More documents for Corpus

2020-08-14 17:25:50 -06:00 · 2020-08-14 17:25:50 -06:00 · 266e371642
commit 266e371642
parent e78ae20e92
11 changed files with 4936 additions and 3942 deletions
--- a/Corpus/CORPUS.txt
+++ b/Corpus/CORPUS.txt
--- a/Corpus/MOGRIFIER
+++ b/Corpus/MOGRIFIER
--- a/Corpus/Model
+++ b/Corpus/Model
--- a/Fine-Tuning.txt
+++ b/Fine-Tuning.txt
@ -1,662 +0,0 @@
-                                      Movement Pruning:
-                              Adaptive Sparsity by Fine-Tuning
-
-
-
-
-                                Victor Sanh 1 , Thomas Wolf 1 , Alexander M. Rush 1,2
-                                       1 Hugging Face, 2 Cornell University
-                             {victor,thomas}@huggingface.co;arush@cornell.edu
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-     arXiv:2005.07683v1  [cs.CL]  15 May 2020                                         Abstract
-
-                       Magnitude pruning is a widely used strategy for reducing model size in pure
-                       supervised learning; however, it is less effective in the transfer learning regime that
-                       has become standard for state-of-the-art natural language processing applications.
-                       We propose the use ofmovement pruning, a simple, deterministic ﬁrst-order weight
-                       pruning method that is more adaptive to pretrained model ﬁne-tuning. We give
-                       mathematical foundations to the method and compare it to existing zeroth- and
-                       ﬁrst-order pruning methods. Experiments show that when pruning large pretrained
-                       language models, movement pruning shows signiﬁcant improvements in high-
-                       sparsity regimes. When combined with distillation, the approach achieves minimal
-                       accuracy loss with down to only 3% of the model parameters.
-
-
-                 1 Introduction
-
-                 Large-scale transfer learning has become ubiquitous in deep learning and achieves state-of-the-art
-                 performance in applications in natural language processing and related ﬁelds. In this setup, a large
-                 model pretrained on a massive generic dataset is then ﬁne-tuned on a smaller annotated dataset to
-                 perform a speciﬁc end-task. Model accuracy has been shown to scale with the pretrained model and
-                 dataset size [Raffel et al., 2019]. However, signiﬁcant resources are required to ship and deploy these
-                 large models, and training the models have high environmental costs [Strubell et al., 2019].
-                 Sparsity induction is a widely used approach to reduce the memory footprint of neural networks at
-                 only a small cost of accuracy. Pruning methods, which remove weights based on their importance,
-                 are a particularly simple and effective method for compressing models to be sent to edge devices such
-                 as mobile phones. Magnitude pruning [Han et al., 2015, 2016], which preserves weights with high
-                 absolute values, is the most widely used method for weight pruning. It has been applied to a large
-                 variety of architectures in computer vision [Guo et al., 2016], in language processing [Gale et al.,
-                 2019], and more recently has been leveraged as a core component in thelottery ticket hypothesis
-                 [Frankle et al., 2019].
-                 While magnitude pruning is highly effective for standard supervised learning, it is inherently less
-                 useful in the transfer learning regime. In supervised learning, weight values are primarily determined
-                 by the end-task training data. In transfer learning, weight values are mostly predetermined by the
-                 original model and are only ﬁne-tuned on the end task. This prevents these methods from learning to
-                 prune based on the ﬁne-tuning step, or “ﬁne-pruning.”
-                 In this work, we argue that to effectively reduce the size of models for transfer learning, one should
-                 instead usemovement pruning, i.e., pruning approaches that consider the changes in weights during
-                 ﬁne-tuning. Movement pruning differs from magnitude pruning in that both weights with low and
-                 high values can be pruned if they shrink during training. This strategy moves the selection criteria
-                 from the 0th to the 1st-order and facilitates greater pruning based on the ﬁne-tuning objective. To
-
-
-                 Preprint. Under review.                 test this approach, we introduce a particularly simple, deterministic version of movement pruning
-                 utilizing the straight-through estimator [Bengio et al., 2013].
-                 We apply movement pruning to pretrained language representations (BERT) [Devlin et al., 2019,
-                 Vaswani et al., 2017] on a diverse set of ﬁne-tuning tasks. In highly sparse regimes (less than 15% of
-                 remaining weights), we observe signiﬁcant improvements over magnitude pruning and other 1st-order
-                 methods such asL0 regularization [Louizos et al., 2017]. Our models reach 95% of the original
-                 BERT performance with only 5% of the encoder’s weight on natural language inference (MNLI)
-                 [Williams et al., 2018] and question answering (SQuAD v1.1) [Rajpurkar et al., 2016]. Analysis of
-                 the differences between magnitude pruning and movement pruning shows that the two methods lead
-                 to radically different pruned models with movement pruning showing greater ability to adapt to the
-                 end-task.
-
-                 2 Related Work
-
-                 In addition to magnitude pruning, there are many other approaches for generic model weight pruning.
-                 Most similar to our approach are methods for using parallel score matrices to augment the weight
-                 matrices [Mallya and Lazebnik, 2018, Ramanujan et al., 2020], which have been applied for convo-
-                 lutional networks. Differing from our methods, these methods keep the weights of the model ﬁxed
-                 (either from a randomly initialized network or a pre-trained network) and the scores are updated to
-                 ﬁnd a good sparse subnetwork.
-                 Many previous works have also explored using higher-order information to select prunable weights.
-                 LeCun et al. [1989] and Hassibi et al. [1993] leverage the Hessian of the loss to select weights for
-                 deletion. Our method does not require the (possibly costly) computation of second-order derivatives
-                 since the importance scores are obtained simply as the by-product of the standard ﬁne-tuning. Theis
-                 et al. [2018] and Ding et al. [2019] use the absolute value or the square value of the gradient. In
-                 contrast, we found it useful to preserve the direction of movement in our algorithm.
-                 Compressing pretrained language models for transfer learning is also a popular area of study. Other
-                 approaches include knowledge distillation [Sanh et al., 2019, Tang et al., 2019] and structured pruning
-                 [Fan et al., 2020a, Michel et al., 2019]. Our core method does not require an external teacher model
-                 and targets individual weight. We also show that having a teacher can further improve our approach.
-                 Recent work also builds upon iterative magnitude pruning with rewinding [Yu et al., 2020] to train
-                 sparse language models from scratch. This differs from our approach which focuses on the ﬁne-tuning
-                 stage. Finally, another popular compression approach is quantization. Quantization has been applied
-                 to a variety of modern large architectures [Fan et al., 2020b, Zafrir et al., 2019, Gong et al., 2014]
-                 providing high memory compression rates at the cost of no or little performance. As shown in
-                 previous works [Li et al., 2020, Han et al., 2016] quantization and pruning are complimentary and
-                 can be combined to further improve the performance/size ratio.
-
-                 3 Background: Score-Based Pruning
-
-                 We ﬁrst establish shared notation for discussing different neural network pruning strategies. Let
-                 W2Rnn refer to a generic weight matrix in the model (we consider square matrices, but they
-                 could be of any shape). To determine which weights are pruned, we introduce a parallel matrix of
-                 associated importance scoresS2Rnn . Given importance scores, each pruning strategy computes a
-                 maskM2 f0;1gnn . Inference for an inputxbecomesa= (WM)x, whereis the Hadamard
-                 product. A common strategy is to keep the top-vpercent of weights by importance. We deﬁne Top v as a function which selects thev%highest values inS:1; STop(S)                                     (1) v  i;j =     i;j in topv%
-                                                  0; o.w.
-
-                 Magnitude-based weight pruning determines the mask based on the absolute value of each weight    as a measure of importance. Formally, we have importance scoresS= jWi;j j     , and masks 1i;jn M=Top(S)(Eq(1)). There are several extensions to this base setup. Han et al. [2015] use v iterative magnitude pruning: the model is ﬁrst trained until convergence and weights with the lowest
-                 magnitudes are removed afterward. The sparsiﬁed model is then re-trained with the removed weights
-                 ﬁxed to 0. This loop is repeated until the desired sparsity level is reached.
-
-                                                  2                              Magnitude pruning    L0 regularization Movement pruning Soft movement pruning
-                  Pruning Decision 0th order 1st order 1st order 1st order
-                  Masking Function Top v     Continuous Hard-Concrete Top v        Thresholding
-                  Pruning Structure Local or Global Global Local or Global Global
-                  Learning Objective      L L+l0 E(L0 )          L L+mvp R(S)
-                  Gradient Form Gumbel-Softmax Straight-Through Straight-ThroughP               P           PScoresS          jW              )                               )i;j j   (@L )(t W(t) f(S(t) )   (@L )(t) W(t)    (@L )(t) W(t
-                                            t@W    i;j  i;j                         i;j i;j            t@W    i;j i;j         t@W i;j
-                 Table 1: Summary of the pruning methods considered in this work and their speciﬁcities. The
-                 expression offofL0 regularization is detailed in Eq (3).
-
-
-                 In this study, we focus onautomated gradual pruning[Zhu and Gupta, 2018]. It supplements
-                 magnitude pruning by allowing masked weights to be updated such that they are not ﬁxed for the
-                 entire duration of the training. Automated gradual pruning enables the model to recover from previous
-                 masking choices [Guo et al., 2016]. In addition, one can gradually increases the sparsity levelv     during training using a cubic sparsity scheduler:v(t) =vf + (v          t 3
-                                                             i vf ) 1ti . The sparsity nt level at time stept,v(t) is increased from an initial valuevi (usually 0) to a ﬁnal valuevf innpruning
-                 steps afterti steps of warm-up. The model is thus pruned and trained jointly.
-
-                 4 Movement Pruning
-
-                 Magnitude pruning can be seen as utilizing zeroth-order information (absolute value) of the running
-                 model. In this work, we focus on movement pruning methods where importance is derived from
-                 ﬁrst-order information. Intuitively, instead of selecting weights that are far from zero, we retain
-                 connections that are moving away from zero during the training process. We consider two versions of
-                 movement pruning: hard and soft.
-                 For (hard) movement pruning, masks are computed using the Top v function:M=Top(S). Unlike v magnitude pruning, during training, we learn both the weightsP      Wand their importance scoresS.
-                 During the forward pass, we compute for alli,a    ni =    Wk=1 i;k Mi;k xk .
-                 Since the gradient of Top v is 0 everywhere it is deﬁned, we follow Ramanujan et al. [2020], Mallya
-                 and Lazebnik [2018] and approximate its value with thestraight-through estimator[Bengio et al.,
-                 2013]. In the backward pass, Top v is ignored and the gradient goes "straight-through" toS. The
-                 approximation of gradient of the lossLwith respect toSi;j is given by
-                                        @L   @L @a=     i   @L=   W x@S                   j                   (2)
-                                         i;j  @a i @S i;j  @a  i;ji
-                 This implies that the scores of weights are updated, even if these weights are masked in the forward
-                 pass. We prove in Appendix A.1 that movement pruning as an optimization problem will converge.
-                 We also consider a relaxed (soft) version of movement pruning based on the binary mask function
-                 described by Mallya and Lazebnik [2018]. Here we replace hyperparametervwith a ﬁxed global
-                 threshold valuethat controls the binary mask. The mask is calculated asP M= (S> ). In order to
-                 control the sparsity level, we add a regularization termR(S) =mvp   (Si;j   i;j )which encourages
-                 the importance scores to decrease over time 1 . The coefﬁcientmvp controls the penalty intensity and
-                 thus the sparsity level.
-                 Finally we note that these approaches yield a similar updateL0 regularization based pruning, another
-                 movement based pruning approach [Louizos et al., 2017]. Instead of straight-through,L0 uses the
-                 hard-concretedistribution, where the maskMis sampled for alli;jwith hyperparametersb >0,
-                 l <0, andr >1:                                             u U(0;1)            Si;j =(log(u)log(1u) +Si;j )=b
-                         Zi;j = (rl)Si;j +l M i;j = min(1;ReLU(Zi;j ))
-
-                 The expectedP      L0 norm has a closed form involving the parameters of the hard-concrete:                                      E(L0 ) =
-                      logSi;j     i;j blog(l=r). Thus, the weights and scores of the model can be optimized in
-                                     P1 We also experimented with   jSi;j i;j jbut it turned out to be harder to tune while giving similar results.
-
-                                                  3                              (a) Magnitude pruning             (b) Movement pruning
-                 Figure 1: During ﬁne-tuning (on MNLI), the weights stay close to their pre-trained values which
-                 limits the adaptivity of magnitude pruning. We plot the identity line in black. Pruned weights are
-                 plotted in grey. Magnitude pruning selects weights that are far from 0 while movement pruning
-                 selects weights that are moving away from 0.
-
-
-                 an end-to-end fashion to minimize the sum of the training lossLand the expectedL0 penalty. A
-                 coefﬁcientl0 controls theL0 penalty and indirectly the sparsity level. Gradients take a similar form:
-                         @L   @L                      rl=   W@S        i;j xj f(Si;j )wheref(Si;j ) =    S          Zi;j 1g     (3)
-                           i;j  @a i                       b  i;j (1Si;j )1f0
-                                                                               
-                 At test time, a non-stochastic estimation of the mask is used:M^= min 1;ReLU (rl)(S)+l
-                  and weights multiplied by 0 can simply be discarded.
-                 Table 1 highlights the characteristics of each pruning method. The main differences are in the masking
-                 functions, pruning structure, and the ﬁnal gradient form.
-
-                 Method Interpretation In movement pruning, the gradient ofLwith respect toWi;j is given
-                 by the standard gradient derivation: @L = @L M@W i;j   @a  i;j xj . By combining it to Eq(2), we have i @L = @L W@S i;j  @W   i;j (we omit the binary mask termMi;j for simplicity). From the gradient update in i;j
-                 Eq (2),S               @Li;j is increasing when    <0, which happens in two cases: @S i;j
-                      (a) @L <0andW@W         i;j >0i;j
-                      (b) @L >0andW@W         i;j <0i;j
-                 It means that during trainingWi;j is increasing while being positive or is decreasing while being
-                 negative. It is equivalent to saying thatSi;j is increasing whenWi;j is moving away from 0. Inversely,
-                 Si;j is decreasing when @L >0which means thatW@S                  i;j is shrinking towards 0. i;j
-                 While magnitude pruning selects the most important weights as the ones which maximize their
-                 distance to 0 (jWi;j j), movement pruning selects the weights which are moving the most away from
-                 0 (Si;j ). For this reason, magnitude pruning can be seen as a 0th order method, whereas movement
-                 pruning is based on a 1st order signal. In fact,Scan be seen as an accumulator of movement: from
-                 equation (2), afterTgradient updates, we have
-                                                XS(T)          @L=        )(t) W(t)                   (4) i;j     S  (@W     i;j i;jt<T
-
-                 Figure 1 shows this difference empirically by comparing weight values during ﬁne-tuning against
-                 their pre-trained value. As observed by Gordon et al. [2020], ﬁne-tuned weights stay close in absolute
-                 value to their initial pre-trained values. For magnitude pruning, this stability around the pre-trained
-
-                                                  4                 values implies that we know with high conﬁdence before even ﬁne-tuning which weights will be
-                 pruned as the weights with the smallest absolute value at pre-training will likely stay small and be
-                 pruned. In contrast, in movement pruning, the pre-trained weights do not have such an awareness of
-                 the pruning decision since the selection is made during ﬁne-tuning (moving away from 0), and both
-                 low and high values can be pruned. We posit that this is critical for the success of the approach as it
-                 is able to prune based on the task-speciﬁc data, not only the pre-trained value.
-
-                 5 Experimental Setup
-
-                 Transfer learning for NLP uses large pre-trained language models that are ﬁne-tuned on target tasks
-                 [Ruder et al., 2019, Devlin et al., 2019, Radford et al., 2019, Liu et al., 2019]. We experiment with task-
-                 speciﬁc pruning ofBERT-base-uncased, a pre-trained model that contains roughly 84M parameters.
-                 We freeze the embedding modules and ﬁne-tune the transformer layers and the task-speciﬁc head.
-                 We perform experiments on three monolingual (English) tasks, which are common benchmarks for
-                 the recent progress in transfer learning for NLP: question answering (SQuAD v1.1) [Rajpurkar et al.,
-                 2016], natural language inference (MNLI) [Williams et al., 2018], and sentence similarity (QQP)
-                 [Iyer et al., 2017]. The datasets respectively contain 8K, 393K, and 364K training examples. SQuAD
-                 is formulated as a span extraction task, MNLI and QQP are paired sentence classiﬁcation tasks.
-                 For a given task, we ﬁne-tune the pre-trained model for the same number of updates (between 6
-                 and 10 epochs) across pruning methods 2 . We follow Zhu and Gupta [2018] and use a cubic sparsity
-                 scheduling for Magnitude Pruning (MaP), Movement Pruning (MvP), and Soft Movement Pruning
-                 (SMvP). Adding a few steps of cool-down at the end of pruning empirically improves the performance
-                 especially in high sparsity regimes. The schedule forvis:
-                                 8<vi                    0t < t i
-                                   v     v          )3 t                        (5):f + (vi  f )(1tti tf
-                                                  nt     i t < Ttf
-                                   vf                    o.w.
-                 wheretf is the number of cool-down steps.
-                 We compare our results against several state-of-the-art pruning baselines: Reweighted Proximal
-                 Pruning (RPP) [Guo et al., 2019] combines re-weightedL1 minimization and Proximal Projection
-                 [Parikh and Boyd, 2014] to perform unstructured pruning. LayerDrop [Fan et al., 2020a] leverages
-                 structured dropout to prune models at test time. For RPP and LayerDrop, we report results from
-                 authors. We also compare our method against the mini-BERT models, a collection of smaller BERT
-                 models with varying hyper-parameters [Turc et al., 2019].
-
-                 6 Results
-
-                 Figure 2 displays the results for the main pruning methods at different levels of pruning on each
-                 dataset. First, we observe the consistency of the comparison between magnitude and movement
-                 pruning: at low sparsity (more than 70% of remaining weights), magnitude pruning outperforms
-                 all methods with little or no loss with respect to the dense model whereas the performance of
-                 movement pruning methods quickly decreases even for low sparsity levels. However, magnitude
-                 pruning performs poorly with high sparsity, and the performance drops extremely quickly. In contrast,
-                 ﬁrst-order methods show strong performances with less than 15% of remaining weights.
-                 Table 2 shows the speciﬁc model scores for different methods at high sparsity levels. Magnitude
-                 pruning on SQuAD achieves 54.5 F1 with 3% of the weights compared to 73.6 F1 withL0 regular-
-                 ization, 76.3 F1 for movement pruning, and 79.9 F1 with soft movement pruning. These experiments
-                 indicate that in high sparsity regimes, importance scores derived from the movement accumulated
-                 during ﬁne-tuning induce signiﬁcantly better pruned models compared to absolute values.
-                 Next, we compare the difference in performance between ﬁrst-order methods. We see that straight-
-                 through based hard movement pruning (MvP) is comparable withL0 regularization (with a signiﬁcant
-                 gap in favor of movement pruning on QQP). Soft movement pruning (SMvP) consistently outperforms
-                    2 Preliminary experiments showed that increasing the number of pruning steps tended to improve the end
-                 performance
-
-                                                  5                 Figure 2: Comparisons between different pruning methods in high sparsity regimes.Soft move-
-                 ment pruning consistently outperforms other methods in high sparsity regimes.We plot the
-                 performance of the standard ﬁne-tuned BERT along with 95% of its performance.
-
-
-
-
-
-
-
-
-
-
-
-
-
-                 Table 2: Performance at high sparsity levels. (Soft) movement pruning outperforms current
-                  state-of-the art pruning methods at different high sparsity levels.
-
-                                BERT base  Remaining
-                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
-                      SQuAD - Dev             10%    67.7/78.5 69.9/80.1 71.9/81.7 71.3/81.580.4/88.1EM/F1               3%    40.1/54.5 61.6/73.6 65.2/76.3 69.6/79.9
-                      MNLI - Dev              10%    77.8/79.0 77.9/78.5 79.3/79.5 80.7/81.2acc/MM acc   84.5/84.9     3%    68.9/69.8 75.2/75.6 76.1/76.7 79.0/79.7
-                      QQP - Dev              10%    78.8/75.1 87.6/81.9 89.1/85.5 90.2/86.891.4/88.4acc/F1                3%    72.1/58.4 86.5/81.1 85.6/81.0 89.2/85.5
-
-
-                 hard movement pruning andL0 regularization by a strong margin and yields the strongest performance
-                 among all pruning methods in high sparsity regimes. These comparisons support the fact that even if
-                 movement pruning (and its relaxed version soft movement pruning) is simpler thanL0 regularization,
-                 it yet yields stronger performances for the same compute budget.
-                 Finally, movement pruning and soft movement pruning compare favorably to the other baselines, ex-
-                 cept for QQP where RPP is on par with soft movement pruning. Movement pruning also outperforms
-                 the ﬁne-tuned mini-BERT models. This is coherent with [Li et al., 2020]: it is both more efﬁcient and
-                 more effective to train a large model and compress it afterward than training a smaller model from
-                 scratch. We do note though that current hardware does not support optimized inference for sparse
-                 models: from an inference speed perspective, it might often desirable to use a small dense model
-                 such as mini-BERT over a sparse alternative of the same size.
-
-                 Distillation further boosts performance Following previous work, we can further leverage knowl-
-                 edge distillation [Bucila et al., 2006, Hinton et al., 2014] to boost performance for free in the pruned
-                 domain [Jiao et al., 2019, Sanh et al., 2019] using our baseline ﬁne-tunedBERT-basemodel as
-                 teacher. The training objective is a linear combination of the training loss and a knowledge distillation
-
-
-                 Figure 3: Comparisons between different pruning methods augmented with distillation.Distillation
-                 improves the performance across all pruning methods and sparsity levels.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-                                                  6                 Table 3: Distillation-augmented performances for selected high sparsity levels.All pruning methods
-                 beneﬁt from distillation signal further enhancing the ratio Performance VS Model Size.
-
-                                BERT base  Remaining
-                                 ﬁne-tuned  Weights (%)   MaP   L0 Regu MvP soft MvP
-                      SQuAD - Dev             10%    70.2/80.1 72.4/81.9 75.6/84.3 76.6/84.980.4/88.1EM/F1               3%    45.5/59.6 65.5/75.9 67.5/78.0 72.9/82.4
-                      MNLI - Dev              10%    78.3/79.3 78.7/79.8 80.1/80.4 81.2/81.8acc/MM acc   84.5/84.9     3%    69.4/70.6 76.2/76.5 76.5/77.4 79.6/80.2
-                      QQP - Dev              10%    79.8/65.0 88.1/82.8 89.7/86.2 90.5/87.191.4/88.4acc/F1                3%    72.4/57.8 87.1/82.0 86.1/81.5 89.3/85.6
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-                            (a) Distribution of remaining weights     (b) Scores and weights learned by
-                                                        movement pruning
-                 Figure 4: Magnitude pruning and movement pruning leads to pruned models with radically different
-                 weight distribution.
-
-
-                 loss on the output distributions. Figure 3 shows the results on SQuAD, MNLI, and QQP for the three
-                 pruning methods boosted with distillation. Overall, we observe that the relative comparisons of the
-                 pruning methods remain unchanged while the performances are strictly increased. Table 3 shows for
-                 instance that on SQuAD, movement pruning at 10% goes from 81.7 F1 to 84.3 F1. When combined
-                 with distillation, soft movement pruning yields the strongest performances across all pruning methods
-                 and studied datasets: it reaches 95% ofBERT-basewith only a fraction of the weights in the encoder
-                 (5% on SQuAD and MNLI).
-
-                 7 Analysis
-
-                 Movement pruning is adaptive Figure 4a compares the distribution of the remaining weights for
-                 the same matrix of a model pruned at the same sparsity using magnitude and movement pruning. We
-                 observe that by deﬁnition, magnitude pruning removes all the weights that are close to zero, ending
-                 up with two clusters. In contrast, movement pruning leads to a smoother distribution, which covers
-                 the whole interval except for values close to 0.
-                 Figure 4b displays each individual weight against its associated importance score in movement
-                 pruning. We plot pruned weights in grey. We observe that movement pruning induces no simple
-                 relationship between the scores and the weights. Both weights with high absolute value or low
-                 absolute value can be considered important. However, high scores are systematically associated with
-                 non-zero weights (and thus the “v-shape”). This is coherent with the interpretation we gave to the
-                 scores (section 4): a high scoreSindicates that during ﬁne-tuning, the associated weight moved away
-                 from 0 and is thus non-null.
-
-                 Local and global masks perform similarly  We study the inﬂuence of the locality of the pruning
-                 decision. While local Top v selects thev% most important weights matrix by matrix, global Top v uncovers non-uniform sparsity patterns in the network by selecting thev% most important weights in
-
-                                                  7                 Figure 5: Comparison of local and global selec- Figure 6:Remaining weights per layer in thetions of weights on SQuAD at different sparsity Transformer.Global magnitude pruning tends tolevels.For magnitude and movement pruning, prune uniformly layers. Global 1st order meth-local and global Top v performs similarly at all ods allocate the weight to the lower layers whilelevels of sparsity.                     heavily pruning the highest layers.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-                 the whole network. Previous work has shown that a non-uniform sparsity across layers is crucial to
-                 the performance in high sparsity regimes [He et al., 2018]. In particular, Mallya and Lazebnik [2018]
-                 found that the sparsity tends to increase with the depth of the network layer.
-                 Figure 5 compares the performance of local selection (matrix by matrix) against global selection
-                 (all the matrices) for magnitude pruning and movement pruning. Despite being able to ﬁnd a
-                 global sparsity structure, we found that global did not signiﬁcantly outperform local, except in high
-                 sparsity regimes (2.3 F1 points of difference with 3% of remaining weights for movement pruning).
-                 Even though the distillation signal boosts the performance of pruned models, the end performance
-                 difference between local and global selections remains marginal.
-                 Figure 6 shows the remaining weights percentage obtained per layer when the model is pruned until
-                 10% with global pruning methods. Global weight pruning is able to allocate sparsity non-uniformly
-                 through the network, and it has been shown to be crucial for the performance in high sparsity regimes
-                 [He et al., 2018]. We notice that except for global magnitude pruning, all the global pruning methods
-                 tend to allocate a signiﬁcant part of the weights to the lowest layers while heavily pruning in the
-                 highest layers. Global magnitude pruning tends to prune similarly to local magnitude pruning, i.e.,
-                 uniformly across layers.
-
-
-                 8 Conclusion
-
-                 We consider the case of pruning of pretrained models for task-speciﬁc ﬁne-tuning and compare
-                 zeroth- and ﬁrst-order pruning methods. We show that a simple method for weight pruning based on
-                 straight-through gradients is effective for this task and that it adapts using a ﬁrst-order importance
-                 score. We apply this movement pruning to a transformer-based architecture and empirically show that
-                 our method consistently yields strong improvements over existing methods in high-sparsity regimes.
-                 The analysis demonstrates how this approach adapts to the ﬁne-tuning regime in a way that magnitude
-                 pruning cannot. In future work, it would also be interesting to leverage group-sparsity inducing
-                 penalties [Bach et al., 2011] to remove entire columns or ﬁlters. In this setup, we would associate a
-                 score to a group of weights (a column or a row for instance). In the transformer architecture, it would
-                 give a systematic way to perform feature selection and remove entire columns of the embedding
-                 matrix.
-
-
-                 References
-                 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
-                   Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text
-                   transformer.ArXiv, abs/1910.10683, 2019.
-
-                                                  8                 Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
-                   learning in nlp. InACL, 2019.
-                 Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for
-                   efﬁcient neural network. InNIPS, 2015.
-                 Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
-                   with pruning, trained quantization and huffman coding. InICLR, 2016.
-                 Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efﬁcient dnns. InNIPS,
-                   2016.
-                 Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks.ArXiv,
-                   abs/1902.09574, 2019.
-                 Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket
-                   hypothesis at scale.ArXiv, abs/1903.01611, 2019.
-                 Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients
-                   through stochastic neurons for conditional computation.ArXiv, abs/1308.3432, 2013.
-                 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
-                   bidirectional transformers for language understanding. InNAACL, 2019.
-                 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
-                   Kaiser, and Illia Polosukhin. Attention is all you need. InNIPS, 2017.
-                 Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through
-                   l0 regularization. InICLR, 2017.
-                 Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for
-                   sentence understanding through inference. InNAACL, 2018.
-                 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for
-                   machine comprehension of text. InEMNLP, 2016.
-                 Arun Mallya and Svetlana Lazebnik. Piggyback: Adding multiple tasks to a single, ﬁxed network by
-                   learning to mask.ArXiv, abs/1801.06519, 2018.
-                 Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari.
-                   What’s hidden in a randomly weighted neural network? InCVPR, 2020.
-                 Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. InNIPS, 1989.
-                 Babak Hassibi, David G. Stork, and Gregory J. Wolff. Optimal brain surgeon: Extensions and
-                   performance comparisons. InNIPS, 1993.
-                 Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with
-                   dense networks and ﬁsher pruning.ArXiv, abs/1801.05787, 2018.
-                 Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Ji Liu, and Jungong Han. Global sparse
-                   momentum sgd for pruning very deep neural networks. InNeurIPS, 2019.
-                 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
-                   bert: smaller, faster, cheaper and lighter. InNeurIPS EMC2 Workshop, 2019.
-                 Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling
-                   task-speciﬁc knowledge from bert into simple neural networks.ArXiv, abs/1903.12136, 2019.
-                 Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
-                   structured dropout. InICLR, 2020a.
-                 Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? InNeurIPS,
-                   2019.
-
-                                                  9                 Haonan Yu, Sergey Edunov, Yuandong Tian, and Ari S. Morcos. Playing the lottery with rewards and
-                   multiple languages: lottery tickets in rl and nlp. InICLR, 2020.
-                 Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou,
-                   and Armand Joulin. Training with quantization noise for extreme model compression.ArXiv,
-                   abs/2004.07320, 2020b.
-                 Oﬁr Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert.ArXiv,
-                   abs/1910.06188, 2019.
-                 Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional
-                   networks using vector quantization.ArXiv, abs/1412.6115, 2014.
-                 Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gon-
-                   zalez. Train large, then compress: Rethinking model size for efﬁcient training and inference of
-                   transformers.ArXiv, abs/2002.11794, 2020.
-                 Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efﬁcacy of pruning for model
-                   compression. InICLR, 2018.
-                 Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: Studying the effects of
-                   weight pruning on transfer learning.ArXiv, abs/2002.08307, 2020.
-                 Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in
-                   natural language processing. InNAACL, 2019.
-                 Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
-                   models are unsupervised multitask learners. 2019.
-                 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
-                   Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
-                   approach.ArXiv, abs/1907.11692, 2019.
-                 Shankar Iyer, Nikhil Dandekar, and Kornel Csernai. First quora dataset release: Question pairs, 2017.
-                   URLhttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
-                 Fu-Ming Guo, Sijia Liu, Finlay S. Mungall, Xue Lian Lin, and Yanzhi Wang. Reweighted proximal
-                   pruning for large-scale language representation.ArXiv, abs/1909.12486, 2019.
-                  Neal Parikh and Stephen P. Boyd. Proximal algorithms.Found. Trends Optim., 1:127–239, 2014.
-                 Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better:
-                   The impact of student initialization on knowledge distillation.ArXiv, abs/1908.08962, 2019.
-                 Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. InKDD, 2006.
-                 Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
-                   InNIPS, 2014.
-                 Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, and Qun Liu.
-                   Tinybert: Distilling bert for natural language understanding.ArXiv, abs/1909.10351, 2019.
-                 Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model
-                   compression and acceleration on mobile devices. InECCV, 2018.
-                 Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity
-                   through convex optimization.Statistical Science, 27, 09 2011. doi: 10.1214/12-STS394.
-
-
-
-
-
-
-
-
-
-                                                  10                 A Appendices
-
-                 A.1 Guarantees on the decrease of the training loss
-
-                 As the scores are updated, the relative order of the importances is likely shufﬂed, and some connections
-                 will be replaced by more important ones. Under certain conditions, we are able to formally prove that
-                 as these replacements happen, the training loss is guaranteed to decrease. Our proof is adapted from
-                 [Ramanujan et al., 2020] to consider the case of ﬁne-tuableW.
-                 We suppose that (a) the training lossLis smooth and admits a ﬁrst-order Taylor development
-                 everywhere it is deﬁned and (b) the learning rate ofW(W >0) is small. We deﬁne the TopK
-                 function as the analog of the Top v function, wherekis an integer instead of a proportion. We ﬁrst
-                 consider the case wherek= 1in the TopK masking, meaning that only one connection is remaining
-                 (and the other weights are deactivated/masked). Let’s denoteWi;j this sole remaining connection at
-                 stept. Following Eq (1), it means that81u;vn;S (t)u;v S(t) .i;j
-                 We suppose that at stept+ 1, connections are swapped and the only remaining connection at step
-                 t+ 1is(k;l). We have:
-                                   (
-                                    Att;    81u;vn;S (t)u;v S(t)
-                                                             i;j                   (6) Att+ 1; 81u;vn;S (t+1)u;v  S(t+1)
-                                                              k;l
-
-                 Eq(6)gives the following inequality:S(t+1) S(t) S(t+1) S(t) . After re-injecting the gradient k;l    k;l   i;j     i;j update in Eq (2), we have:
-                                          @L      )       @L
-                                         S   W(t x                                (7)@a  k;l l  S  W(t) x
-                                            k          @a  i;j ji
-
-                 Moreover, the conditions in Eq(6)lead to the following inferences:a(t) =W(t) x    a(t+1) =i     i;j j and k
-                 W(t+1) xk;l  l .
-
-                 Since                       t)W is small,jj(a(t+1) ;a (t+1) )(a( ;a (t) )jj i    k      i  k  2 is also small. Because the training lossLis
-                 smooth, we can write the 1st order Taylor development ofLin point(a(t) ;a (t) ):i  k
-
-
-                            L(a(t+1) ;a (t+1) ) L(a(t) ;a (t) )i    k       i  k
-                               @L   (a(t+1) a(t)   @L) +   (a(t+1) a(t) )@a  k     kk            @a  i     ii
-                               @L=   W(t+1)   @Lx     W(t) x@a  k;l  l 
-                                k         @a  i;j ji
-                               @L           @L       @L                         (8) =   W(t+1) x                 W(t)    @Lx       (t) x@a  k;l  l + (   W(t) x
-                                k           @a  k;l l +
-                                              k       @a  k;l l )   Wi;j jk        @a i
-                               @L=   (W(t+1) x                )   @L
-                              @a   k;l  l W(t)     @Lx          xk;l l ) + (   W(t
-                                k                 @a  k;l l    W(t) x
-                                                     k       @a  i;j j )
-                                                               i
-                               @L       @L            @L       @L=   x           (S(t) )           x@a l (W   x     k;l ) + (   W(t) l    W(t) x
-                                k      @a l m
-                                          k            @a  k;lk       @a  i;j j )
-                                                                  i
-                 The ﬁrst term is null because of inequalities(6)and the second term is negative because of inequality
-                 (7). ThusL(a(t+1) ;a (t+1) ) L(a(t) ;a (t) ): when connection(k;l)becomes more important than i    k        i (i;j), the connections are swapped and the training loss decreases between step k                         tandt+ 1.          Similarly, we can generalize the proof to a setE=f(ai ;b i );(ci ;d i );iNg ofNswapping
-                 connections.
-                 We note that this proof is not speciﬁc to theTopKmasking function. In fact, we can extend the proof
-                 using theThresholdmasking functionM:= (S>=)[Mallya and Lazebnik, 2018]. Inequalities
-                 (6) are still valid and the proof stays unchanged.
-
-                                                  11                 Last, we note these guarantees do not hold if we consider the absolute value of the scoresjSi;j j(as
-                 it is done in Ding et al. [2019] for instance). We prove it by contradiction. If it was the case, it
-                 would also be true one speciﬁc case: thenegative thresholdmasking function (M:= (S< )where
-                  <0).
-                 We suppose that at stept+ 1, the only remaining connection(i;j)is replaced by(k;l):
-                                  (
-                                   Att;    81u;vn;S (t) S(t)
-                                                      i;j      u;v                 (9)Att+ 1; 81u;vn;S (t+1) S(t+1)
-                                                      k;l        u;v
-
-                 The inequality on the gradient update becomes: @LS  W(t) x@a k k;l l < @LS  W@a  i;j xj and following i
-                 the same development as in Eq(8), we haveL(a(t+1) ;a (t+1) ) L(a(t) ;a (t) )0: the loss increases. i    k       i We proved by contradiction that the guarantees on the decrease of the loss do not hold if we consider k
-                 the absolute value of the score as a proxy for importance.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-                                                  12
--- a/Corpus/Network
+++ b/Corpus/Network
@ -1,150 +0,0 @@
-  Network Pruning
-
-
-     As one of the earliest works in network pruning, Yann Lecun's Optimal brain 
-     damage (OBD) paper has been cited in many of the papers.
-     Some research focuses on module network designs. "These models, such as 
-     SqueezeNet , MobileNet  and Shufflenet, are basically made up of low resolutions 
-     convolution with lesser parameters and better performance."
-     Many recent papers -I've read- ephasize on structured pruning (or sparsifying) as a 
-     compression and regularization method, as opposed to other techniques such as 
-     non-structured pruning (weight sparsifying and connection pruning), low rank 
-     approximation and vector quantization (references to these approaches can be 
-     found in the related work sections of the following papers). 
-     Difference between structred and non-structured pruning:
-       "Non-structured pruning aims to remove single parameters that have little 
-       influence on the accuracy of networks". For example, L1-norm regularization on 
-       weights is noted as non-structured pruning- since it's basically a weight 
-       sparsifying method, i.e removes single parameter. 
-       The term 'structure' refers to a structured unit in the network. So instead of 
-       pruning individual weights or connections, structured pruning targets neurons, 
-       filters, channels, layers etc. But the general implementation idea is the same as 
-       penalizing individual weights: introducing a regularization term (mostly in the 
-       form of L1-norm) to the loss function to penalize (sparsify) structures.
-     I focused on structured pruning and read through the following papers:
-
-   1. Structured Pruning of Convolutional Neural Networks via L1 
-     Regularization (August 2019)
-       "(...) network pruning is useful to remove redundant parameters, filters, 
-       channels or neurons, and address the over-fitting issue."
-
-       Provides a good review of previous work on non-structured and structured 
-       pruning.
-       "This study presents a scheme to prune filters or neurons of fully-connected 
-       layers based on L1 regularization to zero out the weights of some filters or 
-       neurons."
-       Didn't quite understand the method and implementation. There are two key 
-       elements: mask and threshold. "(...) the problem of zeroing out the values of 
-       some filters can be transformed to zero some mask." || "Though the proposed 
-       method introduces mask, the network topology will be preserved because the        mask can be absorbed into weight." || "Here the mask value cannot be 
-       completely zeroed in practical application, because the objective function (7) is 
-       non-convex and the global optimal solution may not be obtained. A strategy is 
-       adopted in the proposed method to solve this problem. If the order of 
-       magnitude of the mask value is small enough, it can be considered almost as 
-       zero. Thus, to decide whether the mask is zero, a threshold is introduced. (...) 
-       The average value of the product of the mask and the weight is used to 
-       determine whether the mask is exactly zero or not."
-       From what I understand they use L1 norm in the loss function to penalize 
-       useless filters through peenalizing masks. And a threshold value is introduced 
-       to determine when the mask is small enough to be considered zero. 
-       They test on MNIST (models: Lenet-5) and CIFAR-10 (models: VGG-16, ResNet-
-       32)
-
-   2. Learning Efficient Convolutional Networks through Network Slimming (August 
-     2017) + Git repo
-       "Our approach imposes L1 regular- ization on the scaling factors in batch 
-       normalization (BN) layers, thus it is easy to implement without introducing any 
-       change to existing CNN architectures. Pushing the values of BN scaling factors 
-       towards zero with L1 regularization enables us to identify insignificant channels 
-       (or neurons), as each scaling factor corresponds to a specific con- volutional 
-       channel (or a neuron in a fully-connected layer)."
-       They provide a good insight on advantages and disadvantages of other 
-       computation reduction methods such as low rank approximation, vector 
-       quantization etc. 
-       I belive here they use the word 'channel' to refer to filters (?).
-       "Our idea is introducing a scaling factor γ for each channel, which is multiplied 
-       to the output of that channel. Then we jointly train the network weights and 
-       these scaling factors, with sparsity regularization imposed on the latter. Finally 
-
-       we prune those channels with small factors, and fine-tune the pruned network. 
-       " --> so instead of 'mask' they use the 'scaling factor' and impose regularization 
-       on that, but the idea is very similar.
-       "The way BN normalizes the activations motivates us to design a simple and 
-       efficient method to incorporates the channel-wise scaling factors. Particularly, 
-       BN layer normalizes the internal activa- tions using mini-batch statistics." || " 
-       (...) we can directly leverage the γ parameters in BN layers as the scaling factors 
-       we need for network slim- ming. It has the great advantage of introducing no 
-       overhead to the network."       They test on CIFAR and SVHN (models: VGG-16, ResNet-164, DenseNet-40), 
-       ImageNet (model: VGG-A) and MNIST (model: Lenet)
-
-   3. Learning Structured Sparsity in Deep Neural Networks (Oct 2016) + Git repo
-       " (...) we propose Structured Sparsity Learning (SSL) method to directly learn a 
-       compressed structure of deep CNNs by group Lasso regularization during the 
-       training. SSL is a generic regularization to adaptively adjust mutiple structures 
-       in DNN, including structures of filters, channels, and filter shapes within each 
-       layer, and structure of depth beyond the layers." || " (...) offering not only well-
-       regularized big models with improved accuracy but greatly accelerated 
-       computation."
-
-
-
-        "Here W represents the collection of all weights in the DNN; ED(W) is the loss 
-       on data; R(·) is non-structured regularization applying on every weight, e.g., L2-
-       norm; and Rg(·) is the structured sparsity regularization on each layer. Because 
-       Group Lasso can effectively zero out all weights in some groups [14][15], we 
-       adopt it in our SSL. The regularization of group Lasso on a set of weights w can 
-       be represented as  
-
-
-        , where w(g) is a group of partial weights in w and G is the total number of 
-       groups. " || "In SSL, the learned “structure” is decided by the way of splitting 
-       groups of w(g). We investigate and formulate the filer-wise, channel-wise, 
-       shape-wise, and depth-wise structured sparsity (...)"
-       They test on MNIST (models: Lenet, MLP), CIFAR-10 (models: ConvNet, ResNet-
-       20) and ImageNet (model:AlexNet)
-       The authors also provide a visualization of filters after pruning, showing that 
-       only important detectors of patterns remain after pruning.
-
-       In conclusions: "Moreover, a variant of SSL can be performed as structure 
-       regularization to improve classification accuracy of state-of-the-art DNNs."
-
-   4. Learning both Weights and Connections for Efficient Neural Networks (Oct 2015)
-       "After an initial training phase, we remove all connections whose weight is 
-       lower than a threshold. This pruning converts a dense, fully-connected layer to 
-       a sparse layer." || "We then retrain the sparse network so the remaining 
-       connections can compensate for the connections that have been removed. The 
-       phases of pruning and retraining may be repeated iteratively to further reduce        network complexity. In effect, this training process learns the network 
-       connectivity in addition to the weights (...)"
-       Although the description above implies the pruning was done only for FC 
-       layers, they also do pruning on convolutional layers - although they don't 
-       provide much detail on this in the methods. But there's this statement when 
-       they explain retraining: "(...) we fix the parameters for CONV layers and only 
-       retrain the FC layers after pruning the FC layers, and vice versa.". The results 
-       section also shows that convolutional layer connections were also 
-       pruned on the tested models.
-       They test on MNIST (models: Lenet-300-100 (MLP), Lenet-5 (CNN)) and 
-       ImageNet (models: AlexNet, VGG-16)
-       The authors provide a visualization of the sparsity patterns of neurons after 
-       pruning (for an FC layer) which shows that pruning can detect visual attention 
-       regions.
-       The method used in this paper targets individual parameters (weights) to 
-       prune. So, technically this should be considered as a non-structured pruning 
-       method. However, the reason I think this is referenced as a structured pruning 
-       method is that if all connections of a neuron is pruned (i.e all input and output 
-       weights were below threshold), the neuron itself will be removed from the 
-       network:  "After pruning connections, neurons with zero input connections or 
-       zero output connections may be safely pruned."
-       SIDENOTE: They touch on the use of global average pooling instead of fully 
-       connected layers in CNNs: "There have been other attempts to reduce the 
-       number of parameters of neural networks by replacing the fully connected 
-       layer with global average pooling."
-
-   5. Many more can be picked from the references of these papers. 
-
-
-
-     There's a paper on Bayesion compression for Deep Learning from 2017. Their 
-     hypothesis is: "By employing sparsity inducing priors for hidden units (and not 
-     individual weights) we can prune neurons including all their ingoing and outgoing 
-     weights." However, the method is mathematically heavy and the related work 
-     references are quite old (1990s, 2000s). 
--- a/Architectures.txt
+++ b/Architectures.txt
--- a/Corpus/Optimal
+++ b/Corpus/Optimal
--- a/Corpus/PLUG
+++ b/Corpus/PLUG
--- a/Corpus/Predicting
+++ b/Corpus/Predicting
--- a/Corpus/Predicting
+++ b/Corpus/Predicting
--- a/Corpus/Pruning
+++ b/Corpus/Pruning