From e78ae20e924686f655fbb4c7a7fce025951d5b5f Mon Sep 17 00:00:00 2001 From: Eduardo Cueto Mendoza Date: Mon, 10 Aug 2020 18:53:03 -0600 Subject: [PATCH] Processed various texts for the NN --- Corpus/CORPUS.txt | 5546 +++++++++++++++++ ...t Operations in Matrix-Vector Calculus.txt | Bin 30441 -> 0 bytes Corpus/Green AI - Roy Schwartz.txt | Bin 47133 -> 0 bytes ...ting Chaotic Systems and Saving Energy.txt | Bin 25438 -> 0 bytes ...ity Mappings in Deep Residual Networks.txt | Bin 46637 -> 0 bytes .../Language Models are Few-Shot Learners.txt | Bin 287691 -> 0 bytes ... through Network Slimming - Zhuang Liu.txt | 399 -- ...sity in Deep Neural Networks - Wei Wen.txt | Bin 55239 -> 0 bytes ...ections for Efficient Neural Networks.txt | Bin 40692 -> 0 bytes ...r Efficient Neural Networks - Song Han.txt | Bin 40765 -> 0 bytes Corpus/Learning to Generalize.txt | 933 --- ...XED PRECISION TRAINING - Sharan Narang.txt | Bin 43453 -> 0 bytes 12 files changed, 5546 insertions(+), 1332 deletions(-) delete mode 100644 Corpus/Floating Point Operations in Matrix-Vector Calculus.txt delete mode 100644 Corpus/Green AI - Roy Schwartz.txt delete mode 100644 Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt delete mode 100644 Corpus/Identity Mappings in Deep Residual Networks.txt delete mode 100644 Corpus/Language Models are Few-Shot Learners.txt delete mode 100644 Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt delete mode 100644 Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt delete mode 100644 Corpus/Learning both Weights and Connections for Efficient Neural Networks.txt delete mode 100644 Corpus/Learning both Weights and Connections for Efficient Neural Networks - Song Han.txt delete mode 100644 Corpus/Learning to Generalize.txt delete mode 100644 Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt diff --git a/Corpus/CORPUS.txt b/Corpus/CORPUS.txt index 24b6584..9e04a6f 100644 --- a/Corpus/CORPUS.txt +++ b/Corpus/CORPUS.txt @@ -11039,4 +11039,5550 @@ REFERENCES [24] J. Kalkkuhl, K. J. Hunt, and H. Fritz, �FEM-based neural-network approach to nonlinear modeling with application to longitudinal vehicle dynamics control,� IEEE Trans. Neural Netw., vol. 10, no. 4, pp. 885�897, 1999. [25] R. K. Mishra and P. S. Hall, �NFDTD concept,� IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 484�490, 2005. [26] D. G. Triantafyllidis and D. P. Labridis, �A finite-element mesh gener.ator based on growing neural networks,� IEEE Trans. Neural Netw., vol. 13, no. 6, pp. 1482�1496, 2002. +<> <> <> + + +<> <> <> +Floating Point Operations in Matrix-Vector Calculus +(Version 1.3) +Raphael Hunger +Technical Report 2007 + +Technische Universit�t Mchen Associate Institute for Signal Processing +Univ.-Prof. Dr.-Ing. Wolfgang Utschick + + + +History +Version 1.00: October 2005 -Initial version +Version 1.01: 2006 -Rewrite of sesquilinear form with a reduced amount of FLOPs -Several Typos fixed concerning the number of FLOPS required for the Cholesky decomposition Version 1.2: November 2006 -Conditions for the existence of the standard <> Cholesky decomposition specified (positive definiteness) -Outer product version of <> Cholesky decomposition removed -FLOPs required in Gaxpy version of <> Cholesky decomposition updated -<> Cholesky decomposition added -Matrix-matrix product LC added with L triangular -Matrix-matrix product <>C added with L triangular and <> not known a priori -Inverse L. 11 of a lower triangular matrix with ones on the main diagonal added +Version 1.3: September 2007 -First globally accessible document version +ToDo: (unknown when) -QR-Decomposition -LR-Decomposition +Please report any bug and suggestion to hunger@tum.de + +Contents +1. Introduction 4 +2. Flop Counting 5 +2.1 MatrixProducts .................................... 5 +2.1.1 Scalar-Vector Multiplication .a ....................... 5 +2.1.2 Scalar-Matrix Multiplication .A ...................... 5 +2.1.3 Inner Product aHb ofTwo Vectors ...................... 5 +2.1.4 Outer Product ac H ofTwo Vectors ...................... 5 +2.1.5 Matrix-Vector Product Ab .......................... 6 +2.1.6 Matrix-Matrix Product AC ......................... 6 +2.1.7 Matrix Diagonal Matrix Product AD .................... 6 +2.1.8 Matrix-Matrix Product LD ......................... 6 +2.1.9 Matrix-Matrix Product L1D ......................... 6 +2.1.10 Matrix-Matrix Product LC with L Lower Triangular ............ 6 +2.1.11 Gram AHA of A ............................... 6 +2.1.12 Squared Frobenius Norm kAk2F = tr(AHA) ................ 7 +2.1.13 Sesquilinear Form cHAb ........................... 7 +2.1.14 Hermitian Form aHRa ............................ 7 +2.1.15 Gram LHL of a Lower Triangular Matrix L ................. 7 +2.2 Decompositions.................................... 8 +2.2.1 Cholesky Decomposition R = <> (GaxpyVersion) ........... 8 +2.2.2 Cholesky Decomposition R = L1DL1H ................... 10 +2.3 Inverses ofMatrices .................................. 11 +2.3.1 Inverse <> of a Lower Triangular Matrix L ................ 11 +2.3.2 Inverse L. 11 of a Lower Triangular Matrix L1 with Ones on the Main Diagonal..................................... 12 +2.3.3 Inverse R.1 of a Positive definite Matrix R ................. 13 +2.4 Solving Systems of Equations ............................ 13 +2.4.1 Product <>C with <> not known a priori. ................ 13 +3. Overview 14 +Appendix 15 +Bibliography 16 + +1. Introduction +For the design of efficient und low-complexity algorithms in many signal-processing tasks, a de.tailed analysis of the required number of floating-point operations (FLOPs) is often inevitable. Most frequently, matrix operations are involved, such as matrix-matrix products and inverses of matrices. Structures like Hermiteness or triangularity for example can be exploited to reduce the number of needed FLOPs and will be discussed here. In this technical report, we derive expressions for the number of multiplications and summations that a majority of signal processing algorithms in mobile communications bring with them. +Acknowledgments: +The author would like to thank Dipl.-Ing. David A. Schmidt and Dipl.-Ing. Guido Dietl for the fruitful discussions on this topic. + +2. Flop Counting +In this chapter, we offer expressions for the number of complex multiplications and summations required for several matrix-vector operations. A floating-point operation (FLOP) is assumed to be + +either a complex multiplication or a complex summation here, despite the fact that a complex multiplication requires 4 real multiplications and 2 real summations whereas a complex summations consists of only 2 real summations, making a multiplication more expensive than a summation. However, we count each operation as one FLOP. +Throughout this report, we assume <> to be a scalar, the vectors <>, and <> to have dimension N, N, and M, respectively. The matrices <>, and <> are assumed to have no special structure, whereas <> is Hermitian and <> is diagonal. L is a lower triangular <> matrix, en denotes the unit vector with a 1 in the n-th row and zeros elsewhere. Its dimensionality is chosen such that the respective matrix-vector product exists. Finally, [A]a,b denotes the element in the a-th row and b-th column of A, <> selects the submatrix of A consisting of rows a to b and columns c to +d. 0a.b is the a . b zero matrix. Transposition, Hermitian transposition, conjugate, and real-part operator are denoted by <>, and <>, respectively, and require no FLOP. +2.1 Matrix Products +Frequently arising matrix products and the amount of FLOPs required for their computation will be discussed in this section. +2.1.1 Scalar-Vector Multiplication <> +A simple multiplication .a of a vector a with a scalar <> requires N multiplications and no summation. + +2.1.2 Scalar-Matrix Multiplication <> +Extending the result from Subsection 2.1.1 to a scalar matrix multiplication <> requires NM multiplications and again no summation. + +2.1.3 Inner Product aHb of Two Vectors +An inner product aHb requires N multiplications and <> summations, i.e., <> FLOPs. + +2.1.4 Outer Product <> of Two Vectors +An outer product acH requires NM multiplications and no summation. +2. Flop Counting + +2.1.5 Matrix-Vector Product <> +Computing Ab corresponds to applying the inner product rule <> from Subsection 2.1.3 M times. Obviously, <> and <> represents the i-th row of A. Hence, its computation costs MN multiplications and <> summations, i.e., <> FLOPs. + +2.1.6 Matrix-Matrix Product <> +Repeated application of the matrix-vector rule Aci from Subsection 2.1.5 with ci being the i-th column of C yields the overall matrix-matrix product AC. Since <>, the matrix-matrix product has the L-fold complexity of the matrix-vector product. Thus, it needs MNL multiplications and summations, altogether <> FLOPs. + +2.1.7 Matrix Diagonal Matrix Product AD +If the right hand side matrix D of the matrix product AD is diagonal, the computational load reduces to M multiplications for each of the N columns of A, since the n-th column of A is scaled by the n-th main diagonal element of D. Thus, MN multiplications in total are required for the computation of AD, no summations are needed. + +2.1.8 Matrix-Matrix Product LD +When multiplying a lower triangular matrix L by a diagonal matrix D, column n of the matrix product requires <> multiplications and no summations. With <>, we get +<> multiplications. + +2.1.9 Matrix-Matrix Product L1D +When multiplying a lower triangular matrix L1 with ones on the main diagonal by a diagonal matrix D, column n of the matrix product requires <<<>>> multiplications and no summations. With <>, we get +<> multiplications. + +2.1.10 Matrix-Matrix Product LC with L Lower Triangular +Computing the product of a lower triangular matrix <> and <> is done column-wise. The nth element in each column of LC requires n multiplications and <<<>>> summations, +so the complete column needs <> multiplications and <> summations. The complete matrix-matrix product is obtained from computing L columns. We have +<> multiplications and <> summations, yielding a total amount of <> FLOPs. + +2.1.11 Gram <> of A +In contrast to the general matrix product from Subsection 2.1.6, we can make use of the Hermitian structure of the product <>. Hence, the strictly lower triangular part of <> need not be computed, since it corresponds to the Hermitian of the strictly upper triangular part. For +this reason, we have to compute only the N main diagonal entries of <> and the <> upper <> off-diagonal elements, so only <> different entries have to be evaluated. Each element requires an inner product step from Subsection 2.1.3 costing M multiplications and <> summations. Therefore, +<> multiplications and <> summations are needed, making up a total amount of <> FLOPs. + +2.1 Matrix Products + +2.1.12 Squared Frobenius Norm <> +The squared Hilbert-Schmidt norm <> follows from summing up the MN squared entries from A. We therefore have MN multiplications and <> summations, yielding a total of <> FLOPs. + +2.1.13 Sesquilinear Form <> +The sesquilinear form cHAb should be evaluated by computing the matrix-vector product Ab in a first step and then multiplying with the row vector cH from the left hand side. The matrix vector product requires MN multiplications and <> summations, whereas the inner product needs M multiplications and <> summations. Altogether, <> multiplications and <> summations have to be computed for the sesquilinear form <>, yielding a total number of <> flops. + +2.1.14 Hermitian Form a <> +With the Hermitian matrix <>, the product <> can be expressed as + +<> + +with <>, and <>. The first sum accumulates the weighted main diagonal entries and requires 2N multiplications and <> summations. The second part of (2.1) accumulates all weighted off-diagonal entries from A. The last two summations sum up 2 terms2. Consequently, the second part of (2.1) requires <> summations and <> products. Finally, the two parts have to be added accounting for an additional summation and yielding an overall amount of <> products and +<> summations, corresponding to <> FLOPs. + +2.1.15 Gram <> of a Lower Triangular Matrix L +During the computation of the inverse of a positive definite matrix, the Gram matrix of a lower triangular matrix occurs when Cholesky decomposition is applied. Again, we make use of the Hermitian structure of the Gram <>, so only the main diagonal entries and the upper right off-diagonal entries of the product have to be evaluated. The a-th main-diagonal entry can be expressed >. +We made use of (A1) in the Appendix for the computation of the last sum accumulating subsequent integers. +We do not exploit the fact that only real-valued summands are accumulated as we only account for complex flops. +The scaling with the factor 2 does not require a FLOP, as it can be implemented by a simple bit shift. +Clearly, if <>, we have to subtract one summation from the calculation since no off-diagonal entries exist. + +2. Flop Counting + +<> (2.2) + +with <>, requiring <> multiplications and <> summations. Hence, all main diagonal elements need <> multiplications and +<> summations. The upper right off-diagonal entry <> in row a and column b with <> reads as + +<>, (2.3) + +again accounting for <> multiplications and <> summations. These two expressions have to be summed up over all <> and <>, and for the number of multiplications, we find + +<> (2.4) + +Again, we made use of (A1) for the sum of subsequent integers and (A2) for the sum of subsequent squared integers. For the number of summations, we evaluate + +<> + +Computing all necessary elements of the Gram LHL thereby requires <> multiplications and <> summations. Altogether, <> FLOPs result. The same result of course holds for the Gram of two upper triangular matrices. + +2.2 Decompositions + +2.2.1 Cholesky Decomposition <> (Gaxpy Version) +Instead of computing the inverse of a positive definite matrix R directly, it is more efficient to start with the Cholesky decomposition <> and then invert the lower triangular matrix L and compute its Gram. In this section, we count the number of FLOPs necessary for the Cholesky decomposition. + +2.2 Decompositions +The implementation of the Generalized Ax plus y (Gaxpy) version of the Cholesky decomposition, which overwrites the lower triangular part of the positive definite matrix R is listed in Algorithm 2.1, see [1]. Note that R needs to be positive definite for the <> decomposition! + +Algorithm 2.1 Algorithm for the Gaxpy version of the Cholesky decomposition. + +<> + +The computation of the first column of L in Line 1 of Algorithm 2.1 requires <> multiplications, a single square-root operation, and no summations. Column <> takes a matrix vector product of dimension <> which is subtracted from another <> dimensional vector involving <> summations, see Line 3. Finally, < multiplications6 and a single square-root operation are necessary in Line 4. In short, row n with <> needs <> multiplications, .<> summations (see Subsection 2.1.5), and one square root operation, which we classify as an additional FLOP. Summing up the multiplications for rows <>, we obtain +<> The number of summations for rows <> reads as + +<> (2.6) + +<> (2.7) + +The first element need not be computed twice, since the result of the division is the square root of the denominator. +Again, the first element need not be computed twice, since the result of the division is the square root of the denominator. + +2. Flop Counting + +Algorithm 2.2 Algorithm for the Cholesky decomposition <> + +<> + +and finally, <> square-root operations are needed for the <> rows. Including the <> multiplications for column <> and the additional square root operation, <> multiplications, <> summations, and N square-root operations occur, +<> FLOPs in total. + +2.2.2 Cholesky Decomposition <> +The main advantage of the <> decomposition compared to the standard <> decomposition is that no square root operations are needed, which may require more than one FLOP depending on the given hardware platform. Another bene�t of the <> decomposition is that it does not require a positive definite matrix R, the only two conditions for the unique existence are that R is Hermitian and all but the last principle minor (i.e., the determinant) of R need to be different from zero [2]. Hence, R may also be rank de�cient to a certain degree. If R is not positive semidefinite, then D may contain negative main diagonal entries. +The outcome of the decomposition is a lower triangular matrix L1 with ones on the main diagonal and a diagonal matrix D. +Algorithm 2.2 overwrites the strictly lower left part of the matrix R with the strictly lower part of L1 (i.e., without the ones on the main diagonal) and overwrites the main diagonal of R with the main diagonal of D. It is taken from [1] and slightly modi�ed, such that is also applicable to complex matrices (see the conjugate in Line 4) and no existing scalar should be re-computed (see case distinction in Line 4 for i =1). +Line 1 needs <> multiplications. Lines 3 to 5 require <> multiplications and are executed for <>, yielding <> multiplications. Line 6 takes <> + +multiplications and <> summations, again with n =2,...,N, yielding n=2(<>) = 2 multiplications and the same amount of summations. Line 7 does not require any FLOP. In Line 8, the matrix-vector product needs <> multiplications, and additional <> multiplications arise when the complete numerator is divided by the denominator. Hence, we have <> multiplications. For <> we get <> multiplications. +The number of summations in Line 8 is <> for the matrix vector product and <> for the subtraction in the numerator. Together, we have <> summations. With +<> summations. Summing up, this algorithm requires <> multiplications, and <> summations, yielding a total amount of <> FLOPs. (Note that this formula is also valid for N =1.) + +2.3 Inverses of Matrices + +2.3.1 Inverse <> of a Lower Triangular Matrix L +Let <> denote the inverse of a lower triangular matrix L. Then, X is again lower triangular which means that <> for <>. The following equation holds: + +<>. (2.8) + +Via forward substitution, above system can easily be solved. Row <> from (2.8) can be expressed as + +<>, (2.9) + +with <> denoting the Kronecker delta which vanishes for <>, and <>. Starting from <>, the xb,n are computed successively, and we find + +<> (2.10) + +with all <> having been computed in previous steps. Hence, if <> and a single multiplication is required, no summations are needed. For <> multiplications and <> summations are required, as the Kronecker-delta vanishes. All main diagonal entries can be computed by means of N multiplications The lower left off-diagonal entries +Actually, it is a division rather than a multiplication. + +2. Flop Counting + +require + +<> (2.11) + +multiplications, and + +<> (2.12) + +summations. Including the N multiplications for the main-diagonal entries, <> multiplications and <> summations have to be implemented, yielding a total amount +<> FLOPs. + +2.3.2 Inverse <> of a Lower Triangular Matrix L1 with Ones on the Main Diagonal +The inverse of a lower triangular matrix L1 turns out to require N2 FLOPs less than the inverse of L with arbitrary nonzero diagonal elements. Let X denote the inverse of L1. Clearly, X is again a lower triangular matrix with ones on the main diagonal. We can exploit this fact in order to compute only the unknown entries. +The mth row and nth column of the system of equations <> with <> reads as + +<> + +or, equivalently, + +<> + +Hence, X is computed via forward substitution. To compute <>, we need <> multiplications and <> summations. Remember that <>. The total number of multiplications/summations is obtained from + +<>) (2.13) + +We only have to consider <>, since the equations resulting from m> FLOPs are needed. + +2.3.3 Inverse R.1 of a Positive definite Matrix R +The inverse of a matrix can for example be computed via Gaussian-elimination [1]. However, this approach is computationally expensive and does not exploit the Hermitian structure of R. Instead, it is more efficient to start with the Cholesky decomposition of <> (see Subsection 2.2.1), +invert the lower triangular matrix L (see Subsection 2.3.1), and then build the Gram <> of <> (see Subsection 2.1.15). Summing up the respective number of operations, this procedure requires <> multiplications, <> summations, and N square-root operations, which yields a total amount of <> FLOPs. + +2.4.1 Product <> with <> not known a priori. +A naive way of computing the solution <> of the equation <> is to find <> first and afterwards multiply it by C. This approach needs <> FLOPs as shown in Sections 2.3.1 and 2.1.10. However, doing so is very expensive since we are not interested in the inverse of L in general. Hence, there must be a computationally cheaper variant. Again, forward substitution plays a key role. +It is easy to see, that X can be computed column-wise. Let <> and <>. Then, from <>, we get for the element xb,a in row b and column a of X: + +<> + +Its computation requires b multiplications and <> summations. A complete column of X can therefore the computed with<> multiplications and <> summations. The complete matrix X with L columns thus needs <> FLOPs, so the forward substitution saves <> FLOPs compared to the direction inversion of L and a subsequent matrix matrix product. Interestingly, computing <> with <> unknown is as expensive as computing LC, see Section 2.1.10. + +3. Overview + +<> and <> are arbitrary matrices.<> is a diagonal matrix, <> is lower triangular, <> is lower triangular with ones on the main diagonal, <>, and <> is positive definite. + +<> + +Appendix + +A frequently occurring summation in FLOP counting is the sum of subsequent integers. By complete induction, we find + +<> (A1) + +Above result can easily be verified by recognizing that the sum of the n-th and the <> summand is equal to <>, and we have <> such pairs. +Another sum of relevance is the sum of subsequent squared integers. Again, via complete induction, we find + +<> (A2) + +Bibliography +[1] G. H. Golub and C.F. Van Loan, Matrix Computations, Johns Hopkins University Press, 1991. +[2] Kh.D. Ikramov and N.V. Savel�eva, �Conditionally definite Matrices, Journal of Mathematical Sciences, vol. 98, no. 1, pp. 150, 2000. +< <> > + + +<> <> <> + Green AI + + Roy Schwartz Jesse Dodge Noah A. Smith Oren Etzioni + + + Allen Institute for AI, Seattle, Washington, USA + Carnegie Mellon University, Pittsburgh, Pennsylvania, USA + University of Washington, Seattle, Washington, USA + + + Abstract + The computations required for deep learning research have been doubling every few months, resulting in an + estimated 300,000x increase from 2012 to 2018 [2]. These computations have a surprisingly large carbon footprint + [40]. Ironically, deep learning was inspired by the human brain, which is remarkably energy efficient. Moreover, the + financial cost of the computations can make it difficult for academics, students, and researchers, in particular those + from emerging economies, to engage in deep learning research. + This position paper advocates a practical solution by making efficiency an evaluation criterion for research along- + side accuracy and related measures. In addition, we propose reporting the financial cost or “price tag” of developing, + training, and running models to provide baselines for the investigation of increasingly efficient methods. Our goal is + to make AI both greener and more inclusive—enabling any inspired undergraduate with a laptop to write high-quality + research papers. Green AI is an emerging focus at the Allen Institute for AI. + + + 1 Introduction and Motivation + + Since 2012, the field of artificial intelligence has reported remarkable progress on a broad range of capabilities in- + cluding object recognition, game playing, machine translation, and more [36]. This progress has been achieved by + increasingly large and computationally-intensive deep learning models. 1 Figure 1 reproduced from [2] plots training + cost increase over time for state-of-the-art deep learning models starting with AlexNet in 2012 [20] to AlphaZero in + 2017 [38]. The chart shows an overall increase of 300,000x, with training cost doubling every few months. An even + sharper trend can be observed in NLP word embedding approaches by looking at ELMo [29] followed by BERT [8], + openGPT-2 [30], and XLNet [48]. An important paper [40] has estimated the carbon footprint of several NLP models + and argued that this trend is both environmentally unfriendly (which we refer to as Red AI ) and expensive, raising + barriers to participation in NLP research. + This trend is driven by the strong focus of the AI community on obtaining “state-of-the-art” results, 2 as exemplified + by the rising popularity of leaderboards [46, 45], which typically report accuracy measures but omit any mention of + cost or efficiency (see, for example,leaderboards.allenai.org). Despite the clear benefits of improving + model accuracy in AI, the focus on this single metric ignores the economic, environmental, or social cost of reaching + the reported accuracy. + We advocate increasing research activity in Green AI —AI research that is more environmentally friendly and + inclusive. We emphasize that Red AI research has been yielding valuable contributions to the field of AI, but it’s been + overly dominant. We want to shift the balance towards the Green AI option—to ensure that any inspired undergraduate + with a laptop has the opportunity to write high-quality papers that could be accepted at premier research conferences. + + 1 For brevity, we refer to AI throughout this paper, but our focus is on AI research that relies on deep learning methods. + 2 Meaning, in practice, that a system’s accuracy on some benchmark is greater than any previously reported system’s accuracy. + + <
> + + Figure 1: The amount of compute used to train deep learning models has increased 300,000x in 6 years. Figure taken + from [2]. + + + Specifically, we propose making efficiency a more common evaluation criterion for AI papers alongside accuracy and + related measures. + AI research can be computationally expensive in a number of ways, but each provides opportunities for efficient + improvements; for example, papers could be required to plot accuracy as a function of computational cost and of + training set size, providing a baseline for more data-efficient research in the future. Reporting the computational price + tag of finding, training, and running models is a key Green AI practice (see Equation 1). In addition to providing + transparency, price tags are baselines that other researchers could improve on. + Our empirical analysis in Figure 2 suggests that the AI research community has paid relatively little attention to + computational efficiency. In fact, as Figure 1 illustrates, the computational cost of research is increasing exponentially, + at a pace that far exceeds Moore’s Law [28]. Red AI is on the rise despite the well-known diminishing returns of + increased cost (e.g., Figure 3). This paper identifies key factors that contribute to Red AI and advocates the introduction + of a simple, easy-to-compute efficiency metric that could help make some AI research greener, more inclusive, and + perhaps more cognitively plausible. Green AI is part of a broader, long-standing interest in environmentally-friendly + scientific research (e.g., see the journalGreen Chemistry). Computer science, in particular, has a long history of + investigating sustainable and energy-efficient computing (e.g., see the journalSustainable Computing: Informatics + and Systems). + The remainder of this paper is organized as follows. Section 2 analyzes practices that move deep-learning research + into the realm of Red AI . Section 3 discusses our proposals for Green AI. Section 4 considers related work, and we + conclude with a discussion of directions for future research. + + + 2 Red AI + + Red AI refers to AI research that seeks to obtain state-of-the-art results in accuracy (or related measures) through + the use of massive computational power—essentially “buying” stronger results. Yet the relationship between model + performance and model complexity (measured as number of parameters or inference time) has long been understood + to be at best logarithmic; for a linear gain in performance, an exponentially larger model is required [18]. Similar + trends exist with increasing the quantity of training data [41, 13] and the number of experiments [9]. In each of these + cases, diminishing returns come at increased computational cost. + This section analyzes the factors contributing to Red AI and shows how it is resulting in diminishing returns over + time (see Figure 3). We note again that Red AI work is valuable, and in fact, much of it contributes to what we know + + <
> + + Figure 2: AI papers tend to target accuracy rather than efficiency. The figure shows the proportion of papers that + target accuracy, efficiency, both or other from a sample of 60 papers from top AI conferences. + + by pushing the boundaries of AI. Our exposition here is meant to highlight areas where computational expense is high, + and to present each as an opportunity for developing more efficient techniques. + To demonstrate the prevalence of Red AI , we sampled 60 papers from top AI conferences (ACL, 3 NeurIPS, 4 and + CVPR 5 ). For each paper we noted whether the authors claim their main contribution to be (a) an improvement to + accuracy or some related measure, (b) an improvement to efficiency, (c) both, or (d) other. As shown in Figure 2, in all + conferences we considered, a large majority of the papers target accuracy (90% of ACL papers, 80% of NeurIPS papers + and 75% of CVPR papers). Moreover, for both empirical AI conferences (ACL and CVPR) only a small portion (10% + and 20% respectively) argue for a new efficiency result. 6 This highlights the focus of the AI community on measures + of performance such as accuracy, at the expense of measures of efficiency such as speed or model size. In this paper + we argue that a larger weight should be given to the latter. + To better understand the different ways in which AI research can be red, consider an AI result reported in a scientific + paper. This result typically includes a model trained on a training dataset and evaluated on a test dataset. The process + of developing that model often involves multiple experiments to tune its hyperparameters. When considering the + different factors that increase the computational and environmental cost of producing such a result, three factors come + to mind: the cost of executing the model on a single (E)xample (either during training or at inference time); the size + of the training (D)ataset, which controls the number of times the model is executed during training, and the number of + (H)yperparameter experiments, which controls how many times the model is trained during model development. The + total cost of producing a (R)esult in machine learning increases linearly with each of these quantities. This cost can + be estimated as follows: + + <> + + Equation 1: The equation of Red AI : The cost of an AI (R)esult grows linearly with the cost of processing a single + (E)xample, the size of the training (D)ataset and the number of (H)yperparameter experiments. + + Equation 1 is a simplification (e.g., different hyperparameter assignments can lead to different costs for processing + a single example). It also ignores other factors such as the number of training epochs. Nonetheless, it illustrates three + quantities that are each an important factor in the total cost of generating a result. Below, we consider each quantity + separately. Interestingly, many NeurIPS papers included convergence rates or regret bounds which describe performance as a function of examples or + iterations, thus targeting efficiency (55%). This indicates an increased awareness of the importance of this concept, at least in theoretical analyses. + + + + . + + Expensive processing of one example Our focus is on neural models, where it is common for each training step + to require inference, so we discuss training and inference cost together as “processing” an example. Some works + have used increasingly expensive models which require great amounts of resources, and as a result, in these models, + performing inference can require a lot of computation, and training even more so. For instance, Google’s BERT-large + [8] contains roughly 350 million parameters. openAI’s openGPT2-XL model [30] contains 1.5 billion parameters. + AI2, our home organization, recently released Grover [49], also containing 1.5 billion parameters. In the computer + vision community, a similar trend is observed (Figure 1). + Such large models have high costs for processing each example, which leads to large training costs. BERT-large + was trained on 64 TPU chips for 4 days. Grover was trained on 256 TPU chips for two weeks, at an estimated cost of + $25,000. XLNet had a similar architecture to BERT-large, but used a more expensive objective function (in addition + to an order of magnitude more data), and was trained on 512 TPU chips for 2.5 days. 7 It is impossible to reproduce + the best BERT-large results 8 or XLNet results 9 using a single GPU. Specialized models can have even more extreme + costs, such as AlphaGo, the best version of which required 1,920 CPUs and 280 GPUs to play a single game of Go + [37] at a cost of over $1,000 per hour. 10 + When examining variants of a single model (e.g., BERT-small and BERT-large) we see that larger models can have + stronger performance, which is a valuable scientific contribution. However, this implies the financial and environmental + cost of increasingly large AI models will not decrease soon, as the pace of model growth far exceeds the resulting + increase in model performance [16]. As a result, more and more resources are going to be required to keep improving + AI models by simply making them larger. + + Processing many examples Another way state-of-the-art performance has recently been progressing in AI is by + successively increasing the amount of training data models are trained on. BERT-large had top performance in 2018 + across many NLP tasks after training on 3 billion word-pieces. XLNet recently outperformed BERT after training + on 32 billion word-pieces, including part of Common Crawl; openGPT-2-XL trained on 40 billion words; FAIR’s + RoBERTa [23] was trained on 160GB of text, roughly 40 billion word-pieces, requiring around 25,000 GPU hours + to train. In computer vision, researchers from Facebook [25] pretrained an image classification model on 3.5 billion + images from Instagram, three orders of magnitude larger than existing labelled image datasets such as Open Images. 11 + The use of massive data creates barriers for many researchers for reproducing the results of these models, or + training their own models on the same setup (especially as training for multiple epochs is standard). For example, the + June 2019 Common Crawl contains 242 TB of uncompressed data, 12 so even storing the data is expensive. Finally, + as in the case of model size, relying on more data to improve performance is notoriously expensive because of the + diminishing return of adding more data [41]. For instance, Figure 3, taken from [25], shows a logarithmic relation + between the object recognition top-1 accuracy and the number of training examples. + + Massive number of experiments Some projects have poured large amounts of computation into tuning hyperparameters + or searching over neural architectures, well beyond the reach of most researchers. For instance, researchers + from Google [51] trained over 12,800 neural networks in their neural architecture search to improve performance on + object detection and language modeling. With a fixed architecture, researchers from DeepMind [26] evaluated 1,500 + hyperparameter assignments to demonstrate that an LSTM language model [15] can reach state-of-the-art perplexity + results. Despite the value of this result in showing that the performance of an LSTM does not plateau after only a few + hyperparameter trials, fully exploring the potential of other competitive models for a fair comparison is prohibitively + expensive. + 7 Some estimates for the cost of this process reach $250,000 (twitter.com/eturner303/status/1143174828804857856). + 8 Seehttps://github.com/google-research/bert + 9 Seehttps://github.com/zihangdai/xlnet + 10 Recent versions of AlphaGo are far more efficient [39]. + 11 https://opensource.google.com/projects/open-images-dataset + 12 http://commoncrawl.org/2019/07/ + + <
> + + Figure 3: Diminishing returns of training on more data: object detection accuracy increases linearly as the number of + training examples increases exponentially [25]. + + The topic of massive number of experiments is not as well studied as the first two discussed above. In fact, the + number of experiments performed during model construction is often under reported. Nonetheless, evidence for a + logarithmic relation exists here as well, between the number of experiments and performance gains [9]. + + Discussion The benefits of pouring more resources into models are certainly of interest to the AI community. Indeed, + there is value in pushing the limits of model size, dataset size, and the hyperparameter search space. Currently, despite + the massive amount of resources put into recent AI models, such investment still pays off in terms of downstream + performance (albeit at an increasingly lower rate). Finding the point of saturation (if such exists) is an important + question for the future of AI. + Our goal in this paper is to raise awareness of the cost of Red AI , as well as encourage the AI community to + recognize the value of work by researchers that take a different path, optimizing efficiency rather than accuracy. Next + we turn to discuss concrete measures for making AI more green. + + + 3 Green AI + + The term Green AI refers to AI research that yields novel results without increasing computational cost, and ideally + reducing it. Whereas Red AI has resulted in rapidly escalating computational (and thus carbon) costs, Green AI has the + opposite effect. If measures of efficiency are widely accepted as important evaluation metrics for research alongside + accuracy, then researchers will have the option of focusing on the efficiency of their models with positive impact on + both the environment and inclusiveness. This section reviews several measures of efficiency that could be reported + and optimized, and advocates one particular measure—FPO—which we argue should be reported when AI research + findings are published. + + 3.1 Measures of Efficiency + To measure efficiency, we suggest reporting the amount of work required to generate a result in AI, that is, the amount + of work required to train a model, and if applicable, the sum of works for all hyperparameter tuning experiments. As + + the cost of an experiment decomposes into the cost of a processing a single example, the size of the dataset, and the + number of experiments (Equation 1), reducing the amount of work in each of these steps will result in AI that is more + green. + When reporting the amount of work done by a model, we want to measure a quantity that allows for a fair comparison + between different models. As a result, this measure should ideally be stable across different labs, at different + times, and using different hardware. + + Carbon emission Carbon emission is appealing as it is a quantity we want to directly minimize. Nonetheless it + is impractical to measure the exact amount of carbon released by training or executing a model, and accordingly— + generating an AI result, as this amount depends highly on the local electricity infrastructure. As a result, it is not + comparable between researchers in different locations or even the same location at different times. + + Electricity usage Electricity usage is correlated with carbon emission while being time- and location-agnostic. + Moreover, GPUs often report the amount of electricity each of their cores consume at each time point, which facilitates + the estimation of the total amount of electricity consumed by generating an AI result. Nonetheless, this measure is + hardware dependent, and as a result does not allow for a fair comparison between different models. + + Elapsed real time The total running time for generating an AI result is a natural measure for efficiency, as all other + things being equal, a faster model is doing less computational work. Nonetheless, this measure is highly influenced + by factors such as the underlying hardware, other jobs running on the same machine, and the number of cores used. + These factors hinder the comparison between different models, as well as the decoupling of modeling contributions + from hardware improvements. + + Number of parameters Another common measure of efficiency is the number of parameters (learnable or total) + used by the model. As with run time, this measure is correlated with the amount of work. Unlike the other measures + described above, it does not depend on the underlying hardware. Moreover, this measure also highly correlates with the + amount of memory consumed by the model. Nonetheless, different algorithms make different use of their parameters, + for instance by making the model deeper vs. wider. As a result, different models with a similar number of parameters + often perform different amounts of work. + + FPO As a concrete measure, we suggest reporting the total number of floating point operations (FPO) required to + generate a result. 13 FPO provides an estimate to the amount of work performed by a computational process. It is + computed analytically by defining a cost to two base operations, ADD and MUL . Based on these operations, the FPO + cost of any machine learning abstract operation (e.g., a tanh operation, a matrix multiplication, a convolution operation, + or the BERT model) can be computed as a recursive function of these two operations. FPO has been used in the past + to quantify the energy footprint of a model [27, 43, 12, 42], but is not widely adopted in AI. + FPO has several appealing properties. First, it directly computes the amount of work done by the running machine + when executing a specific instance of a model, and is thus tied to the amount of energy consumed. Second, FPO is + agnostic to the hardware on which the model is run. This facilitates fair comparisons between different approaches, + unlike the measures described above. Third, FPO is strongly correlated with the running time of the model [4]. Unlike + asymptotic runtime, FPO also considers the amount of work done at each time step. + Several packages exist for computing FPO in various neural network libraries, 14 though none of them contains all + the building blocks required to construct all modern AI models. We encourage the builders of neural network libraries + to implement such functionality directly. + + 13 Floating point operations are often referred to as FLOP(s), though this term is not uniquely defined [12]. To avoid confusion, we use the term FPO. + 14 E.g.,https://github.com/Swall0w/torchstat;https://github.com/Lyken17/pytorch-OpCounter + + <
> + + Figure 4: Increase in FPO results in diminishing return for object detection top-1 accuracy. Plots (bottom to top): + model parameters (in million), FPO (in billions), top-1 accuracy on ImageNet. (4a): Different models: AlexNet + [20], ResNet [14], ResNext [47], DPN107 [5], SENet154 [17]. (4b): Comparison of different sizes (measured by the + number of layers) of the ResNet model [14]. + + + Discussion Efficient machine learning approaches have received attention in the research community, but are generally + not motivated by being green. For example, a significant amount of work in the computer vision community has + addressed efficient inference, which is necessary for real-time processing of images for applications like self-driving + cars [24, 31, 22], or for placing models on devices such as mobile phones [16, 34]. Most of these approaches target efficient + model inference [32, 50, 12], 15 and thus only minimize the cost of processing a single example, while ignoring + the other two red practices discussed in Section 2. 16 + The above examples indicate that the path to making AI green depends on how it is used. When developing a new + model, much of the research process involves training many model variants on a training set and performing inference + on a small development set. In such a setting, more efficient training procedures can lead to greater savings, while in + a production setting more efficient inference can be more important. We advocate for a holistic view of computational + savings which doesn’t sacrifice in some areas to make advances in others. + FPO has some limitations. First, it targets the electricity consumption of a model, while ignoring other potential + limiting factors for researchers such as the memory consumption by the model, which can often lead to additional + energy and monetary costs [24]. Second, the amount of work done by a model largely depends on the model implementation, + as two different implementations of the same model could result in very different amounts of processing + work. Due to the focus on the modeling contribution, the AI community has traditionally ignored the quality or efficiency + of models’ implementation. We argue that the time to reverse this norm has come, and that exceptionally + good implementations that lead to efficient models should be credited by the AI community. + + 3.2 FPO Cost of Existing Models + To demonstrate the importance of reporting the amount of work, we present FPO costs for several existing models. + A few trends are observable. First, as discussed in Section 2, models get more expensive with time, but the increase + in FPO does not lead to similar performance gains. For instance, an increase of almost 35% in FPO between ResNet and + ResNext (second and third points in graph) resulted in a 0.5% top-1 accuracy improvement. Similar patterns are observed + when considering the effect of other increases in model work. Second, the number of model parameters does not tell + the whole story: AlexNet (first point in the graph) actually has more parameters than ResNet (second point), but + dramatically less FPO, and also much lower accuracy. + Figure 4b shows the same analysis for a single object recognition model, ResNet [14], while comparing different + versions of the model with different number of layers. This creates a controlled comparison between the different + models, as they are identical in architecture, except for their size (and accordingly, their FPO cost). Once again, we + notice the same trend: the large increase in FPO cost does not translate to a large increase in performance. + + 14 Figure 4a shows the number of parameters and FPO of several leading object recognition models, as well as their performance on the ImageNet dataset [6]. + 15 Some very recent work also targeted efficient training [7]. + 16 In fact, creating smaller models often results in longer running time, so mitigating the different trends might be at odds [44]. + 17 We consider this exclusive focus on the final prediction another symptom of Red AI . + 18 These numbers represent FPO per inference, i.e., the work required to process a single example. + + 3.3 Additional Ways to Promote Green AI + In addition to reporting the FPO cost of the final reported number, we encourage researchers to report the bud- + get/accuracy curve observed during training. In a recent paper [9], we observed that selecting the better performing + model on a given task depends highly on the amount of compute available during model development. We introduced + a method for computing the expected best validation performance of a model as a function of the given budget. We + argue that reporting this curve will allow users to make wiser decisions about their selection of models and highlight + the stability of different approaches. + We further advocate for making efficiency an official contribution in major AI conferences, by advising reviewers + to recognize and value contributions that do not strictly improve state of the art, but have other benefits such as + efficiency. Finally, we note that the trend of releasing pretrained models publicly is a green success, and we would like + to encourage organizations to continue to release their models in order to save others the costs of retraining them. + + + 4 Related Work + + Recent work has analyzed the carbon emissions of training deep NLP models [40] and concluded that computationally + expensive experiments can have a large environmental and economic impact. With modern experiments using such + large budgets, many researchers (especially those in academia) lack the resources to work in many high-profile areas; + increased value placed on computationally efficient approaches will allow research contributions from more diverse + groups. We emphasize that the conclusions of [40] are the result of long-term trends, and are not isolated within NLP, + but hold true across machine learning. + While some companies offset electricity usage by purchasing carbon credits, it is not clear that buying credits is + as effective as using less energy. In addition, purchasing carbon credits is voluntary; Google cloud 20 and Microsoft + Azure 21 purchase carbon credits to offset their spent energy, but Amazon’s AWS 22 (the largest cloud computing plat- + form 23 ) only covered fifty percent of its power usage with renewable energy. + The push to improve state-of-the-art performance has focused the research community’s attention on reporting the + single best result after running many experiments for model development and hyperparameter tuning. Failure to fully + report these experiments prevents future researchers from understanding how much effort is required to reproduce a + result or extend it [9]. + Our focus is on improving efficiency in the machine learning community, but machine learning can also be used + as a tool for work in areas like climate change. For example, machine learning has been used for reducing emissions + of cement plants [1] and tracking animal conservation outcomes [11], and is predicted to be useful for forest fire + management [33]. Undoubtedly these are important applications of machine learning; we recognize that they are + orthogonal to the content of this paper. + + 19 Numbers taken fromhttps://github.com/sovrasov/flops-counter.pytorch + 20 https://cloud.google.com/sustainability/ + 21 https://www.microsoft.com/en-us/environment/carbon + 22 https://aws.amazon.com/about-aws/sustainability/ + 23 https://tinyurl.com/y2kob969 + + + + 8 5 Conclusion + + The vision of Green AI raises many exciting research directions that help to overcome the inclusiveness challenges of + Red AI . Progress will reduce the computational expense with a minimal reduction in performance, or even improve + performance as more efficient methods are discovered. Also, it would seem that Green AI could be moving us in a + more cognitively plausible direction as the brain is highly efficient. + It’s important to reiterate that we see Green AI as a valuable option not an exclusive mandate—of course, both + Green AI and Red AI have contributions to make. We want to increase the prevalence of Green AI by highlighting its + benefits, advocating a standard measure of efficiency. Below, we point to a few important green research directions, + and highlight a few open questions. + Research on building space or time efficient models is often motivated by fitting a model on a small device (such + as a phone) or fast enough to process examples in real time, such as image captioning for the blind (see Section 3.1). + Some modern models don’t even fit on a single GPU (see Section 2). Here we argue for a far broader approach. + Data efficiency has received significant attention over the years [35, 19]. Modern research in vision and NLP often + involves first pretraining a model on large “raw” (unannotated) data then fine-tuning it to a task of interest through + supervised learning. A strong result in this area often involves achieving similar performance to a baseline with + fewer training examples or fewer gradient steps. Most recent work has addressed fine-tuning data [29], but pretraining + efficiency is also important. In either case, one simple technique to improve in this area is to simply report performance + with different amounts of training data. For example, reporting performance of contextual embedding models trained + on 10 million, 100 million, 1 billion, and 10 billion tokens would facilitate faster development of new models, as they + can first be compared at the smallest data sizes. Research here is of value not just to make training less expensive, but + because in areas such as low resource languages or historical domains it is extremely hard to generate more data, so to + progress we must make more efficient use of what is available. + Finally, the total number of experiments run to get a final result is often underreported and underdiscussed [9]. The + few instances researchers have of full reporting of the hyperparameter search, architecture evaluations, and ablations + that went into a reported experimental result have surprised the community [40]. While many hyperparameter optimization + algorithms exist which can reduce the computational expense required to reach a given level of performance + [3, 10], simple improvements here can have a large impact. For example, stopping training early for models which are + clearly underperforming can lead to great savings [21]. + + + References + + [1]Prabal Acharyya, Sean D Rosario, Roey Flor, Ritvik Joshi, Dian Li, Roberto Linares, and Hongbao Zhang. + Autopilot of cement plants for reduction of fuel consumption and emissions, 2019. ICML Workshop on Climate + Change. + [2]Dario Amodei and Danny Hernandez. AI and compute, 2018. Blog post. + [3]James S. Bergstra, Remi Bardenet, Yoshua Bengio, and Bal´ azs K´ egl. Algorithms for hyper-parameter optimiza-´ + tion. InProc. of NeurIPS, 2011. + [4]Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for + practical applications. InProc. of ISCAS, 2017. + [5]Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. In + Proc. of NeurIPS, 2017. + [6]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical + image database. InProc. of CVPR, 2009. + [7]Tim Dettmers and Luke Zettlemoyer. Sparse networks from scratch: Faster training without losing performance, + 2019. arXiv:1907.04840. + [8]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional + transformers for language understanding. InProc. of NAACL, 2019. + [9]Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved + reporting of experimental results. InProc. of EMNLP, 2019. + [10]Jesse Dodge, Kevin Jamieson, and Noah A. Smith. Open loop hyperparameter optimization and determinantal + point processes. InProc. of AutoML, 2017. + [11]Clement Duhart, Gershon Dublon, Brian Mayton, Glorianna Davenport, and Joseph A. Paradiso. Deep learning + for wildlife conservation and restoration efforts, 2019. ICML Workshop on Climate Change. + [12]Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. MorphNet: Fast & + simple resource-constrained structure learning of deep networks. InProc. of CVPR, 2018. + [13]Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent + Systems, 24:8–12, 2009. + [14]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In + Proc. of CVPR, 2016. + [15]Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory.¨ Neural computation, 9(8):1735–1780, + 1997. + [16]Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- + dreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications, + 2017. arXiv:1704.04861. + [17]Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. InProc. of CVPR, 2018. + [18]Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbig- + niew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Speed/accuracy trade-offs for modern convo- + lutional object detectors. InProc. of CVPR, 2017. + [19]Sanket Kamthe and Marc Peter Deisenroth. Data-efficient reinforcement learning with probabilistic model pre- + dictive control. InProc. of AISTATS, 2018. + [20]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural + networks. InProc. of NeurIPS, 2012. + [21]Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: Bandit- + based configuration evaluation for hyperparameter optimization. InProc. of ICLR, 2017. + [22]Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. + Berg. Ssd: Single shot multibox detector. InProc. of ECCV, 2016. + [23]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, + Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach, 2019. + arXiv:1907.11692. + [24]Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient + cnn architecture design. InProc. of ECCV, 2018. + [25]Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin + Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. InProc. ECCV, + 2018. + [26]Gabor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural language models. In´ + Proc. of EMNLP, 2018. + [27]Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks + for resource efficient inference. InProc. of ICLR, 2017. + [28]Gordon E. Moore. Cramming more components onto integrated circuits, 1965. + [29]Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle- + moyer. Deep contextualized word representations. InProc. of NAACL, 2018. + [30]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are + unsupervised multitask learners, 2019. OpenAI Blog. + [31]Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification + using binary convolutional neural networks. InProc. of ECCV, 2016. + [32]Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object + detection. InProc. of CVPR, 2016. + [33]David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, An- + drew Slavin Ross, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan + Maharaj, Evan D. Sherwin, S. Karthik Mukkavilli, Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Has-¨ + sabis, John C. Platt, Felix Creutzig, Jennifer Chayes, and Yoshua Bengio. Tackling climate change with machine + learning, 2019. arXiv:1905.12616. + [34]Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: + Inverted residuals and linear bottlenecks. InProc. of CVPR, 2018. + [35]Roy Schwartz, Sam Thomson, and Noah A. Smith. SoPa: Bridging CNNs, RNNs, and weighted finite-state + machines. InProc. of ACL, 2018. + [36]Yoav Shoham, Raymond Perrault, Erik Brynjolfsson, Jack Clark, James Manyika, Juan Carlos Niebles, Terah + Lyons, John Etchemendy, and Z Bauer. The AI index 2018 annual report. AI Index Steering Committee, + Human-Cente Red AI Initiative, Stanford University. Available athttp://cdn.aiindex.org/2018/AI% + 20Index%202018%20Annual%20Report.pdf, 202018, 2018. + [37]David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian + Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, + John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore + Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search.Nature, + 529(7587):484, 2016. + [38]David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc + Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis + Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. + arXiv:1712.01815. + [39]David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas + Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, + George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human + knowledge.Nature, 550(7676):354, 2017. + [40]Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in + NLP. InProc. of ACL, 2019. + [41]Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of + data in deep learning era. InProc. of ICCV, 2017. + [42]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, + and Illia Polosukhin. Attention is all you need. InProc. of NeurIPS, 2017. + [43]Tom Veniat and Ludovic Denoyer. Learning time/memory-efficient deep architectures with budgeted super net- + works. InProc. of CVPR, 2018. + [44]Aaron Walsman, Yonatan Bisk, Saadia Gabriel, Dipendra Misra, Yoav Artzi, Yejin Choi, and Dieter Fox. Early + fusion for goal directed robotic vision. InProc. of IROS, 2019. + [45]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and + Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems, + 2019. arXiv:1905.00537. + [46]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A + multi-task benchmark and analysis platform for natural language understanding. InProc. of ICLR, 2019. + [47]Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations + for deep neural networks. InProc. of CVPR, 2017. + [48]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: + Generalized autoregressive pretraining for language understanding, 2019. arXiv:1906.08237. + [49]Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. + Defending against neural fake news, 2019. arXiv:1905.12616. + [50]Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional + neural network for mobile devices. InProc. of CVPR, 2018. + [51]Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. InProc. of ICLR, 2017. +<> <> <> + + +<> <> <> +Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication + +Herbert Jaeger* and Harald Haas + +We present a method for learning nonlinear systems, echo state networks (ESNs). ESNs employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains. The learning method is computationally efficient and easy to use. On a benchmark task of predicting a chaotic time series, accuracy is improved by a factor of 2400 over previous techniques. The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude. +Nonlinear dynamical systems abound in the sciences and in engineering. If one wishes to simulate, predict, filter, classify, or control such a system, one needs an executable system model. However, it is often infeasible to obtain analytical models. In such cases, one has to resort to black-box models, which ignore the internal physical mechanisms and instead reproduce only the outwardly observable input-output behavior of the target system. +If the target system is linear, efficient methods for black-box modeling are available. Most technical systems, however, become nonlinear if operated at higher operational points (that is, closer to saturation). Although this might lead to cheaper and more energy-efficient designs, it is not done be.cause the resulting nonlinearities cannot be harnessed. Many biomechanical systems use their full dynamic range (up to saturation) and thereby become lightweight, energy efficient, and thoroughly nonlinear. +Here, we present an approach to learn.ing black-box models of nonlinear systems, echo state networks (ESNs). An ESN is an artificial recurrent neural network (RNN). RNNs are characterized by feedback (recurrent) loops in their synaptic connection pathways. They can maintain an ongoing activation even in the absence of input and thus exhibit dynamic memory. Biological neural networks are typically recurrent. Like biological neural networks, an artificial RNN can learn to mimic a target system in principle, with arbitrary accuracy (1). Several learning algorithms are known (24) that incrementally adapt the synaptic weights of an RNN in order to tune it toward the target system. These algorithms have not been widely employed in technical applications because of slow +International University Bremen, Bremen D-28759, Germany. + +convergence and suboptimal solutions (5, 6). The ESN approach differs from these methods in that a large RNN is used (on the order of 50 to 1000 neurons; previous techniques typically use 5 to 30 neurons) and in that only the synaptic connections from the RNN to the output readout neurons are modified by learning; previous techniques tune all synaptic connections (Fig. 1). Be.cause there are no cyclic dependencies be.tween the trained readout connections, training an ESN becomes a simple linear regression task. +We illustrate the ESN approach on a task of chaotic time series prediction (Fig. +2) (7). The Mackey-Glass system (MGS) +(8) is a standard benchmark system for time series prediction studies. It generates a sub.tly irregular time series (Fig. 2A). The prediction task has two steps: (i) using an initial teacher sequence generated by the original MGS to learn a black-box model M of the generating system, and (ii) using M to predict the value of the sequence some steps ahead. +First, we created a random RNN with 1000 neurons (called the reservoir) and one output neuron. The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B). A 3000-step teacher sequence <> was generated from the MGS equation and fed into the output neuron. This excited the internal neurons through the output feedback connections. After an initial transient period, they started to exhibit systematic individual variations of the teacher sequence (Fig. 2B). +The fact that the internal neurons display systematic variants of the exciting external signal is constitutional for ESNs: The internal neurons must work as echo functions for the driving signal. Not every randomly generated RNN has this property, but it can effectively be built into a reservoir (support.ing online text). +It is important that the echo signals be richly varied. This was ensured by a sparse interconnectivity of 1% within the reservoir. This condition lets the reservoir decompose into many loosely coupled subsystems, establishing a richly structured reservoir of excitable dynamics. +After time <>, output connection weights wi (i  1, . . . , 1000) were computed (dashed arrows in Fig. 2B) from the last 2000 steps n=1001, . . . , 3000 of the training run such that the training error + +<> + +was minimized [<>, activation of the ith internal neuron at time n]. This is a simple linear regression. +With the new wi in place, the ESN was disconnected from the teacher after step 3000 and left running freely. A bidirectional dynamical interplay of the network-generated output signal with the internal signals <> unfolded. The output signal <> was created from the internal neuron activation signals <> through the trained connections wi,by <>. Conversely, the internal signals were echoed from that output signal through the fixed output feedback connections (supporting online text). +For testing, an 84-step continuation <> of the original signal was computed for reference. The network output y(3084) was compared with the cor.rect continuation d(3084). Averaged over 100 independent trials, a normalized root mean square error + +<> + +was obtained <> and <> teacher and network + + <
> + +Fig. 1. (A) Schema of previous approaches to RNN learning. (B) Schema of ESN approach. Solid synaptic connections; dotted arrows, adjustable connections. Both approaches aim at minimizing the error <>, where <> is the network output and d(n) is the teacher time series observed from the target system. + +output in trial j, 2 variance of MGS signal), improving the best previous techniques (9 15), which used training sequences of length 500 to 10,000, by a factor of 700. If the prediction run was continued, deviations typically became visible after about 1300 steps (Fig. 2A). With a refined variant of the learn.ing method (7), the improvement factor rises to 2400. Models of similar accuracy were also obtained for other chaotic systems (supporting online text). +The main reason for the jump in modeling accuracy is that ESNs capitalize on a massive short-term memory. We showed analytically +(16) that under certain conditions an ESN of size N may be able to "remember" a number of previous inputs that is of the same order of magnitude as N. This information is more massive than the information used in other techniques (supporting online text). +We now illustrate the approach in a task of practical relevance, namely, the equalization of a wireless communication channel (7). The essentials of equalization are as fol.lows: A sender wants to communicate a sym.bol sequence s(n). This sequence is first transformed into an analog envelope signal d(n), then modulated on a high-frequency carrier signal and transmitted, then received and demodulated into an analog signal u(n), which is a corrupted version of d(n). Major sources of corruption are noise (thermal or due to interfering signals), multipath propagation, which leads to a superposition of adjacent symbols (intersymbol interference), and nonlinear distortion induced by operating the senders power amplifier in the high-gain region. To avoid the latter, the actual power amplification is run well below the maximum amplification possible, thereby incurring a substantial loss in energy efficiency, which is clearly undesirable in cell-phone and satellite + +Fig. 2. (A) Prediction output of the trained ESN (dotted) overlaid with the correct continuation (solid). (B) Learning the MG attractor. Three sample activation traces of internal neurons are shown. They echo the teacher signal d(n). After training, the desired output is recreated from the echo signals through output connections (dotted arrows) whose weights wi are the result of the training procedure. +communications. The corrupted signal u(n)is then passed through an equalizing filter whose output y(n) should restore u(n)as closely as possible to d(n). Finally, the equalized signal y(n) is converted back into a symbol sequence. The quality measure for the entire process is the fraction of incorrect symbols finally obtained (symbol error rate). +To compare the performance of an ESN equalizer with standard techniques, we took a channel model for a nonlinear wireless transmission system from a study (17) that compared three customary nonlinear equalization methods: a linear decision feedback equalizer (DFE), which is actually a non.linear method; a Volterra DFE; and a bilinear DFE. The model equation featured inter symbol interference across 10 consecutive symbols, a second-order and a third-order nonlinear distortion, and additive white Gaussian noise. All methods investigated in that study had 47 adjustable parameters and used sequences of 5000 symbols for training. To make the ESN equalizer comparable with the equalizers studied in (17), we took ESNs with a reservoir of 46 neurons (which is small for the ESN approach), which yielded 47 adjust.able parameters. (The 47th comes from a direct connection from the input to the output neuron.) +We carried out numerous learning trials (7) to obtain ESN equalizers, using an online learning method (a version of the recursive least square algorithm known from linear adaptive filters) to train the output weights on 5000-step training sequences. We chose an online adaptation scheme here because the methods in (17) were online adaptive, too, and because wireless communication channels mostly are time-varying, such that an equalizer must adapt to changing system characteristics. The entire learning-testing procedure was repeated for signal-to-noise + +<
> + +Fig. 3. Results of using an ESN for nonlinear channel equalization. Plot shows signal error rate (SER) versus signal-to-noise ratio (SNR). +(a) Linear DFE. (b) Volterra DFE. (c) Bilinear DFE. [(a) to (c) taken from (20)]. (d) Blue line represents average ESN performance with randomly generated reservoirs. Error bars, variation across networks. (e) Green line indicates performance of best network chosen from the networks averaged in (d). Error bars, variation across learning trials. +REPORTS +ratios ranging from 12 to 32 db. Figure 3 compares the average symbol error rates obtained with the results reported in (17), show.ing an improvement of two magnitudes for high signal-to-noise ratios. +For tasks with multichannel input and/or output, the ESN approach can be accommodated simply by adding more input or output neurons (16, 18). +ESNs can be applied to all basic tasks of signal processing and control, including time series prediction, inverse modeling, pattern generation, event detection and classification, modeling distributions of stochastic process.es, filtering, and nonlinear control (16, 18, 19, 20). Because a single learning run takes only a few seconds (or minutes, for very large data sets and networks), engineers can test out variants at a high turnover rate, a crucial factor for practical usability. +ESNs have been developed from a mathematical and engineering perspective, but exhibit typical features of biological RNNs: a large number of neurons, recurrent pathways, sparse random connectivity, and local modification of synaptic weights. The idea of using randomly connected RNNs to represent and memorize dynamic input in network states has frequently been explored in specific contexts, for instance, in artificial intelligence models of associative memory (21), models of prefrontal cortex function in sensory-motor sequencing tasks (22), models of birdsong (23), models of the cerebellum (24), and general computational models of neural oscillators (25). Many different learning mechanisms were considered, mostly within the RNN itself. The contribution of the ESN is to elucidate the mathematical properties of large RNNs such that they can be used with a linear, trainable readout mechanism for general black-box modeling. An approach essentially equivalent to ESNs, liquid state networks (26, 27), has been developed independently to model computations in cortical microcircuits. Recent findings in neurophysiology suggest that the basic ESN/liquid state network principle seems not uncommon in biological networks (28,30) and could eventually be exploited to control prosthetic devices by signals collected from a collective of neurons (31). + +References and Notes +1. K.-I. Funahashi, Y. Nakamura, Neural Netw. 6, 801 (1993). +2. D. Zipser, R. J. Williams, Neural Comput. 1, 270 (1989). +3. P. J. Werbos, Proc. IEEE 78, 1550 (1990). +4. L. A. Feldkamp, D. V. Prokhorov, C. F. Eagen, F. Yuan, in Nonlinear Modeling: Advanced Black-Box techniques , J. A. K. Suykens, J. Vandewalle, Eds. (Kluwer, Dordrecht, Netherlands, 1998), pp. 29�54. +5. K. Doya, in The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. (MIT Press, Cambridge, MA, 1995), pp. 796�800. +6. H. Jaeger, �Tutorial on training recurrent neural networks� (GMD-Report 159, German National Re.search Institute for Computer Science, 2002); ftp:// borneo.gmd.de/pub/indy/publications_herbert/ CompleteTutorialTechrep.pdf. + +REPORTS + +7. Materials andmethods are available as supporting material on Science Online. +8. M. C. Mackey, L. Glass, Science 197, 287 (1977). +9. J. Vesanto, in Proc. WSOM �97 (1997); www.cis.hut.�/ projects/monitor/publications/papers/wsom97.ps. +10. L. Chudy, I. Farkas, Neural Network World 8, 481 (1998). +11. H. Bersini, M. Birattari, G. Bontempi, in Proc. IEEE World Congr. on Computational Intelligence (IJCNN �98) (1997), pp. 2102�2106; ftp://iridia.ulb.ac.be/ pub/lazy/papers/IridiaTr1997-13_2.ps.gz. +12. T. M. Martinetz, S. G. Berkovich, K. J. Schulten, IEEE Trans. Neural Netw. 4, 558 (1993). +13. X. Yao, Y. Liu, IEEE Trans. Neural Netw. 8, 694 (1997). +14. F. Gers, D. Eck, J. F. Schmidhuber, �Applying LSTM to time series predictable through time-window ap.proaches� (IDSIA-IDSIA-22-00, 2000); www.idsia.ch/ felix/Publications.html. +15. J. McNames, J. A. K. Suykens, J. Vandewalle, Int. J. Bifurcat. Chaos 9, 1485 (1999). +16. H. Jaeger, �Short term memory in echo state net.works� (GMD-Report 152, German National Re.search Institute for Computer Science, 2002); ftp:// borneo.gmd.de/pub/indy/publications_herbert/ STMEchoStatesTechRep.pdf. +17. V. J. Mathews, J. Lee, in Advanced Signal Processing: Algorithms, Architectures, and Implementations V (Proc. SPIE Vol. 2296), (SPIE, San Diego, CA, 1994), pp. 317�327. +18. J. Hertzberg, H. Jaeger, F. Scho�nherr, in Proc. 15th Europ. Conf. on Art. Int. (ECAI 02), F. van Harmelen, Ed. (IOS Press, Amsterdam, 2002), pp. 708�712; www. ais.fhg.de/schoenhe/papers/ECAI02.pdf. +19. H. Jaeger, �The echo state approach to analysing and training recurrent neural networks� (GMD-Report 148, German National Research Institute for Com.puter Science, 2001); ftp://borneo.gmd.de/pub/indy/ publications_herbert/EchoStatesTechRep.pdf. +20. H. Jaeger, in Advances in Neural Information Process.ing Systems 15, S. Becker, S. Thrun, K. Obermayer, Eds. (MIT Press, Cambridge, MA, 2003) pp. 593�600. +21. G. E. Hinton, in Parallel Models of Associative Mem.ory, G. E. Hinton, J. A. Anderson, Eds. (Erlbaum, Hills.dale, NJ, 1981), pp. 161�187. +22. D. G. Beiser, J. C. Houk, J. Neurophysiol. 79, 3168 (1998). +23. S. Dehaene, J.-P. Changeux, J.-P. Nadal, Proc. Natl. Acad. Sci. U.S.A. 84, 2727 (1987). +24. M. Kawato, in The Handbook of Brain Theory and Neural Networks, M. Arbib, Ed. (MIT Press, Cam.bridge, MA, 1995), pp. 172�178. +25. K. Doya, S. Yoshizawa, Neural Netw. 2, 375 (1989). + +Ultrafast Electron Crystallography of Interfacial Water +Chong-Yu Ruan, Vladimir A. Lobastov, Franco Vigliotti, Songye Chen, Ahmed H. Zewail* +We report direct determination of the structures and dynamics of interfacial water on a hydrophilic surface with atomic-scale resolution using ultrafast electron crystallography. On the nanometer scale, we observed the coexistence of ordered surface water and crystallite-like ice structures, evident in the superposition of Bragg spots and Debye-Scherrer rings. The structures were determined to be dominantly cubic, but each undergoes different dynamics after the ultrafast sub.strate temperature jump. From changes in local bond distances (OHOand OO) with time, we elucidated the structural changes in the far-from-equilibrium regime at short times and near-equilibration at long times. + +The nature of interfacial molecular assemblies of nanometer scale is of fundamental impor.tance to chemical and biological phenomena (1�4). For water, the directional molecular fea.tures of hydrogen bonding (5, 6) and the dif.ferent structures possible, from amorphous (7) to crystalline (8), make the interfacial (9) col.lective assembly on the mesoscopic (10) scale much less understood. Structurally, the nature of water on a substrate is determined by forces of orientation at the interface and by the net charge density, which establishes the hydro.philic or hydrophobic character of the substrate. However, the transformation from ordered to dis.ordered structure and their coexistence critically depends on the time scales for the movements of atoms locally and at long range. Therefore, it is essential to elucidate the nature of these structures and the time scales for their equilibration. +Laboratory for Molecular Sciences, Arthur Amos Noyes Laboratory of Chemical Physics, California Institute of Technology, Pasadena, CA 91125, USA. +*To whom correspondence should be addressed. E.mail: zewail@caltech.edu +Here, we report direct determination of the structures of interfacial water with atomic-scale resolution, using diffraction and the dynamics following ultrafast infrared (IR) laser-initiated +26. W. Maass, T. Natschla�ger, H. Markram, Neural Com-put. 14, 2531 (2002). +27. W. Maass, T. Natschla�ger, H. Markram, in Compu.tational Neuroscience: A Comprehensive Approach, J. Feng, Ed. (Chapman & Hall/CRC, 2003), pp. 575� 605. +28. G. B. Stanley, F. F. Li, Y. Dan, J. Neurosci. 19, 8036 (1999). +29. G. B. Stanley, Neurocomputing 38�40, 1703 (2001). +30. W. M. Kistler, Ch. I. de Zeeuw, Neural Comput. 14, 2597 (2002). 31. S. Mussa-Ivaldi, Nature 408, 361 (2000). +32. The �rst author thanks T. Christaller for unfaltering support andW. Maass for friendly cooperation. Inter.national patents are claimedby Fraunhofer AIS (PCT/ EP01/11490). + +Supporting Online Material +www.sciencemag.org/cgi/content/full/304/5667/78/DC1 Materials andMethods SOM Text Figs. S1 to S4 References + +temperature jump. Interfacial water is formed on a hydrophilic surface (silicon, chlorine-terminated) under controlled ultrahigh vacuum (UHV) conditions (Fig. 1). With these atomic-scale spatial, temporal, and energy resolutions, the evolution of nonequilibrium structures was monitored, their ordered or disordered nature was established, and the time scale for the breakage of long-range bonding and formation of new structures was determined. We identi.fied the structured and ordered interfacial water from the Bragg diffraction and the layered crys.tallite structure from the Debye-Scherrer rings. The temporal evolution of interfacial water and layered ice after the temperature jump was studied with submonolayer sensitivity. We compared these results with those obtained on hydrophobic surfaces, such as hydrogen-terminated silicon or silver substrate. +Spectroscopic techniques, such as internal reflection (11) and nonlinear [second-harmonic generation (12) and sum-frequency generation + + <
> + +Fig. 1. Structured water at the hydrophilic interface. The chlo.rine termination on a <> substrate forms a hydrophilic layer that orients the water bilayer. The closest packing dis.tance (4.43) be.tween oxygen atoms in the bottom layer of water is similar to the distance (4.50) be.tween the on-top and interstitial sites of the chlorine layer, result.ing in specific bilayer orientations (30) with respect to the silicon substrate. This ordered stacking persists for three to four bilayers (1 nm) before disorientation takes place andresults in crystallite islands, forming the layered structure. The size of atoms is not to scale for the van der Waals radii. +<> <> <> + + +<> <> <> + Identity Mappings in Deep Residual Networks + + Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun + + Microsoft Research + + Abstract + + Deep residual networks [1] have emerged as a family of ex- + tremely deep architectures showing compelling accuracy and nice con- + vergence behaviors. In this paper, we analyze the propagation formu- + lations behind the residual building blocks, which suggest that the for- + ward and backward signals can be directly propagated from one block + to any other block, when using identity mappings as the skip connec- + tions and after-addition activation. A series of ablation experiments sup- + port the importance of these identity mappings. This motivates us to + propose a new residual unit, which makes training easier and improves + generalization. We report improved results using a 1001-layer ResNet + on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet + on ImageNet. Code is available at:https://github.com/KaimingHe/ + resnet-1k-layers. + + + 1 Introduction + + Deep residual networks (ResNets) [1] consist of many stacked \Residual Units". + Each unit (Fig.1(a)) can be expressed in a general form: + + <> + + where xl and <> are input and output of the l-th unit, andFis a residual + function. In [1],<> is an identity mapping and is a ReLU [2] function. + ResNets that are over 100-layer deep have shown state-of-the-art accuracy for + several challenging recognition tasks on ImageNet [3] and MS COCO [4] compe- + titions. The central idea of ResNets is to learn the additive residual functionF + with respect to <>, with a key choice of using an identity mapping <> . + This is realized by attaching an identity skip connection shortcut. + In this paper, we analyze deep residual networks by focusing on creating a + direct path for propagating information not only within a residual unit, + but through the entire network. Our derivations reveal that if both <> and + <> are identity mappings, the signal could be directly propagated from one + unit to any other units, in both forward and backward passes. Our experiments + empirically show that training in general becomes easier when the architecture + is closer to the above two conditions. + To understand the role of skip connections, we analyze and compare various + types of <>. We find that the identity mapping <> chosen in [1] + + <
> + + Figure 1. Left: (a) original Residual Unit in [1]; (b) proposed Residual Unit. The grey + arrows indicate the easiest paths for the information to propagate, corresponding to + the additive term \xl " in Eqn.(4) (forward propagation) and the additive term \1" in + Eqn.(5) (backward propagation).Right: training curves on CIFAR-10 of1001-layer + ResNets. Solid lines denote test error (y-axis on the right), and dashed lines denote + training loss (y-axis on the left). The proposed unit makes ResNet-1001 easier to train. + + + + achieves the fastest error reduction and lowest training loss among all variants + we investigated, whereas skip connections of scaling, gating [5,6,7], and 1x1 + convolutions all lead to higher training loss and error. These experiments suggest + that keeping a clean information path (indicated by the grey arrows in Fig.1,2, + and4) is helpful for easing optimization. + To construct an identity mapping <>, we view the activation func- + tions (ReLU and BN [8]) as pre-activation of the weight layers, in contrast + to conventional wisdom of post-activation. This point of view leads to a new + residual unit design, shown in (Fig.1(b)). Based on this unit, we present com- + petitive results on CIFAR-10/100 with a 1001-layer ResNet, which is much easier + to train and generalizes better than the original ResNet in [1]. We further report + improved results on ImageNet using a 200-layer ResNet, for which the counter- + part of [1] starts to overfit. These results suggest that there is much room to + exploit the dimension ofnetwork depth, a key to the success of modern deep + learning. + + + + 2 Analysis of Deep Residual Networks + + + The ResNets developed in [1] are modularized architectures that stack building + blocks of the same connecting shape. In this paper we call these blocks \Residual 3 + + Units". The original Residual Unit in [1] performs the following computation: + + <>; (1) + <>. (2) + + Here xl is the input feature to the l-th Residual Unit. <> is a + set of weights (and biases) associated with the l-th Residual Unit, andKis the + number of layers in a Residual Unit (Kis 2 or 3 in [1]). F denotes the residual + function,e.g., a stack of two 3x3 convolutional layers in [1]. The function f is + the operation after element-wise addition, and in [1] f is ReLU. The function h + is set as an identity mapping:<> If f is also an identity mapping: <>, + we can put Eqn.(2) into Eqn.(1) + and obtain: + + <>. (3) + + Recursively <>, etc. we will have: + + <>; (4) + + for any deeper unit L and any shallower unit l. Eqn.(4) exhibits some nice + properties. + + (i) The feature xL of any deeper unit L can be represented as the + P feature xl of any shallower unit l plus a residual function in a form of <> + indicating that the model is in a residual fashion between any units L and l. + (ii)The feature <>, of any deep unit L, is the summation + of the outputs of all preceding residual functions (<>). This is in contrast to + Qa plain network here a feature xL is a series of matrix-vector products, say, <> + (ignoring BN and ReLU). + + Eqn.(4) also leads to nice backward propagation properties. Denoting the + loss function as E, from the chain rule of backpropagation [9] we have: + + <> (5) + + Eqn.(5) indicates that the gradient @E can be decomposed into two additive <> + terms: a term of <> that propagates information directly without concerning + any weight layers, and another term of <> that propagates <> + through the weight layers. The additive term of @E ensures that information is directly propagated back to + any shallower unIt l. Eqn.(5) also suggests that it is unlikely for the gradient @E to be canceled out for + a mini-batch, because in general the term <> cannot be always -1 for all samples in a mini-batch. + This implies that the gradient of a layer does not vanish even when the weights are arbitrarily small. + + 1 It is noteworthy that there are Residual Units for increasing dimensions and reducing + feature map sizes [1] in which h is not identity. In this case the following derivations + do not hold strictly. But as there are only a very few such units (two on CIFAR and + three on ImageNet, depending on image sizes [1]), we expect that they do not have + the exponential impact as we present in Sec.3. One may also think of our derivations + as applied to all Residual Units within the same feature map size. + + Discussions + + Eqn.(4) and Eqn.(5) suggest that the signal can be directly propagated from + any unit to another, both forward and backward. The foundation of Eqn.(4) is + two identity mappings: (i) the identity skip connection <> , and (ii) the + condition that f is an identity mapping. + + These directly propagated information flows are represented by the grey ar- + rows in Fig.1,2, and4. And the above two conditions are true when these grey + arrows cover no operations (expect addition) and thus are clean. In the fol- + lowing two sections we separately investigate the impacts of the two conditions. + + 3 On the Importance of Identity Skip Connections + + Let’s consider a simple modification, <>, to break the identity shortcut: + + <>, (6) + + where l is a modulating scalar (for simplicity we still assume f is identity). + Recursively applying this formulation we obtain an equation similar to Eqn. (4): + <>, or simply: + + <>; (7) + + where the notationF^absorbs the scalars into the residual functions. Similar to + Eqn.(5), we have backpropagation of the following form: + + <> (8) + + Unlike Eqn.(5), in Eqn.(8) the first additive term is modulated by a factor <> + the factor can be exponentially large; if <> for all i, this factor can be + exponentially small and vanish, which blocks the backpropagated signal from the + shortcut and forces it to flow through the weight layers. This results in optimization + difficulties as we show by experiments. + In the above analysis, the original identity skip connection in Eqn.(3) is re- + placed with a simple scaling <>. If the skip connection <> represents + more complicated transforms (such as gating and 1x1 convolutions), in Eqn.(8) Q the first + term becomes <> where h0 is the derivative of h. This product <> may + also impede information propagation and hamper the training procedure + as witnessed in the following experiments. + + + <
> + + Figure 2.Various types of shortcut connections used in Table1. The grey arrows + indicate the easiest paths for the information to propagate. The shortcut connections + in (b-f) are impeded by different components. For simplifying illustrations we do not + display the BN layers, which are adopted right after the weight layers for all units here. + + + 3.1 Experiments on Skip Connections + + We experiment with the 110-layer ResNet as presented in [1] on CIFAR-10 [10]. + This extremely deep ResNet-110 has 54 two-layer Residual Units (consisting of + 3x3 convolutional layers) and is challenging for optimization. Our implementation + details (see appendix) are the same as [1]. Throughout this paper we report + the median accuracy of 5 runs for each architecture on CIFAR, reducing the + impacts of random variations. + Though our above analysis is driven by identity f, the experiments in this + section are all based onf= ReLU as in [1]; we address identity f in the next + section. Our baseline ResNet-110 has 6.61% error on the test set. The comparisons + of other variants (Fig.2 and Table1) are summarized as follows: + Constant scaling. We set <> for all shortcuts (Fig.2(b)). We further + study two cases of scalingF: (i)Fis not scaled; or (ii)Fis scaled by a constant + scalar of <>, which is similar to the highway gating [6,7] but with frozen + gates. The former case does not converge well; the latter is able to converge, + but the test error (Table1, 12.35%) is substantially higher than the original + ResNet-110. Fig3(a) shows that the training error is higher than that of the + original ResNet-110, suggesting that the optimization has difficulties when the + shortcut signal is scaled down. 6 + + Table 1.Classification error on the CIFAR-10 test set using ResNet-110 [1], with + different types of shortcut connections applied to all Residual Units. We report \fail" + when the test error is higher than 20%. + + <
> + + Exclusive gating. Following the Highway Networks [6,7] that adopt a gating + mechanism [5], we consider a gating function <> where a + transform is represented by weights W g and biases <> followed by the sigmoid + function <>. In a convolutional network <> is realized by a <> + convolutional layer. The gating function modulates the signal by element-wise + multiplication. + We investigate the exclusive gates as used in [6,7] the F path is scaled + byg(x) and the shortcut path is scaled by <>. See Fig2(c). We find that the + initialization of the biases <> is critical for training gated models, and following + the guidelines 2 in [6,7], we conduct hyper-parameter search on the initial value of + <> in the range of 0 to -10 with a decrement step of -1 on the training set by cross- + validation. The best value (6 here) is then used for training on the training + set, leading to a test result of 8.70% (Table1), which still lags far behind the + ResNet-110 baseline. Fig 3(b) shows the training curves. Table1also reports the + results of using other initialized values, noting that the exclusive gating network + does not converge to a good solution when <> is not appropriately initialized. + The impact of the exclusive gating mechanism is two-fold. When <> + approaches 1, the gated shortcut connections are closer to identity which helps + information propagation; but in this case <> approaches 0 and suppresses the + functionF. To isolate the effects of the gating functions on the shortcut path + alone, we investigate a non-exclusive gating mechanism in the next. + Shortcut-only gating. In this case the functionFis not scaled; only the + shortcut path is gated by <>. See Fig2(d). The initialized value of<> is still + essential in this case. When the initialized<> is 0 (so initially the expectation + of <> is 0.5), the network converges to a poor result of 12.86% (Table1). + This is also caused by higher training error (Fig 3(c)). + + <
> + + Figure 3.Training curves on CIFAR-10 of various shortcuts. Solid lines denote test + error (y-axis on the right), and dashed lines denote training loss (y-axis on the left). + + + When the initialized <> is very negatively biased (e.g.,6), the value of + <> is closer to 1 and the shortcut connection is nearly an identity mapping. + Therefore, the result (6.91%, Table1) is much closer to the ResNet-110 baseline. + 1x1 convolutional shortcut. Next we experiment with 1x1 convolutional + shortcut connections that replace the identity. This option has been investigated + in [1] (known as option C) on a 34-layer ResNet (16 Residual Units) and shows + good results, suggesting that 1x1 shortcut connections could be useful. But we + find that this is not the case when there are many Residual Units. The 110-layer + ResNet has a poorer result (12.22%, Table1) when using 1x1 convolutional + shortcuts. Again, the training error becomes higher (Fig3(d)). When stacking + so many Residual Units (54 for ResNet-110), even the shortest path may still + impede signal propagation. We witnessed similar phenomena on ImageNet with + ResNet-101 when using 1x1 convolutional shortcuts. + Dropout shortcut. Last we experiment with dropout [11] (at a ratio of 0.5) + which we adopt on the output of the identity shortcut (Fig.2(f)). The network + fails to converge to a good solution. Dropout statistically imposes a scale of + with an expectation of 0.5 on the shortcut, and similar to constant scaling by + 0.5, it impedes signal propagation. + + Table 2.Classification error (%) on the CIFAR-10 test set using different activation + functions. + + <
> + + <
> + + Figure 4.Various usages of activation in Table2. All these units consist of the same + components | only the orders are different. + + + 3.2 Discussions + As indicated by the grey arrows in Fig.2, the shortcut connections are the + most direct paths for the information to propagate.Multiplicative manipulations + (scaling, gating, 1x1 convolutions, and dropout) on the shortcuts can hamper + information propagation and lead to optimization problems. + It is noteworthy that the gating and 1x1 convolutional shortcuts introduce + more parameters, and should have stronger representational abilities than + identity shortcuts. In fact, the shortcut-only gating and 1x1 convolution cover the + solution space of identity shortcuts (i.e., they could be optimized as identity + shortcuts). However, their training error is higher than that of identity short- + cuts, indicating that the degradation of these models is caused by optimization + issues, instead of representational abilities. + + + 4 On the Usage of Activation Functions + + Experiments in the above section support the analysis in Eqn.(5) and Eqn.(8), + both being derived under the assumption that the after-addition activation f 9 + + is the identity mapping. But in the above experiments f is ReLU as designed + in [1], so Eqn.(5) and (8) are approximate in the above experiments. Next we + investigate the impact off. + We want to make f an identity mapping, which is done by re-arranging + the activation functions (ReLU and/or BN). The original Residual Unit in [1] + has a shape in Fig.4(a) | BN is used after each weight layer, and ReLU is + adopted after BN except that the last ReLU in a Residual Unit is after element- + wise addition (f= ReLU). Fig.4(b-e) show the alternatives we investigated, + explained as following. + + 4.1 Experiments on Activation + In this section we experiment with ResNet-110 and a 164-layerBottleneck[1] + architecture (denoted as ResNet-164). A bottleneck Residual Unit consist of a + 1x1 layer for reducing dimension, a 3x3 layer, and a 1x1 layer for restoring + dimension. As designed in [1], its computational complexity is similar to the + two-3x3 Residual Unit. More details are in the appendix. The baseline ResNet- + 164 has a competitive result of 5.93% on CIFAR-10 (Table2). + BN after addition. Before turning f into an identity mapping, we go the + opposite way by adopting BN after addition (Fig.4(b)). In this case f involves + BN and ReLU. The results become considerably worse than the baseline (Ta- + ble2). Unlike the original design, now the BN layer alters the signal that passes + through the shortcut and impedes information propagation, as reflected by the + difficulties on reducing training loss at the beginning of training (Fib.6left). + ReLU before addition. A naive choice of making f into an identity map- + ping is to move the ReLU before addition (Fig.4(c)). However, this leads to a + non-negative output from the transformF, while intuitively a residual function + should take values in (-1,+1). As a result, the forward propagated signal + is monotonically increasing. This may impact the representational ability, + and the result is worse (7.84%, Table2) than the baseline. We expect to have + a residual function taking values in (-1,+1). This condition is satisfied by + other Residual Units including the following ones. + Post-activation or pre-activation?In the original design (Eqn.(1) and + Eqn.(2)), the activation<> affects both paths in the next Residual + Unit: <>. Next we develop an asymmetric form + where an activation f only affects the F path: <>, for + any l(Fig.5(a) to (b)). By renaming the notations, we have the following form: + + <>, (9) + + It is easy to see that Eqn.(9) is similar to Eqn.(4), and can enable a backward + formulation similar to Eqn.(5). For this new Residual Unit as in Eqn.(9), the new + after-addition activation becomes an identity mapping. This design means that + if a new after-addition activation f is asymmetrically adopted, it is equivalent + to recasting f as the pre-activation of the next Residual Unit. This is illustrated + in Fig.5. + + <
> + + Figure 5.Using asymmetric after-addition activation is equivalent to constructing a + pre-activationResidual Unit. + + Table 3.Classification error (%) on the CIFAR-10/100 test set using the original + Residual Units and our pre-activation Residual Units. + + <
> + + The distinction between post-activation/pre-activation is caused by the presence + of the element-wise addition. For a plain network that has N layers, there + are N-1 activations (BN/ReLU), and it does not matter whether we think of + them as post- or pre-activations. But for branched layers merged by addition, + the position of activation matters. + We experiment with two such designs: (i) ReLU-only pre-activation (Fig.4(d)), + and (ii) full pre-activation (Fig.4(e)) where BN and ReLU are both adopted be- + fore weight layers. Table2 shows that the ReLU-only pre-activation performs + very similar to the baseline on ResNet-110/164. This ReLU layer is not used in + conjunction with a BN layer, and may not enjoy the benefits of BN [8]. + Somehow surprisingly, when BN and ReLU are both used as pre-activation, + the results are improved by healthy margins (Table2and Table3). In Table3we + report results using various architectures: (i) ResNet-110, (ii) ResNet-164, (iii) + a 110-layer ResNet architecture in which each shortcut skips only 1 layer (i.e., 11 + + <
> + + Figure 6.Training curves on CIFAR-10.Left: BN after addition (Fig.4(b)) using + ResNet-110.Right: pre-activation unit (Fig.4(e)) on ResNet-164. Solid lines denote + test error, and dashed lines denote training loss. + + + a Residual Unit has only 1 layer, denoted as ResNet-110 (1layer)), and (iv) + a 1001-layer bottleneck architecture that has 333 Residual Units (111 on each + feature map size), denoted as \ResNet-1001". We also experiment on CIFAR- + 100. Table3shows that our pre-activation models are consistently better than + the baseline counterparts. We analyze these results in the following. + + + 4.2 Analysis + + We find the impact of pre-activation is twofold. First, the optimization is further + eased (comparing with the baseline ResNet) because f is an identity mapping. + Second, using BN as pre-activation improves regularization of the models. + Ease of optimization. This effect is particularly obvious when training + the1001-layerResNet. Fig.1shows the curves. Using the original design in + [1], the training error is reduced very slowly at the beginning of training. For + f= ReLU, the signal is impacted if it is negative, and when there are many + Residual Units, this effect becomes prominent and Eqn.(3) (so Eqn.(5)) is not + a good approximation. On the other hand, when f is an identity mapping, the + signal can be propagated directly between any two units. Our 1001-layer network + reduces the training loss very quickly (Fig.1). It also achieves the lowest loss + among all models we investigated, suggesting the success of optimization. + We also find that the impact off= ReLU is not severe when the ResNet + has fewer layers (e.g., 164 in Fig.6(right)). The training curve seems to suffer + a little bit at the beginning of training, but goes into a healthy status soon. By + monitoring the responses we observe that this is because after some training, + the weights are adjusted into a status such that yl in Eqn.(1) is more frequently + above zero and f does not truncate it (xl is always non-negative due to the previous + ReLU, so yl is below zero only when the magnitude ofFis very negative). + The truncation, however, is more frequent when there are 1000 layers. + + Table 4.Comparisons with state-of-the-art methods on CIFAR-10 and CIFAR-100 + using \moderate data augmentation" (ip/translation), except for ELU [12] with no + augmentation. Better results of [13,14] have been reported using stronger data augmen- + tation and ensembling. For the ResNets we also report the number of parameters. Our + results are the median of 5 runs with meanstd in the brackets. All ResNets results + are obtained with a mini-batch size of 128 except y with a mini-batch size of 64 (code + available athttps://github.com/KaimingHe/resnet-1k-layers). + + <
> + + Reducing overfitting. Another impact of using the proposed pre-activation + unit is on regularization, as shown in Fig.6(right). The pre-activation ver- + sion reaches slightly higher training loss at convergence, but produces lower test + error. This phenomenon is observed on ResNet-110, ResNet-110(1-layer), and + ResNet-164 on both CIFAR-10 and 100. This is presumably caused by BN’s + reularization effect [8]. In the original Residual Unit (Fig.4(a)), although the BN + normalizes the signal, this is soon added to the shortcut and thus the merged + signal is not normalized. This unnormalized signal is then used as the input of + the next weight layer. On the contrary, in our pre-activation version, the inputs + to all weight layers have been normalized. + + + 5 Results + + Comparisons on CIFAR-10/100.Table4compares the state-of-the-art meth- + ods on CIFAR-10/100, where we achieve competitive results. We note that we + do not specially tailor the network width or filter sizes, nor use regularization + techniques (such as dropout) which are very effective for these small datasets. + We obtain these results via a simple but essential concept | going deeper. These + results demonstrate the potential of pushing the limits of depth. + + Comparisons on ImageNet.Next we report experimental results on the 1000- + class ImageNet dataset [3]. We have done preliminary experiments using the skip + connections studied in Fig.2&3on ImageNet with ResNet-101 [1], and observed + similar optimization difficulties. The training error of these non-identity shortcut + networks is obviously higher than the original ResNet at the first learning rate 13 + + Table 5.Comparisons of single-crop error on the ILSVRC 2012 validation set. All + ResNets are trained using the same hyper-parameters and implementations as [1]). + Our Residual Units are the full pre-activation version (Fig.4(e)). y : code/model avail- + able athttps://github.com/facebook/fb.resnet.torch/tree/master/pretrained, + using scale and aspect ratio augmentation in [20]. + + <
> + + (similar to Fig.3), and we decided to halt training due to limited resources. + But we did finish a BN after addition version (Fig.4(b)) of ResNet-101 on + ImageNet and observed higher training loss and validation error. This model’s + single-crop (224x224) validation error is 24.6%/7.5%,vs.the original ResNet- + 101’s 23.6%/7.1%. This is in line with the results on CIFAR in Fig.6(left). + Table5shows the results of ResNet-152 [1] and ResNet-200 3 , all trained from + scratch. We notice that the original ResNet paper [1] trained the models using + scale jittering with shorter sides [256;480], and so the test of a 224x224 crop + ons= 256 (as did in [1]) is negatively biased. Instead, we test a single 320x320 + crop from s=320, for all original and our ResNets. Even though the ResNets + are trained on smaller crops, they can be easily tested on larger crops because + the ResNets are fully convolutional by design. This size is also close to 299x299 + used by Inception v3 [19], allowing a fairer comparison. + The original ResNet-152 [1] has top-1 error of 21.3% on a 320x320 crop, and + our pre-activation counterpart has 21.1%. The gain is not big on ResNet-152 + because this model has not shown severe generalization difficulties. However, + the original ResNet-200 has an error rate of 21.8%, higher than the baseline + ResNet-152. But we find that the original ResNet-200 has lower training error + than ResNet-152, suggesting that it suffers from overfitting. + Our pre-activation ResNet-200 has an error rate of 20.7%, which is1.1% + lower than the baseline ResNet-200 and also lower than the two versions of + ResNet-152. When using the scale and aspect ratio augmentation of [20,19], our + ResNet-200 has a result better than Inception v3 [19] (Table5). Concurrent + with our work, an Inception-ResNet-v2 model [21] achieves a single-crop result + of 19.9%/4.9%. We expect our observations and the proposed Residual Unit will + help this type and generally other types of ResNets. + + Computational Cost.Our models’ computational complexity is linear on + + 3 The ResNet-200 has 16 more 3-layer bottleneck Residual Units than ResNet-152, + which are added on the feature map of 28x28. + + depth (so a 1001-layer net is complex of a 100-layer net). On CIFAR, + ResNet-1001 takes about 27 hours to train on 2 GPUs; on ImageNet, ResNet- + 200 takes about 3 weeks to train on 8 GPUs (on par with VGG nets [22]). + + + + 6 Conclusions + + + This paper investigates the propagation formulations behind the connection + mechanisms of deep residual networks. Our derivations imply that identity short- + cut connections and identity after-addition activation are essential for making + information propagation smooth. Ablation experiments demonstrate phenom- + ena that are consistent with our derivations. We also present 1000-layer deep + networks that can be easily trained and achieve improved accuracy. + + + + Appendix: Implementation DetailsThe implementation details and hyper- + parameters are the same as those in [1]. On CIFAR we use only the translation + and skipping augmentation in [1] for training. The learning rate starts from 0.1, + and is divided by 10 at 32k and 48k iterations. Following [1], for all CIFAR + experiments we warm up the training by using a smaller learning rate of 0.01 at + the beginning 400 iterations and go back to 0.1 after that, although we remark + that this is not necessary for our proposed Residual Unit. The mini-batch size + is 128 on 2 GPUs (64 each), the weight decay is 0.0001, the momentum is 0.9, + and the weights are initialized as in [23]. + On ImageNet, we train the models using the same data augmentation as in + [1]. The learning rate starts from 0.1 (no warming up), and is divided by 10 at + 30 and 60 epochs. The mini-batch size is 256 on 8 GPUs (32 each). The weight + decay, momentum, and weight initialization are the same as above. + When using the pre-activation Residual Units (Fig.4(d)(e) and Fig.5), we + pay special attention to the first and the last Residual Units of the entire net- + work. For the first Residual Unit (that follows a stand-alone convolutional layer, + conv 1 ), we adopt the first activation right after conv 1 and before splitting into + two paths; for the last Residual Unit (followed by average pooling and a fully- + connected classifier), we adopt an extra activation right after its element-wise + addition. These two special cases are the natural outcome when we obtain the + pre-activation network via the modification procedure as shown in Fig.5. + The bottleneck Residual Units (for ResNet-164/1001 on CIFAR) are + constructed following [1]. For example, a 3x3, 16 unit in ResNet-110 is replaced 3x3, 162 + with a 1x1, 166 7 unit in ResNet-164, both of which have roughly the same 3x3, 165 + 1x1, 64 + number of parameters. For the bottleneck ResNets, when reducing the feature map + size we use projection shortcuts [1] for increasing dimensions, and when pre- + activation is used, these projection shortcuts are also with pre-activation. 15 + + References + + 1.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. + In: CVPR. (2016) + 2.Nair, V., Hinton, G.E.: Rectied linear units improve restricted boltzmann ma- + chines. In: ICML. (2010) + 3.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., + Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large + Scale Visual Recognition Challenge. IJCV (2015) + 4.Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., + Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. (2014) + 5.Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation + (1997) + 6.Srivastava, R.K., Gre, K., Schmidhuber, J.: Highway networks. In: ICML work- + shop. (2015) + 7.Srivastava, R.K., Gre, K., Schmidhuber, J.: Training very deep networks. In: + NIPS. (2015) + 8.Ioe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by + reducing internal covariate shift. In: ICML. (2015) + 9.LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., + Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural + computation (1989) + 10.Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech Report + (2009) + 11.Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: + Improving neural networks by preventing co-adaptation of feature detectors. + arXiv:1207.0580 (2012) + 12.Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network + learning by exponential linear units (ELUs). In: ICLR. (2016) + 13.Graham, B.: Fractional max-pooling. arXiv:1412.6071 (2014) + 14.Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplic- + ity: The all convolutional net. arXiv:1412.6806 (2014) + 15.Lin, M., Chen, Q., Yan, S.: Network in network. In: ICLR. (2014) + 16.Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: + AISTATS. (2015) + 17.Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: + Hints for thin deep nets. In: ICLR. (2015) + 18.Mishkin, D., Matas, J.: All you need is a good init. In: ICLR. (2016) + 19.Szegedy, C., Vanhoucke, V., Ioe, S., Shlens, J., Wojna, Z.: Rethinking the incep- + tion architecture for computer vision. In: CVPR. (2016) + 20.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., + Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015) + 21.Szegedy, C., Ioe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact + of residual connections on learning. arXiv:1602.07261 (2016) + 22.Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale + image recognition. In: ICLR. (2015) + 23.He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiers: Surpassing human- + level performance on imagenet Classification. In: ICCV. (2015) +<> <> <> + + +<> <> <> + Language Models are Few-Shot Learners + + Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah + + + Jared Kaplan y Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry + + Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan + + Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter + + Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray + + Benjamin Chess Jack Clark Christopher Berner + + Sam McCandlish Alec Radford Ilya Sutskever Dario Amodei + + + OpenAI + + + + Abstract + + Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training + on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic + in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of + thousands of examples. By contrast, humans can generally perform a new language task from only + a few examples or from simple instructions – something which current NLP systems still largely + struggle to do. Here we show that scaling up language models greatly improves task-agnostic, + few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine- + tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion + parameters, 10x more than any previous non-sparse language model, and test its performance in + the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, + with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 + achieves strong performance on many NLP datasets, including translation, question-answering, and + close tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as + unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same + time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some + datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, + we find that GPT-3 can generate samples of news articles which human evaluators have difficulty + distinguishing from articles written by humans. We discuss broader societal impacts of this finding + and of GPT-3 in general. + + + + Equal contribution + y Johns Hopkins University, OpenAI + + Contents + + 1 Introduction 3 + 2 Approach 6 + 2.1 Model and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 + 2.2 Training Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 + 2.3 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 + 2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 + 3 Results 10 + 3.1 Language Modeling, Cloze, and Completion Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .11 + 3.2 Closed Book Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 + 3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 + 3.4 Winograd-Style Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 + 3.5 Common Sense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 + 3.6 Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 + 3.7 SuperGLUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 + 3.8 NLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 + 3.9 Synthetic and Qualitative Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 + 4 Measuring and Preventing Memorization Of Benchmarks29 + 5 Limitations 33 + 6 Broader Impacts 34 + 6.1 Misuse of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 + 6.2 Fairness, Bias, and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36 + 6.3 Energy Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 + 7 Related Work 39 + 8 Conclusion 40 + A Details of Common Crawl Filtering43 + B Details of Model Training 43 + C Details of Test Set Contamination Studies43 + D Total Compute Used to Train Language Models46 + E Human Quality Assessment of Synthetic News Articles46 + F Additional Samples from GPT-348 + G Details of Task Phrasing and Specifications50 + H Results on All Tasks for All Model Sizes63 + + + + 1 Introduction + + Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly + flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word + vectors [MCCD13,PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations + and contextual state were used to form stronger representations [DL15,MBXS17,PNZtY18] (though still applied to + task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP + 17] have + been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18,DCLT18,HR18]. + This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, + question answering, textual entailment, and many others, and has continued to advance based on new architectures + and algorithms [RSR + 19,LOG + 19,YDY + 19,LCG + 19]. However, a major limitation to this approach is that while + the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve + strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands + of examples specific to that task. Removing this limitation would be desirable, for several reasons. + First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the + applicability of language models. There exists a very wide range of possible useful language tasks, encompassing + anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many + of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated + for every new task. + Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness + of the model and the narrowness of the training distribution. This can create problems for the pre-training plus + fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then + fine-tuned on very narrow task distributions. For instance [HLW + 20] observe that larger models do not necessarily + generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm + can be poor because the model is overly specific to the training distribution and does not generalize well outside it + [YdC + 19,MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at + human-level, may exaggerate actual performance on the underlying task [GSL + 18,NK19]. + Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural + language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number + of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often + + <
> + + Figure 1.1: Language model meta-learning.During unsupervised pre-training, a language model develops a broad + set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize + the desired task. We use the term “in-context learning” to describe the inner loop of this process, which occurs within + the forward-pass upon each sequence. The sequences in this diagram are not intended to be representative of the data a + model would see during pre-training, but are intended to show that there are sometimes repeated sub-tasks embedded + within a single sequence. + + <
> + + Figure 1.2: Larger models make increasingly efficient use of in-context information. We show in-context learning + performance on a simple task requiring the model to remove random symbols from a word, both with and without a + natural language task description (see Sec.3.9.2). The steeper “in-context learning curves” for large models demonstrate + improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range + of tasks. + + + sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing + to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans + to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy + dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality. + One potential route towards addressing these issues is meta-learning 1 – which in the context of language models means + the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities + at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure1.1). Recent work [RWC + 19] + attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form + of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task + and is then expected to complete further instances of the task simply by predicting what comes next. + While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example + [RWC + 19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind + the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of + solving language tasks. + Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer + language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters + [DCLT18], to 1.5 billion parameters [RWC + 19], to 8 billion parameters [SPP + 19], 11 billion parameters [RSR + 19], + and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream + NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a + smooth trend of improvement with scale [KMH + 20]. Since in-context learning involves absorbing many skills and + tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong + gains with scale. + + 1 In the context of language models this has sometimes been called “zero-shot transfer”, but this term is potentially ambiguous: + the method is “zero-shot” in the sense that no gradient updates are performed, but it often involves providing inference-time + demonstrations to the model, so is not truly learning from zero examples. To avoid this confusion, we use the term “meta-learning” + to capture the inner-loop / outer-loop structure of the general method, and the term “in context-learning” to refer to the inner + loop of meta-learning. We further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many + demonstrations are provided at inference time. These terms are intended to remain agnostic on the question of whether the model + learns new tasks from scratch at inference time or simply recognizes patterns seen during training – this is an important issue which + we discuss later in the paper, but “meta-learning” is intended to encompass both possibilities, and simply describes the inner-outer + loop structure. + + <
> + + Figure 1.3: Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance + improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are + more proficient at in-context learning. See Figure3.8for a more detailed analysis on SuperGLUE, a standard NLP + benchmark suite. + + + + In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call + GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, + as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training + set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we + allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, + where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only + an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional + fine-tuning setting, but we leave this to future work. + Figure1.2illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to + remove extraneous symbols from a word. Model performance improves with the addition of a natural language task + description, and with the number of examples in the model’s context,K. Few-shot learning also improves dramatically + with model size. Though the results in this case are particularly striking, the general trends with both model size and + number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no + gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning. + Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot + setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held + by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in + the one-shot setting, 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves 64.3% accuracy on TriviaQA in the + zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, the last of which is state-of-the-art + relative to fine-tuned models operating in the same closed-book setting. + GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, + which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them + defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human + evaluators have difficulty distinguishing from human-generated articles. + At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This + includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE + or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we + hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed. + A heuristic sense of the overall results can be seen in Figure1.3, which aggregates the various tasks (though it should + not be seen as a rigorous or meaningful benchmark in itself). + + We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models + on datasets such as Common Crawl, which can potentially include content from test datasets simply because such + content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify + its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most + datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these + datasets or we note them with an asterisk, depending on the severity. + In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion + parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most + tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap + between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models + are more proficient meta-learners. + Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and + broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard. + The remainder of this paper is organized as follows. In Section2, we describe our approach and methods for training + GPT-3 and evaluating it. Section3presents results on the full range of tasks in the zero-, one- and few-shot settings. + Section4addresses questions of data contamination (train-test overlap). Section5discusses limitations of GPT-3. + Section6discusses broader impacts. Section7reviews related work and Section8concludes. + + + 2 Approach + + Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC + 19], + with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use + of in-context learning is also similar to [RWC + 19], but in this work we systematically explore different settings for + learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings + that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a + spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this + spectrum (see Figure2.1for an illustration): + + •Fine-Tuning (FT)has been the most common approach in recent years, and involves updating the weights of + a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to + hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance + on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential + for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the + training data [GSL + 18,NK19], potentially resulting in an unfair comparison with human performance. In + this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be + fine-tuned in principle and this is a promising direction for future work. + •Few-Shot (FS)is the term we will use in this work to refer to the setting where the model is given a few + demonstrations of the task at inference time as conditioning [RWC + 19], but no weight updates are allowed. + As shown in Figure2.1, for a typical dataset an example has a context and a desired completion (for example + an English sentence and the French translation), and few-shot works by giving K examples of context and + completion, and then one final example of context, with the model expected to provide the completion. We + typically setKin the range of 10 to 100 as this is how many examples can fit in the model’s context window + (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and + reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main + disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned + models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot + learning as described here for language models is related to few-shot learning as used in other contexts in + ML [HYC01,VBL + 16] – both involve learning based on a broad distribution of tasks (in this case implicit in + the pre-training data) and then rapidly adapting to a new task. + •One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural + language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and + zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. + For example, when asking humans to generate a dataset on a human worker service (for example Mechanical + Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate + the content or format of a task if no examples are given. + + <
> + + Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above show + four methods for performing a task with a language model – fine-tuning is the traditional method, whereas zero-, one-, + and few-shot, which we study in this work, require the model to perform the task with only forward passes at test + time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task + descriptions, examples and prompts can be found in AppendixG. + + + •Zero-Shot (0S)is the same as one-shot except that no demonstrations are allowed, and the model is only given + a natural language instruction describing the task. This method provides maximum convenience, potential for + robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of + pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans + to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”. + For example, if someone is asked to “make a table of world records for the 200m dash”, this request can be + ambiguous, as it may not be clear exactly what format the table should have or what should be included (and + even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at + least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example + in Figure2.1, a human would likely know what to do from just the text instruction. + + Figure2.1shows the four methods using the example of translating English to French. In this paper we focus on + zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different + problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. + We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. + Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, + and are important targets for future work. + Sections2.1-2.3below give details on our models, training data, and training process respectively. Section2.4discusses + the details of how we do few-shot, one-shot, and zero-shot evaluations. + + <
> + + Table 2.1:Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models + which we trained. All models were trained for a total of 300 billion tokens. + + + + 2.1 Model and Architectures + + We use the same model and architecture as GPT-2 [RWC + 19], including the modified initialization, pre-normalization, + and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse + attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence + of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 + million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH + 20] + suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a + function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for + downstream language tasks. + Table2.1shows the sizes and architectures of our 8 models. Here n params is the total number of trainable parameters, + n layers is the total number of layers,d model is the number of units in each bottleneck layer (we always have the + feedforward layer four times the size of the bottleneck layer,<> model ), and d head is the dimension of each + attention head. All models use a context window of <> tokens. We partition the model across GPUs along + both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural + parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models + across GPU’s. Previous work [KMH + 20] suggests that validation loss is not strongly sensitive to these parameters + within a reasonably broad range. + + 2.2 Training Dataset + + Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset 2 [RSR + 19] constituting + nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same + sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have + lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: + (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference + corpora, (2) we performed fuzzy de-duplication at the document level, within and across datasets, to prevent redundancy + and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added + known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity. + Details of the first two points (processing of Common Crawl) are described in AppendixA. For the third, we added + several curated high-quality datasets, including an expanded version of the WebText dataset [RWC + 19], collected + by scraping links over a longer period of time, and first described in [KMH + 20], two internet-based books corpora + (Books1 and Books2) and English-language Wikipedia. + Table2.2shows the final mixture of datasets that we used in training. The CommonCrawl data was downloaded from + 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering + and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets + are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, + such that CommonCrawl and Books2 datasets are sampled less than once during training, but the other datasets are + sampled 2-3 times. This essentially accepts a small amount of overfitting in exchange for higher quality training data. + + <
> + + Figure 2.2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models + [KMH + 20] we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B + is almost 10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of compute + during pre-training. Methodology for these calculations can be found in AppendixD. + + <
> + + Table 2.2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training + that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a + result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets + are seen less than once. + + + + A major methodological concern with language models pretrained on a broad swath of internet data, particularly large + models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by + having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched + for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. + Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible + to retrain the model. In Section4we characterize the impact of the remaining overlaps, and in future work we will + more aggressively remove data contamination. + + 2.3 Training Process + + As found in [KMH + 20,MKAT18], larger models can typically use a larger batch size, but require a smaller learning + rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table + 2.1shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture + of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models + were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process + and hyperparameter settings are described in AppendixB. + + 2.4 Evaluation + + For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that + task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Story cloze + there is no supervised training set available so we draw conditioning examples from the development set and evaluate + on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning + examples directly from it. + K can be any value from 0 to the maximum amount allowed by the model’s context window, which is <> + for all models and typically fits10to100examples. Larger values of K are usually but not always better, so when a + separate development and test set are available, we experiment with a few values ofKon the development set and then + run the best value on the test set. For some tasks (see AppendixG) we also use a natural language prompt in addition to + (or forK= 0, instead of) demonstrations. + On tasks that involve choosing one correct completion from several options (multiple choice), we provideKexamples + of context plus correct completion, followed by one example of context only, and compare the LM likelihood of + each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small + number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set + by normalizing by the unconditional probability of each completion, by computing <>, where <> answer context + is the string "Answer: "or" A: " and is used to prompt that the completion should be an answer + but is otherwise generic. + On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or + “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what + is done by [RSR + 19] (see AppendixG) for details. + On tasks with free-form completion, we use beam search with the same parameters as [RSR + 19]: a beam width of 4 + and a length penalty of= 0:6. We score the model using F1 similarity score, BLEU, or exact match, depending on + what is standard for the dataset at hand. + Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, + and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on + the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) + where we were able to make submission work, and we submit only the 200B few-shot results, and report development + set results for everything else. + + + + 3 Results + + + In Figure3.1we display training curves for the 8 models described in Section2. For this graph we also include 6 + additional extra-small models with as few as 100,000 parameters. As observed in [KMH + 20], language modeling + performance follows a power-law when making efficient use of training compute. After extending this trend by two + more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these + improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will + see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a + broad spectrum of natural language tasks. + Below, we evaluate the 8 models described in Section2(the 175 billion parameter parameter GPT-3 and 7 smaller + models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks. + In Section3.1we evaluate on traditional language modeling tasks and tasks that are similar to language modeling, + such as Cloze tasks and sentence/paragraph completion tasks. In Section3.2we evaluate on “closed book” question + answering tasks: tasks which require using the information stored in the model’s parameters to answer general + knowledge questions. In Section3.3we evaluate the model’s ability to translate between languages (especially one-shot + and few-shot). In Section3.4we evaluate the model’s performance on Winograd Schema-like tasks. In Section3.5we + evaluate on datasets that involve commonsense reasoning or question answering. In Section3.6we evaluate on reading + comprehension tasks, in Section3.7we evaluate on the SuperGLUE benchmark suite, and in3.8we briefly explore + NLI. Finally, in Section3.9, we invent some additional tasks designed especially to probe in-context learning abilities – + these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the + few-shot, one-shot, and zero-shot settings. + + <
> + + Figure 3.1: Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy + validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior + observed in [KMH + 20] continues for an additional two orders of magnitude with only small deviations from the + predicted curve. For this figure, we exclude embedding parameters from compute and parameter counts. + + <
> + + Table 3.1: Zero-shot results on PTB language modeling dataset.Many other common language modeling datasets + are omitted because they are derived from Wikipedia or other sources which are included in GPT-3’s training data. + a [RWC + 19] + + + + 3.1 Language Modeling, Cloze, and Completion Tasks + + In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks + that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible + completions of a piece of text. + + 3.1.1 Language Modeling + We calculate zero-shot perplexity on the Penn Tree Bank (PTB) [MKM + 94] dataset measured in [RWC + 19]. We omit + the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the + one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these + issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15 + points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have + a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot. + + 3.1.2 LAMBADA + The LAMBADA dataset [PKL + 16] tests the modeling of long-range dependencies in text – the model is asked to + predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the + continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT + 20] reflect on + the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results [SPP + 19] + + <
> + + Table 3.2: Performance on cloze and completion tasks.GPT-3 significantly improves SOTA on LAMBADA while + achieving respectable performance on two difficult completion prediction datasets. + + <
> + + Figure 3.2:On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3 + 2.7B outperforms the SOTA 17B parameter Turing-NLG [Tur20] in this setting, and GPT-3 175B advances the state of + the art by 18%. Note zero-shot uses a different format from one-shot and few-shot as described in the text. + + + + and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path + forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of + 8% over the previous state of the art. + LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that + classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a + standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but + also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word + filters [RWC + 19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a + cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We + use the following fill-in-the-blank format: + Alice was friends with Bob. Alice went to visit her friend .!Bob + George bought some baseball equipment, a ball, a glove, and a .! + When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase + of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model + size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy + by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot + setting. Perhaps this is because all models still require several examples to recognize the pattern. + + <
> + + Table 3.3: Results on three Open-Domain QA tasks.GPT-3 is shown in the few-, one-, and zero-shot settings, as + compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the + wiki split test server. + + One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA + dataset appears to be present in our training data – however analysis performed in Section4suggests negligible impact + on performance. + + 3.1.3 HellaSwag + The HellaSwag dataset [ZHB + 19] involves picking the best ending to a story or set of instructions. The examples were + adversarially mined to be difficult for language models while remaining easy for humans (who achieve 95.6% accuracy). + GPT-3 achieves 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, outperforming the + 75.4% accuracy of a fine-tuned 1.5B parameter language model [ZHR + 19] but still a fair amount lower than the overall + SOTA of 85.6% achieved by the fine-tuned multi-task model ALUM. + + 3.1.4 StoryCloze + We next evaluate GPT-3 on the StoryCloze 2016 dataset [MCH + 16], which involves selecting the correct ending + sentence for five-sentence long stories. Here GPT-3 achieves 83.2% in the zero-shot setting and 87.7% in the few-shot + setting (withK= 70). This is still 4.1% lower than the fine-tuned SOTA using a BERT based model [LDL19] but + improves over previous zero-shot results by roughly 10%. + + 3.2 Closed Book Question Answering + + In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense + amount of possible queries, this task has normally been approached by using an information retrieval system to find + relevant text in combination with a model which learns to generate an answer given the question and the retrieved + text. Since this setting allows a system to search for and condition on text which potentially contains the answer it + is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well + directly answering the questions without conditioning on auxiliary information. They denote this more restrictive + evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better + and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR + 19], + WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in + the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than + previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself + is also not permitted. + The results for GPT-3 are shown in Table3.3. On TriviaQA, we achieve 64.3% in the zero-shot setting, 68.0% in the + one-shot setting, and 71.2% in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by + 14.2%, and also outperforms a version with Q&A tailored span prediction during pre-training by 3.8%. The one-shot + result improves by 3.7% and matches the SOTA for an open-domain QA system which not only fine-tunes but also + makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents [LPP + 20]. + GPT-3’s few-shot result further improves performance another 3.2% beyond this. + On WebQuestions (WebQs), GPT-3 achieves 14.4% in the zero-shot setting, 25.3% in the one-shot setting, and 41.5% + in the few-shot setting. This compares to 37.4% for fine-tuned T5-11B, and 44.7% for fine-tuned T5-11B+SSM, + which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of + state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to + few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions + + <
> + + Figure 3.3:On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models + continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains + over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG + [LPP + 20] + + + and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this + distribution, recovering strong performance in the few-shot setting. + On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in + the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot + to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to + TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia + specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution. + Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two + datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we + find that performance scales very smoothly with model size (Figure3.3and AppendixHFigureH.7), possibly reflecting + the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model. + + 3.3 Translation + + For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity + concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially + when translating between French and English despite only training on 10 megabytes of remaining French text. Since we + increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training + dataset to include more representation of other languages, though this remains an area for further improvement. As + discussed in2.2the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although + GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages. + These languages are documented in the supplemental material. In order to better understand translation capability, we + also expand our analysis to include two additional commonly studied languages, German and Romanian. + Existing unsupervised machine translation approaches often combine pretraining on a pair of monolingual datasets + with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a + blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, + and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in + particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make + use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data. + Results are shown in Table3.4. Zero-shot GPT-3, which only receives on a natural language description of the task, + still underperforms recent unsupervised NMT results. However, providing only a single example demonstration for + + <
> + + Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating + into English reflecting its strength as an English LM.We report BLEU scores on the WMT’14 Fr$En, + WMT’16 De$En, and WMT’16 Ro$En datasets as measured by multi-bleu.perl with XLM’s tokenization + in order to compare most closely with prior unsupervised NMT work. SacreBLEU f [Pos18] results re- + ported in AppendixH. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA + with relative confidence. a [EOAG18]b [DHKH14]c [WXH + 18]d [oR16]e [LGG + 20]f [SacreBLEU signature: + BLEU+case.mixed+numrefs.1+smooth.exp+tok.intl+version.1.2.20] + + <
> + + Figure 3.4:Few-shot translation performance on 6 language pairs as model capacity increases. There is a consistent + trend of improvement across all datasets as the model scales, and as well as tendency for translation into English to be + stronger than translation from English. + + <
> + + Table 3.5:Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section + 4for details on potential contamination of the Winograd test set. a [SBBC19]b [LYN + 20] + + <
> + + Figure 3.5:Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales. + Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B + is competitive with a fine-tuned RoBERTA-large. + + + each translation task improves performance by over 7 BLEU and nears competitive performance with prior work. + GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior + unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the + three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into + English but under-performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at + over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE + tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, + few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and + the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art. + For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of + unsupervised pretraining, supervised finetuning on 608K labeled examples, and backtranslation [LHCG19b]. + Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of + improvement with model capacity. This is shown in Figure3.4in the case of few-shot results, and scaling for all three + settings is shown in AppendixH. + + 3.4 Winograd-Style Tasks + + The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun + refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned + language models have achieved near-human performance on the original Winograd dataset, but more difficult versions + + <
> + + Table 3.6:GPT-3 results on three commonsense reasoning tasks, PIQA, ARC, and OpenBookQA. GPT-3 Few-Shot + PIQA result is evaluated on the test server. See Section4for details on potential contamination issues on the PIQA test + set. + <
> + + Figure 3.6:GPT-3 results on PIQA in the zero-shot, one-shot, and few-shot settings. The largest model achieves a + score on the development set in all three conditions that exceeds the best recorded score on the task. + + + such as the adversarially-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test + GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting. + On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method + described in [RWC + 19]. Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which + is presented as binary classification and requires entity extraction to convert to the form described in this section. On + Winograd GPT-3 achieves 88.3%, 89.7%, and 88.6% in the zero-shot, one-shot, and few-shot settings, showing no clear + in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human + performance. We note that contamination analysis found some Winograd schemas in the training data but this appears + to have only a small effect on results (see Section4). + On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves 70.2% in the + zero-shot setting, 73.2% in the one-shot setting, and 77.7% in the few-shot setting. For comparison a fine-tuned + RoBERTA model achieves 79%, state-of-the-art is 84.6% achieved with a fine-tuned high capacity model (T5), and + human performance on the task as reported by [SBBC19] is 94.0%. + + 3.5 Common Sense Reasoning + + Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence + completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) [BZB + 19], + asks common sense questions about how the physical world works and is intended as a probe of grounded understanding + of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot + (the last measured on PIQA’s test server). This compares favorably to the 79.4% accuracy prior state-of-the-art of a + + <
> + + Table 3.7:Results on reading comprehension tasks. All scores are F1 except results for RACE which report accuracy. + a [JZC + 19]b [JN20]c [AI19]d [QIA20]e [SPP + 19] + + fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over 10% worse than human + performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis + flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark + the result with an asterisk. See Section4for details. + ARC [CCE + 18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the + “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval + methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot + setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline + (55.9%) from UnifiedQA [KKS + 20]. On the “Easy” version of the dataset (questions which either of the mentioned + baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned + RoBERTa baseline from [KKS + 20]. However, both of these results are still much worse than the overall SOTAs + achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy + set. + On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points + short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the + leaderboard. + Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and + inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant + improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings. + + 3.6 Reading Comprehension + + Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, + multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread + in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general + we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each + respective dataset. + GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset + and performs worst (13 F1 below an ELMo baseline) on QuAC [CHI + 18] a dataset which requires modeling structured + dialog acts and answer span selections of teacher-student interactions. On DROP [DWD + 19], a dataset testing discrete + reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned + BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches + which augment neural networks with symbolic systems [RLL + 19]. On SQuAD 2.0 [RJL18], GPT-3 demonstrates its + few-shot learning capabilities, improving by almost 10 F1 (to 69.8) compared to a zero-shot setting. This allows it to + slightly outperform the best fine-tuned result in the original paper. On RACE [LXL + 17], a multiple choice dataset of + middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with + the earliest work utilizing contextual representations and is still 45% behind SOTA. + + 3.7 SuperGLUE + + In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a + more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark + [WPN + 19]. GPT-3’s test-set performance on the SuperGLUE dataset [WPN + 19] is shown in Table3.8. In the few-shot + setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and + + <
> + + Figure 3.7:GPT-3 results on CoQA reading comprehension task. GPT-3 175B achieves 85 F1 in the few-shot setting, + only a few points behind measured human performance and state-of-the-art fine-tuned models. Zero-shot and one-shot + performance is a few points behind, with the gains to few-shot being largest for bigger models. + + <
> + + Table 3.8:Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported + on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient + updates. + + <
> + + Figure 3.8: Performance on SuperGLUE increases with model size and number of examples in context.A value + of K= 32 means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in + SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference + lines (our test set results are in Table3.8). The BERT-Large reference model was fine-tuned on the SuperGLUE training + set (125K examples), whereas BERT++ was first fine-tuned on MultiNLI (392K examples) and SWAG (113K examples) + before further fine-tuning on the SuperGLUE training set (for a total of 630K fine-tuning examples). We find the + difference in performance between the BERT-Large and BERT++ to be roughly equivalent to the difference between + GPT-3 with one example per context versus eight examples per context. + + MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used + the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. + We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA + performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving + second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, + performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the + original Winograd dataset as described in Section3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, + roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting. + WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different + phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two + sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer + in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot + setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same + way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. + This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these + weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to + the state-of-the-art held by a fine-tuned 11 billion parameter model. + Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of + examples in the context showing increasing benefits from in-context learning (Figure3.8). We scale K up to 32 + examples per task, after which point additional examples will not reliably fit into our context. When sweeping over + values ofK, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large + on overall SuperGLUE score. + + 3.8 NLI + + Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. + In practice, this task is usually structured as a two or three class classification problem where the model classifies + + <
> + + Figure 3.9: Performance of GPT-3 on ANLI Round 3.Results are on the dev-set, which has only 1500 examples + and therefore has high variance (we estimate a standard deviation of 1.2%). We find that smaller models hover around + random chance, while few-shot GPT-3 175B closes almost half the gap from random chance to SOTA. Results for + ANLI rounds 1 and 2 are shown in the appendix. + + + whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). + SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest + version of GPT-3 performs convincingly better than random (56%) in any evaluation setting, but in a few-shot setting + GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced + Adversarial Natural Language Inference (ANLI) dataset [NWD + 19]. ANLI is a difficult dataset employing a series of + adversarially mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our + models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting (33%), + whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure3.9and full results + for all rounds can be found in AppendixH. These results on both RTE and ANLI suggest that NLI is still a very difficult + task for language models and they are only just beginning to show signs of progress. + + 3.9 Synthetic and Qualitative Tasks + + One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which + require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have + occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we + test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the + letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to + solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new + words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets + with the hope of stimulating further study of test-time behavior of language models. + + 3.9.1 Arithmetic + To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small + battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language: + + •2 digit addition (2D+)– The model is asked to add two integers sampled uniformly from[0;100), phrased in + the form of a question, e.g. “Q: What is 48 plus 76? A: 124.” + •2 digit subtraction (2D-)– The model is asked to subtract two integers sampled uniformly from[0;100); the + answer may be negative. Example: “Q: What is 34 minus 53? A: -19”. + •3 digit addition (3D+)– Same as 2 digit addition, except numbers are uniformly sampled from[0;1000). + + <
> + + Figure 3.10:Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a + significant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being + able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction + of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot + are shown in the appendix. + + + •3 digit subtraction (3D-)– Same as 2 digit subtraction, except numbers are uniformly sampled from[0;1000). + •4 digit addition (4D+)– Same as 3 digit addition, except uniformly sampled from[0;10000). + •4 digit subtraction (4D-)– Same as 3 digit subtraction, except uniformly sampled from[0;10000). + •5 digit addition (5D+)– Same as 3 digit addition, except uniformly sampled from[0;100000). + •5 digit subtraction (5D-)– Same as 3 digit subtraction, except uniformly sampled from[0;100000). + •2 digit multiplication (2Dx)– The model is asked to multiply two integers sampled uniformly from[0;100), + e.g. “Q: What is 24 times 42? A: 1008”. + •One-digit composite (1DC)– The model is asked to perform a composite operation on three 1 digit numbers, + with parentheses around the last two. For example, “Q: What is 6+(4*8)? A: 38”. The three 1 digit numbers + are selected uniformly on[0;10)and the operations are selected uniformly from f+,-,*g. + + In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random + instances of the task and evaluate all models on those instances. + First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure3.10. On addition and subtraction, + GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, + 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the + number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on + five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves + 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves + 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness + beyond just single operations. + As Figure3.10makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the + second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all + other operations less than 10% of the time. + One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation + to the task (or at the very least recognition of the task) is important to performing these computations correctly. + Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly + + <
> + + Table 3.9:Results on basic arithmetic tasks for GPT-3 175B.f2,3,4,5gDf+,-gis 2, 3, 4, and 5 digit addition or + subtraction, 2Dx is 2 digit multiplication. 1DC is 1 digit composite operations. Results become progressively stronger + moving from the zero-shot to one-shot to few-shot setting, but even the zero-shot shows significant arithmetic abilities. + + + <
> + + Table 3.10:GPT-3 175B performance on various word unscrambling and word manipulation tasks, in zero-, one-, and + few-shot settings. CL is “cycle letters in word”, A1 is anagrams of but the first and last letters, A2 is anagrams of all but + the first and last two letters, RI is “Random insertion in word”, RW is “reversed words”. + + + + outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table3.9, and + model capacity scaling for all three settings is shown in AppendixH. + To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic + problems in our test set and searched for them in our training data in both the forms" + ="and + " plus ". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 + subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers + could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes + such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than + memorizing a table. + Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even + zero-shot settings. + + 3.9.2 Word Scrambling and Manipulation Tasks + To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of + 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of + scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are: + + •Cycle letters in word (CL)– The model is given a word with its letters cycled, then the “=” symbol, and + is expected to generate the original word. For example, it might be given “lyinevitab” and should output + “inevitably”. + •Anagrams of all but first and last characters (A1)– The model is given a word where every letter except + the first and last have been scrambled randomly, and must output the original word. Example: criroptuon = + corruption. + •Anagrams of all but first and last 2 characters (A2)– The model is given a word where every letter except + the first 2 and last 2 have been scrambled randomly, and must recover the original word. Example: opoepnnt + !opponent. + •Random insertion in word (RI)– A random punctuation or space character is inserted between each letter + of a word, and the model must output the original word. Example: s.u!c/c!e.s s i/o/n = succession. + •Reversed words (RW)– The model is given a word spelled backwards, and must output the original word. + Example: stcejbo!objects. + + For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by + [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure3.11. + Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving 66.9% on removing + + <
> + + Figure 3.11:Few-shot performance on the five word scrambling tasks for different sizes of model. There is generally + smooth improvement with model size although the random insertion task shows an upward slope of improvement with + the 175B model solving the task the majority of the time. Scaling of one-shot and zero-shot performance is shown in + the appendix. All tasks are done with K=100. + + + + random insertions, 38.6% on cycling letters, 40.2% on the easier anagram task, and 15.1% on the more difficult anagram + task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word. + In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the + model can rarely perform any of the tasks (Table3.10). This suggests that the model really does appear to learn these + tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear + in the pre-training data (although we cannot confirm this with certainty). + We can further quantify performance by plotting “in-context learning curves”, which show task performance as a + function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task + in Figure1.2. We can see that larger models are able to make increasingly effective use of in-context information, + including both task examples and natural language task descriptions. + Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding + operates on significant fractions of a word (on average 0.7 words per token), so from the LM’s perspective succeeding + at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, + CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), + requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require + non-trivial pattern-matching and computation. + + + 3.9.3 SAT Analogies + + To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of + 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of + the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to + hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to + temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original + word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves 65.2% in the + few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting, whereas the average score among + college applicants was 57% [TL05] (random guessing yields 20%). As shown in Figure3.12, the results improve with + scale, with the the full 175 billion model improving by over 10% compared to the 13 billion parameter model. + + <
> + + Figure 3.12:Zero-, one-,and few-shot performance on SAT analogy tasks, for different sizes of model. The largest + model achieves 65% accuracy in the few-shot setting, and also demonstrates significant gains to in-context learning + which are not present in smaller models. + + + 3.9.4 News Article Generation + Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by + conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news + story [RWC + 19]. Relative to [RWC + 19], the dataset used to train GPT-3 is much less weighted towards news articles, + so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets + the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To + solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the + model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably + generate short articles in the “news” genre. + To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional + sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles + from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. [ZHR + 19]. Generative + language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to + distinguish the two is a potentially important measure of quality. + + In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles + from the websitenewser.com(mean length: 215 words). We then generated completions of these titles and subtitles + from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each + model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed + by either the human written article or the article generated by the model 4 . Participants were asked to select whether the + article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by + a machine”, or “very likely written by a machine”. + The articles we selected were not in the models’ training data and the model outputs were formatted and selected + programmatically to prevent human cherry-picking. All models used the same context to condition outputs on and were + pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. + However, we also ran an experiment to control for participant effort and attention that followed the same format but + involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a + 160M parameter model with no context and increased output randomness. + + 3 This task is also relevant to the potential misuse of language models discussed in Section6.1. + 4 We wanted to identify how good an average person on the internet is at detecting language model outputs, so we focused on + participants drawn from the general US population. See AppendixEfor details. + + <
> + + Table 3.11: Human accuracy in identifying whether short (200 word) news articles are model generated. We + find that human accuracy (measured by the ratio of correct assignments to non-neutral assignments) ranges from 86% + on the control model to 52% on GPT-3 175B. This table compares mean accuracy between five different models, and + shows the results of a two-sample T-Test for the difference in mean accuracy between each model and the control model + (an unconditional GPT-3 Small model with increased output randomness). + + + + Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that + the intentionally bad articles were model generated was 86% where 50% is chance level performance. By contrast, + mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance + at 52% (see Table3.11). 5 Human abilities to detect model generated text appear to decrease as model size increases: + there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance. 6 + This is true despite the fact that participants spend more time on each output as model size increases (see AppendixE). + Examples of synthetic articles from GPT-3 are given in Figures3.14and3.15.7 Much of the text is—as indicated by the + evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator + that an article is model generated since, unlike human authors, the models have no access to the specific facts that the + article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual + phrasings, though these are often subtle enough that they are not noticed. + Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like + GROVER [ZHR + 19] and GLTR [GSR19] may have greater success at detecting model generated text than human + evaluators. Automatic detection of these models may be a promising area of future research. + Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe + more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated + by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated + completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial + experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to + compare human abilities to detect the articles generated by GPT-3 and a control model. + We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was + 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely + above chance at 52%(see Table3.12). This indicates that, for news articles that are around 500 words long, GPT-3 + continues to produce articles that humans find difficult to distinguish from human written news articles. + + 3.9.5 Learning and Using Novel Words + A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a + word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here + we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, + such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate) + + 5 We use a two-sample Student’s T-Test to test for significant difference between the means of the participant accuracies of each + model and the control model and report the normalized difference in the means (as the t-statistic) and the p-value. + 6 If a model consistently produces texts that are more impressive than human articles, it is possible that human performance on + this task would drop below 50%. Indeed, many individual participants scored below 50% on this task. + 7 Additional non-news samples can be found in AppendixF. + + <
> + + Figure 3.13:People’s ability to identify whether news articles are model-generated (measured by the ratio of correct + assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberately- + bad control model (an unconditioned GPT-3 Small model with higher output randomness) is indicated with the dashed + line at the top, and the random chance (50%) is indicated with the dashed line at the bottom. Line of best fit is a power + law with 95% confidence intervals. + + <
> + + Table 3.12:People’s ability to identify whether 500 word articles are model generated (as measured by the ratio of + correct assignments to non-neutral assignments) was 88% on the control model and 52% on GPT-3 175B. This table + shows the results of a two-sample T-Test for the difference in mean accuracy between GPT-3 175B and the control + model (an unconditional GPT-3 Small model with increased output randomness). + + <
> + + Figure 3.14:The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human + written article (accuracy: 12%). + + <
> + + Figure 3.15:The GPT-3 generated news article that humans found the easiest to distinguish from a human written + article (accuracy: 61%). + + <
> + + Figure 3.16:Representative GPT-3 completions for the few-shot task of using a new word in a sentence. Boldface is + GPT-3’s completions, plain text is human prompts. In the first example both the prompt and the completion are provided + by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional + prompts and provides the completions. Nothing task-specific is provided to GPT-3 other than the conditioning shown + here. + + nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the + broad task and one-shot in terms of the specific word. Table3.16shows the 6 examples we generated; all definitions + were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were + generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try + any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final + sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of + the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy + sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence. + + 3.9.6 Correcting English Grammar + Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the few- + shot setting by giving prompts of the form"Poor English Input: nn Good English Output: + ". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any + omissions or repeats). Results are shown in Figure3.17. + + 4 Measuring and Preventing Memorization Of Benchmarks + + Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our + benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research + without established best practices. While it is common practice to train large models without investigating contamination, + given the increasing scale of pretraining datasets, we believe this issue is becoming increasingly important to attend to. + This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] + detected and removed a training document which overlapped with one of their evaluation datasets. Other work such + as GPT-2 [RWC + 19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that + + <
>. + + Figure 3.17:Representative GPT-3 completions for the few-shot task of correcting English grammar. Boldface + is GPT-3’s completions, plain text is human prompts. In the first few examples example both the prompt and the + completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives + successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 aside from + the first few examples as conditioning and the “Poor English input/Good English output” framing. We note that the + distinction between ”poor” and ”good” English (and the terms themselves) is complex, contextual, and contested. As + the example mentioning the rental of a house shows, assumptions that the model makes about what “good” is can even + lead it to make errors (here, the model not only adjusts grammar, but also removes the word ”cheap” in a way that alters + meaning). + + <
> + + Figure 4.1: GPT-3 Training Curves We measure model performance during training on a deduplicated validation + split of our training distribution. Though there is some gap between training and validation performance, the gap grows + only minimally with model size and training time, suggesting that most of the gap comes from a difference in difficulty + rather than overfitting. + + + + although models did perform moderately better on data that overlapped between training and testing, this did not + significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent). + GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of + magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential + for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B + does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was + deduplicated (Figure4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as + large as feared. + We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap + between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a + bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t + feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts + results. + For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as + examples that have a 13-gram overlap with anything in the pretraining set (or that overlap with the whole example when + it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, + so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in + AppendixC. + We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean + subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a + significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be + inflating the results. The results are summarized in Figure4.2. Although potential contamination is often high (with a + quarter of benchmarks scoring over 50%), in most cases performance changes only negligibly, and we see no evidence + that contamination level and performance difference are correlated. We conclude that either our conservative method + substantially overestimated contamination or that contamination has little effect on performance. + Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on + the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference + difficult. + Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension + (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English + + <
> + + Figure 4.2: Benchmark contamination analysis We constructed cleaned versions of each of our benchmarks to + check for potential contamination in our training set. The x-axis is a conservative lower bound for how much of the + dataset is known with high confidence to be clean, and the y-axis shows the difference in performance when evaluating + only on the verified clean subset. Performance on most benchmarks changed negligibly, but some were flagged for + further review. On inspection we find some evidence for contamination of the PIQA and Winograd results, and we mark + the corresponding results in Section3with an asterisk. We find no evidence that other benchmarks are affected. + + + translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false + positives. We summarize the results for each group of tasks below: + + •Reading Comprehension:Our initial analysis flagged>90% of task examples from QuAC, SQuAD2, and + DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. + Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source + text was present in our training data but the question/answer pairs were not, meaning the model gains only + background information and cannot memorize the answer to a specific question. + •German translation:We found 25% of the examples in the WMT16 German-English test set were marked + as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the + flagged examples contain paired sentences resembling NMT training data and collisions were monolingual + matches mostly of snippets of events discussed in the news. + •Reversed Words and Anagrams:Recall that these tasks are of the form “alaok = koala”. Due to the + short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged + overlaps, we found that they were not typically instances of real reversals or unscramblings in the training set, + but rather palindromes or trivial unscramblings, e.g “kayak = kayak”. The amount of overlap was small, + but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the + symbol insertion task shows high overlap but no effect on performance – this is because that task involves + removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to + many spurious matches. + •PIQA:The overlap analysis flagged 29% of examples as contaminated, and observed a 3 percentage point + absolute decrease (4% relative decrease) in performance on the clean subset. Though the test dataset was + released after our training set was created and its labels are hidden, some of the web pages used by the + crowdsourced dataset creators are contained in our training set. We found a similar decrease in a 25x smaller + model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias + rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot + rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential + contamination. + •Winograd:The overlap analysis flagged 45% of examples, and found a 2.6% decrease in performance on the + clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in + fact present in our training set, though presented in a different format than we present the task to the model. + Although the decrease in performance is small, we mark our Winograd results in the main paper with an + asterisk. + + •Language modeling:We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the + Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably + extract a clean subset here, we do not report results on these datasets, even though we intended to when starting + this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language + modeling benchmark. + + We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply + to verify how much actual contamination existed. These appeared to often contain false positives. They had either + no actual contamination, or had contamination that did not give away the answer to the task. One notable exception + was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very + small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format + precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this + paper, the potential contamination is noted in the results section. + An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the + same distribution as the original dataset. It remains possible that memorization inflates results but at the same time + is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number + of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small + models, which are unlikely to be memorizing. + Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright + remove problematic results, depending on the severity. Much work remains to be done to address this important and + subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed + explanation of our analysis, we refer the reader to AppendixC. + + + 5 Limitations + + GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for + future work. + First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct + predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although + the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to + lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences + or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of + GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed + informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some + datasets (such as PIQA [BZB + 19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type + “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable + gaps on our suite of benchmarks, as described in Section3, and in particular it does little better than chance when + evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same + way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading + comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks. + GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused + on exploring in-context learning behavior in autoregressive language models because it is straightforward to both + sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional + architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent + literature, which has documented improved fine-tuning performance when using these approaches over standard + language models [RSR + 19]. Thus our design decision comes at the cost of potentially worse performance on tasks + which empirically benefit from bidirectionality. This may include fill-in-the-blank tasks, tasks that involve looking back + and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then + generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a + few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves + comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and + RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning + than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with + few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”. + A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether + autoregressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the + + pretraining objective. Our current objective weights every token equally and lacks a notion of what is most important to + predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, + with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas + ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed + actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains + of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world + [BHT + 20]. For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a + different approach is likely to be necessary. Promising future directions in this vein might include learning the objective + function from humans [ZSW + 19a], fine-tuning with reinforcement learning, or adding additional modalities such as + images to provide grounding and a better model of the world [CLY + 19]. + Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 + takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more + text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is + an important direction for future work, and might come from grounding in the physical world to provide additional + information, or from algorithmic improvements. + A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot + learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it + has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that + are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, + to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on + this spectrum may also vary from task to task. Synthetic tasks such as wordscrambling or defining nonsense words + seem especially likely to be learned de novo, whereas translation clearly must be learned during pretraining, although + possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what + humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training + and identifying them at test time would be an advance for language models, but nevertheless understanding precisely + how few-shot learning works is an important unexplored direction for future research. + A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are + both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of + models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large + models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, + most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. + Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; + new challenges and opportunities may be associated with applying it to models of this size. + Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, + it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in + performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This + last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special + concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts + (Section6). + + 6 Broader Impacts + + Language models have a wide range of beneficial applications for society, including code and writing auto-completion, + grammar assistance, game narrative generation, improving search engine responses, and answering questions. But + they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over + smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the + potential to advance both the beneficial and harmful applications of language models. + Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily + greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this + are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in + Section6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section6.2. We also briefly + discuss issues of energy efficiency (Section6.3). + + 6.1 Misuse of Language Models + + Malicious uses of language models can be somewhat difficult to anticipate because they often involve repurposing + language models in a very different environment or for a different purpose than researchers intended. To help with this, + we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying + threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact + [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures. + + 6.1.1 Potential Misuse Applications + + Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples + include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing + and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high + quality text. Language models that produce high quality text generation could lower existing barriers to carrying out + these activities and increase their efficacy. + The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to + generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in + 3.9.4 represents a concerning milestone in this regard. + + 6.1.2 Threat Actor Analysis + + Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors + who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced + (e.g. state-sponsored) groups with long-term agendas [SBC + 19]. + To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat + groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did + find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances + of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated + with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is + not immediate, but significant improvements in reliability could change this. + Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about + possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible + difference in operations that may see potential gains by using language models. The assessment was that language + models may not be worth investing significant resources in because there has been no convincing demonstration that + current language models are significantly better than current methods for generating text, and because methods for + “targeting” or “controlling” the content of language models are still at a very early stage. + + 6.1.3 External Incentive Structures + + Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their + agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular + among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login + credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment. + Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. + The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k + truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot + produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the + amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts + how scalable the operation can be. + Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will + eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to + malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on + this through a combination of mitigation research, prototyping, and coordinating with other technical developers. + + 6.2 Fairness, Bias, and Representation + + Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, + since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and + producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in + the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8 + + Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and + behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely + present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s + biases even within the studied categories. + Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes + present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, + and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how + they are different in this dimension. + + 6.2.1 Gender + In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found + that occupations in general have a higher probability of being followed by a male gender identifier than a female one + (in other words, they are male leaning) when given a context such as"Thefoccupationgwas a"(Neutral Variant). + 83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured + this by feeding the model a context such as"The detective was a"and then looking at the probability of the + model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). + In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus + were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and + sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, + housekeeper etc. + We also tested how these probabilities changed when we shifted the context to be the"The competentfoccupationg + was a"(Competent Variant), and when we shifted the context to be"The incompetentfoccupationgwas a" + (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with"The competent + foccupationgwas a,"the majority of occupations had an even higher probability of being followed by a + male identifier than a female one than was the case with our original neutral prompt,"Thefoccupationgwas + a". With the prompt"The incompetentfoccupationgwas a"the majority of occupations still leaned male + with a similar probability than for our original neutral prompt. The average occupation bias - measured as + <> was <> for the Neutral Variant,<> for the Competent Variant and <> jobs + for the Incompetent Variant. + + We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further + corroborated the model’s tendency to associate most occupations with males. One method measured the mod- + els ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model + a context such as"The advisor met with the advisee because she wanted to get advice about job + applications. ‘She’ refers to the" and found the option with the lowest probability between the two possi- + ble options (Choices between Occupation Option: advisor; Participant Option: advisee). + Occupation and participant words often have societal biases associated with them such as the assumption that most + occupants are by default male. We found that the language models learnt some of these biases such as a tendency to + associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of + all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences + where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All + other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns + with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers + some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger + models are more robust than smaller models. + We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other pre- + selected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature + + 8 Evaluating fairness, bias, and representation in language models is a rapidly-developing area with a large body of prior work. + See, for example, [HZJ + 19,NBR20,SCNP19]. + + Table 6.1:Most Biased Descriptive Words in 175B Model + + <
> + + of 1 and topp of 0.9 for every prompt in our dataset. For gender, we had prompts such as"He was very","She + was very","He would be described as","She would be described as" 9 . We looked at the adjectives and + adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more + often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were + more often described using adjectives that span a greater spectrum. + Table6.1shows the top 10 most favored descriptive words for the model along with the raw number of times each + word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a + category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective, + we have also included the average for the number of co-occurrences across all qualifying words for each gender. + + + 6.2.2 Race + + To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The fracegman was very", + "The fracegwoman was very"and"People would describe thefracegperson as"and generated 800 + samples for each of the above prompts, withfracegreplaced with a term indicating a racial category such as White + or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that + language models produce text of differing sentiment when varying features such as occupation [HZJ + 19], we explored + how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred + disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive + words (eg. wonderfulness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: + -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet). + It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that + focused on racial features; these results are not from the models talking about race in the wild but talking about race in + an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply + looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to + a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated + with a negative sentiment under this testing methodology. + Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the + other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences + narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and + highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data. + + + 9 We only used male and female pronouns. This simplifying assumption makes it easier to study co-occurrence since it does not + require the isolation of instances in which ‘they’ refers to a singular noun from those where it didn’t, but other forms of gender bias + are likely present and could be studied using different approaches. + + <
> + + Figure 6.1:Racial Sentiment Across Models + + <
> + + Table 6.2:Shows the ten most favored words about each religion in the GPT-3 175B model. + + + + + 6.2.3 Religion + + We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, + and Judaism, by generating 800 model outputs of length 50 with a temperature of 1 and a to p of 0.9 for every + prompt. Our prompts were of the nature"fReligion practitionersgare"(Eg."Christians are") for each + of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a + corpus of such completions for studying co-occurrence of words. + The following is an example output from the model: + "Buddhists aredivided into two main branches - Theravada and Mahayana. Theravada + is the more conservative branch, centering on monastic life and the earliest sutras + and refusing to recognize the later Mahayana sutras as authentic." + Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect + how these terms are sometimes presented in the world. For example, with the religionIslam, we found that words such + asramadan,prophetandmosqueco-occurred at a higher rate than for other religions. We also found that words such + asviolent,terrorismandterroristco-occurred at a greater rate with Islam than with other religions and were in + the top 40 most favored words for Islam in GPT-3. + + 6.2.4 Future Bias and Fairness Challenges + We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, + and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an + area of continuous research for us and are excited to discuss different methodological approaches with the community. + We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but + we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model + attributes to develop informative labels such as Model Cards for Model Reporting from [MWZ + 18]. + Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this + is also extensive [QMZH19,HZJ + 19], so we offer only a few brief comments on future directions specific to large + language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for + building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for + these models. There is room for more research that engages with the literature outside NLP, better articulates normative + statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. + Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been + shown to have blind spots [GG19,NvNvdG19] but in a holistic manner. + + 6.3 Energy Usage + + Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 + 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days + for a 1.5B parameter GPT-2 model (Figure2.2). This means we should be cognizant of the cost and efficiency of such + models, as advocated by [SDSE19]. + The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we + should consider not only the resources that go into training them, but how these resources are amortized over the + lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though + models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even + with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of 0.4 kW-hr, or + only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down + the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient + versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency + of such models over time, similar to trends observed in image recognition and neural machine translation [HB20]. + + 7 Related Work + + Several lines of work have focused on increasing parameter count and/or computation in language models as a + means to improve generative or task performance. An early work scaled LSTM based language models to over a + billion parameters [JVS + 16]. One line of work straightforwardly increases the size of transformer models, scaling + up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: + 213 million parameters [VSP + 17] in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters + [RWC + 19], 8 billion parameters [SPP + 19], 11 billion parameters [RSR + 19], and most recently 17 billion parameters + [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of + increasing models’ capacity to store information without increased computational cost. These approaches rely on the + conditional computation framework [BLC13] and specifically, the mixture-of-experts method [SMM + 17] has been + used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19], + though only a small fraction of the parameters are actually used on each forward pass. A third approach increases + computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and + the universal transformer [DGV + 18]. Our work focuses on the first approach (scaling compute and parameters together, + by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ + this strategy. + Several efforts have also systematically studied the effect of scale on language model performance. [KMH + 20, + RRBS19,LWS + 20,HNA + 17], find a smooth power-law trend in loss as autoregressive language models are scaled up. + This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the + curve can perhaps be detected in Figure3.1), and we also find relatively smooth increases in many (though not all) + downstream tasks across 3 orders of magnitude of scaling. + Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language + models that are as small as possible. This approach includes ALBERT [LCG + 19] as well as general [HVD15] and + task-specific [SDCW19,JYS + 19,KR16] approaches to distillation of language models. These architectures and + techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint + of giant models. + As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable + effort has been devoted to constructing more difficult or open-ended tasks, including question answering [KPR + 19, + IBGC + 14,CCE + 18,MCKS18], reading comprehension [CHI + 18,RCM19], and adversarially constructed datasets + designed to be difficult for existing language models [SBBC19,NWD + 19]. In this work we test our models on many + of these datasets. + Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the + tasks we tested on. Recent efforts include [RSR + 19,RRS20], which fine-tuned an 11 billion parameter language model, + and [GLT + 20], which focused on attending over a large corpus of data at test time. Our work differs in focusing on + in-context learning but could be combined in the future with those of [GLT + 20,LPP + 20]. + Metalearning in language models has been utilized in [RWC + 19], though with much more limited results and no + systematic study. More broadly, language model metalearning has an inner-loop-outer-loop structure, making it + structurally similar to metalearning as applied to ML in general. Here there is an extensive literature, including + matching networks [VBL + 16], RL2 [DSC + 16], learning to optimize [RL16,ADG + 16,LM17] and MAML [FAL17]. + Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also + resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activations + across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training) + updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time. + Few-shot auto-regressive density estimation was explored in [RCP + 17] and [GWC + 18] studied low-resource NMT as + a few-shot learning problem. + While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained + language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with + similar goals is semi-supervised learning where approaches such as UDA [XDH + 19] also explore methods of fine-tuning + when very little labeled data is available. + Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18] + and utilized for some tasks (such as summarizing) in a language model with [RWC + 19]. The notion of presenting + tasks in natural language was also explored in the text-to-text transformer [RSR + 19], although there it was applied for + multi-task fine-tuning rather than for in-context learning without weight updates. + Another approach to increasing generality and transfer-learning capability in language models is multi-task learning + [Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for + each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the + weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating + the weights for a new task. Multi-task learning has shown some promising initial results [LGH + 15,LSP + 18] and + multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed + the boundaries on certain tasks [KKS + 20], but is still limited by the need to manually curate collections of datasets and + set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of + tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate + a broader set of explicit tasks for multi-task learning, for example through procedural generation [TFR + 17], human + interaction [ZSW + 19b], or active learning [Mac92]. + Algorithmic innovation in language models over the last two years has been enormous, including denoising-based + bidirectionality [DCLT18], prefixLM [DL15] and encoder-decoder architectures [LLG + 19,RSR + 19], random permu- + tations during training [YDY + 19], architectures that improve the efficiency of sampling [DYY + 19], improvements in + data and training procedures [LOG + 19], and efficiency increases in the embedding parameters [LCG + 19]. Many of + these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure autoregressive + language models, both in order to focus on in-context learning performance and to reduce the complexity of our large + model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s + performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these + algorithmic techniques is a promising direction for future work. + + + 8 Conclusion + + We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and + benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of + state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at + tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. + We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results + suggest that very large language models may be an important ingredient in the development of adaptable, general + language systems. + + Acknowledgements + + The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub + Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea + Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up + this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura + Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early + discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, + Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of + people who created content that was used in the training of the model, and to those who were involved in indexing or + upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure + and supercomputing teams for making it possible to train models at this scale. + + Contributions + + Tom Brown, Ben Mann, Prafulla Dhariwal, Dario Amodei, Nick Ryder, Daniel M Ziegler, and Jeffrey Wu + implemented the large-scale models, training infrastructure, and model-parallel strategies. + Tom Brown, Dario Amodei, Ben Mann, and Nick Ryderconducted pre-training experiments. + Ben Mann and Alec Radfordcollected, filtered, deduplicated, and conducted overlap analysis on the training data. + Melanie Subbiah, Ben Mann, Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Tom Henighan, and + Girish Sastryimplemented the downstream tasks and the software framework for supporting them, including creation + of synthetic tasks. + Jared Kaplan and Sam McCandlishinitially predicted that a giant language model should show continued gains, and + applied scaling laws to help predict and guide model and data scaling decisions for the research. + Ben Mannimplemented sampling without replacement during training. + Alec Radfordoriginally demonstrated few-shot learning occurs in language models. + Jared Kaplan and Sam McCandlishshowed that larger models learn more quickly in-context, and systematically + studied in-context learning curves, task prompting, and evaluation methods. + Prafulla Dhariwalimplemented an early version of the codebase, and developed the memory optimizations for fully + half-precision training. + Rewon Child and Mark Chendeveloped an early version of our model-parallel strategy. + Rewon Child and Scott Graycontributed the sparse transformer. + Aditya Rameshexperimented with loss scaling strategies for pretraining. + Melanie Subbiah and Arvind Neelakantanimplemented, experimented with, and tested beam search. + Pranav Shyamworked on SuperGLUE and assisted with connections to few-shot learning and meta-learning literature. + Sandhini Agarwalconducted the fairness and representation analysis. + Girish Sastry and Amanda Askellconducted the human evaluations of the model. + Ariel Herbert-Vossconducted the threat analysis of malicious use. + Gretchen Kruegeredited and red-teamed the policy sections of the paper. + Benjamin Chess, Clemens Winter, Eric Sigler, Christopher Hesse, Mateusz Litwin, and Christopher Berner + optimized OpenAI’s clusters to run the largest models efficiently. + Scott Graydeveloped fast GPU kernels used during training. + Jack Clarkled the analysis of ethical impacts — fairness and representation, human assessments of the model, and + broader impacts analysis, and advised Gretchen, Amanda, Girish, Sandhini, and Ariel on their work. + Dario Amodei, Alec Radford, Tom Brown, Sam McCandlish, Nick Ryder, Jared Kaplan, Sandhini Agarwal, + Amanda Askell, Girish Sastry, and Jack Clarkwrote the paper. + Sam McCandlishled the analysis of model scaling, and advised Tom Henighan and Jared Kaplan on their work. + Alec Radfordadvised the project from an NLP perspective, suggested tasks, put the results in context, and demonstrated + the benefit of weight decay for training. + Ilya Sutskeverwas an early advocate for scaling large generative likelihood models, and advised Pranav, Prafulla, + Rewon, Alec, and Aditya on their work. + Dario Amodeidesigned and led the research. + + A Details of Common Crawl Filtering + + As mentioned in Section2.2, we employed two techniques to improve the quality of the Common Crawl dataset: (1) + filtering Common Crawl and (2) fuzzy deduplication: + + 1.In order to improve the quality of Common Crawl, we developed an automatic filtering method to remove low + quality documents. Using the original WebText as a proxy for high-quality documents, we trained a classifier + to distinguish these from raw Common Crawl. We then used this classifier to re-sample Common Crawl by + prioritizing documents which were predicted by the classifier to be higher quality. The classifier is trained + using logistic regression classifier with features from Spark’s standard tokenizer and HashingTF 10 . For the + positive examples, we used a collection of curated datasets such as WebText, Wikiedia, and our web books + corpus as the positive examples, and for the negative examples, we used unfiltered Common Crawl. We used + this classifier to score Common Crawl documents. We kept each document in our dataset iff + + <> + + We chose <> in order to take mostly documents the classifier scored highly, but still include some documents + that were out of distribution <> was chosen to match the distribution of scores from our classifier on WebText. + We found this re-weighting increased quality as measured by loss on a range of out-of-distribution generative + text samples. + 2.To further improve model quality and prevent overfitting (which becomes increasingly important as model + capacity increases), we fuzzily deduplicated documents (i.e. removed documents with high overlap with + other documents) within each dataset using Spark’s MinHashLSH implementation with 10 hashes, using the + same features as were used for classification above. We also fuzzily removed WebText from Common Crawl. + Overall this decreased dataset size by an average of 10%. + + After filtering for duplicates and quality, we also partially removed text occurring in benchmark datasets, described in + Appendix C. + + B Details of Model Training + + To train all versions of GPT-3, we use Adam with <>, we clip the global norm of the + gradient at 1.0, and we use cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260 + billion tokens, training continues at 10% of the original learning rate). There is a linear LR warmup over the first 375 + million tokens. We also gradually increase the batch size linearly from a small value (32k tokens) to the full value over + the first 4-12 billion tokens of training, depending on the model size. Data are sampled without replacement during + training (until an epoch boundary is reached) to minimize overfitting. All models use weight decay of 0.1 to provide a + small amount of regularization [LH17]. + During training we always train on sequences of the fullnctx = 2048token context window, packing multiple + documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. + Sequences with multiple documents are not masked in any special way but instead documents within a sequence + are delimited with a special end of text token, giving the language model the information necessary to infer that + context separated by the end of text token is unrelated. This allows for efficient training without need for any special + sequence-specific masking. + + C Details of Test Set Contamination Studies + + In section4we gave a high level overview of test set contamination studies. In this section we provide details on + methodology and results. + + Initial training set filtering We attempted to remove text occurring in benchmarks from training data by searching + for 13-gram overlaps between all test/development sets used in this work and our training data, and we removed + the colliding 13-gram as well as a 200 character window around it, splitting the original document into pieces. For + filtering purposes we define a gram as a lowercase, whitespace delimited word with no punctuation. Pieces less than + 200characters long were discarded. Documents split into more than 10 pieces were considered contaminated and + removed entirely. Originally we removed entire documents given a single collision, but that overly penalized long + documents such as books for false positives. An example of a false positive might be a test set based on Wikipedia, in + which the Wikipedia article quotes a single line from a book. We ignored13grams that matched more than 10 training + documents, as inspection showed the majority of these to contain common cultural phrases, legal boilerplate, or similar + content that we likely do want the model to learn, rather than undesired specific overlaps with test sets. Examples for + various frequencies can be found in the GPT-3 release repository. + + Overlap methodology For our benchmark overlap analysis in Section4, we used a variable number of wordsNto + check for overlap for each dataset, whereNis the 5th percentile example length in words, ignoring all punctuation, + whitespace, and casing. Due to spurious collisions at lower values ofNwe use a minimum value of 8 on non-synthetic + tasks. For performance reasons, we set a maximum value of 13 for all tasks. Values forNand the amount of data + marked as dirty are shown in TableC.1. Unlike GPT-2’s use of bloom filters to compute probabilistic bounds for test + contamination, we used Apache Spark to compute exact collisions across all training and test sets. We compute overlaps + between test sets and our full training corpus, even though we only trained on 40% of our filtered Common Crawl + documents per Section2.2. + We define a ‘dirty’ example as one with anyN-gram overlap with any training document, and a ‘clean’ example as one + with no collision. + Test and validation splits had similar contamination levels despite some test splits being unlabeled. Due to a bug revealed + by this analysis, filtering described above failed on long documents such as books. Because of cost considerations it + was infeasible to retrain the model on a corrected version of the training dataset. As such, several language modeling + benchmarks plus the Children’s Book Test showed almost complete overlap, and therefore were not included in this + paper. Overlaps are shown in TableC.1 + + Overlap results To understand how much having seen some of the data helps the model perform on downstream + tasks, we filter every validation and test set by dirtiness. Then we run evaluation on the clean-only examples and report + the relative percent change between the clean score and the original score. If the clean score is more than 1% or 2% + worse than the overall score, it suggests the model may have overfit to the examples it has seen. If the clean score is + significantlybetter, our filtering scheme may have preferentially marked easier examples as dirty. + This overlap metric tends to show a high rate of false positives for datasets that contain background information (but + not answers) drawn from the web (such as SQuAD, which draws from Wikipedia) or examples less than 8 words + long, which we ignored in our filtering process (except for wordscrambling tasks). One instance where this technique + seems to fail to give good signal is DROP, a reading comprehension task in which 94% of the examples are dirty. The + information required to answer the question is in a passage provided to the model, so having seen the passage during + training but not the questions and answers does not meaningfully constitute cheating. We confirmed that every matching + training document contained only the source passage, and none of the questions and answers in the dataset. The more + likely explanation for the decrease in performance is that the 6% of examples that remain after filtering come from a + slightly different distribution than the dirty examples. + Figure4.2shows that as the dataset becomes more contaminated, the variance of the clean/all fraction increases, but + there is no apparent bias towards improved or degraded performance. This suggests that GPT-3 is relatively insensitive + to contamination. See Section4for details on the datasets we flagged for further review. + + <
> + + Table C.1:Overlap statistics for all datasets sorted from dirtiest to cleanest. We consider a dataset example dirty if it + has a singleN-gram collision with any document in our training corpus. “Relative Difference Clean vs All” shows the + percent change in performance between only the clean examples vs all the examples in the benchmark. “Count” shows + the number of examples. “Clean percentage” is the percent of examples that are clean vs total. For “Acc/F1/BLEU” we + use the metric specified in “Metric”. These scores come from evaluations with a different seed for the random examples + used for in-context learning, and will therefore differ slightly from the scores elsewhere in the paper. + + D Total Compute Used to Train Language Models + + This appendix contains the calculations that were used to derive the approximate compute used to train the language + models in Figure2.2. As a simplifying assumption, we ignore the attention operation, as it typically uses less than 10% + of the total compute for the models we are analyzing. + Calculations can be seen in TableD.1and are explained within the table caption. + + <
> + + Table D.1:Starting from the right hand side and moving left, we begin with the number of training tokens that each + model was trained with. Next we note that since T5 uses an encoder-decoder model, only half of the parameters are + active for each token during a forward or backwards pass. We then note that each token is involved in a single addition + and a single multiply for each active parameter in the forward pass (ignoring attention). Then we add a multiplier of + 3x to account for the backwards pass (as computing both @params and @acts use a similar amount of compute as the + forwards pass. Combining the previous two numbers, we get the total flops per parameter per token. We multiply this @loss @loss + value by the total training tokens and the total parameters to yield the number of total flops used during training. We + report both flops and petaflop/s-day (each of which are 2.88e+7 flops). + + E Human Quality Assessment of Synthetic News Articles + + This appendix contains details on the experiments measuring human ability to distinguish GPT-3-generated synthetic + news articles from real news articles. We first describe the experiments on the200word news articles, and then + describe the preliminary investigation of500word news articles generated by GPT-3. + Participants:We recruited 718 unique participants to take part in 6 experiments. 97 participants were excluded for + failing an internet check question, leaving a total of 621 participants: 343 male, 271 female, and 7 other. Mean + participant age was38years old. All participants were recruited through Positly, which maintains a whitelist of + high-performing workers from Mechanical Turk. All participants were US-based but there were no other demographic + restrictions. Participants were paid$12 for their participation, based on a task time estimate of 60 minutes determined + by pilot runs. In order to ensure that the sample of participants for each experiment quiz was unique, participants were + not allowed to take part in an experiment more than once. + Procedure and design:We arbitrarily selected 25 news articles that appeared innewser.comin early 2020. We used + the article titles and subtitles to produce outputs from the 125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13.0B, and 200B + (GPT-3) parameter language models. Five outputs per question were generated by each model and the generation with a + word count closest to that of the human written article was selected automatically. This was to minimize the effect + that completion length might have on participants’ judgments. The same output procedure for each model with the + exception of the removal of the intentionally bad control model, as described in the main text. + + <
> + + Table E.1:Participant details and article lengths for each experiment to evaluate human detection of200word model + generated news articles. Participants were excluded due to internet check fails. + + <
> + + Figure E.1:Participants spend more time trying to identify whether each news article is machine generated as model + size increases. Duration on the control model is indicated with the dashed line. Line of best fit is a linear model on a log + scale with 95% confidence intervals. + + + In each experiment, half of the participants were randomly assigned to quiz A and half were randomly assigned to quiz + B. Each quiz consisted of 25 articles: half (12-13) were human written and half (12-13) were model generated: the + articles with human written completions in quiz A had model generated completions in quiz B and vice versa. The + order of quiz question was shuffled for each participant. Participants could leave comments and were asked to indicate + if they had seen the articles before. Participants were instructed not to look up the articles or their content during the + quiz and at the end of the quiz were asked if they had looked anything up during the quiz. + Statistical Tests:To compare means on the different runs, we performed a two-sample t-test for independent groups for + each model against the control. This was implemented in Python using thescipy.stats.ttest_indfunction. When + plotting a regression line in the graph of average participant accuracy vs model size, we fit a power law of the form + ax b . The 95% confidence intervals were estimated from the t-distribution of the sample mean. + Duration statistics: In the main text, we discussed the finding that the ability of human participants to distinguish + model and human generated news articles decreases as our models become larger. We have also found that the + average time spent for a given set of questions increases as the model size increases, as shown in FigureE.1. Lower + + <
> + + Table E.2:Participant details and article lengths for the experiments investigating human detection of500word + model generated news articles. Participants were excluded due to internet check fails. + + + + accuracy scores despite increased time investment from participants supports the finding that larger models generate + harder-to-distinguish news articles. + Preliminary investigation of 500 word articles: We recruited 160 unique US-based participants to take part in 2 + experiments through Positly (details are given in TableE.2). We randomly selected 12 Reuters world news articles from + late 2019 and created a context for GPT-3 175B that consisted of a single Reuters article not in this set of 12. We then + used the article titles and Reuters locations to generate completions from GPT-3 175B and the 160M control model + from the previous experiments. These were used to create two 12-question quizzes per model, each consisting of half + human written and half model generated articles. Comprehension questions were added and articles were shown to + participants in 3 stages at 30 second intervals to encourage closer reading. Participants were paid$12 for this task. + Model generation selection methods, exclusion criteria, and statistical tests mirror those of the previous experiments. + + F Additional Samples from GPT-3 + + GPT-3 adapts well to many tasks other than the ones explored in the main body of the paper. As an example, in Figure + F.1, we show four uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the + style of Wallace Stevens. We first experimented with a few prompts, then generated four samples with no additional + editing or selection (sampling at temperature1using nucleus sampling [HBFC19] withP= 0:9). Completions were + truncated when the model began to write a new title and author heading, or broke into prose commentary. + + <
> + + Figure F.1:Four uncurated completions from a context suggesting the model compose a poem in the style of Wallace + Stevens with the title ‘Shadows on the Way’. + + + + G Details of Task Phrasing and Specifications + + The following figures illustrate the formatting and phrasing of all the tasks included in the paper. All data comes from + the ground truth datasets in this section, and no samples from GPT-3 are included here. + + <
> + + Figure G.1:Formatted dataset example for RACE-h. When predicting, we normalize by the unconditional probability + of each answer as described in2. + + <
> + + Figure G.4:Formatted dataset example for PIQA + + <
> + + Figure G.5:Formatted dataset example for COPA + + <
> + + Figure G.6:Formatted dataset example for ReCoRD. We consider the context above to be a single ”problem” because + this is how the task is presented in the ReCoRD dataset and scored in the ReCoRD evaluation script. + + <
> + + Figure G.8:Formatted dataset example for OpenBookQA. When predicting, we normalize by the unconditional + probability of each answer as described in2. + + Context! Making a cake: Several cake pops are shown on a display. A woman and girl + are shown making the cake pops in a kitchen. They + Correct Answer! bake them, then frost and decorate. + Incorrect Answer! taste them as they place them on plates. + Incorrect Answer! put the frosting on the cake as they pan it. + Incorrect Answer! come out and begin decorating the cake as well. + + Figure G.9:Formatted dataset example for HellaSwag + + <
> + + Figure G.10:Formatted dataset example for ANLI R3 + + <
> + + Figure G.11:Formatted dataset example for ARC (Challenge). When predicting, we normalize by the unconditional + probability of each answer as described in2. + + <
> + + Figure G.12:Formatted dataset example for SAT Analogies + + <
> + + Figure G.14:Formatted dataset example for Winogrande. The ‘partial’ evaluation method we use compares the + probability of the completion given a correct and incorrect context. + + + <
> + + Figure G.15:Formatted dataset example for MultiRC. There are three levels within MultiRC: (1) the passage, (2) the + questions, and (3) the answers. During evaluation, accuracy is determined at the per-question level, with a question + being considered correct if and only if all the answers within the question are labeled correctly. For this reason, we use + K to refer to the number ofquestionsshown within the context. + + <
> + + Figure G.16:Formatted dataset example for ARC (Easy). When predicting, we normalize by the unconditional + probability of each answer as described in 2. + + <
> + + Figure G.17:Formatted dataset example for StoryCloze + + <
> + + Figure G.18:Formatted dataset example for CoQA + + <
> + + Figure G.24:Formatted dataset example for Natural Questions + + <
> + + Figure G.26:Formatted dataset example for Symbol Insertion + + <
> + + Figure G.30:Formatted dataset example for CB + + <
> + + Figure G.32:Formatted dataset example for WiC + + <
> + + Figure G.36:Formatted dataset example for De!En. This is the format for one- and few-shot learning, for this and + other langauge tasks, the format for zero-shot learning is “Q: What is theflanguagegtranslation offsentencegA: + ftranslationg.” + + <
> + + Figure G.49:Formatted dataset example for Arithmetic 4D+ + + <
> + + Figure G.50:Formatted dataset example for Arithmetic 5D + + <
> + + Figure G.51:Formatted dataset example for Arithmetic 5D+ + + + + + + + + + + + + H Results on All Tasks for All Model Sizes + + <
> + + Table H.1:Scores for every task, setting and model that we investigate in this paper. + + <
> + + Figure H.1:All results for all SuperGLUE tasks. + + <
> <
> + + Figure H.2:Results for SAT task. Figure H.3:All results for all Winograd tasks. + + <
> + + Figure H.4:All results for all Arithmetic tasks. + + <
> + + Figure H.5:All results for all Cloze and Completion tasks. + + <
> + + Figure H.6:All results for all Common Sense Reasoning tasks. + + <
> + + Figure H.7:All results for all QA tasks. + + <
> + + Figure H.8:All results for all Reading Comprehension tasks. + + <
> + + Figure H.9:All results for all ANLI rounds. + + <
> + + Figure H.10:All results for all Scramble tasks. + + <
> + + Figure H.11:All results for all Translation tasks. + + + References + + [ADG + 16]Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, + Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. + InAdvances in neural information processing systems, pages 3981–3989, 2016. + [AI19]WeChat AI. Tr-mt (ensemble), December 2019. + [AJF19]Roee Aharoni, Melvin Johnson, and Orhan Firat. Massively multilingual neural machine translation. In + Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational + Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. + [BBDIW20]Su Lin Blodgett, Solon Barocas, Hal Daume III, and Hanna Wallach. Language (technology) is power:´ + A critical survey of “bias” in nlp.arXiv preprint arXiv:2005.14050, 2020. + [BCFL13]Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from + question-answer pairs. InProceedings of the 2013 conference on empirical methods in natural language + processing, pages 1533–1544, 2013. + [BES10]Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: an enhanced lexical + resource for sentiment analysis and opinion mining. InLrec, volume 10, pages 2200–2204, 2010. + [BHT + 20]Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella + Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. Experience grounds language. + arXiv preprint arXiv:2004.10151, 2020. + [BLC13]Yoshua Bengio, Nicholas Leonard, and Aaron C. Courville. Estimating or propagating gradients through´ + stochastic neurons for conditional computation.Arxiv, 2013. + [BZB + 19]Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about + physical commonsense in natural language.arXiv preprint arXiv:1911.11641, 2019. + [Car97]Rich Caruana. Multitask learning.Machine learning, 28(1), 1997. + [CB78]Susan Carey and Elsa Bartlett. Acquiring a single new word.Proceedings of the Stanford Child Language + Conference, 1978. + [CCE + 18]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and + Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv, + abs/1803.05457, 2018. + [CGRS19]Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse + transformers, 2019. + [CHI + 18]Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke + Zettlemoyer. Quac : Question answering in context.Arxiv, 2018. + [CLY + 19]Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and + Jingjing Liu. Uniter: Learning universal image-text representations.arXiv preprint arXiv:1909.11740, + 2019. + [Cra17]Kate Crawford. The trouble with bias.NIPS 2017 Keynote, 2017. + [DCLT18]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep + bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018. + [DGV + 18]Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal + transformers.Arxiv, 2018. + [DHKH14] Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. Edinburgh’s phrase-based machine + translation systems for wmt-14. InProceedings of the Ninth Workshop on Statistical Machine Translation, + pages 97–104, 2014. + [DL15]Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. InAdvances in neural information + processing systems, 2015. + [DSC + 16]Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2 : Fast + reinforcement learning via slow reinforcement learning.ArXiv, abs/1611.02779, 2016. + [DWD + 19]Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. + Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.arXiv preprint + arXiv:1903.00161, 2019. + [DYY + 19]Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. + Transformer-xl: Attentive language models beyond a fixed-length context.Arxiv, 2019. + [EOAG18]Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. + arXiv preprint arXiv:1808.09381, 2018. + [FAL17]Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of + deep networks.ArXiv, abs/1703.03400, 2017. + [Fyo00]Yaroslav Fyodorov. A natural logic inference system, 2000. + [GG19]Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing methods cover up systematic gender biases + in word embeddings but do not remove them.arXiv preprint arXiv:1903.03862, 2019. + [GLT + 20] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval- + augmented language model pre-training.arXiv preprint arXiv:2002.08909, 2020. + [Gra16]Alex Graves. Adaptive computation time for recurrent neural networks.Arxiv, 2016. + [GSL + 18]Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A + Smith. Annotation artifacts in natural language inference data.arXiv preprint arXiv:1803.02324, 2018. + [GSR19]Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and visualiza- + tion of generated text.arXiv preprint arXiv: 1906.04043, 2019. + [GWC + 18]Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, and Victor OK Li. Meta-learning for low-resource + neural machine translation.arXiv preprint arXiv:1808.08437, 2018. + [HB20]Daniel Hernandez and Tom Brown. Ai and efficiency, May 2020. + [HBFC19]Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. + CoRR, abs/1904.09751, 2019. + [HLW + 20]Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krishnan, and Dawn Song. + Pretrained transformers improve out of distribution robustness.arXiv preprint arXiv:2004.06100, 2020. + [HNA + 17]Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. + Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. + arXiv preprint arXiv:1712.00409, 2017. + [HR18] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv + preprint arXiv:1801.06146, 2018. + [HVD15]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv + preprint arXiv:1503.02531, 2015. + [HYC01]Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to Learn Using Gradient Descent. + InInternational Conference on Artificial Neural Networks, pages 87–94. Springer, 2001. + [HZJ + 19]Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, + Dani Yogatama, and Pushmeet Kohli. Reducing sentiment bias in language models via counterfactual + evaluation.arXiv preprint arXiv:1911.03064, 2019. + [IBGC + 14]Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daume III. A neural ´ + network for factoid question answering over paragraphs. InEmpirical Methods in Natural Language + Processing, 2014. + [IDCBE19]Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of + generated text is easiest when humans are fooled.arXiv preprint arXiv:1911.00650, 2019. + [JCWZ17]Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly + supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017. + [JN20]Zheng Junyuan and Gamma Lab NYC. Numeric transformer - albert, March 2020. + [JVS + 16]Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits + of language modeling.arXiv preprint arXiv:1602.02410, 2016. + [JYS + 19]Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. + TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351, 2019. + [JZC + 19]Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng Yang, and Yunfeng Liu. Technical report on + conversational question answering.arXiv preprint arXiv:1909.10772, 2019. + [KKS + 20]Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. + Unifiedqa: Crossing format boundaries with a single qa system.arXiv preprint arXiv:2005.00700, 2020. + [KMB20]Sarah E. Kreps, Miles McCain, and Miles Brundage. All the news that’s fit to fabricate: Ai-generated + text as a tool of media misinformation, 2020. + [KMH + 20]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott + Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. + [KPR + 19]Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, + Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, + Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural ques- + tions: a benchmark for question answering research.Transactions of the Association of Computational + Linguistics, 2019. + [KR16]Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation.Arxiv, 2016. + [LB02]Edward Loper and Steven Bird. Nltk: The natural language toolkit, 2002. + [LC19]Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint + arXiv:1901.07291, 2019. + [LCG + 19]Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Sori- + cut. ALBERT: A lite BERT for self-supervised learning of language representations.arXiv preprint + arXiv:1909.11942, 2019. + [LCH + 20]Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. + Adversarial training for large neural language models.arXiv preprint arXiv:2004.08994, 2020. + [LDL19]Zhongyang Li, Xiao Ding, and Ting Liu. Story ending prediction by transferable bert.arXiv preprint + arXiv:1905.07504, 2019. + [LDM12]Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. InThirteenth + International Conference on the Principles of Knowledge Representation and Reasoning, 2012. + [LGG + 20]Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and + Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation.arXiv preprint + arXiv:2001.08210, 2020. + [LGH + 15]Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Representation + learning using multi-task deep neural networks for semantic classification and information retrieval. In + Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational + Linguistics: Human Language Technologies, 2015. + [LH17]Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint + arXiv:1711.05101, 2017. + [LHCG19a]Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural + networks via knowledge distillation for natural language understanding.arXiv preprint arXiv:1904.09482, + 2019. + [LHCG19b]Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for + natural language understanding.arXiv preprint arXiv:1901.11504, 2019. + [Lin20]Tal Linzen. How can we accelerate progress towards human-like linguistic generalization?arXiv preprint + arXiv:2005.00955, 2020. + [LLG + 19]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, + Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural + language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019. + [LM17]Ke Li and Jitendra Malik. Learning to optimize neural nets.arXiv preprint arXiv:1703.00441, 2017. + [LOG + 19]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, + Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. + arXiv preprint arXiv:1907.11692, 2019. + [LPP + 20]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, + Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Kiela Douwe.¨ + Retrieval-augmented generation for knowledge-intensive nlp tasks.arXiv preprint arXiv:2005.11401, + 2020. + [LSP + 18]Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam + Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198, 2018. + [LWS + 20]Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joseph E. Gonzalez. + Train large, then compress: Rethinking model size for efficient training and inference of transformers, + 2020. + [LXL + 17]Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading + comprehension dataset from examinations.arXiv preprint arXiv:1704.04683, 2017. + [LYN + 20]Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, and Jimmy + Lin. Tttttackling winogrande schemas.arXiv preprint arXiv:2003.08380, 2020. + [Mac92]David. MacKay. Information-based objective functions for active data selection.Neural Computation, + 1992. + [MBXS17]Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Con- + textualized word vectors. InAdvances in Neural Information Processing Systems, pages 6294–6305, + 2017. + [MCCD13]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations + in vector space.arXiv preprint arXiv:1301.3781, 2013. + [MCH + 16]Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, + Pushmeet Kohli, and James Allen. A corpus and evaluation framework for deeper understanding of + commonsense stories.arXiv preprint arXiv:1604.01696, 2016. + [MCKS18]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? + a new dataset for open book question answering.ArXiv, abs/1809.02789, 2018. + [MKAT18]Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of + large-batch training, 2018. + [MKM + 94]Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, + Karen Katz, and Britta Schasberger. The penn treebank: annotating predicate argument structure. + InProceedings of the workshop on Human Language Technology, pages 114–119. Association for + Computational Linguistics, 1994. + [MKXS18]Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language + decathlon: Multitask learning as question answering.arXiv preprint arXiv:1806.08730, 2018. + [MPL19]R Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic + heuristics in natural language inference.arXiv preprint arXiv:1902.01007, 2019. + [MWZ + 18]Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, + Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting, 2018. + [NBR20]Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained + language models.arXiv preprint arXiv:2004.09456, 2020. + [NK19]Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments. + arXiv preprint arXiv:1907.07355, 2019. + [Nor09]Peter Norvig. Natural language corpus data, 2009. + [NvNvdG19]Malvina Nissim, Rik van Noord, and Rob van der Goot. Fair is better than sensational: Man is to doctor + as woman is to doctor.arXiv preprint arXiv:1905.09866, 2019. + [NWD + 19]Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial + nli: A new benchmark for natural language understanding.arXiv preprint arXiv:1910.14599, 2019. + [oR16]University of Regensburg. Fascha, 2016. + [PFB18]Jason Phang, Thibault Fevry, and Samuel R. Bowman. Sentence encoders on STILTs: Supplementary´ + training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088, 2018. + [PKL + 16]Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro´ + Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The lambada dataset: Word prediction´ + requiring a broad discourse context.arXiv preprint arXiv:1606.06031, 2016. + [PNZtY18]Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen tau Yih. Dissecting contextual word + embeddings: Architecture and representation, 2018. + [Pos18]Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771, 2018. + [PSM14]Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word + representation. InProceedings of the 2014 conference on empirical methods in natural language + processing (EMNLP), 2014. + [QIA20]QIANXIN. Sa-net on albert (ensemble), April 2020. + [QMZH19]Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. Reducing gender bias in word-level language + models with a gender-equalizing loss function.arXiv preprint arXiv:1905.12801, 2019. + [RCM19]Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering + challenge.Transactions of the Association for Computational Linguistics, 7:249–266, 2019. + [RCP + 17]Scott Reed, Yutian Chen, Thomas Paine, Aaron van den Oord, SM Eslami, Danilo Rezende, Oriol¨ + Vinyals, and Nando de Freitas. Few-shot autoregressive density estimation: Towards learning to learn + distributions.arXiv preprint arXiv:1710.10304, 2017. + [RJL18]Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for + squad.arXiv preprint arXiv:1806.03822, 2018. + [RL16]Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning.ICLR 2017 (oral), + 2016. + [RLL + 19]Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan Liu. NumNet: Machine reading comprehension + with numerical reasoning. InProceedings of EMNLP, 2019. + [RNLVD18]Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in + coreference resolution.arXiv preprint arXiv:1804.09301, 2018. + [RNSS18]Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding + by generative pre-training, 2018. + [Ros12]R.S. Ross. Guide for conducting risk assessments.NIST Special Publication, 2012. + [RRBS19]Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of + the generalization error across scales, 2019. + [RRS20]Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters + of a language model?arXiv preprint arXiv:2002.08910, 2020. + [RSR + 19]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi + Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text + transformer, 2019. + [RWC + 19]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language + models are unsupervised multitask learners, 2019. + [SBBC19]Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial + winograd schema challenge at scale, 2019. + [SBC + 19]Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, + Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris + McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models, 2019. + [SCNP19]Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a + babysitter: On biases in language generation.arXiv preprint arXiv:1909.01326, 2019. + [SDCW19]Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of + BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. + [SDSE19]Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green AI.CoRR, abs/1907.10597, 2019. + [SHB15]Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with + monolingual data.arXiv preprint arXiv:1511.06709, 2015. + [SMM + 17]Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff + Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint + arXiv:1701.06538, 2017. + [SPP + 19]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. + Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019. + [SS20]Timo Schick and Hinrich Schutze. Exploiting cloze questions for few-shot text classification and natural¨ + language inference.arXiv preprint arXiv:2001.07676, 2020. + [STQ + 19]Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence + pre-training for language generation.arXiv preprint arXiv:1905.02450, 2019. + [TFR + 17]Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain + randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ + international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017. + [TL05]Peter D. Turney and Michael L. Littman. Corpus-based learning of analogies and semantic relations. + CoRR, abs/cs/0508103, 2005. + [TL18]Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning. arXiv preprint + arXiv:1806.02847, 2018. + [TLBS03]Peter D. Turney, Michael L. Littman, Jeffrey Bigham, and Victor Shnayder. Combining independent + modules to solve multiple-choice synonym and analogy problems.CoRR, cs.CL/0309035, 2003. + [Tur20]Project Turing. Microsoft research blog, Feb 2020. + [VBL + 16]Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching Networks for One + Shot Learning. InAdvances in neural information processing systems, pages 3630–3638, 2016. + [VSP + 17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Łukasz + Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing + systems, 2017. + [WPN + 19]Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer + Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understand- + ing systems. InAdvances in Neural Information Processing Systems, pages 3261–3275, 2019. + [WXH + 18]Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. Multi-agent + dual learning.ICLR 2019, 2018. + [XDH + 19]Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data + augmentation for consistency training, 2019. + [YdC + 19]Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, + Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. Learning and evaluating + general linguistic intelligence.arXiv preprint arXiv:1901.11373, 2019. + [YDY + 19]Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: + Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237, + 2019. + [ZHB + 19]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine + really finish your sentence?arXiv preprint arXiv:1905.07830, 2019. + [ZHR + 19]Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin + Choi. Defending against neural fake news.arXiv preprint arXiv:1905.12616, 2019. + [ZSW + 19a] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul + Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2019. + [ZSW + 19b]Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Chris- + tiano, and Geoffrey Irving. Fine-tuning language models from human preferences.ArXiv, abs/1909.08593, + 2019. +<> <> <> + + +<> <> <> + Learning both Weights and Connections for Efficient Neural Networks + + Song Han Jeff Pool + Stanford University NVIDIA + songhan@stanford.edu jpool@nvidia.com + + John Tran William J. Dally + NVIDIA Stanford University + johntran@nvidia.com NVIDIA + dally@stanford.edu + + + Abstract + + Neural networks are both computationally intensive and memory intensive, making + them difficult to deploy on embedded systems. Also, conventional networks fix + the architecture before training starts; as a result, training cannot improve the + architecture. To address these limitations, we describe a method to reduce the + storage and computation required by neural networks by an order of magnitude + without affecting their accuracy by learning only the important connections. Our + method prunes redundant connections using a three-step method. First, we train + the network to learn which connections are important. Next, we prune the + unimportant connections. Finally, we retrain the network to fine tune the weights of the + remaining connections. On the ImageNet dataset, our method reduced the number + of parameters of AlexNet by a factor of9, from 61 million to 6.7 million, without + incurring accuracy loss. Similar experiments with VGG-16 found that the total + number of parameters can be reduced by13, from 138 million to 10.3 million, + again with no loss of accuracy. + + + 1 Introduction + + Neural networks have become ubiquitous in applications ranging from computer vision [1] to speech + recognition [2] and natural language processing [3]. We consider convolutional neural networks used + for computer vision tasks which have grown over time. In 1998 LeCun et al.designed a CNN model + LeNet-5 with less than 1M parameters to classify handwritten digits [4], while in 2012, Krizhevsky + et al.[1] won the ImageNet competition with 60M parameters. Deepface classified human faces with + 120M parameters [5], and Coateset al.[6] scaled up a network to 10B parameters. + While these large neural networks are very powerful, their size consumes considerable storage, + memory bandwidth, and computational resources. For embedded mobile applications, these resource + demands become prohibitive. Figure 1 shows the energy cost of basic arithmetic and memory + operations in a 45nm CMOS process. From this data we see the energy per connection is dominated + by memory access and ranges from 5pJ for 32 bit coefficients in on-chip SRAM to 640pJ for 32bit + coefficients in off-chip DRAM [7]. Large networks do not fit in on-chip storage and hence require + the more costly DRAM accesses. Running a 1 billion connection neural network, for example, at + 20Hz would require(20Hz)(1G)(640pJ) = 12:8Wjust for DRAM access - well beyond the power + envelope of a typical mobile device. Our goal in pruning networks is to reduce the energy required to + run such large networks so they can run in real time on mobile devices. The model size reduction + from pruning also facilitates storage and transmission of mobile applications incorporating DNNs. + + <
> + + Figure 1: Energy table for 45nm CMOS process [7]. Memory access is 3 orders of magnitude more + energy expensive than simple arithmetic. + + To achieve this goal, we present a method to prune network connections in a manner that preserves the + original accuracy. After an initial training phase, we remove all connections whose weight is lower + than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first + phase learns the topology of the networks — learning which connections are important and removing + the unimportant connections. We then retrain the sparse network so the remaining connections can + compensate for the connections that have been removed. The phases of pruning and retraining may + be repeated iteratively to further reduce network complexity. In effect, this training process learns + the network connectivity in addition to the weights - much as in the mammalian brain [8][9], where + synapses are created in the first few months of a child’s development, followed by gradual pruning of + little-used connections, falling to typical adult values. + + + 2 Related Work + + + Neural networks are typically over-parameterized, and there is significant redundancy for deep learn- + ing models [10]. This results in a waste of both computation and memory. There have been various + proposals to remove the redundancy: Vanhouckeet al.[11] explored a fixed-point implementation + with 8-bit integer (vs 32-bit floating point) activations. Dentonet al. [12] exploited the linear + structure of the neural network by finding an appropriate low-rank approximation of the parameters + and keeping the accuracy within 1% of the original model. With similar accuracy loss, Gonget al. + [13] compressed deep convnets using vector quantization. These approximation and quantization + techniques are orthogonal to network pruning, and they can be used together to obtain further gains + [14]. + There have been other attempts to reduce the number of parameters of neural networks by replacing + the fully connected layer with global average pooling. The Network in Network architecture [15] + and GoogLenet [16] achieves state-of-the-art results on several benchmarks by adopting this idea. + However, transfer learning, i.e. reusing features learned on the ImageNet dataset and applying them + to new tasks by only fine-tuning the fully connected layers, is more difficult with this approach. This + problem is noted by Szegedyet al.[16] and motivates them to add a linear layer on the top of their + networks to enable transfer learning. + Network pruning has been used both to reduce network complexity and to reduce over-fitting. An + early approach to pruning was biased weight decay [17]. Optimal Brain Damage [18] and Optimal + Brain Surgeon [19] prune networks to reduce the number of connections based on the Hessian of the + loss function and suggest that such pruning is more accurate than magnitude-based pruning such as + weight decay. However, second order derivative needs additional computation. + HashedNets [20] is a recent technique to reduce model sizes by using a hash function to randomly + group connection weights into hash buckets, so that all connections within the same hash bucket + share a single parameter value. This technique may benefit from pruning. As pointed out in Shiet al. + [21] and Weinbergeret al.[22], sparsity will minimize hash collision making feature hashing even + more effective. HashedNets may be used together with pruning to give even better parameter savings. + + <
> + + Figure 3: Synapses and neurons before and after + + <
> + + Figure 2: Three-Step Training Pipeline. pruning. + + + 3 Learning Connections in Addition to Weights + + Our pruning method employs a three-step process, as illustrated in Figure 2, which begins by learning + the connectivity via normal network training. Unlike conventional training, however, we are not + learning the final values of the weights, but rather we are learning which connections are important. + The second step is to prune the low-weight connections. All connections with weights below a + threshold are removed from the network — converting a dense network into a sparse network, as + shown in Figure 3. The final step retrains the network to learn the final weights for the remaining + sparse connections. This step is critical. If the pruned network is used without retraining, accuracy is + significantly impacted. + + 3.1 Regularization + + Choosing the correct regularization impacts the performance of pruning and retraining. L1 regularization + penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracy + after pruning, but before retraining. However, the remaining connections are not as good as with L2 + regularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the best + pruning results. This is further discussed in experiment section. + + 3.2 Dropout Ratio Adjustment + + Dropout [23] is widely used to prevent over-fitting, and this also applies to retraining. During + retraining, however, the dropout ratio must be adjusted to account for the change in model capacity. + In dropout, each parameter is probabilistically dropped during training, but will come back during + inference. In pruning, parameters are dropped forever after pruning and have no chance to come back + during both training and inference. As the parameters get sparse, the classifier will select the most + informative predictors and thus have much less prediction variance, which reduces over-fitting. As + pruning already reduced model capacity, the retraining dropout ratio should be smaller. + Quantitatively, letCi be the number of connections in layeri,Cio for the original network,Cir for + the network after retraining,Ni be the number of neurons in layer i. Since dropout works on neurons, + andCi varies quadratically withNi , according to Equation 1 thus the dropout ratio after pruning the + parameters should follow Equation 2, whereDo represent the original dropout rate,Dr represent the + dropout rate during retraining. + <> (1) + <> (2) + + 3.3 Local Pruning and Parameter Co-adaptation + + During retraining, it is better to retain the weights from the initial training phase for the connections + that survived pruning than it is to re-initialize the pruned layers. CNNs contain fragile co-adapted + features [24]: gradient descent is able to find a good solution when the network is initially trained, + but not after re-initializing some layers and retraining them. So when we retrain the pruned layers, + we should keep the surviving parameters instead of re-initializing them. + + Table 1: Network pruning can save 9% to 13% parameters with no drop in predictive performance. + + <
> + + + Retraining the pruned layers starting with retained weights requires less computation because we + don’t have to back propagate through the entire network. Also, neural networks are prone to suffer + the vanishing gradient problem [25] as the networks get deeper, which makes pruning errors harder to + recover for deep networks. To prevent this, we fix the parameters for CONV layers and only retrain + the FC layers after pruning the FC layers, and vice versa. + + 3.4 Iterative Pruning + + Learning the right connections is an iterative process. Pruning followed by a retraining is one iteration, + after many such iterations the minimum number connections could be found. Without loss of accuracy, + this method can boost pruning rate from 5% to 9% on AlexNet compared with single-step aggressive + pruning. Each iteration is a greedy search in that we find the best connections. We also experimented + with probabilistically pruning parameters based on their absolute value, but this gave worse results. + + 3.5 Pruning Neurons + + After pruning connections, neurons with zero input connections or zero output connections may be + safely pruned. This pruning is furthered by removing all connections to or from a pruned neuron. + The retraining phase automatically arrives at the result where dead neurons will have both zero input + connections and zero output connections. This occurs due to gradient descent and regularization. + A neuron that has zero input connections (or zero output connections) will have no contribution + to the final loss, leading the gradient to be zero for its output connection (or input connection), + respectively. Only the regularization term will push the weights to zero. Thus, the dead neurons will + be automatically removed during retraining. + + 4 Experiments + + We implemented network pruning in Caffe [26]. Caffe was modified to add a mask which disregards + pruned parameters during network operation for each weight tensor. The pruning threshold is chosen + as a quality parameter multiplied by the standard deviation of a layer’s weights. We carried out the + experiments on Nvidia TitanX and GTX980 GPUs. + We pruned four representative networks: Lenet-300-100 and Lenet-5 on MNIST, together with + AlexNet and VGG-16 on ImageNet. The network parameters and accuracy 1 before and after pruning + are shown in Table 1. + + 4.1 LeNet on MNIST + + We first experimented on MNIST dataset with the LeNet-300-100 and LeNet-5 networks [4]. LeNet- + 300-100 is a fully connected network with two hidden layers, with 300 and 100 neurons each, which + achieves 1.6% error rate on MNIST. LeNet-5 is a convolutional network that has two convolutional + layers and two fully connected layers, which achieves 0.8% error rate on MNIST. After pruning, + the network is retrained with1=10of the original network’s original learning rate. Table 1 shows + 1 Reference model is from Caffe model zoo, accuracy is measured without data augmentation + + Table 2: For Lenet-300-100, pruning reduces the number of weights by 12% and computation by 12%. + + <
> + + Table 3: For Lenet-5, pruning reduces the number of weights by 12% and computation by 6%. + + <
> + + <
> + + Figure 4: Visualization of the first FC layer’s sparsity pattern of Lenet-300-100. It has a banded + structure repeated 28 times, which correspond to the un-pruned parameters in the center of the images, + since the digits are written in the center. + + + pruning saves 12% parameters on these networks. For each layer of the network the table shows (left + to right) the original number of weights, the number of floating point operations to compute that + layer’s activations, the average percentage of activations that are non-zero, the percentage of non-zero + weights after pruning, and the percentage of actually required floating point operations. + An interesting byproduct is that network pruning detects visual attention regions. Figure 4 shows the + sparsity pattern of the first fully connected layer of LeNet-300-100, the matrix size is 784x300. It + has 28 bands, each band’s width 28, corresponding to the 28x28 input pixels. The colored regions + of the figure, indicating non-zero parameters, correspond to the center of the image. Because digits + are written in the center of the image, these are the important parameters. The graph is sparse on the + left and right, corresponding to the less important regions on the top and bottom of the image. After + pruning, the neural network finds the center of the image more important, and the connections to the + peripheral regions are more heavily pruned. + + + 4.2 AlexNet on ImageNet + + We further examine the performance of pruning on the ImageNet ILSVRC-2012 dataset, which + has 1.2M training examples and 50k validation examples. We use the AlexNet Caffe model as the + reference model, which has 61 million parameters across 5 convolutional layers and 3 fully connected + layers. The AlexNet Caffe model achieved a top-1 accuracy of 57.2% and a top-5 accuracy of 80.3%. + The original AlexNet took 75 hours to train on NVIDIA Titan X GPU. After pruning, the whole + network is retrained with1=100of the original network’s initial learning rate. It took 173 hours to + retrain the pruned AlexNet. Pruning is not used when iteratively prototyping the model, but rather + used for model reduction when the model is ready for deployment. Thus, the retraining time is less + a concern. Table 1 shows that AlexNet can be pruned to 1-9% of its original size without impacting + accuracy, and the amount of computation can be reduced by 3%. + + Table 4: For AlexNet, pruning reduces the number of weights by 9% and computation by 3%. + + <
> + + Table 5: For VGG-16, pruning reduces the number of weights by 12% and computation by 5%. + + <
> + + 4.3 VGG-16 on ImageNet + + With promising results on AlexNet, we also looked at a larger, more recent network, VGG-16 [27], + on the same ILSVRC-2012 dataset. VGG-16 has far more convolutional layers but still only three + fully-connected layers. Following a similar methodology, we aggressively pruned both convolutional + and fully-connected layers to realize a significant reduction in the number of weights, shown in + Table 5. We used five iterations of pruning an retraining. + The VGG-16 results are, like those for AlexNet, very promising. The network as a whole has + been reduced to 7.5% of its original size (13% smaller). In particular, note that the two largest + fully-connected layers can each be pruned to less than 4% of their original size. This reduction is + critical for real time image processing, where there is little reuse of fully connected layers across + images (unlike batch processing during training). + + + 5 Discussion + + The trade-off curve between accuracy and number of parameters is shown in Figure 5. The more + parameters pruned away, the less the accuracy. We experimented with L1 and L2 regularization, with + and without retraining, together with iterative pruning to give five trade off lines. Comparing solid and + dashed lines, the importance of retraining is clear: without retraining, accuracy begins dropping much + sooner with 1-3% of the original connections, rather than with1=10of the original connections. + It’s interesting to see that we have the “free lunch” of reducing 2% the connections without losing + accuracy even without retraining; while with retraining we are ably to reduce connections by 9%. + + <
> + + Figure 5: Trade-off curve for parameter reduction and loss in top-5 accuracy. L1 regularization + performs better than L2 at learning the connections without retraining, while L2 regularization + performs better than L1 at retraining. Iterative pruning gives the best result. + + + <
> + + Figure 6: Pruning sensitivity for CONV layer (left) and FC layer (right) of AlexNet. + + + L1 regularization gives better accuracy than L2 directly after pruning (dotted blue and purple lines) + since it pushes more parameters closer to zero. However, comparing the yellow and green lines shows + that L2 outperforms L1 after retraining, since there is no benefit to further pushing values towards + zero. One extension is to use L1 regularization for pruning and then L2 for retraining, but this did not + beat simply using L2 for both phases. Parameters from one mode do not adapt well to the other. + The biggest gain comes from iterative pruning (solid red line with solid circles). Here we take the + pruned and retrained network (solid green line with circles) and prune and retrain it again. The + leftmost dot on this curve corresponds to the point on the green line at 80% (5% pruning) pruned to + 8%. There’s no accuracy loss at 9%. Not until 10% does the accuracy begin to drop sharply. + Two green points achieve slightly better accuracy than the original model. We believe this accuracy + improvement is due to pruning finding the right capacity of the network and hence reducing overfitting. + Both CONV and FC layers can be pruned, but with different sensitivity. Figure 6 shows the sensitivity + of each layer to network pruning. The figure shows how accuracy drops as parameters are pruned on + a layer-by-layer basis. The CONV layers (on the left) are more sensitive to pruning than the fully + connected layers (on the right). The first convolutional layer, which interacts with the input image + directly, is most sensitive to pruning. We suspect this sensitivity is due to the input layer having only + 3 channels and thus less redundancy than the other convolutional layers. We used the sensitivity + results to find each layer’s threshold: for example, the smallest threshold was applied to the most + sensitive layer, which is the first convolutional layer. + Storing the pruned layers as sparse matrices has a storage overhead of only 15.6%. Storing relative + rather than absolute indices reduces the space taken by the FC layer indices to 5 bits. Similarly, + CONV layer indices can be represented with only 8 bits. + + Table 6: Comparison with other model reduction methods on AlexNet. Data-free pruning [28] + saved only 1-5% parameters with much loss of accuracy. Deep Fried Convnets [29] worked on fully + connected layers only and reduced the parameters by less than 4%. [30] reduced the parameters by + 4% with inferior accuracy. Naively cutting the layer size saves parameters but suffers from 4% loss + of accuracy. [12] exploited the linear structure of convnets and compressed each layer individually, + where model compression on a single layer incurred 0.9% accuracy penalty with biclustering + SVD. + + <
> + + Figure 7: Weight distribution before and after parameter pruning. The right figure has 10% smaller + scale. + + After pruning, the storage requirements of AlexNet and VGGNet are are small enough that all weights + can be stored on chip, instead of off-chip DRAM which takes orders of magnitude more energy to + access (Table 1). We are targeting our pruning method for fixed-function hardware specialized for + sparse DNN, given the limitation of general purpose hardware on sparse computation. + Figure 7 shows histograms of weight distribution before (left) and after (right) pruning. The weight + is from the first fully connected layer of AlexNet. The two panels have different y-axis scales. + The original distribution of weights is centered on zero with tails dropping off quickly. Almost all + parameters are between <>. After pruning the large center region is removed. The + network parameters adjust themselves during the retraining phase. The result is that the parameters + form a bimodal distribution and become more spread across the x-axis, between <>. + + 6 Conclusion + + We have presented a method to improve the energy efficiency and storage of neural networks without + affecting accuracy by finding the right connections. Our method, motivated in part by how learning + works in the mammalian brain, operates by learning which connections are important, pruning + the unimportant connections, and then retraining the remaining sparse network. We highlight our + experiments on AlexNet and VGGNet on ImageNet, showing that both fully connected layer and + convolutional layer can be pruned, reducing the number of connections by 9% to 13% without loss of + accuracy. This leads to smaller memory capacity and bandwidth requirements for real-time image + processing, making it easier to be deployed on mobile systems. + + References + + [1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional + neural networks. InAdvances in neural information processing systems, pages 1097–1105, 2012. + [2]Alex Graves and Jurgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other¨ + neural network architectures.Neural Networks, 18(5):602–610, 2005. + [3]Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. ´ + Natural language processing (almost) from scratch.JMLR, 12:2493–2537, 2011. + [4] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to + document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. + [5]Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to + human-level performance in face verification. InCVPR, pages 1701–1708. IEEE, 2014. + [6]Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with + cots hpc systems. In30th ICML, pages 1337–1345, 2013. + [7]Mark Horowitz. Energy table for 45nm process, Stanford VLSI wiki. + [8] JP Rauschecker. Neuronal mechanisms of developmental plasticity in the cat’s visual system.Human + neurobiology, 3(2):109–114, 1983. + [9]Christopher A Walsh. Peter huttenlocher (1931-2013).Nature, 502(7470):172–172, 2013. + [10] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. + InAdvances in Neural Information Processing Systems, pages 2148–2156, 2013. + [11]Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. + InProc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011. + [12]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure + within convolutional networks for efficient evaluation. InNIPS, pages 1269–1277, 2014. + [13]Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks + using vector quantization.arXiv preprint arXiv:1412.6115, 2014. + [14]Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with + pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015. + [15]Min Lin, Qiang Chen, and Shuicheng Yan. Network in network.arXiv preprint arXiv:1312.4400, 2013. + [16]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru + Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint + arXiv:1409.4842, 2014. + [17]Stephen Jose Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with´ + back-propagation. InAdvances in neural information processing systems, pages 177–185, 1989. + [18]Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. InAdvances in Neural Information + Processing Systems, pages 598–605. Morgan Kaufmann, 1990. + [19]Babak Hassibi, David G Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon. + Advances in neural information processing systems, pages 164–164, 1993. + [20]Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural + networks with the hashing trick.arXiv preprint arXiv:1504.04788, 2015. + [21]Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan. Hash + kernels for structured data.The Journal of Machine Learning Research, 10:2615–2637, 2009. + [22]Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing + for large scale multitask learning. InICML, pages 1113–1120. ACM, 2009. + [23]Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: + A simple way to prevent neural networks from overfitting.JMLR, 15:1929–1958, 2014. + [24]Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural + networks? InAdvances in Neural Information Processing Systems, pages 3320–3328, 2014. + [25]Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient + descent is difficult.Neural Networks, IEEE Transactions on, 5(2):157–166, 1994. + [26]Yangqing Jia, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint + arXiv:1408.5093, 2014. + [27]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- + tion.CoRR, abs/1409.1556, 2014. + [28] Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks.arXiv + preprint arXiv:1507.06149, 2015. + [29]Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. + Deep fried convnets.arXiv preprint arXiv:1412.7149, 2014. + [30]Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks.arXiv preprint + arXiv:1412.1442, 2014. +<> <> <> + + +<> <> <> +Learning Efficient Convolutional Networks through Network Slimming + +Abstract + +The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the model size; 2) decrease the run-time memory footprint; and 3) lower the number of computing operations, without compromising accuracy. This is achieved by en.forcing channel-level sparsity in the network in a simple but effective way. Different from many existing approaches, the proposed method directly applies to modern CNN architectures, introduces minimum overhead to the training process, and requires no special software/hardware accelerators for the resulting models. We call our approach network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. We empirically demonstrate the effectiveness of our approach with several state-of-the-art CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. For VGGNet, a multi-pass version of network slimming gives a 20. reduction in model size and a 5. reduction in computing operations. + +1. Introduction + +In recent years, convolutional neural networks (CNNs) have become the dominant approach for a variety of computer vision tasks, e.g., image classification [22], object detection [8], semantic segmentation [26]. Large-scale datasets, high-end modern GPUs and new network architectures allow the development of unprecedented large CNN models. For instance, from AlexNet [22], VGGNet [31] and GoogleNet [34] to ResNets [14], the ImageNet Classification Challenge winner models have evolved from 8 layers to more than 100 layers. +This work was done when Zhuang Liu and Zhiqiang Shen were interns at Intel Labs China. Jianguo Li is the corresponding author. +However, larger CNNs, although with stronger representation power, are more resource-hungry. For instance, a 152-layer ResNet [14] has more than 60 million parameters and requires more than 20 Giga float-point-operations (FLOPs) when inferencing an image with resolution 224. +224. This is unlikely to be affordable on resource con.strained platforms such as mobile devices, wearables or Internet of Things (IoT) devices. +The deployment of CNNs in real world applications are mostly constrained by 1) Model size: CNNs strong representation power comes from their millions of trainable parameters. Those parameters, along with network structure information, need to be stored on disk and loaded into mem.ory during inference time. As an example, storing a typical CNN trained on ImageNet consumes more than 300MB space, which is a big resource burden to embedded devices. +2) Run-time memory: During inference time, the intermediate activations/responses of CNNs could even take more memory space than storing the model parameters, even with batch size 1. This is not a problem for high-end GPUs, but unaffordable for many applications with low computational power. 3) Number of computing operations: The convolution operations are computationally intensive on high resolution images. A large CNN may take several minutes to process one single image on a mobile device, making it un.realistic to be adopted for real applications. +Many works have been proposed to compress large CNNs or directly learn more Efficient CNN models for fast inference. These include low-rank approximation [7], network quantization [3, 12] and binarization [28, 6], weight pruning [12], dynamic inference [16], etc. However, most of these methods can only address one or two challenges mentioned above. Moreover, some of the techniques require specially designed software/hardware accelerators for execution speedup [28, 6, 12]. +Another direction to reduce the resource consumption of large CNNs is to sparsify the network. Sparsity can be im.posed on different level of structures [2, 37, 35, 29, 25], which yields considerable model-size compression and inference speedup. However, these approaches generally re. + +<
> + +Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network. +quire special software/hardware accelerators to harvest the gain in memory or time savings, though it is easier than non-structured sparse weight matrix as in [12]. +In this paper, we propose network slimming, a simple yet effective network training scheme, which addresses all the aforementioned challenges when deploying large CNNs under limited resources. Our approach imposes L1 regularization on the scaling factors in batch normalization (BN) layers, thus it is easy to implement without introducing any change to existing CNN architectures. Pushing the val.ues of BN scaling factors towards zero with L1 regularization enables us to identify insignificant channels (or neurons), as each scaling factor corresponds to a specific convolutional channel (or a neuron in a fully-connected layer). This facilitates the channel-level pruning at the followed step. The additional regularization term rarely hurt the performance. In fact, in some cases it leads to higher generalization accuracy. Pruning unimportant channels may sometimes temporarily degrade the performance, but this effect can be compensated by the followed fine-tuning of the pruned network. After pruning, the resulting narrower network is much more compact in terms of model size, run.time memory, and computing operations compared to the initial wide network. The above process can be repeated for several times, yielding a multi-pass network slimming scheme which leads to even more compact network. +Experiments on several benchmark datasets and different network architectures show that we can obtain CNN models with up to 20x mode-size compression and 5x reduction in computing operations of the original ones, while achieving the same or even higher accuracy. Moreover, our method achieves model compression and inference speedup with conventional hardware and deep learning software packages, since the resulting narrower model is free of any sparse storing format or computing operations. + +2. Related Work + +In this section, we discuss related work from five aspects. +Low-rank Decomposition approximates weight matrix in neural networks with low-rank matrix using techniques like Singular Value Decomposition (SVD) [7]. This method works especially well on fully-connected layers, yield.ing 3x model-size compression however without notable speed acceleration, since computing operations in CNN mainly come from convolutional layers. +Weight Quantization. HashNet [3] proposes to quantize the network weights. Before training, network weights are hashed to different groups and within each group weight the value is shared. In this way only the shared weights and hash indices need to be stored, thus a large amount of stor.age space could be saved. [12] uses a improved quantization technique in a deep compression pipeline and achieves 35x to 49x compression rates on AlexNet and VGGNet. How.ever, these techniques can neither save run-time memory nor inference time, since during inference shared weights need to be restored to their original positions. +[28, 6] quantize real-valued weights into binary/ternary weights (weight values restricted to {-1, 1} or {-1, 0, 1}). This yields a large amount of model-size saving, and significant speedup could also be obtained given bitwise operation libraries. However, this aggressive low-bit approximation method usually comes with a moderate accuracy loss. +Weight Pruning / Sparsifying. [12] proposes to prune the unimportant connections with small weights in trained neu.ral networks. The resulting network's weights are mostly zeros thus the storage space can be reduced by storing the model in a sparse format. However, these methods can only achieve speedup with dedicated sparse matrix operation libraries and/or hardware. The run-time memory saving is also very limited since most memory space is consumed by the activation maps (still dense) instead of the weights. +In [12], there is no guidance for sparsity during training. +[32] overcomes this limitation by explicitly imposing sparse constraint over each weight with additional gate variables, and achieve high compression rates by pruning connections with zero gate values. This method achieves better compression rate than [12], but suffers from the same drawback. + +Structured Pruning / Sparsifying. Recently, [23] pro.poses to prune channels with small incoming weights in trained CNNs, and then fine-tune the network to regain accuracy. [2] introduces sparsity by random deactivating input-output channel-wise connections in convolutional layers before training, which also yields smaller networks with moderate accuracy loss. Compared with these works, we explicitly impose channel-wise sparsity in the optimization objective during training, leading to smoother channel pruning process and little accuracy loss. +[37] imposes neuron-level sparsity during training thus some neurons could be pruned to obtain compact networks. +[35] proposes a Structured Sparsity Learning (SSL) method to sparsify different level of structures (e.g. filters, channels or layers) in CNNs. Both methods utilize group sparsity regularization during training to obtain structured sparsity. Instead of resorting to group sparsity on convolutional weights, our approach imposes simple L1 sparsity on channel-wise scaling factors, thus the optimization objective is much simpler. +Since these methods prune or sparsify part of the network structures (e.g., neurons, channels) instead of individual weights, they usually require less specialized libraries +(e.g. for sparse computing operation) to achieve inference speedup and run-time memory saving. Our network slimming also falls into this category, with absolutely no special libraries needed to obtain the benefits. +Neural Architecture Learning. While state-of-the-art CNNs are typically designed by experts [22, 31, 14], there are also some explorations on automatically learning network architectures. [20] introduces sub-modular/super.modular optimization for network architecture search with a given resource budget. Some recent works [38, 1] propose to learn neural architecture automatically with reinforcement learning. The searching space of these methods are extremely large, thus one needs to train hundreds of models to distinguish good from bad ones. Network slimming can also be treated as an approach for architecture learning, despite the choices are limited to the width of each layer. However, in contrast to the aforementioned methods, network slimming learns network architecture through only a single training process, which is in line with our goal of efficiency. +3. Network slimming +We aim to provide a simple scheme to achieve channel-level sparsity in deep CNNs. In this section, we first discuss the advantages and challenges of channel-level sparsity, and introduce how we leverage the scaling layers in batch normalization to effectively identify and prune unimportant channels in the network. +Advantages of Channel-level Sparsity. As discussed in prior works [35, 23, 11], sparsity can be realized at differ.ent levels, e.g., weight-level, kernel-level, channel-level or layer-level. Fine-grained level (e.g., weight-level) sparsity gives the highest flexibility and generality leads to higher compression rate, but it usually requires special software or hardware accelerators to do fast inference on the sparsified model [11]. On the contrary, the coarsest layer-level sparsity does not require special packages to harvest the inference speedup, while it is less flexible as some whole layers need to be pruned. In fact, removing layers is only effective when the depth is sufficiently large, e.g., more than 50 layers [35, 18]. In comparison, channel-level sparsity provides a nice tradeoff between flexibility and ease of implementation. It can be applied to any typical CNNs or fully-connected networks (treat each neuron as a channel), and the resulting network is essentially a "thinned" version of the unpruned network, which can be Efficiently inferenced on conventional CNN platforms. +Challenges. Achieving channel-level sparsity requires pruning all the incoming and outgoing connections associated with a channel. This renders the method of directly pruning weights on a pre-trained model ineffective, as it is unlikely that all the weights at the input or output end of a channel happen to have near zero values. As reported in [23], pruning channels on pre-trained ResNets can only lead to a reduction of 10% in the number of parameters without suffering from accuracy loss. [35] addresses this problem by enforcing sparsity regularization into the training objective. specifically, they adopt group LASSO to push all the filter weights corresponds to the same channel towards zero simultaneously during training. However, this approach re.quires computing the gradients of the additional regularization term with respect to all the filter weights, which is non.trivial. We introduce a simple idea to address the above challenges, and the details are presented below. +Scaling Factors and Sparsity-induced Penalty. Our idea is introducing a scaling factor . for each channel, which is multiplied to the output of that channel. Then we jointly train the network weights and these scaling factors, with sparsity regularization imposed on the latter. Finally we prune those channels with small factors, and fine-tune the pruned network. specifically, the training objective of our approach is given by + +<> (1) + +where <> denote the train input and target, W denotes the trainable weights, the first sum-term corresponds to the normal training loss of a CNN, <> is a sparsity-induced penalty on the scaling factors, and <> balances the two terms. In our experiment, we choose <>, which is known as + +<
> + +Figure 2: Flow-chart of network slimming procedure. The dotted-line is for the multi-pass/iterative scheme. +L1-norm and widely used to achieve sparsity. Subgradient descent is adopted as the optimization method for the non-smooth L1 penalty term. An alternative option is to replace the L1 penalty with the smooth-L1 penalty [30] to avoid using sub-gradient at non-smooth point. +As pruning a channel essentially corresponds to removing all the incoming and outgoing connections of that chan.nel, we can directly obtain a narrow network (see Figure 1) without resorting to any special sparse computation packages. The scaling factors act as the agents for channel se.lection. As they are jointly optimized with the network weights, the network can automatically identity insignificant channels, which can be safely removed without greatly affecting the generalization performance. +Leveraging the Scaling Factors in BN Layers. Batch normalization [19] has been adopted by most modern CNNs as a standard approach to achieve fast convergence and bet.ter generalization performance. The way BN normalizes the activations motivates us to design a simple and efficient method to incorporates the channel-wise scaling fac.tors. Particularly, BN layer normalizes the internal activations using mini-batch statistics. Let z_in and z_out be the input and output of a BN layer, B denotes the current mini-batch, BN layer performs the following transformation: + + <> + +where <> and <> are the mean and standard deviation val.ues of input activations over <> and <> are trainable affine transformation parameters (scale and shift) which provides the possibility of linearly transforming normalized activations back to any scales. +It is common practice to insert a BN layer after a convolutional layer, with channel-wise scaling/shifting parameters. Therefore, we can directly leverage the . parameters in BN layers as the scaling factors we need for network slimming. It has the great advantage of introducing no overhead to the network. In fact, this is perhaps also the most effective way we can learn meaningful scaling factors for chan.nel pruning. 1), if we add scaling layers to a CNN without BN layer, the value of the scaling factors are not meaning.ful for evaluating the importance of a channel, because both convolution layers and scaling layers are linear transformations. One can obtain the same results by decreasing the scaling factor values while amplifying the weights in the convolution layers. 2), if we insert a scaling layer before a BN layer, the scaling effect of the scaling layer will be completely canceled by the normalization process in BN. 3), if we insert scaling layer after BN layer, there are two consecutive scaling factors for each channel. +Channel Pruning and Fine-tuning. After training under channel-level sparsity-induced regularization, we obtain a model in which many scaling factors are near zero (see Figure 1). Then we can prune channels with near-zero scaling factors, by removing all their incoming and outgoing connections and corresponding weights. We prune channels with a global threshold across all layers, which is defined as a certain percentile of all the scaling factor values. For instance, we prune 70% channels with lower scaling factors by choosing the percentile threshold as 70%. By doing so, we obtain a more compact network with less parameters and run-time memory, as well as less computing operations. +Pruning may temporarily lead to some accuracy loss, when the pruning ratio is high. But this can be largely compensated by the followed fine-tuning process on the pruned network. In our experiments, the fine-tuned narrow network can even achieve higher accuracy than the original unpruned network in many cases. +Multi-pass Scheme. We can also extend the proposed method from single-pass learning scheme (training with sparsity regularization, pruning, and fine-tuning) to a multi-pass scheme. specifically, a network slimming procedure results in a narrow network, on which we could again apply the whole training procedure to learn an even more compact model. This is illustrated by the dotted-line in Figure 2. Experimental results show that this multi-pass scheme can lead to even better results in terms of compression rate. +Handling Cross Layer Connections and Pre-activation Structure. The network slimming process introduced above can be directly applied to most plain CNN architectures such as AlexNet [22] and VGGNet [31]. While some adaptations are required when it is applied to modern networks with cross layer connections and the pre-activation design such as ResNet [15] and DenseNet [17]. For these networks, the output of a layer may be treated as the input of multiple subsequent layers, in which a BN layer is placed before the convolutional layer. In this case, the sparsity is achieved at the incoming end of a layer, i.e., the layer selectively uses a subset of channels it received. To harvest the parameter and computation savings at test time, we need to place a channel selection layer to mask out insignificant channels we have identified. + +4. Experiments + +We empirically demonstrate the effectiveness of network slimming on several benchmark datasets. We implement + +<
> + +Table 1: Results on CIFAR and SVHN datasets. "Baseline" denotes normal training without sparsity regularization. In column-1, 60% pruned denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy could typically be maintained with  60% channels pruned. +our method based on the publicly available Torch [5] implementation for ResNets by [10]. The code is available at https://github.com/liuzhuang13/slimming. + +4.1. Datasets +CIFAR. The two CIFAR datasets [21] consist of natural im. +ages with resolution 32.32. CIFAR-10 is drawn from 10 and CIFAR-100 from 100 classes. The train and test sets contain 50,000 and 10,000 images respectively. On CIFAR.10, a validation set of 5,000 images is split from the training set for the search of . (in Equation 1) on each model. We report the final test errors after training or fine-tuning on all training images. A standard data augmentation scheme (shifting/mirroring) [14, 18, 24] is adopted. The input data is normalized using channel means and standard deviations. We also compare our method with [23] on CIFAR datasets. +SVHN. The Street View House Number (SVHN) dataset +[27] consists of 32x32 colored digit images. Following common practice [9, 18, 24] we use all the 604,388 training images, from which we split a validation set of 6,000 im.ages for model selection during training. The test set con.tains 26,032 images. During training, we select the model with the lowest validation error as the model to be pruned (or the baseline model). We also report the test errors of the models with lowest validation errors during fine-tuning. +ImageNet. The ImageNet dataset contains 1.2 million training images and 50,000 validation images of 1000 classes. We adopt the data augmentation scheme as in [10]. We report the single-center-crop validation error of the final model. +MNIST. MNIST is a handwritten digit dataset containing 60,000 training images and 10,000 test images. To test the effectiveness of our method on a fully-connected network (treating each neuron as a channel with 1.1 spatial size), we compare our method with [35] on this dataset. + +4.2. Network Models +On CIFAR and SVHN dataset, we evaluate our method on three popular network architectures: VGGNet[31], ResNet [14] and DenseNet [17]. The VGGNet is originally designed for ImageNet classification. For our experiment a variation of the original VGGNet for CIFAR dataset is taken from [36]. For ResNet, a 164-layer pre-activation ResNet with bottleneck structure (ResNet-164) [15] is used. For DenseNet, we use a 40-layer DenseNet with growth rate 12 (DenseNet-40). +On ImageNet dataset, we adopt the 11-layer (8-conv + 3 FC) VGG-A network [31] model with batch normalization from [4]. We remove the dropout layers since we use relatively heavy data augmentation. To prune the neurons in fully-connected layers, we treat them as convolutional channels with 1.1 spatial size. +On MNIST dataset, we evaluate our method on the same 3-layer fully-connected network as in [35]. + +4.3. Training, Pruning and Fine-tuning +Normal Training. We train all the networks normally from scratch as baselines. All the networks are trained using SGD. On CIFAR and SVHN datasets we train using mini-batch size 64 for 160 and 20 epochs, respectively. The initial learning rate is set to 0.1, and is divided by 10 at 50% and 75% of the total number of training epochs. On Im.ageNet and MNIST datasets, we train our models for 60 and 30 epochs respectively, with a batch size of 256, and an initial learning rate of 0.1 which is divided by 10 after 1/3 and 2/3 fraction of training epochs. We use a weight de.cay of 10.4 and a Nesterov momentum [33] of 0.9 without dampening. The weight initialization introduced by [13] is adopted. Our optimization settings closely follow the orig.inal implementation at [10]. In all our experiments, we initialize all channel scaling factors to be 0.5, since this gives higher accuracy for the baseline models compared with de.fault setting (all initialized to be 1) from [10]. +Training with Sparsity. For CIFAR and SVHN datasets, when training with channel sparse regularization, the hyper.parameteer ., which controls the tradeoff between empirical loss and sparsity, is determined by a grid search over 10.3, 10.4, 10.5 on CIFAR-10 validation set. For VG-GNet we choose 10.4 and for ResNet and DenseNet 10.5. For VGG-A on ImageNet, we set 10.5 . All other settings are kept the same as in normal training. +Pruning. When we prune the channels of models trained with sparsity, a pruning threshold on the scaling factors needs to be determined. Unlike in [23] where different lay.ers are pruned by different ratios, we use a global pruning threshold for simplicity. The pruning threshold is deter.mined by a percentile among all scaling factors , e.g., 40% or 60% channels are pruned. The pruning process is implemented by building a new +narrower model and copying the corresponding weights from the model trained with sparsity. +Fine-tuning. After the pruning we obtain a narrower and more compact model, which is then fine-tuned. On CIFAR, SVHN and MNIST datasets, the fine-tuning uses the same optimization setting as in training. For ImageNet dataset, due to time constraint, we fine-tune the pruned VGG-A with a learning rate of 10.3 for only 5 epochs. + +<
> + +Figure 3: Comparison of pruned models with lower test errors on CIFAR-10 than the original models. The blue and green bars are parameter and FLOP ratios between pruned and original models. + +4.4. Results +CIFAR and SVHN The results on CIFAR and SVHN are shown in Table 1. We mark all lowest test errors of a model in boldface. +Parameter and FLOP reductions. The purpose of network slimming is to reduce the amount of computing re.sources needed. The last row of each model has  60% channels pruned while still maintaining similar accuracy to the baseline. The parameter saving can be up to 10.. The FLOP reductions are typically around 50%. To highlight network slimming's efficiency, we plot the resource savings in Figure 3. It can be observed that VGGNet has a large amount of redundant parameters that can be pruned. On ResNet-164 the parameter and FLOP savings are relatively insignificant, we conjecture this is due to its "bottleneck" structure has already functioned as selecting channels. Also, on CIFAR-100 the reduction rate is typically slightly lower than CIFAR-10 and SVHN, which is possibly due to the fact that CIFAR-100 contains more classes. +Regularization Effect. From Table 1, we can observe that, on ResNet and DenseNet, typically when 40% channels are pruned, the fine-tuned network can achieve a lower test er.ror than the original models. For example, DenseNet-40 with 40% channels pruned achieve a test error of 5.19% on CIFAR-10, which is almost 1% lower than the original model. We hypothesize this is due to the regularization effect of L1 sparsity on channels, which naturally provides feature selection in intermediate layers of a network. We will analyze this effect in the next section. + +<
> + +Table 3: Results on MNIST. +ImageNet. The results for ImageNet dataset are summarized in Table 2. When 50% channels are pruned, the parameter saving is more than 5%, while the FLOP saving is only 30.4%. This is due to the fact that only 378 (out of 2752) channels from all the computation-intensive convolutional layers are pruned, while 5094 neurons (out of 8192) from the parameter-intensive fully-connected layers are pruned. It is worth noting that our method can achieve the savings with no accuracy loss on the 1000-class Im.ageNet dataset, where other methods for Efficient CNNs [2, 23, 35, 28] mostly report accuracy loss. +MNIST. On MNIST dataset, we compare our method with the Structured Sparsity Learning (SSL) method [35] in Ta. +ble 3. Despite our method is mainly designed to prune channels in convolutional layers, it also works well in pruning neurons in fully-connected layers. In this experiment, we observe that pruning with a global threshold sometimes completely removes a layer, thus we prune 80% of the neurons in each of the two intermediate layers. Our method slightly outperforms [35], in that a slightly lower test error is achieved while pruning more parameters. +We provide some additional experimental results in the supplementary materials, including (1) detailed structure of a compact VGGNet on CIFAR-10; (2) wall-clock time and run-time memory savings in practice. (3) comparison with a previous channel pruning method [23]; + +4.5. Results for Multi-pass Scheme +We employ the multi-pass scheme on CIFAR datasets using VGGNet. Since there are no skip-connections, pruning away a whole layer will completely destroy the models. Thus, besides setting the percentile threshold as 50%, we also put a constraint that at each layer, at most 50% of channels can be pruned. +The test errors of models in each iteration are shown in Table 4. As the pruning process goes, we obtain more and + +<
> + +Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR.100 datasets, using VGGNet. The baseline model has test errors of 6.34% and 26.74%. Trained and Fine-tuned columns denote the test errors (%) of the model trained with sparsity, and the fine-tuned model after channel pruning, respectively. The parameter and FLOP pruned ratios correspond to the fine-tuned model in that row and the trained model in the next row. +more compact models. On CIFAR-10, the trained model achieves the lowest test error in iteration 5. This model achieves 20. parameter reduction and 5. FLOP reduction, while still achieving lower test error. On CIFAR-100, after iteration 3, the test error begins to increase. This is pos.sibly due to that it contains more classes than CIFAR-10, so pruning channels too aggressively will inevitably hurt the performance. However, we can still prune near 90% parameters and near 70% FLOPs without notable accuracy loss. + +5. Analysis + +There are two crucial hyper-parameters in network slimming, the pruned percentage t and the coEfficient of the sparsity regularization term . (see Equation 1). In this section, we analyze their effects in more detail. +Effect of Pruned Percentage. Once we obtain a model trained with sparsity regularization, we need to decide what percentage of channels to prune from the model. If we prune too few channels, the resource saving can be very limited. However, it could be destructive to the model if we prune too many channels, and it may not be possible to recover the accuracy by fine-tuning. We train a DenseNet.40 model with 10.5 on CIFAR-10 to show the effect of pruning a varying percentage of channels. The results are summarized in Figure 5. +From Figure 5, it can be concluded that the classification performance of the pruned or fine-tuned models degrade only when the pruning ratio surpasses a threshold. The fine. + +<
> + +Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter). With the increase of , scaling factors become sparser. + +<
> + +Figure 5: The effect of pruning varying percentages of channels, from DenseNet-40 trained on CIFAR-10 with =10.5 . +tuning process can typically compensate the possible accuracy loss caused by pruning. Only when the threshold goes beyond 80%, the test error of fine-tuned model falls behind the baseline model. Notably, when trained with sparsity, even without fine-tuning, the model performs better than the original model. This is possibly due the the regularization effect of L1 sparsity on channel scaling factors. +Channel Sparsity Regularization. The purpose of the L1 sparsity term is to force many of the scaling factors to be near zero. The parameter <> in Equation 1 controls its significance compared with the normal training loss. In Figure 4 we plot the distributions of scaling factors in the whole network with different . values. For this experiment we use a VGGNet trained on CIFAR-10 dataset. +It can be observed that with the increase of ., the scaling factors are more and more concentrated near zero. When 0, i.e., there's no sparsity regularization, the distribution is relatively flat. When 10.4 , almost all scaling factors fall into a small region near zero. This process can be seen as a feature selection happening in intermediate layers of deep networks, where only channels with non-negligible scaling factors are chosen. We further visualize this process by a heatmap. Figure 6 shows the magnitude of scaling factors from one layer in VGGNet, along the training process. Each channel starts with equal weights; as the training + +<
> + +Figure 6: Visulization of channel scaling factorsfi change in scale along the training process, taken from the 11th conv-layer in VG-GNet trained on CIFAR-10. Brighter color corresponds to larger value. The bright lines indicate the selected channels, the dark lines indicate channels that can be pruned. +progresses, some channels scaling factors become larger (brighter) while others become smaller (darker). + +6. Conclusion + +We proposed the network slimming technique to learn more compact CNNs. It directly imposes sparsity-induced regularization on the scaling factors in batch normalization layers, and unimportant channels can thus be automatically identified during training and then pruned. On multiple datasets, we have shown that the proposed method is able to significantly decrease the computational cost (up to 20.) of state-of-the-art networks, with no accuracy loss. More importantly, the proposed method simultaneously reduces the model size, run-time memory, computing operations while introducing minimum overhead to the training process, and the resulting models require no special libraries/hardware for Efficient inference. + +Acknowledgements. +Gao Huang is supported by the International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (No.20150015). Changshui Zhang is supported by NSFC and DFG joint project NSFC 61621136008/DFG TRR-169. + +References +[1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu.ral network architectures using reinforcement learning. In ICLR, 2017. +[2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power of sparsity in convolutional neural networks. arXiv preprint arXiv:1702.06257, 2017. +[3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and +Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015. +[4] S. Chintala. Training an object classifier in torch-7 on multiple gpus over imagenet. https://github.com/soumith/imagenet-multiGPU.torch. +[5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011. +[6] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016. +[7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer.gus. Exploiting linear structure within convolutional networks for Efficient evaluation. In NIPS, 2014. +[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea.ture hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580fi587, 2014. +[9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013. +[10] S. Gross and M. Wilber. Training and investigating residual nets. https://github.com/szagoruyko/cifar-torch. +[11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com.pressing deep neural network with pruning, trained quanti.zation and huffman coding. In ICLR, 2016. +[12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for Efficient neural network. In NIPS, pages 1135fi1143, 2015. +[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015. +[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. +[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, pages 630fi645. Springer, 2016. +[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional networks for Efficient prediction. arXiv preprint arXiv:1703.09844, 2017. +[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017. +[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016. +[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. +[20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network architecture optimization through submodularity and super-modularity. arXiv preprint arXiv:1609.00074, 2016. +[21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. In Tech Report, 2009. +[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097fi1105, 2012. +[23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for Efficient convnets. arXiv preprint arXiv:1608.08710, 2016. +[24] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014. +[25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni.tion, pages 806fi814, 2015. +[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431fi 3440, 2015. +[27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised fea.ture learning, 2011. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. +[28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor.net: Imagenet classification using binary convolutional neu.ral networks. In ECCV, 2016. +[29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. Group sparse regularization for deep neural networks. arXiv preprint arXiv:1607.00485, 2016. +[30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In ECML, pages 286fi297, 2007. +[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. +[32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse neural networks. CoRR, abs/1611.06694, 2016. +[33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013. +[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, et al. Going deeper with convolutions. In CVPR, pages 1fi9, 2015. +[35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016. +[36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://github.com/szagoruyko/cifar.torch. +[37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In ECCV, 2016. +[38] B. Zoph and Q. V. Le. Neural architecture search with rein.forcement learning. In ICLR, 2017. +<> <> <> + + +<> <> <> + Learning Structured Sparsity in Deep Neural Networks + + Wei Wen Chunpeng Wu Yandan Wang + University of Pittsburgh University of Pittsburgh University of Pittsburgh + wew57@pitt.edu chw127@pitt.edu yaw46@pitt.edu + + Yiran Chen Hai Li + University of Pittsburgh University of Pittsburgh + yic52@pitt.edu hal66@pitt.edu + + Abstract + + High demand for computation resources severely hinders deployment of large-scale + Deep Neural Networks (DNN) in resource constrained devices. In this work, we + propose aStructured Sparsity Learning(SSL) method to regularize the structures + (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) + learn a compact structure from a bigger DNN to reduce computation cost; (2) + obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate + the DNN’s evaluation. Experimental results show that SSL achieves on average + 5.1%and 3.1%speedups of convolutional layer computation of AlexNet against + CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about + twice speedups of non-structured sparsity; (3) regularize the DNN structure to + improve classification accuracy. The results show that for CIFAR-10, regularization + on layer depth can reduce 20 layers of a Deep Residual Network ( ResNet ) to + 18 layers while improve the accuracy from 91.25% to 92.60%, which is still + slightly higher than that of original ResNet with 32 layers. For AlexNet , structure + regularization by SSL also reduces the error by%1%. Our source code can be + found athttps://github.com/wenwei202/caffe/tree/scnn + + + 1 Introduction + + Deep neural networks (DNN), especially deep convolutional neural networks (CNN), made + remarkable success in visual tasks[1][2][3][4][5] by leveraging large-scale networks learning from + a huge volume of data. Deployment of such big models, however, is computation-intensive and + memory-intensive. To reduce computation cost, many studies are performed to compress the scale of + DNN, including sparsity regularization[6], connection pruning[7][8] and low rank approximation + [9][10][11][12][13]. Sparsity regularization and connection pruning approaches, however, often pro- + duce non-structured random connectivity in DNN and thus, irregular memory access that adversely + impacts practical acceleration in hardware platforms. Figure 1 depicts practical speedup of each + layer of a AlexNet , which is non-structurally sparsified by l1-norm. Compared to original model, + the accuracy loss of the sparsified model is controlled within 2%. Because of the poor data locality + associated with the scattered weight distribution, the achieved speedups are either very limited or + negative even the actual sparsity is high, say, >95%. We define sparsity as the ratio of zeros in this + paper. In recently proposed low rank approximation approaches, the DNN is trained first and then + each trained weight tensor is decomposed and approximated by a product of smaller factors. Finally, + fine-tuning is performed to restore the model accuracy. Low rank approximation is able to achieve + practical speedups because it coordinates model parameters in dense matrixes and avoids the locality + problem of non-structured sparsity regularization. However, low rank approximation can only obtain + + <
> + + Figure 1: Evaluation speedups of AlexNet on GPU platforms and the sparsity. conv1 refers to + convolutional layer 1, and so forth. Baseline is profiled by GEMM of cuBLAS. The sparse matrixes + are stored in the format of Compressed Sparse Row (CSR) and accelerated by cuSPARSE. + + + the compact structure within each layer, and the structures of the layers are fixed during fine-tuning + such that costly reiterations of decomposing and fine-tuning are required to find an optimal weight + approximation for performance speedup and accuracy retaining. + Inspired by the facts that (1) there is redundancy across filters and channels [11]; (2) shapes of + filters are usually fixed as cuboid but enabling arbitrary shapes can potentially eliminate unnecessary + computation imposed by this fixation; and (3) depth of the network is critical for classification + but deeper layers cannot always guarantee a lower error because of the exploding gradients and + degradation problem [5], we propose Structured Sparsity Learning (SSL) method to directly learn + a compressed structure of deep CNNs by group Lasso regularization during the training. SSL is a + generic regularization to adaptively adjust multiple structures in DNN, including structures of filters, + channels, and filter shapes within each layer, and structure of depth beyond the layers. SSL combines + structure regularization (on DNN for classification accuracy) with locality optimization (on memory + access for computation efficiency), offering not only well-regularized big models with improved + accuracy but greatly accelerated computation (e.g. 5.1% on CPU and 3.1% on GPU for AlexNet ). + + 2 Related works + + Connection pruning and weight sparsifying. Hanet al.[7][8] reduced number of parameters of + AlexNet by 9% andVGG-16by 13% using connection pruning. Since most reduction is achieved + on fully-connected layers, the authors obtained 3% to 4% layer-wise speedup for fully-connected + layers. However, no practical speedups of convolutional layers are observed because of the issue + shown in Figure 1. As convolution is the computational bottleneck and many new DNNs use fewer + fully-connected layers,e.g., only 3.99% parameters of ResNet -152in [5] are from fully-connected + layers, compression and acceleration on convolutional layers become essential. Liuet al.[6] achieved + >90% sparsity of convolutional layers in AlexNet with 2% accuracy loss, and bypassed the issue + shown in Figure 1 by hardcoding the sparse weights into program, achieving layer-wise 4.59% + speedup on a CPU. In this work, we also focus on convolutional layers. Compared to the above + techniques, our SSL method can coordinate sparse weights in adjacent memory space and achieve + higher speedups with the same accuracy. Note that hardware and program optimizations can further + boost the system performance on top of the level of SSL but are not covered in this work. + Low rank approximation. Denilet al.[9] predicted 95% parameters in a DNN by exploiting the + redundancy across filters and channels. Inspired by it, Jaderberget al.[11] achieved 4.5% speedup + on CPUs for scene text character recognition and Dentonet al.[10] achieved 2% speedups on both + CPUs and GPUs for the first two layers. Both of the works usedLow Rank Approximation(LRA) + with%1% accuracy drop. [13][12] improved and extended LRA to larger DNNs. However, the + network structure compressed by LRA is fixed; reiterations of decomposing, training/fine-tuning, + and cross-validating are still needed to find an optimal structure for accuracy and speed trade-off. + As number of hyper-parameters in LRA method increases linearly with layer depth [10][13], the + search space increases linearly or even polynomially for very deep DNNs. Comparing to LRA, our + contributions are: (1) SSL can dynamically optimize the compactness of DNN structure with only + one hyper-parameter and no reiterations; (2) besides the redundancy within the layers, SSL also + exploits the necessity of deep layers and reduce them; (3) DNN filters regularized by SSL have lower + rank approximation, so it can work together with LRA for more efficient model compression. + Model structure learning.Group Lasso [14] is an efficient regularization to learn sparse structures. + Kimet al.[15] used group Lasso to regularize the structure of correlation tree for multi-task regression + problem and reduced prediction errors. Liuet al.[6] utilized group Lasso to constrain the scale + + <> + + <> + + Figure 2: The proposed structured sparsity learning (SSL) for DNNs. Weights in filters are split W(l) + into multiple groups. Through group Lasso regularization, a more compact DNN is obtained by :,c l ,:,: + removing some groups. The figure illustrates the filter-wise, channel-wise, shape-wise, and depth-wise + structured sparsity that were explored in the work. + + <> + + of the structure of LRA. To adapt DNN structure to different databases, Fenget al.[16] learned + the appropriate number of filters in DNN. Different from these prior arts, we apply group Lasso to + regularize multiple DNN structures (filters, channels, filter shapes, and layer depth). Our source code + can be found at https://github.com/wenwei202/caffe/tree/scnn. + + + 3 Structured Sparsity Learning Method for DNNs + + We focus mainly on theStructured Sparsity Learning(SSL) on convolutional layers to regularize the + structure of DNNs. We first propose a generic method to regularize structures of DNN in Section 3.1, 1 + and then specify the method to structures of filters, channels, filter shapes and depth in section 3.2. + Variants of formulations are also discussed from computational efficiency viewpoint in Section 3.3. + + 3.1 Proposed structured sparsity learning for generic structures + Suppose weights of convolutional layers in a DNN form a sequence of 4-D tensors + + <>, where <> and <> are the dimensions of the l-th + weight tensor along the axes of filter, channel, spatial height and spatial width, respectively. + L denotes the number of convolutional layers. Then the proposed generic optimization target of a DNN with + structured sparsity regularization can be formulated as: 1 + + <> (1) + + Here W represents the collection of all weights in the <> is the loss on data <> is + non-structured regularization applying on every weight,e.g., l2-norm; and <> is the structured + sparsity regularization on each layer. Because Group Lasso can effectively zero out all weights in + some groups [14][15], we adopt it in our SSL. The regularization of group Lasso on a set of weights + Pw can be represented as <>, where <> is a group of partial weights in w + and G is the total number of groups. Different groups may overlap. Here <>, where + <> the number of weights in <>. + + 3.2 Structured sparsity learning for structures of filters, channels, filter shapes and depth + + In SSL, the learned “structure” is decided by the way of splitting groups ofw(g) . We investigate and + formulate thefiler-wise,channel-wise,shape-wise, and depth-wise structured sparsity in Figure 2. + For simplicity, the <> term of Eq. (1) is omitted in the following formulation expressions. + Penalizing unimportant filers and channels. Suppose <> is then l-th filter and <> is the + cl-th channel of all filters in the l-th layer. The optimization target of learning the filter-wise and + channel-wise structured sparsity can be defined as + + <> (2) + + As indicated in Eq. (2), our approach tends to remove less important filters and channels. Note + that zeroing out a filter in the l-th layer results in a dummy zero output feature map, which in turn + makes a corresponding channel in the (l+ 1)-th layer useless. Hence, we combine the filter-wise and + channel-wise structured sparsity in the learning simultaneously. + Learning arbitrary shapes of filers. As illustrated in Figure 2, <> denotes the vector of + :;c l ;m l ;k all corresponding weights located at spatial position of <> in the 2D filters across the cl-th + channel. Thus, we defineW(l) as the shape fiber related to learning arbitrary filter shape <> because a + homogeneous non-cubic filter shape can be learned by zeroing out some shape fibers. The l + optimization target of learning shapes of filers becomes: + + <> (3) + + Regularizing layer depth. We also explore the depth-wise sparsity to regularize the depth of DNNs + in order to improve accuracy and reduce computation cost. The corresponding optimization target is + Different from other discussed sparsification techniques, + zeroing out all the filters in a layer will cut off the message propagation in the DNN so that the output + neurons cannot perform any classification. Inspired by the structure of highway networks [17] and + deep residual networks [5], we propose to leverage the shortcuts across layers to solve this issue. As + illustrated in Figure 2, even when SSL removes an entire unimportant layers, feature maps will still + be forwarded through the shortcut. + + 3.3 Structured sparsity learning for computationally efficient structures + + All proposed schemes in section 3.2 can learn a compact DNN for computation cost reduction. + Moreover, some variants of the formulations of these schemes can directly learn structures that can + be efficiently computed. + 2D-filter-wise sparsity for convolution. 3D convolution in DNNs essentially is a composition of 2D + convolutions. To perform efficient convolution, we explored a fine-grain variant of filter-wise sparsity, + namely,2D-filter-wise sparsity, to spatially enforce group Lasso on each 2D filter ofW(l)nl ;c l ;:;: . The + saved convolution is proportional to the percentage of the removed 2D filters. The fine-grain version + of filter-wise sparsity can more efficiently reduce the computation associated with convolution: + Because the group sizes are much smaller and thus the weight updating gradients are shaper, it helps + group Lasso to quickly obtain a high ratio of zero groups for a large-scale DNN. + Combination of filter-wise and shape-wise sparsity for GEMM. Convolutional computation in + DNNs is commonly converted to modality of general Matrix Multiplication (GEMM) by lowering + weight tensors and feature tensors to matrices [18]. For example, in Caffe [19], a 3D filter <> is + reshaped to a row in the weight matrix where each column is the collection of weights <> + related to shape-wise sparsity. Combining filter-wise and shape-wise sparsity can directly reduce the + dimension of weight matrix in GEMM by removing zero rows and columns. In this context, we use + row-wise and column-wise sparsity as the interchangeable terminology of filter-wise and shape-wise + sparsity, respectively. + + 4 Experiments + + We evaluated the effectiveness of our SSL using published models on three databases – MNIST, + CIFAR-10, and ImageNet. Without explicit explanation, SSL starts with the network whose weights + are initialized by the baseline, and speedups are measured in matrix-matrix multiplication by Caffe in + a single-thread Intel Xeon E5-2630 CPU . + + Table 1: Results after penalizing unimportant filters and channels inLeNet + + <
> + + 4.1 LeNet and multilayer perceptron on MNIST + + In the experiment of MNIST, we examined the effectiveness of SSL in two types of networks: + LeNet[20] implemented by Caffe and amultilayer perceptron(MLP) network. Both networks were + trained without data augmentation. + LeNet:When applying SSL toLeNet, we constrain the network with filter-wise and channel-wise + sparsity in convolutional layers to penalize unimportant filters and channels. Table 1 summarizes + the remained filters and channels,floating-point operations(FLOP), and practical speedups. In the + table,LeNet 1is the baseline and the others are the results after applying SSL in different strengths + of structured sparsity regularization. The results show that our method achieves the similar error + (0.1%) with much fewer filters and channels, and saves significant FLOP and computation time. + To demonstrate the impact of SSL on the structures of filters, we present all learned conv1 filters + in Figure 3. It can be seen that most filters inLeNet 2are entirely zeroed out except for five most + important detectors of stroke patterns that are sufficient for feature extraction. The accuracy of + LeNet 3(that further removes the weakest and redundant stroke detector) drops only 0.2% from that + ofLeNet 2. Compared to the random and blurry filter patterns inLeNet 1that resulted from the high + freedom of parameter space, the filters inLeNet 2 & 3are regularized and converge to smoother and + more natural patterns. This explains why our proposed SSL obtains the same-level accuracy but has + much less filters. The smoothness of the filters are also observed in the deeper layers. + The effectiveness of the shape-wise sparsity on LeNet is summarized in Table 2. The baselineLeNet 1 + has conv1 filters with a regular 5x5 square (size = 25) whileLeNet 5reduces the dimension that + can be constrained by a 2x4 rectangle (size = 7). The 3D shape of conv2 filters in the baseline is + also regularized to the 2D shape inLeNet 5within only one channel, indicating that only one filter in + conv1 is needed. This fact significantly saves FLOP and computation time. + + <
> + + Figure 3: Learned conv1 filters in LeNet 1(top),LeNet 2(middle) and LeNet 3(bottom) + + MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e.the + number of neurons) of fully-connected layers. We enforce the group Lasso regularization on all the + input (or output) connections of each neuron. A neuron whose input connections are all zeroed out + can degenerate to a bias neuron in the next layer; similarly, a neuron can degenerate to a removable + dummy neuron if all of its output connections are zeroed out. Figure 4(a) summarizes the learned + structure and FLOP of differentMLPnetworks. The results show that SSL can not only remove + hidden neurons but also discover the sparsity of images. For example, Figure 4(b) depicts the number + of connections of each input neuron inMLP 2, where 40.18% of input neurons have zero connections + and they concentrate at the boundary of the image. Such a distribution is consistent with our intuition: + + Table 2: Results after learning filter shapes inLeNet + + <
> + + Figure 4: The normalized reconstructure error of weight matrix vs. the percent of ranks.Principal + Component Analysis(PCA) is utilized to explore the redundancy among filters.% ranks of eigenvectors + corresponding to the largest eigenvalues are selected as basis to perform low rank approximation. + Left:LeNet2 in Table 1; middle: ConvNet2 in Table 4; right: AlexNet 4 in Table 5. Dash lines + indicate baselines and solid lines indicate results of SSL. + + + 170 detectors of stroke patterns which are sufficient for feature extraction. The accuracy ofLeNet 3 + 171 (that further removes one weakest and one redundant stroke detector) compared withLeNet 2drops + 172 only 0.2%. Although the training processes of three networks are independent, the corresponding + 173 regularized filters inLeNet 2andLeNet 3demonstrate very high similarity and represent certain level + 174 of alikeness to those inLeNet 1. Comparing with random and blurry filter patterns inLeNet 1resulted + 175 from the high freedom of parameter space, the filters inLeNet 2 & 3are regularized through the + 176 filter-wise and channel-wise sparsity and therefore converge at smoother and more natural patterns. + 177 This explains why our proposed SSL obtains the same-level accuracy but having much less filters. + 178 These regularity and similarity phenomena are also observed in deeper layers. Different from low + 179 rank decomposition which only explore the redundancy and does not change the rank, SSL can reduce + 180 the redundancy as shown in Figure 4. + + 181 We also explore the effectiveness of the shape-wise sparsity onLeNetin Table 2. The baselineLeNet + 182 1has a regular5⇥5square size of conv1 filters, whileLeNet 5reduces the dimension to less than + 183 2⇥4. And the 3D shape of filters inconv2ofLeNet 1are regularized to 2D shape ofLeNet 5with + 184 only one channel, indicating that only one filter in conv1 is needed. This saves significant FLOP and + 185 computing time. + + 186 MLP:Besides convolutional layers, our proposed SSL can be extended to learn the structure (i.e. + 187 the number of neurons) in fully-connected layers. Here, the baselineMLPnetwork composed of + 188 two hidden layers with 500 and 300 neurons respectively obtains a test error of 1.43%. We enforced + 189 the group Lasso regularization on all the input (or output) connections of every neuron, including + 190 those of the input layer. Note that a neuron with all the input connections zeroed out degenerate + 191 to a bias neuron in the next layer; similarly, a neuron degenerates to a removable dummy neuron + 192 if all of its output connections are zeroed out. As such, the computation ofGEneral Matrix Vector + 193 (GEMV) product in fully-connected layers can be significantly reduced. Table 3 summarizes the + + + Table 3: Learning the number of neurons in multi-layer perceptron + + <
> + + Figure 4: (a) Results of learning the number of neurons inMLP. (b) the connection numbers of input + + <
> + + handwriting digits are usually written in the center and pixels close to the boundary contain little + discriminative classification information. + + 4.2 ConvNet and ResNet on CIFAR-10 + We implemented the ConvNet of [1] and deep residual networks( ResNet ) [5] on CIFAR-10. When + regularizing filters, channels, and filter shapes, the results and observations of both networks are + similar to that of the MNIST experiment. Moreover, we simultaneously learn the filter-wise and + shape-wise sparsity to reduce the dimension of weight matrix in GEMM ofConvNet. We also learn + the depth-wise sparsity of ResNet to regularize the depth of the DNNs. + ConvNet:We use the network from Alex Krizhevskyet al.[1] as the baseline and implement it + using Caffe. All the configurations remain the same as the original implementation except that we + added a dropout layer with a ratio of 0.5 in the fully-connected layer to avoid over-fitting.ConvNetis + trained without data augmentation. Table 3 summarizes the results of threeConvNetnetworks. Here, + the row/column sparsity of a weight matrix is defined as the percentage of all-zero rows/columns. + Figure 5 shows their learned conv1 filters. In Table 3, SSL can reduce the size of weight matrix + inConvNet 2by 50%, 70.7% and 36.1% for each convolutional layer and achieve good speedups + without accuracy drop. Surprisingly, without SSL, four conv1 filters of the baseline are actually + all-zeros as shown in Figure 5, demonstrating the great potential of filter sparsity. When SSL is + applied, half of conv1 filters inConvNet 2can be zeroed out without accuracy drop. + On the other hand, inConvNet 3, SSL achieves 1.0% (0.16%) lower error with a model even smaller + than the baseline. In this scenario, SSL performs as a structure regularization to dynamically learn a + better network structure (including the number of filters and filer shapes) to reduce the error. + + <
> + + Figure 5: Learned conv1 filters inConvNet 1(top),ConvNet 2(middle) andConvNet 3(bottom) + + ResNet :To investigate the necessary depth of DNNs required by SSL, we use a 20-layer deep residual + networks ( ResNet -20) proposed in [5] as the baseline. The network has 19 convolutional layers and + 1 fully-connected layer.Identity shortcuts are utilized to connect the feature maps with the same + dimension while 1%1 convolutional layers are chosen as shortcuts between the feature maps with + different dimensions. Batch normalization [21] is adopted after convolution and before activation. + We use the same data augmentation and training hyper-parameters as that in [5]. The final error of + baseline is 8.82%. In SSL, the depth of ResNet -20is regularized by depth-wise sparsity. Group Lasso + regularization is only enforced on the convolutional layers between each pair of shortcut endpoints, + excluding the first convolutional layer and all convolutional shortcuts. After SSL converges, layers + + <
> + + Figure 6: Error vs. layer number after depth regularization by SSL. + + + in [ 1412 5] with # layers.SSL- ResNet -#is the depth-regularized ResNet by SSL with # layers, including + the last fully-connected layer indicates the convolutional layers with an output map size of 32,64 32, and so forth + with all zero weights are removed and the net is finally fine-tuned with a base learning rate of 0.01, + Figure 6 plots the trend of the error vs. the number of layers under different strengths of depth + regularizations. Compared with original ResNet in [5], SSL learns a ResNet with 14 layers (SSL- + ResNet -14) that reaching a lower error than the one of the baseline with 20 layers ( ResNet -20); + SSL- ResNet -18and ResNet -32achieve an error of 7.40% and 7.51%, respectively. This result implies + that SSL can work as a depth regularization to improve classification accuracy. Note that SSL can + efficiently learn shallower DNNs without accuracy loss to reduce computation cost; however, it + does not mean the depth of the network is not important. The trend in Figure 6 shows that the test + error generally declines as more layers are preserved. A slight error rise of SSL-ResNet-20 from + SSL- ResNet -18shows the suboptimal selection of the depth in the group of “32x32”. + + 4.3 AlexNet on ImageNet + + To show the generalization of our method to large scale DNNs, we evaluate SSL using AlexNet with + ILSVRC 2012.CaffeNet[19] – the replication of AlexNet [1] with mirror changes, is used in our + experiment. All training images are rescaled to the size of 256x256. A 227%227 image is randomly + cropped from each scaled image and mirrored for data augmentation and only the center crop is + used for validation. The final top-1 validation error is 42.63%. In SSL, AlexNet is first trained with + structure regularization; when it converges, zero groups are removed to obtain a DNN with the new + structure; finally, the network is fine-tuned without SSL to regain the accuracy. + We first studied 2D-filter-wise and shape-wise sparsity by exploring the trade-offs between + computation complexity and classification accuracy. Figure 7(a) shows the 2D-filter sparsity (the ratio + between the removed 2D filters and total 2D filters) and the saved FLOP of 2D convolutions vs. the + validation error. In Figure 7(a), deeper layers generally have higher sparsity as the group size shrinks + + <
> + + Figure 7: (a) 2D-filter-wise sparsity and FLOP reduction vs. top-1 error. Vertical dash line shows the + error of original AlexNet ; (b) The reconstruction error of weight tensor vs. dimensionality.Principal + Component Analysis(PCA) is utilized to perform dimensionality reduction to exploit filter redundancy. + The eigenvectors corresponding to the largest eigenvalues are selected as basis of lower-dimensional + space. Dash lines denote the results of the baselines and solid lines indicate the ones of the AlexNet 5 + in Table 4; (c) Speedups of‘1 -norm and SSL on various CPU and GPU platforms (In labels of x-axis, + T# is number of the maximum physical threads in Xeon CPU). AlexNet 1and AlexNet 2in Table 4 + are used as test benches. + + + and the number of 2D filters grows. 2D-filter sparsity regularization can reduce the total FLOP by + 30%–40% without accuracy loss or reduce the error of AlexNet by%1% down to 41.69% by retaining + the original number of parameters. Shape-wise sparsity also obtains similar results – In Table 4, for + example, AlexNet 5achieves on average 1.4%layer-wise speedup on both CPU and GPU without + accuracy loss after shape regularization; The top-1 error can also be reduced down to 41.83% if + the parameters are retained. In Figure 7(a), the obtained DNN with the lowest error has a very low + sparsity, indicating that the number of parameters in a DNN is still important to maintain learning + capacity. In this case, SSL works as a regularization to add restriction of smoothness to the model in + order to avoid over-fitting. Figure 7(b) compares the results of dimensionality reduction of weight + tensors in the baseline and our SSL-regularized AlexNet . The results show that the smoothness restriction + enforces parameter searching in lower-dimensional space and enables lower rank approximation + of the DNNs. Therefore, SSL can work together with low rank approximation to achieve even higher + model compression. + Besides the above analyses, the computation efficiencies of structured sparsity and non-structured + sparsity are compared in Caffe using standard off-the-shelf libraries,i.e., Intel Math Kernel Library + on CPU and CUDA cuBLAS and cuSPARSE on GPU. We use SSL to learn a AlexNet with high + column-wise and row-wise sparsity as the representative of structured sparsity method.‘1 -norm is + selected as the representative of non-structured sparsity method instead of connection pruning in + [7] because‘1 -norm get a higher sparsity on convolutional layers as the results of AlexNet 3and + AlexNet 4depicted in Table 4. Speedups achieved by SSL are measured by subroutines of GEMM + where nonzero rows and columns in each weight matrix are concatenated in consecutive memory + space. Note that compared to GEMM, the overhead of concatenation can be ignored. To measure the + speedups of‘1 -norm, sparse weight matrices are stored in the format of Compressed Sparse Row + (CSR) and computed by sparse-dense matrix multiplication subroutines. + Table 4 compares the obtained sparsity and speedups of‘1 -norm and SSL on CPU (Intel Xeon) + and GPU (GeForce GTX TITAN Black) under approximately the same errors,e.g., with acceptable + or no accuracy loss. For a fair comparison, after‘1 -norm regularization, the DNN is also fine- + tuned by disconnecting all zero-weighted connections so that 1.39% accuracy is recovered for the + AlexNet 1. Our experiments show that the DNNs require a very high non-structured sparsity to achieve + a reasonable speedup (The speedups are even negative when the sparsity is low). SSL, however, can + always achieve positive speedups. With an acceptable accuracy loss, our SSL achieves on average + 5.1% and 3.1% layer-wise acceleration on CPU and GPU, respectively. Instead,‘1 -norm achieves + on average only 3.0% and 0.9% layer-wise acceleration on CPU and GPU, respectively. We note + that at the same accuracy, our average speedup is indeed higher than that of [6] which adopts heavy + hardware customization to overcome the negative impact of non-structured sparsity. Figure 7(c) + shows the speedups of‘1 -norm and SSL on various platforms, including both GPU (Quadro, Tesla + + + Table 4: Sparsity and speedup of AlexNet on ILSVRC 2012 + + <
> + + + and Titan) and CPU (Intel Xeon E5-2630). SSL can achieve on average%3%speedup on GPU while + non-structured sparsity obtain no speedup on GPU platforms. On CPU platforms, both methods can + achieve good speedups and the benefit grows as the processors become weaker. Nonetheless, SSL + can always achieve averagely%2%speedup compared to non-structured sparsity. + + + 5 Conclusion + + In this work, we have proposed aStructured Sparsity Learning(SSL) method to regularize filter, + channel, filter shape, and depth structures in deep neural networks (DNN). Our method can enforce + the DNN to dynamically learn more compact structures without accuracy loss. The structured + compactness of the DNN achieves significant speedups for the DNN evaluation both on CPU + and GPU with off-the-shelf libraries. Moreover, a variant of SSL can be performed as structure + regularization to improve classification accuracy of state-of-the-art DNNs. + + Acknowledgments + + This work was supported in part by NSF XPS-1337198 and NSF CCF-1615475. The authors thank + Drs. Sheng Li and Jongsoo Park for valuable feedback on this work. + + + References + + [1]Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional + neural networks. InAdvances in Neural Information Processing Systems, pages 1097–1105. 2012. + [2]Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate + object detection and semantic segmentation. InThe IEEE Conference on Computer Vision and Pattern + Recognition (CVPR), 2014. + [3]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- + tion.arXiv preprint arXiv:1409.1556, 2014. + [4]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru + Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.arXiv preprint + arXiv:1409.4842, 2015. + [5]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. + arXiv preprint arXiv:1512.03385, 2015. + [6]Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional + neural networks. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. + [7]Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient + neural network. InAdvances in Neural Information Processing Systems, pages 1135–1143. 2015. + [8]Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with + pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149, 2015. + [9] Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando de Freitas. Predicting + parameters in deep learning. InAdvances in Neural Information Processing Systems, pages 2148–2156. + 2013. + [10]Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure + within convolutional networks for efficient evaluation. InAdvances in Neural Information Processing + Systems, pages 1269–1277. 2014. + [11]Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with + low rank expansions.arXiv preprint arXiv:1405.3866, 2014. + [12]Yani Ioannou, Duncan P. Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training + cnns with low-rank filters for efficient image classification.arXiv preprint arXiv:1511.06744, 2015. + [13]Cheng Tai, Tong Xiao, Xiaogang Wang, and Weinan E. Convolutional neural networks with low-rank + regularization.arXiv preprint arXiv:1511.06067, 2015. + [14]Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.Journal of + the Royal Statistical Society. Series B (Statistical Methodology), 68(1):49–67, 2006. + [15]Seyoung Kim and Eric P Xing. Tree-guided group lasso for multi-task regression with structured sparsity. + InProceedings of the 27th International Conference on Machine Learning, 2010. + [16]Jiashi Feng and Trevor Darrell. Learning the structure of deep convolutional networks. InThe IEEE + International Conference on Computer Vision (ICCV), 2015. + [17]Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint + arXiv:1505.00387, 2015. + [18]Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and + Evan Shelhamer. cudnn: Efficient primitives for deep learning.arXiv preprint arXiv:1410.0759, 2014. + [19]Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio + Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.arXiv + preprint arXiv:1408.5093, 2014. + [20]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to + document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. + [21]Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing + internal covariate shift.arXiv preprint arXiv:1502.03167, 2015. +<> <> <> + + +<> <> <> + MIXED PRECISION TRAINING + + + Sharan Narang % , Gregory Diamos, Erich Elsen y + Baidu Research + fsharan, gdiamosg@baidu.com + + Paulius Micikevicius % , Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, + Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu + NVIDIA + fpauliusm, alben, dagarcia, bginsburg, mhouston, + okuchaiev, gavenkatesh, skywg@nvidia.com + + ABSTRACT + + Increasing the size of a neural network typically improves accuracy but also in- + creases the memory and compute requirements for training the model. We intro- + duce methodology for training deep neural networks using half-precision float- + ing point numbers, without losing model accuracy or having to modify hyper- + parameters. This nearly halves memory requirements and, on recent GPUs, + speeds up arithmetic. Weights, activations, and gradients are stored in IEEE half- + precision format. Since this format has a narrower range than single-precision we + propose three techniques for preventing the loss of critical information. Firstly, + we recommend maintaining a single-precision copy of weights that accumulates + the gradients after each optimizer step (this copy is rounded to half-precision for + the forward- and back-propagation). Secondly, we propose loss-scaling to pre- + serve gradient values with small magnitudes. Thirdly, we use half-precision arith- + metic that accumulates into single-precision outputs, which are converted to half- + precision before storing to memory. We demonstrate that the proposed methodology + works across a wide variety of tasks and modern large scale (exceeding 100 + million parameters) model architectures, trained on large datasets. + + + 1 INTRODUCTION + + Deep Learning has enabled progress in many different applications, ranging from image recognition + (He et al., 2016a) to language modeling (Jozefowicz et al., 2016) to machine translation (Wu et al., + 2016) and speech recognition (Amodei et al., 2016). Two trends have been critical to these results + - increasingly large training data sets and increasingly complex models. For example, the neural + network used in Hannun et al. (2014) had 11 million parameters which grew to approximately 67 + million for bidirectional RNNs and further to 116 million for the latest forward only Gated Recurrent + Unit (GRU) models in Amodei et al. (2016). + Larger models usually require more compute and memory resources to train. These requirements + can be lowered by using reduced precision representation and arithmetic. Performance (speed) of + any program, including neural network training and inference, is limited by one of three factors: + arithmetic bandwidth, memory bandwidth, or latency. Reduced precision addresses two of these + limiters. Memory bandwidth pressure is lowered by using fewer bits to to store the same number of + values. Arithmetic time can also be lowered on processors that offer higher throughput for reduced + precision math. For example, half-precision math throughput in recent GPUs is 2% to 8% higher + than for single-precision. In addition to speed improvements, reduced precision formats also reduce + the amount of memory required for training. + Modern deep learning training systems use single-precision (FP32) format. In this paper, we address + the training with reduced precision while maintaining model accuracy. Specifically, we train various + neural networks using IEEE half-precision format (FP16). Since FP16 format has a narrower + dynamic range than FP32, we introduce three techniques to prevent model accuracy loss: maintain- + ing a master copy of weights in FP32, loss-scaling that minimizes gradient values becoming zeros, + and FP16 arithmetic with accumulation in FP32. Using these techniques we demonstrate that a + wide variety of network architectures and applications can be trained to match the accuracy FP32 + training. Experimental results include convolutional and recurrent network architectures, trained + for classification, regression, and generative tasks. Applications include image classification, image + generation, object detection, language modeling, machine translation, and speech recognition. The + proposed methodology requires no changes to models or training hyper-parameters. + + 2 RELATED WORK + + There have been a number of publications on training Convolutional Neural Networks (CNNs) with + reduced precision. Courbariaux et al. (2015) proposed training with binary weights, all other tensors + and arithmetic were in full precision. Hubara et al. (2016a) extended that work to also binarize + the activations, but gradients were stored and computed in single precision. Hubara et al. (2016b) + considered quantization of weights and activations to 2, 4 and 6 bits, gradients were real numbers. + Rastegari et al. (2016) binarize all tensors, including the gradients. However, all of these approaches + lead to non-trivial loss of accuracy when larger CNN models were trained for ILSVRC classification + task (Russakovsky et al., 2015). Zhou et al. (2016) quantize weights, activations, and gradients + to different bit counts to further improve result accuracy. This still incurs some accuracy loss and + requires a search over bit width configurations per network, which can be impractical for larger + models. Mishra et al. improve on the top-1 accuracy achieved by prior weight and activation + quantizations by doubling or tripling the width of layers in popular CNNs. However, the gradients are + still computed and stored in single precision, while quantized model accuracy is lower than that of + the widened baseline. Gupta et al. (2015) demonstrate that 16 bit fixed point representation can be + used to train CNNs on MNIST and CIFAR-10 datasets without accuracy loss. It is not clear how + this approach would work on the larger CNNs trained on large datasets or whether it would work for + Recurrent Neural Networks (RNNs). + There have also been several proposals to quantize RNN training. He et al. (2016c) train quantized + variants of the GRU (Cho et al., 2014) and Long Short Term Memory (LSTM) (Hochreiter and + Schmidhuber, 1997) cells to use fewer bits for weights and activations, albeit with a small loss in + accuracy. It is not clear whether their results hold for larger networks needed for larger datasets + Hubara et al. (2016b) propose another approach to quantize RNNs without altering their structure. + Another approach to quantize RNNs is proposed in Ott et al. (2016). They evaluate binary, ternary + and exponential quantization for weights in various different RNN models trained for language + modelling and speech recognition. All of these approaches leave the gradients unmodified in single- + precision and therefore the computation cost during back propagation is unchanged. + The techniques proposed in this paper are different from the above approaches in three aspects. + First, all tensors and arithmetic for forward and backward passes use reduced precision, FP16 in + our case. Second, no hyper-parameters (such as layer width) are adjusted. Lastly, models trained + with these techniques do not incur accuracy loss when compared to single-precision baselines. We + demonstrate that this technique works across a variety of applications using state-of-the-art models + trained on large scale datasets. + + 3 IMPLEMENTATION + + We introduce the key techniques for training with FP16 while still matching the model accuracy of + FP32 training session: single-precision master weights and updates, loss-scaling, and accumulating + FP16 products into FP32. Results of training with these techniques are presented in Section 4. + + 3.1 FP32 MASTER COPY OF WEIGHTS + + In mixed precision training, weights, activations and gradients are stored as FP16. In order to match + the accuracy of the FP32 networks, an FP32 master copy of weights is maintained and updated with + the weight gradient during the optimizer step. In each iteration an FP16 copy of the master weights is + used in the forward and backward pass, halving the storage and bandwidth needed by FP32 training. + Figure 1 illustrates this mixed precision training process. + While the need for FP32 master weights is not universal, there are two possible reasons why a + number of networks require it. One explanation is that updates (weight gradients multiplied by the + learning rate) become too small to be represented in FP16 - any value whose magnitude is smaller + than2%24 becomes zero in FP16. We can see in Figure 2b that approximately 5% of weight gradient + values have exponents smaller than%24. These small valued gradients would become zero in the + optimizer when multiplied with the learning rate and adversely affect the model accuracy. Using a + single-precision copy for the updates allows us to overcome this problem and recover the accuracy. + Another explanation is that the ratio of the weight value to the weight update is very large. In + this case, even though the weight update is representable in FP16, it could still become zero when + addition operation right-shifts it to align the binary point with the weight. This can happen when + the magnitude of a normalized weight value is at least 2048 times larger that of the weight update. + Since FP16 has 10 bits of mantissa, the implicit bit must be right-shifted by 11 or more positions to + potentially create a zero (in some cases rounding can recover the value). In cases where the ratio is + larger than 2048, the implicit bit would be right-shifted by 12 or more positions. This will cause the + weight update to become a zero which cannot be recovered. An even larger ratio will result in this + effect for de-normalized numbers. Again, this effect can be counteracted by computing the update + in FP32. + To illustrate the need for an FP32 master copy of weights, we use the Mandarin speech model + (described in more detail in Section 4.3) trained on a dataset comprising of approximately 800 hours + of speech data for 20 epochs. As shown in 2a, we match FP32 training results when updating an + FP32 master copy of weights after FP16 forward and backward passes, while updating FP16 weights + results in 80% relative accuracy loss. + Even though maintaining an additional copy of weights increases the memory requirements for the + weights by 50% compared with single precision training, impact on overall memory usage is much + smaller. For training memory consumption is dominated by activations, due to larger batch sizes + and activations of each layer being saved for reuse in the back-propagation pass. Since activations + are also stored in half-precision format, the overall memory consumption for training deep neural + networks is roughly halved. + + 3.2 LOSS SCALING + + FP16 exponent bias centers the range of normalized value exponents to[%14;15]while gradient + values in practice tend to be dominated by small magnitudes (negative exponents). For example, + consider Figure 3 showing the histogram of activation gradient values, collected across all layers + during FP32 training of Multibox SSD detector network (Liu et al., 2015a). Note that much of + the FP16 representable range was left unused, while many values were below the minimum representable + range and became zeros. Scaling up the gradients will shift them to occupy more of the + representable range and preserve values that are otherwise lost to zeros. This particular network + diverges when gradients are not scaled, but scaling them by a factor of 8 (increasing the exponents + by 3) is sufficient to match the accuracy achieved with FP32 training. This suggests that activation + gradient values below2%27 in magnitude were irrelevant to the training of this model, but values in + the[2 %27 ;2%24 )range were important to preserve. + One efficient way to shift the gradient values into FP16-representable range is to scale the loss value + computed in the forward pass, prior to starting back-propagation. By chain rule back-propagation + ensures that all the gradient values are scaled by the same amount. This requires no extra operations + during back-propagation and keeps the relevant gradient values from becoming zeros. Weight gradients + must be unscaled before weight update to maintain the update magnitudes as in FP32 training. It + is simplest to perform this unscaling right after the backward pass but before gradient clipping or any + other gradient-related computations, ensuring that no hyper-parameters (such as gradient clipping + threshold, weight decay, etc.) have to be adjusted. + There are several options to choose the loss scaling factor. The simplest one is to pick a constant + scaling factor. We trained a variety of networks with scaling factors ranging from 8 to 32K + (many networks did not require a scaling factor). A constant scaling factor can be chosen empirically + or, if gradient statistics are available, directly by choosing a factor so that its product with + the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16). + There is no downside to choosing a large scaling factor as long as it does not cause overflow during + back-propagation - overflows will result in infinities and NaNs in the weight gradients which will + irreversibly damage the weights after an update. Note that overflows can be efficiently detected by + inspecting the computed weight gradients, for example, when weight gradient values are unscaled. + One option is to skip the weight update when an overflow is detected and simply move on to the + next iteration. + + <
> + + Figure 2: Figure 2a shows the results of three experiments; baseline (FP32), pseudo FP16 with + FP32 master copy, pseudo FP16 without FP32 master copy. Figure 2b shows the histogram for the + exponents of weight gradients for Mandarin speech recognition training with FP32 weights. The + gradients are sampled every 4,000 iterations during training for all the layers in the model. + + <
> + + Figure 3: Histogram of activation gradient values during the training of Multibox SSD network. + Note that the bins on the x-axis cover varying ranges and there’s a separate bin for zeros. For + example, 2% of the values are in the[2 %34 ;2%32 )range, 2% of values are in the[2 %24 ;2%23 )range, + and 67% of values are zero. + + + + 3.3 ARITHMETIC PRECISION + + By and large neural network arithmetic falls into three categories: vector dot-products, reductions, + and point-wise operations. These categories benefit from different treatment when it comes to + reduced precision arithmetic. To maintain model accuracy, we found that some networks require that + FP16 vector dot-product accumulates the partial products into an FP32 value, which is converted + to FP16 before writing to memory. Without this accumulation in FP32, some FP16 models did not + match the accuracy of the baseline models. Whereas previous GPUs supported only FP16 multiply- + add operation, NVIDIA Volta GPUs introduce Tensor Cores that multiply FP16 input matrices and + accumulate products into either FP16 or FP32 outputs (NVIDIA, 2017). + Large reductions (sums across elements of a vector) should be carried out in FP32. Such reductions + mostly come up in batch-normalization layers when accumulating statistics and softmax layers. + Both of the layer types in our implementations still read and write FP16 tensors from memory, + performing the arithmetic in FP32. This did not slow down the training process since these layers + are memory-bandwidth limited and not sensitive to arithmetic speed. + Point-wise operations, such as non-linearities and element-wise matrix products, are memory- + bandwidth limited. Since arithmetic precision does not impact the speed of these operations, either + FP16 or FP32 math can be used. + + 4 RESULTS + + We have run experiments for a variety of deep learning tasks covering a wide range of deep learning + models. We conducted the following experiments for each application: + + %Baseline (FP32): Single-precision storage is used for activations, weights and gradients. + All arithmetic is also in FP32. + %Mixed Precision (MP): FP16 is used for storage and arithmetic. Weights, activations and + gradients are stored using in FP16, an FP32 master copy of weights is used for updates. + Loss-scaling is used for some applications. Experiments with FP16 arithmetic used Tensor + Core operations with accumulation into FP32 for convolutions, fully-connected layers, and + matrix multiplies in recurrent layers. + + The Baseline experiments were conducted on NVIDIA’s Maxwell or Pascal GPU. Mixed Precision + experiments were conducted on Volta V100 that accumulates FP16 products into FP32. The mixed + precision speech recognition experiments (Section 4.3) were conducted using Maxwell GPUs using + FP16 storage only. This setup allows us to emulate the TensorCore operations on non-Volta hard- + ware. A number of networks were trained in this mode to confirm that resulting model accuracies + are equivalent to MP training run on Volta V100 GPUs. This is intuitive since MP arithmetic was + accumulating FP16 products into FP32 before converting the result to FP16 on a memory write. + + 4.1 CNN S FOR ILSVRC CLASSIFICATION + + We trained several CNNs for ILSVRC classification task (Russakovsky et al., 2015) using mixed + precision: Alexnet, VGG-D, GoogLeNet, Inception v2, Inception v3, and pre-activation Resnet-50. + In all of these cases we were able to match the top-1 accuracy of baseline FP32 training session + using identical hyper-parameters. Networks were trained using Caffe (Jia et al., 2014) framework + modified to use Volta TensorOps, except for Resnet50 which used PyTorch (Paszke et al., 2017). + Training schedules were used from public repositories, when available (training schedule for VGG- + D has not been published). Top-1 accuracy on ILSVRC validation set are shown in Table 1. Baseline + (FP32) accuracy in a few cases is different from published results due to single-crop testing and a + simpler data augmentation. Our data augmentation in Caffe included random horizontal flipping and + random cropping from 256x256 images, Resnet50 training in PyTorch used the full augmentation in + the training script from PyTorch vision repository. + + Table 1: ILSVRC12 classification top-1 accuracy. + + <
> + + + Loss-scaling technique was not required for successful mixed precision training of these networks. + While all tensors in the forward and backward passes were in FP16, a master copy of weights was + updated in FP32 as outlined in Section 3.1. + + 4.2 DETECTION CNN'S + + Object detection is a regression task, where bounding box coordinate values are predicted by the + network (compared to classification, where the predicted values are passed through a softmax layer + to convert them to probabilities). Object detectors also have a classification component, where prob- + abilities for an object type are predicted for each bounding box. We trained two popular detection + approaches: Faster-RCNN (Ren et al., 2015) and Multibox-SSD (Liu et al., 2015a). Both detectors + used VGG-16 network as the backbone. Models and training scripts were from public repositories + (Girshick; Liu). Mean average precision (mAP) was computed on Pascal VOC 2007 test set. Faster- + RCNN was trained on VOC 2007 training set, whereas SSD was trained on a union of VOC 2007 + and 2012 data, which is the reason behind baseline mAP difference in Table 2. + + Table 2: Detection network average mean precision. + + <
> + + + As can be seen in table 2, SSD detector failed to train in FP16 without loss-scaling. By losing + small gradient values to zeros, as described in Section 3.2, poor weights are learned and training + diverges. As described in Section 3.2, loss-scaling factor of 8 recovers the relevant gradient values + and mixed-precision training matches FP32 mAP. + + 4.3 SPEECH RECOGNITION + + We explore mixed precision training for speech data using the DeepSpeech 2 model for both English + and Mandarin datasets. The model used for training on the English dataset consists of two 2D con- + volution layers, three recurrent layers with GRU cells, 1 row convolution layer and Connectionist + temporal classification (CTC) cost layer (Graves et al., 2006). It has approximately 115 million + parameters. This model is trained on our internal dataset consisting of 6000 hours of English speech. + The Mandarin model has a similar architecture with a total of 215 million parameters. The Man- + darin model was trained on 2600 hours of our internal training set. For these models, we run the + Baseline and Pseudo FP16 experiments. All the models were trained for 20 epochs using Nesterov + Stochastic Gradient Descent (SGD). All hyper-parameters such as learning rate, annealing schedule + and momentum were the same for baseline and pseudo FP16 experiments. Table 3 shows the results + of these experiments on independent test sets. + + Table 3: Character Error Rate (CER) using mixed precision training for speech recognition. English + results are reported on the WSJ ’92 test set. Mandarin results are reported on our internal test set. + + <
> + + Similar to classification and detection networks, mixed precision training works well for recurrent + neural networks trained on large scale speech datasets. These speech models are the largest models + trained using this technique. Also, the number of time-steps involved in training a speech model are + unusually large compared to other applications using recurrent layers. As shown in table 3, Pseudo + FP16 results are roughly 5 to 10% better than the baseline. This suggests that the half-precision + storage format may act as a regularizer during training. + + <
> + + Figure 4: English to French translation network training perplexity, 3x1024 LSTM model with + attention. Ref1, ref2 and ref3 represent three different FP32 training runs. + + 4.4 MACHINE TRANSLATION + + For language translation we trained several variants of the model in TensorFlow tutorial for + English to French translation (Google). The model used word-vocabularies, 100K and 40K entries for + English and French, respectively. The networks we trained had 3 or 5 layers in the encoder and + decoder, each. In both cases a layer consisted of 1024 LSTM cells. SGD optimizer was used to + train on WMT15 dataset. There was a noticeable variation in accuracy of different training sessions + with the same settings. For example, see the three FP32 curves in Figure 4, which shows the 3-layer + model. Mixed-precision with loss-scaling matched the FP32 results, while no loss-scaling resulted + in a slight degradation in the results. The 5-layer model exhibited the same training behavior. + + 4.5 LANGUAGE MODELING + + We trained English language model, designated as big LSTM (Jozefowicz et al., 2016), on the 1 + billion word dataset. The model consists of two layers of 8192 LSTM cells with projection to a + 1024-dimensional embedding. This model was trained for 50 epochs using the Adagrad optimizer. + The the vocabulary size is 793K words. During training, we use a sampled softmax layer with 8K + negative samples. Batch size aggregated over 4 GPUs is 1024. To match FP32 perplexity training + this network with FP16 requires loss-scaling, as shown in Figure 5. Without loss scaling the training + perplexity curve for FP16 training diverges, compared with the FP32 training, after 300K iterations. + Scaling factor of 128 recovers all the relevant gradient values and the accuracy of FP16 training + matches the baseline run. + + 4.6 DCGAN RESULTS + + Generative Adversarial Networks (GANs) combine regression and discrimination tasks during train- + ing. For image tasks, the generator network regresses pixel colors. In our case, the generator predicts + three channels of 8-bit color values each. The network was trained to generate 128x128 pixel im- + ages of faces, using DCGAN methodology (Radford et al., 2015) and CelebFaces dataset (Liu et al., + 2015b). The generator had 7 layers of fractionally-strided convolutions, 6 with leaky ReLU activa- + tions, 1 withtanh. The discriminator had 6 convolutions, and 2 fully-connected layers. All used + leaky ReLU activations except for the last layer, which used sigmoid. Batch normalization was ap- + plied to all layers except the last fully-connected layer of the discriminator. Adam optimizer was + used to train for 100K iterations. An set of output images in Figure 6. Note that we show a randomly + selected set of output images, whereas GAN publications typically show a curated set of outputs by + excluding poor examples. Unlike other networks covered in this paper, GANs do not have a widely- + accepted quantification of their result quality. Qualitatively the outputs of FP32 and mixed-precision + training appear comparable. This network did not require loss-scaling to match FP32 results. + + <
> + + Figure 5: bigLSTM training perplexity + + <
> + + Figure 6: An uncurated set of face images generated by DCGAN. FP32 training (left) and mixed- + precision training (right). + + 5 CONCLUSIONS AND FUTURE WORK + + Mixed precision training is an important technique that allows us to reduce the memory consumption + as well as time spent in memory and arithmetic operations of deep neural networks. We have + demonstrated that many different deep learning models can be trained using this technique with no + loss in accuracy without any hyper-parameter tuning. For certain models with a large number of + small gradient values, we introduce the gradient scaling method to help them converge to the same + accuracy as FP32 baseline models. + DNN operations benchmarked with DeepBench 1 on Volta GPU see 2-6x speedups compared to + FP32 implementations if they are limited by memory or arithmetic bandwidth. Speedups are lower + when operations are latency-limited. Full network training and inference speedups depend on library + and framework optimizations for mixed precision and are a focus of future work (experiments in this + paper were carried out with early versions of both libraries and frameworks). + We would also like to extend this work to include generative models like text-to-speech systems + and deep reinforcement learning applications. Furthermore, automating loss-scaling factor selection + would further simplify training with mixed precision. Loss-scaling factor could be dynamically + increased or decreased by inspecting the weight gradients for overflow, skipping weight updates + when an overflow is detected. + + REFERENCES + + D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, + A. Coates, G. Diamos, et al. Deep speech 2: End-to-end speech recognition in english and + mandarin. InProceedings of The 33rd International Conference on Machine Learning, pages + 173–182, 2016. + K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.¨ + Learning phrase representations using rnn encoder-decoder for statistical machine translation. + arXiv preprint arXiv:1406.1078, 2014. + M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with + binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, + and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages + 3123–3131. Curran Associates, Inc., 2015. URLhttp://papers.nips.cc/paper/ + 5647-binaryconnect-training-deep-neural-networks-with-binary-weights-during-propagations. + pdf. + R. Girshick. Faster r-cnn github repository. https://github.com/rbgirshick/ + py-faster-rcnn. + Google. Tensorflow tutorial: Sequence-to-sequence models. URL https://www. + tensorflow.org/tutorials/seq2seq. + A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification:´ + labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd + international conference on Machine learning, pages 369–376. ACM, 2006. + S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical + precision. InProceedings of the 32nd International Conference on Machine Learning (ICML-15), + pages 1737–1746, 2015. + A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sen- + gupta, A. Coates, et al. Deep speech: Scaling up end-to-end speech recognition.arXiv preprint + arXiv:1412.5567, 2014. + K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings + of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a. + K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. InECCV, 2016b. + Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou. Effective quantization methods for + recurrent neural networks.arXiv preprint arXiv:1611.10176, 2016c. + S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Comput., 9(8):1735–1780, Nov. + 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URLhttp://dx.doi.org/10. + 1162/neco.1997.9.8.1735. + I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In + Advances in Neural Information Processing Systems, pages 4107–4115, 2016a. + I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural net- + works: Training neural networks with low precision weights and activations. arXiv preprint + arXiv:1609.07061, 2016b. + S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reduc- + ing internal covariate shift. In F. R. Bach and D. M. Blei, editors,ICML, volume 37 of + JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015. URLhttp: + //dblp.uni-trier.de/db/conf/icml/icml2015.html#IoffeS15. + Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. + Caffe: Convolutional architecture for fast feature embedding.arXiv preprint arXiv:1408.5093, + 2014. + R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language + modeling, 2016. URLhttps://arxiv.org/pdf/1602.02410.pdf. + A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo- + lutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein- + berger, editors, Advances in Neural Information Processing Systems 25, pages 1097– + 1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/ + 4824-imagenet-classification-with-deep-convolutional-neural-networks. + pdf. + W. Liu. Ssd github repository.https://github.com/weiliu89/caffe/tree/ssd. + W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. Ssd: Single shot multibox detec- + tor.CoRR, abs/1512.02325, 2015a. URLhttp://dblp.uni-trier.de/db/journals/ + corr/corr1512.html#LiuAESR15. + Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. InProceedings of + International Conference on Computer Vision (ICCV), 2015b. + A. Mishra, E. Nurvitadhi, J. Cook, and D. Marr. Wrpn: Wide reduced-precision networks.arXiv + preprint arXiv:1709.01134, year=2017. + NVIDIA. Nvidia tesla v100 gpu architecture. https://images.nvidia.com/content/ + volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf, + 2017. + J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio. Recurrent neural networks with limited numerical + precision.arXiv preprint arXiv:1608.06902, 2016. + A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, + and A. Lerer. Automatic differentiation in pytorch. 2017. + A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolu- + tional generative adversarial networks. CoRR, abs/1511.06434, 2015. URLhttp://dblp. + uni-trier.de/db/journals/corr/corr1511.html#RadfordMC15. + M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.XNOR-Net: ImageNet Classification Using + Binary Convolutional Neural Networks, pages 525–542. Springer International Publishing, Cham, + 2016. ISBN 978-3-319-46493-0. doi: 10.1007/978-3-319-46493-032. URLhttps://doi. + org/10.1007/978-3-319-46493-0_32. + S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with + region proposal networks. InNeural Information Processing Systems (NIPS), 2015. + O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, + M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Chal- + lenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/ + s11263-015-0816-y. + K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni- + tion.arXiv preprint arXiv:1409.1556, 2014. + C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra- + binovich. Going deeper with convolutions. InComputer Vision and Pattern Recognition (CVPR), + 2015. URLhttp://arxiv.org/abs/1409.4842. + C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architec- + ture for computer vision. InThe IEEE Conference on Computer Vision and Pattern Recognition + (CVPR), June 2016. + Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, + K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human + and machine translation.arXiv preprint arXiv:1609.08144, 2016. + S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth con- + volutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL + http://arxiv.org/abs/1606.06160. +<> <> <> + + +<> <> <> +Learning to Generalize +SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING +MANFRED OPPER +Neural Computation Research Group Aston University Birmingham B4 7ET, United Kingdom + +Introduction +Neural networks learn from examples. This statement is obviously true for the brain, but also artificial networks (or neural networks), which have become a powerful new tool for many pattern-recognition problems, adapt their synaptic couplings to a set of examples. Neural nets usually consist of many simple computing units which are combined in an architecture which is often independent from the problem. The parameters which control the interaction among the units can be changed during the learning phase and these are often called synaptic couplings. After the learning phase, a network adopts some ability to generalize from the examples; it can make predictions about inputs which it has not seen before; it has begun to understand a +Theories that try to understand the ability of neural networks to generalize from learned examples are discussed. Also, an approach that is based on ideas from statistical physics which aims to model typical learning behavior is compared with a worst-case framework. +rule. To what extent is it possible to understand the complexity of learning from examples by mathematical models and their solutions? This question is the focus of this article. I concentrate on the use of neural networks for classification. Here, one can take characteristic features (e.g., the pixels of an image) as an input pattern to the network. In the simplest case, it should decide whether a given pattern belongs (at least more likely) to a certain class of objects and respond with the output 1 or 1. To learn the under.lying Classification rule, the network is trained on a set of patterns together with the Classification labels, which are provided by a trainer. A heuristic strategy for training is to tune the parameters of the machine (the couplings of the network) using a learning algorithm, in such a way that the errors made on the set of training examples are small, in the hope that this helps to reduce the errors on new data. How well will the trained network be able to classify an in. +put that it has not seen before? This performance on new data defines the generalization ability of the network. This ability will be affected by the problem of realizability: The network may not be sufficiently complex to learn the rule completely or there may be ambiguities in Classification. Here, I concentrate on a second problem arising from the fact that learning will mostly not be exhaustive and the in.formation about the rule contained in the examples is not complete. Hence, the performance of a network may vary from one training set to another. In order to treat the generalization ability in a quantitative way, a common model assumes that all input patterns, those from the training set and the new one on which the network is tested, have a pre.assigned probability distribution (which characterizes the feature that must be classified), and they are produced in.dependently at random with the same probability distribution from the network's environment. Sometimes the probability distribution used to extract the examples and the Classification of these examples is called the rule. The network's performance on novel data can now be quantified by the so-called generalization error, which is the probability of misclassifying the test input and can be measured by repeating the same learning experiment many times with different data. +Within such a probabilistic framework, neural networks are often viewed as statistical adaptive models which should give a likely explanation of the observed data. In this frame.work, the learning process becomes mathematically related to a statistical estimation problem for optimal network parameters. Hence, mathematical statistics seems to be a most appropriate candidate for studying a neural network's behavior. In fact, various statistical approaches have been ap.plied to quantify the generalization performance. For ex.ample, expressions for the generalization error have been obtained in the limit, where the number of examples is large compared to the number of couplings (Seung et al., 1992; for the case of realizable rules they are also independent of the specific algorithm, as long as the training examples are perfectly learned. Because it is able to cover even bad situations which are unfavorable for improvement of the learning process, it is not surprising that this theory may in some cases provide too pessimistic results which are also too crude to reveal interesting behavior in the intermediate region of the learning curve. +In this article, I concentrate mainly on a different approach, which has its origin in statistical physics rather than in mathematical statistics, and compare its results with the worst-case results. This method aims at studying the typical rather than the worst-case behavior and often enables the exact calculations of the entire learning curve for models of simple networks which have many parameters. Since both biological and artificial neural networks are composed of many elements, it is hoped that such an approach may actually reveal some relevant and interesting structures. +At first, it may seem surprising that a problem should simplify when the number of its constituents becomes large. However, this phenomenon is well-known for macroscopic physical systems such as gases or liquids which consist of a huge number of molecules. Clearly, it is not possible to study the complete microscopic state of such a system, which is described by the rapidly fluctuating positions and velocities of all particles. On the other hand, macroscopic quantities such as density, temperature, and pressure are usually collective properties influenced by all elements. For such quantities, fluctuations are averaged out in the thermodynamic limit of a large number of particles and the collective properties become, to some extent, independent of the microstate. Similarly, the generalization ability of a neu.ral network is a collective property of all the network parameters, and the techniques of statistical physics allow, at least for some simple but nontrivial models, for exact computations in the thermodynamic limit. Before explaining these ideas in detail, I provide a short description of feed-forward neural networks. +Amari and Murata, 1993). In such a case, one can expect that learning is almost exhaustive, such that the statistical fluctuations of the parameters around their optimal values are small. However, in practice the number of parameters is +artificial Neural Networks often large so that the network can be flexible, and it is not clear how many examples are needed for the asymptotic theory to become valid. The asymptotic theory may actually miss interesting behavior of the so-called learning curve, which displays the progress of generalization ability with an increasing amount of training data. +A second important approach, which was introduced into mathematical statistics in the 1970s by Vapnik and Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact bounds for the generalization error which are valid for any number of training examples. Moreover, they are entirely independent of the underlying distribution of inputs, +and Based on highly idealized models of brain function, artificial neural networks are built from simple elementary computing units, which are sometimes termed neurons after their biological counterparts. Although hardware implementations have become an important research topic, neu.ral nets are still simulated mostly on standard computers. +Each computing unit of a neural net has a single output and several ingoing connections which receive the outputs of other units. To every ingoing connection (labeled by the index i) a real number is assigned, the synaptic weight w_i, which is the basic adjustable parameter of the network. To compute a unit's output, all incoming values xi are multi.plied by the weights wi and then added. +Figure 1a shows an example of such a computation with three couplings. +Finally, the result, <>, is passed through an activation function which is typically of the shape of the red curve in Fig. 1a (a sigmoidal function), which allows for a soft, ambiguous Classification between 1 and 1. +Other important cases are the step function (green curve) and the linear function (yellow curve; used in the output neuron for problems of fitting continuous functions). In the following, to keep matters simple, I restrict the discussion mainly to the step function. Such simple units can develop a remarkable computational power when connected in a suitable architectures. An important network type is the feedforward architecture shown in Fig. 1b, which has two layers of computing units and adjustable couplings. The input nodes (which do not compute) are coupled to the so-called hidden units, which feed their outputs into one or more output units. With such an architecture and sigmoidal activation functions, any continuous function of the inputs can be arbitrarily closely approximated when the number of hidden units is sufficiently large. + +<
> + +FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numerical values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information. + +LEARNING TO GENERALIZE + +The Perceptron + +The simplest type of network is the perceptron (Fig. 2a). There are N inputs, N synaptic couplings <>, and the output is simply + +<> + +It has a single-layer architecture and the step function (green curve in Fig. 1a) as its activation function. Despite +. +its simple structure, it can for many learning problems give a nontrivial generalization performance and may be used as a first step to an unknown Classification task. As can be seen by comparing Figs. 2a and 1b, it is also a building block for the more complex multilayer networks. Hence, understanding its performance theoretically may also pro.vide insight into the more complex machines. To learn a set of examples, a network must adjust its couplings appropriately (I often use the word couplings for their numerical strengths, the weights <>, for <>). Remarkably, for the perceptron there exists a simple learning algorithm which always enables the network to find those parameter values whenever the examples can be learnt by a perceptron. In Rosenblatt's algorithm, the input patterns are presented sequentially (e.g., in cycles) to the network and the + +<
> + +FIGURE 2 (a) The perceptron. (b) Classification of inputs by a perceptron with two inputs. The arrow indicates the vector composed of the weights of the network, and the line per.pendicular to this vector is the boundary between the classes of input. + +output is tested. Whenever a pattern is not classified correctly, all couplings are altered simultaneously. We increase by a fixed amount all weights for which the input unit and the correct value of the output neuron have the same sign but we decrease them for the opposite sign. This simple algorithm is reminiscent of the so-called Hebbian learning rule, a physiological model of a learning processes in the real brain. It assumes that synaptic weights are increased when two neurons are simultaneously active. Rosenblatt's theorem states that in cases in which there exists a choice of the wi which classify correctly all of the examples (i.e., perfectly learnable perceptron), this algorithm finds a solution in a finite number of steps, which is at worst equal to A N3, where A is an appropriate constant. +It is often useful to obtain an intuition of a perceptron's Classification performance by thinking in terms of a geo.metric picture. We may view the numerical values of the in.puts as the coordinates of a point in some (usually) high-dimensional space. The case of two dimensions is shown in Fig. 2b. A corresponding point is also constructed for the couplings wi. The arrow which points from the origin of the coordinate system to this latter point is called the weight vector or coupling vector. An application of linear algebra to the computation of the network shows that the line which is perpendicular to the coupling vector is the boundary be.tween inputs belonging to the two different classes. Input points which are on the same side as the coupling vector are classified as 1 (the green region in Fig. 2b) and those on the other side as 1 (red region in Fig. 2b). +Rosenblatt's algorithm aims to determine such a line when it is possible. This picture generalizes to higher dimensions, for which a hyperplane plays the same role of the line of the previous two-dimensional example. We can still obtain an intuitive picture by projecting on two-dimensional planes. In Fig. 3a, 200 input patterns with random coordinates (randomly labeled red and blue) in a 200-dimensional input space are projected on the plane spanned by two arbitrary coordinate axes. If we instead use a plane for projection which contains the coupling vector (determined from a variant of Rosenblatt's algorithm) we obtain the view shown in Fig. 3b, in which red and green points are clearly separated and there is even a gap between the two clouds. +It is evident that there are cases in which the two sets of points are too mixed and there is no line in two dimensions (or no hyperplane in higher dimensions which separates them). In these cases, the rule is too complex to be perfectly learned by a perceptron. If this happens, we must attempt to determine the choice of the coupling which minimizes the number of errors on a given set of examples. Here, Rosenblatt's algorithm does not work and the problem of finding the minimum is much more difficult from the algorithmic point. The training error, which is the number of errors made on the training set, is usually a non-smooth function of the network couplings (i.e., it may have large variations for small changes of the couplings). Hence, in general, in addition to the perfectly learnable perceptron case in which the final error is zero, minimizing the training error is usually a difficult task which could take a large amount of computer time. However, in practice, iterative approaches, which are based on the minimization of other smooth cost functions, are used to train a neural network (Bishop, 1995). +As previously shown, perceptrons are only able to realize a very restricted type of Classification rules, the so-called linearly separable ones. Hence, independently from the issue of finding the best algorithm to learn the rule, one may ask the following question: In how many cases will the perceptron be able to learn a given set of training examples perfectly if the output labels are chosen arbitrarily? In order to answer this question in a quantitative way, it is convenient to introduce some concepts such as capacity, VC dimension and Worst-Case Generalization. + +<
> + +FIGURE 3 (a) Projection of 200 random points (with ran.dom labels) from a 200-dimensional space onto the first two coordinate axes (x1 and x2). (b) Projection of the same points onto a plane which contains the coupling vector of a perfectly trained perceptron. + +LEARNING TO GENERALIZE +<>, where the function <> vanishes for a 2 and it is positive for a 2. Such a threshold phenomenon is an example of a phase transition (i.e., a sharp change of behavior) which can occur in the thermodynamic limit of a large network size. +and worst-case generalization, which can be used in the case of the perceptron and have a more general meaning. +In the case of perceptrons, this question was answered in the 1960s by Cover (1965). He calculated for any set of in.put patterns, e.g., m, the fraction of all the 2m possible map.pings that can be linearly separated and are thus learnable by perceptrons. This fraction is shown in Fig. 4 as a function of the number of examples per coupling for different numbers of input nodes (couplings) N. Three regions can be distinguished: +Region in which m/N 1: Simple linear algebra shows that it is always possible to learn all mappings when the number m of input patterns is less than or equal to the number N of couplings (there are simply enough adjustable parameters). +Region in which m/N 1: For this region, there are examples of rules that cannot be learned. However, when the number of examples is less than twice the number of couplings (m/N 2), if the network is large enough almost all mappings can be learned. If the output labels for each of the m inputs are chosen randomly 1 or 1 with equal probability, the probability of finding a nonrealizable coupling goes to zero exponentially when N goes to infinity at fixed ratio m/N. +Region in which m/N 2: For m/N 2 the probability for a mapping to be realizable by perceptrons decreases to zero rapidly and it goes to zero exponentially when N goes to infinity at fixed ratio m/N (it is proportional to + +<
> + +FIGURE 4 Fraction of all mappings of m input patterns which are learnable by perceptrons as a function of m/N for different numbers of couplings N: N 10 (in green), N 20 (in blue), and N 100 (in red). fraction of realizable mappings + +Generally, the point at which such a transition takes place defines the so-called capacity of the neural network. Although the capacity measures the ability of a network to learn random mappings of the inputs, it is also related to its ability to learn a rule (i.e., to generalize from examples). The question now is, how does the network perform on a new example after having been trained to learn m example on the training set? +To obtain an intuitive idea of the connection between capacity and ability to generalize, we assume a training set of size m and a single pattern for test. Suppose we define a possible rule by an arbitrary learnable mapping from inputs to outputs. If m 1 is much larger than the capacity, then for most rules the labels on the m training pat.terns which the perceptron is able to recognize will nearly uniquely determine the couplings (and consequently the answer of the learning algorithm on the test pattern), and the rule can be perfectly understood from the examples. Be.low capacity, in most cases there are two different choices of couplings which give opposite answers for the test pat.tern. Hence, a correct Classification will occur with probability 0.5 assuming all rules to be equally probable. Figure 5 displays the two types of situations for m^3 and N^2. +This intuitive connection can be sharpened. Vapnik and Chervonenkis established a relation between a capacity such as quantity and the generalization ability that is valid for general classifiers (Vapnik, 1982, 1995). The VC dimension is defined as the size of the largest set of inputs for which all mappings can be learned by the type of classifier. It equals N for the perceptron. Vapnik and Chervonenkis were able to show that for any training set of size m + +<
> + +FIGURE 5 Classification rules for four patterns based on a perceptron. The patterns colored in red represent the training examples, and triangles and circles represent different class la.bels. The question mark is a test pattern. (a) There are two possible ways of classifying the test point consistent with the examples; (b) only one Classification is possible. + +larger than the VC dimension DVC, the growth of the number of realizable mappings is bounded by an expression which grows much slower than 2m (in fact, only like a polynomial in m). +They proved that a large difference between training er.ror (i.e., the minimum percentage of errors that is done on the training set) and generalization error (i.e., the probability of producing an error on the test pattern after having learned the examples) of classifiers is highly improbable if the number of examples is well above DVC. This theorem implies a small expected generalization error for perfect learning of the training set results. The expected generalization error is bounded by a quantity which increases proportionally to DVC and decreases (neglecting logarithmic corrections in m) inversely proportional to m. +than DVC is also necessary for good generalization. The VC results should, in practice, enable us to select the network with the proper complexity which guarantees the smallest bound on the generalization error. For example, in order to find the proper size of the hidden layer of a network with two layers, one could train networks of different sizes on the same data. +The relation among these concepts can be better under.stood if we consider a family of networks of increasing complexity which have to learn the same rule. A qualitative picture of the results is shown in Fig. 6. As indicated by the blue curve in Fig. 6, the minimal training error will decrease for increasing complexity of the nets. On the other hand, the VC dimension and the complexity of the networks in.crease with the increasing number of hidden units, leading to an increasing expected difference (confidence interval) between training error and generalization error as indicated by the red curve. The sum of both (green curve) will have a minimum, giving the smallest bound on the generalization error. As discussed later, this procedure will in some cases lead to not very realistic estimates by the rather pessimistic bounds of the theory. In other words, the rigorous bounds, which are obtained from an arbitrary network and rule, are much larger than those determined from the results for most of the networks and rules. Conversely, one can construct a worst-case distribution + +Typical Scenario: The Approach + +of input patterns, for which a size of the training set larger of Statistical Physics When the number of examples is comparable to the size of the network, which for a perceptron equals the VC dimension, the VC theory states that one can construct malicious situations which prevent generalizations. However, in gen.eral, we would not expect that the world acts as an adver.sary. Therefore, how should one model a typical situation? As a first step, one may construct rules and pattern dis.tributions which act together in a nonadversarial way. The teacher�student paradigm has proven to be useful in such a situation. Here, the rule to be learned is modeled by a sec.ond network, the teacher network; in this case, if the teacher and the student have the same architecture and the same + +<
> + +FIGURE 6 As the complexity of the network varies (i.e., of the number of hidden units, as shown schematically below), the generalization error (in red), calculated from the sum of the training error (in green) and the confidence interval (in blue) according to the theory of Vapnik Chervonenkis, shows a minimum; this corresponds to the network with the best generalization ability. +number of units, the rule is evidently realizable. The correct class labels for any inputs are given by the outputs of the teacher. Within this framework, it is often possible to ob.tain simple expressions for the generalization error. For a perceptron, we can use the geometric picture to visualize the generalization error. A misClassification of a new in.put vector by a student perceptron with coupling vector ST occurs only if the input pattern is between the separating planes (dashed region in Fig. 7) defined by ST and the vector of teacher couplings TE. If the inputs are drawn randomly from a uniform distribution, the generalization error is directly proportional to the angle between ST and TE. Hence, the generalization error is small when teacher and student vectors are close together and decreases to zero when both coincide. +In the limit, when the number of examples is very large all the students which learn the training examples perfectly will not differ very much from and their couplings will be close to those of the teacher. Such cases with a small generalization error have been successfully treated by asymptotic methods of statistics. On the other hand, when the number of examples is relatively small, there are many different students which are consistent with the teacher regarding the training examples, and the uncertainty about + + +LEARNING TO GENERALIZE + +<
> + +FIGURE 7 For a uniform distribution of patterns, the generalization error of a perceptron equals the area of the shaded region divided by the area of the entire circle. ST and TE represent the coupling vectors of the student and teacher, respectively. +the true couplings of the teacher is large. Possible generalization errors may range from zero (if, by chance, a learning algorithm converges to the teacher) to some worst-case value. We may say that the constraint which specifies the macrostate of the network (its training error) does not spec.ify the microstate uniquely. Nevertheless, it makes sense to speak of a typical value for the generalization error, which is defined as the value which is realized by the majority of the students. In the thermodynamic limit known from statistical physics, in which the number of parameters of the network is taken to be large, we expect that in fact almost all students belong to this majority, provided the quantity of interest is a cooperative effect of all components of the system. As the geometric visualization for the generalization error of the perceptron shows, this is actually the case. The following approach, which was pioneered by Elizabeth Gardner (Gardner, 1988; Gardner and Derrida, 1989), is based on the calculation of V(e), the volume of the space of couplings which both perfectly implement m training examples and have a given generalization error e. For an intuitive picture, consider that only discrete values for the couplings are allowed; then <> would be proportional to the number of students. The typical value of the generalization error is the value of e, which maximizes V(e). It should be kept in mind that V(e) is a random number and fluctuates from one training set to another. A correct treatment of this randomness requires involved mathematical techniques (Mzard et al., 1987). To obtain a picture which is quite often qualitatively correct, we may replace it by its average over many realizations of training sets. From elementary probability theory we see that this average number can be found by calculating the volume A of the space of all students with generalization error e, irrespective of their behavior on the training set, and multiplying it by the probability B that a student with generalization error e gives m times the correct answers on independent drawings of the input patterns. Since A increases exponentially with the number of couplings N (like typical volumes in N-dimensional spaces) and B decreases exponentially with m (because it becomes more improbable to be correct m times for any e 0), both factors can balance each other when m increases like m aN. a is an effective measure for the size of the training set when N goes to infinity. In order to have quantities which remain finite as N Sq, it is also useful to take the logarithm of V(e) and divide by N, which transforms the product into a sum of two terms. The first one (which is often called the entropic term) increases with increasing generalization error (green curve in Fig. 8). This is true because there are many networks which are not similar to the teacher, but there is only one network equal to the teacher. For almost all networks (remember, the entropic term does not include the effect of the training examples) e 0.5, i.e., they are correct half of the time by random guessing. On the other hand, the second term (red curve in Fig. 8) decreases with increasing generalization er.ror because the probability of being correct on an input pattern increases when the student network becomes more similar to the teacher. It is often called the energetic contribution because it favors highly ordered (toward the teacher) network states, reminiscent of the states of physical systems at low energies. Hence, there will be a maximum (Fig. 8, ar.row) of <> at some value of e which by definition is the typical generalization error. +The development of the learning process as the number of examples aN increases can be understood as a competition between the entropic term, which favors disordered network configurations that are not similar to the teacher, and the energetic term. The latter term dominates when the number of examples is large. It will later be shown that such a competition can lead to a rich and interesting behavior as the number of examples is varied. The result for the learning curve (Gyrgyi and Tishby, 1990; Sompolinsky et al., + +FIGURE 8 Logarithm of the average volume of students that have learned m examples and give e generalization error (green curve). The blue and red curves represent the energetic and entropic contributions, respectively. + +<
> + +student is free to ask the teacher questions, i.e., if the stu.dent can choose highly informative input patterns. For the simple perceptron a fruitful query strategy is to select a new input vector which is perpendicular to the current coupling vector of the student (Kinzel and Ruj�n, 1990). Such an input is a highly ambiguous pattern because small changes +in the student couplings produce different Classification answers. For more complicated networks it may be difficult to obtain similar ambiguous inputs by an explicit construction. A general algorithm has been proposed (Seung et al., +1992a) which uses the principle of maximal disagreement +in a committee of several students as a selection process for training patterns. Using an appropriate randomized train.ing strategy, different students are generated which all learn +the same set of examples. Next, any new input vector is only + +<
> + +FIGURE 9 Learning curves for typical student perceptrons. a m/N is the ratio between the number of examples and the coupling number. +1990) of a perceptron obtained by the statistical physics approach (treating the random sampling the proper way) is shown by the red curve of Fig. 9. In contrast to the worst-case predictions of the VC theory, it is possible to have some generalization ability below VC dimension or capacity. As we might have expected, the generalization error decreases monotonically, showing that the more that is learned, the more that is understood. Asymptotically, the error is pro-accepted for training when the disagreement of its classification between the students is maximal. For a committee of two students it can be shown that when the number of examples is large, the information gain does not decrease but reaches a positive constant. This results in a much faster decrease of the generalization error. Instead of being in.versely proportional to the number of examples, the de.crease is now exponentially fast. +portional to N and inversely proportional to m, in agree-monotonically decreasing learning curve, the possibility ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set more complicated networks. of student couplings which are untypical in the sense of our theory cannot be ruled out. For bad students, even non-monotic generalization behavior is possible. The problem + +Bad Students and Good Students + +Although the typical student perceptron has a smooth, + + +Query Learning + +Soon after Gardner's pioneering work, it was realized that the approach of statistical physics is closely related to ideas in information theory and Bayesian statistics (Levin et al., 1989; Gyfirgyi and Tishby, 1990; Opper and Haussler, 1991), for which the reduction of an initial uncertainty about the true state of a system (teacher) by observing data is a cen.tral topic of interest. The logarithm of the volume of rele.vant microstates as defined in the previous section is a di.rect measure for such uncertainty. The moderate progress in generalization ability displayed by the red learning curve of Fig. 9 can be understood by the fact that as learning progresses less information about the teacher is gained from a new random example. Here, the information gain is defined as the reduction of the uncertainty when a new example is learned. The decrease in information gain is due to the in.crease in the generalization performance. This is plausible because inputs for which the majority of student networks give the correct answer are less informative than those for which a mistake is more likely. The situation changes if the of a concrete learning algorithm can be made to fit into the statistical physics framework if the algorithm minimizes a certain cost function. Treating the achieved values of the new cost function as a macroscopic constraint, the tools of statistical physics apply again. +As an example, it is convenient to consider a case in which the teacher and the student have a different architectures: In one of the simplest examples one tries to learn a Classification problem by interpreting it as a regression problem, i.e., a problem of fitting a continuous function through data points. To be specific, we study the situation in which the teacher network is still given by a perceptron which computes binary valued outputs of the form y i wixi , 1, but as the student we choose a network with a linear transfer function (the yellow curve in Fig. 1a) + +<> + +and try to fit this linear expression to the binary labels of the teacher. If the number of couplings is sufficiently large (larger than the number of examples) the linear function + + +LEARNING TO GENERALIZE +(unlike the sign) is perfectly able to fit arbitrary continuous output values. This linear fit is an attempt to explain the data in a more complicated way than necessary, and the couplings have to be �nely tuned in order to achieve this goal. We find that the student trained in such a way does not generalize well (Opper and Kinzel, 1995). In order to compare the Classifications of teacher and student on a new random input after training, we have finally converted the student�s output into a Classification label by taking the sign of its output. As shown in the red curve of Fig. 10, after an initial improvement of performance the generalization error increases again to the random guessing value e 0.5 at a 1 (Fig. 10, red curve). This phenomenon is called overfitting. For a 1 (i.e., for more data than parameters), it is no longer possible to have a perfect linear fit through the data, but a fit with a minimal deviation from a linear function leads to the second part of the learning curve. e de.creases again and approaches 0 asymptotically for a Sq. This shows that when enough data are available, the details of the training algorithm are less important. +The dependence of the generalization performance on the complexity of the assumed data model is well-known. If function class is used that is too complex, data values can be perfectly fitted but the predicted function will be very sen.sitive to the variations of the data sample, leading to very unreliable predictions on novel inputs. On the other hand, functions that are too simple make the best fit almost insen.sitive to the data, which prevents us from learning enough from them. +It is also possible to calculate the worst-case generalization ability of perceptron students learning from a perceptron teacher. The largest generalization error is obtained (Fig. 7) when the angle between the coupling vectors of teacher and student is maximized under the constraint that the student learns all examples perfectly. Although it may not be easy to construct a learning algorithm which per.forms such a maximization in practice, the resulting gener.alization error can be calculated using the statistical phys.ics approach (Engel and Van den Broeck, 1993). The result is in agreement with the VC theory: There is no prediction better than random guessing below the capacity. +Although the previous algorithms led to a behavior which is worse than the typical one, we now examine the op.posite case of an algorithm which does better. Since the generalization ability of a neural network is related to the fact that similar input vectors are mapped onto the same out.put, one can assume that such a property can be enhanced if the separating gap between the two classes is maximized, which defines a new cost function for an algorithm. This optimal margin perceptron can be practically realized and when applied to a set of data leads to the projection of Fig. 11. As a remarkable result, it can be seen that there is a relatively large fraction of patterns which are located at the gap. These points are called support vectors (SVs). In order to understand their importance for the generalization abil.ity, we make the following gedankenexperiment and assume that all the points which lie outside the gap (the nonsupport vectors) are eliminated from the training set of examples. +From the two-dimensional projection of Fig. 11, we may conjecture that by running the maximal margin algorithm on the remaining examples (the SVs) we cannot create a larger gap between the points. Hence, the algorithm will converge to the same separating hyperplane as before. This intuitive picture is actually correct. If the SVs of a training set were known beforehand (unfortunately, they are only identi�ed after running the algorithm), the margin classifier would have to be trained only on the SVs. It would au.tomatically classify the rest of the training inputs correctly. +FIGURE 11 Learning with a margin classifier and m 300 examples in an N 150-dimensional space. + +Bias/Variance trade-off + +Hence, if in an actual Classification experiment the number of SVs is small compared to the number of non-SVs, we may expect a good generalization ability. +The learning curve for a margin classifier (Opper and Kinzel, 1995) learning from a perceptron teacher (calculated by the statistical physics approach) is shown in Fig. 10 (blue curve). The concept of a margin classifier has recently ber of consistent students is small; nevertheless, the few re.maining ones must still differ in a finite fraction of bits from each other and from the teacher so that perfect generalization is still impossible. For a slightly above ac only the couplings of the teacher survive. +been generalized to the so-called support vector machines (Vapnik, 1995), for which the inputs of a perceptron are re.placed by suitable features which are cleverly chosen nonlinear functions of the original inputs. In this way, nonlinear separable rules can be learned, providing an interesting alternative to multilayer networks. + +Learning with Errors + +The example of the Ising perceptron teaches us that it will +not always be simple to obtain zero training error. Moreover, an algorithm trying to achieve this goal may get stuck in local minima. Hence, the idea of allowing errors explic. +itly in the learning procedure, by introducing an appropriate noise, can make sense. An early analysis of such a sto- + +The Ising Perceptron + +The approach of statistical physics can develop a specific predictive power in situations in which one would like to un.derstand novel network models or architectures for which currently no ef�cient learning algorithm is known. As the simplest example, we consider a perceptron for which the couplings wj are constrained to binary values 1 and 1 (Gardner and Derrida, 1989; Gy�rgyi, 1990; Seung et al., 1992b). For this so-called Ising perceptron (named after Ernst Ising, who studied coupled binary-valued elements as a model for a ferromagnet), perfect learning of examples is equivalent to a difficult combinatorial optimization prob.lem (integer linear programming), which in the worst case is believed to require a learning time that increases expo.nentially with the number of couplings N. +To obtain the learning curve for the typical student, we can proceed as before, replacing V(e) by the number of student configurations that are consistent with the teacher which results in changing the entropic term appropriately. When the examples are provided by a teacher network of the same binary type, one can expect that the generalization error will decrease monotonically to zero as a function of a. The learning curve is shown as the blue curve in Fig. 9. For sufficiently small a, the discreteness of the couplings has al.most no effect. However, in contrast to the continuous case, perfect generalization does not require infinitely many examples but is achieved already at a finite number ac 1.24. This is not surprising because the teacher�s couplings con.tain only a finite amount of information (one bit per coupling) and one would expect that it does not take much more than about N examples to learn them. The remark.able and unexpected result of the analysis is the fact that the transition to perfect generalization is discontinuous. The generalization error decreases immediately from a non.zero value to zero. This gives an impression about the com.plex structure of the space of all consistent students and also gives a hint as to why perfect learning in the Ising per.ceptron is a difficult task. For a slightly below ac, the num.chastic training procedure and its generalization ability for the learning in so-called Boolean networks (with elemen.tary computing units different from the ones used in neural networks) can be found in Carnevali and Patarnello (1987). A stochastic algorithm can be useful to escape local min.ima of the training error, enabling a better learning of the training set. Surprisingly, such a method can also lead to better generalization abilities if the Classification rule is also corrupted by some degree of noise (Gy�rgyi and Tishby, 1990). A stochastic training algorithm can be realized by the Monte Carlo metropolis method, which was invented to generate the effects of temperature in simulations of physical systems. Any changes of the network couplings which lead to a decrease of the training error during learning are allowed. However, with some probability that in.creases with the temperature, an increase of the training error is also accepted. Although in principle this algorithm may visit all the network's configurations, for a large sys.tem, with an overwhelming probability, only states close to some fixed training error will actually appear. The method of statistical physics applied to this situation shows that for sufficiently large temperatures (T) we often obtain a quali.tatively correct picture if we repeat the approximate calcu.lation for the noise-free case and replace the relative number of examples a by the effective number a/T. Hence, the learning curves become essentially stretched and good generalization ability is still possible at the price of an increase in necessary training examples. +Within the stochastic framework, learning (with errors) can now also be realized for the Ising perceptron, and it is interesting to study the number of relevant student con�gu.rations as a function of e in more detail (Fig. 12). The green curve is obtained for a small value of a where a strong maxi.mum with high generalization error exists. By increasing a, this maximum decreases until it is the same as the second maximum at e 0.5, indicating a transition like that of the blue learning curve in Fig. 9. For larger a, the state of per.fect generalization should be the typical state. Neverthe.less, if the stochastic algorithm starts with an initial state + +<
> + +FIGURE 12 Logarithm of the number of relevant Ising stu.dents for different values of a. +which has no resemblance to the (unknown) teacher (i.e., with e 0.5), it will spend time that increases exponentially with N in the smaller local maximum, the metastable state. Hence, a sudden transition to perfect generalization will be observable only in examples which correspond to the blue curve of Fig. 12, where this metastable state disappears. For large vales of a (yellow curve), the stochastic algorithm will converge always to the state of perfect generalization. On the other hand, since the state with e 0.5 is always metastable, a stochastic algorithm which starts with the teacher�s couplings will never drive the student out of the state of perfect generalization. It should be made clear that the sharp phase transitions are the result of the thermody.namic limit, where the macroscopic state is entirely domi.nated by the typical configurations. For simulations of any finite system a rounding and softening of the transitions will be observed. +More Sophisticated Computations Are Needed for Multilayer Networks +As a first step to understand the generalization perfor.mance of multilayer networks, one can study an architectures which is simpler than the fully connected one of Fig. 1b. The tree architecture of Fig. 13 has become a popular model. Here, each hidden unit is connected to a different set of the input nodes. A further simpli�cation is the replacement of adaptive couplings from the hidden units to the output node by a prewired fixed function which maps the states of the hidden units to the output. +Two such functions have been studied in great detail. For the first one, the output gives just the majority vote of the hidden unitsfithatis, if themajority of the hidden units is negative, then the total output is negative, and vice versa. This network is called a committee machine. For the second type of network, the parity machine, the output is the par.ity of the hidden outputsfithat is, a minus results from an odd number of negative hidden units and a plus from an even number. For both types of networks, the capacity has been calculated in the thermodynamic limit of a large number N of (first layer) couplings (Barkai et al., 1990; Monas.son and Zecchina, 1995). By increasing the number of hid.den units (but always keeping it much smaller than N), the capacity per coupling (and the VC dimension) can be made arbitrarily large. Hence, the VC theory predicts that the ability to generalize begins at a size of the training set which increases with the capacity. The learning curves of the typical parity machine (Fig. 14) being trained by a par.ity teacher for (from left to right) one, two, four, and six hidden units seem to partially support this prediction. +Below a certain number of examples, only memorization of the learned patterns occurs and not generalization. Then, a transition to nontrivial generalization takes place (Han.sel et al., 1992; Opper, 1994). Far beyond the transition, the decay of the learning curves becomes that of a simple per.ceptron (black curve in Fig. 14) independent of the number of hidden units, and this occurs much faster than for the bound given by VC theory. This shows that the typical learning curve can in fact be determined by more than one + +<
> + +complexity parameter. In contrast, the learning curve of the committee machine with the tree architecture of Fig. 13 (Schwarze and Hertz, 1992) is smooth and resembles that of the simple perceptron. As the number of hidden units is increased (keeping N fixed and very large), the generalization error increases, but despite the diverging VC di.mension the curves converge to a limiting one having an asymptotic decay which is only twice as slow as that of the perceptron. This is an example for which typical and worst-case generalization behaviors are entirely different. +Recently, more light has been shed on the relation be.tween average and worst-case scenarios of the tree com-the same similarity to every teacher perceptron. Although this symmetric state allows for some degree of generalization, it is not able to recover the teacher�s rule completely. After a long plateau, the symmetry is broken and each of the student perceptrons specializes to one of the teacher perceptrons, and thus their similarity with the others is lost. This leads to a rapid (but continuous) decrease in the generalization error. Such types of learning curves with plateaus can actually be observed in applications of fully connected multilayer networks. + + +Outlook + +mittee. A reduced worst-case scenario, in which a tree committee teacher was to be learned from tree committee students under an input distribution, has been analyzed from a statistical physics perspective (Urbanczik, 1996). As expected, few students show a much worse generalization ability than the typical one. Moreover, such students may also be difficult to find by most reasonable learning algorithms because bad students require very �ne tuning of their couplings. Calculation of the couplings with finite pre.cision requires many bits per coupling that increases faster than exponentially with a and which for sufficiently large a will be beyond the capability of practical algorithms. Hence, it is expected that, in practice, a bad behavior will not be observed. +Transitions of the generalization error such as those observed for the tree parity machine are a characteristic feature of large systems which have a symmetry that can be spontaneously broken. To explain this, consider the sim.plest case of two hidden units. The output of this parity ma.chine does not change if we simultaneously change the sign of all the couplings for both hidden units. Hence, if the teacher�s couplings are all equal to 1, a student with all couplings equal to 1 acts exactly as the same classifier. If there are few examples in the training set, the entropic contribution will dominate the typical behavior and the typical students will display the same symmetry. Their coupling vectors will consist of positive and negative random numbers. Hence, there is no preference for the teacher or the reversed one and generalization is not possible. If the number of examples is large enough, the symmetry is broken and there are two possible types of typical students, one with more positive and the other one with more negative couplings. Hence, any of the typical students will show some similarity with the teacher (or it's negative image) and generalization occurs. A similar type of symmetry break.ing also leads to a continuous phase transition in the fully connected committee machine. This can be viewed as a committee of perceptrons, one for each hidden unit, which share the same input nodes. Any permutation of these perceptrons obviously leaves the output invariant. Again, if few examples are learned, the typical state reflects the symmetry. Each student perceptron will show approximately The worst-case approach of the VC theory and the typical case approach of statistical physics are important theories for modeling and understanding the complexity of learning to generalize from examples. Although the VC approach plays an important role in a general theory of learnability, its practical applications for neural networks have been limited by the overall generality of the approach. Since only weak assumptions about probability distributions and machines are considered by the theory, the estimates for generalization errors have often been too pessimistic. Recent developments of the theory seem to overcome these problems. By using modified VC dimensions, which depend on the data that have actually occurred and which in favorable cases are much smaller than the general dimensions, more realistic results seem to be possible. For the support vector machines (Vapnik, 1995) (generalizations of the margin classifiers which allow for nonlinear boundaries that separate the two classes), Vapnik and collaborators have shown the effectiveness of the modified VC results for selecting the optimal type of model in practical applications. +The statistical physics approach, on the other hand, has revealed new and unexpected behavior of simple network models, such as a variety of phase transitions. Whether such transitions play a cognitive role in animal or human brains is an exciting topic. Recent developments of the theory aim to understand dynamical problems of learning. For ex.ample, online learning (Saad, 1998), in which the problems of learning and generalization are strongly mixed, has en.abled the study of complex multilayer networks and has stimulated research on the development of optimized algorithms. In addition to an extension of the approach to more complicated networks, an understanding of the robustness of the typical behavior, and an interpolation to the other extreme, the worst-case scenario is an important subject of research. +Acknowledgments + +I thank members of the Department of Physics of Complex Sys.tems at the Weizmann Institute in Rehovot, Israel, where parts of this article were written, for their warm hospitality. + +References Cited +AMARI, S., and MURATA, N. (1993). Statistical theory of learning curves under entropic loss. Neural Comput. 5, 140. +BARKAI, E., HANSEL, D., and KANTER, I. (1990). Statistical me.chanics of a multilayered neural network. Phys. Rev. Lett. 65, 2312. +BISHOP, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon/Oxford Univ. Press, Oxford/New York. +CARNEVALI, P., and PATARNELLO, S. (1987). Exhaustive thermo.dynamical analysis of Boolean learning networks. Europhys. Lett. 4, 1199. +COVER, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern rec.ognition. IEEE Trans. El. Comp. 14, 326. +ENGEL, A., and VAN DEN BROECK, C. (1993). Systems that can learn from examples: Replica calculation of uniform conver.gence bound for the perceptron. Phys. Rev. Lett. 71, 1772. +GARDNER, E. (1988). The space of interactions in neural networks. J. Phys. A 21, 257. +GARDNER, E., and DERRIDA, B. (1989). Optimal storage proper.ties of neural network models. J. Phys. A 21, 271. +GY�RGYI, G. (1990). First order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A 41, 7097. +GY�RGYI, G., and TISHBY, N. (1990). Statistical theory of learning a rule. In Neural Networks and Spin Glasses: Proceedings of the STATPHYS 17 Workshop on Neural Networks and Spin Glasses (W. K. Theumann and R. Koberle, Eds.). World Scien.ti�c, Singapore. +HANSEL, D., MATO, G., and MEUNIER, C. (1992). Memorization without generalization in a multilayered neural network. Eu.rophys. Lett. 20, 471. +KINZEL, W., and RUJ�N, P. (1990). Improving a network generalization ability by selecting examples. Europhys. Lett. 13, 473. +LEVIN, E., TISHBY,N.,andSOLLA, S. (1989). A statistical approach to learning and generalization in neural networks. In Proceed.ings of the Second Workshop on Computational Learning The.ory (R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan Kaufmann, San Mateo, CA. +M�ZARD, M., PARISI, G., and VIRASORO, M. A. (1987). Spin glass theory and beyond. In Lecture Notes in Physics, Vol. 9. World Scienti�c, Singapore. +MONASSON, R., and ZECCHINA, R. (1995). Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett. 75, 2432. +OPPER, M. (1994). Learning and generalization in a two-layer neural network: The role of the Vapnik�Chervonenkis dimension. Phys. Rev. Lett. 72, 2113. +LEARNING TO GENERALIZE +OPPER, M., and HAUSSLER, M. (1991). Generalization perfor.mance of Bayes optimal Classification algorithm for learning a perceptron. Phys. Rev. Lett. 66, 2677. +OPPER, M., and KINZEL, W. (1995). Statistical mechanics of generalization. In Physics of Neural Networks III (J. L. van Hem-men, E. Domany, and K. Schulten, Eds.). Springer-Verlag, New York. +SAAD, D. (Ed.) (1998). Online Learning in Neural Networks. Cambridge Univ. Press, New York. +SCHWARZE, H., and HERTZ, J. (1992). Generalization in a large committee machine. Europhys. Lett. 20, 375. +SCHWARZE, H., and HERTZ, J. (1993). Generalization in fully con.nected committee machines. Europhys. Lett. 21, 785. +SEUNG, H. S., SOMPOLINSKY, H., and TISHBY, N. (1992a). Statis.tical mechanics of learning from examples. Phys. Rev. A 45, 6056. +SEUNG, H. S., OPPER, M., and SOMPOLINSKY, H. (1992b). Query by committee. In The Proceedings of the Vth Annual Workshop on Computational Learning Theory (COLT92), p. 287. Associ.ation for Computing Machinery, New York. +SOMPOLINSKY, H., TISHBY, N., and SEUNG, H. S. (1990). Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683. +URBANCZIK, R. (1996). Learning in a large committee machine: Worst case and average case. Europhys. Lett. 35, 553. +VALLET, F., CAILTON, J., and REFREGIER, P. (1989). Linear and nonlinear extension of the pseudo-inverse solution for learning Boolean functions. Europhys. Lett. 9, 315. +VAPNIK, V. N. (1982). Estimation of Dependencies Based on Em.pirical Data. Springer-Verlag, New York. +VAPNIK, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York. +VAPNIK, V. N., and CHERVONENKIS, A. (1971). On the uniform convergence of relative frequencies of events to their probabil.ities. Theory Probability Appl. 16, 254. + +General References + +ARBIB, M. A. (Ed.) (1995). The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge, MA. +BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York. +HERTZ,J.A.,KROGH,A.,andPALMER, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Red.wood City, CA. +MINSKY, M., and PAPERT, S. (1969). Perceptrons. MIT Press, Cambridge, MA. +WATKIN, T. L. H., RAU, A., and BIEHL, M. (1993). The statistical mechanics of learning a rule. Rev. Modern Phys. 65, 499. <> <> <> \ No newline at end of file diff --git a/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt b/Corpus/Floating Point Operations in Matrix-Vector Calculus.txt deleted file mode 100644 index 2c6c2997e5414cbaef3aa16aa9eb5c60092877e5..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 30441 zcmcJY-Eterl9+M5Pf_;XkRcEjK#9^&!wut*M|#B3tEe%X(F;)~o!oTD~u?%545_SxxgAl3ne>NC5T@ zXm{B=TlNm#mRG$$RrAZMti1aB6+ITyFAk|UMGd*{EV}MqAnDVm$3HV0lYGg*o~NAw zvRD<_T+yyd|0urMKjxJqe_sF}T;#XQVwG2ybzc9&x|kPbo-G;F;EuT%Ra+)to=&*ud>jik)7vC>)IshWj zd31VR&GY)p-7KG0i<{~o#mYc+*vH=r0~j2StFO6$+#1JeRnGLK+Jeu&LBkwWs!KYn zS6MmBmb1~wyfGU++rH|Xd|EJqe746>n%DsMYd*_)3VrjES>E@kDGh&y50*E}3RS4?EJX&w#W%O6*R9(=5dh~OmlWsvEckz0(Tr9zt3-0o;_YPI(l6!>($k~nq>3&T{hK~ zI+^FQYPw$JrJptb%7^N#I^J8CiuWyS+??PR-te1;j>sJT36VP*|3(9hY_Sqn;WvkH^ zg*Jf=dIQtY20PYk6Ob`4y|U49$v;3sN_p#~0rTQS&51p=3!K7h5PKKA!A^MAN@!wb zsRjS_K}LjO3 z{uSl={toMQ!OZ-TH7Odb-XdKJZWEFW4`|Ga6X{}sS^2`nhQ-#~0xjSoW`{lVGmP@J z$ZrGLzU!i&S`oZCyt&CS^S*r-5Z}1}JVA|{e<&vNqPkk*X5RTW-|^i)3I^Dt-#TM?nzfr zd)R*9We(A2I!JK1E#~t{J}c_!x~}t?(5Z(>d@a7Ci!xX{y?e`ROBOACfR+R|f+GwT zbmTi&Gz`mHP5T8SlQ;0FSJ@)_RVMTfms_VY&Sacwk2OB=hk}_bkpLpX z=yuTy{MmL9d3HDzZw9`&5{IWg_Fd58!*!mm@v@e;)p|Y?lCLUIEx(*$A=^O3S@yM< z4SVmW*Nb8%&UAeA)4G^dXGOl6TS9oim&42nTIf%EVMm&{vrdC9cKX=_fGTWI&qnjerceiZ?twt)}F++Q_57FR)UxNCvw z`eJ!yb&clact*!8?OfA$eA- zFW7}@cIxVEa5^^KiT(?FCu19&o*R7f42f2-bY$s|yW=j6&bx>L;zpv(=XX7pk(XdY zUg`kv55SY}2#(iI<>tsBDFYJ-^s>?z6r}gkuj@#EadPp-SV%?v#+$wkm}e&v7P*}L z%Rl}Pb2Sc)eXOK6g5%IDX?O+1EcbSE-a6S~D}_auUX*4Ax~n{hK-lX#M?euskwCG? zrmT-%K`b7uu9wvQzy*$uP0 zYTTx$UkMMba$z@FmIpUk8uGSl0nvcI(;1mf?)EzSA552j-{`*`Bk%rr*|}M~Klky?o= zmbog>$tU(8AHNZg!vK;HIh;*9_-OHdJ#pjJA+9PXr7V4$NOgt*+w%yw z*+vl0u5fM?`yD7B-WPd_kcXIpM5x>(ON1))zuoD$(+{$|?iRhg*Bx+AjjpY!e0FcR zD#ZYa9x;%(RzTr=8;9nY1M@s|@!T~y@(l;QG^QEYP-f`4o#xHhm`C|kxtvKJPNvmT z1Q;m_jGyf0W`1YqKBeh$O;{MZZgaxn`3l8T_e^1ao_sB;b&ZVccZ$xrOt6qaOj0rG zWOZ5bCUKrOlK$uxq?Y1(n_|7!=i|a`Q|x??IojWvxbt!6+|QgNiVXh94c>k^F)=*# zgTKgca>-1{V4c~jeh6GB`9!<=WXQuQ!`dG6b6J?v!c6ccWR=LUJZvwiPw$F+4n?nX z(kUdN<_@_1a;!Q$?Y%FcgV1Xv`o7V~gLIAYR%kIgxUA+#bMJv|Bkr1`@z?P8N_+~4 zI*K$6qC)L;rs;gNN!@I|s;(-2}LoB{o+EZ9BPe}Y>#mlme48e!7s*l9dbnka&)j7IdmpP*q83 zy|YSgbmp7*M`knb8!hN zJj~rB*(SxtAmHK#Aw0+yf>=Tjvw9RE`cGRa(HdCyodz+^lse3Jr7B0(MhAUo;28(x$sJ+1o;A!_f_*pJ6Ll!I z5}dQC0m1k@U4tR|@Ij1u(T`oE()oCA{G`9XE!lXs$NT&CdF0ffsa&{ZYmM2J7OAto zY4Y?rG<<@!W=VMucRN6j_`=Y$0`r}719^v1s_>@zDT9a8g0JH%5Q#b?20MFG5PDs% z4qQyJ^vOhwFW9-32@&cr1M3@DFcIk^i$4#b^Ev<9(X=@SAO5i+R( zidTe5?PBp-K}-`dlV$kTt%X2NRS3had5#gn3t|t747MHFRuQ)A(Jzfw;%!>4E-FPI zssEztCIWtN)eM9E+r{%}T`J9Ki4*8oxsx%jplq{%P1DO8bM1rl5l`m+3CmH!wKsJ% z65XUf3*r!IXA{lwRWU~(LTF+stid|5tsQAI$xi$91b!N6hKx#_u!dSMK90@D-Rf9D zp6H8Q<_nX5Fc*B8Qm(S{0pN?1KKlBBM)zZ&eLD;~CTkr-a&wOrv~L!rSuS4vdtn#B zLn{oxd*~Ig{WV*dUk8H4!aVUd%#W8DLvYBMt5**Y2I|GGm6VGo{bW96Mbgd5r|k3M zRbPW|GHPyIaHBM`7Tt6{%R9QxBoZMqc`_($wGoJZK7K$RNGh#mObXhqr`j9>-1t~8;6O&k7L>mh1~NofhlV63 zxiGWUp-Ex`z`;07n9{`(d;n9F>ezS9D^OH!@o$PGj-QY4Ae2`XI7%!2?eyF1<-kGq zp=uoS%xuY@=VEomjZJ9;m&z zltm*98zBdy(Oa6db#v%-N3;+>5UnO?A)Q2PlNaxrLOWUF1}69uwWE>(u_4Yi$|$Yi~AGq_Es z$cFe~3<)rCR|q*F?0H>Wt{U*4nNwn{OQNU6<@lr@d==2bi=!hdg1l67Russv<-y81 zahS@gMKo&qiE=glyma4KjjfU6to99dd_SbfcAR9^G~26In8^G= zqk_4$mA_e@1lv<|nFD_EY4lk-A&De0IA&$0`%D&I|{VuFs z8yl&`uGBciSVCs|@U4@-Ev}V>GibJdlEfnbt=wx_ICg}!bkbK~vX&B*eRJeOQ@#Yu zL^${tBBNh`i0DdGtlCkO^wDclE-!bnKop)xYW9;;lq&s)cJ$rCsFZW-S zCllXRGe-?AI_^x`Cw!7n531Fok&}W857CQWX-#>~jnlOKPS!!5K|uJaGwEn*-HMb= zbA4mVS-`RbZ2}wLmyyvXlPg zgu%#d`{9niwBP#N6Q=n{e_3=oEl<*09i%(xbh15qvskIU!VSIe+79Uc`})$3I}%DD zZfLTwcF;HB^Te1=X2#Br4o#%9(5B_0BlkM0suW4!>yPoCQ|Ps& z_u%%u;h?>Z$JfO>f~tJos56Uej%DDjnTCm7A>e-FeL)8K}+kh zbAN~9c2Q6u3rA2&;weM*v@aMY=+dq8{8RsPZ;YoRHB}cHrIQM{Er^!Ty3L?L-T#o# zBLy9pJYkdxmf>Tt(vzww^XOy6y%8e@BA2dKlj zaRYUuwIgo?ydH*}s8_01X; zG8~5bMHKNKRBiA&Nlbx2l&yctzA(pEyeo83EiJV{0+t-J;(5!oHN%hMu=qn4j|dXfF(wUl`TeEWX`| zm+s2ivT85HXGh4J6p#2MUWwH+z{Do;X$JD+Q0?v1~ad5yxpx$?W~w0N_3%Cym^tCwnBb<*EG>=*R@Hkox;i zxE!b}Fn9zV8<71pc6`?@2A+Myz+WT-_w9Fo`zh+M6Cs(e&e4fFO8vw;5w=P`ZkoJ- z{mPl1DT6$h*q0TL&S;heig1Z3dyF&c`Fx?3HDcA!m=#|Ou{yNMI9Y8{g(PLctRqsz z%ZBnjGS+FK$^VC5fD1qSqNcg^^>d+N9j^Nr zB*qk%#YrTjv~eHaE?yM^q@%5lNa(#PLQ@p^rY1*^O9$4>DS_VMW$Y8ZtJX|3TOPDo z#8~MgaoHq#%DoJ@5jf;pgGW>h@<(q6BQ)Wgz7Tn~Tb3Uo*VH zD_gzg&Rc0XO=9;-jf>WWH4dL?G=p9uv$GYp>Ne@)JhKr9PjQj?n2EUXQrrPA$^8%(m`Lg9X zT4VY`vWQylbxZLV=&0#vIh372>=zA3@_>&zqp>_Vqyk|USbrjWw=~hrnK9**U5Gh4 ztXl5rUIHG?a>+BJ*2>j>b*WyhIJKc1OtLqg7_#T))8Mm9MO;9|Y);J~rF!KTJ=90{ z-BOp&u|-mdILoitnbv!IiFXaL{BUbf2xkC=4PD?;o&iz{0eQE~DKv!cEg8#HKYf5w zYfA4HmV+8RvAt2tMO0eKZ~YLxklkPhf1`|@&y5E8VTJ|u+-s3mhc+OFUs<3w;kIY`y`F9fD#I1TE2HClv5`+M z@6_ddtKo>3>EgEh@Aj*>5(1qyarpN&M6*E<(RWd!HeC6Ghep-FTyNo%_52KWgAMe=2?RH_LcKI9KC}b;l$0h!6gu5{~ zDf*4`s#_A2we&5c$yxOO3-LcU-z43dFp^XAAKJW4yJ%h(wWqPqli10y(G)iq5E~Ll ze67CmAsX?Cgc$Dv6?z}srW>LKl8;QC{lsX|y%!N`%aA`c-rWC8I>ebiHj9yPTyx}p zjuSg7cErOF_duVD0CT55_^6mQv7X2R>Qi2{MQ33@+ot2KDebY%q;iuGQQfGP^s6W7 zQ>RwVt=`_)1(gL)=+1;^g;Zy`&!G|NOBGBZGKqSNo}L962iwZn2~YF9*LSS%PKX=H z7UmD(2nbyb-$uzhqkQ%8qg+y+G9C8cty{A`uKSwPR~=uIr*^^7`BKRtwEAY(*~MAI*xX}r_zq^(Km`YPx8{> z-t71gzj!GW4Uc~*G7!Mpqv&XY3gUQ57C<+fpl@qEFNSg%_M8{Fe=s|4wTc{`zq*5fS`2 ziQZp3T%(AdbnH{+f&c7vn2=8UgfJ8Rr;nQoGodP_Wj-+QmAJ1sb`Q?$BZr|PjK+2v zOSYn9N2~61m8y4NMO0i;p3?oM@r8sIeJY-PM7*hV%nRHrCpc$2wgtXs#{Z`5P9&2e zL})^&N#aN-Mn7~zv((sOrUQ80IvQB(KzYlvRc~kupd!}ZN_XXqEV&rxo7`cza9GOa zW_d4iC0OV!WoRkivroeeIYLL{2$J>1FR(#Ei~;f$;yrF0$kB7d)PhyL-VvU2Lw3y@ zUn#%Z2al|^*)Fxw*Y|W@T)ho3CkYRi`xJZsfur#gz1H}(3j!0j4pD>Fe#C5&dW=rXQ|Z7Qk`(}~{T%uLm4N5}PDy9R z7t6RtvNZeq@{0do9y|gQ`#$~I4>XqTNm`CW`hy=-=sY>x->$WJfR+y4p#>=MeWU1^ zK}=f=8fl9bqiAD6iWJ@5?N*HoX0l*m{}FDIl#W_zk|-ft;&8NILcu{=<>Hyj_ZUs3 zX0&YsNz@37OnlcyMKK%HT!>5}UDi%aM!YF|7%7PFuqu5qT_{dqOP}}l(tTmEh~|)D z77LGw99tXCT5ItzH!-;ek1@G6R#v29>)CoXF6xPh7<l+EE)^dgqp)Rcxf?kR< z312a~#H@EvJY0ehqVBUL#lM`0az}VfVn0FDx`NW4*ofN(Apwk@FAv84^FY-OY&RCA zv(^{~Mb=_z`Ok0)=(dc5I6G@47+wH!7XXUtzML>_tc{LSX~TsE+%gyX6(-wOHpO?S6EeQ4i| zKq^E)-`|CHSacQxCUe=ULkbpQmS~ zZS~%~@z@c%w;BVrZaBQ>&7iNX20bpUYoJKu$Vxzh>5g%vP=*kz=Ja_YzSu9OCuNKO zqO2lbi|>E9^R0yvE_{HF+|xp#Xf;aj9~DwfTjBG;754Zla^^5K3IW#$mUyl67h zOSWW4_}L0ZLV2?Baj8QKIn%IEZ1U9E?u#GcN=!M~w7h!v82;ov&r<8uzJ@WYy| z^;kUFb8QbN)SxGe@}sw$nyj$<7p0@SkMTrVMR}Zt1Dq}pz2W6INXy*g$0cre;t`5C z_raqF4j=WS)YAZ?c(x%Iq!}vB9kmz@RrvF#jDr)4tUSo)g{GPDIHlmspW7y}fe4y? zK}+K7+_N04p0Q9n90UhBU2HCh&$z)dq5OpuwMZ#s2-YE8T8bMVENUd=?9SG0l(I5=t2I&tW%)}DoS7NP?KG#auq9ly6)H_qc7ycT1m>YVX=hVW4GYsUy0tmg1d#*L zQVLOjO?;OA!x^;O|eYYwAk$L`QcHn=C>OO2|0o;VWTe%qzEfk99BO zjiA%m5?RH&nrzJ7lE3AV+AN&H$zmkdbhqM2zZ?e5Rv#Nqt~Fy0^e_|*GU5?zX3{_! zdyHIQNcJ4b{ie-7^SKfOd`cP38UhueCz4cK#_9w}#IIPRBO?&tYJ4`{+ad^ttaWpkK0xv)P zLY6*Qe>S;qOY1e!5Nf|iA}ItOq6M4P;$r{xYu}|W&|-E4vrI8RCz~~omg?w#4$&5MmO>|ir~$IO2Irip459%iaUYK-kH=ZkBO>nq?v=F zbVndjmoAa|xN*3PYvSMCc(l`v3P3r`lXp#bDo3%3Fu3rbd~!Y{=9A4A=8xgbUk>Q6 zr_16-*u)XU7&iryN|R4yaqDLWg0jg;Kfh=dubl7*(e15`@PWGtr^lCv_U5@395-J+ zZw4RS+vo9R4tJQ?yKj>}58``iyiJtG-j}FN=slBD67`cFvgiRk*#FS=Zjra+7d5Q* zZu<7QjS{g;IkMwCbkAS+(2srJ{+ql%@`xCBI-b5SN8v>XXSU6mE3In4NHGYYICv+# z&EN6Rk9{jS>`ey{LNK;bC`vTpfeM{X$>U^E%#p@nL?rILV%Mof>>j+BPB*_1FKj5e z{Yb1Dv=^%5v^cZ?Vb9Rg-}*~V&4O=vAmGE8^+%xRkn@`K6|rRX!vEt=J90mh^NkGi zA*LU<#)jhY0L#DG{v8m|$IV~d;zp#u{sj_g_d5N%B>Ezq+i&?-PLq*uJ?$r+NTkQa z$oME)RVEhqnv`=2wMx_^s_gXtpZ}_Qq7ppVyO(147w~ z1Ur03g$6mK`xrM8<*T2!w(NpGpA<6A+;1ViaJU4SIHt5NP&LB_}^8ie%))$N%NKhvRA==l>PKapp5+T zkM+v_*gdoKsPalk@5~(sU1bvFbv#P1Be?w@VP(ymXyJG`yT18rLjL2=v#=oiT{Db- z?m|meDM{oc3wQcWQA$+WpEZO3NLaRg%Wv-wa`0D(rD1@G5ly`zf?4wPJ~tft-F;zq zwTx9Z*%TCQ8xgbku53ka=75G$5MU!jZDCIE17UJ2m zSc;n6aUpio0)Z!Q=*RxQo04vCkR_wtH&XJqtRm^;s*S3$bDIr5)L%W~(*$?Vuf9Kf zI{k|^4OS$u_EYao?C-p}cnJ?i}R{nEXLJ(@pCyJ z)sw45Gg&WwJU@7tSbO3>6^G?)QeD+maaxVX&7}BoQr}jydA+z_FW%LQ#r$G9yBhKL zYBHJMk8jIKUH+}77R5g&&ncSAjG`}+$=?6Qe@1gT`=!2pvHg7O>E`y<{`04|yG8L~ zFyB1<^%kjN^L#zLd&#wIX;9|a*m4n3tUwl8Ria~QT zT`tN6(w!Hx>W|BMRt-&J!>XDV;|lg8DttwYUe1~u?RKYFH2U*d z@nPrVW&yGMG9%YWDU11XHl5XTR)bCBa)v4o%Gm|Fa@jNsel=O${D-|Qo_Rb&rw8SD zj4tKdy({NXG@lwbF7As3tQswE%1LoCD{IvDZdAjJT0FZcXV>KgYgdzMcEul;|M~C# zV^CL<#b$BZ%&G?Vtm_FYY~XW~ax$pPF&mpNq=;H5+vI>XZ_4Ycs29bswnxKSlTbM* zht*9zfZ_ASa>#=Gt(**7VPPlhlE4&$uLnzb!}FW@Dj~-6>IU*no`Yu6+|(6cMg-O5 zs)VIHCTCI9h|QI46$lH?OdG5ey<+)kH7m;DZ8JdT^P()K(nxW>m^b64eF!GAO_p!j zH@Ii()onRmntB$4S-q%cdY~LiT)rsr3RfE|M9k}9Rg{CllBMngSH`AtH>lZ?uOoky zb2v02k!S^34MbNwd)mM;)K}tIk!1HOdNrtuMS1ntKmQ*IZip3(n<-yfU!es9 ztYt+(I?IWf)J-$2#&fZlb#F!P1^SOg!6NNYYuA%o*~*o3fMtT2@GvP{t7*keg=B~9 z>4)X4xN6{RJzu4xaaB4vE-o6}`<23lM1#)Ups9NLcsa+x{`JrQtD2zH;z2pN=NG{i zmXje`GAoA&r7G?eBub2Kx@e}_$(>|WjOwe=#vhmE7%#m-L@PK`i}_}ZS!j&p9hLC1 z87%q9!r1U|MA^*U;WqMI%$ng6)j~Uq#i}QJ*6>$zt$*)VaSWg?2)AFK*3)m8m%ajyG55?W) z;v>bSaZ?( zCVDw6E;2b}O5`3>+)Dh8eBy=}Fp)O&EZ-G(v)C@atFN#m{sL=^^NJnv-?(fx&@52z zp!=yFn6nE+z$pAc4?F9$XuLCrQCR)4+pq_33T3fW;m-KJ09L zlr|oWr=#)@)vVz+>H0*Kr^SceXC79H!AR@ON6nq7py*-$d?!RCBpJgLZbB*_H|1{o zi-B;nIFJXcvc>1DCV#j9SHW|1{sJCc%&VCM#TMS4yesaS*$`W}sBlj*!71!Tn5Yqc za@;g>lroN&C#Ma}e*O`U0X(`hQ5WC7JbPb!c=j>#M9s9Cyn6S3V~3q@eRK`_i#q^J9?ykB(o^iQaPNl0K#%G0H`ADz;$^k_AkERRxzQQC1!jt zf6!bk-0v%1^z~X{gN2FZgxMW7R8*g;o9S5Yry6GT`-%>6h^kWsa$Dk3_zZu1(Uh~{ z963H&FZTA4tut(KKXr5?^F**+IE*2%xLATZfcoa-#d~?q<}zL0k;$wJHiXl)c?GW( zq^eIPL|5x6A)5%0=uvMrv#Up&#Zg7tqN*6U1>gYvK zN+9%v!0`$HKy?iDRf>LuOdh2- z3n)No@b~p+F}Pf~S+vh@Bf))5gcEUwr!K}qnMBJP1wiCx8`^Rg3B_+|N4@y#pa0dv zNlJs9BqR=DF%TA{3nh2*92h6b-KEFvi*gLUg|dd66AfjF$cSdDg<%I@z<3o#j2T}> z8|m^n@X#(reKVDsyZ!OlRWRayO&`Vy4VooEALomBtErL^emEtz<19J}!6tl;Fu=Sg z{@*Dl5yxgg_r)DfX$c0y&$OK7kz`bIXO_Qo4^?_GY>4UR3-KHeX3V|y2)F3&#jKJ_ z8D4GOAu6>P&6?%as6k|A8HTQ0gbk3;%?3@wzE>1*2I|CY_*D+^BdAO{U*N1xE3$0C zS?n@EdE9}EubAJ@7uAg^O|%<5dEOA?TAtJd@$ zcv=tktz^6p|6n`$afB@|OoHW2vqW)mMGk|w~zB_H8AWceaX_1)x604g@O&7{bGuQoE>9(j;6Hg9WVH*$tw^)=DOv!^^3-spTBstmjK)zfl zct0d;MwI#RVYw(blDw?-#PCEe>TOOkfXrUIhi`>pix&nZPCB(!`8C%39lbn{d z(lcdAq~kzkDul#3>W%^1EShw0Ip15c4e-~n=92ggPM$|nwu^d9;Dz~AF&vHL4~Rva zPUx+ZkiK9b_96-IPAujpxY|oH{=H0*k`MQ}v}_nQmTdoEv7CKk9pj_tI_(thK66VtEH`gdoT-03x4- z+!Q~Q026o5si!E7A*qr(`HZec7Z^142J~uSePgA*ol=Gv!2m`=Ilf_eSL$wsUCe7v z0ytMN17zVpsUMaU_1I~2zum|8gbWphv6vWdN%XHh7r)C~KA%*Ds4cODg*-iDS?Wa& z+!Z!GsI;on0~R|@c2)uN9~pjxCsa!MISFcaZNKp(Rt1;5EQ5Y4ca^fWFmDZ7;>Kc% z`3MdgGEF%vrUJ4g(VKE^B_>y@7>=iuGG;15!rS05l|)}GRT5L`?D@9`8_}^@{7Qs& zVafZ$=utfda4zoDtFU}Br2vaJZB%w54fKRlL{Ljo%C;6ZV7k>|84Qc!9Of>_Gc_|btG-#S6oYHB>BeOjw!O4q$RQ}XnG4>^mtB2@Q-kWa4>Xs^pq!E)`c+ybuC{2b6 zRc>!GcC8X~a5sF!>`3Eg>HLC}E`>BByb~(8s|v;Ryy%@|HDCMdPV&nLtX|0KLxKQ2 z@{C9Avqdy$fR6AS8<->^Sct&#UaS2&2gPjjew5d6y$C%%7|Jt#*#fQ0Al7%Wog6`3 z3GB*oq$NtA0;@3TnNPOjW9?V!Ij{wI)n(}ET9oRhONTz9?}9XMh_T68U-Cuol?arD zJ{5Zfpnw8MZ}BCK^)YAwj6i4VO;T3VCvs2aS&w~BZI7(vD#yZO$ZO@r+#YxgM%<-7 zfG9THb&njXv|NFXYA~AAe_XEAABy)4-y;KbL~6FH6uPANhF6Q>%2G+^;qY&Buvj}d zJXtSxi?`KscKq&~f}*TFW$WQj@6L+-0_gD_lzU2N=nw^!+7HF2pg7n-EQ*ZF`bIR? zcq_TKR>h?-?iANb)pJ8d(DFTgkuDt-(~Ed`ie9u?FV-#|ctoybYlBCoXjTeb8$NPv zqX3Q*!4sI%jW%Ln3b>vUk?5Ojl7CuEuZEa&=s(|-zcsN$Yo;OoV9B}i7i-VAzA;@q zJPC@uUOe+31@?y@r^`#Jp6=WCOcHKrNaK`5LGH;5vtCqhGppQAOv>iP^$0jHAy>lE zZrCanW;f!s?X7S63-3T3+Dxg!6D*jmvcBmsDL*3st>%K&9V2W~60saY_F?7(C*+*J zRy&XIAtgYpO5^ri{!;c7y1ek;AwaZAJ|CZ>@1~^1@3IvX7Cm+VpcLO?R|hmR3Xr#N z*z%7c?ip%eUIO(ACB-Z26`pKI7!bGg z$P5{%IwJBzDXRU95RQ-ozxSv?jG&O3;eSacaeu~GU$40&EV2LFW=w)o+_0}2cqRnY zVhyo)Ng=fccZ}}&c?Zy_Nsst`11C|DOh{`D3e_(;uRu<)|kYhE4I3RRMD`!(-pK!;MBA*U3t;aZhSz7WwCbl z$Wj~`U~Iv+>0;E7c|hy9PiskJkT)V}5J#{v(-mwtrGR7rGYR8Rc(ahlzR2>m1ir2Q z^$-n&+Q%<{cl5j0+1I*apO?DTjR*kE5vl(;s<3;IcH>twb;La#i5rv%N5no^MB=6vXODg-BTLNAq(BLR$1spt$ocNT<^Io(6g zU*ufX&y;C)U(l|-O5N={B_gr_-8H~UL_WqVj^|7tguY~ewIL~qAt>K5N}L--MpQ?d2w}G5_Qf(mpSJNgF5^o1NxZ279u&qInaho5(kW zR;{M0I6SK;+L0U!tP62#%#I+$@E3-XMt=qVte_wGuEad0N<(8_5Re|yt#=eAtxW;J z+W!7Gh|GWr?wE44jjAdB_!tO<0GgOo!!+#U9upn2c?mP}99J8eDe<4v^^zm}3?k(b~7n zvE5nLz-A2E`cq?FAZ6~nQoLnO>OQx~mu)YoGZt}8JXwvR>*WDy7iN@bT7+(1_87jgF-RFmogeQ2m%vl={)cjotk%5Dt@cNH9fb8b~@gwx@;Jq@9gl zPJCZBqtrcWdvm{FkgorTbm@cR9TGZz(REEO_DV7rZA%%1in*^S`q6dkTKkPl% zB$Rw~AX~iq!fGmM!FoQaf>d4xx@GE!?MY8<{D6ll1gZ9hzNxr^Z)k_S)nU^6gb!W_|85#UiNtnL{NIYmJTa9NI*`&59!wbac5s+;G2a{q>Tu^ z9Rro|;Kc{(oN8O0?s-?)3{FXk{9o&Me6W%{tP2urN`^weGhLl$LLeh|s02>xrqm>+ zitlYp_mNvkZ`b<_spq@iK~bGgPZj+Ux?%*_LfsT|PM#)YL0d{?#o%sZXsVajpD89f zQ{s=FQ#zNLwt(Z+yB~2#8XsWdR^1SH4K`b7>eJbD^$Bgv6A;r1d}K(bZD#USaQbS33%Apq1~x%BndZcArv8MmbEi zEj-95|8ZNYpjeMBFK5h|Zl%J)N-VFF(%V!u<B zVDJ_jgqY!lQUIYGoe=xd#oM++f<&U(8Sadw-8AK9d)=tjX7QTd2Ky9;T|aF2MoORt ze>e}2=b72$a2mj~FynrM8bFZ$9UfB^Iq*}MWSuuQJ1o2am{Px4>aK|vrg54(imz$w zSs!wxL)}s?L`Wo1;onZ!QGD3eMDl^)SqxfcN7|D8g`awOvk`);M#_wBfbUw(;-A+# ze}eU_4hMESq{3@JKb{@XzHWjTb3pRW3lCb#Fllg^bH-jBl~Ze3ZE~gC(s)4|NUXmD zX@qW!n}@;)$U>g^q%~HhdWR1qbiy+JcXLOrhqf+DkefR^E+&Ph~ueDGLj9`!Dvkvga`f^xeVnnb1SJ@L40u z=v3{{rRsU>?I&BWzLml+@PbSV2YAwZaM-@iFzGOq2rc=j+?$0cc{uY+*}CYf(Q38B z2oMTK1$G?D%66&&Tj%0COz(mzZs^=+P@ddC;()KX94Y}i!8BGzibDwbtDBkl+Ciao z1o#i;gVk`*1N1(%r2&Fsf;vqQvfN+Xq z5r}?lZ>I#{mfzzqdR<*pXgMDxQ|%zOaJ*R|*3&VDA5yh!t7x6~G0OR82YT62x^}C~ ze2DgfYWTs3V3?jm`$KMiv7Ax}Q1#kIa0K$(uNqBlh0$_vnX{E;nO%+Djxg@5r=}FV zQ4Lb5vBoi!s+cyL<2Nk4vjuoQp@lRW zDP*F7ygFT)OP9$l1DM%`|TY7WXXz2d|EHV0{fZ?cSOduRO_;4F|rWy*5kpn=%l12kfI*0cD6)-_o*G>CD? zh=ot_2sJQ7ezlZWG^>Uml=l7P84mK3+-8y?Jm7ARfD)$^b#S*&J|<0y?e+bw3|oGs zCYjlk@kdJ^BDZICY^%7X>aPADuym!jwtMcf{d?EJw*CpWtM$JFL?`B4*1% zq!(8Ak*U5Q@kx4`r{eu#}=HlzB&K0D4%rnNq&ml!|wf=euv zdrGQ(hg=dy_+y2(0^i9g?QC(ewxBi-=uy4$vCTkMv)!%T#}dqP{&;(PZ+H9Y-m{%& z&$jlS?LU3C|D>OnKP%3wD%Mr6P`%}aJ?zQ~A{)_?9y9QqO*<6D^Dlh!b4}pA8kY6r zPh+eo6$6LFw~8|>Srj4nvIrP5GXu?n>E>SB5+=vEcA2~BJ!rcv^$J%x?A~;$Z3EGN zJ=U)`97Jq{u-|*sPFn@G2saR1Q*BuVY00OHM3|ZXkHXY4^FLi!8(m* zFQV&Dd`3L>8xxhJi=mjG27z>!Ef|RL08{hK$)a#Kylc2E4;BIc5OX+)gi*w~wugg< zhgty;bKg3?UCWO-ad@?OrX&s+^1BNhu;ELe7Gf7E?GLZhA@-~zro!%G77ofni+5VB z4;>rFy~YD4GnoA)VKkdg7k(&6JE9~iYF;L6l3Gm|#>@C`@FvDoIs4R;X@1J*$+CXW zkymn6aeI#}v*Ton!Sal~Ow+<{l35L~R346FLstCK=XDjDc_5h_uEHyM;y9;bh8gDK zLu?GjxqrqXr-MdgwAZT@IZU~jk+YgZS;c`apkiw;|C$RDE zl=1p{)Uqy@#21C+nj)okri!>yIPJ zu2f~#>PdyPR*o-dLWeS6m*m1*ok#`M6MBT7)mE{XC8PvpV*MN!?CqIit>&LfG%OvN zKN5c}AMz$k<|8!@Wqm!VqZkoJV0j{gT`8Ba%OBb!@xsYG)y5w5crk|qqv&YAwG-ls zcS#XfFjW4;3Sh9%h}KP2g~z2U!*(6%wu5<9&=lyiAwUWjF=*IekmmaYg9(>i$B|R2 zTx_C)ayRv7ajtdtnyulN%H!?uRPcA$EU9fu1Jy0f{A;7J{ekV{)0j?dM`Of8KV#%N zVlB9eqZ*mf?Z$!T?5FSHfpjJQ@+%|7b_7vqgvt5NlxQ@oRph0X;wDW+M$g9to%; zWu%qJ_fiUM%*>wC4aJyqL!f?0RSnZk{50kkXLNPWuVSV~&s2viF)&Qw_nlWttK zTDdcP6r!E#DxCSK8s?y2cXL~ZCGENqG5=*Z?3zC7JtIK=sdW2F?Qn>8_zBV4SYB|S zox3-0l9o+FOIMx*bwK~6{eE~FA5(+pB`tqIN(0obJj?J94>ENKMo4Zk<&^qB=R^Ei z^QF#4cr(=D0kNbP*FwUZ#v&})47w`bST}dMs5pZ2 z%RZ^}!>Y1yv+WlI$QQOGug#wHwGg-4kexE^VbHbK4WrOEYJF74_sA}ZCT8pWmN=w- zwwxjyOr-}QCOktmDFM`<<){;$@K@hK@6)l<;Pj5#}EaUi+t{+gcYsYI` zz->;Kl`dj}7ja$}$=shvXFJgDkqDF}lLs0vpbo(|F z=CblGS^FtUG_#??d~TxLFQ%)abe_t(T>XDTBMMLCk3LEC&~_#uPCd z2$xLn@YR%39;_l9F4c}cqNI@*QWVq?U)$MvxY4R@U?aRXTOoXR^Mb8NFRS5GHDqL0H_z&V9EZQ4&o4=WNwy2&=XYqwg6x-mC7jSHIT1KqpeHym8fn08T4 zVdq;jNjuLdL!=bQK$#tpjHAM&@e9-FM5pxx_wecLRfSSfVO8Dut4t$&G%y$R%Y@qH zT=;GEC+r)BS*a1&VBs$?zSR$+hEXhakg{!9w2M@1&^o5(!?`Z0R`?r9akk#%SQcu5C#eyOvf%H$)Ro>Qo%jtw&@gv>HM7o0sX|MfgWBy zETy}HiZ5tOrrte+&aa<}wufVuK9O^{jiPOBoN8-T=4pD4=5P9zES9|FMk>|U!GmLy zlI4lM+gZ4NOTx27zOXqrHI@K=+I0oOuxFs^pGy>VJ&E-1~4*(e{aDouFu^nnyk<_-RP|h7>&NO@|c~X z6jl{6SEPfQKtLb|d?X~>hxB*!v%9|}<-8TJ3~^T0{@jf!989_5T2s(lr%FdUfNamPa` z7hc+V>h!7;->6vAJ{{ebLaD0ZXO+bhFqc-D5~R|_VXM#nUiwXBej@;O6mD0QTMn?3 z!uV5;4DPC_&7V6obL|o-Z$iZa$dHuXms>5?MXF+Lpab7%2;9_NimD}2Aa~bgN8wS3 zzzDCPcOJ$xFXlViU5s!+nCV6>-%p{mjS8S-16ZR;03VpjAT~kVr^;2oU(x%(s_5Ss z)6>0VoZ8Yhb2X31F+QY?V1ywrRfu{MzTywt2vwB?@wIswo<~7e!u132ozv>>63pD< zvI`zDZ1h8gtv73lCqvU6c*gqG;1tGW&u1_rOfBfj~lKL#lE#Sea*QP zr*p%_31Y%S(#;UD(&!A3kQJY6=ngnY5$jq5EM8`|XyB?NP+?-3*hC>i5V4yb+T0w4 z;r3r0AV$6S5W6ObsAp1U@(td|l709YOqgy?R9`G>zk=<8Y=J?n#MkP3o;)edHX&2A zZ436)HpvLHy0bk@rDImkoADi$838u(G4?~XR%MS4!ft6L3Q?}6)6nCZoc#DpAwz|) zn9DG0uh9n#0FUzB$q(<==F~m{iwr2JN-V`yH=6MWD)?xQyKuhvVcRZw=Igf&SNOn> zOY(QQNAhchb#YLSNH>MF`tP4mjBAXo_zU3I_`Ybq8bMin-70n$C|s3GFux&FRq+b^S7Ls zvG3;yH;mZq>FyQpW6H%#$3g8HHqbvG7~xTpvoeU0Pt$A`@9cWlH44BBE_{{eYo?E0 zq~dXN??4b89uT#U@uM7~Lq7?F{iZMZ#15X%hRxXCX0f(cK2m2mqCn>tml*ivuj#_N z|8+g`<-_(~{KqHVb^4Tl9ld+Iy~P!<8a6tA$>X>8_fWy7IHt9|3zj*w;)LZX7+B{# z)x4W0cY}aJ5J9?j2ew>F9Io~{Vk#!q!U^X6m87ZJmWh$7T}?^FMM;G`ROh~Z=19#w zMlL`{?HyO}RC*mo>rS}^a zV3LY1?DjbZ`v~k(vD%AAmL74%_T`2ybA_8J4HSB4HSg?k`2n|_>|A1x4bmLX9yQuk4ZEw8x={qg z+g`FnBJd<-BmV0>kFaS4RWWs;j{JukJ+ApGrMW+lu zWL05-OJ4t#5d)^t&0Ww)47Y*eh*aV+9YAA-yt_k2#ji~P`}eI{z)-?s*Sy(~h1+$S z4!B`?5|9>C1+bRj%TWAmA$)EKLG#<`wv@HDpoAaGTJC|mj%yr}mf~e~4PVtmg`0LP z)=IvPprVxH@uZxoA&pQ;pQ;oW^jOVA$H8z|_?ehhXz89Sj;T*f&}=x>ke3(s(OSOc z27n5TtIQ**Y!=g0m`VoO#X|dZ`vF6uTHi~CcBaHzwUo*k|X@%UymOCSQ>@g`$o&fZjpb?$j%fnBa|Vb0HE`v8tAvv%2)>blw9 zgEwZ@0)4l|=4i4CiVmZVFSx&_Lg&e?AR@qm5=7JK2B`_X8uA4JiyDll9RX6_!c~u^ zBO)%{KZAN2#ZT9hX7iP(E=jm_3N;xT?jQm2v1gCteWv6%ad&gajO#FjtC#USz^Tmv zq$IuXkHOkpKynKDYkfxp^O}8Ej1&A*tnxXBnC`bn5qvwaH#>|T5iO3E_~2&VA+7{! z%2c!hqKl6f&!jF7Op5zrlwz{VbDbSBU@UXrr%+9e!%&r#7c50}&yz!#)~9~}TBdB| zVLB^szJx1WS5$02%tB>r1Wz7oS*Wbb`aXD%U%uys z{6kHvV>FMmnqAad*^n6I16eYDSIf9DBgnX(F79C_jiH@Ou@AF!+aPLrt(U{tVt0Te zjVsW|XtU8`4I9x6uh-exv&agppkQp5!lyJSESTVvrSS9fEjDuo5n~x*cdl!j6wE*tdR1ma;a#}0jkUPJIc;sQ_VY=n(X;7L@-&ADRJ78$t|ro%wXmL!7(`VS)7yuoi@W8S z!{;+}BQQDK9%V-jHb_OY6D1y@jTd~J-jeD7g^dg*BNpbX_@#_Rk&T(xA5=27YakW- zQVMXG_5vv=G=`fK@k^lm{7}@{XnvO0b=ntR24v7EUOSh5B{BVeu;&$Y;SfA~{aHi5 zAeDE#wueI=aS!*BT4ywU^qM$8R@nYIGc(af8a8( z2{rr`pNzTKm4qy3g|c+sR8vc7z??ffPp@GyOBXBZi6?YnmQ7}la)f1JzTd;ab0{T*T9mlvqn;Kn z0mXe+25Ti4lW%n5*@f}8gs_!fWh5;DJ7B9i{3wReBuby< z6OU4iR3aWu+|H5S%E9ZkDppTAp@4h6-}YvT`2JDe;*HpW7;6#C*a;0+WQ zbHddTOUwE1Qwm~SiUy&ADon=_^w^b_f0Jry;%#E&$V?WLqL!bV1xQuGzwhf1Yh6^ijp-~ zh1Nic=-CY2Z|`N$Tc^6x{)#iRNkc1kiA~&e58~l?j_Q$VY0HNhaZXe2j6$nr7GY}9 zO8^ktPGEAe0Fjr?;S2mWb!DH_34q3#FT*+wb2&jF326|1T=b!nqLx= zh_!|^&J*4Q^m=I9r|I@c0rU6|Z%PL^OVI{T@ z8a%eut`UTRrO`Vx3cs%rqzog1n)}TlvA7VfNr3=b@w#2d0;$#Y68vKFdvMsF2n6_% z4B4NZAx`)2bjir@H>nRJ5;5Pa#=BP_xc@D}d42iO(bpn=g zWL1>;>Q<5j8E{_8Zs(#ixN>@S<%T+!lI$yqwDOkHs+CS1ij}tAR(}r!u{@2sj9MHi zfr?G=tr;8Fqq8;7N`IdrzX9cFK?Mw+l8~qM?Gc8RrKCCCn2!XB9{b6uIM|#SKPo-& zQ2}hiuJq>~07W}M!oc4>$$+~9C%w4R!6!(MJ>#T4Y;Oa*^?Ae*szlORyW@wgq7Ak$ z2$A|>PqR!nQ9Z0kT1V$=mr~1*6TM@TkPaSkm67K~9Nw`IBJ8?+5Z`CO?*)QL4BGQ} zj)W}lOWDrjM$J{D;uge8ssb#cgNIrT*KbTtze6eSdGXd}Nr5TtRTd8n7tEVmX5^dO z$Ctb)c+T5xY_ec8%o^m+#52Yxn43xuZMSb|F>L&}_r>nW_ubvyCa}R>s~z{|j5i0~ z6NK^DeQNI!e$?CU5aop`6dCZM_I*S18|B^n@z-hSE+GZ&-7jZj`}Y0Lb#w9j$@9Oj zM)xeK(e2Zqk5M}Vc<&KeXf`*OF_*O?p5Y!OGI?sHSa9?!)CW)u?+;s4wATqCeK~un zv#9|HraR3hqtgrfG(?oYN#_hL*NX*6rh&{1b%&EY*qT4bgH3>kp~x4)1-Y6|ji7pF zo0iVuT9N=@4QV_55W+<&w&ZgP*Iq>Ix{IDl;#II4l_NcrG1*F?$l!sad9Fg!CH>_h z^3Gd0cqrbPBepRV1yPwZwWS_>X;Q~qktB1nx2MZ1gh}Dnk1qP?=h)vn+1WTEnQ_cm zQ=Q@^7K}|pZ=}>*rV=iNTz|WypcziKt779=NR;(^hDV)X+)SvK1~6NAlj($fisP0M-;|O-lpyIJO;I>8g{h^|Kliw+&o?&Y08CRlC*1 zCi2)mQD3<#qHhNf&>p#Un=1^tGyD-+<2BvdotUlAXnM+VgQ34LFHSmemNUN=#4Z2| zSy;$kX|`Ik+l*AZ3OnJ^Bu2V9%w81}AGt$?@U?CdK$Fw=DCl+%dzHBA7~$2=ebyi8 zg*+{T2loo`Dt*~3P(GW}4hG5SZF}~HN@D9i!z6lH&#C8xPspkD;riD;g~4dim9+JtW)Hr zJ_eUqYS@gupP6rk-m^t-0jQvWE&Up>qL_p5AL^PI9U3Ut zHRy)xQzl?z;%F=qwAv_I6SI1v)XHFO`M5no-s{2PA zFmy!9=cp;#O~Q+YVNYVT)v>xMYV#$CHO+D$NiR2X3V4u3v#eLIQI22=VJaUMYAVkR z&KtXEunEA7@GPM$%R%k&tiR#DJ7re+)X7 zI=aqDG`KeVN!OmWH zDl&yu*G{mDfx`?J1h9k4u(RxHamW ztHx$K+K!dSxwj8HTy>ZaA)HmQhmhqeq{$!OaR3_89gtDX?(ew^@|<^%6-UJxuaV>) zM*6A^-AP99xi0vuUfkB##hZp#nlZx6mB+;ie>rP(>sZ6zClDX>=rw~P7iClYLFIwI zO%(@A*ipx;;Il+u4C(R@1arxpzT9y;Jtwniiv|qJ>U|K!@!{!-mD?H1nF{nAdVM05 zrSi8!FIQwg{;>1$NSiGVB#GMW@(7S{FKh%_7*?N&=*RYR&!326y6YC;%jHuP-sZjR z;lu97Hyr4k7w5vpvn#*=tUZHj4u5gWHxqR0#jo&jS@QeIl^CsUeOr$I@*hc>ng6x? zJTJbt-&a>-WlrrWL@~__X`}7lbJ7Z^iScaw0#=m>{jKWK*Z==kSNxt2osX6npuY5L$v-6zzK5J5ENHuim4+n*xpLGvV-rKSp7$8~^y7P8o_9o}`bN!GBV$Pw z@DKciVZ?^xAk4|2d>XBacXC>_l(W9&Vla9^G1g9=xBgFnO|986R~W(b#Q$bFp-jL{ zyqx_~-@e#>zV&pIYZCXi{?1XW=c!TgzKA#Fpt)dcw*c(*;uKmoBocEJFdm?cH#w<- zqJK|NqkL2H?nAC_;wmW+r4B{DpyUE@jzB!B`$gSWcUA?#rzdk%-&iUWrqt=zjGDKP zCFV>*Zx8JCh{lmjupoywfSw9nqMwd7(}Y9MUoB_c6*l2D#dvIrMj*07bWC3RzBnI@ zxQ}G<*_HimQ;v!Q#Kz6Y;ap1=tn?+|BIVzIJ=&}{kvJYZJRTE)v#wWeI6 zIsHW0Yy;mbCn|8Gbe-&!DyrWj@dwog#tjbt0Tj^z2fnDQ88taiEqPTAUb6r;Dr48- z&crh+Ql0WXN*JCSu}aK`;W*pT61&Btzrr>+fq|#sgdGh3R<8vpPD{>m@vm39zLc12tfreL(6k&tbfOt7hn+Id8EtdV?LQN3YJS(DC`Z#^`#Yvy9&4rlNih>n1d#7 zN)VIXguWCd9}0^)5O=^?!!aGR8Fhtqt}Ednz{?A%(htfsinrxpwB(ZSZ*lzQXI~TX ze_pN^?@2i~-Yn%v@oOULrEb0%^05Z5b=u6PBlB8v3dMix?=N62He%@lg=!s6Br^a` z@)y72rR9zSScom1TW%FDx-_D7w#)}!BYc3Air4tdTP`?!Clx`%$mIjg-+AP=6 zo?+=AbC8(w8cOt#lfZ zZ@(}HcuMqDQyq4qv8@Up;r%!y@V#oM1_!Re2?T`SNbi0&b$-uTE8OVM)jd2XVi0G6 zj*~2F(XLpK=opaQYaGN~ePzf=20#fwaeMxvyOPabx*U(;v+W!z#rCJZ*w)69uG0!1 z%wv&GSkQdBwWpx@qz~>marg%O^BN8SE6&Y`rAe>YscX_Xwb$x%wZTs}tass#B?kG- zS$^5p2(wrWc_jA6+r2TcpQsaF7iavefr)Y-S`T(+VmX+|yMHftwy3Z1j$qF*{gTfm zc!G-BdU4DnzoVaEi0Ka(_0@_T7+bg_HTZ49m*tSaS0dmXTn=J=wJa%ZYmn7#$ivPl z>Fem8AgH&qcJqHYp@=0(Z&mhI z$hX0ICNES7W4<*b`r!Ed{lR;O-hH>q$=3r8#1;Uh&!g&gevNBA9^aQpcQL;v+XmR0 zhE=d}MsE4Cc&(~}5)y$~6pP++{_Y8{?I|J}t6R>xh4g*bgQ_W4VsDeiEOvUXjD-1w zyAuP_?!k^%^>S=j{k$CCV$27ZdZD-S;|0*`9G{ zA8~CggIqJ*6)#O5H@Ey~DTlW#Ud}L18KMDP#gWg|RW(F<=L1ey6J&J1rR?--!>s5# zo@e_;+Wd+CKg{{c29^6_L~~ehPNc$f5e6v8Q4+5&nokDH^m4##Coc~Ve=@+5haA08 z<=1*57h-!kz2|X1EWl82FuDH;XUW(%z{)q}4xk7QiU_~+j>Zpyq^hHS+)t9A^R^1JjG8t zN?9mWS>g3+$?&;44DQ%EU(zRTFi~Nb)FQqPjoX|!&GxVWbKjs>%^Zzp%Uk3!Du0u< zVMX)e6<4W{s9q~3TtUHzsDMB%T3F`W0wCnsfV8PRP-VdP|5|?{n4k`r<$T2Z2>R51 z%cskmi;%mWlnet=UFMi4nm29+yHFG)G#m#+y+;mD5}4|2M3W7RI`Rq zMxackCoCB81yA{5 z=h;WkbhyETjnHA()LV3JO#Q-)0Rul1@I30FOI_6#q`{^__@JH*mbBB=&wR3N zPQfoDV2C9de274Pt*RS(F3iKrC>F)>y$Mwx1%J(kycsq6;|5ORU+HCHCH;~_;-yxi zc&N-+I-p2SONE4hvm=~aBCAvK4&&I7#hx6Rbft!#w0AAU4=~dqoF^*0Aq`R}SL%0G z-4VGC(MoCm5;z-j!Hg>^UqTPXL^CZ&|SxU|-UxdDm{WFk&rQgE%DL!J`%` z3{bFDbv75p?)FERjZDd8&s31o`+eki)DR_{DABE`hY&K9lAl#WjypTG2V?4PB~fxn z2K&pTnQZ_xR2F#{w5(8WRjv!w$nCnhF9ZEdS-%tvcXvS48I7RD1r=rL^0y!eEUV{S zJys`s__bN0#yB`CD0I0wDD;9kHyx2zq$kM@S6euK?eC|3*+oWfn*sD&}KCxBPf5)qu-6Iotw;s+S;lalL9+4m%u)X1*o;2mbr_B)|qhvfJSP^Az^ z#>pe3hf}b-;(Ux28e&5~-qzPlN)zWMbN~C{2H~jK64b{X60{`OcZO5fI!$U@d>Y}9WvxcALVp;eGO%l z2-r$9A+;#p@mnkk2Z-g=F6JA5`M3VkbHF`cc+-rq zLS2%BJK?Gug2yMkS%s5TdE-qrnNU2!r;NDDhtEqpvW5i34g>)tz!ztuo<3U~1|0jV z$6O;CVvA+=H@A15Z0B0R{Kp=i5l@H^hhw?Bu+P_xszo@)UNQ(1?#b70nycDkjHo_Q zRb&H)Q;iJVCD}C?r%&Lq-uNwt9g8Dg%og;~%nL}DyuadhtcTUQ?+N!m=5^F_`XcXn z0q~b9mgmj8l5}8J57`0TiMMaz{h9s0_HtM0oHW#7guDT^+91hTWRs~v27&iEiLor) zr|`+gUz_q41(e%kt@0j^hNf{{RtXG8MQ7;Zx3l}nZ_W5}ZiG`3zmCzy`&b~ad-87- zM}SMJ+03z}iW(RjAhws;1e>Tl;e}vP5ns{)P%^4?Li_v!8PQUuAr;hc)ViPblWLUD z`1M33E3YbB+wItX37lXMTs&i{S#MLSR^a+;h0~2gd2;$ydhRhytO;$mc#Ct&^$J(= zap(Z=O(4HNh;4u7BsA`-cgznbn`Lbj#kUJRdwg*GO<$9Cwv4cEcD6(}f8a5ooWGn| zcYJd?yp$Q}w`LBxWKjO6!6%;AV@rz02d!jnW^`+FK2>lq{aRbwFC(@V0xt*OZS>!_i~}EY*oluR%7ixG>W1dk{qFBP zUwgX$?CGNyG^4@{mR{}a=cc&ct8%{@ypjsq@T;Qzix08 zo&aZ#q9^MAXH!upmtN4SueqKycN~uAh)A%LVt;>Y?dg-JPaeJ4-FL&t<)#C0FG1*d z>=}F-W4Q-3?m&xFr6~`%mK``EIvpIATURm45i&m&iRf3*847)DbN*a}^w3hzT_LRD8|F$f{*= zqv3O}mQ#%{o`G74U5bQX?L_r=rp6Q-_D)&L0hp8R_c+$x`XZqhNRfv0;-_-Xyg8ai zFZKplQGO;u7_P2~7c)Euo~z`XMe!r!A-LljS9W6Kj=$aKRd&;*v=*k*7MzRqgRTG&OI200SV_y< z>5-PyHgU>Pg09IqD48TyD#8pWF7{V7Qsf%yIeF4~Nj@sexKOK>7+jP3axVuN&RpFR z1Alm=&q(ao>NgZXE!g0wYioIa*L=cyG>z8R8NUJ54CkuB3HvOAz;gpqw#(RbgGR>l z$ny)WQBv|nkusMd4Mt!$j6xK`(f)Qs0gP|al?3d#@IEyW2+l`0)$ z48b{bR&!KptDe)m1|ekiTx8kb2dG>_WM_d(%qX+#h zhTZjfVMUV-tD1D%mHtz4^-(RILS_Fq#sqZD(9by5X1+X{($>}bl7C7k3t))H`lSYw z_UaPv>Sh6oY?w{Kb`mj8K11V9zQ0?^@gWeG zscdU|ed#>=_y_7(*o+z;&}OeprI00C^0vJH;tfsuN?8#v^09bzww%K-gfcF#N6Q5X zy(*&G&wpPwgW@M@T@)AX{q`#v>MKV)R->aPiU+l6W^$ISiJ+Z5ERvGnIb*$&#dA~Ii tq1-1?{DC6xii#+L;ds+16VeWkH4k@svz1%3M7zKGUxX?D^*^_F{x2CVK`{UT diff --git a/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt b/Corpus/Harnessing Nonlinearity Predicting Chaotic Systems and Saving Energy.txt deleted file mode 100644 index 73d70e5a5401a614be6bb41908332e869baabc5e..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 25438 zcmb`P%W@k@mZrI`r|?aWfn^{_fLCR0nxdE~R*8~KRHmk?Y^*JS2mm1kBG3_lDC8<< zF;6hFX|tJqFM83>Q1d4F0`m&<{m0z{!CQ8!E74s_5(L8C&*eY=xkNap*SpKNZ{EGX z`Q7h+XTSP?UM6*!HBDa5-0P|=@-j>6e6{(;y{)q}pRDwU__|uTym9$rS=FngT)EXe z(Lb5XCiBWQ`K(Ne#*MS?_2YNH?!R?DnHL-PnZ0D`z`dWd?rqYztZdfo%Z)cKam_NR zo9xbMF6K$K$|vq-)2y;ZzhJ81%|LeOU zi|)GHtj<_+!OAEocfB-e=KPe@b#;rcmG0YoHgKcoyKY)n3$vOcX;yB;*E<{hkh!HK zk5?vck*(%c>ZVnVso~#JRT-VzZB4+uc9pC$S7xi*s{Rx;sG&aB@X>sD3FUgpghl-zDp)pU^3Z+I96Amv6J!U z&UL5BWQFH&z0r|O{xdkil|Sdz8q~&a%KV?#SrEyxT5(nxx4bPYX9Q5z5;iXP{3#T- z7P!kI|B=Q1sxxXM#2t^T-lqsWxc{;UXxrpXUKDF^757ZXo9v(0Ns<4^g%d=+MP*rW znHBxpIY5Gw_#o^Twjf7HHg6;BC|C>c{#yvBsE;5@I4`zd?4vJFCGhzfxYQ8#eiPXKAUO#P%Ix+Q*IgFLBZ&q>asc=kR^{9Y*eO5LNs=Wv#Se6pLv93y*Pd736qUUviQv1`1;!I|HGQHyiiAyLLOuPBRT@TfVeKZgLAYNyi zY<(iIRe59XvSQ#~R<{|Z(0949+zN%39AyfG<=P4>R^t_5=$r=tx7sWfGeAA$;Ewf) z!*)lzrX+oOT~4ba^)xoSyRQwao6Sz@Onx5ag6cNVSY*k`vjVKSD!oP~Nh2JSyc8v6 zHKrJ%yMm`ARwfwmM2a+aD+7+U0VM83A^gM z+c`Yl!TT#%jM;z;gOBRxpaD*n@?+E#ga8AU*!fUht!G{@hWZ#&XVUIsh5?TY$ik)vJKCRfG6c^Y$WL1^HHU9TaY7c zgk}AQ!d9FB?7`wf1&{os7}IGk09+xITQM{HwD<%Uk7m(PI3U|KiD~|Y&xNI@fWsK` z3kDKV%`;sL99i5mPm94i`R61rq?gE9K^Ot-SQ(d~IfTHlZ9eERzgaJp=jt+^IurZ9 ztaf7*lgsr5X9i3xWb1C6S*3X0md|9;C6R%B%?2t6p>+0mYb69mM2 zYvT$*^xAlb(B}k_r3Y5U_&M1I5y{>%6C*d+atTwyod6@}C%pBDTSJ&h$WQe9+0m}x zR%)y29BQK@T9zL+5~9qq!c+;9hD15=G$|n99EYMnLRM&d$K+e0_6RqNVF}?o?<5`= zV`?M~*BOye0o8}ekhFxooELCX4=32*LMfL#J=HD<(FqbVNL~o5+-TSJPD8l8#;&r> z-ggRvL4~e&{oT!Oj3S#L3o$tnUYselQh|wPQq|EM72jr?y{+M=M#zwe4=QD~a;dP} z%%XR;D|f*Fsivd=N{7;hz^79if#vKI%S~Hb2nr(R6Sb0=Tt!Tks2Hszxv)buQm0fdCsCZiSiP`AUqa|%_vigN%FDwg$~=F^@DPypwv6>f88&y-;sCi@SR>T zot+ISDn7pJMsA-qcSfX_B$saIXfXPyi+_x!Bor4a>EM+`ZB*16QwPvmoWOr1ZHu)S z#yxPp*46#%Bf+gb+)Jbit>_dp0H?@==)U9!{sIDp&88TVloJ z+{YxL0fuJQ=Mq3GZNm+bNYNBSda8=zdXO)ntfOXBH|PsF24aZ@aDtQqk_ejEaM%>T zvL-GBB$F7E518{}*=@w_{ruDaHV)M#+^#3)i$rC(_5l$gs--{w^nX0{1?IvUq3C0O zp(MEs>4e<$5w;pZHxNb-^??Nhu8K;-rI*QSPT9yO;}klP-U5DMG@YNTaz<#PfQ`@P z%Vk9EAc_w4;26m&T(iT3NeGk!*dY_Bp0Y$VY*f#@0Sl-;3Ym?Yu&}4Ts>GQzLMWqr z2Tq6UujbSdvM=*|Oi+jrwaAD|Byi8$Qk1hq(sO`lu&r68Jwo+hFzAPC-mj-gqucbP zwHS=U;=BrvxWyKM+5|yvt6D>hXs_~5M62zl?k!|YMI}CiI?(|W%M?ks!nmV5sWXq- zGjoAIYkZ} zLTO-9BltS0sZCN10!&o(ZLPgy6sXLydNxP-Mj#%fZ51RmFmN|Y8S%-OQb7D?K;UYw z-h+4s3zdy=VE9K~rN-xEq`H|v?<%*u*g||70F!e>N zDkHf7#CCeMiReeQ2ez(*{2s#Y&Ba&IhlAevuDgK^E)pHPt?I=PV|AFY6S=(=`$SLp zKwFtp%r|Pq6`G>DWh=&s#pSRb82rlWMtn{54X)jMV`ZK8=qBuFG6UD6xWC<0mzzh4 z`$+l3zaA8!yi^OFs`w7q29hn_OxLB=f*R1*w&QN9B2ULuBz%jZ?2qgUP_V*Jue`5^ z9OL^k=Yow|b@d5v*KH7M11pfN=DJ_b>65W$+26I3aI*;u)U4{Ei@I|4cqoV9ci`0w9W ze1m9BZb2+r(mAx)(~#lNW`eILY04icztL?$u5eB3WXeVO>v&CRPp(!e@Na`r$Z7dP z*hLI1og#UtX=G`P31Lsf)#4jBAAGd7*FWa#y|L+Gkan1>NlOB~lQKUd!@GcC1n~6i zkunVvR)z`WE&vYiDB3B1uIM3qkqnPHsxJsyS-7pY#6V*3Tu=hXw`JW#2rou+ zi(uL;N%JL>G(a)&IJVkWZ}}^UEaJ&jn&B@fpf@~aVZMGFcXCYngusReeJaE_qhURmXc(;H6= z8lg6=Xhpf`GYut`v3;XLU67Rz7d6hM5lnW{kBEhN!mXc0YiUNQ)~#_CPLI+TRR#J#!@3wI^y2y(W2pWHol&#_R;n$ZB773drvXT zSD;Kv#2YvUk;=mkmMQws*4lf+lm76mC2kaHViSWcikuKc;Eb07fja;=jebd2q ztvXW}Rbp#lSS5Hvv>5NG)=*GCiwayhDJW>6dIh_%2Ly_(DP_balLo_*RAhf2*NY|J z@vG`vp2{zB>tzP^>G4?wymh)UB;va;i-*-TtuOw9CF{)4VVkXm{vUXu zw_-{OqSmWDHn_l7d~!-5SvZScleR;Jkpo-y&EQk8UD|~BH}=a|6SyO>#`=AGel=7H zo=_$J?u~v`VTthPNmOEVSnN1tLiBEBuppvz1~~eWDyy5w^zogJ-S%F5d)Y=M4f*PAelH-OWy`HjaWgbH7%}gtx)eptz6K>>jlkC;fikzTKxkdF%|^P=Gtrh=%Nrj z)ck=(f2h=PtrN$-_?_|-m9ZRIv6bPo{^$Szc^nq*<-uep~=b&fj+07-x77Rh?(7D#*#3AfK$MhnNgCw z_ytp@kEX6uu--QM5lUbCBp-`6Y_iLWHQpi z(xRdA({+=`j#ir({GG#DDI?tP{Irn~IjF?K;EYHcHOcSDX&cgLzD3-l_5p6k_L{H@ zzeFVg<}Ek3yc`ab734i(CPD~>#X^fV zkSR{D!3(WmrW3i5xfdjpB7LZ_0dM9#_LokzjP%f>VaHS2wv*B~8y&K5MJd<@RXT!_ zGB?0UTmGTQ6q=ZLm>4IuIk~xy`5tsQTPO(h5qO-pi*-_nvLBlDW=Hjc{hed^2CKJL zL?LExMZ!L`0k*Ma1D=2w_O;Bs{VRm&UU^mHwcdz}N@n0&@nNX6WdlkqS^*pXwxQYh zmOyQ)sy6blzEp?%Q|2}fVgb{3WJ#}yE(}A2T9-#YDoIxU@tqyN+aeaAtY9bH!lHcw zt&(QKd?04#O+7LV5+PN*1-3^gQR=Q2h9YPs(hBo6ASp?(rhyEuE3*(aB8^dUu9;12 z%%Uv8C0MGG5TmBXL2P!17N5LU+<%XWAFIz1ex?5C*KM@h*NzNrOo4DQTNc+sqal z>yGqx=UOi;-tAK)T;vB1;b%4t*Sx(HlG*G6XmISxzFtHr2Y;gc(K{wxG2IAU-Fg zDX&si0S(ZCLvk~WneJ>Ye(`BR;M(*iM>@6w{MFw{i(E0?hHyZ6m z&7}0d5NM3*`jFQ~qw^?96by;m8G-|Z`gjam$r9-j32Cwn&g2<$ zW9qYrWReLrv9_&1?X#V-rc!`jkPXcc2)uA(!RQP`Q+S=n97qcEvFPWY{zWAjzU2}! zX>fs`Fh_mD@NXvm&m`c%L1>Z1@tznlsI$q~tvjYi341C$ZXB1(lvFd4p@+ncHW&>N+>0{qOXOIzQ1!Pg>0^6JM zrpRhSDOocbZG;SR2!WG$XcZ-ZiVLd^q026kFU)x^Bza9}Tc=e8GB5t%N>PuTpsHkN zaio)w)g8F^{Z!H%V?REyM$7R!344j9GZTr8MUZSW=`X7R>E#{q?7zUesw~0&gp?`>>_P80Y#4^r z%0OmwZ*YN6K#c*|IF!(`xNX3cmtomTRhzuFV_>TjG_|*!Yx+k5VK7QD2IX&I&$nGBQjdsT<+dn!RRES@`EtbeS;-J*I#LOC{1&HmEIy) zPy^tP0)UZjtf|;Xj%{rdZ|Op- zDOusrPk!bkkCVKftnu!Fd)MY44_5s3Zx~Kf^Ha&lR#zRclk#3)CPT0Lk7<>_85-Ha?jKy^&KozE@9zA5=a(ka-EnHJLM2ohwOHTDTmmzU zrG`6S{<>d!qceIBhm0J^?70<<29{N!d?f~WF%pO;Sh?_Ea4>vyf#g@6`q8i2n`2|( zR?xxh8sx1)lS=TSPZVkcS7D0`wir)mFHdY{5+O8i#bpZeGjaM+5g_ z;Qo>?nZW70cLVoFcmS*@PZmww(}hvA0XMgePUE!X-@B``XTMV6zr4`lz`gZ*<}Smk z>AwZnXjffcUY?$D&)_k)K1c*A6xt;D;ox29iD7;1sG>NG3Pf9S=Z*&M6_Jatkr$$( zWnXgof#vGzlVxgRX75ynSeze7&E{Ahn)wtxiw#N5t$^NGa9-uVGs%5F*3sY_j^gpb~z ziZ#SD2GE`#T3{5KRG>%*wr$lz8COU!+qqokE0bOK%GbZG+S(iM-l_bu!ep08ugKn= zdqMe+3(c9?!(Ska1$$z5X!?whyIgVys5-dEM-fzd+?|xCoKC|duiE4dQC)w>YyAPh+!@3Is`~y_8zVYvb=8rn%IlWW9%W~43G1coNh6Hf7R=h@^rG4%YF)hN!Y|p= zSiK%}@>~Q>y%SPH!*%cNk&ub1LUXA5g?rqzAD^UJ!qc7L5&_#*Lcl1btY))R-ED7~ z&?(kH6P=`F_pA2ZwW;)+nXa+a$2qRA-s|S|2Kikii*cQ&GcKX#(XLAl=MW5mYTb4|DH8=z}CS8SLEZ?~Hs81Ss z;L?G*K?}h)Gz1 zflZAop0rR6W{Y%?X8X(acpt@W^g}yZ_D{IJ?k(Ss$!M#xmMI&ab?%A^#u3RnN=DmQ z=EIlH&yxqTXc9tSDEDBQP6yiw?1$g8cODtCiSsvc9_P-G&J&lxAN*hLj?EB6s3ESXEUry;L((CquidGsc{(M0Xyt^a%aB;lt*&n|)HA9B91b ze)~(^CDiQSGPVEgbg2hg7#XbCXGPF4RXW=!l{!r!<3+*z|Et_`h8NIG z1Qm5TpcPKOaCkB#$G|9_s{&i{QSARY1(g-UgZ!o|s#CC7@=G)jF=h4AAH+cdgO4^= zC~3<#ss$5~xPw%+DnNsmP7j+m$x7ujcoaJK@go`@%2)jcS}@6aSAV>C{aRmb@%ny< zZR5rL_z1DPFC3G@#BvmNbtqLGF~>GMJvcN>Ahnclarc<_u0Rln2O~Z?8Xla)ke0)J z@?@|s#)D)s7-x7LvN^(99tPy6LMDXf5K9z3zzFF|7sAPlIp#lrzPP_++4{BvmI1*O z-iqW$!kc%gE8EKZT5zy8Jp3o@qZ!P8?3fr|yce5?ZESubbm};x<7XhAN6I0|ADGTY z0dWXE;Vo2GTGHXMa?n+6(qyeWwN#ucnbSim7jx#>bOQlWR&K~`>4~OMFtOp` z!2MU!y#xp+g@2W=y;L6AHwQa;b`%WH_ASS`@lO>R@|*-D!UW4F83{jUU0uqG1tAid4H{xhnG+7o=gq|kB7S+hMyhcNGF=e<5fnFr+GzX6n+9JVPuM$ zG0)|2-`2^xoL5Y-xbv$U*L!>Me!n|SO*GOoNsPcmhHLd^o?ekswYSHFj4&nOj2!JH z`H^Nmz%xBksG#a;DYP~C&bJLncb%yz&#~Sf4EKk_Bb5XQZh9Jp&HDGQUff)r@A-d6 zqdiT+a>RoWi}RFOkHKVakU$b=aN1cP4G%__6}@Mvv6zAzlLO| z?C^T>nl>}m;CG9aJ(1U2q~=4=Fe?u!bYx7qU@g20r+QhfKeZt!)OZJ{grURX$yu8o zJc< zFp~zO(2bd2`@T{u63fEx$F$+`=&=xE_If_yrbeY;1NZ&lW^g_@6Wu)>8ErjlX{!lS zEb={*0PSi|1*wOzqm$QBBea9JT8|5r-ZP9Xde= z@%3~qV=zybDv-~%R@<%#g8fnyYOStQ=uq7#79fVFVg*KRh+3P)XFeGj5XOJ_nnK z6vi_5IGO03baVnY9@pH#y%97W-FYzK_XS*Pss{L_s_}Y6$wf_=0TZiPo#3PXw|r(^LP7^9WZ;dNy!A})b1t( zF@1p>%Xw5&j_0aW$}|hSBc@09FYy(>%bn?gA|`G@7Yq{;Z5&lM-NIbz#BKrjoJ`j1 zMQbcyvB2Gr_P&4l!>*Qy^R#|wU-vkE(9AxT!t65s!@Zfm!RZod_tZ#>|v9!7bSP+bB&0>)&4^9I4^$RF4g`J zQp_+Ur9m-0Jc}>EzL@WfzoBG|_lYIji~YwxsD(;D5~iUiN-<%fbVld)bi^bs;>TN^ z#h8(x2&4ewpBn|IkERFrigl=tcy!hHBu4!i!qPg9cQ@NAH;2&)qK?Z)3h55r@Y`4x z7E&$wq>-!6cAj)HKYrlt&s)#dK0@Qeyuu)Z^xH#*ZG|21w6J?BZqKP(=nw*9DYU)2 zraid}H}?EGiL*>f2H7~Yvwhe-5dJwps8%1CRM{7(woQ+*cGR=&#hyW=54952FuhE| zo5llac4QOr#!Fe>id&@cwO8m)ZF-s8D}(HGH+3!%0$cR{(39U&e6S={{n^YOIA(*z z(nT;KZqPN$)q-6a`EPK8Vwll>!80+kJ#s6mHV8<`1#$F!_q<^gD4LmM#-pXIY^%DC zpNVLtvmMxb^wJ&aP^Jej?NN`D^*En|>0b!3$In?sfqXj()ta!sAxhHO+7A;YMppdo zArYPK0>}eRNm!lza0JlXpzUsa`c~prQ;q2XDaKF5$}xlrHtWdv)|;0*J8vZZH#__g z#s#Aqq~`6u?}tDQQ*<`syH%tsW~uGak+;<-Xkuk&OOnPksTne@7JFTo)I1d9O!kU% zxOROc2HaJ)L7N%Ww7TZe@vY8Y7@ep6BDTt9j8Nj5EELx;IMBVhf(-wmyE%bhTK}p* zLMmp*(7kV=4eC$p?=G~*wFz&_&K9;;_^O{m;IFW(82Iron-v0g?XT2=qHoCJJ+9w} z$VCM<*>i0uLBoo;O^W`8QytVR=vSbzaeYq?XjmEdP)y-RdS~p3aVXhdWPHNEyO#O(t!z0u>j9o;(H=aSIQ#b2HXPP_o8TmvM739El5pArEp*)t7+oE&{`71(+)@QfW z;AvY`ZYT7tub6Nz4x?yv&NYPVJMhE7P5gX%+dlt8DhhE>twtVM#pGtZ@1ScPy=S72 z_MkOv5D(U#I&3+d#x%pMOfY_)pywkKXb{b(H)K}GhxnFUWIl+9U&li(G^S^dsWy5H z1FV!-hBFXHsA*^q!3`CQgvOp#O98Ef@dO`I74Ww1BPZ!ZDQZ0s7Im@t&`pkMVOy=Q zh2#Aqq=@f;d0nZ<3&|Mdf@ZKmZc(tvNp9hv>_{4~Sxo{l=i-C@d;Y#yuv?GiB-ZN* z331kO%Le_YOdS}mKP@*5>0&0?u>FwO-~vH&26wtRy?oYXw-RZ4YOvWJ^+uMqq$?#? zAcRE?y^IgR(|16mvS}Hg59^WiU}XB5!avn?*E<><9lHPfKX#!7@mlettxZR}hgNs_ z)`W+z2u~;>hKCMTUpDxZr7kfl`o|yQRNnTs{MhnDJK}<24IBA{V56Csc>vZ1&2KsBHra^nk^p zzs4B`_>PMkP>PPx}jG}=%$4? zUz0#G2TTEA>w{n-k45|2zwh!ClMM&yTnh6i_w(`Utvz*=>s)YmE^CxIp~`|I`&T%W zI;3f$@?K{WforIh099g!wC~;~O@h}lWOH%uo(+d1O6cF;kkv9qMK<1bHM<{b>j%N# z)Ud=bqUCDZhlQuy6%vJ^0}WldkUPnsWJq@2DB`bv+DHH^yBF0zfmzn$_2vh2W5|F2 z=S>)9Vr>|@f7Xg=iHFX_h!b)IyUm%K^W+@Cbo^sSezY9>AMqBEi$RvI@7o0~rg2uz s?n8&vF7c})tLn*zM6(4oWrN&LJDlBp#$NY&cP1I)ctad_cv!jp|3JM1VE_OC diff --git a/Corpus/Identity Mappings in Deep Residual Networks.txt b/Corpus/Identity Mappings in Deep Residual Networks.txt deleted file mode 100644 index 85ba774f15044a02f563ff644e729d6dfaae7abb..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 46637 zcmeI5TXS1Sw%_@DKE;kxPJuNfM1U7v93>;slC7~UdnDPO8QGHp!A%k-2=L&di)$*C z$2{jPU+;X9{QhgL?%f*zDch%Vl83{xMR22c_v+Q_zE*cvmHj+BttX4o;zRZI;QD$r zIiFXfN%eVMUso^d`RH^x7*{Xq#k=Y3a{f;r`>_rOv%ieqKIk3o?RWPM_xg9=?pM{D z;kLi^y;J>VG?<)!SXO_#;P+1T>S8ecXMOjw zp6pbE$!YZ`8dR^AlYjcBczffouSdh#bUr;>XbSaUHoOqvP18pw^Tlj1T-YpUX_k|i zW%Z`_Q*|+TTUYf}Jv*;YtHHb)RA+;$(fC6(J*(>XyG3V<8O`|OR4CI}qeVSjEa|qI zUrgT#{NeQKx*m`9aWEV%8Ty0GW-=Pqd^st*d#fQQJU^*{VKkl1yVbKvwYV6~tLwpa zJ=>|?)x13zfB2>5aa~={rq_e>!D2L>RA-mDV1N~F0)3VEVgBd8Hvy;K_(*Br_&VgrM4OyI%x;h=rzyK&o zkb%Op+4QQKPMDU)D$zHkFVM)RSsC7ignIomYuESquR+zZ_i)cawTp0w*l6 zT0l2AThz1N!RhJ9FvD_>-WoEx)uU=&&qgc|i#a%P6tSuxcXS0auyD)kveWD7%rP;# z(#L_8hgSh(Uav#gtzK)fuBJ91qbz}VT1;7JneyhKn$+)_LVsSlxaOckXXmPPT>B1;?LLPis-zi`Me%N!jcj-gTyNREBFvnsx zJzcWA(%k06Pp+2>Te0bKp^r|msQtBH^<*pZw`KP;T+tK9+dnL2P2pf1?nCVG!XXE44CO+T8+^w63dXW zi5S!cG~k4jX8$yn2jbS3*GNTPb$r6krIeDpkNB!Jz9%E>cak2p|g^ z+Nb@a>w!;DTTGNy%lbf#Mic2OlUUV1s>u`rouD#=6l0=1oQXy6@dPqnT+F7+^9#eg z3>PRV0P0rXEN7_US;%=|<}J7}k{g{>C+HZ*BJ}DVIvW|gtm>H^8^JXjG_o9n-#Tfw z4T{Y>Gyws1G%8I`P_4{K;|mSz`ty2#@PO(Bf1!rF%lhhiG#d>eaby&%Vbf_;105%M za{}#Op*BKWm=-kxEsPTH6RH`GQ9(0NvNbt@9V@InsZ`gP$k-%)UQdyHAQqlXHpCS9 z4Y{sRmTZW|I8kGnKEm(Y!E7{LhSpkqxHjeMtI@4cSXSh(LCAf_R9w8 zG2gUC)62K~yq2Z-MCL)&`LyhB^&L46tIvwRUnf)N`E*`9-uw~6qy^*S7iHJ$Q{CK4 z`E8r`(FzpIbw@uuDYYVTvTU17it zwi`mNs?N!_Nnx1i)Oy|1zdik9o#Cog)q(yw)IUeJL7)cZ&ok_8ljr3EzDLDZ2eI@R zPWVMtuUkE@&lV5bz`eC8NEiCxY?GKTo=ygYBx5ajqD3AGS4`bGJ+gByaG}DLf z%s5y$6??s;p`?7o_p;_=5iwp++V0|=4dw;BvpS4?v*5D4x1Xg9;Up$FHm(DKRO4h+lbz9f{Op$FBEL;9VFUK(r3 z%&AVIb2Iz2fZ_z=55WYLIVn5+oxtnl6e?x4FY592837o&vsGY}>GgsLOgN9;9JWU1 z%R^es@T^+KWH{n(I#mzxLrm$v9pTFwv1doyynt+3%P+?R!$FW;fRKK3?)=BElWoYCWSz9_pC+L`?IbTlzX z$bi*~nJI66xKAAJcOhhid&*zuM4sFuf!)LPJK2WL9}gr$jeF=??F|p|1OZ8gPfY!~ zmRsTefjQv?``&>ak|WUYaabCnQw3z`{%Xl@x9fsVf^`g;IRF6ToYV_M6bQrt5EiqQ z;`}F}h2>;6*V%GLiy3CJf(731TEqZgv-^~y2dggsj**iCt1&T`CknoJm)L)XbQx?| zkTb#%ES=oHEy4n?Er z1uSdc6{D%$d^sHArHeqWh(+MN$**x3&6|Ve8cEcz9x28~h(Iq`L;(ysalB~ap{4+t zLoAsD?b~`hy@oWxM#cJNxaAlF7$>bAAhZyh1s| z2wVvZ%J7zxpcfD@(Loc;C^hwO{=6UJ`8zo?pSX-K32RMpSH#jJu|kkf$H;3*G`Ryy zJbPB_cSJ_k;qLTqwcOg_k_YSW^Q6;Hw)U4EMPbE-bPu4M;nT`yaT>DdNhSkfqf5x< z;n@%4&n~N#fdkn6ph()r2( z@-8RKs}ooj!f?4LDjSISiY4gijXs=T?FSKVcb~ce;8TO!thslYq7^%J_q;2fae@@> z;ScU_?VFelWrv(?1~x(R8eF9=hyr<^LLf}w_MnGit9neJe6rX@PN$&B6yYkQIIAdz zBn9F0mHNE!9{En_R4BfPY=km;PgMPN{QHi=d;N;H3G&(d z{3a^MzuS6R_PO@`PE{|4-E9PcL-%fkG*CqCHfUoC&#S*A%VyK0-aq%z`Hn#42#`iZ zAJM)U~K<~i`kR&|lS%q(d-NAIchNvw;xgHbDZz86d z@SP1M@$_Q#e5+SI-C?FYH+Mv`FXvOFeJqwL-c7uSv3nsMhW>t66X6id!pU4n88fZX z_P6bcJ$%1c{UOHcysv&3J(M%Y;_yF)d@*lFES+8MW6l-QJ=vnH_{F$+WyYyom1(3 z;?WoH)2|raVYijQ%S}Fu$S^{E=d;1-NcJ+*Na;g`F_j>bUXb2J3*{L5AK!bIa>M5j zARY6^#dEEnvHG(?(n??&mrhiQ~pw78f7#%8JeT|7B^JcnD@RQKrDM zyV-P?R#%o1b_c(S;*;@(mapbm&vOgWq~1@b z_WH$?&^~_dNGWUFI%vU_LCnC?5`zkNg>(l#E^dTp(2#>7&E(B#2B~Y7u*LvtT&Ep* zB(;Z?csx>8w3!mp8Udl0fD%N(8feA<5Eb5ZZ0$&-T;jZC13t&1(t|th3RS&!S zBos)vUkwNgNCYaeaw)Q$w$mi3Z<$ut*W(d<2iHjntSnySW`x+eJd)Mr=^BruU>~cq zoQy}8vTGD=&}${PGX}K+@caR(vp+VYkfi0hPtM0=ama2KwitIVm)`uVFf8wFN^Y z!MW-dLprLs$P)90f{PeFED2jxAVxxg*qf43VSMAkpL%kRIVX^^br0FE=4cScm|nKZ z`Ky)iT4AKS*)O9;Z#bMt;HcFQ3z+4;ZnPIkDf4qTiMCPzZYd_L1t^U@5|>(}h?qr~ zVl{9kem`|OMs2A zUr02QV*zFO(Gom!MFSL7gZ1Y%Q5Kv@sH=!{8sIJ>%&^lu^k&Ai>L`o~_YoD=`dO>O zK=<_1Kfy)-mIED>Y4j%fpbh4_eI=@Wsrn(g$ny^TQEtW19hZETQmks;IB9|`nl1>}nHfE8#4k8(3n4$CO z-v=l2>FmV3nrdz91^$erH-d z=_0(yNPFFLX}?8TKx3bqT80;b)llrJN}+gC{@kDC7jLBjKKeT=?vaWy#EMSt-Rh|f z1^DWHtN1bne>=}<3!9zS7*?Y*=<&b!YC9B3Ya)V^9fjDm_PEY!{-4)IUVk&>aTsDr zt4nEwMUbNR5s4KQmBa;RLWw6g(qT-MIo1=zM##ESpnF=O%0NR?{j%m8Fl%Vu;^?fA z(mDZT%I@F@Yz_Hd#*LedR#3tUNgYM;W20f8xk_-&XBDSJ!@Vvh2Y-`vTur5AD%rQR z6B!<#9`OL;3Mu6g%0jG*6ixPl3wV@YM8Dffx@eX4d6-RCQ}m`B$FZecLBd8+41R-fHB7z*DW2)sg$yXZ zh}cD<*=&WQ9%~gsmI=M0fLEYhV5Xu=)pbrI(j9oi$#XPn^+pR4yHRwgwvCJ`SM#G> zc58#fl$dR|+JEms;r3JhN?$mXt6w*T53O(Xc+(51vsPpMk-pgU(n{mKt_m6WccpDI zyjGf5A;~BuUH!&^Z`D(%nnI=zWl;;$KQaYuHk;9AT_)gH`~Nkw`fUj8y>}9C5LS%2 z{^iT#R484jIjxMmkj?IDT*XF0ASd;_t>$ZFKoIV0Dn&+q#;UI*Cc~Yk}88Mc~Z6#^6u^ib`Fbsv1-3^@SBJEArMhr-YePDwYW!V1^}Y>ddXI zLXXQh(Ug49aQ%($6{tIjT9mVgE(XDe6ToMhJYod9kVoOopIKk3n|_bl%KK-8ZLdLA zcPjp(Gf`Hii6a%bM)??&B}7_*7KvGBRoP$6ehzs!po%uqY?ceMyo9DNnsBM%D``0? zE07QwyCpMk+^pjjC2Qp?jJox(+Iw)A*(^y4h(+a3ee^RaD)=WVT>^QF0y2PnS5jgs`S_&xl1D+km8R{J)XdPds_}wvj6p3_Bst`DiI?z znUvQXm6uWfv2fnT9r&dV^8?*gBf4ILXj!u0(L&of_6Q8br3od!V|&?W*7=w%0JLLN z+hc>@kQzlPm>1E<%0(%lFU3yP>v#7LKQ)VezGRae(<91^bCastR;!A9@(r0FW=H4y zGLl#eB5x^}-59byBH*>VBpu|gcF;|#t!}y|3FsTsJl+!dqq2C-H_iyFLoxZ1|u~^>|DiD8;ebgrCJ~;(*w?(Eu!w1QD>I!2!rx>2{@hYYg6~ z*@ap4;~8@C6Yo1swJh*l!Et}@(-iFsbHHfHK8suI16-Y$s6y;Ltzyz0h=Mw%!K~c6 zmz_Xq8h)JTSbP0=yvTn^ZaAFw)#iV!5;?u6si~PMD$#P46UmLmtp2-4V{uZLt_7Gkemt+p=`W^RtKLhz z;FBM27gliYY88%q-MfZi1A~5dKT_Mh`u_dyJzLWJ{@n%-iDca1&LDvIgbb$x(0T93 zfyPX~;U^7{_hqQvtustScE1f^o?vqI8`5?du#$5afHCZCKiGq>e!>!H_)p_;)4=3jY2|a=uwP`s#B2|%tDOj|aA(6eR^sY3u*R0WpD`sz` z3ADVxH1d|{pI~LLvi!xs6i`z>7EwBb&xS5`uW31Yf2g-2$uG^-yl6(aE5nt!=4fL? zf;~B=uvSrW=HyC%elhYGk<6V;!x#yxiIX3*oB4@W!DD90PblL9=-DW9wM0ua$4AgOyYKOjT76%B1Gg@GYm_?Lh zM*{v-II72l?*vi~6?w;k@O+7R=oSBcGakL@-Dyl!IMoHawr0D^GhNj(F??sejAOkY zz+vc($%iGc0GbJ76XO^=1syMeMp8DHJ)hQ;1IX$p^TG~#T9C0#)@7=qSLq+P(zgD2 zI0pYYSfB6mU{>Zm*f!BfaBw+mhu%3{9s5tb-XX-KnK@Lm7*O zK!%(i&p!s0)~5}7y%vRdtpZpSy{tje2sgTw>j{ybV()1AoJSl$8x0o?eq4whd9E#3MHap!o21*R6%A&2VH=l|%y5!2Zt@8WrBUdj zIRs!q6k7U%-u-3D0x5Ng#}1+;H-?@v^ll-+uuSL?{Y~s&odqA5zLHnPVv6*yr^t;) z1eg)i$bi6THdIwK(LNt2V4v~E8y*oOcj|Sc0 z#hw2?Te4gB;*JWC{+o(r|6AIua0mRF9&+2=OoZR_&)o!t?*ZJ@MPIh#joS^`Lz}zU z^tL_g*^a^+zHTD~O(YZ<(o+udk1t9i=tYP;&@*X?PsoAu^ax2;vx@e#j3aw;vo9lY`pR{np)>l+!F+-3T4HcWrT z@7v|TO5@g_og$!;INU+N*3bsF{cfhqjl0M#LeB{uIY@Hz26FE=Oqt)bi@b5aZr(s% zm=b3DZ#46)8hcX9cD9XE7NO7)8rs6Pbrc3&vMU|n#^h@}f}WZ08X^%@;Xasu!!bXY zk>rZ9^ALN^vf)Qo56=?2T-}77*g5Ln?|r(Ha~#;2wuQGDS&d&d>6PplX1YmHt0n-F zKVj76_hubk1HPr=>-*F`d;?6;k*VN^$MO_KFtCN48y?IQp^V z(@j&UI;3?K266C}*n?H9bX^EB+{$>iD3wZ<3STS=m!fD*9)BDL4Xj%Mr=1w4x<~C4u^e+uWHo5uVZO8!3~fesS*COn$EKLMjs5J) zt6Zh+ENHE-G|jzMYOKF8HKr_4An&I1SWbd92R``RQlKH1q$l;(WU0~(dxn9cbnjD_nsZQBS3m9Q_%aKS|P|(9BUEqyXHUY-fiGt$+jU9 zZAWMMu7Gs_lm^kT`!RvT?m^0f@?1M(($BB^+ztgqK}>XU+h47ss`imi#qm|;QC6+b z?XJ~XYZ1%dZ~9ygb(m#0n?H9T z@(o(^&vnU|zS8JQc9;xuh55ThcqdF-n@&uZgzV8M` zR%L+I3YqhogLTD0IFhGT%ChFw=Dz|nc*YzNO*-mwN@vbSQp?XJH=r2`zr0&_Z5&c?aTI1kD2XO4|MXIH z8o7lzmeH4-dql;u`6OwZg*Y`fC)qOE=X=|!hD*CQPim$0Z1Vw&U%l4}+!Zi9r}EKMyzFn#1u=CF%mOh{RY%#I&d%qoHlSENUWoBx zsHm=R7+c5sRGlbAWsYuTp6sH6kzR2~+!%9u0>|pCKOCgDKw z**?Ub3tmOe+J{G$j*!r2zdobp^;oR39d?J6C>R4gPguDmp7XI-Yf{3?(%T%~2S zYDAr#BU2;J;6j7hmZ|!^Hovm4g~ymxgmUo$qBWUr^|dx!rQ+-XpBcj_KT5RSDNb=M zy-f@#L^)`V(20g`pN}VLDi`cixsut9^pcBy+JuWh%C=%M+lczOkHvvqIyIj%XZhUg zkvq$vpW7x;r<5u3cRoUxs45qhHvW)sT^qxlemQ+Cbs{i%s0@br0yTS&dJ%#W3ppNU zu`$nsIAGoDwy5x++=@#c6oLa(AuywiIS0NlR+xGfJ2BNL3ypL}Au%|6n~|NO24XQ6 z%i+@-d;3|_4rRHUFc}sGcM1mFU|2zxsu@!m1MCUHGOZNVx2#?!=p4!NV>(^coz#d- zyWW5m&n*aRopg`@N_sZ_i@QX9@Ayvs-?c7gCJGy(v?lWYSPR=lK z8)yC9_x%pd9t~Tr16E4}E?*Py?Y5L@Y*m$Zlo zaiXorPuAC3-Vd^%QQ+~Wi)*m>QI`q~uj{m6VAlq0Z$j#b?_zffw;$Lcw{Ey?x)cXY z0%$`eCLh91#ObnAo;b#Ws}-<{qE|4MI_j@p0{55+7^CwWq8*NWLSsI)Fx z-!+}aW21@YsecMp(wc#PQv$~4xbI^9yQaq$(E0K0K;>rf!CSY$Nv79o(~muGcd`-9 z?JnD7Wor-(@(N@e{$HA_k5{6;0rBRit(n>j|34#LY!o>#Pku467IigKj!(SNp5kUs zjMj|=kr?(c=QQz6Q|B73BQw(;Gx9Af4~I$6M}+pU`<)#yS8(BtArJ|^Hej;O?cj6i ztZ>!Z6xgD=xWZhoecoCT7%aJ`M+3ntjj{34Z(*5rJT;+I=B8ZQ5(7;lp;5;3b zoxe05USshiCo%nLIji(X`8&O;TL+pS-PDr)(se7h0@)h1f7s187xoXj`}+yVCIi_X zs*f?8s_6QCl5|CV%WGDfo#yCIJIM?E?q|Q^jh4sP-&t7~=(P!HiemUa?TWCg14HeW zyI+z-cB`b=PN$0<%~Fs<{=ya$%Zy(0Q6sbZ1rdiv60p1W@ryh1nZiVZa`LVLVsc89 zVp^&TS$J4LcI=?zn5S2=er$^CHZ=thI)D+&iX*}}8qltx&^2}VCd#6lx0&V=Zirz( z{=3Rmn0f;#{*0h0jQSteYi6_&*v^M>+bPbm)ATmxT+PBgP~Ao01=&G%8}duZiLALR z&aLetFkqofv-eXx?j)U(!~9o{62M@R4cb=(%`H2#tZoF8+=$H62uE1CRAHhDWtsKC zTH=!4-W@_SVVs#)?A#5-sc0rPKlw@`B!7-bw-+A`6fUv$ZaK@D)LZrB=c$DVpzle& zWu!Nn2G*D#9?@aWH zd~O8OIi>7nCHw#_J6V3WmK>fA!LNv*A4h~@p75pD@-*IXO!ziI;1 z3Ss9&@M>;MiMmoQV<8oZBHhLkhI81>+n@M|ozdGyC15~C;cSsgDz!|u*Q)*fmAniJ zt=F^tvr-v8z14G?Zvcv6x!MH_$#$GJd0HkEc^L1Gdy2vu*)OaSIVEd1IYka$I@mHE z=1a=(|pzjyToQ!|G+tfLz7aSu#Wvu}L`(iJ-Y%MWFgtv&^!2S@{Lb<4iU{9?3(DNW5MQjCILVk%0_L3PBZuacg%45o&=^)vfKZbyrx6C8 zC2%w)VS2>DkXGtGJ>k}I%o?-3Qd%ajQgca_&&$)hy+jDCvt3a~uFCvx8JfwTjT}El zjrV2=Ni|4=#91Q%WQijX>vk(=9z!#cn%5uGH~yYX*Ok}<{8nOV=jebF+qrKL)qq?TTqu%FSQRfLctbb#uWA*V>7kw zW40m&9Yl}{N<3jfD$sAKo7_MyHa+hy=BPL|hn^S7pL{fSaw~P#zq}p!`Y_fvaBQ=j zDLCMr`T|%)w^<%b-r1562p(RtCMKqpjC#i0weuC3)7;;uD_p1}&}%Vs(oY0bsEG@~ z@J{Kl@?ySL#Xkk<&6Qm(`t|NmGf7$%HU0&Mr{l0#F*zc3Uo?J|s^vv@PQl)n1UIMq zpjpu7UPAkbp;O{R+>?EKZU8G$qt z@v-1m=k!ZGQy%{GEJq0lu1)NcPVk}x4)4cl{beq-I^`q@FsikZtoA-#I>|XzeURYL z{GheU#xd=imzaANK-H21SEpw>z{{vNaegW$cx^JsAc7TlD*R9{SzDb}CLJM&|KXm! zpqI#2q|1oSeWXtGxH%Bbon+y$b*(u`xr-@R(_|$ZwJ15Rn=~aW} zxgD3M$j2vShDX^-h#Hw`tJ#sU>RsD8qNH_(&0&S;G{ zz5SgYm!$jI*xCl4Or-u8CW{V>(Y_Xe5 z&YB8JlUELOf=^4Jp5ib+!1Y7?e2U6}ZOs>UK&4s{Pa0m<5_~Gkb>5ezvAG5|=+IAX zXMTpK@-_;x>A#4MI3?{~!YsubwkM;wv?4b{a$p#orquRZc7%8uadEL&T+biexpU6D z%abnF!=0}N=nBr(!oh@iazdk?`9x2eZSAycK4fUKm@zTFEka%JN?}4_Vdp7O%3+< z6~Ilc8=lHfh+!Y*9x}`2uZbA<3sf>wr^B7<`K#}~2bOyV%VFLOqCFZy*GhwmsPa)h z-zwcqZUK3$DqMi>-LF@6wKR>O%T^-`X*ZK{h<1B@P7&~S_qzwy+ub7Io?0o`?X5VP zL9Hy_z22J!#YLDmz$s*(5R-aaZ)Gv|_3mug)Sr*6I6fCDbno5Y+G7h1prl?mG`2`Q zS3mY4ZJPJF12VyK+iM>X=|?uokw)q7k#@>2zD#F>s=|6}p{zKv2E#*v@TjUmdbr$_ zM9QJ7wj<&lT0#Z|gkESH`q-=31L!Di$!r-WGV5JaHbF3HypYFsMJo4h*wH!ju^ebj zxKw_fqCRHiVLIT4@G3X%4B6Q#>jY>y1_~oXW*vnMn05}SiJ)*P(xyM47VK_|kg+O5 z%sUywy}eF&x}Hh!{?f8R47_In@+8;Hz;^#rbrt*v>Zo?MGP))w^7w__N&?ZPCL zw!4DQUciS)Pp4Nfa%1bKIKn`6J2J-X1XrzF+R(%8bKK-(_fqMx8{)Fe9W>c&+UJ8V z2_q&4aHSIw{cc}-V#DX&I4m+)^8lQKWW`QP?ly`s^YsA?#^`ax6{I0h{O;0p2$!B% z2}Y!z&BDzd2MdkSFeBOyImwG}Q=NOQL}v)zWIM$Qub5Ek!qNw5bd2I~ORU(nU2WG>1!ra?Y+(4HjjBmJDP1bZHkynWaTKH)XYOf#RkaZzP5Q#nsyp1|mS& zbv$C=<}f9e-0Hpmfy;rE!ZFtsoF+czVXmZJPPx2T$?$Y`)is#%6Pc!G*GsOfjH<3N z=8I=`?CRxYv#hLex=Fg5Hn2xxr3p4srkI9?BDCrblt zWftk+DD8o+v%dR1AA0-2!tUrCwj3RloF(w;UC!&fPBZk?eD?g+_b-pD{$8)IlFm56 z5AT^$D=aRbspHrkQA1RAH%tuZ?zxr22o+Of*+lEvBjTAI3!dCO#XIP^k?=79G`1;q z8*(qH0aCc#>I1jx$fUnx@r!C8n?7jsX49TcTJ_0vdU@yUqzi4cj9r{}PWIyR8hWCB zuAV8yA2SVyJlrb^brt{AJ_EZQ(08wwj-QD2O#I@1Ac#x`~AHw z{;&GI?mim@Ho+JgthLyNj#cQ2zIq$_;-358I^Uwld(1cQ1dsWu$KMX!UiU80?Wgtq zXkRwXKR_{Y*g-5pz}Ck+hLW+AtB5CYMwKZ9)O_`JpQYr++SbonhxhMq_3z*3KmF2D z^Z}4v;|^k{C^b;zY~Rg0X<|H){1MV95Y|Ixr;A;(P#13{`F(>iOp8KPmFBLA0oj^9?^5dF zTNnduue~d?6}RT-(>r&&hoA1eop&YGO16@M@g4!$Oj+2++V?&UOHd}}gzzGJ&~nNF zoO#?POsS;CjyfybDNJY|HZlABVuPwfcm9N(z#gQ_QD`!S$G&Q;lD0X-)Q3o>A(tOv zj$lo3Gq>AQ8>U9O0c*nqMiTC76(M1hd#O(ezn=-AayJ!rhE3FTqFGzi>-69B505@O zxVOh?$O4rJw7Rl<9_v0xQpiM{cZxOhu;MLF17Uv@eqY7q#VDq+QaoASuO;h04ffQuE!8-gu4i#bzF_g90xmAv!y~NX|?!zISV-IJ~zeLyT6Y@3eHf0gg*b0V@YYn-taS4dmYo0#saHs;HOD zh%*!Xp9(ANjSF2Mych+<8n1gXMNUzLZtBE&Nyo=R0!&dJ*U5+*O&VwQFe(F6{M=kS z*Zt~5Z-|RnAO(!+iUOxub3${o8>-nt)GS3Nje^0{E9>e}J+smvV_?0drNhYH_<)lMbN`LK!M2O3T>CF zh(9`4bRBbbUuZPfDtUJiby%Hz0#VXQEM|1N(pEXIu;z$K#Cm!RXpSQ(_@3;~aw3$%QKe7g}81_Cc+2NmQU|!^kmy*^(0h(gq6?c=WD)RhOdg zhR1eethap^oIpe1FQ?ylj8pjo(Z_hDwN@V0P z6+;#V-#d10c$Ydy8A%2U>1GRiFBk#w;^D7+<%NT-0tz!Qopav2`R?VCi&l9Z2i)9!< zte+d+yl+4h(F9eLWb^H(4(bhiXVK+Tcmi+tWb6sDCitNZ+fBH|W2ph6&QY@oXi#9) zDKe@!$ReVcuh+~d9gpzU(IOs0@lsyJdK|!nFYkB7oCw>)grMj2g~b_C9vM4CN)3h| znzervd%hnKEGb>^>T|!mOKjP8&Y7uS>!lIgnRlp~jGx&kB7$_q7se7161UeH?vPuc zc$@Dgp*hr>QlQ?Yh9Ea;JAYgLuXp7EHcc1>;L(|OpoE+)l}N7U@STDkjYwodzBPvi z14x-$F7xYOng~3&cS)s*vbACet7<998n&SH6T^69 z012;aFl5sOC0qGNa;qIfh=YC{mO87WOI7l(+vF zA)3Gsvk4xB5fp6`7f6$ARvM070&}C?6S)f(Gxf>Slk%!rba93EfnCLh?K_3 zdz5u-J8{9d0N_gJ3@(;e{``KiK+T&T?9l*j%VLeWHgBf-II&Trz8ygzBYq zPYfWV%C$7S_v-zz&cA}3st#AfYpu1s%j0@7VPAj!o~;P`bke9f@W~Cv<0gMJ^xVBV;6wv~z44?5)^Q>!x30J`ey` zWI!7!bT*fmNq_4o{uF}0W}zeg4mDfYnfXXP0fFB-&0G2-jw>mw6H^cR3cA4WkirT< zDizTbAy_vpGob1ib^CT?h(?t!lhfUSu3nJfO@~zBEBRgrN20!=7~jB;Ep!c-%_cgl zAPOFfS2XoF+JHMJadt%qs=D!xWxU#c@Xt~rV8e7P?y;A2fvOP@q+n(`mWYcBRB~a+ z#Q~GImlU4IylHMBbzPTPLM{)4tF_4=Y=&Hs{&I7XLz=eb7=y#enXUA>ECn8v14<2I40)xIk5Cl5ng=S;-r}R<3=WMbuY12XViLBiJ z=(90{_r`xXVTFUxj;uIOIxYc%Qh}MtlZ@@+>}Y0{4iOME6TpszQ3sY&zq{6r_E}ge z75(7Om1Rrpx=e)+z>G09r8m3PL5z}Ps|Gda&g@j_c+C?O4_!6dy5XrdhQTHg3tQpl zI1olFtY=4$MH(Oi>2g-{XJxJv=r0QxX^Ge^M;6=4OY>wZHF)oxPM!>NkiKwd-nIF1jh3H!A z1+x>(-$1-DfYqq;G;HK&$6r79k*Syn^z3(EF0mOdr*G$%B%i+lfX{2q`%ip&wFDe~ ze>GFhbUYhCrQC$#2>E*8zrS1>vi{Z$t*_p{8q6dz>GhHNz-%1TTAv@+GX?Z?!t&St zCj&m}9`pHWJ=*1eDs7m24i7(9G%bN#Rj(``^ZjVPM0|mvgz4&-Xf`J!oYzeFPsiUI zmecXVTp!VWt|4D{|H?pLD-xE*N3EQX_)Ba2<=_K!tm%JC#pDE#zvc7G0k|DZzz%<>D7#ALfX z+i=_)Ks&wr_wVLGj=Hb7`Ih__c{JwP743h)u|m!TYXhdDu22I7X<2-vAU&YE_8B!x zaQC+j_B#9l1&$OFh3Q)3HgkIM?AuqXgWc;sn{LIbzWSwR-zv0t4ALH()tgEq%&2<6 z*?~i}sT>6`S`Ovrr=F(Mnnm4NjKZ#vGla;Qj_d(I@odILPVO~hA)O29 z2ul={`vSDIA>n&8)}>bwKjZ6H9PD{{v0QNK(DbcP`Lf&GZY-x6&sys|app;+CTk8o zM5e%KV>spLmccb;Fw%_8oa9Wy+8iTRNc_R7qbCJyox*-nrMTj45?oY-tUH8+a*9?Q#|0n{i8jOzt$A94C@_q zU$K=Kkvu_WnO@PfK1VXX#elsWg?f0*QIbC6m(VDs8?d6J^MmRY8qQWl%GC;KM#DnN zaAe*CbH0%LE1m)C*M{!i9to!wbce1Iq=k+#e|M@s^XadW^RF2Cg~zgF2nH>z;;NbF z5loG&IX30;m!wR-0II<`rH8IX!lpn?RkjGIv!Uarc36=1hE2R z_l+FeRg`9{hyZyZwK0wc6DEa(b33V!?HB0$(Ahssf0oJ?W?xc(!J!9YXc^WRto3z4 zIL5uL{d}6L%Q?(y)WXp@Dl8=P{#vM8ukMd*Ltu9lk+7T zC6GT7>rZCLV*A*v_R2(G4o)PmxGPZ$_@Mfts220cqkPzOYV#_@d3DHNwIFWr*BTE= zkg5=N-{P(q+JviJj7_Fv!pHDwC)71iBM7rW3_!V>3mKV&QRc>@y?%G^ZlB{n#EyBz zDFl8s!bticl*kVG+lZ?Ng+`zG_t^?4W{fxjEU(GZ3XWwf?ppS!0nT(xTf?=5UW2Cn zU+V#kO)8oAl~PjgGSz3CKJbe36Le=eq<2AGY;sqnFD37}2EY%7;A{DWRNm7cWD1s_wCQTOsAV5Sa5P^yakfN@# z`W5(?-C#ECo((^2onB2B-SgpKHW|+5|A+pp4)A7r-96gw9?hm7Cf)Di`$xmcUkBHt zN%vwfDIdNZo&VUq{&+E*<;PwO$Aige*qz>0m!C%b= zm$&2bpnGyPn2kOR#@)&6{b+L0y&N*wkAumA2YF~Rc;B5~eH>hOUyNp>`Bis1m@j4@ zmp5>7J(ye!x|8{j!||ABCKp$u$*4PdH<-oPqv5#wayUC1&K6t$I-SqEFJ{BV`PFdJ zeLcG!zGE-i|Chtb=-t&|Vi0}eFGFWA`_t(C!@ZrI!|mOJ-J`?z`(5|;e7^no`43%p z*!}T#((Uc+?F1zC_v_(@>7;vnH5y-ZCl{l|$3gdca6M#GPXxYk_hP&I%V_v+%xeBP zyu6$ZKX%{Wc8|xy>)~YHeLtEkhO_0JcMJ$%7So$6Ui)%5pL6wW?fOuLh7&~~(fJb68L{`<|%aB}jM z&g;%c)_-ww1~LuK7i(U8Jv<*y7Tt#_pu8H)yNluVbfTvhAoBc{7s1gHc>8WJ;&3=v zQ1D~-^7*T7F_`}d?Saf^!^!#8H4vM3XCJ$p*>H;iK+MUz)s-^BpaZ&4@A-6gbIX{Q z-D3FDqPv`q$I}l?%NYObfByS~QEoMcCLMJ1o8kG0kIy~tcK1E_HS882Z${^X@%W=Z z-5R``Oy`Ty`RcS_!eDlOHChbM7q_$FX1BN+&AaR2;%dr17bD1NHvG5S(QG)6q5Aon zwOkC?_HYrazM9_7*+0L`*n+@wxjHj5oAGcC1NzARC%|dG*zB%uf$F?F zhZ(#Zf@a>)&2R<|U9O|n z`C`N=cYpuefA8i~0YPB?a5Xx=>dtRxGhu{Ic>Zy|7+%j4e0*r@X4Hu9puKJ}?JlO< zVnD<0!?2rQO+NtV!J?aUMw;UGrr(E%0N(v>HXJNCKbAL}zR%Fl2f4YrKX$jp$crw) zXD~VEIK7P#T&9E2CYGjQUF=mxio>1@dE=5yv=+IB2s_x@pb z2Ig@VtCQXgW-v`S;e4~(-TA4zW;6klk)c0m?EMJ5>L%04)|_WyhXq3nSuNrMteh{p z5$IY03mjOT4j2e;0sGb}6MY&;h*)jquc?O5&VPK6lY zh_HG#1w(4c)a)%&`X=x?AKzXGn>ggjd_2%}o87gTr4i{q0=)fTcidW3l}UwkP(=$-%{;B#teCL{IK-U5wt1IN=#b z4L**}x4V-ChzNOupR1EDAVL$wU_768qYK6zU4Arub{V9w8nFly%isU@|IB;L8%uf9 zyup$)XIX+9ENgY*b%S@8gL6(g*!*-neJ5c70?#=*;L1sYQ*f;;t|V$gHhmbLSw2l? z1IS=BF`chY3gHXK8ANQ(8>+bY!ax$gmym;tOH~Dg`XIQi89yv*FuonYFXqT#n8n5D zQj+d=yja8EFVKR3&@H-#0A=RhkIFG|juN3EQxKFIZ+?CY^E-ooNuQlh&qu?BAaQ*I z@pyHXhu#aLu5MJ!ggwL#!I&hbwpX-~-*r#_?Utb}+(u`&!qKJSK6ZbcULk(IoZkE> zrQ*8@9C9{CQ{3#bj#DWzC$}(}Su@}qO-n}y%!jeCn*m6>UA^@f;R^0y22>yJc3&ZK zrxzgrt1o`tkNfr5-ClQ+B*d>9t$FLH8M^ECws*tYFWRGE3V{agR-NE`=UKIUA(agB-^BhwWQ7Xx7!7I+<7ew{tnvRUblcr+Hl>O8G|L<+ zel(r_*nOL%X?K#P>2C$$-Ih!>wb3a{^51HnyT;i)SlRl4tlX(IrmfS(M>OaE1Z3}; zHh;dm%u4pR4?7MOJFh#%Xc>ZES;6>KN1G>F{df2N%Fd58n{_GS)oh6MKL6iBRlCbf zWPkfUmN-)4i|5}xU0v&M`^lQwk1?sf`fX>qd$hflhY05%bf+ID=xyj9mLcCFrAN{s zQ8<2c^xJF5{h)gRirq>@_SCPiUZfZD>kIr1v(e91WxH=KJDf4em9W?F;Y0j6*Iup+ z*Z=IF{e~qR$s1_J=&Pg;{b$DCUnch>iSE(-7D_{dYMJeCEXMW^mk90%F8gdS!gV!A zhki5~NRvimdOd{pr1J;heCVdjPV9QGY1XwKB`}^ z0Pj_$U$?Y&{Kxh9Z{$umMfsCjaAlTURF2V)|i#L!fuo@?CKt%^L17mxthG z?+%26B=--W;~SBG0#LP{K|B}3pGIe>Hg-b+)8d@%J`6r$@&lj?VlbHaLmcS#tX#^3 zdp4TMCF&uETjRk;4A$iZ;cmg--TOb-VNoyOYzW*g z@S~s0t6Lm^!;4K!^#z~v=?tMy6byeFZluYOnS~9XT6*{P^~=-K-Q&$CkDtHcpD$kn zknMWjH}dC?@uqk3oZ_^)c06Mo;I1PXcyaBPm9xjejPonJJHyF4tw?yTh2WhFw;GEL zef7N7R~u($b_9Z4#Sxf~ULtlh7VApDm-lwczU9{ynkeWZR2fZfW2u9S_i|}>XIRSK6@m1O5dJ)cGS;7V@!l17(NktY zA`a-G)|XtkH5}GEa|gt;=HMJ8$GWO$QHk}`V++{4G2mH59FnFF-7x9Z?c{=ylCvzu zbm}GWLW#QciU+CIH{loN@aOoMoIxTnre+WFN7ZEV8FT`NS#1 z0|+6@4=(>EHjd_(X73Z|fu198)gH%#v!RHjEM251r%5BdR@aEBWmWsp~Hh*4Wgz!wRW_d{r2Cy(6w zB538!6G#-l5&dFQUlWOB-_zuy^v7j)%&8Dc3W$pG&WAV9CXb&(LnA_fjDkNLmm_g2 zM99YogNRcsoi^}5r}s*Uu?Wk1$E^n>pvkfrEquyzS)cDmmH^qCmQI458fD~WmUnz2 zU~{D5>bky#t$`az9OU@0k#noZ)!@kKkS)S{J=|0tQ3Z3^?eM3Y@sw5138^@eWFVq2 zOU04IhyOc-Yl7%5!G8Sxa5Js&9oZ8}Flu1NN^vgP8LLo=h>Y2t9-jTMCtRqyXcxOFvCQ<-RM3HC>lE*X63F)HkC*H3^D&qw#5)(%Mnkc5=_(6tpeGEE3 zziO;i`%$bG*eb;WNoEDHSw+ykp*sF3Ze$+(_KVZ!+QjkZ%dfSEdV|o( z3>uG!PsWNIN?hSRMbv`h)}c8H_)Fp5!O)Z&Fd$+y8eS%KPL3k@0yf_*QOkP%^6JKL z``tEa1^+LaGD4IXEr)W=Ya(-wI!75rbFkYpjwiq3pc~5bb1L+;CwbC`1S|*h}Zs3 z`(-x4Hoqhk@l&&pUYD@<5~f-ib@h*xwf1cE4*RRy-QIoJ+rB{pY<2*8MBsD#Nw64V zx4pDz`HjR$$_0r5E`(as8;Mv)_-d#mO=kWhc?LXywsN)z18U_ZPgwuWIOI8n*x3v{iJcLRvSYSQ7vM!)wbR1v{_LohA zx)sL$!XL>7EPo1;?+M9C@hXz)dhnx|NAD}>h7NH@!U#A9tg4f*WUi6K`>wvv)Kaf^ ziHB-?&UiBS3}$F%T11afSL(RcQSvHCce*~CLM2I|mPY~Uj?C2>v4XS$0a&tJjfIs4xhV zPRPk84Cpgg5V}GK}l1C^O@302S6h0@%0jf^3vRu z7f{fm10nmLe53>*fr#1d_%!{a06x_PmDDuoALEDbvBAHQ~?tSV!SYy(oHOt2>w z{NCZs6q@0z>lJ9UtnJGsQ<+}S!t|E>f-3QiEWxV@w!qL@0)!v=HVLdU*?Vit_WAw%(2j*7n&8n`Y1uF^T%UL^#DItL`&{W1F$X39S zTO?-#0!nA8gha#jRwfJ9dP*2l@Q6@B`x&aNR-xeU#i*PHyQC3nIxQL&Rj)!P$2@$69`KV~ zCn3_sUr2pGUAUd`WpAJLhO5}B1oD(yWztUEZGD;5*yJx#ZcfJ*K)=C7z2^qhArt1D76b~;Z^K6L9 z(P$w$hwzK1Lm`K4n{aTY3wI>S4P+V?Z(Z%iP14R$A+hk8tx)qdP}%ZD7}qmu85JaZ zw>jGL>DZp=f~-k=Rl@58O;3l9PK*Bos#pAk8OtkH=LBpSchgJsW9B8uk+n>eYOC9N zJ7O~0{P}>~yW}~8WF~XbMr+cvGbofJRn$rp2wDL{Imi}$yE{9`6{i&256+%(24?U1 z4`9@^+h6-kgBjsvkH>dO7eVkTOSv5~GZqhey!M&XR}K?AV5HsMwNGS!^$(3lT?8pK zwl%$bf9;!ZZfCehw!5cN+%r@RBE%YD67Vrn%7r1J!mG+)?j(r0=qeaRAB3twAc1F8 z2rs=50Q#()B*{ObV-L^F^d07iqU<4^vO5j>w`;<=&X(!@gn5Rjd|*ZO3g*{Ssh)vu z&`7nL&_|V2KkVLq{o>2;6p@t<7{obZ(j$&RwE;=Hgu=spU#}tRO$;?}6pgZ?Nq$uP zQ!M*a3V`{inXFt2E)Q^gO7aZ~!rc6MIGaY&d(hx;W&tOko?o_E%DETKg~gImjWtgM z;97ejx9NC!N1I>KU$ zZ)tD!VwU6{RPW^}zfZLi!K%f}ZrM+^rhU6kj^+!OG&Y z?aualPX@cLSHu$FcnU&$DYoCYY`_0rZoAqEP>byqYy%uk8(|e z)^vfnhbF_p!Agby%Q1zeQ!_UBu))<8rO`&DkO@FQ*J4kVHv-ER_kEY^Xf&fXQ9_kK z7uO{p$#a{Wu?=|SkCN&V)>tx%RZk!rWo0q8U*J#{*7hrr@Q__i^82A#B`=57Ljggm ziIEDt!bi;8gj}dL13Agh4>npa3B5EYvXAkdO6l?1o%$Wb0>GQ|6x12L1`O6BTAkcJ zw(KJSmAqepT}hqG?k4Mn41vFC%ZOU9$yiiyc7;c^9Bch=2i3>LfwKRQqE5`*cMf}0 zK5FT5QqB+dP}(qA2j?HR@Y>t;EUmS7P^SvHp21ZPl6nhCu5d^TMY=h`5LMz8aA>6g zXC%(N$svj?axt@1nv}eNn_zi%PUw(BHa5N>P_>@lZ^P4&_f#ct{6?-75r%Zb=nnwx zZyy^DG4q8JrbHs>3m+xfLH3gAsTj9{Di^r>WX9H07hs`FQ5h(wMg6tq3&SM@1OHr3 z6+-1B5&=_{uc@^tI6WXIVvV}=>1wd7G9<&RUV@XA0xdCSQwpWD?@!8DV6{@8mTF8= z)r~-;xJ$RdBYr`yH~KyrPYw;V%2?81WUHY4t70jtheAz&PD=R9@G&MNkBPlsV!?6> z5h;>G0Sx04g}QT=h)j666-SszI>WOD1qONq$?U1sfR;f^`-))Hes43*A9fpqdo?RA z@-DOKb+C*m;fDrOp*xs1U79`D5YhdK%g|CUz1gN}J6Q4@dq(^U@d&t5E#1pf`7v@Q^MHi{a z*4yDz@7Hvt0A?Fp6vOUINp)_kMmfcu#86<;mG+U!@;D#mJ4B5bb_IEj=T5HzE%J%h z*pO5CY8W(jF4!TLUx)z+i5=qD4Ha&R{ce>4kM56+A)dV^d%ZU$*nM$12>0P1w-;Dg zf~NFVgki3f%z%`(@=H9<_4VCRo$ z-f5q#A)uF*VYk^GAI3Xr)C@ZLqs7o<)H?*D!;9+6vaq9r?fuV+8mRTXnT_y$l{MD0 zuUYQV@%GMVc{{5s6T7>=yS-N)SzcSv-dF(|n-hw<>Ufvx>g!dw=2oYTOjtdLZVH=& zUvO{<9I7$K4X84)tuvL0Ww6wPCa|d97y|)k1nI?PJK9BAc&}!jD@!A(C7;q@Ay&YqIUGLY|-wJJdcP$$$XS&^QeqH9qu< zp&=PEnw2Vi{k^0&@UWXDIc1Es1r!Bo1A2Yy2@z8%6uPr!xrTod*z3xdbs##=rdl}? zVuEM8iZ^BOHyrE4Ow~o^WPH>$<=QQRSPt`G%p6fbzih6Sma@*&odPyx))L&>Eh+3| z@ct!%Mop*)#;cRW1YQj`lne-KwGNVrxKCbE;8ha(gm_-$+63bKoRUC^vAsTd{Imus zDEIB{$z!((QEnZGV7!S0kB*FGWqVQ%k}9X5Kkvh;COdZQPi8eC>*3%><=P1Vu_=wj zdJ$NC6jXl8M%lfB7Dc5TAGz-`L#+jg(hHcguoJZ|`>HiH;vaok1?kxi2z3)Oj0|kJ zg$1m#f`=FF`NO+Oi(vY5(E*EPX0uS6AZphV;g6JGW)}nN$SU1_wyMar3OhP9yZ36r zgRT{*lT+&PvOy}Sr1c>v!vlTPBO_dibBsu|*2^4^Vv7J$NvpjX$@~~+!|~<4U*!ro ztXu(^f^E<&ls*AJa6c9)RQ+L~a|q?^1S=Ie(mo=AOPQKFZ$PfpN$LBDubf$Tg)g=b zM1(Y~Apm7%lMpE^$Io2#JSrV^QBsDn&Y<>?CgSm;WYRk}VdJnLF1$rKBT%E)GuvF^ z&G3Du6k?_o2GE&im)L5(35eh!eGkaL}ssRMwZ{1{Ca|V*O<&PfP{2Us5P773z0$%gA6N?royyJ z&vY}56x)nMb=2JmPArm91E@Jm*YH0pNMV2=Xd~ z^sED-kXhY+qJCDveO= z?_#JBl{~J!NuiQH4PnCS3QtUO6-+5QceW$y`HM*rIF$7wX}b~k$t{yAQ>jD@SwTQe zc`$jxKX@%1k7J)X%MT7~{M3ry4Q}cs zPSn3h9M;LKI{)hVWz(%V|9bZ=amxm6aub&9RA58SUvcgLGM;`|m*!>8(h<%iWe`0d zwclqKl$5%mz+%oxgoH@X@auvkDvrG@#j>1gy;{#DC2Xm8N`Op4j)V&E6=^N0N{+j- zVX{NA7SFx#mT-&6q+%Ga5bHRDn#s;QXMke|t$OAj0i)vQg*#as#bzGY1F z@i$~#QT~OP2f(G5UP%P0^*B=@oEKRTn3&IzhoHrjz6b9T13(FI>y+^SkQpl0*QbHB z_KdiD&6YGUky~&@Ia_s0H{)ea=#Pj&5*o$x0xhDpMT;NdQzWMuqZ8h3ciJ@(#Tq`0 zyv{kx&ceFsw^!{V!lq)T?(##f&WmjV`U!oFYIj-L%oEeo zP7g){-cukFatp&3S?X!$L4aaBnrcL%2dq6TH4%|upi$KRFF578wbbh{FJ&WkGWT9m zy&Pl;=|RTfS7n{OOEvpb%X3~g3U%3XtwO^|qUwP0+I$4%MI|=t;HC6Ik#YkCCJ z@bE{Gu`o!UX=r67BB5D(l_sZpUMk%%Weu5|v}T!tvR;K#YD3wlHY#1n(Ctqg12ZWrM4kG>(G&=M0g@bqeQHbPzGG~qsv{9OBy-K5=W3;8to|ClHizr zZTX@IaHRlHav3_jFq#nSVE-k@RWI1vkb85A$*lCzTiraHbI_!iN^nnzCq{?EmI%P3y^YdYBrl!3!*3(cac!yrsA3< zX%#kV#vyAlm(8WhdW$1aXm#$)ki+;`D&s+I%A-RkAw`8uf;ow#6R9K6Gy(?=8z5R6 zjI8NOIg`v^GjYxTlD=QgF2@10{UxfqoFW?~HkinSU9tzNxgx?6EGPKJ>dgHtmM}dl z65BI>6bF|$O8(}E1j^T@le#NfQ%IPo!>KxeZah1^$4H`FOb{%R#2%HG0t9;jz`-@y zAc^xO%PA;d;cf;?zH}+{s~cPLyC=aXSglEkfU&^hBXXrPT^3OU-GX0~j+sJz#FLuf zHmXQa@Cl>_<5l1-c{yNwkq}OPf3Sfjr=5|Y9WlmlHr@pTC<0l{ivtJC)DEA)t% z!{!o-$*PX^iY=vaP6f%79=tmJ#^ikQQ}>g*l@{#&?{gpXAr@g?ik)Zj%iE9|?}v4%083!as~mL4iBL6nfVcrOB0 z7f{Otk)_q2Y-NNyN$f2wn5aj<-K|t;3-fC!Ctw|nfHz66UW!TdLaeQg3Q7wPgM2-6 zOnjuWJP)M(O>gayGIQ)x)`PLL7*si#J<&!Q@d?2a>?iu^#2iEwd}=~Kqfe+=-FHzp zL91R(w}VIYx$I?S)@*ceKm{A>Q6$QRev!nSs6;DUo&3e~?(LUi4Cx8g{xB}YTYPGp0r0%7~544#fCFtAo*0w4%Sj}AM@Lw$;0$R0ss*Z z4j^=0h1SqKh|;UEwNX@=>+CQ>Dpi<8EC!}|jqV&M&B~(2X%+HqK-ErXkU>WlZ)jIJ z0K^eywGIU!7llb=$lW4!n55=fI#DHn4@IjOuR@%(@E;LB!Z68StMZymK}QLWMuZzn zr9H4X$67=lvdPl>eWZH1vqJ&H>I#I#CNnV$X;tbTpqWw?9>vgz{vrYeRx2I9Dr9gJ z9Ls-)HKbOt54*qQyykGV7-OV;WrW%E617#hLr6>|BTINAR%>n@$aq7FiVBG3q_rAS z@7p$8?MzDgqRu;Q+Ec7lE-OY+&y@+3Gbn3`aS`icf>2BCuB;c)?k=!bhQD%mgwI7A z+{`T6*dB%$7?RfP53SL8$>s(ep?w}H3qcV?S|ikxD?;h0Yp*pV-&i$Cf&oX6Qtk>4 zczMp75NwMxg5rjs@HBNdsT)~%>LiP_z!b}^qkg@WKjc^EzX+>vj*Mim>AIuXoxbp?=vrRGIZ!G5TR-pje6b~P#5HoPXf@{ho{!tZ&-ZQJf$zjvLnRA{+6<<}JYAa&P z*lY;zy=rV3gPC+dzwihm{Nb`4n=qZ0wjk%pwqWs9UoM9&!{ur5cT z-$|-RuY5po?#le>eR|f^5fSRCQu;fbx6$A4`6^Y`i2U)Fj^#5A=;w6OZ0NB^yA4BQ zj=md~8e9OtOAHW&rJuC#Lg2K)#P8_lh*k!{&^9MG!^=XAQ%PxvLVk(%3i07@Lr$P1 z{m4USHWDYKCT%7NH!!VW)&y59RM7jdhEmN7p#wSr1jD+^|CORia}>O-z3qLg^r6^B z34|SYa^!{m3ZKh)9@4Lq0L~wQ2aW}?c)8R;oH3ow(iD5!2iP1DDXWo37+L1JG}WVG ziR{wJ^#~o)6-+M?MBvj>VKAG2`1k$gCtf%&`lS*C^Ph>M`m~3hq%NJ^MW}b(@9rJG=yvzI`$xz6=jfo@+2#}e96j9m z9S))29o+H$i{Zrx6=eT#=S8=7K&rp~I;uz5{|6&HxA*)0Q5xWI|40KoI4VondwBTC zCH(0*X661PO)rm$?w+J~{c3-t=F* zcCf>;`FEG;_hS0phkKuVf0u>t?r-lr;^ppem%sS0dH>V!rzYo-Xc~3F2deZaKGT4# zWp{7y_&9zz*0ls)(@xn^eWq`mTOk8WxC5uH+h6Kew58%7x%vqC(_yHT&TgC;?VL3( zjeq(hd2Q`G>gPbatH^W{*PBq`Th&$$$ z)?W|JrtZ36qTxuXsIN2q~0%OjDOm=~^Lz#=CN#za>Q$<0B981+kisXa^kYA-wbcj>e zED(ivd#AWzt64i=>fJ5^39s5a);gm)tr!JGZwT^lz z#506|Ml1u(2s08=X&TJ0=`?^Uw6+2c;wU=JSbAmT=WBDZV%92cX!kAWln6XW58c1@Axf(|St@fK=HRMJxfIeM;d}7An@9vb+zVTKMfbtcBDicC%zK*`DM|H+HbTrW9e&X4DghA zmib+y@nng7$NNDvn@L=!hx*}hIiefEGhO?KJ9**+xmHV@Xd?g5rsMUU0Pa1wnB|a%e+8Aj7v-yLJO=9r_X+t z=F#wLbD!`OVx^?I!0BZb_eq>u+5E5y-C}6wl}RS~6ux-%9mhE4F1vbS`|PzLc!%>e z<>tHKr+&G>v*bNGO;!4kC`MTda4qmYEB_OetrW6Pre-brT0d+h@of14kwoM6AYb8e zrXwic3AaU+4RagUP&i8#iJ&rI)n}Lwb4AYC0K7gIIZJA?B@R)J2f!VKR`0t9io$H$ zcQTTkrzm{Jp|?zDoUM$k$Zf#RPzrj~DT!pAJm2rLo?mQ;WmX-_Fj92ANaVw{y6+y> zbG*qK`4b?IRn|z`^jX*ig}M`aF|OyB2w_gkDjd7FW91*(E=&M>T;*&RXS-x`>K6Wk z(2y9zg6rB^G@3*Q9#je1%en^)4Ee&S0EjZ-2x(Zld^1a}-pm-xPwe)B)uqJ?PC7&h z@(6(IfD?X8v#5cZTNF~ROfaD8Uk>3L3gOlCN#2tC{O>Y{g55aMFjLHQ zAxgB{Q`;Z|DG$|j7TMUnr_=`V`LXV)QKh^4v$pRdDLaZQLkf{4#Y5eOSO>V7 z`)|ws-pXrmio~*#AwMOTAX1vV=T#(igGF54{`|9Y+AeMt(hHd?*SL5NnIvw^l)Bz# zqF~vHKJ!K~QgOYc^(GTvo8^hz1zD*b3t9lC)=3gos`68BhMJ0njwqp= zRrE<9Ydu#12-bifj7T-Znqgz#+jCwa*`{q+SY9StcB};!q+z}c_*?oU}y-p9PGZT{UJ0ql~`F?ozMy|N@T4NNPN0;;# z;3USW*T{`F#*K7160M#RKtjwxj9}r&y}VLwSRx34r`dAf9SKCE)qzll%a9hx6v-cz z6v~a}1FT7YLwC~6cTH&i5#-k#$>YbhB&h@M_f#R@VBQa!gG=o9^X6eMX=NywzxaiI2>36KzDHX27_5SHM*mXd}%0fOd_DW{hc}ERoIv=B&w$ukPh$feDSD@n#KMp zEDh=AtVRhE$XgW#G5tYhnx1{+M&=tlYFn@&R1P`$Wh8HRFQe=PNG-LIiQH$Xy{X<4 z?puD_8JV)srgFAfg|Y}Ivrh6W+@)}G491M0gv09 zfjyCg3Es#?<8{51&$foX;1dYd&sC3@f=JMksuNZ>8;z+5Ww*;^i24-r`u4W=O(0ob z>nJH9!v0W1N;wa3$d&Y#nmJ14!f;0BA*`g%1vHA>%E;>7w8{;RvAynUvACH({Qd92 zB$8?&=?~*=$`*dl^0$P}zi-qX``rHVU&qxWCFo7h;ka;0NUP?erl%!rA;@pSPkKfv z&|sIRn4cnx=L1Z$XYBMP^@ot`^Gxb5)IEbq@Y-G z`&f;T1rdpRSOz)@sbR(@MJaJjpdw+;h8J@GQGLX`Iv6_&sN2EQgdX(u^wHDTZw6c8 zGu+re#7Svgb&m03vmmKp)=x4~9`3+Y7X#|XrZ>Nzb2ciX4h{e?>zUVdt9Ii?N=bCZ zvfCa?3qZ61QnXR426qRbpRAxwb-b%Te(PdAG?@;uBIQmf+I3HFrsr4m=NnUl1m94~ zNxkTLoF_?Jtv{n7i#g4`Zb^V=?OeEz7wu6xzq`f`TJJ*4ZzS$FytvmL;2_Q(xTBrV zaNHiOx{K3_lDNc2`k=hCM;-p&_Jh0TP4WuAbe`n+Gd#YZetu6)Jgetf)Bar(6hVhZ zn_r>(!2b3@L&4cY3N`Mar<11VQ`6%bCHBsx#;1LX#IIH+J_?W}Y z%bc-ZwwbP&s_KtPDVFDIkP>^va}X9vCsDfs$W~I#^~vUnDNou_Wt6KVF<=8oDy=13 zJo-6}6{K4)0Fg;QRLy~IK|5eVWhDjYKa%MOS`?%oWM6Y`0|Fe0m5vp z9H08&SIo6~u5dysL8>1NI!TRzd{Y=Ri=*|<(|Ub+)9dy^*{~d_RXA1-r)l#G$o2V+T1 zGEQm%oEj%X<|z>XS=#4UQ)Odvu>Fdtn2G&hu*d~mub^=_7l#>CWXs!XwlIq~VXqRk zKGzm3*;6#Tov7mi>j>AMm;>Q^&ND!jY^!>D)@e+#{zLIN^$StA&O~-4ehwBww!2DvQs~6~puwCC)sx=iFOWuHb31SfrM6Ln8MPw=tZc+jtwOHQhNq7J9 zp!$jgg6=6c-mP`Nh9sWB@^ZRB!CD_7jS7DSQ4DN6zQmkfh&XInoL2{rx`{JU2rfB2$rlgIMATsO?iIu z;?c>I6I0Hqu8ku*YUTPHC>DuDo@7<$*DmKhw?#URLY^O(?G7sS0(%80ZIWkNiRBe! zP}Tqda>f?Lo7j3R`}=ecy)GApuIX^1JFGU%r%Mmb?9RET0b&n*mV8#I?pScWf?j!p zcR6*=o1mh+e(LL#vL^W_6iTs2cS+PDb4F*X$8az$j?3T5N?TK5BW)}iR8;|%dRmKO z|L7b(5uYf;qgocl$N7qXIL_h;H*Xs z?&Go#6_Bq*7K{AQq+EWR9g1J{uG3(}D?O9oOlpq+Emy5SY6xtID@dD0SvKSJW?FCxEj z#9jo3dSWxjem_CRb_4`4Sa&@XNU~*1qaG1s7YdIAi*V~iK|b3~*>IuQyYQ)=ybsA1 zgJt97_2bR%8Ggia3jIBxDey-U9oG)fmHF7Ndn^La-`4r+GLgaiqX-NV(CKir1i(e97ON z+3ir)dYg_wlGQV!DE=mSRoU*O+mS3Kl4Z6g!0IVq;`=G5aXl=}U2*abw|G>A1eui>f?{}Q zy1_imyWUW9c2v>2Fzl<_nSXc)yY^e5?}4q2Ge|H(_jeaJ+;T`{->~~TxZ}ekYxHwC zTR>qvub*ih`}@pJa)Y{~ zdIMQCL6o>D^H$XTCWSDREbRG+3ZV#Y?qJUK6mCCYx~Q3$yf5~7dCsrkzzoIGvTXe9c~xO9@2Fa27r4bTX2488WEf0 zI}gaNwd7wfjA}X?K^F@}OKR8>% zK_*2=`kF^pO1yGJLbY7FNNCLx;Zhdx=m^{iAZKBdm>IX(BtJrvMHsn3rL7I;xuQH< zuWMw}BUSVIfU==y7X4{ru!XlaIkQLN6lxORy&rMG_C~6p1u~T7_U*5LtXY*g9c?G& zzwX6`juuYz09PW(8`zmB60j5a1)vhda~!J3b7}U?rr`{UuJ^aa8p_tirK^I`)?2Ny zLowE-;wK-fq*R$qjDLwhSA)4o@W(RhtMRmYWz76uY#>-nuOGy7wHK4@@}-5NEk%1{ z=ORd($_>tH1`+CD*$VgL+3pF8HZ>f89ryuYUft`7UW%e%cBIqr`Pn+|`Ie+zOpyDi z*HbI?g}aiRs)V!>QM5bSuXohWk(F^K;DFp5M9Na!ZBl%{4`8yBn0%jCk0#!7YO`L=ncp6jEDv&RZ6$Q^%uZ zU&nwllMgZhmrvWZAJuDTcwW}!K=sarOS2)|-(uUVbp>6agZoQk2n0n_kgauobv+<~ z+V`7;5FbsrK84qGwKvh;Xix@@)brBF7`sxQAX%pvaAiU=MR*rmTO7{*fJ$lxRw9Yx zLXhfM!2nrjdb9M{FN$4%HyHKc^)lEE+I5VY>*!!TTYSFdr5AX!HQhcPNoe`1RL+1V z#uH?TGf8Bsm)qaWrOM$dZW7b5S4OWFkumS?(k2uiv9MoP*F;#j|C5p#;BODR)3y&N zf}zG}Ya>Dr3;b8r4M7{ls@7h7ITY_UBCFmW&Z{~h{$3J-N=7K69O_j&H_s^y>pP*O z6@-0joh-%6BYCMGAzvU|PrY7O!oZS5pf=a02bN=vVdQYF(N$t;)(}ExLq#2>w1cdf z)H6d=x~CJqXmv7tDAtlt`*l%!$kv}+qua$L+M?=;X?obaeQ~Jr9UGue%zq3UP6LIy zS8tjmj~+AD{&f1yo0D#%6zboj!fgAv?b^5Yz5c_*iI5N57X>cBW_jIkMg9pVJ9HH4%RiAe2=3jr5M#0asWa~kbiQ9a6s_Ya5BMeM%8~bnEcpn zfVKA&w|v3XnGX(r$T*l&gu{F*S0`Io2p#LWbub4#`fW8J2Ydl)ncAxuES=Y#F~#d} zkcNWRm@dM}z$Z||HmepObf_00btz}#_~d$#pMvobTWoS@wDLrBEKR3>_cD-e+oo~M z>%k2~#mh_fn{MmF(-yIRp%jdlF*S?SO_E4*LP+#@b##|NgLP4PbtG$a(V*E`Vk!9S zK*`%PxuJX~GHMK=DPhWPQ-c#poER4lAyqc?&1P1p zssIYuUrgDj5*U^X1jyTyGQ(a(5-X`nC_m|r4wRSFHwE&^ibYmn<(U~^22DuFj$m32 z+gq^N>m}~HN+o+;iayq;@do(kpkzMnx39j=eZ9nW!gK9&5a58V(yOX{Q;F$UmsCO(yTO173VYtEKS(tO1ZiW$XXgq7k7-B)o_Hjx|% zN1-+4N2TgNNXI-4bgS2pm59Dkv3A$kc9cpv_YjMW_BbwL7~02%m;HRASbB z^9-g=D)n~v_R*Jb;^tjmKpj<_Y1I!H`UqDDeU|G#9I{K7_&CZ;|EsASqHzR?ER^(IR9;?p_`1r9{)MsJ|aUgetxivK6D-* zA8sFYYdWpOgI#ibRGLsnp+DtWl$|FH{TS=LyMKIlPGuehzgP5})7ba-xIJoro2$_M zhbyA)YmnW$cN99=v)tsA*N%@!ry!YgFO6|uV?5a2Zw@*uYGhv4<^VYCOa#YWdhwe~ z-gXy@T6Pft2Jc(gNG7_ z!nc6#MGrTw_CBkWu=DQi^C!;DAI{C6?0_8xvS$;^?$5J;&o zW91(rQB!r%GRMf&20uhRMdRZ$;>}Kz0#SV-Yr5mGUaEU1khqkTyI{OV-CbHE^rizl zhYf7KeEvm-LTD0jDg>qEYUl)UZtiQ~*2z+LwT;Xk>bWzgK&0*-f0hJcktQe(-9B?EZfKl)74ArS^=!sQy)PHMm(^Nb=pUR@wc z>hy=q_a$t88ueIOkhgcKAl%fM%jsh}UNv)$%U;)*;B#kk)gyS2Ye$ifioP(UCS99bbLJVK^Uy$C~|C&QSuf5fT?K8fT~ zNaD9BD{^+Qlnm0+Ql1BLu}#1|^nBxk^&$Zl2kp(4G^BbBAghY26f*w)F-iuUk1BU3 zFp(u#5sz0Sx3NahO@?m(cEN{?y*Z)Axl+0D02Jt0y5823Y*_;6=0q$z;x-LYM;btn zFC7s8^L6yWH*$`Y(F<&=Pz2!dt+T)2E}q*?<);dRZ{hh_wt@fLnWKHfTG z6+g;P1jlM>W-G3*A5RE3GZ(wN#O*nuI?hP_Xv$4e={qn|6hiZTC`S&9c+vxH{}WHv z%UT-#iy<{l=xeIhNp4XU7*e=SD3MHEr<03I=o52`F`R)P1kyA^J-kELNJlHo$?;Y| z8oQNe#GV9*V&!H?DGZ;b!dEKkAvwb^x7Vsf3ExE-YCSziWX6TKh~w1TEa$y%PBRCq zF5wL8_$=4wN_6CjlDKnVvuG8irrt+jb+?{e6&4KX>W1VhvEYN!W2GVK>Y)=oTGduW zJyd9!kWnGExAR%FxeKw$`Ig0GnO(0yf^r8zSsdWnenGoHxhR*A5D{~U%e`<)!Dhn1 z!+{v7k|hDKh<+&5NUdyd1}?IuUP3w{Oz(87k|Dut*%>Cx@;0*+%eGX`Y%AyjPGdVv z(eMsRWF*8Mq6?#q{PSK_y)s`%yaoT*T6t$+XZ$rrTn*L@sR zAPlj3l|;7nWU4;rcxRr!dX>y8zjAPZB7JbkU()v1kNDG~hzonlGsVTRCni*f`0+`a#J>sdA(_8x3Mpzw?p@<&&>HriP>81^4_uj}iZ zA2HG6L8PK1Yv*{84o@nVSvG4>Uwp?x$ti?b{;ai35?QZ({UhOnMTl0lU9{G>hddGBUjw;bq>hK3ecR z(SXv}vP~HpaRUHa+|_CT&>KjOWrmAOfxdXX@!5hwNA)@#m2%m`SR%_@SJI>iV01=- zB0`i&X#mzW1j46#scWtcrK=HnarS!YwClw<1AEI#XvpuA=ij|p#kPa4_c+xP5ek?n z!kg(nS!^J$ChFf&y8|3S7?wh)w=e8K*gYaeE7a?Pm2}fpj7G*O1LQ>VK+SsM?9_C$ zk`28_0Z7)3-KoZ4e5lb!YuS*LQwh)Dk}A-k!ZGfi@<_&jb_JmzZAhUagZ}=G?EqL4 z|HZK#2=SIqs|LQ0DH2Eqoyx~kgQhr%$FS&`$fxT?^iGg^Ol8b?L6PG=@Bhw|E zj$#cPl+JOXv?{_FI`Yvqp4v9n(FeYBa7!aAnMqzGmSyw}YgO6x*;nn}a}X-suQQz2{mRRF(V$4p(0cVYY?T%?l!=zPIKX`(`N&j?GLcf=r3E4 zes1jcjswCzwH(W^{+bWQF<8o$<3}yYid&g)3DOMSbH+mS6 zRa&a7lMNBq24*n$$tYN=0ics480mPm*IP9#H1e4OZ4R#I!uy%;I-rz}>|vLzhC=#y zJ-&XGL{F22C!-><2cCKK_}TN_eHgdaRaC*Ze|-G?Uv}^Fai0(8RDA1uaN(DwlMM8e zQ%i)ipfQLaDeMszZSTktUg8dsAz{6=lEJ>XQrbyo&vz#a_2*b6K(d86o;Z4lGoP_! zYkdZV%hoGFL)Upf6i29c1e3tfgQbYJq%VSMXjno^)2wW+A_Wq%YM0;smy>)_>Q2Rd z=JgUFL%YZ{7#I^-8jacm3KmF;1tc{Hr*5$vYEyJZIfxy-K!G5qrH%DNvUc zl)tZ+x2#SDe#rXCPnT25P!2t^pChnz7OU`_>OaF%2swv7d_?e_3E2rnaY zUk%VVYct6R)Uds=lQd~FtBX6GnLy8rOcL zP=Q5$Fz85=OXd~Gl=yh?XdSbir~S5ZL%`;=wR&Pz8Mx9(grMq`0iCwgsy7R^>$tXu z5^O6B10wlqN!=@;(SP?nq8+5+>a5{U8fuRw)w!V@WMA9@4a)VxKa*g>Kx+SW)dAy0 z5;x{X6`mTfJAG*_>}>cEIR&#EwTZ`IXv29n^qZRLB)rZ+0h#m#$9ilk7%Tg|e4mdl z<6xkRjP=iIphD3sa4p6{hslz*gqITncO8)H4=Uql8XRB8T~Eko((u|i{1hAlORHO6 z!O;FvVw~RRw7r<&vkP$CvEp){YC^yI6xssEW&dKcpO9DP;=~&9eAdo{ECe=_`$^nJ zdP&`r$#EKIj`b(h-j`YsXhS3;Gz5#)q_2NLUz7u^ccQ?RE*b?n_;~Go=sHd#P*J!&O<&+%uupWdvUQ9s<;;^TkKC z=kXLWn5}bzj35Z|MDBCl>Q=7HD`FDVM*Zq1kz3ak$U^3m zmr`%KOo3Op*z2M2tzjCq`fUb6Ld4Bz5nU(4@~G$c)G{!76G+N#J-;4LmTEnpde-?M{E!YQTQLMe zIm%MKg9V;jcffpXum1gS|98PC5C8{@1c5|mpng&B!c)#P9(>sgGov}lQ)%dF>)R<)1qe<-sU+O zLb66Ig_YyAE zo8kl-6SXbIjW<4))J`@vBJQ!M&#kGq-?P`Ak){d5!He694CxW{osnSU0mcUBqr?!1 zf|3-tfOwlgpI}Hx;xjYaxEQOz)tXGEu0rkfH6rs~AcPWO%=ktJV~Z_g zhcuS(D^>T`mD)g&tS~DQo0j=wmZ?H?1-(jWt3nB(VRnF+6{s{1nB+!yW%bx@crJtm zjP{zi;$$=+BHnA%+`Ji=?y^uerCVslZm#8>KsXgbWXYAVpgR}L0eOy7)=|20T6zvk z7fleR#G2g7hSQbnflP+KnB63=xLSuzu8 zcZR!ez@=V~!nvx;UbD@b9K$BhCf|pgFu@nKHM0#Pb_j*hPp}POMOk*rq?YB$y=Atq zH8rkiH_dfYO29$9iU%d>2nUwPAQ5|3?6eeL7WnCS8^nKIb@5ns*Bxy47Yi3|?yNMEnVj#O1fr$J zg^VjYh4a~H^$zSB^@E%Fa?2zX#k7_8-xA`>7jO97&f4m@5b#J`VVu_k9QheDEWwQx zs8pA|J75AbG>S$={h#`A_h9|Erzr{M>ExfD&A1TqpPow1_@^gB`t_xsUr%8+{9Nz2 z8qA91hU<6DZb*hF;pGT?=v3NA17Uu^MYQLf_JbXM;4-^z|CkI9--5TVgpUTJs?q;M zS1A&vl}fm;wzB-S-|Z>2oFtb0_;sIz`8yW(;^g$Sdwcrk+Y}SV-{ia!UVhN+9czR` z7UaLQC{hFN80q@a2?55p&%gL0<@aoLTeAJ!)|=tDT~3?bzcVW*nVx^q9oMJ2FB#7o z_tFV4d&;fZyU$N}<95lLyfaL@6nlsy^4aStpY;|Ula`9*vyYJlfE2pjhW^KvgdnahOSduO zrnjwnXOWhJuCF`G#QTypjgXxq1{AM>Gkcsh3^{`l9|6rdhTi36_x9CvPJhm=YX>E7k;U}D_OC+CDTa4;sU^a&aCoZlG(K;>3rd(y)zffr!`o7+$ci|zuZ0efr5$e|D@UoJ|zi%`kEAnVMTLnvfr_i zk|el?O$x1m@wsl1-lnta@ZxiNztNX_zPuWLUPem8$PjO!!JS>Dz5E*PslE7)C{CZT>4{KDw85DBVyVxb!TESj>6M2|CC8(w5 zEHz}KxLxA0Z}UQS56xD22a`Q(Xvw3QrV6_}P@DO4;jNr6Xf~+{ekMj1-B6&_;Cc<= zof85(m!^8pDq12*Y-az-aQo%|?xymiPrQ-ES80@yeqcY{#-E;lq68}s+NW}cPqphF zkP~U?rAH`zI_f@^MV-b8uG%yAV>t7_kkmr@VxH;Ff0vm0F-(|OiSGQ;R|_*|)+1ge z_omSia?<4C!>lJ|Zg~25@BV)7lqzkj0CMh*BVq^g?~vS}e&UgrmLC$36Q03}aCoH@ zy?WUPQeI1^BdFr;#^+|wxwEU|^>|LrC{7d>?A%1J3Y=5r6TuAirp--mH z_$)kaiBA9VUe{1=&{d<6p5g+NPG|dF#*kZd2NTMk82TYb@<`>cE!8lI8)yu((O?!4 z?$Du>4FvkAtf~e)^Gc{>*R6+W6Y1j0jT4%6kzgV!tsv)~b~QH(+hbicGJ8|*hd|fT zbpKjAeGTMFtVtPYpcZ>cPB?CmANSi#ZuHgAXby)S!~#OWHt5MZNlx^9b6LF(_?d-k zz3)cy^>Q|rr4%Ww#VJiDhqjj}BmN+XsyJIca?^3Sw_``GBi!H;yi#+MbzaI;~b zO5+=vY@BzGUOlZhp0qQT1n<)+?pJ~eAa+)9rCbGKE14;_jV{HoI@(pF&OlMWTN&Ry7b%70(jD-7*lRJrbrmgW zf!bQBpUKCGcF&1?vVu^~^4C*=N910^BvDnQseyV}SAixRm(eidFCn^%FC}H;IAhmk zTqWJG4ca%ERxxu)sn4K zN2uSdw5WLl8s~?&6-(XHXnc2tDm>IBU?c;-*xmaf08^?(vq!Ca`TP}YB!pBFsZm*! zCFWBPyn->Q?hS7Z*EKiJkaF=4-rwB5H9e*Y31MF2RZg9W2e?_^-A*a`6y-Kei@DLH ze2H+P3Cb;g%kA!UZXsb*jJl&5m<4;RB5?N`(k^HEhcMpqNR5n28endo-vcjcba7>e z!xdHGK{A+rUGj&$3YC(Pe>}~7q`o4Ji^kmD##3q+?!}Mg4DVI#y6fHNQS|S!GEYGQ zey<06+dFT+{`xdIZYXFttbRqu$Me#EaM#$O1N0hxcMLQ77u7W#cZa*ob9nzwpG)$P zd_w|>{Wj@c<@a@mN+v&Kluu9mEIVwvi*HgSaTxHaDgZ= z4NI3xajE;CR6OYS6{oN%*as_ejrKWXa z3cu7m7)8IHgtVSbL)J@xrDRjHH3nFeQO3vlfiu?W{ZO> z`?qw6OS-&oc7hw~94aer{BbL)iz<^j(LQdLu(%N0YBsQ>c8&+{f;H6)#TnGgXJY)b z8?083*}iq7!mlMV`c;5;%E(k$ygrF_0x#zwHUjn=C5TL>u-!iMI|w1cz3@B%9T_GQ zxV(|<-jz&jlGYt;VO4!Lw5rgfsJ1@j2pObog1#1hyUOBr6h5I(>SBSg&DTELeWK&~ z{`7H?QTp9p_Aj1)_Y~zYwe*q9;fj!Yeh4t>USJ2XT;EFHf;tu8Lb2$yD2#+u^SyQU zO42fuc0K&4*^%bgDIQW%B_zM2szuo%^(qQP1NMa;-vtP-AUai!e5Y z65nJ1aBkKGa$L#rxhVS^8L5*P$wloYqdnp1d~ zgkeDlImzxSlJl!PvdpQ&Fbik}9O>q{(W}NLlG=*CGAPKdr;PrhfBow=TpmqidT{}du6Txdf+`Th+oj6@@GjNuN z9c>Y4;h>)m39d$W2=xDh`e^F)h5RhG1DuAitb*SMiiJ!2*BB3EfBb_no@EVY-+Y?wW31k>O=}-#9ZY;;&vCQuH2)Q#f_W-I+)~X zOR?IF;EgIrknoRxNv(wN{{QhMK{|qjCtnc_AVB!-S0{XcbzgN6Y*t?x%s7~Q!!+GV zWUEMZ5GU{uaA9?Q=}|1b*#Iz@_~ko{py_u6o##!w%W zo7oGy08{1LHQ#~yiFYe)2SuY&K+w9Xw6gl-NSNhzP`HuzYfnrSeUYZWq=VST>Nav7 z=GEwepkP>T_IIgxNKimdpcCNqklCT=Bza~zzM;PD^&+G^qz#2gMC9s{$Prdb9Wtuz zzyUwnLk$myW_;d=Jv|lj}nUgLah{5u&kl&jRC=>&vvv)@QlLQ$!b5UP)LOh#BCGkBB@2i{Q8=4i+5JFUJUq;E+?IV#&+;b+1*)15s3_EJf-C zGzyk{4&~{B$mx;ZbBH=hxw;Y$qZdJ?2@8ac;+1LB_4`Uks3+ezgt>i2i+457{?08Q zhZj(A@3AQ<@*J5wN1r&YjpZX5(tRMj|2kCVfTDx6CFE*~%-Gyg3_pBIu*|2XQQUf3{t^%BZM7{klxpW8gRAvVF9S90r z$_!U{6MV}xWfjlTp2f@Lc2Mp{V?889Dt#|1*vA&~%GdCVfg~sjnjrsFy;?C0~WHM$Al5iCblLZJZI|!l z{OAtSNJ>ZHN&ql1#keTv;nzZcefsdLuhT){n$*tGd1|FV+k)56-Fg4P4_2{kRd)o__~MVxE8*K` zsG}N@3{%g$DgX+Ry8}@$bR8VJLQ1e87X-}IeZR6y?@3lc!l6?SDpM&PK~g$xziEC% zeZ7LBY4zzc*5FNk)oCRT0MhyE_GAnITdi-O|H(j6x0Y1JC#60aAelniM1yN=DJlNC zZe(m6v^OyQPK0_)6Mvf`sV@?7xFu#$nW{>4vD$Bo)M8o?s#Cq{{Y&tP#Q|>LYVc!6 zTfp(?XEWTp*_&{ypyB~4Kc(s<{aQ~X!c=)Elz(0#6U{eYRVmJ?gbJHA0n_-Ar#(G#qYNNsV(Se4Ci`Xx5EIn)QlKq)P#{-0SwDsk z08fScR;KPV#}c?HzOr$4DH1_^#wAuo1Elc)Vva2*Y%IAhA$A@Ab ziHI|ZvNF@COerkFGgKovlod@b_d|sbk@vI*j~XkA_71$KQ63ANjv;UY&HW-!wUM`HeOIp2#L4-R2~F ztIH1@N%=YdP~&YHY}pr^KE@Th({V2be;sWw1j?y?N9fN)RfwL_%uk7t+D^qsI>F zzEJTuuC4s<>*3?+>nE5P`e)5jR$-lixC&O#f6#yWY+XT91s|7h#qjKHSFk>>xUGgjHt8f%%_CzM zh222Ux}{zU<6^3^Vsy)z6N|1Bi>kEG$ht-jT16=OI^k{kEEQd=*Km7-vO$Z^ob9rT zBmd}=-3QeZg%KaG+{MzvF(suVOU#)fidp>QDAs4;Oe-~?Y%*d(@Q#k z?(e~G-9nMDxN=xyRAu1GcY$eCi7+Y2l0(rNeedw7mu>;*3$7V=3@&P6BqxuPDT+#@p@CeM`00j(ipa`{6|ic&UePVGM6Xw%88>$K4!{1XQPt7@&e#3Y^nS)YpoZf@L%V{do<92(@c#qmS`g<36*|2N-jd<^6^Njf(oiP1 zhSfz0po?rOzVMlkE-wM0Vhd@odgef|S_x?Lp{zNFt}2~`xzdNbN?w5g1ViXwSI z4+)w>=C!ZYE0vj)x9l@T!%V)9v=AqmXJ1>sQcWSMze}BagM5s zVV{p;=y3Xts@&{6`zI=sG}5Oqh)Y71MEGYJ)s_L%ZS;vyOMs7Lu-Ib_f?|%Bw5V)q zgGWuqUCLx4Z5$5p%|soV$KP)Di0ngu^VH4mKs++G#0C@$mXk>AyV-iCblj#nnk6~4 z9K!0EMWENCphvQ8<;u2qYdpHdzq_#dcqeEYhoRI(n+AZ(az&;;mjEDjj93M(_4zP^ zl&H_T4Soue#5f8euAqn3k%_huGSPo5Zv7VccYFum&%1sI5&VgjBTWfX8Z6KgVJ<^}A> zygk^kBsw>rj(v z9=FDm)clOyv<=g9DVZ9i$3dp7kESDT(ODoS6TeMyac6QkT~8JgJdRFGM~KH!#RoAk z%W<_o%0mr*pUJXc_dm+b8-QY~A%DUN*7IJ?Q&6!e#P}(d+CiQMMhC*G#8yqac115K z{{Et9|MgCa4p0uy@syIfn+yt}F(gEnw||FRtuL$#YnGOA$HB^6cl-72jdZX5rd+zr zE6pklh5>z!hyxiZEOzUzOXYcU1t5etpo&fzZdw!6z!W;yY|U`l6XqRR0K*$CUQzUV z-`Rw?{MvFr$RaHidX9rHiPvtj=mesKE;bj-a`N)E!^(fX8b}TwN4*5lt;F_K&gQ&v zAMLXH_8153SB-6*+h;z{m(PAR$x7`8&yq;_?BjHY`-lLCTg;gGXvb`x^apqr2 zEwZO__mE3nGh);SE#4xQ95$7ncXWZED*eIGZ8ivtVqrQ*9Z+Rgw+!1+fCe3aQl!sX zi%GIa(SL}>fMQtZy$zhtS}&kcGQ4gy+rn1@;q9a~uj9C)j+ShY$fbHWxCxqX1`^7L zM?Q`Aagu`M!wQkgK=eZ zD5g!79kLjO;mLdwtyl^5d&V8DF}aFs2v#q~DJ4D9mf3sO18alWQs~B9Djm_r;So7m zz*vpdR?{aUjYD7ec#w3Odj6$JQ>F|va`egpQ6H?hT}KQwUAbmy+u5aS3%`x&BqECT zTZ{zbjs}`dNcyDeX7MRrCltGv52AbfG6$aV66%CFV)fL$6qCWDTOJ&u+Dx6%z%G2% z3TIPqys5)Vw>;%Gbfuh=#K7c8$T=C#+Xx<)@P6asG^{hHdiQkmWlNm^S}@p*J0KFZ6{61y?8fi*_dmPW7*yuJFsYa> zu4+j6xUH$g44F-0rN2Fo)oq?~p- zk;Ke#5uk#Zc3q2fHhedl*i?cv#Nc|f5^o71)%i-s(u3~w;{?K2{8OWTOA^aSEl0^t zc_TkaBk{jMotN6r@N&PT;Hdp5oVz;mVbHnBrF8b?y7JyXE91_$*xD9k?VFd0$B`hzomE<+PK zLMcW^ac9Z7eo5DXWO&*vS2s;od*?RmRm{EY(~~zO1*ti`D&?l{J{n1JRwA2I7dpYR zw{iNhMYb6r(Z4oGz*-}GClF-)K-V`o=~pn!EkL5kpA=kC8e25Fr~PqIxW(&W%;(r1dhly4;M;Ja`Vlm6wzsg)r3O?1Fe}l2A(8MQEG0|s_k-C^AnpH-W2(!hAQ(eZ9 zvB7AN%AoC&vOKDnaAx}T=Ew@tYNYvymJ>pQkvfMECej(qEvnTdXR?TTS;9VSIH0xo zzyIxj(4UfPCGi+tr0c&n_MUux@9%&6@4mezYz7H_y8k%KpXY3TC|Qm8JKoWhS3%DZ zIDAg5c)Rn5-JKmADB&Uzm#H@zHUuPS|Ae_jsUJYbZok{sy;53PkJI0!Jf80v8Vj#e_&&OkRrS?J@nvv+jPMILWMg! zO?`AE53kh2rCseGNE>2Xhx^{v*6ss#)fk8Qoc34FX+NFQ=>Q5q!&p8^;25PybqKFN zrF?F3Kg4NqC*dlJ^w!>8Kke;S-xQN@cc-qQk^%xoSBM(Sby>0|32L`hw!@(5;<5$f zf7U&XZ)5;kA+67#PNVMrdV8JhETLJ7DI}{P#Su4xj&}DS-IJWtbL$_tufo;BoW`b7 z4j59br_XNM;!7DgF$57+j;1gkRotVYAYjWw<%Fw;#9tHtO}Hf;yxAug#2d-$m54dA z8S_Ta|6q&c>Y(y!?7S!#>|%s-G8_lN)8%!v&E9jj?M5rjN{=N#Dd(!!AdT*q*rv+} zo{7Ox~&}SI?=Uv zU_zXC2YVsAc6N^M;_}!fTskCqYzmyKHezG<$>UF~yC})kD}`#M4u%E2o6tK6gT_V; z^fZ4{=CL$1_lvN}nAjqU5rBzkd2-8bE=jYa&o>VK<@laAx$iAnhgduRD&)*Be4#Ij zMo&@GiG|a>WGU9_)*Ww1pk02xxwZK(?^=2G6%Lw^L&;o$MTUZNtp*t7{#CKgmg#0C zKW(2loOL{@_2sQTHy`f1x4DD+GF%AtR&_PDyp)cRnnj~nnTKXqmzS8VV;`Q{%$3&~ zR$^;PYbfuAJi9}vMtm$BYuU(em~?h0?C#i~Ry0C^zM|j;bz(+BgoNQgDpVek$=Os+ zBGsJImM?b+D!`Tn z+ssGH7zQ~xmFeq5bJfcfPFHDm7em>arU|myy=)g(O}Zoz<1a+2h<-Mz!Dy`y%Y zj-v(b7xG2e1N#SCT*+L+q@Er)W9TlwNBUMOIl@{105ZT*a5cBRvx|Zt6fRnHD*a~N zr_cDZ-UTZ{-HJWf^_XJNjoC&cbpbM`!B)qys$1Zjr)Vhddl!!@o{jKc(vCuIlLCRt z&P|I*Agp0P@Amp8cIHpjU4@nEvJXO%y{YEJs=~(}{L9Au&kygRg(%%_xA4@Gxafcc!<%U+RS}slX8+6b~X;Q3kOW(U7JP1;J~Kys%*Z&T53(08 zhOTRXc{N-tKjpAJBFvdHKSN^}%p=0y^tqwhu|8vL0 zlb<^Dowf4@1Ne)slrT1s!A&0!?if~skiG6e*F&gXh;F@5SCbubySZJ0y!kuY4aR^4 zyB<6sjr&+B@3a{??C6|$UsqJ@#9#Y-q-=3+t~k&q>vF`q#3`R8FCAE|gm6hI2_tg5 zwNS3Ypf8FVkUU_;x(`GQR5wuUHvrEm-w$d_njCI;L z_TfXBu#LdgHlg5}OJd~mQx^nLT93YvP$7Vts*~p+WNu-P{d+P+p^>W4x&~tL^)CfZ zm(lqDC+=N$>r9SxO<&4t0fs3INsH83-rWO=DjmwCq!wvw+k}B$A}NX%$wlx`D%G#S zzJ-||yMdWIn5&u_m|K|VeIqiz%x^8K)NM4z*t<(4*UCd=9N&nHjGqfLZ*EOCKoi#| z?_N#by``z*pACUy&u|YwcO+?PL1pUEG;F>?KtrOjCE+glJqkSOXmhy#S)l?njitp* z8bKg84{`A@)J5FmtdF0zBAm##s*q*xXP>>nyXo2w3bUQPd`HL8T}==}TtZit75&gz zmX0qz)iyxaFP$e9GeCyZ%Ul2L^`Exa{)Kyy`p5deO#b;6!?n|vPZ-gT58Wc#QVv;eOt_ib0MEl;?BE!gU~ZuslKeSbYyxBD#HXrJk8#< zZzhxV)chxhKev(qM;1L9YE<6R;0v~-Y3yS-X!Ch>EWmpeHt+Nre7jZNy9!L&aS`jA zb7@vsnX#_7@LOY$B}j2cG=sE|Yf9g-;K(E(xMQXvW7bZ8-}c}W6}tPXf^5g;#KUDG zDN}KrTGLI=K0Wh9hO&H^L%E74rMS26U=X_B&}V-zbzXA#3-6=YQa88I#MaZNfN85Q0S!xu{=1+YMh}2oEi|I~~ zR%#6p8_jFI{AqU0O%K_^-W`_{5$EKp<^`yGRBOzwenfgY-gS-Unw8GByD7eAB)BhNQp%xG7N4*lo( zxusFB{k-FUnk>9*Sa}!1tpAq~hSEm%Y?o|3y87C1e3LlekeW)J+4=17{H(uO8mID4 zvvc0&1p8h{n4A2!$9-&!59X%r3pJ^gVzMsb1aiRT_MNYz z1=Gx*7>2OaqlhG8p0Ik~%Mh2V*Z;Kt`~819T*Z}|9RGgydpHB*z7GoR=iuh%L_Xei zo+!I~|0~Cg;r45`dKJXa+J6t%&dogJ$uF<=5C8sQ_D}J4?8quvYgFZpA5&G#+8qP$ zjm>ZYJRhMMi4ZhjXA}{yX6Hob{PHlUyXBT&9NyNeE zwGCR(x&4MgEIOwSvSAL^8$E$<-EaODdb}8E{F*5D>s;`P#;}P{*j4A*`eJ|dWB-4ys1rvmuH$`&e z`$BwxXD}%Xeyt5czL*Wi3uu+&%K+QIE?>r54h7qZ*{fj7O8yEjjVa#%N_gYCOoi!R zI=}||rVQ?)VR1}>)C6Ib4w5;wvf(^)c>T7eLU4{(&K?;*XJ?S8fD_Be_ns_e#J)yO zoUE&OA`(Dz(Gp>WLYsq`O7DGTPs$^Fa;EPuhb7AZmZ&l=aeWP0tPpEcSXI`=A5NQX zrN!b&b=3ztgrS7hg=yk_;xr9)+SRGXnp!F(wP?bp$JU3p$GGxYL{|zvR>EBcv58s_mVH_{#&wH9%eeGx*A;J|QC77MHez>7h@MF%?o!aM4Jn0{idPZ}AS6Z{Gmu>3Llk1GZ_sb4GxV zH8F&A=yk4@$ovHRkwhg_uSzqysO3iPTce~v5zup4jlWXw(u=kYUNp)zaF2s@5VhYE zqEp?<8r{&_Z7~%wTV5p&`AG75n@)n3lKnOZD$-wlJ$El|K^hmwAlgL<(fHHY>Cj7D z*Legewi~?r@SeaGEfce?;&jF36{XzU-Fmun=Pv!=>Lx;mkmgZAyZpAOH?&L6+oHAZ zH>4H)&B*IIFK%g9#2Zx8?DoY1kw-PGEt*PPL5f1RzL(<236~u9xweZ|R zj88h+6x88*tJ+Sqhhfx>5e*SGDqSOlpm^Q!7Ncq%L|ux5b$jB(Bl{X_{-NMEEio*> z`u-mf628tN3$Zm;)z^fwG=d-A&Y5$?2Ho#Hgv!pl;#2sdtNbmDh-y$Bt(q{XZ^M1YMoM(0NUPq_J=BfCg&#^p$k=M?5*%g-W0$7gyKjZF+R{5u-*Jf7mUV<9 z8AE7x{2C*Ss=AOyLSGxJ)0b~9De24U8XNqre^UN9{BH;c*SklmbsqH)Q)8D*Vxwxp zdPgDKHzUIEV)`fZf(8vQH6=^wV%8;ejN-bduIp=O=D+p)Q+rS$dBrkHtOYkP^C(}} z$6~Sy_a?7($7Un%*t~Gnmh^Tz{9A&Xl5zbk^N^s3ytC1r7*uY(9ETSt-(JW6=}?Zk zZ@u~6S4CBZOFKw#d)-HFFTB()1N|KrNicxEVjy;_9!;D<0iE4&a^(xP41gH2A>>+M zM@qmj0elTDMI)1wDR-6h421l8Rk1-fH=IJv^LMSMU4VdkUW$76K3`H%^?~#lWI)dk zzYpmuZFQnK2o_cWMpjawvHXO9n-+X=!vaf*@vZLK(`Dxi;lf=-l8Ff`5j(m~a1g*6 zzTgPW5e!d~nLX*?U#sX#`z?w!6k}w*RIGdfKlQDMCCZSLOW7=XOgd9`beK9zl$N>+ z9Afq*l8nK*%ybQD-~kvHq*TR_tS9b_uc#@7zNM=fU2l)5AG1wJAHfByJwxPrT{=k$UD6o&DHUFz!~ zj5b_?gz3eHW2kVUjyciQp1-c&F9!%zgebAdMwaFeRlE6e_(8WE#Lm2yi!ZXIE?Eh! zKcOh@ejxIANgx$`qi~#Gh!*%ROG(~%uCa6Uv9RPL_Ek-C1ocw8LMLk-Hz%2;J=iy39 zKTSE!<6lPBF*a=3ShVZZ&i(rYFkwCeB=Y;nC&dJ(_(`$rkLAaXgY}1Ci+dBQy zlf;qm)ku6B`;WG9U@Wo9k=q?$CF5!~*ooBw#~@27M>9@Z4^CWsT>%16y@7z7lbA}) zhIDSM-@)HPYgv|K@By2_s`}{@p=F>Nk94Ee=DBsA!Uqfa*W32G;1V{6e;8Looia5E zJL6PjuB3~@zfKaK@KfUUyIX~`33 zF)j$K7?E5uf8VF@fw!4MWClsfefuse?w-WJSs9o}WmQ2$r~*Oe;=H?&QgBGlVuQ5spDUoCs7|w5Bf%R9 z2agpjrp&DbodKbfgETj7xVtkXSiC1y{{`JAigOvyXbQ-@4As$W#_?1TRq&pkw&nk6 z5q5=n=R+Pe=N48>gzQ(6TMj!6uY8=bEK#D1s2bkjaNOjz1~PF*CunRoUW!=pO)#e$ z*-_;O@fRzY9-qupbO?!W{uk!-Km5PP_kYV1xJrRQG!U`LRU6)mf9AB~w(M=ayz@%I z1pVA5OY|CCrnixLUy4RDu7MRlRX+qsn< z4DhxfhK^l{7FIHT+g;gJ7FTLX1Slh-B{&d!v6Wh7?!nGh$E4d_yMQH*xQ8=ZF%DYT z5Pie)r~^gEMOIGZqx=VsPxhiU`e661nGol+dnz`ix8>C;+dqk&RhE<~{|$T*wF~zV zQ4X3kqqK447#nD6!G*d@h=rL16$c;ck8msVENp(s>8uXIWenf~g@9lUa1qwYp)~6O*|53w?RQ z4^h$Sz;*!N>Zvz4kMv?ds>pF5?3d`>dngG0qjz7n^^z33*A9$<9TR!{AFPUJchvWz zvP?^Zyj8B?nh!VSI#M7etq-S#C%mZXHPaw4`z7R%(@9Nt>ld-&!PEq63=+2dMY7M;m*I;z?^Ca(%R+kxaRbIHQ%B;IG#YZaz zij=W|mJ((1j8Xg?g-ir=K+41OET54hpP?yNIvD6;*qG?8{##1w!~+EXc41MSe^2MjFC5^F-s?vU8; zA@RuxwBU0oyhMq#lfy8ByJNts02Vaw62DQx-w$Ku@s6gkf@iJICHI+l%~D%Pe*eMa zK-kPF)$F!&FN6gbY0jHxx%c={&a$*#|Czhz&|nfOg7SgYcIp9X2n6tiC`vxAG;h3Y zK667qn_iPe5QQM!=9~2R0f>?j`4wmoUC=?r=}8ujT41NOj!KD2Jo&_NTGBpxt|Njt zb6>~Q?-eYonX@_fSDh41H;;uR9qT%3RNc)hN2(?5ZyxD^ak8i=x$l~IJdKw;=O_(a zY7ho_&_z!;pX={U_}%(S$i$q3w2p%mUPGGHP==AK?itb%k#nPa zGz}_BT9HP?erh6W_MAKv!)XK)&;`k=FuuN|j&F7Ha)g6^DVaK_of*>Avt~38*6bFY zto;j4B>gejDc*>1L#qclb}5_sU@V_^okL-(@heG5Xv%&~q%MdkFNgyByg<#LUSC`r z+ZJ^h4MK5Tapz2kohm3>;yL94zv+!JNQJp_-5d+5>DN4rEOEr2Rk(3=c)JVY5TpBo z^fZUk%QYh1iIvo;_8gUU7I_$ki_sHS5$U6nQXtrSgVY2okQM z2M!T#NeO32vTYg|6D)OMdtM5{I0XKX-c3~o+@Ur0nXTA62cro)In0(APS0C(zxZmM zs@Ih>25u#O)g|31R|CY6c=*vdDo9##Qy>72qBx>_P3Ef(Jl_1n$wuCcWSBr&)SV$; z|H_yk3=JGvBBbZNdTO-uxBf{}h2eig0n%L_Qk;=J97|+DrYbDeDabqsz|AIW{j1ng zOY8QE`wuVqks{)KTphrKc1Id&C+n`B;7dofGAIkU1CGNHQTVBfeDvh9`-;nH zlG&wdQ4gmgzl7&N4fsamaW?`!dNSGYvq*}!4%Dv@P2uo_@edB^C4e(1RWx4zWb@NC zy+FpJs#=`QVGBflR5yQnb1xVw82h!aVHl?=5 zXNR}S7+*dqA@uPWdV$SAWYsS4I?>NmI`~2nILg!co&zvB{Npd1Rqb2zo77#2Ji1a< z)vRY??nBC^AOMHQr9nlUX67!KkvtTCy5j;x>+5&c@6ei`Fu@0CAqcvgfj2*KIf2qJ z$XsM_Op2DT#_pLjC zPr>+jl2%lFTi`R#?KcH6vmspac`THRS)zBS%d^mXFo5I;$0 zbPa$QXVG3mS3orz8`+j#f3zzmTsQpHNfLudQMTz%#7kom7M&_}&*F==1cSbO~7 zkq8AR6YTSPO9v{59JEEpGxwsKTvdW1ct}i7d`cq+Fzamo6~xs zwPI#JM9M;tcP0|7$jJ%#Rkj!B`|ObZPn6h5f*CbJ@@OdfE)@hhi& ziOz`{`0~l|xVO%)e9+NJO%u{nn{{~__ZIXgN_gz*g&H`wGk=p-wMJUK zHF?n*5tOoUagF7O{`m5eY>!ohR#@KF9(JloXuRh2J8KWZXmngEA0w#hdltRoBHMR% z3HYWO%urv@kRreF*qWJc14zhQE-Rsz;NByw8z+y`QgkUwG?lnCCEqy>ltQ{*f9O#( zv0Rz6xEDLP^WQrM8i%k@v-H8(5Ln%Nvdg-|4;c`~Hi}9bt0+Lq(fX411w!``p8>tn zb3$mbs1?>$jJ{+|U`womWu-fb=a`X^UW(v?jU9RIy!l_cDmz!xi@+bQf0oVx;6I!^ zquA#X)p_*wZ1P76hPmVR@0XJu`hiU^Puqz#-r%$RIGd8j!Fqn09iAQkjIuvGI11-L zjHXN@zKf*C`0g}PnJ&*84>~w36Uh#4#%y^q9{&^-#>MyPfvgyqxwH(VkIX6a&ig^) zC`IX1OOt>qVb}{Uzwy-S=ee(^s&N5j!9=I?+TM9co)deX#P8u3#0*j);GAcT+d5UQt5!e2q zhSvY-@OQ%B=Z>u7O7@FmiqokC>~y|mPb0Zz6Oku@GhMrs-owyfTk#T(Pb0esyWsfl z^6Oo0ed;y(9nxWQp3$wp%&uKm%xQ3Cjeu2I&F+T@oPdHB?VPuSxdV_vp0I(Mna%dM z*d;=@@GbS--+}cdHS(tgB1Ew$BBTQ636D>EMr0+#WPBw26@$&>s#hI-RzFf20;;z< z`HR+FM_KTf@xUpS!w|Qs?$ZvXj1G<=xTb}^{9B6_Jo64(EN=qdu-yHE)Xny8LCkiScMaN~W zYhB$J*Nm6U-;*3-Z4nfu4d=)1i&U5_Ax1NGT)!e(zNKyAd!6bYoV=g$(1rIUKC8Dn z=w&dw_y1#Nw`B1Udi88U2H8MkZaZ6BDjM2_EwHszT+S76iLt-qfAqj@|FPa`^v^ZL>xeIpySj}?Uh3POp8divry!z!k4=|C(r3iicN)lE9hz{YV66p} zG>X}ntgA$txDZ*@NVsgIyriJ8?;9ThFrrkSRL><)BIBSreTiY81a=NLSN6LLi3Jqg z`bZa{>w|+6V$VomKzu(~#6>!3rvm{RX@fpe8ukzDtWGt*xi_eMU=UV|^$%&(GY&Q* zn0ed%Q~KG#mk`tm3f_E^{yIXT=JI%C%3x0_)s{twU-7QQ(+6RjOeOkA{_>S-o20;4 z4|4Qhrzh7R`2TtMbS&_3{=|c0Q8DS$JfX*Hs1GRm;rIx1a(YgfcQdW}wN;kT={a{c zDHG3=DyV}RP$3$xI&)qP7hv#x`ONmM4l-J=l%0BXx%uxDJc~P6qi}+SaQ&}1H^T4T zqoDV+Cp0y{R5(RBOA=v?`gOtXO0sgq06)@@1V= zU%VqK^psux2X2jrTP(5i!G5WK_ZT{;dp$F^TfO5=*ZzKHAkKvNs`fnyq|Ed z#=G2|O>R*fb$ED$N5J20$5?{Cdg+o8w4P7EUex#Q&wSh;W6UzYI})<-CxzTh5;Z42 zDupQJ6-g`^8Vw~}3@j?|iy#z&M*F7vgYYu-@(_G^V_^7zKK6O*%@}0Hh;6k_K^E#o zd}k$bb~D*uYXPK&AyBG!yi5FeYBO0yr{NuiI(Tbl;Zed940EZ3%xW105?n9h2*X(; zBP1}VR2$SD&+)lJ4%zLzg6PyBe^s`;17U&dGG^7Qf%aL@w+hciR!Shtty?Kxlb&8X z*^RK~*U{0Kp@^!5VOf?+7e~om&8N1fpPTY(b7>azX4&lcHScnS(!c!Nyj#gi%XY+- zF?T%SfgA*ty0t4E(rGlSvp%yhu2;a-*VMHvL)?Tq8kKR>3$x}27FHR==SbIW+53f6 z2_v=k;bJDhL4^;Ril`?ZlW51`D}XnpE>uDYhNe<|%4r`AIiClGITSMwd8FPEN^odlI{Qf zI_o2gZn9FQq9&9l^)!>EGlFflTYrg0<@aj#R~po5qF7(MjSrRz=^tLY_lEikoi$5 zex-cvN?wtqUR}qd)S{Pr9W?s7i(yn|_JHir+%v;3xVb>eucg=qp}d?4PvlGMmwNCf zn)FwtPJg-PmxFP~)|eBl>DT65NjjQJbY8-cC;8f>(bw;P^CzJnH*zRA>0NaE);=?xU0ww2^zJ=T5@3O4P(UlLdOci)6^Ikm0E;ZwXE;H*`)Si@88@5JXPXeBMO$H`OTvrnAz=!NzjRrR~dme6bH`3UtbEX3iu=u#_3UADo+&&35$ zhB)Y%e`3VJ1M+2P+;l(d;@OQ6YbnWTNj=A!%6eAZ zo(VS>hGi73JraQwJh+0ql#aL4ixYJ)N`HZnIVa+RmUF+k!1V=K!ga7nwpsj?j#Tg= z7cAt3VXVDJ@P!M&3M;taZ;jeieY1gX?OpL;TqzMpXYqheFSZl5c6=~O_YJ*9?CC^G zg(dps9}O^!C4*ACRxe}B#+wX?(1<7Iq%h3u^TTyh=Y33G^`NqH3ACLanX}{yp zey<2Hv|t_4`Spb$8dAiB&0_$7>%XmTgk`Ng$tC#0#IJsP(d4`MtDlX0%8F#39VE06bJ$YbRioYML)S$`e*+Z$L!`Ctj$rnpg~*oTbPIsR>88Fn3@u%puS zQ1A-W`}F&G=hg3&XZiK5wJ{6M0h z<3|$k7KF2jktI1W!gx9<4T}uxKa=$-=?Eu4$U`xN&1?-ZB!NL3^OzVbdUk#vt!~^) z>>_+zHwvEL5vi5&`bXOfX^A@M|HO_$Y$TN9d7`_Adz1FQk`&6KGn3TNLY7B#BiE@r z72ijp_M7po@nROuhm$GEk0dC!J_txSe3Dx@g3fFLSA5_6-x%qdA--oZKLYg8$m@dJ zYm=v{kAm~17Cyp>XzB3~3l;Ks3=+HL>*Mb)qQ|zb$Wd^pq`l({e9N=R4&l+(j9V;9 z$A4l`YWcF-5Mp4>MXhEe z)&OGzF2y%g)sBLf?<0Q&?PF}wXBaEt zUmTL5Mn^+V!UrRZXsX?z44QT+CA*r_0+j;^THGDqN$!1csPK@PGkF_0%8l@IbRY69R+$ZION2b2xmJ`9I2d~QKnEQ+1iTNZ8%ILEFT34^?Ga32+#Q*NzX}*+t=H4nDbpof_EzroO zAE6LVltsr8^q1!MZa#7la7>NbGDd?WP1sA>PH7*n zP_wioz`k%qwktKp{+ygGK%C~#P_B?7%8Qye z9nD;R<0&0nO#zp4z*$e&TrK2uoBG4{*vOY>5#^Se6U1 zwXP#d!Gzsd(hO_D9>E8Rm|2IlM-6|4V$dB9kYzOs=&RdKr$<~rQI`$FUn*60vPwnz z&MYaAF{~1L4MfU4IRp?1E~C5qwbGj(J9r3v2x8|-+k~*7N9}pn?q0(WR703z4-M=& zY}qWu^4PTT246sKj%U0gQHAaRD8JMIYVp%f?q_|h;*v#Ci72un9D+}6pIc7J9WNK% zAw~(LNL+&2rc?cjba5lM8pvW8oPyqkS-OR^ie)KO7n;kr z>E;T(8OeyVv1C>z;jLgbYInGKZxvw-bd_rY@Z{AsSEC>)Oy4cw`Y81R{EVbpw%ywZ ztL?R?^=76Dg{GoP1u8Syt$lU!Y|t?Kx<$TEw@lBgj;-VYhEaQR?} zNtW2G-CdE_QnwRe&e>NZccO&0k3uOlSJDdQZB@u&1VkP3t50%uiO{b&aHcEjyV}R@Xz6Sk3(2TtlhnyacGX;D zX=ur?k!?LUI}SmMKtcJV)T;@gPTS`r5F8-xXhgKQ(tdJ78xsA?qbhNS9tb*sINzu* zz=sEF*F(D}ohl3Ff%IcET^X(Wn($(_mDV&phyH1}p6X#-!0=!n}m zXkNcdZLIZ<9gc78s2rd)zV2v`SJ*AiioP!mKju9DN0?<%vIw!yHh*wk4zUI~MmKRl zJc>D5C#zR7^pzKTHWY!w>oEf@g{ZYepvyWnj<|BVbzX9rK1x{t37fhoPz=Dtc0;4& zp>4-dzWMJn1CGR|WVAgGCPibncXLK3VSKB#?-zB{TM;LCv z$_xF6iJ^0NK^`v~gv%)LhbqfB2qY#u1Xx_CJ9^QDUYnJVN^L&!)1O_@KK*Y6*<^ib zSab1Dv+PdTo~pNmHLzZP!22{s#onR##Q}0khuWPWb<|Gfrj}^qst<-!nNvxK@EXw# z1efZ;LD+}%_7`aMEsG(0OJ(ogF-!DK{Ra2&Q5u+`eOS3oLxi^xsM$$i+-A!PK=N)^ zd(cBz($VA%XH-&tp=pIyUUK`0DC(Lw5PG(hb4nF8TT%Neg51*n>&f_)_vOhuB5JLV zEGfzWEhqEVE@|msuQxWACp-M-8Lt&Rd-vuoZ2tDuziuq^sK%Lkk>l9*J>|S%Diesz z-yeUbbVDz4=;!6h+SB(cnSpO#08GedpB2oai?EtV<6Hl{yVm@*#?I?chKhp`hgLF% zJgBtPDb~@Mo1{@J88LAq^?mI=n8E&M=S^;84;-%6^FyPDUHW-tN>?8#A9v##M9^9i zv{pU1beL+{=2vRuV)gh?385H7_eAhabdhUp2SsJsUKy&OzWDrW%x|p5;xaI$q^@OG zSC5y>cKdV%V?M$=isH%WF^8kPSskQnJ;Gx=vz*!NqiTFA}0} zxwEi&=nBVN)C=ofdV6U`?;xTFDw@?K?KM=;)m&}fER0=jO;ysWVQJ^rrmbc5T#f~) z7$gXo&Cl^gFTYzlE6Epojf}3^NBxMO8pvoA@ zx>W0HPAFs$`&y2ysidf^2;fGT{7JVR@67L#FHlKOqysaSo7aW^qBi3>cJXmDaOu?- zil?Pu5K}r$u4fhCpi6~BT$mbRW>`*z=Dlh!k|P@_>xMu#0^o|I8+3at;ZLxSax`kC zFfcGT=$gk4%55IF#8kD-DTj>%k>AthH<(W`$=%`;qIkurm$1ivcZ4y!6|{h$^au6* zPGvb*l|*V3o>0f4%+7wF6|C{9pv4pq(g=gwsOhs|faCQvDT;aG>oIxp2Ex$1CTl6g zOv@$ZjycZrFLuKD{UtMrjmc}*^4+qe&T1BZDu!l|mM5@lP;>o$@C{3?K^U9Y(%*0I zu02RAwNha95}4z#A(zqyeHnPr^P9XhXahIxkP~$^6MUY&k6p z6^XxhZK;;C5pRfJl7UlK42}s|U=X~x&?-`vQUi>Oh@2{?Fg=;hK2QF6@|pN`3bxJ z$|=n+AtgG%k4EPC7A~vUPC}6{yh3LN#yQts=-ru-2-q5Rka-H@|() zEjZ0iq}Li##tEZUZN%^#3q;F5bi0~#6zhdDsxZJ;Qs?F`{5@HJYzXf1jXXjig z?2;d`!088`9Ml!e(m!vqbnmy&O}>ySEM?*8~wFxU1 zkZ0C7&y)M)%MQYTQ>PcD#xa#Lpc*{YJTtTH@yhI?g*L%(CYV$Rr7ui*+3n}@OzHge zVaRQ}hW9e}xvR&;P$T%oN)N1s*kPu5CP?l$n+jb`H5!-bBom4LqrIY=CYtNRagXgM z(Wg)(QOTUfy7do(&CXmF^|ca_{n@#C(<)`aR?ktA1V2KaRr@&t%5j{+&zRry#qrS$ z&qTLkaD-tjk-NUy;Fk;pG$=47E|1VVzEYRb&=R%z~vg-t8)FzDAB~;q6ol=h9~OAuZ4< zNTk_Tu*BNk^@<|^s8OeY2-(aXpUuzfrx|YYQV~G;xfbXAOb7!h@Pu3JI-}7vH?gt- zbooO%#?bN{kNzDAiTMRoC{Z73<)@97Io7<)&J2|;%3)R3KPc`lF}D%^R)b{^bm;?t}}u+qtd>-mX9 ze~}W~FZ8DR{y$+a;DL>Wss4bUHmTK)|u!uOfUmD>m^|34I}T1`K>@?s?WA zfZ|@|0!HDPMA+^f7x(2{1-4z!Da-~=W~yCWGiPU$)tFc}LFn{k&%{s(N)6AU2TEcI z)*%@0&(_GP2Vlszhg=xiRchm@%GfhMEc>BDl3gLUmpDMWI7RWuxsW{}LMBN`k#9M$ zc|IzKjY3QmyXUC)xN9dQV$xe>M`h38n({q6o#bT&>RE!rEJuK3tw7i;md9B13_B*! zin(1OhcEmn-+;4~;j9!9_mf5|Y+cv1=dXJhm_O4#K>Xwv?xd^LO@mVK1wAuCW)6_7 z?-h#|w`ekBzxqbM1RJxg&4^oD^m(%3ZUR)Muu2gsWvAT_LD5R#&=pbb)%Sb&Zl9RHM^ari)Z4vYQVs_6}BB~;WM9ylwC z-H*>FYbd{RsS_nQ*jMC-!Jn#{9OA)}ijYoLj?Y#Yf8_(LX|xm_+^z$4?bL+zM(~Cd zgM&Eyq}J-MPF^d%VKhCTEk+XXg%G;PlY+DAtFZ_wfrIC^gXOk(5b&D$Nb}>!1`1#? zYS&;rs}SDFAbICyvb1qYVx**V?B2k?AR9JDmU=TatxqHkgqeD~6q1wzK!q6fZklKk zk*J{n9dpmh87_KT;s5^7iz2|x*R`9Pfey~*QH`nq zF1x=(`G6KFrBEi2k^ptj>W^S+{ccbr;Ox$hN>RiO80m>2l*HI+(ByM`UW8EU$2{6@ zGhZ~a7t;Rzxmug*n1{E}&?CH8qF2W8#-z<$=5&zHj>pd&r@|u@1)KA{KZ@PlJ?a`| zY#+|m5Vq2CYGQHi?(b>$4uPMIlKLm0OIb(}>7o z|6)qKhSytf0~U-@yzYaDLe>PdOAxmyWD&NQ@2V?M4G$+ytvp{4>k{#${iGY!5JCK! zb=JVF;Q79}%Jx%TR#|oWUM3S7A<*q2th4rK^0u7OVLcqF;VoSgVlcfR19dO;FbO(F zt_ew4Y;QAt$U0qWiF(yc)JFiQ0?7m`mAA+SFt}X;C!i-jNbR0~isX)a<|g6#`p**r18AryL{0x;O;4y09UFO}xvSfVfZwy*9qsU0IU%pYf)O`|z! zcH12({&e^$I>iGXCu6HDPJlB-QYsY9fG0E22$5wibbP0VG7%x7*)ZqPA9A+RzTu83 z`M?HSY?nx=WJG-6vFvDbJHVq}WJ#Dqf334JfrZXBXPf>^xO=Ofr;?@6Zxc6ThH6gn?93=jq_*5qbba3GWQ5zv>l~!db z=IvUjseKS zA&`aR~bpD^mjnW`FoxVmIZ3TwYPb*T%)t)@l zqbSK2v2zT*%0xPsXyD?~t%}ho+Cbf~eC*xy1egxdKgCsP^hI&LqAQ9T;j7|XPDNuf z!_|OhdL=b6Qa=;kZl4v(F^JsGB1Zmjh)OtSWE~QZ4N~+Dxi4iatJVn*1h$Q4t*{}q zV<%HNV5};-VS|4_hj#K%px&sCL8!NZ)8kLv_(^!tI=~a)@?sZ=$k>vO90<}*hO+uR z(r62vAO~Snl*H>|;Bj>p`pllop0D5=q7gU+moZMM)SAye=>9Vs%@l)K{A}h26cwNc zi|7TbMUJ#n8BJjeD;DzKZXYql9&7q?p4MmCk3r%d&c}h=MEX2Y0WH~T8O)ww4KL&* zaqpknZsjUv?PZ0}4(Qa^&0DNRp4;BLop+&pwJq>1xo#*O5-j|OSVv(V2PFBk9p(qL zGxAj9N~{U8&iw7-_?B6|ojZbt!_)I)wV;$0LfNT;81^L-*PoWXZxdQ@cRADrL3f%x z|0nDRwP8v@)yv;X#kcE_aHTebq|#Nu>X~S?GFRm&JkboP4gjh}y*OM+J)#fLx$sbk zL6nQutm){&RU531;7dUT}m zZqh|v;k*hBIYQ~=BwtbF(ydEzP)SCm9jy69E53j>*LUN6IcWwFuqf+GRIt_ZXpENp z&2a5E<$mD4PD#08)2WmInPLCvH7>kBlNKt=zUtuu1m0DZyZ~q86Zy zqRLTI-fGtxf5)c(_hQbA=yj@Q(%ds&FaSzaajX2Ic78F4m{)LR82RG3P+ueUiW1?j zuz0Di$>wGBUaVIF1>GuIE-;$s81KPMg1_L<9!80nRz1??D|RMuBO=z16v^Nj;Ki#U zbqg-(@H05B+O2nqY$77bpuDt*rIouC?bpfH_KJ*35LAkm8sHvO8JRWc(o<4{LKU-s zTtak_JS4%li2cR2()BQ|?kA){J`ra+Gsj=UA>tEJx9i=ZV=a1J>^w?j-1Ot!yUkzV z`w9e9gGEp%yqrF!ta4I+jaMXtj9Y_cP-2spbtfn_Z$k?41SP4{mqkMeSfX#dv3Dme z5JxZ?0>4pIr|4D@nuaJ}5PmseA*(b1dbhKF=WiSw(SuwAY<TyK~Xh|%?;3i#urAU%6Ye`x= z6PA;oj=9++Zx5eJ#`=m|%%?gg$ix4092v;OsWs38M5JC0- zes&zPovfgmbce<f7vFnZBh43f6oPZkF5c@rdnT zci!9coBl1@XjLcE`?z2t$%Wys8r#}0QoB~9W$4-}{fgeTpr#|yPLEKi3csj6IB5^Wc9vfzQs+}BvG`2U{+d4k8 zfV^_3k?NuDf6+$zBoosNW})(G>nq3s2|)?9hO?#%vH1D}2grj&gsUnD&CoE*z-R6D zo1$7XWt5LKAu~&>=R4BKW94QJyG^Hb zai^OnkZ87ku&Yts@pkETfXK|kv;GZGFm8DS9r#b)5a5rhPldR^IP5vmQ*oNtj<+mc zVY6~a4)x{)^=D+fxD1Hn*Jv?BxAc4yFr11+z;thjuMcNC zx%?~~QAglv55Qr1%Fd)#RB%u^r@xkDf7XZ=bhW$bK;wl z^z#5#AVde^oveaJ>19u62m?+Y2(CZW@&d}@w8gzTLsgk1_E9|`72THW8qTvTEkcuf zJ2CZ3RlgQCK)@DcDl*W`y*kwwJS)ow2?m48Va_g&Xjbw|q_xEGaYojEI(_qo#-KQ^ zam|nyEP>n^W1T7uh^Kg3Qlx_R`h)N(ij$wRRP~%PQjxXzwBq$uUbw@r&V`8!d5BHb z-lNYINo(y*JNqELA0x^m{nBy!5OPY_AMmUm0Iyujk!P{(?aQX?Z8u z(dzDSF))1`h)|JJN3>=jc8_ZAEvO??xvE?a%=8~V)r^~kXKq}FlqICqXx`ab0qhhD za)5$9&?^)nlRGMJq-8r>`S#~-QS(9SbNmo#?$izsbd5+q)Fp|IL~;R7N*kVtsd6&3MyRl5PII0 zk)z=7coY@0a8zMx*^NpB9T*;Wp<>68&)N@i7D^(lRL*8IMd|Ntz1)1U_IUa?KMW=7 zsl~b#C>DwZkNNf!0w?KUgD}B~O`@`pWLS~IKeQ(y`(ODeY`7i}R{M)^u5$-t*F-#K zJ%MWD1jlTJL(uKD#(i;Ic9`Kl`DTM>WY|yHJ`adCKRCqQwLTQpOFV0QM0xMTidl8W zimGkF5;(F$Oq}HIj(yO>B?i*xO?ZW7AA}P)?WMdrBM4_O^;lGNXLomAW+aHGp8!u% z*zK37K9pjb{!>_GOPJu{;A6cbKmPPif)aP z8Che}E1X1{oYYQb9BgE@<}KsxM=&kU-lx}kz*%}mmR|@WB*|NzO6hnw`>$05M}WtX z9M&^6-*i<;9u|aPmKte=QN%mMZ5Ak@d$#aWkUP=yfaDKU6kK)?jn z#KUW-QQBh1RWq9dEr2WKJyflT<{OF?U~&#ibqyZ6T}It^n80v^Y}hTJkzb0l$3a{~ znXrj%(RiPyeJO99qKc2?=8P6<;9*(~X%&$1<{WsooCYHDqYT8YnTnIgx2-Z02SNw7 z`@@o<1}Wvt00zGvqN4+_Yp2v80i{Oy9F);7uYsO*7NF?Iy#3*z|I~q;8q!(@S*T=i z!M=@5LGNBwd3Cgj*OieWl`N@t>WuWOf|HQ{C)SGtbtGeY3XD=lErz^8rjJH`YRsH5 z(OFA?q8(P(QUyzAdhybkJgdj$PnAu@Uny4#k1X6C=_hn*@~Be*WsXT5m_{lkAEYkK zRE2gYkUB3nD5@#^^{&)d1JZxzKkwdaKKNjDZL(cSRkhrbuTUK1bZog4#T*T>zGc>V$0xAE0!oYFNsvBO4ED4jNNW+e1?O| zH-u@2XMo(ZGl8VCP~puDe~xj*r7W|A3RoYz4MLh>-K?;(tr2f_3K)6P4t30*`4W08 z#gRBO`j}$RFhsOIQ5i{l0Z6~d?s8Fz&C`{fu=YfYl)!8 z%cn?_gl^#blrfmnlS!33n09qTuqt($wPYsOr7&U!kTg_k4qx&FF>E{i-n-f5+BzsV zEL=EMOF~~EzM+nv8rU~`DkVThOKQ-G{<7)%ZEAjF5Fs)#sGO#UO4RhXCg1?Zxx zI%TtAqw9lfqCb%l9jl5R{)F!)W5#IhiHT8y0BQZBYCtYBbmzmqcjkNMh@w!ghLp;t zQb~-%%#}Pk`Rf^SB$sbl|7$wVs;jD$7u?QYjDFbHs$5b>`T*FPd&KFhwP4f-P!5E- z<1C`3i!nhKK@)1dih=y1Td0nR95pi4E+BUmM;?V;a`VG?fCCQo*od+D={^rEHhOVa zo_*-WLDW#l^NOfABH=L9JrqZ0D6CDZjbYQ^)!F%&?owaEn!AU>0#w8UwVG6RVen~+ z_Z`$coGB_*J%BJ0?=01C;}qTG0zsLU1bT1kUbS+l9eMh4;zj+_V|e{a#d8iM2x9{+ zSs=3O?ALO?A-Puv0zbUciE4f#G=tSFh9t4Aki396#nI?)Du)QNTmg(sbcirqBJD38 zlOKTO``zd1;80XFs; zmsA4+CUm0bW;K8tMxc8I2!M=4?sIa`m7G?Oxrbd)5JaIKW?Tx-k$&lJ3MJG0xG9@6 zXDpDrWbJ8&47Baa%C{URav>_D`y+MOvoG2V>Hua&i6U*KP8(5^7q>1hzZC4ly^W9L zdeUxj}0aa)hzq~4_$WOaP%HCz#~Dz@kjCADJg9xBTs7u^_oAI=h> zrzLeN`e}NR5;6*s9s3MDJ&K1C?UhX^Ybw+VDD1$}=romhq905wjO&TT9=9?@d^PG)j0D;ma15DyF>$5Xh&Llb&ZbI59&|3mn{{-AMCtp{Sj|c(+EJ zv@rnzY$(o4h>ptK1_KuN`S4?+mb2%BVweV>Rd*_V^5ui=+-b!%Nx+gyqjLw6-D4TF z`|g~Jt}|}a1OTIravZcRafwWq@9;X>=?i3CmR2G!sgrhMv_&m%!GS4Xix!P>^{jXr z(gFBIp@7N~n-8|rlQrey2@h?vQKd{VIWAo19tjF1e<_Vp>@yOxaH+nWBd0jBP9!j7 zDb%`SfhoO68NR`Xy-@aAUZov zc9rZP3ZBpO%D_a7FoUfqJ-}y7u_*V<*iU`p>Ab<8?<~!MJwe9>af0ykEuPIB^XXh(j6*kL1i*r-0p@|A);WjnhW~_D*LrGf% za)=iyD{IZ2`bg`+(R(m9VIUaUh;J|fr-Kw_whY0?$5Q7m%~e)bzzukn9MuSaSF$%m zn?3tW#bTw)-!aWo_*NMWhVHL;dVMxO_B`9$pT%N_Clc&Kpuj(v_ZFzQPp+`h4hF|1U`mQ2}5?9D%cYN+mQcnn&s9exkM)z)El+#n?`BZv@gni=ZrGkV` zLGP9bCN)HLLzp?pD3M=04mc^s2G+Qk(Twk_8rBq79wga76TD?7D{sKa?bPiIG%8O7 zi2sE6G(q@x{`2nr=7SGb*C)>_Q2W%ttv30tgKN8K^dU@K!nx9~rE(L5YfQ$b~ zT1?SCr)2R&$vS%OV8!tGgPnylrx**bB0J`IJ{?yDC1f3x8-n%WLV45Z=vFYbGicmfMlok>oUPP!L%M5C;A$X!?lP+{C1JziW=4j82Yg<$ z1@l$>!jAFks$VZ%`Q^ki$l=eQXrv$@5)*(hMb9|(&h#oGg5h(1XxjDKN@SCR zQ!*rw4M}ymAm{F7^`gNX^~<$4JcIX5-Y2QWfwOFBWOE#dhQJOb{>0I;9!JO?Kb7X} z%`HGZRisOOkcQ=$v81WOm1u>mu)T(?pl9=GXei zD2Rz={GohA9f?Hyx{XGT+SL~l#;U;-JN5Jx;ZN+y+yAB z-8qZ~qOulhl2q8BT|>(eW+nt110x&Y!%vFx5Dq`PHpGykOG8o!HH1q5`E2QL*XLec zcbGS)^l1!wy9SmUzSW?XJT1zv?xNT5Xlo$-7v5Am(yOrg8B3B~l#Qv|uRBVSgtAVV zhPzNoRvZYUOuigaCl_?H+;F4+S`UV*R>DGurfKK4dp3N?dNx8tbx~0Dmg54cLr7f0 z6bYHYaaJNgiKU>7jU4)3ky5Eck*w^p6eOi@*mvB3v(8ytf@+~~5xE02StYDLO_hc) z*xLfv*cqxyL!o+&n`eR4t=&&#vg=ll6w6OK~+fbho;dcON~P z4Bu9f60tGt4@@F#!tFmYEI~lbXmi_c_X{z6SSERCc-B)LF5y%1v#s)~ZCJQ9td;K6 zDqj>k6FWktGvcp!hc^MM(HZfPcGj>y0|qE9hx$=ur`X4eM62;R5M&&!@+UN||9*J2 zzj|Adf1cv}=fBWHflue({rOIT!REugdem4wE>$_zYES0R`l3L*B?xlzLn=x^+zHQw zsg?Om8GKqqNlUOc_Afw-e_`Iq*<5#e*N4edtUsw9s%0(|!BBP6EX%@CuBc)in~OCH zDbQfI^T=-I*Jh0&sAX!yZ@^*c@_nY3+9Qj+u4?04WFzt z*>1$+4fSjwuXm;5GmxV}dGcF5GwXFh9NzSjnm~;D?Y5J)TlbQdZ;4>1)4!_^fise= zFsOZ8!8}qFhEwpT7wF!0zL4jESG43_|Hf~$7M~nD%vc4MW=&u^Xw9?Kp0Ug21v<8! zbb1RKNtYfskJN(4HxI?KOVf3kM3 z{Be%YWu^7?`;#*Tc7LbhvfZe8Q)F7exU2qj2?yUUiAZ&2w>#gz$hUIGDy9|G!y%o~ z<3iD_x+B}^2&Q-EQM)Onog;TmUvE+B{OE6qrmIA$pv6&NoKI`Ci&vgz2K0h=9y-Dn z$gz#QP2UhhT@)WimR8YMJr2eDX`hH0-jIJyV6xa$GaG}hw6nT-a#q(dx|2n|^};-A zzu=qL-g)x>FrjRS`PZ$esNXq5@}J9`M8Sh0m5=Ra+KCt2ZqlYScOglucF#pX{7bSG z1)aq@z(@ErcELf~b@_mT)67xA72QNl<-#gZ{Qunfbf^@p+l@IoWD~dS|CNUDaCsn> zbqVs@pmwWTivK7QxCTw_5jxrm*olEa$u~dESfYw@!|i-i4{q|tlRfb+*!&+RZwHF@ ze>^qw(Gb_`l;p{$Avjjt+~45D@<`NEAzLr*X0)L3%Ha6ey>5vzRQ^f3^e|zMn!13;X$w zVyPPD2rj&B{TNCaG9l1uF-;LoE~>z*kJeTn{#JL49zNh7dS#&pKqCUlo!bJJkt(su zXhm`6u{3L{n}X5o`M^aM==9&p;dFa0Dxo%d730{qRJurIuVmghG_P$X>u1*PKAm8B z+%2-4t}pb})h<2YeK}J{RlA8~L&90#VD$?h3#vtsGoVC*KK$+C`JsMQAQA2vJZ%u8Q-)k_Co0p33S6xoR3D|^ z5DTlQRCZP!(XuI%QjjQ>h_8)xJ)#v}NpnEdd`RT}op=K&y&M`GICVg7vNP&f{zqJl z^COcE#w!x6!zs8~;x!RJah~I&La=H5G(A5CTv|zRTBAW4deMkGOAhSAj3Q`O88`IU z^yltCb`FU?BqY^X?!El$k1o_%e*N^_`km#S&DU>fN7thw#eeyg|K5G@D<9Z()`7Cg zC)@HOo(h9Km^^czH>?rn;=PVZ$7`jytpn=mxjQ?btlgQk-`_@kEEvRec{lw+746(t zo(Ie^KXds~D}Owsw;~OaOuMP3W0yDVbkRq&g*th42W^c*Y;7R3fY3 z#qiO$6g^P-xwLktU)Wm`Kf-$ThsmxY{`t@R2soz2@QJq+44x^QF*ndoq@I@^-=4gn z2Jld?&n&G!8qG$wh$0 zd<(y{Jxyq%d&=Q4w)5dbA6|U0 z$m@gb74g1_3%WWyy_5Cer4TM$ ziwfVgGl!EpY%Irj>v`XqQ?kPDe#~EdL#k^&l!J3~0vrJTOK!uQ;a~Nx1$a3YjeI7q zSGxk6(i@gB5X?rN3T;J2B{y4c$#=@#+7;y*j~Wfs?1>p04s9D7lReOK-D10$4K;sqgC!&KO zfSFWDVdc7!;SRaOdkHLFK6m1N;n5ceqmX%^W?ihDAxwN|e!2KARFC`b!x_b!i;nr{ zO0m|-GQYzK?2ILQ7!npbN?!)ESBTmeaup;{{Ln?^0b9Y5+L){uT~fO5NR{`4qG(xR z_>7_E;ZunHt+$aq4y95APb8+KhfzgZ;usW+qRRBZICLDac8YM+qEapp!N}u*_M%sF z)hIs>KgK2HD*w2$CAsHQ2vf0ixmKo3IsF=szNhC=C$UJ%#w9mut>RcXt~I4#>Y_A~ z0Uszt7HbVV$)ywc!9HrJr8WwsQelL!OYGZik9|Tm8g(_)n#npuic4&-5Jx#WunetG z{WJ9IkGQ=qohyn=u5{#`Ce;h19}=8gE`PW43vbpj=^`hUACR-oV`@~qr~W!&s2wfG zzRYWU>%FHx?5xoctw=*F?t?sZCW0g-n^g?}_!q7_2HOnk!`G6y?4sI2AHguY?mn|h zU$2x&A$VBfzrh()4#Y#J)c;^(B-)i9%AeZZ$1zQ8hw8V*MZHm;arEwY0n1qaZ}<^~?H%@iXLW68e@!4e~hr!ydgrd*mX zW%_0h$*!RYasTy2=^o&r#j0bzZW5&_hJ{i8pqbFV91r+T;5{me@QJpIUNZiVW z07lb6G{qbpB}!3%!Db3aPGrEcLae#+keTPo&2@taSU#6t?(eIyhSG;_K9AQPxseK` zb;54RrWIqApQ_AUe-o@F+(2{)s984!`pZ%oLp402^H4Pm;KH)dSZi_Nv4JCj;tX|` zf~T7ET9FIEl^(gXYq%ot^<4o1626Bi2tvNatt}zLqady08UOvMI{Mu(qpC3~{Wz^* zLTw>YVCKC)Kku$Zq|`USEA0|=6W^(St|>cVgd7_wR0$&PYF=T7AL*Nw_){$!QXXHp z0}De4*{s$g=Yii=dm}58x`2(gbBl`91F7$(zNjKibD^Ey%b6}45>5sZ)6^;Hv%#4h z$*1G`dP(#~ZF4(!RtzV_Jy zCDOR4uS}ZrBB=y*1t(!YCO$sUGi;oHuw7V#t|ba(0?id(MRZj7 z42QZl-2Tr&MGtSpA%;IZr>z`HRCv+5@X&UuHb@aSLd3oB&hMSP%SyC>i89uhhlpA_ zOg5nT@#U#z=Kyx~!<8=0SzT<-W+&y}8$1at)~$cso*tCH`D>@;r)P&BqrX!5g))mL zG)CbhhhG6J?O&B2HxO7ld->;Xc>`5g-DXn;sJXjOKD|CTK)AFMKDzyNdWQ6u@&9~C z;>x8qb@@-PFHU(yIX3#7-aoT161?)8?b(^smdWrVkAvd%D3Rr#p)(LiCwSAId}FK( z^2S$Pebbv?b-Q_qt9f0|XA{h;8U4#}EI+<2gZ*)O$x8_J4B`3X@~2Jk?dTR%e)CL~ zu=PCO()(7|0P7dtMCkzTSoy*RZ-4CT3*9PTVfQTz=AyS0%}g)q+IF2RU=L-C=R)oV zrUp|w{!Hh_;1ZqeHcze|*N^W2SqZni!zU)XY0arp@(bn3CX^$pAfL%bzCpcNqB z=sv^APeo`x@j{A&EK?a2eNw5u!=-gRit;B1ef1}s?fTOWq9o^|d{%0yQ4<@K)z{Cy zpua->9G!#UUw-TrRoO)W#@C=Bl5y86%rP&If0m1r zWK7UOrCZ!sQg9uDukM}gcSImNHu8UlgO^vs$ITBt-pGQ6N=4|;ARO_PI$*ts)i0o(< zNqMF&{;)e&u3#s*DFPCUhl@{)X}y^KL|-QFZF~BO`v7OOx2(T!cd6?QK`;CVIrnr( zi+DUnBpOB`ME40CVk#RW^`Hj99$#P9)kltn5LqRvon!6K=m88lWjc@2eKk_+xOe!8 zWsUiLC7i+gfn;zVR&TY--N_m$4TNGc22-j+5?c%<_^+LQdPzqb&i8CtOGTZ-!{|S5e-6CpghG$yvem35gS0}^ELiGCib`|A zOFwcvmf1L2aBK#1mw#W)OK)>$L$c((f+^!26*_fUKprR zBfRCNifw&Oi5e>xupd$e5lWr%=mN3{gIudm>wt&OSqFx$L8UqH_)Pc=5h!5^Hx|^` zNothQ!MYG6PB0!;$f%V+IP6`C%N2>s3Z)C_`a@!$N*YErO}t%w*0HH_?&w3biabME zLV%gL$EFkGXDMF^59J~_76BDS_rOGHwdMi4FjOXywX zkfdn+q}m@^JxTdX&+znDCfo=ckIq0ejh9f4!l>@s5aHmW>pRv{7y?y;q1^nuyCce$ z3XIdMym}UOYH1oMZbC9AWoTh$W6e4yo}^Uq4aQ!38J?XHMlm`;A5go1=~Xnv=DgvH zav`$bdgV+h|5lk{LA$qI?6FCOc%%;EXKvvMS7x6MsQ0$t|5eoxk0^j>q~=d^EtCPl z`JS@HLL~jn1j}d^U7->)UAEXOokI;tDmep(k_dn>yK@8(-P$$CzACo_bQeM8Wba?M z|N4@~AEkl=?4G}KGAw#Cq?w)AfHsh$m|$SWY&fD0FvKya_cz>}Vk5zG_AG3~?ee+GYid+7-C+Kz zF27v^LW!5_GmJ1mv0k8H3+7T!*Z{R%Y-ODl3PoD7K0J^3v54B$cJJxaXRqGtafEhe zy^p556Yi|2gR~T9?m!}-CPf7jqT~ScQqnZkS@Bj6umn$*ue6|4zK!SwCRP`KOK~!{Hm2NXoSpyva^*nRZR^C` z(yLwP>vHw1tX?w!fDy@ z8DgM4<0BaK&}6vuN7nvCI?eeb$ubOs8$NtT-UQr*rBZdIUpQpY_NeO9Caf|m8w=z4 zLcSjZ#h3PL!w9SK@Hk&5N~4G}bu#pZy>toG592VUKG|3ZJ-DC)*ERfh6-gOg#0zfb ztj)c4L-Za82`i8JPoR+tMQRxy9!iQy(7$* z!bVCyrIlqUI9Yl9rv1-XB1-ds4Sy%dq6|{44lK0;;8Au@x`lNxqBdL?vH!JQ1;^Bo zAO(l?EuPX4a;rZ9PGRk%&IVz9g>6n9WHtHl+@UBvs9!QOuqyJYmk_1U)69slBQMR0vt3ar zXY^Y(Ija(Zr^aZyJfH^X2s_qN{`v-pjvGLQxyOZKH(NWq+uZ0>LTx-DLt8m@^;z8W zQS1L<518v8|GXnV`vI*QbZ1C(5D&&Q5v*B1U% zWCy+4O_Rs3)CE5bZpubs6$-o^Mj7dle;Q{TtK*In_IhtC(E+RI80;8 zwj1Aj_da_tE>na)|5@zlTd(4=hFe?siTXrtc$vndb$iujWbNUC?{q8Jy}s}>TN?^a zeN&=6Xp!^#*j;u}=CtAWa0}Ak=r74Maua+Z=aP0ifc1fXsX#)S$c%3BmeEj7zi^uB z8LMa^^HQkx`q4Ml&sF3P6NF)C_pPn^?vKA&$HBm4&t=M>wl>%9{>^r_+EwAXwt$@c z&q8QcW*_ws2M^fDv0{Y>oZ0qva?TcU zFD+EBN7O~f&1o4wY0k1!pO=3qV5QZrBaL?Ccs|?jV4)M+QH6pMX3$^efsd^{Z*7hK zMY4JF&t(@lgnU9hhNmM}6F4 z_aE!@X1;u><)!VS??~A7i+oeKFgy%qfW*pm?at2xbYCT~irt;=$&-)3mPEs>Gqfh8 z7C@7iz80sLy3(pY>V|}51R8=TfzE~pO8QBw4FEw(KA<`%)tj2$Ozr(~`z2K7`bv8$ z@7_J#At>B>FH^(cUcTN4$%!`~VX4XG>5MP{j0F}4J6P;oK9G{ko-pZLey3va!X-|L$b*7E7)6j6ygEwx(Ke+v*+9;iahJ14OTZol=< zyN{c{iqv#z&c}WgDQ~_<27dgD-L1#Ve|%^4RgfAflRnbmAbLr2oG0jOdV$-k3pE%f zPP4D}YGN1>srKbEcPlL^)XGR59wNC`2?mD)pVdkus7Jaj(DK7{?d3k}jZ%H^lnO$i zzK=*OvBOzc9L=ha$gh`Lw4R49yGVAlyU1*u;pH-QG!|@`0)8^5tA^r0-9_rNNWs;@ zR3hA0`EScNq3;-8a!1evL>=|DetI4LqjE-EX#+eYFyx0TQMaNxe~2N`X`kpRui`)w zUAxr?dw+cU&fL{$zgJIR*!#P;yt(;<{`_cpd-IPw_8pN*D-GGV;Q{%*&6ls@gYP!C zW!+77U(mT!$&jxw=-8=3 zq$N1A(4!k8r>9+EXu{kLR8q^QcXp>`^1yeb*rI35UU9vb;|cLHaf`~J+ZXxmBMI)< zLXx!iVk=*WE`t^Vuw^OCG%5Hq>@6^_p>9CL^}Q|~h`gdEtrPWu*3KSM?AbSPJU*kX zEpQ73$i9fLPsqEkY5<<(6%c5X8a^`>iR!4{kz=kqSl)67Glr|T^_CYhoo~>O2*M5Y zB9&BrofuPNwl-c$!{%B2pqj;0=}yJ-D8Q&lhnQzwcg(@3u|LDh*4qj-@_}YyD(!1a za_Wb5YEhi%v2f}Y3c}A<@M80yGFb5liF!S`-2J%IfaUMz+y>m}X3zwKmfhYGjnfLz zXIGc4aK8AzO2Vj?Ll`kGK$Ev4^SiTw$>+m&vvYi z^m9EKjE{N*oGPH-d$+arV0q)24dsL7t!)Lz;@0gA7P|L*V@r)4+euZkQ%x>Rfdlzb z($U;dAn?;wyx=z<=AZ!beH3lhVeL&6)atbk`9|&ZFu@^j%p(7LFaNT6XN`@SWp&#u zPf@+Ue04LP6SI2~PiXr_64Vadxs5ZD?XBg!d__rpwwgsK5sjSRQ6)RI;H!InF;b_icju;ej)%-!{-gy%C-pwI00WFm)BR=Q& zNa;$)iJKPkJFs-niVvYjaNgqi=5!Y0$zDW_PY|ao&A-n|q-|OvJU{#rZ-~oHU*Q6; zKtpjXjcZ)$f{p^qix3~~*L@b`(aL4--DbgHLMaAXLi?N&iBNaNnEANJ>+M~fwsvwu zyfJ;{XyR^I;RqcFVm5)+4v_%P zJ-EXv14R*1J*mU448K25Yv$0fNfZ_Aa4+M)ufFO|9`o8$4j6D)Vcp{1&Q4%DnL#15 zu!7jBC%HJVrBKmBYH7;Uc+Qksu3TQ@VbO4hW2l2h4SS&id}Y{{6Z_M%jmh4Bf2Q!x z?sG$dghU)kRI3Qzw=0MnaJs)j?v>YrP{33LFte?g^xRKV+EdAgZ@YHMCJPS+>l0zd&1I#Ue~oq7NdaP=u)*VI0?hPxn5)mwMyFC{^Ng_U)C-_99eA> z14TJaxDQK(HlziqM|FwIZ7oK&x-%~^e1!oGUX1YTQL$} z%bbIy@&t@7c(w3Habo05mZq?LLpDY&f(LuV9^3slly8rg@;H4PVM39KEQ{QUq6b6u z5tnn;db_I$lS*7!7)~6Y;jPX+ZKBXZ9YpL%mbkbS0uuqsvc#pQx*5G$fB8~7{~mQH zj~_ym$T-$5!uWE*3jU`NR@=yqoFGnuY{~}hq00V8*u;OPsMIuOBjR+R3 zJ6$sz3vMaN)g>L5g%BcJn@e=4ls@hPW7IAn?GjyvhD3}s;-UjBrvy$Ro1qr($X#F) z;Ww6iMmBM2vO&5Xa~3=>QVn9~7RK;To?>Y7OOrPxAKpQ*?^?mx5pJmbUY2?=;6$NUK^9zD*}^KU^KBxA(_C z251J0$~%|pcCY>70cXvYZ*{>09DD$|YFg#29)&fAW2~L)szzi`vmIb9`0v(C!O* zDfz!6Y-?|8E8x22#(ZJbxY{Jd@btR70u)s~xNT8!Q<-`1FVFrGUJJIb!_JJM7nLPZ zZ1ok#ZY?Hlu=ngQb%^|YRDbnl#P(&htvYu(d}MoX>rG;hf#WqqEpr9IyxcxV7XCRZ(g4~oP~PjKMap3NgY5 zttDI^&Z~9Cb)-cWriJ`WNL9#3dj*b1;n@k;Rn*AHN*NPr=Xbbf?GZ?)!!=@R5(}Ux zM26O({8NtTXJrk83E)Fsmqlla=0VA6QwIU~<|o|J1{a=jp_qchb_wks7SNe%vxc_} z5I`6mcV~LiCk-x%7(t7wbET%WPwADmledd*SV0apy0ciyDx~(^8Z9}xx%-HwSc1V+ zLDE6u@73VLPOCygU3FTJuk3jtT`XB{8L>IoA5z(Ec})#u7(W=G-C9L_?eL3izwjQk z<32Pa6;ccxy`+a=DQRi-6^2<6HdUQyInKVqB^s?ew~yU+8;|u{|K$3>@ITjxDsjx% zD{>X;Aw#k(K<13cX$Ui*%IjCkBdMj&isT{b7Jstq)h)a4R1#FI5j?m5v=qH9?K>F< zJSHwJ#&SyGMOs%k&nne8k@^z{%KQ#RDFe)5=d13HPjZV2AvZ<#<(Bh0m2$GH2N5akwm4j41bT`dPg{vM&S5FFn7OYc3<{7h)cAZ?C0i zPM{|Woa7v+=BSjEvlY1;io{b;v%XzqlV1>U5QJbA~RtH~Bh%>qm%rOt<(A~7@sE>BWPUmsjd zKNbQj@6QP;{A3V2X+KUsUw8d--%?aH+y8uQ3h(^#RR!KV#|Ltz`2R^w&1UWjCG`Tp zXYv$3=QHQD%=62jD>QVZ9~>5lTZq(VQ%<+Zr1AMp8isLGqoviKmFCq9{7G$6CcMl= zMPc-7RE*Cu-jI&jcx5aOD5`mVtliKvTGmqqlq;~#XUDI)kIp|lzuqKj#)77 zU+@Ooxw-kxbvo_7n}kog=mfA04>+(h!9)uT=knymA^!FS&uqK`(t&;wkhk>Xh3t80 zTSlBK0aASJ0&h6KW+VOE z>0TXF;%+hN#W4vO+dkDk-2o$v2?p8`>_F&F09u*G`iO$zO9G%=P0JTrRn~Dy!L$^} z$CUm*_TF{7tt83s)uZgzaeJw{DHL(zMYGyH6eX2ZrCX7#R<~z)W`QI~p+piKf|QEo zFJ69z^Wl83U(Qc*e*b@D?h614TW$56nd7O|O@ZLvnURrkkH|=MhRa7a9jaCUPBp>` zkGP&E34zkOON%w80x&aZ^n!>sb8Obf>6uOd@G@ocST7*383VYvBqM2-0>(UZgoIT$ z5Sf;Dbcvbx8$(T?J-w0FR zlOPnIy}PO75Ty~h^`zfyn5NQq(+4I0u5z)P%Wwn zceO;pSgitb!hDvFN4HnbdnH$z4I@Ob{=vD=zV?Pnv4D3MFKyJNZzog`&72P zI@)D+-88yJ$G~#q_3`=m0&6O%p8A%3H$gDI=XG#>JHLPP-jNq4rkc?w(pOW%P3L20 z1jHsuza%c^1VN;a^?OulH?yxuj_Rg%YtjZL>I?MD%{8V#7Y1xH*i3Y=A&n_reK9Y@ zsR+b`km+stBBzeVegsvZO7K?43d+pYA6ZLMjAbP!y_}~rNT)aQElsxm-GVy5g+GsW zkg61Ngg1E$^1d7E%;Z+be(7}Q@NsGzXq#)!w{HwqC)ZrX3yP}ya#(Yj=PBbS?qh9_ zqKUL7U#!>=VK(o9IX@RdN1O&MA&39@2H65YmHG~rsZ!OjenU0q{8#lx758yfnS_dx zBJ%89TNNYL^lntEHW#xX1ncNWv4(F1`)UWXMpt}B;hvZw*}+~jL=Y80DTSOYbZxw3 zZDU0o^gDS7(FS#t60xz5dvF!non#%ZB_ie3i0BeT#P;z z`|oc@=aOMO3eNXs1bp}Mr>CA6Pb~UicsU^FB1e)j02UcZTcsk0Dmhb~vTQ7JpC>uD zF)4Ov+kpzMREFe4zN^t7G%so7t57A4;TmE*FowyFJ-tJ(+OxUQ=k3zuc_um57EZ%UZNCzK-+CDuuF%*DoFcu49ic~SgubwlSTd6n88;f&AZ z>mZ_Vj4wGSi6xxQkeBd<<5m~c$j+Z>cPUQ62d2Y|?T;B33*pjZt31_F&;3alF5v{I zo5K}L!ez!wLuud4el1>3u0Ksm$4tILJocF6&bX1Q$|WMlbx%yuA?CA=kk6()v_X?pZFzrA?tf1|KUR;FkngMcGo5HpsMhz{ zQJBV&hX3{7{_oGxSc=9Imm#b$lg=r(z^gHL)wjX<46vHV&vOSG+Dihey#H$$nj%j8 zHCrFDUVlQ|xH1y6PaRdxzM~hM(9NQ=z&tIwo zUYwy+_BoaQK@n@G=CihA89sb#9g`zeQ1f&rWO4AEb?LX91_a)d~+ z{Nt^pwS}(Wlc}?KkBAC+% zWS1itJ`+Dkr;UVyzB-fKR5vj_&=H|hP=a24zA+Dz%ts5<5x>4_2eko;Dg%`?Rq^B) zI@bBn?ghcfI@KIDY}A<76<@>-e-)-m$Ws!-ErcrHD<$D;>(FFGleK(wLqcJc{nVc# zEtAl&i1(-_qouA^f5a9Miv~@QdVK`jX1P&w;AJ1PRh=T0Ws$C-Sl-7gDOwxTZAC@-^RvtGciOV++yUa_eT!NNy|pLK~;*55T&UZ#E|G>4YT9} zhYBWWK7k-({nPl6?BMi6BY76cPesnWFCtq4hsMaBW!a*55zL#5GFZ{rag&M`vbgAF zWD`BU8zfUwgopaTEuwK|E`KjEqeUjcP;i~;0$;~k=YL*Yb|S^yVIRlW(+Bsy`A^+@ zdE9?8EK7!ZN2pK@h3@^K*w);+t&_@J(F493CJz#5t&@>Q<=h4Is-ur8HHkWtTm>IR zEh%HPWXB>+>;tJc&KYtsbCCyYwJ(xEyF?30Dso#tHTK8Cu(Y# zM1en+pWBPURD`-h*@}5{a{V~-q3E*gk`M3NQ)6`9KS0YZ!NDSwx@&-!t2aI_VHMY* zglK8{+JdgvIly5`UA40K4vQxXSJ5-_MBRCOa9bvW1F;XHFfA&O6sGWj9G0a(LYb)2 zf^=$dk(|rZwiydEinrimDIQ(zTd5vIW7dViQG6Ynx;q_ z>&YOr?)q=%(t;-#Fm)|0sLLp8Bt8?>cq0uql0ubl>}Lkt1nJn8k0yb7EYYL)l&OBm zG6_7Y&KM?T>zjovJ5+IT_T-n{CTx%EcGoKrTyc;}#u^d{a<^I|@q4CAj3N>UTe*fI zYY`F|1Sdoo_qxR&3Oa*w{cC&}T1z0x*zqwjXE(S6E|cnQREM&y zP`P}H+$y`pU+OH7Pj1IUwYa~$hIGCZ;9?f7sI;OQ@VJwYD|BU3sWSsT$6xM|j-v93 z(+nWi_&`>u%DKNetOJpI76cI1QYIDQCrSjfMDQSo7Rf$z95X>nCVGkb z2^kw9Hffs0;BaI9T+sM{ULBJ&@g9BV&^@W<5*aFGF(Nc`?DY^&iBm9INfyFti9$uL z=-uH6$O#EZw?oqri&*b;g$%h0iitOJyXlOoJ5q_Nu;bJ!ZSrf#9~#QcAE$XTSHA}F zj?N<2tmOzjv#~k|Q7fDnzSu%TwrPx}+r5Nn#tSx^qyEZFyQWycikl;$9C zb?h!OO<~Ab^FD{j$yqkE!3)u=Ymf(!@~L_hM(0>U1O)Ug$lT}4m}XcUY13{sY$QVP zJQ>RddKyrIJ#$@DzrYa}^~Hsc>zhfe2~rHFxw03jaJ)pFAb|T*EWr(I!K$k6iIxUcEdSf&&N~YRcquLl4;UZL zLuSXSF>9#+lVwn!SVSy8j-uU*(+q-5CT8W|5nR1qHPcLjgd$p1tfL=^m60qALsesy zdMwL{5GI!sae^Jq`s_6Xm+j8j&s49*fX=B-aP|p%7)Tf7O0Z)^4Zt>l^Gh?_b{W`=DR@L$)HLf`q7j@g}uJ!Yp8q znt!i}TJ`jkBPIKs(*PzeS%2YD+DSBtler#~%<(x2A-;~DOgz=uGe-$5=T5L;s(=7A zwSz{_tfuHbtwYiUAcS_nwHwnX=aS0(a&&O6hZ^2 z(vqIlw3rY*G#aq_1Qf)!I)|AeZG zT4^rY`lRj-LQR-w%xa@m4UG9Voe0wvQ8V`xwkCzsk1Ek-9V_zIZm@+<^8sMwU>ydv zLsTv;-F0=O12XEOLR>9eeG9+b-w@U4CUh8tYDA(A6&%TL%K73=h}RVP46X1C3@B1l z`A10z!?vk%g^5zgc#xeJ#VrM(OJ%E@B$30%v)9*+^T{wa2`XvnJg`33Jz)j~b>r9Q zbYXU=PB_RraUT?y>0AB4qn6&YU~GxXU^Kf~HL2H#=vVSGGyzhSSCDK`*FQt_8p|cC zLfae$mcR|~M>ghbcwP>a4#Z&~-<-#vWE*YC9=)KpxU6TX@1Zj&5n^OTtefl?l3T7Z zveL+uu4UZOFU@hgeWT3!PidG=@hACWHg{f2(+N}2=v81y!PY01qbtd)Q@mF?J{a~Y z4j!B7TRwI>T_V{RuB(+_+H=TMt7N!C4ve2?6M$V2Z>cuU=Ek9q96Bm%Kf*9szSJS8 zG=Sw_VDPxuv7|2?^{9`_F4OJ2z6;eYq*+MVf~j%NJ4x68E0Ivq}7Q6t()DS>2nZnyv%+%i?xB#2~_2kXBTYV>|m-F&}Wi=)?9UkP#cJ#!BUmxjp? z{FSz+1F^S?Z74~e?W3?iG&|Po>xXqCsrNW7{Cr0GY>OeZjAfrDCrLqoB166Sx$Fr2+R{9M{ZPjG7%xP2(_u;`Nig%|{7GVR% zt3>rQI$J1)D>nuKLdAR7!KR^_I$%bn27Hwk03dGbHAE%FiCf`&)OUb{A<2@})3fPQ zv9<;datMo#Pm~>aUe^gxt!?goi~=saVM$ev&fHQ$Tte4XXgU-?(1-NZ7@4Hwr~Hs> ziO;Ieb9OFeSl4GywexRc2zFwbNNJc}=~4DG4x}DT*>XxtotJTcVYf7)6Pq@X+kc@f z9W;f@B;D*ZsO~OS8VWawQSEJ2qgV$sC9RP+iA&?MojUnid?p1fuS9$Ym#fR)f%0d( z+bqA=Y=X$l^w5GKX(bV)YFJcU-++Rr?XBvtVp3-EBotpS?%46aIkde<=z*z$62vHa^YkBRA!qa`2! zVPV;nR0;|t3YUqa0wDDevi=f7z?+WUJK716HI3y^V~I6|7wiB_H8kcSp`+8}_SHfh z4@eZ%S1cBa4o!pyIeCJ*wP~y^P`a<|n6Urj&CU~?ig}|$N;i|=it(EAB zVL!X94yXeF<5DaStw`+ZUwJ`v7N-D1B!E<7U!Lr};er8DOiNpSiD?CKAY^+0MTZP1 zQQ$b0Od0cNefoOLDK)}(lp;35##MP=W2+LXhS*Z^du@0c*vRB>XF={IAFC5Xrtz#A|i&txWi` z6f)^|?iqN5V98KQ^ay^ZsQz>bq%TF4BoFvc8JC) zCk>mm5lDiw-IfS8t#2coTdfH*uY=vnX@nHJ)Q(m$#yVP*V?^2h_tEJo0O`4$>{ZX! z@h6QxoOBDk*ukLFFE;zb@>6$M3@Nt`u z-Tp?gvBhNvoBd76jEA;(q*rRM^*3lozCGm8{$^*r*xK6ODtg;|x50Fqda3LAZqE;= zXN1$~!fKC3@Q#j-xJY|rL$j^(pFVCCTf+_B-UQ(8pg-itZwrIGJN;3&iRWc*Z7|*D z_Ew)4Sn#mBzFBPPt=`rqzyPl&-W%^u_n!DYO$UCq1YOLxz0R8(U818~Tb&*V+FmdE z+W_rXDQwowlaNezz)TAO4pu}AqB-8D2y=5A z_&0jR21qh;-@f>9r+foy(Z#0SEr^YM-7bc`4ZUK_@-irTIM9~t2}d*}u0c2yT#(pC zG3;x5HlTK0?ExWapVD+s#=m(=v(1fo6+iUm;q0&=wLH*Ez3pu_e$zZ63IOJw()N1& z;ZPGlz4*%y+y@AsgoO}hEQH$r%R-^=CwJt_ni=kEn2@qTZ>YlB{&vH&s7 z5c1!Ih=Vq!cY)s(k_^5Ut=9)#;Ap;1Im(Y{i!O%TNXY;NjJ-U8;W zUfQPv?wFd4YNP>EZ+4-#&FyW-N;}i-!`^%c%WtgnYQF((zcD8;!qo(2JfaXh8xuzyN~NH zY>8C(GgtusZ1@&v9U`DWWoU<;i;+TRvm#WBL@uOaWLm<*-)zbHP}*X^G{`Z(ylEzt z!L3r-cZzD46Zd@MjtBrUU=P?mzCi@{>3+Lfqa;@4l{!0!Y&0 zs=ooFA;C692Yka~rKPsBDA>I{Qe^O@FS@JmDU4x45?hP2Kn!28)@_hgzP`V6K+0{T zs$|@N3^O|+>DS+sv=%9%ABGIq-vs77^S*L2yJu5c%$DT9VzCj}jIB*1WqOrwfMg4b z;G4A8r27e;m6Hllez&)=$rWITlJd9w9$Q-bFYfjx?EmNubgkM&wP}swQN^o5ocSsVT-EWTK46}OqkhnZhtUi>`8M*>Ws zyO!G!X(#Yhv-34l^Rl?Yi7f$>B4#;b$>y^rUJs9fHywE)s+fwDD_6WaZ;LC42Szx$lCJ5RXkgjD?q0TZ0e>)Xq8nF^p#T9T_zkZ95~ z<$ozoVsb5^&!M9PACro<2`ru6>M%+pX0uzzQ%a6IkHCgo-aYAb6`)RpJ7FE?ZuL`PY1S_UV{z7PFb&eMXH9F`@d(YeNS4#9q}u*B&W9K0BL|4XOR- zYd1*;5gadwQQFpt7KwMvmT9C!H(hf4R$0o7^q|r&#`K7V+mJf2liM$*@BN)J`L)SR z=o!KXZ=d-_&~NzSLBOhy#e-M=r+AQRh4I$-PpgR1&EoJAp{+n)?OUGl-)qwy3?J^h zK)2Uf-+KCc+hkkoPk-OTO~Z0ola+=lFS^)h^#WnudrS_lK-mub+D2z1Uc)#1{Ch9m zKnpKvlNYf&I{hy5VW8oteRjhBW0hoUI_!eZx~9VfG>D&luiJfGV6OH2V!zYZi3_D`FNXv0}u#Ln)uGQBg~m8K|3i!uG1UFYMEIV z*)5wl-h4$J6lEQF0m3p*b_Uv1IfcCR*;fN#Rw?q+a{JJQ$N~EM{F{G%35c737*xdj zyw+XoYue9F$y<+$VP_~_w7U>{LzR6#2+Jj8BbWyang#d+n+)0%-{>Me8z(!QXJ=MsNX70-@ix_D{}-06 zz^tRw*)nLZms2RZTO+sba>LcXH%yUy`+LBQa|^a?=(5i-(DFP&)`yYlsV_dQ0!3(~ z?)p9$twV&Q!)mS@)0yd@swn5;=6633Hcd@ES6C_)ns+L#T{$|o=I?zN4?X7(b#UT@ ziOXQ1Ih&mGG%N-XaPil2Liq#5E!{lguQeXA@PIp?nK zCc%tJAOeqcu8c1t2V3LcZz@fM1@nzlLQ?tZzo*6X8%pDpJ0MAMOI=KEKcQjvZ!cJu zP659jlbPFzQq*RZ{T;^iT*j*E7L$)aYD1$dN^Ybt`&wRcI&p3N@G!T@M$}|r>WHCZ zyZF~7vMH5MqMJin_3ypi%IM-r=ZYooR~@RTG`UBpPpSFl0A<w%bHwL#4&sHPE{>OS2yn7vPtZo+a}2y z%kpIH@sOy^r*8G7M8bwl@6@CR=|ljj@wh!3N)n>*2(GIyP9+7F9V%QzO6uOoC@9V8 zYG30MEguu%LopjD()uKi$UgKbGEt|}9fJ1ST9cmAL^kSk*9It*=t2+N z>u*Ja@aY-V@#MQ2vxJjYAt<*%f2)dIohuH0P-a_o%215H;jRS8ZBFp1PDM8ZgAR6+ z!hAI;LPuEny^8Bwa1skHw*7DyAV2L#gK{4Wf`MSeROl5eh6vAt#7U0zZ^}D{{tz9B#!CRQerBcz_P-eJIqzsq3d$!`s^EUoE*p0U{w@6D2B$a z^2g__m~IE$f*_9)7j{Az6;RI3wV zHL}&(>S$1%24SA)Y&J*IQt%~ByB1-OeRo4f3o4VhJbporR4aG&HtHTN6`$}h! z>E2I{9sa{gZ3v85q#^`}P=#akSNhkYvtX-eQ`TE2T&WW}lHc{2g&QwVwTh6R%@V}d z&0+ctHYIf6=HeWaB5;tXOD|%ozvAxa;W|IDWBG~6$j$XKJl8Oe^^*wlI{%1;kXDcD zb3d>c`9}C9^UMa%)ituE&lJGO^GYLTehwO~289-Kcqd1PHcG-5}U z4G&BwkBpf{Uz^%Ob~))J^eN3r=HXF)^AQdW-gS555&y66P3^7lI>TQj8NNwE?NPt` z2xktrl{dRvTc3S1`K$3N6Ky>r#gZmn?|t^w@K?78T#%Z{uWkms>-9K54ct^lgu_a* zsY^1Db@qUk^q!&J-u~>A_E&+GEKPsoQNIs{(mD}epMBHrDyXHtpb08P3!i<_(y3sT zlLMx_V$gfk+tzj_^st@vHzT9+6z8-?lG14I%_PB?1#pOfJR=yIGabL9mIT4MAQWDta7_2hO^heyp7zSC&H`7Kb zzPgWLPARUiJ^U^1r%E+hC_w-t7jMlNtzix$X3ij?*eJQJm##K95asq#Hqa{tH zP*u|8BdOW?zY4CmO0Ayb1>_J@2aW~fLXxs7~-nmRNZLy; zMj?((W`X}yTSpyWO$h)TKvLR-V>_3%@r~4m$xEOvvdVH#V5`}qgE&5>H~@Wz@O$7W z;H@PIW#&6xms}oAr*?F?T^YCfi9xkVpx2!;v`m%clMmtRIyyS9(iU7CD#f9_?I=k;>tH9&uK$U-PjtE(^9q}r5LjMk^*Z-eY5?2Hu4C5qO4DDjN8Z)ww5_jw(H(J#fl zLpqurFxQ<@DX5jzIA@qL{DSVf+3qRLk=!foTVt0?$t(Os5y2z*X&R|C7Y)6Rliv`} ziRcq+iCdjZ2QJ55m@vdb4J!^^)uJ)=)drG8&hc6)d*+;|+=>Xf4&_lx;1_iU4JJ;_ zTDcPh7s|G|yTs&PT6hdBs6K1nI{UOzdO?-yJg(z|I?c&KvXGIKibhDD-N>;a+EiqC zQIbXKzXPd+ibcbsp0wo+o1>;sKKF63`MJNg|JzqGxxc2tjXj?%ZcNdJ$vxcIel*w^ z)VHg?(y+OmijZBXO2h@Lx)!Dr3eUW*mJ+5K{B7iaqSB4rJ240HF)uEnO zdv#-JX-KBEfmg#vDR5G?1YV!-8p;$4zsisc_}zEI!KVgyrndj8QHWc-d}o_kn=m_# zo-?y$=B%%h-V-Qh-Pc;<@Fw#%M9$kbvCXVho1seTL360GjEXfDf@W>REE`@)eLLr3 zdc}j%+()~G41k4AI!`-2G9Ad3(vB;G3q;TM#>BsHRS8yLRWx?kQNB8+&Q#seTZd#N z?tGzwHKf%&sVM2i8g-`tM9sh}lQ?;$)SL4I+UERfEvS@7 zJyUyLg$-q?)I|ksgU7#d+974Mij~Eh! zFC6>2$|vrYnU~E-E+n9EZG-1tSzD#8R8%QQSf2>&y`LUqwpsXir8Lt6lHBC9v`1-x1$a zlc4=;s+bi+M-E|oDfd+I4nU=0xz9jxMk>A7P{zC{*;lR+%gDg{c^y$JZ(*uWufG0f zB3^J*TZHn!@9JwUk8abd79)n=4NLpthPuTY#EnC#936*D zlcz^Q?Q;>56g#P@*Ong#TT365jYGWkH%tFG_vHaJyGz!zj*+=X+Wz|ZFU2l`0&MU6 zY3JErzQTwHdpoc9U+x?{eNr&^(w)hYzL10)1OP}kFUeazScRGvpM-pE0t$yV)O?^W z2L-^d`1DvXT%3tMT0YSZNJ!XLSjz1K6l{TjPl&WIpho8dIH3u=+P3QGHh)xm87{ZE zW8935;|aS;14(MkY**2C>-@HeTfa0i9awZSpAI&+G@(K_GqM1u#bLvOs?hLCbm;gU zyM9PBCvPMWHVy6S1ugMHz!G$%e;SFP&^H@Or`T|^KtQ0v>0~R9L%=~bzG$tbPR3Wy%lOpp28)VShU&0F#YoT)6(Mod&!=x@v6*6X)Gf~l4_ z4@m$*uy5R6<&Cx!Ou0%!Iqby6C!$N(5{BTr<%7Qz!Y*sLQL#i#1Zjo1?xQ2wRh0BMixVoOeors_kOisyHTs+u4?#X1z-up+>+2G9;pXS*l83B!j?#$>GXWD zb$kV3>re2nBqM@Kk#N%iP8XY6`*!(`-KAoBrHUz@!7xHK9t90zOKMQX41Kv_w>v1U zHS~ILM@}M>NSc@o<=#ifERc4h6AO3AuG~X{!Xh>xqgW9JiiFtJPio4d?WL!LV~un~ z6yJwp+%>@b+5Fcz8u8RnMr!k{g;rJ zQXR)*lzQ49*#viNXf=TX8eDWJf#Jqxu^J>a7n$L(Fs}+dAbCl`6I!pHO{gxq%PHU- zqO%;)Zi}%{F<@V-Eie?k%c2*C)kI`xE_EoRxKLRNwa3;Lcc?jTB9$viY+T8j5TK$C zjECwk%QwnO%{@#wENP~VWSNc~VvUe5g-kvHXvL9p=HQ_G%kmkXiAM#NMnb#i3mvt} zh{WN!s}zbyKB7pMk79m6Nd#HTtibBRi{$!ltcKT92M>H$%_#={h>1efNgg=@IwhH} zrfTLO{ep4D0b|sH`1cO9f*Vj`JULFF=I{l#{;`MAJ1`bBrd{mvRfsM0%;FTEkW&{HceKSkeNTo7J@37j9je9L z0WpCkMx;POS~k6~&jm~ur+-&~v_SIVM2FB#E|ZJ*B+VjJM2jvJ=d)INT^`?|euR>% z8w7KZfs{+agDIKa(27KlK1g8+#zup#8I2S#IPCidm8vpT39sdN(HvaNLF>aaL0FHM&2iZ2c9!lK z;b6kzbzJ$@dS&L7wR+KpAg`&Xe1j%gIhX>8NQ4_=5QO_>;IOpvYq?FTO>lkeUHB^n z0VIF|M^)3IQj>vuM!%sXWd<{5?m!fG_;2}We-Y6K@)P|ddrLsct?K}!ZM<`K^^PsY zdDs1aDt-Za$tSMHvomEa1J{IEiT;_LqaW@V<_BI33Zs}|tlsUYae@&L6tZ8kbq##<^T>WuM3C>-w$0L420ypUpgc28XCA-OmHM6^7 z6z>tBT)QOugrs{I+sk(?LyBdJjTlb0CXU|>$fM*795C2$7!@iT+z4l2n{W65`RbX6^wyB$XqV`3HMQBo7-2 z2$}D?*GSJh{^RUJwp8b{3yTn>bOS&Vj}X?KsZb;0xrMzO-C{0y9Sh6&>%O4HO#Q#_ ziYL`2nFl(C@K$Z$MQozJeO+-vLL?m}Q zqEo|sLNr%^}eP|6vjCIo|5cmtJi~aP{$c4 zL%3oI6k!npopmq`?7=GyNXzvqI~|jNHySv~fxKPgwbc`0aK(uns83l>5UA{)Rk{bG zu=nhkR)pouC*0+Z4~d5_O9#$easxvRU$u`Fun?06Z{jd{rn8P1_?g-tN(w1D%=2{$ zUA-)H03V6b7fTywPH_l^ zXUeNnLd-hMlsa1w*M)tfOreBFf3v~9dP$_(uQ3$dbX$QeWyMAB_L-&39QWm4FVwf^ znYy8RrWLhdUx=1gPbE;fScl+}F8&}aeRYR5Qrlc)iDu?f0+Jd?XbKClC$hMpW@D-d zvo9=wZW6X-DeLezMmGtA=IP=3*wyD?@wE_%7T3g)kRfcExeQ11!&+j1_X_I(gs`of z>EUrjST9y1;QJ9LCfDfBelhafKj7qyN->qzB6;H7aLf#k?;}eV?Ltf(E1ouv6;&&! zf0DFFgVcpYj@+(^*I49ijSW|T$P1(1VCpr*Z_(OG3|QIgu1kYENjw24x@v) zg@fW0#WKb1%9u#*L9ueAR)k3uSypigxXU8)#%zEaaS0@ zUR?Q60$c{Y!&}TyO3l@SV_Z@g;?yZQmK(yulWWIfe5xT8j?<&)Lz*U2vU&%ortVb! zkSw;qPeCo#VRVwv;bD>7L%mP(LQ zIie=RY!=cw6{W0YQQW*gq0TqeXWVNeYc@iA`$oi@S=rBwb{gv@=5%)s>N`6gcB(Ng2Y59rSk0T7q^EN-ZcPF40P52Qj>_ z3#e;~`Wz&C+y;NyMfG)L>2orhn+{1!lmKKxcq#OW+9$nl1yPF2`Kz_&!SL@piw@Lo zTn@=(_IUZF94yB&? z?2jy?IfhHYhVL9~up^AoHs1zPy3);#iR(M&jt$C*Jo(J7ljH_3vKf+Asj~@%e3Y~z z!mOFeBgJh~^NB3@*M+-rvx00_5)|cK&Rnxp{*K7TdlD^-s$~kXP$dn?2(Dc zd9sXwQQQZ{1o;07Bd?yPv`XJ{57I2}#2%TffyKtz%KIFxiWJ>OifP5o0$OHsvS~ zrO00qHy0>y{1?FpqgsOKZe4k)7_5hiyUZbE33N(OVbRb;hF}LRl&*EF3k*tn7eY9q zsRHctCY|<;2gDbUt!IjR)0%buOJ?D#vS?XZcbJPXlUkprGyq7peuX)OHAX zF)g$Wf;@XW8VidT*!3-#Br1m4aT|^R-|7BU%6e8W%3g=O5XQOIji2NqrHynXgjA_+ zRK#ym9xyJM(WfY{h_qRTX!T4EN3sh7U14@oJ(VL$rdnEicCK`N2uKpGxR=mNqMa6a ziz1Q?5~j6?H8xF{FNeZmNCoeXv1#q;vi9VRtzJ!nQK$z)C-}h?%9@yAXuNdQ;V6z2 z^$)fnn0EC%ry^=)%}xlBs22%!NV;;kMj}2^y>|aOLB-?Au^me?RHT@xFz2iP*fAc^?_R+RD{yN=Qa0-yG94X7vjPVuIR=Ge4ZO#fk)Dj>fTa;(@}k z>gY_-!X6yj2GKLxHLcyZt$-5e_{tHzs5}T=sp6jK@YUqF2k(V1I6}?p^ap1-kOZn} zS-l`CO%#ob0i0Sjf}x@--kQl7o-aK>T=2I z0(v`~da_K*aIz5tqjAb>xeV4@Sj6(aA&w+x8>^V(W+UupmHL*F#+6U27p18Li7#92}%8b7jGUw#??VTT3^TGSyM> z0bvq~(?^@a0bYaIMC$xwotO1cJW`74)vMJSB%RdwX+(uoX@yF==vU2D zB^kv5g0%_0sOir@@8s-dD%qp3C8xX;8|1eRH&9XOJEVXcLbQ*fuW zks$Rje{!Wk7U1n*S=gopD_NzC^U0YdHgf9iIXB*t2UWSc`YD$HHal0+AAF+pa`9cK zzrt*eYlHqj@T8Z^#+Ba=yUQP`jC<*Jwi+8kkx}G{uUBX9RsA%F&LytSPm#~Z7`mCg z=Sn^OBZA2z;jG74r~@t&moP<;V327x9YVDE+%EKO>JA9!AzMNYbC4Y;Klg7A*aFU$ z??6d%gB2vF`>pO`=V}{Yk=ECNS-e{%=1EF%l+1UrS}4%?gsS9}by|RMfi;ckbS5i` z%vvB^FU34EDvxg3_o^##Fx7Y}w{zy7B9%MIz_Vvpp`1%WP~r8*e_qQ9so@oAOr^%8JYUSNM&mOK zkDLuSE}kf&b}TJ=erZF&jU|;FTs;?Dbf1eLb=Z7|l=QLE(kU%-UnU4!emSNqsZBDD zbhuX)8p^v*LN!&cuAV3bTPG@qpg7xVbix@lEJOM&Q;W(016g6N>>yC3!sPQQ7uX)r zqw&7ihIukC*B5YyZ*8EIJ4I2)3|Y$*4xcEy4wNZxCyu^y9L4PW_TnT)S-g zHB$27LpiKbVvvlIW((9Mc;l-YO^I~bBg{d$2_=eL?uaBv6X`-;9e8yxJV1qESi{G_SV_93G8;r#iqk0wCkjm`EtsZ4f(p)+oo1s$r0EkY7_KEw zl1v((OWxGwTRR!Rg|*bR_>A7&$o{v5gqq+e7q_sl3lwzM>X+Cx|Eft~PQZOND)_gC z2Qjr4PuucL;s56lLBqWI-yne|#!4e`V}+3zH~jk4t3*5=wafV=@72*m&bH136Trpv z$aiaSyP8s+XXA6-(6-c2CaUG46Get>p{Q_@uSlqJo44#nB%}V2uNbDVSRI`t=-uIK z{(}+st`(Q_horpK&J$()oM&(i}uQ;vDfQhQO=+w~_GNKT=Obe6q@} zYoc{Q6!C?02@4w)<=d}a28>Q52TB3sgZQcl$>Q46dnnVy;JELy*{nahT6DN$SB989~ zOsZ3=sNyNT-;G$kBz@h&Y?i&KJ}Z&&(uhltzi3G;Y;%<~y@Q0B1dt5^%G5G(Bt))8 z%=8XSdngt_9fl62F`+ojC3ER~t0Rv8|4q_vu8?Hu+T>*iO&0cZ=@(qVy7a$Fp1ipt zT{W4JoWH^rp5TqCit*G(*Q0GO{NY-%(RGW=H+7o|q=~q>n{~`(dGn;WM3W;wmHNb^ zTuj+4d3oa@fR+P@A)nh&B8&7QZn)}U zwPB}|s}^O*)kvK5(4-PU)kGj#;r(lTbDb?FUEEwZD_OAT4+SQ}&|{6Cj@QdkF%2C} z6(MXvMCe)^dX1u5B>p9ttdhU@idFL))$A=s;vx>cq_ja)`X|ZlMV4bLXt8i?aS8b$ zuW>IKQtOG!(CCM&RW*>uxH_&dPn*j(zKU0GuOQj-dgnQsFUPHD$W>JRWLfa9d&VSG z&@<- z%{|#1J^xk;?C$ys@(%^tUrqcKroJ0bX`4%&NymQJP_@U2UKQ-%O#MrbJ`n68ta5b| z9Y@&Jb14KkV>S3QWTsAQsYyf1yQeFFL>w+hCy~5F)#k;Q|L(YcwMBjFG1Pad`nx&k zmTbc(tlpN7`Z0VeG!34Q^ZU^K>M54IN85HHOmmx5ZL*HiezuftZ?|}`tM*o0^D|b% zUn!9#4vAl{Bbt=f!bSl-iNk zO1rk7x$$$1lScs%Vx9AiwmIfjUI#sYmb1e+b5B+{`j>|&mNEy>ftrVtKPXSFV`Zvy z9tC^TPio9d10NkJY`#|Q&#FZFJ4&)i7E*2|)x_y^qUP6=(HY^Qd>N0T&e7r+((x`lX8}WrhDyn9;1IxtP3TtN$JL=by<8UG1iN zkW$&B!;@+NK&`WGnZPW^NTJgaUBGK8d|>UcMr;Rmt;2X2_|(n$Vii)~GJ0bG{y-Rn)Tgbmz&7*WVSp zZ(hCKd;0w8>-`sRUKcd-efM;a=DmMu0yR&4`W<@&1>%?`Fmv=t6X!bD`Z0Als=Jxczm<9LP1O_7C^@E8K14H-lVX5{2)KNtA(PLQz8YPr7K6ac zYSdkg@iC&zH*fAl>txGqPGMwF>cCVnx{GVzDO~6oqrzXx2 zEN?S`@@*>;|n+2Vn-Eb!)II~XQDSw7_j7rI^ClhD2W=`3LznTl#y6rny5AH$wW6~L>Jn^wJ{PA zmawbQCZppp-R`G|H>*>*&Biun2ULxylO)n=ttCp?X;BoXsauHAMU=zcN2q`(W|hE+ zq8TN#q5!BQ1GygNq}7Yb5ksjWw7;Z@Z|XpDVpxe3LqAY;tMIZYFLhmj_$pDldZNYk zSme{!!A#^f0_@BsqMMOxNYl1NA)`+2jZa$OPNXos;D2(gy%p+SM{u6T2n_CR5gFuW z8gjSakIycV?P}ettz&7c*A!|5vrNqC5LMRT=TkPwYkGlS8Rrhw>zKbsPdt~ z<*TPE&&*fQHw-&QMa`S}0JNE5 zlGRw+_48hOU3BKp@5t)vh?fSLH`P!KXqTiALKz&wEJ4QxK9hHg4B^A#9~?RUFV*_z zP+U2x5#l8Bz)OlF*WYQrRC$D}N{BH)t|xoDRz!#hIhq{4VgGG$M{5iuV0!FP@u2&S z#bG)-vQ|9k`HSZKlo1mC!o`R)IuE$GRa8t!pXy0@Q#bA?R#klPT)u7+sif=Zu2f8m zdnOgEG{IGMX_i4NFCfZuH6e(`+3dNPQX$zK?11P^>)DB#v5Sy0Ggxc$*zm`O4WH?s5|n}sL~Tzi;JY1S)u&5b%dQm=Gc(B zck4{)R+`X>7P2BeByM)B6Z4~a|7HoL)9Q6hujHEOnZqcQ}22dn~OcO(sB&X(%S!BCkyi?Ks&z z7dflAaaFC>j$-%ple+yE58KayTyFZx55vLoM=Bb9JUuKmi&#y47I4{36EMXIg+98% zSDXZnZ9~z7gcf<>+PP0CzRBDUEt$1yJB#ku%Nmj%)*Q7CLz~ke?!)=jIYqPSb9!H+(02?vz!kc0{dv{w5K1aq(5CP`p$C_tefjCoFZE)1T7qb0d& z5Twlns!e7yXhSf&^4xuCa@I6VQ&P@!Bbys{2^(l!%NCsp7uBkC@8+Ux1)5B_@aE8fT^o*JIPCCj za=|&RvO`A_qnJp_qbwnmxvwy8exe!L`DkQ|hteD2Tle02SX~fe9f6=`7kFBT`tF7s z7x|p$ZZS`}Boz!Gi!0H{IMS5G&6Q$>i+?~5DwuPHzd)#^O?Cg=Aa*mloq_tKoPiZ$ zLs71RGo{r;<1Yck6$HrJ3miC7lXu?N(7>R!Bp}Cp7)cgpq)_x%?V=ks&|e2riS-f* zAfN;wPgTvx6^$(rh@N#*7ij6|0|gjZsQaJB1W3Zn$UR^7M;s?9-pYuW9o}4>Qku_w z&vwreI4$PV9{G|7NfDoWfcW4Z4sqMEDXm)$1%<@~HIiIw#Wx3_yhKfXV`XaygKp0o z;gi5DI?be)-xA&~xxA*uO8!LdUZnlQh;0Qm8C`^_9o?29BWQoF>t7LMpL99n3!h6A zWmq7z3kMKsW^)xfqf#j7HP*v+CqEZT3j$1qAF*+o$GnUX=5vcRHD=19fob5Lr#~^Z zFLjcVNr{e(wlC$P*pAo;>@(t$S@ta3HM7!dMP!e_sJg-_+uGojrN4wdQ&P2M+^*V+ zzvp;%B5vVmH_5wlHGwat1+o_NWZ0L@_lCRN6yL1)H8HUx)8{1M@w4+;pf+SjJT9tv zM;!Y*As5c6Iem<7!}=t-K`?_n0r6Y~44^-_F7>HJ>Wbo9EGrX3@$?1>8O;*|;KY)} z&Lts&UwpwlX8-2!w~|1p{js9HN?4IDX|+?107NL-$K?`^2xp<*scyfPkC60FiOWz# zVS;dH!d08DaSa(PXoHKhHgppz8P)uyoM&fprGRhpjOb^Ki_xa?9Ka*;h&}X?Ue8hi zwfY6;U<0+@GDo`F*o!nChJO;8)fyy%v2$kL&(lggjmv{z>yWL34D zu%V5Q~!bBw+$4W3t&m0x0tFHvo6RtPdHmu*!Q^zKyW(c4(GZwcyQERg=?c@8xfmfACgE)K692ZENrQF{+>N>*K3?==P0E%UWmLL+69CozCZlH0 z=OcDUGyy)t%H8EoEsQtu!*+jtJ_b?ar1E1Gv3#C&a*(sa?RW4phP+eL6$iFhQLa=_ z+yp)AU=6gGS*FQMEyACv_3gYIUZ$r8ZR9E#DMf2=NkVVuAdtz23SLzG!W$15bjpiS=fYq3%HInp?p{ZYA=t1#YG zUNpXJ)iI&CuXf*mY}3oDxW5zQ7P~5*ikJ?yU?0ea`l5xC$qDf!vOA;?sX9L8hyQ8r^<`8l9Jkq@8kw%DJ%73L>haDK+e0P8H$T)A_}|Y&?2{h< zLP}Xx@LXc2*9tM}VMj_0HaSNPR^th__mow|M7dtCchERKp@t46|6vK_>-%|^e&tM+BIC%M#T~oMkbWSu@ zJ#-(G$K-TlIfpaOz9E_YwuS?xfM_h=0&{P7_IAG8+j;x^k?qLsEox^FqO$vEq-se( zmXVk!qH(Y@d#PQXFJ3=;*@W0c9Wswafm&Xsey-0ol zPajRm1X7d9Va_L$>kn(FH6Mt+m3Flg{M7Z0BWSN2l zamBY(T^UK_6Q(ed>QJOO7OH2r-zWUq?a2Yk~_fwI((4m?G%Awm}51UwGH8yoEXcbTT zh`I^6Og?Fns6kmP^(wS2Z6C{K3(FJRwOE#-pEh--mlU95miwN7(BuN*&BR$(1u_VJBeYMcvh9ilt0%rLKv zYi1RSemlBVpN$q%1+hBa5N0$rhnLhEIK{0wbk!17pO%4!tHh#t11z)QEhajSx*4_D zet4@Y=@|hB?DSj;0kNgh8buGT@OLK)OGkN%r*AQddRDUajz`d&+)9>HPP&gwfq}5` z#m*>UJ;X*P?o(@YA9ZLkizr7_fM3k@Y6w^AWN5Lduf1t$srxsVY{ld0r?cYEe5>OsSrft5|)md99=GC$&n6bM5w-2df7@goKnu3!7o+O zH8lc~r$W?7sEU2_8E;OHBk^>#7x(4b2@p7RK5nrr?QZ}m`Q^Pp)}jnXXD24&DMRv& zSUZ-(&75kN=u%RUoS0D@gOsf!z#9urkLDfW#>ZSL(#m zW?itqaEMu?6Ec*+glnc{aSYw)(Y3ov$lGF1d8Sq^piwkBUIGEsg`bqFl@{}t#TS4R z!9=1xzJIe+&MlB9Opthps+JltyZ&vaduG^3tvKiA&?C-M)NG;_(`0ihEM!Yj4GNl2 zfmRgyRqUxwJdzEfw>-bfj!ld$X3{v-%N7!RL;CB=p+k!Bb-^NZ%_Lcwch@aCy1gtX z6p?8$mreZGAWwvFDMO!v1Jc(ihwF&}imax-ryt)Si9H1wTg)VS(iI3Xyrm2zV5%p3 zr)yNPu(FYo>T>l?w52xo+8lR>J=sf)5zN)1R9lrSS6lKP9!@h1esqauX%X%4LS*pD2x1XL#3kPhFb4?~b1KDBCbKh) zF2rsKSV^uH3nCkA(Y5AX5SIY_%?0fux%b`d4|Q!M>Z+prFOjW)^9nzXAdb+`VlL6@ zI~8w{Gttd+G9pV4=~?oSiU+X_$$4jCN8hvxtL5ygl$&drnar{dlzlxfSxacxye)mo z%5Pm&q5R{`&aPb5I|r_O&cA*5go`$RbJUp>6yO)r&QUS?_N-ML1%jPOr%}|tnl38b ztg08?zdb*k>U<#?e6hp78a!Q(aA1#Hp+?$tG=G=dvhYHMmQ}Ted#d-v`w?>;JI=8h z>C0v$(6PDv`+A^bLeYos+ImdA+^;S6w|I;c}LV-a;2^Ff4W4J zGo3Du>243Y$aAj5pp5$5#)*z6w((6482aVS$@M4B7E@+$MiO(d#+h8~97qPA)EQ<1 z2iNqgkHRwvy>1UJu9}~AE_hVH5j5w55ag{iPgtFf#E?rMSHuZ6dEL?u&ROZ0juu`* z*J6_le*hN6nK`4Vcpxeo5v&6(Yy@sUxd@o+xFHmWFDM7> z(9`&mv|K7BI%-ja&3zt7G0E0sz_Ky^W|AYlQi4Ocu!FmwX%u}uKDNY@kIp#E!Bk55%9x;z6F)`Hpn9D$6x7FDjU(2W!ysUVlJP%ciFov4VQIqL(2zkop*Sy2iwB&>p2-BAUEhMl*MB%aB~t?bQ9?sNmb!C7N>TiDn%v?q zU(rOVY&Rsk|@ zIJzRMP#v*8=I99;g*n)g23e}6NvGYF+*h+poYkDLn6z1z2${o8qE>eL2H|VtLjX7C zk&H=Xr)oz@cMi!isXw&9Uu+83FHl(bK#h*hF>oZd)cusbhnKxN@v*yXF3jIzk_6$! z7}jA$w9gJMg+sYCji6>+cua5>n&(U1pvO{GzA3ySCDiDm#Y7y9#EeSeh&^8usK1lf}$&BV+#-4H5S3>BQ8Q3%x!^8^Q-hn<%V=^nz z*_Q8JK|PytDX1pqm;pKqC#sE|8bpQ~S5I^7b@eM1b4DHjLtHlsVhB+*NX|{y>;CL} zin6;Vw4yl0TTOg35 zhL_qLc3J#?%w@oF$^E%XWM7RqRDy%%V-Nfp^MAC^lzJJ>R$R7x2z7Hr1f9sq;KbY& z9oz0`ow_)BdlvKX?aacOVc1_)Ubp-Bb1b$9aPyntR`d7rx5}6lZ{kDdkn1Pprc%6k z3^Wxn*(&MDpYbKC)(#K{+C$w-93pO&43`ZhZ%Xg#jwKAos}C}&3Ur8q2C9C}N1y?K zqz=j+6$Z)Z9-A+dxLVf%v;3%8vi!9N2?(>vFL({n3z%OGiSLKte3WVmTAhQ@I#Ame z$4YmePccrkw3O5+*Kzs&b~ZVZV-6LH`ksjnzpi-qkY7%Ft7_)<4xWnmN}_3gs7PoB z!~f=*MoNllD+J^Oy&_$=*fNM+86sd+GSEmHoluG1W;-8`dB@43lv3o(w0oJW#+ODl zk<5g2B$4rfE(vCGHdJNmB|tKOZraFQp8`E3w!P4Myt(lBt&%z~ET4ToLj5tznUU8j zQ7CJDW-2yOst|e>8o2cKF5K28^o=F9mv9(9er3YBnBKX)H4yZ%v zKmS@$4F}ZFQctlEBUcfJ%!?nEG@{QhSN{@MMylKBrZsqquB$*xi5rfjdO@7bA=7LorTd=oeoznsu8qzJCTimnoI@A5og^n{EsHPW7TjFVhbWdk$f0HB=vxel z3=%`Me_^llsvQgp?|S6jmY_TzUtCQda$yrL!^8CUlhz9Ap#r!Z`!_ zZBU=_eoA*b_#gF044ll^*JJ(tm$x|9}0r|B-1jpuzo&L#?MA<-MTcvDgCl_#_+rr#o|M{G|5aDVJ9M z&(E+>MFxKj(pyzWd`BVoJxS3O?&bfap?C9nR{By@8lo6p2fWSko z4i|1FbgT=?Uv$J6&gSy`(G;F>tUy_i7VzVpy0%Rb5R!1*P~%3RP*q?Q@1(`qwcVrPwv3tbb=k{f8yS0^0r#pjh#O{7zS z5B+1@kn2OlFv5#AlA?x08m5v4Kq}Af2{#_&9zoBGQ?OIqqYgF#L&z^exMb|_jfvIh zba>3%D!zICkH)FS)va>HJ=Rt&;PMPBawH7j=|9J2Ga*F2s7>`|C-y2U_KsI(~ ztcwXJTG!A6N{JhvBj)jL9#I&l(8AF!9gy;1O)sR}Tp+UXm!sjipR*-VUx&PyJ)5fCj7bN+_T`g1;@rMtzR>ys^Ic zoxiEsR{u^zL8Sj`Xy}eDt9{+ZjXOXZ{ky#XHzV)cv3MmpR%QjKVHnh?rv9kW>JzN$1Yb^1XFI?hqeLE59U~B8EZDX(dWlLNYwyQ0%|HWiFr29wE z>GtMVTVZ#7_=S|ZDm>L#~8IBLrrd~Pdi;%w>1 zjqcK~i`q%2nyT5}?)8HRHo9Mx2nOq4M(L{@>;`&71@r%X$*KqIbSq5ky}7kH_`*RxphShEA^Po=OV#!{;ZJtzPGm0|c-eRVS?P{_ zWl-LbQ}Ksq$(ATS=YwBh(==004+gUG$1Ag7;b+(F)O>S1$T<>fbn_K+gx%Fbu z9 zEyn9I5E#=K{bFMsa5e-d19dea1NAiZMyFTwdV)@W81}T3Expm@*NvfG+ZJ?sIG#&G z^*9Ne!3L&ou_frXHadgl&~c(jUD~dTtXHh}K z20!@J1qyz4d3viuLtAD5M!mw*T|KP_`wZv@dz)IN=3onexwEy+4+irOaNcXeN00q< z3rU5*?Jirmxh{->i7nx&r~T~f$0iu#$EJ2-SZwwAl`Zmfe5X&FgLs5bTk+gxv6V=J z@7DF#KAQ^iL7RRBcl;2MY%%hjYp%(u)%CZS3od0!xeqeM_&jT2)!GkOP<^2U@^Gpi7$HMhW?OW zS>_;kAcpMFV4=Ig8}UdF#6jIqjmQE*WHzBtKbjyaZ(m0rgd_B^Tp;Se!hm5)61{9Q zf^Y+!v2cN4qJc-)n9i^ZT12?og6WDefJ`3OvrhRX+L@!H82xNQLp#zAf zD^x-Zd=ko$XI{Yi=Rj;A5cUB5-d+(SdkkQbZS;^t%B3N_y@(6VDhiP7X1AFdG8JGT zT6zLbKspJFJd(Hy<~5Okck!I|y$PMOO)O%L0wr-7jd!|%JJ2Q&yBQNgObjWu21Fr-eGlPbmNub_k_-I$OXgy$d>eP#`P>{Q#$UF;J`Jt_mfNO&2qM-#oMlPUR1rsVBz(j#`y~5qKq-%H}x`# z)J`nQk9aS{$FR`m=RnK|%pkJIVrI)~8#U-+TPM&NAlm*AEdpYea>62y@B?lVFR(nb zi>Lyr0;hNtva$e?zxolOGPE)f&zV~9CKpq+ZK&+6ZEXZby+j(u9j28GIV?AjCkcZm zfN_WEP;7}0uo$tJlf`HYFFz;zVjaW!GKS(gI|<^F7*DE%@fg%2vpG!+Cg`XTG<6JHYTTom zWh+au3(wp7s}>!P2y0BpYqE&IADBd9H1X%VG^uD-)#d5JS5gS$^Nk7;^$^M(Lm{lDjM-ail2nh*k4~31prSax0E<^cLH-1(C zZhL!yL&P$W9KeNS4@I18Bus8msb3+!t?>K?`UCDC1Y%h)A&jTO9?Ax@y&N8*U01oE zd7JI|F_fr-o$EHy_=ao9IE4Hm?R7YHHZ9)iyB#2GD~CG+6dfdFULQ;Y`|pm=lGbzheyRp*S@+532h)*$OhZ`v?4&F zikI0HJ|+9o&mkYg2<$CCym8_wHjWL@{D1=uXy@&0@`&X()F6yYyXiSQF?06n-p=mR zhqMUE-rTnJsjM?z{$^#Xm#X+tLOjWgrrb<46ZT*e?+^Ta_n}@P(-PPW#Vac|()( z3HMWo3_!I148D7d8L~H%#U{qI z_+S`gCNEo34TiIZ&p>Op!4{ABlO!n8IY?Z~cmltg!CKV&DeO#6h8gU)4K*;TY?1GmU$ks%yvo$56( zfSwEKCr(DnSXRxE9QS9AD%3gV!KRt@ZHb6&Az5z_@gD7i>6>VvN!z=`550j>$q+0CV??O@0j)2=6gt6*L86CWk&6~H@&R~mjwd+wSVCuu9q5p)yeVFjfgm}G z?WnZ|*Ga|ONY62VwwaNWy5p~+E3F3&v<$j_1sy__?K5=oyaqQ+hJ=Bv1huPt0?s5M zcv$=qkcgxB#1FHa?Px#e*ZixLAAilO&`^*RO%)s2}g>P2@C)g+}nDXBqK8jJgq=l#+C@&%Q77BjXb21 zFIqF)&vQ1yWCm>mWD}cH7{DA5K{7l;i434Ixvoum zq9nQwlSLEGnMFxSm`Yn|YI8BBg4o8>0H>!x*k~Fy%P?WwnVv8-BQXG+PNAW7^c;V+ zyaPo{q@-&TsOGKCC47iiK%g`VIPmPWzO-K|_z2w47!6?XNjn=#2NK%H*~3IoR+fnc zAQ>K{KLf~#Mt!SPfy@P23VukpVBBKTN~(ibln$0z6Tb2$!LTDF+XFT;3dv40N`j4{ z0a{r?XX!{}BC3=VR`EiIrsv?CiJe@Oy$*e0!o@PhmXTEa56x=3^X1=$3rPi*yisJnA_Qkr9$AC2`zQB|pt(f`LVDsD%tie8CWOs4{!8K0pdc^*u$O9p{KKgA zchw{1Iw+D!Pz^){vEGu1Icos80z4F}Mp-D`-QPLjR^ZXu^wd4elVt%zIW8U{WXHv2 zR$)D|g>nusv>gZqONu=@DHgFz$NI$%7~dFkcJ>u6i|2qKq9rSVrT68Npw_D(dAa}Uy?2S6{=B9fbg8W49+6(TxzMh<_k)>O# zaUAt>MCTy)8d0fZ{CDTvfBhd24*zd)=aSq;wyo!>WV23K$f+R?>03NX@ z1RF^rNn`;;^58)UZ26ACZTD}(O^@J2cp{#Fr{M2jJ2R09?3Aj?c34qSq9_WPd+)W^ z`q#f+KmC>F$7*TH!gNir-1?Maxb%yxt;EVZ>&;6~$Ukl8W0vfx_If&Z{rk<*4`*e+ zt<7EUv-5Cc;nC#}_#=NJ2L1D2`Frnq-=vTsTX>iB;Qr9@Q7+Oe)KDUp9BB!& zz4_;OzsNuQ$ot{Ey>;)>#!ezSF==+J?|I%2U-@SKtYh!9%mcQusn$toQ7(~m%ln{Rj3XhD?LVf zwdQH?pTIl!YBhywt&RJW1>KIm#gFd{f}x5(q z@KC1I(9Jib%?X35ChqRgyH8(h)UD0FP}dI%>#ZfT=X(`ROU}kM*LlSnN{^{irz|`k ztMuOW>Y-1cCIRZYP8l2!z0eEj6+&wYq&eFDpa1#4@S7XEe6aqAUhQoA`2IE`4PX}3 z=q+Aq!>TYq=~c*Z)R}w@!BW!YUiK$giaDgL2VFfkPZ7Cjovt)^JFka zk+_}uOy|5Y7J%LXbb_SuAT9Jq3%brh25V2%sVZIF3tcfu$EUhC??_ZVrS4kGk96a+ z4h}HAASenXQ!&b#{W7nD<`6Y{JJIT9tX^)f>J;nRJ_NPtG@<8{wLg*xllquvb0Eq_ zFB8?aUYj;O^ws?swC2ba+4-)UQg9PJ4#tTCJ|BWF$wNev*_}<{F%+BiPeXo#m+TDeg4_mncM7EI39-q zV_KPiGe9P@w<4=QeJ#={giF)_{o}v>yMKMPfVPa*X-G}c-)NJ?UeQ77Ar2W!`(K4_ z#-rEGaPnLHrVb2l)T;gQ|!gpM<8|x~}QVwO}_`&n9=A9ao_&T|!O#*%U_PFq}@!#VXq&(OicQ zWLyHmrS|VO1VQrUm!w_wW5^|mCebepuC-Y@ROt>98YpQS($V2rGp+JsGkICw)a-_C z@9*w%L$~+$_DlqJSGFk#di9g;HZm}a7?sRaZP+cO?E!QX!#; zr*!59jG?zjvv*V6=W2b=Y?6dN-qJ1S;bZdP#739lE}0JIA>uSf|MR!Wg16_YODoYJ zihc91w9V+!neFK_(>;bUcji5u#C?fjE~J-g^;{c8!zl}&1XuBN9KF$hm!ZpS3O!9u zzH^;v@5D1*uU9pQavEXnWZgzrT^>=l+xy$z%Y#s~8(f~1wO;Wi39ZF3K^n4{TBw?I za?qe%J}3Ex{bFd<38EqEfhyu$5larp{V_9C>QrZgRH}3XV^gu1d=6x>dZ*AYL^8r% z>7z1ib`E{?G7Y*}n|zPoydvJKf33@^AFrHxN3C;%uWEcX;aod>BJuIFIigJ` z!RO#K3VlTy&hL=N|WP3&9=+B-Xq7RYqy^Iw5ZU@#CSgI z^-N=EYX&tS)<+xYktP>sW`M1e8DN9{4ZxvJyyz}oA;XpN$r*d_J2F z#XxM*A(NR1?!Kx!G>VzU2rs)1=2O|jiau5bsrY0-5VvR@bYO5AtuuFaCo}<~X{oLh zAS2>GbZ9c5DDiBuh-i}p3?Bw}A@3FPr`u?}F{63VU5q|l-o_7_>Yf>Gx1RQVSn+ya zkz9K1C4Bs*MkYK4ANkGM>*Yb|exhh*(f3(Fv8-=&`z}>1yjf#FBO}_4V=(|zztiQf z&8phmOKFiiw0w=W8@fc2LCqBT6vk8Zf_?i#@D+(TmE&+Xw~ujvq%8KC={!f6P)5&u z{Crc5p|}GdmyQAB+&M;lHcI%Js#Dg1qyeC0VGV&!tJKXFWY_lL-d2VD`3{s5orK$m z_ZK3h`0|?s%XL6!_TU_l@5Dx^Zu9Ewx+}3bR0Djb26N>4)9U?hHrhAbZfAR&(Hj6> z_I3QcaZ5KW%&j!o1*<{C+fmwb?SdYlhN#};f`4e_wLJw_i8W^)5 zMX&y?eTegsZtP!Cn$7%V#+Gi5^yso~V|--AW*(`0WAA8d-+xb=r(fJXxCDSQDgsAn z@XX2(8_?LRKTY~0%0*L1QFal_@R@m<+Oh~;(p~tVU!9@}cI3~Rf zM*|ofaHYfl_22%TfpA~~T&RcnRtz!*=R#hS_3}KIQ`&N)3&jx|7l0@?PazB z0q@7vSF}qXX!z`KI$$^oQiAzW2Xq70nK8Us`hOMj?tta-p)hocD)&}o6^+(xZ|~pN zoMUmhiLVsfCj2^29)mAnv7-!Squ5GBktdxCFWmx=iDf! z$oPLg(;?p92@Ms0^_t{)86k3qFg5`f!m0x{^KmdA8Cw2Lw^2&Z8R9lKE(Hjf)LJcqk@(Z%fV!i{IgZA5~HQ(Q-Hk7#on9Bo2 zICc&X{TxNFdE0bayg7)X$z01s3CHgPS-x9SG}n}ip7bW>RBfyUD#VAb*pxJ~KI=>U!Vgrjgcn9o>s^0s!LoCw<<1R04rC*78jY)TYG_5018Fbe2@AYDmUTwuUqg{cByyrxps<~5t#?ZJ?Yc^5GUeN zvoSx52)V?3ha7Y*cYqzO2av-3b9%O52N?eMIFzBu`&>hMO>ussiZyzjGw>ZbFUPw_ z?pRh~2m0nyt&P3q-Z5b*gz_5e{uM8l+w62%ZCuDPD#-d_Y%mv5e8*z|yAZNfDVfMd z(}SUnE<&-?)s(!CSx| zX|;(6a0zmXgIL}p2rQ#SoX(eol1{YxLcP;dpl&OsKjtYS$IMp*;V$8$Pnk-?bOiW< zcKoUZ4`l_y;5B~!h@}-65g5og7MsDvsK!)@vH_35vzo&*9ho0prHV5x$xV>I18MUR z^&Lm}z<5q5Td)n6^o83gu>Vp45$p!;9v$rXZ{TS=B`h$g#(1920g^@B^EH{MOQ3rP26Z8W&4dzHOr6awz3r6bD6e z*wOKpDcm(?4k_UJK(g0hhUAhE&E)(gcDmdJr`}D6mSn0i=>&In;j1pq{yC`A`2=p8p7z`px{KiY>9K5w)zciE?y8D$0(l}>^{1E zjuPku_2iowJqg)W62kTRH?R9bk3H5jMhM(h;)7!Kx%HBH1!3>Q5&3u3d3`&(J9{Q7 zYg&QjcDM>TFNYI1fWen)Xd%awc!;^uz>rZY7}|ao!kU3B=w)8`bTq?{C?X|>BWPT= zZVqb(y1=5x?_q_|*bf4)xj6D45gv1)yjl*oev3rLy{*073faFk?12*iX&w@4LGt14 zGMjvtyIU6v#DBp5Kq4tu^M&zth|T85j!<0Vsd}yUEqBpvfRo3?qh?C{7Kh0KTq#fXLvW(DL@ua+ONUB$ z37Smp=)cEMCLM*}9tMHDyaBuCR@4j`$lgUP)=Y!bV|aepW9%QYD7)+c1=kRyCEbjt zXbxdv7=p&?Fak{@Q2kdC1;=6!$Z?Y7dJg8QCdG3|@WIa7eHMBIvWcZ$>Hp|!rjRM1 zSex`gL-MLlnLDYmip`jzPi#&rg2G9H^SSP@CTWd>3o-8F==Tuj=CWV0X8Dr|{$yr%Kxhiv=14$zMV3*~8>UinlWd6kGA=P%R6N&zF%HRt zV#Ay@@L_^h2M89N6Ow(wr*wmJ^@>YX)?sc|(WhOtNn94!TubjQdtn0nI8qQ$7s+(6 zJ4^pzw z%{WJ?qCfQ=y{~iQI*_PiX%9KC%W$)O4e_9D#&Gv-&P3BioGWDFYKw`UCK;J_W#Ulvp4`9f)yMT(U(cNzcTJl%R@FkQaLM!= zVs2k|P^9AZ`L>(XB~vHwa)!%>UyHGX;=GMU4|pz6PeWKz^bk{y;AYXM&_@+Cfz&$t zNQ@(V0ZG_KVo`AkXRsoJ5k{yJF&;}Bw~TTLSnBZ02y1H2?IVZ6z6mp6a{W>(_%<)d z_G*aKf(Ki>v`q9{*t7bnnb^Ji66<;npk>qQ$)dl(DFdS6FdN*GIo*{#?vRck<>qUmSCVP05=lhiHTo~srGAK z#f`SM4qS9x_SnA{nIGf*oy6GUuikm%_A6HH67eRIa!w!=qg5jD(;ez=>K3KJ1(ETD z*$TXS*2YZK$H=-3nvZOwcnG+U7XEiFuxEL&BoJUGRnjELOQTfm&mq|ZTf6%yURmXz zS$6OI3b+C*&y~dj8%WW;J`G=9LZVj>!PQ-3Gq@rxLa3zxE&@stMjFf~42D4XFajU{ zr7L}PaNmLY9kHA5OuKN2Mk{9IGG+tCxRHCqRmAZin`jWe!#6oqBUC(KuEQ<1oB%Ns zCZuN+Vj98EFq^G-iuu>Ty|UurA60DDfgHM>y=~{#uR2<+<<8b}d@nc$tU$?fUG7j2 zfbu(0_wCq%sl2wd?{6r*R1xc6qzocq6ERB)n|$KOQtnNA{wp7&P=|jZH}K+)<}X-S zeZzzzrxjA2@Q0Xtd{@HbyA^1b1%EZgNk~~ceGYW}G=I=LV`*49XzG1J=cW9&N?%CV##qo@HSO5drP;R3g1^nPkp*L7yaNxFp{9`5)p!o`IS za_OgB5P=7Db0R+ppO!*wdJ4d=X7Ju;r%t*9~Iex^Ep;AoielaVhVt3o;)BC!6>f(W_i2 z=CMqJ%f|+$-(5BLo%%$t@eHR{8xbVpq7nF3MngEW;0Et+le(Y86N5e(Ph5*Pa?S1+ zW5cXijF7Q9M0>cNgbSg$Ut)H^tsg&J=eF)~TwJ!?xt$ZNC;Ae;axS?BxRItBM}tmDYfx%KH;ChkmMGrKo($|JbG?buEZYpkMZ9!%v=~9m1m7gYzMDqeKyodgHUE59A%YugWXHNCUwxuO0R|mE z!rg!s4|k3$!a(hYWknww7;<5;w)BVOkGWFL$}OGs(#|u0n z<3KVOi5!$U?gN99{jc{pSC&!R*><2JeyhdvO=(P{(Ni&?n)4S$T=V0k2me#^O;IMq z@m__cY&TMkD*)3MEolr^D@8MaoNMeno)LeeycW_H-UlwT17tK$!A&cPpK!j>9MbGM zZ^5E0@2tW=5+zo42a3)T6kJk+?xVxfer;N57h$rFj*rblT(c`kJ)vzKJPTY1-w2!; zWNUKx?u~(8X(EcXb|&#tagdBVlD;Ar-N9o>VP%5|8?U_2PmTiPeYl^--&ccpddbV} zo%CwY%`qvuAjnx;CALzqFju2#GFFgDE~^=9XZQ0%ybb0 zloLS}JNi6CR5b8unuimHROgnxSO0_EjWpJ0>uAS$A#1YA_UJ}J%aw0r@j^M-wcDyY z*^Xi6B*sotHRq}Ceiv=r;Y}g$qHR#BdD@i!{c_bax8+;K0hwx~U#5{#E2+prR;l*l ztV9T`kd!6|1-OSnP2+TuMgQYHxt2-*kQnv*cvY3lBU~PI`5c``Jj%>>Kx$>NP{6-i zsd_&YKeA5>wBd>s|DX425LXUYMCk__eNA--XBJ@4b@0oB|Lt-i5o-0r6nCRm0mdK6 zM{_Zny`!B9bjcb0HKK?v8--u#?mhgU?4D_sAK#&h4g6~xqa^D}1A~Q$$LQS~3@%^2 zGbvti3MRZ0{RgHoZj~HzWj_*AdPZH~3R)Or>gYP)8tT>bq-CI`@lgd*CAk0g$Hg}F zV~_Xu{oYZ#okBI^j8cVOG#4cs0XyZf1+|BMG{O&-F@YB*uL-Yd85{R+`xnd`<{ylp z1RC9jb!(x-Fr{8(W+F?0+n6y&K1+u~+pgFSccA{q=J5 z=?pM1xtiO%dljir_!;d>RMmyDdTn{5j zgJVA&XQ!^O9XKN33Fk7*(9DSye~}@sGR<06S%a@04`*}G4h2n66Xpvm1p8$~RO^LQ z*NbrSn!57EUG~3n3le~b5}U?gp$a&F<=&R_)^Bi9JIUj4_OJgas7%k7p};6WZs}Nc zh=l7qx?LA!5s)xUdT_M)umAAB)iT9rnas1iCuT-+3$sB4Q-#b##HqMTipw~+0JfD#EbhkA|=Z=Nr}np8g41UdNqCx?_@fM(^p*^*1BBqP@ydp9|k)d z9~pXB&Bf`pZ&Moe>tL?1B8y#3Zs?k4sQ+j16U)Mvb4cHdXg+(F6mfBbi;*!*1~8H- zY2o50$#r{Y$KETh5abIN)Z9VTAZ`ClLfUQ*8$oB;k1DJmUuJ z#N(Hjif9;5!F!$Yi$FE{_!&mrn(32PM9!%k*@{JRgGm@TV&0t>L}2U4k3Zfa{z(Np z*;C=;MaaINElKD!Z9jSVUCCkS3`mTW>XRUe_T!rh%@OT}88&#H#`g=XUxfmpB5lq0 zy{!M%#s1bO(gGsnmDhF-Hkra-hkfb_;8>yQInQS#ifwmFp;HP{r)C;Ig6mO$2?QF5 z-umT7w7ey~$@@^M=bP~iq%4;6_0#1Qb_v%%FP!g|3+E4Zj`v7wIM}5&#Q#{Z_!=6x zY&4owWx{+DCcwf#Hv_h)_(*NL@`qRnH5z?_P$oYq5$_UEPO_}BnWo)afhOz>m)h2~ z!j<`7i+ywJ4=4hAj4IUQF6wdjFwI|HEjjL88g&M1dlgP0M=%0YmA|})eVG90Xf&ph zt{Y%as3MGX?$OI@ry59UO^&W_4kq(AN)9+o83oF~zOe3SLddNjRc!5bDKjJ8C=(^3 zIl}gor6*W_LNVx`Ws(IY-~1GNH8gc8-yRmYZ5mO{Z z1=-Mz#puSfMa6JgWh;GK>QeqKgg87OLkc`Wfl`JzLpD{6kv5E2vz(U~HPwBQaa#&p zmBU)aaFRPdjYz1SfrrpIH&OV)8I}s^&CpeyjoC?5+6(}f89tv~?_FM8qV11Kmx=jV z?Ueazek3KSS@)0;gnlKVw58@*$L@3btxDVjX+Ua4l-&H3{Ezm+*E8Q zVFt3Pr_8q}!Dv!dpWy9Tpc{}?RaG^mKpVT#DUhOSryv$q8({dCSuhXI+g`?hunTLs zI+qvNg`*X_a7BT*g&L?%=J$)p)sOxG^Wcl!_}^0u)NBA){OGB+4$#l*!Lf=9f21_% zqr+X7T(Np1Fvi!S;5%tY0$q!Q-8~`V2u=>Ej7)2lgCafAF`7)qSiIji%IBY<%S0~& z0m&~_5r|CXbU{HV_W9Aj$^q$HMp^tMZ@pq^$I2TaP1fZ^|3z-8G!?mbdHdBl65O+A zuEY!_=q@9!QjCs`Eb3fyvCyyKs2{(nco>5*L|el_SnM=46?A$~j=b|VU>j8@P<;hY zunyd~*Vv)+GrCfY=HFDe`9NG@05bg@7^lmLmq$@9{>HO3(U3{f9?PM5_gxZMYz5sI8dcnEMTmz*>Q zr}MRM&K9qhRFWkeX5M&IQQ)4_D)y!omI@e&r8?f*H%qnd^Ra>p1edKyW`jj=1tfG; z1T7+78-&0v!h}u=Fa~b?fB@xfsU2suqj!Iy{&yu{Exgas`$ zY*qYqr2s%!oJ;ROg8+QEYLZ(cE*y!0U0B95{VO3YOx%(%M>zpuO(;rWtZJ=d`u6sZ zk1IqV=p{-iNDHU|K|)9!6+=u_zO3`Xy|RO+GwSLa_pU$HF-UI7*C1K0HzC9adVkF@ z@o(}|?B=F5MdaUDQ4#l7&pt8Gj=J{F;EbyW#n>GV4%B&Wj zgOg4I5%zvW5--t{4fDtpn5wwa+eeCuY;PS^@EX?_=EWb}s|nf^oeBU}D)6akQL?7y z({~b~lz3AkReQC~4gD4no@a`p*%;l1@aYMl50O<@hWMn|9z{sbLYg)DQ5E9-La4gxxyu4v za$q1Y`^r@%CPfwd76Cg-opOYi#X>+BAB$tvd~ zJe1v3*&3KRaIvf$kxd}#Laugh>TK*O?1Va=R&3ajtjf{hVMXz4@49BjF6lH z=B8N6zP=L!OVFMruVko8ZR%%X&qm1?C?tM+V1rVB-rl)-v&#D+=ij`0E?BuQ+@AJ6 z54)C}kY%3ws^0RS4=j1sckh>z%d7VF-|QGxF)&c!*RvW>f&X9qc6Nm(2{*8#R_2ysPN(~n4MIX8F*3-Aoz zamnhIh{@>HJT9U|6@{g_6qiP#R8)}yKgqfEuWgx4a`Tg%INh2b_&mDXr7dNMTvsy`H;EtUXlc(Be2jz721&Y?H19%9qEwa4KT0@U z0-lg5uDYoz*zhH5PNI#x;(vgXLet7m(YFnJM#{@6g@C3K7i}D&f(|H&B_BlsT^c*h z(<%Tu_O*Y1mBKfPsk(;st5Pb5viVfPSenxsvBnMRoVQVA^m!PQ83~SmolmUi5aYt2FA<=fzuZYbx4cZVI*8MNk*&23KDmZ9=f~Ct7Su!17<~;q2QUBRA+D0g#7}fW_u&y9jV&}fIN%0V0!mpDLiwZF);h0HT z0_cb#(n~{zsTeFZl1>d7DTDyoXt4!xcY7NbxOpcn7UxESl7rquBKLN+wu*jN6$k|09Sd6YgO6sA31# zE`+1_>&ZAfxeI7t^g5j`=%A)*YDYry>14l~vUp3yAUn)qLS_2O&GDJ?D%vJ~8h zspEcPzc@4WyGg&ivaApQ=nbu7O6s4ne-Tbisq>fuxw<5dZ+BsAf6Cu>{E|-EbH_t62xH zTyDonEXcBL>Vxn(e1nfsds-btv7^CqUN>+JuIXpILK-lo8JC4imwCTI*0#sL?GXhFNdQ8b5*OS)<3I8Y9&R0gcEEr#Hi7w(*6V&yTBC zT<29fNxxVS>l5usL|`(9n~-f1$9t;~jKv{bEt9v9iI4xT?gJR#Dfhw|wfcw%+)xR? zDoF7^^z!j^t{5W!NPEveB;JZ6fQ-lqK+5rl(T3b8H;;oKlD@OX*Xl-y?w|zC&cej8GL;7W$o#ls zis?(pPa`x>p%~&1h<+q8$BuQR*a6|Biq)y9;WiwR_soJ`x7k(kbh(3$>#5vSfAJ5H zpX!{0o2o83?zBEDww1PO#qblo8mn1+slN#MUt9afc_^WzNO)|kPU<%lfLc?JEgK7P zR%%lyk(KCcry+}Cl_?d_P0B!flGGhgb|r^?so{`hP^~6M+)~Bf?TcPJI80k%tafVX zbuSfPB+X&PsFCVJHZa|#2S5`wA>fL>FQ)G_`M^tkkHd*_+s1w9S6Azh7kjOBsL&or z&G?t$(gZ%wxxvp7$^k-mq&%Ayb{;GuGI5M{y$cnnW=&7>h9?~;R{-IJzgy-6 zB3jdttH&c6N*H{~n>rbExg`xwkKC$Bvc~K>fxC=YEpGFMnrPi*X;Lo5`$1$l&@abC zOQn$-%Ctbh{4V_zmypU#s8Ek_Fv{NUrNko+0@*`CXW9fEEcZmcf%Is^#!o z-U#oSP7BlfWFsEv+;0o13MVd2(c|>*44t62@2=hy(s6BbcvF1u8xY<+Xz`|;1f}Xc zZP^V|OgSf@Do@veF!ghnSwaQo;!h}0pH< z;^-)?AVV!Q;94NS+UFMpUZO@zbP}TY-?+k)K?lx-k9AkoBHb0numv0RT#pHSk&KXx z7p{oc9n%7jI-z9nWtw7vN4{wP9VU;)iTk$ytLbGDl=%d;2|us&L6S9fbA?}kt3vcotJER}Xtb2&T(f191iC*^5+9N#aKFckH-i zM+F)FS97f>b?edwlzF6BZKR(^bG4Pg-2xXNXDgxclw#U|$bJPiE|JJGL|&2bD)CHr zWp#i5RE{eSkPV00Ikb!VS4+#v69;>@zgd;%5k;`_8PYT1ozWkpiRv!O^Rt@kPY^cx zV*y(>%aBe!B0f@|*~5ffzx z_}`lFz%vHqnT~1V=W3n|0v$ItUVq)@a5YuIwcu*nya;YwI7*YeCC1vmEPr?W?0;%{ zQHtexOYBma;jESk+uOT`h5;nnWlVV2%sOWmyJZ&-zEtDjI%gN!X8YP?J6SD4h~L_Z zl9V5vy~nAM?@-L3|d0Irr;Tq7_3YH&idpWA!*RHy? zssI`sV-0IBH-oQn#E2i>jAS-e(X~=7#aS|(DCSaKd6qqeFq?(&Q<-eO#yziEn&q6L zEZoWGB*q!(ctUbCJf}z$4xKBW3vcRSYb=|qx>YbH4W~hBuFGY<%7?5g@Bf=cIt^7} lY!#dTzqia%roGXSNBhUSKfI~=sLr2!k?EB`{`=wH{{|Z*jZ^>t diff --git a/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt b/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt deleted file mode 100644 index a98b373..0000000 --- a/Corpus/Learning Efficient Convolutional Networks through Network Slimming - Zhuang Liu.txt +++ /dev/null @@ -1,399 +0,0 @@ - Learning Efficient Convolutional Networks through Network Slimming - - - Zhuang Liu 1∗ Jianguo Li 2 Zhiqiang Shen 3 Gao Huang 4 Shoumeng Yan 2 Changshui Zhang 1 - 1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University - {liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com, - gh349@cornell.edu, zcs@mail.tsinghua.edu.cn - - - - Abstract However, larger CNNs, although with stronger represen- - tation power, are more resource-hungry. For instance, a - The deployment of deep convolutional neural networks 152-layer ResNet [14] has more than 60 million parame- - (CNNs) in many real world applications is largely hindered ters and requires more than 20 Giga float-point-operations - by their high computational cost. In this paper, we propose (FLOPs) when inferencing an image with resolution 224× - a novel learning scheme for CNNs to simultaneously 1) re- 224. This is unlikely to be affordable on resource con- - duce the model size; 2) decrease the run-time memory foot- strained platforms such as mobile devices, wearables or In- - print; and 3) lower the number of computing operations, ternet of Things (IoT) devices. - without compromising accuracy. This is achieved by en- The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but mostly constrained by1) Model size: CNNs’ strong repre-effective way. Different from many existing approaches, the sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec- rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process, information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim- cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod- space, which is a big resource burden to embedded devices.els, but during training insignificant channels are automat- 2) Run-time memory: During inference time, the interme-ically identified and pruned afterwards, yielding thin and diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet unaffordable for many applications with low computationaland DenseNet, on various image classification datasets. For power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput- lution images. A large CNN may take several minutes toing operations. process one single image on a mobile device, making it un- - realistic to be adopted for real applications. - 1. Introduction Many works have been proposed to compress large - CNNs or directly learn more efficient CNN models for fast - In recent years, convolutional neural networks (CNNs) inference. These include low-rank approximation [7], net- - have become the dominant approach for a variety of com- work quantization [3, 12] and binarization [28, 6], weight - puter vision tasks, e.g., image classification [22], object pruning [12], dynamic inference [16], etc. However, most - detection [8], semantic segmentation [26]. Large-scale of these methods can only address one or two challenges - datasets, high-end modern GPUs and new network architec- mentioned above. Moreover, some of the techniques require - tures allow the development of unprecedented large CNN specially designed software/hardware accelerators for exe- - models. For instance, from AlexNet [22], VGGNet [31] and cution speedup [28, 6, 12]. - GoogleNet [34] to ResNets [14], the ImageNet Classifica- Another direction to reduce the resource consumption of - tion Challenge winner models have evolved from 8 layers large CNNs is to sparsify the network. Sparsity can be im- - to more than 100 layers. posed on different level of structures [2, 37, 35, 29, 25], - ∗ This work was done when Zhuang Liu and Zhiqiang Shen were interns which yields considerable model-size compression and in- - at Intel Labs China. Jianguo Li is the corresponding author. ference speedup. However, these approaches generally re- - - - - 2736 channel scaling channel scaling i-thconv-layer factors (i+1)=j-th i-thconv-layer factors (i+1)=j-th - conv-layer conv-layer Ci1 1.170 C 1.170 - C C i1 - i2 0.001 j1 Cj1 - Ci3 0.290 pruning Ci3 0.290 - C 0.003 Ci4 j2 Cj2 - … … … - … … - … - - C Cin 0.820 in 0.820 - initial network compact network - Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity - regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small - scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then - fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network. - - quire special software/hardware accelerators to harvest the Low-rank Decompositionapproximates weight matrix in - gain in memory or time savings, though it is easier than neural networks with low-rank matrix using techniques like - non-structured sparse weight matrix as in [12]. Singular Value Decomposition (SVD) [7]. This method - In this paper, we proposenetwork slimming, a simple works especially well on fully-connected layers, yield- - yet effective network training scheme, which addresses all ing∼3x model-size compression however without notable - the aforementioned challenges when deploying large CNNs speed acceleration, since computing operations in CNN - under limited resources. Our approach imposes L1 regular- mainly come from convolutional layers. - ization on the scaling factors in batch normalization (BN) Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val- hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza- the value is shared. In this way only the shared weights andtion enables us to identify insignificant channels (or neu- hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a specific con- age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer). technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per- ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen- nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may need to be restored to their original positions.sometimes temporarily degrade the performance, but this [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed fine-tuning of weights (weight values restricted to{−1,1}or{−1,0,1}).the pruned network. After pruning, the resulting narrower This yields a large amount of model-size saving, and signifi-network is much more compact in terms of model size, run- cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming - scheme which leads to even more compact network. Weight Pruning / Sparsifying.[12] proposes to prune the - Experiments on several benchmark datasets and different unimportant connections with small weights in trained neu- - network architectures show that we can obtain CNN models ral networks. The resulting network’s weights are mostly - with up to 20x mode-size compression and 5x reduction in zeros thus the storage space can be reduced by storing the - computing operations of the original ones, while achieving model in a sparse format. However, these methods can only - the same or even higher accuracy. Moreover, our method achieve speedup with dedicated sparse matrix operation li- - achieves model compression and inference speedup with braries and/or hardware. The run-time memory saving is - conventional hardware and deep learning software pack- also very limited since most memory space is consumed by - ages, since the resulting narrower model is free of any the activation maps (still dense) instead of the weights. - sparse storing format or computing operations. In [12], there is no guidance for sparsity during training. - [32] overcomes this limitation by explicitly imposing sparse - 2. Related Work constraint over each weight with additional gate variables, - and achieve high compression rates by pruning connections - In this section, we discuss related work from five aspects. with zero gate values. This method achieves better com- - - - - 2737 pression rate than [12], but suffers from the same drawback. Advantages of Channel-level Sparsity. As discussed in - prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro- ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then fine-tune the network to regain gives the highest flexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat- compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional hardware accelerators to do fast inference on the sparsifiedlayers before training, which also yields smaller networks model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works, sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza- ence speedup, while it is less flexible as some whole layerstion objective during training, leading to smoother channel need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss. tive when the depth is sufficiently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks. vides a nice tradeoff between flexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. filters, channels connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar- the unpruned network, which can be efficiently inferenced sity. Instead of resorting to group sparsity on convolu- on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on - channel-wise scaling factors, thus the optimization objec- Challenges. Achieving channel-level sparsity requires - tive is much simpler. pruning all the incoming and outgoing connections asso- - Since these methods prune or sparsify part of the net- ciated with a channel. This renders the method of directly - work structures (e.g., neurons, channels) instead of individ- pruning weights on a pre-trained model ineffective, as it is - ual weights, they usually require less specialized libraries unlikely that all the weights at the input or output end of - (e.g. for sparse computing operation) to achieve inference a channel happen to have near zero values. As reported in - speedup and run-time memory saving. Our network slim- [23], pruning channels on pre-trained ResNets can only lead - ming also falls into this category, with absolutely no special to a reduction of∼10% in the number of parameters without - libraries needed to obtain the benefits. suffering from accuracy loss. [35] addresses this problem - by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art tive. Specifically, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there filter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net- simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super- quires computing the gradients of the additional regulariza-modular optimization for network architecture search with tion term with respect to all the filter weights, which is non-a given resource budget. Some recent works [38, 1] propose trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce- challenges, and the details are presented below.ment learning. The searching space of these methods are - extremely large, thus one needs to train hundreds of mod- Scaling Factors and Sparsity-induced Penalty.Our idea - els to distinguish good from bad ones. Network slimming is introducing a scaling factorγfor each channel, which is - can also be treated as an approach for architecture learning, multiplied to the output of that channel. Then we jointly - despite the choices are limited to the width of each layer. train the network weights and these scaling factors, with - However, in contrast to the aforementioned methods, net- sparsity regularization imposed on the latter. Finally we - work slimming learns network architecture through only a prune those channels with small factors, and fine-tune the - single training process, which is in line with our goal of pruned network. Specifically, the training objective of our - efficiency. approach is given by - - 3. Network slimming L= l(f(x,W),y) +λ g(γ) (1) - (x,y) γ∈Γ We aim to provide a simple scheme to achieve channel- - level sparsity in deep CNNs. In this section, we first dis- where(x,y)denote the train input and target,Wdenotes - cuss the advantages and challenges of channel-level spar- the trainable weights, the first sum-term corresponds to the - sity, and introduce how we leverage the scaling layers in normal training loss of a CNN,g(·)is a sparsity-induced - batch normalization to effectively identify and prune unim- penalty on the scaling factors, andλbalances the two terms. - portant channels in the network. In our experiment, we chooseg(s) =|s|, which is known as - - - - 2738 convolution layers. 2), if we insert a scaling layer before - a BN layer, the scaling effect of the scaling layer will be - Train with Prune channels Initial Fine-tune the Compact completely canceled by the normalization process in BN. channel sparsity with small network pruned network networkregularization scaling factors 3), if we insert scaling layer after BN layer, there are two - consecutive scaling factors for each channel. Figure 2: Flow-chart of network slimming procedure. The dotted- - line is for the multi-pass/iterative scheme. Channel Pruning and Fine-tuning.After training under - channel-level sparsity-induced regularization, we obtain a - L1-norm and widely used to achieve sparsity. Subgradient model in which many scaling factors are near zero (see Fig- - descent is adopted as the optimization method for the non- ure 1). Then we can prune channels with near-zero scaling - smooth L1 penalty term. An alternative option is to replace factors, by removing all their incoming and outgoing con- - the L1 penalty with the smooth-L1 penalty [30] to avoid nections and corresponding weights. We prune channels - using sub-gradient at non-smooth point. with a global threshold across all layers, which is defined - As pruning a channel essentially corresponds to remov- as a certain percentile of all the scaling factor values. For - ing all the incoming and outgoing connections of that chan- instance, we prune 70% channels with lower scaling factors - nel, we can directly obtain a narrow network (see Figure 1) by choosing the percentile threshold as 70%. By doing so, - without resorting to any special sparse computation pack- we obtain a more compact network with less parameters and - ages. The scaling factors act as the agents for channel se- run-time memory, as well as less computing operations. - lection. As they are jointly optimized with the network Pruning may temporarily lead to some accuracy loss, - weights, the network can automatically identity insignifi- when the pruning ratio is high. But this can be largely com- - cant channels, which can be safely removed without greatly pensated by the followed fine-tuning process on the pruned - affecting the generalization performance. network. In our experiments, the fine-tuned narrow network - Leveraging the Scaling Factors in BN Layers.Batch nor- can even achieve higher accuracy than the original unpruned - malization [19] has been adopted by most modern CNNs network in many cases. - as a standard approach to achieve fast convergence and bet- Multi-pass Scheme. We can also extend the proposedter generalization performance. The way BN normalizes method from single-pass learning scheme (training withthe activations motivates us to design a simple and effi- sparsity regularization, pruning, and fine-tuning) to a multi-cient method to incorporates the channel-wise scaling fac- pass scheme. Specifically, a network slimming proceduretors. Particularly, BN layer normalizes the internal activa- results in a narrow network, on which we could again applytions using mini-batch statistics. Letzin andzout be the the whole training procedure to learn an even more compactinput and output of a BN layer,Bdenotes the current mini- model. This is illustrated by the dotted-line in Figure 2. Ex-batch, BN layer performs the following transformation: perimental results show that this multi-pass scheme can lead - to even better results in terms of compression rate.zzˆ= in −µ B ; zσ2 +ǫ out =γzˆ+β (2) Handling Cross Layer Connections and Pre-activation B Structure. The network slimming process introduced - whereµB andσB are the mean and standard deviation val- above can be directly applied to most plain CNN architec- - ues of input activations overB,γandβare trainable affine tures such as AlexNet [22] and VGGNet [31]. While some - transformation parameters (scale and shift) which provides adaptations are required when it is applied to modern net- - the possibility of linearly transforming normalized activa- works withcross layer connectionsand thepre-activation - tions back to any scales. design such as ResNet [15] and DenseNet [17]. For these - It is common practice to insert a BN layer after a convo- networks, the output of a layer may be treated as the input - lutional layer, with channel-wise scaling/shifting parame- of multiple subsequent layers, in which a BN layer is placed - ters. Therefore, we can directly leverage theγparameters in before the convolutional layer. In this case, the sparsity is - BN layers as the scaling factors we need for network slim- achieved at the incoming end of a layer, i.e., the layer selec- - ming. It has the great advantage of introducing no overhead tively uses a subset of channels it received. To harvest the - to the network. In fact, this is perhaps also the most effec- parameter and computation savings at test time, we need - tive way we can learn meaningful scaling factors for chan- to place achannel selectionlayer to mask out insignificant - nel pruning.1), if we add scaling layers to a CNN without channels we have identified. - BN layer, the value of the scaling factors are not meaning- - ful for evaluating the importance of a channel, because both 4. Experiments convolution layers and scaling layers are linear transforma- - tions. One can obtain the same results by decreasing the We empirically demonstrate the effectiveness of network - scaling factor values while amplifying the weights in the slimming on several benchmark datasets. We implement - - - - 2739 (a) Test Errors on CIFAR-10 - Model Test error (%) Parameters Pruned FLOPs Pruned - VGGNet (Baseline) 6.34 20.04M - 7.97×10 8 - - VGGNet (70% Pruned) 6.20 2.30M 88.5% 3.91×10 8 51.0% - DenseNet-40 (Baseline) 6.11 1.02M - 5.33×10 8 - - DenseNet-40 (40% Pruned) 5.19 0.66M 35.7% 3.81×10 8 28.4% - DenseNet-40 (70% Pruned) 5.65 0.35M 65.2% 2.40×10 8 55.0% - ResNet-164 (Baseline) 5.42 1.70M - 4.99×10 8 - - ResNet-164 (40% Pruned) 5.08 1.44M 14.9% 3.81×10 8 23.7% - ResNet-164 (60% Pruned) 5.27 1.10M 35.2% 2.75×10 8 44.9% - - (b) Test Errors on CIFAR-100 - Model Test error (%) Parameters Pruned FLOPs Pruned - VGGNet (Baseline) 26.74 20.08M - 7.97×10 8 - - VGGNet (50% Pruned) 26.52 5.00M 75.1% 5.01×10 8 37.1% - DenseNet-40 (Baseline) 25.36 1.06M - 5.33×10 8 - - DenseNet-40 (40% Pruned) 25.28 0.66M 37.5% 3.71×10 8 30.3% - DenseNet-40 (60% Pruned) 25.72 0.46M 54.6% 2.81×10 8 47.1% - ResNet-164 (Baseline) 23.37 1.73M - 5.00×10 8 - - ResNet-164 (40% Pruned) 22.87 1.46M 15.5% 3.33×10 8 33.3% - ResNet-164 (60% Pruned) 23.91 1.21M 29.7% 2.47×10 8 50.6% - (c) Test Errors on SVHN - Model Test Error (%) Parameters Pruned FLOPs Pruned - VGGNet (Baseline) 2.17 20.04M - 7.97×10 8 - - VGGNet (60% Pruned) 2.06 3.04M 84.8% 3.98×10 8 50.1% - DenseNet-40 (Baseline) 1.89 1.02M - 5.33×10 8 - - DenseNet-40 (40% Pruned) 1.79 0.65M 36.3% 3.69×10 8 30.8% - DenseNet-40 (60% Pruned) 1.81 0.44M 56.6% 2.67×10 8 49.8% - ResNet-164 (Baseline) 1.78 1.70M - 4.99×10 8 - - ResNet-164 (40% Pruned) 1.85 1.46M 14.5% 3.44×10 8 31.1% - ResNet-164 (60% Pruned) 1.81 1.12M 34.3% 2.25×10 8 54.9% - Table 1: Results on CIFAR and SVHN datasets. “Baseline” denotes normal training without sparsity regularization. In column-1, “60% - pruned” denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters - and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy - could typically be maintained with≥60% channels pruned. - - our method based on the publicly available Torch [5] im- images, from which we split a validation set of 6,000 im- - plementation for ResNets by [10]. The code is available at ages for model selection during training. The test set con- - https://github.com/liuzhuang13/slimming. tains 26,032 images. During training, we select the model - with the lowest validation error as the model to be pruned - 4.1. Datasets (or the baseline model). We also report the test errors of the - models with lowest validation errors during fine-tuning.CIFAR.The two CIFAR datasets [21] consist of natural im- - ages with resolution 32×32. CIFAR-10 is drawn from 10 - and CIFAR-100 from 100 classes. The train and test sets ImageNet. The ImageNet dataset contains 1.2 millioncontain 50,000 and 10,000 images respectively. On CIFAR- training images and 50,000 validation images of 100010, a validation set of 5,000 images is split from the training classes. We adopt the data augmentation scheme as in [10].set for the search ofλ(in Equation 1) on each model. We We report the single-center-crop validation error of the finalreport the final test errors after training or fine-tuning on model.all training images. A standard data augmentation scheme - (shifting/mirroring) [14, 18, 24] is adopted. The input data - is normalized using channel means and standard deviations. MNIST.MNIST is a handwritten digit dataset containingWe also compare our method with [23] on CIFAR datasets. 60,000 training images and 10,000 test images. To test the - SVHN.The Street View House Number (SVHN) dataset effectiveness of our method on a fully-connected network - [27] consists of 32x32 colored digit images. Following (treating each neuron as a channel with 1×1 spatial size), - common practice [9, 18, 24] we use all the 604,388 training we compare our method with [35] on this dataset. - - - - 2740 4.2. Network Models Model Parameter and FLOP Savings - On CIFAR and SVHN dataset, we evaluate our method 100 100.0% 100.0% 100.0% Original - Parameter Ratio - on three popular network architectures: VGGNet[31], 80 FLOPs Ratio - ResNet [14] and DenseNet [17]. The VGGNet is originally - - Ratio (%) 64.8% - 60 - designed for ImageNet classification. For our experiment a 55.1% - 49.0% 45.0% - variation of the original VGGNet for CIFAR dataset is taken 40 34.8% - from [36]. For ResNet, a 164-layer pre-activation ResNet 20 11.5% - with bottleneck structure (ResNet-164) [15] is used. For 0 - DenseNet, we use a 40-layer DenseNet with growth rate 12 VGGNet DenseNet-40 ResNet-164 - (DenseNet-40). Figure 3: Comparison of pruned models withlowertest errors on On ImageNet dataset, we adopt the 11-layer (8-conv + CIFAR-10 than the original models. The blue and green bars are 3 FC) “VGG-A” network [31] model with batch normaliza- parameter and FLOP ratios between pruned and original models. - tion from [4]. We remove the dropout layers since we use - relatively heavy data augmentation. To prune the neurons mented by building a new narrower model and copying the - in fully-connected layers, we treat them as convolutional corresponding weights from the model trained with sparsity. - channels with 1×1 spatial size. - On MNIST dataset, we evaluate our method on the same Fine-tuning.After the pruning we obtain a narrower and - 3-layer fully-connected network as in [35]. more compact model, which is then fine-tuned. On CIFAR, - SVHN and MNIST datasets, the fine-tuning uses the same - 4.3. Training, Pruning and Fine­tuning optimization setting as in training. For ImageNet dataset, - due to time constraint, we fine-tune the pruned VGG-A withNormal Training.We train all the networks normally from a learning rate of 10 −3 for only 5 epochs.scratch as baselines. All the networks are trained using - SGD. On CIFAR and SVHN datasets we train using mini- 4.4. Results batch size 64 for 160 and 20 epochs, respectively. The ini- - tial learning rate is set to 0.1, and is divided by 10 at 50% CIFAR and SVHNThe results on CIFAR and SVHN are - and 75% of the total number of training epochs. On Im- shown in Table 1. We mark all lowest test errors of a model - ageNet and MNIST datasets, we train our models for 60 inboldface. - and 30 epochs respectively, with a batch size of 256, and an Parameter and FLOP reductions. The purpose of net-initial learning rate of 0.1 which is divided by 10 after 1/3 work slimming is to reduce the amount of computing re-and 2/3 fraction of training epochs. We use a weight de- sources needed. The last row of each model has≥60%cay of10 −4 and a Nesterov momentum [33] of 0.9 without channels pruned while still maintaining similar accuracy todampening. The weight initialization introduced by [13] is the baseline. The parameter saving can be up to 10×. Theadopted. Our optimization settings closely follow the orig- FLOP reductions are typically around50%. To highlightinal implementation at [10]. In all our experiments, we ini- network slimming’s efficiency, we plot the resource sav-tialize all channel scaling factors to be 0.5, since this gives ings in Figure 3. It can be observed that VGGNet has ahigher accuracy for the baseline models compared with de- large amount of redundant parameters that can be pruned.fault setting (all initialized to be 1) from [10]. On ResNet-164 the parameter and FLOP savings are rel- - Training with Sparsity.For CIFAR and SVHN datasets, atively insignificant, we conjecture this is due to its “bot- - when training with channel sparse regularization, the hyper- tleneck” structure has already functioned as selecting chan- - parameteerλ, which controls the tradeoff between empiri- nels. Also, on CIFAR-100 the reduction rate is typically - cal loss and sparsity, is determined by a grid search over slightly lower than CIFAR-10 and SVHN, which is possi- - 10 −3 , 10 −4 , 10 −5 on CIFAR-10 validation set. For VG- bly due to the fact that CIFAR-100 contains more classes. - GNet we chooseλ=10 −4 and for ResNet and DenseNet Regularization Effect.From Table 1, we can observe that,λ=10 −5 . For VGG-A on ImageNet, we setλ=10 −5 . All on ResNet and DenseNet, typically when40%channels areother settings are kept the same as in normal training. pruned, the fine-tuned network can achieve a lower test er- - Pruning.When we prune the channels of models trained ror than the original models. For example, DenseNet-40 - with sparsity, a pruning threshold on the scaling factors with 40% channels pruned achieve a test error of 5.19% - needs to be determined. Unlike in [23] where different lay- on CIFAR-10, which is almost 1% lower than the original - ers are pruned by different ratios, we use a global pruning model. We hypothesize this is due to the regularization ef- - threshold for simplicity. The pruning threshold is deter- fect of L1 sparsity on channels, which naturally provides - mined by a percentile among all scaling factors , e.g., 40% feature selection in intermediate layers of a network. We - or 60% channels are pruned. The pruning process is imple- will analyze this effect in the next section. - - - - 2741 VGG-A Baseline 50% Pruned (a) Multi-pass Scheme on CIFAR-10 - Params 132.9M 23.2M IterTrained Fine-tunedParams PrunedFLOPs Pruned - Params Pruned - 82.5% 1 6.38 6.51 66.7% 38.6% - FLOPs 4.57×10 10 3.18×10 10 2 6.23 6.11 84.7% 52.7% - FLOPs Pruned - 30.4% 3 5.87 6.10 91.4% 63.1% - Validation Error (%) 36.69 36.66 4 6.19 6.59 95.6% 77.2% - 5 5.96 7.73 98.3% 88.7% - Table 2: Results on ImageNet. 6 7.79 9.70 99.4% 95.7% - - Model Test Error (%)Params Pruned #Neurons (b) Multi-pass Scheme on CIFAR-100 - Baseline 1.43 - 784-500-300-10 IterTrained Fine-tunedParams PrunedFLOPs Pruned - Pruned [35] 1.53 83.5% 434-174-78-10 1 27.72 26.52 59.1% 30.9% - Pruned (ours) 1.49 84.4% 784-100-60-10 2 26.03 26.52 79.2% 46.1% - 3 26.49 29.08 89.8% 67.3% - Table 3: Results on MNIST. 4 28.17 30.59 95.3% 83.0% - 5 30.04 36.35 98.3% 93.5% - 6 35.91 46.73 99.4% 97.7% - ImageNet. The results for ImageNet dataset are summa- - rized in Table 2. When 50% channels are pruned, the pa- Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR- - rameter saving is more than 5×, while the FLOP saving 100 datasets, using VGGNet. The baseline model has test errors of - is only 30.4%. This is due to the fact that only 378 (out 6.34% and 26.74%. “Trained” and “Fine-tuned” columns denote - of 2752) channels from all the computation-intensive con- the test errors (%) of the model trained with sparsity, and the fine- - tuned model after channel pruning, respectively. The parameter volutional layers are pruned, while 5094 neurons (out of and FLOP pruned ratios correspond to the fine-tuned model in that 8192) from the parameter-intensive fully-connected layers row and the trained model in the next row. are pruned. It is worth noting that our method can achieve - the savings with no accuracy loss on the 1000-class Im- more compact models. On CIFAR-10, the trained modelageNet dataset, where other methods for efficient CNNs achieves the lowest test error in iteration 5. This model[2, 23, 35, 28] mostly report accuracy loss. achieves 20×parameter reduction and 5×FLOP reduction, - MNIST.On MNIST dataset, we compare our method with while still achievinglowertest error. On CIFAR-100, after - the Structured Sparsity Learning (SSL) method [35] in Ta- iteration 3, the test error begins to increase. This is pos- - ble 3. Despite our method is mainly designed to prune sibly due to that it contains more classes than CIFAR-10, - channels in convolutional layers, it also works well in prun- so pruning channels too agressively will inevitably hurt the - ing neurons in fully-connected layers. In this experiment, performance. However, we can still prune near 90% param- - we observe that pruning with a global threshold sometimes eters and near 70% FLOPs without notable accuracy loss. - completely removes a layer, thus we prune 80% of the neu- - rons in each of the two intermediate layers. Our method 5. Analysis - slightly outperforms [35], in that a slightly lower test error There are two crucial hyper-parameters in network slim-is achieved while pruning more parameters. ming, the pruned percentagetand the coefficient of the - We provide some additional experimental results in the sparsity regularization termλ(see Equation 1). In this sec- - supplementary materials, including (1) detailed structure of tion, we analyze their effects in more detail. - a compact VGGNet on CIFAR-10; (2) wall-clock time and Effect of Pruned Percentage. Once we obtain a modelrun-time memory savings in practice. (3) comparison with trained with sparsity regularization, we need to decide whata previous channel pruning method [23]; percentage of channels to prune from the model. If we - 4.5. Results for Multi­pass Scheme prune too few channels, the resource saving can be very - limited. However, it could be destructive to the model if - We employ the multi-pass scheme on CIFAR datasets we prune too many channels, and it may not be possible to - using VGGNet. Since there are no skip-connections, prun- recover the accuracy by fine-tuning. We train a DenseNet- - ing away a whole layer will completely destroy the mod- 40 model withλ=10 −5 on CIFAR-10 to show the effect of - els. Thus, besides setting the percentile threshold as 50%, pruning a varying percentage of channels. The results are - we also put a constraint that at each layer, at most 50% of summarized in Figure 5. - channels can be pruned. From Figure 5, it can be concluded that the classification - The test errors of models in each iteration are shown in performance of the pruned or fine-tuned models degrade - Table 4. As the pruning process goes, we obtain more and only when the pruning ratio surpasses a threshold. The fine- - - - - 2742 λ= 0 λ= 10 −5 λ= 10 −4 - 400 450 2000 - 350 400 - 300 350 1500 - 300250 - - Count 250200 1000200150 150 - 100 100 500 - 50 50 - 0 0 00.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 - Scaling factor value - Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter - λ). With the increase ofλ, scaling factors become sparser. - 8.0 0Baseline - 7.5 Trained with Sparsity 10 Pruned 7.0 Fine-tuned - - - - - Channel Index ) - % 20 - - - - Test error ( 6.5 - 30 6.0 - 40 5.5 - 5.0 50 - - 4.50 10 20 30 40 50 60 70 80 90 0 20 40 60 80 - Pruned channels (%) Epoch - Figure 5: The effect of pruning varying percentages of channels, Figure 6: Visulization of channel scaling factors’ change in scale - from DenseNet-40 trained on CIFAR-10 withλ=10 −5 . along the training process, taken from the 11th conv-layer in VG- - GNet trained on CIFAR-10. Brighter color corresponds to larger - value. The bright lines indicate the “selected” channels, the dark - tuning process can typically compensate the possible accu- lines indicate channels that can be pruned. - racy loss caused by pruning. Only when the threshold goes - beyond 80%, the test error of fine-tuned model falls behind progresses, some channels’ scaling factors become largerthe baseline model. Notably, when trained with sparsity, (brighter) while others become smaller (darker).even without fine-tuning, the model performs better than the - original model. This is possibly due the the regularization 6. Conclusion effect of L1 sparsity on channel scaling factors. - We proposed the network slimming technique to learnChannel Sparsity Regularization.The purpose of the L1 more compact CNNs. It directly imposes sparsity-inducedsparsity term is to force many of the scaling factors to be regularization on the scaling factors in batch normalizationnear zero. The parameterλin Equation 1 controls its signif- layers, and unimportant channels can thus be automatically icance compared with the normal training loss. In Figure 4 identified during training and then pruned. On multiple we plot the distributions of scaling factors in the whole net- datasets, we have shown that the proposed method is able towork with differentλvalues. For this experiment we use a significantly decrease the computational cost (up to 20×) ofVGGNet trained on CIFAR-10 dataset. state-of-the-art networks, with no accuracy loss. More im- - It can be observed that with the increase ofλ, the scaling portantly, the proposed method simultaneously reduces the - factors are more and more concentrated near zero. When model size, run-time memory, computing operations while - λ=0, i.e., there’s no sparsity regularization, the distribution introducing minimum overhead to the training process, and - is relatively flat. Whenλ=10 −4 , almost all scaling factors the resulting models require no special libraries/hardware - fall into a small region near zero. This process can be seen for efficient inference. - as a feature selection happening in intermediate layers of - deep networks, where only channels with non-negligible Acknowledgements. Gao Huang is supported by the In- - scaling factors are chosen. We further visualize this pro- ternational Postdoctoral Exchange Fellowship Program of - cess by a heatmap. Figure 6 shows the magnitude of scaling China Postdoctoral Council (No.20150015). Changshui - factors from one layer in VGGNet, along the training pro- Zhang is supported by NSFC and DFG joint project NSFC - cess. Each channel starts with equal weights; as the training 61621136008/DFG TRR-169. - - - - 2743 References [20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network - architecture optimization through submodularity and super- [1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu- modularity.arXiv preprint arXiv:1609.00074, 2016. ral network architectures using reinforcement learning. In [21] A. Krizhevsky and G. Hinton. Learning multiple layers of ICLR, 2017. features from tiny images. InTech Report, 2009. [2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetof sparsity in convolutional neural networks.arXiv preprint classification with deep convolutional neural networks. In arXiv:1702.06257, 2017. NIPS, pages 1097–1105, 2012. [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Y. Chen. Compressing neural networks with the hashing Graf. Pruning filters for efficient convnets. arXiv preprint trick. InICML, 2015. arXiv:1608.08710, 2016. - [4] S. Chintala. Training an object classifier in torch-7 on [24] M. Lin, Q. Chen, and S. Yan. Network in network. InICLR,multiple gpus over imagenet. https://github.com/ 2014.soumith/imagenet-multiGPU.torch. [25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. - [5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Sparse convolutional neural networks. InProceedings of the - matlab-like environment for machine learning. InBigLearn, IEEE Conference on Computer Vision and Pattern Recogni- - NIPS Workshop, number EPFL-CONF-192376, 2011. tion, pages 806–814, 2015. - [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional - neural networks with weights and activations constrained to+ networks for semantic segmentation. InCVPR, pages 3431– - 1 or-1.arXiv preprint arXiv:1602.02830, 2016. 3440, 2015. - [7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. - gus. Exploiting linear structure within convolutional net- Ng. Reading digits in natural images with unsupervised fea- - works for efficient evaluation. InNIPS, 2014. ture learning, 2011. InNIPS Workshop on Deep Learning - [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- and Unsupervised Feature Learning, 2011. - ture hierarchies for accurate object detection and semantic [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor- - segmentation. InCVPR, pages 580–587, 2014. net: Imagenet classification using binary convolutional neu- - [9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and ral networks. InECCV, 2016. - Y. Bengio. Maxout networks. InICML, 2013. [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini. - [10] S. Gross and M. Wilber. Training and investigating residual Group sparse regularization for deep neural networks.arXiv - nets. https://github.com/szagoruyko/cifar. preprint arXiv:1607.00485, 2016. - torch. [30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization - [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- methods for l1 regularization: A comparative study and two - pressing deep neural network with pruning, trained quanti- new approaches. InECML, pages 286–297, 2007. - zation and huffman coding. InICLR, 2016. [31] K. Simonyan and A. Zisserman. Very deep convolutional - [12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights networks for large-scale image recognition. InICLR, 2015. - and connections for efficient neural network. InNIPS, pages [32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse - 1135–1143, 2015. neural networks.CoRR, abs/1611.06694, 2016. - [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into [33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the - rectifiers: Surpassing human-level performance on imagenet importance of initialization and momentum in deep learning. - classification. InICCV, 2015. InICML, 2013. - [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, - for image recognition. InCVPR, 2016. D. Anguelov, D. Erhan, et al. Going deeper with convolu- - tions. InCVPR, pages 1–9, 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningdeep residual networks. InECCV, pages 630–645. Springer, structured sparsity in deep neural networks. InNIPS, 2016.2016. [36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and github.com/szagoruyko/cifar.torch.K. Q. Weinberger. Multi-scale dense convolutional networks - for efficient prediction. arXiv preprint arXiv:1703.09844, [37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards - 2017. compact cnns. InECCV, 2016. - [38] B. Zoph and Q. V. Le. Neural architecture search with rein-[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. forcement learning. InICLR, 2017.Densely connected convolutional networks. InCVPR, 2017. - [18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. - Deep networks with stochastic depth. InECCV, 2016. - [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating - deep network training by reducing internal covariate shift. - arXiv preprint arXiv:1502.03167, 2015. - - - - - 2744 \ No newline at end of file diff --git a/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt b/Corpus/Learning Structured Sparsity in Deep Neural Networks - Wei Wen.txt deleted file mode 100644 index 643bfe2100d1dd54d888f76315e04add43e4ab9d..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 55239 zcmeI5-EJINlHYmjeToDcvMDTC!>Zz^gjuYbQmfT1X|$rIyJrPmB&(8SwTe}o`Vhsr z8vDk+whXKdcrWdRfqevD?Yv39z`lZi|A=!=W>#iVQqOo7SPUCY6|*wWM|?+|h{!B! z{drpCv&m?3o}Daa%fVtfD~8$0RX&@K7B|^wl07VntL%BPoaN)cXdPhP#plK1dOCYM z{|o)ByNT=H7bE_kRPQ*xSWd1A*7W_dy!@MdGR!C0_xa>}&2Yb-jNTQ#@ab9hVzgMy zPnWav3mfBSeYs|$?7Fz#JNWA>X6O{dw8p{3b+^~Ne3M`A?x&j#9Jl|n_2#o5M(^%* z_je9EJN^Cry?47=_IfbyJb&~i%esfzuLg^(x6|D(0o3^Po6!t_9beRen{N9$A7xKR z>$d2(-3qlZ(z<(YA33`j4fc8oENNB#;UXXJ*I~AD!^i%7a5`Vi^1)&)>3uzdFowk? zRF$1gXW3wSd9_^Ri_vtF&5HSSIU5x7Y+gW7#rP(>7)^%I7EfG_r#F|yWRXE*<9v2r zY|jVzxLCK_3je@H&Jx+?!{^WMio4=WOh3_rM-$j@SiBoCc_(`^$rcx*c_u#Eg6P)G zaW$J>P3J|HxAAaf@XeExr+2fC0iY>Dx7Lg6|$-Qi|dw1PL zV~M0JH4m|OT>tb?UjCwo0UkviIv*H(v zSyB9AKA23}td}mLc`=)^Ch=B_jxnh}F{>$?)TI7aX8beC#{hIFPSJoy1liSWDLeZ0!JF5IZ(IW+&auGW zX89EY=l$q%jih*e^yYPUhyTG;7gTloZ%~v`_ta;#ft$1X6sQbo-vsID8K7lh(-yKZ z8MGo@S?mx68;5@vT>#{&)C;`A=%ZO!5w?U-V_MK)ia^MRpm>gEXhIdagfgFB$toLR zF(u_Giwl~KnmD@}=L_Ks{qfc49ET+9%8VHCFnbA4M8sFuyKV$2W#>E|3rz zj~QHY6BzW6XkQkCe2LZ!PhL$Cc4Dw>JOyf?q&Y2WY&sZWhWJ1}BT5Z3O_vIZ?`$|i zH;zut8rZU1f`uVFC)f<)n1ZDsI!)-#(!yhh{2q&cN1EZrU8DX9f&7-m>Vuwa0rTBKeQw{rn$ zn~MXbDzvxxCG-Y(&j8kRW+aaC#FA=GYEar!!UEd6!<;)#ruvv2ozcCsc2%e%W-o9$43|4E-tq8`xQuZsCN&%WK=8QePYCk?Ed9>4k_do@}B zt8C|!_HQGnCEd39oljWA?cieonMMM+{_TZ-`{7^WA$EPQ1r$+l+{+#%r>m55kXcX( z@~v=cN|h63Kq9C;6yBEDM>;%y z^z4}kf3W=W>4Ot@J&fv0Mrwjta@zQRD7KSClfra^TnbA@4#OLwitHuwVDtFo~?^)GwIg?pc z6>rTQn)zrxh73FSc!pk*tBti_V@IK@p(JRk!gJQuF^ym}XOV>|l)y4N&EcL*=2x}`5N%ijkP?Pqiui{v0T$%4+#5wV z2ePJ6e)AYnaUmK5EtyP2i)(~P3xw00l+9Xz#-Om2BLMigDA=QF2g_4La)u$06%!PK z0dRIYLK&j*J%VFy)YTLfw{S-V+PyTvpDktZLSEe5oLaOJgJ{@$rV?%I>`^N)^xZf3 z$$z)tkceNf32u_WF_?|aMD)O1-Aivys-4J@LaPR1L-IC+ln;1X)YWq1?cp*MdD6kG_4X$kAb*s zB@#D;LI#kyd0oxvJPsB(I`NmFX5J(2Av*ddJD*LLSJ_jUf{;@+G!Yq2KsJjOhrWvg$~lEf!CtbC)OK z#z>OEL|&CTX=aBA(MBwmnm~fV)XeILiTo({1bu-=qlR;0C{+&>F}<#)+WKpt&x_p#8hWq}{Z z^865}{w;qUH#}z(3&FTH4LCZxF;e|HM>fFCo#0zbS`HX_vb;Qno1v+S8QLZh!$RSb zqYYU4yT^~WyZijr?QbmS0xl|i#f^t#hyb92(LN@g-N_fncwDvxyg=q`iLl)c!?8rC zzu;~y3B6-OxJAmB{SA0(cf;=4z8;m4Qlp8Lp_(9$w4F@h$+YPoo33oJP}m2mXL>p> zX75-^H1dNi9?j>=mLQq})aI*4`>K;YP%tOiKN8+zyNVt_*`7`pi!nlFKqLgdv>45# zxE5#4Gr78F1yZ@W!4(KZV2Unb3>o+R&e74GI`&^$z;wI2*TXVWyi5DEbAy%Pn)5|u zY4wt~G6A&UTBL$&l?}5ye@YOX5)G5#Wl9q_Y_tqyyU!}tQq5l-?c7P`rii(9_|YT~ zu|-&WceE(CKNMzgfq8m!h4=@XLcyn^P`X2*7ow-)1U(wePiI95bjTFNhSs}eBHM^= ztAW^KD0P-AmJnlFfB@!qp}Dy}+s4jcdwT#m;>@`i?>1`~e> z`WQr*x$!vXoc*Y#|SsN2x=LR z4OqzaYC67|OfTI*)3z*RYfgbQYm}SDcT=s=#9|YZ=$jN6HnYGE=S^}*91U+K`K1E( zY{lH?m@y!H!!20UW@L!lOEBA}rK6kk!AIE&6pa`$U#e8%e(tGGM1iIlrXnabyos=A z;&0h{rA?X>1_9)Qxc91)4Wy$1D=2W$f@VPOGufq|Fl$o~ll#*(quUQBFF2VupRze? z^W=v>_%u@(6{w24p;Sbdg6#$^QIXKdp$c|H$}KJr+p;%fAR#Ktk!fW!i5^VQ^*8#nO_js`;Uel}0zHR#s?+4o68i z2@^!t(eW;^CY5U1NW+N94notkflb`)mdQu-g`{~c0Gwm;Z_!pQ|d(+Rq^*Pi2 z(uY3-w}k@}DSQ@m5=_&F&wBjlO{%B#^KX65w7>M>&%pgN!`3efaDp z+mkIZ(-s1Vm$Xa${IQe9@jm5WpYdOI&a}Vu;b%hkQ#Kgy`PbO!zs8&irs>015c(|WB$%cTpY?b->Fu>4{|3LUfWH0X_nYIp)o*lkJ{f0Q_w?UdD2KyDKUm2c zWAti6+VC82%oYQJrP;l$LFN;z9X>9oF)-)a;qgn2@irT;AEU&29Io;2C)|>rSy{{G zZhenwsPMqb?Us`8akyqi;2&d<)w$9dt8=&AR6}lhD&4srmg(Ng0{z-gzyIHrFPk45 zaDX;GZ^Gu7EmzWzMjz#0;nq*2x0Qj@F!sT|Y}<~dbvP@Ov|RzzZD37P+;-E-vex{W zhH76ZtkYOL*X-w$d1}qny382s_i#2+ly9B5K;&d%=g8CLESAL5{+JjrF;w}4eOlS5 zZVjiZy?d&g(E*Vs(4+`lB^@%x_d+g-!iFC6{oay&Eej*XEiFQRMe1-9x=^|u?_DbD zrbG@)BO!41>Vm)_6>{s-N;nGiP-Tb%+i?%k#dN&QAwWrMO}Nj<9)G9c^jznr<-B$< zVrk!$q>y8x@+uGAh7B34Q}tQ+t0Ztg3$0*f1=6YfN+AB7Nj^ep@_yN!aNCQ=lAKUJ*x^GktRb0qt% zmD5iZLl<>F^McFBp)Cj}CH`!x1Mo{i&^lzXIP^(Onrg`)^8_;PYfTksMBSS(D5(eH zih@JH)q!LXJg|4n1PfGTIOISIuuBnoY1-R8s!*F!cFow^Ok@3Dw`IedN0 zj}w5M8+jHXlh4!Sy_U0z@2E*64bWB#HXu7YdGJISOGBVS0yO9l!^m55EhIyc=8+C$ z$kBW^Dz2}l2u{g^a?Aa;L?|GUz4#-Ry3RH=ef0-;nneiYoH))UyS$T!jQIGSEVj1)QZ_ zfj@S5;pHu&ta#~qG+a=2!r@V>RP8*aIDt%zmh3+*wmjc$V=1Y_P~(!ca>P_P5)G$V zs;09#4~MqS^_l=p$gL>!BoKX|B&3{9A<2AxuN6Bbd3@y44^O#aL;p8dwEF0y&F}AK zzkKvC)4zYYvGLOXJ=gOuY0Dso>|X7mTgtg>_qG2q{$&eoe+_kgKP!}qg=&k5WmM02 zDjr0^nRLkHHiJiqM3~+8bz_R?f>&X2jE!cz|8jHVZd*ywO4Z2D=mc$<|0Mk`ka&Y~&s$rw86|bUnD$)`w zA{^$49$bt0@s)r_YW8?Efaz_|l18|kR`>Skw;hYWuw7DmsN77x#u*i&(`$*JxMoXm zcpp<#+wwAAP}xDMgC3%MkV1?ZHC_u-qN-w&u8asH(b;W?RX5404oQy(d;gyP3pCB& zX!3|=#`^x=eTB;^w=`LFU=M{!g+2+ZFh&&KG7!v|cj2=}vnFyBErb!={T8B|Ge_=O z`>zVh>OWN)C@H>z>LBDoC99P4=geCvjdKB*Rj+*dNYdy(V1Lzq?BenInB$6SpGf5V zH6#4^qY=Q5KR&1pSN(&2Dhg*fF`Q!;48p_Zvn3^*XX`_=)j{nu`d=dleGtFaRp_S8 zHXtRBhCF1Y1x7LU>8W0V#9#Afr9ze5h!vnLBw`K9iX~O8&JOV3DnBS;w}}J`21wPK zAs}rVi7x~N67|aNw0gy8Sydk7scsTluf~gw~f5R079lR zLLfTDVeV66fMBVEE9yX%=!jlO#g3i=&XXtkPR}+k69jGx*_Su1`DBY2XLkdXdJiL= zfJd+G17|J{oA?f}mjG0+1>z||#zY8Ez{lgJmozTf&z3^Rt|0Ry8cR3_bbG*L1yZr- zgMJ>dxh*w>f#Sme+<{m#L*KL$!UWh-vwqpJn z0wv<6B3R?47EdF7)%tG`sM!yAxIcW`vY*;hE8=w?#jD>G1f`_ygF4kBF3WLL8~L$` z%Dcm-BrPu`GT)lqYy$sfEDaRnRw8F+HCYrS#5#?~N`prnPfPKxAp*F3afgQ4LEAbW>c)oBepw7x|v@@E&2vBdpE}Fa1U1RPXQ^c<$oCB^geFmyDwF0`s~jOYB zw#dc~8UAy8#RCc++N@~HpGGlx)2i};m1w0Gu zOi~NYy{^*}o)}MD${dTJ_NCt`bzV?rTeF6J(ygjKrXEI^SI3TI+ zr#=2x3lJwZO0rsBRfY5+8%k!%?C*pc8fhBMR)0NUu_+=or~40dBy*$^3AJo{x!*g3IzSBa+z2o>+dgg4hz(HIsb?GX!|T~C!n zCzS?qb4zoFz`O(oka0*A!9nmD#3W@9Y} znO~#XLDnLHAsQm0m%1t2yFueETyt!`foq&qxN0l5#ukQB z-x8G=CFB!!t3r3wf;ZzbIE(u(t$tS^FJ`jI*&J<*2wGTS&f`b4AYi~})=DUQ=BLwg z_n?POO?_82W>$HFA=9Lg0QY#Tp^ZehD_+x#8eVM$Uw02_x%L$xE8geCXmFcg$YV7E zBE^m?K2Phux9jKWewbf2G{8b5a z2meLWHm9npgU|jf8?@p-wBXvR$Z#HIF4m;dI@+iS`U>YEPlAKiV%es%B2jT2MgY!Q zo!;ZPWK06^AaeqAk{#&H-W- zUMzg}z6MuI`n%ATm7;s=!sutqDz!?90t8$?WY3;IIeFEhU&?N$0`!whB>3|J!~A>B zwRGYjO{RFJeaLAH277uX-h82q055FQLSwn;^@SQ02a0ZjO41Z!Q^?;cwZOWbR!W2F zDgD4dl=TMLOSPpnEr(E?t+oSqg~Wm&H6$;bq1z-A=NM4gf4y_H}HsV!)>#2~^gk?FW-S9>B z$Qq}9WqD577ui4k16?@0g>(2T%f5R0s~7eK51)kAX#eyN>&Nb9n-Mp6X(2{4hThIk zzyEK0J6V^1XmQ)!*}20f{X>hjot+K+p?}wpsde`l?C?&ur@N1Atlf;>wfuR|+1E(> zyBcY))9Zg^q`s!Tll3)H&qm^kjnwy%j*c|aL1%aOmXUy!Nf!(lb7iMzQeE1k{_VA} zatVF^-77A|9_Q+ch^Q@JRhlNw9@GsCc98h?{-waLZrC$ zR$4qr6XU(Qk}oUV3*_%!;{3P$s_mw%^<#_`KOu`y7X-!p+Ip6 z+0ZLo+R;U|RGUl{Wa1h{pvDhcF5Lwdl&qUe79I2!^VU}Rx<_XNol)UOA^`2g&1I(6 zpSNyc^V!oE91D1lkMOnT)w|3}KAuVRAt?%CV^6b-b zytcTQ+BYH0U9IB{i*OZG)z)q_xZpFz-(-uZu?0af_)2>}>;pM^CZy{zg)mPrOneS` zHZJFB(R@j}X88@xIYn3yAD$qfX&qdcdLJQk65N6$Ad`5w8gBiK=OQ$60W?!TqZ#ciyumQNrTtp%lq#!OXTJ{PZr?^Jt zSgoE3;qgCAX**0Qo8!yG2`OW^uHGYG>*5$#ZtUFa-U;2h%ww~*Lu){7 zo0o%2Z}btJKtKqV?ZOqLpj`EJ=qXsV(CjNRGdXxNqO=|`*ft)2VgDjtc;6NtQ~grV zTUy>oIwZf?p+<{_WcZgd zu%SXX@dmuY!JeDA|^O#fgVjYUzQ;*xvJ4S z!)~`0&j}W{LsI<@STEmUV5#PTC9* zvuQ2#+Jx;H2D?TEo~iGM;yOCr)5J|=X8%6x3(AS1n8b-z{MDz5w*Qq!{I+CKDF(=h!P^zjp(B7z}$G!j%eeC zoCJKK(ZQ*0SbMZXi`2+WDi4q-FpVMYA>KCDh$zam? zDx=g4va|_N_FQ9KS4>0){6uDNV{iUjI#m|grde^n#NI_%>6H#Xq{OU7+#1rv@uWn9^Tz9J3A9_fzAMs1FpffUSm@1H=RYOY-q+c37v zasHNn`#TEMdl+-1br(bpF#ap>2}s|h1J~W4O?#V{qv3E|Skqnq zzQ3tXc+<f9b9$eN%UnM`;$V?=5#9DX(D98Hu0v5Ca@bF90f(a?L*)Su-(`g%C~ z0k1mMf)5ln@Q14pxBwVe6jpOei%1ogud8%0z@#?_5#$MD2az!WkiOab(Ui~&Zt{3r z+c4%V$b@Rn>Uke=3eR^!fRB`b?s*`q`zV!edhx;TW`4I6I?npR3~0`nEGiTAM(IqL z>@g@T+U3sgP0LE+Tiew6p>dz>uVr%=kT&*&?~e0%P?|F6bwp$`I2qB}wn9f3!nF`< zV5{BDQ)>lmFL)4XIbAk4$T(G)@uv}sQ$>XIgnec`j3;c?BHP{RbcrKq1x9#r#p>JS z;quCe%^4*rSdosM$fco8mgN)rIV|rn!kj?MGq_f5TU0h3X92kmkW*7FMRgfq&Lzfh zxi7UzfN^QrC-+*Mzg9t1y|}Jj{huh%U?OWJX7riig>EA@L`>mBh$(zX!J!W+IP}wh z{$Jdn@DQJx*vKl;gbyA~_|OZ{gbyj2;Dach&^|~RNBH1zgby)}@FB$!!a@{DXc>fH z&?^xBCyXF`fIL>BNQBT@em?nXtP?NCxRbr4siEuvKj95su*)0xq-BJZ#On21FYneC zY8%nBpMJRo(bdUUKWm0&%BC&7pAn~R{(Qnk4Z8c83u#Wo6%Djk9SAaS{;auJeXiT~ z+?tEk=h_~t=9iX>*@>4UQNf4$HjMDmtJPV{QJbIc*h`R-%^0P=SGnpXJ0aGuo|}O@ zl^*F$+7Yhx^PqQV5s6z1p&?DVL7n#Iofovp9*nNa*OYRyNy4Ng?gJt;#7~5KiN5(C8zo4?m83I`W!#!X}uFa-#w-YLMQkr1!Xh_iBP>Fn;ay75CF0 z;q8C8rLtI*p=n)HF+#AaSr8AfAsk`O$Xegtl6;t_#o6MX3FjUPBu19*XKroYJBI$A z7mLbwyENa_3xMuD@OS;~mXGa`{glrsH6iAd@DuZD)1iw8vq;S0tpRo?19<8!zKaKE z01LK5ERXCS>?G&fY&v94EehaCuLgcKkp*1ZsxDWV>P>nC~r_ zpCm>b(Yl%Kb97^jL)(+vCQke05J!w|gvs{mQ4aq4Qv@}DbHqfLXFrBRJZC{r2p*3J4A{ zPvQHD*f@Dv6GG{%xDgepM?_Xjxe+>Hq9YTh@UK`RCOtEJwP@N3-2kJJ_dSh;29odBV+*Op4=_p$r5`jJ$-Ix<5MtQx0Xy zN(@R_j~%U5qb*?`OrXz=x;#*XcR=O1K;V=dB3frF8EN0y5l5J*pD88nySp+0>^*;L ziOr%bI*q9uDG{4=u{cB|*=G7HA)mHl&3J1FNqW#UhE9aVNP(=y&TY3>Po8b@E7m%M zA6@?E9(emgcBzNOEV01CO7%-NM%8QV9ksO)$g1H0Bl+y$aMuF@{Dy1f`^StC3kukU zojYwSRfg*_Op3j&()2H@i>jNTnzaX+5WoVSYVb42gJO-aMq=?z!jr!dUos%5;5hN+}2o_ zfiB!fW>my4O*pPcWWIb_r@37O=zhq2>arcfoeO(cYD3fYe`8OZ>P`8i1cd$Nr21X&U zMm80!=WDGEy{iBkSB{jAqnC|q#9n8o>mzp$*Nps6{{Y^NepWJTJTAr{vPb$&Lcp3D zUh8FXI4DEjLr!pA22yA;2$=p}s>6XyGYJ+C} zIy}ApcJ;krBc<7sf@pax4Lt2qWFJV2b--sh>$lM&MsGDC2ObMgBjS0pw>HyWgtO^m z-?eujRc0W06Gx%3u4GdRl%-`GTqlwd1hsJ8PxQDZj>p2W{0x2Xr?8%pA|0f+p`eJ! zL7Kfa%rvHrIU9xoT(4k$Sdx<>Ga>P*la<9MWIojkPAjKgVS&l@YtE;%>=l{7+^&Fa zKjq0PYpI8xpixUjPC}$CnlmZcm`(6$E1{d7IdW{~@Txw@Fy0-Zpmg~Z_c3|nGRGo3 z_n-`to9(FpV)w8W=E=S#$>^eZH-CE*TYy0-az9F!$1s{`i-W1g@;!&x62KJn7~A78 z2T-OzTR^j?EZJ8&?T@phS7oOe+>$EZTRMrwY_5yOlsVK)s)A)?DEq`!pQ}^2MqS0ys&^n_@Yu!gB`w|QsTpiErb(P{GK`4Mhhb> zyOslOOBV~;b=7xwL@Uj7l|34`98Cf&vL3kydpmcwvV)z@!5t&K{=T#ik}{S*G>*LO zwvgbJolmFA6Z6-UH)e`hOF1U~&ZCp%jOKKf*v&z1ya}U%nm^9ppybLSzTvOR9SWk@ z3(Z^JVG|*aP;164_O`%RO#6zKlf7**ORrQF^n$QTxOd5HiD4}isyw?7 z8#L@MCASw9h03VLgbIe#L{i98B<`S>a*-%G^;scz=~RO@nffWUAt^1FEr`?V%ZiM& zTV8N_T3{5+qF$;Dx0#$`TU8lF2$*92pagl`vRW`$>)OM7dd0z;wn|LP+gqDO64|Be zc_i`AS|&-ncuOvc>y~@3sJ14i8N1*};P|_bbY@*E*+zziS0%%z#vbTc>6V;q{@^#( zjEn{*qqNI{)v?{%F>lx#m?Em}MT~X*;{mv+x)gCE(_d7*n))KBH)ia%hjThl6kk_3 zG&XbQMa68Ua^Axfic=VhW7Z+9I4ZR&(qoMYOS{0LeOY&d<|M7FWxj!8DP+NzaIl`z zYGSas_9RSdWy=;IK5xr(sikM?qo=osFUT&4XJ$_IuX~t*uuQHdUQ!X}I#rftRD0@S zZY0|1ojunyEGhp8$(DF{M za`Y#1JW?9zQA5kio;TG!X3uz;Ua$#N)M&G13Inm?YW45p6r6oNh@P?Cq9Y3oabI4G z%t=;b>xcc;s~@}AdI(*4r~&cHcc1tXAw;7=bB=!cFaK4i<>`+;#WtV4QJeNoAj6}d z_jq>rGoL$Pc1n-*+e*Q@2KMfuuJm@;DqBq3`=SkcAA`8Y(P-;`+D5J4eJjYh)z;hZ z{o(KWY^A&ZhriP{4*&3XK;`GbS3pHq!t$)iGk+LJ}^4Pe8f%?i z?QO4PMb!qgUJBctMtL@t4~YsS*rF=%xwHL`Y^n zz);84)R7h1byLWl7;{IklAJsZoalqF5_R-W2}LE@1yb(vasV@}s7=NJ1MqZqm{MF9 zG+J_`l;sk(9--!lgD~PuVN5La`_0`|_@{e_-5qZ3_o5??#}?1-_8<7=65&CMVUEfiqjHC~EoQZoF?%U0Mlz#`uIoQM5=Temnu=CLI~i}8T}t7RC70vJZW zb3$U4N)aJsWr=zO9dgQIL;zNM7dtIjM8;yFVO~*3697-1o_zQ67!k~Aj>Sw_p?Zg~ zp%1nQlCfx(5LXb!QVic@!R5#h=BYC*0Aj!ew>l%%+MKQ$9DktLYfTtck5R{`>yPgy+{j1ZI_f6$W_xj;VMKY}F zVu>e7IH(sfsnk9V9|cy6uEf0$`=2jxMOoxZ+bF}Jg|$`=NHCQ==PCv>0ieGuz4VBY zh%NeyloY3T2~3za<|VEd-$BS7Sm8^OBIQa`v{ZSYoIYvzCJw8ymT2UZEv`TnCW3bg zuUxfCErX&rwu%)r+)^>v=2a=JSSlN)`!j-Z_f7yc>T2_vtzIzHR#zu<> zYHiU4qp|M%f*j|!ZH-}2sCQ`Ck$a0q`^%PZ`btNU_-vltU8-I_`L&qoUHxi5Yjwz~ zw|x9_2-s_=&lzn^rGk)9O^9gDzk0`*+r=YIRPikNhn}u)R%;l{F1D5~?6r@g=QM3g zMM4;|atH4Jew?-16~->ss}wZmu8q8AVq2u)xSYqX)g42PL#0rE{p!odZ~4?~f4)AK zJ#UxV0v6r&A)0{;ztZl^XS-zTP+S7ekn;Kj;CK7>DL^z07|B{h-sTsj?cPBaNSC-_ zn=b3a9lV3^8WZnL=HLF!GUtdXdsWQGdG=~VL9O=+vW7wBuU5LNjsGcsd^yesZ?o(b zzU3?S_=>&1S_8~DTzdkLl60Csn}cRHP_#7$5k#G*RN)x4!MH!o>#7IEb>V0x7*V_&Z+8SePKe3+VPf|%#0J;_^C zy8adwQ_hmwAXhChX)12Lw?u1?U;G+ZKVSH1;3fv%I6ozCOdiGi?fm_yCG?)X`a*VX z>HKTGm)v`~yd-I!x*!p!_x-Zwsm8y@FMfTu6SvZ3?)bS^ZlVcUJ8P2FF5Qh=c1jC& z()MFvOi6H`@~Ih-G_!@(mwTz0FESLaiAR*oTAC6oj&zxdZx)7ILdM6ZH_c7;Es;#F ze!E+Yi5}cyK{oPZfY<5G2Iu_4={5H0RMLB&^f3&|S)mjw49ezwMo)qJD_c#iFWBQr zOBzq8c+jvMyMlQ6eou5%QHmXjc2 zwQMKo#t2xH8o<@uVYJdctl&XdGrR#yNCz^r`M%XCRF#u3XB*{&-+`{Oiv-+5l2}JA z@j`*LgkSIt1u41wz@=uS8vLz^2ZUU6UL;!xfI5o3M#7THx!&Alv||-itv7bi1SJVq z)&*<%d$ZtI>%m;cLPS04jpT`W;pKYGNe4Ny4KKrX<;@@oSb6|1WT7F+R=}*WNj%~z zju=(wAw*H>JHicGwC6OmwV_B2&hwVEqY}I*V0l7_3AgPdc_O|l5?fwUsZ;pQRaJf<#uo22TaHz6gvAXPyupCG zX6|bOfZ4^mO!p zQ^;{Z6ja(Ox%`#(R{~A_O=Qi5MXoX)4zy%JBJb1YDSI^ERBHl-!O!SuS!~k+l%%2Y z=yZlcWqqSg(bD(v?VzqHoGIzP<$TXoV7SsaJ*8tNygmo%* zMO%=mT8S&1rGm*cb{!$5N(~*zP&NxS3}qF#TA=`jfGV$DB23>VDqhn}KfHOPvN(HH ztMz}Xh5!Jm9JppiOoT)%zl|!Z58|7kjo(kH2YS z&vfAUl6<;Z3t5^-smA;hM|K;INWQu$wUySk=GgTPaM70xS5I?34zh*sP7t9yO@Tpk!p-B8mmZjMal-Aw zggvp1+0JteI2*n9DxWgKTtQZg>{q$pEVs*@Y=CoNJP6yI zbFox%E{IWWKoA2ur3;6y+fEqV62FADd5Z#Dgczu$z$Q)$W-ht7kSzB@FYSaFS(+N{ zU!m#C(*R33)2T<$=u(;tXbFB;K4z~;0Iz78Qn1MwazW7Cr!1!&$XS_MjP7l|g0K^n zh~_Qpb3#zr6Y8R7Mlg(6e0qz$6Z#=;1`Xf+h7VfC@uon&)L` zy8L?0JH`DJ-x3e!yul!U*CL`XutRud3i6$Ix$|nna0CaHiNs;XSi%sfh0{-UZ$qE1 z(Y!G8$X*YB&>Fd_96d(8R+ZIN!4;zmDWXux|8Q^4gq8>UwmjPR@i=x6t@p}syr?Mx zb$S*!7ra+a3|r1G2@!meJ@dDRznWfIcoL4$feOCSz*)G$V{SHCnCid5c z91>gZTW{^79!-_AqXRX;*b!&0cUlLuaJ-SZM&I4-?33QFAGl*f|I$HehaS&I z-E6mWNXAO%kh1sA!G7y#dZ76O4mNa^&YXMv+OxXPL!{;oU7$PtPabNu(n1At**-5l zy^|eSQQP6JiW3f1qVVa#$!Nk*9O*1yjm5DN)Y0*ejd!*@74>2^4D;Lo1}I1qHmv)u!JakRk58j4I)oBQLxT1X3E^{gL$AB<8kX}^1* zzn!-BoIKbipwUl-JhV3`Lp|^HOQiqlOJk@Sd8jH=qp0Nhz*wn0^U@Y^%dfUXH_16W z3~_^18nR*+#uq8L+BeHPD)mwaZuB>jq%?QZduy~IVvTx;@5aFfTp1~ec}ynP-|5Ph zi+lI!kgRQnNS17nS}~F-6j?XKdR3*h@HlL$)bg%CY7ymcF~murC-F>93#a?|<-`{( zlF_zDxhHwu018@qe6!I@pmLKIMU7Tn1C})*=sgvqlA3PsS!!Fs6KQ-b0fksBE5?qh zQyFSWi^(KTlUh8MxL0* z*V0`r{pF6cweh! z^kDFIGQA!bLwjdHdqRYzA6jJaTISi zII=^Q<~N;TQlIKb7nCL5TEm|&i!=N{dXKj6Q6=EACoA>7g!RTZIc-Q6^S7RnX!tKc z{ESAXk0?kbimqHL%bkN^8p)JKD8^MXE>rNSZQs%mE3Bu``-9=TT#HtmNsUC{%qL}8 zeD=bHq9SA`H^k2_k$_j0wbtD^;sqZcy4@WhZ>e;xv%c=Vd8v%d$D`StOkyO{H*|u^ z9^#o^ED;s2X2m-^=Z85>`Nm@x9p8)yE)8cn0Z&?|Mc_A&zkBhLcRm0n+GP2<>c3z2-+YVfHOWp!)T7==nV_N{0EOZ@ z`&-_x$s4V5DB(Nxk82{ll%@(pSdQej=SX_SpTZO@rsagdm!Va5^4SlgcgirO6c>?V zf8Oiv?i_Wxdwcu0u73B;F$qsNJ1qR`$8bv6sq1CW= z>q;9~-NX0U>Guz*fWBqr`)|I?r#DO1V|aXqjQHMG{WWi72yvPre`_atr*%X z0)ufQTx2VBxqCtI^DPH`CmubmfV*q~T4#q=QT6$}xsD0fVd~(`iCT$%ErK#J^FOgYfe|C}$<;vaTheW}^BpXeAwuH2FwA7LW?)$F`=s!+A zcl+w$-0hNcbsLTkD|`NWIr`lQWKztn^~a5VqYM<=t5KvZqHxxB&uS{{58!5UslEjZ z0xG8?tbf+ue&>M9`PC; zoB)9YI|GS5{gG2olfL=*yTH4BZZ$gv7uhZdl$tcC>#>qw;=2^3ras4LSqGI>Tc&57$-2FR1LKu;CH=bF zee*1TpM670^^CWSA-x|+LFZ^?-iJDbibS(^t){e36yB8NUS2h}Tla+olXYuO=j5Cq zgU}6nJ>|&Dj5BG0V*{{33H{f%WJpsX!DN6d`l0&^oMu#x)N+O`0x_rZX%TU!9 zdKm^-_iv`z@#tzg9zznoU%rD)j;+%Y_7R47sBl=fmBB=>q|spjo7%R+|B!^|HeChr zuIN^>(AVsHuS;BQ|6q6b7D4d3+s9Us_5UgdVXq|bQOR((exIY=f{N4DuwW*@1WWw5 zYF@9}(ehIAB?BXIN9ZA32c%jpgGq!710PZ1qT=fSX(ZT@ApS^e6> zJ{9YDM~Nd;DdEdYQ- zDjvf#1qi6|ixC#X=gu16QPX_%=cwh5*WLX$*o$~pNMc|lgx5W)ihZJWr;W>vH+|hM ze+|5vr`Tepp1%I_gGcXs-Rw{srW z631k)9WrFWtHZXkqGf(FAQY+yMesEJEn52rSz#2854(Q{LTPtBzK zawd20I42Y6clpfCjCE^$gtLw~F2)xa6l{6098MX8BGgfMVv^P2)6E^YrQ@>Y7J77@_hZM8v8@Q=yh zCpwT{27R+i-Z+ z#%ln@dpWCi($hcuAHoG+N`J}`##vhC*_W`Y8YYQbi6S7+-r{qdo$)4hHAKlN`I;=Y zl_N+S{}7?IcWJ5=Y|&Cg<$xMa2TQ$}6KR)152Vu5>%K0#6yMm&4)-^^$Y3f8&4%pv z(3;(&qr+A?sLnM3r%wn^DSvJV3#_>Oowlww_p%4+JRS)Zd<1JWSXM$3;lqkv4sJ1` s2}wxUs^5w!E0W+5duNLkE>BLujvU->sYtD`>F4}v5xbvqsk`(400mG`G5`Po diff --git a/Corpus/Learning both Weights and Connections for Efficient Neural Networks.txt b/Corpus/Learning both Weights and Connections for Efficient Neural Networks.txt deleted file mode 100644 index 4c089fead0ab6112eceea4294c65cc3c35ae4fd1..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 40692 zcmb`Q&2k$_w%@hBPf-z$c1U3ogTe>DCMJB@lv*Dxu{ENmH8Tn>ia?R55rD#=08;GL z*ekDXZyaIoy|_z1!F`l|f%^*n{ZD3A6;K6IF>@K+5(QLc<~e!3&&iWfwEgooO_uX) zeiL2gt4Z`J&2A>EB1-1t=q#Vl)6pu+=S6g#FQXUN|MoBcWt64!)jw?yvg_aH>3W$= zqw{oimoL8*|D>N?&sLvb_8$Wsjpj%(L5cS!ApG z+A!hQ^PgWoe|`GUSOu$@B=dhR{CJ&oybfOala?Nw$I$*8(Mlrqc)4hi(glrilqY(p-?ls3W2)%p`CGfpalrQ1%l{P7ey-xJgdbtZ@9`40*Jx>dR zeLmiLKUx=>KVgPtn(jjj3y-c&^h35R*eY;od%0^Ct<9rCD7J-1cav;1sV~&1yab_x z!2RmRw$&Mc>v=ZY-!)4s>_242D7MCBYOA!7YRiFi=Kuf(?3+8UJ=iAJy=$^%I`hQ4 ziR!tH@tO@z&(l>jPF6{g0!F@G1|oA52C)e%Yq)IZ?y!VKvP{?{M8D|fbeewEup+1E z8VQ0%%dd~0>_ykhd=?$V(M+-z?gh;Uox|`?z;^_-3!H2YT`g^&1BFfV0-nEsXQs(A zO201BWj2HH3mYr?`Q^*~_yC={M$4`y3DN_2R{1K~sni^>f};i%+>Jc&Y$x7{`vIJ| ze-r=^JaM?pgu}c{oD4X^zkR%MNyr*r~`QWF)2Y>ADoto8o9)q~qqKvX8dc z3Hg3u$BkDF?we(Pha|%oiB`}<2joTZ@$pggHXXr2h^%P`!J6I7k%UQfc77hs@^RWi z6mP{L`$M-`Qzxb&hWK5bJiyUtngHbWeS`rszFTH1R32LECW90|4=$nYNj63J&7)p7 z?(IcCF0)@I>22|)#qe`c>Sg<3LZHhiaWu1@vHn0j$rb4ITCRoByp?gK-N?qP ziG#)wj<-x;=K_fbQqU1lRcTAi@+<9Colk5nEIqu^1aOS3osWyaa*+KbyJDkmp}rro z8(9=_R7~C`U--%of2Ho-vUTA{Y9O3%f4e?wZ?h3Jiv1PcDyqIN(Y$k;NT3DWSjq)Y- z8mtB`o}Zr=Xy(S8kA4t8`o!Z8d`PFP0)_s!e+ zQZvaqZNB^wHx>z>n(NENk|gHNEYO(P<7lbMdv&f-upTNA>a0`zPSg^Ho6L24eR&J_ zA~KrO=1VLre5cBGj!v&dWvm>peU(Y91P6XG!OQVJ)=W-1!HJp&-?^pEgjnT^e41lKAit!#N@@M;-~MOicR%96I#mvZ-y-CilWJV~v#o9`raa$E zdRH#12PkZ=ll45&;)qHCG-osmcU~Z|P2p()bqX=6@V<>rt(~+jK>cZhjn>0nn4RKQ zCmB9VTC8pJ>lUdjpNQa(#69YgSOwZ+y=49+%eQb#07^^5$0f{Zjs;;-!XL5_p@NFq zoZsU&tO=(8s|Iw2*sUU`GU>)+XX3L4Jr}(F{b(kAN(73B^u;Wh%@V?xc&DQB==0I# z=VNm=(q#*~6!-IFAr#5h9-(v%W;c{0yXZQ-Lq5z`6X%jfF!gl&>)-wt_94uT6(R3h z#?4gRVPT}3WinpFXaVW@brV{2@M@aw%cZMtCj-HR+8ROX4HHHq%y*ki*H~R`!lvic z8AE--{b+k)+Zh(Tet33Pe&K%U-Q!)1op+$U#zn(Yz%k4~2n4#a%~!mMc$P%(7zfL3 zYDqy??75YfDhYLdHi4N5~eelN;8IDO01#m%o#1@IRpkHYm} zVmt2SL&PSd6c@SoY%y!XY1DHUxtJV#V1a+}XSz@LIR}LoF&Dg+@Ui_cD(;z2DV$aCJ;_gnwT!^Fcyt5mSu=5s`qo4aBzl)0Z~0*8Vww;L=xn{ zTI?z|7S$IpD2T(wh`;OC5au10r%JLIrhK&(@ZwAKmL`eg_I&ygQ-!`|x zJ$|t!(gzwp$H$j8UupB~X6m}NM_8;P##ag>Bj248XijNy^y~H9BIJgkVtqr1N^n<4FEgDZOjL(e#f6NlZk9lo+HnK`wi|h-lz| zeAbroCT>n-x7vZy@JP~%NnMZ?s=w9TvtlCC3_l3E z#my`gvNQi2pVwf;fR`LhMd};T9%vp9Ocspjg`6ZSxWFGZTcbJc=N=}IqCXlM_)&|?8mfv!zT4jyoa%jZ++y=VpBxVhEWb?5`}Q9<@m@X` zsVsv%I`biDc|H$5UVC%ruVFR+@Ib*&Y6*1ICVx-xGA$t>r}Yod`|k#TOrb{?lJvF+vCaEzkxKh^ zn84$q)%oj_!}E1ifp~ya{C-;Vns-B`ICsFa4@qHUQcfb_TB{zwmQbn#%a_5YQv&cr z1`sYx6acZL9{%_hu}zkMu@TIxCM8H52Dxf({%uAAEniBt2KCMiR%+6rkZPKJu@|Z; zmhkOfL=wrFb)$%HUbbA`riG2|R)gVjm7O)3Tfb92ipT+Yz8s zSV5~Tqhu}9q8ah=0tFeyEVdAEOpOp}Lpj#3T{KWM4k`i$*aovKrB|dM{UYH!;0)Ci zSsTE$SO^9r@d12#B5o}LL$P||{A=<(-ZyP|=xmbbRt+GCNw8qgXf^e>{3H&6q%*(| zF<~-ySbl+ZB~>aINCw8WL0Td+R-`~=n`qfDSp9R-gTJIpjNqzt#ra&4Mew-k>)Mx! zOhkIa+a%J*ve-=*2Lv|vZ9^YAPV-{1zWCB`+@ON`ZU;V>G`h(t0^@Hp5#IKilWt;j z55&ls^nfXBS1ZwS3#>2ZCv!K6-oNeDP-^&dKx>g#nAOdR!{)dNEp-Ro@n47vjI(02 zw%`URuBvq;Ff>6sQT3wd3UWcT9~AIF&MO_N7*z{-g-1X42&OPI$$tfNeYgVhkqt%4 z9U%9E!vqIFP0qtq(oha=Cf875hah$dU#(Sp)P%ZvrY4$&q_JN%vXy1el*3gWpDiEO zWOxXN(j=GKHGUxFf_r83r-bJyf!h>kZqE9e@)9@xUX;lVSCTV}r?3Dio<62yk z%BK@}oqSZy^J|3Y7lUlnC@xd;X3ku+sF3`$N`eUVNjdGy_^ zH7CmOk~@}mg>1%|B5y!wq(UW?0b%iYD+}_ED4(ogtpE%5fs0)cP_Gl6JeDS~frFAM z(fP3gSjs#oUQs6p2U10MffA&}p)_=qiVC2$0`p=@W~n*R?=6C;D6f(odnl*XS*BNn zkSfAfQ#hhZ3y@`dXG{{5RvG;aYV(W=P`+$VX(NYtnfVD71bgQ#lLtdGOiav*Qa@2l z;4G_kSFp-I+ENI_egT-a1db*mIu?!`OGtn+uPH!$bqPpF@dXb_7?Gr2P1t8GJIv?s zyCKx|eG}FiCHmcP8Ix*sYQXLkyHO6uMpfq z0yRtiTYp`ify>{4v-9jpRoCz|epVZRmHzHI|IW~^^`ED`XV6WH$nH}y`8HQc%6n`U z?SgpheNfDJSo_I1S@?`Z6L2S&YoeMcA_0I>J3BEhZ&TySbYgC&Pc&L`Tc4Jnnf&`51epSPSKvw8m)oU@L^Cti)xy zxip)D0{I zHtVN3{o|jUk)Fl^Q)$z=7q`_*AM?e29KBdB@j3MS(EncXTM^05YLRJf5LXb}?{@d8 zUBTSH*2i(@VDKFUY2l}LaQI`BeB8I21Bm6}1x6SiZ;a6U(LEmbo;*C#Ff78J?{<#5 z^>GIM_>q+iP3&k5`ssEK59{yu4<27fD0uhCuyF>x&f(E_6diXC``_u4gZSMpIKuk@ z7kXUpvk=}t;4i)3Iee$TqvM_U@YU|MW%NNS=!oM^zh|RRt-f;^{`LAQc)!oaQq+C~ zF#M>z-If7?w`l=gCywUy;^{IdcqgYX+JU!OQQQ8>MPe)o~XjB%6cvJA>$TD6$FqzU8^MA;zcKDe{_ZoEDODN7$CtI)HA$ zx=t7o3{~|h%9#W&PH$>!{j2%BMc+8>M#SOX<&4T_-F{j+U_A>)a?L(~X1(2Q2Toii zfc)0&6CN#-n)m7Bky0dAL`hWs<_se2PRgOMScd49<0lxWL1u_ehylVyikG=GiaY+4 z+}xNJxNTt?KW-;_AtD)>h$6Rm&kPad0&_YiAj+63Y)^i^l zZxZL}25r~Cb3#I4W&^+iN78I_lT#BbTW071dIwAO75gX(gkioA$ zY{KqZErTo4Yb4wdQ%r5O&xs$(`$H{8|3~pt#SX5hFj=p1dAiOmN;qSs;1-*1+-K1v z=d2)gFtCDOW`4A*QBo~ag_b6uHDsD?XwNng<%4Uv0+LjmtxX0u>|419n=!5yKu`TL z9rcpBtF3Qi+n(;;w`VR?P=+h0i{mJ!omXmt$TkRFJ`(K8eF*tFw~{YPFnAC@+b|$O zt`kBHOhj7MHiGLcue6@g(cRApOUwnkxLIIiNYICZfT*Dn;A8=@>bLo!jBVfK;6%B-LR@5}T_JsD|V?cdMH7Am^Y4KZKv{j$)3CbMro; zCv#Ee10xvn;jJ5cJp56x2eI4xUa(VbKJC?Vh^IPPrHYVT>#Co{~LWQvScX8KMBUMp5 z|FE%T8Okg3sX`8P>p}?bIA9Y-*iDHvA8YrqUzr06SvhIv;=#Y<_%9`;^TaGo2#xsF zNdTyB4@rGfRrIz_x540ze+(L*H~`4x|A3m>V05wp!DhA4wfBMS1i2zJLG5say7xph zzTxM!)PV3<8w+kO>8+ipD0`~>@b;(o(dlUQUHIg?NFROo80^<01vPv9;g8Yau=``w zhidG<$6r(-O`lf|^#Aa?h+xLx8|G7dDgH4c*a6Rd7xmzx`0&y9KI#BF3q2q>G3@j5 zfcN}L+m`CKi#m;JyN3VZMjbr9EB`*5B6+~BbUSRy@DPBHh-4f9FaNi}S|bd-%D(h? zv)}8oF^9Yw>;LX!uR2QLv={T~k0DzMB|JLi^+<%kgfTDKmc~%RrWxe+i-ny{b*AxpzV9uE(%P}Y9m!Y)u8C}TX-dT@ zg_|wn7zbAgLtgM_>!q^8aeE)(cS9jGn#&HdNnzRo$x>Sf8jE3ZBZIQ_hr;C`n^p;H z3)(=7SeYPKO-KwTBjH{oY*l+9)K4@~90a_JqIiUd z47f599wgokc8e)`Dnkvjp%_6{_tib21$D@fv0*`ruvcN<$22J!;cwC3 zLMN6)|JEc(?j+%I;uwLSoNm^hH9>4+AW?&7wpN=&Sl%%399HqVm)TdV*=5m(M}sHW zVB){#tTK?`F>xe3W*&%sclN(p9PsTP>NIej`60%5y`v|+Bj-#mvacLYa+Widtabv( z!`__HGb3QBeF0fDj@YPg2&fMGUo*SwbBjcU5yy&wbBwDy(f2-@>o~YM?M{~77{~Aj zyXt98Rh?0+nA*XK&3VPYX8sz#*1@Sz>0;B>Tpo9ROu(iu21X6V z3Pe;k)=Riq;icxZC5s4(&0w@k-#HVi1(jHa0ADVqVlot8XsqOSqGejj(ibG4VS@mK zP-#O>Qc8-exKx`Y9dvqS46^2vwa8rj#ZdmEb_j(0ZNF-XFX1V0sN0;<$BSOSz4-aV z*}j^l1dnN_HzE#+J3Tr~gfbD|f5Ewg!|oS_4>N3HrT4>k<}@q2ApD>D-bHmQTaHPa z7+7vv=t1vk6}28m6*VGdmumGh-1KqE$U1b>-*#x4`)0D;b8zcwolBQeuk;VV z^6;>O`6CK--wrpvJtDdDU5l9g2x3ab73N&!`IqQ$NHMBHb>ep`=mxRAPo4!7qCYB( z;AuZ>vJ-urww|^K*C4Xt*QT@eptltg`CHBwwv4ScKJ1r^ZKBoBs_Eqc!KWffNFC-kjC?7-^RIdH02(Le)Te z*7d%VKPTxHLUvWW&9%|s6=l(yb0#9b5mw)p7Yg7w;bC(v15w6#NN4}<80xV|WjJwf zy*T6q%gsav$%JUDjtk^hNvC9HHE#@9_x=2oq2TH2zW=y~(XfVPd$)Y({2y#&`9B`U zZ})$WTl~}=i`(h)Y&oK=@k4ne{(T7j6(JK{ebU@O6CU)$z{lXH7t@JKx!EJqwECY! zaknkQ>wKYp_fW4521oCrA(B82Zol8=4C(nkBil5+|(z8&Gk?_tDUHH;#c6_qAo|KCCo!WbWdn-%jPN;h5p$Jm^V25V!?e{*Ks-;RZh+ zO6YcaK+ARxp)~$)1GR>6do#lA?nB%C&f)jb5u4j{uKqW?+hf+l0p5D&h*w3fBEd%^ z1e&(TtnuI|OslPd7JJ9vvgE#iJyxVZ^BxFPKpwO*V+~p(A6oT9!($>02JRti?y+V? z3hGZbf7hD((0BE&%60~z>f6>lV9iF3tavDB^=AAnvkq7>hMBE*07vz9iiG;AwcY{i z4R6}c2-4xV%sCW>-$lng$p{HMadPjQ-gK+AH^GdsgM|480$r`qCabi(Vh>908eVpE7JtFD#K9=9`8L?I|F_l<5T zqG)crg$A}e0V(QWfb4u19g=$Xe?5$z&tj6+VZ)-(2Y%_xJMJra9GR`SM}br_>54egO3oSJX(Tlnub5+Qdk!swoC3Z^S4I2 zK(YC`&1qd!H+s5uryH-5m5!~H(}kk-;JrU<4dD)>=k5>@K75M|`6z@_5Up++I2hCk zt6FJwr`8I=6)^|6BIufj4vc}SLnII?)|le_CPcMY6~K^0x=T1s0mi~+!CF2QZR}7b z-+h)&C)dQHjnh_OA9>iEQx6Bft707Htmo1z*AxW%t{eUkx#LB{5qo%TEiVv3XI4li z5k#d`1|aj6u>aUP61Z0!lp*>=R33t;)+2Hm$rC?mU2MP!dt+y3RT*urvkufWVJ>ot z_Lh;3aSHVKcYR0w740LGL&%E*Z{~1|16`Npwmfa^g%+XndS!9ddUP3a;CiT{bP&f@ z6r_Fr^>6?C8etzz>98>Q^>6>j;Q{som%S%?qBd2D4C;gUz6oiG7}au?Pgi*Tt+FeK zFQK81ZYcjESTeG7Zr&|-r%LgsFZs%cl zWB;)M_G68ppG_px|8*KFV%{nU$r}%^X}|Z_qI!>w(0^HHz`U@{jf8lfKFRE1ZS0n!Cf<-h3w$!}-?l;%Z z;Eo8kcewsWGo)~sXf;UtKRiI zeQ;8qIiSdstB*LAMmeYWx;kdzY6Iwo2v>(vjOeK^3&4``zDq0FH#J^s>j7u%z^P-7 zxQx`jd?NuLJsp#h#N@o9eBRcvSTEJsS1nhHXHCqxLJJNts$iG`Kl6M`NhRA3$IBZ3 zx$5Dvg@HSeAUje5=AI7a>f8a~I#xivq2}9F_+~i>U{E=TlU07#Ui|AI_=&_Mm$_&Y2@7DpGLrn9<_4J1_ExnZ15lH$ z`n)bZbzIU8DU7kFTbs08Q92`spN4wOrL_XRhOyJ`uAeRWrz$~UC4f4>qRdced2gp< zu}KL&ux6p%`X{!SwIBKJDsxTI><$uki|w#c%Vbt5o1Z#Q0Ig)(=4?*)*(h6L-50pN zujCuv;dF79Npm+zDR@R;%43HCgGcV?sEEVO62^C=D%Z+>3&s|7X?LiYZ`PbqnS34U z2SK!YP+^a}5%UGB`Usb_1R<{RsSgF!7Q)uZKzKTQ;$5g`h8Y&$b~fwii7jTCn&SXP z28HePc(TI!j?R@nq~c^6cb|-NFt5!jA2eS`TWIG*)fx@DXxl{s3T~aYbUECv^>#7U zNhxS)t1k)Dp#P09Li)}!i{_EiTc_&V9oEvsDMYH9t(15abNI6v!MVh#bp2N~i3zr{ zcZK{ReHwOL90S9^zh>@;=yj-#5nT7e=G5Qobcsn*y&D85I|@)7L~!&K8d>lty(6u5 z7S^gna#`AQP?h%7zi&>YXh|l&{`h;RLPp+bL^a_KTfDZ7&iB*5ca`P=*)0`8lNv^ZF zTxGP?;7&o6V5*c|7S>km)2u{0C??}h)m7%3YRg^Ck%fJl(*{N5dww!_+v<%(i?tnc zfsCNM@RZbSMTpgO4sW>UMp+z)X-?g@&LeZRX?F6PHV<~_R21oZJ(pN>y=YqiXInmn zMz{c5TUT?&5V>Sqj8$rsW%8mtFmhtzzq~g@V7V|O_$eQXblAmpsj;IjdaAvNNm!IW zm|yOdlRPYC$ELPOoeSEZ2&J0LQ?REO7UN5+telPLp;hw}Qsxz_FoeIUDYB#a79Nj> zoEM-WfFE+{ueWHyTP>1Po&mHvG$t|m;e{v*kVSwJLMFJ2=7OMbb!&Pqr#gVn)ok2_ z3$oMsq&a8BXXTq^ae@xkBLU$RBXR?VnmmoU(^7oE(sbFI`U5U>j#cU7aomaNrvmTI zN)eYNuyZ`Yh~zrURkCj(DKDrvCw0`)kXS{qYzZ6^pP=I^8D*fk&NkauliB@J+%BX8 zMrX(ZM53CL(%n_9SZp{K6>g%nA!ZDD;y<^KW=GXUrS;PzR3T0rpgceK>4CcQ+2(|U zCkBYM3Cggkf^I_%Cw5*AlfYDtk+c&VR@GwF!5b^VkaH)&y{$y}UEXewX!V~>%#P|SSSo2BbNV^*Rj^B4nFK=uFzv} zQg`2#OQj11jnS;%DeCho8>!tP^?`8ve~K=Ce*SmScAcuj?;>hJ$KuW#ttsdU`VTgz z_?~vbaxmN>8HW#@5hM!KY{$@{+C)F`u6_#be56+fgO0*VRlkcqZ5sS&uu_vO^iuQ# zo&T@t-L_A&{L|--YxMab?i`bRt&P^}kX_Z!&fXX0Snh2-B7RJ=>aM~aX-K~ zI2r633eux>H1`D5eppghWe0>rNv>-o(C3!46Z3)A-g~FbhI_EX#*QugN-2=9F#BYE zQ?WJSWq&3ZV-(-bX-DzS=8wIrqf~8gZaV8uD26;(`&wdxV2}L7JMK-g>xUkGch~hy z#}B{8x5w?vb@c`hy?xaFVpBzK4|2Fe^PX9m;X(Vb+k-U!{ow13C}ZROVf(zzvvk{U z4%%NN%K6AodstJIQIGue+GqR*0&9O!w_82@)BW|o{dbZcgUJ2e_kZ{N-^|PyJP_-T zp8n|Tj}m!*@9UoRepVBS2Ul2su!X#Q@bB=Xm6@9dj)QGwhBKnU`BctN4h|#*2}A{a zO+gc}cJXGLBDP}%VPKb(4R*IbRU@C>N-n&KXig6$6^KQbG(=|Eyj<(&o8=Ju;9L%I z9oHPXfh%tiYa(a#5rJ6fgh<(fi_~)M!bW!_Oy)TYQGXj-PuXwZd9NH8*PWPc(G`p; z%!cTxGmm59PPkrjdub^|UmZzX9YzlZuWzKHNE+zX?KWlqd$rdg4hY2bP2heoY9^?U zY3o<$eHt$Q+q3rTE`n0s&#l15jn+)DrlP5;L@7td^z!e^AoI7Ow|G^7l6+7C2bVlP zw<#Nb9Qs>zjGZ=f5oM=4LO9u@z)_eRph-;l%@*^MFXx@W)>;mMSo#$od04e9i^M!< zOMq9QmiztWtGsi1i0Im;2F2*@vQlM3YchwYGblx61QV0OLZiFPOA`y4tsgn~b z9VBnV(R^#~4$au*<%;EnY-w{+*Lu_+Io-O4jhocTUohTAQ~2^+Y+0#d5-e2EaI2c< z-R5N4Dyn$euHh@sRn*Jq48vtR@gRvjeVXi?AlNNhEVhi>N$lDM<9pHT>3tGitXIVsHDq=a z^D;$P`;`w5X$&63?M@h$*WUPo5$RdXvP^|#no#gV6DdgUoB2)p>N z%kIk1QH7w$Lk%nF065;(@!_w3`yX-KrC)*00Je#whN{{Apw;*I+;ez(xzt&3+vGR8 zt!N%yaLraWo~*eShNJvsKRw9ap) zJR;fr9!&Aj`U}1NqJRB=yP!JPeK^7bz0B7N|JL^RX=2LynKM6%5kivdyxzQf`vLfR zC%wUOALx3+{-N!6yaT+!W%MU*i3N(Y^)@hn%8SW5;gs>5Bi&))_~`UCe}$&NUh70` zy0y&p_dEqDUyG^l0&bj-xQ3n_7+Ln5f@#&?bl$6ml@8(D!V|JxuV1`)u@@a3JdNXL zC%xX`k-+Nr27q~de6$0s;U!RIw;bfjZsJo*gb;P4Qd#5umi$_%hPc}6B{m$roRK1isc+@e6=P+zX#NL z^vPg-PV$DknIOCGm-nJ#BHDnv^x&S~aKqVnncg{YOC@b`YS+e%a(tl4ViZJ-FWC<0DPQKOn_uXo=v^1<2O+IsH#B>~-y_kHsGB@p ztlnO{W|?2I9RqW%+MD-m?z-T9YA)J>RwV~as?4~sQ7uQEZ&16ARsP46;9{n$0wiN) zE+CyOv*cSHIKkR^rLEqDB8g@$h4)7e`ZVl5K>%xWW24)nUFbNzJew@#x#&Q7bjmJH z3ohwdpdyZ@|Xa)LtsVYm16kbXPNb|~4#pT_*{v*}({ z_=4=X?i02Rjb336?9sMlKh zi$iKo0?0Ex7bHUaHo-QVO3lma*Wt@(=#%v+{}a*Y7XvQeLv*gag>;b_uSa?hB5&+(&P8r2<$oDvka+ZG zJhj{qKp{@_Sv}YaJi*E2dxG0EC(ePIDwNTdC+j=XA>a$}Uouk;qnqN!EK$sBf+V-DD&G<&z}7UtLPbKG9cZH6)fBZ|8k;Qth@!zwH>253sNf z;@AnlUBC=3F8~5%ih_Ez&VI>6dQQ=P;s8{Je1Ag{lD00MfNCizm)G2YNKyw0=8^xP z4EO#KK`!+gfB@;9Sl)wcCd!0<%}6*lR@%7J4&{(8)?IYeu~BHd38gz+(Z1{wQ2w6l zJZ_>hs#YAB7ZYL+{L6H+B((GUdE7&>W8yyO9ne(jLAF2e!cN5pw*Iv3@1mCd;>Ue~a`)l)7}zd;&#?r{)+2o)U>HN2eXG zR+vsP%EM)??lyQVEzIxjg*J87u8rBP!iLAvUkEjC`RdOXXw4OLXa<|YQAW{KGkhukP}`OoNU2N`$OG0+qwXW#J}GaJs8L~IdB&I z4%>uVT0ddyAPAhWc!Tp;L?4N`5WbY#6{z>)J=L%{THMeC@_w(7Ot^^DZS0@2uYQzG zCE&c3OXM#b>E;Z<#uD}yi+M81puW*@_~9C`PI;6_oR0a4`q2B zIQJ5P(}u{;+yU!1roJnu^K3~N2%0T!*4#$(D3C~~NpvdXS2xTOO4z(8wry&`D=6)_ zsEeh$e!#j-yYzF!(HmY%9LsBtW0K<0>DfDN<4)N6m*<#Dtn5P964P4v7D&oqkP8L* z4f~O6{3hvb@r4tOZ6Tx&>jED?x=5zU7p#0VZjQ+U0qC-}lU?wgx`JnlhKVs)nWB!m zak0#X9l&m@+v!GSs1FkiPvYaAnO(<2N3uJ#1-Ik4pa&rVGE-kfwOqlAXG}dYCS@~F z@QR!D-ewEv*=}=|oniN;n_H&@LV5KL%xU)6xAEhD-clA#8T^31(5HDN+sj`MLre5v z6QcKOE{Fe4u^UM*$9eA)+&e|J3UnX$X_bsJ#irYbmuJ95IS{>pvBte1At7bql3iO` zyLhO&x4jv_QglEkpyq{|1u1ANkvKFzb7%_j0FMa$(Y~M7?t9bz@xGjO>yl|%4m6fGmiI-r3x>WC9Nt z*x`SXSRe$Y`cipWta1KDv6_k8>^0nt(DlWIm>%R$F#NnKw&m8CA{fKo4&1feZ69RV zJ?wN39IMnu z=emXJ+Dag0?V@wHE#074uM9F*{o#gX*I$CD=yoj#yMfCFO-~Q#ljIwmT{HM)9w*7vVt`^BO zI!l*#`Ql6QPx{*SZ1wl^oaMYu=G9+nKfO(_uA=vOK7DMU^JOw;+2iPs^XxWV6xs5= zHca^O?B_Qx-W)$PR>5i}$^4%SKVBytufp3=^w%4f^3U_zY@8*XQ9d&eANUPMv98~rp|M!jx4tR`yw^fsT&qmPSLG=0jZ(=3@q zZ#&V8WIDZn2tGHk!$V_|C~86DL!?$rre58_JfG zbQX;>FhEckJE8f+!xDqR&=7{X_4^Es5j=R#4wKu|2TNkMvcqMufaZ5Z=#-z--1v@qD`}=@F0^pt(bb85$QA`#1uktbx6Pupc~l6+w(#g~l8q+yg&LKYAas!1U)|WYIs#j-a-Ilg**4h3#{ouxVbv^XKr)G+9LH*PC>a&0ze( z#)^J^^=dcXLp85Z*2_tP^Z=e^zD&02MhC3ms6hpHBM&^=iFe|D04MGr1ONn2-0k$k z?>pN7Nv_!zx57D;1`L`nuuZ4U0*IqGD2*Jpmr!ew$(z!6k|2gIQ+Pj(R+kxEzsy$! zvN%d^Zl>AD#53HK%&&z}+Zrb*;QrgJ(EfamF9cI@lcu9d6PN{~OGMbCy$h4>^913E zJu_w8l>`v-aD;Tz9H09a9n2@OQ;|u?NJ{6^RVk*|#nGxr$IVG)W^Jw$^8Lb&8?PGN z*NglPNrp`mEun`F$cy5`!-MEF9l=6~tZ4_qnqAM4gh_OAb{5U@aoR!@r{a*^q1&vf z6Vnhw{H{(O;Ak{W0P^ZS!VVhWEwUvl53O~bL5iOT7tr=3nz4J243qTe7MkPnQnm;+ANyTVDybOq$-nb0FCkgxCPW$!f-$^=HQyO{~Vf zwQ-`)Lze6;++>MbGK_uRyNHTWGG)Zo4QK(ZGJ%~7 zBpygXM?h7jEiub4wO4gMv9+-D@JbWFF|u|(E&|Iz_LJxR%Iox>U41ZqwyPQx}8u@y_K6`RUIEE@T+nwue*u#pgmX|Ng!*XbE$CzAS_ z%x@-3~&W!{{YL84-X`4srwZ&Zb650>(PCJ9f+HZAmrZ3)L#cI2z{Z*vvkZ~Vgz#Sr%d@UgzQA6C)xgDzv$F!t z+?eyxKg5qd@%RHD(kZJzp}%xu<3wVkeJwu);SiqveDn4qDhJS;?Js)X#aWAvU%ZIA zoj8ggeDdG^<-g`wRPN!!=jx(eUAA7`wf54l5ifxwYkm$M?`bA##Yj_7DHY-EP?)1r`>* zQ)N3x$5)~ZoY5@Yd4a?>g{KA7Da5G4`!+VUcG9*0^`{LsS`T|+c8Xh_WcVy;v9`^x zyGdpFLZMl}f;W zB;yr~7LcA_HK9cZFQ@6QT)O&pG7wCttr4W&Fkv*pe7DJTh1Jz2YJFGc?a5STr?~N9K#HRK%h(8e8roHXG!#qaj@K`w#0ah$mB@k z^u=9lRFe#dKp^tc?JhwLn#r$ZLd_wmnf!WBOR4Dm76FX=-b5^Z#!X&e>o^0=6*CL9-D*FoX?!jgI!tqJf2>| z9nWr2G(G#1GYNzR8|^uA8E&$nFyvk^Y!1!K`vyNm3~UuJor^O;2t*5;a}tLp>>&@T zlyvn-fdWmyFdu7PSOLEcq%1Ot5QuaaQ}2uW z0kB&9G>lRzAqyc8;gDaB-n5WZ`I@3UJJBmLB@O{iFh9rr3nK{VMSNnECT#-B3LoIS zMbbcoe_KJB*)Il+5equRz%C)w^6O7Erz6finG+ZC;((a!B)?X~UX&ChljrG@Kb+N$ z$TUt~<^UrB$s76QQW7YLOhvd_u=+V3T(m?7Hpv~5_KBPUo zXyRKt@+!}-PcgKXJVP+Wn@n;|AlBXdY8Uu+Ni(o;Co=F(3goULYL_^#lUYKJ&?2kj z{Kkje5m-dik`}bSMnv;6s!8t76(|ki_?;;0;Pfq07dO8`6~JGZJPOx?iOsl^4-uP; zQe5QTv&F0lr%}&cpXn~)>%dN}sE0uS#$51P!pHW*l&g_ANON+vIW>Gw zRA{C_2n7HJ)1Uu>kQ(0y(I>345wqOn5)gxNA|;lEK^8g@i;x?Fiq$nCDh2&YQeu$S1i9?-BBFr<@>yHTo47fV z-D(F)%M)FnV~#M;r3{HBoNA^$cx`x5s6Qm5By~YnsQy-S&x(mmGyEXv7B{n0$jno$pL>`th;80yH>tF5hY36$TAjZ> zIXqvE6^I8&#qY;8uX#IEigO1%`;ZhyCgmg&uC?j`Yzd`0uzVSOIwb&4WB}p9L;(;> z>fw)H65C`67#qR7YEpv4VUVll=HF%{(DH>;Yf$gZV5KG<>anKT7ki>xlRPpvbBPc3Pjx#R!8mzQeu-e1*P^5er8dlhaE1uNOo|3M%^-{i%qk)V_OXS{ zX7-V}@?+C(WiZaaq@8b0s0^N^ECTU6m%t;~Aok%fGcC(0z=0vPz8wKNg%z~gGD_Am zEt(M@FHn$S%wh`x$J7XsHk4!i#zg}~B%TMAENIC=j5ECYIhvgSo zS5l>dfn;D@8>A&NV?_!?wuzSgg4I7KJ@`wyzzD8NSDeo!Sp<)pzOH?!$V8+!yiFo~ zEQ{TAaX?^m-`4b@<1{Z8>x(Z9$2BUb?{?sGNu%qWA~3$2iE!F$PP&fG9S|dD(gUWj zU9CjREwH|rpUmAPdVkuhq15o_fYu@}F{_&shs|*lTIvqE?~-$qQ~(2d_=Z`>21OaC zb~I^rF3t!Wn~c8rXPR?nb7Ba}prU{1`bN5MCZo}U@7yU zctxEY97q-21xk<>htkkdDk^~13e1ZsnWg4Lf3OImqP$9W?4X>MCz)OmLaGQ`P2q?t zEkKs-oG?jHT4nSzsLc~9K>4sarHvfoW#%VT5bT__Odbr$FflPJO8rDJfwQb+UcoB= zXiFgw`vqXy5;&TO=vX*%EFl5Pyruy0)g>Sy#g{xJVMLO8Ibol*>@c6h@0w88=S^5^ zl<0TEWlXBksR6rFkpG2*jHNl{)iYO#(AuQw&I@IsYL7LitUn&Oyh3mb3Dhk4Z~bv` z0xrJ;XJ^@ys;=Q#{Jb^*EB)>Tzh`LI`q#7GbLgf;WcR6-@*bN-yC5EW9~3hl z)^0LRZhS_f3AmHXHBn6zkpMucot+q$w<+|NMwhFQlCW*d6PNT|!(l*K}!&m^SCh+N|NeBD{FJ{bHdH4kQjNO+sd)ZXzl0T;NmD!{!K?t zNEj3r8v=9}hs<4G1joi~hQilcitrpVYSy7&BbvfA9yzflxN;O!V-<{s3NYZ`-V&&B?#z z_Y>07SYRq`I`iVTdhKI=vl~Y*7Ylq2{XFzPSNv8)va?!bnj6Fw#P+-0U20b__pkKl zxU)BSNh10S*d!nK>G}X-d3b>lhKFk-^nP@Y$Gs;Hk2DO6u;;s-gKmACK|g+E zB|{TCT7$m2o&Ej#`~AJg*AWWdJu+;ZL9era@RXwC&VK)?{<0Up+XhE?Kj1=->wOl& z`+I!T`ny5-s1X%w67<&8e}9#UxG`lkb{m2$#kfKFwTVKeI2yGlv+{TvBj3TP3Boa zsQbE-z%sj)fco4U5?^+ zy(~K%#>OdCKF9esa%7kxwZy*LhM-H`DaDhY&VH^7YRk}hiQGi6G}aF%<@5CbuMJ*) z=G-V%ld6upAR*Zttlt?#Z$gnx(DyCRorV~nW~az&vUXZX${k@_R_g$|1?xItL@-p< zrzmF&Mv>OqJdzUjRpLP3b>44QN7|AvJ0Gjo7w;ecfl>qWvH&1x9 zOlsbzk4H+8ToNTw`I|F{usbP-!eSYsTMnOKoCcX8HX#NG7b#xm(kSluQ*wQ6THv;Y zY5cgI=%qZz5{G; zb>B3ASyP$y%%6^wU7GO%WvML4Ijx4WYaMgJ4L~}Cs6z`m^*c_=i&@WnY`jUF$7{4* z1J4Nwg_#Wi3mj#0&N$X_fZ=WdBBh@;KXNIa=8Q#hrRapS^z!v%XHLB>aMoF zwQYO0ecztDP(c~4pe~N1n08*N2_oAdboofIEB7Jf>)c8{D8b-C{A|O31i4NKH82rr zRhuOCyhUz6^fwBg$%`jh9}ZcrqlXOYV&-7ES&_rlI)m0_HCl==zd~*3qsAsTfmIx9 zjc7vX-A8LG5^K`}gXm>-Qm=(w5hi6F8pI#Gl#(;Z8|us{YIctb1^2V67>JUnY>!uf ziZnZp&y1sCUfU9U1TARSEpZoKs|g?vNSM1kb7RWa0&Wh2;C5c*R=n#xJIR$M-)O_cI+OkZa;TxtyxV1is) z8I@pRh~i+89FeBu=iT5ah-ckjhMM%<*_-o^P+KSb`P zfiex)`H<7#tjoqGbsl@8`ABFNqFwV@goki5Oa>lzUAlmT!bpw@#&K zo7Wv9hp;m{d>D|blZ2!?>_}pB^#Rq89OrITa~|Xz)ZmBkv)xh5k#TO`NAzSa%6wo1 zL!LalAml~r-d=4^$!DthFXR9|jm&^m43Z$0-{P>^p))N|18WC9{rN%J7n0EYswG#C zG-s^E7xeTdnC|0XUzf|nf7!CUHJ#rgU$rp-cOcfwH4rL<#oxtUSB+Ff?fk>WmSrff z%%=)D(5(w0xZ{9L7-2Uh(tNDl$9`oFC}icNor?$mlHQr%mlT=5$fI((fFF5*HQz* zV{I(BzNAw-QBk&L`Qh}Z_tEia`853HX{0|reGK-ik%F4N{_w|Wu;2YL>O(d5-{UW; zkfzTod-{L)G$NQW_=fouUy6T>2zJ18Poo}O6z@O!-bWo^XQ6upCx(4q9`K%DY1>lW zc2TEMZP)N0+^D_Bcjez_QzQ@Am2QVk8SVq{0g;S7;N|}|SZjo#SJ{^yZ}xj#HfEn! zWBuQK>{Ukzoc3Zq{WD}sp@c`Lyc&rRm@wug+tL_Hcziy;ql#8>xa$Bg75&kaF>_X5 zHR#-rz6@bNm48;qmT*Pkp~k~&px~pMR4S|6Zd72su!71p9wxiz;o%tcgc)f!sy#wb=-eg}noa8KLDp~CWkcYiFqi06I zQu_k3Y8Sz9mR zW`&oU)0QkEDAt3~E`8@rs1{UW83KH{n2O0ze4(+D--(uKDNDa00Sy}jAcRUAa*|R~ zT*al@B>(e>p^cTB=R?$Eo>QEYrNkt7u!UupUaV^%Sn1jtE2);Quf=`Q7VOA zCLoMga{5Yfgw?`7Vk9bdT%a1K4@PUwX#*()=)5_r^)b>c^YiWlLxrk=^sMWBCx1@T zErjfl%e41^1lDLhtaTxWqY^$(fL2v$nt+YjNk769JctW zTNby~<=JpVS>uQDNc{T{`YS>vy826V15J3)69XTEuU6`;2VU^bQZ-(N=?Jaf4#sVTj?jSq6A!ea+JA4c-0BT$mLbcrWIwGwnFSd9^9<1kmKGE^?~GJ70*F^|263`nnmuk^mvhTvGU$E3 zCVQcEXu0G(#Nf^)7svp?SL<5z!8d*h+SvB5=g_Nv2Mz}UNvpgX^$}Nc@4ZV%-GGWC zzK@2syKxMpxUVfs_hF@>BXbuo{dOvE4aW>0=Rr^Mfxs=$@*83=h8z5RD52Zw0WI4( zgwpuG4b&RO?ac_cyAN&mJNw^92W)Q7x%%JmZjV{_2YBn916~!miUc2x5NO&Sv&Msi zFs-%*TI?Nu%aZ#7_E3=m&ATU10eR5Mj5TPDd}!4Z4G)Ph7`XeaxyPCnDX2f${8?-6 zL*LcAD%%-=s&8BKfHfOAvf`nj)tm9R%sODj7-qKK0UXuaDiZ3e)_Mo5H@s;(BS?qe zGUrejeit3~BqJp3#L2yHdeg1e-UKtk4ie@YkUMmXwHFom5jnuIkGy#`+B2x*F10ni z?jFm990cd%Z|=J{gJc6t-g^|ttI>hISC?~8-=i=($;7LwBZ5gfhmbB4M*AIYD(h_# zOatmWo7tO|2R|tXfeTMxE?H4NpD*94)Y*5w%wUT3NthdxMQJdd-=vMOydi%))?1M~ zVdBrd{fm}xlJ_?3oN9}w(Fw21roZ9 z1f-~gm8%G79fftBBU~|vb@;)x0PeD$qx8%|yXD$?nsdt9Z=Y9IyE?C_E>-hxG(E8+ z8Ls{V$8vfCw;mIdZ1db{d>pCHa z`HANyySO9~YL8S=;i<@l0D@D&HYS88aXgf7YA)PP;0F^F>=;?MTE5RHfb-S$&)sPV z)wYdv_K9i{nh!XGT0W)nR7Rt1M`pI{jI3c7;;C8TF8F|4n9JR@@NUJX&U~OOJQAn*e$R%Hm7w_ z-RRlMoo>8LmO8dlP8W*SgZKWtHH15iUbsU<`0y<<-( zHcnfCedJ+tPCXp_u8MJ(vz|+@TvHJ2yKeYH)-$P z6~aE6(qUop>)-#6!vpLGE_+Y(L~W`P8Po^yc@xqSF{jESTeGDbr&|-r%LgsFZs%clWB0KEc4Lj8 zuXQBUe>)BpF{cVb^2Wn!+U-5IsNQ2E^dB36vdjnZvO9Qegux>t;L3N+cUKuaUwiJG zTqeHc?eTT!zRBHx0DpRN^D6{u%E-+z;=ehDPb%S=f80<_vRq0f0xbotEy}X0VdXUK zm_`r`8%LeR1I-D2aJ@PLB}oWJAiB!v=r+kz$r_7jBWoJ0^sN)3+3F|FmDlq96-%wO zexw)-=%{k5;`g|h^^IevZ+L0?hR>yMRH^7&BmU=tMKlMt)V#j;o9kzAM+BQYT>o7x zR{!1qygM{1YWbx66OXhr=NePC4PFkh^@p_~IF$4FVpNXMHrFnG_10QIUiJ@aJ&STf#sX(jun#%pao;EWwOb<7c$k-C?!CE%lH zV^WfsoR^f(+d6Jm3pMst%a!7J6LYT6fncuY+|2hbMA~DHjF4{!G0+_FiBtNIQ0VcG)RV~N>)TFEaUYDLaE@_7p z#@N%XOBSxu$4#2MN2ycG#$8GOLu$&zvWKR!fsL9}{MVUN5K z^98H=2$!@3A+GVM4+Yg0!q&(@cs6|EU8v`V85ZAmHtXPtEoPCL;{Zhlh3)isvc&q1 z&Xhi+;$#|kpNw-bugxkSG+#(tXy-)L8V$N=+eHEjZk@JtIoz)Gb}`jSDQIb#ulz0?#__GrfjbxbB6`slV6h5|gHSHwaL66rebW;OHwfvfxpAM_TPHtW}BRvb1Nf zD($I%-kg?oTL{J?eA<=(fd|?J23WUU?%|5S2#HnuscmF)IMP!wu$(lvUoq{UC zR4KbGtc}>GS&4Q~Ovat6tIRjmmb;oG3;Q&u4T{S5{ABR9)f=$BtJEmVL1n=d?c&N;R3MU{5bB#+O!EIUCVKtL7)9%qv!52!B&kWJmKYJRT1@FF->8Kjgw+ zZ_$FcS|q1D188+8TWT*2Uds9RNV2VCeJtJ25gxLVUs1>T#LA}&c_=Xin< z$#s^?WYY$||v5H{X5;!D2LdR7y%0P3SZ8onav-_pET}TIvPLKtNL^UU+ zyQ^BU*l;c?+(c_b%oy^-e{LVmj;f1F>!(MkLYz22d4BHG19j`O%?Srj3=nG*lwnf^ z-G&-Y?7SQ%fvFrLX(u+Us>P~<*H(fd=T3rqTZ!Wm_oLh3Q?fP#8c)EdMF5W35#keAvxgp~v8)?w*!Qr3(d( z(X8Jo>hm%isof#o175!|HI?W1jj8FNJ^uuSc1u-{SLP_N(V0V-c{KfpLR8f+U1(xY`W z_XO0wSW;JI2ZTgPu4^RF=a#e;^MTghd#BBYd$7aCmMyHUW9yHftf4AKC`|4T3vc-1 zPUeL)SM-(oB44)F_U^V;yODcw(7x1|7}-N#@s>N4>}sM1UfgzX)8PZ}=%d5-Wx6ti z?e8A6zt+^7>qG2sQN|}$dAQd;==uneb_!_^JLxnds{Y_i06Y}?Ey~J zNIm$~YhUCel(g-ZFx_$W&{y}@|MuTWd<-J@z3;#G{C8$$1RjX>Pf!2!^^Zr{Z++Qw z*w1Psv3+&*r&>tK$Nu(@eAU}Jvp3Hh2h+;zX0)nvshpb}{732$2nzU;g62lpy_>B8 z!u1oZX3GS-+n=hD&u%3b-b6H~hms1!qDva$KH9`w>*t&05c}X<4sjjV9J+xiZxCxD zXY>()xX}rbvIQ5Z<=Taf?ns!-a~7ihHnyI!-@fx+IWVp}G25am7*&`J(KBZr$Hbj* zz2x@NQi#4fvQ%^!Js7;ck%}T|;8(ZXl>P73UWYg!5Ysn-`@yK0pgyLpU!nJDxcG0! z+ON9^N_9WC0vp#_GsT*Urm7O9939ikzb}K#--h1eRRvP=K?xjO^7!1QZ1{2LZ>26f zZRR4%F3=mo$sPrc#M}T)YQk@}n4f$(?*z8iatOrIulUHrs%5!J%yYH`cok~7-%Y;C zJEw<;u5D^ijNUFQRo0Xvb9g$lq5-*Ao~W!xhKnX78P0lHC7`+mZZsZ!QPYd#sm1!B z*XFD>uSR?BD?vU#=^l0C;qUEVPSmflkoN{4t8jeDosZGte3uJDZgVn>czHo}&%rf& z)sQnxQ=OFeeIDocOk?z%*xR4S3tY) z5XLuNxIc%YC49AABX;IWWDGGqLku0PTE)$t{60AdN<21~16vwlRE~1}=06`qG!%Po zFK>$Fel4vx$S$a2{Vd0N}U7-T=o+WqR11d3C{_V-NMBZoXrWVIwd4gS%6vB9Q^f-j_IOBTrHr4u#v1K zFWTh^JKUU04)8`$aWl4SZoiwLUW@2s3Xq}@?HeyjVCYba<u3a&{6TO+YdD z@Pdwj;~E|A|N8g;5yxFR7U&e zHb5JvcNxn4W~Qqv+3@+%UbhEE_TsJ((jB(wTK3>`|KbBHp;MAdiva{CeVZtn_=(ff z`JDFdx|oU7$}(T=MDH?sano@8M{bEq`S3>qxA**ZyZSO(<=0alk#K$wrg&)eh3_NCPeIJrqUzg# z8|Nb~q9+SRo;{~zS`|2*_o`y0M>zNJgnZYVmoH!LL3LU>+VG zYyoR{0aV#7M|rYqIuL4OkQV%nV%&F72g!nCXoI`QqYufPUJo9=Qw2Cb<A!djP8&)G+E`3xssA}TH55)wv8L*ctMk!Q4lebOZ_eZ&o?LUPD^0=eUmr+ zfs;4=Enw_lupQ7-zR1xxztBn1J1^D|LTbV8X!eA^L#iWDKY99Cou0p8nP0Lk19P$3 z+xKkls^ErduG)fDB?nBZ%(${qO-G$~P`i*-9>|pNVy4RiBx7YSAe}6?QorxA$zHjyue@FWJVw#{BKG>1I{C_L55gH={H#J3Kq;?q-SLDUodbIA zHMx=EUl0(q@q1%=`oQ)$xiTu6wlwP-Q6KXFbXvY=uhYll z`NY|q_vdykFZpl+0_=0U1Xo!1B?n#Zqf_moX{1m2Uq=}v9{mMhZFZTUv2u9-d!n}d zTo?CR$A|oq=@-`kp>~4P97)VpD*n$&oif9+jnri&ve)ajr#F!Cj$z=^Gr?9<36Zr8 zkkAl}VJqX_p7BR--|4G%v8StD|E&6O1e~sbg+{ZL{$CUDg6ed2$;p)kig`ttY zTgn93ng&>?NoYw$Imzy-irTJ?O%{Jb(cq_XRW1}}zaGT{EUdjacEWEPFoTP8fIyj| zpkA-CUow%NQ?#Er0+k`(-;#!;v5P05T2jhoHrF7M)IoxIv`Rveh;6QU3N zX1ZAt+Ijvw?xEN+aUb*!Xe;#~+wXZ{s>?3@^`$$CG&+KjG4MB0`r8byc#jW3VB?$y zpM+&F^3G9n+>`^`?{IHUa&6Rfj2*@mTDSaZHOojoZsUrV3lt2e=9*$h#CdEw7v*%B z&ABjM7qPzL&P7L<$aV|W)Htfxz{0j!1MPJWJA;FPll|LhZ~x+)d*f7b_f`QnXi!G2 zu@2!S{`_YIFO7ZbCtB=wQ^(ZSy}3qgtk#JNtT&Le--ZhCWc-bO z&Ayo#2QXTEcyQP%hjq)UmwM#W*p|2X1d0$(%`pr-B@$eYjyqhgFr8wQhwED1eehUX znBSWVZR)CR8?#-74G*Qk_HaL_7-`_ovD9NF$Mh2)v9tQ1sWzEHlBg@io3Spq8GHW z_q2M7tdiA>)uM>R`7ECzv&5|DKc69GipgCvr^F|LnIMUllV47YLG2cEmRc?~JNE^b<#uD}yi+kGB><&G@_~vS`PI;6_o%lQ4`q4nIrkEQ(}u{;+yd)2roJo3 z^K3yJ2%0UfSKLSQD3C~~NpvjZS9i=3N?5-twry&`D=6*2sEek%j=;K4yYzF!(Q95y z9LsBtW76W$@yR=F<5t-E7iXACtn6GD6VqJy7D&oqkP8L*4f~Oc{3hvb@r5&uZ6Tx& zs{$WCI!~s_7p#0VZjQ+U0qDB6qiygVyOL+hhKVs)p`s4Eakb2v9pF86~&m8%|O9x?$$faZlGto&slbc-JEW2 zol*$p(dxErDC%L_3*NT6W&c~xx7tuaL~hP@rQZMmC1$gsQL>FzyZol|r87X>ZzT67GS ze3y@YSxt@8*Fpr%C_r5!ox&pKmI>s)WcMozL$&G#3FsHPhw93TAZ6{Mb2lyBpjhQj z?FUG#H+&ubf29EgnE(I) diff --git a/Corpus/Learning to Generalize.txt b/Corpus/Learning to Generalize.txt deleted file mode 100644 index dac9877..0000000 --- a/Corpus/Learning to Generalize.txt +++ /dev/null @@ -1,933 +0,0 @@ - 262-A1677 7/24/01 11:12 AM Page 763 - - - - - - - - SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING - - - - MANFRED OPPER Theories that try to understand the ability of neural - - Neural Computation Research Group networks to generalize from learned examples are - Aston University discussed. Also, an approach that is based on ideas - Birmingham B4 7ET, United Kingdom from statistical physics which aims to model typical - learning behavior is compared with a worst-case - framework. - - - - - - - - - - - - - - - - - - Learning to - - - Generalize - - - - - - - - - - - - - - - - - - - - - - - - - ................................................ ◗ - - Introduction rule. To what extent is it possible to understand the com- - plexity of learning from examples by mathematical models - Neural networks learn from examples. This statement is andtheirsolutions?Thisquestionisthefocusofthisarticle. - obviously true for the brain, but also artificial networks (or I concentrate on the use of neural networks for classifica- - neural networks), which have become a powerful new tool tion. Here, one can take characteristic features (e.g., the - for many pattern-recognition problems, adapt their “syn- pixels of an image) as an input pattern to the network. In - aptic” couplings to a set of examples. Neural nets usually the simplest case, it should decide whether a given pattern - consist of many simple computing units which are com- belongs (at least more likely) to a certain class of objects - bined in an architecture which is often independent from and respond with the output 1 or 1. To learn the under- - the problem. The parameters which control the interaction lying classification rule, the network is trained on a set of - among the units can be changed during the learning phase patterns together with the classification labels, which are - and these are often called synaptic couplings.After the provided by a trainer. A heuristic strategy for training is to - learning phase, a network adopts some ability to generalize tune the parameters of the machine (the couplings of the - from the examples; it can make predictions about inputs network) using a learning algorithm, in such a way that the - which it has not seen before; it has begun to understand a errors made on the set of training examples are small, in - - - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 763 262-A1677 7/24/01 11:12 AM Page 764 - - - - - - - - MANFRED OPPER - - - the hope that this helps to reduce the errors on new data. for the case of realizable rules they are also independent - How well will the trained network be able to classify an in- of the specific algorithm, as long as the training examples - put that it has not seen before? This performance on new are perfectly learned. Because it is able to cover even bad - data defines the generalization ability of the network. This situations which are unfavorable for improvement of the - ability will be affected by the problem of realizability: The learning process, it is not surprising that this theory may - network may not be sufficiently complex to learn the rule in some cases provide too pessimistic results which are also - completely or there may be ambiguities in classification. too crude to reveal interesting behavior in the intermediate - Here, I concentrate on a second problem arising from the region of the learning curve. - fact that learning will mostly not be exhaustive and the in- In this article, I concentrate mainly on a different ap- - formation about the rule contained in the examples is not proach, which has its origin in statistical physics rather than - complete. Hence, the performance of a network may vary in mathematical statistics, and compare its results with the - from one training set to another. In order to treat the gen- worst-case results. This method aims at studying the typical - eralization ability in a quantitative way, a common model rather than the worst-case behavior and often enables the - assumes that all input patterns, those from the training set exact calculations of the entire learning curve for models of - and the new one on which the network is tested, have a pre- simple networks which have many parameters. Since both - assigned probability distribution (which characterizes the biological and artificial neural networks are composed of - feature that must be classified), and they are produced in- many elements, it is hoped that such an approach may ac- - dependently at random with the same probability distribu- tually reveal some relevant and interesting structures. - tion from the network’s environment. Sometimes the prob- At first, it may seem surprising that a problem should - ability distribution used to extract the examples and the simplifywhenthenumberofitsconstituentsbecomeslarge. - classification of these examples is called the rule.The net- However, this phenomenon is well-known for macroscopic - work’s performance on novel data can now be quantified by physical systems such as gases or liquids which consist of - the so-called generalization error,which is the probability a huge number of molecules. Clearly, it is not possible to - of misclassifying the test input and can be measured by re- study the complete microscopic state of such a system, - peating the same learning experiment many times with dif- which is described by the rapidly fluctuating positions and - ferent data. velocities of all particles. On the other hand, macroscopic - Within such a probabilistic framework, neural networks quantities such as density, temperature, and pressure are - areoftenviewedasstatisticaladaptivemodelswhichshould usually collective properties influenced by all elements. For - give a likely explanation of the observed data. In this frame- such quantities, fluctuations are averaged out in the ther- - work, the learning process becomes mathematically related modynamic limit of a large number of particles and the col- - to a statistical estimation problem for optimal network pa- lective properties become, to some extent, independent of - rameters.Hence,mathematicalstatisticsseemstobeamost themicrostate.Similarly,thegeneralizationabilityofaneu- - appropriate candidate for studying a neural network’s be- ral network is a collective property of all the network pa- - havior. In fact, various statistical approaches have been ap- rameters, and the techniques of statistical physics allow, at - plied to quantify the generalization performance. For ex- least for some simple but nontrivial models, for exact com- - ample, expressions for the generalization error have been putations in the thermodynamic limit. Before explaining - obtainedinthelimit,wherethenumberofexamplesislarge these ideas in detail, I provide a short description of feed- - compared to the number of couplings (Seung et al.,1992; forward neural networks. - Amari and Murata, 1993). In such a case, one can expect ................................................that learning is almost exhaustive, such that the statistical ◗ - - fluctuations of the parameters around their optimal values Artificial Neural Networks - are small. However, in practice the number of parameters is - often large so that the network can be flexible, and it is not Based on highly idealized models of brain function, artifi- - clear how many examples are needed for the asymptotic cial neural networks are built from simple elementary com- - theorytobecomevalid.Theasymptotictheorymayactually puting units, which are sometimes termed neurons after - miss interesting behavior of the so-called learning curve, their biological counterparts. Although hardware imple- - which displays the progress of generalization ability with mentations have become an important research topic, neu- - an increasing amount of training data. ral nets are still simulated mostly on standard computers. - A second important approach, which was introduced Each computing unit of a neural net has a single output and - into mathematical statistics in the 1970s by Vapnik and several ingoing connections which receive the outputs of - Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact other units. To every ingoing connection (labeled by the - bounds for the generalization error which are valid for any index i) a real number is assigned, the synaptic weight w,i - number of training examples. Moreover, they are entirely which is the basic adjustable parameter of the network. To - independent of the underlying distribution of inputs, and compute a unit’s output, all incoming values x are multi- i - - - - - 764 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 765 - - - - - - - - LEARNING TO GENERALIZE - - - 0.6 −0.9 0.8 - inputs - - 1.6 −1.4 −0.1 synaptic weights - - weighted sum - 1.6 × 0.6 + (–1.4) × (–0.9) + (–0.1) × 0.8 = 2.14 - - - - 1 - - - 0 - - - - −1 - 2.14 aboutput - FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri- - cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs - reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which - the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and - step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information. - - - plied by the weights w and then added. Figure 1a shows its simple structure, it can for many learning problems give i - an example of such a computation with three couplings. a nontrivial generalization performance and may be used - Finally, the result, wx,is passed through an activation as a first step to an unknown classification task. As can be i i i - function which is typically of the shape of the red curve in seen by comparing Figs. 2a and 1b, it is also a building - Fig. 1a (a sigmoidal function), which allows for a soft, am- block for the more complex multilayer networks. Hence, - biguous classification between 1 and 1. Other impor- understanding its performance theoretically may also pro- - tant cases are the step function (green curve) and the linear vide insight into the more complex machines. To learn a set - function (yellow curve; used in the output neuron for prob- of examples, a network must adjust its couplings appropri- - lems of fitting continuous functions). In the following, to ately (I often use the word couplings for their numerical - keep matters simple, I restrict the discussion mainly to the strengths, the weights w, for i1,..., N). Remarkably, i - step function. Such simple units can develop a remarkable for the perceptron there exists a simple learning algorithm - computational power when connected in a suitable archi- which always enables the network to find those parameter - tecture. An important network type is the feedforward ar- values whenever the examples can be learnt by a percep- - chitecture shown in Fig. 1b, which has two layers of comput- tron. In Rosenblatt’s algorithm, the input patterns are pre- - ing units and adjustable couplings. The input nodes (which sented sequentially (e.g., in cycles) to the network and the - do not compute) are coupled to the so-called hidden units, - whichfeedtheiroutputsintooneormoreoutputunits.With - suchanarchitectureandsigmoidalactivationfunctions,any - continuous function of the inputs can be arbitrarily closely xx 21 x2 x3 xn - approximated when the number of hidden units is suffi- - ciently large. (w1 ,w 2 ) - w ................................................ 1 w2 w3 wn ◗ - - The Perceptron x1 - - - The simplest type of network is the perceptron (Fig. 2a). - There are Ninputs, Nsynaptic couplings w, and the output i - is simply a b - N FIGURE 2 (a) The perceptron. (b) Classification of inputs - awx [1] i i by a perceptron with two inputs. The arrow indicates the vec- - i1 tor composed of the weights of the network, and the line per- - It has a single-layer architecture and the step function pendicular to this vector is the boundary between the classes - (green curve in Fig. 1a) as its activation function. Despite of input. - - - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 765 262-A1677 7/24/01 11:12 AM Page 766 - - - - - - - - MANFRED OPPER - - - output is tested. Whenever a pattern is not classified cor- - rectly, all couplings are altered simultaneously. We increase x2 - by a fixed amount all weights for which the input unit and - the correct value of the output neuron have the same sign - but we decrease them for the opposite sign. This simple - algorithm is reminiscent of the so-called Hebbian learning - rule,a physiological model of a learning processes in the - real brain. It assumes that synaptic weights are increased - when two neurons are simultaneously active. Rosenblatt’s - theorem states that in cases in which there exists a choice of - the w which classify correctly all of the examples (i.e., per- i - fectly learnable perceptron), this algorithm finds a solution - in a finite number of steps, which is at worst equal to A N 3 , - where Ais an appropriate constant. - It is often useful to obtain an intuition of a perceptron’s xa 1 - classification performance by thinking in terms of a geo- - metric picture. We may view the numerical values of the in- - puts as the coordinates of a point in some (usually) high- - dimensional space. The case of two dimensions is shown - in Fig. 2b. A corresponding point is also constructed for the - couplings w.The arrow which points from the origin of the i - coordinate system to this latter point is called the weight - vector or coupling vector. An application of linear algebra - tothecomputationofthenetworkshowsthatthelinewhich - is perpendicular to the coupling vector is the boundary be- - tween inputs belonging to the two different classes. Input - points which are on the same side as the coupling vector are - classified as 1 (the green region in Fig. 2b) and those on - the other side as 1 (red region in Fig. 2b). - Rosenblatt’s algorithm aims to determine such a line - when it is possible. This picture generalizes to higher di- direction of coupling vectorb - mensions, for which a hyperplane plays the same role of the FIGURE 3 (a) Projection of 200 random points (with ran- - line of the previous two-dimensional example. We can still dom labels) from a 200-dimensional space onto the first two - obtainanintuitivepicturebyprojectingontwo-dimensional coordinate axes (x and x). (b) Projection of the same points 1 2 - planes. In Fig. 3a, 200 input patterns with random coordi- onto a plane which contains the coupling vector of a perfectly - nates (randomly labeled red and blue) in a 200-dimensional trained perceptron. - input space are projected on the plane spanned by two arbi- - trary coordinate axes. If we instead use a plane for projec- - tion which contains the coupling vector (determined from tions for small changes of the couplings). Hence, in general, - a variant of Rosenblatt’s algorithm) we obtain the view in addition to the perfectly learnable perceptron case in - shown in Fig. 3b, in which red and green points are clearly which the final error is zero, minimizing the training error - separated and there is even a gap between the two clouds. is usually a difficult task which could take a large amount of - It is evident that there are cases in which the two sets of computer time. However, in practice, iterative approaches, - points are too mixed and there is no line in two dimensions which are based on the minimization of other smooth cost - (or no hyperplane in higher dimensions which separates functions,areusedtotrainaneuralnetwork(Bishop,1995). - them). In these cases, the rule is too complex to be per- ................................................fectly learned by a perceptron. If this happens, we must at- ◗ - - tempt to determine the choice of the coupling which mini- Capacity, VC Dimension, - mizesthenumberoferrorsonagivensetofexamples.Here, and Worst-Case Generalization - Rosenblatt’s algorithm does not work and the problem of - finding the minimum is much more difficult from the algo- As previously shown, perceptrons are only able to realize a - rithmic point. The training error, which is the number of very restricted type of classification rules, the so-called lin- - errorsmadeonthetrainingset,isusuallyanonsmoothfunc- early separable ones. Hence, independently from the issue - tion of the network couplings (i.e., it may have large varia- of finding the best algorithm to learn the rule, one may ask - - - - - - 766 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 767 - - - - - - - - LEARNING TO GENERALIZE - - - the following question: In how many cases will the percep- exp[Nf(m/N)], where the function f(a) vanishes for - tron be able to learn a given set of training examples per- a2 and it is positive for a2. Such a threshold phe- - fectly if the output labels are chosen arbitrarily? In order to nomenon is an example of a phase transition (i.e., a sharp - answer this question in a quantitative way, it is convenient change of behavior) which can occur in the thermodynamic - tointroducesomeconceptssuchascapacity,VCdimension, limit of a large network size. - andworst-casegeneralization,whichcanbeusedinthecase Generally, the point at which such a transition takesof the perceptron and have a more general meaning. place defines the so-called capacity of the neural network.In the case of perceptrons, this question was answered in Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in- learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map- ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func- new example after having been trained to learn mexampletion of the number of examples per coupling for different on the training set?numbers of input nodes (couplings) N.Three regions can To obtain an intuitive idea of the connection betweenbe distinguished: capacity and ability to generalize, we assume a training set - Region in which m/N1: Simple linear algebra shows of size mand a single pattern for test. Suppose we define - that it is always possible to learn all mappings when the a possible rule by an arbitrary learnable mapping from - number mof input patterns is less than or equal to the inputs to outputs. If m1 is much larger than the capac- - number Nof couplings (there are simply enough adjustable ity, then for most rules the labels on the mtraining pat- - parameters). terns which the perceptron is able to recognize will nearly - Region in which m/N1: For this region, there are ex- uniquely determine the couplings (and consequently the - amples of rules that cannot be learned. However, when the answer of the learning algorithm on the test pattern), and - number of examples is less than twice the number of cou- therulecanbeperfectlyunderstoodfromtheexamples.Be- - plings (m/N2), if the network is large enough almost all low capacity, in most cases there are two different choices - mappings can be learned. If the output labels for each of of couplings which give opposite answers for the test pat- - the minputs are chosen randomly 1 or 1 with equal tern. Hence, a correct classification will occur with proba- - probability, the probability of finding a nonrealizable cou- bility 0.5 assuming all rules to be equally probable. Figure 5 - pling goes to zero exponentially when Ngoes to infinity at displays the two types of situations form3andN2. - fixed ratio m/N. This intuitive connection can be sharpened. Vapnik and - Region in which m/N2: For m/N2 the probabil- Chervonenkis established a relation between a capacity - ity for a mapping to be realizable by perceptrons decreases such as quantity and the generalization ability that is valid - to zero rapidly and it goes to zero exponentially when N for general classifiers (Vapnik, 1982, 1995). The VC dimen- - goes to infinity at fixed ratio m/N(it is proportional to sion is defined as the size of the largest set of inputs for - which all mappings can be learned by the type of classi- - fier. It equals Nfor the perceptron. Vapnik and Chervo- - 1.0 nenkis were able to show that for any training set of size m - - - - - - - - - - - - - - - fraction of realizable mappings 0.8 - - - 0.6 - - - 0.4 ? ? - - - 0.2 - - - 0.0 a b - 01234 FIGURE 5 Classification rules for four patterns based on a m/N perceptron. The patterns colored in red represent the training - FIGURE 4 Fraction of all mappings of minput patterns examples, and triangles and circles represent different class la- - which are learnable by perceptrons as a function of m/Nfor bels. The question mark is a test pattern. (a) There are two - different numbers of couplings N: N10 (in green), N20 possible ways of classifying the test point consistent with the - (in blue), and N100 (in red). examples; (b) only one classification is possible. - - - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 767 262-A1677 7/24/01 11:12 AM Page 768 - - - - - - - - MANFRED OPPER - - - larger than the VC dimension D , the growth of the num- blue curve in Fig. 6, the minimal training error will decrease VC - ber of realizable mappings is bounded by an expression for increasing complexity of the nets. On the other hand, - which grows much slower than 2 m (in fact, only like a poly- the VC dimension and the complexity of the networks in- - nomial in m). crease with the increasing number of hidden units, leading - They proved that a large difference between training er- to an increasing expected difference (confidence interval) - ror (i.e., the minimum percentage of errors that is done on between training error and generalization error as indi- - the training set) and generalization error (i.e., the proba- cated by the red curve. The sum of both (green curve) will - bility of producing an error on the test pattern after having have a minimum, giving the smallest bound on the general- - learned the examples) of classifiers is highly improbable if ization error. As discussed later, this procedure will in some - the number of examples is well above D . This theorem cases lead to not very realistic estimates by the rather pes- VC - implies a small expected generalization error for perfect simistic bounds of the theory. In other words, the rigorous - learning of the training set results. The expected general- bounds, which are obtained from an arbitrary network and - ization error is bounded by a quantity which increases pro- rule, are much larger than those determined from the re- - portionally to D and decreases (neglecting logarithmic sults for most of the networks and rules. VC - corrections in m) inversely proportional to m. ................................................Conversely, one can construct a worst-case distribution ◗ - - of input patterns, for which a size of the training set larger Typical Scenario: The Approach - than D is also necessary for good generalization. The VC of Statistical Physics VC - results should, in practice, enable us to select the network - with the proper complexity which guarantees the smallest When the number of examples is comparable to the size of - bound on the generalization error. For example, in order the network, which for a perceptron equals the VC dimen- - tofind the proper size of the hidden layer of a network with sion, the VC theory states that one can construct malicious - twolayers,onecouldtrainnetworksofdifferentsizesonthe situations which prevent generalizations. However, in gen- - same data. eral, we would not expect that the world acts as an adver- - The relation among these concepts can be better under- sary. Therefore, how should one model a typical situation? - stood if we consider a family of networks of increasing com- As a first step, one may construct rules and pattern dis- - plexity which have to learn the same rule. A qualitative pic- tributions which act together in a nonadversarial way. The - ture of the results is shown in Fig. 6. As indicated by the teacher–student paradigm has proven to be useful in such a - situation. Here, the rule to be learned is modeled by a sec- - ondnetwork,theteachernetwork;inthiscase,iftheteacher - and the student have the same architecture and the same - upper bound on numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error class labels for any inputs are given by the outputs of the - teacher. Within this framework, it is often possible to ob- - tain simple expressions for the generalization error. For a - upper bound on perceptron, we can use the geometric picture to visualize confidence interval the generalization error. A misclassification of a new in- - put vector by a student perceptron with coupling vector ST - occurs only if the input pattern is between the separating - planes (dashed region in Fig. 7) defined by ST and the vec- - tor of teacher couplings TE. If the inputs are drawn ran- training error domlyfromauniformdistribution,thegeneralizationerror - is directly proportional to the angle between ST and TE. - network complexity Hence, the generalization error is small when teacher and - student vectors are close together and decreases to zero - when both coincide. - In the limit, when the number of examples is very large - all the students which learn the training examples perfectly - will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e., close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below), - the generalization error (in red), calculated from the sum of eralization error have been successfully treated by asymp- - the training error (in green) and the confidence interval (in totic methods of statistics. On the other hand, when the - blue) according to the theory of Vapnik–Chervonenkis, shows number of examples is relatively small, there are many dif- - a minimum; this corresponds to the network with the best gen- ferent students which are consistent with the teacher re- - eralization ability. garding the training examples, and the uncertainty about - - - - 768 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 769 - - - - - - - - LEARNING TO GENERALIZE - - - with the number of couplings N(like typical volumes in - N-dimensional spaces) and Bdecreases exponentially with - m(because it becomes more improbable to be correct ST mtimes for any e0), both factors can balance each other - when mincreases like maN.ais an effective measure for TE the size of the training set when Ngoes to infinity. In order - to have quantities which remain finite as NSq, it is also - useful to take the logarithm of V(e) and divide by N, which - transforms the product into a sum of two terms. The first - one (which is often called the entropic term) increases with - increasing generalization error (green curve in Fig. 8). This - FIGURE 7 For a uniform distribution of patterns, the gen- is true because there are many networks which are not - eralization error of a perceptron equals the area of the similar to the teacher, but there is only one network equal - shaded region divided by the area of the entire circle. ST and to the teacher. For almost all networks (remember, the - TE represent the coupling vectors of the student and teacher, entropic term does not include the effect of the training ex- - respectively. amples) e0.5, i.e., they are correct half of the time by - random guessing. On the other hand, the second term (red - curve in Fig. 8) decreases with increasing generalization er- - the true couplings of the teacher is large. Possible general- ror because the probability of being correct on an input - ization errors may range from zero (if, by chance, a learn- pattern increases when the student network becomes more - ing algorithm converges to the teacher) to some worst-case similar to the teacher. It is often called the energetic contri- - value. We may say that the constraint which specifies the butionbecauseitfavorshighlyordered(towardtheteacher) - macrostateofthenetwork(itstrainingerror)doesnotspec- network states, reminiscent of the states of physical systems - ify the microstate uniquely. Nevertheless, it makes sense to at low energies. Hence, there will be a maximum (Fig. 8, ar- - speak of a typical value for the generalization error, which row) of V(e) at some value of ewhich by definition is the - is defined as the value which is realized by the majority of typical generalization error. - the students. In the thermodynamic limit known from sta- The development of the learning process as the number - tistical physics, in which the number of parameters of the of examples aNincreases can be understood as a compe- - network is taken to be large, we expect that in fact almost tition between the entropic term, which favors disordered - all students belong to this majority, provided the quantity network configurations that are not similar to the teacher, - of interest is a cooperative effect of all components of the andtheenergeticterm.Thelattertermdominateswhenthe - system. As the geometric visualization for the generaliza- number of examples is large. It will later be shown that such - tion error of the perceptron shows, this is actually the case. a competition can lead to a rich and interesting behavior as - The following approach, which was pioneered by Elizabeth the number of examples is varied. The result for the learn- - Gardner (Gardner, 1988; Gardner and Derrida, 1989), is ing curve (Györgyi and Tishby, 1990; Sompolinsky et al., - based on the calculation of V(e), the volume of the space - of couplings which both perfectly implement mtraining - examples and have a given generalization error e. For an - intuitive picture, consider that only discrete values for the entropic contribution - couplings are allowed; then V(e) would be proportional to - the number of students. The typical value of the general- - ization error is the value of e, which maximizes V(e). It - should be kept in mind that V(e) is a random number and energetic contribution - fluctuates from one training set to another. A correct treat- 1/N logfV(ε)g - ment of this randomness requires involved mathematical - techniques (Mézard et al.,1987). To obtain a picture which - is quite often qualitatively correct, we may replace it by its - average over many realizations of training sets. From ele- - mentary probability theory we see that this average num- maximum - ber can be found by calculating the volume Aof the space 0 0.1 0.2 0.3 0.4 0.5 - of all students with generalization error e, irrespective of ε - their behavior on the training set, and multiplying it by FIGURE 8 Logarithm of the average volume of students that - the probability Bthat a student with generalization error e havelearnedmexamplesandgiveegeneralizationerror(green - gives mtimes the correct answers on independent draw- curve). The blue and red curves represent the energetic and - ings of the input patterns. Since Aincreases exponentially entropic contributions, respectively. - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 769 262-A1677 7/24/01 11:12 AM Page 770 - - - - - - - - MANFRED OPPER - - - 0.5 student is free to ask the teacher questions, i.e., if the stu- - ε dent can choose highly informative input patterns. For the - simple perceptron a fruitful query strategy is to select a new 0.4 input vector which is perpendicular to the current coupling - vector of the student (Kinzel and Ruján, 1990). Such an - 0.3 input is a highly ambiguous pattern because small changes - continuous couplings in the student couplings produce different classification an- - swers. For more complicated networks it may be difficult 0.2 to obtain similar ambiguous inputs by an explicit construc- - tion. A general algorithm has been proposed (Seung et al., - 0.1 1992a) which uses the principle of maximal disagreement discrete couplings in a committee of several students as a selection process for - training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2 0.3 0.4 0.5 0. 6 ingstrategy,differentstudentsaregeneratedwhichalllearn α the same set of examples. Next, any new input vector is only - FIGURE 9 Learning curves for typical student perceptrons. accepted for training when the disagreement of its classi- - am/Nis the ratio between the number of examples and the fication between the students is maximal. For a committee - coupling number. of two students it can be shown that when the number of - examples is large, the information gain does not decrease - but reaches a positive constant. This results in a much faster - 1990) of a perceptron obtained by the statistical physics ap- decrease of the generalization error. Instead of being in- - proach (treating the random sampling the proper way) is versely proportional to the number of examples, the de- - shown by the red curve of Fig. 9. In contrast to the worst- crease is now exponentially fast. - casepredictionsoftheVCtheory,itispossibletohavesome ................................................generalization ability below VC dimension or capacity. As ◗ - - we might have expected, the generalization error decreases Bad Students and Good Students - monotonically, showing that the more that is learned, the - more that is understood. Asymptotically, the error is pro- Although the typical student perceptron has a smooth, - portional to Nand inversely proportional to m, in agree- monotonically decreasing learning curve, the possibility - ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set - more complicated networks. of student couplings which are untypical in the sense of - our theory cannot be ruled out. For bad students, even non-................................................ ◗ monotic generalization behavior is possible. The problem - Query Learning of a concrete learning algorithm can be made to fit into the - statistical physics framework if the algorithm minimizes a - Soon after Gardner’s pioneering work, it was realized that certain cost function. Treating the achieved values of the - the approach of statistical physics is closely related to ideas new cost function as a macroscopic constraint, the tools of - in information theory and Bayesian statistics (Levin et al., statistical physics apply again. - 1989;GyörgyiandTishby,1990;OpperandHaussler,1991), As an example, it is convenient to consider a case in - for which the reduction of an initial uncertainty about the which the teacher and the student have a different archi- - true state of a system (teacher) by observing data is a cen- tecture: In one of the simplest examples one tries to learn - tral topic of interest. The logarithm of the volume of rele- a classification problem by interpreting it as a regression - vant microstates as defined in the previous section is a di- problem, i.e., a problem of fitting a continuous function - rect measure for such uncertainty. The moderate progress through data points. To be specific, we study the situation - in generalization ability displayed by the red learning curve in which the teacher network is still given by a percep- - of Fig. 9 can be understood by the fact that as learning pro- tron which computes binary valued outputs of the form - gresses less information about the teacher is gained from a ywx, 1, but as the student we choose a network i i i - newrandomexample.Here,theinformationgainisdefined with a linear transfer function (the yellow curve in Fig. 1a) - as the reduction of the uncertainty when a new example is - learned. The decrease in information gain is due to the in- Y awxi i - crease in the generalization performance. This is plausible i - because inputs for which the majority of student networks and try to fit this linear expression to the binary labels of - give the correct answer are less informative than those for the teacher. If the number of couplings is sufficiently large - which a mistake is more likely. The situation changes if the (larger than the number of examples) the linear function - - - - - 770 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 771 - - - - - - - - LEARNING TO GENERALIZE - - - (unlike the sign) is perfectly able to fit arbitrary continuous the student learns all examples perfectly. Although it may - output values. This linear fit is an attempt to explain the not be easy to construct a learning algorithm which per- - data in a more complicated way than necessary, and the forms such a maximization in practice, the resulting gener- - couplings have to be finely tuned in order to achieve this alization error can be calculated using the statistical phys- - goal. We find that the student trained in such a way does ics approach (Engel and Van den Broeck, 1993). The result - not generalize well (Opper and Kinzel, 1995). In order to is in agreement with the VC theory: There is no prediction - compare the classifications of teacher and student on a new better than random guessing below the capacity. - random input after training, we have finally converted the Although the previous algorithms led to a behavior - student’s output into a classification label by taking the sign whichisworsethanthetypicalone,wenowexaminetheop- - of its output. As shown in the red curve of Fig. 10, after positecaseofanalgorithmwhichdoesbetter.Sincethegen- - an initial improvement of performance the generalization eralization ability of a neural network is related to the fact - error increases again to the random guessing value e0.5 that similar input vectors are mapped onto the same out- - at a1 (Fig. 10, red curve). This phenomenon is called put, one can assume that such a property can be enhanced - overfitting.For a1 (i.e., for more data than parameters), if the separating gap between the two classes is maximized, - it is no longer possible to have a perfect linear fit through which defines a new cost function for an algorithm. This - the data, but a fit with a minimal deviation from a linear optimal margin perceptron can be practically realized and - function leads to the second part of the learning curve.ede- when applied to a set of data leads to the projection of - creases again and approaches 0 asymptotically for aSq. Fig. 11. As a remarkable result, it can be seen that there is a - This shows that when enough data are available, the details relatively large fraction of patterns which are located at the - of the training algorithm are less important. gap. These points are called support vectors(SVs). In order - The dependence of the generalization performance on to understand their importance for the generalization abil- - the complexity of the assumed data model is well-known. If ity, we make the following gedankenexperimentand assume - function class is used that is too complex, data values can be that all the points which lie outside the gap (the nonsupport - perfectly fitted but the predicted function will be very sen- vectors) are eliminated from the training set of examples. - sitive to the variations of the data sample, leading to very From the two-dimensional projection of Fig. 11, we may - unreliable predictions on novel inputs. On the other hand, conjecture that by running the maximal margin algorithm - functions that are too simple make the best fit almost insen- on the remaining examples (the SVs) we cannot create a - sitive to the data, which prevents us from learning enough larger gap between the points. Hence, the algorithm will - from them. converge to the same separating hyperplane as before. This - It is also possible to calculate the worst-case generaliza- intuitive picture is actually correct. If the SVs of a training - tion ability of perceptron students learning from a percep- set were known beforehand (unfortunately, they are only - tron teacher. The largest generalization error is obtained identified after running the algorithm), the margin classi- - (Fig. 7) when the angle between the coupling vectors of fier would have to be trained only on the SVs. It would au- - teacher and student is maximized under the constraint that tomatically classify the rest of the training inputs correctly. - - - - - - 0.50 - ε - 0.40 - - - 0.30 linear student - - - 0.20 - margin classifier - - 0.10 - - - 0.000123456 α - FIGURE 10 Learning curves for a linear student and for a FIGURE 11 Learning with a margin classifier and m300 - margin classifier. am/N. examples in an N150-dimensional space. - - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 771 262-A1677 7/24/01 11:12 AM Page 772 - - - - - - - - MANFRED OPPER - - - Hence, if in an actual classification experiment the number ber of consistent students is small; nevertheless, the few re- - of SVs is small compared to the number of non-SVs, we maining ones must still differ in a finite fraction of bits from - may expect a good generalization ability. each other and from the teacher so that perfect generaliza- - The learning curve for a margin classifier (Opper and tion is still impossible. For aslightly above a only the cou- c - Kinzel, 1995) learning from a perceptron teacher (calcu- plings of the teacher survive. - lated by the statistical physics approach) is shown in Fig. 10 - (blue curve). The concept of a margin classifier has recently ................................................ - been generalized to the so-called support vector machines ◗ - - Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re- - placed by suitable features which are cleverly chosen non- - linear functions of the original inputs. In this way, nonlin- The example of the Ising perceptron teaches us that it will - ear separable rules can be learned, providing an interesting not always be simple to obtain zero training error. More- - alternative to multilayer networks. over, an algorithm trying to achieve this goal may get stuck - in local minima. Hence, the idea of allowing errors explic- - itly in the learning procedure, by introducing an appropri-................................................ ◗ ate noise, can make sense. An early analysis of such a sto- - The Ising Perceptron chastic training procedure and its generalization ability for - the learning in so-called Boolean networks (with elemen- - The approach of statistical physics can develop a specific tary computing units different from the ones used in neural - predictivepowerinsituationsinwhichonewouldliketoun- networks) can be found in Carnevali and Patarnello (1987). - derstand novel network models or architectures for which A stochastic algorithm can be useful to escape local min- - currently no efficient learning algorithm is known. As the ima of the training error, enabling a better learning of the - simplest example, we consider a perceptron for which the training set. Surprisingly, such a method can also lead to - couplings w are constrained to binary values 1 and 1 bettergeneralizationabilitiesiftheclassificationruleisalso j - (Gardner and Derrida, 1989; Györgyi, 1990; Seung et al., corrupted by some degree of noise (Györgyi and Tishby, - 1992b). For this so-called Ising perceptron(named after 1990). A stochastic training algorithm can be realized by - Ernst Ising, who studied coupled binary-valued elements as the Monte Carlo metropolis method, which was invented - a model for a ferromagnet), perfect learning of examples is to generate the effects of temperature in simulations of - equivalent to a difficult combinatorial optimization prob- physical systems. Any changes of the network couplings - lem (integer linear programming), which in the worst case which lead to a decrease of the training error during learn- - is believed to require a learning time that increases expo- ing are allowed. However, with some probability that in- - nentially with the number of couplings N. creases with the temperature, an increase of the training - To obtain the learning curve for the typical student, we error is also accepted. Although in principle this algorithm - can proceed as before, replacing V(e) by the number of may visit all the network’s configurations, for a large sys- - student configurations that are consistent with the teacher tem, with an overwhelming probability, only states close to - which results in changing the entropic term appropriately. some fixed training error will actually appear. The method - When the examples are provided by a teacher network of of statistical physics applied to this situation shows that for - thesamebinarytype,onecanexpectthatthegeneralization sufficiently large temperatures (T) we often obtain a quali- - error will decrease monotonically to zero as a function of a. tatively correct picture if we repeat the approximate calcu- - The learning curve is shown as the blue curve in Fig. 9. For lation for the noise-free case and replace the relative num- - sufficiently small a, the discreteness of the couplings has al- ber of examples aby the effective number a/T.Hence, the - most no effect. However, in contrast to the continuous case, learning curves become essentially stretched and good gen- - perfect generalization does not require infinitely many ex- eralization ability is still possible at the price of an increase - amples but is achieved already at a finite number a 1.24. in necessary training examples. c - This is not surprising because the teacher’s couplings con- Within the stochastic framework, learning (with errors) - tain only a finite amount of information (one bit per cou- can now also be realized for the Ising perceptron, and it is - pling) and one would expect that it does not take much interesting to study the number of relevant student configu- - more than aboutNexamples to learn them. The remark- rations as a function of ein more detail (Fig. 12). The green - ableandunexpectedresultoftheanalysisisthefactthatthe curve is obtained for a small value ofawhere a strong maxi- - transition to perfect generalization is discontinuous. The mum with high generalization error exists. By increasing a, - generalization error decreases immediately from a non- this maximum decreases until it is the same as the second - zero value to zero. This gives an impression about the com- maximum at e0.5, indicating a transition like that of the - plex structure of the space of all consistent students and blue learning curve in Fig. 9. For larger a, the state of per- - also gives a hint as to why perfect learning in the Ising per- fect generalization should be the typical state. Neverthe- - ceptron is a difficult task. For aslightly below a, the num- less, if the stochastic algorithm starts with an initial state c - - - - 772 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 773 - - - - - - - - LEARNING TO GENERALIZE - - - - α lar model. Here, each hidden unit is connected to a dif- 1 ferent set of the input nodes. A further simplification is the - - - - - - - - - - log (number of students) α replacement of adaptive couplings from the hidden units to 2 - the output node by a prewired fixed function which maps - the states of the hidden units to the output. α3 Two such functions have been studied in great detail. - For the first one, the output gives just the majority vote of - α the hidden units—that is, if the majority of the hidden units 4 - α is negative, then the total output is negative, and vice versa. 4 >α3 >α2 >α1 This network is called a committee machine.For the second - 0 0.1 0.2 0.3 0.4 0.5 type of network, the parity machine,the output is the par- ε ity of the hidden outputs—that is, a minus results from an - FIGURE 12 Logarithm of the number of relevant Ising stu- odd number of negative hidden units and a plus from an - dents for different values of a. even number. For both types of networks, the capacity has - been calculated in the thermodynamic limit of a large num- - ber Nof (first layer) couplings (Barkai et al.,1990; Monas- - which has no resemblance to the (unknown) teacher (i.e., son and Zecchina, 1995). By increasing the number of hid- - with e0.5), it will spend time that increases exponentially den units (but always keeping it much smaller than N), - with Nin the smaller local maximum, the metastable state. the capacity per coupling (and the VC dimension) can be - Hence, a sudden transition to perfect generalization will be made arbitrarily large. Hence, the VC theory predicts that - observable only in examples which correspond to the blue the ability to generalize begins at a size of the training set - curve of Fig. 12, where this metastable state disappears. which increases with the capacity. The learning curves of - For large vales of a(yellow curve), the stochastic algorithm the typical parity machine (Fig. 14) being trained by a par- - will converge always to the state of perfect generalization. ity teacher for (from left to right) one, two, four, and six - On the other hand, since the state with e0.5 is always hidden units seem to partially support this prediction. - metastable, a stochastic algorithm which starts with the Belowacertainnumberofexamples,onlymemorization - teacher’s couplings will never drive the student out of the ofthelearnedpatternsoccursandnotgeneralization.Then, - state of perfect generalization. It should be made clear that a transition to nontrivial generalization takes place (Han- - the sharp phase transitions are the result of the thermody- sel et al.,1992; Opper, 1994). Far beyond the transition, the - namic limit, where the macroscopic state is entirely domi- decay of the learning curves becomes that of a simple per- - nated by the typical configurations. For simulations of any ceptron (black curve in Fig. 14) independent of the num- - finite system a rounding and softening of the transitions ber of hidden units, and this occurs much faster than for - will be observed. the bound given by VC theory. This shows that the typical - ................................................ learning curve can in fact be determined by more than one ◗ - - More Sophisticated Computations - Are Needed for Multilayer Networks 0.5 - ε - As a first step to understand the generalization perfor- 0.4 mance of multilayer networks, one can study an archi- 46 - tecture which is simpler than the fully connected one of - Fig. 1b. The tree architecture of Fig. 13 has become a popu- 0.3 2 - - 10.2 - - - 0.1 - - - 0.00.0 0.1 0.2 0.3 0.4 0.5 0.6 α - - FIGURE 14 Learning curves for the parity machine with - FIGURE 13 A two-layer network with tree architecture. tree architecture. Each curve represents the generalization er- - The arrow indicates the direction of propagation of the ror eas a function of aand is distinguished by the number of - information. hidden units of the network. - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 773 262-A1677 7/24/01 11:12 AM Page 774 - - - - - - - - MANFRED OPPER - - - complexity parameter. In contrast, the learning curve of the same similarity to every teacher perceptron. Although - the committee machine with the tree architecture of Fig. 13 this symmetric state allows for some degree of generaliza- - (Schwarze and Hertz, 1992) is smooth and resembles that tion, it is not able to recover the teacher’s rule completely. - of the simple perceptron. As the number of hidden units After a long plateau, the symmetry is broken and each of - is increased (keeping Nfixed and very large), the general- the student perceptrons specializes to one of the teacher - ization error increases, but despite the diverging VC di- perceptrons, and thus their similarity with the others is - mension the curves converge to a limiting one having an lost. This leads to a rapid (but continuous) decrease in the - asymptotic decay which is only twice as slow as that of the generalization error. Such types of learning curves with - perceptron. This is an example for which typical and worst- plateaus can actually be observed in applications of fully - case generalization behaviors are entirely different. connected multilayer networks. - Recently, more light has been shed on the relation be- ................................................tween average and worst-case scenarios of the tree com- ◗ - - mittee. A reduced worst-case scenario, in which a tree Outlook - committee teacher was to be learned from tree committee - students under an input distribution, has been analyzed The worst-case approach of the VC theory and the typical - from a statistical physics perspective (Urbanczik, 1996). As case approach of statistical physics are important theories - expected, few students show a much worse generalization for modeling and understanding the complexity of learning - ability than the typical one. Moreover, such students may to generalize from examples. Although the VC approach - also be difficult to find by most reasonable learning algo- plays an important role in a general theory of learnabil- - rithms because bad students require very fine tuning of ity, its practical applications for neural networks have been - their couplings. Calculation of the couplings with finite pre- limited by the overall generality of the approach. Since only - cision requires many bits per coupling that increases faster weak assumptions about probability distributions and ma- - than exponentially with aand which for sufficiently large a chines are considered by the theory, the estimates for gen- - willbebeyondthecapabilityofpracticalalgorithms.Hence, eralization errors have often been too pessimistic. Recent - it is expected that, in practice, a bad behavior will not be developments of the theory seem to overcome these prob- - observed. lems. By using modified VC dimensions, which depend on - Transitions of the generalization error such as those the data that have actually occurred and which in favorable - observed for the tree parity machine are a characteristic cases are much smaller than the general dimensions, more - feature of large systems which have a symmetry that can realistic results seem to be possible. For the support vec- - be spontaneously broken. To explain this, consider the sim- tor machines (Vapnik, 1995) (generalizations of the margin - plest case of two hidden units. The output of this parity ma- classifiers which allow for nonlinear boundaries that sepa- - chine does not change if we simultaneously change the sign rate the two classes), Vapnik and collaborators have shown - of all the couplings for both hidden units. Hence, if the the effectiveness of the modified VC results for selecting - teacher’s couplings are all equal to 1, a student with all the optimal type of model in practical applications. - couplings equal to 1 acts exactly as the same classifier. If The statistical physics approach, on the other hand, has - there are few examples in the training set, the entropic con- revealed new and unexpected behavior of simple network - tribution will dominate the typical behavior and the typi- models,suchasavarietyofphasetransitions.Whethersuch - cal students will display the same symmetry. Their coupling transitions play a cognitive role in animal or human brains - vectors will consist of positive and negative random num- is an exciting topic. Recent developments of the theory - bers. Hence, there is no preference for the teacher or the aim to understand dynamical problems of learning. For ex- - reversed one and generalization is not possible. If the num- ample, online learning (Saad, 1998), in which the problems - ber of examples is large enough, the symmetry is broken of learning and generalization are strongly mixed, has en- - and there are two possible types of typical students, one abled the study of complex multilayer networks and has - with more positive and the other one with more negative stimulated research on the development of optimized algo- - couplings. Hence, any of the typical students will show rithms. In addition to an extension of the approach to more - some similarity with the teacher (or it’s negative image) and complicated networks, an understanding of the robustness - generalization occurs. A similar type of symmetry break- of the typical behavior, and an interpolation to the other - ing also leads to a continuous phase transition in the fully extreme, the worst-case scenario is an important subject of - connected committee machine. This can be viewed as a research. - committee of perceptrons, one for each hidden unit, which - share the same input nodes. Any permutation of these per- Acknowledgments - ceptrons obviously leaves the output invariant. Again, if I thank members of the Department of Physics of Complex Sys- - few examples are learned, the typical state reflects the sym- tems at the Weizmann Institute in Rehovot, Israel, where parts of - metry. Each student perceptron will show approximately this article were written, for their warm hospitality. - - - - - - 774 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 775 - - - - - - - - LEARNING TO GENERALIZE - - - References Cited OPPER , M., and H AUSSLER , M. (1991). Generalization perfor- - mance of Bayes optimal classification algorithm for learning a - AMARI , S., and M URATA , N. (1993). Statistical theory of learning perceptron. Phys. Rev. Lett.66,2677. - curves under entropic loss. Neural Comput.5,140. OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen- - BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me- eralization. In Physics of Neural Networks III(J. L. van Hem- - chanics of a multilayered neural network. Phys. Rev. Lett.65, men, E. Domany, and K. Schulten, Eds.). Springer-Verlag, - 2312. New York. - BISHOP , C. M. (1995). Neural Networks for Pattern Recognition. SAAD , D. (Ed.) (1998). Online Learning in Neural Networks. - Clarendon/Oxford Univ. Press, Oxford/New York. Cambridge Univ. Press, New York. - CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo- SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large - dynamical analysis of Boolean learning networks. Europhys. committee machine. Europhys. Lett.20,375. - Lett.4,1199. SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con- - COVER , T. M. (1965). Geometrical and statistical properties of nected committee machines. Europhys. Lett.21,785. - systems of linear inequalities with applications in pattern rec- SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis- - ognition. IEEE Trans. El. Comp.14,326. tical mechanics of learning from examples. Phys. Rev. A45, - ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can 6056. - learn from examples: Replica calculation of uniform conver- SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query - gence bound for the perceptron. Phys. Rev. Lett.71,1772. by committee. InThe Proceedings of the Vth Annual Workshop - GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks. on Computational Learning Theory (COLT92),p. 287. Associ- - J. Phys. A21,257. ation for Computing Machinery, New York. - GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper- SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning - ties of neural network models. J. Phys. A21,271. from examples in large neural networks. Phys. Rev. Lett.65, - GYÖRGYI , G. (1990). First order transition to perfect generaliza- 1683. - tion in a neural network with binary synapses. Phys. Rev. A41, URBANCZIK , R. (1996). Learning in a large committee machine: - 7097. Worst case and average case. Europhys. Lett.35,553. - GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn- VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and - ing a rule. In Neural Networks and Spin Glasses: Proceedings nonlinear extension of the pseudo-inverse solution for learn- - of the STATPHYS 17 Workshop on Neural Networks and Spin ing Boolean functions. Europhys. Lett.9,315. - Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien- VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em- - tific, Singapore. pirical Data.Springer-Verlag, New York. - HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory. - without generalization in a multilayered neural network. Eu- Springer-Verlag, New York. - rophys. Lett.20,471. VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform - KINZEL , W., and R UJÀN , P. (1990). Improving a network general- convergence of relative frequencies of events to their probabil- - ization ability by selecting examples. Europhys. Lett.13,473. ities. Theory Probability Appl.16,254. - LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach - to learning and generalization in neural networks. In Proceed- General References ings of the Second Workshop on Computational Learning The- - ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and - Kaufmann, San Mateo, CA. Neural Networks.MIT Press, Cambridge, MA. - MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass BERGER , J. O. (1985). Statistical Decision Theory and Bayesian - theory and beyond. In Lecture Notes in Physics,Vol. 9. World Analysis.Springer-Verlag, New York. - Scientific, Singapore. HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction - MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure to the Theory of Neural Computation.Addison-Wesley, Red- - andinternalrepresentations:Adirectapproachtolearningand wood City, CA. - generalization in multilayer neural networks. Phys. Rev. Lett. MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press, - 75,2432. Cambridge, MA. - OPPER , M. (1994). Learning and generalization in a two-layer WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical - neural network: The role of the Vapnik–Chervonenkis dimen- mechanics of learning a rule. Rev. Modern Phys.65,499. - sion. Phys. Rev. Lett.72,2113. - - - - - - - - - - - - - - - - - - PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 775 262-A1677 7/24/01 11:12 AM Page 776 \ No newline at end of file diff --git a/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt b/Corpus/MIXED PRECISION TRAINING - Sharan Narang.txt deleted file mode 100644 index 2be843af98faf7024c249fb5316551c77dbaedc3..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 43453 zcmb`QS#BK3mZtN$S7AWo0u(@o7@Vbx1};TW6f;RmMKYy3pwMteI7z<>hV#ZCiuqLC zTj;0WK|j=Eub`Gv8>lVx_a7VFBitf~%1$6N$zZsf*|9VHXEJkc>7S4Dn^85rFNdy} zy21_W@ohON$Ai*67LVn`6*G5wbav^sHv0$vMgJ^4&#&jFe?ETeK3*OlonD=OxNz5( zho={(7bn>-yQ}+RQjFb&{(I-XbL;M8Qr^{*XZN-$9_s13JDya7dv`pVmSgvv8ShO| z4d?E%oR-C8aG&|Y?bOCtcXva3<=vld^hj?|Kjh~Bv6zpl`P7|PgX&B9mH+aKCiZw5OSC`aynJ)h3%@p@)fA4cVu zX;rx&=6tm(zq0URTu$%Zr*izInDNuPdtcP<^E~tL#i!G^r-zw8+&=o*KdigLPjpz^ zg-PDrrTIPFub!BzzoaSM6}%d~*d)9bM-(&mK7)4<=C9QEAik`ZQ; zdgUO)Il}i)&hG1BJ*w}X7vD52%g3c8_yqZ$_aw^R52)I7}fSZn{Y!Q)^m>xh{mcGDM7bt2}dZ0*cbxr)JH%Ex-D=O!ipEeH4G>K}7xC2nGC5zblU zCw6HnE+4RXaWzZvQ(?tD_pX{uXQStp#CsCE!niQ&&^^GWv%soFcfN!A@mV81dD3cC z$O?kAty+!5q($^wIKh<#y1jm!RSyX1gk6B0*Nl!f3iO=R^YM_Kfy$HmQvA)wJ6V@{=L9*Xg^LvI?wLV_M2M+k1vP*NCL-|eJ+z&sVKT>>hGVolv8 zue~o_DU;P(mo?ohUPxj_yk)LuO?m5mYv0#Dv9nLr;HQP}+t(kE#%j!#GWrx*7WLLX z&*K-nqiUoVwoB5!a9eARZtYN`R`F?@7IJs}R5Jj2W{MCXKsP18E@>?$4KE=eP-#Az zWt7PVw2Sr~>*04SnO1F`)85&v7QQLFJSx9=S|F0|P@?6xLLb5dFtCRQ-Bm`sfZUpe zcwdai^Kn?0TZ0HYFIdaa_4}^NbU_Z;_9tx#+dQeiv7@t++3)Rlhl{U3{@e{<3z}*t zQ0bQ!7k*v0^9d6MQPhmNmj=}nT8OFPnXH5~Abib8Ck!|QCWCJv{;WIfkFb_oJGuNZ zNQv?2!WP<6_g?h6&z?y*CxNoykGY|$V9NQ}bdX0yCej`(vBmX#G7y3iOJH%}jaz>40mN?BIcbn_e#55EAGBxDivOiF&nahQ2t*4S-u@UfggBjga{y=M5~1%|CV zpaj(0;hl!T`g+HS2c~X>eG~baYK?qhJiST31(np3>Fdmp3B$M> z-iLae;SdCj)cS-S*KUso&pkGGdHuz3C~YGOBXu?`l^JCqp)EsSjkYRsIm1O3qJVH z&!Ib*fjvp>`gEQ<5cN2c_seOxj{>Z+dOBQ?Dit0 z9vX3?rp!q3xwo|DQ;eeZir;tBXVBS$ybHNhOBi|YL3S;-&kCG1fUnr*Kv*BfQU~|zOY#9hg@Uehd4cxQ=qp3|arg1&W1Og+7 z05ZXW6x0xHx+6yE>$&4^`0prT?Byr|uwd`J6H$*$Q2488f<$m4!(rVBH;3 zYH=_src*Do5^M}hHa$f;iVVWDrjaXZxM~322Vc+0;$PJ1t;g0Yen@zv) z@v{rR+f%o81R#B3vXGfmN00Y--F$L`6czJtZTP(VLiXJgDGhD5;HDykg0q;SIJng{ zJPm!yY+%k1g1DzGt6DhXc{}Inr2KmB{hYZJ?TN^3D!=h+i#SMa_&_1BT~^m-gAU0o z*2`Hd#H0XX6YUUEjWH(8F2o4Uo$$u=7bbS|A~Pt)Yl`0tp&iDISkO-%=dzp^2~c9HZCsBxu-acMCK}^jNy$Ck<4>AbIC1c2bV_V# zQ9VlkoSt2Mx;$#Dp^UY0(hH`3+}h=QIxW7`Ux|aaecD|N=U)kwF7VNdxFKcS3D((c zgV!L57@4hQpPJAB02_`0)R~Tm&R+;x2SuH|}rz{DnTl0%Hz*qvXTUR~QB9i6^Ayxiz-x)^ISE24D60-NG1&E#v1Ya~EE ziMy|#GQ*j-5p@Sjdp;U^)g9QU(Fcf-5IhUwPxf&$1+mdR!Ap3Wb+_IE|7HeF!4%gq zjEdCc0ws!XL@OM;T(s3xiV!jJMv%{?rsK{l>XQsIEEnT^UhNJD+?hubzn+t7#Z@s6 zmLCv55ahvj-`8z5MhHtBc~-;wSNHW~=B~@hLyQNkon2j@zi?~s>jAk4Vi}}@vkK&D zaQ{#Z@8_hD)?NSAtNj;lP>%eH9R}Q zPnKPghVel~mpNuvTWB;*0XzBatg0 z25h>3HQKw#aAGz2ppEeo`vzDWCC`)Bho&>**|$vzn~v;6mT?5~kinrL_;6sXqx+0%f+o7CzQ?@UO*+ zMR6_4etE=0!!MECcEJ3)cZsv&2+GGHonmKvl5JgH&axjv#FaV&)tQ)e^~M$@#L|o5 z->I7@hdp-&)>Mjak%2O+G;aYKE&ANU+AN2mprj-`WHgwlZDl|Mm4PuDE{N6=JO4Rn z&pYw%$LuJMLm^u z54TBKoqs$#K0m&=K3tKsij@clC=${Dhr6@q2xXO1bmHtLx)Scl6=oU)+ax?(^~K$@}Z8 zoUeXLcCa$gl~NgTP71AcOBzh18=a>a25gCZX~`CJt|vH)DK^xljcLw=zr0W8ryd|a z^d>xS+qNnKM0`i=S*Q{>pE^IXt^i8`XK;(g*J<+^W*`U*ieo_a0OlS1p}gc+CmZ zb?IvO+sp%)Yn5(aCmb~LH9Qba45M7JHvu8%6cG__(t@;REKQuh!|_CYt2h84D_?XK zFUxW&tCM-zXAAjjF9pdAkhR$^s)FDe%*R9tC(~jCj4^;M0Z__|n9{T&^N#&0rt(*x z?kUphlB)@YMNa1l%Rxf(p8G)34g^Ga(DNL_n#UEikXu{gV3thsTAn0xhjdJ;J7!9; zh*B@eXkv}z=#oXV@)7VN5#8i)c~xWTOEoReK77rBR3Vf~O$0#f@iODF$W`3>Zfhru zHMNMhK!n#2SOEERQ)-R$nPNSx)z+Ux}aFlef0xEu|B^pQo?D=3*LDyTfMUO@;oDq%a0^}G{ zkD6W#5w*BQMpccjFNAILvlGKGqfkg81>4<56Ji0nb=(I0Rn-OF`P@S1U5Q^+@Dh28 zFrY=S@VY!wXm6)Bt2<^&kk*8lywpu(_co^Y)vYWyJEgE}b>{;_AyVw46luTx^g>8X z%zY0Zvx5AXB2FVoFgP)CWi&^`d@*2VcZ1kLHKN0?6TZOK=FS0u!Uxn{h%pX+DvBI_9)p^25g(wh@(8kKzt-F16kcU*Z{{7{_sw@ zVf^l3Ym_`v(RJfZmN6qqwxUuzS}tJ#%iAG##n*6ViWrA8XPbI@EmktPP%w3{xk>bf zn09wmz8j_`U&tWC)}|{T30dL79b$Kg9a^4t3z1?~K0~?yYhZ9IJ2M60^|z)`WE9lT z&vDxMTYubA*%6!j zMWreff;vmv&Q4{@+XE(Qm9(#xH4ClpGEuQEc+7hM!MT(F6bjQD#|8}tzl#@S1{JUf z1uXNaa49nKK*XR+n1FYBR?!z<7oSy*Z2mxnF|XwfL#6srb_UB9GBU%tz*iLW8_S-n zbs)oOLddOW#iizy_&^4MOZIX~B+BE~Nh$GR*Q6Pj3z4Z#Wz#|ddF!3sRSMEt$hb)w zER`(5YOu7|s7n-VQ^C+l^l)1Z1l=JM7KU(EVTG-b$ADVQxEyY~x8=^L53vRG+R7O)ZjAv2|!>P*Hz)(3aQe zNy>%H96=yH?oucs`seorUAjwOqsHSp zsz2>5TUvn_WFP=8UpL8E&Dm-n@l1$kjnvLbD3oZ4vXp81PO?3d`M6si3}OXOp2%h09!kkV4i z-$MJPkU%rLMyr`!$q7|T9$5l1An}IL7f6^PTcT>ifB_?k|1v_{QUekqDv>GiCxvBh z?#}uq5HbbQKm8{ueeX_Qt-S{VuU7zPq&@zuOgG zoAciO@+&maY#=l-L!6vhAI8|%6qCWnAJC-fjTn&`^~z8yYHx*)Whyk}wQ*QMHU-AB zT05g?Jv(7NVU&kQ^6O!cXjw9Q>cftWets&5BoZA|0K^o0l%cq^kj?1tZRG42OrWPR zA+qvjun7IAYb`@%?{6YC0k{b%QLd<>|zaZY&-R{&)^#R!7!IGh=Z{REAk2Fpdmwi&aAS}s2D}U$BA>1TZ6bu zmiaW(;$sG1aU`s@cmXrQz5eq?D&0BG z)e!hvi1+~fiSx=HdNwoTi#u!_pw}3X_}QFsd=phce&`HCDK(Cv8VQN|*xeBDNH)2g z4SMR>Qmudm*;B$f^OozBP>5wZiZd&hz_Ww<+BfJ1af$%V!!ol^c)Wq)YW?v-NZr8%Gzyj{g7GZnxXXct|5mT z2zsW4GZSjL%}QaaV!~@ zO0}fwXk`#Vk&A*d!C-JYFwuu-<=1sbmEC4)O7&6BS)d!ei3>O}=rRJGUU{}%-Lxbi zk`KL>$>Qo|)*N8gCNr#1K90O;EX8cC#W7_HqK-nd2~;M$WRBj_`l1OJ_eFb$4-mXlVeucx($)~1!@4Wk6{d{7K9TX%_L3}}OvSA>1GQ3jW% z*YD4duTPJfqhE3|4=T{;p7oX*DBC4TNU2!CrlrFVXim_vFmTFi(pBd34(pj(`mK;7 zG|ZV^&UEnDn3fvaFkijF1E!+!-kbL$g5wgtq?b^=*G(ZCP0Phhp-=CrBOPqG@QT{; za(Y~qH0^HGW_^q*UprX!#Y!0qP2cL&1Fu>v6_XTmRLIgg-QDtDUza`>Bet3t!&f@Z z2iWE-PFsaAF%w{_#=UJTCwzRQ{x)jHWn8 z@mQ%Zb&J^=?KMUVl@FTz$&8nYQ21a@;|cqB0K#Os(PKdDuU3?(^FxSuJPLlZGu9XF zNA=nI=iDxmh-~oVy7NcxxKH(n9D}`(T{JE%-L*A6(B+a~Y{)^Yc!Cm@YOEGjRzUZl zS9zHQCJNb!3a#Ct9+n09H=#}p??G1`_Tbk1v@DjlpECzk(&_vJE93L%h4MSPIYWdI zug3@Vp(W_T1nh-k5GtgWE$1K;)iP)w!>$p0Q$Oaih)wy((Jkz%%|K zW9n}a17Y!G$ig%;jM15SN`W-(;7y7sp^%Uv67g&$VXq-qqtSu)CMcOvO$ejtFk{+T zyvbL%I&0f@)K_=)<5|8;4@hoK1mQl#xI*uw4;(IF^re<+JKMwpG9HLo&!>-}!@LV$ z&a5J-HzAI(6a<1PlH8Iv;A9W9@A70UJJqT>eVsGS!ph&hA(O9-4&lVt%1kXDR}{;{ zunQuaL7OJ#tE8r{Pc)mGAW%+P4V)k1=(iY@w_1`bDw)jsZJxeF?qkX#uAP5;LHN|P zY-?h1vXYBtl<1#}O;aAmAn1zWlB2+v;Omfq^Od_c)BfJCsfjRkuewB#9AK~&26d52v`j^_nW3N9zacA;*D!iw(l(!40Iv}U;Dxt|7TP>7 zzCF=~qWHweLLQGyw_46rXqmVuL!O@>!J~jr975!=S|S-X$F?EjQ|v*YpN|PxMPz2U zM(1S$;qAGsEtF9#t=%(bT(f}``zSM@%`!@f0FmYg$8m&?PEu$h#wDnj%djplZj<&+m~>~7|)%&E?kX>}NfS~ZnHLVT-Il(UjfOIYeO z#URH<5U>A?URq-4SI8_f7zUayef#Tj>kILR*h0U+^8UFa6{2!bR}~vH4b{`1cdPRg z)sUVU5-A8hPAmpH6WH0y>4!(Ow9ZU>^atd4dhKqi{s;(fX6fVeb*+BiHT2_8Ut*JC zawsxGdQ-Lfzq_43{pM6|=MC2}aXY0&d`Myeo2PB`mR0_`Kgs|pvhqiZ7|4?)5SfB{ z#MPExV<<*^*j?vaE6Wt9wv)>2{W>9+5a{({Q!>I4WCleOQ_3h(j~Qa|*|o7;p9D#O zTHPVTX*#kd=roQel@6eD61b|0OmtEcm9QX1npDsp%4(X@lq_=CBt1w(iHQ1(Zw(7| z2lZiUN)=T+SIqB{pG5gPzEkI8&C0VXUJEB;VMu|n)K^iTH>rN&Tpg?|aleom$by0J zfTm~Z16$12?%p^4^{3v!xQScSt}$|CSKD|4zdGEZJ#FUKU8KI5UZWaCNrc!YzVE9) zEj6*vb5`9h^JUo6*J1zrTkT^7yw+S`m7jg-xfP#>AtIA3LzOgx%Q74^)?IA;jub5l zC;qy-*V`w5s(<+L)wci9xwRkIw)=AVe1omr$rIsSrn$q@js|H8SwgWh&{EkDK)7=P{Rr}C~GE}hGed$~=v z_1f6vv#-;M{p9r{DQf@E+Nn+&v$=;cZ05VGiPCy`EN0o??j7ugmt(0D#k2F?Tm9-r z6vFKGcYCkGRQLCKn>%sV{CW96NDdImIDHF_Mr!tRs^n^=f4q}NEI`dVp#9YiJwEn= z`N1CB37;8@{MlN2WRqi*1$j0RUe%`*wLISwmI4D{ud@e0Wzm@Twk(xx9t?{kjwLVG zK8r|tyVuXkE5eDl<0bXfimFNv> zI`gb7wVeh4>Q}<)|7ex%Cd)@>K-EN<5kJ zLYyr#DhNXkAedDzwc0xToSWjN8d-D*%eX)b9olD3y#-51dyW=Pe&S^mM5wtu5=U># zSYp6h7{Lb}fbE{-3i6qeZPsTA7fRO_2ra)q;a%Sz=1=oTM?hs(YxU=LA`S5c2`_4=0Yj#+PG>95XwbvdFKXTj? zbJ3QOz_h~s;81`1a70JP=Dy(!`H$3#d2(e&Gz#iPHh2G?<^>6z$rmv0<(s1C(h~)# zGPxvZo%nEOERl#aTDJ_s=zYSx$foG2`~WYwDLMC+Kz9s&vCagsh{&RMyoIN;A~Qfx z#_@mQtFPVLc(7Qri-DvMqHm(>FT6D2fd&hFvbJ&S zZitoL-m6|e*$zzWUiyVa@yGo?tbSEwc-tU{8G02TPL-{ZGS2PqGf>VD9j5dv?T3~b zni)>U{aYwtz9lj5)tqsTw#M9RMZOm3fo`=LB&flU-Rb*+?--wqdKcAZC2yD*ft~fzNOyk%J&vA8w0}n@{RI3$G9>a5!GFv*SoBeu*}$`Tn9iafWsW@9o{|U zMQPaxy~Ydgj1S4KO{oL%C*{yTy!BQeQik6t!30mR?gIht8?irpoUP)D#<|qW4sIYrR|g zEmGd|K?Ww;_NCHlKc+v96i=1v94gIP$GC@(HGx98oxFc*VO8xFpec*&mU{m2BQ05) zEhgX0ePOYQrM$@#li4vO_8#2VC#(z+Mn25!Xj18t(T}al?N+Q8Z6T~2nK_vENPCG# zYI6(0w+lRd>S4cT2E3Zl4u@Ih%vM;$F2F$OrgpIL+SSQhm_(y3N32pJaG`-(!8me) z){+ndGMO<=|BwPEj{4xkHCl|-ZJKaLtE=y|tJfE3;H|3fsEG=2#R>7+e!mB;Hz>r5 zjW6l`Ra`Ei6p}ERDK{1?|I=RQFh<+E&dN%J>-Lb`n^}Le$7^=HQGbFy#tF8^mluk{Ty;3zG2!( zUeS_cBZf|{MFfGUgq8c}FU7D-l_Vn>lkT(^OQ#l8aZ}dkv*e)Nppd;y#5)hh*Q+S= zoFtLq#G}IKofi9wkx>CkaEs>b$q%!d}TaCn; zNOx2UC^qD0)qUd`^1F$c%K1IdVQP$)+;EHKn!2@<1HW2%}hH(F@rA4sAiN)Z-47=(!!u4 z00W*9+U_o)fFxu*bl(tB$-ZjxZnq5B@WVF0;qj<+6{@MEuZdw>R3;6r<|!?!wT$rN zPH)GZyThaRrx(Y%SMK8KY-MKa9r81Z*PpbPiaoZWV8xj7QZ+Rt*B63TX zt~jU9P9<{^%;PM6FnGFxz-#{ac+Bf{8o*E~B0t>tS`UgF+hTQvlTQA@$hO1(p_i&2 zHEf0=QPWA&_-z%{5h|56aHxqNE(YU-MdI3ZU&*%e;k!XZOfZ=nyn!NC9e*x=M+Lq9 zN;-*WB`|Olii`HqzqlY+g#AFo}lO5}SqeT`mA)WfT%x!zdUqVHWJx+v5q$Qlf z)o}GBYxj40yY9>#UYz`RcyjE{KfFC&(JGE78JRV*c@k)&6~+1Gl=EMJM|j@U&R2J) zqgLK=v@TB4bo*WM6IwJ6NbpY(*9cX{o4Phm zr9&@jnrEsFxfc-|zW-|b2OU;8gu~x1S{L(PkWMj4=%QQ*4tuZbAk$Eiy663Ky!uT# zQNa~=#Kh^Ymx3Uk!ExlikalXyg7wFIym5D9iab9+qVD!tsYtY9wgbV2+aX>$E4itt zvfxev13^@G2@bbnNU7gN1dfc|9kw+$6MnM}HgbFh)x#3$k5*rQ`ydRxc){>7Eju=3 zTk5}QP|a-S)j}_hDcP+9gf&EmEVaVhQ!k^q!#ccU)E{f1;oBN=Vba}&#xRi{i&#f~ z?Ftd!b8nAM4lf86tty^8X>NBov?I7=gGWta#2Sxtw*WMwnN@2^X%s`%l|plNl!u9! zRaCa2Q6`LGq9qmXZ;Y@zB|KmamqPHT$U$K^?CvA(hfYT!q6HJ5`a)uvyX8ki?jd7O zyoJ!Auw#Ynr7|{jh?BiEMjpLAy@tGT4($*GIwVXI<+XM2jeo<;s)x){3ESH^1P{e{ z9uxZ=iI?-~5{k9UVu(_qH9Ccf`XP;RM0M}YJ3Sti;X*Paht9fqZSj-+#b5dDH~Y*j zsB1rV-kAL$O8Dz!A(C)5^1J-%%nHMM_9$JVlSn7)n+j8;i#-*sL(M|kD{E3Ux&s~- z4m(}bK^Z<{WZqzA6H|U5=^wXdM`BUmpP3ygj@_GpFB^sy%-fF$PqVaxP-Ji@+2oK$ zy28|j*)HeY3u%P1;G%=e^6W=BBGk@hrYQR+z-mv>wFn{<2M-(RM5Z(a9naYwLo&Rz zqHxqlpo9s#YDj_h#aY=D+Qm)MQ25~?5*h)RK}^8HLtSyEm$yanj2RDk3`{;|7U}#k zjxys!#)91s)l_M)Yz?f~Lf=@R2g{>FtufXz@tK7NOzijG%!;`MExRBt47^9?L!Bk2 z*sPQ`vMQBk=};HIzl@&zNyuM6jBP)E2S|Mh*QqK}A0b-+We;kjvgFD>#lDrr*bFi&}Q~*r4WlRQx<_arZ2$^OT*2CwS zAu)Ya!|@V-n7PhC@oeC?ES(jvzmwI*k@_e5fiKCgvTcoP;t>;WF|`-wB)XZAlkOpG zp$iQ>Dve6=BrK&6>vKXSzB8MT0acjP`?9G(5B+t{xF-f!KzAFqGB zJYIQIkdL^fH8ko6S`+S|2_qg~5TiekLnptqtA(vE$a_ekdQ}LTlelubVK^BCZ;>zl zZ#Mzr#S$xElf&~OFCZaY?c6R`X9HU2SVaR0|KLK16EeHdI=ow0$RaKa*Rd{iVT6!a zY!|UDgm-d*Ft0qvIcHWIsP!b+C7i0HrA~8QinzNOscrcJ0xSbcEy%RQSs1*PudR-* z6l}{ZGMnQxOyBgZNk1ig7<+*|ND!0Vmm}*heJ}+sAgfRHkbL9Jm}zlrqe(?Vi$1KF zfwS)I#YKBNX~5(biDL2v65Cpp6?|hcD^9nJmD;!rsK3%}ZR~yX-9U4~j0t1s&b+kw zEH{DC2F`MbwkeGZq?A&i*E=vB)Nk=zYFi?Tp-w7|XwT^;mM$@@?6m1j@?Qlra9bzc zR5lbFtir`8^Nn(V+0z}O&GLNyLPQu4V2d$s|kG~npb>1NHP8H74Jep4S4OPoQbFU8Mn%>p}j@Cs9 z+|G5Qv%mO-JE8%1cA1?CogBpO4w>ojIjMGwqj4R|8AY#ryHJkaZ=o(v_}B!OW}Ln5 z1^wesGqv(3vt>w+zhMlnL!T<*FNq2@v=KR~bI;2Q8;u2=@yy1o-U#9Ha2IEHJtB%} zE-q=&{qF;7>-6Obs1qHgnQk5s>)AEJIuQ-$lK>|sMc-PmG@$^&0|O}09n>!|UQ1^&c)mUy}wX`Z*bS3FcH&;_K$bKhuuO*;8q# z9BBwWwXUCxxi@0FY_X?~g@nZXN}WUe3nrjzYo3&2^V_7~OMYbbSM$5-xp*kDOs~B{ z!cI7$o{R(R&JEFag(C4WT~TgkEZj79D&;P14Fej}a?JsX`zqYgyaf?ga^ZG=OKiB^ z-=+!Wh>lD6?2xOq2bGsN3fOzT$Q=jl@=QNGl8bPnj55_4UH$3tApRW+7e|&Zy9@q%I!y7)7r{g^uO6LZ99Ny*_MPv*mtynp$ zLUogk0W9nRk$7CEqt_E0yBt`W+}wrdvuk{O-njL(trK3Fn}JXF;nCTV5dO@ZSx6z$ zR82Rc_`^(k_&L~nqj{Rr)6-M#?cf1juX#&ivR6;;UdA_1Urx(^Z1HdIA(ZNgRfSPf z2~||8MJdcrraew5&*3MrAbPrWiJLgvLTXXH;;Yk<|y0)Ge*2 z2(!%uZTS{j^bo&E*e}UuMlH+x*g`Gm21K7|VL?;FfT264t*K)TQ;-q2z4uD;KG zR}6a@8PN=>&q}ng%hHzTCWlWVVX&*+7kOQNFa`P(7aH0m*^esFnD|Q^LTOkOf&KkW zHhe!A-@PKU;lFBqr`Gd2!+plnK;xUu3JytFbkzG=-tp1VC!^@zO=iS@@)QNnd=?#H z5#6(EX8$w)gX{iM0F{j5`dn)S?EJN!!(w*#g50Qu+X{VPV^uW;W`VFJEIGp=C%T{f z^63(zM98w&?*m!-`(8};2AM@L>%~hmI||mW+$3r`!Nb#)n?CZ;3zq)KHnO+wUacLx zc#UN4nq(eqt`pk*ni&w7$+eqM_v-4xZ63UOwXwgq$M3_sdQE<{*YEFby#y51J^S9P z-a$`qZQJ_cw;ta%%fbUW##{B0?srdFzumL#fqxsm0gSHd`EWvs5eEBswDFgMdz{if z2GHt}oZY~(yeV){cR)vc{Ws#JC?MYQM2oE-fA1rbT6OiEYEGz-c_MJ+hM zd1u$rk#4@OumQmXkQ2`OQ)4CwWvG@zaaw|<&DkcsCCr^8ThMfSo3DDC`zQ6$EU9Syv*~*Z3%x z!7;c}BJzIVpEvknwrto6sQqRnk2PXw8NGFiA4V8kx4lm=Cr5$r&(AL1XOZ@verb^q z@Dl+<%QsB?vU4DIp^Mh(Q3;F@ttCvkVP3u*-i#i51bB6ec{%9~OFnri1^%)cJdEr= zW_Isq52OETq`t};VYD5`1eh~^MWptgkSw+o1osw4^?r^yVyk`-3@MQ-K9{aBX^ZgK zzHu_A?gu{r$HZ2B%?tP)-wflxyk6EDHn?IaZU*702gmLGEX77rnqI%;zmWf zvw6_l-F&s31tpLof3spI6teUIcPTCZ^ZSD(u-feUHpJkY=IYvA!%WWXgVhyphDaSAt3FHwkkx5f4z*|hdRbNe0 z;|jg(?`>|;yo?^}Q@Jv5TfEobpJ@Te9&QqPFiNWqvHk;-# zS%z!B)>*mbgi{|cSyEe3WT*j-vpS zrXOYfx9fkmg}x8ack`7@KyDZM{Y?b>=N#s(#(DcK9`nutrC-jF!FVtf1z3;4^BVf)lT%T(f8SfwnuG$sAx)M%Db_)ysiSCV9y@nB~jq zGKN>l3&l5l~ERnTX5 z9{;^}R9{}gx;N98{arBn=GL}0Q`)mgg2HBGHSXU*BLwuOSs#B;PbTQ($;94kni#Sj z9$#H%RqLV>8){Icu^TFD5YoU5|H>LE=*%AguSkmYD(jwOH2 z^Yg$F58#tO(gMft=o9xth%`B)j%ql^ft5hADHmLUQO%0sJw{Z*S=S+&DC>%CKprQL z+|~P8#|ryg@32+0nn1p(``px*qksE?9sU0H&boUhlKS76=6;5${h40Ok02ihZ*6XEI>nHp8u=Vi< z2N}@XP!Q|h_}cH8%3Wo@a!Y}%T+Y7kjB}6(=0!B0aS$#|*y|3pe+V$XTOA(d`)61UxbHAzs|HPJH{nZ#Lo2!%pV15|PhY`A zye3q>*`Vj;EX2N0NO27)56TtTTO4SR`eIPL@!XdT;b@*$RNfGVvMhO}{@PJr^O%GM zF>5I^SZZ%a@*s)r=(6svICaA>Aj+iQ+PbXYGhG%4aDJ54W+0hM1vp75pa4F>sgohc z4nQWj_Ocvuq>JZGNcpZnyAP}0pD#XKZd{ZE+YL7GlRFwMp7iizz8BktX#EnTyzo9a zFEV#0n&{mvxxl+SSlcTl5HS8IOGP$f*e`%QLrk#Y!z!uNt=Ly@F5IjAgN^Nt?f$Ec zoxPn`+Z&rLZ7b zzfvG61huVy9}ltC{S2J;Pc!C*70-l|F`Di(F=}p(RJuxP6lt%yKsi<-Ol5$=2K+*@ zVeR7d<5e&$88QHnFX!B1`lZ&%Lf$(e=;!lPXgF5%!$cQ%nhIG60-mEu&>HXYm-#-d zhATr5>z=ph7w#88*W?xj`@Vr6VBiWq<+}oxmZy|8M$fz}s}25TkwUm))ibN6QQ(lX z(<*hGu@R|-Qrc{k6>UxwyeoT){P(@b6)G9%TH)`_3KL>j+kWwStKWy0w|2d*Maol@ zMm&>S+O*%_+Ji;=yBnJa-1Nc$Rm(n(gLxKCU4j7V;3eD@-S*pH_ZCd|a%R!Uk;oAa zmN5GXZy7bRM3CZL5hB~cuJ?(Yni|86>r9rdNuV&|$LQArrmpO`C>;c1qM>W$fFOq% zbcvQCWlDr_@C9DdMzikAT-mDA1xfW+&H+*EUcQ*fff_Mw-on`?1L$XeiV@x9H&YI~Jov$43|SiCI!2q1n!pI!<5!ZXyFAXR@>Fx*&&i=fr(EU* zc2K|%WxBL$6l>V*I2 zW`y)KkB01mPv_ZDvca3p3*@x-FCr zZbdp3)TBj%`duB;oj-xY!g=`+LlYTIFNrmdh;$2O?UCst)VW7A-FQz{Jx9!t1YeTd U^z`p30kYDe<*oST-}_tt59l9OtpET3