Processed various texts for the NN

This commit is contained in:
Eduardo Cueto Mendoza 2020-08-10 18:53:03 -06:00
parent 93b3da7a7d
commit e78ae20e92
12 changed files with 5546 additions and 1332 deletions

File diff suppressed because it is too large Load Diff

Binary file not shown.

View File

@ -1,399 +0,0 @@
Learning Efficient Convolutional Networks through Network Slimming
Zhuang Liu 1 Jianguo Li 2 Zhiqiang Shen 3 Gao Huang 4 Shoumeng Yan 2 Changshui Zhang 1
1 CSAI, TNList, Tsinghua University 2 Intel Labs China 3 Fudan University 4 Cornell University
{liuzhuangthu, zhiqiangshen0214}@gmail.com,{jianguo.li, shoumeng.yan}@intel.com,
gh349@cornell.edu, zcs@mail.tsinghua.edu.cn
Abstract However, larger CNNs, although with stronger represen-
tation power, are more resource-hungry. For instance, a
The deployment of deep convolutional neural networks 152-layer ResNet [14] has more than 60 million parame-
(CNNs) in many real world applications is largely hindered ters and requires more than 20 Giga float-point-operations
by their high computational cost. In this paper, we propose (FLOPs) when inferencing an image with resolution 224×
a novel learning scheme for CNNs to simultaneously 1) re- 224. This is unlikely to be affordable on resource con-
duce the model size; 2) decrease the run-time memory foot- strained platforms such as mobile devices, wearables or In-
print; and 3) lower the number of computing operations, ternet of Things (IoT) devices.
without compromising accuracy. This is achieved by en- The deployment of CNNs in real world applications areforcing channel-level sparsity in the network in a simple but mostly constrained by1) Model size: CNNs strong repre-effective way. Different from many existing approaches, the sentation power comes from their millions of trainable pa-proposed method directly applies to modern CNN architec- rameters. Those parameters, along with network structuretures, introduces minimum overhead to the training process, information, need to be stored on disk and loaded into mem-and requires no special software/hardware accelerators for ory during inference time. As an example, storing a typi-the resulting models. We call our approachnetwork slim- cal CNN trained on ImageNet consumes more than 300MBming, which takes wide and large networks as input mod- space, which is a big resource burden to embedded devices.els, but during training insignificant channels are automat- 2) Run-time memory: During inference time, the interme-ically identified and pruned afterwards, yielding thin and diate activations/responses of CNNs could even take morecompact models with comparable accuracy. We empirically memory space than storing the model parameters, even withdemonstrate the effectiveness of our approach with several batch size 1. This is not a problem for high-end GPUs, butstate-of-the-art CNN models, including VGGNet, ResNet unaffordable for many applications with low computationaland DenseNet, on various image classification datasets. For power.3) Number of computing operations:The convolu-VGGNet, a multi-pass version of network slimming gives a tion operations are computationally intensive on high reso-20×reduction in model size and a 5×reduction in comput- lution images. A large CNN may take several minutes toing operations. process one single image on a mobile device, making it un-
realistic to be adopted for real applications.
1. Introduction Many works have been proposed to compress large
CNNs or directly learn more efficient CNN models for fast
In recent years, convolutional neural networks (CNNs) inference. These include low-rank approximation [7], net-
have become the dominant approach for a variety of com- work quantization [3, 12] and binarization [28, 6], weight
puter vision tasks, e.g., image classification [22], object pruning [12], dynamic inference [16], etc. However, most
detection [8], semantic segmentation [26]. Large-scale of these methods can only address one or two challenges
datasets, high-end modern GPUs and new network architec- mentioned above. Moreover, some of the techniques require
tures allow the development of unprecedented large CNN specially designed software/hardware accelerators for exe-
models. For instance, from AlexNet [22], VGGNet [31] and cution speedup [28, 6, 12].
GoogleNet [34] to ResNets [14], the ImageNet Classifica- Another direction to reduce the resource consumption of
tion Challenge winner models have evolved from 8 layers large CNNs is to sparsify the network. Sparsity can be im-
to more than 100 layers. posed on different level of structures [2, 37, 35, 29, 25],
This work was done when Zhuang Liu and Zhiqiang Shen were interns which yields considerable model-size compression and in-
at Intel Labs China. Jianguo Li is the corresponding author. ference speedup. However, these approaches generally re-
2736 channel scaling channel scaling i-thconv-layer factors (i+1)=j-th i-thconv-layer factors (i+1)=j-th
conv-layer conv-layer Ci1 1.170 C 1.170
C C i1
i2 0.001 j1 Cj1
Ci3 0.290 pruning Ci3 0.290
C 0.003 Ci4 j2 Cj2
… … …
… …
C Cin 0.820 in 0.820
initial network compact network
Figure 1: We associate a scaling factor (reused from a batch normalization layer) with each channel in convolutional layers. Sparsity
regularization is imposed on these scaling factors during training to automatically identify unimportant channels. The channels with small
scaling factor values (in orange color) will be pruned (left side). After pruning, we obtain compact models (right side), which are then
fine-tuned to achieve comparable (or even higher) accuracy as normally trained full network.
quire special software/hardware accelerators to harvest the Low-rank Decompositionapproximates weight matrix in
gain in memory or time savings, though it is easier than neural networks with low-rank matrix using techniques like
non-structured sparse weight matrix as in [12]. Singular Value Decomposition (SVD) [7]. This method
In this paper, we proposenetwork slimming, a simple works especially well on fully-connected layers, yield-
yet effective network training scheme, which addresses all ing3x model-size compression however without notable
the aforementioned challenges when deploying large CNNs speed acceleration, since computing operations in CNN
under limited resources. Our approach imposes L1 regular- mainly come from convolutional layers.
ization on the scaling factors in batch normalization (BN) Weight Quantization. HashNet [3] proposes to quantizelayers, thus it is easy to implement without introducing any the network weights. Before training, network weights arechange to existing CNN architectures. Pushing the val- hashed to different groups and within each group weightues of BN scaling factors towards zero with L1 regulariza- the value is shared. In this way only the shared weights andtion enables us to identify insignificant channels (or neu- hash indices need to be stored, thus a large amount of stor-rons), as each scaling factor corresponds to a specific con- age space could be saved. [12] uses a improved quantizationvolutional channel (or a neuron in a fully-connected layer). technique in a deep compression pipeline and achieves 35xThis facilitates the channel-level pruning at the followed to 49x compression rates on AlexNet and VGGNet. How-step. The additional regularization term rarely hurt the per- ever, these techniques can neither save run-time memoryformance. In fact, in some cases it leads to higher gen- nor inference time, since during inference shared weightseralization accuracy. Pruning unimportant channels may need to be restored to their original positions.sometimes temporarily degrade the performance, but this [28, 6] quantize real-valued weights into binary/ternaryeffect can be compensated by the followed fine-tuning of weights (weight values restricted to{1,1}or{1,0,1}).the pruned network. After pruning, the resulting narrower This yields a large amount of model-size saving, and signifi-network is much more compact in terms of model size, run- cant speedup could also be obtained given bitwise operationtime memory, and computing operations compared to the libraries. However, this aggressive low-bit approximationinitial wide network. The above process can be repeated method usually comes with a moderate accuracy loss. for several times, yielding a multi-pass network slimming
scheme which leads to even more compact network. Weight Pruning / Sparsifying.[12] proposes to prune the
Experiments on several benchmark datasets and different unimportant connections with small weights in trained neu-
network architectures show that we can obtain CNN models ral networks. The resulting networks weights are mostly
with up to 20x mode-size compression and 5x reduction in zeros thus the storage space can be reduced by storing the
computing operations of the original ones, while achieving model in a sparse format. However, these methods can only
the same or even higher accuracy. Moreover, our method achieve speedup with dedicated sparse matrix operation li-
achieves model compression and inference speedup with braries and/or hardware. The run-time memory saving is
conventional hardware and deep learning software pack- also very limited since most memory space is consumed by
ages, since the resulting narrower model is free of any the activation maps (still dense) instead of the weights.
sparse storing format or computing operations. In [12], there is no guidance for sparsity during training.
[32] overcomes this limitation by explicitly imposing sparse
2. Related Work constraint over each weight with additional gate variables,
and achieve high compression rates by pruning connections
In this section, we discuss related work from five aspects. with zero gate values. This method achieves better com-
2737 pression rate than [12], but suffers from the same drawback. Advantages of Channel-level Sparsity. As discussed in
prior works [35, 23, 11], sparsity can be realized at differ-Structured Pruning / Sparsifying. Recently, [23] pro- ent levels, e.g., weight-level, kernel-level, channel-level orposes to prune channels with small incoming weights in layer-level. Fine-grained level (e.g., weight-level) sparsitytrained CNNs, and then fine-tune the network to regain gives the highest flexibility and generality leads to higheraccuracy. [2] introduces sparsity by random deactivat- compression rate, but it usually requires special software oring input-output channel-wise connections in convolutional hardware accelerators to do fast inference on the sparsifiedlayers before training, which also yields smaller networks model [11]. On the contrary, the coarsest layer-level spar-with moderate accuracy loss. Compared with these works, sity does not require special packages to harvest the infer-we explicitly impose channel-wise sparsity in the optimiza- ence speedup, while it is less flexible as some whole layerstion objective during training, leading to smoother channel need to be pruned. In fact, removing layers is only effec-pruning process and little accuracy loss. tive when the depth is sufficiently large, e.g., more than 50[37] imposes neuron-level sparsity during training thus layers [35, 18]. In comparison, channel-level sparsity pro-some neurons could be pruned to obtain compact networks. vides a nice tradeoff between flexibility and ease of imple-[35] proposes a Structured Sparsity Learning (SSL) method mentation. It can be applied to any typical CNNs or fully-to sparsify different level of structures (e.g. filters, channels connected networks (treat each neuron as a channel), andor layers) in CNNs. Both methods utilize group sparsity the resulting network is essentially a “thinned” version ofregualarization during training to obtain structured spar- the unpruned network, which can be efficiently inferenced sity. Instead of resorting to group sparsity on convolu- on conventional CNN platforms.tional weights, our approach imposes simple L1 sparsity on
channel-wise scaling factors, thus the optimization objec- Challenges. Achieving channel-level sparsity requires
tive is much simpler. pruning all the incoming and outgoing connections asso-
Since these methods prune or sparsify part of the net- ciated with a channel. This renders the method of directly
work structures (e.g., neurons, channels) instead of individ- pruning weights on a pre-trained model ineffective, as it is
ual weights, they usually require less specialized libraries unlikely that all the weights at the input or output end of
(e.g. for sparse computing operation) to achieve inference a channel happen to have near zero values. As reported in
speedup and run-time memory saving. Our network slim- [23], pruning channels on pre-trained ResNets can only lead
ming also falls into this category, with absolutely no special to a reduction of10% in the number of parameters without
libraries needed to obtain the benefits. suffering from accuracy loss. [35] addresses this problem
by enforcing sparsity regularization into the training objec-Neural Architecture Learning. While state-of-the-art tive. Specifically, they adoptgroup LASSOto push all theCNNs are typically designed by experts [22, 31, 14], there filter weights corresponds to the same channel towards zeroare also some explorations on automatically learning net- simultaneously during training. However, this approach re-work architectures. [20] introduces sub-modular/super- quires computing the gradients of the additional regulariza-modular optimization for network architecture search with tion term with respect to all the filter weights, which is non-a given resource budget. Some recent works [38, 1] propose trivial. We introduce a simple idea to address the aboveto learn neural architecture automatically with reinforce- challenges, and the details are presented below.ment learning. The searching space of these methods are
extremely large, thus one needs to train hundreds of mod- Scaling Factors and Sparsity-induced Penalty.Our idea
els to distinguish good from bad ones. Network slimming is introducing a scaling factorγfor each channel, which is
can also be treated as an approach for architecture learning, multiplied to the output of that channel. Then we jointly
despite the choices are limited to the width of each layer. train the network weights and these scaling factors, with
However, in contrast to the aforementioned methods, net- sparsity regularization imposed on the latter. Finally we
work slimming learns network architecture through only a prune those channels with small factors, and fine-tune the
single training process, which is in line with our goal of pruned network. Specifically, the training objective of our
efficiency. approach is given by
3. Network slimming L= l(f(x,W),y) +λ g(γ) (1)
(x,y) γ∈Γ We aim to provide a simple scheme to achieve channel-
level sparsity in deep CNNs. In this section, we first dis- where(x,y)denote the train input and target,Wdenotes
cuss the advantages and challenges of channel-level spar- the trainable weights, the first sum-term corresponds to the
sity, and introduce how we leverage the scaling layers in normal training loss of a CNN,g(·)is a sparsity-induced
batch normalization to effectively identify and prune unim- penalty on the scaling factors, andλbalances the two terms.
portant channels in the network. In our experiment, we chooseg(s) =|s|, which is known as
2738 convolution layers. 2), if we insert a scaling layer before
a BN layer, the scaling effect of the scaling layer will be
Train with Prune channels Initial Fine-tune the Compact completely canceled by the normalization process in BN. channel sparsity with small network pruned network networkregularization scaling factors 3), if we insert scaling layer after BN layer, there are two
consecutive scaling factors for each channel. Figure 2: Flow-chart of network slimming procedure. The dotted-
line is for the multi-pass/iterative scheme. Channel Pruning and Fine-tuning.After training under
channel-level sparsity-induced regularization, we obtain a
L1-norm and widely used to achieve sparsity. Subgradient model in which many scaling factors are near zero (see Fig-
descent is adopted as the optimization method for the non- ure 1). Then we can prune channels with near-zero scaling
smooth L1 penalty term. An alternative option is to replace factors, by removing all their incoming and outgoing con-
the L1 penalty with the smooth-L1 penalty [30] to avoid nections and corresponding weights. We prune channels
using sub-gradient at non-smooth point. with a global threshold across all layers, which is defined
As pruning a channel essentially corresponds to remov- as a certain percentile of all the scaling factor values. For
ing all the incoming and outgoing connections of that chan- instance, we prune 70% channels with lower scaling factors
nel, we can directly obtain a narrow network (see Figure 1) by choosing the percentile threshold as 70%. By doing so,
without resorting to any special sparse computation pack- we obtain a more compact network with less parameters and
ages. The scaling factors act as the agents for channel se- run-time memory, as well as less computing operations.
lection. As they are jointly optimized with the network Pruning may temporarily lead to some accuracy loss,
weights, the network can automatically identity insignifi- when the pruning ratio is high. But this can be largely com-
cant channels, which can be safely removed without greatly pensated by the followed fine-tuning process on the pruned
affecting the generalization performance. network. In our experiments, the fine-tuned narrow network
Leveraging the Scaling Factors in BN Layers.Batch nor- can even achieve higher accuracy than the original unpruned
malization [19] has been adopted by most modern CNNs network in many cases.
as a standard approach to achieve fast convergence and bet- Multi-pass Scheme. We can also extend the proposedter generalization performance. The way BN normalizes method from single-pass learning scheme (training withthe activations motivates us to design a simple and effi- sparsity regularization, pruning, and fine-tuning) to a multi-cient method to incorporates the channel-wise scaling fac- pass scheme. Specifically, a network slimming proceduretors. Particularly, BN layer normalizes the internal activa- results in a narrow network, on which we could again applytions using mini-batch statistics. Letzin andzout be the the whole training procedure to learn an even more compactinput and output of a BN layer,Bdenotes the current mini- model. This is illustrated by the dotted-line in Figure 2. Ex-batch, BN layer performs the following transformation: perimental results show that this multi-pass scheme can lead
to even better results in terms of compression rate.zzˆ= in −µ B ; zσ2 +ǫ out =γzˆ+β (2) Handling Cross Layer Connections and Pre-activation B Structure. The network slimming process introduced
whereµB andσB are the mean and standard deviation val- above can be directly applied to most plain CNN architec-
ues of input activations overB,γandβare trainable affine tures such as AlexNet [22] and VGGNet [31]. While some
transformation parameters (scale and shift) which provides adaptations are required when it is applied to modern net-
the possibility of linearly transforming normalized activa- works withcross layer connectionsand thepre-activation
tions back to any scales. design such as ResNet [15] and DenseNet [17]. For these
It is common practice to insert a BN layer after a convo- networks, the output of a layer may be treated as the input
lutional layer, with channel-wise scaling/shifting parame- of multiple subsequent layers, in which a BN layer is placed
ters. Therefore, we can directly leverage theγparameters in before the convolutional layer. In this case, the sparsity is
BN layers as the scaling factors we need for network slim- achieved at the incoming end of a layer, i.e., the layer selec-
ming. It has the great advantage of introducing no overhead tively uses a subset of channels it received. To harvest the
to the network. In fact, this is perhaps also the most effec- parameter and computation savings at test time, we need
tive way we can learn meaningful scaling factors for chan- to place achannel selectionlayer to mask out insignificant
nel pruning.1), if we add scaling layers to a CNN without channels we have identified.
BN layer, the value of the scaling factors are not meaning-
ful for evaluating the importance of a channel, because both 4. Experiments convolution layers and scaling layers are linear transforma-
tions. One can obtain the same results by decreasing the We empirically demonstrate the effectiveness of network
scaling factor values while amplifying the weights in the slimming on several benchmark datasets. We implement
2739 (a) Test Errors on CIFAR-10
Model Test error (%) Parameters Pruned FLOPs Pruned
VGGNet (Baseline) 6.34 20.04M - 7.97×10 8 -
VGGNet (70% Pruned) 6.20 2.30M 88.5% 3.91×10 8 51.0%
DenseNet-40 (Baseline) 6.11 1.02M - 5.33×10 8 -
DenseNet-40 (40% Pruned) 5.19 0.66M 35.7% 3.81×10 8 28.4%
DenseNet-40 (70% Pruned) 5.65 0.35M 65.2% 2.40×10 8 55.0%
ResNet-164 (Baseline) 5.42 1.70M - 4.99×10 8 -
ResNet-164 (40% Pruned) 5.08 1.44M 14.9% 3.81×10 8 23.7%
ResNet-164 (60% Pruned) 5.27 1.10M 35.2% 2.75×10 8 44.9%
(b) Test Errors on CIFAR-100
Model Test error (%) Parameters Pruned FLOPs Pruned
VGGNet (Baseline) 26.74 20.08M - 7.97×10 8 -
VGGNet (50% Pruned) 26.52 5.00M 75.1% 5.01×10 8 37.1%
DenseNet-40 (Baseline) 25.36 1.06M - 5.33×10 8 -
DenseNet-40 (40% Pruned) 25.28 0.66M 37.5% 3.71×10 8 30.3%
DenseNet-40 (60% Pruned) 25.72 0.46M 54.6% 2.81×10 8 47.1%
ResNet-164 (Baseline) 23.37 1.73M - 5.00×10 8 -
ResNet-164 (40% Pruned) 22.87 1.46M 15.5% 3.33×10 8 33.3%
ResNet-164 (60% Pruned) 23.91 1.21M 29.7% 2.47×10 8 50.6%
(c) Test Errors on SVHN
Model Test Error (%) Parameters Pruned FLOPs Pruned
VGGNet (Baseline) 2.17 20.04M - 7.97×10 8 -
VGGNet (60% Pruned) 2.06 3.04M 84.8% 3.98×10 8 50.1%
DenseNet-40 (Baseline) 1.89 1.02M - 5.33×10 8 -
DenseNet-40 (40% Pruned) 1.79 0.65M 36.3% 3.69×10 8 30.8%
DenseNet-40 (60% Pruned) 1.81 0.44M 56.6% 2.67×10 8 49.8%
ResNet-164 (Baseline) 1.78 1.70M - 4.99×10 8 -
ResNet-164 (40% Pruned) 1.85 1.46M 14.5% 3.44×10 8 31.1%
ResNet-164 (60% Pruned) 1.81 1.12M 34.3% 2.25×10 8 54.9%
Table 1: Results on CIFAR and SVHN datasets. “Baseline” denotes normal training without sparsity regularization. In column-1, “60%
pruned” denotes the fine-tuned model with 60% channels pruned from the model trained with sparsity, etc. The pruned ratio of parameters
and FLOPs are also shown in column-4&6. Pruning a moderate amount (40%) of channels can mostly lower the test errors. The accuracy
could typically be maintained with≥60% channels pruned.
our method based on the publicly available Torch [5] im- images, from which we split a validation set of 6,000 im-
plementation for ResNets by [10]. The code is available at ages for model selection during training. The test set con-
https://github.com/liuzhuang13/slimming. tains 26,032 images. During training, we select the model
with the lowest validation error as the model to be pruned
4.1. Datasets (or the baseline model). We also report the test errors of the
models with lowest validation errors during fine-tuning.CIFAR.The two CIFAR datasets [21] consist of natural im-
ages with resolution 32×32. CIFAR-10 is drawn from 10
and CIFAR-100 from 100 classes. The train and test sets ImageNet. The ImageNet dataset contains 1.2 millioncontain 50,000 and 10,000 images respectively. On CIFAR- training images and 50,000 validation images of 100010, a validation set of 5,000 images is split from the training classes. We adopt the data augmentation scheme as in [10].set for the search ofλ(in Equation 1) on each model. We We report the single-center-crop validation error of the finalreport the final test errors after training or fine-tuning on model.all training images. A standard data augmentation scheme
(shifting/mirroring) [14, 18, 24] is adopted. The input data
is normalized using channel means and standard deviations. MNIST.MNIST is a handwritten digit dataset containingWe also compare our method with [23] on CIFAR datasets. 60,000 training images and 10,000 test images. To test the
SVHN.The Street View House Number (SVHN) dataset effectiveness of our method on a fully-connected network
[27] consists of 32x32 colored digit images. Following (treating each neuron as a channel with 1×1 spatial size),
common practice [9, 18, 24] we use all the 604,388 training we compare our method with [35] on this dataset.
2740 4.2. Network Models Model Parameter and FLOP Savings
On CIFAR and SVHN dataset, we evaluate our method 100 100.0% 100.0% 100.0% Original
Parameter Ratio
on three popular network architectures: VGGNet[31], 80 FLOPs Ratio
ResNet [14] and DenseNet [17]. The VGGNet is originally
Ratio (%) 64.8%
60
designed for ImageNet classification. For our experiment a 55.1%
49.0% 45.0%
variation of the original VGGNet for CIFAR dataset is taken 40 34.8%
from [36]. For ResNet, a 164-layer pre-activation ResNet 20 11.5%
with bottleneck structure (ResNet-164) [15] is used. For 0
DenseNet, we use a 40-layer DenseNet with growth rate 12 VGGNet DenseNet-40 ResNet-164
(DenseNet-40). Figure 3: Comparison of pruned models withlowertest errors on On ImageNet dataset, we adopt the 11-layer (8-conv + CIFAR-10 than the original models. The blue and green bars are 3 FC) “VGG-A” network [31] model with batch normaliza- parameter and FLOP ratios between pruned and original models.
tion from [4]. We remove the dropout layers since we use
relatively heavy data augmentation. To prune the neurons mented by building a new narrower model and copying the
in fully-connected layers, we treat them as convolutional corresponding weights from the model trained with sparsity.
channels with 1×1 spatial size.
On MNIST dataset, we evaluate our method on the same Fine-tuning.After the pruning we obtain a narrower and
3-layer fully-connected network as in [35]. more compact model, which is then fine-tuned. On CIFAR,
SVHN and MNIST datasets, the fine-tuning uses the same
4.3. Training, Pruning and Fine­tuning optimization setting as in training. For ImageNet dataset,
due to time constraint, we fine-tune the pruned VGG-A withNormal Training.We train all the networks normally from a learning rate of 10 3 for only 5 epochs.scratch as baselines. All the networks are trained using
SGD. On CIFAR and SVHN datasets we train using mini- 4.4. Results batch size 64 for 160 and 20 epochs, respectively. The ini-
tial learning rate is set to 0.1, and is divided by 10 at 50% CIFAR and SVHNThe results on CIFAR and SVHN are
and 75% of the total number of training epochs. On Im- shown in Table 1. We mark all lowest test errors of a model
ageNet and MNIST datasets, we train our models for 60 inboldface.
and 30 epochs respectively, with a batch size of 256, and an Parameter and FLOP reductions. The purpose of net-initial learning rate of 0.1 which is divided by 10 after 1/3 work slimming is to reduce the amount of computing re-and 2/3 fraction of training epochs. We use a weight de- sources needed. The last row of each model has≥60%cay of10 4 and a Nesterov momentum [33] of 0.9 without channels pruned while still maintaining similar accuracy todampening. The weight initialization introduced by [13] is the baseline. The parameter saving can be up to 10×. Theadopted. Our optimization settings closely follow the orig- FLOP reductions are typically around50%. To highlightinal implementation at [10]. In all our experiments, we ini- network slimmings efficiency, we plot the resource sav-tialize all channel scaling factors to be 0.5, since this gives ings in Figure 3. It can be observed that VGGNet has ahigher accuracy for the baseline models compared with de- large amount of redundant parameters that can be pruned.fault setting (all initialized to be 1) from [10]. On ResNet-164 the parameter and FLOP savings are rel-
Training with Sparsity.For CIFAR and SVHN datasets, atively insignificant, we conjecture this is due to its “bot-
when training with channel sparse regularization, the hyper- tleneck” structure has already functioned as selecting chan-
parameteerλ, which controls the tradeoff between empiri- nels. Also, on CIFAR-100 the reduction rate is typically
cal loss and sparsity, is determined by a grid search over slightly lower than CIFAR-10 and SVHN, which is possi-
10 3 , 10 4 , 10 5 on CIFAR-10 validation set. For VG- bly due to the fact that CIFAR-100 contains more classes.
GNet we chooseλ=10 4 and for ResNet and DenseNet Regularization Effect.From Table 1, we can observe that,λ=10 5 . For VGG-A on ImageNet, we setλ=10 5 . All on ResNet and DenseNet, typically when40%channels areother settings are kept the same as in normal training. pruned, the fine-tuned network can achieve a lower test er-
Pruning.When we prune the channels of models trained ror than the original models. For example, DenseNet-40
with sparsity, a pruning threshold on the scaling factors with 40% channels pruned achieve a test error of 5.19%
needs to be determined. Unlike in [23] where different lay- on CIFAR-10, which is almost 1% lower than the original
ers are pruned by different ratios, we use a global pruning model. We hypothesize this is due to the regularization ef-
threshold for simplicity. The pruning threshold is deter- fect of L1 sparsity on channels, which naturally provides
mined by a percentile among all scaling factors , e.g., 40% feature selection in intermediate layers of a network. We
or 60% channels are pruned. The pruning process is imple- will analyze this effect in the next section.
2741 VGG-A Baseline 50% Pruned (a) Multi-pass Scheme on CIFAR-10
Params 132.9M 23.2M IterTrained Fine-tunedParams PrunedFLOPs Pruned
Params Pruned - 82.5% 1 6.38 6.51 66.7% 38.6%
FLOPs 4.57×10 10 3.18×10 10 2 6.23 6.11 84.7% 52.7%
FLOPs Pruned - 30.4% 3 5.87 6.10 91.4% 63.1%
Validation Error (%) 36.69 36.66 4 6.19 6.59 95.6% 77.2%
5 5.96 7.73 98.3% 88.7%
Table 2: Results on ImageNet. 6 7.79 9.70 99.4% 95.7%
Model Test Error (%)Params Pruned #Neurons (b) Multi-pass Scheme on CIFAR-100
Baseline 1.43 - 784-500-300-10 IterTrained Fine-tunedParams PrunedFLOPs Pruned
Pruned [35] 1.53 83.5% 434-174-78-10 1 27.72 26.52 59.1% 30.9%
Pruned (ours) 1.49 84.4% 784-100-60-10 2 26.03 26.52 79.2% 46.1%
3 26.49 29.08 89.8% 67.3%
Table 3: Results on MNIST. 4 28.17 30.59 95.3% 83.0%
5 30.04 36.35 98.3% 93.5%
6 35.91 46.73 99.4% 97.7%
ImageNet. The results for ImageNet dataset are summa-
rized in Table 2. When 50% channels are pruned, the pa- Table 4: Results for multi-pass scheme on CIFAR-10 and CIFAR-
rameter saving is more than 5×, while the FLOP saving 100 datasets, using VGGNet. The baseline model has test errors of
is only 30.4%. This is due to the fact that only 378 (out 6.34% and 26.74%. “Trained” and “Fine-tuned” columns denote
of 2752) channels from all the computation-intensive con- the test errors (%) of the model trained with sparsity, and the fine-
tuned model after channel pruning, respectively. The parameter volutional layers are pruned, while 5094 neurons (out of and FLOP pruned ratios correspond to the fine-tuned model in that 8192) from the parameter-intensive fully-connected layers row and the trained model in the next row. are pruned. It is worth noting that our method can achieve
the savings with no accuracy loss on the 1000-class Im- more compact models. On CIFAR-10, the trained modelageNet dataset, where other methods for efficient CNNs achieves the lowest test error in iteration 5. This model[2, 23, 35, 28] mostly report accuracy loss. achieves 20×parameter reduction and 5×FLOP reduction,
MNIST.On MNIST dataset, we compare our method with while still achievinglowertest error. On CIFAR-100, after
the Structured Sparsity Learning (SSL) method [35] in Ta- iteration 3, the test error begins to increase. This is pos-
ble 3. Despite our method is mainly designed to prune sibly due to that it contains more classes than CIFAR-10,
channels in convolutional layers, it also works well in prun- so pruning channels too agressively will inevitably hurt the
ing neurons in fully-connected layers. In this experiment, performance. However, we can still prune near 90% param-
we observe that pruning with a global threshold sometimes eters and near 70% FLOPs without notable accuracy loss.
completely removes a layer, thus we prune 80% of the neu-
rons in each of the two intermediate layers. Our method 5. Analysis
slightly outperforms [35], in that a slightly lower test error There are two crucial hyper-parameters in network slim-is achieved while pruning more parameters. ming, the pruned percentagetand the coefficient of the
We provide some additional experimental results in the sparsity regularization termλ(see Equation 1). In this sec-
supplementary materials, including (1) detailed structure of tion, we analyze their effects in more detail.
a compact VGGNet on CIFAR-10; (2) wall-clock time and Effect of Pruned Percentage. Once we obtain a modelrun-time memory savings in practice. (3) comparison with trained with sparsity regularization, we need to decide whata previous channel pruning method [23]; percentage of channels to prune from the model. If we
4.5. Results for Multi­pass Scheme prune too few channels, the resource saving can be very
limited. However, it could be destructive to the model if
We employ the multi-pass scheme on CIFAR datasets we prune too many channels, and it may not be possible to
using VGGNet. Since there are no skip-connections, prun- recover the accuracy by fine-tuning. We train a DenseNet-
ing away a whole layer will completely destroy the mod- 40 model withλ=10 5 on CIFAR-10 to show the effect of
els. Thus, besides setting the percentile threshold as 50%, pruning a varying percentage of channels. The results are
we also put a constraint that at each layer, at most 50% of summarized in Figure 5.
channels can be pruned. From Figure 5, it can be concluded that the classification
The test errors of models in each iteration are shown in performance of the pruned or fine-tuned models degrade
Table 4. As the pruning process goes, we obtain more and only when the pruning ratio surpasses a threshold. The fine-
2742 λ= 0 λ= 10 5 λ= 10 4
400 450 2000
350 400
300 350 1500
300250
Count 250200 1000200150 150
100 100 500
50 50
0 0 00.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Scaling factor value
Figure 4: Distributions of scaling factors in a trained VGGNet under various degree of sparsity regularization (controlled by the parameter
λ). With the increase ofλ, scaling factors become sparser.
8.0 0Baseline
7.5 Trained with Sparsity 10 Pruned 7.0 Fine-tuned
Channel Index )
% 20
Test error ( 6.5
30 6.0
40 5.5
5.0 50
4.50 10 20 30 40 50 60 70 80 90 0 20 40 60 80
Pruned channels (%) Epoch
Figure 5: The effect of pruning varying percentages of channels, Figure 6: Visulization of channel scaling factors change in scale
from DenseNet-40 trained on CIFAR-10 withλ=10 5 . along the training process, taken from the 11th conv-layer in VG-
GNet trained on CIFAR-10. Brighter color corresponds to larger
value. The bright lines indicate the “selected” channels, the dark
tuning process can typically compensate the possible accu- lines indicate channels that can be pruned.
racy loss caused by pruning. Only when the threshold goes
beyond 80%, the test error of fine-tuned model falls behind progresses, some channels scaling factors become largerthe baseline model. Notably, when trained with sparsity, (brighter) while others become smaller (darker).even without fine-tuning, the model performs better than the
original model. This is possibly due the the regularization 6. Conclusion effect of L1 sparsity on channel scaling factors.
We proposed the network slimming technique to learnChannel Sparsity Regularization.The purpose of the L1 more compact CNNs. It directly imposes sparsity-inducedsparsity term is to force many of the scaling factors to be regularization on the scaling factors in batch normalizationnear zero. The parameterλin Equation 1 controls its signif- layers, and unimportant channels can thus be automatically icance compared with the normal training loss. In Figure 4 identified during training and then pruned. On multiple we plot the distributions of scaling factors in the whole net- datasets, we have shown that the proposed method is able towork with differentλvalues. For this experiment we use a significantly decrease the computational cost (up to 20×) ofVGGNet trained on CIFAR-10 dataset. state-of-the-art networks, with no accuracy loss. More im-
It can be observed that with the increase ofλ, the scaling portantly, the proposed method simultaneously reduces the
factors are more and more concentrated near zero. When model size, run-time memory, computing operations while
λ=0, i.e., theres no sparsity regularization, the distribution introducing minimum overhead to the training process, and
is relatively flat. Whenλ=10 4 , almost all scaling factors the resulting models require no special libraries/hardware
fall into a small region near zero. This process can be seen for efficient inference.
as a feature selection happening in intermediate layers of
deep networks, where only channels with non-negligible Acknowledgements. Gao Huang is supported by the In-
scaling factors are chosen. We further visualize this pro- ternational Postdoctoral Exchange Fellowship Program of
cess by a heatmap. Figure 6 shows the magnitude of scaling China Postdoctoral Council (No.20150015). Changshui
factors from one layer in VGGNet, along the training pro- Zhang is supported by NSFC and DFG joint project NSFC
cess. Each channel starts with equal weights; as the training 61621136008/DFG TRR-169.
2743 References [20] J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. Neural network
architecture optimization through submodularity and super- [1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu- modularity.arXiv preprint arXiv:1609.00074, 2016. ral network architectures using reinforcement learning. In [21] A. Krizhevsky and G. Hinton. Learning multiple layers of ICLR, 2017. features from tiny images. InTech Report, 2009. [2] S. Changpinyo, M. Sandler, and A. Zhmoginov. The power [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetof sparsity in convolutional neural networks.arXiv preprint classification with deep convolutional neural networks. In arXiv:1702.06257, 2017. NIPS, pages 10971105, 2012. [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Y. Chen. Compressing neural networks with the hashing Graf. Pruning filters for efficient convnets. arXiv preprint trick. InICML, 2015. arXiv:1608.08710, 2016.
[4] S. Chintala. Training an object classifier in torch-7 on [24] M. Lin, Q. Chen, and S. Yan. Network in network. InICLR,multiple gpus over imagenet. https://github.com/ 2014.soumith/imagenet-multiGPU.torch. [25] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.
[5] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Sparse convolutional neural networks. InProceedings of the
matlab-like environment for machine learning. InBigLearn, IEEE Conference on Computer Vision and Pattern Recogni-
NIPS Workshop, number EPFL-CONF-192376, 2011. tion, pages 806814, 2015.
[6] M. Courbariaux and Y. Bengio. Binarynet: Training deep [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
neural networks with weights and activations constrained to+ networks for semantic segmentation. InCVPR, pages 3431
1 or-1.arXiv preprint arXiv:1602.02830, 2016. 3440, 2015.
[7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fer- [27] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
gus. Exploiting linear structure within convolutional net- Ng. Reading digits in natural images with unsupervised fea-
works for efficient evaluation. InNIPS, 2014. ture learning, 2011. InNIPS Workshop on Deep Learning
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- and Unsupervised Feature Learning, 2011.
ture hierarchies for accurate object detection and semantic [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-
segmentation. InCVPR, pages 580587, 2014. net: Imagenet classification using binary convolutional neu-
[9] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and ral networks. InECCV, 2016.
Y. Bengio. Maxout networks. InICML, 2013. [29] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini.
[10] S. Gross and M. Wilber. Training and investigating residual Group sparse regularization for deep neural networks.arXiv
nets. https://github.com/szagoruyko/cifar. preprint arXiv:1607.00485, 2016.
torch. [30] M. Schmidt, G. Fung, and R. Rosales. Fast optimization
[11] S. Han, H. Mao, and W. J. Dally. Deep compression: Com- methods for l1 regularization: A comparative study and two
pressing deep neural network with pruning, trained quanti- new approaches. InECML, pages 286297, 2007.
zation and huffman coding. InICLR, 2016. [31] K. Simonyan and A. Zisserman. Very deep convolutional
[12] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights networks for large-scale image recognition. InICLR, 2015.
and connections for efficient neural network. InNIPS, pages [32] S. Srinivas, A. Subramanya, and R. V. Babu. Training sparse
11351143, 2015. neural networks.CoRR, abs/1611.06694, 2016.
[13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into [33] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the
rectifiers: Surpassing human-level performance on imagenet importance of initialization and momentum in deep learning.
classification. InICCV, 2015. InICML, 2013.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
for image recognition. InCVPR, 2016. D. Anguelov, D. Erhan, et al. Going deeper with convolu-
tions. InCVPR, pages 19, 2015.[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [35] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learningdeep residual networks. InECCV, pages 630645. Springer, structured sparsity in deep neural networks. InNIPS, 2016.2016. [36] S. Zagoruyko. 92.5% on cifar-10 in torch. https://[16] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and github.com/szagoruyko/cifar.torch.K. Q. Weinberger. Multi-scale dense convolutional networks
for efficient prediction. arXiv preprint arXiv:1703.09844, [37] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards
2017. compact cnns. InECCV, 2016.
[38] B. Zoph and Q. V. Le. Neural architecture search with rein-[17] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. forcement learning. InICLR, 2017.Densely connected convolutional networks. InCVPR, 2017.
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger.
Deep networks with stochastic depth. InECCV, 2016.
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
2744

View File

@ -1,933 +0,0 @@
262-A1677 7/24/01 11:12 AM Page 763
SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING
MANFRED OPPER Theories that try to understand the ability of neural
Neural Computation Research Group networks to generalize from learned examples are
Aston University discussed. Also, an approach that is based on ideas
Birmingham B4 7ET, United Kingdom from statistical physics which aims to model typical
learning behavior is compared with a worst-case
framework.
Learning to
Generalize
................................................ ◗
Introduction rule. To what extent is it possible to understand the com-
plexity of learning from examples by mathematical models
Neural networks learn from examples. This statement is andtheirsolutions?Thisquestionisthefocusofthisarticle.
obviously true for the brain, but also artificial networks (or I concentrate on the use of neural networks for classifica-
neural networks), which have become a powerful new tool tion. Here, one can take characteristic features (e.g., the
for many pattern-recognition problems, adapt their “syn- pixels of an image) as an input pattern to the network. In
aptic” couplings to a set of examples. Neural nets usually the simplest case, it should decide whether a given pattern
consist of many simple computing units which are com- belongs (at least more likely) to a certain class of objects
bined in an architecture which is often independent from and respond with the output 1 or 1. To learn the under-
the problem. The parameters which control the interaction lying classification rule, the network is trained on a set of
among the units can be changed during the learning phase patterns together with the classification labels, which are
and these are often called synaptic couplings.After the provided by a trainer. A heuristic strategy for training is to
learning phase, a network adopts some ability to generalize tune the parameters of the machine (the couplings of the
from the examples; it can make predictions about inputs network) using a learning algorithm, in such a way that the
which it has not seen before; it has begun to understand a errors made on the set of training examples are small, in
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 763 262-A1677 7/24/01 11:12 AM Page 764
MANFRED OPPER
the hope that this helps to reduce the errors on new data. for the case of realizable rules they are also independent
How well will the trained network be able to classify an in- of the specific algorithm, as long as the training examples
put that it has not seen before? This performance on new are perfectly learned. Because it is able to cover even bad
data defines the generalization ability of the network. This situations which are unfavorable for improvement of the
ability will be affected by the problem of realizability: The learning process, it is not surprising that this theory may
network may not be sufficiently complex to learn the rule in some cases provide too pessimistic results which are also
completely or there may be ambiguities in classification. too crude to reveal interesting behavior in the intermediate
Here, I concentrate on a second problem arising from the region of the learning curve.
fact that learning will mostly not be exhaustive and the in- In this article, I concentrate mainly on a different ap-
formation about the rule contained in the examples is not proach, which has its origin in statistical physics rather than
complete. Hence, the performance of a network may vary in mathematical statistics, and compare its results with the
from one training set to another. In order to treat the gen- worst-case results. This method aims at studying the typical
eralization ability in a quantitative way, a common model rather than the worst-case behavior and often enables the
assumes that all input patterns, those from the training set exact calculations of the entire learning curve for models of
and the new one on which the network is tested, have a pre- simple networks which have many parameters. Since both
assigned probability distribution (which characterizes the biological and artificial neural networks are composed of
feature that must be classified), and they are produced in- many elements, it is hoped that such an approach may ac-
dependently at random with the same probability distribu- tually reveal some relevant and interesting structures.
tion from the networks environment. Sometimes the prob- At first, it may seem surprising that a problem should
ability distribution used to extract the examples and the simplifywhenthenumberofitsconstituentsbecomeslarge.
classification of these examples is called the rule.The net- However, this phenomenon is well-known for macroscopic
works performance on novel data can now be quantified by physical systems such as gases or liquids which consist of
the so-called generalization error,which is the probability a huge number of molecules. Clearly, it is not possible to
of misclassifying the test input and can be measured by re- study the complete microscopic state of such a system,
peating the same learning experiment many times with dif- which is described by the rapidly fluctuating positions and
ferent data. velocities of all particles. On the other hand, macroscopic
Within such a probabilistic framework, neural networks quantities such as density, temperature, and pressure are
areoftenviewedasstatisticaladaptivemodelswhichshould usually collective properties influenced by all elements. For
give a likely explanation of the observed data. In this frame- such quantities, fluctuations are averaged out in the ther-
work, the learning process becomes mathematically related modynamic limit of a large number of particles and the col-
to a statistical estimation problem for optimal network pa- lective properties become, to some extent, independent of
rameters.Hence,mathematicalstatisticsseemstobeamost themicrostate.Similarly,thegeneralizationabilityofaneu-
appropriate candidate for studying a neural networks be- ral network is a collective property of all the network pa-
havior. In fact, various statistical approaches have been ap- rameters, and the techniques of statistical physics allow, at
plied to quantify the generalization performance. For ex- least for some simple but nontrivial models, for exact com-
ample, expressions for the generalization error have been putations in the thermodynamic limit. Before explaining
obtainedinthelimit,wherethenumberofexamplesislarge these ideas in detail, I provide a short description of feed-
compared to the number of couplings (Seung et al.,1992; forward neural networks.
Amari and Murata, 1993). In such a case, one can expect ................................................that learning is almost exhaustive, such that the statistical ◗
fluctuations of the parameters around their optimal values Artificial Neural Networks
are small. However, in practice the number of parameters is
often large so that the network can be flexible, and it is not Based on highly idealized models of brain function, artifi-
clear how many examples are needed for the asymptotic cial neural networks are built from simple elementary com-
theorytobecomevalid.Theasymptotictheorymayactually puting units, which are sometimes termed neurons after
miss interesting behavior of the so-called learning curve, their biological counterparts. Although hardware imple-
which displays the progress of generalization ability with mentations have become an important research topic, neu-
an increasing amount of training data. ral nets are still simulated mostly on standard computers.
A second important approach, which was introduced Each computing unit of a neural net has a single output and
into mathematical statistics in the 1970s by Vapnik and several ingoing connections which receive the outputs of
Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact other units. To every ingoing connection (labeled by the
bounds for the generalization error which are valid for any index i) a real number is assigned, the synaptic weight w,i
number of training examples. Moreover, they are entirely which is the basic adjustable parameter of the network. To
independent of the underlying distribution of inputs, and compute a units output, all incoming values x are multi- i
764 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 765
LEARNING TO GENERALIZE
0.6 0.9 0.8
inputs
1.6 1.4 0.1 synaptic weights
weighted sum
1.6 × 0.6 + (1.4) × (0.9) + (0.1) × 0.8 = 2.14
1
0
1
2.14 aboutput
FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri-
cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs
reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which
the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and
step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information.
plied by the weights w and then added. Figure 1a shows its simple structure, it can for many learning problems give i
an example of such a computation with three couplings. a nontrivial generalization performance and may be used
Finally, the result, wx,is passed through an activation as a first step to an unknown classification task. As can be i i i
function which is typically of the shape of the red curve in seen by comparing Figs. 2a and 1b, it is also a building
Fig. 1a (a sigmoidal function), which allows for a soft, am- block for the more complex multilayer networks. Hence,
biguous classification between 1 and 1. Other impor- understanding its performance theoretically may also pro-
tant cases are the step function (green curve) and the linear vide insight into the more complex machines. To learn a set
function (yellow curve; used in the output neuron for prob- of examples, a network must adjust its couplings appropri-
lems of fitting continuous functions). In the following, to ately (I often use the word couplings for their numerical
keep matters simple, I restrict the discussion mainly to the strengths, the weights w, for i1,..., N). Remarkably, i
step function. Such simple units can develop a remarkable for the perceptron there exists a simple learning algorithm
computational power when connected in a suitable archi- which always enables the network to find those parameter
tecture. An important network type is the feedforward ar- values whenever the examples can be learnt by a percep-
chitecture shown in Fig. 1b, which has two layers of comput- tron. In Rosenblatts algorithm, the input patterns are pre-
ing units and adjustable couplings. The input nodes (which sented sequentially (e.g., in cycles) to the network and the
do not compute) are coupled to the so-called hidden units,
whichfeedtheiroutputsintooneormoreoutputunits.With
suchanarchitectureandsigmoidalactivationfunctions,any
continuous function of the inputs can be arbitrarily closely xx 21 x2 x3 xn
approximated when the number of hidden units is suffi-
ciently large. (w1 ,w 2 )
w ................................................ 1 w2 w3 wn ◗
The Perceptron x1
The simplest type of network is the perceptron (Fig. 2a).
There are Ninputs, Nsynaptic couplings w, and the output i
is simply a b
N FIGURE 2 (a) The perceptron. (b) Classification of inputs
awx [1] i i by a perceptron with two inputs. The arrow indicates the vec-
i1 tor composed of the weights of the network, and the line per-
It has a single-layer architecture and the step function pendicular to this vector is the boundary between the classes
(green curve in Fig. 1a) as its activation function. Despite of input.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 765 262-A1677 7/24/01 11:12 AM Page 766
MANFRED OPPER
output is tested. Whenever a pattern is not classified cor-
rectly, all couplings are altered simultaneously. We increase x2
by a fixed amount all weights for which the input unit and
the correct value of the output neuron have the same sign
but we decrease them for the opposite sign. This simple
algorithm is reminiscent of the so-called Hebbian learning
rule,a physiological model of a learning processes in the
real brain. It assumes that synaptic weights are increased
when two neurons are simultaneously active. Rosenblatts
theorem states that in cases in which there exists a choice of
the w which classify correctly all of the examples (i.e., per- i
fectly learnable perceptron), this algorithm finds a solution
in a finite number of steps, which is at worst equal to A N 3 ,
where Ais an appropriate constant.
It is often useful to obtain an intuition of a perceptrons xa 1
classification performance by thinking in terms of a geo-
metric picture. We may view the numerical values of the in-
puts as the coordinates of a point in some (usually) high-
dimensional space. The case of two dimensions is shown
in Fig. 2b. A corresponding point is also constructed for the
couplings w.The arrow which points from the origin of the i
coordinate system to this latter point is called the weight
vector or coupling vector. An application of linear algebra
tothecomputationofthenetworkshowsthatthelinewhich
is perpendicular to the coupling vector is the boundary be-
tween inputs belonging to the two different classes. Input
points which are on the same side as the coupling vector are
classified as 1 (the green region in Fig. 2b) and those on
the other side as 1 (red region in Fig. 2b).
Rosenblatts algorithm aims to determine such a line
when it is possible. This picture generalizes to higher di- direction of coupling vectorb
mensions, for which a hyperplane plays the same role of the FIGURE 3 (a) Projection of 200 random points (with ran-
line of the previous two-dimensional example. We can still dom labels) from a 200-dimensional space onto the first two
obtainanintuitivepicturebyprojectingontwo-dimensional coordinate axes (x and x). (b) Projection of the same points 1 2
planes. In Fig. 3a, 200 input patterns with random coordi- onto a plane which contains the coupling vector of a perfectly
nates (randomly labeled red and blue) in a 200-dimensional trained perceptron.
input space are projected on the plane spanned by two arbi-
trary coordinate axes. If we instead use a plane for projec-
tion which contains the coupling vector (determined from tions for small changes of the couplings). Hence, in general,
a variant of Rosenblatts algorithm) we obtain the view in addition to the perfectly learnable perceptron case in
shown in Fig. 3b, in which red and green points are clearly which the final error is zero, minimizing the training error
separated and there is even a gap between the two clouds. is usually a difficult task which could take a large amount of
It is evident that there are cases in which the two sets of computer time. However, in practice, iterative approaches,
points are too mixed and there is no line in two dimensions which are based on the minimization of other smooth cost
(or no hyperplane in higher dimensions which separates functions,areusedtotrainaneuralnetwork(Bishop,1995).
them). In these cases, the rule is too complex to be per- ................................................fectly learned by a perceptron. If this happens, we must at- ◗
tempt to determine the choice of the coupling which mini- Capacity, VC Dimension,
mizesthenumberoferrorsonagivensetofexamples.Here, and Worst-Case Generalization
Rosenblatts algorithm does not work and the problem of
finding the minimum is much more difficult from the algo- As previously shown, perceptrons are only able to realize a
rithmic point. The training error, which is the number of very restricted type of classification rules, the so-called lin-
errorsmadeonthetrainingset,isusuallyanonsmoothfunc- early separable ones. Hence, independently from the issue
tion of the network couplings (i.e., it may have large varia- of finding the best algorithm to learn the rule, one may ask
766 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 767
LEARNING TO GENERALIZE
the following question: In how many cases will the percep- exp[Nf(m/N)], where the function f(a) vanishes for
tron be able to learn a given set of training examples per- a2 and it is positive for a2. Such a threshold phe-
fectly if the output labels are chosen arbitrarily? In order to nomenon is an example of a phase transition (i.e., a sharp
answer this question in a quantitative way, it is convenient change of behavior) which can occur in the thermodynamic
tointroducesomeconceptssuchascapacity,VCdimension, limit of a large network size.
andworst-casegeneralization,whichcanbeusedinthecase Generally, the point at which such a transition takesof the perceptron and have a more general meaning. place defines the so-called capacity of the neural network.In the case of perceptrons, this question was answered in Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in- learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map- ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func- new example after having been trained to learn mexampletion of the number of examples per coupling for different on the training set?numbers of input nodes (couplings) N.Three regions can To obtain an intuitive idea of the connection betweenbe distinguished: capacity and ability to generalize, we assume a training set
Region in which m/N1: Simple linear algebra shows of size mand a single pattern for test. Suppose we define
that it is always possible to learn all mappings when the a possible rule by an arbitrary learnable mapping from
number mof input patterns is less than or equal to the inputs to outputs. If m1 is much larger than the capac-
number Nof couplings (there are simply enough adjustable ity, then for most rules the labels on the mtraining pat-
parameters). terns which the perceptron is able to recognize will nearly
Region in which m/N1: For this region, there are ex- uniquely determine the couplings (and consequently the
amples of rules that cannot be learned. However, when the answer of the learning algorithm on the test pattern), and
number of examples is less than twice the number of cou- therulecanbeperfectlyunderstoodfromtheexamples.Be-
plings (m/N2), if the network is large enough almost all low capacity, in most cases there are two different choices
mappings can be learned. If the output labels for each of of couplings which give opposite answers for the test pat-
the minputs are chosen randomly 1 or 1 with equal tern. Hence, a correct classification will occur with proba-
probability, the probability of finding a nonrealizable cou- bility 0.5 assuming all rules to be equally probable. Figure 5
pling goes to zero exponentially when Ngoes to infinity at displays the two types of situations form3andN2.
fixed ratio m/N. This intuitive connection can be sharpened. Vapnik and
Region in which m/N2: For m/N2 the probabil- Chervonenkis established a relation between a capacity
ity for a mapping to be realizable by perceptrons decreases such as quantity and the generalization ability that is valid
to zero rapidly and it goes to zero exponentially when N for general classifiers (Vapnik, 1982, 1995). The VC dimen-
goes to infinity at fixed ratio m/N(it is proportional to sion is defined as the size of the largest set of inputs for
which all mappings can be learned by the type of classi-
fier. It equals Nfor the perceptron. Vapnik and Chervo-
1.0 nenkis were able to show that for any training set of size m
fraction of realizable mappings 0.8
0.6
0.4 ? ?
0.2
0.0 a b
01234 FIGURE 5 Classification rules for four patterns based on a m/N perceptron. The patterns colored in red represent the training
FIGURE 4 Fraction of all mappings of minput patterns examples, and triangles and circles represent different class la-
which are learnable by perceptrons as a function of m/Nfor bels. The question mark is a test pattern. (a) There are two
different numbers of couplings N: N10 (in green), N20 possible ways of classifying the test point consistent with the
(in blue), and N100 (in red). examples; (b) only one classification is possible.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 767 262-A1677 7/24/01 11:12 AM Page 768
MANFRED OPPER
larger than the VC dimension D , the growth of the num- blue curve in Fig. 6, the minimal training error will decrease VC
ber of realizable mappings is bounded by an expression for increasing complexity of the nets. On the other hand,
which grows much slower than 2 m (in fact, only like a poly- the VC dimension and the complexity of the networks in-
nomial in m). crease with the increasing number of hidden units, leading
They proved that a large difference between training er- to an increasing expected difference (confidence interval)
ror (i.e., the minimum percentage of errors that is done on between training error and generalization error as indi-
the training set) and generalization error (i.e., the proba- cated by the red curve. The sum of both (green curve) will
bility of producing an error on the test pattern after having have a minimum, giving the smallest bound on the general-
learned the examples) of classifiers is highly improbable if ization error. As discussed later, this procedure will in some
the number of examples is well above D . This theorem cases lead to not very realistic estimates by the rather pes- VC
implies a small expected generalization error for perfect simistic bounds of the theory. In other words, the rigorous
learning of the training set results. The expected general- bounds, which are obtained from an arbitrary network and
ization error is bounded by a quantity which increases pro- rule, are much larger than those determined from the re-
portionally to D and decreases (neglecting logarithmic sults for most of the networks and rules. VC
corrections in m) inversely proportional to m. ................................................Conversely, one can construct a worst-case distribution ◗
of input patterns, for which a size of the training set larger Typical Scenario: The Approach
than D is also necessary for good generalization. The VC of Statistical Physics VC
results should, in practice, enable us to select the network
with the proper complexity which guarantees the smallest When the number of examples is comparable to the size of
bound on the generalization error. For example, in order the network, which for a perceptron equals the VC dimen-
tofind the proper size of the hidden layer of a network with sion, the VC theory states that one can construct malicious
twolayers,onecouldtrainnetworksofdifferentsizesonthe situations which prevent generalizations. However, in gen-
same data. eral, we would not expect that the world acts as an adver-
The relation among these concepts can be better under- sary. Therefore, how should one model a typical situation?
stood if we consider a family of networks of increasing com- As a first step, one may construct rules and pattern dis-
plexity which have to learn the same rule. A qualitative pic- tributions which act together in a nonadversarial way. The
ture of the results is shown in Fig. 6. As indicated by the teacherstudent paradigm has proven to be useful in such a
situation. Here, the rule to be learned is modeled by a sec-
ondnetwork,theteachernetwork;inthiscase,iftheteacher
and the student have the same architecture and the same
upper bound on numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error class labels for any inputs are given by the outputs of the
teacher. Within this framework, it is often possible to ob-
tain simple expressions for the generalization error. For a
upper bound on perceptron, we can use the geometric picture to visualize confidence interval the generalization error. A misclassification of a new in-
put vector by a student perceptron with coupling vector ST
occurs only if the input pattern is between the separating
planes (dashed region in Fig. 7) defined by ST and the vec-
tor of teacher couplings TE. If the inputs are drawn ran- training error domlyfromauniformdistribution,thegeneralizationerror
is directly proportional to the angle between ST and TE.
network complexity Hence, the generalization error is small when teacher and
student vectors are close together and decreases to zero
when both coincide.
In the limit, when the number of examples is very large
all the students which learn the training examples perfectly
will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e., close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below),
the generalization error (in red), calculated from the sum of eralization error have been successfully treated by asymp-
the training error (in green) and the confidence interval (in totic methods of statistics. On the other hand, when the
blue) according to the theory of VapnikChervonenkis, shows number of examples is relatively small, there are many dif-
a minimum; this corresponds to the network with the best gen- ferent students which are consistent with the teacher re-
eralization ability. garding the training examples, and the uncertainty about
768 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 769
LEARNING TO GENERALIZE
with the number of couplings N(like typical volumes in
N-dimensional spaces) and Bdecreases exponentially with
m(because it becomes more improbable to be correct ST mtimes for any e0), both factors can balance each other
when mincreases like maN.ais an effective measure for TE the size of the training set when Ngoes to infinity. In order
to have quantities which remain finite as NSq, it is also
useful to take the logarithm of V(e) and divide by N, which
transforms the product into a sum of two terms. The first
one (which is often called the entropic term) increases with
increasing generalization error (green curve in Fig. 8). This
FIGURE 7 For a uniform distribution of patterns, the gen- is true because there are many networks which are not
eralization error of a perceptron equals the area of the similar to the teacher, but there is only one network equal
shaded region divided by the area of the entire circle. ST and to the teacher. For almost all networks (remember, the
TE represent the coupling vectors of the student and teacher, entropic term does not include the effect of the training ex-
respectively. amples) e0.5, i.e., they are correct half of the time by
random guessing. On the other hand, the second term (red
curve in Fig. 8) decreases with increasing generalization er-
the true couplings of the teacher is large. Possible general- ror because the probability of being correct on an input
ization errors may range from zero (if, by chance, a learn- pattern increases when the student network becomes more
ing algorithm converges to the teacher) to some worst-case similar to the teacher. It is often called the energetic contri-
value. We may say that the constraint which specifies the butionbecauseitfavorshighlyordered(towardtheteacher)
macrostateofthenetwork(itstrainingerror)doesnotspec- network states, reminiscent of the states of physical systems
ify the microstate uniquely. Nevertheless, it makes sense to at low energies. Hence, there will be a maximum (Fig. 8, ar-
speak of a typical value for the generalization error, which row) of V(e) at some value of ewhich by definition is the
is defined as the value which is realized by the majority of typical generalization error.
the students. In the thermodynamic limit known from sta- The development of the learning process as the number
tistical physics, in which the number of parameters of the of examples aNincreases can be understood as a compe-
network is taken to be large, we expect that in fact almost tition between the entropic term, which favors disordered
all students belong to this majority, provided the quantity network configurations that are not similar to the teacher,
of interest is a cooperative effect of all components of the andtheenergeticterm.Thelattertermdominateswhenthe
system. As the geometric visualization for the generaliza- number of examples is large. It will later be shown that such
tion error of the perceptron shows, this is actually the case. a competition can lead to a rich and interesting behavior as
The following approach, which was pioneered by Elizabeth the number of examples is varied. The result for the learn-
Gardner (Gardner, 1988; Gardner and Derrida, 1989), is ing curve (Györgyi and Tishby, 1990; Sompolinsky et al.,
based on the calculation of V(e), the volume of the space
of couplings which both perfectly implement mtraining
examples and have a given generalization error e. For an
intuitive picture, consider that only discrete values for the entropic contribution
couplings are allowed; then V(e) would be proportional to
the number of students. The typical value of the general-
ization error is the value of e, which maximizes V(e). It
should be kept in mind that V(e) is a random number and energetic contribution
fluctuates from one training set to another. A correct treat- 1/N logfV(ε)g
ment of this randomness requires involved mathematical
techniques (Mézard et al.,1987). To obtain a picture which
is quite often qualitatively correct, we may replace it by its
average over many realizations of training sets. From ele-
mentary probability theory we see that this average num- maximum
ber can be found by calculating the volume Aof the space 0 0.1 0.2 0.3 0.4 0.5
of all students with generalization error e, irrespective of ε
their behavior on the training set, and multiplying it by FIGURE 8 Logarithm of the average volume of students that
the probability Bthat a student with generalization error e havelearnedmexamplesandgiveegeneralizationerror(green
gives mtimes the correct answers on independent draw- curve). The blue and red curves represent the energetic and
ings of the input patterns. Since Aincreases exponentially entropic contributions, respectively.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 769 262-A1677 7/24/01 11:12 AM Page 770
MANFRED OPPER
0.5 student is free to ask the teacher questions, i.e., if the stu-
ε dent can choose highly informative input patterns. For the
simple perceptron a fruitful query strategy is to select a new 0.4 input vector which is perpendicular to the current coupling
vector of the student (Kinzel and Ruján, 1990). Such an
0.3 input is a highly ambiguous pattern because small changes
continuous couplings in the student couplings produce different classification an-
swers. For more complicated networks it may be difficult 0.2 to obtain similar ambiguous inputs by an explicit construc-
tion. A general algorithm has been proposed (Seung et al.,
0.1 1992a) which uses the principle of maximal disagreement discrete couplings in a committee of several students as a selection process for
training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2 0.3 0.4 0.5 0. 6 ingstrategy,differentstudentsaregeneratedwhichalllearn α the same set of examples. Next, any new input vector is only
FIGURE 9 Learning curves for typical student perceptrons. accepted for training when the disagreement of its classi-
am/Nis the ratio between the number of examples and the fication between the students is maximal. For a committee
coupling number. of two students it can be shown that when the number of
examples is large, the information gain does not decrease
but reaches a positive constant. This results in a much faster
1990) of a perceptron obtained by the statistical physics ap- decrease of the generalization error. Instead of being in-
proach (treating the random sampling the proper way) is versely proportional to the number of examples, the de-
shown by the red curve of Fig. 9. In contrast to the worst- crease is now exponentially fast.
casepredictionsoftheVCtheory,itispossibletohavesome ................................................generalization ability below VC dimension or capacity. As ◗
we might have expected, the generalization error decreases Bad Students and Good Students
monotonically, showing that the more that is learned, the
more that is understood. Asymptotically, the error is pro- Although the typical student perceptron has a smooth,
portional to Nand inversely proportional to m, in agree- monotonically decreasing learning curve, the possibility
ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set
more complicated networks. of student couplings which are untypical in the sense of
our theory cannot be ruled out. For bad students, even non-................................................ ◗ monotic generalization behavior is possible. The problem
Query Learning of a concrete learning algorithm can be made to fit into the
statistical physics framework if the algorithm minimizes a
Soon after Gardners pioneering work, it was realized that certain cost function. Treating the achieved values of the
the approach of statistical physics is closely related to ideas new cost function as a macroscopic constraint, the tools of
in information theory and Bayesian statistics (Levin et al., statistical physics apply again.
1989;GyörgyiandTishby,1990;OpperandHaussler,1991), As an example, it is convenient to consider a case in
for which the reduction of an initial uncertainty about the which the teacher and the student have a different archi-
true state of a system (teacher) by observing data is a cen- tecture: In one of the simplest examples one tries to learn
tral topic of interest. The logarithm of the volume of rele- a classification problem by interpreting it as a regression
vant microstates as defined in the previous section is a di- problem, i.e., a problem of fitting a continuous function
rect measure for such uncertainty. The moderate progress through data points. To be specific, we study the situation
in generalization ability displayed by the red learning curve in which the teacher network is still given by a percep-
of Fig. 9 can be understood by the fact that as learning pro- tron which computes binary valued outputs of the form
gresses less information about the teacher is gained from a ywx, 1, but as the student we choose a network i i i
newrandomexample.Here,theinformationgainisdefined with a linear transfer function (the yellow curve in Fig. 1a)
as the reduction of the uncertainty when a new example is
learned. The decrease in information gain is due to the in- Y awxi i
crease in the generalization performance. This is plausible i
because inputs for which the majority of student networks and try to fit this linear expression to the binary labels of
give the correct answer are less informative than those for the teacher. If the number of couplings is sufficiently large
which a mistake is more likely. The situation changes if the (larger than the number of examples) the linear function
770 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 771
LEARNING TO GENERALIZE
(unlike the sign) is perfectly able to fit arbitrary continuous the student learns all examples perfectly. Although it may
output values. This linear fit is an attempt to explain the not be easy to construct a learning algorithm which per-
data in a more complicated way than necessary, and the forms such a maximization in practice, the resulting gener-
couplings have to be finely tuned in order to achieve this alization error can be calculated using the statistical phys-
goal. We find that the student trained in such a way does ics approach (Engel and Van den Broeck, 1993). The result
not generalize well (Opper and Kinzel, 1995). In order to is in agreement with the VC theory: There is no prediction
compare the classifications of teacher and student on a new better than random guessing below the capacity.
random input after training, we have finally converted the Although the previous algorithms led to a behavior
students output into a classification label by taking the sign whichisworsethanthetypicalone,wenowexaminetheop-
of its output. As shown in the red curve of Fig. 10, after positecaseofanalgorithmwhichdoesbetter.Sincethegen-
an initial improvement of performance the generalization eralization ability of a neural network is related to the fact
error increases again to the random guessing value e0.5 that similar input vectors are mapped onto the same out-
at a1 (Fig. 10, red curve). This phenomenon is called put, one can assume that such a property can be enhanced
overfitting.For a1 (i.e., for more data than parameters), if the separating gap between the two classes is maximized,
it is no longer possible to have a perfect linear fit through which defines a new cost function for an algorithm. This
the data, but a fit with a minimal deviation from a linear optimal margin perceptron can be practically realized and
function leads to the second part of the learning curve.ede- when applied to a set of data leads to the projection of
creases again and approaches 0 asymptotically for aSq. Fig. 11. As a remarkable result, it can be seen that there is a
This shows that when enough data are available, the details relatively large fraction of patterns which are located at the
of the training algorithm are less important. gap. These points are called support vectors(SVs). In order
The dependence of the generalization performance on to understand their importance for the generalization abil-
the complexity of the assumed data model is well-known. If ity, we make the following gedankenexperimentand assume
function class is used that is too complex, data values can be that all the points which lie outside the gap (the nonsupport
perfectly fitted but the predicted function will be very sen- vectors) are eliminated from the training set of examples.
sitive to the variations of the data sample, leading to very From the two-dimensional projection of Fig. 11, we may
unreliable predictions on novel inputs. On the other hand, conjecture that by running the maximal margin algorithm
functions that are too simple make the best fit almost insen- on the remaining examples (the SVs) we cannot create a
sitive to the data, which prevents us from learning enough larger gap between the points. Hence, the algorithm will
from them. converge to the same separating hyperplane as before. This
It is also possible to calculate the worst-case generaliza- intuitive picture is actually correct. If the SVs of a training
tion ability of perceptron students learning from a percep- set were known beforehand (unfortunately, they are only
tron teacher. The largest generalization error is obtained identified after running the algorithm), the margin classi-
(Fig. 7) when the angle between the coupling vectors of fier would have to be trained only on the SVs. It would au-
teacher and student is maximized under the constraint that tomatically classify the rest of the training inputs correctly.
0.50
ε
0.40
0.30 linear student
0.20
margin classifier
0.10
0.000123456 α
FIGURE 10 Learning curves for a linear student and for a FIGURE 11 Learning with a margin classifier and m300
margin classifier. am/N. examples in an N150-dimensional space.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 771 262-A1677 7/24/01 11:12 AM Page 772
MANFRED OPPER
Hence, if in an actual classification experiment the number ber of consistent students is small; nevertheless, the few re-
of SVs is small compared to the number of non-SVs, we maining ones must still differ in a finite fraction of bits from
may expect a good generalization ability. each other and from the teacher so that perfect generaliza-
The learning curve for a margin classifier (Opper and tion is still impossible. For aslightly above a only the cou- c
Kinzel, 1995) learning from a perceptron teacher (calcu- plings of the teacher survive.
lated by the statistical physics approach) is shown in Fig. 10
(blue curve). The concept of a margin classifier has recently ................................................
been generalized to the so-called support vector machines ◗
Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re-
placed by suitable features which are cleverly chosen non-
linear functions of the original inputs. In this way, nonlin- The example of the Ising perceptron teaches us that it will
ear separable rules can be learned, providing an interesting not always be simple to obtain zero training error. More-
alternative to multilayer networks. over, an algorithm trying to achieve this goal may get stuck
in local minima. Hence, the idea of allowing errors explic-
itly in the learning procedure, by introducing an appropri-................................................ ◗ ate noise, can make sense. An early analysis of such a sto-
The Ising Perceptron chastic training procedure and its generalization ability for
the learning in so-called Boolean networks (with elemen-
The approach of statistical physics can develop a specific tary computing units different from the ones used in neural
predictivepowerinsituationsinwhichonewouldliketoun- networks) can be found in Carnevali and Patarnello (1987).
derstand novel network models or architectures for which A stochastic algorithm can be useful to escape local min-
currently no efficient learning algorithm is known. As the ima of the training error, enabling a better learning of the
simplest example, we consider a perceptron for which the training set. Surprisingly, such a method can also lead to
couplings w are constrained to binary values 1 and 1 bettergeneralizationabilitiesiftheclassificationruleisalso j
(Gardner and Derrida, 1989; Györgyi, 1990; Seung et al., corrupted by some degree of noise (Györgyi and Tishby,
1992b). For this so-called Ising perceptron(named after 1990). A stochastic training algorithm can be realized by
Ernst Ising, who studied coupled binary-valued elements as the Monte Carlo metropolis method, which was invented
a model for a ferromagnet), perfect learning of examples is to generate the effects of temperature in simulations of
equivalent to a difficult combinatorial optimization prob- physical systems. Any changes of the network couplings
lem (integer linear programming), which in the worst case which lead to a decrease of the training error during learn-
is believed to require a learning time that increases expo- ing are allowed. However, with some probability that in-
nentially with the number of couplings N. creases with the temperature, an increase of the training
To obtain the learning curve for the typical student, we error is also accepted. Although in principle this algorithm
can proceed as before, replacing V(e) by the number of may visit all the networks configurations, for a large sys-
student configurations that are consistent with the teacher tem, with an overwhelming probability, only states close to
which results in changing the entropic term appropriately. some fixed training error will actually appear. The method
When the examples are provided by a teacher network of of statistical physics applied to this situation shows that for
thesamebinarytype,onecanexpectthatthegeneralization sufficiently large temperatures (T) we often obtain a quali-
error will decrease monotonically to zero as a function of a. tatively correct picture if we repeat the approximate calcu-
The learning curve is shown as the blue curve in Fig. 9. For lation for the noise-free case and replace the relative num-
sufficiently small a, the discreteness of the couplings has al- ber of examples aby the effective number a/T.Hence, the
most no effect. However, in contrast to the continuous case, learning curves become essentially stretched and good gen-
perfect generalization does not require infinitely many ex- eralization ability is still possible at the price of an increase
amples but is achieved already at a finite number a 1.24. in necessary training examples. c
This is not surprising because the teachers couplings con- Within the stochastic framework, learning (with errors)
tain only a finite amount of information (one bit per cou- can now also be realized for the Ising perceptron, and it is
pling) and one would expect that it does not take much interesting to study the number of relevant student configu-
more than aboutNexamples to learn them. The remark- rations as a function of ein more detail (Fig. 12). The green
ableandunexpectedresultoftheanalysisisthefactthatthe curve is obtained for a small value ofawhere a strong maxi-
transition to perfect generalization is discontinuous. The mum with high generalization error exists. By increasing a,
generalization error decreases immediately from a non- this maximum decreases until it is the same as the second
zero value to zero. This gives an impression about the com- maximum at e0.5, indicating a transition like that of the
plex structure of the space of all consistent students and blue learning curve in Fig. 9. For larger a, the state of per-
also gives a hint as to why perfect learning in the Ising per- fect generalization should be the typical state. Neverthe-
ceptron is a difficult task. For aslightly below a, the num- less, if the stochastic algorithm starts with an initial state c
772 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 773
LEARNING TO GENERALIZE
α lar model. Here, each hidden unit is connected to a dif- 1 ferent set of the input nodes. A further simplification is the
log (number of students) α replacement of adaptive couplings from the hidden units to 2
the output node by a prewired fixed function which maps
the states of the hidden units to the output. α3 Two such functions have been studied in great detail.
For the first one, the output gives just the majority vote of
α the hidden units—that is, if the majority of the hidden units 4
α is negative, then the total output is negative, and vice versa. 4 >α3 >α2 >α1 This network is called a committee machine.For the second
0 0.1 0.2 0.3 0.4 0.5 type of network, the parity machine,the output is the par- ε ity of the hidden outputs—that is, a minus results from an
FIGURE 12 Logarithm of the number of relevant Ising stu- odd number of negative hidden units and a plus from an
dents for different values of a. even number. For both types of networks, the capacity has
been calculated in the thermodynamic limit of a large num-
ber Nof (first layer) couplings (Barkai et al.,1990; Monas-
which has no resemblance to the (unknown) teacher (i.e., son and Zecchina, 1995). By increasing the number of hid-
with e0.5), it will spend time that increases exponentially den units (but always keeping it much smaller than N),
with Nin the smaller local maximum, the metastable state. the capacity per coupling (and the VC dimension) can be
Hence, a sudden transition to perfect generalization will be made arbitrarily large. Hence, the VC theory predicts that
observable only in examples which correspond to the blue the ability to generalize begins at a size of the training set
curve of Fig. 12, where this metastable state disappears. which increases with the capacity. The learning curves of
For large vales of a(yellow curve), the stochastic algorithm the typical parity machine (Fig. 14) being trained by a par-
will converge always to the state of perfect generalization. ity teacher for (from left to right) one, two, four, and six
On the other hand, since the state with e0.5 is always hidden units seem to partially support this prediction.
metastable, a stochastic algorithm which starts with the Belowacertainnumberofexamples,onlymemorization
teachers couplings will never drive the student out of the ofthelearnedpatternsoccursandnotgeneralization.Then,
state of perfect generalization. It should be made clear that a transition to nontrivial generalization takes place (Han-
the sharp phase transitions are the result of the thermody- sel et al.,1992; Opper, 1994). Far beyond the transition, the
namic limit, where the macroscopic state is entirely domi- decay of the learning curves becomes that of a simple per-
nated by the typical configurations. For simulations of any ceptron (black curve in Fig. 14) independent of the num-
finite system a rounding and softening of the transitions ber of hidden units, and this occurs much faster than for
will be observed. the bound given by VC theory. This shows that the typical
................................................ learning curve can in fact be determined by more than one ◗
More Sophisticated Computations
Are Needed for Multilayer Networks 0.5
ε
As a first step to understand the generalization perfor- 0.4 mance of multilayer networks, one can study an archi- 46
tecture which is simpler than the fully connected one of
Fig. 1b. The tree architecture of Fig. 13 has become a popu- 0.3 2
10.2
0.1
0.00.0 0.1 0.2 0.3 0.4 0.5 0.6 α
FIGURE 14 Learning curves for the parity machine with
FIGURE 13 A two-layer network with tree architecture. tree architecture. Each curve represents the generalization er-
The arrow indicates the direction of propagation of the ror eas a function of aand is distinguished by the number of
information. hidden units of the network.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 773 262-A1677 7/24/01 11:12 AM Page 774
MANFRED OPPER
complexity parameter. In contrast, the learning curve of the same similarity to every teacher perceptron. Although
the committee machine with the tree architecture of Fig. 13 this symmetric state allows for some degree of generaliza-
(Schwarze and Hertz, 1992) is smooth and resembles that tion, it is not able to recover the teachers rule completely.
of the simple perceptron. As the number of hidden units After a long plateau, the symmetry is broken and each of
is increased (keeping Nfixed and very large), the general- the student perceptrons specializes to one of the teacher
ization error increases, but despite the diverging VC di- perceptrons, and thus their similarity with the others is
mension the curves converge to a limiting one having an lost. This leads to a rapid (but continuous) decrease in the
asymptotic decay which is only twice as slow as that of the generalization error. Such types of learning curves with
perceptron. This is an example for which typical and worst- plateaus can actually be observed in applications of fully
case generalization behaviors are entirely different. connected multilayer networks.
Recently, more light has been shed on the relation be- ................................................tween average and worst-case scenarios of the tree com- ◗
mittee. A reduced worst-case scenario, in which a tree Outlook
committee teacher was to be learned from tree committee
students under an input distribution, has been analyzed The worst-case approach of the VC theory and the typical
from a statistical physics perspective (Urbanczik, 1996). As case approach of statistical physics are important theories
expected, few students show a much worse generalization for modeling and understanding the complexity of learning
ability than the typical one. Moreover, such students may to generalize from examples. Although the VC approach
also be difficult to find by most reasonable learning algo- plays an important role in a general theory of learnabil-
rithms because bad students require very fine tuning of ity, its practical applications for neural networks have been
their couplings. Calculation of the couplings with finite pre- limited by the overall generality of the approach. Since only
cision requires many bits per coupling that increases faster weak assumptions about probability distributions and ma-
than exponentially with aand which for sufficiently large a chines are considered by the theory, the estimates for gen-
willbebeyondthecapabilityofpracticalalgorithms.Hence, eralization errors have often been too pessimistic. Recent
it is expected that, in practice, a bad behavior will not be developments of the theory seem to overcome these prob-
observed. lems. By using modified VC dimensions, which depend on
Transitions of the generalization error such as those the data that have actually occurred and which in favorable
observed for the tree parity machine are a characteristic cases are much smaller than the general dimensions, more
feature of large systems which have a symmetry that can realistic results seem to be possible. For the support vec-
be spontaneously broken. To explain this, consider the sim- tor machines (Vapnik, 1995) (generalizations of the margin
plest case of two hidden units. The output of this parity ma- classifiers which allow for nonlinear boundaries that sepa-
chine does not change if we simultaneously change the sign rate the two classes), Vapnik and collaborators have shown
of all the couplings for both hidden units. Hence, if the the effectiveness of the modified VC results for selecting
teachers couplings are all equal to 1, a student with all the optimal type of model in practical applications.
couplings equal to 1 acts exactly as the same classifier. If The statistical physics approach, on the other hand, has
there are few examples in the training set, the entropic con- revealed new and unexpected behavior of simple network
tribution will dominate the typical behavior and the typi- models,suchasavarietyofphasetransitions.Whethersuch
cal students will display the same symmetry. Their coupling transitions play a cognitive role in animal or human brains
vectors will consist of positive and negative random num- is an exciting topic. Recent developments of the theory
bers. Hence, there is no preference for the teacher or the aim to understand dynamical problems of learning. For ex-
reversed one and generalization is not possible. If the num- ample, online learning (Saad, 1998), in which the problems
ber of examples is large enough, the symmetry is broken of learning and generalization are strongly mixed, has en-
and there are two possible types of typical students, one abled the study of complex multilayer networks and has
with more positive and the other one with more negative stimulated research on the development of optimized algo-
couplings. Hence, any of the typical students will show rithms. In addition to an extension of the approach to more
some similarity with the teacher (or its negative image) and complicated networks, an understanding of the robustness
generalization occurs. A similar type of symmetry break- of the typical behavior, and an interpolation to the other
ing also leads to a continuous phase transition in the fully extreme, the worst-case scenario is an important subject of
connected committee machine. This can be viewed as a research.
committee of perceptrons, one for each hidden unit, which
share the same input nodes. Any permutation of these per- Acknowledgments
ceptrons obviously leaves the output invariant. Again, if I thank members of the Department of Physics of Complex Sys-
few examples are learned, the typical state reflects the sym- tems at the Weizmann Institute in Rehovot, Israel, where parts of
metry. Each student perceptron will show approximately this article were written, for their warm hospitality.
774 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 775
LEARNING TO GENERALIZE
References Cited OPPER , M., and H AUSSLER , M. (1991). Generalization perfor-
mance of Bayes optimal classification algorithm for learning a
AMARI , S., and M URATA , N. (1993). Statistical theory of learning perceptron. Phys. Rev. Lett.66,2677.
curves under entropic loss. Neural Comput.5,140. OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen-
BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me- eralization. In Physics of Neural Networks III(J. L. van Hem-
chanics of a multilayered neural network. Phys. Rev. Lett.65, men, E. Domany, and K. Schulten, Eds.). Springer-Verlag,
2312. New York.
BISHOP , C. M. (1995). Neural Networks for Pattern Recognition. SAAD , D. (Ed.) (1998). Online Learning in Neural Networks.
Clarendon/Oxford Univ. Press, Oxford/New York. Cambridge Univ. Press, New York.
CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo- SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large
dynamical analysis of Boolean learning networks. Europhys. committee machine. Europhys. Lett.20,375.
Lett.4,1199. SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con-
COVER , T. M. (1965). Geometrical and statistical properties of nected committee machines. Europhys. Lett.21,785.
systems of linear inequalities with applications in pattern rec- SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis-
ognition. IEEE Trans. El. Comp.14,326. tical mechanics of learning from examples. Phys. Rev. A45,
ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can 6056.
learn from examples: Replica calculation of uniform conver- SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query
gence bound for the perceptron. Phys. Rev. Lett.71,1772. by committee. InThe Proceedings of the Vth Annual Workshop
GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks. on Computational Learning Theory (COLT92),p. 287. Associ-
J. Phys. A21,257. ation for Computing Machinery, New York.
GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper- SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning
ties of neural network models. J. Phys. A21,271. from examples in large neural networks. Phys. Rev. Lett.65,
GYÖRGYI , G. (1990). First order transition to perfect generaliza- 1683.
tion in a neural network with binary synapses. Phys. Rev. A41, URBANCZIK , R. (1996). Learning in a large committee machine:
7097. Worst case and average case. Europhys. Lett.35,553.
GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn- VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and
ing a rule. In Neural Networks and Spin Glasses: Proceedings nonlinear extension of the pseudo-inverse solution for learn-
of the STATPHYS 17 Workshop on Neural Networks and Spin ing Boolean functions. Europhys. Lett.9,315.
Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien- VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em-
tific, Singapore. pirical Data.Springer-Verlag, New York.
HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory.
without generalization in a multilayered neural network. Eu- Springer-Verlag, New York.
rophys. Lett.20,471. VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform
KINZEL , W., and R UJÀN , P. (1990). Improving a network general- convergence of relative frequencies of events to their probabil-
ization ability by selecting examples. Europhys. Lett.13,473. ities. Theory Probability Appl.16,254.
LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach
to learning and generalization in neural networks. In Proceed- General References ings of the Second Workshop on Computational Learning The-
ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and
Kaufmann, San Mateo, CA. Neural Networks.MIT Press, Cambridge, MA.
MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass BERGER , J. O. (1985). Statistical Decision Theory and Bayesian
theory and beyond. In Lecture Notes in Physics,Vol. 9. World Analysis.Springer-Verlag, New York.
Scientific, Singapore. HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction
MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure to the Theory of Neural Computation.Addison-Wesley, Red-
andinternalrepresentations:Adirectapproachtolearningand wood City, CA.
generalization in multilayer neural networks. Phys. Rev. Lett. MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press,
75,2432. Cambridge, MA.
OPPER , M. (1994). Learning and generalization in a two-layer WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical
neural network: The role of the VapnikChervonenkis dimen- mechanics of learning a rule. Rev. Modern Phys.65,499.
sion. Phys. Rev. Lett.72,2113.
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 775 262-A1677 7/24/01 11:12 AM Page 776