933 lines
81 KiB
Plaintext
933 lines
81 KiB
Plaintext
|
262-A1677 7/24/01 11:12 AM Page 763
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
SECTION VI / MODEL NEURAL NETWORKS FOR COMPUTATION AND LEARNING
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER Theories that try to understand the ability of neural
|
|||
|
|
|||
|
Neural Computation Research Group networks to generalize from learned examples are
|
|||
|
Aston University discussed. Also, an approach that is based on ideas
|
|||
|
Birmingham B4 7ET, United Kingdom from statistical physics which aims to model typical
|
|||
|
learning behavior is compared with a worst-case
|
|||
|
framework.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Learning to
|
|||
|
|
|||
|
|
|||
|
Generalize
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
................................................ ◗
|
|||
|
|
|||
|
Introduction rule. To what extent is it possible to understand the com-
|
|||
|
plexity of learning from examples by mathematical models
|
|||
|
Neural networks learn from examples. This statement is andtheirsolutions?Thisquestionisthefocusofthisarticle.
|
|||
|
obviously true for the brain, but also artificial networks (or I concentrate on the use of neural networks for classifica-
|
|||
|
neural networks), which have become a powerful new tool tion. Here, one can take characteristic features (e.g., the
|
|||
|
for many pattern-recognition problems, adapt their “syn- pixels of an image) as an input pattern to the network. In
|
|||
|
aptic” couplings to a set of examples. Neural nets usually the simplest case, it should decide whether a given pattern
|
|||
|
consist of many simple computing units which are com- belongs (at least more likely) to a certain class of objects
|
|||
|
bined in an architecture which is often independent from and respond with the output 1 or 1. To learn the under-
|
|||
|
the problem. The parameters which control the interaction lying classification rule, the network is trained on a set of
|
|||
|
among the units can be changed during the learning phase patterns together with the classification labels, which are
|
|||
|
and these are often called synaptic couplings.After the provided by a trainer. A heuristic strategy for training is to
|
|||
|
learning phase, a network adopts some ability to generalize tune the parameters of the machine (the couplings of the
|
|||
|
from the examples; it can make predictions about inputs network) using a learning algorithm, in such a way that the
|
|||
|
which it has not seen before; it has begun to understand a errors made on the set of training examples are small, in
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 763 262-A1677 7/24/01 11:12 AM Page 764
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER
|
|||
|
|
|||
|
|
|||
|
the hope that this helps to reduce the errors on new data. for the case of realizable rules they are also independent
|
|||
|
How well will the trained network be able to classify an in- of the specific algorithm, as long as the training examples
|
|||
|
put that it has not seen before? This performance on new are perfectly learned. Because it is able to cover even bad
|
|||
|
data defines the generalization ability of the network. This situations which are unfavorable for improvement of the
|
|||
|
ability will be affected by the problem of realizability: The learning process, it is not surprising that this theory may
|
|||
|
network may not be sufficiently complex to learn the rule in some cases provide too pessimistic results which are also
|
|||
|
completely or there may be ambiguities in classification. too crude to reveal interesting behavior in the intermediate
|
|||
|
Here, I concentrate on a second problem arising from the region of the learning curve.
|
|||
|
fact that learning will mostly not be exhaustive and the in- In this article, I concentrate mainly on a different ap-
|
|||
|
formation about the rule contained in the examples is not proach, which has its origin in statistical physics rather than
|
|||
|
complete. Hence, the performance of a network may vary in mathematical statistics, and compare its results with the
|
|||
|
from one training set to another. In order to treat the gen- worst-case results. This method aims at studying the typical
|
|||
|
eralization ability in a quantitative way, a common model rather than the worst-case behavior and often enables the
|
|||
|
assumes that all input patterns, those from the training set exact calculations of the entire learning curve for models of
|
|||
|
and the new one on which the network is tested, have a pre- simple networks which have many parameters. Since both
|
|||
|
assigned probability distribution (which characterizes the biological and artificial neural networks are composed of
|
|||
|
feature that must be classified), and they are produced in- many elements, it is hoped that such an approach may ac-
|
|||
|
dependently at random with the same probability distribu- tually reveal some relevant and interesting structures.
|
|||
|
tion from the network’s environment. Sometimes the prob- At first, it may seem surprising that a problem should
|
|||
|
ability distribution used to extract the examples and the simplifywhenthenumberofitsconstituentsbecomeslarge.
|
|||
|
classification of these examples is called the rule.The net- However, this phenomenon is well-known for macroscopic
|
|||
|
work’s performance on novel data can now be quantified by physical systems such as gases or liquids which consist of
|
|||
|
the so-called generalization error,which is the probability a huge number of molecules. Clearly, it is not possible to
|
|||
|
of misclassifying the test input and can be measured by re- study the complete microscopic state of such a system,
|
|||
|
peating the same learning experiment many times with dif- which is described by the rapidly fluctuating positions and
|
|||
|
ferent data. velocities of all particles. On the other hand, macroscopic
|
|||
|
Within such a probabilistic framework, neural networks quantities such as density, temperature, and pressure are
|
|||
|
areoftenviewedasstatisticaladaptivemodelswhichshould usually collective properties influenced by all elements. For
|
|||
|
give a likely explanation of the observed data. In this frame- such quantities, fluctuations are averaged out in the ther-
|
|||
|
work, the learning process becomes mathematically related modynamic limit of a large number of particles and the col-
|
|||
|
to a statistical estimation problem for optimal network pa- lective properties become, to some extent, independent of
|
|||
|
rameters.Hence,mathematicalstatisticsseemstobeamost themicrostate.Similarly,thegeneralizationabilityofaneu-
|
|||
|
appropriate candidate for studying a neural network’s be- ral network is a collective property of all the network pa-
|
|||
|
havior. In fact, various statistical approaches have been ap- rameters, and the techniques of statistical physics allow, at
|
|||
|
plied to quantify the generalization performance. For ex- least for some simple but nontrivial models, for exact com-
|
|||
|
ample, expressions for the generalization error have been putations in the thermodynamic limit. Before explaining
|
|||
|
obtainedinthelimit,wherethenumberofexamplesislarge these ideas in detail, I provide a short description of feed-
|
|||
|
compared to the number of couplings (Seung et al.,1992; forward neural networks.
|
|||
|
Amari and Murata, 1993). In such a case, one can expect ................................................that learning is almost exhaustive, such that the statistical ◗
|
|||
|
|
|||
|
fluctuations of the parameters around their optimal values Artificial Neural Networks
|
|||
|
are small. However, in practice the number of parameters is
|
|||
|
often large so that the network can be flexible, and it is not Based on highly idealized models of brain function, artifi-
|
|||
|
clear how many examples are needed for the asymptotic cial neural networks are built from simple elementary com-
|
|||
|
theorytobecomevalid.Theasymptotictheorymayactually puting units, which are sometimes termed neurons after
|
|||
|
miss interesting behavior of the so-called learning curve, their biological counterparts. Although hardware imple-
|
|||
|
which displays the progress of generalization ability with mentations have become an important research topic, neu-
|
|||
|
an increasing amount of training data. ral nets are still simulated mostly on standard computers.
|
|||
|
A second important approach, which was introduced Each computing unit of a neural net has a single output and
|
|||
|
into mathematical statistics in the 1970s by Vapnik and several ingoing connections which receive the outputs of
|
|||
|
Chervonenkis (VC) (Vapnik, 1982, 1995), provides exact other units. To every ingoing connection (labeled by the
|
|||
|
bounds for the generalization error which are valid for any index i) a real number is assigned, the synaptic weight w,i
|
|||
|
number of training examples. Moreover, they are entirely which is the basic adjustable parameter of the network. To
|
|||
|
independent of the underlying distribution of inputs, and compute a unit’s output, all incoming values x are multi- i
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
764 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 765
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
LEARNING TO GENERALIZE
|
|||
|
|
|||
|
|
|||
|
0.6 −0.9 0.8
|
|||
|
inputs
|
|||
|
|
|||
|
1.6 −1.4 −0.1 synaptic weights
|
|||
|
|
|||
|
weighted sum
|
|||
|
1.6 × 0.6 + (–1.4) × (–0.9) + (–0.1) × 0.8 = 2.14
|
|||
|
|
|||
|
|
|||
|
|
|||
|
1
|
|||
|
|
|||
|
|
|||
|
0
|
|||
|
|
|||
|
|
|||
|
|
|||
|
−1
|
|||
|
2.14 aboutput
|
|||
|
FIGURE 1 (a) Example of the computation of an elementary unit (neuron) in a neural network. The numeri-
|
|||
|
cal values assumed by the incoming inputs to the neuron and the weights of the synapses by which the inputs
|
|||
|
reach the neuron are indicated. The weighted sum of the inputs corresponds to the value of the abscissa at which
|
|||
|
the value of the activation function is calculated (bottom graph). Three functions are shown: sigmoid, linear, and
|
|||
|
step. (b) Scheme of a feedforward network. The arrow indicates the direction of propagation of information.
|
|||
|
|
|||
|
|
|||
|
plied by the weights w and then added. Figure 1a shows its simple structure, it can for many learning problems give i
|
|||
|
an example of such a computation with three couplings. a nontrivial generalization performance and may be used
|
|||
|
Finally, the result, wx,is passed through an activation as a first step to an unknown classification task. As can be i i i
|
|||
|
function which is typically of the shape of the red curve in seen by comparing Figs. 2a and 1b, it is also a building
|
|||
|
Fig. 1a (a sigmoidal function), which allows for a soft, am- block for the more complex multilayer networks. Hence,
|
|||
|
biguous classification between 1 and 1. Other impor- understanding its performance theoretically may also pro-
|
|||
|
tant cases are the step function (green curve) and the linear vide insight into the more complex machines. To learn a set
|
|||
|
function (yellow curve; used in the output neuron for prob- of examples, a network must adjust its couplings appropri-
|
|||
|
lems of fitting continuous functions). In the following, to ately (I often use the word couplings for their numerical
|
|||
|
keep matters simple, I restrict the discussion mainly to the strengths, the weights w, for i1,..., N). Remarkably, i
|
|||
|
step function. Such simple units can develop a remarkable for the perceptron there exists a simple learning algorithm
|
|||
|
computational power when connected in a suitable archi- which always enables the network to find those parameter
|
|||
|
tecture. An important network type is the feedforward ar- values whenever the examples can be learnt by a percep-
|
|||
|
chitecture shown in Fig. 1b, which has two layers of comput- tron. In Rosenblatt’s algorithm, the input patterns are pre-
|
|||
|
ing units and adjustable couplings. The input nodes (which sented sequentially (e.g., in cycles) to the network and the
|
|||
|
do not compute) are coupled to the so-called hidden units,
|
|||
|
whichfeedtheiroutputsintooneormoreoutputunits.With
|
|||
|
suchanarchitectureandsigmoidalactivationfunctions,any
|
|||
|
continuous function of the inputs can be arbitrarily closely xx 21 x2 x3 xn
|
|||
|
approximated when the number of hidden units is suffi-
|
|||
|
ciently large. (w1 ,w 2 )
|
|||
|
w ................................................ 1 w2 w3 wn ◗
|
|||
|
|
|||
|
The Perceptron x1
|
|||
|
|
|||
|
|
|||
|
The simplest type of network is the perceptron (Fig. 2a).
|
|||
|
There are Ninputs, Nsynaptic couplings w, and the output i
|
|||
|
is simply a b
|
|||
|
N FIGURE 2 (a) The perceptron. (b) Classification of inputs
|
|||
|
awx [1] i i by a perceptron with two inputs. The arrow indicates the vec-
|
|||
|
i1 tor composed of the weights of the network, and the line per-
|
|||
|
It has a single-layer architecture and the step function pendicular to this vector is the boundary between the classes
|
|||
|
(green curve in Fig. 1a) as its activation function. Despite of input.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 765 262-A1677 7/24/01 11:12 AM Page 766
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER
|
|||
|
|
|||
|
|
|||
|
output is tested. Whenever a pattern is not classified cor-
|
|||
|
rectly, all couplings are altered simultaneously. We increase x2
|
|||
|
by a fixed amount all weights for which the input unit and
|
|||
|
the correct value of the output neuron have the same sign
|
|||
|
but we decrease them for the opposite sign. This simple
|
|||
|
algorithm is reminiscent of the so-called Hebbian learning
|
|||
|
rule,a physiological model of a learning processes in the
|
|||
|
real brain. It assumes that synaptic weights are increased
|
|||
|
when two neurons are simultaneously active. Rosenblatt’s
|
|||
|
theorem states that in cases in which there exists a choice of
|
|||
|
the w which classify correctly all of the examples (i.e., per- i
|
|||
|
fectly learnable perceptron), this algorithm finds a solution
|
|||
|
in a finite number of steps, which is at worst equal to A N 3 ,
|
|||
|
where Ais an appropriate constant.
|
|||
|
It is often useful to obtain an intuition of a perceptron’s xa 1
|
|||
|
classification performance by thinking in terms of a geo-
|
|||
|
metric picture. We may view the numerical values of the in-
|
|||
|
puts as the coordinates of a point in some (usually) high-
|
|||
|
dimensional space. The case of two dimensions is shown
|
|||
|
in Fig. 2b. A corresponding point is also constructed for the
|
|||
|
couplings w.The arrow which points from the origin of the i
|
|||
|
coordinate system to this latter point is called the weight
|
|||
|
vector or coupling vector. An application of linear algebra
|
|||
|
tothecomputationofthenetworkshowsthatthelinewhich
|
|||
|
is perpendicular to the coupling vector is the boundary be-
|
|||
|
tween inputs belonging to the two different classes. Input
|
|||
|
points which are on the same side as the coupling vector are
|
|||
|
classified as 1 (the green region in Fig. 2b) and those on
|
|||
|
the other side as 1 (red region in Fig. 2b).
|
|||
|
Rosenblatt’s algorithm aims to determine such a line
|
|||
|
when it is possible. This picture generalizes to higher di- direction of coupling vectorb
|
|||
|
mensions, for which a hyperplane plays the same role of the FIGURE 3 (a) Projection of 200 random points (with ran-
|
|||
|
line of the previous two-dimensional example. We can still dom labels) from a 200-dimensional space onto the first two
|
|||
|
obtainanintuitivepicturebyprojectingontwo-dimensional coordinate axes (x and x). (b) Projection of the same points 1 2
|
|||
|
planes. In Fig. 3a, 200 input patterns with random coordi- onto a plane which contains the coupling vector of a perfectly
|
|||
|
nates (randomly labeled red and blue) in a 200-dimensional trained perceptron.
|
|||
|
input space are projected on the plane spanned by two arbi-
|
|||
|
trary coordinate axes. If we instead use a plane for projec-
|
|||
|
tion which contains the coupling vector (determined from tions for small changes of the couplings). Hence, in general,
|
|||
|
a variant of Rosenblatt’s algorithm) we obtain the view in addition to the perfectly learnable perceptron case in
|
|||
|
shown in Fig. 3b, in which red and green points are clearly which the final error is zero, minimizing the training error
|
|||
|
separated and there is even a gap between the two clouds. is usually a difficult task which could take a large amount of
|
|||
|
It is evident that there are cases in which the two sets of computer time. However, in practice, iterative approaches,
|
|||
|
points are too mixed and there is no line in two dimensions which are based on the minimization of other smooth cost
|
|||
|
(or no hyperplane in higher dimensions which separates functions,areusedtotrainaneuralnetwork(Bishop,1995).
|
|||
|
them). In these cases, the rule is too complex to be per- ................................................fectly learned by a perceptron. If this happens, we must at- ◗
|
|||
|
|
|||
|
tempt to determine the choice of the coupling which mini- Capacity, VC Dimension,
|
|||
|
mizesthenumberoferrorsonagivensetofexamples.Here, and Worst-Case Generalization
|
|||
|
Rosenblatt’s algorithm does not work and the problem of
|
|||
|
finding the minimum is much more difficult from the algo- As previously shown, perceptrons are only able to realize a
|
|||
|
rithmic point. The training error, which is the number of very restricted type of classification rules, the so-called lin-
|
|||
|
errorsmadeonthetrainingset,isusuallyanonsmoothfunc- early separable ones. Hence, independently from the issue
|
|||
|
tion of the network couplings (i.e., it may have large varia- of finding the best algorithm to learn the rule, one may ask
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
766 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 767
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
LEARNING TO GENERALIZE
|
|||
|
|
|||
|
|
|||
|
the following question: In how many cases will the percep- exp[Nf(m/N)], where the function f(a) vanishes for
|
|||
|
tron be able to learn a given set of training examples per- a2 and it is positive for a2. Such a threshold phe-
|
|||
|
fectly if the output labels are chosen arbitrarily? In order to nomenon is an example of a phase transition (i.e., a sharp
|
|||
|
answer this question in a quantitative way, it is convenient change of behavior) which can occur in the thermodynamic
|
|||
|
tointroducesomeconceptssuchascapacity,VCdimension, limit of a large network size.
|
|||
|
andworst-casegeneralization,whichcanbeusedinthecase Generally, the point at which such a transition takesof the perceptron and have a more general meaning. place defines the so-called capacity of the neural network.In the case of perceptrons, this question was answered in Although the capacity measures the ability of a network tothe 1960s by Cover (1965). He calculated for any set of in- learn random mappings of the inputs, it is also related to itsput patterns, e.g., m,the fraction of all the 2 m possible map- ability to learn a rule (i.e., to generalize from examples).pings that can be linearly separated and are thus learnable The question now is, how does the network perform on aby perceptrons. This fraction is shown in Fig. 4 as a func- new example after having been trained to learn mexampletion of the number of examples per coupling for different on the training set?numbers of input nodes (couplings) N.Three regions can To obtain an intuitive idea of the connection betweenbe distinguished: capacity and ability to generalize, we assume a training set
|
|||
|
Region in which m/N1: Simple linear algebra shows of size mand a single pattern for test. Suppose we define
|
|||
|
that it is always possible to learn all mappings when the a possible rule by an arbitrary learnable mapping from
|
|||
|
number mof input patterns is less than or equal to the inputs to outputs. If m1 is much larger than the capac-
|
|||
|
number Nof couplings (there are simply enough adjustable ity, then for most rules the labels on the mtraining pat-
|
|||
|
parameters). terns which the perceptron is able to recognize will nearly
|
|||
|
Region in which m/N1: For this region, there are ex- uniquely determine the couplings (and consequently the
|
|||
|
amples of rules that cannot be learned. However, when the answer of the learning algorithm on the test pattern), and
|
|||
|
number of examples is less than twice the number of cou- therulecanbeperfectlyunderstoodfromtheexamples.Be-
|
|||
|
plings (m/N2), if the network is large enough almost all low capacity, in most cases there are two different choices
|
|||
|
mappings can be learned. If the output labels for each of of couplings which give opposite answers for the test pat-
|
|||
|
the minputs are chosen randomly 1 or 1 with equal tern. Hence, a correct classification will occur with proba-
|
|||
|
probability, the probability of finding a nonrealizable cou- bility 0.5 assuming all rules to be equally probable. Figure 5
|
|||
|
pling goes to zero exponentially when Ngoes to infinity at displays the two types of situations form3andN2.
|
|||
|
fixed ratio m/N. This intuitive connection can be sharpened. Vapnik and
|
|||
|
Region in which m/N2: For m/N2 the probabil- Chervonenkis established a relation between a capacity
|
|||
|
ity for a mapping to be realizable by perceptrons decreases such as quantity and the generalization ability that is valid
|
|||
|
to zero rapidly and it goes to zero exponentially when N for general classifiers (Vapnik, 1982, 1995). The VC dimen-
|
|||
|
goes to infinity at fixed ratio m/N(it is proportional to sion is defined as the size of the largest set of inputs for
|
|||
|
which all mappings can be learned by the type of classi-
|
|||
|
fier. It equals Nfor the perceptron. Vapnik and Chervo-
|
|||
|
1.0 nenkis were able to show that for any training set of size m
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
fraction of realizable mappings 0.8
|
|||
|
|
|||
|
|
|||
|
0.6
|
|||
|
|
|||
|
|
|||
|
0.4 ? ?
|
|||
|
|
|||
|
|
|||
|
0.2
|
|||
|
|
|||
|
|
|||
|
0.0 a b
|
|||
|
01234 FIGURE 5 Classification rules for four patterns based on a m/N perceptron. The patterns colored in red represent the training
|
|||
|
FIGURE 4 Fraction of all mappings of minput patterns examples, and triangles and circles represent different class la-
|
|||
|
which are learnable by perceptrons as a function of m/Nfor bels. The question mark is a test pattern. (a) There are two
|
|||
|
different numbers of couplings N: N10 (in green), N20 possible ways of classifying the test point consistent with the
|
|||
|
(in blue), and N100 (in red). examples; (b) only one classification is possible.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 767 262-A1677 7/24/01 11:12 AM Page 768
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER
|
|||
|
|
|||
|
|
|||
|
larger than the VC dimension D , the growth of the num- blue curve in Fig. 6, the minimal training error will decrease VC
|
|||
|
ber of realizable mappings is bounded by an expression for increasing complexity of the nets. On the other hand,
|
|||
|
which grows much slower than 2 m (in fact, only like a poly- the VC dimension and the complexity of the networks in-
|
|||
|
nomial in m). crease with the increasing number of hidden units, leading
|
|||
|
They proved that a large difference between training er- to an increasing expected difference (confidence interval)
|
|||
|
ror (i.e., the minimum percentage of errors that is done on between training error and generalization error as indi-
|
|||
|
the training set) and generalization error (i.e., the proba- cated by the red curve. The sum of both (green curve) will
|
|||
|
bility of producing an error on the test pattern after having have a minimum, giving the smallest bound on the general-
|
|||
|
learned the examples) of classifiers is highly improbable if ization error. As discussed later, this procedure will in some
|
|||
|
the number of examples is well above D . This theorem cases lead to not very realistic estimates by the rather pes- VC
|
|||
|
implies a small expected generalization error for perfect simistic bounds of the theory. In other words, the rigorous
|
|||
|
learning of the training set results. The expected general- bounds, which are obtained from an arbitrary network and
|
|||
|
ization error is bounded by a quantity which increases pro- rule, are much larger than those determined from the re-
|
|||
|
portionally to D and decreases (neglecting logarithmic sults for most of the networks and rules. VC
|
|||
|
corrections in m) inversely proportional to m. ................................................Conversely, one can construct a worst-case distribution ◗
|
|||
|
|
|||
|
of input patterns, for which a size of the training set larger Typical Scenario: The Approach
|
|||
|
than D is also necessary for good generalization. The VC of Statistical Physics VC
|
|||
|
results should, in practice, enable us to select the network
|
|||
|
with the proper complexity which guarantees the smallest When the number of examples is comparable to the size of
|
|||
|
bound on the generalization error. For example, in order the network, which for a perceptron equals the VC dimen-
|
|||
|
tofind the proper size of the hidden layer of a network with sion, the VC theory states that one can construct malicious
|
|||
|
twolayers,onecouldtrainnetworksofdifferentsizesonthe situations which prevent generalizations. However, in gen-
|
|||
|
same data. eral, we would not expect that the world acts as an adver-
|
|||
|
The relation among these concepts can be better under- sary. Therefore, how should one model a typical situation?
|
|||
|
stood if we consider a family of networks of increasing com- As a first step, one may construct rules and pattern dis-
|
|||
|
plexity which have to learn the same rule. A qualitative pic- tributions which act together in a nonadversarial way. The
|
|||
|
ture of the results is shown in Fig. 6. As indicated by the teacher–student paradigm has proven to be useful in such a
|
|||
|
situation. Here, the rule to be learned is modeled by a sec-
|
|||
|
ondnetwork,theteachernetwork;inthiscase,iftheteacher
|
|||
|
and the student have the same architecture and the same
|
|||
|
upper bound on numberofunits,theruleisevidentlyrealizable.Thecorrect generalization error class labels for any inputs are given by the outputs of the
|
|||
|
teacher. Within this framework, it is often possible to ob-
|
|||
|
tain simple expressions for the generalization error. For a
|
|||
|
upper bound on perceptron, we can use the geometric picture to visualize confidence interval the generalization error. A misclassification of a new in-
|
|||
|
put vector by a student perceptron with coupling vector ST
|
|||
|
occurs only if the input pattern is between the separating
|
|||
|
planes (dashed region in Fig. 7) defined by ST and the vec-
|
|||
|
tor of teacher couplings TE. If the inputs are drawn ran- training error domlyfromauniformdistribution,thegeneralizationerror
|
|||
|
is directly proportional to the angle between ST and TE.
|
|||
|
network complexity Hence, the generalization error is small when teacher and
|
|||
|
student vectors are close together and decreases to zero
|
|||
|
when both coincide.
|
|||
|
In the limit, when the number of examples is very large
|
|||
|
all the students which learn the training examples perfectly
|
|||
|
will not differ very much from and their couplings will be FIGURE 6 As the complexity of the network varies (i.e., close to those of the teacher. Such cases with a small gen- of the number of hidden units, as shown schematically below),
|
|||
|
the generalization error (in red), calculated from the sum of eralization error have been successfully treated by asymp-
|
|||
|
the training error (in green) and the confidence interval (in totic methods of statistics. On the other hand, when the
|
|||
|
blue) according to the theory of Vapnik–Chervonenkis, shows number of examples is relatively small, there are many dif-
|
|||
|
a minimum; this corresponds to the network with the best gen- ferent students which are consistent with the teacher re-
|
|||
|
eralization ability. garding the training examples, and the uncertainty about
|
|||
|
|
|||
|
|
|||
|
|
|||
|
768 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 769
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
LEARNING TO GENERALIZE
|
|||
|
|
|||
|
|
|||
|
with the number of couplings N(like typical volumes in
|
|||
|
N-dimensional spaces) and Bdecreases exponentially with
|
|||
|
m(because it becomes more improbable to be correct ST mtimes for any e0), both factors can balance each other
|
|||
|
when mincreases like maN.ais an effective measure for TE the size of the training set when Ngoes to infinity. In order
|
|||
|
to have quantities which remain finite as NSq, it is also
|
|||
|
useful to take the logarithm of V(e) and divide by N, which
|
|||
|
transforms the product into a sum of two terms. The first
|
|||
|
one (which is often called the entropic term) increases with
|
|||
|
increasing generalization error (green curve in Fig. 8). This
|
|||
|
FIGURE 7 For a uniform distribution of patterns, the gen- is true because there are many networks which are not
|
|||
|
eralization error of a perceptron equals the area of the similar to the teacher, but there is only one network equal
|
|||
|
shaded region divided by the area of the entire circle. ST and to the teacher. For almost all networks (remember, the
|
|||
|
TE represent the coupling vectors of the student and teacher, entropic term does not include the effect of the training ex-
|
|||
|
respectively. amples) e0.5, i.e., they are correct half of the time by
|
|||
|
random guessing. On the other hand, the second term (red
|
|||
|
curve in Fig. 8) decreases with increasing generalization er-
|
|||
|
the true couplings of the teacher is large. Possible general- ror because the probability of being correct on an input
|
|||
|
ization errors may range from zero (if, by chance, a learn- pattern increases when the student network becomes more
|
|||
|
ing algorithm converges to the teacher) to some worst-case similar to the teacher. It is often called the energetic contri-
|
|||
|
value. We may say that the constraint which specifies the butionbecauseitfavorshighlyordered(towardtheteacher)
|
|||
|
macrostateofthenetwork(itstrainingerror)doesnotspec- network states, reminiscent of the states of physical systems
|
|||
|
ify the microstate uniquely. Nevertheless, it makes sense to at low energies. Hence, there will be a maximum (Fig. 8, ar-
|
|||
|
speak of a typical value for the generalization error, which row) of V(e) at some value of ewhich by definition is the
|
|||
|
is defined as the value which is realized by the majority of typical generalization error.
|
|||
|
the students. In the thermodynamic limit known from sta- The development of the learning process as the number
|
|||
|
tistical physics, in which the number of parameters of the of examples aNincreases can be understood as a compe-
|
|||
|
network is taken to be large, we expect that in fact almost tition between the entropic term, which favors disordered
|
|||
|
all students belong to this majority, provided the quantity network configurations that are not similar to the teacher,
|
|||
|
of interest is a cooperative effect of all components of the andtheenergeticterm.Thelattertermdominateswhenthe
|
|||
|
system. As the geometric visualization for the generaliza- number of examples is large. It will later be shown that such
|
|||
|
tion error of the perceptron shows, this is actually the case. a competition can lead to a rich and interesting behavior as
|
|||
|
The following approach, which was pioneered by Elizabeth the number of examples is varied. The result for the learn-
|
|||
|
Gardner (Gardner, 1988; Gardner and Derrida, 1989), is ing curve (Györgyi and Tishby, 1990; Sompolinsky et al.,
|
|||
|
based on the calculation of V(e), the volume of the space
|
|||
|
of couplings which both perfectly implement mtraining
|
|||
|
examples and have a given generalization error e. For an
|
|||
|
intuitive picture, consider that only discrete values for the entropic contribution
|
|||
|
couplings are allowed; then V(e) would be proportional to
|
|||
|
the number of students. The typical value of the general-
|
|||
|
ization error is the value of e, which maximizes V(e). It
|
|||
|
should be kept in mind that V(e) is a random number and energetic contribution
|
|||
|
fluctuates from one training set to another. A correct treat- 1/N logfV(ε)g
|
|||
|
ment of this randomness requires involved mathematical
|
|||
|
techniques (Mézard et al.,1987). To obtain a picture which
|
|||
|
is quite often qualitatively correct, we may replace it by its
|
|||
|
average over many realizations of training sets. From ele-
|
|||
|
mentary probability theory we see that this average num- maximum
|
|||
|
ber can be found by calculating the volume Aof the space 0 0.1 0.2 0.3 0.4 0.5
|
|||
|
of all students with generalization error e, irrespective of ε
|
|||
|
their behavior on the training set, and multiplying it by FIGURE 8 Logarithm of the average volume of students that
|
|||
|
the probability Bthat a student with generalization error e havelearnedmexamplesandgiveegeneralizationerror(green
|
|||
|
gives mtimes the correct answers on independent draw- curve). The blue and red curves represent the energetic and
|
|||
|
ings of the input patterns. Since Aincreases exponentially entropic contributions, respectively.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 769 262-A1677 7/24/01 11:12 AM Page 770
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER
|
|||
|
|
|||
|
|
|||
|
0.5 student is free to ask the teacher questions, i.e., if the stu-
|
|||
|
ε dent can choose highly informative input patterns. For the
|
|||
|
simple perceptron a fruitful query strategy is to select a new 0.4 input vector which is perpendicular to the current coupling
|
|||
|
vector of the student (Kinzel and Ruján, 1990). Such an
|
|||
|
0.3 input is a highly ambiguous pattern because small changes
|
|||
|
continuous couplings in the student couplings produce different classification an-
|
|||
|
swers. For more complicated networks it may be difficult 0.2 to obtain similar ambiguous inputs by an explicit construc-
|
|||
|
tion. A general algorithm has been proposed (Seung et al.,
|
|||
|
0.1 1992a) which uses the principle of maximal disagreement discrete couplings in a committee of several students as a selection process for
|
|||
|
training patterns. Using an appropriate randomized train- 0.00.0 0.1 0.2 0.3 0.4 0.5 0. 6 ingstrategy,differentstudentsaregeneratedwhichalllearn α the same set of examples. Next, any new input vector is only
|
|||
|
FIGURE 9 Learning curves for typical student perceptrons. accepted for training when the disagreement of its classi-
|
|||
|
am/Nis the ratio between the number of examples and the fication between the students is maximal. For a committee
|
|||
|
coupling number. of two students it can be shown that when the number of
|
|||
|
examples is large, the information gain does not decrease
|
|||
|
but reaches a positive constant. This results in a much faster
|
|||
|
1990) of a perceptron obtained by the statistical physics ap- decrease of the generalization error. Instead of being in-
|
|||
|
proach (treating the random sampling the proper way) is versely proportional to the number of examples, the de-
|
|||
|
shown by the red curve of Fig. 9. In contrast to the worst- crease is now exponentially fast.
|
|||
|
casepredictionsoftheVCtheory,itispossibletohavesome ................................................generalization ability below VC dimension or capacity. As ◗
|
|||
|
|
|||
|
we might have expected, the generalization error decreases Bad Students and Good Students
|
|||
|
monotonically, showing that the more that is learned, the
|
|||
|
more that is understood. Asymptotically, the error is pro- Although the typical student perceptron has a smooth,
|
|||
|
portional to Nand inversely proportional to m, in agree- monotonically decreasing learning curve, the possibility
|
|||
|
ment with the VC predictions. This may not be true for that some concrete learning algorithm may result in a set
|
|||
|
more complicated networks. of student couplings which are untypical in the sense of
|
|||
|
our theory cannot be ruled out. For bad students, even non-................................................ ◗ monotic generalization behavior is possible. The problem
|
|||
|
Query Learning of a concrete learning algorithm can be made to fit into the
|
|||
|
statistical physics framework if the algorithm minimizes a
|
|||
|
Soon after Gardner’s pioneering work, it was realized that certain cost function. Treating the achieved values of the
|
|||
|
the approach of statistical physics is closely related to ideas new cost function as a macroscopic constraint, the tools of
|
|||
|
in information theory and Bayesian statistics (Levin et al., statistical physics apply again.
|
|||
|
1989;GyörgyiandTishby,1990;OpperandHaussler,1991), As an example, it is convenient to consider a case in
|
|||
|
for which the reduction of an initial uncertainty about the which the teacher and the student have a different archi-
|
|||
|
true state of a system (teacher) by observing data is a cen- tecture: In one of the simplest examples one tries to learn
|
|||
|
tral topic of interest. The logarithm of the volume of rele- a classification problem by interpreting it as a regression
|
|||
|
vant microstates as defined in the previous section is a di- problem, i.e., a problem of fitting a continuous function
|
|||
|
rect measure for such uncertainty. The moderate progress through data points. To be specific, we study the situation
|
|||
|
in generalization ability displayed by the red learning curve in which the teacher network is still given by a percep-
|
|||
|
of Fig. 9 can be understood by the fact that as learning pro- tron which computes binary valued outputs of the form
|
|||
|
gresses less information about the teacher is gained from a ywx, 1, but as the student we choose a network i i i
|
|||
|
newrandomexample.Here,theinformationgainisdefined with a linear transfer function (the yellow curve in Fig. 1a)
|
|||
|
as the reduction of the uncertainty when a new example is
|
|||
|
learned. The decrease in information gain is due to the in- Y awxi i
|
|||
|
crease in the generalization performance. This is plausible i
|
|||
|
because inputs for which the majority of student networks and try to fit this linear expression to the binary labels of
|
|||
|
give the correct answer are less informative than those for the teacher. If the number of couplings is sufficiently large
|
|||
|
which a mistake is more likely. The situation changes if the (larger than the number of examples) the linear function
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
770 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 771
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
LEARNING TO GENERALIZE
|
|||
|
|
|||
|
|
|||
|
(unlike the sign) is perfectly able to fit arbitrary continuous the student learns all examples perfectly. Although it may
|
|||
|
output values. This linear fit is an attempt to explain the not be easy to construct a learning algorithm which per-
|
|||
|
data in a more complicated way than necessary, and the forms such a maximization in practice, the resulting gener-
|
|||
|
couplings have to be finely tuned in order to achieve this alization error can be calculated using the statistical phys-
|
|||
|
goal. We find that the student trained in such a way does ics approach (Engel and Van den Broeck, 1993). The result
|
|||
|
not generalize well (Opper and Kinzel, 1995). In order to is in agreement with the VC theory: There is no prediction
|
|||
|
compare the classifications of teacher and student on a new better than random guessing below the capacity.
|
|||
|
random input after training, we have finally converted the Although the previous algorithms led to a behavior
|
|||
|
student’s output into a classification label by taking the sign whichisworsethanthetypicalone,wenowexaminetheop-
|
|||
|
of its output. As shown in the red curve of Fig. 10, after positecaseofanalgorithmwhichdoesbetter.Sincethegen-
|
|||
|
an initial improvement of performance the generalization eralization ability of a neural network is related to the fact
|
|||
|
error increases again to the random guessing value e0.5 that similar input vectors are mapped onto the same out-
|
|||
|
at a1 (Fig. 10, red curve). This phenomenon is called put, one can assume that such a property can be enhanced
|
|||
|
overfitting.For a1 (i.e., for more data than parameters), if the separating gap between the two classes is maximized,
|
|||
|
it is no longer possible to have a perfect linear fit through which defines a new cost function for an algorithm. This
|
|||
|
the data, but a fit with a minimal deviation from a linear optimal margin perceptron can be practically realized and
|
|||
|
function leads to the second part of the learning curve.ede- when applied to a set of data leads to the projection of
|
|||
|
creases again and approaches 0 asymptotically for aSq. Fig. 11. As a remarkable result, it can be seen that there is a
|
|||
|
This shows that when enough data are available, the details relatively large fraction of patterns which are located at the
|
|||
|
of the training algorithm are less important. gap. These points are called support vectors(SVs). In order
|
|||
|
The dependence of the generalization performance on to understand their importance for the generalization abil-
|
|||
|
the complexity of the assumed data model is well-known. If ity, we make the following gedankenexperimentand assume
|
|||
|
function class is used that is too complex, data values can be that all the points which lie outside the gap (the nonsupport
|
|||
|
perfectly fitted but the predicted function will be very sen- vectors) are eliminated from the training set of examples.
|
|||
|
sitive to the variations of the data sample, leading to very From the two-dimensional projection of Fig. 11, we may
|
|||
|
unreliable predictions on novel inputs. On the other hand, conjecture that by running the maximal margin algorithm
|
|||
|
functions that are too simple make the best fit almost insen- on the remaining examples (the SVs) we cannot create a
|
|||
|
sitive to the data, which prevents us from learning enough larger gap between the points. Hence, the algorithm will
|
|||
|
from them. converge to the same separating hyperplane as before. This
|
|||
|
It is also possible to calculate the worst-case generaliza- intuitive picture is actually correct. If the SVs of a training
|
|||
|
tion ability of perceptron students learning from a percep- set were known beforehand (unfortunately, they are only
|
|||
|
tron teacher. The largest generalization error is obtained identified after running the algorithm), the margin classi-
|
|||
|
(Fig. 7) when the angle between the coupling vectors of fier would have to be trained only on the SVs. It would au-
|
|||
|
teacher and student is maximized under the constraint that tomatically classify the rest of the training inputs correctly.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
0.50
|
|||
|
ε
|
|||
|
0.40
|
|||
|
|
|||
|
|
|||
|
0.30 linear student
|
|||
|
|
|||
|
|
|||
|
0.20
|
|||
|
margin classifier
|
|||
|
|
|||
|
0.10
|
|||
|
|
|||
|
|
|||
|
0.000123456 α
|
|||
|
FIGURE 10 Learning curves for a linear student and for a FIGURE 11 Learning with a margin classifier and m300
|
|||
|
margin classifier. am/N. examples in an N150-dimensional space.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 771 262-A1677 7/24/01 11:12 AM Page 772
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER
|
|||
|
|
|||
|
|
|||
|
Hence, if in an actual classification experiment the number ber of consistent students is small; nevertheless, the few re-
|
|||
|
of SVs is small compared to the number of non-SVs, we maining ones must still differ in a finite fraction of bits from
|
|||
|
may expect a good generalization ability. each other and from the teacher so that perfect generaliza-
|
|||
|
The learning curve for a margin classifier (Opper and tion is still impossible. For aslightly above a only the cou- c
|
|||
|
Kinzel, 1995) learning from a perceptron teacher (calcu- plings of the teacher survive.
|
|||
|
lated by the statistical physics approach) is shown in Fig. 10
|
|||
|
(blue curve). The concept of a margin classifier has recently ................................................
|
|||
|
been generalized to the so-called support vector machines ◗
|
|||
|
|
|||
|
Learning with Errors (Vapnik, 1995), for which the inputs of a perceptron are re-
|
|||
|
placed by suitable features which are cleverly chosen non-
|
|||
|
linear functions of the original inputs. In this way, nonlin- The example of the Ising perceptron teaches us that it will
|
|||
|
ear separable rules can be learned, providing an interesting not always be simple to obtain zero training error. More-
|
|||
|
alternative to multilayer networks. over, an algorithm trying to achieve this goal may get stuck
|
|||
|
in local minima. Hence, the idea of allowing errors explic-
|
|||
|
itly in the learning procedure, by introducing an appropri-................................................ ◗ ate noise, can make sense. An early analysis of such a sto-
|
|||
|
The Ising Perceptron chastic training procedure and its generalization ability for
|
|||
|
the learning in so-called Boolean networks (with elemen-
|
|||
|
The approach of statistical physics can develop a specific tary computing units different from the ones used in neural
|
|||
|
predictivepowerinsituationsinwhichonewouldliketoun- networks) can be found in Carnevali and Patarnello (1987).
|
|||
|
derstand novel network models or architectures for which A stochastic algorithm can be useful to escape local min-
|
|||
|
currently no efficient learning algorithm is known. As the ima of the training error, enabling a better learning of the
|
|||
|
simplest example, we consider a perceptron for which the training set. Surprisingly, such a method can also lead to
|
|||
|
couplings w are constrained to binary values 1 and 1 bettergeneralizationabilitiesiftheclassificationruleisalso j
|
|||
|
(Gardner and Derrida, 1989; Györgyi, 1990; Seung et al., corrupted by some degree of noise (Györgyi and Tishby,
|
|||
|
1992b). For this so-called Ising perceptron(named after 1990). A stochastic training algorithm can be realized by
|
|||
|
Ernst Ising, who studied coupled binary-valued elements as the Monte Carlo metropolis method, which was invented
|
|||
|
a model for a ferromagnet), perfect learning of examples is to generate the effects of temperature in simulations of
|
|||
|
equivalent to a difficult combinatorial optimization prob- physical systems. Any changes of the network couplings
|
|||
|
lem (integer linear programming), which in the worst case which lead to a decrease of the training error during learn-
|
|||
|
is believed to require a learning time that increases expo- ing are allowed. However, with some probability that in-
|
|||
|
nentially with the number of couplings N. creases with the temperature, an increase of the training
|
|||
|
To obtain the learning curve for the typical student, we error is also accepted. Although in principle this algorithm
|
|||
|
can proceed as before, replacing V(e) by the number of may visit all the network’s configurations, for a large sys-
|
|||
|
student configurations that are consistent with the teacher tem, with an overwhelming probability, only states close to
|
|||
|
which results in changing the entropic term appropriately. some fixed training error will actually appear. The method
|
|||
|
When the examples are provided by a teacher network of of statistical physics applied to this situation shows that for
|
|||
|
thesamebinarytype,onecanexpectthatthegeneralization sufficiently large temperatures (T) we often obtain a quali-
|
|||
|
error will decrease monotonically to zero as a function of a. tatively correct picture if we repeat the approximate calcu-
|
|||
|
The learning curve is shown as the blue curve in Fig. 9. For lation for the noise-free case and replace the relative num-
|
|||
|
sufficiently small a, the discreteness of the couplings has al- ber of examples aby the effective number a/T.Hence, the
|
|||
|
most no effect. However, in contrast to the continuous case, learning curves become essentially stretched and good gen-
|
|||
|
perfect generalization does not require infinitely many ex- eralization ability is still possible at the price of an increase
|
|||
|
amples but is achieved already at a finite number a 1.24. in necessary training examples. c
|
|||
|
This is not surprising because the teacher’s couplings con- Within the stochastic framework, learning (with errors)
|
|||
|
tain only a finite amount of information (one bit per cou- can now also be realized for the Ising perceptron, and it is
|
|||
|
pling) and one would expect that it does not take much interesting to study the number of relevant student configu-
|
|||
|
more than aboutNexamples to learn them. The remark- rations as a function of ein more detail (Fig. 12). The green
|
|||
|
ableandunexpectedresultoftheanalysisisthefactthatthe curve is obtained for a small value ofawhere a strong maxi-
|
|||
|
transition to perfect generalization is discontinuous. The mum with high generalization error exists. By increasing a,
|
|||
|
generalization error decreases immediately from a non- this maximum decreases until it is the same as the second
|
|||
|
zero value to zero. This gives an impression about the com- maximum at e0.5, indicating a transition like that of the
|
|||
|
plex structure of the space of all consistent students and blue learning curve in Fig. 9. For larger a, the state of per-
|
|||
|
also gives a hint as to why perfect learning in the Ising per- fect generalization should be the typical state. Neverthe-
|
|||
|
ceptron is a difficult task. For aslightly below a, the num- less, if the stochastic algorithm starts with an initial state c
|
|||
|
|
|||
|
|
|||
|
|
|||
|
772 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 773
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
LEARNING TO GENERALIZE
|
|||
|
|
|||
|
|
|||
|
|
|||
|
α lar model. Here, each hidden unit is connected to a dif- 1 ferent set of the input nodes. A further simplification is the
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
log (number of students) α replacement of adaptive couplings from the hidden units to 2
|
|||
|
the output node by a prewired fixed function which maps
|
|||
|
the states of the hidden units to the output. α3 Two such functions have been studied in great detail.
|
|||
|
For the first one, the output gives just the majority vote of
|
|||
|
α the hidden units—that is, if the majority of the hidden units 4
|
|||
|
α is negative, then the total output is negative, and vice versa. 4 >α3 >α2 >α1 This network is called a committee machine.For the second
|
|||
|
0 0.1 0.2 0.3 0.4 0.5 type of network, the parity machine,the output is the par- ε ity of the hidden outputs—that is, a minus results from an
|
|||
|
FIGURE 12 Logarithm of the number of relevant Ising stu- odd number of negative hidden units and a plus from an
|
|||
|
dents for different values of a. even number. For both types of networks, the capacity has
|
|||
|
been calculated in the thermodynamic limit of a large num-
|
|||
|
ber Nof (first layer) couplings (Barkai et al.,1990; Monas-
|
|||
|
which has no resemblance to the (unknown) teacher (i.e., son and Zecchina, 1995). By increasing the number of hid-
|
|||
|
with e0.5), it will spend time that increases exponentially den units (but always keeping it much smaller than N),
|
|||
|
with Nin the smaller local maximum, the metastable state. the capacity per coupling (and the VC dimension) can be
|
|||
|
Hence, a sudden transition to perfect generalization will be made arbitrarily large. Hence, the VC theory predicts that
|
|||
|
observable only in examples which correspond to the blue the ability to generalize begins at a size of the training set
|
|||
|
curve of Fig. 12, where this metastable state disappears. which increases with the capacity. The learning curves of
|
|||
|
For large vales of a(yellow curve), the stochastic algorithm the typical parity machine (Fig. 14) being trained by a par-
|
|||
|
will converge always to the state of perfect generalization. ity teacher for (from left to right) one, two, four, and six
|
|||
|
On the other hand, since the state with e0.5 is always hidden units seem to partially support this prediction.
|
|||
|
metastable, a stochastic algorithm which starts with the Belowacertainnumberofexamples,onlymemorization
|
|||
|
teacher’s couplings will never drive the student out of the ofthelearnedpatternsoccursandnotgeneralization.Then,
|
|||
|
state of perfect generalization. It should be made clear that a transition to nontrivial generalization takes place (Han-
|
|||
|
the sharp phase transitions are the result of the thermody- sel et al.,1992; Opper, 1994). Far beyond the transition, the
|
|||
|
namic limit, where the macroscopic state is entirely domi- decay of the learning curves becomes that of a simple per-
|
|||
|
nated by the typical configurations. For simulations of any ceptron (black curve in Fig. 14) independent of the num-
|
|||
|
finite system a rounding and softening of the transitions ber of hidden units, and this occurs much faster than for
|
|||
|
will be observed. the bound given by VC theory. This shows that the typical
|
|||
|
................................................ learning curve can in fact be determined by more than one ◗
|
|||
|
|
|||
|
More Sophisticated Computations
|
|||
|
Are Needed for Multilayer Networks 0.5
|
|||
|
ε
|
|||
|
As a first step to understand the generalization perfor- 0.4 mance of multilayer networks, one can study an archi- 46
|
|||
|
tecture which is simpler than the fully connected one of
|
|||
|
Fig. 1b. The tree architecture of Fig. 13 has become a popu- 0.3 2
|
|||
|
|
|||
|
10.2
|
|||
|
|
|||
|
|
|||
|
0.1
|
|||
|
|
|||
|
|
|||
|
0.00.0 0.1 0.2 0.3 0.4 0.5 0.6 α
|
|||
|
|
|||
|
FIGURE 14 Learning curves for the parity machine with
|
|||
|
FIGURE 13 A two-layer network with tree architecture. tree architecture. Each curve represents the generalization er-
|
|||
|
The arrow indicates the direction of propagation of the ror eas a function of aand is distinguished by the number of
|
|||
|
information. hidden units of the network.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 773 262-A1677 7/24/01 11:12 AM Page 774
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
MANFRED OPPER
|
|||
|
|
|||
|
|
|||
|
complexity parameter. In contrast, the learning curve of the same similarity to every teacher perceptron. Although
|
|||
|
the committee machine with the tree architecture of Fig. 13 this symmetric state allows for some degree of generaliza-
|
|||
|
(Schwarze and Hertz, 1992) is smooth and resembles that tion, it is not able to recover the teacher’s rule completely.
|
|||
|
of the simple perceptron. As the number of hidden units After a long plateau, the symmetry is broken and each of
|
|||
|
is increased (keeping Nfixed and very large), the general- the student perceptrons specializes to one of the teacher
|
|||
|
ization error increases, but despite the diverging VC di- perceptrons, and thus their similarity with the others is
|
|||
|
mension the curves converge to a limiting one having an lost. This leads to a rapid (but continuous) decrease in the
|
|||
|
asymptotic decay which is only twice as slow as that of the generalization error. Such types of learning curves with
|
|||
|
perceptron. This is an example for which typical and worst- plateaus can actually be observed in applications of fully
|
|||
|
case generalization behaviors are entirely different. connected multilayer networks.
|
|||
|
Recently, more light has been shed on the relation be- ................................................tween average and worst-case scenarios of the tree com- ◗
|
|||
|
|
|||
|
mittee. A reduced worst-case scenario, in which a tree Outlook
|
|||
|
committee teacher was to be learned from tree committee
|
|||
|
students under an input distribution, has been analyzed The worst-case approach of the VC theory and the typical
|
|||
|
from a statistical physics perspective (Urbanczik, 1996). As case approach of statistical physics are important theories
|
|||
|
expected, few students show a much worse generalization for modeling and understanding the complexity of learning
|
|||
|
ability than the typical one. Moreover, such students may to generalize from examples. Although the VC approach
|
|||
|
also be difficult to find by most reasonable learning algo- plays an important role in a general theory of learnabil-
|
|||
|
rithms because bad students require very fine tuning of ity, its practical applications for neural networks have been
|
|||
|
their couplings. Calculation of the couplings with finite pre- limited by the overall generality of the approach. Since only
|
|||
|
cision requires many bits per coupling that increases faster weak assumptions about probability distributions and ma-
|
|||
|
than exponentially with aand which for sufficiently large a chines are considered by the theory, the estimates for gen-
|
|||
|
willbebeyondthecapabilityofpracticalalgorithms.Hence, eralization errors have often been too pessimistic. Recent
|
|||
|
it is expected that, in practice, a bad behavior will not be developments of the theory seem to overcome these prob-
|
|||
|
observed. lems. By using modified VC dimensions, which depend on
|
|||
|
Transitions of the generalization error such as those the data that have actually occurred and which in favorable
|
|||
|
observed for the tree parity machine are a characteristic cases are much smaller than the general dimensions, more
|
|||
|
feature of large systems which have a symmetry that can realistic results seem to be possible. For the support vec-
|
|||
|
be spontaneously broken. To explain this, consider the sim- tor machines (Vapnik, 1995) (generalizations of the margin
|
|||
|
plest case of two hidden units. The output of this parity ma- classifiers which allow for nonlinear boundaries that sepa-
|
|||
|
chine does not change if we simultaneously change the sign rate the two classes), Vapnik and collaborators have shown
|
|||
|
of all the couplings for both hidden units. Hence, if the the effectiveness of the modified VC results for selecting
|
|||
|
teacher’s couplings are all equal to 1, a student with all the optimal type of model in practical applications.
|
|||
|
couplings equal to 1 acts exactly as the same classifier. If The statistical physics approach, on the other hand, has
|
|||
|
there are few examples in the training set, the entropic con- revealed new and unexpected behavior of simple network
|
|||
|
tribution will dominate the typical behavior and the typi- models,suchasavarietyofphasetransitions.Whethersuch
|
|||
|
cal students will display the same symmetry. Their coupling transitions play a cognitive role in animal or human brains
|
|||
|
vectors will consist of positive and negative random num- is an exciting topic. Recent developments of the theory
|
|||
|
bers. Hence, there is no preference for the teacher or the aim to understand dynamical problems of learning. For ex-
|
|||
|
reversed one and generalization is not possible. If the num- ample, online learning (Saad, 1998), in which the problems
|
|||
|
ber of examples is large enough, the symmetry is broken of learning and generalization are strongly mixed, has en-
|
|||
|
and there are two possible types of typical students, one abled the study of complex multilayer networks and has
|
|||
|
with more positive and the other one with more negative stimulated research on the development of optimized algo-
|
|||
|
couplings. Hence, any of the typical students will show rithms. In addition to an extension of the approach to more
|
|||
|
some similarity with the teacher (or it’s negative image) and complicated networks, an understanding of the robustness
|
|||
|
generalization occurs. A similar type of symmetry break- of the typical behavior, and an interpolation to the other
|
|||
|
ing also leads to a continuous phase transition in the fully extreme, the worst-case scenario is an important subject of
|
|||
|
connected committee machine. This can be viewed as a research.
|
|||
|
committee of perceptrons, one for each hidden unit, which
|
|||
|
share the same input nodes. Any permutation of these per- Acknowledgments
|
|||
|
ceptrons obviously leaves the output invariant. Again, if I thank members of the Department of Physics of Complex Sys-
|
|||
|
few examples are learned, the typical state reflects the sym- tems at the Weizmann Institute in Rehovot, Israel, where parts of
|
|||
|
metry. Each student perceptron will show approximately this article were written, for their warm hospitality.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
774 VOLUME III / INTELLIGENT SYSTEMS 262-A1677 7/24/01 11:12 AM Page 775
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
LEARNING TO GENERALIZE
|
|||
|
|
|||
|
|
|||
|
References Cited OPPER , M., and H AUSSLER , M. (1991). Generalization perfor-
|
|||
|
mance of Bayes optimal classification algorithm for learning a
|
|||
|
AMARI , S., and M URATA , N. (1993). Statistical theory of learning perceptron. Phys. Rev. Lett.66,2677.
|
|||
|
curves under entropic loss. Neural Comput.5,140. OPPER , M., and K INZEL , W. (1995). Statistical mechanics of gen-
|
|||
|
BARKAI , E., H ANSEL , D., and K ANTER , I. (1990). Statistical me- eralization. In Physics of Neural Networks III(J. L. van Hem-
|
|||
|
chanics of a multilayered neural network. Phys. Rev. Lett.65, men, E. Domany, and K. Schulten, Eds.). Springer-Verlag,
|
|||
|
2312. New York.
|
|||
|
BISHOP , C. M. (1995). Neural Networks for Pattern Recognition. SAAD , D. (Ed.) (1998). Online Learning in Neural Networks.
|
|||
|
Clarendon/Oxford Univ. Press, Oxford/New York. Cambridge Univ. Press, New York.
|
|||
|
CARNEVALI , P., and P ATARNELLO , S. (1987). Exhaustive thermo- SCHWARZE , H., and H ERTZ , J. (1992). Generalization in a large
|
|||
|
dynamical analysis of Boolean learning networks. Europhys. committee machine. Europhys. Lett.20,375.
|
|||
|
Lett.4,1199. SCHWARZE , H., and H ERTZ , J. (1993). Generalization in fully con-
|
|||
|
COVER , T. M. (1965). Geometrical and statistical properties of nected committee machines. Europhys. Lett.21,785.
|
|||
|
systems of linear inequalities with applications in pattern rec- SEUNG , H. S., S OMPOLINSKY , H., and T ISHBY , N. (1992a). Statis-
|
|||
|
ognition. IEEE Trans. El. Comp.14,326. tical mechanics of learning from examples. Phys. Rev. A45,
|
|||
|
ENGEL , A., and V AN DEN BROECK , C. (1993). Systems that can 6056.
|
|||
|
learn from examples: Replica calculation of uniform conver- SEUNG , H. S., O PPER , M., and S OMPOLINSKY , H. (1992b). Query
|
|||
|
gence bound for the perceptron. Phys. Rev. Lett.71,1772. by committee. InThe Proceedings of the Vth Annual Workshop
|
|||
|
GARDNER ,E.(1988).Thespaceofinteractionsinneuralnetworks. on Computational Learning Theory (COLT92),p. 287. Associ-
|
|||
|
J. Phys. A21,257. ation for Computing Machinery, New York.
|
|||
|
GARDNER , E., and D ERRIDA , B. (1989). Optimal storage proper- SOMPOLINSKY , H., T ISHBY , N., and S EUNG , H. S. (1990). Learning
|
|||
|
ties of neural network models. J. Phys. A21,271. from examples in large neural networks. Phys. Rev. Lett.65,
|
|||
|
GYÖRGYI , G. (1990). First order transition to perfect generaliza- 1683.
|
|||
|
tion in a neural network with binary synapses. Phys. Rev. A41, URBANCZIK , R. (1996). Learning in a large committee machine:
|
|||
|
7097. Worst case and average case. Europhys. Lett.35,553.
|
|||
|
GYÖRGYI , G., and T ISHBY , N. (1990). Statistical theory of learn- VALLET , F., C AILTON , J., and R EFREGIER , P. (1989). Linear and
|
|||
|
ing a rule. In Neural Networks and Spin Glasses: Proceedings nonlinear extension of the pseudo-inverse solution for learn-
|
|||
|
of the STATPHYS 17 Workshop on Neural Networks and Spin ing Boolean functions. Europhys. Lett.9,315.
|
|||
|
Glasses(W. K. Theumann and R. Koberle, Eds.). World Scien- VAPNIK , V. N. (1982). Estimation of Dependencies Based on Em-
|
|||
|
tific, Singapore. pirical Data.Springer-Verlag, New York.
|
|||
|
HANSEL , D., M ATO , G., and M EUNIER , C. (1992). Memorization VAPNIK , V. N. (1995). The Nature of Statistical Learning Theory.
|
|||
|
without generalization in a multilayered neural network. Eu- Springer-Verlag, New York.
|
|||
|
rophys. Lett.20,471. VAPNIK , V. N., and C HERVONENKIS , A. (1971). On the uniform
|
|||
|
KINZEL , W., and R UJÀN , P. (1990). Improving a network general- convergence of relative frequencies of events to their probabil-
|
|||
|
ization ability by selecting examples. Europhys. Lett.13,473. ities. Theory Probability Appl.16,254.
|
|||
|
LEVIN ,E.,T ISHBY ,N.,andS OLLA ,S.(1989).Astatisticalapproach
|
|||
|
to learning and generalization in neural networks. In Proceed- General References ings of the Second Workshop on Computational Learning The-
|
|||
|
ory(R. Rivest, D. Haussler, and M. Warmuth, Eds.). Morgan ARBIB , M. A. (Ed.) (1995). The Handbook of Brain Theory and
|
|||
|
Kaufmann, San Mateo, CA. Neural Networks.MIT Press, Cambridge, MA.
|
|||
|
MÉZARD , M., P ARISI , G., and V IRASORO , M. A. (1987). Spin glass BERGER , J. O. (1985). Statistical Decision Theory and Bayesian
|
|||
|
theory and beyond. In Lecture Notes in Physics,Vol. 9. World Analysis.Springer-Verlag, New York.
|
|||
|
Scientific, Singapore. HERTZ ,J.A.,K ROGH ,A.,andP ALMER , R. G. (1991).Introduction
|
|||
|
MONASSON , R., and Z ECCHINA , R. (1995). Weight space structure to the Theory of Neural Computation.Addison-Wesley, Red-
|
|||
|
andinternalrepresentations:Adirectapproachtolearningand wood City, CA.
|
|||
|
generalization in multilayer neural networks. Phys. Rev. Lett. MINSKY , M., and P APERT , S. (1969). Perceptrons.MIT Press,
|
|||
|
75,2432. Cambridge, MA.
|
|||
|
OPPER , M. (1994). Learning and generalization in a two-layer WATKIN , T. L. H., R AU , A., and B IEHL , M. (1993). The statistical
|
|||
|
neural network: The role of the Vapnik–Chervonenkis dimen- mechanics of learning a rule. Rev. Modern Phys.65,499.
|
|||
|
sion. Phys. Rev. Lett.72,2113.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
PART TWO / BUILDING BLOCKS FOR INTELLIGENCE SYSTEMS 775 262-A1677 7/24/01 11:12 AM Page 776
|