1298 lines
82 KiB
Plaintext
1298 lines
82 KiB
Plaintext
|
LETTER Communicated by Herbert Jaeger
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Analysis and Design of Echo State Networks
|
|||
|
|
|||
|
|
|||
|
Mustafa C. Ozturk
|
|||
|
can@cnel.ufl.edu
|
|||
|
Dongming Xu
|
|||
|
dmxu@cnel.ufl.edu
|
|||
|
JoseC.Pr´ ´ıncipe
|
|||
|
principe@cnel.ufl.edu
|
|||
|
Computational NeuroEngineering Laboratory, Department of Electrical and
|
|||
|
Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
|
|||
|
|
|||
|
|
|||
|
The design of echo state network (ESN) parameters relies on the selec-
|
|||
|
tion of the maximum eigenvalue of the linearized system around zero
|
|||
|
(spectral radius). However, this procedure does not quantify in a sys-
|
|||
|
tematic manner the performance of the ESN in terms of approximation
|
|||
|
error. This article presents a functional space approximation framework
|
|||
|
to better understand the operation of ESNs and proposes an information-
|
|||
|
theoretic metric, the average entropy of echo states, to assess the richness
|
|||
|
of the ESN dynamics. Furthermore, it provides an interpretation of the
|
|||
|
ESN dynamics rooted in system theory as families of coupled linearized
|
|||
|
systems whose poles move according to the input signal dynamics. With
|
|||
|
this interpretation, a design methodology for functional approximation
|
|||
|
is put forward where ESNs are designed with uniform pole distributions
|
|||
|
covering the frequency spectrum to abide by the richness metric, irre-
|
|||
|
spective of the spectral radius. A single bias parameter at the ESN input,
|
|||
|
adapted with the modeling error, configures the ESN spectral radius to
|
|||
|
the input-output joint space. Function approximation examples compare
|
|||
|
the proposed design methodology versus the conventional design.
|
|||
|
|
|||
|
|
|||
|
1 Introduction
|
|||
|
|
|||
|
Dynamic computational models require the ability to store and access the
|
|||
|
time history of their inputs and outputs. The most common dynamic neural
|
|||
|
architecture is the time-delay neural network (TDNN) that couples delay
|
|||
|
lines with a nonlinear static architecture where all the parameters (weights)
|
|||
|
are adapted with the backpropagation algorithm. The conventional delay
|
|||
|
line utilizes ideal delay operators, but delay lines with local first-order re-
|
|||
|
cursive filters have been proposed by Werbos (1992) and extensively stud-
|
|||
|
ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
|
|||
|
1993). Chains of first-order integrators are interesting because they effec-
|
|||
|
tively decrease the number of delays necessary to create time embeddings
|
|||
|
|
|||
|
|
|||
|
Neural Computation19, 111–138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
|||
|
|
|||
|
|
|||
|
(Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
|
|||
|
ent type of embedding that is largely unexplored. RNNs are perhaps the
|
|||
|
most biologically plausible of the artificial neural network (ANN) models
|
|||
|
(Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990),
|
|||
|
but are not well understood theoretically (Siegelmann & Sontag, 1991;
|
|||
|
Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
|
|||
|
RNNs is the difficulty to adapt the system weights. Various algorithms,
|
|||
|
such as backpropagation through time (Werbos, 1990) and real-time recur-
|
|||
|
rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
|
|||
|
however, these algorithms suffer from computational complexity, resulting
|
|||
|
in slow training, complex performance surfaces, the possibility of instabil-
|
|||
|
ity, and the decay of gradients through the topology and time (Haykin,
|
|||
|
1998). The problem of decaying gradients has been addressed with spe-
|
|||
|
cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
|
|||
|
native second-order training methods based on extended Kalman filtering
|
|||
|
(Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
|
|||
|
Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
|
|||
|
et al., 1998) provide more reliable performance and have enabled practical
|
|||
|
applications in identification and control of dynamical systems (Kechri-
|
|||
|
otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
|
|||
|
Kambhampati, & Warwick, 1995).
|
|||
|
Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
|
|||
|
echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
|
|||
|
the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
|
|||
|
2002). ESNs possess a highly interconnected and recurrent topology of
|
|||
|
nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
|
|||
|
and contain information about the history of input and output patterns.
|
|||
|
The outputs of these internal PEs (echo states) are fed to a memoryless but
|
|||
|
adaptive readout network (generally linear) that produces the network out-
|
|||
|
put. The interesting property of ESN is that only the memoryless readout is
|
|||
|
trained, whereas the recurrent topology has fixed connection weights. This
|
|||
|
reduces the complexity of RNN training to simple linear regression while
|
|||
|
preserving a recurrent topology, but obviously places important constraints
|
|||
|
in the overall architecture that have not yet been fully studied. Similar ideas
|
|||
|
have been explored independently by Maass and formalized in the LSM
|
|||
|
architecture. LSMs, although formulated quite generally, are mostly im-
|
|||
|
plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
|
|||
|
whereas ESNs are dynamical ANN models. Both attempt to model biolog-
|
|||
|
ical information processing using similar principles. We focus on the ESN
|
|||
|
formulation in this letter.
|
|||
|
The echo state condition is defined in terms of the spectral radius (the
|
|||
|
largest among the absolute values of the eigenvalues of a matrix, denoted
|
|||
|
by·) of the reservoir’s weight matrix (W<1). This condition states
|
|||
|
that the dynamics of the ESN is uniquely controlled by the input, and the
|
|||
|
effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113
|
|||
|
|
|||
|
|
|||
|
relies on the selection of spectral radius. However, there are many possible
|
|||
|
weight matrices with the same spectral radius, and unfortunately they do
|
|||
|
not all perform at the same level of mean square error (MSE) for functional
|
|||
|
approximation. A similar problem exists in the design of the LSM. LSMs
|
|||
|
have been shown to possess universal approximation given the separation
|
|||
|
property (SP) for the liquid (reservoir in ESNs) and the approximation
|
|||
|
property (AP) for the readout (Maass et al., 2002). SP is quantified by a
|
|||
|
kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
|
|||
|
(2005) that is based on the rank of a matrix formed by the system states
|
|||
|
corresponding to different input signals. The kernel quality is a measure
|
|||
|
for the complexity and diversity of nonlinear operations carried out by the
|
|||
|
liquid on its input stream in order to boost the classification power of a
|
|||
|
subsequent linear decision hyperplane (Maass et al., 2005). A variation of
|
|||
|
SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
|
|||
|
been argued that complex calculations can be best carried out by networks
|
|||
|
on the boundary between ordered and chaotic dynamics.
|
|||
|
Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox-
|
|||
|
imation (filters that map input functionsu(·) of time on output functionsy(·)
|
|||
|
of time). We see two major shortcomings with the current ESN approach
|
|||
|
that uses echo state condition as a design principle. First, the impact of fixed
|
|||
|
reservoir parameters for function approximation means that the informa-
|
|||
|
tion about the desired response is conveyed only to the output projection.
|
|||
|
This is not optimal, and strategies to select different reservoirs for different
|
|||
|
applications have not been devised. Second, imposing a constraint only on
|
|||
|
the spectral radius is a weak condition to properly set the parameters of
|
|||
|
the reservoir, as experiments show (different randomizations with the same
|
|||
|
spectral radius perform differently for the same problem; see Figure 2).
|
|||
|
This letter aims to address these two problems by proposing a frame-
|
|||
|
work, a metric, and a design principle for ESNs. The framework is a signal
|
|||
|
processing interpretation of basis and projections in functional spaces to
|
|||
|
describe and understand the ESN architecture. According to this interpre-
|
|||
|
tation, the ESN states implement a set of basis functionals (representation
|
|||
|
space) constructed dynamically by the input, while the readout simply
|
|||
|
projects the desired response onto this representation space. The metric
|
|||
|
to describe the richness of the ESN dynamics is an information-theoretic
|
|||
|
quantity, the average state entropy (ASE). Entropy measures the amount of
|
|||
|
information contained in a given random variable (Shannon, 1948). Here,
|
|||
|
the random variable is the instantaneous echo state from which the en-
|
|||
|
tropy for the overall state (vector) is estimated. The probability density
|
|||
|
function (pdf) in a differential geometric framework should be thought of
|
|||
|
as a volume form; that is, in our case, the pdf of the state vector describes
|
|||
|
the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
|
|||
|
established information as a coordinate free metric in the state manifold.
|
|||
|
Therefore, entropy becomes a global descriptor of information that quanti-
|
|||
|
fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
|||
|
|
|||
|
|
|||
|
time dependency of the states, the state entropy averaged over time (ASE)
|
|||
|
is an appropriate estimate of the volume of the state manifold.
|
|||
|
The design principle specifies that one should consider independently
|
|||
|
thecorrelationamongthebasisandthespectralradius.Intheabsenceofany
|
|||
|
information about the desired response, the ESN states should be designed
|
|||
|
with the highest ASE, independent of the spectral radius. We interpret the
|
|||
|
ESN dynamics as a combination of time-varying linear systems obtained
|
|||
|
from the linearization of the ESN nonlinear PE in a small, local neighbor-
|
|||
|
hood of the current state. The design principle means that the poles of the
|
|||
|
linearized ESN reservoir should have uniform pole distributions to gener-
|
|||
|
ate echo states with the most diverse pole locations (which correspond to
|
|||
|
the uniformity of time constants). Effectively, this will create the least cor-
|
|||
|
related bases for a given spectral radius, which corresponds to the largest
|
|||
|
volume spanned by the basis set. When the designer has no other informa-
|
|||
|
tion about the desired response to set the basis, this principle distributes
|
|||
|
the system’s degrees of freedom uniformly in space. It approximates for
|
|||
|
ESNs the well-known property of orthogonal basis. The unresolved issue
|
|||
|
that ASE does not quantify is how to set the spectral radius, which depends
|
|||
|
again on the desired mapping. The concept of memory depth as explained
|
|||
|
in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
|
|||
|
issues associated with the spectral radius. The correlation time of the de-
|
|||
|
siredresponse(asestimatedbythefirstzerooftheautocorrelationfunction)
|
|||
|
gives an indication of the type of spectral radius required (long correlation
|
|||
|
time requires high spectral radius). Alternatively, a simple adaptive bias is
|
|||
|
added at the ESN input to control the spectral radius integrating the infor-
|
|||
|
mation from the input-output joint space in the ESN bases. For sigmoidal
|
|||
|
PEs, the bias adjusts the operating points of the reservoir PEs, which has
|
|||
|
the net effect of adjusting the volume of the state manifold as required to
|
|||
|
approximate the desired response with a small error. This letter shows that
|
|||
|
ESNs designed with this strategy obtain systematically better results in a
|
|||
|
set of experiments when compared with the conventional ESN design.
|
|||
|
|
|||
|
|
|||
|
2 Analysis of Echo State Networks
|
|||
|
|
|||
|
2.1 Echo States as Bases and Projections.Let us consider the ar-
|
|||
|
chitecture and recursive update equation of a typical ESN more closely.
|
|||
|
Consider the recurrent discrete-time neural network given in Figure 1
|
|||
|
withMinput units,Ninternal PEs, andLoutput units. The value of
|
|||
|
the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal
|
|||
|
units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)=
|
|||
|
[y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M
|
|||
|
weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal
|
|||
|
PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115
|
|||
|
|
|||
|
|
|||
|
Input Layer Dynamical Reservoir Read-out
|
|||
|
|
|||
|
Win WW out
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
x(n) u(n)
|
|||
|
|
|||
|
. +
|
|||
|
. y(n)
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Wback
|
|||
|
|
|||
|
|
|||
|
Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed-
|
|||
|
weight (W<1) recurrent network and a linear readout. The recurrent net-
|
|||
|
work is a reservoir of highly interconnected dynamical components, states of
|
|||
|
which are called echo states. The memoryless linear readout is trained to pro-
|
|||
|
duce the output.
|
|||
|
|
|||
|
|
|||
|
output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
|
|||
|
activation of the internal PEs (echo state) is updated according to
|
|||
|
|
|||
|
x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1)
|
|||
|
|
|||
|
wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all
|
|||
|
f e−x
|
|||
|
i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x
|
|||
|
network is computed according to
|
|||
|
|
|||
|
y(n+1)=fout (Wout x(n+1)), (2.2)
|
|||
|
|
|||
|
wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity.
|
|||
|
ESNs resemble the RNN architecture proposed in Puskorius and
|
|||
|
Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
|||
|
|
|||
|
|
|||
|
interfaces. The critical difference is the dimensionality of the hidden re-
|
|||
|
current PE layer and the adaptation of the recurrent weights. We submit
|
|||
|
that the ideas of approximation theory in functional spaces (bases and pro-
|
|||
|
jections), so useful in adaptive signal processing (Principe, 2001), should
|
|||
|
be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued
|
|||
|
function of a real-valued vector
|
|||
|
|
|||
|
u(t)=[u1 (t),u2 (t),...,uM (t)] T .
|
|||
|
|
|||
|
In functional approximation, the goal is to estimate the behavior ofh(u(t))
|
|||
|
as a combination of simpler functionsϕi (t), called the basis functionals,
|
|||
|
such that its approximant,hˆ(u(t)), is given by
|
|||
|
|
|||
|
N
|
|||
|
hˆ(u(t))= ai ϕi (t).
|
|||
|
i=1
|
|||
|
|
|||
|
Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
|
|||
|
the central questions in practical functional approximation is how to choose
|
|||
|
the set of bases to approximate a given desired signal. In signal processing,
|
|||
|
thechoicenormallygoesforacompletesetoforthogonalbasis,independent
|
|||
|
of the input. When the basis set is complete and can be made as large
|
|||
|
as required, fixed bases work wonders (e.g., Fourier decompositions). In
|
|||
|
neural computing, the basic idea is to derive the set of bases from the
|
|||
|
input signal through a multilayered architecture. For instance, consider a
|
|||
|
single hidden layer TDNN withNPEs and a linear output. The hidden-
|
|||
|
layer PE outputs can be considered a set of nonorthogonal basis functionals
|
|||
|
dependent on the input,
|
|||
|
|
|||
|
|
|||
|
ϕi (u(t))=g bij uj (t).
|
|||
|
j
|
|||
|
|
|||
|
bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
|
|||
|
mation produced by the TDNN is then
|
|||
|
|
|||
|
N
|
|||
|
h ˆ(u(t))= ai ϕi (u(t)), (2.3)
|
|||
|
i=1
|
|||
|
|
|||
|
whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
|
|||
|
the bases and theai ’s adapt the projection in the projection space. Here the
|
|||
|
goal is to restrict the number of bases (number of hidden layer PEs) because
|
|||
|
their number is coupled with the number of parameters to adapt, which
|
|||
|
has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117
|
|||
|
|
|||
|
|
|||
|
since all of the parameters of the network are adapted, the best basis in the
|
|||
|
joint (input and desired signals) space as well as the best projection can be
|
|||
|
achieved and represents the optimal solution. The output of the TDNN is
|
|||
|
a linear combination of its internal representations, but to achieve a basis
|
|||
|
set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
|
|||
|
be enforced. Ito, Shah and Pon, and others have shown that this is indeed
|
|||
|
the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
|
|||
|
the scope of this article.
|
|||
|
The ESN (and the RNN) architecture can also be studied in this frame-
|
|||
|
work. The states of equation 2.1 correspond to the basis set, which are
|
|||
|
recursively computed from the input, output, and previous states through
|
|||
|
Win ,W,andWback . Notice, however, that none of these weight matrices is
|
|||
|
adapted, that is, the functional bases in the ESN are uniquely defined by the
|
|||
|
input and the initial selection of weights. In a sense, ESNs are trading the
|
|||
|
adaptive connections in the RNN hidden layer by a brute force approach
|
|||
|
of creating fixed diversified dynamics in the hidden layer.
|
|||
|
For an ESN with a linear readout network, the output equation (y(n+
|
|||
|
1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
|
|||
|
ai ’s are replaced by the echo states and the readout weights, respectively.
|
|||
|
The readout weights are adapted in the training data, which means that the
|
|||
|
ESN is able to find the optimal projection in the projection space, just like
|
|||
|
the RNN or the TDNN.
|
|||
|
A similar perspective of basis and projections for information processing
|
|||
|
in biological networks has been proposed by Pouget and Sejnowski (1997).
|
|||
|
They explored the possibility that the response of neurons in parietal cortex
|
|||
|
serves as basis functions for the transformations from the sensory input
|
|||
|
to the motor responses. They proposed that “the role of spatial represen-
|
|||
|
tations is to code the sensory inputs and posture signals in a format that
|
|||
|
simplifies subsequent computation, particularly in the generation of motor
|
|||
|
commands”.
|
|||
|
The central issue in ESN design is exactly the nonadaptive nature of
|
|||
|
the basis set. Parameter sets in the reservoir that provide linearly inde-
|
|||
|
pendent states and possess a given spectral radius may define drastically
|
|||
|
different projection spaces because the correlation among the bases is not
|
|||
|
constrained. A simple experiment was designed to demonstrate that the se-
|
|||
|
lection of the ESN parameters by constraining the spectral radius is not the
|
|||
|
most suitable for function approximation. Consider a 100-unit ESN where
|
|||
|
the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
|
|||
|
the ESN generate the seventh power of the input signal. Different realiza-
|
|||
|
tions of a randomly connected 100-unit ESN were constructed where the
|
|||
|
entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
|
|||
|
and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
|
|||
|
weights are set to+1or,−1 with equal probabilities, andWback is set to
|
|||
|
zero. Input is applied for 300 time steps, and the echo states are calculated
|
|||
|
using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
|||
|
|
|||
|
|
|||
|
MSE for different realizations10 4
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
10 6
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
10 8
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
10 9
|
|||
|
0 10 20 30 40 50
|
|||
|
Different realizations
|
|||
|
|
|||
|
Figure 2: Performances of ESNs for different realizations ofWwith the same
|
|||
|
weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
|
|||
|
bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
|
|||
|
of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
|
|||
|
show that for each set of random weights that provide the same spectral ra-
|
|||
|
dius, the correlation or degree of redundancy among the bases will change, and
|
|||
|
different performances are encountered in practice.
|
|||
|
|
|||
|
|
|||
|
to determine the optimal output weight matrix,Wout , in the mean square
|
|||
|
error (MSE) sense (where MSE is defined byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001):
|
|||
|
|
|||
|
|