1298 lines
82 KiB
Plaintext
1298 lines
82 KiB
Plaintext
LETTER Communicated by Herbert Jaeger
|
||
|
||
|
||
|
||
Analysis and Design of Echo State Networks
|
||
|
||
|
||
Mustafa C. Ozturk
|
||
can@cnel.ufl.edu
|
||
Dongming Xu
|
||
dmxu@cnel.ufl.edu
|
||
JoseC.Pr´ ´ıncipe
|
||
principe@cnel.ufl.edu
|
||
Computational NeuroEngineering Laboratory, Department of Electrical and
|
||
Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
|
||
|
||
|
||
The design of echo state network (ESN) parameters relies on the selec-
|
||
tion of the maximum eigenvalue of the linearized system around zero
|
||
(spectral radius). However, this procedure does not quantify in a sys-
|
||
tematic manner the performance of the ESN in terms of approximation
|
||
error. This article presents a functional space approximation framework
|
||
to better understand the operation of ESNs and proposes an information-
|
||
theoretic metric, the average entropy of echo states, to assess the richness
|
||
of the ESN dynamics. Furthermore, it provides an interpretation of the
|
||
ESN dynamics rooted in system theory as families of coupled linearized
|
||
systems whose poles move according to the input signal dynamics. With
|
||
this interpretation, a design methodology for functional approximation
|
||
is put forward where ESNs are designed with uniform pole distributions
|
||
covering the frequency spectrum to abide by the richness metric, irre-
|
||
spective of the spectral radius. A single bias parameter at the ESN input,
|
||
adapted with the modeling error, configures the ESN spectral radius to
|
||
the input-output joint space. Function approximation examples compare
|
||
the proposed design methodology versus the conventional design.
|
||
|
||
|
||
1 Introduction
|
||
|
||
Dynamic computational models require the ability to store and access the
|
||
time history of their inputs and outputs. The most common dynamic neural
|
||
architecture is the time-delay neural network (TDNN) that couples delay
|
||
lines with a nonlinear static architecture where all the parameters (weights)
|
||
are adapted with the backpropagation algorithm. The conventional delay
|
||
line utilizes ideal delay operators, but delay lines with local first-order re-
|
||
cursive filters have been proposed by Werbos (1992) and extensively stud-
|
||
ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
|
||
1993). Chains of first-order integrators are interesting because they effec-
|
||
tively decrease the number of delays necessary to create time embeddings
|
||
|
||
|
||
Neural Computation19, 111–138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
||
|
||
|
||
(Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
|
||
ent type of embedding that is largely unexplored. RNNs are perhaps the
|
||
most biologically plausible of the artificial neural network (ANN) models
|
||
(Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990),
|
||
but are not well understood theoretically (Siegelmann & Sontag, 1991;
|
||
Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
|
||
RNNs is the difficulty to adapt the system weights. Various algorithms,
|
||
such as backpropagation through time (Werbos, 1990) and real-time recur-
|
||
rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
|
||
however, these algorithms suffer from computational complexity, resulting
|
||
in slow training, complex performance surfaces, the possibility of instabil-
|
||
ity, and the decay of gradients through the topology and time (Haykin,
|
||
1998). The problem of decaying gradients has been addressed with spe-
|
||
cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
|
||
native second-order training methods based on extended Kalman filtering
|
||
(Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
|
||
Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
|
||
et al., 1998) provide more reliable performance and have enabled practical
|
||
applications in identification and control of dynamical systems (Kechri-
|
||
otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
|
||
Kambhampati, & Warwick, 1995).
|
||
Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
|
||
echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
|
||
the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
|
||
2002). ESNs possess a highly interconnected and recurrent topology of
|
||
nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
|
||
and contain information about the history of input and output patterns.
|
||
The outputs of these internal PEs (echo states) are fed to a memoryless but
|
||
adaptive readout network (generally linear) that produces the network out-
|
||
put. The interesting property of ESN is that only the memoryless readout is
|
||
trained, whereas the recurrent topology has fixed connection weights. This
|
||
reduces the complexity of RNN training to simple linear regression while
|
||
preserving a recurrent topology, but obviously places important constraints
|
||
in the overall architecture that have not yet been fully studied. Similar ideas
|
||
have been explored independently by Maass and formalized in the LSM
|
||
architecture. LSMs, although formulated quite generally, are mostly im-
|
||
plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
|
||
whereas ESNs are dynamical ANN models. Both attempt to model biolog-
|
||
ical information processing using similar principles. We focus on the ESN
|
||
formulation in this letter.
|
||
The echo state condition is defined in terms of the spectral radius (the
|
||
largest among the absolute values of the eigenvalues of a matrix, denoted
|
||
by·) of the reservoir’s weight matrix (W<1). This condition states
|
||
that the dynamics of the ESN is uniquely controlled by the input, and the
|
||
effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113
|
||
|
||
|
||
relies on the selection of spectral radius. However, there are many possible
|
||
weight matrices with the same spectral radius, and unfortunately they do
|
||
not all perform at the same level of mean square error (MSE) for functional
|
||
approximation. A similar problem exists in the design of the LSM. LSMs
|
||
have been shown to possess universal approximation given the separation
|
||
property (SP) for the liquid (reservoir in ESNs) and the approximation
|
||
property (AP) for the readout (Maass et al., 2002). SP is quantified by a
|
||
kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
|
||
(2005) that is based on the rank of a matrix formed by the system states
|
||
corresponding to different input signals. The kernel quality is a measure
|
||
for the complexity and diversity of nonlinear operations carried out by the
|
||
liquid on its input stream in order to boost the classification power of a
|
||
subsequent linear decision hyperplane (Maass et al., 2005). A variation of
|
||
SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
|
||
been argued that complex calculations can be best carried out by networks
|
||
on the boundary between ordered and chaotic dynamics.
|
||
Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox-
|
||
imation (filters that map input functionsu(·) of time on output functionsy(·)
|
||
of time). We see two major shortcomings with the current ESN approach
|
||
that uses echo state condition as a design principle. First, the impact of fixed
|
||
reservoir parameters for function approximation means that the informa-
|
||
tion about the desired response is conveyed only to the output projection.
|
||
This is not optimal, and strategies to select different reservoirs for different
|
||
applications have not been devised. Second, imposing a constraint only on
|
||
the spectral radius is a weak condition to properly set the parameters of
|
||
the reservoir, as experiments show (different randomizations with the same
|
||
spectral radius perform differently for the same problem; see Figure 2).
|
||
This letter aims to address these two problems by proposing a frame-
|
||
work, a metric, and a design principle for ESNs. The framework is a signal
|
||
processing interpretation of basis and projections in functional spaces to
|
||
describe and understand the ESN architecture. According to this interpre-
|
||
tation, the ESN states implement a set of basis functionals (representation
|
||
space) constructed dynamically by the input, while the readout simply
|
||
projects the desired response onto this representation space. The metric
|
||
to describe the richness of the ESN dynamics is an information-theoretic
|
||
quantity, the average state entropy (ASE). Entropy measures the amount of
|
||
information contained in a given random variable (Shannon, 1948). Here,
|
||
the random variable is the instantaneous echo state from which the en-
|
||
tropy for the overall state (vector) is estimated. The probability density
|
||
function (pdf) in a differential geometric framework should be thought of
|
||
as a volume form; that is, in our case, the pdf of the state vector describes
|
||
the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
|
||
established information as a coordinate free metric in the state manifold.
|
||
Therefore, entropy becomes a global descriptor of information that quanti-
|
||
fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
||
|
||
|
||
time dependency of the states, the state entropy averaged over time (ASE)
|
||
is an appropriate estimate of the volume of the state manifold.
|
||
The design principle specifies that one should consider independently
|
||
thecorrelationamongthebasisandthespectralradius.Intheabsenceofany
|
||
information about the desired response, the ESN states should be designed
|
||
with the highest ASE, independent of the spectral radius. We interpret the
|
||
ESN dynamics as a combination of time-varying linear systems obtained
|
||
from the linearization of the ESN nonlinear PE in a small, local neighbor-
|
||
hood of the current state. The design principle means that the poles of the
|
||
linearized ESN reservoir should have uniform pole distributions to gener-
|
||
ate echo states with the most diverse pole locations (which correspond to
|
||
the uniformity of time constants). Effectively, this will create the least cor-
|
||
related bases for a given spectral radius, which corresponds to the largest
|
||
volume spanned by the basis set. When the designer has no other informa-
|
||
tion about the desired response to set the basis, this principle distributes
|
||
the system’s degrees of freedom uniformly in space. It approximates for
|
||
ESNs the well-known property of orthogonal basis. The unresolved issue
|
||
that ASE does not quantify is how to set the spectral radius, which depends
|
||
again on the desired mapping. The concept of memory depth as explained
|
||
in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
|
||
issues associated with the spectral radius. The correlation time of the de-
|
||
siredresponse(asestimatedbythefirstzerooftheautocorrelationfunction)
|
||
gives an indication of the type of spectral radius required (long correlation
|
||
time requires high spectral radius). Alternatively, a simple adaptive bias is
|
||
added at the ESN input to control the spectral radius integrating the infor-
|
||
mation from the input-output joint space in the ESN bases. For sigmoidal
|
||
PEs, the bias adjusts the operating points of the reservoir PEs, which has
|
||
the net effect of adjusting the volume of the state manifold as required to
|
||
approximate the desired response with a small error. This letter shows that
|
||
ESNs designed with this strategy obtain systematically better results in a
|
||
set of experiments when compared with the conventional ESN design.
|
||
|
||
|
||
2 Analysis of Echo State Networks
|
||
|
||
2.1 Echo States as Bases and Projections.Let us consider the ar-
|
||
chitecture and recursive update equation of a typical ESN more closely.
|
||
Consider the recurrent discrete-time neural network given in Figure 1
|
||
withMinput units,Ninternal PEs, andLoutput units. The value of
|
||
the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal
|
||
units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)=
|
||
[y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M
|
||
weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal
|
||
PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115
|
||
|
||
|
||
Input Layer Dynamical Reservoir Read-out
|
||
|
||
Win WW out
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
x(n) u(n)
|
||
|
||
. +
|
||
. y(n)
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Wback
|
||
|
||
|
||
Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed-
|
||
weight (W<1) recurrent network and a linear readout. The recurrent net-
|
||
work is a reservoir of highly interconnected dynamical components, states of
|
||
which are called echo states. The memoryless linear readout is trained to pro-
|
||
duce the output.
|
||
|
||
|
||
output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
|
||
activation of the internal PEs (echo state) is updated according to
|
||
|
||
x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1)
|
||
|
||
wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all
|
||
f e−x
|
||
i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x
|
||
network is computed according to
|
||
|
||
y(n+1)=fout (Wout x(n+1)), (2.2)
|
||
|
||
wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity.
|
||
ESNs resemble the RNN architecture proposed in Puskorius and
|
||
Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
||
|
||
|
||
interfaces. The critical difference is the dimensionality of the hidden re-
|
||
current PE layer and the adaptation of the recurrent weights. We submit
|
||
that the ideas of approximation theory in functional spaces (bases and pro-
|
||
jections), so useful in adaptive signal processing (Principe, 2001), should
|
||
be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued
|
||
function of a real-valued vector
|
||
|
||
u(t)=[u1 (t),u2 (t),...,uM (t)] T .
|
||
|
||
In functional approximation, the goal is to estimate the behavior ofh(u(t))
|
||
as a combination of simpler functionsϕi (t), called the basis functionals,
|
||
such that its approximant,hˆ(u(t)), is given by
|
||
|
||
N
|
||
hˆ(u(t))= ai ϕi (t).
|
||
i=1
|
||
|
||
Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
|
||
the central questions in practical functional approximation is how to choose
|
||
the set of bases to approximate a given desired signal. In signal processing,
|
||
thechoicenormallygoesforacompletesetoforthogonalbasis,independent
|
||
of the input. When the basis set is complete and can be made as large
|
||
as required, fixed bases work wonders (e.g., Fourier decompositions). In
|
||
neural computing, the basic idea is to derive the set of bases from the
|
||
input signal through a multilayered architecture. For instance, consider a
|
||
single hidden layer TDNN withNPEs and a linear output. The hidden-
|
||
layer PE outputs can be considered a set of nonorthogonal basis functionals
|
||
dependent on the input,
|
||
|
||
|
||
ϕi (u(t))=g bij uj (t).
|
||
j
|
||
|
||
bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
|
||
mation produced by the TDNN is then
|
||
|
||
N
|
||
h ˆ(u(t))= ai ϕi (u(t)), (2.3)
|
||
i=1
|
||
|
||
whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
|
||
the bases and theai ’s adapt the projection in the projection space. Here the
|
||
goal is to restrict the number of bases (number of hidden layer PEs) because
|
||
their number is coupled with the number of parameters to adapt, which
|
||
has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117
|
||
|
||
|
||
since all of the parameters of the network are adapted, the best basis in the
|
||
joint (input and desired signals) space as well as the best projection can be
|
||
achieved and represents the optimal solution. The output of the TDNN is
|
||
a linear combination of its internal representations, but to achieve a basis
|
||
set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
|
||
be enforced. Ito, Shah and Pon, and others have shown that this is indeed
|
||
the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
|
||
the scope of this article.
|
||
The ESN (and the RNN) architecture can also be studied in this frame-
|
||
work. The states of equation 2.1 correspond to the basis set, which are
|
||
recursively computed from the input, output, and previous states through
|
||
Win ,W,andWback . Notice, however, that none of these weight matrices is
|
||
adapted, that is, the functional bases in the ESN are uniquely defined by the
|
||
input and the initial selection of weights. In a sense, ESNs are trading the
|
||
adaptive connections in the RNN hidden layer by a brute force approach
|
||
of creating fixed diversified dynamics in the hidden layer.
|
||
For an ESN with a linear readout network, the output equation (y(n+
|
||
1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
|
||
ai ’s are replaced by the echo states and the readout weights, respectively.
|
||
The readout weights are adapted in the training data, which means that the
|
||
ESN is able to find the optimal projection in the projection space, just like
|
||
the RNN or the TDNN.
|
||
A similar perspective of basis and projections for information processing
|
||
in biological networks has been proposed by Pouget and Sejnowski (1997).
|
||
They explored the possibility that the response of neurons in parietal cortex
|
||
serves as basis functions for the transformations from the sensory input
|
||
to the motor responses. They proposed that “the role of spatial represen-
|
||
tations is to code the sensory inputs and posture signals in a format that
|
||
simplifies subsequent computation, particularly in the generation of motor
|
||
commands”.
|
||
The central issue in ESN design is exactly the nonadaptive nature of
|
||
the basis set. Parameter sets in the reservoir that provide linearly inde-
|
||
pendent states and possess a given spectral radius may define drastically
|
||
different projection spaces because the correlation among the bases is not
|
||
constrained. A simple experiment was designed to demonstrate that the se-
|
||
lection of the ESN parameters by constraining the spectral radius is not the
|
||
most suitable for function approximation. Consider a 100-unit ESN where
|
||
the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
|
||
the ESN generate the seventh power of the input signal. Different realiza-
|
||
tions of a randomly connected 100-unit ESN were constructed where the
|
||
entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
|
||
and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
|
||
weights are set to+1or,−1 with equal probabilities, andWback is set to
|
||
zero. Input is applied for 300 time steps, and the echo states are calculated
|
||
using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe
|
||
|
||
|
||
MSE for different realizations10 4
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
10 6
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
10 8
|
||
|
||
|
||
|
||
|
||
10 9
|
||
0 10 20 30 40 50
|
||
Different realizations
|
||
|
||
Figure 2: Performances of ESNs for different realizations ofWwith the same
|
||
weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
|
||
bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
|
||
of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
|
||
show that for each set of random weights that provide the same spectral ra-
|
||
dius, the correlation or degree of redundancy among the bases will change, and
|
||
different performances are encountered in practice.
|
||
|
||
|
||
to determine the optimal output weight matrix,Wout , in the mean square
|
||
error (MSE) sense (where MSE is defined byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001):
|
||
|
||
|