testing_generation/Corpus/Analysis and Design of Echo...

1298 lines
82 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

LETTER Communicated by Herbert Jaeger
Analysis and Design of Echo State Networks
Mustafa C. Ozturk
can@cnel.ufl.edu
Dongming Xu
dmxu@cnel.ufl.edu
JoseC.Pr´ ´ıncipe
principe@cnel.ufl.edu
Computational NeuroEngineering Laboratory, Department of Electrical and
Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
The design of echo state network (ESN) parameters relies on the selec-
tion of the maximum eigenvalue of the linearized system around zero
(spectral radius). However, this procedure does not quantify in a sys-
tematic manner the performance of the ESN in terms of approximation
error. This article presents a functional space approximation framework
to better understand the operation of ESNs and proposes an information-
theoretic metric, the average entropy of echo states, to assess the richness
of the ESN dynamics. Furthermore, it provides an interpretation of the
ESN dynamics rooted in system theory as families of coupled linearized
systems whose poles move according to the input signal dynamics. With
this interpretation, a design methodology for functional approximation
is put forward where ESNs are designed with uniform pole distributions
covering the frequency spectrum to abide by the richness metric, irre-
spective of the spectral radius. A single bias parameter at the ESN input,
adapted with the modeling error, configures the ESN spectral radius to
the input-output joint space. Function approximation examples compare
the proposed design methodology versus the conventional design.
1 Introduction
Dynamic computational models require the ability to store and access the
time history of their inputs and outputs. The most common dynamic neural
architecture is the time-delay neural network (TDNN) that couples delay
lines with a nonlinear static architecture where all the parameters (weights)
are adapted with the backpropagation algorithm. The conventional delay
line utilizes ideal delay operators, but delay lines with local first-order re-
cursive filters have been proposed by Werbos (1992) and extensively stud-
ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
1993). Chains of first-order integrators are interesting because they effec-
tively decrease the number of delays necessary to create time embeddings
Neural Computation19, 111138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe
(Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
ent type of embedding that is largely unexplored. RNNs are perhaps the
most biologically plausible of the artificial neural network (ANN) models
(Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990),
but are not well understood theoretically (Siegelmann & Sontag, 1991;
Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
RNNs is the difficulty to adapt the system weights. Various algorithms,
such as backpropagation through time (Werbos, 1990) and real-time recur-
rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
however, these algorithms suffer from computational complexity, resulting
in slow training, complex performance surfaces, the possibility of instabil-
ity, and the decay of gradients through the topology and time (Haykin,
1998). The problem of decaying gradients has been addressed with spe-
cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
native second-order training methods based on extended Kalman filtering
(Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
et al., 1998) provide more reliable performance and have enabled practical
applications in identification and control of dynamical systems (Kechri-
otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
Kambhampati, & Warwick, 1995).
Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
2002). ESNs possess a highly interconnected and recurrent topology of
nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
and contain information about the history of input and output patterns.
The outputs of these internal PEs (echo states) are fed to a memoryless but
adaptive readout network (generally linear) that produces the network out-
put. The interesting property of ESN is that only the memoryless readout is
trained, whereas the recurrent topology has fixed connection weights. This
reduces the complexity of RNN training to simple linear regression while
preserving a recurrent topology, but obviously places important constraints
in the overall architecture that have not yet been fully studied. Similar ideas
have been explored independently by Maass and formalized in the LSM
architecture. LSMs, although formulated quite generally, are mostly im-
plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
whereas ESNs are dynamical ANN models. Both attempt to model biolog-
ical information processing using similar principles. We focus on the ESN
formulation in this letter.
The echo state condition is defined in terms of the spectral radius (the
largest among the absolute values of the eigenvalues of a matrix, denoted
by·) of the reservoirs weight matrix (W<1). This condition states
that the dynamics of the ESN is uniquely controlled by the input, and the
effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113
relies on the selection of spectral radius. However, there are many possible
weight matrices with the same spectral radius, and unfortunately they do
not all perform at the same level of mean square error (MSE) for functional
approximation. A similar problem exists in the design of the LSM. LSMs
have been shown to possess universal approximation given the separation
property (SP) for the liquid (reservoir in ESNs) and the approximation
property (AP) for the readout (Maass et al., 2002). SP is quantified by a
kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
(2005) that is based on the rank of a matrix formed by the system states
corresponding to different input signals. The kernel quality is a measure
for the complexity and diversity of nonlinear operations carried out by the
liquid on its input stream in order to boost the classification power of a
subsequent linear decision hyperplane (Maass et al., 2005). A variation of
SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
been argued that complex calculations can be best carried out by networks
on the boundary between ordered and chaotic dynamics.
Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox-
imation (filters that map input functionsu(·) of time on output functionsy(·)
of time). We see two major shortcomings with the current ESN approach
that uses echo state condition as a design principle. First, the impact of fixed
reservoir parameters for function approximation means that the informa-
tion about the desired response is conveyed only to the output projection.
This is not optimal, and strategies to select different reservoirs for different
applications have not been devised. Second, imposing a constraint only on
the spectral radius is a weak condition to properly set the parameters of
the reservoir, as experiments show (different randomizations with the same
spectral radius perform differently for the same problem; see Figure 2).
This letter aims to address these two problems by proposing a frame-
work, a metric, and a design principle for ESNs. The framework is a signal
processing interpretation of basis and projections in functional spaces to
describe and understand the ESN architecture. According to this interpre-
tation, the ESN states implement a set of basis functionals (representation
space) constructed dynamically by the input, while the readout simply
projects the desired response onto this representation space. The metric
to describe the richness of the ESN dynamics is an information-theoretic
quantity, the average state entropy (ASE). Entropy measures the amount of
information contained in a given random variable (Shannon, 1948). Here,
the random variable is the instantaneous echo state from which the en-
tropy for the overall state (vector) is estimated. The probability density
function (pdf) in a differential geometric framework should be thought of
as a volume form; that is, in our case, the pdf of the state vector describes
the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
established information as a coordinate free metric in the state manifold.
Therefore, entropy becomes a global descriptor of information that quanti-
fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe
time dependency of the states, the state entropy averaged over time (ASE)
is an appropriate estimate of the volume of the state manifold.
The design principle specifies that one should consider independently
thecorrelationamongthebasisandthespectralradius.Intheabsenceofany
information about the desired response, the ESN states should be designed
with the highest ASE, independent of the spectral radius. We interpret the
ESN dynamics as a combination of time-varying linear systems obtained
from the linearization of the ESN nonlinear PE in a small, local neighbor-
hood of the current state. The design principle means that the poles of the
linearized ESN reservoir should have uniform pole distributions to gener-
ate echo states with the most diverse pole locations (which correspond to
the uniformity of time constants). Effectively, this will create the least cor-
related bases for a given spectral radius, which corresponds to the largest
volume spanned by the basis set. When the designer has no other informa-
tion about the desired response to set the basis, this principle distributes
the systems degrees of freedom uniformly in space. It approximates for
ESNs the well-known property of orthogonal basis. The unresolved issue
that ASE does not quantify is how to set the spectral radius, which depends
again on the desired mapping. The concept of memory depth as explained
in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
issues associated with the spectral radius. The correlation time of the de-
siredresponse(asestimatedbythefirstzerooftheautocorrelationfunction)
gives an indication of the type of spectral radius required (long correlation
time requires high spectral radius). Alternatively, a simple adaptive bias is
added at the ESN input to control the spectral radius integrating the infor-
mation from the input-output joint space in the ESN bases. For sigmoidal
PEs, the bias adjusts the operating points of the reservoir PEs, which has
the net effect of adjusting the volume of the state manifold as required to
approximate the desired response with a small error. This letter shows that
ESNs designed with this strategy obtain systematically better results in a
set of experiments when compared with the conventional ESN design.
2 Analysis of Echo State Networks
2.1 Echo States as Bases and Projections.Let us consider the ar-
chitecture and recursive update equation of a typical ESN more closely.
Consider the recurrent discrete-time neural network given in Figure 1
withMinput units,Ninternal PEs, andLoutput units. The value of
the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal
units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)=
[y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M
weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal
PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115
Input Layer Dynamical Reservoir Read-out
Win WW out
x(n) u(n)
. +
. y(n)
Wback
Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed-
weight (W<1) recurrent network and a linear readout. The recurrent net-
work is a reservoir of highly interconnected dynamical components, states of
which are called echo states. The memoryless linear readout is trained to pro-
duce the output.
output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
activation of the internal PEs (echo state) is updated according to
x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1)
wheref=(f1 ,f2 ,...,fN )aretheinternalPEsactivationfunctions.Here,all
f ex
i s are hyperbolic tangent functions ( ex ). The output from the readout ex +ex
network is computed according to
y(n+1)=fout (Wout x(n+1)), (2.2)
wherefout =(fout ,fout ,...,fout ) are the output units nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity.
ESNs resemble the RNN architecture proposed in Puskorius and
Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe
interfaces. The critical difference is the dimensionality of the hidden re-
current PE layer and the adaptation of the recurrent weights. We submit
that the ideas of approximation theory in functional spaces (bases and pro-
jections), so useful in adaptive signal processing (Principe, 2001), should
be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued
function of a real-valued vector
u(t)=[u1 (t),u2 (t),...,uM (t)] T .
In functional approximation, the goal is to estimate the behavior ofh(u(t))
as a combination of simpler functionsϕi (t), called the basis functionals,
such that its approximant,hˆ(u(t)), is given by
N
hˆ(u(t))= ai ϕi (t).
i=1
Here,ai s are the projections ofh(u(t)) onto each basis function. One of
the central questions in practical functional approximation is how to choose
the set of bases to approximate a given desired signal. In signal processing,
thechoicenormallygoesforacompletesetoforthogonalbasis,independent
of the input. When the basis set is complete and can be made as large
as required, fixed bases work wonders (e.g., Fourier decompositions). In
neural computing, the basic idea is to derive the set of bases from the
input signal through a multilayered architecture. For instance, consider a
single hidden layer TDNN withNPEs and a linear output. The hidden-
layer PE outputs can be considered a set of nonorthogonal basis functionals
dependent on the input,
 
ϕi (u(t))=g bij uj (t).
j
bij s are the input layer weights, andgis the PE nonlinearity. The approxi-
mation produced by the TDNN is then
N
h ˆ(u(t))= ai ϕi (u(t)), (2.3)
i=1
whereai s are the weights of the output layer. Notice that thebij s adapt
the bases and theai s adapt the projection in the projection space. Here the
goal is to restrict the number of bases (number of hidden layer PEs) because
their number is coupled with the number of parameters to adapt, which
has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117
since all of the parameters of the network are adapted, the best basis in the
joint (input and desired signals) space as well as the best projection can be
achieved and represents the optimal solution. The output of the TDNN is
a linear combination of its internal representations, but to achieve a basis
set (even if nonorthogonal), linear independence among theϕi (u(t))s must
be enforced. Ito, Shah and Pon, and others have shown that this is indeed
the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
the scope of this article.
The ESN (and the RNN) architecture can also be studied in this frame-
work. The states of equation 2.1 correspond to the basis set, which are
recursively computed from the input, output, and previous states through
Win ,W,andWback . Notice, however, that none of these weight matrices is
adapted, that is, the functional bases in the ESN are uniquely defined by the
input and the initial selection of weights. In a sense, ESNs are trading the
adaptive connections in the RNN hidden layer by a brute force approach
of creating fixed diversified dynamics in the hidden layer.
For an ESN with a linear readout network, the output equation (y(n+
1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi s and
ai s are replaced by the echo states and the readout weights, respectively.
The readout weights are adapted in the training data, which means that the
ESN is able to find the optimal projection in the projection space, just like
the RNN or the TDNN.
A similar perspective of basis and projections for information processing
in biological networks has been proposed by Pouget and Sejnowski (1997).
They explored the possibility that the response of neurons in parietal cortex
serves as basis functions for the transformations from the sensory input
to the motor responses. They proposed that “the role of spatial represen-
tations is to code the sensory inputs and posture signals in a format that
simplifies subsequent computation, particularly in the generation of motor
commands”.
The central issue in ESN design is exactly the nonadaptive nature of
the basis set. Parameter sets in the reservoir that provide linearly inde-
pendent states and possess a given spectral radius may define drastically
different projection spaces because the correlation among the bases is not
constrained. A simple experiment was designed to demonstrate that the se-
lection of the ESN parameters by constraining the spectral radius is not the
most suitable for function approximation. Consider a 100-unit ESN where
the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
the ESN generate the seventh power of the input signal. Different realiza-
tions of a randomly connected 100-unit ESN were constructed where the
entries ofWare set to 0.4,0.4, and 0 with probabilities of 0.025, 0.025,
and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
weights are set to+1or,1 with equal probabilities, andWback is set to
zero. Input is applied for 300 time steps, and the echo states are calculated
using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe
MSE for different realizations10 4
10 6
10 8
10 9
0 10 20 30 40 50
Different realizations
Figure 2: Performances of ESNs for different realizations ofWwith the same
weight distribution. The weight values are set to 0.4,0.4, and 0 with proba-
bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
of 0.88. In the 50 realizations, MSEs vary from 5.9×10 9 to 8.9×10 5 . Results
show that for each set of random weights that provide the same spectral ra-
dius, the correlation or degree of redundancy among the bases will change, and
different performances are encountered in practice.
to determine the optimal output weight matrix,Wout , in the mean square
error (MSE) sense (where MSE is defined byO=1 (dy)T (dy)) is to use 2 the Wiener solution given by Haykin (2001):
1 1
Wout =E[xx T ]1 E[xd] 1
= x(n)x(n)T x(n)d(n) . (2.4) N Nn n
Here,E[.] denotes the expected value operator, andddenotes the desired
signal. Figure 2 depicts the MSE values for 50 different realizations of
the ESNs. As observed, even though each ESN has the same sparseness
and spectral radius, the MSE values obtained vary greatly among differ-
ent realizations. The minimum MSE value obtained among the 50 realiza-
tions is 5.9x10 9 , whereas the maximum MSE is 8.9x10 5 . This experiment Analysis and Design of Echo State Networks 119
demonstrates that a design strategy that is based solely on the spectral
radius is not sufficient to specify the system architecture for function ap-
proximation. This shows that for each set of random weights that provide
thesamespectralradius,thecorrelationordegreeofredundancyamongthe
bases will change, and different performances are encountered in practice.
2.2 ESN Dynamics as a Combination of Linear Systems.It is well
known that the dynamics of a nonlinear system can be approximated by
that of a linear system in a small neighborhood of an equilibrium point
(Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis
with hyperbolic tangent nonlinearities and approximate the ESN dynam-
ics by the dynamics of the linearized system in the neighborhood of the
current system state. Hence, when the system operating point varies over
time, the linear system approximating the ESN dynamics changes. We are
particularly interested in the movement of the poles of the linearized ESN.
Consider the update equation for the ESN without output feedback given
by
x(n+1)=f(Win u(n+1)+Wx(n)).
Linearizing the system around the current statex(n), one obtains the
Jacobian matrix,J(n+1), defined by
 f˙(net 1 (n))w ˙11 f(net 1 (n))w12 ··· f˙(net 1 (n))w1N   f˙(net J(n+1)= 2 (n))w ˙21 f(net 2 (n))w22 ··· f˙(net 2 (n))w2N   ··· ··· ··· ··· 
f˙(net N (n))w ˙N1 f(net N (n))wN2 ···f˙(net N (n))wNN
 f˙(net 1 (n)) 0 ··· 0
  0 f ˙(net  = 2 (n))··· 0   ·W=F(n)·W. (2.5)
 ··· ··· ··· ··· 
00···f˙ (net N (n))
Here,net i (n)istheith entry of the vector (Win u(n+1)+Wx(n)), andwij
denotes the (i,j)th entry ofW. The poles of the linearized system at time
n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the
amplitude of each PE changes, the local slope changes, and so the poles of
1 The transfer function of a linear systemx(n+1)=Ax(n)+Bu(n)is X(z) =(zIU(z)
A)1 B=Adjoint(zIA) B. The poles of the transfer function can be obtained by solving det(zIA)
det(zIA)=0. The solution corresponds to the eigenvalues ofA. 120 M. Ozturk, D. Xu, and J. Pr´ıncipe
the linearized system are time varying, although the parameters of ESN are
fixed.
In order to visualize the movement of the poles, consider an ESN with
100 states. The entries of the internal weight matrix are chosen to be 0,
0.4 and0.4 with probabilities 0.9, 0.05, and 0.05.Wis scaled such that a
spectral radius of 0.95 is obtained. Input weights are set to+1or1 with
equal probabilities. A sinusoidal signal with a period of 100 is fed to the
system, and the echo states are computed according to equation 2.1. Then
the Jacobian matrix and the eigenvalues are calculated using equation 2.5.
Figure 3 shows the pole tracks of the linearized ESN for different input
values. A single ESN with fixed parameters implements a combination of
many linear systems with varying pole locations, hence many different
time constants that modulate the richness of the reservoir of dynamics as a
function of input amplitude. Higher-amplitude portions of the signal tend
to saturate the nonlinear function and cause the poles to shrink toward
the origin of thez-plane (decreases the spectral radius), which results in a
system with a large stability margin. When the input is close to zero, the
poles of the linearized ESN are close to the maximal spectral radius chosen,
decreasing the stability margin. When compared to their linear counterpart,
an ESN with the same number of states results in a detailed coverage of
thez-plane dynamics, which illustrates the power of nonlinear systems.
Similar results can be obtained using signals of different shapes at the ESN
input.
A key corollary of the above analysis is that the spectral radius of an
ESN can be adjusted using a constant bias signal at the ESN input without
changing the recurrent connection matrix,W. The application of a nonzero
constant bias will move the operating point to regions of the sigmoid func-
tion closer to saturation and always decrease the spectral radius due to the
shape of the nonlinearity. 2 The relevance of bias in terms of overall system
performance has also been discussed in Jaeger (2002b) and Bertschinger
and Natschlager (2004), but here we approach it from a system theory per-¨
spective and explain its effect on reservoir dynamics.
3 Average State Entropy as a Measure of the Richness of ESN Reservoir
Previous research was aware of the influence of diversity of the recurrent
layer outputs on the overall performance of ESNs and LSMs. Several met-
rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al.,
2 AssumeWhas nondegenerate eigenvalues and corresponding linearly independent
eigenvectors. Then consider the eigendecomposition ofW,whereW=PDP 1 ,Pis the
eigenvectormatrixandDisthediagonalmatrixofeigenvalues(Dii )ofW.SinceF(n)andD
are diagonal,J(n+1)=F(n)W=F(n)(PDP 1 )=P(F(n)D)P1 is the eigendecomposition
ofJ(n+1). Here, each entry ofF(n)D,f (net(n))Dii , is an eigenvalue ofJ. Therefore,
|f (net(n))Dii |≤|Dii |sincef (net i )≤f (0). Analysis and Design of Echo State Networks 121
(A) 1 (B) 1
D0.8 0.8
0.6 C 0.6
0.4 0.4
Imaginary
Amplitude 0.2 0.2
0 B E 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1 0 20 40 60 80 100 -1 -0.5 Real 0 0.5 1 Time
(C) 1 (D) 1
0.8 0.8
0.6 0.6
0.4 0.4
Imaginary 0.2
Imaginary 0.2
0 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1
(E) 1 (F) 1
0.8 0.8
0.6 0.6
0.4 0.4
Imaginary 0.2
Imaginary 0.2
0 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1
Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input
goes through a cycle. An ESN with fixed parameters implements a combination
of linear systems with varying pole locations. (A) One cycle of sinusoidal signal
with a period of 100. (BE) The positions of poles of the linearized systems
when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative
pole locations show the movement of the poles as the input changes. Due to
the varying pole locations, different time constants modulate the richness of
the reservoir of dynamics as a function of input amplitude. Higher-amplitude
signals tend to saturate the nonlinear function and cause the poles to shrink
toward the origin of thez-plane (decreases the spectral radius), which results in
a system with a large stability margin. When the input is close to zero, the poles
ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing
the stability margin. An ESN with more states results in a detailed coverage of
thez-plane dynamics, which illustrates the power of nonlinear systems, when
compared to their linear counterpart. 122 M. Ozturk, D. Xu, and J. Pr´ıncipe
2005). Here, our approach of bases and projections leads to a new metric.
We propose the instantaneous state entropy to quantify the distribution of
instantaneous amplitudes across the ESN states. Entropy of the instanta-
neous ESN states is appropriate to quantify performance in function ap-
proximation because the ESN output is a mere weighted combination of
the instantaneous value of the ESN states. If the echo states instantaneous
amplitudes are concentrated on only a few values across the ESN state dy-
namic range, the ability to approximate an arbitrary desired response by
weighting the states is limited (and wasteful due to redundancy between
the different states), and performance will suffer. On the other hand, if the
ESN states provide a diversity of instantaneous amplitudes, it is much eas-
ier to achieve the desired mapping. Hence, the instantaneous entropy of the
states appears as a good measure to quantify the richness of dynamics with
instantaneous mappers. Due to the time structure of signals, the average
state entropy (ASE), defined as the state entropy averaged over time, will be
the parameter used to quantify the diversity in the dynamical reservoir of
the ESN. Moreover, entropy has been proposed as an appropriate measure
of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE
measures the volume of the echo state manifold spanned by trajectories.
Renyisquadraticentropyisemployedherebecauseitisaglobalmeasure
of information. In addition, an efficient nonparametric estimator of Renyis
entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe,
Xu, & Fisher, 2000). Renyis entropy with parameterγfor a random variable
Xwith a pdffX (x) is given by Renyi (1970):
1Hγ (X)= logE[fγ1 (X)].1γ X
Renyis quadratic entropy is obtained forγ=2 (forγ→1, Shannons en-
tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un-
known pdf to be estimated, Parzen windowing approximates the underly-
ing pdf by
1N
fX (x)= KN σ (xxi ),
i=1
whereKσ is the kernel function with the kernel sizeσ. Then the Renyis
quadratic entropy can be estimated by (Principe et al., 2000)
H2 (X)=log1
KN2 σ (xj xi ) . (3.1)
j i Analysis and Design of Echo State Networks 123
The instantaneous state entropy is estimated using equation 3.1 where
thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T
of an ESN withNinternal PEs. Results will be shown with a gaussian kernel
with kernel size chosen to be 0.3 of the standard deviation of the entries
of the state vector. We will show that ASE is a more sensitive parameter to
quantify the approximation properties of ESNs by experimentally demon-
strating that ESNs with different spectral radius and even with the same
spectral radius display different ASEs.
Let us consider the same 100-unit ESN that we used in the previous
section built with three different spectral radii 0.2, 0.5, 0.8 with an input
signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks.
The instantaneous state entropy is also calculated at each time step using
equation 3.1 and plotted in Figure 4B. First, note that the instantaneous
state entropy changes over time with the distribution of the echo states as
we would expect, since state entropy is dependent on the input signal that
also changes in this case. Second, as the spectral radius increases in the
simulation, the diversity in the echo states increases. For the spectral radius
of 0.2, echo states instantaneous amplitudes are concentrated on only a
few values, which is wasteful due to redundancy between different states.
In practice, to quantify the overall representation ability over time, we will
use ASE, which takes values0.735,0.007, and 0.335 for the spectral
radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral
radius, several ASEs are possible. Figure 4C shows ASEs from 50 different
realizations of ESNs with the same spectral radius of 0.5, which means that
ASE is a finer descriptor of the dynamics of the reservoir. Although we
have presented an experiment with sinusoidal signal, similar results are
obtained for other inputs as long as the input dynamic range is properly
selected.
Maximizing ASE means that the diversity of the states over time is the
largest and should provide a basis set that is as uncorrelated as possible.
This condition is unfortunately not a guarantee that the ESN so designed
will perform the best, because the basis set in ESNs is created independent
of the desired response and the application may require a small spectral
radius. However, we maintain that when the desired response is not ac-
cessible for the design of the ESN bases or when the same reservoir is
to be used for a number of problems, the default strategy should be to
maximize the ASE of the state vector. The following section addresses
the design of ESNs with high ASE values and a simple mechanism to
adjust the reservoir dynamics without changing the recurrent connection
weights.
4 Designing Echo State Networks
4.1 Design of the Echo State Recurrent Connections.According to the
interpretation of ESNs as coupled linear systems, the design of the internal 124 M. Ozturk, D. Xu, and J. Pr´ıncipe
connection matrix,W, will be based on the distribution of the poles of the
linearized system around zero state. Our proposal is to design the ESN
such that the linearized system has uniform pole distribution inside the
unit circle of thez-plane. With this design scenario, the system dynamics
will include uniform coverage of time constants arising from the uniform
distribution of the poles, which also decorrelates as much as possible the
basis functionals. This principle was chosen by analogy to the identification
oflinearsystemsusingKautzfilters(Kautz,1954),whichshowsthatthebest
approximation of a given transfer function by a linear system with finite
order is achieved when poles are placed in the neighborhood of the spectral
resonances. When no information is available about the desired response,
we should uniformly spread the poles to anticipate good approximation to
arbitrary mappings.
We again use a maximum entropy principle to distribute the poles inside
the unit circle uniformly. The constraints of a circle as boundary conditions
for discrete linear systems and complex conjugate locations are easy to
include for the pole distribution (Thogula, 2003). The poles are first initial-
ized at random locations; the quadratic Renyis entropy is calculated by
equation 3.1, and poles are moved such that the entropy of the new dis-
tribution is increased over iterations (Erdogmus & Principe, 2002). This
method is efficient to find uniform coverage of the unit circle with an arbi-
trary number of poles. The system with the uniform pole locations can be
interpreted using linear system theory. The poles that are close to the unit
circle correspond to many sharp bandpass filters specializing in different
frequency regions, whereas the inner poles realize filters of larger frequency
support. Moreover, different orientations (angles) of the poles create filters
of different center frequencies.
Now the problem is to construct an internal weight matrix from the pole
locations (eigenvalues ofW). In principle, we would like to create a sparse
Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs
ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8,
from top to bottom, respectively. The diversity of echo states increases when the
spectral radius increases. Within the dynamic range of the echo states, systems
with smaller spectral radius can generate only uneven representations, while
forW=0.8, outputs of echo states almost uniformly distribute within their
dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1.
Information contained in the echo states is changing over time according to the
input amplitude. Therefore, the richness of representation is controlled by the
input amplitude. Moreover, the value of ASE increases with spectral radius.
(C) ASEs from 50 different realizations of ESNs with the same spectral radius
of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the
reservoir than the spectral radius. Analysis and Design of Echo State Networks 125
(A) Echo States1
0
- 10 20 40 60 801001201401601802001
0
- 10 20 40 60 801001201401601802001
0
- 10 20 40 60 80100120140160180200Time
(B) State Entropy1.5 Spectral Radius = 0.2
1 Spectral Radius = 0.5 Spectral Radius = 0.8
0.5
0
- 0.5
- 1
- 1.5
- 2
- 2.50 50 100 150 200Time
(C) Different ASEs for the same spectral radius0.3
0.25
0.2
ASE0.15
0.1
0.050 10 20 30 40 50
Trials 126 M. Ozturk, D. Xu, and J. Pr´ıncipe
matrix, so we started with the sparsest matrix (with an inverse), which is
the direct canonical structure given by (Kailath, 1980)
 a1 a2 ···aN1 aN
 10··· 00  W= 01··· 00   . (4.1)
··· ··· ··· ··· ···
00··· 10
The characteristic polynomial ofWis
l(s)=det(sIW)=sN +a N11 s +a2 sN2 +aN
=(sp1 )(sp2 )···(spN ), (4.2)
wherepi s are the eigenvalues andai s are the coefficients of the character-
istic polynomial ofW. Here, we know the pole locations of the linear system
obtained from the linearization of the ESN, so using equation 4.2, we can
obtain the characteristic polynomial and constructWmatrix in the canon-
ical form using equation 4.1. We will call the ESN constructed based on
the uniform pole principle ASE-ESN. All other possible solutions with the
same eigenvalues can be obtained byQ1 WQ,whereQis any nonsingular
matrix.
To corroborate our hypothesis, we would like to show that the linearized
ESN designed with the recurrent weight matrix having the eigenvalues
uniformly distributed inside the unit circle creates higher ASE values for a
given spectral radius compared to other ESNs with random internal con-
nection weight matrices. We will consider an ESN with 30 states and use our
procedure to create theWmatrix for ASE-ESN for different spectral radii
between [0.1, 0.95]. Similarly, we constructed ESNs with sparse randomW
matrices with different sparseness constraints. This corresponds to a weight
distribution having the values 0,candcwith probabilitiesp1 ,(1p1 )/2,
and (1p1 )/2, wherep1 defines the sparseness ofWandcis a constant
that takes a specific value depending on the spectral radius. We also created
Wmatrices with values uniformly distributed between1 and 1 (U-ESN)
and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then,
for differentWin matrices, we run the ASE-ESNs with the sinusoidal input
given in section 3 and calculate ASE. Figure 5 compares the ASE values
averaged over 1000 realizations. As observed from the figure, the ASE-ESN
with uniform pole distribution generates higher ASE on average for all
spectral radii compared to ESNs with sparse and uniform random connec-
tions. This approach is indeed conceptually similar to Jeffreys maximum
entropy prior (Jeffreys, 1946): it will provide a consistently good response
for the largest class of problems. Concentrating the poles of the linearized Analysis and Design of Echo State Networks 127
1
ASEESN
0.8 UESN
sparseness=0.2
0.6 sparseness=0.1
sparseness=0.07
0.4
ASE 0.2
0
- 0.2
- 0.40 0.2 0.4 0.6 0.8 1
Spectral radius
Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith
uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN
with uniformly distributed weights between1 and 1. Randomly generated
weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the
networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole
distribution generates a higher ASE on average for all spectral radii compared
to ESNs with random connections.
system in certain regions of the space provides good performance only if
the desired response has energy in this part of the space, as is well known
from the theory of Kautz filters (Kautz, 1954).
4.2 Design of the Adaptive Bias.In conventional ESNs, only the out-
put weights are trained, optimizing the projections of the desired response
onto the basis functions (echo states). Since the dynamical reservoir is fixed,
the basis functions are only input dependent. However, since function ap-
proximation is a problem in the joint space of the input and desired signals,
a penalty in performance will be incurred. From the linearization analysis
that shows the crucial importance of the operating point of the PE non-
linearity in defining the echo state dynamics, we propose to use a single
external adaptive bias to adjust the effective spectral radius of an ESN. No-
tice that according to linearization analysis, bias can reduce only spectral
radius. The information for adaptation of bias is the MSE in training, which
modulates the spectral radius of the system with the information derived
from the approximation error. With this simple mechanism, some informa-
tionfromtheinput-outputjointspaceisincorporatedinthedefinitionofthe
projection space of the ESN. The beauty of this method is that the spectral 128 M. Ozturk, D. Xu, and J. Pr´ıncipe
radius can be adjusted by a single parameter that is external to the system
without changing reservoir weights.
The training of bias can be easily accomplished. Indeed, since the pa-
rameter space is only one-dimensional, a simple line search method can be
efficiently employed to optimize the bias. Among different line search al-
gorithms, we will use a search that uses Fibonacci numbers in the selection
of points to be evaluated (Wilde, 1964). The Fibonacci search method min-
imizes the maximum number of evaluations needed to reduce the interval
of uncertainty to within the prescribed length. In our problem, a bias value
is picked according to Fibonacci search. For each value of bias, training
data are applied to the ESN, and the echo states are calculated. Then the
corresponding optimal output weights and the objective function (MSE)
are evaluated to pick the next bias value.
Alternatively, gradient-based methods can be utilized to optimize the
bias, due to simplicity and low computational cost. System update equation
with an external bias signal,b,isgivenby
x(n+1)=f(Win u(n+1)+Win b+Wx(n)).
The update equation forbis given by
∂O(n+1) ∂x(n+1)=e·Wout × (4.3)∂b ∂b ∂x(n)=e·Wout × f˙(net n+1 )· W× +Win . (4.4)∂b
Here,Ois the MSE defined previously. This algorithm may suffer from
similar problems observed in gradient-based methods in recurrent net-
works training. However, we observed that the performance surface is
rather simple. Moreover, since the search parameter is one-dimensional,
the gradient vector can assume only one of the two directions. Hence, im-
precision in the gradient estimation should affect the speed of convergence
but normally not change the correct gradient direction.
5 Experiments
This section presents a variety of experiments in order to test the validity
of the ESN design scheme proposed in the previous section.
5.1 Short-TermMemoryCapacity.Thisexperimentcomparestheshort-
term memory (STM) capacity of ESNs with the same spectral radius using
the framework presented in Jaeger (2002a). Consider an ESN with a sin-
gle input signal,u(n), optimally trained with the desired signalu(nk),
for a given delayk. Denoting the optimal output signalyk (n), thek-delay Analysis and Design of Echo State Networks 129
STM capacity of a network,MC k , is defined as a squared correlation coef-
ficient betweenu(nk)andyk (n) (Jaeger, 2002a). The STM capacity,MC,
of the network is defined as ∞ MC k=1 k . STM capacity measures how accu-
rately the delayed versions of the input signal are recovered with optimally
trained output units. Jaeger (2002a) has shown that the memory capacity
for recalling an independent and identically distributed (i.i.d.) input by an
Nunit RNN with linear output units is bounded byN.
We use ESNs with 20 PEs and a single input unit. ESNs are driven
by an i.i.d. random input signal,u(n), that is uniformly distributed over
[0.5, 0.5]. The goal is to train the ESN to generate the delayed versions
of the input,u(n1),...,u(n40). We used four different ESNs: R-ESN,
U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN
used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47,
0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a
sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof
U-ESN are uniformly distributed over [1, 1] and scaled to obtain the spec-
tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed
with uniform poles. BASE-ESN has the same recurrent weight matrix as
ASE-ESN and an adaptive bias at its input. In each ESN, the input weights
are set to 0.1 or0.1 with equal probability, and direct connections from the
input to the output are allowed, whereasWback is set to0(Jaeger, 2002a).
The echo states are calculated using equation 2.1 for 200 samples of the
input signal, and the first 100 samples corresponding to initial transient
are eliminated. Then the output weight matrix is calculated using equation
2.4. For the BASE-ESN, the bias is trained for each task. All networks are
run with a test input signal, and the corresponding output andMC k are
calculated. Figure 6 shows thek-delay STM capacity (averaged over 100
trials) of each ESN for delays 1,...,40 for the test signal. The STM capac-
ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70,
and 16.90, respectively. First, ESNs with uniform pole distribution (ASE-
ESN and BASE-ESN) haveMCs that are much longer than the randomly
generated ESN given in Jaeger (2002a) in spite of all having the same spec-
tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical
maximumvalueofN=20.AcloserlookatthefigureshowsthatR-ESNper-
forms slightly better than ASE-ESN for delays less than 9. In fact, for small
k, large ASE degrades the performance because the tasks do not need long
memory depth. However, the drawback of high ASE for smallkis recov-
ered in BASE-ESN, which reduces the ASE to the appropriate level required
for the task. Overall, the addition of the bias to the ASE-ESN increases the
STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly
better STM compared to R-ESN with only three different weight values,
although it has more distinct weight values compared to R-ESN. It is also
significant to note that theMCwill be very poor for an ESN with smaller
spectral radius even with an adaptive bias, since the problem requires large
ASE and bias can only reduce ASE. This experiment demonstrates the 130 M. Ozturk, D. Xu, and J. Pr´ıncipe
1 RESN
UESN
ASEESN0.8 BASEESN
Memory Capacity 0.6
0.4
0.2
0
0 10 20 30 40
Delay
Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed
using the test signal. The results are averaged over 100 different realizations of
each ESN type with the specifications given in the text for differentWandWin
matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are
13.09, 13.55, 16.70, and 16.90, respectively.
suitability of maximizing ASE in tasks that require a substantial memory
length.
5.2 Binary Parity Check.The effect of the adaptive bias was marginal
in the previous experiment since the nature of the problem required large
ASE values. However, there are tasks in which the optimal solutions re-
quire smaller ASE values and smaller spectral radius. Those are the tasks
where the adaptive bias becomes a crucial design parameter in our design
methodology.
Consider an ESN with 100 internal units and a single input unit. ESN is
drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal
is to train an ESN to generate them-bit parity corresponding to lastmbits
received, wheremis 3,...,8. Similar to the previous experiments, we used
the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly
connected ESN where the entries ofWmatrix are set to 0, 0.06,0.06
with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN
are designed with a spectral radius of 0.9. The input weights are set to 1 or -1
with equal probability, and direct connections from the input to the output
are allowed whereasWback is set to 0. The echo states are calculated using
equation 2.1 for 1000 samples of the input signal, and the first 100 samples
correspondingtotheinitialtransientareeliminated.Thentheoutputweight Analysis and Design of Echo State Networks 131
350
300
250
Wrong Decisions 200
150
100
ASEESN50 RESN
BASEESN0
3 4 5 6 7 8
m
Figure 7: The number of wrong decisions made by each ESN form=3,...,8
in the binary parity check problem. The results are averaged over 100 differ-
ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin
matrices with the specifications given in the text. The total numbers of wrong
decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and
699.
matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias
is trained for each task. The binary decision is made by a threshold detector
that compares the output of the ESN to 0.5. Figure 7 shows the number of
wrong decisions (averaged over 100 different realizations) made by each
ESN form=3,...,8.
The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE-
ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs
poorly since the nature of the problem requires a short time constant for
fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the
R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions.
BASE-ESN performs a lot better than ASE-ESN and slightly better than
the R-ESN since the adaptive bias reduces the spectral radius effectively.
Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN,
since the task requires access to longer input history, which compromises
the need for fast response. Indeed, the bias in the BASE-ESN takes effect
when there are errors (m>4) and when the task benefits from smaller
spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and
2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide
range of bias values that result in similar MSE values (between 0 and 3). In 132 M. Ozturk, D. Xu, and J. Pr´ıncipe
summary, this experiment clearly demonstrates the power of the bias signal
to configure the ESN reservoir according to the mapping task.
5.3 System Identification.This section presents a function approxima-
tion task where the aim is to identify a nonlinear dynamical system. The
unknown system is defined by the difference equation
y(n+1)=0.3y(n)+0.6y(n1)+f(u(n)),
where
f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu).
The input to the system is chosen to be sin(2πn/25).
We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with
30 internal units and a single input unit. TheWmatrix of each ESN is scaled
suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN
where the entries ofWmatrix are set to 0, 0.35,0.35 with probabilities 0.8,
0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or1 with
equal probability, and direct connections from the input to the output are
allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated
using equation 2.4. The MSE values (averaged over 100 realizations) for R-
ESN and ASE-ESN are 1.23x10 5 and 1.83x10 6 , respectively. The addition
of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10 6
to 3.27x10 9 .
6 Discussion
The great appeal of echo state networks (ESNs) and liquid state machine
(LSM) is their ability to construct arbitrary mappings of signals with rich
and time-varying temporal structures without requiring adaptation of the
free parameters of the recurrent layer. The echo state condition allows the
recurrent connections to be fixed with training limited to the linear output
layer. However, the literature did not elucidate on how to properly choose
the recurrent parameters for system identification applications. Here, we
provide an alternate framework that interprets the echo states as a set
of functional bases formed by fixed nonlinear combinations of the input.
The linear readout at the output stage simply computes the projection of
the desired output space onto this representation space. We further in-
troduce an information-theoretic criterion, ASE, to better understand and
evaluate the capability of a given ESN to construct such a representation
layer. The average entropy of the distribution of the echo states quantifies
thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest
to achieve the smallest correlation among the bases and be able to cope with Analysis and Design of Echo State Networks 133
arbitrary mappings. However, not all function approximation problems re-
quire the same memory depth, which is coupled to the spectral radius. The
effective spectral radius of an ESN can be optimized for the given problem
with the help of an external bias signal that is adapted using the joint input-
output space information. The interesting property of this method when
applied to ESN built from sigmoidal nonlinearities is that it allows the fine
tuning of the system dynamics for a given problem with a single external
adaptive bias input and without changing internal system parameters. In
our opinion, the combination of the largest possible ASE and the adapta-
tion of the spectral radius by the bias produces the most parsimonious pole
location of the linearized ESN when no knowledge about the mapping is
available to optimally locate the bass functionals. Moreover, the bias can be
easily trained with either a line search method or a gradient-based method
since it is one-dimensional. We have illustrated experimentally that the de-
sign of the ESN using the maximization of ASE with the adaptation of the
spectral radius by the bias has provided consistently better performance
across tasks that require different memory depths. This means that these
two parameters design methodology is preferred to the spectral radius
criterion proposed by Jaeger, and it is still easily incorporated in the ESN
design.
Experiments demonstrate that the ASE for ESN with uniform linearized
poles is maximized when the spectral radius of the recurrent weight matrix
approaches one (instability). It is interesting to relate this observation with
the computational properties found in dynamical systems “at the edge of
chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993;
Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨
tomata rules are evolved to perform a complex computation, evolution will
tend to select rules with “critical” parameter values, which correlate with
a phase transition between ordered and chaotic regimes. Recently, similar
conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨
Langtons interpretation of edge of chaos was questioned by Mitchell et al.
(1993). Here, we provide a system-theoretic view and explain the computa-
tional behavior with the diversity of dynamics achieved with linearizations
that have poles close to the unit circle. According to our results, the spectral
radiusoftheoptimalESNinfunctionapproximationisproblemdependent,
and in general it is impossible to forecast the computational performance
as the system approaches instability (the spectral radius of the recurrent
weight matrix approaches one). However, allowing the system to modu-
late the spectral radius by either the output or internal biasing may allow
a system close to instability to solve various problems requiring different
spectral radii.
Our emphasis here is mostly on ESNs without output feedback connec-
tions. However, the proposed design methodology can also be applied to
ESNs with output feedback. Both feedforward and feedback connections
contribute to specify the bases to create the projection space. At the same 134 M. Ozturk, D. Xu, and J. Pr´ıncipe
time, there are applications where the output feedback contributes to the
system dynamics in a different fashion. For example, it has been shown that
a fixed weight (fully trained) RNN with output feedback can implement a
family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992).
In meta-learning, the role of output feedback in the network is to bias the
system to different regions of dynamics, providing multiple input-output
mappings required (Santiago & Lendaris, 2004). However, results could not
be replicated with ESNs (Prokhorov, 2005). We believe that more work has
to be done on output feedback in the context of ESNs but also suspect that
the echo state condition may be a restriction on the system dynamics for
this type of problem.
There are many interesting issues to be researched in this exciting new
area. Besides an evaluation tool, ASE may also be utilized to train the ESNs
representation layer in an unsupervised fashion. In fact, we can easily adapt
withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild,
and Principe (2003): extra weights linking the outputs of recurrent states to
maximize output entropy. Output entropy maximization is a well-known
metric to create independent components (Bell & Sejnowski, 1995), and
here it means that the echo states will become as independent as possible.
This would circumvent the linearization of the dynamical system to set the
recurrent weights and would fine-tune continuously in an unsupervised
manner the parameters of the ESN among different inputs. However, it
goes against the idea of a fixed ESN reservoir.
The reservoir of recurrent PEs can be thought of as a new form of a time-
to-space mapping. Unlike the delay line that forms an embedding (Takens,
1981), this mapping may have the advantage of filtering noise and produce
representations with better SNRs to the peaks of the input, which is very
appealing for signal processing and seems to be used in biology. However,
further theoretical work is necessary in order to understand the embedding
capabilities of ESNs. One of the disadvantages of the ESN correlated basis
is in the design of the readout. Gradient-based algorithms will be very
slow to converge (due to the large eigenvalue spread of modes), and even
if recursive methods are used, their stability may be compromised by the
condition number of the matrix. However, our recent results incorporating
anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of
solving this problem.
Finally we would like to briefly comment on the implications of these
models to neurobiology and computational neuroscience. The work by
Pouget and Sejnowski (1997) has shown that the available physiological
data are consistent with the hypothesis that the response of a single neuron
in the parietal cortex serves as a basis function generated by the sensory
input in a nonlinear fashion. In other words, the neurons transform the
sensory input into a format (representation space) such that the subsequent
computation is simplified. Then, whenever a motor command (output of
the biological system) needs to be generated, this simple computation to Analysis and Design of Echo State Networks 135
read out the neuronal activity is done. There is an intriguing similarity
betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski
and our interpretation of echo states in ESN. We believe that similar ideas
can be applied to improve the design of microcircuit implementations of
LSMs. First, the framework of functional space interpretation (bases and
projections) is also applicable to microcircuits. Second, the ASE measure
may be directly utilized for LSM states because the states are normally low-
pass-filtered before the readout. However, the control of ASE by changing
the liquid dynamics is unclear. Perhaps global control of thresholds or bias
current will be able to accomplish bias control as in ESN with sigmoid
PEs.
Acknowledgments
ThisworkwaspartiallysupportedbyNSFECS-0422718,NSFCNS-0540304,
and ONR N00014-1-1-0405.
References
Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer.
Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor-
ical perception, and probability learning: Some applications of a neural model.
Psychological Review, 84, 413451.
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach
to blind separation and blind deconvolution.Neural Computation, 7(6), 1129
1159.
Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨
in recurrent neural networks.Neural Computation, 16(7), 14131436.
Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal
of Physics, 14(1), 113.
de Vries, B. (1991).Temporal processing with neural networks—the development of the
gamma model. Unpublished doctoral dissertation, University of Florida.
Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural
network for system identification and control.IEEE Proceedings of Control Theory
and Applications, 142(4), 307314.
Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179211.
Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation:
Stochastic information gradient.Signal Processing Letters, 10(8), 242245.
Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for
adaptive system training.IEEE Transactions on Neural Networks, 13(5), 10351044.
Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream
Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle
(Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 2953). Dordrecht,
Netherlands: Kluwer. 136 M. Ozturk, D. Xu, and J. Pr´ıncipe
Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle
River, NJ. Prentice Hall.
Haykin, S. (2001).Adaptive filter theory(4th ed.). Upper Saddle River, NJ: Prentice
Hall.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa-
tion, 9(8), 17351780.
Hopfield, J. (1984). Neurons with graded response have collective computational
properties like those of two-state neurons.Proceedings of the National Academy of
Sciences, 81, 30883092.
Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math-
ematics, 5(1), 189203.
Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural
networks(Tech. Rep. No. 148). Bremen: German National Research Center for
Information Technology.
Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152).
Bremen: German National Research Center for Information Technology.
Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL,
EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German
National Research Center for Information Technology.
Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
and saving energy in wireless communication.Science, 304(5667), 7880.
Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems.
Proceedings of the Royal Society of London, A 196, 453461.
Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall.
Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit
Theory, 1(3), 2939.
Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks
for adaptive communication channel equalization.IEEE Transactions on Neural
Networks, 5(2), 267278.
Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks.
IEEE Transactions on Neural Networks, 6(5), 10001004.
Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation
theory(2nd ed.). New York: Springer-Verlag.
Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 1237.
Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the
computational power and generalization capability of neural microcircuits. In
L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing
systems, no. 17 (pp. 865872). Cambridge, MA: MIT Press.
Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨
stable states: A new framework for neural computation based on perturbations.
Neural Computation, 14(11), 25312560.
Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos:
Evolving cellular automata to perform computations.Complex Systems, 7, 89
130.
Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J.
Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293
301). Singapore: World Scientific. Analysis and Design of Echo State Networks 137
Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex
using basis functions.Journal of Cognitive Neuroscience, 9(2), 222237.
Principe, J. (2001). Dynamic neural networks and optimal signal processing. In
Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6
28). Boca Raton, FL: CRC Press.
Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new
class of adaptive IIR filters with restricted feedback.IEEE Transactions on Signal
Processing, 41(2), 649656.
Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin
(Ed.),Unsupervised adaptive filtering(pp. 265319). Hoboken, NJ: Wiley.
Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter-
national Joint Conference on Neural Networks(pp. 14631466). Montreal, Canada.
Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed
weights in recurrent neural networks: An overview. InProc. of International Joint
Conference on Neural Networks(pp. 20182022). Honolulu, Hawaii.
Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys-
tems with Kalman filter trained recurrent networks.IEEE Transactions on Neural
Networks, 5(2), 279297.
Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap-
plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 14071420.
Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev,
M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with
echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and
Signal Processing. Philadelphia.
Renyi, A. (1970).Probability theory. New York: Elsevier.
Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis.
Unpublished doctoral dissertation, University of Florida.
Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net-
works: Reformulating fixed weight neural networks. InProc. of International Joint
Conference on Neural Networks(pp. 189194). Budapest, Hungary.
Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in
multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 1018.
Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
Journal, 27, 623656.
Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc-
toral dissertation, Rutgers University.
Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied
Mathematics Letters, 4(6), 7780.
Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process-
ing systems, 1(pp. 133140). San Mateo, CA: Morgan Kaufmann.
Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S.
Young (Eds.),Dynamical systems and turbulence(pp. 366381). Berlin: Springer.
Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub-
lished masters thesis, University of Florida.
Werbos, P. (1990). Backpropagation through time: What it does and how to do it.
Proceedings of IEEE, 78(10), 15501560. 138 M. Ozturk, D. Xu, and J. Pr´ıncipe
Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua-
tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 6589). New
York: Van Nostrand Reinhold.
Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall.
Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running
fully recurrent neural networks.Neural Computation, 1, 270280.
Received December 28, 2004; accepted June 1, 2006.