LETTER Communicated by Herbert Jaeger
Analysis and Design of Echo State Networks
Mustafa C. Ozturk
Dongming Xu
JoseC.Pr´ ´ıncipe
Computational NeuroEngineering Laboratory, Department of Electrical and
Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
The design of echo state network (ESN) parameters relies on the selec-
tion of the maximum eigenvalue of the linearized system around zero
(spectral radius). However, this procedure does not quantify in a sys-
tematic manner the performance of the ESN in terms of approximation
error. This article presents a functional space approximation framework
to better understand the operation of ESNs and proposes an information-
theoretic metric, the average entropy of echo states, to assess the richness
of the ESN dynamics. Furthermore, it provides an interpretation of the
ESN dynamics rooted in system theory as families of coupled linearized
systems whose poles move according to the input signal dynamics. With
this interpretation, a design methodology for functional approximation
is put forward where ESNs are designed with uniform pole distributions
covering the frequency spectrum to abide by the richness metric, irre-
spective of the spectral radius. A single bias parameter at the ESN input,
adapted with the modeling error, configures the ESN spectral radius to
the input-output joint space. Function approximation examples compare
the proposed design methodology versus the conventional design.
1 Introduction
Dynamic computational models require the ability to store and access the
time history of their inputs and outputs. The most common dynamic neural
architecture is the time-delay neural network (TDNN) that couples delay
lines with a nonlinear static architecture where all the parameters (weights)
are adapted with the backpropagation algorithm. The conventional delay
line utilizes ideal delay operators, but delay lines with local first-order re-
cursive filters have been proposed by Werbos (1992) and extensively stud-
ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
1993). Chains of first-order integrators are interesting because they effec-
tively decrease the number of delays necessary to create time embeddings
Neural Computation19, 111–138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe
(Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
ent type of embedding that is largely unexplored. RNNs are perhaps the
most biologically plausible of the artificial neural network (ANN) models
(Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990),
but are not well understood theoretically (Siegelmann & Sontag, 1991;
Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
RNNs is the difficulty to adapt the system weights. Various algorithms,
such as backpropagation through time (Werbos, 1990) and real-time recur-
rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
however, these algorithms suffer from computational complexity, resulting
in slow training, complex performance surfaces, the possibility of instabil-
ity, and the decay of gradients through the topology and time (Haykin,
1998). The problem of decaying gradients has been addressed with spe-
cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
native second-order training methods based on extended Kalman filtering
(Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
et al., 1998) provide more reliable performance and have enabled practical
applications in identification and control of dynamical systems (Kechri-
otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
Kambhampati, & Warwick, 1995).
echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
2002). ESNs possess a highly interconnected and recurrent topology of
nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
and contain information about the history of input and output patterns.
The outputs of these internal PEs (echo states) are fed to a memoryless but
adaptive readout network (generally linear) that produces the network out-
put. The interesting property of ESN is that only the memoryless readout is
trained, whereas the recurrent topology has fixed connection weights. This
reduces the complexity of RNN training to simple linear regression while
preserving a recurrent topology, but obviously places important constraints
in the overall architecture that have not yet been fully studied. Similar ideas
have been explored independently by Maass and formalized in the LSM
architecture. LSMs, although formulated quite generally, are mostly im-
plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
whereas ESNs are dynamical ANN models. Both attempt to model biolog-
ical information processing using similar principles. We focus on the ESN
formulation in this letter.
The echo state condition is defined in terms of the spectral radius (the
largest among the absolute values of the eigenvalues of a matrix, denoted
by·) of the reservoir’s weight matrix (W<1). This condition states
that the dynamics of the ESN is uniquely controlled by the input, and the
effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113
relies on the selection of spectral radius. However, there are many possible
weight matrices with the same spectral radius, and unfortunately they do
not all perform at the same level of mean square error (MSE) for functional
approximation. A similar problem exists in the design of the LSM. LSMs
have been shown to possess universal approximation given the separation
property (SP) for the liquid (reservoir in ESNs) and the approximation
property (AP) for the readout (Maass et al., 2002). SP is quantified by a
kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
(2005) that is based on the rank of a matrix formed by the system states
corresponding to different input signals. The kernel quality is a measure
for the complexity and diversity of nonlinear operations carried out by the
liquid on its input stream in order to boost the classification power of a
subsequent linear decision hyperplane (Maass et al., 2005). A variation of
SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
been argued that complex calculations can be best carried out by networks
on the boundary between ordered and chaotic dynamics.
imation (filters that map input functionsu(·) of time on output functionsy(·)
of time). We see two major shortcomings with the current ESN approach
that uses echo state condition as a design principle. First, the impact of fixed
reservoir parameters for function approximation means that the informa-
tion about the desired response is conveyed only to the output projection.
This is not optimal, and strategies to select different reservoirs for different
applications have not been devised. Second, imposing a constraint only on
the spectral radius is a weak condition to properly set the parameters of
the reservoir, as experiments show (different randomizations with the same
spectral radius perform differently for the same problem; see Figure 2).
This letter aims to address these two problems by proposing a frame-
work, a metric, and a design principle for ESNs. The framework is a signal
processing interpretation of basis and projections in functional spaces to
describe and understand the ESN architecture. According to this interpre-
tation, the ESN states implement a set of basis functionals (representation
space) constructed dynamically by the input, while the readout simply
projects the desired response onto this representation space. The metric
to describe the richness of the ESN dynamics is an information-theoretic
quantity, the average state entropy (ASE). Entropy measures the amount of
information contained in a given random variable (Shannon, 1948). Here,
the random variable is the instantaneous echo state from which the en-
tropy for the overall state (vector) is estimated. The probability density
function (pdf) in a differential geometric framework should be thought of
as a volume form; that is, in our case, the pdf of the state vector describes
the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
established information as a coordinate free metric in the state manifold.
Therefore, entropy becomes a global descriptor of information that quanti-
fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe
time dependency of the states, the state entropy averaged over time (ASE)
is an appropriate estimate of the volume of the state manifold.
The design principle specifies that one should consider independently
information about the desired response, the ESN states should be designed
with the highest ASE, independent of the spectral radius. We interpret the
ESN dynamics as a combination of time-varying linear systems obtained
from the linearization of the ESN nonlinear PE in a small, local neighbor-
hood of the current state. The design principle means that the poles of the
linearized ESN reservoir should have uniform pole distributions to gener-
ate echo states with the most diverse pole locations (which correspond to
the uniformity of time constants). Effectively, this will create the least cor-
related bases for a given spectral radius, which corresponds to the largest
volume spanned by the basis set. When the designer has no other informa-
tion about the desired response to set the basis, this principle distributes
the system’s degrees of freedom uniformly in space. It approximates for
ESNs the well-known property of orthogonal basis. The unresolved issue
that ASE does not quantify is how to set the spectral radius, which depends
again on the desired mapping. The concept of memory depth as explained
in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
issues associated with the spectral radius. The correlation time of the de-
gives an indication of the type of spectral radius required (long correlation
time requires high spectral radius). Alternatively, a simple adaptive bias is
added at the ESN input to control the spectral radius integrating the infor-
mation from the input-output joint space in the ESN bases. For sigmoidal
PEs, the bias adjusts the operating points of the reservoir PEs, which has
the net effect of adjusting the volume of the state manifold as required to
approximate the desired response with a small error. This letter shows that
ESNs designed with this strategy obtain systematically better results in a
set of experiments when compared with the conventional ESN design.
2 Analysis of Echo State Networks
2.1 Echo States as Bases and Projections.Let us consider the ar-
chitecture and recursive update equation of a typical ESN more closely.
Consider the recurrent discrete-time neural network given in Figure 1
withMinput units,Ninternal PEs, andLoutput units. The value of
the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal
units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)=
[y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M
weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal
PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115
Input Layer Dynamical Reservoir Read-out
Win WW out
x(n) u(n)
. +
. y(n)
Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed-
weight (W<1) recurrent network and a linear readout. The recurrent net-
work is a reservoir of highly interconnected dynamical components, states of
which are called echo states. The memoryless linear readout is trained to pro-
duce the output.
output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
activation of the internal PEs (echo state) is updated according to
x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1)
wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all
f e−x
i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x
network is computed according to
y(n+1)=fout (Wout x(n+1)), (2.2)
wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity.
ESNs resemble the RNN architecture proposed in Puskorius and
Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe
interfaces. The critical difference is the dimensionality of the hidden re-
current PE layer and the adaptation of the recurrent weights. We submit
that the ideas of approximation theory in functional spaces (bases and pro-
jections), so useful in adaptive signal processing (Principe, 2001), should
be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued
function of a real-valued vector
u(t)=[u1 (t),u2 (t),...,uM (t)] T .
In functional approximation, the goal is to estimate the behavior ofh(u(t))
as a combination of simpler functionsϕi (t), called the basis functionals,
such that its approximant,hˆ(u(t)), is given by
hˆ(u(t))= ai ϕi (t).
Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
the central questions in practical functional approximation is how to choose
the set of bases to approximate a given desired signal. In signal processing,
of the input. When the basis set is complete and can be made as large
as required, fixed bases work wonders (e.g., Fourier decompositions). In
neural computing, the basic idea is to derive the set of bases from the
input signal through a multilayered architecture. For instance, consider a
single hidden layer TDNN withNPEs and a linear output. The hidden-
layer PE outputs can be considered a set of nonorthogonal basis functionals
dependent on the input,
ϕi (u(t))=g bij uj (t).
bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
mation produced by the TDNN is then
h ˆ(u(t))= ai ϕi (u(t)), (2.3)
whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
the bases and theai ’s adapt the projection in the projection space. Here the
goal is to restrict the number of bases (number of hidden layer PEs) because
their number is coupled with the number of parameters to adapt, which
has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117
since all of the parameters of the network are adapted, the best basis in the
joint (input and desired signals) space as well as the best projection can be
achieved and represents the optimal solution. The output of the TDNN is
a linear combination of its internal representations, but to achieve a basis
set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
be enforced. Ito, Shah and Pon, and others have shown that this is indeed
the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
the scope of this article.
The ESN (and the RNN) architecture can also be studied in this frame-
work. The states of equation 2.1 correspond to the basis set, which are
recursively computed from the input, output, and previous states through
Win ,W,andWback . Notice, however, that none of these weight matrices is
adapted, that is, the functional bases in the ESN are uniquely defined by the
input and the initial selection of weights. In a sense, ESNs are trading the
adaptive connections in the RNN hidden layer by a brute force approach
of creating fixed diversified dynamics in the hidden layer.
For an ESN with a linear readout network, the output equation (y(n+
1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
ai ’s are replaced by the echo states and the readout weights, respectively.
The readout weights are adapted in the training data, which means that the
ESN is able to find the optimal projection in the projection space, just like
the RNN or the TDNN.
A similar perspective of basis and projections for information processing
in biological networks has been proposed by Pouget and Sejnowski (1997).
They explored the possibility that the response of neurons in parietal cortex
serves as basis functions for the transformations from the sensory input
to the motor responses. They proposed that “the role of spatial represen-
tations is to code the sensory inputs and posture signals in a format that
simplifies subsequent computation, particularly in the generation of motor
The central issue in ESN design is exactly the nonadaptive nature of
the basis set. Parameter sets in the reservoir that provide linearly inde-
pendent states and possess a given spectral radius may define drastically
different projection spaces because the correlation among the bases is not
constrained. A simple experiment was designed to demonstrate that the se-
lection of the ESN parameters by constraining the spectral radius is not the
most suitable for function approximation. Consider a 100-unit ESN where
the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
the ESN generate the seventh power of the input signal. Different realiza-
tions of a randomly connected 100-unit ESN were constructed where the
entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
weights are set to+1or,−1 with equal probabilities, andWback is set to
zero. Input is applied for 300 time steps, and the echo states are calculated
using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe
MSE for different realizations10 4
10 6
10 8
10 9
0 10 20 30 40 50
Different realizations
Figure 2: Performances of ESNs for different realizations ofWwith the same
weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
show that for each set of random weights that provide the same spectral ra-
dius, the correlation or degree of redundancy among the bases will change, and
different performances are encountered in practice.
to determine the optimal output weight matrix,Wout , in the mean square
error (MSE) sense (where MSE is defined byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001):