testing_generation/Corpus/Analysis and Design of Echo...

            LETTER                                 Communicated by Herbert Jaeger


            Analysis and Design of Echo State Networks


            Mustafa C. Ozturk
            can@cnel.uﬂ.edu
            Dongming Xu
            dmxu@cnel.uﬂ.edu
            JoseC.Pr´     ´ıncipe
            principe@cnel.uﬂ.edu
            Computational NeuroEngineering Laboratory, Department of Electrical and
            Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.


            The design of echo state network (ESN) parameters relies on the selec-
            tion of the maximum eigenvalue of the linearized system around zero
            (spectral radius). However, this procedure does not quantify in a sys-
            tematic manner the performance of the ESN in terms of approximation
            error. This article presents a functional space approximation framework
            to better understand the operation of ESNs and proposes an information-
            theoretic metric, the average entropy of echo states, to assess the richness
            of the ESN dynamics. Furthermore, it provides an interpretation of the
            ESN dynamics rooted in system theory as families of coupled linearized
            systems whose poles move according to the input signal dynamics. With
            this interpretation, a design methodology for functional approximation
            is put forward where ESNs are designed with uniform pole distributions
            covering the frequency spectrum to abide by the richness metric, irre-
            spective of the spectral radius. A single bias parameter at the ESN input,
            adapted with the modeling error, conﬁgures the ESN spectral radius to
            the input-output joint space. Function approximation examples compare
            the proposed design methodology versus the conventional design.


            1 Introduction

            Dynamic computational models require the ability to store and access the
            time history of their inputs and outputs. The most common dynamic neural
            architecture is the time-delay neural network (TDNN) that couples delay
            lines with a nonlinear static architecture where all the parameters (weights)
            are adapted with the backpropagation algorithm. The conventional delay
            line utilizes ideal delay operators, but delay lines with local ﬁrst-order re-
            cursive ﬁlters have been proposed by Werbos (1992) and extensively stud-
            ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera,
            1993). Chains of ﬁrst-order integrators are interesting because they effec-
            tively decrease the number of delays necessary to create time embeddings


            Neural Computation19, 111–138(2007)   C 2006 Massachusetts Institute of Technology           112 M. Ozturk, D. Xu, and J. Pr´ıncipe


           (Principe, 2001). Recurrent neural networks (RNNs) implement a differ-
           ent type of embedding that is largely unexplored. RNNs are perhaps the
           most biologically plausible of the artiﬁcial neural network (ANN) models
           (Anderson, Silverstein, Ritz, & Jones, 1977; Hopﬁeld, 1984; Elman, 1990),
           but are not well understood theoretically (Siegelmann & Sontag, 1991;
           Siegelmann, 1993; Kremer, 1995). One of the main practical problems with
           RNNs is the difﬁculty to adapt the system weights. Various algorithms,
           such as backpropagation through time (Werbos, 1990) and real-time recur-
           rent learning (Williams & Zipser, 1989), have been proposed to train RNNs;
           however, these algorithms suffer from computational complexity, resulting
           in slow training, complex performance surfaces, the possibility of instabil-
           ity, and the decay of gradients through the topology and time (Haykin,
           1998). The problem of decaying gradients has been addressed with spe-
           cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter-
           native second-order training methods based on extended Kalman ﬁltering
           (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov,
           Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp
           et al., 1998) provide more reliable performance and have enabled practical
           applications in identiﬁcation and control of dynamical systems (Kechri-
           otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado,
           Kambhampati, & Warwick, 1995).
             Recently,twonewrecurrentnetworktopologieshavebeenproposed:the
           echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and
           the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨
           2002). ESNs possess a highly interconnected and recurrent topology of
           nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001)
           and contain information about the history of input and output patterns.
           The outputs of these internal PEs (echo states) are fed to a memoryless but
           adaptive readout network (generally linear) that produces the network out-
           put. The interesting property of ESN is that only the memoryless readout is
           trained, whereas the recurrent topology has ﬁxed connection weights. This
           reduces the complexity of RNN training to simple linear regression while
           preserving a recurrent topology, but obviously places important constraints
           in the overall architecture that have not yet been fully studied. Similar ideas
           have been explored independently by Maass and formalized in the LSM
           architecture. LSMs, although formulated quite generally, are mostly im-
           plemented as neural microcircuits of spiking neurons (Maass et al., 2002),
           whereas ESNs are dynamical ANN models. Both attempt to model biolog-
           ical information processing using similar principles. We focus on the ESN
           formulation in this letter.
             The echo state condition is deﬁned in terms of the spectral radius (the
           largest among the absolute values of the eigenvalues of a matrix, denoted
           by·) of the reservoir’s weight matrix (W<1). This condition states
           that the dynamics of the ESN is uniquely controlled by the input, and the
           effect of the initial states vanishes. The current design of ESN parameters           Analysis and Design of Echo State Networks 113


           relies on the selection of spectral radius. However, there are many possible
           weight matrices with the same spectral radius, and unfortunately they do
           not all perform at the same level of mean square error (MSE) for functional
           approximation. A similar problem exists in the design of the LSM. LSMs
           have been shown to possess universal approximation given the separation
           property (SP) for the liquid (reservoir in ESNs) and the approximation
           property (AP) for the readout (Maass et al., 2002). SP is quantiﬁed by a
           kernel-quality measure proposed in Maass, Legenstein, and Bertschinger
           (2005) that is based on the rank of a matrix formed by the system states
           corresponding to different input signals. The kernel quality is a measure
           for the complexity and diversity of nonlinear operations carried out by the
           liquid on its input stream in order to boost the classiﬁcation power of a
           subsequent linear decision hyperplane (Maass et al., 2005). A variation of
           SP has been proposed in Bertschinger and Natschlager (2004), and it has¨
           been argued that complex calculations can be best carried out by networks
           on the boundary between ordered and chaotic dynamics.
             Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox-
           imation (ﬁlters that map input functionsu(·) of time on output functionsy(·)
           of time). We see two major shortcomings with the current ESN approach
           that uses echo state condition as a design principle. First, the impact of ﬁxed
           reservoir parameters for function approximation means that the informa-
           tion about the desired response is conveyed only to the output projection.
           This is not optimal, and strategies to select different reservoirs for different
           applications have not been devised. Second, imposing a constraint only on
           the spectral radius is a weak condition to properly set the parameters of
           the reservoir, as experiments show (different randomizations with the same
           spectral radius perform differently for the same problem; see Figure 2).
             This letter aims to address these two problems by proposing a frame-
           work, a metric, and a design principle for ESNs. The framework is a signal
           processing interpretation of basis and projections in functional spaces to
           describe and understand the ESN architecture. According to this interpre-
           tation, the ESN states implement a set of basis functionals (representation
           space) constructed dynamically by the input, while the readout simply
           projects the desired response onto this representation space. The metric
           to describe the richness of the ESN dynamics is an information-theoretic
           quantity, the average state entropy (ASE). Entropy measures the amount of
           information contained in a given random variable (Shannon, 1948). Here,
           the random variable is the instantaneous echo state from which the en-
           tropy for the overall state (vector) is estimated. The probability density
           function (pdf) in a differential geometric framework should be thought of
           as a volume form; that is, in our case, the pdf of the state vector describes
           the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946)
           established information as a coordinate free metric in the state manifold.
           Therefore, entropy becomes a global descriptor of information that quanti-
           ﬁes the volume of the manifold deﬁned by the random variable. Due to the           114 M. Ozturk, D. Xu, and J. Pr´ıncipe


           time dependency of the states, the state entropy averaged over time (ASE)
           is an appropriate estimate of the volume of the state manifold.
             The design principle speciﬁes that one should consider independently
           thecorrelationamongthebasisandthespectralradius.Intheabsenceofany
           information about the desired response, the ESN states should be designed
           with the highest ASE, independent of the spectral radius. We interpret the
           ESN dynamics as a combination of time-varying linear systems obtained
           from the linearization of the ESN nonlinear PE in a small, local neighbor-
           hood of the current state. The design principle means that the poles of the
           linearized ESN reservoir should have uniform pole distributions to gener-
           ate echo states with the most diverse pole locations (which correspond to
           the uniformity of time constants). Effectively, this will create the least cor-
           related bases for a given spectral radius, which corresponds to the largest
           volume spanned by the basis set. When the designer has no other informa-
           tion about the desired response to set the basis, this principle distributes
           the system’s degrees of freedom uniformly in space. It approximates for
           ESNs the well-known property of orthogonal basis. The unresolved issue
           that ASE does not quantify is how to set the spectral radius, which depends
           again on the desired mapping. The concept of memory depth as explained
           in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the
           issues associated with the spectral radius. The correlation time of the de-
           siredresponse(asestimatedbytheﬁrstzerooftheautocorrelationfunction)
           gives an indication of the type of spectral radius required (long correlation
           time requires high spectral radius). Alternatively, a simple adaptive bias is
           added at the ESN input to control the spectral radius integrating the infor-
           mation from the input-output joint space in the ESN bases. For sigmoidal
           PEs, the bias adjusts the operating points of the reservoir PEs, which has
           the net effect of adjusting the volume of the state manifold as required to
           approximate the desired response with a small error. This letter shows that
           ESNs designed with this strategy obtain systematically better results in a
           set of experiments when compared with the conventional ESN design.


           2 Analysis of Echo State Networks

              2.1 Echo States as Bases and Projections.Let us consider the ar-
           chitecture and recursive update equation of a typical ESN more closely.
           Consider the recurrent discrete-time neural network given in Figure 1
           withMinput units,Ninternal PEs, andLoutput units. The value of
           the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal
           units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)=
           [y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M
           weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal
           PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij           Analysis and Design of Echo State Networks 115


            Input Layer      Dynamical Reservoir Read-out

                     Win  WW                 out


                                       x(n) u(n)

                                .                     +
                                .                               y(n)


                                                             Wback


           Figure 1: An echo state network (ESN). ESN is composed of two parts: a ﬁxed-
           weight (W<1) recurrent network and a linear readout. The recurrent net-
           work is a reservoir of highly interconnected dynamical components, states of
           which are called echo states. The memoryless linear readout is trained to pro-
           duce the output.


           output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The
           activation of the internal PEs (echo state) is updated according to

               x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)),             (2.1)

           wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all
           f                              e−x
            i ’s are hyperbolic tangent functions ( ex −  ). The output from the readout ex +e−x
           network is computed according to

               y(n+1)=fout (Wout x(n+1)),                           (2.2)

           wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1   2      L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity.
             ESNs resemble the RNN architecture proposed in Puskorius and
           Feldkamp (1996) and also used by Sanchez (2004) in brain-machine           116 M. Ozturk, D. Xu, and J. Pr´ıncipe


           interfaces. The critical difference is the dimensionality of the hidden re-
           current PE layer and the adaptation of the recurrent weights. We submit
           that the ideas of approximation theory in functional spaces (bases and pro-
           jections), so useful in adaptive signal processing (Principe, 2001), should
           be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued
           function of a real-valued vector

               u(t)=[u1 (t),u2 (t),...,uM (t)] T .

           In functional approximation, the goal is to estimate the behavior ofh(u(t))
           as a combination of simpler functionsϕi (t), called the basis functionals,
           such that its approximant,hˆ(u(t)), is given by

                       N
               hˆ(u(t))=   ai ϕi (t).
                       i=1

           Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of
           the central questions in practical functional approximation is how to choose
           the set of bases to approximate a given desired signal. In signal processing,
           thechoicenormallygoesforacompletesetoforthogonalbasis,independent
           of the input. When the basis set is complete and can be made as large
           as required, ﬁxed bases work wonders (e.g., Fourier decompositions). In
           neural computing, the basic idea is to derive the set of bases from the
           input signal through a multilayered architecture. For instance, consider a
           single hidden layer TDNN withNPEs and a linear output. The hidden-
           layer PE outputs can be considered a set of nonorthogonal basis functionals
           dependent on the input,
                                 

               ϕi (u(t))=g  bij uj (t).
                           j

           bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi-
           mation produced by the TDNN is then

                       N
               h ˆ(u(t))=   ai ϕi (u(t)),                                (2.3)
                       i=1

           whereai ’s are the weights of the output layer. Notice that thebij ’s adapt
           the bases and theai ’s adapt the projection in the projection space. Here the
           goal is to restrict the number of bases (number of hidden layer PEs) because
           their number is coupled with the number of parameters to adapt, which
           has an impact on generalization and training set size, for example. Usually,           Analysis and Design of Echo State Networks 117


           since all of the parameters of the network are adapted, the best basis in the
           joint (input and desired signals) space as well as the best projection can be
           achieved and represents the optimal solution. The output of the TDNN is
           a linear combination of its internal representations, but to achieve a basis
           set (even if nonorthogonal), linear independence among theϕi (u(t))’s must
           be enforced. Ito, Shah and Pon, and others have shown that this is indeed
           the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside
           the scope of this article.
             The ESN (and the RNN) architecture can also be studied in this frame-
           work. The states of equation 2.1 correspond to the basis set, which are
           recursively computed from the input, output, and previous states through
           Win ,W,andWback . Notice, however, that none of these weight matrices is
           adapted, that is, the functional bases in the ESN are uniquely deﬁned by the
           input and the initial selection of weights. In a sense, ESNs are trading the
           adaptive connections in the RNN hidden layer by a brute force approach
           of creating ﬁxed diversiﬁed dynamics in the hidden layer.
             For an ESN with a linear readout network, the output equation (y(n+
           1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and
           ai ’s are replaced by the echo states and the readout weights, respectively.
           The readout weights are adapted in the training data, which means that the
           ESN is able to ﬁnd the optimal projection in the projection space, just like
           the RNN or the TDNN.
             A similar perspective of basis and projections for information processing
           in biological networks has been proposed by Pouget and Sejnowski (1997).
           They explored the possibility that the response of neurons in parietal cortex
           serves as basis functions for the transformations from the sensory input
           to the motor responses. They proposed that “the role of spatial represen-
           tations is to code the sensory inputs and posture signals in a format that
           simpliﬁes subsequent computation, particularly in the generation of motor
           commands”.
             The central issue in ESN design is exactly the nonadaptive nature of
           the basis set. Parameter sets in the reservoir that provide linearly inde-
           pendent states and possess a given spectral radius may deﬁne drastically
           different projection spaces because the correlation among the bases is not
           constrained. A simple experiment was designed to demonstrate that the se-
           lection of the ESN parameters by constraining the spectral radius is not the
           most suitable for function approximation. Consider a 100-unit ESN where
           the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let
           the ESN generate the seventh power of the input signal. Different realiza-
           tions of a randomly connected 100-unit ESN were constructed where the
           entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025,
           and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input
           weights are set to+1or,−1 with equal probabilities, andWback is set to
           zero. Input is applied for 300 time steps, and the echo states are calculated
           using equation 2.1. The next step is to train the linear readout. One method           118 M. Ozturk, D. Xu, and J. Pr´ıncipe


                              MSE for different realizations10 4


              10 6


              10 8


              10 9
                0        10        20        30        40        50
                                Different realizations

           Figure 2: Performances of ESNs for different realizations ofWwith the same
           weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba-
           bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius
           of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results
           show that for each set of random weights that provide the same spectral ra-
           dius, the correlation or degree of redundancy among the bases will change, and
           different performances are encountered in practice.


           to determine the optimal output weight matrix,Wout , in the mean square
           error (MSE) sense (where MSE is deﬁned byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001):

                                                        −1  1
            Wout =E[xx T ]−1 E[xd]∼  1
                              =      x(n)x(n)T         x(n)d(n) .  (2.4) N               Nn               n

           Here,E[.] denotes the expected value operator, andddenotes the desired
           signal. Figure 2 depicts the MSE values for 50 different realizations of
           the ESNs. As observed, even though each ESN has the same sparseness
           and spectral radius, the MSE values obtained vary greatly among differ-
           ent realizations. The minimum MSE value obtained among the 50 realiza-
           tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment           Analysis and Design of Echo State Networks 119


           demonstrates that a design strategy that is based solely on the spectral
           radius is not sufﬁcient to specify the system architecture for function ap-
           proximation. This shows that for each set of random weights that provide
           thesamespectralradius,thecorrelationordegreeofredundancyamongthe
           bases will change, and different performances are encountered in practice.

             2.2 ESN Dynamics as a Combination of Linear Systems.It is well
           known that the dynamics of a nonlinear system can be approximated by
           that of a linear system in a small neighborhood of an equilibrium point
           (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis
           with hyperbolic tangent nonlinearities and approximate the ESN dynam-
           ics by the dynamics of the linearized system in the neighborhood of the
           current system state. Hence, when the system operating point varies over
           time, the linear system approximating the ESN dynamics changes. We are
           particularly interested in the movement of the poles of the linearized ESN.
           Consider the update equation for the ESN without output feedback given
           by

               x(n+1)=f(Win u(n+1)+Wx(n)).

           Linearizing the system around the current statex(n), one obtains the
           Jacobian matrix,J(n+1), deﬁned by
                                                         f˙(net 1 (n))w   ˙11 f(net 1 (n))w12 ··· f˙(net 1 (n))w1N                                       f˙(net                                J(n+1)=    2 (n))w   ˙21 f(net 2 (n))w22 ··· f˙(net 2 (n))w2N                                          ··· ··· ··· ···    
                      f˙(net N (n))w  ˙N1 f(net N (n))wN2 ···f˙(net N (n))wNN
                                               f˙(net 1 (n)) 0   ···   0
                                                  0    f ˙(net               =            2 (n))···   0                               ·W=F(n)·W.   (2.5)
                       ··· ··· ··· ···  
                         00···f˙ (net N (n))


           Here,net i (n)istheith entry of the vector (Win u(n+1)+Wx(n)), andwij
           denotes the (i,j)th entry ofW. The poles of the linearized system at time
           n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the
           amplitude of each PE changes, the local slope changes, and so the poles of


             1 The transfer function of a linear systemx(n+1)=Ax(n)+Bu(n)is X(z) =(zI−U(z)
           A)−1 B=Adjoint(zI−A) B. The poles of the transfer function can be obtained by solving det(zI−A)
           det(zI−A)=0. The solution corresponds to the eigenvalues ofA.           120 M. Ozturk, D. Xu, and J. Pr´ıncipe


           the linearized system are time varying, although the parameters of ESN are
           ﬁxed.
             In order to visualize the movement of the poles, consider an ESN with
           100 states. The entries of the internal weight matrix are chosen to be 0,
           0.4 and−0.4 with probabilities 0.9, 0.05, and 0.05.Wis scaled such that a
           spectral radius of 0.95 is obtained. Input weights are set to+1or−1 with
           equal probabilities. A sinusoidal signal with a period of 100 is fed to the
           system, and the echo states are computed according to equation 2.1. Then
           the Jacobian matrix and the eigenvalues are calculated using equation 2.5.
           Figure 3 shows the pole tracks of the linearized ESN for different input
           values. A single ESN with ﬁxed parameters implements a combination of
           many linear systems with varying pole locations, hence many different
           time constants that modulate the richness of the reservoir of dynamics as a
           function of input amplitude. Higher-amplitude portions of the signal tend
           to saturate the nonlinear function and cause the poles to shrink toward
           the origin of thez-plane (decreases the spectral radius), which results in a
           system with a large stability margin. When the input is close to zero, the
           poles of the linearized ESN are close to the maximal spectral radius chosen,
           decreasing the stability margin. When compared to their linear counterpart,
           an ESN with the same number of states results in a detailed coverage of
           thez-plane dynamics, which illustrates the power of nonlinear systems.
           Similar results can be obtained using signals of different shapes at the ESN
           input.
             A key corollary of the above analysis is that the spectral radius of an
           ESN can be adjusted using a constant bias signal at the ESN input without
           changing the recurrent connection matrix,W. The application of a nonzero
           constant bias will move the operating point to regions of the sigmoid func-
           tion closer to saturation and always decrease the spectral radius due to the
           shape of the nonlinearity. 2 The relevance of bias in terms of overall system
           performance has also been discussed in Jaeger (2002b) and Bertschinger
           and Natschlager (2004), but here we approach it from a system theory per-¨
           spective and explain its effect on reservoir dynamics.

           3 Average State Entropy as a Measure of the Richness of ESN Reservoir

           Previous research was aware of the inﬂuence of diversity of the recurrent
           layer outputs on the overall performance of ESNs and LSMs. Several met-
           rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al.,


             2 AssumeWhas nondegenerate eigenvalues and corresponding linearly independent
           eigenvectors. Then consider the eigendecomposition ofW,whereW=PDP −1 ,Pis the
           eigenvectormatrixandDisthediagonalmatrixofeigenvalues(Dii )ofW.SinceF(n)andD
           are diagonal,J(n+1)=F(n)W=F(n)(PDP −1 )=P(F(n)D)P−1 is the eigendecomposition
           ofJ(n+1). Here, each entry ofF(n)D,f (net(n))Dii , is an eigenvalue ofJ. Therefore,
           |f (net(n))Dii |≤|Dii |sincef (net i )≤f (0).           Analysis and Design of Echo State Networks 121


            (A) 1                        (B) 1
                     D0.8                          0.8
               0.6   C                      0.6
               0.4                          0.4


                                         Imaginary
              Amplitude 0.2                          0.2
                0 B        E                 0
              -0.2                          -0.2
              -0.4                          -0.4
              -0.6                          -0.6
              -0.8                          -0.8
               -1                           -1 0   20   40   60   80  100     -1    -0.5   Real 0    0.5    1 Time
            (C) 1                        (D) 1
               0.8                          0.8
               0.6                          0.6
               0.4                          0.4


              Imaginary 0.2


                                         Imaginary 0.2
                0                           0
              -0.2                          -0.2
              -0.4                          -0.4
              -0.6                          -0.6
              -0.8                          -0.8
               -1                           -1-1    -0.5   Real 0    0.5    1     -1    -0.5   Real 0    0.5    1

            (E) 1                        (F) 1
               0.8                          0.8
               0.6                          0.6
               0.4                          0.4


              Imaginary 0.2


                                         Imaginary 0.2
                0                           0
              -0.2                          -0.2
              -0.4                          -0.4
              -0.6                          -0.6
              -0.8                          -0.8
               -1                           -1-1    -0.5   Real 0    0.5    1     -1    -0.5   Real 0    0.5    1

           Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input
           goes through a cycle. An ESN with ﬁxed parameters implements a combination
           of linear systems with varying pole locations. (A) One cycle of sinusoidal signal
           with a period of 100. (B–E) The positions of poles of the linearized systems
           when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative
           pole locations show the movement of the poles as the input changes. Due to
           the varying pole locations, different time constants modulate the richness of
           the reservoir of dynamics as a function of input amplitude. Higher-amplitude
           signals tend to saturate the nonlinear function and cause the poles to shrink
           toward the origin of thez-plane (decreases the spectral radius), which results in
           a system with a large stability margin. When the input is close to zero, the poles
           ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing
           the stability margin. An ESN with more states results in a detailed coverage of
           thez-plane dynamics, which illustrates the power of nonlinear systems, when
           compared to their linear counterpart.           122 M. Ozturk, D. Xu, and J. Pr´ıncipe


           2005). Here, our approach of bases and projections leads to a new metric.
           We propose the instantaneous state entropy to quantify the distribution of
           instantaneous amplitudes across the ESN states. Entropy of the instanta-
           neous ESN states is appropriate to quantify performance in function ap-
           proximation because the ESN output is a mere weighted combination of
           the instantaneous value of the ESN states. If the echo state’s instantaneous
           amplitudes are concentrated on only a few values across the ESN state dy-
           namic range, the ability to approximate an arbitrary desired response by
           weighting the states is limited (and wasteful due to redundancy between
           the different states), and performance will suffer. On the other hand, if the
           ESN states provide a diversity of instantaneous amplitudes, it is much eas-
           ier to achieve the desired mapping. Hence, the instantaneous entropy of the
           states appears as a good measure to quantify the richness of dynamics with
           instantaneous mappers. Due to the time structure of signals, the average
           state entropy (ASE), deﬁned as the state entropy averaged over time, will be
           the parameter used to quantify the diversity in the dynamical reservoir of
           the ESN. Moreover, entropy has been proposed as an appropriate measure
           of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE
           measures the volume of the echo state manifold spanned by trajectories.
             Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure
           of information. In addition, an efﬁcient nonparametric estimator of Renyi’s
           entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe,
           Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable
           Xwith a pdffX (x) is given by Renyi (1970):


                        1Hγ (X)=     logE[fγ−1 (X)].1−γ      X


           Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en-
           tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un-
           known pdf to be estimated, Parzen windowing approximates the underly-
           ing pdf by


                      1N
                fX (x)=     KN    σ (x−xi ),
                        i=1

           whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s
           quadratic entropy can be estimated by (Principe et al., 2000)

                                          

               H2 (X)=−log1
                                      KN2        σ (xj −xi ) .               (3.1)
                                j   i           Analysis and Design of Echo State Networks 123


             The instantaneous state entropy is estimated using equation 3.1 where
           thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T
           of an ESN withNinternal PEs. Results will be shown with a gaussian kernel
           with kernel size chosen to be 0.3 of the standard deviation of the entries
           of the state vector. We will show that ASE is a more sensitive parameter to
           quantify the approximation properties of ESNs by experimentally demon-
           strating that ESNs with different spectral radius and even with the same
           spectral radius display different ASEs.
             Let us consider the same 100-unit ESN that we used in the previous
           section built with three different spectral radii 0.2, 0.5, 0.8 with an input
           signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks.
           The instantaneous state entropy is also calculated at each time step using
           equation 3.1 and plotted in Figure 4B. First, note that the instantaneous
           state entropy changes over time with the distribution of the echo states as
           we would expect, since state entropy is dependent on the input signal that
           also changes in this case. Second, as the spectral radius increases in the
           simulation, the diversity in the echo states increases. For the spectral radius
           of 0.2, echo state’s instantaneous amplitudes are concentrated on only a
           few values, which is wasteful due to redundancy between different states.
           In practice, to quantify the overall representation ability over time, we will
           use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral
           radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral
           radius, several ASEs are possible. Figure 4C shows ASEs from 50 different
           realizations of ESNs with the same spectral radius of 0.5, which means that
           ASE is a ﬁner descriptor of the dynamics of the reservoir. Although we
           have presented an experiment with sinusoidal signal, similar results are
           obtained for other inputs as long as the input dynamic range is properly
           selected.
             Maximizing ASE means that the diversity of the states over time is the
           largest and should provide a basis set that is as uncorrelated as possible.
           This condition is unfortunately not a guarantee that the ESN so designed
           will perform the best, because the basis set in ESNs is created independent
           of the desired response and the application may require a small spectral
           radius. However, we maintain that when the desired response is not ac-
           cessible for the design of the ESN bases or when the same reservoir is
           to be used for a number of problems, the default strategy should be to
           maximize the ASE of the state vector. The following section addresses
           the design of ESNs with high ASE values and a simple mechanism to
           adjust the reservoir dynamics without changing the recurrent connection
           weights.

           4 Designing Echo State Networks

             4.1 Design of the Echo State Recurrent Connections.According to the
           interpretation of ESNs as coupled linear systems, the design of the internal           124 M. Ozturk, D. Xu, and J. Pr´ıncipe


           connection matrix,W, will be based on the distribution of the poles of the
           linearized system around zero state. Our proposal is to design the ESN
           such that the linearized system has uniform pole distribution inside the
           unit circle of thez-plane. With this design scenario, the system dynamics
           will include uniform coverage of time constants arising from the uniform
           distribution of the poles, which also decorrelates as much as possible the
           basis functionals. This principle was chosen by analogy to the identiﬁcation
           oflinearsystemsusingKautzﬁlters(Kautz,1954),whichshowsthatthebest
           approximation of a given transfer function by a linear system with ﬁnite
           order is achieved when poles are placed in the neighborhood of the spectral
           resonances. When no information is available about the desired response,
           we should uniformly spread the poles to anticipate good approximation to
           arbitrary mappings.
             We again use a maximum entropy principle to distribute the poles inside
           the unit circle uniformly. The constraints of a circle as boundary conditions
           for discrete linear systems and complex conjugate locations are easy to
           include for the pole distribution (Thogula, 2003). The poles are ﬁrst initial-
           ized at random locations; the quadratic Renyi’s entropy is calculated by
           equation 3.1, and poles are moved such that the entropy of the new dis-
           tribution is increased over iterations (Erdogmus & Principe, 2002). This
           method is efﬁcient to ﬁnd uniform coverage of the unit circle with an arbi-
           trary number of poles. The system with the uniform pole locations can be
           interpreted using linear system theory. The poles that are close to the unit
           circle correspond to many sharp bandpass ﬁlters specializing in different
           frequency regions, whereas the inner poles realize ﬁlters of larger frequency
           support. Moreover, different orientations (angles) of the poles create ﬁlters
           of different center frequencies.
             Now the problem is to construct an internal weight matrix from the pole
           locations (eigenvalues ofW). In principle, we would like to create a sparse


           Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs
           ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8,
           from top to bottom, respectively. The diversity of echo states increases when the
           spectral radius increases. Within the dynamic range of the echo states, systems
           with smaller spectral radius can generate only uneven representations, while
           forW=0.8, outputs of echo states almost uniformly distribute within their
           dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1.
           Information contained in the echo states is changing over time according to the
           input amplitude. Therefore, the richness of representation is controlled by the
           input amplitude. Moreover, the value of ASE increases with spectral radius.
           (C) ASEs from 50 different realizations of ESNs with the same spectral radius
           of 0.5. The plot shows that ASE is a ﬁner descriptor of the dynamics of the
           reservoir than the spectral radius.           Analysis and Design of Echo State Networks 125


                                    (A) Echo States1
                          0
                          - 10 20 40 60 801001201401601802001
                          0
                          - 10 20 40 60 801001201401601802001
                          0
                          - 10 20 40 60 80100120140160180200Time
                                   (B) State Entropy1.5             Spectral Radius = 0.2
                           1             Spectral Radius = 0.5 Spectral Radius = 0.8
                          0.5
                           0
                          - 0.5
                           - 1
                          - 1.5
                           - 2
                          - 2.50     50    100    150    200Time
                            (C) Different ASEs for the same spectral radius0.3

                         0.25

                          0.2

                        ASE0.15

                          0.1

                         0.050    10   20   30   40   50
                                       Trials           126 M. Ozturk, D. Xu, and J. Pr´ıncipe


           matrix, so we started with the sparsest matrix (with an inverse), which is
           the direct canonical structure given by (Kailath, 1980)

                                      −a1 −a2 ···−aN−1 −aN
                    10···  00                    W= 01···  00                     .                       (4.1)
                   ··· ··· ··· ··· ···
                      00···  10

           The characteristic polynomial ofWis

               l(s)=det(sI−W)=sN +a N−11 s   +a2 sN−2 +aN
                  =(s−p1 )(s−p2 )···(s−pN ),                        (4.2)

           wherepi ’s are the eigenvalues andai ’s are the coefﬁcients of the character-
           istic polynomial ofW. Here, we know the pole locations of the linear system
           obtained from the linearization of the ESN, so using equation 4.2, we can
           obtain the characteristic polynomial and constructWmatrix in the canon-
           ical form using equation 4.1. We will call the ESN constructed based on
           the uniform pole principle ASE-ESN. All other possible solutions with the
           same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular
           matrix.
             To corroborate our hypothesis, we would like to show that the linearized
           ESN designed with the recurrent weight matrix having the eigenvalues
           uniformly distributed inside the unit circle creates higher ASE values for a
           given spectral radius compared to other ESNs with random internal con-
           nection weight matrices. We will consider an ESN with 30 states and use our
           procedure to create theWmatrix for ASE-ESN for different spectral radii
           between [0.1, 0.95]. Similarly, we constructed ESNs with sparse randomW
           matrices with different sparseness constraints. This corresponds to a weight
           distribution having the values 0,cand−cwith probabilitiesp1 ,(1−p1 )/2,
           and (1−p1 )/2, wherep1 deﬁnes the sparseness ofWandcis a constant
           that takes a speciﬁc value depending on the spectral radius. We also created
           Wmatrices with values uniformly distributed between−1 and 1 (U-ESN)
           and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then,
           for differentWin matrices, we run the ASE-ESNs with the sinusoidal input
           given in section 3 and calculate ASE. Figure 5 compares the ASE values
           averaged over 1000 realizations. As observed from the ﬁgure, the ASE-ESN
           with uniform pole distribution generates higher ASE on average for all
           spectral radii compared to ESNs with sparse and uniform random connec-
           tions. This approach is indeed conceptually similar to Jeffreys’ maximum
           entropy prior (Jeffreys, 1946): it will provide a consistently good response
           for the largest class of problems. Concentrating the poles of the linearized           Analysis and Design of Echo State Networks 127


                      1
                            ASEESN
                     0.8     UESN
                            sparseness=0.2
                     0.6     sparseness=0.1
                            sparseness=0.07
                     0.4

                    ASE 0.2

                      0

                    - 0.2

                    - 0.40      0.2     0.4     0.6     0.8      1
                                    Spectral radius

           Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith
           uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN
           with uniformly distributed weights between−1 and 1. Randomly generated
           weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the
           networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole
           distribution generates a higher ASE on average for all spectral radii compared
           to ESNs with random connections.


           system in certain regions of the space provides good performance only if
           the desired response has energy in this part of the space, as is well known
           from the theory of Kautz ﬁlters (Kautz, 1954).

             4.2 Design of the Adaptive Bias.In conventional ESNs, only the out-
           put weights are trained, optimizing the projections of the desired response
           onto the basis functions (echo states). Since the dynamical reservoir is ﬁxed,
           the basis functions are only input dependent. However, since function ap-
           proximation is a problem in the joint space of the input and desired signals,
           a penalty in performance will be incurred. From the linearization analysis
           that shows the crucial importance of the operating point of the PE non-
           linearity in deﬁning the echo state dynamics, we propose to use a single
           external adaptive bias to adjust the effective spectral radius of an ESN. No-
           tice that according to linearization analysis, bias can reduce only spectral
           radius. The information for adaptation of bias is the MSE in training, which
           modulates the spectral radius of the system with the information derived
           from the approximation error. With this simple mechanism, some informa-
           tionfromtheinput-outputjointspaceisincorporatedinthedeﬁnitionofthe
           projection space of the ESN. The beauty of this method is that the spectral           128 M. Ozturk, D. Xu, and J. Pr´ıncipe


           radius can be adjusted by a single parameter that is external to the system
           without changing reservoir weights.
             The training of bias can be easily accomplished. Indeed, since the pa-
           rameter space is only one-dimensional, a simple line search method can be
           efﬁciently employed to optimize the bias. Among different line search al-
           gorithms, we will use a search that uses Fibonacci numbers in the selection
           of points to be evaluated (Wilde, 1964). The Fibonacci search method min-
           imizes the maximum number of evaluations needed to reduce the interval
           of uncertainty to within the prescribed length. In our problem, a bias value
           is picked according to Fibonacci search. For each value of bias, training
           data are applied to the ESN, and the echo states are calculated. Then the
           corresponding optimal output weights and the objective function (MSE)
           are evaluated to pick the next bias value.
             Alternatively, gradient-based methods can be utilized to optimize the
           bias, due to simplicity and low computational cost. System update equation
           with an external bias signal,b,isgivenby

               x(n+1)=f(Win u(n+1)+Win b+Wx(n)).

           The update equation forbis given by

                ∂O(n+1)           ∂x(n+1)=−e·Wout ×                               (4.3)∂b                 ∂b                    ∂x(n)=−e·Wout × f˙(net n+1 )· W×     +Win  .    (4.4)∂b

             Here,Ois the MSE deﬁned previously. This algorithm may suffer from
           similar problems observed in gradient-based methods in recurrent net-
           works training. However, we observed that the performance surface is
           rather simple. Moreover, since the search parameter is one-dimensional,
           the gradient vector can assume only one of the two directions. Hence, im-
           precision in the gradient estimation should affect the speed of convergence
           but normally not change the correct gradient direction.

           5 Experiments

           This section presents a variety of experiments in order to test the validity
           of the ESN design scheme proposed in the previous section.

             5.1 Short-TermMemoryCapacity.Thisexperimentcomparestheshort-
           term memory (STM) capacity of ESNs with the same spectral radius using
           the framework presented in Jaeger (2002a). Consider an ESN with a sin-
           gle input signal,u(n), optimally trained with the desired signalu(n−k),
           for a given delayk. Denoting the optimal output signalyk (n), thek-delay           Analysis and Design of Echo State Networks 129


           STM capacity of a network,MC k , is deﬁned as a squared correlation coef-
           ﬁcient betweenu(n−k)andyk (n) (Jaeger, 2002a). The STM capacity,MC,
           of the network is deﬁned as  ∞ MC k=1   k . STM capacity measures how accu-
           rately the delayed versions of the input signal are recovered with optimally
           trained output units. Jaeger (2002a) has shown that the memory capacity
           for recalling an independent and identically distributed (i.i.d.) input by an
           Nunit RNN with linear output units is bounded byN.
             We use ESNs with 20 PEs and a single input unit. ESNs are driven
           by an i.i.d. random input signal,u(n), that is uniformly distributed over
           [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions
           of the input,u(n−1),...,u(n−40). We used four different ESNs: R-ESN,
           U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN
           used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47,
           −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a
           sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof
           U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec-
           tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed
           with uniform poles. BASE-ESN has the same recurrent weight matrix as
           ASE-ESN and an adaptive bias at its input. In each ESN, the input weights
           are set to 0.1 or−0.1 with equal probability, and direct connections from the
           input to the output are allowed, whereasWback is set to0(Jaeger, 2002a).
           The echo states are calculated using equation 2.1 for 200 samples of the
           input signal, and the ﬁrst 100 samples corresponding to initial transient
           are eliminated. Then the output weight matrix is calculated using equation
           2.4. For the BASE-ESN, the bias is trained for each task. All networks are
           run with a test input signal, and the corresponding output andMC k are
           calculated. Figure 6 shows thek-delay STM capacity (averaged over 100
           trials) of each ESN for delays 1,...,40 for the test signal. The STM capac-
           ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70,
           and 16.90, respectively. First, ESNs with uniform pole distribution (ASE-
           ESN and BASE-ESN) haveMCs that are much longer than the randomly
           generated ESN given in Jaeger (2002a) in spite of all having the same spec-
           tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical
           maximumvalueofN=20.AcloserlookattheﬁgureshowsthatR-ESNper-
           forms slightly better than ASE-ESN for delays less than 9. In fact, for small
           k, large ASE degrades the performance because the tasks do not need long
           memory depth. However, the drawback of high ASE for smallkis recov-
           ered in BASE-ESN, which reduces the ASE to the appropriate level required
           for the task. Overall, the addition of the bias to the ASE-ESN increases the
           STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly
           better STM compared to R-ESN with only three different weight values,
           although it has more distinct weight values compared to R-ESN. It is also
           signiﬁcant to note that theMCwill be very poor for an ESN with smaller
           spectral radius even with an adaptive bias, since the problem requires large
           ASE and bias can only reduce ASE. This experiment demonstrates the           130 M. Ozturk, D. Xu, and J. Pr´ıncipe


                      1                           RESN
                                                   UESN
                                                   ASEESN0.8                           BASEESN


                    Memory Capacity 0.6


                     0.4


                     0.2


                      0
                       0       10       20       30       40
                                       Delay

           Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed
           using the test signal. The results are averaged over 100 different realizations of
           each ESN type with the speciﬁcations given in the text for differentWandWin
           matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are
           13.09, 13.55, 16.70, and 16.90, respectively.


           suitability of maximizing ASE in tasks that require a substantial memory
           length.

             5.2 Binary Parity Check.The effect of the adaptive bias was marginal
           in the previous experiment since the nature of the problem required large
           ASE values. However, there are tasks in which the optimal solutions re-
           quire smaller ASE values and smaller spectral radius. Those are the tasks
           where the adaptive bias becomes a crucial design parameter in our design
           methodology.
             Consider an ESN with 100 internal units and a single input unit. ESN is
           drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal
           is to train an ESN to generate them-bit parity corresponding to lastmbits
           received, wheremis 3,...,8. Similar to the previous experiments, we used
           the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly
           connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06
           with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
           connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN
           are designed with a spectral radius of 0.9. The input weights are set to 1 or -1
           with equal probability, and direct connections from the input to the output
           are allowed whereasWback is set to 0. The echo states are calculated using
           equation 2.1 for 1000 samples of the input signal, and the ﬁrst 100 samples
           correspondingtotheinitialtransientareeliminated.Thentheoutputweight           Analysis and Design of Echo State Networks 131


                    350

                    300

                    250


                   Wrong Decisions 200

                    150

                    100
                                                    ASEESN50                             RESN
                                                    BASEESN0
                          3     4     5     6     7     8
                                         m

           Figure 7: The number of wrong decisions made by each ESN form=3,...,8
           in the binary parity check problem. The results are averaged over 100 differ-
           ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin
           matrices with the speciﬁcations given in the text. The total numbers of wrong
           decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and
           699.


           matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias
           is trained for each task. The binary decision is made by a threshold detector
           that compares the output of the ESN to 0.5. Figure 7 shows the number of
           wrong decisions (averaged over 100 different realizations) made by each
           ESN form=3,...,8.
             The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE-
           ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs
           poorly since the nature of the problem requires a short time constant for
           fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the
           R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions.
           BASE-ESN performs a lot better than ASE-ESN and slightly better than
           the R-ESN since the adaptive bias reduces the spectral radius effectively.
           Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN,
           since the task requires access to longer input history, which compromises
           the need for fast response. Indeed, the bias in the BASE-ESN takes effect
           when there are errors (m>4) and when the task beneﬁts from smaller
           spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and
           2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide
           range of bias values that result in similar MSE values (between 0 and 3). In           132 M. Ozturk, D. Xu, and J. Pr´ıncipe


           summary, this experiment clearly demonstrates the power of the bias signal
           to conﬁgure the ESN reservoir according to the mapping task.

             5.3 System Identiﬁcation.This section presents a function approxima-
           tion task where the aim is to identify a nonlinear dynamical system. The
           unknown system is deﬁned by the difference equation

               y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n)),

           where

                f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu).

           The input to the system is chosen to be sin(2πn/25).
             We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with
           30 internal units and a single input unit. TheWmatrix of each ESN is scaled
           suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN
           where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8,
           0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with
           equal probability, and direct connections from the input to the output are
           allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated
           using equation 2.4. The MSE values (averaged over 100 realizations) for R-
           ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition
           of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10 −6
           to 3.27x10 −9 .

           6 Discussion

           The great appeal of echo state networks (ESNs) and liquid state machine
           (LSM) is their ability to construct arbitrary mappings of signals with rich
           and time-varying temporal structures without requiring adaptation of the
           free parameters of the recurrent layer. The echo state condition allows the
           recurrent connections to be ﬁxed with training limited to the linear output
           layer. However, the literature did not elucidate on how to properly choose
           the recurrent parameters for system identiﬁcation applications. Here, we
           provide an alternate framework that interprets the echo states as a set
           of functional bases formed by ﬁxed nonlinear combinations of the input.
           The linear readout at the output stage simply computes the projection of
           the desired output space onto this representation space. We further in-
           troduce an information-theoretic criterion, ASE, to better understand and
           evaluate the capability of a given ESN to construct such a representation
           layer. The average entropy of the distribution of the echo states quantiﬁes
           thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest
           to achieve the smallest correlation among the bases and be able to cope with           Analysis and Design of Echo State Networks 133


           arbitrary mappings. However, not all function approximation problems re-
           quire the same memory depth, which is coupled to the spectral radius. The
           effective spectral radius of an ESN can be optimized for the given problem
           with the help of an external bias signal that is adapted using the joint input-
           output space information. The interesting property of this method when
           applied to ESN built from sigmoidal nonlinearities is that it allows the ﬁne
           tuning of the system dynamics for a given problem with a single external
           adaptive bias input and without changing internal system parameters. In
           our opinion, the combination of the largest possible ASE and the adapta-
           tion of the spectral radius by the bias produces the most parsimonious pole
           location of the linearized ESN when no knowledge about the mapping is
           available to optimally locate the bass functionals. Moreover, the bias can be
           easily trained with either a line search method or a gradient-based method
           since it is one-dimensional. We have illustrated experimentally that the de-
           sign of the ESN using the maximization of ASE with the adaptation of the
           spectral radius by the bias has provided consistently better performance
           across tasks that require different memory depths. This means that these
           two parameters’ design methodology is preferred to the spectral radius
           criterion proposed by Jaeger, and it is still easily incorporated in the ESN
           design.
             Experiments demonstrate that the ASE for ESN with uniform linearized
           poles is maximized when the spectral radius of the recurrent weight matrix
           approaches one (instability). It is interesting to relate this observation with
           the computational properties found in dynamical systems “at the edge of
           chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchﬁeld, 1993;
           Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨
           tomata rules are evolved to perform a complex computation, evolution will
           tend to select rules with “critical” parameter values, which correlate with
           a phase transition between ordered and chaotic regimes. Recently, similar
           conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨
           Langton’s interpretation of edge of chaos was questioned by Mitchell et al.
           (1993). Here, we provide a system-theoretic view and explain the computa-
           tional behavior with the diversity of dynamics achieved with linearizations
           that have poles close to the unit circle. According to our results, the spectral
           radiusoftheoptimalESNinfunctionapproximationisproblemdependent,
           and in general it is impossible to forecast the computational performance
           as the system approaches instability (the spectral radius of the recurrent
           weight matrix approaches one). However, allowing the system to modu-
           late the spectral radius by either the output or internal biasing may allow
           a system close to instability to solve various problems requiring different
           spectral radii.
             Our emphasis here is mostly on ESNs without output feedback connec-
           tions. However, the proposed design methodology can also be applied to
           ESNs with output feedback. Both feedforward and feedback connections
           contribute to specify the bases to create the projection space. At the same           134 M. Ozturk, D. Xu, and J. Pr´ıncipe


           time, there are applications where the output feedback contributes to the
           system dynamics in a different fashion. For example, it has been shown that
           a ﬁxed weight (fully trained) RNN with output feedback can implement a
           family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992).
           In meta-learning, the role of output feedback in the network is to bias the
           system to different regions of dynamics, providing multiple input-output
           mappings required (Santiago & Lendaris, 2004). However, results could not
           be replicated with ESNs (Prokhorov, 2005). We believe that more work has
           to be done on output feedback in the context of ESNs but also suspect that
           the echo state condition may be a restriction on the system dynamics for
           this type of problem.
             There are many interesting issues to be researched in this exciting new
           area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s
           representation layer in an unsupervised fashion. In fact, we can easily adapt
           withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild,
           and Principe (2003): extra weights linking the outputs of recurrent states to
           maximize output entropy. Output entropy maximization is a well-known
           metric to create independent components (Bell & Sejnowski, 1995), and
           here it means that the echo states will become as independent as possible.
           This would circumvent the linearization of the dynamical system to set the
           recurrent weights and would ﬁne-tune continuously in an unsupervised
           manner the parameters of the ESN among different inputs. However, it
           goes against the idea of a ﬁxed ESN reservoir.
             The reservoir of recurrent PEs can be thought of as a new form of a time-
           to-space mapping. Unlike the delay line that forms an embedding (Takens,
           1981), this mapping may have the advantage of ﬁltering noise and produce
           representations with better SNRs to the peaks of the input, which is very
           appealing for signal processing and seems to be used in biology. However,
           further theoretical work is necessary in order to understand the embedding
           capabilities of ESNs. One of the disadvantages of the ESN correlated basis
           is in the design of the readout. Gradient-based algorithms will be very
           slow to converge (due to the large eigenvalue spread of modes), and even
           if recursive methods are used, their stability may be compromised by the
           condition number of the matrix. However, our recent results incorporating
           anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of
           solving this problem.
             Finally we would like to brieﬂy comment on the implications of these
           models to neurobiology and computational neuroscience. The work by
           Pouget and Sejnowski (1997) has shown that the available physiological
           data are consistent with the hypothesis that the response of a single neuron
           in the parietal cortex serves as a basis function generated by the sensory
           input in a nonlinear fashion. In other words, the neurons transform the
           sensory input into a format (representation space) such that the subsequent
           computation is simpliﬁed. Then, whenever a motor command (output of
           the biological system) needs to be generated, this simple computation to           Analysis and Design of Echo State Networks 135


           read out the neuronal activity is done. There is an intriguing similarity
           betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski
           and our interpretation of echo states in ESN. We believe that similar ideas
           can be applied to improve the design of microcircuit implementations of
           LSMs. First, the framework of functional space interpretation (bases and
           projections) is also applicable to microcircuits. Second, the ASE measure
           may be directly utilized for LSM states because the states are normally low-
           pass-ﬁltered before the readout. However, the control of ASE by changing
           the liquid dynamics is unclear. Perhaps global control of thresholds or bias
           current will be able to accomplish bias control as in ESN with sigmoid
           PEs.


           Acknowledgments

           ThisworkwaspartiallysupportedbyNSFECS-0422718,NSFCNS-0540304,
           and ONR N00014-1-1-0405.


           References

           Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer.
           Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor-
             ical perception, and probability learning: Some applications of a neural model.
             Psychological Review, 84, 413–451.
           Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach
             to blind separation and blind deconvolution.Neural Computation, 7(6), 1129–
             1159.
           Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨
             in recurrent neural networks.Neural Computation, 16(7), 1413–1436.
           Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal
             of Physics, 14(1), 1–13.
           de Vries, B. (1991).Temporal processing with neural networks—the development of the
             gamma model. Unpublished doctoral dissertation, University of Florida.
           Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural
             network for system identiﬁcation and control.IEEE Proceedings of Control Theory
             and Applications, 142(4), 307–314.
           Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211.
           Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation:
             Stochastic information gradient.Signal Processing Letters, 10(8), 242–245.
           Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for
             adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044.
           Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream
             Kalman ﬁlter training for recurrent networks. In J. Suykens, & J. Vandewalle
             (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht,
             Netherlands: Kluwer.           136 M. Ozturk, D. Xu, and J. Pr´ıncipe


           Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle
             River, NJ. Prentice Hall.
           Haykin, S. (2001).Adaptive ﬁlter theory(4th ed.). Upper Saddle River, NJ: Prentice
             Hall.
           Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa-
             tion, 9(8), 1735–1780.
           Hopﬁeld, J. (1984). Neurons with graded response have collective computational
             properties like those of two-state neurons.Proceedings of the National Academy of
             Sciences, 81, 3088–3092.
           Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math-
             ematics, 5(1), 189–203.
           Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural
             networks(Tech. Rep. No. 148). Bremen: German National Research Center for
             Information Technology.
           Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152).
             Bremen: German National Research Center for Information Technology.
           Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL,
             EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German
             National Research Center for Information Technology.
           Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems
             and saving energy in wireless communication.Science, 304(5667), 78–80.
           Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems.
             Proceedings of the Royal Society of London, A 196, 453–461.
           Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall.
           Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit
             Theory, 1(3), 29–39.
           Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks
             for adaptive communication channel equalization.IEEE Transactions on Neural
             Networks, 5(2), 267–278.
           Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks.
             IEEE Transactions on Neural Networks, 6(5), 1000–1004.
           Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation
             theory(2nd ed.). New York: Springer-Verlag.
           Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37.
           Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the
             computational power and generalization capability of neural microcircuits. In
             L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing
             systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press.
           Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨
             stable states: A new framework for neural computation based on perturbations.
             Neural Computation, 14(11), 2531–2560.
           Mitchell, M., Hraber, P., & Crutchﬁeld, J. (1993). Revisiting the edge of chaos:
             Evolving cellular automata to perform computations.Complex Systems, 7, 89–
             130.
           Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J.
             Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293–
             301). Singapore: World Scientiﬁc.           Analysis and Design of Echo State Networks 137


           Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex
             using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237.
           Principe, J. (2001). Dynamic neural networks and optimal signal processing. In
             Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6–
             28). Boca Raton, FL: CRC Press.
           Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma ﬁlter—a new
             class of adaptive IIR ﬁlters with restricted feedback.IEEE Transactions on Signal
             Processing, 41(2), 649–656.
           Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin
             (Ed.),Unsupervised adaptive ﬁltering(pp. 265–319). Hoboken, NJ: Wiley.
           Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter-
             national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada.
           Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with ﬁxed
             weights in recurrent neural networks: An overview. InProc. of International Joint
             Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii.
           Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys-
             tems with Kalman ﬁlter trained recurrent networks.IEEE Transactions on Neural
             Networks, 5(2), 279–297.
           Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap-
             plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420.
           Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev,
             M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with
             echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and
             Signal Processing. Philadelphia.
           Renyi, A. (1970).Probability theory. New York: Elsevier.
           Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis.
             Unpublished doctoral dissertation, University of Florida.
           Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net-
             works: Reformulating ﬁxed weight neural networks. InProc. of International Joint
             Conference on Neural Networks(pp. 189–194). Budapest, Hungary.
           Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in
             multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18.
           Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical
             Journal, 27, 623–656.
           Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc-
             toral dissertation, Rutgers University.
           Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied
             Mathematics Letters, 4(6), 77–80.
           Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended
             Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process-
             ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann.
           Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S.
             Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer.
           Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub-
             lished master’s thesis, University of Florida.
           Werbos, P. (1990). Backpropagation through time: What it does and how to do it.
             Proceedings of IEEE, 78(10), 1550–1560.           138 M. Ozturk, D. Xu, and J. Pr´ıncipe


           Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua-
             tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New
             York: Van Nostrand Reinhold.
           Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall.
           Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running
             fully recurrent neural networks.Neural Computation, 1, 270–280.


           Received December 28, 2004; accepted June 1, 2006.