LETTER Communicated by Herbert Jaeger Analysis and Design of Echo State Networks Mustafa C. Ozturk can@cnel.ufl.edu Dongming Xu dmxu@cnel.ufl.edu JoseC.Pr´ ´ıncipe principe@cnel.ufl.edu Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A. The design of echo state network (ESN) parameters relies on the selec- tion of the maximum eigenvalue of the linearized system around zero (spectral radius). However, this procedure does not quantify in a sys- tematic manner the performance of the ESN in terms of approximation error. This article presents a functional space approximation framework to better understand the operation of ESNs and proposes an information- theoretic metric, the average entropy of echo states, to assess the richness of the ESN dynamics. Furthermore, it provides an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, a design methodology for functional approximation is put forward where ESNs are designed with uniform pole distributions covering the frequency spectrum to abide by the richness metric, irre- spective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. 1 Introduction Dynamic computational models require the ability to store and access the time history of their inputs and outputs. The most common dynamic neural architecture is the time-delay neural network (TDNN) that couples delay lines with a nonlinear static architecture where all the parameters (weights) are adapted with the backpropagation algorithm. The conventional delay line utilizes ideal delay operators, but delay lines with local first-order re- cursive filters have been proposed by Werbos (1992) and extensively stud- ied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, 1993). Chains of first-order integrators are interesting because they effec- tively decrease the number of delays necessary to create time embeddings Neural Computation19, 111–138(2007) C 2006 Massachusetts Institute of Technology 112 M. Ozturk, D. Xu, and J. Pr´ıncipe (Principe, 2001). Recurrent neural networks (RNNs) implement a differ- ent type of embedding that is largely unexplored. RNNs are perhaps the most biologically plausible of the artificial neural network (ANN) models (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), but are not well understood theoretically (Siegelmann & Sontag, 1991; Siegelmann, 1993; Kremer, 1995). One of the main practical problems with RNNs is the difficulty to adapt the system weights. Various algorithms, such as backpropagation through time (Werbos, 1990) and real-time recur- rent learning (Williams & Zipser, 1989), have been proposed to train RNNs; however, these algorithms suffer from computational complexity, resulting in slow training, complex performance surfaces, the possibility of instabil- ity, and the decay of gradients through the topology and time (Haykin, 1998). The problem of decaying gradients has been addressed with spe- cial processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alter- native second-order training methods based on extended Kalman filtering (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications in identification and control of dynamical systems (Kechri- otis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, Kambhampati, & Warwick, 1995). Recently,twonewrecurrentnetworktopologieshavebeenproposed:the echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and the liquid state machine (LSM) by Maass (Maass, Natschlager, & Markram,¨ 2002). ESNs possess a highly interconnected and recurrent topology of nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) and contain information about the history of input and output patterns. The outputs of these internal PEs (echo states) are fed to a memoryless but adaptive readout network (generally linear) that produces the network out- put. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously places important constraints in the overall architecture that have not yet been fully studied. Similar ideas have been explored independently by Maass and formalized in the LSM architecture. LSMs, although formulated quite generally, are mostly im- plemented as neural microcircuits of spiking neurons (Maass et al., 2002), whereas ESNs are dynamical ANN models. Both attempt to model biolog- ical information processing using similar principles. We focus on the ESN formulation in this letter. The echo state condition is defined in terms of the spectral radius (the largest among the absolute values of the eigenvalues of a matrix, denoted by·) of the reservoir’s weight matrix (W<1). This condition states that the dynamics of the ESN is uniquely controlled by the input, and the effect of the initial states vanishes. The current design of ESN parameters Analysis and Design of Echo State Networks 113 relies on the selection of spectral radius. However, there are many possible weight matrices with the same spectral radius, and unfortunately they do not all perform at the same level of mean square error (MSE) for functional approximation. A similar problem exists in the design of the LSM. LSMs have been shown to possess universal approximation given the separation property (SP) for the liquid (reservoir in ESNs) and the approximation property (AP) for the readout (Maass et al., 2002). SP is quantified by a kernel-quality measure proposed in Maass, Legenstein, and Bertschinger (2005) that is based on the rank of a matrix formed by the system states corresponding to different input signals. The kernel quality is a measure for the complexity and diversity of nonlinear operations carried out by the liquid on its input stream in order to boost the classification power of a subsequent linear decision hyperplane (Maass et al., 2005). A variation of SP has been proposed in Bertschinger and Natschlager (2004), and it has¨ been argued that complex calculations can be best carried out by networks on the boundary between ordered and chaotic dynamics. Inthisletter,weareinterestedinstudyingtheESNforfunctionalapprox- imation (filters that map input functionsu(·) of time on output functionsy(·) of time). We see two major shortcomings with the current ESN approach that uses echo state condition as a design principle. First, the impact of fixed reservoir parameters for function approximation means that the informa- tion about the desired response is conveyed only to the output projection. This is not optimal, and strategies to select different reservoirs for different applications have not been devised. Second, imposing a constraint only on the spectral radius is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perform differently for the same problem; see Figure 2). This letter aims to address these two problems by proposing a frame- work, a metric, and a design principle for ESNs. The framework is a signal processing interpretation of basis and projections in functional spaces to describe and understand the ESN architecture. According to this interpre- tation, the ESN states implement a set of basis functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to describe the richness of the ESN dynamics is an information-theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variable (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the en- tropy for the overall state (vector) is estimated. The probability density function (pdf) in a differential geometric framework should be thought of as a volume form; that is, in our case, the pdf of the state vector describes the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) established information as a coordinate free metric in the state manifold. Therefore, entropy becomes a global descriptor of information that quanti- fies the volume of the manifold defined by the random variable. Due to the 114 M. Ozturk, D. Xu, and J. Pr´ıncipe time dependency of the states, the state entropy averaged over time (ASE) is an appropriate estimate of the volume of the state manifold. The design principle specifies that one should consider independently thecorrelationamongthebasisandthespectralradius.Intheabsenceofany information about the desired response, the ESN states should be designed with the highest ASE, independent of the spectral radius. We interpret the ESN dynamics as a combination of time-varying linear systems obtained from the linearization of the ESN nonlinear PE in a small, local neighbor- hood of the current state. The design principle means that the poles of the linearized ESN reservoir should have uniform pole distributions to gener- ate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively, this will create the least cor- related bases for a given spectral radius, which corresponds to the largest volume spanned by the basis set. When the designer has no other informa- tion about the desired response to set the basis, this principle distributes the system’s degrees of freedom uniformly in space. It approximates for ESNs the well-known property of orthogonal basis. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again on the desired mapping. The concept of memory depth as explained in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the issues associated with the spectral radius. The correlation time of the de- siredresponse(asestimatedbythefirstzerooftheautocorrelationfunction) gives an indication of the type of spectral radius required (long correlation time requires high spectral radius). Alternatively, a simple adaptive bias is added at the ESN input to control the spectral radius integrating the infor- mation from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. This letter shows that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design. 2 Analysis of Echo State Networks 2.1 Echo States as Bases and Projections.Let us consider the ar- chitecture and recursive update equation of a typical ESN more closely. Consider the recurrent discrete-time neural network given in Figure 1 withMinput units,Ninternal PEs, andLoutput units. The value of the input unit at timenisu(n)=[u1 (n),u2 (n),...,uM (n)] T , of internal units arex(n)=[x1 (n),x2 (n),...,xN (n)] T , and of output units arey(n)= [y1 (n),y2 (n),...,yL (n)] T . The connection weights are given in anN×M weight matrixWin =(win ) for connections between the input and the inter- ij nalPEs,inanN×NmatrixW=(wij ) for connections between the internal PEs, in anL×NmatrixWout =(wout ) for connections from PEs to the ij Analysis and Design of Echo State Networks 115 Input Layer Dynamical Reservoir Read-out Win WW out x(n) u(n) . + . y(n) Wback Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixed- weight (W<1) recurrent network and a linear readout. The recurrent net- work is a reservoir of highly interconnected dynamical components, states of which are called echo states. The memoryless linear readout is trained to pro- duce the output. output units, and in anN×LmatrixWback =(wback ) for the connections ij that project back from the output to the internal PEs (Jaeger, 2001). The activation of the internal PEs (echo state) is updated according to x(n+1)=f(Win u(n+1)+Wx(n)+Wback y(n)), (2.1) wheref=(f1 ,f2 ,...,fN )aretheinternalPEs’activationfunctions.Here,all f e−x i ’s are hyperbolic tangent functions ( ex − ). The output from the readout ex +e−x network is computed according to y(n+1)=fout (Wout x(n+1)), (2.2) wherefout =(fout ,fout ,...,fout ) are the output unit’s nonlinear functions 1 2 L (Jaeger, 2001, 2002a). Generally, the readout is linear sofout is identity. ESNs resemble the RNN architecture proposed in Puskorius and Feldkamp (1996) and also used by Sanchez (2004) in brain-machine 116 M. Ozturk, D. Xu, and J. Pr´ıncipe interfaces. The critical difference is the dimensionality of the hidden re- current PE layer and the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and pro- jections), so useful in adaptive signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Leth(u(t)) be a real-valued function of a real-valued vector u(t)=[u1 (t),u2 (t),...,uM (t)] T . In functional approximation, the goal is to estimate the behavior ofh(u(t)) as a combination of simpler functionsϕi (t), called the basis functionals, such that its approximant,hˆ(u(t)), is given by N hˆ(u(t))= ai ϕi (t). i=1 Here,ai ’s are the projections ofh(u(t)) onto each basis function. One of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal processing, thechoicenormallygoesforacompletesetoforthogonalbasis,independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g., Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture. For instance, consider a single hidden layer TDNN withNPEs and a linear output. The hidden- layer PE outputs can be considered a set of nonorthogonal basis functionals dependent on the input,   ϕi (u(t))=g bij uj (t). j bij ’s are the input layer weights, andgis the PE nonlinearity. The approxi- mation produced by the TDNN is then N h ˆ(u(t))= ai ϕi (u(t)), (2.3) i=1 whereai ’s are the weights of the output layer. Notice that thebij ’s adapt the bases and theai ’s adapt the projection in the projection space. Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt, which has an impact on generalization and training set size, for example. Usually, Analysis and Design of Echo State Networks 117 since all of the parameters of the network are adapted, the best basis in the joint (input and desired signals) space as well as the best projection can be achieved and represents the optimal solution. The output of the TDNN is a linear combination of its internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among theϕi (u(t))’s must be enforced. Ito, Shah and Pon, and others have shown that this is indeed the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside the scope of this article. The ESN (and the RNN) architecture can also be studied in this frame- work. The states of equation 2.1 correspond to the basis set, which are recursively computed from the input, output, and previous states through Win ,W,andWback . Notice, however, that none of these weight matrices is adapted, that is, the functional bases in the ESN are uniquely defined by the input and the initial selection of weights. In a sense, ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden layer. For an ESN with a linear readout network, the output equation (y(n+ 1)=Wout x(n+1)) has the same form of equation 2.3, where theϕi ’s and ai ’s are replaced by the echo states and the readout weights, respectively. The readout weights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. A similar perspective of basis and projections for information processing in biological networks has been proposed by Pouget and Sejnowski (1997). They explored the possibility that the response of neurons in parietal cortex serves as basis functions for the transformations from the sensory input to the motor responses. They proposed that “the role of spatial represen- tations is to code the sensory inputs and posture signals in a format that simplifies subsequent computation, particularly in the generation of motor commands”. The central issue in ESN design is exactly the nonadaptive nature of the basis set. Parameter sets in the reservoir that provide linearly inde- pendent states and possess a given spectral radius may define drastically different projection spaces because the correlation among the bases is not constrained. A simple experiment was designed to demonstrate that the se- lection of the ESN parameters by constraining the spectral radius is not the most suitable for function approximation. Consider a 100-unit ESN where the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let the ESN generate the seventh power of the input signal. Different realiza- tions of a randomly connected 100-unit ESN were constructed where the entries ofWare set to 0.4,−0.4, and 0 with probabilities of 0.025, 0.025, and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to+1or,−1 with equal probabilities, andWback is set to zero. Input is applied for 300 time steps, and the echo states are calculated using equation 2.1. The next step is to train the linear readout. One method 118 M. Ozturk, D. Xu, and J. Pr´ıncipe MSE for different realizations10 4 10 6 10 8 10 9 0 10 20 30 40 50 Different realizations Figure 2: Performances of ESNs for different realizations ofWwith the same weight distribution. The weight values are set to 0.4,−0.4, and 0 with proba- bilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9×10 −9 to 8.9×10 −5 . Results show that for each set of random weights that provide the same spectral ra- dius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice. to determine the optimal output weight matrix,Wout , in the mean square error (MSE) sense (where MSE is defined byO=1 (d−y)T (d−y)) is to use 2 the Wiener solution given by Haykin (2001): −1 1 Wout =E[xx T ]−1 E[xd]∼ 1 = x(n)x(n)T x(n)d(n) . (2.4) N Nn n Here,E[.] denotes the expected value operator, andddenotes the desired signal. Figure 2 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained vary greatly among differ- ent realizations. The minimum MSE value obtained among the 50 realiza- tions is 5.9x10 −9 , whereas the maximum MSE is 8.9x10 −5 . This experiment Analysis and Design of Echo State Networks 119 demonstrates that a design strategy that is based solely on the spectral radius is not sufficient to specify the system architecture for function ap- proximation. This shows that for each set of random weights that provide thesamespectralradius,thecorrelationordegreeofredundancyamongthe bases will change, and different performances are encountered in practice. 2.2 ESN Dynamics as a Combination of Linear Systems.It is well known that the dynamics of a nonlinear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis with hyperbolic tangent nonlinearities and approximate the ESN dynam- ics by the dynamics of the linearized system in the neighborhood of the current system state. Hence, when the system operating point varies over time, the linear system approximating the ESN dynamics changes. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ESN without output feedback given by x(n+1)=f(Win u(n+1)+Wx(n)). Linearizing the system around the current statex(n), one obtains the Jacobian matrix,J(n+1), defined by  f˙(net 1 (n))w ˙11 f(net 1 (n))w12 ··· f˙(net 1 (n))w1N   f˙(net J(n+1)= 2 (n))w ˙21 f(net 2 (n))w22 ··· f˙(net 2 (n))w2N   ··· ··· ··· ···  f˙(net N (n))w ˙N1 f(net N (n))wN2 ···f˙(net N (n))wNN  f˙(net 1 (n)) 0 ··· 0   0 f ˙(net  = 2 (n))··· 0   ·W=F(n)·W. (2.5)  ··· ··· ··· ···  00···f˙ (net N (n)) Here,net i (n)istheith entry of the vector (Win u(n+1)+Wx(n)), andwij denotes the (i,j)th entry ofW. The poles of the linearized system at time n+1 are given by the eigenvalues of the Jacobian matrixJ(n+1). 1 As the amplitude of each PE changes, the local slope changes, and so the poles of 1 The transfer function of a linear systemx(n+1)=Ax(n)+Bu(n)is X(z) =(zI−U(z) A)−1 B=Adjoint(zI−A) B. The poles of the transfer function can be obtained by solving det(zI−A) det(zI−A)=0. The solution corresponds to the eigenvalues ofA. 120 M. Ozturk, D. Xu, and J. Pr´ıncipe the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the movement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and−0.4 with probabilities 0.9, 0.05, and 0.05.Wis scaled such that a spectral radius of 0.95 is obtained. Input weights are set to+1or−1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system, and the echo states are computed according to equation 2.1. Then the Jacobian matrix and the eigenvalues are calculated using equation 2.5. Figure 3 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locations, hence many different time constants that modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of thez-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. When compared to their linear counterpart, an ESN with the same number of states results in a detailed coverage of thez-plane dynamics, which illustrates the power of nonlinear systems. Similar results can be obtained using signals of different shapes at the ESN input. A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ESN input without changing the recurrent connection matrix,W. The application of a nonzero constant bias will move the operating point to regions of the sigmoid func- tion closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity. 2 The relevance of bias in terms of overall system performance has also been discussed in Jaeger (2002b) and Bertschinger and Natschlager (2004), but here we approach it from a system theory per-¨ spective and explain its effect on reservoir dynamics. 3 Average State Entropy as a Measure of the Richness of ESN Reservoir Previous research was aware of the influence of diversity of the recurrent layer outputs on the overall performance of ESNs and LSMs. Several met- rics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., 2 AssumeWhas nondegenerate eigenvalues and corresponding linearly independent eigenvectors. Then consider the eigendecomposition ofW,whereW=PDP −1 ,Pis the eigenvectormatrixandDisthediagonalmatrixofeigenvalues(Dii )ofW.SinceF(n)andD are diagonal,J(n+1)=F(n)W=F(n)(PDP −1 )=P(F(n)D)P−1 is the eigendecomposition ofJ(n+1). Here, each entry ofF(n)D,f (net(n))Dii , is an eigenvalue ofJ. Therefore, |f (net(n))Dii |≤|Dii |sincef (net i )≤f (0). Analysis and Design of Echo State Networks 121 (A) 1 (B) 1 D0.8 0.8 0.6 C 0.6 0.4 0.4 Imaginary Amplitude 0.2 0.2 0 B E 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1 0 20 40 60 80 100 -1 -0.5 Real 0 0.5 1 Time (C) 1 (D) 1 0.8 0.8 0.6 0.6 0.4 0.4 Imaginary 0.2 Imaginary 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1 (E) 1 (F) 1 0.8 0.8 0.6 0.6 0.4 0.4 Imaginary 0.2 Imaginary 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -0.6 -0.6 -0.8 -0.8 -1 -1-1 -0.5 Real 0 0.5 1 -1 -0.5 Real 0 0.5 1 Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input goes through a cycle. An ESN with fixed parameters implements a combination of linear systems with varying pole locations. (A) One cycle of sinusoidal signal with a period of 100. (B–E) The positions of poles of the linearized systems when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative pole locations show the movement of the poles as the input changes. Due to the varying pole locations, different time constants modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude signals tend to saturate the nonlinear function and cause the poles to shrink toward the origin of thez-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles ofthelinearizedESNareclosetothemaximalspectralradiuschosen,decreasing the stability margin. An ESN with more states results in a detailed coverage of thez-plane dynamics, which illustrates the power of nonlinear systems, when compared to their linear counterpart. 122 M. Ozturk, D. Xu, and J. Pr´ıncipe 2005). Here, our approach of bases and projections leads to a new metric. We propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes across the ESN states. Entropy of the instanta- neous ESN states is appropriate to quantify performance in function ap- proximation because the ESN output is a mere weighted combination of the instantaneous value of the ESN states. If the echo state’s instantaneous amplitudes are concentrated on only a few values across the ESN state dy- namic range, the ability to approximate an arbitrary desired response by weighting the states is limited (and wasteful due to redundancy between the different states), and performance will suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, it is much eas- ier to achieve the desired mapping. Hence, the instantaneous entropy of the states appears as a good measure to quantify the richness of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time, will be the parameter used to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by trajectories. Renyi’squadraticentropyisemployedherebecauseitisaglobalmeasure of information. In addition, an efficient nonparametric estimator of Renyi’s entropy,whichavoidsexplicitpdfestimation,hasbeendeveloped(Principe, Xu, & Fisher, 2000). Renyi’s entropy with parameterγfor a random variable Xwith a pdffX (x) is given by Renyi (1970): 1Hγ (X)= logE[fγ−1 (X)].1−γ X Renyi’s quadratic entropy is obtained forγ=2 (forγ→1, Shannon’s en- tropy is obtained). GivenNsamples{x1 ,x2 ,...,xN }drawn from the un- known pdf to be estimated, Parzen windowing approximates the underly- ing pdf by 1N fX (x)= KN σ (x−xi ), i=1 whereKσ is the kernel function with the kernel sizeσ. Then the Renyi’s quadratic entropy can be estimated by (Principe et al., 2000)   H2 (X)=−log1 KN2 σ (xj −xi ) . (3.1) j i Analysis and Design of Echo State Networks 123 The instantaneous state entropy is estimated using equation 3.1 where thesamplesaretheentriesofthestatevectorx(n)=[x1 (n),x2 (n),...,xN (n)] T of an ESN withNinternal PEs. Results will be shown with a gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally demon- strating that ESNs with different spectral radius and even with the same spectral radius display different ASEs. Let us consider the same 100-unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time step using equation 3.1 and plotted in Figure 4B. First, note that the instantaneous state entropy changes over time with the distribution of the echo states as we would expect, since state entropy is dependent on the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo state’s instantaneous amplitudes are concentrated on only a few values, which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use ASE, which takes values−0.735,−0.007, and 0.335 for the spectral radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral radius, several ASEs are possible. Figure 4C shows ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the reservoir. Although we have presented an experiment with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic range is properly selected. Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set in ESNs is created independent of the desired response and the application may require a small spectral radius. However, we maintain that when the desired response is not ac- cessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. The following section addresses the design of ESNs with high ASE values and a simple mechanism to adjust the reservoir dynamics without changing the recurrent connection weights. 4 Designing Echo State Networks 4.1 Design of the Echo State Recurrent Connections.According to the interpretation of ESNs as coupled linear systems, the design of the internal 124 M. Ozturk, D. Xu, and J. Pr´ıncipe connection matrix,W, will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside the unit circle of thez-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which also decorrelates as much as possible the basis functionals. This principle was chosen by analogy to the identification oflinearsystemsusingKautzfilters(Kautz,1954),whichshowsthatthebest approximation of a given transfer function by a linear system with finite order is achieved when poles are placed in the neighborhood of the spectral resonances. When no information is available about the desired response, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to distribute the poles inside the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The poles are first initial- ized at random locations; the quadratic Renyi’s entropy is calculated by equation 3.1, and poles are moved such that the entropy of the new dis- tribution is increased over iterations (Erdogmus & Principe, 2002). This method is efficient to find uniform coverage of the unit circle with an arbi- trary number of poles. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle correspond to many sharp bandpass filters specializing in different frequency regions, whereas the inner poles realize filters of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now the problem is to construct an internal weight matrix from the pole locations (eigenvalues ofW). In principle, we would like to create a sparse Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs ofechostates(100PEs)producedbyESNswithspectralradiusof0.2,0.5,and0.8, from top to bottom, respectively. The diversity of echo states increases when the spectral radius increases. Within the dynamic range of the echo states, systems with smaller spectral radius can generate only uneven representations, while forW=0.8, outputs of echo states almost uniformly distribute within their dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. Information contained in the echo states is changing over time according to the input amplitude. Therefore, the richness of representation is controlled by the input amplitude. Moreover, the value of ASE increases with spectral radius. (C) ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the reservoir than the spectral radius. Analysis and Design of Echo State Networks 125 (A) Echo States1 0 - 10 20 40 60 801001201401601802001 0 - 10 20 40 60 801001201401601802001 0 - 10 20 40 60 80100120140160180200Time (B) State Entropy1.5 Spectral Radius = 0.2 1 Spectral Radius = 0.5 Spectral Radius = 0.8 0.5 0 - 0.5 - 1 - 1.5 - 2 - 2.50 50 100 150 200Time (C) Different ASEs for the same spectral radius0.3 0.25 0.2 ASE0.15 0.1 0.050 10 20 30 40 50 Trials 126 M. Ozturk, D. Xu, and J. Pr´ıncipe matrix, so we started with the sparsest matrix (with an inverse), which is the direct canonical structure given by (Kailath, 1980)  −a1 −a2 ···−aN−1 −aN  10··· 00  W= 01··· 00   . (4.1) ··· ··· ··· ··· ··· 00··· 10 The characteristic polynomial ofWis l(s)=det(sI−W)=sN +a N−11 s +a2 sN−2 +aN =(s−p1 )(s−p2 )···(s−pN ), (4.2) wherepi ’s are the eigenvalues andai ’s are the coefficients of the character- istic polynomial ofW. Here, we know the pole locations of the linear system obtained from the linearization of the ESN, so using equation 4.2, we can obtain the characteristic polynomial and constructWmatrix in the canon- ical form using equation 4.1. We will call the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be obtained byQ−1 WQ,whereQis any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the recurrent weight matrix having the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with random internal con- nection weight matrices. We will consider an ESN with 30 states and use our procedure to create theWmatrix for ASE-ESN for different spectral radii between [0.1, 0.95]. Similarly, we constructed ESNs with sparse randomW matrices with different sparseness constraints. This corresponds to a weight distribution having the values 0,cand−cwith probabilitiesp1 ,(1−p1 )/2, and (1−p1 )/2, wherep1 defines the sparseness ofWandcis a constant that takes a specific value depending on the spectral radius. We also created Wmatrices with values uniformly distributed between−1 and 1 (U-ESN) and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, for differentWin matrices, we run the ASE-ESNs with the sinusoidal input given in section 3 and calculate ASE. Figure 5 compares the ASE values averaged over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on average for all spectral radii compared to ESNs with sparse and uniform random connec- tions. This approach is indeed conceptually similar to Jeffreys’ maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized Analysis and Design of Echo State Networks 127 1 ASEESN 0.8 UESN sparseness=0.2 0.6 sparseness=0.1 sparseness=0.07 0.4 ASE 0.2 0 - 0.2 - 0.40 0.2 0.4 0.6 0.8 1 Spectral radius Figure 5: Comparison of ASE values obtained for ASE-ESN havingWwith uniform eigenvalue distribution, ESNs with randomWmatrix, and U-ESN with uniformly distributed weights between−1 and 1. Randomly generated weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole distribution generates a higher ASE on average for all spectral radii compared to ESNs with random connections. system in certain regions of the space provides good performance only if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954). 4.2 Design of the Adaptive Bias.In conventional ESNs, only the out- put weights are trained, optimizing the projections of the desired response onto the basis functions (echo states). Since the dynamical reservoir is fixed, the basis functions are only input dependent. However, since function ap- proximation is a problem in the joint space of the input and desired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE non- linearity in defining the echo state dynamics, we propose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. No- tice that according to linearization analysis, bias can reduce only spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the information derived from the approximation error. With this simple mechanism, some informa- tionfromtheinput-outputjointspaceisincorporatedinthedefinitionofthe projection space of the ESN. The beauty of this method is that the spectral 128 M. Ozturk, D. Xu, and J. Pr´ıncipe radius can be adjusted by a single parameter that is external to the system without changing reservoir weights. The training of bias can be easily accomplished. Indeed, since the pa- rameter space is only one-dimensional, a simple line search method can be efficiently employed to optimize the bias. Among different line search al- gorithms, we will use a search that uses Fibonacci numbers in the selection of points to be evaluated (Wilde, 1964). The Fibonacci search method min- imizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data are applied to the ESN, and the echo states are calculated. Then the corresponding optimal output weights and the objective function (MSE) are evaluated to pick the next bias value. Alternatively, gradient-based methods can be utilized to optimize the bias, due to simplicity and low computational cost. System update equation with an external bias signal,b,isgivenby x(n+1)=f(Win u(n+1)+Win b+Wx(n)). The update equation forbis given by ∂O(n+1) ∂x(n+1)=−e·Wout × (4.3)∂b ∂b ∂x(n)=−e·Wout × f˙(net n+1 )· W× +Win . (4.4)∂b Here,Ois the MSE defined previously. This algorithm may suffer from similar problems observed in gradient-based methods in recurrent net- works training. However, we observed that the performance surface is rather simple. Moreover, since the search parameter is one-dimensional, the gradient vector can assume only one of the two directions. Hence, im- precision in the gradient estimation should affect the speed of convergence but normally not change the correct gradient direction. 5 Experiments This section presents a variety of experiments in order to test the validity of the ESN design scheme proposed in the previous section. 5.1 Short-TermMemoryCapacity.Thisexperimentcomparestheshort- term memory (STM) capacity of ESNs with the same spectral radius using the framework presented in Jaeger (2002a). Consider an ESN with a sin- gle input signal,u(n), optimally trained with the desired signalu(n−k), for a given delayk. Denoting the optimal output signalyk (n), thek-delay Analysis and Design of Echo State Networks 129 STM capacity of a network,MC k , is defined as a squared correlation coef- ficient betweenu(n−k)andyk (n) (Jaeger, 2002a). The STM capacity,MC, of the network is defined as ∞ MC k=1 k . STM capacity measures how accu- rately the delayed versions of the input signal are recovered with optimally trained output units. Jaeger (2002a) has shown that the memory capacity for recalling an independent and identically distributed (i.i.d.) input by an Nunit RNN with linear output units is bounded byN. We use ESNs with 20 PEs and a single input unit. ESNs are driven by an i.i.d. random input signal,u(n), that is uniformly distributed over [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions of the input,u(n−1),...,u(n−40). We used four different ESNs: R-ESN, U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN used in Jaeger (2002a) where the entries ofWmatrix are set to 0, 0.47, −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. The entries ofWof U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spec- tral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or−0.1 with equal probability, and direct connections from the input to the output are allowed, whereasWback is set to0(Jaeger, 2002a). The echo states are calculated using equation 2.1 for 200 samples of the input signal, and the first 100 samples corresponding to initial transient are eliminated. Then the output weight matrix is calculated using equation 2.4. For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal, and the corresponding output andMC k are calculated. Figure 6 shows thek-delay STM capacity (averaged over 100 trials) of each ESN for delays 1,...,40 for the test signal. The STM capac- ities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively. First, ESNs with uniform pole distribution (ASE- ESN and BASE-ESN) haveMCs that are much longer than the randomly generated ESN given in Jaeger (2002a) in spite of all having the same spec- tral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximumvalueofN=20.AcloserlookatthefigureshowsthatR-ESNper- forms slightly better than ASE-ESN for delays less than 9. In fact, for small k, large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for smallkis recov- ered in BASE-ESN, which reduces the ASE to the appropriate level required for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly better STM compared to R-ESN with only three different weight values, although it has more distinct weight values compared to R-ESN. It is also significant to note that theMCwill be very poor for an ESN with smaller spectral radius even with an adaptive bias, since the problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the 130 M. Ozturk, D. Xu, and J. Pr´ıncipe 1 RESN UESN ASEESN0.8 BASEESN Memory Capacity 0.6 0.4 0.2 0 0 10 20 30 40 Delay Figure 6: Thek-delay STM capacity of each ESN for delays 1,...,40 computed using the test signal. The results are averaged over 100 different realizations of each ESN type with the specifications given in the text for differentWandWin matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively. suitability of maximizing ASE in tasks that require a substantial memory length. 5.2 Binary Parity Check.The effect of the adaptive bias was marginal in the previous experiment since the nature of the problem required large ASE values. However, there are tasks in which the optimal solutions re- quire smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a crucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is drivenbyabinaryinputsignal,u(n),thatassumesthevalues0or1.Thegoal is to train an ESN to generate them-bit parity corresponding to lastmbits received, wheremis 3,...,8. Similar to the previous experiments, we used the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries ofWmatrix are set to 0, 0.06,−0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability, and direct connections from the input to the output are allowed whereasWback is set to 0. The echo states are calculated using equation 2.1 for 1000 samples of the input signal, and the first 100 samples correspondingtotheinitialtransientareeliminated.Thentheoutputweight Analysis and Design of Echo State Networks 131 350 300 250 Wrong Decisions 200 150 100 ASEESN50 RESN BASEESN0 3 4 5 6 7 8 m Figure 7: The number of wrong decisions made by each ESN form=3,...,8 in the binary parity check problem. The results are averaged over 100 differ- ent realizations of R-ESN, ASE-ESN, and BASE-ESN for differentWandWin matrices with the specifications given in the text. The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and 699. matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias is trained for each task. The binary decision is made by a threshold detector that compares the output of the ESN to 0.5. Figure 7 shows the number of wrong decisions (averaged over 100 different realizations) made by each ESN form=3,...,8. The total numbers of wrong decisions form=3,...,8 of R-ESN, ASE- ESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs poorly since the nature of the problem requires a short time constant for fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. BASE-ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that form=7 and 8, the ASE-ESN performs similar to the R-ESN, since the task requires access to longer input history, which compromises the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there are errors (m>4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and 2.7 form=3, 4, 5, and 6, respectively. Form=7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In 132 M. Ozturk, D. Xu, and J. Pr´ıncipe summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. 5.3 System Identification.This section presents a function approxima- tion task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation y(n+1)=0.3y(n)+0.6y(n−1)+f(u(n)), where f(u)=0.6sin(πu)+0.3sin(3πu)+0.1sin(5πu). The input to the system is chosen to be sin(2πn/25). We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with 30 internal units and a single input unit. TheWmatrix of each ESN is scaled suchthatithasaspectralradiusof0.95.R-ESNisarandomlyconnectedESN where the entries ofWmatrix are set to 0, 0.35,−0.35 with probabilities 0.8, 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or−1 with equal probability, and direct connections from the input to the output are allowed,whereasWback issetto0.Theoptimaloutputweightsarecalculated using equation 2.4. The MSE values (averaged over 100 realizations) for R- ESN and ASE-ESN are 1.23x10 −5 and 1.83x10 −6 , respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10 −6 to 3.27x10 −9 . 6 Discussion The great appeal of echo state networks (ESNs) and liquid state machine (LSM) is their ability to construct arbitrary mappings of signals with rich and time-varying temporal structures without requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed with training limited to the linear output layer. However, the literature did not elucidate on how to properly choose the recurrent parameters for system identification applications. Here, we provide an alternate framework that interprets the echo states as a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of the desired output space onto this representation space. We further in- troduce an information-theoretic criterion, ASE, to better understand and evaluate the capability of a given ESN to construct such a representation layer. The average entropy of the distribution of the echo states quantifies thevolumespannedbythebases.Assuch,thisvolumeshouldbethelargest to achieve the smallest correlation among the bases and be able to cope with Analysis and Design of Echo State Networks 133 arbitrary mappings. However, not all function approximation problems re- quire the same memory depth, which is coupled to the spectral radius. The effective spectral radius of an ESN can be optimized for the given problem with the help of an external bias signal that is adapted using the joint input- output space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearities is that it allows the fine tuning of the system dynamics for a given problem with a single external adaptive bias input and without changing internal system parameters. In our opinion, the combination of the largest possible ASE and the adapta- tion of the spectral radius by the bias produces the most parsimonious pole location of the linearized ESN when no knowledge about the mapping is available to optimally locate the bass functionals. Moreover, the bias can be easily trained with either a line search method or a gradient-based method since it is one-dimensional. We have illustrated experimentally that the de- sign of the ESN using the maximization of ASE with the adaptation of the spectral radius by the bias has provided consistently better performance across tasks that require different memory depths. This means that these two parameters’ design methodology is preferred to the spectral radius criterion proposed by Jaeger, and it is still easily incorporated in the ESN design. Experiments demonstrate that the ASE for ESN with uniform linearized poles is maximized when the spectral radius of the recurrent weight matrix approaches one (instability). It is interesting to relate this observation with the computational properties found in dynamical systems “at the edge of chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; Bertschinger & Natschlager, 2004). Langton stated that when cellular au-¨ tomata rules are evolved to perform a complex computation, evolution will tend to select rules with “critical” parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested for LSMs (Bertschinger & Natschlager, 2004).¨ Langton’s interpretation of edge of chaos was questioned by Mitchell et al. (1993). Here, we provide a system-theoretic view and explain the computa- tional behavior with the diversity of dynamics achieved with linearizations that have poles close to the unit circle. According to our results, the spectral radiusoftheoptimalESNinfunctionapproximationisproblemdependent, and in general it is impossible to forecast the computational performance as the system approaches instability (the spectral radius of the recurrent weight matrix approaches one). However, allowing the system to modu- late the spectral radius by either the output or internal biasing may allow a system close to instability to solve various problems requiring different spectral radii. Our emphasis here is mostly on ESNs without output feedback connec- tions. However, the proposed design methodology can also be applied to ESNs with output feedback. Both feedforward and feedback connections contribute to specify the bases to create the projection space. At the same 134 M. Ozturk, D. Xu, and J. Pr´ıncipe time, there are applications where the output feedback contributes to the system dynamics in a different fashion. For example, it has been shown that a fixed weight (fully trained) RNN with output feedback can implement a family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). In meta-learning, the role of output feedback in the network is to bias the system to different regions of dynamics, providing multiple input-output mappings required (Santiago & Lendaris, 2004). However, results could not be replicated with ESNs (Prokhorov, 2005). We believe that more work has to be done on output feedback in the context of ESNs but also suspect that the echo state condition may be a restriction on the system dynamics for this type of problem. There are many interesting issues to be researched in this exciting new area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s representation layer in an unsupervised fashion. In fact, we can easily adapt withtheSIG(stochasticinformationgradient)describedinErdogmus,Hild, and Principe (2003): extra weights linking the outputs of recurrent states to maximize output entropy. Output entropy maximization is a well-known metric to create independent components (Bell & Sejnowski, 1995), and here it means that the echo states will become as independent as possible. This would circumvent the linearization of the dynamical system to set the recurrent weights and would fine-tune continuously in an unsupervised manner the parameters of the ESN among different inputs. However, it goes against the idea of a fixed ESN reservoir. The reservoir of recurrent PEs can be thought of as a new form of a time- to-space mapping. Unlike the delay line that forms an embedding (Takens, 1981), this mapping may have the advantage of filtering noise and produce representations with better SNRs to the peaks of the input, which is very appealing for signal processing and seems to be used in biology. However, further theoretical work is necessary in order to understand the embedding capabilities of ESNs. One of the disadvantages of the ESN correlated basis is in the design of the readout. Gradient-based algorithms will be very slow to converge (due to the large eigenvalue spread of modes), and even if recursive methods are used, their stability may be compromised by the condition number of the matrix. However, our recent results incorporating anL1 norm penalty in the LMS (Rao et al., 2005) show great promise of solving this problem. Finally we would like to briefly comment on the implications of these models to neurobiology and computational neuroscience. The work by Pouget and Sejnowski (1997) has shown that the available physiological data are consistent with the hypothesis that the response of a single neuron in the parietal cortex serves as a basis function generated by the sensory input in a nonlinear fashion. In other words, the neurons transform the sensory input into a format (representation space) such that the subsequent computation is simplified. Then, whenever a motor command (output of the biological system) needs to be generated, this simple computation to Analysis and Design of Echo State Networks 135 read out the neuronal activity is done. There is an intriguing similarity betweentheinterpretationoftheneuronalactivitybyPougetandSejnowski and our interpretation of echo states in ESN. We believe that similar ideas can be applied to improve the design of microcircuit implementations of LSMs. First, the framework of functional space interpretation (bases and projections) is also applicable to microcircuits. Second, the ASE measure may be directly utilized for LSM states because the states are normally low- pass-filtered before the readout. However, the control of ASE by changing the liquid dynamics is unclear. Perhaps global control of thresholds or bias current will be able to accomplish bias control as in ESN with sigmoid PEs. Acknowledgments ThisworkwaspartiallysupportedbyNSFECS-0422718,NSFCNS-0540304, and ONR N00014-1-1-0405. References Amari, S.-I. (1990).Differential-geometrical methods in statistics.NewYork:Springer. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categor- ical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution.Neural Computation, 7(6), 1129– 1159. Bertschinger,N.,&Natschlager,T.(2004).Real-timecomputationattheedgeofchaos¨ in recurrent neural networks.Neural Computation, 16(7), 1413–1436. Cox,R.T.(1946).Probability,frequency,andreasonableexpectation.AmericanJournal of Physics, 14(1), 1–13. de Vries, B. (1991).Temporal processing with neural networks—the development of the gamma model. Unpublished doctoral dissertation, University of Florida. Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural network for system identification and control.IEEE Proceedings of Control Theory and Applications, 142(4), 307–314. Elman, J. L. (1990). Finding structure in time.Cognitive Science, 14(2), 179–211. Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: Stochastic information gradient.Signal Processing Letters, 10(8), 242–245. Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for adaptive system training.IEEE Transactions on Neural Networks, 13(5), 1035–1044. Feldkamp,L.A.,Prokhorov,D.V.,Eagen,C.,&Yuan,F.(1998).Enhancedmultistream Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle (Eds.),Nonlinear modeling: Advanced black-box techniques(pp. 29–53). Dordrecht, Netherlands: Kluwer. 136 M. Ozturk, D. Xu, and J. Pr´ıncipe Haykin,S.(1998).Neuralnetworks:Acomprehensivefoundation(2nded.).UpperSaddle River, NJ. Prentice Hall. Haykin, S. (2001).Adaptive filter theory(4th ed.). Upper Saddle River, NJ: Prentice Hall. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computa- tion, 9(8), 1735–1780. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons.Proceedings of the National Academy of Sciences, 81, 3088–3092. Ito, Y. (1996). Nonlinearity creates linear independence.Advances in Computer Math- ematics, 5(1), 189–203. Jaeger, H. (2001).The echo state approach to analyzing and training recurrent neural networks(Tech. Rep. No. 148). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002a).Short term memory in echo state networks(Tech. Rep. No. 152). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002b).Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach(Tech. Rep. No. 159). Bremen: German National Research Center for Information Technology. Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication.Science, 304(5667), 78–80. Jeffreys,H.(1946).Aninvariantformforthepriorprobabilityinestimationproblems. Proceedings of the Royal Society of London, A 196, 453–461. Kailath, T. (1980).Linear systems. Upper Saddle River, NJ: Prentice Hall. Kautz, W. (1954). Transient synthesis in time domain.IRE Transactions on Circuit Theory, 1(3), 29–39. Kechriotis,G.,Zervas,E.,&Manolakos,E.S.(1994). Usingrecurrentneuralnetworks for adaptive communication channel equalization.IEEE Transactions on Neural Networks, 5(2), 267–278. Kremer,S.C.(1995).OnthecomputationalpowerofElman-stylerecurrentnetworks. IEEE Transactions on Neural Networks, 6(5), 1000–1004. Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998).Elements of applied bifurcation theory(2nd ed.). New York: Springer-Verlag. Langton, C. G. (1990). Computation at the edge of chaos.Physica D, 42, 12–37. Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the computational power and generalization capability of neural microcircuits. In L. K. Saul, Y. Weiss, L. Bottou (Eds.),Advances in neural information processing systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. Maass, W., Natschlager, T., & Markram, H. (2002). Real-time computing without¨ stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations.Complex Systems, 7, 89– 130. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.),Dynamic patterns in complex systems(pp. 293– 301). Singapore: World Scientific. Analysis and Design of Echo State Networks 137 Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions.Journal of Cognitive Neuroscience, 9(2), 222–237. Principe, J. (2001). Dynamic neural networks and optimal signal processing. In Y. Hu & J. Hwang (Eds.),Neural networks for signal processing(Vol. 6-1, pp. 6– 28). Boca Raton, FL: CRC Press. Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new class of adaptive IIR filters with restricted feedback.IEEE Transactions on Signal Processing, 41(2), 649–656. Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin (Ed.),Unsupervised adaptive filtering(pp. 265–319). Hoboken, NJ: Wiley. Prokhorov, D. (2005). Echo state networks: Appeal and challenges. InProc. of Inter- national Joint Conference on Neural Networks(pp. 1463–1466). Montreal, Canada. Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed weights in recurrent neural networks: An overview. InProc. of International Joint Conference on Neural Networks(pp. 2018–2022). Honolulu, Hawaii. Puskorius,G.V.,&Feldkamp,L.A.(1994).Neurocontrolofnonlineardynamicalsys- tems with Kalman filter trained recurrent networks.IEEE Transactions on Neural Networks, 5(2), 279–297. Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods ap- plied to on-vehicle idle speed control.Proceedings of IEEE, 84(10), 1407–1420. Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with echo state networks. In2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia. Renyi, A. (1970).Probability theory. New York: Elsevier. Sanchez, J. C. (2004).From cortical neural spike trains to behavior: Modeling and analysis. Unpublished doctoral dissertation, University of Florida. Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction net- works: Reformulating fixed weight neural networks. InProc. of International Joint Conference on Neural Networks(pp. 189–194). Budapest, Hungary. Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in multilayer perceptrons.IEEE Transactions on Neural Networks, 10(1), 10–18. Shannon,C.E.(1948).Amathematicaltheoryofcommunication.BellSystemTechnical Journal, 27, 623–656. Siegelmann, H. T. (1993).Foundations of recurrent neural networks. Unpublished doc- toral dissertation, Rutgers University. Siegelmann,H.T.,&Sontag,E.(1991).Turingcomputabilitywithneuralnets.Applied Mathematics Letters, 4(6), 77–80. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.),Advances in neural information process- ing systems, 1(pp. 133–140). San Mateo, CA: Morgan Kaufmann. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. Young (Eds.),Dynamical systems and turbulence(pp. 366–381). Berlin: Springer. Thogula, R. (2003).Information theoretic self-organization of multiple agents.Unpub- lished master’s thesis, University of Florida. Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of IEEE, 78(10), 1550–1560. 138 M. Ozturk, D. Xu, and J. Pr´ıncipe Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evalua- tion. In D. White & D. Sofge (Eds.),Handbook of intelligent control(pp. 65–89). New York: Van Nostrand Reinhold. Wilde, D. J. (1964).Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks.Neural Computation, 1, 270–280. Received December 28, 2004; accepted June 1, 2006.