Fractal Learning Neural Networks - Semantic Scholar...Fractal Learning Neural Networks Whitney Tabor Department of Psychology University of Connecticut Storrs, CT 06269 USA Phone:

Fractal Learning Neural Networks

Whitney Tabor

Department of Psychology

University of Connecticut

Storrs, CT 06269 USA

Phone: (860) 486-4910

email: [email protected]

Abstract

One of the fundamental challenges to using recurrent neural net-

works (RNNs) for learning natural languages is the difficulty they have

with learning complex temporal dependencies in symbol sequences. Re-

cently, substantial headway has been made in training RNNs to process

languages which can be handled by the use of one or more symbol-

counting stacks (e.g., anbn, anAmBmbn, anbncn) with compelling cases

made that the networks are solving the problem in principle, not simply

using finite-state methods to approximate the training data—e.g., (Gers

and Schmidhuber, 2001; Boden and Wiles, 2002). Success on cases that

require stacks to function as full-fledged sequence memories (e.g., palin-

drome languages), has not been as great. On the other hand, (Moore,

1998), (Siegelmann, 1996), and (Tabor, 2000) describe methods of us-

ing fractal sets to keep track of arbitrary stack computations in neural

units with bounded, rational-valued activation values. The present pa-

per combines these hand-wiring methods with the work on learning by

describing Fractal Learning Neural Networks (FLNNs), which discover

such fractal encodings via hill climbing in weight space. Simulations

show that FLNNs can learn several context free languages that cannot

be processed by a finite set of input-symbol counters.

1

1 Introduction

Recurrent neural networks (RNNs) of sufficient size can implement Turing

machines (Pollack, 1987; Kremer, 1996; Siegelmann, 1996) and thus perform

the same computations as symbolic computing mechanisms. It has proved

difficult, however, to successfully train RNNs to learn infinite-state languages.

This paper focuses on RNN learning of context-free languages (CFLs), the

smallest class of infinite state languages on the Chomsky Hierarchy (Chomsky,

1956). Appendix 1 provides definitions of terms like “context free language”

which come from the literature on formal automata. The central training

results of this paper were synopsized in (Tabor, 2003). The current work

confirms the correspondence between the context free process and the neural

network process by analyzing the discrepancies between them.

An early effort to let a sequential neural symbol processor induce a context

free language used a neural controller to manipulate a separate neural stack

(Sun, Chen, Giles, Lee, and Chen, 1990). Recently, several projects have been

able to induce the stack mechanism itself, as well as the controller, in the

network’s hidden units. Successful results have been obtained for languages

that involve using one or more stacks as “counters”, which keep track of the

length of a substring of input elements. For example, a Simple Recurrent

Network (SRN) trained by (Wiles and Elman, 1995) was able to correctly

learn the corpus of sentences from the language anbn (this language consists

of the sentences, “a b”, “a a b b”, “a a a b b b”, etc.) with up to 11 levels

of embedding and to generalize up to 18 levels. (Boden and Wiles, 2000) and

(Boden and Wiles, 2002) found that the Sequential Cascaded Network (SCN)

of (Pollack, 1991) achieved generalization considerably beyond the training set

on anbncn, a context-sensitive language.

Because infinite-state languages involve arbitrarily long temporal depen-

dencies, these projects have struggled with the well-known exponential decay

of the error signal over time in standard RNN training regimens—see (Hochre-

iter, 1991; Hochreiter, 1997) for a formal analysis of this problem. There-

fore, (Hochreiter and Schmidhuber, 1997) developed Long Short-Term Mem-

2

ory (LSTM) networks, which employ a special kind of unit that maintains con-

stant error flow across arbitrarily long temporal dependencies. Indeed (Gers

and Schmidhuber, 2001) found that LSTM networks trained on examples from

anbn and anbncn with at most 10s of levels of embeddings generalized perfectly

to 1000s of levels (See Table 1). LSTM networks also did well with the “re-

stricted palindrome language,”1 anAmBmbn, far exceeding the performance of

previous RNNs. Moreover, the LSTM learning algorithm is local in space and

time in the sense of (Schmidhuber, 1989). The LSTM paradigm is thus cur-

rently the most successful neural network paradigm for learning infinite state

languages.

Whereas formal automata are naturally suited to modeling symbol se-

quences whose structure is specified exactly by finitely-many rules, neural net-

works are well suited to modeling noisy environments where statistical methods

are appropriate. Since symbol-processing RNNs combine these viewpoints, it

is helpful to establish a terminology for talking about probabilities of symbol

sequences.

Let a corpus-distribution be a map from each prefix (i.e., sequence of words

from the vocabulary) to a probability distribution over next-words. The prob-

ability of a word sequence is the product of the probabilities of the successive

word-to-word transitions within it (i.e., the successive word choices are inde-

pendently sampled). A word sequence is a grammatical string of a corpus-

distribution if it has nonzero probability. A machine correctly processes a

grammatical string from a corpus distribution if, upon being presented with

each symbol of the string in succession, it accurately specifies the probabilities

of the symbols that follow. An output activation vector accurately specifies a

probability distribution if it is closer to the correct distribution than to any

other distribution associated with a grammatical symbol transition in the cor-

pus distribution. This method of assessing correctness is equivalent to the

method used by a number of researchers in cases where only one next-word

is possible: count the prediction as correct if the activation of the unit asso-

1The term is due to (Rodriguez and Wiles, 1998).

3

ciated with the correct next word is above 0.5 (Rodriguez, 2001; Boden and

Wiles, 2000; Boden and Wiles, 2002; Rodriguez and Wiles, 1998). The current

method has the advantage of also being useful in cases where probabilities are

involved. Table 1 cites best published performance levels to date, measured as

percent accurate specifications in a sample corpus.

The languages mentioned so far constitute a proper subset of stack-manipulation

languages. In particular, all of them can be modeled with device that counts

stack symbols on one or more stacks (Gers and Schmidhuber, 2001; Rodriguez,

2001); none require keeping track of arbitrary orders of the symbols on the

stack. For example, to process anbncn, one stack can count up as the a’s occur

and then count back down as the b’s occur; a second stack can count up as the

b’s occur and then count back down as the c’s occur. There are many stack-

manipulation languages, even within the set of context-free languages, which

require keeping track of the order, as well as the number, of elements on the

stack. For example each sentence in wWR (a palindrome language) consists of

strings that can be divided in half such that each half is a mirror-image (under

homomorphism) of the other half—e.g. “a b a a A A B A”(see Table 1).

RNN learning of languages that require keeping track of stack order has

not been very successful. With considerable effort, (Rodriguez, 2001) was

able to get an SRN to make some correct predictions about an unrestricted

palindrome language (wWR) but it only mastered 78% of its training strings

by (Rodriguez, 2001)’s measure in which the net had to correctly process

every transition within a string to be counted as getting the string correct. To

process wWR, it is necessary for the network to remember not only how many

symbols are on the stack, but the order in which they occurred. (Tabor, 2002)

trained a set of 10 SRNs with four hidden units each on the crossed-serial

dependency language, wW (see Table 1), another language for which keeping

track of stack order is essential. Although the focus of that work was not on

trying to achieve high accuracy, the very poor performance of the network in

that study suggests that it is not straightforward to get an SRN to learn wW .

4

Table 1: State-growth rates for several languages. The column, “State-

Growth”, gives the minimal number of states a system must distinguish in

order to correctly classify all strings of length L or less (for those L for

which such strings exist). wW is the language in which an arbitrary sequence

of symbols, w = w1w2 . . . wn, must be followed by a equal-length sequence,

W = W1W2 . . . Wn, and each correspondence wi ↔ Wi is an instance of either

a ↔ A or b ↔ B (a crossed serial dependency language). wWR is identical to

wW except that the order of the Wi′s is reversed (a palindrome language).

Under “Best Performance”, “x% at L ≤ p” means the network accurately spec-

ified x% of the symbol transitions when tested on all sentences up to length

p.

Language State-Growth Best performance Training Net Source1

anbn L 100% at L ≤ 2000 L ≤ 20 LSTM (Gers & Schmidhuber, 2001)

anbncn L 100% at L ≤ 1500 L ≤ 120 LSTM (Gers & Schmidhuber, 2001)

anAmBman(

L2 + 1

)2 − 1 100% at L ≤ 92 L ≤ 44 LSTM (Gers & Schmidhuber, 2001)

wWR 3(2L2 − 1) ≤ 81%2 at L = 14 L ≤ 12+ SRN (Rodriguez, 2001)

wW 3(2L2 − 1) 69% at L ≤ 6 Sample3 SRN (Tabor, 2002)

1.1 State Growth Rates

A useful way of categorizing languages which require different kinds of stack

mechanisms is to examine their rates of state growth as function of string length

(Crutchfield, 1994). I define the state growth function of a language L to be the

function which specifies, for each possible sentence-length, n, how many states

a computer must distinguish in order to correctly process all strings of length

n or less from L. Given this definition, infinite-state languages which require

keeping track of arbitrary orders of elements on their stacks are exponential

state-growth languages.

It is difficult for RNNs to learn exponential state-growth languages. Even

though the performance levels of LSTM networks are consistently much higher

than those of SRN’s, I am aware of no published results on their learning of

exponential state-growth languages.

5

It makes sense that exponential state-growth languages should be partic-

ularly hard to learn because (i) many training exemplars are needed to fully

sample the set of viable strings up to any given length, (ii) the hidden units

cannot function merely as counters, but must also keep track of stack se-

quencing information, and (iii) many states implies smaller critical distances

between states, so the network must be tuned very accurately even to handle

short strings. Unfortunately, among practical applications of infinite state lan-

guages, exponential state-growth cases are common. For example, embedding

order is critical for keeping track of the subject-predicate dependencies in nat-

ural language relative clauses (Christiansen and Chater, 1999). Moreover, in

computer programming tasks which require recursion, order of recursive em-

bedding almost always has a major effect on function. It is thus desirable to

be able to handle such cases with learning RNNs.

For both linear and quadratic languages, the principles employed by net-

works that successfully learn the languages are very clear: hidden units are

used as counters of stack symbols. But how to extend this method to ex-

ponential state-growth cases is not obvious: (Rodriguez, 2001), using human

insight to build a linear approximation of the network identified in Row 4

of Table 1 was able to find an effective solution for one case but he did not

provide a principled method of accomplishing this approximation. Nor did

he show how the correct weights could be learned. The present work starts

with a known method of encoding exponential state-growth cases in artificial

neurons and then identifies a type of network that has a strictly monotonic

gradient from its initial state to the engineered solution for several exponential

2(Rodriguez, 2001) reports 39 correct out of 114 mixed sequences (i.e., with

both a’s & b’s). I concluded that there were a minimum of 114 - 39 = 73 errors

among at most 3(27 − 1) states. The actual error rate may have been higher.3(Tabor, 2002) generated strings with a probabilistic queue-grammar in

which the probability of pushing a symbol onto the queue was 0.2 at each

point. The network was trained on one pass through a 3 million word corpus

of such strings.

6

state-growth cases.

2 Dynamical Automata and Fractal Learning

Neural Networks

Insight into the challenge of encoding arbitrary stack manipulations in RNNs

has been provided by (Moore, 1998), (Siegelmann, 1996), (Tabor, 2000), (Barns-

ley, 1988), and (Tino, 1999) who show that fractal sets provide a natural frame-

work for encoding recursive computation in analog computing devices. The

current paper employs the framework laid out by (Tabor, 2000).

2.1 Pushdown Dynamical Automata for Context Free

Languages

(Tabor, 2000) defines dynamical automata for processing symbol sequences in

a bounded real-valued activation space. A dynamical automaton (or “DA”)

(Tabor, 2000) is a symbol processor whose states lie in a complete metric

space. The DA must start at a particular point called the start state. It is

associated with a list of state change functions that map that metric space to

itself, a vocabulary of symbols that it processes, and a partition of the metric

space. An Input Map specifies, for each compartment of the partition, a set

of symbols that can be grammatically processed when the network is in the

current partition, and the state change function that is associated with each

symbol. Appendix 2 provides a formal definition.

(Tabor, 2000) defines a subset of Dynamical Automata, called Pushdown

Dynamical Automata (PDDAs), that behave like Pushdown Automata (PDAs—

see Appendix 1). PDDAs travel around on fractals with several branches.

Their functions implement analogs of stack operations: pushing, popping, and

exchanging symbols. Pushing involves moving to a part of the fractal where

the local scale is an order of magnitude smaller than the current scale; pop-

ping is the inverse of pushing; and exchanging involves shifting position at the

7

Table 2: The Input Map for Pushdown Dynamical Automaton 1 (PDDA 1).

The initial state is(00

). When the automaton is in the state listed under

“Compartment”, it can read (or emit) the symbol listed under “Input”. The

consequent state change is listed under “State Change”.

Compartment Input State Change

z1 < 0 and z2 < 0 b ~z ← ~z +(02

)

z1 < 0 and z2 > 0 c ~z ← 2(~z +(

1−1

))

Any a ~z ← 12~z +

(−1−1

)

largest scale without changing the scale. (Tabor, 2000) shows that the set of

languages processed by PDDAs is identical to the set of languages processed

by PDAs, that is to say, the Context Free Languages (Hopcroft and Ullman,

1979).

Table 2 gives an example of a PDDA for processing the sentences generated

by the grammar shown in Table 3. PDDA 1 always starts at the origin of

the space z1 × z2 at the beginning of processing a sentence; it must move in

accordance with the rules specified in Table 2; if it is able to process each word

of the sentence and return to the origin when the last word is read, then the

sentence belongs to the language.

In order to successfully process Language 1 with a symbolic stack, it is

necessary to store information on the stack whenever an “a b c” sequence

is begun but not completed. If A is stored when “a” is encountered, and

B is stored when “b” is encountered, then the possible stack states are the

members of {A, B}* (see Appendix 1). As Figure 1a shows, PDDA 1 induces

a map from this set to positions on a fractal with the topology of a Cantor Set

(Barnsley, 1988). Figure 1b provides an example of the states visited by the

automaton as it processes the center-embedded sentence, “a b a a b c b c c”.

8

Table 3: Grammar 1. Parentheses denote optional constituents, which oc-

cur with probability 0.2 in every case. A probability, indicated by a decimal

number to its left, is associated with each production. The probabilities are

irrelevant for implementation in a dynamical automaton but are useful for

generating a corpus which can be used to train a FLNN—see below.

1.0 S → A B C 1.0 A → a (S) 1.0 B → b (S) 1.0 C → c (S)

−2 −1.5 −1 −0.5 0

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

A

AAAAA

BAA

BA

ABA

BBA

B

ABAAB

BAB

BB

ABB

BBB

z1

z2

a. Stack map

−2 −1.5 −1 −0.5 0

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

1. a

2. b

3. a

4. a

5. b

6. c

7. b

8. c

9. c

z1

z2

b. Sample trajectory.

Figure 1: a. The stack map of PDDA 1. Stack-states (labeled) are associated

with the apices of rightward-pointing triangles. Each letter in a label identifies

a symbol on the stack. The top-of stack is the right-most letter in each label.

b. The trajectory associated with the sentence, “a b a a b c b c c”, from

Language 1. “1.” identifies the first word, “2.” the second, etc.

9

2.2 Insight from Prior Training Results

(Rodriguez, 2001) trained a SRN on several context free languages. Although

the networks did not achieve high accuracy on exponential state-growth lan-

guages (see Table 1) (Rodriguez, 2001) was able to examine the network

weights and construct an effective solution in each case, using a linear re-

current map in place of the SRN’s sigmoidal hidden unit activation function.

For example, based on a network trained on wWR, (Rodriguez, 2001) defined

the hidden-layer map shown in (1)-(2).

~zt = f(

0.5 0 0 0

0 0.5 0 0

2 0 2 0

0 2 0 2

· ~zt−1 +

0.5 0.5 −5 −5

0.4 0.1 −5 −5

−5 −5 −1 −1

−5 −5 −0.8 −0.2

· ~It) (1)

fi(ni) =

0, ni < 0

zi, 0 ≤ ni ≤ 1

1, 1 < ni

(2)

The input, It, is encoded as a = (1, 0, 0, 0); b = (0, 1, 0, 0); A = (0, 0, 1,

0); B = (0, 0, 0, 1). For the “push” operations (see Appendix 1) associated

with the symbols a and b, the third and fourth dimensions have fixed values,

so it is possible to graph the important state changes. Figure 2 shows the set

of all states visited by the map as it processes all possible initial sequences of

a’s and b’s down to seven levels of embedding (the “push set”).

Figure 2 makes it clear that Rodriguez’s gleaned model uses the same kind

of stack mechanism as a PDDA. Two insights from this and other prior work

help with the challenge of learning exponential state-growth languages: (i)

as (Rodriguez, 2001) and (Holldobler, Kalinke, and Lehmann, 1997) point

out, the fractal scaling solution depends critically on being able to invert the

scale changes. SRNs do this by using the linear region of their activation

function to approximate multiplicative inverses. But the imperfection of the

approximation leads to inaccuracies at high levels of embedding—hence the use

10

0.5 0.6 0.7 0.8 0.9 10.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

A

AA

AAA

BAA

BA

ABA

BBA

B

ABAAB

BAB

BB

ABB

BBB

z1

z2

Figure 2: The push set for the map gleaned by [3] from a SRN trained on

wWR. Visited states are at the tips of leftward pointing triangles. The letter

labels identify the stack states associated with the visited states under the

minimal PDA for wWR. As in PDDA 1, the states with different immediate

futures are on different branches of the fractal.

11

by (Rodriguez, 2001) of the piecewise linear function, f . The FLNN’s discussed

in the next section avoid the nonlinear distortion by using linear recurrent units

in the learning model. (ii) Part of the complexity of the solution embodied in

equation (1) lies in the use of a rotation when the network crosses the midpoint

of a palindrome (This rotation is accomplished by the submatrix

2 0

0 2

in

the third quadrant of the recurrent weight block in (1)). The rotation has the

effect of mapping the whole push-set into a new subspace so the network can

easily tell whether it is in push- or pop-mode. This solution is somewhat akin

to doing a global symbol substitution across the stack in a PDA. it preserves

the stack structure while changing the identity of every element in it. A simpler

solution with a PDA is to use a control-state variable to keep track of whether

the midpoint has been passed. Indeed, (Tabor, 2000) handles such cases with

DAs by invoking an additional dimension which encodes the analog of a control

state change. Suspecting that learning the rotation or the control state change

might be difficult for a network, I have focused here on cases in which a stack

alone is sufficient—no separate control variables are needed.

2.3 Fractal Learning Neural Networks (FLNN’s): Im-

plementation of Dynamical Automata in RNNs

Figure 3 shows a Fractal Learning Neural Network (FLNN), one way of im-

plementing a dynamical automaton in artificial neurons. The input layer uses

a one-hot encoding for words of the vocabulary, as in (Elman, 1990). The

first hidden layer is recurrently connected. The input layer has two kinds of

projections to the first hidden layer: first-order connections project directly to

the hidden units; second-order connections control the weights on the recurrent

hidden connections. As in (Rodriguez, 2001)’s hand-constructed networks, the

first hidden layer has a linear (identity) activation function. All maps are dis-

crete. The following is a general form of the update rule for the first hidden

layer:

12

Input

a b c

Hidden 1 (Linear)

Hidden 2 (Gaussian)

Output (Softmax)

a b c

Figure 3: A FLNN for learning the language of Grammar 1.

13

zi(t) =∑

j∈Inputs

wijaj(t) +∑

k∈Hidden1

∑j∈Inputs

sikjzk(t− 1)aj(t) (3)

Here, t indexes time, aj is the activation of the j’th input unit, and zi is the

activation of the i’th first hidden layer unit; wij is the (first-order) weight on

the connection from unit j to unit i; sikj is the (second-order) weight from

input unit j and the previous state of hidden unit k to the current state of

hidden unit i. Because the PDDA analysis indicates that only the self-weights

need to be manipulated for handling context free grammars and it suffices to

use the same contraction/expansion factor across all hidden dimensions for

a given word, one can set all the non-self-connections in Hidden1 to 0 and

define sj = siij. Moreover, because the input vectors are “one-hot” codes, the

description can be further simplified by defining zi(t, j) to be the value of zi(t)

when unit j is the activated input:

zi(t, j) = wij + zi(t− 1)sj (4)

Equation (4) should be compared to the equations in the “State Change”

column of Table 2, which it implements.

The Hidden1 units have first-order connections to the units in Hidden2.

These second hidden layer units have gaussian activation functions:

gi(t) = exp

[−|~wi − ~z(t)|2

b2i

](5)

Here, gi is the activation of the i’th second hidden layer unit and b2i is its

“bias” (a parameter which controls the radius of the spherical region of Hidden1

space over which the unit is active); ~wi is the vector of weights feeding from

Hidden1 to unit i in Hidden2; ~z is the vector of Hidden1 activations. The

purpose of these radial basis function units is explained below.

The second hidden layer units have first-order connections to the output

units, which, as a group, have the normalized exponential (or “soft-max”)

activation function, since they model the probability distribution for the next

symbol at each point in time (Haykin, 1994):

14

oi(t) =exp(neti(t))∑

k∈Outputs

exp(netk(t))(6)

neti(t) =∑

j∈Hidden2

wijgj(t) (7)

Here, oi(t) is the activation of the i’th output unit at time t.

The network receives symbols in sequence from the language it is trained

on. Each symbol activates a single, unique unit on the input layer. The ac-

tivations are updated layer-by-layer in the order, Input, Hidden1, Hidden2,

Output. In the recurrent layer, Hidden1, all the new states are computed

before any of the states are actually changed. The job of the network is to

activate on the output layer, after each symbol is presented, the correct prob-

ability distribution over next-symbols (Elman, 1990). The (gaussian) radial

basis functions at Hidden2 are used by the network to classify the branches of

the fractal on which the network travels in Hidden1. For the languages consid-

ered here, each fractal branch in Hidden1 corresponds to a different immediate

future and hence a different probability distribution on the output layer. The

use of radial basis functions rather than sigmoid functions on Hidden2 helps

to drive the formation of the contraction maps which are important for im-

plementing the fractal scaling: if the FLNN persists in using non-contractive

scaling, then it will suffer great inaccuracy when multiple embeddings drive it

outside the halo of the radial basis units.

3 Simulations

3.1 Training Languages

I used the training languages specified by Grammar 1 (Table 3) and Grammar

2 (Table 4). These languages are exponential state-growth languages. Their

state-growth functions are 2(L3+1) − 1 and 2(L

2+1) − 1, respectively.

15

Table 4: Grammar 2. (See Table 3 for explanation.)

0.5 S → A B

0.5 S → X Y

1.0 A → a (S) 1.0 B → b (S) 1.0 X → x (S) 1.0 Y → y (S)

3.2 Training Procedure

Two constraints were put on the FLNN to make learning easier. I assigned the

three gaussian units in the second hidden layer fixed variances. These units

end up classifying the branches of the fractal. Since the radius of each fractal

branch is arbitrary, provided it is positive, the second hidden layer variances

can be fixed without loss of generality. I set these variances to 0.25. I also set

all the hidden self-weights to 1 initially. This choice is motivated by the fact

that the self-weights perform the fractal contraction and expansion, so setting

them to the multiplicative identity is the unbiased choice.

The networks were trained by gradient sampling (also called ”hill climb-

ing”) in batch mode. By gradient sampling, I mean the class of methods of

descending an error surface which sample the change in error associated with

a set of small displacements from the current state and choose the one that

reduces the error most at each step. This approach was motivated by the fact

that gradient sampling is simple to implement and conventional methods of

computing the gradient accurately, Backpropagation Through Time (Werbos,

1974; Rumelhart, Hinton, and Williams, 1986; Williams and Peng, 1990; Pearl-

mutter, 1995) and Real-time Recurrent Learning (Robinson and Fallside, 1987;

Williams and Zipser, 1995), suffer from disappearance of the error signal across

long time lags (Hochreiter, 1991; Hochreiter, 1997). It was also motivated by

the fact that (Chalup and Blair, 1999) and (Melnik, Levy, and Pollack, 2000)

were successful in training other neural networks on infinite state languages

using gradient sampling techniques. Disadvantages of the current method are

that it is inefficient and it is nonlocal in time and space in the sense of (Schmid-

16

huber, 1989). The work of (Gers and Schmidhuber, 2001) and (Hochreiter and

Schmidhuber, 1997), which computes an accurate gradient without letting the

error signal decay, helpfully complements the present work and may provide

the basis for a more efficient algorithm.

The gradient sampling was implemented as follows. A corpus of sentences

from the language was processed at the current weight setting and at points

on a sphere in weight-space surrounding the current setting. To reduce the

amount of testing, only a set of basis directions and their negatives was tested

at each stage. Whichever single weight adjustment produced the greatest

reduction in the error relative to the current setting was adopted and the

process was repeated. Error on a particular word was measured as Kullback-

Leibler Divergence at the output layer (E =∑

i∈Outputs tilog( tioi

), where ti is

the probability of word i in the context and oi is the activation of output unit

i) and was summed over all words in the training corpus. The radius of the

sphere was 0.001. No momentum was used. The initial values of the Hidden1-

to-Hidden2 weights and the Hidden2-to-Output weights in each network were

randomly picked from the uniform distribution on (-0.3, 0.3).

The training corpus for Language 1 consisted of one instance each of all

sentences of length ≤ 9. The training corpus for Language 2 consisted of one

instance each of all sentences of length ≤ 6. Note that the training sets for

both languages included strings with at most three levels of recursion. From

the corpus distribution associated with each language, I computed the transi-

tion probabilities for the transitions in its training corpus and used these as

the targets on the output layer. Note that each training corpus was a highly

skewed sample from its corresponding corpus distribution. Similar results were

obtained when each training corpus formed a random sample from its corre-

sponding corpus distribution but convergence was slower. The networks were

trained until their mean error per word on the test corpus dropped below 0.001

or the gradient became so shallow that it appeared flat in single-precision float-

ing point encoding.

17

Table 5: Performance on training and test sets for the two FLNN’s. RMSE =

Root Mean Squared Error. % Cor. = Percent Correct. N = the number of

networks that contributed to the computation of Standard Error (SE). Npoints

= the number of words tested per network. All networks were trained by

gradient sampling.

Language Corpus RMSE SE % Cor. SE N Npoints

1 Training 0.013 0.001 100.000 0.000 9 129

1 Test 0.048 0.005 99.424 0.195 9 4755

2 Training 0.008 0.000 100.000 0.000 11 276

2 Test 0.022 0.001 99.900 0.022 11 15232

3.3 Training Results

3.3.1 Test Statistics

The test corpora consisted of all sentences of length 12 to 15 words from Lan-

guage 1 and all sentences of length 8 to 10 words from Language 2 (i.e., novel

sentences with ≤ 5 levels of recursion). The point was to test all sentences

within a short depth range in order to find out how close to perfectly im-

plementing the recursive structure the nets were coming, without running up

against the distortion created by limited precision (Boden and Wiles, 2000).

FLNN1 and FLNN2 processed their training sets completely accurately in

9 and 11 out of 15 trials, respectively. I will refer to the runs with accurate

training set performance in each case as the “successful runs”. No unsuccessful

runs achieved more than 93% accuracy on the training corpus. To detect and

avoid over-learning, the networks were tested on the testing corpus as well as

the training corpus throughout training. This precaution turned out not to

be necessary: the minimum of testing error occurred at, or very near, the end

of training and the difference in performance between the test minima and

the end of training was very slight. All results reported are from the end of

training.

18

Table 5 shows the Root Mean Squared Error (Haykin, 1994) and the per-

centage of correctly predicted transitions for the two networks on its training

corpus and test corpus. The very low error rates on the test corpus suggest

that each network has successfully generalized to a language with similar struc-

ture to that of the target. The analysis of the following section reveals that

this performance degrades when deeper levels of embedding are tested, but it

also shows that the successful networks have adopted solutions that are struc-

turally identical to those of the corresponding PDDAs. In this sense, they have

successfully identified the intended, infinite-state language.

3.3.2 State Space Analysis

In order to process a language correctly, the network must assign states with

different futures to different locations in the recurrent hidden unit space (Hid-

den1). Therefore, it is helpful to divide the Hidden1 states into equivalence

classes of states with identical futures. Figure 4 provides one view of such a

partition. It was generated from the Hidden1 states of a network that learned

Language 1 successfully. Each labeled point in the figure marks the mean of

the points associated with the stack state that labels it (i.e. it marks the mean

of a set with identical futures). Figure 4 was derived from the set of all states

the network visited while processing the test corpus (such a set derived from a

corpus will henceforth be referred to as the Visitation Set of the corpus. The

set of vector means over states with common futures derived from this set will

be called the Mean Visitation Set or MVS.). Figure 4 is analogous to Figure

1a. Indeed, it is clear from comparison of the figures that the trained net-

work has discovered an analogous fractal organization to that of the PDDA.

The branches of the fractal have been shaded to distinguish the two different

immediate futures that are associated with all points but the initial point in

Language 1. These two immediate futures are [0.2 a, 0.8 b, 0.0 c] and [0.2 a,

0.0 b, 0.8 c].2 For convenience, I will refer to these two branches of the frac-

2In [0.2 a, 0.8 b, 0.0 c], the decimal numbers indicate probabilities and the

letters identify next-symbols.

19

−2 −1.5 −1 −0.5 0 0.5

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

A

AA

AAA BAA

BA

ABA BBA

B

AB

AAB BAB

BB

ABB BBB

z1

z2

Figure 4: State means for a sample FLNN at the end of training. The (labeled)

stack states are located at the apices of downward-pointing triangles. Hori-

zontal shading marks the a-branch of the fractal and vertical shading marks

the b-branch. This figure is analogous to Figure 1a.

tal as the “a-branch” and the “b-branch” respectively. Every other successful

network had an identically structured set of means.

Two important properties distinguish PDDAs from DAs in general: (a)

the regions (or branches) of the fractal that correspond to different immedi-

ate future states are nonoverlapping; (b) symbolic machine cycles correspond

to dynamical machine cycles (See Appendix I). Peter Tino (P.C.) refers to

constraint (b) as the Generalized Fixed Point Condition (GFPC). It says, in

essence, that in situations where a PDA makes a loop in its state space, the

corresponding PDDA will too. To understand the process by which a FLNN

develops its solution, it is helpful to examine properties (a) and (b) separately.

20

a. Overlap of Branches. Figure 5 tracks the overlap of fractal branches

during the course of training for the same successful network depicted in Figure

4. In this figure, the stack state labels have been removed to make viewing

easier. Figure 5 depicts the MVS associated with a test set of all palindromes

up to 5 levels of embedding. At the beginning of training, the MVS is a single

point because all of the Input-to-Hidden1 weights are initialized to 0. After

training has proceeded for a while, the MVS has expanded into an infinite

lattice with high overlap between the branches (Figure 5a). This infinite-lattice

is the reflex of the unit initial weights in Hidden1, which imply no contraction

or expansion. Over the course of training, the branches shrink to a finite size

and gradually spread apart (Figure 5b), eventually reaching a nonoverlapping

final state (Figure 5c, which shows the same points as Figure 4). The fact

that once the fractal branches come into existence, they monotonically shrink

and separate is an indication that the dynamics of the learning process are

dominated by an attractor with the topological properties of the PDDA 1

solution.

b. Generalized Fixed Point Condition (GFPC). Figure 6 shows the distri-

bution of points associated with each common future at the end of training.

Figure 6a was based on points from the Training Corpus alone (up to 3 levels

of recursion). In addition to showing the means of points with identical fu-

tures at the apices of downward pointing triangles, each set of points with a

common future has a polygonal simplex drawn around its periphery. Because

the training set was learned with high accuracy, these polygons appear as sin-

gle points coincident with their means for most levels of recursion. Only the

points associated with the 0’th level are visibly separated, and surrounded in

the figure by a small triangular simplex. The contours indicate the network’s

interpretation of the error surface at the end of training. This surface was

arrived at by computing the Hidden1-to-Output map for each point, ~z within

the plotted region of Hidden1. For each resulting point in Output space, its

distance to the three grammar-derived distributions was computed. The height

of the induced error surface at ~z was taken to be the minimum of these three

21

−3 −2 −1 0 1−0.5

0

0.5

1

1.5

z1

z2

(a)

−3 −2 −1 0 1−0.5

0

0.5

1

1.5

z1

z2

(b)

−3 −2 −1 0 1−0.5

0

0.5

1

1.5

z1

z2

(c)

Figure 5: The evolution of the state means over the course of training a typical

FLNN. Each graph depicts the MVS derived from the test corpus of all palin-

dromes with <= 5 levels of recursion. The MVS is shown after the network

has traveled 10000 steps along the gradient (a), 13500 steps (b) and 25000

steps (c). Vertical shading marks the a-branch and horizontal shading marks

the b-branch of the fractal in each panel.

22

a. Three levels of recursion (training set).

−2 −1.5 −1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

1.5

z 0

ab

z1

z2

b. Five levels of recursion (test set)

−2 −1.5 −1 −0.5 0 0.5 1

−1

−0.5

0

0.5

1

1.5

z 0

ab

z1

z2

Figure 6: Visited points grouped by source state for (a) the training set and

(b) the test set. The data are from the same run that is depicted in Figures 4

and 5. The contours show the induced classification surface.

distances. Thus the basin structure of Figure 6a indicates how the network

classifies points in its Hidden1 space. Indeed, each training point lies well

within the basin of the correct probability distribution, in keeping with the

fact that the network made no classification errors on the training corpus at

the end of training. However, the positive area of the triangles (visible at the

0’th level in the figure) indicates that the network is not perfectly obeying the

GFPC: it does not always return to exactly the same point after processing

a sequence of symbols that would return the corresponding PDA to the same

stack-state.

Figure 6b displays the consequence of this failure in the case of processing

23

more deeply embedded strings. It is identical to Figure 6a except that the

Visitation Set on which it is based included the shortest palindromes from

each level of recursion up to the maximum level contained in the test corpus

(5 levels). Displacements (with respect to the GFPC) at high levels of recursion

grow exponentially as the network state scales back up to the 0 level. Thus

the set of points that are supposed to coincide with the initial state shows

the highest level of distortion. Indeed, some of the initial points and some of

the b-predictions land in the wrong basin for this particular network. Such

basin transgressions are the source of the errors on the test corpora reported

above. Although this distortion implies that the network will make many errors

on sentences with very deep embeddings, the fact that the network completely

separates the branches of the fractal indicates that it has successfully captured

the structure of the PDDA solution.

4 Conclusions

Fractal Learning Neural networks appear to be able to reliably discover, via

gradient descent exploration, the information structure of a time series gener-

ated by several exponential state-growth context free languages. These net-

works thus show promise of helping to understand, in a principled way, neural

network learning of the kind of complex processes for which symbolic models

are so widely used.

Several prior studies have shown that SRNs can learn some fairly com-

plex natural language patterns (Allen and Seidenberg, 1999; Rohde and Plaut,

1999). But the utility of these findings for researchers of language phenomena

will be limited in the long run if we cannot relate them to the structural un-

derstanding provided by current symbolic analyses of language. One valuable

aspect of the analysis in the preceding section is that it reveals the principle

behind the network’s solution to the learning problem and it sets the stage for

identifying precise correlates of the symbolic entities that are the bread and

butter of current linguistic understanding of language structure.

Several avenues of future work look promising. As noted in the section,

24

Training Procedure, it may be helpful to pursue a more efficient and accurate

method of computing the error gradient—e.g., LSTM (Gers and Schmidhu-

ber, 2001; Hochreiter and Schmidhuber, 1997). A topic of current work is to

improve FLNN generalization accuracy in cases of deep embedding by using

the information in the gradient to drive a more perfect approximation of the

GFPC. It will also be helpful to consider a wider range of languages than the

ones reported on here, including languages that require using a stack with

more than two symbols, non-context-free languages, and languages for which

a PDA with independent control states is needed, languages with lexical am-

biguity (the same word causes different manipulations of the stack, depending

on context), and languages with structural ambiguity (the stack state is not

uniquely determined by the input).

Much research on neural network learning focuses on the micro-level of the

learning mechanism; less has focused on the macro-level of the representations

that are to be learned. Indeed, these representations are often inscrutable in

nets trained on complex problems. The present results suggest that it is both

possible and valuable to study representations in order to guide the search for

an effective learning mechanism in complex domains.

5 Appendix 1: Formal Language Notation and

Background

Language. I use the term “language” here in the sense of formal language

theory (Hopcroft and Ullman, 1979): a language L is a (possibly infinite) set

of finite-length strings of symbols drawn from a finite alphabet. The strings in

L are called grammatical and all other strings are called ungrammatical with

respect to L. A symbol w from the alphabet is said to be a grammatical con-

tinuation of a string s if the concatenation [sw] forms an initial substring of

some grammatical string of the language. A machine is said to be a recognizer

of a language L if it can determine, for each finite-length string s, whether

s ∈ L. A machine is said to be a generator of a language L if it can list all

25

and only the strings of L. Generally, RNNs recognize and generate languages

by identifying grammatical continuations of symbol strings. I will refer to the

identification of the grammatical continuations of a symbol string as the (cor-

rect) processing of the string. The sequence of states that a machine occupies

during the correct processing of a string is called the parse of the string. In the

text, I define a corresponding notion for languages in which symbol transitions

are associated with probabilities rather than absolute possibilities.

The formalization in the remainder of this appendix is based on that found

in (Hopcroft and Ullman, 1979) but varies slightly from it in order to make

some of the terms in the text more transparent.

For C a set, the expression C∗ refers to the set of finite sequences of el-

ements of C, including the null sequence. This convention is called “Kleene

star notation”. The null sequence is denoted by “ε”.

Finite-State Automata. A finite-state automaton (FSA), F , is a computer

which can be in only a finite number of states. Formally, a finite-state machine

is denoted by a 5-tuple, (Q, Σ, δ, q0, F ), where Q is a finite set of control states,

Σ is the input alphabet, which consists of the symbols (words) in the language,

q0 ∈ Q, is the initial control state, F ⊆ Q, is the set of final control states, and

δ, a function which maps from Q× Σ to Q is called the transition function.

If F is in the state q1 and δ contains the map δ(q1, w) → q2, for w ∈ Σ and

q2 ∈ Q, then the occurrence of w is said to be a grammatical transition under

F . If w is processed under such conditions, then the state of F changes from

q1 to q2. The set of symbol sequences s ∈ Σ∗ whose symbols bring F from its

initial state to one of its final states via grammatical transitions when they are

processed in sequence is called the language, L, of F .

Pushdown Automata. A push-down automaton (PDA), P , is denoted by

a 7-tuple, (Q, Σ, Γ, δ, q0, Z0, F ), where Q, Σ, and q0 are defined as for a finite

state machine. Γ is a second alphabet called the stack alphabet. Z0 ∈ Γ is

the initial stack state. δ is a map from Q × (Σ ∪ {ε}) × Γ to finite subsets of

Q×Γ∗. An element S of Γ∗ is called a stack state. I adopt the convention that

the rightmost element of S is called the top of stack and is denoted top(S).

26

F ⊆ Q× Γ∗ is the set of final states of P .

If P is in a state (q1, S1) where q1 ∈ Q is a control state and S1 ∈ Γ∗ is

a stack state, and δ contains the map δ(q1, w, top(S1)) → (q2, γ) for w ∈ Σ

and γ ∈ Γ∗, then the occurrence of w is said to be a grammatical transition

under P . If w is processed under such conditions, then the state of P changes

from (q1, S1) to (q2, S2), where S2 is derived from S1 by replacing top(S1)

with γ. The set of symbol sequences s ∈ Σ∗ whose symbols bring P from its

initial state to one of its final states via grammatical transitions when they

are processed in sequence is called the language of P . A PDA amounts to a

FSA in combination with a pushdown stack, a memory consisting of a string

of symbols which, at any point in time, can be manipulated only by changing

(i.e. replacing or deleting) the most-recently-added symbol.

A theorem states that the languages that can be processed by FSAs are

a proper subset of the languages that can be processed by PDAs (Hopcroft

and Ullman, 1979). This implies that some PDAs are infinite-state machines,

in the sense that there is no bound on the number of states they can occupy

during grammatical processing.

The addition, by a PDA, of a symbol to its stack is called pushing the

symbol onto the stack. The removal of the symbol is called popping the stack.

The number of symbols that a PDA has on its stack at a particular point in

time is called its current level of recursion. The maximum current level of

recursion that a particular PDA, Pi, achieves during the course of assigning

parse p to a sentence s is called the level of recursion of parse p of s under Pi.

Since the set of all PDAs is countable, there is a smallest PDA PL (measured

in terms of the number of symbols in the formalization just specified) for each

PDA language L. Let the level of recursion of a sentence s ∈ L be the smallest

level of recursion of s under PL. For example, the level of recursion of “a x y

b” in Language 2 in the text is 2.

Context-Free Grammar. A context-free grammar (CFG), G, is a 4-tuple,

(V, T, P,R) where V is a finite set of abstract nodes, T is a finite set of terminal

nodes, P is a finite set of productions of the form A → α, where A is an abstract

27

node and α is a member of (V ∪ T )∗, and R ∈ V is a special abstract node

called the root node. If β1 ∈ (V ∪ T ) contains an instance Ai of the abstract

node, A ∈ V , and G contains the production, A → α, as defined above, then

G derives β2 from β1, where β2 is the string that results when Ai in β1 is

replaced by α. Derivation is assumed to be a transitive relation. s ∈ T ∗ is a

(grammatical) sentence of G if s can be derived from R. The set of sentences

of G is called the language generated by G. A language generated by a CFG

is called a Context Free Language (CFL).

A theorem states that the languages that can be processed by PDAs are

exactly the CFLs (Hopcroft and Ullman, 1979).

Cycles. If r is a particular subsequence of a grammatical sentence s of

language L, processed by the symbolic machine P (e.g. a finite state automaton

or a pushdown automaton) and if P is in state qP just before starting to process

r and it returns to state qP after processing r, then r is said to generate a

symbolic cycle under P . Likewise, if D is a FLNN which processes sequences

from the alphabet of L and if, when processing s, D is in state qD just before

starting to process r and it returns to state qD after processing r, then r is said

to generate a dynamical cycle under D. D obeys the Generalized Fixed Point

Condition (GFPC) with L if all the cycles of the minimal symbolic machine

for L correspond to cycles of D.

6 Appendix 2: Definition of Pushdown Dy-

namical Automaton (PDDA)

Def. A Dynamical Automaton (DA) is a 7-tuple, M:

M = (X, F, P, Σ, IM,O, FR) (8)

(1) The space, X, is a complete metric space.

(2) The function list, F , consists of a finite number of functions,

w1, . . . wN where wi : X → X for each i ∈ {1, . . . , N}.

28

(3) The partition, P , is a finite partition of X with compartments,

m1, . . . , mK .

(4) The input list is a finite set of symbols drawn from an alphabet,

Σ.

(5) The input map is a three-place relation, IM : P × Σ × F →{0, 1} which specifies for each compartment, m, and each input

symbol, s, what function(s) can be performed if symbol s is pre-

sented when the system state is in compartment m. If IM(m, s, f) =

0 for all f ∈ F then symbol s cannot be presented when the system

is in compartment m.

(6) The machine always starts at the start state, O ∈ X. If, as

successive symbols from an input string are presented, it makes

transitions consistent with the Input Map, and arrives in the final

region, FR, then the input string is accepted.

Def. A generalized iterated function system (GIFS) consists of a complete

metric space (X, d) together with a finite set of functions, wi : X → X,

i ∈ {1, . . . , N}.Def. Let X be a metric space with GIFS S = {X; w1 . . . wN}. A set O ⊂ X

is called a pooling set of S if it satisfies the following:

(i) wi(O) ∩ wj(O) = ® for i, j ∈ {1, . . . n} and i 6= j.

(ii)⋃n

i=1 wi(O) ⊂ O

The set of points in O that are not in⋃n

i=1 wi(O) is called the crest of O.

Def. Let S = {X; w1 . . . wN} be a GIFS with cascade point x0 and corre-

sponding cascade C. Then wi : C → C is called a push function on C.

Def. Let S = {X; w1 . . . wN} be a GIFS with cascade point x0 and correspond-

ing cascade C. Let Y = {x ∈ C : topC(x) = i for i ∈ {1, . . . , N}}. Suppose wi

is invertible on Y . Then the function f : Y → C such that f(x) = w−1i (x) is

called a pop function on C. (This definition was incorrectly stated in (Tabor,

2000).)

29

Def. Let S = {X; w1 . . . wn} be a GIFS. Let C1 and C2 be disjoint cascades

under S with cascade points x10 and x20 respectively. Then the function f :

C1 → C2 such that for all x ∈ C1 the x10-address of x is equal to the x20-

address of f(x) is called a switch function. Note that f is one-to-one and onto,

and hence invertible for all x′ ∈ C2.

Def. If C1, . . . , CK are cascades on GIFS S satisfying

Ci ∩ Cj = ® for i 6= j.

(i.e., they are disjoint) and x ∈ Ci for i ∈ {1, . . . , K} then i is called the index

of x with respect to the set {C1, . . . , CK}Def. Let M be a dynamical automaton on metric space X. We say M

is a pushdown dynamical automaton (PDDA) if there exists a GIFS, S =

{X; w1, . . . , wN} with cascade points x10, x20, . . . , xK0 ∈ X, K ∈ {1, 2, 3, . . .}and corresponding cascades, C1, C2, . . . , CK such that

(i) C1, C2, . . . , CK are disjoint.

(ii) For x ∈ ⋃Ni=1 Ci, the partition compartment of x is determined

by the conjunction of the index, i, of x and topCi(x).

(iii) Let m be a compartment of the partition of M . Each function

f : m → X in the input map is either a stack function, a switch

function, or a composition of the two when restricted to points on

one of the cascades.

(iv) The start state, O, and final region, FR, of M are contained

in⋃K

i=1 xi0.

A theorem states that the languages processed by PDDAs are exactly the

CFLs (Tabor, 2000).

Acknowledgements

This work was partially supported by a grant from the University of Connecti-

cut Research Foundation.

30

References

Allen, J. and M. S. Seidenberg (1999). Grammaticality judgment and apha-

sia: a connectionist account. In B. MacWhinney (Ed.), The Emergence

of Language. Mahwah, NJ: Lawrence Erlbaum.

Barnsley, M. (1988). Fractals Everywhere. Boston: Academic Press.

Boden, M. and J. Wiles (2000). Context-free and context sensitive dynamics

in recurrent neural networks. Connection Science 12 (3), 197–210.

Boden, M. and J. Wiles (2002). On learning context-free and context-

sensitive languages. IEEE Transactions on Neural Networks 13 (2), 491–

493.

Chalup, S. and A. D. Blair (1999). Hill climbing in recurrent neural networks

for learning the anbncn language. In Proceedings of the Sixth International

Conference on Neural Information Processing, pp. 508–513.

Chomsky, N. (1956). Three models for the description of language. IRE

Transactions on Information Theory 2 (3), 113–124. A corrected version

appears in Luce, Bush, and Galanter, eds., 1965 Readings in Mathemat-

ical Psychology, Vol. 2.

Christiansen, M. H. and N. Chater (1999). Toward a connectionist model of

recursion in human linguistic performance. Cognitive Science 23, 157–

205.

Crutchfield, J. P. (1994). The calculi of emergence: Computation, dynamics,

and induction. Physica D 75, 11–54. In the special issue on the Proceed-

ings of the Oji International Seminar, Complex Systems—from Complex

Dynamics to Artificial Reality.

Elman, J. L. (1990). Finding structure in time. Cognitive Science 14, 179–

211.

Gers, F. A. and J. Schmidhuber (2001). LSTM recurrent networks learn

simple context-free and context-sensitive languages. IEEE Transactions

on Neural Networks 12 (6), 1333–1340.

31

Haykin, S. S. (1994). Neural networks: a comprehensive foundation. New

York: MacMillan.

Hochreiter, S. (1991). Untersuchung zu dynamischen neuronalen netzen. Un-

published doctoral dissertation, Technische Universitat Munschen. Avail-

able online at www7.infomratick.tu-muenchen.de/ hochreit.

Hochreiter, S. (1997). The vanishing gradient problem during learning re-

current neural nets and problem solutions. International Journal of Un-

certainty, Fuzziness, and Knowledge-Based Systems 6 (2), 107–116.

Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural

Computation 9 (8), 1735–1780.

Holldobler, S., Y. Kalinke, and H. Lehmann (1997). Designing a counter:

Another case study of dynamics and activation landscapes in recurrent

networks. In Advances in Artificial Intellegence, pp. 313–324. Berlin; New

York: Springer, C 1997.

Hopcroft, J. E. and J. D. Ullman (1979). Introduction to Automata Theory,

Languages, and Computation. Menlo Park, California: Addison-Wesley.

Kremer, S. C. (1996). A theory of grammatical induction in the connectionist

paradigm. PhD Thesis, Department of Computing Science, Edmonton,

Alberta.

Melnik, O., S. Levy, and J. B. Pollack (2000). RAAM for infinite context-

free languages. In Proceedings of the International Joint Conference on

Artificial Neural Networks (IJCNN). IEEE.

Moore, C. (1998). Dynamical recognizers: Real-time language recognition

by analog computers. Theoretical Computer Science 201, 99–136.

Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent

neural networks: A survey. IEEE Transactions on Neural Networks 6 (5),

1212–1228.

Pollack, J. (1987). On connectionist models of natural language processing.

Unpublished doctoral dissertation, University of Illinois.

32

Pollack, J. B. (1991). The induction of dynamical recognizers. Machine

Learning 7, 227–252.

Robinson, A. J. and F. Fallside (1987). The utility driven dynmaic error

propagation network. Technical Report CUED/F-INFENG/TR.1, Cam-

bridge Univ. Engineering Department.

Rodriguez, P. (2001). Simple recurrent networks learn context-free and

context-sensitive languages by counting. Neural Computation 13 (9).

Rodriguez, P. and J. Wiles (1998). Recurrent neural networks can learn

to implement symbol-sensitive counting. In M. Jordan, M. Kearns, and

S. Solla (Eds.), Advances in Neural Information Processing Systems 10,

pp. 87–93. Cambridge, MA: MIT Press.

Rohde, D. and D. Plaut (1999). Language acquisition in the absence of

explicit negative evidence: How important is starting small? Journal of

Memory and Language 72, 67–109.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Learning in-

ternal represenations by error propagation. In D. E. Rumelhart, J. L.

McClelland, and the PDP Research Group (Eds.), Parallel Distributed

Processing, v. 1, pp. 318–362. MIT Press.

Schmidhuber, J. (1989). The neural bucket brigade: a local learning al-

gorithm for dynamic feedforward and recurrent networks. Connection

Science 1 (4), 403–412.

Siegelmann, H. (1996). The simple dynamics of super Turing theories. The-

oretical Computer Science 168, 461–472.

Sun, G. Z., H. H. Chen, C. L. Giles, Y. C. Lee, and D. Chen (1990).

Connectionist pushdown automata that learn context-free grammars. In

M. Caudill (Ed.), Proceedings of the International Joint Conference on

Neural Networks, pp. 577–580. Hillsdale, NJ: Lawrence Earlbaum.

Tabor, W. (2000). Fractal encoding of context-free grammars in connection-

ist networks. Expert Systems: The International Journal of Knowledge

Engineering and Neural Networks 17 (1), 41–56.

33

Tabor, W. (2002). The value of symbolic computation. Ecological Psychol-

ogy 14 (1/2), 21–52.

Tabor, W. (2003). Learning exponential state growth languages by hill

climbing. IEEE Transactions on Neural Networks 14 (2), 444–446.

Tino, P. (1999). Spatial representation of symbolic sequences through itera-

tive function systems. IEEE Transactions on Systems, Man, and Cyber-

netics Part A: Systems and Humans 29 (4), 386–392.

Werbos, P. J. (1974). Beyond regression: New tools for prediction and anl-

ysis in the behavioral sciences. Unpublished Doctoral Dissertation, Har-

vard University, Cambridge, MA.

Wiles, J. and J. Elman (1995). Landscapes in recurrent networks. In J. D.

Moore and J. F. Lehman (Eds.), Proceedings of the 17th Annual Cogni-

tive Science Conference. Lawrence Erlbaum Associates.

Williams, R. J. and J. Peng (1990). An efficient gradient-based algorithm for

on-line training of recurrent network trajectories. Neural Computation 2,

490–501.

Williams, R. J. and D. Zipser (1995). Gradient-based learning algorithms

for recurrent networks. In Y. Chauvin and D. E. Rumelhart (Eds.),

Backpropagation: Theory, Architectures, and Applications, pp. 433–486.

Lawrence Erlbaum Associates.

34

Documents

Fractal Learning Neural Networks - Semantic Scholar...Fractal Learning Neural Networks Whitney Tabor Department of Psychology University of Connecticut Storrs, CT 06269 USA Phone: