View
7
Download
0
Category
Preview:
Citation preview
Fractal Learning Neural Networks
Whitney Tabor
Department of Psychology
University of Connecticut
Storrs, CT 06269 USA
Phone: (860) 486-4910
email: whitney.tabor@uconn.edu
Abstract
One of the fundamental challenges to using recurrent neural net-
works (RNNs) for learning natural languages is the difficulty they have
with learning complex temporal dependencies in symbol sequences. Re-
cently, substantial headway has been made in training RNNs to process
languages which can be handled by the use of one or more symbol-
counting stacks (e.g., anbn, anAmBmbn, anbncn) with compelling cases
made that the networks are solving the problem in principle, not simply
using finite-state methods to approximate the training data—e.g., (Gers
and Schmidhuber, 2001; Boden and Wiles, 2002). Success on cases that
require stacks to function as full-fledged sequence memories (e.g., palin-
drome languages), has not been as great. On the other hand, (Moore,
1998), (Siegelmann, 1996), and (Tabor, 2000) describe methods of us-
ing fractal sets to keep track of arbitrary stack computations in neural
units with bounded, rational-valued activation values. The present pa-
per combines these hand-wiring methods with the work on learning by
describing Fractal Learning Neural Networks (FLNNs), which discover
such fractal encodings via hill climbing in weight space. Simulations
show that FLNNs can learn several context free languages that cannot
be processed by a finite set of input-symbol counters.
1
1 Introduction
Recurrent neural networks (RNNs) of sufficient size can implement Turing
machines (Pollack, 1987; Kremer, 1996; Siegelmann, 1996) and thus perform
the same computations as symbolic computing mechanisms. It has proved
difficult, however, to successfully train RNNs to learn infinite-state languages.
This paper focuses on RNN learning of context-free languages (CFLs), the
smallest class of infinite state languages on the Chomsky Hierarchy (Chomsky,
1956). Appendix 1 provides definitions of terms like “context free language”
which come from the literature on formal automata. The central training
results of this paper were synopsized in (Tabor, 2003). The current work
confirms the correspondence between the context free process and the neural
network process by analyzing the discrepancies between them.
An early effort to let a sequential neural symbol processor induce a context
free language used a neural controller to manipulate a separate neural stack
(Sun, Chen, Giles, Lee, and Chen, 1990). Recently, several projects have been
able to induce the stack mechanism itself, as well as the controller, in the
network’s hidden units. Successful results have been obtained for languages
that involve using one or more stacks as “counters”, which keep track of the
length of a substring of input elements. For example, a Simple Recurrent
Network (SRN) trained by (Wiles and Elman, 1995) was able to correctly
learn the corpus of sentences from the language anbn (this language consists
of the sentences, “a b”, “a a b b”, “a a a b b b”, etc.) with up to 11 levels
of embedding and to generalize up to 18 levels. (Boden and Wiles, 2000) and
(Boden and Wiles, 2002) found that the Sequential Cascaded Network (SCN)
of (Pollack, 1991) achieved generalization considerably beyond the training set
on anbncn, a context-sensitive language.
Because infinite-state languages involve arbitrarily long temporal depen-
dencies, these projects have struggled with the well-known exponential decay
of the error signal over time in standard RNN training regimens—see (Hochre-
iter, 1991; Hochreiter, 1997) for a formal analysis of this problem. There-
fore, (Hochreiter and Schmidhuber, 1997) developed Long Short-Term Mem-
2
ory (LSTM) networks, which employ a special kind of unit that maintains con-
stant error flow across arbitrarily long temporal dependencies. Indeed (Gers
and Schmidhuber, 2001) found that LSTM networks trained on examples from
anbn and anbncn with at most 10s of levels of embeddings generalized perfectly
to 1000s of levels (See Table 1). LSTM networks also did well with the “re-
stricted palindrome language,”1 anAmBmbn, far exceeding the performance of
previous RNNs. Moreover, the LSTM learning algorithm is local in space and
time in the sense of (Schmidhuber, 1989). The LSTM paradigm is thus cur-
rently the most successful neural network paradigm for learning infinite state
languages.
Whereas formal automata are naturally suited to modeling symbol se-
quences whose structure is specified exactly by finitely-many rules, neural net-
works are well suited to modeling noisy environments where statistical methods
are appropriate. Since symbol-processing RNNs combine these viewpoints, it
is helpful to establish a terminology for talking about probabilities of symbol
sequences.
Let a corpus-distribution be a map from each prefix (i.e., sequence of words
from the vocabulary) to a probability distribution over next-words. The prob-
ability of a word sequence is the product of the probabilities of the successive
word-to-word transitions within it (i.e., the successive word choices are inde-
pendently sampled). A word sequence is a grammatical string of a corpus-
distribution if it has nonzero probability. A machine correctly processes a
grammatical string from a corpus distribution if, upon being presented with
each symbol of the string in succession, it accurately specifies the probabilities
of the symbols that follow. An output activation vector accurately specifies a
probability distribution if it is closer to the correct distribution than to any
other distribution associated with a grammatical symbol transition in the cor-
pus distribution. This method of assessing correctness is equivalent to the
method used by a number of researchers in cases where only one next-word
is possible: count the prediction as correct if the activation of the unit asso-
1The term is due to (Rodriguez and Wiles, 1998).
3
ciated with the correct next word is above 0.5 (Rodriguez, 2001; Boden and
Wiles, 2000; Boden and Wiles, 2002; Rodriguez and Wiles, 1998). The current
method has the advantage of also being useful in cases where probabilities are
involved. Table 1 cites best published performance levels to date, measured as
percent accurate specifications in a sample corpus.
The languages mentioned so far constitute a proper subset of stack-manipulation
languages. In particular, all of them can be modeled with device that counts
stack symbols on one or more stacks (Gers and Schmidhuber, 2001; Rodriguez,
2001); none require keeping track of arbitrary orders of the symbols on the
stack. For example, to process anbncn, one stack can count up as the a’s occur
and then count back down as the b’s occur; a second stack can count up as the
b’s occur and then count back down as the c’s occur. There are many stack-
manipulation languages, even within the set of context-free languages, which
require keeping track of the order, as well as the number, of elements on the
stack. For example each sentence in wWR (a palindrome language) consists of
strings that can be divided in half such that each half is a mirror-image (under
homomorphism) of the other half—e.g. “a b a a A A B A”(see Table 1).
RNN learning of languages that require keeping track of stack order has
not been very successful. With considerable effort, (Rodriguez, 2001) was
able to get an SRN to make some correct predictions about an unrestricted
palindrome language (wWR) but it only mastered 78% of its training strings
by (Rodriguez, 2001)’s measure in which the net had to correctly process
every transition within a string to be counted as getting the string correct. To
process wWR, it is necessary for the network to remember not only how many
symbols are on the stack, but the order in which they occurred. (Tabor, 2002)
trained a set of 10 SRNs with four hidden units each on the crossed-serial
dependency language, wW (see Table 1), another language for which keeping
track of stack order is essential. Although the focus of that work was not on
trying to achieve high accuracy, the very poor performance of the network in
that study suggests that it is not straightforward to get an SRN to learn wW .
4
Table 1: State-growth rates for several languages. The column, “State-
Growth”, gives the minimal number of states a system must distinguish in
order to correctly classify all strings of length L or less (for those L for
which such strings exist). wW is the language in which an arbitrary sequence
of symbols, w = w1w2 . . . wn, must be followed by a equal-length sequence,
W = W1W2 . . . Wn, and each correspondence wi ↔ Wi is an instance of either
a ↔ A or b ↔ B (a crossed serial dependency language). wWR is identical to
wW except that the order of the Wi′s is reversed (a palindrome language).
Under “Best Performance”, “x% at L ≤ p” means the network accurately spec-
ified x% of the symbol transitions when tested on all sentences up to length
p.
Language State-Growth Best performance Training Net Source1
anbn L 100% at L ≤ 2000 L ≤ 20 LSTM (Gers & Schmidhuber, 2001)
anbncn L 100% at L ≤ 1500 L ≤ 120 LSTM (Gers & Schmidhuber, 2001)
anAmBman(
L2 + 1
)2 − 1 100% at L ≤ 92 L ≤ 44 LSTM (Gers & Schmidhuber, 2001)
wWR 3(2L2 − 1) ≤ 81%2 at L = 14 L ≤ 12+ SRN (Rodriguez, 2001)
wW 3(2L2 − 1) 69% at L ≤ 6 Sample3 SRN (Tabor, 2002)
1.1 State Growth Rates
A useful way of categorizing languages which require different kinds of stack
mechanisms is to examine their rates of state growth as function of string length
(Crutchfield, 1994). I define the state growth function of a language L to be the
function which specifies, for each possible sentence-length, n, how many states
a computer must distinguish in order to correctly process all strings of length
n or less from L. Given this definition, infinite-state languages which require
keeping track of arbitrary orders of elements on their stacks are exponential
state-growth languages.
It is difficult for RNNs to learn exponential state-growth languages. Even
though the performance levels of LSTM networks are consistently much higher
than those of SRN’s, I am aware of no published results on their learning of
exponential state-growth languages.
5
It makes sense that exponential state-growth languages should be partic-
ularly hard to learn because (i) many training exemplars are needed to fully
sample the set of viable strings up to any given length, (ii) the hidden units
cannot function merely as counters, but must also keep track of stack se-
quencing information, and (iii) many states implies smaller critical distances
between states, so the network must be tuned very accurately even to handle
short strings. Unfortunately, among practical applications of infinite state lan-
guages, exponential state-growth cases are common. For example, embedding
order is critical for keeping track of the subject-predicate dependencies in nat-
ural language relative clauses (Christiansen and Chater, 1999). Moreover, in
computer programming tasks which require recursion, order of recursive em-
bedding almost always has a major effect on function. It is thus desirable to
be able to handle such cases with learning RNNs.
For both linear and quadratic languages, the principles employed by net-
works that successfully learn the languages are very clear: hidden units are
used as counters of stack symbols. But how to extend this method to ex-
ponential state-growth cases is not obvious: (Rodriguez, 2001), using human
insight to build a linear approximation of the network identified in Row 4
of Table 1 was able to find an effective solution for one case but he did not
provide a principled method of accomplishing this approximation. Nor did
he show how the correct weights could be learned. The present work starts
with a known method of encoding exponential state-growth cases in artificial
neurons and then identifies a type of network that has a strictly monotonic
gradient from its initial state to the engineered solution for several exponential
2(Rodriguez, 2001) reports 39 correct out of 114 mixed sequences (i.e., with
both a’s & b’s). I concluded that there were a minimum of 114 - 39 = 73 errors
among at most 3(27 − 1) states. The actual error rate may have been higher.3(Tabor, 2002) generated strings with a probabilistic queue-grammar in
which the probability of pushing a symbol onto the queue was 0.2 at each
point. The network was trained on one pass through a 3 million word corpus
of such strings.
6
state-growth cases.
2 Dynamical Automata and Fractal Learning
Neural Networks
Insight into the challenge of encoding arbitrary stack manipulations in RNNs
has been provided by (Moore, 1998), (Siegelmann, 1996), (Tabor, 2000), (Barns-
ley, 1988), and (Tino, 1999) who show that fractal sets provide a natural frame-
work for encoding recursive computation in analog computing devices. The
current paper employs the framework laid out by (Tabor, 2000).
2.1 Pushdown Dynamical Automata for Context Free
Languages
(Tabor, 2000) defines dynamical automata for processing symbol sequences in
a bounded real-valued activation space. A dynamical automaton (or “DA”)
(Tabor, 2000) is a symbol processor whose states lie in a complete metric
space. The DA must start at a particular point called the start state. It is
associated with a list of state change functions that map that metric space to
itself, a vocabulary of symbols that it processes, and a partition of the metric
space. An Input Map specifies, for each compartment of the partition, a set
of symbols that can be grammatically processed when the network is in the
current partition, and the state change function that is associated with each
symbol. Appendix 2 provides a formal definition.
(Tabor, 2000) defines a subset of Dynamical Automata, called Pushdown
Dynamical Automata (PDDAs), that behave like Pushdown Automata (PDAs—
see Appendix 1). PDDAs travel around on fractals with several branches.
Their functions implement analogs of stack operations: pushing, popping, and
exchanging symbols. Pushing involves moving to a part of the fractal where
the local scale is an order of magnitude smaller than the current scale; pop-
ping is the inverse of pushing; and exchanging involves shifting position at the
7
Table 2: The Input Map for Pushdown Dynamical Automaton 1 (PDDA 1).
The initial state is(00
). When the automaton is in the state listed under
“Compartment”, it can read (or emit) the symbol listed under “Input”. The
consequent state change is listed under “State Change”.
Compartment Input State Change
z1 < 0 and z2 < 0 b ~z ← ~z +(02
)
z1 < 0 and z2 > 0 c ~z ← 2(~z +(
1−1
))
Any a ~z ← 12~z +
(−1−1
)
largest scale without changing the scale. (Tabor, 2000) shows that the set of
languages processed by PDDAs is identical to the set of languages processed
by PDAs, that is to say, the Context Free Languages (Hopcroft and Ullman,
1979).
Table 2 gives an example of a PDDA for processing the sentences generated
by the grammar shown in Table 3. PDDA 1 always starts at the origin of
the space z1 × z2 at the beginning of processing a sentence; it must move in
accordance with the rules specified in Table 2; if it is able to process each word
of the sentence and return to the origin when the last word is read, then the
sentence belongs to the language.
In order to successfully process Language 1 with a symbolic stack, it is
necessary to store information on the stack whenever an “a b c” sequence
is begun but not completed. If A is stored when “a” is encountered, and
B is stored when “b” is encountered, then the possible stack states are the
members of {A, B}* (see Appendix 1). As Figure 1a shows, PDDA 1 induces
a map from this set to positions on a fractal with the topology of a Cantor Set
(Barnsley, 1988). Figure 1b provides an example of the states visited by the
automaton as it processes the center-embedded sentence, “a b a a b c b c c”.
8
Table 3: Grammar 1. Parentheses denote optional constituents, which oc-
cur with probability 0.2 in every case. A probability, indicated by a decimal
number to its left, is associated with each production. The probabilities are
irrelevant for implementation in a dynamical automaton but are useful for
generating a corpus which can be used to train a FLNN—see below.
1.0 S → A B C 1.0 A → a (S) 1.0 B → b (S) 1.0 C → c (S)
−2 −1.5 −1 −0.5 0
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
A
AAAAA
BAA
BA
ABA
BBA
B
ABAAB
BAB
BB
ABB
BBB
z1
z2
a. Stack map
−2 −1.5 −1 −0.5 0
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
1. a
2. b
3. a
4. a
5. b
6. c
7. b
8. c
9. c
z1
z2
b. Sample trajectory.
Figure 1: a. The stack map of PDDA 1. Stack-states (labeled) are associated
with the apices of rightward-pointing triangles. Each letter in a label identifies
a symbol on the stack. The top-of stack is the right-most letter in each label.
b. The trajectory associated with the sentence, “a b a a b c b c c”, from
Language 1. “1.” identifies the first word, “2.” the second, etc.
9
2.2 Insight from Prior Training Results
(Rodriguez, 2001) trained a SRN on several context free languages. Although
the networks did not achieve high accuracy on exponential state-growth lan-
guages (see Table 1) (Rodriguez, 2001) was able to examine the network
weights and construct an effective solution in each case, using a linear re-
current map in place of the SRN’s sigmoidal hidden unit activation function.
For example, based on a network trained on wWR, (Rodriguez, 2001) defined
the hidden-layer map shown in (1)-(2).
~zt = f(
0.5 0 0 0
0 0.5 0 0
2 0 2 0
0 2 0 2
· ~zt−1 +
0.5 0.5 −5 −5
0.4 0.1 −5 −5
−5 −5 −1 −1
−5 −5 −0.8 −0.2
· ~It) (1)
fi(ni) =
0, ni < 0
zi, 0 ≤ ni ≤ 1
1, 1 < ni
(2)
The input, It, is encoded as a = (1, 0, 0, 0); b = (0, 1, 0, 0); A = (0, 0, 1,
0); B = (0, 0, 0, 1). For the “push” operations (see Appendix 1) associated
with the symbols a and b, the third and fourth dimensions have fixed values,
so it is possible to graph the important state changes. Figure 2 shows the set
of all states visited by the map as it processes all possible initial sequences of
a’s and b’s down to seven levels of embedding (the “push set”).
Figure 2 makes it clear that Rodriguez’s gleaned model uses the same kind
of stack mechanism as a PDDA. Two insights from this and other prior work
help with the challenge of learning exponential state-growth languages: (i)
as (Rodriguez, 2001) and (Holldobler, Kalinke, and Lehmann, 1997) point
out, the fractal scaling solution depends critically on being able to invert the
scale changes. SRNs do this by using the linear region of their activation
function to approximate multiplicative inverses. But the imperfection of the
approximation leads to inaccuracies at high levels of embedding—hence the use
10
0.5 0.6 0.7 0.8 0.9 10.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
A
AA
AAA
BAA
BA
ABA
BBA
B
ABAAB
BAB
BB
ABB
BBB
z1
z2
Figure 2: The push set for the map gleaned by [3] from a SRN trained on
wWR. Visited states are at the tips of leftward pointing triangles. The letter
labels identify the stack states associated with the visited states under the
minimal PDA for wWR. As in PDDA 1, the states with different immediate
futures are on different branches of the fractal.
11
by (Rodriguez, 2001) of the piecewise linear function, f . The FLNN’s discussed
in the next section avoid the nonlinear distortion by using linear recurrent units
in the learning model. (ii) Part of the complexity of the solution embodied in
equation (1) lies in the use of a rotation when the network crosses the midpoint
of a palindrome (This rotation is accomplished by the submatrix
2 0
0 2
in
the third quadrant of the recurrent weight block in (1)). The rotation has the
effect of mapping the whole push-set into a new subspace so the network can
easily tell whether it is in push- or pop-mode. This solution is somewhat akin
to doing a global symbol substitution across the stack in a PDA. it preserves
the stack structure while changing the identity of every element in it. A simpler
solution with a PDA is to use a control-state variable to keep track of whether
the midpoint has been passed. Indeed, (Tabor, 2000) handles such cases with
DAs by invoking an additional dimension which encodes the analog of a control
state change. Suspecting that learning the rotation or the control state change
might be difficult for a network, I have focused here on cases in which a stack
alone is sufficient—no separate control variables are needed.
2.3 Fractal Learning Neural Networks (FLNN’s): Im-
plementation of Dynamical Automata in RNNs
Figure 3 shows a Fractal Learning Neural Network (FLNN), one way of im-
plementing a dynamical automaton in artificial neurons. The input layer uses
a one-hot encoding for words of the vocabulary, as in (Elman, 1990). The
first hidden layer is recurrently connected. The input layer has two kinds of
projections to the first hidden layer: first-order connections project directly to
the hidden units; second-order connections control the weights on the recurrent
hidden connections. As in (Rodriguez, 2001)’s hand-constructed networks, the
first hidden layer has a linear (identity) activation function. All maps are dis-
crete. The following is a general form of the update rule for the first hidden
layer:
12
Input
a b c
Hidden 1 (Linear)
Hidden 2 (Gaussian)
Output (Softmax)
a b c
Figure 3: A FLNN for learning the language of Grammar 1.
13
zi(t) =∑
j∈Inputs
wijaj(t) +∑
k∈Hidden1
∑j∈Inputs
sikjzk(t− 1)aj(t) (3)
Here, t indexes time, aj is the activation of the j’th input unit, and zi is the
activation of the i’th first hidden layer unit; wij is the (first-order) weight on
the connection from unit j to unit i; sikj is the (second-order) weight from
input unit j and the previous state of hidden unit k to the current state of
hidden unit i. Because the PDDA analysis indicates that only the self-weights
need to be manipulated for handling context free grammars and it suffices to
use the same contraction/expansion factor across all hidden dimensions for
a given word, one can set all the non-self-connections in Hidden1 to 0 and
define sj = siij. Moreover, because the input vectors are “one-hot” codes, the
description can be further simplified by defining zi(t, j) to be the value of zi(t)
when unit j is the activated input:
zi(t, j) = wij + zi(t− 1)sj (4)
Equation (4) should be compared to the equations in the “State Change”
column of Table 2, which it implements.
The Hidden1 units have first-order connections to the units in Hidden2.
These second hidden layer units have gaussian activation functions:
gi(t) = exp
[−|~wi − ~z(t)|2
b2i
](5)
Here, gi is the activation of the i’th second hidden layer unit and b2i is its
“bias” (a parameter which controls the radius of the spherical region of Hidden1
space over which the unit is active); ~wi is the vector of weights feeding from
Hidden1 to unit i in Hidden2; ~z is the vector of Hidden1 activations. The
purpose of these radial basis function units is explained below.
The second hidden layer units have first-order connections to the output
units, which, as a group, have the normalized exponential (or “soft-max”)
activation function, since they model the probability distribution for the next
symbol at each point in time (Haykin, 1994):
14
oi(t) =exp(neti(t))∑
k∈Outputs
exp(netk(t))(6)
neti(t) =∑
j∈Hidden2
wijgj(t) (7)
Here, oi(t) is the activation of the i’th output unit at time t.
The network receives symbols in sequence from the language it is trained
on. Each symbol activates a single, unique unit on the input layer. The ac-
tivations are updated layer-by-layer in the order, Input, Hidden1, Hidden2,
Output. In the recurrent layer, Hidden1, all the new states are computed
before any of the states are actually changed. The job of the network is to
activate on the output layer, after each symbol is presented, the correct prob-
ability distribution over next-symbols (Elman, 1990). The (gaussian) radial
basis functions at Hidden2 are used by the network to classify the branches of
the fractal on which the network travels in Hidden1. For the languages consid-
ered here, each fractal branch in Hidden1 corresponds to a different immediate
future and hence a different probability distribution on the output layer. The
use of radial basis functions rather than sigmoid functions on Hidden2 helps
to drive the formation of the contraction maps which are important for im-
plementing the fractal scaling: if the FLNN persists in using non-contractive
scaling, then it will suffer great inaccuracy when multiple embeddings drive it
outside the halo of the radial basis units.
3 Simulations
3.1 Training Languages
I used the training languages specified by Grammar 1 (Table 3) and Grammar
2 (Table 4). These languages are exponential state-growth languages. Their
state-growth functions are 2(L3+1) − 1 and 2(L
2+1) − 1, respectively.
15
Table 4: Grammar 2. (See Table 3 for explanation.)
0.5 S → A B
0.5 S → X Y
1.0 A → a (S) 1.0 B → b (S) 1.0 X → x (S) 1.0 Y → y (S)
3.2 Training Procedure
Two constraints were put on the FLNN to make learning easier. I assigned the
three gaussian units in the second hidden layer fixed variances. These units
end up classifying the branches of the fractal. Since the radius of each fractal
branch is arbitrary, provided it is positive, the second hidden layer variances
can be fixed without loss of generality. I set these variances to 0.25. I also set
all the hidden self-weights to 1 initially. This choice is motivated by the fact
that the self-weights perform the fractal contraction and expansion, so setting
them to the multiplicative identity is the unbiased choice.
The networks were trained by gradient sampling (also called ”hill climb-
ing”) in batch mode. By gradient sampling, I mean the class of methods of
descending an error surface which sample the change in error associated with
a set of small displacements from the current state and choose the one that
reduces the error most at each step. This approach was motivated by the fact
that gradient sampling is simple to implement and conventional methods of
computing the gradient accurately, Backpropagation Through Time (Werbos,
1974; Rumelhart, Hinton, and Williams, 1986; Williams and Peng, 1990; Pearl-
mutter, 1995) and Real-time Recurrent Learning (Robinson and Fallside, 1987;
Williams and Zipser, 1995), suffer from disappearance of the error signal across
long time lags (Hochreiter, 1991; Hochreiter, 1997). It was also motivated by
the fact that (Chalup and Blair, 1999) and (Melnik, Levy, and Pollack, 2000)
were successful in training other neural networks on infinite state languages
using gradient sampling techniques. Disadvantages of the current method are
that it is inefficient and it is nonlocal in time and space in the sense of (Schmid-
16
huber, 1989). The work of (Gers and Schmidhuber, 2001) and (Hochreiter and
Schmidhuber, 1997), which computes an accurate gradient without letting the
error signal decay, helpfully complements the present work and may provide
the basis for a more efficient algorithm.
The gradient sampling was implemented as follows. A corpus of sentences
from the language was processed at the current weight setting and at points
on a sphere in weight-space surrounding the current setting. To reduce the
amount of testing, only a set of basis directions and their negatives was tested
at each stage. Whichever single weight adjustment produced the greatest
reduction in the error relative to the current setting was adopted and the
process was repeated. Error on a particular word was measured as Kullback-
Leibler Divergence at the output layer (E =∑
i∈Outputs tilog( tioi
), where ti is
the probability of word i in the context and oi is the activation of output unit
i) and was summed over all words in the training corpus. The radius of the
sphere was 0.001. No momentum was used. The initial values of the Hidden1-
to-Hidden2 weights and the Hidden2-to-Output weights in each network were
randomly picked from the uniform distribution on (-0.3, 0.3).
The training corpus for Language 1 consisted of one instance each of all
sentences of length ≤ 9. The training corpus for Language 2 consisted of one
instance each of all sentences of length ≤ 6. Note that the training sets for
both languages included strings with at most three levels of recursion. From
the corpus distribution associated with each language, I computed the transi-
tion probabilities for the transitions in its training corpus and used these as
the targets on the output layer. Note that each training corpus was a highly
skewed sample from its corresponding corpus distribution. Similar results were
obtained when each training corpus formed a random sample from its corre-
sponding corpus distribution but convergence was slower. The networks were
trained until their mean error per word on the test corpus dropped below 0.001
or the gradient became so shallow that it appeared flat in single-precision float-
ing point encoding.
17
Table 5: Performance on training and test sets for the two FLNN’s. RMSE =
Root Mean Squared Error. % Cor. = Percent Correct. N = the number of
networks that contributed to the computation of Standard Error (SE). Npoints
= the number of words tested per network. All networks were trained by
gradient sampling.
Language Corpus RMSE SE % Cor. SE N Npoints
1 Training 0.013 0.001 100.000 0.000 9 129
1 Test 0.048 0.005 99.424 0.195 9 4755
2 Training 0.008 0.000 100.000 0.000 11 276
2 Test 0.022 0.001 99.900 0.022 11 15232
3.3 Training Results
3.3.1 Test Statistics
The test corpora consisted of all sentences of length 12 to 15 words from Lan-
guage 1 and all sentences of length 8 to 10 words from Language 2 (i.e., novel
sentences with ≤ 5 levels of recursion). The point was to test all sentences
within a short depth range in order to find out how close to perfectly im-
plementing the recursive structure the nets were coming, without running up
against the distortion created by limited precision (Boden and Wiles, 2000).
FLNN1 and FLNN2 processed their training sets completely accurately in
9 and 11 out of 15 trials, respectively. I will refer to the runs with accurate
training set performance in each case as the “successful runs”. No unsuccessful
runs achieved more than 93% accuracy on the training corpus. To detect and
avoid over-learning, the networks were tested on the testing corpus as well as
the training corpus throughout training. This precaution turned out not to
be necessary: the minimum of testing error occurred at, or very near, the end
of training and the difference in performance between the test minima and
the end of training was very slight. All results reported are from the end of
training.
18
Table 5 shows the Root Mean Squared Error (Haykin, 1994) and the per-
centage of correctly predicted transitions for the two networks on its training
corpus and test corpus. The very low error rates on the test corpus suggest
that each network has successfully generalized to a language with similar struc-
ture to that of the target. The analysis of the following section reveals that
this performance degrades when deeper levels of embedding are tested, but it
also shows that the successful networks have adopted solutions that are struc-
turally identical to those of the corresponding PDDAs. In this sense, they have
successfully identified the intended, infinite-state language.
3.3.2 State Space Analysis
In order to process a language correctly, the network must assign states with
different futures to different locations in the recurrent hidden unit space (Hid-
den1). Therefore, it is helpful to divide the Hidden1 states into equivalence
classes of states with identical futures. Figure 4 provides one view of such a
partition. It was generated from the Hidden1 states of a network that learned
Language 1 successfully. Each labeled point in the figure marks the mean of
the points associated with the stack state that labels it (i.e. it marks the mean
of a set with identical futures). Figure 4 was derived from the set of all states
the network visited while processing the test corpus (such a set derived from a
corpus will henceforth be referred to as the Visitation Set of the corpus. The
set of vector means over states with common futures derived from this set will
be called the Mean Visitation Set or MVS.). Figure 4 is analogous to Figure
1a. Indeed, it is clear from comparison of the figures that the trained net-
work has discovered an analogous fractal organization to that of the PDDA.
The branches of the fractal have been shaded to distinguish the two different
immediate futures that are associated with all points but the initial point in
Language 1. These two immediate futures are [0.2 a, 0.8 b, 0.0 c] and [0.2 a,
0.0 b, 0.8 c].2 For convenience, I will refer to these two branches of the frac-
2In [0.2 a, 0.8 b, 0.0 c], the decimal numbers indicate probabilities and the
letters identify next-symbols.
19
−2 −1.5 −1 −0.5 0 0.5
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
A
AA
AAA BAA
BA
ABA BBA
B
AB
AAB BAB
BB
ABB BBB
z1
z2
Figure 4: State means for a sample FLNN at the end of training. The (labeled)
stack states are located at the apices of downward-pointing triangles. Hori-
zontal shading marks the a-branch of the fractal and vertical shading marks
the b-branch. This figure is analogous to Figure 1a.
tal as the “a-branch” and the “b-branch” respectively. Every other successful
network had an identically structured set of means.
Two important properties distinguish PDDAs from DAs in general: (a)
the regions (or branches) of the fractal that correspond to different immedi-
ate future states are nonoverlapping; (b) symbolic machine cycles correspond
to dynamical machine cycles (See Appendix I). Peter Tino (P.C.) refers to
constraint (b) as the Generalized Fixed Point Condition (GFPC). It says, in
essence, that in situations where a PDA makes a loop in its state space, the
corresponding PDDA will too. To understand the process by which a FLNN
develops its solution, it is helpful to examine properties (a) and (b) separately.
20
a. Overlap of Branches. Figure 5 tracks the overlap of fractal branches
during the course of training for the same successful network depicted in Figure
4. In this figure, the stack state labels have been removed to make viewing
easier. Figure 5 depicts the MVS associated with a test set of all palindromes
up to 5 levels of embedding. At the beginning of training, the MVS is a single
point because all of the Input-to-Hidden1 weights are initialized to 0. After
training has proceeded for a while, the MVS has expanded into an infinite
lattice with high overlap between the branches (Figure 5a). This infinite-lattice
is the reflex of the unit initial weights in Hidden1, which imply no contraction
or expansion. Over the course of training, the branches shrink to a finite size
and gradually spread apart (Figure 5b), eventually reaching a nonoverlapping
final state (Figure 5c, which shows the same points as Figure 4). The fact
that once the fractal branches come into existence, they monotonically shrink
and separate is an indication that the dynamics of the learning process are
dominated by an attractor with the topological properties of the PDDA 1
solution.
b. Generalized Fixed Point Condition (GFPC). Figure 6 shows the distri-
bution of points associated with each common future at the end of training.
Figure 6a was based on points from the Training Corpus alone (up to 3 levels
of recursion). In addition to showing the means of points with identical fu-
tures at the apices of downward pointing triangles, each set of points with a
common future has a polygonal simplex drawn around its periphery. Because
the training set was learned with high accuracy, these polygons appear as sin-
gle points coincident with their means for most levels of recursion. Only the
points associated with the 0’th level are visibly separated, and surrounded in
the figure by a small triangular simplex. The contours indicate the network’s
interpretation of the error surface at the end of training. This surface was
arrived at by computing the Hidden1-to-Output map for each point, ~z within
the plotted region of Hidden1. For each resulting point in Output space, its
distance to the three grammar-derived distributions was computed. The height
of the induced error surface at ~z was taken to be the minimum of these three
21
−3 −2 −1 0 1−0.5
0
0.5
1
1.5
z1
z2
(a)
−3 −2 −1 0 1−0.5
0
0.5
1
1.5
z1
z2
(b)
−3 −2 −1 0 1−0.5
0
0.5
1
1.5
z1
z2
(c)
Figure 5: The evolution of the state means over the course of training a typical
FLNN. Each graph depicts the MVS derived from the test corpus of all palin-
dromes with <= 5 levels of recursion. The MVS is shown after the network
has traveled 10000 steps along the gradient (a), 13500 steps (b) and 25000
steps (c). Vertical shading marks the a-branch and horizontal shading marks
the b-branch of the fractal in each panel.
22
a. Three levels of recursion (training set).
−2 −1.5 −1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
1.5
z 0
ab
z1
z2
b. Five levels of recursion (test set)
−2 −1.5 −1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
1.5
z 0
ab
z1
z2
Figure 6: Visited points grouped by source state for (a) the training set and
(b) the test set. The data are from the same run that is depicted in Figures 4
and 5. The contours show the induced classification surface.
distances. Thus the basin structure of Figure 6a indicates how the network
classifies points in its Hidden1 space. Indeed, each training point lies well
within the basin of the correct probability distribution, in keeping with the
fact that the network made no classification errors on the training corpus at
the end of training. However, the positive area of the triangles (visible at the
0’th level in the figure) indicates that the network is not perfectly obeying the
GFPC: it does not always return to exactly the same point after processing
a sequence of symbols that would return the corresponding PDA to the same
stack-state.
Figure 6b displays the consequence of this failure in the case of processing
23
more deeply embedded strings. It is identical to Figure 6a except that the
Visitation Set on which it is based included the shortest palindromes from
each level of recursion up to the maximum level contained in the test corpus
(5 levels). Displacements (with respect to the GFPC) at high levels of recursion
grow exponentially as the network state scales back up to the 0 level. Thus
the set of points that are supposed to coincide with the initial state shows
the highest level of distortion. Indeed, some of the initial points and some of
the b-predictions land in the wrong basin for this particular network. Such
basin transgressions are the source of the errors on the test corpora reported
above. Although this distortion implies that the network will make many errors
on sentences with very deep embeddings, the fact that the network completely
separates the branches of the fractal indicates that it has successfully captured
the structure of the PDDA solution.
4 Conclusions
Fractal Learning Neural networks appear to be able to reliably discover, via
gradient descent exploration, the information structure of a time series gener-
ated by several exponential state-growth context free languages. These net-
works thus show promise of helping to understand, in a principled way, neural
network learning of the kind of complex processes for which symbolic models
are so widely used.
Several prior studies have shown that SRNs can learn some fairly com-
plex natural language patterns (Allen and Seidenberg, 1999; Rohde and Plaut,
1999). But the utility of these findings for researchers of language phenomena
will be limited in the long run if we cannot relate them to the structural un-
derstanding provided by current symbolic analyses of language. One valuable
aspect of the analysis in the preceding section is that it reveals the principle
behind the network’s solution to the learning problem and it sets the stage for
identifying precise correlates of the symbolic entities that are the bread and
butter of current linguistic understanding of language structure.
Several avenues of future work look promising. As noted in the section,
24
Training Procedure, it may be helpful to pursue a more efficient and accurate
method of computing the error gradient—e.g., LSTM (Gers and Schmidhu-
ber, 2001; Hochreiter and Schmidhuber, 1997). A topic of current work is to
improve FLNN generalization accuracy in cases of deep embedding by using
the information in the gradient to drive a more perfect approximation of the
GFPC. It will also be helpful to consider a wider range of languages than the
ones reported on here, including languages that require using a stack with
more than two symbols, non-context-free languages, and languages for which
a PDA with independent control states is needed, languages with lexical am-
biguity (the same word causes different manipulations of the stack, depending
on context), and languages with structural ambiguity (the stack state is not
uniquely determined by the input).
Much research on neural network learning focuses on the micro-level of the
learning mechanism; less has focused on the macro-level of the representations
that are to be learned. Indeed, these representations are often inscrutable in
nets trained on complex problems. The present results suggest that it is both
possible and valuable to study representations in order to guide the search for
an effective learning mechanism in complex domains.
5 Appendix 1: Formal Language Notation and
Background
Language. I use the term “language” here in the sense of formal language
theory (Hopcroft and Ullman, 1979): a language L is a (possibly infinite) set
of finite-length strings of symbols drawn from a finite alphabet. The strings in
L are called grammatical and all other strings are called ungrammatical with
respect to L. A symbol w from the alphabet is said to be a grammatical con-
tinuation of a string s if the concatenation [sw] forms an initial substring of
some grammatical string of the language. A machine is said to be a recognizer
of a language L if it can determine, for each finite-length string s, whether
s ∈ L. A machine is said to be a generator of a language L if it can list all
25
and only the strings of L. Generally, RNNs recognize and generate languages
by identifying grammatical continuations of symbol strings. I will refer to the
identification of the grammatical continuations of a symbol string as the (cor-
rect) processing of the string. The sequence of states that a machine occupies
during the correct processing of a string is called the parse of the string. In the
text, I define a corresponding notion for languages in which symbol transitions
are associated with probabilities rather than absolute possibilities.
The formalization in the remainder of this appendix is based on that found
in (Hopcroft and Ullman, 1979) but varies slightly from it in order to make
some of the terms in the text more transparent.
For C a set, the expression C∗ refers to the set of finite sequences of el-
ements of C, including the null sequence. This convention is called “Kleene
star notation”. The null sequence is denoted by “ε”.
Finite-State Automata. A finite-state automaton (FSA), F , is a computer
which can be in only a finite number of states. Formally, a finite-state machine
is denoted by a 5-tuple, (Q, Σ, δ, q0, F ), where Q is a finite set of control states,
Σ is the input alphabet, which consists of the symbols (words) in the language,
q0 ∈ Q, is the initial control state, F ⊆ Q, is the set of final control states, and
δ, a function which maps from Q× Σ to Q is called the transition function.
If F is in the state q1 and δ contains the map δ(q1, w) → q2, for w ∈ Σ and
q2 ∈ Q, then the occurrence of w is said to be a grammatical transition under
F . If w is processed under such conditions, then the state of F changes from
q1 to q2. The set of symbol sequences s ∈ Σ∗ whose symbols bring F from its
initial state to one of its final states via grammatical transitions when they are
processed in sequence is called the language, L, of F .
Pushdown Automata. A push-down automaton (PDA), P , is denoted by
a 7-tuple, (Q, Σ, Γ, δ, q0, Z0, F ), where Q, Σ, and q0 are defined as for a finite
state machine. Γ is a second alphabet called the stack alphabet. Z0 ∈ Γ is
the initial stack state. δ is a map from Q × (Σ ∪ {ε}) × Γ to finite subsets of
Q×Γ∗. An element S of Γ∗ is called a stack state. I adopt the convention that
the rightmost element of S is called the top of stack and is denoted top(S).
26
F ⊆ Q× Γ∗ is the set of final states of P .
If P is in a state (q1, S1) where q1 ∈ Q is a control state and S1 ∈ Γ∗ is
a stack state, and δ contains the map δ(q1, w, top(S1)) → (q2, γ) for w ∈ Σ
and γ ∈ Γ∗, then the occurrence of w is said to be a grammatical transition
under P . If w is processed under such conditions, then the state of P changes
from (q1, S1) to (q2, S2), where S2 is derived from S1 by replacing top(S1)
with γ. The set of symbol sequences s ∈ Σ∗ whose symbols bring P from its
initial state to one of its final states via grammatical transitions when they
are processed in sequence is called the language of P . A PDA amounts to a
FSA in combination with a pushdown stack, a memory consisting of a string
of symbols which, at any point in time, can be manipulated only by changing
(i.e. replacing or deleting) the most-recently-added symbol.
A theorem states that the languages that can be processed by FSAs are
a proper subset of the languages that can be processed by PDAs (Hopcroft
and Ullman, 1979). This implies that some PDAs are infinite-state machines,
in the sense that there is no bound on the number of states they can occupy
during grammatical processing.
The addition, by a PDA, of a symbol to its stack is called pushing the
symbol onto the stack. The removal of the symbol is called popping the stack.
The number of symbols that a PDA has on its stack at a particular point in
time is called its current level of recursion. The maximum current level of
recursion that a particular PDA, Pi, achieves during the course of assigning
parse p to a sentence s is called the level of recursion of parse p of s under Pi.
Since the set of all PDAs is countable, there is a smallest PDA PL (measured
in terms of the number of symbols in the formalization just specified) for each
PDA language L. Let the level of recursion of a sentence s ∈ L be the smallest
level of recursion of s under PL. For example, the level of recursion of “a x y
b” in Language 2 in the text is 2.
Context-Free Grammar. A context-free grammar (CFG), G, is a 4-tuple,
(V, T, P,R) where V is a finite set of abstract nodes, T is a finite set of terminal
nodes, P is a finite set of productions of the form A → α, where A is an abstract
27
node and α is a member of (V ∪ T )∗, and R ∈ V is a special abstract node
called the root node. If β1 ∈ (V ∪ T ) contains an instance Ai of the abstract
node, A ∈ V , and G contains the production, A → α, as defined above, then
G derives β2 from β1, where β2 is the string that results when Ai in β1 is
replaced by α. Derivation is assumed to be a transitive relation. s ∈ T ∗ is a
(grammatical) sentence of G if s can be derived from R. The set of sentences
of G is called the language generated by G. A language generated by a CFG
is called a Context Free Language (CFL).
A theorem states that the languages that can be processed by PDAs are
exactly the CFLs (Hopcroft and Ullman, 1979).
Cycles. If r is a particular subsequence of a grammatical sentence s of
language L, processed by the symbolic machine P (e.g. a finite state automaton
or a pushdown automaton) and if P is in state qP just before starting to process
r and it returns to state qP after processing r, then r is said to generate a
symbolic cycle under P . Likewise, if D is a FLNN which processes sequences
from the alphabet of L and if, when processing s, D is in state qD just before
starting to process r and it returns to state qD after processing r, then r is said
to generate a dynamical cycle under D. D obeys the Generalized Fixed Point
Condition (GFPC) with L if all the cycles of the minimal symbolic machine
for L correspond to cycles of D.
6 Appendix 2: Definition of Pushdown Dy-
namical Automaton (PDDA)
Def. A Dynamical Automaton (DA) is a 7-tuple, M:
M = (X, F, P, Σ, IM,O, FR) (8)
(1) The space, X, is a complete metric space.
(2) The function list, F , consists of a finite number of functions,
w1, . . . wN where wi : X → X for each i ∈ {1, . . . , N}.
28
(3) The partition, P , is a finite partition of X with compartments,
m1, . . . , mK .
(4) The input list is a finite set of symbols drawn from an alphabet,
Σ.
(5) The input map is a three-place relation, IM : P × Σ × F →{0, 1} which specifies for each compartment, m, and each input
symbol, s, what function(s) can be performed if symbol s is pre-
sented when the system state is in compartment m. If IM(m, s, f) =
0 for all f ∈ F then symbol s cannot be presented when the system
is in compartment m.
(6) The machine always starts at the start state, O ∈ X. If, as
successive symbols from an input string are presented, it makes
transitions consistent with the Input Map, and arrives in the final
region, FR, then the input string is accepted.
Def. A generalized iterated function system (GIFS) consists of a complete
metric space (X, d) together with a finite set of functions, wi : X → X,
i ∈ {1, . . . , N}.Def. Let X be a metric space with GIFS S = {X; w1 . . . wN}. A set O ⊂ X
is called a pooling set of S if it satisfies the following:
(i) wi(O) ∩ wj(O) = ® for i, j ∈ {1, . . . n} and i 6= j.
(ii)⋃n
i=1 wi(O) ⊂ O
The set of points in O that are not in⋃n
i=1 wi(O) is called the crest of O.
Def. Let S = {X; w1 . . . wN} be a GIFS with cascade point x0 and corre-
sponding cascade C. Then wi : C → C is called a push function on C.
Def. Let S = {X; w1 . . . wN} be a GIFS with cascade point x0 and correspond-
ing cascade C. Let Y = {x ∈ C : topC(x) = i for i ∈ {1, . . . , N}}. Suppose wi
is invertible on Y . Then the function f : Y → C such that f(x) = w−1i (x) is
called a pop function on C. (This definition was incorrectly stated in (Tabor,
2000).)
29
Def. Let S = {X; w1 . . . wn} be a GIFS. Let C1 and C2 be disjoint cascades
under S with cascade points x10 and x20 respectively. Then the function f :
C1 → C2 such that for all x ∈ C1 the x10-address of x is equal to the x20-
address of f(x) is called a switch function. Note that f is one-to-one and onto,
and hence invertible for all x′ ∈ C2.
Def. If C1, . . . , CK are cascades on GIFS S satisfying
Ci ∩ Cj = ® for i 6= j.
(i.e., they are disjoint) and x ∈ Ci for i ∈ {1, . . . , K} then i is called the index
of x with respect to the set {C1, . . . , CK}Def. Let M be a dynamical automaton on metric space X. We say M
is a pushdown dynamical automaton (PDDA) if there exists a GIFS, S =
{X; w1, . . . , wN} with cascade points x10, x20, . . . , xK0 ∈ X, K ∈ {1, 2, 3, . . .}and corresponding cascades, C1, C2, . . . , CK such that
(i) C1, C2, . . . , CK are disjoint.
(ii) For x ∈ ⋃Ni=1 Ci, the partition compartment of x is determined
by the conjunction of the index, i, of x and topCi(x).
(iii) Let m be a compartment of the partition of M . Each function
f : m → X in the input map is either a stack function, a switch
function, or a composition of the two when restricted to points on
one of the cascades.
(iv) The start state, O, and final region, FR, of M are contained
in⋃K
i=1 xi0.
A theorem states that the languages processed by PDDAs are exactly the
CFLs (Tabor, 2000).
Acknowledgements
This work was partially supported by a grant from the University of Connecti-
cut Research Foundation.
30
References
Allen, J. and M. S. Seidenberg (1999). Grammaticality judgment and apha-
sia: a connectionist account. In B. MacWhinney (Ed.), The Emergence
of Language. Mahwah, NJ: Lawrence Erlbaum.
Barnsley, M. (1988). Fractals Everywhere. Boston: Academic Press.
Boden, M. and J. Wiles (2000). Context-free and context sensitive dynamics
in recurrent neural networks. Connection Science 12 (3), 197–210.
Boden, M. and J. Wiles (2002). On learning context-free and context-
sensitive languages. IEEE Transactions on Neural Networks 13 (2), 491–
493.
Chalup, S. and A. D. Blair (1999). Hill climbing in recurrent neural networks
for learning the anbncn language. In Proceedings of the Sixth International
Conference on Neural Information Processing, pp. 508–513.
Chomsky, N. (1956). Three models for the description of language. IRE
Transactions on Information Theory 2 (3), 113–124. A corrected version
appears in Luce, Bush, and Galanter, eds., 1965 Readings in Mathemat-
ical Psychology, Vol. 2.
Christiansen, M. H. and N. Chater (1999). Toward a connectionist model of
recursion in human linguistic performance. Cognitive Science 23, 157–
205.
Crutchfield, J. P. (1994). The calculi of emergence: Computation, dynamics,
and induction. Physica D 75, 11–54. In the special issue on the Proceed-
ings of the Oji International Seminar, Complex Systems—from Complex
Dynamics to Artificial Reality.
Elman, J. L. (1990). Finding structure in time. Cognitive Science 14, 179–
211.
Gers, F. A. and J. Schmidhuber (2001). LSTM recurrent networks learn
simple context-free and context-sensitive languages. IEEE Transactions
on Neural Networks 12 (6), 1333–1340.
31
Haykin, S. S. (1994). Neural networks: a comprehensive foundation. New
York: MacMillan.
Hochreiter, S. (1991). Untersuchung zu dynamischen neuronalen netzen. Un-
published doctoral dissertation, Technische Universitat Munschen. Avail-
able online at www7.infomratick.tu-muenchen.de/ hochreit.
Hochreiter, S. (1997). The vanishing gradient problem during learning re-
current neural nets and problem solutions. International Journal of Un-
certainty, Fuzziness, and Knowledge-Based Systems 6 (2), 107–116.
Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural
Computation 9 (8), 1735–1780.
Holldobler, S., Y. Kalinke, and H. Lehmann (1997). Designing a counter:
Another case study of dynamics and activation landscapes in recurrent
networks. In Advances in Artificial Intellegence, pp. 313–324. Berlin; New
York: Springer, C 1997.
Hopcroft, J. E. and J. D. Ullman (1979). Introduction to Automata Theory,
Languages, and Computation. Menlo Park, California: Addison-Wesley.
Kremer, S. C. (1996). A theory of grammatical induction in the connectionist
paradigm. PhD Thesis, Department of Computing Science, Edmonton,
Alberta.
Melnik, O., S. Levy, and J. B. Pollack (2000). RAAM for infinite context-
free languages. In Proceedings of the International Joint Conference on
Artificial Neural Networks (IJCNN). IEEE.
Moore, C. (1998). Dynamical recognizers: Real-time language recognition
by analog computers. Theoretical Computer Science 201, 99–136.
Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent
neural networks: A survey. IEEE Transactions on Neural Networks 6 (5),
1212–1228.
Pollack, J. (1987). On connectionist models of natural language processing.
Unpublished doctoral dissertation, University of Illinois.
32
Pollack, J. B. (1991). The induction of dynamical recognizers. Machine
Learning 7, 227–252.
Robinson, A. J. and F. Fallside (1987). The utility driven dynmaic error
propagation network. Technical Report CUED/F-INFENG/TR.1, Cam-
bridge Univ. Engineering Department.
Rodriguez, P. (2001). Simple recurrent networks learn context-free and
context-sensitive languages by counting. Neural Computation 13 (9).
Rodriguez, P. and J. Wiles (1998). Recurrent neural networks can learn
to implement symbol-sensitive counting. In M. Jordan, M. Kearns, and
S. Solla (Eds.), Advances in Neural Information Processing Systems 10,
pp. 87–93. Cambridge, MA: MIT Press.
Rohde, D. and D. Plaut (1999). Language acquisition in the absence of
explicit negative evidence: How important is starting small? Journal of
Memory and Language 72, 67–109.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Learning in-
ternal represenations by error propagation. In D. E. Rumelhart, J. L.
McClelland, and the PDP Research Group (Eds.), Parallel Distributed
Processing, v. 1, pp. 318–362. MIT Press.
Schmidhuber, J. (1989). The neural bucket brigade: a local learning al-
gorithm for dynamic feedforward and recurrent networks. Connection
Science 1 (4), 403–412.
Siegelmann, H. (1996). The simple dynamics of super Turing theories. The-
oretical Computer Science 168, 461–472.
Sun, G. Z., H. H. Chen, C. L. Giles, Y. C. Lee, and D. Chen (1990).
Connectionist pushdown automata that learn context-free grammars. In
M. Caudill (Ed.), Proceedings of the International Joint Conference on
Neural Networks, pp. 577–580. Hillsdale, NJ: Lawrence Earlbaum.
Tabor, W. (2000). Fractal encoding of context-free grammars in connection-
ist networks. Expert Systems: The International Journal of Knowledge
Engineering and Neural Networks 17 (1), 41–56.
33
Tabor, W. (2002). The value of symbolic computation. Ecological Psychol-
ogy 14 (1/2), 21–52.
Tabor, W. (2003). Learning exponential state growth languages by hill
climbing. IEEE Transactions on Neural Networks 14 (2), 444–446.
Tino, P. (1999). Spatial representation of symbolic sequences through itera-
tive function systems. IEEE Transactions on Systems, Man, and Cyber-
netics Part A: Systems and Humans 29 (4), 386–392.
Werbos, P. J. (1974). Beyond regression: New tools for prediction and anl-
ysis in the behavioral sciences. Unpublished Doctoral Dissertation, Har-
vard University, Cambridge, MA.
Wiles, J. and J. Elman (1995). Landscapes in recurrent networks. In J. D.
Moore and J. F. Lehman (Eds.), Proceedings of the 17th Annual Cogni-
tive Science Conference. Lawrence Erlbaum Associates.
Williams, R. J. and J. Peng (1990). An efficient gradient-based algorithm for
on-line training of recurrent network trajectories. Neural Computation 2,
490–501.
Williams, R. J. and D. Zipser (1995). Gradient-based learning algorithms
for recurrent networks. In Y. Chauvin and D. E. Rumelhart (Eds.),
Backpropagation: Theory, Architectures, and Applications, pp. 433–486.
Lawrence Erlbaum Associates.
34
Recommended