Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Acceptedfor publication,IEEE TransactionsonKnowledgeandDataEngineering.
NaturalLanguageGrammaticalInferencewith RecurrentNeural
Networks
SteveLawrence,C. LeeGiles�, SandiwayFong����������� ������ ��������������������������� ��!��"� ���$#&%'��()%*��!�+%,��-$.
NECResearchInstitute
4 IndependenceWay
Princeton,NJ08540
Abstract
This paperexaminesthe inductive inferenceof a complex grammarwith neuralnetworks– specif-
ically, the taskconsideredis thatof traininga network to classifynaturallanguagesentencesasgram-
maticalor ungrammatical,therebyexhibiting the samekind of discriminatorypower provided by the
PrinciplesandParameterslinguistic framework, or Government-and-Bindingtheory. Neuralnetworks
aretrained,without thedivision into learnedvs. innatecomponentsassumedby Chomsky, in anattempt
to producethesamejudgmentsasnative speakerson sharplygrammatical/ungrammaticaldata.How a
recurrentneuralnetwork couldpossesslinguisticcapability, andthepropertiesof variouscommonrecur-
rentneuralnetwork architecturesarediscussed.Theproblemexhibits trainingbehavior which is often
notpresentwith smallergrammars,andtrainingwasinitially difficult. However, afterimplementingsev-
eral techniquesaimedat improving the convergenceof the gradientdescentbackpropagation-through-
time training algorithm,significantlearningwaspossible. It was found that certainarchitecturesare
betterable to learnan appropriategrammar. The operationof the networks andtheir training is ana-
lyzed.Finally, theextractionof rulesin theform of deterministicfinite stateautomatais investigated.
Keywords: recurrentneuralnetworks, natural languageprocessing,grammaticalinference,
government-and-binding theory, gradient descent, simulated annealing, principles-and-
parametersframework, automataextraction.
/Also with theInstitutefor AdvancedComputerStudies,Universityof Maryland,CollegePark,MD 20742.
1 Intr oduction
This paperconsidersthe taskof classifyingnaturallanguagesentencesasgrammaticalor ungrammatical.
Weattemptto trainneuralnetworks,without thebifurcationinto learnedvs. innatecomponentsassumedby
Chomsky, to producethesamejudgmentsasnative speakerson sharplygrammatical/ungrammaticaldata.
Only recurrentneuralnetworksareinvestigatedfor computationalreasons.Computationally, recurrentneu-
ral networks are more powerful than feedforward networks and somerecurrentarchitectureshave been
shown to beat leastTuring equivalent[53, 54]. We investigatethepropertiesof variouspopularrecurrent
neuralnetwork architectures,in particularElman,Narendra& Parthasarathy(N&P) andWilliams & Zipser
(W&Z) recurrentnetworks, andalsoFrasconi-Gori-Soda(FGS) locally recurrentnetworks. We find that
bothElmanandW&Z recurrentneuralnetworksareableto learnanappropriategrammarafterimplement-
ing techniquesfor improving theconvergenceof thegradientdescentbasedbackpropagation-through-time
trainingalgorithm.We analyzetheoperationof thenetworksandinvestigatea rule approximationof what
the recurrentnetwork haslearned– specifically, the extractionof rules in the form of deterministicfinite
stateautomata.
Previouswork [38] hascomparedneuralnetworkswith othermachinelearningparadigmson this problem
– this work focuseson recurrentneuralnetworks, investigatesadditionalnetworks,analyzestheoperation
of thenetworksandthetrainingalgorithm,andinvestigatesruleextraction.
This paperis organizedas follows: section2 provides the motivation for the task attempted.Section3
providesabrief introductionto formalgrammarsandgrammaticalinferenceanddescribesthedata.Section
4 lists the recurrentneuralnetwork modelsinvestigatedandprovidesdetailsof the dataencodingfor the
networks.Section5 presentstheresultsof investigationinto varioustrainingheuristics,andinvestigationof
training with simulatedannealing.Section6 presentsthemain resultsandsimulationdetails,andinvesti-
gatestheoperationof thenetworks.Theextractionof rulesin theform of deterministicfinite stateautomata
is investigatedin section7, andsection8 presentsadiscussionof theresultsandconclusions.
2 Moti vation
2.1 RepresentationalPower
Naturallanguagehastraditionallybeenhandledusingsymboliccomputationandrecursive processes.The
mostsuccessfulstochasticlanguagemodelshave beenbasedon finite-statedescriptionssuchasn-gramsor
hiddenMarkov models. However, finite-statemodelscannotrepresenthierarchicalstructuresasfound in
naturallanguage1 [48]. In thepastfew yearsseveral recurrentneuralnetwork architectureshave emerged
1Theinside-outsidere-estimationalgorithmis anextensionof hiddenMarkov modelsintendedto beusefulfor learninghierar-
chicalsystems.Thealgorithmis currentlyonly practicalfor relatively smallgrammars[48].
2
which have beenusedfor grammaticalinference[9, 21, 19, 20, 68]. Recurrentneuralnetworkshave been
usedfor several smallernaturallanguageproblems,e.g. papersusingthe Elmannetwork for naturallan-
guagetasksinclude: [1, 12, 24, 58, 59]. Neuralnetwork modelshave beenshown to be ableto account
for a varietyof phenomenain phonology[23, 61, 62, 18, 22], morphology[51, 41, 40] androle assignment
[42, 58]. Inductionof simplergrammarshasbeenaddressedoften – e.g. [64, 65, 19] on learningTomita
languages[60]. The taskconsideredherediffers from thesein that the grammaris morecomplex. The
recurrentneuralnetworks investigatedin this paperconstitutecomplex, dynamicalsystems– it hasbeen
shown thatrecurrentnetworkshave therepresentationalpower requiredfor hierarchicalsolutions[13], and
thatthey areTuringequivalent.
2.2 Languageand Its Acquisition
Certainlyoneof themostimportantquestionsfor thestudyof humanlanguageis: How do peopleunfail-
ingly manageto acquiresucha complex rule system?A systemsocomplex that it hasresistedtheefforts
of linguists to dateto adequatelydescribein a formal system[8]. A coupleof examplesof the kind of
knowledgenative speakersoftentake for grantedareprovidedin thissection.
For instance,any native speaker of English knows that the adjective eager obligatorily takes a comple-
mentizerfor with a sententialcomplementthat containsan overt subject,but that theverbbelieve cannot.
Moreover, eager maytake asententialcomplementwith anon-overt, i.e. animpliedor understood,subject,
but believecannot2:
*I ameagerJohnto behere I believe Johnto behere
I ameagerfor Johnto behere *I believe for Johnto behere
I ameagerto behere *I believe to behere
Suchgrammaticalityjudgmentsaresometimessubtlebut unarguablyform partof thenative speaker’s lan-
guagecompetence.In othercases,judgmentfalls not on acceptabilitybut on other aspectsof language
competencesuchasinterpretation.Considerthereferenceof theembeddedsubjectof thepredicateto talk
to in thefollowing examples:
Johnis toostubbornfor Mary to talk to
Johnis toostubbornto talk to
Johnis toostubbornto talk to Bill
In thefirst sentence,it is clearthatMary is thesubjectof theembeddedpredicate.As every native speaker
knows, thereis a strongcontrastin the co-referenceoptionsfor the understoodsubjectin the secondand
2As is conventional,anasteriskis usedto indicateungrammaticality.
3
third sentencesdespitetheirsurfacesimilarity. In thethird sentence,Johnmustbetheimpliedsubjectof the
predicateto talk to. By contrast,John is understoodastheobjectof thepredicatein thesecondsentence,
thesubjectherehaving arbitraryreference;in otherwords,thesentencecanbereadasJohnis toostubborn
for somearbitrary person to talk to John. The point to emphasizehereis that the languagefaculty has
impressive discriminatorypower, in thesensethata singleword, asseenin theexamplesabove, canresult
in sharpdifferencesin acceptabilityor alter theinterpretationof a sentenceconsiderably. Furthermore,the
judgmentsshown above arerobust in thesensethatvirtually all native speakersagreewith thedata.
In the light of suchexamplesandthe fact that suchcontrastscrop up not just in Englishbut in otherlan-
guages(for example,thestubborncontrastalsoholdsin Dutch),somelinguists(chiefly Chomsky [7]) have
hypothesizedthat it is only reasonablethatsuchknowledgeis only partially acquired:thelack of variation
found acrossspeakers,andindeed,languagesfor certainclassesof datasuggeststhat thereexists a fixed
componentof thelanguagesystem.In otherwords,thereis aninnatecomponentof thelanguagefacultyof
thehumanmind thatgovernslanguageprocessing.All languagesobey theseso-calleduniversalprinciples.
Sincelanguagesdodiffer with regardto thingslikesubject-object-verborder, theseprinciplesaresubjectto
parametersencodingsystematicvariationsfoundin particularlanguages.Undertheinnatenesshypothesis,
only the languageparametersplusthe language-specificlexicon areacquiredby thespeaker; in particular,
theprinciplesarenot learned.Basedon theseassumptions,thestudyof theselanguage-independent prin-
cipleshasbecomeknown asthePrinciples-and-Parametersframework, or Government-and-Binding(GB)
theory.
This paperinvestigateswhethera neuralnetwork canbe madeto exhibit the samekind of discriminatory
power on thesortof dataGB-linguistshave examined.More precisely, thegoalis to train aneuralnetwork
from scratch,i.e. without the division into learnedvs. innatecomponentsassumedby Chomsky, to pro-
ducethesamejudgmentsasnative speakerson thegrammatical/ungrammaticalpairsof thesortdiscussed
above. Insteadof usinginnateknowledge,positive andnegative examplesareused(a secondargumentfor
innatenessis thatit is notpossibleto learnthegrammarwithoutnegative examples).
3 Data
We first provide a brief introductionto formal grammars,grammaticalinference,andnaturallanguage;for
a thoroughintroductionseeHarrison[25] andFu[17]. Wethendetailthedatasetwhichwehaveusedin our
experiments.
3.1 Formal Grammars and Grammatical Inference
Briefly, a grammar0 is a four tuple 132547684�9:4<;+= , where 2 and 6 aresetsof terminalsandnonterminals
comprisingthealphabetof thegrammar, 9 is asetof productionrules,and ; is thestartsymbol.For every
4
grammarthereexists a language> , a setof stringsof the terminalsymbols,that the grammargenerates
or recognizes.Therealsoexist automatathatrecognizeandgeneratethegrammar. Grammaticalinference
is concernedmainly with the proceduresthat canbe usedto infer the syntacticor productionrulesof an
unknown grammar 0 basedon a finite set of strings ? from >A@B0DC , the languagegeneratedby 0 , and
possiblyalsoon a finite setof stringsfrom thecomplementof >E@B0FC [17]. This paperconsidersreplacing
theinferencealgorithmwith a neuralnetwork andthegrammaris thatof theEnglishlanguage.Thesimple
grammarusedby Elman [13] shown in table 1 containssomeof the structuresin the completeEnglish
grammar:e.g.agreement,verbargumentstructure,interactionswith relative clauses,andrecursion.
S G NP VP ”.”
NP G PropN H N H N RC
VP G V (NP)
RC G whoNP VP H whoVP (NP)
N G boy H girl H cat...
PropN G John H Mary
V G chaseH feed H see...
Table1. A simplegrammarencompassingasubsetof theEnglishlanguage(from [13]). NP= nounphrase,VP = verb
phrase,PropN= propernoun,RC = relativeclause,V = verb,N = noun,andS = thefull sentence.
In theChomsky hierarchyof phrasestructuredgrammars,thesimplestgrammarandits associatedautomata
areregulargrammarsandfinite-state-automata(FSA). However, it hasbeenfirmly established[6] that the
syntacticstructuresof naturallanguagecannotbeparsimoniouslydescribedby regular languages.Certain
phenomena(e.g. centerembedding)aremorecompactlydescribedby context-free grammarswhich are
recognizedby push-down automata,while others(e.g. crossed-serialdependenciesand agreement)are
betterdescribedby context-sensitive grammarswhicharerecognizedby linearboundedautomata[50].
3.2 Data
Thedatausedin thiswork consistsof 552Englishpositiveandnegativeexamplestakenfrom anintroductory
GB-linguisticstextbookby LasnikandUriagereka[37]. Mostof theseexamplesareorganizedinto minimal
pairslike theexampleI ameager for Johnto win/*I ameager Johnto win above. Theminimalnatureof the
changesinvolvedsuggeststhat thedatasetmayrepresentanespeciallydifficult taskfor themodels.Dueto
thesmall samplesize,the raw data,namelywords,werefirst converted(usinganexisting parser)into the
major syntacticcategoriesassumedunderGB-theory. Table2 summarizesthe partsof speechthat were
used.
Thepart-of-speechtaggingrepresentsthesolegrammaticalinformationsuppliedto themodelsaboutpar-
ticular sentencesin additionto the grammaticalitystatus.An importantrefinementthat wasimplemented
5
Category Examples
Nouns(N) John, bookanddestruction
Verbs(V) hit, beandsleep
Adjectives(A) eager, old andhappy
Prepositions(P) withoutandfrom
Complementizer(C) that or for asin I thoughtthat. . .
or I ameager for . . .
Determiner(D) theor each asin themanor eachman
Adverb(Adv) sincerely or whyasin I sincerelybelieve. . .
or Why did Johnwant. . .
Marker (Mrkr) possessive ’s, of, or to asin John’s mother,
thedestructionof . . . , or I want to help. . .
Table2. Partsof speech
wasto includesub-categorizationinformationfor themajorpredicates,namelynouns,verbs,adjectivesand
prepositions.Experimentsshowedthataddingsub-categorizationto thebarecategory informationimproved
theperformanceof themodels.For example,anintransitive verbsuchassleepwould beplacedinto a dif-
ferentclassfrom the obligatorily transitive verb hit. Similarly, verbsthat take sententialcomplementsor
doubleobjectssuchasseem, giveor persuadewould be representative of otherclasses3. Fleshingout the
sub-categorizationrequirementsalongtheselinesfor lexical itemsin thetrainingsetresultedin 9 classesfor
verbs,4 for nounsandadjectives,and2 for prepositions.Examplesof theinput dataareshown in table3.
Sentence Encoding Grammatical
Status
I ameagerfor Johnto behere n4 v2 a2c n4 v2 adv 1
n4 v2 a2c n4 p1v2 adv 1
I ameagerJohnto behere n4 v2 a2n4 v2 adv 0
n4 v2 a2n4 p1v2 adv 0
I ameagerto behere n4 v2 a2v2 adv 1
n4 v2 a2p1 v2 adv 1
Table3. Examplesof thepart-of-speechtagging
Taggingwasdonein acompletelycontext-freemanner. Obviously, aword,e.g.to, maybepartof morethan
onepart-of-speech.Thetaggingresultedin severalcontradictoryandduplicatedsentences.Variousmethods
3Following classicalGB theory, theseclassesaresynthesizedfrom the theta-gridsof individual predicatesvia the Canonical
StructuralRealization(CSR)mechanismof Pesetsky [49].
6
weretestedto dealwith thesecases,however they wereremoved altogetherfor the resultsreportedhere.
In addition,thenumberof positive andnegative exampleswasequalized(by randomlyremoving examples
from thehigherfrequency class)in all trainingandtestsetsin orderto reduceany effectsdueto differing a
priori classprobabilities(whenthenumberof samplesperclassvariesbetweenclassestheremaybea bias
towardspredictingthemorecommonclass[3, 2]).
4 Neural Network Models and Data Encoding
The following architectureswereinvestigated.Architectures1 to 3 aretopologicalrestrictionsof 4 when
thenumberof hiddennodesis equalandin thissensemaynothave therepresentationalcapabilityof model
4. It is expectedthat theFrasconi-Gori-Soda(FGS)architecturewill be unableto performthe taskandit
hasbeenincludedprimarily asacontrolcase.
1. Frasconi-Gori-Soda(FGS) locally recurrent networks[16]. A multilayer perceptronaugmentedwith
local feedbackaroundeachhiddennode.Thelocal-outputversionhasbeenused.TheFGSnetwork has
alsobeenstudiedby [43] – thenetwork is calledFGSin this paperin line with [63].
2. Narendra and Parthasarathy [44]. A recurrentnetwork with feedbackconnectionsfrom eachoutput
nodeto all hiddennodes.TheN&P network architecturehasalsobeenstudiedby Jordan[33, 34] – the
network is calledN&P in thispaperin line with [30].
3. Elman [13]. A recurrentnetwork with feedbackfrom eachhiddennodeto all hiddennodes. When
trainingtheElmannetwork backpropagation-through-time is usedratherthanthetruncatedversionused
by Elman,i.e. in thispaper“Elmannetwork” refersto thearchitectureusedby Elmanbut not thetraining
algorithm.
4. Williams andZipser[67]. A recurrentnetwork whereall nodesareconnectedto all othernodes.
Diagramsof thesearchitecturesareshown in figures1 to 4.
For inputto theneuralnetworks,thedatawasencodedinto afixedlengthwindow madeupof segmentscon-
tainingeightseparateinputs,correspondingto theclassificationsnoun,verb,adjective, etc. Sub-categories
of theclasseswerelinearlyencodedinto eachinput in amannerdemonstratedby thespecificvaluesfor the
nouninput: Not a noun= 0, nounclass1 = 0.5, nounclass2 = 0.667,nounclass3 = 0.833,nounclass
4 = 1. The linear orderwasdefinedaccordingto thesimilarity betweenthevarioussub-categories4. Two
outputswereusedin theneuralnetworks,correspondingto grammaticalandungrammaticalclassifications.
The datawasinput to the neuralnetworks with a window which is passedover the sentencein temporal
4A fixed lengthwindow madeup of segmentscontaining23 separateinputs,correspondingto theclassificationsnounclass1,
nounclass2, verbclass1, etc.wasalsotestedbut provedinferior.
7
Figure1. A Frasconi-Gori-Sodalocally recurrentnetwork. Not all connectionsareshown fully.
Figure2. A Narendra& Parthasarathyrecurrentnetwork. Not all connectionsareshown fully.
Figure3. An Elmanrecurrentnetwork. Not all connectionsareshown fully.
8
Figure4. A Williams & Zipserfully recurrentnetwork. Not all connectionsareshown fully.
orderfrom thebeginningto theendof thesentence(seefigure5). Thesizeof thewindow wasvariablefrom
oneword to the lengthof the longestsentence.We notethat thecasewherethe input window is small is
of greaterinterest– the larger the input window, the greaterthe capabilityof a network to correctlyclas-
sify thetrainingdatawithout forming a grammar. For example,if theinput window is equalto thelongest
sentence,thenthe network doesnot have to storeany information– it cansimply mapthe inputsdirectly
to theclassification.However, if theinput window is relatively small,thenthenetwork mustlearnto store
information. As will beshown later thesenetworks implementa grammar, anda deterministicfinite state
automatonwhich recognizesthisgrammarcanbeextractedfrom thenetwork. Thus,we aremostinterested
in the small input window case,wherethe networks arerequiredto form a grammarin orderto perform
well.
5 Gradient Descentand SimulatedAnnealing Learning
Backpropagation-through-time [66]5 hasbeenusedto train theglobally recurrentnetworks6, andthegradi-
entdescentalgorithmdescribedby theauthors[16] wasusedfor theFGSnetwork. Thestandardgradient
descentalgorithmswere found to be impracticalfor this problem7. The techniquesdescribedbelow for
improving convergencewereinvestigated.Due to the dependenceon the initial parameters,a numberof
simulationswereperformedwith differentinitial weightsandtrainingset/testsetcombinations.However,
dueto the computationalcomplexity of the task8, it wasnot possibleto performasmany simulationsas
5Backpropagation-through-timeextendsbackpropagationto includetemporalaspectsandarbitraryconnectiontopologiesby
consideringanequivalentfeedforwardnetwork createdby unfoldingtherecurrentnetwork in time.6Real-time[67] recurrentlearning(RTRL) wasalsotestedbut did notshow any significantconvergencefor thepresentproblem.7Without modifying thestandardgradientdescentalgorithmsit wasonly possibleto train networkswhich operatedon a large
temporalinputwindow. Thesenetworkswerenot forcedto modelthegrammar, they only memorizedandinterpolatedbetweenthe
trainingdata.8Eachindividualsimulationin this sectiontook anaverageof two hoursto completeona SunSparc10server.
9
Figure5. Depictionof how the neuralnetwork inputscomefrom an input window on the sentence.The window
movesfrom thebeginningto theendof thesentence.
desired.The standarddeviation of theNMSE valuesis includedto help assessthe significanceof the re-
sults.Table4 shows someresultsfor usingandnot usingthetechniqueslistedbelow. Exceptwherenoted,
the resultsin this sectionarefor Elmannetworks using: two word inputs,10 hiddennodes,the quadratic
costfunction,the logistic sigmoidfunction,sigmoidoutputactivations,onehiddenlayer, the learningrate
scheduleshown below, aninitial learningrateof 0.2,theweightinitializationstrategy discussedbelow, and
1 million stochasticupdates(targetvaluesareonly providedat theendof eachsentence).
Standard NMSE Std.Dev. Variation NMSE Std.Dev.
Update:batch 0.931 0.0036 Update: stochastic 0.366 0.035
Learningrate:constant 0.742 0.154 Learning rate: schedule 0.394 0.035
Activation: logistic 0.387 0.023 Activation: I'J3K�L 0.405 0.14
Sectioning: no 0.367 0.011 Sectioning:yes 0.573 0.051
Cost function: quadratic 0.470 0.078 Costfunction: entropy 0.651 0.0046
Table4. Comparisonsof usingandnot usingvariousconvergencetechniques.All otherparametersareconstantin
eachcase:Elmannetworksusing:two wordinputs(i.e. aslidingwindow of thecurrentandpreviousword),10hidden
nodes,the quadraticcost function, the logistic activation function, sigmoidoutputactivations,onehiddenlayer, a
learningrateschedulewith an initial learningrateof 0.2, the weight initialization strategy discussedbelow, and1
million stochasticupdates.EachNMSE result representsthe averageof four simulations. The standarddeviation
valuegivenis thestandarddeviationof thefour individual results.
10
1. Detectionof SignificantError Increases. If theNMSE increasessignificantlyduring training thennet-
work weightsarerestoredfrom apreviousepochandareperturbedto preventupdatingto thesamepoint.
This techniquewasfoundto increaserobustnessof thealgorithmwhenusinglearningrateslargeenough
to helpavoid problemsdueto local minimaand“flat spots”on theerrorsurface,particularlyin thecase
of theWilliams & Zipsernetwork.
2. Target Outputs.Targetsoutputswere0.1 and0.9 usingthe logistic activation functionand-0.8 and0.8
usingthe M�N$O"P activationfunction.Thishelpsavoid saturatingthesigmoidfunction. If targetsweresetto
theasymptotesof thesigmoidthiswould tendto: a) drive theweightsto infinity, b) causeoutlier datato
producevery largegradientsdueto thelargeweights,andc) producebinaryoutputsevenwhenincorrect
– leadingto decreasedreliability of theconfidencemeasure.
3. StochasticVersusBatch Update. In stochasticupdate,parametersareupdatedaftereachpatternpresen-
tation,whereasin truegradientdescent(oftencalled”batch” updating)gradientsareaccumulatedover
thecompletetrainingset.Batchupdateattemptsto follow thetruegradient,whereasa stochasticpathis
followedusingstochasticupdate.
Stochasticupdateis oftenmuchquicker thanbatchupdate,especiallywith large,redundantdatasets[39].
Additionally, thestochasticpathmayhelpthenetwork to escapefrom local minima. However, theerror
can jump aroundwithout converging unlessthe learningrate is reduced,most secondorder methods
do not work well with stochasticupdate,andstochasticupdateis harderto parallelizethanbatch[39].
Batchupdateprovides guaranteedconvergence(to local minima) andworks betterwith secondorder
techniques.However it canbevery slow, andmayconvergeto very poorlocalminima.
In theresultsreported,thetrainingtimeswereequalizedby reducingthenumberof updatesfor thebatch
case(for anequalnumberof weightupdatesbatchupdatewould otherwisebemuchslower). Batchup-
dateoftenconvergesquickerusingahigherlearningratethantheoptimalrateusedfor stochasticupdate9,
hencealteringthelearningratefor thebatchcasewasinvestigated.However, significantconvergencewas
notobtainedasshown in table4.
4. WeightInitialization. Randomweightsareinitialized with thegoalof ensuringthat thesigmoidsdo not
startout in saturationbut arenotvery small(correspondingto aflat partof theerrorsurface)[26]. In ad-
dition, several(20)setsof randomweightsaretestedandthesetwhichprovidesthebestperformanceon
thetrainingdatais chosen.In ourexperimentsonthecurrentproblem,it wasfoundthatthesetechniques
do notmake asignificantdifference.
5. LearningRateSchedules. Relatively high learningratesaretypically usedin orderto help avoid slow
convergenceandlocal minima. However, a constantlearningrateresultsin significantparameterand
performancefluctuationduring the entire training cycle suchthat the performanceof the network can
9Stochasticupdatedoesnotgenerallytolerateashigha learningrateasbatchupdatedueto thestochasticnatureof theupdates.
11
alter significantlyfrom the beginning to theendof thefinal epoch.Moody andDarken have proposed
“searchthenconverge” learningrateschedulesof theform [10, 11]:Q @SRTCVU Q�WX�Y[Z\ (1)
whereQ @SR�C is thelearningrateat time R , Q�W is theinitial learningrate,and ] is aconstant.
We have found that the learningrateduring thefinal epochstill resultsin considerableparameterfluc-
tuation,10 andhencewe have addedanadditionalterm to further reducethe learningrateover thefinal
epochs(our specificlearningrateschedulecanbe found in a later section). We have found the useof
learningrateschedulesto improve performanceconsiderablyasshown in table4.
6. Activation Function. Symmetricsigmoid functions(e.g. M�N$O"P ) often improve convergenceover the
standardlogistic function. For our particularproblemwe found that the differencewasminor andthat
thelogistic functionresultedin betterperformanceasshown in table4.
7. CostFunction.Therelative entropy costfunction[4, 29, 57, 26, 27] hasreceivedparticularattentionand
hasa naturalinterpretationin termsof learningprobabilities[36]. We investigatedusingbothquadratic
andrelative entropy costfunctions:
Definition 1 Thequadraticcostfunctionis definedas^`_bac8dfe5gih e+jlkme�nBo(2)p
Definition 2 Therelativeentropy costfunctionis definedas^`_ d�erq ac g ats h e nmuwv3x ats h eats kye s ac g a j h e nmuzv{x a j h ea j|kye�} (3)pwhere~ and � correspondto theactualanddesiredoutputvalues,� rangesover theoutputs(andalsothe
patternsfor batchupdate).We foundthequadraticcostfunctionto provide betterperformanceasshown
in table4. A possiblereasonfor this is that the useof the entropy cost function leadsto an increased
varianceof weightupdatesandthereforedecreasedrobustnessin parameterupdating.
8. Sectioningof theTraining Data. We investigateddividing the training datainto subsets.Initially, only
one of thesesubsetswas usedfor training. After 100% correctclassificationwas obtainedor a pre-
specifiedtime limit expired,anadditionalsubsetwasaddedto the “working” set. This continueduntil
theworking setcontainedtheentiretrainingset.Thedatawasorderedin termsof sentencelengthwith
10NMSE resultswhich areobtainedover an epochinvolving stochasticupdatecanbe misleading.We have beensurprisedto
find quitesignificantdifferencein theseon-lineNMSE calculationscomparedto a staticcalculationeven if thealgorithmappears
to have converged.
12
theshortestsentencesfirst. This enabledthenetworksto focuson thesimplerdatafirst. Elmansuggests
that the initial trainingconstrainslater trainingin a usefulway [13]. However, for our problem,theuse
of sectioninghasconsistentlydecreasedperformanceasshown in table4.
We have also investigatedthe useof simulatedannealing. Simulatedannealingis a global optimization
method[32, 35]. Whenminimizing a function,any downhill stepis acceptedandtheprocessrepeatsfrom
this new point. An uphill stepmayalsobeaccepted.It is thereforepossibleto escapefrom local minima.
As theoptimizationprocessproceeds,the lengthof thestepsdeclinesandthealgorithmconvergeson the
globaloptimum.Simulatedannealingmakesvery few assumptionsregardingthefunctionto beoptimized,
andis thereforequiterobustwith respectto non-quadraticerrorsurfaces.
Previouswork hasshown theuseof simulatedannealingfor finding theparametersof a recurrentnetwork
model to improve performance[56]. For comparisonwith the gradientdescentbasedalgorithmsthe use
of simulatedannealinghasbeeninvestigatedin orderto train exactly thesameElmannetwork ashasbeen
successfullytrainedto 100%correcttrainingsetclassificationusingbackpropagation-through-time (details
arein section6). No significantresultswereobtainedfrom thesetrials11. Theuseof simulatedannealing
hasnot beenfoundto improve performanceasin Simardet al. [56]. However, their problemwastheparity
problemusingnetworks with only four hiddenunits whereasthe networks consideredin this paperhave
many moreparameters.
This result provides an interestingcomparisonto the gradient descentbackpropagation-through-time
(BPTT)method.BPTTmakestheimplicit assumptionthattheerrorsurfaceis amenableto gradientdescent
optimization,and this assumptioncan be a major problemin practice. However, althoughdifficulty is
encounteredwith BPTT, the methodis significantly more successfulthan simulatedannealing(which
makesfew assumptions)for thisproblem.
6 Experimental Results
Resultsfor thefour neuralnetwork architecturesaregivenin thissection.Theresultsarebasedon multiple
training/testsetpartitionsandmultiple randomseeds.In addition,asetof Japanesecontroldatawasusedas
a testset(wedonotconsidertrainingmodelswith theJapanesedatabecausewedonothavea largeenough
dataset).JapaneseandEnglishareat theoppositeendsof thespectrumwith regardto word order. That is,
Japanesesentencepatternsarevery differentfrom English. In particular, Japanesesentencesaretypically
SOV (subject-object-verb) with theverbmoreor lessfixedandtheotherargumentsmoreor lessavailable
to freely permute. English datais of courseSVO and argumentpermutationis generallynot available.
For example,thecanonicalJapaneseword orderis simply ungrammaticalin English. Hence,it would be
extremelysurprisingif anEnglish-trainedmodelacceptsJapanese,i.e. it is expectedthata network trained
11Theadaptive simulatedannealingcodeby LesterIngber[31, 32] wasused.
13
on Englishwill not generalizeto Japanesedata.This is whatwe find – all modelsresultedin no significant
generalizationon theJapanesedata(50%erroron average).
Fivesimulationswereperformedfor eacharchitecture.Eachsimulationtookapproximatelyfour hoursona
SunSparc10 server. Table5 summarizestheresultsobtainedwith thevariousnetworks. In orderto make
thenumberof weightsin eacharchitectureapproximatelyequalwe have usedonly singleword inputsfor
theW&Z modelbut two word inputsfor theothers.This reductionin dimensionalityfor theW&Z network
improvedperformance.Thenetworkscontained20hiddenunits.Full simulationdetailsaregivenin section
6.
The goal was to train a network using only a small temporalinput window. Initially, this could not be
done. With the additionof the techniquesdescribedearlier it waspossibleto train Elmannetworks with
sequencesof thelast two wordsasinput to give 100%correct(99.6%averagedover 5 trials) classification
on the training data. Generalizationon the testdataresultedin 74.2%correctclassificationon average.
This is betterthantheperformanceobtainedusingany of theothernetworks,however it is still quite low.
Thedatais quitesparseandit is expectedthatincreasedgeneralizationperformancewill beobtainedasthe
amountof datais increased,aswell asincreaseddifficulty in training. Additionally, the datasethasbeen
hand-designedby GB linguiststo covera rangeof grammaticalstructuresandit is likely thattheseparation
into the training and test setscreatesa test set which containsmany grammaticalstructuresthat arenot
coveredin the training set. The Williams & Zipsernetwork alsoperformedreasonablywell with 71.3%
correctclassificationof thetestset.Notethatthetestsetperformancewasnotobservedto dropsignificantly
afterextendedtraining, indicatingthat theuseof a validationsetto controlpossibleoverfitting would not
alterperformancesignificantly.
TRAIN Classification Std.dev.
Elman 99.6% 0.84
FGS 67.1% 1.22
N&P 75.2% 1.41
W&Z 91.7% 2.26
ENGLISHTEST Classification Std.dev.
Elman 74.2% 3.82
FGS 59.0% 1.52
N&P 60.6% 0.97
W&Z 71.3% 0.75
Table5. Resultsof the network architecturecomparison.The classificationvaluesreportedarean averageof five
individualsimulations,andthestandarddeviationvalueis thestandarddeviationof thefive individual results.
Completedetailson a sampleElmannetwork areasfollows (othernetworksdiffer only in topologyexcept
14
for theW&Z for which betterresultswereobtainedusinganinput window of only oneword): thenetwork
containedthreelayersincluding the input layer. Thehiddenlayercontained20 nodes.Eachhiddenlayer
nodehada recurrentconnectionto all otherhiddenlayer nodes.The network wastrainedfor a total of 1
million stochasticupdates.All inputswerewithin therangezeroto one.All targetoutputswereeither0.1or
0.9.Biasinputswereused.Thebestof 20randomweightsetswaschosenbasedontrainingsetperformance.
Weightswereinitialized asshown in Haykin [26] whereweightsareinitialized on a nodeby nodebasisas
uniformly distributedrandomnumbersin the range @'�A�������y�t�T4��������y� �7C where �t� is the fan-in of neuron� .The logistic outputactivation function wasused. The quadraticcost function wasused. The searchthen
converge learningratescheduleusedwas Q U �������*�f� �*�� ���m� ��� � ���T� � ��� � � � �*� � � �m� � �������� �y� � ��� where Q U learningrate,
Q�W U initial learningrate, 2¡U total trainingepochs,¢£U currenttrainingepoch,¤{¥AU§¦m¨ , ¤,©ªU«¨��¬$¦ . The
trainingsetconsistedof 373non-contradictoryexamplesasdescribedearlier. TheEnglishtestsetconsisted
of 100non-contradictorysamplesandtheJapanesetestsetconsistedof 119non-contradictorysamples.
We now take a closerlook at theoperationof thenetworks. Theerrorduringtrainingfor a sampleof each
network architectureis shown in figure 6. The error at eachpoint in the graphsis the NMSE over the
completetrainingset. Note thenatureof theWilliams & Zipserlearningcurve andtheutility of detecting
andcorrectingfor significanterrorincreases12.
Figure7 shows anapproximationof the “complexity” of theerrorsurfacebasedon thefirst derivativesof
theerrorcriterionwith respectto eachweight ®[U°¯²±´³<µ³�¶ ±· ¶ where � sumsover all weightsin thenetwork
and 2D¸ is the total numberof weights.This valuehasbeenplottedaftereachepochduringtraining. Note
themorecomplex natureof theplot for theWilliams & Zipsernetwork.
Figures8 to 11 show sampleplotsof theerrorsurfacefor thevariousnetworks.Theerrorsurfacehasmany
dimensionsmakingvisualizationdifficult. We plot sampleviews showing thevariationof theerrorfor two
dimensions,andnotethat theseplotsareindicative only – quantitative conclusionsshouldbe drawn from
thetesterrorandnot from theseplots. Eachplot shown in thefiguresis with respectto only two randomly
chosendimensions.In eachcase,the centerof the plot correspondsto the valuesof the parametersafter
training. Taken together, the plots provide an approximateindication of the natureof the error surface
for thedifferentnetwork types. The FGSnetwork errorsurfaceappearsto be thesmoothest,however the
resultsindicatethatthesolutionsfounddonotperformverywell, indicatingthattheminimafoundarepoor
comparedto the global optimumand/orthat the network is not capableof implementinga mappingwith
low error. TheWilliams & Zipserfully connectednetwork hasgreaterrepresentationalcapabilitythanthe
Elmanarchitecture(in thesensethatit canperformagreatervarietyof computationswith thesamenumber
of hiddenunits). However, comparingtheElmanandW&Z network errorsurfaceplots,it canbeobserved
thattheW&Z network hasagreaterpercentageof flat spots.Thesegraphsarenotconclusive (becausethey
only show two dimensionsandareonly plottedaroundonepoint in theweightspace),however they backup
12The learningcurve for theWilliams & Zipsernetwork canbemadesmootherby reducingthe learningratebut this tendsto
promoteconvergenceto poorerlocalminima.
15
0.1
1
10
0 500 1000 1500 2000 2500 3000 3500Epoch
NMSE - final 0.835
FGS
0.01
0.1
1
10
0 500 1000 1500 2000 2500 3000 3500Epoch
NMSE - final 0.0569
Elman
0.1
1
10
0 500 1000 1500 2000 2500 3000 3500Epoch
NMSE - final 0.693
N&P
0.1
1
10
0 500 1000 1500 2000 2500 3000 3500Epoch
NMSE - final 0.258
W&Z
Figure6. AverageNMSE(log scale)over thetrainingsetduringtraining.Topto bottom:Frasconi-Gori-Soda,Elman,
Narendra& ParthasarathyandWilliams & Zipser.
thehypothesisthat theW&Z network performsworsebecausetheerrorsurfacepresentsgreaterdifficulty
to thetrainingmethod.
7 Automata Extraction
The extractionof symbolicknowledgefrom trainedneuralnetworks allows the exchangeof information
betweenconnectionistandsymbolicknowledgerepresentationsandhasbeenof greatinterestfor under-
standingwhat the neuralnetwork is actuallydoing [52]. In additionsymbolicknowledgecanbe inserted
into recurrentneuralnetworksandevenrefinedaftertraining[15, 47, 45].
Theorderedtriple of a discreteMarkov process( 1 state;input G next-state= ) canbeextractedfrom a RNN
16
1e-05
0.0001
0.001
0.01
0.1
0 500 1000 1500 2000 2500 3000 3500Epoch
FGS
1e-05
0.0001
0.001
0.01
0.1
0 500 1000 1500 2000 2500 3000 3500Epoch
Elman
0.001
0.01
0.1
0 500 1000 1500 2000 2500 3000 3500Epoch
N&P
1e-05
0.0001
0.001
0.01
0.1
1
0 500 1000 1500 2000 2500 3000 3500Epoch
W&Z
Figure7. Approximate“complexity” of theerrorsurfaceduringtraining.Top to bottom:Frasconi-Gori-Soda,Elman,
Narendra& ParthasarathyandWilliams & Zipser.
andusedto form an equivalentdeterministicfinite stateautomata(DFA). This canbe doneby clustering
theactivationvaluesof therecurrentstateneurons[46]. Theautomataextractedwith this processcanonly
recognizeregulargrammars13.
However, naturallanguage[6] cannotbeparsimoniouslydescribedby regular languages- certainphenom-
ena(e.g. centerembedding)aremorecompactlydescribedby context-free grammars,while others(e.g.
crossed-serialdependenciesandagreement)arebetterdescribedby context-sensitive grammars.Hence,the
networksmaybeimplementingmoreparsimoniousversionsof thegrammarwhichweareunableto extract
with this technique.
13A regular grammar¹ is a 4-tuple ¹»º½¼<¾�¿BÀE¿�Á�¿�Â�à where ¾ is the startsymbol, À and Á arenon-terminalandterminal
symbols,respectively, and  representsproductionsof theform ÄÆÅÈÇ or ÄÆÅÉÇ3Ê where Ä�¿�ÊÌË&À and Ç)ËAÁ .
17
-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 0: -0.4992 -6
-5-4
-3-2
-10
12
34
Weight 0: -0.4992
0.680.7
0.720.740.760.780.8
0.820.840.860.88
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 41: -0.07465 -5
-4-3
-2-1
01
23
45
Weight 162: 0.4955
0.650.7
0.750.8
0.850.9
0.95
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 109: -0.0295 -5
-4-3
-2-1
01
23
45
Weight 72: 0.1065
0.650.7
0.750.8
0.850.9
0.951
1.051.1
1.151.2
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 28: 0.01001 -6-5-4-3-2-10 1 2 3 4 5
Weight 43: -0.07529
0.670.6750.68
0.6850.69
0.6950.7
0.705
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 127: -0.004546 -5
-4-3
-2-1
01
23
45
Weight 96: 0.01305
0.6830.6840.6850.6860.6870.6880.6890.69
0.6910.6920.693
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 146: -0.1471 -6-5-4-3-2-10 1 2 3 4 5
Weight 45: -0.1257
0.680.6850.69
0.6950.7
0.7050.71
0.7150.72
0.7250.73
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 81: 0.1939 -6-5-4-3-2-10 1 2 3 4 5
Weight 146: -0.1471
0.680.690.7
0.710.720.730.74
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 144: 0.1274 -5
-4-3
-2-1
01
23
45
Weight 67: 0.7342
0.640.660.680.7
0.720.740.760.780.8
0.82
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 154: 0.04359 -6
-5-4
-3-2
-10
12
34
Weight 166: -0.5495
0.650.7
0.750.8
0.850.9
0.951
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 101: 0.03919 -6
-5-4
-3-2
-10
12
34
Weight 76: -0.3808
0.680.6820.6840.6860.6880.69
0.6920.6940.6960.698
0.70.702
-4 -3 -2 -1 0 1 2 3 4 5 6Weight 13: 1.463 -5
-4-3
-2-1
01
23
45
Weight 7: 0.6851
0.660.680.7
0.720.740.760.780.8
0.820.84
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 60: 0.334 -5
-4-3
-2-1
01
23
45
Weight 106: 0.009545
0.680.6820.6840.6860.6880.69
0.6920.6940.6960.698
0.70.702
Figure8. Error surfaceplots for theFGSnetwork. Eachplot is with respectto two randomlychosendimensions.In
eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.
-2 -1 0 1 2 3 4 5 6 7 8Weight 0: 3.033 -2
-10
12
34
56
78
Weight 0: 3.033
1.211.221.231.241.251.261.271.28
-4 -3 -2 -1 0 1 2 3 4 5 6Weight 132: 1.213 -13-12-11-10-9
-8-7-6-5-4-3-2
Weight 100: -7.007
1.221.241.261.281.3
1.321.341.361.381.4
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 152: 0.6957 -8
-7-6
-5-4
-3-2
-10
12
Weight 47: -2.201
1.211.221.231.241.251.261.271.281.291.3
1.31
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5Weight 84: -0.1601 -5
-4-3
-2-1
01
23
45
Weight 152: 0.6957
1.211.221.231.241.251.261.271.28
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 91: 0.6162 -5
-4-3
-2-1
01
23
45
Weight 24: 0.1485
1.241.251.261.271.281.291.3
1.311.321.33
-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 160: -0.5467 -6
-5-4
-3-2
-10
12
34
Weight 172: -0.7099
1.241.251.261.271.281.291.3
1.311.32
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3Weight 105: -2.075 -9-8-7-6-5-4-3-2-10 1 2
Weight 79: -3.022
1.221.231.241.251.261.271.28
-7 -6 -5 -4 -3 -2 -1 0 1 2 3Weight 13: -1.695 -7-6-5-4-3-2-10 1 2 3 4
Weight 8: -1.008
1.241.2451.25
1.2551.26
1.2651.27
1.275
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 195: 0.1977 -3
-2-1
01
23
45
67
Weight 5: 2.464
1.241.261.281.3
1.321.341.361.381.4
1.421.44
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3Weight 86: -2.148 -3
-2-1
01
23
45
67
Weight 155: 2.094
1.221.2251.23
1.2351.24
1.2451.25
1.2551.26
1.2651.27
1.275
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2Weight 163: -2.216 -4-3-2-10 1 2 3 4 5 6 7
Weight 138: 1.867
1.221.231.241.251.261.271.281.291.3
1.31
-15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5Weight 126: -9.481 -8-7-6-5-4-3-2-10 1 2 3
Weight 86: -2.148
1.221.241.261.281.3
1.321.341.36
Figure9. Error surfaceplotsfor theN&P network. Eachplot is with respectto two randomlychosendimensions.In
eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.
Thealgorithmwe usefor automataextraction(from [19]) worksasfollows: afterthenetwork is trained(or
evenduringtraining),weapplyaprocedurefor extractingwhatthenetwork haslearned— i.e., thenetwork’s
currentconceptionof whatDFA it haslearned.TheDFA extractionprocessincludesthe following steps:
1) clusteringof the recurrentnetwork activation space,Í , to form DFA states,2) constructinga transition
18
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1Weight 0: -3.612 -9
-8-7
-6-5
-4-3
-2-1
01
Weight 0: -3.612
0.60.6020.6040.6060.6080.61
0.6120.6140.616
-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 102: -0.7384 3
45
67
89
1011
1213
Weight 122: 8.32
0.580.590.6
0.610.620.630.640.65
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 244: -8.597 -8
-7-6
-5-4
-3-2
-10
12
Weight 81: -2.907
0.560.570.580.590.6
0.610.620.630.640.650.66
-2 -1 0 1 2 3 4 5 6 7 8 9Weight 136: 3.962 -7
-6-5
-4-3
-2-1
01
23
Weight 151: -1.307
0.5850.59
0.5950.6
0.6050.61
0.6150.62
0.6250.63
0.635
4 5 6 7 8 9 10 11 12 13 14Weight 263: 9.384 -10-9
-8-7-6-5-4-3-2-10 1
Weight 215: -4.061
0.570.580.590.6
0.610.620.630.640.650.66
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 265: -8.874 -5
-4-3
-2-1
01
23
45
Weight 16: 0.3793
0.60.6050.61
0.6150.62
0.6250.63
0.6350.64
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1Weight 62: -5.243 -7
-6-5
-4-3
-2-1
01
23
Weight 226: -1.218
0.550.560.570.580.590.6
0.610.620.630.640.65
-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6Weight 128: -10.25 -14
-13-12
-11-10
-9-8
-7-6
-5-4
Weight 162: -8.527
0.580.590.6
0.610.620.630.640.650.66
-6 -5 -4 -3 -2 -1 0 1 2 3 4Weight 223: -0.2433 4
56
78
910
1112
1314
Weight 118: 9.467
0.580.590.6
0.610.620.630.640.650.66
-5 -4 -3 -2 -1 0 1 2 3 4 5Weight 256: 0.1506 4
56
78
910
1112
1314
Weight 118: 9.467
0.580.590.6
0.610.620.630.640.65
-1 0 1 2 3 4 5 6 7 8 9Weight 228: 4.723 -5
-4-3
-2-1
01
23
45
Weight 8: 0.3707
0.580.5850.59
0.5950.6
0.6050.61
0.6150.62
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 46: -8.978 -1
01
23
45
67
89
Weight 228: 4.723
0.590.595
0.60.6050.61
0.6150.62
0.625
Figure10. Error surfaceplotsfor theElmannetwork. Eachplot is with respectto two randomlychosendimensions.
In eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.
-2 -1 0 1 2 3 4 5 6 7 8 9Weight 0: 3.879 -2-10 1 2 3 4 5 6 7 8 9
Weight 0: 3.879
1.41
1.415
1.42
1.425
1.43
1.435
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2Weight 65: -2.406 -15
-14-13
-12-11
-10-9
-8-7
-6-5
Weight 153: -10
1.4331.43351.434
1.43451.435
1.43551.436
1.43651.437
1.43751.438
-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6Weight 124: -10.8 9
1011
1213
1415
1617
1819
Weight 105: 14.08
1.4321.4341.4361.4381.44
1.4421.4441.4461.4481.45
1.452
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2Weight 122: -3.084 -15
-14-13
-12-11
-10-9
-8-7
-6-5
Weight 158: -9.731
1.43351.433521.433541.433561.433581.4336
1.433621.433641.433661.433681.4337
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 74: -8.938 -112
-111-110
-109-108
-107-106
-105-104
-103-102
Weight 204: -107
1.431.4351.44
1.4451.45
1.4551.46
1.4651.47
4 5 6 7 8 9 10 11 12 13 14Weight 192: 9.58 -15
-14-13
-12-11
-10-9
-8-7
-6-5
Weight 86: -9.935
1.3951.4
1.4051.41
1.4151.42
1.4251.43
1.4351.44
3 4 5 6 7 8 9 10 11 12 13Weight 170: 8.782 3
45
67
89
1011
1213
Weight 2: 8.368
1.4161.4181.42
1.4221.4241.4261.4281.43
1.4321.434
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 214: -8.855 -5
-4-3
-2-1
01
23
45
Weight 39: 0.5599
1.411.4151.42
1.4251.43
1.4351.44
1.4451.45
1.4551.46
33 34 35 36 37 38 39 40 41 42 43Weight 147: 38.69 0
12
34
56
78
910
Weight 223: 5.656
1.4051.41
1.4151.42
1.4251.43
1.4351.44
1.445
-19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9Weight 210: -13.29 -15
-14-13
-12-11
-10-9
-8-7
-6-5
Weight 86: -9.935
1.391.4
1.411.421.431.441.451.46
-2 -1 0 1 2 3 4 5 6 7 8 9Weight 63: 3.889 -8-7-6-5-4-3-2-10 1 2 3
Weight 81: -2.165
1.421.4251.43
1.4351.44
1.4451.45
1.455
-14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4Weight 88: -8.262 4
56
78
910
1112
1314
Weight 85: 9.631
1.381.391.4
1.411.421.431.441.451.46
Figure11. Error surfaceplots for theW&Z network. Eachplot is with respectto two randomlychosendimensions.
In eachcase,thecenterof theplot correspondsto thevaluesof theparametersaftertraining.
diagramby connectingthesestatestogetherwith the alphabetlabelledarcs,3) putting thesetransitions
togetherto make thefull digraph– forming loops,and4) reducingthedigraphto a minimal representation.
The hypothesisis that during training, the network begins to partition (or quantize)its statespaceinto
fairly well-separated,distinct regionsor clusters,which representcorrespondingstatesin somefinite state
19
automaton(recently, it hasbeenproved that arbitraryDFAs canbe stablyencodedinto recurrentneural
networks[45]). Onesimplewayof findingtheseclustersis to divideeachneuron’s rangeinto Î partitionsof
equalwidth. Thusfor 2 hiddenneurons,thereexist Î · possiblepartition states. TheDFA is constructed
by generatinga statetransitiondiagram,i.e. associatinganinput symbolwith thepartition stateit just left
andthepartition stateit activates.Theinitial partition state, or startstateof theDFA, is determinedfrom
theinitial valueof Í´Ï ZÑÐ W�Ò . If thenext input symbolmapsto thesamepartition statevalue,we assumethat
a loop is formed. Otherwise,a new statein theDFA is formed. TheDFA thusconstructedmaycontaina
maximumof Î · states;in practiceit is usuallymuchless,sincenot all partition statesarereachedby Í´Ï Z Ò .Eventuallythis processmustterminatesincethereareonly a finite numberof partitionsavailable;and,in
practice,many of the partitionsarenever reached.The derivedDFA canthenbe reducedto its minimal
DFA usingstandardminimizationalgorithms[28]. It shouldbenotedthatthisDFA extractionmethodmay
beappliedto any discrete-timerecurrentnet,regardlessof orderor hiddenlayers.Recently, theextraction
processhasbeenprovento convergeandextractany DFA learnedor encodedby theneuralnetwork [5].
The extractedDFAs dependon the quantizationlevel, Î . We extractedDFAs using valuesof Î starting
from 3 andusedstandardminimizationtechniquesto comparethe resultingautomata[28]. We passedthe
training and test datasetsthroughthe extractedDFAs. We found that the extractedautomatacorrectly
classified95%of thetrainingdataand60%of thetestdatafor ÎÓUÕÔ . Smallervaluesof Î producedDFAs
with lower performanceandlargervaluesof Î did not producesignificantlybetterperformance.A sample
extractedautomatacanbeseenin figure12. It is difficult to interprettheextractedautomataanda topic for
future researchis analysisof theextractedautomatawith a view to aiding interpretation.Additionally, an
importantopenquestionis how well theextractedautomataapproximatethegrammarimplementedby the
recurrentnetwork whichmaynotbea regulargrammar.
Automataextractionmayalsobeusefulfor improving theperformanceof thesystemvia aniterative com-
binationof rule extractionandrule insertion. Significantlearningtime improvementscanbe achieved by
training networks with prior knowledge[46]. This may leadto the ability to train larger networks which
encompassmoreof thetargetgrammar.
8 Discussion
This paperhasinvestigatedthe useof variousrecurrentneuralnetwork architectures(FGS,N&P, Elman
andW&Z) for classifyingnaturallanguagesentencesasgrammaticalor ungrammatical,therebyexhibiting
thesamekind of discriminatorypower providedby thePrinciplesandParameterslinguistic framework, or
Government-and-Bindingtheory. From bestto worst performance,the architectureswere: Elman,W&Z,
N&P andFGS.It is not surprisingthat theElmannetwork outperformstheFGSandN&P networks. The
computationalpower of Elmannetworks hasbeenshown to be at leastTuring equivalent [55], wherethe
N&P networkshavebeenshown to beTuringequivalent[54] but to within alinearslowdown. FGSnetworks
20
1
2
3
4
6
7
8
9
10
11
12
13
14
1516
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
5152
53
54
55
56
57
58
59
6061
62
63
64
65
66
67
68
69
70
7172
73
74
75
76
77
78
79
80
81
82
83
84
85
8687
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120 121
Figure12. An automataextractedfrom an Elmannetwork trainedto performthe naturallanguagetask. The start
stateis state1 at thebottomleft andtheacceptingstateis state17 at thetop right. All stringswhich do not reachthe
acceptingstatearerejected.
have recentlybeenshown to be themostcomputationallylimited [14]. Elmannetworks arejust a special
caseof W&Z networks– thefactthattheElmanandW&Z networksarethetopperformersis notsurprising.
However, theoreticallywhy theElmannetwork outperformedtheW&Z network is anopenquestion.Our
experimentalresultssuggestthat this is a trainingissueandnot a representationalissue.Backpropagation-
through-time(BPTT) is an iterative algorithmthat is not guaranteedto find theglobalminima of thecost
function error surface. The error surfaceis different for the ElmanandW&Z networks, andour results
suggestthat the error surfaceof the W&Z network is lesssuitablefor the BPTT training algorithmused.
However, all architecturesdo learnsomerepresentationof thegrammar.
Are thenetworkslearningthegrammar?Thehierarchyof architectureswith increasingcomputationalpower
(for a givennumberof hiddennodes)give aninsightinto whethertheincreasedpower is usedto modelthe
21
morecomplex structuresfoundin thegrammar. Thefactthatthemorepowerful ElmanandW&Z networks
providedincreasedperformancesuggeststhatthey wereableto find structurein thedatawhichit maynotbe
possibleto modelwith theFGSnetwork. Additionally, investigationof thedatasuggeststhat100%correct
classificationonthetrainingdatawith only two wordinputswouldnotbepossibleunlessthenetworkswere
ableto learnsignificantaspectsof thegrammar.
Another comparisonof recurrentneuralnetwork architectures,that of Giles and Horne [30], compared
variousnetworkson randomlygenerated6 and64-statefinite memorymachines.Thelocally recurrentand
Narendra& Parthasarathynetworksprovedasgoodor superiorto morepowerful networks like theElman
network, indicatingthateitherthetaskdid not requiretheincreasedpower, or thevanilla backpropagation-
through-timelearningalgorithmusedwasunableto exploit it.
This paperhasshown thatbothElmanandW&Z recurrentneuralnetworksareableto learnanappropriate
grammarfor discriminatingbetweenthe sharplygrammatical/ungrammaticalpairsusedby GB-linguists.
However, generalizationis limited by the amountandnatureof the dataavailable,andit is expectedthat
increaseddifficulty will be encounteredin training the modelsasmoredatais used. It is clearthat there
is considerabledifficulty scalingthe modelsconsideredhereup to larger problems.We needto continue
to addresstheconvergenceof the trainingalgorithms,andbelieve that further improvementis possibleby
addressingthe natureof parameterupdatingduring gradientdescent.However, a point mustbe reached
afterwhich improvementwith gradientdescentbasedalgorithmsrequiresconsiderationof thenatureof the
error surface. This is relatedto the input andoutputencodings(rarely are they chosenwith the specific
aim of controlling theerrorsurface),theability of parameterupdatesto modify network behavior without
destroying previously learnedinformation, and the methodby which the network implementsstructures
suchashierarchicalandrecursive relations.
Acknowledgments
This work hasbeenpartially supportedby theAustralianTelecommunicationsandElectronicsResearchBoard(SL).
References
[1] RobertB. Allen. Sequentialconnectionistnetworksfor answeringsimplequestionsabouta microworld. In 5th
AnnualProceedingsof theCognitiveScienceSociety, pages489–495,1983.
[2] EtienneBarnardandElizabethC.Botha.Back-propagationusesprior informationefficiently. IEEETransactions
on Neural Networks, 4(5):794–802,September1993.
[3] EtienneBarnardandDavid Casasent.A comparisonbetweencriterion functionsfor linear classifiers,with an
applicationto neuralnets.IEEE TransactionsonSystems,Man,andCybernetics, 19(5):1030–1041, 1989.
22
[4] E.B. BaumandF. Wilczek. Supervisedlearningof probability distributionsby neuralnetworks. In D.Z. An-
derson,editor, Neural InformationProcessingSystems, pages52–61,New York, 1988.AmericanInstituteof
Physics.
[5] M.P. Casey. Thedynamicsof discrete-timecomputation,with applicationto recurrentneuralnetworksandfinite
statemachineextraction.Neural Computation, 8(6):1135–1178,1996.
[6] N.A. Chomsky. Threemodelsfor the descriptionof language. IRE Transactionson InformationTheory, IT-
2:113–124,1956.
[7] N.A. Chomsky. Lectureson GovernmentandBinding. ForisPublications,1981.
[8] N.A. Chomsky. Knowledgeof Language: Its Nature, Origin, andUse. Prager, 1986.
[9] A. Cleeremans,D. Servan-Schreiber, andJ.L.McClelland.Finite stateautomataandsimplerecurrentnetworks.
Neural Computation, 1(3):372–381,1989.
[10] C. Darken andJ.E. Moody. Note on learningrateschedulesfor stochasticoptimization. In R.P. Lippmann,
J.E.Moody, andD. S.Touretzky, editors,Advancesin Neural InformationProcessingSystems, volume3, pages
832–838.MorganKaufmann,SanMateo,CA, 1991.
[11] C. DarkenandJ.E.Moody. Towardsfasterstochasticgradientsearch.In Neural InformationProcessingSystems
4, pages1009–1016.MorganKaufmann,1992.
[12] J.L. Elman. Structuredrepresentationsandconnectionistmodels. In 6th AnnualProceedingsof the Cognitive
ScienceSociety, pages17–25,1984.
[13] J.L.Elman.Distributedrepresentations,simplerecurrentnetworks,andgrammaticalstructure.MachineLearn-
ing, 7(2/3):195–226,1991.
[14] P. FrasconiandM. Gori. Computationalcapabilitiesof local-feedbackrecurrentnetworksactingasfinite-state
machines.IEEE Transactionson Neural Networks, 7(6),1996.1521-1524.
[15] P. Frasconi,M. Gori, M. Maggini,andG. Soda.Unified integrationof explicit rulesandlearningby examplein
recurrentnetworks. IEEE Transactionson KnowledgeandData Engineering, 7(2):340–346,1995.
[16] P. Frasconi,M. Gori, andG. Soda.Local feedbackmultilayerednetworks. Neural Computation, 4(1):120–130,
1992.
[17] K.S.Fu. SyntacticPatternRecognitionandApplications. Prentice-Hall,EnglewoodCliffs, N.J,1982.
[18] M. GasserandC. Lee.Networksthatlearnphonology. Technicalreport,ComputerScienceDepartment,Indiana
University, 1990.
[19] C. Lee Giles, C.B. Miller, D. Chen,H.H. Chen,G.Z. Sun,andY.C. Lee. Learningandextractingfinite state
automatawith second-orderrecurrentneuralnetworks. Neural Computation, 4(3):393–405,1992.
[20] C. LeeGiles,C.B. Miller, D. Chen,G.Z. Sun,H.H. Chen,andY.C. Lee. Extractingandlearninganunknown
grammarwith recurrentneuralnetworks. In J.E.Moody, S.J.Hanson,andR.PLippmann,editors,Advancesin
Neural InformationProcessingSystems4, pages317–324,SanMateo,CA, 1992.MorganKaufmannPublishers.
[21] C. Lee Giles, G.Z. Sun,H.H. Chen,Y.C. Lee,andD. Chen. Higher orderrecurrentnetworks & grammatical
inference.In D.S.Touretzky, editor, Advancesin Neural InformationProcessingSystems2, pages380–387,San
Mateo,CA, 1990.MorganKaufmannPublishers.
23
[22] M. Hare. Therole of similarity in Hungarianvowel harmony: A connectionistaccount.TechnicalReportCRL
9004,Centrefor Researchin Language,Universityof California,SanDiego,1990.
[23] M. Hare,D. Corina,andG.W. Cottrell. Connectionistperspectiveon prosodicstructure.TechnicalReportCRL
NewsletterVolume3 Number2, Centrefor Researchin Language,Universityof California,SanDiego,1989.
[24] CatherineL. HarrisandJ.L. Elman. Representingvariableinformationwith simplerecurrentnetworks. In 6th
AnnualProceedingsof theCognitiveScienceSociety, pages635–642,1984.
[25] M.H. Harrison.Introductionto FormalLanguageTheory. Addison-Wesley PublishingCompany, Reading,MA,
1978.
[26] S.Haykin. Neural Networks,A ComprehensiveFoundation. Macmillan,New York, NY, 1994.
[27] J.A. Hertz, A. Krogh, andR.G. Palmer. Introductionto the Theoryof Neural Computation. Addison-Wesley
PublishingCompany, RedwoodCity, CA, 1991.
[28] J.E.HopcroftandJ.D.Ullman. Introductionto AutomataTheory, Languages,andComputation. Addison-Wesley
PublishingCompany, Reading,MA, 1979.
[29] J.Hopfield.Learningalgorithmsandprobabilitydistributionsin feed-forwardandfeed-backnetworks.Proceed-
ingsof theNationalAcademyof Science, 84:8429–8433,1987.
[30] B. G. Horne and C. Lee Giles. An experimentalcomparisonof recurrentneuralnetworks. In G. Tesauro,
D. Touretzky, andT. Leen,editors,Advancesin Neural InformationProcessingSystems7, pages697–704.MIT
Press,1995.
[31] L. Ingber. Very fastsimulatedre-annealing.MathematicalComputerModelling, 12:967–973,1989.
[32] L. Ingber. Adaptivesimulatedannealing(ASA). Technicalreport,LesterIngberResearch,McLean,VA, 1993.
[33] M.I. Jordan.Attractordynamicsandparallelismin a connectionistsequentialmachine. In Proceedingsof the
NinthAnnualconferenceof theCognitiveScienceSociety, pages531–546.LawrenceErlbaum,1986.
[34] M.I. Jordan. Serial order: A parallel distributed processingapproach. TechnicalReport ICS Report8604,
Institutefor CognitiveScience,Universityof Californiaat SanDiego,La Jolla,CA, May 1986.
[35] S. Kirkpatrick andG.B. Sorkin. Simulatedannealing. In Michael A. Arbib, editor, TheHandbookof Brain
TheoryandNeural Networks, pages876–878.MIT Press,Cambridge,Massachusetts,1995.
[36] S.Kullback. InformationTheoryandStatistics. Wiley, New York, 1959.
[37] H. LasnikandJ.Uriagereka.A Coursein GB Syntax:Lectureson BindingandEmptyCategories. MIT Press,
Cambridge,MA, 1988.
[38] Steve Lawrence,Sandiway Fong,andC. LeeGiles. Naturallanguagegrammaticalinference:A comparisonof
recurrentneuralnetworksandmachinelearningmethods.In StefanWermter, EllenRiloff, andGabrieleScheler,
editors,Symbolic,Connectionist,and StatisticalApproachesto Learning for Natural Language Processing,
Lecturenotesin AI, pages33–47.SpringerVerlag,New York, 1996.
[39] Y. Le Cun. Efficient learningandsecondordermethods.Tutorial presentedat NeuralInformationProcessing
Systems5, 1993.
[40] L.R. LeerinkandM. Jabri. Learningthepasttenseof Englishverbsusingrecurrentneuralnetworks. In Peter
Bartlett, Anthony Burkitt, andRobertWilliamson, editors,Australian Conferenceon Neural Networks, pages
222–226.AustralianNationalUniversity, 1996.
24
[41] B. MacWhinney, J. Leinbach,R. Taraban,andJ. McDonald. Languagelearning: cuesor rules? Journal of
MemoryandLanguage, 28:255–277,1989.
[42] R.MiikkulainenandM. Dyer. Encodinginput/outputrepresentationsin connectionistcognitivesystems.In D. S.
Touretzky, G. E. Hinton, andT. J. Sejnowski, editors,Proceedingsof the 1988ConnectionistModelsSummer
School, pages188–195,LosAltos, CA, 1989.MorganKaufmann.
[43] M.C. Mozer. A focusedbackpropagationalgorithm for temporalpattern recognition. Complex Systems,
3(4):349–381,August1989.
[44] K.S. NarendraandK. Parthasarathy. Identificationandcontrol of dynamicalsystemsusingneuralnetworks.
IEEETransactionson Neural Networks, 1(1):4–27,1990.
[45] C.W. Omlin andC.L. Giles. Constructingdeterministicfinite-stateautomatain recurrentneuralnetworks. Jour-
nal of theACM, 45(6):937,1996.
[46] C.W. Omlin andC.L. Giles. Extractionof rulesfrom discrete-timerecurrentneuralnetworks. Neural Networks,
9(1):41–52,1996.
[47] C.W. Omlin andC.L. Giles. Rule revision with recurrentneuralnetworks. IEEE Transactionson Knowledge
andData Engineering, 8(1):183–188,1996.
[48] F. PereiraandY. Schabes.Inside-outsidere-estimationfrom partially bracketedcorpora.In Proceedingsof the
30thannualmeetingof theACL, pages128–135,Newark,1992.
[49] D. M. Pesetsky. PathsandCategories. PhDthesis,MIT, 1982.
[50] J.B.Pollack.Theinductionof dynamicalrecognizers.MachineLearning, 7:227–252,1991.
[51] D.E. RumelhartandJ.L. McClelland. On learningthe pasttensesof Englishverbs. In D.E. Rumelhartand
J.L. McClelland,editors,Parallel DistributedProcessing, volume 2, chapter18, pages216–271.MIT Press,
Cambridge,1986.
[52] J.W. Shavlik. Combiningsymbolicandneurallearning.MachineLearning, 14(3):321–331,1994.
[53] H.T. Siegelmann.Computationbeyondtheturing limit. Science, 268:545–548,1995.
[54] H.T. Siegelmann,B.G. Horne,andC.L. Giles. Computationalcapabilitiesof recurrentNARX neuralnetworks.
IEEETrans.on Systems,ManandCybernetics- Part B, 27(2):208,1997.
[55] H.T. SiegelmannandE.D.Sontag.On thecomputationalpowerof neuralnets.Journalof ComputerandSystem
Sciences, 50(1):132–150,1995.
[56] P. Simard,M.B. Ottaway, andD.H. Ballard. Analysisof recurrentbackpropagation.In D. Touretzky, G. Hinton,
andT. Sejnowski, editors,Proceedingsof the1988ConnectionistModelsSummerSchool, pages103–112,San
Mateo,1989.(Pittsburg 1988),MorganKaufmann.
[57] S.A. Solla, E. Levin, and M. Fleisher. Acceleratedlearningin layeredneuralnetworks. Complex Systems,
2:625–639,1988.
[58] M. F. St. JohnandJ.L. McClelland. Learningandapplyingcontextual constraintsin sentencecomprehension.
Artificial Intelligence, 46:5–46,1990.
[59] AndreasStolcke. Learningfeature-basedsemanticswith simplerecurrentnetworks. TechnicalReportTR-90-
015,InternationalComputerScienceInstitute,Berkeley, California,April 1990.
25
[60] M. Tomita. Dynamicconstructionof finite-stateautomatafrom examplesusinghill-climbing. In Proceedingsof
theFourth AnnualCognitiveScienceConference, pages105–108,Ann Arbor, MI, 1982.
[61] D. S. Touretzky. Rulesand mapsin connectionistsymbol processing. TechnicalReportCMU-CS-89-158,
CarnegieMellon University:Departmentof ComputerScience,Pittsburgh,PA, 1989.
[62] D. S.Touretzky. Towardsa connectionistphonology:The“many maps”approachto sequencemanipulation.In
Proceedingsof the11thAnnualConferenceof theCognitiveScienceSociety, pages188–195,1989.
[63] A.C. Tsoi andA.D. Back. Locally recurrentglobally feedforwardnetworks: A critical review of architectures.
IEEETransactionson Neural Networks, 5(2):229–239,1994.
[64] R.L. WatrousandG.M. Kuhn. Inductionof finite statelanguagesusingsecond-orderrecurrentnetworks. In J.E.
Moody, S.J.Hanson,andR.PLippmann,editors,Advancesin Neural InformationProcessingSystems4, pages
309–316,SanMateo,CA, 1992.MorganKaufmannPublishers.
[65] R.L. WatrousandG.M. Kuhn. Inductionof finite-statelanguagesusingsecond-orderrecurrentnetworks.Neural
Computation, 4(3):406,1992.
[66] R.J.Williams andJ.Peng.An efficientgradient-basedalgorithmfor on-linetrainingof recurrentnetwork trajec-
tories.Neural Computation, 2(4):490–501,1990.
[67] R.J.Williams andD. Zipser. A learningalgorithmfor continuallyrunningfully recurrentneuralnetworks.Neural
Computation, 1(2):270–280,1989.
[68] Z. Zeng,R.M. Goodman,andP. Smyth. Learningfinite statemachineswith self-clusteringrecurrentnetworks.
Neural Computation, 5(6):976–990,1993.
26