Upload
dayton-frier
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
English version
A method for top-down and deterministic
parsing of multilingual corpora :
application : computing subject-verb links
Jacques VergneGREYC - Université de Caen
http://www.info.unicaen.fr/~jvergne
TALN 2002
24/6/2002 © Jacques Vergne TALN 2002 -2-
English version
Features of the experience
• experimenting, exploring, explaining, transmitting
deterministic parsing methods
• choice of a classical task, limited and (apparently) simple :
detecting and linking subjects and verbs in clauses
with the smaller possible soft (program + resources)
24/6/2002 © Jacques Vergne TALN 2002 -3-
English version
Linking subject <—> verb
• linking pronoun or chunk subject to the verbal chunk in every clause
• multilingual corpus (English, German, French, Italian, Spanish)
with language identification : genericity of the method ?
• top-down : document —> clause and chunk, (with partial chunking, without going down to the word level)
• written in perl :
- sentence parsing : 40 Kb
- resources : 20 Kb for 5 languages
24/6/2002 © Jacques Vergne TALN 2002 -4-
English version
with beginnings of clauses, beginnings of chunks
How doing without a dictionary ?
<[>||<d>L'euro</d> ||<V>rend déjà <p>d'éminents</p> services
<[><p>Dans les deux</p> cas ||<d>ces systèmes</d>
<p>d'armes</p> ||<V>disposent <p>de radars</p>
<[>||<d>Questo tema</d> ||<V>rischia <p>di essere</p> <d>la questione</d> sociale <p>del futuro</p>
<[>||<d>La Bolsa</d> <p>de Tokio</p> ||<V>cerró ayer
<p>a su nivel</p> más bajo <p>en 17</p> años
with determiner - verbal ending couples
24/6/2002 © Jacques Vergne TALN 2002 -5-
English version
<[>||<d>Das Sternbild</d> nämlich ||<V>steht <p>in dieser
Jahreszeit</p> besonders tief <p>am Himmel</p>
<[><p>Bis Ende Oktober</p> ||<V>schließt sich ||<d>der Reigen</d> <p>in Connecticut</p>, Massachusetts
<cc>und Rhode Island
<[>||<d>The costs</d> ||<V>mount rapidly,
<[cc>But ||<d>the Pentagon</d> move ||<V>represents
<d>the first</d> significant federal call-up
with beginnings of clauses, beginnings of chunks
How doing without a dictionary ?
with determiner - verbal ending couples
24/6/2002 © Jacques Vergne TALN 2002 -6-
English version
Resources : all for French
"à condition que|à condition qu|ainsi que|ainsi qu|auquel|auxquels|combien|comme|comment|dont|dés que|dés qu|lorsque|
lorsqu|même si|où|parce que|parce qu|pourquoi|quand|alors que|alors qu|bien que|bien qu|quoi que|quoi qu|tandis que|tandis qu|tant que|tant qu|puisque|puisqu|sans que|sans qu|que|qu|qui|sauf si|si"
"et donc|et encore|et ensuite|et même|et non|et pas|et pourtant|et|ou bien|ou même|ou encore|ou|mais aussi|mais|car|mais|or|puis"
"quant à|quant au|quant aux|grâce à|grâce au|grâce aux|face à|face au|face aux|à partir de|à partir du|à partir d|à|À|afin de|afin d|aprés|au-delà d|au-delà de|au-delà du|au-delà des|au|aux|auprés d|auprés de|auprés du|auprés des|autour d|
autour de|autour du|autour des|avant|avec|chez|contre|dans|de par|d'entre|d'où|d|de|des|du|depuis|devant|dés|durant| en tant que|en tant qu|en|entre|hors d|hors de|hors du|hors des|jusque|jusqu'à|jusqu'au|jusqu'aux|lors d|lors de|lors du|lors des|malgré|outre|par|parmi|pendant|pour|près de|près d|sans|sauf|sous|selon|sur|vers|via|voire"
"un|une|le|la|l|ce|cet|cette|sa|son|notre|leur|tout|toute|chaque|aucun|aucune| Un|Une|Le|La|L|Ce|Cet|Cette|Sa|Son|Notre|Leur|Tout|Toute|Chaque|Aucun|Aucune"
"les|ces|ses|leurs|nos|tous|toutes|plusieurs|deux|trois|quatre|cinq|six|sept|huit|neuf|dix|d'autres|certains|quelques|Les|Ces|Ses|Leurs|Nos|Tous|Toutes|Plusieurs|Deux|Trois|Quatre|Cinq|Six|Sept|Huit|Neuf|Dix|D'autres|Certains|Quelques"
"je|j|tu|il|elle|l'on|on|c|ça|cela|ceci"
"ils|elles|nous|vous"
"a|avait|aura|ait|aurait|est|était|sera|serait|va|allait|ira|faisait|fera"
"ont|avaient|auront|aient|auraient|sont|étaient|seront|seraient|vont|allaient|iront|font|faisaient|feront"
"e|a|ed|pand|end|ond|erd|ord|oud|et|it|ît|tient|vient|pent|sent|eint|ort|ut|ût""ent|ont"
"n'|ne |m'|me |t'|te |s'|se |s'en |s'y |lui |leur |en |y |le |la |les |l'"
beginningsof clause
beginningsof
chunk
subjectpronouns
auxiliaries
verbalendings
clitics
24/6/2002 © Jacques Vergne TALN 2002 -7-
English version
Resources : all for English
"although|as if|as|because|before|how|if|since|than|that|though|unless|until|whatever|what|when| where|whether|while|who|which|whom|whose|why"
"and|but|or|nor"
"about|according to|across|after|against|along|amid|among|around|such as|at|because of|behind|
between|by|despite|due to|during|except for|for|from|in order to|in|inside|into|instead of|like|of|off|on|out of| over|per|prior to|less than|more than|throughout|through|to|toward|under|unlike|via|within|without|with"
"such a|a|an|another|this|any|each|one|Such a|A|An|Another|This|Any|Each|One"
"many|most|much of|much|plenty of|several|some|such|these|those|both|two|three|four|five|six|seven| eight|nine|ten|a few|Many|Most|Much of|Much|Plenty of|Several|Some|Such|These|Those|Both|Two| Three|Four|Five|Six|Seven|Eight|Nine|Ten|A few"
"the|our|your|its|his|her|their|The|Our|Your|Its|His|Her|Their"
"I|he|she|it"
"we|they|you"
"has|is|was|does|says|tells|hasn't|isn't|wasn't|doesn't" "have|are|were|do|say|tell|haven't|aren't|weren't|don't" "had|will|would|shall|should|may|might|must|cannot|can|could|did|said|told|hadn't|wouldn't|shouldn't|may|mustn't|can't|couldn't|didn't|won't "
"s|ed""a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|t|u|v|w|x|y|z"
""
beginningsof clause
beginningsof
chunk
subjectpronouns
auxiliaries
verbalendings
clitics
24/6/2002 © Jacques Vergne TALN 2002 -8-
English version
Resources : all for German
"dass|daß|in denen|indessen|dessen|indem|nachdem|ob|obwohl| was|warum|wer|weil|wenn|wie|wo|wofür|worauf|worin"
"aber|oder|und"
"dem|den|des|diesem|diesen|dieser|dieses|einem|einen|einer|eines|meinem|meinen|meiner|meines|deinem|deinen|deiner|deines|seinem|seinen|seiner|seines|ab|als|am|an|anhand|auf|aus|bei|bis|durch|für|
gegen|gen|hinter|ihren|im|innerhalb|ins|in|mit|nach|neben|ohne|pro|seit|über|‹ber|um|unseren|unter|vom|von|vor|während|wegen|zum|zur|zu|zwischen"
"der|das|ein|eine|dieser|diese|kein|keine|ihres|ihr|Der|Das|Ein|Eine|Dieser|Diese|Kein|Keine|Ihres|Ihr|die|meine|seine|viel|Die|Meine|Seine|Viel"
"die|meine|seine|ihre|viele|alle|zwei|Die|Meine|Seine|Ihre|Viele|Alle|Zwei"
"ich|er|sie|es|man|Ich|Er|Sie|Es|Man"
"wir|Sie|sie|Wir"
"habe|hat|hatte|bin|ist|sei|wäre|war|wird|werde|wurde|darf|dürfte|kann|konnte|könnte|könne|lässt|muss|soll|will|wollte"
"haben|hatten|sind|waren|werden|wurden|worden|können|könnten|lassen|müssen|mussten|sollen|sollten"
"b|nd|te|e|f|ag|ng|ah|hm|t"
"en|rn"
""
beginningsof clause
beginningsof
chunk
subjectpronouns
auxiliaries
verbalendings
clitics
24/6/2002 © Jacques Vergne TALN 2002 -9-
English version
1 document
Parsing and Hierarchies of grains
intermediary grains computed grains
textual zones
proto-clauses
extracting
validating, segmenting, linking
clauses
purely top-down parser
segmenting / written forms
proto-chunks
tagging / written forms
going down in the hierarchy
of physical grains
chunks
physical grains
sentences
segmenting / punctuation
24/6/2002 © Jacques Vergne TALN 2002 -10-
English version
proto-clauses(= hypotheses on clauses)
post-processing
standardprocess
Parsing process
cutting,linking
proto-clauses
clauses(= 1 proto-clause)
1 sentence
diagnostic
clauses(= 1/2 proto-clause,
2 proto-clauses)
beginnings of clause
auxiliaries, subject pronouns,verbal endings
partialchunking
subject & verb ?sentence ?
linkingsubject - verb
beginnings of chunks
no
segmentation / written forms
24/6/2002 © Jacques Vergne TALN 2002 -11-
English version
Standard process : example 1
0 : <[>Je n'ai jamais dit
1 : <[cs>queque </cs>l'euro allait remplacer le dollar
2 : <[.>..
Je n'ai jamais dit que l'euro allait remplacer le dollar.
(Ouest-France of 18/10/2001)
• tagging beginnings of proto-clauses —> segmentation into proto-clauses :
proto-clause = clause
24/6/2002 © Jacques Vergne TALN 2002 -12-
English version
Standard process : example 1
0 : <[><pp>Je <V>n'ai jamais dit [nbpp=1 nbV=1]
1 : <[cs>queque </cs><d>l'euro</d> allait remplacer <d>le dollar</d>[nbpp=0 nbV=0]
2 : <[.>..
• tagging beginnings of chunks —> partial chunking in the written form of the proto-clause
• tagging subject pronouns, auxiliaries —> counting subject pronouns and auxiliaries
24/6/2002 © Jacques Vergne TALN 2002 -13-
English version
Standard process : example 1
0 : <[>||<pp>Je ||<V>n'ai jamais dit [nbV=1 saturS=1]
1 : <[cs>queque </cs>||<d>l'euro</d> ||<V>allait remplacer <d>le dollar</d>[nbV=1 saturS=1]
2 : <[.>..
• for every proto-clause : detecting and linking subject and verb
24/6/2002 © Jacques Vergne TALN 2002 -14-
English version
Standard process : example 1
0 : <[>||<pp>Je ||<V>n'ai jamais dit [nbV=1 saturS=1]
1 : <[cs>queque </cs>||<d>l'euro</d> ||<V>allait remplacer <d>le dollar</d>[nbV=1 saturS=1]
2 : <[.>..
• diagnostic of every clause and of the sentence
• every clause has its subject and its verb
and the sentence has a main clause (without a mark)
24/6/2002 © Jacques Vergne TALN 2002 -15-
English version
Standard process : example 2
Eine spektakuläre Operation gelang ihm im November 1974, als er ein Spenderherz transplantierte, ohne das Herz des Empfängers zu entfernen.
(Der Spiegel - 2/9/2001)
0 : <[>Eine spektakuläre Operation gelang ihm im November 1974,
1 : <[cs>alsals </cs>er ein Spenderherz transplantierte,
2 : <[><pi>ohneohne </pi>das Herz des Empfängers <pi>zu </pi>entfernen
3 : <[.>..
• tagging of the beginnings of proto-clauses —> segmentation into proto-clauses :
24/6/2002 © Jacques Vergne TALN 2002 -16-
English version
Standard process : example 2
0 : <[><d>Eine spektakuläre Operation</d> gelang ihm <p>im November</p> 1974, [nbpp=0 nbV=0]
1 : <[cs>alsals </cs><pp>er <d>ein Spenderherz</d> transplantierte, [nbpp=1 nbV=0]
2 : <[><pi>ohneohne </pi><d>das Herz</d> <p>des Empfängers</p> <pi>zu
entfernen</pi>
3 : <[.>..
• tagging beginnings of chunks —> partial chunking in the written form of the proto-clause
• tagging pronouns, auxiliaries —> counting pronouns and auxiliaries
24/6/2002 © Jacques Vergne TALN 2002 -17-
English version
Standard process : example 2
• for every proto-clause : detecting and linking subject and verb
0 : <[>||<d>Eine spektakuläre Operation</d> ||<V>gelang ihm <p>im November</p> 1974, [nbV=1 saturS=1]
1 : <[cs>alsals </cs>||<pp>er <d>ein Spenderherz</d> ||<V>transplantierte, [nbV=1 saturS=1]
2 : <[><pi>ohneohne </pi><d>das Herz</d> <p>des Empfängers</p> <pi>zu entfernen</pi>
3 : <[.>..
24/6/2002 © Jacques Vergne TALN 2002 -18-
English version
Standard process : example 2
0 : <[>||<d>Eine spektakuläre Operation</d> ||<V>gelang ihm <p>im November</p> 1974, [nbV=1 saturS=1]
1 : <[cs>alsals </cs>||<pp>er <d>ein Spenderherz</d> ||<V>transplantierte, [nbV=1 saturS=1]
2 : <[><pi>ohneohne </pi><d>das Herz</d> <p>des Empfängers</p> <pi>zu entfernen</pi>
3 : <[.>..
• diagnostic of every clause and of the sentence
• every clause has its subject and its verb
and the sentence has a main clause (without a mark)
24/6/2002 © Jacques Vergne TALN 2002 -19-
English version
Post-processing :
proto-clause clause
2 operations are possible :
• cutting 1 proto-clause => 2 clauses
• linking 2 proto-clauses => 1 clause
24/6/2002 © Jacques Vergne TALN 2002 -20-
English version
Post-processing : cutting a proto-clause into 2 clauses
Result of the standard process :
2 verbs in 1 proto-clause
=> searching a cut point
0 : <[cs>AlthoughAlthough </cs>||<pp>they ||<V>have not ruled out <d>a possibility</d> [nbV=1 saturS=1]
1 : <[cs>thatthat </cs><d>another criminal</d> <V>could be <p>behind the anthrax</p>
attacks, investigators <V>are intensely looking <p>at evidentiary</p> threads
linking <d>the letters</d> <p>to the hijackers</p>[nbV=2]
2 : <[.>..
24/6/2002 © Jacques Vergne TALN 2002 -21-
English version
Post-processing : cutting a proto-clause into 2 clauses
0 : <[cs>AlthoughAlthough </cs>||<pp>they ||<V>have not ruled out <d>a possibility</d> [nbV=1 saturS=1]
1 : <[cs>thatthat </cs>||<d>another criminal</d> ||<V>could be <p>behind the anthrax</p>
attacks,, [nbV=1 saturS=1]
2 : <[>||investigators ||<V>are intensely looking <p>at evidentiary</p> threads linking
<d>the letters</d> <p>to the hijackers</p>[nbV=1 saturS=1]
3 : <[.>..
Cut on the comma :
every clause now has its subject and its verb
and the sentence has a main clause (without a mark)
24/6/2002 © Jacques Vergne TALN 2002 -22-
English version
Post-processing : linking 2 proto-clauses
0 : <[><d>Eine junge Südafrikanerin</d>, [nbV=0]
1 : <[pr>||diedie </pr>1969 <d>ein neues Herz</d> ||<V>erhielt, [nbV=1 saturS=1]
2 : <[>überlebte damit zwölf Jahre[nbV=0]
3 : <[.>..
Result of the standard process :
2 proto-clauses have no verb
=> trying to link them
24/6/2002 © Jacques Vergne TALN 2002 -23-
English version
Post-processing : linking 2 proto-clauses
0 : <[>|<d>Eine junge Südafrikanerin</d>, [nbV=0 S_en_attente=1] (ping of the subject)
1 : <[pr>||diedie </pr>1969 <d>ein neues Herz</d> ||<V>erhielt,
[nbV=1 saturS=1]
2 : <[>überlebte damit zwölf Jahre
[nbV=0]
linking the proto-clause 0 to the proto-clause 2by the "ping-pong" process :
24/6/2002 © Jacques Vergne TALN 2002 -24-
English version
0 : <[>||<d>Eine junge Südafrikanerin</d>, [nbV=0 S_en_attente=0 lienS=2] (ping of the subject)
1 : <[pr>||diedie </pr>1969 <d>ein neues Herz</d> ||<V>erhielt,
[nbV=1 saturS=1]
2 : <[>||<V>überlebte damit zwölf Jahre
[nbV=1 saturS=1 lienS=0] (pong of the verb)
Post-processing : linking 2 proto-clauses
linking the proto-clause 0 to the proto-clause 2by the "ping-pong" process :
3 : <[.>..
every clause now has its subject and its verb
and the sentence has a main clause (without a mark)
24/6/2002 © Jacques Vergne TALN 2002 -25-
English version
0 : <[><d>Les tueurs</d>, [nbV=0]
1 : <[pr>||quiqui </pr>||<V>ont assassiné Rehavam Zeevi, ministre israélien <p>du Tourisme</p>, appartiennent <p>au camp</p> <p>des ennemis</p> <p>de la paix</p>
[nbV=1 saturS=1]
2 : <[.>..
Post-processing : cutting a proto-clause into 2 clauses
+ linking 2 proto-clauses
Result of the standard process :
1 proto-clause has no verb
=> trying to cut and link
24/6/2002 © Jacques Vergne TALN 2002 -26-
English version
"ping-pong" process : ping of the subject = putting a subject candidate in a waiting position
0 : <[>|<d>Les tueurs</d>, [nbV=0 S_en_attente=plur] (ping of the subject?)
1 : <[pr>||quiqui </pr>||<V>ont assassiné Rehavam Zeevi, ministre israélien
<p>du Tourisme</p>,, appartiennent <p>au camp</p> <p>des
ennemis</p> <p>de la paix</p>
[nbV=1 saturS=1]
cutting the proto-clause 1 into 2 proto-clauses :
Post-processing : cutting a proto-clause into 2 clauses
+ linking 2 proto-clauses
24/6/2002 © Jacques Vergne TALN 2002 -27-
English version
0 : <[>|<d>Les tueurs</d>, [nbV=0 S_en_attente=plur] (ping of the subject?)
1 : <[pr>||quiqui </pr>||<V>ont assassiné Rehavam Zeevi, ministre israélien
<p>du Tourisme</p>,,
[nbV=1 saturS=1]
2 : <[>appartiennent <p>au camp</p> <p>des ennemis</p> <p>de la
paix</p>
[nbV=0]
Post-processing : cutting a proto-clause into 2 clauses
+ linking 2 proto-clauses
cutting the proto-clause 1 into 2 proto-clauses :
24/6/2002 © Jacques Vergne TALN 2002 -28-
English version
0 : <[>||<d>Les tueurs</d>, [nbV=0 S_en_attente=0 lienS=2] (ping of the subject?)
1 : <[pr>||quiqui </pr>||<V>ont assassiné Rehavam Zeevi, ministre israélien
<p>du Tourisme</p>,
[nbV=1 saturS=1]
2 : <[>||<V>appartiennent <p>au camp</p> <p>des ennemis</p> <p>de la
paix</p>
[nbV=1 saturS=1 lienS=0] (pong of the verb)3 : <[.>..
every clause now has its subject and its verb
and the sentence has a main clause (without a mark)
Post-processing : cutting a proto-clause into 2 clauses
+ linking 2 proto-clauses"ping-pong" process : pong of the verb = a waiting subject candidate & agreeing verbal ending
24/6/2002 © Jacques Vergne TALN 2002 -29-
English version
Implementation of the linguistic model
physical grains
computedgrains
clauses
sentences
proto-chunks
these grains are represented
in a repetitive structure
these grains are tagged
in the written forms of the (proto-)clauses
proto-clauses
chunks
intermediarygrains
in the repetitive structure of the
(proto-)clauses
24/6/2002 © Jacques Vergne TALN 2002 -30-
English version
Aims of the "Groupe Syntaxe" of the GREYC
• searching minimal solutions :
for a given task, minimising means
- very little programs
- very simple algorithms
- deterministic solutions (without combination enumeration) :
. computing on forms and their positions
- linguistic minimal bases :. using very few properties,
only ones which are useful in the process
. very few resources (typographical, morphological)
24/6/2002 © Jacques Vergne TALN 2002 -31-
English version
Very small programs !
• how ?
while using very general linguistic properties
defined in comprehension
and not in extension
• why ?
because these properties are interesting :
few, abstract
operativeefficient
understanding, modelling
acting
24/6/2002 © Jacques Vergne TALN 2002 -32-
English version
Conclusions
• classical tasks are feasible with minimal means (quasi absence of dictionary)
other tasks : computing reported speech, locating explanations cf. Nadine Lucas (GREYC) and Emmanuel Giguet (LATTICE)
• with fewer means, work is easier :- fewer lexical resources => lower cost- easy to add a new language- always above the word level
• beginnings of a promising way
• still a long way ...
24/6/2002 © Jacques Vergne TALN 2002 -33-
English version
your questions ?
End of the lecture
24/6/2002 © Jacques Vergne TALN 2002 -34-
English version
to download
• you can download this presentation on http://www.info.unicaen.fr/~jvergne/TALN2002_JVergne_en.ppt
• also see my presentation at TALN 2001 Parsing natural languages : from "combinatory" to "deterministic" parsing
on http://www.info.unicaen.fr/~jvergne/TALN2001_JVergne_en.ppt
• also see the tutorial of Coling 2000"Trends in Robust Parsing"
on http://www.info.unicaen.fr/~jvergne/tutorialColing2000.html
(presentation and references)
24/6/2002 © Jacques Vergne TALN 2002 -35-
English version
24/6/2002 © Jacques Vergne TALN 2002 -36-
English version
1 document
Parsing and Hierarchies of grains
classicalparsers
recursives phrases, sentence
physical grains
computedgrains
sentences
tokens
segmenting
segmenting
grouping tokens and phra.
top - down in the hierarchy
of physical grains
bottom - up in the hierarchy
of computed grains
24/6/2002 © Jacques Vergne TALN 2002 -37-
English version
1 document
Parsing and Hierarchies of grains
1998parser
chunks
physical grains
computedgrains
sentences
tokens
segmenting
segmenting
grouping tokens
linking chunks
top - down in the hierarchy
of physical grains
bottom - up in the hierarchy
of computed grains
24/6/2002 © Jacques Vergne TALN 2002 -38-
English version
1 document
Parsing and Hierarchies of grains
GREYCparser
chunks
physical grains
computedgrains
textual zones
tokens
extracting
segmenting
grouping and linking
clauses
sentences
grouping and linking
grouping and linking
top - down in the hierarchy
of physical grains
bottom - up in the hierarchy
of computed grains
24/6/2002 © Jacques Vergne TALN 2002 -39-
English version