Upload
jackson-murray
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Experiments in German Noun Chunking
Michael SchiehlenInstitut für Maschinelle Sprachverarbeitung
Universität [email protected]
COLING 2002, TaipeiAugust 27th, 2002
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen2
• chunk: maximal string containing a major head, dominated by root, not contained in other chunk
In John' s house we met a nice man proud of his son.
• major head: a content word not between a function word f and the word selected by f
Definition of Chunks (Abney:93)
• root: highest node with major head as semantic head
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen3
Base Noun Chunks
• A base noun chunk is a chunk with a noun as major head. (Base noun chunks are “core” NPs and PPs.)
• The underlying grammar assumes
• null determiners
Ø poor people forms a base noun chunk
• and empty nouns.
the poor Ø forms a base noun chunk
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen4
• if chunks may be multi-headed
• if conjunctions are excluded from chunks
• by Abney’s definition
Problems with Coordination
the old men and women
the old men and women
the old men and women
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen5
Disambiguationin a Cascaded Finite-State Parser
• POS ambiguities resolved by POS tagger.
• PP attachment ambiguities are kept underspecified.
• All other ambiguities are resolved using the longest-match criterion (Abney, 1993).
• Chunks should be as long as possible.
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen6
System Overview
string
tokenizer
POS tagger
cascaded noun chunker
cascaded clause chunker
(underspecified representation of) predicate- argument structure
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen7
Determining Predicate-Argument Structure in German
• In German, case is important, not position! Der/Den Hund kennt Anna. the dog knows Anna (Anna knows the dog./The dog knows Anna.)
• Case is determined jointly by determiners, adjectives and nouns.
der/den hohen Schäden (the heavy damages, gen.pl/dat.pl) der große/großen Felsen (the large rock(s), nom.sg/gen.pl) den großen Stein/Steinen (the large stone(s), acc.sg/dat.pl)
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen8
Problem: Center-Embedding
• The words needed to compute case may be separated by other (embedded) noun chunks.
[der/den [mehrere Milliarden Euro] hohen Schäden] the several billions Euro high damages (the damages amounting to several billion Euro)
• Base noun chunks may be ungrammatical. {die} {im Alter} {nachlassenden Kräfte} the in-the age diminishing forces (the strength diminishing in old age)
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen9
Definition of Full Noun Chunks (1)
part of NP between determiner and (first) head noun (Schmid and Schulte im Walde: 2000)
• includes names{the discoverer Christopher Columbus}
• but not coordinated NPs{parts} {of Scotland} and {Northern Ireland}
• and not appositions{Christopher Columbus}, {the famous discoverer},
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen10
Definition of Full Noun Chunks (2)
NP/PP stripped of adverbials at the front and PPs and relative clauses at the back (Brants, 1999)
• coordinations (attachment ambiguity!) and appositions
{? parts {? of Scotland and Northern Ireland}
• pre- and postnominal genitives
{{Marias} Version {der Geschichte}} Mary's version of the story
• measure phrases {{20 Dollar} Strafe} 20 dollars penalty (a penalty of 20$)
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen11
Recognizing Full Noun Chunks (1)
explicit representation of ambiguities (potential noun chunks)
• used in previous work on full noun chunking (Brants:99, Schmid and Schulte im Walde:00, Kermes and Evert:02)
• drawback: requires search→ Parser is not deterministic any longer.
→ Linear complexity is lost.
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen12
Recognizing Full Noun Chunks (2)
a new method retaining determinism and linear complexity
• recognize base noun chunks that could form • beginning, • middle or • end of a full noun chunk
• discard those noun chunks (monotonicity lost!)
• re-apply original noun chunk transducer
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen13
AP
NPnom;acc
3.
2.
might bebeginning
might be end offull noun chunk
die Ende der Woche geplanten Treffen sind geplatzt
the end of the week planned meetings have been cancelled
the meetings planned for the end of the week have been cancelled
NP NP NPnom;acc nom;dat;acc dat 1.
Recognizing Recursive NPsby Non-Monotonic Cascades
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen14
Three Approaches to Agreement Checking in FS Parsers
1. add agreement info to POS tags and compile the grammar out (drawback: explosion of trans table)
2. postpone agreement check until after chunk recognition (Abney, 1997)
3. interleave agreement checking with chunking (Neumann et al., 2000), problems with subcategorizing multi-words
um Gottes willen (for God's sake)
“um” takes acc., “um-willen” takes gen.!
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen15
Evaluation: Test Data
• gold standard: NEGRA tree bank321,000 tokens
100,974 base noun chunks
78,942 full noun chunks• Structure of full noun chunks not considered.• Agreement information extracted not considered.• Same test data were used by Brants (1999) and
Kermes and Evert (2002).
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen16
Evaluation: Baseline
• baseline: statistical knowledge-free method of Ramshaw and Marcus (1995)
• Instead of Brill tagger, the tree tagger (Schmid:94) was used.- precision/recall on R&M’s test data with the
tree tagger: 90.7/91.2% (R&M got 91.8/92.3%)
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen17
Parameters Tested
• agreement checking online or offline
• left-to-right or right-to-left traversal{der 14} {Jahre} {alte Junge} (the 14-year-old boy){der} {14 Jahre} {alte Junge}
• quality of POS taggingPOS-I(deal): POS tags from tree bankPOS-L(exicon): from tree tagger trained on tree bankPOS-T(agger): from tree taggerPOS-C(hunker): POS tags disambiguated by chunker
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen18
F-Values for Base Noun Chunks
English baseline FS- > FS<- FS- >agr FS- >agr85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
POS- I
POS- L
POS- T
POS- C
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen19
Discussion
• English is harder than German.- German nouns are less ambiguous than English nouns
• POS-I > POS-L > POS-T > POS-C• Tags from the chunker (POS-C) are worse than
baseline. Using a POS tagger is a good idea.
• Direction of processing makes no difference.• Checking agreement yields small improvement.
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen20
baseline FS- > FS<- FS- >agr FS<- agr Skut and Brants:98
Schmid+ Schulte00
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
POS- I
POS- L
POS- T
F-Values for Full Noun Chunks
maximum entropy model PCFG model
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen21
Discussion
• Online agreement checking pays (see next slide).
• Better results with right-to-left parsing are mainly due to a heuristic which could only be incorporated in right-to-left parser:
- prefer shortest match with conjunct attachment {? The presidents of {? France and the U.S.A.} met.
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen22
Online Agreement Checking (+)
errors avoided
• genitives (case mismatch){in John's} {house}
• conjunction attachment (case mismatch)
{das Leben {von Schauspielern} und Zirkusleuten} the life(nom;acc) of actors and circus people(dat)
• adjacent NPs (adjective declination)
{diese beiden ähnliche Erfolge} those two(weak) similar(strong) successes
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen23
Some grammar errors become visible only with agreement checking.
• N coordination is missing.
{die nachlassenden Kräfte} the diminishing strength
{die Verletzungen} und {nachlassenden Kräfte} the injuries and diminishing strength
Online Agreement Checking (-)
no noun chunk!
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen24
Conclusion (1)
• Writing a finite-state grammar is worth the effort. FS method performs better than statistical method
• Noun chunker is not very good at determining POS tags.
• Online agreement checking improves performance.
• Shortest match is better than longest match for conjunction attachment.
IMS Stuttgart
COLING 2002August 27th, 2002
© Michael Schiehlen25
Conclusion (2)
• Two chunkers have been implemented (base noun chunker, full noun chunker).
• Both are completely deterministic.
• On a SUN Ultra-250, the base noun chunker processes 12,500 words per second, the full noun chunker achieves 5,200 wps.
• plans for the future: extend the system to recognize predicate-argument structure for Information Extraction