25
Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ ims . uni - stuttgart .de COLING 2002, Taipei August 27th, 2002

Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart [email protected] COLING

Embed Size (px)

Citation preview

Page 1: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

Experiments in German Noun Chunking

Michael SchiehlenInstitut für Maschinelle Sprachverarbeitung

Universität [email protected]

COLING 2002, TaipeiAugust 27th, 2002

Page 2: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen2

• chunk: maximal string containing a major head, dominated by root, not contained in other chunk

In John' s house we met a nice man proud of his son.

• major head: a content word not between a function word f and the word selected by f

Definition of Chunks (Abney:93)

• root: highest node with major head as semantic head

Page 3: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen3

Base Noun Chunks

• A base noun chunk is a chunk with a noun as major head. (Base noun chunks are “core” NPs and PPs.)

• The underlying grammar assumes

• null determiners

Ø poor people forms a base noun chunk

• and empty nouns.

the poor Ø forms a base noun chunk

Page 4: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen4

• if chunks may be multi-headed

• if conjunctions are excluded from chunks

• by Abney’s definition

Problems with Coordination

the old men and women

the old men and women

the old men and women

Page 5: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen5

Disambiguationin a Cascaded Finite-State Parser

• POS ambiguities resolved by POS tagger.

• PP attachment ambiguities are kept underspecified.

• All other ambiguities are resolved using the longest-match criterion (Abney, 1993).

• Chunks should be as long as possible.

Page 6: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen6

System Overview

string

tokenizer

POS tagger

cascaded noun chunker

cascaded clause chunker

(underspecified representation of) predicate- argument structure

Page 7: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen7

Determining Predicate-Argument Structure in German

• In German, case is important, not position! Der/Den Hund kennt Anna. the dog knows Anna (Anna knows the dog./The dog knows Anna.)

• Case is determined jointly by determiners, adjectives and nouns.

der/den hohen Schäden (the heavy damages, gen.pl/dat.pl) der große/großen Felsen (the large rock(s), nom.sg/gen.pl) den großen Stein/Steinen (the large stone(s), acc.sg/dat.pl)

Page 8: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen8

Problem: Center-Embedding

• The words needed to compute case may be separated by other (embedded) noun chunks.

[der/den [mehrere Milliarden Euro] hohen Schäden] the several billions Euro high damages (the damages amounting to several billion Euro)

• Base noun chunks may be ungrammatical. {die} {im Alter} {nachlassenden Kräfte} the in-the age diminishing forces (the strength diminishing in old age)

Page 9: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen9

Definition of Full Noun Chunks (1)

part of NP between determiner and (first) head noun (Schmid and Schulte im Walde: 2000)

• includes names{the discoverer Christopher Columbus}

• but not coordinated NPs{parts} {of Scotland} and {Northern Ireland}

• and not appositions{Christopher Columbus}, {the famous discoverer},

Page 10: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen10

Definition of Full Noun Chunks (2)

NP/PP stripped of adverbials at the front and PPs and relative clauses at the back (Brants, 1999)

• coordinations (attachment ambiguity!) and appositions

{? parts {? of Scotland and Northern Ireland}

• pre- and postnominal genitives

{{Marias} Version {der Geschichte}} Mary's version of the story

• measure phrases {{20 Dollar} Strafe} 20 dollars penalty (a penalty of 20$)

Page 11: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen11

Recognizing Full Noun Chunks (1)

explicit representation of ambiguities (potential noun chunks)

• used in previous work on full noun chunking (Brants:99, Schmid and Schulte im Walde:00, Kermes and Evert:02)

• drawback: requires search→ Parser is not deterministic any longer.

→ Linear complexity is lost.

Page 12: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen12

Recognizing Full Noun Chunks (2)

a new method retaining determinism and linear complexity

• recognize base noun chunks that could form • beginning, • middle or • end of a full noun chunk

• discard those noun chunks (monotonicity lost!)

• re-apply original noun chunk transducer

Page 13: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen13

AP

NPnom;acc

3.

2.

might bebeginning

might be end offull noun chunk

die Ende der Woche geplanten Treffen sind geplatzt

the end of the week planned meetings have been cancelled

the meetings planned for the end of the week have been cancelled

NP NP NPnom;acc nom;dat;acc dat 1.

Recognizing Recursive NPsby Non-Monotonic Cascades

Page 14: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen14

Three Approaches to Agreement Checking in FS Parsers

1. add agreement info to POS tags and compile the grammar out (drawback: explosion of trans table)

2. postpone agreement check until after chunk recognition (Abney, 1997)

3. interleave agreement checking with chunking (Neumann et al., 2000), problems with subcategorizing multi-words

um Gottes willen (for God's sake)

“um” takes acc., “um-willen” takes gen.!

Page 15: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen15

Evaluation: Test Data

• gold standard: NEGRA tree bank321,000 tokens

100,974 base noun chunks

78,942 full noun chunks• Structure of full noun chunks not considered.• Agreement information extracted not considered.• Same test data were used by Brants (1999) and

Kermes and Evert (2002).

Page 16: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen16

Evaluation: Baseline

• baseline: statistical knowledge-free method of Ramshaw and Marcus (1995)

• Instead of Brill tagger, the tree tagger (Schmid:94) was used.- precision/recall on R&M’s test data with the

tree tagger: 90.7/91.2% (R&M got 91.8/92.3%)

Page 17: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen17

Parameters Tested

• agreement checking online or offline

• left-to-right or right-to-left traversal{der 14} {Jahre} {alte Junge} (the 14-year-old boy){der} {14 Jahre} {alte Junge}

• quality of POS taggingPOS-I(deal): POS tags from tree bankPOS-L(exicon): from tree tagger trained on tree bankPOS-T(agger): from tree taggerPOS-C(hunker): POS tags disambiguated by chunker

Page 18: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen18

F-Values for Base Noun Chunks

English baseline FS- > FS<- FS- >agr FS- >agr85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

POS- I

POS- L

POS- T

POS- C

Page 19: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen19

Discussion

• English is harder than German.- German nouns are less ambiguous than English nouns

• POS-I > POS-L > POS-T > POS-C• Tags from the chunker (POS-C) are worse than

baseline. Using a POS tagger is a good idea.

• Direction of processing makes no difference.• Checking agreement yields small improvement.

Page 20: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen20

baseline FS- > FS<- FS- >agr FS<- agr Skut and Brants:98

Schmid+ Schulte00

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

POS- I

POS- L

POS- T

F-Values for Full Noun Chunks

maximum entropy model PCFG model

Page 21: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen21

Discussion

• Online agreement checking pays (see next slide).

• Better results with right-to-left parsing are mainly due to a heuristic which could only be incorporated in right-to-left parser:

- prefer shortest match with conjunct attachment {? The presidents of {? France and the U.S.A.} met.

Page 22: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen22

Online Agreement Checking (+)

errors avoided

• genitives (case mismatch){in John's} {house}

• conjunction attachment (case mismatch)

{das Leben {von Schauspielern} und Zirkusleuten} the life(nom;acc) of actors and circus people(dat)

• adjacent NPs (adjective declination)

{diese beiden ähnliche Erfolge} those two(weak) similar(strong) successes

Page 23: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen23

Some grammar errors become visible only with agreement checking.

• N coordination is missing.

{die nachlassenden Kräfte} the diminishing strength

{die Verletzungen} und {nachlassenden Kräfte} the injuries and diminishing strength

Online Agreement Checking (-)

no noun chunk!

Page 24: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen24

Conclusion (1)

• Writing a finite-state grammar is worth the effort. FS method performs better than statistical method

• Noun chunker is not very good at determining POS tags.

• Online agreement checking improves performance.

• Shortest match is better than longest match for conjunction attachment.

Page 25: Experiments in German Noun Chunking Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart mike@ims.uni-stuttgart.de COLING

IMS Stuttgart

COLING 2002August 27th, 2002

© Michael Schiehlen25

Conclusion (2)

• Two chunkers have been implemented (base noun chunker, full noun chunker).

• Both are completely deterministic.

• On a SUN Ultra-250, the base noun chunker processes 12,500 words per second, the full noun chunker achieves 5,200 wps.

• plans for the future: extend the system to recognize predicate-argument structure for Information Extraction