11
Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

Embed Size (px)

DESCRIPTION

April 2005CLINT Lecture IV3 Bibliography A. Voutilainen, Morphological disambiguation, in Karlsson, Voutilainen, Heikkila, Anttila (eds) Constraint Grammar pp , Mouton de Gruyter, See [e-book]e-book

Citation preview

Page 1: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

Human Language Technology

Part of Speech (POS) Tagging II

Rule-based Tagging

Page 2: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 2

Acknowledgment

Most slides taken from Bonnie Dorr’s course notes: www.umiacs.umd.edu/~bonnie/courses/cmsc723-03

Jurafsky & Martin Chapter 5

Page 3: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 3

Bibliography

A. Voutilainen, Morphological disambiguation, in Karlsson, Voutilainen, Heikkila, Anttila (eds) Constraint Grammar pp165-284, Mouton de Gruyter, 1995. See [e-book]

Page 4: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 4

EngCG Rule-Based Tagger (Voutilainen 1995) Rules based on English Constraint Grammar Two stage design Uses ENGTWOL Lexicon Hand written disambiguation rules

Page 5: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 5

ENGTWOL Lexicon

Based on TWO-Level morphology of English (hence the name)

56,000 entries for English word stems Each entry annotated with morphological and

syntactic features

Page 6: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 6

Sample ENGTWOL Lexicon

Page 7: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 7

Examples of constraints (informal) Discard all verb readings if to the left there is an

unambiguous determiner, and between that determiner and the ambiguous word itself, there are no nominals (nouns, abbreviations etc.).

Discard all finite verb readings if the immediately preceding word is to.

Discard all subjunctive readings if to the left, there are no instances of the subordinating conjunction that or lest.

The first constraint would discard the verb reading (next slide)

There are about 1,100 constraints altogether

Page 8: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 8

Actual Constraint Syntax

Given input: “that”If

(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)

Then eliminate non-ADV tagsElse eliminate ADV tag

this rule eliminates the adverbial sense of that as in “it isn’t that odd”

Page 9: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 9

ENGCG Tagger

Stage 1: Run words through morphological analyzer to get all parts of speech. E.g. for the phrase “the tables”, we get the following

output:

"<the>" "the"<Def> DET CENTRAL ART SG/PL

"<tables>" "table" N NOM PL "table"<SVO> V PRES SG3 VFIN

Stage 2: Apply constraints to rule out incorrect POSs

Page 10: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

April 2005 CLINT Lecture IV 10

Example

WORD TAGSPavlov PVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL SEM SGCS (subord. conj)

salivation N NOM SG

Page 11: Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging

Performance

Tested on examples from Wall St Journal, Brown Corpus, Lancaster-Oslo-Bergen Corpus

After application of the rules 93-97% of all words are fully disambiguated, and 99.7% of all words retain correct reading.

At the time, this was superior performance to other taggers

However, one should not discount the amount of effort needed to create this system