57
NLTK Tagging CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)

NLTK Tagging

Embed Size (px)

DESCRIPTION

NLTK Tagging. CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes). Today’s Outline. Administration Final Words on Regular Expressions Regular Expressions in NLTK New Topic: Tagging Motivation and Linguistic Background NLTK Tutorial: Tagging - PowerPoint PPT Presentation

Citation preview

NLTK Tagging

CS1573: AI Application Development, Spring 2003

(modified from Steven Bird’s notes)

Today’s Outline• Administration

• Final Words on Regular Expressions– Regular Expressions in NLTK

• New Topic: Tagging– Motivation and Linguistic Background

• NLTK Tutorial: Tagging– Part-of-Speech Tagging

– The nltk.tagger Module

– A Few Tagging Algorithms

– Some Gory Details

Regular Expressions, again

• Python– Regular expression syntax

• NLTK uses– The regular expression tokenizer– A simple regular expression tagging algorithm

Regular Expression Tokenizers

• Mimicing the WSTokenizer

>>> tokenizer=RETokenizer(r'[^\s]+')

>>> tokenizer.tokenize(example_text)

['Hello.'@[0w], "Isn't"@[1w], 'this'@[2w], 'fun?'@[3w]]

RE Tokenization, continued

> regexp=r'\w+|[^\w\s]+‘ '\w+|[^\w\s]+' > tokenizer = RETokenizer(regexp) > tokenizer.tokenize(example_text)

['Hello'@[0w], '.'@[1w], 'Isn'@[2w], "'"@[3w], 't'@[4w], 'this'@[5w], 'fun'@[6w], '?'@[7w]]

Why is this version better?

RE Tokenization, continued

> regexp=r'\w+|[^\w\s]+' Why is this version better?

-includes punctuation as separate tokens

-matches either a sequence of alphanumeric characters (letters and numbers); or a sequence of punctuation characters.

But, still has problems, for example … ?

Improved Example

> example_text = 'That poster costs $22.40.'

> regexp = r'(\w+)|(\$\d+\.\d+)|([^\w\s]+)' '(\w+)|(\$\d+\.\d+)|([^\w\s]+)'

> tokenizer = RETokenizer(regexp)

> tokenizer.tokenize(example_text) ['That'@[0w], 'poster'@[1w], 'costs'@[2w], '$22.40'@[3w], '.'@[4w]]

Regular Expression Limitations

While Regular Languages can model many things, there are still limitations (no advice when rejection, all or one solution when accept condition is ambiguous).

New Topic

Now we’re going to start looking at tagging, and especially approaches that depend on looking at words in context.

We’ll start with what looks like an artificial task: predicting the next word in a sequence.

We’ll then move to tagging, the process of associating auxiliary information with each token, often for use in later stages of text processing

Word Prediction Example

• From NY Times:– Stocks plunged this…

Word Prediction Example

• From NY Times:– Stocks plunged this morning, despite a cut in interest

Word Prediction Example

• From NY Times:– Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall …

Word Prediction Example

• From NY Times:– Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall Street began …

Word Prediction Example

• From NY Times:– Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall Street began trading for the first time since last …

Word Prediction Example

• From NY Times:– Stocks plunged this morning, despite a cut in

interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday’s terrorist attacks.

Format Change

• Move to pdf slides (highlights of Jurafsky and Martin Chapters 6 and 8)

Tagging: Overview /Review

• Motivation– What is tagging? What does tagging do? Kinds of tagging?– Significance of part of speech

• Basics– Features and context– Brown and Penn Treebank, tagsets– Tagging in NLTK (nltk.tagger module)

• Tagging– Algorithms, statistical and rule-based tagging– Evaluation

Terminology

• Tagging– The process of associating labels with each token in

a text

• Tags– The labels

• Tag Set– The collection of tags used for a particular task

Example

Typically a tagged text is a sequence of white-space separated base/tag tokens:

The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./.

.

What does Tagging do?

1. Collapses Distinctions• Lexical identity may be discarded• e.g. all personal pronouns tagged with PRP

2. Introduces Distinctions• Ambiguities may be removed• e.g. deal tagged with NN or VB• e.g. deal tagged with DEAL1 or DEAL2

3. Helps classification and prediction

Kinds of Tagging

• Part-of-Speech tagging– Grammatical tagging– Divides words into categories based on how they can be

combined to form sentences (e.g., articles can combine with nouns but not verbs)

• Semantic Sense tagging– Sense disambiguation– Homonym disambiguation

• Discourse tagging– Speech acts (request, inform, greet, etc.)

Significance of Parts of Speech

• A word’s POS tells us a lot about the word and its neighbors– Limits the range of meanings (deal), pronunciation

(object vs object) or both (wind)– Helps in stemming– Limits the range of following words for ASR– Helps select nouns from a document for IR– Basis for partial parsing– Basis for searching for linguistic constructions – Parsers can build trees directly on the POS tags instead of

maintaining a lexicon

Features and Contexts

wn-2 wn-1 wn wn+1

CONTEXT FEATURE

tn-2tn-1 tn tn+1

Why there are many tag sets1. Definition of POS tag

• Semantic, syntactic, morphological• Tagsets differ in both how they define the tags, and at

what level of granularity2. Balancing classification and prediction

• Introducing more distinctions:• Better information about context• Harder to classify current token

• Introducing few distinctions• Less information about context• Less work to do for classifying current token

The Brown Corpus

• The first digital corpus (1961)– Francis and Kucera, Brown University

• Contents: 500 texts, each 2000 words long– From American books, newspapers, magazines– Representing genres:

• Science fiction, romance fiction, press reportage scientific writing, popular lore

Penn Treebank

• First syntactically annotated corpus

• 1 million words from Wall Street Journal

• Part of speech tags and syntax trees

Representing Tags in NLTK

• TaggedType class>>> ttype1 = TaggedType('dog', 'NN') 'dog'/'NN‘>>> ttype1.base() dog' >>> ttype1.tag() ‘NN'

• Tagged tokens>>> ttoken = Token(ttype, Location(5)) 'dog'/'NN'@[5]

Reading Tagged Corpora>> tagged_text_str = open('corpus.txt').read()

'John/NN saw/VB the/AT book/NN on/IN the/AT table/NN ./END He/NN sighed/VB ./END'

>> tokens=TaggedTokenizer().tokenize(tagged_text_str)

['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], 'table'/'NN'@[6w], '.'/'END'@[7w], 'He'/'NN'@[8w], 'sighed'/'VB'@[9w], '.'/'END'@[10w]]

If TaggedTokenizer encouters a word without a tag, it will assign it the default tag None.

The TaggerI Interface

> tokens = WSTokenizer().tokenize(untagged_text_str) ['John'@[0w], 'saw'@[1w], 'the'@[2w], 'book'@[3w], 'on'@[4w], 'the'@[5w], 'table'@[6w], '.'@[7w], 'He'@[8w], 'sighed'@[9w], '.'@[10w]]

> my_tagger.tag(tokens) ['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w],

'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], 'table'/'NN'@[6w], '.'/'END'@[7w], 'He'/'NN'@[8w], 'sighed'/'VB'@[9w], '.'/'END'@[10w]]

The interface defines a single method, tag, which assigns a tag to each token in a list, and returns the resulting list of tagged tokens.

Tagging Algorithms

• Default tagger– Inspect the word and guess a tag

• Unigram tagger– Assign the tag which is the most probable for the word

in question, based on raw frequency– Uses training data

• Bigram tagger, n-gram tagger• Rule-based taggers, HMM taggers

(outside scope of this class)

Default Tagger

• We need something to use for unseen words– E.g., guess NNP for a word with an initial capital

• Do regular-expression processing of the words– Sequence of regular expression tests

– Assigment of the wor to a suitable tag

• If there are no matches…– Assign to the most frequent tag, NN

Finding the most frequent tag

• nltk.probability module– for ttoken in ttext:

freq_dist.inc(ttoken.tag()) def_tag = freq_dist.max()

A Default Tagger> tokens=WSTokenizer().tokenize(untag_text_str)

['John'@[0w], 'saw'@[1w], '3'@[2w], 'polar'@[3w], 'bears'@[4w], '.'@[5w]]

> my_tagger.tag(tokens)

['John'/'NN'@[0w], 'saw'/'NN'@[1w], '3'/'CD'@[2w], 'polar'/'NN'@[3w], 'bears'/'NN'@[4w], '.'/'NN'@[5w]]

NN_CD_Tagger assigns CD to numbers, otherwise NN

Poor performance (20-30%) in isolation, but when used with other taggers can significantly improve performance

Unigram Tagger

• Unigram = table of frequencies– E.g. in tagged WSJ sample, “deal” is tagged with NN

11 times, with VB 1 time, and with VBP 1 time– 90% accuracy

• Counting events– freq_dist = CFFreqDist()

for tttoken in ttext: context = ttoken.type().base()

feature = ttoken.type().tag() freq_dist.inc(CFSample(context,feature))

– context_event = ContextEvent(token.type()) sample=freq_dist.cond_max(context_event) tag=sample.feature()

Unigram Tagger (continued)• Before being used, UnigramTaggers are trained using

the train method, which uses a tagged corpus to determine which tags are most common for each word:

# 'train.txt' is a tagged training corpus

>>> tagged_text_str = open('train.txt').read()

>>> train_toks = TaggedTokenizer().tokenize(tagged_text_str)

>>> tagger = UnigramTagger()

>>> tagger.train(train_toks)

Unigram Tagger (continued)

• Once a UnigramTagger has been trained, the tag can be used to tag untagged corpora:

> tokens = WSTokenizer().tokenize(untagged_text_str)

> tagger.tag(tokens)

['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], ...]

Unigram Tagger (continued)

• Performance is highly dependent on the quality of its training set

• Can’t be too small

• Can’t be too different from texts we actually want to tag

• How is this related to the homework that we just did?

Nth Order Tagging

• Bigram table: frequencies of pairs– Not necessarily adjacent or of same category– What is the most likely tag for w_n, given w_n-1 and t_n-1?– What is the context for NLTK?

• N-gram tagger– Consider n-1 previous tags– Sparse data problem– Accuracy versus coverage tradeoff– Backoff

• Throwing away order– Put context into a set

Nth-Order Tagging (continued)

• In addition to considering the token’s type, the context also considers the tags of the n preceding tokens

• The tagger then picks the tag which is most likely for that context

• Different values of n are possible– Oth order = unigram tagger– 1st order = bigrams– 2nd order = trigrams

Nth-Order Tagging (continued)

• Tagged training corpus determines most likely tag for each context:

> train_toks = TaggedTokenizer().tokenize(tagged_text_str)

> tagger = NthOrderTagger(3) # 3rd order tagger

>tagger.train(train_toks)

Nth-Order Tagging (continued)

• Once trained, it can tag untagged corpora:

> tokens=WSTokenizer().tokenize(untag_text_str)

> tagger.tag(tokens)

['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], ...]

Combining Taggers

Use more accurate algorithms when we can, backoff to wider coverage when needed.

• Try tagging the token with the 1st order tagger.

• If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger.

• If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag.

BackoffTagger class>>> train_toks =

TaggedTokenizer().tokenize(tagged_text_str)

# Construct the taggers

>>> tagger1 = NthOrderTagger(1) # 1st order

>>> tagger2 = UnigramTagger() # 0th order

>>> tagger3 = NN_CD_Tagger()

# Train the taggers

>>> tagger1.train(train_toks)

>>> tagger2.train(train_toks)

Backoff (continued)# Combine the taggers (in order, by specificity)

>> tagger = BackoffTagger([tagger1, tagger2, tagger3])

# Use the combined tagger

>>tokens=TaggedTokenizer().tokenize(untagged_text_str)

>> tagger.tag(tokens)

['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], ...]

Rule-Based Tagger

• The Linguistic Complaint– Where is the linguistic knowledge of a tagger?– Just a massive table of numbers– Aren’t there any linguistic insights that could

emerge from the data?– Could thus use handcrafted sets of rules to tag

input sentences, for example, if input follows a determiner tag it as a noun.

Evaluating a Tagger

• Tagged tokens – the original data

• Untag the data

• Tag the data with your own tagger

• Compare the original and new tags– Iterate over the two lists checking for identity

and counting– Accuracy = fraction correct

A Look at Tagging Implementations

• It demonstrates how to write classes implementing the interfaces defined by NLTK.

• It provides you with a better understanding of the algorithms and data structures underlying each approach to tagging.

• It gives you a chance to see some of the code used to implement NLTK. The developers have tried hard to ensure that the implementation of every class in NLTK is easy to understand.

A Sequential TaggerThe taggers in this tutorial are implemented as sequential

taggers• Assigns tags to one token at a time, starting with the first

token of the text, and proceeding in sequential order. • Decides which tag to assign a token on the basis of that token,

the tokens that preceed it, and the predicted tags for the tokens that preceed it.

• To capture this commonality, we define a common base class, SequentialTagger (class SequentialTagger(TaggerI))

• The next.tag method (note typo in tutorial) returns the appropriate tag for the next token; each tagger subclass provides its own implementation

SequentialTagger.next_tag

-decides which tag to assign a token, given the list of tagged tokens that preceeds it.

- two arguments: a list of tagged tokens preceeding the token to be tagged, and the token to be tagged; and it returns the appropriate tag for that token.

- def next_tag(self, tagged_tokens, next_token): assert 0, "next_tag not defined by

SequentialTagger subclass"

SequentialTagger.tagdef tag(self, text):

tagged_text = []

# Tag each token, in sequential order. for token in text:

# Get the tag for the next token. tag = self.next_tag(tagged_text, token)

# Use tag to build tagged token, add to tagged_text. tagged_token = Token(TaggedType(token.type(),

tag), token.loc()) tagged_text.append(tagged_token)

return tagged_text

Example Subclass: NN_CD_Taggerclass NN_CD_Tagger(SequentialTagger):

def __init__(self): pass #empty constructor

def next_tag(self, tagged_tokens, next_token):

# Assign 'CD' for numbers, 'NN' for anything else.

if re.match(r'^[0-9]+(.[0-9]+)?$', next_token.type()):

return 'CD'

else:

return 'NN‘

# just define this method; when the tag method is called, the definition given by SequentialTagger will be used.

Another Example: UnigramTaggerclass UnigramTagger(TaggerI):

class UnigramTagger(SequentialTagger):

Unigram Tagger: Training

def train(self, tagged_tokens):

for token in tagged_tokens:

outcome = token.type().tag()

context = token.type().base() self._freqdist[context].inc(outcome

Unigram Tagger: Taggingdef next_tag(self, tagged_tokens, next_token):

context = next_token.type() return self._freqdist[context].max()

eg access context and find most likely outcome>>> freqdist['bank'].max() 'NN'

Unigram Tagger: Initialization• The constructor for UnigramTagger simply

initializes self._freqdist with a new conditional frequency distribution.

• def __init__(self): self._freqdist = probability.ConditionalFreqDist()

For Self-Study

NthOrder Tagger Implementation

BackoffTagger Implementation

For Next Time

Chunk Parsing