26
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing, Ch 17 C D Manning & H Schütze (1999) Foundations of Statistical Natural Language Processing, Ch 10

Tagging – more details

  • Upload
    keahi

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Tagging – more details. Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing , Ch 8 R Dale et al (2000) Handbook of Natural Language Processing , Ch 17 C D Manning & H Sch ü tze (1999) Foundations of Statistical Natural Language Processing , Ch 10. POS tagging - overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Tagging – more details

Tagging – more details

Reading:

D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8R Dale et al (2000) Handbook of Natural Language Processing, Ch 17C D Manning & H Schütze (1999) Foundations of Statistical Natural Language Processing, Ch 10

Page 2: Tagging – more details

POS tagging - overview

• What is a “tagger”?• Tagsets• How to build a tagger and how a tagger

works– Supervised vs unsupervised learning– Rule-based vs stochastic– And some details

Page 3: Tagging – more details

What is a tagger?

• Lack of distinction between …– Software which allows you to create something you

can then use to tag input text, e.g. “Brill’s tagger”– The result of running such software, e.g. a tagger for

English (based on the such-and-such corpus)• Taggers (even rule-based ones) are almost

invariably trained on a given corpus• “Tagging” usually understood to mean “POS

tagging”, but you can have other types of tags (eg semantic tags)

Page 4: Tagging – more details

Tagging vs. parsing

• Once tagger is “trained”, process consists straightforward look-up, plus local context (and sometimes morphology)

• Will attempt to assign a tag to unknown words, and to disambiguate homographs

• “Tagset” (list of categories) usually larger with more distinctions

Page 5: Tagging – more details

Tagset

• Parsing usually has basic word-categories, whereas tagging makes more subtle distinctions

• E.g. noun sg vs pl vs genitive, common vs proper, +is, +has, … and all combinations

• Parser uses maybe 12-20 categories, tagger may use 60-100

Page 6: Tagging – more details

Simple taggers• Default tagger has one tag per word, and

assigns it on the basis of dictionary lookup– Tags may indicate ambiguity but not resolve it, e.g.

nvb for noun-or-verb• Words may be assigned different tags with

associated probabilities – Tagger will assign most probable tag unless – there is some way to identify when a less probable

tag is in fact correct• Tag sequences may be defined by regular

expressions, and assigned probabilities (including 0 for illegal sequences)

Page 7: Tagging – more details

What probabilities do we have to learn?

(a)Individual word probabilities:Probability that a given tag t is appropriate for

a given word w– Easy (in principle): learn from training corpus:

– Problem of “sparse data”:• Add a small amount to each calculation, so we get

no zeros

)(),()|(

wfwtfwtP

Page 8: Tagging – more details

(b) Tag sequence probability:Probability that a given tag sequence t1,t2,

…,tn is appropriate for a given word sequence w1,w2,…,wn

– P(t1,t2,…,tn | w1,w2,…,wn ) = ??? – Too hard to calculate entire sequence:P(t1,t2 ,t3 ,t4 , …) = P(t2|t1 ) P(t3|t1,t2 ) P(t4|t1,t2 ,t3 ) …– Subsequence is more tractable– Sequence of 2 or 3 should be enough:Bigram model: P(t1,t2) = P(t2|t1 )Trigram model: P(t1,t2 ,t3) = P(t2|t1 ) P(t3|t2 ) N-gram model:

ni

iin ttPttP,1

11 )|(),...,(

Page 9: Tagging – more details

More complex taggers

• Bigram taggers assign tags on the basis of sequences of two words (usually assigning tag to wordn on the basis of wordn-1)

• An nth-order tagger assigns tags on the basis of sequences of n words

• As the value of n increases, so does the complexity of the statistical calculation involved in comparing probability combinations

Page 10: Tagging – more details

History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93%-95%

Greene and RubinRule Based - 70%

LOB Corpus Created (EN-UK)1 Million Words

DeRose/ChurchEfficient HMMSparse Data

95%+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based – 95%+

Tree-Based Statistics (Helmut Shmid)

Rule Based – 96%+

Neural Network 96%+

Trigram Tagger(Kempe)

96%+

Combined Methods98%+

Penn Treebank Corpus

(WSJ, 4.5M)

LOB Corpus Tagged

Page 11: Tagging – more details

How do they work?

• Tagger must be “trained”• Many different techniques, but typically …• Small “training corpus” hand-tagged• Tagging rules learned automatically• Rules define most likely sequence of tags• Rules based on

– Internal evidence (morphology)– External evidence (context)

Page 12: Tagging – more details

Rule-based taggers

• Earliest type of tagging: two stages• Stage 1: look up word in lexicon to give list

of potential POSs• Stage 2: Apply rules which certify or

disallow tag sequences• Rules originally handwritten; more recently

Machine Learning methods can be used• cf transformation-based learning, below

Page 13: Tagging – more details

Stochastic taggers

• Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier ... Some primitive algorithms were already published in 60s and 70s)

• Most common is based on Hidden markov Models (also found in speech processing, etc.)

Page 14: Tagging – more details

(Hidden) Markov Models• Probability calculations imply Markov models: we

assume that P(t|w) is dependent only on the (or, a sequence of) previous word(s)

• (Informally) Markov models are the class of probabilistic models that assume we can predict the future without taking too much account of the past

• Markov chains can be modelled by finite state automata: the next state in a Markov chain is always dependent on some finite history of previous states

• Model is “hidden” if it is actually a succession of Markov models, whose intermediate states are of no interest

Page 15: Tagging – more details

Three stages of HMM training

• Estimating likelihoods on the basis of a corpus: Forward-backward algorithm

• “Decoding”: applying the process to a given input: Viterbi algorithm

• Learning (training): Baum-Welch algorithm or Iterative Viterbi

Page 16: Tagging – more details

Forward-backward algorithm

• Denote• Claim:• Therefore we can calculate all At(s) in time

O(L*Tn).• Similar, by going backwards, we can get:

• Multiplying we can get:• Note that summing this for all states at a time t

gives the likelihood of w1…wL.

sstatewwPsA ttt ,1

qttt swPqsPqAsA 11

sstatewwPsB tLtt 1 sstatewwP tL ,1

Page 17: Tagging – more details

Viterbi algorithm (aka Dynamic programming)

(see J&M p177ff)

ttimeatsstatewithendingsequencestateBestsQt • Denote • Claim:• Otherwise, appending s to the prefix would get a path better than Qt+1(s).• Therefore, checking all possible states q at time t, multiplying by the transition probability between q and s and the expression probability of wt+1 given s, and finding the maximum, gives Qt+1(s).• We need to store for each state the previous state in Qt(s).• Find the maximal finish state, and reconstruct the path.• O(L*Tn) instead of TL.

ttttt sQQsQ 111

Page 18: Tagging – more details

Baum-Welch algorithm

• Start with initial HMM• Calculate, using F-B, the likelihood to get

our observations given that a certain hidden state was used at time i.

• Re-estimate the HMM parameters• Continue until convergence• Can be shown to constantly improve

likelihood

Page 19: Tagging – more details

Unsupervised learning

• We have an untagged corpus• We may also have partial information such

as a set of tags, a dictionary, knowledge of tag transitions, etc.

• Use Baum-Welch to estimate both the context probabilities and the lexical probabilities

Page 20: Tagging – more details

Supervised learning

• Use a tagged corpus• Count the frequencies of tag-pairs t,w: C(t,w)• Estimate (Maximum Likelihood Estimate):

• Count the frequencies of tag n-grams C(t1…tn)• Estimate (Maximum Likelihood Estimate):

• What about small counts? Zero counts?

wii wtCtC

tCwtCtwP ,,

11

111

ini

iniinii ttC

ttCtttP

Page 21: Tagging – more details

Sparse Training Data - Smoothing

• Adding a bias:

• Compensates for estimation (Bayesean approach)• Has larger effect on low-count words• Solves zero-count word problem• Generalized Smoothing:

• Reduces to bias using:

TwC

wtCwtPwtCwtC

,'),(,'

tovermeasureyprobabilitfwtfwwCwtCwwtP ,,1,

11

Twtf

TwCwCw 1,1

Page 22: Tagging – more details

Decision-tree tagging

• Not all n-grams are created equal:– Some n-grams contain redundant information

that may be expressed well enough with less tags

– Some n-grams are too sparse• Decision Tree (Schmid, 1994)

Page 23: Tagging – more details

Decision Trees• Each node is a binary test of tag ti-k.

• The leaves store probabilities for ti.• All HMM algorithms can still be used• Learning:

– Build tree from root to leaves– Choose tests for nodes that maximize information gain– Stop when branch too sparse– Finally, prune tree

Page 24: Tagging – more details

Transformation-based learning

• Eric Brill (1993)• Start from an initial tagging, and apply a

series of transformations• Transformations are learned as well, from

the training data• Captures the tagging data in much fewer

parameters than stochastic models• The transformations learned have

linguistic meaning

Page 25: Tagging – more details

Transformation-based learning

• Examples: Change tag a to b when:– The preceding (following) word is tagged z– The word two before (after) is tagged z– One of the 2 preceding (following) words is

tagged z– The preceding word is tagged z and the

following word is tagged w– The preceding (following) word is W

Page 26: Tagging – more details

Transformation-based Tagger: Learning

• Start with initial tagging• Score the possible

transformations by comparing their result to the “truth”.

• Choose the transformation that maximizes the score

• Repeat last 2 steps