26
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat Department of Computer Science & Engineering Indian Institute of Technology Kharagpur

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Part-of-Speech Tagging and Chunking with Maximum Entropy Model

Sandipan Dandapat

Department of Computer Science & Engineering

Indian Institute of Technology Kharagpur

Page 2: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Goal

Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech

to each word. e.g. Noun, Verb...

Syntactic Analysis Chunking: Identify and label phrases as verb phrase and

noun phrase etc.

Page 3: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Machine Learning to Resolve POS Tagging and Chunking

HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.)

Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)

Maximum Entropy (Ratnaparkhi,96; etc.)

TB(ED)L (Brill,92,94,95; etc.)

Decision Tree (Black,92; Marquez,97; etc.)

Page 4: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Our Approach

Maximum Entropy based

Diverse and overlapping features Language Independence Reasonably good accuracy

Data intensive Absence of sequence information

Page 5: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Schema

Language Model

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging

Page 6: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

ME Model

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging

ME Model: Current state depends

on history (features)

1 1

1,

( ... | ... ) ( | )n n i i

i n

S P t t w w P t h

Page 7: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

ME Model

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging

ME Model: Current state depends

on history (features)

( , )

( | )( )

ii

g h tiP t h

Z h

( , )( ) i

ig h tZ hi

Page 8: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

ME Model

Disambiguation Algorithm

Rawtext

Taggedtext

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer

Page 9: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging: Our Approach

ME Model

Beam SearchRawtext

Taggedtext

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer

Page 10: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Disambiguation Algorithm

n321 wwww Text:

Tags:• • •

• • •

• • •

• • •

Where, ti {T} , wi {T} = Set of tags

1 1

1,

( ... | ... ) ( | )n n i i

i n

S P t t w w P t h

Page 11: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Disambiguation Algorithm

n321 wwww Text:

Tags:• •

• •

• •

Where, ti TMA(wi), wi {T} = Set of tags

1 1

1,

( ... | ... ) ( | )n n i i

i n

S P t t w w P t h

Page 12: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

What are Features?

Feature function Binary function of the history and target

Example,

1 if current_token(h)=Ami and t=PRP( , )

0 othetrwisejf h t

Page 13: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Features

W1

W2

W3

W4

W4

T2

T3

T4

T5

T6

T7

i-3 W1 T1

i-2

i-1

i

i+1

i+2

i+3 T4

Estimated Tag

Feature Set

11 2 2 1 2, , , , , , , 4, 4ii i i i i iF w w w w w t t pre suf

40 different experiments were conducted taking several combination from set ‘F’

pos word POS_Tag

Page 14: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Features

Estimated Tag

Feature Set 11 1, , , , 4, 4ii i iF w w w t pre suf

Condition Features

Static features for all words

Current word(wi)

Previous word (wi-1)

Next word (wi+1)

|prefix| ≤ 4|suffix| ≤ 4

Dynamic Features for all words

POS tag of previous word (ti-1)

W3

W3

W4

T3

T3T4

T5

T6

T7

i-3 W1 T1

i-2

i-1

i

i+1

i+2

i+3

W6

W7

W2 T2

pos word POS_Tag

Page 15: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Chunking Features

T2

T3

T4

T5

T6

C2

C3

C4

C5

C6

C7

i-3 W1 T1 C1

W2

W3

T7

i-2

i-1

i

i+1

i+2

i+3

W5

W6

W7

W4 Estimated Tag

Feature Set

Static features for all words

Current word (wi)

POS tag of the current word (ti)

POS tags of previous two words (ti-1 and ti-2)

POS tags of next two words (ti+1

and ti+2)

Dynamic Features for all words

Chunk tags of previous two words (Ci-1 and Ci-2)

pos word POS_Tag Chunk_Tag

Page 16: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Experiments: POS tagging Baseline Model Maximum Entropy Model

ME (Bengali, Hindi and Telugu) ME + IMA ( Bengali) ME + CMA (Bengali)

Data UsedLanguage Bengali Hindi Telugu

Training data 20,396 21,470 21,416

Development data 5,023 5,681 6,098

Test data 5,226 4,924 5,193

No. of POS tags 27 25 25

No. of Chunk labels 6 7 6

Page 17: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes Corpus Ambiguity

Mean number of possible tags for each word Measured in the training tagged data

Language Dutch German English French Bengali Hindi Telugu

Corpus Ambiguity

1.11 1.3 1.34 1.69 1.75 1.85 1.70

Accuracy 96% 97% 96.5% 94.5% ? ? ?

Unknown Words

13% 9% 11% 5% 33% 21% 56%

(Dermatas et al 1995)

Page 18: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Results on Development Set

60

65

70

75

80

85

90

5 10 15 20Size of the training corpus (1000x words)

Ta

gg

ing

Ac

cu

rac

y (

%)

Bengali

Hindi

Telugu

Overall Accuracy

Language Bengali Hindi Telugu

Corpus Ambiguity

1.75 1.85 1.70

Accuracy 79.74% 83.10% 67.12%

Unknown Words

33% 21% 56%

Page 19: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Results on Development Set

60

65

70

75

80

85

90

5 10 15 20Size of the training corpus (1000x words)

Ta

gg

ing

Ac

cu

rac

y (

%)

Bengali

Hindi

Telugu

60

70

80

90

100

5 10 15 20Size of the training corpus (1000x words)

Tag

gin

g A

ccu

racy

(%

)

Bengali

Hindi

Telugu

50

60

70

80

90

100

5 10 15 20

Size of the training corpus (1000x words)

Tag

gin

g A

ccu

racy

(%

) Bengali

Hindi

TeluguK

now

n W

ords

Unk

now

n W

ords

Overall Accuracy

Page 20: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

POS Tagging Results - Bengali

70

75

80

85

90

95

5 10 15 20Size of the training corpus (1000x words)

Tag

gin

g A

ccu

racy

(%

)

ME

ME + IMA

ME + CMA

89.81 89.81 90.14

68.85

88.39

72.45

60

65

70

75

80

85

90

95

ME ME+IMA ME+CMAT

agg

ing

Acc

ura

cy (

%)

known words

unknown words

Page 21: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Results on Development set

Method Bengali Hindi Telugu

Baseline 58.88 68.93 -

ME79.74

(89.3, 60.5)83.10

(90.9,53.7)67.82

(82.570.0)

ME + IMA83.51

(84.2, 82.1)- -

ME + CMA88.25

(89.3, 86.2)- -

Page 22: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Chunking Results

Two different measures Per word basis Per chunk basis Correctly identified groups along with

correctly labeled groups

Evaluation Criteria

Method Bengali Hindi Telugu

Per word basis

ME + I_POS 84.45 79.88 65.92

Per chunk basis

ME + I_POS 87.3,80.6 74.1,67.4 69.6,56.7

ME + C_POS 93.3,87.7 78.5,74.4 -

Page 23: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Assessment of Error Types

Predicted Class

Actual Class

% of total error

% of class error

NN NNC 10.4 3.43

NN JJ 7.9 2.6

NN NNP 6.0 1.9

VFM VRB 4.4 5.4

NNP NNPC 4.4 11.11

Predicted Class

Actual Class

% of total error

% of class error

NN NNP 14.5 10.2

NN JJ 7.9 5.6

NN NNC 6.0 4.27

JJ NN 3.9 14.34

VFM VAUX 3.1 5.4Bengali Hindi

Predicted Class

Actual Class

% of total error

% of class error

NN JJ 12.5 9.5

NN NNP 10.9 8.3

PREP NLOC 6.1 23.7

NN RB 4.5 3.4

Telugu

Page 24: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Results on Test Set Bengali data has been tagged using ME+IMA model Hindi and Telugu data has been tagged with simple

ME model

LanguageNumber of

WordsPOS Tagging

AccuracyChunking Accuracy

Bengali 5225 77.61 80.59

Hindi 4924 75.69 74.92

Telugu 5193 74.47 68.59

Chunk Accuracy has been measured per word basis

Page 25: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Conclusion and Future Scope

Morphological restriction on tags gives an efficient tagging model even when small labeled text is available

The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages

Linguistic prefix and suffix information can be adopted

More features can be explored for chunking

Page 26: Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Thank You