Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat

Dept. of Computer Science & Engg.

Indian Institute of Technology Kharagpur

Part-of-Speech Tagging and Chunking with Maximum Entropy Model

Sandipan Dandapat

Department of Computer Science & Engineering




Goal

Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech

to each word. e.g. Noun, Verb...

Syntactic Analysis Chunking: Identify and label phrases as verb phrase and

noun phrase etc.



Machine Learning to Resolve POS Tagging and Chunking

HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.)

Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)

Maximum Entropy (Ratnaparkhi,96; etc.)

TB(ED)L (Brill,92,94,95; etc.)

Decision Tree (Black,92; Marquez,97; etc.)



Our Approach

Maximum Entropy based

Diverse and overlapping features Language Independence Reasonably good accuracy

Data intensive Absence of sequence information



POS Tagging Schema

Language Model

Disambiguation Algorithm

Rawtext

Taggedtext

Possible POSClass Restriction …

POS tagging



POS Tagging: Our Approach

ME Model


Rawtext

Taggedtext


POS tagging

ME Model: Current state depends

on history (features)

1 1

1,

( ... | ... ) ( | )n n i i

i n

S P t t w w P t h




ME Model


Rawtext

Taggedtext


POS tagging

ME Model: Current state depends

on history (features)

( , )

( | )( )

ii

g h tiP t h

Z h

( , )( ) i

ig h tZ hi




ME Model


Rawtext

Taggedtext

…

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer




ME Model

Beam SearchRawtext

Taggedtext

…

POS tagging

ti {T}

or

ti TMA(wi)

iw

{T} : Set of all tags

TMA(wi) : Set of tags computed by

Morphological Analyzer




n321 wwww Text:

Tags:• • •

• • •

• • •

• • •

Where, ti {T} , wi {T} = Set of tags

1 1

1,

( ... | ... ) ( | )n n i i

i n

S P t t w w P t h




n321 wwww Text:

Tags:• •

•

• •

• •

Where, ti TMA(wi), wi {T} = Set of tags

1 1

1,

( ... | ... ) ( | )n n i i

i n

S P t t w w P t h



What are Features?

Feature function Binary function of the history and target

Example,

1 if current_token(h)=Ami and t=PRP( , )

0 othetrwisejf h t



POS Tagging Features

W1

W2

W3

W4

W4

T2

T3

T4

T5

T6

T7

i-3 W1 T1

i-2

i-1

i

i+1

i+2

i+3 T4

Estimated Tag

Feature Set

11 2 2 1 2, , , , , , , 4, 4ii i i i i iF w w w w w t t pre suf

40 different experiments were conducted taking several combination from set ‘F’

pos word POS_Tag



POS Tagging Features

Estimated Tag

Feature Set 11 1, , , , 4, 4ii i iF w w w t pre suf

Condition Features

Static features for all words

Current word(wi)

Previous word (wi-1)

Next word (wi+1)

|prefix| ≤ 4|suffix| ≤ 4

Dynamic Features for all words

POS tag of previous word (ti-1)

W3

W3

W4

T3

T3T4

T5

T6

T7

i-3 W1 T1

i-2

i-1

i

i+1

i+2

i+3

W6

W7

W2 T2

pos word POS_Tag



Chunking Features

T2

T3

T4

T5

T6

C2

C3

C4

C5

C6

C7

i-3 W1 T1 C1

W2

W3

T7

i-2

i-1

i

i+1

i+2

i+3

W5

W6

W7

W4 Estimated Tag

Feature Set

Static features for all words

Current word (wi)

POS tag of the current word (ti)

POS tags of previous two words (ti-1 and ti-2)

POS tags of next two words (ti+1

and ti+2)

Dynamic Features for all words

Chunk tags of previous two words (Ci-1 and Ci-2)

pos word POS_Tag Chunk_Tag



Experiments: POS tagging Baseline Model Maximum Entropy Model

ME (Bengali, Hindi and Telugu) ME + IMA ( Bengali) ME + CMA (Bengali)

Data UsedLanguage Bengali Hindi Telugu

Training data 20,396 21,470 21,416

Development data 5,023 5,681 6,098

Test data 5,226 4,924 5,193

No. of POS tags 27 25 25

No. of Chunk labels 6 7 6



Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes Corpus Ambiguity

Mean number of possible tags for each word Measured in the training tagged data

Language Dutch German English French Bengali Hindi Telugu

Corpus Ambiguity

1.11 1.3 1.34 1.69 1.75 1.85 1.70

Accuracy 96% 97% 96.5% 94.5% ? ? ?

Unknown Words

13% 9% 11% 5% 33% 21% 56%

(Dermatas et al 1995)



POS Tagging Results on Development Set

60

65

70

75

80

85

90

5 10 15 20Size of the training corpus (1000x words)

Ta

gg

ing

Ac

cu

rac

y (

%)

Bengali

Hindi

Telugu

Overall Accuracy

Language Bengali Hindi Telugu

Corpus Ambiguity

1.75 1.85 1.70

Accuracy 79.74% 83.10% 67.12%

Unknown Words

33% 21% 56%



POS Tagging Results on Development Set

60

65

70

75

80

85

90


Ta

gg

ing

Ac

cu

rac

y (

%)

Bengali

Hindi

Telugu

60

70

80

90

100


Tag

gin

g A

ccu

racy

(%

)

Bengali

Hindi

Telugu

50

60

70

80

90

100

5 10 15 20

Size of the training corpus (1000x words)

Tag

gin

g A

ccu

racy

(%

) Bengali

Hindi

TeluguK

now

n W

ords

Unk

now

n W

ords

Overall Accuracy



POS Tagging Results - Bengali

70

75

80

85

90

95


Tag

gin

g A

ccu

racy

(%

)

ME

ME + IMA

ME + CMA

89.81 89.81 90.14

68.85

88.39

72.45

60

65

70

75

80

85

90

95

ME ME+IMA ME+CMAT

agg

ing

Acc

ura

cy (

%)

known words

unknown words



Results on Development set

Method Bengali Hindi Telugu

Baseline 58.88 68.93 -

ME79.74

(89.3, 60.5)83.10

(90.9,53.7)67.82

(82.570.0)

ME + IMA83.51

(84.2, 82.1)- -

ME + CMA88.25

(89.3, 86.2)- -



Chunking Results

Two different measures Per word basis Per chunk basis Correctly identified groups along with

correctly labeled groups

Evaluation Criteria

Method Bengali Hindi Telugu

Per word basis

ME + I_POS 84.45 79.88 65.92

Per chunk basis

ME + I_POS 87.3,80.6 74.1,67.4 69.6,56.7

ME + C_POS 93.3,87.7 78.5,74.4 -



Assessment of Error Types

Predicted Class

Actual Class

% of total error

% of class error

NN NNC 10.4 3.43

NN JJ 7.9 2.6

NN NNP 6.0 1.9

VFM VRB 4.4 5.4

NNP NNPC 4.4 11.11

Predicted Class

Actual Class

% of total error

% of class error

NN NNP 14.5 10.2

NN JJ 7.9 5.6

NN NNC 6.0 4.27

JJ NN 3.9 14.34

VFM VAUX 3.1 5.4Bengali Hindi

Predicted Class

Actual Class

% of total error

% of class error

NN JJ 12.5 9.5

NN NNP 10.9 8.3

PREP NLOC 6.1 23.7

NN RB 4.5 3.4

Telugu



Results on Test Set Bengali data has been tagged using ME+IMA model Hindi and Telugu data has been tagged with simple

ME model

LanguageNumber of

WordsPOS Tagging

AccuracyChunking Accuracy

Bengali 5225 77.61 80.59

Hindi 4924 75.69 74.92

Telugu 5193 74.47 68.59

Chunk Accuracy has been measured per word basis



Conclusion and Future Scope

Morphological restriction on tags gives an efficient tagging model even when small labeled text is available

The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages

Linguistic prefix and suffix information can be adopted

More features can be explored for chunking



Thank You

Documents

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat