View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Part-of-Speech Tagging and Chunking with Maximum Entropy Model
Sandipan Dandapat
Department of Computer Science & Engineering
Indian Institute of Technology Kharagpur
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Goal
Lexical Analysis Part-Of-Speech (POS) Tagging : Assigning part-of-speech
to each word. e.g. Noun, Verb...
Syntactic Analysis Chunking: Identify and label phrases as verb phrase and
noun phrase etc.
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Machine Learning to Resolve POS Tagging and Chunking
HMM Supervised (DeRose,88; Mcteer,91; Brants,2000; etc.)
Semi-supervised (Cutting,92; Merialdo,94; Kupiec,92; etc.)
Maximum Entropy (Ratnaparkhi,96; etc.)
TB(ED)L (Brill,92,94,95; etc.)
Decision Tree (Black,92; Marquez,97; etc.)
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Our Approach
Maximum Entropy based
Diverse and overlapping features Language Independence Reasonably good accuracy
Data intensive Absence of sequence information
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging Schema
Language Model
Disambiguation Algorithm
Rawtext
Taggedtext
Possible POSClass Restriction …
POS tagging
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging: Our Approach
ME Model
Disambiguation Algorithm
Rawtext
Taggedtext
Possible POSClass Restriction …
POS tagging
ME Model: Current state depends
on history (features)
1 1
1,
( ... | ... ) ( | )n n i i
i n
S P t t w w P t h
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging: Our Approach
ME Model
Disambiguation Algorithm
Rawtext
Taggedtext
Possible POSClass Restriction …
POS tagging
ME Model: Current state depends
on history (features)
( , )
( | )( )
ii
g h tiP t h
Z h
( , )( ) i
ig h tZ hi
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging: Our Approach
ME Model
Disambiguation Algorithm
Rawtext
Taggedtext
…
POS tagging
ti {T}
or
ti TMA(wi)
iw
{T} : Set of all tags
TMA(wi) : Set of tags computed by
Morphological Analyzer
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging: Our Approach
ME Model
Beam SearchRawtext
Taggedtext
…
POS tagging
ti {T}
or
ti TMA(wi)
iw
{T} : Set of all tags
TMA(wi) : Set of tags computed by
Morphological Analyzer
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Disambiguation Algorithm
n321 wwww Text:
Tags:• • •
• • •
• • •
• • •
Where, ti {T} , wi {T} = Set of tags
1 1
1,
( ... | ... ) ( | )n n i i
i n
S P t t w w P t h
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Disambiguation Algorithm
n321 wwww Text:
Tags:• •
•
• •
• •
Where, ti TMA(wi), wi {T} = Set of tags
1 1
1,
( ... | ... ) ( | )n n i i
i n
S P t t w w P t h
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
What are Features?
Feature function Binary function of the history and target
Example,
1 if current_token(h)=Ami and t=PRP( , )
0 othetrwisejf h t
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging Features
W1
W2
W3
W4
W4
T2
T3
T4
T5
T6
T7
i-3 W1 T1
i-2
i-1
i
i+1
i+2
i+3 T4
Estimated Tag
Feature Set
11 2 2 1 2, , , , , , , 4, 4ii i i i i iF w w w w w t t pre suf
40 different experiments were conducted taking several combination from set ‘F’
pos word POS_Tag
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging Features
Estimated Tag
Feature Set 11 1, , , , 4, 4ii i iF w w w t pre suf
Condition Features
Static features for all words
Current word(wi)
Previous word (wi-1)
Next word (wi+1)
|prefix| ≤ 4|suffix| ≤ 4
Dynamic Features for all words
POS tag of previous word (ti-1)
W3
W3
W4
T3
T3T4
T5
T6
T7
i-3 W1 T1
i-2
i-1
i
i+1
i+2
i+3
W6
W7
W2 T2
pos word POS_Tag
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Chunking Features
T2
T3
T4
T5
T6
C2
C3
C4
C5
C6
C7
i-3 W1 T1 C1
W2
W3
T7
i-2
i-1
i
i+1
i+2
i+3
W5
W6
W7
W4 Estimated Tag
Feature Set
Static features for all words
Current word (wi)
POS tag of the current word (ti)
POS tags of previous two words (ti-1 and ti-2)
POS tags of next two words (ti+1
and ti+2)
Dynamic Features for all words
Chunk tags of previous two words (Ci-1 and Ci-2)
pos word POS_Tag Chunk_Tag
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Experiments: POS tagging Baseline Model Maximum Entropy Model
ME (Bengali, Hindi and Telugu) ME + IMA ( Bengali) ME + CMA (Bengali)
Data UsedLanguage Bengali Hindi Telugu
Training data 20,396 21,470 21,416
Development data 5,023 5,681 6,098
Test data 5,226 4,924 5,193
No. of POS tags 27 25 25
No. of Chunk labels 6 7 6
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Tagset and Corpus Ambiguity Tagset consists of 27 grammatical classes Corpus Ambiguity
Mean number of possible tags for each word Measured in the training tagged data
Language Dutch German English French Bengali Hindi Telugu
Corpus Ambiguity
1.11 1.3 1.34 1.69 1.75 1.85 1.70
Accuracy 96% 97% 96.5% 94.5% ? ? ?
Unknown Words
13% 9% 11% 5% 33% 21% 56%
(Dermatas et al 1995)
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging Results on Development Set
60
65
70
75
80
85
90
5 10 15 20Size of the training corpus (1000x words)
Ta
gg
ing
Ac
cu
rac
y (
%)
Bengali
Hindi
Telugu
Overall Accuracy
Language Bengali Hindi Telugu
Corpus Ambiguity
1.75 1.85 1.70
Accuracy 79.74% 83.10% 67.12%
Unknown Words
33% 21% 56%
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging Results on Development Set
60
65
70
75
80
85
90
5 10 15 20Size of the training corpus (1000x words)
Ta
gg
ing
Ac
cu
rac
y (
%)
Bengali
Hindi
Telugu
60
70
80
90
100
5 10 15 20Size of the training corpus (1000x words)
Tag
gin
g A
ccu
racy
(%
)
Bengali
Hindi
Telugu
50
60
70
80
90
100
5 10 15 20
Size of the training corpus (1000x words)
Tag
gin
g A
ccu
racy
(%
) Bengali
Hindi
TeluguK
now
n W
ords
Unk
now
n W
ords
Overall Accuracy
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
POS Tagging Results - Bengali
70
75
80
85
90
95
5 10 15 20Size of the training corpus (1000x words)
Tag
gin
g A
ccu
racy
(%
)
ME
ME + IMA
ME + CMA
89.81 89.81 90.14
68.85
88.39
72.45
60
65
70
75
80
85
90
95
ME ME+IMA ME+CMAT
agg
ing
Acc
ura
cy (
%)
known words
unknown words
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Results on Development set
Method Bengali Hindi Telugu
Baseline 58.88 68.93 -
ME79.74
(89.3, 60.5)83.10
(90.9,53.7)67.82
(82.570.0)
ME + IMA83.51
(84.2, 82.1)- -
ME + CMA88.25
(89.3, 86.2)- -
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Chunking Results
Two different measures Per word basis Per chunk basis Correctly identified groups along with
correctly labeled groups
Evaluation Criteria
Method Bengali Hindi Telugu
Per word basis
ME + I_POS 84.45 79.88 65.92
Per chunk basis
ME + I_POS 87.3,80.6 74.1,67.4 69.6,56.7
ME + C_POS 93.3,87.7 78.5,74.4 -
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Assessment of Error Types
Predicted Class
Actual Class
% of total error
% of class error
NN NNC 10.4 3.43
NN JJ 7.9 2.6
NN NNP 6.0 1.9
VFM VRB 4.4 5.4
NNP NNPC 4.4 11.11
Predicted Class
Actual Class
% of total error
% of class error
NN NNP 14.5 10.2
NN JJ 7.9 5.6
NN NNC 6.0 4.27
JJ NN 3.9 14.34
VFM VAUX 3.1 5.4Bengali Hindi
Predicted Class
Actual Class
% of total error
% of class error
NN JJ 12.5 9.5
NN NNP 10.9 8.3
PREP NLOC 6.1 23.7
NN RB 4.5 3.4
Telugu
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Results on Test Set Bengali data has been tagged using ME+IMA model Hindi and Telugu data has been tagged with simple
ME model
LanguageNumber of
WordsPOS Tagging
AccuracyChunking Accuracy
Bengali 5225 77.61 80.59
Hindi 4924 75.69 74.92
Telugu 5193 74.47 68.59
Chunk Accuracy has been measured per word basis
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Conclusion and Future Scope
Morphological restriction on tags gives an efficient tagging model even when small labeled text is available
The performance of Hindi and Telugu can be improved using the morphological analyzer of the languages
Linguistic prefix and suffix information can be adopted
More features can be explored for chunking
Dept. of Computer Science & Engg.
Indian Institute of Technology Kharagpur
Thank You