34
Adding morphological information to a connectionist Part-Of-Speech tagger F. Zamora-Martínez M.J. Castro-Bleda S. España-Boquera S. Tortajada-Velert Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia, Spain Escuela Superior de Enseñanzas Técnicas Universidad CEU-Cadenal Herrera, Alfara del Patriarca, Valencia, Spain 10-12 November 2009, Sevilla F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 1 / 33

Adding morphological information to a connectionist Part-Of-Speech tagger

Embed Size (px)

DESCRIPTION

In this paper, we describe our recent advances on a novel approach to Part-Of-Speech tagging based on neural networks. Multilayer perceptrons are used following corpus-based learning from contextual, lexical and morphological information. The Penn Treebank corpus has been used for the training and evaluation of the tagging system. The results show that the connectionist approach is feasible and comparable with other approaches.

Citation preview

Page 1: Adding morphological information  to a connectionist Part-Of-Speech tagger

Adding morphological information to a connectionistPart-Of-Speech tagger

F. Zamora-Martínez M.J. Castro-Bleda S. España-BoqueraS. Tortajada-Velert

Departamento de Sistemas Informáticos y ComputaciónUniversidad Politécnica de Valencia, Spain

Escuela Superior de Enseñanzas TécnicasUniversidad CEU-Cadenal Herrera, Alfara del Patriarca, Valencia, Spain

10-12 November 2009, Sevilla

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 1 / 33

Page 2: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 2 / 33

Page 3: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 3 / 33

Page 4: Adding morphological information  to a connectionist Part-Of-Speech tagger

What is Part-Of-Speech (POS) tagging?

T = τ1, τ2, . . . , τk: a set of POS tagsΩ = ω1, ω2, . . . , ωm: the vocabulary of the application

The goal of a Part-Of-Speech tagger is to associate each word in a textwith its correct lexical-syntactic category (represented by a tag).

ExampleThe grand jury commented on a number of other topicsDT JJ NN VBD IN DT NN IN JJ NNS

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 4 / 33

Page 5: Adding morphological information  to a connectionist Part-Of-Speech tagger

Ambiguity and applications

Words often have more than one POS tag: lowerEurope proposed lower rate increases . . . = JJRTo push the pound even lower . . . = RBR. . . should be able to lower long-term . . . = VB

Ambiguity!!!

Applications: speech synthesis, speech recognition, informationretrieval, word-sense disambiguation, machine translation, ...

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 5 / 33

Page 6: Adding morphological information  to a connectionist Part-Of-Speech tagger

How hard is POS tagging? Measuring ambiguity

Peen Treebank (45-tag corpus)Unambiguous (1 tag) 36,678 (84%)Ambiguous (2-7 tags) 7,088 (16%)Details: 2 tags 5,475

3 tags 1,313 (lower)4 tags 2505 tags 416 tags 77 tags 2 (bet, open)

A simple approach which assigns only the most common tag to eachword performs with 90% accuracy!

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 6 / 33

Page 7: Adding morphological information  to a connectionist Part-Of-Speech tagger

Unknown Words

How can one assign a tag to a given word if that word is unknown tothe tagger?

Unknown words are the hardest problem for POS tagging!

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 7 / 33

Page 8: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 8 / 33

Page 9: Adding morphological information  to a connectionist Part-Of-Speech tagger

Probabilistic model

We are given a sentence: what is the best sequence of tags whichcorresponds to the sequence of words?

Probabilistic view: Consider all possible sequences of tags and out ofthis universe of sequences, choose the tag sequence which is mostprobable given the observation sequence of words.

tn1 = argmax

tn1

P(tn1 |wn

1 ) = argmaxtn1

P(wn1 |tn

1 )P(tn1 ).

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 9 / 33

Page 10: Adding morphological information  to a connectionist Part-Of-Speech tagger

Probabilistic model: Simplifications

To simplify:1 Words are independent of each other and a word’s identity only

depends on its tag→ lexical probabilities:

P(wn1 |tn

1 ) ≈n∏

i=1

P(wi |ti)

2 Another one establishes that the probability of one tag to appearonly depends on its predecessor tag (bigram, trigram, ...) →contextual probabilities:

P(tn1 ) ≈

n∏i=1

P(ti |ti−1).

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 10 / 33

Page 11: Adding morphological information  to a connectionist Part-Of-Speech tagger

Probabilistic model: Limitations

With these assumptions, a typical probabilistic model is expressed as:

tn1 = argmax

tn1

P(tn1 |wn

1 ) ≈ argmaxtn1

n∏i=1

P(wi |ti)P(ti |ti−1),

where tn1 is the best estimation of POS tags for the given sentence

wn1 = w1w2 . . .wn and considering that P(t1|t0) = 1.

1 It does not model long-distance relationships.2 The contextual information takes into account the context on the

left while the context on the right is not considered.

Both limitations can be overwhelmed using ANNs models.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 11 / 33

Page 12: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 12 / 33

Page 13: Adding morphological information  to a connectionist Part-Of-Speech tagger

Basic connectionist model

Europe proposed lower rate increasesNNP VBD ????? NN NNS

MLPs as POS tags classifiers:MLP Input:

lower — wi : the ambiguous input word, loc. cod. → projection layerNNP , VBD, NN, NNS — ci : the tags of the words surrounding theambiguous word to be tagged (past and future context), loc. cod.

MLP Output:the probability of each tag given the input:Pr(JJR|input)=0.6, Pr(RBR|input)=0.2, Pr(VB|input)=0.1, . . .

Therefore, the network learnt the following mapping:

F (wi , ci , ti ,Θ) = PrΘ(ti |wi , ci)

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 13 / 33

Page 14: Adding morphological information  to a connectionist Part-Of-Speech tagger

Morphological extended connectionist model

Europe proposed lower rate increasesNNP-Cap VBD-NCap ????? NN-NCap NNS-NCap

NCap, -er

MLPs as POS tags classifiers:MLP Input:

lower — wi : the ambiguous input word, loc. cod. → projection layerNCap, -er — mi : morph. info related to the amb. input word.NNP-Cap., VBD-NCap, NN-NCap, NNS-NCap — c′

i : the tags of thewords surrounding the ambiguous word to be tagged (past andfuture context) extended with morphological information, loc. cod.

MLP Output:the probability of each tag given the input.

Therefore, the network learnt the following mapping:

F (wi ,mi , c′i , ti ,Θ) = PrΘ(ti |wi ,mi , c′i ),

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 14 / 33

Page 15: Adding morphological information  to a connectionist Part-Of-Speech tagger

And what about Unknown Words?

When evaluating the model, there are words that have never beenseen during training; therefore, they do not belong neither to thevocabulary of known ambiguous words nor to the vocabulary of knownnon-ambiguous words→ “Unknown words”: the hardest problem forthe network to tag correctly.

Proposed solutionA combination of two especialized models:

MLPKnow : the MLP specialized for known ambiguous wordsMLPUnk : the MLP specialized in unknown words

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 15 / 33

Page 16: Adding morphological information  to a connectionist Part-Of-Speech tagger

MLPKnow for known ambiguous words

wi : known ambiguousinput word locallycodified at the input ofthe projection layer

mi : morphological inforelated to the inputambiguous word

Context: two labels ofpast context and onelabel of future context,extended withmorphological info.

FKnow (wi ,mi , c′i , ti ,ΘK ) = PrΘK (ti |wi ,mi , c′i ).

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 16 / 33

Page 17: Adding morphological information  to a connectionist Part-Of-Speech tagger

MLPUnk for unknown words

mi : morphological inforelated to the inputunknown word (thesame that for MLPKnow

si : more specificmorphological inforelated to the inputunknown word(different fromMLPKnow

Context: three labelsof past context andone label of futurecontext, extended withmorphological info.

FUnk (mi , si , c′i , ti ,ΘU) = PrΘU (ti |mi , si , c′i ),

where si corresponds to additional morphological information relatedto the unknown input i-th word.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 17 / 33

Page 18: Adding morphological information  to a connectionist Part-Of-Speech tagger

Twi table with the POS tags

minutes NNS, NNPSmagnification NNstrikes NNS, VBZsize VBP, NNlayoff NNcohens NNPS. . . . . .

Tminutes = NNS,NNPS Known ambiguous wordTmagnification = NN Known non-ambiguous word

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 18 / 33

Page 19: Adding morphological information  to a connectionist Part-Of-Speech tagger

Final connectionist model

For each posible known word (ambiguous and non-ambiguous) wehave a Twi table with the POS tags observed in training for word wi :

F (wi ,mi , si , c′i , ti ,ΘK ,ΘU) =

0 if ti 6∈ Twi ,

1 if Twi = ti,FKnow (wi ,mi , c′

i , ti ,ΘK ) if wi ∈ Ω′ ∧ ti ∈ Twi ,

FUnk (mi , si , c′i , ti ,ΘU) in other case.

Where Ω′ is the ambiguous words vocabulary.

tn1 = argmax

tn1

Pr(tn1 |wn

1 ) ≈ argmaxtn1

n∏i=1

F (wi ,mi , si , c′i , ti ,ΘK ,ΘU)

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 19 / 33

Page 20: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 20 / 33

Page 21: Adding morphological information  to a connectionist Part-Of-Speech tagger

The Penn Treebank Corpus

This corpus consists of a set of English texts from the Wall StreetJournal distributed in 25 directories containing 100 files withseveral sentences each one.The total number of words is about one million, being 49 000different.The whole corpus was labeled with POS and synyactic tags.The POS tag labeling consists of a set of 45 different categories.Two more tag were added to take into account the beginning andending of a sentence, thus resulting in a total amount of 47different POS tags.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 21 / 33

Page 22: Adding morphological information  to a connectionist Part-Of-Speech tagger

The Penn Treebank Corpus: Partitions

Dataset Directory Num. of Num. of Vocabularysentences words size

Training 00-18 38 219 912 344 34 064Tuning 19-21 5 527 131 768 12 389Test 22-24 5 462 129 654 11 548Total 00-24 49 208 1 173 766 38 452

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 22 / 33

Page 23: Adding morphological information  to a connectionist Part-Of-Speech tagger

The Penn Treebank Corpus: Preprocess

Huge corpus with a lot of words in ambiguous vocabulary. Preprocessto reduce the vocabulary:

Ten random partitions from training set of equal size. Words thatappeared just in one partition were considered as unknown words.POS tags appearing in a word less than 1% of its possible tagswere eliminated (tagging errors).

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 23 / 33

Page 24: Adding morphological information  to a connectionist Part-Of-Speech tagger

The Penn Treebank Corpus: Morph. information

Two morphological preprocessing filters:Deleting the prefixes from the composed words (using a set of the125 more common English prefixes). In this way, some unknownwords were converted to known words.

Examplepre-, electro-, tele-, . . .

All the cardinal and ordinal numbers (except “one” and “second”that are polysemic) were replaced with the special token *CD*.

Exampletwenty-years-old ⇒ *CD*-years-oldpost-1987 ⇒ post-*CD*

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 24 / 33

Page 25: Adding morphological information  to a connectionist Part-Of-Speech tagger

The Penn Treebank Corpus: Morph. information

Morphological added to MLPs:Three input units⇒ input word has the first capital letter, all capsor a subset. This is an important morphological characteristic andit was also added to the POS tags of the context (both MLPs).A unit indicating if the word has any dash “-” (both MLPs).A unit indicating if the word has any point “.” (both MLPs).Suffix analysis to deal with unknown words (only MLPUnk ):

Compute the probability distribution of tags for suffixes of lengthless or equal to 10⇒ 709 suffixes found.An agglomerative hierarchical clustering process was followed, anda empirical set of clusters was chosen.Finally, a set of the 21 more common grammatical suffixes wereadded.

MLPUnk needs 209 units for take into account the presence ofsuffixes in words.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 24 / 33

Page 26: Adding morphological information  to a connectionist Part-Of-Speech tagger

The Penn Treebank Corpus: after preproces

Dataset Num. of words Unambiguous Ambiguous UnknownTraining 912 344 549 272 361 704 1 368Tuning 131 768 77 347 51 292 3 129Test 129 654 75 758 51 315 2 581Total 1 173 766 702 377 464 311 7 078

Vocabulary in Training

6 239 ambiguous words.

25 798 unambiguous words were obtained.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 25 / 33

Page 27: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 26 / 33

Page 28: Adding morphological information  to a connectionist Part-Of-Speech tagger

The connectionist POS taggers

Projection layer.Error backpropagation algorithm for training.The topology and parameters of multilayer perceptrons in thetrainings were selected in previous experimentation.For the experiments we have used a toolkit for pattern recognitiontasks developed by our research group.MLPKnow trained with ambiguous vocabulary words.MLPUnk trained with words that appear less than 4 times.

Parameter MLPKnown MLPUnkInput layer size |T + M′|(p + f ) + 50 + |M| |T + M′|(p + f ) + |M|+ |S|Output layer size |T | |T |Projection layer size |Ω′| → 50 –Hidden layer(s) size 100-75 175-100Hidden layer act. func. Hyperbolic TangentOutput layer act. func. SoftmaxLearning rate 0.005Momentum 0.001Weight decay 0.0000001

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 27 / 33

Page 29: Adding morphological information  to a connectionist Part-Of-Speech tagger

Performance on the tuning set

POS tagging error rate for the tuning set varying the context (p is thepast context, and f is the future context).

MLPKnown errorFuture

Past 2 3 4 52 6.30 6.26 6.25 6.313 6.28 6.22 6.20 6.314 6.28 6.27 6.28 6.31

MLPUnk errorFuture

Past 1 2 31 12.56 12.46 12.402 12.27 12.08 12.373 12.59 11.95 12.244 12.72 12.34 12.46

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 28 / 33

Page 30: Adding morphological information  to a connectionist Part-Of-Speech tagger

Test POS tagging performance

POS tagging error rate for the tuning and test sets for the globalsystem. Comparison of our connectionist system with morphologicalinformation versus our previous system without morphologicalinformation.

Partition With morp. info. Without morp. info.Tuning 3.2% 4.2%Test 3.3% 4.3%

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 29 / 33

Page 31: Adding morphological information  to a connectionist Part-Of-Speech tagger

Index

1 POS tagging

2 Probalilistic tagging

3 Connectionist tagging

4 The Penn Treebank Corpus

5 The connectionist POS taggers

6 Conclusions

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 30 / 33

Page 32: Adding morphological information  to a connectionist Part-Of-Speech tagger

Conclusions: Comparison with other tagging systems

POS tagging error rate for the test set. Known refers to thedisambiguation error for known ambiguous words. Unk refers to thePOS tag error for unknown words. Total is the total POS tag error, withambiguous, non-ambiguous, and unknown words.

Model KnownAmb Unknown TotalSVMs 6.1 11.0 2.8MT - 23.5 3.5TnT 7.8 14.1 3.5NetTagger - - 3.8HMM Tagger - - 5.8RANN - - 8.0Our approach 6.7 10.3 3.3

Results comparable with state of the art systems.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 31 / 33

Page 33: Adding morphological information  to a connectionist Part-Of-Speech tagger

Conclusions: Future works

Increase the amount of morphological information.Test the models in a graph based approach.

Introduce a language model of POS tags to improve the results.

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 32 / 33

Page 34: Adding morphological information  to a connectionist Part-Of-Speech tagger

Thank you!

F. Zamora et al (UPV/CEU-UCH) CAEPIA 2009 10-12 November 2009, Sevilla 33 / 33