13
https://helda.helsinki.fi Part-of-Speech Tagging using Parallel Weighted Finite-State Transducers Silfverberg, Miikka 2010-08 Silfverberg , M & Linden , K 2010 , Part-of-Speech Tagging using Parallel Weighted Finite-State Transducers . in Proceedings of IceTAL 2010 : 7th International Conference on Natural Language Processing . Reykjavik , IceTal - International Conference on Natural Language Processing , Reykjavik , Iceland , 16/08/2010 . http://hdl.handle.net/10138/29357 Downloaded from Helda, University of Helsinki institutional repository. This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

Part-of-Speech Tagging Using Parallel Weighted - Helda -

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Part-of-Speech Tagging Using Parallel Weighted - Helda -

https://helda.helsinki.fi

Part-of-Speech Tagging using Parallel Weighted Finite-State Transducers

Silfverberg, Miikka

2010-08

Silfverberg , M & Linden , K 2010 , Part-of-Speech Tagging using Parallel Weighted

Finite-State Transducers . in Proceedings of IceTAL 2010 : 7th International Conference on

Natural Language Processing . Reykjavik , IceTal - International Conference on Natural

Language Processing , Reykjavik , Iceland , 16/08/2010 .

http://hdl.handle.net/10138/29357

Downloaded from Helda, University of Helsinki institutional repository.

This is an electronic reprint of the original article.

This reprint may differ from the original in pagination and typographic detail.

Please cite the original version.

Page 2: Part-of-Speech Tagging Using Parallel Weighted - Helda -

Part-of-Speech Tagging Using Parallel Weighted

Finite-State Transducers

Miikka Silfverberg and Krister Linden

Department of Modern LanguagesUniversity of Helsinki

Helsinki, Finland{miikka.silfverberg,krister.linden}@helsinki.fi

Abstract. We use parallel weighted finite-state transducers to imple-ment a part-of-speech tagger, which obtains state-of-the-art accuracywhen used to tag the Europarl corpora for Finnish, Swedish and En-glish. Our system consists of a weighted lexicon and a guesser combinedwith a bigram model factored into two weighted transducers. We useboth lemmas and tag sequences in the bigram model, which guaranteesreliable bigram estimates.

Keywords: Weighted Finite-State Transducer, Part-of-Speech Tagging,Markov Model, Europarl.

1 Introduction

Part-of-Speech (POS) taggers play a crucial role in many language applicationssuch as parsers, speech synthesizers, information retrieval systems and transla-tion systems. Systems, which need to process a lot of data, benefit from fasttaggers. Generally it is easier to find faster implementations for simple modelsthan for complex ones, so simple models should be preferred, when tagging speedis crucial.

We demonstrate that a straightforward first order Markov model, is sufficientto obtain state-of-the-art accuracy when tagging English, Finnish and SwedishEuroparl corpora [Koehn 2005]. The corpora were tagged using the Connexor fdgparsers [Jarvinen et al. 2004] and we used the tagged corpora both for trainingand as a gold standard in testing. Our results indicate that bigram probabilitiesyield accurate tagging, if lemmas are included in POS analyzes.

Our model consists of a weighted lexicon, a guessing mechanism for unknownwords, and two bigram models. We analyze each word in a sentence separatelyusing the weighted lexicon and guesser. The analyzes are then combined intoone acyclic minimal weighted finite-state transducer (WFST), whose paths cor-respond to possible POS analyzes of the sentence. The paths in the sentenceWFST are re-scored using the bigram models.

The bigram models assign weights for pairs of successive word forms andcorresponding POS analyzes including lemmas. One of the models assigns weightfor POS analyzes of word form bigrams starting at even positions in the sentence

H. Loftsson, E. Rognvaldsson, S. Helgadottir (Eds.): IceTAL 2010, LNAI 6233, pp. 369–380, 2010.c© Springer-Verlag Berlin Heidelberg 2010

Page 3: Part-of-Speech Tagging Using Parallel Weighted - Helda -

370 M. Silfverberg and K. Linden

and the other one assigns weights for bigrams starting at odd positions. Bothbigram models are implemented as WFSTs.

The sentence WFST and bigram model WFSTs are combined using weightedintersecting composition [Silfverberg and Linden 2009], which composes the sen-tence WFST with the simulated intersection of the bigram models. Finallythe POS analysis of the sentence is obtained using a best paths algorithm[Mohri and Riley 2002]. The WFSTs and algorithms for parsing were imple-mented using an open source transducer library HFST [Linden et al. 2009].

The paper is structured as follows. We first review earlier relevant research insection 2. We then formalize the POS tagging task in section 3 and present ourmodel for a POS tagger as an instance of the general formulation in section 4.In section 5 we demonstrate how to implement the model using WFSTs.

The remainder of the paper deals with training and testing the POS tagger.We present the corpora and parsers used in training and tests in section 6, de-scribe training of the model in section 7, evaluate the implementation in section8 and analyze the results of the evaluation and present future research directionsin section 9. Lastly we conclude the paper in section 10.

2 Previous Research

Statistical POS tagging is a common task in natural language applications. POStaggers can be implemented using a variety of statistical models including HiddenMarkov Models (HMM) [Church 1999] [Brants 2000] and Conditional RandomFields [Lafferty et al. 2001].

Markov models are probably the most widely used technique for POS tagging.Some older systems such as [Cutting 1992] used first order models, but the ac-curacies reported were not very good. E.g. [Cutting 1992] report an accuracy of96 % for tagging English text. Newer systems like [Brants 2000] have used sec-ond order models, which generally lead to better tagging accuracy. [Brants 2000]reports accuracy of 96.46% for tagging the Penn Tree Bank. More recent sec-ond order models further improve on accuracy. [Collins 2002] reports 97.11%accuracy and [Shen et al. 2007] 97.33% accuracy on the Penn Tree Bank.

We use lemmas in our bigram model as did [Thede and Harper 1999], whoused lexical probabilities in their second order HMM for tagging English andobtained improved accuracy (96% – 97%) w.r.t. a second order model using plaintag sequences. In contrast to this, our model uses only bigram probabilities andit is not an HMM, since we only use frequency counts of POS analyzes for wordpairs. In addition we split our bigram model into two components, which reducesits size thus allowing us to use a larger training material.

The idea of syntactic parsing and POS tagging using parallel finite-state con-straints was outlined by [Koskenniemi 1990]. The general idea in our system isthe same, but instead of a rule-based morphological disambiguator, we imple-ment a statistical tagger using WFSTs. Still, hand-crafted tagging constraintscould be added to the system.

Page 4: Part-of-Speech Tagging Using Parallel Weighted - Helda -

Part-of-Speech Tagging Using Parallel Weighted Finite-State Transducers 371

3 Formulation of the POS Tagging Task

In this section we formulate the task of Part-of-Speech (POS) tagging and de-scribe probabilistic POS taggers formally.

By a sentence, we mean a sequence of syntactic tokens s = (s1 ... sn) andby a POS analysis of the sentence s, we mean a sequence of POS analyzest = (t1, ..., tn). We include lemmas in POS analyzes. For each i, the analysis ticorresponds to the token si in sentence s. We denote the set of all sentences byS and the set of all analyzes by T .

A POS tagger is a machine which associates each sentence s with its most likelyPOS analysis ts. To find the most likely POS analyzes for the sentence s, themodel estimates the probabilities for all possible analyzes of s using a distributionP . For the sentence s and every possible POS analysis t, the distribution Passociates a probability P (t, s). Keeping t fixed, the mapping s �→ P (t, s) is anormalized probability distribution. The most likely analysis ts of the sentences is the analysis which maximizes the probability P (t, s), i.e.

ts = arg maxt

P (t, s).

The distribution P can consist of a number of component distributions Pi,each giving probability Pi(s, t) for sentence s and analysis t. The componentprobabilities are combined using some function F : [0, 1]n → [0, 1] to obtain

P (s, t) = F(Pi(t, s), ..., Pn(t, s)

).

The function F should be chosen in such a way that P is nonnegative and satisfies∑

t∈T

P (t, s) = 1

for each sentence s.Often a convex linear function F is used to combine estimates given by the

component models. In such a case the model P is called a linear interpolation ofthe models Pi.

4 A Probabilistic First Order Model

In this section we describe the idea behind our POS tagger. We use a bigrammodel for POS tagging. Thus the probability of a given tagging of a sentence isestimated using analyzes of word pairs.

Since we make use of extensive training material, we may include lemmas inbigrams. Although the training material is extensive, the tagger will still en-counter bigrams which did not occur in the training material or only occurredonce or twice. In such cases we want to use unigram probabilities for estimatingthe best POS analysis. Hence we weight all analyzes using probabilities given byboth the unigram and bigram models, but weight bigram probabilities heavilywhile only giving unigram probabilities a small weight. Hence unigram proba-bilities become significant only when bigram probabilities are very close to eachother.

Page 5: Part-of-Speech Tagging Using Parallel Weighted - Helda -

372 M. Silfverberg and K. Linden

4.1 The Unigram Model

The unigram model emits plain unigram probabilities pu(t, sx) for analyzes tgiven a word form sx (we use the index x to signify that pu(t, sx) is independentof the context of the word form sx). Unigram probabilities are readily computedfrom training material. The probability of the analysis t = (t1 ... tn) given thesentence s = (s1 ... sn) assigned by the unigram model is

Pu(t, s) =n∏

i=1

pu(ti, si).

In practice it is not possible to train the unigram model for all possible wordforms in highly inflecting languages with productive compounding mechanismsuch as Finnish or Turkish. Instead the probabilities for analyzes given a wordform need to be estimated using probabilities for words with similar suffixes. Forinstance, if the word form foresaw was not observed during training, we can giveit a similar distribution of analyzes as the word saw receives, since saw shares athree-letter suffix with foresaw.

In practice such estimation relying on analogy is accomplished by a so calledPOS guesser, which seeks words with maximally long suffixes in common withan unknown word. It then assigns probabilities for POS analyzes of the unknownword on basis of the analyzes of the known words. [Linden 2009a] shows how aguesser can be integrated with a weighted lexicon in a consistent way.

4.2 The Bigram Models

We use two bigram models Qo and Qe giving probabilities for bigrams startingat even and odd positions in the sentence. The estimates are built using plainbigram probabilities for tagging a word-pair s1 and s2 with analyzes t1 andt2 respectively1. These probabilities pb(t1, s1, t2, s2) are easily computed from atraining corpus.

For an analysis t = t1 ... t2k and a sentence s = s1 ... s2k of even length 2k,the models Qo and Qe give bigram scores

Qo(t, s) =k∏

i=1

pb(t2i−1, s2i−1, t2i, s2i), Qe(t, s) =k−1∏

i=1

pb(t2i, s2i, t2i+1, s2i+1)

For an analysis t = t1 ... t2k+1 and a sentence s = s1 ... s2k+1 of odd length2k + 1, the models Qo and Qe give bigram scores

Qo(t, s) =k∏

i=1

pb(t2i−1, s2i−1, t2i, s2i), Qe(t, s) =k∏

i=1

pb(t2i, s2i, t2i+1, s2i+1)

1 In literature, it is often suggested that one should instead compute probabilities ofword form bigrams given POS analysis bigrams. We cannot do this, since we includelemmas in POS analyzes. This makes the probability of a word form given a POSanalysis either 0 or 1 since most analyzes only have one realization as a word form.

Page 6: Part-of-Speech Tagging Using Parallel Weighted - Helda -

Part-of-Speech Tagging Using Parallel Weighted Finite-State Transducers 373

4.3 Combining the Unigram and Bigram Models

The standard way of forming a model from Pu, Qo and Qe would be to use linearinterpolation. We do not want to do this, since we aim to convert probabilitiesinto penalty weights in the tropical semiring using the mapping p �→ − log p,which is not compatible with sums. Instead we take a weighted product of powersof the component probabilities. Hence we get a model

P (t, s) = Pu(t, s)wuQo(t, s)woQe(t, s)we

where wu, we and wo are parameters, which need to be estimated.If each of the models Pu, Qe and Qo agree on the probability p of an analysis t

given a sentence s, we want P to give the same probability. This is accomplishedexactly when wu + we + wo = 1. There does not seem to be any reason toprefer either of the models Qe or Qo, which makes it plausible to assume thatwe = wo. Hence an implementation of the model only requires estimating twonon-negative parameters: the unigram parameter wu and the bigram parameterwb. They should satisfy wu + 2wb = 1.

It is possible that P (t, s) will not be a normalized distribution when s is keptfixed, but it can easily be normalized by scaling linearly with factor ΣtP (t, s).For the present implementation, it is not crucial that P is normalized.

5 Implementing the Statistical Model Using WeightedFinite-State Transducers

We describe the implementation of the POS tagger model using weighted finite-state transducers (WFSTs). We implement each of the components of the sta-tistical model as a WFST, which are trained using corpus data.

In order to speed up computations and prevent roundoff errors, we convertprobabilities p, given by the models, into penalty weights in the tropical semiringusing the transformation p �→ − log p. In the tropical semiring the product ofprobabilities pq translates to the sum of corresponding penalty weights − log p+− log q. The kth power of the probability p, namely pk, translates to a scalingof its weight −k log p. These observations follow from familiar algebraic rules forlogarithms.

In our system, tagging of sentences is performed in three stages using fourdifferent WFSTs. The first two WFSTs, a weighted lexicon and a guesser forunknown words, implement a unigram model. They produce weighted sugges-tions for analyzes of individual word forms. The latter two WFSTs re-score thesuggestions using bigram probabilities. The weights − log p given by the unigrammodel and the bigram model are scaled by multiplying with a constant in orderto prefer analyzes which are strong bigrams. The scaled weights −k log p arethen added to give the total scoring of the input sentence. This corresponds tomultiplying the powers pk of the corresponding probabilities.

In the first stage we use a weighted lexicon, which gives the five best analyzesfor each known word form. In initial tests, the correct tagging for a known word

Page 7: Part-of-Speech Tagging Using Parallel Weighted - Helda -

374 M. Silfverberg and K. Linden

could be found among the five best analyzes in over 99% of tagged word forms,so we get sufficient coverage while reducing computational complexity.

For an unknown word x, we use a guesser which estimates the probability ofanalyzes using the probabilities for analyzes of known words. We find the set ofknown word forms W , whose words share the longest possible suffix with theword form x. We then determine the five best analyzes for the unknown wordform x by finding the five best analyzes for words in the set W .

For each word si in a sentence s = s1 ... sn, we form a WFST Wi which isa disjunction of its five best analyzes t1 ... t5 according to the weights w(si, ti)given by the unigram model. In case there are less than five analyzes for a word,we take as many as there are. We then compute a weighted concatenation Ws

of the individual WFSTs Wi. The transducer Ws is the disjunction of all POSanalyzes of the sentence s, where each word receives one of its best five analyzesgiven by the unigram model.

To re-score the analysis suggestions given by the lexicon and the guesser,we use two WFSTs whose combined effect gives the bigram weighting for thesentence. One of the model scores bigrams starting at even positions in thesentence and the other one scores bigrams starting at odd positions. Thus wegive a score for all bigrams in the sentence without having to compute a WFSTequivalent to the intersection of the models which might be quite large.

Using weighted intersecting composition [Silfverberg and Linden 2009] we si-multaneously apply both bigram scoring WFSTs to the sentence WFST Ws. ThePOS analysis of the sentence s is the best path of the result of the composition.

The WFSTs and algorithms for parsing were implemented using the HelsinkiFinite-State Technology (HFST) interface [Linden et al. 2009].

We now describe the lexicon, guesser and the bigram WFSTs in more detail.

5.1 The Weighted Lexicon

Using a tagged corpus, we form a weighted lexicon L which re-writes word formsto their lemmas and analyzes. POS analyzes for a word form si are weightedaccording to their frequencies, which are transformed into tropical weights.

In order to estimate the weights for words which were not seen in the trainingcorpus, we construct a guesser. For an unknown word, the guesser will try toconstruct a series of analyzes relying on information about the analyzes of knownsimilar words.

Figure 1 shows an example guesser, which can be constructed from a reversedweighted lexicon. Guessing begins at the end of the word. We allow guessing ata particular analysis for a word only if the word has a suffix agreeing with theanalysis. See [Linden 2009a] for more information on guessers.

5.2 The Bigram Models

To re-score analyzes given by the unigram model, we use two WFSTs whosecombination serves as a bigram model. The first one, Be, scores each knownword form/analysis bigram s2k, s2k+1 and t1, t2 in the sentence starting at an

Page 8: Part-of-Speech Tagging Using Parallel Weighted - Helda -

Part-of-Speech Tagging Using Parallel Weighted Finite-State Transducers 375

0 1g:0 /0 .0234

20:gs mon n

0:fni v/3.773

0 : < t a b >4

0:g

5o:o 7 / 5 0

0 :0

0 :0

6d:d

?:?/50

0 :0

Fig. 1. Guesser constructed from a weighted lexicon. Guessing starts at the end of aword. Skipping letters gives a high penalty and analyzes, where equally many lettersare skipped, are weighted according to the frequency of the analyzes.

even position 2k according to the maximum likelihood estimate of the tag bigramt1t2 w.r.t. the word form bigram s2ks2k+1. The WFST Bo is similar to Be exceptit weights bigrams starting at odd positions s2k−1s2k.

Given a word form pair s1, s2, we compute the probability P (t1, s1, t2, s2) foreach POS analysis pair t1, t2. These sum to 1 when w1 and w2 remain fixed.Then we form a transducer B, whose paths transform word form pairs s1s2 intoanalysis pairs t1t2 with weight − log P (t1, s1, t2, s2). Lastly we disjunct B witha default bigram, which transforms arbitrary word form sequences to arbitraryanalyzes with a penalty weight, which is greater than the penalty received byall other transformations.

In addition to the model B, we also compute a general word model W , whichtransforms an arbitrary sequence of symbols into an arbitrary lemma and ananalysis. The word model W is used to skip words at the beginning and end ofsentences.

From the transducers above, we form the models Be and Bo using weightedfinite state operations

Be = WB∗W {0,1} and Bo = B∗W {0,1}.

Here W {0,1} signifies an optional instance of W .

0

?

1<tab>

<?> 2#/50

3and<tab>cc

6will<tab>v auxmod

8?/100

4#

7#

?

9

<tab>

5dog<tab>n nom sg

#

dog<tab>v inf

#

<?>

Fig. 2. A small example of an even bigram model Be. ? signifies an arbitrary symboland <?> signifies an arbitrary POS analysis symbol.

Page 9: Part-of-Speech Tagging Using Parallel Weighted - Helda -

376 M. Silfverberg and K. Linden

5.3 Parsing Using Weighted Intersecting Composition

In our system, parsing a sentence S is in principle equivalent to finding the bestpath of the transducer

(S ◦ L) ◦ (Be ∩ Bo).

Since the intersection of Bo and Be could become prohibitively large, we insteaduse intersecting composition [Silfverberg and Linden 2009] to simulate the in-tersection of Be and Bo during composition with the unigram tagged sentenceS ◦ L.

Intersecting composition is an operation first used in compiling two-levelgrammars [Karttunen 1994]. We use a weighted version of the operation.

After the intersecting composition, we extract the best path from the resultingtransducer. This is the tagged sentence.

6 Data

In this section we describe the data used for testing and training the POS tagger.For testing and training, we used the Europarl parallel corpus [Koehn 2005].The Europarl parallel corpus is a collection of proceedings of the European

Parliament in eleven European languages. The corpus has markup to identifyspeaker and some html-markup, which we removed to produce a file in raw textformat. We used the Finnish, English and Swedish corpora. Since the trainingand testing materials are the same for all three languages, the results we obtainfor the different languages are comparable.

We parsed the Europarl corpora using Connexor functional dependency parsersfi-fdg for Finnish, sv-fdg for Swedish and en-fdg for English [Jarvinen et al. 2004].From the parses of the corpora we extracted word forms, lemmas and POS tags.For training and testing, we preserved the original tokenization of the fdg-parsersand removed prop tags marking proper nouns, abbr tags marking abbreviationsand heur tags marking guesses made by the fdg-parser. The tag sequence countsin table 1 represent the number of tag sequences after abbr, prop and heur tagswere removed.

Table 1. Some figures describing the test and training material for the POS tagger

Language Syntactic tokens Sentences POS tag sequences

English 43 million 1 million 122

Finnish 25 million 1 million 2194

Swedish 38 million 1 million 243

Table 1 describes the data used in training and testing the POS tagger. Wesee that the fi-fdg parser for Finnish emitted more than ten times as many tagsequences as sv-fdg for Swedish or en-fdg for English. The en-fdg parse emittedclearly fewest tag sequences.

Page 10: Part-of-Speech Tagging Using Parallel Weighted - Helda -

Part-of-Speech Tagging Using Parallel Weighted Finite-State Transducers 377

7 Training the Model

We now describe training the model, which consists of two phases. In the firstphase we build the weighted lexicon and guesser and the bigram models. In thesecond phase we estimate experimentally coefficients wu and wb, which maximizethe accuracy of the interpolated model

P (t, s) = Pu(t, s)wuQo(t, s)wbQe(t, s)wb

Using a small material covering 1000 syntactic tokens, we estimated wu = 0.1and wb = 0.45. This shows that it is beneficial to weight the bigram model heav-ily, which seems natural, since bigrams provide more information than unigrams.

70

75

80

85

90

95

100

2 2.5 3 3.5 4 4.5 5 5.5 6

AC

CU

RA

CY

(%

)

SIZE OF TRAINING CORPUS (10^n)

1-GRAM2-GRAM

Fig. 3. The accuracy for the English POS tagger as a function of the size of trainingdata. We used between 102 and 106 sentences for training. The lower curve displays theaccuracy using only the unigram model, whilst the upper curve displays the accuracyof the combined unigram and bigram model.

Figure 3 shows learning curves for the English language POS tagger using 102

to 106 sentences for training. The lower curve displays accuracies for the unigrammodel and the upper curve shows the accuracy for the combined unigram andbigram model. For the unigram model, we can see that little improvement isobtained by increasing the training data from 104 sentences. In contrast, thereis significant improvement (≈ 0.82%) for the bigram model even when we movefrom 105 to 106 sentences.

Page 11: Part-of-Speech Tagging Using Parallel Weighted - Helda -

378 M. Silfverberg and K. Linden

8 Evaluation

We describe the methods we used to evaluate the POS tagger and the resultswe got.

We used ten-fold cross-validation to evaluate the POS tagger, that is we splitthe training material in ten equally sized parts and used nine parts for trainingthe model and the remaining part for testing. Varying the tenth used for testingwe trained ten POS taggers for each language.

For each of the languages we trained two sets of taggers. One set used onlyunigram probabilities for assigning POS tags. The other used both unigram andbigram probabilities. We may consider the unigram taggers as a baseline.

For each tree languages, we computed the average and standard deviation ofthe accuracy of the unigram and bigram taggers. In addition we computed theWilcoxon matched-pairs signed-ranks test for the bigram and unigram accura-cies in all three languages. The test does not assume that the data is normallydistributed (unlike the paired t-test). The results of our tests can be seen intable 2.

Table 2. Average accuracies and standard deviations for POS taggers in Finnish,English and Swedish. The sixth column shows the improvement, which results foradding the bigram model. In the seventh column, we show the results of the Wilcoxonmatched-pairs signed-ranks test between unigram and bigram accuracies.

Language Unigram Acc. σ Bigram Acc. σ Diff. Conf.

English 93.10% 0.09 98.29% 0.01 5.19% ≥ 99.8%

Finnish 94.38% 0.07 96.63% 0.03 2.25% ≥ 99.8%

Swedish 94.12% 0.20 97.31% 0.11 3.19% ≥ 99.6%

9 Discussion and Future Work

It is interesting to see that a bigram tagger can perform equally well or bet-ter than trigram taggers at least on certain text genres. The mean accuracy98.29%, we obtained for tagging the English Europarl corpus is exceptionallyhigh (for example [Shen et al. 2007], report a 97.33% accuracy on tagging thePenn Tree Bank). The improvement of 5.19 percentage points from the unigrammodel to the combined unigram and bigram model is also impressive. There isalso a clear improvement for Finnish and Swedish, when the bigram model isused in tagging and accuracy for these languages is also high. We had prob-lems finding accuracies figures for statistical taggers of Finnish, but for Swedish[Megyesi 2001] reports accuracies between 94% and 96%, which means that weget state-of-the-art accuracy for Swedish.

Of course the Europarl corpus is probably more homogeneous than the PennTree Bank or the Brown Corpus, both of which include texts from a varietyof genres. Furthermore tagging is easier because the en-fdg parser only emits122 different POS analyzes. Still, Europarl texts represent an important genre,

Page 12: Part-of-Speech Tagging Using Parallel Weighted - Helda -

Part-of-Speech Tagging Using Parallel Weighted Finite-State Transducers 379

because the EU is constantly producing written materials, which need to betranslated into all official languages of the union.

The accuracy for Finnish shows less improvement than English and Swedish.We believe this is a result of the fact that Finnish words carry a lot of informationbut the bonds between words in sentences may be quite weak. This conclusionis supported by the fact that unigram accuracy for Finnish is best of all threelanguages.

We do not believe, that using trigram statistics would bring much improve-ment for Finnish. Instead we would like to write a set of linguistic rules whichwould cover most typically occurring tagging errors. Especially we would like totry out constraints, which would mark certain analyzes as illegal in some con-texts. Such negative information is hard to learn using statistical methods. Still,it may be very useful, so it could be provided by hand-crafted rules.

Clearly our figures for accuracy need to be considered in relation to the taggingaccuracy of the fdg parsers. We did not succeed in finding a study on the POStagging accuracy of the fdg parsers. Instead we examined the POS tagging for oneword per twenty thousand in the first tenth of the Europarl corpora for Finnish,English and Swedish. This amounted to 131 examined words for Finnish, 219examined words for English and 191 examined words for Swedish. According tothese tests, the POS tagging accuracy of the fdg parsers for Finnish is 95.4%,for English it is 97.3% and for Swedish it is 97.5%.

10 Conclusions

We introduced a model for a statistical POS tagger using bigram statistics withlemmas included. We showed how the tagger can be implemented using WFSTs.We also demonstrated a new way to factor a first order model into a modeltagging bigrams at even positions in the sentence and another model taggingbigrams at odd positions.

In order to test our model, we implemented POS taggers for Finnish, Englishand Swedish, training them and evaluating them using Europarl corpora in therespective languages and Connexor fdg parsers.

We obtained a clear, statistically significant, improvement for all three lan-guages when compared to the baseline unigram tagger. At least for English andSwedish, we obtain state-of-the-art accuracy.

Acknowledgements. We thank the anonymous referees. We also want thankour colleagues in the Hfst team. The first author is funded by Langnet GraduateSchool for Language Studies.

References

[Brants 2000] Brants, T.: TnT – A Statistical Part-of-Speech Tagger. In: ANLP - 2000(2000)

[Church 1999] Church, K.: A Stochastic Parts Program and Noun Phrase Parser forUnrestricted Text. In: Proceedings of the Second Conference on Applied NaturalLanguage Processing (1988)

Page 13: Part-of-Speech Tagging Using Parallel Weighted - Helda -

380 M. Silfverberg and K. Linden

[Collins 2002] Collins, M.: Discriminative Training Methods for Hidden Markov Mod-els: Theory and Experiments with Perceptron Algorithms. In: EMNLP (2002)

[Cutting 1992] Cutting, D., Kupiec, J., Pedersen, J., Sibun, P.: A Practical Part-of-Speech Tagger. In: Proceedings of the Third Conference on Applied Natural Lan-guage Processing (1992)

[Jarvinen et al. 2004] Jarvinen, T., Laari, M., Lahtinen, T., Paajanen, S., Paljakka, P.,Soininen, M., Tapanainen, P.: Robust Language Analysis Components for Prac-tical Applications. In: Gamback, B., Jokinen, K. (eds.) Robust and AdaptiveInformation Processing for Mobile Speech Interfaces (2004)

[Karttunen 1994] Karttunen, L.: Constructing Lexical Transducers. In: COLING 1994,pp. 406–411 (1994)

[Koehn 2005] Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Transla-tion. In: Machine Translation Summit X, Phuket, Thailand, pp. 79–86 (2005)

[Koskenniemi 1990] Koskenniemi, K.: Finite-state parsing and disambiguation. In:13th COLING (1990)

[Lafferty et al. 2001] Lafferty, J., MacCallum, A., Pereira, F.: Conditional RandomFields: Probabilistic Models for Segmenting and Labeling Sequence Data. In:ICML 2001 (2001)

[Linden et al. 2009] Linden, K., Silfverberg, M., Pirinen, T.: Hfst Tools for Morphology– an Efficient Open-Source Package for Construction of Morphological Analyzers.In: Mahlow, C., Piotrowski, M. (eds.) SFCM 2009. LNCS, vol. 41, pp. 28–47.Springer, Heidelberg (2009)

[Linden 2009a] Linden, K.: Entry Generation by Analogy Encoding New Words forMorphological Lexicons. NEJLT, vol. 1 (2009)

[Linden 2009b] Linden, K.: Guessers for Finite-State Transducer Lexicons. In: Gel-bukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 158–169. Springer, Heidelberg(2009)

[Megyesi 2001] Megyesi, B.: Comparing data-driven learning algorithms for POS tag-ging of Swedish. In: EMNLP 2001 (2001)

[Mohri and Riley 2002] Mohri, M., Riley, M.: An Efficient Algorithm for the n-Best-Strings Problem. In: ICSLP 2002 (2002)

[Mikheev 1997] Mikheev, A.: Automatic Rule Induction for Unknown-Word Guessing.In: CL, vol. 23 (1997)

[Shen et al. 2007] Shen, L., Satta, G., Joshi, A.: Guided Learning for BidirectionalSequence Classification. In: ACL 2007 (2007)

[Silfverberg and Linden 2009] Silfverberg, M., Linden, K.: Conflict Resolution UsingWeighted Rules in HFST-TwolC. In: NODALIDA 2009 (2009)

[Thede and Harper 1999] Thede, S., Harper, M.: A Second-Order Hidden MarkovModel for Part-of-Speech Tagging. In: 37th ACL (1999)