9
Improving Korean verb–verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts Seonho Kim a,, Juntae Yoon b,, Jungyun Seo a,, Seog Park a,a Department of Computer Science, Sogang University, Seoul, Republic of Korea b Daumsoft Inc., Se-Ah Venture Tower, Seoul, Republic of Korea article info Article history: Received 18 June 2010 Available online 12 September 2011 Communicated by R.C. Guido Keywords: POS tagging Verb–verb morphological disambiguation Unlabeled corpora Automatic annotation Web counts Hard example-based selective sampling abstract This paper deals with verb–verb morphological disambiguation of two different verbs that have the same inflected form. The verb–verb morphological ambiguity (VVMA) is one of the critical Korean parts of speech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highly depends on the lexical information in their surrounding contexts and the domains they occur in. How- ever, current probabilistic morpheme-based POS tagging systems cannot handle VVMA adequately since most of them have a limitation to reflect a broad context of word level, and they are trained on too small amount of labeled training data to represent sufficient lexical information required for VVMA disambig- uation. In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical information to handle the VVMA. The underlying idea is that we automatically generate the annotated training set applicable to the ambiguity problem such as VVMA resolution via unlabeled unambiguous instances which belong to the same class. This enables to label ambiguous instances with the knowledge that can be induced from unambiguous instances. Since the unambiguous instances have only one label, the automatic generation of their annotated corpus are possible with unlabeled data. In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause the VVMA, a training data for the VVMA disambiguation are generated via the instances of unambiguous conjugations related to each possible verb base form of ambiguous words. This approach does not require an additional annotation process for an initial training data set or a selection process for good seeds to iteratively augment a labeling set which are important issues in bootstrapping methods using unlabeled data. Thus, this can be strength against previous related works using unlabeled data. Furthermore, a plenty of confident seeds that are unambiguous and can show enough coverage for learning process are assured as well. We also suggest a strategy to extend the context information incrementally with web counts only to selected test examples that are difficult to predict using the current classifier or that are highly different from the pre-trained data set. As a result, automatic data generation and knowledge acquisition from unlabeled text for the VVMA resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9–10% out of verb-related tagging errors are fixed by the VVMA resolution whose accuracy was about 98% by using the Naïve Bayes classifier coupled with selective web counts. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Machine learning methods based on probabilistic models using corpora have been successfully employed to various natural language processing problems. For instance, the current state- of-the-art statistical parts-of-speech (POS) taggers have shown tagging performance of almost 97% token-level accuracy, regardless of kind of language (Lee and Rim, 2009; Shen et al., 2007; Toutanova et al., 2003; Tsuruoka and Tsujii, 2005). However, the accuracy at sentence level has been reported as about 56%, which is a much lower score. Thus, the 3% errors are still critical in practical systems that require more accurate and high-quality language processing. The tagging involved with NN(noun)/RB(adverb), NN(noun)/ NNP(proper noun), NNS(noun, plural)/VBZ(verb, 3rd person singu- lar present), NN(noun)/VB(verb, base form), VBD(verb, past tense)/ VBN(verb, past participle), VBP(verb, non-3rd person singular present)/VBD(verb, past), IN(preposition)-WDT(wh-determiner) in 0167-8655/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2011.09.003 Corresponding authors. Tel.: +82 2 6091 8532; fax: +82 2 706 8954 (S. Kim). E-mail addresses: [email protected] (S. Kim), [email protected] (J. Yoon), [email protected] (J. Seo), [email protected] (S. Park). Pattern Recognition Letters 33 (2012) 62–70 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Improving Korean verb–verb morphological disambiguation using lexical knowledge from unambiguous unlabeled data and selective web counts

Embed Size (px)

Citation preview

Pattern Recognition Letters 33 (2012) 62–70

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Improving Korean verb–verb morphological disambiguation using lexicalknowledge from unambiguous unlabeled data and selective web counts

Seonho Kim a,⇑, Juntae Yoon b,⇑, Jungyun Seo a,⇑, Seog Park a,⇑aDepartment of Computer Science, Sogang University, Seoul, Republic of KoreabDaumsoft Inc., Se-Ah Venture Tower, Seoul, Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history:Received 18 June 2010Available online 12 September 2011Communicated by R.C. Guido

Keywords:POS taggingVerb–verb morphological disambiguationUnlabeled corporaAutomatic annotationWeb countsHard example-based selective sampling

0167-8655/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.patrec.2011.09.003

⇑ Corresponding authors. Tel.: +82 2 6091 8532; faE-mail addresses: [email protected] (S.

(J. Yoon), [email protected] (J. Seo), [email protected]

This paper deals with verb–verb morphological disambiguation of two different verbs that have the sameinflected form. The verb–verb morphological ambiguity (VVMA) is one of the critical Korean parts ofspeech (POS) tagging issues. The recognition of verb base forms related to ambiguous words highlydepends on the lexical information in their surrounding contexts and the domains they occur in. How-ever, current probabilistic morpheme-based POS tagging systems cannot handle VVMA adequately sincemost of them have a limitation to reflect a broad context of word level, and they are trained on too smallamount of labeled training data to represent sufficient lexical information required for VVMA disambig-uation.In this study, we suggest a classifier based on a large pool of raw text that contains sufficient lexical

information to handle the VVMA. The underlying idea is that we automatically generate the annotatedtraining set applicable to the ambiguity problem such as VVMA resolution via unlabeled unambiguousinstances which belong to the same class. This enables to label ambiguous instances with the knowledgethat can be induced from unambiguous instances. Since the unambiguous instances have only one label,the automatic generation of their annotated corpus are possible with unlabeled data.In our problem, since all conjugations of irregular verbs do not lead to the spelling changes that cause

the VVMA, a training data for the VVMA disambiguation are generated via the instances of unambiguousconjugations related to each possible verb base form of ambiguous words. This approach does not requirean additional annotation process for an initial training data set or a selection process for good seeds toiteratively augment a labeling set which are important issues in bootstrapping methods using unlabeleddata. Thus, this can be strength against previous related works using unlabeled data. Furthermore, aplenty of confident seeds that are unambiguous and can show enough coverage for learning processare assured as well.We also suggest a strategy to extend the context information incrementally with web counts only to

selected test examples that are difficult to predict using the current classifier or that are highly differentfrom the pre-trained data set.As a result, automatic data generation and knowledge acquisition from unlabeled text for the VVMA

resolution improved the overall tagging accuracy (token-level) by 0.04%. In practice, 9–10% out ofverb-related tagging errors are fixed by the VVMA resolution whose accuracy was about 98% by usingthe Naïve Bayes classifier coupled with selective web counts.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Machine learning methods based on probabilistic models usingcorpora have been successfully employed to various naturallanguage processing problems. For instance, the current state-of-the-art statistical parts-of-speech (POS) taggers have showntagging performance of almost 97% token-level accuracy,

ll rights reserved.

x: +82 2 706 8954 (S. Kim).Kim), [email protected] (S. Park).

regardless of kind of language (Lee and Rim, 2009; Shen et al.,2007; Toutanova et al., 2003; Tsuruoka and Tsujii, 2005). However,the accuracy at sentence level has been reported as about 56%,which is a much lower score. Thus, the 3% errors are still criticalin practical systems that require more accurate and high-qualitylanguage processing.

The tagging involved with NN(noun)/RB(adverb), NN(noun)/NNP(proper noun), NNS(noun,plural)/VBZ(verb,3rd person singu-lar present), NN(noun)/VB(verb,base form), VBD(verb,past tense)/VBN(verb,past participle), VBP(verb,non-3rd person singularpresent)/VBD(verb,past), IN(preposition)-WDT(wh-determiner) in

Table 1Examples of various word forms.

Word Morphological analysis

gatda ga(가, go) + eoss(었, past ending) + da(다, declaration ending)gasieotda ga(가, go) + sieoss(시었, honorific past ending) + da(다,

declaration ending)ganda ga(가, go) + n(ㄴ, present ending) + da(declaration, ending)ganeun ga(가, go) + neun(는, adnominal ending)gagi ga(가, go) + gi(기, nominal ending)gago ga(가, go) + go(고, conjunctive ending)

S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70 63

English and the verb–verb morphological ambiguity (VVMA) inKorean is an example of the errors.

In fact, these problems are hard to address with local featuresused by the current state-of-the-art sequence model taggers sincethey require various kinds of linguistic knowledge which needsmuch syntax or semantics, broad contextual knowledge or multi-sentence discourse context (Manning, 2011). For instance, the tagselection of ‘‘record” (NN/VB) needs much broad lexical contextfor correct tagging. In practice, we may need another type of lexicalcontext and a knowledge acquisition method in order to correctthe confusion mentioned above.

For the reasons, various kinds of linguistic knowledge and fea-tures according to error types are hard to comprise under a singletagging framework. Thus, we need to classify errors according totheir type and devise the appropriate knowledge acquisition pro-cess for each error type coupled with classifier combination orsemi-supervised learning rather than concentrate on improvingthe tagging performance in a single framework.

As an example of processing for specific errors, we considerhere Korean verb–verb morphological ambiguity (VVMA) wheretwo different verbs can have the same inflected form. The VVMAin Korean denotes ambiguities in the aspect of base forms thatone word can be analyzed morphologically different verbal baseforms, which is described in detail in the next section. For example,a word, saneun1 has a morphological ambiguity in terms of baseform since it can be a conjugation both verbs of sa (buy) and sal(live). The VVMA resolution is to find a correct verb base form be-tween sa and sal for the word saneun.

In Korean, since verbs containing such VVMA cause semanticambiguities, the problem is closely related to sense disambiguationexcept that it should be handled at the morphological analysisstage. Thus, the determination of a correct base form requiresbroad contextual knowledge that must be beyond a sequence tag-ger with local features. However, it is hard for current state-of-the-art morpheme-based Korean statistical taggers to reflect broadcontextual information at the word level, which makes it difficultto handle the VVMA. Actually, the VVMA errors account for about9–10% of verb-related tagging errors in Korean. Since the wordforms including VVMA are polysemous and frequently occur in textalthough base form types themselves are not many, as shown inTable 2, the resolution is relatively important in practicalapplications.

In this work, we suggest a methodology for the Korean VVMAresolution which can be applied with a lot of confident seeds fromunlabeled text. The underlying idea is that we automatically gener-ate the annotated training set applicable to ambiguity problemsuch as VVMA resolution by using unlabeled unambiguous datain the same class. We can classify ambiguous instances by knowl-edge derived from other unambiguous instances in the same classbased on the characteristic of Korean verbs with VVMA.

In practice, studies using unlabeled data to boost a learningperformance or to reduce annotation efforts have been constantlyperformed since we can get the performance improvement onlywith a small amount of training material so called seed data(Abney, 2008; Blum and Mitchell, 1998; Chan and Ng, 2007;Steedman et al., 2003; Suzuki and Isozaki, 2008; Yarowsky,1995). Because the quality of seeds affects a final performance aswell as an initial precision, the issues on the selection of good seedsand learning from a small initial data are important in the boostingmethods using unlabeled text (Dasgupta and Ng, 2007; Lee andLee, 2007). In general, seeds have to be unambiguous, independenton domain shift and to show enough coverage for a learning

1 That is, saneun is analyzed as two different morphological combination, i.e. sa(buy) + neun(adnominal ending) and sal(live) + neun(adnominal ending).

2 The lemma, ‘‘deul” has many meanings such as enter/go, hold/take, raise/lift, givequote, eat/drink, cut/be sharp, clear (up)/become clear, take/cost, and be dyed/tinged

process. However, in most cases it is not easy to select a largeamount of such reliable seeds or to construct useful unlabeledexamples for training.

Therefore, in order to get the reliable training data, we replacethe VVMA problem into another classification problem which canbe trained over automatically constructed annotated instancesand an automatic generation of annotated data and self-trainingare possible without applying any iterative method. That is, it ispossible to map different unambiguous instances of the same viewon unlabeled data.

In our work, unambiguous conjugations of each possible basemorpheme with respect to ambiguous words can be used as seedsfor base form disambiguation, since all conjugations of irregularverbs do not lead to spelling changes which cause the VVMA.The contexts of unambiguous conjugations in unlabeled text areused for training the classifier of the VVMA resolution. We will de-scribe more details about unambiguous conjugations in the nextsection.

Accordingly, this approach does not require an initial annota-tion for seeds and accurate classes of unlabeled data can be identi-fied in advance since unambiguous words of unlabeled data carryonly one base form. Furthermore, a plenty of reliable seeds (unam-biguous words) are assured to train classifiers for the VVMA reso-lution. As a result, it is possible to extract sufficient lexicalinformation for classification even from a raw text as accurate asa tagged corpus.

In addition, in order to dynamically adapt the VVMA classifier tothe variations of vocabulary and new domain, we suggest a strategyto extend the range of training examples selectively by using webdata. It allows only selective examples to be added in a training setwhich are difficult to predictwith the current classifier or highly dif-ferent fromapre-traineddata set. For this, the confidence anddiver-sity values are measured with respect to each test example.

As a result, we achieved about 98% accuracy for the VVMA res-olution by using the Naïve Bayes classifier coupled with selectiveweb counts. Also, the overall performance of the POS tagging wasimproved about 0.04% in accuracy through the resolution. The pro-posed work that transforms an ambiguous problem to an unambig-uous problem could be extended to non-Korean languages andother ambiguity problem as long as the ambiguous problem canbe represented via other unambiguous instances in the same class.

2. VVMA problems in current POS tagging models

In Korean, which is an agglutinative and inflectional language, aword called eojeol denotes a sequence of morphemes consisting ofa lexical morpheme and functional morphemes. For example, threemorphemes of deul(들),2 eot(었, past ending), and da(다, assertiveending) form a word(eojeol), deuleotda(들었다) which represents apast assertive form of the verb deul(들). In linguistics, this modifica-tion of a verb by inflection is called conjugation. Table 1 shows someverb inflections according to Korean verb endings.

,.

Table 2Unambiguous conjugation seed patterns for each morpheme.

Ambiguous words Basemorpheme

Possible morphological analysis Unambiguous conjugations of baseforms

deuleot deuleo, deuleotgo, deuleoseo,deuleoya,

deud(듣) dued(hear,listen, follow,work)/VB + eot(었, ‘‘past ending”/PE duedgo(deud + go)/deulgo(deul + go),deul(들) deul(hold, raise,give,cut, take,eat,clear, cost,be dyed, join,

catch, feel,contain,suffer,break into,accommodate, ripe)/VB+ eot/PE

deudneun(deud + neun)/deuneun (deul + neun),. . .

muleot muleo, muleotgo, muleoya,muleoseo, . . .

mud(묻) mud(ask,blame,charge,bury,be stained)/VB + eot/PE mudgo(mud + go)/muleo(mud + eo),mul(물) mul(bite,pay,spoil,hold,stick to)/VB + eot/PE mulgo(mul + go)/muneun(mul + neun), . . .

geoleot, geoleo, geoleotgo, geoleoseo,geoleoya, . . .

geod(걷) geod(roll,walk,gather, live, remove,go out)/VB + eot/PE geodgo/geolgo, geodge/geolge, geodjiman/geoljiman, geodneunge/geoneunge, . . .geol(걸) geol(hang,bet, risk,speak,call,expect, trip, sue,cast, set)/VB

+ eot/PEgileot, gileo, gileoya, gileoseo, . . . gid(긷) gid(draw,pump)/VB + eot/PE gidge/gilge, gidgo/gilgo, gidneunji/gileunji,

gidneundago/gildago, . . .gil(길) gil(long)/VB + eot/PEggeuneun, ggeundago,

ggeundamyeon, ggeundaneun,ggeuji, . . .

ggeu(끄) ggeu(put out, turn out,extinguish,break,repay)/VB + neun(는,adnominal ending)/ENRR1

ggeuge/ggeulge, ggeugo/ggeulgo, ggedorok/ggeuldorok, ggeoya/ggeuleoya, . . .

ggeu l(끌) ggeul(drag, lead,attract,prolong, install,pull out)/VB + neun/ENTR1

gganeun, ggandago, ggandaneun,ggandamyeon, . . .

gga(까) gga(peel,hatch,deduct, strike)/VB + neun/ENTR1 ggage/ggalge, ggago/ggalgo, ggajiman/ggaljiman,ggaya/ggalaya, . . .ggal(깔) ggal(spread out, lower) + neun/ENTR1

sseuneun, sseul, sseundaneun,sseuni, sseulji, . . .

sseu(쓰) sseu(use,write,employ,hire, spend,speak,wear,bitter)/VB+ neun

sseugo/sseulgo, sseuge/sseulge, ssedorok/sseuldorok, sseujiman/sseuljiman, . . .

sseul(쓸) sseul(sweep)/VB + neunpaneun, pandago, pandaneun, paji,

paneunde, pandago, . . .pa(파) pa(dig,carve, remove, investigate)/VB + neun pago/palgo, page/palge, pamyeonseo/

palmyeonseo, paryeogo/palryeogo, . . .pal(팔) pal(sell, turn away,trade on)/VB + neunnaneun, nani, nanigga, nabnida,

nandaneun, . . .na(나) na(grow,born,sprout,occur,produce,come out,smell)/VB

+ neunnalgo/nago, nage/nalge, nadorok/naldorok,

nal(날) nal(fly,soar)/VB + neun nareogo/nalreogo, . . .saneun, saneunde, sal, sandaneun,

salji, . . .sa(사) sa(buy,get,praise,value,hire, incur)/VB + neun sago/salgo, sage/salge, saya/salaya, samyeonseo/

salmyeonseo, sadorok/saldorok, . . .sal(살) sal(live, lead,serve,vivid)/VB + neunjuneun, jundago, julgi, juneunji,

juneunde, . . .ju(주) ju(give,present)/VB + neun jueo(ju + eo)/juleo(jul + eo),jul(줄) jul(decrease,diminish,drop)/VB + neun jugo(ju + go)/julgo(jul + go), . . .

64 S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70

There are irregular verbal conjugations and regular conjuga-tions in Korean. In case of an irregular conjugation, its spelling ischanged by a phonological variation when a base form meets amorpheme of a specific sound. On the other hand, there is no spell-ing change in regular conjugations. For instance, the Korean verbdeud(듣)3 is irregularly conjugated as its spelling, deud (듣) changesinto deul(들) when it is combined with a morpheme eot(었, past end-ing) or eo(어, conjunctive ending). Thus, a word composed of deud(듣), eot(었, past ending), and da(다, assertive ending) morphemescorresponds to not deudeotda(듣었다) but deuleotda(들었다). As a re-sult, the word deuleotda(들었다) has an ambiguity since it can be theconjugation of both base forms, deul(들) and deud(듣). Thus, it can beanalyzed into two different morpheme sequences: ‘‘deud(듣)+eot(었)+da(다)” and ‘‘deul(들)+eot(었)+da(다)”. Similarly, the conjugationssuch as deuleo(들어), deuleotgo(들었고), deuleoseo(들어서), deuleun(들으니), deuleuryeogo(들으려고), deuleumeo(들으며), and deuleoya(들어야) contain morphological ambiguities in terms of base forms.However, there is no spelling change and phonological variationwhen the verb base form, deud(듣) combines with the morpheme,ji(지). In that case, a word, deudji(듣지) is actually formed as it is.

Such morphological ambiguity related to base form is a some-what different phenomenon from English irregular verbs whosebase forms can be recognized with lists or rules. This work treatsKorean verb–verb morphological ambiguity (VVMA) issue that averbal conjugation has an ambiguity in its base form. The VVMAhas not been handled properly under previous statistical morpho-logical analyzers and POS tagger even though the word forms con-taining VVMA highly frequently appear in Korean text. Moreover, itis recognized as a difficult problem since it concerns with sensedisambiguation at the morphological analysis level.

In general, morphological complexity of Korean poses a chal-lenge to the morphological analysis and POS tagging. Since exten-sive morphological variants from a single base form are possible,

3 The lemma ‘‘듣(deud)” indicates different senses of a word, such as hear/listen,follow/obey, learn, receive/suffer, attend, take effect/work, and drip/drop.

Korean statistical tagging models need much larger annotated cor-pora to sufficiently reflect the variants. Thus, morpheme-based bi-gram or trigram models are normally preferred (Lee et al., 2002;Lee and Rim, 2009) than eojeol(word)-based models because ofdata sparseness problem. Although previous statistical taggershave achieved good results, they fail to capture important contex-tual clues necessary for VVMA resolution. In particular, a VVMAresolution model should be able to take into account contexts atword level but most tagging models have limitations to reflect sucha broad context. At most, previous one or two morphemes are con-sidered to determine a POS for a current morpheme. Nonetheless,it is still hard to extend the window size of morpheme-based tag-ging models or to consider eojeol(word)-based tagging models dueto sparse data problem.

Table 3 shows how the VVMA problem can be processed byusing a general statistical POS tagger. In the sentence4 the sur-rounding word of deuleotda(들었다), sorireul(소리를) can be a clueto choose the correct base form as dued(듣). In the case, the verb baseform depends on its neighboring words context, particularly the pre-vious content morpheme, sori(소리, sound).

In general, the optimal POS tags t1,. . .,n for a word sequencebased on first order Markov model can be estimated as follows:

pðt1;...;njw1::nÞ ¼ argmax pðw1;...;njt1;...;nÞpðt1;...;nÞ

pðw1;...;njt1;...;nÞpðt1;...;nÞ ¼Yni¼1

pðwijtiÞpðtijti�1Þ

t̂1::n ¼ argmaxt1;...;n

Yni¼1

pðwijtiÞpðtijti�1Þ

HMM (Hidden Markov Model) is widely adapted to assign the mostlikely POS tags with respect to an observed sequence of words. Thetagging procedure by HMM is formulated by transition probability,

4 The sentence ‘‘keun sorireul deuleotda(큰소리를들었다)” can be analyzed as ‘‘keu, big/loud) + n(ㄴ, adnominal ending) sori(소리, sound) + reul(를, object case

article) deud(듣, hear) + eot(었, past tense) + da(다, assertive ending).”

(크p

Table 3Previous statistical taggers.

Surface form Possible morphological analyses

keun(그는) sorireul(소리 를) deuleotda(들었다) a. sori(소리,sound)/NN + reul(를,object particle)/PPCA2 deud(듣,hear)/VB + eot(었,past ending)/PE + da(다, assertiveending)/ENTEb. sori(sound)/NN + reul(object particle)/PPCA2 deul( hold)/VB + eot(past ending)/PE + da/ENTE

Bigram hidden Markov Trigram hidden Markov Lexicalized bigram hidden Markov ME, CRF, MRFQipðti�1jtiÞpðmijtiÞ

Qipðtijti�2; ti�1Þpðmijti�1; tiÞ

Qipðti�1jti;mi�1Þpðmijti;mi�1Þ

QipðtijhiÞ

p(PPCA|VB) p(dued|VB) p(VB|NN,PPCA) p(deud|PPCA,VB) p(PPCA|VB,reul) p(dued|VB, reul)fj ¼ 1 if mi�2 ¼ sori&mi ¼ deud&ti ¼ VB

0 otherwise

�p(PPCA|VB) p(deul|VB) p(VB|NN,PPCA) p(deul|PPCA,VB) p(PPCA|VB,reul) p(duel|VB,reul)

S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70 65

p(ti|ti�1) and emission probability, p(wi|ti). However, in a first-orderHMM tagger shown in Table 3, final decision is affected by the lex-ical probabilities p(deul|VB) and p(deud|VB) because the transitionprobability is the same. Likewise, the best result of a lexicalized bi-gram HMM is decided by the probabilities, p(deul|PPCA, VB) and p(deud|PPCA, VB) (Lee et al., 2000).

As a result, in classical HMM based models, a base form of anambiguous word is determined only by a more frequent verb baseform that occurs in the annotated corpora regardless of contexts ofneighboring words. Thus, it is difficult to capture sufficient word-based contextual information in morpheme-based statisticalmodels.

For the purpose of designing a model which captures wide lex-ical information, maximum entropy (ME) taggers can be adopted(Curran and Clark, 2003; Ratnaparkhi et al., 1996). It can combinewide-range word contextual information as specialized featuresmore flexibly than HMM-based models. However, typical anno-tated corpora are insufficient to train existing statistical taggersbased on supervised learning methods. Most Korean statistical tag-gers are trained on 1.5–2.0 million words (eojeols) annotated cor-pora and thus lexical contextual information often cannot becovered well using current supervised learning methods such asHMM, CRF, ME (Han and Palmer, 2005; Lee and Rim, 2009). Fur-thermore, a considerable number of annotations relating to baseforms of VVMA words are clearly wrong in the annotated corpora.

Thus, in this work, we suggest a classifier for VVMA disambigu-ation based on the current POS tagger and a large pool of unlabeledraw text. Raw texts are easily accessible on web or via other elec-tronic resources and joint contextual probability distribution overbase forms and other neighbor context words can be exploitedfrom them. In particular, we focus the fact that accurate informa-tion can be extracted from even unlabeled text in this specificVVMA task. Since all conjugations of irregular verbs do not leadto ambiguous forms, we can use surrounding contexts of unambig-uous conjugations related to each possible verb morpheme ofambiguous words as training data. In other words, a raw text canbe treated as an annotated corpus. For example, in the disambigu-ation process of a word form deuleot(들었) which can be analyzedinto deud/VB+eot/PPE5 or deul/VB+eot/PPE, the contexts of unambig-uous conjugations such as deulgo(들고)6 are utilized as trainingexamples for the label of deul(들). Likewise, contexts of deudgo(듣고)7 or deudneun(듣는)8 can be used to train the label for deud(듣).In this problem, a supervised learning scheme can be performedeven with unlabeled raw corpora. Consequently, we improved theperformance of POS tagging through the VVMA disambiguationbased on the current POS tagger and unlabeled text. Moreover, thissystem can be easily adapted to a domain shift since feature exten-sion can proceed with the raw corpora.

5 Past tense pre-ending.6 deul(들) + go(고, conjunctive ending).7 deud(듣) + go(고, conjunctive ending).8 deud(듣) + go(고, conjunctive ending).

In the remainder of this paper, we describe the context featuresneeded for VVMA resolution and briefly introduce the learningmethods that we tested with in Section 3. Then, we define hardexamples and an efficient augmentation of context features withrespect to the hard examples using web counts will be explainedin Section 4. Finally, experimental results and how VVMA disam-biguation is helpful in POS tagging are presented and then we con-clude this paper by summarizing its contribution and indicatingareas of future work.

3. Learning strategy and features for VVMA classifier

Methods for exploiting unlabeled data to boost performances oflearning models and reduce annotation efforts have been con-stantly studied. For example, a semi-supervised learning such asco-training, self-training, or active learning utilizes unlabeled datato build a more accurate classifier (Abney, 2008; Blum andMitchell, 1998; Nigam and Ghani, 2000; Suzuki and Isozaki,2008). Typically, a semi-supervised classifier labels unlabeled datausing a classifier trained on a small set of labeled data and thennew labeled data are combined with the pre-labeled data to retrainthe classifier.

Unsupervised learning first sets a small number of trainingexamples that represent each label with seeds and label unlabeleddata by applying the classifier trained on the seed data and addsthe classified examples to the training set with a precision by athreshold (Poon et al., 2009; Yarowsky, 1995). The initial seedsare often selected from words that appear in a gazetteer or showreliable collocation relationships with a target class. As a result,the training set is incrementally augmented and the classifier isiteratively trained on the new training set.

In case of EM (expectation maximization)-based unsupervisedPOS tagging, it starts to set parameters from standard tag dictio-nary and word/tag pairs derived from corpus. Thus, EM-basedunsupervised tagging accuracy varies depending on the dictionaryemployed. In the POS tagging, EM has performed worse than othertechniques (Ravi and Knight, 2009).

We also employ unlabeled text to address the VVMA problembut make use of raw text just like labeled text in this problem. Suf-ficient training data can be automatically constructed from rawtext if we use unambiguous conjugations as reliable categoryseeds. Since accurate classes for unlabeled data are identified inadvance, we can eventually avoid a preliminary classification forannotating training data or an initial misclassification by theintroduction of wrong and insufficient seeds. This is a big differ-ence between this work and other bootstrapping approaches suchas semi-supervised or unsupervised learning that employ anunlabeled data (Abney, 2008; Blum and Mitchell, 1998; Nigamand Ghani, 2000; Poon et al., 2009; Suzuki and Isozaki, 2008;Yarowsky, 1995).

Since the availability of good training data for a classifier is stillthe main bottleneck in many applications, this is a meaningful

Table 4Features extracted from seed examples.

Content word ofword1

Content word ofword2

Word1 Word2 Case particle of word1 Case particle ofword2

Class(output)

gejagbi(productioncost)

mani(many) gejagbi + ga(subject caseparticle)

mani ga(subject case particle) deul

na(my) yyegi(story) na + eui(adnominal caseparticle)

yyegi + reul(objectparticle)

eui (adnominal caseparticle)

reul(object case) deud

got(same) iyagi(story) got-eun(adnomial ending) iyagi + reul(objectparticle)

reul(object case) deud

jum(aspect) iyoo(reason) jum + eul(object caseparticle)

iyoo + ro(adverbialcase)

eul(object case particle) ro(adverbial case) deul

geupgyeoghi(rapidly)

jul(reduce) geupgyeoghi jul + eo(auxiliaryparticle)

deul

66 S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70

approach to find out a way to utilize unlabeled data efficiently. Inthis work, we found a specific domain to apply confident seedswhich contributes to a learning performance comparable to thatof supervised learning.

In our work, the training examples are automatically generatedfrom raw corpora by just setting unambiguous verbal ending pat-terns related to ambiguous base forms. We deal with 11 pairs ofmorphologically ambiguous verb forms that commonly occur intext, as shown in Table 2. In general, VVMA is closely related toverb sense, and the nearby content words provide strong and con-sistent clues to the sense of a target word since words highly tendto exhibit only one sense in a given context or discourse (Yarow-sky, 1995). For the case of the disambiguation of deud(듣) anddeul(들), contexts of unambiguous verbal conjugations such as deu-dgo(듣+고) and deulgo(들+고) are used as training data to distin-guish deud and deul. In order to extract the context features tolearn a base form, sentences including unambiguous conjugationsof each base form are first identified and then the sentences aretagged with the current POS tagger. To design a classifier for theVVMA resolution, previous two words of a target word and theircontent morphemes are used as lexical features and the postposi-tions (case particle) of previous two words are also used as syntac-tic features. In Korean, postpositions are generally used to marknouns according to their role such as subject, object, or comple-ment in a sentence or clause. For example, postpositions eul(을)and reul(를) represent object cases and postpositions i(이) or ga(가), subject cases. Thus, the information can be clues to informverb subcategorization or a structural relationship between a verband its arguments which is helpful to distinguish ambiguous baseforms.

Table 4 shows context features of a number of training exampleswhich are used to distinguish the base forms deud(듣) and deul(들).With a set of features extracted from unlabeled data, a classifier istrained to disambiguate two possible base forms. We here test var-ious classifiers based on machine learning methods. The classifiersperform binary classifications, such as deud(듣)/deul(들) or sa(사)/sal(살) distinctions, as shown in Table 2 since all Korean VVMAwords are analyzed into only one out of two base morphemes.

4. VVMA classifiers

4.1. Classification methods

We now briefly introduce various learning schemes that weexperimented with. First, the Naïve Bayesian classifier was tested(McCallum and Nigam, 1998; Pedersen, 2000) for the VVMA dis-ambiguation. It assumes that the presence or absence of a particu-lar feature for a specific class is unrelated to the presence of otherfeatures. That is, it can be interpreted as the words of a text aregenerated, independent of their context. We can represent it asfollows:

pðyjjxÞ ¼pðyjÞpðx1; x2 � � � xnjyjÞ

pðxÞ ¼ pðyjÞpðxÞ

Yni¼1

pðxijyjÞ / pðyjÞYni¼1

pðxijyjÞ

ð1ÞIn the equation, X is a set of context features, such as those in Table4, and yj is a specific base form (class) to be learned. In spite of theirnaive design and apparently over-simplified assumptions, naiveBayesian classifiers have worked well in many complex real-worldsituations (Pedersen, 2000).

Second, we tested a boosting method that combines manysimple weak rules (classifiers) and produces a single highlyaccurate classification rule from the weak classifiers (Freundet al., 2003). Because weak classifiers may give incorrect predic-tions, the final classification is based on the weighted votes ofthe weak classifiers. At each iteration, the base learner concen-trates on examples that are particularly difficult to classify andthat yield high prediction errors. Control of the learner isachieved by modifying the distribution of each training exampledifferently. The distribution of examples is iteratively updatedby placing more weights on incorrectly classified training exam-ples and less weights on correctly classified examples, and thensequentially applying weak classifiers to a modified version ofdata distribution.

In this problem, we define a weak hypothesis, ht(xi) according towhether or not a specific context feature xt exists in the context ofan example xi. Here, yi means two possible base forms for outputs.The weak hypothesis simply decides the output yi as 1 or �1depending on the presence of a specific feature. The value of thefunction Iðyn – htðxnÞÞ is 1 if the label of a target morpheme andthe output of the hypothesis (classifier) are different. Thealgorithm is as follows:

Input:N: examples {(x1,y1), . . . , (xN, yN)}L: a leaning algorithm generating hypothesis ht(x)T: max number of hypotheses in the ensemble

Initializedtn: the weight (distribution) of example n at iteration t

Do for t = 1, . . . ,T1. Train base learner according to example distribution dt

and obtain hypothesis, ht(x)2. compute weighted error of a weak hypothesiset ¼

Pnd

tnIðyn – htðxnÞÞ

3. compute hypothesis weight at ¼ 12 ln

1�etet

4. update example distribution

dtþ1n ¼ dtn expð�atynhtðxnÞÞ=Zt , Zt is a normalization factor

Zt ¼P

ndtn expð�atynhtðxnÞÞ

OutputFinal hypothesis fensembleðxÞ ¼

PtathtðxÞ

Table 5Google counts.

nition Letters 33 (2012) 62–70 67

In this way, the boosting algorithm focuses on more informativeor difficult examples. It is easy to implement, though it may lead to

S. Kim et al. / Pattern Recog

Word Total counts Counts (with in a data range)

ggao(까고) 492,000 70,500ggalgo(깔고) 1,200,000 122,000gganeun(까는) 1,020,000 156,000sago(사고) 37,000,000 246,000salgo(살고) 9,200,000 247,000saneun(사는) 21,500,000 248,000

9 http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html.

worse classification results than an original classifier if the data istoo noisy.

Third, we tested a support vector machine (SVM), which isknown to robustly handle a large feature set and provide a highgeneralization performance in many NLP areas (Steinwart andChristmann, 2008; Vapnik, 2006). For a given finite set of learningpatterns, a SVM constructs the optimal hyperplane that separatesthe set of positive examples from a set of negative examples witha maximum distance (margin) between the hyperplane and thesupport vectors. Here, the support vectors denote the trainingexamples closest to the hyperplane. Assume N observation pairsof (xi, yi), where xi 2 Rn is an input context feature vector and yiis its associated base form output label. The goal of the SVM is tofind a hyperplane such that w � x + b = 0, where a weight vectorw is normal to the hyperplane and b=kwk then becomes the per-pendicular distance from the hyperplane to the origin. As a result,the margin between the support vectors and the hyperplane can berepresented as 2

kwk and the SVM finds the hyperplane with the larg-est margin by minimizing 1

2 kwk2; subject to the constraintsyið< xi �w > þbÞ � 1 P 0 for i ¼ 1; . . . ;N. By introducing Lagrangemultipliers ai, the solution can be found by maximizing LD

LD ¼XNi¼1

ai � 12

XNi¼1

XNj¼1

aiajyiyjhxi � xji

s:t:XNi¼1

aiyi ¼ 0; ai P 0

ð2Þ

For the SVM learning, the LibSVM toolkit was used after construc-tion of a feature vector to represent the context according to eachexample (Fan et al., 2005).

Finally, we tested the maximum entropy (ME) framework,whose goal is to maximize the entropy of a distribution that is sub-ject to certain constraints. It learns a conditional probability modelthat predicts output y, one out of two possible verb base forms, fora given context x of a current word wi. The model can be defined asfollows:

pkðyjxÞ ¼1

ZkðxÞ expXnj¼1

kjfjðx; yÞ !

ð3Þ

where fj(x,y) denotes a binary-valued feature function, i is theweighting parameter of fj(x,y), n is the total number of features,and Zk(x) denotes a normalization factor. The context x is definedas a sequence of words (eojeols), content morphemes, and postposi-tions that precede the current word: x = {wi�2,wi�1,cmi�2, cmi�1,pi�2,pi�1}. For example, we can consider the following feature func-tion for the recognition of base form as follows:

fiðx;yÞ ¼1 if a content word; ð ;soundÞsori preceeds wi;

y¼ deudð ;hearÞ0 otherwise

8><>: ð4Þ

This feature function means that the base morpheme (output) of theword y is deud(듣) when sori (소리) occurs in its surrounding con-text of the current word. As a result, the probability pk(y|x) is calcu-lated by the weighted sum of the active features. Given anexponential model p with n features, and a set of training data withtheir empirical distribution, the weight for each feature is trained tomaximize the model p’s log-likelihood as follows:

LðpÞ ¼Xx;y

~pðx; yÞ logðyjxÞ ð5Þ

The model parameters for the distribution p and the weights offeatures j can be obtained using the GIS (generalized iterative scal-ing) method. For ME learning, we use the Maxent toolkit.9

4.2. Hard examples and web counts

Practically, the classifier for VVMA resolution should be dynam-ically adaptable to variations in vocabulary and future examples.For instance, some context features of examples never occurredin a training data. ‘‘iPhone” and ‘‘iPad, ” in fact, are new terminol-ogies. Thus, we need a strategy that can adaptively and incremen-tally extend the range of training examples. In this study, we willuse web data as new corpora. Some previous works have demon-strated that web counts can be used to approximate bigram or n-gram frequencies as well as to supply additional information forunseen data (Blum and Mitchell, 1998; Brants and Alex, 2006; Kel-ler and Lapata, 2003; Lapata and Keller, 2005; Zhu and Rosenfeld,2001). Zhu and Rosenfeld (2001) adopted web-based n-gramcounts for language modeling. Keller and Lapata (2003) and Lapataand Keller (2005) proved that web queries could generate frequen-cies comparable to those obtained from a balanced, carefully edi-ted corpus, even though web counts could contain noise data.

In general, however, web-scale data sets are very huge to down-load related pages for the further processing that extracts correctlexical context features. For example, in the case of the ambiguousword gganeun(까는), 1,020,000 web pages match that query byusing Google API, and in case of saneun(사는), 21,500,000 webpages match a given search string, as shown in Table 5. As theunambiguous conjugations for learning gga(까)/ggal(깔), ggago(까고) or ggalgo(깔고) occur in 492,000 and 1,200,000 pages, respec-tively. Thus, we need to extend new data incrementally by select-ing only useful examples with respect to small data sets.

In this study, we define useful examples as hard examples. Theycorrespond to unseen or highly different examples from pre-trained data or those whose predictions by the current classifierderived from pre-trained data are either uncertain or wrong. Theycan be selected when new examples are added or when some ofthe test examples to be classified require additional information.

In order to evaluate whether or not an example is hard, the con-fidence and diversity values are used. First, the confidence valuecan be defined as (6). Since it measures the difference betweenthe probability that an example x is predicted as category ci andone predicted as category cj, it indicates how confident the predic-tion of the trained classifier is with respect to a specific example x

ConfidenceðxÞ ¼ pðcijxÞ � pðcjjxÞpðcijxÞ ð6Þ

This equation is based on a Naïve Bayesian classifier, and ci denotesthe best category (class) that is predicted with the highest value bythe current classifier. On the other hand, cj is the category with thesecond highest value. That is, ci is the category closest to x that cor-responds to a context feature vector of an ambiguous word, and ci is

Table 6Disambiguation results.

68 S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70

the other category because this problem is associated with a binaryclassification. The classification confidence for an example will behigh if there is a big difference between two prediction probabili-ties. The confidence score is a metric for correctness of the classifi-cation with respect to a particular example.

However, although there is no feature to be applied to an un-seen example, a classifier can have a high confidence value if thedistributions of classes are biased. In that case, the confidence de-pends only on the class probabilities. We thus include unseenexamples in hard examples, though they can have high predictionconfidence values. In addition, the cases of unambiguous conjuga-tion examples on the web that yield prediction errors after apply-ing the current classifier belong to the set of hard examples.

Another criterion for identification of hard examples is diver-sity, which evaluates the degree of how different or redundant aspecific test example or a new example and previously trained dataare. Diversity can be defined as follows:

diversityðtiÞ ¼ 1�maxsj2S

Kðti; sjÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiKðti; tiÞKðsj; sjÞ

pKðti; sjÞ ¼ ti � sj; cos h ¼ ti � sj

ktik � ksjkð7Þ

In this equation, S is the set of examples used for training and ti is aspecific example to be classified or added into the training set. Krepresents the similarity between two examples. The diversity iscomputed by the similarity value between ti and the example, sjthat is the most similar to ti in the set of previously trained exam-ples. If ti is more similar to existing examples, then it has a lowerdegree of diversity. Using this measure, totally new examples, orexamples that are at least different from previously learned datacan be selected. That is, such examples can be interpreted as uncer-tain examples under the current model which is derived from thetraining data.

Table 7Disambiguation results (naïve Bayesian).

Method Accuracy

Bigram model 75.10Naïve Bayes 93.58Naïve Bayes + web counts 98.12

5. Experimental results

For the VVMA resolution test, we used 5 million eojeols (words)of raw corpora as a training a test set and a POS tagged corpus of600, 000 eojeols a test set. In this study, 11 types of VVMA are con-sidered, as shown in Table 2. In order to evaluate various classifiers

for the VVMA resolution, we first retrieve context features with re-spect to the selected morphologically ambiguous words from thetest set and then base forms are assigned to the respective contextfeatures by each classifier. The performances of the classifiers areevaluated by accuracy based on whether each base form is cor-rectly assigned or not.

Table 6 shows the disambiguation results for each learningmethod for the VVMA resolution. The changes in classificationmethods did not show any significant improvement in disambigu-ation performance, as shown in the table. In general, ME or SVMwork better with a high level of accuracy than others, but the vari-ations in terms of accuracy between the learning methods werenot apparent except in the cases of gga(까)/ggal(깔) and ggeu(끄)/ggeul(끌), where the sparseness rates are high or the number oftraining samples is comparatively small. In Table 6, the sparse datacorrespond to test examples whose lexical features are unseen inthe training data.

As a result, the average accuracy for the VVMA disambiguationwas about 93.58%, which was an improvement of 18.48% over thecurrent bigram HMM tagging model. Table 7 presents performancecomparisons between Naïve Bayesian classifiers and the mor-pheme-based bigram tagger and Table 8 shows performance com-parisons of Naïve Bayesian classifiers depending on the presence ofsyntactic features. In addition, the usage of case particle (postposi-tion) information with respect to previous words yields a slightimprovement in the VVMA disambiguation, except in the case ofna(나)/nal(날). In the base forms, na(나) and nal(날), the case parti-cle information is not much effective at distinguishing betweentwo base forms since they have similar verb subcategorizationfeatures.

We also investigated the efficacy of the method by applying theproposed method to the real tagger. The test set consisted of113,830 morphemes (58,965 words). The VVMA disambiguationwas performed as a postprocessing after the POS tagging. As shownin Table 9, the performance of the tagger was improved by 0.04%(token level accuracy) only with the VVMA disambiguation. In fact,26 (9.35%) among 278 verb-related tagging errors accounted forthe VVMA errors and 24 (92.3%) among 26 errors were correctedwith the VVMA resolution classifier. Incorrectly modified wordsmostly needed more broad contexts than we considered.

In addition, we experimented with how to incorporate web datafor hard examples, as explained in the previous section. To investi-gate the possibility of using web data to selective hard examples,we currently checked hard examples in the test data and obtainedadditional information from web pages to process them. The issueson the extension of training set and appropriate time for retrainingrequire further studies. At the same time, the web searching andtagging process into one also need a further research.

For this experiment, web counts returned by a search enginewere used instead of the downloading of related pages. We firstidentified hard examples based on their diversity and confidencemeasures and then used web counts to estimate the probabilitiesof their context features with respect to each base form. Table 10shows a number of selected hard examples for the disambiguationof gga(까) and ggal(깔) and their context features. We retrievedabout 5% out of the test examples as hard examples for which fur-ther information is needed. Table 11 shows queries for the contextfeatures of the hard examples and their Google counts. The query

Table 8Disambiguation results (naïve Bayesian).

Table 9Tagging performance improvement by the VVMA disambiguation.

# of correctly tagged words Accuracy

Without VVMA 57,593 97.67With VVMA disambiguation 57,614 97.71

Table 10Selected hard examples (gga/ggal).

Hard examples Context feature Correctcategory

boan prograemeulgganeun

boan(security) prograem(program) boanprogrameul josa1=eul(object case particle)

ggal(install)

gueyeoreulmyeonjeoneseogganeun

geuyeo(her) myeonjeon(before�face)geuyeoreul myeonjeoneseo josa2=reul(object case) josa1=eseo(adverbial caseparticle)

gga(slate)

pentieom 4eseogganeun

pentieom(Pentium) 4 pentieom 4eseojosa1=eseo

ggal(install)

Table 11Google counts (‘‘gga”/‘‘ggal”) with respect to each query.

Query (feature + decision) Googlecounts

‘‘boan(보안, security)* ggalgo(깔,install +고, conjunctiveending)”

7

‘‘boan* ggago(까 +고)” 0‘‘boan* ggalge(깔 +게)” 0‘‘boan* ggage(까 +게)” 0‘‘prograem(프로그램, program)* ggalgo(깔 + 고)” 1510‘‘prograem* ggago(까 + 고)” 2‘‘prograem* ggalge(깔 + 게)” 52‘‘prograem* ggage(까 + 게)” 0‘‘prograemeul ggalgo(깔 + 고)” 1140‘‘prograemeul ggago(까 + 고)” 1‘‘prograemeul ggalge(깔 + 게)” 28‘‘prograemuel ggage(까 + 게)” 0‘‘eul(을, object particle) ggalgo” 29,500‘‘eul ggago” 3820‘‘eul ggalge” 436‘‘eul ggage” 126

11 ‘‘prograem(프로그램,program) + eul(을,object case particle) ggal(깔, install) + go(고, conjunctive ending)”.12

S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70 69

consists of each context feature of a hard example and the co-occurring unambiguous conjugations of each base form. The in-flected verb forms used in queries are generated by expanding eachpossible base form of an ambiguous word into its predefinedunambiguous conjugation patterns. In the table, inflected verbssuch as ggalgo(깔고)/ggago(까고) and ggage(까게)/ggalge(깔게) areused as unambiguous conjugations that are defined for the disam-biguation of gga(까) and ggal(깔).

All queries are performed as exact phrase matches by usingquotes, and the counts for each feature and its co-occurring deci-sion (base form) on the web are retrieved through phrase searchesusing Google API. In addition, the ‘‘daterange” option is used tolimit the results to documents that were published within a spe-cific date range. In the table, web counts are restricted to docu-ments published within the previous three months. Further, awild card letter is used to retrieve distant contexts. Google treatsit as a placeholder for a word or more than one word. For example,the query ‘‘prograem* ggalgo”10 matches the pages containing aphrase that starts with program(프로그램) and is followed by oneor more words and ggalgo(깔고). In the query, ggalgo is one of the

10 ‘‘prograem(프로그램,program)* ggal(깔,install) + go(고,conjunctive ending)”.

‘‘prograem(프로그램,program) + man(만,auxiliary particle) ggal(깔, install) + go(고, conjunctive ending)”.13 ‘‘prograem(프로그램,program) + eul(을,object case particle) mujogeon(무조건

surely) ggal(깔, install) + go(고, conjunctive ending)”.

unambiguous conjugations of the base form, ggal(깔). Variousphrases, including ‘‘prograemeul ggalgo”11, ‘‘prograemman ggalgo”12,and ‘‘prograemeul mujogeon ggalgo”13 can match the query. In otherwords, the examples that include unambiguous conjugations of thebase form ggal(깔), such as ggalge(깔게) or ggalgiman(깔기만), besidesggalgo(깔고), are used to learn the contexts of the base form.

By interpolating the web and corpus counts, we do not discardone side of the counts completely. Eq. (8) shows the interpolationscheme for two naïve Bayesian classifiers that was used in thisstudy.

kpwebðyjjxÞ þ ð1� kÞpcorpusðyjjxÞ ð8Þ

It is concerned with naïve Bayesian classifications on corpus countsand on web counts. Since web counts are generally much largerthan corpus counts, the interpolation approach is applied insteadof retraining the classifier after adding the feature counts fromthe web into the training set. The training set can be incrementallyextended with additional unlabeled text or by the features withweb counts.

As a result, after adding web counts for the context features ofselected hard examples, the average disambiguation accuracy toVVMA words was improved by 4.54% compared to the normalnaïve Bayesian classifier, even though it has the limitation to ob-tain precise web counts only without any further processing.

The cases, which require a broader context than the previoustwo words, account for most of erroneous results. Also, some errorsare originated from incorrect web counts from the inclusion ofwrong web pages because we only considered web counts, ratherthan downloading related pages and filtering them. However, theweb-based frequencies are generally useful to approximate bigramcounts for VVMA disambiguation task.

This work was explored for resolution of VVMA in Korean, but isnot limited to Korean. By generating the appropriate training setfor disambiguation with unambiguous instances in the same classfrom a raw corpus, the method can be similarly applied to improvethe tagger in other language. In English, the confusion between NNand VB is one of the main tagging errors. Their word forms are thesame but the context would be different, which can be dealt withusing this approach. For instance, the disambiguation knowledge

,

70 S. Kim et al. / Pattern Recognition Letters 33 (2012) 62–70

could be learned, if the context of NP and VP are distinctivelydetermined and thus extracted from large-sized raw materialsfor training using function words such as determiner, auxiliaryverbs and so on.

6. Conclusion

In this paper, we address the problem of VVMA based on a largepool of raw text that contains sufficient lexical information. Astraining data, we use the surrounding contexts of unambiguousconjugations related to each possible verb morpheme of ambigu-ous words, since all conjugations of irregular verbs do not lead tochanges in spelling that produce VVMA. In this problem, thus, asupervised classification can be attempted even with unlabeledraw corpora. As a result, the classifiers based on context featuresdemonstrated better performance than the current morpheme-based bigram POS taggers. In addition, we provided additionalinformation with web counts for selected test examples that aredifficult to predict using the current classifier or that are highly dif-ferent from the pre-trained data set.

As a result, automatic data generation and knowledge acquisi-tion from unlabeled text for the VVMA resolution improved theoverall tagging accuracy (token-level) by 0.04%. The VVMA resolu-tion accuracy was about 98% by the Naïve Bayes classifier coupledwith the addition of selective web counts. Since the availability ofgood training data for a classifier is still the main bottleneck inmany applications, this work can be a meaningful approach to findout a way to utilize unlabeled data efficiently by providing suffi-cient information based on unambiguous unlabeled instances inthe same class we targeted as much as labeled corpora do.

Acknowledgements

This work is supported by the Ministry of Education and BasicScience Research Program through the National Research Founda-tion of Korea (NRF) funded by the Ministry of Education, Scienceand Technology (2009-0070211).

References

Abney, S., 2008. Semisupervised Learning for Computational Linguistics. Chapman &Hall/CRC, London, UK.

Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with co-training.In: Proc. Workshop on Computational Learning Theory (pp. 92–100).

Brants, T., Alex, F., 2006. Web 1T 5-gram version 1, Linguistic Data Consortium,Philadelphia. LDC2006T13.

Chan, Y.S., Ng, H.T., 2007. Domain adaptation with active learning for word sensedisambiguation. In: Proc. 45th Annual Meeting of the Association forComputational Linguistics, pp. 49–56.

Curran, R.J., Clark, S., 2003. Language independent NER using a maximum entropytagger. In: Proc. Seventh Conference on Natural Language Learning (CoNLL-03),pp. 164–167.

Dasgupta, S., Ng, V., 2007. Unsupervised part-of-speech acquisition for resource-scarce languages. In: Proc. Joint Conf. on Empirical Methods in Natural

Language Processing and Computational Natural Language Learning, pp. 218–227.

Fan, R.E., Chen, P.H., Lin, C.J., 2005. Working set selection using the second orderinformation for training SVM. J. Machine Learn Res. 6, 1889–1918.

Freund, Y., Iyer, R.D., Schapire, R.E., Singer, Y., 2003. An efficient boosting algorithmfor combining preferences. J. Machine Learn Res. 4, 933–969.

Han, C., Palmer, M., 2005. A morphological tagger for Korean: Statistical taggingcombined with corpus-based morphological rule application. Machine Transl.18 (4), 275–297.

Keller, F., Lapata, M., 2003. Using the web to obtain frequencies for unseen bigrams.Comput. Linguist. 29 (3), 459–484.

Lapata, M., Keller, F., 2005. Web-based models for natural language processing. ACMTrans. Speech Lang. 2 (1), 1–31.

Lee, D., Rim, H., 2009. Probabilistic modeling of Korean morphology. IEEE Trans.Audio Speech 17 (5), 945–955.

Lee, G.G., Lee, J., Cha, J., 2002. Syllable-pattern-based unknown-morphemesegmentation and estimation for hybrid part-of-speech tagging of Korean.Comput. Linguist. 28 (1), 53–70.

Lee, S., Tsujji J., Rim, H., 2000. Part-of-speech tagging based on Hidden MarkovModel assuming joint independence. In: Proc. First North American AnnualMeeting of the Association for Computational Linguistics, pp. 263–169.

Lee, S., Lee, G., 2007. Exploring phrasal context and error correction heuristics inbootstrapping for geographic named entity annotation. Inform. Systems 32,575–592.

Manning, D.C., 2011. Part-of-Speech tagging from 97% to 100%: Is it time for somelinguistics? In: Proc. CICLing 2011, pp. 171–189.

McCallum, A., Nigam, K., 1998. A comparison of event models for Naïve Bayes textclassification. In: Proc. AAAI/ICML-98 Workshop on Learning for TextCategorization, pp. 41–48.

Nigam, K., Ghani, R., 2000. Analyzing the effectiveness and applicability of co-training. In: Proc. 9th International Conference on Information and KnowledgeManagement, pp. 86–93.

Pedersen, T., 2000. A simple approach to building ensembles of naiveBayesian classifiers for word sense disambiguation. In: Proc. First NorthAmerican Annual Meeting of the Association for Computational Linguistics,pp. 63–69.

Poon, H., Cherry, C., Toutanova, K., 2009. Unsupervised morphological segmentationwith log-linear models. In: Proc. HLT-NAACL, pp. 209–217.

Ratnaparkhi, A., 1996. A maximum entropy model for part-of-speech tagging. In:Proc. Empirical Methods in Natural Language Processing, pp. 133–142.

Ravi, S., Knight, K., 2009. Minimized models for unsupervised part-of-speechtagging. In: Proc. ACL-IJCNLP, pp. 504–512.

Shen, L., Satta, G., Joshi, A., 2007. Guided learning for bidirectional sequenceclassification. In: Proc. 45th Annual Meeting of the Association ofComputational Linguistics, pp. 760–767.

Steedman, M., Hwa, R., Clark, S., Osborne, M., Sarkar, A., Hockenmaier, J., Ruhlen, P.,Baker, S., Crim, J., 2003. Example selection for bootstrapping statistical parsers.In: Proc. HLT-NAACL 2003, pp. 331-338.

Steinwart, I., Christmann, A., 2008. Support Vector Machines. Springer, New York.Suzuki, J., Isozaki, H., 2008. Semi-supervised sequential labeling and segmentation

using Giga-word scale unlabeled data. In: Proc. Association for ComputationalLinguistics 08, pp. 665–673.

Toutanova, K., Klein, D., Manning, C.D., Singer, Y., 2003. Feature-rich part-of-speechtagging with a cyclic dependency network. In: Proc. HLT-NAACL 2003, pp. 252–259.

Tsuruoka, Y., Tsujii, J., 2005. Bidirectional inference with the easiest-first strategyfor tagging sequence data. In: Proc. HLT/EMNLP 2005, pp. 467–474.

Vapnik, V., 2006. Estimation of Dependences Based on Empirical Data. Springer,New York.

Yarowsky, D., 1995. Unsupervised word sense disambiguation rivaling supervisedmethods. In: Proc. 33rd Annual Meeting of the Association for ComputationalLinguistics, pp. 189–196.

Zhu, X., Rosenfeld, R., 2001. Improving trigram language modeling with the WorldWide Web. In: Proc. Internat. Conf. on Acoustics, Speech, and Signal Processing,pp. 533–536.