31
An Unsupervised Method for Identifying Loanwords in Korean Hahn Koo San Jose State University [email protected] Manuscript to appear in Language Resources and Evaluation The final publication is available at Springer via http://dx.doi.org/10.1007/s10579-015-9296-5

An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

  • Upload
    votruc

  • View
    232

  • Download
    6

Embed Size (px)

Citation preview

Page 1: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

An Unsupervised Method for Identifying Loanwords in Korean

Hahn KooSan Jose State University

[email protected]

Manuscript to appear in Language Resources and Evaluation

The final publication is available at Springer viahttp://dx.doi.org/10.1007/s10579-015-9296-5

Page 2: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Abstract

This paper presents an unsupervised method for developing a character-basedn-gram classifier that identifies loanwords or transliterated foreign words inKorean text. The classifier is trained on an unlabeled corpus using the Ex-pectation Maximization algorithm, building on seed words extracted fromthe corpus. Words with high token frequency serve as native seed words.Words with seeming traces of vowel insertion to repair consonant clustersserve as foreign seed words. What counts as a trace of insertion is deter-mined using phoneme co-occurrence statistics in conjunction with ideas andfindings in phonology. Experiments show that the method can produce anunsupervised classifier that performs at a level comparable to that of a su-pervised classifier. In a cross-validation experiment using a corpus of about9.2 million words and a lexicon of about 71,000 words, mean F-scores of thebest unsupervised classifier and the corresponding supervised classifier were94.77% and 96.67%, respectively. Experiments also suggest that the methodcan be readily applied to other languages with similar phonotactics such asJapanese.

Keywords: Loanwords; Transliteration; Detection; N-gram; EM algorithm;Korean

1

Page 3: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

1 Introduction

Loanwords are words whose meaning and pronunciation are borrowed fromwords in a foreign language. Their forms, both pronunciation and spelling,are often nativized. Their pronunciations adapt to conform to native soundpatterns. Their spellings are transliterated using the native script and re-flect the adapted pronunciations. For example, flask [flæsk] in English be-comes 플라스크 [pʰɨl.ɾa.sɨ.kʰɨ] in Korean. The present paper is concernedwith building a system that scans Korean text and identifies loanwords1

spelled in Hangul, the Korean alphabet. Such a system can be useful inmany ways. First, one can use the system to collect data to study variousaspects of loanwords (e.g. Haspelmath and Tadmor, 2009) or develop ma-chine transliteration systems (e.g. Knight and Graehl, 1998; Ravi and Knight,2009). Loanwords or transliterations (e.g. 플라스크) can be extracted frommonolingual corpora by running the system alone. Transliteration pairs (e.g.<flask, 플라스크>) can be extracted from parallel corpora by first identi-fying the output with the system and then matching input forms based onscoring heuristics such as phonetic similarity (e.g. Yoon et al., 2007). Second,the system allows one to use etymological origins of words as a feature andbe more discrete in text processing. For example, grapheme-to-phoneme con-version in Korean (Yoon and Brew, 2006) and stemming in Arabic (Nwesri,2008) can be improved by keeping separate rules for native words and loan-words. The system can be used to classify a given word into either categoryand apply the proper set of rules.

The loanword identification system envisioned here is a binary, character-based n-gram classifier. Given a word (w) spelled in Hangul, the classifierdecides whether the word is of native (N) or foreign (F ) origin by Bayesianclassification, i.e. solving the following equation:

c(w) = arg maxc∈{N,F}

P (w|c) · P (c) (1)

The likelihood P (w|c) is calculated using a character n-gram model specific1In this paper, loanwords in Korean refer to all words of foreign origin that are translit-

erated in Hangul except Sino-Korean words, which are ancient borrowings from Chinese.Sino-Korean words are considered more native-like than other words of foreign origin dueto their longer history and higher morphological productivity (Sohn, 1999).

2

Page 4: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

to that class. The classifier is trained on a corpus in an unsupervised manner,building on seed words extracted from the corpus. The native seed consistsof words with high token frequency in the corpus. The idea is that frequentwords are more likely to be native words than foreign words. The foreignseed consists of words that contain what appear to be traces of vowel inser-tion. Korean does not have words that begin or end with consonant clusters.Like many other languages with similar phonotactics (e.g. Japanese), for-eign words with consonant clusters are transliterated with vowels inserted tobreak the clusters. So presence of substrings that resemble traces of inser-tion suggests that a word may be of foreign origin. An obvious problem isdeciding what those traces look like a priori. Here the problem is resolved bya heuristic based on phoneme co-occurrence statistics and rudimentary ideasand findings in phonology.

The rest of the paper is organized as follows. In Section 2, I discuss pre-vious studies in foreign word identification as well as ideas and findings inphonology that the present study builds on. I describe the proposed methodfor developing the unsupervised classifier in detail in Section 3. I discuss ex-periments that evaluate the effectiveness of the method in Korean in Section4 and pilot experiments in Japanese that explore its applicability to otherlanguages in Section 5. I conclude the paper in Section 6.

2 Background

This work is motivated by previous studies on identifying loanwords or for-eign words in monolingual data. Many of them rely on the assumption thatdistribution of strings of sublexical units such as phonemes, letters, and syl-lables differs between words of different origins. Some write explicit andcategorical rules stating which substrings are characteristic of foreign words(e.g. Bali et al., 2007; Khaltar and Fujii, 2009). Some train letter or syllablen-gram models separately for native words and foreign words and comparethe two. It has been shown that the n-gram approach can be very effectivein Korean (e.g. Jeong et al., 1999; Oh and Choi, 2001).

Training the n-gram models is straightforward with labeled data in whichwords are tagged either native or foreign. But creating labeled data can beexpensive and tedious. In response, some have proposed methods for generat-

3

Page 5: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

ing pseudo-annotated data: Baker and Brew (2008) for Korean and Goldbergand Elhadad (2008) for Hebrew. In both studies, the authors suggest gener-ating pseudo-loanwords by applying transliteration rules to a foreign lexiconsuch as the CMU Pronouncing Dictionary. They suggest different methodsfor generating pseudo-native words. Baker and Brew extract words with hightoken frequencies in a Korean newswire corpus assuming that frequent wordsare more likely to be native than foreign. Goldberg and Elhadad extractedwords from a collection of old Hebrew texts assuming that old texts aremuch less likely to contain foreign words than recent texts. The approachis effective and a classifier trained on the pseudo-labeled data can performcomparably to a classifier trained on manually labeled data. Baker and Brewtrained a logistic regression classifier using letter trigrams on about 180,000pseudo-words, half pseudo-Korean and half pseudo-English. Tested on a la-beled set of 10,000 native Korean words and 10,000 English loanwords, theclassifier showed 92.4% classification accuracy. In comparison, the corre-sponding classifier trained on manually labeled data showed 96.2% accuracyin a 10-fold cross-validation experiment.

The pseudo-annotation approach obviates the need to manually label data.But one has to write a separate set of transliteration rules for every pair oflanguages. In addition, the transliteration rules may not be available to beginwith, if the very purpose of identifying loanwords is to collect training datafor machine transliteration. The foreign seed extraction method proposed inthe present study is an attempt to reduce the level of language-specificity anddemand for additional natural language processing capabilities. The methodessentially equips one with a subset of transliteration rules by presuppos-ing a generic pattern in pronunciation change, i.e. vowel insertion. Themethod should be applicable to many language pairs. The need to repairconsonant clusters arises for many language pairs and vowel insertion is a re-pair strategy adopted in many languages. Foreign sound sequences that arephonotactically illegal in the native language are usually repaired rather thanoverlooked. A common source of phonotactic discrepancy involves consonantclusters: different languages allow consonant clusters of different complex-ity. Maddieson (2013) identifies 151 languages that allow a wide variety ofconsonant clusters, 274 languages that allow only a highly restricted set ofclusters, and 61 languages that do not allow clusters at all. Illegal clustersare repaired by vowel insertion or consonant deletion, but vowel insertionappears to be cross-linguistically more common (Kang, 2011).

4

Page 6: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

The vowel insertion pattern is initially characterized only generically as ‘in-sert vowel X in position Y to repair consonant cluster Z’. The generic natureof the characterization ensures language-neutrality. But in order for the pat-tern to be of any use, one must eventually flesh out the details and provideinstances of the pattern equivalent to specific transliteration rules: ‘insert [u]between the consonants to repair [sm]’ or [sm] → [sum], for example. Herethe language-specific details of vowel insertion are discovered from a corpusin a data-driven manner but the search process is guided by findings andideas in phonology. As will be described in detail below, possible values ofwhich vowel is inserted where are constrained based on typological studies ofloanword adaptation (e.g. Kang, 2011) and vowel insertion (e.g. Hall, 2011).Possible consonant sequences originating from a cluster are delimited by theidea of sonority sequencing principle (e.g. Clements, 1990).

3 Proposal

The goal is to build a Bayesian classifier made of two character n-gram mod-els: one for native words (N) and the other for foreign words (F ). Thatis,

c(w) = arg maxc∈{N,F}

P (c) · P (w|c) ≈ arg maxc∈{N,F}

P (c) ·∏i

P (gi|gi−1i−n+1, c) (2)

where gi is the ith character of w and gi−1i−n+1 is the string of n− 1 characters

preceding it. In this study, the n-gram models use Witten-Bell smoothing(Witten and Bell, 1991) for its ease of implementation. That is,

P (gi|gi−1i−n+1, c) = (1−λc(g

i−1i−n+1))·Pmle(gi|gi−1

i−n+1, c)+λc(gi−1i−n+1)·P (gi|gi−1

i−n+2, c)(3)

So the parameters of the classifier consist of P (c), Pmle(gi|gi−1i−n+1, c), and

λc(gi−1i−n+1). They can be estimated from data as follows:

P (c) =

∑w z(w, c)∑

c′∑

w z(w, c′)(4)

5

Page 7: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Pmle(gi|gi−1i−n+1, c) =

∑w freqw(g

ii−n+1) · z(w, c)∑

w freqw(gi−1i−n+1) · z(w, c)

(5)

λc(gi−1i−n+1) =

∑w freqw(g

i−1i−n+1) · z(w, c)

N1+(gi−1i−n+1•) +

∑w freqw(g

i−1i−n+1) · z(w, c)

(6)

z(w, c) indicates whether w is classified c: z(w, c) = 1 if it is and z(w, c) = 0otherwise. freqw(x) is the number of times x occurs in w. N1+(g

i−1i−n+1•) is

the number of different n-grams prefixed gi−1i−n+1 that occur at least once.

The challenge here is that the training corpus is unlabeled, i.e. z(w, c) ishidden. I use variants of the EM algorithm to iteratively guess z(w, c) andupdate the parameters. The n-gram models are initialized with seed wordsextracted from the corpus. For the native class, I use high frequency wordsin the corpus as seed words: for example, all words whose token frequencyis in the 95th percentile. For the foreign class, I first use sublexical statisticsto list phoneme strings that would result from vowel insertion and then usewords that contain the phoneme strings as seed words. Below I describe indetail how foreign seed words are extracted and how the seeded classifier isiteratively trained.

3.1 Foreign seed extraction

The method aims to identify loanwords whose original forms contain conso-nant clusters and use them as foreign seed words. This is done by string/patternmatching, where the pattern consists of phoneme strings that can result fromvowel insertion. Consonant clusters do not begin or end syllables in Korean.When foreign words are borrowed, consonant clusters are repaired by in-serting a vowel somewhere next to the consonants to break the cluster intoseparate syllables. Speakers usually insert the same vowel in the same po-sition to repair a given consonant cluster. As a result, transliterations ofdifferent words with the same consonant cluster all share a common sub-string showing trace of insertion. For example, 트라이 (try), 트레인 (train),트리 (tree), 트롤 (troll), and 트루 (true) all have 트ㄹ which is pronounced[tʰɨɾ]. The idea is to figure out what those signature substrings are in advanceand look for words that have them. There is a risk of false positives since

6

Page 8: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

such substrings may exist for reasons other than vowel insertion. But thehope is that the seeded classifier will gradually learn to be discrete and useother substrings in words for further disambiguation.

The phoneme strings defining the pattern are specified below as tuples of theform < C1C2, Vid, Vloc > for ease of description. Each tuple characterizes aphoneme string made of two consonants and a vowel. C1 and C2 are the twoconsonants. Vid is the identity of the vowel. Vloc is the location of the vowelrelative to the consonants, i.e. between, before, or after the consonants. Forexample, <s, n, ɨ, between> means [sɨn] as in [sɨnou] for 스노우 (snow) and<n, tʰ, ɨ, after> means [ntʰɨ] as in [hintʰɨ] for 힌트 (hint). The idea is to useC1C2 to specify consonants from a foreign cluster and Vid and Vloc to specifywhich vowel is inserted where to repair the cluster.

Rather than manually listed using language expertise, the tuples are discov-ered from a corpus using the following heuristic:

1. List words that appear atypical compared with the native seed words.

2. Extract < C1C2, Vid, Vloc > tuples from the atypical words where

(a) C1C2 respects the sonority sequencing principle.

(b) Vid and Vloc most strongly cooccur with C1C2 among all vowels.

3. Identify the most common Vid as the default vowel used for insertion.Keep tuples whose Vid matches the default vowel and throw away therest.

4. Identify the most common Vloc of the default vowel as its site of insertionfor clusters in each syllable position (onset or coda). Keep tuples whoseVloc matches the identified site of insertion and throw away the rest.

The basic idea is to find recurring combinations of two consonants that po-tentially came from a foreign cluster and a vowel. Step 1 defines the searchspace. It should be easier to see the target pattern if we zeroed in on loan-words. Native words have various morphological patterns that can obscurethe target pattern. Of course, it is not yet known which words are loanwords.So instead the method avoids words similar to what are currently believed tobe native words, i.e. the native seed words. Put differently, words dissimilarto the native seed words are tentatively loanwords. Here the similarity ismeasured by a word’s length-normalized probability according to a character

7

Page 9: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

n-gram model trained on the native seed words: 1/|w| · logP (w) for word wof length |w|. A word is atypical if its probability ranks below a thresholdpercentile (e.g. 5%).

Step 2 generates a first-pass list. Condition 2a delimits possible consonantsequences from a foreign cluster. According to the sonority sequencing prin-ciple, consonants in a syllable are ordered so that consonants of higher sonor-ity appear closer to the vowel of the syllable. There are different proposalson what sonority is and how different classes of consonants rank on thesonority scale (e.g. Clements, 1990; Selkirk, 1984; Ladefoged, 2001). HereI simply classify consonants as either obstruents or sonorants (see Table 1)and stipulate that sonorants have higher sonority than obstruents. I alsoassume that the sonority of consonants do not change during transliterationalthough their identities may change. For example, ‘free’ changes from [fɹi]to [pʰɨɾi], but [pʰ] remains an obstruent and [ɾ] remains a sonorant. Accord-ingly, C1C2 must be obstruent-sonorant if it is from an onset cluster andsonorant-obstruent if it is from a coda cluster. To determine with certaintywhether the consonants originally occupied onset or coda, I focus on phonemestrings found only at word boundaries. If C1C2 are the first two consonantsof a word, they are from onset. If they are the last two consonants of a word,they are from coda.

<Insert Table 1 here>

Condition 2b is used to guess the vowel inserted to repair each cluster.Only one vowel is repeatedly used so its co-occurrence with the consonantsshould not only be noticeable but most noticeable among all vowels. Herethe co-occurrence tendency is measured using point-wise mutual informa-tion: PMI(C1C2, V ) = logP (C1C2, V ) − logP (C1C2) · P (V ) where V =<Vid, Vloc >.

The list is truncated to avoid false positives in steps 3 and 4. This is doneby identifying the default vowel insertion strategy and keeping only the tu-ples consistent with it. Exactly which vowel is inserted where to repair aconsonant cluster is context-specific. But a language that relies on vowelinsertion for repair usually has a default vowel inserted in typical locations(cf. Uffmann, 2006). Here it is assumed that the default vowel is the oneused to repair most diverse consonant clusters. So it is the most frequentvowel in the list. Similarly, its default site of insertion is in principle its mostfrequent location in the list. But possible sites of insertion differ for onset

8

Page 10: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

clusters and coda clusters: before or between the consonants in onset, butafter or between the consonants in coda (Hall, 2011). So the default site ofinsertion is identified separately for onset and coda.

3.2 Bootstrapping with EM

The parameters (θ) to estimate are P (c), Pmle(gi|gi−1i−n+1, c), and λc(g

i−1i−n+1).

The first parameter P (c) is initialized according to some assumption aboutwhat proportion of words in the given corpus are loanwords. For example,if one assumes that 5% are loanwords, P (N) = 0.95 and P (F ) = 0.05. Thelatter two parameters, which define the n-gram models, are initialized usingthe seed words as if they were labeled data: z(w,N) = 1 and z(w,F ) = 0 fornative seed words and z(w,N) = 0 and z(w,F ) = 1 for foreign seed words.Note that other words in the corpus are not used to initialize the n-grammodels. The initial parameters are then updated on the whole corpus byiterating the following two steps until some stopping criterion is met.

E-step: Calculate the expected value of z(w, c) using current parameters.

E[z(w, c)] = P (c|w; θ(t)) = P (w|c; θ(t)) · P (c; θ(t))∑c′ P (w|c′; θ(t)) · P (c′; θ(t))

(7)

M-step: Transform the expected value to z(w, c), i.e. some estimate ofz(w, c), and plug it into equations (4-6) to update the parameters.

I experiment with three versions of the algorithm in the present study: softEM, hard EM, and smoothstep EM. The three differ with respect to howE[z(w, c)] is transformed to z(w, c). In soft EM, which is the same as theclassic EM algorithm (Dempster et al., 1977), there is no transformation, i.e.z(w, c) = E[z(w, c)]. In hard EM, z(w, c) = 1 if c = arg maxc′ E[z(w, c′)] andz(w, c) = 0 otherwise. Since there are only two classes here, this is equivalentto applying a threshold function at 0.5 to E[z(w, c)]. In smoothstep EM, asmooth step function is applied instead of the threshold function: z(w, c) =f 3(E[z(w, c)]) where f(x) = −2x3 + 3x2. Figure 1 illustrates how E[z(w, c)]is transformed to z(w, c) by the three variants of the EM algorithm.

<Insert Figure 1 here>

9

Page 11: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

As will be shown in the experiments below, soft EM is aggressive while hardEM is conservative in recruiting words to the foreign class. Soft EM givespartial credit even to words that are very unlikely to be foreign accordingto the current model. Over time, such words may manage to gain enoughconfidence and be considered foreign. Some of them may turn out to be falsepositives. On the other hand, hard EM does not give any credit even to wordsthat are just barely below the threshold to be considered foreign. Some ofthem may turn out to be false negatives. Smoothstep EM is a compromisebetween the two extremes. It virtually ignores words that do not stand achance but gives due credit to words that barely missed.

4 Experiments

Experiments show that the proposed approach can be effective in Koreandespite its unsupervised nature. Classifiers built on a raw corpus with mi-nor preprocessing (e.g. removing tokens with non-Hangul characters) identifyloanwords in test lexicons well. The foreign seed extraction method correctlyidentifies the default vowel insertion strategy in Korean loanword phonology.The resulting classifier performs better when initialized with the proposedseeding method than random seeding. Its performance is not that far be-hind the corresponding supervised classifier either. Moreover, after exposureto the words (but not their labels) used to train the supervised classifier,the unsupervised classifier performs at a level comparable to the supervisedclassifier. I discuss the details of the experiments below.

4.1 Methods

I use four datasets called SEJONG, KAIST, NIKL-1, and NIKL-2 below.SEJONG and KAIST are unlabeled data used to initialize and train theunsupervised classifier. SEJONG consists of 1,019,853 types and 9,206,430tokens of eojeols, which are character strings delimited by white space equiv-alent to words or phrases. The eojeols are from a morphologically anno-tated corpus developed in the 21st Century Sejong Project under the aus-pices of the Ministry of Culture, Sports, and Tourism of South Korea, andNational Institute of the Korean Language (2011). They were selected by

10

Page 12: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

extracting Hangul character strings delimited by white-space after remov-ing punctuation marks. Strings that contained non-Hangul characters (e.g.12월의, Farrington으로부터) were excluded in the process. KAIST consistsof 2,409,309 types and 31,642,833 tokens of eojeols from the KAIST corpus(Korea Advanced Institute of Science and Technology, 1997) extracted in thesame way as SEJONG. NIKL-1 and NIKL-2 are labeled data used to test theclassifier. They are made of words from various language resources releasedby the National Institute of the Korean Language (NIKL). NIKL-1 consistsof 49,962 native words and 21,176 foreign words selected from two lexiconsNIKL (2008, 2013). NIKL-2 consists of 44,214 native words and 18,943 for-eign names selected from four reports released by NIKL (2000a,b,c,d) and alist of transliterated names of people and places originally spelled in Latinalphabets (NIKL, 2013). I examined the words manually and labeled themeither native or foreign. Words of unknown or ambiguous etymological originwere excluded in the process. SEJONG and NIKL-1 are mainly used to ex-amine effectiveness of the proposed methods. KAIST and NIKL-2 are usedto examine whether the methods are robust to varying data. See Table 2 fora summary of data sizes.

<Insert Table 2 here>

The proposed methods are implemented as follows. All n-gram models aretrained on character bigrams, where each Hangul character represents a syl-lable. The high frequency words defining the native seed are eojeols whosetoken frequency is above the 95th percentile in a given corpus. When extract-ing the foreign seed, the so-called ‘atypical words’ are eojeols whose length-normalized n-gram probabilities lie in the bottom 5% according to the modeltrained on the native seed. Their phonetic transcriptions are generated byapplying simple rewrite rules in Appendix A. For bootstrapping, the priorprobabilities are initialized to P (c = N) = 0.95 and P (c = F ) = 0.05. Theparameters of the classifier are iteratively updated until the average likeli-hood of the data improves by no more than 0.01% or the number of iterationsreaches 100.

Classification performance is measured in terms of precision, recall, and F-score. Here, precision (p) is the percentage of words correctly classified asforeign out of all words classified as foreign. Recall (r) is the percentage ofwords correctly classified as foreign out of all words that should have beenclassified as foreign. F-score is a harmonic mean of the two with equal empha-

11

Page 13: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

sis on both, i.e. F = 2·p·r/(p+r). To put the numbers in perspective, scoresof classifiers built using the proposed methods are compared with those ofsupervised classifiers and randomly seeded classifiers. Supervised classifiersare trained and tested on the labeled data (NIKL-1 or NIKL-2) using five-foldcross-validation. The labeled data is partitioned into five equal-sized subsets.The supervised classifier is trained on four subsets and tested on the remain-ing subset. This is repeated five times for the five different combinations ofsubsets. Randomly seeded classifiers are unsupervised classifiers with just adifferent seeding strategy: 5% of words in the corpus are randomly chosen asforeign seed words and the rest are native seed words. For fair comparison,the unsupervised classifiers are also tested five separate times on the five sub-sets of labeled data that the supervised classifier is tested on. Accordingly,classification scores reported below are the arithmetic means of scores on thefive subsets.

4.2 Results and discussion

The foreign seed extraction method correctly identifies the default vowelinsertion strategy. Table 3 lists the number of different consonant clusters forwhich each vowel in Korean is selected as the top candidate. [ɨ] is predictedto be the default vowel as it is chosen most often overall. Its predicted site ofinsertion for onset clusters is between the consonants of each cluster as it ischosen more often there than before the consonants. Similarly, its predictedsite of insertion for coda clusters is after the consonants of each cluster ratherthan between the consonants.

<Insert Table 3 here>

The 28 phoneme strings made of the default vowel and the consonant pairsit allegedly separates are listed in the row labeled SEJONG in Table 4. Theyspecify what traces of vowel insertion would look like and define the patternmatched against the atypical words to extract the foreign seed. All but threeof them indeed occur as traces of vowel insertion in one or more loanwordsin the entire data used for the present study. The foreign seed consistsof 2,500 eojeols (out of 50,992 atypical ones) that contain one or more ofthe phoneme strings. The foreign seed does contain false positives, but theirproportion is not that big: 489/2500 (=19.56%). Since SEJONG is unlabeledand too large, it is hard to tell what percentage of loanwords the foreign seed

12

Page 14: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

represents. But if one extracted all atypical words in NIKL-1 that containedthe phoneme strings, it would return a foreign seed containing 458/21,176 =2.16% of all the loanwords in the dataset. So the foreign seed is small in sizeand represents a tiny fraction of loanwords.

<Insert Table 4 here>

The seeded classifier can be trained effectively with smoothstep EM (see row 2in Table 5 for scores). Despite the small seed, recall is high (85.51%) withoutcompromise in precision (94.21%). The scores are, of course, lower than thoseof the supervised classifier (see row 1 in Table 5). Precision is lower by 2.67%points and recall is lower by 10.95% points. But considering the unsupervisednature of the approach, the scores are encouraging. The classifier performsbetter when trained with smoothstep EM than the other two variants of EM(see rows 4 and 5 in Table 5). Precision is just as high but recall is a bitlower (80.16%) when trained with hard EM. On the other hand, precisionis miserable (47.81%) although recall is higher (91.46%) when trained withsoft EM. Figure 2 illustrates how well the classifier performs on NIKL-1 overtime as it is iteratively trained on SEJONG with the three variants of EM.Right after initialization, scores of the classifier are precision = 93.82% andrecall = 52.07%. All three variants boost recall significantly within the firstseveral iterations. Soft EM is the most successful, followed by smoothstepEM, and then hard EM. But while the other two not only maintain but alsomarginally improve precision, soft EM steadily loses precision throughoutthe whole training session.

<Insert Figure 2 here>

Bootstrapping is more effective with the proposed seeding method than ran-dom seeding. Scores of three different randomly seeded classifiers trainedwith smoothstep EM are listed in rows 6-8 in Table 5. Compared to theproposed classifier, although their precision is higher by around 1% point,their recall is lower by around 14% points. But their performance is ratherconsistent as well as strong and deserves a closer look. The three randomly-seeded classifiers all followed a similar trajectory as they evolved. To brieflydescribe the process using a clustering analogy, the foreign cluster, whichstarted out as a small random set of 50,992 eojeols, immediately shrankto a much smaller set including those with hapax character bigrams whosetype frequency is one. For one of the three classifiers, the foreign clustershrank to a set of 5,421 eojeols as soon as training began and 2,061 of them

13

Page 15: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

contained hapax bigrams. It is likely that many words containing hapax bi-grams were loanwords and the foreign cluster eventually grew around them.In fact, among 4,378 words in NIKL-1 containing character bigrams that ap-pear only once in SEJONG, 1,601 are native words and 2,777 are loanwords.The process makes intuitive sense. At the beginning, the foreign cluster isoverwhelmed in size by the native cluster and unlikely to have homogeneoussubclusters due to random initialization. Eojeols in the foreign cluster willbe absorbed by the native cluster unless they have bigrams that seem aliento the native cluster. Hapax bigrams would be a prime example of suchbigrams and as a result they figure more prominently in the foreign cluster.Loanwords are alien to begin with, so it makes sense that they are morelikely to have hapax bigrams than native words. The dynamics involvingdata size, randomness, hapax bigrams, and loanwords are indeed interestingand did lead to good classifiers. But at the moment, it is not clear if they arereliable and predictable. More importantly, the proposed seeding methodled to significantly better classifiers.

Robustness to noise: The proposed methods are effective despite somenoise in training data. There are two sources of noise in SEJONG: crudegrapheme-to-phoneme conversion (G2P) and lack of morphological process-ing. G2P generates phonetic transcriptions required for foreign seed extrac-tion. In the experiments above, the transcriptions were generated by applyinga rather simple set of rules. Grapheme-phoneme correspondence in Hangul isquite regular, but there are phonological patterns such as coda neutralizationand tensification (Sohn, 1999) that the rules do not capture. Accordingly,the resulting transcriptions would be decent approximations but occasionallyincorrect. In fact, when the rules are tested on 14,007 words randomly chosenfrom the Standard Korean Dictionary, word accuracy and phoneme accuracyare 67.92% and 94.67%. One could ask if the proposed methods would per-form better with more accurate transcriptions. An experiment with a betterG2P suggests that the approximate transcriptions are good enough. A joint5-gram model (Bisani and Ney, 2008) was trained on 126,068 words fromthe Standard Korean Dictionary. The model transcribes words in SEJONGdifferently from the rules: by 36.62% in terms of words and 5.53% in terms ofphonemes. The model’s transcriptions are expected to be more accurate. Itsword accuracy and phoneme accuracy on the 14,007 words mentioned aboveare 95.30% and 99.35%.

Building the classifier from scratch using the new transcriptions barely changes

14

Page 16: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

results. The foreign seed extraction method again correctly identifies the de-fault vowel insertion strategy. It identifies [ɨ] as the default vowel, insertedbetween the consonants in onset and after the consonants in coda. It picks31 phoneme strings including the vowel as potential traces of insertion (seeSEJONG-g2p in Table 4). All but four of them have example loanwords inwhich they occur as traces of vowel insertion. The set of phoneme strings issimilar to the one identified before, with a 73.53% overlap between the two.The resulting foreign seed is even more similar to the previous seed, with a84.35% overlap between the two. The new seed is slightly larger than theprevious seed (2,527 vs. 2,500 words) but has a higher proportion of falsepositives (20.66% vs. 19.56%). The two seeds lead to very similar classifierstrained with smoothstep EM. The two trained classifiers tag 99.39% of wordsin NIKL-1 in the same way and their scores differ by only 0.24% – 0.48%points (see row 9 in Table 5 for the new classification scores).

The training data in the experiments above include eojeols containing bothnative and foreign morphemes. Loanwords can be suffixed with native mor-phemes, combine with native words to form compounds, or both. A goodexample is 투자펀드를 (investment-fund-ACC) where 투자 and 를 are na-tive and 펀드 is foreign. Such items may mislead the classifier to recruitfalse positives during training. One could ask if performance of the proposedmethods can be improved by stemming or further morpheme segmentation.Experiments suggest that they improve precision but at the sacrifice of re-call. Data for the experiments consists of a set of 250,844 stems and a setof 132,430 non-suffix morphemes in SEJONG. Eojeols in SEJONG are mor-phologically annotated in the original corpus. For example, 투자펀드를 isannotated 투자/NNG + 펀드/NNG + 를/JKO. Stems were extracted byremoving substrings tagged as suffixes and particles (e.g. 투자펀드를 →투자펀드). Non-suffix morphemes were extracted by splitting the derivedstems at specified morpheme boundaries (e.g. 투자펀드 → 투자 and 펀드).Two classifiers were built from scratch with rule-based transcriptions: oneusing the stems and the other using the morphemes.

The foreign seed extraction method is effective as when it was applied toeojeols. It correctly identifies the default vowel and its site of insertion inboth data sets. The phoneme strings identified as potential traces of insertionare listed in rows labeled SEJONG-stem and SEJONG-morph in Table 4.As before, many of them are indeed found in loanwords because of vowelinsertion, while a few of them are not. The resulting seeds are much smaller

15

Page 17: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

but contain less false positives than before: 59/642 = 9.20% and 58/323 =17.96% when using stems and morphemes respectively vs. 489/2500 =19.56%when using eojeols. Scores of the seeded classifiers trained with smoothstepEM are listed in rows 10 and 11 in Table 5. Compared to the classifier trainedon eojeols, precision improves by 1.55 and 2.14% points but recall plummetsby 11.62 and 23.81% points. The gain in precision is tiny compared to lossin recall. Perhaps one could prevent the loss in recall by adding more data.But the current results suggest that the proposed methods are good enough,if not better off, without morphological processing.

Robustness to varying data: Experiments with different Korean datasuggest that the proposed methods are effective in Korean in general ratherthan the particular data used above. A new classifier was built from scratchon KAIST using rule-based transcriptions and smoothstep EM and testedon NIKL-2. Its performance was compared with the unsupervised classifiertrained on SEJONG and a new supervised classifier trained on subsets ofNIKL-2. The foreign seed extraction method again correctly identifies thedefault vowel and its site of insertion. It picks 26 phoneme strings includingthe vowel as potential traces of insertion (see KAIST in Table 4). All butone of them have example loanwords in which they occur as traces of vowelinsertion. The phoneme strings lead to a foreign seed consisting of 4,179eojeols. The seed contains relatively more false positives (27.35%) than whenusing eojeols in SEJONG (19.56%). But scores of the SEJONG classifierand the resulting KAIST classifier tested on NIKL-2 are barely different (seerows 13 and 15 in Table 5). The SEJONG classifier is behind the supervisedclassifier by 5.31% point in precision and 11.20% in recall (see row 12 in Table5 for scores of the supervised classifier). The difference is slightly larger thanthe difference observed with NIKL-1. This is most likely because SEJONGis more different from NIKL-2 than it is from NIKL-1. The perplexity of acharacter bigram model trained on SEJONG is higher on NIKL-2 (564.55)than on NIKL-1 (484.18).

Adaptation: Unlike the supervised classifier, the training data and thetest data for the unsupervised classifiers come from different sources. Forexample, one unsupervised classifier was trained on SEJONG and tested onNIKL-1, while the supervised classifier compared with it was both trainedand tested on NIKL-1. So the comparison between the two was not entirelyfair. Experiments show that a simple adaptation method such as linearinterpolation can fix the problem. In sum, a baseline classifier is interpolated

16

Page 18: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

with a new classifier that inherits parameters from the baseline classifier anditeratively trained on adaptation data. The classifiers are interpolated andmake predictions according to the following equation:

c(w) = arg maxc

(1− λ) · Pbase(w, c) + λ · Pnew(w, c) (8)

Here the baseline classifier is the classifier trained on words from an unlabeledcorpus (e.g. SEJONG) and adaptation data is the portion of the labeled data(e.g. NIKL-1) used to train the comparable supervised classifier. Of course,the adaptation data does not include labels from the original data. The ideais not to provide feedback but to merely expose the classifier to the kinds ofwords it will be asked to classify later. In the experiments, the new classifierwas trained on 90% of the adaptation data with smoothstep EM just likethe baseline classifier. The interpolation weights were estimated using theremaining 10% with the classic EM algorithm. Applying the method toadapt the SEJONG and KAIST classifiers to the NIKL data significantlyimproves their performance. F-scores of the unsupervised classifiers afteradaptation are behind the comparable supervised classifiers by no more than2.5% points. See rows 3, 14, and 16 in Table 5 for scores after adaptation.

<Insert Table 5 here>

5 Applicability to other languages: a pilotstudy in Japanese

Ideally, the proposed approach should work with any language that doesnot allow consonant clusters and relies on vowel insertion to repair foreignclusters. In this section, I demonstrate its potential applicability with a pilotstudy in Japanese. In addition to not allowing consonant clusters, Japanesedoes not allow consonants in coda except the moraic nasal (e.g. [san]) and thefirst part of a geminate obstruent that straddles two syllables (e.g. [kip.pu]).The vowel inserted for repair is [u] usually (e.g. フランス [huransu] for‘France’), but [o] for coronal stops [t] and [d] (e.g. トレンド [torendo] for‘trend’). It is inserted between the consonants to repair onset clusters andafter the consonants to repair coda clusters beginning with [n]. But for other

17

Page 19: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

coda clusters, it is inserted after each consonant of the cluster (e.g. ヘルス[herusu] for ‘health’). The patterns are similar to Korean, so the approachshould work without much modification.

The data for the experiment consists of 108,816 words for training and148,128 words for testing. The training data came from the JEITA corpus(Hagiwara, 2013). It is not obvious to tell word boundaries and pronunci-ation in raw Japanese text. Words are not delimited by white space andsometimes spelled in kanji which are logographic rather than hiragana orkatakana which are phonographic. Fortunately, the corpus comes with thewords segmented and additionally spelled in katakana. It is those katakanaspellings that constitute the training data. The test data came from JMDict(Breen, 2004), a lexicon annotated with various information including pro-nunciation transcribed in either hiragana or katakana and source language ifa word is a loanword. Since loanwords in Japanese are spelled in katakana, Ilabeled words spelled without any katakana characters as native and wordsthat had language source information and spelled only in katakana as foreign.This led to the test set of 130,237 native words and 17,891 foreign words.Some of the words in the training and test data were respelled to make theclassification task non-trivial. First, all words in hiragana were respelled inkatakana (e.g. それ → ソレ). Otherwise, one could simply label any word inhiragana as native and avoid false positives. Second, all instances of choonpuwere replaced with proper vowel characters given the context (e.g. ハ–プ–ン[haapuun] ‘harpoon’ → ハアプウン). The character in katakana indicateslong vowels, which in hiragana are indicated by adding an extra vowel char-acter. Without the correction, one could simply label words with choonpuas foreign and identify a significant portion of loanwords.

The n-gram models in the experiment were trained on katakana characterbigrams. Phonetic transcriptions for foreign seed extraction were generatedessentially by romanization. Katakana symbols were romanized following theNihon-shiki system (e.g. シャツ → syatu) and each letter was mapped tothe corresponding phonetic symbol (e.g. syatu → [sjatu]). All other aspectsof the experiment were set up in the same way as the experiments in Korean.The results appear promising. The foreign seed extraction method identifies[u] as the default vowel and its site of insertion as between consonants inonset and after consonants in coda. It picks 14 phoneme strings includingthe vowel as potential traces of insertion (see JEITA in Table 4). Eightof them have example loanwords in which they occur as traces of vowel

18

Page 20: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

insertion. The phoneme strings lead to a foreign seed consisting of 173 wordsthat include 68 false positives (46.26%). It is encouraging that the methodcorrectly identifies the default vowel insertion strategy. But the resultingforeign seed is quite small partly because the corpus is small to begin withand less accurate than the seeds in the Korean experiments. Classificationscores are listed in rows 17-19 in Table 5. Overall, the scores are lower thanthe scores achieved in Korean. Considering that the scores are lower even forthe supervised classifier, it seems that character bi-grams are less effectivein Japanese than Korean. As expected from the size of the foreign seed,recall of the unsupervised classifier is quite low. But after adaptation to thelexicon, recall improves significantly and F-score is not that far behind thesupervised classifier.

6 Conclusion

I proposed an unsupervised method for developing a classifier that identifiesloanwords in Korean text. As shown in the experiments discussed above, themethod can yield an effective classifier that can be made to perform at a levelcomparable to that of a supervised classifier. The method is cost-efficient asit does not require language resources other than a large monolingual corpus,a grapheme-to-phoneme converter, and perhaps a lexicon to supplement thecorpus. The method is in principle applicable to a wide range of languages,i.e. those that rely on vowel insertion to repair illegal consonant clusters.Results from the pilot experiment in Japanese were encouraging. Futurestudies will further explore applicability of the method to other languages,especially under-resourced languages.

ReferencesBaker, K. and Brew, C. (2008). Statistical identification of English loanwords

in Korean using automatically generated training data. In Proceedings ofthe 6th Language Resources and Evaluation Conference (LREC’08), pages1159–1163.

Bali, R.-M., Chong, C. C., and Pek, K. N. (2007). Identifying and classifying

19

Page 21: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

unknown words in Malay texts. In Proceedings of the 7th InternationalSymposium on Natural Language Processing, pages 493–498.

Bisani, M. and Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5):434–451.

Breen, J. (2004). JMDict: a Japanese-multilingual dictionary. In Proceed-ings of the Workshop on Multilingual Linguistic Resources, pages 71–79.Association for Computational Linguistics.

Clements, G. N. (1990). The role of the sonority cycle in core syllabification.In Kingston, J. and Beckman, M., editors, Papers in Laboratory PhonologyI: Between the Grammar and Physics of Speech, pages 283–333. Cambridge:Cambridge University Press.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihoodfrom incomplete data via the EM algorithm. Journal of the Royal StatisticalSociety. Series B (Methodological), pages 1–38.

Goldberg, Y. and Elhadad, M. (2008). Identification of transliterated foreignwords in Hebrew script. In Computational Linguistics and Intelligent TextProcessing, pages 466–477. Springer Berlin - Heidelberg.

Hagiwara, M. (2013). JEITA public morphologically tagged corpus (inChasen format).

Hall, N. (2011). Vowel epenthesis. In van Oostendorp, M., Ewen, C. J.,Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology,pages 1576–1596. Malden, MA & Oxford: Wiley-Blackwell.

Haspelmath, M. and Tadmor, U. (2009). Loanwords in the World’s Lan-guages: A Comparative Handbook. Walter de Gruyter.

Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. (1999). Automaticidentification and back-transliteration of foreign words for information re-trieval. Information Processing and Management, 35:523–540.

Kang, Y. (2011). Loanword phonology. In van Oostendorp, M., Ewen, C. J.,Hume, E., and Rice, K., editors, The Blackwell Companion to Phonology,pages 2258–2281. Malden, MA & Oxford: Wiley-Blackwell.

Khaltar, B.-O. and Fujii, A. (2009). A lemmatization method for Mongo-

20

Page 22: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

lian and its application to indexing for information retrieval. InformationProcessing & Management, 45(4):438–451.

Knight, K. and Graehl, J. (1998). Machine transliteration. ComputationalLinguistics, 24(4):599–612.

Korea Advanced Institute of Science and Technology (1997). Automaticallyanalyzed large scale KAIST corpus [Data file].

Ladefoged, P. (2001). A Course in Phonetics. Orlando: Harcourt Brace, 4edition.

Maddieson, I. (2013). Syllable structure. In Dryer, M. S. and Haspelmath,M., editors, The World Atlas of Language Structures Online, Leipzig. MaxPlanck Institute for Evolutionary Anthropology.

Ministry of Culture, Sports, and Tourism of South Korea, and National In-stitute of the Korean Language (2011). The 21st century Sejong project[Data file].

NIKL (2000a). gukeo eohwiui bunryu mokrok yeongu. Resource document.

NIKL (2000b). pyojuneo geomtoyong jaryo. Resource document.

NIKL (2000c). pyojungukeodaesajeon pyeonchanyong eowon jeongbo jaryo.Resource document.

NIKL (2000d). yongeon hwalyongpyo. Resource document.

NIKL (2008). Survey of the state of loanword usage. [Data file].

NIKL (2013). oeraeeo pyogi yongrye jaryo – romaja inmyeonggwa jimyeong.Resource document.

Nwesri, A. F. A. (2008). Effective Retrieval Techniques for Arabic text. PhDthesis, RMIT University, Melbourne, Australia.

Oh, J.-H. and Choi, K.-S. (2001). Automatic extraction of transliteratedforeign words using hidden markov model. In Proceedings of the Interna-tional Conference on Computer Processing of Oriental Languages, pages433–438.

Ravi, S. and Knight, K. (2009). Learning phoneme mappings for transliter-ation without parallel data. In Proceedings of Human Language Technolo-

21

Page 23: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

gies: The 2009 Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, pages 37–45.

Selkirk, E. (1984). On the major class features and syllable theory. InAronoff, M. and Oerhle, R. T., editors, Language Sound Structure: Studiesin Phonology Presented to Morris Halle by His Teachers and Students,pages 107–136. Cambridge, MA: MIT Press.

Sohn, H.-M. (1999). The Korean Language. Cambridge: Cambridge Univer-sity Press.

Uffmann, C. (2006). Epenthetic vowel quality in loanwords: Empirical andformal issues. Lingua, 116(7):1079–1111.

Witten, I. H. and Bell, T. (1991). The zero-frequency problem: Estimat-ing the probabilities of novel events in adaptive text compression. IEEETransactions on Information Theory, 37(4):1085–1094.

Yoon, K. and Brew, C. (2006). A linguistically motivated approach tographeme-to-phoneme conversion for Korean. Computer Speech & Lan-guage, 20(4):357–381.

Yoon, S.-Y., Kim, K.-Y., and Sproat, R. (2007). Multilingual transliterationusing feature based phonetic method. In Proceedings of the 45th AnnualMeeting of the Association of Computational Linguistics, pages 112–119.

22

Page 24: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Appendix A. Rewrite rules for grapheme-to-phoneme conversion

The table below shows letter-to-phoneme correspondences in Korean. Theidea is to transcribe the pronunciation of a spelled word by first decomposingsyllable-sized characters into letters and then mapping the letters to theirmatching phonemes one by one. For example, 한글 → ᄒ + ᅡ + ᄂ + ᄀ+ ᅳ + ᄅ → [hankɨl].

Letter Phoneme(s) Letter Phoneme(s) Letter Phoneme(s)ᄀ k ᄁ k* ᄂ nᄃ t ᄄ t* ᄅ (onset) ɾ

ᄅ (coda) l ᄆ m ᄇ pᄈ p* ᄉ s ᄊ s*

ᄋ (onset) Null ᄋ (coda) ŋ ᄌ tʃᄍ tʃ* ᄎ tʃʰ ᄏ kʰᄐ tʰ ᄑ pʰ ᄒ hㅏ a ㅑ j a ㅐ æᅤ j æ ᅥ ʌ ᅧ j ʌᅦ e ᅨ j e ᅩ oᅭ j o ᅪ w a ᅫ w æᅬ ø ᅮ u ᅲ j uᅯ w ʌ ᅰ w e ᅱ w iᅳ ɨ ᅵ i ᅴ ɨi

23

Page 25: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Table 1: Korean phonemes and their place in the proposed sonority hierarchy.

Class PhonemesObstruents p p* pʰ t t* tʰ k k* kʰ s s* h tʃ tʃ* tʃʰSonorants m n ŋ ɾ l w j

Vowels a e i o u æ ʌ ø ɨ ɨi

24

Page 26: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Table 2: Data sizes in number of unique words or eojeols.

Class SEJONG KAIST NIKL-1 NIKL-2 JEITA JMDictNative unknown unknown 49,962 44,214 unknown 130,237Foreign unknown unknown 21,176 18,943 unknown 17,891Total 1,019,863 2,409,309 71,138 63,157 108,816 148,128

25

Page 27: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Table 3: Number of consonant clusters each vowel allegedly repairs via in-sertion.

ɨ e ʌ a æ u ɨi ø o iBefore onset consonants 0 3 4 1 1 2 3 5 1 1

Between onset consonants 17 5 6 8 4 10 11 7 9 6Between coda consonants 0 3 3 2 1 1 0 0 1 2

After coda consonants 11 9 4 5 10 3 2 4 4 5Total 28 20 17 16 16 16 16 16 15 14

26

Page 28: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Table 4: Phoneme strings chosen as potential traces of insertion. Strings inparentheses were not found in any loanwords as traces of insertion.

Data Potential traces of insertionSEJONG kɨɾ, k*ɨɾ, k*ɨn, kʰɨɾ, pɨɾ, p*ɨɾ, pʰɨɾ, sɨm, sɨn, sɨw, tɨɾ, (tɨŋ),

tɨl, tʃ*ɨm, (tʃ*ɨn), tʰɨɾ, tʰɨl, ŋkʰɨ, lpʰɨ, lsɨ, ltʃɨ, ltʃʰɨ, ltʰɨ, mpʰɨ,msɨ, nsɨ, (ntʃ*ɨ), ntʰɨ

SEJONG-g2p kɨɾ, k*ɨɾ, k*ɨn, kʰɨɾ, kʰɨn, pɨɾ, p*ɨɾ, pʰɨɾ, pʰɨw, sɨm, sɨn,sɨw, tɨɾ, (tɨŋ), tɨl, tʃ*ɨm, (t*ɨj), (t*ɨm), tʰɨɾ, tʰɨl, ŋkʰɨ, lpʰɨ,lsɨ, ltʃɨ, (lt*ɨ), ltʰɨ, mpʰɨ, msɨ, nsɨ, ntʃɨ, ntʰɨ

SEJONG-stem kɨl, kɨm, kʰɨɾ, pɨɾ, p*ɨɾ, pʰɨɾ, sɨn, sɨw, tɨɾ, tɨl, (tʃ*ɨn), (tʃʰɨŋ),tʃʰɨl, t*ɨl, tʰɨɾ, tʰɨw, ŋkʰɨ, lpʰɨ, ltʃɨ, ltʃʰɨ, ltʰɨ, mpʰɨ, (mt*ɨ),nsɨ, ns*ɨ, (ntʃ*ɨ), ntʃʰɨ

SEJONG-morph kɨɾ, kɨm, kʰɨɾ, pɨɾ, pʰɨɾ, pʰɨw, sɨɾ, sɨn, sɨw, tɨɾ, (tʃ*ɨn), tʃʰɨl,tʰɨɾ, tʰɨw, ŋkʰɨ, ŋt*ɨ, lpʰɨ, lsɨ, ltʃɨ, ltʃʰɨ, ltʰɨ, msɨ, (mt*ɨ), nsɨ,ns*ɨ, (ntʃ*ɨ), ntʃʰɨ, ntʰɨ

KAIST kɨɾ, kɨn, k*ɨɾ, k*ɨn, kʰɨɾ, pɨɾ, p*ɨw, pʰɨɾ, pʰɨw, sɨm, sɨn,sɨw, s*ɨɾ, tɨɾ, (tɨŋ), tɨl, tʃ*ɨm, t*ɨl, tʰɨɾ, ŋtʰɨ, lpʰɨ, ltʃɨ, ltʰɨ,mpʰɨ, nsɨ, ntʰɨ

JEITA (bum), gur, (huj), (huw), (kun), kur, pur, (tuj), (tum), ngu,nhu, nku, nsu, nzu

27

Page 29: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Table 5: Performance of trained classifiers.

Index Train (+adapt) Test Seeding Learning Precision Recall F-score1 NIKL-1 NIKL-1 N/A Supervised 96.88 96.46 96.672 SEJONG NIKL-1 Proposed Smoothstep EM 94.21 85.51 89.653 SEJONG (+NIKL-1) NIKL-1 Proposed Smoothstep EM 95.49 94.05 94.774 SEJONG NIKL-1 Proposed Hard EM 94.21 80.16 86.625 SEJONG NIKL-1 Proposed Soft EM 47.81 93.35 60.816 SEJONG NIKL-1 Random Smoothstep EM 95.30 70.98 81.367 SEJONG NIKL-1 Random Smoothstep EM 95.37 71.75 81.898 SEJONG NIKL-1 Random Smoothstep EM 95.20 71.89 81.929 SEJONG-g2p NIKL-1 Proposed Smoothstep EM 94.45 85.03 89.4910 SEJONG-stem NIKL-1 Proposed Smoothstep EM 95.76 73.89 83.4211 SEJONG-morph NIKL-1 Proposed Smoothstep EM 96.35 61.70 75.2212 NIKL-2 NIKL-2 N/A Supervised 95.36 94.12 94.7313 SEJONG NIKL-2 Proposed Smoothstep EM 90.05 82.92 86.3414 SEJONG (+NIKL-2) NIKL-2 Proposed Smoothstep EM 93.85 90.89 92.3415 KAIST NIKL-2 Proposed Smoothstep EM 90.53 82.52 86.3416 KAIST (+NIKL-2) NIKL-2 Proposed Smoothstep EM 93.80 91.17 92.4617 JMDict JMDict N/A Supervised 88.17 84.62 86.3618 JEITA JMDict Proposed Smoothstep EM 81.20 61.82 70.2019 JEITA (+JMDict) JMDict Proposed Smoothstep EM 88.00 80.27 83.96

28

Page 30: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Figure 1: Transformation of E[z(w, c)] to z(w, c).

29

Page 31: An Unsupervised Method for Identifying Loanwords in Korean · Loanword Identification in Korean 1 Introduction Loanwords are words whose meaning and pronunciation are borrowed from

Loanword Identification in Korean

Figure 2: Precision and recall of the unsupervised classifier over iterations.

30