DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm ...vowels and tones, words with dissimilar pronun-ciations are incorrectly assigned to the same en-coding (e.g. 稀饭 and

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 444–453Brussels, Belgium, October 31 - November 1, 2018. c©2018 Association for Computational Linguistics

444

DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm based onLearned High Dimensional Encoding

Min LiIBM Research

[email protected]

Marina DanilevskyIBM Research

[email protected]

Sara NoemanIBM

[email protected]

Yunyao LiIBM Research

[email protected]

Abstract

Phonetic similarity algorithms identify wordsand phrases with similar pronunciation whichare used in many natural language process-ing tasks. However, existing approaches aredesigned mainly for Indo-European languagesand fail to capture the unique properties ofChinese pronunciation. In this paper, we pro-pose a high dimensional encoded phoneticsimilarity algorithm for Chinese, DIMSIM.The encodings are learned from annotated datato separately map initial and final phonemesinto n-dimensional coordinates. Pinyin pho-netic similarities are then calculated by aggre-gating the similarities of initial, final and tone.DIMSIM demonstrates a 7.5X improvementon mean reciprocal rank over the state-of-the-art phonetic similarity approaches.

1 Introduction

Performing the mental gymnastics of transform-ing ‘I’m hear’ to ‘I’m here,’ or, ‘I can’t so but-tons’ to ‘I can’t sew buttons,’ is familiar to anyonewho has encountered autocorrected text messages,punny social media posts, or just friends with badgrammar. Although at first glance it may seemthat phonetic similarity can only be quantified foraudible words, this problem is often present inpurely textual spaces, such as social media postsor text messages. Incorrect homophones and syn-ophones, whether used in error or in jest, posechallenges for a wide range of NLP tasks, suchas named entity identification, text normalizationand spelling correction (Chung et al., 2011; Xiaet al., 2006; Toutanova and Moore, 2002; Twiefelet al., 2014; Lee et al., 2013; Kessler, 2005). Thesetasks must therefore successfully transform incor-rect words or phrases (‘hear’,’so’) to their phonet-ically similar correct counterparts (’here’,’sew’),which in turn requires a robust representation ofphonetic similarity between word pairs. A reli-

Pinyin initial final tonexi1 x i 1fan4 f an 4

Table 1: Example Pinyins.

偶(ou2,我wo2)稀饭(xi1fan4,喜欢xi2huan1)你。I like you.杯具(bei1ju4,悲剧bei1ju4)啊，为一个女孩纸(zhi2,子zi5)这么香菇(xiang1gu1,想哭 xiang2ku1)。Sadly, I am heart broken for a girl.

Table 2: Microblogs using phonetic transcription.

able approach for generating phonetically simi-lar words is equally crucial for Chinese text (Xiaet al., 2006).

Unfortunately, most existing phonetic similar-ity algorithms such as Soundex (Archives andAdministration, 2007) and Double Metaphone(DM) Philips (2000) are motivated by English anddesigned for Indo-European languages. Words areencoded to approximate phonetic presentations byignoring vowels (except foremost ones), which isappropriate where phonetic transcription consistsof a sequence of phonemes, such as for English.In contrast, the speech sound of a Chinese char-acter is represented by a single syllable in Pinyinconsisting of two or three parts: an initial (op-tional), a final or compound finals, and tone 1 (Ta-ble 1). As a result, phonetic similarity approachesdesigned for Indo-European languages often fallshort when applied to Chinese text. Note that weuse Pinyin as the phonetic representation becauseit is a widely accepted Romanization system (San,2007; ISO, 2015) of Chinese syllables, used toteach pronunciation of standard Chinese. Table 2shows two sentences from Chinese microblogs,containing informal words derived from phonetictranscription. The DM and Soundex encodings for

1Chinese has five tones, represented on a 1-5 scale.

445

Words DM Soundex稀xi1饭fan4 S:S,FN:FN X000,F500喜xi2欢huan1 S:S,HN:HN X000,H500泄xie4愤fen4 S:S,FN:FN X000,F500

Table 3: DM and Soundex of Chinese words.

zh ch sh

z c s

Figure 1: Grouping initials by phonetic similarity.

near-homonyms of 喜欢 from Table 2 are shownin Table 3. Since both DM and Soundex ignorevowels and tones, words with dissimilar pronun-ciations are incorrectly assigned to the same en-coding (e.g. 稀饭 and 泄愤), while true near-homonyms are encoded much further apart (e.g.稀饭 and 喜欢). On the other hand, additionalcandidates with similar phonetic distances such as心xin1烦fan2，西xi1方fang1 for稀饭 should begenerated, for consumption by downstream appli-cations such as text normalization.

The example highlights the importance of con-sidering all Pinyin components and their charac-teristics when calculating Chinese phonetic simi-larity (Xia et al., 2006). One recent work (Yao,2015) manually assigns a single numerical num-ber to encode and derive phonetic similarity. How-ever, this single-encoding approach is inaccuratesince the phonetic distances between Pinyins arenot captured well in a one dimensional space. Fig-ure 1 illustrates the similarities between a sub-set of initials. Initial groups “z, c”, “zh, ch”,“z, zh” and “zh, ch” are all similar, which can-not be captured using a one dimensional represen-tation (e.g., an encoding of “zh=0,z=1,c=2,ch=3”fails to identify the “zh, ch” pair as similar.)ALINE (Kondrak, 2003) is another illustrationof the challenge of manually assigning numer-ical values in order to accurately represent thecomplex relative phonetic similarity relationshipsacross various languages. Therefore, given theperceptual nature of the problem of phonetic simi-larity, it is critical to learn the distances based on asmuch empirical data as possible (Kessler, 2005),rather than using a manually encoded metric.

This paper presents DIMSIM, a learned n-dimensional phonetic encoding for Chinese alongwith a phonetic similarity algorithm, which usesthe encoding to generate and rank phonetically

similar words. To address the complexity of rel-ative phonetic similarities in Pinyin components,we propose a supervised learning approach tolearn n dimensional encodings for finals and ini-tials where n can be easily extended from one totwo or higher dimensions. The learning modelderives accurate encodings by jointly consideringPinyin linguistic characteristics, such as place ofarticulation and pronunciation methods, as well ashigh quality annotated training data sets. We com-pare DIMSIM to Double Metaphone(DM), Mini-mum edit distance(MED) and ALINE demonstrat-ing that DIMSIM outperforms these algorithms by7.5X on mean reciprocal rank, 1.4X on precisionand 1.5X on recall on a real-world dataset. Ourcontributions are:

1. An encoding for Chinese Pinyin leveragingChinese pronunciation characteristics.

2. A simple and effective phonetic similarityalgorithm to generate and rank phoneticallysimilar Chinese words.

3. An implementation and a comprehensiveevaluation showing the effectiveness of DIM-SIM over the state-of-the-art algorithms.

4. A package release of the implemented algo-rithm and a constructed dataset of Chinesewords with phonetic corrections.2

2 Generating Phonetic Candidates

DIMSIM generates ranked candidate words withsimilar pronunciation to a seed word. Similarity ismeasured by a phonetic distance metric based onn-dimensional encodings, as introduced below.

2.1 Phonetic Comparison for Pinyin

An important characteristic of Pinyin is that thethree components, initial, final and tone, can be in-dependently phonetically compared. For example,the phonetic similarity of the finals “ie” and “ue”is identical in the Pinyin pairs {“xie2”,“xue2”}and {“lie2”,“lue2”}, in spite of the varying ini-tials. English, by contrast, does not have this char-acteristic. Consider as an example, the letter group“ough,” which is pronounced quite differently in“rough,” “through” and “though.”

Note that depending on the initials, a final ofsame written form can represent different finals.For instance, u is written as u after j, q and x; uo iswritten as o after b, p,m, f or w. There are a total

2https://github.com/System-T/DimSim.

446

of six rewritten rules in Pinyin (ISO, 2015). Sincethese rules are fixed, we preprocess the Pinyins ac-cording to these rules, transforming them into theoriginal form for our internal representation (e.g.,we represent ju as ju and bo as buo.)

2.2 Measuring Phonetic SimilarityDIMSIM represents a given word w as a list ofcharacters {ci|1 ≤ i ≤ K} where K is the num-ber of characters and pci denotes the Pinyin of ithcharacter. The initial, final, and tone componentsof pci are denoted as pIci , p

Fci , and pTci , respectively.

Formally, the phonetic similarity S between thepronunciation of ci and c′i is computed using Man-hattan distance as the sum of the distances be-tween the three pairs of components, as follows:

∑1≤i≤K

S(ci, c′i) =

∑1≤i≤K

{Sp(pIci , pIc′i)+

Sp(pFci , p

Fc′i) + ST (p

Tci , p

Tc′i)}

(1)

Manhattan distance is an appropriate metric sincethe three components are independent. Any sin-gle change does not affect more than one com-ponent, and any change affecting several compo-nents is the result of multiple independent andadditive changes. It follows that the similaritybetween two words is computed as the sum ofthe phonetic distances of characters. For exam-ple, the Pinyins of “童鞋” and “同学” are“tong2xie2” and “tong2xue2”. The distance be-tween “童(tong2)” and “同(tong2)” is zero; thedistance between “鞋(xie2)” and “学(xue2)”is calculated as S( “鞋” , “学” ) = Sp(x, x) +Sp(ie, ue) + ST (2, 2). Although the characters“鞋(xie2)” and “学(xue2)” are completely dif-ferent, their Pinyins only differ in their finals.

2.3 Learning Pinyin EncodingsThe next task is to compute encodings for initials,finals, and tones. While tonal similarity is easilyhandled (see Section 2.4), pairwise similarity forinitials and finals is more complex. We adopt asupervised learning approach to obtain these en-codings, using linguistic characteristics combinedwith a labeled dataset. The latter consists of wordpairs, with specific pairs of initials or finals man-ually annotated for phonetic similarity. The set ofannotated pairs between initials and finals are thenused to learn the n-dimensional encodings of ini-tials and finals, which will in turn be used for gen-erating phonetically similar candidates.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

Confidential Review Copy. DO NOT DISTRIBUTE.

For instance, u is written as u after j, q, x. uois written as o after b, p, m, f or w. There are atotal of six rewritten rules in Pinyin (ISO, 2015).Since these rules are fixed, it is straightforward topreprocess the Pinyins according to these rules toturn them into the original form of Pinyins as ainternal representation before conducting phoneticcomparison. For example, we represent ju as ju,bo as buo. After the preprocessing step, we inde-pendently compare components.

2.2 Measuring Phonetic Similarity

DIMSIM represents a given Chinese word w as alist of Chinese characters {ci|1 i K} whereK is the number of characters in w and pci denotesthe Pinyin of ith character. The initial, final, andtone components of the Pinyin pci are denoted aspI

ci, pF

ci, and pT

ci, respectively.

Formally, the phonetic similarity S between thepronunciation of two characters, ci and c0i is com-puted using Manhattan distance as the sum of thedistances between the three pairs of components,as follows:

X

1iK

S(ci, c0i) =

X

1iK

{Sp(pIci

, pIc0i

)+

Sp(pFci

, pFc0i

) + ST (pTci

, pTc0i

)}(1)

Manhattan distance is an appropriate metric sincethe three components are independent. A singlechange in a Pinyin is therefore a change to the ini-tial, the final, or the tone, but not to more thanone of the components simultaneously. A changethat affects more than one component is the re-sult of multiple independent and therefore additivechanges.

Following the same logic, the phonetic sim-ilarity between two words w and w0 is com-puted as the sum of the distances between thePinyins. For example, the Pinyins of “Âã”and “�f” are “tong2xie2” and “tong2xue2”respectively. The distance between “Â(tong2)”and “�(tong2)” is zero; the distance between“ã(xie2)” and “f(xue2)” is calculated asS(ie, ue) = Sp(x, x)+Sp(ie, ue)+ST (2, 2). Wesee that although the characters are completely dif-ferent, “ã(xie2)” and “f(xue2)” only differin their finals, but not their initials and tones.

2.3 Learning Pinyin Encodings

Therefore, the next task is to generate an accuraterepresentation of phonetic similarity for every pairof initials, finals, and tones. As there are only 5

Labial Alveolar Retroflex Alveolo- Velarpalatal

Plosive (u) b d gPlosive (a) p t kNasal m nAffricate (u) z zh jAffricate (a) c ch qFricative f s sh x hLiquid l rSemivowel y and w(u) = unaspirated, (a) = aspirated

Table 4: Table of Pinyin initials (colors denote clusters).

tones in Chinese, pairwise tonal similarity is eas-ily handled (see section 2.4). However, pairwisesimilarity for initials and finals is more complexand must be learned. We use a supervised machinelearning approach that uses Pinyin linguistic char-acteristics combined with manually labeled datasets of phonetic similarity. The training data setsconsist of word pairs that highlight a pair of ini-tials (or finals), and are used as the context for anannotator-provided phonetic similarity score. Themanually labeled scores are transformed into sim-ilarity scores. The set of initials (or finals) is thenmapped to the n-dimensional encodings by mini-mizing the difference between the resulting pair-wise distances, and the distances obtained fromthe training data sets.

2.3.1 Generating Similar Word PairsPhonetically similar word pairs are used to createannotations representing the phonetic similarity ofa pair of initials, or finals. Chinese has 253 pairsof initials and 666 pairs of finals. Manually an-notating each pair similarity requires a very largenumber of examples: assuming ten or twenty wordpairs are provided as context for each pair, the taskquickly blows up to nine or eighteen thousand an-notations. We observe that the phonetic similar-ity of Chinese Pinyin is greatly impacted by thepronunciation methods and the place of articula-tion. Leveraging known Pinyin linguistic charac-teristics can improve the accuracy of our modeland reduce the size of the annotation task. Specif-ically, this is done by grouping the Pinyin compo-nents into initial clusters according to the Pinyinpronunciation tables (ISO, 2015) and only anno-tating the pairs within each cluster along with asingle pairwise distance between clusters.

Table 4 shows the the clustering of initials ac-cording to the Pinyin linguistic characteristics.We partition initials into 12 clusters, consisting

Figure 2: Table of Pinyin Initials.

2.3.1 Generating Similar Word PairsPhonetically similar word pairs are used to createannotations representing the phonetic similarity ofinitials, or finals. Chinese has 253 pairs of initialsand 666 pairs of finals. Annotating examples ofall these pairs is labor intensive and error-prone.Assuming twenty word pairs are provided as con-text per pair, the task quickly blows up to eigh-teen thousand annotations. However, we observethat the phonetic similarity of Pinyin is greatly im-pacted by the pronunciation methods and the placeof articulation - this allows us to improve the accu-racy and simplify the annotation task. Specifically,this is done by grouping Pinyin components intoinitial clusters and only annotating pairs withineach cluster, and representative cluster pairs.

Figure 2 partitions initials into 12 clusters, con-sisting of “bp”,“dt”,“gk”,“hf”,“nl”,“r”, “jqx”,“zcs”, “zhchsh”, “m” ,“y” and “w”, based on thepronunciation method and the place of articula-tion. “f” and “h” are grouped together as they areboth fricative and sound very similar, especiallyfor people from the southeast of China (Zhishihao,2017). We then eliminate the comparison of pairsthat are highly similar or highly dissimilar. For ex-ample, as the semivowel initials “y” and “w” aredissimilar to all other initials, we label every ini-tial pair containing one of them with the lowestpossible score. To compare between clusters, werandomly choose one initial from each cluster andgenerate just those comparison pairs. The numberof pairs of initials decreases from 253 to 59.

We use a similar method for finals, partition-ing them into six groups by the six basic vow-els (“a,o,e,i,u,u”) (e.g., “i,in,ing” are clustered to-gether.) We then use edit distance and commonsequence length constraints to guide the pair gen-eration; specifically, we compare a pair of finalsif the edit distance between them is 1 or 2. Since

447

the length of finals on average is two, an edit dis-tance of three means a complete change to the fi-nal, resulting in pairs with the lowest similarity.To compare finals across clusters, since the editdistance between any such pair is at least two, wecompare pairs only when the length of the com-mon sequence is at least two (for example, “ian”and “uan”), and otherwise assign the lowest possi-ble similarity to the pairs. This drops the numberof comparison pairs of finals down to 113.

After generating the comparison pairs, we cre-ate word pairs whose Pinyins only differ in thethese pairs. We identify and account for severalconfounding factors that may affect annotation: 1)the position of the character containing the initialor final being compared; 2) the word length; and 3)the combination of initials and finals. Since mostChinese words are of length two, we only gener-ate word pairs of length two for this task. Provid-ing word pairs of length greater than two wouldnot make much difference to learned encodings aslong as word pairs are representative.

For a given initial (or final) pair (p1, p2), suchas (b, p), we first generate the all possible Pinyinswith a component of p1 such as bao and bing.For each Pinyin py, we retrieve all the wordswith length two in the dictionary which also havefirst or second character with the same py. Ex-ample words for py=“bao′′ include包bao1袱fu2.For each created word w, we change the initial(or final) from p1 to p2, retrieve the correspond-ing words from the dictionary and generate theword pairs to compare. One such example is(包bao1袱fu2, 泡pao4芙fu2). Finally, from thefull list we randomly select five word pairs thatvary the first character, and five word pairs thatvary the second character.

We invite three native Chinese speakers to per-form the annotations. For each word pair, the an-notators give a label on a 7 point scale represent-ing their agreement, where the labels range from’Completely Disagree’ (1) to ’Completely Agree’(7). We calculate Krippendorff’s α (Hayes andKrippendorff, 2007) for the initials and finals an-notations to be 0.69 and 0.54, representing theinter-annotator agreement. For each word pair, weuse Equation 2 to calculate the distance θ with theaverage value φ of labels across the annotators.Equation 2 inverts the labels so that the output canbe used as a distance metric (phonetically similarinitials or finals are closer together), and scales the

b p

m

f dt

nl

g kh

j q x

zh

chsh

r z

cs

-30 -20 -10 0 10 20 30

Figure 3: Learned initial encodings, n=1.

b p

m

f

d t

nl

g k

h

j q x

zh

ch sh

r

z

cs

-1000

-500

0

500

1000

1500

2000

-1500 -1000 -500 0 500 1000

Figure 4: Learned initial encodings, n=2.

result to more accurately measure phonetic simi-larities. The parameters a and b are set 4 and 104

by default, but we also show that the performanceof our method is not sensitive to the parameter set-tings (see Section 3.2).

θ(φ) = 1/aφ ∗ b (2)

2.3.2 Learning ModelOnce the average distances between pairs are com-puted from the annotated data sets, we define aconstrained optimization to compute encodings ofthe initials and finals. The final goal is to map eachinitial (or final) to an n-dimensional point.

The distance Sp of a pair p of points(x1, x2, ..., xn), (y1, y2, ..., yn) is calculated usingEuclidean distance as shown in equation 3.

Sp =

√ ∑1≤i≤n

(xi − yi)2 (3)

The model aims to minimize the sum of the ab-solute differences between the Euclidean distancesof component pairs and the average distances ob-tained from the annotated training data across allpairs for initials (or finals) C. We also incorporatea penalty function, τp, for pairs deviating from themanually annotated distance θ so that more pho-netically similar pairs are penalized more highly(we discuss τ further in Section 3.2). Equation 4represents the cost function:

min∑p∈C|S2p − θ2p| ∗ τp (4)

448

One main advantage of our learning model isthat it is generic and can easily extend to any n-dimensional space. Based on the structured ofTable 2, we intuit that extending beyond one di-mension will yield more accurate encodings. Fig-ures 3 and 4 visualize the computed encodingsof initials when setting n=1 and n=2 We see thatwhen n = 2, the locations of initial coordinatesalign well with Table 2,. In particular, the twelvegroups are clustered in a pattern that is defined inSection 2.3.1. For example, “bp,gk,jqx” are sep-arated into different clusters. However, while Ta-ble 2 indicated the basic clusters for the initials,our learned model goes further than Table 2 by ac-tually quantifying the inter- and intra-cluster sim-ilarities. Specifically, clusters “c, ch, j, q, x” aretighter than clusters “c, c, h” and “d, t”, whereasthe clusters “m” and “n, r, l” are well separatedfrom other clusters. Interestingly, the learning al-gorithms organically discovers new clusters thatare not reflected in Table 2; namely that “r,n” and“r,l” are pairs of phonetically similar initials.

When n = 1, the learned model collapses thecoordinates into one dimension (Figure 3). Weobserve that the predefined clusters are not wellaligned, and many clusters are mixed together(e.g., “bp,gk,nl,dt”), preventing DIMSIM fromconsidering variations within a cluster to be moresimilar than variations between clusters. Visuallycomparing Figures 3 and 4 gives the intuition forwhy DIMSIM with n = 2 performs better thanDIMSIM with n = 1, which is in turn reflected inour evaluation results. Section 3 presents the ef-fects that varying the number of dimensions hason evaluation results.

2.4 Phonetic Tone Similarity

There are five tones in Chinese, represented by atone number scale ranging from 1 to 5. It is sim-ple to use tone numbers for tone encodings and thedifference between the tones of two Pinyins as theraw measure of distance, ranging in value from 1to 5 (e.g., ST (xue2, xue4) = 4− 2 = 2). One ex-ception is that we encode tone 3 as the numericalvalue of 2.5 since tone 3 is more similar to tone 2compared to tone 4 according to the relative pitchchanges of the four tones (ISO, 2015). However,this measure must first be scaled to be compara-ble to the pairwise phonetic distances of initialsand finals. There is an additional constraint: anypairwise difference in initials or finals must have

Input : Word w, Threshold th,Dict dict;Output: Words outws;begin

pys = getPinyins(w,dict);headPys =getSimPinyins(pys(0), th);headWords =getWordswithHeadPy(headPys, dict);for cw ∈ headWords do

if cw.size 6= w.size thencontinue;

endsim = getSimilarity(cw,w);if sim ≤ th then

outws.add(cw);end

endsortByAscSim(outws);return outws;

endAlgorithm 1: Generating phonetic candidates.

a greater negative effect on the phonetic similaritybetween characters than any difference in tones.For example, S(xue1,lue1)<S(xue1,xue5) eventhough xue1 and xue5 are at opposite ends ofthe tone scale. We therefore scale ST such thatMax(ST ) <Min(Sp).

2.5 Candidate Generation and Ranking

Having determined the phonetic encodings and themechanism to compute the phonetic similarity us-ing learned phonetic encodings, we now describehow to generate and rank similar candidates in Al-gorithm 1. Given a word w, a similarity thresholdth, and a Chinese Pinyin dictionary dict, we re-trieve the Pinyin py of w from dict. We derivea list of Pinyins Pys whose similarity to py fallswithin the threshold th. These are used to gener-ate a list of words with the same Pinyin in Pysand the same number of characters as w. We cal-culate the similarity of each candidate word withw using Equation 1 and filter out candidates thatfall outside the similarity threshold th. Thus, this a parameter that affects the precision and recallof the generated candidates. A larger th generatesmore candidates, increasing recall while decreas-ing precision.3 Finally, we output the candidatesranked in ascending order by similarity distance.

3We study the impact of varying th in Section 3.

449

3 Evaluation

We collect 350 words from social media (Wu,2016), and annotate each with 1-3 phoneti-cally similar words. We use a community-maintained free dictionary to map characters toPinyins (CEDict, 2016). We compare DIMSIM

with Double Metaphone (DM) (Philips, 2000),ALINE (Kondrak, 2003) and Minimum edit dis-tance (MED) (Navarro, 2001) in terms of preci-sion (P), recall (R), and average Mean ReciprocalRank (MRR) (Voorhees and et al., 1999). We cal-culate recall automatically using the the full testset of word pairs (Wu, 2016). Since downstreamapplications will only consider a limited numberof candidates in practice, we evaluate precision viaa manual annotation task on the top-ranked candi-dates generated by each approach. DM consid-ers word spelling, pronunciation and other mis-cellaneous characteristics to encode the word intoa primary and a secondary code. DM as one ofthe baselines is known to perform poorly at rank-ing the candidates (Carstensen, 2005) since onlytwo codes are used. We therefore use our method(Equation 1) to rank the DM-generated candi-dates, to create a second baseline, DM-rank.4 Thethird baseline, ALINE, measures phonetic similar-ity based on manually coded multi-valued articu-latory features weighted by their relative impor-tance with respect to feature salience (again, man-ually determined). MED, the last baseline, com-putes similarity as the minimum-weight series ofedit operations that transforms one sound compo-nent into another.

3.1 The Effectiveness of DIMSIM

Recall and MRR: We compare DIMSIM to DM,DM-rank, ALINE and MED. DIMSIM1 and DIM-SIM2 denotes DIMSIM encoding dimension n = 1and n = 2, respectively. As shown in Figure 5,DIMSIM2 improves recall by factors of 1.5, 1.5,1.3 and 1.2, and improves MRR by factors of 7.5,1.4, 1.03 and 1.2 over DM, DM-Rank, ALINEand MED, respectively. DM performs relativelypoorly, as it is designed for English, and does notaccurately reflect Chinese pronunciation. Rank-ing DM candidates using the DIMSIM phoneticdistance defined in Equation 1 improves its av-erage MRR by a factor of 5.5. However, evenDM-Rank is outperformed by the simple MED

4We do not compare with Soundex as DM is accepted tobe an improved phonetic similarity algorithm over Soundex.

0

0.2

0.4

0.6

0.8

1

1.2

MRR R

DM

DM-rank

ALINE

MED

DimSim1

DimSim2

Figure 5: Recall and MRR.

0

0.2

0.4

0.6

0.8

1

P MRR

ALINE

MED

DM-rank

DimSim2

Figure 6: Precision and MRR.

baseline, demonstrating the inherent problem withDM’s coarse encodings. While ALINE has asimilar recall to DIMSIM, it performs worse onMRR than DIMSIM2 because it does not havea direct representation of compound vowels forPinyin. It measures distance between compoundvowels using phonetic features of basic vowelswhich leads to inaccuracy. In turn, MED strug-gles with representing accurate phonetic distancesbetween initials, since most initials are of length 1,and the edit distance between any two charactersof length 1 is identical. In contrast, DIMSIM en-codes initials and finals separately, and thus even a1-dimensional encoding (DIMSIM1) outperformsthe other baselines. Finally, the intuition of Fig-ures 3 and 4 is reflected in the data, as DIMSIM2outperforms DIMSIM1 by 14% (MRR).

Precision and MRR: Here we evaluate thequality of the candidate ranking since in practice,downstream applications consider only a smallnumber of possible candidates for every word. Weask two native Chinese speakers to annotate thequality of the generated candidates. Choosing 100words randomly from the test set, we use DM-rank, MED, ALINE and DIMSIM2 to generatetop-K candidates for each seed word (K = 5).5

The annotators mark each candidate as phoneti-cally similar to the seed word (1) or not (0), alsomarking the one candidate they believe to be themost similar-sounding (2), which may be any of

5We do not evaluate DM and DIMSIM1 as they performworse than DM-Rank and DIMSIM2, respectively.

450

the top-K candidates. We then compute precisionand average MRR using the obtained annotations.We achieve inter-level agreement(ILA) of 0.75 forP and ILA of 0.84 for average MRR. DIMSIM

once again outperforms MED and DM-Rank by upto 1.4X for precision and 1.24X for MRR. Sincethe only criteria for picking the best top-K candi-date is phonetic similarity, this demonstrates thatDIMSIM ranks the most phonetically similar can-didates higher than the other baselines.

τ(φ)θ(φ) 1/2φ 1/4φ 1/φ2 1/φ4

∗102 ∗104 ∗102 ∗103

None F10 F20 F30 F402φ F11 F21 - -4φ F12 F22 - -φ2 - - F33 F34φ4 - - F43 F44

Table 4: Variations of θ,τ .

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7

F10

F20

F30

F40

Figure 7: Distance by θ.

0.6

0.7

0.8

0.9

1

MRR R

F10F11F12

F20F21F22

F30F33F34

F40F43F44

Figure 8: Impact of varying θ,τ .

3.2 Impact of Scoring and Penalty Functions

We study the sensitivity of DIMSIM to varying thescoring and penalty functions, using recall and av-erage MRR for evaluation. Table 4 shows four dif-ferent scoring functions θ and penalty functionsτ (including the variation of not using a penalty

function) to convert the annotator scores φ to pair-wise distances S, following Equation 4.

Figure 7 depicts the values of the four scoringfunctions θ as a function of the annotator scoreson a log 10 scale, to demonstrate the effect ofvarying a and b, as well as using φ as the baseor exponent. Figure 8 demonstrates how sensitiveour model is to the different combinations of scor-ing and penalty functions. We see that althoughRecall is entirely insensitive to the variations, theperformance of MRR is impacted. There is aclear preference for the variations on the “diago-nal” of Table 4: F11, F22, F33, F44, but the near-identical performance of these variations demon-strates DIMSIM’s robustness to the particular scor-ing and penalty functions used. Note that not usinga penalty function impacts MRR significantly.

0.6

0.7

0.8

0.9

1

MRR R

n=1 n=2 n=3 n=4

Figure 9: Impact of varying n.

3.3 Impact of the Encoding Dimensions

As demonstrated above, encoding initials and fi-nals into a two-dimensional space is more ef-fective than a one-dimensional space. Figure 9presents the results of continuing to increase thenumber of dimensions, n = [1, 4]. We observethat recall is barely affected, with all variationsable to successfully identify the targeted words98% to 99% of the time. We also see that movingfrom n=1 to n=2 increases the average MRR by1.14X . However, further increasing the numberof dimensions to n>2 no longer improves averageMRR, indicating that learning a two-dimensionalencoding is enough to capture the phonetic rela-tionships between Pinyin components.

3.4 Impact of the Distance Threshold

We examine how the similarity distance thresh-old (th) impacts DIMSIM by varying th from 2 to4096 (Figure 10) (using the scoring function F22).As th increases, recall increases from 0.75 to 0.99,converging when th reaches 2048. By increasingth DIMSIM matches more characters that are simi-

451

0.6

0.7

0.8

0.9

1

1 4 16 64 256 1024 4096

MRR R

Figure 10: Impact of varying th.

lar to the first character of the given word, which inturn increases the number of candidates within thedistance. Thus, the probability of including the la-beled gold standard words in the results increases.MMR is less sensitive to th, converging when threaches 128. However, the generated set of can-didate words is reduced too much for th < 128,hurting the performance of MMR. To ensure bothhigh recall and MRR we set th = 2500.

0.6

0.7

0.8

0.9

1

1 4 16 64 256 1024 4096

MRR R

Figure 11: Varying candidate nc.

3.5 Impact of Number of Candidates

While generating more candidates improves therecall, presenting too many candidates to a down-stream application is not desirable. To find a bal-ance, we study the impact of varying the upperlimit of the number of generated candidates ncfrom 2 to 2048 (Figure 11). We find that MRRconverges at 64 candidates, while recall takeslonger; however, setting the upper limit at 64 can-didates already achieves almost 98% recall, sug-gesting it as a reasonable cutoff in practice. Unlessotherwise mentioned, we set nc = 1, 000 for ex-periments, to isolate the impact of this parameter.

3.6 Error Analysis

We analyze and summarize three types of errorsmade by DIMSIM. The first occurs when targetedwords are out of vocabulary(OOV). For instance,for the original word “药丸” , the targed word is

“要完” which is OOV. As is commonly the case intext normalization applications which convert in-formal language to well-formed terms, our methodworks as long as the targeted words are in the dic-tionary. This shortcoming is generally alleviatedby adding new terms to the dictionary. Second,DIMSIM cannot derive phonetic candidates fromdialects that are not encoded in our mapping ta-ble. For example, for “冻(dong4)蒜(suan4)” , thetargeted word “当(dang1)选(xuan2)” is obtainedusing the pronunciation of southern Fujian dialect.However, our approach can easily be extended toincorporate and capture such variants by learningmapping tables for each dialect and using themto generate corresponding candidates. Finally, weconstrain DIMSIM to not identify candidates thatdiffer in length from the seed word, as we observethat most transcriptions have the same word length- though some corner cases do occur.

4 Related Work

There is a plethora of work focusing onthe phonetic similarities between words andcharacters (Archives and Administration, 2007;Mokotoff, 1997; Taft, 1970; Philips, 1990, 2000;Elsner et al., 2013). These algorithms encodewords with similar pronunciation into the samecode. For example, Soundex (Archives andAdministration, 2007) converts words into fixedlength code through a mapping table of initialgroups to ordinal numbers. These algorithms failto capture Chinese phonetic similarity since theconversion rules do not consider pronunciationproperties of Pinyin. Linguists in the phonetic andphonology community have also proposed severalphonetic comparison algorithms (Kessler, 2005;Mak and Barnard, 1996; Nerbonne and Heeringa,1997; Ladefoged, 1969; Kondrak, 2003) for de-termining the similarity between speech forms.However, as features of articulatory phoneticsare manually assigned, these algorithms fall shortin capturing the perceptual essence of phoneticsimilarity through empirical data (Kessler, 2005).In contrast, DIMSIM achieves high accuracy bylearning the encodings both from high qualitytraining data sets and linguistic Pinyin features.

Several works in Named Entity translation (Linand Chen, 2002; Lam et al., 2004; Kuo et al., 2007;Chung et al., 2011) focus on learning the phoneticsimilarity between English and Chinese automati-cally. These approaches first represent English and

452

Chinese words in basic phoneme units and applyedit distance algorithms to compute the similar-ity. Training frameworks are then used to learnthe similarity. However, the phonetic similarityused in these systems cannot be applied to Chi-nese words since Pinyin has its own specific char-acteristics, which do not easily map to English, fordetermining phonetic similarity. Another main ap-plication of phonetic similarity algorithms is textnormalization (Xia et al., 2006; Li et al., 2003;Han et al., 2012; Sonmez and Ozgur, 2014; Qianet al., 2015), where phonetic similarity is mea-sured by a combination of initial and final simi-larities. However, the encodings used in these ap-proaches are too coarse-grained, yielding low F1measures. DIMSIM learns separate high dimen-sional encodings for initials and finals, and usesthem to calculate and rank the distances betweenPinyin representations of Chinese word pairs. KarlStratos (Stratos, 2017) proposes a sub-characterarchitecture to deal with the data sparsity problemin Korean language processing by breaking downeach Korean character into a small set of primitivephonetic units. However, this work does not ad-dress the problem of the phonetic similarity and isthus orthogonal to DIMSIM.

5 Conclusion

Motivated by phonetic transcription as a widelyobserved phenomenon in Chinese social mediaand informal language, we have designed an accu-rate phonetic similarity algorithm. DIMSIM gen-erates phonetically similar candidate words basedon learned encodings that capture the pronunci-ation characteristics of Pinyin initial, final, andtone components. Using a real world dataset,we demonstrate that DIMSIM effectively improvesMRR by 7.5X , recall by 1.5X and precision by1.4X over existing approaches.

The original motivation for this work was to im-prove the quality of downstream NLP tasks, suchas named entity identification, text normalizationand spelling correction. These tasks all share a de-pendency on reliable phonetic similarity as an in-termediate step, especially for languages such asChinese where incorrect homophones and syno-phones abound. We therefore plan to extend thisline of work by applying DIMSIM to downstreamapplications, such as text normalization.

ReferencesNational Archives and Records Administra-

tion. 2007. The Soundex Indexing System.https://www.archives.gov/research/census/soundex.html.

Adam Carstensen. 2005. An Introduction toDouble Metaphone and the Principles BehindSoundex. http://www.b-eye-network.com/view/1596.

CEDict. 2016. CC-CEDICT. https://www.mdbg.net/chindict/chindict.php?page=cc-cedict.

Jen-Ming Chung, Fu-Yuan Hsu, Cheng-Yu Lu, Hahn-Ming Lee, and Jan-Ming Ho. 2011. Automaticenglish-chinese name translation by using web-mining and phonetic similarity. In IEEE Interna-tional Conference on Information Reuse and Inte-gration, pages 283–287.

Micha Elsner, Sharon Goldwater, Naomi Feldman, andFrank Wood. 2013. A joint learning model of wordsegmentation, lexical acquisition, and phonetic vari-ability. In Proc. EMNLP.

Bo Han, Paul Cook, and Timothy Baldwin. 2012. Au-tomatically constructing a normalisation dictionaryfor microblogs. In Proc. of the joint conference onEMNLP and CoNLL, pages 421–432. ACL.

Andrew F. Hayes and Klaus Krippendorff. 2007. An-swering the call for a standard reliability measurefor coding data. Communication Methods and Mea-sures, 1(1):77–89.

ISO. 2015. ISO 7098: Romanization of Chi-nese. https://www.iso.org/standard/61420.html.

Brett Kessler. 2005. Phonetic comparison algo-rithms. Transactions of the Philological Society,103(2):243–260.

Grzegorz Kondrak. 2003. Phonetic alignment and sim-ilarity. Computers and the Humanities, 37(3):273–291.

Jin-Shea Kuo, Haizhou Li, and Ying-Kuei Yang. 2007.A phonetic similarity model for automatic extractionof transliteration pairs. ACM Transactions on AsianLanguage Information Processing (TALIP), 6(2):6.

Peter Ladefoged. 1969. The measurement of phoneticsimilarity. In Proceedings of the 1969 conference onComputational linguistics, pages 1–14. Associationfor Computational Linguistics.

Wai Lam, Ruizhang Huang, and Pik-Shan Cheung.2004. Learning phonetic similarity for matchingnamed entity translations and mining new transla-tions. In Proc. ACM SIGIR, pages 289–296.

Chia-ying Lee, Yu Zhang, and James R Glass. 2013.Joint learning of phonetic units and word pronunci-ations for asr. In EMNLp, pages 182–192.

https://www.archives.gov/research/census/soundex.html

https://www.archives.gov/research/census/soundex.html

http://www.b-eye-network.com/view/1596

http://www.b-eye-network.com/view/1596

https://www.mdbg.net/chindict/chindict.php?page=cc-cedict



https://www.iso.org/standard/61420.html

https://www.iso.org/standard/61420.html

453

Hong-lian Li, Wei He, and Bao-zong Yuan. 2003. Ankind of chinese text strings’ similarity and its appli-cation in speech recognition. Journal of Chinese In-formation Processing, 17(1):60–64.

Wei-Hao Lin and Hsin-Hsi Chen. 2002. Backward ma-chine transliteration by learning phonetic similarity.In Proc. CoNLL, pages 1–7. ACL.

Brian Mak and Etienne Barnard. 1996. Phone clus-tering using the bhattacharyya distance. In SpokenLanguage, 1996. ICSLP 96. Proceedings., FourthInternational Conference on, volume 4, pages 2005–2008. IEEE.

Gary Mokotoff. 1997. Soundexing and genealogy.Avotaynu.

Gonzalo Navarro. 2001. A guided tour to approximatestring matching. ACM computing surveys (CSUR),33(1):31–88.

John Nerbonne and Wilbert Heeringa. 1997. Measur-ing dialect distance phonetically. In Proceedings ofthe Third Meeting of the ACL Special Interest Groupin Computational Phonology (SIGPHON-97).

Lawrence Philips. 1990. Hanging on the metaphone.Computer Language, 7(2):6.

Lawrence Philips. 2000. The double metaphone searchalgorithm. C/C++ users journal, 18(6):38–43.

Tao Qian, Yue Zhang, Meishan Zhang, Yafeng Ren,and Dong-Hong Ji. 2015. A transition-based modelfor joint segmentation, pos-tagging and normaliza-tion. In EMNLP, pages 1837–1846.

Duanmu San. 2007. The Phonology of Standard Chi-nese. Oxford University Press.

Cagil Sonmez and Arzucan Ozgur. 2014. A graph-based approach for contextual text normalization. InEMNLP, pages 313–324.

Karl Stratos. 2017. A sub-character architecture forkorean language processing. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 721–726.

Robert L Taft. 1970. Name search techniques. 1. Bu-reau of Systems Development, New York State Iden-tification and Intelligence System.

Kristina Toutanova and Robert C. Moore. 2002. Pro-nunciation modeling for improved spelling correc-tion. In Proc. ACL, ACL ’02, pages 144–151,Stroudsburg, PA, USA.

Johannes Twiefel, Timo Baumann, Stefan Heinrich,and Stefan Wermter. 2014. Improving domain-independent cloud-based speech recognition withdomain-dependent phonetic post-processing. InAAAI, pages 1529–1536.

Ellen M Voorhees and et al. 1999. The trec-8 questionanswering track report. In Trec, volume 99, pages77–82.

You Wu. 2016. Commonly used phonetic vocab-ulary. http://www.51wendang.com/doc/97585e99067d692a1bbaec92.

Yunqing Xia, Kam-Fai Wong, and Wenjie Li. 2006. Aphonetic-based approach to chinese chat text nor-malization. In Proc. ACL, pages 993–1000.

Mabus Yao. 2015. An algorithm for Chinese Simi-larity based on phonetic grapheme coding. http://mabusyao.iteye.com/blog/2267661.

Zhishihao. 2017. Differentiating f and h. http://www.zhishihao.com/xue/show/51763.

http://www.51wendang.com/doc/97585e99067d692a1bbaec92

http://www.51wendang.com/doc/97585e99067d692a1bbaec92

http://mabusyao.iteye.com/blog/2267661

http://mabusyao.iteye.com/blog/2267661

http://www.zhishihao.com/xue/show/51763

http://www.zhishihao.com/xue/show/51763

Documents

DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm ...vowels and tones, words with dissimilar pronun-ciations are incorrectly assigned to the same en-coding (e.g. 稀饭 and