Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang

Department of Information Management

Acquisition of English-Japanese proper nouns from noisy-parallel newswire

articles using KATAKANA matching

Toshiba Corp. R&D Center


N.Y.U.S.T.

I. M.

Outline

Motivation Objective Introduction Background Method Simulations Discussion Conclusion


N.Y.U.S.T.

I. M.

Motivation

Limitation of statistical approaches


N.Y.U.S.T.

I. M.

Objective

Superiority of linguistic approaches


N.Y.U.S.T.

I. M.

Introduction

A tool for extracting bilingual knowledge from noisy-parallel English-Japanese text Dynamic programming Phonetic similarities Partial matching of English-Japanese Extract a small reliable bilingual lexicon of

anchor points Establish further bilingual correspondences


N.Y.U.S.T.

I. M.

Introduction

Type of bilingual knowledge acquisition from parallel corpora Statistical

Internal distributional evidence of bilingual word pairs

Linguistic External evidence provided by bilingual

lexicons to establish anchor points between pairs of bilingual phrases


N.Y.U.S.T.

I. M.

Background

The challenge for establishing a bilingual correspondance between English-Katakana Lose information when English-Katakana

`r' and `l' or `b' and `v' Redundant vowel sounds when Katakana-English

`fra' in “Frankfurt” ` フラ‘ translate into ‘fura’


N.Y.U.S.T.

I. M.

Background

Deal with these problems in previous researches Transcribe into intermediate representations and

match these. The matching knowledge may be biased towards

English pronunciation.

“Chirac” => “ シラク”` シ ' is pronounced as shi.


N.Y.U.S.T.

I. M.

Background

A neutral intermediate representation allows for partial matching When intermediate representation match above a

certain threshold then they are in a translation relation.

“ パレスチナ”

“Palestine”“Palestinian”“Palestinians”


N.Y.U.S.T.

I. M.

Method

NPT (Nearest Phonetic Transliteration) Takes each Katakana word and converts it to a ph

onetic string representing all English spelling combinations of the word.

“ ブルンジ” which is “Burundi” in English

‘ルー > rloue’

“buorlouenmgesdjgiou”


N.Y.U.S.T.

I. M.


N.Y.U.S.T.

I. M.Method – NPT_score

“Burundi”“buorlouenmgesdjgiou”

npt: NPT stringe: English stringmd: maximum depthd: depth counts: score


N.Y.U.S.T.

I. M.

Method

Save search time and detect substrings

Several heuristics First letter is in upper case for obtaining candidate

proper nouns in the English text. Limit the minimum length of Katakana words

available for matching.

“ クリスマス” (=“Christmas”) and “Mass”


N.Y.U.S.T.

I. M.

Simulations

Two corpora of English and Japanese headline newswire articles.

The test corpus had 150 aligned articles 1730 English paragraphs and 771 Japanese paragraphs 871 Katakana words 9742 potential English proper nouns 65 comparisons for each Katakana word in each article.


N.Y.U.S.T.

I. M.

Simulations

Baseline Soundex algorithm

K&H Convert the Katakana and the English word to a simplif

ied disjunctive phonetic form. Does not allow either partial matches or matching of su

bstrings.


N.Y.U.S.T.

I. M.

Results

F-measure81%58%39%


N.Y.U.S.T.

I. M.

Discussion

NPT yielded the best result overall. Higher threshold and higher precision. K&H can’t handle partial match and intermediate for

m may lose information. Partial matching

Finding substrings Identify cognatively connectd translation pairs

“ インドネシア” => “Indonesia”“Indonesian”, “Indonesians”, “Indonesias"


N.Y.U.S.T.

I. M.

Conclusion

Back-transliterating from Katakana to English is unexpectedly difficult.

The set of matching rules is quite small, it could be improved.

Future research Induce the rules automatically from a corpus of ex

amples.

Documents

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management