Transcript
Page 1: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Learning Formulation and Transformation Rules for Multilingual Named Entities

Advisor : Dr. Hsu

Reporter : Chun Kai Chen

Author : Hsin-Hsi Chen, Changhua Yang and Ying Lin

Proceedings of the ACL 2003

Page 2: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Introduction Multilingual Named Entity Corpora Rule Mining Experimental Results Conclusions Personal Opinion

Page 3: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

The past works on multilingual named entities emphasizes on the transliteration issues

However, the transformation between named entities in different languages is not transliteration only─ Victoria Fall- 維多利亞瀑布─ Little Rocky Mountains- 小落磯山脈─ Kenmare- 康美爾─ East Chicago- 東芝加哥

Page 4: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

Propose a method extract─ formulation rules of named entities for individual

languages─ transformation rules for mapping among languages

Application of the results on cross language information retrieval (CLIR)

Page 5: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(1/3)

In the past, named entity extraction ─ mainly focuses on general domains─ employed to various applications such as information r

etrieval, question-answering

Page 6: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(2/3) Most of the previous approaches

─ dealt with monolingual named entity extraction─ Chen et al.(1998) extended it to cross-language information retrieval (C

LIR) A grapheme-based model was ( 字母 )

─ proposed to compute the similarity between Chinese transliteration name and English name.

Lin and Chen (2000) further classified the works into two directions─ forward transliteration (Wan and Verspoor, 1998)─ backward transliteration (Chen et al., 1998; Knight and Graehl, 199

8),─ proposed a phoneme-based model

Page 7: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction(3/3)

This paper will study ─ the issues of languages and named entity types on the

choices of translation and transliteration. ─ We focus on three more challenging named entities onl

y, i.e., named people named locations named organizations

Page 8: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Multilingual Named Entity Corpora

NICT location name corpus─ Developed by Ministry of Education of Taiwan in 1995─ consists of three parts

Foreign location name, Chinese transliteration/translation name, country name (Victoria Fall, “ 維多利亞瀑布” (wei duo li ya pu bu), South Africa)

CNA personal name and organization corpora─ are used by news reporters to unify the name translitera

tion/translation in news stories

Page 9: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Rule Mining

Frequency-Based Approach with a Bilingual Dictionary

Keyword Extraction without a Bilingual Dictionary

Extraction of Transformation Rules Extraction of Keywords at a Distance

Page 10: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Learning Formulation and Transformation Rules

Frequency-Based with a Bilingual Dictionary

Keyword Extraction without a Bilingual Dictionary

Generate candidatesCount the frequency (TFIDF)

Victoria FallVictoria, “ 維多利亞” Fall, “ 瀑布”

World Taiwanese Association “ 世台會”

Decompose E

(s6) {Catalan Mountain , 卡太蘭山 }(s7) {Aletschhorn Mountain , 阿利奇赫恩山 }

{Catalan Mountain , 卡太蘭山 }{Catalan , 卡 太 蘭 山 }{e1, 卡太 太蘭 蘭山 }{e1, …}{e1, 卡太蘭山 }

{Mountain , 卡 太 蘭 山 }{e2, 卡太 太蘭 蘭山 }{e2, …}{e2, 卡太蘭山 }

{Aletschhorn Mountain , 阿利奇赫恩山 }{Aletschhorn , 阿 利 奇 赫 恩 山 }{e1, 阿利 利奇 奇赫 赫恩 恩山 }{e1, …}{e1, 阿利奇赫恩山 }

{Mountain , 阿 利 奇 赫 恩 山 }{e2, 阿利 利奇 奇赫 赫恩 恩山 }{e2, …}

{Mountain, “ 山” (shan)}

Extraction of Transformation Rules

(s6’) γ mountain ⇔ δ 山(s7’) γ mountain ⇔ δ 山(s8’) γ Strait ⇔ δ 海峽(s9’) γ, Strait of ⇔ δ 海峽

Extraction of Keywords at a Distance

“American Civil Liberties Union”.“American ∆ Liberties Union”“American Civil ∆ Union”“American ∆ Union”

Dictionary

“Mountain” ⇔ “ 山”

Page 11: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Frequency-Based Approach with a Bilingual Dictionary We postulate

─ transliterated term is usually an unknown word and not listed in a lexicon

─ translated term often appears in a lexicon

Under this postulation ─ translated term( 翻譯詞 ) occurs more often in a corpus

Fall, “ 瀑布”─ transliterated term( 音譯詞 ) only appears very few

Victoria, “ 維多利亞”

Page 12: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Frequency-based method(1/2) Simple frequency-based method will compute the frequencies

of terms and use them to tell out the transliteration and translation parts in a named entity─ Compute word frequencies of each word in the foreign name list─ Keep those words

appear more than a threshold appear in a common foreign dictionary these words form candidates of simple keywords

Mountain─ Examine the foreign word list again─ Cluster the Chinese name list

based on foreign keywords here a bilingual dictionary may be consulted “Mountain” ⇔ “ 山”

Page 13: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Frequency-based method(2/2) NICT location name corpus

─ River ( 河 , he), Island ( 島 , dao), Lake ( 湖 ,hu), Mountain ( 山 , shan), Bay ( 灣 , wan), Mountain ( 峰 , feng), Peak ( 峰 , feng)

─ “Mountain” ⇔ “ 山” (shan) and “ 峰” (feng)─ “峰” (feng) ⇔ “Mountain” and “Peak”

CNA organization name corpus─ Suffix

Association ( 協會 , xie hui), University ( 大學 , da xue)─ Prefix

International ( 國際 , guo ji), World ( 世界 ,shi jie), American ( 美國 , mei guo)

Page 14: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (problem) Abbreviation is common adopted in translation,

dictionary-based approach is hard to capture this phenomenon─ (World Taiwanese Association,“ 世台會” )

Here another approach without dictionary is proposed

Page 15: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (process)(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山

─ {e1, s1 s2 … st} {Aletschhorn , 阿 利 奇 赫 恩

山 } {e1, 阿利 利奇 奇赫 赫恩 恩山 } {e1, 阿利奇 利奇赫 奇赫恩 赫恩

山 } {e1, 阿利奇赫 利奇赫恩 奇赫恩山 } {e1, 阿利奇赫恩 利奇赫恩山 } {e1, 阿利奇赫恩山 }

─ {e2, s1 s2 … st} {Mountain , 阿 利 奇 赫 恩

山 } {e2, 阿利 利奇 奇赫 赫恩 恩山 } {e2, 阿利奇 利奇赫 奇赫恩 赫恩

山 } {e2, 阿利奇赫 利奇赫恩 奇赫恩山 } {e2, 阿利奇赫恩 利奇赫恩山 } {e2, 阿利奇赫恩山 }

(s7) Catalan Mountain ⇔ 卡太蘭山─ {e1, s1 s2 … st}

{Catalan , 卡 太 蘭 山 } {e1, 卡太 太蘭 蘭山 } {e1, 卡太蘭 太蘭山 } {e1, 卡太蘭山 }

─ {e2, s1 s2 … st} {Mountain , 卡 太 蘭 山 } {e2, 卡太 太蘭 蘭山 } {e2, 卡太蘭 太蘭山 } {e2, 卡太蘭山 }

•{e, c} whose frequency > 2 are kept•{Mountain, “ 山” (shan)}

Page 16: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (algorithm) {Ej, Cj}

─ Ej is a foreign named entity─ Cj is a Chinese named entity

decompose the named entities─ Ej

comprises m words w1·w2…wm a candidate segment ep, q is defined as wp … wq

─ Cj has n syllables s1·s2…sn a candidate segment cx, y is defined as sx … sy

─ we can get pairs of {ep, q, cx, y} from {Ej, Cj}. group and count

─ the pairs collected from the multilingual named entity list─ count the frequency for each occurrence─ pairs with higher frequency denote significant segment pairs

Page 17: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (example) Example

─ All the pairs {e, c} whose frequency > 2 are kept─ {Mountain, “ 山” (shan)} and {Strait, “ 海峽” (ha

i xia)} appear twice

(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山(s7) Catalan Mountain ⇔ 卡太蘭山(s8) Cook Strait ⇔ 科克海峽(s9) Dover, Strait of ⇔ 多佛海峽

Page 18: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Keyword Extraction without a Bilingual Dictionary (problem) Two issues have to be addressed

─ redundancy which may exist in the pairs of segments should be eliminated carefully

─ e may be translated to more than one synonym “Association” ⇔“ 協會” (xie hui) and “ 聯誼會” (lian yi hui)

A metric to deal with the above issues is proposed)1 (log 2 iiii c)idf(c}) f({e,c})score({e,c

) (max

) (

}tf{e,c

}{e,ctf })f({e,c

jj

i

i

)(log 2

)df(c

N )idf(c

i

i

}) ,({max arg icescore c

Page 19: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Extraction of Transformation Rules

Chinese location name keyword ─ tends to be located in the rightmost─ the remaining part is a transliterated name

Foreign location name keyword ─ tends to be either located in the rightmost, or permuted by some preposi

tions, comma, and the transliterating part

(s6) Aletschhorn Mountain ⇔ 阿利奇赫恩山(s7) Catalan Mountain ⇔ 卡太蘭山(s8) Cook Strait ⇔ 科克海峽(s9) Dover, Strait of ⇔ 多佛海峽

(s6’) γ mountain ⇔ δ 山(s7’) γ mountain ⇔ δ 山(s8’) γ Strait ⇔ δ 海峽(s9’) γ, Strait of ⇔ δ 海峽

Page 20: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Extraction of Keywords at a Distance

(s12) and (s13)─ English compound keyword is separated and so is its corresponding Chi

nese counterpart

(s14) and (s15)─ English compound keyword is connected in ─ but the corresponding Chinese translation is separated

(s12) American Podiatric medical Association ⇔ 美國足病醫療學會(s13) American Public Health Association ⇔ 美國公共衛生學會(s14) American Society for Industrial Security ⇔ 美國工業安全協會(s15) American Society of Newspaper Editors ⇔ 美國報紙編輯人協會

Page 21: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Extraction of Keywords at a Distance

Introduce a symbol ∆ to cope with the distance issue─ “American Civil Liberties Union”.─ “American ∆ Liberties Union”─ “American Civil ∆ Union”─ “American ∆ Union”

Page 22: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental Analysis (corpus) NICT location corpus

─ Total 122 keyword pairs are identified─ Total 230 transformation rules─ On the average, a keyword pair corresponds to 1.89 transformation rules

CNA personal names─ are composed of more than one Word

(100 / 50,586)─ the number of keywords extracted is only a few

De ⇔ 戴 (dai), La ⇔ 拉 (la), De La ⇔ 戴拉 (dai la), Du ⇔ 杜 (du), David ⇔ 大衛 (da wei)

CNA organization─ are composed of more than one Word

(12,885 / 14,658)─ 5,229 keyword pairs are extracted─ most of the keyword pairs are meaning translated

Page 23: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental Analysis (classify) We classify these keyword pairs into the following types

─ Meaning translation common location keywords

Bir ⇔ 井 (jing), Ain ⇔ 泉 (quan),Bahr ⇔ 河 (he), Cerro ⇔ 山 (shan) Direction

Central ⇔ 中 (zhong), East ⇔ 東 (dong), etc.) size (e.g., Big ⇔ 大 (da)), length (e.g, Long ⇔ 長 (zhang)), color (e.g., Black ⇔ 黑 (hei), Blue ⇔ 藍 (lan), etc.)

the specificity of place or area Crystal ⇔ 結晶 , Diamond⇔ 鑽石 (zuan shi)

─ Phoneme transliteration keywords Dera ⇔ 德拉 (de la), Monte⇔ 蒙特 (meng te), Los ⇔ 洛斯 (luo si) 伊利莎白 (yi li sha bai), Edward ⇔ 愛德華 (ai de hua) Total 39 terms belong to this type. It occupies 31.97%.

─ Some keywords in type (1) are transliterated Bay ⇔ 貝 (Bay), Beach ⇔ 比奇 (bi qi) Total 14 keywords (11.48%) are extracted.

Page 24: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experimental Results

NICT location corpus─ Total 122 keyword pairs are identified─ Total 230 transformation rules─ On the average, a keyword pair corresponds to 1.89 tra

nsformation rules keyword pair mountain ⇔ 山 (shan)

─ Four transformation rules (1) γα ⇔ δβ (234) (2) γ, α ⇔ δβ (45) (3) γ, αγ ⇔ δβ (1) (4) γαγ ⇔ δβ (1)

Page 25: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Application on CLIR

Page 26: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion and Remarks

This paper proposes corpus-based approaches ─ extract the formulation rules and the translation/transliteration

rules among multilingual named entities

Two types of evaluation─ partition the corpora into two parts, one for training and the other

one for testing─ integrating our method in a cross language information retrieval

system

Further applications ─ will be explored in the future and the methodology will be

extended to other types of named entities

Page 27: Learning Formulation and Transformation Rules for Multilingual Named Entities

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Personal Opinion

Drawback─ Lack analysis about time complexity

Application─ Construct Chinese-English rules apply to IR

Future Work─ Adopt transliterated / translated term issue


Recommended