30
Korean Korean - - Spanish CLIR System Development Spanish CLIR System Development : : Translation of Unknown Words Translation of Unknown Words August 23, 2005 Qing Li Information Retrieval and Natural Language Processing Lab. Information and Communications University (ICU)

Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

KoreanKorean--Spanish CLIR System DevelopmentSpanish CLIR System Development::Translation of Unknown WordsTranslation of Unknown Words

August 23, 2005

Qing Li

Information Retrieval and Natural Language Processing Lab.Information and Communications University (ICU)

Page 2: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

22//3030Information and Communications UniversityInformation and Communications University

OutlineOutline

• Introduction

• Unknown Word Translation Method

• Evaluation

• Unknown word module Demo

Page 3: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

33//3030Information and Communications UniversityInformation and Communications University

How to retrieve documents ?How to retrieve documents ?

Crawler

Search Engine

Query

Page 4: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

44//3030Information and Communications UniversityInformation and Communications University

CLIRCLIR

Crawling Spanish doc.

Search Engine indexed Spanish doc.

Spanish Query

Korean Person

Page 5: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

55//3030Information and Communications UniversityInformation and Communications University

Translation ModuleTranslation Module

Korean Query

Spanish Query

Page 6: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

66//3030Information and Communications UniversityInformation and Communications University

Unknown words translationUnknown words translation

Unknown words:• They are currently not available

Named-entities: (person, organization and location)Book/movie titlesTerminology (Medical, Sci&Tech, Military, …)

• Most of them are compound nounsThe meaning can not be directly derived from its componentsRequires more world knowledge to translate

• Important for NLP applications:Machine Translation (MT)Cross-lingual Information Retrieval (CLIR)Question-Answering (QA)

Page 7: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

77//3030Information and Communications UniversityInformation and Communications University

Our strategyOur strategy

Manually construct an unknown words list• Accurate but almost impossible !?

Good strategy is to automatically construct that list

• Mining the web to get the corresponding translation of unknown words.

Page 8: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

88//3030Information and Communications UniversityInformation and Communications University

Searching the web for the translation?Searching the web for the translation?

Searching the parallel data on the web (e.g. STRAND: Resnik 2003)

Page 9: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

99//3030Information and Communications UniversityInformation and Communications University

Searching the web for the translation?Searching the web for the translation?

Searching the comparable corpus on the web (Fung 1998)

Page 10: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1010//3030Information and Communications UniversityInformation and Communications University

Searching the web for the translation?Searching the web for the translation?

Anchor texts pointing to the same page (Lu 2004)

Page 11: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1111//3030Information and Communications UniversityInformation and Communications University

Searching the web for the translation?Searching the web for the translation?

Mining the mixed language pages.

Page 12: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1212//3030Information and Communications UniversityInformation and Communications University

Mining translations from mixedMining translations from mixed--langlang. pages. pages

Crawling the Chinese web pages that contain English text. (Zhang and Vines, SIGIR 2004)– Use Google to locate the webpages containing the Chinese terms– English expressions occur next to the Chinese terms are considered as

their translations– Crawled 2GB web data, 1,168 distinct English terms found, 61% are

correct translations

Searching the Chinese terms among the English pages. (Cheng et al. SIGIR 2004)– Use Google to retrieve “English” pages containing the Chinese terms– Extract translations from the snippets– LiveTrans system

Page 13: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1313//3030Information and Communications UniversityInformation and Communications University

Our method Our method

Page 14: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1414//3030Information and Communications UniversityInformation and Communications University

Our methodOur method

– Preprocessing– Multiple features

• Phonetic • Physical structure

– parenthesis • Frequency-length

– Feature fusion

Page 15: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1515//3030Information and Communications UniversityInformation and Communications University

Preprocessing Preprocessing –– case 1case 1

Page 16: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1616//3030Information and Communications UniversityInformation and Communications University

Preprocessing Preprocessing ––case 2case 2

Page 17: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1717//3030Information and Communications UniversityInformation and Communications University

FrequencyFrequency--length modellength model

( ) ( )( ) (1 )max max

i iFL i

len c Freq cw clen Freq

α α= × + − ×

….

1.6724Roh Moo

1.3314Moo

234Roh Moo hyun

1.6724Moo hyun

Weight_FLLen Freq Candidate

Page 18: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1818//3030Information and Communications UniversityInformation and Communications University

Physical structure ModelPhysical structure Model

Candidates retrieved from parenthesis

Page 19: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

1919//3030Information and Communications UniversityInformation and Communications University

Phonetic model Phonetic model

Phonetic model• Capture phonetic similarity

• Person, location and brand names

• Probabilistic surface string alignment

• Romanized source phrases vs. target phrase

• Letters are aligned according to their pronunciation similarity (not orthogonal forms)

Huang, Vogel and Waibel, Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization, ACL 03 Multilingual NE Recognition Workshop

Page 20: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2020//3030Information and Communications UniversityInformation and Communications University

Feature combination Feature combination

Stage 1 :• Combining the result from physical structure model and frequency-length

model.

• Rank the candidates based on the weight

Stage 2 :• If the weights of some candidates got from phonetic model lager than a certain

threshold, move them to the top and rerank them based on the phonetic model weight.

Page 21: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2121//3030Information and Communications UniversityInformation and Communications University

Sample Sample

English unknown word :kofi annan (UN President)Result from stage 1

Result from stage 2

Page 22: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2222//3030Information and Communications UniversityInformation and Communications University

One problem One problem

• The WebPages have both Korean and Spanish words is limited.

• Solution : mining the webpages contain both Korean and English words in order to get the corresponding English translation for the unknown words.

• Is this reasonable?• Same spelling for unknown words• Chilean webpages contains many webpages written in English• Even in one line some words are written in English in Chilean webpage.

Page 23: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2323//3030Information and Communications UniversityInformation and Communications University

ReasonReason

Page 24: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2424//3030Information and Communications UniversityInformation and Communications University

Evaluation Evaluation

Test set• 300 key phrases manually selected • Manual translation as reference• One phrase may have several correct translations

Page 25: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2525//3030Information and Communications UniversityInformation and Communications University

Overall Translation Quality Overall Translation Quality

Page 26: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2626//3030Information and Communications UniversityInformation and Communications University

Snapshot of translation resultSnapshot of translation result

Page 27: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2727//3030Information and Communications UniversityInformation and Communications University

OnOn--going workgoing work

We have collected 40,000 Korean unknown words from newspaper collection.

We will translate those words into Spanish / English to extend the current bilingual language list.

Continue to refine the method by applying the techniques in MT area

Page 28: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2828//3030Information and Communications UniversityInformation and Communications University

DemoDemo

http://220.69.185.118:8080/tran/

Page 29: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

2929//3030Information and Communications UniversityInformation and Communications University

ReferencesReferences

Fung, P and Yee, L.Y. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In Proc. Of COLING-ACL, pp. 414-420, 1998.F. Huang, S. Vogel and A. Waibel. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceeding of the 41st ACL, Workshop on Multilingual and Mixed-Language Named Entity Recognition, Sapporo, Japan, July 2003.Lu, W.-H., Chien, L.-F., and Lee, H.-J. Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242-269, 2004.P. Resnik and N. A. Smith. The web as a parallel corpus. Comput. Linguist., 29(3):349--380, 2003.Y. Zhang and P. Vines. Detection and translation of oov terms prior to query time. In SIGIR '04, pages 524--525. ACM Press, 2004.Y. Zhang, F. Huang and S. Vogel. Mining Translations of OOV Terms from the Web through Cross-lingual Query Expansion. In SIGIR ’05.

Page 30: Korean-Spanish CLIR System Development: Translation of ... · Unknown words translation Unknown words: • They are currently not available 9Named-entities: (person, organization

3030//3030Information and Communications UniversityInformation and Communications University

Thank you.

Any questions will be welcome!