20
Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Embed Size (px)

Citation preview

Page 1: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Constructing Bilingual Resources for Digital Libraries

Rim, Hae-ChangKorea University

2000.8.10

Page 2: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Contents

• Introduction• Bilingual resources

– bilingual dictionary– bilingual corpus– bilingual thesaurus

• Our experience– bilingual dictionary– bilingual corpus– bilingual thesaurus

• Summary

Page 3: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Introduction

• What is the problem?– language barrier at multilingual digital library.

• How to solve the problem?– machine translation(MT) – cross-language information retrieval(CLIR)

• Why bilingual resources?• MT and CLIR are based on bilingual resources.

• What shall we do?– constructing

• Korean-English bilingual dictionary• Korean-English bilingual corpus• Korean-English bilingual thesaurus

Page 4: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Overview

DL DL

language barrier

CLIRMT

bilingualresources

Page 5: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

• Bilingual Resources

Bilingual dictionary Bilingual corpus Bilingual thesaurus

Page 6: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

• Definition– dictionary containing words and their translated words.

• Application field– CLIR

• [Oard 98], [Fujii et al. 99], [Myaeng et al. 99]– MT

• Utilization

Bilingual Dictionary

word

“ 대기”

word

“ 대기”

bilingual dictionarybilingual dictionary

“ 대기 1” – “atmosphere”“ 대기 2” – “waiting”

bilingual dictionarybilingual dictionary

“ 대기 1” – “atmosphere”“ 대기 2” – “waiting”

translatedwords

“atmosphere”“waiting”

translatedwords

“atmosphere”“waiting”

CLIR

MT

Page 7: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Bilingual Corpus (1)

• Definition– comparable corpus

• a collection of similar texts in different languages– parallel corpus

• a collection of texts which have been translated into one or more other language(s).

• Ex) Canadian Hansard corpus• Application field

– CLIR • [Yang et al. 98]

– MT • Example-Based Machine Translation

– [Brown 96], [Murata et al. 99], [Shirai et al.97]– [Turcato et al 99]

Page 8: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

• Utilization

Bilingual Corpus (2)

translated words

“ 대기” - “atmosphere” - “waiting”“ 오염” - “pollution”

“ 대기 오염” “atmosphere pollution” ? “waiting pollution” ?

translated words

“ 대기” - “atmosphere” - “waiting”“ 오염” - “pollution”

“ 대기 오염” “atmosphere pollution” ? “waiting pollution” ? CLIR

MT

bilingual corpus

“the sources of atmospherepollution may have a global, regional and localcharacter.”

“ 대기 오염의 원인은 전세계적 , 국부적 , 그리고 지역적인 특징을가진다 .”

bilingual corpus

“the sources of atmospherepollution may have a global, regional and localcharacter.”

“ 대기 오염의 원인은 전세계적 , 국부적 , 그리고 지역적인 특징을가진다 .”translated phrase

“ 대기 오염”“atmosphere pollution”

translated phrase

“ 대기 오염”“atmosphere pollution”

Page 9: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Bilingual Thesaurus (1)

• Definition– a collection of words in two languages that are put into grou

ps together according to connections between their meanings

– Ex) EuroWordNet

• Application field– CLIR

• concept-based CLIR– [Gonzalo et al. 98], [Gilarranz et al. 97]

Page 10: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

bilingual thesaurusbilingual thesaurus

{region, part}

{atmosphere, 대기 } {air}

{inactivity}

{wait,waiting, 대기 } {pause}

• Utilization

Bilingual Thesaurus (2)

word

“ 대기”

word

“ 대기”

CLIR

word concept

“region”“inactivity”

word concept

“region”“inactivity”

Page 11: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

• Our Experience

Bilingual dictionary Bilingual corpus Bilingual thesaurus

Page 12: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Bilingual Dictionary

• Korean-English bilingual dictionary– size

• 2 million entries

– application

person’sname

“ 링컨”

person’sname

“ 링컨”

bilingualbilingualbiographicalbiographicaldictionarydictionary

““ 링컨” 링컨” - “Lincoln”- “Lincoln”

bilingualbilingualbiographicalbiographicaldictionarydictionary

““ 링컨” 링컨” - “Lincoln”- “Lincoln”

translated person’s name

“Lincoln”

translated person’s name

“Lincoln”

CLIR

MT

Page 13: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Bilingual Corpus

• Korean-English bilingual corpus– parallel corpus containing 250,000 words– based on CES(Corpus Encoding Standard)

• Corpus construction tools– corpus refining tools– corpus annotating tools– bilingual concordancer

Page 14: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

• Goal– Constructing a Korean-English bilingual thesaurus

• Approach– assigning Korean words to corresponding English words in

WordNet

Bilingual Thesaurus (1)

{air}

{region, part}

{atmosphere, 대기 }

Korean word

“ 대기”

Korean word

“ 대기”

WordNetWordNet

[ Korean-English bilingual thesaurus ]{air}

{region, part}

{atmosphere}

Page 15: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Bilingual Thesaurus (2)

• Current status of the task– under construction

Korean

thesaurus

WordNet

word count 20149 94473

concept count

(synset count)13211 68046

word sense

count

23838 116317

Page 16: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

Summary

• Surmounting the language barrier– using bilingual resources

• Korean-English bilingual resources– Korean-English bilingual dictionary– Korean-English bilingual corpus– Korean-English bilingual thesaurus

• Our experience– Korean-English bilingual dictionary– Korean-English bilingual corpus– Korean-English bilingual thesaurus

Page 17: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

reference(1)

• [Oard 98] Douglas W. Oard, “A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval”, the Third Conference of the Association for Machine Translation in the Americas (AMTA), Philadelphia, PA, October, 1998.

• [Fujii et al. 99] Atsushi Fujii, Tetsuya Ishikawa, "Cross-Language Information Retrieval for Technical Documents", Proceedings of the joint ACL SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.29-37, 1999.

• [Myaeng et al. 99] Sung Hyon Myaeng and Myung-gil Jang, "Complementing Dictionary-Based Query Translations with Corpus Statistics for Cross-Language IR", Machine Translation Summit VII, 1999.

Page 18: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

reference(2)

• [Yang et al. 98] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E.F rederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp. 323-345

• [Brown 96] Ralf D. Brown, “Example-Based Machine Translation in the Pangloss System”, In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp.169-174, Copenhagen, Denmark, August 5-9, 1996.

• [Murata et al. 99] Murata, M, Q. Ma, K.Uchimoto, H. Isahara, "An Example-Based Approach to Japanese-to-English Translation of Tense, Aspect, and Modality", in TMI'99, Chester, UK, August 23, 1999.

Page 19: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

reference(3)

• [Shirai et al. 97] Shirai, S., F. Bond, and Y. Takahashi. 1997. “A Hybrid Rule and Example based Method for Machine Translation.” In Natural Language Processing Pacific Rim Symposium '97: NLPRS-97.

• [Turcato et al. 99] Davide Turcato, Paul McFetridge, Fred Popowich, Janine Toole, "A Unified Example-Based and Lexicalist Approach to Machine Translation", at the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99)

• [Gonzalo et al. 98] Julio Gonzalo, Felisa Verdejo, Carol Peters and Nicoletta Calzolari, “Applying EuroWordNet to Cross-Language Text Retrieval”, Computers and the Humanities, Vol 32, Nos. 2-3, pp. 73-89, 1998.

Page 20: Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10

reference(4)

• [Gilarranz et al. 97] Julio Gilarranz, Julio Gonzalo and Felisa Verdejo, "An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", AAAI 97.