Upload
catherine-bennett
View
212
Download
0
Embed Size (px)
Citation preview
Constructing Bilingual Resources for Digital Libraries
Rim, Hae-ChangKorea University
2000.8.10
Contents
• Introduction• Bilingual resources
– bilingual dictionary– bilingual corpus– bilingual thesaurus
• Our experience– bilingual dictionary– bilingual corpus– bilingual thesaurus
• Summary
Introduction
• What is the problem?– language barrier at multilingual digital library.
• How to solve the problem?– machine translation(MT) – cross-language information retrieval(CLIR)
• Why bilingual resources?• MT and CLIR are based on bilingual resources.
• What shall we do?– constructing
• Korean-English bilingual dictionary• Korean-English bilingual corpus• Korean-English bilingual thesaurus
Overview
DL DL
language barrier
CLIRMT
bilingualresources
• Bilingual Resources
Bilingual dictionary Bilingual corpus Bilingual thesaurus
• Definition– dictionary containing words and their translated words.
• Application field– CLIR
• [Oard 98], [Fujii et al. 99], [Myaeng et al. 99]– MT
• Utilization
Bilingual Dictionary
word
“ 대기”
word
“ 대기”
bilingual dictionarybilingual dictionary
“ 대기 1” – “atmosphere”“ 대기 2” – “waiting”
bilingual dictionarybilingual dictionary
“ 대기 1” – “atmosphere”“ 대기 2” – “waiting”
translatedwords
“atmosphere”“waiting”
translatedwords
“atmosphere”“waiting”
CLIR
MT
Bilingual Corpus (1)
• Definition– comparable corpus
• a collection of similar texts in different languages– parallel corpus
• a collection of texts which have been translated into one or more other language(s).
• Ex) Canadian Hansard corpus• Application field
– CLIR • [Yang et al. 98]
– MT • Example-Based Machine Translation
– [Brown 96], [Murata et al. 99], [Shirai et al.97]– [Turcato et al 99]
• Utilization
Bilingual Corpus (2)
translated words
“ 대기” - “atmosphere” - “waiting”“ 오염” - “pollution”
“ 대기 오염” “atmosphere pollution” ? “waiting pollution” ?
translated words
“ 대기” - “atmosphere” - “waiting”“ 오염” - “pollution”
“ 대기 오염” “atmosphere pollution” ? “waiting pollution” ? CLIR
MT
bilingual corpus
“the sources of atmospherepollution may have a global, regional and localcharacter.”
“ 대기 오염의 원인은 전세계적 , 국부적 , 그리고 지역적인 특징을가진다 .”
bilingual corpus
“the sources of atmospherepollution may have a global, regional and localcharacter.”
“ 대기 오염의 원인은 전세계적 , 국부적 , 그리고 지역적인 특징을가진다 .”translated phrase
“ 대기 오염”“atmosphere pollution”
translated phrase
“ 대기 오염”“atmosphere pollution”
Bilingual Thesaurus (1)
• Definition– a collection of words in two languages that are put into grou
ps together according to connections between their meanings
– Ex) EuroWordNet
• Application field– CLIR
• concept-based CLIR– [Gonzalo et al. 98], [Gilarranz et al. 97]
bilingual thesaurusbilingual thesaurus
{region, part}
{atmosphere, 대기 } {air}
{inactivity}
{wait,waiting, 대기 } {pause}
• Utilization
Bilingual Thesaurus (2)
word
“ 대기”
word
“ 대기”
CLIR
word concept
“region”“inactivity”
word concept
“region”“inactivity”
• Our Experience
Bilingual dictionary Bilingual corpus Bilingual thesaurus
Bilingual Dictionary
• Korean-English bilingual dictionary– size
• 2 million entries
– application
person’sname
“ 링컨”
person’sname
“ 링컨”
bilingualbilingualbiographicalbiographicaldictionarydictionary
““ 링컨” 링컨” - “Lincoln”- “Lincoln”
bilingualbilingualbiographicalbiographicaldictionarydictionary
““ 링컨” 링컨” - “Lincoln”- “Lincoln”
translated person’s name
“Lincoln”
translated person’s name
“Lincoln”
CLIR
MT
Bilingual Corpus
• Korean-English bilingual corpus– parallel corpus containing 250,000 words– based on CES(Corpus Encoding Standard)
• Corpus construction tools– corpus refining tools– corpus annotating tools– bilingual concordancer
• Goal– Constructing a Korean-English bilingual thesaurus
• Approach– assigning Korean words to corresponding English words in
WordNet
Bilingual Thesaurus (1)
{air}
{region, part}
{atmosphere, 대기 }
Korean word
“ 대기”
Korean word
“ 대기”
WordNetWordNet
[ Korean-English bilingual thesaurus ]{air}
{region, part}
{atmosphere}
Bilingual Thesaurus (2)
• Current status of the task– under construction
Korean
thesaurus
WordNet
word count 20149 94473
concept count
(synset count)13211 68046
word sense
count
23838 116317
Summary
• Surmounting the language barrier– using bilingual resources
• Korean-English bilingual resources– Korean-English bilingual dictionary– Korean-English bilingual corpus– Korean-English bilingual thesaurus
• Our experience– Korean-English bilingual dictionary– Korean-English bilingual corpus– Korean-English bilingual thesaurus
reference(1)
• [Oard 98] Douglas W. Oard, “A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval”, the Third Conference of the Association for Machine Translation in the Americas (AMTA), Philadelphia, PA, October, 1998.
• [Fujii et al. 99] Atsushi Fujii, Tetsuya Ishikawa, "Cross-Language Information Retrieval for Technical Documents", Proceedings of the joint ACL SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.29-37, 1999.
• [Myaeng et al. 99] Sung Hyon Myaeng and Myung-gil Jang, "Complementing Dictionary-Based Query Translations with Corpus Statistics for Cross-Language IR", Machine Translation Summit VII, 1999.
reference(2)
• [Yang et al. 98] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E.F rederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp. 323-345
• [Brown 96] Ralf D. Brown, “Example-Based Machine Translation in the Pangloss System”, In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp.169-174, Copenhagen, Denmark, August 5-9, 1996.
• [Murata et al. 99] Murata, M, Q. Ma, K.Uchimoto, H. Isahara, "An Example-Based Approach to Japanese-to-English Translation of Tense, Aspect, and Modality", in TMI'99, Chester, UK, August 23, 1999.
reference(3)
• [Shirai et al. 97] Shirai, S., F. Bond, and Y. Takahashi. 1997. “A Hybrid Rule and Example based Method for Machine Translation.” In Natural Language Processing Pacific Rim Symposium '97: NLPRS-97.
• [Turcato et al. 99] Davide Turcato, Paul McFetridge, Fred Popowich, Janine Toole, "A Unified Example-Based and Lexicalist Approach to Machine Translation", at the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99)
• [Gonzalo et al. 98] Julio Gonzalo, Felisa Verdejo, Carol Peters and Nicoletta Calzolari, “Applying EuroWordNet to Cross-Language Text Retrieval”, Computers and the Humanities, Vol 32, Nos. 2-3, pp. 73-89, 1998.
reference(4)
• [Gilarranz et al. 97] Julio Gilarranz, Julio Gonzalo and Felisa Verdejo, "An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", AAAI 97.