32
A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective by Ying Alvarado (24401693) CSE 8337 Lecturer : Dr. Margaret Dunham April 26, 2007

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective

  • Upload
    ray

  • View
    76

  • Download
    0

Embed Size (px)

DESCRIPTION

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective. by Ying Alvarado (24401693). CSE 8337 Lecturer : Dr. Margaret Dunham April 26, 2007. Outline. Introduction Concept Why important Approach CLIR problems Resource Approaches - PowerPoint PPT Presentation

Citation preview

Page 1: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

A Brief Survey on

Cross-language Information Retrieval (CLIR)- Text Retrieval Perspective

by

Ying Alvarado (24401693)

CSE 8337Lecturer : Dr. Margaret Dunham

April 26, 2007

Page 2: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

2

Outline Introduction

Concept Why important

Approach CLIR problems Resource Approaches Example Techniques

A CLIR application system CLIR effectiveness CLIR future tasks CLIR communities References

Page 3: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

3

Cross Language IR Definition: Users enter their query in one

language and the system retrieves relevant documents in other languages.

For example, a user may pose their query in English but retrieve relevant documents written in French.

Example CLIR applications Cross-Language retrieval from texts Cross-Language retrieval from audio and images

[1] Wikipedia, http://en.wikipedia.org/wiki/Cross-language_information_retrieval[2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

In this presentation, we focus on text IR only!

Page 4: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

4

•Monolingual IR: Documents and user requests in the same language

Documents(L1 )

IR systemRequest (L1)

Results(L1)

Monolingual vs. Bilingual vs. Multilingual

[2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

•Cross-language IR: Documents and user requests are in different languages (bilingual IR)

Documents(L2 )

Cross-language IR (CLIR) systemRequest (L1) Results(L2)

Source languageTarget language

Page 5: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

5

Documents(L3)

Multilingual IR (MLIR) systemRequest (L?) Results (L2, L3 or L4)

Documents(L2 )

Documents(L4 )

e.g. the Web

•Multilingual IR: Documents in collection in different languages, search requests in any language

Monolingual vs. Bilingual vs. Multilingual (con.)

Page 6: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

6

Why CLIR?

[3] Internet World Stats, http://www.internetworldstats.com/stats7.htm

TOP TEN LANGUAGESIN THE INTERNET

% of allInternet Users

Internet Usersby Language

InternetPenetration

by Language

Internet Growthfor Language( 2000 - 2007 )

2007 EstimateWorld Populationfor the Language

English 29.5 % 328,666,386 28.7 % 139.6 % 1,143,218,916

Chinese 14.3 % 159,001,513 11.8 % 392.2 % 1,351,737,925

Spanish 8.0 % 88,920,232 20.2 % 260.3 % 439,284,783

Japanese 7.7 % 86,300,000 67.1 % 83.3 % 128,646,345

German 5.3 % 58,711,687 61.1 % 113.2 % 96,025,053

French 5.0 % 55,521,294 14.3 % 355.2 % 387,820,873

Portuguese 3.6 % 40,216,760 17.2 % 430.8 % 234,099,347

Korean 3.1 % 34,120,000 45.6 % 79.2 % 74,811,368

Italian 2.8 % 30,763,940 51.7 % 133.1 % 59,546,696

Arabic 2.6 % 28,540,700 8.4 % 931.8 % 340,548,157

TOP TEN LANGUAGES 81.7 % 910,762,512 21.4 % 181.4 % 4,255,739,462

Rest of World Languages 18.3 % 203,511,914 8.8 % 444.5 % 2,318,926,955

WORLD TOTAL 100.0 % 1,114,274,426 16.9 % 208.7 % 6,574,666,417

Top Ten Languages Used in the Web( Number of Internet Users by Language )

Mar. 10, 2007

Page 7: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

7

Why CLIR? (con.)

[4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

A collection may contains documents in many different languages, e.g. the Web. It would be impractical to form a query in each language.

The documents may be expressed in more than one languages. For example,

Technical documents in which English jargon appears intermixed with narrative text in another language.

Academic works which cite the titles of documents in different languages.

The user is not sufficiently fluent to express a query in a language, but is able to make use of the documents that are identified.

The user is monolingual and wants to query in their native language. Because he

can judge relevance even if results not translated have access to document translation

[2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

Page 8: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

8

Handling non-ASCII character setsUntranslatable search keys (OOV): e.g. compound words, proper names, special termsMulti-word concepts, e.g. phrases and idiomsAmbiguity, e.g. Homonymy and polysemyWord Inflections, e.g. plurals and gender

CLIR problems

[5] Ari Pirkola, et al. Dictionary-Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol. 4. 2001

[2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

Page 9: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

9

Ontology Representation of concepts and relationships

Thesaurus it more commonly means a listing of words with similar,

related, or opposite meanings It does not include the definition of words

Bilingual dictionary a list of words together with additional word-specific

information. Bilingual controlled vocabulary

carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search

Corpora The document collection itself

Resources for Translation

[6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

[4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

[1] Wikipedia. Related pages.[7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004

Page 10: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

10

An example of controlled vocabulary

[14] Boxes and Arrows, http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary

The hierarchical relationships

Women’s Pants:   BT Pants   NT Casual Pants   NT Dress Pants   NT Sports Pants

The equivalence relationship

Page 11: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

11

What to translate? Document translation

Text translation E.g., translate entire document collection into English → search collection in English

Vector translation Query translation

E.g., translate English query into Chinese query → search Chinese document collection

[6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

Page 12: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

12

Tradeoffs Document Translation

Documents can be translate and stored offline Dependent on high quality automatic machine translation

(MT) system Does not easily deal with changing document sets

Query Translation Often easier Disambiguation of query terms may be difficult with short

queries

[6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

[4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

Page 13: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

13

Approaches to query translation Knowledge-based: Several aspects of domain knowledge is manually encoded in

to a lexicon. Ontology-based (concept driven) Thesaurus-based Dictionary-based

Expensive to construct lexicons; Lag behind the common use of terminology.

Corpus-based: directly exploit statistical information about term usage in a corpora; automatically construct lexicon.

Parallel corpora: document pairs, sentence pairs, term pairs Comparable corpora: document pairs, similar content Unaligned corpora: documents from the same domain, not translations of one another,

not linked in any other way

[8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001[9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

[4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

Page 14: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

14

Applying monolingual IR techniques Query expansion Relevance feedback Stemming Latent semantic analysis Parsing Part of speech tagging……

[4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

Page 15: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

15

Multilingual Thesauri Three construction techniques

Build it from scratch Translate an existing thesaurus Merge monolingual thesauri

For example EuroWordNet 7 languages Built from existing lexical resources Has the same structure as Princeton

WordNet

[8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001[9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Page 16: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

16

Pseudo-Relevance Feedback Also call Blind feedback Assume that the top n documents in the result set

actually are relevant. Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search

Top ranked FrenchDocuments French

Text Retrieval System

AltaVista

FrenchQueryTerms

EnglishTranslations

English Web PagesParallel

Corpus

[9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Page 17: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

17

Different level alignment in parallel corpora Document alignment

Already exists Collected from existing corpora

Examine document external features Examine document internal features

Sentence alignment Easily constructed from aligned documents Match pattern of relative sentence lengths Good first step for term alignment

Term alignment Using co-occurrence-based translation

[9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Page 18: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

18

Example of term alignment

CSE8337 是一门关于信息存储和检索的课程。

CSE8337 is a class about information storage and retrieval.

Page 19: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

19

Co-occurrence-based translation Align terms using co-occurrence statistics

assumed that the correct translations of query terms tend to co-occur in target language documents

How often do a term pair occur in sentence pairs?

Weighted by relative position in the sentences

Retain term pairs that occur unusually often

[9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

Page 20: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

20

Exploiting Unaligned Corpora

Example approach: category-based translation Extract a large number of terms from unaligned

coprora of the first and second languages Assign a category to each extracted term by

accessing monolingual thesauri of the first and second languages

Estimate category-to-category translation probabilities Estimate term-to-term translation probabilities using

said category-to-category translation probabilities

[15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent 6885985. Filing date: Dec 18, 2000. Issue date: Apr 26, 2005

Page 21: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

21

In Summary

Term-aligned Sentence-aligned Document-aligned Unaligned

Parallel Comparable

Knowledge-based Corpus-based

Controlled Vocabulary Free Text

Cross-Language Text Retrieval

Query Translation Document Translation

Text Translation Vector Translation

Ontology-based Dictionary-based

Thesaurus-based

[8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001

Page 22: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

22

An experimental system

[10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Automatic construction of parallel English-Chinese corpus for CLIR A parallel text mining system- PTMiner Finds parallel text from web Parallel Text Mining Algorithm

1. Search for candidate sites - Using existing Web search engines, search for the candidate sites that may contain parallel pages; (by using text anchor)

2. File name fetching - For each candidate site, fetch the URLs of Web pages that are indexed by the search engines;

3. Host crawling - Starting from the URLs collected in the previous step, search through each candidate site separately for more URLs;

4. Pair scan - From the obtained URLs of each site, scan for possible parallel pairs; (by analyzing document external features)

5. Download and verifying - Download the parallel pages, determine file size, language and character set, text length, HTML structure, and filter out non-parallel pairs.

Page 23: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

23

The workflow of the mining process

Sample anchor texts: “english version” [“in english”, ……] Sample document external features: “file-ch.html” vs. “file-en.html” “…/chinese/…/file.html” vs. “…/english/…file.html” Sample document internal features: Character set, HTML structure

[10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Page 24: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

24

An alignment example

[10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Page 25: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

25

Part of the lexicons t: ture f: false

[10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

•Encoding scheme transformation (for Chinese)•Sentence level segmentation•Chinese word segmentation•English expression extraction•SILC: language and encoding identification system

Other techniques and tools used:

Page 26: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

26

Results 14820 pairs of texts (lexicon) C-E has a precision of 77% E-C has a precision of 81.5% CLIR results

Test corpus: TREC5 and TREC6 Chinese track

[10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

Page 27: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

27

Does CLIR work? Best systems at TREC-6 (1997):

English-French: 49% of highest French monolingual English-German: 64% of highest German monolingual

Best systems at CLEF (2002): English-French: 83% of highest French monolingual English-German: 86% of highest German monolingual

Best systems at CLEF (2006): English-French: 93.82% of best French monolingual English-Portuguese: 90.91% of best Portuguese monolingual

[2]Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005[16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

Page 28: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

28

Future tasks

[11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR[12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

Extend study scope: Web pages, medical literature, USENET newsgroup

articles, records of legislative and legal proceedings… Lower cost, improve efficiency

Pay more attention on indexing-time optimizations to improve query-time efficiency

Consider user’s perspective Improve the utility of ranked lists

Define suitable criteria for the construction of a valid multilingual Web corpus

Get resources for resource-poor languages

Page 29: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

29

CLIR Communities TREC Cross Language Track currently focuses on the

Arabic language,

Cross-Language Evaluation Forum (CLEF) – a spinoff from TREC - covering many European languages,

NTCIR Asian Language Evaluation (covering Chinese, Japanese and Korean).

[12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

Page 30: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

30

In CLEF 2006, eight tracks were offered to evaluate the

performance of systems:

multilingual document retrieval on news collections (Ad-hoc)

cross-language structured scientific data (Domain-specific)

interactive cross-language retrieval multiple language question answering cross-language retrieval on image collections cross-language speech retrieval multilingual web retrieval cross-language geographic retrieval.

CLEF

[13] Carol Peters, Cross-Language Evaluation Forum - CLEF 2006. D-Lib Magazine October 2006

Page 31: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

31

References[1] Wikipedia, http://en.wikipedia.org/wiki/Cross-language_information_retrieval[2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005[3] Internet World Stats, http://www.internetworldstats.com/stats7.htm

[4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR-3615. 1996

[6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

[8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001

[5] Ari Pirkola, et al. Dictionary_Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol. 4. 2001

[7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? http://www.metamodel.com/article.php?story=20030115211223271. 2004

[9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

[10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000[11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR[12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR[13] Carol Peters, Cross-Language Evaluation Forum - CLEF 2006. D-Lib Magazine October 2006

[14] Boxes and Arrows, http://www.boxesandarrows.com/view/what_is_a_controlled_vocabulary

[15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent 6885985. Filing date: Dec 18, 2000. Issue date: Apr 26, 2005[16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

Page 32: A Brief Survey on Cross-language Information  Retrieval (CLIR) - Text Retrieval Perspective

32

Thank you!