35
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland www.unine.ch/info/clef/

Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Multilingual Search using

Query Translation and

Collection Selection

Jacques Savoy, Pierre-Yves Berger

University of Neuchatel, Switzerland

www.unine.ch/info/clef/

Page 2: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Our IR Strategy

1. Effective monolingual IR system1. Effective monolingual IR system

(combining indexing schemes?)(combining indexing schemes?)

2. Combining query translation tools2. Combining query translation tools

3. Effective collection selection and 3. Effective collection selection and

merging strategiesmerging strategies

Page 3: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Effective Monolingual IR

1.1. Define a Define a stopword stopword listlist

2.2. Have a « good » stemmerHave a « good » stemmer

Removing only the feminineRemoving only the feminineand plural suffixes (inflections)and plural suffixes (inflections)Focus on Focus on nounsnouns and and adjectivesadjectivessee www.see www.unineunine..chch/info/clef//info/clef/

Page 4: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Indexing Units

For Indo-European languages : Set of significant words

CJK: bigramswith stoplistwithout stemming

For Finnish, we used 4-grams.

Page 5: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

IR Models

ProbabilisticOkapiProsit or deviation from randomness

Vector-spaceLnu-ltctf-idf (ntc-ntc)binary (bnn-bnn)

Page 6: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Monolingual Evaluation

0.2017

0.3309

0.4349

0.4568

0.4685

word

French Portug.EnglishModel

wordwordQ=TD

0.18340.3005binary

0.3708

0.4579

0.4695

0.4835

0.3706tf-idf

0.4979Lnu-ltc

0.5313Prosit

0.5422Okapi

Page 7: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Monolingual Evaluation

wordwordQ=TD

0.15120.1859binary

0.27160.3862tf-idf

0.37940.4643Lnu-ltc

0.34480.4620Prosit

0.38000.4773Okapi

RussianFinnishModel

Page 8: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Monolingual Evaluation

4-gramword4-gramwordQ=TD

0.03730.15120.23870.1859binary

0.4466

0.5022

0.5357

0.5386

0.19160.27160.3862tf-idf

0.28520.37940.4643Lnu-ltc

0.28790.34480.4620Prosit

0.28900.38000.4773Okapi

RussianFinnishModel

Page 9: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Problems with Finnish

Finnish can be characterized by:

1. A large number of cases (12+),

2. Agglutinative (saunastani)

& compound construction (rakkauskirje)

3. Variations in the stem

(matto, maton, mattoja, …).

Do we need a deeper morpholgical analysis?

Page 10: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Improvement using…

Relevance feedback (Rocchio)?

Page 11: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Monolingual Evaluation

+8.4%+6.1%+1.6%+8.1%

+3.8%-1.5%+3.5%+5.2%

RussianFinnishFrenchEnglishQ=TD

0.37360.56840.46430.5742+PRF

0.4568

0.4851

0.4685

0.34480.53570.5313Prosit

0.39450.53080.5704+PRF

0.38000.53860.5422Okapi

Page 12: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Relevance FeedbackMAP after blind query-expansion

0 10 15 20 30 40 50 60 75 100# of terms added (from 3 best-ranked doc.)

0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

Prosit Okapidtu-dtn Lnu-ltc

Page 13: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Improvement using…

Data fusion approaches (see the paper)

Improvement with the English, Finnish and Portuguese corpora

Hurt the MAP with the French and Russian collections

Page 14: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Bilingual IR

1. Machine-readable dictionaries (MRD)e.g., Babylon

2. Machine translation services (MT)e.g., FreeTranslation

BabelFish

3. Parallel and/or comparable corpora(not used in this evaluation campaign)

Page 15: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Bilingual IR

“Tour de France WinnerWho won the Tour de France in 1995?”

(Babylon1) “France, vainqueurOrganisation Mondiale de la Santé, le, France,* 1995”

(Right) “Le vainqueur du Tour de FranceQui a gagné le tour de France en 1995 ?”

Page 16: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Bilingual EN->FR/FI/RU/PT

0.3888

0.2960

0.1216

0.3067

0.2209

0.3800

Russianword

0.4057N/A0.3845FreeTra

0.42040.30420.4066Combi.

0.3830

0.2664

0.3706

0.4685

Frenchword

N/AN/AReverso

0.32770.2653InterTra

0.30710.1965Baby1

0.48350.5386Manual

Portugword

Finnish4-gram

TDOkapi

Page 17: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Bilingual EN->FR/FI/RU/PT

102.3%

77.9%

32.0%

80.7%

58.1%

0.3800

Russianword

83.9%N/A82.1%FreeTra

86.9%56.5%86.8%Combi.

81.8%

56.9%

79.1%

0.4685

Frenchword

N/AN/AReverso

67.8%49.3%InterTra

63.5%36.5%Baby1

0.48350.5386Manual

Portugword

Finnish4-gram

TDOkapi

Page 18: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Bilingual IR

Improvement …Query combination (concatenation)Relevance feedback

(yes for PT, FR, RU, no for FI)Data fusion

(yes for FR, no for RU, FI, PT)

Page 19: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Multilingual EN->EN FI FR RU

1. Create a common indexDocument translation (DT)

2. Search on each language andmerge the result lists (QT)

3. Mix QT and DT

4. No translation

Page 20: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Merging Problem

EN

<<–––– MergingMerging

RUFRFI

Page 21: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Merging Problem

1 EN120 1.22 EN200 1.03 EN050 0.74 EN705 0.6…

1 FR043 0.82 FR120 0.753 FR055 0.654 …

1 RU050 1.62 RU005 1.13 RU120 0.94 …

Page 22: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Merging Strategies

Round-robin (baseline)

Raw-score merging

Normalize

Biased round-robin

Z-score

Logistic regression

Page 23: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Merging Problem

1 EN120 1.22 EN200 1.03 EN050 0.74 EN705 0.6…

1 FR043 0.82 FR120 0.753 FR055 0.654 …

1 RU050 1.62 RU005 1.13 RU120 0.94 …

1 RU… 1 RU… 2 EN…2 EN…3 RU…3 RU…4 ….4 ….

Page 24: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Merging Strategies

Round-robin (baseline)

Raw-score merging

Normalize

Biased round-robin

Z-score

Logistic regression

Page 25: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Test-Collection CLEF-2004

EN FI FR RU

size 154 MB 137 MB 244 MB 68 MB

doc 56,472 55,344 90,261 16,716

topic 42 45 49 34

rel./query 8.9 9.2 18.7 3.6

Page 26: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Merging Strategies

Round-robin (baseline)

Raw-score merging

Normalize

Biased round-robin

Z-score

Logistic regression

Page 27: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Z-Score Normalization

1 EN120 1.22 EN200 1.03 EN050 0.74 EN765 0.6… …

Compute the mean µ and standard deviation σ

New score = ((old score-µ) / σ ) + δ

1 EN120 7.02 EN200 5.03 EN050 2.04 EN765 1.0…

Page 28: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

MLIR Evaluation

Condition A (meta-search)

Prosit/Okapi (+PRF, data fusion

for FR, EN)

Condition C (same SE)

Prosit only (+PRF)

Page 29: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Multilingual Evaluation

0.28670.2669Z-score wt

0.33930.3090Logistic

0.2639

0.2899

0.0642

0.2386

Cond. A

0.2613Biased RR

0.2646Norm

0.3067Raw-score

0.2358Round-robin

Cond. CEN->ENFRFI RU

Page 30: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Collection Selection

General idea:1. Use the logistic regression to compute

a probability of relevance for each document;

2. Sum the document scores from the first 15 best ranked documents;

3. If this sum ≥ δ, consider all the list,otherwise select only the first m doc.

Page 31: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Multilingual Evaluation

0.33780.2953Select (m=3)

0.2957

0.3234

0.3090

0.2386

Cond. A

0.3405Select (m=0)

0.3558Optimal select

0.3393Logistic reg.

0.2358Round-robin

Cond. CEN->ENFRFI RU

Page 32: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Conclusion (monolingual)

Finnish is still a difficult language

Blind-query expansion:different patterns with

different IR models

Data fusion (?)

Page 33: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Conclusion (bilingual)

Translation resources not available for all languages pairs

Improvement bycombining query translations yes, but can we have a better combination scheme?

Page 34: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Conclusion (multilingual)

Selection and merging are still hard problems

Overfitting is not always the best choice (see results with Condition C)

Page 35: Multilingual Search using Query Translation and Collection ...delos-old.isti.cnr.it/eventlist/CLEF2004/Savoy.pdf · Effective Monolingual IR 1. Define a stopword list 2. Have a «

Multilingual Search using

Query Translation and

Collection Selection

Jacques Savoy, Pierre-Yves Berger

University of Neuchatel, Switzerland

www.unine.ch/info/clef/