Upload
yoshe
View
37
Download
0
Embed Size (px)
DESCRIPTION
A merging strategy proposal: The 2-step retrieval status value method. Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia Department of Computer Science, University of Ja´en, Ja´en, Spain Inf Retrieval (2006) 9: 71–93. Merging problem. query. Language 1. - PowerPoint PPT Presentation
Citation preview
A merging strategy proposal:The 2-step retrieval status value
methodFernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia
Department of Computer Science, University of Ja´en, Ja´en, Spain
Inf Retrieval (2006) 9: 71–93
Merging problem
Language 1 Language 2 Language 3
Result lists from per language
Merge to a single result list
d11
d12
d13….
.
d21
d22
d23….
.
d31
d32
d33….
.
d31
d32
d21
d11
d12
d23
d13….
query
Merge strategy
Traditional solution• Round-Robin
– Language1 list d11 d12 d13…– Language2 list d21 d22 d23…– Language3 list d31 d32 d33…– Marge d11 d21 d31 d12 d22 d32 …
• Raw-scoring
• Normalized scoring
– 1)
– 2)
Traditional solution• Logistic regression (Calv´e and Savoy (2000), Savoy (2003a))
• LVQ neural networks (Mart´ın et al. 2003)
2-step retrieval status value method• Step 1:
– translating and searching the query on each monolinqual collection,produces two results:
a) a concept T’ consist of each term together with its corresponding translation
b) Mutilinqual collection D’,as result of the union of the 1000 retrieved documents for each language.
2-step retrieval status value method• Step 2:
– re-indexing the D’ ,but considering solely the T’ vocabulary.
– given a concept , its document frequency is the result of grouping together the document frequencies of the terms which makes up the concept
2-step retrieval status value method• For Example:• Spanish word casa translate to English word is house ,home
Given a document , term frequency will be calculate as usual , document frequency will be the sum of the document frequency of “casa”, “house” ,“home”
Mixed 2-step RSV• Not aligned words
• Raw mixed 2-step RSV method– for a given τi j , term j into the monolingual collection i , the document
frequency value will be:• As 2-step method ,if τi j is aligned.
• the initial weight in the first step of the method, if the translation of τi j into the other languages is unknown.
• RSVi = α · RSVialign + (1 − α) ·RSVi
nonalign
– α = 0.75
Mixed 2-step RSV• Normalized mixed 2-step RSV method
– α = 0.75
Mixed 2-step RSV• Learning–based algorithm
– Logistic regression
• α, β1, β2 and β3 must be estimated by using iteratively re-weighted least squares method
– LVQ Neural network (Mart´ın et al. 2003)
Use machine translation to align word
• Pen = “Pesticides in baby food” – Unigrams Pen = {Pesticides, baby, food}– Bigrams Pen = {Pesticides baby, baby food}
• the translated expression is:– EXPen={Pesticides in baby food}{Pesticides,baby, food}{Pesticides baby,
baby food }
• Then we have, • Psp = {Pesticidas alimento ni˜nos}• Unigrams Psp = {Pesticidas, beb´e, alimento} (Unigrams P
sp is the translation of Unigrams Pen )• Bigrams Psp = {Pesticidas beb´es, alimento ni˜nos} (Bigra
ms Psp is the translation of Bigrams Pen )
Use machine translation to align word
• For each wordisp Unigrams P∈ sp do
– (a) if wordisp P∈ sp, then remove wordi
sp from Psp, and add (wordi
sp , wordien ) to the set of aligned w
ords ALIGNED
• Thus, we obtain:– Psp = {ni˜nos}– ALIGNED = {(pesticidas,pesticides),(alimento,f
ood)}
Use machine translation to align word
• For each bigram bigramspi ∈ BigramsPsp
– (a) if (wordsp1 , worden
1 ) ∈ ALIGNED (wordsp1 is a
ligned with worden1 ) and wordsp
2 ∈ Psp then remove wordsp
2 from Psp and add (wordsp 2 , worden
2 ) to ALIGNED set.
– (b) if (wordsp1 , worden
2 ) ∈ ALIGNED and wordsp2
∈ Psp then remove wordsp2 from Psp and add (wo
rdsp2 , worden
1 ) to ALIGNED set.
Use machine translation to align word
– (c) if (wordsp2 , worden
1 ∈ ALIGNED and words
p1 ∈ Psp then remove wordsp
1 from Psp andadd (wordsp
1 , worden2 ) to ALIGNED set.
– (d) if (wordsp2 , worden
2 ∈ ALIGNED and words
p1 ∈ Psp, then remove wordsp
1 from Psp and add (wordsp
1 , worden1 ) to ALIGNED set.
• Psp = ∅• ALIGNED = {(pesticidas,pesticides),(alime
nto,food) (ni˜nos,baby)
Method conclusion• Fully aligned word
– 2-step method
• Partial aligned word– Raw-mixed 2-step RSV– Normalized mixed 2-step RSV– Logistic regression mixed 2-step RSV– Neural network mixed 2-step RSV
• Algorithm to align phrase and translations
Experiment• Document
– CLEF 2003 have two task CLEF 2003-8 and CLEF 2003-4 . CLEF 2003-4 is limited to four language(English , France , German and Spanish )
• Query (Title + Description )
Experiment• they are indexed with the Zprise IR system, us
ing the OKAPI probabilistic model (fixed at b = 0.75 and k1 = 1.2)
• Translation strategies– Machine Readable Dictionary (Babylon)
• to pick the first translation available (under the heading “Babylon 1”) or the first two terms (indicated under the label “Babylon 2”)
– Machine Translation (MT, Babelfish)– Mixed MT and MDR
• by taking together Babelfish and Babylon 1 translations.
Experiment1 –multilinqual results with fully aligned queries
Experiment1 –multilinqual results with fully aligned queries
Experiment1 – analysis of failures
Too many documents from the Spanish collection for this query
Experiment1 – analysis of failures
Experiment2 –multilinqual results with partially aligned queries
• Based on MDR translation approach
Experiment2 –multilinqual results with partially aligned queries
• Based on MDR translation approach
Experiment2 –multilinqual results with partially aligned queries
• Based on MT translation approach
• with the CLEF 2001–2002 test collection and CLEF2001+CLEF2002+CLEF2003 query set (160 queries, five languages, EN, SP, DE, FR, IT)
Conclusion
• Future effort– Dealing with translation probabilities.– Testing the method with other translation strategie
s such as the Multilingual Similarity Thesaurus.– n-grams indexing.– continue studying strategies in order to deal with a
ligned and non-aligned term queries: the integration of both sorts of terms by means of bayesian networks