View
0
Download
0
Embed Size (px)
38th European Conference on Information Retrieval
Nicola Ferro @frrncl
University of Padua, Italy
Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016) 20 March 2016, Padua, Italy
Multilingual Information Access: What and How Well?
What
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Some search trends: translate + …
3
0
1000
2000
3000
4000
5000
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Facebook Twitter Whatsapp Instagram Spotify Youtube
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Some search trends: translate + …
3
0
1000
2000
3000
4000
5000
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Facebook Twitter Whatsapp Instagram Spotify Youtube
Multil ingua
l
inform ation
acces s
is a m ultime
dia
conce rn
Text is the primary enabler for multilingual
information access
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Top Ten Languages in the Web
4
Data taken from http://www.internetworldstats.com/stats7.htm
0%
300%
600%
900%
1200%
1500%
1800%
2100%
2400%
2700%
3000%
0
100
200
300
400
500
600
Eng lish
Chi nes
e Spa
nish
Jap ane
se
Por tug
ues e Ger
man Ara bic
Fren ch
Rus sian Kor
ean Oth ers
Users (million) Growth (%)2011
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Top Ten Languages in the Web
4
Data taken from http://www.internetworldstats.com/stats7.htm
0%
700%
1400%
2100%
2800%
3500%
4200%
4900%
5600%
6300%
7000%
0
150
300
450
600
750
900
Eng lish
Chi nes
e Spa
nish Ara bic
Por tug
ues e
Jap ane
se Rus
sian Mal ay
Fren ch
Ger man Oth
ers
Users (million) Growth (%)2015
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Multilingual Information Access
Monolingual retrieval in non- English languages
Bilingual retrieval A → B
Multilingual retrieval A → A, B, ...
Multilingual retrieval AB → A, AB, AC, B, BC, ABC, …
5
Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the querier, with identical or near identical objects in different media or languages appropriately identified.
[D. Oard & D. Hull, AAAI Symposium on Cross-Language IR, Spring 1997, Stanford]
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Typical IR Flow
6
4
"IR-Cycle"
IR System
mouse trap
Traps to catch mice .."The Mousetrap", a play by Agatha Christie
..a good trap against rodents
I need a trap to get rid of some mice
I could get me a cat!
DOC1: Poisonless mousetraps DOC2: Get rid of mice DOC3: Traps for rodents …..
How to get rid of mice – the politically correct way
Result
Processing
Query
Formulation / Coding
Verbalization
IR "Flow"
Index
Indexing
Query
Indexing
Matching
Documents
Document representation Query representation
Wirtschaft Result
[Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Possible CLIR Flow: Query Translation
7
[Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]
7
MLIA/CLIR "flow"/"structure" Building a MLIA/CLIR system involves adressing different processing steps. We structure our discussion into the following list of steps
Indexing Translation Matching
Index
Indexing
Query
Indexing
Matching
Documents
Document representation
Query representation
Wirtschaft Result
Query representation
Translation
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Possible MLIR Flow: Query Translation
8
8
Bilingual CLIR
Maybe the "simplest scenario"
We add query translation to a monolingual IR system How to integrate the translation step into the overall system?
Matching Matching
Result Result
Index Index
15
Index
Indexing
Query
Indexing
Matching
Document representation
Query representation
Result
Query representation
Translation
Documents
Merging
Result
Translation Translation
[Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Possible MLIR Flow: Query and Document Translation
9
9
MLIA – Query Translation
More complex setup A series of bilingual steps A merging step is needed to produce a single, integrated result
Result
17
Index
Indexing
Query
Indexing
Matching
Document representation Query representation
WirtschaftResult
Query representation
Translation
Documents
Translation
Document representation
Translation
Result
[Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
MLIA: Issues to Consider
Document representation
Language identification
cross-scripting
Segmentation
compound words
N-grams
discourse
Stop lists
varying length
Normalization
diacritics
spellings
...
Stemming
rule-based
statistical / N-grams
Enrichment
named entity recognition
thesaurus expansion
authorship
...
Translation
ambiguity
out-of-vocabulary terms
bag-of-words
...
10
How Well
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Why Evaluation?
12
“To measure is to know”
“If you cannot measure it, you cannot improve it”
Lord William Thompson, first Baron Kelvin (1824-1907)
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
How Does Experimental Evaluation Work?
Cranfield Paradigm
Dates back to mid 1960s
Makes use of experimental collections
documents
topics
relevance judgments
Ensures comparability and repeatability of the experiments
13
38th European Conference on Information Retrieval
#ecir2016 @frrncl
N. Ferro - Multilingual Information Access: What and How Well?
Large-scale Evaluation Campaigns
Evaluation activities are conducted in large international fora
TREC (Text REtrieval Conference), USA, since 1992
NTCIR (NII Test Collection for IR Systems), Japan, since 1999
CLEF (Conference and Labs of the Evaluation Forum), Europe, since 2000
FIRE (Forum for Information Retrieval Evaluation), India, since 2008
Share a common methodology, the Cranfield Paradigm
14
38th European Conference on Information Retrieval
#ecir2016 @frrncl