Multilingual Information Access: What and How Well? Query Formulation / Coding Verbalization IR "Flow"

  • View
    0

  • Download
    0

Embed Size (px)

Text of Multilingual Information Access: What and How Well? Query Formulation / Coding Verbalization IR...

  • 38th European Conference on Information Retrieval

    Nicola Ferro @frrncl

    University of Padua, Italy

    Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016) 20 March 2016, Padua, Italy

    Multilingual Information Access: What and How Well?

  • What

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Some search trends: translate + …

    3

    0

    1000

    2000

    3000

    4000

    5000

    2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

    Facebook Twitter Whatsapp Instagram Spotify Youtube

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Some search trends: translate + …

    3

    0

    1000

    2000

    3000

    4000

    5000

    2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

    Facebook Twitter Whatsapp Instagram Spotify Youtube

    Multil ingua

    l

    inform ation

    acces s

    is a m ultime

    dia

    conce rn

    Text is the 
 primary enabler 
 for multilingual

    information access

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Top Ten Languages in the Web

    4

    Data taken from http://www.internetworldstats.com/stats7.htm

    0%

    300%

    600%

    900%

    1200%

    1500%

    1800%

    2100%

    2400%

    2700%

    3000%

    0

    100

    200

    300

    400

    500

    600

    Eng lish

    Chi nes

    e Spa

    nish

    Jap ane

    se

    Por tug

    ues e Ger

    man Ara bic

    Fren ch

    Rus sian Kor

    ean Oth ers

    Users (million) Growth (%)2011

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Top Ten Languages in the Web

    4

    Data taken from http://www.internetworldstats.com/stats7.htm

    0%

    700%

    1400%

    2100%

    2800%

    3500%

    4200%

    4900%

    5600%

    6300%

    7000%

    0

    150

    300

    450

    600

    750

    900

    Eng lish

    Chi nes

    e Spa

    nish Ara bic

    Por tug

    ues e

    Jap ane

    se Rus

    sian Mal ay

    Fren ch

    Ger man Oth

    ers

    Users (million) Growth (%)2015

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Multilingual Information Access

    Monolingual retrieval in non- English languages

    Bilingual retrieval A → B

    Multilingual retrieval A → A, B, ...

    Multilingual retrieval AB → A, AB, AC, B, BC, ABC, …

    5

    Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the querier, with identical or near identical objects in different media or languages appropriately identified.

    [D. Oard & D. Hull, AAAI Symposium on Cross-Language IR, Spring 1997, Stanford]

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Typical IR Flow

    6

    4

    "IR-Cycle"

    IR System

    mouse trap

    Traps to catch mice .."The Mousetrap", a play by Agatha Christie

    ..a good trap against rodents

    I need a trap to get rid of some mice

    I could get me a cat!

    DOC1: Poisonless mousetraps DOC2: Get rid of mice DOC3: Traps for rodents …..

    How to get rid of mice – the politically correct way

    Result

    Processing

    Query

    Formulation / Coding

    Verbalization

    IR "Flow"

    Index

    Indexing

    Query

    Indexing

    Matching

    Documents

    Document representation Query representation

    Wirtschaft Result

    [Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Possible CLIR Flow: Query Translation

    7

    [Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]

    7

    MLIA/CLIR "flow"/"structure" Building a MLIA/CLIR system involves adressing different processing steps. We structure our discussion into the following list of steps

    Indexing Translation Matching

    Index

    Indexing

    Query

    Indexing

    Matching

    Documents

    Document representation

    Query representation

    Wirtschaft Result

    Query representation

    Translation

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Possible MLIR Flow: Query Translation

    8

    8

    Bilingual CLIR

    Maybe the "simplest scenario"

    We add query translation to a monolingual IR system How to integrate the translation step into the overall system?

    Matching Matching

    Result Result

    Index Index

    15

    Index

    Indexing

    Query

    Indexing

    Matching

    Document representation

    Query representation

    Result

    Query representation

    Translation

    Documents

    Merging

    Result

    Translation Translation

    [Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Possible MLIR Flow: Query and Document Translation

    9

    9

    MLIA – Query Translation

    More complex setup A series of bilingual steps A merging step is needed to produce a single, integrated result

    Result

    17

    Index

    Indexing

    Query

    Indexing

    Matching

    Document representation Query representation

    WirtschaftResult

    Query representation

    Translation

    Documents

    Translation

    Document representation

    Translation

    Result

    [Figure taken from M. Braschler, Multilingual Information Retrieval and Cross-Language Information Retrieval, TrebleCLEF Summer School 2009, Italy]

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    MLIA: Issues to Consider

    Document representation

    Language identification

    cross-scripting

    Segmentation

    compound words

    N-grams

    discourse

    Stop lists

    varying length

    Normalization

    diacritics

    spellings

    ...

    Stemming

    rule-based

    statistical / N-grams

    Enrichment

    named entity recognition

    thesaurus expansion

    authorship

    ...

    Translation

    ambiguity

    out-of-vocabulary terms

    bag-of-words

    ...

    10

  • How Well

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Why Evaluation?

    12

    “To measure is to know”

    “If you cannot measure it, you cannot improve it”

    Lord William Thompson, 
 first Baron Kelvin (1824-1907)

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    How Does Experimental Evaluation Work?

    Cranfield Paradigm

    Dates back to mid 1960s

    Makes use of experimental collections

    documents

    topics

    relevance judgments

    Ensures comparability and repeatability of the experiments

    13

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl

    N. Ferro - Multilingual Information Access: What and How Well?

    Large-scale Evaluation Campaigns

    Evaluation activities are conducted in large international fora

    TREC (Text REtrieval Conference), USA, since 1992

    NTCIR (NII Test Collection for IR Systems), Japan, since 1999

    CLEF (Conference and Labs of the Evaluation Forum), Europe, since 2000

    FIRE (Forum for Information Retrieval Evaluation), India, since 2008

    Share a common methodology, the Cranfield Paradigm

    14

  • 38th European Conference on Information Retrieval

    #ecir2016 @frrncl