LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy

LREC 2008

From Research to Application in Multilingual Information Access:

The Contribution of Evaluation

Carol PetersISTI-CNR, Pisa, Italy

LREC 2008

From Research to Application in Multilingual Information Access:

The Contribution of Evaluation

Carol PetersISTI-CNR, Pisa, Italy

LREC 2008

Outline

What is MLIA/CLIR? What is the State-of-the-Art?

Where are the Problems?

What is the Contribution of Evaluation?Where are the Problems?

What more can we do?From CLEF to TrebleCLEF

LREC 2008

Europe’s Linguistic Diversity

LREC 2008

There are 6,800 known languages spoken in 200 countries2,261 have writing systems (the others are only spoken)

Just 300 have some kind of language processing tools

http://www.ethnologue.com/language_index.asp

LREC 2008

What is MLIA?

MLIA related research regards the storage, access, retrieval and presentation of information in any of the world's languages.

Two main areas of interest: multiple language access, browsing, displaycross-language information discovery and

retrieval

LREC 2008

Multi-Language Access, Browsing, Display

The enabling technology: character encoding specific requirements of particular

languages and scripts internationalization & localization

LREC 2008

Cross-Language Information Retrieval

Crossing the language barrier… querying of multilingual collection in one

language against documents in many other languages…

filtering, selecting, ranking retrieved documents

presenting retrieved information in an interpretable and exploitable fashion

LREC 2008

The Problem

UserQuery

Document

LanguageBarrier

QueryRepresentation

DocumentRepresentation

LREC 2008

LREC 2008

LREC 2008

CLIR methods How is it done?

Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.)

Translate: queries or documents (or both)

Translation resources Machine Translation (MT) Parallel/comparable corpora Bilingual Dictionaries Multilingual Thesauri Conceptual Interlingua

Find relevant documents in target collection(s) & present results

LREC 2008

CLIR for Multimedia

Retrieval from a mixed media collection is non- trivial problem

Different media processed in different ways and suffer from different kinds of indexing errors: spoken documents indexed using speech recognition

handwritten documents indexed using OCR

images indexed using significant features

Need for complex integration of multiple technologies

Need for merging of results from different sources

LREC 2008

Main CLIR Difficulties (I)

Language identificationMorphology: inflection, derivation, compounding, …

OOV terms, e.g. proper names, terminologyMulti-word concepts, e.g. phrases and idiomsAmbiguity, e.g. polysemy

Handling many languages: L1 -> LnMerging results from different sources / mediaPresenting the results in useful fashion

LREC 2008

Main CLIR Difficulties (II)

MLIA system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction)

MLIA systems need intelligent post-processing of results: merging/ summarization / translation

MLIA systems need well-developed resources Language Processing Tools Language Resources

Resources are expensive to acquire, maintain, update

LREC 2008

Cross-Language Evaluation Forum

Objectives Promote research and stimulate development

of multilingual IR systems for European languages, throughCreation of evaluation infrastructureBuilding of an MLIA/CLIR research communityConstruction of publicly available test-suites

Major Goal Encourage development of truly multilingual,

multimodal systems

LREC 2008

CLEF Coordination

Centre for the Evaluation of Human Language and Multimodal Communication Technologies (CELCT), Trento, Italy

College of Information Studies and Institute for Advanced Computer Studies, U. Maryland, USA

Dept. of Computer Science, U. Indonesia Depts. of Computer Science & Medical

Informatics, RWTH Aachen U., Germany Dept. of Computer Science and Information

Systems, U. Limerick, Ireland Dept. of Computer Science and Information

Engineering, National U. Taiwan Dept. of Information Engineering, U. Padua, Italy Dept. of Information Sci, U. Hildesheim, Germany Dept. of Information Studies, U. Sheffield, UK Evaluations and Language Resources Distribution

Agency Sarl, Paris, France Fondazione Bruno Kessler FBK-irst, Trento, Italy German Research Centre for Artificial

Intelligence, DFKI, Saarbrücken, Germany Information and Language Processing Systems,

U. Amsterdam, Netherlands IZ Bonn, Germany

Inst. For Information technology, Hyderabad, India

Inst. of Formal and Applied Linguistics, Charles University, Czech Rep

LSI-UNED, Madrid, Spain Linguateca, Sintef, Oslo, Norway Linguistic Modelling Lab., Bulgarian Acad Sci Microsoft Research Asia NIST, USA Biomedial Informatics, Oregon Health and

Science University, USA Research Computing Center of Moscow State

U. Research Institute for Linguistics, Hungarian

Academy of Sciences School of Computer Science and Mathematics,

Victoria U., Australia School of Computing, DCU, Ireland UC Data Archive and School of Information

Management and Systems, UC Berkeley, USA University "Alexandru Ioan Cuza", IASI,

Romania U. Hospitals and U.of Geneva, Switzerland Vienna University of Technology, Austria

Institutions contributing to the organisation of the different tracks of CLEF 2007

LREC 2008

Evolution of CLEFCLEF 2000 Tracks

mono-, bi- & multilingual text doc retrieval (Ad Hoc) mono- and cross-language information on structured scientific data (Domain-Specific)

CLEF 2001 Added

interactive cross-language retrieval (iCLEF)

CLEF 2002 Added

cross-language spoken document retrieval (CL-SR)

CLEF 2003 Added

multiple language question answering (QA@CLEF) cross-language retrieval in image collections (ImageCLEF)

CLEF 2005 Added

multilingual retrieval of Web documents (WebCLEF) cross-language geographical retrieval (GeoCLEF)

CLEF 2008 Added

cross-language video retrieval (VideoCLEF) multingual information filtering (INFILE@CLEF)

LREC 2008

CLEF Test Collections

2000 News documents in 4 languages GIRT German Social science database

2007

CLEF multilingual comparable corpus of more than 3M news docs in 13 languages: CZ,DE,EN,ES,FI,FR,IT,NL,RU,SV,PT,BG and HU

GIRT-4 social science database in EN and DE, Russian ISISS collection; Cambridge Sociological Abstracts

Malach collection of conversational speech derived from the Shoah archives EN & CZ

EuroGOV, 3.5 M webpages crawled from European governmental sites IAPR TC-12 photo database; PASCAL VOC 2006 training data ImageCLEFmed radiological database consisting of 6 distinct datasets; IRMA collection in EN & DE for automatic medical image annotation

Each track creates topics/queries & relevance assessments in diverse languages

LREC 20082020

Promoting Research through Evaluation

Text Retrieval (from 2000) Mono-, bi- and multilingual system performance tested using

news documents (13 European languages) bilingual task testing on unusual language combinations multilingual system testing with many target languages advanced tasks to monitor improvement in system performance

over time and focused on problem of merging results from different collections/languages

“robust” task emphasized importance of stable performance over languages instead of high average performance

Since 2006, queries in non-European languages (Indian sub-task) 2008: New tasks on library archives; Tasks on non-European

target collections; robust task uses WSD data

LREC 2008

Results: Cross-Language Text Retrieval

Comparing bilingual results with monolingual baselines: TREC-6, 1997:

EN→FR: 49% of best monolingual French system EN→DE: 64% of best monolingual German system

CLEF 2002: EN→FR: 83,4% of best monolingual French system EN→DE: 85,6% of best monolingual German system

CLEF 2003 enforced the use of “unusual” language pairs: IT→ES: 83% of best monolingual Spanish IR system DE→IT: 87% of best monolingual Italian IR system FR→NL: 82% of best monolingual Dutch IR system

CLEF 2007 best bilingual system 88% of best monolingual system

LREC 2008

Other results:non-doc & non-text retrieval

Interactive CLEF• Cross-Lang. IR from a user-inclusive perspective

Multilingual Question Answering 10 different target collections, Real-time exercise,

answer validation, QA on speech transcripts Geographical CLIR

Cross-language image retrieval Tasks on photo and medical archives, tasks for

retrieval and calssification Cross-language spoken document & cross-language

speech retrieval

LREC 2008

CLEF Achievements

Stimulation of research activity in new, previously unexplored areas

Study and implementation of evaluation methodologies for diverse types of cross-language IR systems

Creation of a large set of empirical data about multilingual information access from the user perspective

Quantitative and qualitative evidence with respect to best practice in cross-language system development

Creation of reusable test collections for system benchmarking

Building of a strong, multidisciplinary research community

BUT

LREC 2008

BUT

Notable lack of takeup

by

Application Communities

LREC 2008

TrebleCLEFTrebleCLEF is a Coordination Action, funded under is a Coordination Action, funded under the 7FP from 2008 to 2009, which aims at:the 7FP from 2008 to 2009, which aims at:

continuing to promote the development of advanced continuing to promote the development of advanced multilingual multimedia information access systemsmultilingual multimedia information access systems

disseminating knowhow, tools, and resources to disseminating knowhow, tools, and resources to enable DL creators to make content and knowledge enable DL creators to make content and knowledge accessible, usable and exploitable over time, over accessible, usable and exploitable over time, over media and over language boundariesmedia and over language boundaries

TrebleCLEF

LREC 2008

Objectives ITrebleCLEF will promote R&D and industrial take-up of multilingual, multimodal information access functionality in the following ways: by continuing to support the annual CLEF system

evaluation campaigns, with particular focus on:user modeling, e.g. the requirements of different classes

of users when querying multilingual information sourceslanguage-specific experimentation, e.g. looking at

differences across languages in order to derive best practices for each language

results presentation, e.g. how can results be presented in the most useful and comprehensible way to the user.

LREC 2008

Objectives II

by constituting a scientific forum for the MLIA community of researchers enabling them to meet and discuss results, emerging trends, new directions

providing a scientific digital library to manage accessible the scientific data and experiments produced during the course of an evaluation campaign, providing tools to:analyze, compare, and cite the data and experimentscurate, preserve, annotate, enrich them (promoting their re-

use)

LREC 2008

Objectives III by acting as a virtual centre of competence providing a

central reference point for anyone interested in studying or implementing MLIA functionality:

making publicly available sets of guidelines on best practices in MLIA (e.g. what stemmer to use, what stop list, what translation resources, how best to evaluate, etc., depending on the application requirements);

making tools and resources used in the evaluation campaigns freely available to a wider public whenever possible; otherwise providing links to where they can be acquired;

organising workshops, and/or tutorials and training sessions.

LREC 2008

Approach EvaluationEvaluation

test collections and laboratory test collections and laboratory evaluationevaluation

user evaluationuser evaluation log analysislog analysis

Best Practices & GuidelinesBest Practices & Guidelines system-oriented aspects of system-oriented aspects of

MLIA applicationsMLIA applications collaborative user studiescollaborative user studies user-oriented aspects of MLIA user-oriented aspects of MLIA

interfacesinterfaces

Dissemination and TrainingDissemination and Training tutorialstutorials workshopsworkshops summer schoolsummer school

LREC 2008

Consortium

ISTI-CNR, Pisa, Italy University of Padua, Italy University of Sheffield, United Kingdom Universidad Nacional de Educación a Distancia,

Spain Zurich University of Applied Sciences, Switzerland Centre for the Evaluation of Language

Communication Technologies, Italy Evaluations & Language resources Distribution

Agency, France

LREC 2008

Contacts

For further information see:

http://www.trebleclef.eu/

or contact:

Carol Peters - ISTI-CNRE-mail:[email protected]

Documents

LREC 2008 From Research to Application in Multilingual Information Access: The Contribution of Evaluation Carol Peters ISTI-CNR, Pisa, Italy