Upload
florence-garrison
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
LREC 2008
From Research to Application in Multilingual Information Access:
The Contribution of Evaluation
Carol PetersISTI-CNR, Pisa, Italy
LREC 2008
From Research to Application in Multilingual Information Access:
The Contribution of Evaluation
Carol PetersISTI-CNR, Pisa, Italy
LREC 2008
Outline
What is MLIA/CLIR? What is the State-of-the-Art?
Where are the Problems?
What is the Contribution of Evaluation?Where are the Problems?
What more can we do?From CLEF to TrebleCLEF
LREC 2008
Europe’s Linguistic Diversity
LREC 2008
There are 6,800 known languages spoken in 200 countries2,261 have writing systems (the others are only spoken)
Just 300 have some kind of language processing tools
LREC 2008
What is MLIA?
MLIA related research regards the storage, access, retrieval and presentation of information in any of the world's languages.
Two main areas of interest: multiple language access, browsing, displaycross-language information discovery and
retrieval
LREC 2008
Multi-Language Access, Browsing, Display
The enabling technology: character encoding specific requirements of particular
languages and scripts internationalization & localization
LREC 2008
Cross-Language Information Retrieval
Crossing the language barrier… querying of multilingual collection in one
language against documents in many other languages…
filtering, selecting, ranking retrieved documents
presenting retrieved information in an interpretable and exploitable fashion
LREC 2008
The Problem
UserQuery
Document
LanguageBarrier
QueryRepresentation
DocumentRepresentation
LREC 2008
LREC 2008
LREC 2008
CLIR methods How is it done?
Pre-process & index both documents and queries – generally using language dependent techniques (tokenisation, stopwords, stemming, morphological analysis, decompounding, etc.)
Translate: queries or documents (or both)
Translation resources Machine Translation (MT) Parallel/comparable corpora Bilingual Dictionaries Multilingual Thesauri Conceptual Interlingua
Find relevant documents in target collection(s) & present results
LREC 2008
CLIR for Multimedia
Retrieval from a mixed media collection is non- trivial problem
Different media processed in different ways and suffer from different kinds of indexing errors: spoken documents indexed using speech recognition
handwritten documents indexed using OCR
images indexed using significant features
Need for complex integration of multiple technologies
Need for merging of results from different sources
LREC 2008
Main CLIR Difficulties (I)
Language identificationMorphology: inflection, derivation, compounding, …
OOV terms, e.g. proper names, terminologyMulti-word concepts, e.g. phrases and idiomsAmbiguity, e.g. polysemy
Handling many languages: L1 -> LnMerging results from different sources / mediaPresenting the results in useful fashion
LREC 2008
Main CLIR Difficulties (II)
MLIA system need clever pre-processing of target collections (e.g. semantic analysis, classification, information extraction)
MLIA systems need intelligent post-processing of results: merging/ summarization / translation
MLIA systems need well-developed resources Language Processing Tools Language Resources
Resources are expensive to acquire, maintain, update
LREC 2008
Cross-Language Evaluation Forum
Objectives Promote research and stimulate development
of multilingual IR systems for European languages, throughCreation of evaluation infrastructureBuilding of an MLIA/CLIR research communityConstruction of publicly available test-suites
Major Goal Encourage development of truly multilingual,
multimodal systems
LREC 2008
CLEF Coordination
Centre for the Evaluation of Human Language and Multimodal Communication Technologies (CELCT), Trento, Italy
College of Information Studies and Institute for Advanced Computer Studies, U. Maryland, USA
Dept. of Computer Science, U. Indonesia Depts. of Computer Science & Medical
Informatics, RWTH Aachen U., Germany Dept. of Computer Science and Information
Systems, U. Limerick, Ireland Dept. of Computer Science and Information
Engineering, National U. Taiwan Dept. of Information Engineering, U. Padua, Italy Dept. of Information Sci, U. Hildesheim, Germany Dept. of Information Studies, U. Sheffield, UK Evaluations and Language Resources Distribution
Agency Sarl, Paris, France Fondazione Bruno Kessler FBK-irst, Trento, Italy German Research Centre for Artificial
Intelligence, DFKI, Saarbrücken, Germany Information and Language Processing Systems,
U. Amsterdam, Netherlands IZ Bonn, Germany
Inst. For Information technology, Hyderabad, India
Inst. of Formal and Applied Linguistics, Charles University, Czech Rep
LSI-UNED, Madrid, Spain Linguateca, Sintef, Oslo, Norway Linguistic Modelling Lab., Bulgarian Acad Sci Microsoft Research Asia NIST, USA Biomedial Informatics, Oregon Health and
Science University, USA Research Computing Center of Moscow State
U. Research Institute for Linguistics, Hungarian
Academy of Sciences School of Computer Science and Mathematics,
Victoria U., Australia School of Computing, DCU, Ireland UC Data Archive and School of Information
Management and Systems, UC Berkeley, USA University "Alexandru Ioan Cuza", IASI,
Romania U. Hospitals and U.of Geneva, Switzerland Vienna University of Technology, Austria
Institutions contributing to the organisation of the different tracks of CLEF 2007
LREC 2008
Evolution of CLEFCLEF 2000 Tracks
mono-, bi- & multilingual text doc retrieval (Ad Hoc) mono- and cross-language information on structured scientific data (Domain-Specific)
CLEF 2001 Added
interactive cross-language retrieval (iCLEF)
CLEF 2002 Added
cross-language spoken document retrieval (CL-SR)
CLEF 2003 Added
multiple language question answering (QA@CLEF) cross-language retrieval in image collections (ImageCLEF)
CLEF 2005 Added
multilingual retrieval of Web documents (WebCLEF) cross-language geographical retrieval (GeoCLEF)
CLEF 2008 Added
cross-language video retrieval (VideoCLEF) multingual information filtering (INFILE@CLEF)
LREC 2008
CLEF Test Collections
2000 News documents in 4 languages GIRT German Social science database
2007
CLEF multilingual comparable corpus of more than 3M news docs in 13 languages: CZ,DE,EN,ES,FI,FR,IT,NL,RU,SV,PT,BG and HU
GIRT-4 social science database in EN and DE, Russian ISISS collection; Cambridge Sociological Abstracts
Malach collection of conversational speech derived from the Shoah archives EN & CZ
EuroGOV, 3.5 M webpages crawled from European governmental sites IAPR TC-12 photo database; PASCAL VOC 2006 training data ImageCLEFmed radiological database consisting of 6 distinct datasets; IRMA collection in EN & DE for automatic medical image annotation
Each track creates topics/queries & relevance assessments in diverse languages
LREC 20082020
Promoting Research through Evaluation
Text Retrieval (from 2000) Mono-, bi- and multilingual system performance tested using
news documents (13 European languages) bilingual task testing on unusual language combinations multilingual system testing with many target languages advanced tasks to monitor improvement in system performance
over time and focused on problem of merging results from different collections/languages
“robust” task emphasized importance of stable performance over languages instead of high average performance
Since 2006, queries in non-European languages (Indian sub-task) 2008: New tasks on library archives; Tasks on non-European
target collections; robust task uses WSD data
LREC 2008
Results: Cross-Language Text Retrieval
Comparing bilingual results with monolingual baselines: TREC-6, 1997:
EN→FR: 49% of best monolingual French system EN→DE: 64% of best monolingual German system
CLEF 2002: EN→FR: 83,4% of best monolingual French system EN→DE: 85,6% of best monolingual German system
CLEF 2003 enforced the use of “unusual” language pairs: IT→ES: 83% of best monolingual Spanish IR system DE→IT: 87% of best monolingual Italian IR system FR→NL: 82% of best monolingual Dutch IR system
CLEF 2007 best bilingual system 88% of best monolingual system
LREC 2008
Other results:non-doc & non-text retrieval
Interactive CLEF• Cross-Lang. IR from a user-inclusive perspective
Multilingual Question Answering 10 different target collections, Real-time exercise,
answer validation, QA on speech transcripts Geographical CLIR
Cross-language image retrieval Tasks on photo and medical archives, tasks for
retrieval and calssification Cross-language spoken document & cross-language
speech retrieval
LREC 2008
CLEF Achievements
Stimulation of research activity in new, previously unexplored areas
Study and implementation of evaluation methodologies for diverse types of cross-language IR systems
Creation of a large set of empirical data about multilingual information access from the user perspective
Quantitative and qualitative evidence with respect to best practice in cross-language system development
Creation of reusable test collections for system benchmarking
Building of a strong, multidisciplinary research community
BUT
LREC 2008
BUT
Notable lack of takeup
by
Application Communities
LREC 2008
TrebleCLEFTrebleCLEF is a Coordination Action, funded under is a Coordination Action, funded under the 7FP from 2008 to 2009, which aims at:the 7FP from 2008 to 2009, which aims at:
continuing to promote the development of advanced continuing to promote the development of advanced multilingual multimedia information access systemsmultilingual multimedia information access systems
disseminating knowhow, tools, and resources to disseminating knowhow, tools, and resources to enable DL creators to make content and knowledge enable DL creators to make content and knowledge accessible, usable and exploitable over time, over accessible, usable and exploitable over time, over media and over language boundariesmedia and over language boundaries
TrebleCLEF
LREC 2008
Objectives ITrebleCLEF will promote R&D and industrial take-up of multilingual, multimodal information access functionality in the following ways: by continuing to support the annual CLEF system
evaluation campaigns, with particular focus on:user modeling, e.g. the requirements of different classes
of users when querying multilingual information sourceslanguage-specific experimentation, e.g. looking at
differences across languages in order to derive best practices for each language
results presentation, e.g. how can results be presented in the most useful and comprehensible way to the user.
LREC 2008
Objectives II
by constituting a scientific forum for the MLIA community of researchers enabling them to meet and discuss results, emerging trends, new directions
providing a scientific digital library to manage accessible the scientific data and experiments produced during the course of an evaluation campaign, providing tools to:analyze, compare, and cite the data and experimentscurate, preserve, annotate, enrich them (promoting their re-
use)
LREC 2008
Objectives III by acting as a virtual centre of competence providing a
central reference point for anyone interested in studying or implementing MLIA functionality:
making publicly available sets of guidelines on best practices in MLIA (e.g. what stemmer to use, what stop list, what translation resources, how best to evaluate, etc., depending on the application requirements);
making tools and resources used in the evaluation campaigns freely available to a wider public whenever possible; otherwise providing links to where they can be acquired;
organising workshops, and/or tutorials and training sessions.
LREC 2008
Approach EvaluationEvaluation
test collections and laboratory test collections and laboratory evaluationevaluation
user evaluationuser evaluation log analysislog analysis
Best Practices & GuidelinesBest Practices & Guidelines system-oriented aspects of system-oriented aspects of
MLIA applicationsMLIA applications collaborative user studiescollaborative user studies user-oriented aspects of MLIA user-oriented aspects of MLIA
interfacesinterfaces
Dissemination and TrainingDissemination and Training tutorialstutorials workshopsworkshops summer schoolsummer school
LREC 2008
Consortium
ISTI-CNR, Pisa, Italy University of Padua, Italy University of Sheffield, United Kingdom Universidad Nacional de Educación a Distancia,
Spain Zurich University of Applied Sciences, Switzerland Centre for the Evaluation of Language
Communication Technologies, Italy Evaluations & Language resources Distribution
Agency, France
LREC 2008
Contacts
For further information see:
http://www.trebleclef.eu/
or contact:
Carol Peters - ISTI-CNRE-mail:[email protected]