31
1 CLEF 2009, Corfu Question Answering Track Overview J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. D. Santos L.M. Cabral A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova

CLEF 2009, Corfu Question Answering Track Overview

  • Upload
    akamu

  • View
    33

  • Download
    1

Embed Size (px)

DESCRIPTION

CLEF 2009, Corfu Question Answering Track Overview. A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova. D. Santos L.M. Cabral. J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi. QA Tasks & Time. - PowerPoint PPT Presentation

Citation preview

Page 1: CLEF 2009, Corfu Question Answering Track Overview

1

CLEF 2009, Corfu

Question Answering TrackOverview

J. TurmoP.R. ComasS. RossetO. GalibertN. MoreauD. MostefaP. RossoD. Buscaldi

D. SantosL.M. Cabral

A. PeñasP. FornerR. SutcliffeÁ. RodrigoC. ForascuI. AlegriaD. GiampiccoloN. MoreauP. Osenova

Page 2: CLEF 2009, Corfu Question Answering Track Overview

2

QA Tasks & Time

2003 20042005

2006 20072008

2009

QA Tasks

Multiple Language QA Main TaskResPubliQ

A

Temporal restrictio

nsand lists

Answer Validation Exercise (AVE)

GikiCLEF

Real Time

QA over Speech Transcriptions (QAST)

WiQAWSD QA

Page 3: CLEF 2009, Corfu Question Answering Track Overview

3

2009 campaign

ResPubliQA: QA on European Legislation

GikiCLEF: QA requiring geographical reasoning on Wikipedia

QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

Page 4: CLEF 2009, Corfu Question Answering Track Overview

4

QA 2009 campaign

TaskRegistere

dgroups

Participant groups

Submitted Runs

Organizing people

ResPubliQA

20 1128 + 16

(baseline runs)9

Giki CLEF 27 8 17 runs 2

QAST 12 4 86 (5 subtasks) 8

Total59

showed interest

23 Groups

147 runs evaluated

19 + addition

al assessor

s

Page 5: CLEF 2009, Corfu Question Answering Track Overview

5

ResPubliQA 2009:QA on European Legislation

Organizers

Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova

Additional Assessors

Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru

Advisory Board

Donna HarmanMaarten de RijkeDominique Laurent

Page 6: CLEF 2009, Corfu Question Answering Track Overview

6

Evolution of the task

2003 2004 2005 20062007

2008

2009

Target language

s3 7 8 9 10 11 8

Collections

News 1994 + News 1995+ Wikipedia Nov. 2006

European Legislation

Number of

questions200 500

Type of questions

200 Factoid

+ Temporal

restrictions

+ Definitions

- Type of

question

+ Lists

+ Linked questions

+ Closed lists

- Linked+ Reason+ Purpose

+ Procedure

Supporting

information

Document Snippet Paragraph

Size of answer

Snnipet Exact Paragraph

Page 7: CLEF 2009, Corfu Question Answering Track Overview

7

Objectives

1. Move towards a domain of potential users

2. Compare systems working in different languages

3. Compare QA Tech. with pure IR4. Introduce more types of questions5. Introduce Answer Validation Tech.

Page 8: CLEF 2009, Corfu Question Answering Track Overview

8

Collection

Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements

and resolutions Economy, health, law, food, … Between 1950 and 2006 XML-TEI.2 encoding Unfortunately, non parallel at the

paragraph level -> extra work

Page 9: CLEF 2009, Corfu Question Answering Track Overview

9

500 questions

REASON Why did a commission expert conduct an

inspection visit to Uruguay?

PURPOSE/OBJECTIVE What is the overall objective of the eco-

label?

PROCEDURE How are stable conditions in the natural

rubber trade achieved?

In general, any question that can be answered in a paragraph

Page 10: CLEF 2009, Corfu Question Answering Track Overview

10

500 questions

Also FACTOID

• In how many languages is the Official Journal of the Community published?

DEFINITION• What is meant by “whole milk”?

No NIL questions

Page 11: CLEF 2009, Corfu Question Answering Track Overview

11

Page 12: CLEF 2009, Corfu Question Answering Track Overview

12

Translation of questions

Page 13: CLEF 2009, Corfu Question Answering Track Overview

13

Selection of the final pool of 500 questions out of the 600 produced

Page 14: CLEF 2009, Corfu Question Answering Track Overview

14

Page 15: CLEF 2009, Corfu Question Answering Track Overview

15

Systems response

No Answer ≠ Wrong Answer

1. Decide if the answer is given or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment

2. Provide the paragraph (ID+Text) that answers the question

AimTo leave a question unanswered has more value than to give a wrong answer

Page 16: CLEF 2009, Corfu Question Answering Track Overview

16

Assessments

R: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered

• NoA R: NoA, but the candidate answer was correct• NoA W: NoA, and the candidate answer was incorrect• Noa Empty: NoA and no candidate answer was given

Evaluation measure: c@1 Extension of the traditional accuracy

(as proportion of questions correctly answered) Considering unanswered questions

Page 17: CLEF 2009, Corfu Question Answering Track Overview

17

Evaluation measure

n: Number of questionsnR: Number of correctly answered

questionsnU: Number of unanswered questions

)(1

1@n

nnn

nc R

UR

Page 18: CLEF 2009, Corfu Question Answering Track Overview

18

Evaluation measure

If nU = 0 then c@1=nR/n Accuracy

If nR = 0 then c@1=0

If nU = n then c@1=0

Leave a question unanswered gives value only if this avoids to return a wrong answer

Accuracy

)(1

1@n

nnn

nc R

UR Accuracy

The added value is the performance shown with the answered questions: Accuracy

Page 19: CLEF 2009, Corfu Question Answering Track Overview

19

List of Participants

System Team

elix ELHUYAR-IXA, SPAIN

icia RACAI, ROMANIA

iiit Search & Info Extraction Lab, INDIA

iles LIMSI-CNRS-2, FRANCE

isik ISI-Kolkata, INDIA

loga U.Koblenz-Landau, GERMAN

mira MIRACLE, SPAIN

nlel U. politecnica Valencia, SPAIN

syna Synapse Developpment, FRANCE

uaic AI.I.Cuza U. of IASI, ROMANIA

uned UNED, SPAIN

Page 20: CLEF 2009, Corfu Question Answering Track Overview

20

Value of reducing wrong answers

System c@1 Accuracy

#R #W

#NoA

#NoA R

#NoA W

#NoA empty

combination 0.76 0.76 381 119

0 0 0 0

icia092roro 0.68 0.52 260 84 156 0 0 156

icia091roro 0.58 0.47 237 156

107 0 0107

UAIC092roro 0.47 0.47 236 264

0 0 00

UAIC091roro 0.45 0.45 227 273

0 0 00

base092roro 0.44 0.44 220 280

0 0 00

base091roro 0.37 0.37 185 315

0 0 00

Page 21: CLEF 2009, Corfu Question Answering Track Overview

21

Detecting wrong answers

System c@1

Accuracy

#R #W #NoA

#NoA R

#NoA W

#NoA empt

y

combination 0.56

0.56 278

222 0 0 0 0

loga091dede

0.44

0.4 186

221 93 16 689

loga092dede

0.44

0.4 187

230 83 12 629

base092dede

0.38

0.38 189

311 0 0 00

base091dede

0.35

0.35 174

326 0 0 00

Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions

Very good step towards improving the system

Page 22: CLEF 2009, Corfu Question Answering Track Overview

22

IR important, not enough

System c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty

combination 0.9 0.9 451 49 0 0 0 0

uned092enen 0.61 0.61 288 184 28 15 12 1

uned091enen 0.6 0.59 282 190 28 15 13 0

nlel091enen 0.58 0.57 287 211 2 0 0 2

uaic092enen 0.54 0.52 243 204 53 18 35 0

base092enen 0.53 0.53 263 236 1 1 0 0

base091enen 0.51 0.51 256 243 1 0 1 0

elix092enen 0.48 0.48 240 260 0 0 0 0

uaic091enen 0.44 0.42 200 253 47 11 36 0

elix091enen 0.42 0.42 211 289 0 0 0 0

syna091enen 0.28 0.28 141 359 0 0 0 0

isik091enen 0.25 0.25 126 374 0 0 0 0

iiit091enen 0.2 0.11 54 37 409 0 11 398

elix092euen 0.18 0.18 91 409 0 0 0 0

elix091euen 0.16 0.16 78 422 0 0 0 0

Feasible Task

Perfect combination is 50% better than best system

Many systems under the IR baselines

Page 23: CLEF 2009, Corfu Question Answering Track Overview

23

Comparison across languages Same questions Same documents Same baseline systems Strict comparison only affected by

the variable of language But it is feasible to detect the most

promising approaches across languages

Page 24: CLEF 2009, Corfu Question Answering Track Overview

24

Comparison across languages

System RO ES EN IT DE

icia092 0.68        

nlel092   0.47      

uned092   0.41 0.61    

uned091   0.41 0.6    

icia091 0.58        

nlel091   0.58 0.52  

uaic092 0.47   0.54    

uaic091  0.45

loga091         0.44

loga092         0.44

Baseline 0.44 0.4 0.53 0.42 0.38

Systems above the baselines

Icia, Boolean + intensive NLP +

ML-based validation & very good knowledge of the collection

(Eurovoc terms…)

Baseline, Okapi-BM25 tuned for

paragraph retrieval

Page 25: CLEF 2009, Corfu Question Answering Track Overview

25

Comparison across languages

System RO ES EN IT DE

icia092 0.68        

nlel092   0.47      

uned092   0.41 0.61    

uned091   0.41 0.6    

icia091 0.58        

nlel091   0.58 0.52  

uaic092 0.47   0.54    

uaic091  0.45

loga091         0.44

loga092         0.44

Baseline 0.44 0.4 0.53 0.42 0.38

Systems above the baselines

nlel092, ngram-based retrieval,

combining evidence from

several languages

Baseline, Okapi-BM25 tuned for

paragraph retrieval

Page 26: CLEF 2009, Corfu Question Answering Track Overview

26

Comparison across languages

System RO ES EN IT DE

icia092 0.68        

nlel092   0.47      

uned092   0.41 0.61    

uned091   0.41 0.6    

icia091 0.58        

nlel091   0.58 0.52  

uaic092 0.47   0.54    

uaic091  0.45

loga091         0.44

loga092         0.44

Baseline 0.44 0.4 0.53 0.42 0.38

Systems above the baselines

Uned, Okapi-BM25 + NER +

paragraph validation +

ngram based re-ranking

Baseline, Okapi-BM25 tuned for

paragraph retrieval

Page 27: CLEF 2009, Corfu Question Answering Track Overview

27

Comparison across languages

System RO ES EN IT DE

icia092 0.68        

nlel092   0.47      

uned092   0.41 0.61    

uned091   0.41 0.6    

icia091 0.58        

nlel091   0.35 0.58 0.52  

uaic092 0.47   0.54    

uaic091  0.45

loga091         0.44

loga092         0.44

Baseline 0.44 0.4 0.53 0.42 0.38

Systems above the baselines

nlel091, ngram-based paragraph

retrieval

Baseline, Okapi-BM25 tuned for

paragraph retrieval

Page 28: CLEF 2009, Corfu Question Answering Track Overview

28

Comparison across languages

System RO ES EN IT DE

icia092 0.68        

nlel092   0.47      

uned092   0.41 0.61    

uned091   0.41 0.6    

icia091 0.58        

nlel091   0.58 0.52  

uaic092 0.47   0.54    

uaic091  0.45

loga091         0.44

loga092         0.44

Baseline 0.44 0.4 0.53 0.42 0.38

Systems above the baselines

Baseline, Okapi-BM25 tuned for

paragraph retrieval

Loga, Lucene + deep NLP + Logic + ML-

based validation

Page 29: CLEF 2009, Corfu Question Answering Track Overview

29

Conclusion

Compare systems working in different languages

Compare QA Tech. with pure IR Pay more attention to paragraph retrieval Old issue, late 90’s state of the art (English) Pure IR performance: 0.38 - 0.58 Highest difference respect IR baselines: 0.44 – 0.68

• Intensive NLP• ML-based answer validation

Introduce more types of questions Some types difficult to distinguish Any question that can be answered in a paragraph Analysis of results by question types (in progress)

Page 30: CLEF 2009, Corfu Question Answering Track Overview

30

Conclusion

Introduce Answer Validation Tech. Evaluation measure: c@1 Value of reducing wrong answers Detecting wrong answers is feasible

Feasible task 90% of questions have been answered Room for improvement: Best systems around

60% Even with less participants we have

More comparison More analysis More learning

ResPubliQA proposal for 2010 SC and breakout session

Page 31: CLEF 2009, Corfu Question Answering Track Overview

31

Interest on ResPubliQA 2010

GROUP

1 Uni. "Al.I.Cuza" Iasi (Dan Cristea, Diana Trandabat)

2 Linguateca (Nuno Cardoso)

3 RACAI (Dan Tufis, Radu Ion)

4 Jesus Vilares

5 Univ. Koblenz-Landlau  (Bjorn Pelzer)

6 Thomson Reuters (Isabelle Moulinier)

7 Gracinda Carvalho

8 UNED (Alvaro Rodrigo)

9 Uni. Politecnica Valencia (Paolo Rosso & Davide Buscaldi)

10

Uni. Hagen (Ingo Glockner)

11

Linguit (Jochen L. Leidner)

12

Uni. Saarland  (Dietrich Klakow)

13

ELHUYAR-IXA (Arantxa Otegi)

14

MIRACLE TEAM (Paloma Martínez Fernández)

But we need more

You have already a Gold Standard of 500 questions & answers to play with…