Evaluating Question Answering Validation

EvaluatingQuestion Answering Validation

Anselmo Peñas (and Alvaro Rodrigo)

NLP & IR groupUNED

nlp.uned.es

Information Science InstituteMarina del Rey, December 11, 2009

UNED

nlp.uned.es

Old friendsQuestion Answering

Nothing else than answering a question

Natural Language

Understanding Something there, if

you are able to answer a question

QA: extrinsic evaluation for NLU

Suddenly… (See the track?)…The QA Track at TREC

UNED

nlp.uned.es

Question Answering at TREC

Object of evaluation itself Redefined as a (roughly

speaking): Highly-precision-oriented IR task Where NLP was necessary

• Specially for Answer Extraction

AfterBig document collections (News, Blogs)Unrestricted domainRanking of answers (linked to documents)More Retrieval

BeforeKnowledge Base (e.g. Semantic

networks)Specific domain

Single accurate answer (with explanation)

More Reasoning

UNED

nlp.uned.es

What’s this story about?

2003 2004 200

5 2006 2007

2008 2009 2010

QA Tasks

at CLEF

Multiple Language QA Main Task ResPubliQA

Temporal restrictions and lists

Answer Validation Exercise (AVE) GikiCLEF

Real Time

QA over Speech Transcriptions (QAST)

WiQA WSD QA

UNED

nlp.uned.es

Outline

1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009

UNED

nlp.uned.es

Short cycleLong cycle

Out-line

1. Analysis of current systems

performance

2. Mid term goals and strategy

3. Evaluation Task definition

4. Analysis of the evaluation

cycleResult

analysisMethodology

analysis

Generation of methodology and

evaluation resources

Task activation and

development

UNED

nlp.uned.es

Systems performance

2003 - 2006 (Spanish)

OverallBest result

<60%

Definitions

Best result>80% NOT

IR approach

UNED

nlp.uned.es

Pipeline Upper Bounds

SOMETHING to break the pipeline

Question

Answer

Questionanalysis

PassageRetrieval

AnswerExtraction

AnswerRanking

1.00.8 0.8 0.64x x =Not enough evidence

UNED

nlp.uned.es

Results in CLEF-QA 2006 (Spanish)

Perfect combination

81%

Best system 52,5%

Best with ORGANIZATION

Best with PERSON

Best with TIME

UNED

nlp.uned.es

Collaborative architecturesDifferent systems response better

different types of questions• Specialization• Collaboration

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

SOMETHING for

combining / selecting

Answer

UNED

nlp.uned.es

Collaborative architecturesHow to select the good answer?

• Redundancy• Voting• Confidence score• Performance history

Why not deeper content analysis?

UNED

nlp.uned.es

Mid Term Goal

GoalImprove QA systems performance

New mid term goalImprove the devices for:

Rejecting / Accepting / Selecting Answers

The new task (2006)Validate the correctness of the answersGiven by real QA systems...

...the participants at CLEF QA

UNED

nlp.uned.es

Outline

1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009

UNED

nlp.uned.es

Define Answer Validation Decide whether an answer is correct

or not• More precisely:

The Task: Given

• Question• Answer• Supporting Text

Decide if the answer is correct according to the supporting text

Let’s call it Answer Validation Exercise (AVE)

UNED

nlp.uned.es

Whish list Test collection

• Questions• Answers• Supporting Texts• Human assessments

Evaluation measures Participants

UNED

nlp.uned.es

Evaluation linked to main QA task

QuestionAnswering

Track

Systems’ answers

Systems’ Supporting Texts

AnswerValidationExercise

Questions

(ACCEPT / REJECT)

Human Judgements (R,W,X,U)

QA Track results

Mapping(ACCEPT / REJECT) Evaluation

AVE Track results

Reuse human assessments

UNED

nlp.uned.es

Candidate answer

Supporting Text

Answer is not correct or not enough evidence

Question

Answer is correct

Answer Validation

Answer Validation Exercise (AVE)

AVE 2007 - 2008

Textual Entailment

HypothesisAutomatic HypothesisGeneration

AVE 2006

UNED

nlp.uned.es

Outline Motivation and goals Definition and general framework AVE 2006

• Underlying architecture: pipeline• Evaluating the validation• As RTE exercise: pairs text-hypothesis

AVE 2007 & 2008 QA 2009

UNED

nlp.uned.es

AVE 2006: A RTE exercise

If the text semantically entails the hypothesis, then the answer is expected to be correct.

Question

Supporting snippet

Exact AnswerQA system

HypothesisText Entailment?

Is this true? Yes 95% with current QA systems (J LOG COMP 2009)

UNED

nlp.uned.es

Collections AVE 2006

Available at: nlp.uned.es/clef-qa/ave/

Testing (pairs entail.) TrainingEnglish 2088 (10% YES) 2870 (15% YES)Spanish 2369 (28% YES) 2905 (22% YES)German 1443 (25% YES)French 3266 (22% YES)Italian 1140 (16% YES)Dutch 807 (10% YES)Portuguese 1324 (14% YES)

UNED

nlp.uned.es

Evaluating the ValidationValidation

Decide if each candidate answer is correct or not• YES | NO

Not balanced collections

Approach: Detect if there is enough evidence to accept an answer

Measures: Precision, recall and F over correct answers

Baseline system: Accept all answers

UNED

nlp.uned.es

Evaluating the Validation

CRCA

CA

nnn

recall

Correct Answer

Incorrect

Answer AnswerAccepte

dnCA nWA

AnswerRejected nCR nWR

WACA

CA

nnn

precision

precisionrecallprecisionrecallF

2

UNED

nlp.uned.es

Results AVE 2006Language Baseline

(F)Best system (F)

Reported Techiques

English .27 .44 LogicSpanish .45 .61 LogicGerman .39 .54 Lexical, Syntax,

Semantics, Logic, CorpusFrench .37 .47 Overlapping, LearningDutch .19 .39 Syntax, LearningPortuguese .38 .35 OverlappingItalian .29 .41 Overlapping, Learning

UNED

nlp.uned.es

Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008

• Underlying architecture: multi-stream• Quantify the potential benefit of AV in QA• Evaluating the correct selection of one answer• Evaluating the correct rejection of all answers

QA 2009

UNED

nlp.uned.es

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

+ Supporting Texts

Answer Validation &

SelectionAnswer

Participant systems in aCLEF – QA

Evaluation of AnswerValidation & Selection

AVE 2007 & 2008

UNED

nlp.uned.es

Collections<q id="116" lang="EN">

<q_str> What is Zanussi? </q_str><a id="116_1" value="">

<a_str> was an Italian producer of home appliances </a_str><t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str>

</a><a id="116_2" value="">

<a_str> who had also been in Cassibile since August 31 </a_str><t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str>

</a><a id="116_4" value="">

<a_str> 3 </a_str><t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>

</a></q>

UNED

nlp.uned.es

Evaluating the SelectionGoals Quantify the potential gain of Answer Validation in

Question Answering Compare AV systems with QA systems Develop measures more comparable to QA

accuracy

questions

correctlyansweredquestions

nn

accuracyqa ___

UNED

nlp.uned.es

Evaluating the selectionGiven a question with several candidate

answersTwo options:

Selection Select an answer ≡ try to answer the question

• Correct selection: answer was correct• Incorrect selection: answer was incorrect

Rejection Reject all candidate answers ≡ leave question

unanswered• Correct rejection: All candidate answers were incorrect• Incorrect rejection: Not all candidate answers were

incorrect

UNED

nlp.uned.es

Evaluating the Selectionn questions

n= nCA + nWA + nWS + nWR + nCR

Question with Correct Answer

Question without

Correct AnswerQuestion Answered Correctly

(One Answer Selected) nCA -

Question Answered Incorrectly nWA nWS

Question Unanswered(All Answers Rejected) nWR nCR

nnaccuracyqa CA_

nn

accuracyrej CR_

UNED

nlp.uned.es

Evaluating the Selection

nnaccuracyqa CA_

nn

accuracyrej CR_

nn

nnaccuracy CRCA

Rewards rejection(not balanced cols)

Interpretation for QA: all questions correctly rejected by AV will be answered correctly

UNED

nlp.uned.es

Evaluating the Selection

)(1nnnn

nnn

nn

nnestimated CA

CRCACACRCA

nnaccuracyqa CA_

nn

accuracyrej CR_

Interpretation for QA:questions correctly rejected has

valueas if they were answered

correctlyin qa_accuracy proportion

UNED

nlp.uned.es

Analysis and discussion (AVE 2007 Spanish)

Validation

Selection

Comparing AV & QA

UNED

nlp.uned.es

Techniques in AVE 2007

Generates hypotheses 6Wordnet 3Chunking 3

n-grams, longest common

Subsequences

5

Phrase transformations 2NER 5

Num. Expressions 6Temp. expressions 4

Coreference resolution 2Dependency analysis 3

Syntactic similarity 4Functions (sub, obj, etc) 3

Syntactic transformations 1Word-sense disambiguation 2

Semantic parsing 4Semantic role labeling 2

First order logic representation

3

Theorem prover 3Semantic similarity 2

UNED

nlp.uned.es

Conclusion of AVE

Answer Validationbefore

• It was assumed as a QA module• But no space for its own development

The new devices should help to improve QAthey

• Introduce more content analysis• Use Machine Learning techniques• Are able to break pipelines or combine streams

Let’s transfer them to QA main task

UNED

nlp.uned.es

Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009

UNED

nlp.uned.es

CLEF QA 2009 campaign

ResPubliQA: QA on European Legislation

GikiCLEF: QA requiring geographical reasoning on Wikipedia

QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

UNED

nlp.uned.es

CLEF QA 2009 campaign

TaskRegistere

dgroups

Participant groups

Submitted Runs

Organizing people

ResPubliQA 20 11 28 + 16

(baseline runs) 9Giki CLEF 27 8 17 runs 2

QAST 12 4 86 (5 subtasks) 8

Total59

showed interest

23 Groups

147 runs evaluated

19 + addition

al assessor

s

ResPubliQA 2009:QA on European Legislation

Organizers

Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova

Additional Assessors

Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru

Advisory Board

Donna HarmanMaarten de RijkeDominique Laurent

UNED

nlp.uned.es

Evolution of the task2003 2004 2005 2006 200

72008 2009

Target language

s3 7 8 9 10 11 8

Collections News 1994 + News 1995 + Wikipedia

Nov. 2006European

LegislationNumber

of questions

200 500

Type of questions 200 Factoid

+ Temporal restrictions

+ Definitions

- Type of

question

+ Lists

+ Linked questions

+ Closed lists

- Linked+ Reason+ Purpose

+ Procedure

Supporting

information

Document Snippet Paragraph

Size of answer Snnipet Exact Paragraph

UNED

nlp.uned.es

Collection

Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements

and resolutions Economy, health, law, food, … Between 1950 and 2006

UNED

nlp.uned.es

500 questions REASON

Why did a commission expert conduct an inspection visit to Uruguay?

PURPOSE/OBJECTIVE What is the overall objective of the eco-

label?

PROCEDURE How are stable conditions in the natural

rubber trade achieved?

In general, any question that can be answered in a paragraph

UNED

nlp.uned.es

500 questions Also

FACTOID• In how many languages is the Official Journal of

the Community published?

DEFINITION• What is meant by “whole milk”?

No NIL questions

UNED

nlp.uned.es

Systems responseNo Answer ≠ Wrong Answer

1. Decide if they answer or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment

2. Provide the paragraph (ID+Text) that answers the question

AimTo leave a question unanswered has more value than to give a wrong answer

UNED

nlp.uned.es

AssessmentsR: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered

• NoA R: NoA, but the candidate answer was correct

• NoA W: NoA, and the candidate answer was incorrect

• Noa Empty: NoA and no candidate answer was given

Evaluation measure: c@1 Extension of the traditional accuracy

(as proportion of questions correctly answered)

Considering unanswered questions

UNED

nlp.uned.es

Evaluation measure

n: Number of questionsnR: Number of correctly answered

questionsnU: Number of unanswered questions

)(11@nnnn

nc R

UR

UNED

nlp.uned.es

Evaluation measure

If nU = 0 then c@1=nR/n AccuracyIf nR = 0 then c@1=0If nU = n then c@1=0

Leave a question unanswered gives value only if this avoids to return a wrong answer

Accuracy

)(11@nnnn

nc R

UR Accuracy

The added value is the performance shown with the answered questions: Accuracy

UNED

nlp.uned.es

List of Participants

System Team

elix ELHUYAR-IXA, SPAINicia RACAI, ROMANIAiiit Search & Info Extraction Lab, INDIAiles LIMSI-CNRS-2, FRANCEisik ISI-Kolkata, INDIAloga U.Koblenz-Landau, GERMANmira MIRACLE, SPAINnlel U. politecnica Valencia, SPAINsyna Synapse Developpment, FRANCEuaic AI.I.Cuza U. of IASI, ROMANIAuned UNED, SPAIN

UNED

nlp.uned.es

Value of reducing wrong answers

System c@1 Accuracy

#R #W

#NoA

#NoA R

#NoA W

#NoA empty

combination 0.76 0.76 381 119

0 0 0 0

icia092roro 0.68 0.52 260 84 156 0 0 156icia091roro 0.58 0.47 237 15

6107 0 0

107UAIC092roro 0.47 0.47 236 26

40 0 0

0UAIC091roro 0.45 0.45 227 27

30 0 0

0base092roro 0.44 0.44 220 28

00 0 0

0base091roro 0.37 0.37 185 31

50 0 0

0

UNED

nlp.uned.es

Detecting wrong answers

System c@1

Accuracy

#R #W #NoA

#NoA R

#NoA W

#NoA empt

ycombination 0.5

60.56 27

8222 0 0 0 0

loga091dede

0.44

0.4 186

221 93 16 689

loga092dede

0.44

0.4 187

230 83 12 629

base092dede

0.38

0.38 189

311 0 0 00

base091dede

0.35

0.35 174

326 0 0 00

Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions

Very good step towards improving the system

UNED

nlp.uned.es

IR important, not enoughSystem c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty

combination 0.9 0.9 451 49 0 0 0 0uned092enen 0.61 0.61 288 184 28 15 12 1uned091enen 0.6 0.59 282 190 28 15 13 0nlel091enen 0.58 0.57 287 211 2 0 0 2uaic092enen 0.54 0.52 243 204 53 18 35 0base092enen 0.53 0.53 263 236 1 1 0 0base091enen 0.51 0.51 256 243 1 0 1 0elix092enen 0.48 0.48 240 260 0 0 0 0uaic091enen 0.44 0.42 200 253 47 11 36 0elix091enen 0.42 0.42 211 289 0 0 0 0syna091enen 0.28 0.28 141 359 0 0 0 0isik091enen 0.25 0.25 126 374 0 0 0 0iiit091enen 0.2 0.11 54 37 409 0 11 398elix092euen 0.18 0.18 91 409 0 0 0 0elix091euen 0.16 0.16 78 422 0 0 0 0

Achievable Task

Perfect combination is 50% better than best system

Many systems under the IR baselines

UNED

nlp.uned.es

Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009

Conclusion

UNED

nlp.uned.es

Conclusion New QA evaluation setting Assuming that

To leave a question unanswered has more value than to give a wrong answer

This assumption give space to further development QA systems

And hopefully improve their performance

Thanks!

http://nlp.uned.es/clef-qa/ave

http://www.clef-campaign.orgAcknowledgement: EU project T-CLEF (ICT-1-4-1 215231)

http://nlp.uned.es/clef-qa/ave

http://www.clef-campaign.org/

Documents

Evaluating Question Answering Validation