Upload
kioko
View
71
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information Science Institute Marina del Rey, December 11, 2009. Evaluating Question Answering Validation. Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es. Old friends. Question Answering Nothing else than answering a question Natural Language Understanding - PowerPoint PPT Presentation
Citation preview
EvaluatingQuestion Answering Validation
Anselmo Peñas (and Alvaro Rodrigo)
NLP & IR groupUNED
nlp.uned.es
Information Science InstituteMarina del Rey, December 11, 2009
UNED
nlp.uned.es
Old friendsQuestion Answering
Nothing else than answering a question
Natural Language
Understanding Something there, if
you are able to answer a question
QA: extrinsic evaluation for NLU
Suddenly… (See the track?)…The QA Track at TREC
UNED
nlp.uned.es
Question Answering at TREC
Object of evaluation itself Redefined as a (roughly
speaking): Highly-precision-oriented IR task Where NLP was necessary
• Specially for Answer Extraction
AfterBig document collections (News, Blogs)Unrestricted domainRanking of answers (linked to documents)More Retrieval
BeforeKnowledge Base (e.g. Semantic
networks)Specific domain
Single accurate answer (with explanation)
More Reasoning
UNED
nlp.uned.es
What’s this story about?
2003 2004 200
5 2006 2007
2008 2009 2010
QA Tasks
at CLEF
Multiple Language QA Main Task ResPubliQA
Temporal restrictions and lists
Answer Validation Exercise (AVE) GikiCLEF
Real Time
QA over Speech Transcriptions (QAST)
WiQA WSD QA
UNED
nlp.uned.es
Outline
1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009
UNED
nlp.uned.es
Short cycleLong cycle
Out-line
1. Analysis of current systems
performance
2. Mid term goals and strategy
3. Evaluation Task definition
4. Analysis of the evaluation
cycleResult
analysisMethodology
analysis
Generation of methodology and
evaluation resources
Task activation and
development
UNED
nlp.uned.es
Systems performance
2003 - 2006 (Spanish)
OverallBest result
<60%
Definitions
Best result>80% NOT
IR approach
UNED
nlp.uned.es
Pipeline Upper Bounds
SOMETHING to break the pipeline
Question
Answer
Questionanalysis
PassageRetrieval
AnswerExtraction
AnswerRanking
1.00.8 0.8 0.64x x =Not enough evidence
UNED
nlp.uned.es
Results in CLEF-QA 2006 (Spanish)
Perfect combination
81%
Best system 52,5%
Best with ORGANIZATION
Best with PERSON
Best with TIME
UNED
nlp.uned.es
Collaborative architecturesDifferent systems response better
different types of questions• Specialization• Collaboration
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
SOMETHING for
combining / selecting
Answer
UNED
nlp.uned.es
Collaborative architecturesHow to select the good answer?
• Redundancy• Voting• Confidence score• Performance history
Why not deeper content analysis?
UNED
nlp.uned.es
Mid Term Goal
GoalImprove QA systems performance
New mid term goalImprove the devices for:
Rejecting / Accepting / Selecting Answers
The new task (2006)Validate the correctness of the answersGiven by real QA systems...
...the participants at CLEF QA
UNED
nlp.uned.es
Outline
1. Motivation and goals2. Definition and general framework3. AVE 20064. AVE 2007 & 20085. QA 2009
UNED
nlp.uned.es
Define Answer Validation Decide whether an answer is correct
or not• More precisely:
The Task: Given
• Question• Answer• Supporting Text
Decide if the answer is correct according to the supporting text
Let’s call it Answer Validation Exercise (AVE)
UNED
nlp.uned.es
Whish list Test collection
• Questions• Answers• Supporting Texts• Human assessments
Evaluation measures Participants
UNED
nlp.uned.es
Evaluation linked to main QA task
QuestionAnswering
Track
Systems’ answers
Systems’ Supporting Texts
AnswerValidationExercise
Questions
(ACCEPT / REJECT)
Human Judgements (R,W,X,U)
QA Track results
Mapping(ACCEPT / REJECT) Evaluation
AVE Track results
Reuse human assessments
UNED
nlp.uned.es
Candidate answer
Supporting Text
Answer is not correct or not enough evidence
Question
Answer is correct
Answer Validation
Answer Validation Exercise (AVE)
AVE 2007 - 2008
Textual Entailment
HypothesisAutomatic HypothesisGeneration
AVE 2006
UNED
nlp.uned.es
Outline Motivation and goals Definition and general framework AVE 2006
• Underlying architecture: pipeline• Evaluating the validation• As RTE exercise: pairs text-hypothesis
AVE 2007 & 2008 QA 2009
UNED
nlp.uned.es
AVE 2006: A RTE exercise
If the text semantically entails the hypothesis, then the answer is expected to be correct.
Question
Supporting snippet
Exact AnswerQA system
HypothesisText Entailment?
Is this true? Yes 95% with current QA systems (J LOG COMP 2009)
UNED
nlp.uned.es
Collections AVE 2006
Available at: nlp.uned.es/clef-qa/ave/
Testing (pairs entail.) TrainingEnglish 2088 (10% YES) 2870 (15% YES)Spanish 2369 (28% YES) 2905 (22% YES)German 1443 (25% YES)French 3266 (22% YES)Italian 1140 (16% YES)Dutch 807 (10% YES)Portuguese 1324 (14% YES)
UNED
nlp.uned.es
Evaluating the ValidationValidation
Decide if each candidate answer is correct or not• YES | NO
Not balanced collections
Approach: Detect if there is enough evidence to accept an answer
Measures: Precision, recall and F over correct answers
Baseline system: Accept all answers
UNED
nlp.uned.es
Evaluating the Validation
CRCA
CA
nnn
recall
Correct Answer
Incorrect
Answer AnswerAccepte
dnCA nWA
AnswerRejected nCR nWR
WACA
CA
nnn
precision
precisionrecallprecisionrecallF
2
UNED
nlp.uned.es
Results AVE 2006Language Baseline
(F)Best system (F)
Reported Techiques
English .27 .44 LogicSpanish .45 .61 LogicGerman .39 .54 Lexical, Syntax,
Semantics, Logic, CorpusFrench .37 .47 Overlapping, LearningDutch .19 .39 Syntax, LearningPortuguese .38 .35 OverlappingItalian .29 .41 Overlapping, Learning
UNED
nlp.uned.es
Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008
• Underlying architecture: multi-stream• Quantify the potential benefit of AV in QA• Evaluating the correct selection of one answer• Evaluating the correct rejection of all answers
QA 2009
UNED
nlp.uned.es
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
+ Supporting Texts
Answer Validation &
SelectionAnswer
Participant systems in aCLEF – QA
Evaluation of AnswerValidation & Selection
AVE 2007 & 2008
UNED
nlp.uned.es
Collections<q id="116" lang="EN">
<q_str> What is Zanussi? </q_str><a id="116_1" value="">
<a_str> was an Italian producer of home appliances </a_str><t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str>
</a><a id="116_2" value="">
<a_str> who had also been in Cassibile since August 31 </a_str><t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str>
</a><a id="116_4" value="">
<a_str> 3 </a_str><t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>
</a></q>
UNED
nlp.uned.es
Evaluating the SelectionGoals Quantify the potential gain of Answer Validation in
Question Answering Compare AV systems with QA systems Develop measures more comparable to QA
accuracy
questions
correctlyansweredquestions
nn
accuracyqa ___
UNED
nlp.uned.es
Evaluating the selectionGiven a question with several candidate
answersTwo options:
Selection Select an answer ≡ try to answer the question
• Correct selection: answer was correct• Incorrect selection: answer was incorrect
Rejection Reject all candidate answers ≡ leave question
unanswered• Correct rejection: All candidate answers were incorrect• Incorrect rejection: Not all candidate answers were
incorrect
UNED
nlp.uned.es
Evaluating the Selectionn questions
n= nCA + nWA + nWS + nWR + nCR
Question with Correct Answer
Question without
Correct AnswerQuestion Answered Correctly
(One Answer Selected) nCA -
Question Answered Incorrectly nWA nWS
Question Unanswered(All Answers Rejected) nWR nCR
nnaccuracyqa CA_
nn
accuracyrej CR_
UNED
nlp.uned.es
Evaluating the Selection
nnaccuracyqa CA_
nn
accuracyrej CR_
nn
nnaccuracy CRCA
Rewards rejection(not balanced cols)
Interpretation for QA: all questions correctly rejected by AV will be answered correctly
UNED
nlp.uned.es
Evaluating the Selection
)(1nnnn
nnn
nn
nnestimated CA
CRCACACRCA
nnaccuracyqa CA_
nn
accuracyrej CR_
Interpretation for QA:questions correctly rejected has
valueas if they were answered
correctlyin qa_accuracy proportion
UNED
nlp.uned.es
Analysis and discussion (AVE 2007 Spanish)
Validation
Selection
Comparing AV & QA
UNED
nlp.uned.es
Techniques in AVE 2007
Generates hypotheses 6Wordnet 3Chunking 3
n-grams, longest common
Subsequences
5
Phrase transformations 2NER 5
Num. Expressions 6Temp. expressions 4
Coreference resolution 2Dependency analysis 3
Syntactic similarity 4Functions (sub, obj, etc) 3
Syntactic transformations 1Word-sense disambiguation 2
Semantic parsing 4Semantic role labeling 2
First order logic representation
3
Theorem prover 3Semantic similarity 2
UNED
nlp.uned.es
Conclusion of AVE
Answer Validationbefore
• It was assumed as a QA module• But no space for its own development
The new devices should help to improve QAthey
• Introduce more content analysis• Use Machine Learning techniques• Are able to break pipelines or combine streams
Let’s transfer them to QA main task
UNED
nlp.uned.es
Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009
UNED
nlp.uned.es
CLEF QA 2009 campaign
ResPubliQA: QA on European Legislation
GikiCLEF: QA requiring geographical reasoning on Wikipedia
QAST: QA on Speech Transcriptions of European Parliament Plenary sessions
UNED
nlp.uned.es
CLEF QA 2009 campaign
TaskRegistere
dgroups
Participant groups
Submitted Runs
Organizing people
ResPubliQA 20 11 28 + 16
(baseline runs) 9Giki CLEF 27 8 17 runs 2
QAST 12 4 86 (5 subtasks) 8
Total59
showed interest
23 Groups
147 runs evaluated
19 + addition
al assessor
s
ResPubliQA 2009:QA on European Legislation
Organizers
Anselmo PeñasPamela FornerRichard SutcliffeÁlvaro RodrigoCorina ForascuIñaki AlegriaDanilo GiampiccoloNicolas MoreauPetya Osenova
Additional Assessors
Fernando Luis CostaAnna KampchenJulia KrammeCosmina Croitoru
Advisory Board
Donna HarmanMaarten de RijkeDominique Laurent
UNED
nlp.uned.es
Evolution of the task2003 2004 2005 2006 200
72008 2009
Target language
s3 7 8 9 10 11 8
Collections News 1994 + News 1995 + Wikipedia
Nov. 2006European
LegislationNumber
of questions
200 500
Type of questions 200 Factoid
+ Temporal restrictions
+ Definitions
- Type of
question
+ Lists
+ Linked questions
+ Closed lists
- Linked+ Reason+ Purpose
+ Procedure
Supporting
information
Document Snippet Paragraph
Size of answer Snnipet Exact Paragraph
UNED
nlp.uned.es
Collection
Subset of JRC-Acquis (10,700 docs x lang) Parallel at document level EU treaties, EU legislation, agreements
and resolutions Economy, health, law, food, … Between 1950 and 2006
UNED
nlp.uned.es
500 questions REASON
Why did a commission expert conduct an inspection visit to Uruguay?
PURPOSE/OBJECTIVE What is the overall objective of the eco-
label?
PROCEDURE How are stable conditions in the natural
rubber trade achieved?
In general, any question that can be answered in a paragraph
UNED
nlp.uned.es
500 questions Also
FACTOID• In how many languages is the Official Journal of
the Community published?
DEFINITION• What is meant by “whole milk”?
No NIL questions
UNED
nlp.uned.es
Systems responseNo Answer ≠ Wrong Answer
1. Decide if they answer or not• [ YES | NO ]• Classification Problem• Machine Learning, Provers, etc.• Textual Entailment
2. Provide the paragraph (ID+Text) that answers the question
AimTo leave a question unanswered has more value than to give a wrong answer
UNED
nlp.uned.es
AssessmentsR: The question is answered correctlyW: The question is answered incorrectlyNoA: The question is not answered
• NoA R: NoA, but the candidate answer was correct
• NoA W: NoA, and the candidate answer was incorrect
• Noa Empty: NoA and no candidate answer was given
Evaluation measure: c@1 Extension of the traditional accuracy
(as proportion of questions correctly answered)
Considering unanswered questions
UNED
nlp.uned.es
Evaluation measure
n: Number of questionsnR: Number of correctly answered
questionsnU: Number of unanswered questions
)(11@nnnn
nc R
UR
UNED
nlp.uned.es
Evaluation measure
If nU = 0 then c@1=nR/n AccuracyIf nR = 0 then c@1=0If nU = n then c@1=0
Leave a question unanswered gives value only if this avoids to return a wrong answer
Accuracy
)(11@nnnn
nc R
UR Accuracy
The added value is the performance shown with the answered questions: Accuracy
UNED
nlp.uned.es
List of Participants
System Team
elix ELHUYAR-IXA, SPAINicia RACAI, ROMANIAiiit Search & Info Extraction Lab, INDIAiles LIMSI-CNRS-2, FRANCEisik ISI-Kolkata, INDIAloga U.Koblenz-Landau, GERMANmira MIRACLE, SPAINnlel U. politecnica Valencia, SPAINsyna Synapse Developpment, FRANCEuaic AI.I.Cuza U. of IASI, ROMANIAuned UNED, SPAIN
UNED
nlp.uned.es
Value of reducing wrong answers
System c@1 Accuracy
#R #W
#NoA
#NoA R
#NoA W
#NoA empty
combination 0.76 0.76 381 119
0 0 0 0
icia092roro 0.68 0.52 260 84 156 0 0 156icia091roro 0.58 0.47 237 15
6107 0 0
107UAIC092roro 0.47 0.47 236 26
40 0 0
0UAIC091roro 0.45 0.45 227 27
30 0 0
0base092roro 0.44 0.44 220 28
00 0 0
0base091roro 0.37 0.37 185 31
50 0 0
0
UNED
nlp.uned.es
Detecting wrong answers
System c@1
Accuracy
#R #W #NoA
#NoA R
#NoA W
#NoA empt
ycombination 0.5
60.56 27
8222 0 0 0 0
loga091dede
0.44
0.4 186
221 93 16 689
loga092dede
0.44
0.4 187
230 83 12 629
base092dede
0.38
0.38 189
311 0 0 00
base091dede
0.35
0.35 174
326 0 0 00
Maintaining the number of correct answers,the candidate answer was not correctfor 83% of unanswered questions
Very good step towards improving the system
UNED
nlp.uned.es
IR important, not enoughSystem c@1 Accuracy #R #W #NoA #NoA R #NoA W #NoA empty
combination 0.9 0.9 451 49 0 0 0 0uned092enen 0.61 0.61 288 184 28 15 12 1uned091enen 0.6 0.59 282 190 28 15 13 0nlel091enen 0.58 0.57 287 211 2 0 0 2uaic092enen 0.54 0.52 243 204 53 18 35 0base092enen 0.53 0.53 263 236 1 1 0 0base091enen 0.51 0.51 256 243 1 0 1 0elix092enen 0.48 0.48 240 260 0 0 0 0uaic091enen 0.44 0.42 200 253 47 11 36 0elix091enen 0.42 0.42 211 289 0 0 0 0syna091enen 0.28 0.28 141 359 0 0 0 0isik091enen 0.25 0.25 126 374 0 0 0 0iiit091enen 0.2 0.11 54 37 409 0 11 398elix092euen 0.18 0.18 91 409 0 0 0 0elix091euen 0.16 0.16 78 422 0 0 0 0
Achievable Task
Perfect combination is 50% better than best system
Many systems under the IR baselines
UNED
nlp.uned.es
Outline Motivation and goals Definition and general framework AVE 2006 AVE 2007 & 2008 QA 2009
Conclusion
UNED
nlp.uned.es
Conclusion New QA evaluation setting Assuming that
To leave a question unanswered has more value than to give a wrong answer
This assumption give space to further development QA systems
And hopefully improve their performance
Thanks!
http://nlp.uned.es/clef-qa/ave
http://www.clef-campaign.orgAcknowledgement: EU project T-CLEF (ICT-1-4-1 215231)