Upload
mariah-cain
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
1
CLEF 2011, AmsterdamQA4MRE, Question Answering for Machine Reading Evaluation
Question Answering Track Overview
Main TaskAnselmo PeñasEduard Hovy
Pamela FornerÁlvaro Rodrigo
Richard SutcliffeCorina Forascu
Caroline Sporleder
Modality and Negation
Roser MoranteWalter Daelemans
2
QA Tasks & Time at CLEF
2003 2004
2005
2006
2007
2008 2009
2010 2011
QA Task
s
Multiple Language QA Main Task ResPubliQA QA4MRE
Temporal restrictio
nsand lists
Answer Validation Exercise (AVE)
GikiCLEF
Negation and Modality
Real Time
QA over Speech Transcriptions (QAST)
WiQA
WSD
QA
3
New setting
QA over a single documentMultiple Choice Reading Comprehension
Tests• Forget about the IR step (for a while)• Focus on answering questions about a
single text• Chose the correct answer
Why this new setting?
Systems performance
Upper bound of 60% accuracy
OverallBest result
<60%
Definitions
Best result>80% NOT
IR approach
Pipeline Upper Bound
SOMETHING to break the pipeline: answer validation instead of re-ranking
Question
Answer
Questionanalysis
PassageRetrieval
AnswerExtraction
AnswerRanking
1.00.8 0.8 0.64x x =
Not enough evidence
Multi-stream upper bound
Perfect combination
81%
Best system 52,5%
Best with ORGANIZATION
Best with PERSON
Best with TIME
Multi-stream architectures
Different systems response better different types of questions
• Specialization• Collaboration
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
SOMETHING for
combining / selecting
Answer
AVE 2006-2008
Answer Validation: decide whether to return the candidate answer or not
Answer Validation should help to improve QAIntroduce more content analysisUse Machine Learning techniquesAble to break pipelines and combine
streams
9
Hypothesis generation + validation
Question
Searching space of
candidate answers
Hypothesis generation
functions+
Answer validation functions
Answer
ResPubliQA 2009 - 2010
Transfer AVE results to QA main task 2009 and 2010Promote QA systems with better answer
validation
QA evaluation setting assuming thatTo leave a question unanswered has
more value than to give a wrong answer
Evaluation measure
n: Number of questionsnR: Number of correctly answered
questionsnU: Number of unanswered questions
)(1
1@n
nnn
nc R
UR
Reward systems that maintain accuracy but reduce the number of incorrect answers by leaving some questions unanswered
12
Conclusions of ResPubliQA 2009 – 2010
This was not enough We expected a bigger change in
systems architecture Validation is still in the pipeline
Bad IR -> Bad QA No qualitative improvement in
performance Need of space to develop the
technology
13
2011 campaign
Promote a bigger change in QA systems architecture
QA4MRE: Question Answering for Machine Reading Evaluation
Measure progress in two reading abilitiesAnswer questions about a single textCapture knowledge from text
collections
Reading test
Text
Coal seam gas drilling in Australia's Surat Basin has been halted by flooding.
Australia's Easternwell, being acquired by Transfield Services, has ceased drilling because of the flooding.
The company is drilling coal seam gas wells for Australia's Santos Ltd.
Santos said the impact was minimal.
Multiple choice testAccording to the text…
What company owns wells in Surat Basin?a) Australiab) Coal seam gas wellsc) Easternwelld) Transfield Servicese) Santos Ltd.f) Ausam Energy Corporation g) Queenslandh) Chinchilla
Knowledge gaps
Acquire this knowledge from the reference collection
drill
Company BWell C
for
own | P=0.8
Queensland
Australia
Surat Basin
is part of
is part ofCompany
A
I II
Knowledge-Understanding dependence
We “understand” because we “know”We need a little more of both to answer
questions
Capture ‘knowledge’ expressed in texts
‘Understand’ language
Reading cycle
Control the variable of knowledge
The ability of making inferences about texts is correlated to the amount of knowledge considered This variable has to be taken into account
during evaluation Otherwise is very difficult to compare
methods
How to control the variable of knowledge in a reading task?
Text as sources of knowledge Text Collection
Big and diverse enough to acquire knowledge
• Impossible for all possible topics
Define a scalable strategy: topic by topic
Reference collection per topic (20,000-100,000 docs.)
Several topics Narrow enough to limit knowledge
needed • AIDS• CLIMATE CHANGE• MUSIC & SOCIETY
19
Evaluation tests
12 reading tests (4 docs per topic)120 questions (10 questions per test)600 choices (5 options per question)
Translated into 5 languages: English, German, Spanish, Italian, Romanian
20
Evaluation tests
44 questions required background knowledge from the reference collection
38 required combine info from different paragraphs
Textual inferencesLexical: acronyms, synonyms,
hypernyms…Syntactic: nominalizations,
paraphrasing…Discourse: correference, ellipsis…
21
Evaluation
QA perspective evaluationc@1 over all 120 questions
Reading perspective evaluationAggregating results by test
TaskRegistere
dgroups
Participant groups
Submitted Runs
QA4MRE 25 12 62 runs
22
Workshop QA4MRE
Tuesday 10:30 – 12:30
Keynote: Text Mining in Biograph (Walter Daelemans)
QA4MRE methodology and results (Álvaro Rodrigo)
Report on Modality and Negation pilot (Roser Morante)
14:00 – 16:00Reports from participants
Wednesday 10:30 – 12:30
Breakout session
23
CLEF 2011, AmsterdamQA4MRE, Question Answering for Machine Reading Evaluation
Question Answering Track Breakout session
Main TaskAnselmo PeñasEduard Hovy
Pamela FornerÁlvaro Rodrigo
Richard SutcliffeCorina Forascu
Caroline Sporleder
Modality and Negation
Roser MoranteWalter Daelemans
24
QA4MRE breakout session Task
Questions are more difficult and realistic
100% reusable test sets Languages and participants
No participants for some languages But valuable resource for evaluation Good balance for developing tests
in other languages (even without participants)
• Problem is to find parallel translations for tests
25
QA4MRE breakout session Background collections
Good balance of quality and noise
Methodology to build them is ok Test documents (TED)
Not ideal but parallel Open audience and no copyright
issues Consider other possibilities
• CafeBabel• BBC news
26
QA4MRE breakout session Evaluation
Encourage participants to test previous systems on new campaigns
Ablation tests, what happens if you remove a component?
Runs with and without background knowledge, with and without external resources
Processing time measurements
27
QA4MRE 2012
Topics Previous
1. AIDS2. Music and Society3. Climate Change
Add1. Alzheimer (divulgative sources:
blogs, web, news, …)
28
QA4MRE 2012 Pilots
Modality and Negation Move to a three value setting:
Given an event in the text decide whether it is1. Asserted (no negation and no speculation)2. Negated (negation and no speculation),3. Speculated
Roadmap1. 2012 as a separated pilot2. 2013 integrate modality and negation in the
main task tests
29
QA4MRE 2012 Pilots
Biomedical domain Focus in one disease: Alzheimer (59,000
Medline abstracts) Scientific language Give participants the background collection
already processed: Tok, Lem, POS, NER, Dependency parsing
Development set
30
QA4MRE 2012 in summary Main task
Multiple Choice Reading Comprehension tests
Same format Additional topic: Alzheimer English, German, (maybe Spanish,
Italian, Romanian, others) Two pilots
Modality and negation• Asserted, negated, speculated
Biomedical domain focus on Alzheimer disease• Same format as the main task
31
Thanks!