54
An Analysis of the An Analysis of the AskMSR Question- AskMSR Question- Answering System Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

An Analysis of the AskMSR An Analysis of the AskMSR Question-Answering SystemQuestion-Answering System

Eric Brill, Susan Dumais, and Michelle Banko

Microsoft Research

Page 2: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

From Proceedings of the From Proceedings of the EMNLP Conference, 2002EMNLP Conference, 2002

Page 3: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

GoalsGoals

Evaluate contributions of components

Explore strategies for predicting when answers are incorrect

Page 4: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

AskMSR – What Sets It ApartAskMSR – What Sets It Apart

Dependency on data redundancy

No sophisticated linguistic analyses– Of questions– Of answers

Page 5: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

TREC Question Answering TREC Question Answering TrackTrack

Fact-based, short-answer questions– How many calories are there in a Big Mac?– Who killed Abraham Lincoln?– How tall is Mount Everest?

562 – In case you’re wonderingMotivation for much of recent work in QA

Page 6: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Other ApproachesOther Approaches

POS taggingParsingNamed Entity extractionSemantic relationsDictionariesWordNet

Page 7: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

AskMSR ApproachAskMSR Approach

Web – “gigantic data repository” Different from other systems using web

– Simplicity & efficiency No complex parsing No entity extraction For queries or best matching web pages No local caching

Claim: techniques used in approach to short-answer tasks are more broadly applicable

Page 8: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Some QA DifficultiesSome QA Difficulties

Single, small information source– Likely only 1 answer exists

Source with small # of answer formulations– Complex relations between Q & A

Lexical, syntactic, semantic relations Anaphora, synonymy, alternate syntactic formulations,

indirect answers make this difficult

Page 9: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Answer RedundancyAnswer Redundancy

Greater answer redundancy in source– More likely: simple relation between Q & A

exists– Less likely: need to deal with difficulties facing

NLP systems

Page 10: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

System ArchitectureSystem Architecture

Page 11: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Query ReformulationQuery Reformulation

Rewrite question– Substring of declarative answer– Weighted– “when was the paper clip invented?” “the

paper clip was invented”Produce less precise rewrites

– Greater chance of matching– Backoff to simple ANDing of non-stop words

Page 12: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Query Reformulation (cont.)Query Reformulation (cont.)

String based manipulationsNo parserNo POS taggingSmall lexicon for possible POS and

morphological variantsCreated rewrite rules by handChose associated weights by hand

Page 13: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram MiningN-gram Mining

Formulate rewrite for search engineCollect and analyze page summariesWhy use summaries?

– Efficiency– Contain search terms, plus some context

N-grams collected from summaries

Page 14: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram Mining (Cont.)N-gram Mining (Cont.)

Extract 1-, 2-, 3-grams from summary– Score by weight of rewrite that retrieved it

Sum scores across all summaries with n-gram

No frequency within summaryFinal score for n-gram

– Weights associated with rewrite rules– # of unique summaries it is in

Page 15: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram FilteringN-gram Filtering

Use handwritten filter rulesQuestion type assignment

– e.g. who, what, how

Choose set of filters based on q-typeRescore n-grams based on presence of

features relevant to filters

Page 16: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram Filtering (Cont.)N-gram Filtering (Cont.)

15 simple filters– Based on human knowledge

Question types Answer domain

– Surface string features Capitalization Digits Handcrafted regular expression patterns

Page 17: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram TilingN-gram TilingMerge similar answersCreate longer answers from overlapping

smaller answer fragments– “A B C”, “B C D” “A B C D”

Greedy algorithm– Start w/ top-scoring n-gram, check lower

scoring n-grams for tiling potential If can be tiled, replace higher-scoring n-gram with

tiled n-gram, remove lower-scoring n-gram

– Stop when can no longer tile

Page 18: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

ExperimentsExperimentsFirst 500 TREC-9 queriesUse scoring patterns provided by NIST

– Modified some patterns to accommodate web answers not in TREC

– More specific answers allowed Edward J. Smith vs. Edward Smith

– More general answers not allowed Smith vs. Edward Smith

– Simple substitutions allowed 9 months vs. nine months

Page 19: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Experiments (cont.)Experiments (cont.)

Time differences between Web & TREC– “Who is the president of Bolivia?”– Did NOT modify answer key– Would make comparison w/earlier TREC

results impossible (instead of difficult?)

Changes influence absolute scores, not relative performance

Page 20: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Experiments (cont.)Experiments (cont.)

Automatic runs– Start w/queries– Generate ranked list of 5 answers

Use Google as search engine– Query-relevant summaries for n-gram mining

efficiencyAnswers are max. of 50 bytes long

– Typically shorter

Page 21: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

““Basic” System PerformanceBasic” System Performance

Backwards notion of basic– Current system, all modules implemented– Default settings

Mean Reciprocal Rank (MRR) – 0.507 61% of questions answered correctly Average answer length – 12 bytes Impossible to compare precisely with TREC-9

groups, but still very good performance

Page 22: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component ContributionsComponent Contributions

Page 23: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Query Rewrite ContributionQuery Rewrite Contribution

More precise queries – higher weightsAll rewrites equal – MRR drops 3.6%Only backoff AND – MRR drops 11.2%Rewrites capitalize on web redundancyCould use more specific regular expression

matching

Page 24: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram Filtering ContributionN-gram Filtering Contribution

1-, 2-, 3-grams from 100 best-matching summaries

Filter by question type “How many dogs pull a sled in the Iditarod?” Question prefers a number Run, Alaskan, dog racing, many mush ranked lower

than pool of 16 dogs (correct answer)

No filtering – MRR drops 17.9%

Page 25: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

N-gram Tiling ContributionN-gram Tiling Contribution

Benefits of tiling– Substrings take up only 1 answer slot

e.g. San, Francisco, San Francisco

– Longer answers can never be found with only tri-grams

e.g. “light amplification by [stimulated] emission of radiation”

No tiling – MRR drops 14.2%

Page 26: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component CombinationsComponent Combinations

Only weighted sum of occurrences of1-, 2-, 3-grams – MRR drops 47.5%

Simple statistical system– No linguistic knowledge or processing– Only AND queries– Filtering – no, (statistical) tiling – yes– MRR drops 33% to 0.338

Page 27: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component CombinationsComponent Combinations

Statistical system –good performance?– Reasonable on absolute scale?– One TREC-9 50 byte run performed better

All components contribute to accuracy– Precise weights of rewrites unimportant– N-gram tiling – a “poor man’s named-entity

recognizer”– Biggest contribution from filters/selection

Page 28: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component CombinationsComponent Combinations

Claim: “Because of the effectiveness of our tiling algorithm…we do not need to use any named entity recognition components.”– By having filters with capitalization info

(section 2.3, 2nd paragraph), aren’t they doing some NE recognition?

Page 29: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component ProblemsComponent Problems

Page 30: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component Problems (cont.)Component Problems (cont.)

No correct answer in top 5 hypotheses23% of errors – not knowing units

– How fast can Bill’s Corvette go? mph or k/h

34% (Time, Correct) – time problems or answer not in TREC-9 answer key

16% from shortcomings in n-gram tilingNumber retrieval (5%) – query limitation

Page 31: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Component Problems (cont.)Component Problems (cont.)

12% - beyond current system paradigm– Can’t be fixed with minor enhancements– Is this really so? or have they been easy on

themselves in error attribution?

9% - no discussion

Page 32: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Knowing When…Knowing When…

Some cost for answering incorrectlySystem can choose to not answer instead of

giving incorrect answer– How likely hypothesis is correct?

TREC – no distinction between wrong answer and no answer

Deploy real system – trade-off between precision & recall

Page 33: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Knowing When…(cont.)Knowing When…(cont.)

Answer is ad-hoc combination of hand tuned weights

Is it possible to induce useful precision-recall (ROC) curve when answers don’t have meaningful probabilities?

What is an ROC (Receiver Operating Characteristic) curve?

Page 34: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

ROCROC From http://www-csli.stanford.edu/~schuetze/roc.html (Hinrich Schütze, co-author of

Foundations of Statistical Natural Language Processing)

Page 35: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

ROC (cont.)ROC (cont.)

Page 36: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Determining LikelihoodDetermining Likelihood

Ideal – determine likelihood of correct answer based only on question

If possible, can skip such questionsUse decision tree based on set of features

from question string 1-, 2-grams, type sentence length, longest word length # capitalized words, # stop words Ratio of stop words to non-stop words

Page 37: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree/Diagnostic ToolDecision Tree/Diagnostic Tool

Performs worst on how questionsPerforms best on short who questions

w/many stop wordsInduce ROC curve from decision tree

– Sort leaf nodes from highest probability of being correct to lowest

– Gain precision by not answering questions with highest probability of error

Page 38: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–QueryDecision Tree–Query

Page 39: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–Query ResultsDecision Tree–Query Results

Decision Tree trained on TREC-9Tested on TREC-10Overfits training data – insufficient

generalization

Page 40: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–Query TrainingDecision Tree–Query Training

Page 41: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–Query TestDecision Tree–Query Test

Page 42: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Answer Correctness/ScoreAnswer Correctness/Score

Ad-hoc score based on– # of retrieved passages n-gram occurs in– weight of rewrite used to retrieve passage– what filters apply– effects of n-gram tiling

Correlation between whether answer appears in top 5 output and…

Page 43: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Correct Answer In Top 5Correct Answer In Top 5

…and score of system’s first ranked answer– Correlation coefficient: 0.363– No time-sensitive q’s: 0.401

…and score of first ranked answer minus second– Correlation coefficient: 0.270

Page 44: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Answer #1 Score - TrainAnswer #1 Score - Train

Page 45: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Answer #1 Score – TestAnswer #1 Score – Test

Page 46: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Other Likelihood IndicatorsOther Likelihood Indicators

Snippets gathered for each question– AND queries– More refined exact string match rewrites

MRR and snippets– All snippets from AND: 0.238– 11 to 100 from non-AND: 0.612– 100 to 400 from non-AND: 0.628

But wasn’t MRR for “base” system 0.507?

Page 47: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Another Decision TreeAnother Decision Tree

Features of first DT, plus– Score of #1 answer– State of system in processing

Total # of matching passages # of non-AND matching passages Filters applied Weight of best rewrite rule yielding matching

passages Others

Page 48: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–All featuresDecision Tree–All features

Page 49: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–All TrainDecision Tree–All Train

Page 50: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–All TestDecision Tree–All Test

Page 51: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

Decision Tree–AllDecision Tree–All

Gives useful ROC curve on test dataOutperformed by Answer #1 ScoreThough outperformed by simpler ad-hoc

technique, still useful as diagnostic tool

Page 52: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

ConclusionsConclusions

Novel approach to QACareful analysis of contributions of major

system componentsAnalysis of factors behind errorsApproach for learning when system is

likely to answer incorrectly– Allowing system designers to decide when to

trade recall for precision

Page 53: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

My ConclusionsMy Conclusions

Claim: techniques used in approach to short-answer tasks are more broadly applicable

Reality: “We are currently exploring whether these techniques can be extended beyond short answer QA to more complex cases of information access.”

Page 54: An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

My Conclusions (cont.)My Conclusions (cont.)“…we do not need to use any named entity

recognition components.”– Filters w/capitalization info = NE recognition

12% of errors beyond system paradigm– Still wonder–is this really so?

9% of errors–no discussionAd hoc method outperforms Decision Tree

– Did they merely do a good job of designing system, of assigning weights, etc.?

– Did they get lucky?