12
A Proposal for Evaluating Answer Distillation from Web Data Bhaskar Mitra, Grady Simon, Jianfeng Gao, Nick Craswell & Li Deng

A Proposal for Evaluating Answer Distillation from Web Data

Embed Size (px)

Citation preview

Page 1: A Proposal for Evaluating Answer Distillation from Web Data

A Proposal for Evaluating

Answer Distillation

from Web DataBhaskar Mitra, Grady Simon,

Jianfeng Gao, Nick Craswell & Li Deng

Page 2: A Proposal for Evaluating Answer Distillation from Web Data

Answer distillation task Click icon to add picture

Given query and passage containing answer, summarize answer for presentation, on-screen or read-out.Answers are query-biased summaries:• Single entities or phrases (e.g.,

“Rome” for the query “Italy capital”)

• Multi-sentence (e.g., for the query “how to get a passport”)

• Need not be spans from passage• Might combine multiple passages

Page 3: A Proposal for Evaluating Answer Distillation from Web Data

commissioner of the nba

Source passages:

Query:

Answer?

Adam Silver

Page 4: A Proposal for Evaluating Answer Distillation from Web Data
Page 5: A Proposal for Evaluating Answer Distillation from Web Data

http://www.kdnuggets.com/2016/05/datasets-over-algorithms.html

Datasets let algorithms shine

Page 6: A Proposal for Evaluating Answer Distillation from Web Data

Questions are not search queriesQuestions are well-formed &

curatedSingle entity / phrase answers

Multiple-choice answers

DataE.g., TREC-QA, MCTest,CBT, WikiQA, SQuAD

Designed for matching short responses, or

Poorly correlate with human judgments, or

Human in the loop (non-repeatable)

MetricE.g., P/R, BLEU, METEOR

+

Existing QA Datasets

Page 7: A Proposal for Evaluating Answer Distillation from Web Data

Sample queries from Bing logsEditorially curated reference

answersMany reference answers per query

DataPhrasing Aware (pa-) metricsModified versions of BLEU /

METEOR

Metric+

Our proposal

Page 8: A Proposal for Evaluating Answer Distillation from Web Data

Towards variance reductionUse single reference passage set to reduce variance from conflicting information at source

Get many reference answers to model the natural variance in answer phrasing

Extend existing metrics to take better advantage of the large number of available reference answers

The law requires all children traveling in the front or rear seat of any car, van or goods vehicle must use the correct child car seat until they are either 135cm in height or 12 years old (which ever they reach first). After this they must use an adult seat belt. There are very few exceptions.

law for ages for children allowed to sit in front seatQuer

y

Passages

Children under the age of 12 and less than 135cm tall need a child car seat when traveling in the front or the rear seat of a car.

Distilled answers

Children of any age can travel in the front or the rear seat of a car. They need a child seat if under the age of 12.

Children under the age of 12 need a child seat, unless more than 135cm tall.…A child seat is necessary for children under 12. Otherwise an adult seat belt must be worn.

Page 9: A Proposal for Evaluating Answer Distillation from Web Data

Generating the dataset

Sample queries

• Randomly sample from Bing logs

• Remove PII• Remove

navigational, transactional queries

• Remove queries with no deterministic answers (E.g., “holiday recipes”)

Retrieve candidate passages

• Retrieve top-N candidate passages per query

• Typically retrieved from many different documents

Select minimal passage set

• Editors select the minimal but sufficient passage set

• If multiple passages are selected then information across passages should not conflict

Curate reference answers

• Editors curate minimal but complete answer for ach query

• Answers can be single entity or phrase, or multi-sentence passage

Page 10: A Proposal for Evaluating Answer Distillation from Web Data

Phrasing Aware MetricsScore candidate answer based on average similarity with all available reference answersEach reference answer is importance weighted based on agreement with other reference answersMetrics like BLEU (or METEROR) can be used as similarity metric

Page 11: A Proposal for Evaluating Answer Distillation from Web Data
Page 12: A Proposal for Evaluating Answer Distillation from Web Data

Request For Comments

We want to make the proposed Answer Distillation dataset and corresponding metrics publicly available for academic

research

We need YOUR feedback to build the right evaluation framework

https://gitter.im/ProjectDistillery/Distillery