22
Entity-Driven Desiderata Computational Linguistics: Jordan Boyd-Graber University of Maryland COREFERENCE Adapted from slides by Phil Blunsom Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 1/8

Entity-Driven Desideratausers.umiacs.umd.edu/~jbg/teaching/CMSC_723/15c_qa.pdf · Entity-Driven Desiderata Computational Linguistics: Jordan Boyd-Graber University of Maryland COREFERENCE

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Entity-Driven Desiderata

Computational Linguistics: Jordan Boyd-GraberUniversity of MarylandCOREFERENCE

Adapted from slides by Phil Blunsom

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 1 / 8

Machine Reading Framework

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 2 / 8

CNN/Daily Mail Datasets

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 3 / 8

CNN/Daily Mail Datasets

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 3 / 8

CNN/Daily Mail Datasets

Good

we aimed to factor out worldknowledge through entityanonymisation so models could notrely on correlations rather thanunderstanding.

Bad

The generation process and entityanonymisation reduced the task tomultiple choice and introducedadditional noise.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 3 / 8

CNN/Daily Mail Datasets

Good

posing reading comprehension as alarge scale conditional modelling taskmade it accessible to machinelearning researchers, generating agreat deal of subsequent research.

Bad

while this approach is reasonable forbuilding applications, it is entirely thewrong way to develop and evaluatenatural language understanding.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 3 / 8

Narrative QA

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 4 / 8

Narrative QA

Good

A challenging evaluation that tests arange of language understanding,particularly temporal aspects ofnarrative, and also scalability ascurrent models cannot represent andreason over full narratives

Bad

Performing well on this task is clearlywell beyond current models, bothrepresentationally andcomputationally. As such it will behard for researchers to hill climb onthis evaluation.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 4 / 8

Narrative QA

Good

The relatively small number ofnarratives for training models forcesresearchers to approach this taskfrom a transfer learning perspective.

Bad

The relatively small number ofnarratives means that this dataset isnot of immediate use for thosewanting to build supervised modelsfor applications.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 4 / 8

MS Marco

Questions are mined from a search engine and matched with candidateanswer passages using IR techniques.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 5 / 8

MS Marco

Good

The reliance on real queries createsa much more useful resource forthose interested in applications.

Bad

People rarely ask interestingquestions of search engines, and theuse of IR techniques to collectcandidate passages limits theusefulness of this dataset forevaluating language understanding.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 5 / 8

MS Marco

Good

Unrestricted answers allow a greaterrange of questions.

Bad

How to evaluate freeform answers isan unsolved problem. Bleu is not theanswer!

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 5 / 8

SQuAD

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 6 / 8

SQuAD

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 6 / 8

SQuAD

Good

Very scalable annotation process thatcan cheaply generate large numbersof questions per article.

Bad

Annotating questions directly fromthe context passages strongly skewsthe data distribution. The task thenbecomes reverse engineering theannotators, rather than languageunderstanding.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 6 / 8

SQuAD

Good

The online leaderboard allows easybenchmarking of systems andmotivates competition.

Bad

Answers as spans reduces the taskto multiple choice, and doesnâAZtallow questions with answers latent inthe text.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 6 / 8

SQuAD

Good

Human upperbound sets reasonablegoal.

Bad

Allows mischaracterization of what itmeans to “read”.

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 6 / 8

Quiz Bowl

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 7 / 8

Quiz Bowl

Good

Free data from experts

Bad

Sometimes can be trivially solvedwith pattern matching

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 7 / 8

Quiz Bowl

Good

Based on already known knowledge

Bad

Not tied to readable data

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 7 / 8

Quiz Bowl

Good

Human comparison makes sense

Bad

More cumbersome computerevaluation

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 7 / 8

Possible Projects

� Improve selection of answer spans

� Improve IR search for context

� Visualizing reader spans

� Domain adaptation (Wikipedia/Questions/Books)

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 8 / 8