Entity-Driven Desideratausers.umiacs.umd.edu/~jbg/teaching/CMSC_723/15c_qa.pdf · Entity-Driven Desiderata Computational Linguistics: Jordan Boyd-Graber University of Maryland COREFERENCE

Entity-Driven Desiderata

Computational Linguistics: Jordan Boyd-GraberUniversity of MarylandCOREFERENCE

Adapted from slides by Phil Blunsom

Computational Linguistics: Jordan Boyd-Graber | UMD Entity-Driven Desiderata | 1 / 8

Machine Reading Framework


CNN/Daily Mail Datasets





Good

we aimed to factor out worldknowledge through entityanonymisation so models could notrely on correlations rather thanunderstanding.

Bad

The generation process and entityanonymisation reduced the task tomultiple choice and introducedadditional noise.



Good

posing reading comprehension as alarge scale conditional modelling taskmade it accessible to machinelearning researchers, generating agreat deal of subsequent research.

Bad

while this approach is reasonable forbuilding applications, it is entirely thewrong way to develop and evaluatenatural language understanding.


Narrative QA


Narrative QA

Good

A challenging evaluation that tests arange of language understanding,particularly temporal aspects ofnarrative, and also scalability ascurrent models cannot represent andreason over full narratives

Bad

Performing well on this task is clearlywell beyond current models, bothrepresentationally andcomputationally. As such it will behard for researchers to hill climb onthis evaluation.


Narrative QA

Good

The relatively small number ofnarratives for training models forcesresearchers to approach this taskfrom a transfer learning perspective.

Bad

The relatively small number ofnarratives means that this dataset isnot of immediate use for thosewanting to build supervised modelsfor applications.


MS Marco

Questions are mined from a search engine and matched with candidateanswer passages using IR techniques.


MS Marco

Good

The reliance on real queries createsa much more useful resource forthose interested in applications.

Bad

People rarely ask interestingquestions of search engines, and theuse of IR techniques to collectcandidate passages limits theusefulness of this dataset forevaluating language understanding.


MS Marco

Good

Unrestricted answers allow a greaterrange of questions.

Bad

How to evaluate freeform answers isan unsolved problem. Bleu is not theanswer!


SQuAD


SQuAD


SQuAD

Good

Very scalable annotation process thatcan cheaply generate large numbersof questions per article.

Bad

Annotating questions directly fromthe context passages strongly skewsthe data distribution. The task thenbecomes reverse engineering theannotators, rather than languageunderstanding.


SQuAD

Good

The online leaderboard allows easybenchmarking of systems andmotivates competition.

Bad

Answers as spans reduces the taskto multiple choice, and doesnâAZtallow questions with answers latent inthe text.


SQuAD

Good

Human upperbound sets reasonablegoal.

Bad

Allows mischaracterization of what itmeans to “read”.


Quiz Bowl


Quiz Bowl

Good

Free data from experts

Bad

Sometimes can be trivially solvedwith pattern matching


Quiz Bowl

Good

Based on already known knowledge

Bad

Not tied to readable data


Quiz Bowl

Good

Human comparison makes sense

Bad

More cumbersome computerevaluation


Possible Projects

� Improve selection of answer spans

� Improve IR search for context

� Visualizing reader spans

� Domain adaptation (Wikipedia/Questions/Books)


Documents

Entity-Driven Desideratausers.umiacs.umd.edu/~jbg/teaching/CMSC_723/15c_qa.pdf · Entity-Driven Desiderata Computational Linguistics: Jordan Boyd-Graber University of Maryland COREFERENCE