12
Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

Embed Size (px)

Citation preview

Page 1: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

An Evaluation Competition?Eight Reasons to be Cautious

Donia Scott

Open University

&

Johanna Moore

University of Edinburgh

Page 2: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

1. All that Glitters is not Gold Evaluation requires a gold standard, i.e.,

clearly specified input/output pairs Does this make sense for NLG?

For most NLG tasks, there is no one right answer (Walker, LREC 2005)

Any output that allows the user to successfully perform the task is acceptable

“Using human outputs assumes they are properly geared to the larger purpose the outputs are meant for. ” (KSJ, p.c.)

Page 3: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

2.What’s good for the goose… Most important criterion is “fitness for

purpose” Can’t compare output of systems designed

for different purposes NLG systems (unlike parsing and MT?)

serve a wide range of functions

Page 4: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

3. Don’t count on metrics

Summarization and MT communities are questioning the usefulness of their shared metrics BLEU does not correlate with human judgements of

translation quality (Callison-Burch et al. EACL 2006) BLEU should only be used to compare versions of the

same system (Knight, EACL 2006 invited talk)

Will nuggets of pyramids topple over?

Page 5: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

4. What’s the input? There is no agreed input for any stage of the

NLG pipeline Or even where the NLG problem starts, e.g.,

weather report generation Is input raw weather data or significant events

determined by weather analysis program? Weather forecasting not part of the NLG problem! But, quality of the text depends on quality of the

data analysis!

Page 6: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

5. What to standardize/evaluate?

Realization (for example) Should input contain

rhetorical goals (a la Hovy) information structure

Should output contain prosodic markup

Page 7: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

6. Plug and play delusion Requires agreeing on interfaces at each

stage of the pipeline Not, “it’s gonna be XML” Must define representations to be passed

and their semantics (a la RAGS)

Page 8: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

7. Who will pay the piper? It’s pretty clear why DARPA pays for

ASR, MT, Summarization, TDT, TREC, etc.

What’s the killer-app for NLG?

The fact that NSF is holding this workshop and consulting the research community is a very good sign

Page 9: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

8. Stifling Science To push this forward, we have to agree on the

input (and interfaces) Whatever we agree on will limit the

phenomena we study and the theories we can test E.g., SPUD

Hard to find a task the allows study of all phenomena community is interested in E.g., MapTask

Page 10: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

What are we evaluating?

Is the text (speech) generated Grammatical? Natural? Easy to comprehend? Memorable? Suitable to enable user to achieve their

intended purpose?

Page 11: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

Recommendations Must be clear about who is going to

learn what from the (very large) effort Task chosen must:

be realistic, i.e., reflect how effective text (or speech) generated is to enable user to achieve their purpose

inform NLG research, i.e., help us learn things that enable development of better systems

Page 12: An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

Thank You!