An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

An Evaluation Competition?Eight Reasons to be Cautious

Donia Scott

Open University

&

Johanna Moore

University of Edinburgh

1. All that Glitters is not Gold Evaluation requires a gold standard, i.e.,

clearly specified input/output pairs Does this make sense for NLG?

For most NLG tasks, there is no one right answer (Walker, LREC 2005)

Any output that allows the user to successfully perform the task is acceptable

“Using human outputs assumes they are properly geared to the larger purpose the outputs are meant for. ” (KSJ, p.c.)

2.What’s good for the goose… Most important criterion is “fitness for

purpose” Can’t compare output of systems designed

for different purposes NLG systems (unlike parsing and MT?)

serve a wide range of functions

3. Don’t count on metrics

Summarization and MT communities are questioning the usefulness of their shared metrics BLEU does not correlate with human judgements of

translation quality (Callison-Burch et al. EACL 2006) BLEU should only be used to compare versions of the

same system (Knight, EACL 2006 invited talk)

Will nuggets of pyramids topple over?

4. What’s the input? There is no agreed input for any stage of the

NLG pipeline Or even where the NLG problem starts, e.g.,

weather report generation Is input raw weather data or significant events

determined by weather analysis program? Weather forecasting not part of the NLG problem! But, quality of the text depends on quality of the

data analysis!

5. What to standardize/evaluate?

Realization (for example) Should input contain

rhetorical goals (a la Hovy) information structure

Should output contain prosodic markup

6. Plug and play delusion Requires agreeing on interfaces at each

stage of the pipeline Not, “it’s gonna be XML” Must define representations to be passed

and their semantics (a la RAGS)

7. Who will pay the piper? It’s pretty clear why DARPA pays for

ASR, MT, Summarization, TDT, TREC, etc.

What’s the killer-app for NLG?

The fact that NSF is holding this workshop and consulting the research community is a very good sign

8. Stifling Science To push this forward, we have to agree on the

input (and interfaces) Whatever we agree on will limit the

phenomena we study and the theories we can test E.g., SPUD

Hard to find a task the allows study of all phenomena community is interested in E.g., MapTask

What are we evaluating?

Is the text (speech) generated Grammatical? Natural? Easy to comprehend? Memorable? Suitable to enable user to achieve their

intended purpose?

Recommendations Must be clear about who is going to

learn what from the (very large) effort Task chosen must:

be realistic, i.e., reflect how effective text (or speech) generated is to enable user to achieve their purpose

inform NLG research, i.e., help us learn things that enable development of better systems

Thank You!

Documents

An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh