C é cile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia

Cécile Paris, Nathalie Colineau and Ross Wilkinson

CSIRO ICT Centre

Sydney, Australia

NLG Systems Evaluation: Establishing the Big Picture

www.csiro.au

2

What have we learnt from a shared task approach from our siblings (e.g., IR)

Advantages Some algorithm & system comparability tests (e.g., inverse frequency

works, length normalisation does not)

Some shared resources

(Recognised) Disadvantages It will only tell you some of the things one needs to know – important

elements will be missed

It does not allow the community to answer some important theoretical and practical questions

Too narrow

Note: No beliefs that there is a perfect system

There are some standards, but not a gold standard

www.csiro.au

3

Some beliefs?

1. Subtasks and input/output requirements need to be standardised to make core technologies truly comparable. What needs to be evaluated is an approach with its

characteristics and its application context.

2. To evaluate systems/approaches we need to compare them in a shared task. Comparison is not a requirement for evaluation nor shared

task for comparison.

3. Quality of systems equates quality of output (i.e., English text). A system cannot be reduced to its output; there are other

attributes.

4. There has to be a gold standard. One measure cannot account for everything (even if we were

to look at the quality of the output only).

5. One NLG technique works better than another. Vive la difference! There is no such thing as “one size fits all”;

typically there are pros and cons.

www.csiro.au

4

We can compare apples and pears

We do it in every day for many things (e.g., usefulness of comparisons as found in consumer reports)

Comparison: Does not require exact similarity But focuses on a set of characteristics/attributes.

Depending on situations and needs, different characteristics are required or favoured over others: fitness for purpose

We propose a framework in which to describe characteristics of NLG systems, modules or approaches

www.csiro.au

5

Example: Buying a car

General: ConvertibleSafety: Front side airbag Brake assist system Convertible rollover protection Rain sensor windshield wipers Dimensions: 277 mm lower 600 mm smaller curb to curb turning

circle Engine: Clutchless manual gearbox

General: 4 wheel drive Larger seating capacitySafety: Rear window wipers Dimensions: 766 mm longer & 160 mm wider Engine: 1.9 l larger engine 2.1 faster acceleration 0-100 km/h 28 l larger fuel tank

RRP $36,490 Manual, 4WD, 5 doors, 7 seats, 145kw, 3.50L,

Origin: Korea, 2007

RRP $31,990 Manual, Convertible, 2 doors, 4 seats, 82kw, 1.60L

Origin: Spain, 2004

Hard constraint: must be between $ 30,000 and $ 40,000

Set of attributes that characterise a car General Safety Dimensions Engine Etc…

www.csiro.au

6

How can we compare (and choose)?

Depends on the criteria of a person (or of a situation) Robert’s Priorities:

Sports car Size: wants a smaller car Safety important

Bill’s Priorities: Needs 4 wheel drive for camping trips Seating capacity: large

Does this mean that one car is better than another? No

Comparison and evaluation in the abstract is not necessarily meaningful

What is required is a way – a framework – to describe characteristics

www.csiro.au

7

Input: Type (e.g., numerical, semantic)

Output: Type (e.g., English, logical form) “Quality” Number of expressions generated

Fitness into other modules: Place into overall NLG architecture (e.g., requires a text planning or a

grammar component)

Configuration: Availability of parameters to fine-tune (e.g., user model, domain model)

General: Execution time

Can we apply these ideas to generation systems (or modules)?

Example: Generating Referring Expressions (GRE)Input

Output

Fitness into other modules

Configuration

General

www.csiro.au

8

An example: Comparison of GRE components

System X GRE moduleEnglish LanguageOrigin: Lab X

Hard constraint: need referring expressions in English

Input: Type: numerical

Output: Type: English Quality: has been shown to allow people to

select specific objects in a landscape

Fitness into other modules: Place into overall NLG architecture:

Requires a text planning component No additional lexico-grammatical

component needed

Configuration: Parameters to fine-tune:

Yes, user model Requirements: creation of user model

System Y GRE module English LanguageOrigin: University Y

Input: Type: knowledge base

Output: Type: logical form Quality: produces appropriate input to a

functional grammar

Fitness into other modules: Place into overall NLG architecture:

Requires a text planning component Requires a functional grammar for

realisation

Configuration: Parameters to fine-tune: no

www.csiro.au

9

Possible situations/criteria

My situation: My input is numerical data I need parameters to fine-tune

Your situation: You have a domain model available You already have a grammar component You need a GRE to “plug in”

Different systems/approaches will be appropriate

(Similar debate has taken place for template vs planning: no “best” method – depends on what one needs to do)

System X

System Y

www.csiro.au

10

What we need to develop/agree upon

Comprehensive set of characteristics that describe and specify NLG components and systems

How to measure them? (when they need to be measured) Might be qualitative or quantitative

Might not be a gold standard

Might depend on the characteristics • (e.g., different measure for fluency, task effectiveness, user

satisfaction or cost/ease of building a system)

www.csiro.au

11

A framework for evaluation

Inspired by other work --looking beyond ourimmediate “siblings”, e.g., Information systems

• Delone and McLean 92

• Cornford et al. 94

ISO 9126

UM (effectiveness)

www.csiro.au

12

Need for a more general framework for evaluation

Enlarge the view of evaluation Ensure we have a big picture

(avoid dangers of local view) Organise the possible criteria/ways to

think about the questions to ask Guides the experimental work Consider NLG in its context and

that of its stakeholders Consider costs and benefits Allows one to choose system/module

best fit for purpose Allows for specific evaluation tasks,

placing them and their results into a larger context

www.csiro.au

13

A proposed framework

System functions Human Perspectives

Organisational Context

Structure System requirements (hardware and resources required)

Architecture details (flexibility, extensibility, maintainability)

Skills required Allocation of resources (staff & money)

Process Efficiency (in using resources)

Impact on work (time & workflow)

Impact on Process & Activities

Outcomes Accuracy Reliability System Usage

Task performance

User satisfaction

Productivity Gains Profits Effectiveness

www.csiro.au

14

Refining the characteristics (with our work)

Information Consumer (end user)

Information Provider

Information Intermediaries

System Providers

Benefits Outcomes:

Task Effectiveness

Knowledge Gain

Satisfaction

Outcomes:

Audience Reach

Audience Accuracy

Message Accuracy

Process:

Ease (cost) of Knowledge Creation

Outcomes:

System Usage

Reliability

Response Time

Correctness (for purpose)

Costs Process:

Cognitive Load

Learning Time

Process:

Metadata Provision

Structured Information

Currency of Data

Process:

Time and Effort for knowledge creation and integration

Structure:

Implementation Cost

System Integration

System Maintenance

www.csiro.au

15

Using the framework to define characteristics -- GRE



Structure System requirements:

Input

Other modules required

Configuration: Availability of parameters

Intermediaries:

Time & Skills required to build resources

Time & Skills to integrate into overall NLG architecture

(probably not applicable as it is only 1 module in a larger system)

Process Execution Time

Outcomes Output type

Accuracy: always produces output

Reliability: graceful degradation?

Consumer (if output is English):

Task performance

User satisfaction

What does this allow?



Structure System requirements:

Input

Other modules required

Configuration: Availability of parameters

Process Execution Time

Outcomes Output type Consumer (if output is English):

Task performance

User satisfaction

Choice:Given a requirement,choose system withcharacteristics that fit the environment

Comparison & Evaluation:Given a system/module for specificrequirements, evaluation with othersystems can be done for a specificcharacteristic (e.g., user satisfaction, task completion, easeof building required input)

New attributes, guided by theframework

www.csiro.au

16

Impact of such a framework

Way to describe system (component, approach) better understanding of strengths and weaknesses.

Useful for evaluations and comparisons.

But also in general: Someone needing a component can choose appropriate one Someone outside the NLG community can choose a module for their

own purposes, without knowing much about it increase visibility of field in other communities

Way to compare systems (modules, approaches) without need to standardise

Fit-for-purpose vs generic: not an issue

Researchers constrained to work on a specific domain/application can still describe their work and be part of this activity no exclusion

www.csiro.au

17

(Almost final) Remarks

Big picture Funding Fine-tuning a system for specific task – no longer an issue Attention to important theoretical problems Understanding of weaknesses & strengths of systems

(modules, approaches)

Orthogonal issues Finding balance between talking and doing Shared resources vs shared tasks

N/A

www.csiro.au

18

Moving forward as a community

What should we do? Define set of characteristics to:

• Understand position and specificity of an approach (module, system)

• Allow descriptions and comparisons

How? Reflect on our own work and characterise it in terms of its

strengths (and weaknesses!) – e.g., think about different stakeholders involved in construction, maintenance, funding, etc.

Use framework as guidance• To understand an approach (module, system) from a variety of

perspectives (e.g., not just the output)

• To know what to evaluate depending on the situation

• To ensure we see the big picture

www.csiro.au

19

References

Cornford, T, Doukidis, G.I. & Forster, D. (1994). Experience with a structure, process and outcome framework for evaluating an information system, Omega, International Journal of Management Science, 22 (5), 491-504.

DeLone, W. H. & McLean, E. R. (1992). Information Systems Success: The Quest for the Dependent Variable. In Information Systems Research, Volume 3, Issue 1 (March, 1992), 60-96.

www.csiro.au

20

Outline

Misconceptions: what we commonly think is true

Can we compare apples and pears to get rid of the lemons?

How does this apply to NLG?

Enlarging the view of evaluation: “The Big Picture”

Remarks

Moving forward

Documents

C é cile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia