Upload
qamra
View
16
Download
0
Tags:
Embed Size (px)
DESCRIPTION
NLG Systems Evaluation: Establishing the Big Picture. C é cile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia. What have we learnt from a shared task approach from our siblings (e.g., IR). Advantages - PowerPoint PPT Presentation
Citation preview
Cécile Paris, Nathalie Colineau and Ross Wilkinson
CSIRO ICT Centre
Sydney, Australia
NLG Systems Evaluation: Establishing the Big Picture
www.csiro.au
2
What have we learnt from a shared task approach from our siblings (e.g., IR)
Advantages Some algorithm & system comparability tests (e.g., inverse frequency
works, length normalisation does not)
Some shared resources
(Recognised) Disadvantages It will only tell you some of the things one needs to know – important
elements will be missed
It does not allow the community to answer some important theoretical and practical questions
Too narrow
Note: No beliefs that there is a perfect system
There are some standards, but not a gold standard
www.csiro.au
3
Some beliefs?
1. Subtasks and input/output requirements need to be standardised to make core technologies truly comparable. What needs to be evaluated is an approach with its
characteristics and its application context.
2. To evaluate systems/approaches we need to compare them in a shared task. Comparison is not a requirement for evaluation nor shared
task for comparison.
3. Quality of systems equates quality of output (i.e., English text). A system cannot be reduced to its output; there are other
attributes.
4. There has to be a gold standard. One measure cannot account for everything (even if we were
to look at the quality of the output only).
5. One NLG technique works better than another. Vive la difference! There is no such thing as “one size fits all”;
typically there are pros and cons.
www.csiro.au
4
We can compare apples and pears
We do it in every day for many things (e.g., usefulness of comparisons as found in consumer reports)
Comparison: Does not require exact similarity But focuses on a set of characteristics/attributes.
Depending on situations and needs, different characteristics are required or favoured over others: fitness for purpose
We propose a framework in which to describe characteristics of NLG systems, modules or approaches
www.csiro.au
5
Example: Buying a car
General: ConvertibleSafety: Front side airbag Brake assist system Convertible rollover protection Rain sensor windshield wipers Dimensions: 277 mm lower 600 mm smaller curb to curb turning
circle Engine: Clutchless manual gearbox
General: 4 wheel drive Larger seating capacitySafety: Rear window wipers Dimensions: 766 mm longer & 160 mm wider Engine: 1.9 l larger engine 2.1 faster acceleration 0-100 km/h 28 l larger fuel tank
RRP $36,490 Manual, 4WD, 5 doors, 7 seats, 145kw, 3.50L,
Origin: Korea, 2007
RRP $31,990 Manual, Convertible, 2 doors, 4 seats, 82kw, 1.60L
Origin: Spain, 2004
Hard constraint: must be between $ 30,000 and $ 40,000
Set of attributes that characterise a car General Safety Dimensions Engine Etc…
www.csiro.au
6
How can we compare (and choose)?
Depends on the criteria of a person (or of a situation) Robert’s Priorities:
Sports car Size: wants a smaller car Safety important
Bill’s Priorities: Needs 4 wheel drive for camping trips Seating capacity: large
Does this mean that one car is better than another? No
Comparison and evaluation in the abstract is not necessarily meaningful
What is required is a way – a framework – to describe characteristics
www.csiro.au
7
Input: Type (e.g., numerical, semantic)
Output: Type (e.g., English, logical form) “Quality” Number of expressions generated
Fitness into other modules: Place into overall NLG architecture (e.g., requires a text planning or a
grammar component)
Configuration: Availability of parameters to fine-tune (e.g., user model, domain model)
General: Execution time
Can we apply these ideas to generation systems (or modules)?
Example: Generating Referring Expressions (GRE)Input
Output
Fitness into other modules
Configuration
General
www.csiro.au
8
An example: Comparison of GRE components
System X GRE moduleEnglish LanguageOrigin: Lab X
Hard constraint: need referring expressions in English
Input: Type: numerical
Output: Type: English Quality: has been shown to allow people to
select specific objects in a landscape
Fitness into other modules: Place into overall NLG architecture:
Requires a text planning component No additional lexico-grammatical
component needed
Configuration: Parameters to fine-tune:
Yes, user model Requirements: creation of user model
System Y GRE module English LanguageOrigin: University Y
Input: Type: knowledge base
Output: Type: logical form Quality: produces appropriate input to a
functional grammar
Fitness into other modules: Place into overall NLG architecture:
Requires a text planning component Requires a functional grammar for
realisation
Configuration: Parameters to fine-tune: no
www.csiro.au
9
Possible situations/criteria
My situation: My input is numerical data I need parameters to fine-tune
Your situation: You have a domain model available You already have a grammar component You need a GRE to “plug in”
Different systems/approaches will be appropriate
(Similar debate has taken place for template vs planning: no “best” method – depends on what one needs to do)
System X
System Y
www.csiro.au
10
What we need to develop/agree upon
Comprehensive set of characteristics that describe and specify NLG components and systems
How to measure them? (when they need to be measured) Might be qualitative or quantitative
Might not be a gold standard
Might depend on the characteristics • (e.g., different measure for fluency, task effectiveness, user
satisfaction or cost/ease of building a system)
www.csiro.au
11
A framework for evaluation
Inspired by other work --looking beyond ourimmediate “siblings”, e.g., Information systems
• Delone and McLean 92
• Cornford et al. 94
ISO 9126
UM (effectiveness)
www.csiro.au
12
Need for a more general framework for evaluation
Enlarge the view of evaluation Ensure we have a big picture
(avoid dangers of local view) Organise the possible criteria/ways to
think about the questions to ask Guides the experimental work Consider NLG in its context and
that of its stakeholders Consider costs and benefits Allows one to choose system/module
best fit for purpose Allows for specific evaluation tasks,
placing them and their results into a larger context
www.csiro.au
13
A proposed framework
System functions Human Perspectives
Organisational Context
Structure System requirements (hardware and resources required)
Architecture details (flexibility, extensibility, maintainability)
Skills required Allocation of resources (staff & money)
Process Efficiency (in using resources)
Impact on work (time & workflow)
Impact on Process & Activities
Outcomes Accuracy Reliability System Usage
Task performance
User satisfaction
Productivity Gains Profits Effectiveness
www.csiro.au
14
Refining the characteristics (with our work)
Information Consumer (end user)
Information Provider
Information Intermediaries
System Providers
Benefits Outcomes:
Task Effectiveness
Knowledge Gain
Satisfaction
Outcomes:
Audience Reach
Audience Accuracy
Message Accuracy
Process:
Ease (cost) of Knowledge Creation
Outcomes:
System Usage
Reliability
Response Time
Correctness (for purpose)
Costs Process:
Cognitive Load
Learning Time
Process:
Metadata Provision
Structured Information
Currency of Data
Process:
Time and Effort for knowledge creation and integration
Structure:
Implementation Cost
System Integration
System Maintenance
www.csiro.au
15
Using the framework to define characteristics -- GRE
System functions Human Perspectives
Organisational Context
Structure System requirements:
Input
Other modules required
Configuration: Availability of parameters
Intermediaries:
Time & Skills required to build resources
Time & Skills to integrate into overall NLG architecture
(probably not applicable as it is only 1 module in a larger system)
Process Execution Time
Outcomes Output type
Accuracy: always produces output
Reliability: graceful degradation?
Consumer (if output is English):
Task performance
User satisfaction
What does this allow?
System functions Human Perspectives
Organisational Context
Structure System requirements:
Input
Other modules required
Configuration: Availability of parameters
Process Execution Time
Outcomes Output type Consumer (if output is English):
Task performance
User satisfaction
Choice:Given a requirement,choose system withcharacteristics that fit the environment
Comparison & Evaluation:Given a system/module for specificrequirements, evaluation with othersystems can be done for a specificcharacteristic (e.g., user satisfaction, task completion, easeof building required input)
New attributes, guided by theframework
www.csiro.au
16
Impact of such a framework
Way to describe system (component, approach) better understanding of strengths and weaknesses.
Useful for evaluations and comparisons.
But also in general: Someone needing a component can choose appropriate one Someone outside the NLG community can choose a module for their
own purposes, without knowing much about it increase visibility of field in other communities
Way to compare systems (modules, approaches) without need to standardise
Fit-for-purpose vs generic: not an issue
Researchers constrained to work on a specific domain/application can still describe their work and be part of this activity no exclusion
www.csiro.au
17
(Almost final) Remarks
Big picture Funding Fine-tuning a system for specific task – no longer an issue Attention to important theoretical problems Understanding of weaknesses & strengths of systems
(modules, approaches)
Orthogonal issues Finding balance between talking and doing Shared resources vs shared tasks
N/A
www.csiro.au
18
Moving forward as a community
What should we do? Define set of characteristics to:
• Understand position and specificity of an approach (module, system)
• Allow descriptions and comparisons
How? Reflect on our own work and characterise it in terms of its
strengths (and weaknesses!) – e.g., think about different stakeholders involved in construction, maintenance, funding, etc.
Use framework as guidance• To understand an approach (module, system) from a variety of
perspectives (e.g., not just the output)
• To know what to evaluate depending on the situation
• To ensure we see the big picture
www.csiro.au
19
References
Cornford, T, Doukidis, G.I. & Forster, D. (1994). Experience with a structure, process and outcome framework for evaluating an information system, Omega, International Journal of Management Science, 22 (5), 491-504.
DeLone, W. H. & McLean, E. R. (1992). Information Systems Success: The Quest for the Dependent Variable. In Information Systems Research, Volume 3, Issue 1 (March, 1992), 60-96.
www.csiro.au
20
Outline
Misconceptions: what we commonly think is true
Can we compare apples and pears to get rid of the lemons?
How does this apply to NLG?
Enlarging the view of evaluation: “The Big Picture”
Remarks
Moving forward