View
2
Download
0
Category
Preview:
Citation preview
SAARLAND UNIVERSITY
Faculty of Natural Science and Technology I
Department of Computer Science
Master’s Program in Computer Science
Master’s Thesis
Embodied Presentation Teams:
A plan-based approach for affective
sports commentary in real-time
submitted by
Ivan Gregor
on March 1, 2010
Supervisor
Prof. Wolfgang Wahlster
Advisor
Dr. Michael Kipp
Reviewers
Prof. Wolfgang Wahlster
Dr. Michael Kipp
Statement
Hereby I confirm that this thesis is my own work and that I have documented all sources
used.
Signed:
Date:
Declaration of Consent
Herewith I agree that my thesis will be made available through the library of the Com-
puter Science Department.
Signed:
Date:
Abstract
Virtual agents are essential representatives of multimodal user interfaces. This thesis
presents the IVAN system (Intelligent Interactive Virtual Agent Narrators) that gen-
erates affective commentary on a tennis game that is given as an annotated video in
real-time. The system employs two distinguishable virtual agents that have different
roles (TV commentator, expert), personality profiles, and positive, neutral, or negative
attitudes to the players. The system uses an HTN planner to generate dialogues which
enables to plan large dialogue contributions and generate alternative plans. The sys-
tem can also interrupt the current discourse if a more important event happens. The
current affect of the virtual agents is conveyed by lexical selection, facial expression,
and gestures. The system integrates background knowledge about the players and the
tournament and user pre-defined questions. We have focused on the dialogue planning,
knowledge processing, and behaviour control of the virtual agents. Commercial products
have been used as the audio-visual component of the system.
A demo version of the IVAN system was accepted for the GALA 2009 that was a part of
the 9th International Conference on Intelligent Virtual Agents. We have verified that an
HTN planner can be employed to generate affective commentary on a continuous sports
event in real-time. However, while the HTN planning is well suited to generate large
dialogue contributions, the expert systems are more suitable to produce commentary on a
rapidly changing environment. Most parts of the system are domain dependent, however
the same architecture can be reused to implement applications such as: interactive
tutoring systems, tourist guides, or guides for the blind.
i
Acknowledgements
First of all, I would like to thank Michael Kipp and Jan Miksatko for being very helpful
and inspiring supervisors. Thanks as well to the DFKI for providing the opportunity
to work on this project, for the necessary equipment, and funding to attend the GALA
competition and the IVA conference. Thank you also to the Charamel GmbH and Nu-
ance Communication, Inc., for providing the Charamel virtual agents Mark and Gloria
and the RealSpeak Solo software with the Tom and Serena voices, respectively. Finally,
I would like to thank my parents for being very supportive during my studies in Prague
and Saarbruecken.
ii
Contents
Abstract i
Acknowledgements ii
List of Figures v
List of Tables vii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 GALA 2009 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 IVAN System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Research Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work 8
2.1 ERIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 The Affect Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 The Natural Language Generation Module . . . . . . . . . . . . . 9
2.2 DEIRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Spectators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 STEVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Presentation Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Design of Presentation Teams . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Inhabited Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Rocco II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Methods for Controlling Behaviour of Virtual Agents 16
3.1 Hierarchical Task Network Planning . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Example of a Planning Task . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Java Simple Hierarchical Ordered Planner (JSHOP) . . . . . . . . 19
3.1.3 JSHOP Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Expert Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Statecharts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Generating Dialogue 26
iii
Contents iv
4.1 Commentary Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Dialogue Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Planning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.4 Commentary Excerpt . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Affect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Planning with Attitude . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.3 OCC Generated Emotions . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Architecture 41
5.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Design Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.3 Off-the-shelf Components . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Tennis Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Event Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 Background Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.3 Discourse Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.1 Template Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.2 Avatar Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4.3 Output Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Discussion 62
6.1 Comparison with the ERIC system . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Evaluation in Terms of Research Aims . . . . . . . . . . . . . . . . . . . . 63
6.3 Comparison JSHOP vs Jess . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Conclusion 68
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A Commentary Excerpt 72
List of Figures
1.1 Event Position Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Example of an ANVIL File . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 ERIC commenting on a Horse Race . . . . . . . . . . . . . . . . . . . . . . 9
2.2 DEIRA (Dynamic Engaging Intelligent Reporter Agent) . . . . . . . . . . 10
2.3 STEVE in a 3D Simulated Student’s Work Environment . . . . . . . . . . 12
2.4 Example of a Planning Method (Dialogue Scheme) to Discuss an AttributeValue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Excerpt of the Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Gerd and Metze commenting RobboCup Soccer Game . . . . . . . . . . . 15
3.1 Example of a Planning Task - HTN . . . . . . . . . . . . . . . . . . . . . 18
3.2 Example of a Planning Task - generated Plan . . . . . . . . . . . . . . . . 18
3.3 JSHOP Input Generation Process . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Sample JSHOP Axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Sample JSHOP Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Sample JSHOP Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 Overview of the COHIBIT system . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Example of a Planning Method . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Example of a Compound Task Decomposition . . . . . . . . . . . . . . . . 30
4.3 Possible Decompositions of a Compound Task . . . . . . . . . . . . . . . . 31
4.4 Decomposition of the Goal Task “Comment” . . . . . . . . . . . . . . . . 32
4.5 Decomposition of the Subgoal Task Commant on rally . . . . . . . . . . . 32
4.6 Decomposition of the Goal Task “Comment” that leads to a Subgoal TaskDrop Volley . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 Emotion Module GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1 IVAN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Charamel Virtual Agents Mark and Gloria . . . . . . . . . . . . . . . . . 45
5.4 Tennis Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Tennis Simulator GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.6 IVAN Architecture - Plan Generation . . . . . . . . . . . . . . . . . . . . 47
5.7 Dataflow - Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.8 States of the Tennis Game . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.9 Tennis Score Counting using a Finite State Machine . . . . . . . . . . . . 50
5.10 Hierarchy of Facts from which an Ace can be deduced . . . . . . . . . . . 52
5.11 JSHOP Input Generation Process . . . . . . . . . . . . . . . . . . . . . . 55
v
List of Figures vi
5.12 IVAN Architecture - Plan Execution . . . . . . . . . . . . . . . . . . . . . 56
5.13 Dataflow - Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
List of Tables
1.1 Tennis Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Event Position Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Track Element Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Dialogue Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Example of Generated Dialogues based on different Appraisals . . . . . . 36
4.3 Description of the eight Basic OCC Emotions . . . . . . . . . . . . . . . . 37
4.4 Five Personality Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Example of Events that elicit respective Emotions . . . . . . . . . . . . . 38
5.1 Description of the Tennis Counting Terminology . . . . . . . . . . . . . . 50
5.2 Example of high-level facts deduced from low-level facts . . . . . . . . . . 52
5.3 Examples of Facts deduced from the Background Knowledge . . . . . . . 53
vii
Chapter 1
Introduction
This thesis presents the IVAN system (Intelligent Interactive Virtual Agent Narrators),
that provides affective commentary on a continuous sports event in real-time. We have
employed two virtual agents that are engaged in dialogues to comment on a tennis game
that was given as the GALA 2009 challenge (see section 1.2). The virtual agents can have
different attitudes to players and their current affective state can be conveyed by lexical
selection, facial expression, and gestures. We have focused on the knowledge processing,
dialogue planning, and behaviour control of the virtual agents. We have used commercial
software as the audio-visual component of the system. In the following sections, we will
explain why it is beneficial to employ virtual agents, then we will describe our task as
the GALA 2009 challenge, outline the IVAN system, and describe our research aims.
1.1 Motivation
Multimodal user interfaces are becoming more and more important in human-machine
communication. Essential representatives of such interfaces are the virtual agents that
aim to act like humans in the way they employ gestures, gaze, facial expression, posture,
and prosody to convey facts in face-to-face communication with a user. [1] The face-
to-face interaction that uses rich communication channel is believed to be exclusively a
human domain, for instance, if people have something important to say, they say it in
person. To generate such a complex behaviour of a virtual agent, it is important to endow
the virtual agent with emotions since s/he becomes more believable by humans and
the system that employs such virtual agents becomes more entertaining and enjoyable
for the users. [2] Virtual agents can be employed in many fields such as: computer
games, tutoring systems, virtual training environments [3], story telling systems [4, 5],
advertisement, automated presenters [6, 7, 8, 9] , and commentators. [10, 11]
1
Chapter 1. Introduction 2
In this thesis, we have focused on the commentary agents. Moreover, we have employed
a presentation team since the use of a presentation team [6], i.e., the use of several
distinguishable virtual agents with different personality profiles, roles, and goals, en-
riches communication strategies and the information being conveyed can be distributed
onto several virtual agents in the form of a dialogue. It is particularly important to
endow virtual agents of the presentation team with emotions since they become more
distinguishable. Distinct virtual agents can better represent different roles and opposing
points of view. The use of a presentation team is also more advantageous in comparison
to the use of only one virtual agent since the performance of a presentation team is more
entertaining for the audience, provide better understanding, and improve recall of the
presented information.
The additional advantage of virtual commentary agents is that they can run locally on
a user’s computer. Hence, the commentary can be partly customized since the user can
set basic settings of a commentary. Thus, it is a good idea to employ virtual agents as
a presentation team to comment on a sports event.
1.2 GALA 2009 Challenge
In this section, we will introduce our task that was given as the GALA 2009 1 challenge
(Gathering of Animated Lifelike Agents). The GALA event is a part of the annual In-
ternational Conference on Intelligent Virtual Agents (IVA)2. The aim of GALA is to
encourage students to implement a system that provides behaviourally complex com-
mentary on a continuous stream of events in real-time. The challenge of GALA 2009
was to provide a commentary on a tennis game that was given as an annotated video.
The GALA challenge in the previous years was to comment on a horse race that was
given by a horse race simulator.
The events that occur in the video of a tennis game are manually annotated with the
ANVIL tool [12] and stored into an ANVIL file. The ANVIL file contains timestamped
events that are grouped into tracks where each track contains events that have the same
source, namely, we have one track for the ball and one track for each player. Table 1.1
contains all events that can be annotated.
Each event is further specified with the place on a tennis court where it happened.
Table 1.2 contains attributes that specify the position of a ball or a player and Figure
1.1 depicts these tags in the picture of a tennis court.
1http://hmi.ewi.utwente.nl/gala2http://iva09.dfki.de/
Chapter 1. Introduction 3
Player events Ball events
throw shotserve cross netforehand hit netbackhand hit tapeforehand-volley bouncebackhand-volley faultsmash outmiss
Table 1.1: Tennis Events
Position side Position longitudinal Position lateral Position height
server net left lowreceiver mid court middle middle
baseline right high
Table 1.2: Event Position Specification
Figure 1.1: Event Position Specification
Each event with its timestamp and position specification stands for a track element.
Table 1.3 contains information which attributes each track element has.
Chapter 1. Introduction 4
Ball track element Player track element
timestamp timestampball event player eventposition lateral position lateralposition longitudinal position longitudinalposition sideposition height
Table 1.3: Track Element Specification
Figure 1.2: Example of an ANVIL File
Figure 1.2 shows two excerpts from an ANVIL file. The left column is an example of
a ball track and the right column is an example of a track of the first player. As we
can see, each track consists of track elements where each track element represents one
event. Furthermore, each track element has a start time and an end time. Whilst the
start time of an event corresponds to its timestamp, the end time of an event can be
omitted, since all events can be considered as instantaneous. The ball track describes
that a ball was shot on the right side of the baseline on the server side at time 7.49 sec,
then the ball crossed the net in the middle, and bounced in the middle of the mid-court
on the receiver side, and then was shot on the right side of the baseline. The player
track describes that the player is throwing a ball on the right side of the baseline at time
7.4 sec, then he is serving. Later on, the player is playing a forehand on the right side
of the baseline, and then he is playing a backhand from the left side of the baseline.
Chapter 1. Introduction 5
1.3 IVAN System
In this section, we will introduce the IVAN system (Intelligent Interactive Virtual Agent
Narrators) [13] that we have developed to produce affective, behaviourally complex
commentary on a continuous sports event in real-time. The system was employed to
comment on a tennis game that was given as the GALA 2009 challenge. We have
employed a presentation team (Elisabeth Andre, Thomas Rist) [6], i.e., in our case
two virtual agents with different roles (TV commentator, expert) to reflect two different
presentation styles, attitudes to the players (positive, neutral, negative), and personality
profiles to simultaneously comment on a tennis game. One virtual agent can interrupt
the other virtual agent or himself/herself when more important event happens. The
system also integrates background knowledge about the players and the tournament.
Moreover, the user can fire one of the pre-defined questions at any time. We have
focused on the knowledge processing, dialogue planning, and behaviour control of virtual
agents. We have used commercial software as the audio-visual component of the system.
The IVAN system consists of several modules that are running in separate threads and
communicate via shared queues. We employed an HTN planner to generate dialogues,
statecharts to simulate basic states of the game, and expert systems to maintain the
emotional state of each virtual agent. When the system starts, the tennis simulator reads
an ANVIL [12] file that contains the description of a tennis game and sends timestamped
events (e.g. a player playes a forehand, a ball hits the net) at the time they occur to
the input interface of the core system. The core system transforms these elementary
events to the low-level facts (e.g. which player just scored) that form the knowledge base
for the HTN planner and the emotion module. Generated plans that represent possible
dialogues are transformed to individual utterances and annotated with gestures. The
current emotional state of a virtual agent is used to derive his/her facial expression.
Annotated utterances along with the corresponding facial expression tags are sent to
the audio-visual component that creates the multimodal output of the system.
When the system starts, our two virtual agents are engaged in dialogues to comment on
the tennis game or on the background facts. A virtual agent is happy if his/her favourite
player is doing well and unhappy s/he is losing. A virtual agent comments in a positive
way on a player s/he likes and on events that lead to the victory of his/her favourite
player. A virtual agent comments in a negative way on a player s/he dislikes and on
events that hinder the victory of his favourite player. The current affect of a virtual
agent is conveyed by lexical selection, facial expression, and gestures.
Chapter 1. Introduction 6
1.4 Research Aims
In this section, we will describe our four main research aims. They will be discussed in
section Evaluation in Terms of Research Aims 6.2 after we describe the architecture of
the whole system.
• Dialogue Planning for Real-time Commentary
In this master thesis, we wanted to investigate how an HTN planner can be em-
ployed to generate commentary in the form of a dialogue on a continuous sports
event in real-time for two virtual agents. An example of a real-time commentary
system that uses the expert systems to control one virtual agent is ERIC [10], how-
ever he might be too reactive, i.e., individual utterances are uttered at particular
knowledge states where ERIC cannot generate larger contributions. In addition,
the expert systems cannot generate alternative plans, thus the HTN planning of-
fers more variability. Therefore, we wanted to examine an HTN planner that is
supposed to be a good strategy to generate elaborate, large, and coherent dialogue
contributions.
• Reactivity
The system should be able to react quickly to new events that happen during the
tennis game. Moreover, when a more important event happens than the event
on which the virtual agents are commenting at the moment, the system should
be able to interrupt the current discourse and comment on the new event. The
interruption should be graceful and have smooth transition.
• Behavioural Complexity
The virtual agents of our presentation team should ideally behave like human ten-
nis commentators and produce interesting, suitable, and believable commentary.
They should use the whole range of communication channels to convey facts about
the tennis game. They should generate variety of dialogues along with synchro-
nized hand and body gestures and have appropriate facial expression in dependence
on their current emotional states. Moreover, if we allow the user to interact with
the system, the system becomes more engaging. The behavioural complexity en-
sures the believability of the virtual characters. Without above mentioned traits,
the virtual agents would look unrealistic.
• Affective Behaviours
The virtual agents should affectively react to the events that occur in the tennis
game according to their (positive, neutral, or negative) attitudes to the players.
Chapter 1. Introduction 7
Their emotional state should be derived from the appraisals of the events that hap-
pen during the tennis game. The virtual agents’ current affect should be conveyed
by lexical selection, facial expression, and gestures. If we endow virtual agents
with emotions, it will increase their believability and they will be better accepted
by users.
Chapter 2
Related Work
In this chapter, we will describe several examples of virtual agent applications that are
relevant to our work. We will introduce ERIC that is an affective, rule-based sport
commentary agent that won GALA 2007 as a horse race commentator. We will also
present DEIRA that is another horse race reporter. Then, we will present project Spec-
tators that participated in GALA 2009 (see section 1.2); it employs several autonomous
affective virtual agents that jointly watch a tennis game as ordinary tennis spectators.
To introduce the HTN planning (see section 3.1) that we have employed in our system
to generate dialogues, we will describe STEVE that uses the HTN planning to help
students to perform physical procedural tasks in a 3D simulated student’s work envi-
ronment. Since we employed a presentation team [6] in our system, we will also describe
the general design of presentation teams and two applications that employ them.
2.1 ERIC
ERIC [10, 14] won GALA 2007 1 as a horse race commentator. ERIC is a generic rule-
based framework for affective real-time commentary developed at DFKI. The system was
tested in two domains: a horse race and a tank battle game, where the horse race was
given in form of a horse race simulator supplied by GALA 2007. The simulator sends the
speed and the position of each horse every second to ERIC via socket. ERIC is getting
events from the horse race simulator and produces coherent natural language alongside
with the non-verbal behaviour. The visual output is represented by a virtual agent that
has lip movement synchronized to speech, can express various facial expressions and
perform many different gestures. ERIC employs the same avatar engine as our system.
The graphical output of ERIC is shown in Figure 2.1.
1http://hmi.ewi.utwente.nl/gala/finalists 2007/
8
Chapter 2. Related Work 9
Figure 2.1: ERIC commenting on a Horse Race
ERIC consists of several modules. We will describe two most interesting modules, i.e.,
the Affect module and the Natural Language Generation module in detail.
2.1.1 The Affect Module
The affect module is getting facts from the world and assigns appraisals to each event,
action, and object according to the goals, desires, and cause effect relations. The ap-
praisal of an event, action or objects is then sent in the form of a specific tag to the
ALMA module [15] that maintains the commentator’s affective state. ALMA consid-
ers three types of affect: emotions (short-term), mood (medium-term), and personality
(long-term). Emotions are bound to specific events and decay through time. Mood
represents the average of the emotional state across time. Personality is defined by Big
Five [16], i.e., openess, conscientiousness, extraversion, agreeableness and neuroticism.
Personality is used to compute the initial mood and influences the intensity and decay
of emotions. The affective state of a virtual agent influences the utterance, gesture, and
facial expression selection.
2.1.2 The Natural Language Generation Module
This module uses a template-based algorithm to generate utterances. Each template
corresponds to a rule in a rule-based engine. Each such rule has conditions that can be
partitioned into four groups: facts that must be known, facts that must be unknown,
facts that must be true, and facts that must be false. There is at least one utterance
Chapter 2. Related Work 10
for each template that contains flat text and slots for variables. First, all candidate
templates are generated, then the corresponding utterances are retrieved and finally one
of the most coherent utterances is chosen. The discourse coherence is ensured by the
Centering Theory [17] that in a simplified way says that the discourse is coherent if every
two utterances are coherent. Thus, the topic of a template and a list of all possible topics
for a coherent following sentence is defined for each template. After the last template
has been chosen, the next template is chosen so that its topic was among possible topics
for a coherent following sentence of the last template.
This system is most closely related to our work since the overall goal of ERIC is the
same as ours. The comparison of the IVAN system and ERIC is in section 6.1.
2.2 DEIRA
DEIRA [11] (Dynamic Engaging Intelligent Reporter Agent) is another commentary
agent that participated in GALA 2007 2 as a horse race reporter. DEIRA employs an
expert system to generate affective commentary in real-time. The system maintains
the affective state of the reporter according to his personality and events that occur in
the horse race. The current affect is represented by a vector of four values (tension,
surprise, amusement, pity) and is conveyed by lexical selection and facial expression of
the reporter. The graphical output of the system is shown in Figure 2.2.
Figure 2.2: DEIRA (Dynamic Engaging Intelligent Reporter Agent)
2http://hmi.ewi.utwente.nl/gala/finalists 2007/
Chapter 2. Related Work 11
2.3 Spectators
Project Spectators [18] participated in GALA 2009 3 (see section 1.2). The system
consists of several autonomous virtual agents that are watching a tennis game. The
spectators can have different attitudes to the teams where the attitude can be positive
or neutral. Each spectator has a euphoria factor that determines how much the mood
state of a spectator changes when an important event happens in the tennis game. The
euphoria factor stands for the spectators’ personality trait. The mood of a spectator
is expressed by his facial expression, typical animations, and speech. The spectators’
moods are as follows: euphoric, happy, slightly happy, neutral, slightly sad, sad, and
disappointed. Furthermore, the position of the ball is interpolated so that the spectators
can gaze at the ball within a rally. Also the voice of a referee is incorporated to utter
the score in a conventional way.
However, the system focuses only on a non-verbal behaviour, i.e., neither the spectators
nor the referee comment on the game as tennis commentators. The system essentially
consists only of a limited set of rules that trigger respective animations. Thus, our
system and the project Spectators could be put together to generate complex scene of a
tennis game with both tennis commentators and spectators.
2.4 STEVE
STEVE (Soar Training Expert for Virtual Environments) [3] is a sample application that
uses the same method as our system to control the behaviour of virtual agents, namely,
the HTN planning (see section 3.1). STEVE is a virtual agent that helps students
to perform physical procedural tasks in a 3D simulated student’s work environment.
STEVE can either demonstrate procedural tasks or monitor students while they are
performing tasks and provide assistance if they need help or ask questions. Each task
consists of a set of partially ordered steps where a step can be a primitive action or
a composite action which creates a hierarchical structure where some steps of a task
can be also reused to solve other tasks. Therefore, STEVE employs the Hierarchical
Task Network to define particular tasks. STEVE consists of the perception, cognition,
and motor control module. The perception module monitors the state of the virtual
world and maintains its coherent representation. In each loop of the decision cycle of
the cognition module, the cognition module gets the current snapshot of the world from
the perception module, chooses appropriate goals, and then constructs and executes
plans. The motor control module gets high level commands from the cognition module
3http://hmi.ewi.utwente.nl/gala/finalists 2009/
Chapter 2. Related Work 12
to control the voice, locomotion, gaze, gestures and objects manipulation. The graphical
output of STEVE is shown in Figure 2.3.
Figure 2.3: STEVE in a 3D Simulated Student’s Work Environment
Our system, as well as STEVE, uses an HTN planner to generate speech and can interact
with users via user questions. We were also inspired by the STEVE ’s execution cycle and
the concept of snapshots of the world. In comparison to STEVE, our system employs two
virtual agents, maintains their affective states, and generates affective commentary. On
the other hand, our system generates shorter contributions, it does not have elaborate
user interaction, and our virtual agents cannot move in the virtual environment.
2.5 Presentation Teams
We employed a presentation team [6, 7, 8, 9] in our system to comment on a tennis
game. In this section, we briefly describe the general design of presentation teams and
then focus on two projects that employ them. The first project is Inhabited Marketplace
where a car seller and customers have different preferences (e.g. running costs, prestige)
and character profiles. They are engaged in dialogues to discuss different attributes of
a car that the customers are interested in. The second project is Rocco II where two
soccer fans that can have different attitudes to the teams and character profiles jointly
watch a RoboCup soccer game and comment on it.
Chapter 2. Related Work 13
2.5.1 Design of Presentation Teams
The idea of presentation teams is to automatically generate presentations on the fly. A
presentation team consists of at least two virtual agents to convey information in style
of a performance to be observed by a user. This approach is believed to be more enter-
taining and provide better understanding than a system with only one presenter. The
virtual agents’ roles, character profiles, and dialogue types are chosen in dependence
on the discourse purpose. Moreover, the characters should be distinguishable, i.e., they
should have different audio-visual appearance, expertise, interest, and personality. Dis-
tinct agents can also better express opposing roles. There are two basic approaches how
to generate the dialogue. [19] Agents with the scripted behaviour correspond to actors
of a play that can still improvise a little at performance time, i.e., their behaviour is
first generated as a script (that contains slots for variables that can be substituted at
runtime) and later on executed. In contrast to the agents with the scripted behaviour,
the autonomous agents have no script, thus, they generate the dialogue contributions
on the fly, i.e., they pursue their own communicative goals and react to the dialogue
contributions of the other characters. First, we present a project that employs the agents
with the scripted behaviour, and then a project that employs the autonomous agents.
2.5.2 Inhabited Marketplace
The Inhabited Marketplace project employs a presentation team to present facts along
with an evaluation under constraints. Each character’s profile is defined by agreeable-
ness (agreeable, neutral, disagreeable), extraversion (extravert, neutral, introvert) and
valence (positive, neutral, negative). The presentation team consists of a car seller and
customers where each of them can prefer different dimension (e.g. environment, economy,
prestige, or running costs). The aim of each customer is to discuss all attributes that
have positive or negative impact on a dimension they are interested in. Furthermore, the
dialogue is also driven by the characters’ personality traits, e.g., an extrovert will start
the conversation or an introvert will use less direct speech. The dialogue is generated
by an HTN planner (see section 3.1), i.e., the goal task is successively decomposed by
planning methods into individual utterances. An example of a planning method that
represents a particular dialogue scheme is shown in Figure 2.4. The method represents
a scenario where two agents discuss a feature of an object. It applies if the feature has a
negative impact on any dimension and if this relationship can be easily inferred. Thus,
any disagreeable buyer produces a negative comment referring to this dimension, e.g.,
to the dimension running costs considering facts contained in Figure 2.5.
Chapter 2. Related Work 14
Figure 2.4: Example of a Planning Method (Dialogue Scheme) to Discuss an AttributeValue
Figure 2.5: Excerpt of the Domain Knowledge
2.5.3 Rocco II
Gerd and Metze are two soccer fans that comment on a RoboCup soccer game. They can
have different attitudes to the teams and their character profile is defined by extraversion
(extravert, neutral, introvert), openess (open, neutral, not open) and valence (positive,
neutral, negative). The project focuses on the following dispositions: arousal (calm,
neutral, excited) and valence. The system performs incremental event recognition [20]
from high level analysis of the scene over recognized events to the basis for the commen-
tary where the basis additionally contains background knowledge about the game and
teams. The system employs two autonomous agents that use template based natural
language generation to produce the commentary on the fly. Furthermore, an agent can
interrupt himself if more important event happens. The templates are strings with slots
for variables. Each template contains several tags, for instance: verbosity (the number
of words), bias (positive, neutral, negative), formality (formal, normal, colloquial) and
floridity (dry, normal, flowery language). The candidate templates are filtered in four
steps in the execution cycle:
Chapter 2. Related Work 15
1. pass only short templates in the case of the time pressure
2. templates used recently are eliminated
3. pass only templates expressing the speaker’s attitude
4. choose templates according to the speaker’s personality
The agents’ emotions are influenced by the current state of the game. Emotions can
be expressed by the speed and pitch range of the speech along with different hand and
body gestures. The graphical output of the system is shown in Figure 2.6.
Figure 2.6: Gerd and Metze commenting RobboCup Soccer Game
Similar to our system, Gerd and Metze can have different attitudes to the teams (play-
ers), personality profiles, integrates background knowledge about the game and teams,
and allow interruptions. In comparison to our system, they employed two autonomous
agents that use template based natural language generation to produce the commentary
on the fly. While our templates can be categorized only according to the bias, in Rocco II
project they use wide range of different templates that are categorized according to: ver-
bosity, bias, formality, and floridity. Thus, the system can generate more reactive and
elaborate commentary than our system. The system also maintains the emotional state
of the virtual agents which can be expressed by prosody, and hand and body gestures.
On one hand, our system does not integrate prosody, on the other hand, our virtual
agents have more elaborate facial expressions and gestures.
Chapter 3
Methods for Controlling
Behaviour of Virtual Agents
In this chapter, we will introduce three basic methods for controlling the behaviour of
virtual agents that we have employed in our system. The most important method is
the HTN planning that we have employed to generate dialogues for our presentation
team (see section 4.1). The second method are expert systems that we have used to
define emotion eliciting conditions in the emotion module (see section 4.2.3). The third
method are statecharts where we have used three simple finite state machines to model
basic states of the system (see section 5.3.1). Let us note that all these methods can be
used separately for the natural language generation (e.g. see ERIC in section 2.1 that
uses the expert systems).
3.1 Hierarchical Task Network Planning
In our system, we have employed the Hierarchical Task Network (HTN) planning to
generate the dialogues for our presentation team (see section 4.1). In general, planning
is employed for the problem solving and can be applied in many different domains to
save time and money, e.g., in air transport, flight control, controlling of space probes,
army missions, maintenance of complex machines (e.g. submarines), help in the case of
natural disasters, or tutoring systems (e.g. see STEVE in section 2.4). [21]
HTN planning is a variant of the automated planning. First, we will introduce the
STRIPS-Like planning [22] (where STRIPS stands for Stanford Research Institute Prob-
lem Solver) and then compare it to the HTN planning. The input of a STRIPS-Like
planner consists of a set of facts that describe the initial state of the world, a set of goal
16
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 17
facts, and a set of planning operators that correspond to actions that can modify the
current state of the world. Let us denote the set of facts that describe the current state
of the world as a Base. A planning operator has a list of preconditions, a delete list,
and an add list. A planning operator can be applied if its preconditions are contained
in the Base. After a planning operator is applied, all facts that are in its delete list are
deleted from the Base and all facts that are in its add list are added to the Base. The
STRIPS-Like planner reaches the goal state of the world if the Base contains all goal
facts. After the planner is started, it is searching for a sequence of planning operators
that successively change the initial state of the world to its goal state. The output of
the planner is a plan (or a list of all possible plans) that consists of a list of planning
operators such that if we successively apply these operators to the initial state of the
world, we get the goal state of the world. While a STRIPS-Like planner can try to
apply any planning operator at any step of the planning process to reach the goal state
of the world, an HTN planner can only try to apply planning operators that are defined
in the HTN at a particular step of the planning process.
The HTN planning is based on tasks decomposition, i.e., compound tasks are decom-
posed into subtasks where each subtask is either a compound task on a lower level of
the planning hierarchy or a primitive task that corresponds to an action that can be
executed in the real world. Let us note that the primitive tasks in the HTN planning
correspond to the planning operators in the STRIPS-Like planning. The description
of the world (called planning domain in the HTN planning terminology) is given as a
Hierarchical Task Network and the planning goal (called planning problem) is given as
a list of goal tasks and a list of facts that describe the initial state of the world. The
resulting plan is a list of primitive tasks such that if we successively perform these prim-
itive tasks we accomplish the goal tasks. In the following text, we will show an example
of a planning task, introduce JSHOP1 as an implementation of an HTN planner that
we have employed in our system to generate the dialogues for our presentation team (see
section 4.1), and finally we will define some basic constructs of the JSHOP language.
3.1.1 Example of a Planning Task
Let us consider an example of a planning task that is depicted in Figure 3.1 to demon-
strate a typical task for an HTN planner. [23] There is a Hierarchical Task Network that
represents a way how to travel from x to y, more precisely, how to accomplish the goal
task travel(x,y). We can either take a taxi for a short distance or we can fly by air for
a long distance. (There might be also other ways how to travel that we do not consider
here.) Thus, to accomplish the compound goal task travel(x,y) we have to fulfil one of
1JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 18
its compound subtasks, namely, travel by taxi or travel by air. In the first case (travel
by taxi) we must first get a taxi, then ride the taxi from x to y and finally pay for it. In
the second case (travel by air) we must first buy a ticket from airport(x) to airport(y),
then travel from x to airport(x), fly by air from airport(x) to airport(y) and eventually
travel from airport(y) to y. Thus, to fulfil a compound task travel by taxi or travel by air
we have to satisfy all its respective subtasks. Let us note that after the planner starts,
it finds out first whether it is possible to travel by taxi and if not it backtracks and tries
the option to travel by air.
Figure 3.1: Example of a Planning Task - HTN
The resulting plan how to travel from the UMD (University of Maryland) to the MIT
is depicted in Figure 3.2. First we have to buy a ticket from the BWI (Baltimore
Washington International) airport to the Logan airport, then take a taxi from the UMD
to the BWI airport, then fly by air from the BWI airport to the Logan airport, and
finally take a taxi from the Logan airport to the MIT.
Figure 3.2: Example of a Planning Task - generated Plan
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 19
3.1.2 Java Simple Hierarchical Ordered Planner (JSHOP)
In the following text, we will introduce the Java Simple Hierarchical Ordered Planner
(JSHOP)2 [24, 25] that is the implementation of an HTN planner that we have employed
in our system. JSHOP is a Java implementation of a domain-independent Hierarchical
Task Network (HTN) planner, developed at the University of Maryland, that is based on
ordered task decomposition. The planning is conducted by problem reduction, i.e., the
planner recursively decomposes tasks into subtasks and stops when it reaches primitive
tasks that can be performed directly by planning operators. The compound task de-
composition is realized by methods that define how to decompose compound tasks into
subtasks. Since there may be more than one method that can be applied to a compound
task, the planner can backtrack, i.e., it can try more than one method to decompose a
compound task. As a consequence, the planner can find more than one suitable plan.
The Input of JSHOP consists of a description of a planning domain and a planning
problem. The planning domain creates the world description, i.e., it consists of planning
methods, planning operators and axioms. The planning problem consists of a list of
tasks and a list of facts that hold in the initial state of the world. The planning domain
description is stored in a domain file and the problem description in a problem file.
The Output of JSHOP is a list of suitable plans where each plan consists of a list of
primitive tasks and each primitive task corresponds to an action that can be executed
in the real world (e.g. utter an utterance or move object O from place X to place Y ).
Figure 3.3: JSHOP Input Generation Process
To Run the Planner, we have first to generate Java code from the respective domain
and problem files that are written using special Lisp-like syntax. JSHOP is implemented
2JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 20
in this way since this approach allows to perform certain optimizations and to produce
Java code that is tailored for a particular domain and problem description. [26] See
Figure 3.3. (The generated Domain Description Java file is compiled with the Domain-
Independent Templates which results in a Domain-Specific Planner. The generated Java
Problem file is compiled as well. At the end, we can run the planner that outputs all
possible Solution Plans.)
3.1.3 JSHOP Language
In the following text, we will describe the most important JSHOP constructs, namely:
axioms, planning operators, and planning methods. See the JSHOP manual [27] for
more details on the whole syntax of the language. JSHOP contains many constructs
characteristic for an HTN planner (e.g. symbols, terms, call terms, logical atoms, logical
expressions, implication, universal quantification, assignment, call expressions, logical
preconditions, task atoms, task list, axioms, operators, and methods). Furthermore, it
is possible to write user defined functions in Java.
Axioms
An axiom is an expression of the form:
(: − a [name1] L1 [name2] L2 . . . [namen] Ln)
where the head of an axiom is a logical atom a and its tail is a list of pairs (name,
logical precondition) where a is true if L1 is true or if L1...Lk−1 are false and Lk is true
(for k ≤ n). The name of the logical precondition is optional, however it can improve
readability. Figure 3.4 shows an example of an axiom. A place ?x is in walking distance
if the weather is good and a place ?x is within two miles of home, or if the weather is
bad and a place ?x is within one mile of home.
Figure 3.4: Sample JSHOP Axiom
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 21
Operators
An operator has the following form:
(: operator h P D A [c])
where h is the operator’s head; P is the operator’s precondition; D is the operator’s
delete list; A is the operator’s add list; c is the operator’s cost where the default cost
is 1. Let us denote the set of facts that describe the current state of the world as the
facts base. The operator can be applied if the preconditions in P are satisfied. After the
operator has been applied, all facts contained in D are deleted from the facts base and
all facts contained in A are added to the facts base. Figure 3.5 shows an example of a
planning operator. We can drive a ?truck from a ?old-loc to a ?location if the ?truck
is at the ?old-loc. After the operator has been applied, the fact (at ?truck ?old-loc) is
deleted from the facts base and a new fact (at ?truck ?location) is added to the facts
base.
Figure 3.5: Sample JSHOP Operator
Methods
A method is a list of the form:
(: method h [name1] L1 T1 [name2] L2 T2 . . . [namen] Ln Tn)
where h is the method’s head; each Li is a precondition; each Ti is a list of tasks; each
namei is a respective optional name. The compound task specified by the method can
be performed by performing all tasks in the list Ti if the precondition Li is satisfied and
for all preconditions Lk such that k < i holds that they are not satisfied. Figure 3.6
presents an example of a method. The task specified by this method is to eat a ?food.
If we have a fork then we eat the ?food with a fork. If we do not have a fork but we
have a spoon then we eat the ?food with a spoon.
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 22
Figure 3.6: Sample JSHOP Method
3.2 Expert Systems
Expert systems can be also employed to generate commentary on a sports event as we
have shown in ERIC (see section 2.1). Nevertheless, we have employed an expert system
only in the emotion module to define emotion eliciting conditions (see section 4.2.3).
Expert systems are used in many domains to “replace” human experts. The know-how
of human experts is first stored in the system. Afterwards, the system can be queried
by users that always get consistent answers. Nevertheless, the disadvantage of such a
system is that it is not appropriate for changing environments. Expert systems can be,
for instance, used in the following domains: financial services, accounting, production,
process control, medicine, or human resources. Examples of expert systems are CLIPS
(C Language Integrated Production System) [28] and its reimplementation into Java
Jess (Java Expert System Shell) [29] that we have employed in our system.
Expert systems are used to reason about the world using some knowledge that consists
of facts and rules. While the facts describe the current world in terms of assertions, the
rules define how to modify the facts base (knowledge base), e.g., how to deduce new
facts from already known facts where each rule is in the form of an if-then clause. Let
us note that it is also possible to retract or modify facts as a result of a rule being fired.
The inferring loop of a typical expert system consists of the following three steps:
1. Match the left hand side of the rules against facts and move matched rules onto
the agenda.
2. Order the rules on the agenda according to some conflict resolution strategy (e.g.
at random).
3. Execute the right hand side of the rules on the agenda in the order decided by
step (2).
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 23
The inferring loop ends when no new facts can be inferred. After the inferring process
ends, we know which rules have been fired and the fact base contains all initial and
inferred facts that have not been retracted. In the following text, we will present an
implementation of an expert system that we have employed in our system.
Java Expert System Shell (Jess)
Jess [29] is a fast Java implementation of an expert system developed at Sandia National
Laboratories. Although it has a rich Lisp-like syntax we will show only two examples:
one that defines an unordered fact and the other that defines a rule. See [29] for more
details on the complete syntax of the language.
Unordered Fact - Every fact corresponds to a particular template. The definition
of a template starts with a keyword deftemplate followed by a template name and an
optional documentation comment. The following template is an example how to define
an automobile. The template contains four slots: the manufacturer, the model, the year
of production as an integer, and color where red is the default color.
(deftemplate automobile
"A specific car."
(slot make)
(slot model)
(slot year (type INTEGER))
(slot color (default red))
)
The following command asserts a concrete Volkswagen Golf that was produced in 2009
and is of the default red colour.
(assert (automobile (model Golf)(make Volkswagen)(year 2009)))
Rule - Consider the following templates. The first template defines an agent that has
a name and can be hungry, the second template defines the current time.
(deftemplate agent
"A hungry agent"
(slot name)
(slot hungry)
)
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 24
(deftemplate current_time
"The current time"
(slot ctime (type FLOAT))
)
The following commands asserts agent George that is hungry and the current time that
is half past twelve.
(assert (agent (name George)(hungry TRUE)))
(assert (time (ctime 12.5)))
Consider the following rules that are chained.
(defrule open_cafeteria
(current_time {(12.0 <= ctime && ctime <= 14.0)})
=>
(assert (food-available))
)
(defrule have_lunch
?agent <- (agent (name ?name) (hungry TRUE))
(food-available)
=>
(modify ?agent (hungry FALSE))
(printout t ?name ‘‘had lunch.’’ crlf)
)
The first rule opens a cafeteria if the current time is between 12 and 14, and asserts the
fact that the food is available at the moment. Thus, the rule fires since the current time
is 12.5 and adds a new fact (food-available) to the facts base. The second rule fires if
there is an agent that is hungry and the food is available. Hence, the second rule fires
as well, prints out: “George had lunch.”, and modify the respective fact (i.e. the slot
hungry is FALSE ).
3.3 Statecharts
Another method that can be employed to control virtual agents are statecharts. In our
system, we have used finite state machines to maintain different states of the system
Chapter 3. Methods for Controlling Behaviour of Virtual Agents 25
(see section 5.3.1). However, the statecharts can be also used to generate speech. An
example of a tool that enables to control virtual agents using statecharts is SceneMaker.
[30] A user can create arbitrary statechart using SceneMaker to describe the behaviour
of virtual agents. In every node of a statechart, a scene is stored. A scene can, for
instance, describe a dialogue between two virtual agents, i.e., the scene is described in a
theater script-like language and consists of utterances that are annotated with gestures.
A statechart can also consist of several types of edges that are used to define transitions
between nodes (e.g. a timeout edge, a conditional edge, or a probability edge).
The difference between SceneMaker and our approach is that while SceneMaker performs
one of the pre-defined scene at a node, we first run the HTN planner to generate the
scene, and then the scene is performed. Nevertheless, we have employed only three
simple finite state machines to maintain the basic states of our system and the logic was
implemented in the domain description of the HTN planner.
SceneMaker was employed in several projects: CrossTalk [31], VirtualHuman [32],
IDEAS4Games [33], and COHIBIT. [34, 35] For instance, the purpose of the COHIBIT
project is to provide knowledge about car technology and virtual agents in an entertain-
ing way. Two virtual agents interact with users and give them advices how to build a
car from different car pieces. The system is informed about the presence of users via
cameras and about the location and orientation of car pieces which is realized using
RFID technology. The overview of the COHIBIT system is depicted in Figure 3.7.
Figure 3.7: Overview of the COHIBIT system
Chapter 4
Generating Dialogue
In this chapter, we will explain how we generate affective commentary on a tennis game
for our two virtual agents. First, we will describe how we generate dialogues using an
HTN planner. Then, we will describe how we generate a piece of a dialogue that conveys
a particular attitude of a virtual agent to a player, how we maintain the affective state
of a virtual agent, and how a particular affect can be conveyed by different modalities.
4.1 Commentary Planning
In this section, we will describe how we generate the dialogues for our presentation team
that consists of two virtual agents. We have employed the JSHOP planner (see sec-
tion 3.1) to generate the commentary where the generated plans correspond to possible
dialogues in which the presentation team can be engaged. The planner is triggered at
particular states of the tennis game, gets facts that describe the current state of the
tennis game, and outputs all possible plans. The detailed description of the states in
which the planner is triggered, input facts, and how the generated plans are executed
will be given in Chapter 5. Thus, in this section, we will focus only on the dialogue
generation, i.e., in which dialogues our commentary team can be engaged in distinct
states of the tennis game according to the facts that describe the tennis game and the
background of the players and the tournament.
4.1.1 Motivation
The overall goal of our system is to automatically generate interesting, suitable, coherent,
and affective commentary from different points of view (in dependence on commentators’
attitudes to the players) in real-time. To investigate what the real tennis commentators
26
Chapter 4. Generating Dialogue 27
say during the game, we have analysed several tennis games from YouTube1. We have
found out that there are usually two commentators that comment on a tennis match
where the second commentator is usually a former tennis player or an expert in the field
that can always provide additional background information. We have also found out
that the commentary is to some extent driven by the states of the game, e.g., nobody
is talking when the serving player concentrates before s/he serves, the commentators
are engaged in small talks discussing players’ background when there is nothing else to
comment on or the commentators usually summarize every rally after it finishes. Thus,
for instance the statechart approach presented in the SceneMaker project (see section
3.3) would be also convenient, therefore we have employed finite state machines to decide
when to run the planner according to the states of the tennis game.
We have also noticed that the information being conveyed by a sport commentator does
not often bring much more than an ordinary spectator can perceive while s/he is watching
the same tennis game. Since we wanted our commentary to be more sophisticated, we
have let us inspire by the TennisEarth2 web page that describes tennis matches (rally
by rally) for tennis fans that have not seen them. As a consequence, the commentary on
the TennisEarth is more elaborate and inspiring for us. We also wanted to incorporate
more background knowledge since a standard tennis match is usually long-winded and
there is often nothing to comment on, thus we have made use of the OnCourt3 project
as a source of the background knowledge about players and tennis tournaments.
As we have already stated, the commentators have positive, neutral, or negative attitudes
to the players. Since the standard live commentary is usually balanced, except for
particular international tournaments, we had to add respective bias to our utterances.
Let us note that biased utterances usually convey particular affects. To deal with the
real-time requirement, we had to make sure that the dialogues are not too long. However,
we can predict the time we have at our disposal for a commentary according to the state
of the tennis game. For instance, we have always more time to comment on a just finished
game than on an event that happens within a rally. Nevertheless, these predictions
are only rough approximations, thus we had to allow interruptions, i.e., to interrupt
the current plan if more relevant event happens. The coherence of the commentary is
ensured by the dialogue planning that is elaborated in the next section.
1http://www.youtube.com/2http://www.tennisearth.com/3http://www.oncourt.info/
Chapter 4. Generating Dialogue 28
4.1.2 Dialogue Planning
To represent our presentation team, we have employed two virtual agents that have
different roles, attitudes to the players, and audio-visual appearance. The first com-
mentator is the Charamel virtual agent Mark that represents a TV tennis commentator
and the second Charamel virtual agent is Gloria that represents a tennis expert. (See
section 5.1.3 for more details on the Charamel avatar engine.) While Mark should con-
centrate on simple facts concerning the tennis game Gloria should rather elaborate on
these facts. Let us remember that all dialogues are based on commentators’ attitudes
to the players that can be positive, neutral, or negative.
Dialogue Schemes
We were inspired by the dialogue schemes presented in the project Presentation Teams
(see section 2.5.2). A dialogue scheme is a generic representation of a piece of dialogue
that can be generated under certain conditions by a planner. Let us note that dialogue
schemes correspond to the methods in the HTN planning. Let us also remember that
in the HTN planning, the compound goal task is decomposed by planning methods to
the subtasks where each subtask is either a planning operator that corresponds to a
template (that represents an utterance) or a compound task that is further decomposed
by planning methods. Consider the planning method depicted in Figure 4.1.
Figure 4.1: Example of a Planning Method
Let us assume that player ?P1 has played a winning return (i.e. player ?P2 has lost
the rally) and the subgoal task deduced by the planner from the goal task according
to the current state of the game is the compound task “comment on rally”. Thus, we
Chapter 4. Generating Dialogue 29
can satisfy the compound task “comment on rally” by performing the BODY of the
planning method if the PRECONDITIONS of the planning method can be satisfied (i.e.
?A is a commentator, ?B is an expert, player ?P1 has played a winning return, player
?P2 has lost the rally, ?A and ?B have both positive attitude to player ?P1 ). Figure
4.1 also presents an example of a possible dialogue that can be generated by applying
this planning method assuming that the BODY of the planning method consists only of
two planning operators (i.e. not compound tasks), and variables ?P1, ?P2, ?A and ?B
stands for respective players, commentator, and expert. We have already defined that
all dialogue schemes are based on commentators’ attitudes to the players, nevertheless
the semantic of the dialogue schemes can have one of the form defined in Table 4.1.
Whilst the left column defines individual dialogue schemes the right column presents an
example of a possible generated dialogue for each dialogue scheme.
Dialogue Scheme Example of a Generated Dialogue
A: argument for/against X A: “That serve was really phenomenal!”
B : contrary B : “Well, that is a little exaggerated!”
A: argument for/against X A: “Blake is in great shape as usual.”
B : contrary B : “But he already produced several unforced errors.”
A: override A: “Still, he is the best player on the court.”
A: argue for X A: “Excellent return by Safin.”
B : elaborate on X B : “Unreachable for Blake”.
A: background fact X A: “The brother of Blake Thomas is a well known player.”
B : evidence of X B : “His best ranking was the 141st place in 2002.”
A: background fact X A: “Roddick has been 4 times injured recently.”
B : consequence of X B : “It will be hard to break through today.”
Table 4.1: Dialogue Schemes
Planning Large Dialogue Contributions
We have already shown how to generate a simple dialogue. In the following text, we
will describe how to generate large dialogue contributions that consist of several simple
dialogues. Consider a part of a planning tree that is depicted in Figure 4.2 where all
nodes stand for compound tasks. Imagine that a game has finished and the subgoal task
of the planner deduced from the goal task is the compound task “comment on just fin-
ished game”. Hence, to satisfy the compound task “comment on just finished game”, we
have to satisfy all its compound subtasks, namely: Introduction, Body, and Conclusion.
Similarly, to satisfy the compound task Body, we have to satisfy all its compound sub-
tasks, namely: comment on score, comment on winning team, and comment on losing
team. The decomposition of the compound subtasks comment on winning team and
Chapter 4. Generating Dialogue 30
Figure 4.2: Example of a Compound Task Decomposition
comment on losing team are analogous. Every leaf of the subtree depicted in Figure
4.2 corresponds to at least one planning method that decomposes respective compound
task. The compound task decomposition is accomplished by a planning method that
stands for a dialogue scheme or by a planning method that represents a hierarchy of di-
alogue schemes, i.e., the compound task can be decomposed by a planning method into
several dialogue schemes in dependence on the facts that hold in the current description
of the world (e.g. commentators’ attitudes to the players). The following list presents
a possible generated dialogue that summarizes a game that has just finished (where C
and E stand for a commentator and an expert, respectively).
Introduction
E : “What a relief!”
C : “Tight game let’s summarize it.”
Comment on Score
C : “Blake and Roddick won the first game.”
E : “That’s unbelievable that they broke opponents’ serve!”
C : “That was spectacular!”
Comment on winning team - Highlights
C : “Blake and Roddick played an excellent game.”
E : “Well, they played several excellent winning returns.”
Comment on winning team - Difficulties
C : “Can you say something about difficulties of Blake and Roddick?”
Chapter 4. Generating Dialogue 31
E : “They were already trailing.”
C : “But they recovered.”
Comment on winning team - Odds
C : “Are Blake and Roddick going to win the match?”
E : “They are my favourites!”
Comment on losing team - Difficulties
C : “What difficulties did Safin and Ferrer have?”
E : “They did many unforced errors.”
Comment on losing team - Odds
C : “Do Safin and Ferrer have any chance to win.”
E : “Well, they can still break through.”
Conclusion
C : “Let’s see the next game.”
E : “Definitely.”
4.1.3 Planning Tree
In this section, we will describe our planning tree that represents the hierarchy of all
dialogues that can be generated. The planning tree is defined as a Hierarchical Task
Network (HTN) in the planning domain of the JSHOP planner (see section 3.1). The
root of the planning tree is the goal task, any internal node of the planning tree is a
compound task (i.e. a possible subgoal task), and every leaf of the planning tree is
either a primitive task that corresponds to a template (that represents an utterance) or
a reference to a particular compound task that is an internal node of the planning tree.
Let us consider Figure 4.3. To satisfy a compound task, we have to either satisfy all its
descendants (1), one arbitrary descendant that can be satisfied (2), or we have to satisfy
the first descendant that can be satisfied (3).
Figure 4.3: Possible Decompositions of a Compound Task
Chapter 4. Generating Dialogue 32
The root of our planning tree is the goal task “Comment”. Figure 4.4 depicts how the
goal task “Comment” is decomposed into subgoal tasks in dependence on the state of
the game, e.g., the presentation team is engaged in dialogues to introduce the upcoming
game if the game is just at the beginning or they summarize a rally just after a rally
finishes.
Figure 4.4: Decomposition of the Goal Task “Comment”
Figure 4.5 shows the further decomposition of the compound task Comment on rally
that is a subgoal task of the goal task “Comment”. Thus, our presentation team is
commenting on the result of the last rally in dependence on its outcome, e.g., the pre-
sentation team can comment on an excellent ace or a winning return played by a player.
Figure 4.5: Decomposition of the Subgoal Task Commant on rally
Figure 4.6 depicts the whole decomposition path from the goal task “Comment” to the
subgoal task “Drop Volley” which results in a commentary on a rally that finished with
a winning return that was a drop volley (i.e. the player won the rally by a ball that he
played before it bounced and then placed it just behind the net).
Chapter 4. Generating Dialogue 33
Figure 4.6: Decomposition of the Goal Task “Comment” that leads to a SubgoalTask Drop Volley
4.1.4 Commentary Excerpt
In this section, we will show an example of a generated dialogue where the players of
the serving team are: Blake and Roddick and the players of the receiving team are:
Safin and Ferrer. In this example, the dialogues are unbiased, i.e., the attitude of the
commentators is neutral since we would like to show how detailed the commentary can
be supposing that there is enough time to utter it. The state of the game and the
subgoal of the planner are mentioned before each dialogue. Let us note that C stands
for a commentator and E stands for a tennis expert. Another commentary excerpt is
shown in Appendix A.
Beginning - Introduction to the upcoming game
C : “Ladies and Gentlemen! Welcome to the Wimbledon semi-final in doubles.”
E : “We will guide you through the match in which James Blake and Andy Roddick
are playing versus Marat Safin and David Ferrer.”
C : “Enjoy the show!”
Rally in Progress - Serving Player’s Background
C : “Roddick has been 4 times injured since last years.”
E : “It will be hard to break through today.”
Rally in Progress - Comment on a nice shot
E : “What a shot!”
Rally finished - Summarize the rally (score: 15:0)
C : “What a Forehand by Roddick.”
Chapter 4. Generating Dialogue 34
E : “Roddick hit an excellent forehand-volley right into the left corner.”
C : “Roddick took advantage of a weak forehand return from Safin.”
Rally in Progress - Players’ Background
C : “Brother of James Blake Thomas is also playing tennis.”
E : “His best ranking was in 2002 when he occupied the 141st place in doubles.”
Rally in Progress - Comment on a nice shot
C : “What a shot by Roddick!”
Rally finished - Summarize the rally (score: 30:0)
C : “What a long ralley!”
E : “Ended by an inaccurate backhand-volley by Safin.”
C : “30:0”
E : “Blake and Roddick are holding their serve so far.”
Ralley in Progress - Background
E : “The weather is cloudy today.”
C : “Hopefully it won‘t be raining.”
Rally finished - Summarize the rally (score: 30:15)
C : “Nice high lob by Safin.”
E : “Too high for Roddick.”
C : “Caused unforced error by Blake.”
4.2 Affect
In the following sections, we will explain why it is important to generate affective com-
mentary on a tennis game and how an affect can be conveyed by different modalities.
We will explain two methods that we have employed to generate affective commentary
on a tennis game and discuss the pros and cons of this approach.
4.2.1 Motivation
In this section, we will clarify how important it is to incorporate emotions into the
commentary and how the affect can be expressed. In general, the virtual agents are
better accepted by users if they are endowed with emotions. [2] Different personality
profiles and affect make virtual agents more distinguishable which is beneficial to cre-
ation of the presentation teams. We were inspired by the concept of the presentation
teams described in section 2.5. Thus, we have employed two distinct virtual agents that
Chapter 4. Generating Dialogue 35
have different roles (commentator, expert), attitudes to the players (positive, neutral,
negative), and personality profiles (defined by: optimistic, choleric, extravert, neurotic,
social). Two affective virtual agents can also better represent opposing opinions and
are more entertaining than only one presenter. Moreover, the user should better recall
conveyed facts.
There can be many exciting moments in a tennis game as well, e.g., to win a tennis game
a player must have at least four points in total and two points more than the opponent,
thus the finish of a tennis game can be quite thrilling since there can be many game and
break points (i.e. situations when the serving or receiving player needs only one point
to win the game). Therefore, our virtual agents should affectively react to the events
that, e.g., lead to the victory of their favourite player or that lower the odds to win. The
current affect of a virtual agent can be expressed by dialogue scheme selection, lexical
selection (i.e. choice of an appropriate utterance according to the current affect), gaze,
facial expression, and hand and body gestures.
4.2.2 Planning with Attitude
In this section, we will describe how a particular affect can be conveyed via the choice
of a corresponding dialogue scheme where a dialogue scheme is a generic definition of
a piece of dialogue (see section 4.1.2). As we have already stated, a virtual agent can
have positive, neutral, or negative attitude to a player. Let us note that almost every
topic of the commentary is related to a specific event (e.g. a player has just scored, a
player has lost the lead). Thus, every such event can be appraised by a virtual agent as
desirable or undesirable according to his/her attitude to the players (e.g. it is desirable
when my favourite player gets a point or undesirable when he loses the lead). Hence, a
virtual agent will comment in a positive way on a desirable event and in a negative way
on an undesirable events. Each event is also usually connected with a particular player,
thus a virtual agent will comment in a positive way on actions of a player s/he likes and
in a negative way on actions of a player s/he dislikes. A virtual agent that has a neutral
attitude to a player will comment in a neutral way on events that are connected with a
respective player.
Let us consider a dialogue that consists of two utterances that are uttered by two virtual
agents. Let us assume that the dialogue is either related to an event that can be
appraised as positive, neutral, or negative, or the event is related to a player to which a
virtual agent has positive, neutral, or negative attitude. Table 4.2 presents examples of
possible generated dialogues where A and B stand for respective commentators. The first
column represents a particular combination of appraisals of an event or a combination of
Chapter 4. Generating Dialogue 36
attitudes to a player that is related to a particular event. The second column represents
a dialogue scheme of a possible dialogue where X stands for a player’s action or a fact.
The third column represents an example of a generated dialogue.
Appraisal Dialogue Scheme Example of a Generated Dialogue
A: positive A: argue for X A: “Outstanding ace by Blake!”
B : positive B : support X B : “Blake hits blistering serve down the line!”
A: positive A: argue for X A: “Excellent forehand by Safin!”
B : negative B : play down X B : “That’s a bit overstated.”
A: negative A: point out fault X A: “Safin failed to get the ball over the net.”
B : positive B : excuse X B : “Safin just overhits the serve.”
A: neutral A: convey fact X A: “The score is already 30:0.”
B : negative B : consequence of X B : “Safin and Ferrer are real losers as usual!”
A: neutral A: convey fact X A: “Deuce again.”
B : neutral B : elaborate on fact X B : “Safind and Ferrer got back on board.”
Table 4.2: Example of Generated Dialogues based on different Appraisals
Thus, we have shown how a particular affect can be conveyed via the choice of an
appropriate dialogue scheme. Let us note that the pieces of a generated dialogue are
individual utterances where an utterance is usually uttered by a virtual agent in a
particular situation that is correlated with a particular affect. Therefore, we annotated
each utterance with default gesture and facial expression tags to seamlessly convey a
particular affect by an utterance. Nevertheless, these tags are only default and can be
substituted by other tags generated by other modules. For instance, the facial expression
can be also set according to the current affective state of a virtual agent generated by
the emotion module that is described in the next section.
4.2.3 OCC Generated Emotions
In this section, we will describe the emotion module that models the affective state
of each virtual agent according to the OCC (Ortony, Collins, Clore) cognitive model
of emotions. [36, 37] We simulate eight basic OCC emotions that are relevant to the
tennis commentary. These emotions are explained in Table 4.3. The emotion module is
initialized with the personality of each virtual agent that is defined by five personality
traits listed in Table 4.4.
Chapter 4. Generating Dialogue 37
OCC Emotion Description
JOY Something happened that I wanted to happen.
DISTRESS Something happened that I did not want to happen.
HOPE Something may happen that I really want to occur.
FEAR Something may happen that I wish to never occur.
RELIEF Something bad did not happen.
DISAPPOINTMENT Something did not happen that I really wanted to occur.
SATISFACTION Something happened that I really wanted to occur.
FEAR-CONFIRMED Something bad did actually happen.
Table 4.3: Description of the eight Basic OCC Emotions
Personality Trait
optimistic
choleric
extravert
neurotic
social
Table 4.4: Five Personality Traits
The input of the emotion module are facts that our system deduces from the elementary
events got from the tennis game. The main functionality of the emotion module4 is
implemented in Jess (see section 3.2). The goals and antigoals of a virtual agent are
deduced from his/her attitude to the players, e.g., virtual agent A that has a positive
attitude to player P wants P to win the game, conversely, virtual agent B that has a
negative attitude to player P wants P to lose the game. The events that happen in the
tennis game are appraised as desirable if they lead to the goal or undesirable if they
hinder the goal. The conditions that elicit emotions based on the events that happen in
the tennis game are called emotion eliciting conditions. The appraisals of the emotion
eliciting conditions then generate particular emotions with respective intensities where
the initial intensity of a particular emotion depends on personality of the respective
virtual agent. The affective state of a virtual agent is represented by a vector of intensi-
ties of each emotion where, for instance, the emotion with the highest intensity can be
considered as the output of the emotion module. Since the emotions decay over time,
the emotion module maintains the emotion decay using, e.g., a linear decay function.
Table 4.5 shows examples of events that elicit respective emotions.
4The definitions of the OCC emotions (in source file occ.clp) were provided by Michael Kipp (DFKI).
Chapter 4. Generating Dialogue 38
OCC Emotion Event
JOY My favourite player scored.
DISTRESS My favourite player lost a point.
HOPE My favourite player is now leading.
FEAR My favourite player is now trailing.
RELIEF My favourite player settled the score.
DISAPPOINTMENT My favourite player lost the lead.
SATISFACTION My favourite player won the game.
FEAR-CONFIRMED My favourite player lost the game.
Table 4.5: Example of Events that elicit respective Emotions
Figure 4.7 depicts the GUI of the emotion module. The left part of the chart depicts
the current intensities of respective emotions for the first virtual agent and the right
part of the chart depicts corresponding data for the second virtual agent. The dynamic
bar chart was created using the JFreeChart5 library. There is also a log for each virtual
agent that lists all events that have caused a particular emotion from the beginning of
the tennis game. (Let us remark that Figure 4.7 depicts only the last two events.) Each
log entry consists of the emotion name, initial intensity, and the cause description.
Figure 4.7: Emotion Module GUI
5Andreas Viklund. The JFreeChart Class Library. http://www.jfree.org/jfreechart/
Chapter 4. Generating Dialogue 39
The output of the emotion module is currently employed to set and update the facial
expression of each virtual agent every second. Nevertheless, it could be also used for the
gesture and lexical selection or as an input of the planner (if we had dialogue schemes
based on the OCC emotions).
4.2.4 Discussion
In this section, we will explain why we have employed two methods to simulate emotions
and which other options we have considered. As we have already stated, all dialogue
schemes are based on virtual agents’ attitudes to the players. Nevertheless, we could
have based the dialogue schemes also on the virtual agents’ current emotions. In this
case, we would have first derived the current emotion for each virtual agent and then
we would have tried to find an appropriate dialogue scheme. Nevertheless, in this case,
we would have had to face to the substantial growth of the number of dialogue schemes
and to the subsequent growth of the number of templates that represent individual
utterances. Thus, we would have had dialogue schemes for all meaningful combination
of emotions that the virtual agents can have.
However, we noticed that the positive appraisals usually correspond to emotions such as:
joy, hope, satisfaction, and relief, and that the negative appraisals usually correspond
to emotions such as: distress, disappointment, fear, and fear-confirmed. Therefore, we
could simplify the design of the planning domain and base the dialogue schemes only on
virtual agents’ attitudes to the players and derive specific emotion in a separate emotion
module. Such a specific emotion can be expressed by other modalities (e.g. facial
expression, gaze, gestures, lexical selection) except for the dialogue scheme selection.
Nevertheless, if we had had the specific emotion of each virtual agent as an input of the
planner, we could have also generated plans where the emotions could have changed at
some point as a reaction to what the other agent would have said. However, this option
is not useful in our case since both virtual agents share the same knowledge about the
tennis game, and the emotion of a virtual agent should correspond to the current state
of the game and not substantially change, for instance, from joy to distress if the virtual
agent’s favourite player is winning but the other virtual agent has just said something
bad about the winner.
Nevertheless, the option to change the emotion at some point of a plan would be useful if
the virtual agents had different knowledge about the tennis game such that an utterance
uttered by one virtual agent could have substantially changed the emotion of the other
virtual agent (e.g. one virtual agent would have made the other virtual agent happy if
s/he had told him/her that his/her favourite player had just won the game). To change
Chapter 4. Generating Dialogue 40
the emotion at some point in a plan would be also useful if the plans were longer, which
in our case is only the commentary on a just finished game, but it is hard to imagine
that a virtual agent that is very happy because his favourite player has just won the
game would have changed his/her emotion, e.g., from joy to distress because the other
virtual agent said something bad about a player s/he likes.
We have written a separate emotion module since we wanted to simulate the emotional
state of each virtual agent more precisely, e.g., we wanted to maintain the emotion decay
which would be infeasible in the planner. We could also have used some off-the-shelf
software to simulate the emotional state of each virtual agent. Nevertheless, we wanted
to simulate the emotions in a transparent way so that we could clearly see which event
had elicited which emotion and which emotion currently prevailed. We also wanted to
have full control over the module (i.e. we can adjust the computation of the initial
intensities of individual OCC emotions in dependence on the personality, we can define
our decay function, and we have control over the input and output tags). Therefore, we
did not use any “black box” such as ALMA [15], although ALMA is in general a good
choice to simulate the affective state of a virtual agent since it additionally maintains
the history and emotion blending.
The emotion module and the planner run independently. The planner cannot update the
emotion module since not every plan that is generated is also executed. Additionally,
the time of the plan generation and the time of the plan execution are different. The
emotion module could have passed the current emotional states of the virtual agents to
the planner, nevertheless we do not need the exact emotional state of the virtual agents
in the planner since our dialogue schemes are based only on virtual agents’ attitudes to
the players.
Chapter 5
Architecture
In this chapter, we will introduce individual modules of our system and describe how
they cooperate to generate a commentary on a tennis game for our presentation team
based on elementary events that are produced by a tennis simulator in real-time. The
system consists of several modules that are running in separate threads and communicate
via shared queues. For each module, we will describe its task and how it communicates
with other modules, i.e., what are the input and output of a particular module. First,
we will introduce the tennis simulator that produces elementary events (e.g. a player
plays a forehand, the ball crosses the net, the ball lands out). Then, we will describe
the plan generation, i.e., how we generate plans based on the knowledge deduced from
the elementary events got from the tennis simulator where a plan represents a particular
dialogue. Afterwards, we will explain how these generated plans are executed, i.e., how
we select plans from all the plans generated in the previous step. Our presentation team
is then engaged in dialogues that correspond to the selected plans.
5.1 System Overview
In the following sections, we will present the main design aims, introduce the overall
architecture of the system, and present the off-the-shelf components that are employed
in the system. We will discuss advantages of the modular architecture of the system,
how to ensure the reactivity, as well as discuss the need for extensibility. Finally we will
briefly introduce individual modules of the system and how they cooperate to produce
a commentary on a tennis game.
41
Chapter 5. Architecture 42
5.1.1 Design Aims
The system was designed with three main design aims, namely: modularity, reactivity,
and extensibility, that will be described below.
Modularity
The overall system is broken down into individual modules where each module provides
clearly defined interface and functionality. Each module is running in a separate thread
and asynchronously communicates with other modules via shared queues. This approach
is advantageous since each module can be tested separately and possibly replaced by
another module that implements the same interface.
Reactivity
The system should be able to react quickly to new events. Evidently, reactivity is
closely related to the modularity which facilitates not only parallel execution at multi-
core platforms but also the possibility of interruptions, i.e., one module can cause the
interruption of another module by sending an asynchronous message. The response time
of each module must be reasonably bound as well.
Extensibility
Since we wanted to participate in GALA 2009 (see section 1.2), we had at that time to
rapidly develop a demo application. As a consequence, the overall design should have
allowed for simple functionality implementation and subsequent refinement. This aim
is also related to the modularity since individual modules can be added, replaced, or
separately improved.
5.1.2 System Architecture
In the following text, we will briefly explain how we generate the commentary for our
presentation team based on the elementary events (e.g. a player serves, a ball hits the
net) that are produced by the tennis simulator. We will introduce individual modules
of the system and describe how they communicate. Figure 5.1 depicts the overall ar-
chitecture of the IVAN system and Figure 5.2 describes the dataflow that starts with
the elementary events produced by the tennis simulator and ends with the multimodal
output represented by the Charamel avatar engine (see section 5.1.3).
The tennis simulator is sending elementary events to the event manager. The event
manager is getting these elementary events (such as a ball is crossing the net, a ball
bounces) and deduces low-level facts (e.g. a rally finished). These derived low-level
facts are stored in the knowledge base. The event manager also decides when to run
Chapter 5. Architecture 43
Figure 5.1: IVAN Architecture
Figure 5.2: Dataflow
the discourse planner based on the global state of the game. In other words, the event
manager has a role of a perception unit since it is receiving events from the outside world
and maintains its coherent representation in the form of a knowledge base. The discourse
planner triggered by the event manager gets facts from the knowledge base, generates all
possible plans, and passes them to the output manager where a plan represents a possible
dialogues. Some facts can also be deduced during the planning process and stored in the
knowledge base (e.g. statistics to generate the commentary that summarizes the game).
The output manager maintains the plan execution, chooses one plan to execute, matches
planning operators with templates, adds gestures annotation, and sends appropriate
Chapter 5. Architecture 44
commands to the avatar manager that transforms them to the avatar engine specific
commands. More precisely, there is a mapping that maps each planning operator onto
a template where a template represents a set of possible annotated utterances. Thus,
a planning operator is mapped onto an annotated utterance that is chosen at random
among all utterances that correspond to a respective template. Furthermore, the avatar
manager maintains the state of the dialogue (e.g. who is speaking at the moment or how
long it will take to finish the current utterance) which can be used, e.g., to decide when
to interrupt the current discourse.
There is also the emotion module that maintains separately the emotional state of each
virtual agent. For instance, the facial expression of each virtual agent is updated every
second according to the current emotional state that is stored in the knowledge base.
Let us note that the knowledge base also contains background facts about the game and
players, virtual agents’ roles (commentator or expert), personality profiles, and attitudes
(positive, neutral, or negative) to the players.
5.1.3 Off-the-shelf Components
We have used two commercial products as an audio-visual component of the system.
We have employed Charamel1 to visualize virtual agents and RealSpeak Solo2 as a text-
to-speech (TTS) engine. We will describe both software toolkits in the following para-
graphs.
Charamel Avatar Engine
Charamel is a standalone application that communicates via socket and can visualize
several virtual agents at the same time. Individual virtual agents are controlled via
scripting language CharaScript. The virtual agents can express 14 different facial ex-
pressions (e.g. smile, happy, disappointed, angry, sad) with varying intensities. Their
lip movement is synchronized to speech that is produced by the RealSpeak Solo TTS.
The virtual agents can playback around one hundred pre-fabricated gesture clips that
can be tweaked using many different parameters (e.g. velocity, start time, end time,
interpolation time). Moreover, the transitions between each two consecutive gestures or
facial expressions are interpolated; the virtual agents are also performing idle gestures
in the meantime while no other gestures are triggered in order to look natural. Figure
5.3 depicts two Charamel virtual agents Mark and Gloria that were employed in the
system.
1http://www.charamel.com/2http://www.nuance.com/realspeak/solo/
Chapter 5. Architecture 45
Figure 5.3: Charamel Virtual Agents Mark and Gloria
RealSpeak Solo TTS Engine
RealSpeak Solo is a TTS engine that gets commands from the Charamel to vocalize
desired utterances. While the TTS engine is vocalizing an utterance it is also sending
tags back to the Charamel which enables synchronized lip movement of a virtual agent
that is speaking. RealSpeak Solo supports several male and female voices. We employed
British female voice Serena for the Charamel virtual agent Gloria and American male
voice Tom for Mark.
5.2 Tennis Simulator
The GALA 2009 challenge was given as a static ANVIL file that describes a tennis game
(see section 1.2). Since we wanted to test our system as if it was a real-time application
we wrote a tennis simulator that reads first an ANVIL file and then simulates the game
in real-time. Although we consider the tennis simulator as a part of our system, it can
be easily reused in other systems since it communicates via socket. Moreover, only a
subtle modification is needed to simulate any game that is given as an ANVIL file (with
a corresponding video). In the following text, we will describe our tennis simulator in
detail.
The architecture of the tennis simulator is shown in Figure 5.4. The tennis simulator
first reads a video file and its annotation that is stored in an ANVIL file. The video is
Chapter 5. Architecture 46
Figure 5.4: Tennis Simulator
opened in a video player that is implemented using the Java Media Framework API 3;
the timestamped events, read from the ANVIL file, are stored in a priority queue. When
the simulator is started it is sending events one by one at the time they occur to a socket.
Since the time of the simulation is determined by the video player it is possible to pause
the simulation or to move it forwards. It is also possible to fire one of the pre-defined
question event anytime.
Figure 5.5 shows a GUI of the tennis simulator. A user first chooses an input file. S/he
can decide whether the video will be displayed in the video player or not and whether
the start of the simulation will be postponed or moved forward; then the simulation can
be started.
Figure 5.5: Tennis Simulator GUI
3http://java.sun.com/javase/technologies/desktop/media/jmf/
Chapter 5. Architecture 47
5.3 Plan Generation
In this section, we will describe how we generate plans that correspond to possible
dialogues from the elementary events that are generated by the tennis simulator. Figure
5.6 depicts in colors the part of the system that is responsible for the plan generation
and Figure 5.7 shows which part of the dataflow is covered in this section. First, we will
describe the event manager that is getting elementary events from the tennis simulator,
deduces low-level facts from the elementary events and stores them in the knowledge
base where the low-level facts along with the background knowledge, virtual agents’
roles, personality profiles, and attitudes to the players create coherent representation
of the outside world. Then, we will describe the discourse planner that is triggered by
the event manager, gets facts from the knowledge base, and outputs all possible plans
that are subsequently passed to the output manager that maintains the plan execution
described in section 5.4.
Figure 5.6: IVAN Architecture - Plan Generation
Chapter 5. Architecture 48
Figure 5.7: Dataflow - Plan Generation
5.3.1 Event Manager
In this section, we will describe the event manager that has a role of a “perception unit”
since it is getting events from the outside world and maintains its coherent representation
that is stored in the knowledge base. More precisely, the event manager is getting
elementary events from the tennis simulator and deduces low-level facts that are stored
in the knowledge base. It also maintains the overall state and score of the match and
decides when to run the discourse planner. The elementary events (e.g. a player plays
a backhand, the ball lands out) that the event manager is getting from the tennis
simulator were defined in the GALA 2009 scenario in detail (see section 1.2), moreover
an elementary event can be also a user pre-defined question event. Let us remember
that a tennis match consists of sets, a set consists of games, and a game consists of
rallys. However, for the sake of simplicity we consider only one tennis game. Since we
cannot run the discourse planner every time we get an elementary event, we first describe
basic states of the tennis game that are modelled using finite state machines, and then
we identify at which states we run the discourse planner. After that, we explain what
low-level facts are deduced by the event manager, stored in the knowledge base and
subsequently available for the discourse planner.
States
Two finite state machines that we have employed to simulate basic states of the tennis
game are depicted in Figure 5.8. Both finite state machines run in parallel where the
initial state is marked by red and the transitions correspond to particular sequences of
elementary events.
Let us first look at the finite state machine on the left side. We start at the state
beginning, after a player throws a ball to serve we move to the state game in progress, at
the end after the game finishes we move to the state game finished. The state machine
on the right side starts at the state game not in progress. After a player throws a ball to
Chapter 5. Architecture 49
Figure 5.8: States of the Tennis Game
serve we move to the state rally beginning. A player can throw a ball several times before
he actually serves but after he serves we move to the state rally in progress. After the
ball hits the net, lands out, or bounces twice we get to the state rally finished. Then, in
the case the game is finished we get to the state game not in progress otherwise we wait
till a player throws a ball to serve and move to the state rally beginning. Both finite state
machines could be perceived as one but two of them will provide better understanding.
There are also two facts stored in the knowledge base derived from respective finite state
machines.
The event manager triggers the discourse planner at some states of the tennis game. We
will now show the list of specific states at which the discourse planner is triggered. The
list also contains some examples of goals that can be derived by the discourse planner
at respective states. (Let us note that additional states could be added if desired).
• beginning - do some introduction to the upcoming game
• rally finished - summarize just finished rally
• game finished - discuss just finished game
• rally beginning & a player has thrown the ball already twice - a player is nervous,
a player concentrates
• rally in progress - comment on the serving player’s background
• rally in progress & a volley or a smash was played - nice shot, risky shot
• rally in progress & the ball hit the tape - luck, inaccuracy
• a question event occured - answer the question
Chapter 5. Architecture 50
Score
The score of the game is also maintained in the event manager using a point counter for
each player and a finite state machine depicted in Figure 5.9. If a player wins a rally
s/he gets one point. A player wins the game if he has at least 4 points in total and at
least 2 points more than the opponent. After both players reach at least 3 points and
the game is not over yet, the score is either deuce or advantage. Table 5.1 explains how
the tennis score is counted for one player in the tennis terminology. Let us note that
the same player is serving within one game and that the score is read with the serving
player’s score first.
Figure 5.9: Tennis Score Counting using a Finite State Machine
Score Explanation
“love/zero” 0 points“fifteen” 1 point“thirty” 2 points“forty” 3 points“deuce” at least 3 points have been scored by each player, scores are equal“advantage” for the leading player, at least 3 points has been scored by each
player and one player has one point more
Table 5.1: Description of the Tennis Counting Terminology
Facts
We will now explain which low-level facts are deduced by the event manager from the
elementary events and stored in the knowledge base. The reason why we perform the
deduction of the low-level facts at this level in the event manager is that it substantially
facilitates the planning domain design. Working with the elementary events in the
planning domain would be quite cumbersome and unsuitable in the case we want to
reach reasonable latency. As we have already mentioned, the state of the game and the
score are maintained in the event manager, thus also the respective facts are stored in
the knowledge base. While the knowledge base contains only the current state of the
Chapter 5. Architecture 51
game, it contains all facts that describe the score from the beginning of the game. To
distinguish between individual score facts and to rank them, we will introduce a concept
of score generations, i.e., the first score fact has 0 generation, the second score fact has
1 generation etc. We can deduce, e.g., whether a player has lost the lead or settled from
the consecutive score facts. (Let us note that the concept of generations is often used in
computer science to distinguish among data that originates at consecutive steps of an
algorithm.)
Rally Snapshots
All events that occur in the tennis game are partitioned into the so-called rally snapshots.
We will now describe which low-level facts are derived from a rally snapshot and stored
in the knowledge base. Each rally snapshot has its generation that is similarly defined as
the score generation. (Let us note that the rally generation and the score generation are
different in general since, e.g, the first fault is a rally without score change.) The low-
level facts are deduced for each rally snapshot and stored in the knowledge base. In the
case the planner is triggered in the middle of a rally, the knowledge base then contains
only facts deduced from the elementary events considering the current incomplete partial
rally snapshot. The following list outlines which specific low-level facts are deduced from
a rally snapshot and stored in the knowledge base:
• how many times did the ball cross the net
• a list at which heights the ball crossed the net
• a list of pairs (player, shot) sorted from the beginning of a rally to its end
• a position where the last ball, that was in the field, bounced first
• a position where the last ball, that was out, bounced
• whether the ball crossed the net before it landed out
• which player missed the last ball
• how many times the serving player had thrown the ball before he served
Table 5.2 contains three examples that show which high-level facts can be deduced from
the low-level facts listed above. Figure 5.10 depicts a hierarchy of facts that shows how
an ace can be deduced.
Chapter 5. Architecture 52
high-level fact a list of low-level facts
ace the ball crossed once the net, bounced in the field,state - rally finished
lob the ball crossed the net at high position, bounced at the baselinedrop the ball crossed the net at low position, bounced at the net
Table 5.2: Example of high-level facts deduced from low-level facts
Figure 5.10: Hierarchy of Facts from which an Ace can be deduced
Comparison to Related Work
The event manager is to some extent similar to the STEVE’s perception module (see
section 2.4) since it also maintains the state of the world and its coherent representa-
tion. Our approach is also similar to the SceneMaker (see section 3.3), that employs
statecharts to control virtual agents, with the difference that while the SceneMaker can
perform, e.g., a pre-defined scene (i.e. a dialogue where utterances are annotated with
gestures) at a certain state we run the planner to generate the scene.
5.3.2 Background Knowledge
The background knowledge about the players and the game is incorporated to produce
commentary when, for instance, there is currently nothing else to comment on. We will
show some examples of background facts that are stored in the knowledge base. The
background knowledge is stored in several static CSV (Comma Separated Values) files
that could be alternatively replaced with a relational database. After the system starts,
all CSV files are read and the background knowledge they contain is transformed to the
facts that are stored in the knowledge base. Table 5.3 shows some examples which facts
Chapter 5. Architecture 53
can be deduced from the background knowledge.
Background knowledge Examples of deduced fact
Player’s details A sister of a player is also a tennis professional.Ranking A player is leading the ATP score.Style A player is playing risky as usual.Injury A player has been four times injured recently.Player’s results A player won two matches in a row.Tournaments details The tournament is played in London on grass.
Table 5.3: Examples of Facts deduced from the Background Knowledge
5.3.3 Discourse Planner
The discourse planner is responsible for the plans generation where a plan represents a
dialogue. The discourse planner is triggered by the event manager at particular states
of the game. It gets all facts from the knowledge base and outputs all possible plans
that are subsequently passed to the output manager. We will describe the input of the
planner, the planner itself, and the representation of the planner output. Let us note
that the concept of the dialogue generation has already been described in Chapter 4.
Input
The input of the planner consists of a planning task and a list of facts that describe
the initial state of the world. The planning task is the same all the time, namely, the
compound task “comment”, since the planner decides each time what it should comment
on according to the supplied facts. The list of facts varies and contains all the facts that
are stored in the knowledge base, i.e., it contains the following types of facts:
• the current state of the game
• scores of the game
• rally snapshots
• background knowledge (see section 5.3.2)
• commentators’ (positive, neutral, negative) attitudes to the players
• roles (commentator, expert)
• a question (a fact identifying that there is a question to be answered)
Chapter 5. Architecture 54
The Planner
We have employed the JSHOP (Java Simple Hierarchical Ordered Planner) as an HTN
planner to produce the commentary on a tennis game. See section 3.1 for more details
on JSHOP. As described above, the planner gets the input in the form of a problem
description and outputs all possible plans. The concept how these plans are generated
has already been described in detail in Chapter 4. Since JSHOP is an offline planner,
we had to modify it to run online. First, we will describe what makes JSHOP an offline
planner, how we modified it to run online, and how we could have employed JSHOP
without modification since we also considered and implemented this option.
JSHOP as an Offline Planner - The drawback of JSHOP is that it requires to
generate and compile the problem description prior to running the planner, assuming
that the problem description changes whereas the domain description remains the same
during the system run. As we can see, there is a costly compilation step before each run
of the planner. See section 3.1 where we explained the JSHOP input generation process
in detail. Let us also note that the planner does not have its own working memory in
the sense that every time it is run all facts have to be supplied again.
JSHOP as an Online Planner - We investigated how the problem description Java file
was generated from the JSHOP problem file and found out how to bypass the compilation
step described above. We have written a universal problem description Java file that
has been compiled only once and fully replaces the problem description Java file that
would be generated by the JSHOP, i.e., the instance of the problem description Java
class accepts the discourse planner problem description representation as Java objects
and serves as an input of the JSHOP as if the problem file was generated by the JSHOP.
This approach is fast and the plan generation takes only about 50-150ms.
Alternative Use of JSHOP as an Online Planner - JSHOP can be used as an
online planner without modification. However, this approach is quite costly since the
compilation step takes each time about 1 second and also consumes a lot of CPU re-
sources. Figure 5.11 shows individual steps of this alternative approach that will be
described below. The discourse planner uses its own problem description representation
that is first transformed to the JSHOP problem file (that uses special Lisp-Like syntax),
then the respective Java file is generated and compiled. After that, we make use of a
nice Java feature, namely, that it allows to replace one class implementation by another
at runtime, i.e., it allows to replace one *.class file by another during the system run.
Thus, at the end of the process depicted in Figure 5.11, we have a *.class representation
of a problem description and the planner can be started.
Let us note that we use this approach to compile the domain description once at the
beginning when the system starts. In this case the process starts with the JSHOP
Chapter 5. Architecture 55
domain file from which the corresponding Java file is generated, compiled and replaced
at runtime.
Figure 5.11: JSHOP Input Generation Process
Output
The output of the planner is the so called planning response which contains: a list of
all possible plans, the time when the planner was triggered, and the respective state
of the game. Each plan from the list of all possible plans contains: priority, semantic
token, and a list of planning operators. The semantic tokens are strings that identify
plans. For instance, the semantic tokens can be used to avoid repetitions where we
disallow consecutive execution of two plans with the same semantic token. The list
of planning operators corresponds to a dialogue where each planning operator stands
for one template (that corresponds to an utterance). Moreover, some facts can be also
deduced during the planning process and stored in the knowledge base for the next run
of the planner. For instance, it can be the statistics that summarises the game (e.g.
the number of outs, winning returns, and aces for each player). These facts can be, for
instance, used to generate the commentary on a just finished game.
5.4 Plan Execution
In this section, we will describe how we execute the plans that are generated by the
discourse planner, i.e., how we select plans that will be executed, more precisely, in
which dialogues the virtual agents will be engaged. Figure 5.12 depicts in colors the
part of the system that is responsible for the plan execution and Figure 5.13 shows
which part of the dataflow is covered in this section. First, we will describe the template
manager that provides mapping for each planning operator of a plan onto a particular
utterance that is furthermore annotated with gesture tags. Then, we will describe the
avatar manager that stands for an interface of the Charamel avatar engine, and finally
Chapter 5. Architecture 56
we will describe the output manager that is responsible for the plan execution, i.e., it
decides which plans and when will be executed.
Figure 5.12: IVAN Architecture - Plan Execution
Figure 5.13: Dataflow - Plan Execution
5.4.1 Template Manager
Let us remember that each plan corresponds to a dialogue where a plan consists of a
list of planning operators (primitive tasks) and each planning operator corresponds to a
Chapter 5. Architecture 57
template that contains a set of possible utterances that can be uttered by a virtual agent.
In this section, we will describe how a planning operator is mapped onto a particular
utterance that can be additionally annotated with gesture tags. The template manager
contains over 220 different templates and provides mapping for each planning operator
onto a particular template where each template has usually several slots that can be
substituted by parameters of a respective planning operator. Each template contains
1-3 variants of an utterance. Which utterance will be chosen is decided at random for
the sake of higher variability.
Moreover, there are default gesture and facial expression tags in every utterance since
each utterance is more or less bound to a particular situation that is correlated with a
certain emotion. The facial expression tags can be for instance: Smile, Happy, Surprise,
Angry, or Sad with different intensities. The gesture tags can be for instance: Disagree,
DontKnow, Disappointed, Surprise, Oops, or OhYes. Each gesture tag is stored in a so
called gesticon and is mapped onto a set of 1-3 possible gestures that can be directly
performed by a virtual agent in a particular situation. Every time the gesticon is queried
to find a mapping for a given gesture tag, it chooses one gesture from the corresponding
set of possible gestures at random to achieve higher variability.
Furthermore, there are two duration tags for each utterance, the first denotes the number
of milliseconds needed to utter an utterance employing a male voice and the second tag
is the respective duration for a female voice. These tags can be used to estimate the
duration of utterance in the case it is not provided by the text-to-speech engine. Let us
note that the gesture and facial expression tags stand only for default values, i.e., they
can be filtered out and substituted by other tags generated by other modules.
Example
In the following text, we will show an example how a planning operator can be mapped
onto a particular utterance. Imagine that the server has served and the receiver has
returned the ball in such a way that the server failed to return it. One planning operator
(more precisely operator’s head) of the generated plan can be for instance:
briskly_returned_serve ?server ?receiver ?receiver_shot
Where the first string is the operator’s name and the strings that begin with a question
mark stand for variables that are substituted into slots of a template. The planning
operator’s head contains three variables: ?server refers to the serving player, ?receiver
refers to the receiving player, and ?receiver shot refers to the type of a shot that the
receiving player played. There is a corresponding template in the template manager
that contains three slots that correspond to the three variables of the planning operator.
The template consists of two utterances:
Chapter 5. Architecture 58
{EmotionSurprise} {ExplainTo} ?receiver surprised ?server with an accurate
?receiver_shot return.
{EmotionSurprise} {Play} ?receiver generated a ?receiver_shot {Look} return
that was out of ?server’s reach.
The facial expression and gesture tags are annotated in curly brackets. The facial
expression tags start with the prefix Emotion whereas all other tags are gesture tags.
Let us assume that: the second utterance has been chosen at random, the variable
substitutions are known, and the respective gesture tags have been chosen from the
gesticon at random. Thus, we get the following substitutions:
?server := Safin
?receiver := Federer
?receiver_shot := forehand
{EmotionSurprise} := $(Emotion,surprise,0.9,500,1000,3000)
{Play} := $(Motion,interaction/bye/bye01,400,500,0,10000,1.5)
{Look} := $(Motion,presentation/look/lookto_right02,400,500,0,1200,0.8)
where the facial expression and gesture tags are mapped onto the avatar engine specific
tags (see the Charamel manual [38] for more details). After we apply the substitutions
we get the following annotated utterance that can be directly sent to the Charamel
avatar engine.
$(Emotion,surprise,0.9,500,1000,3000)
$(Motion,interaction/bye/bye01,400,500,0,10000,1.5)
Federer generated a forehand
$(Motion,presentation/look/lookto_right02,400,500,0,1200,0.8)
return that was out of Safin’s reach.
After a Charamel virtual agent gets this utterance, s/he looks surprised, s/he makes a
hand movement as if s/he played a ball with a tennis racket, and then s/he gazes at the
other virtual agent.
5.4.2 Avatar Manager
The avatar manager serves as an interface of the Charamel avatar engine. In the fol-
lowing text, we will describe how we have incorporated this module into our system
Chapter 5. Architecture 59
and which functionality it provides. The avatar manager is placed between the out-
put manager and the Charamel avatar engine. The output manager decides what plan
will be executed, i.e., which utterance and when will be uttered whereas the Charamel
avatar engine displays two virtual agents that represent our commentary team and ac-
cepts commands to control their behaviour. Thus, the role of the avatar manager is
to transform commands from the output manager to the Charamel specific commands.
Furthermore, it maintains the state of the dialogue that can be exploited by the output
manager. An annotated utterance, a gesture, or a facial expression can be sent to the
avatar manager. The tags that describe the state of the dialogue got from the avatar
manager are as follows: which virtual agent is currently speaking, how long s/he has
already been speaking, how much time it takes to finish the current utterance, and what
gesture or facial expression has been set for each virtual agent last time.
Let us remember that all commands that are sent to the avatar manager or to the
Charamel avatar engine are sent in a non-blocking manner (i.e. it never waits till a
command is completed). Thus, the output manager must first get the current state
of the dialogue and then decide which command to send to the avatar manager. For
instance, if nobody is speaking it can send instantaneously an annotated utterance to
the Charamel avatar engine. If somebody is speaking it knows who is speaking and how
long it will take to finish the current utterance. Thus, the output manager can then
decide whether to wait or send a new utterance right away. For instance, it should wait
if the utterance that is being uttered will be finished in a second. Nonetheless, in the
case that somebody is speaking and the avatar manager gets a command to utter the
other utterance, it interrupts the virtual agent that is speaking and starts uttering the
new utterance.
There can be two kinds of interruptions: self-interruption or an interruption by the
other agent. Gaze gestures and interruption utterances (e.g. “Wait!” or “Look!”) are
used to make the interruptions smoother. As we have already stated, the length of an
utterance is stored in the template manager for each template, nevertheless this length is
not accurate since the exact length of an utterance depends on slot substitutions in the
templates (e.g. a ?name “Ray” is shorter than “Richard”). Thus, the Charamel avatar
engine is always queried to send back the real length of an utterance. However, it can
take up to 1 second to get the response, thus the estimated length that is stored in the
template manager is used as long as the real length returned from the Charamel avatar
engine is unknown. A gesture or a facial expression can be sent to the Charamel avatar
engine at any time. New gesture or facial expression will be smoothly interpolated with
the previous one.
Chapter 5. Architecture 60
Since the avatar manager communicates with the Charamel avatar engine via socket (see
the Charamel manual [38]) we have to deal with some latency that can be up to one
second which can cause unwanted delays in the commentary. Another shortcoming of the
Charamel avatar engine is that a virtual agent that is speaking cannot be interrupted at
a specific position in an utterance since the exact state of the virtual agent is unknown.
We can only estimate the position in an utterance according to the time elapsed from
its beginning. Therefore, we cannot prevent from an utterance being interrupted in the
middle of a word.
5.4.3 Output Manager
The output manager is responsible for the plan execution, i.e., it decides in which di-
alogues the virtual agents will be engaged. In the following text, we will explain the
functionality of the output manager in detail. The output manager gets plans from the
discourse planner, chooses one plan to execute, maps planning operators onto templates,
and sends respective annotated utterances to the avatar manager that transforms them
to the Charamel specific commands. Thus, the output manager decides which plan and
when to execute. Furthermore, the output manager can interrupt the current plan and
run a new one while the interrupted plan can be started again later. The decision when
to interrupt a plan is based on heuristics. Moreover, the output manager keeps the plan
history that prevents from repetitions so that one plan is not executed twice in a row.
Decision Loop
The functionality of the output manager is implemented in the decision loop that main-
tains the state of the plan that is being executed, the stack of candidate plans, and the
plan history. The decision loop consists of the following steps:
1. Try to get new plans.
2. If there are new plans then select one and put it on the stack of candidate plans.
3. Remove old plans from the stack of candidate plans.
4. Get the status of the dialogue engine.
5. In the case that nobody is speaking we can perform one of the following actions:
• The plan that is being executed continues with the next utterance.
• The plan that has been interrupted starts again.
• The current plan is interrupted by a new one.
• A new plan is started.
Chapter 5. Architecture 61
6. In the case that somebody is speaking and there is a newer plan on the stack of
candidate plans, we decide according to heuristics whether the current plan will
be interrupted or not.
The plan is selected according to its priority and the least recently used strategy (at step
2) such that it prefers plans with high priority and plans that have not been executed
recently. To ensure that the stack of the candidate plans contains only plans that are
up-to-date (at step 3), we go through the plans and filter out old plans depending on
the semantic tokens of plans. For instance, a plan that contains some background facts
(e.g. that the serving player is leading the ATP score) does not get older so fast as a
plan that is related to some event that happened in the middle of a rally (e.g. when a
player played a smash).
Each time the output manager gets new plans, it has to decide on the basis of some
heuristics whether to interrupt the current plan and continue with a new one or not.
The output manager makes use of the state of the dialogue to know the approximate
time needed to finish the current utterance or how long the current plan has already
been running. For instance, the current plan will not be interrupted if it finishes in a
second or if it was started a couple of milliseconds ago. The interruptions also cannot
occur too often. In dependence on the semantic tokens of plans, some plans should be
executed as soon as possible (e.g. a comment referring to an ace) and some plans can
be executed with certain delay (e.g. a comment on player’s background). Furthermore,
an interrupted plan can be run again if it is still up-to-date and has not been almost
finished last time.
Chapter 6
Discussion
In this chapter, we will compare the IVAN system with ERIC, evaluate our system in
terms of the research aims, and discuss two basic tools (JSHOP and Jess) that can
be both employed to generate affective commentary on a continuous sports event in
real-time.
6.1 Comparison with the ERIC system
In this section, we will compare our system with ERIC (see section 2.1) since ERIC
is most closely related to our work. ERIC is an affective commentary virtual agent
that won GALA 2007 1 as a horse race reporter. The overall goal of ERIC is the same
as ours with the difference that while ERIC is a monologic system that employs one
virtual agent we have employed a presentation team that consists of two virtual agents
to comment on a sports event. Our virtual agents have different roles (TV commentator,
expert) and can have different attitudes to the players (positive, neutral, negative). The
use of a presentation team is believed to be more entertaining for the audience than only
one presenter and enriches the communication strategies since our virtual agents can be
engaged in dialogues and represent opposing points of view.
ERIC employs the expert systems to generate speech where his utterances reflect his
current knowledge state and the discourse coherence is ensured by the centering the-
ory. Nevertheless, ERIC may be too reactive, i.e., individual utterances are uttered at
particular knowledge states where ERIC cannot generate larger contributions. Hence,
we have employed an HTN planner to generate the dialogues which enabled us to plan
large dialogue contributions and the discourse coherence was ensured by the planner.
1http://hmi.ewi.utwente.nl/gala/finalists 2007/
62
Chapter 6. Discussion 63
In contrast to ERIC, we have also implemented the possibility of interruptions, i.e.,
the current discourse can be interrupted if a more important event happens. However,
there is always certain trade-off between reactivity, i.e., a reactive commentary with fre-
quent interruptions, and discourse coherence, i.e., a commentary with large and coherent
dialogue contributions that does not comment on each event.
While ERIC uses ALMA to maintain his affective state, we use two methods: one that
generates affective dialogues based on virtual agents’ attitudes to the players and the
other that maintains the affective state of each virtual agent in the emotion module.
ALMA might appear to be a “black box”, on the contrary, the generation of affective
dialogues and the simulation of the affective states for our virtual agents are more
transparent, i.e., we can adjust the computation of the initial intensities of individual
OCC emotions in dependence on the personality, we can define our decay function, and
we have full control over the input tags and output of our emotion module. We can also
always say which event has caused the virtual agent’s current emotion or why a virtual
agent is commenting in a positive or negative way on an event or a player.
In comparison to ERIC, our virtual agents have gestures more synchronized with speech,
use more elaborate idle gestures (provided by Charamel), can gaze at each other, and
can interact with a user via user pre-defined questions. Whilst ERIC was designed to
be domain independent and was tested in two different domains, our system has only
been designed to comment on a tennis game, nevertheless the same architecture can be
used to produce affective commentary in other domains.
6.2 Evaluation in Terms of Research Aims
In this section, we will compare our research amis, listed in section 1.4, with the system
that we have implemented.
Dialogue Planning for Real-time Commentary and Reactivity
We have employed the JSHOP as an HTN planner to produce commentary on a contin-
uous sports event in real-time. The motivation to use an HTN planner was to generate
large dialogue contributions and to prevent from being too reactive (in the sense de-
scribed in 6.1). It also seemed to be a good strategy to generate dialogues. First, the
JSHOP gets all facts that describe the current state of the world and outputs all possible
plans (dialogues). Then, in the decision loop, one plan is selected and executed. The
problem arises when an important event happens in the middle of the execution of a
plan (dialogue) that comments on another event. In this case, our system can either
interrupt the execution of the current plan or wait till the current plan finishes. This
Chapter 6. Discussion 64
problem would solve dynamic replanning, i.e., to modify the current plan on the fly.
Since the JSHOP does not support dynamic replanning we can only either wait till the
current plan finishes or we can interrupt it. However, if the JSHOP supported dynamic
replanning, it would not be sufficient since the Charamel avatar engine does not indicate
its exact state, e.g., we cannot interrupt an utterance at a specific position in an utter-
ance. Moreover, if we sent an utterance word by word to the Charamel avatar engine it
would not be uttered in a coherent way. Thus, the planner would need to work with the
whole utterances which would not be optimal since we would have to wait till the current
utterance would have been uttered, and then we would continue with an utterance of
the modified plan that would have been created by the dynamic replanning.
Therefore, there is always certain trade-off between reactivity and discourse coherence.
We can either often interrupt plans (dialogues) to be reactive or we can delay the com-
mentary on some events or we can even ignore some events to get large, coherent dialogue
contributions. Nevertheless, we have noticed that the real-life tennis commentators do
not comment on every event and in the case when the game is not interesting, they are
engaged in small talks to amuse the audience by talking about the players’ background.
Thus, we have implemented a compromise that uses some heuristics to decide when
to interrupt the discourse. The resulting commentary is partly reactive but since we
cannot interrupt the discourse too often, our commentary has sometimes delays or does
not consider some events.
There is also always certain trade-off between reactive commentary that uses short
utterances and elaborate, more detailed commentary that is not so reactive. Since we
wanted to produce more interesting and detailed commentary to convey more facts, our
utterances are rather long.
We have supposed that the HTN planning is convenient to produce commentary on
sports events that are rather long-winded (e.g. a life tennis game). However, the testing
files provided by GALA 2009 were generated by the Wii2 software that produced tennis
games that unfolded more quickly in comparison to a standard life tennis games. Hence,
there was a slight mismatch between the input we anticipated and the input that we had
got. Nevertheless, our system was able to produce the commentary even under these
conditions.
The reactivity of the system also partly depends on the response time of the avatar
engine and the speed with which the virtual agents are talking. A little bit faster speech
and lower response time of the avatar engine that is sometimes up to 1 second would
lead to better results in terms of reactivity.
2http://wii.com/
Chapter 6. Discussion 65
Behavioural Complexity and Affectivity
Our virtual agents provide affective commentary on a tennis game according to their
(positive, neutral, negative) attitudes to the players and according to the events that
occur during the tennis game. The current affect of a virtual agent is expressed by
dialogue scheme selection, lexical selection, facial expression, and gestures. A user can
recognize which virtual agent is in favour of which player and whether the virtual agent’s
favourite player is doing well or not. For instance, only the virtual agent’s facial expres-
sion can reveal whether his/her favourite player is leading or not. The virtual agents
have also gestures synchronized with speech and can interact with a user in the form of
user pre-defined questions.
The variability of dialogues is ensured by the planner that always outputs all possible
plans (dialogues), and by the random selection of utterances and gestures within partic-
ular templates. However, there is always certain trade-off between a few nice, suitable,
and specific dialogues and a variety of a lot of general dialogues. Since we wanted to
have specific commentary for GALA we have preferred the first option. Nevertheless,
more variety could be achieved if we added more dialogue schemes and more variants
of utterances and gestures to respective templates. The dialogue schemes could be also
based on different types of OCC emotions that are maintained for each virtual agent in
a simple emotion module which would also increase the variability and affectivity of the
commentary.
We have used two methods to produce affective commentary: one that generates affective
dialogues based on virtual agents’ attitudes to the players and the other that maintains
the affective state of each virtual agent in the emotion module. Thus, the user can see
which event elicited which emotion and why a virtual agent is commenting in a positive
or negative way.
Generalizability
Although our system was not designed to be domain independant, we will describe be-
low which modifications would be necessary to change the domain. The tennis simulator
would need only a subtle modification to simulate any sports event given as an ANVIL
file. We would need to define new states at which the discourse planner is triggered by
the event manager. We would also need to define the snapshots of the world and which
low-level facts would be derived from the respective snapshots. The pre-processing of
the background facts is done in a generic way, thus we would only provide corresponding
input CSV files. While the Java code in the discourse planner is domain independent,
the definition of the Hierarchical Task Network in the planning domain would need to
be rewritten except for the part that concerns the background knowledge (e.g. injury,
Chapter 6. Discussion 66
weather). We would also need to add corresponding templates, change some heuris-
tics in the output manager, e.g., to determine under which conditions a plan can be
interrupted. We would also need to define respective emotion eliciting conditions in the
emotion module. The avatar manager is domain independent. Thus, the most complex
task would be to rewrite the domain description of the planner and to add respective
templates.
6.3 Comparison JSHOP vs Jess
In this section, we will compare two approaches, i.e., the HTN planning (see section 3.1)
and the expert systems (see section 3.2) that can be used to generate a commentary
on a sports event defined as GALA 2009 (see section 1.2). We will focus on two tools,
namely: JSHOP3 that is a representative of an HTN planner that we have employed
in our system to generate dialogues, and Jess4 which is a representative of an expert
system that was used, e.g., in ERIC (see section 2.1) to generate speech. Whereas the
HTN planning is well suited to plan larger contributions (e.g. dialogue planning), the
expert systems are more suitable to produce shorter comments that reflect the current
state of the world. In the following text, we will compare JSHOP and Jess in terms of
their expressive power, usableness, and user-friendliness.
• Variability
The variability is important, e.g., for the dialogue planning since the virtual agents
should not be engaged in the same dialogues all the time. In the logistics, it is
also convenient to have more than one way how to deliver a package since not all
paths cost the same, thus the cheapest path should be chosen, and some paths
can be also dynamically added or deleted from the domain. The advantage of the
planning is that it finds all solutions to a problem while the expert systems output
only one. (More precisely, while a planner is backtracking to find all possible plans,
it can try several substitutions of a variable. In contrast to the planning, once a
rule fires in an expert system a variable is substituted and cannot be changed.)
Nevertheless, it is possible to set the random resolution strategy in a rule-based
system which resembles as if we have chosen a plan at random among all possible
plans output by a planner. Thus, the variability can be reached in the rule-base
systems to some extent as well.
• Priority
We can assign a cost to each planning operator in the planning domain such that
3JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/4Jess (Java Expert System Shell) http://www.jessrules.com/
Chapter 6. Discussion 67
the cost of a plan is equal to the sum of the costs of all planning operators that
the plan contains. After the planner outputs all possible plans, we can choose the
most or least expensive plan according to our preferences. If the cost corresponds
to the length of a path, we will probably choose the shortest one. If the cost
corresponds to the amount of money that we get when we execute the plan, we
will presumably choose the most profitable plan. In an expert system, we can
assign a salience value to each rule which specifies how urgently a rule should be
fired and in the case the salience value of two rules is the same, the current conflict
resolution strategy decides which rule will be fired as first. This is the way how
the rule-based systems can prioritize some outcomes. Nevertheless, the use of the
salience value should be avoided since it makes the execution of the rules very
difficult to monitor.
• Expressive Power
Jess offers substantially more constructs than JSHOP. We will show two examples
of constructs that are defined in Jess and that are not defined in JSHOP where
it would be advantageous to have them in JSHOP as well. The first example:
JSHOP does not support unordered facts, thus in the case we want to work with
only one slot of a fact we have to consider all its slots since JSHOP supports only
ordered facts. The second example: It is quite cumbersome to count the number
of facts that match certain condition in JSHOP, nevertheless it can be bypassed
by recursion. This task can be solved in Jess using the accumulate construct in an
intuitive way.
• Online vs Offline Execution
We have already pointed out that JSHOP runs offline (see Figure 3.3). Thus, due
to any change in the domain or problem file, respective Java file has to be first
generated and then compiled before the planner can be actually run. In contrast
to JSHOP, Jess runs online, i.e., after the Jess rule-based engine is initialized, it
can be run several times where facts and rules can be added to its facts base or
retracted in the meantime.
• Development Environment
Jess can be better integrated into a development environment than JSHOP since
there is a plugin that integrates Jess into Eclipse IDE 5 which facilitates the de-
velopment, e.g., it offers a Jess editor that emphasizes the Jess Lisp-like syntax
and marks errors. In comparison to Jess, JSHOP is provided as a Java library.
Nevertheless, the input JSHOP files can be edited as text files in Eclipse IDE as
well.
5http://www.eclipse.org/
Chapter 7
Conclusion
7.1 Summary
In this thesis, we have presented the architecture of the IVAN system (Intelligent In-
teractive Virtual Agent Narrators) that generates an affective commentary on a tennis
game in real-time where the input was given as an annotated video provided by GALA
2009. The demo version of the IVAN system was accepted for the GALA 2009 1 that
was a part of the 9th International Conference on Intelligent Virtual Agents (IVA)2.
The system employs two virtual agents with different attitudes to the players that are
engaged in dialogues to comment on a tennis game. We have focused on the knowledge
processing, dialogue planning, and behaviour control of the virtual agents. Commercial
products have been employed to represent the audio-visual component of the system.
Most parts of the system are domain dependent. However, the same architecture can
be reused to implement applications such as: interactive tutoring system, tourist guide,
or guide for the blind.
The system consists of several modules. We have employed an HTN planner to plan the
dialogues, an expert system to define the appraisals of the emotion eliciting conditions in
the emotion module, and finite state machines to simulate basic states of the system. Our
two virtual agents can have positive, neutral, or negative attitudes to the players. The
system uses two methods to generate affective multimodal output. In the first method,
the dialogue schemes in the HTN planner are selected according to the desirability
of particular events for respective virtual agents. In the second method, the system
maintains the affective state of each virtual agent in the emotion module, according
to the OCC cognitive model of emotions [36], based on the appraisals of the events
1http://hmi.ewi.utwente.nl/gala/finalists 2009/2http://iva09.dfki.de/
68
Chapter 7. Conclusion 69
that happen in a tennis game. The current affect of the virtual agents is expressed
by lexical selection, facial expression, and gestures. Furthermore, the system integrates
background knowledge about the players and the tournament and allows the user to fire
one of the pre-defined questions at any time.
We have employed the JSHOP3 as an HTN planner to generate dialogues for our two
virtual agents. We have verified that JSHOP can be employed to generate affective
commentary on a continuous sports event in real-time. However, the HTN planning is
well suited to generate large dialogue contributions. Thus, if the environment changed
rapidly and we wanted to consider most of the events that occur in the environment it
would be more appropriate to use the expert systems as in ERIC. [10]
7.2 Future Work
In the following paragraphs, we will outline which modifications could be made to im-
prove our system in the future.
EMBR
We could integrate EMBR (A Realtime Animation Engine For Interactive Embodied
Agents). [39] Since EMBR has more advanced behaviour control, e.g., it can have more
precise gaze that can express particular emotions whereas the Charamel virtual agents
(see section 5.1.3) can only turn the head to gaze at the other virtual agent. We did not
employ EMBR since it had not been released at that time and EMBR had also offered
only one virtual agent where we needed two distinguishable characters.
Prosody
We could also integrate a prosody module if we had an appropriate TTS engine that
would provide the option to set the respective parameters. Then, we could use the
current emotional state of a virtual agent that is simulated by an emotion module (see
section 4.2.3) to set respective parameters of the TTS engine. We have not implemented
the prosody module since the RealSpeak Solo TTS4 did not provide the option to change
respective parameters.
ALMA
We could use ALMA [15] to maintain the emotional state of each virtual agent since
ALMA in addition to our emotion module maintains history and the emotion blending.
We could anticipate smoother transitions between individual emotional states of a virtual
agent. Nevertheless, we did not employ ALMA since we wanted to have full control
3JSHOP2 (Java Simple Hierarchical Ordered Planner) http://www.cs.umd.edu/projects/shop/4http://www.nuance.com/realspeak/solo/
Chapter 7. Conclusion 70
over the emotion module so that we could, e.g., adjust the computation of the initial
intensities of individual OCC emotions in dependence on the personality and define our
own decay function.
Affect
We could try to base some dialogue schemes on particular OCC emotions that are output
by our emotion module. In this way, we would get more affective and suitable dialogues.
Nevertheless, it would entail a lot of work since we would have also to make up a lot
of utterances that would express particular emotions. Let us note that to work with a
reasonable amount of templates, we can have either a lot of general affective dialogues
or a lot of specific dialogues that express particular emotions in a limited way. In our
case, we have chosen the second option, thus our dialogue schemes are only based on
virtual agents’ (positive, neutral, negative) attitudes to the players.
We could also base the selection of particular utterances and gestures in templates
on the current emotional state of a virtual agent that is maintained by our emotion
module. A particular utterance and a gesture would be chosen according to the current
emotional state of the virtual agent. The current affect could also, for instance, influence
the velocity of particular gestures. In this way, we would get more affective dialogues.
Nevertheless, we did not implement this feature since it would have required to make up
a lot of different affective utterances. We have also supposed that it is sufficient when
the utterances convey only virtual agents’ (positive, neutral, negative) attitudes to the
players.
Dynamic Replanning
We could try another planner (e.g. HOTRiDE [40]) that would support dynamic re-
planning, since the only way how we can change the plan (dialogue) now is to interrupt
the current plan and start a new plan. Nevertheless, the dynamic replanning seems
to be quite difficult to implement. One reason, why we did not try such a planner is
that the Charamel virtual engine (see section 5.1.3) does not indicate the exact state of
the discourse, thus such a planner would have to work with the whole utterances which
would not be optimal. Thus, the precondition to employ such a planner is to have an
avatar engine that would indicate what exactly has been uttered so far at any point in
time.
Evaluation
More elaborate evaluation of the system could be done. We could perform an experiment
to find out what a user remembers from the commentary with and without virtual
agents. However, the life tennis commentators are usually hidden so that the audience
could concentrate on the tennis game. Though we would in general expect that the
commentary with the virtual agents would be better, it can easily happen that the users
Chapter 7. Conclusion 71
would more concentrate on the video of the tennis game and remember more without
virtual agents since the use of the virtual agents would rather distract them. We have
not performed this sort of evaluation since it was not clear how to interpret the possible
results.
We could also compare our commentary with a life commentary. Nevertheless, in com-
parison to our system, the real commentators are usually hidden and their commentaries
are not biased. Our system was also partly optimized for the GALA 2009 (see section
1.2) that was slightly different from a life tennis game since it used Wii5 videos of tennis
games. The life tennis commentary is also often very elaborate, thus our system cannot
compete with such a commentary in terms of variability.
Other Domains
We could reuse the architecture of the system to implement a system in other domains,
e.g., another long-winded sports events, interactive tutoring systems, tour guides, or
guides for the blind.
5http://wii.com/
Appendix A
Commentary Excerpt
In the following list, we will show a commentary excerpt, where C stands for a com-
mentator and E stands for a tennis expert.
C : “Ladies and Gentleman. Welcome to the Wimbledon semi-final in doubles.”
E : “We will guide you through the match in which James Blake and Andy Roddict are
playing versus Marat Safin and David Ferrer.”
C : “Enjoy the show!”
C : “The weather is cloudy.”
E : “I hope it won’t be raining.”
C : “Oops, Roddick scored!”
E : “Roddick hits an excellent forehand-volley right into the left corner.”
C : “Roddick has been unbeatable recently.”
E : “What a control by Roddick!”
C : “The score is already 30:0.”
E : “Players Safin and Ferrer are real losers as usual!”
E : “None of the players are playing on favourite surface.”
C : “It should harm performance of Blake and Rod..”
72
Appendix A. Commentary Excerpt 73
Interruption
E : “What an unexpected lob!”
C : “It’s a bad idea to try to return a lob at the net.”
C : “The brother of Blake Thom..”
Interruption
C : “Look!”
C : “Blake missed the huge backhand return by Ferrer! ”
C : “The brother of Blake Thomas is also playing tennis.”
E : “His Best ranking was in 2002.”
C : “The score is already 30:40.”
E : “No, Blake and Roddick are facing the break point.”
C : “Ferrer had a hard time recently.”
C : “Deuce.”
E : “Great recover by Blake and Roddick!”
C : “Roddick has been unbeatable recently.”
E : “No doubt he is a genius.”
C : “Off forehand by Blake drifts outside the left sideline.”
E : “Blake just overcooked his forehand.”
E : “Blake concentrates on his serve.”
C : “Roddic has been four times injured since last year.”
E : “It’s amazing how he ...”
Interruption
C : “What a relief!”
E : “Oh, no!”
C : “Tight game let’s summerize it.”
C : “Safin and Ferrer won the first game.”
E : “That’s unbelievable that they broke opponents’ serve!”
Bibliography 74
C : “That was quite obvious!”
E : “Safin and Ferrer played below part!”
C : “There were some excellent shots!”
E : “Well there might have been some bright sides.”
C : “Of course there were!”
C : “Did safin and Ferrer have any difficulties?”
E : “They were already trailing.”
C : “We have seen nice recovery.”
C : “Let’s see the next game.”
E : “Definitely.”
Bibliography
[1] Justine Cassell, Tim Bickmore, Lee Campbell, Hannes Vilhjalmsson, and Hao Yan.
”Human conversation as a system framework: Designing embodied conversational
agents”. In Embodied Conversational Agents, pages 29–63. MIT Press, Cambridge,
2000.
[2] Jonathan Gratch and Stacy Marsella. Tears and fears: modeling emotions and
emotional behaviors in synthetic agents. In Proceedings of the fifth international
conference on Autonomous agents, pages 278 – 285. ACM Press, Montreal, Quebec,
Canada, 2001.
[3] Jeff Rickel and W. Lewis Johnson. Animated agents for procedural training in
virtual reality: Perception, cognition, and motor control. APPLIED ARTIFICIAL
INTELLIGENCE, 13:343—382, 1998.
[4] Marc Cavazza, Fred Charles, and Steven J. Mead. Interacting with virtual char-
acters in interactive storytelling. In Proceedings of the first international joint
conference on Autonomous agents and multiagent systems, pages 318–325. ACM
Press, Bologna, Italy, 2002.
[5] Mark Riedl, C.J. Saretto, and R. Michael Young. Managing interaction between
users and agents in a multi-agent storytelling environment. In Proceedings of the 2nd
International Joint Conference on Autonomous Agents and Multi Agent Systems.
Melbourne, 2003.
[6] Elisabeth Andre, Thomas Rist, Susanne van Mulken, Martin Klesen, and Stephan
Baldes. The automated design of believable dialogues for animated presentation
teams. In Embodied Conversational Agents, pages 220–225, Cambridge, 2000. MIT
Press.
[7] Elisabeth Andre and Thomas Rist. Presenting through performing: On the use of
multiple Life-Like characters in Knowledge-Based presentation systems. In 2000
International Conference on Intelligent User Interfaces, pages 1–8. ACM Press,
New York, 2000.
75
Bibliography 76
[8] Elisabeth Andre, Thomas Rist, and Jochen Muller. Integrating reactive and scripted
behaviors in a Life-Like presentation agent. In Proceedings of the Second Inter-
national Conference on Autonomous Agents (Agents 1998), pages 261–268. ACM
Press, New York, 1998.
[9] Elisabeth Andre, Kim Binsted, Kumiko Tanaka-Ishii, Sean Luke, Gerd Herzog,
and Thomas Rist. Three RoboCup simulation league commentator systems. AI
Magazine, 22:57–66, 2000.
[10] Martin Strauss and Michael Kipp. ERIC: a generic rule-based framework for an
affective embodied commentary agent. 2007.
[11] Francois L. A. Knoppel, Almer S. Tigelaar, Danny Oude Bos, Thijs Alofs, and
Zsofia Ruttkay. Trackside DEIRA: a dynamic engaging intelligent reporter agent.
In Proceedings of the 7th international joint conference on Autonomous agents and
multiagent systems (AAMAS). Portugal, 2008.
[12] Michael Kipp. ANVIL a generic annotation tool for multimodal dialogue. pages
1367–1370, Aalborg, 2001.
[13] Ivan Gregor, Michael Kipp, and Jan Miksatko. IVAN intelligent interactive virtual
agent narrators. In Proceedings of the 9th International Conference on Intelligent
Virtual Agents (IVA-09), pages 560–561. Springer, Amsterdam, 2009.
[14] Martin Strauss. Realtime generation of multimodal affective sports commentary
for embodied agents, 2007.
[15] Patrick Gebhard. ALMA - a layered model of affect. In Proceedings of the Fourth In-
ternational Joint Conference on Autonomous Agents and Multiagent Systems (AA-
MAS 05), pages 29–36. Utrecht, 2005.
[16] Lewis R. Goldberg. An alternative description of personality: The Big-Five fac-
tor structure. In Journal of Personality and Social Psychology, volume 59, page
12161229. 1990.
[17] Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. Centering: A frame-
work for modeling the local coherence of discourse. In Computational Linguistics,
volume 21, page 203 225. 1995.
[18] Ionut Damian, Kathrin Janowski, and Dominik Sollfrank. Spectators, a joy to
watch. In Proceedings of the 9th International Conference on Intelligent Virtual
Agents (IVA-09), pages 558–559. Springer, Amsterdam, 2009.
Bibliography 77
[19] Elisabeth Andre and Thomas Rist. Controlling the behavior of animated pre-
sentation agents in the interface: Scripting versus instructing. In AI Magazine,
volume 22, pages 53–66. AAAI Press, 2001.
[20] Elisabeth Andre, Gerd Herzog, and Thomas Rist. Generating multimedia presen-
tations for RoboCup soccer games. In RoboCup-97: Robot Soccer World Cup I
(Lecture Notes in Computer Science). Springer, 1998.
[21] Dana Nau, Tsz-Chiu Au, Okhtay Ilghami, Ugur Kuter, Hector Munoz-Avila,
J. William Murdock, Dan Wu, and Fusun Yaman. Applications of SHOP and
SHOP2, 2004.
[22] Richard Fikes and Nils Nilsson. STRIPS: a new approach to the application of
theorem proving to problem solving. In Artificial Intelligence, volume 2, pages
189–208. 1971.
[23] Dana S. Nau, Stephen J. J. Smith, and Kutluhan Erol. Control strategies in HTN
planning: Theory versus practice. In AAAI-98/IAAI-98 Proceedings, pages 1127–
1133. 1998.
[24] Dana Nau, Hector Munoz-Avila, Yue Cao, Amnon Lotem, and Steven Mitchell.
Total-Order planning with partially ordered subtasks. In Proceedings of the Sev-
enteenth International Joint Converence on Artificial Intelligence (IJCAI-2001).
Seattle, 2001.
[25] Dana Nau, Yue Cao, Amnon Lotem, and Hector Munoz-Avila. SHOP: simple hier-
archical ordered planner. In International Joint Conference on Artificial Intelligence
(IJCAI-99), pages 968–973, Stockholm, 1999.
[26] Okhtay Ilghami and Dana S. Nau. A general approach to synthesize Problem-
Specific planners, 2003.
[27] Okhtay Ilghami. Documentation for JSHOP2. 2006.
[28] Gary Riley. CLIPS: a tool for building expert systems, 2008. URL http:
//clipsrules.sourceforge.net/.
[29] Ernest Friedman-Hill. Jess, the rule engine for the java platform, 2009. URL
http://www.jessrules.com/.
[30] Patrick Gebhard, Michael Kipp, Martin Klesen, and Thomas Rist. Authoring scenes
for adaptive, interactive performances. In Proceedings of the Second International
Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-03),
pages 725–732. ACM Press, New York, 2003.
Bibliography 78
[31] Martin Klesen, Michael Kipp, Patrick Gebhard, and Thomas Rist. Staging exhibi-
tions: Methods and tools for modelling narrative structure to produce interactive
performances with virtual actors. In Virtual Reality. Special Issue on Storytelling
in Virtual Environments, volume 7, pages 17–29. Springer-Verlag, 2003.
[32] Norbert Reithinger, Patrick Gebhard, Markus Lockelt, Alassane Ndiaye, Norbert
Pfleger, and Martin Klesen. VirtualHumanDialogic and affective interaction with
virtual characters. In Proceedings of the 8th International Conference on Multimodal
Interfaces (ICMI’06), pages 51–58. Canada, 2006.
[33] Patrick Gebhard, Marc Schroder, Marcela Charfuelan, Christoph Endres, Michael
Kipp, Sathish Pammi, Martin Rumpler, and Oytun Turk. IDEAS4Games: building
expressive virtual characters for computer games. In Proceedings of the 8th Interna-
tional Conference on Intelligent Virtual Agents (IVA’08), pages 426–440. Springer,
2008.
[34] Patrick Gebhard and Susanne Karsten. On-Site evaluation of the interactive CO-
HIBIT museum exhibit. In Proceedings of the 9th International Conference on
Intelligent Virtual Agents (IVA-09), pages 174–180. Springer, Amsterdam, 2009.
[35] Michael Kipp, Kerstin H. Kipp, Alassane Ndiaye, and Patrick Gebhard. Evaluating
the tangible interface and virtual characters in the interactive COHIBIT exhibit,
2006.
[36] Andrew Ortony, Allan Collins, and Gerald L. Clore. The cognitive structure of
emotions., 1988.
[37] Christoph Bartneck. Integrating the OCC model of emotions in embodied charac-
ters. In Proceedings of the Workshop on Virtual Conversational Characters: Appli-
cations, Methods, and Research Challenges. Melbourne, 2002.
[38] Alexander Reinecke, Christian Dold, and Thomas Koch. Charamel Avatar Player
Interface. 2009.
[39] Alexis Heloir and Michael Kipp. EMBR - a realtime animation engine for interactive
embodied agents. In Proceedings of the 9th International Conference on Intelligent
Virtual Agents (IVA-09), pages 393–404. Springer, Amsterdam, 2009.
[40] N. Fazil Ayan, Ugur Kuter, Fusun Yaman, and Robert P. Goldman. HOTRiDE:
hierarchical ordered task replanning in dynamic environments. In Proceedings of
the ICAPS-07 Workshop on Planning and Plan Execution for Real-World Systems
- Principles and Practices for Planning in Execution. Providence, Rhode Island,
USA, 2007.
Recommended