Chapter 8. Situated Dialogue Processing for Human-Robot Interactionin Cognitive Systems, Christensen et al.
Course: Robots Learning from Humans
Sabaleuski Matsvei
Interdisciplinary Program in Cognitive ScienceSeoul National University
http://cogsci.snu.ac.kr
Contents
Introduction
Background
Multi-level Intergration in Language Processing
Language Processing and Situational Experience
Talking
Talking about What You Can See
Talking about Places You Can Visit
Talking about Things You Can Do
Conclusions
2
Local visuo-spatial scenes
Spatial organization of an indoor environment
DIALOGUE
«THE WORLD»
Playmate scenario Explorer scenario
Introduction
Requirements for the solution
Gradual construction
Referentiality
Persistence
Efficiency & Effectiveness
LANGUAGE PERCEPTION
Winograd's SHRDLU
Incremental "left-to-right" linguistic analyses connected to visuo-spatial representations of local scenes.
Could understand and execute human
commands Had a basic memory
to supply context
Small virtual world Language consisting of
around 50 words
Steels's Semiotic Networks
Open-ended, adaptive communication system
Ability to learn Communicative
success above 80%
Lexicon of around 50 words
Impossible to connect alternative meanings at the same time
Sony AIBO robots
Bi-directionality hypothesis
• Gradual construction
Use of Combinatory Categorial Grammar (CCG)
• Referentiality
Use of structured discourse representation models with the
ability to resolve linguistic reference to situated context
• Persistence
Different referent resolutions can be combined, which is used in
visual learning
• Efficiency & Effectiveness
Incremental comprehension model can sort out unlikely word-
and meaning hypothesses;
Perfomance of speech recognition and parcing is close to 90%
Multi-level Intergration in Language Processing
Modular model
Context-independant representation is constructed first and only then it is intepreted against preceding dialogue context
Incremental model
Every new word is related to representations of the preceding input
Princimple of parsimony:Preferance of the least 'presuppositionally' heavy intepretations
e.g. The postman delivered the baby. Mary gave the child the dog bit a bandaid.
Incremental model is supported by the results of psycholinguistic research (saccadic eye movement research)
Language Processing and Situational Experience
Anticipatory effect
Disambiguation by scene understanding
Temporal projection
Focus of psycholinguistic research:
How information from situation awareness effects utterance
comprehension
Interaction between LANGUAGE and VISION is mediated by CATEGORIES
The research revlealed:
Talking
Listening
Comprehending
Representing an utterance
Representing the Interpretation of an Utterance in Context
Comprehending an Utterance in Context
Picking Up the Right Interpretation
Speaking
Producing an Utterance in Context
Producing Speech
Representing an utterance
Utterance is represented as ontologically richly sorted, relational structure - a logical form in a decidable fragment of modal logic
I want you to put the red mug to the right of the ball
Packing
Take the ball to the left of the box
Packing node
Internal relation
Packing nominal
Packing edge
Packing node target
Example of incremental parsing and packing of logical forms
Here is the ball
Representing the Interpretation of an Utterance in Context
Co-reference relations - relations between mentions referring to the same objects or events.eg. pronouns ('it'), anaphoric expressions ('the red mug')
New referent identifier – [NEW : {antn}]
Antecendant referent - [OLD : {anti}],
[OLD : anti < {antj, ..., antk} < NEW : {antn}].
Reference structure can specify preference orders over sets of old and new referents
Decision tree for dialogue moves
A dialogue move ('speech act') specifies how an etterance contributes to furthering the dialogue
Dialogue context model
Put the red ball next to the cube
Comprehending an Utterance in Context
Cross-modal salience model
Visual salience
Linguistic salience
Word recognition lattice
Example of an incremental analysis
Utterance interpretation at grammatical level
Picking up the right interpretation
Parse selectoin system based on a statistical linear model explores a set of relevant acoustic, syntactic, semantic and contextual features of the parses, and computes a likelihood score for each of them.
Parse selection is a function F :X →Y,where X is a set of possible input utterances, Y is a set of parsesWe alos assume:1. A function GEN(x) which enumerates all possible parses for an input x.2. A d-dimensional feature vector f (x, y) ∈ Rd, representing specific featuresof the pair (x, y). 3. A parameter vector w ∈ Rd
Where wT · f (x, y) is the inner product , and can be seen as a measure of the 'quality' of the phrase
Producing an utterance in context
http://mary.dfki.de:59125/
Producing of an utterance is triggered by a communicative goal.
Communicative goal specifies a dialogue move, and content which is to be
communicated.
The utterance realizer uses the same grammar as the parser.
The MARY speech synthesis engine then produces audio output.
References are generated by the use of incremental algorithm of Dale and
Reiter.
The algorithm is initialized with the intended referent, a contrast set and a list of
prefered attributes. It incrementally tries to rule out members of the set for which
a given property of the intended referent foes not hold.
Thank you for your attention