Upload
lellobot
View
226
Download
0
Embed Size (px)
Citation preview
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
1/136
CIMMI - A Contextually Informed Multimodal
Integration Method for a Humanoid Robot in anAssistive Technology Context
Elizabeth Harte
Bachelor of Science (Hons.), Computer Science (2005)
University of Auckland
Submitted for the Degree of
Masters of Engineering Science (Research)
Intelligent Robotics Research Centre
Department of Electrical & Computer Systems Engineering
Monash University, Clayton, VIC 3800, Australia
April 2008
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
2/136
Declaration
I declare that, to the best of my knowledge, the research described herein is originalexcept where the work of others is indicated and acknowledged, and that the thesishas not, in whole or in part, been submitted for any other degree at this or any otheruniversity.
Elizabeth Harte
Melbourne, AustraliaApril 2008
i
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
3/136
Acknowledgments
My thanks go to Prof. Ray Jarvis for his enthusiasm and support over the past 2years. He is an inspiring roboticist and researcher. I will be lucky to be even half ascreative as him in my life.
I would also like to acknowledge Mitsubishi Heavy Industries Ltd for the uniqueopportunity to use the Wakamaru robot for this project.
I would like to thank all the members of the Intelligent Robotics Research Center.I have met some great people and made some great friends. Thank you for welcomingthis Kiwi into the circle. In particular, thanks go to Alan Zhang and Jay Chakravartyfor their valuable discussions and friendship during the development of this project.
Finally, I thank Rob Davidson for his moral support during this project. His en-couragement, support and editorial skills were greatly appreciated during the writing
of this thesis. Kia ora.
ii
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
4/136
Abstract
Robots dealing directly with people need to be able to handle the uncertainties as-sociated with human-centric environments, including human interaction. This fieldhas seen much attention in the past decade, though work using multimodal systemshas not been deeply explored. This work describes a Contextually Informed Multi-Modal Integrator, CIMMI, that fuses speech and symbolic gesture probabilisticallyand is informed by contextual knowledge in an assistive technology robotic applica-tion. Symbolic gestures are gestures that have semantic meaning, such as a wavegesture meaning hello. Very few systems use symbolic gestures in a robotic domain,primarily because symbolic gestures are not as commonly used in day-to-day humanconversation as deictic (pointing) gestures. Using symbolic gestures allows the sys-tem to resolve ambiguities associated with action words in a command hypothesis
where deictic gestures are often used to identify objects or locations. Exploring theuse of symbolic gestures will be one focus of this work.
Since no deictic gestures are implemented in this work, contextual knowledge isused to resolve ambiguities associated with object and location words in a command.Contextual knowledge is defined as both conversational and situational. Conversa-tional contextual knowledge uses the dialogue history to select the best commandfrom a generated list of possible commands or to resolve ambiguities in the selectedcommand. In human conversation, if a person had been talking about a yellowcup then the conversational context allows a person to simplify additional refer-ences to the same object by saying the cup without having to specify the impliedyellow again. Using this idea as an influence, ambiguities, where the object or
location information is missing in a command, could be resolved by referencing theconversational context. These simple ideas are explored and discussed further in thiswork.
The second kind of implemented contextual knowledge is situational. In thiswork, situational contextual knowledge consists of the last known locations of theuser, the robot and a list of objects that exist in the environment. Each objectin the list has a colour, object class type (such as cup) and a location. Knowingthe locations of an object allows the user to simplify their spoken requests so theycan simply ask bring me the blue cup without stating the implied from the table.This same concept applies to specifying the colour of an object. Using these ideasas an influence, this knowledge can also resolve misrecognitions, where the colour
or location information could be missing from a command. The user and robotlocations are only used to assist the robotic responses. This situational knowledgeis explored further in this thesis. The exploitation of conversational and situationalcontextual knowledge, as introduced above, will be the second focus of this work.
The accuracy of the speech recognition system alone (53%) was compared tousing the speech with contextual knowledge, both conversational and situational.This increased the accuracy of the system to 72%. With the addition of the gesturerecognition as well, the systems accuracy rose only by 4% more because only 3 casesrequired the gesture to clarify the ambiguity in the spoken command. CIMMI wasimplemented on a humanoid robot, Wakamaru, and the speech, gesture and objectrecognition systems were implemented on a sepearate laptop.
iii
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
5/136
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Multimodal Human-Robot Interaction . . . . . . . . . . . . . . . . . 2
1.2.1 Modes of Input . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . 21.2.3 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literary Review 7
2.1 Service and Field Robotics . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Healthcare Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Social Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Multimodal Integration . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Modes of Input . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.2 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . 112.4.3 Methods of Integration . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Multimodal Human-Robot Interaction . . . . . . . . . . . . . . . . . 13
3 Contextually Informed Multimodal Human-Robot Interaction and
Robotic Responses 16
3.1 Functional Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Software Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Symbolic Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.2 Region Tracking . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Static Gesture Recognition . . . . . . . . . . . . . . . . . . . 30
3.5 Contextual Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.1 Conversational Context . . . . . . . . . . . . . . . . . . . . . 323.5.2 Situational Context . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Integrating Speech and Symbolic Gestures in an Informed Manner . 363.6.1 CIMMI Overview . . . . . . . . . . . . . . . . . . . . . . . . . 373.6.2 The Iterative Approach . . . . . . . . . . . . . . . . . . . . . 403.6.3 The Sequential Approach . . . . . . . . . . . . . . . . . . . . 45
3.7 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.8 Wakamarus Responses . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Experiments 57
4.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1.1 Experiment 1: Selecting the Number of Recognition Hypotheses 594.1.2 Experiment 2: Word Recognition Rate of Unparsed Speech
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.3 Experiment 3: Word Recognition Rate of Parsed Speech Results 62
4.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
iv
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
6/136
CONTENTS
4.2 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Experiment 1: Gesture Detection and Recognition by SkinSegmentation Only . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Experiment 2: Gesture Detection and Recognition by Skinand Motion Segmentation . . . . . . . . . . . . . . . . . . . . 69
4.2.3 Experiment 3: Gesture Detection and Recognition by Skinand Motion Segmentation in the New Laboratory Environmen 73
4.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3 Multimodal Interaction Accuracy . . . . . . . . . . . . . . . . . . . . 77
4.3.1 Experiment 1: Accuracy of the Individual Modes . . . . . . . 774.3.2 Experiment 2: Using the Iterative Approach . . . . . . . . . 794.3.3 Experiment 3: Using the Sequential Approach . . . . . . . . 85
4.4 Ob ject Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.4.1 Experiment 1: Object Recognition using Colour Segmentation
and Shapematcher in the Old Laboratory Environment . . . 944.4.2 Experiment 2: Object Recognition using Colour Segmentation
and Shapematcher in the New Laboratory Environment . . . 954.5 Navigation and Obstacle Avoidance . . . . . . . . . . . . . . . . . . 98
4.5.1 Experiment 1: Navigation and Obstacle Avoidance in the OldLaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5.2 Experiment 2: Navigation and Obstacle Avoidance in the NewLaboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.6 Robotic Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.1 Natural Interactions . . . . . . . . . . . . . . . . . . . . . . . 1034.6.2 Requesting Information . . . . . . . . . . . . . . . . . . . . . 1054.6.3 Fetch-And-Carry Requests . . . . . . . . . . . . . . . . . . . . 1054.6.4 Task Performance . . . . . . . . . . . . . . . . . . . . . . . . 106
5 Discussions 110
5.1 The Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1105.2 Integrating Speech with Contextual Knowledge . . . . . . . . . . . . 1115.3 Addition of Symbolic Gestures . . . . . . . . . . . . . . . . . . . . . 1125.4 Comparing the Iterative and Sequential Multimodal Approaches . . 1135.5 The Robotic Demonstrations . . . . . . . . . . . . . . . . . . . . . . 113
6 Future Work and Conclusions 115
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
v
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
7/136
List of Figures
3.1 The defined rooms in the laboratory environment. The height of thelaser range finder, which is used for obstacle avoidance, is lower thanthe height of the white polysterine walls between the rooms. Thedesk is the foreground table, the kitchen is at the furthest tableand the table location is to the right with the blue cup on it. Theclosest Wakamaru robot is the active one. The other is not used inthis system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Overview of the Transactional Intelligence, and the ContextualKnowledge which informs it. . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 A diagram showing the speech recognition process, from capturing the
spoken utterance through to generating a list of parsed recognitionhypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 The non-skin (left) and skin (right) histograms . . . . . . . . . . . . 253.5 A series of motion segmentations of a wave gesture. (a) In the initial
frame there is no history to segment any motion so all lighter colouredregions are segmented. (b) In the second frame, the user has movedtheir arm a distance so the previous motion segmentation and thecurrently detected motion are added as shown here. (c) After threeframes, the static background region has been segmented away andonly the users motion, and speckles of motion of the person in thebottom right, are detected as shown here. . . . . . . . . . . . . . . . 26
3.6 A single frame from a wave gesture using skin and motion segmentation 273.7 A diagram of the gesture capture and recognition process . . . . . . 273.8 The red circle outlines the face region and the orange circle represents
the hand region. The blue dots represent the center of the boundingcircles in the XY plane for each frame of the gesture sequence. Thelighter the colour of the dot, the closer that part of the gesture andvice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.9 Showing the 9 directions that should be differentiable by using thegesture tracking and the three features specified. . . . . . . . . . . . 30
3.10 Captures from the SM program of matches between two gesture skele-tons. Notice the individual segments of the skeleton representing the
silhouette are numbered and different colours. . . . . . . . . . . . . . 313.11 A diagram illustrating CIMMIs sequential algorithm . . . . . . . . . 393.12 Calculating the speech weight, wts, in the iterative approach. . . . . 413.13 Wakamaru with Bumblebee camera on head, HokuyoURG laser range
finder at middle and laptop on back. . . . . . . . . . . . . . . . . . . 483.14 Localization panoramic images on Wakamaru. . . . . . . . . . . . . . 503.15 Screen captures from Wakatest, Wakamarus debugging and testing
tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.16 A sample image of each of the recognized object class . . . . . . . . 523.17 A sample object capture which is segmented, made into a skeleton
then matched to the object database. . . . . . . . . . . . . . . . . . . 53
3.18 A diagram showing the SM matching process, as provided by Macrini 54
vi
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
8/136
LIST OF FIGURES
3.19 Route following and obstacle avoidance on Wakamaru. (a) shows
two sequential, but not consecutive, images of the robots positionapproximately when the screen captures in (b) and (c) were taken.(b) shows two consecutive captures of the obstacle avoidance usingthe laser range finder. Obstacles are red and the path is green withthe yellow square as the start point and blue square the end point. (c)shows Wakamarus on board obstacle avoidance sensor map in greenand blue with the path in pink. If the on board sensors detected anobstacle, parts of the blue sensor map would be red. Notice the faintpink crosses which identify the location of the landmark reflectors. . 56
4.1 A capture of the glass panels along one side of the new laboratory . 58
4.2 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Threesequential, but not consecutive, captures from a go gesture. Thecoloured boxes bound the detected skin coloured regions. Note thatthe top two left hand boxes in the top row are patches of ceiling seg-mented as skin regions. Camshift started tracking the hand and faceinstead of the ceiling patches, resulting in the large box in the bottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Threesequential, but not consecutive, captures from a go gesture. The
coloured boxes bound the detected skin coloured regions. Note thatthe top two left hand boxes in the top row are patches of ceiling seg-mented as skin regions. Camshift started tracking the hand and faceinstead of the ceiling patches, resulting in the large box in the bottomrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 (a) The binary skin segmented images. These segmented regions cor-respond to the coloured boxes in the coloured captures. (b) Three se-quential, but not consecutive, captures from a go gesture. The bluebox shows the location of the face in the first frame of the sequenceand the coloured boxes bound the detected skin coloured regions.Note that the top two left hand boxes in the top row are patches of
ceiling segmented as skin regions. Camshift started tracking the handand face instead of the ceiling patches, resulting in the large box inthe bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vii
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
9/136
LIST OF FIGURES
4.5 Graphs showing the gesture training database for the same dataset
plotted against the three classification features. The gestures in (a)were only tracked by skin segmentation, and in (b) were tracked usingboth skin and motion segmentation. In (a), the gesture classes werequite separate from each other and training samples were clusteredwithin each class. By comparison, in (b) the gesture classes were lessseparate from each other, with some overlap as well. The trainingsamples were less clustered per class, especially the go gesture. Thiscould be because the tracking of go gestures in the previous versionwas influenced by segmented background regions, as the gesture regioncould get stuck on these regions modifying the path tracked. . . . . 72
4.6 A capture of the user not performing a gesture using skin and motion
segmentation in the new laboratory. . . . . . . . . . . . . . . . . . . 744.7 A capture of the user performing a come gesture using skin and motion
segmentation in the new laboratory. . . . . . . . . . . . . . . . . . . 754.8 A capture, and segmentations, of objects on the table in the old lab-
oratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.9 A capture, segmentation and skeleton of an object on the kitchen
table in the new laboratory. . . . . . . . . . . . . . . . . . . . . . . . 954.10 A bowl skeleton matching a cup skeleton due to the rotational invari-
ance of the ShapeMatcher approach. . . . . . . . . . . . . . . . . . . 964.11 A sample capture where the yellow plate is partially out of frame,
resulting in a misrecognition of the region. . . . . . . . . . . . . . . . 97
4.12 A comparison of the camera disparity obstacle detection against thelaser range finder obstacles detection. . . . . . . . . . . . . . . . . . 100
4.13 A sequence of images of Wakamaru traveling from its charger basetoward the kitchen table in the laboratory environment. . . . . . . 102
4.14 Waving to Wakamaru . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.15 Wakamaru touching a recognized object . . . . . . . . . . . . . . . . 1074.16 An attempted object recognition capture. The robots head is still
moving downwards, resulting in a blurry image. This image con-tains part of the object and was, in fact, correctly recognized. Othercaptures would still only be looking at the wall resulting in a failedrecognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.17 Another attempted object recognition capture. The segmented cupsilhouette failed to be recognised as a cup because of the shadowon the table which was segmented as part of the cup, changing thesilhouette. This meant the cup had a top width of similar width toits base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
viii
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
10/136
List of Tables
3.1 Sample Unparsed Speech Recognitions . . . . . . . . . . . . . . . . . 223.2 Sample Risk Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Sample Values from the Associative Map . . . . . . . . . . . . . . . . 233.4 Sample Parsed Speech Recognitions . . . . . . . . . . . . . . . . . . 233.5 Table showing some Sample Command Hypotheses from Speech In-
put. These utterances were spoken in the order listed. Because theYellow Cup in Ex. 1 is not in the known object database, the robotrequested its location, which the user provided in Ex 2. The combina-tion of these give an unambiguous command hypothesis Get YellowCup Desk. Ex. 3 illustrates that using the conversational history,
the identity of the cup can be resolved to Yellow Cup where theuser is asking the robot to take the cup away again. . . . . . . . . . 343.6 Table showing Sample Objects from the Object Database . . . . . . 36
4.1 The First Ten Recognition Hypotheses for a Spoken Utterance . . . 604.2 Sample Unparsed Speech Recognitions . . . . . . . . . . . . . . . . . 614.3 Word Recognition Rate (WRR) for the Unparsed Speech Results . . 624.4 Sample Parsed Speech Recognitions . . . . . . . . . . . . . . . . . . 624.5 Word Recognition Rate (WRR) for the Parsed Speech Results . . . . 634.6 Two Cases of Deletion Errors for Parsed Speech . . . . . . . . . . . . 644.7 Comparing the Gesture Detection and Recognition Results . . . . . 764.8 Table showing some Sample Unimodal Hypotheses Selected by the
Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.9 Table showing some Sample Multimodal Hypotheses Selected by the
Iterative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.10 Comparison of the Use of Speech and Contextual Knowledge against
the Iterative version of CIMMI . . . . . . . . . . . . . . . . . . . . . 844.11 Table showing some Sample Unimodal Hypotheses Selected by the
Sequential Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.12 Comparison of Speech Recognition Accuracy Results . . . . . . . . . 884.13 Table showing some Sample Multimodal Hypotheses Selected by the
Sequential Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.14 Comparison of the Accuracies of using Speech with Contextual Knowl-
edge to the two different version of CIMMI . . . . . . . . . . . . . . 93
ix
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
11/136
1Introduction
1.1 Motivation
Increasing levels of affordable computational resources and quality sensors have per-
mitted the development of robots that can operate in a widening set of domains,
both structured and unstructured environments and at times shared by people.
Sometimes a robot is designed because it can do a task more accurately or ef-
ficiently than a human, such as with industrial and surgical robots. Other times
robots are designed because a human cannot perform the dangerous or inaccessible
task, such as with search and rescue or space exploration. In the case of service
robotics, robots are often designed to perform tasks which directly interact with,
and assist, a person.
The need for assistive robotics is increasing as demographic studies have shown
there will be an increase in the number of elderly in the future [1]. According to the
Australian Bureau of Statistics in 2004, 13% of the population was over 65. This
figure is expected to increase to between 26% and 28% by 2051 [2]. While assistive
technologies may not replace human carers in the future, they could a provide viable
alternatives. A robot carer could assist an elderly person in a domestic environment,
allowing that person to be independent and remain in their home while still having
some basic support and security.
Robot assistants dealing directly with people need to be social robots - ones that
interact with people by following some expected behaviourial norms [3]. They also
need to be able to handle the uncertainties associated with human-centric environ-
ments including people interactions. This field has seen much attention in the past
decade, however work in multimodal systems for robotic platforms has not been
deeply explored.
1
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
12/136
1.2. MULTIMODAL HUMAN-ROBOT INTERACTION
1.2 Multimodal Human-Robot Interaction
Multimodal integration is where two or more modes of input are combined to create a
richer, and possibly more accurate, interpretation than possible with a single mode
alone. Using the partial information from one mode to clarify another is called
mutual disambiguation - a concept defined and proved by Sharon Oviatt [4].
1.2.1 Modes of Input
As with human communication, speech is commonly used as a primary mode of
communication input in multimodal robotic systems. The speech input is combined
with other modes, such as gestures or emotion recognition, complementing its in-
terpretation. In human-robot interaction (HRI), deictic gestures (pointing gestures)
are most commonly used as they can replace object or location information missing
from the speech. For example, if the user said Bring me that cup, then the direction
of the pointing gesture could resolve which cup to get.
Symbolic gestures, also called emblems, are gestures that hold their own semantic
meaning and can be understood without context from additional speech, such as a
thumbs up gesture [5]. These gestures have been explored broadly in human-
computer interaction (HCI) as pen-based actions or as GUI interface clicks. This
work has been transferred to touch screen interfaces for HRI but natural hand-
performed gestures have not been researched deeply [6]. Symbolic gestures can
substitute, reinforce, or conflict with action words in the speech input.
1.2.2 Contextual Knowledge
While humans communicate by multiple modes, they also use their knowledge to
aid their interpretations. Contextual knowledge is instinctively used by people to
resolve what may otherwise be ambiguous communications. Contextual knowledge
is broad and varied, though some simple and common kinds are:
what they know about the environment
what they have been talking about
the speaker and listener locations
These different kinds of context can be used by a robot to help it to more ac-
curately interpret, or respond to, what is communicated to it by a human user.
The environmental knowledge can provide situational context about objects in the
environment so a spoken utterance like get the cup in the kitchen could be stated
2
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
13/136
1.3. CONTRIBUTIONS
without specifying which cup. The speaker could assume the listener had a shared
knowledge of objects in the environment, such as the green cup in the kitchen.
Similarly, the conversation history can provide context about which specific cup
a speaker is speaking about by referring to a previously spoken about yellow cup.
The conversational context can also be used to resolve a command by combining a
users clarification with the last recognized user command.
The speaker and listener locations, another part of the situational context, can
be used to increase the likelihood of one object being requested over another, or to
aid responses to a request.
In a robotic system, all this knowledge can be used to help the system inter-
pret, and respond to, human communication more accurately. Whether the speaker
provides an incomplete command or some information is misrecognized, contextual
knowledge can help resolve the ambiguities in a useful manner.
1.2.3 Multimodal Integration
To combine multiple modes of input, various methods have been proposed. A popu-
lar early work, QuickSet, combined two n-best lists of input, one of speech and one
of gestures. An n-best list is a list of speech or gesture recognition hypotheses any-
one of which could be the correct recognition. QuickSet would statistically combine
the n-best lists using the probabilities of each hypothesis [7] [8]. This statistical ap-
proach became the driving force for developed systems that followed [9] [10] [11] [12].
More recently, learning algorithms [13] [14] [15] and domain independent approaches
[16] [17] [18] have been developed for use with HCI systems.
In HRI, the multimodal systems often process speech with deictic (pointing)
gestures integrated at the semantic level [19] [20] [21]. The methods implemented
vary from frame-based matching to constraint-based integration to learning algo-
rithms. In these systems, ob ject recognition was used to identify which object the
pointing gesture was pointing at to resolve anaphoric ambiguities in the speech. The
multimodal fusion in [21] only combined the gesture inputs if there was a need of dis-
ambiguation because their gesture recognition accuracy was low, an idea supported
by [13]. The current multimodal systems may use contextual knowledge, but very
few assess their practical implementations [19] [20] [21].
1.3 Contributions
To develop a robot to interact with humans in human-centric environments, it re-
quires both transactional and spatial intelligences. Transactional intelligences are
the skills to interpret what was communicated and interact with the human to re-
3
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
14/136
1.3. CONTRIBUTIONS
solve ambiguities if a difficulty occurs. In this work, the transactional intelligence
consists of speech and gesture recognition as well as their probabilistic combination
in CIMMI, a Contextually Informed MultiModal Integrator. CIMMI combines the
speech and symbolic gestures semantically, where speech is the primary mode of
input. The semantic relationships between the speech and gestures are manually
defined in an associative map, illustrating whether a word positively, neutrally or
negatively reinforces the semantic meaning of the corresponding gesture. The sym-
bolic gestures are also used to substitute a speech word semantically. In this work,
a substitution occurs if an action word is misrecognized in the speech, then the
recognized gesture provides the action word in the command hypothesis.
Spatial intelligence refers to being able to understand and navigate through a
space and to know the location, function of and how to manipulate objects in that
space. In this work, a simple navigation and obstacle avoidance system and an
object recognition system was implemented to demonstrate the success, and failures,
of CIMMI.
The transactional and spatial intelligences can be informed by contextual knowl-
edge, both conversational and situational. Conversational context is built by previ-
ous dialogue between the robot and the user. It can be used to resolve the identity
of an object such as if a yellow cup was requested earlier in the conversation, then
when the speaker asks about a cup, it could be assumed that it is the same cup asthe yellow one. This conversational information is used to clarify object identities
and locations in this work.
Situational context includes spatial information about known objects in the en-
vironment, the robot and the user. The known object knowledge is used if a speaker
requests the green cup without stating the location of the cup. The robotic system
can resolve this missing location by referencing the green cup in the known object
knowledge and returning the last known location of it. This situational context is
used to clarify objects identities (i.e. colour) and locations in this work. The robot
and user locations are not used to clarify ambiguities in the multimodal commands,but do dictate how the robot responds to particular requests i.e. is the robot already
in the correct location?
The transactional intelligence developed in this thesis primarily involves fusing
the speech and gesture modes in a probabilistic multimodal integration method to
generate a command hypothesis that can be correctly responded to by the robot.
The command hypothesis is selected by a weight based system, where the weights are
calculated using the transactional intelligence and contextual knowledge. A weight
signifies CIMMIs confidence in a command hypothesis, where the weight is based
on:
4
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
15/136
1.4. THESIS OVERVIEW
the speech/gesture pair relationship according to the associative map
if any word in the speech input has an associated risk value
if the object or location mentioned were discussed in the conversation history
if the object mentioned exists in the object knowledge
These weights, combined with the individual probabilities of the speech and
gesture recognition systems, support the selection of a multimodal command to be
responded to by the robot, so long as it has no ambiguities.
In this work, ambiguities are defined as
missing or uncertain information, including references to non-existent ob jects
a conflict between the selected speech/gesture pair
low individual probabilities of the speech input
Missing information can be resolved either by using the symbolic gestures, the
contextual knowledge or by requesting more information from the user. A conflict
between the speech/gesture pair or low probabilities need to be resolved by the user
by requesting confirmation of the spoken input.
1.4 Thesis Overview
In this chapter, some of the concepts associated with multimodal integration, in-
cluding the modes and methods, were discussed and the motivation for this work
explained. In the next chapter, a review of the previous work done in these fields
is discussed more technically. This review is not a complete survey of the fields
involved, but does discuss the keys ideas and issues involved in multimodal human-
robot interaction.With this survey as a base, chapter 3 discusses and justifies the concepts used in
this thesis, with particular emphasis on the mutimodal integration methodology. The
speech and gesture recognition systems, which provide the inputs to the multimodal
integration algorithm, are described, as is the demonstration system, as implemented
on Wakamaru - a humanoid robot.
Having described the work completed, the experimental testing and results are
presented in chapter 4. The speech and gesture recognition systems are assessed as
unimodal systems and then compared to the multimodal results of CIMMI. CIMMI is
then tested in some robotic demonstrations where object recognition and navigation
5
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
16/136
1.4. THESIS OVERVIEW
are assessed individually first. The successes, and failures, of these experiments are
then discussed in chapter 5.
In the final chapter, the future work for this project is outlined, and the conclu-
sions of the work presented.
6
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
17/136
2Literary Review
2.1 Service and Field Robotics
Since Al-Jazaris mechanical boat of musicians in the 13th century [22], people have
created machines to perform both simple and complex tasks [23] [24] [25]. In 1954,
George Devol and Joseph Engelberger invented Unimate, the first programmable
and teachable robotic arm which became the basis for the now flourishing indus-
trial robotics industry [26]. While industrial robotics is one major area of modern
robotics, the other is service and field robotics. Industrial robots are used because
they perform their tasks better than a human in terms of speed and precision, and
relieve humans of the tedium of repetition. In comparison, service and field robots
are used because a human needs, or prefers, a robot to perform a dangerous, inac-
cessible or assistive task.
Field robotics can be divided into 3 keys areas, search and rescue robotics, space
exploration and military, police or security robotics. Search and rescue robots are
designed to access places where it may be too dangerous for a human to go, such as
in a bush fire or collapsed buildings [27] [28]. Space exploration robots, like the Mars
rovers, explore where a human is, so far, unable to go [29] [30]. More recently, space
robotic aids, like the Robonaut [31] [32], have been developed to assist astronauts
perform tasks, though at this stage these robots are teleoperated. Military and police
robots functions include bomb detection and defusing, security and surveillance and
transportation [33] [34] [35].
Service robots are designed to interact with humans directly and to perform tasks
to assist a human, often in everyday chores of a tedious nature or where assistance is
needed by a frail or disabled person. Service robotics can be broken into the several
key domains:
Entertainment Entertainment robots, such as Sonys Aibo [36] and
WowWees RoboSapien, perform interesting acts by teleoperation and are
7
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
18/136
2.2. HEALTHCARE ROBOTICS
quite popular purchases for consumers despite their lack of truly social
interaction.
Public guides Public guides, such as in a museum or airport, require a much
higher degree of interaction to accomplish their more complex tasks, but
the interaction can be by touch screen [37] [38] [39] rather than by speech
[40]
Health care robotics Health care robots have been designed to assist the el-
derly and infirm by guidance and scheduling reminders. This is discussed
further in the next section.
Personal assistants Personal robots have been developed to serve people
in human-centric environments, such as in offices and homes. This is
discussed in more detail in Section 2.3.
2.2 Healthcare Robotics
The design and development of healthcare robots to assist the elderly and infirm
has been broadly researched in recent years, as statistics indicate there will be a
greater proportion of elderly in the future [1] [41]. According to the Australian
Bureau of Statistics in 2004, 13% of the population was over 65. This figure is
expected to increase to between 26% and 28% by 2051 [2]. Furthermore in 2004,
1.5% of the population was over 85 years old, and this is also expected to rise to
between 6% and 8% by 2051 [2]. While assistive technologies are unlikely to replace
human carers in the future, they could provide viable alternatives. A robot carer
could assist an elderly person in a domestic environment, allowing that person to
be independent and remain in their home while still having some basic support and
security. The research for assistive technology ranges from monitoring systems [42]
[43] [44] to scheduling systems [45] [46] [47] to guidance systems [47] [48] [49] and
robotic assistants [50] [51] [45] [47].
Healthcare robots include Pearl, who was developed by the NurseBot Project
[45] [47] to locate a person, remind them of an appointment and guide them to their
destination. The platform was a mobile unit with unmoving walker arms for support
and a touch screen. The touch screen and speech synthesis provided the human-robot
interaction, which was adequate for the testing of this system in a nursing home.
The Care-O-bot system was developed by Hans, Graf and Schraft from Fraunhofer
IPA, Stuttgart and could perform fetch and carry and various other supporting tasks
in a domestic environment [50] [51]. It had a gripper arm, allowing it to lift and hold
objects, but the robot itself could also be used as a support and walker. The system
8
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
19/136
2.3. SOCIAL ROBOTICS
included a simple scheduler, could make emergency phone calls and, with its touch
screen monitor, could provide videophone, tv and stereo interface. The human-robot
interaction was performed multimodally, but not cooperatively as the system could
either process a speech input or a touch screen input, but did not integrate both.
2.3 Social Robotics
Personal assistants ten years ago were merely fetch-and-carry systems where the
focus of the systems was the planning, mapping, localization and obstacle avoidance
[52] [53]. More recently, there has been an emphasis on making robots more social
so that they can perform tasks and provide information more accurately [21] [20][54] [55]. Bartneck and Forlizzi [3] defined a social robot as:
... an autonomous or semi-autonomous robot that interacts and commu-
nicates with humans by following the behavioral norms expected by the
people with whom the robot is intended to interact.
Breazeal [56] comments that humans have to develop socially and emotionally
to survive in our society, so it should not be surprising that we must give robots,
or have them learn, these same skills to the same end. This point is reiterated by
Forilizzi [57] who introduced Roomba robots into domestic environments and per-
formed an ethnographic study of the human-robot interactions. Roomba, compared
to another robot vacuum cleaner Flair, inspired social interaction, in some cases
being nicknamed and spoken to as an entity, even though it could not socially inter-
act itself. While the author does not give a reason for the different treatment, it is
assumed that it is because of Roombas autonomy, competency and cute appearance
as compared to Flair.
The form of the robot is another interesting area of research, particularly with
regards to human-robot interaction (HRI). Many robotic assistants use graphical
displays of a face or user interface on top of a robotic platform, such as RHINO [37],
BIRON [58], Care-o-bot [51], Nursebot [47], Valerie the Roboceptionist [54] and
Grace and George [59]. These displays were often chosen for their expressiveness
and reliability compared to the mechanical alternative. However, it is acknowledged
that the lack of a physical embodiment does make HRI more difficult, especially as
the human often has trouble knowing where the robot is looking [54].
Most socially interactive robots are anthropomorphic in some regard, with the
well known exceptions being robotic pets, such as PARO - a baby harp seal robot
[60]. Some of these robots may only have an expressive face as the human-like
part, such as Kismet, an expressive anthropomorphic robot head [61] and Sparky,
9
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
20/136
2.4. MULTIMODAL INTEGRATION
a small mobile robot with an expressive face [62]. Most, however, are humanoid
with a mechanical embodiment of a face on either a bipedal [63] or wheeled robotic
platform [19] [64] [40] [65] [66] [21]. The humanoid form is argued by researchers to
be ideal for social HRI because, for robots to interact with humans like a human
(using gaze, gestures, etc.) they need to be physically capable of performing such
acts [55] [67] [66]. Also, by having a humanoid form, then a humans expectations
of the robots functions will be natural human interaction, an ideal scenario for HRI
[64] [40] [65].
Repliee-Q1 and Repliee-Q2 are an extension of the humanoid form robots, as the
robots are cast from human subjects and have realistic skin and facial features [68].
These robots are androids (or gynoids if female) and have innumerous actuators from
the waist up controlling all their motion. Because of the realism of these systems,
peoples interactive expectations are very high.
2.4 Multimodal Integration
Multimodal integration or fusion is the challenge of taking many different modes of
input, such as speech, lip tracking, gestures, posture, head pose etc, and merging
all this information, either at the feature level or semantic level, to create a richer
interpretation than any single mode may have provided [4]. It has been proved thatthis fusion of inputs can resolve errors that may occur in a single input by using
partial information of another input - a concept called mutual disambiguation, as
defined and proved by Sharon Oviatt in [4].
2.4.1 Modes of Input
Speech is often used as the primary mode of communication for people so this is
often the case for both human-computer and human-robot interactions [69] [70] [19]
[20] [21] [71] [72]. Despite it being the primary mode of communication, it is rarely
the only mode of communication. According to McNeill, 90% of gestures occur in
conjunction with some speech [73]. People commonly use different types of gestures
in conjunction with speech to communicate their intentions. Gestures, either pen-
based or naturally performed by hand, are often classified in four keys ways [73]
[5]:
Symbolic Also called emblems, these are gestures that have a semantic mean-
ing, such as wave to mean hello. Sign language is a vocabulary of sym-
bolic gestures.
Deictic Pointing gestures referring to people or objects in a space.
10
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
21/136
2.4. MULTIMODAL INTEGRATION
Iconic Iconic gestures resemble what they convey, such as to show the size,
shape or motion of an object.
Metaphoric These gestures involve the abstract manipulation of an object
or tool.
Gullberg found that over 50% of observed spontaneous gestures were deictic
gestures [69]. These gestures can replace missing object or location information by
pointing to an object or location from some spoken speech [70] [19] [20] [21]. For
example, if a user said put the lamp here, then the pointing gesture would resolve
the location of here. The first system to integrate speech and deictic gestures was
Bolts seminal paper, Put-that-there [74].
Alternatively, symbolic gestures have a direct meaning and can be understood
without context from additional speech [5]. Because of this independence from
speech, they can reinforce, substitute or conflict with speech [71] [72]. As pen-based
gestures, symbolic gestures can be GUI interface clicks or predefined structures,
such as arrows, which have a specific meaning in the application [7] [8] [75] [76].
Symbolic hand gestures form has to adhere to an expected structure to give the
gesture specific meaning, such as a stop gesture should be indicated by an open
palm facing away from the gesture. In human-robot interaction, symbolic gestures
have been broadly explored on touch screen interfaces [45] [47] [50] [51] but not so
much as natural hand gestures [6].
More recently, parallel modes of input that complement each other, such as
speech and lip tracking, have been fused where one mode directly reinforces the
other [15] [14].
2.4.2 Contextual Knowledge
Contextual knowledge is instinctively used by people to communicate with each
other. However, it is only sometimes used in unimodal or multimodal systems to
select a command to be performed. It can have many forms and uses, as context
is a broad area of information. Some of the key contexts used in HRI systems are
conversational and situational.
Conversational contextual knowledge is where the previously recognised com-
mands contribute to selecting the current command [40] [77] [20] [21] [78] [79] [80]
[81]. Storing this dialogue history is also a method to resolve ambiguous commands
by combining information provided by the speaker to form an unambiguous com-
mand [40] [20] [21] [77].
An extension of conversational contextual knowledge is also being able to limit
a speech recognition vocabulary depending on an expected answer [40] [20]. When
11
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
22/136
2.4. MULTIMODAL INTEGRATION
a vocabulary consists of 10,000 words or more, the likelihood of misrecognitions
increases. To decrease the number of misrecognitions, sub-vocabularies can be used
to lessen the range of words recognized by the system. Alternatively, sub-grammars
can be used improve the speech recognition performance, where a grammar defines
the form the recognized spoken utterance could take [77].
Contextual knowledge can also be situational, where the location of the agents in-
volved contributes to selecting a current command [70] [77] [80]. Situational context
is called situatedness in speech recognition applications [82] [77] [83]. Situational
context has been defined a number of ways, including by plan recognition [82] [83]
and information about objects in the environment [21] [82] [77].
2.4.3 Methods of Integration
In addition to the types of modes and knowledge that are fused multimodally, how
they are fused is also an important field of research. Virtual World [75], Finger
Pointer [84], VisualMan [76], Jeanie [85] and QuickSet [7] [8] all integrated speech
and either pen or gesture based motion at a semantic level. QuickSet, though,
was the first system to use a unification based system, a logic based method for
combining the meanings from two modes into a a single common interpretation.
The other systems used frame based integration, a pattern matching method for
fusing data structures of semantic information into a single common interpretation.
QuickSet would take two n-best lists of speech and gesture semantic information,
where an n-best list is a list of n possibly correct interpretations of each mode. It
then generated a list of multimodal commands by exhaustively combining the speech
and gestures that temporally align. That is, where a gesture precedes or overlaps the
speech [86]. The list of multimodal commands was then sorted statistically using the
probabilities of each list item from the individual modes. Finally, a multimodal com-
mand was selected if the speech/gesture pair defining it could be legally combined
into an executable instruction.
While QuickSet was highly expressive and scalable, it did not account for the
correct solution not being recognized by one of the input modalities. Some of the
QuickSet developers extended their multimodal system to include an associative map
to represent legal semantic combinations between all the defined types of speech and
pen based inputs, as well as formalizing the integration process statistically [86].
Others proposed using a finite state automaton which would parse, understand and
interpret inputs from a speech and gesture inputs [87]. The statistical approach
defined by [86] became the driving force for the multimodal field, leading to the
development of SmartKom [9], MUST [10], a speech and gaze multimodal system
[11] and a speech and mouse interface for remotely controlling a UAV robot [12].
12
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
23/136
2.5. MULTIMODAL HUMAN-ROBOT INTERACTION
Despite this statistical strength, a variety of approaches were developed for mul-
timodal fusion in the last five years. A new method for combining multimodal inputs
where one or more are only intermittedly used (such as gestures) was defined by [13].
The approach involved learning to weight non-verbal features only when they are
relevant to the speech features, i.e. when there is an ambiguity. This technique
showed a 73% improvement over using the non-verbal inputs whenever available. A
semantic based fusion technique was developed by [16] to make their system domain
independent, similar to the more recent work [17] and [18]. For the modes of speech
and lip-tracking, coupled hidden markov models [15] and a method to learn a collec-
tion of meaningful multimodal structures [14] have been recently developed, though
this field has been much more deeply explored than this [88].
2.5 Multimodal Human-Robot Interaction
Early multimodal HRI systems did not combine the multiple modes of input, but
rather acted on one or the other [50] [51]. One of the earliest multimodal HRI
systems that combined the modes was Hermes, by Bischoff et al [40]. Hermes was
a robust humanoid museum robot that used speech, vision and touch with context
information about the nature of the conversation to interpret a users intention.
The vision system would perceive objects to help resolve pronouns and tactile senseallowed Hermes attention to be redirected [40]. The speech system had a vocabulary
of about 350 words but would only check for a handful of words depending on the
conversational context, as triggered by key words.
Iba et al from CMU described the framework where a robot is programmed in-
teractively through a multimodal interface [70]. The user can provide feedback and
interrupt tasks to tell the robot to avoid an obstacle or stop moving. The system
uses speech and one or two handed gloved gestures as the modes of input, though
the accuracy of these systems are not reported. The limited speech vocabulary con-
tains about 50 words, and the gesture system has seven single-handed gestures andone two-handed gesture. The multimodal integration gives the symbolic interpreta-
tions of the speech and gestures probabilistic values that are used to decide which
task should be performed based on the current situational context. This system
was demonstrated, but had not been quantitatively evaluated in the most recent
publication.
Rogalla et al developed an HRI system using the service robot Albert [19]. The
system integrated speech, gestures and object recognition into an event manage-
ment architecture, where an event is a single input. The event manager transformed
a certain event or group of events into an action to be performed. Their gesture
13
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
24/136
2.5. MULTIMODAL HUMAN-ROBOT INTERACTION
recognition overall performance was 95.6% for 6 static gestures using skin segmenta-
tion and a Fourier description of the normalized contour. Their speech recognition
system was domain dependent to increase accuracy, though the accuracy was not
reported.
Gorostiza et al [89] described a multimodal HRI framework for a personal robot
assistant. The framework took into account speech, body gestures and tactile sensors
as the modes of input and used markup languages to encode these functionalities.
As this project was still in development, they did not specifically outline how the
multimodal integration would work in their current publications.
Huwel and Wrede implemented a multimodal human-robot system which used
speech, gesture and object recognition as well as environmental information [20].
The focus of their project is to enable a robot to carry out an involved dialog for
handling instructions. Their speech system was domain independent, could incor-
porate results from pointing gestures and object recognition, and could interpret
speech even if the input is not grammatically correct and the recognition incorrect.
They used situated semantic units to define speech interpretations, with any miss-
ing information was presumed to be available in the contextual knowledge. While
the contextual knowledge is not described, it is suggested that conversational and
object information is used. As a german research group, the users were non-native
English speakers, all these factors contributing to the average 52% accuracy of theirspeech recognition. Because of the design of their system though, 61% of the 1642
test utterances were interpreted as correct utterances with full semantic meaning,
illustrating how their speech system recovers from the speech recognition errors.
They do not report the accuracy of their gesture and object recognition systems.
The most recent HRI publication is [21] by Stiefelhagen et al from the German
research group Sonferforschungsbereich on humanoid robotics. They have developed
a robotic system whose internal systems include speech recognition, dialogue pro-
cessing, localization, tracking and identification of a person, pointing gesture recog-
nition and head orientation recognition. Each of these systems has been developedand tested separately and are being integrated into a single system. The multimodal
fusion uses a constraint based system to semantically fuse inputs from speech, point-
ing gestures and an environmental model [90]. To resolve some speech ambiguities,
they use conversation history and user identification - contextual information.
Because their gesture recognition system only detected 87% of gestures, and had
only 47% recognition accuracy, their system uses speech as the primary modality,
and only references the gesture recognition system if there is a disambiguation. The
direction of a pointing gesture generates a list of objects from the environment model
that are in that direction within an error cone. By comparing the appropriate speech
14
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
25/136
2.5. MULTIMODAL HUMAN-ROBOT INTERACTION
token with the objects on the list, a match can be found, resolving the ambiguity.
Their multimodal system had a success rate of 74% on 102 multimodal user inputs.
Of the errors, 52% were caused by failing to detect a gesture, 22% because a gesture
could not be resolved with a high confidence, 17% due to speech recognition errors
and 9% were because the speech and gesture inputs failed time contraints.
Based on these implementations of multimodal technology for robotic platforms,
there are many areas to explore. The practical use of contextual knowledge to
inform a multimodal system needs to be assessed. In addition, different kinds of
gestures that would be used to communicate with a robot, not just deictic, should
be investigated as well.
15
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
26/136
3Contextually Informed Multimodal
Human-Robot Interaction and Robotic
Responses
For a robot to communicate with a person effectively, it must possess transactional
intelligence the skills to understand what was communicated and interact with
the human to resolve ambiguity so as to negotiate what should be acted upon if a
difficulty occurs. In this work, speech recognition, gesture recognition and CIMMI
represent these skills. But for a robot to respond to a humans request, spatial intel-
ligence is required to understand and interact with the environment. In this work,
navigation, obstacle avoidance and object recognition represent these skills. Both of
these intelligences are informed by some contextual knowledge, both conversational
and situational. All of these components are described in this chapter.
The scenarios that the system is expected to accomplish are initially described
in Section 3.1. For a person to communicate with the robot, both speech and
gestures are recognized as described in Sections 3.3 and 3.4 respectively. These
modes of input are combined by CIMMI using the defined contextual knowledge.
This knowledge is described in Section 3.5. The multimodal algorithm algorithm
itself is comprehensively described in Section 3.6.
For a service robot to act in human-centric environments, it requires spatial
intelligence as well. In this work, the navigation, obstacle avoidance and object
recognition systems represent these skills. They were developed to demonstrate the
multimodal algorithm on a humanoid robot and are described in Section 3.8.
3.1 Functional Scenarios
The developed informed multimodal system was designed to accomplish simple tasks
in a domestic environment perhaps to assist an old or frail person. In the laboratory
16
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
27/136
3.1. FUNCTIONAL SCENARIOS
Figure 3.1: The defined rooms in the laboratory environment. The height of the laserrange finder, which is used for obstacle avoidance, is lower than the height of the whitepolysterine walls between the rooms. The desk is the foreground table, the kitchenis at the furthest table and the table location is to the right with the blue cup on it.The closest Wakamaru robot with the mounted camera is the active one. The other isnot used in this system.
environment where the system was developed, simple rooms were defined with a hall,
kitchen and office space using short walls to separate the spaces (See Fig. 3.1). Each
room had a table with one or two objects on it. The simple tasks to be performed
by the robot include:
Natural interactions The robot should act and respond to users responses by
natural means, such as by speech and gestures. The Mitsubishi engineers
built in a series of smooth and natural-looking motions for when the robot
was performing preprogrammed actions, such as leaving the charger, but also
moving around the room. Simple hand and arm gestures were also defined for
user responses such as waving.
Providing information The system should be able to provide information about
the environment, such as object locations, on demand using its situational
contextual knowledge - the known object database.
Fetch and carry The robot should be able to retrieve an object from a location
at the users request. To accomplish this, the robot must navigate to a the
objects location - avoiding obstacles along the way. It then must recognize the
17
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
28/136
3.2. SOFTWARE OVERVIEW
requested object, reach out and touch it. Wakamaru does not have articulated
hands, so the object is not retrieved. The possibility of using magnets was
explored but was unsuccessful due to the shape of Wakamarus hands. The
robot should then return the object to the user. If no object is recognized,
then the robot should return to the user and inform them of this.
With these scenarios in mind, the transactional intelligence was developed to
accomplish the human interaction components of the system, as described in the
following sections.
3.2 Software Overview
In this section, the transactional intelligence is outlined conceptually, initially by
implementation and then as a user interaction. This overview should provide a
guide for how the different components of the transactional intelligence relate, but
also the reasons for different design decisions.
The developed system is modular by design. The capturing and processing of
the speech, vision and laser-depth data are performed in separate processes on the
laptop, as it is a more powerful system than that in Wakamaru alone. CIMMI and
robotic response system is a single threaded process running on Wakamaru. Thespeech recognizer is a client system, triggering action in the robotic system. The
vision system is a server that detects gestures while it waits for instructions from
Wakamaru to return the last found gesture, detect obstacles, or recognize objects.
The laser system is a server that waits for an instruction to return depth data for
obstacle avoidance.
A user communicates with Wakamaru by speech and dynamic symbolic gestures
to complete specific tasks. The speech recognizer detects a single utterance, trigger-
ing the gesture recognizer to return the last detected gesture. Retrieving the last
recognized gesture, rather than detecting the current one, is done because previous
studies showed that when any meaningful gestures were made, 85% of the time it was
accompanied by a spoken keyword temporally aligned during and after the gesture
[91] and gestures could precede the deictic word by as much as four seconds [88].
Kettebekov and Sharma also showed that 93.7% of time, a deictic or iconic gesture
was temporally aligned with the semantically associated nouns, pronouns, spatial
adverbials, and spatial prepositions [91]. While this system does not semantically
recognize deictic or iconic gestures, it assumes this statistic would apply to the re-
lationship between symbolic gestures and the semantically associated speech action
word, the first word of a sentence.
18
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
29/136
3.2. SOFTWARE OVERVIEW
Figure 3.2: Overview of the Transactional Intelligence, and the Contextual Knowledgewhich informs it.
The speech and gesture recognizers generate n-best lists of recognition hypothe-
ses, where n = 5 for speech, as selected by experimentation (See Section 4.1), and
n = 3 for gestures, as there are only 3 gestures implemented. These recognizers
are scalable, as words can be added to the vocabulary file for parsing, and trainingdata added to the gesture recognition file for classification (See Sections 3.3 and 3.4
for more information). However, in the current implementation additional actions
cannot by clarified by CIMMI or responded to by the robot, as these components
are hard coded. Implementing a more flexible clarification process is a short term
goal of this work. The recognition hypotheses in the n-best lists are sent to CIMMI
along with the probabilities of correctness each recognition hypothesis has.
CIMMI selects the best speech/gesture pair, a command hypothesis, based the
transactional intelligences and the contextual knowledge. See Section 3.5 for details
about the implemented contextual knowledge. The selected hypothesis is checked forambiguities according to some predefined rules (See Section 3.6). If an ambiguity is
not resolved beyond a satisfactory threshold of certainty, then the robot will ask the
user for more information. If there is no ambiguity or an acceptable level of certainty,
then the hypothesis is passed onto the robotic responses finite state machine so the
robot can appropriately respond to the input. This overview of the transactional
intelligence is represented by Fig 3.2.
Having outlined the transactional intelligence in this work, each component will
be described in more detail in the following sections.
19
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
30/136
3.3. SPEECH RECOGNITION
Figure 3.3: A diagram showing the speech recognition process, from capturing thespoken utterance through to generating a list of parsed recognition hypotheses
3.3 Speech Recognition
Speech is used as the primary form of communication between people so it makes
sense that a robot interacting with people recognizes this mode of input. Because of
this fact, speech recognition is implemented as part of the transactional intelligence
of this human-robot interactive system. As an overview of this system, the speech
recognition engine processes a spoken utterance and generates a list of recognition
hypotheses. These hypotheses are then parsed using a defined vocabulary and these
interpreted hypotheses, and their probabilities of correctness, are sent to Wakamaru.
This process is described in detail in the following section.
The developed speech recognition system uses the Microsoft Speech SDK v5.11
because it is well documented and provides an interface to the speech recognizer for
developing speech recognition applications.
The Microsoft Speech Recognizer Engine (SRE) v5.1 was based on Whisper
(Windows Highly Intelligent SPEech Recognizer) speech engine [92] which in turn
was based on SPHINX-II [93]. A speech recognizer consists of an acoustic model to
represent the sounds, a decoder to interpret possible words that could be represented
by those models and language models which define the probabilities of sequence of
words. The acoustic models in the SRE use hidden Markov models (HMMs) to
represent three phonemes, where as phoneme is a sound [94]. The language models
use both bigram-class-based and trigram-word structures where bigram-class-based
structures are pairs of words sorted into context-dependent classes [95] and, by
example, trigram-word structures are the quick brown and quick brown fox for
the utterance the quick brown fox [96].
Training the acoustic models required reading provided passages where the
sounds were encoded into HMMs. Approximately seven hours of training was needed
before this American speech recognition engine was recognizing most of the 80 words
in the desired vocabulary. A training file was also developed containing the specific
1Microsoft Speech SDK 5.1 is available for download from:
http://www.microsoft.com/speech/speech2007/downloads.mspx.
20
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
31/136
3.3. SPEECH RECOGNITION
utterances the system should recognize, such as bring me the blue cup. To recog-
nize a spoken utterance, the SRE generates an HMM to represent three phonemes
and these are concatenated together to represent the spoken utterance. The HMMs
are then decoded into words using the learned language models, selecting the words
with the highest probabilities. A list of hypotheses is generated by selecting alter-
native words with the next highest probabilities. This much training was required
to refine the acoustic models so then it could accommodate subtle variations in the
users pronunciation because words are pronunciated slightly differently depending
on the neighbouring words in the spoken utterance.
In the developed speech system, a spoken utterance is captured and a list of five
recognition hypotheses is generated. The number of recognition hypotheses gener-
ated, five, was selected by experimentation, as it encompassed 97% of the correct
recognition results for the tested dataset (See Section 4.1 for more information). A
recognition hypothesis consists of a list of words that were possibly spoken by the
user and a confidence value generated by the SRE (See Table 3.1). Generating a list
of possible recognitions is done because the first recognition according to the speech
engine is not always the correct one.
The confidence values for each recognition hypothesis are scaled so the values
sum to 1.0, i.e. are a probability. The hypotheses are also parsed, detecting an
action, a colour, an object type and/or a location, similar to [19]. However, forthe recognized hypotheses to be parsed, the nickname for Wakamaru, Waka, must
be recognized in at least one of the hypotheses. The reasoning for the use of this
Waka was the speech recognizer would try to recognize sounds in the laboratory
as spoken utterances, such as other conversations, music etc. To avoid these sounds
being recognized by CIMMI and possibly interfering with experiments, the name
Waka needed to be recognized so then only a spoken utterance that was clearly
directed at Wakamaru would be recognized.
Each hypothesis is parsed by the developed system using a domain dependent
vocabulary, where each keyword has a:
Lexical category (such as noun, verb etc) represented by a number. If a
word is in the vocabulary then the words unique ID number and its
probability of correctness are stored in the parsed recognition structure,
[action, colour, object, location] depending on their defined lexical cate-
gory. This structure is a simple form of case grammars, where a different
number of slots would be defined depending on the action word recog-
nized [20]. Because of the simple utterances that were being recognized,
this was not necessary for the implemented system.
21
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
32/136
3.3. SPEECH RECOGNITION
Table 3.1: Sample Unparsed Speech Recognitions
Ex. Spoken Utterance Recognition Hypotheses Confidence
1 bring the yellow cup Waka bring the yellow cup Waka 111484bring the yellow cup water 111484bring the yellow cab Waka 93467bring the yellow tape Waka 93467bring the yellow cut Waka 93467
2 fetch the yellow cup Waka fetch the yellow cut Waka 112738fetch the yellow cut off 112738fetch the yellow cup Waka 107971fetch the yellow cup soccer 107971fetch the yellow cab Waka 86016
Table 3.2: Sample Risk Values
Words Time Help Come Go Stop Hello Cup Desk
Risk Values 0 10 0 0 10 0 0 0
Risk value The risk value represents the value of certain words semantically,
where the risk values are higher for words where misinterpretation could
have seriously negative consequence, such as if the words help and stop
were misrecognised (see Table 3.2). Gestures do not have risk values in
this implementation of the system, as symbolic gestures with high risk
meanings, such as stop, have not been implemented (See Section 3.4 for
more information).
Currently the risk values are binary, as illustrated in Table 3.2, but this
can easily be extended so risk values can be a greater range of values, or
fractional values. These values could also be learned by user interaction.
List values relating the word to each of the gestures defined in the
system semantically. The values relating a word to each gesture are de-
fined using an associative map, similar to [86]. The map is manually
defined in the vocabulary file based on the context of a domestic environ-
ment and the developers common sense. For example, the gesture Come
is contradictory to the word Go, so the corresponding associative map
value is 0 (See Table 3.3). Also, the utterance Whats the time?, repre-
sented as Time in Table 3.3, is not complimentary or contradictory to
any gesture, so its associative map values are always 1. This associative
22
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
33/136
3.3. SPEECH RECOGNITION
Table 3.3: Sample Values from the Associative Map
Words Time Help Come Go Stop Hello Cup Desk
Come Gesture 1 2 2 0 0 1 2 2
Go Gesture 1 2 0 2 0 1 2 2
Wave Gesture 1 2 1 1 1 2 1 1
Table 3.4: Sample Parsed Speech Recognitions
Example Spoken Utterancen-Best List
Parsed Input Prob.1 Bring the yellow cup Get Yellow Cup 0.37Get Yellow 0.31Get Yellow Cut 0.31
2 Fetch the yellow cup Get Yellow Cut 0.37Get Yellow Cup 0.35Get Yellow 0.28
map could be adapted to individual users with learning algorithms and
also contain fractional values. This associative map is a key factor of
mutual disambiguation in our system [97].
The vocabulary can be extended beyond the 80 words defined by adding new
words to the text file along with the necessary values. However, additional action
words cannot be added to the vocabulary without extending the robotic response
system as this is hard-coded for Wakamaru and the implemented action words only.
Different vocabulary files could also be defined for different contexts, such as the
kitchen, bathroom or office based on the recognition of keywords as done by [40].
Once all the recognition hypotheses have been parsed, the parsed recognition
structures and probabilities, [action, colour, object, location, probability], are sent to
Wakamaru to be combined with the gesture recognitions and contextual knowledge.
These two components of the transactional knowledge will be defined in the following
two sections. The overall speech recognition process is illustrated in Fig. 3.3. This
style of diagram will be used to describe the different components of the transactional
intelligence in the following sections.
23
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
34/136
3.4. SYMBOLIC GESTURE RECOGNITION
3.4 Symbolic Gesture Recognition
In human communication, while speech may be the primary form of interaction used
by people, they instinctively gesticulate to express themselves as well. According to
McNeill, 90% of gestures occur in conjunction with some speech [73]. Based on this
fact, gesture recognition was implemented in this work. This system is only a toy
or proof of concept system to illustrate the use of a multimodal system in robotics.
Symbolic gestures were used in this system because they can substitute, reinforce
and conflict with an action word semantically. Deictic gestures, while more common
in human communication for anaphora resolution (this, here, him words) and
object identification, is being thoroughly explored by a number of groups [70] [19][89] [20] [21]. The multimodal combination of symbolic gestures and speech has been
primarily explored in HCI [7] [8] [91] with some work in HRI [6].
The three symbolic dynamic gestures recognized in this work are wave, come
here and go away. The gestures are dynamic because they are moving as opposed
to static gestures, such as a thumbs up. These gestures are tracked using OpenCVs
Camshift tracking and 3D geometry. Each of the implemented gestures have two or
more meanings,
Come - bring or come
Go - get, go, call or answer
Wave - hello, goodbye or help
The gestural meaning is decided during multimodal integration, as dictated by
the speech action word or, in the action words absence, the speech object or location
words. For example, if the action word is get, then the semantic meaning of the
gesture go is get. However, if the speech action word is missing, but only a location
word is defined, then it is assumed the semantic meaning is go.
For gesture performance, the gesturer must be standing approximately 2 meters
away from the robot so both a frontal face and the gestures are detectable. If the
gesturer is too far away, the face will be too small to detect using OpenCVs Haar
frontal face detector. If the gesturer is to close to the camera then the perspective of
the gesture is changed, perhaps resulting in a misrecognition. The face and all pixels
to the right of it (as seen by the robot) are ignored to simplify image processing and
assuming that gestures are performed right-handed. If a user could be identified
(perhaps by using face recognition) as a left handed person, then a left hand mode
could be used.
24
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
35/136
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
36/136
3.4. SYMBOLIC GESTURE RECOGNITION
(a) (b) (c)
Figure 3.5: A series of motion segmentations of a wave gesture. (a) In the initial framethere is no history to segment any motion so all lighter coloured regions are segmented.(b) In the second frame, the user has moved their arm a distance so the previous motion
segmentation and the currently detected motion are added as shown here. (c) Afterthree frames, the static background region has been segmented away and only the usersmotion, and speckles of motion of the person in the bottom right, are detected as shownhere.
templates use the previous three frames to build a motion history. This means
that differences between the current frame and last three frames are included in the
segmented motion image, as shown in Fig. 3.5.
These two masks, the skin and motion segmentation, are then differenced, result-
ing in a greyscale image showing moving skin coloured regions in the frame. This
image is used to select the regions for the Camshift algorithm to track, as shown in
Fig. 3.6. In the first frame, there is no motion history so static regions are included
in the segmentation. However, after 3 frames have been processed, the background
regions are successfully ignored.
3.4.2 Region Tracking
The initial search windows for the Camshift tracker are manually defined by labeling
the greyscale image generated by the skin and motion segmentation. In the next
frame, the algorithm using the greyscale skin and motion segmentation for the frame
and a search window for one region as defined from the previous frame to track that
region. It is assumed part of the region in this frame will overlap with the search
window. CamShift finds the center of a region by iteratively searching in the search
window of the greyscale image. It then calculates the regions size and orientation
and returns the bounding box of it. The bounding box is used as the search windows
for this region in the next frame. The center of the bounding box is considered the
(x,y) coordinate of a region for the current frame. This is done for each region that
was segmented in the first frame which can be as many as 15 sometimes. However,
it is assumed only one region will be a gesture.
26
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
37/136
3.4. SYMBOLIC GESTURE RECOGNITION
(a) A binary image of the skin segmentation (b) A greyscale image of the motion seg-mentation
(c) The conjunction of (a) and (b) (d) The colour capture with the detectedregions identified
Figure 3.6: A single frame from a wave gesture using skin and motion segmentation
Figure 3.7: A diagram of the gesture capture and recognition process
27
8/14/2019 CIMMI - A Contextually Informed Multimodal Integration Method for a Humanoid Robot in an Assistive Technology
38/136
3.4. SYMBOLIC GESTURE RECOGNITION
For each tracked region of each frame, the center (x,y) and the average disparity
value of the tracked region are translated into (X, Y, Z)