21
The Conversational Interface

The Conversational Interface - rd.springer.com978-3-319-32967-3/1.pdf · The Conversational Interface is a must read for students, researchers, interaction ... 2.3 Conversational

  • Upload
    buithu

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

The Conversational Interface

Michael McTear • Zoraida CallejasDavid Griol

The Conversational InterfaceTalking to Smart Devices

123

Michael McTearSchool of Computing and MathematicsUlster UniversityNorthern IrelandUK

Zoraida CallejasETSI Informática y TelecomunicaciónUniversity of GranadaGranadaSpain

David GriolDepartment of Computer ScienceUniversidad Carlos III de MadridMadridSpain

ISBN 978-3-319-32965-9 ISBN 978-3-319-32967-3 (eBook)DOI 10.1007/978-3-319-32967-3

Library of Congress Control Number: 2016937351

© Springer International Publishing Switzerland 2016This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, express or implied, with respect to the material contained herein orfor any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer NatureThe registered company is Springer International Publishing AG Switzerland

Foreword

Some of us who have been in the field of “computers that understand speech” formany years have experienced firsthand a tremendous evolution of all the tech-nologies that are required for computers to talk and understand speech and lan-guage. These technologies, including automatic speech recognition, naturallanguage understanding, language generation, and text to speech are extremelycomplex and have required decades for scientists, researchers, and practitionersaround the world to create algorithms and systems that would allow us to com-municate with machines using speech, which is the preferred and most effectivechannel for humans.

Even though we are not there yet, in the sense that computers do not yet have amastering of speech and language comparable to that of humans, we have madegreat strides toward that goal. The introduction of Interactive Voice Response(IVR) systems in the 1990s created programs that could automatically handlesimple requests on the phone and allow large corporations to scale up their cus-tomer care services at a reasonable cost. With the evolution of speech recognitionand natural language technologies, IVR systems rapidly became more sophisticatedand enabled the creation of complex dialog systems that could handle naturallanguage queries and many turns of interaction. That success prompted the industryto create standards such as VoiceXML that contributed to making the task ofdevelopers easier, and IVR applications became the test bed and catalyzer for theevolution of technologies related to automatically understanding and producinghuman language.

Today, while IVRs are still in use to serve millions of people, a new technologythat penetrated the market half a decade ago is becoming more important ineveryday life: that of “Virtual Personal Assistants.” Apple’s Siri, Google Now, andMicrosoft’s Cortana allow whoever has a smartphone to ask virtually unlimitedrequests and interact with applications in the cloud or on the device. The embod-iment of virtual personal assistants into connected devices such as Amazon’s Echoand multimodal social robots such as Jibo are on the verge of enriching the human–computer communication experience and defining new ways to interact with the

v

vast knowledge repositories on the Web, and eventually with the Internet of Things.Virtual Personal Assistants are being integrated into cars to make it safer to interactwith onboard entertainment and navigation systems, phones, and the Internet. Wecan imagine how all of these technologies will be able to enhance the usability ofself-driving cars when they will become a reality. This is what the conversationalinterface is about, today and in the near future.

The possibilities of conversational interfaces are endless and the science ofcomputers that understand speech, once the prerogative of a few scientists who hadaccess to complex math and sophisticated computers, is today accessible to manywho are interested and want to understand it, use it, and perhaps contribute to itsprogress.

Michael McTear, Zoraida Callejas, and David Griol have produced an excellentbook that is the first to fill a gap in the literature for researchers and practitionerswho want to take on the challenge of building a conversational machine withavailable tools. This is an opportune time for a book like this, filled with the depthof understanding that Michael McTear has accumulated in a career dedicated totechnologies such as speech recognition, natural language understanding, languagegeneration and dialog, and the inspiration he has brought to the field.

In fact, until now, there was no book available for those who want to understandthe challenges and the solutions, and to build a conversational interface by usingavailable modern open source software. The authors do an excellent job in settingthe stage by explaining the technology behind each module of a conversationalsystem. They describe the different approaches in accessible language and proposesolutions using available software, giving appropriate examples and questions tostimulate the reader to further research.

The Conversational Interface is a must read for students, researchers, interactiondesigners, and practitioners who want to be part of the revolution brought by“computers that understand speech.”

San Francisco Roberto PieracciniFebruary 2016 Director of Advanced Conversational Technologies

Jibo, Inc.

vi Foreword

Preface

When we first started planning to write a book on how people would be able to talkin a natural way to their smartphones, devices and robots, we could not haveanticipated that the conversational interface would become such a hot topic.

During the course of writing the book we have kept touch with the most recentadvances in technology and applications. The technologies required for conversa-tional interfaces have improved dramatically in the past few years, making whatwas once a dream of visionaries and researchers into a commercial reality. Newapplications that make use of conversational interfaces are now appearing on analmost weekly basis.

All of this has made writing the book exciting and challenging. We have beenguided by the comments of Deborah Dahl, Wolfgang Minker, and RobertoPieraccini on our initial book proposal. Roberto also read the finished manuscriptand kindly agreed to write the foreword to the book.

We have received ongoing support from the team at Springer, in particular MaryJames, Senior Editor for Applied Science, who first encouraged us to considerwriting the book, as well as Brian Halm and Zoe Kennedy, who answered our manyquestions during the final stages of writing. Special thanks to Ms. Shalini Selvamand her team at Scientific Publishing Services (SPS) for their meticulous editing andfor transforming our typescript into the published version of the book.

We have come to this book with a background in spoken dialog systems, havingall worked for many years in this field. During this time we have received advice,feedback, and encouragement from many colleagues, including: Jim Larson, whohas kept us abreast of developments in the commercial world of conversationalinterfaces; Richard Wallace, who encouraged us to explore the world of chatbottechnology; Ramón López-Cózar, Emilio Sanchis, Encarna Segarra and Lluís F.Hurtado, who introduced us to spoken dialog research; and JoséManuel Molina andAraceli Sanchis for the opportunity to continue working in this area. We would alsolike to thank Wolfgang Minker, Sebastian Möller, Jan Nouza, Catherine Pelachaud,Giusseppe Riccardi, Huiru (Jane) Zheng, and their respective teams for welcomingus to their labs and sharing their knowledge with us.

vii

Writing a book like this would not be possible without the support of family andfriends. Michael would like to acknowledge the encouragement and patience of hiswife Sandra who has supported and encouraged him throughout. Zoraida and Davidwould like to thank their parents, Francisco Callejas, Francisca Carrión, AmadeoGriol, and Francisca Barres for being supportive and inspirational, with a heartfeltand loving memory of Amadeo, who will always be a model of honesty and passionin everything they do.

viii Preface

Contents

1 Introducing the Conversational Interface . . . . . . . . . . . . . . . . . . . 11.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Who Should Read the Book? . . . . . . . . . . . . . . . . . . . . . . . . 21.3 A Road Map for the Book . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Part I: Conversational Interfaces: Preliminaries . . . . . . 41.3.2 Part II: Developing a Speech-Based Conversational

Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.3 Part III: Conversational Interfaces and Devices. . . . . . 51.3.4 Part IV: Evaluation and Future Prospects. . . . . . . . . . 6

Part I Conversational Interfaces: Preliminaries

2 The Dawn of the Conversational Interface . . . . . . . . . . . . . . . . . . 112.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Interacting with a Conversational Interface . . . . . . . . . . . . . . . 122.3 Conversational Interfaces for Smart Watches and Other

Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Explaining the Rise of the Conversational Interface . . . . . . . . . 15

2.4.1 Technological Developments . . . . . . . . . . . . . . . . . . 162.4.2 User Acceptance and Adoption . . . . . . . . . . . . . . . . 182.4.3 Enterprise and Specialized VPAs . . . . . . . . . . . . . . . 192.4.4 The Cycle of Increasing Returns . . . . . . . . . . . . . . . 20

2.5 The Technologies that Make up a Conversational Interface . . . . 202.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Toward a Technology of Conversation . . . . . . . . . . . . . . . . . . . . . 253.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Conversation as Action. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 The Structure of Conversation . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Dealing with Longer Sequences . . . . . . . . . . . . . . . . 34

ix

3.4 Conversation as a Joint Activity. . . . . . . . . . . . . . . . . . . . . . . 353.4.1 Turn Taking in Conversation . . . . . . . . . . . . . . . . . . 363.4.2 Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.3 Conversational Repair . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 The Language of Conversation . . . . . . . . . . . . . . . . . . . . . . . 423.5.1 Prosodic, Paralinguistic, and Nonverbal Behaviors . . . 433.5.2 Implications for the Conversational Interface . . . . . . . 44

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Conversational Interfaces: Past and Present . . . . . . . . . . . . . . . . . 514.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Conversational Interfaces: A Brief History . . . . . . . . . . . . . . . 52

4.2.1 A Typical Interaction with a SpokenDialog System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.2 An Interaction that Goes Wrong. . . . . . . . . . . . . . . . 544.2.3 Spoken Dialog Systems. . . . . . . . . . . . . . . . . . . . . . 554.2.4 Voice User Interfaces . . . . . . . . . . . . . . . . . . . . . . . 564.2.5 Embodied Conversational Agent, Companions,

and Social Robots . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.6 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 What Have We Learned so Far? . . . . . . . . . . . . . . . . . . . . . . 584.3.1 Making Systems More Intelligent . . . . . . . . . . . . . . . 584.3.2 Using Incremental Processing to Model

Conversational Phenomena . . . . . . . . . . . . . . . . . . . 604.3.3 Languages and Toolkits for Developers. . . . . . . . . . . 614.3.4 Large-Scale Experiments on System Design

Using Techniques from Machine Learning. . . . . . . . . 624.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Part II Developing a Speech-Based Conversational Interface

5 Speech Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.1 ASR as a Probabilistic Process. . . . . . . . . . . . . . . . . 775.2.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Text-to-Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3.1 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3.2 Waveform Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 865.3.3 Using Prerecorded Speech . . . . . . . . . . . . . . . . . . . . 875.3.4 Speech Synthesis Markup Language . . . . . . . . . . . . . 87

x Contents

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Implementing Speech Input and Output . . . . . . . . . . . . . . . . . . . . 936.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 Web Speech API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.1 Text-to-Speech Synthesis. . . . . . . . . . . . . . . . . . . . . 956.2.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3 The Android Speech APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.3.1 Text-to-Speech Synthesis. . . . . . . . . . . . . . . . . . . . . 1036.3.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 1106.3.3 Using Speech for Input and Output. . . . . . . . . . . . . . 117

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Creating a Conversational Interface Using ChatbotTechnology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257.2 Introducing the Pandorabots Platform . . . . . . . . . . . . . . . . . . . 1277.3 Developing Your Own Bot Using AIML . . . . . . . . . . . . . . . . 131

7.3.1 Creating Categories . . . . . . . . . . . . . . . . . . . . . . . . 1317.3.2 Wildcards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.3.3 Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.3.4 Sets and Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.5 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.4 Creating a Link to Pandorabots from Your Android App . . . . . 1397.4.1 Creating a Bot in the Developer Portal . . . . . . . . . . . 1397.4.2 Linking an Android App to a Bot. . . . . . . . . . . . . . . 143

7.5 Introducing Mobile Functions . . . . . . . . . . . . . . . . . . . . . . . . 1477.5.1 Processing the <oob> Tags. . . . . . . . . . . . . . . . . . . 1477.5.2 Battery Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.5.3 Search Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.5.4 Location and Direction Queries . . . . . . . . . . . . . . . . 151

7.6 Extending the App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537.7 Alternatives to AIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547.8 Some Ways in Which AIML Can Be Further Developed . . . . . 154

7.8.1 Learning a Chatbot Specification from Data. . . . . . . . 1547.8.2 Making Use of Techniques from Natural Language

Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8 Spoken Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . 1618.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618.2 Technologies for Spoken Language Understanding. . . . . . . . . . 1638.3 Dialog Act Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1648.4 Identifying Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Contents xi

8.5 Analyzing the Content of the User’s Utterances . . . . . . . . . . . . 1668.5.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1668.5.2 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1668.5.3 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . 1678.5.4 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . 1688.5.5 Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . 1688.5.6 Information Extraction . . . . . . . . . . . . . . . . . . . . . . 1698.5.7 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . 169

8.6 Obtaining a Complete Semantic Interpretation of the Input . . . . 1708.6.1 Semantic Grammar . . . . . . . . . . . . . . . . . . . . . . . . . 1708.6.2 Syntax-Driven Semantic Analysis . . . . . . . . . . . . . . . 172

8.7 Statistical Approaches to Spoken Language Understanding . . . . 1758.7.1 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . 1768.7.2 Discriminative Models . . . . . . . . . . . . . . . . . . . . . . 1788.7.3 Deep Learning for Natural and Spoken Language

Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1798.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9 Implementing Spoken Language Understanding . . . . . . . . . . . . . . 1879.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1879.2 Getting Started with the Api.ai Platform . . . . . . . . . . . . . . . . . 188

9.2.1 Exercise 9.1 Creating an Agent in Api.ai. . . . . . . . . . 1889.2.2 Exercise 9.2 Testing the Agent. . . . . . . . . . . . . . . . . 189

9.3 Creating an Android App for an Agent . . . . . . . . . . . . . . . . . . 1919.3.1 Exercise 9.3 Producing a Semantic Parse. . . . . . . . . . 1919.3.2 Exercise 9.4 Testing the App . . . . . . . . . . . . . . . . . . 194

9.4 Specifying Your Own Entities and Intents. . . . . . . . . . . . . . . . 1959.4.1 Exercise 9.5 Creating Entities . . . . . . . . . . . . . . . . . 1959.4.2 Exercise 9.6 Creating an Intent . . . . . . . . . . . . . . . . 1979.4.3 Exercise 9.7 Testing the Custom Entities

and Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1999.5 Using Aliases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1999.6 Using Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

9.6.1 Exercise 9.8 Defining Contexts . . . . . . . . . . . . . . . . 2009.7 Creating a Slot Filling Dialog . . . . . . . . . . . . . . . . . . . . . . . . 200

9.7.1 Exercise 9.9 Creating a Slot-Filling Dialog . . . . . . . . 2019.7.2 Exercise 9.10 Additional Exercises . . . . . . . . . . . . . . 202

9.8 Overview of Some Other Spoken Language UnderstandingTools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2029.8.1 Tools Using Intents and Entities. . . . . . . . . . . . . . . . 2039.8.2 Toolkits for various other NLP Tasks . . . . . . . . . . . . 206

9.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

xii Contents

10 Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20910.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20910.2 Defining the Dialog Management Task . . . . . . . . . . . . . . . . . . 210

10.2.1 Interaction Strategies. . . . . . . . . . . . . . . . . . . . . . . . 21110.2.2 Error Handling and Confirmation Strategies . . . . . . . . 214

10.3 Handcrafted Approaches to Dialog Management . . . . . . . . . . . 21510.4 Statistical Approaches to Dialog Management . . . . . . . . . . . . . 217

10.4.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 21910.4.2 Corpus-Based Approaches . . . . . . . . . . . . . . . . . . . . 225

10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

11 Implementing Dialog Management . . . . . . . . . . . . . . . . . . . . . . . . 23511.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23511.2 Development of a Conversational Interface

Using a Rule-Based Dialog Management Technique. . . . . . . . . 23611.2.1 Practical Exercises Using VoiceXML . . . . . . . . . . . . 242

11.3 Development of a Conversational InterfaceUsing a Statistical Dialog Management Technique . . . . . . . . . . 257

11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

12 Response Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26512.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26512.2 Using Canned Text and Templates . . . . . . . . . . . . . . . . . . . . . 26512.3 Using Natural Language Generation Technology . . . . . . . . . . . 268

12.3.1 Document Planning . . . . . . . . . . . . . . . . . . . . . . . . 26912.3.2 Microplanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27012.3.3 Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

12.4 Statistical Approaches to Natural Language Generation. . . . . . . 27212.5 Response Generation for Conversational Queries . . . . . . . . . . . 273

12.5.1 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . 27412.5.2 Structured Resources to Support Conversational

Question Answering . . . . . . . . . . . . . . . . . . . . . . . . 27512.5.3 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . 276

12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

Part III Conversational Interfaces and Devices

13 Conversational Interfaces: Devices, Wearables, Virtual Agents,and Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28313.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28313.2 Wearables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

13.2.1 Smartwatches and Wristbands . . . . . . . . . . . . . . . . . 28413.2.2 Armbands and Gloves. . . . . . . . . . . . . . . . . . . . . . . 285

Contents xiii

13.2.3 Smart Glasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28613.2.4 Smart Jewelry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28713.2.5 Smart Clothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

13.3 Multimodal Conversational Interfaces for Smart Devicesand Wearables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

13.4 Virtual Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29313.5 Multimodal Conversations with Virtual Agents . . . . . . . . . . . . 29413.6 Examples of Tools for Creating Virtual Agents . . . . . . . . . . . . 29513.7 Social Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29613.8 Conversational Interfaces for Robots . . . . . . . . . . . . . . . . . . . 29713.9 Examples of Social Robots and Tools for Creating Robots . . . . 298

13.9.1 Aldebaran Robots. . . . . . . . . . . . . . . . . . . . . . . . . . 29813.9.2 Jibo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29913.9.3 FurHat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30013.9.4 Aisoy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30013.9.5 Amazon Echo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30213.9.6 Hello Robo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30213.9.7 The Open Robot Hardware Initiative. . . . . . . . . . . . . 30213.9.8 iCub.Org: Open-Source Cognitive Humanoid

Robotic Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 30213.9.9 SPEAKY for Robots. . . . . . . . . . . . . . . . . . . . . . . . 30313.9.10 The Robot Operating System (ROS) . . . . . . . . . . . . . 303

13.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

14 Emotion, Affect, and Personality. . . . . . . . . . . . . . . . . . . . . . . . . . 30914.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30914.2 Computational Models of Emotion . . . . . . . . . . . . . . . . . . . . . 310

14.2.1 The Dimensional Approach . . . . . . . . . . . . . . . . . . . 31014.2.2 The Discrete Approach . . . . . . . . . . . . . . . . . . . . . . 31214.2.3 The Appraisal Approach . . . . . . . . . . . . . . . . . . . . . 312

14.3 Models of Personality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31414.3.1 The Detection of Personality . . . . . . . . . . . . . . . . . . 31514.3.2 Simulating Personality . . . . . . . . . . . . . . . . . . . . . . 316

14.4 Making Use of Affective Behaviors in the ConversationalInterface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31914.4.1 Acknowledging Awareness and Mirroring

Emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31914.4.2 Dealing with and Provoking the User’s Emotions. . . . 32014.4.3 Building Empathy . . . . . . . . . . . . . . . . . . . . . . . . . 32114.4.4 Fostering the User’s Engagement . . . . . . . . . . . . . . . 32214.4.5 Emotion as Feedback on the System’s

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32214.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

xiv Contents

15 Affective Conversational Interfaces . . . . . . . . . . . . . . . . . . . . . . . . 32915.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32915.2 Representing Emotion with EmotionML . . . . . . . . . . . . . . . . . 33015.3 Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

15.3.1 Emotion Recognition from Physiological Signals . . . . 33515.3.2 Emotion Recognition from Speech . . . . . . . . . . . . . . 33915.3.3 Emotion Recognition from Facial Expressions

and Gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34415.4 Emotion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

15.4.1 Expressive Speech Synthesis . . . . . . . . . . . . . . . . . . 34915.4.2 Generating Facial Expressions, Body Posture,

and Gestures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35015.4.3 The Uncanny Valley. . . . . . . . . . . . . . . . . . . . . . . . 352

15.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

16 Implementing Multimodal Conversational InterfacesUsing Android Wear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35916.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35916.2 Visual Interfaces for Android Wear . . . . . . . . . . . . . . . . . . . . 36116.3 Voice Interfaces for Android Wear. . . . . . . . . . . . . . . . . . . . . 363

16.3.1 System-Provided Voice Actions . . . . . . . . . . . . . . . . 36416.3.2 Developer-Defined Voice Actions. . . . . . . . . . . . . . . 366

16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

Part IV Evaluation and Future Directions

17 Evaluating the Conversational Interface . . . . . . . . . . . . . . . . . . . . 37917.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37917.2 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

17.2.1 Overall System Evaluation. . . . . . . . . . . . . . . . . . . . 38117.2.2 Component Evaluation . . . . . . . . . . . . . . . . . . . . . . 38217.2.3 Metrics Used in Industry . . . . . . . . . . . . . . . . . . . . . 387

17.3 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38817.3.1 Predicting User Satisfaction . . . . . . . . . . . . . . . . . . . 389

17.4 Evaluation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39417.4.1 Evaluation Settings: Laboratory Versus Field . . . . . . . 39417.4.2 Wizard of Oz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39517.4.3 Test Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

17.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

Contents xv

18 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40318.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40318.2 Advances in Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

18.2.1 Cognitive Computing . . . . . . . . . . . . . . . . . . . . . . . 40418.2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40418.2.3 The Internet of Things . . . . . . . . . . . . . . . . . . . . . . 40618.2.4 Platforms, SDKs, and APIs for Developers . . . . . . . . 406

18.3 Applications that Use Conversational Interfaces . . . . . . . . . . . . 40818.3.1 Enterprise Assistants . . . . . . . . . . . . . . . . . . . . . . . . 40818.3.2 Ambient Intelligence and Smart Environments . . . . . . 40918.3.3 Health care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41018.3.4 Companions for the Elderly . . . . . . . . . . . . . . . . . . . 41118.3.5 Conversational Toys and Educational Assistants . . . . . 41218.3.6 Bridging the Digital Divide for Under-Resourced

Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41418.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

xvi Contents

About the Authors

Michael McTear is Emeritus Professor at the University of Ulster with a specialresearch interest in spoken language technologies. He graduated in GermanLanguage and Literature from Queens University Belfast in 1965, was awarded MAin Linguistics at University of Essex in 1975, and a Ph.D. at the University of Ulsterin 1981. He has been a Visiting Professor at the University of Hawaii (1986–1987),the University of Koblenz, Germany (1994–1995), and University of Granada,Spain (2006–2010). He has been researching in the field of spoken dialog systemsfor more than 15 years and is the author of the widely used textbook Spokendialogue technology: toward the conversational user interface (Springer, 2004). Healso is a coauthor (with Kristiina Jokinen) of the book Spoken Dialogue Systems,(Morgan and Claypool, 2010), and (with Zoraida Callejas) of the book VoiceApplication Development for Android (Packt Publishing, 2013).

Zoraida Callejas is Associate Professor at the University of Granada, Spain, andmember of the CITIC-UGR Research Institute. She graduated in Computer Sciencein 2005, and was awarded a Ph.D. in 2008, also at the University of Granada. Herresearch is focused on dialog systems, especially on affective human–machineconversational interaction, and she has made more than 160 contributions to books,journals, and conferences. She has participated in several research projects relatedto these topics and contributes to scientific committees and societies in the area. Shehas been a Visiting Professor at the Technical University of Liberec, CzechRepublic (2007–2013), University of Trento, Italy (2008), University of Ulster,Northern Ireland (2009), Technical University of Berlin, Germany (2010),University of Ulm, Germany (2012, 2014) and ParisTech, France (2013). She iscoauthor (with Michael McTear) of the book Voice Application Development forAndroid (Packt Publishing, 2013).

David Griol is Professor at the Department of Computer Science in the Carlos IIIUniversity of Madrid (Spain). He obtained his Ph.D. degree in Computer Sciencefrom the Technical University of Valencia (Spain) in 2007. He also has a B.S. inTelecommunication Science from this University. He has participated in severalEuropean and Spanish projects related to natural language processing and

xvii

conversational interfaces. His main research activities are mostly related to thestudy of statistical methodologies for dialog management, dialog simulation, usermodeling, adaptation and evaluation of dialog systems, mobile technologies, andmultimodal interfaces. His research results have been applied to several applicationdomains including Education, Healthcare, Virtual Environments, AugmentedReality, Ambient Intelligence and Smart Environments, and he has published in anumber of international journals and conferences. He has been a visiting researcherat the University of Ulster (Belfast, UK), Technical University of Liberec (Liberec,Czech Republic), University of Trento (Trento, Italy), Technical University ofBerlin (Berlin, Germany), and ParisTech University (Paris, France). He is a memberof several research associations for Artificial Intelligence, Speech Processing, andHuman–Computer Interaction.

xviii About the Authors

Abbreviations

ABNF Augmented Backus-Naur FormAI Artificial intelligenceAIML Artificial Intelligence Markup LanguageAM Application managerAmI Ambient intelligenceAPI Application programming interfaceASR Automatic speech recognitionATIS Airline Travel Information ServiceAU Action unitAuBT Augsburg Biosignal ToolboxAuDB Augsburg Database of BiosignalsBDI Belief, desire, intentionBML Behavior Markup LanguageBP Blood pressureBVP Blood volume pulseCA Conversation analysisCAS Common answer specificationCCXML Call Control eXtensible Markup LanguageCFG Context-free grammarCITIA Conversational Interaction Technology Innovation AllianceCLI Command line interfaceCOCOSDA International Committee for Co-ordination and Standardisation of

Speech DatabasesCRF Conditional random fieldCTI Computer–telephone integrationCVG Compositional vector grammarDBN Dynamic Bayesian networksDG Dependency grammarDiAML Dialog Act Markup LanguageDIT Dynamic Interpretation TheoryDM Dialog management

xix

DME Dialog move engineDNN Deep neural networkDR Dialog registerDST Dialog state trackingDSTC Dialog state tracking challengeDTMF Dual-tone multifrequencyECA Embodied Conversational AgentEEG ElectroencephalographyELRA European Language Resources AssociationEM Expectation maximizationEMG ElectromyographyEMMA Extensible MultiModal AnnotationEmotionML Emotion Markup LanguageESS Expressive speech synthesisEU European UnionFACS Facial action coding systemFAQ Frequently asked questionsFIA Form Interpretation AlgorithmFML Functional Markup LanguageFOPC First-order predicate calculusGMM Gaussian Mixture ModelGPU Graphics processing unitGSR Galvanic skin responseGUI Graphical user interfaceHCI Human–computer interactionHMM Hidden Markov modelHRV Heart rate variabilityIoT Internet of ThingsIQ Interaction qualityISCA International Speech Communication AssociationISU Information State UpdateITU-T International Telecommunication UnionIVR Interactive voice responseJSGF Java Speech Grammar FormatJSON JavaScript Object NotationLGPL Lesser General Public LicenseLPC Linear predictive codingLPCC Linear prediction cepstral coefficientLSA Latent semantic analysisLUIS Language Understanding Intelligent Service (Microsoft)LVCSR Large vocabulary continuous speech recognitionMDP Markov decision processMFCC Mel frequency cepstral coefficientsMLE Maximum likelihood estimationMLP Multilayer perceptron

xx Abbreviations

MURML Multimodal Utterance Representation Markup LanguageNARS Negative Attitudes toward Robots ScaleNER Named entity recognitionNIST National Institute of Standards and TechnologyNLG Natural language generationNLP Natural language processingNLTK Natural Language ToolkitNP Noun phraseOCC Ortony, Clore, and CollinsPAD Pleasure, arousal, and dominancePARADISE PARAdigm for DIalogue Evaluation SystemPLS Pronunciation Lexicon SpecificationPOMDP Partially observable Markov decision processPOS Part of speechPP PerplexityPP Prepositional phrasePPG PhotoplethysmographPSTN Public switched telephone networkQA Question answeringRAS Robot Anxiety ScaleRG Response generationRL Reinforcement learningRNN Recurrent (or Recursive) neural networkROS Robot Operating SystemRSA Respiratory sinus arrythmiaRVDK Robotic Voice Development KitSAIBA Situation, Agent, Intention, Behavior, AnimationSASSI Subjective Assessment of Speech System InterfacesSC Skin conductanceSCXML State Chart XML: State Machine Notation for Control AbstractionSDK Software development kitSDS Spoken dialog systemSIGGEN Special Interest Group on Natural Language GenerationSISR Semantic Interpretation for Speech RecognitionSLU Spoken language understandingSMIL Synchronized Multimedia Integration LanguageSMILE Speech and Music Interpretation by Large-space feature

ExtractionSRGS Speech Recognition Grammar SpecificationSSI Social Signal InterpretationSSLU Statistical spoken language understandingSSML Speech Synthesis Markup LanguageSURF Speeded Up Robust FeaturesSVM Support vector machineTAM Technology acceptance model

Abbreviations xxi

TIPI Ten-Item Personality InventoryTTS Text-to-speech synthesisUI User interfaceUMLS Unified Medical Language SystemURI Uniform resource identifierVoiceXML Voice Extensible Markup LanguageVP Verb phraseVPA Virtual personal assistantVUI Voice user interfaceW3C World Wide Web ConsortiumWA Word accuracyWeka Waikato Environment for Knowledge AnalysisWER Word error rateWOZ Wizard of OZXML Extensible Markup LanguageZCR Zero crossing rate

xxii Abbreviations