Spoken Dialogue Systems for Ambient Environments: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010, Gotemba, Shizuoka, Japan, October 1-2, 2010. Proceedings

Lecture Notes in Artificial Intelligence 6392Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science

Gary Geunbae Lee Joseph MarianiWolfgang Minker Satoshi Nakamura (Eds.)

Spoken DialogueSystems for AmbientEnvironments

Second International Workshop on SpokenDialogue Systems Technology, IWSDS 2010Gotemba, Shizuoka, Japan, October 1-2, 2010Proceedings

13

Series Editors

Randy Goebel, University of Alberta, Edmonton, CanadaJörg Siekmann, University of Saarland, Saarbrücken, GermanyWolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany

Volume Editors

Gary Geunbae LeePohang University of Science and TechnologyDepartment of Computer Science and EngineeringSan 31, Hyoja-dong, Nam-gu, Pohang, 790-784, South KoreaE-mail: [email protected]

Joseph MarianiCentre National de la Recherche ScientifiqueLaboratoire d’Informatique pour la Mécanique et les Sciences de l’ IngénieurB.P. 133 91403 Orsay cedex, FranceE-mail: [email protected]

Wolfgang MinkerUniversity of Ulm, Institute of Information TechnologyAlbert-Einstein-Allee 43, 89081 Ulm, GermanyE-mail: [email protected]

Satoshi NakamuraNational Institute of Information and Communications Technology3-5 Hikaridai, Keihanna Science City, Kyoto, JapanE-mail: [email protected]

Library of Congress Control Number: 2010935212

CR Subject Classification (1998): I.2, H.5, H.4, H.3, I.4, I.5

LNCS Sublibrary: SL 7 – Artificial Intelligence

ISSN 0302-9743ISBN-10 3-642-16201-0 Springer Berlin Heidelberg New YorkISBN-13 978-3-642-16201-5 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.

springer.com

© Springer-Verlag Berlin Heidelberg 2010Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper 06/3180

Preface

It is our great pleasure to welcome you to the 2nd International Workshop onSpoken Dialogue Systems Technology (IWSDS), which was held, as a satelliteevent of INTERSPEECH 2010, at Gotemba Kogen Resort in the Fuji area,Japan, October 1–2, 2010.

The annual workshop brings together researchers from all over the worldworking in the field of spoken dialogue systems. It provides an international fo-rum for the presentation of research and applications and for lively discussionsamong researchers as well as industrialists. Building on the success of IWSDS2009 Irsee, Germany, this year’s workshop designated “Spoken Dialogue Systemsfor Ambient Environments” as a special theme of discussion. We also encour-aged discussions of common issues of spoken dialogue systems including but notlimited to:

– Speech recognition and semantic analysis– Dialogue management– Adaptive dialogue modelling– Recognition of emotions from speech, gestures, facial expressions and phys-

iological data– User modelling– Planning and reasoning capabilities for coordination and conflict description– Conflict resolution in complex multi-level decisions– Multi-modality such as graphics, gesture and speech for input and output– Fusion and information management– Learning and adaptability– Visual processing and recognition for advanced human-computer interaction– Databases and corpora– Evaluation strategies and paradigms– Prototypes and products

The workshop program consisted of 22 regular papers and 2 invited keynotetalks. This year, we were pleased to have two keynote speakers: Prof. RamonLopez-Cozar, Universidad de Granada, Spain and Prof. Tetsunori Kobayashi,Waseda University, Japan.

We would like to take this opportunity to thank the scientific committeemembers for their timely and efficient contributions and for completing the re-view process on time.

In addition, we would like to express our sincere gratitude to the local or-ganizing committee, especially to Dr. Teruhisa Misu, who contributed to thesuccess of this workshop with careful consideration and timely and accurate ac-tion. Furthermore, we have to mention that this workshop would not have beenachieved without the support of the Korean Society of Speech Scientists and theNational Institute of Information and Communications Technology.

VI Preface

Finally, we hope all the attendees benefited from the workshop and enjoyedtheir stay at the base of beautiful Mount Fuji.

July 2010 Gary Geunbae LeeJoseph Mariani

Wolfgang MinkerSatoshi Nakamura

Organization

IWSDS 2010 was organized by the National Institute of Information and Com-munications Technology (NICT), in cooperation with Pohang University of Sci-ence and Technology; Centre National de la Recherche Scientifique, Laboratoired’Informatique pour la Mecanique et les Sciences de l’Ingenieur; Dialogue Sys-tems Group, Institute of Information Technology, Ulm University; and The Ko-rean Society of Speech Sciences (KSSS).

Organizing Committee

Gary Geunbae Lee Pohang University of Science and Technology, KoreaJoseph Mariani Centre National de la Recherche Scientifique,

Laboratoire d’Informatique pour la Mecanique et lesSciences de l’Ingenieur, and Institute for Multilingualand Multimedia Information, France

Wolfgang Minker Dialogue Systems Group, Institute of InformationTechnology, Ulm University, Germany

Satoshi Nakamura National Institute of Information and CommunicationsTechnology, Japan

Local Committee

Hisashi Kawai National Institute of Information and CommunicationsTechnology, Japan

Hideki Kashioka National Institute of Information and CommunicationsTechnology, Japan

Chiori Hori National Institute of Information and CommunicationsTechnology, Japan

Kiyonori Ohtake National Institute of Information and CommunicationsTechnology, Japan

Sakriani Sakti National Institute of Information and CommunicationsTechnology, Japan

Teruhisa Misu National Institute of Information and CommunicationsTechnology, Japan

Referees

Jan Alexandersson, GermanyMasahiro Araki, JapanAndre Berton, GermanySadaoki Furui, Japan

Rainer Gruhn, GermanyJoakim Gustafson, SwedenPaul Heisterkamp, GermanyDavid House, Sweden

VIII Organization

Kristiina Jokinen, FinlandTatsuya Kawahara, JapanHong Kook Kim, KoreaLin-Shan Lee, TaiwanLi Haizhou, SingaporeRamon Lopez-Cozar Delgado, SpainMike McTear, UKMikio Nakano, Japan

Elmar Noth, GermanyNorbert Reithinger, GermanyLaurent Romary, FranceGabriel Skantze, SwedenKazuya Takeda, JapanHsin-min Wang, TaiwanWayne Ward, USA

Table of Contents

Long Papers

Impact of a Newly Developed Modern Standard Arabic Speech Corpuson Implementing and Evaluating Automatic Continuous SpeechRecognition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Mohammad A.M. Abushariah, Raja N. Ainon, Roziati Zainuddin,Bassam A. Al-Qatab, and Assal A.M. Alqudah

User and Noise Adaptive Dialogue Management Using Hybrid SystemActions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Senthilkumar Chandramohan and Olivier Pietquin

Detection of Unknown Speakers in an Unsupervised Speech ControlledSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Tobias Herbig, Franz Gerl, and Wolfgang Minker

Evaluation of Two Approaches for Speaker Specific SpeechRecognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Tobias Herbig, Franz Gerl, and Wolfgang Minker

Issues in Predicting User Satisfaction Transitions in Dialogues:Individual Differences, Evaluation Criteria, and Prediction Models . . . . . 48

Ryuichiro Higashinaka, Yasuhiro Minami, Kohji Dohsaka, andToyomi Meguro

Expansion of WFST-Based Dialog Management for Handling MultipleASR Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Naoto Kimura, Chiori Hori, Teruhisa Misu, Kiyonori Ohtake,Hisashi Kawai, and Satoshi Nakamura

Evaluation of Facial Direction Estimation from Cameras forMulti-modal Spoken Dialog System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Akihiro Kobayashi, Kentaro Kayama, Etsuo Mizukami,Teruhisa Misu, Hideki Kashioka, Hisashi Kawai, andSatoshi Nakamura

D3 Toolkit: A Development Toolkit for Daydreaming Spoken DialogSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Donghyeon Lee, Kyungduk Kim, Cheongjae Lee, Junhwi Choi, andGary Geunbae Lee

New Technique to Enhance the Performance of Spoken DialogueSystems by Means of Implicit Recovery of ASR Errors . . . . . . . . . . . . . . . . 96

Ramon Lopez-Cozar, David Griol, and Jose F. Quesada

X Table of Contents

Simulation of the Grounding Process in Spoken Dialog Systems withBayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Stephane Rossignol, Olivier Pietquin, and Michel Ianotto

Facing Reality: Simulating Deployment of Anger Recognition in IVRSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Alexander Schmitt, Tim Polzehl, and Wolfgang Minker

A Discourse and Dialogue Infrastructure for Industrial Dissemination . . . 132Daniel Sonntag, Norbert Reithinger, Gerd Herzog, andTilman Becker

Short Papers

Impact of Semantic Web on the Development of Spoken DialogueSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

Masahiro Araki and Yu Funakura

A User Model to Predict User Satisfaction with Spoken DialogSystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Klaus-Peter Engelbrecht and Sebastian Moller

Sequence-Based Pronunciation Modeling Using a Noisy-ChannelApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Hansjorg Hofmann, Sakriani Sakti, Ryosuke Isotani, Hisashi Kawai,Satoshi Nakamura, and Wolfgang Minker

Rational Communication and Affordable Natural Language Interactionfor Ambient Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Kristiina Jokinen

Construction and Experiment of a Spoken Consulting DialogueSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Hideki Kashioka,Hisashi Kawai, and Satoshi Nakamura

A Study Toward an Evaluation Method for Spoken Dialogue SystemsConsidering User Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Etsuo Mizukami, Hideki Kashioka, Hisashi Kawai, andSatoshi Nakamura

A Classifier-Based Approach to Supporting the Augmentation of theQuestion-Answer Database for Spoken Dialogue Systems . . . . . . . . . . . . . . 182

Hiromi Narimatsu, Mikio Nakano, and Kotaro Funakoshi

The Influence of the Usage Mode on Subjectively Perceived Quality . . . . 188Ina Wechsung, Anja Naumann, and Sebastian Moller

Table of Contents XI

Demo Papers

Sightseeing Guidance Systems Based on WFST-Based DialogueManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Teruhisa Misu, Chiori Hori, Kiyonori Ohtake,Etsuo Mizukami, Akihiro Kobayashi, Kentaro Kayama,Tetsuya Fujii, Hideki Kashioka, Hisashi Kawai, andSatoshi Nakamura

Spoken Dialogue System Based on Information Extraction fromWeb Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Koichiro Yoshino and Tatsuya Kawahara

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 1–12, 2010. © Springer-Verlag Berlin Heidelberg 2010

Impact of a Newly Developed Modern Standard Arabic Speech Corpus on Implementing and Evaluating

Automatic Continuous Speech Recognition Systems

Mohammad A.M. Abushariah1,2, Raja N. Ainon1, Roziati Zainuddin1, Bassam A. Al-Qatab1, and Assal A.M. Alqudah1

1 Faculty of Computer Science and Information Technology, University of Malaya, 50603, Kuala Lumpur, Malaysia

2 Department of Computer Information Systems, King Abdullah II School for Information Technology, University of Jordan, 11942, Amman, Jordan

[email protected], [email protected], [email protected], [email protected], [email protected],

[email protected]

Abstract. Being current formal linguistic standard and only acceptable form of Arabic language for all native speakers, Modern Standard Arabic (MSA) still lacks sufficient spoken corpora compared to other forms like Dialectal Arabic. This paper describes our work towards developing a new speech corpus for MSA, which can be used for implementing and evaluating any Arabic auto-matic continuous speech recognition system. The speech corpus contains 415 (367 training and 48 testing) sentences recorded by 42 (21 male and 21 female) Arabic native speakers from 11 countries representing three major regions (Levant, Gulf, and Africa). The impact of using this speech corpus on overall performance of Arabic automatic continuous speech recognition systems was examined. Two development phases were conducted based on the size of train-ing data, Gaussian mixture distributions, and tied states (senones). Overall re-sults indicate that larger training data size result higher word recognition rates and lower Word Error Rates (WER).

Keywords: Modern Standard Arabic, text corpus, speech corpus, phonetically rich, phonetically balanced, automatic continuous speech recognition.

1 Introduction

Arabic language is the largest still living Semitic language and one of the six official languages of the United Nations (UN). It is the official language in 21 countries situ-ated in Levant, Gulf, and Africa. Arabic language is ranked as fourth after Mandarin, Spanish and English in terms of the number of first language speakers.

According to [1], Standard Arabic and Dialectal Arabic are the two major forms of Arabic language. Standard Arabic form includes both the Classical Arabic and the Modern Standard Arabic (MSA).

2 M.A.M. Abushariah et al.

Dialectal Arabic varies from one country to another and includes the daily spoken Arabic. This form of Arabic is deviated from the Standard Arabic and sometimes more than one dialect can be found within a country [1].

Being the most formal and standard form of Arabic, Classical Arabic can be found in The Holy Qur’an scripts. These scripts have full diacritical marks, therefore, Arabic phonetics are completely represented [1].

Modern Standard Arabic (MSA) is the current formal linguistic standard of Arabic language, which is widely taught in schools and universities, and used in the office and the media. Although almost all written Arabic resources use MSA, however dia-critical marks are mostly omitted and readers must infer missing diacritical marks from the context. Modern Standard Arabic (MSA) contains 34 phonemes (28 conso-nants and 6 vowels). Any Arabic utterance or word must start with a consonant. Arabic vowels are classified into 3 short and 3 long vowels, where long vowels are approximately double the duration of short vowels [1, 2].

Since MSA is the only acceptable form of Arabic language for all native speakers [1], therefore, it became the main focus of current Arabic Automatic Speech Recogni-tion (ASR) researches. However, previous Arabic ASR researches were directed to-wards dialectal and colloquial Arabic serving a specific cluster of the Arabic native speakers [3].

The following section, Section 2, emphasizes on the need for Modern Standard Arabic (MSA) speech corpus. Speech corpus description and analysis is presented in Section 3. Section 4 presents all implementation requirements and components re-quired for the development of the Arabic automatic continuous speech recognition system. The speech corpus testing and evaluation for Arabic ASR systems is pre-sented in Section 5. Section 6 analyzes the experimental results. We finally present the conclusion in Section 7.

2 The Need for Modern Standard Arabic (MSA) Speech Corpus

Lack of spoken and written training data is one of the main issues encountered by Arabic ASR researchers. A list of most popular (from 1986 through 2005) corpora is provided in [4] showing only 19 corpora (14 written, 2 spoken, 1 written and spoken, and 2 conversational).

A survey on industrial needs for Arabic language resources was conducted on 20 companies situated in Lebanon, Palestine, Egypt, France, and US [5]. Responses highlighted the need for read, prepared, prompted, elicited, and spontaneous Arabic spoken data. In most cases, responding companies did not show much of interest for telephone and broadcast news spoken data. According to [5], responding companies commented that available resources are too expensive and do not meet standard qual-ity requirements. They also lack of adaptability, reusability, quality, coverage, and adequate information types.

In a complementary survey [6], a total of 55 respondents were received (36 institu-tions and 19 individual experts) representing 15 countries located in North Africa, Near and Middle East, Europe, and North America. Respondents insisted on the need for Arabic language resources for both Modern Standard Arabic (MSA) and Colloquial Arabic speech corpora. Over 100 language resources (25 speech corpora, 45 lexicons and dictionaries, 29 text corpora, and 1 multimodal corpus) were identified [6].

Impact of a Newly Developed Modern Standard Arabic Speech Corpus 3

Based on literature investigation, our research work provides Arabic language re-sources that meet academia and industrial expectations and recommendations. The Modern Standard Arabic (MSA) speech corpus was developed in order to provide a state-of-the-art spoken corpus that bridges the gap between currently available Arabic spoken resources and the research community expectations and recommendations. The following motivational factors and speech corpus characteristics were considered for developing our spoken corpus:

1. Modern Standard Arabic (MSA) is the only acceptable form of Arabic lan-guage for all native speakers and is highly demanded for Arabic language re-searches; therefore, our speech corpus is based on MSA form.

2. The newly developed Arabic speech corpus was prepared in a high quality and specialized noise proof studio, which suits a wide horizon of systems es-pecially for office environment as recommended by [6].

3. The speech corpus was designed in a way that would serve any Arabic ASR system regardless of its domain. It focused on the presence of Arabic pho-nemes as much as possible using the least possible Arabic words and sen-tences based on phonetically rich and balanced speech corpus approach.

4. The opportunity to explore differences of speech patterns between Arabic na-tive speakers from 11 different countries representing the three major regions (Levant, Gulf, and Africa).

5. The need for read and prepared Arabic spoken data as illustrated in [5] was also considered. Companies did not show interest for Arabic telephone and broadcast news spoken data. Therefore, this Arabic speech corpus is neither a telephone nor a broadcast news based spoken data. It is prepared and read Arabic spoken data.

3 Speech Corpus Description and Analysis

Speech corpus is an important requirement for developing any ASR system. The de-veloped corpus contains 415 sentences in the Modern Standard Arabic (MSA). 367 written phonetically rich and balanced sentences were developed in [7], and were recorded and used for training the acoustic model. For testing the acoustic model, 48 additional sentences representing Arabic proverbs were created by an Arabic lan-guage specialist. The speech corpus was recorded by 42 (21 male and 21 female) Arabic native speakers from 11 different Arab countries representing three major regions (Levant, Gulf, and Africa).

Since this speech corpus contains training and testing written and spoken data of variety of speakers who represent different genders, age categories, nationalities, and professions, and is also based on phonetically rich and balanced sentences, then it is expected to be used for development of many MSA speech and text based applica-tions, such as speaker independent ASR, text-to-speech (TTS) synthesis, speaker recognition, and others.

The motivation behind the creation of our phonetically rich and balanced speech corpus was to provide large amounts of high quality recordings of Modern Standard


Arabic (MSA) suitable for the design and development of any speaker-independent continuous automatic Arabic speech recognition system.

The phonetically rich and balanced Arabic speech corpus was initiated in March 2009. Although participants were selected based on their interest to join this work, but speakers were indirectly selected based on agreed upon characteristics. Participants were selected so that they:

• Have fair distribution of gender and age. • Have different current professions. • Have varieties of educational backgrounds with a minimum of high school certi-

fication. This is important to secure an efficient reading ability of the participants. • Belong to varieties of native Arabic speaking countries. • Belong to any of the three major regions where Arabic native speakers mostly

live (Levant, Gulf, and Africa). This is important to produce a comprehensive speech corpus that can be used by all Arabic language research community.

As a result, 51 (23 male and 28 female) participants were selected and asked to record the prepared text corpus. Recordings of 3 participants were incomplete, 2 participants were from Eritrea living in Saudi Arabia, and therefore, they are non-native speakers. In addition, 2 participants had resonance or voice disorder problem, whereby the quality of their voice was poor and it was difficult to get a single correct recording. Finally, 2 other participants had articulation disorder problem, whereby some sounds were not pronounced clearly or even replaced in some cases with another sound. Therefore, recordings of 9 participants were excluded.

Speech recordings of 42 participants were finally shortlisted in order to form our speech corpus as shown in Table 1. Shortlisted participants belong to two major age groups as shown in Table 2.

Table 1. Shortlisted participants

Gender Region Country

Male Female Total Total/Region

Jordan 8 4 12 Palestine 2 - 2

Levant

Syria 1 - 1

15

Iraq - 4 4 Saudi Arabia - 3 3 Yemen - 3 3

Gulf Oman - 1 1

11

Sudan 4 3 7 Algeria 3 3 6 Egypt 2 - 2

Africa Morocco 1 - 1

16

Total: 21 21 42 42 Total (%): 50% 50% 100% 100%


Table 2. Participants’ age and gender distribution

Gender No. Age Category Male Female

Total

1 Less Than 30 Years 7 14 21 2 30 Years and Above 14 7 21

Total: 21 21 42

Recording sessions were conducted in a sound-attenuated studio. Sound Forge 8

software was installed and used for making the recordings. Default recording attrib-utes were initially used as shown in Table 3.

Table 3. Initial recording attributes

Recording Attribute Value Sampling Rate (Hz) 44100Hz Bit-Depth 16 bits Channels 2 channels (Stereo)

These recording attributes were then converted at a later stage to be used for devel-

oping speech recognition applications as shown in Table 4.

Table 4. Converted recording attributes

Recording Attribute Value Sampling Rate (Hz) 16000Hz Bit-Depth 16 bits Channels 1 channel (Mono)

In order to use our phonetically rich and balanced speech corpus for training and

testing any Arabic ASR system, a number of Matlab programs were developed in order to produce a ready to use speech corpus. These Matlab programs were devel-oped for the purpose of 1) Automatic Arabic speech segmentation, 2) Parameters conversion of speech data, 3) Directory structure and sound filenames convention, and 4) Automatic generation of training and testing transcription files.

A manual classification and validation of the correct speech data was conducted requiring great human efforts. This process was very crucial in order to ensure and validate the pronunciation correctness of the speech data before using them in training the system’s acoustic model.

4 Arabic Automatic Continuous Speech Recognition System

This section describes the major implementation requirements and components for developing the Arabic automatic speech recognition system, which are clearly shown in Fig. 1, which also complies with the generic architecture of the Carnegie Mellon


University (CMU) Sphinx engine. A brief description of each component is discussed in the following sub-sections.

Fig. 1. Components of Arabic automatic continuous speech recognition system

4.1 Feature Extraction

Feature extraction is also referred to as front end component, is the initial stage of any ASR system that converts speech inputs into feature vectors in order to be used for training and testing the speech recognizer. The dominating feature extraction tech-nique known as Mel-Frequency Cepstral Coefficients (MFCC) was applied to extract features from the set of spoken utterances. A feature vector represents unique charac-teristics of each recorded utterance, which is considered as an input to the classifica-tion component.

4.2 Arabic Phonetic Dictionary

The phoneme pronunciation dictionary serves as an intermediary link between the acoustic model and the language model in all speech recognition systems. A rule-based approach to automatically generate a phonetic dictionary for a given transcrip-tion was used. A detailed description of the development of this Arabic phonetic dic-tionary can be found in [8]. Arabic pronunciation follows certain rules and patterns when the text is fully diacritized. A detailed description of these rules and patterns can be found in [9].

In this work, the transcription file contains 2,110 words and the vocabulary list contains 1,626 unique words. The number of pronunciations in the developed pho-netic dictionary is 2,482 entries. Fig. 2 shows a sample of the generated phonetic dictionary.


Fig. 2. Sample of the rule-based phonetic dictionary

4.3 Acoustic Model Training

The acoustic model component provides the Hidden Markov Models (HMMs) of the Arabic tri-phones to be used in order to recognize speech. The basic HMM structure known as Bakis model, has a fixed topology consisting of five states with three emit-ting states for tri-phone acoustic modeling.

In order to build a better acoustic model, CMU Sphinx 3 uses tri-phone based acoustic modeling. Continuous Hidden Markov Models (CHMM) technique is also supported in CMU Sphinx 3 for parametrizing the probability distributions of the state emission probabilities.

There are two development phases for the acoustic model training. The first phase is based on 4.07 hours of training data, whereas the second phase is based on 8 hours of training data.

4.3.1 Acoustic Model Training Based on 4.07 Hours During our first development phase, speech recordings of 8 speakers (4 males and 4 females) were manually segmented. Each speaker recorded both training and testing sentences whereby the training sentences are used to train the acoustic model and the testing sentences are used to test the performance of the speech recognizer.

Out of the 8 speakers only 5 speakers (3 males and 2 females) are used to train the acoustic model in this phase and the other 3 speakers are mainly used to test the per-formance. A total of 3604 utterances (4.07 hours) are used to train the acoustic model.

The acoustic model is trained using continuous state probability density of 16 Gaussian mixture distributions. However, the state distributions were tied to different number of senones ranging from 350 to 2500. Different results are obtained and shown in Section 5.

4.3.2 Acoustic Model Training Based on 8 Hours During our second development phase, a small portion of the entire speech corpus is experimented. A total of 8,043 utterances are used resulting about 8 hours of speech data collected from 8 (5 male and 3 female) Arabic native speakers from 6 different Arab countries namely; Jordan, Palestine, Egypt, Sudan, Algeria, and Morocco.

In order to show a fair testing and evaluation of the Arabic ASR performance, the round robin testing approach was applied, where every round speech data of 7 out of 8 speakers are trained and speech data of the 8th are tested. This is also important to show how speaker-independent the system.

E AE: L AE: M UH آلام E AE: M IH N IH N آمن E AE: Y AE: T UH آيات E AE B AE D AE أبد E AE B IY أبي E AE B JH AE L AE N IY أبجلني E AE B TT AH E AE أبطأE AE B L AE JH UH أبلج


Acoustic model training was divided into two stages. During the first stage, one of the eight training data sets was used in order to identify the best combination of Gaus-sian mixture distributions and number of senones. The acoustic model is trained using continuous state probability density ranging from 2 to 64 Gaussian mixture distribu-tions. In addition, the state distributions were tied to different number of senones rang-ing from 350 to 2500. A total of 54 experiments were done at this stage producing different results as shown in Section 5. During the second stage, the best combination of Gaussian mixture distributions and number of senones was used to train the other seven out of eight training data sets.

4.4 Language Model Training

The language model component provides the grammar used in the system. The gram-mar’s complexity depends on the system to be developed. In this work, the language model is built statistically using the CMU-Cambridge Statistical Language Modeling toolkit, which is based on modeling the uni-grams, bi-grams, and tri-grams of the lan-guage for the subject text to be recognized.

Creation of a language model consists of computing the word uni-gram counts, which are then converted into a task vocabulary with word frequencies, generating the bi-grams and tri-grams from the training text based on this vocabulary, and finally converting the n-grams into a binary format language model and standard ARPA for-mat. For both development phases, the number of uni-grams is 1,627, whereas the number of bi-grams and tri-grams is 2,083 and 2,085 respectively.

5 Systems’ Testing and Evaluation

This section presents the testing and evaluation of the two development phases of the Arabic automatic continuous speech recognition system.

5.1 First Development Phase Based on 4.07 Hours

The testing and evaluation was done based on 3 different testing data sets, 1) 444 sound files (same speakers, but different sentences), 2) 84 sound files (different speakers, but same sentences), and 3) 130 sound files (different speakers, and different sentences).

Results are shown in Tables 5, 6, and 7 respectively. Table 8 compares the sys-tem’s performance based on diacritical marks.

Table 5. System’s performance for testing data set 1

Version Densities Senones Word Recognition Rate (%) Experiment1 16 1000 87.26 Experiment2 16 1500 81.10 Experiment3 16 2500 72.05 Experiment4 16 500 90.39 Experiment5 16 350 90.91 Experiment6 16 400 91.23 Experiment7 16 450 90.43



Version Densities Senones Word Recognition Rate (%) Experiment8 16 400 89.42


Version Densities Senones Word Recognition Rate (%) Experiment9 16 400 80.83

Table 8. Effect of diacritical marks on the overall system’s performance

Testing Sets With Diacritical Marks Without Diacritical Marks Set 1 91.23 92.54 Set 2 89.42 90.81 Set 3 80.83 80.83

5.2 Second Development Phase Based on 8 Hours

There are 8 different data sets used to train and test the system’s performance based on 8 hours as shown in Table 9.

Table 9. Training and testing data sets for 8 hours speech corpus

Testing Data Same

Speakers Different Speakers

Experiment ID

Training Data Different

Sentences Same

Sentences Different Sentences

Total

Testing Data

Ratio of Testing

Data (%)

Exp.1 6379 906 678 80 1664 20.69 Exp.2 6288 871 769 115 1755 21.82 Exp.3 5569 755 1488 231 2474 30.76 Exp.4 6308 888 749 98 1735 21.57 Exp.5 6296 889 761 97 1747 21.72 Exp.6 6331 891 726 95 1712 21.29 Exp.7 6219 861 838 125 1824 22.68 Exp.8 6009 841 1048 145 2034 25.29

During the first stage of training the acoustic model, the first data set (Exp.1) was

used to identify best combination of Gaussian mixture distributions and number of senones. It is found that 16 Gaussians with 500 senones obtained the best word recog-nition rate of 93.24% as shown in Fig. 3. Therefore, this combination was used for training the acoustic model in Exp.2 through Exp.8 data sets.


Fig. 3. Word recognition rate (%) in reference to number of senones and Gaussians

Tables 10 and 11 show the word recognition rates (%) and the Word Error Rates (WER) with and without diacritical marks respectively.

Table 10. Overall system’s performance with full diacritical marks

Same Speakers BUT Different

Sentences

Different Speaker BUT Same Sen-

tences

Different Speaker AND Different

Sentence

Experiment ID Rec. Rate

(%) WER Rec. Rate

(%) WER Rec. Rate

(%) WER

Exp.1 93.24 10.73 94.98 6.28 90.11 13.48 Exp.2 91.80 11.96 93.30 10.62 83.00 27.87 Exp.3 93.07 10.53 97.22 3.66 89.81 14.94 Exp.4 92.72 11.42 96.89 4.16 91.44 11.76 Exp.5 93.43 10.09 94.92 7.13 89.49 14.86 Exp.6 92.61 11.56 95.55 7.37 90.64 14.23 Exp.7 92.65 11.15 96.37 4.51 88.15 14.25 Exp.8 91.85 12.75 98.10 2.51 89.99 13.31

Average Results 92.67 11.27 95.92 5.78 89.08 15.59

6 Experimental Results Analysis

During the first development phase based on 4.07 hours, it is noticed that when the number of senones increases, the recognition rate declines. However, the combination of 16 Gaussian mixtures and 400 senones is the best for the current corpus size achieving 91.23% and 14.37% Word Error Rate (WER) for set 1. This result has im-proved when tested without diacritical marks achieving 92.54% and 13.06% WER.


Table 11. Overall system’s performance without diacritical marks

Same Speakers BUT Different

Sentences

Different Speaker BUT Same Sen-

tences

Different Speaker AND Different

Sentence

Experiment ID Rec. Rate

(%) WER Rec. Rate

(%) WER Rec. Rate

(%) WER

Exp.1 94.41 9.57 95.22 6.04 90.79 12.81 Exp.2 93.02 10.74 93.95 10.33 84.38 26.49 Exp.3 94.29 9.31 97.42 3.46 90.88 13.87 Exp.4 93.86 10.29 97.33 3.73 92.87 10.34 Exp.5 94.57 8.95 95.32 6.73 90.76 13.59 Exp.6 93.75 10.41 95.91 7.00 91.39 13.48 Exp.7 94.06 9.74 96.68 4.20 89.42 12.98 Exp.8 93.04 11.56 98.50 2.11 91.33 11.97

Average Results 93.88 10.07 96.29 5.45 90.23 14.44

On the other hand, during the second development phase based on 8 hours, the

best combination was 16 Gaussian mixture distributions with 500 senones obtaining 93.43% and 94.57% word recognition accuracy with and without diacritical marks respectively. Therefore, the number of senones increases when there is an increase in training speech data, and it is expected to increase further when our speech corpus is fully utilized.

Speaker independency is clearly realized in this work as testing was conducted to assure this aspect. For different speakers but similar sentences, the system obtained a word recognition accuracy of 95.92% and 96.29% and a Word Error Rate (WER) of 5.78% and 5.45% with and without diacritical marks respectively. On the other hand, for different speakers and different sentences, the system obtained a word recognition accuracy of 89.08% and 90.23% and a Word Error Rate (WER) of 15.59% and 14.44% with and without diacritical marks respectively.

It is noticed that the developed systems perform better without diacritical marks compared to the same systems with diacritical marks. Therefore, the issue of diacritics needs to be solved in future developments. Further parameter enhancements need to be done in order to reduce the WER. This includes language model weights, beam width, and the word insertion penalty (wip).

7 Conclusions

This paper reports our work towards building a phonetically rich and balanced Mod-ern Standard Arabic (MSA) speech corpus, which is necessary for developing a high performance Arabic speaker-independent automatic continuous speech recognition system. This work includes creating the phonetically rich and balanced speech corpus with full diacritical marks transcription from different speakers with different varie-ties of attributes, and making all preparation and pre-processing steps in order to pro-duce a ready to use speech data for further training and testing purposes. This speech corpus can be used for any Arabic speech based applications including speaker recog-nition and text-to-speech synthesis, covering different research needs.


The obtained results are comparable to other languages of the same vocabulary size. This work adds a new kind of possible speech data for Modern Standard Arabic (MSA) based text and speech applications besides other kinds such as broadcast news and telephone conversations. Therefore, this work is an invitation to all Arabic ASR developers and research groups to utilize and capitalize.

References

1. Elmahdy, M., Gruhn, R., Minker, W., Abdennadher, S.: Survey on common Arabic lan-guage forms from a speech recognition point of view. In: International Conference on Acoustics (NAG-DAGA), Rotterdam, Netherlands, pp. 63 – 66 (2009)

2. Alotaibi, Y.A.: Comparative Study of ANN and HMM to Arabic Digits Recognition Sys-tems. Journal of King Abdulaziz University: Engineering Sciences 19(1), 43–59 (2008)

3. Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schwartz, R., Vergyri, D.: Novel approaches to Arabic speech recognition. In: Report from the 2002 Johns-Hopkins Summer Workshop, ICASSP 2003, Hong Kong, vol. 1, pp. 344–347 (2003)

4. Al-Sulaiti, L., Atwell, E.: The design of a corpus of Contemporary Arabic. International Journal of Corpus Linguistics, John Benjamins Publishing Company, 1 – 36 (2006)

5. Nikkhou, M., Choukri, K.: Survey on Industrial needs for Language Resources. Technical Report, NEMLAR – Network for Euro-Mediterranean Language Resources (2004)

6. Nikkhou, M., Choukri, K.: Survey on Arabic Language Resources and Tools in the Mediter-ranean Countries. Technical Report, NEMLAR – Network for Euro-Mediterranean Lan-guage Resources (2005)

7. Alghamdi, M., Alhamid, A.H., Aldasuqi, M.M.: Database of Arabic Sounds: Sentences. Technical Report, King Abdulaziz City of Science and Technology, Saudi Arabia, In Arabic (2003)

8. Ali, M., Elshafei, M., Alghamdi, M., Almuhtaseb, H., Al-Najjar, A.: Generation of Arabic Phonetic Dictionaries for Speech Recognition. In: IEEE Proceedings of the International Conference on Innovations in Information Technology, UAE, pp. 59 – 63 (2008)

9. Elshafei, A.M.: Toward an Arabic Text-to-Speech System. The Arabian Journal of Science and Engineering 16(4B), 565–583 (1991)

User and Noise Adaptive Dialogue Management UsingHybrid System Actions

Senthilkumar Chandramohan and Olivier Pietquin

SUPELEC - IMS Research Group, Metz - France{senthilkumar.chandramohan,olivier.pietquin}@supelec.fr

Abstract. In recent years reinforcement-learning-based approaches have beenwidely used for policy optimization in spoken dialogue systems (SDS). A dia-logue management policy is a mapping from dialogue states to system actions,i.e. given the state of the dialogue the dialogue policy determines the next actionto be performed by the dialogue manager. So-far policy optimization primarilyfocused on mapping the dialogue state to simple system actions (such as con-firm or ask one piece of information) and the possibility of using complex sys-tem actions (such as confirm or ask several slots at the same time) has not beenwell investigated. In this paper we explore the possibilities of using complex (orhybrid) system actions for dialogue management and then discuss the impact ofuser experience and channel noise on complex action selection. Our experimentalresults obtained using simulated users reveal that user and noise adaptive hybridaction selection can perform better than dialogue policies which can only performsimple actions.

1 Introduction

Spoken Dialog Systems (SDS) are systems which have the ability to interact with hu-man beings using speech as the medium of interaction. The dialogue policy plays acrucial role in dialogue management and informs the dialogue manager what action toperform next given the state of the dialogue. Thus building an optimal dialogue manage-ment policy is an important step when developing any spoken dialogue system. Using ahand-coded dialogue policy is one of the simplest ways for building dialogue systems,but as the complexity of the dialogue task grows it becomes increasingly difficult tocode a dialogue policy manually. Over the years various statistical approaches such as[9, 3, 21] have been proposed for dialogue management problems with reasonably largestate spaces.

Most of the literature on spoken dialog systems (policy optimization) focus on opti-mal selection of elementary dialog acts at each dialog turn. In this paper, we investigatethe possibility of learning to combine these simple dialog acts into complex actions toobtain more efficient dialogue policies. Since complex system acts combine several sys-tem acts together it can lead to shorter dialogue episodes. Also by using complex systemacts, system designers can introduce some degrees of flexibility to the human-computerinteraction by allowing users with prior knowledge about the system to furnish and re-ceive as much as information as they wish in one user/system act. The use of complex

G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 13–24, 2010.c© Springer-Verlag Berlin Heidelberg 2010

14 S. Chandramohan and O. Pietquin

system actions for dialogue management has been studied only to a little extent. Worksrelated to the use of open-ended questions are studied in [11].

The primary focus of this contribution is to learn a hybrid action policy which canchoose to perform simple system acts as well as more complex and flexible system acts.The challenge in learning such a hybrid policy is the unavailability of dialogue corporato explore complex system acts. Secondly, the impact of noise and user simulation oncomplex system acts are analyzed and means to learn a noise and user adaptive dialoguepolicy are discussed.

This paper is organized as follows: In Section 2 a formal description of Markov Deci-sion Process (MDP) is presented first, following which casting and solving the dialogueproblem in the framework of an MDP is discussed. In Section 3 complex system ac-tions are formally defined and then the impact of channel noise and user experience isdiscussed. Section 4 outlines how channel noise can be simulated using user simulationand how a noise adaptive hybrid action policy can be learned. Section 5 describes howa user-adaptive hybrid-action policy can be learned. Section 6 outlines our evaluationset-up and analyzes the performance of different policies learned. Eventually Section 7concludes.

2 MDP for Dialogue Management

The MDP [1] framework comes from the optimal control community. It is originallyused to describe and solve sequential decision making problems in stochastic dynamicenvironments. An MDP is formally a tuple {S, A, P, R, γ} where S is the (finite) statespace, A the (finite) action space, P ∈ P(S)S×A the family of Markovian transitionprobabilities1, R ∈ R

S×A×S the reward function and γ the discounting factor (0 ≤γ ≤ 1). According to this formalism, during the interaction with a controlling agent,an environment steps from state to state (s ∈ S) according to transition probabilities Pas a consequence of the controller’s actions (a ∈ A). After each transition, the systemproduces an immediate reward (r) according to its reward function R. A so-called policyπ ∈ AS mapping states to actions models the way the agent controls its environment.The quality of a policy is quantified by the so-called value function V π(s) which mapseach state to the expected discounted cumulative reward given that the agent starts inthis state and follows the policy π:

V π(s) = E[∞∑

i=0

γiri|s0 = s, π] (1)

An optimal policy π∗ maximizes this function for each state:

π∗ = argmaxπ

V π (2)

Suppose that we are given the optimal value function V ∗ (that is the value functionassociated to an optimal policy), deriving the associated policy would require to know

1 Notation f ∈ AB is equivalent to f : B → A.

User and Noise Adaptive Dialogue Management Using Hybrid System Actions 15

the transition probabilities P . Yet, this is usually unknown and the optimal control pol-icy should be learned only by interactions. This is why the state-action value (or Q-)function is introduced. It adds a degree of freedom on the choice of the first action:

Qπ(s, a) = E[∞∑

i=0

γiri|s0 = s, a0 = a, π] (3)

Q∗(s, a) = E[∞∑

i=0

γiri|s0 = s, a0 = a, π∗] (4)

where Q∗(s, a) is the optimal state-action value function. An action-selection strategythat is greedy according to this function (π(s) = argmaxa Q∗(s, a)) provides an opti-mal policy. There are many algorithms that solve this optimization problem. When thisoptimization is done without any information about the transition probabilities and thereward function but only transition and immediate rewards are observed, the solvingalgorithms belong to the Reinforcement Learning (RL) family [20].

2.1 Dialogue as an MDP

The spoken dialogue management problem can be seen as a sequential decision mak-ing problem. It can thus be cast into an MDP and the optimal policy can be found byapplying a RL algorithm. Indeed, the role of the dialogue manager (or the decisionmaker) is to select and perform dialogue acts (actions in the MDP paradigm) whenit reaches a given dialogue turn (state in the MDP paradigm) while interacting witha human user (its environment in the MDP paradigm). There can be several types ofsystem dialogue acts. For example, in the case of a restaurant information system, pos-sible acts are request(cuisine type), provide(address), confirm(price range), close etc.The dialogue state is usually represented efficiently by the Information State paradigm[3]. In this paradigm, the dialogue state contains a compact representation of the historyof the dialogue in terms of system acts and its subsequent user responses (user acts).It summarizes the information exchanged between the user and the system until thedesired state is reached and the dialogue episode is eventually terminated. A dialoguemanagement strategy is thus a mapping between dialogue states and dialogue acts.

Still following the MDP paradigm, the optimal strategy is the one that maximizessome cumulative function of rewards collected all along the interaction. A commonchoice for the immediate reward is the contribution of each action to the user’s satis-faction [17]. This subjective reward is usually approximated by a linear combinationof objective measures (dialogue duration, number of ASR errors, task completion etc.).Weights of this linear combination can be computed from empirical data [10]. Yet, mostof the time, more simple reward functions are used, taking into account that the mostimportant objective measures are task completion and length of the dialogue episode.

2.2 Restaurant Information MDP-SDS

The dialogue problem studied in the rest of this paper is a slot filling restaurant infor-mation system. The dialogue manager has 3 slots to be filled and confirmed by the user


(1) Cuisine (Italian-French-Thai), (2) Location (City center-East-West) and (3) Price-range (Cheap-Moderate-Expensive). Here the goal of the dialogue system is to fill theseslots with user preferences and also to confirm the slot values, if the confidence in theretrieved information is low, before proceeding to seek relevant information from thedatabase. The initial list of possible (commonly used) system actions are, (1) Ask cui-sine, (2) Ask location, (3) Ask restaurant type, (4) Explicit confirm cuisine, (5) Explicitconfirm location, (6) Explicit confirm type and (7) Greet the user.

The dialogue state of the Restaurant information MDP-SDS includes 6 binary valuesto indicate whether the 3 slots have been filled and confirmed. It also includes a binaryvalue to indicate whether the user had been greeted or not. The reward function isdefined as follows: the system will receive a completion reward of 300 if the task issuccessfully completed and will receive a time step penalty of -20 for every transition.

2.3 Dialogue Policy Optimization

Once the dialogue management problem is cast into an MDP, Dynamic Programming orRL methods [20] can be applied to find the optimal dialogue policy [9]. The goal of thepolicy optimization task is to find the dialogue policy which maximizes the expecteddiscounted sum of rewards that can obtained by the agent over an infinite time period.Most of the recent works done in this direction [6] focus on using online reinforcementlearning algorithms such as SARSA for policy optimization. Online RL algorithms likeSARSA are data intensive and so it is customary to simulate or model the user behaviorbased on the available dialogue corpus [8, 18, 13] and to artificially generate simulateddialogues. The RL policy learner will then interact with the simulated user to find theoptimal dialogue policy. DIPPER dialogue management framework [4] along with RE-ALL, a hierarchical reinforcement learning policy learner [7] was used to learn andtest the dialogue policies discussed in this paper (the exploration rate of the RL policylearner was set as 0.2). The user simulation used in the experiments was trained usingtown information corpus discussed in [4]. The policy learned using the reward function,action and state space described in 2.2 will be the baseline and will be referred to assimple action policy.

3 Complex System Actions

Simple actions are commonly used system acts which are related to one slot such asasking a slot value or explicitly confirming the slot value etc. Actions listed in thesubsection 2.2 are all examples of simple system acts (except implicit confirmation).Complex actions are system actions which are formed by combining two or more sim-ple systems actions. Complex actions deal with multiple slots such as confirming twoslot values or asking for three slot values. Thus for the restaurant information dialoguesystem there can be several possible complex actions that can be performed. Some ofthe complex actions in this case are (1) Ask two slot values, (2) Ask three slot values,(3) Explicitly confirm two slot values, (4) Explicitly confirm three slot values (5) Implic-itly confirm two and ask third slot and (6) Implicitly confirm a slot and ask a slot value(commonly used complex action).


3.1 Hybrid Action Policy

This section explains how to learn a hybrid action policy which can choose to performsimple system acts as well as complex system acts. Firstly the action set of the restau-rant information MDP-SDS described in 2.2 is updated with the following complexsystem actions, (1) Ask values for two slots, (2) Explicitly confirm two slot values, (3)Implicitly confirm two slot values and ask value for the third slot. Since the action set isupdated with simple and complex actions the RL policy learner will explore both typesof actions. But the user simulation learned using the dialogue corpora (which only hadsimple actions) has an ability to respond only for simple system actions.

Thus the user behavior for the complex system acts is hand-coded and combined withthe learned user simulation. The hand-coded behavior for complex actions is as follows(1) ask 2 slots {ProvideTwoSlotValue 0.9, ProvideOneSlotValue 0.1, SayNothing 0}, (2)explicitConfirm 2 slots {SayYes 1.0, SayNo 0, SayNothing}, (3) implicitConfirm 2 slotsand ask value for slot {ProvideOneSlotValue 1.0, SayYes 0.0, SayNo 0}. The updateduser simulation can thus respond to both simple (behavior learned from corpora) andcomplex system actions (hand-coded behavior). As explained in 2.3 a dialogue policyis learned using the updated user simulation and MDP-SDS, this policy will be referredto as hybrid action policy.

System: HelloUser: SilenceSystem: What type of restaurant are you looking for and in which location? (complex)User: Italian restaurant in city centerSystem: In what price-range are you looking for a Italian restaurant?User: A cheap oneSystem: Did you say you are looking for a cheap restaurant in city-center?(complex) User: Yes

The dialogue episode presented here is the interaction between the RL policy learner(exploration rate set as zero) and user simulation using the hybrid action policy. Onecan observe that the policy can now choose complex system actions and simple actionswhen required. It can be observed that given the action set of the restaurant informationdialogue system the sample dialogue presented here is an optimal behavior for ground-ing the three slots.

3.2 Effect of Noise and User on Complex Action

The hand-coded user behavior for complex actions discussed in Section 3.1 simulatethe zero channel noise scenario i.e., when the user says something it is assumed that thesystem will capture it correctly and there is no chance for error. This is not always trueand there may be some noise in the transmission channel. Thus ideally the probabilityfor SayNo user act is not zero (the fact that the system doesn’t understand what the usersaid is modeled as the user saying nothing). But if the user response is SayNo for thecomplex system act ImplicitConfirm2slotsAndAskASlot it would be difficult to identifywhich of the two slots is wrong. Based on this our 1st assumption is : when there is


noise in the automatic speech recognition (ASR) channel it is advisable to perform sim-ple system acts and not complex actions.

System: Can you let me know your preferences for restaurant selection are?User 1: Nothing (Novice user)User 2: Italian restaurant (Novice user)User 3: Cheap Italian restaurant in City Center (Experienced user)

The users who intend to use the restaurant information SDS may range from novice(new) users to experienced (frequent) users. Now let us consider the above mentionedexample. Here the user is under providing information in the first two cases but providesall necessary information in the third case. Based on this our 2nd assumption is: it isideal to perform simple systems actions while interacting with novice users and performhybrid actions while interacting with experienced users.

4 Noise Adaptive Policy

Based on our 1st assumption action selection has to be performed depending on thenoise level in the ASR channel. First step to learn a noise dependent policy is to havea noise simulation module. Several works have be done in the recent past such as [14,12, 5, 16, 19] to simulate channel noise for dialogue modeling. A simple approach tosimulate the channel noise is to tune the probabilities of user responses for confirmationsystem actions [15]. By increasing the probability for negation we can simulate thehigh noise scenario and by reducing the probability for negation we can simulate thelow noise scenario. The user behaviors for complex confirmation actions presented insection 3.1 are modified as shown below in Table 1 to simulate low noise scenario andhigh noise scenario. Let us refer to these user simulations as low noise user model andhigh noise user model.

Table 1. Handcrafted user behavior for noise simulation

Channel noise Action Provide 1Value SayYes SayNoLow ImplicitConfirm2AndAskASlot 0.9 0 0.1Low ExplicitConfirm2 0 0.9 0.1High ImplicitConfirm2AndAskASlot 0.6 0 0.4High ExplicitConfirm2 0 0.6 0.4

The goal now is to learn a hybrid dialogue policy which is sensitive to noise inthe ASR channel. A new variable is introduced in the Information State to measurethe negation index (i.e. to measure the frequency of negations made by the user). Thisapproach is similar to the one suggested in [2]. The reward function defined is 2.2 ismodified to include the negation index as shown below


Reward = Completion reward − Time step penalty − Negation count ∗ 20

Using this updated reward function and the user simulation with channel noise simula-tions, noise sensitive policies can be learned. Firstly a policy is learned using the lownoise user model and the second using the high noise user model. Let us refer to thesedialogue policies as low noise policy and high noise policy. An interaction between thepolicy learner (learning rate set as zero) using the low noise policy and user simulationwith simulated low channel noise is presented below. It can be observed that the systemperforms complex system actions as well as simple actions.

System: greetUser: silenceSystem: ask2Slot [askCuisine, askLocation]User: provide 2 Values [cuisine, central Location]System: implConfAskASlot [confirmCuisine, askPrice]User: provide 1 value [Moderate]System: explicitConfirm2 [confirm centralLocation,confirm Moderate]User: yesAnswer

An interaction between the policy learner (learning rate set as zero) using the highnoise policy and user simulation with simulated high channel noise is presented below.It can be observed that the system performs only simple (confirmation) actions.

System: greetUser: silenceSystem: ask2Slot [askLocation, askPrice]User: provide 2 values [CentralLocation, Moderate]System: askASlot [askCuisine]User: provide 1 value [Italian]System: explicitConfirm [CentralLocation]User: yesAnswerSystem: explicitConfirm [Moderate]User: yesAnswerSystem: explicitConfirm [Italian] User: yesAnswer

In real life situations it is practically impossible to say when the channel noise willbe low and when it will be high, thus one may not know when to switch between thehigh noise and low noise policies. One possible solution is to learn a dialogue policywhich can adapt itself to different ASR channel noise levels. A noise adaptive dialoguepolicy is learned by using the high noise user model and low noise user model in par-allel. For every dialogue cycle the policy learner randomly chooses one of the two usersimulations for interaction. This way one can learn a policy that can adapt to differentchannel noise levels. Let the policy learned by randomly switching between the usermodels during the policy optimization process be called as noise adaptive hybrid (ac-tion) policy.


5 User Adaptive Policy

The goal now is to first simulate the user experience in the user simulation and useit to learn a user experience dependent policy. To perform this task novice users areassumed (as shown in section 3.2 example) to under provide information for complexactions whereas the experienced users will provide the necessary slot values. In orderto simplify the problem the novice users are assumed to say nothing for complex (infor-mation seeking) actions whereas the experienced users will provide the necessary slotvalues in most cases. Tuning the probabilities for user behavior in this way results intwo user behaviors and let us term them as novice user simulation and experienced usersimulation. In addition to simulating the user experience the user behavior also sim-ulated the low noise level scenario. Novice and experienced user behaviors with lowchannel noise is outlined in Table 2

Table 2. Handcrafted user behavior for user experience simulation

User Noise Action Give 2 values Give 1 value Yes No NothingNovice Low ImplicitConf2&AskASlot 0 0.9 0 0.1 0Novice Low ExplicitConf2 0 0 0.9 0.1 0Novice Low Ask2Slots 0 0 0 0 1.0

Experienced Low ImplicitConf2&AskASlot 0 0.9 0 0.1 0Experienced Low ExplicitConf2 0 0 0.9 0.1 0Experienced Low Ask2Slots 0.9 0 0 0 0.1

Similar to the negation index we introduce a term called experience index in the staterepresentation of the restaurant information MDP-SDS. The reward function updated insection 4 is again updated as follows

Reward = Completion reward + TimePenalty − NegCount ∗ 5 − ExpIndex ∗ 10

By using the novice and experienced user behaviors one can learn two different dialoguepolicies. Let us term these policies as novice user policy and experienced user policy.Also as explained in the previous section by using these user simulations simultaneouslyi.e. by randomly switching them during the policy optimization one can learn a useradaptive policy. Let us term this policy as adaptive hybrid (action) policy.

6 Policy Evaluation and Analysis

Table 3 presents the result of comparison between simple action policy and hybrid ac-tion policy derived in section 2.3 and 3.1. The results are based on 300 dialogue cyclesbetween the policy learner using the two policies (learning rate set as zero) with usersimulation (which was used to learn hybrid action policy). One can observe that by us-ing complex actions along with simple actions we can considerably reduce the dialoguelength and hence the overall reward of the dialogue manager can be improved.


Table 3. Simple action vs Hybrid action policy

Policy Name Average Reward Completion Average LengthSimple 160 300 7.0Hybrid 214 300 4.2

Table 4 presents the result of comparison between low noise policy and adaptivenoise hybrid action policy derived in section 4. The results are based on 300 dialoguecycles between the policy learner using the two policies (learning rate set as zero) withuser simulation (also) simulating low channel noise. It can be observed that the adaptivenoise policy performs equally as good as the low noise policy in the low channel noisescenario.

Table 4. Low noise policy vs Adaptive noise policy in low noise scenario

Policy Name Average Reward Completion Average LengthLow noise 216.51 300 4.12

Adaptive noise 214.06 300 4.20

Table 5 presents the result of comparison between high noise policy and adaptivenoise hybrid action policy derived in section 4. The results are based on 300 dialoguecycles between the policy learner using the two policies (exploration rate set as zero)with user simulation (also) simulating high channel noise. It can be observed that theadaptive noise policy performs equally as good as the high noise policy in the high chan-nel noise scenario, but there is a small degradation with regard to the task completionaverage.

Table 5. High noise policy vs Adaptive noise policy in high noise scenario

Policy Name Average Reward Completion Average LengthHigh noise 160.52 300 6.65

Adaptive noise 175.99 295.84 5.30

Table 6 presents the result of comparison between low noise, high noise and adaptivenoise hybrid action policy derived in section 4. The results are based on 300 dialoguecycles between the policy learner using the three policies (exploration rate set as zero)with two user simulations, simulating mixed channel noise. It can be observed that theadaptive noise policy performs better than the high noise policy and low noise policyin the mixed channel noise scenario. This shows that the adaptive policy learns a tradeoff to switch between complex and simple actions with regard to changing noise levels(where us low noise policy tries to perform complex actions always and high noisepolicy performs simple actions always). It actually takes advantage of the extendedstate representation to perform this adaptation.


Table 6. Low noise policy Vs High noise policy Vs Adaptive noise policy in mixed noise scenario

Policy Name Average Reward Completion Average LengthLow noise 140.19 297.56 7.38High noise 170.06 300 6.33

Adaptive noise 191.38 298.07 4.87

Table 7 presents the result of comparison between novice user policy, experienceduser and adaptive user hybrid action policy derived in section 5. The results are based on250 dialogue cycles between the policy learner using the three policies (exploration rateset as zero) with both novice and experienced user simulations, (randomly switched tosimulating mixed user experience). It can be observed that the user adaptive policy per-forms better than the novice user policy and experienced user policy in the mixed userscenario. This shows that the user adaptive policy learns a trade off to switch betweencomplex and simple actions with regard to changing user experience levels (where thenovice user policy tries to perform simple actions always and experienced user policyperforms complex actions always).

Table 7. Novice Vs Experienced Vs Adaptive user policy in mixed user scenario

Policy Avg. Reward Completion Avg. SimpleAct ComplexAct LengthNovice 145.9 250.0 7.66 0 7.66

Experience -197.7 228.6 2.9 16.0 18.9Adaptive 151.9 250.0 4.69 1.0 5.69

7 Conclusion

So far the possibilities of using complex system actions along with simple actions forspoken dialogue management has not been much investigated. Based on the experimen-tal results presented in this contribution one can conclude that complex action selectioncan considerably reduce the dialogue length but in the mean time it is important to con-sider the channel noise and user experience factors before choosing complex actions.Since it is not possible to predict the channel noise level or the experience level of theuser in real life scenario one can learn an adaptive hybrid action policy that can adaptto the channel noise and user experience. Yet, it requires extending the state representa-tion to take into account the behaviour of the user (SayNo or overinformative users forexample).

All the tasks (learning and testing) presented in this paper are carried out using simu-lated users partially learned from corpus and partially hand tuned, thus it will be ideal totest these policies with real users in future. Hybrid action selection may move human -machine interaction a step closer towards human - human communication. One otherinteresting direction of future work will be to explore the possibilities of automaticallygenerate new complex actions from a given list of simple actions and use online policylearning approaches to learn an hybrid dialogue policy. This way we may come acrosspotentially new and interesting system actions which may not available in the dialoguecorpus.


References

[1] Bellman, R.: A markovian decision process. Journal of Mathematics and Mechanics 6,679–684 (1957)

[2] Janarthanam, S., Lemon, O.: User simulations for online adaptation and knowledge-alignment in troubleshooting dialogue systems. In: In Proceedings of LONDial, London,UK (2008)

[3] Larsson, S., Traum, D.R.: Information state and dialogue management in the TRINDI dia-logue move engine toolkit. Natural Language Engineering 6, 323–340 (2000)

[4] Lemon, O., Georgila, K., Henderson, J., Stuttle, M.: An ISU dialogue system exhibitingreinforcement learning of dialogue policies: generic slot-filling in the TALK in-car system.In: Proceedings of the Meeting of the European Chapter of the Associaton for Computa-tional Linguistics (EACL 2006), Morristown, NJ, USA (2006)

[5] Lemon, O., Liu, X.: Dialogue Policy Learning for combinations of Noise and User Simu-lation: transfer results. In: Proceedings of SIGdial 2007, Antwerp, Belgium (2007)

[6] Lemon, O., Pietquin, O.: Machine learning for spoken dialogue systems. In: Proceedingsof the International Conference on Speech Communication and Technologies (InterSpeech2007), Antwerpen, Belgium (2007)

[7] Lemon, O., Liu, X.X., Shapiro, D., Tollander, C.: Hierarchical Reinforcement Learning ofDialogue Policies in a development environment for dialogue systems: REALL-DUDE. In:Proceedings of the 10th SemDial Workshop, BRANDIAL 2006, Potsdam, Germany (2006)

[8] Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-Machine Interactionfor learning dialog Strategies. IEEE Transactions on Speech and Audio Processing 8, 11–23(2000)

[9] Levin, E., Pieraccini, R., Eckert, W.: Using markov decision process for learning dialoguestrategies. In: Proceedings of ICASSP, Seattle, Washington (1998)

[10] Kamm, C.A., Walker, M.A., Litman, D.J., Abella, A.: PARADISE: A framework for eval-uating spoken dialogue agents. In: Proceedings of the 35th Annual Meeting of the Associ-ation for Computational Linguistics (ACL 1997), Madrid, Spain, pp. 271–280 (1997)

[11] Pietquin, O.: A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis,Faculte Polytechnique de Mons, TCTS Lab (Belgique) (April 2004)

[12] Pietquin, O., Dutoit, T.: A Probabilistic Framework for Dialog Simulation and OptimalStrategy Learning. IEEE Transactions on Audio, Speech and Language Processing 14(2),589–599 (2006)

[13] Pietquin, O., Dutoit, T.: A probabilistic framework for dialog simulation and optimal strat-egy learning. IEEE Transactions on Audio, Speech & Language Processing 14(2), 589–599(2006)

[14] Pietquin, O., Renals, S.: ASR System Modeling For Automatic Evaluation And Optimiza-tion of Dialogue Systems. In: Proceedings of the IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP 2002), Orlando, USA, FL (May 2002)

[15] Rieser, V.: Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizard-of-Oz data. PhD thesis, Saarland University, Dpt of Computational Linguistics (July 2008)

[16] Rieser, V., Lemon, O.: Learning effective multimodal dialogue strategies from wizard-of-ozdata: bootstrapping and evaluation. In: Proceedings of the Association for ComputationalLinguistics (ACL) 2008, Columbus, USA (2008)

[17] Singh, S., Kearns, M., Litman, D., Walker, M.: Reinforcement learning for spoken dialoguesystems. In: Proceedings of the Annual Meeting of the Neural Information Processing So-ciety (NIPS 1999), Denver, USA. Springer, Heidelberg (1999)


[18] Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simula-tion techniques for reinforcement-learning of dialogue management strategies. KnowledgeEngineering Review 21(2), 97–126 (2006)

[19] Schatzmann, J., Young, S.: Error simulation for training statistical dialogue systems. In:Proceedings of the ASRU 2007, Kyoto, Japan (2007)

[20] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computa-tion and Machine Learning, 3rd edn. The MIT Press, Cambridge (March 1998)

[21] Williams, J.D., Young, S.: Partially observable markov decision processes for spoken dialogsystems. Computer Speech Language 21(2), 393–422 (2007)

Detection of Unknown Speakers in an

Unsupervised Speech Controlled System�

Tobias Herbig1,3, Franz Gerl2, and Wolfgang Minker3

1 Nuance Communications Aachen GmbH, Ulm, Germany2 SVOX Deutschland GmbH, Ulm, Germany

3 University of Ulm, Institute of Information Technology, Ulm, Germany

Abstract. In this paper we investigate the capability of our self-learningspeech controlled system comprising speech recognition, speaker iden-tification and speaker adaptation to detect unknown users. Our goalis to enhance automated speech controlled systems by an unsupervisedpersonalization of the human-computer interface. New users should beallowed to use a speech controlled device without the need to identifythemselves or to undergo a time-consumptive enrollment. Instead, thesystem should detect new users during the operation of the device. Newspeaker profiles should be initialized and incrementally adjusted withoutany additional intervention of the user. Such a personalization of human-computer interfaces represents an important research issue. Exemplar-ily, in-car applications such as speech controlled navigation, hands-freetelephony or infotainment systems are investigated. Results for detectingunknown speakers are presented for a subset of the SPEECON database.

1 Introduction

Speech recognition has attracted attention for various applications such as officesystems, manufacturing, telecommunication, medical reports and infotainmentsystems [1].

For in-car applications both the usability and security can be increased for awide variety of users. The driver can be supported to safely participate in roadtraffic and to operate technical devices such as navigation systems or hands-freesets. Infotainment systems with speech recognition for navigation, telephony ormusic control typically are not personalized to a single user. The speech signalmay be degraded by varying engine, wind and tire noises, or transient events suchas passing cars or babble noise. Computational efficiency and memory consump-tion are important design parameters. On the other hand, a large vocabulary,e.g. city or street names for navigation, has to be reliably recognized.

However, for a variety of practical applications a small number of users, e.g.5 − 10 recurring speakers, can be assumed. The benefits of a device that canrecognize the voices of its main users are obvious:

� This work was conducted at Harman-Becker. Tobias Herbig is now with NuanceCommunications. Franz Gerl is now with SVOX.


26 T. Herbig, F. Gerl, and W. Minker

The dialog flow can be personalized to specific user habits. New ways forsimplifying the interaction with the device can be suggested. Unexperiencedusers can be introduced to the system.

Furthermore, speech recognition can be improved. Since speech recognizersare typically trained on a large set of speakers, there is a mismatch between thetrained speech pattern and the voice characteristics of each speaker degradingspeech recognition accuracy [2]. Enhanced statistical models can be obtainedfor each speaker by adapting on speaker specific data. Without speaker trackingall information acquired from a particular speaker is either lost or degradedwith each speaker turn. Therefore, it seems to be reasonable to employ speakeridentification and speaker adaptation separately for different speakers.

A simple implementation would be to force the user to identify himself when-ever the system is initialized. However, we look for a more natural and convenienthuman-computer communication by identifying the current user in an unsuper-vised way.

We developed a speech controlled system which includes speaker identifica-tion, speech recognition and speaker adaptation. We succeeded to track differ-ent speakers after a short enrollment phase of only two command and controlutterances [3]. This is enabled by combining the strengths of two adaptationschemes [4]. In the learning phase only a few parameters have to be estimatedallowing to capture the main speech and speaker characteristics. In the longrun individual adjustment is achieved. A unified approach of speaker identifica-tion and speech recognition was developed as an extension of a standard speechrecognizer. Multiple recognitions can therefore be avoided.

In this paper we investigate the detection of unknown speakers to overcomethe limitation of an enrollment. The goal is to initialize new speaker profiles inan unsupervised manner during the first few utterances of a new user. Thereforethe unified approach of joint speaker identification and speech recognition anda standard speaker identification technique were evaluated for several traininglevels of the employed statistical models.

First, speaker identification of known and unknown speakers is introduced.An implementation of an automated speech recognizer is described. The speakeradaptation scheme employed to capture and represent speaker characteristicsis briefly introduced. Then, the unified approach is summarized. Finally, theresults of our experiments are presented for the unified approach and a standardtechnique. Finally, a summary is given and an extension of our approach issuggested for future work.

2 Speaker Identification

Gaussian Mixture Models (GMMs) have emerged as the dominating statisticalmodel for speaker identification [5]. GMMs comprise a set of N multivariateGaussian density functions subsequently denoted by the index k. The multi-modal probability density function

Detection of Unknown Speakers in an Unsupervised Speech 27

p(xt|Θi) =N∑

k=1

wik · N {

xt|μik,Σi

k

}(1)

is a convex combination of their component densities. Each speaker model iis completely defined by the parameter set Θi which contains the weightingfactors wi

k, mean vectors μik and covariance matrices Σi

k. The parameter set willbe omitted for reasons of simplicity. xt denotes the feature vector which maycontain Mel Frequency Cepstral Coefficients (MFCCs) [6] or mean normalizedMFCCs [5], for example.

For speaker identification the log-likelihood

log (p(x1:T |i)) =T∑

t=1

log

(N∑

k=1

wik · N {

xt|μik,Σi

k

})

(2)

is calculated for each utterance characterized by the sequence of feature vec-tors x1:T . Independently and identically distributed (iid) feature vectors areassumed. The speaker with the highest posterior probability or likelihood isidentified as found by Reynolds and Rose [7].

The detection of unknown speakers is a critical issue for open-set speakeridentification since unknown speakers cannot be explicitly modeled. A simpleextension to open-set scenarios is to introduce a threshold θth for the absolutelog-likelihood values as found by Fortuna et al. [8]:

log (p(x1:T |Θi)) ≤ θth, ∀i. (3)

If the speaker’s identity does not correspond to a particular speaker model,a low likelihood value is expected. However, we expect high fluctuations of theabsolute likelihood in adverse environments such as automobiles. This may affectthe threshold decision.

Advanced techniques may use normalization techniques comprising a Univer-sal Background Model (UBM) [8,9]. Log-likelihood ratios of the speaker modelsand UBM can be examined for out-of-set detection [8]. If the following inequality

log (p(x1:T |Θi)) − log (p(x1:T |ΘUBM)) ≤ θth, ∀i (4)

is valid for all speaker models, an unknown speaker is likely.The latter approach yields the advantage to lower the influence of events which

affect all statistical models in a similar way. For example, phrases spoken in anadverse environment may cause a mismatch between the speaker models and theaudio signal due to background noises. Furthermore, text-dependent fluctuationsin a spoken phrase, e.g. caused by unseen data or the training conditions, can bereduced [8]. In those cases the likelihood ratio appears to be more robust thanabsolute likelihoods.


3 Implementation of an Automated Speech Recognizer

We use a speech recognizer based on Semi-Continuous HMMs (SCHMMs) [10].All states st share the mean vectors and covariances of one GMM and only differin their weighting factors

pSCHMM(xt) =M∑

st=1

N∑

k=1

wst

k · N {xt|μk,Σk} · p(st) (5)

where M denotes the number of states. For convenience, the parameter set Θwhich includes the initial state probabilities, state transitions and the GMMparameters is omitted. The basic setup of the speech recognizer is shown inFig. 1.

Front-end xt Codebook qt Decoder

Fig. 1. Block diagram of a speech recognizer based on SCHMMs

Noise reduction is calculated by a standard Wiener filter. 11 MFCC coeffi-cients are extracted. The 0 th coefficient is substituted by a normalized energy.Cepstral mean subtraction and a Linear Discriminant Analysis (LDA) are ap-plied to obtain a compact representation which is robust against environmentalinfluences. We use windows of 9 frames of MFCCs, where the dynamics of deltaand delta-delta coefficients have been incorporated into the LDA using a boot-strap training [11].

Each feature vector xt is compared with a speaker independent codebooksubsequently called standard codebook. The standard codebook consists of about1000 multivariate Gaussian densities defined by the parameter set

Θ0 = {w01 , . . . w

0N , μ0

1, . . . , μ0N ,Σ0

1, . . . ,Σ0N}. (6)

The soft quantization

qt = (p(xt|k = 1), . . . , p(xt|k = N)) (7)

is used for speech decoding. The speech decoder comprises the acoustic models,lexicon and language model. The acoustic model is realized by Markov chains.The lexicon represents the corpus of all word strings to be recognized. The priorprobabilities of word sequences are given by the language model [10].

4 Speaker Adaptation

Speaker adaptation allows to adjust or to initialize speaker specific statisticalmodels, e.g. GMMs for speaker identification or codebooks for enhanced speech


recognition. The capability of adaptation algorithms depends on the number ofavailable parameters which is limited by the amount of speaker specific trainingdata.

The Eigenvoice (EV) approach is advantageous when facing few data sinceonly some 10 parameters have to be estimated to adapt codebooks of a speechrecognizer [12]. To modify the mean vectors of our speech recognizer, about25, 000 parameters have to be optimized. Mean vector adaptation μEV

k may resultfrom a linear combination of the original speaker independent mean vector μ0

k

and a weighted sum of the eigenvoices eEVk,l :

μEVk = μ0

k +L∑

l=1

αl · eEVk,l (8)

where Ei{μEVk } != μ0

k is assumed. Only the scalar weighting factors αl have tobe optimized.

When sufficient speaker specific training data is available, the Maximum APosteriori (MAP) adaptation allows individual adjustments of each Gaussiandensity [5]:

μMAPk = (1 − αk) · μ0

k + αk · μMLk (9)

αk =nk

nk + η, η = const (10)

nk =T∑

t=1

p(k|xt,Θ0). (11)

When GMMs for standard speaker identification are adapted, we use μUBMk

instead of μ0k. For convenience, we employ only the sufficient statistics of the

standard codebook or a UBM.On extensive training data the MAP adaptation approaches Maximum Like-

lihood (ML) estimates

μMLk =

1nk

T∑

t=1

p(k|xt,Θ0) · xt. (12)

Thus, we use a simple yet efficient combination of EV and ML estimates toadjust the mean vectors of codebooks [4]:

μoptk = (1 − βk) · μEV

k + βk · μMLk (13)

βk =nk

nk + λ, λ = const. (14)

The smooth transition from globally estimated mean vectors μEVk to locally

optimized ML estimates allows to efficiently retrieve speaker characteristics forenhanced speech recognition. Fast convergence on limited data and individualadaptation on extensive data are achieved. For convenience, the speech decoder’sstate alignment is omitted in our notation for codebook optimization.


5 Unified Speaker Identification and Speech Recognition

We obtain an unsupervised speech controlled system by fusing all componentsintroduced:

Speaker specific codebooks are initialized and continuously adjusted. The ba-sic setup of the speaker independent speech recognizer shown in Fig. 1 is extendedby NSp speaker specific codebooks which are operated in parallel to the standardcodebook.

Front-EndSpeech

Speaker

Adaptation

Transcription

I II

ML

Fig. 2. System architecture for joint speaker identification and speech recognition. Onefront-end is employed for speaker specific feature extraction. Speaker specific codebooksare used to decode the spoken phrase (I) and to estimate the speaker identity (II) ina single step. Both results are used for speaker adaptation to enhance future speakeridentification and speech recognition. Furthermore, speaker specific cepstral mean andenergy normalization is controlled.

To avoid parallel speech decoding, a two-stage processing can be used. First,the most probable speaker is determined by standard methods for speaker iden-tification. Then, the entire utterance can be re-processed by employing the cor-responding codebook for speech decoding to generate a transcription.

To avoid high latencies and the increase of the computational complexitycaused by re-processing, we developed a unified approach to realize speech de-coding and speaker identification simultaneously. Speaker specific codebooks areconsidered as common GMMs representing the speaker’s pronunciation charac-teristics. Class et al. [13] and the results in [4] give evidence that speaker specificcodebooks can be employed to track different speakers.

We model speaker tracking by an HMM whose states represent enrolled speak-ers. The emission probability density functions are represented by the adaptedcodebooks of the speech recognizer. For speaker specific speech recognition withonline speaker tracking we employ the forward algorithm to select the opti-mal codebook on a frame level. Only the soft quantization of the hypothesizedspeaker is processed by the speech decoder. This technique can be viewed as fastbut probably less confident speaker identification to be used for speaker specific


speech recognition under real-time conditions. In this context, codebooks areused to decode a spoken phrase and to determine the current speaker.

In parallel, an improved guess of the speaker identity is provided for speakeradaptation which is performed after speech decoding. Each speaker specific code-book is evaluated in the same way as common GMMs for speaker identification.We only employ the simplification of equal weighting factors ws

k to avoid therequirement of a state alignment. The log-likelihood

Li =1T

T∑

t=1

log

(N∑

k=1

N {xt|μi

k,Σ0k

})

(15)

denotes the accumulated log-likelihood normalized by the length T of the re-corded utterance. The weighting factors wk = 1

N are omitted for convenience.In addition, the speech recognition result is used to discard speech pauses and

garbage words which do not contain speaker specific information. The likelihoodvalues of each codebook are buffered until a precise segmentation is available.

Our target is to automatically personalize speech controlled devices. To ob-tain a strictly unsupervised speech controlled system, new users should be au-tomatically detected without the requirement to attend an enrollment. In thefollowing, we investigate whether new speakers can be detected when a simplethreshold θth is applied to the log-likelihood ratios of the speaker specific code-books and standard codebook. If no log-likelihood ratio exceeds this threshold

Li − L0 < θth, ∀i, (16)

an unknown speaker is detected. In the experiments carried out the performanceof the joint speaker identification and speech recognition to detect unknownspeakers was evaluated and compared to a standard technique based on GMMspurely optimized to represent speaker characteristics.

6 Evaluation

We conducted several experiments to investigate how accurate unknown speak-ers are detected by our unified approach. In addition, we examined a referenceimplementation representing a standard speaker identification technique.

6.1 Reference Implementation

For the standard approach a UBM-GMM with diagonal covariance matrices wastrained by the Expectation Maximization (EM) algorithm. About 3.5 h speechdata originating from 41 female and 36 male speakers of our USKCP1 devel-opment database was incorporated into the UBM training. Mean-normalized11 MFCCs and delta-features were extracted.1 The USKCP is an internal speech database for in-car applications. The USKCP

comprises command and control utterances such as navigation commands, spellingand digit loops. The language is US-English.


Speaker specific GMMs are initialized and continuously adapted by MAPadaptation of the mean vectors [5]. We tested several implementations concerningthe number of component densities 32 ≤ N ≤ 256 and tuning parameters 4 ≤η ≤ 20 for adaptation. The best results for speaker identification were obtainedfor η = 4 as shown in [3].

6.2 Database

For the evaluation of both techniques we employed a subset of the SPEECON [14]database. This subset comprises 50 male and 23 female speakers recorded in anautomotive environment. The sampling rate is 11, 025 Hz. The language is US-English. Colloquial utterances with more than 4 words and mispronunciationswere discarded. Digit and spelling loops were kept.

6.3 Results

First, the joint speaker identification and speech recognition was evaluated. Thedetection of unknown speakers is realized by the threshold decision given in (16).We employed λ = 4 in speaker adaptation since we observed the best speakeridentification results for closed-set scenarios. Several implementations of thecombined adaptation with 4 ≤ λ ≤ 20 were compared [3].

For evaluation we use a two-stage technique. First, the best in-set speakermodel characterized by the highest likelihood is identified. Then a thresholddecision is used to test for an unknown speaker. The performance of the bi-nary in-set / out-of-set classifier is evaluated by the so-called Receiver OperatorCharacteristics (ROC). The detection rate is depicted versus false alarm rate.

To evaluate the detection accuracy of a self-learning system, we defined sev-eral training stages given by the number of utterances NA used for adaptation.Speaker models during the learning phase (NA < 20), moderately trained code-books and extensively trained models (NA > 100) are investigated. Maladapta-tions with respect to the speaker identity are neglected. Confidence intervals aregiven by a gray shading.

The ROC curves in Fig. 3(a) and Fig. 3(b) show that the accuracy of open-set speaker identification seems to be highly dependent of the adaptation levelof the statistical models and the number of enrolled speakers. Especially forspeaker models trained on only a few utterances in an adverse environmenta global threshold seems to be not feasible. This observation agrees with ourformer experiments [4]. Even for extensively trained codebooks, e.g. NA ≥ 100,relatively high error rates can be observed.

The same experiment was repeated with the reference implementation. Theresults for N = 256 and η = 4 are exemplarily shown in Fig. 4(a) and Fig. 4(b).Unknown speakers are detected by a threshold decision similar to (4). However,log-likelihoods are normalized by the length of the current utterance to be robustagainst short commands. In summary, significantly worse detection rates areachieved compared to Fig. 3.


0 0.1 0.2 0.3 0.4 0.5 0.60.4

0.5

0.6

0.7

0.8

0.9

1D

etec

tion

rate

False alarm rate

(a) 5 speakers are enrolled.

0 0.1 0.2 0.3 0.4 0.5 0.60.4

0.5

0.6

0.7

0.8

0.9

1

Det

ecti

on

rate

False alarm rate

(b) 10 speakers are enrolled.

Fig. 3. Detection of unknown speakers for the unified system which integrates speakeridentification and speech recognition - NA ≤ 20 (◦), 20 < NA ≤ 50 (�), 50 < NA ≤100 (�) and 100 < NA ≤ 200 (+)

0 0.1 0.2 0.3 0.4 0.5 0.60.4

0.5

0.6

0.7

0.8

0.9

1

Det

ecti

on

rate

False alarm rate

(a) 5 speakers are enrolled.

0 0.1 0.2 0.3 0.4 0.5 0.60.4

0.5

0.6

0.7

0.8

0.9

1

Det

ecti

on

rate

False alarm rate

(b) 10 speakers are enrolled.

Fig. 4. Detection of unknown speakers for the reference implementation with 256 Gaus-sian distributions - NA ≤ 20 (◦), 20 < NA ≤ 50 (�), 50 < NA ≤ 100 (�) and100 < NA ≤ 200 (+)

0 0.1 0.2 0.3 0.40.6

0.7

0.8

0.9

1

Det

ecti

on

rate

False alarm rate

Fig. 5. Comparison of speaker specific codebooks (solid line) and GMMs comprising32 (◦), 64 (�), 128 (�) and 256 (+) Gaussian densities for 100 < NA ≤ 200. MAPadaptation with η = 4 is employed.


To compare the influence of the number of Gaussian densities on the detectionaccuracy, all implementations are shown in Fig. 5. Here, only extensively trainedspeaker models characterized by NA > 100 are considered. Obviously, the detec-tion accuracy of the reference implementations starts to settle for N > 64. Theaccuracy also seems to be significantly inferior to the codebook based approach.

7 Summary and Conclusion

The evaluation has shown that our unified speaker identification and speechrecognition technique is able to detect unknown speakers. The unified approachproduced significantly higher detection rates than the investigated reference im-plementations. However, the detection rates achieved do not allow to operate aspeech recognizer in a completely unsupervised manner. In summary, it seemsto be difficult to detect new users by only one utterance, especially for shortcommand and control utterances. It became evident that the training of eachspeaker model should be reflected in the in-set / out-of-set decision. A globalthreshold seems to be inadequate.

In future, we will develop more sophisticated posterior probabilities represent-ing the adaptation level of each speaker model. When series of utterances areused for speaker identification a significant improvement for detecting unknownspeakers and for speaker identification rates can be expected. Still speaker iden-tification and detecting unknown speakers will never be perfect. This presentsa challenge for dialog developers. Dialog strategies will have to deal with am-biguous information about the user’s identity and avoid erratic behavior. Thedialog may have to wait for increased confidence in following utterances, or takethe initiative in confirming the user’s identity. When these challenges are methowever, more natural speech understanding systems are possible.

References

1. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall,Englewood Cliffs (1993)

2. Zavaliagkos, G., Schwartz, R., McDonough, J., Makhoul, J.: Adaptation algorithmsfor large scale hmm recognizers. In: EUROSPEECH 1995, pp. 1131–1135 (1995)

3. Herbig, T., Gerl, F., Minker, W.: Evaluation of two approaches for speaker spe-cific speech recognition. In: Second International Workshop on Spoken DialogueSystems Technology, IWSDS 2010 (2010) (to appear)

4. Herbig, T., Gerl, F., Minker, W.: Fast adaptation of speech and speaker charac-teristics for enhanced speech recognition in adverse intelligent environments. In:The 6th International Conference on Intelligent Environments, IE-2010 (2010) (toappear)

5. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adaptedgaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000)

6. Reynolds, D.A.: Large population speaker identification using clean and telephonespeech. IEEE Signal Processing Letters 2(3), 46–48 (1995)


7. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification us-ing gaussian mixture speaker models. IEEE Transactions on Speech and AudioProcessing 3(1), 72–83 (1995)

8. Fortuna, J., Sivakumaran, P., Ariyaeeinia, A., Malegaonkar, A.: Open set speakeridentification using adapted gaussian mixture models. In: INTERSPEECH 2005,pp. 1997– 2000 (2005)

9. Angkititrakul, P., Hansen, J.H.L.: Discriminative in-set/out-of-set speaker recog-nition. IEEE Transactions on Audio, Speech, and Language Processing 15(2), 498–508 (2007)

10. Schukat-Talamazzini, E.G.: Automatische Spracherkennung. Vieweg (1995) (inGerman)

11. Class, F., Kaltenmeier, A., Regel-Brietzmann, P.: Optimization of an hmm - basedcontinuous speech recognizer. In: EUROSPEECH 1993, pp. 803–806 (1993)

12. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N.: Rapid speaker adaptation ineigenvoice space. IEEE Transactions on Speech and Audio Processing 8(6), 695–707(2000)

13. Class, F., Haiber, U., Kaltenmeier, A.: Automatic detection of change in speakerin speaker adaptive speech recognition systems. US 2003/0187645 A1 (2003)

14. Iskra, D., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., Kiessling,A.: Speecon - speech databases for consumer devices: Database specification andvalidation. In: Proceedings of the Third International Conference on LanguageResources and Evaluation, LREC 2002, pp. 329–333 (2002)

Evaluation of Two Approaches for Speaker

Specific Speech Recognition�

Tobias Herbig1,3, Franz Gerl2, and Wolfgang Minker3

1 Nuance Communications Aachen GmbH, Ulm, Germany2 Harman/Becker Automotive Systems GmbH, Ulm, Germany

3 University of Ulm, Institute of Information Technology, Ulm, Germany

Abstract. In this paper we examine two approaches for the automaticpersonalization of speech controlled systems. Speech recognition may besignificantly improved by continuous speaker adaptation if the speakercan be reliably tracked. We evaluate two approaches for speaker iden-tification suitable to identify 5-10 recurring users even in adverse envi-ronments. Only a very limited amount of speaker specific data can beused for training. A standard speaker identification approach is extendedby speaker specific speech recognition. Multiple recognitions of speakeridentity and spoken text are avoided to reduce latencies and computa-tional complexity. In comparison, the speech recognizer itself is used todecode spoken phrases and to identify the current speaker in a single step.The latter approach is advantageous for applications which have to beperformed on embedded devices, e.g. speech controlled navigation in au-tomobiles. Both approaches were evaluated on a subset of the SPEECONdatabase which represents realistic command and control scenarios forin-car applications.

1 Introduction

During the last few decades steady progress in speech recognition and speakeridentification has been achieved leading to high recognition rates [1]. Complexspeech controlled applications can now be realized.

Especially for in-car applications speech recognition may help to improve us-ability and security. The driver can be supported to safely participate in roadtraffic and to operate technical devices such as navigation systems or hands-freesets. However, the speech signal may be degraded by various background noises,e.g. varying engine, wind and tire noises, passing cars or babble noise. In ad-dition, changing environments, speaker variability and natural language inputmay have a negative influence on the performance of speech recognition [2].

For a variety of practical applications, e.g. infotainment systems with speechrecognition for navigation, telephony or music control, typically only 5 − 10 re-curring speakers are expected to use the system. The benefits of a device thatcan identify the voices of its main users are obvious:� This work has been conducted when Tobias Herbig and Franz Gerl were affiliated

with Harman/Becker. Tobias Herbig is now with Nuance Communications.


Evaluation of Two Approaches for Speaker Specific Speech Recognition 37

The dialog flow can be personalized to specific user habits. New ways forsimplifying the interaction with the device can be suggested. Unexperiencedspeakers can be introduced to the system, for example.

Furthermore, speech recognition can be improved by adapting the statisticalmodels of a speech recognizer on speaker specific data. A speech recognitionengine offers a very detailed modeling of the acoustic feature space. Given acorrect decoding of an utterance there are techniques that enable adaptation onone single utterance as found in [3], for example. Combining speech recognitionand speaker identification offers the opportunity to keep long-term adaptationprofiles. The reasoning that recognizing the utterance may help to achieve rea-sonable speaker identification rates after a short training period prompted us todo the work we report in this paper.

We have developed a speech controlled system which combines speaker iden-tification, speech recognition and speaker adaptation. Different speakers can bereliably tracked after a short enrollment phase of only two command and controlutterances. Fast information retrieval is realized by combining the strengths oftwo adaptation schemes [4]. In the learning phase only a few parameters have tobe estimated to capture the most relevant speech and speaker characteristics. Inthe long run this adaptation scheme smoothly transits to an individual adjust-ment of each speaker profile. To meet the demands of an efficient implementationsuitable for embedded devices, speaker identification has to be performed on-line. We employ the speech recognizer’s detailed modeling of speech and speakercharacteristics for a unified approach of speaker identification and speech recog-nition. A standard speech recognizer is extended to identify the current user andto decode the spoken phrase simultaneously.

Alternatively, speaker identification can be implemented by standard tech-niques known from the literature, e.g. Reynolds et al. [5]. To limit the com-putational overhead and latencies caused by reprocessing of spoken phrases,we combine speaker identification with our approach for on-line speaker profileselection.

In this paper, speech recognition and speaker identification are briefly in-troduced. Then we discuss our approach for integrated speaker identificationand speech recognition combined with speaker adaptation. The architecture ofa reference system combining standard techniques for speaker identification andspeaker adaptation is explained. Finally, the evaluation results for realistic com-mand and control applications in automobiles are presented. A summary andconclusion are given.

2 Automated Speech Recognition

We use Hidden Markov Models (HMMs) to represent both the static and dy-namic speech characteristics. The Markov models represent the speech dynam-ics. The emission probability density function is modeled by Gaussian MixtureModels (GMMs). The probability density function of GMMs


p(xt|Θ) =N∑

k=1

wk · N {xt|μk,Σk} (1)

comprises a convex combination of N multivariate Gaussian densities which aredenoted by the index k. xt represents the feature vector at time instance t.GMMs are defined by their parameter sets Θ which contain the weights wk,mean vectors μk and covariance matrices Σk. For speech recognition, we useso-called Semi-Continuous HMMs (SCHMMs):

pSCHMM(xt) =M∑

st=1

N∑

k=1

wst

k · N {xt|μk,Σk} · p(st), (2)

as can be found by Schukat-Talamazzini [6]. All states st of an SCHMM sharethe mean vectors and covariances of one GMM and only differ in their weightingfactors. M denotes the number of states. For convenience, we omit the parameterset ΘSCHMM comprising the initial state probabilities, state transitions and theGMM parameters.

The basic setup of our speech recognizer is shown in Fig. 1.

Front-end xt Codebook qt Decoder

Fig. 1. Block diagram of a speech recognizer based on SCHMMs

Noise reduction is calculated by a standard Wiener filter and 11 Mel FrequencyCepstral Coefficients (MFCCs) are extracted. The 0 th coefficient is substitutedby a normalized energy. Cepstral mean subtraction and a Linear DiscriminantAnalysis (LDA) are applied to obtain a compact representation which is robustagainst environmental influences. We use windows of 9 frames of MFCCs, wherethe dynamics of delta and delta-delta coefficients have been incorporated intothe LDA using a bootstrap training [7].

Each feature vector xt is compared with a speaker independent codebooksubsequently called standard codebook. The standard codebook consists of about1000 multivariate Gaussian densities defined by the parameter set

Θ0 = {w01 , . . . w

0N , μ0

1, . . . , μ0N ,Σ0

1, . . . ,Σ0N}. (3)

The soft quantization

q0t = (p(xt|k = 1,Θ0), . . . , p(xt|k = N,Θ0)) (4)

contains the likelihood scores of all Gaussian densities. The soft quantization isemployed for speech decoding.


The speech decoder comprises the acoustic models, lexicon and languagemodel. The acoustic models are realized by Markov chains. The lexicon rep-resents the corpus of all word strings to be recognized. The prior probabilitiesof word sequences are given by the language model [6]. The Viterbi algorithm isused to determine the most likely word string.

3 Speaker Identification

Speaker variability can be modeled by GMMs which have emerged as the dom-inating generative statistical model in speaker identification [5].

For each speaker one GMM can be trained on speaker specific data using theEM-algorithm. Alternatively, a speaker independent GMM so-called UniversalBackground Model (UBM) can be trained for a large group of speakers. Speakerspecific GMMs can be obtained by speaker adaptation [5].

For testing independently and identically distributed (iid) feature vectors areassumed by neglecting temporal statistical dependencies. Log-likelihood compu-tation can then be realized by a sum of logarithms

log (p(x1:T |Θi)) =T∑

t=1

log

(N∑

k=1

wik · N {

xt|μik,Σi

k

})

(5)

where x1:T = {x1, . . . ,xt, . . .xT } represents a sequence of feature vectors, e.g.mean-normalized MFCCs. i denotes the speaker index. Subsequently, the speakerwith the highest log-likelihood score is identified as the current speaker

iML = argmaxi

{log(p(x1:T |Θi))} . (6)

according to the Maximum Likelihood (ML) criterion.

4 Joint Speaker Identification and Speech Recognition

Speech recognition can be significantly improved when codebooks are adaptedto specific speakers. Speaker specific codebooks can be considered as commonGMMs representing the speaker’s pronunciation characteristics. Class et al. [8]and the results in [4] give evidence that speaker specific codebooks can be em-ployed to track different speakers.

To avoid latencies and computational overhead caused by multiple recogni-tions of the spoken phrase and speaker identity, we employ speech recognitionand speaker identification simultaneously. The basic architecture of our speechcontrolled system is depicted in Fig. 2.

In the front-end standard Wiener filtering is employed for speech enhance-ment to reduce background noises. MFCC features are extracted to be usedfor both speech recognition and speaker identification. For each speaker energynormalization and cepstral mean subtraction are continuously adjusted startingfrom initial values.


Front-EndSpeech

Speaker

Adaptation

Transcription

I II

ML

Fig. 2. System architecture for joint speaker identification and speech recognition com-prising two stages. Part I and II denote the speaker specific speech recognition andspeaker identification, respectively. The latter controls speaker specific feature vectornormalization. Speaker adaptation is employed to enhance speaker identification andspeech recognition. Codebooks are initialized in the case of an unknown speaker. Thestatistical modeling of speaker characteristics is continuously improved.

For speech recognition appropriate speaker specific codebooks are selectedon a frame level. NSp speaker specific codebooks are operated in parallel to thestandard codebook. The posterior probability p(it|x1:t) is estimated for eachspeaker i given the history of observations x1:t:

p(it|x1:t) ∝ p(xt|it) · p(it|x1:t−1), it = 0, 1, . . . , NSp (7)

p(it|x1:t−1) =∑

it−1

p(it|it−1) · p(it−1|x1:t−1) (8)

p(i1|x1) ∝ p(x1|i1) · p(i1). (9)

The codebook iMAPt characterized by the highest posterior probability is selected

according to the Maximum A Posteriori (MAP) criterion. Only the correspondingqi

t is forwarded to the speech decoder to generate a transcription of the spokenphrase. The corresponding state alignment is used for codebook adaptation.

For speaker identification codebooks are considered as GMMs with equalweighting factors to avoid the requirement of a state alignment. We calculatethe log-likelihood per frame

Li =1T

T∑

t=1

log

(N∑

k=1

N {xt|μi

k,Σ0k

})

(10)

to identify the most likely speaker iML according to the ML criterion. The speechrecognition result is employed to exclude speech pauses and garbage words bybuffering the likelihood scores until a precise segmentation can be given. Thespeaker identification result enables to adapt the corresponding codebook andto control feature extraction.


We use speaker adaptation to initialize and continuously adapt speakerspecific codebooks based on recognized utterances. We only use the sufficientstatistics of the standard codebook for reasons of computational efficiency. Dueto limited adaptation data the number of available parameters has to be balancedwith the amount of speaker specific data.

Eigenvoice (EV) adaptation is suitable when facing few data since only some10 parameters αl have to be estimated [3]. Mean vector adaptation can be imple-mented by a weighted sum of the eigenvoices eEV

k,l and an offset, e.g. the originalspeaker independent mean vector μ0

k:

μEVk = μ0

k +L∑

l=1

αl · eEVk,l . (11)

Principal Component Analysis (PCA) can be applied to extract the eigenvoices.MAP adaptation allows individual adjustment of each Gaussian density when

sufficient data is available [9, 5]. On extensive data the MAP adaptation ap-proaches the Maximum Likelihood (ML) estimates

μMLk =

1nk

T∑

t=1

p(k|xt,Θ0) · xt (12)

where nk =∑T

t=1 p(k|xt,Θ0) denotes the number of softly assigned featurevectors. Therefore, we use a simple combination

μoptk = (1 − αk) · μEV

k + αk · μMLk (13)

αk =nk

nk + λ, λ = const (14)

to efficiently adjust the mean vectors of codebooks [4]. Covariance matrices arenot modified. For convenience, the state alignment is omitted in our notation.

5 Independent Speaker Identification and SpeechRecognition

In the preceding section a unified approach for speaker identification and speechrecognition was introduced. Alternatively, a standard technique for speaker iden-tification based on GMMs purely optimized to capture speaker characteristicscan be employed. In combination with a speech recognizer where several speakerprofiles are operated in parallel, a reference implementation can be easily ob-tained. The corresponding setup depicted in Fig. 3 can be summarized as follows:

– Front-end. The recorded speech signal is preprocessed to reduce back-ground noises. The feature vectors comprise 11 mean normalized MFCCsand delta features. The 0 th coefficient is replaced by a normalized energy.


Front-End xt

Speaker

Identification

Speech

Recognition

Adaptation

Adaptation

Transcription

GMM

Codebook

Fig. 3. System architecture for parallel speaker identification and speaker specificspeech recognition. Codebook selection is implemented as discussed before. Speakeridentification is realized by additional GMMs.

– Speech recognition. Appropriate speaker specific codebooks are selectedfor the decoding of the recorded utterance as discussed before.

– Speaker identification. Subsequently, common GMMs purely represent-ing speaker characteristics are used to identify the current user. A speakerindependent UBM with diagonal covariance matrices is used as template forspeaker specific GMMs [5]. The UBM was trained by the EM algorithm.About 3.5 h speech data originating from 41 female and 36 male speakersof the USKCP1 database was incorporated into the UBM training. For eachspeaker about 100 command and control utterances, e.g. navigation com-mands, spelling and digit loops, were used. For testing the ML criterion isapplied to identify the current user. The codebooks of the speech recognizerand the GMMs are adapted according to this estimate.

– Speaker adaptation. GMM models and the speaker specific codebooks ofthe identified speakers are continuously adjusted. Codebook adaptation isimplemented as discussed before. GMMs are adjusted by MAP adaptationas found by Reynolds et al. [5]. However, we only use the sufficient statisticsof the UBM:

μMAPk = (1 − αk) · μUBM

k + αk · μMLk (15)

αk =nk

nk + η, η = const (16)

nk =T∑

t=1

p(k|xt,ΘUBM). (17)

Adaptation accuracy is supported here by the moderate complexity of theapplied GMMs.

1 The USKCP is an internal speech database for in-car applications which was collectedby TEMIC Speech Dialog Systems, Ulm, Germany. The language is US-English.


6 Experiments

Several experiments were conducted for both implementations to investigatespeaker identification accuracy and the benefit for speech recognition.

6.1 Database

For the evaluation we employed a subset of the US-SPEECON [10] database.This subset comprises 50 male and 23 female speakers recorded in an automotiveenvironment. The sampling rate is 11, 025 Hz. Colloquial utterances with morethan four words and mispronunciations were discarded whereas digit and spellingloops were kept.

6.2 Evaluation

The evaluation was performed on 60 sets of five enrolled speakers which areselected randomly. From one utterance to the next the probability of a speakerchange is approximately 10 %. In the learning phase of each set 10 utterances areemployed for unsupervised initialization of each speaker model. Only the firsttwo utterances of a new speaker are indicated and then the current speaker hasto be identified in a completely unsupervised manner. Then the speakers appearrandomly. At least five utterances are spoken beetween two speaker turns. Boththe Word Accuracy (WA) and identification rate are examined.

The speech recognizer without any speaker adaptation is used as baseline.Short-term adaptation is implemented by an EV approach which applies anexponential weighting window to the adaptation data. This decay guaranteesthat speaker changes are captured within approximately five or six utterances ifno speaker identification is employed.

The speech recognizer applies grammars for digit and spelling loops, dedicatednumbers and a grammar which contains all remaining utterances.

6.3 Results for Joint Speaker Identification and Speech Recognition

First, the joint speaker identification and speech recognition is examined forspecific values of λ employed in speaker adaptation.

The results are given in Table 1. They show a significant improvement ofthe WA with respect to both the baseline and short-term adaptation. The twospecial cases ML (λ ≈ 0) and EV (λ → ∞) clearly fall behind the combinationof both adaptation techniques. MAP adaptation with speaker independent priorparameters is not able to track different speakers in our scenario for η ≥ 8.Furthermore, no eminent difference in WA can be observed for 4 ≤ λ ≤ 20. Thus,speaker identification can be optimized independently of the speech recognizerand seems to reach an optimum of 94.64 % for λ = 4. For higher values theidentification rates drop significantly.


Table 1. Comparison of different adaptation techniques for joint speaker identificationand speech recognition

Speaker adaptation WA [%] Speaker ID [%]

Baseline 85.23 -

Short-term adaptation 86.13 -

Combined adaptationML (λ ≈ 0) 86.89 81.54λ = 4 88.10 94.64λ = 8 88.17 93.49λ = 12 88.16 92.42λ = 16 88.18 92.26λ = 20 88.20 91.68EV (λ → ∞) 87.51 84.71

MAP adaptationη = 4 87.47 87.43η = 8 85.97 21.17

6.4 Results for Independent Speaker Identification and SpeechRecognition

In comparison to the unified approach the same experiments were repeated withthe reference system characterized by separate modules for speaker identificationand speech recognition.

In Table 2 the results of this scenario are presented for several implementationswith respect to the number of Gaussian distributions and values of parameter η.Both the speaker identification and speech recognition rate reach an optimumfor η = 4 and N = 64 or 128. For higher values of η this optimum is shifted to-wards a lower number of Gaussian distributions as expected. Since the learningrate of the adaptation algorithm is reduced, only a smaller number of distri-butions can be efficiently estimated at the beginning. The performance of thespeech recognizer is marginally reduced with higher η.

Table 2. Realization of parallel speaker identification and speech recognition. Speakeridentification is implemented by several GMMs comprising 32, 64, 128 and 256 Gaussiandistributions. MAP adaptation of mean vectors is used. Codebook adaptation uses λ =12.

MAP η = 4 η = 8 η = 12 η = 20N WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]

32 88.01 88.64 88.06 88.17 87.98 87.29 87.97 87.5064 88.13 91.09 88.06 89.64 87.98 87.92 87.92 85.30128 88.04 91.18 87.94 87.68 87.87 84.97 87.82 80.09256 87.92 87.96 87.97 85.59 87.90 81.20 87.73 76.48

In the next experiment not only mean vectors but also weights are modifiedby the MAP adaptation. The results are summarized in Table 3.


Table 3. Realization of parallel speaker identification and speech recognition. Speakeridentification is implemented by several GMMs comprising 32, 64, 128 or 256 Gaus-sian distributions. MAP adaptation of weights and mean vectors is used. Codebookadaptation uses λ = 12.

MAP η = 4 η = 8 η = 12 η = 20N WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]

32 87.92 87.24 87.97 88.24 87.97 87.61 88.02 87.0464 88.11 90.59 88.06 89.99 88.03 88.80 87.93 86.64128 88.11 91.32 88.03 89.42 88.03 88.10 87.91 84.26256 88.10 91.62 87.97 88.71 88.02 86.01 87.88 82.88

In the preceding experiment the speaker identification accuracy could be im-proved for η = 4 by increasing the number of Gaussian distributions to N = 128.For N = 256 the identification rate dropped significantly.

Now a steady improvement and an optimum of 91.62 % can be observedfor N = 256. However, the identification rate approaches a limit. For η = 4doubling the number of Gaussian distributions from 32 to 64 results in 26 % rel-ative improvement of the error rate whereas the relative improvement achievedby the increase from 128 to 256 Gaussian distributions is about 3.5 %. The op-timum for speech recognition is again about 88.1 % WA.

Finally, the comparison with the combined approach characterized by an in-tegrated speaker identification is shown in Fig. 4 and Fig. 5. Mean vector and

32 64 128 25685

86

87

88

89

90

WA

[%]

N

(a) Speech recognition realized bythe reference implementation. MAPadaptation (η = 4) of mean vectorsand weights (black) and only meanvectors (dark gray) are depicted.

λBL ML 4 8 12 16 20 EV ST

85

86

87

88

89

90

WA

[%]

(b) Speech recognition implementedby the unified approach. Results areshown for speaker adaptation withpredefined speaker identity (black) [4]as well as for joint speaker identifi-cation and speech recognition (darkgray). The speaker independent base-line (BL) and short-term adapta-tion (ST) are shown for comparison.

Fig. 4. Comparison of the reference implementation (left) and the joint speaker iden-tification and speech recognition (right) with respect to speech recognition


32 64 128 25680

84

88

92

96

100ID

[%]

N

(a) Speaker identification rates ofthe reference implementation. MAPadaptation (η = 4) of mean vectorsand weights (black) and only meanvectors (dark gray) are depicted.

λML 4 8 12 16 20 EV

80

84

88

92

96

100

ID[%

]

(b) Speaker identification rates ofthe joint speaker identification andspeech recognition.

Fig. 5. Comparison of the reference implementation (left) and the joint speaker iden-tification and speech recognition (right) with respect to speaker identification

weight adaptation are depicted for η = 4 representing the best speech recogni-tion and speaker identification rates in our experiments. Furthermore, the upperbound for speaker specific speech recognition is shown. There the speaker isknown when codebook adaptation is performed [4].

7 Summary and Conclusion

Two approaches have been developed to solve the problem of an unsupervisedsystem comprising self-learning speaker identification and speaker specific speechrecognition.

Speaker identification and speech recognition use an identical front-end so thata parallel feature extraction for speech recognition and speaker identification isavoided. Speaker specific speech recognition is realized by an on-line codebookselection. On an utterance level the speaker identity is estimated in parallel tospeech recognition. Multiple recognitions are not required. A speech recognizeris enabled to create and continuously adapt speaker specific codebooks whichallow a higher recognition accuracy in the long run.

94.64 % speaker identification rate and 88.20 % WA were achieved by theunified approach for λ = 4 and λ = 20, respectively. The results for the baselineand the corresponding upper bound were 85.23 % and 88.90 % WA [4]. In thelatter case it was assumed that the speaker identity is known.

For the reference system, several GMMs are required for speaker identificationin addition to the HMMs of the speech recognizer. Complexity therefore increasessince both models have to be evaluated and adapted. An optimum of 91.18 %speaker identification rate was achieved for 128 Gaussian distributions and η = 4when only the mean vectors were adapted. The best speech recognition resultof 88.13 % WA was obtained for 64 Gaussian distributions. By adapting both


the mean vectors and weights, the speaker identification rate could be increasedto 91.62 % for 256 Gaussian distributions and η = 4. The WA remained at thesame level.

For both implementations similar results for speech recognition were achievedeven though the identification rates of the reference were significantly worse.This observation supports the finding that the speech recognition accuracy is rel-atively insensitive with respect to moderate error rates of speaker identification.Thus, different strategies can be applied to identify speakers without affectingthe performance of the speech recognizer as long as appropriate codebooks areselected for speech decoding.

However, the unified approach seems to be advantageous when unknownspeakers have to be detected as shown in [11]. Therefore, we propose to em-ploy the unified approach to implement a speech controlled system which isoperated in a completely unsupervised manner.

References

1. Furui, S.: Selected topics from 40 years of research in speech and speaker recogni-tion. In: INTERSPEECH 2009, pp. 1–8 (2009)

2. Junqua, J.-C.: Robust Speech Recognition in Embedded Systems and PC Appli-cations. Kluwer Academic Publishers, Dordrecht (2000)

3. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N.: Rapid speaker adaptation ineigenvoice space. IEEE Transactions on Speech and Audio Processing 8(6), 695–707(2000)

4. Herbig, T., Gerl, F., Minker, W.: Fast adaptation of speech and speaker charac-teristics for enhanced speech recognition in adverse intelligent environments. In:The 6th International Conference on Intelligent Environments, IE 2010 (2010) (toappear)

5. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adaptedgaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000)

6. Schukat-Talamazzini, E.G.: Automatische Spracherkennung. Vieweg (1995) (inGerman)

7. Class, F., Kaltenmeier, A., Regel-Brietzmann, P.: Optimization of an hmm - basedcontinuous speech recognizer. In: EUROSPEECH 1993, pp. 803–806 (1993)

8. Class, F., Haiber, U., Kaltenmeier, A.: Automatic detection of change in speakerin speaker adaptive speech recognition systems. US 2003/0187645 A1 (2003)

9. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate gaus-sian mixture observations of markov chains. IEEE Transactions on Speech andAudio Processing 2(2), 291–298 (1994)

10. Iskra, D., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., Kiessling,A.: Speecon - speech databases for consumer devices: Database specification andvalidation. In: Proceedings of the Third International Conference on LanguageResources and Evaluation, LREC 2002, pp. 329–333 (2002)

11. Herbig, T., Gerl, F., Minker, W.: Detection of unknown speakers in an unsupervisedspeech controlled system. In: Second International Workshop on Spoken DialogueSystems Technology, IWSDS 2010 (2010) (to appear)

Issues in Predicting User Satisfaction Transitions inDialogues: Individual Differences, Evaluation Criteria,

and Prediction Models

Ryuichiro Higashinaka1, Yasuhiro Minami2, Kohji Dohsaka2, and Toyomi Meguro2

1 NTT Cyber Space Laboratories, NTT Corporation1-1 Hikarinooka, Yokosuka, 239-0847 Kanagawa, Japan

2 NTT Communication Science Laboratories, NTT Corporation2-4 Hikaridai, Seika-cho, Soraku-gun, 619-0237 Kyoto, Japan

[email protected],{minami,dohsaka,meguro}@cslab.kecl.ntt.co.jp

Abstract. This paper addresses three important issues in automatic prediction ofuser satisfaction transitions in dialogues. The first issue concerns the individualdifferences in user satisfaction ratings and how they affect the possibility of cre-ating a user-independent prediction model. The second issue concerns how to de-termine appropriate evaluation criteria for predicting user satisfaction transitions.The third issue concerns how to train suitable prediction models. We present ourfindings for these issues on the basis of the experimental results using dialoguedata in two domains.

1 Introduction

Although predicting the overall quality of dialogues has been actively studied [7,12,13],only recently has the work begun on ways to automatically predict user satisfaction tran-sitions during a dialogue [2]. Predicting such transitions would be useful when we wantto perform an in-depth turn-by-turn analysis of the performance of a dialogue system, andalso when we want to pinpoint situations where the dialogue quality begins to degradeor improve, the discovery of which could be used to improve dialogue systems as well asto assist human operators at contact centers for improving customer satisfaction [9,11].

Since the work on automatic prediction of user satisfaction transitions is still in apreliminary phase, there are a number of issues that need to be clarified. This paperaddresses three such issues and presents our findings based on experimental results.

The first issue concerns the individual differences of user satisfaction ratings. In anywork that deals with predicting user satisfaction, it is important to determine whetherwe should aim at creating user-independent or user-dependent prediction models. Weinvestigate how user satisfaction ratings of individuals differ on the basis of correlationsand distributions of ratings and discuss the feasibility of creating a user-independentprediction model. The second issue concerns the evaluation criteria for the predictionof user satisfaction transitions. In any engineering work, it is necessary to establish anevaluation measure. Previous work has used the mean squared error (MSE) of ratingprobabilities [2]; however, the MSE has a serious limitation: the dialogue has to followa predefined scenario. We consider the MSE to be too restrictive for common use. In this


Issues in Predicting User Satisfaction Transitions in Dialogues 49

Table 1. Dialogue statistics in the AD and AL domains. Avg and SD denote the average numberand the standard deviation of dialogue-acts within a dialogue. Since an utterance can containmultiple dialogue-acts, the number of dialogue-acts is always larger than that of utterances.

AD Domain: 90 dialogues# Utterances # Dialogue-acts Avg SD

All 5180 5340 59.33 17.54User 1890 2050 22.78 6.60System 3290 3290 36.56 11.81

AL Domain: 100 dialogues# Utterances # Dialogue-acts Avg SD

All 3951 4650 46.50 8.99Speaker 2103 2453 24.53 5.69Listener 1848 2197 21.97 5.25

paper, we propose several candidates for evaluation criteria and discuss which criteriashould be used. The third issue concerns how to train suitable prediction models. Inprevious work, hidden Markov models (HMMs) have been used [2]. However, HMMsmay not offer the best solution. Recent studies on sequential labeling have shown thatconditional random fields (CRFs) [5] provide the state-of-the-art performance in manyNLP tasks, such as chunking and named entity recognition [10]. In addition, HMMs aregenerative models, whereas CRFs are discriminative ones. In this paper, we compareHMMs and CRFs to investigate which kind of model is more appropriate for the taskof predicting user satisfaction transitions.

The next section describes the dialogue data we use in detail. Section 3describes the in-dividual differences in user satisfaction ratings between human judges. Section 4presentsour candidates for the evaluation criteria and Section 5 describes our experiments forcomparing the prediction performance of HMMs and CRFs. Section 6 summarizes thepaper and mentions future work.

2 Data Collection

We collected dialogue data in two domains: the animal discussion (AD) domain and theattentive listening (AL) domain. All dialogues are in Japanese. In both domains, the datawere text dialogues. We did not use spoken dialogue data because we wanted to avoidparticular problems of speech, such as filled pauses and overlaps, although we plan todeal with spoken dialogue in the future. The dialogues in the AD domain are human-machine dialogues and those in the AL domain are human-human dialogues; hence, wecover both cases of human-machine and human-human dialogues. In addition, neitherdomain has specific tasks/scenarios, meaning that our setting is more general than thatin the previous work [2], where the course of a dialogue was strictly controlled by usingscenarios.

2.1 Animal Discussion Domain

In the AD domain, the system and user talk about likes and dislikes about animals viaa text chat interface. The data consist of 1000 dialogues between a dialogue system

50 R. Higashinaka et al.

and 50 human users. Each user conversed with the system 20 times, including twoexample dialogues at the beginning. All user/system utterances have been annotatedwith dialogue-acts. There are 29 dialogue-act types, including those related to self-disclosure, question, response, and greetings. For example, a dialogue-act DISC-P

denotes one’s self-disclosure about a proposition (whether one likes/dislikes a certainanimal) and DISC-R denotes one’s self-disclosure of a reason for a proposition (see [3]for the description of dialogue-acts and sample dialogues).

From the data of the initial ten users sorted by user ID, we randomly extracted ninedialogues per user to form a subset of 90 dialogues (see Table 2 for the statistics). Then,two independent annotators (hereafter, AD-annot1 and AD-annot2), who were not theauthors, labeled them with utterance-level user satisfaction ratings. More specifically,they provided three different user satisfaction ratings related to “Smoothness of theconversation”, “Closeness perceived by the user towards the system”, and “Willingnessto continue the conversation”.

The ratings ranged from 1 to 7, where 1 is the worst and 7 the best. Before actualannotation, the annotators took part in a tutorial session so that their standards for ratingcould be firmly established. The annotators carefully read each utterance and gave arating after each system utterance according to how they would have felt after receivingeach system utterance if they had been the user in the dialogue. To make the situationmore realistic, they were not allowed to look down at the dialogue after the currentutterance. At the beginning of a dialogue, the ratings always started from four (neutral).We obtained 3290 ratings for 3290 system utterances (cf. Table 2) from each annotator.In this work, we had third persons (not the actual participants of the conversations)judge user satisfaction for the sake of reliability and consistency.

2.2 Attentive Listening Domain

In the AL domain, a listener attentively listens to the other in order to satisfy thespeaker’s desire to speak and make himself/herself heard. Figure 1 shows an excerptof a listening-oriented dialogue together with utterance-level user satisfaction ratings(see [6] for details of this domain).

We collected such listening-oriented dialogues using a website where users takingthe roles of listeners and speakers were matched up to have conversations. A conversa-tion was done through a text-chat interface. The participants were instructed to end theconversation approximately after ten minutes. Within a three-week period, each of the37 speakers had about two conversations a day with each of the ten listeners, resultingin our collecting 1260 listening-oriented dialogues. All dialogues were annotated withdialogue-acts. There were 46 dialogue-act types in this domain. Although we cannotnot describe the full details of our dialogue-acts for lack of space, we have dialogue-acts DISC-EVAL-POS for one’s self-disclosure of his/her positive evaluation towards acertain entity, DISC-HABIT for one’s self-disclosure of his/her habit, and INFO for deliv-ery of objective information. Then, we made a subset of the data by randomly selectingten dialogues for each of the ten listeners to obtain 100 dialogues for annotating usersatisfaction ratings (see Table 2 for the statistics).

Two independent annotators (hereafter, AL-annot1 and AL-annot2), who were notthe authors or annotators for the AD domain, provided utterance-level ratings after all


Utterance (dialogue-acts) Sm Cl GLLIS You know, in spring, Japanese food tastes delicious.

(DA: DISC-EVAL-POS)5 5 5

SPK This time every year, I make a plan to go on a healthydiet. But . . . (DA: DISC-HABIT)

LIS Uh-huh (DA: ACK) 6 5 6SPK The temperature goes up suddenly!

(DA: INFO)SPK It’s always too late! (DA: DISC-EVAL-NEG)LIS Clothing worn gets less and less when not being able to

lose weight. (DA: DISC-FACT)6 6 6

SPK Well, people around me soon get used to my body shapethough. (DA: DISC-FACT)

Fig. 1. Excerpt of a dialogue with AL-annot1’s utterance-level user satisfaction ratings forsmoothness (Sm), closeness (Cl), and good listener (GL) in the AL domain. SPK and LIS de-note speaker and listener, respectively. Both the speaker and listener are human.

Table 2. Correlation (ρ) of ratings. Granularity indicates the levels of user satisfaction ratings.The granularity (a) uses the original 7 levels of ratings, (b) uses 3 levels (we assigned low for1-2, middle for 3-5, and high for 6-7), (c) uses the same 3 levels with different thresholds [lowfor 1-3, middle for 4, high for 5-7], (d) uses 2 levels [low for 1-4, high for 5-7], and (e) uses thesame 2 levels but with the thresholds [low for 1-3, high for 4-7].

AD Domain AL DomainGranularity Smoothness Closeness Willingness Smoothness Closeness Good Listener(a) 7 ratings 0.18 0.15 0.27 0.18 0.10 0.11(b) 3 ratings 0.17 0.13 0.18 0.04 0.05 0.11(c) 3 ratings 0.13 0.11 0.21 0.14 0.08 0.08(d) 2 ratings 0.20 0.17 0.31 0.18 0.13 0.14(e) 2 ratings 0.30 0.30 0.32 0.18 0.11 0.04

listeners’ utterances to express how they would have felt after receiving the listeners’utterances. After a tutorial session, the annotators gave three ratings as in the AD do-main; namely, smoothness, closeness, and “good listener”. Instead of willingness, wehave a “good listener” criterion here asking for how good the annotator thinks the lis-tener is from the viewpoint of attentive listening; for example, how well the listener ismaking it easy for the speaker to speak. All ratings ranged from 1 to 7. We obtained1848 ratings for 1848 listener utterances (cf. Table 2) from each annotator.

3 Individual Differences

We investigated how user satisfaction ratings of two independent annotators differ inorder to gain insight into whether it is reasonable for us to aim for user-independentprediction models.

Table 2 shows the rather low correlation coefficients (Spearman’s rank correlationcoefficients, ρ) of the ratings of our two independent annotators for the AD and AL


1 2 3 4 5 6 7

Ratings

Fre

quen

cy

020

060

010

00

1 2 3 4 5 6 7

Ratings

Fre

quen

cy

020

060

010

00

Fig. 2. Distributions of the smoothness ratings in the AD domain. The histogram on the left is thedistribution for AD-annot1; that on the right is the distribution for AD-annot2.

1 2 3 4 5 6 7

Ratings

Fre

quen

cy

020

060

010

00

1 2 3 4 5 6 7

Ratings

Fre

quen

cy

020

060

010

00

Fig. 3. Distributions of the good listener ratings in the AL domain. The histogram on the left isthe distribution for AL-annot1; that on the right is the distribution for AL-annot2.

domains. Here, we first calculated the correlation coefficient for each dialogue and thenaveraged the coefficients over all dialogues. Since it may be too difficult for the 7 levelsof user satisfaction ratings to correlate, we changed the granularity of the ratings to3 levels (i.e., low, middle, high) and even 2 levels (i.e., low and high) for calculatingthe correlation coefficients. However, this did not greatly improve the correlations ineither domains. It is quite surprising that the simple choice of high/low shows verylow correlation. From these results, it is clear that the ratings given to user satisfactiontransitions are likely to differ greatly among individuals and that it may be difficult tocreate a user-independent prediction model; therefore, as a preliminary step, we dealwith user-dependent prediction models in this paper.

We also investigated the distributions of the ratings for the annotators. Figure 2 showsthe distributions for the smoothness rating in the AD domain, and Fig. 3 shows thedistributions for the good listener rating in the AL domain. It can be seen that, in the ADdomain, the distributions are rather similar, meaning that the two annotators providedratings roughly with the same ratios. This, together with the low correlation shown inTable 2, indicates that the annotators allocate the same rating very differently. As forthe AL domain, we see that the distributions differ greatly: AL-annot1 rated most of theutterances 4-5, whereas AL-annot2’s ratings follow a normal distribution-like pattern,which is another indication of the difficulty of creating a user-independent predictionmodel; the ranges of ratings as well as their output probabilities could differ greatlyamong individuals. Here, the fact that AL-annot1 rated most of the utterances 4-5 canbe rather problematic for training prediction models because the output distribution of


the trained model would follow a similar distribution, producing only 4-5 ratings. Sucha model would not be able to detect good [rating=7] or bad [rating=1] ratings, whichmay make the prediction models useless. We examine how this bias of ratings affectsthe prediction performance in Section 5.

4 Evaluation Criteria

We conceived of two kinds of evaluation criteria: one for evaluating individual matchesand the other for evaluating distributions. We do not consider the MSE of rating proba-bilities [2] because its use is too restrictive and because we believe the ideal evaluationcriterion should be applied to any hypothesis ratings as long as reference ratings areavailable.

4.1 Evaluating Individual Matches

Since our task is to predict user satisfaction transitions, it is obviously important thatthe predicted rating matches that of the reference (i.e., human judgment). Therefore,we have the match rate (MR) and the mean absolute error (MAE) to calculate the ratingmatches. Here, the MR treats all ratings differently, whereas the MAE takes the distanceof ratings into account; namely, 6 is closer to 7 than to 1. In addition, we calculate theSpearman’s rank correlation coefficient (ρ) so that the correspondence of the hypothesisand reference ratings can be taken into account. They are derived using the equationsbelow. In the equations, R (= {R1 . . . RL}) and H (= {H1 . . .HL}) denote referenceand hypothesis rating sequences for a given dialogue, respectively. L is the length of Rand H . Note that they have the same length.

(1) Match Rate (MR):

MR(R, H) =1L

L∑

i=1

match(Ri, Hi), (1)

where ‘match’ returns 1 or 0 depending on whether Ri matches Hi.

(2) Mean Absolute Error (MAE):

MAE(R, H) =1L

L∑

i=1

|Ri − Hi|. (2)

(3) Spearman’s rank correlation coefficient (ρ):

ρ(R, H) =∑L

i=1(Ri − R)(Hi − H)√∑Li=1(Ri − R)2

∑Li=1(Hi − H)2

, (3)

where R and H denote the average values of R and H , respectively.


4.2 Evaluating Rating Distributions

As we saw in Fig. 3, the rating distributions of the annotators may vary greatly. There-fore, it may be important to take into account the rating distributions in evaluation.To this end, we can use the Kullback-Leibler divergence (KL), which can measure thesimilarity of distributions.

Having a similar distribution may not necessarily mean that the prediction is suc-cessful, because in cases where reference ratings gather around just a few rating values(see, for example, the left hand side of Fig. 3 for AL-annot1’s distribution), there is apossibility of inappropriately valuing highly prediction models that output only a fewfrequent ratings; such models cannot predict other ratings, which is not a desirable func-tion of a prediction model. In the practical as well as information theoretic sense, wehave to correctly predict rare but still important cases. Therefore, in addition to the KL,we use the match rate per rating (MR/r) and mean absolute error per rating (MAE/r).These criteria evaluate how accurately each individual rating can be predicted; namely,the accuracy for predicting one rating is equally valued with that for the other rating ir-respective of the distribution of ratings in the reference. We use the following equationsfor the KL, MR/r and MAE/r.

(4) Kullback-Leibler Divergence (KL):

KL(R,H) =K∑

r=1

P(H, r) · log(P(H, r)P(R, r)

), (4)

where K is the maximum user satisfaction rating (i.e., 7 in our case), R and H denotethe sequentially concatenated reference/hypothesis rating sequences of all dialogues,and P(∗, r) denotes the occurrence probability that a rating r is found in an arbitraryrating sequence.

(5) Match Rate per Rating (MR/r):

MR/r(R,H) =1K

K∑

r=1

∑i∈{i|Ri=r}

match(Ri,Hi)

∑i∈{i|Ri=r}

1, (5)

where Ri and Hi denote ratings at i-th positions.

(6) Mean Absolute Error per Rating (MAE/r):

MAE/r(R,H) =1K

K∑

r=1

∑i∈{i|Ri=r}

|Ri − Hi|∑

i∈{i|Ri=r}1

. (6)

4.3 Selecting Appropriate Evaluation Criteria

We have so far presented six evaluation criteria. Although they can all be useful, itwould still be desirable if we could choose a single criterion for simplicity and also forpractical use. We made three assumptions for selecting the most suitable criterion.


1:speaker1 2:speaker2

States for Rating 1

3:speaker1 4:speaker2

States for Rating 2

Fig. 4. Topology of our HMM. The states for ratings 1 and 2 are connected ergodically. An ovalmarked speaker1/speaker2 indicates a state for speaker1/speaker2. Arrows denote transitions andnumbers before speaker1/speaker2 are state IDs. Boxes group together the states related to aparticular rating.

First, the suitable criterion should not evaluate “random choice” highly. Second, itshould not evaluate “no-choice” highly, such as when the prediction is done simply byusing a single rating value. In other words, since “random choice” and “no-choice” donot perform any prediction, they should show the lowest performance when we use thesuitable criterion. Third, the suitable criterion should be able to evaluate the predic-tion accuracy independent of individuals because it would be difficult for researchersand developers in the field to adopt a criterion that is too sensitive to individual dif-ferences for a reliable comparison. We also believe that the prediction accuracy shouldbe similar among individuals because of the fundamental difficulty in predicting usersatisfaction [4]; for a computational model, predicting one person’s ratings would be asdifficult as predicting the other person’s. Therefore, we consider the suitable evaluationcriterion should produce similar values for different individuals. In the next section, weexperimentally find the best evaluation criterion that satisfies these assumptions.

5 Prediction Experiment

We trained our prediction models using HMMs and CRFs and compared their predictionperformance. Note that we trained these models for each annotator in each domainfollowing the results in Section 3. As baselines and as the requirements for selectingthe best evaluation criterion, we prepared a random baseline (hereafter, RND) and a“no-choice” baseline. Our “no-choice” baseline produces the most common rating 4 aspredictions; hence, this is a majority baseline (hereafter, MJR).

5.1 Training Data

Our task is to predict a user satisfaction rating at each evaluation point in a dialogue. Wedecided to predict the user satisfaction rating after each dialogue-act because a dialogue-act is one of the basic units of dialogue. We created the training data by aligning thedialogue-acts with their user satisfaction ratings.

Since we have ratings only after system/listener utterances, we first assumed thatthe ratings for dialogue-acts corresponding to user/speaker utterances were the same asthose after the previous system/listener utterances. In addition, since a system/listenerutterance may contain multiple dialogue-acts, its dialogue-acts are given the same rating


DA-2 DA-1 DA0 DA1 DA2

s-2 s-1 s0 s1 s2

r-2 r-1 r0 r1 r2

…

Rating

Dialogue-act

Speaker ID

Fig. 5. Topology of our CRF. The area within the dotted line represents the scope of our featuresfor predicting the rating r0.

as the utterance. This process results in our creating a sequence < s1, DA1, r1 > · · · <sN , DAN , rN > for each dialogue, where si denotes the speaker of a dialogue-act,DAi the i-th dialogue-act, ri the rating for DAi, and N the number of dialogue-actsin a dialogue. We created such sequences for our dialogue data. Our task is to predictr1 . . . rN from < s1, DA1 > · · · < sN , DAN >.

5.2 Training HMMs

From the training data, we trained HMMs following a manner similar to [2]. We haveK groups of states where K is the maximum rating value; i.e., 7. Each group representsa particular rating k (1 ≤ k ≤ K). Figure 4 shows the HMM topology. For the sake ofsimplicity, the figure only shows the case when we have only two ratings: 1 and 2. Eachgroup has two states: one for representing the emission of one speaker (conversationalparticipant) and the other for the emission of the other speaker. We used this topologybecause it has been successfully utilized to model two-party conversations [6]. In thisHMM, all states are connected ergodically; that is, all states can transit to all otherstates.

As emissions, we used a speaker ID (a binary value s ∈ {0, 1}, indicating speaker1or speaker2), a dialogue act, and a rating score. The number of dialogue-acts in the ADdomain is 29, and the number of dialogue-acts in the AL domain is 46. A speaker IDs is emitted with the probability of 1.0 from the states corresponding to the speaker s.A rating score k is emitted with the probability of 1.0 from the states representing therating k. Therefore, a datum having a speaker ID s and a rating k is always assignedto a state representing s and k in the training phase. We used the EM-algorithm for thetraining.

In decoding, we made the HMM ignore the output probability of rating scores andsearched for the best path using the Viterbi algorithm [8]. Since the states in the bestpath represents the most likely ratings, we can translate the state IDs into correspondingrating values. For example, if the best path goes through state IDs {1,3,4,2} in Fig. 4,then the predicted rating sequence becomes <1,2,2,1>.

5.3 Training CRFs

We used a linear-chain CRF based on a maximum a posteriori probability (MAP) cri-terion [5]. The most probable rating for each dialogue-act was estimated using the


following features: the current dialogue-act, previous and succeeding two dialogue-acts,the speaker IDs for these dialogue-acts, and the previous and succeeding two ratings.Figure 5 illustrates the topology of our CRF and the scope of the features.

5.4 Evaluation Procedure

We performed a ten-fold cross validation. We first separated the training data into tendisjoint sets. Then, we used nine sets for training HMMs and CRFs, and used the re-maining one for testing. We repeated this ten times in a round-robin fashion. In theevaluation, from the output of our prediction models, we extracted predictions onlyafter system/listener dialogue-acts because the reference ratings were originally givenonly after them. We compared the predictions with the reference sequences using thesix evaluation criteria we proposed in Section 4.

5.5 Results

Tables 3 and 4 show the evaluation results for the AD domain. Tables 5 and 6 showthe results for the AL domain. To compare the means of the MR, MAE, and ρ, weperformed a non-parametric multiple comparison test (Steel-Dwass test [1]). We didnot perform a statistical test for other criteria because it was difficult to perform sample-wise comparison for distributions.

Before looking into the individual values, we first need to fix the evaluation criterion.According to our assumptions for choosing appropriate criteria (see Section 4.3), RNDand MJR should not show good performance when they are compared to any predictionmodel because they do not perform any prediction. Since MJR outperforms others inthe MR, MAE, and MAR/r, we should not be using such criteria. Using the third as-sumption, we can also eliminate ρ and KL because their values differ greatly among in-dividuals. For example, ρ for the smoothness in the AD domain for AD-annot1 is 0.187(column HMM), whereas that for AD-annot2 is just 0.05 (column HMM), and the KLfor the closeness in the AL domain for AL-annot1 is 0.093 (column CRF), whereas thatfor AL-annot2 is 0.029 (column CRF). The elimination of the KL can also be supportedby the fact that the similar rating distributions of AD-annot1 and AD-annot2 did not re-sult in high correlations, which suggests that the shape of rating distributions do notnecessarily mean the match of ratings (cf. Fig. 2). As a result, we end up with only oneevaluation criterion: MR/r, which becomes our recommended evaluation criterion.

Here, we do not argue that the MR/r is the best possible measure. There could bemore appropriate ones that we could not introduce in this paper. In addition, we do notmean that other measures are not useful; for example, both the MR and MR/r approachthe same value of 1 as the prediction accuracy improves. We recommend the MR/rsimply because, among our proposed criteria, it can evaluate non-predicting baselineslowly and that it seems less susceptible to individual differences than others.

When we focus on the MR/r, we see that HMMs have consistently better valuesthan CRFs (except for just one case). Therefore, we can say that the current best modelcan be achieved by HMMs. One explanation of this result may be that the parametersof CRFs may have over-tuned to the data with higher posterior probabilities. Conse-quently, CRFs showed poor performance for the data with lower posterior probabil-ities. Although HMMs performed comparatively better than CRFs, it should also be


Table 3. The MR, MAE, ρ, KL, MR/r and MAE/r for the random baseline (RND), majoritybaseline (MJR), HMMs and CRFs for the AD domain. The ratings of AD-annot1 were usedas references. The asterisks, ‘+’, ‘h’, and ‘c’ indicate the statistical significance (p<0.01) overRND, MJR, HMM, and CRF, respectively. Bold font indicates the best value for a certain usersatisfaction rating.

Smoothness Closeness WillingnessRND MJR HMM CRF RND MJR HMM CRF RND MJR HMM CRF

MR 0.142 0.376∗h 0.275∗ 0.308∗ 0.146 0.340∗ 0.279∗ 0.273∗ 0.155 0.298∗ 0.283∗ 0.305∗

MAE 1.995 0.996∗h 1.420∗ 1.252∗ 2.004 1.094∗ 1.431∗ 1.392∗ 1.947 1.085∗ 1.403∗ 1.245∗

ρ -0.007 NA 0.187∗ 0.109 -0.002 NA 0.213∗ 0.110 0.025 NA 0.169 0.183∗

KL 0.284 1.011 0.162 0.031 0.184 1.079 0.092 0.061 0.208 1.222 0.125 0.013MR/r 0.149 0.143 0.217 0.172 0.143 0.143 0.231 0.162 0.150 0.143 0.224 0.208MAE/r 2.280 1.714 1.782 1.820 2.242 1.714 1.702 1.836 2.219 1.714 1.705 1.636

Table 4. Evaluation results for the AD domain with the ratings of AD-annot2 as references. SeeTable 3 for the notations in the table.

Smoothness Closeness WillingnessRND MJR HMM CRF RND MJR HMM CRF RND MJR HMM CRF

MR 0.147 0.306∗ 0.257∗ 0.278∗ 0.146 0.320∗ 0.240 0.275 0.154 0.288∗ 0.277∗ 0.312∗

MAE 2.010 1.132∗h 1.738 1.431∗ 2.025 1.157∗ 1.600∗ 1.419∗ 2.068 1.183∗h 1.595∗ 1.225∗

ρ -0.024 NA 0.050 0.166∗ -0.011 NA 0.202∗ 0.171∗ 0.009 NA 0.105 0.245∗

KL 0.154 1.231 0.162 0.027 0.140 1.217 0.184 0.041 0.258 1.280 0.215 0.038MR/r 0.149 0.143 0.210 0.177 0.136 0.143 0.232 0.176 0.158 0.143 0.234 0.238MAE/r 2.246 1.714 2.017 1.922 2.292 1.714 1.726 1.852 2.250 1.714 1.938 1.680

Table 5. Evaluation results for the AL domain with the ratings of AL-annot1 as references. SeeTable 3 for the notations in the table.

Smoothness Closeness Good ListenerRND MJR HMM CRF RND MJR HMM CRF RND MJR HMM CRF

MR 0.150 0.472∗ 0.436∗ 0.519∗ 0.135 0.556∗ 0.421∗ 0.551∗h 0.142 0.370∗ 0.423∗ 0.505∗+

MAE 1.878 0.688∗ 0.803∗ 0.642∗h 1.832 0.593∗h 0.897∗ 0.575∗h 1.937 0.806∗ 0.850∗ 0.619∗+h

ρ -0.002 NA 0.221∗ 0.241∗ 0.005 NA 0.101 0.209∗ -0.054 NA 0.214∗ 0.294∗

KL 0.944 0.781 0.090 0.080 1.000 0.611 0.122 0.093 1.005 1.020 0.088 0.081MR/r 0.158 0.143 0.228 0.193 0.161 0.143 0.231 0.190 0.138 0.143 0.222 0.202MAE/r 2.327 1.714 1.878 1.962 2.221 1.714 1.994 1.579 2.366 1.714 1.805 1.596

noted that the absolute values of MR/r are only 0.2–0.24; further improvements arecrucial. One possibility for improving HMMs is to incorporate more features, such asword-level features and those related to dialogue history. Another possibility may beto increase the number of states for more accurate modeling of the dialogue-act/ratingsequences. We leave further analyses as future work.


Table 6. Evaluation results for the AL domain with the ratings of AL-annot2 as references. SeeTable 3 for the notations in the table.

Smoothness Closeness Good ListenerRND MJR HMM CRF RND MJR HMM CRF RND MJR HMM CRF

MR 0.138 0.292∗ 0.263∗ 0.289∗ 0.146 0.310∗h 0.226∗ 0.263∗ 0.150 0.300∗h 0.244∗ 0.273∗

MAE 2.031 1.128∗h 1.489∗ 1.264∗ 1.972 1.023∗ch 1.508∗ 1.297∗h 1.945 1.053∗c

h 1.522∗ 1.313∗

ρ -0.018 NA 0.183∗ 0.161∗ 0.017 NA 0.077 0.044 0.021 NA 0.130 0.074KL 0.426 1.251 0.101 0.023 0.337 1.188 0.094 0.029 0.342 1.207 0.129 0.038MR/r 0.132 0.143 0.210 0.185 0.148 0.143 0.195 0.168 0.149 0.143 0.208 0.185MAE/r 2.330 1.714 1.974 1.799 2.293 1.714 1.772 1.760 2.210 1.714 1.841 1.699

6 Summary and Future Work

This paper addressed three important issues in automatic prediction of user satisfactiontransitions in dialogues: individual differences, evaluation criteria, and prediction mod-els. We first showed, by our observation of great individual differences in our ratingdata, that it is rather difficult to create a user-independent prediction model. Then, weintroduced six possible candidates for evaluation criteria of user satisfaction transitionsand experimentally found that the match rate per rating (MR/r) is currently the mostappropriate criterion. Finally, in our experiment, we found that HMMs provide betterprediction accuracies than CRFs. Our contribution lies in our setting a course for futureresearch in predicting user satisfaction transitions in dialogues, especially by the sug-gestion of an appropriate evaluation criterion and by revealing the standard performancewe could attain by the currently available prediction models. Our future work includescreating more appropriate evaluation criteria, improving the prediction accuracy usingmore features, and further verification of our findings using the data of more individualsand more domains.

References

1. Dwass, M.: Some k-sample rank-order tests. In: Olkin, I., et al. (eds.) Contributions to Prob-ability and Statistics, pp. 198–202. Stanford University Press, Stanford (1960)

2. Engelbrech, K.P., Godde, F., Hartard, F., Ketabdar, H., Moller, S.: Modeling user satisfactionwith hidden Markov models. In: Proc. SIGDIAL, pp. 170–177 (2009)

3. Higashinaka, R., Dohsaka, K., Isozaki, H.: Effects of self-disclosure and empathy in human-computer dialogue. In: Proc. SLT, pp. 109–112 (2008)

4. Higashinaka, R., Miyazaki, N., Nakano, M., Aikawa, K.: Evaluating discourse understandingin spoken dialogue systems. ACM Trans. Speech Lang. Process. 1, 1–20 (2004)

5. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic mod-els for segmenting and labeling sequence data. In: Proc. ICML, pp. 282–289 (2001)

6. Meguro, T., Higashinaka, R., Dohsaka, K., Minami, Y., Isozaki, H.: Analysis of listening-oriented dialogue for building listening agents. In: Proc. SIGDIAL, pp. 124–127 (2009)

7. Moller, S., Engelbrecht, K.P., Schleicher, R.: Predicting the quality and usability of spokendialogue services. Speech Communication 50(8-9), 730–744 (2008)

8. Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Maga-zine 3(1), 4–16 (1986)


9. Subramaniam, L.V., Faruquie, T.A., Ikbal, S., Godbole, S., Mohania, M.K.: Business intelli-gence from voice of customer. In: Proc. ICDE, pp. 1391–1402 (2009)

10. Suzuki, J., McDermott, E., Isozaki, H.: Training conditional random fields with multivariateevaluation measures. In: Proc. COLING-ACL, pp. 217–224 (2006)

11. Takeuchi, H., Subramaniam, L.V., Nasukawa, T., Roy, S., Balakrishnan, S.: A conversation-mining system for gathering insights to improve agent productivity. In: Proc. IEEE Inter-national Conference on E-Commerce Technology and IEEE International Conference onEnterprise Computing, E-Commerce, and E-Services, pp. 465–468 (2007)

12. Walker, M.A., Langkilde-Geary, I., Hastie, H.W., Wright, J., Gorin, A.: Automatically train-ing a problematic dialogue predictor for a spoken dialogue system. Journal of Artificial In-telligence Research 16(1), 293–319 (2002)

13. Walker, M.A., Litman, D., Kamm, C.A., Abella, A.: PARADISE: A framework for evaluatingspoken dialogue agents. In: Proc. EACL, pp. 271–280 (1997)


Expansion of WFST-Based Dialog Management for Handling Multiple ASR Hypotheses

Naoto Kimura1,2, Chiori Hori1, Teruhisa Misu1, Kiyonori Ohtake1, Hisashi Kawai1, and Satoshi Nakamura1

1 National Institute of Information and Communications Technology (NICT), MASTAR Project, Keihanna Science City, Japan

{naoto.kimura,chiori.hori}@nict.go.jp 2 Nara Institute of Science and Technology (NAIST)

8916-5 Takayama-cho, Ikoma-shi, Nara, 630-0192, Japan

Abstract. We proposed a weighted finite-state transducer-based dialog manager (WFSTDM) which was a platform for expandable and adaptable dialog sys-tems. In this platform, all rules and/or models for dialog management (DM) are expressed in WFST form, and the WFSTs are used to accomplish various tasks via multiple modalities. With this framework, we constructed a statistical dialog system using the user concept and system action tags which were acquired from an annotated corpus of human-to-human spoken dialogs as input and output la-bels of the WFST. We introduced a spoken language understanding (SLU) WFST for converting user utterances to user concept tags, a dialog scenario WFST for converting user concept tags to system action tags and a sentence generation (SG) WFST for converging system action tags to system utterances. The tag sequence probabilities of the dialog scenario WFST were estimated by using a spoken dialog corpus for hotel reservation. The SLU, scenario and SG WFSTs were then composed to be a dialog management WFST which deter-mines the next action of the system responding to the user input. In our previous research, we evaluated the dialog strategy by referring to the manual transcrip-tion. Then in this paper, we present the performance of WFSTDM when speech recognition hypotheses are input. To alleviate degradation of the DM perform-ance caused by speech recognition errors, we expand the WFSTDM for handling multiple hypotheses of speech recognition and confidence score which indicate acoustic and linguistic reliability of speech recognition. We also evalu-ated the accuracy of SLU results and the correctness of system actions selected by the dialog management WFST. We confirmed that the performance of dialog management was enhanced by choosing the optimal action among all the WFST paths for multiple hypotheses (N-best) of speech recognition in consideration of confidence score.

Keywords: Spoken dialog, Spoken Language Understanding, Weighted Finite-State Transducer (WFST), Statistical dialog management, speech recognition, confidence score, N-best.

1 Introduction

We aim to construct robust spoken dialog systems through which human and machine can have a dialog freely as if human and human do. Since a conventional dialog

62 N. Kimura et al.

system requires a user to answer in response to system questions, the user’s responses are limited by the questions and the user rarely behaves in flexible manners through the dialog. In such a system-driven dialog, the accurate automatic speech recognition (ASR) results are available by restricting users’ spontaneous input. The state-of-the-art ASR technologies recognize spontaneous speech in real time with more than 100M-word vocabularies [1]. It is high time to challenge to construct dialog systems which accept user spontaneous dialog behaviors.

To realize such dialog systems, corpus-based dialog management is a promising means. Humans have typical patterns of dialog especially when a target task is limited and thus statistical models of dialog scenario to determine the next system’s action could be learned from a dialog corpus. Furthermore the dialog corpus enables to cover more linguistic expressions to understand the user’s intention and more natural system responses as in a corpus. Statistical models for spoken language understanding (SLU) and dialog scenario are trained through using a dialog corpus which is annotated with the concept tags of user and agent. Dialog management (DM) determines system’s next action in response to user input in such statistical models.

We proposed a weighted finite-state transducer (WFST) based DM platform [2]. WFSTs are mainly used in speech and language processing [3]. The WFST-based DM (WFSTDM) system accepts user inputs and output system responses according to a dialog management WFST which represents rules/statistical models. The WFSTDM enables to construct an expandable and adaptable dialog system which handles multi-ple tasks and modalities [4] since various statistical models represented by multiple WFSTs can be composed using the WFST operations. To build more complex dialog systems, more statistical models are required to be composed into a dialog manage-ment WFST and the size of the WFST becomes huge. The cost of the decoding such a huge WFST can be reduced by the WFST optimization operation [5]. Statistical mod-els for SLU, dialog scenario, and sentence generation (SG) for system responses can be transformed into WFSTs and then these WFSTs are composed into a dialog man-agement WFST [6]. The dialog management WFST can be decoded using the WFSTDM system, in which user utterance input to the system shall be converted into system responses i.e., sentences. Figure 1 shows the WFSTDM system.

Composed

ScenarioWFST

SLUWFST

SGWFST

WFST Management WFST

Speech Recognition

System

Text-To-SpeechWFSTDM

User speech System’s Synthesized Speech

Fig. 1. WFST-based dialog management system

Expansion of WFST-Based DM for Handling Multiple ASR Hypotheses 63

Focused on statistical models for dialog systems, thus far a statistical model of a system action sequence for dialog strategy [7] and that of user concept sequence for understanding [8] [9] [10] were investigated independently. We constructed a statisti-cal model for dialog scenario obtained from the tag sequence of both the clerk and customer sides [11]. This model determines the system’s next action based on prob-abilities of multiple user concept hypotheses conditioned by the previous system action. We examined the performance of the WFSTDM using this model as a dialog management WFST for the manual transcriptions of user utterance inputs.

In this paper we present the performance of the WFSTDM for ASR hypothesis in-puts. Since SLU models are trained from the manual transcription of the dialog cor-pus, ASR errors may degrade the performance of DM. Thus far to minimize such degradation of the DM performance, multiple ASR hypotheses and confidence score indicating acoustic and linguistic reliability of hypotheses were used for SLU in dia-log systems [12][13]. User inputs are rejected by dialog systems if the confidence score for each ASR result is lower than a threshold and efficient dialog strategies are chosen by avoiding system redundant confirmation. As another approach, SLU is carried out by selecting most likely understanding which is selected by weights of the confidence score. In this paper, we expand the WFSTDM to handle multiple ASR hypotheses and confidence score in determining a most-likely system’s next action.

2 WFST-Based Dialog System

2.1 Spoken Language Understanding WFST

A spoken language understanding (SLU) WFST is a pattern detector which detects a phrase expressing a user concept from an input sentence, and translates it to the user concept tag. We extracted a set of utterance sentences corresponding to each concept tag from the corpus, and then n-word phrases with high relative frequency from each sentence were extracted and embedded them in the SLU WFST as distinguishable ex-pression patterns of the user concept tags. Specifically, the transition weights in the WFST were determined so that the paths corresponding to longer phrases could have a lower cost, i.e. it is based on longest pattern matching. In this paper, we constructed the SLU WFST using frequent n-word phrases (n=1~6) in a corpus for hotel reservation.

2.2 Scenario WFST

A statistical model for dialog scenario was trained using a sequence of annotation tags in the corpus. Although there are alternatives in choosing responses to the user, the scenario WFST enables the dialog system to determine which system action could be taken in response to the user input in each state of dialog discourses. In this paper, we constructed the scenario WFST from a 3-gram model of annotation tag sequence from the corpus.

2.3 Dialog Management WFST

The SLU WFST, scenario WFST and SG WFST were composed and then optimized using WFST operations. The finally composed WFST is denoted as a dialog man-agement WFST. Manual transcription or speech recognition results of a user utterance

64 N. Kimura et al.

is input to the dialog management WFST, and the next system’s action tag is output from the WFST. In this experiment, the SLU WFST and scenario WFST without SG WFST were combined to the dialog management WFST.

2.4 Algorithm of WFSTDM

A WFST T over a semiring K is defined by an 8-tuple as ( )ρλ,,,,,,, EFiQT ΔΣ=

where:

(1) Σ is a finite set of input symbols; (2) Δ is a finite set of output symbols; (3) Q is a finite set of states; (4) Qi ∈ is an initial state;

(5) QF ⊂ is a set of final states;

(6) ( ) ( ) QQE ××∪Δ×∪Σ×⊂ Κεε is a finite set of transitions;

(7) λ is an initial weight; (8) ρ : K→F is a final weight function.

Given an input symbol sequence to a WFST, the output symbol sequence can be ob-tained as that on the best path with the minimum (or maximum) cumulative weight. The best path can be found efficiently with Dynamic Programming (DP) among suc-cessful paths, from the initial to one of the finals, which accept the input sequence. In dialog management, however, the system has to respond to the user immediately in each turn. Thus the system needs to choose the most appropriate output sequence according to the current situation.

We show the algorithm of our WFSTDM in Table I. Steps 1 to 5 perform initial ac-tions that can be taken by epsilon transitions from the initial state. Steps 6 to 10 re-spond actions to the user’s input at each turn. Steps 11 to 13 check task completion. In the algorithm, π indicates a path consisting of consecutive transitions

Lee ,,1 … in the

WFST. For a path π, we denote its origin state by p[π], its end state by n[π], and its cumulative weight by w[π] where [ ] [ ] [ ]Leweww ⊗⊗= 1π . “ ⊕ ” and “ ⊗ ” are two

formally-defined binary operations, i.e., “addition” and “multiplication” over the semiring. In this paper, we use the tropical semiring in which the “addition” and “multiplication” of two real-valued weights are defined as the minimum of the two and ordinary addition, respectively. We can also use the log semiring in which each weight is defined as a minus log probability, and corresponding “addition” and “mul-tiplication” are defined. Note that a (cumulative) weight is assumed to be better than the others if it is smaller than the others.

For a set of states S, input symbol sequence x, and output symbol sequence y, we define ( )yxSP ,, as a set of all paths each of which originates from one of S, accepts x

and outputs y. In the steps of 1 and 7, the system selects the most appropriate action sequence on the paths in ( )yxSP ,, . ( )πC is the expected cost for taking π and the pos-

sible future transitions from [ ]πn . This is a look-ahead for choosing a more appropri-

ate action at each turn, which may be set as a (discounted) reward used in POMDP. St is a set of states the system takes at turn t. Wt(s) is the cumulated weight for each state in St.


Table 1. Algorithm of WFST-based dialog Management

Steps 11 and 12 check the task completion based on

FW~ , the relative overall cumu-

lated weight of all the successful paths. Threshold is a pre-defined constant value. In Step 13, the control returns to Step 6 to receive the next user’s input.

Generally, a set of (weighted) rules or (hidden) Markov models can be represented as a WFST. Once such a model is embedded into a WFST, it can be combined with other WFSTs. Many useful operations for WFSTs are used to combine and manipu-late WFSTs. The composition operation for two WFSTs can be used to generate a WFST that translates sequentially by the two WFSTs.

Suppose we prepare two WFSTs independently, whereas one translates a word se-quence into its corresponding concept sequence for language understanding, and the other translates a concept sequence into system actions for dialog management. In the case we compose these two WFSTs, the dialog manager can accept word sequences

// execute initial actions 1. // choose the best action sequence in paths with

epsilons

{ }( )[ ] ( )ππλ

επCwy

yiPy

⊗⊗← ⊕∈Δ∈ ,,

0*

minargˆ

2. execute actions corresponding to 0y

3. [ ] { }( ){ }00 ˆ,,, yiPnssS εππ ∈=′′←

4. foreach 0Ss ∈′ do

( )[ ] { }( )

[ ]πλεπππ

wsWyiPns

⊗←′ ⊕∈=′ 0ˆ,,,:

0

5. 1←t // execute actions for user’s input

6. receive a user’s input symbol sequence tx

7. // choose the best action sequence for tx

( )[ ]( ) [ ] ( )πππ

πCwpWy t

yxSPyt

tt

⊗⊗← −∈Δ∈

⊕−

1,,1

*

minargˆ

8. execute actions corresponding to ty

9. [ ] ( ){ }tttt yxSPnssS ˆ,,, 1−∈=′′← ππ

10. foreach tSs ∈′ do

( )[ ] ( )

[ ]( ) [ ]πππππ

wpWsW tyxSPns

t

ttt

⊗←′ −∈=′⊕

−

1ˆ,,,: 1

// check task completion

11. ( ) ( ) ( )1

~−

∈′′∩∈′ ⎭⎬⎫

⎩⎨⎧

′′⊗⎭⎬⎫

⎩⎨⎧

′⊗′← ⊕⊕ sWssWW tSs

tFSs

F

tt

ρ

12. if ThresholdWF <~ then exit

13. // proceed to the next turn 1+← tt and go to Step 6

66 N. Kimura et al.

directly using the composed WFST. In addition, some optimization operations are effective to reduce the size and the computational cost in runtime.

2.5 Handling Multiple ASR Hypotheses

In our previous work, we used transcript of utterances of several dialogs as input to a dialog management WFST in order to evaluate the effectiveness of the dialog man-agement. Then in this work, we use speech recognition result as the input, including recognition errors. To minimize the impact of recognition errors on dialog manage-ment, we expand the WFSTDM for handling N-best hypotheses of speech recognition with confidence scores that indicate acoustic and linguistic reliability of speech rec-ognition. The new WFSTDM chooses the optimal action in all the WFST paths for multiple hypotheses while each of paths is weighted with the confidence score of the corresponding hypothesis.

The N-best hypotheses of speech recognition often include hypotheses with less er-ror rate than that of the best-scored hypothesis. According to the dialog context, a more appropriate hypothesis can be selected instead of the best-scored one. In this way, we aim to reduce the impact of recognition errors but also to enhance the accu-racy of selecting the next system’s action.

For enabling WFSTDM to accept N-best hypotheses with confidence scores from a speech recognizer, we modify the step 7 in Table 1 as:

( )[ ]( ) [ ] ( ) ( )πππ

πCxCMwpWy ntt

xxyxSPy

t

tnt

ntt

⊗⊗⊗← −

∈∈Δ∈

⊕−

,1

,,,

,

,1*

minargˆ ,

where we consider { }Nnxx ntt ,...,1, == , i.e. tx represents a list of N-best hypotheses

and ntx ,

corresponds to the n-th hypothesis in the list. ( )ntxCM , indicates a confidence

score for ntx ,

. For example, it can be calculated using the word posterior probabilities

for speech input O as:

( ) ( )∑∈

−=ntxw

nt OwPxCM,

log,

Note that we use a negative value of log probability according to the manner of WFST.

In the previous work, we assumed that sentence boundaries are established facts, which were annotated in the transcripts. It is, however, difficult to detect the bounda-ries precisely because they get ambiguous in speech recognition result. Then, when using speech recognition result, we change the WFSTDM so as to accept multiple sentences at once, and to translate them into an output tag sequence without consid-eration of sentence boundaries, i.e. we consider multiple paths with different sentence boundaries for the input sequence.


3 Evaluation Experiment

3.1 Evaluation Data

We used 25 dialogs for hotel reservation between an English speaker and a Japanese speaker as evaluation data [6]. They are annotated with Interchange Format (IF) which is an Interlingua for speech translation systems. Since IF was not originally designed for DM, we modified the original IF tag set to a consistent set for DM and revised the user concept tags and the system action tags. The number of tags resulted in 146 including 58 user concept tags and 88 system action tags. The example of IF tags and number of turns and tags are shown in Table 2 and 3, respectively.

Table 2. The example of IF tag

Turn Speaker IF-tag Japanese Utterance(English Utterance)

a:greeting お電話，ありがとうございます、 (Thank you for your calling, )

a:introduce-self ニューヨークシティホテルでございます。 (New York City Hotel,)

1 System

a:offer + help ご用件をお伺いいたします。 (may I help you?)

c:greeting もしもし、 (Hello,)

c:introduce-self わたし田中弘子といいますが、 (my name is Hiroko Tanaka) 2 User

c:request-action + reservation+ hotel

部屋の予約をお願いしたいんですけれども。 (and I would like to make a reservation.)

a:acknowledge はい、(Yes,) 3 System a:request-information

+ temporal いつがご希望でしょうか。 (and when would you like to stay?)

Table 3. Number of turns and tags used in the systems

User System

#turn/dialogs 10.76(269/25) 10.8(270/25) #tag/turn 1.79(482/269) 2.91(786/270)

3.2 Speech Recognition

We used Julius for speech recognition [8]. The acoustic model is Japanese gender-independent tri-phone model, and the language models are 2-gram and 3-gram models learned with the travel dialog corpus which includes 87194 utterances in 2206 dia-logs. For the 25 test dialogs, test-set perplexity of the 3-gram language model is 31.97 and the number of out-of-vocabulary (OOV) words is 111 (0.81%). The word accu-racy in speech recognition is 78.7%.

68 N. Kimura et al.

3.3 Evaluation Method

In order to evaluate the dialog system, we simulated dialog discourses by using test set as a correct answer dialog. We used a leave-one-out method in which each dialog as a test set and the other 24 dialogs as a training set for the corpus of 25 dialogs with IF tags. To measure the performance of SLU and the performance of prediction for system next actions, we input a set of user utterances at each turn to the WFST by referring the dialog discourse in the test set, and performed spoken language under-standing and made the system predict the next action tag sequence in response to the user input. In this method, we use Mean Reciprocal Rank (MRR) as an evaluation metric of performance of prediction of system’s next actions. MRR is defined as:

∑=

=M

i iRMMRR

1

11

where Ri is the rank of the correct system action tag sequence at i-th turn, and M is the number of system turns. MRR is an evaluation method in which the value becomes large when the correct answer is ranked high in the estimated candidates, and it is suitable for evaluation of prediction performance from many candidates in dialogs.

4 Experimental Result

4.1 Accuracy of Spoken Language Understanding

Since the SLU WFST is leaned from the corpus and is carried out by converting user utterances into user concept tags based on longest-phrase match, the dialog system cannot precisely understand user utterance in cases when the word N-gram based phrases representing user concepts are not in the training data or are misrecognized in speech recognition results. To know the upper-bound of the performance of the SLU WFST, we evaluated how many correct user concept tags are covered by multiple concept tag hypotheses obtained by the longest-phrase patterns in the user utterances.

Figure 2 shows the percentages of the coverage for the correct concept tags when the manual transcription, 1-best and multiple hypotheses of ASR were input. The sentence boundaries are given in the manual transcription results. While the sentence boundaries for the ASR results are unknown. This result shows that if user utterances are correctly recognized by the ASR, the SLU WFST has a potential to understand 80% of the user concept correctly at most. When the ASR results are input, the multi-ple hypotheses minimize the degradation of the coverage in comparison with the 1-best ASR results. 75.3% of the user concept can be understood even though the results from speech recognition are inclined to be degraded.

However, the potential of SLU shown in Figure 2 does not always fully represent the performance of the SLU WFST because it is still difficult to select correct concept tags from multiple hypotheses. To evaluate the performance of the SLU WFST, the concept tag sequence output from the SLU WFST were compared with the references. The accuracy in consideration of insertion, deletion and substitution is shown in Figure 3.


64%

66%

68%

70%

72%

74%

76%

78%

80%

82%

Transcription ASR(1-best) ASR(5-best) ASR(10-best)

Fig. 2. Coverage of correct concept tags in concept tag hypotheses for SLU [%]. Transcription: manual transcription, ASR(N-best): N-best speech recognition results, N= 1, 5, 10.

35%

45%

55%

65%

Con

cept

Acc

urac

y

Transcription with sentence boundariesTranscription without sentence boundariesSpeech recognition result without confidence score (1-best)Speech recognition result with confidence score (1-best)Speech recognition result without confidence score (5-best)Speech recognition result with confidence score (5-best)

Fig. 3. Performance of Spoken Language Understanding using SLU WFSTs

The decline of the concept accuracy caused by missing the sentence boundaries is 20% for the manual transcription. All the speech recognition results don’t have the explicit sentence boundaries and thus the upper bound of the performance is 45% given by the manual transcription without sentence boundaries. Roughly 5.5% of the accuracy decline was induced by recognition errors.

Although the accuracy of the 1-best was 39.5%, that of the 5-best was improved up to 40.5% due to consideration of multiple ASR hypotheses. This tendency is the same as shown in Figure 2. In the comparison of the ASR results with and without the con-fidence score, both the accuracies for the 1-best and 5-best were slightly enhanced by using the confidence score. These results also show that multiple hypotheses and confidence score of ASR contributes to spoken language understanding by the SLU WFST.

70 N. Kimura et al.

4.2 Prediction Performance of System’s Next Action

The SLU WFST which performance is shown in Figure 3 was composed with the scenario WFST and then optimized to be a dialog management WFST. Table 4 lists the average size of 25 WFSTs.

Table 4. Size of WFSTs

WFSTs #state #transition

Scenario (tag 3-gram) 708 2976

SLU 2833 15737

Composed 7611 47462

Optimized 7194 44744

To test the performance of the WFSTDM, the prediction power of the system’s

next action tags was evaluated by using MRR. The MRR for manual transcriptions, 1-best and 5-best ASR results are compared and efficiency of the confidence scores for the DM is shown in Figure 4.

0.15

0.17

0.19

0.21

0.23

0.25

MR

R

Transcription with sentence boundariesTranscription without sentence boundariesSpeech recognition result without confidence score (1-best)Speech recognition result with confidence score (1-best)Speech recognition result without confidence score (5-best)Speech recognition result with confidence score (5-best)

Fig. 4. Prediction power of system next action tag using the WFSTDM

The performance for the manual transcription is 0.246 MRR and is reduced to 0.219 MRR when the sentence boundaries are unknown. The 1-best and 5-best ASR results indicate 0.173 and 0.179 MRRs, respectively. These results show that the performance of SLU reflects to the entire performance of the WFSTDM directly. We confirmed that the performance is enhanced by using N-best (5-best) with confidence score. This is because longer-phrases representing concept tags are covered by multi-ple hypotheses of ASR results and the increased coverage ratio results in improve-ment of spoken language understanding and the performance of dialog management. Additionally, confidence score raises the rank of correct action tag in the multiple hypotheses obtained by the WFSTDM. As a result of this boosting, MRR is improved.


Figure 5 shows the correlation between the speech recognition accuracy and SLU accuracy/MRR for the WFSTDM. The word accuracies averaged by the 25 dialogs were obtained in the range of from 61% to 76.3%. The SLU accuracy and the MRR are improved as the performance of ASR is increased and there is strong correlation among them.

0.14

0.15

0.16

0.17

0.18

35%

36%

37%

38%

39%

40%

41%

61.74 72.19 76.3

MR

RC

once

pt A

ccur

acy

Accuracy of speech recognition

Performance of spoken language understanding(N=1)Performance of spoken language understanding(N=5)Prediction power of system next action tag by MRR(N=1)Prediction power of system next action tag by MRR(N=5)

Fig. 5. Compare of performance of spoken language understanding with MRR

5 Conclusions

We evaluated the WFSTDM in which statistical models trained using a human-to-human dialog corpus for hotel reservation was used in response to the manual tran-scriptions and ASR results of user inputs. To minimize the degradation of the DM performance due to speech recognition errors, we expand the WFSTDM for handling multiple hypotheses of ASR and confidence score indicating acoustic and linguistic reliability of speech recognition. The accuracy of spoken language understanding (SLU) results and the correctness of system actions selected by the dialog manage-ment WFST using MRR were evaluated. To solve a problem of sentence boundaries, such framework that user concept tags can be output repeatedly from multiple utter-ances at each turn was applied. We confirmed that the performance of the SLU WFST and DM were enhanced by choosing the optimal action among all the WFST paths for multiple hypotheses (N-best) of speech recognition and in consideration of confidence score. Although we implemented a sentence generation (SG) WFST module [6], the SG WFST was not composed for the dialog management WFST and thus the perform-ance of the entire dialog management WFST including the SG WFST in response to speech input has not been tested yet in the research for this paper. The future work involves human judgment on appropriateness of sentences generated as system re-sponses by the SG WFST and evaluation experiments by humans via speech input and output.

72 N. Kimura et al.

References

[1] Hori, T., Hori, C., Minami, Y., Nakamura, A.: Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech and Language Processing 15, 1352–1365 (2007)

[2] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Dialog management using weighted finite-state transducers. In: 9th Annual Conference of the International Speech Communication Association, pp. 211–214 (2008)

[3] Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Lan-guage 16, 69–88 (2002)

[4] Kayama, K., Kobayashi, A., Mizukami, E., Misu, T., Kashioka, H., Kawai, H., Naka-mura, S.: Spoken Dialog System on Plasma Display Panel Estimating Users’ Interest by Image Processing. In: 1st International Workshop on Human-Centric Interfaces for Am-bient Intelligence (July 2010) (to appear)

[5] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Recent Advances in WFST-based Dialog System. In: 10th Annual Conference of the International Speech Communi-cation Association, pp. 268–271 (2009)

[6] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Weighted Finite State Trans-ducer Based Statistical Dialog Management. In: IEEE Workshop on Automatic Speech Recognition & Understanding, pp.490–495 (2009)

[7] Hurtado, L.F., Griol, D., Sanchis, E., Segarra, E.: A stochastic approach to dialog man-agement. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp.226–231 (2005)

[8] Nagata, M., Morimoto, T.: First steps towards statistical modeling of dialogue to predict the speech act type of the next utterance. Speech Communication 15, 193–203 (1994)

[9] Higashinaka, R., Nakano, M., Aikawa, K.: Corpus-based Discourse Understanding in Spoken Dialogue Systems. In: 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 240–247 (2003)

[10] Higashinaka, R., Nakano, M.: Rankin Multiple Dialog States by Corpus Statistics to Im-prove Discourse Understanding in Spoken Dialog System. IEICE Trans. Inf. & Syst. E92.D(9), 1771–1782 (2009)

[11] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Statistical dialog manage-ment applied to WFST-based dialog systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4793–4796 (2009)

[12] Komatani, K., Kawahara, T.: Flexible mixed-initiative dialogue management using con-cept-level confidence measures of speech recognizer output. In: International Conference On Computational Linguistics Computational Linguistics, vol. 1, pp. 467–473 (2000)

[13] Hazen, T.J., Seneff, S., Polifroni, J.: Recognition confidence scoring and its use in speech understanding systems. Computer Speech and Language 16, 49–67 (2002)

[14] Lee, A., Shikano, K., Kawahara, T.: Real-time word confidence scoring using local poste-rior probabilities on tree trellis search. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp.I-793–I-7966 (2004)

Evaluation of Facial Direction Estimation from

Cameras for Multi-modal Spoken Dialog System

Akihiro Kobayashi, Kentaro Kayama, Etsuo Mizukami, Teruhisa Misu,Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura

Spoken Language Communication Group, MASTAR project,National Institute of Communications and Technology (NICT), Japan

{akihiro-k,kayama,etsuo.mizukami,teruhisa.misu,

hideki.kashioka,hisashi.kawai,satoshi.nakamura}@nict.go.jp

http://www.nict.go.jp/

Abstract. This paper presents the results of an evaluation of image-processing techniques for estimating facial direction from a camera for amulti-modal spoken dialog system on a large display panel. The systemis called the “proactive dialog system” and aims to present acceptableinformation in an acceptable time. It can detect non-verbal information,such as changes in gaze and facial direction as well as head gesturesof the user during dialog, and recommend suitable information. We im-plemented a dialog scenario to present sightseeing information on thesystem. Experiments which consist of 100 sesions with 80 subjects wereconducted to evaluate the system’s efficiency. The system grows partic-ularly clear when dialog contains recommendations.

Keywords: Facial direction estimation, Head detection, User interface.

1 Introduction

Image processing techniques estimating face and gaze direction from camerashave been widely studied in recent years, and these techniques are used as amultimodal user interface because the directions are thought to indicate theuser’s attention [1,2]. It is, however, difficult to evaluate the efficiency of thesetechniques in multimodal applications because many factors influence user’s im-pressions. In this paper, we developed a multi-modal dialog system for digitalsignage using image processing techniques, and evaluated the performance ofimage processing and its efficiency in our application based on dialog corporaand videos of 100 sessions from which 80 subjects actual spoke to our system.

“Digital Signage” is an advertising media for selecting and displaying in realtime appropriate contents according to the users, and has been actively studiedin recent years [3]. Almost all of these, however, one-sidedly display content, orneed explicit input devices such as touch panels. We expect natural interfacesmake these systems more user-friendly. For example, the ability for a user todraw out desirable information from a system via spoken dialogs and for thesystem to predict the user’s interests and recommend appropriate information


74 A. Kobayashi et al.

would produce an ambient intelligence. The construction of such systems can leadto applications such as next-generation digital signage. Therefore we proposeda novel interactive information display system that realizes proactive dialogsbetween human and computer based on image processing techniques [4,5].

A “proactive dialog system” refers to a system that has the functionality ofactively presenting acceptable information in an acceptable time in addition tobeing able to adequately respond to queries [6]. The proposed system is based onspoken dialog. It is also able to detect non-verbal information, such as changesin gaze and facial direction and head gestures of the user during dialog, andrecommend appropriate information.

We constructed the prototype of this system with data and dialog scenarios forsightseeing guidance on Kyoto. Experiments were held with 80 subjects (total100 sessions) to analyze user behavior during system use and to evaluate thesystem’s usefulness and performance of image processing used in the system.

In this paper, we present software and hardware architecture, image process-ing technology for the system, and user evaluations. Hardware and softwarearchitecture and details of the parts of spoken language recognition and dis-play control are described in section 2. Image processing that detects users andestimates gaze and facial directions is explained in section 3. The applicationimplemented in this system is described in section 4.The experiment to evaluatethe system’s ability is reported in section 5.

2 System Architecture

2.1 Outline of Total System

This section discusses the prototype of a spoken dialog system with a plasma dis-play panel that we constructed for a system integrating non-verbal informationrecognition and spoken dialog.

The system is constructed on the premise of it being fixed in a public space,such as a tourist information office, and presenting information to a general audi-ence. The main input interface is also assumed to be spoken language and imageprocessing is used to enhance dialog quality by estimating user interests. Theoutput interface utilizes a wide screen divided into four windows and displayinga range of information. The character shown in Fig. 2 appears on the screen andexplains displayed content and controls dialog via speech synthesis.

2.2 Hardware

The prototype of the proposed system we constructed is shown in Fig. 1.It consists of following parts:

50-inch plasma display panel (PDP). 70-cm wide and 120-cm high por-trait display on an 80-cm-high base with a resolution of 1080× 1920.

Three pose-controllable monocular cameras. Grasshopper made by PointGrey Research Inc. with attached lenses so that face of a user is correctlyplaced in view.

Evaluation of Facial Direction Estimation 75

Fig. 1. Spoken dialog system on plasmadisplay panel

Fig. 2. Screen example of PDP informa-tion dialog system

Stereo vision camera. Bumblebee2 also made by Point Gray Research Inc. ,with a horizontal angle of about 60-degree width.

Directional microphone. CS-3e made by SANKENLoudspeaker. PN-AZ10 made by Sony

2.3 Software

Software Modules are divided into these four function types:

– Image processing– Speech recognition and parsing– Dialog control– Display control and speech synthesis

The image processing algorithms implemented in the system are human andhead area detection and face and gaze direction estimation, The details of thealgorithms are described in the next section and the dialog control which includesintegration method for image and speech inputs is described in section 4.

The speech recognition and parsing are explained in this section. First, voiceactivity detection (VAD) is performed for the input audio signal and the utteringpart is cut out. This part is sent to a module that contains ATRASR [7] whichwas developed by ATR as a speech recognition engine and performs speechrecognition and parsing. Parsed results are sent finally to the dialog control.


The display control outputs a screen as in Fig. 2. In principle, the screen isdivided into two or four windows; with each window respectively using HTML.In Fig. 2, a brief on Kinkaku-ji temple is displayed on the upper left, a list ofrestaurants near Kinkaku-ji on the lower left, details of a restaurant on the lowerright. The character agent is displayed on the center of the lower right window.She performs several actions as a virtual conversational partner. She is equippedwith lip synchronization. The shape of her mouth is generated based on thevowel by cooperating with the speech synthesis.

3 Non-verbal Information Processing via Images

3.1 Detection of Head Area

Candidates for the human head area via use of the stereo vision camera aredetected with the following processes:

1. Construct three-dimensional occupancy grid (10-cm size)2. Divide into each 20-cm height and cluster object existence area3. Segment to each person area via an applied method of the crossing hierarchy

method [8]4. Detect candidate areas of the human head5. Evaluate and filter candidates by evaluating head possibility

Fig. 3. Person detection from stereo images

An example of the processing is shown in Fig. 3. In the crossing hierarchymethod, three-dimensional space is divided into overlapping plates: for exam-ple, 20-cm plates at heights of 180–160 cm, 170–150 cm, 160–140cm. Areas thatare determined as the human region in the upper unit are then propagated to


lower units and candidate areas of the human region are decided sequentially. Inthis way entire human regions are abstracted. The original method used multiplestereo vision cameras, but this system uses only one. Moreover, the camera ofour system is equipped in plan-view, which causes serious occlusion but makessetting easy.

In this system, clustering is first performed on each plate. A distinction is thenmade for each cluster based on whether it will inherit a cluster of the upper plateor appear as a new region. If a cluster inherits multiple regions it is adequatelydivided, and if multiple clusters inherit a single region they are integrated, ifnecessary. These processes enable robust detection of each human region whenthere are occlusions, closes, and contacts with multiple people. The center partof Fig. 3 shows 18 plates at heights ranging from 200–180 cm to 30–10 cm,used in this system. PDP exists in the lower center of each square. No objectsare black regions. Grey regions indicate invisible areas that are out of view oroccluded. Other colored regions are candidates for the human region. The samecolor indicates the same person.

After that, the upper region, which is about 30 cm from the top of the humanregion, is detected as a candidate area for human head. Moreover, the possibilitythat each candidate area is a human head is evaluated on its height, distancefrom the PDP, size, shape, and head position of the previous frame. Areas thatexceed an a priori threshold are finally decided to be the human head.

3.2 Facial Direction Estimation

The system controls three high-resolution monocular cameras to catch humanhead regions obtained from the above mentioned processes. Facial direction isthen estimated as follows by individual monocular cameras (Fig. 4), and meanof the cameras are used to know where a user looks at:

1. Detect face regions. In this system, images of 800 × 600 pixels are inputat 15 frames per second. If the system fails to detect a face or to trackfacial parts in the image, in the next frame a facial detection routine usingHaar-like features is executed.

2. Detect and track facial parts. The system detects 45 feature points onthe face by using an active appearance model (AAM), as initial values of thecoordination of points are the values of the previous frame (if facial partswere also detected in the previous frame) or a priori values (if the face isnewly detected) [9]. AAM is a method for using principal component analysisagainst the vector consisting of coordination of the features of facial parts inthe image and intensities of the pixels of the face region. The correlation inthe change of the feature point locations and the change of view is learned.It enables tracking non-rigid objects such as facial parts.

3. Estimate facial direction. Six degrees of freedom (DOF), i.e. three in ro-tation and three in translation, of the facial direction are estimated by us-ing the steepest descent method by fitting three-dimensional coordinationof each feature in an a priori three-dimensional face-shape model and thecalculated coordination of the previous step.


Fig. 4. Facial direction estimation

4. Estimate gaze direction. The candidate for the iris region is obtained bybinarizing eye regions and fitting an ellipse. Moreover, three-dimensionalcoordination of the center of the eyeball, obtained in the previous step, andcoordination of the center of the iris calculated via the coordination of thecandidate of the iris region in the image and facial direction are estimated.Gaze direction is the direction of the line on these two points [10].

We use facial direction to know where a user is looking because iris detectionis not robust given the variety of individual eye-shape, and lighting noise in thereal world. We define a base-line of facial direction from a center of the PDP toa facial gravity-point of a user who is looking at the center of PDP. Blue linesin Fig. 4 show the base lines. This system calculates the motion of the base-linefrom the Six DOF. Finally this system identifies the point where the base-linecrosses the PDP plane.

4 Application

4.1 Acceptable Queries

We implemented a Kyoto sightseeing guide scenario on the prototype system,based on a system for a portable PC [11]. Examples of acceptable queries on thesystem are as follows:

1. “Show me sightseeing spots that are famous for determinant. ”.2. “Show me sightseeing spot. ”.3. “Show me subject of sightseeing spot. ”.4. (If search result list is displayed) “Show me details on the n-th item.”


We assigned cherry blossoms, autumn foliage, and gardens as determinant. Abus schedule, how to go to the site, a map, restaurants near the site, and so onare prepared as a subject in 3. When words are recognized, the system displaysthe contents. If a sightseeing spot is omitted, the system infers that the spotmentioned just before is also a current sightseeing spot.

The system has a database of about 2,000 sits. If a user utters words not con-tained in the databases, the system searches by treating the words as keywordson Google and displays the results in list form. If a user utters a number of suchas “4,” the website nominated by the user is displayed.

4.2 Recommendation Based on Non-verbal Information

A typical function of this dialog system is recommendation dialogue based onnon-verbal information. This system shows information in four or two windowson the PDP as shown in Fig. 2, and this system automatically recommends con-tents by estimating which window the user is looking at, when the user doesn’tknow what to say 1. Finally this system transits the state automatically with asystem utterance, “I will explain this content.” Fig. 5 shows details of the dialogstatus.

Fig. 5. Dialog state transition

1. Initial state: Four determinants are displayed. “Determinants” are dis-played on the four-section screen.

2. Four sightseeing spots are displayed. In this mode, pictures and a briefexplanation of four spots that fulfill the condition selected in step 1 arerandomly selected and displayed.

1 The system knows the timing for recommendation using time span between utter-ances of a user and the system.


3. Four content names are displayed. In this mode, an abstract of sight-seeing spots selected in step 2 is displayed in the upper left of the screen. Inaddition, the types of information searchable in the system, such as “how togo to the site,” “map,” and “restaurant near the site” are displayed.

4. Two or four contents are displayed. Contents mean a map near the siteof the current topic, list of restaurants near the site, details of one restauranton the list, bus schedule from Kyoto station to the spot, and so on. Whenthe system estimates that a user utters a query that requests content that isnot prepared, the result list from a Google search for the keyword containedin the utterance and the first-ranked web-site on the list is displayed.

These states are supposed to transit from 1 to 2, 2 to 3, and 3 to 4 as a stan-dard. Regardless of the current state, however, uttering of “determinants” causestransition to state 2, uttering something such as “Show me sightseeing spot ”( sightseeing spot is different from current topic ) causes transition to state 3,and uttering something such as “Show me subject of sightseeing spot ” to state4. If a human is not detected by the image processing for a certain period or theuser says “Thank you,” the system is reset and returns to state 1.

5 Experiments

5.1 Outline of Experiments

We conducted the following real-world experiments of the system in December2009. The number of the session was 100, consisting of 70 subjects who used thesystem only one time and 10 subjects who used it three times on different days.We conducted two types of experiments. The first experiment is an evaluationof usability of this application. Subjects were instructed to search for sightseeinginformation and choose a site they want to go to by using the system freely. Thetotal period of these 100 sessions was about 22 hours. The system recognized4,807 utterances by users during the experiments. The second experiment wasan evaluation of image processing performance. Subjects were instructed to lookup 24 markers shown in the display.

5.2 Performance of Image Processing

To evaluate the performance of facial direction estimation described in Sec. 3.2,we instructed subjects to look up markers, the positions of which are known,and compared these known positions with recognized positions where users werelooking. Each subject stood 1 meter away from the PDP. The PDP sequentiallydisplayed each of 24 markers (width 4 cm x height 6 cm) at 15 cm intervals.Each marker was 1 cm in size and displayed for 2 seconds.

We defined the case in which this system correctly recognized the window(one of four windows shown in Fig. 5) a user was looking at as a success, becausethe application used this information for the recommendation, as described inSection 4. We calculate the success rate of each of 100 sessions. Fig. 6 shows the


0

5

10

15

20

25

30

35

40

45

50

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Subjects

Num

Success

Rate

Fig. 6. Number of Subjects at Each Success Rate

distribution of the success rate. In most of the subjects, success rate is distributedbetween 20-30%, as shown in Fig. 6.

We analyzed the main reasons for these failures. In Fig. 7 and Fig. 8 we drewmarkers as small circles and plots recognized the destinations of user gazes. Fig. 7shows the result of a subject with a success rate exceeding 90%, while Fig. 8 showsthe result of a subject with a 20-30% success rate. As shown in these figures, someusers did not move their head when they looked at the markers.

As shown in these results, improvements in recognition and instruction areneeded. For example, the system would use not only information about the ab-solute position but also a flow of motions in the recognition process. In addition,we think this system requires some instruction to direct user’s head motions,such as the motions of a character agent.

5.3 Evaluation of Application

We implemented recommendations based on facial direction during times ofno utterance as an application to show the integration of the spoken dialogsystem and image processing of non-verbal information. To evaluate the effectsof the function, 40 sessions were done with recommendations and 60 without.The system recommends information if a user makes no utterance for a certainperiod. Its threshold is set to 8 or 10 seconds.

Experiments were carried out as follows:

1. The subject is given an overview of the system and typical dialog between auser and the system.


Fig. 7. Distribution of recognized facialdirections of a user who moves their head

Fig. 8. Distribution of recognized facialdirections of a user who moves their eyes

2. The subject asks the system six questions, instructed by an assistant, towhich the system can respond. The assistant supports the subject. (about 2minutes)

3. The subject freely uses the system unassisted. (about 10 minutes)4. The user answers a questionnaire.

During the 40 sessions the systems recommended 85 times as described inSection 4.2. Classifications on whether the user utterance just after recommen-dation is related to recommended sites are as follows:– Continued recommended topic (A) 25– Uttered other topic (R) 41– Reset the system (I) 19

Recommendation accept rate is A/(A+R) = 37.9%. If we regarded that recom-mendation was refused also in the case I, A/(A + R + I) = 29.4%. 2

Moreover, we analyzed answers for the questionnaire to evaluate users’ im-pressions of this system recomendation [12]. Table 1 shows the results. Thequestionnaire is modified and translated ITU-T Rec. P.851 questionnaire, andcontains four-level score (-1.5, -0.5, 0.5, 1.5) of the whole system and individualitems such as its potential, along with free-form entry about their impressionof the system. We analyzed the questionnaire of 54 sessions with an adequacy2 In this experiment, subjects were required to say (I) when the subjects were satisfied

with a site displayed on PDP. Therefore, we couldn’t decide this type of utteranceas positive or negative.


Table 1. Evaluation of recommendations

Question R(N=21) NR(N=33) F(1.52)

Avg SD Avg SD

You knew at each point of the dialoguewhat the system expected from you. 0.119 0.973 -0.621 0.857 9.611 p<0.01**

The system’s behavior was alwaysas expected. -0.595 0.768 -1.046 0.617 5.638 p<0.05*

You prefer a human operator. -0.310 0.814 -0.712 0.781 3.302 p<0.1+

response rate higher than 50%. It is because in the session of almost subjectswith low adequacy response rate the dialogs were not smooth due to low speechrecognition rate and we judged answers of them are inadequate for evaluatingrecommendation. Among 54 sessions, 21 sessions are in recommendation mode(R) and 33 are not recommendation mode (NR). From analyzing their answersfor question, the evaluation values of system R are mostly higher than those ofsystem NR. Table 1 shows average (Avg), standard deviation (SD), and valuesof F-test from the score of 54 sessions. This is especially notable in these itemsshown in Table 1 higher than 50%.

This result indicates that the system’s clarity rises when dialog provides rec-ommendations. It was, however, often observed that a subject was confused bythe recommendations. The suspected reasons for this are that the delay untilthe system recommends information is too short, and that the system did notdistinguish between a state in which the user was perplexed about selecting in-formation displayed by the system and was puzzled about how to use the system,and uniformly regarded these as the former case and made recommendations.

6 Conclusion

We constructed a prototype of a proactive spoken dialog system, which is asmart interactive information presentation system that incorporates a non-verbalinformation recognition process into a spoken dialog system. The system consistsof a 50-inch plasma display panel, microphone for voice input, and cameras (threemonocular and one stereo) for image input.

We implemented a Kyoto sightseeing guide scenario on the system, in whichdialog mainly progresses by spoken language. When there is no utterance for agiven duration, the system recommends information by estimating user interestsvia facial direction estimated by image information.

We conducted an experiment on the system with 100 sessions to evaluateits efficiency. The head detection rate was almost 100%, but the success rate ofdetecting where a user was looking was about 30%. This seemed to be the reasonwhy the recommendation acceptance rate was about 30%. Evaluations for thesystem with recommendations, however, were mostly higher than those withoutrecommendations. In particular, the evaluation of the system’s clarity increaseswhen dialog provides recommendations.


Future work includes the improvement of recognition rates in speech andimages, addition of a function for estimating the user reactions, such as positiveor negative, to the system’s response, and construction of a system that betterintegrates non-verbal information recognition and spoken dialog.

References

1. Kobayashi, Y., Sugimura, D., Sato, Y., Hirasawa, K., Suzuki, N., Kage, H., Sug-imoto, A.: 3D Head Tracking using the Particle Filter with Cascaded Classifiers.In: Proc. British Machine Vision Conference (BMVC 2006), pp. 37–46 (2006)

2. Fujie, S., Yamahata, T., Kobayashi, T.: Conversation robot with the function ofgaze recognition. In: Proc. 2006 IEEE-RAS Int’l Conf. on Humanoid Robots (Hu-manoids 2006), pp. 364–369 (2006)

3. Lao, S., Yamaguchi, O.: Facial Image Processing Technology for Real Applications:Recent Progress in Facial Image Processing Technology. IPSJ Magazine 50(4), 319–326 (2009)

4. Kobayashi, A., Kayama, K., Lee, D., Sumi, K., Kato, T., Kadobayashi, R., Ya-mazaki, T.: Proposition of Proactive Information Display System Using Face Di-rections and Head Gestures Estimation. In: Proc. of the IEICE General Conference(2009)

5. Minakuchi, M., Asano, S., Satake, J., Kobayashi, A., Hirayama, T., Kawashima, H.,Kojima, H., Matsuyama, T.: Mind Probing: Active Stimulation of Gaze Patternsfor Inference of User’s Interest. Information Processing Society of Japan (IPSJ)SIG Technical Reports (Human-Computer Interaction, HCI) 125, 1–8 (2007)

6. Kawahara, T., Kawashima, H., Hirayama, T., Matsuyama, T.: “Automated In-formation Concierge” based on Proactive Dialog and Information Retrieval. IPSJMagazine 49(8), 912–918 (2008)

7. Itoh, G., Ashikari, Y., Jitsuhiro, T., Nakamura, T.: Summary and evaluation ofspeech recognition integrated environment ATRASR. In: Proc. of 2005 AcousticSociety of Japan Fall Meeting, pp. 221–222 (2005)

8. Yoda, I., Sakaue, K.: Concept of Ubiquitous Stereo Vision and Applications for Hu-man Sensing. In: Proc. on, 2003 IEEE International Symposium on ComputationalIntelligence in Robotics and Automation CIRA 2003, pp. 1251–1257 (2003)

9. Kobayashi, A., Satake, J., Hirayama, T., Kawashima, H., Matsuyama, T.: Person-Independent Face Tracking Based on Dynamic AAM Selection. In: IEEE Int. Conf.on Automatic Face and Gesture Recognition, FG (2008)

10. Satake, J., Kobayashi, A., Hirayama, T., Kawashima, H., Matsuyama, T.: Accu-racy Improvement of Real-Time Gaze Estimation using High Resolution Camera,Technical report of IEICE. PRMU 107(491), 137–142 (2008)

11. Kashioka, H., Misu, T., Ohtake, K., Hori, C., Nakamura, S.: Development of dialogsystem keeping step with users, Technical report of IPSJ. SLP 2008(68), 93–97(2008)

12. Mizukami, E., Kashioka, H., Kawai, H., Nakamura, S.: A study toward an evalua-tion method for spoken dialogue systems with considering criteria of users. In: Proc.of 2nd International Workshop on Spoken Dialogue Systems Technology (IWSDS2010) (2010) (to Appear)


D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

Donghyeon Lee, Kyungduk Kim, Cheongjae Lee, Junhwi Choi, and Gary Geunbae Lee

Department of Computer Science and Engineering Pohang University of Science and Technology

{semko,getta,lcj80,chasunee,gblee}@postech.ac.kr

Abstract. Recently various data-driven spoken language technologies have been applied to spoken dialog system development. However, high cost of maintaining the spoken dialog systems is one of the biggest challenges. In addi-tion, a fixed corpus collected by human is never enough to cover diverse real user’s utterances. The concept of a daydreaming dialog system can solve the problem by making the system learn from previous human-machine dialogs. This paper introduces D3 (Daydreaming Dialog system Development) toolkit, which is a back-end support toolkit for the development of the daydreaming spoken dialog systems. For reducing human efforts, D3 toolkit generates new utterances with semantic annotation and new knowledge by analyzing the usage log file. The new added corpus is determined by verifying proper candidates us-ing semi-automatic methods. The augmented corpus is used for building im-proved models and self-evolution of the dialog system is possible by replacing the old models. We implemented the D3 toolkit using web-based technologies to provide a familiar environment to non-expert end-users.

Keywords: Spoken Dialog System, Statistical NLP, Daydreaming Computer, Failure-driven Learning, Dialog Development Toolkit.

1 Introduction

In recent years, data-driven spoken dialog systems have been studied by many re-searchers. In general, spoken dialog systems (SDS) consist of three major compo-nents: automatic speech recognition (ASR), spoken language understanding (SLU) and dialog management (DM). For building each model, developers need to prepare annotated dialog corpus and domain-specific knowledge. However, corpus prepara-tion requires tedious human efforts.

To reduce the laborious work, several development toolkits such as SUEDE [1], CSLU Toolkit [2] and DialogDesigner [3] have been developed to help developers for rapid system design. For example, Jung et al. [4] developed Dialog Studio to reduce engineering works and to upgrade all components including ASR, SLU, and DM together. Nevertheless, there are still problems when using and managing data-driven spoken dialog systems in a practical field. A fixed corpus collected by human is never enough to cover a real user’s utterance patterns.

86 D. Lee et al.

Data-driven user simulation techniques are widely used for learning optimal dialog strategies in a statistical dialog management framework and for automated evaluation of spoken dialog systems [5]. The user simulation technique is an alternative way to resolve common weaknesses of dialog systems such as the scarceness of the training corpus and the cost of an evaluation made by real users. However, the problem of data-driven user simulation is the limitation of user patterns. The response patterns from data-driven simulated user usually tend to be limited to the training data al-though several exploration algorithms are used to find unseen patterns. Therefore, the patterns lack of reality and it is not easy to simulate unseen user patterns.

The problems can be solved by a concept of a daydreaming computer [6]. The day-dreaming computer should not be idle when left unemployed by users, but daydream-ing. Learning from experience, the one important role of daydreaming, can be applied to the SDS. Therefore, the daydreaming dialog system can keep trying to learn from failures and successes while the users do not use the system. We developed a support tool for a daydreaming dialog system development, called D3 (Daydreaming Dialog Development) toolkit, to help the developer to find the real user’s new utterance patterns.

This paper is organized as follows: Section 2 introduces the concept of the day-dreaming dialog system. D3 toolkit components and detail strategies are proposed in section 3. Preliminary experiments and how to implement the D3 toolkit will be de-scribed in section 4. Finally, section 5 draws a conclusion and a future work.

2 Daydreaming Dialog System

An intriguing aspect of humans is that we spend much time engaged in thought not directly related to the current situation or environment. This type of thought, usually called daydreaming, involves recalling or imaging personal or vicarious experiences in the past of future. For both humans and computer, Mueller and Dyer [6] discussed the following function of daydreaming: support for creativity, future planning and rehearsal, learning from experience, emotion modification and motivation.

The daydreaming computer takes as input situational descriptions, and produces as output actions that it would perform in given situation and several daydreams when the computer is idle. The daydreaming computer learns as it daydreams by indexing daydreams, planning strategies, and future plans into memory for future use.

We believe that the concept of the daydreaming can be extended to the SDS be-cause the daydreaming enables the SDS to learn from successes and failures. While SDS does not process user’s utterances, self-evolutionary modules in the daydreaming dialog system might learn to process the problematic utterances at the background. This daydreaming process would correct the problematic situations and the system could generate appropriate responses when facing the same situation again in the future.

For example, the user says “I am travelling for sightseeing”, but some errors occur due to wrong predictions in ASR, SLU or DM modules. While the SDS is idle, the

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems 87

Fig. 1. Overview of daydreaming spoken dialog system architecture

system automatically detects and corrects the errors and updates all models to smoothly process the same utterance in the future. After daydreaming, the system could generate a response, like “How long are you staying”, when the user says the same utterance.

Fig. 1 illustrates overview of daydreaming spoken dialog system architecture. The log manager creates log files, which are considered as system experiences including successes and failures. The user manager is used to manage user accounts and load models for a user. When the idling time of the system comes, the system starts a day-dreaming process. Daydreaming is performed by a self-evolutionary process, which involves analyzing the human-machine dialog log, trying to extract the patterns, and updating the models to learn from successes and failures. Our D3 toolkit supports this self-evolutionary process of the daydreaming dialog system can be easily done by a real user.

3 Daydreaming Dialog System Development Toolkit

Dialog Studio has provided a convenient way to prepare dialog corpus and update all models without large human efforts. However, it is still time-consuming at the anno-tation step because the developer should annotate new utterances manually to add new patterns without most probable semantic tags.

To address this problem, we have developed the D3 support toolkit for easily main-taining SDS. The D3 toolkit supports three main functions: 1) finding out-of-patterns

88 D. Lee et al.

Fig. 2. Procedures of the D3 Toolkit

to make errors in human-machine dialog logs, 2) suggesting their transcriptions for training a language model, and 3) annotating semantic tags tentatively.

3.1 Procedures of the D3 toolkit

As shown in Fig 2, D3 toolkit starts from loading a set of log files which have been generated when the real users talked to the dialog system in the past. Log analysis is divided into three parts including: ASR, SLU, and Knowledge part. Log analyzer generates candidates which include some system errors due to out-of-patterns. The developer is only required to verify and modify auto-generated candidates like active learning [7]. New utterance patterns are added to the existing corpus and training all models is executed.

3.2 Logging Step

All the processed human-machine dialogs are saved as log files, which are in the XML-format, by the log manager of SDS. An example of the log file is illustrated in Fig 3. Each module of SDS generates information, which is used in the analysis step. ASR module provides a recognized text (ASR RESULT), a confidence score (CONFIDENCE) and a wave filename (WAVE). SLU module provides a se-mantic structure (DIALOG_ACT, MAIN_GOAL, SLOT) and confidence score (SLU_CONFIDENCE). DM module records discourse history such as previous user intention (PREV_DA, PREV_MG, PREV_SL), previous system action (PREV_SA), filled slots (FILLED_SLOT) and current system action (SYSTEM_ACTION). The FILLED_SLOT represents which slots are filled by the user during the current dialog.

3.3 Analysis Step

ASR Evolution. Traditional ASRs requires both language model (LM) and acoustic model (AM). In our system, we use a domain-specific language model and a generic acoustic model to build a spoken dialogue system.


<UTTERANCE id="5" time="2010-06-01 21:39:23"> <ASR>

<WAVE>/wav/result005.wav</WAVE> <RESULT>오늘 21시에는 무슨 프로그램 하지?</RESULT> <CONFIDENCE>0.8587191</CONFIDENCE> </ASR> <SLU> <DIALOG_ACT>wh_question</DIALOG_ACT> <MAIN_GOAL>search_program</MAIN_GOAL> <SLOT name=”date”>today</SLOT>

<SLOT name=”time”>21:00</SLOT> <SLU_CONFIDENCE>0.87671</SLU_CONFIDENCE> </SLU> <DM> <PREV_DA>request</PREV_DA> <PREV_MG>search_program</PREV_MG> <PREV_SL>date,genre</PREV_SL> <PREV_SA>inform(program)</PREV_SA> <FILLED_SLOT>date,time</FILLED_SLOT>

<SYSTEM_ACTION>infrom(program)</ SYSTEM_ACTION> </DM>

</UTTERANCE>

Fig. 3. Example of the log file

ASR evolution involves LM adaptation and AM adaptation (Fig 4). ASR-1 repre-sents the ASR of the SDS and ASR-2 represents the one of the toolkit in Fig 4. They are same modules except using different LM.

To learn out-of-patterns, our toolkit first tries to find mis-recognized utterances in the log file. Out-of-patterns may be recognized incorrectly because they were never seen in training LM of ASR-1. We preliminary used ASR confidence score [8] to detect out-of-patterns because the confidence score has been commonly used to detect erroneous utterances; out-of-patterns could be detected by the filter if their ASR con-fidence score at the utterance level is lower than a pre-defined threshold.

To obtain most probable transcriptions, out-of-patterns are recognized again by ASR-2 in which LM was trained with an extended corpus. The extended corpus con-tains a basic dialog corpus for SDS, a general conversational corpus, and a web data related to the domain to cover a variety of utterance patterns. In the ASR-2, we can use larger LM because this process is off-line and the decoding time is not critical.

After re-recognizing out-of-patterns, the toolkit extracts some recognized utterances with a high confidence score. These may be recognized correctly now although they were mis-recognized. These utterances are used to train new LM in the ASR-1, and then the ASR-1 would recognize them correctly when facing similar patterns.

90 D. Lee et al.

Fig. 4. ASR evolution flow

In addition, well-recognized utterances in the log file are also used to adapt AM us-ing HTK toolkit [9] because acoustic adaptation for a specific speaker is important to increase the recognition accuracy.

SLU Evolution. The flow of SLU evolution is illustrated in Fig 5. For SLU Evolu-tion, we have considered three cases:

Case 1 : low ASR confidence score Case 2 : high ASR confidence score and low SLU confidence score Case 3 : high ASR confidence score and high SLU confidence score

In Case 1, the new utterances could be extracted by ASR evolution. In Case 2, the some errors occur in the current SLU model. Therefore, the utterances in Case 1 and Case 2 should be newly annotated by appropriate semantic tags to train the statistical SLU model. In our system, a hybrid approach to SLU module [10] is used to extract the semantic frames of the user’s utterances.

The hybrid intention recognizer in Fig 5 is based on a hybrid model that combines the utterance model and the dialog context model. In the SDS, the utterance model is usually used for SLU to predict the user intention from the utterance itself. The dialog context model predicts the probable user intention given the current dialog context. For the dialog context model, we use CRF model trained on the dialog corpus for SDS. We use the following features for the discourse-based model: previous dialog act, previous main goal, previous component slots and previous system action. For each utterance, this information can be obtained in the log file. The hybrid model merges hypotheses from the current SLU model with hypotheses from the dialog context model to find the best overall matching to user intention.


Fig. 5. SLU evolution flow

In Case 3, we considered the SLU results as machine-labeled utterances for unla-beled utterances in semi-supervised learning method [11]. Knowledge Database. One of the most important resources in building SDS is knowledge database (KDB) containing domain-specific information (e.g. TV program schedule) to be informed or suggested to the user. The SDS cannot sometimes find the information because the KDB does not contain what the user said. For example, the user says “I want to watch drama ‘Cinderella's Sister’”, but the system does not pro-vide any information because the new program ‘Cinderella’s Sister’ was not updated previously.

In general, the KDB is used to build ASR and SLU modules because they have a set of slot values to be recognized and to be extracted. However, the users do not know which items are included in the KDB; therefore, the users often say out-of-vocabulary (OOV) terms for the slot values. In this case, the daydreaming dialog system has to detect OOV terms to handle unseen items for the next time.

In our toolkit, the OOV terms are detected by using word confidence scoring method [12]. In this method, a search network for recognition is constructed using utterance patterns and a keyword slot is assumed as an OOV term if a word confi-dence score is low. Next, the phonetic recognizer transcribes the phonemes of the OOV terms, and then phoneme to grapheme conversation module generates n-best keywords. After that, the new knowledge is extracted using a knowledge importer, which is a domain-specific module to extract structured information from external

92 D. Lee et al.

Fig. 6. A strategy of knowledge acquisition for OOV terms

knowledge sources such as web. Fig 6 shows a strategy of knowledge acquisition for OOV terms.

3.4 Verification Step

At the analysis step, the D3 toolkit suggests a set of unseen utterance patterns which should be adapted to each model for improving the performance. However, these candidates may still include many errors because they were generated automatically through each evolution module. Therefore, these candidates should be verified whether they are correct before training each model. The D3 toolkit provides the verification step using active learning method [7]. We empirically defined a set of thresholds for ASR and SLU models. If the confidence score of the candidate exceeds a given threshold, the candidate is automatically added to train the model. Otherwise, the candidate is verified by a real user.

At the verification step, the D3 toolkit provides useful information such as the most likely probable transcription and semantic tags for the user to select which utterances should be added to new models. The user can modify the candidate by listening to the recorded speech files. We believe that our toolkit can be easily used by the real users because the verification step is completed by just selecting the checkbox based on the user’s intuition.

3.5 Training Step

After the verification, the existing corpus combines with the new corpus. The augmented corpus is used for building new ASR, SLU and knowledge model. By replacing the models, the self-evolution process is completed.


4 Preliminary Experiments

4.1 Corpus

For preliminary experiments, we collected human-machine dialogs of about 114 ut-terances from 25 dialogs in Korean, which were based on a set of pre-defined subjects relating to the TV guidance task. We tested on our example-based dialog system [13]. Table 1 shows the result type of the collected utterances. The result type relates to three cases mentioned in SLU evolution. The number of utterances in Case 2 is small because SLU models are closely related to ASR models in SDS development. After the evolutionary process, 4 utterance patterns with semantic annotations and 2 knowl-edge sources are newly added to the existing corpus.

Table 1. Result type of the collected utterances

Case 1 Case 2 Case 3

# of utterances 51 9 54

4.2 Dialog System Performance

To evaluate the effectiveness of our toolkit, we compared the dialog system perform-ance before and after the daydreaming. We quantified ASR according to word error rate (WER) and SLU according to concept error rate (CER). After updating model, WER is decreased from 48.64% to 46.97% and CER is decreased from 32.58% to 29.21%. WER is too high compared to WER in conventional SDSs because all dia-logs include many new patterns. We calculated the task completion rate (TCR) for the dialog system evaluation. TCR is increased from 48.00% to 56.00%, which means two more tasks are completed. The experiment results are shown in table 2. The sys-tem performance is improved after daydreaming because the system succeeds in proc-essing the utterances, which cause errors before daydreaming, by updating all models.

Table 2. Experiment results

Before daydreaming After daydreaming

WER (ASR) 48.64% 46.97%

CER (SLU) 32.58% 29.21%

TCR (DM) 48.00% 56.00%

4.3 Implementation

D3 toolkit was implemented into client-server architecture. To access the toolkit eas-ily and provide a familiar environment, the graphic user interface is implemented using web-based technologies. For efficiency and adaptability, the analyzing and training part was developed using standard C++ library. The screen shot in fig. 7 shows the interface of D3 toolkit.

94 D. Lee et al.

Fig. 7. Screen shot of the D3 Toolkit user interface

5 Conclusions and Future Work

We introduced the daydreaming spoken dialog system which can be semi-automatically improved by learning from the past human-machine dialogs. We also implemented the D3 toolkit to support the self-evolutionary process of the daydream-ing dialog system. The D3 toolkit generates new corpus by analyzing the log file. The models of the dialog system can be semi-automatically upgraded using the verifica-tion step. The main advantage of D3 toolkit is that little human effort is required to maintain the SDS by feeding out-of-patterns into the models. In the future work, we will apply D3 toolkit to different domains and extensively evaluate the updated sys-tem. In addition, the process for a dialog model evolution will be considered at the analyzing step. Acknowledgments. This work was supported by the Industrial Strategic technology development program, 10035252, Development of dialog-based spontaneous speech interface technology on mobile platform funded by the Ministry of Knowledge Econ-omy (MKE, Korea).

References

1. Anoop, K.S., Scott, R.K., Chen, J., Landay, J.A., Chen, C.: SUEDE: Iterative, Informal Prototyping for Speech Interfaces. In: Video poster in Extended Abstracts of Human Fac-tors in Computing Systems: CHI 2001, Seattle, pp. 203–204 (2001)

2. Sutton, s., Cole, R., de Villiers, J., Schalkwyk, J., Vermeulen, P., Macon, M., Yan, Y., Kaiser, E., Rundle, B., Shobaki, K., et al.: Universal Speech Tools: the CSLU toolkit. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP), pp. 3221–3224 (1998)

3. Dybkjær, H., Dybkjær, L.: DialogDesigner: Tools support for dialogue model design and evaluation. Lang. Resour. Eval. 40(1), 87–107 (2006)

4. Jung, S., Lee, C., Kim, S., Lee, G.G.: Dialog Studio: A workbench for data-driven spoken dialog system development and management. The Journal of Speech Communica-tion 50(8-9), 683–697 (2008)


5. Scheffler, K., Young, S.: Corpus-based dialogue simulation for automatic strategy learning and evaluation. In: NAACL Workshop on Adaptation in Dialogue Sys-tems, pp. 64–70 (2001)

6. Mueller, E.T., Dyer, M.G.: Daydreaming in humans and computes. In: Proceedings of the Ninth International Joint Conference on Artificial Intelligence. University of California, Los Angeles (1985)

7. Bonwell, C.C., Eison, J.A.: Active learning: Creating excitement in the classroom. In: ASHE-ERIC Higher Education Report No. 1. The George Washington University, School of Education and Human Development, Washington (1991)

8. Jiang, H.: Confidence measures for speech recognition. Speech Communication 45(4), 455–470 (2005)

9. Hidden Markov Toolkit (HTK), http://htk.eng.cam.ac.uk/ 10. Lee, S., Lee, C., Lee, J., Noh, H., Lee, G.G.: Intention-based Corrective Feedback Genera-

tion using Context-aware Model. In: Proceedings of the 2nd International Conference on Computer Supported Education (CSEDU 2010), Valencia (2010)

11. Tur, G., Tur, D.H., Schapire, R.E.: Combining Active and Semi-Supervised Learning for Spoken Language Understanding. The Journal of Speech Communication 45(2), 171–186 (2005)

12. Hazen, T., Burianek, T., Poliforni, J., Seneff, S.: Recognition Confidence Scoring for Use in Speech Understanding Systems. In: Proceedings of ISCA ASR 2000 Tutorial and Re-search Workshop, Paris (2000)

13. Lee, C., Jung, S., Kim, S., Lee, G.G.: Example-based Dialog Modeling for Practical Multi-domain Dialog System. Speech Communication 51(5), 466–484 (2009)


New Technique to Enhance the Performance of Spoken Dialogue Systems by Means of Implicit

Recovery of ASR Errors*

Ramón López-Cózar1, David Griol2, and José F. Quesada3

1 Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada, Spain [email protected]

2 Dept. of Computer Science, Carlos III University of Madrid, Spain [email protected]

3 Dept. of Artificial Intelligence and Computer Science, University of Seville, Spain [email protected]

Abstract. This paper proposes a new technique to implicitly correct some ASR errors made by spoken dialogue systems, which is implemented at two levels: statistical and linguistic. The goal of the former level is to employ for the cor-rection knowledge extracted from the analysis of a training corpus comprised of utterances and their corresponding ASR results. The outcome of the analysis is a set of syntactic-semantic models and a set of lexical models, which are opti-mally selected during the correction. The goal of the correction at the linguistic level is to repair errors not detected during the statistical level which affects the semantics of the sentences. Experiments carried out with a previously-developed spoken dialogue system for the fast food domain indicate that the technique allows enhancing word accuracy, spoken language understanding and task completion by 8.5%, 16.54% and 44.17% absolute, respectively.

1 Introduction

It is well known that user utterances are frequently misheard, misrecognised or mis-understood by spoken dialogue systems (SDSs), mainly due to current limitations of state-of-the-art automatic speech recognition (ASR). Thus, well designed error han-dling strategies are crucial in providing robust system performance, specially for not experienced users. Some authors have studied human error recovery strategies in order to apply these, if possible, to SDSs. For example, [1] found that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis instead of signalling non-understanding, which leads to better understanding of subsequent sentences.

A number of error handling strategies can be found in the literature, mostly work-ing a three levels of the system’s architecture: ASR, spoken language understanding

* This research has been funded by the Spanish Ministry of Science and Technology, under

project TIN2007-64718 HADA.

New Technique to Enhance the Performance of Spoken Dialogue Systems 97

(SLU) and dialogue management (DM). These techniques have been traditionally separated into two groups: error detection and correction.

A common method for error detection is using recognition confidence scores, but the problem is that these measures are not entirely reliable, as depend on noise condi-tions and user types. At the DM level, a common method for detecting errors is using confirmation strategies [2] or re-phrasing, whereas implicit confirmations can be employed for error detection and correction.

[3] proposed a model for error correction, which is comprised of four levels: detec-tion, diagnosis, repair plan selection and plan execution interactively. [4] presented an agent-based architecture in which error handling is divided into individual, applica-tion independent components. This architecture makes it possible to construct adap-tive and reusable components and entire error-handling toolkits. [5] proposed an example-based error recovery method to detect and correct errors, based on a re-phrase strategy and a task guidance to help novice users re-phrase well-recognisable and well-understandable sentences.

Some error handling techniques try to hide errors made by some components of SDSs, for example, errors made by the ASR. The technique that we propose in this paper follows this direction, as its goal is to detect ASR errors and correct them be-fore the ASR result is the input to the SLU. To do so, it takes the ASR result and carries out two kinds of process, one statistical and the other linguistic. The first proc-ess tries to detect and correct errors by considering knowledge extracted from the analysis of a training corpus comprised of utterances and their corresponding ASR results. The goal of the linguistic process is to repair errors not detected during the statistical process, if these affect the semantics of the sentences.

The remainder of the paper is organised as follows. In section 2 we discuss previ-ous studies concerned with error handling techniques to automatically detect and correct ASR errors. Section 3 presents our technique for implicit recovery of these errors. It discusses the necessary elements and explains how to implement the algo-rithms for error correction. Section 4 presents the experiments, comparing system performance results achieved with and without using the proposed technique. Finally, section 5 presents the conclusions and outlines possibilities for future work.

2 Related Work

Most previous techniques for automatically correcting ASR errors are based on statis-tical methods that use probabilistic information about words uttered and words in the recognition results. For example, following this approach, [6] proposed a method based on two parts. On the one hand, a channel model represents errors made by a speech recogniser, whilst on the other, a language model represents the likelihood of a sequence of words uttered by one speaker. They trained both models with transcrip-tions of dialogues obtained with the TRAINS-95 dialogue system. Their experimental results showed that the post-processor output contained fewer errors than that of the speech recogniser.

Following this approach, [7] proposed a method that uses statistical features of characters co-occurrence, which was implemented in two consecutive correcting proc-esses. The former detects and corrects errors using a database of erroneous-correct

98 R. López-Cózar, D. Griol, and J.F. Quesada

utterance pairs. The remaining errors are passed to the second process, which uses a string in the corpus that is similar to the string including recognition errors. The authors found that this method made significant contributions to the performance of speech translation systems.

Also, [8] proposed a method to model likely contexts of all words in an ASR sys-tem vocabulary by performing a lexical co-occurrence analysis using a large corpus of output from a speech recogniser. They identified regions in the data that contain likely contexts for a given query word. Finally, they detected words or sequences of words that are likely to appear in the context and that are phonetically similar to the query word. Their experiments proved that this method provides high-precision in detection and correction of errors.

Several authors have proposed carrying out error detection and correction at sev-eral levels. For example, [9] proposed a two-level schema for detecting recognition errors. The first level applies an utterance classifier to decide whether the speech recognition result is erroneous. If it is decided to be incorrect, it is passed to the sec-ond level where a word classifier is used to decide which words are misrecognitions.

Following the same approach, [10] proposed a method for error detection based on three levels. The first is to detect whether the input utterance is correct. The second level is to detect incorrect words, and the third level is to detect erroneous characters. The error correction first creates candidate lists of errors, and then re-ranks the candi-dates with a combined model of mutual information and trigram.

Statistical methods present several drawbacks. One is that they require large amounts of training data. Another is that their success depends on the size and quality of the speech recognition results, or on the database of collected erroneous strings, since they are directly dependent on the lexical entries. Hence, a number of authors have proposed to combine different types of information sources. For example, [11] presented a method that models the output generated by a set of ASR systems as in-dependent knowledge sources that can be combined and used to generate an output with reduced error rate. The outputs of the single ASR systems are combined into a minimal cost word transition network by means of iterative applications of dynamic programming alignments. The resulting network is decided employing a rescoring or voting process that selects the output sequence with the lowest score.

Employing a different approach, [12] combined lexical information with higher level knowledge sources via a maximum entropy language model (MELM). Error correction was arranged at two levels, using a different language model at each level. At the first level, a word n-gram was employed to capture local dependencies and to speed up the processing. The MELM was used at the second level to capture long-distance dependencies and higher linguistic phenomena, and to re-score the N-best hypotheses produced by the first level. Their experiments showed that this approach had superior performance than previous lexical-oriented approaches. The main prob-lem found was that the training of the MELM required a lot of time and was some-times infeasible.


3 Proposed Technique for Implicit Recovery from ASR Errors

In this paper we propose a new technique to enhance the performance of SDSs using a method that implicitly recovers from some ASR errors. We state that the recovery is implicit as the user is not aware of the error, in other words, no dialogue turns are employed for error recovery which makes a more natural and friendly interaction with the system. This technique is inspired by previous studies based on pattern matching [7] and statistical information [9], sorting out one drawback of the former type of methods, namely, that the selected pattern may not be optimal. To address this limita-tion, our technique employs several corpora of previously learnt syntactic-semantic and lexical patterns, as well as a similarity threshold d ∈ [0.0 – 1.0] to decide whether one pattern is good enough for error correction. If it is found not to be good enough, the technique searches for a better pattern in the whole set of patterns avail-able, and if a proper pattern is not found there either, the technique does not make any correction using the patterns. This procedure will be discussed in detail in Section 3.5.

In the following sections we describe the elements required to implement our tech-nique: concepts, grammatical rules, syntactic-semantic models and lexical models.

3.1 Concepts

We define a concept as a set of keywords of a given type which are necessary to ex-tract the semantic content of sentences within an application domain. For example, in our experiments in the fast food domain that will be described in section 4, we con-sider, among others, the following concepts: DESIRE = {want, need, …}, FOOD = {sandwich, cake, salad, …}, DRINK = {water, beer, wine, …} and AMOUNT = {one, two, three, …}.

3.2 Grammatical Rules

The general format of a grammatical rule is as follows: ssp restriction, where ssp denotes a syntactic-semantic pattern, which will be described in the following section, and restriction is a condition that must be satisfied by all the concepts in the pattern. For example, one rule used in our experiments in Spanish is:

NUMBER DRINK SIZE

number(NUMBER) = number(DRINK) and number(DRINK) = number(SIZE) and number(NUMBER) = number(SIZE)

where number is a function that returns either singular or plural for each Spanish word in the concepts that it uses as input. The goal of this rule is to check number correspondences of drink orders uttered in Spanish. For example, the sentence dos cervezas grandes (two large beers) holds this correspondence.


3.3 Syntactic-Semantic Models

A syntactic-semantic model is a conceptual representation of the sentences uttered by users of a SDS in a dialogue state. This state is associated with a prompt type T of the dialogue system, which represents equivalent prompts to obtain a particular data type from the user. To create a syntactic-semantic model for a prompt type T, we trans-form each sentence uttered by the user in response to the prompt type into what we call a syntactic-semantic pattern (ssp). This pattern is a sequence of concepts obtained by replacing each word in the sentence with the concept(s) the word belongs to. From the analysis of all the sentences uttered in response to each prompt type we create a set of ssp’s, in which we remove those that are redundant and associate with each ssp its relative frequency within the set. The outcome of this process is a syntactic-semantic model associated with the prompt type T (SSMT). We call α model the set of SSMT’s created considering the m prompt types of a SDS: α = {SSMTi}, i = 1 ... m.

3.4 Lexical Models

Lexical models contain information about the performance of the speech recogniser of a SDS. We must create a lexical model for each prompt type T, which we call LMT. To do so, we consider the sentences uttered in response to the prompt type and their corresponding recognition results. The format of this model is: LMT = {wa, wb, pab}, where wa is a word uttered by a user, wb is the recognised word and pab is the posterior probability of obtaining wb given wa. To create LMT we align each uttered sentence with the recognised sentence using the method described in [13], and compute the probabilities pab for each word pair (wa, wb). We call β model the set of LMT’s created considering the m prompt types of a SDS: β = {LMTi}, i = 1 ... m.

3.5 Algorithms to Implement the Technique

In this section we discuss the two levels for error detection and correction employed by our technique: statistical and linguistic.

3.5.1 Correction at the Statistical Level The goal of this correction level is to find words wI‘s in the recognised sentence which belong to incorrect concepts KI’s. For each word, we must decide the correct concept KC and select the most appropriate word wC ∈ KC to substitute wI in the rec-ognised sentence. We can implement this procedure in two steps: pattern matching and pattern alignment. 3.5.1.1 Pattern Matching. The procedure for pattern matching is illustrated in Fig. 1. It receives as input the recognised sentence: w1, w2, w3, w4, … wn, the syntactic-semantic model associated with the current prompt type T, and the similarity thresh-old d.


Pattern matching

w1 w2 w3 w4 … wnSSMT

d

C1 C2 C3 C4 … Cn

Ci Cj Ck … Cm

Does sspINPUT

match anypattern in SSMT

?esspBEST = sspINPUT

Correction atlinguistic level

score = similarity(sspINPUT, p)

score > d?

Call sspSIMILAR the pattern p

Yes

All patterns p inSSMT checked

?

No

count = number of sspSIMILAR’s in SSMT

Yes

count = 1?

sspBEST = sspSIMILAR

count = 0?

Yes: Case 1

No

count = number ofsspSIMILAR’s thathave greatest similaritywith sspINPUT

Yes: Case 2

No: Case 3

count = 1?

Yes

count > 1?

No

count = number of sspSIMILAR’s with highest frequency in SSMT

Yes

count = 1?Yes

No

Correction atlinguistic level

Yes No

No

No

esspINPUT

sspINPUT

Carry out pattern matching using α instead of SSMT

Pattern alignment

Choose another pattern p in SSMT

Choose a pattern p in SSMT

Fig. 1. Pattern matching for correction at the statistical level


The procedure employs what we call an enriched syntactic-semantic pattern (ess-pINPUT) obtained from the recognised sentence. This pattern is a sequence of what we call containers: C1, C2, C3, … Cn. The goal of this step is to transform esspINPUT into another pattern concept sequence called esspBEST, which is initially empty. To do this, we firstly create a syntactic-semantic pattern called sspINPUT, which only contains the concepts in esspINPUT, for example: sspINPUT = DESIRE AMOUNT INGREDIENT FOOD. Secondly, we determine whether sspINPUT matches any pattern in the syntactic-semantic model associated with the prompt type T (SSMT). If so, we make esspBEST = esspINPUT and proceed with the correction at the linguistic level (which will be discussed in section 3.5.2). Otherwise, we look for patterns p similar to sspINPUT in SSMT. To do this, we compare sspINPUT with every pattern p in the model, and com-pute a similarity score as follows: similarity(sspINPUT, p) = (n – med) / n, where n is the number of concepts in sspINPUT and med is the minimum edit distance between both patterns, computed using the method described in [14]. We call sspSIMILAR any pattern p in SSMT such that similarity(sspINPUT, p) > d, where d ∈ [0.0, 1.0] is a similarity threshold, the optimal value of which must be experimentally determined.

We consider three cases depending on the number of sspSIMILAR‘s in SSMT. The first case is when there is just one. In this case, we create a new pattern called sspBEST, make sspBEST = sspSIMILAR and proceed with the pattern alignment, which will be dis-cussed in section 3.5.1.2.

The second case is when there are no sspSIMILAR‘s in SSMT. In this case, we try to find sspSIMILAR‘s in the α model (discussed in section 3.3) instead of doing so in SSMT, i.e., we employ the same procedure but considering α, not SSMT.

The third case is when there are several sspSIMILAR‘s in SSMT (or in α). The ques-tion then is to determine the best sspSIMILAR. To make this selection we search for the sspSIMILAR that has the greatest similarity with sspINPUT. If there is just one sspSIMILAR satisfying this condition, we make sspBEST = sspSIMILAR and proceed with Step 2 (pat-tern alignment). If there are several patterns, we select those with the highest frequency in SSMT (or in α): if there is just one, we make sspBEST = sspSIMILAR and proceed with Step 2; if there are several we do not make any correction at the statisti-cal level.

3.5.1.2 Pattern Alignment. The goal of pattern alignment is to build esspBEST in case it is still empty. The procedure is illustrated in Fig. 2. It receives as input the pattern concept sequence esspBEST, the lexical model associated with the current prompt type T (LMT) and the syntactic-semantic patterns sspINPUT and sspBEST.

The procedure takes into account each container Ca in sspINPUT and considers two cases. The first is when the word wa in Ca does not affect the semantics of the sen-tence, i.e., it is not a keyword (e.g. please). In this case we create a new container D, make D = Ca and add D to esspBEST.

The second case is when the word wa in Ca affects the semantics of the sentence, i.e., it is a keyword (e.g. sandwich). Thus, we study whether the word must be corrected. To do this, we try to align the container Ca with a container Cb in sspBEST using the method described in [13] and consider two possible occurrences:


Pattern alignment

Is esspBEST

already created?

Yes

Correction atlinguistic level foreach Ca in sspINPUT

Is word in Ca

a keyword?

No

Create new container DD = Ca

Add D to esspBEST

No

Is it possibleto align Ca with a

container Cb in sspBEST

?

Yes

Number of concepts in sspINPUT > Number of concepts

in sspBEST

?

No

Discard Ca

Yes: Insertion error

Number of concepts in sspINPUT = Number of concepts

in sspBEST

?

No

Use LMT and create U = set of wordsu in CN with which wI is confused

Yes: Substitution error

count = number of words u in U

count = 1?

Create new container DStore u in DAdd D to esspBEST

Yes

count > 1?

X = set of words u in CN that have highestconfusion probability which wI

count = number of words u in X

count = 1?

Yes

Yes

No

Yes: Ca is correct

LMT

esspBEST

sspINPUTsspBEST

Fig. 2. Pattern alignment for correction at the statistical level


Occurrence 1: Ca can be aligned. In this occurrence we assume that the container Ca is correct and do not make any correction at the statistical level. We create a new container D, make D = Ca and add D to esspBEST.

Occurrence 2: It is not possible to align Ca. This occurrence may happen in two situations. The first is when the container is a result of an insertion recognition er-ror. In this situation we discard Ca, i.e. it is not added to esspBEST. The second situation is when the container is a result of a substitution recognition error. There-fore, we must find a correction word from a different concept, wC∈CN, store it in a new container D, and add this container to esspBEST. To find wC we consider the lexical model associated with the prompt type T (LMT) and create the set U of words u ∈ CN with which the word wI is confused. If there is only one word u in U, we create a new container D that we name CN, store it in u, and add D to essp-BEST. If there are several words, we carry out the same procedure but using the word that has the highest confusion probability with wI if it is unique; if it is not unique, or there are no words in U, we do not make any correction at the statistical level.

3.5.2 Correction at the Linguistic Level The goal of the correction at the linguistic level is to repair errors that are not detected at the statistical level and affect the semantics of sentences. To carry out the correc-tion we use the grammatical rules described in section 3.2. For each rule we carry out the following procedure. The syntactic-semantic pattern ssp of the rule is inserted in a window that slides from left to right over esspBEST, as can be observed in Fig. 3.

want0.7387

want0.7387

DESIRE

one0.6307

one0.6307

NUMBER

ham0.3982

ham0.3982

INGREDIENT

sandwich0.6307

sandwich0.6307

FOOD

two0.6854

two0.6854

NUMBER

small0.7861

small0.7861

SIZE

beers0.6590

beers0.6590

DRINK

and0.4530

and0.4530

C2 C3 C4 C5 C6 C7 C8 C9

NUMBER SIZE DRINK

Sliding window: ssp

I0.5735

I0.5735

C1

Fig. 3. Sliding window over esspBEST

If the concept sequence in the window is found in esspBEST, then we apply the re-striction of the rule to the words in the containers of esspBEST. If the words satisfy the restriction, we do not make any correction. Otherwise, we try to find out the reason for the insatisfaction by searching for an incorrect word wI. To decide the word wC to correct the incorrect word, we consider the lexical model LMT and take into account the set U = {u1, u2, ..., up} comprised of words of the same concept than the word wI. Next, we proceed similarly as discussed in the second case of Occurrence 2 (see pre-vious section) but considering that the goal now is to replace one word in one concept with other word in the same concept.


4 Experiments

The goal of the experiments described in this section is to test the proposed technique using the Saplen spoken dialogue system, which we developed in a previous study to answer fast food queries and orders made in Spanish [15]. The evaluation has been carried out in terms of word accuracy (WA), spoken language understanding (SLU) and task completion (TC), considering two front-ends for ASR: i) baseline ASR, com-prised of the standard HTK-based speech recogniser of the Saplen system, and ii) enhanced ASR, comprised of the same speech recogniser plus an additional module that implements the proposed technique.

We have employed a dialogue corpus collected in our University from students in-teracting with the Saplen system, which contains around 5,500 utterances and roughly 2,000 different words. The utterance corpus has been divided into two separate cor-pora, each containing around 50% of the utterances. Using the training corpus we have compiled a word bigram that allows recognising sentences of the 18 different types in the corpus. The remaining 50% of the utterances have been used for testing.

The experiments have been carried out employing a user simulator developed in a previous study [16]. The interaction between the Saplen system and the simulator is decided considering a set of scenarios that represent user goals. We have created two scenario sets: ScenariosA (300 scenarios) and ScenariosB (100 scenarios). Each dia-logue generated by the interaction between the Saplen system and the user simulator is stored in a log file for analysis and evaluation purposes.

Given that the construction of the syntactic-semantic and lexical models described in sections 3.3 and 3.4 has been carried out employing simulated dialogues, we have made additional experiments to decide the necessary number of dialogues to obtain the maximum amount of syntactic-semantic and lexical knowledge. The results indi-cate that 900 dialogues is the optimal trade-off.

4.1 Experiments with the Baseline ASR

Employing the user simulator, the Saplen system and ScenariosA, we have generated a corpus of 900 dialogues, which we have called DialoguesA1. Table 1 sets out the average results obtained from the analysis of this corpus. The results show the prob-lems of the system in correctly recognising and understanding some utterances. Analysis of the log files reveals that in some cases the misrecognised sentences are similar to the uttered sentences. For example, the sentence in Spanish dos fantas grandes de limón (two large lemon fantas) is recognised as uno fantas grandes de limón (one large lemon fantas) because of the acoustic similarly between dos and uno when uttered by users with strong Southern Spanish accents.

Table 1. Results using the baseline ASR (in %)

WA SLU TC 76.12 54.71 24.51


We have also observed problems with confirmations, which happen because the speech recogniser usually substitutes the word si (yes) by the word seis (six), when the former word is uttered by strongly accented speakers. In other cases, the recog-nised sentences are very distorted by ASR errors. For example, the sentence quiero una fanta de naranja grande (I want one big orange Fanta) is sometimes recognised as queso de manzana tercera (cheese of apple third).

4.2 Experiments with the Enhanced ASR

As the concepts required for the technique (discussed in section 3.1), we have employed a set of 21 concepts that we created in a previous study [15]. Following section 3.2 we have created a set of grammatical rules to check the number corre-spondences for food and drink orders. To create the syntactic-semantic and lexical models, discussed in sections 3.3 and 3.4, we have analysed DialoguesA1 thus obtain-ing α = {SSMTi} and β = {LMTi}, with i = 1 ... 43, given that the Saplen system can generate 43 different prompt types.

To decide the optimal value for the similarity threshold d (discussed in section 3.5.1) we have carried out experiments considering values in the range [0.1, 0.9]. Employing the user simulator and ScenariosB, we have generated a corpus comprised of 300 dialogues for each value, using in all cases the proposed technique. Analysis of the outcomes of these experiments reveals that the best results are obtained when d = 0.5. Using this optimal value, we have employed again ScenariosA to generate an-other corpus of 900 dialogues, which we call DialoguesA2. Table 2 shows the average results obtained from the analysis of this corpus.

Table 2. Results using the enhanced ASR (in %)

WA SLU TC 84.62 71.25 68.32

Analysis of the log files shows that the technique is successful in correcting some

incorrectly recognised sentences. For example, the incorrectly recognised drink order one large lemon fantas is corrected by doing no changes at the syntactic-semantic level, and replacing one with two at the lexical level. In other product orders the cor-rection is carried out at the semantic-syntactic level. For example, one curry salad is sometimes recognised as one error curry salad. In this case the correction is carried out removing the ERROR concept at the syntactic-semantic level.

The technique is useful in correcting the errors with confirmations discussed in the previous section. To do this, it replaces the NUMBER concept with the CONFIRMATION concept, and then selects the most likely word in CONFIRMATION.

The enhanced ASR enables as well correction of some misrecognised telephone numbers (in the Spanish format). For example, nine five eight twenty-one fourteen eighteen is sometimes recognised as gimme five eight twenty-one fourteen eighteen because of acoustic similarity between nine and gimme in Spanish. The technique corrects the error by replacing the DESIRE concept with the NUMBER concept and se-lecting the most likely word in NUMBER given the word gimme at the lexical level.


The technique is also useful to correct some misrecognised postal codes. For ex-ample, eighteen zero zero one is sometimes recognised as eighteen zero zero turkey. This error is corrected by replacing the INGREDIENT concept with the NUMBER concept and selecting the most likely word in NUMBER given the word turkey.

Our proposal is also successful in correcting some incorrectly recognised addresses (in the Spanish format). For example, almona del boquerón street number five second floor letter h is sometimes recognised as almona del boquerón street error five second floor letter zero. This error is corrected by making a double correction. First, re-placement of the ERROR concept with the NUMBER_ID concept and selection of the most likely word in NUMBER_ID given the word error. Second, replacement of the NUMBER concept with the LETTER concept and selection of the most likely word in LETTER given the word zero.

There are cases where the technique fails in detecting errors, and thus in correcting them. This happens when words in the uttered sentence are substituted by other words and the result is valid in the application domain. For example, this occurs when the sentence two green salads is recognised as twelve green salads, given that there is no conflict in terms of concepts and there is agreement in number between the words.

4.2.1 Advantage of Using SSMT’s, α and d In this experiment we have checked whether using SSMT’s or α, taking into account d, is preferable to the two following alternative strategies: i) use α only without firstly checking the SSMT’s, and ii) use the SSMT’s, but if the pattern sspINPUT is not found in these, use α without considering the similarity threshold d. The α model is the one created employing DialoguesA1 and d is set to the optimal value, i.e., d = 0.5. We have implemented strategy i) and used ScenariosA to generate a corpus of 900 dia-logues, which we call DialoguesA3. Next, we have implement strategy ii) and, using again ScenariosA, have generated another corpus of 900 dialogues, which we call DialoguesA4. Therefore, DialoguesA1, DialoguesA3 and DialoguesA4 have been cre-ated using the same scenarios and are comprised of the same number of dialogues, the only difference being in the strategy for selecting the correction model to be used. Table 3 shows the average results obtained from the analysis of DialoguesA3 and DialoguesA4.

Table 3. Results employing strategies to select the syntactic-semantic correction model (in %)

Corpus WA SLU TC DialoguesA3 80.15 61.67 39.78 DialoguesA4 82.26 66.84 55.35

Analysis of the log files shows that the error correction in confirmations is very

much affected by the strategy employed to select the correction model (either SSMT or α). If we always use SSMT to correct errors in confirmations, the correction is in many cases successful. On the other hand, if we always use α the correction is mostly incorrect.


4.2.2 Advantage of Using LMT’s, β and d The goal of this experiment has been to check whether using the LMT’s or β taking into account d is preferable to using β regardless of d. To carry out the experiment we have used the β model created with DialoguesA1. We have employed again Scenari-osA and generated a corpus of 900 dialogues, which we call DialoguesA5. Therefore, DialoguesA1 and DialoguesA5 have been obtained using the same scenarios and are comprised of the same number of dialogues, the only difference being in the use of β. Table 4 shows the average results obtained from the analysis of DialoguesA5. The experiment shows that the confusion probabilities of words are not the same in the LMT‘s and β. For example, considering the β model, the highest probability of confus-ing the word error with a word in the NUMBER concept is 0.0370, and this word is dieciseis (sixteen). However, considering LMT=PRODUCT-ORDER, this probability is 0.0090 and the word is una (one). Therefore, the correction word is dieciseis if we consider β, and una if we take into account LMT=PRODUCT-ORDER, which in some cases is deterministic in making the proper correction.

Table 4. Results employing an alternative strategy to select the lexical model (in %)

Corpus WA SLU TC DialoguesA5 81.40 65.61 60.89

5 Conclusions and Future Work

Comparing the results set out in Tables 1 and 2 we observe that the proposed tech-nique allows enhancing the performance of the Saplen system in terms of WA, SLU and TC by 8.5%, 16.54% and 44.17% absolute, respectively. These enhancements are mostly achieved because considering the proposed threshold for similarity scores between patterns, the technique decides whether to use correction models associated with the current prompt type T (SSMT and LMT), or general correction models for the application domain (α and β). This novel contribution optimises the procedure for error recovery, as can be observed from comparison of results set out in Tables 2, 3 and 4. These results show that our method for selecting the correction models is pref-erable to other possible strategies for selecting these models. In particular, we have observed that the benefit of the proposed method is particularly noticeable in the cor-rection of misrecognised confirmations.

Future work includes considering additional information sources to correct errors that in the current implementation cannot be detected, such as domain-dependent knowledge. For example, in our application domain we could use this kind of infor-mation to consider that the sentence twelve green salads, although syntactically cor-rect, is likely to be incorrectly recognised, given that it is not usual that customers of fast food restaurants order such a large amount of a product. We also plan to study the performance of the technique considering prompt-dependent similarity thresholds.


References

1. Skantze, G.: Exploring human error recovery strategies: Implications for spoken dialogue systems. Speech Communication 45, 325–341 (2005)

2. McTear, M., O’Neill, I., Hanna, P., Liu, X.: Handling errors and determining confirmation strategies – An object-based approach. Speech Communication 45, 249–269 (2005)

3. Duff, D., Gates, B., Luperfoy, S.: An architecture for spoken dialogue management. In: Proc. of ICSLP, pp. 1025–1028 (1996)

4. Turunen, M., Hakulinen, J.: Agent-based error handling in spoken dialogue systems. In: Proc. of Eurospeech, pp. 2189–2192 (2001)

5. Lee, C., Jung, S., Lee, D., Lee, G.G.: Example-based error recovery strategy for spoken dialog system. In: Proc. of ASRU, pp. 538–543 (2007)

6. Ringger, E.K., Allen, J.F.: A fertility model for post correction of continuous speech rec-ognition. In: Proc. of ICSLP, pp. 897–900 (1996)

7. Kaki, S., Sumita, E., Iida, H.: A method for correcting speech recognitions using the statis-tical features of character co-occurrences. In: Proc. of COLING-ACL, pp. 653–657 (1998)

8. Sarma, A., Palmer, D.D.: Context-based speech recognition error detection and correction. In: Proc. of HLT-NAAACL, pp. 85–88 (2004)

9. Zhou, Z., Meng, H.: A two-level schemata for detecting recognition errors. In: Proc. of ICSLP, pp. 449–452 (2004)

10. Zhou, Z., Meng, H., Lo, W.K.: A multi-pass error detection and correction framework for Mandarin LVCSR. In: Proc. of ICSLP, pp. 1646–1649 (2006)

11. Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer out-put voting error reduction (Rover), pp. 347–352 (1997)

12. Jeong, M., Jung, S., Lee, G.G.: Speech recognition error correction using maximum en-tropy language model. In: Proc. of Interspeech, pp. 2137–2140 (2004)

13. Fisher, W.M., Fiscus, J.G.: Better alignment procedures for speech recognition evaluation. In: Proc. ICASSP, pp. 59–62 (1993)

14. Crestani, F.: Word recognition errors and relevance feedback in spoken query processing. In: Proc. Conf. on Flexible Query Answering Systems, pp. 267–281 (2000)

15. López-Cózar, R., Callejas, Z.: Combining language models in the input interface of a spo-ken dialogue system. Computer Speech and Language 20, 420–440 (2006)

16. López-Cózar, R., de la Torre, A., Segura, J.C., Rubio, A.J., Sánchez, V.: Assessment of dialogue systems by means of a new simulation technique. Speech Communication 40(3), 387–407 (2003)

Simulation of the Grounding Process in Spoken DialogSystems with Bayesian Networks

Stephane Rossignol, Olivier Pietquin, and Michel Ianotto

IMS Research Group, Supelec – Metz Campus, [email protected]

Abstract. User simulation has become an important trend of research in the fieldof spoken dialog systems because collecting and annotating real man-machine in-teractions with users is often expensive and time consuming. Yet, such data aregenerally required for designing and assessing efficient dialog systems. The gen-eral problem of user simulation is thus to produce as many as necessary natural,various and consistent interactions from as few data as possible. In this paper, isproposed a user simulation method based on Bayesian Networks (BN) that is ableto produce consistent interactions in terms of user goal and dialog history butalso to simulate the grounding process that often appears in human-human inter-actions. The BN is trained on a database of 1234 human-machine dialogs in theTownInfo domain (a tourist information application). Experiments with a state-of-the-art dialog system (REALL-DUDE/DIPPER/OAA) have been realized andpromising results are presented.

1 Introduction

Spoken dialog systems are now widespread and are in use in many domains (from flightbooking to troubleshooting services). Designing such a speech-based interface is usu-ally an iterative process involving several cycles of prototyping, testing and validation.Tests and validation require interactions between the released system and real humanusers which are very expensive and time consuming. For this reason, user simulationhas become an important trend of research during the last decade. User simulation canbe used for performance assessment [4,9] or for optimisation purposes [8,13,19], likewhen optimizing dialog strategies with reinforcement learning. User simulation shouldnot be confused with user modelling. User modelling is generally used by a dialogsystem for internal purposes such as knowledge representation and user goal infering[5,10] or natural language understanding simulation [14]. The role of user simulationis to generate a large amount of simulated interactions with a dialog system and thesimulated user is therefore external to the system.

Dialog simulation can occur at several level of description. In this work, simulatedinteractions take place at the intention level (such as in [4,8,13,19]) and not at the signallevel as proposed in [9]. An intention is here defined as the minimal unit of informa-tion that a dialog participant can express independently. It will be modeled as dialogacts. Indeed, intention-based communication allows error modelling of every part ofthe system, including speech recognition and understanding [15,14,18]. Pragmatically,it is easier to automatically generate intentions or dialog acts compared with speechsignals, as a large number of utterances can express the same intention.


Simulation of the Grounding Process in Spoken Dialog Systems with BN 111

The Simulated User (SU) presented in this paper is based on Bayesian Networks(BN). This model has been chosen for several reasons. First, BN are generative modelsand can therefore be used for inference as well as for data generation, which is of courserequired for simulation. Second, it is a statistical framework and it can thus generate awide range of different dialogs that are statistically consistent with each other. Third,BN parameters can either be set by experts or trained on data. Given that data is oftendifficult to collect, the introduction of expert knowledge can be very helpful. Finally, alot of efficient tools for inference and training for BN are freely available.

This paper relies on previous work [11,13,16] where BN are used for simulation pur-pose but emphasizes on two novel contributions. First, the model has been modified togenerate grounding behaviours [3]. The grounding process will be considered here asthe process used by dialog participants to ensure that they share the background knowl-edge necessary for the understanding of what will be said later in the dialog. In practice,that means that the simulated user will react automatically by providing again correct in-formation to the dialog system if a problem in the information transmission is detected.The ultimate goal of this work being to train dialog policies that handle such ground-ing behaviours [12]. Second, the model is trained on actual human-machine dialogsdata and tested on a state-of-the-art dialog system (the REALL-DUDE/DIPPER/OAAenvironment [7,1,2]).

The considered domain is the TownInfo domain, a tourist information task. The taskconsists in retrieving information about restaurants in a given city. This can be consid-ered as a slot filling task where three different slots (“food”, “price range” and “area”)are considered. These slots can take respectively 3, 3 and 5 values (see table 1).

Table 1. Slots in the task, and corresponding possible values

Food “italian” “indian” “chinese”Range price “moderate” “expensive” “cheap”

Area “central” “north” “south” “west” “east”

The system acts in use are: <hello>, <request>, <confirm>, <implicitconfirm-request>, <closingDialogue>. The user acts in use are: <inform>, <confirm>,<bye>, (<null>).

The rest of this paper is organized as follows. In Section 2, the proposed model isdescribed in details. In Section 3, experiments are presented. Finally, a conclusion andfuture works are provided in Section 4.

2 Description of the Model

2.1 Bayesian Network Model

The user simulation model described in this contribution is based on the probabilisticmodel of a man-machine dialogue proposed in [11,13]. The interaction between the userand the dialogue manager is considered as a sequential transfer of intentions organizedin turns noted t thanks to dialog acts. At each turn t the dialogue manager selects a

112 S. Rossignol, O. Pietquin, and M. Ianotto

system dialog act at conditionally to its internal state st. The user answers by a user actut which is conditioned by the goal gt s/he is pursuing and the knowledge kt s/he hasabout the dialogue (what has been exchanged before reaching turn t). So, at a given turn,the information exchange can be modeled thanks to the joint probability p(a, s, u, g, k)of all these variables. This joint probability can be factored as:

p(a, s, u, g, k) = p(u|g, k, a, s)p(g|k, a, s)p(k|s, a)p(a|s)p(s)

Given that:

– since the user doesn’t have access to the SDS state, u, g and k cannot depend on s,– the user’s goal can only be modified according to his/her knowledge of the dialogue,

this expression can be simplified:

p(a, s, u, g, k) = p(u|g, k, a)︸︷︷︸

User act

Goal Modif.︷︸︸︷

p(g|k) p(k|a)︸︷︷︸

Know. update

DM Policy︷︸︸︷

p(a|s) p(s)

This can be expressed by the Bayesian network depicted on figure 1-a).

Fig. 1. Bayesian Network-based Simulated User

As explained in [13,12], the practical use of this kind of bayesian network requiresa tractable representation of the stochastic variables {a, s, u, g, k}. Therefore, the vari-able a is split into two subvariables {AS , Sys}: the dialoge act type AS (<hello>,<request>, <confirm>, <implicitconfirm request>, <closingDialogue>) and theslots Sys on which it is applied (3 in this case, note Sys-SLOT(i)). The user act u isalso divided into two subsets: the dialog act type u ( <inform>, <confirm>, <bye>,(<null>)) and the values v of slots on witch they apply (e.g.. u = <inform>, v =“food type = Italian”). A special variable C is added to simulate the closing of the dia-logue by the user. The knowledge k and the goal g are represented as sets of attribute-value pairs. There is a knowledge value for each of the slots in the task (3 in this case,


note k-SLOT(i)). Three levels of knowledge (thus 3 possible values) are considered:low, medium and high. These values represent the knowledge the user has about the factthat s/he previously provided the information about the corresponding slot to the system.For example, if the user once provided information about the type of food s/he wants,the knowledge about this slot goes from low to medium. If this information has beenprovided several times, the knowledge becomes high and it is more likely that the userwill close the dialogue if this slot is asked for one more time. The knowledge somewhatcorresponds to a SU estimate of what is the dialog state. The user goal contains onevalue for each slot in the task. This ensures that the SU behaviour will be consistentaccording to a given goal. One extra possible value “don’t care” is added to indicatethat the user may not be interested in the value of one specific slot.

2.2 Grounding

Notice that, if the Dialog Manager (DM) asks confirmation for a slot for which the SUhas a low knowledge value (i.e. a slot for which the SU has never provided informationyet), it is likely that a grounding problem occurred. In other words, the DM and the userdon’t agree on the exchanged slots before reaching turn t. The SU described so far isdesigned to confirm (or not) the DM information it receives for this slot, with a certainprobability, which can be low (or possibly to close the dialog). Yet, the knowledgevalue could be used to infer the occurrence of a grounding problem. More generally, theBayesian network of Figure 1-a) shows that one could infer the most probable dialoguestate st at turn t given the observed DM act at and the user’s knowledge kt: st =argmaxsp(s|at, kt) (diagnostic inference). The user’s knowledge is required since itkeep traces of the dialogue history from the user’s point of view. If this dialogue stateestimate st is very different from the user knowledge kt, a grounding problem is likelyto occur.

Instead of computing this state estimate and comparing it to the user’s knowledge,we preferred to directly add a decision node in the network as shown on Figure 1-b). This node can only take boolean values, the true value meaning that a groundingproblem occurred. In practice a grounding value can be obtained for each slot. In thisimplementation, if a grounding problem is detected for the slot i, the SU is forced toprovide information concerning this slot. More sophisticated grounding strategies couldbe investigated but we restrict the study to this simple one in this paper.

3 Experiments

3.1 Learning BN Parameters

The whole set of probabilities in the conditional probability tables (CPTs) can not belearned, as this concerns thousands of them (2151, more precisely, in this paper). Formost of the probabilities, it would not be possible to get a good estimate of their valuesanyway since only a small amount of similar situations can be found in the database.In Tables 2, 3, 4, 5 and 6, are summarized the probabilities that we suppose to wellgeneralize the whole set of probabilities. For instance, the slots are considered as equiv-alent, and it is considered that their actual values do not influence the user’s answer to


confirmations. Furthermore, when considering the probability for the user to close thedialogue (C variable mentioned before) we consider that the four system acts “hello”,“request”, “confirm”, and “implicitconfirm request” are equivalent. Only the system act“closingDialog” has been treated differently. Finally, the variables coming from the AS

node and from the k nodes are supposed independent. Therefore, for the probability ofclosing the dialogue, it is assumed than providing 2 + 3 = 5 probabilities is enoughto well generalize the parameters for the corresponding variable C. The same kind ofreasoning has been performed for each node, leading us to a set of only 25 probabilitiesthat need to be learned or heuristically fixed. Notice that the aim of the expert was toobtain as short dialogs as possible: the expert probabilities were thus fixed accordingly.

Table 2. Expert versus trained probabilities – closing probability variable C

expert learned

from AS : the system act is not “closingDialog” 0.90 0.997from AS : the system act is “closingDialog” 0.99 0.944from k-SLOT(i); the knowledge is low, thus the SU rather does not close (lowvalue) the dialog

0.01 0.016

from k-SLOT(i); the knowledge is medium, thus the SU rather closes the dialog 0.90 0.104from k-SLOT(i); the knowledge is high, thus the SU rather closes the dialog(probability higher than above)

1.00 0.773

The heuristically determined values for these probabilities, and their correspondinglearned values are provided in the next five tables (Tables 2, 3, 4, 5 and 6). For instance,the first probability in Table 2 corresponds to the probability that the Simulated Userdecides to close the dialog if the system act coming from the DM is something elsethan a “closingDialog” act. Furthermore, in Table 2, the fact that the trained value forthe probability of closing the dialog when the knowledge has a high value is quite low(0.773) compared to the corresponding expert probability value (1.00), indicates thathuman users tend to avoid to close the dialog even when they should think that theyalready provided the whole requested pieces of information. Therefore, dialogs gener-ated with the heuristically parameterized BN-based simulator would probably tend to beshorter than those obtained with human users or with the trained BN-based SimulatedUser. The values in the seventh row and in the last row of Table 3 are going in the samedirection. This indicates as well that the heuristic BN has been especially designed toallow the full system to reach as fast as possible the end of the dialog, that is to say inas few turns as possible.

3.2 Interaction with REALL-DUDE/DIPPER/OAA

The Simulated User has been interfaced with the spoken dialog system provided in theREALL-DUDE/DIPPER/OAA environment (see [7], [1] and [2]). This environmentoriginally aims at training policies by reinforcement learning. Yet, the dialog policyused for our experiments has been trained independently of the SU presented here and isused for testing purpose only. Dialogs similar to the examples provided in Tables 7 and


Table 3. Expert versus trained probabilities – {u, v}=INFORM SLOT(i)

expert learned

from AS : the system greets, or asks for a slot, thus the SU decides to enter theslot(i)

1.00 0.32

from AS : the system asks for a confirmation, thus the SU decides rather not toenter the slot(i) (low value)

0.01 0.0314

from AS : the system closes the dialog, thus the SU decides rather to enter theslot(i) (low value)

0.35 9.6e−5

from Sys-SLOT(i): the slot(i) is concerned by the system act, thus the SU de-cides to enter the slot(i)

1.00 0.848

from Sys-SLOT(i): the slot(i) is not concerned by the system act, but the SUdecides to enter the slot(i) anyway

0.95 0.727

from Sys-SLOT(i): the slot(i) is concerned by the system act, and the SU de-cides to enter the slot(j)

0.95 0.47

from k-SLOT(i): the knowledge for the slot(i) is low, thus the SU decides toenter the slot(i)

1.00 0.968

from k-SLOT(i): the knowledge for the slot(i) is medium, thus the SU hesitates(probability smaller than above) to enter the slot(i)

0.8 0.768

from k-SLOT(i): the knowledge for the slot(i) is high, thus the SU decides rathernot to enter the slot(i) (low value)

0.001 0.361

Table 4. Expert versus trained probabilities – {u, v} = CONFIRM SLOT(i)

expert learned

from AS : the system greets or asks for a slot, but the SU decides to confirm theslot(i) (probability low)

0.01 0.117

from AS : the system asks for a confirmation for the slot(i), thus the SU decidesto confirm the slot(i)

1.00 0.709

from Sys-SLOT(i): the slot(i) is concerned by the system act, but the SU decidesto confirm the slot(i)

0.99 0.873

from Sys-SLOT(i): the slot(i) is not concerned by the system act, thus the SUdecides to confirm the slot(i) (low value)

0.001 0.0832

from k-SLOT(i): the slot(i) is concerned by the system act and the knowledgefor the slot(i) is low, and the SU decides to confirm the slot(i) (low value?)

0.01 0.906

from k-SLOT(i): the slot(i) is concerned by the system act and the knowledgefor the slot(i) is medium, thus the SU rather decides to confirm the slot(i)

0.98 0.915

from k-SLOT(i): the slot(i) is concerned by the system act and the knowledgefor the slot(i) is high, thus the SU rather decides to confirm the slot(i)

0.99 0.833

Table 5. Expert versus trained probabilities – {u, v} = INFORM VALUE SLOT(i)

expert learned

from INFORM SLOT(I): INFORM SLOT(I) not selected (the system decided notto provide information about the slot(i)), thus the SU decides to enter the valuefor the slot(i) (low value)

0.02 0.0213

from VALUE GOAL SLOT(I): the SU decides to enter the value for the slot(i) ithas in its goal

0.99 0.979


Table 6. Expert versus trained probabilities – {u, v} = CONFIRM VALUE SLOT(i)

expert learned

from u = confirm, v = SLOT(i): CONFIRM SLOT(I) not selected (the systemdecided not to confirm the slot(i)), but the SU decides to confirm the value forthe slot(i) (low value)

0.05 0.00443

from VALUE GOAL SLOT(I): the SU decides to confirm the value for the slot(i)it has in its goal

0.99 0.980

8 are obtained. The SU accompanies each hypothesis it sends to the DM with a proba-bility simulating the confidence score which would be provided among other things bythe Automatic Speech Recogniser (ASR) system. The DM can decide to ask for a con-firmation if this probability is too low. This explains the “<syst act> confirm(Food)”and “<user act> confirm(food=yes)” turns in the second dialog example (in Table 8).It can be noticed that the SU correctly answered to this system act.

Table 7. First dialog example, obtained using the Simulated User integrated within the REALL-DUDE/DIPPER/OAA spoken dialog system environment

User goal:Food: indian rPrice: cheap Area: west

<syst act> hello(Food) Hello. What kind of food would you like?<user act> inform(slot_1=’indian’) I’m looking for an Indian restaurant.<syst act> request(Area) What part of town do you have in mind?<user act> inform(slot_3=’west’) I want a place in the west part of town.<syst act> request(rPrice) What kind of price would you like?<user act> inform(slot_2=’cheap’) I’d prefer something cheap please.<syst act> close The ... is a nice place. Goodbye.

3.3 Heuristic versus Trained BN-Based SU – Statistics

In this Section, are presented statistics computed on dialogs obtained using two versionsof the SU presented in this paper and the REALL-DUDE/DIPPER/OAA environment.The first version of the SU is obtained considering heuristically determined values forthe BN parameters. The second version is obtained considering trained parameters. Thedatabase used for training contains 1234 dialogs. It provides much more complex di-alogs than the proposed task. For instance, twelve slots in total are considered. Thevalues for some of these slots are requested by the DM, such as “type”, “food”, etc.,appealing thus a “request” system act and an “inform” user act; others are requested bythe User, such as “address”, “phone”, etc.; the “price range” can be requested by bothsides. Furthermore, more than ten different system and user acts are used (see [23] foran exhaustive list). In the current task, less acts are considered, as noticed in the intro-duction. Finally, more than one slot and more than one act can be presented during eachturn, both considering the DM turns and the SU turns. The database has been describedin more details in [22] and [20], where it has been used for training dialog managementstrategies.

In Table 9, are presented the results obtained using the heuristic BN (h-BN) and thetrained BN (t-BN). A thousand of dialogs have been simulated for each of them. On


Figure 2, is presented the histogram of the number of turns required to reach the endof the dialogs. Considering the h-BN, it can be seen that most of the dialogs indeed arecarried out into four turns, as expected (93.4 %). This is due to the fact that it has beendesigned to obtain as short dialogs as possible.

Table 8. Second dialog example, obtained using the Simulated User integrated within theREALL-DUDE/DIPPER/OAA spoken dialog system environment

User goal:Food: italian rPrice: expensive Area: central

<syst act> hello(Food) Hello. What kind of food would you like?<user act> inform(slot_1=’italian’) I’m looking for an Italian restaurant.<syst act> confirm(Food) You are looking for an Italian

restaurant, right?<user act> confirm(food=yes) Yes, that’s right.<syst act> request(rPrice) What kind of price would you like?<user act> inform(slot_2=’expensive’) Err, expensive please.<syst act> request(Area) What part of town do you have in mind?<user act> inform(slot_3=’central’) I want a place in the central part of

town.<syst act> close I can recommend the .... Goodbye.

Table 9. Mean number of turns requested to reach the end of the dialogs; max number of turns;min number of turns; percentage of dialogs for which the end is reached in 4 turns; percentage ofdialogs for which the end is reached in less than 9 turns

mean max min 4 turns < 9 turns

h-BN 4.969 21 4 93.20 % 93.40 %t-BN 4.577 9 4 58.00 % 99.90 %

Using the t-BN, the dialogs are quite longer than when using the h-BN. This can-not be seen considering the average number of turns, but considering the percentage ofdialogs which needed exactly four turns to reach their end: this percentage drops from93.2 % to 58.0 %. However, very promisingly, the mean number of turns and the per-centage of dialogs which needed less than nine turns to reach their end are quite betterusing the t-BN. Furthermore it can be noticed that the very long dialogs (more than nineturns requested), indicating some deep misunderstanding between the DM and the SU,have completely disappeared. This indicates that the dialogs obtained with the t-BN aremuch more natural, at least from a DM point of view, than the dialogs obtained with theh-BN. The distribution is actually more in agreement with the data which shows againthe naturalness of the simulated dialogs.

Table 10 presents the number of turns needed per slot, respectively when the h-BNis used the longest dialogs being kept, when the h-BN is used the longest dialogs notbeing kept (they only reflect the DM stopping condition), when the t-BN is used, andconsidering the database. Clearly, the t-BN provides more realistic dialogs, in terms ofnumber of turns needed for each slot.


0100200300400500600700800900

4 6 8 10 12 14 16 18 20 22

num

ber

of d

ialo

gues

heuristic BN

0

100

200

300

400

500

4 6 8 10 12 14 16 18 20 22

num

ber

of d

ialo

gues

number of turns needed to reach the end of the dialogue

trained BN

Fig. 2. Histogram of the number of turns requested to reach the end of the dialog – top: theheuristic BN is used – bottom: the trained BN is used

Table 10. Number of turns needed per slot; h-BN and longest dialogs being kept, h-BN used andlongest dialogs not being kept, t-BN used, and considering the database

h-BN h-BN without long dialogs t-BN database

1.6563 1.3986 1.5257 1.5661

3.4 Grounding Process – Dialog Examples and Statistics

On 1000 simulated dialogs, 496 grounding problems have been detected. In Tables 11and 12, dialog examples with a grounding problem are shown. In the first case, the SUdetected the grounding problem, and solved it; in the second case, a grounding problemoccurred, but the SU did not act consequently and the dialog went worst afterwards.When the DM asks confirmation for the third slot, the SU answered to it even if it hadnever sent information about this slot before. The SU is confused, and the DM is likelyto be confused afterwards as well.

Statistics computed on dialogs obtained using the grounding-enabled SU are pre-sented as well, the grounding problem solver being used or not. As the task featuresthree slots and as the SU is configured to send information about no more than a singleslot per turn, the minimum number of turns necessary to reach the end of the dialog isfour: one per slot, plus the “closingDialog” turn. Notice that a turn is defined here asa couple <syst act>/<user act> (except for the “closingDialog” act, as it is the DMwhich is actually able to stop a dialog).

Considering only the dialogs with a grounding problem, and not the whole set ofdialogs, results in Table 13 are obtained. They are quite promising. The amount ofdialogs longer than 5 turns drops of 30.6 %.


Table 11. Dialog example, obtained using the Simulated User integrated within the REALL-DUDE/DIPPER/OAA spoken dialog system environment; a grounding problem is detected andsolved (the grounding problem solver is used)

User goal:Food: italian rPrice: cheap Area: central

<syst act> hello(Food) Hello. What kind of food would you like?<user act> inform(slot_1=’italian’) I’m looking for an Italian restaurant.

<syst act> confirm(Area) You are looking for a restaurant in thecentral part of town, right?

=> grounding error detected for the slot_1 (the DM understoodsomething wrong); problem solved by the SU

<user act> inform(slot_1=’italian’) Erm, an Italian restaurant please.

<syst act> request(rPrice) What kind of price would you like?<user act> inform(slot_2=’cheap’) Could I have a cheap restaurant?...

Table 12. Dialog example, obtained using the Simulated User integrated within the REALL-DUDE/DIPPER/OAA spoken dialog system environment; a grounding problem occurs and it isnot solved by the SU (the grounding problem solver is not used)

User goal:Food: italian rPrice: cheap Area: east

<syst act> hello(Food) Hello. What kind of food would you like?<user act> inform(slot_1=’italian’) I’m looking for an Italian restaurant.

<syst act> request(rPrice) What kind of price would you like?<user act> inform(slot_2=’cheap’) I’d prefer something cheap please.

<syst act> confirm(Area) Did you say you are looking for arestaurant in the central part of town?

=> grounding error detected for the slot_2 (the DM understoodsomething wrong); the SU does not respond correctly tothis problem

<user act> confirm(area=no) No, sorry.

...

Table 13. Mean number of turns requested to reach the end of the dialogs; percentage of dialogsfor which the end is reached in more than 5 turns – the BN without grounding problem solver orthe BN with a grounding problem solver is used

mean > 5 turnsno grounding problem solver 5.254 29.22 %grounding problem solver 5.069 20.29 %

4 Conclusion and Future Works

In this paper, a user simulation model based on Bayesian Networks is proposed tosimulate realistic human-machine dialogs at the intention level, including grounding


behaviours. Our goal was first to show the interest of training the parameters of theBayesian Networks using a database of actual human-machine dialogs, and second toshow the interest of simulating the grounding process often occuring in human-humandialogs. This is done by comparing the number of turns required to reach the end ofa dialog using different configuration of the proposed simulation framework. Severaldirections for future works are furthermore envisioned.

First, one of the goal of developing this Simulated User is the training of the dialogmanagement policies within a reinforcement learning paradigm. The developed Simu-lated User will allow training the policies in use in the POMDP engines implementedin the REALL-DUDE/DIPPER/OAA environment. Second, some preliminary resultsconcerning the interaction of the presented Simulated User with an independently de-veloped Dialog Manager are shown. It is planned in the very next future to interact withthe spoken dialog system provided in the REALL-DUDE/DIPPER/OAA environmentin a much more extensive and systematic way. This will allow comparing the number ofturns obtained with the BN-based Simulated Users to the number of turns obtained withhuman users and analysing the corresponding task completion scores etc. Furthermore,as definitely considering only the dialog length is not enough for the evaluation of UserSimulations, refined metrics ([21],[17],[6]) are under study. Third we would like to usethe ability of Bayesian Networks to learn online their parameters so as to improve thesimulated dialogs naturalness as users are really interacting with the system and to beable to retrain policies between real interactions.

Acknowledgements

This work is supported by the CLASSIC (Computational Learning in Adaptive Sys-tems for Spoken Conversation) european project, Project Number 216594 funded un-der EC FP7, Call 1 (http://www.classic-project.org/). The authors also want to thankOliver Lemon and Paul Crook for their help in using the REALL-DUDE/DIPPER/OAAframework.

References

1. Bos, J., Klein, E., Lemon, O., Oka, T.: DIPPER: Description and Formalisation of anInformation-State Update Dialogue System Architecture. In: Proceedings of the 4th SIG-DIAL Workshop on Discourse and Dialogue, pp. 115–124 (2003)

2. Cheyer, A., Martin, D.: The Open Agent Architecture. Journal of Autonomous Agents andMulti-Agent Systems (4), 143–148 (2001)

3. Clark, H., Schaefer, E.: Contributing to discourse. Cognitive Science 13, 259–294 (1989)4. Eckert, W., Levin, E., Pieraccini, R.: User Modeling for Spoken Dialogue System Evaluation.

In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding(ASRU), pp. 80–87 (1997)

5. Horvitz, E., Breese, J., Heckerman, D., Hovel, D., Rommelse, K.: The Lumiere Project:Bayesian User Modeling for Inferring the Goals and Needs of Software Users. In: Proc. ofthe 14th Conference on Uncertainty in Artifical Intelligence (July 1998)

6. Janarthanam, S., Lemon, O.: A Two-Tier Simulation Model for Reinforcement Learning ofAdaptative Referring Expression Generation Policies. In: Proceedings of 10th SIGDIAL, pp.120–123 (2009)


7. Lemon, O., Liu, X., Shapiro, D., Tollander, C.: Hierarchical Reinforcement Learning of Di-alogue Policies in a Development Environment for Dialogue Systems: REALL-DUDE. In:10th SemDial Workshop on the Semantics and Pragmatics of Dialogue; BRANDIAL 2006(2006)

8. Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-Machine Interaction forLearning Dialog Strategies. IEEE Transactions on Speech and Audio Processing 8, 11–23(2000)

9. Lopez-Cozar, R., Callejas, Z., McTear, M.F.: Testing the performance of spoken dialoguesystems by means of an artificially simulated user. Artificial Intelligence Review 26(4), 291–323 (2006)

10. Meng, H., Wai, C., Pieracinni, R.: The Use of Belief Networks for Mixed-Initiative DialogModeling. In: Proceedings of the 8th International Conference on Spoken Language Process-ing (ICSLP) (October 2000)

11. Pietquin, O.: A Probabilistic Description of Man-Machine Spoken Communication. In: Pro-ceedings of the IEEE International Conference on Multimedia and Expo (ICME) (July 2005)

12. Pietquin, O.: Learning to Ground in Spoken Dialogue Systems. In: IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 165–168 (2007)

13. Pietquin, O., Dutoit, T.: A Probabilistic Framework for Dialog Simulation and Optimal Strat-egy Learning. IEEE Transactions on Audio, Speech, and Language Processing 14, 589–599(2006)

14. Pietquin, O., Dutoit, T.: Dynamic Bayesian Networks for NLU Simulation with Applicationsto Dialog Optimal Strategy Learning. In: Proceedings of the International Conference onAcoustics, Speech and Signal Processing (ICASSP) (May 2006)

15. Pietquin, O., Renals, S.: ASR System Modeling for Automatic Evaluation and Optimizationof Dialogue Systems. In: Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), Orlando, USA, FL (May 2002)

16. Pietquin, O., Rossignol, S., Ianotto, M.: Training Bayesian Networks for Realistic Man-Machine Spoken Dialogue Simulation. In: Proceedings of the 1rst International Workshopon Spoken Dialogue Systems Technology (December 2009)

17. Schatzmann, J., Georgila, K., Young, S.: Quantitative Evaluation of User Simulation Tech-niques for Spoken Dialogue Systems. In: Proceedings of the 6th SIGDIAL, pp. 45–54 (2005)

18. Schatzmann, J., Thomson, B., Young, S.: Error Simulation for Training Statistical DialogueSystems. In: Proceedings of the International Workshop on Automatic Speech Recognitionand Understanding (ASRU), Kyoto, Japan (2007)

19. Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simu-lation techniques for reinforcement-learning of dialogue management strategies. KnowledgeEngineering Review 21(2), 97–126 (2007)

20. Thomson, B., Gasic, M., Keizer, S., Mairesse, F., Schatzmann, J., Yu, K., Young, S.: UserStudy of the Bayesian Update of Dialogue State Approach to Dialogue Management. In:Proceedings of Interspeech (2008)

21. Vuurpijl, L., ten Bosch, L., Rossignol, S., Neumann, A., Pfleger, N., Engel, R.: Evaluation ofmultimodal dialog systems. In: Proceedings of the LREC Workshop on Multimodal Corpora(2004)

22. Williams, J.D., Young, S.: Partially Observable Markov Decision Processes for Spoken Dia-log Systems. Computer Speech and Language 21, 231–422 (2007)

23. Young, S.: CUED Standard Dialogue Acts. Technical report, Cambridge University Engi-neering Dept (October 2007)

Facing Reality: Simulating Deployment of Anger

Recognition in IVR Systems

Alexander Schmitt1, Tim Polzehl2, and Wolfgang Minker1

1 Dialogue Systems Group/Institute of Information Technology, University of UlmAlbert-Einstein-Allee 43, D-89081 Ulm

[email protected] Quality and Usability Lab der Technischen Universitat Berlin/Deutsche Telekom

Laboratories Ernst-Reuter-Platz 7, D-10587 Berlin, [email protected]

Abstract. With the availability of real-life corpora studies dealing withspeech-based emotion recognition have turned towards recognition ofangry users on turn level. Based on acoustic, linguistic and sometimescontextual features classifiers yield performance values of 0.7-0.8 f-scorewhen classifying angry vs. non-angry user turns. The effect of deploy-ing anger classifiers in real systems still remains an open point and hasnot been examined so far. Is the current performance of anger detectionalready adequate enough for a change in dialogue strategy or even an es-calation to an operator? In this study we explore the impact of an angerclassifier that has been published in a previous study on specific dia-logues. We introduce a cost-sensitive classifier that reduces the numberof misclassified non-angry user turns significantly.

1 Introduction

An increasing number of studies deal with the detection of emotions in speechcorpora stemming from telephone applications. Especially speech-based angerdetection in such Interactive Voice Response (IVR) Dialogue systems can beused offline to monitor quality of service [15,14]. It can indicate potentially prob-lematic turns or grammar slots to the carrier so he can monitor and refine thesystem. Furthermore it could serve as trigger to switch to dialogue strategies thattake into account the user’s emotional state [7,1]. The classification is gaining aperformance level that makes it attractive not only for research. Anger detectionalso opens up opportunities in deployed systems. With reliable classifiers it willbe possible to re-route customers to the assistance of a human operator whenthey get angry. However, this would require a solid performance of the classifierin spotting angry users.

In this study we employ a state-of-the art anger classifier on specific dia-logues. We utilize the new Witchcraft Workbench [10] that has been developedfor simulation of prediction and classification models in order to analyze theimpact of the classifier on dialogue and to evaluate its performance in a real-lifeapplication.


Facing Reality: Simulating Deployment of Anger Recognition in IVR 123

The remainder of this paper is organized as follows: In Section 2 related work onclassifying anger in IVR systems is presented. For our study we employed a largeIVR corpus that is presented in Section 3. Section 4 presents an anger classifierapplied on 1.702 calls from our dialogue corpus. The weaknesses of this approachlead us to Section 5 where we analyze the effect of employing a cost-sensitive Metaclassifier that reduces the number of false predictions. Section 6 introduces theWitchcraft Workbench as evaluation framework for anger models along with SDSdialogues. We finally conclude and discuss our findings in Section 7.

2 Related Studies for Detecting User Anger

One of the first studies addressing emotion recognition with special applicationto call centers was presented by [8] in 1999. Due to the lack of “real” data, thecorpus was collected with 18 non-professional actors that were asked to recordvoice messages of 15-90 seconds in 22kHz which were split afterwards into chunksof 1-3 seconds. Raters were asked to label the chunks with “agitation” (includinganger) and “calm”. Recognition is performed on acoustic cues. Studies on real-life corpora came up in the early 2000s. [4] employed data from a deployed IVRsystem, but complains about data sparsity. The dataset comprises a balanced setof only 142 short utterances with “negative” and “non-negative” emotions. Inlater studies [5,3] the employed set contained 1.197 utterances, however, stronglybalanced towards non-angry emotions. Additionally to acoustic features, Lee etal. make use of linguistic cues that are extracted from ASR transcripts andpropose the concept of Emotional Salience as additional feature.

In 2003 [15] stated that real-life corpora are still hard to obtain and employedan acted corpus containing 8 speakers at 22 kHz quality. The utterance lengthwas artificially shortened to simulate conditions typical for Interactive VoiceResponse Systems.

Recent studies employ real-life data and some include additional knowledgesources that go beyond acoustic and linguistic information of the current turn.[6] and [11] used context features such as the performance of the ASR and thebarge-in behavior of the user to support anger detection on turn level. In [13]the knowledge about the emotional state of the user in previous turns has beenused to improve anger detection on turn level.

Current approaches yield average scores between 60-80% accuracy when clas-sifying balanced sets of angry and non-angry user turns. At the same time thedetection of non-angry user turns performs better than the detection of angryuser turns. In earlier work we compared a German and an English IVR corpusand found the classifier’s precision for “non-anger” at 84.9% and 86.0%, whilethe classifier’s precision for “anger” was as low as 72.0% and 67.0% [9].

3 Corpus Description

The dialogue corpus in our studies originates from a US-American IVR portalcapable of fixing Internet-related problems jointly with the caller. In previous

124 A. Schmitt, T. Polzehl, and W. Minker

work three labelers labeled the single user turns in the corpus with the labelsangry, annoyed and non-angry. The final label was defined based on majorityvoting resulting in 90.2% neutral, 5.1% garbage, 3.4% annoyed and 0.7% angryutterances. 0.6% of the samples in the corpus were sorted out since all threeraters had different opinions. While the number of angry and annoyed utterancesseems very low, 429 calls (i.e. 22.4% of all dialogues) contained annoyed or angryutterances.

For recent studies we collapsed annoyed and angry to angry to be comparablewith other studies in this field. Table 1 depicts the details of the corpus.

Table 1. Details of the employed dialog corpus

English

Domain Internet SupportNumber of Dialogs in Total 1911Audio Duration in Total 10hAverage Number of Turns per Dialog 11.88Number of Raters 3Speech Quality Narrow-bandAverage Duration Anger in Seconds 1.87 ±0.61Average Duration Non-Anger in Seconds 1.57 ±0.66Cohen’s Extended Kappa 0.63

4 Cost Insensitive Classification

A discriminative binary classification task such as anger recognition aims todistinguish between two distinct classes. By definition, one class is depicted as”positive” class, the other as ”negative” class. When evaluating the performanceof a binary classifier with a test set of n samples, four numbers play an importantrole: the number of correctly classified samples from the positive class (true pos-itives; TP); the number of mistakenly as belonging to the positive class classifiedsamples (false positive; FP); the number of correctly classified samples from thenegative class (true negatives; TN) and the number of incorrectly as belongingto the negative class classified samples (false negative; FN). Consequently, thesum TP + FP + TN + FN represents the number of samples in the test set n.

Porting these numbers to the anger detection domain, we can state that TP,TN, FP and FN signify how many utterances that

– have been angry have really been spotted by the classifier? (TP)– have been angry have been mistakenly been classified as non-angry? (FN)– have not been angry have been correctly classified as non-angry? (TN)– have not been angry have mistakenly been classified as angry? (FP)

In this task it is vital to yield a low FN rate. For reasonable classification it isimportant that each class is represented equally in the training set. Otherwisemost classifiers would tend to always predict the predominant class if a sufficient


success rate is guaranteed this way, which is in this task “non-angry”. The testis typically carried out with a test set where classes are also equally spreadto see whether the learned model separates the classes appropriately. However,when considering the later application of a trained model on realistic test data,the distribution of classes is not at all balanced and is frequently very skewtowards “non-angry”. Furthermore, the classifier has to deal with noisy datalike coughing, laughing or off-talk.

We evaluate the anger classifier that is designed as described in [12]. In orderto stay speaker-independent the training set for training the classifier is definedas follows: we randomly select 50% of the calls that contain anger and use alltheir angry user turns for training. These 459 angry user turns stem from 209calls and are replenished with 459 non-angry turns from the same callers.

To assess how a classifier would perform under real-life conditions we use theremaining 1.702 calls from the corpus as test set. It contains the full bandwidthof speech samples that typically occur in IVR systems: 472 angry, 17.269 non-angry and 935 garbage utterances. For training and testing, turns labeled as“garbage” have been collapsed with the class “non-angry”.

The corpus-wide results are depicted in Table 2.

Table 2. Speaker-independent classification result of real-life data when evaluating thepresented classifier

true non angry true angry class precisionpred. non angry 13.040 (TN) 126 (FN) 99.0%

pred. angry 5.124 (FP) 386 (TP) 7.0%class recall 71.8% 75.3%

At first sight, the performance appears rather satisfying. 75.3% of all “angry”utterances in the corpus could be identified as angry which can be seen from therecall value of the “angry” class. On the other hand, the precision of the angerclass is very low with 7%: 5.124 non-angry turns have been falsely identified asangry. Imagine a scenario where the classifier’s prediction is employed to triggerescalation to an operator. The result shows 13 times more “false alarms” than“true alarms”. The classifier in its current state would lead to a large number offalse escalations.

5 Cost Sensitive Classification

In most work it is assumed that both classes are equally important to detectand the costs of misclassification are equal for both cases. Provided that angerdetection is deployed in a real telephone system where the recognition of theclassifier has an impact on the further dialogue flow, it is certainly easy tounderstand that a high number of FN is not as severe as a high number ofFP: if a user turn is classified as non-angry although the user is angry (FN), the


dialogue system would not behave differently than when no anger detection isdeployed. A completely different scenario would occur when a non-angry user ismistakenly classified as angry (FP): it would transfer the caller to an operatoralthough there is no need in doing so. Thus we can state that not all numbersare of equal importance to the task.

To make the classifier cost sensitive, we apply the MetaCost algorithm [2] toour classifier. MetaCost uses the confidence scores of the underlying classifier inorder to choose the class with the minimal risk. In this context the conditionalrisk R(i|x) is defined as

R(i|x) =∑

j

P (j|x)C(i, j)

The risk is the expected cost of classifying sample x as belonging to class i.The probability P is the confidence of the underlying classifier that sample xbelongs to class j. C(i, j) is the cost matrix that defines the costs for correct andwrong predictions. A cost matrix contains the penalty values for each decisiona classifier can make when predicting the class of an utterance. Normally, eachcorrect decision is weighted with 0 (no penalty, since it was a correct decision)and each misclassification is weighted as 1. Altering these weights causes theclassifier to favor one class at the charge of the other class.

The true costs of misclassifying non-angry callers as angry ones can hardly bequantified or depends on a variety of factors such as costs for operators, availabil-ity of operators, per-minute-costs for the IVR system, etc. We thus empiricallydetermine a sensitive value for the cost matrix that yields reasonable results.We increase the costs for misclassifying non-angry callers to the factors 1,2,..15.The results are depicted in Figure 1.

With increasing costs for misclassifying non-angry turns, the classifier favorsfor the class non-angry. The precision for angry is increasing, on the other handthe recall for angry decreases. Looking at Figure 1 we can see that the optimalcost for misclassifying non-angry turns is a cost value of 7. It reduces the numberof misclassified non-angry turns (FP) by factor 16.9, namely from 5.124 (cost=1)to 303 while the number of correctly classified turns (TP) is only reduced byfactor 3.7 from 386 to 103.

6 The Effect on Dialogue

Which effect would the deployment of the cost insensitive classifier from Section4 and the cost sensitive classifier from Section 5 have on specific dialogues?

We apply both classification models within the Witchcraft Workbench. OurWorkbench for Intelligent exploraTion of Human ComputeR conversaTions1 isa new platform-independent open-source workbench designed for the analysis,mining and management of large spoken dialogue system corpora. What makesWitchcraft unique is its ability to visualize the effect of classification and pre-diction models on ongoing system-user interactions. Witchcraft is able to handle1 witchcraftwb.sourceforge.net


0.40.50.60.70.80.91

cisio

n/Re

call

00.10.20.30.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Prec

pos_precision pos_recall

neg_precision neg_recall

(a) Precision and Recall

3000

4000

5000

6000

amples

0

1000

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

#Sa

TP FP FN

(b) TP, FP, FN

Fig. 1. Results of iteratively increasing the cost for misclassifying a non-angry userturn: class-wise precision and recall values (a) and number of TP, FP, FN (b). Thenumber of TN is not depicted here due to visualization. It slightly increases from19.000 to 20.000 samples.

predictions from binary and multi-class discriminative classifiers and regressionmodels. Hence, the workbench allows us to apply the results of an anger classifi-cation model on specific dialogues. Witchcraft operates on turn level requestingthe classifier to deliver a prediction based on information available at the cur-rently processed dialogue turn of a specific call.

Witchcraft reads in the predictions from the classifier that have been deter-mined on turn level (e.g. the current emotional state of the caller). Then it


creates statistics on call level. It counts the number of TP, FP, TN, FN withinthe call and calculates class-wise precision and recall values as well as accuracy,f-score etc. The integrated SQL-based search mechanism allows searching forcalls that fulfill a specific criterion. By that, calls with a high number of falsepositives, low f1 scores for the class anger etc. can be spotted.

An example of such a call under analysis in the Witchcraft Workbench isdepicted in Figure 2.

Fig. 2. Screenshot of the Witchcraft Workbench in the Analysis Perspective. It consistsof two Call Selection Views (top-left), a Call Detail view, the central Dialogue Viewenabling navigation within the conversation, Classification Tables, Chart Views andthe audio player. Turns that have been correctly identified as anger (TP) are depictedwith green background color, false predictions are depicted with red background color(FN+FP).

Our interest lies in finding why and to which extent misclassification of non-angry turns is happening. An auditory analysis of the misclassified samples inWitchcraft turned out that the anger model seems to frequently misclassify dis-torted and garbage turns that contain higher frequencies and high loudness val-ues as “angry”.

When using the cost-insensitive classifier Witchcraft identifies 1.196 calls thatcontain no user anger (according to the raters) but at least one mistakenly asangry classified user turn (according to the classifier). 313 calls contain even5 and more such misclassifications. The adjusted cost-sensitive classifier withthe cost value for misclassifying non-angry turns set to 7 produces a differentimage. The numbers drop to only 134 and 4 calls respectively. Two representativedialogues are depicted in Figure 3.


(a) Non-angry Caller (b) Angry Caller

Fig. 3. Typical calls from a non-angry caller and an angry caller predicted from thecost-insensitive and cost-sensitive learner (respectively left and right side). Again, greenturns symbolize correctly spotted angry turns (TP), red turns symbolize misclassifiedturns (FN+FP).

The cost-insensitive classifier generally causes a large number of false alarms.Figure 3(a) contains a call from a non-angry user. The cost-insensitive classifier(left side) misclassifies a high number of non-angry turns as angry ones. The cost-sensitive classifier (right side) ignores the predictions of the underlying classifierand decides for “non-angry”. Figure 3(b) contains a call from an angry caller.The cost-insensitive classifier predicts nearly all angry turns correctly while thecost-sensitive one behaves more conservatively and misses some of the true angryturns.

7 Conclusion and Discussion

This study has presented the impact of deploying anger recognition to an IVRsystem. The presented cost insensitive classifier generates a very high number offalse positives. This first setup does not seem suitable for an IVR system thatrelies on anger detection for escalating to operators.

The employment of the presented cost-sensitive learner decreases the numberof misclassified non-angry user turns. A cost value of 7 turned out to be thebest compromise in order to gain low FP values while keeping the TP valuecomparably high. Notwithstanding these improvements, the false positives forthe detection of angry users remain twice as high as the true positives. The factthat the classes “annoyed” and “angry” have been previously merged to onesingle class could be an explanation for this behavior. Virtually the patternsbetween really angry and non-angry turns are blurred and the classifier loosesperformance.


For classifiers that are intended to predict hot anger online it is thus rec-ommendable to adjust the rating process and include only those samples thatcontain hot anger. Further it turns out to be questionable whether the detectionof one angry user turn should serve as basis for escalation to operators. Patternmatching taking into account several user turns would be sensitive to detectcallers that are in rage. The current performance seems to be appropriate for anadaption of the dialogue strategy based on the classifier’s prediction.

Acknowledgment

The research leading to these results has received funding from the TransregionalCollaborative Research Centre SFB/TRR 62 “Companion-Technology for Cog-nitive Technical Systems” funded by the German Research Foundation (DFG).We would like to thank the reviewers for their valuable comments.

References

1. Burkhardt, F., van Ballegooy, M., Huber, R.: An emotion-aware voice portal. In:Proceedings of Electronic Speech Signal Processing ESSP (2005)

2. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In:Proceedings of the Fifth International Conference on Knowledge Discovery andData Mining, pp. 155–164. ACM Press, New York (1999)

3. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEETransactions on Speech and Audio Processing 13(2), 293–303 (2005)

4. Lee, C.M., Narayanan, S.S., Pieraccini, R.: Recognition of negative emotions fromthe speech signal. In: IEEE Workshop on Automatic Speech Recognition and Un-derstanding, ASRU 2001, pp. 240–243 (2001)

5. Lee, C.M., Narayanan, S.S., Pieraccini, R.: Combining acoustic and language in-formation for emotion. In: ICSLP 2002 (2002)

6. Liscombe, J., Riccardi, G., Hakkani-Tur, D.: Using context to improve emotiondetection in spoken dialog systems. In: Interspeech, pp. 1845–1848 (2005)

7. Metze, F., Englert, R., Bub, U., Burkhardt, F., Stegmann, J.: Getting closer: tai-lored humancomputer speech dialog. Universal Access in the Information Society(2008)

8. Petrushin, V.A.: Emotion in speech: Recognition and application to call centers.In: Artificial Neural Networks in Engineering (ANNIE 1999), St. Louis, pp. 7–10(1999)

9. Polzehl, T., Schmitt, A., Metze, F.: Salient Features for Anger Recognition in Ger-man and English IVR Portals. Springer, Boston, USA, Spoken Dialogue SystemsTechnology and Design edition (August 2010)

10. Schmitt, A., Bertrand, G., Heinroth, T., Liscombe, J.: Witchcraft: A workbench forintelligent exploration of human computer conversations. In: International Confer-ence on Language Resources and Evaluation (LREC), Valetta, Malta (May 2010)

11. Schmitt, A., Heinroth, T., Liscombe, J.: On nomatchs, noinputs and bargeins: Donon-acoustic features support anger detection?. In: Proceedings of the 10th AnnualSIGDIAL Meeting on Discourse and Dialogue, SigDial Conference 2009, London(UK). Association for Computational Linguistics (September 2009)


12. Schmitt, A., Pieraccini, R., Polzehl, T.: ’For Heavens sake, gimme a live person!’Designing Emotion-Detection Customer Care Voice Applications in AutomatedCall Cent. In: Advances in Speech Recognition: Mobile Environments, Call Centersand Clinics. Springer, Heidelberg (September 2010)

13. Schmitt, A., Polzehl, T.: Modeling a-priori likelihoods for angry user turns withhidden markov models. In: Proc. of the Fifth International Conference on SpeechProsody 2010 (May 2010)

14. Shafran, I., Riley, M., Mohri, M.: Voice signatures. In: 2003 IEEE Workshop onAutomatic Speech Recognition and Understanding, ASRU 2003, November - De-cember 3, 31–36 (2003)

15. Yacoub, S., Simske, S., Lin, X., Burns, J.: Recognition of emotions in interactivevoice response systems. In: Proc. Eurospeech, Geneva, pp. 1–4 (2003)

German Research Center for AI (DFKI)Stuhlsatzenhausweg 3, 66123 Saarbruecken, Germany

Abstract. We think that modern speech dialogue systems need a priorusability analysis to identify the requirements for industrial applications.In addition, work from the area of the Semantic Web should be in-tegrated. These requirements can then be met by multimodal seman-tic processing, semantic navigation, interactive semantic mediation, useradaptation/personalisation, interactive service composition, and seman-tic output representation which we will explain in this paper. We will alsodescribe the discourse and dialogue infrastructure these components de-velop and provide two examples of disseminated industrial prototypes.

1 Introduction

Dialogue system construction is a difficult task since many individual naturallanguage processing components have to be combined into a complex AI system.Theoretical aspects and perspectives on communicative intention have to meetthe practical demands of a user machine interface—the speech based communi-cation must be natural for humans, otherwise they will never accept dialogue asa proper means of communication with machines. Over the last several years, themarket for speech technology has seen significant developments [14] and pow-erful commercial off-the-shelf solutions for speech recognition (ASR) or speechsynthesis (TTS). Even entire voice user interface platforms (VUI) have becomeavailable. However, these discourse and dialogue infrastructures have only mod-erate success so far in the entertainment or industrial sector. This is the casebecause a dialogue system as complex AI system cannot easily be constructed.Additionally, the dialogue engineering requires many customisation works forspecific applications.

We implemented a new discourse and dialogue infrastructure for semanticaccess to structured and unstructured information repositories for industrialapplications. In this paper, we basically provide two new contributions. First,we provide architectural recommendations of how new dialogue infrastructuresmay look like. Second, we discuss which components are needed to convey therequirements of dialogical interaction in multiple use case scenarios and withmultiple interaction devices. To meet these objectives, we implemented a dis-tributed, ontology-based, dialogue system architecture where every major com-ponent can be run on a different host, increasing the scalability of the overall

A Discourse and Dialogue Infrastructure for

Industrial Dissemination

Daniel Sonntag, Norbert Reithinger, Gerd Herzog,


Tilman Beckerand

system. Thereby, the dialogue system acts as the middleware between the clientsand the backend services that hide complexity from the user by presenting aggre-gated ontological data. We also implemented the attached dialogue componentswithin the architecture. This paper is structured as follows. First we will discussrelated work and the basic dialogue architecture. This will be followed by thediscussion of individual dialogue tasks and components. Finally, we will discusstwo industrial dissemination prototypes and provide a conclusion.

Related Work The dialogue engineering task is to provide dialogue-basedaccess to the domain of interest, e.g., for answering domain-specific questionsabout an industrial process in order to complete a business process. Prominentexamples of integration platforms include OOA [11], TRIPS [1], and GalaxyCommunicator [19]; these infrastructures mainly address the interconnection ofheterogeneous software components. The W3C consortium also proposes inter-module communication standards like the Voice Extensible Markup LanguageVoiceXML1 or the Extensible MultiModal Annotation markup language EMMA2,with products from industry supporting these standards3. In addition, manysystems are available that translate natural language input into structured on-tological representations (e.g., AquaLog [10]), port the language to specific do-mains, e.g., ORAKEL [3], or use reformulated semantic structures NLION [16].AquaLog, e.g., presents a solution for a rapid customisation of the system for aparticular ontology; with ORAKEL a system engineer can adapt the natural lan-guage understanding (NLU) component [4] in several cycles thereby customisingthe interface to a certain knowledge domain. The system NLION uses shallownatural language processing techniques (i.e., spell checking, stemming, and com-pound detection) to instantiate an ontology object. The systems which integratesub-tasks software modules from academia or industry can be used off the shelf.However, if one looks closer at actual industrial projects’ requirements, thisidealistic vision begins to blur mainly because of software infrastructure or us-ability issues. These issues are explained in the context of our basic architectureapproach for industrial dissemination.

Basic Architecture We learned some lessons which we use as guidelines inthe development of basic architectures and software infrastructures for multi-modal dialogue systems. In earlier projects [27, 17] we integrated different sub-components into multimodal interaction systems. Thereby, hub-and-spoke dia-logue frameworks played a major role [18]. We also learned some lessons whichwe use as guidelines in the development of semantic dialogue systems [12, 23];over the last years, we have adhered strictly to the developed rule “No presenta-tion without representation.” The idea is to implement a generic, and semantic,dialogue shell that can be configured for and applied to domain-specific dia-

1 http://www.w3.org/TR/voicexml20/2 http://www.w3.org/TR/emma/3 http://www.voicexml.org

133 D. Sonntag et al.

.

.

logue applications. All messages transferred between internal and external com-ponents are based on RDF data structures which are modelled in a discourseontology (also cf. [6, 8, 21]). Our systems for industrial dissemination have fourmain properties: (1) multimodality of user interaction, (2) ontological repre-sentation of interaction structures and queries, (3) semantic representation ofthe interface, and (4) encapsulation of the dialogue proper from the rest of theapplication.4 These architectural decisions are based partly on usability issuesthat arise when dealing with end-to-end dialogue-based interaction systems forindustrial dissemination (they correspond to the use case requirements). Intelli-gent AI systems that involve intelligent algorithms for dialogue processing andinteraction management must be judged for their suitability in industrial en-vironments. Our major concern which we observed in the development processfor industrial applications over the last years is that the incorporation of AItechnologies such as complex natural language understanding components (e.g.,HPSG based speech understanding) and open-domain question answering func-tionality can unintentionally diminish a dialogue system’s usability. This is be-cause negative side-effects such as diminished predictability of what the systemis doing at the moment and lost controllability of the internal dialogue processes(e.g., a question answering process) occur more often when AI components areinvolved. This tendency delivers new requirements for usability to account forthe special demands introduced by the use of AI. For the identification of theseusability issues, we adopted the binocular view of interactive intelligent systems(discussed in detail in [9]).

The main architectural challenges we encountered in implementing a newdialogue application for a new domain can be summarised as follows: first, pro-viding a common basis for task-specific processing; second, accessing the entireapplication backend via a layered approach. In our experience, these challengescan best be solved by implementing the core of a dialogue runtime environ-ment, an ontology dialogue platform (ODP) framework and its platform API(the DFKI spin-off company SemVox, see www.semvox.de, offers a commercialversion), as well as providing configurable adaptor components. These translatebetween conventional answer data structures and ontology-based representations(in the case of, e.g., a SPARQL backend repository) or Web Services (WS)—ranging from simple HTTP-based REST services to Semantic Web Services,driven by declarative specifications [24]. Our ODP workbench (figure 1) buildsupon the industry standard Eclipse and also integrates other established opensource software development tools to support dialogue application development,automated testing, and interactive debugging. A distinguishing feature of thetoolbox is the built-in support for eTFS (extended Typed Feature Structures),the optimised ODP-internal data representation for knowledge structures. Thisenables ontology-aware tools for the knowledge engineer and application devel-oper. A detailed description of the rapid dialogue engineering process, which ispossible thanks to the Eclipse plugins and templates, can be found in [25].

4 A comprehensive overview of ontology-based dialogue processing and the systematicrealisation of these properties can be found in [21], pp.71-131.

A Discourse and Dialogue Infrastructure for Industrial Dissemination 134

Fig. 1. Overall design of the discourse and dialogue infrastructure: ontology-baseddialogue processing framework and workbench

2 Dialogue Tasks and Components

In addition to automatic speech recognition (ASR), dialogue tasks include the in-terpretation of the speech signal and other input modalities, the context-basedgeneration of multimedia presentations, and the modelling of discourse struc-tures. These topic areas are all part of the general research and developmentagenda within the area of discourse and dialogue with an emphasis on dialoguesystems. According to the utility issues and industrial user requirements we iden-tified (system robustness/usability and processing transparency play the majorroles), we distinguished five dialogue tasks for these topics and built five dia-logue components, respectively. These components which allow for a semanticand pragmatic/application-based modelling of discourse and dialogue are pre-sented in more detail in the rest of this section.

Multimodal Semantic Processing and Navigation This task provides therule-based fusion of different input modalities such as text, speech, and point-ing gestures. We use a production-rules-based fusion and discourse engine whichfollows the implementation in [13]. Within the dialogue infrastructure, this com-ponent plays the major role since it provides basic and configurable dialogueprocessing capabilities that can be adapted to specific industrial applicationscenarios. More processing robustness is achieved through the application of aspecial robust parsing feature in the context of RDF graphs as a result of the in-put parsing process. When the user only utters catchwords instead of completeutterances, the semantic relationship between the catchwords can be guessed(following [26]) according to the ontological domain model of the industrial ap-plication domain. The Tool Suite builds upon the industry standard Eclipseand also integrates other established open source software development tools to

135

.

D. Sonntag et al.

support dialogue application development, automated testing, and interactivedebugging.

The semantic navigation and interaction task/module builds the connectionto backend services (the tasks Interactive Semantic Mediation and Web ServiceComposition) and the presentation module (the task Semantic Output represen-tation) and allows for a graph-like presentation of incremental results (also cf.[2]). The spoken dialogue input is used to generate SPARQL queries on ontol-ogy instances (using a Sesame repository, see www.openrdf.org). Users are thenpresented a perspective on the result RDF graph with navigation possibilities.

Interactive Semantic Mediation Interactive semantic mediation has two as-pects: (1) the mediation with the processing backend, the so-called dynamicknowledge base layer (i.e., the heterogeneous data repositories), and (2) themediation with the user (advanced user interface functionality for the adapta-tion to new industrial use case scenarios). We developed a semantic mediationcomponent to mediate between the query interpretation created in the dialogueshell and the semantic background services prevalent in the industry applicationcontext. The mediator can be used to, e.g., collect answers from external infor-mation sources. Since we deal with ontology-based information in heterogeneousterminology, ontology matching has become one of the major requirements. Theontology matching task can be addressed by several string-based or structure-based techniques (cf. [5], p.341, for example). As a new contribution in thecontext of large-scale speech-based AI systems, we think of ontology matchingas a dialogue-based interactive mediation process (cognitive support frameworksfor ontology mapping involve users). The overall approach is depicted in figure2. A dialogue-based approach can make more use of partial, unsure mappings.Furthermore, it increases the usability in dialogue scenarios where the primarytask is different from the matching task itself (cf. industrial usability require-ments). So as not to annoy the user, he/she is presented only the difficult casesfor disambiguation feedback; thus we use the dialogue shell to basically confirmor reject pre-considered alignments [20].

User Adaptation and Personalisation Industrial usability requirementsclearly advocate new AI systems to work with user models and personalisedinformation. An additional, related concern is given by the industrial require-ment to provide user privacy protection and implement data security guidelines.This makes a user adaptation and personalisation task an indispensable assetfor many advanced AI systems. Our discourse and dialogue infrastructure shouldbenefit from user preferences, communities, their ontological modelling, the pri-vacy issues of anonymity, data security, as well as data control by the user (alsocf. processing transparency and transaction security for processes that involvesensitive data like medical patient records). Figure 3 illustrates the close inter-relationships between the adaptation and personalisation blocks. In the contextof new dialogue system infrastructures, persistent and personalised annotationof personal data (e.g., personal image annotations) should be provided. Further


.

.

Fig. 2. Interactive semantic mediation for industry adaptation

topics of interest include interaction histories with dialogue session, task, andpersonalised interaction hierarchies. For example, our use case implementationtries to guess a predefined user group according to the interaction history.

Fig. 3. Industry-relevant forms of user adaptation and personalisation

Interactive Service Composition The service composer module takes in-put from the multimodal semantic base module. A mapping of the structuredrepresentation of the query to a formal query is done first. The formal represen-tation does not use concepts of verbs, nouns, etc. anymore, but rather uses thecustom knowledge representation in RDF. If only one query answering backendexists, the knowledge representation of this backend can be used. Otherwise, aninterim RDF-based representation is generated. The formal query is analysed

137

.

D. Sonntag et al.

and mapped to one or more services that can answer (parts of) the query. Thisstep typically involves several substeps including decompositing the query intosmaller parts, finding suitable services (service discovery), mapping the queryto other representations, planning a series of service executions, and initiatingpossibly clarification requests or disambiguation requests. The different work-flows form different routes in a pre-specified execution plan which includes thepossibility to request clarification information from the user if needed (hence in-teractive service composition). Figure 4 outlines the hard-wired execution planto dynamically address and compose SOAP/WSDL-based and REST-based ser-vices.

Fig. 4. Execution plan for (interactive) service composition

Semantic Output Representation We implemented a semantic output rep-resentation module which realises an abstract container concept called SemanticInterface Elements (SIEs) for the representation and activation of multimediaelements visible on, e.g., a touchscreen user interface [22]. The semantic outputrepresentation architecture comprises of several GUI-related submodules (suchas the Semantic Interface Elements Manager or the Event Manager), dialogue-engine-related modules (such as the Interaction Manager or natural languageParser), and the Presentation Manager sub-module (i.e., the GUI Model Man-ager). The most important part of the architecture is the Display Manager whichobserves the behaviour of the currently displayed SIE. The display manager dis-patches XML based messages to the dialogue system with the help of a message


.

Decoder/Encoder. This display manager has to be customised for every newapplication, whereas all other modules are generic. The Multimodal DialogueEngine then processes the requests in the Interaction and Interpretation Man-ager modules. A new system action or reaction is then initiated and sent tothe Presentation Manager. The GUI Model Manager builds up a presentationmessage in eTFS notation (internal typed feature structure format) before thecomplete message is sent to the Display Manager as a PreML message.

3 Industrial Dissemination

In the following discussion of industrial dissemination, the first example showsa medical prototype application for a radiologist while the second shows abusiness-to-business mobile application. Both dialogue systems build upon themultimodal speech-based discourse and dialogue infrastructure described in thispaper. The prototypes put emphasis on various and combined input forms ondifferent interaction devices, i.e., multitouch screens and iPhones.

Radiology Dialogue System In the MEDICO use case, we work on the di-rect industrial dissemination of a medical dialogue system prototype. Clinicalcare and research increasingly rely on digitised patient information. There is agrowing need to store and organise all patient data, including health records,laboratory reports, and medical images. Effective retrieval of images builds onthe semantic annotation of image contents. At the same time it is crucial thatclinicians have access to a coherent view of these data within their particulardiagnosis or treatment context. This means that with traditional user interfaces,users may browse or explore visualised patient data, but little or no help is givenwhen it comes to the interpretation of what is being displayed. Semantic anno-tations should provide the necessary image information and a semantic dialogueshell should be used to ask questions about the image annotations and refinethem while engaging the clinician in a natural speech dialogue at the same time.

The process of reading the images is highly efficient. Recently, structuredreporting was introduced that allows radiologists to use predefined standard-ised forms for a limited but growing number of specific examinations. However,radiologists feel restricted by these standardised forms and fear a decrease infocus and eye dwell time on the images [7, 28]. As a result, the acceptance forstructured reporting is still low among radiologists while referring physicians andhospital administrative staff are generally supportive of structured standardisedreporting since it eases the communication with the radiologists and can beused more easily for further processing. We strive to overcome the limitations ofstructured reporting by allowing content-based information to be automaticallyextracted from medical images and (in combination with dialogue-based report-ing) eliminating radiologist’s requirements to fill out forms by enabling them tofocus on the image while either dictating the image annotations of the reportsto the dialogue system or refining existing annotations.

139

.

D. Sonntag et al.

The domain-specific dialogue application [24], which uses a touchscreen (fig-ure 5, left) to display the medical SIE windows, is able to process the followingmedical user-system dialogue:

Show me the internalorgans: lungs, liver, thenspleen and colon.

Fig. 5. Left: Multimodal touchscreen interface (reprinted from [24]). The clinician cantouch the items and ask questions about them. Right: Mobile speech client for thebusiness expert.

Currently, the prototype application is being tested in a clinical environment(University Hospitals Erlangen). Furthermore, the question of how to integratethis information and image knowledge with other types of data, such as patientdata, is paramount. In a further step, individual, speech-based findings shouldbe organised according to a specific body region and respective textual patientdata.

Mobile Business-to-Business Dialogue System In the TEXO use case,we try to assist an employee and his superior in a production pipeline [15].Our mobile business scenario is as follows: searching on a service platform, anemployee of a company has found a suitable service which he needs for only


.

1 U: “Show me the CTs, last examination, patient XY.”2 S: Shows corresponding patient CT studies as DICOM picture series and MR videos.3 U: “Show me the internal organs: lungs, liver, then spleen and colon.”4 S: Shows corresponding patient image data according to referral record.5 U: “This lymph node here (+ pointing gesture) is enlarged; so lymphoblastic. Are there any

comparative cases in the hospital?”6 S: “The search obtained this list of patients with similar lesions.”7 U: “Ah okay.”

Our system switches to the comparative records to help the radiologist in the differential diag-nosis of the suspicious case, before the next organ (liver) is examined.8 U: “Find similar liver lesions with the characteristics: hyper-intense and/or coarse texture ...”9 S: Our system again displays the search results ranked by the similarity and matching of the

medical ontology terms that constrain the semantic search.

a short period of time for his current work. Since he is not allowed to carryout the purchase, he formally requests the service by writing a ticket in thecompany-internal Enterprise Resource Planning (ERP) system. In the definedbusiness process, only his superior can approve the request and buy the service.But first, the person in charge has to check for alternative services on the serviceplatform which might be more suitable for the company in terms of qualityor cost standards. The person in charge is currently away on business but hecarries his mobile device with him that allows him to carry out the transactionon the go. The interaction is speech-based and employs a distributed version andinstance of the dialogue system infrastructure. The mobile client (figure 5, leftpart of right half) streams the speech and click input to the dialogue server wherethe input fusion and reaction tasks are performed. The user can ask for specificservices by saying, “Show me alternative services” or naming different serviceclasses. After the ranked list of services is presented, multiple filter criteria canbe specified at once (e.g., “Sort the results according to price and rating.”). As aresult, the services are displayed in a 2D grid which eases the selection accordingto multiple sorting criteria (figure 5, right).

We tested the mobile system in 12 business-related subtasks before the in-dustrial dissemination. Eleven participants were recruited from a set of 50 peoplewho responded to our request (only that fraction was found suitable). The se-lected people were all students (most of them in business or economics studies).From our analysis of the questionnaires we conclude that our mobile B2B sys-tem can be valuable for business users. Almost all users successfully completedthe subtasks (89% of a total of 132 subtasks). In addition, many of them alsoprovided the following positive feedback: they felt confident about the ticketpurchase being successful. Then we introduced the prototype to the industrialpartner. According to first user tests with the industrial partner SAP, the mo-bile speech client enhances the perceived usability of mobile business servicesand has a positive impact on mobile work productivity. More precisely, our mo-bile business application reflects the business task but simplifies the selection ofservices and the commitment of a transaction (internal purchase); it addition-ally minimises text entry (a limitation on mobile devices) and displays relevantinformation in such a way that a user is able to grasp it at first glance (3Dvisualisation).

4 Conclusion

Based on an integration platform for off-the-shelf dialogue solutions and internaldialogue modules, we described the parts of a new discourse and dialogue infras-tructure. The ontology-based dialogue platform provides a technical solution forthe dissemination challenge into industrial environments. The requirements foran industrial dissemination are met by implementing generic components for themost important tasks that we identified: multimodal semantic processing, seman-tic navigation, interactive semantic mediation, user adaptation/personalisation,interactive service composition, and semantic output representation.

141 D. Sonntag et al.

Semantic (ontology-based) interpretations of dialogue utterances may be-come the key advancement in semantic search and dialogue-based interactionfor industrial applications, thereby mediating and addressing dynamic, business-relevant information sources.

References


1. Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A.: An Ar-chitecture for a Generic Dialogue Shell. Natural Language Engineering 6(3), 1–16(2000)

2. Boselli, R., Paoli, F.D.: Semantic navigation through multiple topic ontologies. In:Proceedings of Semantic Web Applications and Perspectives (SWAP), Trento, Italy(2005)

3. Cimiano, P., Haase, P., Heizmann, J.: Porting natural language interfaces betweendomains: an experimental user study with the orakel system. In: IUI 2007: Pro-ceedings of the 12th International Conference on Intelligent User interfaces, pp.180–189. ACM, New York (2007)

4. Engel, R.: SPIN: A Semantic Parser for Spoken Dialog Systems. In: Proceedingsof the 5th Slovenian and First International Language Technology Conference, IS-LTC 2006 (2006)

5. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007)

6. Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.): Spinning the Seman-tic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, Cambridge(2003)

7. Hall, F.M.: The radiology report of the future. Radiology 251(2), 313–316 (2009)

8. Hitzler, P., Krotzsch, M., Rudolph, S.: Foundations of Semantic Web Technolo-gies,Chapman & Hall/CRC, Boca Raton (August 2009)

9. Jameson, A.D., Spaulding, A., Yorke-Smith, N.: Introduction to the Special Issueon Usable AI. AI Magazine 3(4), 11–16 (2009)

10. Lopez, V., Pasin, M., Motta, E.: Aqualog: An ontology-portable question answeringsystem for the semantic web. In: Gomez-Perez, A., Euzenat, J. (eds.) ESWC 2005.LNCS, vol. 3532, pp. 546–562. Springer, Heidelberg (2005)

11. Martin, D., Cheyer, A., Moran, D.: The Open Agent Architecture: a frameworkfor building distributed software systems. Applied Artificial Intelligence 13(1/2),91–128 (1999)

12. Oviatt, S.: Ten myths of multimodal interaction. Communications of theACM 42(11), 74–81 (1999)

13. Pfleger, N.: FADE - An Integrated Approach to Multimodal Fusion and DiscourseProcessing. In: Proceedings of the Dotoral Spotlight at ICMI 2005,Trento, Italy(2005)

Acknowledgments. Thanks go out to Robert Nesselrath, Yajing Zang, MarkusLockelt, Malte Kiesel, Matthieu Deru, Simon Bergweiler, Eugenie Giesbrecht,Alassane Ndiaye, Norbert Pfleger, Alexander Pfalzgraf, Jan Schehl, JochenSteigner and Colette Weihrauch for the implementation and evaluation of thedialogue infrastructure. This research has been supported by the THESEUS Pro-gramme funded by the German Federal Ministry of Economics and Technology(01MQ07016).

143

14. Pieraccini, R., Huerta, J.: Where do we go from here? research and commercialspoken dialog systems. In: Proceedings of the 6th SIGDdial Worlshop on Discourseand Dialogue, pp. 1–10 (September 2005)

15. Porta, D., Sonntag, D., Neßelrath, R.: A Multimodal Mobile B2B Dialogue In-terface on the iPhone. In: Proceedings of the 4th Workshop on Speech in Mobileand Pervasive Environments (SiMPE 2009) in Conjunction with MobileHCI 2009.ACM, New York (2009)

16. Ramachandran, V.A., Krishnamurthi, I.: Nlion: Natural language interface forquerying ontologies. In: COMPUTE 2009: Proceedings of the 2nd Bangalore An-nual Compute Conference on 2nd Bangalore Annual Compute Conference, pp. 1–4.ACM, New York (2009)

17. Reithinger, N., Fedeler, D., Kumar, A., Lauer, C., Pecourt, E., Romary, L.:MIAMM- A Multimodal Dialogue System Using Haptics. In: van Kuppevelt, J.,Dybkjaer, L., Bernsen, N.O. (eds.) Advances in Natural Multimodal Dia-logueSystems. Springer, Heidelberg (2005)

18. Reithinger, N., Sonntag, D.: An integration frame modaldialogue system accessingthe Semantic Web. In: Proceedings of INTERSPEECH, Lisbon, Portugal, pp. 841–844 (2005)

19. Seneff, S., Lau, R., Polifroni, J.: Organization, Communication, and Control in theGalaxy-II Conversational System. In: Proceedings of Eurospeech 1999, Budapest,Hungary, pp. 1271–1274 (1999)

20. Sonntag, D.: Towards dialogue-based interactive semantic mediation in the medicaldomain. In: Proceedings of the Third International Workshop on Ontology Match-ing (OM) Collocated with the 7th International Semantic Web Conference, ISWC(2008)

21. Sonntag, D.: Ontologies and Adaptivity in Dialogue for Question Answering. AKAand IOS Press, Heidelberg (2010)

22. Sonntag, D., Deru, M., Bergweiler, S.: Design and Implementation of CombinedMobile and Touchscreen-Based Multimodal Web 3.0 Interfaces. In: Proceedings ofthe International Conference on Artificial Intelligence (ICAI), pp. 974–979 (2009)

23. Sonntag, D., Engel, R., Herzog, G., Pfalzgraf, A., Pfleger, N., Romanelli, M.,Reithinger, N.: SmartWeb Handheld—Multimodal Interaction with OntologicalKnowl-edge Bases and Semantic Web Services. In: Huang, T.S., Nijholt, A., Pan-tic, M., Pentland, A. (eds.) ICMI/IJCAI Workshops 2007. LNCS (LNAI), vol. 4451,pp. 272–295. Springer, Heidelberg (2007)

24. Sonntag, D., Moller, M.: Unifying semantic annotation and querying in biomedi-cal image repositories. In: Proceedings of International Conference on KnowledgeManagement and Information Sharing, KMIS (2009)

25. Sonntag, D., Sonnenberg, G., Nesselrath, R., Herzog, G.: Supporting a rapid dia-logue engineering process. In: Proceedings of the First International Workshop OnSpoken Dialogue Systems Technology, IWSDS (2009)

26. Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k Exploration of Query Can-didates for Efficient Keyword Search on Graph-Shaped (RDF) Data. In: ICDE2009: Proceedings of the 2009 IEEE International Conference on Data Engineering,Washington, DC, USA, pp. 405–416. IEEE Computer Society Press, Los Alamitos(2009)

27. Wahlster, W.: SmartKom: Symmetric Multimodality in an Adaptive and ReusableDialogue Shell. In: Krahl, R., Gunther, D. (eds.) Proceedings of the Human Com-puter Interaction Status Conference 2003, Berlin, Germany, pp. 47–62. DLR (2003)

28. Weiss, D.L., Langlotz, C.: Structured reporting: Patient care enhancement or pro-ductivity nightmare? Radiology 249(3), 739–747 (2008)

D. Sonntag et al.


Impact of Semantic Web on the Development of Spoken Dialogue Systems

Masahiro Araki and Yu Funakura

Department of Information Science, Kyoto Institute of Technology, Matsugasaki Sakyo-ku 6068585 Kyoto Japan

[email protected]

Abstract. We examined several possible uses of semantic Web technologies in developing spoken dialogue systems. We report that three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowledge, and automatic inference, have a large impact on various stages in the development of spoken dialogue systems, such as language modeling, se-mantic analysis, dialogue management, sentence generation, and user modeling. As an example, we implemented a query generation method for semantic search based on semantic Web technology.

Keywords: Spoken dialogue system, semantic Web, language model, semantic analysis, dialogue management, sentence generation, user model.

1 Introduction

Semantic Web [1] is expected to be a next-generation Web technology. Semantic Web functions as a “Web of data”, as compared to the current HTML-based technol-ogy, which is considered to be a “Web of documents”. The advanced features of the semantic Web are realization of collective intelligence, smooth integration of distrib-uted knowledge, and automatic inference based on well-established logics. Taking into account this movement, a number of studies on spoken dialogue systems have used semantic Web technology as a knowledge representation (e.g. [2]-[7]). However, a large portion of previous research on the use of semantic Web in spoken dialogue systems did not make full use of the advantages of semantic Web.

We examined several possible uses of semantic Web technologies in developing spoken dialogue systems. In the present paper, we report that three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowl-edge, and automatic inference, have a large impact on various stages of development of spoken dialogue systems, such as language modeling, semantic analysis, dialogue management, sentence generation, and user modeling.

The remainder of the present paper is organized as follows. Section 2 presents an overview of the basic concept of semantic Web technology. Section 3 surveys previ-ous research on the use of semantic Web in spoken dialogue systems. Section 4 explains several uses of semantic Web technology in the development of spoken dia-logue systems. As an example, Section 5 reports a query generation method for

Impact of Semantic Web on the Development of Spoken Dialogue Systems 145

semantic search based on semantic Web technology. The present paper is concluded in Section 6 with a discussion of future research.

2 Overview of Semantic Web Technology

The goal of semantic Web technology is to realize a “Web of data” [1]. All resources (e.g., people, organizations, documents, events, music, etc.) are represented by Uni-versal Resource Identifiers (URIs). For example, people and organizations can be represented by their Web page URLs. The general concepts (e.g., a location name or a famous event) can be represented by the entry URL of Wikipedia.

The web of knowledge is expressed by combining these resources in triplet form, i.e., subject-predicate-object. This knowledge representation method is referred to as the Resource Description Framework (RDF)1. In the RDF, the subject and the predi-cate are resources. The object can be either a resource or a literal datum. Fig. 1 shows two examples of RDFs. In the figure, the ellipses indicate resources and the rectangle indicates a literal.

http://ex.org/rdf.html http://ex.org/people#ma

foaf:maker

http://ex.org/people#ma Masahiro Arakifoaf:name

Fig. 1. Example of RDF. The upper RDF represents that “a Web page (http://ex.org/rdf.html) is created by a person (http://ex.org/people#ma)”, and the lower RDF represents that “the person’s name is ‘Masahiro Araki’”.

The predicates (foaf:maker and foaf:name) in Fig. 1 are Friend Of A Friend (FOAF) vocabulary,2 which is an established set of vocabulary (i.e., ontology) of semantic Web. This knowledge can be easily embedded in HTML pages. Microfor-mats3 and RDFa4 are promising approaches for standardizing the embedding mecha-nisms of the RDF in HTML. Therefore, a web crawler can easily recognize these RDFs if such information is embedded in HTML. In addition to FOAF, there are a number of useful ontologies, such as events, reviews, and documents. The simple RDF representation and well-established ontology are the foundation of the next gen-eration of collective intelligence.

The knowledge integration mechanism of a distributed source is another advantage of semantic Web technology. In most basic case, for example, the two RDFs in Fig. 1 can be merged into a single graph structure by integrating the object node of the upper RDF and the subject node of the lower RDF. Even if these RDFs are distributed to

1 http://www.w3.org/RDF/ 2 http://www.foaf-project.org/ 3 http://microformats.org/ 4 http://www.w3.org/TR/xhtml-rdfa-primer/

146 M. Araki and Y. Funakura

different knowledge sources, they can be integrated based on the URI identification. In the case of ontology integration, there is no need to unify the predicate name or class hierarchy because several set relations can be represented by the Web Ontology Language (OWL)5. For example, the equivalence of the predicate can be represented by the following OWL statement:6

foaf:maker owl:equivalentProperty ex:creator.

This statement indicates that the predicate foaf:maker, which is used in one ontol-ogy, is equivalent to the predicate ex:creator, which is used in another ontology. Such URI-based resource identification and ontology-based set relation representation enable easy integration of distributed knowledge.

Another advantage of semantic Web technology is automatic inference. The OWL is based on description logic. Many OWL data sources provide basic inference func-tionality, such as set operation and restriction operation. Therefore, there is no need to write basic inference code for the knowledge modeler in order to implement intelli-gent systems, such as the back-end system of a spoken dialogue system.

3 Previous Research Using Semantic Web in Spoken Dialogue Systems

In this section, we overview the previous research using semantic Web technology in spoken dialogue systems.

The goal of the companions project7 is to realize intelligent, persistent, and person-alized multimodal interfaces to the Internet. The companions project uses RDF repre-sentation and an inference mechanism as back-end knowledge. For example, one system uses family knowledge expressed in the RDF when referencing a user’s photo [2]. In [3], Sonntag et al. proposed a rapid prototyping of a spoken dialogue system using RDF/OWL as a knowledge representation and SPARQL (Query Language for RDF)8 as a query language. In [4], Heinroth et al. demonstrated how to easily merge domain knowledge written in OWL.

A few studies have examined sentence generation from the RDF [5],[6]. It is natu-ral to use standardized concept representation as an input of the sentence generator of the dialogue system.

For dialogue management, we proposed a method of using the frame representation of OWL as a knowledge of frame-driven dialogue management [7]. In the proposed method, the dialogue flows from upper concept to lower concept, following the ontol-ogy representation, without any domain specific control rules.

In summary, previous studies have revealed several advantages of using semantic Web in spoken dialogue system development. However, these studies restricted the

5 http://www.w3.org/TR/owl-overview/ 6 OWL statement is represented by RDF. This notation is called N3. It represents subject,

predicate and object, followed by period. 7 http://www.companions-project.org/ 8 http://www.w3.org/TR/rdf-sparql-query/


application of semantic Web technology to a single element of the dialogue system. As we will show in the next section, there is significant potential to integrate knowl-edge in spoken dialogue systems using semantic Web.

4 Impact of the Development of Spoken Dialogue Systems

So far, we described three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowledge, and automatic inference. These fea-tures have a significant impact on various elements of spoken dialogue systems.

4.1 Language Model

There exists the possibility of automatic acquisition of a word dictionary using se-mantic Web. As shown in Section 2, semantic tagging using Microformats has been already attached on several Web pages. Combining the class information, such as foaf:Person, foaf:Group, and foaf:Organization, and the predicate information, e.g., foaf:name, we can easily construct categorical dictionaries in grammar-based language models from the scratch. In addition, the categorical dic-tionaries can be automatically updated by adding a new word.

4.2 Semantic Analysis

The RDF can be read as subject, predicate, and object (SPO) sentence structure in English. If the user input to the spoken dialogue system is a statement, and if we can map the input entity (e.g., person’s name, organization name) to the resources repre-sented by a URI and their relation to the predicate resource (e.g., belongs to), then the user input can be represented by the RDF and can be easily stored to the RDF data store as a dialogue history.

On the other hand, if the user input is a question, then the input can be converted to SPARQL. An example of converting spoken input to SPARQL is shown in Section 5.

4.3 Dialogue Management

We previously proposed a frame description method using OWL for our frame-driven dialogue system [7]. There are several advantages of using OWL as a knowledge representation for dialogue management. First, the AND-OR task structure can be represented using rdf:Seq and rdf:Alt. Second, OWL can be used as a dialogue flow controller. The dialog begins at the top level concept, and continues by going down the concept hierarchy and specifies values for lower level slots. The transition of the concept is made by dialogue pattern. Next possible dialogue act can be obtained by the knowledge of the adjacency pair. If the precondition of the dialogue act is fulfilled, this act becomes a member of the next possible act of the system. Some of this dialogue pattern knowledge is task independent. In addition, such basic inference is realized in the general OWL reasoner, and the amount of task-dependent knowl-edge that the developer must record will decrease.

148 M. Araki and Y. Funakura

4.4 User Modeling

A number of spoken dialogue systems are assumed to be used by a small number of people for each system. For example, a personal robot can be used in a household (ordinarily, one to five people), whereas a mobile phone can be used by only one person. In such situations, the adaptability of the system to each user is very impor-tant. In order to realize user adaptability, a representation of the user model informa-tion and an update mechanism of the attribute values in the user model are necessary. If the representation and update interface are standardized, then this information can be used across various systems.

Generally, in earlier user adaptable interaction systems, the user model representa-tion and its update procedure were embedded in the system code. Therefore, it was difficult to port one user-modeling module to another system. User modeling based on semantic Web technology is a promising approach for dealing with the portability problem of user information. (See, e.g., [8].)

5 Query Analysis on Semantic Search

In the information retrieval, a natural language utterance is suitable for expressing the intention of a user directly. However, in order to develop a natural language interface, it is necessary to understand the meaning of words and to distinguish words that should be treated as search conditions from words that should be disregarded.

Therefore, in order to perform information retrieval based on semantic analysis, we present a method of generating query commands from natural language utterances. Since we use the RDF as a knowledge source, the target query representation is SPARQL. An example of SPARQL is shown in Fig. 2.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX j.0: <http://ii-lab.is.kit.jp/relationship/> SELECT ?name WHERE { ?x j.0:name ?n .

?x j.0:address ?a . FILTER regex(?a, "Kyoto") . ?x j.0:class ?c . FILTER regex(?c, "hotel") . }

LIMIT 10

Fig. 2. Example of SPARQL. The input is “Kyouto (Kyoto) ni aru (located) hoteru (hotel) wo sagashite (Search).” (Search hotels located in Kyoto.).

Specifically, we gathered functional expressions from a large quantity of question sentence data on the Web Q&A site and developed a semi-task dependent functional expression dictionary (61,429 entries) that maps Japanese functional expressions to search condition representation in SPARQL. We got SPARQL generation accuracy of 42% in the task of location-related search (among 33 inputs). For example, the dic-tionary entry of the word “located” (in Fig. 2) is shown in Fig. 3.


functional

A22584

located

noun-locate

noun-general

6

cond-property

roleProperty Task property

location-property

?x proN ?place . FILTER regex(?place, "condition") .

rdf:type

j.0:prev

j.0:next

j.0:notation

j.0:rule

j.0:frequency

rdf:type

j.0:rolej.0:propertyKind

rdf:type

A32674j.0:taskPName

rdf:type

http://ii-lab.is.kit.jp/relationship#

j.0:task

address

j.0:Pname

Fig. 3. Example of a semantic dictionary (functional words: “ni aru (located)”)

6 Conclusion

We examined several possible uses of semantic Web technologies and reported that semantic technology has a significant impact on various stages in the development of spoken dialogue systems. As an example, we implemented a query generation method for semantic search based on semantic Web technology.

In the future, we intend to implement a fully integrated semantic Web based spo-ken dialogue system and examine its efficiency of development of intelligent systems.

References

1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (May 2001)

2. Field, D., Catizone, R., Cheng, W., Dingli, A., Worgan, S., Ye, L., Wilks, Y.: The Senior Companion: a Semantic Web Dialogue System. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, Vol. 2, pp. 1383–1384 (2009)

3. Sonntag, D., Neßelrath, R., Sonnenberg, G., Herzog, G.: Supporting a Rapid Dialogue Sys-tem Engineering Process. In: Proceedings of the 1st IWSDS (2009)

4. Heinroth, T., Denich, D., Bertrand, G.: Ontology-based Spoken Dialogue Modelling. In: Proceedings of the 1st IWSDS (2009)

5. Sun, X., Mellish, C.: Domain Independent Sentence Generation from RDF Representations for the Semantic Web. In: Proceedings of the Eleventh European Workshop on Natural Lan-guage Generation, pp. 105–108 (2007)

6. Wilcock, G.: Talking OWLs: Towards an Ontology Verbalizer. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 109–112. Springer, Heidelberg (2003)

7. Araki, M.: OWL-based Frame Description for Spoken Dialog Systems. In: Proc. Interna-tional Semantic Web Foundations and Application Technology (SWFAT) Workshop (2003)

8. Fonseca, J.M.C., et al.: Model-Based UI XG Final Report (2010), http://www.w3.org/2005/Incubator/model-based-ui/XGR-mbui/


A User Model to Predict User Satisfaction with Spoken Dialog Systems

Klaus-Peter Engelbrecht and Sebastian Möller

Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Ernst-Reuter-Platz 7,

10587 Berlin, Germany {Klaus-Peter.Engelbrecht,Sebastian.Moeller}@telekom.de

Abstract. In order to predict interactions of users with spoken dialog systems and their ratings of the interaction, we propose to model basic needs of the user which impacting her emotional state. In defining the model we follow the PSI theory by Dörner [1] and identify Competence and Certainty as relevant needs in this context. By analysis of questionnaires we show that such needs impact the users overall opinion of the system. Furthermore, relations to interaction pa-rameters are analyzed.

Keywords: PARADISE, user model, evaluation, spoken dialog system.

1 Introduction

Automatic evaluation of interactive systems can save time and costs of testing, and thus for the entire development of the system [2]. Alternatively, ongoing research deals with automatically trained systems which adapt their behavior to the users [3][4]. In both cases, user models are used to simulate interactions with the system before real users are confronted with it. While there are pros and cons for either way of designing a system, both approaches are heavily dependent on criteria to determine the quality of the observed interaction as it would be perceived by the user [5][6].

While perceived quality this is typically measured using a questionnaire, in simu-lated interactions this is not possible. However, as proposed by the PARADISE framework [7], we can try to predict the rating a user would assign to an interaction from a description of the interaction in the form of interaction parameters.

The accuracy of such predictions can be sufficient to predict differences between mean ratings of systems or system configurations [8]. However, the relation between the interaction parameters and the user judgments can differ between the users, lead-ing to much lower accuracy when predicting judgments for individual dialogs [8]. These differences can be attributed to user characteristics such as affinity towards technology or attitude towards SDSs to some degree [9]; however, still a substantial part of the variance in the judgments cannot be explained so far. Thus, a deeper study of what happens inside a user remains an interesting research question.

In [10] we proposed an integrated user model covering the interaction and the pre-diction of the corresponding judgment, by partly drawing on the same resources, such

A User Model to Predict User Satisfaction with Spoken Dialog Systems 151

as the task goal and the task model. Such a model may work with “dynamic user at-tributes”, i.e. parameters of the user which change over the course of an interaction, as it has been proposed for the MeMo workbench [11]. In this paper, we propose a set of such attributes drawing on PSI theory [1]. This theory tries to explain motivated and emotional behavior of agents on the basis of needs and psychological parameters. By analysis of questionnaires and interaction data, we evaluate the applicability of this approach to interactions with spoken dialog systems (SDSs). Thus, our work should be seen as exploratory in nature.

2 Proposed Model

According to Dörner [1], the behavior of an agent is modulated by a set of needs. Dörner mentions needs for hunger, thirst, physical well-being, and affiliation, as well as more fundamental needs for Competence and Certainty. Competence can be de-scribed as the feeling of being able to manipulate the world (or a device) as desired, while Certainty reflects the reliability of the environment (or some device), i.e. how foreseeable it is. The latter two parameters directly impact the emotional behavior of the agent by modulating behavioral parameters such as arousal, attention (resolution level of perception), flight tendency vs. aggression, and the selection threshold (how easy the agent changes the current goal). The default settings of these parameters are user-specific, so that the same environment can cause different reactions from differ-ent agents.

In Human-Computer-Interaction (HCI), the former needs describing the agents ba-sic urges seem to be of minor importance, as they mainly impact the goal selection, and in HCI we typically consider scenarios with known (i.e. fixed) goals. However, Competence and Certainty might be of interest, as they are related to the emotions and may thus help to predict the emotional response to a system. The emotional response can then be expected to impact the user behavior as well as the subjective evaluation of the system and may thus explain some of the inter-rater differences.

Figure 1 shows how these needs could be integrated into a prediction model for user satisfaction. Displayed are the relations between different variables in one time slice. We display measurable parameters as white ovals, while latent variables are shaded. The target variable, which is also hidden, is displayed in black. Relations between parameters are indicated by arrows, where thick arrows indicate that the relation is deterministic. Note that “emotions” is a placeholder for a sub-system compound of the above mentioned parameters. Not all of these parameters might be crucial to the interaction and judgment process. Functions describing the relation between Competence, Certainty, and emotional parameters can be found in [12]. Interaction parameters and audio parameters of the system turns can be measured directly in the interaction. Note that audio parameters of the user utterances can help to determine the current values of the latent variables.

In the next time-slice (not displayed), the actions of the user and her expectations will be influenced by the current emotional parameters, and thus indirectly by the underlying needs for Competence and Certainty. In the remainder of this paper, we will analyze a database of interactions with SDSs with respect to these relations.

152 K.-P. Engelbrecht and S. Möller

Fig. 1. Proposed model with Competence and Certainty as the central parameters modulating user actions and the emotional apparatus of the user

3 Database

The database contains dialogs with 3 commercial SDSs over a telephone line. All three systems provide information on public transport connections, however, for dif-ferent areas of Germany. The systems differ considerably in their dialog strategy and the voices used for the prompts. One of the systems did not allow barge-in, while the others did most of the time. Nine young (M=22.4, SD=2.8, 4 males) and 7 older (M=53.7, SD=7.8, 4 males) users took part in the experiment. Each user performed a dialog with each of the 3 systems, and judged the system on a full questionnaire designed in accordance to ITU-T Rec. P.851 [13]. The calls were recorded and tran-scribed. Interaction parameters could be read from the transcriptions or were anno-tated afterwards on the basis of the transcriptions. They cover

• Efficiency (#Turns, #Turns till solution, #constraints/utterance, words per system turn)

• Understanding errors (parsing correct, partially correct, failed, or incorrect (PA:CO, PA:PA, PA:FA, PA:IC))

• Interaction style (user speech acts (e.g. negate, no_input, rephrase) , words per user turn, confirmation type (explicit, implicit, none), system help, successful and failed barge-in attempts, time-outs)

• Task success


4 Relation of Competence and Certainty to User Satisfaction

The proposed model can only make sense if we can observe some variance of Compe-tence and Certainty within our dialog corpus. This can be analyzed easiest using the questionnaires the users filled out after the interaction. While we did not ask for Com-petence or Certainty directly, the questionnaire, having 37 items, covers a wide vari-ety of aspects of the interaction including such statements as “The system reacted as expected” or “I felt in control of the interaction”. Such items are closely related to the targeted concepts and may thus be indicative of the desired relations.

In order to simplify the analysis, we first reduced the dimensionality of the data us-ing factor analysis. Before this, we excluded all items asking for an overall evaluation of the system (e.g. “I am satisfied with the system”, “I would use the system again”). The remaining questions all deal with more specific aspects of the interaction. We then performed several factor analyses, using different numbers of resulting factors or the eigenvalues >1 criterion, in order to obtain a clearly interpretable solution from the small umber of cases.

The factor solution yielding the best interpretable result has 5 factors, which ex-plain 72% of the variance in the data. One of the factors could not be named as it contained only 2 very different items. The other factors roughly describe the Control-lability of the system, the Cognitive Effort involved in the interaction, the Naturalness and the Symmetry of the interaction. Most items load on the first two factors. These factors also include all of the items which are semantically similar to the concepts of Competence and Certainty. Thus, we are particularly interested in these dimensions for this study. We store the factor scores after Varimax rotation for later analysis.

In order to obtain a reliable and valid measure of Overall Quality of the system, we also performed a factor analysis on the items related to the overall evaluation of the system. All items load on the same factor, suggesting that they measure the same, and only one, construct. A scale reliability analysis using Cronbach’s alpha supports this assumption (alpha =0.83). Thus, we feel confident to use the resulting factor score as a measure of the Overall Quality of the system.

We then analyzed the impact of the five factors capturing specific quality aspects on the Overall Quality, using Linear Regression with stepwise inclusion. The step-wise algorithm selected all factors except the unnamed one for inclusion in the model, where Controllability and Cognitive Effort are selected first and show the highest Beta coefficients. The accuracy of the model is characterized by R2 = 0.87 and a Standard Error of 0.38.

These results indicate that Competence and Certainty do vary within the database, and that they have a considerable impact on the overall evaluation of the system. However, it is not yet clear how the concepts Competence and Certainty relate to the factors we extracted from the questionnaires and named Controllability and Cognitive Effort.

In a next step, we therefore evaluated the relations of these factors to the interac-tion parameters and the measured user characteristics, in order to see if they modulate the user behavior in conformance to our assumptions about Competence and Certainty.

154 K.-P. Engelbrecht and S. Möller

5 Relation of Competence and Certainty to Interaction Behavior

Of the 30 interaction parameters measured in the data, 10 are significantly correlated with Overall Quality, 8 with Controllability, and 6 with Cognitive Effort. While this statistic suggests that the correlation with Overall Quality determines how strongly the factor is related to the interaction parameters, there are some parameters which are correlated with just one of the more specific factors, or they are stronger correlated to the specific item than to Overall Quality. In the following, we restrict ourselves to an analysis of these parameters.

Table 1 shows an overview of these cases. According to the table, the system was perceived more controllable if there was at least one very long system prompt (WPST_max) or if there were less utterances incorrectly understood. The latter is directly linked to the number of negations (the user disconfirmed what the system understood). In addition, Controllability is related to no-inputs and task success. The parameters which are correlated mainly with Cognitive Effort include a number of insignificant “trends”. Note that we measure two-tailed significance, as for some parameters the direction of the correlation could not be anticipated, while otherwise 1-tailed significance would be sufficient. Thus, the correlation with #Turns till solution, successful barge-ins, and system help can be considered as reliable, as the effect is observed as it was anticipated. The other parameters related to this factor are the number of rephrasings the user did with her utterances, the number of explicit confir-mations, and the question if the user had ever used a SDS.

In summary, most of these correlations would be expected also with Competence and Certainty. However, these concepts cannot be straightforwardly assigned to one of the factors. Most importantly, Controllability resembles Certainty given the corre-lated parameters, but includes task success, which should rather be correlated with Competence. Cognitive Effort, in turn, seems to be related to Competence, but we miss correlations with task success, or #constraints/utterance, or the absolute number of barge-in attempts, to name some examples.

Table 1. Correlations (Pearson’s r) between interaction parameters and the Overall Quality, Controllability, and the Cognitive Effort scale. Asterisks indicate that a correlation is signifi-cance (*) or highly significant (**; two-tailed).

Parameter OQ Controllability Cogn. Eff. Prior experience with SDSs

0.27 0.16 -0.43**

#Turns till solution -0.15 -0.18 0.26 (p=0.089) WPST_max 0.43** 0.46** -0.28 PA:IC -0.25 -0.32* 0.01 Succ. barge-ins 0.10 -0.00 -0.26 (p=0.094) USA:negate -0.35* -0.37* 0.21 USA:no_input -0.43** -0.45** 0.32* USA:rephrase -0.09 -0.10 0.29 (p=0.056) Explicit confirm. -0.27 -0.21 0.325* System help 0.12 -0.03 -0.27 (p=0.074) Task success 0.46** 0.50** -0.16


6 Conclusion

We presented a model of a user interacting with and judging a SDS. The model should be capable of explaining emotional behavior. In order to verify that such a model is useful for the simulation of interactions and corresponding judgments, we tried to identify the involved variables in our data, and analyzed how they are related to parameters describing the interactions with the system. By analysis of question-naires, we found two factors which are related to Competence and Certainty. These factors show the expected correlations with interaction parameters to some degree, however not as much as expected. In particular, Competence and Certainty seem to be limited with respect to the prediction of variants in user actions. Our results allow predictions about no-inputs and rephrasing probability when an utterance is to be repeated. Other correlated parameters seem to be predictors of the factors rather than being influenced by them. Future work will have to analyze such relations on a turn-level.

References

1. Dörner, D.: Die Mechanik des Seelenwagens. Eine neuronale Theorie der Handlungsregu-lation. 1. Auflage. Verlag Hans Huber, Bern (2002)

2. Kieras, D.E.: Model-based Evaluation. In: Jacko, J., Sears, A. (eds.) The Human-Computer Interaction Handbook, pp. 1191–1208. Erlbaum, Mahwah (2003)

3. Levin, E., Pieraccini, R., Eckert, W.: Using Markov decision process for learning dialogue strategies. In: ICASSP 1998, Seattle, pp. 201-204 (1998)

4. Pietquin. O.: A framework for unsupervised learning of dialogue strategies. PhD thesis, Faculty of Engineering, Mons (TCTS Lab), Belgium (2004)

5. Ai, H., Weng, F.: User Simulation as Testing for Spoken Dialog Systems. In: 9th SIGdial Workshop on Discourse and Dialogue, Columbus, OH (2008)

6. Rieser, V., Lemon, O.: Automatic Learning and Evaluation of User-Centered Objective Functions for Dialogue System Optimisation. In: LREC 2008, Marrakech, Morocco, pp. 2356-2361 (2008)

7. Walker, M., Litman, D., Kamm, C., Abella, A.: PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In: ACL/EACL 35th Ann. Meeting of the Assoc. for Computa-tional Linguistics, Madrid, pp. 271–280 (1997)

8. Möller, S., Engelbrecht, K.-P., Schleicher, R.: Predicting the Quality and Usability of Spo-ken Dialogue Services. Speech Communication 50, 730–744 (2008)

9. Engelbrecht, K.-P., Hartard, F., Gödde, F., Möller, S.: A Closer Look at Quality Judgments of Spoken Dialog Systems. In: Interspeech 2009, Brighton (2009)

10. Möller, S., Engelbrecht, K.-P.: Towards a Perception-based Evaluation Model for Spoken Dialogue Systems. In: 4th IEEE Tutorial and Research Workshop PIT, Kloster Irsee (2008)

11. Engelbrecht, K.-P., Quade, M., Möller, S.: Analysis of a New Simulation Approach to Dia-logue System Evaluation. Speech Communication 51, 1234–1252 (2009)

12. Dörner, D., Gerdes, J., Mayer, M., Misra, S.: A Simulation of Cognitive and Emotional Ef-fects of Overcrowding. In: ICCM 2006, pp. 92–99 (2006)

13. ITU-T Rec. P.851. Subjective Quality Evaluation of Telephone Services Based on Spoken Dialogue Systems. In: International Telecommunication Union, Geneva (2003)

Sequence-Based Pronunciation Modeling

Using a Noisy-Channel Approach

Hansjorg Hofmann1,2, Sakriani Sakti1, Ryosuke Isotani1, Hisashi Kawai1,Satoshi Nakamura1, and Wolfgang Minker2

1 National Institute of Information and Communications Technology, Japan{hansjoerg.hofmann,sakriani.sakti,ryosuke.isotani,

hisashi.kawai,satoshi.nakamura}@nict.go.jp2 University of Ulm, [email protected]

Abstract. Previous approaches to spontaneous speech recognition ad-dress the multiple pronunciation problem by modeling the alteration ofthe pronunciation on a phoneme to phoneme level. However, the pho-netic transformation effects induced by the pronunciation of the wholesentence are not considered yet. In this paper we attempt to model thesequence-based pronunciation variation using a noisy-channel approachwhere the spontaneous phoneme sequence is considered as a “noisy”string and the goal is to recover the “clean” string of the word sequence.Hereby, the whole word sequence and its effect on the alternation ofthe phonemes will be taken into consideration. Moreover, the systemnot only learns the phoneme transformation but also the mapping fromthe phoneme to the word directly. In this preliminary study, first thephonemes will be recognized with the present recognition system and af-terwards the pronunciation variation model based on the noisy-channelapproach will map from the phoneme to the word level. Our experimentsuse Switchboard as spontaneous speech corpus. The results show thatthe proposed method improves the word accuracy consistently over theconventional recognition system. The best system achieves up to 38.9%relative improvement to the baseline speech recognition.

Keywords: Spontaneous speech recognition, pronunciation variation,noisy-channel model, statistical machine translation.

1 Introduction

Nowadays pervasive computing gains more and more importance and is alreadypartly integrated in our daily lives. As speech is the most common and con-venient way to communicate among humans, spoken language dialog systems(SLDS) are highly in demand. State-of-the-art automatic speech recognitionsystems (ASR) perform satisfactory in closed environments. If a SLDS which isintegrated in everyday life shall support the user it has to handle more relaxedconstraints to allow the user to speak spontaneously. Spontaneous or conversa-tional speech differs severely from accurately read speech and so the recognition


Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach 157

rates decrease[14]. In natural conversations people pronounce differently, tendto combine or even miss words out. Discourse particles (e.g. like) or hesitationsounds (e.g. ahm) are used to structure the sentence and have no semantic mean-ing. However, Riley et al. [17] consider multiple pronunciation variants as beingthe main problem of spontaneous ASR which has to be faced.

Several approaches have been made how to resolve the multiple pronunciationproblem. One attempt is to extend the dictionary manually with further pronun-ciation variants or to improve it by applying rule-based algorithms. Neverthe-less, both approaches are very time consuming and the latter needs a significantamount of expert knowledge. Other researchers apply data driven approachesto model the alteration of the pronunciation on a phoneme-to-phoneme level.Decision-tree-based approaches have been applied by Bates et al.[2] and im-proved the ASR performance. Chen et al.[4] examine the effect of prosody onpronunciation and propose to use artificial neural networks (ANN) to modelpronunciation variation. Livescu et al.[11] propose a feature-based pronuncia-tion model based on a dynamic Bayesian Network (BN). A report by Sakti etal.[18] also uses a BN technique to model the variation of the base form andthe surface form of the phoneme. After applying the Bayesian Network a smallperformance improvement of the ASR was gained. As the realization of the cur-rent phone does not only depend on neighboring phones the observation windowshould be extended. Fosler-Lussier[6] proposes to take syllabification into con-sideration and investigate decision tree models based on syllables but the worderror rate increases slightly. This may be because the phonetic transformationeffects induced by the pronunciation of the whole sentence are not consideredyet.

In this paper we attempt to model the sequence-based pronunciation variationusing a noisy-channel approach where the spontaneous phoneme sequence isconsidered as a “noisy” string and the goal is to recover the “clean” string of theword sequence. Hereby, the whole word sequence and its effect on the alternationof the phonemes will be taken into consideration. Moreover, the system not onlylearns the phoneme transformation but also the mapping from the phoneme tothe word directly. In this preliminary study, first the phonemes will be recognizedwith the present recognition system and afterwards the proposed pronunciationmodel will map from the phoneme to the word level.

2 Noisy-Channel Approach

The noisy-channel model used to translate the recognized phonemes into thewords is adopted from statistical machine translation (SMT). Given a sequenceof units of a source language, the SMT translates the units into a specifiedtarget language. In this case the source language consists of phonemic units andthe output language consists of words. The SMT computes the most probableword sequence w given an input phoneme sequence p by solving the followingMaximum Likelihood estimation:

w = argmaxwP (p|w) ∗ P (w) (1)

158 H. Hofmann et al.

Here, P(w) represents the probability of the word w provided by the languagemodel (LM) of the target language. P(p|w) denotes the likelihood of the phonemesequence p given the word w, represents the transition from the phonemic to theword representation and is computed by the translation model[3]. The structureof the SMT system is shown in Figure 1:

Fig. 1. Basics of SMT framework

The SMT system is trained with parallel matching pairs of text data from theinput and the output language. While testing the translation system the SMTevaluates each proposed hypothesis by assigning a score according to the statis-tical model probabilities. During the translation process all possible hypothesisare considered and finally the path with the highest score is chosen as a result.

Our SMT system is based on a log-linear framework[12], uses a trigram LMand applies phrase-based SMT techniques[10]. The statistical models of oursystem were trained with special toolkits for language modeling[19] and wordalignment[13] and also a lexicalized distortion model was applied[1]. The trans-lation process is performed by a tool called CleopATRa[5] which is a multi-stackphrase-based SMT decoder.

3 Experimental Evaluation

3.1 Data Corpora

Read and spontaneous speech data corpora were used in this experiment (seeTable 1). The Wall Street Journal speech corpus (WSJ0 and WSJ1)[16] containsread speech recorded from English speakers who had to read newspaper textparagraphs. The training set consists of 60 hours of speech and the so-calledWSJ test consists of 215 utterances of a 5K-word dictionary (Hub 2)[15].

The spontaneous speech data was obtained from two subsets of the Switch-board corpus[7] which consists of spontaneous telephone conversations and hasa significant amount of pronunciation variability[18]. The first subset has beenphonetically hand-transcribed by ICSI, consists of 4 hours (5117 utterances) of


data and is used for modeling the pronunciation variation. The second one isSVitchboard 1 (The Small Vocabulary Switchboard Database)[9] which consistsof statistically selected utterances from the whole Switchboard corpus. This dataset has been segmented in small vocabulary tasks from 10 words up to 500 words.Each segmentation has been further partitioned into 5 subsets (A-E). In this pre-liminary study we use three subsets of SVitchboard1: 50, 250 and 500 words. Forsome of the words there exist more than 10 different pronunciation variants. Anexample of the pronunciation variation of the vocabulary is given below:

and: /ae eh n/, /ae eh n d/, /ae n/, /ae n d/, /ah n/, /ah n d/, etc.

In each case 200 utterances which are at least 3 words long were randomlyselected from the subset A and B and are used for evaluation.

Table 1. Speech data composition of this experiment

Speech Type Data Set Hours Words Usage

Read WSJ Training Set 60 4989 Baseline ASR TrainWSJ Test Set 0.2 4989 Baseline ASR Test

Spontaneous Hand Transcribed Switchboard 4 3843 AM Adapt, LM & SMT TrainSVitchboard1 C&D&E 4 50 AM Adapt, LM Train

SVitchboard1 A&B 0.2 50 - 500 ASR Test, SMT Test

3.2 Baseline ASR System

The triphone HMM acoustic model (AM) was trained with the WSJ corpusdescribed above. As the spontaneous speech data have a sampling rate of 8kHzthe WSJ data were down sampled from 16kHz to 8kHz. A frame length of a 20-msHamming window, a frame shift of 10ms and 25 dimensional feature parameterswere used. The 25 feature parameters consist of 12-order MFCC, delta MFCCand log power. At the beginning each phoneme consists of a 3-state HMM. Byapplying a successive state splitting (SSS) algorithm based on the minimumdescription length (MDL) the optimum state level HMnet is obtained. Furtherinformation about the MDL-SSS algorithm may be obtained from [8]. We builtfour different AMs with different Gaussian mixture numbers: 5, 10, 15 and 20mixtures. Each of the AMs has a total number of 1903 states. The result ofthe baseline tested on read speech and its degradation on spontaneous speech isshown in Table 2.

Table 2. Recognition accuracy of baseline testing

5mix 10mix 15mix 20mix

WSJ Test Set 88.7 89.1 89.2 90.8SVitchboard Test Set 50words 37.5 40.7 39.4 41.8SVitchboard Test Set 250words 29.5 34.1 30.3 33.2SVitchboard Test Set 500words 30.1 32.1 32.1 34.9


3.3 Proposed ASR System

Since the amount of spontaneous speech data in SVitchboard1 is very limited weadapt our baseline to the conversational speech data using the data described inTable 1 with the standard maximum a posteriori (MAP) adaptation.

The SMT was trained on Switchboard with the phoneme as a source and theword as the target. Here, we use dictionary based canonical phoneme sequencesand hand-labelled surface phoneme sequences which result in 10k utterances intotal. Given the correct phoneme sequence of the test list the performance of ourSMT is up to 99% accuracy. By increasing the word range from 50 to 500 wordsthe performance decreases just slightly (see Figure 2 a)). Due to the limitationof time the 50 words word range is used for the further experiments.

Next, the proposed model is applied after the ASR is conducted. The ASRoutputs the most likely word and the according phoneme sequence but for fur-ther processing only the phoneme strings are used. Given the first best pathof the ASR output (Adapt+SMT (1best)) the performance achieved a relativeimprovement of 19.5% (see Figure 2b)). However, here is just the best resultof the ASR considered and translated to the word level. By keeping the wholelattice of the speech recognition result the system can be further improved. Here,we apply the SMT on the n-best (10best, 50best) list generated from the ASRwhich results in unique word sequences. Figure 2b) shows the optimum results(Adapt+SMT(10best), Adapt+SMT(50best)) given the SMT output. Therebya relative improvement of 9.0% to the “Adapt+SMT (1best)” could be achieved.Higher orders still improve the accuracy but converge to a saturation level.

To further improve the performance we assess the reliability of the ASR outputby using the generalized utterance posterior probability (GUPP)[20] approach.We enumerate different thresholds and send only utterances with lower reliabil-ity values than the threshold to the SMT. Figure 2b) shows only the optimum

Word Ranges

50 250 50070

75

80

85

90

95

100

Wor

d Ac

cura

cy (%

)

(a)

5mix 10mix 15mix 20mix

20

25

30

35

40

45

50

55

60

Baseline (WSJ) Adapt + SMT (1best) Adapt + SMT (10best) Adapt + SMT (50best)

Adapt + SMT GUPP(50best)

Adapt + SMT Up-perBound (50best)

Wor

d Ac

cura

cy (%

)

(b)

Fig. 2. a) Performance of SMT given the correct phoneme sequence. b) Improvementof spontaneous accuracy results by applying the proposed model.


results (Adapt+SMT GUPP(50best)) and the best system achieves up to 53.6%.Additionally, in Figure 2b) the upper bound of our proposed system is shownwhen sending only those utterances to the SMT which can be improved bythe system. As can be seen(Adapt+SMT UpperBound(50best)) a relative im-provement of 20.4% to the “Adapt+SMT (1best)” and up to 58% accuracy canbe achieved. In comparison using Bayesian Network approaches Sakti et al.[18]achieve 56.3% and Livescu et al.[11] achieve 57.3% words accuracy. However,both results cannot be compared directly as the tasks differ.

4 Conclusions

We have proposed a sequence-based pronunciation model for spontaneous ASRapplying the noisy-channel approach based on SMT. First, we adapted our base-line AM with spontaneous speech data. After having recognized the phonemeswith our ASR engine we used the SMT system to map from the phoneme tothe word level. The results show that the proposed method improves the wordaccuracy consistently over the conventional recognition system. The best systemachieves up to 38.9% relative improvement to the baseline AM and 21.4% to theproposed model given the first best path of the ASR output. The results pointtowards the positive direction opening the possibility to increase the complexityof the experiment’s topology. Future work includes extensions of the word rangeand usage of a larger data set and the extension of the ngram LM of the SMT.

References

1. Al-Onaizan, Y., Papineni, K.: Distortion models for statistical machine translation.In: Proc. ACL/COLING, pp. 529–536 (2006)

2. Bates, A., Osterndorf, M., Wright, R.: Symbolic phonetic features for modeling ofpronunciation variation. Speech Communication 49, 83–97 (2007)

3. Brown, P., Pietra, S., Pietra, V.: The mathematics of statistical machine transla-tion: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)

4. Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artificialneural networks for English spontaneous speech. In: Proc. ICSLP, pp. 1461–1464(2004)

5. Finch, A., Denoual, E., Okuma, H., Paul, M., Yamamoto, H., Yasuda, K., Zhang,R., Sumita, E.: The NICT/ATR speech translation system for IWSLT 2007. In:Proc. IWSLT, pp. 103–110 (2007)

6. Fosler-Lussier, E.: Contextual word and syllable pronunciation models. In: Proc.IEEE ASRU Workshop (1999)

7. Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: Telephone speech cor-pus for research and development. In: Proc. ICSLP, pp. 24–27 (1996)

8. Jitsuhiro, T., Matsui, T., Nakamura, S.: Automatic generation of non-uniformHMM topologies based on the MDL criterion. IEICE Trans. Inf. Syst. E87-D (8)(2004)

9. King, S., Bartels, C., Bilmers, J.: Small vocabulary tasks from Switchboard 1. In:Proc. EUROSPEECH, pp. 3385–3388 (2005)


10. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proc. theHuman Language Technology Conference, pp. 127–133 (2003)

11. Livescu, K., Glass, J.: Feature-based pronunciation modeling for speech recogni-tion. In: Proc. HLT/NAACL (2004)

12. Och, F., Ney, H.: Discriminative training and maximum entropy models for statis-tical machine translation. In: Proc. ACL, pp. 295–302 (2002)

13. Och, F., Ney, H.: A systematic comparison of various statistical alignment models.Computational Linguistics 29(1), 19–51 (2003)

14. Pallet, D.: A look at NISTS’s benchmark ASR tests: Past, present, future. In: Proc.ASRU, pp. 483–488 (2003)

15. Pallett, S., Fiscus, J., Fisher, M., Garofolo, J., Lund, B., Przybocki, M.: 1993 bench-mark tests for the ARPA spoken language program. In: Proc. Spoken LanguageTechnology Workshop (1994)

16. Paul, D.B., Baker, J.: The design for the Wall Street journal-based CSR corpus.In: Proc. ICSLP (1992)

17. Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock,H., Saraclar, M., Wooters, C., Zavaliagkos, G.: Stochastic pronunciation modellingfrom handlabelled phonetic corpora. In: Proc. ETRW on Modeling PronunciationVariation for Automatic Speech Recognition, pp. 109–116 (1998)

18. Sakti, S., Markov, S., Nakamura, S.: Probabilistic pronunciation variation modelbased on Bayesian networks for conversational speech recognition. In: Second In-ternational Symposium on Universal Communication (2008)

19. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. ICSLP,pp. 901–904 (2002)

20. Lo, W.K., Soong, F.K.: Generalized posterior probability for minimum error veri-fication of recognized sentences. In: Proc. ICASSP, pp. 85–88 (2005)


Rational Communication and Affordable Natural Language Interaction for Ambient Environments

Kristiina Jokinen

Department of Speech Sciences, University of Helsinki, Finland [email protected]

Abstract. This paper discusses rational interaction as a methodology for de-signing and implementing dialogue management in ambient environments. It is assumed that natural (multimodal) language communication is the most intui-tive way of interaction, and most suitable when the interlocutors are involved in open-ended activities that concern negotiations and planning. The paper dis-cusses aspects that support this hypothesis by focussing especially on how in-terlocutors build shared context through natural language, and create social bonds through affective communication. Following the design guidelines for in-teractive artefacts, it is proposed that natural language provides human-computer systems with an interface which is affordable: it readily suggests the appropriate ways to use the interface.

1 Introduction

Human-computer interaction can be seen as a communicative situation where the par-ticipants have various kinds of intentions and expectations concerning the content and flow of the communication (see e.g. [1,3,8,13]). It is not our claim, however, that the computer is conscious about its acts or that it understands the meaning of linguistic symbols in the same way as the human users; rather, we put forward the view that the human-computer interactions are perceived as natural and intentional, if the computer agent’s operation and interaction capability are based on similar capabilities as those used in human-human interactions. Our interest lies in studying assumptions and processes that affect natural interaction and human perception of what is natural inter-action. Rationality is a crucial aspect of social agency and communication, and it is important to take it into consideration when building ambient conversational agents that emerge in our environment [16].

The tasks that the computer is expected to perform have become more complex: search, control, manipulation, and presentation of information are activities where the relation between the input and the output is not necessarily predetermined in a straightforward one-to-one input-output manner, but the system appears to reason without any apparent control by the user. Moreover, ubiquitous computing paradigm and distributed processing add to unstructured, uncontrolled nature of interaction: the user interacts with many computers instead of one, and the actual computing may be physically performed somewhere else than in the device next to us (Service-Oriented Architectures). Furthermore, new emerging tasks deal with activities which need

164 K. Jokinen

socially aware behaviour: affective computing concerns companions, assistants, and virtual friends which do not only aim at efficient task completion, but providing com-pany, entertainment, and friendship. This kind of behaviour requires sensitivity to non-verbal signals and analysis of their function in controlling and coordinating hu-man behaviour.

This paper discusses these issues in the framework of rational communication, fo-cussing especially on the interlocutors’ cooperation in constructing the shared context. We will first discuss dialogue activities that the interlocutors take part in, including the change in the role of the computer from a tool into a participant in the interaction. We then move on to the construction of the shared context by dialogue participants, and finally consider the concept of affordance in the design and development of inter-active systems. The paper is mainly theoretical and follows thoughts and ideas drawn from the wide research dealing with rational interaction as well as from the common design guidelines for natural and intuitive interfaces.

2 Dialogue Activities

The use of natural language as an interface mode often goes together with an assump-tion that the computer should also be able to chat like humans. While this may be a fair requirement concerning virtual companions, it seems an unreasonable and unde-sirable requirement for all interactive systems. Different types of communication strategies, styles of interaction, roles, and politeness codes apply to different human activities and similarly, also human-computer interactions are situated activities where different constraints operate. For instance, a robot companion or a booking sys-tem engage their users in different activities with different goals and participant roles, and the requirements for what is natural, efficient, useful, and helpful vary according to the activity types. Consequently, in the design and development of interactive sys-tems it is important to classify the different activities which involve automated inter-action and to identify the purpose of activity as well as the roles and expectations that constrain the participants’ behaviour.

Activity analysis for communication has been proposed by several authors, and the background of the ideas goes back to linguistic philosophy, psychology, and sociol-ogy, see [1] for an overview. The following four parameters can be used to character-ize a social activity among human participants [1]:

• Type: purpose, function, procedures • Roles: competence/obligations/rights • Instruments: machines/media • Other physical environment

The first parameter refers to the reason for an activity, its motivation, or rationale that helps to understand, and consequently define, means and procedures that can be used to pursue the activity. Procedures can also be used to define the activity: an idea of the purpose of an activity helps to define the specific type of activity. Each activity is also associated with standard activity roles which refer to the tasks related to the purpose of the activity. Each role is performed by one person, and analyzed into com-petence requirements, obligations, and rights. Roles give rise to expectations of the

Rational Communication and Affordable Natural Language Interaction 165

participants’ actions and style of behaviour, and of a coherent and consistent topic. The instruments of the activity refer to tools, machines and other devices that are used to pursue the activity. Usually they create their own patterns of communication. Com-puters are typically regarded as instruments, i.e. tools used to achieve the main activ-ity goal (however, their role has changed, cf. the discussion below). The parameter “other physical environment” concerns environmental conditions which affect the in-terlocutors’ behaviour and manner of interaction. These include light, background noise, and furniture, which in ambient environments are also important constraints for the system operation.

2.1 The Computer as a Tool and an Agent

As noted, the role of the computer in interactions seems to have changed from a tool to a more interactive and intelligent agent. The traditional HCI view has regarded the computer as a transparent tool: it supports human goals for clearly defined tasks, takes the user’s cognitive limitations into account, and provides the user with explicit in-formation about the task and the system's capabilities. The approach goes well with applications intended to support the users’ work and thinking rather than act as a part-ner in an interactive situation (e.g. web portals, assistive tools for computer-aided de-sign and computer supported collaborative work).

In speech applications, the interaction often follows the HMIHY algorithm [7], smoothed through rigorous statistical analysis. The users are expected to behave ac-cording to certain assumptions of how the interaction should and could proceed. However, speech interfaces seem to bring in tacit assumptions of fluent human com-munication and consequently, expectations of the system’s communicative capability become high. For instance, [9] noticed that the same system gets different evaluations depending on the users’ predisposition and prior knowledge of the system: if the users were told that the system had a graphical-tactile interface with a possibility to input spoken language commands, they evaluated the spoken language communication sig-nificantly better than those users who thought they were interacting with a speech-based system which also had a multimodal facility.

Developments in interaction technology have also led to reduction of human con-trol over the system’s behaviour. Computers are connected in networks of interacting processors, their internal structure and operations are less transparent due to distrib-uted processing, and various machine-learning techniques make the software design less deterministic and controllable. Context-aware computing investigates systems that can observe their environment and learn preferences of individual users and user groups, adapting decisions accordingly. Mobile devices offer frameworks for group interaction and social context can include virtual friends besides other humans. Future communications with the computer are thus envisaged to resemble human communi-cation: they are conducted in natural language using multimodal interaction technol-ogy, understanding the effect of social context. The view of the computer as a passive tool under human control is disappearing.

2.2 Rationality

The aim of the interaction research and technology is usually described as being related to the need to increase efficiency and naturalness of interactions between

166 K. Jokinen

humans and computer agents, i.e. improving robustness of the system behaviour. Ac-cording to the view advocated in the paper, the general principles related to activity analysis, cooperative communication, and rational behaviour also function as the ba-sic principles for successful communication in human-machine interactions. Natural communicative behaviour using language-based symbolic communication provides the standards according to which communicators behave and assess their success in interaction. The actual rationality of system actions may differ from those of the hu-man users, but it is the user‘s perception of the system’s rationality that plays the cru-cial role in the interaction. In other words, the system is not regarded as a simple reac-tive machine or a tool, but capable of acting appropriately in situations which are not directly predictable from the previous action.

In search for the definition of rational action, we look at the communication in general [1,3,5,8,13,15]. Rationality emerges in social activity as the agent's compli-ance with the requirements of communicative principles; it captures the agent's com-municative competence, and is effectively a sign of the agent's ability to plan, coordi-nate, and choose actions so that the behaviour looks competent and fulfils the goal which motivated the action. The observation that rationality is in fact perceived ra-tionality, leads us to a straightforward definition of incompetent behaviour: if the agent's interaction with the partner (or with the environment) does not converge to a conclusion, the agent's behaviour is described as incompetent, irrational, lacking the basic characteristics of a motivated and competent agent.

3 Construction of Dialogues

Several authors have described interactive situations as cooperative activity where the shared context is gradually built by the participants by exchanging information [5,12,15]. We call this kind of dialogue modelling constructive [8]. Although the in-teraction itself is complex, it is assumed that communicative actions can be formu-lated using only a few underlying principles, and recursive application of these prin-ciples allows construction of the more complex communicative situations. The constructive approach to interaction management has the following properties:

• The speakers are rational agents, engaged in cooperative activity • The speakers have partial knowledge of the situation, application, and world • The speakers are conduct a dialogue to order to achieve an underlying goal • The speakers construct shared context • The speakers exchange new information on a particular topic.

The main challenge in the constructive dialogue modelling lies in the grounding of language: updating one's knowledge and constructing shared context accordingly. The grounded elements are not automatically shared by other agents, let alone by the same agent in different situations, but the agents' repeated exposure to the same kind of com-municative situations enables learning from interaction: the shared context is the result of the agents' cooperation in similar situations, their alignment [5,12] with each other rather than something intrinsic to the agents or the environment as such. A related ques-tion is the agents' understanding of the relation between language and the physical envi-ronment, the embodiment of linguistic knowledge via action and interaction, discussed

Rational Communication and Affordable Natural Language Interaction 167

in experimental psychology and neuroscience. The relation between high-level commu-nication and the processing of sensory information is not clear, although much interest-ing research is conducted concerning how senso-motor activity constitutes linguistic representations [2, 14]. This line of research will surely affect the design and develop-ment of interactive systems concerning how natural language interaction is enabled in ambient environments and with robotic companions.

4 Affordance

One of the main questions in the design and development of more natural interactive systems is what it actually means to communicate in a natural manner. For instance, speech is not “natural” if there are privacy issues involved or if the enablements for speech communication are not fulfilled at all. Following the activity analysis (Section 2), if the participants understand the rationale behind the activity, it is easier for them also to comprehend the means and procedures that can be used to pursue the activity. Concerning interactive systems, the adjective “natural” should not only refer to the system’s ability to use natural language, but to support functionality that the user finds intuitive in that it fulfils the rationale behind the interaction. We say that the in-teractive systems should afford natural interaction.

The concept of affordance was introduced by [6] in the visual perception field to denote the properties of the world that the actors can act upon, and brought to product design and HCI by [11]. Depending on whether we talk about the uncountable num-ber of actionable relationships between the actor and the world, or about action possi-bilities perceived by the user, we can distinguish real and perceived affordances, re-spectively. In HCI, the (perceived) affordance is related to the properties of the object that suggest to the user the appropriate ways to use the artefact. The smart technology environment should thus afford natural interaction techniques, i.e. interfaces should lend themselves to a natural use without the users needing to reason how the interac-tion should take place to get the task completed.

Interactive spoken conversations include a wide communicative repertoire of non-verbal signals concerning the speaker's emotions, attitudes, and intentions. Recent re-search has focussed on collecting large corpora of natural conversations, and examin-ing how ordinary people communicate in verbal and non-verbal ways in their daily conversations. This allows building of experimental models for affordable spoken in-teractions, which can then be integrated into the design of naturally interacting appli-cations and services [e.g. 4, 8]. Such intuitive communication strategies also encour-age interdisciplinary research between human sciences and technological possibilities of how to design and construct affordable interactive systems.

5 Conclusions

This paper has discussed various issues related to rational and cooperative interaction between humans and intelligent computer agents. The discussion has been mainly theoretical, with the aim to understand the development of human-computer interac-tions and to explore the impact of the new digital technology on the society and hu-man communication in general. Understanding of the communicative activities that

168 K. Jokinen

humans get involved in may help to understand the activities where the partner is an intelligent automatic agent. We tend to think of inanimate objects as tools that can be used to perform particular tasks. Computers, however, are not only tools but complex systems whose manipulation requires special skills, and interaction with them is not necessarily a step-wise procedure of commands, but resembles natural language communication. Thus it seems reasonable that the design and development of interac-tive systems takes into account communicative principles that concern rational activ-ity and cooperation in human interaction.

References

1. Allwood, J.: An Activity Based Approach to Pragmatics. Gothenburg Paper. In: Theoreti-cal Linguistics 76, Dept. of Linguistic, Göteborg University (2000)

2. Arbib, M.: The evolving mirror system: A neural basis for language readiness. In: Christiansen, M., Kirby, S. (eds.) Language evolution, pp. 182–200. Oxford UP, Oxford (2003)

3. Bunt, H.C.: A framework for dialogue act specification. In: Fourth Workshop on Multi-modal Semantic Representation (ACL-SIGSEM and ISO TC37/SC4), Tilburg (2005)

4. Cassell, J., Sullivan, J., Prevost, S., Churchill, E. (eds.): Embodied Conversational Agents. MIT Press, Cambridge (2003)

5. Clark, H.H., Schaefer, E.F.: Contributing to Discourse. Cog. Sci. 13, 259–294 (1989) 6. Gibson, J.: The ecological approach to visual perception. Houghton Mifflin, Boston (1979) 7. Gorin, A.L., Riccardi, G., Wright, J.H.: How I Help You? Speech Communication 23(1-2),

113–127 (1997) 8. Jokinen, K.: Constructive Dialogue Modelling – Speech Interaction and Rational Agents.

John Wiley & Sons, Chichester (2009a) 9. Jokinen, K.: Gesturing in Alignment and Conversational Activity. In: Proceedings of the

Pacific Linguistic Conference, Sapporo, Japan (2009b) 10. Jokinen, K., Hurtig, T.: User Expectations and Real Experience on a Multimodal Interac-

tive System. In: Proceedings of Interspeech-2006, Pittsburgh, US (2006) 11. Norman, D.A.: The psychology of everyday things. Basic Books, New York (1988) 12. Pickering, M., Garrod, S.: Towards a mechanistic psychology of dialogue. Behavioral and

Brain Sciences 27, 169–226 (2004) 13. Sadek, D., Bretier, P., Panaget, F.: ARTIMIS: Natural dialogue meets rational agency. In:

Proceedings of IJCAI 1997, pp. 1030–1035 (1997) 14. Tomasello, M.: First verbs: A case study of early grammatical development. Cambridge

University Press, Cambridge (1992) 15. Traum, D.: Computational models of grounding in collaborative systems. In: Working Pa-

pers of the AAAI Fall Symposium on Psychological Models of Communication in Col-laborative Systems, pp. 124–131. AAAI, Menlo Park (1999)

16. Weiser, M.: The Computer for the Twenty-First Century. Sci. Am., 94–104 (1991)

Construction and Experiment of

a Spoken Consulting Dialogue System

Teruhisa Misu, Chiori Hori, Kiyonori Ohtake,Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura

MASTAR Project, NICT, Kyoto, Japanhttp://mastar.jp/index-e.html

Abstract. This paper addresses a spoken dialogue framework that helpsusers make decisions. Various decision criteria are involved when we se-lect an alternative from a given set of alternatives. When adopting aspoken dialogue interface, users have little idea of the kinds of criteriathat the system can handle. We thus consider a recommendation functionthat proactively presents information that the user would be interestedin. We implemented a sightseeing guidance system with a recommenda-tion function and conducted a user experiment. We provided an initialanalysis of the framework in terms of the system prompt and users’ be-havior, as well as in terms of user’s behavior and his/her knowledge.

1 Introduction

Over the years, a great number of spoken dialogue systems have been devel-oped. Their typical task domains include airline information (ATIS & DARPACommunicator) [1] and railway information (MASK) [2]. Dialogue systems, inmost cases, are used in the fields of database (DB) retrieval and transactionprocessing, and dialogue strategies are optimized so as to minimize the cost ofinformation access. Meanwhile, in many situations where spoken dialogue inter-faces are installed, information access by the user is not a goal in itself, but ameans for a decision making [3]. For example, in using a restaurant retrieval sys-tem, the user’s goal may not be the extraction of price information but to makea decision based on the retrieved information on candidate restaurants. Therehave only been a few studies that have addressed spoken dialogue systems thathelp users make decisions. In this paper, we provide our model of consultingdialogue systems with speech interfaces. In this study, our model is concernedwith the implementation of a sightseeing guidance system for Kyoto city. Weaddress in our preliminary analysis the user’s experience while engaging withthis system.

2 Dialog Model for Consulting

A sightseeing guidance system of the type that we are constructing is regardedas a kind of decision support system. That is, the user selects an alternative froma given set of alternatives based on some criteria.


170 T. Misu et al.

Choose the best spot

Cherry

blossoms

Japanese

garden

Easy

access

Kinkakuji-

temple

Ryoanji-

temple

Nanzenji-

temple

….

・・・・・

Goal

Criteria

Alternatives・・・・・

Fig. 1. Hierarchy structure for sightseeing guidance dialogue

There have been many previous studies of decision support systems in theoperations research field, and the typical method that has been employed is theAnalytic Hierarchy Process [4] (AHP). In the AHP, the problem is modeled asa hierarchy that consists of the decision goal, the alternatives for reaching it,and the criteria for evaluating these alternatives. In the case of our sightseeingguidance system, the goal is to decide on an optimal spot that is in agreementwith the user’s preference. The alternatives are all sightseeing spots that can beproposed and explained by the system. As criteria, we adopt the determinantsthat we have defined in our tagging scheme of the Kyoto sightseeing guidancedialogue corpus [5]. The determinants include various factors that are used toplan sightseeing activities, such as “cherry blossoms”, “Japanese garden”, etc.An example hierarchy using these criteria is shown in Fig. 1.

When adopting such a hierarchy structure, the problem of deciding on theoptimal alternative can be solved by estimating weights for criteria. They areoften optimized through pairwise comparisons, followed by weight tuning basedon the results of such comparison [4]. However, the methodology cannot directlybe applied to spoken dialogue systems. Generally, the system knowledge is usu-ally not fully observable to users at the beginning of a dialogue, and is observedonly through interaction with the system. In addition, spoken dialogue systemsusually handle quite a few candidates and criteria, which makes pairwise com-parison a costly affair. Although there have been several studies that are dealingwith decision making with spoken dialogue interface [3], these works assume thatthe users know all the criteria that the users/system can use for making decision.

In this work, we assume a situation where users are unaware of not only whatkind of information the system can provide but their own preference or factorsthat they should emphasize. We thus consider a spoken dialogue system thatprovides users with information via system-initiative recommendations. We as-sume that the number of alternatives is relatively small, and that all alternativesare known to the users. This is highly likely in real world situations; for example,the situation wherein a user selects one restaurant from a list of the candidatespresented by a car navigation system.

Construction and Experiment of a Spoken Consulting Dialogue System 171

3 Decision Support System with Spoken DialogueInterface

3.1 System Overview

The dialogue system we constructed is capable of two functions: answering users’requests and recommending them information. In an answering function, thesystem can explain the sightseeing spots in terms of every determinant, unlikeconventional systems that are only capable of explaining the pre-set abstract ofa given spot, In the recommendation function, the system provides informationabout what the system can explain, since novice users are unlikely to knowthe capabilities of the system (e.g., the system, as part of its recommendationfunction, provides determinants that the user might be interested in). The systemflow based on these strategies is summarized below. The system:

1. Recognize the user’s utterance,2. Detect the spot and determinant in the user’s utterance,3. Present information based on this understanding,4. Recommend information related to the current topic.

3.2 Knowledge Base

Our back-end DB consists of 15 sightseeing spots as alternatives, and 10 deter-minants described for each spot. The number of alternatives is small comparedto systems dealing with information retrieval. This work focus on the processof comparing and evaluating candidates that meet “essential condition” suchas “Famous temple around Kyoto station”. We selected determinants that fre-quently appear in our dialogue corpus [5]. The determinants are listed in Table 4.Normally, these determinants are related and are dependent on one another, butin practice, the determinants are assumed to be independent and have a parallelstructure.

The spots are annotated in terms of these determinants if they apply to them.The value of the evaluation is “1” when the spot applies to the determinant and“0” when it does not. The text is generated by retrieving appropriate reasonsfrom the Web. An example of the DB is shown in Table 1.

Table 1. Example of the database (translation of Japanese)

Spot name Determinant Eval. Text

Kiyomizu temple

Cherry blossoms 1 There are about 1,000 cherry trees in the templeground. Best of all, the vistas from the maintemple are amazing.

Vista 1 The temple stage is built on the slope, and theviews of the town from here are breathtaking.

Not Crowded 0 This temple is very famous and popular, and isthus constantly crowded.

. . . . . . . . .. . .

172 T. Misu et al.

3.3 Speech Understanding and Response Generation

Our speech understanding process tries to detect sightseeing spots and determi-nant information in the automatic speech recognition (ASR) results. We thusprepared two modules for the spots and the determinants, respectively. In orderto facilitate flexible understanding, we adopted an example-based understandingmethod based on vector space models. That is, the ASR results were matchedagainst a set of documents written about the target spots1, and the spots withthe highest matching scores were used as understanding results. The ASR resultswere also matched against a set of sample query sentences and determinants aredetected. In addition, we also concatenated contextual information on a spot oron a determinant under current focus if the ASR results included either a spotor a determinant. The system would then generate a response by selecting fromone of the appropriate responses in the DB, and present it through synthesizedspeech.

3.4 System Initiative Recommendation

In the information retrieval systems, users often have difficulty making queries.This is particularly the case when they are unsure of what information thesystem possesses. In addition, it is important to raise awareness of the users’potential preferences through a dialogue. We thus design a system initiativerecommendation, which is followed by the system response. The content of therecommendation is determined based on one of the following three methods.

1. Recommendation based on the current focused spotThis method is structured on the basis of the users’ current focus on aparticular spot. Specifically, the system selects three determinants related tothe current spot whose evaluation is “1” and presents them to the user.

2. Recommendation based on the current focused determinantThis method functions on the basis of the focus on a certain specific deter-minant. The system selects three spots related to the current determinantwhose evaluation of the determinant is “1”.

3. Open promptThe system does not make a recommendation, and presents an open prompt.After users acknowledge the domain and the system knowledge this systemmay be considered preferable, since users may become irritated with repeti-tive recommendations.

Although we can use various recommendation methods such as the collaborativefiltering, which is often used for recommender systems [6], in this work we usethese simple methods, because we do not estimate user preference and thusthe system may presents irrelevant information. An example dialogue with thesystem is shown in Table 2. Note that, the system was set to make possiblerecommendations randomly in the following experiment.

1 The documents we referred to were all sourced from Wikipedia.


Table 2. Example dialogue with the system (translation of Japanese)

S1: Hello, this is the Kyoto tourist information system. How may I help you?U1: Please show me the cherry blossom of Ninnaji.S2-Ans.: The cherry blossoms of Ninnaji are specially refered to as “omuro no sakura”.

These cherry trees are quite unique because they are lower than the others.S2-Rec.: Japanese garden, events and fall foliage are some of the areas in which infor-

mation is available on Ninnaji; do you want me to explain? (Method 1). . .U11: Does it have a lovely landscape?S12-Ans.: From the top of the famous gate, you can see Kyoto city spread out below.S12-Rec.: I can name Kiyomizu-temple, Fushimiinari and Kurama-temple as famous

spots that grant great vistas. (Method 2)U12: Kiyomizu temple.S13-Ans.: Kiyomizu temple is built on the side of the hill, and you are treated to a view

of Kyoto city center.S13-Rec.: Do you have any other thing you want to know? (Method 3)U13: I decided to go to Nanazenji templeS14: The item, Nanzenji temple, has been chosen. Are there any questions?. . .

Table 3. Percentages of user utterance types for system prompts

Opening (%) Method 1 (%) Method 2 (%) Method 3 (%)

Acceptance - 71.5 30.0 -Determinant 15.8 7.0 25.0 38.6Spot name 50.0 16.8 16.7 24.8Determinant and spot name 0 1.4 6.7 2.0Others (Commitment, OOS, etc) 34.2 3.3 21.6 34.6

4 System Experiment and Analysis

4.1 User Experiment

We collected test data from 72 subjects who had not used our system before.Subjects were requested to use the system to determine one sightseeing spotout of 15 alternatives, based on the information obtained from the system. Noinstructions on the system knowledge were given except the following three ex-ample utterances of “Could you please tell me the spots famous for XXX?”and “Tell me about XXX temple”. We asked the subjects not to use their ownknowledge and experiences while arriving at a decision. Further, they were re-quested to utter the phrase “I’ll go to XXX,” signifying commitment once theyhad reached a decision.

Only one set of dialogues was collected per subject, since the first such dialoguesession would very likely have altered the level of user knowledge. The averagelength of dialogue before a user communicated his/her commitment was 16.3turns with 7.0 turns being the standard deviation.

4.2 Analysis of Collected Dialogue Sessions

We transcribed a total of 1,752 utterances and labeled their correct dialogue acts(spots and determinants) by hand. The percentage of user utterances that the

174 T. Misu et al.

Table 4. Analysis of user preference and knowledge

DeterminantPercentage of users Percentage of users Percentage of users uttered

who value it (%) who uttered it (%) before system recom. (%)

Japanese garden 34.7 47.2 22.2Not crowded 19.4 41.7 1.4World heritage 48.6 50.0 2.7Vista 48.6 22.2 1.4Easy access 16.7 19.4 19.4Fall foliage 37.5 47.2 18.1Cherry flower 33.3 51.4 13.9History 43.1 31.9 12.5Stroll 45.8 38.9 1.4Event 29.2 36.1 8.3

system could handle was 89.0%, out of which the system could correctly respondto 72.4%.

Analysis of user utterances. First, we analyzed the relationship betweensystem prompts and user utterances. The percentages of user utterances forsystem prompts are shown in Table 3.

“Acceptance of recommendation” refers to the cases where the users acceptthe recommendation. That is, Method 1 is regarded as accepted when the userrequests either of the recommended determinants. Method 2 is regarded as ac-cepted when the user requests either of the recommended spots.

The tendency of user utterances varies according to the recommendationtypes. Many users make queries the systems cannot handle (out of system;OOS) in the opening and in the open prompt (Method 3). Meanwhile, manyusers can make in-domain queries by presenting system knowledge throughrecommendations.

Analysis of user preference and domain knowledge. We analyzed the ses-sions in terms of the preference and domain knowledge of the subjects. Table 4lists the preferences by percentage that subjects emphasize when selecting sight-seeing spots. These are based on questionnaires conducted after the dialoguesession. (We allowed multiple selection.) Since subjects were asked to select de-terminants from the list of all determinants, their selections are considered to betheir preferences under the all system knowledge. However, when the subjectsstart with the dialogue sessions, some of the above preferences turn out to beonly potential preferences, owing to limited nature of the users’ knowledge aboutthe system.

In order to analyze the knowledge of user utterances, we analyzed the per-centage of the utterances that included the determinants before the system rec-ommendation. The result is shown in Table 4.

Several determinants were seldom uttered before the system made its recom-mendations, even if they were important for many users. For example, “Worldheritage site information” and “Stroll information” were seldom uttered beforethe system’s recommendation, despite the fact that around half of the users hademphasized on them. These results show that some of users’ actual preferences


remained as potential preferences before the system made its recommendation orat the very least, the users were not aware that the system were able to explainthose determinants; thus, it is important to have users notice their potentialpreferences through system-initiative recommendations.

4.3 Analysis of Users’ Decisions

Finally, we analyzed the relationship between user preference and the decidedspot. We evaluated the number of agreed attributes between the user preferencesand the decided spots. The average number of agreements was 2.20, which washigher than the expectation by random selection (1.96). However, if the usershad known about their potential preferences and about the system knowledge,and then selected an optimal spot according to their preferences, the averagenumber of agreements, then, would have been 3.34. This result indicates that animproved recommendation strategy can help users make a better choice.

5 Conclusion

In this paper, we addressed a spoken dialogue framework that helps users selectan alternative from a list of alternatives. Through an experimental evaluation,we confirmed that user utterances are largely affected by system recommenda-tions; moreover, we learnt that users can be helped to make better decisions byimproving the dialogue strategies. Therefore, we will extend the framework ofthis research to estimate users’ preferences from their utterances. In addition,the system is expected to handle a more complex planning of natural languagegeneration in recommendations, such as those discussed in [7]. We also planto optimize the selection of responses and recommendations, based on users’preferences and state of knowledge.

References

1. Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-machine Interac-tion for Learning Dialog Strategies. IEEE Trans. on Speech and Audio Processing 8,11–23 (2000)

2. Lamel, L., Bennacef, S., Gauvain, J.L., Dartigues, H., Temem, J.N.: User Evaluationof the MASK Kiosk. Speech Communication 38(1) (2002)

3. Polifroni, J., Walker, M.: Intensional Summaries as Cooperative Responses in Dia-logue: Automation and Evaluation. In: Proc. ACL/HLT, pp. 479–487 (2008)

4. Saaty, T.: The Analytic Hierarchy Process: Planning, Priority Setting, ResourceAllocation. McGraw-Hill, New York (1980)

5. Ohtake, K., Misu, T., Hori, C., Kashioka, H., Nakamura, S.: Annotating DialogueActs to Construct Dialogue Systems for Consulting. In: Proc. The 7th Workshopon Asian Language Resources, pp. 32–39 (2009)

6. Breese, J., Heckerman, D., Kadie, C.: empirical analysis of predictive algorithmsfor collaborative filtering. In: Proc. the 14th Annual Conference on Uncertainty inArtificial Intelligence, pp. 43–52 (1998)

7. Rieser, V., Lemon, O.: Natural Language Generation as Planning Under Uncertaintyfor Spoken Dialogue Systems. In: Proc. 12th Conference of the European Chapterof the Association for Computational Linguistics, EACL (2009)


A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering User Criteria

Etsuo Mizukami, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura

National Institute of Information and Communications Technology 3-5, Hikaridai, Seikacho, Sorakugun, 619-0289, Kyoto, Japan

[email protected]

Abstract. In the development cycle of a spoken dialogue system (SDS), it is important to know how users actually behave and talk and what they expect of the SDS. We are developing SDSs which realize natural communication be-tween users and systems. To collect users’ real data, a wide-scale experiment was carried out with a smart-phone prototype SDS. In this brief paper, we re-port on the experiment’s results and make a tentative analysis of cases in which there were gaps between system performance and user judgment. This requires both an adequate experimental design and an evaluation methodology that con-siders users’ judgement criteria.

1 Introduction

We are developing spoken dialogue systems (SDSs) which realize natural communica-tion between users and systems. For this, we adopt a mobile-type SDS as a prototype system. A mobile-type system could seemingly enable users to access information easily at any place and any time. Though the performance of each system module has progressively improved, if users actually talk to the systems as they do to humans and without proper usage methods, the systems cannot ideally behave as a human in the current moment. We must develop systems with consideration of how the users behave with them and what they expect of them. In this brief report, we introduce a smart-phone prototype system, report on some of the results of the wide-scale experiment with human monitors and results of analysis of interaction between users and the system, and discuss a problem regarding the relation of users’ expectation and the system’s performance.

2 Methods

We constructed systems with data and dialogue scenarios for sightseeing guidance in Kyoto. The use case of the mobile-type system is a situation in which users want to decide on a place to visit and obtain necessary information about it. In this chapter we describe the outline of the mobile system and the procedures of the experiment.

A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering 177

2.1 Outline of System

The mobile-type, server-client system is mounted on a smart-phone (iPhone ©Apple Inc.). The user can input a request in natural spoken language by touching the micro-phone icon on the display panel, and complete the request by touching the end icon. During input, the speech wave data is sequentially divided into a certain size and sent to the speech recognition server wirelessly (3G or WiFi), and this starts the recogni-tion process at the end of the arrival. The result of the speech recognition is sent to the dialogue management server (WFSTDM – Weighted Finite State Transducer Dialog Manager (Hori et al, 2009)) and a corresponding action is called by the language understanding and the scenario WFSTDM. A relevant backend module, map retrieval, bus timetable information, or Kyoto tour guide database is successfully called, and the extracted information is sent to the client with the wave data of system speech gener-ated by the speech synthesis server to be played on the speaker. The user can scroll and resize the display via the iPhone’s touch interface.

2.2 Experiment Procedures

The experiment was carried out for 10 days in a Kyoto office room in Dec. 2009. Since 10 participants per day participated in the experiment, the total number of par-ticipants was 80 including 10 who did it three times, for a total of 100 sessions. The participants consisted of 46 females and 34 males aged 20s to 70s. The participants had no previous experience using the spoken dialogue system. First, participants an-swered a questionnaire about their background (modified and translated ITU-T Rec. P.851 questionnaire (URL:http://www.itu.int/itu-t/recommendations/)), and received an explanation of the experiment and how to use the system. If the participant had no questions about the procedures and usage of the system, the experimental tasks were conducted as follows.

1) Practical scenario task In order to acclimate participants with the system, all were instructed to input a typi-cal scenario composed of seven user turns as follows.

User: Hello. System: Welcome to Kyoto. This is the Kyoto tour guide system.

Do you have any particular place you’d like to go? User: I’d like to go somewhere with a beautiful garden.

System: Fifty-nine sites were found as recommended sites for gardens; such as Tenryuji, Ginkakuji, and Kyoto Imperial Palace. Which are you interested in?

User: Do you have a recommendation? System: How about Kyoto Imperial Palace?

Kyoto Imperial Palace is very much recommended for the beauty of its gardens. It has a lovely and famous Japanese-style garden called Oikeniwa, which was built in the early Edo period. Is there any other information that you’d like to know?

User: How do I get there? System: Access to Kyoto Imperial Palace is shown here.

Is there any other information that you are interested in regarding Kyoto Imperial Palace?

178 E. Mizukami et al.

User: Show me a map of the area. System: Here is a map of the area around Kyoto Imperial Palace.

Is there any other information that you are interested in regarding Kyoto Imperial Palace?

User: No, I’ve decided to go there. System: Are you sure that Kyoto Imperial Palace is your goal?

User: Yes, thanks. System: Kyoto Imperial Palace is decided as the goal.

Thank you very much for using the system. 2) Free scenario task One of the aims of the experiment was to collect actual data on users’ natural utter-ances and expressions. This task was designed with that aim. Participants were in-structed to use the system freely without concern about expression of utterances as in the practical scenario task, but with the usage aim of deciding the destination and repeating the task until the time limit. The time limit was set for 15 min. including the practical scenario task. Half of the one-time participants used the system by referring to acceptable phrases to the system (Presented), while the others did not refer to them (Non-presented).

After both tasks the participants judged the system use by means of a subjective evaluation using a modified and translated ITU-T Rec. P.851 questionnaire. They responded to 20 statement in terms of “Disagree,” “Somewhat disagree,” “Somewhat agree” and “Agree.” Table 1 shows the questionnaire terms. They were also asked to share any comments or needs. During the experiments, the participants’ usage of the system was recorded on video (SONY DSR45) and audio (MOTU828 and Digital Performer).

Table 1. Questionnaire statements for subjective evaluation of the system

Q-No. Statement Q1: My overall impression of the system was good. Q2: The system provided the desired information. Q3: The provided information was complete. Q4: The information was clear. Q5: I expected more help from the system. Q6: I felt well-understood by the system. Q7: The system’s voice was clear. Q8: I understood what the system expected from me. Q9: The system’s behaviour was always as expected.

Q10. The system reacted naturally. Q11. I was able to control the dialogue in the way I wanted. Q12: The system reacted fast enough. Q13: The system’s voice was natural. Q14: Overall, I was satisfied with the dialogue. Q15: I enjoyed the dialogue. Q16: During the dialogue, I felt relaxed. Q17: I prefer a human guide. Q18: I can see this possibility for obtaining information as helpful. Q19: The handling of the system was easy. Q20: In the future, I would use the system again.


3 Results

The obtained data were system logs, wave data saved on the speech recognition server and audio recorder, video data, and subjective judgments and users’ requirements from the questionnaires. The data of the two sessions were omitted from the analysis since those two sessions were carried out by using an alternative system due to system trouble. The other wave data the participants input were transcribed as text data for the evaluation (N = 98 sessions).

3.1 Interaction Parameters

Table 2 shows results from the experiment as interaction parameters (Möller, 2007) with respect to the system performance. Since our system speaks in reply to the user’s request, the number of turns should be same as the user turns. The difference between the numbers of the user and system came mainly from the detection of noise instead of participants’ no-input despite having touched the input icon. WER is the word error rate for user utterance, which is defined as S+D+I/N, such that the number of words in user sentences is N, correct words in the recognition result is C, error words in substi-tution is S, error words in insertion is I, and error words in defection is D. S.ERR is the sentence error rate, which is the rate of sentences not matched to user sentences. CR is the correct response of the system including incomplete information like “Res-taurant information is not available now.” in response to user requests of, “Is there any place to have lunch around Kinkakuji?” Some values indexing system perform-ance tended to be much worse for older participants; S.ERR and the rate of CR of those in their 50s were respectively 51.8% and 46.8%, but 36.1% and 60.9% for those in their 20s. There were no significant differences among them regarding WER.

Table 2. Statistical data and interaction parameters obtained from the recorded data. See the text for more details.

Interaction parameter amount Total user turn 4409Total system turn 4441WER (%) 19.9S.ERR (%) 38.6Rate of CR of task 1 (%) 88.3Rate of CR of task 2 (%) 63.3Av. system-user Delay (s) 4.90Av. user-system Delay (s) 5.51Av. user-utt-duration (s) 1.64Av. system-utt-duration (s) 7.02Total # of tasks 315Av. of tasks 3.21Total # of task successes 228Rate of task successes (%) 72.4Max user turn of tasks 42Av. user turn per task 11.5

180 E. Mizukami et al.

Task success was defined as only being when the participant decided the destina-tion to close the dialogue. Task success of a free scenario task was difficult to define because participants could request or query information about one spot after another without deciding on any destination. Therefore it is difficult to calculate the Kappa efficiency adopted in PARADISE (Walker et al, 1997) because we cannot define the necessary “keys (slots).” Cases that participants cancelled by inputting “Restart,” or system crashing due to some form of trouble were counted as a fault. The rate of task success and the other interaction parameters are also shown in Table 2.

3.2 Subjective Evaluation

The values of the subjective evaluations obtained were converted to -1, -0.33, +0.33, and +1. From the factor analysis for these data (maximum likelihood solution, promax rotated; Q5 and Q17 were omitted because of their low communality), four factors were extracted and named “Acceptability” as constructed by Q1, Q2, Q9, Q10, 11, and Q14, “System Potential” by Q3, Q4, Q18, and Q20, “System Transparency” by Q7, Q8, Q12, and Q13, and “User Comfort” by Q15, Q16, and Q19. The average values of subjective evaluations of composing each factor as high factor loading (> 0.35) were Acceptability = -.22, Potential = +.03, Transparency = +.17, and Comfort = -.08. The correlation between the factor score of the Acceptability and the sentence error rate was significantly negative (Kendall’s tau = -.145, p < .05). There was also a significantly positive correlation (Kendall’s tau = .240, p < .01) between the score of Acceptability and the rate of the correct response. Therefore, the participants who were recognized and responded to correctly by the system tended to judge the Ac-ceptability, such as in Q1 and Q2, as comparably high. On the other hand, not all the participants who were correctly responded to by the system in over 70% of requests (av. COR > 89.4%) judged Acceptability; e.g., the average of the judgment on Q14 “Overall, I was satisfied with the dialogue.” resulted in 0.0 (N=19), while the average of the CR of the participants who judged Q14 as “Agree (1)” or “Somewhat agree (0.33)” were 60.0%.

There were no significant differences between two experimental groups (Presented group A or Not-presented group B) in both interaction parameters and subjective evaluations.

4 Discussions and Conclusion

Why did some users rate the system poorly despite its high performance? Some fac-tors that could influence the evaluation were extracted from the analysis of the inter-action between the users and the systems and users’ comments and needs given in the questionnaires. 1) User’s stance on the system: Strict users did not repeat the same request once the system failed to respond, and tended to perform the free scenario task like routine work; that is, they repeated typical interactions in a similar way to the practical scenario task. For such users, the most important thing may be the speech-recognition ability; i.e., credibility of the SDS as an interface. 2) Generation or habituation to the machine: Older participants tended to be comparatively misrecog-nized by the system. There seemed to be problems in the usage method, such as the timing of touching the icon to input the requests, the way of making a substandard


voice, e.g., speaking too loud or broken, or too quiet. For such users, the most impor-tant thing may be the understandability of the system use before its other abilities. These results indicate the importance of experimental design for adequate evaluation of SDSs. This includes clarification of target users, task design, and its instruction taking a priming effect into account. However, it is ideal that there is an evaluation method that considers users’ backgrounds and communication styles (Mizukami et al, 2009). The above problems may stem from the fact that all users have their own crite-ria for judging the systems. The difference of users’ criteria or expectations can con-strict a stable evaluation. A method like PARADISE (Walker, 1997) can solve this problem by adopting values which have high correlation with user satisfaction as efficient values, but it is presumed that the judgments deviated from the correlation as the above cases are ignored. One approach to such a problem is to normalize the value of subjective evaluation by the rate of correct response. For instance, a normalized value of subjective evaluation NSE may be defined as NSEi(Q#) = SEi(Q#) * cri, such that Q# is the term number which has a significantly positive correlation to the rate of correct response, i is the user ID, and cr is the rate of the correct response in the case of SE > 0 and is the rate of failed response in the case of SE < 0. After this applica-tion, each value of factors is adjusted to Acceptability = -.11, Potential = +.05, Trans-parency = +.17, and Comfort = -.08.

We are now improving the systems based on the experiment results, along with the methodology of the experiment and evaluation framework.

References

1. Hori, C., Othake, K., Misu, T., Kashioka, H., Nakamura, S.: Statistical dialog management applied to WFST-based dialog systems. In: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4793–4796 (2009)

2. Mizukami, E., Kashioka, H., Kawai, H., Nakamura, S.: An Exploratory Analysis on Users’ Communication Styles affecting Subjective Evaluation of Spoken Dialogue Systems. In: Proceedings of 1st IWSDS (2009)

3. Möller, S.: Evaluating Interactions with Spoken Dialogue Telephone Services: Recent Trends in Discourse and Dialogue, pp. 69–100. Springer, Heidelberg (2007)

4. Walker, M.a., Litman, D.J., Kamm, C.A., Abella, A.: PARADISE: A Framework for Evalu-ating Spoken Dialogue Agents. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 271–280 (1997)

A Classifier-Based Approach to Supporting the

Augmentation of the Question-Answer Databasefor Spoken Dialogue Systems

Hiromi Narimatsu1, Mikio Nakano2, and Kotaro Funakoshi2

1 The University of Electro-Communications, Japan2 Honda Research Institute Japan Co., Ltd., Japan

Abstract. Dealing with a variety of user questions in question-answerspoken dialogue systems requires preparing as many question-answerpatterns as possible. This paper proposes a method for supporting theaugmentation of the question-answer database. It uses user questionscollected with an initial question-answer system, and detects questionsthat need to be added to the database. It uses two language models; oneis built from the database and the other is a large-vocabulary domain-independent model. Experimental results suggest the proposed methodis effective in reducing the amount of effort for augmenting the databasewhen compared to a baseline method that used only the initial database.

1 Introduction

When humans engage in a dialogue, they use knowledge on the dialogue topic.Without such knowledge, they cannot understand what other humans say andthey cannot talk on the topic. Spoken dialogue systems also need to have knowl-edge on the topic of dialogue. Technically speaking, such knowledge is a knowl-edge base consisting of speech understanding models, dialogue management mod-els, and dialogue content. Constructing a knowledge base is, however, a time-consuming and expertise-demanding task. It is therefore crucial to find a way tofacilitate constructing the knowledge base.

This paper concerns a kind of spoken dialogue system that answers user ques-tions by retrieving answers from a database consisting of a set of question-answer pairs. We call such systems Question-Answering Spoken Dialogue Sys-tems (QASDS) and the database Question-Answer Databases (QADB). In eachexample question, keyphrases are indicated by braces. Those keyphrases areused for matching a speech recognition result for a user utterance with examplequestions as is done in [4]. If the speech recognition result contains the sameset of keyphrases as one of the example questions in a QA pair, the answerin the pair is selected. Fig. 1 illustrates this. A statistical language model forspeech recognition is trained on the set of example questions in the database.Although QASDSs are simpler than other kinds of systems such as ones that per-form frame-based dialogue management, they have an advantage in that they


A Classifier-Based Approach to Supporting the Augmentation of the QADBs 183

QA Pair

Example questions:

tell me about {parks} in ｛Sengawa｝

are there any {squares} in {Sengawa}

Answer:

There is Saneatsu park in Sengawa ...

QADB

QA Pair

Example questions:



Answer:


QA Pair

Example questions:



Answer:


QA Pair

Example questions:



Answer:


QA Pair

Example questions:



Answer:


QA Pair

Example questions:



Answer:


Input utterance

Speech recognition

keyphrase-based matching

answer

are there any parks

in Sengawa


parks and

Sengawa match

Fig. 1. Answer Selection based on Question-Answer Database

are easy to design for people without expertise in spoken language process-ing because the system behaviors are more predictable than more complicatedsystems.

Much work on QASDS has been done (e.g., [3], [9], and [7]), but they assumethat a lot of real user utterances are available as training data. Unlike their work,we are concerned with how to bootstrap a new system with a small amount oftraining data. This is because obtaining a lot of data requires a considerableamount of time and effort, resulting in making system development difficult.

One of the most crucial problems with QASDS is that it cannot handle out-of-database (OODB) questions. Since there is no appropriate answers to OODBquestions, the system cannot answer them. In addition, since the language modelis built from the example question in a database, OODB questions tend to bemisrecognized, resulting in selecting an answer that is not desired by the user.Since it is not possible to list all possible questions before system deployment,database augmentation is required based on the user questions obtained by de-ploying the system. However, augmenting the database requires a lot of effortsince it requires listening to all user questions to find OODB questions.

This paper proposes a classifier-based approach to support the QADB aug-mentation. It tries to find questions that are highly likely to be OODB questions,and asks the developer to determine if those questions are really OODB or not.This enables the developer to efficiently augment QADB than randomly listeningto user questions. From the system’s point of view, it automatically selects ques-tions whose transcription is more effective in augmenting the system’s database.This can be regraded as a kind of active learning [6,2]. To better estimate thescores, the classifier uses various features obtained from the results of speechrecognition using not only the language model built from the initial QADB butalso a large-vocabulary domain-independent language model.

2 Proposed Method

Our method uses a classifier that classifies user questions into OODB questionsand in-database (IDB) to estimate a score that indicates how likely the questionis OODB. The classifier uses various features obtained from the results of speech

184 H. Narimatsu, M. Nakano, and K. Funakoshi

recognition using both the language model built from the initial database and alarge-vocabulary domain-independent language model. Features concerning theconfidence for the recognized keyphrases would be effective to indicate how likelythe question matches the example question having the same keyphrases.

The results of speech recognition with the large-vocabulary language modelare used for estimating the correctness of the results of speech recognition withthe database-derived language model. This is similar to utterance verificationtechniques [5]. They can also be used for investigating whether the questionincludes noun phrases other than keyphrases. The existence of such noun phrasesindicates the question might be OODB.

3 Experiments

3.1 Data

We used the data collected using a QASDS that provides town information. Theinitial QADB contains 232 question-answer pairs, and 890 example questions intotal. The vocabulary size of the language model built from the database was460 words. When the system answers user questions, corresponding slides areshown at the same time. 25 people (12 males and 13 females) engaged two timesin dialogues with the system for about 14 minutes each. In total, we collected4076 questions in the experiment. Among them, 594 are non-question utterances,such as questions consisting of just fillers, and fragments as results of end-pointdetection errors. These were excluded from the experimental data as we plan toincorporate a separate method for detecting those questions.

3.2 Classifier Training

We used the following two language models for speech recognition:

– LMdb: The trigram model trained on the 890 example questions in QADB.– LMlv: A domain-independent large-vocabulary trigram model trained on

Web texts [1]. Its vocabulary size is 60,250 words.

We used Julius1 for the speech recognizer, and Palmkit2 for the training oflanguage models.

From the speech recognition results, we extracted 35 features. Due to a lackof space we do not list all the features. Sixteen features were obtained from theresult of speech recognition with LMdb. They include acoustic score, languagemodel score, the number of words in the top recognition result, the average,minimum, and the maximum of the confidence scores of keyphrases used foranswer selection, the ratio of nouns in the top recognition result. and whetherthe top speech recognition result was classified as OODB or not. Nine similar1 http://julius.sourceforge.jp/2 Palmkit is a language model toolkit which is compatible with CMU-Cambridge

Toolkit and developed at Tohoku University (http://palmkit.sourceforge.net/).


features were obtained from the result of speech recognition with LMlv, butsome of the answer-selection-related features were not used. Ten features wereobtained by comparing features obtained from LMdb-based speech recognitionresults and those obtained from LMlv-based speech recognition results.

We used the logistic regression3 in Weka data mining toolkit [8] as the clas-sifier. We used the first 25 questions of 5 users as a training data set, and 50questions (the first 25 questions of each dialogue session) of the remaining 20people as the test data set. Then the non-question utterances were removedfrom these sets. The average number of utterances for training in each data setis about 100, and the average number of utterances for testing is about 851. Welimited the number of training data so that the amount of effort for labeling thetraining data could be reduced.

We performed feature selection to avoid overfitting. We used backward step-wise selection so that the average F-measure of OODB detection (with thethreshold of 0.5) of the 10-fold cross validations over the five training sets couldbe maximized. Ten features remained and they achieved the F-measure of 0.74(the recall is 0.77 and the precision is 0.70). We examined which features are cru-cial among the remaining ones by investigating how much F-measure decreaseswhen removing each feature. The top five crucial features are as follows:

1. maxi (the number of occurrences of keyphrase i used for answer selection inSRRdb,all) / the number of words in SRRdb,all).

2. the number of words in SRRdb,1.3. (the number of nouns in SRRdb,1 / the number of words in SRRdb,1) − (the

number of nouns in SRRlv,1 / the number of words in SRRlv,1)4. (the number of nouns in SRRdb,1 / the number of words in SRRdb,1) − (the

number of nouns in SRRdb,all / the number of words in SRRdb,all)5. (the number of nouns in SRRlv,1 / the number of words in SRRlv,1) − (the

number of nouns in SRRlv,all / the number of words in SRRlv,all)

Here SRRdb,1 is the top result of LMdb-based speech recognition, and SRRdb,all isits all results. SRRlv,1 and SRRlv,all are those of LMlv-based speech recognition.

We think these features are effective for the following reasons. If Feature 1is small, the possibility that a keyphrase is a misrecognition result is high andthe question is possibly OODB. If Feature 2 is large, the length of utterance islong and it might be the case that keyphrases are misrecognized as short words.Feature 3, the difference in the ratios of nouns, represents the possibility thatthe recognition results of LMlv include the words out of LMdb. If this value isclose to zero, the recognition result of LMdb is likely to be correct. Features 4and 5 represent the confidence of the recognition result. If these values are large,erroneously recognized nouns are likely to exist, and the question may be OODB.

3.3 Evaluation Results

We evaluated our method by estimating how much it can reduce the cost for lis-tening to or transcribing user questions to augment the database. We compared3 Logistic regression model with a ridge estimator with Weka’s default values.

186 H. Narimatsu, M. Nakano, and K. Funakoshi

the number of OODB questions in n questions extracted by the proposed andbaseline methods, when n is given.

We compared the following methods.

– Proposed Method: Extract top n questions in the order of the scores assignedby the proposed classifier described above.

– Baseline 1 (Random): Extract n questions randomly.– Baseline 2 (Initial-DB): Extract n questions randomly among the questions

that are classified as OODB questions. If the system using Initial DB cannotselect answer to the question, the question is classified as OODB. If n islarger than the number of questions classified as OODB using the initialdatabase, the rest are extracted randomly from the remaining questions. Inthis condition, 5,000 frequent words were added to the language model andtreated as unknown word class words. This prevents out-of-vocabulary wordsfrom being misrecognized as keywords.

Figure 2 compares the above methods. Although the performance of the proposedmethod is close to that of the initial-QADB-based method when the number ofextracted questions is small. This is because the number of questions that initial-QADB-based method classifies as OODB questions is small so the precision ishigh. The proposed method outperforms the initial-QADB-based method whenthe number of extracted questions is large.

0

100

200

300

400

500

600

0 100 200 300 400 500 600 700 800 900

Proposed Method

Baseline 1 (Random)

Baseline 2 (Initial-DB)

The number of extracted questions

The number of

OODB questions in

extracted questions

Fig. 2. The number of out-of-database questions among the extracted questions

4 Summary and Ongoing Work

This paper presented a novel framework for supporting the augmentation of theQADB. It estimates the probability of user questions being OODB. It uses alanguage model built from the initial QADB and a large-vocabulary statisticallanguage model. Although the improvement by the proposed method is limited,the experimental results suggested the possibility of a framework.


Among the future work is finding effective features other than those used inthe experiment. In addition, we plan to investigate the way to build languagemodels more effective in detecting OODB questions.

In this experiment, we assumed that the database and the classifier based onwhich the OODB question candidates are extracted are fixed. In real settings,however, it is possible to incrementally update the database and the classifierfor extracting OODB questions candidates. Future work includes conducting anexperiment in such a setting.

Since our method requires some amount of real user questions to train theclassifier, we will also try to find a way to train the classifier using user questionsin other domains.

References

1. Kawahara, T., Lee, A., Takeda, K., Itou, K., Shikano, K.: Recent progress of open-source LVCSR engine Julius and Japanese model repository. In: Proc. Interspeech2004 (ICSLP), pp. 3069–3072 (2004)

2. Nakano, M., Hazen, T.J.: Using untranscribed user utterances for improving lan-guage models based on confidence scoring. In: Proc. Eurospeech 2003, pp. 417–420(2003)

3. Nisimura, R., Lee, A., Yamada, M., Shikano, K.: Operating a public spoken guidancesystem in real environment. In: Proc. Interspeech 2005, pp. 845–848 (2005)

4. Nisimura, R., Uchida, T., Lee, A., Saruwatari, H., Shikano, K., Matsumoto, Y.:ASKA: Receptionist robot with speech dialogue system. In: Proc. IROS 2002, pp.1314–1317 (2002)

5. Rahim, M.G., Lee, C.H., Juang, B.H.: Discriminative utterance verification for con-nected digits recognition. IEEE Transactions on Speech and Audio Processing 5(3),266–277 (1997)

6. Riccardi, G., Hakkani-Tur, D.: Active and unsupervised learning for automaticspeech recognition. In: Proc. Eurospeech 2003, pp. 1825–1828 (2003)

7. Takeuchi, S., Cincarek, T., Kawanami, H., Saruwatari, H., Shikano, K.: Question andanswer database optimization using speech recognition results. In: Proc. Interspeech2008 (ICSLP), pp. 451–454 (2008)

8. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and tech-niques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

9. Yoshimi, Y., Kakitsuba, R., Nankaku, Y., Lee, A., Tokuda, K.: Probabilistic answerselection based on conditional random fields for spoken dialog system. In: Proc.Interspeech 2008 (ICSLP), pp. 215–218 (2008)


The Influence of the Usage Mode on Subjectively Perceived Quality

Ina Wechsung, Anja Naumann, and Sebastian Möller

Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany

{Ina.Wechsung,Anja.Naumann,Sebastian.Moeller}@telekom.de

Abstract. The current paper presents an evaluation study of a multimodal mo-bile entertainment system. Aim of the study was to investigate the effect of the usage mode (explorative vs. task-oriented) on the perceived quality. In one con-dition the participants were asked to perform specific tasks (task-oriented mode) and in another to do “whatever they want to do with the device”. It was shown that the explorative test setting result in better ratings than the task-oriented.

Keywords: Multimodal Interaction, Evaluation, Usability, User Experience.

1 Introduction

Nowadays usability testing is more or less obligatory when presenting new interface techniques or interaction paradigms. Dependent variables of these studies are typi-cally the factors described in the widespread ISO 9241 standard: effectiveness, effi-ciency and satisfaction. In a meta-analysis by [1] reviewing current practice in usabil-ity evaluation all studies measured at least one of these factors. According to [1] ef-fectiveness and efficiency are most frequently measured via error rate respective task completion time, data often labeled as “objective”. To assess such data it is obvious that at least to some extent predefined tasks are necessary.

Although these “objective” measures might be in line with the concept of usability they are not sufficient for assessing another key concept in HCI, namely User eXperi-ence (UX).

With the attention of the HCI community shifting from Usability to UX the view on how to humanize technology widened [2]. As described in [2], the usability per-spective implicates that technology usage is primarily motivated by accomplishing tasks as efficient and effective as possible to gain time for the real pleasurable activi-ties not related to technology. Hassenzahl [2] questions this position and argues that humans use technology for itself as technology usage can be a source of a positive, enjoyable experience. Thus technology usage is linked to different goals: do goal (buying tickets at a web shop) and be goal (being competent). Do goals are linked to pragmatic qualities and thus a system’s usability, be goals are associated with the non-instrumental aspects referred to as hedonic qualities.

The Influence of the Usage Mode on Subjectively Perceived Quality 189

Since experience can only be subjective the term UX sets the focus on “the subjec-tive side of product use” [2]. Thus these so-called “objective” parameters used for measuring usability might not be meaningful for UX evaluation.

2 Related Work

So far several evaluation biases associated with task-oriented usability testing are documented: Evidence for the high importance of the tasks is provided by [3]. The same website was tested by nine usability expert teams. The results of the evaluation were hardly affected by the number of participants but the number of tasks. With more tasks more usability problems were discovered. It is concluded that it is prefer-able to give a large variety of different sets of tasks to a small number of users over presenting a small set of task to many users.

Cordes [4] pointed out, that typically only tasks are selected that are supported by the product. Domain relevant tasks the system is not capable of are not presented. This is a rather unnatural situation since discovering a product’s capabilities is an essential aspect when being confronted with a new device [4]. Cordes [4] showed that if users are told that the given tasks may not be solvable, the users tended to terminate the task earlier and terminate more tasks than users receiving the exactly same in-struction except for the hint that the tasks may not be solvable. Thus, it is likely that success rates in traditional usability tests are higher than in natural settings. Another crucial issue is the wrong perspective when selecting tasks with respect to the prod-ucts capabilities: The product is evaluated not according to users needs but to its func-tionalities.

A direct comparison between task-oriented and explorative instruction is provided by [5] and [6]. In the first study [5] the participants either received the task to find specific information on a website or were instructed to just have fun with the website. Retrospective judgments of overall appeal, pragmatic and hedonic qualities were assessed. It was shown that with a task-oriented instruction the websites’ usability respective the ability to support the given tasks was of stronger influence on the judgments of experienced overall appeal than for the explorative group. Regarding the explorative group a correlation between usability and appeal was not observed. In the second study [6] the participants interacted with a story-telling platform. Again there were either given the task to find some specific information or asked to freely interact with the system. Besides the retrospective measures of the first study mental effort and affect were measured throughout the usage of the system. Additionally experi-enced spontaneity was assessed after the interaction. It was shown that with task-oriented instruction spontaneity was related to perceived effort, negative affect and reduced appeal. In the explorative group spontaneity was linked to positive affect and lead to higher appeal.

Based on these results the authors concluded that different instructions trigger dif-ferent usage modes and evaluation criteria: Depending on the usage mode (task or non-task) the importance of usability for a systems’ appeal differs, whereas hedonic qualities are equally important in both modes. However, the authors also point out that generalization of their results is difficult and more research using different sys-tems is necessary [6].

190 I. Wechsung, A. Naumann, and S. Möller

So far only unimodal, non-mobile systems have been investigated. If considering multimodal applications modality preference and availability should influence user experience. If in task-oriented setting all tasks have to be performed, for tasks were the preferred modality is not offered the user will have to switch to a less liked modal-ity, which may result in a negative experience. If no tasks are given the user is likely to stick with the most preferred modality. Thus we formed the following hypotheses: mental workload should be lower with an explorative instruction since touch (the more familiar modality) is expected to be used more often than e.g. speech. The ex-perienced identification with the system should be higher since the usage of the sys-tem is not determined by the experimenter but by the users decision. According to [5, 6] overall appeal should be determined by pragmatic qualities for the task-oriented group.

3 Method

3.1 Participants

30 German-speaking individuals (15m, 15f, Ø 28 yrs.) took part in the study. All of them were paid for their participation. The majority (70%) was familiar with touch input; voice control was considerably less known (30%).

3.2 Material

The tested application is called mobile multimodal information cockpit and offers the functionality of a remote control, a mobile TV and video player, video on demand services and games. The application was implemented on an ultra mobile personal computer, the Samsung Q1 Ultra (cf. Fig. 1). The tested system is controllable via a graphical user interface with touch screen and speech input. The output is given via the graphical user interface and audio feedback. For some tasks only one of the mo-dalities was available.

To assess ratings for hedonic and pragmatic qualities the AttrakDiff questionnaire [7] was employed. The AttrakDiff consists of four scales measure hedonic as well as pragmatic attributes. The scale Hedonic Quality-Stimulation measures the extent to which a product can provide stimulation. The scale Hedonic Quality-Identification (HQ-I) measures a products ability to express the owner’s self. The scale Pragmatic Quality (PQ) covers a product’s functionality and the access to the functionality and is thus more or less matching the traditional concept of usability. Additionally the per-ceived global quality is measured with the scale Attractiveness (ATT). The entire questionnaire comprises 28 items on a 7-point semantic differential. Furthermore the SEA-scale [9], which is the German version of the Subjective Mental Effort Ques-tionnaire (SMEQ also known as Rating Scale Mental Effort) [8], was employed as a measure of perceived mental effort. The SEA-scale is an one-dimensional measure with a range between 0 and 220. Along this range seven verbal anchors (hardly effort-ful - extremely effortful) are given.


Fig. 1. Tested application

3.2 Procedure

The experiment consisted of two blocks: one task-oriented and one explorative. Half of the participants started with the task-oriented block followed by the explorative block. For the other half of participants the order was reversed (within-subject de-sign). They were either instructed to perform 16 given tasks (e.g. logging in to the system, switching the channel, searching for a certain movie, a certain TV show; a certain actor, increasing volume; decreasing volume, playing the quiz, switching be-tween the different categories) or to use the next 15 minutes to do whatever they want to do with the device. The duration was set to 15 minutes since pretests showed that this was the average time to accomplish all tasks.

In both test blocks the participants were free to choose the input modality. It was at any time possible to switch or combine modalities. In order to rate the previously tested condition, the SEA-scale [9] and the AttrakDiff [7] had to be filled in after each test block. To analyze which modality was used by the participants, for every interac-tion step the modality used to perform the step was logged. This way, the percentages of modality usage were computed.

4 Results

4.2 Subjective Data

SEA-Scale: No differences could be observed for perceived mental effort. AttrakDiff: The AttrakDiff (cf. Figure 2) showed differences on the scale Attractive-ness (Wilcoxon Z= 2.20, p = .013) and on the scale Hedonic Quality-Identification (Wilcoxon Z = 1.89, p = .029).

In contrast to the results reported in [5, 6] high correlations between pragmatic qualities and overall attractiveness could be observed in both blocks. (Pearson’s Rexp = .796, p<.01, Pearson’s Rtask = .817, p<.01). However, as expected the correlation was higher in the task-oriented block.

192 I. Wechsung, A. Naumann, and S. Möller

Fig. 2. Ratings on AttrakDiff scales and subscales by usage mode

4.3 Performance Data

Modality Preferences: As expected, with explorative instruction touch was more frequently used than with task-oriented instruction (Mexp = 89.41, SDexp = 13.22, Mtask

= 83.48, SDtask = 16.33, Wilcoxon Z = 1.802, p = .029). However, for both conditions touch was by far the dominant modality and speech was rarely used.

An explorative analysis of the data of the task-oriented block showed that speech was used for search tasks only. The familiarity of the input modality was of influence in the task-oriented block: participants with prior experience used speech marginally more often (Mwith = 17.71, SDwith = 9.24, Mwithout = 15.22, SDwithout = 18.71, Mann-Whitney U = 63.5, p =. 081). For the explorative block no effect was shown.

5 Discussion

The results show that task-oriented instructions reduce the experienced identification with the system as well as the perceived overall attractiveness. However, in contradic-tion to [5, 6], and therefore contradictory to our hypothesis, pragmatic qualities were strongly related to overall attractiveness in both usage modes. This relation was only slightly stronger for the task-oriented group. Moreover, mental workload was not higher with the task-oriented instruction but the same in both modes. An explanation might be the kind of tasks: In the studies by [5, 6] knowledge acquisition tasks were given whereas in our study only specific actions (e.g. switching the channel) had to be performed. It is plausible that searching for information and keeping it active in the working memory is mentally more demanding than just performing some given ac-tions not requiring memorizing much information.

In addition, touch was the modality most often chosen by the participants both in the task-oriented and the exploratory test block. Thus, the chosen modality might have a stronger influence on both mental workload and attractiveness of an applica-tion than the usage mode. This is in line with previous research on modality usage


[e.g., 10]. It also shows the importance of the task itself since the users’ choice for a modality has been found to be task dependent [e.g., 11].

Thus in a future study different task types with a variety of tasks including more complex tasks also provoking the usage of different modalities will be compared to an explorative setting.

References

1. Hornbæk, K., Law, E.L.: Meta-analysis of correlations among usability measures. In: Pro-ceedings of CHI 2007, pp. 617–626. ACM Press, New York (2007)

2. Hassenzahl, M.: User Experience (UX): Towards an experiential perspective on product quality. In: Proceedings of the 20th French-Speaking Conference on Human-Computer In-teraction, pp. 11–15 (2008)

3. Lindgaard, G., Chattratichart, J.: Usability Testing: What Have We Overlooked? In: Proc. CHI 2007, pp. 1415–1424. ACM Press, New York (2007)

4. Cordes, E.R.: Task-selection bias: a case for user-defined tasks. International Journal of Human–Computer Interaction 13, 411–420 (2002)

5. Hassenzahl, M., Kekez, R., Burmester, M.: The importance of a software’s pragmatic qual-ity depends on usage modes. In: Luczak, H., Cakir, A.E., Cakir, G. (eds.) Proceedings of the 6th International Conference on Work With Display Units (WWDU), pp. 275–276. ERGONOMIC, Berlin (2002)

6. Hassenzahl, M., Ullrich, D.: To do or not to do: Differences in user experience and retro-spective judgments depending on the presence or absence of instrumental goals. Interact-ing with Computers 19, 429–437 (2007)

7. Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität. [A questionnaire for measuring per-ceived hedonic and pragmatic quality]. In: Ziegler, J., Szwillus, G. (eds.) Mensch & Computer 2003. Interaktion in Bewegung, pp. 187–196. B.G. Teubner, Stuttgart (2001)

8. Zijlstra, F.R.H.: Efficiency in work behavior. A design approach for modern tools. PhD thesis, Delft University of Technology. Delft University Press, Delft 1993)

9. Eilers, K., Nachreiner, F., Hänecke, K.: Entwicklung und Überprüfung einer Skala zur Er-fassung subjektiv erlebter Anstrengung [Development and evaluation of a scale to assess subjectively perceived effort]. Zeitschrift für Arbeitswissenschaft 40, 215–224 (1986)

10. Naumann, A.B., Wechsung, I., Möller, S.: Factors Influencing Modality Choice in Multi-modal Applications. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS (LNAI), vol. 5078, pp. 37–43. Springer, Heidelberg (2008)

11. Wechsung, I., Hurtienne, J., Naumann, A.: Multimodale Interaktion: Intuitiv, robust, bevorzugt und altersgerecht? In: Wandke, H., Struwe, D., Kain, S. (eds.) Mensch und Computer 2009. 9. fachübergreifende Konferenz für interaktive und koooperative Medien - Grenzenlos frei 2009, pp. 213–222. Oldenbourg, Berlin (2009)

Sightseeing Guidance Systems Based on

WFST-Based Dialogue Manager

Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Etsuo Mizukami,Akihiro Kobayashi, Kentaro Kayama, Tetsuya Fujii,

Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura

MASTAR Project, NICT, Kyoto, Japanhttp://mastar.jp/index-e.html

Abstract. We are developping a spoken dialogue system that help theusers through spontaneous interactions on sightseeing guidance domain.The systems are constructed on our framework of weighted finite-statetransducer (WFST) based dialogue manager. The demos are our proto-type spoken dialogue systems on Kyoto tourist information assistance.

1 WFST-Based Framework for Spoken Dialogue Systems

We have proposed an expandable and portable dialogue scenario description andplatform to manage dialogue systems using a Weighted Finite-State Transducers(WFSTs) [1]. The role of dialogue management is to transduce user’s utterace(ASR results) into system action. The framework performs these process usingWFSTs of speech language understanding (SLU) and state transition (dialoguescenario). Since many types of methodologies used for SLU and management(handcrafted rules, statistical n-grams, etc.) can be tranformed to WFSTs, dif-ferent components can easily be integrated and work on a common platform.These WFSTs can be combined and optimized through useful WFST opera-tions, which contributes to efficient beam search in decoding process [2].

2 Tourist Guide Dialogue System

The demo systems are prototype applications for sightseeing guidance domainof Kyoto City. We implement two systems. One is implemented on mobile phone(Fig. 1) and the other on PC with a large display and several active cameras(Fig. 2).

The systems use an SLU WFST and a scenario WFST that are trained us-ing the statistics of our human-to-human dialogue corpus [3]. They are comple-mented by handcrafted rules, which are made based on the corpus. Both systemscan explain more than 700 sightseeing spots in Kyoto. It also can explain the spotin terms of 15 viewpoint that are used to plan sightseeing activities (“cherry blos-soms”, “Japanese garden.”, etc.) as well as basic informations (“access”, “fee”,etc.). Based on the information, users can narrow down and evaluate the spots.


Kyoto Sightseeing Guidance Systems 195

Fig. 1. Mobile phone system Fig. 2. Large display system

The system with large display and cameras can detect non-verbal informa-tion, such as changes in gaze and face direction and head gestures of the userduring dialogue, and proactively recommend information that the user would beinterested in [4].

3 System Performance

We evaluated the performance of the mobile phoene system with 100 subjects.Subjects were assumed to visit Kyoto and to find a sightseeing spot they areinterested in. In total, 390 dialogue sessions with 3,670 utterances were collectedwith an average sentence error rate of 70.7%. The system was able to correctlyanswer 64.7% in response to the users’ requests.

The results were not satisfactory, but the utterances that the system were notable to handle can easily be reinforced by adding appropriate paths to the SLUand scenario WFSTs. This aspect is considered as an advantage of our framework.

References

1. Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Dialog Managementusing Weighted Finite-state Transducers. In: Proc. Interspeech, pp. 211–214 (2008)

2. Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Statistical dialog manage-ment applied to wfst-based dialog systems. In: Proc. ICASSP, pp. 4793–4796 (2009)

3. Ohtake, K., Misu, T., Hori, C., Kashioka, H., Nakamura, S.: Annotating DialogueActs to Construct Dialogue Systems for Consulting. In: Proc. The 7th Workshopon Asian Language Resources, pp. 32–39 (2009)

4. Kayama, K., Kobayashi, A., Mizukami, E., Misu, T., Kashioka, H., Kawai, H., Naka-mura, S.: Spoken Dialog System on Plasma Display Panel Estimating User’s Interestby Image Processing. In: Proc. 1st International Workshop on Human-Centric In-terfaces for Ambient Intelligence, HCIAmi (2010)


Spoken Dialogue System Based on Information Extraction from Web Text

Koichiro Yoshino and Tatsuya Kawahara

School of Informatics, Kyoto University Sakyo-ku Kyoto 606-8501, Japan

We present a novel spoken dialogue system which uses the up-to-date information on the web. It is based on information extraction which is defined by the predicate-argument (P-A) structure and realized by shallow parsing. Based on the information structure, the dialogue system can perform question answering and also proactive information presentation using the dialogue context and a topic model.

To be a useful and interactive system, the system should not only reply to the user's request, but also make proactive information presentation. Our proposed scheme realizes this function with the information extraction technique to generate only useful information. The useful information structure is dependent on domains. Convention-ally, the templates for information extraction were hand-crafted, but this heuristic process is so costly that it cannot be applied to a variety of domains on the web. Therefore, we introduce a filtering method of predicate-argument (P-A) structures generated by the parser, which can automatically define the domain-dependent useful information structure.

This scheme is applied to a domain of baseball news, and we design a dialogue sys-tem which can reply to the user's question as well as make proactive information presentation according to a dialogue history and a topic model. The system can be viewed as a smart interactive news reader.

The architecture of the dialogue system is depicted in Figure 1. First, information extraction is conducted by parsing web texts in advance. A user's query is also parsed to extract the same information structure, and the system matches the extracted infor-mation against the web information. If the system finds some information which completely matches to the user's query, the system makes a response using the corre-sponding web text. When the system cannot find exact information, it searches for some information which matches partially. For example, when the user asked “Did Ichiro hit?'' and the system cannot find exact information “[Ichiro (agent) hit]”, it may find “[Lopez (agent) hit]” which is partially matched and most relevant. This informa-tion is used to generate a similar response that the user would expect.

In the conventional RDB-based dialogue scheme, the system hardly makes relevant responses if it finds no matched entries, thus usually replies “There is no matched entries”. In the conventional question answering scheme, the same situation often happens. Occasionally, a close-matched answer may be found by statistical matching, but the found answer may not be relevant to the user's query. In the proposed scheme, we guarantee that the answer is at least partially matched to the user's query in terms of the information structure.

Spoken Dialogue System Based on Information Extraction from Web Text 197

DialogueManager

USER SYSTEM BACK END

P-A Structure

Extracted Information

InformationExtraction

P-A Structure

Question Answering

ProactivePresentation

Topic Model

Answer

Query

Web Text(News Site)

Presentation

Fig. 1. System architecture

We show an example dialogue between a user and the system in Figure 2. The sys-tem responds in the question-answering mode when the user's question has a predicate and an argument. If the system cannot find exact information, the system makes pro-active presentation by choosing information that has the same predicate but different arguments.

User: Please tell me about today’s Tigers.Sys: The Tigers won the game 3-2 against the Giants.User: Who was the winning pitcher?Sys： Nohmi was the winning pitcher.User: Did Kubota come in to pitch?Sys: Kubota didn’t, but Fujikawa came in to pitch 9th inning.

(long pause)Sys: Do you know Fujikawa’s EPA?User: I don’t know.Sys: 0.00User: Wow.

Fig. 2. Dialogue example

Author Index

Abushariah, Mohammad A.M. 1Ainon, Raja N. 1Al-Qatab, Bassam A. 1Alqudah, Assal A.M. 1Araki, Masahiro 144

Becker, Tilman 132

Chandramohan, Senthilkumar 13Choi, Junhwi 85

Dohsaka, Kohji 48

Engelbrecht, Klaus-Peter 150

Fujii, Tetsuya 194Funakoshi, Kotaro 182Funakura, Yu 144

Gerl, Franz 25, 36Griol, David 96

Herbig, Tobias 25, 36Herzog, Gerd 132Higashinaka, Ryuichiro 48Hofmann, Hansjorg 156Hori, Chiori 61, 169, 194

Ianotto, Michel 110Isotani, Ryosuke 156

Jokinen, Kristiina 163

Kashioka, Hideki 73, 169, 176, 194Kawahara, Tatsuya 196Kawai, Hisashi 61, 73, 156, 169, 176, 194Kayama, Kentaro 73, 194Kim, Kyungduk 85

Kimura, Naoto 61Kobayashi, Akihiro 73, 194

Lee, Cheongjae 85Lee, Donghyeon 85Lee, Gary Geunbae 85Lopez-Cozar, Ramon 96

Meguro, Toyomi 48Minami, Yasuhiro 48Minker, Wolfgang 25, 36, 122, 156Misu, Teruhisa 61, 73, 169, 194Mizukami, Etsuo 73, 176, 194Moller, Sebastian 150, 188

Nakamura, Satoshi 61, 73, 156,169, 176, 194

Nakano, Mikio 182Narimatsu, Hiromi 182Naumann, Anja 188

Ohtake, Kiyonori 61, 169, 194

Pietquin, Olivier 13, 110Polzehl, Tim 122

Quesada, Jose F. 96

Reithinger, Norbert 132Rossignol, Stephane 110

Sakti, Sakriani 156Schmitt, Alexander 122Sonntag, Daniel 132

Wechsung, Ina 188

Yoshino, Koichiro 196

Zainuddin, Roziati 1

Documents

Spoken Dialogue Systems for Ambient Environments: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010, Gotemba, Shizuoka, Japan, October 1-2, 2010. Proceedings