26
Advances in Speech Recognition

Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

Advances in Speech Recognition

Page 2: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored
Page 3: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

Amy NeusteinEditor

Advances in Speech Recognition

Mobile Environments, Call Centers and Clinics

Foreword by Judith Markowitz and Bill Scholz

Page 4: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

EditorAmy NeusteinLinguistic Technology SystemsFort Lee, New [email protected]

ISBN 978-1-4419-5950-8 e-ISBN 978-1-4419-5951-5DOI 10.1007/978-1-4419-5951-5Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 2010935485

© Springer Science+Business Media, LLC 2010All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Page 5: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

v

Judith Markowitz

When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored. Amy’s work has always been infused with cre-ative intensity, so I knew the book would be as interesting for established speech professionals as for readers new to the speech-processing industry.

The fact that I would be writing the foreward with Bill Scholz made the job even more enjoyable. Bill and I have known each other since he was at UNISYS directing projects that had a profound impact on speech-recognition tools and applications.

Bill Scholz

The opportunity to prepare this foreword with Judith provides me with a rare opportu-nity to collaborate with a seasoned speech professional to identify numerous signifi-cant contributions to the field offered by the contributors whom Amy has recruited.

Judith and I have had our eyes opened by the ideas and analyses offered by this collection of authors. Speech recognition no longer needs be relegated to the cate-gory of an experimental future technology; it is here today with sufficient capability to address the most challenging of tasks. And the point-click-type approach to GUI control is no longer sufficient, especially in the context of limitations of modern-day hand held devices. Instead, VUI and GUI are being integrated into unified multimodal solutions that are maturing into the fundamental paradigm for computer-human interaction in the future.

Judith Markowitz

Amy divided her book into three parts but the subject of the first part, mobility, is a theme that flows through the entire book – which is evidence of the extent to which mobility permeates our lives. For example, Matt Yuschik’s opening chapter

Foreword

Two Top Industry Leaders Speak Out

Page 6: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

vi Foreword

in the Call Centers section, which makes up the second part of the book, considers the role of multimodality for supporting mobile devices.

Accurate and usable mobile speech has been a goal that many of us have had for a long time. When I worked for truck-manufacturer Navistar International in the 1980s, we wanted to enable drivers to perform maintenance checks on-the-fly by issuing verbal commands to a device embedded in the truck. At that time, a deploy-ment like that was a dream. Chapters in all three sections of this book reveal the extent to which that dream has been realized – and not just for mobile phones. For example, James Rodger and James George’s chapter in the Clinics section exam-ines end-user acceptance of a handheld, voice-activated device for preventive healthcare.

Bill Scholz

The growing availability of sophisticated mobile devices has stimulated a signifi-cant paradigm shift resulting from a combination of sophisticated speech capability with limited graphic input and display capability. The need for a paradigm shift is exacerbated by the increased frequency with which applications formerly con-strained to desktop computers migrate onto mobile devices, only to frustrate users accustomed to click-and-type input and extensive screen real estate output. Bill Meisel’s introductory chapter brings this issue into clear focus, and Mike Phillips’ team offers candidate solutions in which auditory and visual cues are augmented by tactile and haptic feedback to yield multimodal interfaces which overcome many mobile device limitations.

In response to the demand for more accurate speech input on mobile devices, Mike Cohen’s team from Google has enhanced every step of the recognition pro-cess, from text normalization and acoustic model development through language model training using billions of words. Sophisticated endpointing permits removal of press-to-talk keys, and in collaboration with enhanced multimodal dialog design, provides a comfortable conversational interface that is a natural extension to tradi-tional Web access.

Sid-Ahmed Selouani summarizes efforts of the European community to enhance both the input phase of speech recognition through techniques such as line spectral frequency analysis, and the use of an AI markup language to facilitate interpretation of recognizer output.

The chapters in the Call Centers section describe an array of technologies. Matt Yuschik shows us data justifying the importance of multimodality in contact cen-ters to facilitate caller–agent communication. The combination of objective and subjective measures identified by Roberto Pieraccini’s team provides metrics for contact center evaluation that dramatically reflects the communication perfor-mance enhancements that result from the increased emphasis on multimodal dialog.

Page 7: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

viiForeword

Judith Markowitz

Emphasizing the importance of user expectations, Stephen Springer delves deeply into the subjective aspects of user interface design for call centers. This chapter is a tremendous resource for designers whether they are working with speech for the first time or seasoned developers.

Good design is important but problem dialogs can occur even when callers interact with well-designed speech systems or human agents. Unlike many emotion-detection systems, the tool that Alexander Schmitt and his co-authors have constructed for detecting anger and frustration is not limited to acoustic indicators; it also analyzes words, phrases, the dialog as a whole, and prior emotional states.

While Alexander Schmitt and his co-authors focus on resolving problem dialogs for individual callers, Marsal Gavalda and Jeff Schlueter address problems that occur at the macro level. They describe a phonetics-based, speech-analytics system capable of indexing more than 30,000 h of a contact center’s audio and audio-visual data in a single day and then mining the index for business intelligence.

I was pleased to see a section on speech in clinical settings. John Shagoury crafted a fine examination of medical dictation that shows why speech recognition has become an established and widely accepted method for generating medical reports.

Most treatments of speech recognition in clinics rarely go much beyond its use for report generation. Consequently, I was happy to see chapters on a portable medical device and on the use of speech and language for diagnosis and treatment. Julia Hirschberg and her co-authors’ literature review demonstrates that not only are there acoustic and linguistic indicators of diseases as disparate as depression, diabetes, and cancer but also that some of those indicators can be used to measure the effectiveness of treatment regimens. Similarly, Hemant Patil’s classification of infant cries gives that population of patients a “voice” to communicate about what is wrong. If I had had such tools when I worked as a speech pathologist in the 1970s, I would have been able to do far more for the betterment of my patients.

Amy Neustein has compiled an excellent overview of speech for mobility, call centers, and clinics. Bravo!

Judith Markowitz, Ph.D., is president of J. Markowitz Consultants, and is rec-ognized internationally as one of the top analysts in speech processing. For over 25 years, she has provided strategic and technical consulting to large and small orga-nizations, and has been actively involved in the development of standards in bio-metrics and speech processing. In 2003, she was voted one of the top ten leaders in the speech-processing industry and, in 2006, she was elevated to IEEE Senior Member status. Among Dr. Markowitz’s many accomplishments, she served with distinction as technology editor of Speech Technology Magazine and chaired the VoiceXML Forum Speaker Biometrics Committee.

Page 8: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

viii Foreword

K.W. “Bill” Scholz, Ph.D., is the president of AVIOS, the speech industry’s oldest professional organization. He founded NewSpeech, LLC in 2006, following his long tenure at Unisys, where he served as Director of Engineering for Natural Language solutions. His long and distinguished career as a consultant for domestic and international organizations in architectural design, speech technology, knowl-edge-based systems, and integration strategies is focused on speech application development methodology, service creation environments, and technology assessment.

Page 9: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

ix

Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics provides a forum for today’s speech technology industry leaders – drawn from private enterprises and academic institutions all over the world – to discuss the challenges, advances, and aspirations of voice technology.

The collection of essays contained in this volume represents the research find-ings of over 30 speech experts, including speech engineers, system designers, lin-guists, and IT (information technology) and MIS (management information systems) specialists. The book’s 14 chapters are divided into three sections – mobile environments, call centers, and clinics. But given the practical ubiquity of mobile devices, this three-part division sometimes seems almost irrelevant. For example, one of the chapters in the “call centers” section provides a vivid discus-sion of how to provide today’s call centers with multimodal capabilities – to support text, graphic, voice, and touch – in self-service transactions, so that customers who contact the call center using their mobile phones (rather than a fixed line) can expect a sophisticated interface that lets them resolve their service issues in a way that uses the full capabilities of their handsets, and similarly call center agents using mobile devices that support multimodality can experience more efficient navigation and retrieval of information to complete a transaction for a caller. In the “clinics” section, for that matter, one of the chapters focuses on validating user satisfaction with a voice-activated medical tracking application, run on a compact mobile device for a “hands-free” method of data entry in a clinical setting.

In spite of this unavoidable overlap of sections, the authors’ earnest discussions of the manifold aspects of speech technology not only complement one another but also divide into several areas of specific interest. Each author brings to this round-table his or her unique insights and new ideas, the fruits of much time spent formu-lating, developing and testing out their theories about what kinds of voice applications work best in mobile settings, call centers and clinics.

The book begins with an introduction to the role of speech technology in mobile applications written by Bill Meisel, President of TMA Associates. Meisel is also editor of Speech Strategy News and co-chair (with AVIOS) of the annual Mobile Voice conference in northern California. He opens his discussion by quoting the predictions published by the financial investment giant Morgan Stanley in its Mobile Internet Report, issued near the end of 2009. Meisel shows that in Morgan

Preface

Page 10: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

x Preface

Stanley’s 694-page report, Mobile Internet Computing was said to be “the technology driver of the next decade,” following the Desktop Internet Computing of the 1990s, the Personal Computing of the 1980s, the Mini-Computing of the 1970s and, finally, the Mainframe Computing of the 1960s. In his chapter, fittingly titled “Life on the Go – The Role of Speech Technology in Mobile Applications,” Meisel asserts that since “the mobile phone is becoming an indispensable personal com-munication assistant and multi-functional device… [such a] range of applications creates user interaction issues that can’t be fully solved by extending the Graphical User Interface and keyboard to these small devices.” “Speech recognition, text-to-speech synthesis, and other speech technologies,” Meisel continues, “are part of the solution, particularly since, unlike PCs, every mobile phone has a microphone and speech output.”

Advances in Speech Recognition – which is being published at the very begin-ning of this auspicious decade for mobile computing – examines the practical constraints of using voice in tandem with text. Following Meisel’s comprehensive overview of the role of speech technology in mobile applications, Scott Taylor, Vice President of Mobile Marketing and Solutions at Nuance Communications, Inc., offers a chapter titled “Striking a Healthy Balance – Speech Technology in the Mobile Ecosystem.” Here, Taylor cautions the reader about the need to “balance a variety of multimodal capabilities so as to optimally fit the user’s needs at any given time.” While there is “no doubt that speech technologies will continue to evolve and provide a richer user experience,” argues Taylor, it is critical for experts to remem-ber that “the key to success of these technologies will be thoughtful integration of these core technologies into mobile device platforms and operating systems, to enable creative and consistent use of these technologies within mobile applica-tions.” This is why speech developers, including Taylor himself, view speech capa-bilities on mobile devices not as a single entity but rather as part of an entire mobile ecosystem that must strive to maintain homeostasis so that consumers (as well as carriers and manufacturers) will get the best service from a given mobile application.

To achieve that goal, Mike Phillips, Chief Technology Officer at Boston-based Vlingo, together with members of the company has been at pains to design more effective and satisfying multimodal interfaces for mobile devices. In the chapter following Taylor’s, titled “Why Tap When You Can Talk – Designing Multimodal Interfaces for Mobile Devices that Are Effective, Adaptive and Satisfying to the User,” Phillips and his co-authors present findings from over 600 usability tests in addition to results from large-scale commercial deployments to augment their dis-cussion of the opportunities and challenges presented in the mobile environment. Phillips and his co-writers stress how important it is to strive for user-satisfaction: “It is becoming clear that as mobile devices become more capable, the user inter-face is the last remaining barrier to the scope of applications and services that can be made available to the users of these devices. It is equally clear that speech has an important role to play in removing these user interface barriers.”

Johan Schalkwyk, Senior Staff Engineer at Google, along with some of his colleagues provide the book’s next chapter, aptly titled “Your Word is my

Page 11: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xiPreface

Command –Google Search by Voice: A Case Study.” In this chapter, Schalkwyk and his co-authors illuminate the technology employed by Google “to make search by voice a reality” – and follow this with a fascinating exploration of the user interface side of the problem, which includes detailed descriptions and analyses of the specifi-cally tailored user studies that have been based on Google’s deployed applications.

In painstaking detail, Schalkwyk and his colleagues demystify the complicated technology behind 800-GOOG-411 (an automated system that uses speech recogni-tion and web search to help people find and call businesses), GMM (Google Maps for Mobile) which – unlike GOOG-411 – applies a multimodal speech application (making use of graphics), and finally the Google Mobile application for the iPhone, which includes a search by voice feature. The coda to the chapter is its discussion of user studies based on analyses of live data, and how such studies reveal impor-tant facts about user behavior, facts that impact Google’s “decisions about the technology and user interfaces.” Here are the essential questions addressed in those user studies: “What are people actually looking for when they are mobile? What factors influence them to choose to search by voice or type? What factors contribute to user satisfaction? How do we maintain and grow our user base? How can speech make information access easier?”

The mobile environments section concludes with the presentation of a well-planned study on speech recognition in noisy mobile environments. Sid-Ahmed Selouani, Professor of Information Management at the Université de Moncton, Shippagan Campus, New Brunswick, Canada, in “Well Adjusted – Using Robust and Flexible Speech Recognition Capabilities in Clean to Noisy Mobile Environments,” presents study findings on a new speech-enabled framework that aims at providing a rich interactive experience for smartphone users – particularly in mobile environments that can benefit from hands-free and/or eyes-free operations.

Selouani introduces this framework by arguing that it is based on a conceptualization that divides the mapping between the speech acoustical microstructure and the spoken implicit macrostructure into two distinct levels, namely the signal level and linguistic level. At the signal level, a front-end processing that aims at improving the performance of Distributed Speech Recognition (DSR) in noisy mobile environments is performed.

The linguistic level, on the contrary, “involves a dialogue scheme to overcome the limitations of current human-computer interactive applications that are mostly using constrained grammars.” “For this purpose,” says Selouani, “conversational intelli-gent agents capable of learning from their past dialogue experiences are used.”

In conducting this research on speech recognition in clean to noisy mobile envi-ronments, Selouani utilized the Carnegie-Mellon Pocket Sphinx engine for speech recognition and the Artificial Intelligence Markup Language (AIML) for pattern matching. The evaluation results showed that including both the Genetic Algorithms (GA)-based front-end processing and the AIML-based conversational agents led to significant improvements in the effectiveness and performance of an interactive spoken dialog system in a mobile setting.

Matthew Yuschik, Senior User Experience Specialist at Cincinnati-based Convergys Corporation provides the perfect segue to the next section of Advances in Speech

Page 12: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xii Preface

Recognition. In “It’s the Best of all Possible Worlds – Leveraging Multimodality To Improve Call Center Productivity,” Yuschik makes a convincing argument for equip-ping today’s call centers with multimodal capabilities in self-service transactions – to support text, graphic, voice, and touch – so that customers who contact the call center using their mobile phones (rather than a fixed line) can expect an interface that “pro-vides multiple ways for the caller to search for resolution of their [service] issue.” Given market research predictions that there will be over 4 billion wireless subscrib-ers in 2010, Yuschik draws the sound conclusion that more and more callers will be using their mobile devices when availing themselves of customer support services at customer care and contact centers. After all, most customers who need to resolve product and service issues, or to order new products and services, squeeze in their calls “on the go” instead of taking up crucial time while working at their desks.

In “It’s the Best of all Possible Worlds,” Yuschik explains how leveraging mul-timodality to improve call center productivity is achieved by striking a healthy balance between satisfying the caller’s goal and maximizing the agent’s productiv-ity in the call center. He points out that “a multimodal interface can voice-enable all features of a GUI.” Yet, he cautions “this is a technologically robust solution, but does not necessarily take into account the caller’s goal.” Conceding that “voice activating all parts of the underlying GUI of the application enables the agent to solve every problem by following the step-by-step sequence imposed by the GUI screens,” Yuschik states that “a more efficient approach…is to follow the way agents and callers carry on their dialog to reach the desired goal.” He shows that “this scenario-based (use-case) flow – with voice-activated tasks and subtasks – with tasks and subtasks voice activated – provides a streamlined approach in which an agent follows the caller-initiated dialog, using the MMUI [multimodal user inter-face] to enter data and control the existing GUI in any possible sequence of steps. This goal-focused view,” as explained by Yuschik, “enables callers to complete their transactions as fast as possible.”

Yuschik’s chapter details a set of Convergys trials that “follow a specific sequence where multimodal building-blocks are identified, investigated, and then combined into support tasks that handle call center transactions.” Crucial to those trials were the Convergys call center agents who “tested the Multimodal User Interface for ease of use, and efficiency in completing caller transactions.” The results of the Convergys trials showed that “multimodal transactions are faster to complete than only using a Graphical User Interface.” Yuschik concludes that “the overarching goal of a multimodal approach should be to create a framework that supports many solutions. Then,” he writes, “tasks within any specific transaction are leveraged across multiple applications.”

Every new technology deserves an accurate method of evaluating its perfor-mance and effectiveness; otherwise, the technology will not fully serve its intended purpose. David Suendermann, Principal Speech Scientist at the New York-based SpeechCycle, Inc., and his colleagues Roberto Pieraccini and Jackson Liscombe, are joined by Keelan Evanini of Educational Testing Services in Princeton, New Jersey, for the presentation of an enlightening discussion of a new framework to measure accurately the performance of automated customer care contact centers.

Page 13: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xiiiPreface

In “‘How am I Doing?’ – A New Framework To Effectively Measure the Performance of Automated Customer Care Contact Centers,” the authors carefully dissect conventional methods of measuring how satisfied customers are with auto-mated customer care and contact centers, pointing out why such methods can pro-duce woefully misleading results. They point to a problem that is ever-present when evaluating callers’ satisfaction with any of these self-service contact centers. Namely: quantifying how effectively interactive voice response (IVR) systems satisfy callers’ goals and expectations “has historically proven to be a most difficult task.” Suendermann and his co-authors convincingly show that

[s]uch difficulties in assessing automated customer care contact centers can be traced to two assumptions [albeit misguided] made by most stakeholders in the call center industry:

1. Performance can be effectively measured by deriving statistics from call logs; and

2. The overall performance of an IVR can be expressed by a single numeric value.

The authors introduce an IVR assessment framework that confronts these mis-guided assumptions head on, demonstrating how they can be overcome. The authors show how their “new framework for measuring the performance of IVR-driven call centers incorporates objective and subjective measures.” Using the concepts of hid-den and observable measures, the authors demonstrate how to produce metrics that are reliable and meaningful so that they can better provide accurate system design insights into multiple aspects of IVR performance in call centers.

Just as it is possible to jettison poor methods of evaluating caller satisfaction with IVR performance in favor of more accurate ones, it is equally possible to meet (or even exceed) user expectations with the design of a speech-only interface that builds on what users have come to expect from self-service delivery in general, whether at the neighborhood pharmacy or at the international airport. Stephen Springer, Senior Director of User Interface Design at Nuance Communications, Inc., shows how to do this in his chapter (aptly) titled “Great Expectations – Making Use of Callers’ Experiences from Everyday Life To Design a Satisfying Speech-Only Interface for the Call Center.” According to Springer, “the thoughtful use of user modeling achieved by employing ideas and concepts related to transparency, choice, and expert advice, all of which most, if not all, callers are already familiar with from their own everyday experiences” better meets the users’ expectations than systems whose workings are foreign to what such users encounter in day-to-day life.

Springer carefully examines a wide variety of expectations that callers bring to self-service phone calls, ranging from broad expectations about self-service in gen-eral to the more specific expectations of human-to-human conversation about con-sumer issues. As a specialist in user interface design, Springer recommends to the system designer several indispensable steps to produce more successful interaction between callers and speech interfaces. The irony is that the secrets for meeting greater expectations for caller satisfaction with speech-only interfaces in the call center are not really secrets: they can be found uncannily close to home, by

Page 14: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xiv Preface

extrapolating from callers’ everyday self-service experiences and from their quotidian dialog with human agents at customer care contact centers.

Next, two German academics and SpeechCycle’s CTO, Roberto Pieraccini, tackle the inscrutable and often elusive emotions of callers to ascertain when task completion and user satisfaction with the automated call center may be at risk. Alexander Schmitt of Ulm University, and his two co-authors, in their chapter titled “‘For Heaven’s Sake, Gimme a Live Person!’ – Designing Emotion-Detection Customer Care Voice Applications in Automated Call Centers,” show how their voice application can robustly detect angry user turns by considering acoustic, lin-guistic, and interaction parameter-based information – all of which can be collected and exploited for anger detection. They introduce, in addition, a valuable subcom-ponent that is able to estimate the emotional state of the caller based on the caller’s previous emotional state, supporting the theory that anger displayed in calls to automated call centers, rather than being an isolated occurrence, is more likely an incremental build-up of emotion. Using a corpus of 1,911 calls from an Interactive Voice Response system, the authors demonstrate the various aspects of speech dis-played by angry callers.

The call center section of Advances in Speech Recognition is rounded off by a fascinating chapter on advanced speech analytic solutions aimed at learning why customers call help-line desks and how effectively they are served by the human agent. Yes, that is correct: a human agent, a specimen of call center technology that still exists notwithstanding the push for heavily automated self-service centers. In “The Truth Is Out There – Using Advanced Speech Analytics To Learn Why Customers Call Help-Line Desks and How Effectively They’re Being Served by the Call Center Agent,” Marsal Gavalda, Vice President of Incubation and Principal Language Scientist at Nexidia, and Jeff Schlueter (the company’s Vice President of Marketing & Business Development) describe their novel work in phonetic-based indexing and search, which is designed for extremely fast searching through vast amounts of media.

The authors of “The Truth is Out There” explain the nuts and bolts of their method, showing how they “search for words, phrases, jargon, slang and other terminology that are not readily found in a speech-to-text dictionary.” They demon-strate how “the most advanced phonetic-based speech analytics solutions,” such as theirs, “are those that are robust to noisy channel conditions and dialectal varia-tions; those that can extract information beyond words and phrases; and those that do not require the creation or maintenance of lexicons or language models.” The authors assert that “such well performing speech analytic programs offer unprece-dented levels of accuracy, scale, ease of deployment, and an overall effectiveness in the mining of live and recorded calls.” Given that speech analytics has become indispensable to understanding how to achieve a high rate of customer satisfaction and cost containment, Gavalda and his co-author demonstrate in their chapter how their data mining technology is used to produce sophisticated analyses and reports (including visualizations of call category trends and correlations or statistical met-rics), while preserving “the ability at any time to drill down to individual calls and listen to the specific evidence that supports the particular categorization or data

Page 15: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xvPreface

point in question, all of which allows for a deep and fact-based understanding of contact center dynamics.”

John Shagoury, Executive Vice President of the Healthcare & Imaging Division of Nuance Communications, Inc., opens Advances in Speech Recognition’s last section with a cogent discussion of “the benefits of incorporating speech recogni-tion as part of the everyday clinical documentation workflow.” In his chapter – fittingly titled “Dr. Multi-Task – Using Speech To Build up Electronic Medical Records While Caring for Patients” – Shagoury shows how speech technology yields a sig-nificant improvement in the quality of patient care by increasing the speed of the medical documentation process, so that patients’ health records are quickly made available to healthcare providers. This means they can deliver timely and efficient medical care. Using some fascinating, and on point, real-world examples, Shagoury richly demonstrates how the use of speech recognition technology directly affects improved productivity in hospitals, significant cost reductions, and overall quality improvements in the physician’s ability to deliver optimal healthcare. But Shagoury does not stop there. He goes on to demonstrate that “beyond the core application of speech technologies to hospitals and primary care practitioners, speech recognition is a core tool within the diagnostics field of healthcare, with broad adoption levels within the radiology department.”

Next, James Rodger, Professor of Management Information Systems and Decision Sciences at Indiana University of Pennsylvania, Eberly College of Business and Information Technology – with his co-author, James A. George, senior consultant at Sam, Inc. – provides the reader with a rare inside look at the authors’ “decade long odyssey” in testing and validating end-user acceptance of speech in the clinical setting aboard US Navy ships. In their chapter, titled “Hands Free – Adapting the Task-Technology-Fit Model and Smart Data To Validate End-User Acceptance of the Voice Activated Medical Tracking Application (VAMTA) in the United States Military,” the authors show how their extensive work on vali-dating user acceptance of VAMTA – which is run on a compact mobile device that enables a “hands-free” method of data entry in the clinical setting – was broken down into two phases: 1) a pilot to establish validity of an instrument for obtaining user evaluations of VAMTA and 2) an in-depth study to measure the adaptation of users to a voice-activated medical tracking system in preventive health care. For the latter phase, they adapted a task-technology-fit (TTF) model (from a smart data strategy) to VAMTA, demonstrating that “the perceptions of end-users can be mea-sured and, furthermore, that an evaluation of the system from a conceptual view-point can be sufficiently documented.” In this chapter, they report on both the pilot and the in-depth study.

Rodger and his co-author applied the Statistical Package for the Social Sciences (SPSS) data analysis tool to analyze the survey results from the in-depth study to determine whether TTF, along with individual characteristics, will have an impact on user evaluations of VAMTA. In conducting this in-depth study, the authors modified the original TTF model to allow adequate domain coverage of patient care applica-tions. What is most interesting about their study – and perhaps a testament to the vision of those at the forefront of speech applications in the clinical setting – is that,

Page 16: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xvi Preface

according to Rodger and his co-author, their work “provides the underpinnings for a subsequent, higher level study of nationwide medical personnel.” In fact, they intend “follow-on studies [to] be conducted to investigate performance and user perceptions of VAMTA under actual medical field conditions.”

Julia Hirschberg and Noémie Elhadad, distinguished faculty members at Columbia University, in concert with Anna Hjalmarsson, a bright and talented Swedish graduate student studying at KTH (Royal Institute of Technology), make a strong argument that if “language cues” – primarily acoustic signal and lexical and semantic features – “can be identified and quantified automatically, this infor-mation can be used to support diagnosis and treatment of medical conditions in clinical settings [as well as] to further fundamental research in understanding cog-nition.” In “You’re As Sick As You Sound – Using Computational Approaches for Modeling Speaker State To Gauge Illness and Recovery,” Hirschberg and her co-authors perform an exhaustive medical literature review of studies “that explore the possibility of finding speech-based correlates of various medical conditions using automatic, computational methods.” Among the studies they review are computa-tional approaches that explore communicative patterns of patients who suffer from medical conditions such as depression, autism spectrum disorders, schizophrenia, and cancer.

The authors see a ripe opportunity here for future medical applications. They point out that the emerging research into speaker state for medical diagnostic and treatment purposes – an outgrowth of “related work on computational modeling of emotional state” for studying callers’ interactions with call center agents and Interactive Voice Response (IVR) applications “for which there is interest in distin-guishing angry and frustrated callers from the rest” – equips the physician with a whole new set of diagnostic and treatment tools. “Such tools can have economic and public health benefits, in that a wider population – particularly individuals who live far from major medical centers – can be efficiently screened for a broader spectrum of neurological disorders,” they write. “Fundamental research on mental disorders, like post-partum depression and post traumatic stress disorder, and cop-ing mechanisms for patients with chronic conditions, like cancer and degenerative arthritis, can likewise benefit from computational models of speaker state.”

Hemant Patil, Assistant Professor at the Dhirubhai Ambani Institute of Information and Communication Technology, DA-IICT, in Gandhinagar, India, echoes the beliefs of Shagoury, Rodger and George, and of Hirschberg, Hjalmarsson and Elhadad, all of whom maintain that advances in speech technology have untold economic, social, and public health benefits. In “‘Cry Baby’ – Using Spectrographic Analysis To Assess Neonatal Health Status from an Infant’s Cry,” Patil demon-strates that the rich body of research on spectrographic analysis, predominantly used for performance of speaker recognition, may also be used to assess the neo-nate’s health status, by comparing a normal to an abnormal cry.

Spectrographic analysis is seen by Patil and his colleagues – who are just as passionately involved in this highly specialized area of infant cry research – as use-ful in improving and complementing “the clinical diagnostic skills of pediatricians and neonatologists, by helping them to detect early warning signs of pathology,

Page 17: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xviiPreface

developmental lags, and so forth.” Patil points out to the reader that such technol-ogy “is especially helpful in today’s healthcare environment, in which newborns do not have the luxury of being solely attended by one physician, and are, instead, monitored remotely by a centralized computer control system.”

In explaining cry analysis – a multidisciplinary area of research integrating pedi-atrics, neurology, physiology, engineering, developmental linguistics, and psychol-ogy – Patil demonstrates in “Cry Baby” his application of spectrographic analysis to the vocal sounds of an infant, comparing normal with abnormal infant crying. In his study, ten distinct cry modes, viz., hyperphonation, dysphonation, inhalation, double harmonic break, trailing, vibration, weak vibration, flat, rising, and falling, have been identified for normal infant crying, and their respective spectrographic patterns were observed. This analysis was then extended to the abnormal infant cry. Patil observed that

the double harmonic break is more dominant for abnormal infant cry in cases of myalgia (muscular pain). The inhalation pattern is distinct for infants suffering from asthma or other respiratory ailments such as a cough or cold. For example, for the infant whose larynx is not well developed, the pitch harmonics are nearly absent. As such, there are no voicing or glottal vibrations in the cry signal. And for infants with Hypoxic Ischemic Encephalopathy (HIE), there is an initial tendency of pitch harmonics to rise and then to be followed by a blurring of such harmonics.

As part of this study, Patil also performed infant cry analysis by observing the nature of the optimal warping path in the Dynamic Time Warping (DTW) algo-rithm, which is found to be “near diagonal” in healthy infants, in contrast to that in unhealthy infants whose warping paths reveal significant deviations from the diago-nal across most, though not all, cry modes.

Looking further into broader sociologic implications of cry analysis, Patil shows how this novel field of research can redress the social and economic inequities of healthcare delivery. “Motivated by a need to equalize the level of neonatal health-care (not every neonate has the luxury of being monitored at a teaching hospital equipped with a high level neonatal intensive care unit), I propose for the next phase of research a quantifiable measurement of the added clinical advantage to the clinician (and ancillary healthcare workers) of a baseline comparison of normal versus abnormal cry.”

Now it is up to the reader, after assimilating the substance of this book, to envi-sion how speech applications in mobile environments, call centers, and clinics will improve the lives of consumers, corporations, carriers, manufacturers, and health-care providers – to say nothing of the overall improvements that such technology provides for the byzantine social architecture known as modern-day living.

Fort Lee, NJ Amy Neustein, Ph.D

Page 18: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored
Page 19: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xix

This book would not have been possible without the support and encouragement of Springer’s Editorial Director, Alex Greene, and of his editorial assistant, Ciara J. Vincent, and of the production editor, Joseph Quatela, and the project manager, Rajesh Harini, who in the final stages of production attended with much alacrity to each and every detail. Every writer/editor needs an editor and I could not have asked for a more clear-thinking person than Alex Greene. Alex’s amazing vision helped to shepherd this project from its inception to fruition.

I remain grateful to Drs. Judith Markowitz and K.W. “Bill” Scholz, who contrib-uted an illuminating foreword to this book, and to Dr. James Larson, whose fasci-nating look into the future provides a fitting coda to this book. Of equal importance is Dr. Matthew Yuschik, Senior User Experience Specialist at Convergys Corporation, who generously offered to review all three sections of this work, a task that con-sumed a large portion of his weekends and evenings. I will never be able to suffi-ciently thank Matt for his astute and conscientious review.

Dr. William Meisel, President of TMA Associates in Tarzana, CA and Editor of Speech Strategy News, deserves a special acknowledgment. If there is one person who has his finger on the pulse of the speech industry, it is Bill Meisel. Bill’s clarity of thought helped me to see the overarching theme of mobile applications.

Finally, I’d like to acknowledge several of the “foot soldiers” – the principal authors who shouldered the burden of the project. Johan Schalkwyk, Google’s Senior Staff Engineer deserves particular thanks for meeting his chapter submission deadline even though he had to work evenings and weekends to do it. Dr. David Suendermann, Principal Speech Scientist at SpeechCycle, Inc. sat dutifully at his desk during a major snowstorm in New York, answering a series of e-mails contain-ing editing queries. Alexander Schmitt, Scientific Researcher at the Institute of Information Technology at Ulm University, worked tirelessly – and often late into the night – to answer my editing queries as quickly as possible notwithstanding the six-h time difference between New York and Germany. And in India, Dr. Hemant Patil, Assistant Professor at the Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT) in Gandhinagar, took on a difficult project (detecting neonatal abnormalities through spectrographic analysis of four different cry modes) as a solo author.

Acknowledgments

Page 20: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xx Acknowledgments

To Johan, David, Alex, Hemant, and to all the other stellar contributors to Advances in Speech Recognition, I offer my wholehearted thanks for your hard work and determination.

A. Neustein

Page 21: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xxi

Contents

Part I Mobile Environments

1 “Life on-the-Go”: The Role of Speech Technology in Mobile Applications............................................................................. 3William Meisel

2 “Striking a Healthy Balance”: Speech Technology in the Mobile Ecosystem.......................................................................... 19Scott Taylor

3 “Why Tap When You Can Talk?”: Designing Multimodal Interfaces for Mobile Devices that Are Effective, Adaptive and Satisfying to the User ....................................................................... 31Mike Phillips, John Nguyen, and Ali Mischke

4 “Your Word is my Command”: Google Search by Voice: A Case Study ........................................................................................... 61Johan Schalkwyk, Doug Beeferman, Françoise Beaufays, Bill Byrne, Ciprian Chelba, Mike Cohen, Maryam Kamvar, and Brian Strope

5 “Well Adjusted”: Using Robust and Flexible Speech Recognition Capabilities in Clean to Noisy Mobile Environments............................................................................................ 91Sid-Ahmed Selouani

Part II Call Centers

6 “It’s the Best of All Possible Worlds”: Leveraging Multimodality to Improve Call Center Productivity............................ 115Matthew Yuschik

Page 22: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xxii Contents

7 “How am I Doing?”: A New Framework to Effectively Measure the Performance of Automated Customer Care Contact Centers........................................................................................ 155David Suendermann, Jackson Liscombe, Roberto Pieraccini,and Keelan Evanini

8 “Great Expectations”: Making use of Callers’ Experiences from Everyday Life to Design a Satisfying Speech-only Interface for the Call Center.................................................................... 181Stephen Springer

9 “For Heaven’s Sake, Gimme a Live Person!” Designing Emotion-Detection Customer Care Voice Applications in Automated Call Centers...................................................................... 191Alexander Schmitt, Roberto Pieraccini, and Tim Polzehl

10 “The Truth is Out There”: Using Advanced Speech Analytics to Learn Why Customers Call Help-line Desks and How Effectively They Are Being Served by the Call Center Agent ............................... 221Marsal Gavalda and Jeff Schlueter

Part III Clinics

11 Dr. “Multi-Task”: Using Speech to Build Up Electronic Medical Records While Caring for Patients.......................................... 247John Shagoury

12 “Hands Free”: Adapting the Task–Technology-Fit Model and Smart Data to Validate End-User Acceptance of the Voice Activated Medical Tracking Application (VAMTA) in the United States Military .................................................................. 275James A. Rodger and James A. George

13 “You’re as Sick as You Sound”: Using Computational Approaches for Modeling Speaker State to Gauge Illness and Recovery........................................................................................... 305Julia Hirschberg, Anna Hjalmarsson, and Noémie Elhadad

14 “Cry Baby”: Using Spectrographic Analysis to Assess Neonatal Health Status from an Infant’s Cry........................................ 323Hemant A. Patil

Epilog.................................................................................................................. 349

About the Author.............................................................................................. 359

Index.................................................................................................................. 361

Page 23: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xxiii

Françoise Beaufays, Ph.D. Research Scientist, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

Doug Beeferman, Ph.D. Software Engineer, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

Bill Byrne, Ph.D. Senior Voice Interface Engineer, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

Ciprian Chelba, Ph.D. Research Scientist, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

Mike Cohen, Ph.D. Research Scientist, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

Noémie Elhadad, Ph.D. Assistant Professor, Department of Biomedical Informatics, Columbia University, 2960 Broadway, New York, NY 10027-6902, USA

Keelan Evanini, Ph.D. Associate Research Scientist, Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA

Marsal Gavalda, Ph.D. Vice President of Incubation and Principal Language Scientist, Nexidia, 3565 Piedmont Road, NE, Building Two, Suite 400, Atltanta, GA 30305, USA

Contributors*

* The e-mail addresses are posted for the corresponding authors only.

Page 24: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xxiv Contributors

James A. George Senior Consultant, Sam, Inc., Rockville, MD 1700 Rockville Pike # 400, Rockville, MD 20852, USA

Julia Hirschberg, Ph.D. Professor, Department of Computer Science, Columbia University, 2960 Broadway, New York, NY 10027-6902, USA [email protected]

Anna Hjalmarsson Graduate student, KTH, (Royal Institute of Technology), Kungl Tekniska Högskolan, SE-100 44 STOCKHOLM, Sweden

Maryam Kamvar, Ph.D. Research Scientist, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

Jackson Liscombe, Ph.D. Speech Science Engineer, SpeechCycle, Inc., 26 Broadway, 11th Floor, New York, NY 10004, USA

William Meisel, Ph.D. Editor, Speech Strategy News, President, TMA Associates, P.O. Box 570308, Tarzana, California 91357-0308 [email protected]

Ali Mischke User Experience Manager, Vlingo, 17 Dunster Street, Cambridge, MA 02138-5008, USA

John Nguyen, Ph.D. Vice President, Product, Vlingo, 17 Dunster Street, Cambridge, MA 02138-5008, USA

Hemant A. Patil, Ph.D. Assistant Professor, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, Gujarat-382 007, India [email protected]

Mike Phillips Chief Technology Officer, Vlingo, 17 Dunster Street, Cambridge, MA 02138-5008, USA [email protected]

Page 25: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xxvContributors

Roberto Pieraccini, Ph.D. Chief Technology Officer, SpeechCycle, Inc., 26 Broadway, 11th Floor, New York, NY 10004, USA

Tim Polzehl, MA Scientific Researcher, Quality and Usability Lab, Technischen Universität, Deutsche Telekom Laboratories, Ernst-Reuter-Platz 7, 10587 Berlin, Germany, 030 835358555

James A. Rodger, Ph.D. Professor, Department of Management Information Systems and Decision Sciences, Indiana University of Pennsylvania, Eberly College of Business & Information Technology, 664 Pratt Drive, Indiana, PA 15705, USA [email protected]

Johan Schalkwyk, MSc Senior Staff Engineer, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA [email protected]

Jeff Schlueter, MA Vice President of Marketing & Business Development, Nexidia, 3565 Piedmont Road, NE, Building Two, Suite 400, Atltanta, GA 30305, USA [email protected]

Alexander Schmitt, MS Scientific Researcher, Institute for Information Technology at Ulm University, Albert-Einstein-Allee 43, 89081 Ulm, Germany [email protected]

Sid-Ahmed Selouani, Ph.D. Professor, Information Management Department; Chair of LARIHS (Research Lab. in Human-System Interaction), Université de Moncton, Shippagan Campus, New Brunswick, Canada [email protected]

John Shagoury, MBA Executive Vice President of Healthcare & Imaging Division, Nuance Communications, Inc., 1 Wayside Road, Burlington, MA 01803, USA [email protected]

Stephen Springer Senior Director of User Interface Design, Nuance Communications, Inc., 1 Wayside Road, Burlington, MA 01803, USA [email protected]

Page 26: Advances in Speech Recognition - link.springer.com978-1-4419-5951-5/1.pdf · When Amy asked me to co-author the foreword to her new book on advances in speech recognition, I was honored

xxvi Contributors

Brian Strope, Ph.D. Research Scientist, Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA

David Suendermann, Ph.D. Principal Speech Scientist, SpeechCycle, Inc., 26 Broadway, 11th Floor, New York, NY 10004, USA [email protected]

Scott Taylor Vice President, Mobile Marketing and Solutions, Nuance Communications, Inc., 1 Wayside Road, Burlington, MA 01803, USA [email protected]

Matthew Yuschik, Ph.D. Senior User Experience Specialist (Multichannel Self Care Solutions), Relationship Technology Management, Convergys Corporation, 201 East Fourth Street, Cincinnati, Ohio 45202, USA [email protected]