[IEEE 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications - Kyoto, Japan (26-27 Sept. 1994)] Proceedings of 2nd IEEE Workshop on Interactive Voice

Kyda haowdl Prk. K 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVllAQ4) -A%%

INTERACTIVE SPEECH AND LANGUAGE SYSTEMS FOR TELECOMMUNICATIONS APPLICATIONS

AT NYNEX Hong C. Leung 4Y Judith R. Spitz

Speech Recognition & Language Understanding Services Laboratory NYNEX Science & Technology

White Plains, New York 10604 USA

ABSTRACT As information plays an ever-increasing role in our lives, users

are demanding more in terms of their capability to retrieve and manipulate information. This paper is concerned with NYNEX’s development of interactive speech and language systems for the provision of automated information services. While our long-term goal is to develop total system solutions to interact with users and assist them to search and retrieve information, our progress is made in such a way that each component technology can be deployed by itself in various telecommunications applications. We will discuss some of our findings, and draw from our own experience in technology development, lessons that we learn from our service trials, and benefits that we derive from others in the research community. We believe that advanced speech and language technologies can be quite acceptable to users, a8 long as a graceful and friendly human computer interface is also provided.

INTRODUCTION As information plays an ever-increasing role in our lives,

users demand better ways to retrieve and manipulate information. Users are often in different environments such as offices, homes, phone booths, or working in a hands-busy and eyes-busy location like an automobile. They may want to access centralized information databases such as travel, entertainment, shopping, geography, weather, or finance. As a result, it is essential that we develop systems so that users can remotely communicate and manipulate information.

Traditionally, human representatives are employed to assist users to query a database over the telephone network. A representative typically engages in an interactive dialog with the user, asking for clarification when necessary. As information services proliferate, the workforce of representatives must be increased as well. From a business perspective, it is very expensive to maintain and train this workforce. As a result, different automation techniques have been introduced to reduce costs.

Touch-tone and pre-recorded speech have received relatively wide usage in the past few years. However, it has also become clear that menu-driven touch-tone interface is

too cumbersome to use for complex database queries. Pre- recorded speech also becomes prohibitively expensive for very large databases, and perhaps infeasible for rapidly changing databases.

Natural spoken language is highly desirable for information retrieval. It is natural in that it does not require special training. It is flexible, allowing users to query and retrieve information in a hands busy and eyes busy environment. It is also efficient in that most people can speak faster than typing on a keyboard or a touch-tone pad. It is economical, since speech can already be transmitted and received efficiently in telecommunications applications. Finally, it is ubiquitous since it does not require any special equipment: every telephone can become a computer terminal for both input and output.

In this paper, we will attempt to provide an overview of NYNEX’s effort in this exciting technological area, which is fast becoming a practical reality. We will draw from our own experience in technology development, lessons that we learn from field trials, and benefits that we derive from others in the research community.

TECHNOLOGY PLATFORM AND ISSUES

Figure 1 shows thp major technologies under development at NYNEX Science & Technology. A caller’s speech is first processed by the speech recognition component, which converts the speech waveform into word hypotheses. Unless the speech input is in the form of isolated words, the meaning of the spoken input must be interpreted and usually translated into a language that a database management system can un- derstand. The retrieved information must then be translated back into words before it can delivered to the caller. If the found entry contains any idiosyncrasies, it must also be text normalized. A text-to-speech device then converts the normalized text string into a speech waveform. If security is an issue, the speaker verification component confirms the identity of the speaker before any of the information processing can be activated.

49 - 0-7803-2074-3/94/$4.00 0 1994 IEEE

Figure 1: Speech and language technologies being developed at NYNEX.

In this paper, we will focus on our effort in speaker- independent speech recognition, language understanding, language generation, and text-to-speech synthesis. NYNEX’s speaker-dependent technology has already been deployed in VoiceDialingam, which is described in more detail in other papers in this workshop [16,17]. Our effort in speaker verification is also described in this workshop [12].

Speech Recognition Our speaker-independent speech recognition system adopts

a hybrid approach that takes advantage of stochastic segment modeling and hidden Markov modeling (HMM) [9,11]. We have investigated several classification techniques, and have found that for our applications, multi-layer perceptrons (MLP) produce the highest accuracy and are computation- ally efficient [3,10]. Our word-spotting capability also allows users to embed target words in long phrases or sentences.

While our objective in speech recognition is to recognize a user’s spoken input automatically, the overall system must be designed to handle the dialog gracefully even when the speech recognition system cannot handle the spoken input. This happens when the user does not speak any one of the target words or phrases, or when the recognizer is not coufi- dent of its own output. Since speed1 recognition accuracy still falls short of human performance, we feel that deployment of speech recognition technology, even in a restricted domain, must be accompanied by the support of human representatives. Figure 2 shows a three-way communication between a user, a potentially available human representative, and a speech technology system. The speech system first prompts the user for spoken input. If the recognizer is confident, it can then query a database or continue the dialog with the user without intervention or participation of the human representative. In the event that the recognizer is not confident enough, it can either reprompt the user or solicit assistance by bidding for the first available representative. Once a representative becomes available, the previously spoken input by the user can be played back to the representative. The representative can then continue the dialog with the user.

i f@-- \ /

\ J

Platlorm

Figure 2: A three-way communication between a user, a human representative, and a speech system.

The automation rate, work time of the representatives, and holding time of callers can be affected by a large number of factors. Some of the major factors include vocabulary size, robustness of the recognizer, word-spotting capability, and user’s cooperation. We have found that some of these factors are inter-dependent; changing one factor often influences other factors. For example, we have found that increasing the vocabulary size may also increase the false acceptance rate of our word-spotting recognizer.

Speech Detection and Time Compression Even when the speech system decides to bid for a hu-

man representative, we can still reduce the work time of the representative using speech detection and time compression. Usually there is a de1a.y between the beginning of recording and the time the user begins to speak. If the user hesitates, there may also be pauses embedded in the user’s phrase or sentence. In order to reduce the work time of the human representative, these pauses are removed before playing the utterances to the representative. However, our experience indicates that playing any noise to the representatives can be irritating. Thus we have designed our speech detector to remove silence, background noise, and other noises from the telephone network.

Once speech is detected, time compression is then ap- plied by taking advantage of the redundancy in the speech signal. While this procedure can reduce the duration of an utterance, the intelligibility of the compressed signal must be maintained. We have performed perceptual experiments for some of the utterances that we collected from our directory assistance real customers, and examine how intelligibility of the speech signal may be affected by time compression. Fig-

50

ure 3 shows typical word transcription accuracy as a function of time reduction of utterances. We have found that the transcription accuracy stays relatively unchanged for relatively small time reduction. However, further reduction in utterance duration can quickly degrade the intelligibility of the speech signal. In order to ensure that representatives do not become fatigued after repeatedly listening to time- compressed speech, the aggressiveness of this procedure must be readily modifiable.

8 90"

.& 7 0

Y 2 8 0

4

6 0

*=

JU

Duration Reduction

Figure 3 Typical characteristics of transcription accuracy as a function of time reduced due to time compression.

Barge In

During an interactive dialog between a user and the speech system, the system frequently queries the caller for information. The user may choose to begin talking or "barge in" while the prompt is being played out by the speech system, or while speech is being produced by the text-to-speech device. We have found that barge in happens more frequently with experienced users or when the prompt or synthesized speech is long. This poses a problem for speech recognition system that needs to separate the caller's speech from the prompt.

We have developed an echo cancellation procedure, BIVEC, to allow users to barge into prompts [a]. We have found that BIVEC is quite effective. It can suppress the echoed speech twice as well as a least-mean-square (LMS) algorithm, a com- monly used echo cancellation procedure. When BIVEC is evaluated with our speech recognition system, the resulting recognition error rate is about half of what LMS gets.

User Interface The success of a speech recognition system is inevitably

affected by the cooperation of a user. While user compliance can vary greatly among applications, we have found that prompt wording can often play a significant role in soliciting target words only.

We have performed several experiments to examine users' behavior in response to different prompts. Figure 4 shows users' compliance in speaking one of the target words as a function of the wording of four different prompts. In this application, we have found that the wording of the prompt can change their compliance from 15% to 68% [15]. While this difference is significant, we have found similar results across a variety of telecommunications applications.

Operator Prompt 1 Prompt 2 Prompt 3 Prompt 4

Figure 4 Percentage of all responses that were isolated target words, the desired response, aa a function of wording of the prompt.

Several speech services trials have been conducted over the past few years at NYNEX. We have observed that users' behavior has been changing slowly, becoming somewhat more compliant to our speech systems. Table 1 shows users' behavior in two of our trials for directory assistance conducted in 1989 and 1992. Callers' responses have been divided into five major categories: user rejection or hang up, isolated target words, target words embedded among excess verbiage, speech with no target words, and no speech a t all. For the purpose of this discussion, the absolute figures are not im- portant. Instead, the differences in the figures from 1989 to 1992 indicate changes in users' behavior. We can see that statistics for only 2 of the categories have changed: users are now more willing to talk to an automated system. Ten percent fewer mers were found speechless when they heard the automated prompt, and 10% more users spoke the target words embedded in phrases. We speculate that such change is primarily due to the proliferation of voice messaging sys-

tems. IJsers are becoming more comfortable with talking to machines.

User rejection Isolated target words

Embedded target words Speech, no targets

No speech z - 10 Table 1: Comparison of user compliance in two service trials.

Speech Databases Over the past few years, we have developed several large

speech databases. The NTIMIT database [5] was collected by transmitting the TIMIT [4] database over the telephone network. The original TIMIT utterances were transmitted through a large number of centra,l offices; which were geo- graphically distributed to sample different telephone network conditions. Half of the database was sent over local telephone paths, while the other half was transmitted over long distance lines. Transmission involved the use of a commercial device to simulate the acoustic characteristics between a human’s mouth and a handset. Calibration signals were transmitted across each path in order to evaluate readily such network characteristics as attenuation, frequency response, and har- monic distortion. NTIMIT is now available through the Lin- guistic Data Consortium (LDC).

More recently, NYNEX and the LDC have beer1 developing PHONEBOOK, a phonetically-rich, isolated-word teleyhone- speech database. It is a large database of American EII- glish word utterances incorporating all phonemes in as many phonemic/stress contexts as are likely to produce coarticula- tory variations. Our goal is to collect approximately 75,000 utterances totaling an estimat,ed 15 hours of speech. Approx- imately 12;OOO triphones and 9,000 syllable-part sequences are expocted. This database is still under active development, and it will also become available through the LDC.

In addition, we have been developing an automated information/transaction services research platform, Money Talks. The platform is designed so that it can readily be connected into a real-user, telephone-based, service-providing environment. The speech data and statistics that it collects can enable 11s to determine the general public’s acceptance of using interactive speech-driven services. It will also help us to select a speech/language technology that yields the best price/performance ratio. Currently, we are collaborating with a major financial institution in New York.

Language Understanding While speech recognition can iisually automate a request

if the spoken input is simple, compliant, and highly COII-

strained, natural language understanding is often necessary to interpret the meaning of a cont.inuous sentence or phrase.

For example, a target city name, “Boston”; can be found in “Urn, in Boston please’’ by performing word-spotting. How- ever? in a less compliant input such as “city is Braintree, can I have the phone number for hank of Boston”, where both “Braintree” and “Boston” are target city names, word- spotting would not be able to determine the locality of the listing.

We have been collaboratiiig with the MIT Spoken Laxi- gunge Systems Group on adapting their natural language uti- derstanding capability for telecotrimuriicatioris applications. Specifically, we have ported TINA [13], a natural language system, from MIT, and are developing new gramniar rules for parsing sentences in telecotnrnutiications applications. Fig- ure 5 shows a parse tree for an actual utterance spoketi by a real caller. Each node iri the parse tree corresponds to a semantic/syntactic category. We can see thal the system is able l o determine that “Braintree” is a city name, and “Boston” is in fact part of a bank’s name.

Most spoken language systems to date have been developed under the assumptiori that the user can interact with the system through speech and visual display. As a result, some of the system’s capability in understanding. response generation, and dialogue management may not be dirrctly applicable to telecommunications applications. For exam ple, consider this query: “Show me restaurants near T m r s Square”. While a system with visual display can show all the appropriate restaurants. telling a user over the telephone may take an excessive amount of tinie. Thus, future work must include customization for traditional telephone applications so as to help nsers to be more specific.. Perhaps an appropriate system response to the user’s query is: “What kind of restaurants near Times Square would you l ike?” .

Language Generation Typically, information in large databases existing today is

encoded in an efficient machine-readable form, and was originally intended for visual display to human represenl.atives. Large databases that evolve over many years are generally updated and modified by several different groups of people. This results in idiosyncratic variation in the data. Since human representatives who provide the information are well- trained, they can translate the information into normal text, and are very tolerant of variations in the specificity and pre- cision of the data. Furthermore, they can usually deal with inconsistency in the data format, which results from having many representatives to type in the information over many years.

As a result, before information in a database can be pre- sented to a user by text-to-speech devices, the data itself must be automatically normalized to well-formed words, the form that text-to-speech devices require. We have found three major problems that a text normalizer must be able to handle. First, any ad hoc abbreviations, truncations, and typos must

52 -

city is braintree can i have the phone number for bank of boston

Figure 5: Parse tree for determining the city in a directory assistance query.

be disambiguated. Second, since some of these ambiguities are context-sensitive, the normalizer must be able to delimit meaningful fields in the input text. Finally, acronyms must be detected, especially for a database that has no case information.

The name and address preprocessor (NAP) developed a t NYNEX can delimit name and address fields in the input text [6,7]. Abbreviations can be expanded based on the field they are in. For example, “ST” can be expanded into “street” in the address field, and into “Saint” or “station” in a name field. Some of the abbreviations are relatively com- mon, such as “ST”, “DR”, “MR”, “JR”, and “SR”, while others are more ad hoc, such as “INSTRUMNTN” for “in- strumentation”, “COLCTN” for “collection”, and “NUR” for “nursery” or “nursing” depending on the context. By apply- ing a set of rules and efficient table searches, it can signal a text-to-speech device to spell out an abbreviation, such as “IRS”, or pronounce a name, such as “NYNEX”.

NAP has been evaluated in a customer name and address database of about 22 million entries. It correctly normalizes over 90% of the entries. The system was originally written in NAWK, and it responds fairly quickly on a low-end work- station. Currently, the system is being re-written for higher efficiency, better stability, and more flexible portability.

Text-to-Speech Synthesis Pre-recorded or pre-digitized speech is increasingly more

popular. It is used primarily in applications where the vo-

cabularies are small and known a priori. However, if the vocabulary size is large, or if it needs to be updated frequently, it would be prohibitive to pre-record the entire vocabulary, or to maintain the same voice as the vocabulary changes.

While text-to-speech synthesis has made great strides in the past few decades, the intelligibility and quality of synthesized speech falls far short of human speech. Therefore, before text-to-speech synthesis is deployed for any application, the technology must be carefully assessed.

With the exception of prosody processing [14], NYNEX has been relying on commercially available text-to-speech devices. An Automated Customer Name and Address (ACNA) service trial was conducted to evaluate the feasibility of au- tomating reverse directory assistance using text-to-speech syn thesis [1,2]. In a reverse directory service, callers provide the telephone number and in return are given the associated listing or name and address information. The ACNA trial automated this process, using 5 high-end text-to-speech synthesis devices to deliver the listing information. Approximately 5000 real-user queries were processed. Data such as requests for repetition, spelling, and survey responses were collected.

User perception and acceptance of the automated service was evaluated. We found that in this limited domain, callers scored the automated service fairly well with respect to its comprehensibility in providing the listing information, though the scores fall short of the operator-handled service (88% vs. 60%)’. Our operator-handled results indicated that users appreciate additional insights or information that tran- scends the database listing, supporting our advocacy in three- way communication as shown in Figure 2.

SUMMARY In summary, we have described some of the major tech-

nologies being developed at NYNEX Science & Technology. While our speech and language technologies are developed for use in general telecommunications applications, we have also performed several service trials to ensure that the developed technologies are accepted by users. In order to maximize users’ satisfaction, we believe that deployment of speech and language technologies should, in the near future, be accompanied by some back-up support of human representatives.

ACKNOWLEDGMENTS Much of the work described in this paper was done in

collaboration with present and past members of the NYNEX Speech Services Technology Group. Their contributions are gratefully acknowledged. We would also like to thank the MIT Spoken Language Systems Group for their input and collaboration.

‘Even operators were not scored 100% by callers

- 53

REFERENCES [l] Basson, S., Yashchin, D., Kalycanswamy, A., and Silverman,

K., “Comparing Synthesizers for Name and Address Pro- vision: Field Trial Resulls”, Proc. Eurospeech 99, Berlin, Germany, 1993.

[Z] Basson, S., Yashchin, D., Silverman, K., and Kalyanswamy, A., “Assessing the Acceptability of Automated Customer Name and Address,” Proc. AVIOS, 1991.

[3] Chigier, B., “Phonetic classification on wide-band and telephone quality speech,” Fifth LJARPA Workshop on Speech d Natural Language, Arden House, February 1992.

141 Fisher, W., Doddington, G . , and Goudie-Marshall, K., “The DARPA speech recognition database: specifications and sta- tus,” Proc. DARPA Workshop on Speech Recognition, Febru- ary 1986.

[5] Jankowski, C., Kalyanswamy, A. , Basson, S., and Spitz, J., “NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database”, Proc. ICASSP-90, Al- buquerque, NM, 1990.

[6] Kalyanswamy, A., Silverman, K. , Basson, S., and Yashchin, D., “Preparing Text for a Synthesizer in a Telecommuni- cations Application,’: First IEEE Workshop on Interactive Voice Technology for Telecommuntcations Applications, Pis- cataway, New Jersey, 1992

Kalyanswamy, A. and Silverman, K., “Say What? - Prob- lems in Preprocessing Names and Addresses for Text-to- Speech Conversion”, Proc. AVIOS, 1991.

Kim, J . , and Leung, I f . , “Barge-in via Echo Cancellation,” Fifth US. Telco Speech Research Workshop, St. Louis, 1991.

Leung, H., metherington, I., and Zue, V., “Speech recognition using stochastic segment neural networks,” Proc. ICASSP, San Francisco, 1992.

Leimg, II., Chigier, B., and Glass, J., “A Comparative Study of Signal Representations and Classification Tech- niques for Speech Recognition”, Proc. ICASSP, Minneapolis, Minnesob, 1993.

Leung, H., Hetherington, I., and Zue, V., “Speech recognition using stochastic explicit-segment modeling,” Proc. Eu- ropean Conference on Speech Communication and Technol- ogy, Italy, 1991.

Naik, J., “Field Trial of Speaker Verification Service for Caller Identity Verification in the Telephone Network,” Proc. Second IEEE Workshop on Interactive Vorce Technology for Telecommunications Applications, Kyoto, Japan, 1994.

Sene& S., “TINA: A Natural Language System for Spoken Language Applications,” Computational Linguistics, Vol. 18; No. 1, March 1992.

Silverman, K., Kalyanswamy, A4., Silverman, J. , Basson, S., and Yashchin, D., “Synthesizer Intelligibility in the Con- text of a Name-and-Address Information Service,” Proc. Eu- rospeech 99, Berlin, Germany, 1993.

Spitz, J . and the Artificial Intelligence Speech Technol- ogy Group, “Collection and Analysis of Data from Real Users: Implications for Speech Recognition/Understanding Systems,” Proc. Fourth DARPA Workshop on Speech and Natural Language Workshop, Pacific Grove, CA, February 1991.

[16] Vysotsky, G., “VoiceDialing - The First Speech Recognition Based Telephone Service Delivered to Customer’s Home,” Proc. Second IEEE Workshop on Interactive Voice Tech- nology for Telecommunications Applications, Kyoto, Japan, 1994.

[17] Zreik, L., “A Field Deployed System for Performance As- sessment of a Speech Recognition Based Telephone Service,” Proc. Second IEEE Workshop on Interactive Voice Tech- nology for Telecommunications Applications, Kyoto, Japan, 1994.

54

Documents

[IEEE 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications - Kyoto, Japan (26-27 Sept. 1994)] Proceedings of 2nd IEEE Workshop on Interactive Voice