[IEEE 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications - Kyoto, Japan (26-27 Sept. 1994)] Proceedings of 2nd IEEE Workshop on Interactive Voice

2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA94) KVLID S8$#mb-27% Km

A VOICE-ACTIVATED TELEPHONE EXCHANGE SYSTEM AND ITS FIELD TRIAL

Seiichi YAMAMOTO Kazuyo TAKEDA Noomi INOUE Shingo KUROIWA Mwaki NAITOH KDD R&D Laboratories

2-1-15 Ohara Kamifukuoka-shi, Saitama, 356 Japan

ABSTRACT is not. however. favorable to users because it is natural Speaker-independent speech recognition systems that can

accept telephone qndity speech may open opportunities for introducing new user-friendly services over the public swikhed telephone network (PSTN). We are currently engaged in a project to introduce an automatic speech recognizer over PSTN. We have developed a voice-activated telephone exchange system by combining a continuous speech recogniser and a private branch exchange system (PBX), and conducted field trials. The system has been installed in the R k D laboratories for daily use since June 1993, in order to investigate its performance in a real environ- ment and collect man-machine dialogues. More than 5,000 man-machine dialogues have been collected, and incorrect recognitions have been analyzed and categorized into three categories such as (1) incorrect detection of speech, (2) outof-vocabulary responses, (3) incorrect recognition with inadequate hidden Markov models of speech and noise. We have improved system performance by mainly attacking the issues (1) and (3). We have just developed a new version of the system, using the improved scheme obtained by an- alyzing the collected speech data. In order to collect more man-machine dialogues, we are planning to carry out the second phase field trial in which the new system will be inddled in our branch offices.

1. INTRODUCTION Speaker-independent speech recognition systems that can accept telephone quality speech may open opportunities for introduang new user-friendly services over the public iwitched telephone network (PSTN). Many research and development activities have been undertaken to make speaker independent speech recognition systems available over PSTN. There are, however, still many technical problems to be solved such as how to compensate disturbances by band limitation and a large variety of channel character- istice in order to realize high quality speech recognition over PSTN. There are also problems on how to design human interfaces in order to develop speech recognition systems in a user friendly manner. In order to overcome these problems, there have already been a fairly wide variety of appli- cation trials using an automatic speech recognizer (ASR) over PSTN [l], [2], [3], [4].

We are currently engaged in a project to introduce an ASR over PSTN. We developed an isolated word recognizer b a d on Dynamic Time Warping, and tried to introduce it over PSTN at the beginning of the project [5] . Its recognition rate is high in the laboratory where only well formed isolated words are used as input speech. It

to speak continuously without pauses between words, and many people have difficulty in modifying this behavior. As a result, the system may degrade in the %Id where users produce a significant amount of extraneous vocal output such as nonword Vocalization, stuttering and so on. The result suggests that we should develop a continuous speech recognizer (CSR) if we are to introduce an ASR over PSTN successfully.

We believe that a large spontaneous speech data base is essential to develop a speaker independent CSR and that operating a pilot system, which works in real time and is truely useful, would play a great role in order to collect a large amount of spontaneous speech. Thus, we have selected as one of our tasks a voice-activated telephone exchange system which can recognize various request sentences and connect to a called person’s branch telephone [7]. The system consists of a speaker independent CSR based on hidden Markov models (HMMs), a speech synthesizer and a dialogue controller, and is connected to a private branch exchange system (PBX) via a telephone line interface module.

With this system, we have recently performed a field trial to investigate its feasibility and evalnate user reactions. We have also collected man-machine dialogues for requesting connections with the system [6]. The system has been installed in the R&D laboratories for daily use since June 1993. More than 5,000 man-machine dialagues have been collected through various networks such as a private network, the conventional public switched telephone network and so on.

Incorrect recognitions are analyzed and categorized into three groups, i.e. (1) incorrect detection of speech, (2) out-of-vocabulary responses, (3) incorrect recognition with inadequate hidden Markov models of speech and noise even if the input speech is correctly detected and well formed. We have ameliorated the system performance by improving the speech detection scheme and HMMs of speech and noise.

In this paper, we first describe the prototype of the voice- activated telephone exchange system in section 2. The field trial and analysis of collected man-machine dialogues are then described in section 3. Improvements of system performance for the collected data are dcscribed in section 4. Future plans are discussed in section 5.

2. A PROTOTYPE SYSTEM A prototype of the voice-activated telephone exchange system consists of a continuous speech recognizer, a speech synthesizer and a dialogue controller. It can recognize var-

-21- 0-7803-2074-3/94/$4.00 Q 1994 IEEE

I I I

I I I I I

AT BUS ++ 1 I I I I I I I I I I

Figure 1. Hardware configuration of the voice-activated telephone exchange system

ious request sentences like “MI. Takeda, please”, “May I talk to Mr. Yamada, Accounting Division ?” etc. and connect the line to the called person’s branch telephone after confirming the recognized result.

2.1. Hardware conflguration The hardware configuration of the system is illustrated in Figure 1. The system consists of three platforms, an IBM PC/AT compatible personal computer with SX386(25M Hz), a workstation (HP9000/735) and hardware modules for telephone line interface, speech synthesis and speech recognition. Each of the modules are connected via an AT- bus with each other.

The system is connected to the PBX via a telephone line interface module, which can detect touch tones and provide A/D conversion. The interface also has the facility to send tones and on/off hook signals so as to invoke the call transfer function of PBX. As for real time speech recognition, we designed a special hardware using nine DSPs (TMS320C30). One DSP is used for controlling inter-DSP and DSP/PC communications. One DSP executes echo cancelling, speech period detection and acoustic analysis of the speech including vector quantization. One DSP executes Viterbi operation for grammar nodes, and the other six DSPs execute Viterbi operation for HMM states in par- allel. The DSP for communication control has a two MByte dynamic memory as well as the 256 KLW local static memory for every DSP. This configuration can execute continu-

ous speech recognition for our task up to 300 people, in real time. A workstation, which is connected to the PC/AT via Ethernet, controls dialogue by receiving recognized word sequences, searching the directory, generating system announcements and predicting the words in the next utterance.

2.2. Alg ori t hIns As a basic recognition algorithm, we adopted HMM Viterbi search controlled by Finite State Network grammar. The vocabulary of the grammar includes 322 words listed in Table 1 and two non-speech sounds. The static branch- ing factor of the grammar is 14.0. Furthermore, the search space in the grammar network is restricted so that all words in the sentence are included in the results of the word pre- diction described below. In order to accept spontaneous speech such as speech with filled pauses, we have designed a Finite State Network grammar so that every grammar node has a self-loop path of accepting filled pauses. As a result, the system can accept utterances with filled pauses between phrases.

Before acoustic analysis and speech detection, an echo canceller is invoked for cancelling the system announcement returned through the telephone-hybrid, and for detecting the user voice correctly at any time. Acoustic analysis is carried out every lOmsec by 14th LPC analysis with a 25 msec long Hamming window. As for acoustic modeling, 43 phone HMMs are used based on discrete feature represen-

22

Table 1: Vocabulary. name 164

personal charge 16 title 16 meeting rooms & supl. words 106

research group name 20

tation (VQ) of melcepstrum and delta cepstrum, each of them having 1024 and 256 prototypes, respectively. These models are trained by using 4000 phonetically balanced sentences of 400 speakers recorded over 400 public telephone lies.

The dialogue controlling module is constructed based on a state transition model. In the model, the sentence to be announced, the words expected to be used in the user's next utterance and a set of conditions governing the transition to the next state, are defined for each state. Once the recognition result is given to the dialogue control module, the condition meeting the current status is searched among the conditions in the current state, and the state transition corresponding to the condition is selected. Then at the next state, the system utterance and the predicted words are sent to speech synthesis and recognition modules, respectively.

3. F I E L D T R I A L A N D R E C O G N I T I O N

With this system, we have recently performed a field trial. The system was installed in the R k D Laboratories for daily use in June 1993. More than 5,000 man-machine dialogues (about 10,000 sentences) have been collected through various networks such as a private network, the public switched telephone network and so on, and analyzed from various points.

3.1. Data collection The main purpose of the data collection is to accumulate both linguistic and acoustic phenomena of simple task ori- ented telephone dialogues as much as possible. The dig- i W s p w h signals of user voice and system announcements after inverse filtering for echo cancelling are recorded. Furthermore, to prevent the system from fatal errors, e.g. connecting to the incorrect person or going round in cir- cles, operator intervention is possible at any time. Thus the recognition performance of the system is very stable, and it ensures the system is very useful.

The users of the system are mainly employees of KDD and its subsidiaries, however people outside KDD and its subridiaries may make calla to the system because the line to the system is open to the public switched telephone network.

Before the start of the field trial, we informed users about some features of the system such as :

A C C U R A C Y

1. i t can accept various expressions for requiring connections such as "MI. Takeda, please", "May I speak to MI. Yamada, Accounting Division?" etc

2. i t can accept user speech even during system announcements

3. it can accept speech from various networks such as a private network, the public switched telephone network and so on

3.2. User reactions Both beginners and accustomed users can acess the system. Their needs for system announcements and behavior with respect to the system may be different. To give beginners sufficient information on how to use it and provide a short cut for accustomed users, the system is designed to be able to detect user speech and recognize it even when the system is making annoucements. Figure 2 shows the probability of timing when users start to speak and ita distribution. AB shown in Figure 2, more than 60% of callers begin to speak before the system announcement ends in order to obtain quick response. These results show that voice-activated speech processing systems need to have a function which can detect user speech during the system announcement and recognise it.

LO

U)

40

20

0 5 -

.c)c---)c----) " u n a - mnounce- aammcc- ment(A) mart@) m t Q

i m l In w h i i th. ry8t.m anmuncemmts do mt exist.

Figure 2 . Probability and distribution of start time of user speech : Announcement(A) is "Hello KDD R & D Labs.", Announcemtnt(6) is "This is a voice-activated telephone exchange system speaking", Announcement(C) is "Who would you like to talk to?".

As shown in Figure 2, users who begin to speak before the system announcement ends, generally have a tendency to start to speak during an interval when the system announcement does not exist.

3.3. Linguistic analysis A linguistic analysis of the collected man-machine dialogues shows some features; first, the percentages of each filled pause are considerably different from that of man-man dialogues [8]. Especially, the occurrences of the filled pauses [ano], [anoh] and etc., which are said to be often used in hesitation, are relatively small in comparison with thase in man-man dialogues.

Secondly, the percentage of sentences containing filled pauses is less than 10% and is smaller than that of man- man dialogues [8]. The percentage in the later stage of the field trial is about 4%, and is generally less that that of the early stage which is about 12%. This may be why the percentage of accustomed users increases in the later stage of the field trial.

These results on the collected dialogues may suggest that man-machine dialogues have different characteristics from

."

-23-

that of man-man dialogues, and changes occur when the user is accustomed to the man-machine interface. I t is, however, not clear whether such results are general in man- machine dialogues or dependent on tasks. We think that further experience of other tasks and careful study are nec- essary to obtain general conclusions.

3.4. Recognition accuracy and error analysis Figure 3 shows the percentages of task recognition and incorrect recognition. A task recognition rate means the percentage of sentences whose content words are correctly recognized, discarding whether functional words are correctly recognized or not. As shown in Figure 3, about 61% of sentences are correctly recognized from the viewpoint of the task recognition rate. About 39% of sentences are incorrectly recognized, and they are classified into three categories according to the cause as following :

1 . Incorrect detection of speech (10%)

2. Out-of-vocabulary responses (9%)

3. Incorrect recognition with inadequate HMMs for speech and noise even though utterances are correctly detected and well formed (20%)

out-of-vocabulary remonses (9%) incorrect

correct recognition

(61%)

Figure 3. Accuracy and causes of incorrect recognition

Most out-of-vocabulary words are given names and some verbs for requiring connections which we could not think of. We think that such verbs should be included in the vocabulary of the system. Most ill formed sentences are sentences with false starts, however their occurrences is low.

Causes of incorrect detection of speech are as follows :

1. System can not detect speech because of low level

2. System misses the speech portion after a pause because the pause duration is long so that the system regards it as the end of the speech

3. System performs errorneous detection due to coughing, solioqnies and so on.

4. IMPROVEMENT OF SYSTEM

Deterioration with inadequate speech and noise models and incorrect speech detection are the largest and the second largest problems in the three categories of system performance degradation. We have, therefore, tried to attack the problems of inadequate speech and noise models and incorrect speech detection by using the collected speech data.

4.1. Improvement of models Figure 4 shows improvements of system performance for about 500 speech data selected from the speech data collected from Dec. 1993 to Feb. 1994 under conditions where they were correctly detected and well formed. The task recognition rate is improved up to 94% by adopting an improved noise model and rescoring with context- dependent continuous mixture output density HMMs. In figure 4, branch telephone, PSTN, total show recognition rates for speech from extension branch telephones, the pnb- lic switched telephone network and every network including a private network, respectively.

PERFORMANCE

100, I

-0- branch telephone -A- PSTN 43- total

0 '3

'$ 7 0 -

6 0 1 ' (1) (2) (3) (4) (5 ) (6) items

Figure 4. Improvements of system performance : (1) base- line system, ( 2 ) modification of noise model, (3) modification of FSN grammar, (4) 1st candidate of Nbest beam search (discrete output density model), (5) 1st candidate of Nbest beam search (continuous mixture output density model), ( 6 ) rescoring (context-dependent continuous mixture output density HMMs)

A one-state model is usually used as a noise model. The noise model of the system is also a one-state model and its output distribution parameters were originally trained using circuit noise collected through various telephone circuits. The noise of the system, however, consists of general circuit noise, ambient noise input from telephone sets and the echo returned through the telephone-hybrid. The echo canceller fully attenuates the echo, however the residual echo still remains. The characterist,ics of t,he residual echo are different from the usual circuit noise, therefore, the characteristics of noise during the system annonnce- ment are different from that after the system announcement ends.

The noise model is therefore changed to a multiple-state model and its output distribution parameters are trained using the residual echo and noise through various telephone circuits. This modification of the noise model has improved the system performance by 9%.

24

The system is designed to accept filled pauses a t every inter-phrase, however, the collected speech data shows that almost all filled pauses appear at the head of utterances and there are some inter-phrases in which filled pauses do not appear. We have modified the Finite State Network grammar based on this observation. This modification has improved the system performance by a further 3%.

Contextdependent continuous mixture output density HMMs are reported to supply better recognition performance in comparison with context-independent discrete output density HMMs. This however requires a large com- putation and its real time implementation is rather diffi- cult. We have therefore tried to employ a rescoring scheme in which context independent discrete density HMMs are used in the first search, and time consuming context dependent continuous mixture output density HMMs are used in the search for paths selected in the first search whose number is less that that of the first search path. These improvements of HMMs and search algorithm have improved the recognition rate by 7%.

The task recognition rate for speech through the public switched telephone network is improved up to about 90% from 65% for the base line system, however the improved task recognition rate is not so high in comparison with that for speech from the extension branch telephone, which is about 97%. This is because the variety of channel and telephone sets in the private network is expected to be smaller than that of PSTN, and the signal-to-noise ratio in the private network is generally better than that of PSTN.

We need to improve the system performance for speech over PSTN by improving HMMs with a much larger speech data base collected over PSTN, or employing some tech- niques for compensating channel variety of PSTN [9].

4.2. Improvement of s p e e c h detection

Missing the speech portion after a pause is the biggest prob- lem in the incorrect detection of speech. As for speech detection scheme, the system employs a

method where the outputs of broad band pass filters are measured a t each frame, and the start frame of the speech is decided from whether the outputs exceed thresholds during consecutive mme frames [5 ] . The end of speech is also decided from whether the outputs continue to be below thresholds during 70 consecutive frames (0.7 seconds).

If the number of consecutive frames is set to a large number, then the end of speech can be detected correctly even if there are relatively large pauses between phrases. Such a setting, however, makes the system response slow and users irritable. We have employed a scheme wherein scores of final grammar nodes and intermediate grammar nodes are compared at every provisional end of speech where there is a pause with a duration longer than a threshold. The provisional end of speech is decided to be the end of speech if the score of a final grammar node exceeds scores of intermediate grammar nodes.

Unfortunately the collected speech data contains only speech during the frames which are considered as the speech periods by the system. The new scheme can not be therefore applied to the collected speech data directly, however it supplies the same performance to both speech data with no pauses between phrases and speech data with pauses between phrases which have been newly collected to test the performance. The new scheme is therefore expected to supply good speech detection performance.

5. FUTURE PLANS The causes of system performance degradation are incorrect detection of speech, out-of-vocabulary responses, and inadequate HMMs for speech and noise mentioned before. We have improved the system performance by using context-dependent continuous mixture output density HMMs, a multiple state noise model and the improved speech detection scheme. Most out-of-vocabulary words are given names and some verbs for requesting connections. These words will be included in the vocabulary when the system has a much larger vocabulary. Problems of ill formed sentences such as false starts are still unsolved, however, there were few occurrences of such false starts in our task except in the early stage of the field trial.

System performance has been improved to an almost sat- isfactory level and experienced users are generally favorable to the system. There are, however, usem who have stopped using the system, and the number of collected man-machine dialogues is less than we expected. We have investigated reasons why users stop using it by collecting opinions. One of the reasons is that the number of recognizable words is small, and the system is not so covenient in comparison with looking up a telephone directory.

We have, therefore, developed a new version of the system which can recognize a Vocabulary of more than 5,000 words and which can cover all employees of our branch offices. The system consists of a workstation for searching an optimal path and multiple DSPs for calculating scores of HMMs a t each frame. The system adopts the above- mentioned improvements such as :

1. HMM beam search controlled by LR parser 2. context-independent discrete output density HMMs at

first search to obtain Nbest candidates 3. context-dependent continuous mixture output density

HMMs a t rescoring process 4. multiple state model for noise

We are planning to carry out the second phrase field trial by using the new version of the system in order to evaluate it in a real enviornment, and evaluate users’ behavioral reactions.

6. SUMMARY This paper has described a voice-activated telephone exchange system developed in KDD R & D Laboratories and its field trial. More than 5,000 man-machine dialogues have been collected through various networks, and incorrectly recognized speech is analyzed and dassfied into three categories. We improved system performance by improving the speech detection scheme and adopting a rescoring scheme with context-dependent continuous mixture output density HMMs.

Experienced users are generally favorable to the system, however, the number of collected dialogues is not sufficient. One of the reasons for this is that number of the recognizable words is small. We therefore developed a new version of the system which can recognize a vocabulary of 5,000 words. We are now planning to make a field trial of the new version to investigate it in a real enviorment and evaluate users’ behavioral reactions.

- 25

ACKNOWLEDGEMENT

The authors are grateful to Dr. Urano, Director of KDD R & D Labs. and Dr. Murakami, Deputy Director of KDD R & D Labs. for their continuous support of this work. The authors are also grateful to the members of the Speech and Language Research Group of KDD R & D Labs. for their discussions.

REFERENCES

[l] Wilpon, J., Mikkilineni, R., Roe, D., Gokcen, S.: “Speech Recognition: from the Laboratory to the Real World”, AT&T Tech. J., pp. 14-24 (1990).

[Z] Leuning, M.: “Putting Speech Recognition to Work in the Telephone Network”, IEEE Comput., Vol. 23, pp. 35-41 (1990).

[3] Chang, A Smith, A. Vyosotsky, G.: ‘‘An Automated Performance Evaluation System for Speech Recogniz- ers Used in the Telephone Net.work”, Conf. Rec. Int. Conf. Commun., vol. 1, pp. 1.2.1-1.2.5 (1989).

[4] Jacobs, T.: “Performance of the United Kingdom In- telligent Network Automatic Speech Recognition Sys- tem”, Proc. ISCLP92, pp.1399-1402 (1992).

[5] Yato, F., Kuroiwa, S., Takeda, K., Yamamoto, S., Owa, K., Shouzakai, M.: “A Real-time Isolated Rec- ognizer for Telephone Input”, J. Acoust. Soc. Jpn. (E) 15, 2 (1994).

[6] Kuroiwa, S., Takeda, IC., Inoue, N., Yamamoto, S.: “Analysis of Human-machine Dialogues Collected wit.11 a Voice-activated Telephone Exchanger”, Paper of Technical Group, Sp94-30, pp.57-64 (1994).

[7] Kuroiwa, S., Takeda, IC., Inoue, N., Nogaito, I., Ya- mamoto, S.: “A Voice-Activated Extension Telephone Exchange System”, Proc. EuroSpeech93, pp.789-792 (1993).

[8] Murakami, J . , Sagayama, S.: “A Discussion of Acous- tic and Linguistic Problems in Spont.aneons Speech Recognition”, Paper of Technical Group, Sp91-100, IEICE Japan (1991) (in Japanese).

[9] Acero, A.: “Acoustical and Environmental Robustness in Automatic Speech Recognition”, Kluwer Academic Publishers (1993).

Documents

[IEEE 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications - Kyoto, Japan (26-27 Sept. 1994)] Proceedings of 2nd IEEE Workshop on Interactive Voice