[IEEE 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications - Kyoto, Japan (26-27 Sept. 1994)] Proceedings of 2nd IEEE Workshop on Interactive Voice

2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVllA94) Kydo SrpDmbn 26- Km *i. z

INTERACTIVE VOICE TECHNOLOGY AT WORK: THE CSELT EXPERIENCE

R. Billi, F. Canavesio, A. Ciaramella, L. Nebbia

CSELT, Centro Studi e Labratori Telecomunicazioni via G.Reiss Romoli, 274 10148 Torino, ITALY

1. ABSTRACT

This paper is a survey of the speech technologies and applications developed at CSELT, some of which are employed in real services on the Italian telephone network. This represents a major extension from the previous activity of the centre, which was essentially oriented to research, to a now broader set of activities which range from defining and experimenting new algorithmic approaches to speech product engineering and application development.

In particular the paper describes two applications in the field trial phase, one related to an automatic operator, providing voice selection of large name directories, the other related to an automated network service for directory assistance, which is now accessible to all the Italian telephone customers.

2. THE BASIC VOICE TECHNOLOGIES: FROM RESEARCH TO ENGINEERED SOLUTIONS

- - h Research on text-to-speech has been done at

CSELT since 1974. The first prototype of a synthesizer for the Italian language was built in 1978, based on diphone concatenation and LPC coding 111. In 1991 the waveform concatenation technique was adopted, along with a more accurate diphone generation procedure [2], in order to significantly increase the acoustic quality, while more naturalness was obtained by progress at the prosodic level [3]. These improvements, which were confirmed by extensive subjective tests [4], led the system to a high level of intelligibility, as it is requested by

difficult applications, like name and address pronunciation.

Various attempts have been done during the years to obtain cost-effective implementations of the text-to-speech system. Satisfactory results have been now obtained, relying on the progress of pC computational power, by adopting a complete software version of the system, written in C-ANSI. That allowed the system to be ported easily on several processing platforms. A special purpose hardware architecture was designed for an important network application, which handles more than 30 million calls per year and which will be described later in this paper.

With the legalisation of shared-value services, which occurred in Italy in 1993, CSELT designed a version of its 'ITS to be provided on a licence basis with the trademark ELOQUENSa. This version has been released in the form of a library for several operating systems and processing platforms and has been integrated in several voice servers intended for audiotex and IVR applications.

h recomim . .

The first experiments of speech recognition at CSELT date back to 1979. Statistical models have been studied since 1980 [a], and have been applied first for speaker dependent [9], then for speaker independent [8] recognition on the telephone network. The first continuous speech recognition system was developed in 1987 [IO]. It was speaker dependent and did not work on the telephone line, but worked fairly well connected to a language processing module which processed the recognized sentence to answer simple questions on a geographical database.

With the conclusion of the 5-year SUNDIAL project (Esprit), in 1993, an important advance had been obtained, since the system was now

43 ~

0-7803-2074-3/94/$4.00 0 1994 IEEE

. .

speaker independent and could work on the telephone line. Moreover a dialogue manager permitted to handle the interaction with the user, on a domain of train time-tables, asking for lacking information, solving ambiguity and complex references.

The first serious attempts of using speech recognition technology for telephone applications were in the automation of the collect call service [7] and for voice dialling in the car [ l l ] . Consequently a systematic activity of speech database collection and of training and testing speech recognizers on specific vocabularies started.

As for the text-to-speech, engineered versions of the speech recognizers, for different application requirements, were since then produced and provided on a licence basis.

ROBURB is a compact speech recognizer specifically designed to work in highly noisy environments, such as the car or public telephones. It can recognize, in the speaker independent mode, a vocabulary of up to 34 words and, in the speaker dependent mode, up to 20 words. It is provided as firmware for DSPs like the TMS320C3 1.

AURISB is a speaker independent, robust speech recognizer for telephone applications. Different vocabularies, of up to 100 words each, trained on the Italian telephone network, are available and others can be provided on request. AURISB can also recognize quite well connected digits and has the useful functionality of talk- through. It is provided in the form of a library for DSP TMS320C3 1 for easy integration in voice servers.

FLEXUS@ is a phoneme based recognizer which has been trained in a speaker independent and vocabulary independent way for the telephone network. Due to its enhanced word discrimination, FLEXUS@ can reliably recognize a vocabulary of more than 1,OOO words.

It is suitable for any kind of application, particularly for the ones requiring large vocabularies or frequent updates to the vocabulary. Also this recognizer is provided i n the form of a library for DSP TMS320C31 for easy integration into voice servers.

Voice servers

A simple way to build a voice server using off- the shelf components is by using open architectures like, for example the Dialogic one. A few boards transform a PC into a cost-effective voice server solution which can be easily extended in the number of handled telephone lines, functionalities, network interface. Powerful processing boards permit to port proprietary algorithms while software drivers greatly simplify the integration.

For example CSELT has ported its TTS and speech recognition technologies to the new Dialogic Antares board, a 4 DSP (TMS320C31) board which will be fully integrated in the Dialogic environment.

While a configuration suitable for field trials or uncritical applications may be implemented very quickly and at a very low price, designing more complex applications with hundreds of lines, connection to host databases, remote control and fault tolerance may require more time and a more careful design.

Another problem, which is often given little consideration by developers of small applications, is the importance of having a powerful service creation environment.

An alternative for the application developer is to use platforms specifically designed for network applications, which provide all the necessary functionalities, and to concentrate on the application itself.

An example of a powerful client-server architecture on which all the advanced interactive voice technolo ies described above are integrated is the Necsy' M329-VIP (Voice Interface Platform). It is configured as one control unit with functionalities of service creation and simulation, operator console, local database, host interface. The control unit is connected by a LAN LO one or more peripheral units dedicated to telephone interface and speech functionalities.

CSELT's voice technologies have also been integrated on this product.

* NKSY is a manufacturing company belonging to the STET group.

44

3. REAL APPLICATIONS AND FIELD TRIALS

. . . Olce

There are two quite different voice dialling applications: one at network level and another at PBX (departmental) level. The first one permits each subscriber to access a private vocabulary of frequently called numbers, by pronouncing an associated name or keyword: the required technology is speaker trained recognition and, optionally, speaker independent digit recognition. The second one permits to access by voice telephone directories and implements a kind of automatic operator by connecting the internal or external caller to the desired employee: the required technology is speaker independent, flexible vocabulary recognition on the telephone network and the vocabulary may be small or large depending on the directory size.

With the progress of speech recognition technology it should be possible to access directories of a very large size, as those containing information of all customers of the public telephone network. Some telephone companies am already experimenting to see what can be achieved with the current level of

So this application, even at a smaller scale, is important because it presents problems, such as the variability of pronunciation of names, which may be useful for approaching the problem on a much larger scale.

It is also a good application of speech technology, as it provides a useful service, based on a very simple and quick interaction with the user. Moreover it provides automation of a repetitive and unqualified work, permitting to use the personnel more efficiently in other activities which require human capabilities like, for example, the capacity to interact with other

At CSELT we have implemented a voice dialling service, which provides access to the internal directory of more than 1,OOO employees. The service can be used by any intemal or external telephone. There is an abbreviated service selection (single digit) for intemal access, which for the moment substitutes the more efficient functionality of automatic service activation when telephone is lifted, currently not

technology [12] .

people.

possible due to technical limitations of our PABX.

This service is also used to collect, in the context of a real user interaction, speech databases that will be used to improve the speech recognition training.

The system is entirely based on CSELT's voice technology, namely the ELOQUENS@ text-to- speech and the FLEXUS@ reco nizer. The text- to-speech is used only to p rod tl e a feedback to the user about the recognized name before routing the call to that person.

A problem which was reported in the initial tests, related to the TTS use, consisted in the difficulty for the user of deciding the correctness of the recognized name, when one or more phonetically similar names existed in the directory (for example Mr. Guglielmetti and Mr. Gugliermetti). In this case the solution was to mark (automatically) all the directory entries which had a phonetically similar counterpart and to provide, when the recognized name was one of them, also the first name (and eventually the Department when also the first name coincided).

The recognizer was trained at the phonetic level without any specific adaptation to the application vocabulary. This is automatically generated from the telephone directory: alternative pronunciation models may be generated for particular names, i.e. foreign names.

The system has been implemented on a PC, equipped with one telephone interface board and one or more s ech recognition boards running

The service has been introduced incrementally by making it known to more and more people, as the processing capabilities were correspondingly increased. To remind people the existence of the service and ecourage them to use it, a distinctive label will be applied to their telephones. A field trial, which involves about 150 people, will be carried out in a period of about 3 months, in order to make objective and subjective evaluations.

On the basis of the information which is automatically collected by the system during its use, it will be possible to compute the following statistical data:

number of successfully com leted calls

system hang up)

the FLEXUS 8 software.

number of uncompleted c a f 1s (due to user or

-- 45

average duration of man-machine interaction number of calls per hour during the total field trial

In addition, all of the selected field trial users will be asked to compile a questionnaire which tries to collect some useful information, like the usefulness of the service, ,the usability of the speech interface, the intelligibility and quality of the synthetic speech, any experienced difficulty in interacting with the machine and, finally, a global evaluation of the service.

If the trial will be successfull, the service will be extended to all the 1,000 CSELT employees by the end of October, by installing a version which will be able to handle 8 calls simultaneously.

Reverse Directorv Access

The automation of the Reverse Directory Service, which supplies the user with the name and address corresponding to a given telephone number, has been the first opportunity for SIP, the Italian telephone operating company, to exploit l'TS technology in a widespread service on the public network.

The service, which is accessed dialling 1412, was introduced, not charged, for experimental purpose, in September 1993 only for the customers of one of the telephone exchanges in Rome. At the beginning of 1994 the service entered the full operating stage and started to be charged. It was progressively extended to 32 centres allocated in the main Italian cities. Since May 1994 the service can be accessed from all the Italian territory; this is allowed by routing the calls to 1412 of any origin to the pertinent centre.

The service structure is quite simple: the user dials 1412 and a vocal guide invites him to dial the area code and the telephone number whose corresponding name and address is requested. A special designed device at the central office makes a query automatically to the telephone user data base and converts the answer received in written form into the corresponding speech signal.

The total number of activated lines is about 1200. The number of calls to the service is quite large in this initial operating phase, being around 2.5 millions of calls a month.

It is known that the nature of the text to be synthesized, the form in which it was stored, supposing an operator would have interpreted it,

and the large dimension and variability of the data base require that a number of text processing procedures have to be activated to produce an intelligible synthetic answer. Many problems have to be faced and they include, for example:

the right stress assignment of names, surnames and localities. The solution of this problem implied a careful control of tens of thousands entries to extend the previous rules for word accent to this type of words. We estimate that the residual error is now quite low, less than 1 %; identification of foreign words, names and surnames (SERVICE, GARAGE, CASH, GEORGE, JACK,. . , ) , possibly along with the original language, for an acceptable phonetic transcription; expansion of abbreviations, which could be ambiguous, depending on the context.

Also the not rare presence of orthographic errors (omission, substitution or insertion of characters) can cause serious problems for the intelligibility of the message.

Some of the mentioned operations require a tedious work, which is in progress, to "clean" the data base, other operations have to be performed by specialized modules; some of them are already working, others are under development or are being improved.

The performance of ELOQUENSs, with reference to this application, has been evaluated in laboratory. In particular the intelligibility of surnames has been measured under operating telephone conditions, using test material selected to get an estimate of the percentage of real accesses to 1412 correctly served (message correctly understood by the user) [ 3 ] . After progressive improvements, including the introduction of specialized prosodic rules, an intelligibility value of 85% was reached, without using spelling.

A recent experiment [5] has emphasized that it is possible to have much better results in the intelligibility of company names or similar, by separating prosodically the parts of which the message is made up (for example, trade name, expansion, type of company) and by adopting specific rules for spelling.

In the next months gradually "cleaned" versions of the data base will be introduced and updated

46

versions of the text processing modules will be activated.

At the same time an evaluation of the service "on the f ie ld will be carried out, based on a list of questions issued at the end of a number of transactions. The answers will be automatically collected and processed. It will be also important to verify "on the field" the intelligibility results obtained in the laboratory, for example by collecting the written responses from a few thousands of users, who have listened material selected from a sample of a few thousands data base entries.

4. NEW APPLICATION DEVELOPMENTS

An important and socially useful application of voice technologies is to improve the communication with disabled, such as people having hearing or speaking deficiencies.

At CSELT we are experimenting a machine- mediated communication system between a normal user, which uses a standard telephone, and a person impaired in speaking or hearing functionalities, which uses a screen-phone terminal. Text typed on the screen-phone is converted to speech by 'ITS technology and sent to the other person over the telephone, while his speech is wscribed to text by using a centralized speech recognition server and sent to the screen- phone using a data communication protocol over the telephone line. We are currently using a modem but it is planned to switch to a more powerful protocol, specific for the screen-phone, like the ADSI.

A prototype of the system, called ATENA, will be used for a field trial in which some disabled people will participate. The trial will permit to collect useful information about the real communication needs and the difficulties of machine-mediated communication and will permit to optimize the system for the most frequent usage situations.

A T E " is implemented on a PC equipped with telephone interface boards, speech recognition boards and TTS. The recognizer is the CSELT FLEXUS@, which can handle a vocabulary of a few thousands words, permitting also to introduce new words on line, by typing them with the screen-phone keyboard.

A new application, belonging to the class of computer-supported te lephony, under development is related to a partial automation of Data Processing Support Centres, which provide assistance related to terminals and software applications used by telecom personnel. There are about 10 of those Centres which serve about 40,OOO users geographically distributed on the country. Currently the service is handled by operators called through the public telephone network.

The remote user has f i s t to identify his terminal by specifying the terminaljd (alphanumeric string), then he provides information about his problem, which may be related to the terminal itself, to the communication networks or to the application software.

Two phases of automation are planned. In the first phase, the system will automatically get some preliminary information from the user, like the terminal-id, and will provide some general information about known problems and solutions before connecting the user to the appropriate operator: text-to-speech and limited vocabulary isolatedconnected word recognition will be used in this phase.

In the second phase, the system will try to extract additional information from the user, like the identification of the application program (about 180 possibilities) and the kind of problem. A set of predefined questions, which are currently included in an expert system used by the operators to assist the users, will be automatically provided to the user and the related answers will be also automatically collected: text- to-speech and flexible vocabulary recognition with a few hundreds word vocabulary will be used in this phase.

A prototype of the system, based on CSELT's interactive voice technologies is being developed.

5. THE NEXT CHALLENGE: SPOKEN LANGUAGE UNDERSTANDING

Speech recognition has shown an impressive progress in the last few years, particularly for the possibility of recognizing continuous speech with large vocabularies. Moreover the integration of speech recognition with language processing has

demonstrated that it is already possible to have a natural spoken dialogue with the machine, like between two persons, provided it is carefully managed and the application domain is sufficiently focused. This is true even when the application requires PBX or PSTN quality speech and large-vocabulary speaker-independent recognition.

At CSELT the experience gained on spoken language understanding carried out in the 5 years Esprit project SUNDIAL [13] is now used for designing advanced voice information services over the telephone network, permitting to access transport information, public administration information and so on.

A much more efficient interaction with the user, than was so far possible relying on predefined menus, will be achieved and this will contribute to the spread of interactive voice services.

REFERENCES

[ l ] L. Nebbia, P. Lucchini, "Eight- channel digital speech synthesizer based on LPC technique", Int. Conf. on ASSP, Washington, Apr. 1979.

[2] M. Balestri, S . Lazzaretto, P. Salza, S . Sandri, "The CSELT system for Italian text-to-speech synthesis", Proc. of the 3rd. European Conference on Speech Communication and Technology, Berlin, Sept. 1993.

[3] S . Quazza, P.Salza, S . Sandri, A. Spini, "Prosodic control in a text-to-speech system for Italian", ESCA Workshop on Prosody, Lund, Sept.. 1993.

[4] M. Balestri, E. Foti, L. Nebbia, M. Oreglia, P. Salza, S . Sandri, "Comparison of natural and synthetic speech intelligibility for a reverse directory service", 1992 International Conference on Spoken Language Processing, Banff, Oct. 1992.

[5] P.Salza, E. Foti, M. Oreglia, "Intelligibility of natural speech and TTS synthesis in the reading of acronyms", to be published in AVIOS 94, S. Jod , Sept. 94.

[6] R. Billi, "Vector Quantization and Markov Source Models Applied to Speech Recognition", Int. Conf. on ASSP, Paris, May 1982.

[7] F. Canavesio, L. Fissore and M. Oreglia, "Voice-Operated Automatic Collect -Call in the Public Telephone Network" Proc. of Automatic Operator Services and Telephone Transactions Through Voice Processing, N.Y., Oct 1989.

[8] F. Canavesio, L. Fissore, M. Oreglia, P. Ruscitti, "HMM modeling in the public telephone network environment: experiments and results", 2nd European Conference on Speech Communication and Technology, Genova, 1991

[9] M. Cravero, L. Fissore, R. Pieraccini, C. Scagliola, "Syntax Driven Recognition of Connected Words by Markov Models", Proceedings of the ICASSP 1984.

[lo] L. Fissore, E. Giachin, P. Laface, G . Micca, R. Pieraccini, C . Rullent, "Experimental Results on Large Vocabulary Continuous Speech Recognition and Understanding", Proceedings of the ICASSP '88.

[ l l ] L. Fissore, M. Codogno, G . Pirani, "Isolated Word Recognition in the Mobile-radio System: Experiments and Results", Proc. of European Signal Processing Conference, 1990.

[12] M. Lennig, G. Bielby, J. Massicotte, F. Dugay, L. Robert, J.L. Dufird, "Automated Bilingual Directory Asssistance Trial in Bell Canada", 1st IEEE Workshop on Interactive Voice Technology for Telecomm. Applications, Piscataway, October 1992.

[ 131 J. Pechkam, "A new generation of spoken dialogue systems: results and lessons from the SUNDIAL project", Proc. of the 3rd. European Conference on Speech Communication and Technology, Berlin, Sept. 1993.

Documents

[IEEE 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications - Kyoto, Japan (26-27 Sept. 1994)] Proceedings of 2nd IEEE Workshop on Interactive Voice