User interface design of voice controlled consumer electronics

Philips J. Res. 49 (1995) 439-454

USER INTERFACE DESIGN OF VOICE CONTROLLED CONSUMER ELECTRONICS

by S. GAMM and R. HAEB-UMBACH Philips GmbH ForschunRslaboratorien, Aachen, Postfach 1980, D-52021 Aachen, Gernzan~

Abstract

Today speech recognition of a small vocabulary can be realized so cost- effectively that the technology can penetrate into consumer electronics. But, as first applications that failed on the market show, it is by no means obvious how to incorporate voice control in a user interface. This paper addresses the issue of how to design a voice control so that the user per- ceives it as a benefit. User interface guidelines that are adapted or specific to voice control are presented. Then the process of designing a voice control in the user-centred approach is described. By means of two examples, the car stereo and the telephone answering machine, it is shown how this is turned into practice.

Keywords: car stereo; human factors; speech recognition; telephone answering machine; usability engineering; voice control.

1. Introduction

Voice control indicates the ability to control a machine by means of spoken commands. The first application to come into mind is perhaps in a factory environment where speech is used to elicit a certain action when both hands are occupied. Here, however, we are concerned with applications in the consumer or telecommunications domain. To name just one example, consider name dialling: in order to place a call the name of the person to be called is just spoken after a corresponding prompt, instead of keying in the telephone number. This example shows already some of the issues involved. Compared to the factory environment where the hands simply are not available to do the task, here speech input is just an alternative input modality which has to compete with conventional input modalities. But speech indeed offers some unique features, in this case directness: speaking the number of the person avoids the mental

Philips Journal of Research Vol. 49 No. 4 1995 439

S. Gamm and R. Haeb-Umbach

translation of the name into the telephone number. We will discuss other examples of voice control in more detail later.

With the recent progress in recognition accuracy and the decline of hard- ware costs speech recognition comes within reach of everyday consumer and telecommunication products.

It is by no means straightforward and obvious how to incorporate voice control successfully in an everyday appliance. The validity of this statement may be seen by the limited success of first consumer products that included speech recognition. It is often said that speech-i/o is the most natural interface. However, man-machine communication would only be really natural if the machine could meet the capabilities of a human. Since, however, automatic speech recognition is still inferior to human speech recognition, despite great progress in recent years, building reliable and accepted speech interfaces is a sophisticated and subtle process. Just replacing button presses by speech input does not seem to deliver any discernible customer benefit. Speech has to be included from the outset of the development rather than simply adding it to an existing system. If the human factor issues are addressed from the beginning and if the technological capabilities are well taken into account, attractive designs can be developed which take full advantage of the unique properties of speech input. We will address these issues in two examples.

The unique properties of speech as input modality are: Hands-free operation: Speech input allows hands- and eyes-free operation which is very important in hands- or eyes-busy situations, e.g. while driving a car. Remote control: Speech input can be used for remote control, e.g. via the telephone, in order to control a system which is out of manual reach. Straightforwardness: With voice control no mental translations are neces- sary. With name dialling, for example, the names need no longer be trans- lated into numbers.

The challenge of the user interface design is to exploit these strengths while at the same time trying to cope with its shortcomings. The machine recognizer has a limited vocabulary size, often a more or less rigid dialogue structure, and misrecognitions are possible. A user interface design must be aware of these limitations and take into consideration how the performance is affected by factors such as vocabulary size, dialogue structure, adherence to prompts, etc.

The consumer application domain poses very stringent restrictions on hard- ware costs. With today’s technology, recognition vocabularies of more than 100 words seem to be unrealistic. A restricted recognition vocabulary has

440 Philips Journal of Research Vol. 49 No. 4 1995

User interface design of voice controlled consumer electronics

immediate consequences for the user interface and dialogue design. It is thus the goal of the design process to find a good compromise between recognition accuracy implementation costs and flexibility of the dialogue.

The outline of the paper is as follows. In the next section we will give a short overview of the state-of-the-art recognition technology in as much as it affects user interface design. Section 3 contains a summary of guidelines on how to incorporate voice control in a user interface. An example, the voice-controlled car stereo, illustrates how these guidelines can be turned into practice. In Sec- tion 4 we discuss the user interface design process. We present the usability engineering lifecycle for voice controls and illustrate it with the example of a voice-controlled telephone answering machine. The results are summarized in Section 5.

2. Towards a user-centred interface

The technological capabilities of automatic speech recognition have a great impact on user interface design. In early systems the user had to speak fixed words in a broken fashion to a computer that repeated the last statement, asked for conformation, then requested another instruction. Such a dialogue is not at all natural. With the progress in speech recognition more and more constraints no longer apply, allowing the design of a more user-oriented rather than machine-oriented dialogue. In the following we give examples of how algorithmic advances lead to increased freedom for the interface designer.

2. I. Continuous-speech recognition

Current state-of-the-art systems allow for continuous input, rendering the mandatory pauses between the words of an isolated speech recognizer super- fluous. This results not only in a more natural way of speaking but also in a higher throughput in spoken words per second.

2.2. Keyword spotting

Keyword spotting is a technique to artificially increase the vocabulary size of a recognition system. Commands can be embedded in carrier phrases and the recognition system extracts the command whereas non-keywords are detected as ‘garbage’ and thus discarded. Such a technique frees the user from adhering to only keyword input.



2.3. Speaker-independent recognition

Obliging a user to train a system before using it might indeed often be asking too much, in particular for everyday consumer products. A speaker-independent recognizer frees the user from having to train the system. Note, however, that the speaker-independent vocabulary is fixed and cannot be altered by the user. Therefore, often in practice a mix of speaker-independent and speaker- dependent vocabulary is used. Command words and other common vocabulary (e.g. digits) are ‘factory-trained’, and the user can add speaker-dependent words for personal settings, e.g. the name repertory for name dialling. Furthermore, it is often a good safety measure to allow a user to overwrite a speaker-independent template by a speaker-dependent one, either to improve the recognition accuracy of this particular word or to give the user the freedom to replace a speaker-independent word by a word he feels more comfortable with. Table I summarizes properties of the two types of recognition vocabulary.

2.4. Robustness

Robustness means the ability of a system to maintain its performance even under changing environmental conditions. Here, new algorithms have led to considerable improvement, although this topic is still a much addressed research issue. A practical consequence of improved robustness is that now desktop or handheld microphones can be used in situations where head- mounted microphones had to be used before.

The above examples show that algorithmic advances have led to more freedom for the user interface design. But one has to bear in mind, though, that a speaker-dependent isolated-word recognizer of a lo-word vocabulary will still have a better performance in terms of recognition accuracy than a speaker-independent continuous-speech lOO-word recognizer. Whereas the speech recognition researcher has a fairly simple measure of performance, the error rate, the picture is much more complicated for the user interface

TABLE I Comparison of speaker-independent and speaker-dependent vocabulary

Speaker-independent Speaker-dependent

Requires training by user no Language dependent yes Degree of algorithm complexity high Recognition accuracy relatively low

yes no low

relatively high



designer. He has to optimize user satisfaction. He has to find the right balance between recognition accuracy and flexibility for the user, given an upper limit on the algorithmic complexity dictated by the implementation costs.

3. User interface guidelines for voice control

In this section we present guidelines for incorporating voice control into a user interface. The guidelines are explained in the first subsection, whereas the second subsection illustrates them by means of an example.

3.1. The guidelines

A number of usability principles have found widespread acceptance [ 11. In the following we mention those guidelines which assume a special or additional interpretation for voice control interfaces. Further, results of some specific guidelines for voice control interfaces [2], pertinent to our applications, will also be reviewed.

Give the user the choice of input modality

Systems that use several input modalities, such as voice input and keyboard input, should accept them alternatively whenever an input is demanded from the user [3]. The user should not have to opt once and for all for one input modality. Adherence to this guideline is essential if different input modalities should complement each other in such a way that one modality compensates for the shortcomings of the other one [4].

Be consistent

Consistency asserts that mechanisms should be used in the same way whenever they occur. A particular system action should always be achievable by one particular user action such that a user is not required to learn which command for the same intended action is required at what stage of the machine [5]. If speech input is employed as another input modality, consistency also means that the result of a command should be the same irrespective of the way it has been invoked, whether by a speech command or by a button.

Provide appropriate feedback

The system should always keep the user informed about what is going on by



providing him with correct feedback within reasonable time [5]. Without feedback the user cannot learn from mistakes [6]. This issue is, however, particu- larly delicate for voice control interfaces. It is very awkward and tiring if the recognizer asks for confirmation each time a word has been recognized. If the outcome of a misrecognition is not fatal, it is more appropriate to execute the recognized command rather than asking for confirmation. Feedback is then given by the reaction of the machine as implicit feedback.

Take into account the user’s expectations

Take into account the possibility that the user’s expectations of the system will affect his interpretation of any dialogue with it. The dialogue should be designed to minimize confusion arising from these expectations [7]. For a speech interface the user’s expectations might easily exceed the machine’s capabilities. Therefore it is important to detect usage problems. One way to do so is to react upon recognition of non-keyword speech: if a non-valid utterance is detected, help menus may be offered automatically.

Do not overload the voice input channel

Each input modality has its own strength. The input of a location can ideally be done by a pointing device; a straightforward selection is ideally done by speech. But speech is definitely not suited for fine tunings, i.e. adjustments of a value within a continuous range. Commands like ‘up’ and ‘down’ have only limited utility; they have to be used iteratively in order to accomplish an acceptable degree of precision in manipulating an object [2]. Since speech input is not useful for all functions, the right balance between the different input modalities has to be found.

3.2. Example: The voice-controlled car stereo

Hands-free operation has been mentioned as one of the unique properties of speech input. This property is the motivation for bringing speech recognition into the car environment.

Figure 1 shows the control panel of a car stereo with an extra 3-button control element positioned on or close to the steering wheel. It can be controlled while keeping the hands on the steering wheel, similar to windscreen wipers or indicators. The 3-button control element exists in parallel to the ordinary control panel in the slot of the car stereo. The microphone is a free-talk microphone mounted preferably on the car ceiling. Using voice



Fig. 1 Voice controlled car stereo.

commands and the 3-button control element, most of the functions of the car stereo, at least the common functions, can be called.

The speech recognizer employs speaker-independent command words and speaker-dependent, i.e. user-defined, words. The speaker-dependent part is used for user-defined names of radio stations. The station names are trained as follows: the radio station to be programmed is tuned in and, when the training mode is entered, the user is asked to speak the name of the station, which needs of course not be the ‘official’ name, a couple of times (typically 2 to 4 times). Afterwards this station can be tuned in by speaking its given name. This scenario is very much like the original programming of presets.

A typical usage scenario is as follows. The car stereo is turned on by saying “radio” or “turn the radio on” while pressing and holding down the recording button in the middle. The keyword ‘radio’ is recognized and surrounding phrases, as in the second command, are discarded by the keyword spotting feature. The resulting action is that the car stereo is turned on and starts to play. Stations can now be tuned by speaking their names, such as ‘BBC’ or ‘change to BBC now’. The command word ‘volume’ will program the up and down keys with the volume function, i.e. pushing the up-key will increase the



volume, pushing the down-key will decrease the volume. But many other functions can also be realized with these keys. To mention just one more, the command ‘scan station’ will program the keys to assume the tuning function: pushing the up-key looks for radio stations by increasing the frequency, pushing the down-key by decreasing the frequency. We have thus introduced the concept of speech programmable softkeys.

In the following it shall be shown how the guidelines presented have been turned into practice in this application.

Give the user the choice of input modality

The voice control has been added to the ordinary key control. For the most common functions the user has the choice between voice and key control. For tuning a radio station, for example, he can either press a preset button or speak the station’s name. During use he can switch between voice and key control whenever he wants. There is no need to stick to the input medium once chosen. The key control serves as a fall-back mechanism in case of misrecognitions.

Be consistent

The operation is independent of the music source selected, such as radio or CD. Scanning for example is done in the same way for radio stations as for CD tracks. Consistency is not only ensured between the two functional parts of the car stereo but also between the two input modalities. When scanning radio stations, for example, pressing the ‘Next’ key has the same effect as speaking ‘next’; i.e. the feedback and the possible further functions are identical.

Provide appropriate feedback

In this application the feedback is mostly implicit since misrecognitions do not have fatal effects. After tuning in another radio station, for example, the user gets the implicit feedback by the changing sound and in addition an expli- cit feedback by the display of the new station name.

Take into account the userS expectations

The user may overestimate the capabilities of the machine. In order to avoid confusion by wrong expectations, the machine guides the user when he speaks in such a manner that the machine is not able to understand him.

446 Philip Journal of Research Vol. 49 No. 4 1995

User interface design oj- voice controlled consumer electronics

Do not overload the voice input channel

Voice control is ideal for a straightforward selection, e.g. for the selection of a preset radio. Voice control is not suited for all kinds of fine tuning; i.e., the adjustment of volume, bass, treble and fading is better done by keys. The trade-off between voice and key control also concerns the activation of the speech recognizer. In principle the recognizer could be activated by pressing a button or speaking a codeword. Although speech activation may be desir- able in the car environment we chose activation by a button. For controlling the car stereo the middle button in Fig. 1 has to be pressed and held down for the time a speech command is uttered. Currently, activation by speech is still too error prone and an unintentional activation is very costly in terms of user dissatisfaction. In addition, vocal activation tends to be tiring, since each command word has to be preceded by an activation word.

4. Usability engineering lifecycle

In this section we present the usability engineering lifecycle, i.e. the process of developing a usable voice control. The lifecycle itself is explained in the first subsection, whereas the second subsection illustrates it by means of an example [8].

Functional SpCiticatiOn

DWOQUe .,*- -* SpdfkaUon

.-I,’ .* ??

, ,’ ; ,

. . : ‘. I

‘*._ id . Pr ZP ins2

Verification

Fig. 2. The usability engineering lifecycle.



4.1. The lifecycle

The development process is illustrated in Fig. 2. It consists of five phases which are described in the following.

Functional spec@ation

The result of the functional specification phase is the feature set of the device. First, all conceivable features are listed and assessed according to several criteria, such as frequency or complexity of use. These criteria span a space in which all the features can be located. The definition of the feature set is basically a trade-off between utility and complexity of use. It is of course to some extent arbitrary, but the assessments are based on user enquiries or market studies.

Concept development

The result of the concept development phase is an outline of the man- machine dialogue. This outline is a dialogue specification on the abstract level. Different concepts may be developed in parallel as competitive solutions in this phase.

In order to avoid false developments at an early stage of the development process, the concepts are tested by means of simple mock-ups or Wizard-of- Oz simulations [9]. If competitive concepts are available, the most promising one can thereby be identified.

Dialogue specifkation

The result of the dialogue specification phase is the formal specification of the man-machine dialogue by means of a state transition diagram. Based on the chosen concept, the structure of the dialogue is refined until the man-machine dialogue is fully specified. During this refinement commonly accepted guidelines for dialogue design should be observed [l]. There may be certain requirements which are especially important for a particular application.

The dialogue specification is supported by a graphical state editor which is part of our development environment. The state editor allows us not only to draw state transition diagrams in an interactive style, but also to specify events, e.g. recognized words, that trigger a state transition or functions that are activated by a state transition (see Fig. 3).

448 Philips Joumal of Research Vol. 49 No. 4 19%

User interface design of’ voice controlled consumer electronics

Fig. 3. The dialogue editor

Rapid prototyping and iterative improvement

The result of the rapid prototyping phase is a software simulation of the system according to the dialogue specification. The state editor generates a module which can be interpreted by a rapid prototyping system. The prototype is used for simple, qualitative tests with users. Observations of test persons using the system are fed back into the design cycle and thus lead to an improved dialogue and prototype.

VeriJication

In the verification phase the prototype is finally assessed from the human factors point of view. The assessment can be done by means of interviews, focus groups or performance measures in usability tests. Hereby it is verified whether or not usability goals, which have been defined in the concept phase, have been met or not. If the assessment reveals that the goals are not met, a further iteration step has to be added to the design process.



4.2. Example: The voice-controlled telephone answering machine

Remote control via the telephone has been mentioned as one of the unique advantages of voice control. This advantage is the motivation for bringing speech recognition into the telephone answering machine. Currently, answering machines can be interrogated from remote by means of touch-tones, the dual tone multifrequency signals (DTMF). In many countries the DTMF penetration is lower than 60% [lo], which means that additional bleepers need to be carried along all the time. Remote control means that there is no control panel and no button to activate the speech recognizer. The activation is done context-dependent, automatically by the machine, and it is hidden from the user.

The speech recognizer employs speaker-independent command words and a speaker-dependent, i.e. user-defined, password. The spoken password is used as access control for remote interrogation and it can replace the former 4-digit PIN.

A typical usage scenario is as follows. When the answering machine picks up the line, it plays a greeting message. After the beep, when, usually, callers leave their message, the owner, who wants to interrogate the machine from remote, speaks his password. The machine starts playing the received messages, one after the other. Between each message the caller can say ‘previous’, ‘next’, ‘delete’ or ‘replay’. ‘Delete’ for example causes the system to delete the message that has just been played. The command word can be embedded in fluent speech; e.g., it can also be said ‘please delete this message’.

In the following it will be shown how the usability engineering lifecycle has been turned into practice in this application.

Functional speciJication

For the answering machine the functional specification led to the following feature set:

Functions afecting a single message:

?? Replay last message ?? Delete last message ?? Next message 0 Previous message 0 Stop playing

Functions afecting all messages or the greeting:

?? Replay all messages ?? Delete all messages at once ?? Change the greeting ?? Deactivate answering machine



Concept development

For the answering machine the dialogue is divided into two parts. The first part is when a message is being played. Here the user can activate functions that affect the single message, e.g. deleting or repeating it. After having heard all messages the second part begins. Here the user can activate global functions that affect all messages or the greeting. This concept has been taken from DTMF control.

In a Wizard-of-Oz simulation of the answering machine, the speech recognition and system control were performed by a human operator. In a control panel he could trigger the playback of messages and system announcements, depending on what the user said. A qualitative user test was conducted with eight test persons. It turned out that the concept was accepted, probably due to its compatibility with known answering machines.

As the usability goal we defined that users should perceive a clear benefit when using voice control rather than DTMF control.

Dialogue specfication

For the answering machine there are three requirements on the dialogue:

?? EfJiciency - The remote interrogation must not take much time; calling long-distance is expensive.

?? Ease ofuse - The remote interrogation must be intuitive to use; when calling from afar there is no manual at hand.

?? Robustness- The remote interrogation must be possible even under adverse conditions, e.g. a noisy environment.

In order to make the dialogue both efficient and easy, there are two modes: the command mode and the menu mode.

In the command mode the system just prompts the user for a command. The system executes the function demanded and prompts for the next command. This mode requires the user to know about the functionality of the device and to remember the command words. The command mode is a rather efficient mode and meant for the expert user.

In the menu mode, the system guides the user by offering the available functions in acoustic menus. The user can then choose one and activate it by speaking the proposed command word. The menu mode is easy to use and meant for the novice user.

The mode is determined automatically by the system, depending on the user’s behaviour. The default mode is the efficient command mode. But if the system detects that the user has any problems, e.g. because he uses



unknown command words or does not continue to react, it switches to the menu mode. After any successful completion, the system falls back into the command mode.

Since speech recognition is never 100% reliable, especially in a noisy environment, a fallback mechanism onto a more reliable input form must be pro- vided [4]. For the answering machine, DTMF input represents that fallback mechanism. Therefore the user always has the choice between speech and DTMF input.

Rapid prototyping and iterative improvement

For the answering machine three interation steps were conducted. The design was regarded as being satisfactory when the observations of the test users were unaccompanied by criticism.

First iteration: Command words

It turned out that some of the command words were not intuitively used. In order to investigate the optimal set of command words, a test was conducted with eight users. In this test the menu mode was turned off, so that the users had no clue of which were the accepted command words. Giving them a certain task, it was observed which command words the users would intuitively choose.

It turned out that some command words, such as ‘delete’, were obvious, whereas for some functions, such as changing the greeting, no common intuitive command word could be identified. For those functions several synonyms were included in the system’s vocabulary. As a result of this synonym test the vocabulary grew from 14 to 30 words.

Second iteration: Subjective impression of eficiency

After determining the vocabulary, people complained about the general inefficiency. The system was perceived as being too slow. Therefore the response times were shortened and the system announcements were spoken faster, although this might have diminished the intelligibility. In order to further shorten the system announcements, the help menus were split into two levels: at the first level just the command words are mentioned and at the second level the functions are also explained. The second level of the menu is only played if, not react appropriately.

452

after having heard the first level, the user still does

Philips Journal of Research Vol. 49 No. 4 1995


Third iteration: Time window for speech input

A fixed time window for speech input turned out to be unacceptable. It was observed that users either tried to barge into the system announcements or they waited too long. Therefore the time window has been made flexible in the sense that the system reacts as soon as a command word has been recognized. With this flexibility it is possible to extend the time limit and thereby to satisfy the slow user as well as the fast one.

After three iteration steps the usability of the remote interrogation was considered to be satisfactory and the design process considered to be completed.

Ver$cation

For the answering machine we demanded that users should perceive a clear benefit when using voice control rather than DTMF control. It had to be shown that the majority of the test persons, even if they used the system only once, would prefer voice control.

The achievement of this goal has been checked by means of focus groups. There were three focus groups each in two countries. People in the same focus group had similar backgrounds in using telecommunication terminals. In the focus groups the answering machine was first demonstrated and a few tasks were performed by the test persons. Feedback was collected by means of ques- tionnaires and intensive group discussions. The focus groups revealed that there is a clear preference for voice control, but not quite as expressed in a country with high DTMF penetration as in a country with low DTMF penetration.

5. Summary and conclusion

We have described how to design a voice control for consumer electronics so that it is perceived as a benefit by the user. More sophisticated recognition algorithms have led to more natural user interfaces and to more freedom for the designer. We presented guidelines and showed how they have been realized in the voice controlled car stereo. We further explained the design process and showed how the voice controlled answering machine has been developed in a user-centred approach.

The design of a voice control is a subtle task, since replacing button presses by speech commands does not improve the user interface at all. The right balance between voice and key control has to be found. The given examples show

Philips Jouroal of Research Vol. 49 No. 4 1995 453


that a well designed voice control can make consumer electronics more usable and more attractive and that this technology is on the verge of penetrating the mass market.

REFERENCES

[l] J.D. Gould and C. Lewis, Designing for usability: Key principles and what designers think, Commun. ACM, 28(3) (March), 300-311 (1985).

[2] D. Jones, K. Hapeshi and C. Frankish, Design guidelines for speech recognition interfaces, Applied Ergonomics, 20(l), 47-52 (1989).

[3] L.T. Stifelman, B. Arons, C. Shimandt and E.A. Hulteen, Voice Notes: A speech interface for a hand-held voice notetaker, Proc. INTERCHI ‘93, Amsterdam, pp. 179-186 (1993).

[4] T. Falck, S. Gamm and A. Kerner, Multimodal dialogues make feature phones easier to use, Proc. Applications of Speech Technology, Lautrach, pp. 125-128 (1993).

[5] R. Molich and J. Nielsen, Improving a human-computer dialogue, Commun. ACM, 33(3), 338-348 (1990).

[6] H. Thimbleby, Can anyone work the video?, New Scientist, 23 Feb., 48-51 (1991). [7] B.R. Gaines and M.L.G. Shaw, The art of computer conversation, Prentice-Hall (1984). [8] S. Gamm, R. Haeb-Umbach and D. Langmann, The usability engineering of a voice

controlled answering machine, Proc. Int. Symp. Human Factors in Telecommunications, Melbourne, pp. 177-184 (1995).

[9] J.D. Gould, J. Conti and T. Hovanyecz, Composing letters with a simulated listening type- writer, Commun. ACM, 26(4) (April), 295-308 (1983).

[lo] R.W. Bennett, A.K. Syrdal and E.S. Halpern, Issues in designing public telecommunications services using ASR, Proc. Speech Technology ‘92: Voice Systems Worldwide, pp. 222-229 (1992).

454 Philip Journal of Research Vol. 49 No. 4 19%

Documents

User interface design of voice controlled consumer electronics