Download pdf - Speech recognition technology for mobile phones · PDF filemercially available Ericsson GSM phone that could be operated by voice commands using automatic speech recognition, in ad-dition

IntroductionLast year, Ericsson was one of the first mo-bile phone manufacturers to add an impor-tant technology to mobile phones. The T18,launched in spring 1999, was the first com-mercially available Ericsson GSM phonethat could be operated by voice commandsusing automatic speech recognition, in ad-dition to commands input via the keypad.Other members of the family of telephonesusing Ericsson’s first generation of speechcontrol algorithms are the T28, R320, andA2618 (Figure 1).

These phones use speech recognition forthe new name dialing feature. Thanks to theefficient use of memory, it is currently possi-ble to train and store voice tags for up to 10entries in the phone book of any of thesephones. Each voice tag is trained with a sin-gle utterance by the user and assigned to asingle phone book entry. When the userwants to place a call, he pushes a button andspeaks a person’s name. The phone answers

with the recognized voice tag as acoustic feed-back, and then automatically sets up the call.

All Ericsson phones with speech recogni-tion capabilities also feature call answering,which allows the user to accept or reject in-coming calls using voice commands. Thishas obvious advantages when the phone isused with hands-free equipment.

Compared to dictation products com-mercially available for desktop PCs, the ap-plication described here seems elementary.However, mobile phones are used every day,in a variety of locations, with every kind ofbackground noise imaginable. Hence, thekey issue for speech recognition in mobiledevices is not the size of the vocabulary butthe robustness of the recognition system.

On one hand, the phone must recognizespeech correctly, say, in a quiet office setting,at an airport with conversations going on inthe background, or in a car traveling at 150km/h. On the other hand, care must be takenso that no incidental noise such as a closingdoor or laughter is mistaken for a valid name,which would lead to a call being set up. Also,the recognizer should work properly withany type of microphone, at a variety of dis-tances and angles between the mouth andmicrophone, and despite changes from hand-set to hands-free equipment, all without hav-ing to retrain vocabulary.

The need for speechrecognitionThere are several reasons why speech recog-nition is becoming a standard feature in mo-bile phones. Using a mobile phone whiledriving a car has been regarded as danger-ous, because it distracts the driver. Duringcall set-up, the driver must remove his handfrom the steering wheel to punch the tele-phone number into the keypad. To checkthe number, he has to take his eyes off theroad and look at the display on the tele-phone. During the conversation, he needs tohold the telephone in his hand. Legal re-strictions already exist or are being intro-duced (in Japan and Germany, for example),that prohibit a driver of a car to use a mo-bile phone while operating the vehicle un-less the driver is using hands-free equip-ment.

The use of hands-free equipment is a stepforward, as it allows the driver to keep hishands on the steering wheel during the con-versation. During call set-up, speech recog-nition allows the user to speak the name ofthe person instead of keying in the telephone

148 Ericsson Review No. 3, 2000

Speech recognition technology for mobile phonesStefan Dobler

Following the introduction of mobile phones using voice commands,speech recognition is becoming standard on mobile handsets. Featuressuch as name dialing make phones more user-friendly and can improvethe safety of operating a handset in an automobile.

The author briefly reviews the types of speech-recognition systems,illustrates a few typical problems of the mobile environment, discussesthe efficient use of memory in mobile devices, and provides details on thetechnology deployed in Ericsson’s speech-recognition system for mobiledevices. Finally, potential enhancements aimed at speech-controlledmobile phones of the future are mentioned.

Figure 1Telephones using Ericsson’s first-generation speech-control technology.

Ericsson Review No. 3, 2000 149

number. In addition, acoustic feedback ofthe recognized name can help to keep thedriver’s eyes on the road and hands on thesteering wheel. Consequently, speech recog-nition can make it safer to use a mobilephone in a car.

Apart from gains in functionality, the de-velopment of mobile phones has been dom-inated by a decrease in the physical size ofhandsets in recent years. Increasingly small-er devices are being produced. In addition,customers are demanding bigger displaysfor new services. As a result, telephone key-pads have diminished to a size that some-times makes them awkward to operate.Every now and then, the news media feature“pen-like” phones with no keypad at all.

Speech is considered the most naturalmeans of communication for humans. Thus,automatic speech recognition can become anatural user interface for mobile phones. Itreduces manual interaction with mobilephones. In name dialing, for example, in-stead of searching the telephone book for aparticular name, a user needs only speak thename, and the telephone automatically setsup the call.

New ways of using mobile phones are con-tinuously being introduced. For example,Bluetooth headsets will allow the use of mo-bile phones with cordless portable hands-free equipment. These headsets and aspeech-recognition interface offer a usefulalternative to phone keypads. Other inno-vations will also heighten the need for reli-able speech recognition.

Types of recognitionsystemThe choice of the appropriate type of recog-nition system or recognizer for a mobilephone is crucial. Each type of recognizer hasits own advantages and disadvantages.

Isolated-word vs. connected-wordrecognitionIsolated-word recognizers can recognize asingle word in a recognition window. Tohave them recognize a sequence of words,the speaker must pause between each wordto terminate and restart recognition, whichresults in unnatural pronunciation.

As the name implies, connected-word rec-ognizers allow the speaker to say several key-words with no artificial pause betweenwords. The cost is higher complexity com-pared to isolated-word recognizers. Oneproblem for a connected-word recognizer is

coarticulation, the situation in which wordsare pronounced differently if spoken in aconnected fashion, such as a sequence of dig-its. Word boundaries often disappear andwords melt together. However, because ofthe natural style of speaking, connected-word recognizers are much more user-friendly than isolated-word recognizers andshould thus be preferred.

Speaker-dependent vs. speaker-independent recognitionUsers of mobile phones with current speech-recognition technology have to train theirphones before they can use features such asname dialing. Training is seen as a cumber-some task for users. However, one advantageof such a speaker-dependent system is thatit is language-independent. Any user cantrain his or her phone in any language de-sired. This simplifies the introduction ofspeech control as a new feature in a productsuch as a mobile phone, which is typicallyreleased worldwide in different countrieswhere different languages are spoken. In ad-dition, because the system learns each indi-vidual user’s speaking behavior, perfor-mance is superior to that of speaker-independent recognizers.

Speaker-independent recognizers are pre-trained using a large sample of humanspeakers. A user can immediately start usingsuch a system, which makes operation mucheasier. One drawback of such a system is thatit is impossible to model all possible speak-er variations for a language. Thus, there willalways be a certain percentage of users whosephones will achieve sub-optimal perfor-mance. One solution to this problem wouldbe to give users the option of training all orsome words of the vocabulary that do notwork well for them. This would requirecombining speaker-dependent and speaker-independent recognition in a single device.

Small-vocabulary vs. large-vocabularyrecognitionFor command and control applications,small-vocabulary recognition systems withvocabularies up to 100 words are sufficient.These recognizers are normally word-based,which means each vocabulary element con-sists of a word. The complexity and memo-ry space required are limited.

In contrast, a large-vocabulary recogniz-er for the dictation of letters or e-mail cancontain up to 100,000 words. The basic vo-cabulary elements are smaller sub-units ofspeech like phonemes, which are used to

Feature extractionA preprocessing procedure in the speech-recognition process that transforms thespeech signal into a spectral representation.Feature vectorThe result produced by feature extraction.HMMHidden Markov model.MIPSMega-instructions per second.MMIMan-machine interface.Pattern matchingAlso referred to as search—the procedure inspeech recognition that matches the incomingutterance of the user against voice tags. Thevoice tags are also sometimes called patterns.RAMRandom access memory.SMSShort message service.VocabularyCollective term for all trained voice tags.Voice tagA representation of speech suited to a speechrecognizer.

BOX A, TERMS AND ABBREVIATIONS

build up the words of the vocabulary. Thecomplexity and memory space requirementsare very high. Several dozen megabytes ofmemory are needed to implement a true dic-tation system, which is an unacceptable re-quirement for a mobile phone.

Challenges

AcousticsCompared to mobile phones, fixed-line tele-phones typically offer a stable acoustic en-vironment. They are used in the same room,such as an office or living room, with almostthe same acoustic background every day.

In mobile communications, backgroundnoise is always present and extremely vari-able. Mobile devices are used in every imag-inable environment. The setting could be anoffice or an airport, railway station, or evenoutdoors, with an acoustically challengingenvironment. Automotive interiors—withinterfering background noise that dependson the type of car, the speed at which it istraveling, and so forth—are a challengefaced by mobile phones alone. Consequent-ly, in mobile communications no assump-tions can be made about the acoustic prop-erties of the operating environment or back-ground noise.

Also, a certain proportion of mobile usersfrequently change from handset to hands-free operation, either with built-in hands-free equipment in a car or with portablehands-free accessories. This causes largevariations in the speech signal in additionto the conventional variation of attenuationfrom user to user.

Figures 2 - 4 show differences in speechsignals caused by background noise in typ-ical settings. The first illustration shows thespeech signal of a word that has been spo-ken without background noise. Recognitionof the spoken word should be a fairly sim-ple task for any recognizer.

The second illustration shows the speechsignal produced by the same word spokenwith high stationary background noise. Thissituation corresponds roughly to hands-freeoperation in a car. The distance from themouth to the microphone is about 30 cen-timeters. Background noise is caused by theengine, wind noise, and passing cars. A com-parison of the speech signals in these twocases suggests problems of recognition. In-creasing background noise degrades the per-formance of recognizers. There are tech-niques, such as spectral subtraction, for cop-


0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5x 104

3000

2000

1000

0

1000

2000

3000

Figure 2Without noise—speech file of the word repeat.

0 5000 10000 150006000

5000

4000

3000

2000

1000

0

1000

2000

3000

4000

Figure 3With stationary noice—speech file of the word repeat.


ing with noise in preprocessing or featureextraction as long as the noise is stationary.

A more extreme situation is shown in thethird speech signal (Figure 4). Here, non-stationary noise comes from people talkingin the background. This is the most de-manding situation for an automatic speechrecognizer, because methods dealing withstationary noise in feature extraction do notwork here. The background noise is alsohuman speech, so statistical methods do nothelp to distinguish between the desiredinput from the user and the undesired inputfrom people talking in the background.Nevertheless, this is a rather typical situa-tion for mobile phone use, and users expecttheir mobile phones to operate in all possi-ble acoustic environments.

Hardware restrictionsSome characteristics of mobile phones in-

fluence the speech-recognition systems thatcan be used. In general, mobile handsets are• very small devices;• produced in large volumes; and• highly cost-optimized.This considerably restricts the hardwarethat runs the speech-recognition algo-rithms, as it limits capacity in terms of com-putational load, memory size, and memoryaccess speed.

Computational load

To minimize the size and cost of handsets,no extra processor is allowed for the sole pur-pose of speech control. Existing computingresources have to be shared with other ap-plications in the phone. Nevertheless, thedigital signal processors used in mobilephones today offer performance of at least40 mega-instructions per second (MIPS), upto 180 MIPS. Hence, this bottleneck is nota severe problem.

Memory size

The first generation of Ericsson phones withspeech control had a vocabulary of 10 names.Ericsson Research has developed a solutionthat minimizes the additional memory re-quired to store the voice tags and the audi-tory feedback of trained names. The prob-lem of memory diminishes with modern system-programmable flash memory. Thisallows a relatively large vocabulary for namedialing and the implementation of newspeech-controlled features in mobilephones. However, the memory available isstill some orders of magnitude too small toallow the implementation of a large-

vocabulary system for e-mail dictation andsimilar applications.

Memory access speed

Mobile phones typically consist of a multi-processor solution with different memoriesattached to the available processors. Thetransfer of data between these memoriestends to be a problem, because the normalhardware architectures of mobile phoneswere not designed for the rapid data-transfer rates needed for automatic speechrecognition. The computational perfor-mance of the digital signal processors isgood, but the performance of the processors’random access memory (RAM) is rather lim-ited. Thus, access to the control processorflash memory is required.

Technology

The speech-recognition processBy definition, user-dependent speech-recognition systems require each user to runa training program to create the vocabularyfor the recognizer. The program reads the

5000

4000

3000

2000

1000

0

1000

2000

3000

4000

5000

0.5 1 1.5 2x 10

4

Figure 4With non-stationary background noise—speech file of the word repeat.

sequence of feature vectors that correspondto an utterance of the word to be trained andthen creates and stores a word model of it.When the word is later used as a command,the system extracts features, searches tomatch the sequence of features in the utter-ance, and then runs confidence assessment(Figure 5).

The speech signal v(t) picked up by themicrophone is fed into the feature-extraction function, which extracts essentialinformation from the signal. The output ofthis function is the feature vector, yk, whichdescribes the basic acoustic properties of thespeech signal for a given time interval. k de-notes the index of a time interval. Typical-ly, a new feature vector is generated every10 ms.

The feature vectors are fed into the searchfunction, which has additional input of thetrained vocabulary—in the case of mobilephones, this normally consists of single-word models. The task of the search func-tion is to match the sequence of incomingfeature vectors, yk, against the available vo-cabulary. The most likely vocabulary wordbecomes the recognition result. If a user un-intentionally activates the recognizer, anyspeech signal activity could produce a re-sult, thereby setting up a call. Obviously,this is unacceptable, so an additional blockhas been added: the confidence assessmentfunction.

After the search has finished a recognitionpass, the confidence assessment functionreads parameters from feature extraction(such as the signal-to-noise ratio duringrecognition) and search functions (such asthe distance between first and second-bestmatch). Based on these parameters, a deci-


Feature extraction

V(t) Yk

Confidence assessment

SearchResult

VocabularyTraining

Figure 5General block diagram of a speech recognizer.

V(t)

Feature vector Yk

Hamming-window

Fourier-transformation

Power density spectrum

+

Shape of power density spectrum Noise estimation

Logarithm

Normalization

Figure 6Main blocks of the feature extractionfunction.


sion is made whether to accept and transferthe results to the overlying man-machine in-terface (MMI) or to reject the results of thesearch.

Feature extraction Figure 6 shows the main blocks of the fea-ture extraction function. The speech signal,v(t), which consists of a continuous streamof speech samples, is partitioned for frameprocessing. Based on a sampling frequencyof 8 kHz, typically 256 samples (that is, 32ms window length) are combined to builda frame with an overlap of 176 samples overconsecutive frames. The deployment of aHamming window on each frame yieldssmoothing between frames. The Fouriertransformation transforms the signal fromthe time domain into the frequency domain.This yields a feature vector that representsthe basic spectral properties of the speechsignal.

Subsequent processing steps use thepower density spectrum instead of the com-plex Fourier spectrum. Investigations havefound that human hearing is insensitive tophase variations, so the phase information ofthe signal is removed with this step. Thefine structure of the power density spectrumcarries extensive information on the speak-er and his or her vocal characteristics. Theinformation of the spoken utterance itselfcan be found in the envelope or shape of thespectrum. The envelope is derived by down-sampling the spectrum at distinct center fre-quencies, which are distributed in a nonlin-ear fashion over the frequency axis, againsimilar to human hearing. The Ericssonrecognition system uses µ = 15 center fre-quencies.

Stationary background noise duringrecognition is one important problem.Background noise can be modeled as an ad-ditive noise component N to the speech sig-nal S in the power density domain. For eachof the center frequencies, this can be de-scribed as:

X∆ (µ) = S(µ) + N(µ)

The noise level N is estimated in the noiselevel estimation block. Subtracting the es-timate of the background noise minimizesthe influence of the noise signal on X∆, mak-ing the system more robust to stationarybackground noise.

Another problem is an acoustic environ-ment with variations in microphone levels.One means of coping with this problem is

to use the logarithm of the power densityspectrum. X∆ (µ), which is the µth centerfrequency before logarithmization, is mul-tiplied by a constant factor, V.

VX∆=10 log[V X∆(µ)] with µ L [l...15]

Bear in mind that for the Ericsson system,µ is between 1 and 15. The factor V stemsfrom conditions such as one user’s voicebeing louder than another user’s voice. Theabove formula can thus be rewritten as:

VX∆=10 log[V] +10 log [X∆(µ)]

The constant attenuation factor V is decou-pled in a sum from the desired signal X∆ (µ). Using differential parameters is aneffective means of removing the additiveterm of the attenuation factor V. The meanenergy of all center frequencies is subtract-ed and added as a 16th component. Furthernormalization steps lead to the final featurevector Yk.

Figure 7 shows an example of the final re-sult of feature extraction: the sequence offeature vectors for the spoken English word

15

20

60

40

20

0

20

40

60

80

Feature vector components

Att

enua

tion

0

10

20

0

5

Time in steps of 10 ms

30

40

Male digit one

10

Figure 7Sequence of feature vectors for the spoken English digit one.

one. The 16 feature-vector components aredrawn over the sequence of 10 ms steps.

SearchAs already mentioned, the search (or pattern-matching) function has the task of matchingthe sequence of incoming feature vectorsfrom a user’s utterance with the trained vo-cabulary of the recognizer. Ericsson’sspeech-recognition system employs hiddenMarkov models (HMM) to model humanspeech. With this technique, each word ofthe vocabulary consists of an HMM. Thestructure of an HMM (Figure 8) consists ofa chain of states (denoted q), with each statedescribing a segment of the vocabulary

word. Thus, q1 models the start, and q8 mod-els the end of the word.

States are connected with transitions, whichfacilitate state changes, depending on thetransition probabilities, aij. Emission proba-bilities, bi, which express the spectral similar-ity of a feature vector with a time interval ofthe reference, are attached to the states. Start-ing from the initial state, q1, different pathsthrough the model can be used, depending onthe sequence of incoming feature vectors. Therepetition or skipping of states allows adap-tation to variations in the rate of speech of theuser. A word is recognized if a path throughthe reference has reached its final state, q8,with a reliable probability.


q1

b1

a11 a22

a13 a24

a12 a23

b2

q2

q3

q4

q5

q6

q7

q8

Figure 8Structure of a vocabulary word as a hid-den Markov model.

Figure 9Cordless Bluetooth headset.


Future speech MMI formobile phonesExperience gained from the first generationof voice control shows that the technologyis robust enough to be used in mobile de-vices. A vocabulary of 10 names was suffi-cient as a starting point, but there is demandfor larger name-dialing vocabularies. Whilethe preferred size differs greatly among in-dividuals, most users desire a larger vocab-ulary than currently available.

Up to now, the main application of speechrecognition has been name dialing. Never-theless, more functions in a mobile phonewill be controlled by speech in the future.For example, a function can be directly ac-tivated with voice commands that mightotherwise require a sequence of keypad en-tries in a layered menu interface structure.

The idea of speech shortcuts to menus isappealing, but more important is a speechMMI that does not leave the user dependenton visual feedback or complicated keypad in-teraction once he has used a voice command.A clear illustration of this issue is the use ofmobile phones in a hands-free environment.In addition to current use in hands-freeequipment in cars, hands-free operation willsoon make use of cordless headsets based onBluetooth technology (Figure 9).

The use of a cordless headset will allow auser to leave his or her telephone in a bag orout of reach. This requires control of at leastthe basic phone functions (accept or set-upa call, phone book administration) via theheadset. The hands-free user will be depen-dent on audio feedback as a complement tospeech input. Consequently, functions such

as call set-up, switching profile, and memorecording are suited to speech control, whilefunctions such as viewing the calendar orshort message service (SMS) inbox are not.Through careful selection of functions, thespeech MMI will become invaluable to themobile user.

The technology is still relatively elemen-tary, owing to limited resources in mobilephones. Both computing power and mem-ory size will increase in the future, allowingmore sophisticated and user-friendly tech-nology, such as connected-word recogni-tion. It has also been suggested that speechrecognition technology be used for applica-tions requiring extensive vocabularies (morethan 10,000 words), such as e-mail dicta-tion. Although the resources in mobilephones will improve, the memory needed tohost complex speech recognizers mightnever fit in a phone. However, one solutionmight be to combine terminal-based recog-nizers for frequently used telephone func-tions with complex network-based recog-nizers.

ConclusionDespite the unique challenges that areposed by the mobile user’s environment, ro-bust speech-recognition systems are alreadyused in many advanced handsets. SpeechMMI will become increasingly importantas mobile phone displays get bigger andtheir keypads smaller. Although hardwarerestrictions are likely to limit the size of on-board vocabularies for some time, users canlook forward to functionality enhance-ments.