17
Philips J. Res. 49 (1995) 381-397 SPEECH RECOGNITION ALGORITHMS FOR VOICE CONTROLINTERFACES by R. HAEB-UMBACH, P. BEYERLEIN and D. GELLER Philips GmbH Forschungslaboratorien, Weisshausstrasse 2, D-52066 Aachen, Germany Abstract Recognition accuracy has been the primary objective of most speech recog- nition research, and impressive results have been obtained, e.g. less than 0.3% word error rate on a speaker-independent digit recognition task. When it comes to real-world applications, robustness and real-time response might be more important issues. For the first requirement we review some of the work on robustness and discuss one specific technique, spectral normalization, in more detail. The requirement of real-time response has to be considered in the light of the limited hardware resources in voice control applications, which are due to the tight cost constraints. In this paper we discuss in detail one specific means to reduce the processing and memory demands: a clustering technique applied at various levels within the acoustic modelling. Keywords: automatic speech recognition; small-vocabulary systems; robustness; acoustic-phonetic modelling; state and density clustering. 1. Introduetion Automatic speech recognition has been a topic of research for many years. It is, however, primarily in the past few years that the technology has matured enough to be employed in a large range of applications. There are two main reasons why this has occurred. Firstly, improvements in speech recognition algorithms have led to more robust and reliable systems which can cope with real-world and not only laboratory-controlled environments. Secondly, there is a sharp decrease in cost of computation and memory, which is reflected by the rapid growth in computing capability provided by modern digital signal processing (DSP) chips. Today, the cost of the speech recognition feature approaches a range which Philip. Journalof Research Vol. 49 No. 4 1995 381

SPEECH RECOGNITION ALGORITHMS FOR VOICE CONTROLINTERFACES Bound... · Speech recognition algorithms for voice control intelfaces • Develop algorithms to obtain the best performance

Embed Size (px)

Citation preview

Philips J. Res. 49 (1995) 381-397

SPEECH RECOGNITION ALGORITHMS FOR VOICECONTROLINTERFACES

by R. HAEB-UMBACH, P. BEYERLEIN and D. GELLERPhilips GmbH Forschungslaboratorien, Weisshausstrasse 2, D-52066 Aachen, Germany

Abstract

Recognition accuracy has been the primary objective of most speech recog-nition research, and impressive results have been obtained, e.g. less than0.3% word error rate on a speaker-independent digit recognition task.When it comes to real-world applications, robustness and real-timeresponse might be more important issues. For the first requirement wereview some of the work on robustness and discuss one specific technique,spectral normalization, in more detail. The requirement of real-timeresponse has to be considered in the light of the limited hardware resourcesin voice control applications, which are due to the tight cost constraints. Inthis paper we discuss in detail one specific means to reduce the processingand memory demands: a clustering technique applied at various levelswithin the acoustic modelling.

Keywords: automatic speech recognition; small-vocabulary systems;robustness; acoustic-phonetic modelling; state and densityclustering.

1. Introduetion

Automatic speech recognition has been a topic of research for many years. Itis, however, primarily in the past few years that the technology has maturedenough to be employed in a large range of applications. There are two mainreasons why this has occurred. Firstly, improvements in speech recognitionalgorithms have led to more robust and reliable systems which can copewith real-world and not only laboratory-controlled environments. Secondly,there is a sharp decrease in cost of computation and memory, which is reflectedby the rapid growth in computing capability provided by modern digital signalprocessing (DSP) chips.

Today, the cost of the speech recognition feature approaches a range which

Philip. Journalof Research Vol. 49 No. 4 1995 381

R. Haeb-Umbach et al.

makes it accessible to everyday consumer and telecommunications products.Speech recognition algorithms for voice control interfaces, which operatewith a recognition vocabulary of up to 100words, can be implemented on asingle DSP. An increasing number ofproducts already include a DSP for otherpurposes; e.g. modern mobile telcommunication terminals often employ aDSP to carry out the complex signal processing tasks of those systems. Thisresource can then be shared, and speech recognition comes with only minorextra cost, primarily due to extra memory.Voice control means the ability of a machine to react to spoken commands.

The goal ofthis technology is to provide enhanced access to machines via voicecommands. However, often voice input has to compete with well-establishedconventional means of input, e.g. keyboards. A speech recognizer will onlybe successful if it meets the following requirements:

• It exploits the unique properties of the speech input mode such that the userperceives an actual benefit ofusing it rather than using a conventional inputmode. This topic is extensively dealt with in an accompanying article in thisissue [1].

• It is accurate: the system must achieve a specified level of performance, e.g.recognition accuracy greater than 95%, so that the user is motivated to con-tinue using the system.

• It is robust: both with respect to changing environmental conditions and inthe way users interact with the system. The latter is particularly important ifthe users are unfamiliar with the technology.

• It provides real-time response: the user must be provided with a timelysystem response. Without sufficiently fast feedback the user does not feelin control of the system.

While the first requirement addresses the user interface point of view, theother requirements are mainly determined by the recognition algorithms andtheir implementation. Recognition accuracy has been the primary objectiveof most speech recognition research, and impressive results have beenobtained, e.g. less than 0.3% word error rate for a speaker-independent digitrecognition task [2]. When it comes to real-world applications, robustnessmight even be the more important issue. It has thus become an active fieldof research in recent years. We will review some of this work and discussone specific technique, spectral normalization, in more detail in Section 2.The last requirement of real-time response has to be considered in the light

of the limited hardware resources in voice control applications, which are aresult ofthe tight cost constraints. The problem to be addressed has two pointsof view:

382 Philip. Journalor Research Vol. 49 No.4 1995

Speech recognition algorithms for voice control intelfaces

• Develop algorithms to obtain the best performance given a restricted hard-ware resource (processor and memory limitations) .

• Devise or adapt the hardware so that it is optimally suited to solve the givenrecognition task.

In this paper we mainly take the first point of view and discuss in detail onespecific means to reduce the processing and memory demands: a clusteringtechnique applied at various levels within the acoustic modelling. This willbe discussed in Section 3. The results are summarized in Section 4.

2. Towards robust speech recognition

2.1. Acoustical environmental variability

The causes of acoustical environmental variability can be subdivided intofactors which remain constant through the course of an utterance or a session,such as recording equipment or room acoustics, and factors which may varyeven within a single utterance, e.g. background noise.

The major contributions to acoustical environmental variability are asfollows [3]:

• Changing channel characteristics: This problem arises if the room acousticsare changed or ifthe recording equipment is changed (e.g. a different micro-phone, change of the telephone channel for telephone speech). This kind ofdistortion of the speech signal is most often modelled by passing the speechsignal through a linear filter.

• Input level: Level changes result from speakers speaking with differentvolume or from changes of orientation or distance to the microphone.

• Additive noise is present in many applications. One prominent example isspeech recognition in a car. If training and testing are carried out at thesame noise level, the recognizer can cope quite well even with low signal-to-noise ratios. If there is a mismatch between training and recognitionthis can cause severe performance degradation.

• Different speaking styles may result in different spectral characteristics. Oneexample is speech spoken in the presence of noise.

• Extraneous speech by interfering speakers.

One approach to render a speech recognizer more robust is multistyle train-ing: instead of using a model for the environmental variability, this approachconsists of using a database for training that contains the variability to beexpected in the test conditions. The technique has been successfully usedwith hidden Markov models (HMMs) because of their powerful modelling

Philips Journal of Research Vol.49 No.4 1995 383

R. Haeb-Umbach et al.

abilities in different contexts. However, this approach is often impractical dueto the lack of sufficient training data. Therefore a mismatch between trainingand recording environment can often not be avoided.

Much research has been devoted to the problem of devising algorithmswhich cope with such a mismatch. The approaches can be divided into thefollowing two broad categories:

• Signal processing techniques in the speech recognition front end. Examplesof this are the use of microphone arrays, speech enhancement techniquesand measures to extract more robust feature vectors.

• Statistical modelling techniques of speech and noise. These include transfor-mation and adaptation techniques of hidden Markov models of noise-freespeech to the noisy case.

An overview of these techniques may be found in [3, 4]. In the following wewill present one technique of the first category in more detail: spectrumnormalization.

2.2. Spectrum normalization

The speech signal at the speech recognizer input is typically modelled as aclean speech signal and an additive noise term, observed at the output of alinear time-varying filter, representing the slowly changing channel transferfunction. In the speech recognizer front end considered here the input signalis sampled and blocked into overlapping frames. For each frame a Fouriertransform is computed. In the frequency domain the speech signal may thusbe expressed as

X(f, t) = [N(f, t) + S(f, t)). H(f, t) (1)

where S(f, t), N(f, t) and X(f, t) denote the Fourier transform of a block ofthe pure speech signal, the noise signal and the input signal to the speech recog-nizer, respectively; H(f, t) is the transfer function of the recording channel.Note that we assume non-stationary signals; i.e., the spectral characteristicsare a function of the time t. Further, the channel transfer function is assumedto be time-varying.We compute mel-frequency log-spectral coefficients [5]. If we assume that

the speech signal corrupted by noise can be modelled as the maximum ofthe speech signal and noise over time for each frequency band [6], then weobtain from eq. (1) the following equation in the log-spectral domain:

x(k, t) ~ rik, t)s(k, t) + (1 - r(k, t))n(k, t) + hik, t) (2)

384 Phillps Journalof Research Vol.49 No.4 1995

Speech recognition algorithms for voice control interfaces

where

r(k, t) = {I if s(k, t? > n(k, t)o otherwise

.(3)

s(k, t), n(k, t), h(k, t) and x(k, t) are the logarithms of the power spectraldensities of the pure speech signal, the noise signal, the transfer functionand the compound signal at the input ofthe recognizer, respectively; k denotesthe frequency subband index and t is the discrete time index.For most environmental variabilities it isjustified to assume that the channel

transfer function varies only slowly with time compared to the rate at whichspeech changes; i.e., h(k, t) can be considered a low-pass signal with respectto the time index t.The influence of the transfer function can then be eliminated by high-pass

filtering of the spectral subband envelopes. Furthermore, it is well knownthat high-pass filtering of the subband envelopes suppresses speaker-specificcharacteristics of the speech signal. Different techniques for high-pass filteringhave been applied.

2.2.1. Utterance mean subtraction

In the following we assume that h(k, t) is constant with respect to time twithin one utterance: h(k, t) = h(k). Then for each sub band k the mean valueof x(k, t) over an utterance of length T is

T

Nr(k) =Lrik; t)1=1

(5)

1 Tx(k) = TL x(k, t)

1=1

1 T=h(k) +Nr(k) ~ r(k, t)s(k, t)

1 T+ T _ Nr(k) ~(1- r(k, t))n(k, t) (4)

where

Now it is easily seen thaty(k, t): = x(k, t) - x(k) is independent ofthe channel

Phillps Journalof Research Vol. 49 No. 4 1995 385

R. Haeb-Umbach et al.

transfer function h(k):

y(k, t) =x(k, t) - x(k)

=r(k, t)s(k, t) + (1 - -rik, t))n(k, t)

1 TIT- NAk) ~ rik, t)s(k, t) - T _ NAk) ~(1- r(k, t))n(k, t) (6)

However, it can be observed that y(k, t) depends on the relative amount ofspeech frames within an utterance. I Thus this technique only works well ifthis amount is roughly constant for all utterances, both in training and test.

The above utterance mean computation can be considered as a filteringoperation with a finite impulse response filter of length one utterance withtap weights equal to 1fT. Note that the overall mean subtraction operationis equivalent to a high-pass filter operation. Further note that the mean com-putation introduces a processing delay of one utterance for all subsequentoperations.

A similar operation can also be carried out on cepstral features, where thetechnique is known as cepstral mean subtraction [3].

The above processing is also helpful to compensate for differences betweenthe noise conditions in training and test: if the noise statistics can be consideredto be stationary, i.e. n(k, t) ~ n(k), then the noise term is also suppressed byhigh-pass filtering.

2.2.2. Recursive high-passfilter

The processing delay of one utterance of the utterance mean subtractiontechnique is unacceptable for a real-time implementation. Therefore weconsider here high-pass filtering of the spectral subband envelopes by meansof a recursive infinite impulse response filter [7]. Here the subband mean iscomputed as a weighted running sum of the past samples:

I

x(k, t) =La(j)x(k, t - j)j=1

One possible initialization for x(k, 0) is the overall mean value of the kth sub-band component of the training data. The filter coefficients a(j) have to bechosen such that the above represents a low-pass filter, e.g. for a first-orderlowpass filter we have a(j) = aj, 0 < a < 1. The bandwidth of the filter has

(7)

I Note that each utterance is composed of speech and silence frames.

386 Philip. Journalof Research Vol. 49 No.4 1995

Speech recognition algorithms for voice control interfaces

to meet contradictory design goals: a large bandwidth is desirable to be able toadapt quickly to a new channel; however, a small bandwidth results in lessdis-tortion ofthe speech signal s(k, t). This problem can be solved by a time-vary-ing filter bandwidth, i.e. aU) = ai], t).Again, the high-pass signal y(k, t) = x(k, t) - x(k, t) is then used for the

subsequent classifier.

2.2.3. Experimental results

Speaker-dependent tests were conducted on a car database: two sets of datawere collected in a compact-sized car driving on a highway at an average speedof 120km/h (75mph). The first set of data contains speech uttered via a tele-phone handset ('HS'), the second in hands-free mode ('HF') with a singlemicrophone attached to the sun vizor. Both data collections involved 10speakers (handset: 7 male + 3 female; hands-free: 5 male + 5 female). Threemale and three female speakers were common to both data collections. Thevocabulary contains the German digits including the two alternative pronun-ciations of '2': 'zwei' and 'zwo'. Digits were spoken in isolation, as 3-digitstrings, and as 7-digit strings; 44 single digits and 44 (88) 3-digit strings wereused for training in the handset (hands-free) mode. Recognition experimentswere carried out on 100 7-digit strings. For the 'cross-tests' all handset datawere used for training and all hands-free data for recognition. The error ratespresented are always average values over all available speakers: 10 for thehandset, 10 for the hands-free experiments and 6 for the cross-tests. Thesignal-to-noise ratio of the handset data was between 10dB and 19dB andthat of the hands-free data around 0dB. However, it should be emphasizedthat the difference between handset and hands-free data is not only anincreased noise level but also a different acoustic channel and different speak-ing mode. For more details on the database see [8].Further, we carried out speaker-independent recognition experiments on a

telephone database ('MTEL'). The training part consists of isolated utterancesof German digits spoken by 37 male speakers resulting in a total of 897 utter-ances. The recognition part contains 1892digit strings of different lengths (upto 7 digits) which were spoken by another 23male speakers who were differentfrom the training speakers.Finally, the adult speakers' portion of the Texas Instruments Connected

Digits recognition task [9] ('TI Digits') was used as yet another database.Our speech recognizer employs a connected-word recognition algorithm

which is based on whole-word hidden Markov models, the emission probabil-ities of which are modelled by continuous Laplacian densities. For all results

Philips Journal of Research Vol. 49 No. 4 1995 387

R. Haeb-Umbac/Z et al.

TABLEIWord error rates for handset (HS) and hands-free (HF) database, crosstests(HS: training, HF: recognition) 'and MTEL telephone database; string error~~te to TI Digits recognition t.~sk..In all cases: 32~component feature ve~tor

Word (string) error rate (%)

Type of normalization HS HF CROSS MTEL TI

No spectrum normalization 0.9 2.0 100 3.0 (4.0)Mean subtraction 1.2 2.0 6.9 1.6 (3.2)High-pass filter 1.1 2.3 11.4 2.1 (3.9)Band-pass filter [11] 1.2 2.2 8.3 3.4

388 Philips Journalof Research Vol.49 No.4 1995

of Table I we trained single-density emission probabilities rather than mix-tures. We used the Viterbi approximation both in training and recognition,i.e. the probability of a word is replaced by the probability of its most likelystate sequence.

After sampling the input speech signal at 8 kHz and subsequent pre-emphasis(a = 0.95), 15 cepstrally smoothed log-power spectral intensities are computedevery 12ms from a Hamming-windowed 32ms portion of the speech signal.The energy per frame is subtracted from each intensity and included as anadditional component in a resulting 16-component feature vector. The 16-component vector is augmented with first-order time differences of each vectorcomponent to obtain a 32-component vector. Details ofthe signal analysis andthe acoustic modelling can be found in [8, 10].

Table I summarizes the results on the databases described above. The high-pass filter was initialized for each new utterance, according to the scheme out-lined in Section 2.2.2. After five time frames the filter parameter was fixed to itssteady-state value. For comparison we have included in the table results for thebandpass filter described in [11]. The table indicates word error rates in %,except for the TI Digits case, where string error rates are cited in parentheses.

The results show that in the case of a mismatch of training and test condi-tions ('CROSS' in the table) spectrum normalization is essential. Ifthere is nospectrum normalization the recognizer fails completely. This is due to the manyword insertions caused by the high noise level ofthe test data. For the speaker-independent recognition experiments (MTEL, TI), spectrum normalizationalso improves performance. This shows that slow changes of the log-spectralintensities are mainly caused by speaker and acoustic channel variations andthus should be removed, since they do not bear any information for the

Speech recognition algorithms for voice control interfaces

recognition task. Further, it can be observed that the recursive filters performsomewhat worse than the mean subtraction technique. In the cases of speaker-dependent recognition and match between training and test environment('HS', 'HF'), spectrum normalization does not yield any benefit and caneven deteriorate performance slightly. Spectrum normalization may thus beviewed as a safeguard measure for cases where one might encounter a testenvironment that differs from the training conditions.

3. Clustering techniques for compact acoustic models

In this section we are concerned with the problem of reducing the computa-tion and memory demands of speech recognition in order to obtain a cost-effective hardware implementation. In the following we describe clusteringtechniques to arrive at compact acoustic representations which achieve thesame recognition performance with a considerably smaller number of param-eters than the non-clustered system.We assume that the reader is familiar withhidden Markov modelling, otherwise we refer him to [12].We applied cluster-ing techniques at different levels ofthe acoustic modelling and integrated theminto the acoustic-phonetic training procedure of our continuous-density hiddenMarkov model (HMM) speech recognizer. The main idea of clustering is tojoin models and parameters which are acoustically similar.

We applied the theoretical framework ofmaximum-likelihood estimation todefine a measure of similarity for our bottom-up hierarchical clustering. Thisapproach leads to a similarity measure which is composed of an Lr distance ofthe model or density means and an observation-count-based weighting factor.In Section 3.1 we will show that such a clustering criterion fits well into amaximum-likelihood estimation-based training procedure.For a continuous-density HMM system, acoustic similarity can be seen at

different levels: at the phone level, at the state (or mixture) level and at thedensity level.Clustering at the first level (phones) leads to model tying and aims at defin-

ing a reduced set of models to be trained. It avoids a duplication of acousticallysimilar models, and therefore reduces the number of parameters of a system.This translates directly into reduced processing and memory demands of therecognizer. In addition, tying of rarely observed models with similar butmore often observed ones leads to a more reliable estimation of the modelparameters and thus to a more efficient utilization of given data. Similarreasoning is behind state tying, where individual states rather than wholemodels are tied.

Philips Journalof Research Vol.49 No.4 1995 389

R. Haeb-Umbach et al.

The goal of density clustering is to identify similar singleemission probabilitydensity functions out of the pool of density functions. The pool may containeither the component densities of the mixture densities of all models or theremay be separate pools for each phoneme or family of phonemes. As a resultcorresponding parameters of different models are shared. Density clusteringis done across models and is independent of the previously mentionedmodel-tying.

3.1. A maximum-likelihood approach to clustering

The objective of the applied clustering procedure is to achieve a tying ofmodels and parameters which produces the minimum possible increase of aglobal measure of heterogeneity. We will show that such a measure can bedirectly derived from the maximum-likelihood criterion.

3.1.1. The measure of heterogeneity

Assuming Gaussian output probabilities, we can write for the likelihood ofthe observation ö, given the model (mixture density) m,

11

p(ölm) = LPi·e-J(ö,'i)i=1

d-(- -) 1(- -),,-1(- -)T0, ri = 2" 0 - ri LJ 0 - ri

Wi

Pi = (27'i)D/2.1:E 11/2 ' (8)

where n is the number of component densities of the mixture, D is the dimen-sion of the feature space, :E is a given pooled covariance matrix, Wi is a givenfixed mixture component weight and rj are the component density meanvectors with the distance d(ö, rj). Due to the exponential decrease of the like-lihood we can approximate the logarithm ofthe mixture density likelihood by:

-log(p(ölm)) ~.min [-IOg(Pi) + d(ö, rj)].1=1'00"" .

(9)

Note that -IOg(Pi) ~ o. Thus we define a distance d which incorporates theterm -IOg(Pi) into d:

d(ö, rj) = d(ö, rj) -IOg(Pi). (10)

To obtain the log-likelihood for the whole training set we sum over all

390 Philips Journul or Research Vol.49 No.4 1995

Philip. Journalof Research Vol.49 No.4 1995 391

Speech recognition algorithms for voice control interfaces

observation vectors which occur in the training datar'

(11)s tU rjfm Ö€~

Following the maximum-likelihood approach, we have to train an inventoryof models and densities which minimizes the criterion V:

V= LLL[d(a,r;)].111 T;fmÖfT;

(12)

After a suitable linear transformation 2 = Ta of the feature space, withI::x = I, we obtain

V= LLL 112-ä;112.IJl äj€m.~€äi

(13)

Rewriting the double sum over all mixtures and all component densities ofeach mixture as a single summation over the pool of all N densities, we get:

N

V= LL 112-ä;112;=I.;;eäj

(14)

During bottom-up clustering the size of the pool of densities is successivelyreduced. This results in an increase of the measure V. To be consistent withour maximum-likelihood training procedure, the natural choice of a measurefor the heterogeneity is thus V, and the rule to be followed hence is: Mergethose two clusters which result in the smallest possible increase of heterogeneity V.Since we are joining models and densities, V will increase during clustering.

Hence the maximum-likelihood approach suggests to minimize the increase ofV during clustering. Thus we can assume that V is an optimal measure ofheterogeneity in the framework of maximum-likelihood training.

2 To indicate that m is the model which corresponds to the observation vector ö we write mö'Furthermore we write for convenience ~ Em if ~ is a component density mean vector of the model(mixture density) m, and we write ÖE~ if~ is the closest mean vector (with respect to the Euclidiandistance) to Ö. .

R. Haeb-Umbach et al.

3.1.2. Ward's clustering procedure

In this section we derive Ward's bottom-up clustering procedure, which isdescribed in [13].

In the beginning, each cluster consists of a single element, in our case a singledensity mean vector a. Now the number of clusters is successively reduced.After joining the two clusters p, q to the fused cluster I:

_ E_Y€äp,öq xaf =

nf(15)

_ npap + nqaqaf=

nf

with nf = np + nq, the increase of the negative log-likelihood:N N

D.p,q= L::L.:: Ilx- ail12 + L.:: Ilx - afl12 - L.::L.:: Ilx- ~1I2 (17)

(16)

i=l.'(fü,i#p.q

i=l.'(füi

can, after some computation, be simplified to

np • nq 11- - 112D.p,q= --- ap - aq .np +nq

From this equation it is obvious that clustering can never lead to a decrease ofthe negative log-likelihood. Since

(18)

11- - 112 - np 11ai - s, 11 nq 11ai - aq 11 npnq 11s, - e,11ü, - af - + - ...!:......2....:..:......!:._.,-2-':":

, np+nq np+nq (np+nq)

holds, we obtain for an additional fusion of the fused group land another

(19)

group i:

nf'n· 2D.i,J = --' Ilaf - aill (20)»r+»

1 n.= -- [(np + ni)D.i,p + (nq + n;)D.i,q] - --' - D.p,q' (21)~+~ ~+~

D.i,J can be interpreted as a distance between the clusters i and I; i.e., thedistance (measure of dissimilarity) of two clusters is simply the increase innegative log-likelihood V if the two clusters are merged. In an implementationthe terms D.i,j are computed at the beginning of the clustering. During asuccessive clustering step the distances of the clusters i to the new cluster I(J ;= fused(p, q)) can be directly computed from the distances D.i,q,D.i,p andD.p,q according to eq. (21). The iterative clustering procedure is summarized

392 Philips Journalof Research Vol.49 No.4 1995

Philips Journalof Research Vol.49 No.4 1995 393

Speech recognition algorithms for voice control intelfaces

in the following:

1. Compute a dissimilarity matrix from all given densities by eq. (18).2. Search for the minimum 6.ij within the distance matrix and join the

clusters i,j to a new cluster f =f(i,j).3. Remove the items i,j from the distance matrix.4. Add itemf to the distance matrix, using eq. (21).5. Stop if some stopping criterion ismet (e.g. number of clusters), otherwise

go to 2.

The applied clustering technique is successive and agglomerative and hencecan lead to suboptimal and degenerated cluster configurations. To avoid suchundesirable results an additional k-means clustering (see below) has to beincluded.

3.1.3. k-means clustering procedure

The k-means clustering technique is well-known and is used in addition toWard's clustering procedure (see Section 3.1.2). It works as follows:

1. Start with an initial set of cluster means.2. For each element i of each cluster search for the nearest neighbour

cluster, i.e. a cluster the mean of which has a minimum Lr distanceto i.

3. Move the density i, if necessary, to the new nearest neighbour cluster.4. Re-estimate the cluster means.5. If there are no more density moves from one cluster to another stop,

otherwise go to 2.

The k-means clustering procedure can be included after each clustering itera-tion of Section 3.1.2 or, to save computation time, after a certain number ofiterations.

3.2. State-tying versus density-tying

Hidden Markov models may share some or all component densities of theirmixture densities if they model acoustically similar events. This similarity canbe modelled on state level and density level. As a result of state-tying (seeFig. I), complete states will be tied together, i.e. the tied states will share thesame inventory of component densities.Density-tying on the other hand allows different models to share common

regions of the acoustic space (see Fig. I). It is done across HMM states andis independent ofthe previously mentioned state-tying. Note that the resulting

R. Haeb-Umbach et al.

Two similar states are modelled by two different mixture densities

No Tying

Two similar states are modelled by two different mixture densitiessome of the component densities are tied

Density Tying

State Tying

Fig. 1. State-tying versus density-tying.

configuration is in essence a tied mixture-density system where the degree oftying is determined by the amount of clustering.It is obvious that for single-density emission probabilities there is actually

no difference between state-tying and density-tying which is important forthe implementation, since both kinds of tying are obtained by the sameclustering procedure.

3.3. Experimental results

We carried out experiments on several small-vocabulary recognition tasks.Here, small-vocabulary speech recognition is used as a synonym for an acousticmodelling approach which employs hidden Markov models of words ratherthan of phonemes.

For word-model-based small-vocabulary speech recognition, state tying

TABLE IIString error rate (SER) on TI Digits for various configurations with a small

number of densities

394 Philip. Journalof Research Vol. 49 No. 4 1995

SER [%] Configuration

3.372.972.592.67

0.6 k non-tied single-densities, 19200 parameters0.3 k tied densities, 1k weights, 10600 parameters1.2k non-tied densities, 39600 parameters0.3 k tied densities, 1.6k weights, 11200 parameters

Speech recognition algorithms for voice control interfaces

TABLE IIIString error rate (SER) on TI Digits for various configurations with a large

number of densities

SER [%] Configuration

1.911.90

1.451.30

1.161.14

0.950.97

2.4 k non-tied densities, 79200 parameters0.8 k tied densities, 3.2k weights, 28800 parameters

4.8 k non-tied densities, 158400 parameters2 k tied densities, 10k weights, 74000 parameters

9.5 k non-tied densities, 304000 parameters5 k tied densities, 18.5k weights, 178000 parameters

19k non-tied densities, 608000 parameters10.5k tied densities, 26k weights, 362000 parameters

identifies acoustically similar states within different words. This results indeciding automatically and in a data-driven way which parts of speech ofthe recognition vocabulary are similar and therefore can be modelled withshared parameters.Tables Il and III show the effects of clustering on the number of parameters

and on the error rate for experiments on the adult speakers' portion of theTexas Instruments Connected Digits recognition task. For details on thenon-tied system, see [2]. In Tables Il and III experiments with similar stringerror rates are grouped together. It can be seen that the number of model para-meters could be reduced by a factor of2 to 3without increase in error rate. Fora medium error rate performance range (1.1% to 3% string error rate), theresults can be stated alternatively: given the same number of parameters, thetied system achieves an error rate up to 30% better. Similar results havebeen obtained on the other databases mentioned in Section 2. Thus thedescribed clustering techniques are a powerful means to reduce the computingand memory demands to make the recognition fit on cheap hardware.

4. Summary

When it comes to real-world voice control applications, robustness and real-time response might even be more important issues than a task-dependentoptirnized recognition accuracy. We discussed two techniques to achieverobust high-accuracy real-time speech recognition in real-world environments:the spectral normalization technique and clustering techniques.

Philip. Journni of Research Vol.49 No.4 1995 395

R. Haeb-Umbach et al.

The reported results show that in the case of a mismatch of training and testconditions spectrum normalization is essential, since it is able to remove thenegative effects on the error rate of changing transfer channel characteristicsand even of different noise levels of training and test. For the reportedspeaker-independent recognition experiments spectrum normalization alsoimproves performanée since it discards speaker-specificspectral éharacteristicsin the speech spectrum. Further, it can be observed that recursive filters, whichhave to be employed to achieve real-time response, perform somewhat worsethan the mean subtraction technique. In the cases of speaker-dependentrecognition and match between training and test environment, spectrumnormalization does not yield any benefit and can even worsen performanceslightly. Spectrum normalization may thus be viewed as a safeguard measurefor cases where the test environment differs from the training conditions.

Clustering techniques have been applied to obtain a compact representationof acoustic models. This translates directly into reduced computation andmemory demands in an implementation. This is an important factor in voicecontrol applications which have to live with tight cost constraints. .Moreover,clustering has another beneficial side-effect-a better utilization ofthe trainingdata. In this paper we have discussed in detail a maximum-likelihood-basedclustering procedure applied at various levels within the acoustic modelling.At the state level, clustering allows us to avoid the duplication of acousticallysimilar models. A consequence is that rarely seen acoustic events can bemodelled together with more robust ones. At the density level, clusteringallows us to model better the part of the acoustic space that is shared bydifferent models. A combination of the two clustering techniques leads to areduction of the number of parameters by a factor of up to three and to a sig-nificant error rate reduction on several small-vocabulary speech recognitiontasks.

396 Philips Jouronlof Research Vol.49 No.4 1995

REFERENCES[I] S. Gamm and R. Haeb-Umbach, User interface design of voice controlled consumer

electronics, Philips J. Res., 49(4), 439-454 (1995).[2] R. Haeb-Umbach, D. Geller and H. Ney, Improvements in connected digit recognition using

linear discriminant analysis and mixture densities, in Proc. IEEE Int. Conf. on Acoustics,Speech, and Signal Processing, Minneapolis, MN, Apr. 1993, pp. 11239-11242.

[3] A. Acero, Acoustical and Environmental Robustness in Automatic Speech Recognition,Kluwer Academic Publishers, Boston, MA, 1993.

[4] B.H. Juang, Speech recognition in adverseenvironments, Comput. Speech and Lang., 5, 275-294 (1991).

[5] R. Haeb-Umbach, D. Geller and H. Ney, Improvements in speech recognition for voicedialing in the car environment, in Proc. ESCA-ETRW Workshop on Speech Processing inAdverse Conditions, Cannes-Mandelieu (France), Nov. 1992, pp. 203-206.

[6] A. Nadas, D. Nahamoo and M. Picheny, Speech recognition using noise-adaptive proto-types, IEEE Trans. Acoust. Speech and Signal Processing, 37(10),1495-1503 (Oct. 1989).

Speech recognition algorithms for voice control interfaces

[7] H.G. Hirsch, P. Meyer and H.-W. Ruehl, Improved speech recognition using high-passfiltering of subband envelopes, in Proc. European Conf. on Speech Communication andTechnology, Genova, Sep. 1991, pp. 413-416.

[8] H.W. Ruehl, S. Dobler, J. Weith, P. Meyer, A. Noli, H.H. Hamer and H. Piotrowski, Speechrecognition in the noisy car environment, Speech Commun., 10(1), 11-22 (Feb. 1991).

[9] R.G. Leonhard, A database for speaker-independent digit recognition, in Proc. IEEE Int.Conf. on Acoustics, Speech, and Signal Processing, San Diego, CA, Mar. 1984, pp.42.11.1-42.11.4.

[10] A. Noli, H.H. Hamer, H. Piotrowski, H.W. Ruehl, S. Dobler and S. Weith, Real-timeconnected-word recognition in a noisy environment, in Proc. IEEE Int. Conf. on Acoustics,Speech, and Signal Processing, Glasgow, UK, May 1989, pp. 679-682.

[11] H. Hermansky and N. Morgan, Towards handling the acoustic environment in spokenlanguage processing, in Proc. Int. Conf. Spoken Language Processing, Banff, Canada, Oct.1992, pp. 85-88.

[12] L.R. Rabiner, Mathematical foundations of hidden Markov models, in H. Niemann, M.Lang and G. Sagerer (eds.), Recent Advances in Speech Understanding and Dialog Systems,Vol. F46 of NATO ASI Series, Springer, Berlin, 1988, pp. 183-205.

[13] D. Steinhauser and K. Langer, Clusteranalyse, Waiter de Gruyter, 1977.

Philips Journalof Research Vol. 49 No. 4 1995 397