COMPARISON OF PARAMETRIC REPRESENTATIONS FOR … · recognition, was performed on a PDP-11/45 minicomputer wi th the Interactive Laboratory System (Pfeifer, 1977). In systems employing

COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION INCONTINUOUSLY SPOKEN SENTENCES*

Steven Bo Davis+ and Paul Mermelstein++

Abstracto Several parametric representations of the acoustic signalwere compared as to word recognition performance in a syllableoriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore theemphasis was on ability to retain phonetically significant acousticinformation in the face of syntactic and duration variations. Foreach ~ arameter set (based on a mel-frequency cepstrum, a linearfrequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templateswere generated using an efficient dynamic method, and testdata were time registered wi th the templates. A set of ten melfrequency cepstrum coefficients computed every 6" 4 ms resulted inthe best performance, namely 96 .. 5% and 9500% recognition with eachof two speakers.. The superior performance of the mel-frequencycepstrum coefficients may be attributed to the fact that they betterrepresent the perceptually relevant aspects of the short-term speechspectrum ..

1. INTRODUCTION

The selection of the best parametric representation of acoustic data isan important task in the design of any speech recognition system.. The usualobjectives in selecting a representation are to compress the speech data byeliminating information not pertinent to the phonetic analysis of the data andto enhance those aspects of the signal that contribute significantly to thedetection of phonetic differences. When a significant amount of referenceinformation is stored, such as different speakers' productions of the vocabulary, compact storage of the information becomes an important practicalconsideration.

*To appear in IEEE Transactions on Acoustics,+ -- -Now at Signal Technology, Inc", 15 W. De La Guerra Barbara, CA

93101++Now at Bell-Northern Research and INRS-Telecommunications, University of

Quebec, 3, Place du Commerce, Nuns' Island, Verdun, Quebec, Canada H3E 1H6Acknowledgement. This material is based upon work supported by NSF GrantBNS 7682023 to Haskins Laboratories. Drs. Frank Cooper and PatriCk Nyeparticipated in numerous discussions of the experimental program, and theircontribution is greatly appreciated.

[HASKINS LABORATORIES: Status Report on Speech Research SR-61 (1980)]

195

The choice of a basic bears closely on thetion problem because the decision an unknown wi th a

is based on the within the entire Thenumber of different reference is smaller than the number ofpossible unknown , and therefore the of ng an unknownwith a reference entails a loss of information. One minimizethe loss of useful information differenttions in the framework of the under considerationHowever, since the choice of so basic to the decision as to whatacoustic information is useful, the result of such aof different representations is directly applicable only to therecogni tion system, and ization to differentl y organizednot be warranted.

Fujimura (1 (1975b) discussed detail the rationalefor use of in the recognition of continuousThe goal of the ex here was to select an acoustic represen~

tation most appropriate for the of such The methodsused to evaluate the were open testing, where thedata and test data were independently derived, and closed , where thesedata were identical. In each case, the same produced both thereference and test data, which included the same words in a ofdifferent syntactic contexts. Although variation between animportant problem in its own right, attention is focused here ondependent representations to restrict the different sources of variation inthe acoustic data.

Whi te and Neel y (1976) showed that the choice of parametric representations significantly affects the recognition results in an isolated wordrecognition system. Two of the best representations they explored were a 20~

channel bandpass filtering approach using a Chebychev norm on the logarithm ofthe fil ter energies as a similari ty measure, and a linear prediction codingapproach using a linear prediction residual (Itakura, 1975) as a similari tymeasure. From the similari ty of the corresponding results, they concludedthat bandpass filtering and linear prediction were essentially eqUivalent whenused wi th a dynamic programming time alignment method. However, that resul tmay be due to the absence of phonetically similar words in the testvocabulary.

Because of the known variation of the ear's cri tical bandwidths wi thfrequency (Feldtkeller & Zwicker, 1956; Schroeder, 1977), filters spacedlinearly at low frequencies and logarithmically at high frequencies have beenused to capture the phonetically important characteristics of speech. Pols( 1977) showed that the first six eigenvectors of the covariance matrix forDutch vowels of three speakers, expressed in terms of 17 such filter energies,accounted for 91 .. 8% of the total variance. The direction cosines of hiseigenvectors were very similar to a cosine series expansion on the fil terenergies.. Additional eigenvectors showed an increasing number of oscillationsof their direction cosines wi th respect to their original energies Thisresult suggested that a compact representation would be provided by a set ofmel-frequency cepstrum coefficients.. These cepstrum coefficients are theresul t of a cosine transform of the real logari thm of the short-term energyspectrum expressed on a mel-frequency scale. 1

196

A preliminary experiment (Mermelstein, 1976) showed that the cepstrumcoefficients were useful for representing consonantal information as well.Four speakers produced 12 phonetically similar words, namely "stick," "sick,""skit," "spit," "sit," "slit," "strip," "scrip," "skip," "skid," "spick," and"slid " A representation using only two cepstrum coefficients resulted in 96%correct recognition of this vocabUlary. Given these encouraging results, itbecame important to verify the power of the mel-frequency cepstrum representation by comparing it to a number of other commonly used representations in arecogni tion framework where the other variables, including vocabulary, arekept constant.

This paper compares the performance of different acoustic representationsin a continuous speech recognition system based on syllabic units. The nextsection describes the organization of the recognition system, the selection ofthe speech data, and the different parametric representations. The followingsection describes the method for generating the acoustic templates for eachword by use of a dynamic warping time al ignment procedure. Finally, theresul ts obtained wi th the various representations are listed and discussedfrom the point of view of completeness in representing the necessary acousticinformation.

2. The Experimental Framework

A rather simple speech recognition framework served as the testbed toevaluate the various acoustic representations. Lexical information was utilized in the form of a list of possible words and their corresponding acoustictemplates, and these words were assumed to occur wi th equal likelihood. Nosyntactic or semantic information was util ized • If such information had beenpresent, it could have been used to restrict the number of admissible lexicalhypotheses or assign unequal probabilites to them. Thus, in practice, insteadof matching hypotheses to the entire vocabulary, the number of lexicalhypotheses that one evaluates may be reduced to a much smaller number. Thisreduction would cause many of the hypotheses phonetically similar to thetarget word to be eliminated from consideration. Thus the high phoneticconfusability of the test data may have resulted in a test environment that ismore rigorous than would be encountered in practice.

2.1 Selection of Corpus

The performance of continuous speech recognition systems is determined bya number of distinct sources of acoustic variability, including speakercharacteristics, speaking rate, syntax, communication environment and recording and/ or transmission conditions. The focus of the current experiments isacoustic recognition in the face of variability induced in words of the samespeaker by variation of the surrounding words and by syntactic position. Theuse of a separate reference template for each different syntactic environmentwhich a word might occupy would require exorbitant amounts of storage andtraining data. Thus an important practical requirement is to generatereference templates without regard to the syntactic position of the word. Toavoid the problem of automatically segmenting complex consonantal clusters,the corpus was composed of 'monosyllabic target words that were semanticallyacceptable in a number of different positions in a given syntactic context.Since acoustic variation due to different speaker s is a distinctl y separate

197

problem (Rabiner, 1978), it was considered advisable to restrict the scope ofthese initial experiments by using only speaker templates. That is,both reference and test data were produced by the same speaker.

The sentences were read clearly in a quiet environment and recordeda high quality microphone. These recording conditions were selected toestablish the best performance level that one could expect thesystem to attain. Environments with higher ambient noise, which may beencountered in a practical speech input situation, would undoubtedly detractfrom the clari ty of the acoustic information and therefore resul t in lowerperformance.

The speech data comprised 52 different eve words from two male speakers(DZ and LL), and a total of 169 tokens were collected from 57 distinctsentences (Appendix A). The sentences were read twice by each speaker inrecording sessions separated in time by two months (denoted as DZ 1, DZ2, LL 1and LL2). Thus the data consisted of a total of 676 syllables. To achievethe required variabil i ty, the selected words could be used as both nouns andverbs. For ex ample, "Keep the hope at the bar" and "Bar the for theyell" are two sentences that allow syntactic variation but preserve the sameoverall intonation pattern. All the words examined carried some stress; theunstressed function words were not analyzed. The target words, all eve's,included 12 distinct vowels, /i, I, e, €, (£, :> , 1\, U, U,3t', a, 0/, some ofwhich are normally diphthongized in English. Each vowel was represented in atleast four different words, and these words manifested differences in both theprevocalic and postvocalic consonants. The consonants comprised simple consonants as well as affricates but no consonantal clusters.

2.2 Segmentation

An automatic segmentation process (Mermelstein, 1975a) was initiallyconsidered as one way of delimiting syllable-sized uni ts in continuouslyspoken text, but any such algori thm performs the segmentation task wi th afini te probabil i ty of error. In particular, weak unstressed function wordssometimes appear appended to the adjacent words carrying stronger stress.Additionally, in this study, a boundary point located for an intervocalicconsonant with high sonority may not consistently join that consonant to theword of interests In order to avoid possible interaction between segmentationerrors and poor parametric representations, manual segmentation and auditoryeval uation was used to accurately delimit the signal corresponding to thetarget words. The segmentation, as well as the subsequent analysis andrecogni tion, was performed on a PDP-11/45 minicomputer wi th the InteractiveLaboratory System (Pfeifer, 1977).

In systems employing automatic segmentation, the actual recognition ratescan be expected to be lower due to the generation of templates fromimperfectly delimited words (Mermelstein, 1978), However, there is no reasonto believe that segmentation errors would not detract equally from the.recognition rates obtained for the various parametric representationss

198

203 Parametric Representations

The parametric representations evaluated in this study may be dividedinto two groups, those based on the Fourier spectrum and those based on thelinear prediction spectrum 0 The first group comprises the mel-frequencycepstrum coefficients (MFCC) and the linear-frequency cepstrum coefficients(LFCC)0 The second group includes the linear prediction coefficients (LPC) ,the reflection coefficients (RC), and the cepstrum coefficients derived fromthe linear prediction coefficients (LPCC) 0 A Euclidean distance metric wasused for all cepstrum parameters, since cepstrum coefficients are derived froman orthogonal basis. This metric was also used for the RC, in view of thelack of an inherent associated distance metric. The LPC were evaluated usingthe minimum prediction residual distance metric (Itakura, 1975)0

Each acoustic signal was lowpass filtered at 5 kHz and sampled at 10 kHz o

Fourier spectra or linear prediction spectra were computed for sequentialframes 64 points (6 04 ms) or 128 points (1208 ms) apart 0 In each case, a 256point Hamming window was used to select the data points to be analyzed 0 (Awindow size of 128 points produced degraded results) 0

For the MFCC computations, 20 triangular bandpass filters were simulatedas shown in Figure 1. The MFCC were computed as

MFCC. =1-

20I

k=li 1,2, 0 0 ,M

where M is the number of cepstrum coefficients, and Xk , k = 1,2'000,20,represents the log-energy output of the kth filter.

The LFCC were computed from the log-magnitude Discrete Fourier Transform(DFT) directly as

LFCCQ =1

K-l nikI Yk cos(K)

k=Oi == 1,2, ...... ,M (2 )

where K is the number of DFT magnitude coefficients Yk ..

The LPC were obtained from a 10th order all-pole approximation to thespectrum of the windowed waveform. The autocorrelation method for evaluationof the linear prediction coefficients was used (Markel & Gray, 1976) 0 The HCwere obtained by a transformation of the LPC which is equivalent to matchingthe inverse of the LPC spectrum with a transfer function spectrum thatcorresponds to an acoustic tube consisting of ten sections of variable crosssectional area (Waki ta, 1973). The reflection coefficients determine thefraction of energy in a travelling wave that is reflected at each sectionboundary.

199

Noo

I

ii

2000FREQUENCY

Figure 1: Filters for generating mel-frequency cepstrum coefficients.

The LPCC were obtained from the LPC directly as

=:,2, " " " f 10

The Itakura metric represents the distance between two spectral frames~

with optimal (reference) LPC and test LPC as

D [LPC (4 )

where R is the autocorrelation matrix (obtained from the test sample)"""'"corresponding to the LPC" The metric measures the residual error when the

test sample is fil tered by the optimal LPC" Because of its asymmetry, theItakura metric requires specific identification of the reference coefficients(LPC) and the test coefficients (I~)" For computational efficiency, the

""denominator of (4) will be unity if R is expressed in unnormalized form. Thenif ~(n) denotes the unnormalized diagonal elements of R, rLP(n) denotes theunnormalized autocorrelation coefficients from the LPC polynomial, and thelogari thm is eliminated, the distance may be expressed as (Gray & Markel,1976 )

A

= r(O)rLP(O) + 210

Li=l

A

r(i)rLP(i) (5 )

3. Generation of Acoustic Templates

The use of templates to represent the acoustic information in referencewords allows a significant computation reduction compared to use of thereference tokens themselves. The design of a template generation process isgoverned by the goal of finding the point in acoustic space that simultaneously minimizes the "distance" to all given reference items. Where the appropriate distance is a linear function of the acoustic variables, this goal can berealized by the use of classic pattern recognition techniques. However,phonetic features are not uniformly distributed across the acoustic data, andtherefore perceptually motivated distance measures are nonlinear functions ofthose data. To avoid the computationally exorbitant procedure of simul taneously minimizing the set of nonlinear distances, templates are incrementallygenerated by introducing additional acoustic information from each referenceword to the partial template formed from the previously used reference words.Gi ven a distance between two tokens, or between a token and a template, thenew template can be located along the line whose extent measures thatdistance e Since only acoustically similar tokens are to be combined intoindividual templates, one may expect that this procedure will exploit whateverlocal linearization the space permits.

201

3 1

In one algorithm (Rabiner, 1978), an ini tial template is chosen as thetoken whose duration is the closest to the average duration of all tokensrepresenting the same word (Figure 2). Then all remaining tokens are warpedto the initial template. The warping is achieved by first using dynamicprogramming to provide a mapping (or time registration) between any test tokenand the reference template. Following the notation in Rabiner, Rosenberg, andLevenson (1978), let Ti(m), 0 < m < Mi' be a test contour for word replicationi with duration Mi , i:::1 ,2':-0 0,1,_ 'and let R.1(m) ::: Tj(m) be the initialreference contour, where the duratIon of the Jth token is closest to theaverage duration. For example, these contours may be vectors of cepstrumcoefficients obtained at 10 ms intervals during the worde Then dynamicprogramming may be used to find mappings mi Wi (n), i::: 1 ,2, •.. ,I, subject toboundary conditions at the endpoints, such that the total distance DT(i)between test token i and the reference contour is minimal. A distancefunction D is defined for each pair of points (m,n). Then

NDT(i) = min L D[R

1(n) fT. (w. (n))]

{Wi (n)} n=1 1. 1.

With the aid of these mappings, a new reference contour may be defined as

_ 1 I~ (n) - f L T[w. (n)]

i=l 1.

(6 )

(7 )

and the process is repeated until the distance between the current andprevious templates is below some threshold. This procedure is not dependenton the order in which tokens are considered. However, it is computationallyexpensive to iterate to the final reference contour. Furthermore, there maybe cases where there is no convergence (Rabiner, 1978).

A different algori thm can be used for phonetically similar words; thisalgori thm requires less computation effort and has no convergence problems.Furthermore, the algori thm allows a reference template to be easily updatedwi th an accepted token during verification to allow for word variation overtime. In this procedure (Dav is, 1979), each successive token is warped wi ththe current template to produce a new template for the next token (Figure 3) eFor ex ample,

I\ (n) = Tl (n)

R2 (n) 1 + T2 (w2 (n))]= - [ R (n)2 1

R3 (n) 1 + T3 (w3 (n))]= 3 [2R2 (n) (8 )

1~(n) - f [(I-l)R(I-lfn) + T1(wr(n))]

202

N TOKENS

kilO TIME WARP ETOKEN TO REF(k - n

s

N

REF (I' lIB

TOKEN CLOSESTAVERAGE DURATION

It It+

N

nlllT N )

Now

Figure 2: Iterative al thm for template generation

No+:--

N TOKENS

Ie-a TIME WARP TOKENTO REF(k" n I TIME WARP

REF(k) .. TOKEN(I)

k. k +1

REF(k) III WEIGHTEDAVERAGE OF REF .... nAND WARPED TOKEN

k-I} It-)+

Figure 3: Noniterative algorithm for template generation.

Thus 9 the process ends with the Ith template.

While this algori thm has computational advantages over the first algorithm, the results become order dependent since the warping is sequential andnonlinear. If the tokens are used in a different order, a different templatewill result. For tokens obtained from the same speaker and spoken within thesame context, order dependence is not a problem. However 9 for tokens obtainedfrom different syntactic positions, order dependence is potentially a problem.Finally, if different speakers are involved, tokens will be less similar, andthe order in which they are taken may greatly affect the final template. Ifclustering algori thms are used to generate multiple templates for each word(Rabiner, 1978), then each cluster may be viewed as a group in which orderdependence may be a consideration.

3.2 Time Alignment

All but one of the parametric distance measures explored are derived fromEuclidean functions of parameters pertaining to pairs of time frames. Theappropriate time frames are chosen to best align the significant acousticevents in time. Because the segments aligned are monosyllabic words, one cantake advantage of a number of well defined acoustic features to guide thealignment procedure. For example, the release of a prevocalic voiced stop orthe onset of frication of a postvocalic fricative manifest themselves by meansof such acoustic features. The particular alignment procedure used meetsthese requirements without requiring explicit decisions concerning the natureof the acoustic events.

The alignment operation employed a modified form of the dynamic programming algorithm first applied to spoken words by Velichko and Zagoruyko (1970)and subsequentl y modified by Bridle and Brown (1974) and Itakura (1975). Inview of the intent to use the same algori thm for template generation as forrecogni tion of unknown tokens, a symmetric dynamic programming algori thm wasutilized. Sakoe and Chiba (1978) have recently shown that a symmetric dynamicprogramming algori thm yi eld s better word recognition results than pr ev iousl yused asymmetric forms.

Execution of the algori thm proceeded in two stages (Figure 4). First,the pair of tokens to be compared was time aligned by appending silence to themarked endpoints and linearly shifting the shorter of the pair with respect tothe longer to achieve a preliminary distance minimum. Since monosyllabicwords generally possess a prominent syllabic peak in energy, this operationensured that the syllabic peaks were lined up before the nonlinear minimization process was started. Informal evaluation has shown that use of thepreliminary alignment procedure yields better results than omitting theprocedure or using a linear time warping procedure to equalize the timedurations of the tokens. The two tokens, extended by silence where necessary,were then sUbjected to the dynamic programming search to find an improveddistance minimum. The preliminary distance minimum, found as a resul t of theinitial linear time alignment procedure, corresponded to the distance computedalong the diagonal of the search space and represented in most cases a goodstarting point for the sUbsequent detailed search. Use of this preliminarytime al ignment, and the additional invocation of a penal ty function when thepoint selected along the dynamic programming path implied unequal time

205

RANGE Of SEARCH~-.........~..... IN PRELIMINARY

TIME ALIGNMENT

1:'I

/ ~P ••

/. . ./ ..;/I· ..... /.,

TIME-ALIGNED AND EXTENDED TEST TOKEA(m)

J...- TIME SPAN Of TEMPLATE

TEMPLATEB (n)

EXTENSION BY TIME SPAN OF TEST TOKEN'II"'..........--r--~SILENCE .....--0+---1................................... -...;,...............--------00(.

Figure 4: Dynamic time alignment of speech samples.

206

increments along the measured data, generally forced the optimum warping pathto be near the diagonal, unless prominent acoustic information was present toindicate the contrary" For efficiency in programming, zeros (representingsilence) were never really appended to the data, rather, the time shift wasretained and used to trigger a modified Euclidean or Itakura distance measurewhen appropriate"

The use of silence to extend the syllable tokens in the preliminary timealignment, instead of linear time expansion or contraction as implied byasymmetric formulations of the dynamic programming algori thm, requires somejustification. The comparison here is among syllable-sized units whichgenerally possess an energy peak near the center regions and lesser energynear the ends. Based on a perceptual model, extension of the tokens bysilence is clearly appropriate. Linear time scale changes would obscureequally the more significant duration information in the consonantal regionsand less significant duration information in the vocalic regions e

Discrimination between words like "pool" and "fool" depends cri tically on theduration of the prevocalic burst or fricative. The alignment ensures that theprominent vowel regions are lined up before time scale changes in theconsonantal regions are examined.

3.3 Dynamic Warping Algorithm

The dynamic warping algori thm serves to estimate the similari ty betweenan unknown token and a reference template. Additionally, it serves to align areference token wi th a partial template to ensure that phonetically similarspectral frames are averaged in generating a composite template. Through thepreliminary alignment procedure discussed above, the token or template,whichever is shorter, is extended by silence frames on both sides. Theresul ting mul tid imensional acoustic representations of the pair of patternscompared can be denoted by A(m), m:: 1,2, •.• ,M and B(n), n:: 1,2, ••• ,M. Foreach pair of frames {A(m), B(n)}, a local distance function D[A,BJ can bedefined for estimating the similarity at point x'(m,n). A change of variablesidentifies x' (m,n) as x(p,q), where p and q are measured along and normal tothe diagonal illustrated in Fig ure 4. For each po si tion along the diagonal{x(p,O), 1<p<M}, points along the normal {x(p,q), lql<Q(p)} are analyzed,where the -search space is limited by Iql <Q (p) • The Q(p) define a region inthe grid area delimited by lines with slopes 1/2 and 2 passing through thecorners x(O,O) and x(M,O).

Iq-q"l ~ 1,b)

In order for a grid point x(p,q) to be an acceptable continuation of apath through some preVious point x(p-1 ,q'), it must satisfy two continuityconditions:a) Iq-q'1 < 1; this condition restricts the path to follow non-negative

time steps along the time coordinates of the patterns, andwhere x(p-2,q") is the selected predecessor of the pointx (p-1 ,q' ); this condition restricts anyone time frame toparticipation in at most two local comparisons.

With the aid of these constraints, each point in the search is restricted toat most three possible predecessors. To establish the minimal distancesubpath Dr(p,q) leading back to the orIgIn from the point x(p,q), thecumulative distance leading to that point through each possible predecessorx(p-1,q') is minimized. Thus

207

DT

(p,q) min= q' {DT (p-l,q') = D [A(p-q) ,B (p+q)] v (q-q') } (9 )

V is a penalty function introduced to keep the alignment path close to thediagonal unless a significant distance reduction is obtained by following adifferent path. By setting V to 1.5 for Iq-q'l 1 and 1.0 otherwise,unproducti ve searches far from the diagonal are avoided. Since all pathsterminate at x(M, 0), the total distance of the minimum distance path andtherefore the distance between A and B is given by DT(M,O).

The minimal distance subpath passes through the points {x(p,q), 1<p<M}.These points allow the identification of pairs of frames A( p-~) and B(p+'Q)that contributed to the minimal distance result. A new template G(p),p :::: 1,2, ••• ,M, can then be generated by appropriately averaging the framesA(p-q) and B(p+~), p :::: 1,2, ••• ,M.

The one exception to template generation by weighted averaging occurswi th the LPG. If two LPG vectors are averaged, stabil i ty of the resul tantvector is not guaranteed. Therefore, LPG templates were generated in thespace of LP-derived reflection coefficients. Since the reflection coefficients are bounded in magnitude by one, stability requirements are satisfiedand the symmetric dynamic warping algori thm could be used wi thout modification. Alternately, the templates could be derived in the space of LP-derivedautocorrelation coefficients, since stability is guaranteed from the resul tthat a stable autocorrelation matrix is positive definite, and a linearcombination of posi tive defini te matrices is posi tive definite and hencestable

3.4 Effects of Order In Generating ~ Template

As discussed above, the incremental addition of individual tokens to aprev iousl y formed tempI ate results in a final tempI ate who se val ues depend onthe order of the tokens.

In a preliminary experiment utilizing the same data base (Davis, 1979),ten sets of reference templates based on six MFGC were generated. Each set oftemplates used the reference tokens in random order. Independent test datawere then matched wi th each set of templates on a per speaker basis. Theaverage recognition scores and standard deviations were 94.76.:!. 0.53% and90.53 ~ 0.48% for each speaker respectively. Thus, random ordering of tokensfor template generation did not change the results. At a 0.01 significancelevel, none of the rates for either speaker was significantly different fromthe respective mean. Thirty-two of the 52 different GVCword types were nevermisidentified. Errors were generally confined to the same tokens of a wordregardless of the template, and the most confusions were among test-referencepairs such as wake-bait, book-hood and burn-herd.

The consistent rates among template sets indicated that the templates forany given word were relatively similar. To visualize such relationships, allof the pairwise distances for eight templates and four test tokens of keepwere measured and fitted to an X-Y plane. The eight templates were arbitrarily chosen from among the 24 possible templates for four reference tokens fromDZ 1, and the four test tokens were obtained from DZ2. The fitting procedure

208

wastoken)

on (x, coordinates for test each point (template oruntil the mean-square error in distances among the points was minim-The coordinate shovm in 50 of ordering, the

close to each and relatively far from the test tokens,the robustness of the for generatione

R

For each (MFCC, LFCC, LPCC, LPC and RC), thetest is & stein, 1978). Each segmented

token from sets DZ 1t LL2 was analyzed and a matrix ofcoefficients (columns number and rows correspond-

to time fr was 6) " Each set was used in turn as testand reference data In the case of reference data, templates were formed on aper per session basis, using all tokens of each word (generally threeto five in number) recorded in the session" Two of testing were used:closed tests, where test and reference data were from the same session, eeg",reference DZ1 vs test DZ1, and open tests, where test and reference data werefrom different sessions, og e reference DZ 1 vs" test DZ 2 (Figure 7)" Foreach test word a was performed with each of the 52 referencetemplates, and the word was identified with the least distant template(maximum similarity) In a situation alternative methods, such asvowel preselection and thresholding for rej ection, could be applied toreduce the computations and the number of comparisons e In this experiment,however, the emphasis was on rather than

The results are listed in Table and displayed in Figure 8 for opentests with 10 coefficients and 6 4 ms frames. Regardless of the frame

on type of testing or , these data indicate superior perfor-mance of the MFCC when compared with the other parametric representations. Infact, the performance of six MFCC was also better than any other tencoefficient set In all cases, the 6.4 ms frame separation produced better

As pr ev iousl y stated, the wi ndow si ze was 25. 6 ms, and usinghalf the window size produced degraded resul ts. Finally, speaker DZ, a malewith exceptionally low fundamental frequency, was better recognized thanspeaker LL, a male with somewhat higher fundamental frequency. Speakerdependent differences, however, require further systematic investigatione

Most confusions arose between pairs of words that were phonetically verysimilar" For example, of the eight misrecognitions using the MFCC parametersfor speaker DZ, two were between "bar" and umar," two wer~ between upool U and"fool," one each between "keep" and "heat, " "bait" and "wake," "hook" and"rig, tV and "hood" and "cause 0" Note that by not using the average spectrumenergy (the zeroth cepstrum coefficient) in these comparisons, the overallenergy between time al spectral frames has been equalized 0 Inclusion ofthe variation of overall energy with time might possibly assist discriminationbetween such highly confusable word pairse

209

Table 1

NRecognition Rates Resulting from Use of Various Acoustic Representations

I-'0

Number of Distance Frame Speaker Open ClosedAcoustic RepresentationCoefficients Metric Separation (ms) Test % Test %

mel-frequency cepstrum 10 Euclidean 6.4 DZ 96.5 99.4LL 95.0 99. 1

12.8 DZ 95.6 99,,4LL 93.8 .9

mel-frequency cepstrum 6 Euclidean 6.4 DZ 96.5 99.4LL 92.0 97" 6

12 .. 8 DZ 95 .. 0 98" 8LL 90 .. 2 97 .. 6

linear-frequency cepstrum 10 Euclidean 6.. 4 DZ 94 .. 7 99. 1LL 87.6 98.2

12 .. 8 DZ 93.2 98.8LL 84.9 97.3

linear-prediction cepstrum 10 Euclidean 6.4 DZ 92.6 99. 1LL 87.3 98.2

12 .. 8 DZ 91.7 98.2LL 86 .. 4 96.7

linear-prediction spectrum 10 Itakura 6.. 4 DZ 85 .. 2 97" 9LL 84 .. 3 95,,2

reflection coefficients 10 Euclidean 6.. 4 DZ 83 .. 1 97 .. 1LL 77 .. 5 97 .. 0

12.8 DZ 80.5 97 .. 6,LL 74.6 96 .. 2

y

2

o

• TOKENS

EMPL

ox

2 3 5

Figure 5: X-Y coordinate plane for keep.

211

NI-'N

IKEEPI THE HOPE AT THE BAR.

HOOD THE LOAD IN THE (KEEPI.

BAR THE IKEEPI FOR THE YELL.

BAIT THE IKEEPI AT THE CAll.

DATA

MFCC

PARAMETERS

TIME ALI'GNIIEN

TEMPLATES

Figure 6: Selection of monosyllabic words for template generation.

SPEAKER DlSESSION I

SPEAKERSESSION 2

SPEAKERSESSION

SPEAKER I .

SESSION 2

IDENTifY

SPEAKERSESSION

PEAKERIDSION 2

SPEAKER LL_ION 2

SPEAKER LLSESSION •

TEST DATA TEMPLATES RESULTS

Figure 7: Two-way speaker-dependent identification testse

Nf--lLV

NI---"+:--

Figure 8: Performance of parametric representations for recognition.

-o~

84.3 ~a

77.5

SPEAKER DZ

o95.~

o 0-87.8 87.3

SPEAKER LL70

100 .0

0- -0 .90 ~ 96.594.7 92.6~

0-0

~ 80 ~ 8~2 83.1~

zo- 70~u-b..

~ 100 i iIII io i

_ 90 i

o i i

II:o_ 80

"FCC lFCC lPCC LPCPARAMETER

5. Conclusions

reprdifferences among

These differences lead

The simil in rank order of the recogni ton ratesfor each of the two suggests that thethe various acoustic areto the following conclusionse

1) Parameters derived from the short-term (MFCC, LFCC) ofthe acoustic signal preserve information that from the LPC(LPCC, LPC, RC) omit. Both spectral are consideredfor vowels. However, it is the confusions between the consonants that aremost frequent. The differences found may due to the insufficientlyaccurate representation of the consonantal spectra the linear predictiontechnique.

2) The mel-frequency cepstra possess afrequency cepstra--specifically, MFCC allow bettercant spectral variation in the higher frequency

3) The cepstrum parameters (MFCC, LFCC and LPCC) ,frequency smoothed representations succeedbetter than the'LPC and RC in capturing the significant information.A Euclidean distance metric defined on the cepstrum parameters apparentl yallows a better separation of phonetically distinct Since there aunique transformation between a set of LPCC and the LPC and RC,these representations can be said to contain information. However,this transformation is nonlinear. Representing the acoustic information inthe hyperspace of cepstrum parameters favors the use of a particularlydistance metric.

4) Defining the metric on the basis of the Itakura distance is less effectivethan defining it on the basis of cepstrum distance. The point of optimalityis the same, i.e., equality between cepstra implies zero inprediction residual energy" However, the Itakura distance is less successfulin indicating the phonetic significance of the difference between a ofspectra than the cepstrum distance.

5) The mel-frequency cepstrum coefficients form a particularly compact representation. Six coefficients succeed in capturing most of theinformation. The importance of the higher cepstrum coefficients appears todepend on the speaker. Further data are required from additionalbefore firm conclusions can be reached on the optimal number of coefficients.

The results are limited by the restrictions on the dataIn particular, consonant clusters, multisyllabic words and unstressed monosyllabic words have not been studied. Expansion of the data base along anyoneof these directions introduces additional representation problems. It is notobvious that the best representation for stressed words is also best for themuch more elastic unstressed words. These questions are left for futurestudies.

215

It should be emphasized that the comparative ranking of thetions can be influenced by the choice of both the local and thedistance metrics A Euclidean distance function is one of the simplest toimplement@ However, taking into ~ccount the probability distributions of theindividual parameters should result in improved performancee Estimating thesedistributions requires considerable datao Yet even if only a few parametersof these distributions are known, for example, the variance of thecoefficients, better local distance metrics could be designed & Despite thehigh recogni tion rates achieved so far, there is reason to believe that evenbetter performance can be attained in the futuree

The design of the mel-frequency cepstrum representation was motivated byperceptual factors. Evidently, an ability to capture the perceptually relevant information is an important advantage. The design of an improveddistance metric may result from more accurate modeling of perceptual behavior.In particular, where ? constant difference between spectra persists for anumber of consecutive time frames, the contribution of that difference in thecurrent distance computation is proportional to the duration of that difference. With the possible exception of very short durations, no perceptualjustification exists for this property (Feldtkeller & Zwicker, 1956).Nevertheless, the distance function must in some fashion combine differentinformation from all the time frames constituting the signals compared 0

Further optimization of the integrated distance function represents an important challenge.

For each representation a small but significant gain in recognition isachieved by decreasing the frame spacing from 12.8 ms to 6.4 ms samples. Theaverage difference in the recognition rates is 1.7%. However, the computa·tional complexity for any dynamic programming comparison varies as the squareof the average number of frames consti tuting a word. Thus a significantcomputational penal ty accompanies any increase in the frame rate. In contrast, the computations grow only linearly with the number of cepstrumcoefficients e Since the recognition rates for six cepstrum coefficients and6.4 ms frame spacing is quite comparable to the rate for ten coefficients and12.8 ms frame spacing, increasing the number of coefficients and maintaining asomewhat coarser time resolution is computationally more advantageous thanusing fewer coefficients more frequentlYe

The principal conclusion of the study is that perceptually based wordtemplates are effective in capturing the acoustic information required torecognize these words in continuous speech. Due to the various limitations ofthis study, a conclusion that such high recognition rates are attainable witha complete automatic system operating in a practical environment is notwarranted at this time. However, the results do encourage a continuing effortto optimize the performance of speech recognition systems by critical evaluation of each of the constituent componentse

REFERENCES

Bridle, J e S., & Brown, M. D. An experimental automatic word recogni tionsystem. JSRU Report (Joint Speech Research Unit, Ruislip, England),1974, No. 1003.

Davis, S. Be Order dependence in templates for monosyllabic word identifica-

216

tion. Conference Record, ~ International Conference Q!l J},Qousti cs,Speech and Signal Processing, Washington, 1979, 570-573.

Davis, S. B., & Mermelstein, P. Evaluation of acoustic parameters formonosyllabic word identification. Journal of the Acoustical Society ofAmerica, 1978, 64, Suppl. 1, S180. (Abstract) --- -

Fant, C. G. M. Acoustic description and classification of phonetic uni ts.Ericsson Technics, 1959, 1. Also in G. Fant, Speech sounds and features,MIT Press, 32-83, 1973. -

Feldtkeller, R., & Zwicker, E. Das Ohr als Nachrichtenempfanger. Stuttgart:S. Hi r zel, 1956.

Fujimura, O. The syllable as a unit of speech recognition. IEEE Transactionson Acoustics, Speech and Signal Processing, 1975, ASSP-2~2-87.

Gray, A. H. Jr., & Markel, . D. Distance measures for speech processing.IEEE Transactions on Acoustics, Speech and Signal Processing, 1976, ASSP

, 380-391.Itakura, F. Minimum prediction residual principle applied to speech recogni

tion. IEEE Transactions on Acoustics, Speech and Signal Processing,1975, ASSP:23, 67-72.

Markel, J. D., & Gray, A. H. Jr. Linear prediction of speech. New York:Springer-Verlag, 1976.

Mermelstein, P. Automatic segmentation of speech into syllabic units.Journal of the Acoustical Society of America, 1975, 58, 880-883. (a)

Mermelstein, 15: A phonetic-context controlled strategy for segmentation andphonetic labelling of speech. IEEE Transactions on Acoustics, Speech andSignal Processing, 1975, ASSP-23, 79-82. (b)

Mermel stein, P. Distance measures for speech recogni tion, psychological andinstrumental. In C. H. Chen (Ed.), Pattern recognition and artificialintelligence. New York: Academic Press, 1976, 374-388.

Mermel stein, P. Recogni tion of monosyllabic words in continuous sentencesusing composi te word templates. Conference Record, 1978 InternationalConference Q.!l Acoustics, Speech and Signal Processing, Tulsa, 1978,708-711 .

Pfeifer, L. L. Interacti ve laboratory system users guide. Santa Barbara:Signal Technology, Inc., 1977.

Pols, L. C. W. Spectral analysis and identification of putch vowels inmonosyllabic words. Unpublished doctoral dissertation, Free University,Amsterdam, 1977.

Rabiner, L. R. On creating reference templates for speaker independentrecogni tion of isolated words. IEEE Transactions on Acoustics, Speechand Signal Processing, 1978, ASSP-~34-42.

Rabiner, L. R., Rosenberg, A. E., & Levinson, S. E. Considerations in dynamictime warping algorithms for discrete word recognition. IEEE Transactionson Acoustics, Speech and Signal Processing, 1978, ASSP-2~75-586.

Sakoe-,-H., & Chiba, S. Dynamic programming algori thm optimization for spokenword recogni tion. IEEE Transactions on Acoustics, Speech and SignalProcessing, 1978, ASSP=26, 43-49. -- ----

Schroeder, M. R. Recognition of complex acoustic signals. Life SciencesResearch Report, T. H. Bullock ed., 1977, 55, 323-328.

Vel ichko, V. M., & Zagoruyko, N. G. Automatic recogni tion of 200 words.International Journal of Man-Machine Studies, 1970, 2, 223-234.

Wakita, H. Direct estimation of the vocal tract shape by-inverse filtering ofacoustic speech waveforms. IEEE Transactions on Acoustics, Speech andSignal Processing, 1973, AU-2~17-427.

217

Whi te, G. M., & Neely, R. B. Speech recogni tion experiments wi thprediction, bandpass filtering and dynamic programmingTransactions on Acoustics, Speech and S P 1976,173-188.

FOOTNOTE

linearIEEE

1Fant (1973) compares Beranek's mel-frequency scale, Koenig's scale andFant's approximation to the mel-frequency scale. Since the diffe~ences

between these scales are not significant here, the mel-frequency scale shouldbe understood as a linear frequency spacing below 1000 Hz and a logari thmicspacing above 1000 Hz.

218

APPENDIX A

Sentences used for word recognition.

L2.3.4.5.6.7.8.9.

10.11.12.13.14.15.16.17.18.19.20.21 •22.23.24.25.26.27.28.29.

Keep the hope at the bar.Dig this rock in the heat.Wake the herd at the head.-- -- ---Check the lock on the seal.-- ---Bang this bar on the head.Call a mess in the case.-- -- --Cut the coat for a mop.Foot the work in the mess.--- --- --Boot the back of the book.Burn your-check in the jar.Mop the room on the watch.Load the tar for the bait.Tar this rig in a rus~Fear a hood on the ship.Rig a bait for the work.Nail that book to the rock.-- -- --Yell this call for the wake.--- --- ---Gang the bait on the coat.Walk the watch in the hope.Buff one book for the walk.--- --- ---Hook the mop on the lock.Pool the case for the man.-- -- --Hurl his bar in the muck.-- --- ----Bomb the head at the wake.-- ---Pose this seal for the gang.Mar the watch on the hood.-- ---Heat the foot of the fool.Kill the herd for the load.Case your ship for the-cause.

30.31 .32.33.34.35.36.37.38.39.40.4142.43.44.45.46.47.48.49.50.5152.53.54.55.56.57.

Head the rush for the burn.Back the pool for the CheCk.Watch that hook with the nail.--- ---Rush the buff at the foot.Hood the load for the keep.Room one seal in the pool.Herd the fool with a yell.Rock the mop with a hurl.Coat the cut with the tar.Jar the bomb with a bang.Seal the dig in a fear.Ship the nail in a boot.Bait the keep with a call.Mess his work in the room.--- --- ---Man the cut at the kill.-- -- --Cause a mar on the back.-- ---Muck the gang on the walk.Book the fool on the rig.Fool the man on the rock.-- -- --Work the hurl at the dig.Lock your man in a pose.Hope this call for the heat.Bar the keep for the yell.Put a bang in the bomb.Set a pose in the muck.Pose a jar on the buff.Kill the fear in the cause.Mar the burn on the head.

219

Documents

COMPARISON OF PARAMETRIC REPRESENTATIONS FOR … · recognition, was performed on a PDP-11/45 minicomputer wi th the Interactive Laboratory System (Pfeifer, 1977). In systems employing