15
Effect of voice quality on frequency-warped modeling of vowel spectra q Pushkar Patwardhan * , Preeti Rao Department of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India Received 19 July 2005; received in revised form 7 December 2005; accepted 6 January 2006 Abstract The perceptual accuracy of an all-pole representation of the spectral envelope of voiced sounds may be enhanced by the use of frequency-scale warping prior to LP modeling. For the representation of harmonic amplitudes in the sinusoidal cod- ing of voiced sounds, the effectiveness of frequency warping was shown to depend on the underlying signal spectral shape as determined by phoneme quality. In this paper, the previous work is extended to the other important dimension of spec- tral shape variation, namely voice quality. The influence of voice quality attributes on the perceived modeling error in fre- quency-warped LP modeling of the spectral envelope is investigated through subjective and objective measures applied to synthetic and natural steady sounds. Experimental results are presented that demonstrate the feasibility and advantage of adapting the warping function to the signal spectral envelope in the context of a sinusoidal speech coding scheme. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Voice quality; Spectral envelope modeling; Frequency warping; All-pole modeling; Partial loudness 1. Introduction A successful low bit rate speech coding algorithm requires a good model for the speech signal together with an effective parameter quantization algorithm. A popular model for low bit rate speech coding has been the sinusoidal model, an important example of which is the multiband excitation (MBE) model (Griffin and Lim, 1988). In the case of voiced speech, the parameters of the model are the funda- mental frequency (or pitch), and the amplitudes and phases of the harmonics. The harmonic ampli- tudes represent the product of the source excitation and vocal tract spectra. At low bit rates, estimated phases are usually dispensed with, and the accurate representation of the pitch and harmonic ampli- tudes becomes critical to the perceptual quality of decoded speech. Steady vowel sounds are particu- larly sensitive to harmonic amplitude representation errors. The quantization of harmonic amplitudes is most demanding on the bit allotment in sinusoidal cod- ing, and methods for efficient quantization have been an important topic of research. A widely used method for the quantization of the amplitudes of the harmonics is based on the modeling of a spectral 0167-6393/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2006.01.003 q A portion of this work was presented at the International Conference on Spoken Language Technology-04, New Delhi. * Corresponding author. Tel.: +91 22 25346665; Mobile: +91 9819073984; fax: +91 22 25723806. E-mail address: [email protected] (P. Patwardhan). Speech Communication 48 (2006) 1009–1023 www.elsevier.com/locate/specom

Effect of voice quality on frequency-warped modeling of vowel spectra

Embed Size (px)

Citation preview

Page 1: Effect of voice quality on frequency-warped modeling of vowel spectra

Speech Communication 48 (2006) 1009–1023

www.elsevier.com/locate/specom

Effect of voice quality on frequency-warped modelingof vowel spectra q

Pushkar Patwardhan *, Preeti Rao

Department of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400 076, India

Received 19 July 2005; received in revised form 7 December 2005; accepted 6 January 2006

Abstract

The perceptual accuracy of an all-pole representation of the spectral envelope of voiced sounds may be enhanced by theuse of frequency-scale warping prior to LP modeling. For the representation of harmonic amplitudes in the sinusoidal cod-ing of voiced sounds, the effectiveness of frequency warping was shown to depend on the underlying signal spectral shapeas determined by phoneme quality. In this paper, the previous work is extended to the other important dimension of spec-tral shape variation, namely voice quality. The influence of voice quality attributes on the perceived modeling error in fre-quency-warped LP modeling of the spectral envelope is investigated through subjective and objective measures applied tosynthetic and natural steady sounds. Experimental results are presented that demonstrate the feasibility and advantage ofadapting the warping function to the signal spectral envelope in the context of a sinusoidal speech coding scheme.� 2006 Elsevier B.V. All rights reserved.

Keywords: Voice quality; Spectral envelope modeling; Frequency warping; All-pole modeling; Partial loudness

1. Introduction

A successful low bit rate speech coding algorithmrequires a good model for the speech signal togetherwith an effective parameter quantization algorithm.A popular model for low bit rate speech coding hasbeen the sinusoidal model, an important example ofwhich is the multiband excitation (MBE) model(Griffin and Lim, 1988). In the case of voicedspeech, the parameters of the model are the funda-

0167-6393/$ - see front matter � 2006 Elsevier B.V. All rights reserved

doi:10.1016/j.specom.2006.01.003

q A portion of this work was presented at the InternationalConference on Spoken Language Technology-04, New Delhi.

* Corresponding author. Tel.: +91 22 25346665; Mobile: +919819073984; fax: +91 22 25723806.

E-mail address: [email protected] (P. Patwardhan).

mental frequency (or pitch), and the amplitudesand phases of the harmonics. The harmonic ampli-tudes represent the product of the source excitationand vocal tract spectra. At low bit rates, estimatedphases are usually dispensed with, and the accuraterepresentation of the pitch and harmonic ampli-tudes becomes critical to the perceptual quality ofdecoded speech. Steady vowel sounds are particu-larly sensitive to harmonic amplitude representationerrors.

The quantization of harmonic amplitudes is mostdemanding on the bit allotment in sinusoidal cod-ing, and methods for efficient quantization havebeen an important topic of research. A widely usedmethod for the quantization of the amplitudes ofthe harmonics is based on the modeling of a spectral

.

Page 2: Effect of voice quality on frequency-warped modeling of vowel spectra

1010 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

envelope fitted to the harmonic peaks (MacAulayand Quatieri, 1995). The spectral amplitudes arethen reconstructed from samples of the modeledspectral envelope at harmonic frequencies. Repre-senting the spectral envelope by the coefficients ofan all-pole filter enables the use of one of the manyefficient quantization methods available in thespeech coding literature. The order of the all-polemodel has a significant effect on the accuracy ofthe modeled spectral amplitudes. While the all-polerepresentation of the spectral envelope is expectedto capture local resonances accurately, capturingadditional features such as overall spectral tilt andspectral zeros due to nasality typically lead to anincrease in the number of poles required for an ade-quate approximation. Further, for similar spectralenvelope, low pitched sounds require higher LPmodel order for similar perceived quality levels(Champion et al., 1994; Rao and Patwardhan,2005).

In the interest of achieving low bit rates, how-ever, it is necessary to keep model order as low aspossible. Frequency-scale warping before all-polemodeling of the spectral envelope is a widely usedmethod to improve the perceptual accuracy of mod-eling for a given model order. Frequency-scalewarping leads to a more accurate representation ofthe low frequency spectrum at the cost of increasederrors in the high frequency region. Although per-ceptual scales such as the Bark-scale and its variantshave been widely used in LP modeling of speechspectra, our recent work (Rao and Patwardhan,2005) on synthetically generated steady vowelsounds (using fixed excitation source parameters)indicated that the performance of frequency warp-ing depended to a great extent on the nature ofthe underlying sound spectrum. It was observed thatfront vowels such as [eI] and [i] in fact were bettermodeled without Bark-scale warping. This wasexplained by the low frequency first formant struc-ture of these vowels which fails to mask highfrequency distortion adequately. In a subjectiveexperiment using natural speech, it was found thatthe comparative behaviour of modeling error underdifferent warping conditions in the case of thenon-front vowels was inconsistent across speakersindicating a further dependence on speaker voicequality as reflected in the overall spectral envelope.

In the present study we attempt to extend ourprevious work by exploring aspects of voice qualitythat could influence the spectrum modeling error.The influence of voice quality on perceived model-

ing error in frequency-warped all-pole modeling ofthe spectral envelope is studied via the frameworkof MBE model analysis–synthesis.

While the overall spectral envelope for voicedspeech is determined by the glottal source wave-form, vocal tract transfer function and lip radiation,it is the glottal source waveform that most directlyaffects the relative strengths of the low and high fre-quency harmonics. The scope of the present study isrestricted to variations in laryngeal voice qualitiesfor they are closely linked to gross differences inspectral envelope. An investigation based on subjec-tive and objective experimental evaluation is pre-sented. Finally, the applicability of the results toimproving speech quality in a low bit rate sinusoidalcoder is discussed.

2. Frequency-warped LP modeling of discrete

spectra

A discrete spectrum, characterized by a funda-mental frequency and the amplitudes of the compo-nents at harmonic frequencies, can be representedby the coefficients of an all-pole model. A smoothspectral envelope is first derived to fit the harmonicamplitudes using a suitable interpolation methodsuch as linear interpolation of log amplitudes (Her-manksy et al., 1985; Rao and Patwardhan, 2005).The power spectrum obtained from the interpolatedenvelope is used to compute the autocorrelationfunction via the inverse DFT. The Levinson–Durbin algorithm is applied to obtain the LP coeffi-cients. The spectral amplitudes can be recovered bysampling the reconstructed spectral envelope repre-sented by the all-pole coefficients at the harmonicfrequency locations. Frequency-scale warping maybe incorporated in the spectral envelope modelingby mapping the input harmonic frequencies to cor-responding warped frequency locations by means ofa warping function based on a chosen perceptualscale. The log-linear interpolation of the discretespectral amplitudes (now non-uniformly spaced infrequency) is carried out to obtain a densely sam-pled spectral envelope as detailed in Section 3 of(Rao and Patwardhan, 2005). Uniformly spacedsamples at 20 Hz interval are found to be adequatefor the LP modeling of narrowband speech spectrain the sinusoidal coding context.

In this work, we use the framework of MBEnarrow band coding of speech to evaluate fre-quency-warped LP modeling of voiced sounds. TheMBE analysis–synthesis model offers a convenient

Page 3: Effect of voice quality on frequency-warped modeling of vowel spectra

synthesis

speech

parameters

frequency warped

signal synthesisreference signal

test signal

warping factor

MBE

analysis

MBE

LP modeling

MBE

Fig. 1. Generation of MBE modeled test and reference signals.

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1011

framework for the evaluation of spectral envelopemodeling (Molyneux et al., 1998; Rao and Patward-han, 2005) although the results are applicable moregenerally to other vocoders. Voiced regions are mod-eled by harmonics of a fundamental frequency in theMBE vocoder, and unvoiced bands by spectrallyshaped random noise. Thus parameters of the MBEspeech model consist of the fundamental frequency,band voicing decisions, and the harmonic amplitudes(Griffin and Lim, 1988).

MBE analysis involves the use of high resolutionDFT in an analysis-by-synthesis loop for the accu-rate determination of pitch, voicing and spectralamplitudes for each input frame of speech (typically20 ms duration). In the case of fully voiced speech,synthesis is achieved by summing of sinusoids eachcorresponding to one harmonic. Adjacent framesare combined using either overlap-add or the inter-polation of amplitude and phase depending on theextent of pitch variation (Griffin and Lim, 1988).With unquantised parameters, synthesized speechof very high quality is obtained, particularly invoiced regions. Spectral envelope-interpolated LPmodeling is utilized to achieve the low bit rate cod-ing of the harmonic spectral amplitudes in an MBEmodel based speech coder. In the present work, weinvestigate some aspects of the performance of fre-quency-scale warping in obtaining improved qualityat low LP model order. Based on the considerationsdiscussed in (Rao and Patwardhan, 2005) the LPmodel order selected for the speech quality evalua-tion experiments is 10. The test and evaluation

framework is depicted Fig. 1 which shows the gener-ation of reference and test signals used in the qualitycomparisons.

3. Voice quality and its spectral correlates

Voice quality refers to the auditory impression alistener gets upon hearing the speech of anothertalker. Voice quality is determined by the articula-tors of the vocal tract as well as the characteristicsof the vocal folds. For speech, the vocal folds havea predominant importance. We refer to this aspectof voice quality that results from differences in vocalcord vibratory patterns, or laryngeal voice quality(Childers and Lee, 1991). Periodic glottal excitationis a characteristic of ‘‘modal’’ voices. The perceptualand spectral variations within modal voices can beattributed to the differences in the timing parame-ters of the glottal pulse. Since we are concerned withthe spectral representation of steady periodicsounds, we do not consider those voice types thatare characterized by voicing aperiodicities such asaspiration noise and pitch/amplitude jitter.

3.1. Modal voice production parameters

The speech signal is generated by excitation ofvocal tract caused by the modulated flow of air gen-erated by the lungs. The volume of air passingthrough the vocal cords, also known as glottal vol-ume flow is the excitation signal which is periodic inthe case of voiced speech. The glottal pulse shape

Page 4: Effect of voice quality on frequency-warped modeling of vowel spectra

1012 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

has been extensively studied and modeled for differ-ent voice qualities. A typical glottal pulse period(glottal volume flow) and its derivative (Childers,2000; Childers and Lee, 1991) are shown inFig. 2(a) and (b).

The parameter tp denotes the instant of the maxi-mum in the glottal flow waveform. The parameter te

is the instant of the maximum negative differentiatedglottal flow. The ta is the time constant of the expo-nential curve of the second segment of the LF model,and parameter tc is the instant at which completeglottal closure is reached. The spectral shape of theglottal pulse derivative is dependent on the valuesof the timing parameters. Modal voices, commonamong male speakers, are characterized by the nearlycomplete closure of the vocal folds. Fig. 2(c) showsglottal pulse derivative waveforms correspondingto three different sets of timing parameters andFig. 2(d) shows the corresponding spectra. Anabrupt closure of the vocal folds (low ta) createsstrong high frequency harmonics while a relatively

0 2 4 6 8 10 12

0

2

4

6

8

10

Am

plitu

de

0 2 4 6 8 10 12–1

–0.5

0

0.5

Am

plitu

de

msec

tc

tp

te

te+ta

Ee

(a) (

((b)

Fig. 2. Temporal and spectral structure of glottal pulse. (a) Glottal flodifferent sets of timing parameters, (d) spectra corresponding to (c).

gradual closure of vocal folds results in spectra withstrong low frequency harmonics and weak high fre-quency harmonics. The perceptual correlate of therelative strengths of the low and high frequencyharmonics in the glottal source spectrum is thebrightness of the voice. ‘‘Dark’’ voices have relativelyweak high harmonics while ‘‘bright’’ voices are char-acterized by flatter spectra. Apart from the abrupt-ness in the closure of the vocal folds, the symmetryof the open phase of the glottal pulse as capturedby the ‘‘pulse skew’’ also affects the spectrum. Itinfluences the relative amplitudes of the low fre-quency harmonics (especially the first and secondharmonics) (Doval and d’Alessandro, 1997).

3.2. Spectral correlates

Spectral shape can provide useful cues to relevantaspects of voice quality. Some of the important cuesthat reflect the characteristics of glottal signal are asfollows:

0 2 4 6 8 10 12

–1

–0.5

0

0.5

msec

Am

plitu

de

0 1 2 3 4–20

–15

–10

–5

0

5

10

15

kHz

PS

D (

dB)

c)

d)

w, (b) glottal flow derivative, (c) glottal flow derivative for three

Page 5: Effect of voice quality on frequency-warped modeling of vowel spectra

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1013

H1–A3: This is the ratio of amplitude of first har-monic (H1) relative to that of the third-formant spectral peak (A3), and hasbeen used by Hanson (1997) to charac-terize spectral tilt. The middle and highfrequency components are directlyaffected by the duration of glottal pulse.An abrupt closure in the glottal cycleresults in relatively strong middle andhigh frequency components, i.e. A3 istypically higher than what it would bewith a gradual glottal closure. Fig. 3shows the comparison of harmonicspectra for vowel [2] for dark andbright sounds. It clearly shows the differ-ences in slopes of overall spectralenvelopes.

H1–H2: This is the ratio of amplitudes of the firsttwo harmonics. It is affected by the glot-tal pulse skew as well as the open quo-tient (OQ) of the glottal pulse. Theglottal pulse skew is measured by thespeed quotient (SQ) which is computedas ratio of opening interval to the closinginterval. While the open quotient is ratioof open glottal interval to the pitch per-

Bright [aa]

0 1 2 3 40

10

20

30

40

50

60

70

80

90

kHz

PD

S (

dB)

H1

H2 A1 A2 A3

Fig. 3. Temporal waveforms and spectra of ‘‘dark’’ and ‘‘bright’’ voamplitude of first harmonic, H2 – amplitude of second harmonic, A1 – aamplitude of third formant.

iod. A higher SQ indicates lower H1–H2(Doval and d’Alessandro, 1997), whilehigher OQ implies higher H1–H2.

Roll-off: Roll-off is an indicator of shape of thespectrum. This is defined as frequencywithin which 85% of the total accumu-lated magnitude is concentrated (Burredand Lerch, 2004). The roll-off is deter-mined by the largest DFT bin ‘‘R’’,which satisfies

00

10

20

30

40

50

60

70

80

90

PD

S (

dB)

wel [2]. Acomplitude of fi

XR�1

k¼N1

jX ðkÞj 6 0:85XN2

k¼N1

jX ðkÞj ð1Þ

where X(Æ) is the DFT spectrum of theinput frame. N1 and N2 define the DFTbins corresponding to the range of fre-quencies over which the roll-off is com-puted. Based on the values of N1 andN2, R can represent the roll-off in anyfrequency region of interest. The roll-offcomputed over the full frequency spec-trum (0, 4000 Hz) captures the overallslope of the spectrum. For right-skewedspectra the value of roll-off turns to behigh while for left-skewed spectra the

Dark [aa]

1 2 3 4

kHz

ustic parameters labeled in ‘‘bright’’ [2] are: H1 –rst formant, A2 – amplitude of second formant, A3 –

Page 6: Effect of voice quality on frequency-warped modeling of vowel spectra

1014 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

roll-off is lower. For bright vowels weexpect a high roll-off.

It must be remarked that perceived voice qualitymay also be influenced by recording conditions. Thespectral measures described in this section wouldreflect both glottal waveform characteristics as wellas the transfer function of the recording setup.

4. Experimental evaluation

Experiments were designed to investigate theinfluence of voice quality on the modeling errorfrom frequency-warped LP modeling of the spectralenvelope as presented in Section 2. Vowels utteredin low pitched modal voices of different brightnesswere obtained by synthesis as well as extractionfrom natural speech.

4.1. Test set generation

Natural and synthetic samples corresponding toeight distinct vowels, as shown in Table 1, were gen-erated for the experiment. Synthesis of vowelsallowed the manipulation of the glottal timingparameters which in turn enabled control over thevoice quality. The synthetic vowels were generatedusing articulatory synthesis with the articulationparameters estimated from provided target speechbased upon an analysis-by-synthesis approach(Childers, 2000). For the synthesis of each of thevowels the estimated articulatory parameterstogether with an excitation signal, separately gener-ated using an LF model with fixed parameters, areused in a period-by-period synthesis. The synthe-sized voice quality is dependent on the LF modelparameters (to, tp, ta, te, tc and Ee) (Childers,2000). By varying the timing parameters it is possi-ble to generate variations in the spectral slope (andconsequently in the perceived brightness) of the

Table 1Set of vowels used in the experiment

Vowel ID IPA symbol Typical word ARPABET (symbol)

1 2 ‘‘guard’’ [aa]2 æ ‘‘cat’’ [ae]3 c ‘‘orchid’’ [ao]4 L ‘‘cut’’ [uh]5 o¨ ‘‘lotus’’ [ow]6 ¨ ‘‘boot’’ [uw]7 i ‘‘beet’’ [iy]8 eI ‘‘mate’’ [ey]

voice. The four sets of glottal pulse timing parame-ters used in the experiment are shown in Table 2.Four instances of each vowel corresponding to eachset of glottal parameters (i.e. a total of 32 syntheticsounds) were generated. Based on the resultingoverall spectral slope, we term these as ‘‘dark’’,‘‘weak-dark’’, ‘‘weak-bright’’ and ‘‘bright’’. Eachvowel was generated at 120 Hz pitch (i.e.to = 8.3 ms) for a duration of 350 ms. The startand end of the sound were tapered to avoid abrupttransitions at the boundaries.

The set of natural vowels consisted of fourinstances of each vowel (pitch ranging between80 Hz and 120 Hz) taken from slowly uttered wordswith neutral intonation from a set of six male(Indian) speakers. The sounds were selected suchthat two instances of each vowel were of brightquality and two of dark quality as judged by listen-ing as well as visual examination of the spectralslope. The ends were tapered to eliminate abrupttransitions. This resulted in a total set of 32 vowelsounds (four instances of eight distinct vowelsselected in all across six speakers). In a few instancesof the vowel sounds (i, c, ¨) it was found that theactual vowel durations were in between 250 msand 280 ms and insufficient for judgment. The dura-tion of such sounds were extended to reach the min-imum duration of 350 ms by repeating the first twoperiods at the start of the sounds and final two peri-ods at the end of file. This artificial extension wasnot detectable upon listening. The sounds were ana-lyzed by the MBE model algorithm to estimate thepitch and spectral amplitudes. The spectral ampli-tudes thus obtained for each 20 ms speech framewere modeled with 10th order frequency-warpedLP coefficients with a chosen frequency warpingfactor. The synthesis was carried out by standardsinusoidal synthesis method (MacAulay and Quati-eri, 1995) using the spectral amplitudes obtained byfrequency-warped all-pole model approximation. It

Table 2Glottal waveform timing parameters as a % of to, (Childers, 2000)used to generate the synthetic vowels with four voice qualities

Voice quality Timing parameters

tp (%) te (%) ta (%)

Dark 35 50 15Weak-dark 40 50 15Weak-bright 40 50 2Bright 45 50 2

to = 8.3 ms; tc = 0.9to; Ee = 40.

Page 7: Effect of voice quality on frequency-warped modeling of vowel spectra

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1015

was then compared with a reference sound synthe-sized using originally estimated spectral amplitudes.There were three test sounds for each of the 32reference sounds: LP modeled sound withoutfrequency warping (‘‘U’’), LP modeled with a mildversion of Bark-scale warping (‘‘M’’) and LP mod-eled with Bark-scale warping (‘‘B’’) based on the fre-quency warping scales described in (Rao andPatwardhan, 2005).

4.2. Subjective test

A subjective listening experiment was set up tocompare the perceived qualities of the LP modelingwith different warping factors. Six normal hearinglisteners participated in the test. The test materialwas presented to the subject at normal listening lev-els through high quality head-phones connected to aPC sound card in a quiet room. For each of the testsounds the following ‘‘trio’’ presentation formatwas used: ‘‘reference-test-reference’’ where 250 mssilence separated the sounds. This format madethe degradation of the test with respect to the refer-ence sound relatively easy to detect. The referencesound was also separately available for listening.

The subjects were asked to rank (using ranks 1, 2and 3) the relative perceived degradations of the testsounds U, M and B with respect to the correspond-ing reference sound for each of the vowel sounds intest set. Rank 1 would correspond to the least degra-dation. An undetectable difference would result in asuitable tied ranking. Subjects were allowed to listento the test trios associated with a given reference anynumber of times before making a decision. Each lis-tener did the test using the same set of items in differ-ent orders on three separate occasions. Since it wasfound that the subjective ranks were nearly constantacross listeners and trials, an overall ranking orderwas derived for each test vowel by combining thenumerical ranks across listeners and trials as follows.

A numerical value is assigned to each rank in asingle trial as follows: rank 1 = 2 points, rank2 = 1 point, rank 3 = 0 point. In case of a tie, equalnumber of points are given to both (i.e. two pointseach for first position or one point each for secondposition). The points are summed for each itemacross listeners to get the ‘‘total score’’. The totalscores are then used to derive an overall subjectiveranking. While the data within each item (i.e. refer-ence sound) can be compared, this is not so acrossitems because the relative degradations were ratedfrom best to worst (ranked) only within each item

set. E.g. a high score in item 1 may indicate percep-tually transparent modeling, while the same score initem 2 may indicate a small but clearly audibledegradation.

4.3. Objective measurement of degradation

To quantify the perceived modeling error in eachof the vowel sounds, we used partial loudness (PL),a psychoacoustical distance measure with demon-strated correlation with subjective judgement ofspectral degradation in steady vowel sounds (Mooreet al., 1997; Rao and Patwardhan, 2005). The spec-trum modeling error is treated as the signal whoseloudness is to be estimated in the presence of a back-ground masker (the reference sound). The referencesound is intended to take the role of the backgroundnoise and the modeled sound that of the signal plusbackground noise (reference sound plus the model-ing error). The linear spectral distortion is treatedas additive noise with power spectrum given bythe difference between reference and modeled powerspectra (Rao et al., 2001). The PL of the spectrummodeling error is computed for each frame andaveraged across frames to obtain the PL value forthe entire sample.

The objective rankings of relative degradationacross U/M/B warping conditions were derivedbased upon the computed distortion measure foreach of the reference-test sound pairs. The perfor-mance of an objective distance measure in predict-ing subjective judgments may be evaluated bycomputing a measure of correlation between corre-sponding objective and subjective rankings. TheSpearman’s correlation coefficient (Miller, 1989) isa suitable measure since it makes minimal assump-tions about the data.

4.4. Sentence level test

The experiments described so far involved qualityjudgments on isolated vowels. In order to obtain aperspective on the usefulness of the observationsin the context of a speech coding application, a sen-tence-level listening test was designed. Phoneticallybalanced sentences as well as sentences which pre-dominantly contained one or another vowel werecollected for a subjective listening experiment.Low pitched voices from a set of male speakers wereselected. Table 8 lists the 15 sentences along withan indication of the dominant phonemic contentas well as perceived brightness of the voice. The

Page 8: Effect of voice quality on frequency-warped modeling of vowel spectra

1016 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

sentences were uttered by speakers whose voiceswould be clearly classified as bright sounding ordark sounding in neutral speech.

Each of the sentences was subjected to MBEanalysis and synthesis. The reference sentence wastaken to be the one synthesized with the estimatedspectral amplitudes while the test sentences weresynthesized from 10th order LP modeled spectralamplitudes with and without Bark-scale warpingto obtain ‘‘B’’ and ‘‘U’’ versions, respectively. Eightsubjects were presented with the reference and thetwo corresponding test sentences, and asked tochoose the one perceptually most similar to the ref-erence sentence. Based on a spectral shape measuredescribed in Section 5.3, test sentences constructedby switching the warping parameter between ‘‘U’’and ‘‘B’’ (depending on the nature of the frame)were also included in the listening test.

5. Results and discussion

Tables 3 and 4 summarise the results of subjec-tive and objective ranking of the experiment involv-ing the natural vowels. Tables 5 and 6 are thecorresponding results for the synthetic vowels.

The best rank (rank 1) implies the least perceiveddegradation among the three test sounds, while theworst (rank 3) implies perceived degradation was

Table 3Objective and subjective results on degradation due to spectral envelop

ID IPA sym. Sample Subjective results

Scores Ranks

U M B U M B

1 2 1 2 14 19 3 2 12 5 20 28 3 2 1

2 æ 1 3 22 24 3 2 12 4 22 23 3 2 1

3 c 1 8 25 28 3 2 12 7 20 27 3 2 1

4 L 1 7 15 25 3 2 12 5 23 22 3 1 2

5 o¨ 1 9 26 18 3 1 22 4 20 27 3 2 1

6 ¨ 1 1 26 18 3 1 22 2 15 27 3 2 1

7 i 1 30 15 0 1 2 32 24 21 4 1 2 3

8 eI 1 28 16 3 1 2 32 24 20 3 1 2 3

All vowels were modeled with 10th LP model order with each of the threand Bark-scale warped (B). ‘‘1’’, ‘‘2’’ indicate samples from different vcoefficient between subjectively and objectively measured ranks.

maximum among the three test sounds. Higherscore, and lower PL value, implies lower perceiveddegradation. The last column contains Spearman’srank correlation (Miller, 1989) which is an indicatorof how close to each other the subjective and objec-tive ranks are. In this section, we comment onseveral aspects of the obtained experimental results.

5.1. Subjective ranking experiment

We first discuss the observations on frequency-warped all-pole modeling of natural vowels. FromTable 3, which provides the subjective and objectiveevaluation of the reference-U, reference-M andreference-B pairs of the dark sounding vowels, weobserve that the reference-B pair is ranked betterthan the reference-U for all the vowels except forthe vowels [eI] and [i]. This is consistent with theexperimental results of (Rao and Patwardhan,2005). From Table 4 we observe that in the caseof the ‘‘bright’’ vowels the reference-B pair is alwaysdegraded as compared to the reference-U pair. Thisis evident from the fact that it has always beenranked subjectively lower than the reference-U pair.This is contrary to what has been generally acceptedin the literature. The reference-M pair is seen to betypically ranked between the other two warpingscales. In the case of the synthetic vowels, we see

e modeling of natural vowels with dark voice quality

Objective results CorrelationcoefficientPL Ranks

U M B U M B

0.77 0.28 0.25 3 2 1 10.86 0.14 0.09 3 2 1 11.34 0.45 0.57 3 1 2 0.50.77 0.23 0.23 3 1 1 0.750.51 0.25 0.24 3 2 1 11.28 0.82 0.61 3 2 1 10.44 0.33 0.33 3 1 1 0.751.29 1.01 0.98 3 2 1 0.50.59 0.13 0.14 3 1 2 11.28 0.17 0.15 3 2 1 10.85 0.11 0.13 3 1 2 10.99 0.68 0.31 3 2 1 10.64 0.83 0.94 1 2 3 10.38 0.32 0.42 2 1 3 0.50.15 0.19 0.36 1 2 3 10.09 0.16 0.2 1 2 3 1

e warping conditions: unwarped (U), mild-Bark-scale warped (M)oice qualities. The last column shows the Spearman’s correlation

Page 9: Effect of voice quality on frequency-warped modeling of vowel spectra

Table 4Same as Table 3 but for natural vowels with bright voice quality

ID IPA sym. Sample Subjective results Objective results CorrelationcoefficientScores Ranks PL Ranks

U M B U M B U M B U M B

1 2 1 30 11 5 1 2 3 0.34 0.93 0.6 1 3 2 0.52 30 15 0 1 2 3 0.11 0.35 0.49 1 2 3 1

2 æ 1 30 17 2 1 2 3 0.31 0.57 0.57 1 2 2 0.752 25 23 8 1 2 3 0.43 0.24 0.48 2 1 3 0.5

3 c 1 26 8 12 1 3 2 0.59 0.64 0.94 1 2 3 0.52 14 7 5 1 2 3 0.04 0.06 0.14 1 2 3 1

4 L 1 28 13 4 1 2 3 0.33 0.31 0.38 2 1 3 0.52 29 16 0 1 2 3 0.43 0.49 0.78 1 2 3 1

5 o¨ 1 30 9 6 1 2 3 0.15 0.39 0.34 1 3 2 0.52 28 16 1 1 2 3 0.29 0.16 0.39 2 1 3 0.5

6 ¨ 1 30 14 1 1 2 3 0.43 1.04 1.02 1 3 2 0.52 30 15 0 1 2 3 0.21 0.92 0.77 1 3 2 0.5

7 i 1 29 19 3 1 2 3 0.11 0.07 0.49 2 1 3 0.52 30 15 0 1 2 3 0.1 0.37 0.83 1 2 3 1

8 eI 1 28 17 2 1 2 3 0.3 0.3 0.37 1 1 3 0.752 27 10 8 1 2 3 0.4 0.39 0.52 2 1 3 0.5

Table 5Objective and subjective results on degradation due to spectral envelope modeling of synthetic vowels with dark voice quality

ID IPA sym. Sample Subjective results Objective results CorrelationcoefficientScores Ranks PL Ranks

U M B U M B U M B U M B

1 2 1 5 19 11 3 1 2 0.46 0.09 0.11 3 1 2 12 6 14 13 3 1 2 0.11 0.10 0.09 3 2 1 0.5

2 æ 1 0 19 11 3 1 2 0.70 0.08 0.06 3 2 1 0.52 3 19 11 3 1 2 0.29 0.06 0.04 3 2 1 0.5

3 c 1 3 19 14 3 1 2 0.82 0.11 0.12 3 1 2 12 5 18 15 3 1 2 0.32 0.10 0.13 3 1 2 1

4 L 1 9 13 11 3 1 2 0.67 0.29 0.44 3 1 2 12 5 13 14 3 2 1 0.39 0.32 0.28 3 2 1 1

5 o¨ 1 4 11 16 3 2 1 0.42 0.13 0.12 3 2 1 12 2 13 15 3 2 1 0.25 0.17 0.13 3 2 1 1

6 ¨ 1 7 12 17 3 2 1 0.38 0.10 0.12 3 1 2 0.52 0 16 14 3 1 2 0.36 0.10 0.12 3 1 2 1

7 i 1 16 16 4 1 1 3 0.42 0.21 0.27 3 1 2 �0.252 16 15 2 1 2 3 0.22 0.18 0.35 2 1 3 0.5

8 eI 1 18 14 2 1 2 3 0.19 0.15 0.24 2 1 3 0.52 20 10 0 1 2 3 0.06 0.12 0.15 1 2 3 1

All vowels were modeled with 10th LP model order with each of the three warping conditions: unwarped (U), mild-Bark-scale warped (M)and Bark-scale warped (B). ‘‘1’’, ‘‘2’’ indicate samples from different voice qualities. The last column shows the Spearman’s correlationcoefficient between subjectively and objectively measured ranks.

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1017

from Table 5 that the dark vowels satisfy the sametrend as the dark natural vowels (that is reference-B pair has been ranked better except for [i] and[eI]). Table 6 for the bright vowels, shows someinconsistencies in [ c] and [¨] while all other vowelsshow the expected superiority of reference-U. Anexplanation for the inconsistencies is provided inSection 5.3.

5.2. Correlation with partial loudness

Tables 3–6 indicate an overall high positivecorrelation between subjective ranks and partialloudness based ranks. In fact, the relative positionsof non-warped and Bark-warped in the subjectivepreference are correctly captured by the objectivedistance. By and large, the only inconsistencies are

Page 10: Effect of voice quality on frequency-warped modeling of vowel spectra

Table 6Same as Table 5 but for synthetic vowels with bright voice quality

ID IPA sym. Sample Subjective results Objective results CorrelationcoefficientScores Ranks PL Ranks

U M B U M B U M B U M B

1 2 1 15 14 5 1 2 3 0.06 0.07 0.15 1 2 3 12 15 12 6 1 2 3 0.05 0.06 0.15 1 2 3 1

2 æ 1 2 18 10 3 1 2 0.13 0.04 0.09 3 1 2 12 15 13 3 1 2 3 0.08 0.03 0.09 2 1 3 0.5

3 c 1 1 19 12 3 1 2 0.43 0.05 0.20 3 1 2 12 8 18 12 3 1 2 0.28 0.06 0.23 3 1 2 1

4 L 1 13 14 5 2 1 3 0.25 0.34 0.5 1 2 3 0.52 14 13 6 2 1 3 0.31 0.52 0.43 1 3 2 0.5

5 o¨ 1 12 12 8 2 1 3 0.15 0.17 0.22 1 2 3 0.52 14 13 6 1 2 3 0.19 0.16 0.27 2 1 3 0.5

6 ¨ 1 3 14 19 3 2 1 0.70 0.16 0.17 3 1 2 0.52 2 15 13 3 1 2 0.55 0.20 0.27 3 1 2 1

7 i 1 18 14 2 1 2 3 0.19 0.09 0.56 2 1 3 0.52 12 17 2 2 1 3 0.37 0.32 0.75 2 1 3 1

8 eI 1 20 10 2 1 2 3 0.05 0.13 0.22 1 2 3 12 20 10 7 1 2 3 0.05 0.14 0.26 1 2 3 1

1018 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

with respect to the relative position of mild-Bark-warped test sounds.

The partial loudness is an indication of how audi-ble is the spectral distortion. It is the sum of the par-tial specific loudness contributions of the distortiondistributed across the signal spectrum. In fre-quency-warped LP modeling, there is an improvedspectral match in the low frequency region at thecost of increased errors in the high frequency region.Whether frequency warping improves the overallperceptual accuracy of the fit will depend on whetherthe increase in the partial loudness of the high fre-quency distortion is exceeded by the decrease in thepartial loudness of the low frequency distortion.Fig. 4 shows, for a particular frame of a dark vowelsound [2] the comparison between Bark-warped andnon-warped LP modeling in terms of spectral ampli-tude distortion and partial loudness distribution. Oncomparing Fig. 4(a) and (b), we note the reduced lowfrequency distortion and increased high frequencydistortion introduced by the Bark-scale frequencywarping. Further, a comparison of the partial loud-ness distributions of Fig. 4(c) and (d) indicates thatthe reduction in the low frequency distortion con-tributes more to decreasing audible error than theeffect of the increase in the high frequency distortion.The correct prediction of this perceptual effect by thepartial loudness model indicates that the high fre-quency spectral distortion is masked to a greatextent by the relatively strong low frequency signalspectral components. This explanation is consistent

with the opposite behaviour of the front vowels [i]and [eI] since their low first formant causes reducedspread of masking to the high frequency region. Inthe corresponding plots of Fig. 5 for a frame of abright vowel [2], we note the similar effect. That is,the reduction in the low frequency distortion dueto Bark-scale warping does not compensate ade-quately for the increased high frequency distortionas observed in the partial loudness distributions ofFig. 5(c) and (d). The higher ‘‘loudness’’ of the highfrequency error can be attributed to the insufficientmasking from the low frequency components ofthe reference signal which are only comparable instrength to the high frequency components.

5.3. Relation to spectral cues

Although the partial loudness is able to predictthe subjectively preferred warping condition accu-rately in most cases, it is desirable to have a lesscomputationally complex measure for use in prac-tice. The simple spectral cues of Section 3.2 capturethe overall spectral balance in some way or otherand therefore merit investigation.

Fig. 6 plots the H1–A3 of each of the set of 32natural vowels plus 32 synthetic vowels used in theexperiments. The H1–A3 is measured for a repre-sentative frame picked from center of steady vowelsound. The bright sounding vowels fall at the lowerend of the y-axis relative to the corresponding vow-els with dark voice quality.

Page 11: Effect of voice quality on frequency-warped modeling of vowel spectra

0 1 2 3 40

10

20

30

40

50

60

70

kHz

)B

d(L

PS

ReferenceModeled

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

ERB rate

BR

E/e

nos

PL=0.94 (sone)

0 1 2 3 40

10

20

30

40

50

60

70

kHz

)B

d(L

PS

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

ERB rate

BR

E/e

nos

PL=0.25 (sone)

ReferenceModeled

(a) (c)

(d)(b)

Fig. 4. Modeling of ‘‘dark’’ natural [2] with pitch = 105 Hz, without and with Bark-scale frequency warping at LP model order = 10. (a)and (b) Spectral harmonic amplitudes of reference (s) and modeled (�) sounds are connected with a dotted line to show the spectralenvelopes without warping and with Bark-scale warping respectively. (c) and (d) Partial loudness distribution of spectral distortion inunwarped and Bark-warped modeling, respectively.

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1019

Based on the results of Tables 3 and 4, the vowelinstances that have been ranked better with Bark-scale frequency-warped all-pole modeling are shownencapsulated by a triangle while the ones withoutwarping before modeling are shown encapsulatedby a square. Fig. 6 clearly shows that the soundswith higher H1–A3 benefit more from Bark-scalewarping whereas the ones with lower H1–A3 benefitless (but for the expected cases of [i] and [eI]). Infact, from the data on natural vowels, it appearsthat H1–A3 = 10 dB may be considered a roughthreshold for prediction on whether Bark-warpingbefore LP modeling would help to improve per-ceived quality. From Fig. 6(b) we note that the syn-thetic vowels [¨] and [ c] in our set had H1–A3 valuesclustered at the higher end of the scale which mayexplain why they do not show any variation in thesubjectively preferred warping condition. It is seenthat while H1–A3 captures the overall spectral tilt,it is insensitive to the relative distribution of spectralenergy in the low frequency region, and hence can-not distinguish the low first formant characteristicof the vowels [i] and [eI].

In the view of these above observations, it isdesirable to have a measure that describes notonly the overall frequency concentration but alsothat in a specific frequency region. The spectralroll-off defined in Section 3.2 was adapted for thepurpose. Experiments revealed that the spectralroll-off value of (1) evaluated in the region (0,1500 Hz) was a good indicator of the lowness ofthe first formant as well as of the largeness ofthe gap between the first and second formant. Thatis, this measure is low only in cases of front highand front-mid vowels where there is large gapbetween F1 and F2. Further the spectral roll-offcomputed in the reverse direction in the sameinterval was a good indicator of the low frequencyskew as represented by H1–H2. Based on theseconsiderations and experimentally determinedthresholds, the following roll-off based rule wasapplied to determine whether warping should beenabled:

if ðSR < 1700 HzÞ and ðPSR > 400 HzÞ and

ðRPSR < 350 HzÞ; use Bark warping.

Page 12: Effect of voice quality on frequency-warped modeling of vowel spectra

0 1 2 3 40

10

20

30

40

50

60

70

kHz

)B

d(L

PS

ReferenceModeled

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

ERB rate

BR

E/e

nos

PL=0.38 (sone)

0 1 2 3 40

10

20

30

40

50

60

70

kHz

)B

d(L

PS

ReferenceModeled

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

ERB rate

BR

E/e

nos

PL=0.72 (sone)

(a) (c)

(d)(b)

Fig. 5. Modeling of ‘‘bright’’ natural [2] with pitch = 105 Hz, without and with Bark-scale frequency warping at LP model order = 10. (a)and (b) Spectral harmonic amplitudes of reference (s) and modeled (�) sounds are connected with a dotted line to show the spectralenvelopes without warping and with Bark-scale warping respectively. (c) and (d) Partial loudness distribution of spectral distortion inunwarped and Bark-warped modeling, respectively.

[aa] [ae] [ey] [ao] [uh] [ow] [iy] [uw]

20

10

0

10

20

30

40

Vowel

[aa] [ae] [ey] [ao] [uh] [ow] [iy] [uw]

20

10

0

10

20

30

40

Vowel(a) (b)

Fig. 6. Spectral cue H1–A3 along with subjective preference of quality between Bark-warped and unwarped modeling (a) natural vowelsand (b) synthetic vowels. (n) Bark-scale warping preferred. (h) Unwarped preferred, ‘‘o’’: Vowels with ‘‘bright’’ voice quality, ‘‘�’’:Vowels with ‘‘dark’’ voice quality.

1020 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

Page 13: Effect of voice quality on frequency-warped modeling of vowel spectra

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1021

SR is the roll-off computed (0, 4000 Hz), PSR is theroll-off computed in (0, 1500 Hz), RPSR is the roll-off computed in (1500, 0 Hz). Table 7 shows thenormalized correlation of the objectively predictedpreferred warping condition with the subjectivelypreferred condition for each vowel for the threeobjective measures: PL, H1–A3 and roll-off based.We see that the roll-off based rule does better thanH1–A3 and is comparable to PL while being signif-icantly less complex computationally.

5.4. Some comments on nasalised vowels

Nasalisation of vowels is known to be associatedwith a change in the first formant region of the spec-trum that may be modeled with an added pole andzero, and therefore expected to be difficult to modelwith a low order LP model. Since a prominent fea-ture of nasal vowels lies in the low frequency spec-trum, it is expected that Bark-scale warping beforethe LP modeling would help to retain the perceptionof nasality. Informal listening experiments with arti-ficially nasalized natural vowels support the aboveobservation. To nasalize the vowels, a filter wasapplied that had a zero and a pole located abovethe first formant (Feng and Castelli, 1996). A sepa-ration of 100 Hz between F1 (frequency of first for-mant), FNZ (frequency of zero) and FNP(frequency of pole) is sufficient to introduce percep-tible nasality in the vowel. In a listening experimentwith nasalised vowels, when asked to rank overallquality, however, with and without Bark-scalewarping before LP modeling, subjects often judgedthat modeling without warping was less degradedcompared with modeling with Bark-scale warping,especially in the case of bright vowels. In the caseof dark vowels too, high frequency distortion from

Table 7Normalised correlation values between subjectively observed andobjectively predicted preferred warping condition (between non-warped and Bark-warped only) for each vowel set

Vowel Natural Synthetic

PL H1–A3 Roll-off PL H1–A3 Roll-off

2 1 1 1 1 1 1æ 1 1 1 1 0.75 0.5

c 1 1 1 1 0.75 1L 1 1 1 1 0.5 0.5o¨ 1 1 1 1 0.5 0.5¨ 1 1 1 1 0.75 1i 1 0.5 1 0.75 0.75 1eI 1 0.5 1 1 0.75 1

warping was sometimes more significant than inthe case of the corresponding non-nasalised vowel.This may be attributed to lowered masking fromreduced low frequency components due to the pres-ence of a spectral zero. Overall there was less consis-tency between subjects and also less correlationbetween subjective and objective rankings (basedon the partial loudness measure) compared withthe results on the non-nasalised vowels. This maybe explained by the existence of conflicting percep-tual effects from the two different spectral cuesnamely of loss of nasality, and high frequency spec-trum distortion. In such cases, it is expected thathigher level cognition would come into play in asubject’s judgement of rank, a situation that cannotbe predicted from a low level distance measurebased on a model of the peripheral auditory system.

5.5. Sentence level test

Table 8 shows the preferred warping condition(as selected between non-warped, ‘‘U’’ and Bark-warped, ‘‘B’’) in terms of overall subjective qualityon sentences. We note that in the case of the dark-voiced sentences, subjects preferred the Bark-warped condition except when the phonemes [eI]and [i] dominated. In the case of the bright voices,the subjects preferred modeling without warping inall cases. These observations are consistent withthe isolated phoneme results discussed earlier. Forthe speech coding application, this suggests the pos-sibility of improving overall perceived quality bymaking available a limited set of warping factorsfor dynamic selection based on frame-level spectralcharacteristics. Information on the selected warpingfactor can be conveyed to the decoder via one ortwo bits depending on the number of distinct warp-ing factors available.

In the second part of the experiment, the predic-tion rule of Section 5.3 was implemented and warp-ing condition was switched automatically frame byframe. A subjective listening test comparing thedynamically switched warped sentences with thepreferred fixed warping condition of Column 5 ofTable 8 revealed that in all cases the subjects ratedthe former either better or indistinguishable fromthe latter. This is shown in Column 6 of Table 8.The percentage of frames that were selectivelyswitched to Bark-scale warped modeling is shownin the last column of Table 8. We find that, forsentences no. 7, 8, 10, 11, 12, 13 and 15, the auto-matic algorithm keeps most frames unwarped.

Page 14: Effect of voice quality on frequency-warped modeling of vowel spectra

Table 8Sentences used in the subjective evaluation of preferred frequency warping at the sentence level

Sr. no. Sentences Voicequality

Predominantphoneme

Subjectivepreference

% Framesswitched to B

1 All the balls werebrought from shopping mall

Dark [ c] B S 49

2 By and large he was un-harmed Dark [2] B S 423 Lord Paul was tall Dark [ c] B N 714 Early bird earns a worm Dark [L] B N 575 Can that man carry those pans Dark [æ] B S 396 Draw each graph on new axis Dark Balanced B S 357 They may say the same in Spain Dark [eI] U N 158 His meal is eel meat Bright [i] U N 09 We were away to walla walla Dark [eI], [i], [2] U S 5510 They all agree that

the essay is barely intelligibleBright Balanced U N 17

11 Thick glue oozed out of the tube Bright [i], [¨] U N 412 Dont ask me to carry

an oily rag like thatBright Balanced U N 4

13 A muscular abdomenis good for your back

Bright,nasal

Balanced U N 7

14 Withdraw only as muchmoney as you need

Dark, nasal Balanced B N 48

15 Withdraw only as muchmoney as you need

Bright, nasal Balanced U N 13

The first sub-column of the ‘‘subjective preference’’ indicates the warping condition preferred by listeners between two fixed warpingconditions (‘‘U’’ and ‘‘B’’). The second sub-column indicates the subjective preference when the automatic warping condition predictionrule was used. ‘‘S’’: automatically switched preferred, ‘‘N’’: unable to differentiate and ‘‘O’’: fixed warping preferred.

1022 P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023

These sentences are characterized by either brightvoice or presence of [eI] and [i], which are bettermodeled without warping.

At this point, it is relevant to comment on the useof high frequency pre-emphasis commonly used toimprove LP modeling of speech spectra (Markeland Gray, 1976). Pre-emphasis serves to suppressthe spectral tilt to an extent and thus improve theLP modeling of the formants. However it is sug-gested that pre-emphasis be avoided for unvoicedspeech, while in the case of voiced speech, the pre-emphasis factor should ideally be signal dependent(Markel and Gray, 1976). This is also borne outby our own observations where we found that whilepre-emphasis improves the modeling of voices withhigh spectral tilt (such as breathy, female voices),it degrades sounds with low energy at low frequen-cies [also noted by (Wong et al., 1980)] includingsound spectra characterised by low H1–H2. Pre-emphasis was not used in the present study. It isanticipated that the selective use of pre-emphasiscombined with frequency-scale warping based onthe considerations presented in this paper can helpto achieve further improvements in the perceptualaccuracy of low order LP modeling of the speechspectral envelope.

6. Conclusion

Frequency-warping according to a perceptualscale is often applied to improve the perceptualaccuracy of low order LP modeling of the speechspectral envelope in sinusoidal speech coding.Understanding the factors that influence the subjec-tive perception of spectral envelope modeling errorscan be useful in improving the attained speechquality. Steady vowel sounds are particularly sensi-tive to spectrum envelope modeling errors. Experi-mental investigations of the relative improvementin perceived quality from the use of different warp-ing functions on natural and synthetic steady vow-els have been presented. Contrary to what isgenerally accepted, it is found that whether thewidely used Bark-scale frequency warping is effec-tive depends on the vowel and voice quality attri-butes of the signal spectrum. Dark voices withtheir steep spectral slopes were found to benefitfrom Bark-scale warping but not so the relativelybright voices. The exceptions to this are thefront-mid and front-low vowels [eI] and [i]. Thesevowels are observed to typically degrade whenmodeled with Bark-scale frequency warping rela-tive to the reproduced quality without warping.

Page 15: Effect of voice quality on frequency-warped modeling of vowel spectra

P. Patwardhan, P. Rao / Speech Communication 48 (2006) 1009–1023 1023

In summary, our studies based on observationspresented in this paper and a previous paper(Rao and Patwardhan, 2005) suggest that the vow-els [eI] and [i] dominate over the dark/bright voicequality differences.

The subjective ratings of relative perceived degra-dation were closely predicted by an objective dis-tance measure based on the partial loudness of thespectrum distortion. This suggested that frequencymasking plays a role in determining whether theimproved low frequency spectrum match fromBark-scale warping is sufficient to compensate forthe accompanying high frequency spectrum distor-tion. Only when the low frequency components ofthe spectrum are relatively strong and sufficientlyspread is there a clear benefit from Bark-scale warp-ing. This condition was shown to be closely linkedto the spectral slope as quantified by measures suchas H1–A3 and roll-off which can therefore act asuseful cues in predicting the suitability of a warpingfunction for a particular sound. A rule has beenproposed to predict the subjectively preferred warp-ing condition. The rule is based on a combination ofspectral roll-off values computed in different regionsof the spectrum.

A sentence-level listening test confirmed theresults of the isolated vowel experiments and alsodemonstrated the effectiveness of the warping pre-diction rule. For the speech coding application, thissuggests the possibility of improving overall per-ceived quality by making available a limited set ofwarping factors for dynamic selection based onframe-level spectral characteristics. Information onthe selected warping factor can be conveyed to thedecoder via one or two bits depending on the num-ber of distinct warping factors available. Alterna-tively, a simple but slightly less effective solution isto use fixed warping according to a mild versionof the Bark-scale.

Audio demonstrations may be accessed at http://www.ee.iitb.ac.in/~prao/speech_comm_1/index. html.

References

Burred, J., Lerch, A., 2004. Hierarchical automatic audio signalclassification. J. Audio Eng. Soc. 52 (8), 724–739.

Champion, T., MacAulay, R., Quatieri, J., 1994. High-order all-pole modeling of the spectral envelope. In: Proceedings of theIEEE International Conference on Acoustics, Speech, SignalProcessing, pp. 529–532.

Childers, D., 2000. Speech Processing and Synthesis Toolboxes.John Wiley and Sons.

Childers, D., Lee, C., 1991. Vocal quality factors: analysis, syn-thesis and perception. J. Acoust. Soc. Am. 90 (5), 2394–2411.

Doval, B., d’Alessandro, C., 1997. Spectral correlates of glottalwaveform models: an analytic study. In: Proceedings of theIEEE International Conference on Acoustics, Speech, SignalProcessing, pp. 1295–1298.

Feng, G., Castelli, E., 1996. Some acoustic features of nasal andnasalized vowels: a target for vowel nasalization. J. Acoust.Soc. Am. 99 (6), 3694–3706.

Griffin, D., Lim, J., 1988. Multiband excitation vocoder. IEEETrans. Acoust. Speech Signal Process. 36 (8), 1223–1235.

Hanson, H., 1997. Glottal characteristics of female speakers:acoustic correlates. J. Acoust. Soc. Am. 101 (1), 466–481.

Hermanksy, H., Hanson, B., Wakita, B., Fujisaki, H., 1985. Linearpredictive modeling of speech in modified spectral domain. In:Digital Processing of Signals in Communications, pp. 55–63.

MacAulay, R., Quatieri, T., 1995. Sinusoidal Coding, in SpeechCoding and Synthesis. Elsevier, Amsterdam.

Markel, J., Gray, A., 1976. Linear Prediction of Speech. SpringerVerlag, Berlin.

Miller, J., 1989. Correlation, in Statistics for Advanced Level.Cambridge University Press.

Molyneux, D., Parris, C., Sun, X., Cheetham, B., 1998.Comparison of spectral estimation techniques for low bit-rate speech coding. In: Proceedings of the InternationalConference on Spoken Language Processing, pp. 946–949.

Moore, B., Glasberg, B., Baer, T., 1997. Model for prediction ofthresholds, loudness and partial loudness. J. Audio Eng. Soc.45 (4), 224–240.

Rao, P., Patwardhan, P., 2005. Frequency warped modeling ofvowel spectra: dependence on vowel quality. Speech Com-mun. 47, 322–335.

Rao, P., van Dinther, R., Veldhius, R., Kohlrausch, A., 2001. Ameasure for predicting audibility discrimination thresholdsfor spectral envelope distortions in vowel sounds. J. Acoust.Soc. Am 109 (4), 2085–2097.

Wong, R., Hsiao, C., Markel, J., 1980. Spectral mismatch due topreemphasis in LPC analysis/synthesis. IEEE Trans. Acoust.Speech Signal Process. ASSP 28 (2), 263–264.