30
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Signal analysis and perceptual tests of vowel responses with an interactive source filter model Nord, L. and Ananthapadmanabha, T. V. and Fant, G. journal: STL-QPSR volume: 25 number: 2-3 year: 1984 pages: 025-052 http://www.speech.kth.se/qpsr

Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Signal analysis andperceptual tests of vowel

responses with an interactivesource filter model

Nord, L. and Ananthapadmanabha, T. V. andFant, G.

journal: STL-QPSRvolume: 25number: 2-3year: 1984pages: 025-052

http://www.speech.kth.se/qpsr

Page 2: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 3: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

B. SIGNAL ANALYSIS AND PERCEPTUAL TESTS OF VOWEL RESPONSES WITH AN

-1VE SOURCE FILTER MODEL

L. Nord, T.V. Ananthapadmanabha, and G. Fant

Abtract An interactive source f i l t e r model has been used to generate i m -

pulse responses, single formant sounds, and vowels. With the model it is possible to change physiological source parameters, such as lung pres- sure, peak glottal area, and glottal area function. The acoustic load is represented by the input impedance network of vocal t ract and sub- glottal system. The four major acoustic effects of interaction: skewing, truncation, dispersion, and superposition are discussed and illustrated. A comparison has been made for an inventory of sounds obtained with and without the source-filter interaction. In a psychoacoustic discrimina- tion test it was found that listeners, i n a controlled situation, were able to perceive the fine temporal details for impulse and periodic single resonant sounds simulated with and without a vocal tract loading. However, for more complex sounds, such as male and female vowels, the discrimination approached chance level. Consequences of using an inter- active model for the synthesis of connected speech are discussed.

Introduction There has been a growing interest towards a better understanding of

the voice source, especially the acoustic loading effect on the true glottal flow (Flanagan & Landgraf, 1968; Ishizaka & Flanagan, 1975; Guerin e t al., 1976; Rothenberg, 1980; Fant, 1979; 1981; 1982; Anantha- padmanabha & Fant, 1982). However, there is a paucity of data on the perceptual consequences and the details concerning the temporal and spectral consequences of the interaction phenomena on the acoustic signal. I n 1982 we reported a preliminary study on these aspects. See also a study by Childers e t al. (1983).

A refined modelling of the voice source and a better theoretical understanding of the production of voiced speech sounds would hopefully leadtoanimprovedqualityof synthetic speech, especiallyof female and children voices. The aim of this work is a) to study the percep- tual effects of interaction and b) to i l lus t ra te the temporal and spectral consequences.

Acoustic Theory In the classical acoustic theory of speech production (E'ant, 1960;

Flanagan, 1965), it is assumed that the glottal impedance is very large compared to the vocal tract input impedance. Hence, the volume velocity air flow through the glottis, to a f irst approximation, is independent of the vocal t ract loading. Any f ini te effect of the coupling is ac- counted for by changes i n the landwidth values of the vocal tract for- mant frequencies. I n most of the earlier studies the effect of vocal t ract f i l t e r loading on the aerodynamic glottal a i r flow source was underestimated. The time varying glottal impedance was approximated by quasi-static states, and further, a) the average dynamic resistance of the glottis was either compared with the resistance value of the vocal

Page 4: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 5: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

P I I --- I

-=-+gh+)l;i; tf I I I I

Lung pressure

IPS Ag ( t l , ,. lpv-"- - - I - .. 1

su bglottal time-varying vocal tract input impedance glottal impedance input impedance

Fig. 1. Eiquivalent circuit over the open phase. (F'rcan Ananthapadmmbha & Fant, 1982, Fig. 4.)

Fig. 2. A source pulse and its derivative. a) no load condition use (t) , b) w i t h vocal tract

load u (t). g

msec DERIVATIVE

Page 6: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 7: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

INTERACTIVE NON-INTERACTIVE

-- --- - -- - - . - - -- - - - - - - .- - - - - - - -- - - -

1 2 3 kHz 1 2 3 kHz

Fig. 4. caparison of interactive (truncated) and non- interactive (exponentially damped) responses.

Fig. 5. Selective inverse filtering on natural speech samples. a) Speech wave, b) selective inverse filtering. Top: Fl ringing, bottoan: F2 ringing, c) estimatedglottalflaw.

Page 8: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

the truncation of the response is smooth, the amplitude of the side lobes in the spectrum will depend more on the dispersion effect.

(iv) Superposition - linear. In those cases (notably high pitched voices) where the truncation is not significant, energy will be carried wer from one glottal period to the next. Even for a perfectly periodic phonation, the vowel responses in the time domain for adjacent periods may thus not be identical. The degree of the superposition depends on the relation between pitch frequency and formant frequency.

(v) Superposition - nonlinear. Also the superposition within the open phase may affect the glottal flow derivative at the instant of closing discontinuity and, thus, a change in the excitation level (Fant & Anan- thapadmanabha, 1982).

(vi) Superpasition - mechanical: It is suspected that the presence of a superposition component may, at times, affect the mechanical vibrations of the vocal folds. Thus, a superposition compononent of the F1 re- sponse may control the level of other formants as well (Fant et al., 1963; Fant & Ananthapadmanabha, 1982). However, it is mt yet settled whether this is a nonlinear - acoustic or a mechanical interaction. Up till now simulations have not prwided a support of the mnlinear c o m p nent alone.

(vii) Supraglottal. It is evident that a supraglottal constriction will affect the transglottal pressure drop and, thus, the pattern of vocal cord vibrations. This is typical of voiced fricatives.

Generation of Vow1 Responses In this study we have used an interactive source-filter model with

physiological parameters, such as lung pressure, glottal area function * (maximal glottal area, open quotient , area function), and pitch as input source parameters and the vocal tract input impedance as system parameter. See Fig. 6a for a schematic block diagram of the model. The mechanical modelling of phonation is not represented. Hence, by the abave model the first four effects can be simulated. With the model it is thus pssible to independently change the input parameters and study the effects of, e.g., a raised subglottal pressure.

E X P E R I m In this study we have taken into account the coupling effect of F1

load only with no subglottal coupling. Conventional formant synthesis of vowel sounds uses constant for-

mant bandwidths resulting in an exponential damping, as already sbwn in Fig.4, with more pronounced superpositon effects than found in natural speech.

* open quotient = the ratio of the glottal open time to the fudamental period.

Page 9: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 10: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 11: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

Fig. 7. Spectral effect of truncating a formant oscillation: Band-. width definitions (fran Fant, 1982, Fig. 10 modified). Iog spectra of a) exponential response with closed phase band- width (B,) , b) truncated response with 3 dB duwn points (%), C) exponential response with effective bandwidth (Beff).

Fig. 8. Specifications for the single resonant stimu- li. A natural grouping is indicated.

(mH 1 8

6

4

2

0

(Hz) 7 0

60

50

40

30

I 1 I I I I

- ~'i; !!,' *--

- ' a) - \-a&* - - ,'-\

- :-:/' - - - -------------

6-. . - l u l I I 1 . I - \! 1 -

- - - - ,----

,-.I (? - I:. --- -, a > - I I I I I I

.2 .3 4 5 6 . 7F , ( kHz )

Page 12: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 13: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

l NTERACTIVE

. . .

....... i---,i . . . 1 :

........ 2 . 6.-.- -.-------,..---- . : ,

_i-

...... .......... .......... ............. . . . .

.... . ........... . . . . . i.--ii.-.i-.i-ii -..L : j ; : ;

i I : . : i-i_-- : -. . . -i ...... . . i --- -C----..----------.--.=~ - . . . j j : . ! : : . . ,

.-.<---.. L

. . .... . . . ................. . :-" :..-.-------, -.I-

. : ;

;: . !

; a .

20 : :--...- i---.i- i-..: ..... A +..i;.-- i j . . I ' i . .

. . . . . . / /

...... ..... : : . . ..... . . 3

L ....- i--: iiiiiiii:. i : .

; I / j ' .

: : ....... : : !

................ .... i.- *-.. .... ----- "---; ....... " ! :

. . . ' . A -i-.-- -: , . :

. . . . . I ! : . ; . .. ....... 4 z

: j ............................ ..F. 4 .....-..-. .......... =. m.: ..................................... --.-. -...... ; :

. . : z .. :

..: ................ i i

.................. .................... . . iiiiiiiii-. . 4 . . ........ ! : ----.,--.--d-..-.-..

! ........

..... ......... ........ ................. ............................... : ; ; ,-i + . j - f

0 1 2 3 L 0 1 2 3 O kHz

NON- INTERACTIVE

I , Bleff

Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse stimuli.

1'

a) F of /a/, simulated with three values of peak 1 g ottal area.

a (Hz 1

81

I/* . --

Page 14: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

INTERACTIVE NON- INTERACTIVE

. . . . : > . : . . : 1 ...-...... i ...... _.-I.-- .... I.:-.._-- --.--,+ . ! : : . . . . . 4 . .................... . .

. : ! : . . . j : -

. . . . . . . . . . . . . . . : ....... ..i .... L i ......... LLLLLLLLL ..L-

'+f 0 1 2 3 L kHz

Fig. 9. -ison of interactive and non-interactive responses in time and frequency damains: Impulse response stimuli.

b) Stimulus with luw F1 (F1 of /i/) c) Stimulus that gave the lowest

discrimination score (F1 of /E/) .

Page 15: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

as expected. But, for the s t imul i with high Flr the discriminability is high, even for a low peak glottal area of 10 sq.mrn.

TABLE I1 Result of discrimination test: Impulse response of a single resonance

Number of errors and percentage correct discrimination. Three values of peak glottal areas. (20 samples/data point, 5 listeners)

Formant frequency Number of errors Correct response (%)

of the resonance ( H z )

Peak gl. area (sqmm.) (pooled) 30 20 10

Sum: 4 19 23 mean: 85

The exponential stimuli sounded somewhat resonant i n quality due to longer ringing, while the interactive response stimuli sounded numb, like tapping on a non-resonant piece of wood.

Theoretically, the relative amount of truncation effect (F?!I'E) de- pends on the values of the parallel resistance R1 of the vocal t ract load a t F1 and the peak glottal resistance. This can be expressed with the following formula:

Relative truncation effect (F?!I'E)w ( F ~ ~ L ~ / B ~ ) (ha* Topen) (1) where

F1,B1 = formant frequency and bandwidth, respectively

L1 = inductance load of the f i rs t formant

&= peak glottal area

T-= duration of the glottis open interval

R ~ ( F , ' L ~ / B ~ ) for the single resonance load.

Page 16: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

38 STL-QPSR 2-3/1984 I

Alarge R!TE-value would mean that the difference between the inter- a c t i v e and t h e non- in terac t ive s t imulus is enhanced. I n Fig. 10 t h e number of errors is plotted as a function of the FTE-value. The tenden- cy of g e t t i n g b e t t e r d i s c r i m i n a t i o n sco res f o r l a r g e RTE-values is c lear ly seen.

Some character is t ics of the interact ive response s t imul i skrould be noted l i k e t h e shape of t h e broad peak and t h e pronounced s i d e lobes. For s o m e s t i m u l i t h e s i d e lobes are on ly 15 dB weaker than t h e main peak. Moreover, it is evident from t h e d i s c r i m i n a t i o n r e s u l t s that l i s teners are able t o perceive very s m a l l differences between st imuli pairs .

TESTS ON SINGLE FOFWiNT SOUNDS Periodic s ingle formant sounds of 400 msec duration with a s m o o t h

in tens i ty onset and of f se t were synthesized. Conventionally, a sequence of impulses c o n s t i t u t e s t h e d r i v i n g source i n t h e syn thes i s of vowel sounds. But, here we used source pulses spec i f ica l ly computed with the interact ive source-fil ter model.

Natural vowels and, thus , a l s o s i n g l e formants are m o r e complex than impulse responses. This is due to a complex of components ant, 1979) : ( i ) G l o t t a l f low pulses i n a d i f f e r e n t i a t e d form appear i n t h e waveform, (ii) there may be an additional excitation a t the instant of g l o t t a l opening, (iii) due to the superposition e f fec t tail ends of the damped waves from previous per iods may add i n a r b i t r a r y phase to the main response of t h e p r e s e n t period. The presence of t h e s e a d d i t i o n a l components could reduce the perceptual discr iminabi l i ty results.

A peak g l o t t a l area of 20 sq.mrn and a constant subglottal pressure of 8 cm H20 were used as source parameters. Two values of pitch, %=F~/N and F ~ = F ~ / ( N - ~ . ~ ) , =integer, were used f o r each sound. These two cases correspond t o a p i t c h harmonic co inc id ing wi th t h e formant peak o r t o t h e harmonics l y i n g on e i t h e r s ide . Three d i f f e r e n t values of open quotient: 30 (or 40), 50, and 70% were used, giving s i x d i f fe rent stimu- li s p e c i f i c a t i o n s f o r each sound. The e f f e c t of s u b g l o t t a l load w a s neglected.

Figs. l l a and l l b d i s p l a y the s i n g l e formant s t i m u l i i n t h e t i m e and frequency domains. For vowels wi th h igh F1, t h e d i f f e r e n t i a t e d g l o t t a l pu l se g ives rise t o a s p e c t r a l peak, marked FG. A s t h e open q u o t i e n t va lue is increased, t h e source s p e c t r a l peak is lowered i n frequency. For s t imul i w i t h a l o w F1 the source spectral maximum and formant peak merge and a r e d i f f i c u l t t o d i s t ingu i sh . This can be seen both i n t h e t ime and i n t h e frequency domain. The s i d e lobes f o r t h e s i n g l e formant sounds are no t as apparent as f o r t h e impulse s t i m u l i , compare Fig. 9.

Results of the l is tening tests are shown i n Table 111. The resul t s once again show a good discriminabili ty, especially f o r sounds with a h igh f i r s t formant. I t may be noted t h a t f o r s t i m u l i w i t h F ~ = F ~ / ( N -

0.5), the discr iminabi l i ty is re la t ive ly bet ter , except f o r s t imul i with F1=269 Hz. This could be explained as fol lows: When Fo = F ~ / N t h e

Page 17: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 18: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 19: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

l NTERACTIVE NO N-INTERACTIVE

Fig. 11. Ccanparison of interactive and non-interactive responses in time and frequency dcmains: Periodic single formant stimuli. Fo = F1/N. A phonetic judgement is indicated.

b) Stimulus w i t h low F1 (F1 of /i/) , simulated with three degrees of open quotient.

Page 20: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

superposition effects overrides the effect of truncation thus reducing the differences between the stimuli pairs. Fs there was no correlation between open quotient values and the number of discrimination errors, the results are pooled over the open quotients.

T W I11 Result of discrimination test: single formant sounds

Number of errors and percentage correct discrimination, pooled for two pitch conditions. (72 samples/data point, 6 listeners.)

Formant frequency Number of errors Correct response (%)

of the resonance (Hz) Pitch (pooled)

FO=F1/N FO=F1/(N-0.5)

Sum: 68 46 mean: 74

Open quotient= 30 (or 40) ,50 and 70 %, Psubgl.= 8 cm H20

PHONETIC QUALITY If the spectral shape of single resonant sounds are vowel-like with

a realistic pitch, they will have a (weak) phonetic quality, that is, listeners tend to perceive the sounds as phonemes. The mn-interactive response for stimuli with F1= 659 and 269 Hz (Table 111) were, thus, identified as the vowels [ael and Cu7, respectively. The side lobes appearing for the interactive response stimuli with the same F1 and a high open quotient value were enough to shift the quality towards the phonemes Cal and [o], respectively. Referring to the spectral sections of the stimuli, see Fig. lla (lower) and llb (lower), this effect can be interpreted if we assume the side lobe peaks as mntritxlting to a shift of gravity in the spectra.

Page 21: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 22: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

P- EFFECTS OF (F1 + F2)LOAD W e a l s o simulated a number of s t i m u l i taking i n t o account t h e

interaction of both the f i r s t and the second formant loads, in order to see whether t h e perceptual e f f e c t s would be stronger. This turned ou t not t o be t h e case. W e conclude t h a t t h e main r e s u l t s i n t h i s s tudy w i l l apply equally w e l l for a modelling with the loading of the complete vocal tract network.

TESTSONFEMALEvCkJELS The same vowels w e r e synthesized with a high pitch and using aver-

age formant frequency values f o r female speakers, taken from Fant (1972), see Table V. The input inductances w e r e assumed t o be t h e same as i n Table I f o r t h e t h e male data, s ince t he r e is a decrease of t h e length as w e l l as the cross-sectional area of the vocal tract. In order t o obta in a female voice qua l i ty , w e found it necessary t o use higher subg lo t ta l pressure, smaller peak g l o t t a l a rea (0 - 14 sq.mm), and a higher open quotient value.

TABLE V Stimuli specifications (Fbrmant frequencies and bandwidths i n Hz, inductances i n m ~ )

Beff used instead

Values taken from Fant (1972) except *). F5= 5.4 kHz, F6= 6.6 kHz, F7= 7 -8 kHz

B5= 300 HZ, B6= 400 HZ, B7= 500 HZ

Sampling frequency= 16 kHz

Open quotient= 60 %, Psuwl.= 10 cm H20,

FO-= 265 Hz.

Page 23: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse
Page 24: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

The female voice qual i ty was good, altkrough we did not optimize the parameter settings. Vowel spectra are show. i n Fig. 13. For t h e vowel /e/ a gl-1 mm is seen a t about 800 Hz, bo th i n t h e i n t e r a c t i v e and m - i n t e r a c t i v e simulation. This zero originates f r m the Carnblnati0n

of two e x c i t a t i o n s w i t h t h e p a r t i c u l a r choice of open qout ien t . For t h i s simulation, the second excitation a t the ins tan t of g l o t t a l opening turned o u t to be q u i t e s t rong. A change of open q u o t i e n t by a few percent removed the g l o t t a l zero completely.

The result of the l is tening test, see Table VI , shows approximately the same discrimination ra tes as f o r the m a l e vowels, that is, a l i t t le a w e chance level.

TABLE VI Result of discrimination test: female vowels

Number of errors and percentage correct discrimination. (96 samples/data point, 6 l i s teners )

Vowel Number of errors correct response (%)

mean: 65

D i s c u s s ion A comparison of the discrimination test resu l t s f o r the three types

of s i g n a l s is shown i n Fig. 14. Though t h e d i s c r i m i n a t i o n w a s easy f o r s i g n a l s w i t h s imple s t r u c t u r e , t h e t a s k became d i f f i c u l t f o r vowel sounds. The r e s u l t is no t su rp r i s ing . For vowel sounds, t h e s e i n t e r - action ef fec ts could be considered to be of lesser importance canpared to the dynamic variation of the g l o t t a l pulse shape which contrilxltes m o r e t o t h e voice q u a l i t y (Rosenberg, 1970; Fant, 1979; Ananthapad- manabha 1984).

However, this should not be misinterpreted as saying that source- f i l t e r interaction is of no c o n s q e n c e f o r the synthesis of connected speech. I n t h e above psychoacoustic experiments w e have t e s t e d the discriminabili ty fo r f ine temporal d i f f e rences , c o n t r o l l i n g a l l other parameters to be t h e same. Thus, w e have used c a r e f u l l y c a l c u l a t e d effect ive barrdwidth values f o r the exponentially damped cases to keep

the loudness level the same between stimuli pairs.

Page 25: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

1 NTERACTIVE NON-INTERACTIVE

Fig. 1 3. Spectral sectio~

0 1 2 3 1 5 6 k H z

ns of female vcwel stimuli.

Page 26: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

CORRECT DISCRIMINATION

Page 27: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

For natural speech the interaction effect w i l l result i n a simulta- neous change of the spectral shape (i.e., s inx/x) - and the formant lev- els. The above psychoacoustic experiment has shown that if the formant levels are correctly controlled then the effect of spectral shape dif- ferences is small.

For the synthesis of connected speech (other than steady s ta te vowels), the loudness contrasts between speech segments can be main- tained only by using carefully calculated effective barrdwidth values. The effective bandwidth depends not only on the articulation ht also on the source parameters. Values as large as 400 Hz for F1 of a voiced /h/ sound are not uncommon. Similarly, B2 of /i/ could be influenced strongly by the truncation effect. Because of these interaction phe- nomena, it becomes diff icult to predict the effective bandwidths by rule. The existing rules for deriving bandwidths based on the formant frequencies are for the closed glottis condition (Fant, 1972). E'urther- more, i n practical synthesis of high pitched voices, the bandwidth sometimes has to be modified (increased) in order to avoid unrealistic amplitude fluctuations due to superposition effects.

This gives us two options of synthesis strategy. Either a) The effective bandwidth has to be carefully determined and used i n a non- interactive model or b) available rules for closed phase bandwidths can be used i n an interactive model. The later alternative is preferable as it models the speech production process to a better accuracy, including a proper control of the superposition effect, which is especially criti- cal for the synthesis of high pitched voices. A third alternative (p- sibly of technical interest) would be to modulate the barrdwidth within each glottal cycle, according to source-filter interaction theory, ht still use a non-interactive synthesis model.

SOME RELATED TOPICS

The above experiments have led us to inquire into some related tapics which we w i l l briefly discuss.

A. Perception of F1

What is the perceptual significance of a fusion of spectral voice source maximum and a low f i r s t formant peak i n real speech? Is the perceived f i r s t formant location affected by a strong source peak as seen i n the spectral sections, or does our hearing neglect the contrih- tion from the source peak? There is some evidence for the influence of the spectral shape on the perceived f i r s t formant frequency (Chistovich & Ogorodnikova, 1982). But there is also some counter evidence based on an auditory short term spectral analysis (Schroeder & Mehrgardt, 1982; Millar & Underwood, 1975).

B. Short term spectral analysis When discussing the spectral components of signals that exhibit a

fine temporal structure, a harmonic (long term) spectrum w i l l not clear-

Page 28: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

- - 0 1 2 0 1 2 0 1 2 o4 0 1 2

(compare to a) 0 kHz

(compare to b)

Fig. 15. Consecutive short term spectra for the interaklve mqonse of a single farrmant aound. (T = 10 met, Harming win&#, 1 nrsec displaaamk bebmm sections.)

Page 29: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse

STL-QPSR 2-3/1984 5 1

ly reveal the variations. Another way of representation is to use a

running short term spectral analysis. In Fig. 15 a number of consecutive short term spectra for a single formant sound are shown, covering ap- proximately one and a half glottal cycle. These spectra were obtained by computing the F?IT on Hamming windowed segments of 10 msec duration with 1 msec steps between the sections. The main features to be oberved are the changing level of the voice source maxi~num and the (sinx/x) shaped contour of the F1 peak with side lobes. With this type of analysis, the first formant peak can be expected to appear even for high pitched voices, viz., at those instances when the short term window includes the closed glottis interval.

mnclus ions The auditory system is known to be sensitive to differences in the

waveforms, e.g., AM/QFM (Lozhkin, 1971). In vowel production, the cou- pling of the vocal tract to the time-varying glottal impedance gives rise to a non-exponential formant damping, compared to an exponential damping, used in conventional vowel synthes is. We have tested the dis- criminability for non-exponential (interactive) and exponential (non- interactive) formant damping for a) impulse responses, b) periodic single formant sounds, and c) male and female vowel sounds. These two types of synthesized signals were easily discriminable for simple types of stimuli, though difficult for vowels sounds, Fig. 14.

The qualityof synthetic speech cannot be improved very signifi- cantly by using an interactive source-filter model. However, the inter- active model has the advantage of a simpler control strategy for band- widths a d it has a better control of the superposition effect. A m o r e rigorous evaluation of the interaction phenomena is under way.

References AnanthapadmanaH~, T.V. (1984): "Acoustic analysis of voice source dyna-

Ananthapadmanatha, T.V. and Fant, G. (1982): "Calculation of true glot- tal flow and its components", STGQPSR 1/1982, pp. 1-30; also in Speech Communication - 1 (1982), pp. 167-184.

Ananthapadmanabha, T.V., Nord, L., and Fant, G. (1982) : "Perceptual discriminability of nonexponential/exponential damping of the first formant of vowel sounds", pp. 217-222 in R.Carlson & B.Granstrom, eds.) Proc. of The Representation-of Speech in the Peripheral Auditory System, Stockholm, May 17-19, 1982, Elsevier Biomedical Press, Amsterdam.

Childers, D.G., Yea, J.J., and Bocchieri, E.L. (1983) : "~ource/vocal tract interaction in speech and singing synthesis", to appear in Proc. of Stockholm Music Acoustics Conference, Stockholm, July 28 - August 1, 1983; Publ. issued by the Ebyal Swedish Academy of Music, 1985.

Chistovich, LA. and Ogorodnikwa, E.A. (1982): "Temporal processing of spectral data in vowel perception", Speech Communication 1 (1982), pp. - 45-54.

Page 30: Signal analysis and perceptual tests of vowel responses ... · I , Bleff Fig. 9. Canparison of interactive and non-interactive responses in time and frequency domains: Impulse respnse