A two-formant model and the cardinal vowels · 2 percept. We can therefore conclude that for the vowels [i e yJ it is, tentatively, less convincing than for the remaining 15 cardinal

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

A two-formant model and thecardinal vowelsBladon, A. and Fant, G.

journal: STL-QPSRvolume: 19number: 1year: 1978pages: 001-008

http://www.speech.kth.se/qpsr

http://www.speech.kth.se

http://www.speech.kth.se/qpsr

STL-QPSR 1/1978

taking into account relations between formant frequencies and spect-

rum shape. That formula (henceforth the 1975 formula) was a s fol-

lows:

1 The 1975 formula appeared successful, since it predicted F for nine

2 Swedish vowels to within 120 Hz, with an average e r r o r of only 63 Hz.

We decided to test the 1975 formula for calculating F; against a

full se t of 18 IPA-cardinal vowels. These were spoken by one of the

authors and analyzed spectrographically. A set 'of matching experi-

ments was then carried out, in the following way. 18 four-formant

vowels were synthesized, each with the formant frequencies observed

in the analysis of a cardinal vowel. Four listeners were asked to

match, a s nearly a s possible to a four-formant stimulus, a synthetic

two-formant vowel whose F l was unchanged but whose F2 was variable

Application of the 1975 formula to the cardinal vowel data gives re-

sults which a r e most unsatisfactory. Of the 18 cardinal vowels, ten

show an e r r o r in predicted F' of over 120 Hz, and the mean e r r o r i s 2

232 Hz ! These discrepancies a r e reflected in Fig. I-A- 1, a me1 scale

plot of Fl against F; (matched and calculated). We have adopted the

"technical me1 scaleM of Fant (1959)

m = 1000 log ( I + f/1000) log 2

It can be seen that the calculated values fail to bring out the per-

ceptual distinctiveness of back unrounded vowels (which do not occur

in Swedish and thus had not been part of the 1975 test material), and

of [ 01 and [u] . In all these cases, calculated F' i s much too high. 2 It means, in particular, that the contrast between the rounded and un-

rounded counterparts Lo] and [ Y ] i s completely obscured, the

relationship of [ill to [I] i s displaced (inverted in the F; dimen-

sion) and the open vowel area is improbably overcrowded.

An improved version of a formula for predicting F; has therefore

been developed. I n this paper we report on some initial trials

I I I I I I I ,

o Matched Cotculoted (1975)

- -

Y U w +-a - -

tt - (C-----O 0 w

I - - Y

*e-o 0 #

- -

- -

- -

I I I I I I I

500 F1 (mels)

600

I Fig. I-A- I . Matched and ca lcu la ted F2 of 18 c a r d i n a l vowels i l lus t r a t ing the shor tcomings of the 1975 ~h f o r m u l a .

t

STL-QPSR i/1978

of the formula (henceforth the 1977 formula), on some possible re -

finements to it, and, superficially, on the extent to which two-for-

mant matchings themselves were successfully accomplished.

The 1977 formula, like i ts predecessor, produces a continuous

2 3 4 1/4 shift of I?: between the extreme frequencies of F and (F F ) :

however, i ts basis in acoustic theory i s more firm than the rather

intuitive 1975 version. The new formula i s developed according to

a spectrum prominence model which postulates interdependencies

between formant frequencies and spectrum levels (Fant, 1960),

and is more systematically designed from s -plane residue calcula - tions :

Where Ag4 i s the vocal tract transfer function in the valley be-

tween F3 and F4 a t the frequency F34 = (FJ~4 )1 /2 and A2 i s the trans-

fer function at the second formant peak F2.

The ratio between Aj4 and A2 may be derived from S-plane vec-

torial products a s

1 With the approximation ( P ~ F ' ~ ) ' / ~ = - 2 3 4 (I? +F ) this expression reduces to

The factor K(f) in the weighting function c i s intended to include

additional preemphasis originating from source, radiation, and higher

pole corrections and in addition a correction for differences in equal

loudness levels. In the f irst approximation we shall disregard their

combined frequency dependency and start out the calculations with

~ ( f ) = 12 and B2 = 100 Hz.

.

i

e

E

a

0

3

o

u

Y

d Oe

OE

P A

7

UJ

i

la

~ot'al >

Measured formant frequencies

H2 F3 F4' . Hz Hz Hz Hz

300 2300 3070 3590

470 2180 2720 3790

680 1890 2580 3940

770 1400 2460 3710

660 1170 2770 3650

570 840 2640 3310

3 70 730 2670 3240

290 700 2550 3280

300 1890 2250 3000

460 1520 2290 3290

640 1450 2330 3030

700 1430 2390 3350

670 1050 2900 3490

620 1260 2390 3610

450 1300 2640 3470

300 1320 2480 3440

380 1690 2460 3570

360 1550 2430 3030

prediction error (mels)

"; matched

Hz mels

3095 2034

2361 1749

2076 1621

1452 1294

1103 1073

806 853

700 766

669 739

2101 1633

1570 1362

1637 1399

1458 1294

947 961

1284 1192

1326 1217

1300 1202

1754 1462

1503 1324

r

1975

mels A

2062 26

1780 31

1614 7

1316 22

1248 175

922 69

896 130

807 68

1600 33

1442 80

1578 179

1420 126

1346 385

1542 350

1460 243

1325 123

1517 55

1696 372

2474

F' calculations 2

B ~ = 100 *) mels A

2059 25

1744 5

1557 64

1276 18

1144 71

892 39

814 48

775 36

1659 26

1398 36

1458 59

1330 36

1096 135

1189 3

1285 68

1273 71

1485 23

1656 332

1095

1977

B2=67 **) Hz mels A

3190 2067 33

2361 1749 0

1932 1552 69

1410 1269 25

1182 1126 53

842 881 28

733 793 27

700 766 27

2125 1644 1 1

1583 1369 7

1612 1385 14

1471 1305 1 1

1072 1051 90

1266 1180 12

1354 1235 18

1359 1238 36

1763 1466 4

1479 1575 251

716

F i g . I -A-2 . Matched and calculated F of 18 cardinal vowels. 1977 formula with B=67 dz and ~ ( f ) = l 2 ~ ~ / 1 4 0 0 .

2200

I I I I I I I

o matched calculated (1977)

- - i Y UA u CW a 0-0 .a

- - tt - 0 a 0 4

i - - 0 a &

- - 3 M '

A - oe m D a

- E

p----. d -A a - a - 0.

I I I I I I I - 2000 1800 1600 1400 1200 1000 800 600

200

300

400

500

F, (rnels)

600

700

800

900

Fig . I-A-3. Sequential presentation of the vowels in terms of measured formant data Fi , Q, Fj, and F4 together with matched and calculated ~ i .

I I I 1 I I I I I I 1

- - - F' colculoted X * motched 0

F1 F2 F3 F& = I - D I I - - - = - - - = m

.- A, - - - m - - - I - m m - - - - -

0 - m = fi - - - - 18 H - X .I

ah - d' PQ J)

"

& # * - r - 8 * Y ~

Y e u - - - - - m -I- 1. -

m m - .1-."=--

I t I I I I t I I I I I I I I I I I

STL-QPSR l/i978

from the F2 values by a mean of 44 Hz. There i s a tendency of

matched F; being on the low side of Fa, e . g. in l o ] where

FZ = 1050 Hz and F: is matched to 947 Hz whilst the calculated F;

i s 1072 Hz.

Thp calculated and matched F: of ti] came out close to Fj and

between F2 and Fj in CY] [el and [oeJ . In the remaining vowels

LY] [*I [E] and [OEJ the matched and calculated F; a re again close

to FZ.

The only case of a major discrepancy between calculated and

measured F: appears to be the vowel [ J . Spectrograms of the

vowels [u] and C4-J a r e shown in Fig. I-A-4. In spite of the rel-

ative intense F4 the matched F; of [u3 coincides with F2 which i s

lower than the F2 and matched F: of CiJ . On the other hand, in

t e rms of calculated I?' this expected phonetic order i s not retained. 2

The excessively high calculated F' of [wJ can be traced back to the 2

high sensitivity of the formula to a small distance between F and F 4 3

This property of the formula i s needed to ensure the proper contrast

between [i ] and C ~ J ' but it appears to b e less representative of

spectral dominance in mid and back vowels. A gross e r ro r in the e s -

timate of F may accordingly have rather noticeable effects in the 4 calculations.

We now turn to the more general question of the validity of a two-

formant model in perception. In setting an experimental subject the

task of matching a two-formant vowel to a four-formant one, and in

using those matchings to evaluate the success of numerical methods,

we a r e making the assumption that the auditory pattern above Fi can

be approximated by a single perceptual variable. How valid i s that

assumption? How difficult i s the matching task? From informal ob-

servations of the subjects' reactions it seemed clear that the match-

ing task is a realistic one to set, and that for the majority of cardinal

vowels i t i s not at all difficult to do.

However, employing a rea'dily available but gross and rather su-

perficial measure, we can consider the number of listening trials

that preceded the decision by a subject that a match had been obtained.

STL-QPSR 1/1978

On that basis, the spectrum of the vowels [ i l [el and [y] could

be matched only with some difficulty. The number of trials needed

is plotted on Fig. I-A-4 as isolated data points. Also shown in Fig.

I-A-4 i s some further evidence which could be interpreted in the

same way, namely that the standard deviation in F; responses was

especially high in the vowels Cil and [e ] . This finding i s consis-

tent with the fact that the upper-frequency spectrum of [i e y] i s

more significantly shaped by the upper formants F 3 " ' F n than i s

the case in other vowels.

Indeed a tendency was observed (which needs further corrobora-

tion) for listeners to fall into two groups according to whether they

selected their match for say Ci J a t around 1700 mels ( w 2250 HZ)

o r a t around 2200 mels ( --" 3600 Hz). The first of these frequencies

i s closely coincident with F the second with F4, and it therefore ap- 2 ' pears that subjects tended to allow either F or F4 to dominate the 2 percept. We can therefore conclude that for the vowels [i e yJ it

is , tentatively, l e s s convincing than for the remaining 15 cardinal

vowels to suppose that a perceptual correlate of the formants above

F exists in the form of a single parameter F 1 2 '

Other spectrum attributes such as the relative spacing of formants

within the F 2 F3 F4 F5 probably enter as additional cues.

Although the F; formula has been substantially improved we do

not consider it to have reached a state of perfection, where it can be

recommended for routine data reduction of formant measurements.

Further work should be directed to auditory projections of spectrum

profiles to test the relevance of loudness density mappings and pos-

sible lateral suppression. A fundamental question is to what extent

the percept of phonetic quality i s discontinuous in situations where the

stimulus i s allowed to change from F 2 to F3 o r F4 prominence. Kar-

nickaya et a1 (1975) report bimodal distribution of response just a s

we have noted in our study. This finding would support their view

that the major correlate of vowel quality i s the position in the auti-

tory space of F1 and the next highest peak in the processed spectral

projection. In the phonetic boundary of equal prominence of F2 and

Fig. I-A-5. Standard deviation of F; rnatchings in mels (vertical lines) and the mean number of trials needed for a match (open circles).

STL-QPSR i/1978

1 a higher peak there i s equal probability of either response. The F2

formula i s designed to retain the sensitivity in the change of percept

when crossing an equal prominence boundary but it cannot be expect-

ed to display exactly the same location of the boundary a s we en-

counter in any specific test. Moreover, the particular language

and dialect of the subjects may bias the performance.

Acknowledgments

We a r e indebted to R. Carlson and B. Granstrom for discussions

and for the use of their synthetic vowel matching program.

References

Carlson, R. , Fant, G. , and Granstrijm, B. (1975): "Two-formant models, pitch and vowel perception", pp. 55-82 in Fant, G. and Tatham, M. A. A. (eds. ), Auditory Analysis and Percep- tion of Speech, Academic P re s s , London.

Carlson, R. , Granstrijm, B. , and Fant, C. (1 970): "Some studies concerning perception of isolated vowels", STL-QPSR 2-3/i970, pp. 19-35.

Fant, G. (1 959): "Acoustic analysis and synthesis of speech with applications to Swedish", Eric sson Technics, no. 1.

Fant, G. (1960): Acoustic Theory of Speech Production, Mouton, The Hague, p. 52.

Karnickaya, E .G. , Mushnikov, V.N., Slepokurova, N . A . , and Zhukov, S. Ja. ( 1975): "Auditory processing of steady-state vowels", pp. 37-53 in Fant, G. and Tatham, M. A. A. (eds. ), Auditory Analysis and Perception of Speech, Academic P re s s , London.

Plomp, R. (1975): "Auditory analysis and timbre perception", pp. 7-22 in Fant, G. and Tatham, M. A. A. (eds. ), Auditory Analysis and Perception of Speech, Academic P re s s , London.

Documents

A two-formant model and the cardinal vowels · 2 percept. We can therefore conclude that for the vowels [i e yJ it is, tentatively, less convincing than for the remaining 15 cardinal