Dynamic characteristics of voice fundamental frequency in speech … · voice fundamental frequency in speech and singing. Acoustical analysis and physiological interpretations Fujisaki,

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

Dynamic characteristics ofvoice fundamental frequency

in speech and singing.Acoustical analysis and

physiological interpretationsFujisaki, H.

journal: STL-QPSRvolume: 22number: 1year: 1981pages: 001-020

http://www.speech.kth.se/qpsr

http://www.speech.kth.se

http://www.speech.kth.se/qpsr

A. DYNAMIC CHAFACTERISTICS 01' VOICE InTlNDAMENTAI, J.'RI.mENCT J 3 7 SPFXQ I AND SINGING. ACOUSTICAL ATWLYSIS AND PFNSIOT XKICAT, INTERPFLTATIONS

H. Fujisaki* - Abstract

Voice fundamental frequency plays an important role in expressinn linauistic as well as non-linquistic information. The exact relationships between such information and thc characteristics of the contour of the fundamental frequency, however, have not been fully clarified. The present paper sumnarizes the author's approach to the eludication of the mechanism whereby the linquistic intonnation is converted into the dynamic characteristics of the fundamental frequency contour. A functional model is presented first for the process OF tleneratins the contour from a set of discrete linmistic c m d s , and is shown to be capable of closely approximatinq contours observed both in isolated words and in sentences of Japanese. The mdel is then sham to be also valid for the process of pitch control in sinaincr. Finally, an interpretation is presented on the basis of physioloqical and physical properties/structures of the human l a m . I

1. Introduction

In many of the Indo-Euroj~an languages (1-5) as well as in the

Japanese lanquaoe (6) , the cont:our of the voin. f undamcntal f requrncy (henceforth Fo-contour) plays an important role in transmittinq not

only linguistic information but also non-linyistic infom~tion such

as naturalness, emotion, and speaker idiosyncrasy. Because of dif-

ficulties in accurate analysis ~ m d in quantitative descr-iption, the

relationships between the linquistic/non-1 incwi st i.c informat ion and

the Fo-contour characteristics have not been fully clarified. 7he

elucidation of these relationships requires, firstly, the selection

of characteristic parameters that are capable of describinn the es-

sential features of an F'O-contour, and secondly, a method for ex-

tractincr these parameters from an observed FO-contour. In other

words, an analytical formulati.on (i .e. , a d e l ) of the control

process of voice fundamental frequency is indispensable for thc

auantitative analysis and linqistic interpretation of Fo-contour

characteristics.

It is widely recognized that Fo-contours of words and sentences

are generally characterized by a qradual declination frcan thc onset

toward the end of an utterance, sup?rposed by loca3 humps correspond-

ing to mrd accent, etc. Most of the nu3del.s that have been pro!nsed

* Dept. of Electrical ESnqineering, Faculty of Engineerins, Uni- versity of Tokyo. Also Visitins Professor at Dept. of Speech - Carrrmunication & Music Acoustics, KTH, durinq April 1981.

f o r the interpreta t ion of Fo-contour charac te r i s t i cs , however, a r e

based on a ra ther crude apprdimation of the contour on a l inear

scale of fundamental frequency, and f e l l shor t of the explcul,7tory

p e r t o deal precisely and cfl~anti tat ively with the contribution

of various fac tors involved in the formation of an Fo-contour. ' f ie

present papr s m r i z e s the author ' s appmach t o t h i s problem, and

presents m d e l s developed f o r the analysis and interpreta t ion of

m r d and sentence intonation a s well a s of pi tch control i n sinqincr.

Althouqh the data presented here is exclusively on J a p n e s c , tht7 ap-

proach has been found t o be va l id a l so fo r mcrlish.

2 . Fundamental f r e q u e n c y c o n t o u r o f i s o l a t e d words (6-1 0)

2.1 Formulation of the mode3

The m r d prosody found i n many d i a l ec t s of s p k e n Japmese is

characterized essen t ia l ly by binary pat terns of subjective pi tch as-

sociated w i t h each 'mora' (i.e., a un i t of mtric timinq usually

equal to, but sometinaes smaller than, a syllabic?). The existence

of such pi tch patterns cons t i tu tes t he ' m r d accent ' , and a spec i f ic

pi tch pat tern is ca l led an 'accent type' , Each d i a l ec t is charnc-

t e r ized by its own accent systcm, i.e., system of word accent types

peculiar t o a d i a l ec t o r a croup of d ia lec t s . For example, the ac-

cent system of the Tokyo d i a l ec t js characterized by the followinq

constra ints ; ( 1 ) the subjective pi tch invariably displays a t ransi t ion,

either upward o r downward, 'it the end of the i n j t ial m r a , ancl (2) an

u p r d t rans i t ion can be followed by a downward t rans i t ion , b u t not

v ice versa, within a m r d . The type w i t h a 'hi*' f inal m r a is

fur ther divided i n t o tm types, depcndinc~ on whether o r not the hiqh

p i tch is carr ied over t o t he followinq particle. Thus, the to t a l I

number of accent type of n-mora m r d s is 11+1 i n the Tokyo dialec-t.

On the other hand, the accent systcm of the Osaka d i a l ec t is char-

acterized by a d i f f e r en t set of constra ints which yie ld 2n-1 accent

types f o r n-mra w r d s , with a f e w exceptions i n one- and two-ra

m r d s . While the prosodic information intended by the speaker and per-

ceived. by the l i s t ene r is thus d i sc re te and binary, the contour of

t he voice fundamental frequency is continuous b t h i n time and in

UTTERANCE COMPlAND I

ACCENT COMMAND

[ UTTERANCE 1 \

I ACCENT 1-1 1 I

+

CONTROL 7 MECHANISM

1 t T I M E -

FUNDAMENTAL FREQUENCY

CONTROL MECHANISM

Fig. I-A-2. A functional mdel for the process of aeneratins an Fo-mntour of a mr6: conversion of utterance and accent cclm~nds into an ac tua l Fo-contour.

UTTERANCE CONTROL GLOTTAL * ,

- CONTROL

Ga (t)

GLOTTAL O S C I L L A T I O N MECHANISM

ACCENT FUNDAMENTAL ( ~ ~ 1 COMMAND FREQUENCY

ACCENT FUNDAMENTAL ( ~ ~ 1 COMMAND FREQUENCY

ACCENT FGGDAMENTAL ( ~ ~ 1 COMCIAPID FREQCEFICY

ACCENT FUNDAMENTAL ( ~ ~ 1 CO!-1MAND FREOUENCY

respectively indicate the s t ep response function of the corrcspnd-

ing control mechanism t o the phrase and accent c m n d s . The a ' s

and pj 's are e x p c t e d t o be f a i r l y constanl: within a sentence, or

m n g utterances of an individual speaker. 1 and J are the num-

ber of phrase and accent c m d s , Toi and T3i respectively denote

the onset and o f f s e t of the i : t h phrase c m d , while Tlj and T2j

respectively denote the onset and o f f s e t of the j : th accent ccmnand.

I n the absence of respiratorv pauses within a s p k e n sentence, the

o f f s e t time T3i fo r a l l the phrase camands a r e a s s m d t o be iden-

t i c a l fo r a l l i ' s within an utterance. On the other hand, the accent

corra~nds are cons'crained not t o overlap each other.

3.2 Experimental resul-ts

A set of ten declarative sentences, each consist ing only of

voiced segments and ransinq i n length from 8 t o 24 m r a e , wrc select-

ed. Utterances of these sentences by three mle slzakers of the

Tokyo d i a l ec t were recorded and analyzed. For the purpose of thc

present study, these utterances were made without respiratorv pauses

within a sentence.

Fig. I-A-4 i l l u s t r a t e s r e su l t s of one sample each from the three

sentences:

(a) /aoinoewaaru/ ( "There is a picture of hol lyhccks . ")

(b) /aoiaoinoewaaru/ ("There is a picture of blue hollvhocks.")

(c) /aoiaoinoewa jamanouenoieniaru/ ( "There is a pic ture of bl uc hollyhocks i n a house on a muntain.")

The measurement of the voice fun-ntal frequency was made a t inter-

va l s of 12.8 msec, but the ' + ' symbols indicate the measured Fo-con-

tour only a t every three points. The curve displayed a s a solid line

indicates the bes t approximation aiven by the Wel, and the curve

displayed a s a dashed l i ne indicates the baseline component estimated

a t the saw tim. The analysis c lear ly indicates the existence of

p a r t i a l re-phrasing a t the boundary between the subject phrase and

the predicate phrase i n the case of sentence (c) . The stcpise wave-

forms a t the b o t t m of each p m e l schematically indicate the timing

Table I-A-I. Parameters of sentence Fo-contours extracted by finding the best approximations to t h e observed Fo-contours.

i

j I T I ~2 Aa 6 I ( s e c ) ( s e c ) ( s e c " ) 1

1 1 0.13 0.64 0.46 20.5 1

2 1 0.71 0.93 0.13 20.8 I I

1 I 0.08 0.27 0.54 21.2 I

2 1 0.51 1.00 0.49 20.5

3 1 1.02 1.34 0.20 20.0

1 1 0.12 0.33 0.56 24.0

2 1 0.57 1.05 0.20 26.0 I

3 1 1.41 2.36 0.52 24.0

4 1 2.36 2.81 0.22 23.5

i 1 T o ~3 AP a I ( s e c ) ( s e c ) ( s e c - ' )

11-0.21 1.01 1.18 3.3 I I I 1

1 I -0.21 1.42 1.20 3.6 I I I I

1 '-0.19 2.87 1.50 3.0 I 2 1 1.07 2.87 0.80 3.2

I I I I

I

Sen tence

(a

8 morae

(b)

11 morae

( C )

20 morae

<

F , i , (Hz)

83

84

76

The timing parameters for the accent c m d s naturally vary

wid.e!-y from one sentence to another, since they reflect the lwical

information of each of the constituent wrds. On the other hand,

parameters such as ai and p j are feud to ramin fairly constant,

since they characterize the dynamic properties of a subject's glot-

tal control mechanism to phrase and accent c m d s . The wplitude

%j of the accent camand is seen to be distributed over t w dis-

tinct ranges of values: one from 0.46 to 0.56, and the other fm 0.13 to 0.22. This result m y reflect the discrete (binam) charac-

t e r of accentuation.

These results also suggest that the apr~roximation by the nre-

sent -el to an ~o-contour may not be seriously impaired by con-

straining a1 1 the ails and P 's to remain constant, and by al.luwina j

only t m values for the A Is. Analysis of FO-contours under these a j

constraints has actually been conductd, and the results tend to

confirm the validity of this simplification.

NUMBER OF MORAE n

Fiq. I-A-5. Parameters a, P of phrase and accent. canpnents versus n-r of morae v: in a sentence.

A : Ap E 4 - 0.4 I- v : Aa (major) LI r Q - r : Aa (minor) e u a

z 2 - 0.2 3 -r 2 -WA

I I I 1

12 16 20 ' Jo

NUMBER OF MORAE n Pis. I-A-6. &npl.itude of phrase

and accent c c v - nents versus number of morae yr in a sentence.

The relationships between the sentence 1eno;th n and the jBr*

eters of the simplified model are shown in b'iqs. I-A-5 & I-A-6.

As seen from Fig. I-A-5, the influence of sentence length 11 upon

parameters a and p are quite small, indicatincr that the apparent

tm females w i t h a s imilar ranqe and y l i t y of voice. One (subject

M I ) was a voice t r a i n e r with 10 years of professional trainincr i n

singinq, and the other (subject Y S ) was a student a t a music conser-

vatory with three years of t rdininq i n sinqinq. 'lhe pi tches of the

t m notes were selected to s u i t t h e i r voice ranue, and the in te rva ls

were either a musical fourth (A4-D5) o r an octave (D4-D5). The sc-

quences were produced i n both direct ions , ulxmrd and downward. Fach

sequence was repeated several times i n 3/4 t i . with an M.M. s e t t i nq

of 100, except in the case of pc~fitamercta where the b a t w a s approxi-

mately M.M. 80. These sequences here suna a t three levels of volume:

6oh;1e ( 6 ) , r n ~ ' z z o , ~ i a n o ( m p ) , akrd p i a n i bb i m o ( p p ) . A minimum of f i ve

samples were collected fo r each of the conditions and subjects. For

the sake of comparison, speech materials wrc a l so recorded. These

were isolated utterances of tw m r d s i n the Tokyo d ia lec t of ,Tap-

nese : /=/ ( "candy" ) and /ame/ ( "rain" ) . 'll~esc tm m r d s possess

an ident ical phonemic s t ructure but d i f f e r in the accent type mi-

fested mainly in their Fo-contours, the former k i n q the "low-hiyh"

type and the l a t t e r beiny the "hiqh-low" type.

The technique f o r fundamental frequency extraction was the same

a s t h a t adopted fo r speech, but the data =re i n t e r p l a t e d a t in ter-

va l s of 10 msec. The extraction of charac te r i s t ic parameters of

an Fo-contour a l so follcwed the same l.ine a s t h a t for Fo-contours of

spoken m r d s , but the model was m d i f i e d t o suit. the observed data ,

by suppressing the utterance conponent and by r a v i n g the 'critical-

damping' constra int on the form of Fo-transition. If we assume a s t ep function f o r the s h a p of the camnand t o switch notes, the Fo-

contour of a tm-note sequence can be represented by

&ere

f ( B , Y, t ) =1- [cos $mt+ Y rn s i n ~ h t t ] e ~ p ( - ~ y t ) , for ~ < l ,

= I - (1+ 6 t ) exp(- B t), for y = 1 ,

u (t) denotes the un i t sten function, I?; anP 17f r e s p c t i v e l v r'enote

the and f ina l values of the t rans i t ion , and p and y are

parameters characterizing the second-order l inear system. The or igin I 1

of the t i m e ax is is selected a t the onset of t ransi t ion.

I \ *

Although it is possible to measure such characteristics as the

rise/fall times directly from an Fo-contour, the afxnre formulation

gives us more insight into the underlying mechani.sm of pitch control.

Parameters such as p and y can be obtained from a masured Po-con-

tour by findinq its best approximation given by the above equations,

which can then be used to determine the r ise/fal l times. For the

sake of comparison with the published results by O h l a ( ' 3 , and Sund-

berg(' 4, , we adopt here the same definition of r i d f a l l time as in

their studies. Namely, the rise/fall time is defined as the time

required for the pitch to chanqe from 1/8 to 7/8 of the total range

of transition.

Fig. I-A-7 illustrates one example each of pitch transitions

across the interval of a fourth (A4-D5) i n the sun9 material of sub-

ject MT produced a t m~zzupiano under eight different conditions, i .e.,

upward and downward transitions sung a t four different deqress of ar-

ticulation (appoggiaAuha, vron .Pagutu, 4c)gut(i, and pon tnmc,~ t ~ ! 1 . The

I+ ' symbols indicate the Fo-contours measured a t 10 msec intervals,

while the curve in each panel indicates the best approximation based

on Eq. ( 3 ) .

Quite naturally, the rate of pitch chanae is seen to vary to a

large extent with the degree of articulation, artd also to vary with

the direction, i .e. , a downward transition is faster than an upward

transition, especially i n appciggiaXuha and in nclvr Ycgafo. The Fo-

contours are clearly underdanyed in these fast transitions, while

they are almost critically damped i n normal and s1-r transitions

(&gaZo and potrtarnen t o ) . Table I-A-I1 lists the man values of P ancl y averaged over

several samples each of the eight conditions, obtained fran the an-

alysis of the material sung by MJ?, toqether with the mean rise/fall

times * defined in as &VC? c~ ld calculat~d f m p and Y . For the sake of comparison, parameter values obtained fran

the speech material are also listed. Analysis of materials sung a t

different levels indicated that changes in the volunre do not appre-

ciably affect the rate of transition in most cases.






On the other hand, differences i n the r a t e fo r upward and duwn-

ward t rans i t ions can be explained by re fe r r inc~ t o the s t ress-s t ra in

re la t ionship of Eq. ( 4 ) . The innremental s t i f fnes s , a s given by

a T / 3 x , is obviously greater a t laryer vdl ues of x . Since the

i n i t i a l value of x is greater i n the downward t rans i t ions , the s t i f f -

ness is greater and hence produces a larg.er value of P than in the

upward t ransi t ions .

5 . C o n c l u s i o n s

Dynamic charac te r i s t ics of the voice f u n m t a l frequency have

been investigated both i n speech and i n singing. The loqarithm of

fundamental frequency in speech has been regarded a s the response of

the control mechanisms of vocal cord vibration t o a set of l i ngu i s t i c

carnnands, while the mechanisms have been assumed t o be second-order

l inear systems. The model has been f i r s t developed fo r isolated mrds

and then extended to sentences. The mdel allows one t o separate

l i ngu i s t i c information from the physiological and physical properties

of the speaker's phonatory system, and to synthesize r e a l i s t i c Fo-

contours from a set of simple rules. The s a w approach has also

been extended t o the analysis of pi tch control i n singing, MCI has

proved t o be equally valid. Finally, in terpreta t ion of the model

is presented on the bas i s of ~ ~ h y s i o l q i c a l and physical properties

of t he vocal cord and the s t ructure of thc human larynx, whose main

components can actual ly be regarded a s const i tut inq a sccond-order

l inear system.

Acknowledgments

This paper is an abridged version of a seminar t a l k given on

April 14, 1981 a t the Department of Speech Comnunication & Music

Acoustics, KTII, where the author was stayinq a s Visit inq Professor.

The author is grateful t o Prof. Gunnar Fant and h i s associates for

giving him the opportunity t o review h i s m r k on voice fundamental

frequency, and wishes t o emphasize, with warm qrat i tude, t ha t the

present work was inspired by the pioneerinq m r k s from KTH, by

Sven Ohman and by Johan Sundberg, t o whm he a l s o wishes h i s heart-

f e l t thanks.

STL-QPSR 1/1981

Ref

(1)

e r e n c e s

Ohrmn, S. (1967): "Word and sentence intonation: A q u a n t i t a t i v e model", STL-QPSR 2-3/1967, pp. 20-54.

IsaEenko, A.V. & Schadlich, I I .,T. ( 1966) : "Untersuchunqen iiber d i e deutsche Satzintonation", Studia G r m t i c a 7, pp. 7-67. - 't H a r t , J. (1966): "Perceptual ana lys i s of m t c h intonation fea tu res" , I.P.O. Annual Progress Report - 1 , pp. 47-51.

Maeda, S. (1974): "A charac te r i za t ion of fundamental frequency contours of speech" , Quarterly Progress Report, No. 1 1 4 , Re- search Lab. of Elect ronics , M.I.T., pp. 193-211.

Vaissisre-Maeda, J. (1980) : "La s t r u c t u r a t i o n acoustique d e la phrase F r a n ~ a i s e " , Annali d e l l a ,Scuola Normale Superiore d i P i sa , Se r ie 111, - X I pp. 529-560.

Fu j i sak i , H. & Nagashima, S. (1 969) : "A model f o r synthes is of p i t c h contours of connected speech", Annual Rcprt, Enqineering Research I n s t i t u t e , Universi ty of Tokyo 28, - pp. 53-60.

Fu j i sak i , H. & Sudo, H. (1971): "A mxl.el f o r the creneration of fundamental frequency contours of Japanese word accent", J. A c o u s t i c . S o c . J a p 2, pp. 445-453.

Fu j i sak i , H. & Sudo, H. (1971): "Synthesis by nlle of prosodic fea tu res of connected Japanese", Proc . of 7th ICA, 3, pp. 1 33-1 36 . - Fuj i sak i , H. & Sugito, M. (1978): "Analysis and perception of t m - m r a m r d accent t ~ s i n the Kinki d i a l e c t " , ,J.Acoust.Soc. J a p n - 34, pp. 167-176.

Hirose, K. , F'ujisaki, H. , & Suqito, M. ( 1978) : "Acoustic cor- relates of m r d accent i n English and Japanese", Trans. of the Connittee on S p c h Research, Acoust.Soc.Japan, S78-41.

Fu j i sak i , H. & Sudo, H. (1972): "A generat ive model f o r t h e prosody of connected speech in Japanese", Conf. Record, 1972 Conf. on Speech C m u n i c a t i o n and Processiny, I=-AFClL, pp. 140-1 43.

Hirose, K. & Fuj i sak i , H. (1 980) : "Acoustical f ea tu res of fund- a w n t a l frequency contours of Japanese sentences", Proc. of 10th ICA, - 2, FJ;-9.2.

Ohala, J. & ?3mn, W. (1973): "Speed of p i t ch chanqc", J.Acoust. Soc.Am. 53, p. 345 (A) . - Sundberg, J. (1979): "Maximum speed of p i t ch chanqes i n s ingers and untrained subjects" , J. Phonetics - 7, pp. 71-79.

~ucht .hal ,F . & Kaiser, E. (1944): "Factors determining tension deve1.opnen-t i n s k e l e t a l muscle", Acta Physiol . Scand. 8, pp . - 38-74.

Sandow, A. (1958): "A theory o f active state mechanisms in iso- metric muscular contrac t ion" , Science 127, pp. 760-762. -- S l a t e r , J . C . & Frank, N.H. (1933) : Introduction t o Theoretical Physics, McGraw-Hill Book Co . , New York .

Documents

Dynamic characteristics of voice fundamental frequency in speech … · voice fundamental frequency in speech and singing. Acoustical analysis and physiological interpretations Fujisaki,