Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Dept. for Speech, Music and Hearing
Quarterly Progress andStatus Report
Dynamic characteristics ofvoice fundamental frequency
in speech and singing.Acoustical analysis and
physiological interpretationsFujisaki, H.
journal: STL-QPSRvolume: 22number: 1year: 1981pages: 001-020
http://www.speech.kth.se/qpsr
A. DYNAMIC CHAFACTERISTICS 01' VOICE InTlNDAMENTAI, J.'RI.mENCT J 3 7 SPFXQ I AND SINGING. ACOUSTICAL ATWLYSIS AND PFNSIOT XKICAT, INTERPFLTATIONS
H. Fujisaki* - Abstract
Voice fundamental frequency plays an important role in expressinn linauistic as well as non-linquistic information. The exact relation- ships between such information and thc characteristics of the contour of the fundamental frequency, however, have not been fully clarified. The present paper sumnarizes the author's approach to the eludication of the mechanism whereby the linquistic intonnation is converted into the dynamic characteristics of the fundamental frequency contour. A functional model is presented first for the process OF tleneratins the contour from a set of discrete linmistic c m d s , and is shown to be capable of closely approximatinq contours observed both in isolated words and in sentences of Japanese. The mdel is then sham to be al- so valid for the process of pitch control in sinaincr. Finally, an interpretation is presented on the basis of physioloqical and physical properties/structures of the human l a m . I
1. Introduction
In many of the Indo-Euroj~an languages (1-5) as well as in the
Japanese lanquaoe (6) , the cont:our of the voin. f undamcntal f requrncy (henceforth Fo-contour) plays an important role in transmittinq not
only linguistic information but also non-linyistic infom~tion such
as naturalness, emotion, and speaker idiosyncrasy. Because of dif-
ficulties in accurate analysis ~ m d in quantitative descr-iption, the
relationships between the linquistic/non-1 incwi st i.c informat ion and
the Fo-contour characteristics have not been fully clarified. 7he
elucidation of these relationships requires, firstly, the selection
of characteristic parameters that are capable of describinn the es-
sential features of an F'O-contour, and secondly, a method for ex-
tractincr these parameters from an observed FO-contour. In other
words, an analytical formulati.on (i .e. , a d e l ) of the control
process of voice fundamental frequency is indispensable for thc
auantitative analysis and linqistic interpretation of Fo-contour
characteristics.
It is widely recognized that Fo-contours of words and sentences
are generally characterized by a qradual declination frcan thc onset
toward the end of an utterance, sup?rposed by loca3 humps correspond-
ing to mrd accent, etc. Most of the nu3del.s that have been pro!nsed
* Dept. of Electrical ESnqineering, Faculty of Engineerins, Uni- versity of Tokyo. Also Visitins Professor at Dept. of Speech - Carrrmunication & Music Acoustics, KTH, durinq April 1981.
f o r the interpreta t ion of Fo-contour charac te r i s t i cs , however, a r e
based on a ra ther crude apprdimation of the contour on a l inear
scale of fundamental frequency, and f e l l shor t of the explcul,7tory
p e r t o deal precisely and cfl~anti tat ively with the contribution
of various fac tors involved in the formation of an Fo-contour. ' f ie
present papr s m r i z e s the author ' s appmach t o t h i s problem, and
presents m d e l s developed f o r the analysis and interpreta t ion of
m r d and sentence intonation a s well a s of pi tch control i n sinqincr.
Althouqh the data presented here is exclusively on J a p n e s c , tht7 ap-
proach has been found t o be va l id a l so fo r mcrlish.
2 . Fundamental f r e q u e n c y c o n t o u r o f i s o l a t e d words (6-1 0)
2.1 Formulation of the mode3
The m r d prosody found i n many d i a l ec t s of s p k e n Japmese is
characterized essen t ia l ly by binary pat terns of subjective pi tch as-
sociated w i t h each 'mora' (i.e., a un i t of mtric timinq usually
equal to, but sometinaes smaller than, a syllabic?). The existence
of such pi tch patterns cons t i tu tes t he ' m r d accent ' , and a spec i f ic
pi tch pat tern is ca l led an 'accent type' , Each d i a l ec t is charnc-
t e r ized by its own accent systcm, i.e., system of word accent types
peculiar t o a d i a l ec t o r a croup of d ia lec t s . For example, the ac-
cent system of the Tokyo d i a l ec t js characterized by the followinq
constra ints ; ( 1 ) the subjective pi tch invariably displays a t ransi t ion,
either upward o r downward, 'it the end of the i n j t ial m r a , ancl (2) an
u p r d t rans i t ion can be followed by a downward t rans i t ion , b u t not
v ice versa, within a m r d . The type w i t h a 'hi*' f inal m r a is
fur ther divided i n t o tm types, depcndinc~ on whether o r not the hiqh
p i tch is carr ied over t o t he followinq particle. Thus, the to t a l I
number of accent type of n-mora m r d s is 11+1 i n the Tokyo dialec-t.
On the other hand, the accent systcm of the Osaka d i a l ec t is char-
acterized by a d i f f e r en t set of constra ints which yie ld 2n-1 accent
types f o r n-mra w r d s , with a f e w exceptions i n one- and two-ra
m r d s . While the prosodic information intended by the speaker and per-
ceived. by the l i s t ene r is thus d i sc re te and binary, the contour of
t he voice fundamental frequency is continuous b t h i n time and in
UTTERANCE COMPlAND I
ACCENT COMMAND
[ UTTERANCE 1 \
I ACCENT 1-1 1 I
+
CONTROL 7 MECHANISM
1 t T I M E -
FUNDAMENTAL FREQUENCY
CONTROL MECHANISM
Fig. I-A-2. A functional mdel for the process of aeneratins an Fo-mntour of a mr6: conversion of utterance and accent cclm~nds into an ac tua l Fo-contour.
UTTERANCE CONTROL GLOTTAL * ,
- CONTROL
Ga (t)
GLOTTAL O S C I L L A T I O N MECHANISM
ACCENT FUNDAMENTAL ( ~ ~ 1 COMMAND FREQUENCY
ACCENT FUNDAMENTAL ( ~ ~ 1 COMMAND FREQUENCY
ACCENT FGGDAMENTAL ( ~ ~ 1 COMCIAPID FREQCEFICY
ACCENT FUNDAMENTAL ( ~ ~ 1 CO!-1MAND FREOUENCY
respectively indicate the s t ep response function of the corrcspnd-
ing control mechanism t o the phrase and accent c m n d s . The a ' s
and pj 's are e x p c t e d t o be f a i r l y constanl: within a sentence, or
m n g utterances of an individual speaker. 1 and J are the num-
ber of phrase and accent c m d s , Toi and T3i respectively denote
the onset and o f f s e t of the i : t h phrase c m d , while Tlj and T2j
respectively denote the onset and o f f s e t of the j : th accent ccmnand.
I n the absence of respiratorv pauses within a s p k e n sentence, the
o f f s e t time T3i fo r a l l the phrase camands a r e a s s m d t o be iden-
t i c a l fo r a l l i ' s within an utterance. On the other hand, the accent
corra~nds are cons'crained not t o overlap each other.
3.2 Experimental resul-ts
A set of ten declarative sentences, each consist ing only of
voiced segments and ransinq i n length from 8 t o 24 m r a e , wrc select-
ed. Utterances of these sentences by three mle slzakers of the
Tokyo d i a l ec t were recorded and analyzed. For the purpose of thc
present study, these utterances were made without respiratorv pauses
within a sentence.
Fig. I-A-4 i l l u s t r a t e s r e su l t s of one sample each from the three
sentences:
(a) /aoinoewaaru/ ( "There is a picture of hol lyhccks . ")
(b) /aoiaoinoewaaru/ ("There is a picture of blue hollvhocks.")
(c) /aoiaoinoewa jamanouenoieniaru/ ( "There is a pic ture of bl uc hollyhocks i n a house on a muntain.")
The measurement of the voice fun-ntal frequency was made a t inter-
va l s of 12.8 msec, but the ' + ' symbols indicate the measured Fo-con-
tour only a t every three points. The curve displayed a s a solid line
indicates the bes t approximation aiven by the Wel, and the curve
displayed a s a dashed l i ne indicates the baseline component estimated
a t the saw tim. The analysis c lear ly indicates the existence of
p a r t i a l re-phrasing a t the boundary between the subject phrase and
the predicate phrase i n the case of sentence (c) . The stcpise wave-
forms a t the b o t t m of each p m e l schematically indicate the timing
Table I-A-I. Parameters of sentence Fo-contours extracted by finding the best approximations to t h e observed Fo-contours.
i
j I T I ~2 Aa 6 I ( s e c ) ( s e c ) ( s e c " ) 1
1 1 0.13 0.64 0.46 20.5 1
2 1 0.71 0.93 0.13 20.8 I I
1 I 0.08 0.27 0.54 21.2 I
2 1 0.51 1.00 0.49 20.5
3 1 1.02 1.34 0.20 20.0
1 1 0.12 0.33 0.56 24.0
2 1 0.57 1.05 0.20 26.0 I
3 1 1.41 2.36 0.52 24.0
4 1 2.36 2.81 0.22 23.5
i 1 T o ~3 AP a I ( s e c ) ( s e c ) ( s e c - ' )
11-0.21 1.01 1.18 3.3 I I I 1
1 I -0.21 1.42 1.20 3.6 I I I I
1 '-0.19 2.87 1.50 3.0 I 2 1 1.07 2.87 0.80 3.2
I I I I
I
Sen tence
(a
8 morae
(b)
11 morae
( C )
20 morae
<
F , i , (Hz)
83
84
76
The timing parameters for the accent c m d s naturally vary
wid.e!-y from one sentence to another, since they reflect the lwical
information of each of the constituent wrds. On the other hand,
parameters such as ai and p j are feud to ramin fairly constant,
since they characterize the dynamic properties of a subject's glot-
tal control mechanism to phrase and accent c m d s . The wplitude
%j of the accent camand is seen to be distributed over t w dis-
tinct ranges of values: one from 0.46 to 0.56, and the other fm 0.13 to 0.22. This result m y reflect the discrete (binam) charac-
t e r of accentuation.
These results also suggest that the apr~roximation by the nre-
sent -el to an ~o-contour may not be seriously impaired by con-
straining a1 1 the ails and P 's to remain constant, and by al.luwina j
only t m values for the A Is. Analysis of FO-contours under these a j
constraints has actually been conductd, and the results tend to
confirm the validity of this simplification.
NUMBER OF MORAE n
Fiq. I-A-5. Parameters a, P of phrase and accent. canpnents versus n-r of morae v: in a sentence.
A : Ap E 4 - 0.4 I- v : Aa (major) LI r Q - r : Aa (minor) e u a
z 2 - 0.2 3 -r 2 -WA
I I I 1
12 16 20 ' Jo
NUMBER OF MORAE n Pis. I-A-6. &npl.itude of phrase
and accent c c v - nents versus number of morae yr in a sentence.
The relationships between the sentence 1eno;th n and the jBr*
eters of the simplified model are shown in b'iqs. I-A-5 & I-A-6.
As seen from Fig. I-A-5, the influence of sentence length 11 upon
parameters a and p are quite small, indicatincr that the apparent
tm females w i t h a s imilar ranqe and y l i t y of voice. One (subject
M I ) was a voice t r a i n e r with 10 years of professional trainincr i n
singinq, and the other (subject Y S ) was a student a t a music conser-
vatory with three years of t rdininq i n sinqinq. 'lhe pi tches of the
t m notes were selected to s u i t t h e i r voice ranue, and the in te rva ls
were either a musical fourth (A4-D5) o r an octave (D4-D5). The sc-
quences were produced i n both direct ions , ulxmrd and downward. Fach
sequence was repeated several times i n 3/4 t i . with an M.M. s e t t i nq
of 100, except in the case of pc~fitamercta where the b a t w a s approxi-
mately M.M. 80. These sequences here suna a t three levels of volume:
6oh;1e ( 6 ) , r n ~ ' z z o , ~ i a n o ( m p ) , akrd p i a n i bb i m o ( p p ) . A minimum of f i ve
samples were collected fo r each of the conditions and subjects. For
the sake of comparison, speech materials wrc a l so recorded. These
were isolated utterances of tw m r d s i n the Tokyo d ia lec t of ,Tap-
nese : /=/ ( "candy" ) and /ame/ ( "rain" ) . 'll~esc tm m r d s possess
an ident ical phonemic s t ructure but d i f f e r in the accent type mi-
fested mainly in their Fo-contours, the former k i n q the "low-hiyh"
type and the l a t t e r beiny the "hiqh-low" type.
The technique f o r fundamental frequency extraction was the same
a s t h a t adopted fo r speech, but the data =re i n t e r p l a t e d a t in ter-
va l s of 10 msec. The extraction of charac te r i s t ic parameters of
an Fo-contour a l so follcwed the same l.ine a s t h a t for Fo-contours of
spoken m r d s , but the model was m d i f i e d t o suit. the observed data ,
by suppressing the utterance conponent and by r a v i n g the 'critical-
damping' constra int on the form of Fo-transition. If we assume a s t ep function f o r the s h a p of the camnand t o switch notes, the Fo-
contour of a tm-note sequence can be represented by
&ere
f ( B , Y, t ) =1- [cos $mt+ Y rn s i n ~ h t t ] e ~ p ( - ~ y t ) , for ~ < l ,
= I - (1+ 6 t ) exp(- B t), for y = 1 ,
u (t) denotes the un i t sten function, I?; anP 17f r e s p c t i v e l v r'enote
the and f ina l values of the t rans i t ion , and p and y are
parameters characterizing the second-order l inear system. The or igin I 1
of the t i m e ax is is selected a t the onset of t ransi t ion.
I \ *
Although it is possible to measure such characteristics as the
rise/fall times directly from an Fo-contour, the afxnre formulation
gives us more insight into the underlying mechani.sm of pitch control.
Parameters such as p and y can be obtained from a masured Po-con-
tour by findinq its best approximation given by the above equations,
which can then be used to determine the r ise/fal l times. For the
sake of comparison with the published results by O h l a ( ' 3 , and Sund-
berg(' 4, , we adopt here the same definition of r i d f a l l time as in
their studies. Namely, the rise/fall time is defined as the time
required for the pitch to chanqe from 1/8 to 7/8 of the total range
of transition.
Fig. I-A-7 illustrates one example each of pitch transitions
across the interval of a fourth (A4-D5) i n the sun9 material of sub-
ject MT produced a t m~zzupiano under eight different conditions, i .e.,
upward and downward transitions sung a t four different deqress of ar-
ticulation (appoggiaAuha, vron .Pagutu, 4c)gut(i, and pon tnmc,~ t ~ ! 1 . The
I+ ' symbols indicate the Fo-contours measured a t 10 msec intervals,
while the curve in each panel indicates the best approximation based
on Eq. ( 3 ) .
Quite naturally, the rate of pitch chanae is seen to vary to a
large extent with the degree of articulation, artd also to vary with
the direction, i .e. , a downward transition is faster than an upward
transition, especially i n appciggiaXuha and in nclvr Ycgafo. The Fo-
contours are clearly underdanyed in these fast transitions, while
they are almost critically damped i n normal and s1-r transitions
(&gaZo and potrtarnen t o ) . Table I-A-I1 lists the man values of P ancl y averaged over
several samples each of the eight conditions, obtained fran the an-
alysis of the material sung by MJ?, toqether with the mean rise/fall
times * defined in as &VC? c~ ld calculat~d f m p and Y . For the sake of comparison, parameter values obtained fran
the speech material are also listed. Analysis of materials sung a t
different levels indicated that changes in the volunre do not appre-
ciably affect the rate of transition in most cases.
FUNDAMENTAL FREQUENCY
FUNDAMENTAL FREQUENCY
FUNDAMENTAL FREQUENCY
FUNDAMENTAL FREQUENCY
FUNDAMENTAL FREQUENCY
On the other hand, differences i n the r a t e fo r upward and duwn-
ward t rans i t ions can be explained by re fe r r inc~ t o the s t ress-s t ra in
re la t ionship of Eq. ( 4 ) . The innremental s t i f fnes s , a s given by
a T / 3 x , is obviously greater a t laryer vdl ues of x . Since the
i n i t i a l value of x is greater i n the downward t rans i t ions , the s t i f f -
ness is greater and hence produces a larg.er value of P than in the
upward t ransi t ions .
5 . C o n c l u s i o n s
Dynamic charac te r i s t ics of the voice f u n m t a l frequency have
been investigated both i n speech and i n singing. The loqarithm of
fundamental frequency in speech has been regarded a s the response of
the control mechanisms of vocal cord vibration t o a set of l i ngu i s t i c
carnnands, while the mechanisms have been assumed t o be second-order
l inear systems. The model has been f i r s t developed fo r isolated mrds
and then extended to sentences. The mdel allows one t o separate
l i ngu i s t i c information from the physiological and physical properties
of the speaker's phonatory system, and to synthesize r e a l i s t i c Fo-
contours from a set of simple rules. The s a w approach has also
been extended t o the analysis of pi tch control i n singing, MCI has
proved t o be equally valid. Finally, in terpreta t ion of the model
is presented on the bas i s of ~ ~ h y s i o l q i c a l and physical properties
of t he vocal cord and the s t ructure of thc human larynx, whose main
components can actual ly be regarded a s const i tut inq a sccond-order
l inear system.
Acknowledgments
This paper is an abridged version of a seminar t a l k given on
April 14, 1981 a t the Department of Speech Comnunication & Music
Acoustics, KTII, where the author was stayinq a s Visit inq Professor.
The author is grateful t o Prof. Gunnar Fant and h i s associates for
giving him the opportunity t o review h i s m r k on voice fundamental
frequency, and wishes t o emphasize, with warm qrat i tude, t ha t the
present work was inspired by the pioneerinq m r k s from KTH, by
Sven Ohman and by Johan Sundberg, t o whm he a l s o wishes h i s heart-
f e l t thanks.
STL-QPSR 1/1981
Ref
(1)
e r e n c e s
Ohrmn, S. (1967): "Word and sentence intonation: A q u a n t i t a t i v e model", STL-QPSR 2-3/1967, pp. 20-54.
IsaEenko, A.V. & Schadlich, I I .,T. ( 1966) : "Untersuchunqen iiber d i e deutsche Satzintonation", Studia G r m t i c a 7, pp. 7-67. - 't H a r t , J. (1966): "Perceptual ana lys i s of m t c h intonation fea tu res" , I.P.O. Annual Progress Report - 1 , pp. 47-51.
Maeda, S. (1974): "A charac te r i za t ion of fundamental frequency contours of speech" , Quarterly Progress Report, No. 1 1 4 , Re- search Lab. of Elect ronics , M.I.T., pp. 193-211.
Vaissisre-Maeda, J. (1980) : "La s t r u c t u r a t i o n acoustique d e la phrase F r a n ~ a i s e " , Annali d e l l a ,Scuola Normale Superiore d i P i sa , Se r ie 111, - X I pp. 529-560.
Fu j i sak i , H. & Nagashima, S. (1 969) : "A model f o r synthes is of p i t c h contours of connected speech", Annual Rcprt, Enqineering Research I n s t i t u t e , Universi ty of Tokyo 28, - pp. 53-60.
Fu j i sak i , H. & Sudo, H. (1971): "A mxl.el f o r the creneration of fundamental frequency contours of Japanese word accent", J. A c o u s t i c . S o c . J a p 2, pp. 445-453.
Fu j i sak i , H. & Sudo, H. (1971): "Synthesis by nlle of prosodic fea tu res of connected Japanese", Proc . of 7th ICA, 3, pp. 1 33-1 36 . - Fuj i sak i , H. & Sugito, M. (1978): "Analysis and perception of t m - m r a m r d accent t ~ s i n the Kinki d i a l e c t " , ,J.Acoust.Soc. J a p n - 34, pp. 167-176.
Hirose, K. , F'ujisaki, H. , & Suqito, M. ( 1978) : "Acoustic cor- relates of m r d accent i n English and Japanese", Trans. of the Connittee on S p c h Research, Acoust.Soc.Japan, S78-41.
Fu j i sak i , H. & Sudo, H. (1972): "A generat ive model f o r t h e prosody of connected speech in Japanese", Conf. Record, 1972 Conf. on Speech C m u n i c a t i o n and Processiny, I=-AFClL, pp. 140-1 43.
Hirose, K. & Fuj i sak i , H. (1 980) : "Acoustical f ea tu res of fund- a w n t a l frequency contours of Japanese sentences", Proc. of 10th ICA, - 2, FJ;-9.2.
Ohala, J. & ?3mn, W. (1973): "Speed of p i t ch chanqc", J.Acoust. Soc.Am. 53, p. 345 (A) . - Sundberg, J. (1979): "Maximum speed of p i t ch chanqes i n s ingers and untrained subjects" , J. Phonetics - 7, pp. 71-79.
~ucht .hal ,F . & Kaiser, E. (1944): "Factors determining tension deve1.opnen-t i n s k e l e t a l muscle", Acta Physiol . Scand. 8, pp . - 38-74.
Sandow, A. (1958): "A theory o f active state mechanisms in iso- metric muscular contrac t ion" , Science 127, pp. 760-762. -- S l a t e r , J . C . & Frank, N.H. (1933) : Introduction t o Theoretical Physics, McGraw-Hill Book Co . , New York .