39
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report The effects of adjacent voiced/voiceless consonants on the vowel voice source: a cross language study Gobl, C. and N´ ı Chasaide, A. journal: STL-QPSR volume: 29 number: 2-3 year: 1988 pages: 023-059 http://www.speech.kth.se/qpsr

The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

The effects of adjacentvoiced/voiceless consonantson the vowel voice source: a

cross language studyGobl, C. and Nı́ Chasaide, A.

journal: STL-QPSRvolume: 29number: 2-3year: 1988pages: 023-059

http://www.speech.kth.se/qpsr

Page 2: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 3: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

B. THE EFFECTS OF ADJACENT VOICED/ VOICE- LESS CONSONANTS ON THE VOWEL VOICE SOURCE: A CROSS LANGUAGE STUDY*

Christer Gobl** & Ailbhe Nf Chasaide***

Abstract This study examines how the voice source and source related spectral

characteristics of a vowel are affected by the voiced/voiceless nature of an adjacent consonant. The first vowel in 1 CVCV nonsense utterances of Swedish, French, and English was looked at, where C was either a voiced/voiceless labial stop or fricative. The voice source was studied by means of inverse filtering and by parameterizing the glottal waveform in terms of a four parameter voice source model (the LF model). Oral airflow recordings allowed inferences on the relative timing of glottal and supraglottal gestures and on incomplete glottal closure during the vowel. The results show some striking differences between languages for the vowel preceding a voiceless stop. Swedish tends to have glottal abduction substantially preceding the oral occlusion for the consonant, and this is reflected in the vowel source function as a weakening excitation and an increasing spectral slope. There is also a reduction in formant amplitudes, most strikingly for Fr, as a consequence of bandwidth widening. By contrast, in French, where the glottal and oral gestures were much more tightly synchronized, there was little evidence of such effects. For the English speakers two distinct groups emerged; one similar to the Swedish speakers and one more like the French speakers. Differences in voice source characteristics for the vowel onset were comparatively small; regardless of the voiced/voiceless nature of the consonant, full excitation strength is generally achieved almost immediately. However, there is clear evidence of wider B1 and a somewhat steeper spectral slope when the preceding consonant involves high airflow, i.e., when it is a voiceless fricative or a postaspirated stop. Preliminary rules for onsets and offsets are presented.

1. Introduction This paper sets out to describe in some detail the voice source variation in the

vowel as a function of voicing characteristics of an adjacent consonant. It was undertaken with two objectives in mind. Its first, broad aim is to extend our

This paper is an expanded version of paper AAA11 presented at the 114th ASA-meeting in Miami, November 1987 (Nf Chasaide & Gobl, 1987).

** Swedish Telecom Administration (Televerket), Technology Dept., Section for Research arid Developmer~t, S- 123 86 Farsta, Sweden. Graduate student at KTH.

*** Guest researcher from Centre for Language and Communication Studies, Trinity College, Dublin, Bire, funded by the Swedish Institute (Svenska institutet), S-103 91 Stockholm, Sweden.

Page 4: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

understanding of voice source dynamics. It thereby contributes to the on-going work at KTH, Department of Speech Communication and Music Acoustics, on improving the naturalness of the voice source in synthetic speech, as part of tlle further development of the KTH text-to-speech system (Carlson, Granstrijm, & Hunnicutt, 1981). Secondly, by concentrating specifically on the effects of phonological voiced/voiceless consonants on the source quality of the vowel, we hoped to shed some light on what we felt might be an interesting question: to what extent does the glottal gesture associated with the voiceless, as compared to the voiced consonant affect the phonatory mode of $e vowel, or, to put it in a slightly different way; are there vowel voice quality correlates of these phonological contrasts? If the answer to the last question is positive, one would further wish to know whether any observable effects are likely to be universal, or specific to given languages or even individuals.

In traditional formant synthesis the voice source has been treated as a low-pass fil- tered impulse train with a constant spectral slope, with Fo and amplitude the controllable parameters. This has increasingly been regarded as unsatisfactory, and in recent years more refined parametric glottal flow models have been proposed. One such model which we use in this paper is the LF model (Fant, Liljencrants, & Lin, 1985a), and it has, in addition to F,, four parameters modelling the glottal pulse shape. The increased number of parameters allows manipulation of the spectral slope and also, therefore, a more exact simulation of the voice source spectrum.

However, if we are to improve on the traditional voice source used in synthesis we need not only more realistic source models, but we must also discover more about the way in which real-life speakers modulate their source quality as a function of both linguistic and non-linguistic factors. Although there is a reasonably large literature on the voice source, only a few have dealt with voice source dynamics in connected speech (Ananthapadmanabha, 1984; Fant, 1979a; 197913; 1980; 198 1 ; 1982; Gobl, 1985; 1988). There remains much work to be done if we are to develop rules for controlling these new source parameters.

The improved source model used here is still within the domain of the non- interactive source-filter theory (Fant, 1960). Some interaction effects can, however, be simulated by the model, for example, glottal pulse skewing (Rothenberg, 1981; 1983). Interaction effects which cause overlaid ripple components cannot be captured by the model, for instance, truncation of formant oscillation, superposition of formant oscillation from one period to the following, and influences of the subglottal system (Ananthapadmanabha & Fatlt, 1982; Fant, 1987; Fant & Ananthapadmanabha, 1982; Fant & Lin, 1987; Fant, Lin, & Gobl, 1985b; Lin, 1987). These effects, some of which are particularly noticeable when the vocal folds are abducting but still vibrating, were, however, approximated so as to give the best spectral resemblance between the true glottal flow and the model. This procedure is described in some detail in the following section.

Page 5: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

Fig. 1 .

osc. (crn)/s) . .

A B C

uo (unfi l tered )

uo ( f i l tered )

PEG

Multichannel recording of the Icelandic word llahpal. Time points A , B , and C refer respectively to the beginning of vocal fold abduction, ternzirtation of vocal fold vibration, and oral closure. Osc = speech wnvefornz. U, = oral airflow. PEG = photo-electric glottograph signal.

Page 6: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 7: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

If the timing of glottal abduction for the voiceless consonant is different in the different languages, what precisely are the consequences of these differences for the source characteristics of the vowel? And in what way might offsets for these stops differ from vowel offsets preceding voiced stops'?

Does an adjacent voiceless consonant affect the onset of the vowel differently from the offset'? Given the observations from earlier work mentioned above, we would expect vowel offset before a (somewhat) preaspirated stop to be rather different from vowel onset following a postaspirated stop.

Might there be different onset cliaracteristics in the vowel depending on whether the preceding voiceless consonant is unaspirated as in French, or aspirated as in Swedish and English, quite apart from the question of how onset after voiced consonants differ from both of these?

2. Methods Three techniques were used to shed light on these questions. First of all, aifflow

recordings were used for information on the general timing characteristics of oral and glottal gestures (see comments above on Fig. 1) and to provide a rough idea of the amount of leakage due to incomplete vocal fold closure during phonation. Detailed information on source cllaracteristics was obtained by inverse filtering the speech waveform and matching a source model to the filter output. Furthermore, from spectral sections of the speech waveform the levels of formants and F, were obtained.

2.1. Oral airflow

A Rothenberg mask was used to record oral airflow. The mask is an improved ver- sion of that described in Rothenberg (1Y73), and is specified to have a linear frequency response up to ca. 3 kHz. The frequency range was nevertheless restricted to less than 1.5 kHz for reasons that have to do with the nature of the signal and the recording device. The airflow pulses during voicing, with a spectral slope of ca. - 12 dB per octave, are superimposed on a slowly varying, "DC", airflow component. The FM tape recorder used has a signal-to-noise ratio of ca. 40-50 dB. The signal-to- noise ratio for the periodic signal is substantially less, however, as most it was used up by the DC component. Apart from the tape noise, the airflow transducerlamplifier generated a considerable amount of noise. Hence, harmonic components above 1.5 kHz were completely masked by the tape noise.*

Along with the speech waveform two separate airflow signals were recorded: one was simply the airflow signal from the mask without any manipulation, the other was the average airflow (a low-pass filtered version of the first using a second-order filter with a cutoff frequency of ca. 70 Hz).

2.2. Inverse filtering and matching with LF source model

Due to the limitation in frequency response of the airflow recordings they were unsatisfactory for the detailed study of the glottal flow. Instead, a high quality microphone recording of the sound pressure was utilized for this task. The recording

* If one were recording the airflow signal directly onto computer, the problem of tape noise would be largely alleviated (depending, of course, on the resolution of the analog-to-digital converter).

Page 8: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

was conducted in an anechoic room using a 0.5" B&K condenser microphone and a SONY F1 digital tape recorder. The recorded speech materials were low-pass filtered at 6.3 kHz and transferred to computer using a sampling frequency of 16 kHz aid a 16 bit (for solne of the materials a 12 bit) analog-to-digital converter. The speech signal was further high-pass filtered digitally at 20 Hz (by applying a third order Butterworth filter twice, forwards and backwards in time, so as to retain linear phase response) and this ensured a correct indication of the zero-pressure line.

14 . 130 WINDOW 6.00 ms PITCH L67 Hz

. - - . -- -- - - - ! . SQMPFREQ 16 kHz I , -

4 6 kHz

Fig. 2. INA inverse filtering programme display.

(a) Speech waveform (b) Log FFT-spectrum of (a) (c ) Filtered waveform (d) Log FFT-spectrum of (c ) (e) Filter configuratiort.

The numbers of zeros in the filter function car2 he varied, and their frequency and bandwidths can be manipulated interactively. The filter output is updated in real-time both in the time domain (the filtered waveform) and in the frSequency domain (the FFT spectrum).

Page 9: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 10: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

The second stage in this procedure involved matching a nlodel of differentiated glottal flow to the inverse filtered data. The model used was the four parameter LF model (Fant & al., 1985a) and is briefly outlined below. The matching programme (developed at KTH by Ananthapadtnanabha following suggestions by Fant) can be illustrated with reference to Fig. 3. On the basis of five time points (x in figure) which the experimenter manually marks on the differentiated flow pulse (a), an I,F model curveform is calculated and superimposed (c). The programme lists the corresponding set of parameter values (e), which quantify the gross characteristics of the glottal pulse. The matching is deemed successful when a close correspondence is obtained in both the time domain ((a) and (c)) and the frequency domain ((b) and (d)). Typically for the onsets and the offsets of the vowel, source matching was carried out for every glottal pulse.

The LF model itself (cf., Fig. 4) is composed of two waveform segments. The first segment models the differentiated flow from glottal opening to the time of the main excitation, and is formulated as a portion of the initial cycle of an exponentially growing (i.e., underdarnped) sinusoid:

The second segment is an exponential function that allows a residual flow (dynamic leakage) after the main excitation. The segment used for this "return phase" is given by

where Ee is

and E can iteratively be determined from

Page 11: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

LF MODEL

30

0 1 2 3 4 5 6 7 8 m s

TIME IN MSEC

Fig. 4 . The four parameter LF model of diflerentiated glottal flow.

The four parameters of the LF model, as given by the equations (1) and (2), are I&, a, o,, pertaining to segment 1, and ta, pertaining to segment 2. The LF moclel matching programme yields a set of conceptually more transparent parameters Ee, rk, rg, and r, to quantify the inverse filter output. As our results are given in terms of these, they are described in some detail here, and are illustrated in terms of true ancl diffrerttiated glottal flow in Fig. 5a. The spectral consequences of variation in these parameters are shown in Fig. 5b. (In our study, Ee and ra turned out to be the most revealing of the four parameters, and we shall therefore mainly concentrate on them.)

E, excitation strength, is the negative amplitude at the time point of maximum discontinuity of the differentiated flow (see Fig. 5a). It nonnally corresponds to the maximum slope of the falling branch of the glottal pulse, which typically precedes full closure. At the production level it is determined by the speed of closure of the vocal folds and the volume velocity through them. At the acoustic level it corresponds to the overall intensity of the signal (see Fig. 5b). The amplitude E, is measured in arbitrary units on a linear scale.

Page 12: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-311 988

rk is a measure of the asymmetry between the opening and closing branches of the glottal pulse. It is here defined as (t, - t )Itp, in percent, where - t is the time interval from peak glottal flow to !he excitation, and tp is the [he interval from glottal opening to peak flow (see Fig. 4). This corresponds to tn/tp in Fig. 5a. Therefore, the larger the rk, the more symmetrical the pulse. In conlparison to the underlying glottal area function, the glottal flow pulse is typically skewed to the right, i.e., the opening phase tends to be longer than the closing phase. This would appear to be due to the inertive loading of the vocal tract (see, for instance, Rothenberg, 1981) and to glottal inductance. The phase difference in the vibratory pattern between the upper and the lower parts of the vocal folds has also been demonstrated to contribute to the skewing of the pulse (Cranen & Boves, 1985). The acoustic consequences of rk variation are somewhat complex. In combination with re, it affects nlainly the lower part of the source spectrum (see Fig. 5b). However, the degree of skewing also determines the depth of tie zeros in the source spectrurn: the more symmetrical the pulse, the deeper the spectral dips. Taken together, rg, rk, and ra determine the open quotient, OQ, defined as the ratio of the open phase to the fundamental period, To.* In turn, OQ together with the pulse shape detennine the location of the zeros in the source spectrum (cf., Flanagan, 1972, Section 6.241).**

rg is defined as F&, i.e., the "glottal frequency", F as a percentage of Fo. Fg 8; is a frequency related to the opening branch of t e glottal waveform, and IS

defined as 1/2t . As already mentioned above, rg together with rk mainly deternines the kvels of the lower harmonics in the source spectrum (Fig. 5b).

ra is a meamre of the residual flow from excitation to complete closure (or maximuln closure if there is a DC leakage). As can be seen in Fig. 5a, ra is equal to t&, in percent, where ta is the time constant of the exponential function modelling the return phase. At the production level ra relates to the sharpness of the glottal closure, that is, to whether the vocal folds make contact in an instantaneous way or in a more gradual fashion along their entire length and depth. Differences in ra are important acoustically because r, affects the spectral slope of the signal (cf., Fig. 5b). The exponential characteristics of the return phase are approximately those of a first order low- pass filter. The cutoff frequency, Fa, is inversely proportional to ta:

* Later in the text and in Fig. 5b, we use a slightly different formulation. which we designate Oq. ?IN? "open" period, defined as the interval from opening to excitation, excludes he return phase. This helps to disar~~biguare between effects on different parts of the spectrum.

** A simple analogy: a pulse wave (i.e. a perfectly symmetrical pulse shape) with a pulse width, PW, (= duty- cycle or open quotient) contains all the harmorlics except i/PW, where i = 1,2 ..., n. For exaniple; a square wave, PW = 0.5, consists of only odd numbered harmonics. Jn contrast, a sawtooth waveform, which can be seen as a maximally skewed pulse, has a spectrum with all harmonics.

Page 13: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 14: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

The approximate attenuation (in dB) as a function of the frequency is:

A La(f) = 10 log lo [ 1 + (f/~,)']

2.3. Spectral analysis of the speech waveform

Changes in formant bandwititlis are one important aspect of tlie phenomena we set out to describe here. Unfortunately, the techniques outlined UI Section 2.2 often do not yield consistent or reliable measures of bandwitlth. When the glottis is open or opening, bandwidths are typically very large. As fonnant oscillatio~ls are damped out almost Unmediately, it is difficult to quantify the decay precisely, either in tlie time or the frequency domain, and even fairly large variations in the bandwidth settings have little effect on the inverse filter output. We therefore adopted the Inore indirect strategy of measuring fornlant levels as a way of capturing the effect of clianges in bandwidth. From spectral sections of the speech waveform, calculated every 10 Ins, the level of the first harmonic (the level of Fo, termed Lo) and the levels of the first four formants (the levels of Fl -F4, tennecl L1 -L4) were measured.

One might note here that the level of a formant is not only determined by its band- width and the overall fortilatit frequency pattern, but also by the following source characteristics (cf., Fig. 5b):

• The excitation strength, E,, affects the overall level of the source spectrum, and thus the overall level of the formants.

The degree of dynamic leakage, r,, affects the tilt of the source spectrum, and thereby the level of the higher formants in particular.

The level of a formant is affected if the formant coincides with, or is in the vicinity of a zero in tlie source function.*

The effect of zeros hi the source function is difficult to predict and to estimate accu- rately, and is likely to be fairly small in any case. We therefore did not try to compensate for such effects is this study, as we feared tliat such an attempt might obscure rather than illuminate the data. The internal consistency of tlie measured fonnait levels lends some support for this way of proceeding.

In our data we did not compensate either for fonnant frequency variation in measuring fonnant levels. This was because we felt the variation to be minimized: the vowels we were looking at were of a rather steady quality which was similar across languages.

3. Speech materials The word list contained nonsense words of the fonii I CIVICZV?, read in the

following carrier frames

* Additionally, noise excitation, subglottal poles/zeros, and source/filter interaction may also affect the level of a formant.

Page 15: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Hal sa ---- igen (Swedish)

Dis lui ---- donc (French)

Say ---- again (English)

so that the main sentence stress fell on the first syllable of the word, which in the case of Swedish had grave accent. VI, the object of our study, had a front open quality /a/ in Swedish and French, and the slightly higher /=/ in English. The unstressed second syllable had the vowel /a/, /el, and /a/ in Swedish, French, and English respectively. For CI and CZ, /p, b, v, f/ occurred in all pennutations, thus generating a list with 16 different utterances. Cz was a long consonant in Swedish, this being pho~~ologically the only possibility following a short vowel.

In all, I 1 Swedish speakers (4 female and 7 male), 5 French speakers (2 female and 3 male), and 6 English speakers (2 female and 4 male) were recorded. The recordings were of two kinds, one using an airflow mask and microphone (3 repetitions of list), the other was simply a high-quality microphone recording (3 repetitions), used for the inverse filtering and spectral measurements. The inverse filtering and matching procedures are laborious and very tune consuming, and this necessarily meant that only a limited number of utterances could be thus treated. However, as inferences regarding glottal activity can readily be made from airflow recordings (see Introduction and Fig. 1) the fairly extensive airflow recordings were used to get as general a picture as possible for each of the languages studied and to ensure that the inverse filtered utterances were representative for any particular individual or language. The following, more limited materials were analyzed according to the procedures described in Sections 2.2 and 2.3:

Swedish: tpap:a/, /pab:a/, /bap:a/, /bab:a/, /baf:a/, /bav:a/ 4 speakers - 2 female and 2 male

French: /papel, /pabe/, /bape/, /babe/, /bafe/, /have/ 4 speakers - 2 female and 2 male

English: /pzpa/, /p=ba/, /bapa/, /b=ba/, /b=fa/, /b=va/ 4 speakers - 1 fernale and 3 male

4. Results Cross language differences did show up in the voice offset patten~s particularly,

and so we will concentrate mainly on offsets, dealing more briefly later with onsets. The observed differences relate directly to the fact that the timing of glottal abduction for a following voiceless consonant, traclitionally assumed similar, varies consider- ably. The majority of our Swedish speakers showed an overwhelming tendency to- wards early glottal abduction prior to the consonantal closing gesture, a pattem strik- ingly different from that of our French speakers for whom the glottal abduction and oral closure gestures are much more tightly synchronized. Our English speakers hap- pened to fall into two groups; one group showed early glottal abduction as was the tendency for the Swedes, and the other exhibited a more French-like pattem.

Page 16: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

SWEDISH

male, OE

/ '

Osc.

(crn)/s)

female, KB

1 0 (unf i l tered) /' ! ' ! ! 1 , , ,

i 'I - - I - -- L-... - ( f i l t e r e d )

Fig. 6 . Oral airJlow and speech waveform for Swedish words illustrating cort- trasting medial lp:l and 1b:l. Two speakers.

Page 17: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Although the data we recorcled and analyzed included fricatives as well as stops, in our presentation of results we will concentrate on the latter. This is because seg- menting for the beginning and end of the fricative is more difficult as these are less sharply defined events, and this nlakes alignment and comparison of the voiced/voiceless contexts more problematical. However, insofar as can be ascer- tained, the patterns we describe for the stops are representative for the fricatives also. Where this was not the case, is pointed out in the text below.

4.1. Swedish

As mentioned above, early glottal abduction before oral closure for a followuig voiceless consonant was a striking characteristic of our Swedish data.* Fig. 6 shows airflow recordings of words illustrating intervocalic /p:/ and /b:/ for a male and a fe- male speaker. Early glottal abduction in the course of the first vowel is evidenced by the sharp rise in airflow rate prior to the stop closure for /p:/, where flow rate drops to zero. Simultaneous with the rise in airflow rate there is a drop in the an~plitude of the speech waveform. (This pattern is very reminiscent of that of the preaspirated stops of Icelandic discussed in the Introduction, see Fig. 1). The phonatory mode of the vowel is thus increasingly breathy voiced as it approaches the voiceless geminate. For some speakers (e.g., the second, lower recording in Fig. 6) the voicitlg ceases completely before the oral closure and there is a short interval of truly voiceless aspi- ration. A very brief voiceless interval preceding closure in Swedish stops was also pointed out in Karlsson & Nord (1970).

Fig. 7 is an illustration of the Ee and r, parameter values for the utterances /pab:a/ and /pap:a/. Values for /pap:a/ (dashed lines) have been superimposed on those of /pab:a/ (solid lines) and these are aligned to the oral closure of the medial stop (vertical solid line). The two vertical dashed lines represent the oral release of /b:/ and /p:/.

There is a rather striking difference in the duration of the first vowel. As com- monly occurs in languages, the vowel is considerably shorter preceding the voiceless consonant. But quite apart from the durational difference, it is further noticeable that Ee weakens very early for the vowel in /pap:a/, so that for much of its duration, it has only weak excitation compared to /pab:a/. Concomitant with the drop in E, there is a rise in ra indicating a more rounded closing section of the glottal pulse and a much greater attenuation of higher frequencies for this portion of the vowel. In contrast, not only is the vowel of /pab:a/ longer, but there is strong excitation throughout, which continues at a reduced level through the /b:/ closure. Although we will deal with voice onsets at a later stage, note for the moment that the onset of the second vowel shows strong excitation, Ee, and low ra values regardless of whether the preceding consonant is voiced or voiceless. Note also the rapid rise in Ee following the initial voiceless consonant in both utterances: the vowel reaches its full excitation strength within 10 ms. It therefore seems that the voiceless consonant affects the phonatory characteristics of the preceding vowel much more than the following vowel.

* Of the 1 I Swedish speakers recorded, all had a tendency towards early glottal abduction. Naturally, the degree of which thiq was present varied across speakers. 8 (4 female and 4 male) had very early abduction such as is illustrated in Fig.6, whereaq for 3 cases (males) this effect was less extensive. Nevertheless, even these three showed considerdbly earlier abduction than did the French or the English Type I1 speakers (see below).

Page 18: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

The parameter rk (not included in figure) is generally inversely correlated with the excitation strength, I&. Therefore, for the interval of the first vowel in /pap:a/ for which Ee is falling, rk values are rising, showing that glottal pulses are increasingly symmetrical. The inverse correlation of the two parameters held generally across contexts and languages.*

Bandwidth differences einerged as a function of an adjacent voiced or voiceless consonant, being much wider in the latter. However, as mentioned above in Section 2.3., when the bandwidth is very wide it becomes difficult to measure accurately. We therefore do not present bandwidth measurements as such, but looked rather at the formant levels.

Fig. 8 shows the relative levels of Fo, FI, F2, F3 and F4 in the first vowel of the utterances /bab:a/ and /bap:a/ as measured from spectra at 10 ms intervals. Note the sharp drop in formant levels, particularly for L1 and L2, as a function of the following voiceless consonant. Lo remains relatively constant and therefore in this environment becomes relatively dominant.

The greater attenuation of the lower than the higher formants might appear sur- prising at first glance. From the source data illustrated in Fig. 7, the rise in ra prior to the voiceless stop might have led one to expect the higher formants to have the great- est attenuation. The disparity is explained, however, by the fact that, when the vocal folds abduct, the increased coupling to the subglottal system affects mainly the band- widths of the lower formants (see Fant, 1960, p. 137) Furthermore, as recently pointed out by Klatt (1987), this particular phonatory mode often involves noise ex- citation of the higher formants, and this contributes to the overall level in this fre- quency region. Even if the noise itself had an essentially flat spectrum so that all formants were excited by it, it would still attain greater salience in the high frequency regions, precisely because, with high ra, the harmonic components are weak for this part of the source spectrum. (See also Holmberg, Hillman, & Perkell, 1988 for some similar argumentation relating to the perceived breathiness of female voices). The noise source is likely to be completely masked by the harmonic components in the lower part of the spectrum.

The comparative constancy of Lo is further illustrated in Fig. 9, which shows for the first vowel of the words /bab:a/ and /bap:a/ respectively: the speech waveform (Osc), the differentiated glottal flow (E(t)), the true glottal flow (Ug(t)), Ee (here in dB), the peak flow of the glottal pulse (Up), and Lo. Note how, for the vowel in

* The r, parameter (not included in figure) often lacked a clear patterning. For niost subjects it remained either fairly constant, or showed "noisy", apparently arbitrary varialion, but within a fairly narrow range. Values for males were typically around 110-12x For a few speakers there was a positive correlation between r, ad laEc; however, the range of variation was small, ranging typically for males from 90 to 130. Female values for r, tended to be 10-20% lower than male values. Those speakers who had covariation of r, and Ee showed less clearly the inverse correlation of r, and & alluded to above. This is to be expected given the definition of r, and rg. AU else being equal, an increase in r, means that the opening branch of the glottal pulse is shorter. This would have the consequence of increasing the synimetry of the pulse, i.e. increasing r,. Thus, when E, is falling, if rg also drops, it attenuates the extent to which r, increases. Furthennore, given that when Ec de- creases, rk increases and rg either increases or remains fairly constant, we can infer that the open quotient, Oq, is

I negatively correlated with F,. G e n e d y speaking, these comments regarding rk, r,, and oq hold for all our cross language data, arid for that reason they are not alluded to agair~ in our discussion of the other languages.

Page 19: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Page 20: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 21: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-311 988

I

Osc.

Fig. 9. For the first vowel of the Swedislz words 1bab:al and 1bap:al are shown: Osc = speech wavefoi.m E(t) = diflei-entiated glottal flow Ug(t) = true glottal flow E, = excitation strength (here in dB, for comparability with Lo) Up = peak flow of the glottal pulse Lo = level of Fo

Page 22: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

/bap:a/, in spite of the decrease in amplitude of the excitation-spike, &, the peak flow and the total volu~ne of the flow pulse remain relatively constant. It llas been pointed out by Fant (1980) and Bickley & Stevens (1986) that the level of the low frequency spectrum is basically proportional to the total volume of the glottal pulse. It is there- fore not surprising to find this close correspondence between the glottal flow signal (Ug(t) and Up) and Lo.

To illustrate the effects of a preceding as compared to a following consonant, L1 relative to Lo is shown in Fig. 10 for the four contexts /b-b:/, /b-p:/, /p-b:/ and /pp:/ . In the fully voiced eriviromnent /b-b:/, L1 remains fairly strong throughout the vowel, peaking somewhat at the beginning. In /b-p:/, L1 is high at the beginning of the vowel, but the following voiceless consonant causes a sharp drop in the F1 level. Note that the effect of the following consonant is very pervasive; the drop in LJ be- gins very early in the vowel, almost inunediately after its onset. In this respect the vowel before a voiceless consonant is very different from that preceding a voiced. The nature of the initial consonant has some effect too: in /pb : / as compared to /b-b:/, note the initial depression of Ll .* The effect is less extensive than the effect of a following voiceless consonant. Even so, it is somewhat more than one might have expected, given tlie rapid rise in E, noted above for Fig. 7 (more on this in Section 4.4. below). When the vowel is both preceded and followed by voiceless consonants as in /pp:/ , L1 rises only briefly, then drops for the remainder of the vowel. In some instances we have observed for this context, L l fails to rise, remaining low tlzrough- out. At the onset of the vowel F1 level is approximately 14 dB lower if the preceding consonant is voiceless, and this is true even if one cornpares at 20 rns where Ee for voiced and voiceless contexts are roughly similar. At the vowel offset the difference is approximately 21 dB.**

In Fig. 11, F2 level fluctuations relative to Fo of the different contexts are illus- trated in a similar fashion. Note that there is some lowering of L2 in the vicinity of the voiceless consonant; however, the extent is less than the effect observed for F1. This has the consequence that, at the offset of the vowel before a voiceless consonant where L1 is very low, F2 becomes comparatively more dominant, as can be observed in Fig. 8.

The level of stress may be important in determining the degree to which these ef- fects will be present. The vowels we have concentrated on here are all in stressed po- sition, but the offset of the unstressed vowels in the carrier frame just preceding the stressed word were also analyzed. The airflow traces in Fig. 12a suggest that the off- set of /a:/ preceding word initial rp/ is rather different frorn that of stressed /a/ preced- ing medial /p:/. The sharp rise in airflow, characteristic of the latter is largely absent in the fonner, suggesting that vocal fold abduction begins substantially later. From source parameters for the same utterance in Fig. 12b, we see also that the sharp drop in Ee and rise in ra are also rnuch less in evidence. We sliould note, however, that

* The initial voiceless consonant is here aspirated. Unaspirated consonants are rather different, see Section 4.4.

** Before the voiceless stop, L, appears to flatten or even rise slightly in the last few pulses. This is because L unavoidably drops prior to devoicing. Conipare LI and L,, in /bap:a/, with values for the same utterance in Fig. 8. The value given for the difference at offset was therefore calculated for the point before this drop in Lo. The apparent rise of L, for the same context in Fig. 11 is similarly explained.

Page 23: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

these differences may not be uniquely attributable to the stress difference. Other vari- ables may also play a role, such as tlie vowel length and the intervening word bound- ary (i.e. we are dealing with different positional variants). More data will be needed to allow us to disambiguate between possible factors.

SWEDISH ( m a L e , ~ ~ )

Fig. 10. Ll relative to Lo for Swedish vowel la1 irt the contexts lb-b:l, lb-p:l, lp-b:l, and 1pp:l.

SWEDISH ( m a l e , O E )

L2 ( d B )

Fig. 11 . L2 relative to Lo for Swedish vowel la1 in the carttexts lb-b:l, lb-p:l, Ip-b:/, and lp-p:l.

Page 24: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 25: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-311 988

4.2. French

Voice offset patterns in French proved to be quite different from those in Swedish. For the voiceless stops, the oral airflow trace showed no or only very little increase prior to oral closure compared to the sharp excursion noted for Swedish. Traces for /bape/ and /babe/ in French are illustrated in Fig. 13. Note that strong voicing persists in /bape/ right up to the oral closure for 41, just as for the medial /b/ in /babe/. One can therefore infer that the glottal abduction gesture for the voiceless stop must occur considerably later in French than in Swedish, and must be more nearly simultaneous with the oral closing gesture. Further evidence of this can be deduced from the fact that intrusive voicing in /p/ was very cornrnon for our French speakers, i.e. the voicing persisted for a number of periods after complete oral closure was obtained.

FRENCH (male, T I E )

b a p e b a b e

Fig. 13. Oral airflow and speech waveform for French words /bape/ and /babe/.

Not surprisingly, vowel source parameters are little different for a following /p/ or /b/, as can be seen in Fig. 14. There is some difference: in this figure we note a slight increase in r, immediately prior to /p/ closure. Occasional higher r, and lower Ee can be attested in this environment; however, such differences are never extensive and are confined to the last pulse or two adjacent to the stop.

As regards spectral levels, L1 and L2 did not appear to be much affected by the voiced/voiceless nature of the adjacent consonants (see Fig. 15, and compare with Figs. 10 and 1 I). The voiceless consonant sometimes occasioned a slight lowering of F1 and F2 levels at the offset but not at the onset of the vowel. We would emphasize,

Page 26: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

FRENCH (mare, TIBI

Fig. 14. Superimposed E, and r, parameter values for the French u~ords /babe/ (solid lines) and lpapel (dashed lirtes) aligned to the oral closul-e of the medial stop (vertical solid line). Oral release of the stops are represented by dashed vertical lines.

I I E e -

Fig. 15. Ll and L2 relative to Lo for French vowel la1 in the contexts 115-bl, lb-pl, lp-bl, and ly-pl.

0 -

-a

I I

1 ' 0 100

\ I I b ,I I I I

2 0,O 300 400 (ms) I I

I I

( P ) a P I I

ORAL b P CLOSURE R E L E A S E

Page 27: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

however, that even in the former case the effect was small and happened only occa- sionally.

These comments regarding French are relevant to the stops. Vowel offset characteristics for the voiceless fricative were often rather different and much more like the Swedish voiceless stops and fricatives. Fig. 16 shows airflow for the word /bafe/. Note the very sharp rise in airflow and the drop in voicing level of the vowel prior to the If/ segment.

FRENCH (mare, T I B )

,,\,I ,$'::;:! , , ,! 1, , ,,,, , t,, r:;ji '1;

" 8 I'lltl ~PI\:I~I~I~I\ ; I I , ~~ !\,, Y \i!,~ip ,, i i t l , , i , , , l , , l , , - , , , I ,A,;?,b; t a , 1, ,Ill i~li:!;i,,,!~hlfl 'laL'., r v h r OSC

l , l l i i l ~ ~ ~ ; ~ ~ , , ~ i l i i l ; !,: ~ ~ ~ 1 ' ( c m3/s 1111, ~ l / ~ l l l l i ; ~ ; i ~ , I : I , ;I., f I.,.,!;,~ / , I 1 ~ l l~ l . ' i i~~~~, , ; ; ,~~ 2 0 0 0 1 ;I! 1 (msl

n, "0

0 (unfiltered)

'OoO 1

Fig. 16. Oral airjlow and speech wavefornz for Frelzclz word lbafel. Conzpare to lbupel in Fig. 13

43. English

Our speakers of English exhibited two distinct trends. Of the four for whom de- tailed source characteristics were studied, two subjects patterned very like the Swedish (referred to as Type I), and two had realizations much closer to the French (referred to as Type U).* Fig. 17 illustrates airflow traces of the words /baepa/ aid /baeba/ for a speaker of both types. Note the extensive airflow peak before the /p/ .

* We had recordings for two more subjects ( I male, 1 female) for whom detailed analysis of source characteris- tics were not carried out (see Section 3). Frorn the airflow records it was, however, clear that these two pat- terned with our Type I group here, having very early glottal abduction.

Page 28: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

ENGLISH

Type I (male, R J )

(crn7s.l ! ' (ms) 2000 w

0 4 00 0

#)p l~pr I r t 7 3"? . I ,I!,/ 1 1 , i ! \ U 0

' 4 , , 1 , A ~ ~ , , , , , , r / / ~ ! $ l u n f i ,fered) /!ij J1 , /I~~I,~~I~I;~N & u o

( f i l t e r e d )

b z p a b a b a

Type II (male, A B )

Fig. 17. Oral airflow and speech wavefornz for English words lbmpal and lbmhal for two male speakers; one exhibiting early glottal abduction like tlie Swedish speakers (Tj~pe I ) , and one exhibiting a pattern sinzilar to the French speakers with more synchronized timing of glottal and oral ges- tures (Type II) .

Page 29: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-311 988

closure in /baepa/ for Type I , along wjth much reduction in the voicing level. There is very little such effect for Type 11, however, slightly more than for the French speak- ers.

Source characteristics for the two types are illustrated in Fig. 18. As might be ex- pected, there is extensive weakening of E, and an increase of r, prior to /p/ closure for Type I, indicating both an overall reduction in the level of voicing and attenuation of the higher frequencies in the course of the vowel. Some differences in r, can be noted for Type I1 speakers as well; even so, it does not affect most of the vowel, showing up rather in the last glottal pulse or pulses prior to closure.

ENGLISH TYPE I (male, R J I TYPE IT (male, F N )

( a 1

0 ( m s l

( a 1 ORAL O R A L

CLOSURE

Fig. 18. Superimposed E, and I-, parameter valries for the first vo~je l in the Eng- lish words lbebal (solid lines) and lbepal (dashed lines) aligried to the oral closure of the medial stop (vertical line). Type I and Type I1 speakers illustrated as in Fig. 17.

Page 30: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Spectral measures of the formant levels also showed basically the two types of pat- terns discussed above for Swedish and French.

The question arises as to why there should be two such distinct patterns for our English speakers. From such a limited sample, it is impossible to say with any cer- tainty whether this might reflect a dialect difference, or shnply the range of idiolecti- cal variation that might be found in a single dialect. All four spoke something ap- proximating to R.P. (Received Pronunciation), although all four had northern origins. The two of Type I, although their accents had been lnodified towards R.P., still be- trayed strong northern influences in their speech.** Furthermore, unlike Type I1 speakers, they showed very little tendency to glottalize their voiceless stops it] en- vironments where this would be expected for R.P. This last was ascertained by lis- tening to separate recordings of conversational materials from these informants. Therefore, we tentatively conclude that we are here dealing with a dialect difference. We would further speculate that one might be most likely to find breathy offsets to the vowel before the voiceless consonant in those dialects of English which have least tendencies towards glottalization of voiceless stops.

4.4. Onsets and offsets

Until now we have been concerned almost exclusively with voice offsets, as these revealed the most striking differences for the languages we were investigating. In comparison, onsets appeared to be much less affected by the nature of the preceding consonant, and there was therefore less to report in terms of cross language differ- ences.

It was noted at vowel offset that Ee weakens gradually when glottal abduction for the voiceless consonant substantially precedes oral closure. At onset, E, typically rises very sharply following voiceless as voiced consonants, as can be observed, for example, in Fig. 19. To facilitate comparison of E, rise time in voiceless (here aspi- rated) and voiced contexts, the value for the last /b/ pulse is also shown here, left of time point 0, i.e. the first vowel pulse. For both types of stop, there is typically strong excitation by the second or third vowel pulse. Note also in Fig. 7, the sunilarity of onsets following the medial voiced and voiceless stops. Both of these figures serve to illustrate for /papa/ (same utterance and speaker, but different repetitions) the striking difference in onset and offset of E& in the vicinity of the voiceless consonant. Fur- thermore, a rapid rise in Ee at onset was generally typical regardless of which type of voiceless stop preceded: the voiceless aspirated stop as in Swedish and English (for example, Fig. 7), and the voiceless unaspirated stop as in French (Fig. 14), or indeed the phonological voiced but phonetically voiceless stop as in English #CV (Fig. 18).

Despite the striking similarities, there were some differences. Voiced and voice- less unaspirated stops yielded essentially similar onsets insofar as we were able to as- certain from our measures (see for example the French data in Figs. 14 and 15), but the voiceless aspirated stops differed in a few respects from the otl~ers. First of all, some instances of a more gradual rise in Ee were found followulg aspirated stops

- - - - - - - -

" The two Type 1 speakers for whom detailed analysis were not carried out also spoke R.P., but with a perhaps even stronger northern influence.

Page 31: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-311 988

SWEDISH (ma le , O E )

Fig. 19. COI-relatioiz of spectral measures (Ll relative to Lo) with soztrce parame- ters E,, &, and Fa for the Swedish vowel la1 in tlze contexts 10-b:l, lb-p:l, lp-b:l, and lp-p:l.

Page 32: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

in English. They were comparatively infrequent but attested for speakers of Type I and Type 11. As there was no particular linguistic environment with which such occurrences could be correlated, one must conclude that there is some free variation with regards to this feature, for our English speakers at least.

Further differences relating to onsets following aspirated stops are illustrated in Fig. 19, where we correlate spectral measures of LI versus Lo (Swedish data) with source measures of E, and r,. In the lower part of this figure, Fa (calculated from r,, see formula (5) in Section 2.2) is also plotted, showing the cutoff frequency for the extra downward tilt of the source spectrum. Note that for the vowel onset after the voiceless (aspirated) stops, that altllough by the second pulse there is full E,, L1 is still low and rises more gradually. Note further that the dynamic leakage, r,, is somewhat higher than it would be at the onset following a voiced stop. From Fa, we see that at this sanle point in time (the second pulse) the cutoff frequency for the increased spec- tral tilt is 400-600 Hz, which is lower than the F1 frequency of /a/ (here ca. 600 Hz), and from this we can estimate that F1 should be attenuated by approximately 3-5 dB (see formula (6) in Section 2.2). The increased source slope thus accourlts for some but not all of the observed L1 lowering, which is here about 14 dB. A substantial part of the lowering must therefore be attributable to a wider B1 due to higher coupling to the subglottal system.

At the offset of the vowel, the observed L1 lowering would seem attributable to a number of factors. Ee is weaker and this will partially explain the lower Ll. Damnp- irlg is of course another factor, as the vocal folds have begun to abduct early on. Fi- nally, note from Fa that there is a much lower cutoff for the additional spectral tilt during the offset than during the onset of the vowel. For the offset, this extra attenua- tion begins just above F,, and thus would cause substantial additional attenuation of F1. Note that in comparison, for the offset of the vowel preceding the voiced conso- nant, the attenuation of the higher frequencies begins at approximately 1 kHz, well above the F1 frequency.

5. Discussion and Summary The most striking differences to emerge across languages were in the offset of tlie

vowel preceding voiceless stops. These differences were ultimately the consequence of differences in the timing of glottal abduction relative to stop closure. In this re- spect Swedish tended towards very early glottal abduction, and therefore exhibited extensive breathy voiced offsets of the vowel in this context. This was in marked contrast to the French data, where glottal abduction and oral closure are much Inore closely synchronized. Our English data showed both types of patterns.

For the cases where there is early glottal abduction, the voice source of the vowel is characterized by a marked but gradual decrease in E, (i.e., overall strength) along with a rising r,, showing greater rounding of the comer of the glottal pulse and there- fore an increasing spectral tilt. As E, decreases, rk rises, showing that the glottal pulse becomes increasingly symmetrical in the course of the vowel. On the whole, rk seems to be fairly closely and inversely correlated with &. The spectral measures of the vowel show for this same interval an extensive attenuation in the Fl region. This would appear to be partly due to B1 widening, but also a consequence of the very low

Page 33: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Fa and E,. The termination and indeed much of the vowel is thus strikingly different from the vowel preceding the voiced consonant.

When, as in French, glottal and oral gestures are more synchronized for the post- vocalic voiceless stop, the effects above are absent or only minimally present. In these cases, the vowel termination is very si111ilar to that preceding the voiced conso- nant.

Vowel onsets differ less as a function of the nature of a preceding consonant. The typical onset pattern for Ee was a sharp rise following voiced, voiceless unaspirated, and voiceless aspirated stops. Some, though lesser differences did emerge, however, between voiceless aspirated contexts on the one hand and voiced or voiceless unaspi- rated on the other. In the former case, although Ee normally rises very sharply, the at- tenuation in the F1 region continues for some time, together with a frequently some- what high r,. Some differences in quality at onsets may therefore also characterize the phonological opposition in English and Swedish, which have postaspirated stops, but not in French. The fairly high airflow rate for the first few glottal pulses follow- ing the postaspirated stop (see Fig. 12a), together with the high ra and the damping of F1, suggest that the glottis is not fully/maximally adducting. We conclude therefore that the observed source differences are directly attributable to tlie glottal state at the instant where voice (or the vowel) onsets, being still abducted following postaspirated stops, but more or less adducted following unaspirated or voiced stops. Support for this suggestion comes from the fact that source characteristics for onsets following a voiceless fricative are similar to those following the postaspirated stop. Here also the vocal folds are coming together from an abducted state, and the rate of airflow through them is high. What remains puzzling, however, is tlie fact that Ee tends to have full amplitude almost immediately at onset, even though the vocal folds are not yet fully adducted, clearly different from the situation at offset where the gradual de- cline of E, directly mirrors the abducting vocal folds. It clearly must have to do with some interaction of the high airflow, Bernoulli forces, and the tension of the laryngeal muscles during adduction. A fuller understanding of the first few onset cycles may require more direct techniques to observe vocal fold beliaviour.

'Illere is clearly a basic asymmetry between voice onset and offset, at least when the latter (like the former) occurs with unoccluded vocal tract, both of which are ex- emplified in the first vowel of Swedish /pap:a/, following the initial postaspirated and preceding the medial (somewhat preaspirated) stops. At the level of tlie underlying laryngeal gesture, the onset and offset here present roughly mirror image events, oc- curring during the closinglopening of the vocal folds. However, as a glance at Fig. 19 will show, their acoustic consequences are not mirror images: onset is not offset in reverse. The differences we have described here, we would expect to be representa- tive also for onsets and offsets in #V- and -V# respectively, to which the sanle ba- sic conditions would pertain. We are clearly excluding from consideration here, cases like glottalized stops or vowels with a hard onset, where voice offsets/onsets tolfrorn an adducted rather than an abducted glottal state.

There may be considerable modulations in voice quality, even within a stretch of speech which traditionally would have been described as having niodal phonation. And to answer the speculative question of our introduction, it looks as though voice quality correlates of the phonological voicing opposition may occur even in such

Page 34: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals
Page 35: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

voicing oppositions in the light of the hypothesis proposed in Ni Chasaide (1985; see also further elaboration in Ni Chasaide, forthcoming). This hypothesis suggests that, rather than the integration of numerous and essentially separate cues, the binary lin- guistic percept is based on the relative weighting of two grosser components of the signal over a syllable-sized interval. These two components or "cues" are V+ (plus voice, i.e., periodic energy especially concentrated in the lower frequency region) ant1 V- (minus voice, i.e. aperiodic energy, typically in higher frequency regions, and si- lence). It is not possible to elaborate here, but it is clear that, for example, a breathy voiced offset to the vowel need not be viewed as a separately acting cue, but rather as a way of enhancing "voiceless" percept by affecting the V+/V- balance of the syll- able; reducing V+ (weak Ee, low Fa, and Fi damping) and enhancing V- (a higher noise component).

Quite apart from the question of whetherhow the differences described here might contribute to the linguistic opposition, they might still be important as one of the determinants of natural sounding synthesized speech. Might the incorporation of more detailed rules for onsets and particularly offsets yield more natural sounding synthesis? Might the difference in offset characteristics be one of the factors that makes Swedes sound like Swedes and different from the French?

Perceptual testing will be needed to establish whether and to what degree the production effects described contribute to the listener's linguistic percepts and to hisher judgements of naturalness in synthesized speech. In the meantime, Table I presents typical values based on our findings for the relevant parameters at onset and offset for the contexts described in this paper, which might be suitable for synthesis. These values are given according to the phonetic categories [ph, p, b]. At offset we have used [hp] to designate those voiceless stops which have very early glottal abduc- tion as in Swedish and English (Type I). E, is specified here in decibels; r,, rk, and rg values are given as percentages.

For vowel onset after [ph] three time points are specified: the onset of voice, the point at which full Ee is attained, and the point at which L1 reaches its full value. For [p,b] two time points - voice/vowel onset and full E,/Ll - are sufficient, but we have also given values for the last glottal pulse of a preceding [b]. At voice offset two time poults are specified for each phonetic category. The first of these is the point at which Ee begins to drop, as a consequence of glottal abduction for [hp], and of the oral closb~g gesture for [p] and [b]. The second time point for [bJ is the first glottal pulse following oral closure; for [hp] and [p], it is the termination of voicing. Note that for the latter, E, is set at -18dB; after that point, the voice source should be shut off.

At offset and onset, rg values are assumed to cova~y with &, but within a limited range. Alternatively, rg could be left constant at about 120. Fa can be calculated di- rectly from r,, given any Fo value. Where in Table I Bl=Fl, this means that the bandwidth has the same value as the formant frequency, i.e. that the formant is criti- cally damped. The abbreviation STD means the standard synthesis value for a partic- ular sound.

Page 36: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-311 988

Table I. Parameter values for vowel onsetjoffset in different consonantal environments. For further explanations, see text.

VOWEL ONSET: ~aramete r settings

Preceding consonant

Time points

Time (ms) in relation to voice/vowel onset

E, (dB)

ra

rt (%)

r, (%)

B 1 (Hz)

B2 CHz)

A S P R A ~ O N NOISE

Voice onset Full E, Full L1

0 10 30

-6 0 0

7 2

60 30 30

100 120 120

F1 STD

3 . STD STD

See text

Last [b] Voice/vowel Full pulse onset EJLl

-10 0 10

-15 -6 0

10 5 2

60 60 30

100 100 120

STD 2 . STD STD

STD 2 . STD STD

VOWEL OFFSET: parameter se t t in~s

Time (ms) with reference to voice termination or 1 -60 0 I I -20

0 f m t [b] pulse

Following consonant I [ h p l I Voice I I Voice

Time points

~2 (HZ) / STD I . STD 1 / STD 2 . STD 1

Ee terrninat~on Ee termination

0 -18

2 12

30 60

120 100

STD F1

I 0 -18

2 8

30 50

120 100

STD 2.STD

Full & F'gkrl

ASPIRATTON NOISE

0 -8

2 5

3 0 40

120 100

sm 2. STD

STD 2.STD

See text I I I

Page 37: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Aspiration noise is likely to be an inlportant parameter, especially for the offset of the vowel before [Ilp]. As mentioned earlier, we have no precise values, but as a first approximation, we suggest that the noise source is increased from the point where Ee begins to drop, to a level at voice temzination equal to that of postaspiration. Fur- thermore, as some speakers exhibit for these stops a short interval of t~uly voiceless aspiration prior to oral closure, one night optionally leave the aspiration noise on at this level for 20-30 ms following voice termination. As regards vowel onset after [pll], we suggest that the noise source is set to decay over the same 30 111s as it takes for full L1 to be attained.

The values in Table I are for the male voice. The most important difference for the female voice is the higher r,; this, we suggest, might be 2-4 times the male value, but should never exceed 16% (r,=16% is the same as Fa being approximately equal to F,). Fernale rg values should be 10-20% lower than the male, and rk values tend to be either the sane or slightly higher, i.e. male values could be multiplied by a factor of 1 to 1.5.

Acknowledgements

This work was in part supported by the Swedish Telecom Administration (Televerket) and by the Swedish Institute (Svenska institutet), both of which are gratefully acknowledged here. Most of the analysis was carried out at the Department of Speech Communication and Music Acoustics, and we are deeply indebted to Prof. Gu~znar Fant for his helpful comments and advice, and to Guclrun Tannerghd for making the figures. We also gratefully acknowledge the Phonetics Laboratory of the Institute of Linguistics, Stockllolln University, where the recordings for this study were made, and the Speech Group of the Research Laboratory of Electronics, Mas- sachusetts Institute of Technology, Cambridge, where much of the spectral analysis was carried out.

References Ananthapadmanabha, T.V. (1984): "Acoustic analysis of voice source dynamics", STL-QPSR 2-3/1984, pp. 1-24.

Ananthapadmanabha, T.V. & Fant, G. (1982): "Calculation of true glottal flow and its compondntsM, speech Communication i , pp. 167- 184; also in STL-QPSR l/lY82, pp. 1-30.

Bickley, C.A. & Stevens, K.N. (1986): "Effects of a vocal-tract constriction on the glottal source: experimental and modelling studies", J. of Phonetics 14, No.314, pp. 373-382.

Carlson, R., Granstram, B ., & Huntlicutt, S. (1981): "A multi-language text-to-speech module", STL-QPSR 41198 1, pp. 18-28.

Cranen, B. & Boves, L. (1985): "Pressure measurements during speech production using setniconductor miniature pressure transducers: impact on models for speech production", J.Acoust.Soc.Arn. 77:4, pp. 1543-155 1.

Fant, G. (1960): The Acoustic Theory cf Speech Production, Mouton, Hague (2nd edition 1970).

Page 38: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals

STL-QPSR 2-3/1988

Fant, G. (1979a): "Glottal source and excitation analysis", STL-QPSR 111979, pp. 85- 107.

Fant, G. (1979b): "Vocal source analysis - a progress report", STL-QPSR 3-4/1979, pp. 31-54.

Fant, G. (1980): "Voice source dynamics", STL-QPSR 2-3/1980, pp. 17-37.

Fant, G. (1981): "The source filter concept in voice production", STL-QPSR 111981, pp. 21-37.

Fant, G. (1982): "The voice source - acoustic modeling", STL-QPSR 411982, pp. 28- 48.

Fant, G. ( 1 987): "Interactive phenomena in speech production", pp. 376-38 1 in Proc. XIth ICPhS, Tallinn, Estorzia, USSR, Aug. 1987, Vo1.3, Estonian Academy of Sci- ences.

Fant, G. & Ananthapadmanabha, T.V. (1982): "Truncation and superposition", STL- QPSR 2-3/1982, pp. 1-17.

Fant, G. & Lin, Q. (1987): "Glottal source - vocal tract acoustic interaction", STL- QPSR 111987, pp. 13-27.

Fant, G., Liljencrants, J., & Lin, Q. (1985a): "A four-parameter model of glottal flow", French-Swedish Seminar on Speech, Grenoble, April 1985; also in QPSR 411985, pp. 1-13.

Fant, G., Lin, Q., & Gobl, C. (1985b): "Notes on glottal flow interaction", STL-QPSR 2-3/1985, pp. 21-45.

Flanagan, J.L. (1972): Speech Analysis Syntlzesis and Perception, Springer-Verlag, Berlin, Heidelberg, New York.

Gobl, C. (1985): "Rostkdlans variation i tal", unpublished thesis work.

Gobl, C. (1988): "Voice source dynamics in connected speech", STL-QPSR 111988, pp. 123-159.

Holmberg, E.B., Hillman, R.E., & Perkell, J.S. (1988): "Glottal air flow and pressure measurements for loudness variation by male and fetnale speakers", J.Acoust .Soc.Am. 84, pp. 5 1 1-529.

Karlsson, I & Nord, L. (1970): "A new method of recording occlusion applied to the study of Swedish stops", STL-QPSR 2-3/1970, pp. 8-18.

Klatt, D.H. (1987): "Acoustic correlates of breathiness: first harmonic amplitude, tur- bulence noise, and tracheal coupling", J.Acoust.Soc.Arn. 82, S91(A).

Klatt, D.H. & Klatt, L.C. (forthcoming): "Voice quality variations within and across female and male talkers: implications for speech analysis, synthesis and perception", forthcoming in J.Acoust.Soc.Am.

Lin, Q. (1987): "Nonlinear interaction in voice production", STL-QPSR 111987, pp. 1-12.

Maddieson, I. (1984): Patterns of Sounds, Cambridge University Press.

Ni Chasaide, A. (1985): "Preaspiration in phonological stop contrasts", unpublished Ph.D. thesis, University College of North Wales, Bangor, March 1985.

Ni Chasaide, A. (forthcoming): "The perception of preaspirated stops and a proposed account of voicing perception", forthcoming.

Page 39: The effects of adjacent voiced/voiceless consonants on the vowel … · 1.5 kHz were completely masked by the tape noise.* Along with the speech waveform two separate airflow signals