48
Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng NIST Workshop on Large Vocabulary Continuous Speech Recognition Maritime Institute of Technology, May 4, 2001

Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

  • Upload
    galeno

  • View
    28

  • Download
    3

Embed Size (px)

DESCRIPTION

Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng NIST Workshop on Large Vocabulary Continuous Speech Recognition - PowerPoint PPT Presentation

Citation preview

Page 1: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Stress-Accent and Vowel Quality in

The Switchboard Corpus

Steven Greenberg and Leah HitchcockInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

http://www.icsi.berkeley.edu/~steveng

NIST Workshop on Large Vocabulary Continuous Speech Recognition Maritime Institute of Technology, May 4, 2001

Page 2: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• There is an intimate relationship between vocalic identity, nucleic duration and stress accent in spontaneous dialogue (at least in the Switchboard corpus)

• Stressed syllables tend to have significantly longer nuclei than their unstressed counterparts, consistent with the findings reported by Silipo and Greenberg in previous years’ meetings regarding the OGI Stories corpus (telephone monologues)

• Certain vocalic classes exhibit a far greater dynamic range in duration than others– Diphthongs tend to be longer than monophthongs, BUT ….– The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of

duration and dynamic range under stress (accent) similar to diphtongs• The statistical patterns are consistent with the hypothesis that

duration serves under many conditions as either a primary or secondary cue for vowel height (normally associated with the frequency of the first formant)

Take Home Messages

Page 3: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity

• Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of

being stressed)• Thus, the identity of a vowel can not be considered independently of

stress-accent• The two parameters are likely to be flip sides of the same Koine• Although English is not generally considered to be a vowel-quantity

language (as is Finnish), given the close relationship between stress-accent and duration, and between duration and

vowel quality, there is some sense in which English (and perhaps other stress-accent languages) manifest certain properties of a “quantity” system• Thus, vowel duration may be an important factor in disambiguating

spoken language and therefore should be of interest to the speech recognition community

Take Home Messages

Page 4: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

What is (usually) Meant by Prosodic Stress?• Prosody is supposed to pertain to extra-phonetic cues in the acoustic signal• The pattern of variation over a sequence of SYLLABLES pertaining to: syllabic

DURATION, AMPLITUDE and PITCH (fo) variation over time (but the plot thickens, as we shall see)

Page 5: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

•It supposedly provides important information about:Focus of the speaker’s attention and emphasis for the listenerWhat is “new” and “important” information

Emotional context of the utterance - surprise, sarcasm, shock, delight anger impatience, etc.

Syntactic disambiguation, particularly at the clausal/sentential level e.g., interrogative, declarative forms

Perceptual processing - parsing the utterance into “chunks” for reliable understanding

•Prosody provides a window onto the higher levels of language Can be useful for developing semantic-oriented models for speech

understanding (“Information spotting”)

•Prosody affects pronunciation (and vice versa)Can be useful for modeling pronunciation variation in ASRPhonetic properties may be correlated with prosodic stress - THIS IS THE TOPIC FOR TODAY’S PRESENTATION

Why is Prosodic Stress Important?

Page 6: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS (same as Phoneval-2000)

– Switchboard contains informal telephone dialogues – 54 minutes of material that had previously been phonetically

transcribed (by highly trained phonetics students from UC-Berkeley)

– 45.5 minutes of “pure” speech (filled pauses, junctures filtered out), consisting of:

9,991 words, 13,446 syllables, 33,370 phonetic segments– All of this material had been hand-segmented at either the phonetic-

segment or syllabic level by the transcribers– The syllabic-segmented material was subsequently segmented at the

phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified

The Nitty Gritty (a.k.a. the Corpus Material)

Page 7: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Evaluation Material Details

0

50

100

150

200

250

300

V_Easy Easy Medium Hard V_Hard

Subjective Difficulty

By Subjective Difficulty

0

20

40

60

80

100

120

140

160

180

S_Mid N_Mid N_East West South NYC (Other)

Dialect Region

Num

ber o

f Utte

ranc

es

By Dialect Region

• AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS

• BROAD DISTRIBUTION OF UTTERANCE DURATIONS– 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10% (mean = 4.75 s)

• COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD• A WIDE RANGE OF DISCUSSION TOPICS• VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)

Page 8: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• 2 UC-Berkeley Linguistics students each transcribed the full 45 minutes of material (i.e., there is 100% overlap between the 2)

• Three levels of stress-accent were marked for each syllabic nucleus– Fully stressed (78% concordance between transcribers)– Completely unstressed (85% interlabeler agreement)– An intermediate level of accent (neither fully stressed, nor completely

unstressed (ca. 60% concordance)– Hence, 95% concordance in terms of some level of stress

• The labels of the two transcribers were averaged – In those instances where there was disagreement, the magnitude of disparity

was almost always (ca. 90%) one step. Usually, disagreement signaled a genuine ambiguity in stress accent

• The illustrations in this presentation are based solely on those data in which both transcribers concurred (i.e., fully stressed or completely unstressed)

• A table containing the complete set of data is in a paper submitted to Eurospeech (in the workshop notebook)

Manual Transcription of Stress Accent

Page 9: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

"Pitch is widely regarded, at least in English, as the most salient determinant of prominence. In other words, when a syllable or word is perceived as 'stressed' or 'emphasized,' it is pitch height or a change in pitch, more than length or loudness that is likely to be mainly responsible (see, for example, Fry 1958, Grimson 1980, pp. 222-226, Lehiste 1976, Fudge, 1984, ch. 1)" Clark, J. and Yallop, C. (1990) An Introduction to Phonetics and Phonology. Oxford, Blackwell, p. 280.

"In fact, although it is clear that stressed syllables often have greater overall acoustic intensity than weakly stressed ones, loudness seems to be the least salient and least consistent of the three parameters of pitch, duration and loudness - at least for purposes such as signaling stress" (ibid, p. 282)

“Thus, acording to the ‘general consensus’ the important parameters are (in order) - PITCH, DURATION, LOUDNESS”

(the latter most closely correlated with TOTAL ENERGY (i.e., duration x amplitude, cf. further on)

The “Conventional Wisdom” on Stress-Accent

Page 10: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

OGI Stories - Pitch Doesn’t Cut the Mustard • Although pitch range is the most important of the fo-related cues, it is not as good a

predictor of stress as DURATION

Duration

Amplitude

Pitch Range

Av. Pitch

Page 11: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Total Energy is the Best Predictor of Stress • Duration x Amplitude is superior to all other combination pairs of acoustic

parameters. Pitch appears redundant with duration.

Duration x Amplitude

Dur x Pitch Range

Duration

Dur x Pitch AvPitch Range x Average

Pitch Av x Amp

Pitch Range x Amp

Page 12: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• Vowel quality is generally thought to be a function primarily of two articulatory properties - both related to the motion of the tongue– The front-back plane is most closely associated with the second

formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance

– The height parameter is closely linked to the frequency of F1• In the classic vowel “triangle” segments are positioned in terms of

the tongue positions associated with their production, as follows:

A Brief Primer on Vocalic Acoustics

Page 13: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Duration/Amplitude/Int. Energy - Which?• There are supposed to be large differences in the “intrinsic” amplitude and duration of vowels• Could such differences be compensated for in terms of stress?• Let’s take a closer look!

Page 14: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Amplitude Differences - Stressed/Unstressed• There are very small differences in amplitude between stressed and unstressed nuclei• The lax monophthongs tend to be have a slightly larger dynamic range than diphthongs

Page 15: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Durational Differences - Stressed/Unstressed• There is a large dynamic range in duration between stressed and unstressed nuclei• Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs

Page 16: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Int. Energy Differences - Stressed/Unstressed• There is a large dynamic range in integrated energy between stressed and unstressed nuclei• Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs

Page 17: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data

• The duration, amplitude (and their product, integrated energy, will be plotted on a 2-D grid , where the x-axis will always be in

terms of hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow)• The y-axis will serve as the dependent measure, sometimes

expressed in terms of duration, or amplitude, or their product

Spatial Patterning of Duration and Amplitude

Page 18: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Dipthongal Amplitude and Vowel HeightAll nuclei

Page 19: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Amplitude and Vowel HeightAll nuclei

Page 20: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Amplitude - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 21: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Diphthongal Duration and Vowel HeightAll nuclei

Page 22: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Duration and Vowel HeightAll nuclei

Page 23: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Duration - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 24: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Dipthongal Int. Energy and Vowel HeightAll nuclei

Page 25: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Int. Energy and Vowel HeightAll nuclei

Page 26: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Int. Energy - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 27: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Dipthongal Amplitude and Vowel HeightStressed nuclei

Page 28: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Dipthongal Amplitude and Vowel HeightUnstressed nuclei

Page 29: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Amplitude and Vowel HeightStressed nuclei

Page 30: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Amplitude and Vowel HeightUnstressed nuclei

Page 31: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Amplitude - Monophthongs vs. Diphthongs

Stressed

Unstressed

Diphthongs Monophthongs

Page 32: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Diphthongal Duration and Vowel HeightStressed nuclei

Page 33: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Diphphthongal Duration and Vowel HeightUnstressed nuclei

Page 34: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Duration and Vowel HeightStressed nuclei

Page 35: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Duration and Vowel HeightUnstressed nuclei

Page 36: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Duration - Monophthongs vs. Diphthongs

Stressed

Unstressed

Diphthongs Monophthongs

Page 37: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Dipthongal Int. Energy and Vowel HeightStressed nuclei

Page 38: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Dipthongal Int. Energy and Vowel HeightUnstressed nuclei

Page 39: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Int. Energy and Vowel HeightStressed nuclei

Page 40: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Monopthongal Int. Energy and Vowel HeightUnstressed nuclei

Page 41: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Int. Energy - Monophthongs vs. Diphthongs

StressedDiphthongs Monophthongs

Unstressed

Page 42: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Mystery Parameter• There is one other parameter which when plotted in a vowel triangle plot

shows an interesting pattern• This is - proportion of stressed an unstressed nuclei

Page 43: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Proportion of Stress Accent and Vowel Height

Page 44: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Amplitude - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 45: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Duration - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 46: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

Int. Energy - Monophthongs vs. Diphthongs

All nuclei

Diphthongs Monophthongs

Page 47: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• There is an intimate relationship between vocalic identity, nucleic duration and stress accent in spontaneous dialogue (at least in the Switchboard corpus)

• Stressed syllables tend to have significantly longer nuclei than their unstressed counterparts, consistent with the findings reported by Silipo and Greenberg in previous years’ meetings regarding the OGI Stories corpus (telephone monologues)

• Certain vocalic classes exhibit a far greater dynamic range in duration than others– Diphthongs tend to be longer than monophthongs, BUT ….– The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of

duration and dynamic range under stress (accent) similar to diphtongs• The statistical patterns are consistent with the hypothesis that

duration serves under many conditions as either a primary or secondary cue for vowel height (normally associated with the frequency of the first formant)

Summary and Conclusions

Page 48: Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute

• Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity

• Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of

being stressed)• Thus, the identity of a vowel can not be considered independently of

stress-accent• Thus, vowel duration may be an important factor in disambiguating

spoken language and therefore should be of interest to the speech recognition community

Summary and Conclusions