60
Chapter 7 THE ACOUSTIC CHARACTERISTICS OF SPEECH OF SPEECH 1. Intensity patterns 2 Frequency patterns 2. Frequency patterns 3. Intensity and frequency changes over time

Chapter 7zhanglab.wdfiles.com/local--files/summer/SLHS1301_week7.pdf · 1 varies inversely with tongue height. The frequency of F2 varies inversely with tongue backness. Vocal Tract

  • Upload
    buidung

  • View
    221

  • Download
    4

Embed Size (px)

Citation preview

Chapter 7

THE ACOUSTIC CHARACTERISTICSOF SPEECHOF SPEECH

1. Intensity patterns2 Frequency patterns2. Frequency patterns

3. Intensity and frequency changes over time

1 L t t1. Long-term average spectrum

• Speech energy extends from approximately 50 Hz to about 10 kHz, with most of ,the energy lying at 100-600 Hz

•Above about 500 Hz, the intensity decreases at a rate of about 12 dB/octave

Intensity: The DYNAMIC RANGE

» Intensity Hierarchy:

Intensity: The DYNAMIC RANGE

y yvowels > approximants > nasals > strident > fricatives > stops > nonstrident fricatives

» The ratio of the intensity of the most intense phoneme /a/ to the intensity of the least intense phoneme /θ/ is/a/ to the intensity of the least intense phoneme /θ/, is about 700:1 (dB = 10 log 700)

F ti f I t it V i tiFunctions of Intensity Variations

-LinguisticStressProsody

-Nonlinguisticattention-arousalemotionProsody

Segmentation expressiveness

2 Articulation and Acoustic Space2. Articulation and Acoustic Space

The frequency of F1 varies inversely with tongue height.The frequency of F2 varies inversely with tongue backness.

Vocal Tract Configuration

Formant Frequency Configuration

S h V i bilitSpeech Variability

U

3 The Sound Spectrograph3. The Sound Spectrograph

Spectrograms: Monophthongsvs Diphthongsvs. Diphthongs

What Spectrograms Show UsWhat Spectrograms Show Us

The utterance is “Never touch a snake with your bare hands” Can you determine the wordbare hands . Can you determine the word boundaries by looking at the spectrogram?

Chapter 8

SPEECH PERCEPTION1. Acoustic Variability & Perceptual Constancy

-Vowel perceptionVowel perception-Consonant perception

2. Intelligibility Test3 Ph d Th i3. Phenomena and Theories

-Categorical perception-Perceptual magnet effect-Multimodal speech perception-Sinewave speech

1. Units of spoken language

English phonemes:Vowels:Vowels:

(bead, bid, bed, bad, bod, bawd, book, booed, bud, bird, bide, Boyd, bowed, bayed, bode)

consonants: /p/, /b/, /t/, /d/, /k/, /g/ (go), /m/, /n/, /N/ (sing), /f/, /v/, /θ/ (thigh), / / (thy), /s/, /z/, /S/ (shy), /ʒ/ (rouge), /tS/ (chat), /dʒ/ ( g ), ( y), , , S ( y), ʒ ( g ), S ( ), ʒ(jury), /l/, / /, /j/ (young), /w/, /h/

The Perception of VowelsThe Perception of VowelsDifferent vowels are produced byDifferent vowels are produced by different modifications of the glottal source waveform --the modificationssource waveform --the modifications are produced by the different vocal tract resonancestract resonances

Naturally spoken vowels have as many as four to six formantsmany as four-to-six formants» F1,2,3 are sufficient for excellent vowel

identificationidentification

» F1,2 produce good vowel perception

a) high front vowel /i/:low F1high F2high F2

b) low front vowel /ae/:high F1glower F2

a) high back vowel /u/:low F1low F2low F2

b) low back vowel / /:high F1glow F2

The Perception of ConsonantsConsonants

Recall production of a plosive» Block vocal tractBlock vocal tract

» Stop air flow

» Suddenly release the built-up air» Suddenly release the built up air pressure

» In natural speech, the source of theIn natural speech, the source of the plosive burst is the narrow opening as the closure for a plosive is released

» When heard alone, a listener hears only noise

Sound sources for consonantsSound sources for consonantsVoiced source» vocal fold vibration

vowels glides nasalsvowels, glides, nasals Turbulent source» turbulent airflow

fricative consonants (e.g. “fish”)Transient source» release of pressure build up» release of pressure build-up

stop consonants (e.g. “pea”)Combined sources» voiced + turbulent (e.g. “zoo”)» voiced + turbulent + transient (e.g. “judge”)» turbulent + transient (e.g. “chop”)

C t h t i tiConsonant characteristics

Source(s) + Vocal Tract Constriction( )

Described in terms of: th l f th t i ti» the place of the constriction

where along the vocal tract» manner of the constriction» manner of the constriction

narrow? complete closure?» presence or absence of voicing» presence or absence of voicing

Acoustic cues for consonants often overlap with cues for vowels (coarticulation)

Perception of Plosive Consonants

C t t Eff tContext Effect

However, the targetHowever, the target frequency for each of the three plosives varied somewhat with the formant pattern of the vowel that followed.

Why?

In general, the best target frequencies were 3000 Hz for a velar, 1800 Hz for an alveolar, and about 700 Hz for a bilabial

The Perception of Fricative ConsonantsConsonants

From a production standpoint, we create a fricative by forcing air through a narrow constriction in the vocal tracttract.» /s/ is distinguishable from /ʃ/ by spectral differences

/s/ --- most energy is at 4 kHz and above/ ʃ/ --- most energy is in the 2-3 kHz region

» The less intense /f/ and /θ/ are distinguishable in large part by / / / / g g p ycharacteristics of the second formant transition of the adjacent vowel

The Perception of Fricative Consonants (Cont’d)Consonants (Cont d)

Duration also plays a role in perception of fricatives as a p y p pclass

Record the utterance “see” (/si/)» Then, reduce the duration of the fricative to about 10 ms» The perception changes to “tee” (/ti/)

Another processing game» Record the utterance “peace” (/pis/)» Insert a brief silent interval of about 30 ms or so just after se t a b e s e t te a o about 30 s o so just a te

the beginning of the fricative» The perception changes to “peats” (/pits/)

Can you explain why?y p y

The Perception of Nasal ConsonantsConsonants

The place of articulation, hence the t t f f th F2target frequency for the F2 transition, is the same as for the corresponding plosive stop: /m/ for /b,p/; /n/ for /d,t/; /N/ for /g,k//b,p/; /n/ for /d,t/; /N/ for /g,k/

By coupling the oral and nasal cavities, we lower the intensity of the oral resonances (considerable sound energy is absorbed by the soft tissue in the nasal cavity)We also add low intensity nasal resonancesresonancesWhat is the primary cue WITHIN the nasal class?» The F2 transiton» The F2 transiton

Formant Transitions and Place of ArticulationPlace of Articulation

F2 transition distinguishes among the three plosive stopsthe three plosive stops

F2 transitionF2 transition distinguishes among the three nasals

The Perception of Suprasegmental FeaturesSuprasegmental Features

STRESS and INTONATION serve as linguistic featuresSTRESS and INTONATION serve as linguistic features

» Important acoustic parameters are fundamental frequency duration andfundamental frequency, duration, and intensity

» In English, stress is occasioned by all three» In English, stress is occasioned by all three

» Intonation related principally to fundamental frequencyfrequency

Differences among word stress, sentence stress, emphatic stress, and intonational patterns to distinguish

t t t ti d f th ill t bamong statements, questions, and so forth will not be discussed

Theories of Speech PerceptionPerception

The obvious and general question isThe obvious and general question is, “How do we perceive speech?”

H ti f t t d i t» How are acoustic features converted into phonemes and words?

» Is the conversion a single stage process?» Is the conversion a single-stage process?

» Is there a two-stage process in which we convert to auditory percepts of loudnessconvert to auditory percepts of loudness, pitch, and so forth, followed by conversion into some linguistic sequenceg q

2 Intelligibility Test2. Intelligibility TestTests of Speech Intelligibility (“the Articulation Test” i h i ) th t f th t t d h it iis archaic), the nature of the test and how it is administered differ in both content and form

Content» Nonsense syllables» Word tests (monosyllabic, disyllabic, etc.)» Sentence tests» Connected discourse

FForm» Open message set (listener is not informed of

alternative answers)Closed message set (listener is informed of all» Closed message set (listener is informed of all possible alternative answers)

Aims of TestAims of Test

Tests of speech intelligibility can be used for multiple purposes:p p p

» Evaluate human listeners» Evaluate human listeners

» Evaluate transmission systems (e.g., telephones)telephones)

» Evaluate speech processing devices or strategies (e.g., hearing aids)strategies (e.g., hearing aids)

Experiments have addressed questions such as 1) importance of various acoustic features1) importance of various acoustic features, 2) importance of understanding auditory perception 3) role of movement of vocal organs 4) role of syntax and semantics, and so forth

The experiments fall into different categoriesThe experiments fall into different categories

» Acoustic analysis of speech waves produced by talkers speaking normally --p y p g yExamples include:

Measurement of long-term spectrum of speechM t f l f t f iMeasurement of vowel formant frequencies

» Eliminate or modify naturally produced speech and measure the effect on its intelligibility –Examples include:

Filtered speechMasked speechMasked speechInterrupted speech

» Synthesize speech and, for example, manipulate one feature of a single phoneme, h ldi ll th h t i ti t t dholding all other characteristics constant, and study effects on perception -- Examples include:

Vary VOT for a /b/ -- /p/ continuum to learn when perception changes from one phoneme to anotherVary F transition for a /b/ /d/ /g/ continuum toVary F2 transition for a /b/ -- /d/ -- /g/ continuum to learn when perception changes from one phoneme to another

The Early Experiments: The 1920’s Through the 1940’s1920 s Through the 1940 s

Common observations

» 50% intelligibility corresponds to an intensity of about 30-35 dB SPL

» Intelligibility changes (increases or decreases) at a rate of about 6% per decibel for these monosyllabic wordsp y

» Individual phonemes vary in the intensity at which they become intelligibleintelligible

What class of phonemes would you expect to be most intelligible at low intensities?

What class of phonemes would you expect to require theWhat class of phonemes would you expect to require the greatest intensity before becoming intelligible?

The Early Experiments: The 1920’ Th h th 1940’ 1920’s Through the 1940’s

(Cont’d)The Effect(s) of Noise» What characteristics of the noise will

( )

What characteristics of the noise will help determine its effects on intelligibility?

intensity

spectrumspectrum

temporal pattern

Experiments With Filtered SpeechSpeech

low-pass filterlow pass filter» 500 Hz --- about 7%» 1000 Hz --- about 28%» 2000 Hz --- about 68%» 2000 Hz --- about 68%» 5000 Hz --- about 95%

high pass filterhigh-pass filter» 500 Hz --- about 95%» 1000 Hz --- about 90%

2000 H b 68%» 2000 Hz --- about 68%» 5000 Hz --- about 5%

What is the relevance of i t ith filt d experiments with filtered

speech?Design of telephone

D i f h i t

p

Design of speech processing systems

Further understanding human hearing

Have funHave fun

Experiments with Distorted SpeechSpeech

The figure illustrates one form of distortion to speech called PEAK CLIPPING

Experiments with Distorted S hSpeech

SPEECH QUALITY is altered drastically,SPEECH QUALITY is altered drastically, but intelligibility may be preserved at 80-90% --- Why?90% Why?

Another form of distortion is TEMPORALAnother form of distortion is TEMPORAL INTERRUPTION, in which the speech stream is divided into segments andstream is divided into segments, and alternate segments are discarded --leaving gaps of silence betweenleaving gaps of silence between segments that remain

The 1950’s and LaterThe 1950 s and Later

In the 1950’s, engineers used electronic circuits to produce signals that were speech-like -- called SPEECH SYNTHESIZERSSYNTHESIZERS

One such synthesizer, used extensively in experimentsOne such synthesizer, used extensively in experiments on speech perception, was the PATTERN-PLAYBACK SYNTHESIZER, which was essentially a sound spectrograph in reversespectrograph in reverse

http://www.haskins.yale.edu/featured/patplay.htmlhttp://www.haskins.yale.edu/featured/patplay.html

Summary of the Early ExperimentsExperiments

Speech is intelligible over a wide range of intensities

Speech is intelligible in presence of reasonably intense masker

Speech is intelligible if we preserve only part of the spectrumspectrum

Speech is intelligible in the presence of reasonably severe distortion

Thus, speech is a relatively impervious signal, and no one part of the waveform is indispensable

3. Theories of Speech Perception

» Is there something special or unique about mechanisms for perception of speech that are different for perception of other acoustic signals, music for example?

A Motor Theory of Speech PerceptionPerception

The listener continuously articulates the incoming stream of speech and g pcompares the “auditory result of their own articulation with the incoming gauditory patterns.”

In other words, speech is perceived in articulatory termsin articulatory terms

The distinction between perceptual di i d h i l di idimensions and physical dimensions

Physical dimensions: Any aspect of a physical stimulus thatPhysical dimensions: Any aspect of a physical stimulus that could be measured in a straightforward way with an instrument (e.g., a light meter, a sound level meter, a spectrum analyzer, a fundamental frequency meter, etc.)spectrum analyzer, a fundamental frequency meter, etc.)

Perceptual dimensions: These are the mental experiencesthat occur inside the mind of the observer Thesethat occur inside the mind of the observer. These experiences are actively created by the sensory system and brain based on an analysis of the physical properties of the stimulus Perceptual dimensions can be measured but notstimulus. Perceptual dimensions can be measured, but not with a meter, Measuring perceptual dimensions requires an observer (e.g., a listener).

Categorical PerceptionPhonetic Boundary

/ba/ /pa/

5 10 15 20 25 30 35 40 45

Voice Onset Time (msec)

Voice Onset Time (VOT)( )

voicing onset

releasevoicing onset and

VOT ~0 msec

VOT = Interval between articulatory release and onset of

release release ~ simultaneous

VOT Interval between articulatory release and onset of voicing.

VOT continuum

•Synthetic speech tokens

•synthesis allows for variations that can not be produced naturally

•a gradual acoustic progression from /ta/ to /da/a gradual acoustic progression from /ta/ to /da/

•first token has a VOT of 60 ms (long VOT)

l t t k h VOT f 0 ( i i b i i di t l )•last token has a VOT of 0 ms (voicing begins immediately)

•perception has sharp cross-over boundary

Categorical P ti

Identification

e X 100

Perception

Res

po

nse

50

Stimulus NumberPer

cen

t

1211109876543210

Discrimination

ect

100

90

cen

t C

orr

e

60

70

80

Per

c

10 119 108 97 86 75 64 53 42 31 2

40

0

50

11 12

Discriminated Pair

10-119-108-97-86-75-64-53-42-31-2 11-12

Categorical Perception

What was originally claimed?» Non-speech sounds are perceivedNon speech sounds are perceived

continuously, not categorically

» Speech sounds are perceived categorically, p p g y,not continuously (more so for plosive stops --- less so for vowels)

» Therefore, there is something special about speech processing that is not applicable to

ti f th ti i lperception of other acoustic signals

Categorical perception in the visual domain

Categorical perception of dog/cat?

Categorical Perception is affected by learning experience.ct

100

90 Americans

t Cor

rec

80

70

Japanese

Per

cent

60

50

40

Discriminated Pair10-129-128-117-106-85-64-73-42-51-4

Effects of Language ExperienceExperience

Hindi dental vs. retroflex stop

• high frequency energy in burst • burst has more central energy i t it

• transitions falling into the vowel

• similar to English alveolar stop /d/

intensity

• flatter F3 and F4 transitions

/d/

Perceptual Magnet EffectPerceptual Magnet Effect

Kuhl Kuhl •• Cathedral Centennial Cathedral Centennial •• 20012001

Perceptual Magnet Effect reflects learning

iexperience.

Bimodal Speech Perception: The McGurck EffectThe McGurck Effect

Let’s do an experiment! Close your eyes and listen toClose your eyes and listen to instructions. http://ilabs washington edu:16080/kuhl/rhttp://ilabs.washington.edu:16080/kuhl/research.html

Sinewave SpeechSinewave Speech

http://www.haskins.yale.edu/research/tonecombo.html

1 1 2Tone 1 Tones 1+2

Tone 2 Tones 1+3 Natural Speech

Tone 3 Tones 1+2+3

Wrapping upCategorical perception shows the existence of phonetic boundaries.P t l t h th l f t t iPerceptual magnet shows the role of prototypes in category assimilation.Perception of speech sounds is affected by languagePerception of speech sounds is affected by language experience.Our auditory system pays attention to the acoustic y y p ycues that differentiate the sounds of our language(s)Spectrograms show all acoustic variation over time –

th t hi h d ’t ieven that which we don’t perceivevariation that we have learned not to pay attention to may actually cue category distinctions in otherto may actually cue category distinctions in other languages

Perception of SpeechPerception of Speech

Perception of speech is realized by acoustic information in combination with articulatory, linguistic, semantic, and circumstantial cues

Perception likely is the result of a li t d i t ti b t i icomplicated interaction between incoming

information and information that is stored