Upload
buidung
View
221
Download
4
Embed Size (px)
Citation preview
Chapter 7
THE ACOUSTIC CHARACTERISTICSOF SPEECHOF SPEECH
1. Intensity patterns2 Frequency patterns2. Frequency patterns
3. Intensity and frequency changes over time
1 L t t1. Long-term average spectrum
• Speech energy extends from approximately 50 Hz to about 10 kHz, with most of ,the energy lying at 100-600 Hz
•Above about 500 Hz, the intensity decreases at a rate of about 12 dB/octave
Intensity: The DYNAMIC RANGE
» Intensity Hierarchy:
Intensity: The DYNAMIC RANGE
y yvowels > approximants > nasals > strident > fricatives > stops > nonstrident fricatives
» The ratio of the intensity of the most intense phoneme /a/ to the intensity of the least intense phoneme /θ/ is/a/ to the intensity of the least intense phoneme /θ/, is about 700:1 (dB = 10 log 700)
F ti f I t it V i tiFunctions of Intensity Variations
-LinguisticStressProsody
-Nonlinguisticattention-arousalemotionProsody
Segmentation expressiveness
2 Articulation and Acoustic Space2. Articulation and Acoustic Space
The frequency of F1 varies inversely with tongue height.The frequency of F2 varies inversely with tongue backness.
What Spectrograms Show UsWhat Spectrograms Show Us
The utterance is “Never touch a snake with your bare hands” Can you determine the wordbare hands . Can you determine the word boundaries by looking at the spectrogram?
Chapter 8
SPEECH PERCEPTION1. Acoustic Variability & Perceptual Constancy
-Vowel perceptionVowel perception-Consonant perception
2. Intelligibility Test3 Ph d Th i3. Phenomena and Theories
-Categorical perception-Perceptual magnet effect-Multimodal speech perception-Sinewave speech
1. Units of spoken language
English phonemes:Vowels:Vowels:
(bead, bid, bed, bad, bod, bawd, book, booed, bud, bird, bide, Boyd, bowed, bayed, bode)
consonants: /p/, /b/, /t/, /d/, /k/, /g/ (go), /m/, /n/, /N/ (sing), /f/, /v/, /θ/ (thigh), / / (thy), /s/, /z/, /S/ (shy), /ʒ/ (rouge), /tS/ (chat), /dʒ/ ( g ), ( y), , , S ( y), ʒ ( g ), S ( ), ʒ(jury), /l/, / /, /j/ (young), /w/, /h/
The Perception of VowelsThe Perception of VowelsDifferent vowels are produced byDifferent vowels are produced by different modifications of the glottal source waveform --the modificationssource waveform --the modifications are produced by the different vocal tract resonancestract resonances
Naturally spoken vowels have as many as four to six formantsmany as four-to-six formants» F1,2,3 are sufficient for excellent vowel
identificationidentification
» F1,2 produce good vowel perception
The Perception of ConsonantsConsonants
Recall production of a plosive» Block vocal tractBlock vocal tract
» Stop air flow
» Suddenly release the built-up air» Suddenly release the built up air pressure
» In natural speech, the source of theIn natural speech, the source of the plosive burst is the narrow opening as the closure for a plosive is released
» When heard alone, a listener hears only noise
Sound sources for consonantsSound sources for consonantsVoiced source» vocal fold vibration
vowels glides nasalsvowels, glides, nasals Turbulent source» turbulent airflow
fricative consonants (e.g. “fish”)Transient source» release of pressure build up» release of pressure build-up
stop consonants (e.g. “pea”)Combined sources» voiced + turbulent (e.g. “zoo”)» voiced + turbulent + transient (e.g. “judge”)» turbulent + transient (e.g. “chop”)
C t h t i tiConsonant characteristics
Source(s) + Vocal Tract Constriction( )
Described in terms of: th l f th t i ti» the place of the constriction
where along the vocal tract» manner of the constriction» manner of the constriction
narrow? complete closure?» presence or absence of voicing» presence or absence of voicing
Acoustic cues for consonants often overlap with cues for vowels (coarticulation)
C t t Eff tContext Effect
However, the targetHowever, the target frequency for each of the three plosives varied somewhat with the formant pattern of the vowel that followed.
Why?
In general, the best target frequencies were 3000 Hz for a velar, 1800 Hz for an alveolar, and about 700 Hz for a bilabial
The Perception of Fricative ConsonantsConsonants
From a production standpoint, we create a fricative by forcing air through a narrow constriction in the vocal tracttract.» /s/ is distinguishable from /ʃ/ by spectral differences
/s/ --- most energy is at 4 kHz and above/ ʃ/ --- most energy is in the 2-3 kHz region
» The less intense /f/ and /θ/ are distinguishable in large part by / / / / g g p ycharacteristics of the second formant transition of the adjacent vowel
The Perception of Fricative Consonants (Cont’d)Consonants (Cont d)
Duration also plays a role in perception of fricatives as a p y p pclass
Record the utterance “see” (/si/)» Then, reduce the duration of the fricative to about 10 ms» The perception changes to “tee” (/ti/)
Another processing game» Record the utterance “peace” (/pis/)» Insert a brief silent interval of about 30 ms or so just after se t a b e s e t te a o about 30 s o so just a te
the beginning of the fricative» The perception changes to “peats” (/pits/)
Can you explain why?y p y
The Perception of Nasal ConsonantsConsonants
The place of articulation, hence the t t f f th F2target frequency for the F2 transition, is the same as for the corresponding plosive stop: /m/ for /b,p/; /n/ for /d,t/; /N/ for /g,k//b,p/; /n/ for /d,t/; /N/ for /g,k/
By coupling the oral and nasal cavities, we lower the intensity of the oral resonances (considerable sound energy is absorbed by the soft tissue in the nasal cavity)We also add low intensity nasal resonancesresonancesWhat is the primary cue WITHIN the nasal class?» The F2 transiton» The F2 transiton
Formant Transitions and Place of ArticulationPlace of Articulation
F2 transition distinguishes among the three plosive stopsthe three plosive stops
F2 transitionF2 transition distinguishes among the three nasals
The Perception of Suprasegmental FeaturesSuprasegmental Features
STRESS and INTONATION serve as linguistic featuresSTRESS and INTONATION serve as linguistic features
» Important acoustic parameters are fundamental frequency duration andfundamental frequency, duration, and intensity
» In English, stress is occasioned by all three» In English, stress is occasioned by all three
» Intonation related principally to fundamental frequencyfrequency
Differences among word stress, sentence stress, emphatic stress, and intonational patterns to distinguish
t t t ti d f th ill t bamong statements, questions, and so forth will not be discussed
Theories of Speech PerceptionPerception
The obvious and general question isThe obvious and general question is, “How do we perceive speech?”
H ti f t t d i t» How are acoustic features converted into phonemes and words?
» Is the conversion a single stage process?» Is the conversion a single-stage process?
» Is there a two-stage process in which we convert to auditory percepts of loudnessconvert to auditory percepts of loudness, pitch, and so forth, followed by conversion into some linguistic sequenceg q
2 Intelligibility Test2. Intelligibility TestTests of Speech Intelligibility (“the Articulation Test” i h i ) th t f th t t d h it iis archaic), the nature of the test and how it is administered differ in both content and form
Content» Nonsense syllables» Word tests (monosyllabic, disyllabic, etc.)» Sentence tests» Connected discourse
FForm» Open message set (listener is not informed of
alternative answers)Closed message set (listener is informed of all» Closed message set (listener is informed of all possible alternative answers)
Aims of TestAims of Test
Tests of speech intelligibility can be used for multiple purposes:p p p
» Evaluate human listeners» Evaluate human listeners
» Evaluate transmission systems (e.g., telephones)telephones)
» Evaluate speech processing devices or strategies (e.g., hearing aids)strategies (e.g., hearing aids)
Experiments have addressed questions such as 1) importance of various acoustic features1) importance of various acoustic features, 2) importance of understanding auditory perception 3) role of movement of vocal organs 4) role of syntax and semantics, and so forth
The experiments fall into different categoriesThe experiments fall into different categories
» Acoustic analysis of speech waves produced by talkers speaking normally --p y p g yExamples include:
Measurement of long-term spectrum of speechM t f l f t f iMeasurement of vowel formant frequencies
» Eliminate or modify naturally produced speech and measure the effect on its intelligibility –Examples include:
Filtered speechMasked speechMasked speechInterrupted speech
» Synthesize speech and, for example, manipulate one feature of a single phoneme, h ldi ll th h t i ti t t dholding all other characteristics constant, and study effects on perception -- Examples include:
Vary VOT for a /b/ -- /p/ continuum to learn when perception changes from one phoneme to anotherVary F transition for a /b/ /d/ /g/ continuum toVary F2 transition for a /b/ -- /d/ -- /g/ continuum to learn when perception changes from one phoneme to another
The Early Experiments: The 1920’s Through the 1940’s1920 s Through the 1940 s
Common observations
» 50% intelligibility corresponds to an intensity of about 30-35 dB SPL
» Intelligibility changes (increases or decreases) at a rate of about 6% per decibel for these monosyllabic wordsp y
» Individual phonemes vary in the intensity at which they become intelligibleintelligible
What class of phonemes would you expect to be most intelligible at low intensities?
What class of phonemes would you expect to require theWhat class of phonemes would you expect to require the greatest intensity before becoming intelligible?
The Early Experiments: The 1920’ Th h th 1940’ 1920’s Through the 1940’s
(Cont’d)The Effect(s) of Noise» What characteristics of the noise will
( )
What characteristics of the noise will help determine its effects on intelligibility?
intensity
spectrumspectrum
temporal pattern
Experiments With Filtered SpeechSpeech
low-pass filterlow pass filter» 500 Hz --- about 7%» 1000 Hz --- about 28%» 2000 Hz --- about 68%» 2000 Hz --- about 68%» 5000 Hz --- about 95%
high pass filterhigh-pass filter» 500 Hz --- about 95%» 1000 Hz --- about 90%
2000 H b 68%» 2000 Hz --- about 68%» 5000 Hz --- about 5%
What is the relevance of i t ith filt d experiments with filtered
speech?Design of telephone
D i f h i t
p
Design of speech processing systems
Further understanding human hearing
Have funHave fun
…
Experiments with Distorted SpeechSpeech
The figure illustrates one form of distortion to speech called PEAK CLIPPING
Experiments with Distorted S hSpeech
SPEECH QUALITY is altered drastically,SPEECH QUALITY is altered drastically, but intelligibility may be preserved at 80-90% --- Why?90% Why?
Another form of distortion is TEMPORALAnother form of distortion is TEMPORAL INTERRUPTION, in which the speech stream is divided into segments andstream is divided into segments, and alternate segments are discarded --leaving gaps of silence betweenleaving gaps of silence between segments that remain
The 1950’s and LaterThe 1950 s and Later
In the 1950’s, engineers used electronic circuits to produce signals that were speech-like -- called SPEECH SYNTHESIZERSSYNTHESIZERS
One such synthesizer, used extensively in experimentsOne such synthesizer, used extensively in experiments on speech perception, was the PATTERN-PLAYBACK SYNTHESIZER, which was essentially a sound spectrograph in reversespectrograph in reverse
http://www.haskins.yale.edu/featured/patplay.htmlhttp://www.haskins.yale.edu/featured/patplay.html
Summary of the Early ExperimentsExperiments
Speech is intelligible over a wide range of intensities
Speech is intelligible in presence of reasonably intense masker
Speech is intelligible if we preserve only part of the spectrumspectrum
Speech is intelligible in the presence of reasonably severe distortion
Thus, speech is a relatively impervious signal, and no one part of the waveform is indispensable
3. Theories of Speech Perception
» Is there something special or unique about mechanisms for perception of speech that are different for perception of other acoustic signals, music for example?
A Motor Theory of Speech PerceptionPerception
The listener continuously articulates the incoming stream of speech and g pcompares the “auditory result of their own articulation with the incoming gauditory patterns.”
In other words, speech is perceived in articulatory termsin articulatory terms
The distinction between perceptual di i d h i l di idimensions and physical dimensions
Physical dimensions: Any aspect of a physical stimulus thatPhysical dimensions: Any aspect of a physical stimulus that could be measured in a straightforward way with an instrument (e.g., a light meter, a sound level meter, a spectrum analyzer, a fundamental frequency meter, etc.)spectrum analyzer, a fundamental frequency meter, etc.)
Perceptual dimensions: These are the mental experiencesthat occur inside the mind of the observer Thesethat occur inside the mind of the observer. These experiences are actively created by the sensory system and brain based on an analysis of the physical properties of the stimulus Perceptual dimensions can be measured but notstimulus. Perceptual dimensions can be measured, but not with a meter, Measuring perceptual dimensions requires an observer (e.g., a listener).
Voice Onset Time (VOT)( )
voicing onset
releasevoicing onset and
VOT ~0 msec
VOT = Interval between articulatory release and onset of
release release ~ simultaneous
VOT Interval between articulatory release and onset of voicing.
VOT continuum
•Synthetic speech tokens
•synthesis allows for variations that can not be produced naturally
•a gradual acoustic progression from /ta/ to /da/a gradual acoustic progression from /ta/ to /da/
•first token has a VOT of 60 ms (long VOT)
l t t k h VOT f 0 ( i i b i i di t l )•last token has a VOT of 0 ms (voicing begins immediately)
•perception has sharp cross-over boundary
Categorical P ti
Identification
e X 100
Perception
Res
po
nse
50
Stimulus NumberPer
cen
t
1211109876543210
Discrimination
ect
100
90
cen
t C
orr
e
60
70
80
Per
c
10 119 108 97 86 75 64 53 42 31 2
40
0
50
11 12
Discriminated Pair
10-119-108-97-86-75-64-53-42-31-2 11-12
Categorical Perception
What was originally claimed?» Non-speech sounds are perceivedNon speech sounds are perceived
continuously, not categorically
» Speech sounds are perceived categorically, p p g y,not continuously (more so for plosive stops --- less so for vowels)
» Therefore, there is something special about speech processing that is not applicable to
ti f th ti i lperception of other acoustic signals
Categorical Perception is affected by learning experience.ct
100
90 Americans
t Cor
rec
80
70
Japanese
Per
cent
60
50
40
Discriminated Pair10-129-128-117-106-85-64-73-42-51-4
Effects of Language ExperienceExperience
Hindi dental vs. retroflex stop
• high frequency energy in burst • burst has more central energy i t it
• transitions falling into the vowel
• similar to English alveolar stop /d/
intensity
• flatter F3 and F4 transitions
/d/
Perceptual Magnet EffectPerceptual Magnet Effect
Kuhl Kuhl •• Cathedral Centennial Cathedral Centennial •• 20012001
Bimodal Speech Perception: The McGurck EffectThe McGurck Effect
Let’s do an experiment! Close your eyes and listen toClose your eyes and listen to instructions. http://ilabs washington edu:16080/kuhl/rhttp://ilabs.washington.edu:16080/kuhl/research.html
http://www.haskins.yale.edu/research/tonecombo.html
1 1 2Tone 1 Tones 1+2
Tone 2 Tones 1+3 Natural Speech
Tone 3 Tones 1+2+3
Wrapping upCategorical perception shows the existence of phonetic boundaries.P t l t h th l f t t iPerceptual magnet shows the role of prototypes in category assimilation.Perception of speech sounds is affected by languagePerception of speech sounds is affected by language experience.Our auditory system pays attention to the acoustic y y p ycues that differentiate the sounds of our language(s)Spectrograms show all acoustic variation over time –
th t hi h d ’t ieven that which we don’t perceivevariation that we have learned not to pay attention to may actually cue category distinctions in otherto may actually cue category distinctions in other languages
Perception of SpeechPerception of Speech
Perception of speech is realized by acoustic information in combination with articulatory, linguistic, semantic, and circumstantial cues
Perception likely is the result of a li t d i t ti b t i icomplicated interaction between incoming
information and information that is stored