Sonorant Grab Bag

Sonorant Grab Bag

March 27, 2014

Speech Synthesis:A Basic Overview

• Speech synthesis is the generation of speech by machine.

• The reasons for studying synthetic speech have evolved over the years:

1. Novelty

2. To control acoustic cues in perceptual studies

3. To understand the human articulatory system

• “Analysis by Synthesis”

4. Practical applications

• Reading machines for the blind, navigation systems

Speech Synthesis:A Basic Overview

• There are four basic types of synthetic speech:

1. Mechanical synthesis

2. Formant synthesis

• Based on Source/Filter theory

3. Concatenative synthesis

• = stringing bits and pieces of natural speech together

4. Articulatory synthesis

• = generating speech from a model of the vocal tract.

1. Mechanical Synthesis• The very first attempts to produce synthetic speech were made without electricity.

• = mechanical synthesis

• In the late 1700s, models were produced which used:

• reeds as a voicing source

• differently shaped tubes for different vowels

Mechanical Synthesis, part II• Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device…

• with independently manipulable source and filter mechanisms.

Mechanical Synthesis, part III• An interesting historical footnote:

• Alexander Graham Bell and his “questionable” experiments with his dog.

• Mechanical synthesis has largely gone out of style ever since.

• …but check out Mike Brady’s talking robot.

The Voder• The next big step in speech synthesis was to generate speech electronically.

• This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder.

• The Voder was a manually controlled speech synthesizer.

• (operated by highly trained young women)

Voder Principles• The Voder basically operated like a vocoder.

• Voicing and fricative source sounds were filtered by 10 different resonators…

• each controlled by an individual finger!

• Only about 1 in 10 had the ability to learn how to play the Voder.

Overtone Singing• F0 stays the same (on a “drone”), while singer shapes the vocal tract so that individual harmonics (“overtones”) resonate.

• What kind of voice quality would be conducive to this?

Vowels and Sonorants• So far, we’ve talked a lot about the acoustics of vowels:

• Source: periodic openings and closings of the vocal folds.

• Filter: characteristic resonant frequencies of the vocal tract (above the glottis)

• Today, we’ll talk about the acoustics of sonorants:

• Nasals

• Laterals

• Approximants

• The source/filter characteristics of sonorants are similar to vowels… with a few interesting complications.

Damping• One interesting acoustic property exhibited by (some) sonorants is damping.

• Recall that resonance occurs when:

• a sound wave travels through an object

• that sound wave is reflected...

• ...and reinforced, on a periodic basis

• The periodic reinforcement sets up alternating patterns of high and low air pressure

• = a standing wave

Resonance in a closed tube

t

i

m

e

Damping, schematized• In a closed tube:

• With only one pressure pulse from the loudspeaker, the wave will eventually dampen and die out.

• Why?

• The walls of the tube absorb some of the acoustic energy, with each reflection of the standing wave.

Damping Comparison• A heavily damped wave wil die out more quickly...

• Than a lightly damped wave:

Damping Factors• The amount of damping in a tube is a function of:

• The volume of the tube

• The surface area of the tube

• The material of which the tube is made

• More volume, more surface area = more damping

• Think about the resonant characteristics of:

• a Home Depot

• a post-modern restaurant

• a movie theater

• an anechoic chamber

An Anechoic Chamber

Resonance and Recording• Remember: any room will reverberate at its characteristic resonant frequencies

• Hence: high quality sound recordings need to be made in specially designed rooms which damp any reverberation

• Examples:

• Classroom recording (29 dB signal-to-noise ratio)

• “Soundproof” booth (44 dB SNR)

• Anechoic chamber (90 dB SNR)

Spectrograms

classroom

“soundproof” booth

Spectrograms

anechoic chamber

Inside Your Nose• In nasals, air flows through the nasal cavities.

• The resonating “filter” of nasal sounds therefore has:

• increased volume

• increased surface area

• increased damping

• Note:

• the exact size and shape of the nasal cavities varies wildly from speaker to speaker.

Nasal Variability• Measurements based on MRI data (Dang et al., 1994)

Damping Effects, part 1

[m] [m]

• Damping by the nasal cavities decreases the overall amplitude of the sound coming out through the nose.

Damping Effects, part 2• How might the power spectrum of an undamped wave:

• Compare to that of a damped wave?

• A: Undamped waves have only one component;

• Damped waves have a broader range of components.

100 Hz sinewave

90 Hz sinewave

110 Hz sinewave

+

+

Here’s Why

The Result

90 Hz +

100 Hz +

110 Hz

• If the 90 Hz and 110 Hz components have less amplitude than the 100 Hz wave, there will be less damping:

Damping Spectra

light

medium

Damping Spectra

heavy

• Damping increases the bandwidth of the resonating filter.

• Bandwidth = the range of frequencies over which a filter will respond at .707 of its maximum output.

• Nasal formants will have a larger bandwidth than vowel formants.

Bandwidth in Spectrograms

The formants in nasals have increased bandwidth, in comparison to the formants in vowels.

F3 of [m] F3 of

Nasal Formants• The values of formant frequencies for nasal stops can be calculated according to the same formula that we used for to calculate formant frequencies for an open tube.

• fn = (2n - 1) * c

4L

• The simplest case: uvular nasal .

• The length of the tube is a combination of:

• distance from glottis to uvula (9 cm)

• distance from uvula to nares (12.5 cm)

• An average tube length (for adult males): 21.5 cm

The Math

12.5 cm

9 cm

fn = (2n - 1) * c

4L

L = 21.5 cm

c = 35000 cm/sec

F1 = 35000

86

= 407 Hz

F2 = 1221 Hz

F3 = 2035 Hz

The Real Thing• Check out Peter’s production of an uvular nasal in Praat.

• And also Dustin’s neutral vowel!

• Note: the higher formants are low in amplitude

• Some reasons why:

• Overall damping

• “Nostril-rounding” reduces intensity

• Resonance is lost in the side passages of the sinuses.

• Nasal stops with fronter places of articulation also have anti-formants.

Anti-Formants• For nasal stops, the occlusion in the mouth creates a side cavity.

• This side cavity resonates at particular frequencies.

• These resonances absorb acoustic energy in the system.

• They form anti-formants

Anti-Formant Math• Anti-formant resonances are based on the length of the vocal tract tube.

• For [m], this length is about 8 cm. 8 cm

• fn = (2n - 1) * c

4LL = 8 cm

AF1 = 35000 / 4*8 = 1094 Hz

AF2 = 3281 Hz

etc.

Spectral Signatures• In a spectrogram, acoustic energy lowers--or drops out completely--at the anti-formant frequencies.

anti-formants

Nasal Place Cues• At more posterior places of articulation, the “anti-resonating” tube is shorter.

• anti-formant frequencies will be higher.

• for [n], L = 5.5 cm

• AF1 = 1600 Hz

• AF2 = 4800 Hz

• for , L = 3.3 cm

• AF1 = 2650 Hz

• for , L = 2.3 cm

• AF1 = 3700 Hz

[m] vs. [n]

• Production of [meno], by a speaker of Tsonga

• Tsonga is spoken in South Africa and Mozambique

[m] [e] [n] [o]

AF1 (m)

AF1 (n)

Nasal Stop Acoustics: Summary

• Here’s the general pattern of what to look for in a spectrogram for nasals:

1. Periodic voicing.

2. Overall amplitude lower than in vowels.

3. Formants (resonance).

4. Formants have broad bandwidths.

5. Low frequency first formant.

6. Less space between formants.

7. Higher formants have low amplitude.

Perceiving Nasal Place• Nasal “murmurs” do not provide particularly strong cues to place of articulation.

• Can you identify the following as [m], [n] or ?

• Repp (1986) found that listeners can only distinguish between [n] and [m] 72% of the time.

• Transitions provide important place cues for nasals.

• Repp (1986): 95% of nasals identified correctly when presented with the first 10 msec of the following vowel.

• Can you identify these nasal + transition combos?

Documents

Sonorant Grab Bag