Lecture 5: Jan 19, 2005 - University of Washingtonssli.ee.washington.edu/people/duh/projects/SpeechProduction.pdf · Lecture 5: Jan 19, 2005 5-3 Figure 5.3: Phons and Equal Loudness

EE516 Computer Speech Processing University of WashingtonWinter 2005 Dept. of Electrical Engineering

Lecture 5: Jan 19, 2005Lecturer: Prof: J. Bilmes <[email protected]> Scribe: Kevin Duh

5.1 Overview

In the last lecture, we studied the anatomy of the hearing organ (outer, middle, and inner ear) and two views offrequency encoding in the hair cells (Place Theory and Timing Theory). In this lecture, we look at two broad topics:

1. How sound and speech are perceived

2. How speech is produced

In the first topic, we shall discover that sound/speech perception is not as simple as bunch of dancing hair cells. Instead,pyscho-acoustical experiments have shown complex phenomena in sound perception, such as hearing thresholds,non-linear relationship between intensity and loudness, and temporal/simultaneous masking. We also brieflydiscuss the difficulty of speech perception research and overview some notable experiments regarding intelligibilitytests on spectrally-filtered speech and gaussian-scaled speech.

In the second topic, we will develop a simple mathematical model for speech production based on the source-filtermodel. Specifically, we will model the vocal tract as an uniform lossless tube and derive the dynamics of acousticpressure waves within it. As we shall see, this simple model elegantly explains the formant structure of the vowelschwa, and with some extensions, we can also model the vocal tract for other vowels. Finally, we also derive the 1-Dwave equations that serve as the basis for the above lossless tube derivations.

5.2 Perception of Sound and Speech

5.2.1 Sound Perception

5.2.1.1 Thresholds of Hearing and Feeling

The threshold of hearing is the minimum intensity at which one can perceive a sound; the threshold of feeling is thepoint where any increase in sound intensity begins to cause physical pain. As shown in Figure 1, these thresholds varyacross frequencies. The intensity of sound is measured in terms of sound pressure level (SPL) in units of decibels(dB).

Note that the threshold of hearing decreases sharply as we progress from 20Hz to 2000Hz, but increases rapidlythereafter. This bandpass phenomenon is due to the filtering of the outer and middle ear, and the sensitivity of haircells to different frequencies. From this curve, we can note the interesting fact that it is easier to hear female speakersat low volumes, since females have a pitch range corresponding to a lower threshold of hearing. (Recall that the F0’sof males and females are 50-250Hz and 120-500Hz, respectively.)

5-1

Lecture 5: Jan 19, 2005 5-2

The plot also shows the frequencies of speech, which are relatively constant with respect to the threshold of hearing.The threshold of hearing is higher for 7000Hz, but only fricatives contain significant amounts of these frequencies,which has little effect on speech intelligibility and naturalness. Some believe that evolution has adapted speech pro-duction to fit this region of auditory perception; others believe that the auditory system, which basically functionsonly as a hearing organ, is under less selective pressure than the vocal system, which simultaneously works as speech,eating, and smelling organs.

Figure 5.1: Hearing thresholds across different frequencies and frequency range of speech. [O87]

5.2.1.2 Loudness 6= Intensity

The perception of loudness is not the same as the raw intensity of sound pressure waves. Fletcher and Munsondeveloped a plot (Figure 2) which shows that our perception of loudness varies across frequencies. Sound pressurewaves at different intensities may be perceived similarly in terms of loudness, depending on the waves’ frequencies.This forms the equal loudness curve.

Since intensity of sound pressure waves does not correspond to the perception of loudness, a phon is a defined as anunit of loudness. As shown in Figure 3, the equal loudness curves define the different levels of phons. Speech typicallylies between 20 to 80 phons.

Figure 5.2: Fletcher and Munson curves [O87]

Lecture 5: Jan 19, 2005 5-3

Figure 5.3: Phons and Equal Loudness Curves [O87]

5.2.1.3 Masking

Masking occurs when the perception of one sound is obscured by the presence of another sound. This occurs whenone sound raises the threshold of hearing for the other sound. Two types of masking are:

• Simultaneous Masking – two sounds occurring at once, and one masks the other. The lower frequency soundmasks the higher frequency sound, and this typically occurs when both sounds fall under the same neural criticalband.

• Temporal Masking – two sounds occur in sequence in time, and one masks the other. If the earlier soundmasks the later sound, this is known as forward temporal masking. The opposite case is called backwardtemporal masking. Forward masking can be explained by neural fatigue, where the neuron needs some time torecharge after firing at the first sound. Backward masking very likely involves a blocking phenomenon wherethe processing of the earlier sound at a higher auditory level is interrupted by a later, louder sound.

Masking is widely exploited in auditory coding. Sounds that are masked do not need to be encoded because the listenerwill not be able to perceive them anyways. (However, a select few people with ”golden ears” may be able to hear theadditive noise created by such a lossy encoder.)

5.2.1.4 Neural Tuning Curves

A neural tuning curve shows the threshold of hearing for a single neuron. These curves are helpful in determiningthe critical band, the frequencies where a neuron is active. They can be determined by probing a single neuron andmeasuring the response in terms of spiking rate for a given stimulus.

The neural tuning curve for probing an anesthetized cat is shown in Figure 4. These curves show the threshold whereneurons begin to respond above spontaneous firing. From the figure, we observe that (1) neural tuning curves exhibitroughly constant Q, and (2) different neurons have different tuning curves and critical frequencies. This is one of theevidence that supports von Bekesey’s Place Theory of Frequency Encoding.

The pyschophysical tuning curve for humans can be obtained by using masking (Figure 5). One such procedure usesa fixed-level tone in narrowband noise (masker). The tone is fixed at a low level ( 10dB SPL) to ensure that only oneauditory filter responds. Then, the masker frequency and intensity is adjusted to the points where the listener losesperception of the tone (due to the masking effect). These points form the upside-down critical frequency curve of that”neuron.”

Lecture 5: Jan 19, 2005 5-4

Figure 5.4: Neural tuning curves taken from anesthetized cats.

Figure 5.5: Psychophysical tuning curves of humans.

5.2.2 Speech Perception

5.2.2.1 Challenges in Speech Perception Research

Speech perception occurs at a much higher level in the auditory cortex or brain. This is an active research area as notmuch is yet understood. The goal is to find invariant acoustic cues for different speech sounds. For example, whatdiscriminates between phonemes, phones, syllables, words, phrases, or sentences? These are difficult to figure outsince the acoustic cues for a sound changes depending on its context.

For example, Figure 6 shows the formants of [di] and [du]. Although the phone [d] is the same in both cases, thecontextual vowel following it changes the F2 drastically. Therefore, humans must be using more information thanformants when perceiving speech.

Lecture 5: Jan 19, 2005 5-5

Figure 5.6: Formant transition is different depending on context. Example of [di], [du]. [LB88]

5.2.2.2 Spectral Regions of Speech Perception

The frequencies from 200Hz to 5500Hz are the most important for speech perception. Experiments in speech percep-tion are usually performed by filtering out various spectral regions in a speech signal, then measuring intelligibilityfrom test subjects.

For example, after filtering out frequencies less than 1000Hz, the discriminability of voicing and manner of articulationdecreases. (/b/ vs. /p/ vs. /v/). After filtering out frequencies above 1200Hz, place of articulation discriminability drops(/p/ vs. /t/). Note that in telephones, the bandwidth is 200-4000Hz, which is good enough for speech intelligibility.Although the ”E-set” (/p/, /d/, /e/, /g/, /c/) and fricatives like /f/ and /s/ may be difficult to discriminate, intelligibilityis usually not a problem when words are spoken in context.

Experiments in Gaussian scaled speech tests intelligibility by zeroing out windows of speech at different locations intime. The surprising finding is that if the Gaussian windows are short enough in duration, no particular zeroed locationin the time/frequency axis contains the critical information that makes the speech incomprehensible. Apparently wecan infer what is missing from the other sound segments.

The bottom line is that humans are remarkably good at perceiving speech. Even in the presence of noise or missingsounds, we automatically use a variety of cues and context to determine the spoken speech.

5.3 Speech Production: A Mathematical Model of the Vocal Tract

A mathematical model of the speech production process is important because it sheds light on the characteristics ofspeech and serves as the basis for speech analysis and synthesis. With a good model, we should be able to buildgood speech synthesizers. Further, if we can produce an inverse model that goes from the acoustic signal to modelparameters, we should be able to do better speech recognition, as well.

To begin our derivation of a mathematical model, we start with a joke:

JOKE: A group of wealthy investors wanted to predict the outcome of horse races so they can become even richer.Therefore, they hired a group of eminent physicists around the world to research the issue. After a year, the physicistsreturned with the promise of a solution. They said they have developed a solution that would accurately predict theoutcome of any race without fail! The investers were very eager in listening to this great discovery. So the headphysicist reported, ”Well, first we need to simplify the problem. Assume the horse is a perfect sphere...”

What we will do in the following derivation is to assume that the vocal tract is an uniform lossless tube excited by

Lecture 5: Jan 19, 2005 5-6

a source. Comparing Figure 7 to Figure 8, we see that we have vastly simplified the situation. In the next lecture,we will extend the uniform tube derived here into a concatenation of tubes of varying sizes. It turns out that despitethe simplifying assumptions, the models do shed light on the speech production process and in some cases performadequatly in practice.

Figure 5.7: Schematic of Human Vocal Tract [O87]

Figure 5.8: Uniform Lossless Tube Vocal Tract Model

Lecture 5: Jan 19, 2005 5-7

5.3.1 Simplified Production Model: Overview

Figure 5.9: Speech production as Source-Filter model. The source e(t) is the glottal pulse; The filter impulseresponse v(t) defines the resonances.

The simplified production model we will adopt is a time-varying system (vocal tract) that is excited by a period source(glottis), as shown in Figure 9. The glottal source generates a sawtooth wave, which is superior to square waves andtriangle waves in that it contains all the harmonics. The vocal tract is a time-varying filter that shapes the excitationsignal with different formants. In other words, we have:

S(jΩ) = V (jΩ)I(jΩ)G(jΩ)

where V (jΩ) is the vocal tract response, G(jΩ) is the glottal pulse spectral shape, I(jΩ) is a periodic pulse train, andS(jΩ) is the output speech spectrum. Each of these models can be approximated separately. However, doing so makesan independence assumption between glottal excitement and vocal tract coloring, which is mostly but not entirely true.For example, in Lombard speech, the presence of noise affects speaker effort in an non-linear fashion. Also, in theextremely fast speech by Steve Woodmore (637 words/minute), there is little pitch variation, which suggests a couplingof the source and filter.

5.3.2 Glottal Excitation Model

The glottal excitation e(t) is the convolution of a glottal pulse g(t), which has a specific spectral shape, with a periodicimpulse train i(t). As seen in Figure 10, the glottal pulse looks like a half-wave rectified sine-wave, where closingphase occurs more rapid than the opening phase. This is because the Bernoulli force closes the glottis rapidly whilethe air pressure from the lungs opens the glottis gradually. As a result, we often model the glottal pulse as a sawtoothwave.

Figure 5.10: Glottal pulse and mouth sound pressure [O87]

The spectrum of a glottal pulse G(jΩ) is a low-pass signal with a cut-off frequency around 500Hz. The fall-off isabout 12dB/Octave.

Lecture 5: Jan 19, 2005 5-8

5.3.3 Uniform Lossless Tube Vocal Tract Model

After modeling the glottal source, we will now model the vocal tract filter. In the uniform lossless tube model (Figure11) of the vocal tract, we assume the following:

• The vocal tract is a single tube (and therefore time-variation of the vocal tract shape is not captured.)

• The tube is not curved (which actually does not have a large effect.)

• Losses due to heat conduction, viscous friction at vocal tract walls, softness of vocal tract walls, etc. are ignored.

• Radiation of sound occurs only at the lips.

• There is no nasal coupling.

Using the laws of conservation of mass, energy, and momentum, it can be shown that the acoustic waves in an uniformlossless tube satisfies the following 1-D wave equations:

∂2p

∂x2=

1c2

(∂2p

∂t2

)(5.1)

∂2u

∂x2=

1c2

(∂2u

∂t2

)(5.2)

• p = p(x, t): pressure as a function of position and time

• u = u(x, t): volume velocity of particles in the tube

• t: time

• x: particle position along the tube of length l

• c: speed of sound 340m/s

Note: The wave equations are one-dimensional because we can assume planar wave propagation along the length ofthe tube when the wavelengths are sufficiently long compared to the diameter of the tube. (This is true in typicalhuman vocal tracts for frequencies less than 4000Hz.)

The general solution to these partial differential equations has the form:

p(t, x) = p+(t− x/c) + p−(t + x/c) (5.3)

u(t, x) = u+(t− x/c) + u−(t + x/c) (5.4)

where p+ and u+ are the forward-propagating waves, p− and u− are the backward-propagating waves, and both arearbitrary functions with constant first and second derivatives. One can imagine waves as propagating along the lengthof the tube through time without changing shape.

Now we will find the vocal tract transfer function V (jΩ) to get a frequency domain understanding of the system. Webegin by assuming the boundary condition at x = 0 is an excitation by a complex exponential:

u(0, t) = uG(t) = UG(Ω)ejΩt

Lecture 5: Jan 19, 2005 5-9

where UG(Ω) is the complex amplitude.

As a result, the solutions to the forward and backward waves have the form:

u+(t− x/c) = K+ejΩ(t−x/c) (5.5)

u−(t + x/c) = K−ejΩ(t+x/c) (5.6)

Substituting these equations in Eqs. 6.3 and 6.4 and applying the boundary condition where the pressue at the lips iszero (lips is open). p(l, t) = 0, we solve for the unknown constants K+, K−, and arrive at the steady-state solutions:

p(x, t) = jZ0sin(Ω(l − x)/c)

cos(Ωl/c)· UG(Ω) · ejΩt (5.7)

u(x, t) =cos(Ω(l − x)/c)

cos(Ωl/c)· Ua(Ω) · ejΩt (5.8)

where Z0 = ρc/A is the characteristic impedance of tube.

The volume velocity at the lips, which occurs at x = l, is therefore:

u(l, t) = U(l,Ω)ejΩt =UG(Ω)ejΩt

cos(Ωl/c)(5.9)

Finally, we can get the vocal tract transfer function:

V (Ω) =U(l, Ω)UG(Ω)

=1

cos(Ωl/c)(5.10)

Figure 5.11: Resonant frequencies of an uniform tube

Figure 11 shows a plot of V (Ω). This is an all-pole model, and the peaks occur when Ωl/c = (2n + 1)π/2. Thus, foran uniform tube model, the resonance frequencies are at evenly spaced frequencies fn = (2n+1)πc

2l , n = 0,±1,±2, ....The spacing decreases as vocal tract length l increases. The resonant frequencies are analagous to the formats we see inspeech, and the single tube model actually models the vowel schwa relatively well. As we shall see in the next lecture,differently-spaced resonant frequencies corresponding to different speech sounds can be modeled by concatenatinguniform tubes of various sizes to approximate a varying vocal tract.

Lecture 5: Jan 19, 2005 5-10

5.3.4 Derivation of 1-D Wave Equations

The 1-D wave equations (Eqs. 6.1, 6.2) we used for the single lossless tube model is not difficult to derive. The outlineof our derivation is as follows:

1. Approximate particles in the tube as infinitesimally small boxes and use Newton’s 2nd Law to model its motions.The result is − ∂p

∂x = ρ0 · ∂v∂t .

2. Use the Adiabatic Ideal Gas Law to model the change of volume of the box. The result is ∂2p∂x2 = 1

c2∂2p∂t2 .

3. Do the same to get ∂2u∂x2 = 1

c2∂2u∂t2 .

In the vocal tract, particles move when pressure is applied in some direction. To model this, we will consider ainfinitesimally small cube of volume V = ∆x ·∆y ·∆z as shown in Figure 12. The glottis produces air pressure fromthe left end of the tube, so there is a force that pushes the box to the right. This force can be described by Newton’sSecond Law of Motion, F = ma. This model works well for sounds in the audible range (e.g. Not more than 110dBSPL).

Figure 5.12: Box in tube model. Pressure comes from the x=0 end.

Since pressure is defined as the force over the area (p = f/A), we have the force causing acceleration in the negativex direction on the cube as fl = p×A = ( ∂p

∂x ·∆x) · (∆y ·∆z). The force causing a positive x acceleration is thereforethe negative:

f = −(∂p

∂x·∆x) · (∆y ·∆z). (5.11)

Using Newton’s Second Law, the force per unit volume is:

f

V=

Ma

V

=M

V· ∂u

∂t

= ρ0 ·∂u

∂t(5.12)

where M is the mass of the box, u is the volume velocity of the gas, and ρ0 = M/V is the density of the box.

From Eq, 6.11, we have − ∂p∂x = f

∆x·∆y·∆z = fV . Combining this with Eq. 6.12, we get our desired result for step 1 of

the derivation:

Lecture 5: Jan 19, 2005 5-11

−∂p

∂x= ρ′ · ∂v

∂t(5.13)

Next we proceed to step 2, where we apply the Ideal Gas Law to derive the second derivative equation for pressure.Recall that the Charles-Boyle Ideal Gas Law states that PV = nRT , where

• P : pressure

• V : volume

• n: number of moles in the volume

• R: universal gas constant

• T : temperature in Kelvins

In general, this law describes the relationship between pressure and volume, depending on how temperature changes.There are two types of thermal conditions:

• Isothermal: The gas remains at constant temperature, which occurs if the change in pressure/volume is slow(quasi-static). The result is PV = constant.

• Adiabatic: No heat flows in or out of the system. This occurs if 1) the system is well insulated, and 2) thepressure/volume change happens so fast that heat has no time to flow in or out. This effect is dependent on thewavelenth of a typical sound.

The wavelength of an example speech sound at 1000Hz is λ = cf = 340m/s

1000s−1= 0.34m, so the length of a half-cycle is

0.17m. Heat diffusion occurs approximately at the rate of 0.5m/s. A half-cycle at 1000Hz lasts 0.5ms, so heat travelsonly 2.5x10−4m. This is much smaller than the half-cycle of the speech sound, so the process in the vocal tract isessentially adiabatic.

The Adiabatic Gas law is defined as

P · V γ = constant (5.14)

where the heat capacity γ = specific heat capacity at constant pressurespecific heat capacity at constant volume , which is 1.4 in air.

We want the differential form of the adiabatic gas law, so we will first take the logarithm to get:

ln(P · V γ) = constantlnP + γ · lnV = constant

Then we take take the derivative and rearrange to get the differential form:

dP

P= −γ

dV

V

which can be approximated as

Lecture 5: Jan 19, 2005 5-12

p

P0= −γ

∆V

V0(5.15)

Here, P0 is the undisturbed pressure and p is the incremental pressure from the wave; V0 is the undisturbed volume and∆V is the incremental volume from the wave. Changing ∆V to τ for notational simplicity and taking the derivativeof time, we arrive at:

1P0

· ∂p

∂t= − γ

V0· ∂τ

t(5.16)

Figure 5.13: Left: before deformation; Right: after deformation. Note the incremental volume.

Now we look at the incremental volume of the box in the tube. We assume that the mass in the box remains constant(conservation of mass) while the side of the box extends due to pressure change. As shown in Figure 13, the incre-mental volume is related to ξx, the displacement/extension of the box at one side. Subtracting the volumes of boxes inFigure 13 ( (∆x + ∂ξx

∂x ∆x)∆y∆z and ∆x∆y∆z ), the incremental volume is simply

τ =∂ξx

∂x∆x∆y∆z = V0 ·

∂ξx

∂x

Taking the derivative with respect to time, we have

∂τ

∂t= V0 ·

∂ξx

∂x

= V0∂

∂t

∂ξx

∂x

= V0∂

∂x

∂ξx

∂t

= V0∂u

∂x(5.17)

Note the last step in equation (Eq. 6.17) follows because u, the instantaneous particle velocity, must be equivalent tothe instantaneous velocity of the box’s edge ( ∂ξx

∂t ) due to conservation of mass.

Finally, we combine the equations we derived previously ( Eqs. 6.13, 6.16, 6.17 ) for the final result. Substituting Eq.6.17 in Eq. 6.16 and taking derivative, we have:

Lecture 5: Jan 19, 2005 5-13

∂p

∂t= −γP0

∂u

∂x

∂2p

∂t2= −γP0

∂2u

∂t∂x(5.18)

Taking the derivative (with respect of x) of Eq. 6.13, then substituting in Eq. 6.18, we get:

∂2p

∂x2= ρ0

∂2u

∂x∂t

=ρ0

γP0

∂2p

∂t2

Finally, observing that γP0ρ0

= c2, where c is the speed of sound in the gas, we have our 1-D wave equation for pressure:

∂2p

∂x2=

1c2

(∂2p

∂t2

)To find the wave equation for volume velocity, we take derivative of Eq. 6.15 with respect to x and proceed similarly.In the end, we have two 1-D wave equations describing the dynamics of sound pressure and volume velocity inside anuniform lossless tube:

∂2p

∂x2=

1c2

(∂2p

∂t2

)∂2u

∂x2=

1c2

(∂2u

∂t2

)

References

[DHP99] J. R. DELLER and J. H. L. Hansen and J. G. Proakis, “Discrete-Time Processing of Speech Signals,”Wiley-IEEE Press, 1999.

[HAH01] X. HUANG, A. ACERO and H. HON, “Spoken Language Processing,” Prentice Hall PTR, 2001

[LB88] P. LIEBERMAN abd S. BLUMSTEIN “Speech Physiology, Speech Perception, and Acoustic Phonetics”Cambridge University Press, 1988

[O87] D. O’SHAUGHNESSY, “Speech Communications: Human and Machine,” Addison-Wesley, 1987

[RS78] L. R. RABINER and R. W. SCHAFER, “Digital Processing of Speech Signals,” Prentice Hall, NewJersey, 1978.

Documents

Lecture 5: Jan 19, 2005 - University of Washingtonssli.ee.washington.edu/people/duh/projects/SpeechProduction.pdf · Lecture 5: Jan 19, 2005 5-3 Figure 5.3: Phons and Equal Loudness