Time Domain Methods in Speech Processing speech... · Coefficients, Formants, Vocal TPl ilAil voiced sound T T ... • Representation goes from time sample x[n],n to parameter =",0,1,2,"

Digital Speech ProcessingDigital Speech Processing——Lectures 7Lectures 7--88

Time Domain Methods Time Domain Methods in Speech Processingin Speech Processing

1

General Synthesis ModelGeneral Synthesis ModelGeneral Synthesis ModelGeneral Synthesis ModelLog Areas, Reflection Log Areas, Reflection Coefficients, Formants, Vocal Coefficients, Formants, Vocal T P l i l A i lT P l i l A i lvoiced soundvoiced sound

TT TT

Tract Polynomial, Articulatory Tract Polynomial, Articulatory Parameters, …Parameters, …

voiced sound voiced sound amplitudeamplitude

TT11 TT22

11( )R z zα −= −unvoiced sound unvoiced sound amplitudeamplitude

2Pitch Detection, Voiced/Unvoiced/Silence Detection, Gain Estimation, Vocal Tract Pitch Detection, Voiced/Unvoiced/Silence Detection, Gain Estimation, Vocal Tract Parameter Estimation, Glottal Pulse Shape, Radiation ModelParameter Estimation, Glottal Pulse Shape, Radiation Model

OverviewOverview h iOverviewOverview

Si lspeech x[n]representation of speech

speech or music A(x,t)formants reflection coefficients

Signal Processing

speech, x[n] of speechvoiced-unvoiced-silence pitch sounds of language speaker identificationspeaker identification emotions

• time domain processing => direct operations on the speech waveform

• frequency domain processing => direct operations on a spectral representation of the signal

x[n]zero crossing rate l l i t

systemx[n] level crossing rate

energy autocorrelation

• simple processing

3

• simple processing

• enables various types of feature estimation

BasicsBasicsVV

• 8 kHz sampled speech (bandwidth < 4 kHz)

• properties of speech change with

PP11 PP22

time

• excitation goes from voiced to unvoiced

U/SU/S

• peak amplitude varies with the sound being produced

• pitch varies within and acrosspitch varies within and across voiced sounds

• periods of silence where background signals are seeng g

• the key issue is whether we can create simple time-domain processing methods that enable us to

4

measure/estimate speech representations reliably and accurately

Fundamental AssumptionsFundamental AssumptionsFundamental AssumptionsFundamental Assumptions• properties of the speech signal change relatively

l l ith ti (5 10 d d)slowly with time (5-10 sounds per second)– over very short (5-20 msec) intervals => uncertainty

due to small amount of data, varying pitch, varying , y g p , y gamplitude

– over medium length (20-100 msec) intervals => uncertainty due to changes in sound qualityuncertainty due to changes in sound quality, transitions between sounds, rapid transients in speechover long (100 500 msec) intervals => uncertainty– over long (100-500 msec) intervals => uncertaintydue to large amount of sound changes

• there is always uncertainty in short time

5

y ymeasurements and estimates from speech signals

Compromise SolutionCompromise SolutionCompromise SolutionCompromise Solution• “short-time” processing methods => short segments of p g g

the speech signal are “isolated” and “processed” as if they were short segments from a “sustained” sound with fixed (non-time-varying) properties( y g) p p– this short-time processing is periodically repeated for the

duration of the waveform– these short analysis segments, or “analysis frames” often

overlap one another– the results of short-time processing can be a single number (e.g.,

an estimate of the pitch period within the frame), or a set of numbers (an estimate of the formant frequencies for the analysisnumbers (an estimate of the formant frequencies for the analysis frame)

– the end result of the processing is a new, time-varying sequence that serves as a new representation of the speech signal

6

that serves as a new representation of the speech signal

FrameFrame--byby--Frame Processing Frame Processing in Successive Windowsin Successive Windows

Frame 1Frame 1

Frame 2Frame 2

Frame 3Frame 3

Frame 4Frame 4

Frame 5Frame 5

75% frame overlap => frame length=L, frame shift=R=L/475% frame overlap => frame length=L, frame shift=R=L/4Frame1={x[0],x[1],…,x[LFrame1={x[0],x[1],…,x[L--1]}1]}

Frame2={x[R] x[R+1] x[R+LFrame2={x[R] x[R+1] x[R+L--1]}1]}

7

Frame2={x[R],x[R+1],…,x[R+LFrame2={x[R],x[R+1],…,x[R+L--1]}1]}Frame3={x[2R],x[2R+1],…,x[2R+LFrame3={x[2R],x[2R+1],…,x[2R+L--1]}1]}

……

0,1,..., 1Frame 1: samples −L

1 1Frame 2: samples + + −R R R L, 1,..., 1Frame 2: samples + +R R R L

2 ,2 1,..., 2 1Frame 3: samples + + −R R R L

3 ,3 1,...,3 1Frame 4: samples + + −R R R L

FrameFrame--byby--Frame Processing in Frame Processing in Successive WindowsSuccessive WindowsSuccessive WindowsSuccessive Windows

Frame 1

Frame 2

Frame 3F 4

50% frame overlap

• Speech is processed frame-by-frame in overlapping intervals until entire region of speech is covered by at least one such frame

Frame 4

• Results of analysis of individual frames used to derive model parameters in some manner

• Representation goes from time sample to parameter ,2,1,0,],[ =nnx

9

vector where n is the time index and m is the frame index.,2,1,0],[ =mmf

Frames and WindowsFrames and WindowsFrames and WindowsFrames and Windows

16,000641

samples/secondsamples (equivalent to 40 msec frame (window) length)

SFL

==

10

641240

samples (equivalent to 40 msec frame (window) length) samples (equivalent to 15 msec frame (window) shift)

Frame rate of 66.7 frames/second

LR==

ShortShort--Time ProcessingTime ProcessingShortShort Time ProcessingTime Processingspeech speech representation,

short-time processing

pwaveform, x[n]

p p ,f[m]

[ ] samples at 8000/sec rate; (e.g. 2 seconds of 4 kHz bandlimited=i x n

{ }

0 16000

1 200

[ ] p ; ( gspeech, [ ], )

[ ] [ ] [ ] [ ] t t 100/ t

≤ ≤x n n

f f f f{ }21 1 200[ ] [ ], [ ],..., [ ] vectors at 100/sec rate, , is the size of the analysis vector (e.g., 1

= = ≤ ≤i Lf m f m f m f m m

L for pitch period estimate, 12 forautocorrelation estimates, etc)

11

)

Generic ShortGeneric Short--Time ProcessingTime ProcessingGeneric ShortGeneric Short Time ProcessingTime Processing

( )∞⎛ ⎞∑ˆ

ˆ

( [ ]) [ ]nm n n

Q T x m w n m=−∞ =

⎛ ⎞= −⎜ ⎟⎝ ⎠∑

n̂QT( ) w[n]x[n] T(x[n]) ~~T( ) w[n]linear or non-linear

transformationwindow sequence

(usually finite length)

• is a sequence of local weighted average n̂Q

( y g )

12

g gvalues of the sequence T(x[n]) at time ˆn n=

ShortShort--Time EnergyTime EnergyShortShort Time EnergyTime Energy2[ ]E

∞

∑ 2 [ ] -- this is the long term definition of signal energy

mE x m

=−∞

= ∑

2ˆ

-- there is little or no utility of this definition for time-varying signals

[ ] nE x m=2 21

ˆ

ˆ ˆ[ ] ... [ ]n

x n N x n= − + + +∑m= 1

2

ˆ

ˆ-- short-time energy in vicinity of time ( )

n N

nT x x

− +

=1 0 10

( ) [ ] ot

w n n N= ≤ ≤ −= herwise

13

Computation of ShortComputation of Short--Time EnergyTime EnergyComputation of ShortComputation of Short Time EnergyTime Energy

[ ]w n m−

• window jumps/slides across sequence of squared values, selecting interval f ifor processing

• what happens to as sequence jumps by 2,4,8,...,L samples ( is a lowpass function—so it can be decimated without lost of information; why is lowpass?)

n̂En̂En̂E

14

• effects of decimation depend on L; if L is small, then is a lot more variable than if L is large (window bandwidth changes with L!)

n̂E

Effects of WindowEffects of WindowEffects of WindowEffects of Windowˆ ˆ( [ ]) [ ] == ∗n n nQ T x n w n

ˆ [ ] [ ] =′= ∗ n nx n w n

• serves as a lowpass filter on which often has a lot of[ ]w n ( [ ])T x n• serves as a lowpass filter on which often has a lot of high frequencies (most non-linearities introduce significant high frequency energy—think of what ( ) does in frequency)

ft t d th d fi iti f t i l d filt i t

[ ]w n ( [ ])T x n

Q

[ ] [ ]x n x n⋅

• often we extend the definition of to include a pre-filtering term so that itself is filtered to a region of interest

n̂Q[ ]x n

Linear Filter

ˆ[ ]x n ˆ ˆn n n nQ Q ==[ ]x n ( [ ])T x n( )T [ ]w n

15

ShortShort--Time EnergyTime EnergyShortShort Time EnergyTime Energy• serves to differentiate voiced and unvoiced sounds in speech

f il (b k d i l)from silence (background signal)• natural definition of energy of weighted signal is:

2ˆ[ ] [ ] ( f ti f i l)E∞

⎡ ⎤⎣ ⎦∑ˆ [ ] [ ] (sum or squares of portion of signal)ˆ ˆ-- concentrates measurement at sample , using weighting [ ]

nm

E x m w n m

n w n - m=−∞

= −⎡ ⎤⎣ ⎦∑

∞ ∞2 2 2

ˆ ˆ ˆ [ ] [ ] [ ] [ ]nE x m w n m x m h n m= − = −

2 [ ] [ ]m m

h n w n=−∞ =−∞

=

∑ ∑

( )2 h[n]x[n] x2[n]

short time energy

ˆ ˆn n n nE E

==

16SF SF /SF R

ShortShort--Time Energy PropertiesTime Energy PropertiesShortShort Time Energy PropertiesTime Energy Properties• depends on choice of h[n], or equivalently,

i d [ ]~~window w[n]– if w[n] duration very long and constant amplitude

(w[n]=1, n=0,1,...,L-1), En would not change much over

~~

~~( [ ] , , , , ), n gtime, and would not reflect the short-time amplitudes of the sounds of the speech

– very long duration windows correspond to narrowbandvery long duration windows correspond to narrowband lowpass filters

– want En to change at a rate comparable to the changing sounds of the speech => this is the essential conflict insounds of the speech => this is the essential conflict in all speech processing, namely we need short duration window to be responsive to rapid sound changes, but short windows will not provide sufficient averaging to

17

short windows will not provide sufficient averaging to give smooth and reliable energy function

WindowsWindowsWindowsWindows• consider two windows, w[n]~~consider two windows, w[n]

– rectangular window: • h[n]=1, 0≤n≤L-1 and 0 otherwise

– Hamming window (raised cosine window):• h[n]=0.54-0.46 cos(2πn/(L-1)), 0≤n≤L-1 and 0 otherwise

rectangular window gives equal weight to all L– rectangular window gives equal weight to all Lsamples in the window (n,...,n-L+1)

– Hamming window gives most weight to middle g g gsamples and tapers off strongly at the beginning and the end of the window

18

Rectangular and Hamming WindowsRectangular and Hamming WindowsRectangular and Hamming WindowsRectangular and Hamming Windows

21 samplesL 21 samplesL =

19

Window Frequency ResponsesWindow Frequency ResponsesWindow Frequency ResponsesWindow Frequency Responses

• rectangular windowg

1 222

( )/sin( / )( )sin( / )

j T j T LLTH e eT

Ω − Ω −Ω=Ω

• first zero occurs at f=Fs/L=1/(LT) (or Ω=(2π)/(LT)) => nominal cutoff frequency of the equivalent “lowpass” filter

2sin( / )TΩ

nominal cutoff frequency of the equivalent lowpass filter• Hamming window

[ ] 0.54 [ ] 0.46*cos(2 / ( 1)) [ ]π= − −H R Rw n w n n L w n• can decompose Hamming Window FR into combination

of three terms

[ ] [ ] ( ( )) [ ]H R R

20

RW and HW Frequency ResponsesRW and HW Frequency Responses

• log magnitude response of RW and HW• bandwidth of HW is approximately twice the bandwidth of RW

• attenuation of more than 40 dB for HW outside passband, versus 14 dB for RW

b d i i i ll• stopband attenuation is essentially independent of L, the window duration => increasing L simply decreases window bandwidth

• L needs to be larger than a pitch period (or severe fluctuations will occur in En), but smaller than a sound duration (or En will not adequately reflect the changes in thenot adequately reflect the changes in the speech signal)

There is no perfect value of L since a pitch period can be as short as 20 samples (500 Hz at a 10 kHz

21

There is no perfect value of L, since a pitch period can be as short as 20 samples (500 Hz at a 10 kHz sampling rate) for a high pitch child or female, and up to 250 samples (40 Hz pitch at a 10 kHz sampling rate) for a low pitch male; a compromise value of L on the order of 100-200 samples for a 10 kHz sampling rate is often used in practice

Window Frequency ResponsesWindow Frequency Responses

22Rectangular Windows, Rectangular Windows,

L=21,41,61,81,101L=21,41,61,81,101Hamming Windows, Hamming Windows, L=21,41,61,81,101L=21,41,61,81,101

ShortShort--Time EnergyTime EnergyShortShort Time EnergyTime Energy

2ˆ ˆ( [ ] [ ])

Short-time energy computation:

nE x m w n m∞

= −∑

i

2

( [ ] [ ])

ˆ( [ ]) [ ]

nm

mx m w n m

=−∞

∞

=−∞

= −

∑

∑

[ ] 1, 0,1,..., 1 For -point rectangular window,

giving

Lw m m L= = −

i

i

23

ˆ2

ˆˆ 1

( [ ])

g gn

nm n L

E x m= − +

= ∑

ShortShort--Time Energy using RW/HWTime Energy using RW/HW

LL=51=51 LL=51=51

n̂E n̂E

LL=101=101 LL=101=101

LL=201=201 LL=201=201

LL=401=401 LL=401=401

•• as as LL increases, the plots tend to converge (however you are smoothing sound energies)increases, the plots tend to converge (however you are smoothing sound energies)

24•• shortshort--time energy provides the basis for distinguishing voiced from unvoiced speech time energy provides the basis for distinguishing voiced from unvoiced speech regions, and for mediumregions, and for medium--toto--high SNR recordings, can even be used to find regions of high SNR recordings, can even be used to find regions of silence/background signalsilence/background signal

ShortShort--Time Energy for AGCTime Energy for AGCgygy

C IIR filt t d fi h tC IIR filt t d fi h t titi

time-dependent energy definition•

Can use an IIR filter to define shortCan use an IIR filter to define short--time energy, e.g.,time energy, e.g.,

2 2

0[ ] [ ] [ ] / [ ]σ

∞ ∞

=−∞ =

= −∑ ∑m m

n x m h n m h m

1 11 1 consider impulse response of filter of form

[ ] [ ]α α− −•

= − = ≥n nh n u n n

2 2 1

0 1

1 1[ ] ( ) [ ] [ ]σ α α∞

− −

= <

= − − −∑ n mn

n x m u n m

25

=−∞∑

m

Recursive ShortRecursive Short--Time EnergyTime EnergyRecursive ShortRecursive Short Time EnergyTime Energy1 1 01

[ ] implies the condition i i

− − − − ≥≤

i u n m n m

12 2 1 2 2

1

1 1 1 2

or giving

[ ] ( ) [ ] ( )( [ ] [ ] ...)σ α α α α−

− −

=−∞

≤ −

= − = − − + − +∑n

n m

m

m n

n x m x n x n

22 2 2 2 2

1

1 1 1 2 3

for the index we have

[ ] ( ) [ ] ( )( [ ] [ ] ...)σ α α α α

∞

−− −

−

− = − = − − + − +∑

im

nn m

n

n x m x n x n[ ] ( ) [ ] ( )( [ ] [ ] )=−∞∑

im

2 2 21 1 1

thus giving the relationship

[ ] [ ] [ ]( )σ α σ α= ⋅ − + − −n n x n

0

1 1 1[ ] [ ] [ ]( )

and defines an Automatic Gain Control (AGC) of the form

[ ]

σ α σ α= ⋅ − + − −

i

n n x n

GG n

26

0[ ][ ]σ

=G nn

Recursive ShortRecursive Short--Time EnergyTime EnergyRecursive ShortRecursive Short Time EnergyTime Energy

][nx ][2 nσ2)( ][2 nx ++1z−)( ++)1( α−z

1z−

α

2 2 21 1 1σ α σ α= ⋅ − + − −[ ] [ ] [ ]( )n n x n

27

Recursive ShortRecursive Short--Time EnergyTime EnergyRecursive ShortRecursive Short Time EnergyTime Energy

28

Use of ShortUse of Short--Time Energy for AGCTime Energy for AGCgygy

29

Use of ShortUse of Short--Time Energy for AGCTime Energy for AGC

0.9α =

0.99α =

30

ShortShort--Time MagnitudeTime MagnitudeShortShort Time MagnitudeTime Magnitude• short-time energy is very sensitive to largeshort time energy is very sensitive to large

signal levels due to x2[n] terms– consider a new definition of ‘pseudo-energy’ based

on average signal magnitude (rather than energy)

ˆ ˆ| [ ] | [ ]∞

= −∑nM x m w n m– weighted sum of magnitudes, rather than weighted

sum of squares

=−∞∑n

m

sum of squares

| |x[n] |x[n]| ˆ ˆn n n nM M ==

F F /F R[ ]w n

31• computation avoids multiplications of signal with itself (the squared term)SF SF /SF R

ShortShort--Time MagnitudesTime Magnitudesggn̂M n̂M

L=51L=51 L=51L=51

L=101L=101 L=101L=101

L=201L=201 L=201L=201

L=401L=401 L=401L=401

• differences between En and Mn noticeable in unvoiced regions• dynamic range of M ~ square root (dynamic range of E ) => level differences between voiced and

32

• dynamic range of Mn ~ square root (dynamic range of En) => level differences between voiced and unvoiced segments are smaller

• En and Mn can be sampled at a rate of 100/sec for window durations of 20 msec or so => efficient representation of signal energy/magnitude

Short Time Energy and MagnitudeShort Time Energy and Magnitude——R t l Wi dR t l Wi dRectangular WindowRectangular Window

n̂Mn̂EL=51L=51 L=51L=51

L=101L=101 L=101L=101

L=201L=201 L=201L=201

L=401L=401 L=401L=401

33

Short Time Energy and MagnitudeShort Time Energy and Magnitude——H i Wi dH i Wi dHamming WindowHamming Window

n̂Mn̂EL=51L=51L=51L=51

L=101L=101L=101L=101

L=201L=201L=201L=201

L=401L=401L=401L=401

34

Other Lowpass WindowsOther Lowpass WindowsOther Lowpass WindowsOther Lowpass Windows• can replace RW or HW with any lowpass filer• window should be positive since this guarantees E and M will be• window should be positive since this guarantees En and Mn will be

positive• FIR windows are efficient computationally since they can slide by L

samples for efficiency with no loss of information (what should Lp y (be?)

• can even use an infinite duration window if its z-transform is a rational function, i.e.,

0 0 10 0

[ ] , ,= ≥ < <= <

nh n a n an

1

11

( ) | | | |−= >−H z z a

az

35

Other Lowpass WindowsOther Lowpass WindowsOther Lowpass WindowsOther Lowpass Windows• this simple lowpass filter can be used tothis simple lowpass filter can be used to

implement En and Mn recursively as:21( ) [ ] h t tiE E 21

1

11

( ) [ ] short-time energy( ) | [ ] | short-time magnitude

−

−

= + − −

= + − −n n

n n

E aE a x nM aM a x n

• need to compute En or Mn every sample and then down-sample to 100/sec ratedo sa p e to 00/sec ate

• recursive computation has a non-linear phase, so delay cannot be compensated exactly

36

y p y

ShortShort--Time Average ZC RateTime Average ZC RateShortShort Time Average ZC RateTime Average ZC Ratezero crossing => successive samples

have different algebraic signs

zero crossingszero crossings

• zero crossing rate is a simple measure of the ‘frequency content’ of a signal—especially true for narrowband signals (e.g., sinusoids)

• sinusoid at frequency F0 with sampling rate FS has FS/F0 samples per cycle with two zero crossings per cycle, giving an average zero crossing rate ofg

z1=(2) crossings/cycle x (F0 / FS) cycles/sample

z =2F / F crossings/sample (i e z proportional to F )

37

z1=2F0 / FS crossings/sample (i.e., z1 proportional to F0 )

zM=M (2F0 /FS) crossings/(M samples)

Sinusoid Zero Crossing RatesSinusoid Zero Crossing RatesSinusoid Zero Crossing RatesSinusoid Zero Crossing Rates

10 000Ass me the sampling rate is HF

0 0

1 100

10 000100 10 000 100 100

2 100 2 100 100

Assume the sampling rate is , Hz1. Hz sinusoid has / , / samples/cycle; or / crossings/sample, or / *

=

= = =

= = =

S

S

FF F F

z z

0

2

1000

crossings/10 msec interval

2. Hz sinuso=F 0 10 000 1000 10id has / , / samples/cycle; = =SF F0 01 1002 10 2 10 100

20

p y or / crossings/sample, or / *

crossings/10 msec interval= = =

S

z z

0 0

1

5000 10 000 5000 22 2

3. Hz sinusoid has / , / samples/cycle; or / cro

= = =

=SF F F

z 100 2 2 100ssings/sample, or / *= =z

38

100 crossings/10 msec interval

Zero Crossing for SinusoidsZero Crossing for Sinusoidsgg1

offset:0.75, 100 Hz sinewave, ZC:9, offset sinewave, ZC:8

100 Hz sinewave

0

0.5

ZC=9ZC=9

0 50 100 150 200-1

-0.5

Off 0Off 0

1.5 100 Hz sinewave with dc offset

ZC=8ZC=8

Offset=0.75Offset=0.75

0.5

1ZC 8ZC 8

390 50 100 150 200

0

Zero Crossings for NoiseZero Crossings for Noisegg

2

3

offseet:0.75, random noise, ZC:252, offset noise, ZC:122

random gaussian noise

0

1

2ZC=252ZC=252

0 50 100 150 200

-2

-1

Off 0Off 0

4

6random gaussian noise with dc offset

Offset=0.75Offset=0.75

0

2 ZC=122ZC=122

400 50 100 150 200 250

-2

ZC Rate DefinitionsZC Rate Definitions1

1 12

1 0

ˆ

ˆˆeff

ˆ| sgn( [ ]) sgn( [ ]) | [ ]

( [ ]) [ ]= − +

= − − −∑n

nm n L

Z x m x m w n mL

1 01 0

sgn( [ ]) [ ][ ]

simple rectangular window:

= ≥= − <

i

x n x nx n

1 0 10

ff

[ ]otherwise

= ≤ ≤ −==

w n n L

L LeffL L

41ˆ ˆ ˆSame form for as for or n n nZ E M

ZC NormalizationZC NormalizationZC NormalizationZC Normalization

Th f l d fi iti f i

11

1 12 = − +

= = − −∑

iˆ

ˆˆ

The formal definition of is:

| sgn( [ ]) sgn( [ ]) |

n

n

nm n L

z

Z z x m x mL

i is interpreted as the number of zero crossings per sample. For most practical applications, we need the rate of zero crossingsper fixed interval of samples which isM

1

τ= ⋅ =

per fixed interval of samples, which is rate of zero crossings per sample interval Thus, for an interval of sec., corresponding to samples we get

M

Mz z M M

M

1 τ= ⋅ = ;M Sz z M M F τ= /T

42

ZC NormalizationZC NormalizationZC NormalizationZC Normalizationi For a 1000 Hz sinewave as input, using a 40 msec window length ( ), with various values of sampling rate ( ), we get the following:SL F

F L z M z18000 320 1 4 80 2010000 400 1 5 100 20

//

S MF L z M z

16000 640 1 8 160 20

i

/

Thus we see that the normalized (per interval) zero crossing rate,us e see t at t e o a ed (pe te a ) e o c oss g ate, , is independent of the sampling rate and can be used as a measure of the dominant energy in a band.

Mz

43

ZC and Energy ComputationZC and Energy ComputationHamming window with duration L=201 samples (12.5 msec at Fs=16 kHz)

Hamming window with durationwith duration L=401 samples (25 msec at Fs=16 kHz)

44

ZC Rate DistributionsZC Rate DistributionsUnvoiced SpeechUnvoiced Speech: :

the dominant energy the dominant energy gygycomponent is at component is at

about 2.5 kHzabout 2.5 kHz

Voiced SpeechVoiced Speech: the: theVoiced SpeechVoiced Speech: the : the dominant energy dominant energy component is at component is at

about 700 Hzabout 700 Hz

1 KHz1 KHz 2KHz2KHz 3KHz3KHz 4KHz4KHz

• for voiced speech, energy is mainly below 1.5 kHz• for unvoiced speech energy is mainly above 1 5 kHz

45

for unvoiced speech, energy is mainly above 1.5 kHz

• mean ZC rate for unvoiced speech is 49 per 10 msec interval

• mean ZC rate for voiced speech is 14 per 10 msec interval

ZC Rates for SpeechZC Rates for Speech

•• 15 msec 15 msec windowswindows

•• 100/sec100/sec•• 100/sec 100/sec sampling rate on sampling rate on ZC computationZC computation

46

ShortShort--Time Energy, Magnitude, ZCTime Energy, Magnitude, ZC

47

Issues in ZC Rate ComputationIssues in ZC Rate ComputationIssues in ZC Rate ComputationIssues in ZC Rate Computation

• for zero crossing rate to be accurate, need zero g ,DC in signal => need to remove offsets, hum, noise => use bandpass filter to eliminate DC and humhum

• can quantize the signal to 1-bit for computation of ZC rateof ZC rate

• can apply the concept of ZC rate to bandpass filtered speech to give a ‘crude’ spectral estimate in narrow bands of speech (kind of gives an estimate of the strongest frequency in each narrow band of speech)

48

narrow band of speech)

Summary of Simple Time Domain MeasuresSummary of Simple Time Domain Measures

Linear Filter

( )s n[ ]w n

( )x n[ ]T

( [ ])T x n n̂Q

ˆ ˆ( [ ]) [ ]

1 Energy:

nm

Q T x m w n m∞

=−∞

= −∑

Filter

ˆ2

ˆˆ 1

1. Energy:

ˆ[ ] [ ]

can downsample at rate commensurate with window bandwidth

n

nm n L

E x m w n m

E= − +

= −∑i ˆ

ˆ

ˆ

can downsample at rate commensurate with window bandwidth2. Magnitude:

ˆ[ ] [ ]

n

n

n

E

M x m w n m= −∑

i

ˆ 1

ˆ 1

3. Zero Crossing Rate:

sgn(

m n L

nZ z x

= − +

= =ˆ

ˆ[ ]) sgn( [ 1]) [ ]n

m x m w n m− − −∑

49

ˆ 1

where sgn( [ ]) 1 [ ] 01 [ ] 0

m n L

x m x mx m

= − +

= ≥= − <

∑

ShortShort--Time AutocorrelationTime AutocorrelationShortShort Time AutocorrelationTime Autocorrelation-for a deterministic signal the autocorrelation function is defined as:

∞

=−∞

= +∑

for a deterministic signal, the autocorrelation function is defined as:

[ ] [ ] [ ]

for a random or periodic signal the autocorrelation function is:m

Φ k x m x m k

12 1→∞ =−

= ++ ∑

-for a random or periodic signal, the autocorrelation function is:

[ ] lim [ ] [ ]( )

f

L

N m LΦ k x m x m k

LfΦ Φ- if [x = + = + =>] [ ], then [ ] [ ], the autocorrelation function

preserves periodicity-properties of [ ] :

n x n P Φ k Φ k P

Φ k12 0 03 0

= −= ≤ ∀

. [ ] is even, [ ] [ ]. [ ] is maximum at , | [ ] | [ ],. [ ] is the signal energy or po

Φ k Φ k Φ kΦ k k Φ k Φ kΦ wer (for random signals)

50

[ ] g gy p ( g )

Periodic SignalsPeriodic SignalsPeriodic SignalsPeriodic Signals• for a periodic signal we have (at least infor a periodic signal we have (at least in

theory) Φ[P]=Φ[0] so the period of a periodic signal can be estimated as the p gfirst non-zero maximum of Φ[k] – this means that the autocorrelation function is

fa good candidate for speech pitch detection algorithmsit also means that we need a good way of– it also means that we need a good way of measuring the short-time autocorrelation function for speech signals

51

Short-Time Autocorrelationˆ

- a reasonable definition for the short-time autocorrelation is:

ˆ ˆ [ ] [ ] [ ] [ ] [ ]

1. select a segment of speech by windowing

∞

=−∞

= − + − −∑nm

R k x m w n m x m k w n k m

1. select a segment of speech by windowing 2. compute deterministic autocorrelation o

ˆ ˆ

f the windowed speech [ ] [ ] - symmetry

ˆ ˆ [ ] [ ] [ ] [ ]∞

= −

= − − + −⎡ ⎤⎣ ⎦∑n nR k R k

x m x m k w n m w n k m[ ] [ ] [ ] [ ]

- define filter of the formˆ ˆ ˆ [ ] [ ] [ ]

- this enables us to wri

=−∞

⎡ ⎤⎣ ⎦

= +

∑m

kw n w n w n kte the short-time autocorrelation in the form:

ˆ

ˆ

ˆ [ ] [ ] [ ] [ ]

ˆ- the value of [ ] at time for the lag is obtained by filteringˆ ˆth [ ] [ ] ith filt it

∞

=−∞

= − −∑ knm

thn

R k x m x m k w n m

w k n kk ˆh i l [ ]the sequence [ ] [ ] with a filter wit−x n x n k h impulse response [ ]kw n

52

Short-Time Autocorrelation

ˆ ˆ ˆ [ ] [ ] [ ] [ ] [ ]∞

= − + − −⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦∑nR k x m w n m x m k w n k m=−∞m

1− −

⎡ ⎤ ⎡ ⎤∑L k

NN--L+1 n+kL+1 n+k--L+1L+1

0ˆ

ˆ

ˆ ˆ [ ] [ ] [ ] [ ] [ ]

points used to compute [ ]=

′ ′= + + + +⎡ ⎤ ⎡ ⎤⎣ ⎦ ⎣ ⎦

⇒ −

∑nm

R k x n m w m x n m k w k m

L k R k

53

ˆ points used to compute [ ]⇒ nL k R k

Short-Time Autocorrelation

54

Examples of AutocorrelationsExamples of Autocorrelations

• autocorrelation peaks occur at k=72, 144, ... => 140 Hz pitch • much less clear estimates of periodicity since

55

• Φ(P)

Voiced (female) Voiced (female) L=401L=401 (magnitude)(magnitude)

TNT 00 =nTt =

0T

10

01T

F =

56 sFF /

Voiced (female) Voiced (female) L=401L=401 (log (log magmag))

0TnTt =

0T

01F =0

0 TF =

57 sFF /

Voiced (male) Voiced (male) L=401L=401

0T

0T

033F =0

03 TF =

58

Unvoiced Unvoiced L=401L=401

59

Unvoiced Unvoiced L=401L=401

60

Effects of Window SizeEffects of Window Size• choice of L, window duration

• small L so pitch period almost constant in window

• large L so clear periodicity seen in window

• as k increases the numberLL=401=401

as k increases, the number of window points decrease, reducing the accuracy and size of Rn(k) for large k => h t f th thave a taper of the type R(k)=1-k/L, |k|

Modified AutocorrelationModified Autocorrelation• want to solve problem of differing number of samples for each

different k term in , so modify definition as follows:ˆ[ ]nR k

1 2

1 2

ˆˆ ˆ ˆ[ ] [ ] [ ] [ ] [ ]

- where is standard -point window, and is extended windowof duration samples where is the largest lag of interest

∞

=−∞

= − + − −

+

∑nm

R k x m w n m x m k w n m k

w L wL K Kof duration samples, where is the largest lag of interest

- we can rewrite modified a+L K K

1 2ˆ

utocorrelation as:

ˆ ˆ ˆˆ ˆ[ ] [ ] [ ] [ ] [ ]∞

=−∞

= + + + +∑nm

R k x n m w m x n m k w m k

1 1 2 2

1 0 1

- whereˆ ˆ[ ] [ ] and [ ] [ ]

- for rectangular windows we choose the following:ˆ [ ]

= − = −

= ≤ ≤ −

w m w m w m w m

w m m L12

1 0 11 0 1

[ ] , ˆ [ ] ,

-giving

ˆ

= ≤ ≤ −

= ≤ ≤ − +

w m m Lw m m L K

R1

0ˆ ˆ[ ] [ ] [ ]−

= + + + ≤ ≤∑L

k x n m x n m k k K

62

R0

0ˆ

ˆ

[ ] [ ] [ ],

ˆ- always use samples in computation of [ ]=

= + + + ≤ ≤

∀

∑nm

n

k x n m x n m k k K

L R k k

Examples of Modified AutocorrelationExamples of Modified Autocorrelation

LL--11

L+KL+K--11

ˆ- [ ] is a cross-correlation, not an auto-correlationˆ ˆ- [ ] [ ]≠ −

n

n n

R k

R k R k

63

ˆ- [ ] will have a strong peak at for periodic signals and will not fall off for large

=n n

nR k k P k

Examples of Modified AutocorrelationExamples of Modified Autocorrelation

64

Examples of Modified ACExamples of Modified ACExamples of Modified ACExamples of Modified AC

LL=401=401LL=401=401

LL=251=251LL=401=401LL 401401

LL=125=125LL=401=401

65

Modified Autocorrelations –fixed value of L=401

Modified Autocorrelations –values of L=401,251,125

ShortShort--Time AMDFTime AMDFShortShort Time AMDFTime AMDF- belief that for periodic signals of period , the difference function [ ] [ ] [ ]= − −

Pd n x n x n k

0 2[ ] [ ] [ ]

- will be approximately zero for For realistic speechsignals, [ ] will be small at --b

= ± ±=k , P, P,...

d n k P ut not zero. Based on this reasoning.the short-time Average Magnitude Difference Function (AMDF) is

1 2ˆ

the short-time Average Magnitude Difference Function (AMDF) is

defined as:

ˆ ˆ [ ] | [ ] [ ] [ ] [ ] |γ∞

= + − + − −∑n k x n m w m x n m k w m k1 21 2

[ ] | [ ] [ ] [ ] [ ] |

- with [ ] and [ ] being recta

γ=−∞∑n

m

w m w m

ˆ

ngular windows. If both are the same length, then [ ] is similar to the short-time autocorrelation, whereasγ n k

2 1 ˆif [ ] is longer than [ ], then [ ] is similar to the modified short-timeautocor

γ nw m w m k

1 22 0

/

relation (or covariance) function. In fact it can be shown that

ˆ ˆ[ ] [ ] [ ] [ ]β ⎡ ⎤k k R R k

66

2 0ˆ ˆ ˆ [ ] [ ] [ ] [ ]

- where [ ] varies between 0.6 and 1.0 for different segments of speech.

γ β

β

⎡ ⎤≈ −⎣ ⎦n n nk k R R k

k

AMDF for Speech SegmentsAMDF for Speech SegmentsAMDF for Speech SegmentsAMDF for Speech Segments

67

SummarySummarySummarySummary• Short-time parameters in the timeShort time parameters in the time

domain:–short-time energy–short-time energy–short-time average magnitude

h t ti i t–short-time zero crossing rate–short-time autocorrelation–modified short-time autocorrelation–Short-time average magnitude

68

g gdifference function

Documents

Time Domain Methods in Speech Processing speech... · Coefficients, Formants, Vocal TPl ilAil voiced sound T T ... • Representation goes from time sample x[n],n to parameter =",0,1,2,"