44
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13, 2002 1

Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mel-Generalized Cepstral Representation of Speech

—A Unified Approach to Speech Spectral Estimation

Keiichi Tokuda

Nagoya Institute of Technology

Carnegie Mellon University

�Tamkang University

March 13, 2002

1

Page 2: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Conventional Speech Spectral Estimation

• Linear prediction (LPC) Autoregressive (AR) model

• Cepstral analysis Exponential (EX) model

• Subband filter bank Nonparametric

Variations

• Model ⇒ Pole-zero (ARMA) model

• Analysis window ⇒ Adaptive analysis

(sample by sample basis)

• Auditory characteristics ⇒ Warped LPC, PLP, etc.

◦ Auditory frequency scales (mel, Bark)

◦ Loudness scales (log, sone)

2

Page 3: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of This Talk

1. Conventional cepstral analysis

2. Introduction of generalized logarithmic function

⇒ Generalized cepstral analysis

3. Introduction of auditory frequency scale

⇒ Mel-generalized cepstral analysis

4. Applications to speech recognition and coding

3

Page 4: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

History of Cepstral Analysis

• B.P. Bogert, M.J.R. Healy, J.W. Tukey (1963)

Analysis of seismic signals

— decomposition into direct wave and echo

⇒ Cepstrum, Quefrency, Lifter

• A.M. Noll (1964, 1967)

Pitch extraction based on cepstrum

• A.V. Oppenheim (1966, 1968)

Homomorphic deconvolution

— decomposition into source and vocal tract function

⇒ Complex cepstrum

4

Page 5: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Definition of Cepstrum

Fourier transform of signal s(n)

S(ejω) = F [ s(n) ]

Cepstrum

C(m) = F−1[log |S(ejω)|2

](Bogert et al., Noll)

C(m) = F−1[log |S(ejω)|

](Oppenheim)

5

Page 6: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Complex Cepstrum

z-transform of signal s(n)

S(z) = Z [ s(n) ]

Complex cepstrum

c(m) = Z−1 [ logS(z) ]

= F−1[logS(ejω)

]= F−1

[log |S(ejω)| + j argS(ejω)

]

6

Page 7: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Cepstrum and Complex Cepstrum

log |S(ejω)| = F [C(m) ] = Re [F [ c(m) ] ]

⇓When it is minimum phase (all polesand zeros are located in the unit circle)

c(m) =

⎧⎪⎪⎨⎪⎪⎩

0, m < 0

C(m), m = 0

2C(m), m > 0

7

Page 8: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Homomorphic Deconvolution

Pulse train

White noise

��

�������e(n) s(n) = e(n) ∗ h(n)Linear

time-invariantsystemh(n)

� Speech

s(n) = h(n) ∗ e(n)

↓ F

S(ejω) = H(ejω)E(ejω)

↓ log | · |

log |S(ejω)| = log |H(ejω)| + log |E(ejω)|↓ F−1

C(m) = Ch(m) + Ce(m)

8

Page 9: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

0 100 200

0

Time (100μs)

0 1 2 3 4 50

20

40

60

80

100

Frequency (kHz)

Amplitude (dB)

0 50 1000

5

Quefrency (100μs)

0 1 2 3 4 50

20

40

60

80

100

Frequency (kHz)

Amplitude (dB)

C(m)

s(n)

0 1 2 3 4 5-50

0

50

Frequency (kHz)

Amplitude (dB)

(a)

(b) (c)

(d) (e)

9

Page 10: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Spectrum of Periodic Signal

h(n)

e(n)

e(n)

Np

w(n)

s(n)

Frequency0 π

2π/Np

Frequency0 π

⏐W(ejω)⏐

log⏐

Ew(e

jω) ⏐

log⏐

Sw(e

jω) ⏐

⏐H(ejω)⏐

10

Page 11: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Problems of Homomorphic Processing

(Cepstral Analysis)

Linear smoothing of log spectrum• affected by fine structure of FFT spectrum

• results in a large bias and variance

Voiced speech (periodic)• Envelope of peaks of spectral fine structure

⇒ Improved cepstral analysis , PSE: Biased

11

Page 12: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Cost Function

P (ω): Estimate of Power Spectrum

IN(ω): Periodogram

E =1

∫ π

−π

{IN(ω)

P (ω)− log

IN(ω)

P (ω)− 1

}dω ⇒ min

x: Gaussian Process ⇒ Maximizing p(x|c)

• Unbiased estimation of log spectrum

• equivalent to one used in LPC

• Minimization of energy of inverse filter output

12

Page 13: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of Natural Speech

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

(a) Unbiased cepstral analysis

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

(b) Linear prediction

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

13

Page 14: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Generalized Cepstrum

Complex Cepstrum

c(m) = Z−1 [ logS(z) ]

logS(z) = Z [ c(m) ]

Generalized Cepstrum

cγ(m) = Z−1 [ sγ (S(z)) ]

sγ (S(z)) = Z [ cγ(m) ]

14

Page 15: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Generalized logarithmic function

sγ(w) =

⎧⎨⎩ (wγ − 1)/γ, 0 < |γ| ≤ 1

logw, γ = 0

0 1

1

x

s xγ ( )γ = 1

γ = −1

0 1< <γ

− < <1 0γγ = 0 (log )x

15

Page 16: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Spectral Model

Generalized Cepstrum: cγ(m)

H(z) = s−1γ

⎛⎝ M∑

m=0

cγ(m) z−m

⎞⎠

=

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

⎛⎝1 + γ

M∑m=0

cγ(m) z−m

⎞⎠1/γ

, 0 < |γ| ≤ 1

expM∑

m=0

cγ(m) z−m, γ = 0

Inverse function of Generalized logarithm

s−1γ (w) =

⎧⎨⎩ (1 + γw)1/γ, 0 < |γ| ≤ 1

expw, γ = 0

16

Page 17: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Cost Function

E =1

∫ π

−π

{IN(ω)

P (ω)− log

IN(ω)

P (ω)− 1

}dω ⇒ min

Estimate of Power Spectrum

P (ω) = |H(ejω)|2 = σ2|D(ejω)|2

Interpretation in time-domain

ε = E[e2(n)

]⇒ min

1/D(z) �

x(n) e(n)

17

Page 18: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Advantage

−1 ≤ γ ≤ 0:

• Convex function ⇒ Global solutioncan easily be obtained

• The obtained system H(z) is minimum phase, e.g., stable

• γ = −1 ⇒ Linear Prediction

H(z) =1

1 −M∑

m=0

cγ(m)z−m

• γ = 0 ⇒ Cepstrum

H(z) = expM∑

m=0

cγ(m)z−m

18

Page 19: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Prediction Gain

• D(z) is minimum phase

• Gain of D(z) is one⇒

Predictor:

Q(z) =∞∑

k=1

a(k)z−k

Cost Function:

ε = E[e2(n)

]⇒ Prediction Gain:

G =E

[x2(n)

]E

[e2(n)

]

1/D(z) ��x(n) e(n)

⇓�����

Q(z)

��x(n) e(n)+−

19

Page 20: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(a) Example 1

M=8

(LPC) (UELS)-1 -0.5 0

6

7

8

γPrediction gain(dB)

20

Page 21: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(c) Example 3

M=8

(LPC) (UELS)-1 -0.5 0

3

4

5

γPrediction gain(dB)

21

Page 22: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(b) Example 2

M=8

(LPC) (UELS)-1 -0.5 0

6

7

γPrediction gain(dB)

22

Page 23: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of natural speech

(Generalized Cepstral Analysis) /e/

-20

0

20

40

60

80

Magnitude(dB)

0 1 2 3 4 5Frequency(kHz)

γ = -1

0 1 2 3 4 5Frequency(kHz)

γ = -1/2

0 1 2 3 4 5Frequency(kHz)

γ = -1/3

0 1 2 3 4 5Frequency(kHz)

γ = 0

M=15

(LPC) (UELS)-1 -0.5 0

15

16

γ

Prediction gain(dB)

(a) male /e/

23

Page 24: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Analysis of natural speech

(Generalized Cepstral Analysis) /N/

-20

0

20

40

60

80

Magnitude(dB)

0 1 2 3 4 5Frequency(kHz)

γ = -1

0 1 2 3 4 5Frequency(kHz)

γ = -1/2

0 1 2 3 4 5Frequency(kHz)

γ = -1/3

0 1 2 3 4 5Frequency(kHz)

γ = 0

M=15

(LPC) (UELS)-1 -0.5 0

27

28

γ

Prediction gain(dB)

(b) male /N/

24

Page 25: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = −1/n)

inputσ

output1

C(z)

1

C(z)�

��

��

��

�� � � −−−− �1

C(z)�

1st 3nd . . .. . .. . . n-th

H(z) = σD(z) = σ

{1

C(z)

}n

C(z) =

⎛⎝1 + γ

M∑m=0

c′γ(m) z−m

⎞⎠

25

Page 26: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = 0)

—LMA filter

F (z) F (z) F (z) F (z)� � ���

��

��

��

��

��

��

��

+�Input �

� + �Output

AL,1 AL,2 AL,3 AL,4

- -

� � � �

���

���

����

���

���

����

D(z) = expF (z) RL(F (z)) =

1 +L∑

l=1

AL,l {F (z)}l

1 +L∑

l=1

AL,l {−F (z)}l

F (z) =M∑

m=1

cγ(m) z−m

26

Page 27: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Introduction of auditory frequency scale

First-order all-pass function:

z−1α = Ψ(z) =

z−1 − α

1 − αz−1

Phase Characteristics can be

used for Frequency

Transformation:

ω̃ = tan−1 (1 − α2) sinω

(1 + α2) cosω − 2α

where Ψ(ejω) = e−jω̃

0 π/2 π0

π /2

π

Frequency (rad)

War

ped

frequ

ency

(rad

) mel scale

10kHz samplingα = 0.35

ω

ω

27

Page 28: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mel-Generalized Cepstral Analysis

Mel-generalized cepstrum: cα,γ(m)

H(z) = s−1γ

⎛⎝ M∑

m=0

cα,γ(m) z−mα

⎞⎠

=

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

⎛⎝1 + γ

M∑m=0

cα,γ(m) z−mα

⎞⎠1/γ

, 0 < |γ| ≤ 1

expM∑

m=0

cα,γ(m) z−mα , γ = 0

�z−1α =

z−1 − α

1 − αz−1

28

Page 29: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

• (α, γ) = (0, 0) ⇒ Cepstral model:

H(z) = expM∑

m=0

cα,γ(m)z−m

• (α, γ) = (0, −1) ⇒ AR model:

H(z) =1

1 −M∑

m=0

cα,γ(m)z−m

• (α, γ) = (0.35, 0) ⇒ Mel-cepstral model:

H(z) = expM∑

m=0

cα,γ(m) z−mα

• (α, γ) = (0.47, −1) ⇒ Warped AR model:

H(z) =1

1 −M∑

m=0

cα,γ(m) z−mα

29

Page 30: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

A Unified Approach to Speech Spectral Estimation

� |α| < 1, −1 ≤ γ ≤ 0

Mel-generalized cepstralanalysis

� α = 0

� γ = −1

� γ = 0

Generalized cepstral analysis

Linear prediction Warped Linear Prediction

Unbiased Cepstral analysis Mel-cepstral analysis

30

Page 31: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Mel-generalized analysis of natural speech /N/ M = 12

-40

0

40

Mag

nitu

de(d

B)

α = 0 α = 0.35 α = 0.47

-40

0

40

Mag

nitu

de(d

B)

0 2.5 5-40

0

40

Frequency(kHz)

Mag

nitu

de(d

B)

0 2.5 5

Frequency(kHz)

γ = 0

γ = − 0.5

γ = − 1

0 2.5 5

Frequency(kHz)

31

Page 32: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Example

α = 0

γ = −1

γ = −1/3

γ = 0

0 200 400 (ms)(a) Waveform

n a N b u d e w a

012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, -1, 12) LP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, -1/3, 12) GCEP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, 0, 12) UELS40(dB)(b) Spectral estimates ( α = 0) 32

Page 33: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Example

α = 0.35

γ = −1

γ = −1/3

γ = 0

0 200 400 (ms)(a) Waveform

n a N b u d e w a

012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, -1, 12) WLP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, -1/3, 12) MGCEP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, 0, 12) MCEP40(dB)(b) Spectral estimates ( α = 0.35) 33

Page 34: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = −1/n)

inputσ

output1

C(z)

1

C(z)�

��

��

��

�� � � −−−− �1

C(z)�

1st 2nd . . .. . .. . . n-th

Structure of H(z)

H(z) = σD(z) = σ

{1

C(z)

}n

C(z) =

(1 + γ

M∑m=0

c′α,γ(m) z−mα

) �+�Input

Output

z−1 �+� z−1 �+� z−1 �+� z−1�

��

��

�+�

b(1)

��

��

�+�

b(2)

��

�� b(3)

��

��

��

��

��

��

��

���

��

��

��

��

��

��

α

+�

��

��

α

+�

��

��

α

�−

�−

��

γ(1 − α2)

�−

� �

Structure of C(z) (M = 3)

34

Page 35: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of synthesis filter H(z) (γ = 0)

—MLSA filter

D(z) = expF (z) RL(F (z))

F (z) =M∑

m=0

c′α,γ(m) z−mα

• sufficient accuracy: maximum spectral error 0.24dB

• O(8M) multiply-add operations a sample

• guaranteed stability

• M multiply-add operations for filter coefficients calculation

35

Page 36: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Structure of

MLSA filterF (z) F (z) F (z) F (z)� � �

��

��

��

��

��

��

��

��

+�Input �

� + �Output

AL,1 AL,2 AL,3 AL,4

- -

� � � �

���

���

����

���

���

����

RL(F (z)) expF (z) = D(z), L = 4

���

���

1 − α2

���

���

α ���

���

α ���

���

α

���

���b(1)

���

���b(2)

���

���b(3)

z−1 z−1 z−1 z−1

Input

� �+ � � �+ � � �+� � ���

��

��

���

��

��

��

��

��

��

��

���

�+-

��

��

��

��

��

��

��

���

�+-

��

��+ ���+ Output

Basic filter F (z), M = 3

36

Page 37: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

The choice of α, γ for speech analysis/synthesis

Analysis/synthesis system with fixed α and γ

• speech quality change with γ

γ → −1 Clear

γ → 0 Smooth

• When γ = 0, speech quality with (α, M) = (0.35,15) is

almost equivalent to that with (α, M) = (0,30).

• When the analysis order is high enough, the difference

becomes small.

37

Page 38: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Feature of Unified Approach

• Linear prediction analysis, Cepstral analysis are the special cases.

• Mathematically well-defined

• Physical interpretation

⇒ Minimization of energy of inverse filter output

⇒ x is Gaussian ⇒ Minimization of p(x|c) (ML estimation)

• Global solution, stability of the system function

• Synthesis filter for direct synthesis from the estimated coefficients

⇒ LMA/MLSA/GMSLA filter

• Extension to adaptive analysis (sample by sample basis)

• Parameter transformation for speech recognition

38

Page 39: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Word Recognition based on HMM

Spectral Analysis:1. (α1, γ1, M1) = (0, −1, 12) ⇒ Linear Prediction

2. (α1, γ1, M1) = (0.35, −1/3, 12) ⇒ Mel-generalized cepstral

analysis

3. (α1, γ1, M1) = (0.35, 0, 12) ⇒ Mel-cepstral analysis

Output vector of HMM:

(α2, γ2, M2) = (0.35, 0, 12)

Mel-cepstral coefficients

and Δ (dynamic coefficients)

H(z) = s−1γ1

⎛⎝ M1∑

m=0

cα1,γ1(m) z−mα1

⎞⎠ = s−1

γ2

⎛⎝ ∞∑

m=0

cα2,γ2(m) z−mα2

⎞⎠

39

Page 40: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Recognition Accuracy

(Continuous HMM, 33 phoneme models, 2618 words)

75 80Recognition accuracy (%)

Linear Prediction (α , γ , M )=1 1 1 (0, -1, 12) 77.5%

Mel-Generalized Cepstral Analysis (α , γ , M )=1 1 1 (0.35, -1/3, 12)

80.7%

Mel-Cepstral Analysis (α , γ , M )=1 1 1 (0.35, 0, 12)

79.6%

40

Page 41: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Application to 16kb/s wideband CELP coder

encoder CELPExcitationGenerator

SynthesisFilter

PerceptualWeightingFilter

MinimizeMSE

MGC Analysis

Quantizationof MGC Coef.

MGC coef.

MGC coef.

Input

decoder CELPExcitationGenerator

SynthesisFilter

MGC coef.

Postfilter

MGC coef.

Output

41

Page 42: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Speech quality as a function of α (γ = −1/2)

0.0 0.1 0.2 0.3 0.4 0.53.0

3.5

4.0

4.5

α

D M

O S

averagefemalemale

42

Page 43: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Subjective Evaluation

1 2 3 4 5D M O S

G.722 48kb/s

G.722 56kb/s

G.722 64kb/s

Conv.CELP16kb/s

MGC-CELP 16kb/s

MNRU(dB)

average female male

16 24 32 40 48

43

Page 44: Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Unified Approach to Speech Spectral

Summary

A unified approach to speech spectral estimation

• A unified approach toLinear predicton and Cepstral analysis

• Introduction of auditory frequency scale

• Efficint representation of speech spectrum with an

appropriate choice of α and γ

• Application to speech anaylysis/synthesis, speech

coding, speech recognition

Future work: Optimal α and γ

(Phoneme/Speaker dependent?)

Speech Signal Processing Toolkit:

http://kt-lab.ics.nitech.ac.jp/~tokuda/SPTK/

44