Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Uniﬁed Approach to Speech Spectral

Mel-Generalized Cepstral Representation of Speech

—A Unified Approach to Speech Spectral Estimation

Keiichi Tokuda

Nagoya Institute of Technology

Carnegie Mellon University

�

�

�

�Tamkang University

March 13, 2002

1

Conventional Speech Spectral Estimation

• Linear prediction (LPC) Autoregressive (AR) model

• Cepstral analysis Exponential (EX) model

• Subband filter bank Nonparametric

Variations

• Model ⇒ Pole-zero (ARMA) model

• Analysis window ⇒ Adaptive analysis

(sample by sample basis)

• Auditory characteristics ⇒ Warped LPC, PLP, etc.

◦ Auditory frequency scales (mel, Bark)

◦ Loudness scales (log, sone)

2

Structure of This Talk

1. Conventional cepstral analysis

2. Introduction of generalized logarithmic function

⇒ Generalized cepstral analysis

3. Introduction of auditory frequency scale

⇒ Mel-generalized cepstral analysis

4. Applications to speech recognition and coding

3

History of Cepstral Analysis

• B.P. Bogert, M.J.R. Healy, J.W. Tukey (1963)

Analysis of seismic signals

— decomposition into direct wave and echo

⇒ Cepstrum, Quefrency, Lifter

• A.M. Noll (1964, 1967)

Pitch extraction based on cepstrum

• A.V. Oppenheim (1966, 1968)

Homomorphic deconvolution

— decomposition into source and vocal tract function

⇒ Complex cepstrum

4

Definition of Cepstrum

Fourier transform of signal s(n)

S(ejω) = F [ s(n) ]

Cepstrum

C(m) = F−1[log |S(ejω)|2

](Bogert et al., Noll)

C(m) = F−1[log |S(ejω)|

](Oppenheim)

5

Complex Cepstrum

z-transform of signal s(n)

S(z) = Z [ s(n) ]

Complex cepstrum

c(m) = Z−1 [ logS(z) ]

= F−1[logS(ejω)

]= F−1

[log |S(ejω)| + j argS(ejω)

]

6

Cepstrum and Complex Cepstrum

log |S(ejω)| = F [C(m) ] = Re [F [ c(m) ] ]

⇓When it is minimum phase (all polesand zeros are located in the unit circle)

c(m) =

⎧⎪⎪⎨⎪⎪⎩

0, m < 0

C(m), m = 0

2C(m), m > 0

7

Homomorphic Deconvolution

Pulse train

White noise

��

��e(n) s(n) = e(n) ∗ h(n)Linear

time-invariantsystemh(n)

� Speech

s(n) = h(n) ∗ e(n)

↓ F

S(ejω) = H(ejω)E(ejω)

↓ log | · |

log |S(ejω)| = log |H(ejω)| + log |E(ejω)|↓ F−1

C(m) = Ch(m) + Ce(m)

8

0 100 200

0

Time (100μs)

0 1 2 3 4 50

20

40

60

80

100

Frequency (kHz)

Amplitude (dB)

0 50 1000

5

Quefrency (100μs)

0 1 2 3 4 50

20

40

60

80

100

Frequency (kHz)

Amplitude (dB)

C(m)

s(n)

0 1 2 3 4 5-50

0

50

Frequency (kHz)

Amplitude (dB)

(a)

(b) (c)

(d) (e)

9

Spectrum of Periodic Signal

h(n)

e(n)

e(n)

Np

w(n)

s(n)

Frequency0 π

2π/Np

Frequency0 π

⏐W(ejω)⏐

log⏐

Ew(e

jω) ⏐

log⏐

Sw(e

jω) ⏐

⏐H(ejω)⏐

10

Problems of Homomorphic Processing

(Cepstral Analysis)

Linear smoothing of log spectrum• affected by fine structure of FFT spectrum

• results in a large bias and variance

Voiced speech (periodic)• Envelope of peaks of spectral fine structure

⇒ Improved cepstral analysis , PSE: Biased

11

Cost Function

P (ω): Estimate of Power Spectrum

IN(ω): Periodogram

E =1

2π

∫ π

−π

{IN(ω)

P (ω)− log

IN(ω)

P (ω)− 1

}dω ⇒ min

x: Gaussian Process ⇒ Maximizing p(x|c)

• Unbiased estimation of log spectrum

• equivalent to one used in LPC

• Minimization of energy of inverse filter output

12

Analysis of Natural Speech

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log magnitude (dB)

(a) Unbiased cepstral analysis


-20

0

20

40

60

80

Log magnitude (dB)


-20

0

20

40

60

80

Log magnitude (dB)

(b) Linear prediction


-20

0

20

40

60

80

Log magnitude (dB)

13

Generalized Cepstrum

Complex Cepstrum

c(m) = Z−1 [ logS(z) ]

logS(z) = Z [ c(m) ]

⇓

Generalized Cepstrum

cγ(m) = Z−1 [ sγ (S(z)) ]

sγ (S(z)) = Z [ cγ(m) ]

14

Generalized logarithmic function

sγ(w) =

⎧⎨⎩ (wγ − 1)/γ, 0 < |γ| ≤ 1

logw, γ = 0

0 1

1

x

s xγ ( )γ = 1

γ = −1

0 1< <γ

− < <1 0γγ = 0 (log )x

15

Spectral Model

Generalized Cepstrum: cγ(m)

H(z) = s−1γ

⎛⎝ M∑

m=0

cγ(m) z−m

⎞⎠

=

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

⎛⎝1 + γ

M∑m=0

cγ(m) z−m

⎞⎠1/γ

, 0 < |γ| ≤ 1

expM∑

m=0

cγ(m) z−m, γ = 0

Inverse function of Generalized logarithm

s−1γ (w) =

⎧⎨⎩ (1 + γw)1/γ, 0 < |γ| ≤ 1

expw, γ = 0

16

Cost Function

E =1

2π

∫ π

−π

{IN(ω)

P (ω)− log

IN(ω)

P (ω)− 1

}dω ⇒ min

Estimate of Power Spectrum

P (ω) = |H(ejω)|2 = σ2|D(ejω)|2

Interpretation in time-domain

ε = E[e2(n)

]⇒ min

1/D(z) �

x(n) e(n)

17

Advantage

−1 ≤ γ ≤ 0:

• Convex function ⇒ Global solutioncan easily be obtained

• The obtained system H(z) is minimum phase, e.g., stable

• γ = −1 ⇒ Linear Prediction

H(z) =1

1 −M∑

m=0

cγ(m)z−m

• γ = 0 ⇒ Cepstrum

H(z) = expM∑

m=0

cγ(m)z−m

18

Prediction Gain

• D(z) is minimum phase

• Gain of D(z) is one⇒

Predictor:

Q(z) =∞∑

k=1

a(k)z−k

Cost Function:

ε = E[e2(n)

]⇒ Prediction Gain:

G =E

[x2(n)

]E

[e2(n)

]

1/D(z) ��x(n) e(n)

⇓��

Q(z)

�

�

��x(n) e(n)+−

19

Analysis of synthetic signal

(Generalized Cepstral Analysis)

0 2.5 5Frequency(Hz)

Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(a) Example 1

M=8

(LPC) (UELS)-1 -0.5 0

6

7

8

γPrediction gain(dB)

20




Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(c) Example 3

M=8

(LPC) (UELS)-1 -0.5 0

3

4

5


21




Log magnitude

10dB

True

γ =-1 -3/4 -1/2 -1/4 0

(b) Example 2

M=8

(LPC) (UELS)-1 -0.5 0

6

7


22

Analysis of natural speech

(Generalized Cepstral Analysis) /e/

-20

0

20

40

60

80

Magnitude(dB)

0 1 2 3 4 5Frequency(kHz)

γ = -1


γ = -1/2


γ = -1/3


γ = 0

M=15

(LPC) (UELS)-1 -0.5 0

15

16

γ

Prediction gain(dB)

(a) male /e/

23

Analysis of natural speech

(Generalized Cepstral Analysis) /N/

-20

0

20

40

60

80

Magnitude(dB)


γ = -1


γ = -1/2


γ = -1/3


γ = 0

M=15

(LPC) (UELS)-1 -0.5 0

27

28

γ

Prediction gain(dB)

(b) male /N/

24

Structure of synthesis filter H(z) (γ = −1/n)

inputσ

output1

C(z)

1

C(z)�

��

��

��

�� −−−− �1

C(z)�

1st 3nd . . .. . .. . . n-th

H(z) = σD(z) = σ

{1

C(z)

}n

C(z) =

⎛⎝1 + γ

M∑m=0

c′γ(m) z−m

⎞⎠

25

Structure of synthesis filter H(z) (γ = 0)

—LMA filter

F (z) F (z) F (z) F (z)� � ��

��

��

��

��

��

��

��

+�Input �

� + �Output

AL,1 AL,2 AL,3 AL,4

- -

� � � �

��

�

��

��

��

�

��

��

D(z) = expF (z) RL(F (z)) =

1 +L∑

l=1

AL,l {F (z)}l

1 +L∑

l=1

AL,l {−F (z)}l

F (z) =M∑

m=1

cγ(m) z−m

26

Introduction of auditory frequency scale

First-order all-pass function:

z−1α = Ψ(z) =

z−1 − α

1 − αz−1

Phase Characteristics can be

used for Frequency

Transformation:

ω̃ = tan−1 (1 − α2) sinω

(1 + α2) cosω − 2α

where Ψ(ejω) = e−jω̃

0 π/2 π0

π /2

π

Frequency (rad)

War

ped

frequ

ency

∼

(rad

) mel scale

10kHz samplingα = 0.35

ω

ω

27

Mel-Generalized Cepstral Analysis

Mel-generalized cepstrum: cα,γ(m)

H(z) = s−1γ

⎛⎝ M∑

m=0

cα,γ(m) z−mα

⎞⎠

=

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

⎛⎝1 + γ

M∑m=0

cα,γ(m) z−mα

⎞⎠1/γ

, 0 < |γ| ≤ 1

expM∑

m=0

cα,γ(m) z−mα , γ = 0

�

�

�

�z−1α =

z−1 − α

1 − αz−1

28

• (α, γ) = (0, 0) ⇒ Cepstral model:

H(z) = expM∑

m=0

cα,γ(m)z−m

• (α, γ) = (0, −1) ⇒ AR model:

H(z) =1

1 −M∑

m=0

cα,γ(m)z−m

• (α, γ) = (0.35, 0) ⇒ Mel-cepstral model:

H(z) = expM∑

m=0

cα,γ(m) z−mα

• (α, γ) = (0.47, −1) ⇒ Warped AR model:

H(z) =1

1 −M∑

m=0

cα,γ(m) z−mα

29

A Unified Approach to Speech Spectral Estimation

�

� |α| < 1, −1 ≤ γ ≤ 0

Mel-generalized cepstralanalysis

�

� α = 0

�

� γ = −1

�

� γ = 0

Generalized cepstral analysis

Linear prediction Warped Linear Prediction

Unbiased Cepstral analysis Mel-cepstral analysis

30

Mel-generalized analysis of natural speech /N/ M = 12

-40

0

40

Mag

nitu

de(d

B)

α = 0 α = 0.35 α = 0.47

-40

0

40

Mag

nitu

de(d

B)

0 2.5 5-40

0

40

Frequency(kHz)

Mag

nitu

de(d

B)

0 2.5 5

Frequency(kHz)

γ = 0

γ = − 0.5

γ = − 1

0 2.5 5

Frequency(kHz)

31

Example

α = 0

γ = −1

γ = −1/3

γ = 0

0 200 400 (ms)(a) Waveform

n a N b u d e w a

012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, -1, 12) LP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, -1/3, 12) GCEP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0, 0, 12) UELS40(dB)(b) Spectral estimates ( α = 0) 32

Example

α = 0.35

γ = −1

γ = −1/3

γ = 0

0 200 400 (ms)(a) Waveform

n a N b u d e w a

012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, -1, 12) WLP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, -1/3, 12) MGCEP40(dB)012345

Freq

uenc

y(kH

z)

(α, γ , M) = (0.35, 0, 12) MCEP40(dB)(b) Spectral estimates ( α = 0.35) 33

Structure of synthesis filter H(z) (γ = −1/n)

inputσ

output1

C(z)

1

C(z)�

��

��

��

�� −−−− �1

C(z)�

1st 2nd . . .. . .. . . n-th

Structure of H(z)

H(z) = σD(z) = σ

{1

C(z)

}n

C(z) =

(1 + γ

M∑m=0

c′α,γ(m) z−mα

) �+�Input

�

Output

z−1 �+� z−1 �+� z−1 �+� z−1�

��

��

�+�

b(1)

�

��

��

�+�

b(2)

�

��

�� b(3)

��

��

��

��

�

��

��

��

��

��

��

��

��

��

��

�

α

+�

��

��

�

α

+�

��

��

�

α

�−

�−

��

γ(1 − α2)

�−

� �

Structure of C(z) (M = 3)

34

Structure of synthesis filter H(z) (γ = 0)

—MLSA filter

D(z) = expF (z) RL(F (z))

F (z) =M∑

m=0

c′α,γ(m) z−mα

• sufficient accuracy: maximum spectral error 0.24dB

• O(8M) multiply-add operations a sample

• guaranteed stability

• M multiply-add operations for filter coefficients calculation

35

Structure of

MLSA filterF (z) F (z) F (z) F (z)� � �

��

��

��

��

��

��

��

��

+�Input �

� + �Output

AL,1 AL,2 AL,3 AL,4

- -

� � � �

��

�

��

��

��

�

��

��

RL(F (z)) expF (z) = D(z), L = 4

��

��

1 − α2

��

��

α ��

��

α ��

��

α

��

��b(1)

��

��b(2)

��

��b(3)

z−1 z−1 z−1 z−1

Input

� �+ � � �+ � � �+� � ��

��

��

��

��

��

��

��

��

��

��

��

�+-

��

��

��

��

��

��

��

��

�+-

��

��+ ��+ Output

Basic filter F (z), M = 3

36

The choice of α, γ for speech analysis/synthesis

Analysis/synthesis system with fixed α and γ

• speech quality change with γ

γ → −1 Clear

γ → 0 Smooth

• When γ = 0, speech quality with (α, M) = (0.35,15) is

almost equivalent to that with (α, M) = (0,30).

• When the analysis order is high enough, the difference

becomes small.

37

Feature of Unified Approach

• Linear prediction analysis, Cepstral analysis are the special cases.

• Mathematically well-defined

• Physical interpretation

⇒ Minimization of energy of inverse filter output

⇒ x is Gaussian ⇒ Minimization of p(x|c) (ML estimation)

• Global solution, stability of the system function

• Synthesis filter for direct synthesis from the estimated coefficients

⇒ LMA/MLSA/GMSLA filter

• Extension to adaptive analysis (sample by sample basis)

• Parameter transformation for speech recognition

38

Word Recognition based on HMM

Spectral Analysis:1. (α1, γ1, M1) = (0, −1, 12) ⇒ Linear Prediction

2. (α1, γ1, M1) = (0.35, −1/3, 12) ⇒ Mel-generalized cepstral

analysis

3. (α1, γ1, M1) = (0.35, 0, 12) ⇒ Mel-cepstral analysis

Output vector of HMM:

(α2, γ2, M2) = (0.35, 0, 12)

Mel-cepstral coefficients

and Δ (dynamic coefficients)

H(z) = s−1γ1

⎛⎝ M1∑

m=0

cα1,γ1(m) z−mα1

⎞⎠ = s−1

γ2

⎛⎝ ∞∑

m=0

cα2,γ2(m) z−mα2

⎞⎠

39

Recognition Accuracy

(Continuous HMM, 33 phoneme models, 2618 words)

75 80Recognition accuracy (%)

Linear Prediction (α , γ , M )=1 1 1 (0, -1, 12) 77.5%

Mel-Generalized Cepstral Analysis (α , γ , M )=1 1 1 (0.35, -1/3, 12)

80.7%

Mel-Cepstral Analysis (α , γ , M )=1 1 1 (0.35, 0, 12)

79.6%

40

Application to 16kb/s wideband CELP coder

encoder CELPExcitationGenerator

SynthesisFilter

PerceptualWeightingFilter

MinimizeMSE

MGC Analysis

Quantizationof MGC Coef.

MGC coef.

MGC coef.

Input

decoder CELPExcitationGenerator

SynthesisFilter

MGC coef.

Postfilter

MGC coef.

Output

41

Speech quality as a function of α (γ = −1/2)

0.0 0.1 0.2 0.3 0.4 0.53.0

3.5

4.0

4.5

α

D M

O S

averagefemalemale

42

Subjective Evaluation

1 2 3 4 5D M O S

G.722 48kb/s

G.722 56kb/s

G.722 64kb/s

Conv.CELP16kb/s

MGC-CELP 16kb/s

MNRU(dB)

average female male

16 24 32 40 48

43

Summary

A unified approach to speech spectral estimation

• A unified approach toLinear predicton and Cepstral analysis

• Introduction of auditory frequency scale

• Efficint representation of speech spectrum with an

appropriate choice of α and γ

• Application to speech anaylysis/synthesis, speech

coding, speech recognition

Future work: Optimal α and γ

(Phoneme/Speaker dependent?)

Speech Signal Processing Toolkit:

http://kt-lab.ics.nitech.ac.jp/~tokuda/SPTK/

44

Documents

Mel-Generalized Cepstral Representation of Speech —A ...tokuda/tokuda_tamkang2002.pdf · Mel-Generalized Cepstral Representation of Speech —A Uniﬁed Approach to Speech Spectral