Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Mel-Generalized Cepstral Representation of Speech
—A Unified Approach to Speech Spectral Estimation
Keiichi Tokuda
Nagoya Institute of Technology
Carnegie Mellon University
�
�
�
�Tamkang University
March 13, 2002
1
Conventional Speech Spectral Estimation
• Linear prediction (LPC) Autoregressive (AR) model
• Cepstral analysis Exponential (EX) model
• Subband filter bank Nonparametric
Variations
• Model ⇒ Pole-zero (ARMA) model
• Analysis window ⇒ Adaptive analysis
(sample by sample basis)
• Auditory characteristics ⇒ Warped LPC, PLP, etc.
◦ Auditory frequency scales (mel, Bark)
◦ Loudness scales (log, sone)
2
Structure of This Talk
1. Conventional cepstral analysis
2. Introduction of generalized logarithmic function
⇒ Generalized cepstral analysis
3. Introduction of auditory frequency scale
⇒ Mel-generalized cepstral analysis
4. Applications to speech recognition and coding
3
History of Cepstral Analysis
• B.P. Bogert, M.J.R. Healy, J.W. Tukey (1963)
Analysis of seismic signals
— decomposition into direct wave and echo
⇒ Cepstrum, Quefrency, Lifter
• A.M. Noll (1964, 1967)
Pitch extraction based on cepstrum
• A.V. Oppenheim (1966, 1968)
Homomorphic deconvolution
— decomposition into source and vocal tract function
⇒ Complex cepstrum
4
Definition of Cepstrum
Fourier transform of signal s(n)
S(ejω) = F [ s(n) ]
Cepstrum
C(m) = F−1[log |S(ejω)|2
](Bogert et al., Noll)
C(m) = F−1[log |S(ejω)|
](Oppenheim)
5
Complex Cepstrum
z-transform of signal s(n)
S(z) = Z [ s(n) ]
Complex cepstrum
c(m) = Z−1 [ logS(z) ]
= F−1[logS(ejω)
]= F−1
[log |S(ejω)| + j argS(ejω)
]
6
Cepstrum and Complex Cepstrum
log |S(ejω)| = F [C(m) ] = Re [F [ c(m) ] ]
⇓When it is minimum phase (all polesand zeros are located in the unit circle)
c(m) =
⎧⎪⎪⎨⎪⎪⎩
0, m < 0
C(m), m = 0
2C(m), m > 0
7
Homomorphic Deconvolution
Pulse train
White noise
��
�������e(n) s(n) = e(n) ∗ h(n)Linear
time-invariantsystemh(n)
� Speech
s(n) = h(n) ∗ e(n)
↓ F
S(ejω) = H(ejω)E(ejω)
↓ log | · |
log |S(ejω)| = log |H(ejω)| + log |E(ejω)|↓ F−1
C(m) = Ch(m) + Ce(m)
8
0 100 200
0
Time (100μs)
0 1 2 3 4 50
20
40
60
80
100
Frequency (kHz)
Amplitude (dB)
0 50 1000
5
Quefrency (100μs)
0 1 2 3 4 50
20
40
60
80
100
Frequency (kHz)
Amplitude (dB)
C(m)
s(n)
0 1 2 3 4 5-50
0
50
Frequency (kHz)
Amplitude (dB)
(a)
(b) (c)
(d) (e)
9
Spectrum of Periodic Signal
h(n)
e(n)
e(n)
Np
w(n)
s(n)
Frequency0 π
2π/Np
Frequency0 π
⏐W(ejω)⏐
log⏐
Ew(e
jω) ⏐
log⏐
Sw(e
jω) ⏐
⏐H(ejω)⏐
10
Problems of Homomorphic Processing
(Cepstral Analysis)
Linear smoothing of log spectrum• affected by fine structure of FFT spectrum
• results in a large bias and variance
Voiced speech (periodic)• Envelope of peaks of spectral fine structure
⇒ Improved cepstral analysis , PSE: Biased
11
Cost Function
P (ω): Estimate of Power Spectrum
IN(ω): Periodogram
E =1
2π
∫ π
−π
{IN(ω)
P (ω)− log
IN(ω)
P (ω)− 1
}dω ⇒ min
x: Gaussian Process ⇒ Maximizing p(x|c)
• Unbiased estimation of log spectrum
• equivalent to one used in LPC
• Minimization of energy of inverse filter output
12
Analysis of Natural Speech
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log magnitude (dB)
(a) Unbiased cepstral analysis
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log magnitude (dB)
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log magnitude (dB)
(b) Linear prediction
0 1 2 3 4 5Frequency (kHz)
-20
0
20
40
60
80
Log magnitude (dB)
13
Generalized Cepstrum
Complex Cepstrum
c(m) = Z−1 [ logS(z) ]
logS(z) = Z [ c(m) ]
⇓
Generalized Cepstrum
cγ(m) = Z−1 [ sγ (S(z)) ]
sγ (S(z)) = Z [ cγ(m) ]
14
Generalized logarithmic function
sγ(w) =
⎧⎨⎩ (wγ − 1)/γ, 0 < |γ| ≤ 1
logw, γ = 0
0 1
1
x
s xγ ( )γ = 1
γ = −1
0 1< <γ
− < <1 0γγ = 0 (log )x
15
Spectral Model
Generalized Cepstrum: cγ(m)
H(z) = s−1γ
⎛⎝ M∑
m=0
cγ(m) z−m
⎞⎠
=
⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩
⎛⎝1 + γ
M∑m=0
cγ(m) z−m
⎞⎠1/γ
, 0 < |γ| ≤ 1
expM∑
m=0
cγ(m) z−m, γ = 0
Inverse function of Generalized logarithm
s−1γ (w) =
⎧⎨⎩ (1 + γw)1/γ, 0 < |γ| ≤ 1
expw, γ = 0
16
Cost Function
E =1
2π
∫ π
−π
{IN(ω)
P (ω)− log
IN(ω)
P (ω)− 1
}dω ⇒ min
Estimate of Power Spectrum
P (ω) = |H(ejω)|2 = σ2|D(ejω)|2
Interpretation in time-domain
ε = E[e2(n)
]⇒ min
1/D(z) �
x(n) e(n)
17
Advantage
−1 ≤ γ ≤ 0:
• Convex function ⇒ Global solutioncan easily be obtained
• The obtained system H(z) is minimum phase, e.g., stable
• γ = −1 ⇒ Linear Prediction
H(z) =1
1 −M∑
m=0
cγ(m)z−m
• γ = 0 ⇒ Cepstrum
H(z) = expM∑
m=0
cγ(m)z−m
18
Prediction Gain
• D(z) is minimum phase
• Gain of D(z) is one⇒
Predictor:
Q(z) =∞∑
k=1
a(k)z−k
Cost Function:
ε = E[e2(n)
]⇒ Prediction Gain:
G =E
[x2(n)
]E
[e2(n)
]
1/D(z) ��x(n) e(n)
⇓�����
Q(z)
�
�
��x(n) e(n)+−
19
Analysis of synthetic signal
(Generalized Cepstral Analysis)
0 2.5 5Frequency(Hz)
Log magnitude
10dB
True
γ =-1 -3/4 -1/2 -1/4 0
(a) Example 1
M=8
(LPC) (UELS)-1 -0.5 0
6
7
8
γPrediction gain(dB)
20
Analysis of synthetic signal
(Generalized Cepstral Analysis)
0 2.5 5Frequency(Hz)
Log magnitude
10dB
True
γ =-1 -3/4 -1/2 -1/4 0
(c) Example 3
M=8
(LPC) (UELS)-1 -0.5 0
3
4
5
γPrediction gain(dB)
21
Analysis of synthetic signal
(Generalized Cepstral Analysis)
0 2.5 5Frequency(Hz)
Log magnitude
10dB
True
γ =-1 -3/4 -1/2 -1/4 0
(b) Example 2
M=8
(LPC) (UELS)-1 -0.5 0
6
7
γPrediction gain(dB)
22
Analysis of natural speech
(Generalized Cepstral Analysis) /e/
-20
0
20
40
60
80
Magnitude(dB)
0 1 2 3 4 5Frequency(kHz)
γ = -1
0 1 2 3 4 5Frequency(kHz)
γ = -1/2
0 1 2 3 4 5Frequency(kHz)
γ = -1/3
0 1 2 3 4 5Frequency(kHz)
γ = 0
M=15
(LPC) (UELS)-1 -0.5 0
15
16
γ
Prediction gain(dB)
(a) male /e/
23
Analysis of natural speech
(Generalized Cepstral Analysis) /N/
-20
0
20
40
60
80
Magnitude(dB)
0 1 2 3 4 5Frequency(kHz)
γ = -1
0 1 2 3 4 5Frequency(kHz)
γ = -1/2
0 1 2 3 4 5Frequency(kHz)
γ = -1/3
0 1 2 3 4 5Frequency(kHz)
γ = 0
M=15
(LPC) (UELS)-1 -0.5 0
27
28
γ
Prediction gain(dB)
(b) male /N/
24
Structure of synthesis filter H(z) (γ = −1/n)
inputσ
output1
C(z)
1
C(z)�
��
��
��
�� � � −−−− �1
C(z)�
1st 3nd . . .. . .. . . n-th
H(z) = σD(z) = σ
{1
C(z)
}n
C(z) =
⎛⎝1 + γ
M∑m=0
c′γ(m) z−m
⎞⎠
25
Structure of synthesis filter H(z) (γ = 0)
—LMA filter
F (z) F (z) F (z) F (z)� � ���
��
��
��
��
��
��
��
+�Input �
� + �Output
AL,1 AL,2 AL,3 AL,4
- -
� � � �
���
�
���
����
���
�
���
����
D(z) = expF (z) RL(F (z)) =
1 +L∑
l=1
AL,l {F (z)}l
1 +L∑
l=1
AL,l {−F (z)}l
F (z) =M∑
m=1
cγ(m) z−m
26
Introduction of auditory frequency scale
First-order all-pass function:
z−1α = Ψ(z) =
z−1 − α
1 − αz−1
Phase Characteristics can be
used for Frequency
Transformation:
ω̃ = tan−1 (1 − α2) sinω
(1 + α2) cosω − 2α
where Ψ(ejω) = e−jω̃
0 π/2 π0
π /2
π
Frequency (rad)
War
ped
frequ
ency
∼
(rad
) mel scale
10kHz samplingα = 0.35
ω
ω
27
Mel-Generalized Cepstral Analysis
Mel-generalized cepstrum: cα,γ(m)
H(z) = s−1γ
⎛⎝ M∑
m=0
cα,γ(m) z−mα
⎞⎠
=
⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩
⎛⎝1 + γ
M∑m=0
cα,γ(m) z−mα
⎞⎠1/γ
, 0 < |γ| ≤ 1
expM∑
m=0
cα,γ(m) z−mα , γ = 0
�
�
�
�z−1α =
z−1 − α
1 − αz−1
28
• (α, γ) = (0, 0) ⇒ Cepstral model:
H(z) = expM∑
m=0
cα,γ(m)z−m
• (α, γ) = (0, −1) ⇒ AR model:
H(z) =1
1 −M∑
m=0
cα,γ(m)z−m
• (α, γ) = (0.35, 0) ⇒ Mel-cepstral model:
H(z) = expM∑
m=0
cα,γ(m) z−mα
• (α, γ) = (0.47, −1) ⇒ Warped AR model:
H(z) =1
1 −M∑
m=0
cα,γ(m) z−mα
29
A Unified Approach to Speech Spectral Estimation
�
� |α| < 1, −1 ≤ γ ≤ 0
Mel-generalized cepstralanalysis
�
� α = 0
�
� γ = −1
�
� γ = 0
Generalized cepstral analysis
Linear prediction Warped Linear Prediction
Unbiased Cepstral analysis Mel-cepstral analysis
30
Mel-generalized analysis of natural speech /N/ M = 12
-40
0
40
Mag
nitu
de(d
B)
α = 0 α = 0.35 α = 0.47
-40
0
40
Mag
nitu
de(d
B)
0 2.5 5-40
0
40
Frequency(kHz)
Mag
nitu
de(d
B)
0 2.5 5
Frequency(kHz)
γ = 0
γ = − 0.5
γ = − 1
0 2.5 5
Frequency(kHz)
31
Example
α = 0
γ = −1
γ = −1/3
γ = 0
0 200 400 (ms)(a) Waveform
n a N b u d e w a
012345
Freq
uenc
y(kH
z)
(α, γ , M) = (0, -1, 12) LP40(dB)012345
Freq
uenc
y(kH
z)
(α, γ , M) = (0, -1/3, 12) GCEP40(dB)012345
Freq
uenc
y(kH
z)
(α, γ , M) = (0, 0, 12) UELS40(dB)(b) Spectral estimates ( α = 0) 32
Example
α = 0.35
γ = −1
γ = −1/3
γ = 0
0 200 400 (ms)(a) Waveform
n a N b u d e w a
012345
Freq
uenc
y(kH
z)
(α, γ , M) = (0.35, -1, 12) WLP40(dB)012345
Freq
uenc
y(kH
z)
(α, γ , M) = (0.35, -1/3, 12) MGCEP40(dB)012345
Freq
uenc
y(kH
z)
(α, γ , M) = (0.35, 0, 12) MCEP40(dB)(b) Spectral estimates ( α = 0.35) 33
Structure of synthesis filter H(z) (γ = −1/n)
inputσ
output1
C(z)
1
C(z)�
��
��
��
�� � � −−−− �1
C(z)�
1st 2nd . . .. . .. . . n-th
Structure of H(z)
H(z) = σD(z) = σ
{1
C(z)
}n
C(z) =
(1 + γ
M∑m=0
c′α,γ(m) z−mα
) �+�Input
�
Output
z−1 �+� z−1 �+� z−1 �+� z−1�
��
��
�+�
b(1)
�
��
��
�+�
b(2)
�
��
�� b(3)
��
��
��
��
�
��
��
��
���
��
��
��
��
��
��
�
α
+�
��
��
�
α
+�
��
��
�
α
�−
�−
��
γ(1 − α2)
�−
� �
Structure of C(z) (M = 3)
34
Structure of synthesis filter H(z) (γ = 0)
—MLSA filter
D(z) = expF (z) RL(F (z))
F (z) =M∑
m=0
c′α,γ(m) z−mα
• sufficient accuracy: maximum spectral error 0.24dB
• O(8M) multiply-add operations a sample
• guaranteed stability
• M multiply-add operations for filter coefficients calculation
35
Structure of
MLSA filterF (z) F (z) F (z) F (z)� � �
��
��
��
��
��
��
��
��
+�Input �
� + �Output
AL,1 AL,2 AL,3 AL,4
- -
� � � �
���
�
���
����
���
�
���
����
RL(F (z)) expF (z) = D(z), L = 4
���
���
1 − α2
���
���
α ���
���
α ���
���
α
���
���b(1)
���
���b(2)
���
���b(3)
z−1 z−1 z−1 z−1
Input
� �+ � � �+ � � �+� � ���
��
��
���
��
��
��
��
��
��
��
���
�+-
��
��
��
��
��
��
��
���
�+-
��
��+ ���+ Output
Basic filter F (z), M = 3
36
The choice of α, γ for speech analysis/synthesis
Analysis/synthesis system with fixed α and γ
• speech quality change with γ
γ → −1 Clear
γ → 0 Smooth
• When γ = 0, speech quality with (α, M) = (0.35,15) is
almost equivalent to that with (α, M) = (0,30).
• When the analysis order is high enough, the difference
becomes small.
37
Feature of Unified Approach
• Linear prediction analysis, Cepstral analysis are the special cases.
• Mathematically well-defined
• Physical interpretation
⇒ Minimization of energy of inverse filter output
⇒ x is Gaussian ⇒ Minimization of p(x|c) (ML estimation)
• Global solution, stability of the system function
• Synthesis filter for direct synthesis from the estimated coefficients
⇒ LMA/MLSA/GMSLA filter
• Extension to adaptive analysis (sample by sample basis)
• Parameter transformation for speech recognition
38
Word Recognition based on HMM
Spectral Analysis:1. (α1, γ1, M1) = (0, −1, 12) ⇒ Linear Prediction
2. (α1, γ1, M1) = (0.35, −1/3, 12) ⇒ Mel-generalized cepstral
analysis
3. (α1, γ1, M1) = (0.35, 0, 12) ⇒ Mel-cepstral analysis
Output vector of HMM:
(α2, γ2, M2) = (0.35, 0, 12)
Mel-cepstral coefficients
and Δ (dynamic coefficients)
H(z) = s−1γ1
⎛⎝ M1∑
m=0
cα1,γ1(m) z−mα1
⎞⎠ = s−1
γ2
⎛⎝ ∞∑
m=0
cα2,γ2(m) z−mα2
⎞⎠
39
Recognition Accuracy
(Continuous HMM, 33 phoneme models, 2618 words)
75 80Recognition accuracy (%)
Linear Prediction (α , γ , M )=1 1 1 (0, -1, 12) 77.5%
Mel-Generalized Cepstral Analysis (α , γ , M )=1 1 1 (0.35, -1/3, 12)
80.7%
Mel-Cepstral Analysis (α , γ , M )=1 1 1 (0.35, 0, 12)
79.6%
40
Application to 16kb/s wideband CELP coder
encoder CELPExcitationGenerator
SynthesisFilter
PerceptualWeightingFilter
MinimizeMSE
MGC Analysis
Quantizationof MGC Coef.
MGC coef.
MGC coef.
Input
decoder CELPExcitationGenerator
SynthesisFilter
MGC coef.
Postfilter
MGC coef.
Output
41
Speech quality as a function of α (γ = −1/2)
0.0 0.1 0.2 0.3 0.4 0.53.0
3.5
4.0
4.5
α
D M
O S
averagefemalemale
42
Subjective Evaluation
1 2 3 4 5D M O S
G.722 48kb/s
G.722 56kb/s
G.722 64kb/s
Conv.CELP16kb/s
MGC-CELP 16kb/s
MNRU(dB)
average female male
16 24 32 40 48
43
Summary
A unified approach to speech spectral estimation
• A unified approach toLinear predicton and Cepstral analysis
• Introduction of auditory frequency scale
• Efficint representation of speech spectrum with an
appropriate choice of α and γ
• Application to speech anaylysis/synthesis, speech
coding, speech recognition
Future work: Optimal α and γ
(Phoneme/Speaker dependent?)
Speech Signal Processing Toolkit:
http://kt-lab.ics.nitech.ac.jp/~tokuda/SPTK/
44