報告 : 張志豪

Perceptual Linear Predictive Analysis of Speech

Hynek Hermansky, Speech Technology Laboratory, J. Acoustical Society of America, April 1990

報告 : 張志豪

2

Outline

• Linear Prediction Coding

• Mel-scale Frequency Cepstral Coefficients

• Perceptual Linear Predictive

3

Introduction

• Feature Extraction– Speech Production Model

• Linear Prediction Coding

– Speech Perception Model• Mel-scale Frequency Cepstral Coefficients

4

Linear Prediction Coding

• Property– Approximates the areas of high-energy concentration while smoothing

out the fine harmonic structure and other less-relevant spectral details.

– The approximated high-energy spectral areas often correspond to the resonance frequencies of the vocal tract (formants).

5


• Autocorrelation– Levinson-Durbin Recursion

• Impulse Response

0 50 100 150 200 250 300 350-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300 350-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 1000 2000 3000 4000 5000 6000 7000 800020

40

60

80

100

120

140

Time domain Time domain Frequency domain

Speech LPC Speech and LPC

6


• Disadvantage– LPC approximates speech equally well at all frequencies of the

analysis band. This property is inconsistent with human hearing. Beyond about 800Hz, the spectral resolution of hearing decreases with frequency.

– The amplitude levels typically encountered in conversational speech, hearing is more sensitive in the middle frequency range of the audible spectrum.

– The spectral details of speech are not always preserved or discarded by LPC analysis according to their auditory prominence.

7

Mel-scale Frequency Cepstral Coefficients

• Mel-scale– 在低頻部分 , 人耳感受是比較敏銳 – 在高頻部分 , 人耳的感受就會越來越粗糙 – 人耳對於頻率的感受事呈對數變化的

8


9


• Discrete cosine transform – 由 frequency domain 轉回 time domain– frequency of frequency

10

MFCC & LPC

• Mel-scale Frequency Cepstral Coefficients – Advantage

• 強調語音頻譜上的特性 , 即使在有雜訊干擾的環境下 , 仍能維持較佳的辨識率– Disadvantage

• 運算量較大

• Linear Prediction Coding– Advantage

• 運算量小– Disadvantage

• 未考慮語音頻譜上的特性 , 辨識率隨著雜訊增加而下降

11

Perceptual Linear Predictive

MFCC

LPC

12


• Equal-Loudness Preemphasis2 6 4

2 6 2 9 6 26

( 56.8 10 )( )( 6.3 10 )( 0.38 10 )( 9.58 10 )

w wE ww w w

13


• Equal-Loudness Preemphasis (count.)– 與預強的效果相同 ?

2 4 6 8 10 12 14 16 180

1

2

3

4

5

6

x 109 Filter Bank

2 4 6 8 10 12 14 16 180

0.5

1

1.5

2

2.5

3

3.5

4

x 109 Equal-Loudness

Frequency domain Frequency domain

14


• Intensity-Loudness Power Law–

2 4 6 8 10 12 14 16 180

0.5

1

1.5

2

2.5

3

3.5

4

x 109 Equal-Loudness

2 4 6 8 10 12 14 16 180

200

400

600

800

1000

1200

1400

Intensity-Loudness

1

3

Perceived loudness, L(w), is approximately the cube root of the intensity, I(w)

L(w) = I(w)

Frequency domain Frequency domain

15


• Intensity-Loudness Power Law (count.)– Power spectrum 不需要再開平方

• ek = (float)sqrt((double)(t1*t1 + t2*t2));

– Filter bank 後的值不需要取 log• bins[bin] = log((double)t1);

16


• Inverse Discrete Fourier Transform– 由 frequency domain 轉回 time domain

2 4 6 8 10 12 14 16 180

200

400

600

800

1000

1200

1400

Intensity-Loudness

2 4 6 8 10 12

-150

-100

-50

0

50

ac

Frequency domain Time domain

17


• Autoregressive Modeling (LPC)

2 4 6 8 10 12

-150

-100

-50

0

50

ac

2 4 6 8 10 12

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

c

Time domainTime domain

18

Experiment

　 3 6 9

MFCC 54.21 55.11 55.37

PLP_05 39.01 39.32 39.88

PLP_10 52.55 53.02 53.02

PLP_12 53.79 54.49 54.94

PLP_14 53.62 53.94 54.27

*PLP_12 31.65 32.08 32.03

19

Thanks

20

Thanks

21

Choice Of The Order Of The Autoregressive PLP Model

• Introduction

• Spectral distortion measure of PLP

• Single-frame phoneme identification

• Isolated-word identification

22


• Introduction – With increasing model order the spectrum of the all-pole model

asymptotically approaches the auditory spectrum.

23


• Spectral Distortion Measure of PLP – group-delay distortion measure

• The spectral peaks of the model are enhanced and its spectral slope is suppressed.

• The group-delay metric is more sensitive to distance between narrow peaks.

• The group-delay measure is more sensitive to the actual value of the spectral peak width.

– Exponential measure• Allows for various degrees of peak enhancement.

24

• Single-Frame Phoneme Identification – As is evident, the PLP identification accuracy increases up to

about the 5th order of the autoregressive model and then starts decreasing with further increases in the model order.


25


• Isolated-Word Identification

26


• Discussion– The advantage of the PLP over the LP is that it allows for the

effective suppression of the speaker-dependent information by choosing the particular model order.

– The linguistically relevant speaker-independent cues lie in the gross shape of the auditory spectrum. This gross shape can be characterized by the one or two spectral peaks of the 5th-order PLP model.

27

PLP and Human Hearing

• Introduction

• Formant Frequency Changes

• Sensitivity to Bandwidth Changes

• Sensitivity to Spectral Tilt

• Sensitivity to F0

• Discussion

28


• Introduction– The first three formant frequencies is approximately constant in

relative frequency. The LP analysis is in conflict with it.

29


• Formant Frequency Changes

30

PLP and Vowel Perception

• Introduction

• The effective second formant

• Spectral peak integration theory

• The significance of the bandwidth B2

• Discussion

31

Documents

報告 : 張志豪