Session 08 9

8/8/2019 Session 08 9

1/52

Speech Signal Analysisand Coding

Dr. Arun Kumar

Centre for Applied Research in Electronics

(CARE), IIT Delhi

[email protected]

8/8/2019 Session 08 9

2/52

Contents Speech Processing Applications

Speech Signal Understanding

Speech Production

Speech Signal Characteristics and Analysis

Speech Coding Coding Standards

Coder Attributes including Quality Evaluation

Coding Methodologies

8/8/2019 Session 08 9

3/52

Speech Transmission

Trunk-line telephony Wireless telephony

Speech Storage

Voice Mail, Voice Memo, Answeringmachines

Speech Synthesis Text-to-speech-synthesis

Automatic information services

Speech Processing Applications

8/8/2019 Session 08 9

4/52

Speaker Verification and Identification

Phone banking Secure entry

Aids for the Handicapped

Variable rate playback

Hearing aids

Reading machine for visually impaired Visual display of speech information for

hearing impaired


8/8/2019 Session 08 9

5/52

Speech Enhancement Echo and noise cancellation

Speech Recognition

Automatic language translation

Voice Personality Transformation

Voice conversion from source to target


8/8/2019 Session 08 9

6/52

It is the variation of pressure, fromatmospheric pressure, as a function oftime, caused by traveling waves from

the speakers mouth (apart fromnostrils, cheeks and throat).

The Speech Signal

8/8/2019 Session 08 9

7/52

Units:

SPL (Sound Pressure Level) in dB

relative to a reference level.

Reference: 10 16 W/cm2

- Corresponds to just barely audible

The Intensity Level of Speech

8/8/2019 Session 08 9

8/52

0

20

5560

70

80

100

120

dB

Just barely

audible

Whisper

Airplane

Rock concert

Heavytraffic Variations in normal voice

level (1 meter distance frommouth)


8/8/2019 Session 08 9

9/52

Energy of speech during 1 s

2 x 10 5 Joules

(It takes 100 Joules to light a 100 W bulb for1 s)

Strongest vowel: /a/ as in talk

Weakest vowel: /i/ as in see

Strongest consonant: /r/ as in run Weakest consonant: // as in thin


8/8/2019 Session 08 9

10/52

Audio

SignalCategory

Bandwid

th(Hz)

Sampling

Rate(kHz)

Source

Rate(kbps)

Telephone

BandSpeech

300-3400 8.0 128

Wideband

Speech50-7000 16.0 256

WidebandAudio

20-20,000 44.1/48.0 705/768

Speech & Audio Signal Specs.

8/8/2019 Session 08 9

11/52

Speech Articulation by the Vocal System

Reproduced from: D. OShaughnessy, Human and machine speech communication, IEEE Press, 2000

8/8/2019 Session 08 9

12/52

Speech Classes by Articulation

Voiced speech

Unvoiced speech

Transient (stop) sounds

8/8/2019 Session 08 9

13/52

The relationship between speechsounds (phonemes) and their acousticrealizations

Waveform

Spectrum

Spectrogram

Acoustic Analysis of Speech

8/8/2019 Session 08 9

14/52

Time Waveform of a Speech Sentence

0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4

-1

- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0. 2

0. 4

0. 6

0. 8

Ti m e ( s )

A

m

plit

ud

e

(TH)

THIS IS GOOD

(i) s(s)

(i) s(s)

(G) U (O) d

(D)

8/8/2019 Session 08 9

15/52

Vowels High energy, periodic, steady state utterance

Unvoiced fricatives Low energy, noise-like, steady-state utterance

Voiced fricatives

Low energy, element of periodicity, steady-stateutterance

Stops

Transient release, medium to low energy Nasals

Low-to-medium energy, periodic, steady-stateutterance

Waveform Analysis of a Speech

8/8/2019 Session 08 9

16/52

Fundamental frequency F0 / Pitch period

F0 Male FemaleAverage (Hz) 132 223

Range (Hz) 50-250 120-500

Acoustic Analysis of Vowels

8/8/2019 Session 08 9

17/52

Stop Consonants

Momentary blockage of the vocal tract (50-100ms): Closure phase

Release burst (shortest acoustic event)

Voice onset time (VOT)

Fricatives

Narrow constriction somewhere in vocaltract

Turbulent airflow through the constriction

Acoustic Analysis of Consonants

8/8/2019 Session 08 9

18/52

TheInternational

Phonetic

Alphabet

(IPA)

8/8/2019 Session 08 9

19/52

Universal Speech Production Model

Output

speech

ImpulseTrain

Generator

GlottalPulseModel

WhiteNoise

Generator

VocalTractFilter

Voiced orUnvoiced

switch

Radiation

Model

VoicedGain

UnvoicedGain

8/8/2019 Session 08 9

20/52

Vocal Tract Model

Time-varying all-pole linear filter excited by asource signal.

H(z) models the vocal tract system.

H(z)=1/A(z)

e[n] s[n]

)(

1

1

1

)(

1

zAza

zH P

i

i

i

=

==

8/8/2019 Session 08 9

21/52

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag(dB)

Voiced Speech Spectrum

8/8/2019 Session 08 9

22/52

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag(dB)

Superimposed 2nd-order LP Envelope

8/8/2019 Session 08 9

23/52

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag(dB)

Superimposed 2nd, 6th order LP Envelopes

8/8/2019 Session 08 9

24/52

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag(dB)

Superimposed 2nd, 6th, &10th order LP Envelopes

8/8/2019 Session 08 9

25/52

0 500 1000 1500 2000 2500 3000 3500 4000-100

-80

-60

-40

-20

0

20

40

60

80

Frequency (Hz)

Mag(dB)

Superimposed 2nd, 6th, 10th & 16th order LP Envelopes

8/8/2019 Session 08 9

26/52

Unvoiced Speech and 10th order LP Residual

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 9

- 0 . 1 8

- 0 . 1 7

- 0 . 1 6

- 0 . 1 5

- 0 . 1 4

- 0 . 1 3

- 0 . 1 2

- 0 . 1 1

- 0 . 1

T im e ( m s )

Amplit

ude

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 2

- 0 . 1 5

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

T im e ( m s )

Amplitud

e

8/8/2019 Session 08 9

27/52

Voiced Speech and 10th-order LP Residual

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

0 . 6

T i m e ( m s )

Amplitude

0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 5

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

T i m e ( m s )

Amplitu

de

Short-term correlation

Long-term correlation

8/8/2019 Session 08 9

28/52

Speech Coding

8/8/2019 Session 08 9

29/52

For telephone band (or narrowband) speech: Signal Bandwidth: 300-3400 Hz

Sampling Rate: 8000 Hz Resolution: 16 bits / sample linear PCM

Uncompressed bit rate:16 bits/sample x 8000 samples/s

= 128 Kbit/s

What is the minimum coding rate fortransmitting the message information?

Coding Rates

8/8/2019 Session 08 9

30/52

Coder Classes according to Bit-Rate

B > 16 Kbps High bit rate coders

4 < B

8/8/2019 Session 08 9

31/52

ITU-T: International Telecommunications Union(UN)

MPEG: Motion Pictures Experts Group(ISO/UN)

INMARSAT: Intl. Maritime Satellite Corporation

for geo-synchronous satellites US Government: DoD, NATO

TIA: Telecom Industry Association - for North

American Telecom standards ETSI: European Telecom. Standards Institute

Standards Organizations

8/8/2019 Session 08 9

32/52

Name Coding TypeBit-rate

(kbps)Organization Year

G.711/

G.712

PCM -law/

A-law 64 ITU-T 1972

G.721/G.723

G.726/G.727ADPCM

32/24/40/

16ITU-T

1984/86/

88/90

G.728 LD-CELP 16 ITU-T 1992

G.729 CS-ACELP 8.0 ITU-T 1995

G.723.1 ACELP 6.3/5.3 ITU-T 1995

G.722(Wideband)

SB-ADPCM

48/56/64 ITU-T 1985

Speech Coding Standards

8/8/2019 Session 08 9

33/52



G.722.1 (Wideband)Transform

24/32 ITU-T 1999

Inmarsat IMBE 4.15 INMARSAT 1990

IS-54 (old) VSELP 7.95 TIA 1992

GSM-FR RPE-LTP 13 GSM 1991

GSM-HR CELP 5-6 GSM 1994

GSM-EFR CELP 12.2 GSM 1997


8/8/2019 Session 08 9

34/52



IS-641(new) ACELP 7.4 TIA 1997Iridium AMBE 2.4 Iridium 1996

MPEG-4 HVXC 2-4 MPEG/ISO 1999

MPEG-4 CELP 4-24 MPEG/ISO 1999

FS-1015 LPC-10 2.4US-DoD

/NATO1984

FS-1016 CELP 4.8 US-DoD/NATO

1989

MELP MELP 2.4US-DoD

/NATO

1996


8/8/2019 Session 08 9

35/52


Waveform coding

Vocoding or parametric coding

Hybrid coding


8/8/2019 Session 08 9

36/52

Classes according to Coding Type

Bit rate (Kbps)

Quality

Poor

Fair

Good

Excellent

Parametric Coders

Waveform

approximating

coders

1 42 168 32 64

HybridCoders

8/8/2019 Session 08 9

37/52

Coding Standards

Bit rate (Kbps)

Quality

Poor

Fair

Good

Excellent

Parametric Coders

Waveform approximating

coders

1 42 168 32 64

Hybrid Coders

G.726G.711

Linear

PCM

GSM EFR

FS1015

G.723.1

G.729

G.728

IS96

GSM/2

GSM FR

MELP

8/8/2019 Session 08 9

38/52

PCM Coding

Q[.]

x[n] x[n]

i[n]

Instantaneous, non-uniform quantization

For time-varying energy signals eg speech,uniform quantization is inefficient.

If signal energy is halved, SQNR falls 6 dB. SQNR is independent of signal level in Log

quantizer.

8/8/2019 Session 08 9

39/52

ADPCM Coding

+ Q[.]

Encoder

+P

Decoder +

P

Input

x[n] -

d[n]

x[n]

c[n]d[n]

x[n]

c[n]

d[n] x[n]

x[n]

8/8/2019 Session 08 9

40/52

Prediction in the context of Coding

0 5 1 0 1 5 2 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

0 . 6

T i m e ( m s )

Amplitude

0 5 1 0 1 5 2 0- 0 . 8

- 0 . 6

- 0 . 4

- 0 . 2

0

0 . 2

0 . 4

T i m e ( m s )

Amplitude

Signal and first-difference signal

8/8/2019 Session 08 9

41/52

DPCM with fixed predictor can give 4-11 dBimprovement over PCM.

PCM with adaptive quantization can give ~ 5

dB improvement over -law non-adaptivePCM.

DPCM with adaptive prediction can give 10-12 dB improvement over fixed predictor.

ADPCM Coding

C d E it d Li P di ti (CELP)

8/8/2019 Session 08 9

42/52

Code Excited Linear Prediction (CELP)

Coding

Most coders in 4.8-16 kbps are based

on Linear Prediction Analysis-by-Synthesis (LPAS) coding.

CELP belongs to LPAS paradigm ofspeech coding.

G i Li P di i A l i

8/8/2019 Session 08 9

43/52

Generic Linear Prediction Analysis

Synthesis (LPAS) Coder

Excitation

Generator

Error

Minimization

Synthesis

Filter

LP Analysis

+

Input

speech

-

8/8/2019 Session 08 9

44/52

CELP Decoder

ExcitationGenerator G/A(z)

Excitation parameters

LP and Gain parameters

Synthesized speech

8/8/2019 Session 08 9

45/52

Speech Quality

Objective measures

Segmental SNR

Itakura-Saito distance measure

Spectral distortion (SD)

ITU-T P.862 Recommendation

Subjective measures

Mean opinion score (MOS)

Diagnostic Rhyme Test (DRT)

Diagnostic Acceptability Measure (DAM)

Speech Quality Measurement

Ab l C R i T (MOS)

8/8/2019 Session 08 9

46/52

Listening quality scale

Excellent 5

Good 4Fair 3

Poor 2

Bad 1

Absolute Category Rating Tests (MOS)

Di ti Rh T t

8/8/2019 Session 08 9

47/52

Measures speech intelligibility

Listeners are presented with one of twowords which differ only in leadingconsonant

Examples:

Meet - Beat

Than - Dan

Met - Net

Jest - Guest

Diagnostic Rhyme Test

Di ti Rh T t

8/8/2019 Session 08 9

48/52

Total possible pairs = 96

Intelligibility score, S, is given by:

N(correct) N(incorrect)

S = 100 x

N(test pairs)

Coder Rate (kbps) DRT MOS

FS1016 4.8 91.7 3.3

G.728 16 93.0 3.9

Diagnostic Rhyme Test

P t l l ti f h lit (PESQ)

8/8/2019 Session 08 9

49/52

Part of ITU-T P.862 standard

Objective is to mimic sound perception bypersons in real life

PESQ simulates expts. in which subjects

judge speech quality Physical signals are mapped to

psychophysical representations that match

internal representations in the head

Perceptual evaluation of speech quality (PESQ)

Speech Coder Complexity Issues

8/8/2019 Session 08 9

50/52

Complexity

Computational complexity

Simplex/half-duplex/full-duplex real timeperformance on a single DSP

Fixed point vs. floating point

CELP coders are computationally complex

Memory requirement

Storage of look-up tables, codebooks etc.

Speech Coder Complexity Issues

8/8/2019 Session 08 9

51/52

8/8/2019 Session 08 9

52/52

Thank You!

Documents

Session 08 9