Upload
hoang-duong-quy-rom
View
216
Download
0
Embed Size (px)
Citation preview
8/8/2019 Session 08 9
1/52
Speech Signal Analysisand Coding
Dr. Arun Kumar
Centre for Applied Research in Electronics
(CARE), IIT Delhi
8/8/2019 Session 08 9
2/52
Contents Speech Processing Applications
Speech Signal Understanding
Speech Production
Speech Signal Characteristics and Analysis
Speech Coding Coding Standards
Coder Attributes including Quality Evaluation
Coding Methodologies
8/8/2019 Session 08 9
3/52
Speech Transmission
Trunk-line telephony Wireless telephony
Speech Storage
Voice Mail, Voice Memo, Answeringmachines
Speech Synthesis Text-to-speech-synthesis
Automatic information services
Speech Processing Applications
8/8/2019 Session 08 9
4/52
Speaker Verification and Identification
Phone banking Secure entry
Aids for the Handicapped
Variable rate playback
Hearing aids
Reading machine for visually impaired Visual display of speech information for
hearing impaired
Speech Processing Applications
8/8/2019 Session 08 9
5/52
Speech Enhancement Echo and noise cancellation
Speech Recognition
Automatic language translation
Voice Personality Transformation
Voice conversion from source to target
Speech Processing Applications
8/8/2019 Session 08 9
6/52
It is the variation of pressure, fromatmospheric pressure, as a function oftime, caused by traveling waves from
the speakers mouth (apart fromnostrils, cheeks and throat).
The Speech Signal
8/8/2019 Session 08 9
7/52
Units:
SPL (Sound Pressure Level) in dB
relative to a reference level.
Reference: 10 16 W/cm2
- Corresponds to just barely audible
The Intensity Level of Speech
8/8/2019 Session 08 9
8/52
0
20
5560
70
80
100
120
dB
Just barely
audible
Whisper
Airplane
Rock concert
Heavytraffic Variations in normal voice
level (1 meter distance frommouth)
The Intensity Level of Speech
8/8/2019 Session 08 9
9/52
Energy of speech during 1 s
2 x 10 5 Joules
(It takes 100 Joules to light a 100 W bulb for1 s)
Strongest vowel: /a/ as in talk
Weakest vowel: /i/ as in see
Strongest consonant: /r/ as in run Weakest consonant: // as in thin
The Intensity Level of Speech
8/8/2019 Session 08 9
10/52
Audio
SignalCategory
Bandwid
th(Hz)
Sampling
Rate(kHz)
Source
Rate(kbps)
Telephone
BandSpeech
300-3400 8.0 128
Wideband
Speech50-7000 16.0 256
WidebandAudio
20-20,000 44.1/48.0 705/768
Speech & Audio Signal Specs.
8/8/2019 Session 08 9
11/52
Speech Articulation by the Vocal System
Reproduced from: D. OShaughnessy, Human and machine speech communication, IEEE Press, 2000
8/8/2019 Session 08 9
12/52
Speech Classes by Articulation
Voiced speech
Unvoiced speech
Transient (stop) sounds
8/8/2019 Session 08 9
13/52
The relationship between speechsounds (phonemes) and their acousticrealizations
Waveform
Spectrum
Spectrogram
Acoustic Analysis of Speech
8/8/2019 Session 08 9
14/52
Time Waveform of a Speech Sentence
0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4
-1
- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0. 2
0. 4
0. 6
0. 8
Ti m e ( s )
A
m
plit
ud
e
(TH)
THIS IS GOOD
(i) s(s)
(i) s(s)
(G) U (O) d
(D)
8/8/2019 Session 08 9
15/52
Vowels High energy, periodic, steady state utterance
Unvoiced fricatives Low energy, noise-like, steady-state utterance
Voiced fricatives
Low energy, element of periodicity, steady-stateutterance
Stops
Transient release, medium to low energy Nasals
Low-to-medium energy, periodic, steady-stateutterance
Waveform Analysis of a Speech
8/8/2019 Session 08 9
16/52
Fundamental frequency F0 / Pitch period
F0 Male FemaleAverage (Hz) 132 223
Range (Hz) 50-250 120-500
Acoustic Analysis of Vowels
8/8/2019 Session 08 9
17/52
Stop Consonants
Momentary blockage of the vocal tract (50-100ms): Closure phase
Release burst (shortest acoustic event)
Voice onset time (VOT)
Fricatives
Narrow constriction somewhere in vocaltract
Turbulent airflow through the constriction
Acoustic Analysis of Consonants
8/8/2019 Session 08 9
18/52
TheInternational
Phonetic
Alphabet
(IPA)
8/8/2019 Session 08 9
19/52
Universal Speech Production Model
Output
speech
ImpulseTrain
Generator
GlottalPulseModel
WhiteNoise
Generator
VocalTractFilter
Voiced orUnvoiced
switch
Radiation
Model
VoicedGain
UnvoicedGain
8/8/2019 Session 08 9
20/52
Vocal Tract Model
Time-varying all-pole linear filter excited by asource signal.
H(z) models the vocal tract system.
H(z)=1/A(z)
e[n] s[n]
)(
1
1
1
)(
1
zAza
zH P
i
i
i
=
==
8/8/2019 Session 08 9
21/52
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag(dB)
Voiced Speech Spectrum
8/8/2019 Session 08 9
22/52
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag(dB)
Superimposed 2nd-order LP Envelope
8/8/2019 Session 08 9
23/52
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag(dB)
Superimposed 2nd, 6th order LP Envelopes
8/8/2019 Session 08 9
24/52
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag(dB)
Superimposed 2nd, 6th, &10th order LP Envelopes
8/8/2019 Session 08 9
25/52
0 500 1000 1500 2000 2500 3000 3500 4000-100
-80
-60
-40
-20
0
20
40
60
80
Frequency (Hz)
Mag(dB)
Superimposed 2nd, 6th, 10th & 16th order LP Envelopes
8/8/2019 Session 08 9
26/52
Unvoiced Speech and 10th order LP Residual
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 9
- 0 . 1 8
- 0 . 1 7
- 0 . 1 6
- 0 . 1 5
- 0 . 1 4
- 0 . 1 3
- 0 . 1 2
- 0 . 1 1
- 0 . 1
T im e ( m s )
Amplit
ude
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 2
- 0 . 1 5
- 0 . 1
- 0 . 0 5
0
0 . 0 5
0 . 1
0 . 1 5
T im e ( m s )
Amplitud
e
8/8/2019 Session 08 9
27/52
Voiced Speech and 10th-order LP Residual
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
0 . 6
T i m e ( m s )
Amplitude
0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0- 0 . 1 5
- 0 . 1
- 0 . 0 5
0
0 . 0 5
0 . 1
0 . 1 5
0 . 2
T i m e ( m s )
Amplitu
de
Short-term correlation
Long-term correlation
8/8/2019 Session 08 9
28/52
Speech Coding
8/8/2019 Session 08 9
29/52
For telephone band (or narrowband) speech: Signal Bandwidth: 300-3400 Hz
Sampling Rate: 8000 Hz Resolution: 16 bits / sample linear PCM
Uncompressed bit rate:16 bits/sample x 8000 samples/s
= 128 Kbit/s
What is the minimum coding rate fortransmitting the message information?
Coding Rates
8/8/2019 Session 08 9
30/52
Coder Classes according to Bit-Rate
B > 16 Kbps High bit rate coders
4 < B
8/8/2019 Session 08 9
31/52
ITU-T: International Telecommunications Union(UN)
MPEG: Motion Pictures Experts Group(ISO/UN)
INMARSAT: Intl. Maritime Satellite Corporation
for geo-synchronous satellites US Government: DoD, NATO
TIA: Telecom Industry Association - for North
American Telecom standards ETSI: European Telecom. Standards Institute
Standards Organizations
8/8/2019 Session 08 9
32/52
Name Coding TypeBit-rate
(kbps)Organization Year
G.711/
G.712
PCM -law/
A-law 64 ITU-T 1972
G.721/G.723
G.726/G.727ADPCM
32/24/40/
16ITU-T
1984/86/
88/90
G.728 LD-CELP 16 ITU-T 1992
G.729 CS-ACELP 8.0 ITU-T 1995
G.723.1 ACELP 6.3/5.3 ITU-T 1995
G.722(Wideband)
SB-ADPCM
48/56/64 ITU-T 1985
Speech Coding Standards
8/8/2019 Session 08 9
33/52
Name Coding TypeBit-rate
(kbps)Organization Year
G.722.1 (Wideband)Transform
24/32 ITU-T 1999
Inmarsat IMBE 4.15 INMARSAT 1990
IS-54 (old) VSELP 7.95 TIA 1992
GSM-FR RPE-LTP 13 GSM 1991
GSM-HR CELP 5-6 GSM 1994
GSM-EFR CELP 12.2 GSM 1997
Speech Coding Standards
8/8/2019 Session 08 9
34/52
Name Coding TypeBit-rate
(kbps)Organization Year
IS-641(new) ACELP 7.4 TIA 1997Iridium AMBE 2.4 Iridium 1996
MPEG-4 HVXC 2-4 MPEG/ISO 1999
MPEG-4 CELP 4-24 MPEG/ISO 1999
FS-1015 LPC-10 2.4US-DoD
/NATO1984
FS-1016 CELP 4.8 US-DoD/NATO
1989
MELP MELP 2.4US-DoD
/NATO
1996
Speech Coding Standards
8/8/2019 Session 08 9
35/52
Coding Methodologies
Waveform coding
Vocoding or parametric coding
Hybrid coding
Coding Methodologies
8/8/2019 Session 08 9
36/52
Classes according to Coding Type
Bit rate (Kbps)
Quality
Poor
Fair
Good
Excellent
Parametric Coders
Waveform
approximating
coders
1 42 168 32 64
HybridCoders
8/8/2019 Session 08 9
37/52
Coding Standards
Bit rate (Kbps)
Quality
Poor
Fair
Good
Excellent
Parametric Coders
Waveform approximating
coders
1 42 168 32 64
Hybrid Coders
G.726G.711
Linear
PCM
GSM EFR
FS1015
G.723.1
G.729
G.728
IS96
GSM/2
GSM FR
MELP
8/8/2019 Session 08 9
38/52
PCM Coding
Q[.]
x[n] x[n]
i[n]
Instantaneous, non-uniform quantization
For time-varying energy signals eg speech,uniform quantization is inefficient.
If signal energy is halved, SQNR falls 6 dB. SQNR is independent of signal level in Log
quantizer.
8/8/2019 Session 08 9
39/52
ADPCM Coding
+ Q[.]
Encoder
+P
Decoder +
P
Input
x[n] -
d[n]
x[n]
c[n]d[n]
x[n]
c[n]
d[n] x[n]
x[n]
8/8/2019 Session 08 9
40/52
Prediction in the context of Coding
0 5 1 0 1 5 2 0- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
0 . 6
T i m e ( m s )
Amplitude
0 5 1 0 1 5 2 0- 0 . 8
- 0 . 6
- 0 . 4
- 0 . 2
0
0 . 2
0 . 4
T i m e ( m s )
Amplitude
Signal and first-difference signal
8/8/2019 Session 08 9
41/52
DPCM with fixed predictor can give 4-11 dBimprovement over PCM.
PCM with adaptive quantization can give ~ 5
dB improvement over -law non-adaptivePCM.
DPCM with adaptive prediction can give 10-12 dB improvement over fixed predictor.
ADPCM Coding
C d E it d Li P di ti (CELP)
8/8/2019 Session 08 9
42/52
Code Excited Linear Prediction (CELP)
Coding
Most coders in 4.8-16 kbps are based
on Linear Prediction Analysis-by-Synthesis (LPAS) coding.
CELP belongs to LPAS paradigm ofspeech coding.
G i Li P di i A l i
8/8/2019 Session 08 9
43/52
Generic Linear Prediction Analysis
Synthesis (LPAS) Coder
Excitation
Generator
Error
Minimization
Synthesis
Filter
LP Analysis
+
Input
speech
-
8/8/2019 Session 08 9
44/52
CELP Decoder
ExcitationGenerator G/A(z)
Excitation parameters
LP and Gain parameters
Synthesized speech
8/8/2019 Session 08 9
45/52
Speech Quality
Objective measures
Segmental SNR
Itakura-Saito distance measure
Spectral distortion (SD)
ITU-T P.862 Recommendation
Subjective measures
Mean opinion score (MOS)
Diagnostic Rhyme Test (DRT)
Diagnostic Acceptability Measure (DAM)
Speech Quality Measurement
Ab l C R i T (MOS)
8/8/2019 Session 08 9
46/52
Listening quality scale
Excellent 5
Good 4Fair 3
Poor 2
Bad 1
Absolute Category Rating Tests (MOS)
Di ti Rh T t
8/8/2019 Session 08 9
47/52
Measures speech intelligibility
Listeners are presented with one of twowords which differ only in leadingconsonant
Examples:
Meet - Beat
Than - Dan
Met - Net
Jest - Guest
Diagnostic Rhyme Test
Di ti Rh T t
8/8/2019 Session 08 9
48/52
Total possible pairs = 96
Intelligibility score, S, is given by:
N(correct) N(incorrect)
S = 100 x
N(test pairs)
Coder Rate (kbps) DRT MOS
FS1016 4.8 91.7 3.3
G.728 16 93.0 3.9
Diagnostic Rhyme Test
P t l l ti f h lit (PESQ)
8/8/2019 Session 08 9
49/52
Part of ITU-T P.862 standard
Objective is to mimic sound perception bypersons in real life
PESQ simulates expts. in which subjects
judge speech quality Physical signals are mapped to
psychophysical representations that match
internal representations in the head
Perceptual evaluation of speech quality (PESQ)
Speech Coder Complexity Issues
8/8/2019 Session 08 9
50/52
Complexity
Computational complexity
Simplex/half-duplex/full-duplex real timeperformance on a single DSP
Fixed point vs. floating point
CELP coders are computationally complex
Memory requirement
Storage of look-up tables, codebooks etc.
Speech Coder Complexity Issues
8/8/2019 Session 08 9
51/52
8/8/2019 Session 08 9
52/52
Thank You!