Upload
hatu
View
215
Download
0
Embed Size (px)
Citation preview
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 1
unit sampletrain generator
white-noisegenerator
Linear Time-varyingFilter h(n,m) speech
samples
Vocal tract parameters
Pitch
voiced/unvoiced
Gain
period P(n)v(n)
u(n)
s(n)
Filter assumed constant for 20 -> 50 msec
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 2
Overview• In contrast to waveform coders, vocoders (voice coders) distill a very
compact description of the input and digitize only the parameters of this description.
• Based on the generally accepted engineering model of speech production:
– Speech is the output of a linear time-varying filter (approximating the vocal tract).
– The excitation is either a quasi-periodic train (voiced) or a stationary random sequence (unvoiced).
unit sampletrain generator
white-noisegenerator
Linear Time-varyingFilter h(n,m) speech
samples
Vocal tract parameters
Pitch
voiced/unvoiced
Gain
period P(n)v(n)
u(n)
s(n)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 3
Basic LPC Vocoder
• Coding speech thus entails determining the parameters of the model, deciding the excitation type, its value (pitch or variance), and quantizing these parameters.
• At the decoder, speech is synthesized by exciting the linear filter with either white noise or a pulse train.
• This basic vocoder generates synthetic sounding, yet clearly understandable speech.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 4
Basic LPC Vocoder
Original SpeechAnalysis:• Voiced/Unvoiced decision• Pitch Period (voiced only)• LPC filter coefficients• Signal power (Gain)
G
Pulse Train
Random Noise
Vocal TractModel
V/U
Synthesized Speech
DecoderSignal Power
PitchPeriod
Encoder
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 5
Basic LPC Vocoder
Buffer
Quantize
Speech Input
Speech
LPC Filter
Voiced-Unvoiced Analysis
Pitch Analysis
Quantize
Quantize
Pulsetrain gen.
Noise Excitation
Synthesis
Output
Channel
voiced/unvoiced info
Quantized LPC filter
Filtercoeff .
V/UNV
L
L
Filter H(z)
Quantize-1
Quantize-1
Quantize-1
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 6
LPC coding…• The main tasks and variations in LPC-based coding include :
– how to code to residual signal in a way that preserves the perceptual speech features (with a given desired fidelity), yet use the minimum number of bits possible.
– How to code the coefficients of the prediction filter (or some transformation thereof) in order to ensure stability and minimize the effect of quantization.
– How to estimate the various other entities (pitch, voicing, etc..), at what rate, and how to code them.
s(n) A(z)
prediction filter
residual signal e(n)
pitch period
excitation H(z) = 1 / A(z)e(n)
speech model
LPC analysis
( )∑
∑=
−=
−
−=−= P
k
kk
P
k
kk
zazHzazA
1
1 1
1)(1
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 7
LPC parameter estimation
∑=
−=K
knkn kiRaiR
1|)(|)(
In matrix form
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−
)(
)2()1(
)0()1()1()1(
)0()1()1()1()0(
2
1
KR
RR
a
aa
RRKRR
RRKRRR
n
n
n
Knnn
n
nn
nnn
MM
L
OM
MO
L
recap
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 8
Levinson-Durbin Recursion
Due to Toeplitz nature of the matrix, LPC can be recursively obtained
Loop over i=1,2…,K
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 9
A simplistic Example
⎥⎦
⎤⎢⎣
⎡⎥⎦
⎤⎢⎣
⎡−
−−
=⎥⎦
⎤⎢⎣
⎡)2()1(
)1()0()0()1(
)0()1(1
222
1
RR
RRRR
RRαα
Direct matrix inverse
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 10
Recursion for order-2
i
i
kα LPC coefficients
reflection coefficients
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 11
Equivalent LPC representations
• The L-D recursion guarantees a stable filter , but only with infinite precision. Quantization error may cause temporary instabilities,resulting in pops and clicks in the output.
• Other representations are better suited for quantization : the PARCOR (or negative of the reflection coefficients), or the Line Spectral Frequencies (LSF or LSP) are used.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 12
Quantization of the reflection coefficients
• The reflection coefficients are interesting in that they are bounded in magnitude by unity and thus stability may be guaranteed by keeping the quantized coefficients bounded.
• Studies show that the spectral sensitivity of the LPC spectrum to small changes to the reflection coefficients is U-shaped, having large values whenever the magnitude of the reflection coefficients is close to unity. This necessitates a non-uniform quantization that accounts for the statistical distribution of these. There are 2 possible transformations:
– Log area ratios (LAR’s)
– The inverse sine transformation (bounded by [–π/2 : π/2] ) :
⎭⎬⎫
⎩⎨⎧−+
=m
m
kkmLAR
11log)(
{ }mkmSi arcsin)( =
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 13
Line Spectral Frequencies
• Given an m-th order LPC polynomial
• 2 artificial polynomials of order m+1 are created as :
• Yielding the relation :
mmm zazazA −− +++= L1
11)(
( ) ( ) ( )( ) ( ) ( )11
1
111
)(
)(−+−
+
−+−+
−=
+=
zAzzAzQ
zAzzAzP
mm
mm
mm
mm
It can be shown that all zeros of P and Q lie on the unit circle.
( ) ( ) ( )[ ]zQzPzA mmm 1121
++ +=
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 14
Line Spectral Frequencies
LSP parameters are interpretable in terms of the formant frequencies of the model. Each zero of A(z) maps into one zero in each of the polynomials P(z) and Q(z). If the 2 resulting zeros are close in frequency, it is likely that the ‘parent’ zero in A(z) represents a formant in the model.
P has a real zero at z=-1, Q a zero at z=1, and all other zeros are complex and interleaved.
These zeros comprise the LSP parameters.
zeros of Q(z)
zeros of P(z)
Freq
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 15
Different Excitation Models
• The basic LPC vocoder yields a synthetic quality speech, even with increasing the update rate or the quantization bits. Inherent limitation is mainly :
– Binary decision of whether speech is voiced or not.
– Voiced fricatives (/z/) : noise excitation with a periodic envelop
– Vowel excitation in natural speech has a noisy component above 2-3 kHz.
– Higher spectrum of speech has a transient harmonic structure.
• Variations include : mixed-excitation model:– Sum a LP periodic waveform with a HP noise signal. The cut off frequency
marks the degree of voicing.
– Add jitter to the (periodic) excitation by randomly varying the position and amplitudes of the pulses.
– Modify the phase spectrum of the excitation.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 16
Residual-excited Linear Prediction coder
• In RELP : The basic idea is to use the actual prediction error signal rather than the periodical pulse or random noise to excite the digital filter to reproduce the speech. (The prediction error is also called the residual).
• Sometimes, only the lower 1kHz of the excitation is used.
• The advantage of RELP compared to basic LPC is that it avoids the problems associated with pitch estimation (F0) and voicing error, since it does not require an explicit estimation of these entities.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 17
RELP Coder-Decoder
Pre-emphasis
WindowingAnalysis
Auto-Correlation
Levinson-Durbin
Linear Prediction Analysis
AudioInput
AnalysisFilter
Residual
Filter Coeffs
Residual
Filter Coeffs
Waveform coding
Quantization
Residual SynthesisFilter
AudioOutput
Filter Coeffs
De-emphasis
Dec
odin
g
ResidualSignal
FilterCoeffs
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 18
Multipulse-excited LPC
• An alternative to the basic LPC vocoder, a suitable number of pulses may be generated as the excitation sequence for a given speech segment (for instance 10 pulses for 10 msec segment). The amplitude and location maybe be optimized in some ‘closed-loop’ search.
Buffer Speech Input
Speech
LPC Filter
Excitation Pulses
Freq Weighting Synthesizer
Output
+-
location & amplitudeof optimum pulses
Quantiz
Quantiz
Quantiz-1
Quantiz-1
Ch
anne
l
MinimizeError
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 19
Multipulse-excited LPC
Original residual from LPC analysis
Pulses whose position we are trying to determine
LPC synthesis
LPC synthesis
+
-
[ ]∑=
−−=F
nm mnhAndE
1
2)()(
The time m and amplitude Am of the first pulse are found by minimizing the square error, summed over the length of the analysis frame (4-5 msec).
error
)(nd
)(nh Synthesis filter
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 20
Multipulse-excited LPC
),()(mm
mAm φα
=
)(mα),( mmφ
Cross correlation between d(n) and h(n)
Covariance matrix for h(n)
The solution yields the expression for the amplitude:
The location (time m) of the pulse is found by selecting the lag (m) that maximizes :
),()(2
mmm
φα
The process is repeated to find the other pulse locations.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 21
Hybrid coders
• A general class of coders that combine concepts of model-based (LP analysis) and waveform coding
• Typically include long and short term predictors, and use the concept of analysis-by-synthesis.
• Residual can be coded as a waveform, sequence of pulses, or vectors from a codebook.
– Multi-Pulse Excitation» A sequence of nonuniformly spaced pulses as an excitation signal
– Regular-Pulse Excitation (RPE)» A sequence of uniformly spaced pulses as an excitation signal
– Residual-excited (RELP)» Residual coded as a waveform
– Code-Excited Linear Prediction (CELP)» A code book of excitation sequences
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 22
1/P(z) 1/A(z)v(n) ˆ zj(n)
1/W(z)
zi(n)
+
{ai}{bi},M
Excit.Generator
cj
G
MinimizationProcedure
OriginalSpeech
Channel
Excit.Generator
1/P(z) 1/A(z)
ReconstructedSpeech ŝ(n)
Excit. Parameteror Codeword index
Basic structure of a hybrid coder
Analysis
-
Hybrid coders
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 23
Analysis by synthesis
• Motivation– The optimality of some parameter is easy to determine (e.g., pitch), but not
for others (e.g., gain parameters)
– The interaction among parameters is difficult to analyze but important to synthesis part
• What is A-by-S?– Do the complete analysis-and-synthesis in encoding
– Decoder is embedded in the encoder for the reason of optimizing the extracted parameters
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 24
Analysis by synthesis
-Analysis Synthesis
input speech
],...,[ 1 Nxx=xr
],...,[ 1 Nee=er
x̂r],...,[ 1 Kθθ=θ
r
MMSEAssumed model
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 25
CELP • Code Excited Linear Prediction(CELP)
– a family of techniques that quantize the LPC residual using a codebook of vectors.
– CELP uses the fact that the residual of voiced speech has periodicity and can be used to predict the residual of the current frame.
– Short-term prediction : the prediction using LPC coefficients
– Long-term prediction : the prediction of the residual based on pitch
– Analysis-by-synthesis technique : choosing the combination of parameters so that the reconstructed signal is closest possible to the input signal
• (CELP) was first introduced by B.S. Atal & M.A Schroeder in 1984 ICC. • In 1988 DoD selected the CELP algorithm developed by AT&T Bell Laboratories
as the basis for Federal Standard 4.8 kbps voice coder. (FS-1016)• Produced low-rate coded speech comparable in quality to that of medium-rate
waveform coders.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 26
• Analysis-by-Synthesis Linear Prediction– Excitation sequence is selected from a codebook by closed-loop
optimization.
– Adaptive and stochastic codebooks.
• Long-term Linear Prediction– Pitch (fine) structure of the speech is predicted.
• Perceptual Weighting (Filtering)– Shapes the error such that quantization noise is masked by high-energy
formants.
• CELP is a Hybrid Coder
• Other variants/standards: VSELP, LD-CELP
CELP
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 27
Analysis by synthesis in CELP
LPCAnalysis
Input speech signal
PerceptualWeighting
LPC Synthesis
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 28
Code Books
• Codebook originally consisted of Gaussian sequences; 1024 vectors 40-samples (5ms)
• Selecting the code involves an exhaustive search.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 29
CELP
LPCAnalysis
ST Synthesis
Pitch Estimation
LT Synthesis
Input speech signal
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 30
Short-termPrediction
long-termPrediction
CodewordSearch and gain
calculationinputframe
residuesignal
channel
LPC Pitch, gain Codeword, gain
intermediateresult
CELP
Input speech signal
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 31
Long-term Prediction
LLT
LT
czzH
zHzEzR−−=
=
1)(
)()()(
r(n)
e(n)
HLT(z)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 32
Long-term prediction
We assume the signal is periodic (i.e. repeats itself) , and let L be the estimate of the pitch period.
We then ‘predict’ the current period, from the previous one.
)()( Lnxcnx −⋅=)
[ ][ ] ),(
),0()(
)()(2 LLR
LRLnxE
LnxnxEcx
x=−−
=
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 33
Perceptual Weighting Filter
( )( ) ∑
∑
=
−
=
−
−
−== K
k
kkk
K
k
kkk
za
za
zAzAzW
12
11
2
1
1
1
//)(
γ
γ
γγ
When computing the synthesis error, a perceptual filter is applied on the raw error signal in order to make the distortion measure more relevant to human hearing.
The general form of the filter :
De-emphasize the error energy in the formant regions, since the quantization noise is masked by the strong speech energy there.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 34
Perceptual (Noise) Weighting Filter
The plain MSE metric does not match human perceptionas well as the weighted MSE metric (speech quality assessment)
∑
∑
=
−
=
−
−
−= K
k
kkk
K
k
kkk
za
zazW
12
11
1
1)(
γ
γ
5.0,9.0 21 == γγ
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 35
Subframing
Nf =160 samples (20ms)
Nsf =40 samples (5ms)
Nsf Nsf Nsf Nsf
Each subframe is a vector in 40-dimensional space
Nf =160 samples (20ms)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 36
LP analysis iteration and resolution
• Computed for 20-30ms frames.
• Captures the formant structure.
• 10th order autocorrelation LPC is performed.
• LP parameters are represented with Line Spectrum Pairs (LSP).
• Quantize using 4 bits for each of f2 – f5 and 3 bits for each of the others (34 bits in total) from empirically determined probability density functions.
• Smooth filter transitions by linearly interpolating a new set ofLSP frequencies every ¼ frame.
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 37
Codebook
Nsf
xr
1cr
128cr
7-bit Codebook (e.g, >C=randn(128,40);)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 38
Codebook search
Target vector to approximate
candidate vector in codebook
xr
crd
||||,cos
cxcxrr
rr><
=θθ
θsin|| xr=d
Minimize d Minimize θ Maximize >< cxrr,
The optimum codeword is the one that maximizes correlation with the input vector
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 39
Gain Computation
Target vector to approximate
candidate vector in codebook
xr
cr
2||||min cx rr α−=D><><
=cccxrr
rr
,,α
In CELP coding, xr Prediction residue signal
cr excitation codeword from the codebook
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 40
Variation of CELP
LPC Analysis
Input speech signal
PerceptualWeighting
LPC Synthesis
Adaptivecodebook
Fixed Codebook
G1
G2
Typical CB sizes: Fixed CB, 40 bits; Adaptive CB, 8 bits; Gain CBs, 5 bits each; LP coefficient CB, 28 bits
=> 86 bits per frame => 286 encoding alternatives to test!
LPCodebook
delayed version of previous excitation samples
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 41
Encoding process of QCELP coder
HPFCheby IIS(n)
HammingWindow
GetLPC
LPC to LSFConversion
InterpolationLSF
LSF to LPCConversion
WeightingFilter W(z)
Decision rate
Pitch filter1/P(z)
Synthesis filter1/A(z/5)
Pitch index& gain
Codebookindex
Codebookindex
+-+
Perceptualweighting
Minimize errorprocedure
Code book
Code book
Randomseed
a1-a10
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 42
QCELP bit allocation
1010101010101010Codebook subframe
10101010Pitch subframe
40LPC frame
Rate 1 packets total 160 bits
1010Codebook subframe
10Pitch subframe
10LPC frame
Rate 1/4 packets total 40 bits
Over 1 analysis frame of 160 speech samples
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 43
Examples
• Original (64kbps PCM)
• ADPCM (32kbps)
• LD-CELP (16kbps)
• CS-ACELP (8kbps)
• CELP (4.8kbps)
• LPC10(2.4kbps)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 44
Data Coding
ModelingProcess
entropycoding
discretesource X
binarybit stream
probabilityestimation
P(Y)
Y
entropycodingdiscrete
source X
P(X)
binary bit stream
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 45
Entropy Coding
• Entropy coding is a process whereby a set of data parameters areencoded (represented by symbols) using an alphabet of variable-length symbols that will minimize the overall required bitrate, given the probability of occurrence of these parameters.
• E.g. Morse code.
--..-.---..-.--…-..--….-.--.-.--.----.
Z.00
Y.02
X.00
W.02
V.01
U.03
T.09
S.06
R.06
Q.00
P.02
O.08
N.07
--.-..-.-.---..….--...-..-..-.-.-….-
M.02
L.04
K.01
J.00
I.07
H.06
G.02
F.02
E.12
D.04
C.03
B.01
A.08
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 46
• Step-1– Arrange pi in decreasing order and consider them as tree leaves
• Step-2– Merge two nodes with smallest prob. to a new node and sum up prob.– Arbitrarily assign 1 and 0 to each pair of merging branch
• Step-3– Repeat until no more than one node left.– Read out codeword sequentially from root to leaf
• Variable length code – assigning about log2 (1 / pi) bits for the ith value
Entropy Coding
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 47
S0
S1
S2
S3
S4
S5
S6
S7
000
001
010
011
100
101
110
111
PCM
00
10
010
011
1100
1101
1110
1111
Huffman
0.25
0.21
0.15
0.14
0.0625
0.0625
0.0625
0.06250.125
10
0.12510
0.251
0
0.2910
0.54
1
0
0.46
1
0 1.0
1
0
Entropy Coding
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 48
Morse Code – average length
• The average of the lengths of the letters: – (2+4+4+3+…)/26 = 82/26 ≈ 3.2
• But the average in a typical real sequence of say 1,000,000 letters, will be function of probability of occurrence of the various letters.
• The weighted average: (freq of A)·(length of code for A) + (freq of B)·(length of code for B) + …
= .08·2 + .01·4 + .03·4 + .04·3+… ≈ 2.4
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 49
Vector Quantization
• Basic idea– Treat several signal samples or coefficients as a vector and encode them
together as a block.– Complicated but better coding efficiency, from Shanon’s rate distortion
theory.– Use an N-dimensional quantizer and an L-size Codebook
E.g. : Partitioning of two-dimensional (N)Space into 16 cells (L)
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 50
x Q x̂Scalar Quantization
Vector Quantization x1,…,xK Q x1,…,xK^ ^
Each sample is quantized independent of others
A block of samples are quantized simultaneously
Vector Quantization
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 51
Vector Quantization
Stages
– Codebook design
– Encoding
– Decoding
Scalar vs. Vector quantization
– VQ allows flexible partition of coding cells
– VQ could naturally explore the correlation between elements
– SQ is simpler in implementation
Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 52
ITU Recommended coders
ITU # Common name Coding technique
Description bit rate (kbps)
Input frame size (bits)
Output frame size (bits)
Frame sample time (ms) MOS
G.711Companding Pulse Code
Modulationμ-law or A-law
PCM
Used in PSTN, converts analog speech into non-linear 8-bit samples.
64Continuous
analog 8 0.125 4.8
G.723.1Multi Pulse - Maximum Likelyhood Quantizer MP-MLQ
Encodes 240 samples of 16-bit linear data into 12
16-bit code words6.4
240 samples * 16 bits per sample
= 3840 bits
12 code words * 16 bits per word
= 192 bits30 3.8
G.723.1Algebraic Codebook
Excited Linear Prediction ACELPEncodes 240 samples of 16-bit linear data into 10
16-bit code words5.3
240 samples * 16 bits per sample
= 3840 bits
10 code words * 16 bits per word
= 160 bits30 3.7
G.726 Adaptive Differential Pulse Code Modulation
ADPCM Converts analog speech into 3, 4, or 5 bit samples
16, 24, 32 or 40
Continuous analog
3, 4, or 5 0.125 4.3
G.728Low Delay - Codebook
Excited Linear Prediction LD-CELPEncodes 5 samples of 16-bit linear data into 10-bit
code words16
5 samples * 16 bits per sample
= 80 bits10 0.625 4
G.729Conjugate Structure Algebraic Codebook
Excited Linear PredictionCS-ACELP
Encodes 80 samples of 16-bit linear data into 10
8-bit code words8
80 samples * 16 bits per sample
= 1280 bits
10 code words * 8 bits per word
= 80 bits10 4
G.729aConjugate Structure Algebraic Codebook
Excited Linear PredictionCS-ACELP
Encodes 80 samples of 16-bit linear data into 10
8-bit code words8
80 samples * 16 bits per sample
= 1280 bits
10 code words * 8 bits per word
= 80 bits10 4
MOS, or mean opinion score: subjective measure of voice qualityScores of 4 to 5 are deemed toll quality, 3 to 4 communication quality andless than 3, synthetic quality.