speech coding II - Angelfire · • Signal power (Gain) G Pulse Train ... – Do the complete analysis-and-synthesis in encoding – Decoder is embedded in the encoder for the reason

Digital Speech and Audio Processing E. Nemer UCI Spring 2008 - 1

unit sampletrain generator

white-noisegenerator

Linear Time-varyingFilter h(n,m) speech

samples

Vocal tract parameters

Pitch

voiced/unvoiced

Gain

period P(n)v(n)

u(n)

s(n)

Filter assumed constant for 20 -> 50 msec


Overview• In contrast to waveform coders, vocoders (voice coders) distill a very

compact description of the input and digitize only the parameters of this description.

• Based on the generally accepted engineering model of speech production:

– Speech is the output of a linear time-varying filter (approximating the vocal tract).

– The excitation is either a quasi-periodic train (voiced) or a stationary random sequence (unvoiced).

unit sampletrain generator

white-noisegenerator

Linear Time-varyingFilter h(n,m) speech

samples

Vocal tract parameters

Pitch

voiced/unvoiced

Gain

period P(n)v(n)

u(n)

s(n)


Basic LPC Vocoder

• Coding speech thus entails determining the parameters of the model, deciding the excitation type, its value (pitch or variance), and quantizing these parameters.

• At the decoder, speech is synthesized by exciting the linear filter with either white noise or a pulse train.

• This basic vocoder generates synthetic sounding, yet clearly understandable speech.


Basic LPC Vocoder

Original SpeechAnalysis:• Voiced/Unvoiced decision• Pitch Period (voiced only)• LPC filter coefficients• Signal power (Gain)

G

Pulse Train

Random Noise

Vocal TractModel

V/U

Synthesized Speech

DecoderSignal Power

PitchPeriod

Encoder


Basic LPC Vocoder

Buffer

Quantize

Speech Input

Speech

LPC Filter

Voiced-Unvoiced Analysis

Pitch Analysis

Quantize

Quantize

Pulsetrain gen.

Noise Excitation

Synthesis

Output

Channel

voiced/unvoiced info

Quantized LPC filter

Filtercoeff .

V/UNV

L

L

Filter H(z)

Quantize-1

Quantize-1

Quantize-1


LPC coding…• The main tasks and variations in LPC-based coding include :

– how to code to residual signal in a way that preserves the perceptual speech features (with a given desired fidelity), yet use the minimum number of bits possible.

– How to code the coefficients of the prediction filter (or some transformation thereof) in order to ensure stability and minimize the effect of quantization.

– How to estimate the various other entities (pitch, voicing, etc..), at what rate, and how to code them.

s(n) A(z)

prediction filter

residual signal e(n)

pitch period

excitation H(z) = 1 / A(z)e(n)

speech model

LPC analysis

( )∑

∑=

−=

−

−=−= P

k

kk

P

k

kk

zazHzazA

1

1 1

1)(1


LPC parameter estimation

∑=

−=K

knkn kiRaiR

1|)(|)(

In matrix form

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−

−

)(

)2()1(

)0()1()1()1(

)0()1()1()1()0(

2

1

KR

RR

a

aa

RRKRR

RRKRRR

n

n

n

Knnn

n

nn

nnn

MM

L

OM

MO

L

recap


Levinson-Durbin Recursion

Due to Toeplitz nature of the matrix, LPC can be recursively obtained

Loop over i=1,2…,K


A simplistic Example

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡−

−−

=⎥⎦

⎤⎢⎣

⎡)2()1(

)1()0()0()1(

)0()1(1

222

1

RR

RRRR

RRαα

Direct matrix inverse


Recursion for order-2

i

i

kα LPC coefficients

reflection coefficients


Equivalent LPC representations

• The L-D recursion guarantees a stable filter , but only with infinite precision. Quantization error may cause temporary instabilities,resulting in pops and clicks in the output.

• Other representations are better suited for quantization : the PARCOR (or negative of the reflection coefficients), or the Line Spectral Frequencies (LSF or LSP) are used.


Quantization of the reflection coefficients

• The reflection coefficients are interesting in that they are bounded in magnitude by unity and thus stability may be guaranteed by keeping the quantized coefficients bounded.

• Studies show that the spectral sensitivity of the LPC spectrum to small changes to the reflection coefficients is U-shaped, having large values whenever the magnitude of the reflection coefficients is close to unity. This necessitates a non-uniform quantization that accounts for the statistical distribution of these. There are 2 possible transformations:

– Log area ratios (LAR’s)

– The inverse sine transformation (bounded by [–π/2 : π/2] ) :

⎭⎬⎫

⎩⎨⎧−+

=m

m

kkmLAR

11log)(

{ }mkmSi arcsin)( =


Line Spectral Frequencies

• Given an m-th order LPC polynomial

• 2 artificial polynomials of order m+1 are created as :

• Yielding the relation :

mmm zazazA −− +++= L1

11)(

( ) ( ) ( )( ) ( ) ( )11

1

111

)(

)(−+−

+

−+−+

−=

+=

zAzzAzQ

zAzzAzP

mm

mm

mm

mm

It can be shown that all zeros of P and Q lie on the unit circle.

( ) ( ) ( )[ ]zQzPzA mmm 1121

++ +=


Line Spectral Frequencies

LSP parameters are interpretable in terms of the formant frequencies of the model. Each zero of A(z) maps into one zero in each of the polynomials P(z) and Q(z). If the 2 resulting zeros are close in frequency, it is likely that the ‘parent’ zero in A(z) represents a formant in the model.

P has a real zero at z=-1, Q a zero at z=1, and all other zeros are complex and interleaved.

These zeros comprise the LSP parameters.

zeros of Q(z)

zeros of P(z)

Freq


Different Excitation Models

• The basic LPC vocoder yields a synthetic quality speech, even with increasing the update rate or the quantization bits. Inherent limitation is mainly :

– Binary decision of whether speech is voiced or not.

– Voiced fricatives (/z/) : noise excitation with a periodic envelop

– Vowel excitation in natural speech has a noisy component above 2-3 kHz.

– Higher spectrum of speech has a transient harmonic structure.

• Variations include : mixed-excitation model:– Sum a LP periodic waveform with a HP noise signal. The cut off frequency

marks the degree of voicing.

– Add jitter to the (periodic) excitation by randomly varying the position and amplitudes of the pulses.

– Modify the phase spectrum of the excitation.


Residual-excited Linear Prediction coder

• In RELP : The basic idea is to use the actual prediction error signal rather than the periodical pulse or random noise to excite the digital filter to reproduce the speech. (The prediction error is also called the residual).

• Sometimes, only the lower 1kHz of the excitation is used.

• The advantage of RELP compared to basic LPC is that it avoids the problems associated with pitch estimation (F0) and voicing error, since it does not require an explicit estimation of these entities.


RELP Coder-Decoder

Pre-emphasis

WindowingAnalysis

Auto-Correlation

Levinson-Durbin

Linear Prediction Analysis

AudioInput

AnalysisFilter

Residual

Filter Coeffs

Residual

Filter Coeffs

Waveform coding

Quantization

Residual SynthesisFilter

AudioOutput

Filter Coeffs

De-emphasis

Dec

odin

g

ResidualSignal

FilterCoeffs


Multipulse-excited LPC

• An alternative to the basic LPC vocoder, a suitable number of pulses may be generated as the excitation sequence for a given speech segment (for instance 10 pulses for 10 msec segment). The amplitude and location maybe be optimized in some ‘closed-loop’ search.

Buffer Speech Input

Speech

LPC Filter

Excitation Pulses

Freq Weighting Synthesizer

Output

+-

location & amplitudeof optimum pulses

Quantiz

Quantiz

Quantiz-1

Quantiz-1

Ch

anne

l

MinimizeError



Original residual from LPC analysis

Pulses whose position we are trying to determine

LPC synthesis

LPC synthesis

+

-

[ ]∑=

−−=F

nm mnhAndE

1

2)()(

The time m and amplitude Am of the first pulse are found by minimizing the square error, summed over the length of the analysis frame (4-5 msec).

error

)(nd

)(nh Synthesis filter



),()(mm

mAm φα

=

)(mα),( mmφ

Cross correlation between d(n) and h(n)

Covariance matrix for h(n)

The solution yields the expression for the amplitude:

The location (time m) of the pulse is found by selecting the lag (m) that maximizes :

),()(2

mmm

φα

The process is repeated to find the other pulse locations.


Hybrid coders

• A general class of coders that combine concepts of model-based (LP analysis) and waveform coding

• Typically include long and short term predictors, and use the concept of analysis-by-synthesis.

• Residual can be coded as a waveform, sequence of pulses, or vectors from a codebook.

– Multi-Pulse Excitation» A sequence of nonuniformly spaced pulses as an excitation signal

– Regular-Pulse Excitation (RPE)» A sequence of uniformly spaced pulses as an excitation signal

– Residual-excited (RELP)» Residual coded as a waveform

– Code-Excited Linear Prediction (CELP)» A code book of excitation sequences


1/P(z) 1/A(z)v(n) ˆ zj(n)

1/W(z)

zi(n)

+

{ai}{bi},M

Excit.Generator

cj

G

MinimizationProcedure

OriginalSpeech

Channel

Excit.Generator

1/P(z) 1/A(z)

ReconstructedSpeech ŝ(n)

Excit. Parameteror Codeword index

Basic structure of a hybrid coder

Analysis

-

Hybrid coders


Analysis by synthesis

• Motivation– The optimality of some parameter is easy to determine (e.g., pitch), but not

for others (e.g., gain parameters)

– The interaction among parameters is difficult to analyze but important to synthesis part

• What is A-by-S?– Do the complete analysis-and-synthesis in encoding

– Decoder is embedded in the encoder for the reason of optimizing the extracted parameters


Analysis by synthesis

-Analysis Synthesis

input speech

],...,[ 1 Nxx=xr

],...,[ 1 Nee=er

x̂r],...,[ 1 Kθθ=θ

r

MMSEAssumed model


CELP • Code Excited Linear Prediction(CELP)

– a family of techniques that quantize the LPC residual using a codebook of vectors.

– CELP uses the fact that the residual of voiced speech has periodicity and can be used to predict the residual of the current frame.

– Short-term prediction : the prediction using LPC coefficients

– Long-term prediction : the prediction of the residual based on pitch

– Analysis-by-synthesis technique : choosing the combination of parameters so that the reconstructed signal is closest possible to the input signal

• (CELP) was first introduced by B.S. Atal & M.A Schroeder in 1984 ICC. • In 1988 DoD selected the CELP algorithm developed by AT&T Bell Laboratories

as the basis for Federal Standard 4.8 kbps voice coder. (FS-1016)• Produced low-rate coded speech comparable in quality to that of medium-rate

waveform coders.


• Analysis-by-Synthesis Linear Prediction– Excitation sequence is selected from a codebook by closed-loop

optimization.

– Adaptive and stochastic codebooks.

• Long-term Linear Prediction– Pitch (fine) structure of the speech is predicted.

• Perceptual Weighting (Filtering)– Shapes the error such that quantization noise is masked by high-energy

formants.

• CELP is a Hybrid Coder

• Other variants/standards: VSELP, LD-CELP

CELP


Analysis by synthesis in CELP

LPCAnalysis

Input speech signal

PerceptualWeighting

LPC Synthesis


Code Books

• Codebook originally consisted of Gaussian sequences; 1024 vectors 40-samples (5ms)

• Selecting the code involves an exhaustive search.


CELP

LPCAnalysis

ST Synthesis

Pitch Estimation

LT Synthesis

Input speech signal


Short-termPrediction

long-termPrediction

CodewordSearch and gain

calculationinputframe

residuesignal

channel

LPC Pitch, gain Codeword, gain

intermediateresult

CELP

Input speech signal


Long-term Prediction

LLT

LT

czzH

zHzEzR−−=

=

1)(

)()()(

r(n)

e(n)

HLT(z)


Long-term prediction

We assume the signal is periodic (i.e. repeats itself) , and let L be the estimate of the pitch period.

We then ‘predict’ the current period, from the previous one.

)()( Lnxcnx −⋅=)

[ ][ ] ),(

),0()(

)()(2 LLR

LRLnxE

LnxnxEcx

x=−−

=


Perceptual Weighting Filter

( )( ) ∑

∑

=

−

=

−

−

−== K

k

kkk

K

k

kkk

za

za

zAzAzW

12

11

2

1

1

1

//)(

γ

γ

γγ

When computing the synthesis error, a perceptual filter is applied on the raw error signal in order to make the distortion measure more relevant to human hearing.

The general form of the filter :

De-emphasize the error energy in the formant regions, since the quantization noise is masked by the strong speech energy there.


Perceptual (Noise) Weighting Filter

The plain MSE metric does not match human perceptionas well as the weighted MSE metric (speech quality assessment)

∑

∑

=

−

=

−

−

−= K

k

kkk

K

k

kkk

za

zazW

12

11

1

1)(

γ

γ

5.0,9.0 21 == γγ


Subframing

Nf =160 samples (20ms)

Nsf =40 samples (5ms)

Nsf Nsf Nsf Nsf

Each subframe is a vector in 40-dimensional space

Nf =160 samples (20ms)


LP analysis iteration and resolution

• Computed for 20-30ms frames.

• Captures the formant structure.

• 10th order autocorrelation LPC is performed.

• LP parameters are represented with Line Spectrum Pairs (LSP).

• Quantize using 4 bits for each of f2 – f5 and 3 bits for each of the others (34 bits in total) from empirically determined probability density functions.

• Smooth filter transitions by linearly interpolating a new set ofLSP frequencies every ¼ frame.


Codebook

Nsf

xr

1cr

128cr

7-bit Codebook (e.g, >C=randn(128,40);)


Codebook search

Target vector to approximate

candidate vector in codebook

xr

crd

||||,cos

cxcxrr

rr><

=θθ

θsin|| xr=d

Minimize d Minimize θ Maximize >< cxrr,

The optimum codeword is the one that maximizes correlation with the input vector


Gain Computation

Target vector to approximate

candidate vector in codebook

xr

cr

2||||min cx rr α−=D><><

=cccxrr

rr

,,α

In CELP coding, xr Prediction residue signal

cr excitation codeword from the codebook


Variation of CELP

LPC Analysis

Input speech signal

PerceptualWeighting

LPC Synthesis

Adaptivecodebook

Fixed Codebook

G1

G2

Typical CB sizes: Fixed CB, 40 bits; Adaptive CB, 8 bits; Gain CBs, 5 bits each; LP coefficient CB, 28 bits

=> 86 bits per frame => 286 encoding alternatives to test!

LPCodebook

delayed version of previous excitation samples


Encoding process of QCELP coder

HPFCheby IIS(n)

HammingWindow

GetLPC

LPC to LSFConversion

InterpolationLSF

LSF to LPCConversion

WeightingFilter W(z)

Decision rate

Pitch filter1/P(z)

Synthesis filter1/A(z/5)

Pitch index& gain

Codebookindex

Codebookindex

+-+

Perceptualweighting

Minimize errorprocedure

Code book

Code book

Randomseed

a1-a10


QCELP bit allocation

1010101010101010Codebook subframe

10101010Pitch subframe

40LPC frame

Rate 1 packets total 160 bits

1010Codebook subframe

10Pitch subframe

10LPC frame

Rate 1/4 packets total 40 bits

Over 1 analysis frame of 160 speech samples


Examples

• Original (64kbps PCM)

• ADPCM (32kbps)

• LD-CELP (16kbps)

• CS-ACELP (8kbps)

• CELP (4.8kbps)

• LPC10(2.4kbps)


Data Coding

ModelingProcess

entropycoding

discretesource X

binarybit stream

probabilityestimation

P(Y)

Y

entropycodingdiscrete

source X

P(X)

binary bit stream


Entropy Coding

• Entropy coding is a process whereby a set of data parameters areencoded (represented by symbols) using an alphabet of variable-length symbols that will minimize the overall required bitrate, given the probability of occurrence of these parameters.

• E.g. Morse code.

--..-.---..-.--…-..--….-.--.-.--.----.

Z.00

Y.02

X.00

W.02

V.01

U.03

T.09

S.06

R.06

Q.00

P.02

O.08

N.07

--.-..-.-.---..….--...-..-..-.-.-….-

M.02

L.04

K.01

J.00

I.07

H.06

G.02

F.02

E.12

D.04

C.03

B.01

A.08


• Step-1– Arrange pi in decreasing order and consider them as tree leaves

• Step-2– Merge two nodes with smallest prob. to a new node and sum up prob.– Arbitrarily assign 1 and 0 to each pair of merging branch

• Step-3– Repeat until no more than one node left.– Read out codeword sequentially from root to leaf

• Variable length code – assigning about log2 (1 / pi) bits for the ith value

Entropy Coding


S0

S1

S2

S3

S4

S5

S6

S7

000

001

010

011

100

101

110

111

PCM

00

10

010

011

1100

1101

1110

1111

Huffman

0.25

0.21

0.15

0.14

0.0625

0.0625

0.0625

0.06250.125

10

0.12510

0.251

0

0.2910

0.54

1

0

0.46

1

0 1.0

1

0

Entropy Coding


Morse Code – average length

• The average of the lengths of the letters: – (2+4+4+3+…)/26 = 82/26 ≈ 3.2

• But the average in a typical real sequence of say 1,000,000 letters, will be function of probability of occurrence of the various letters.

• The weighted average: (freq of A)·(length of code for A) + (freq of B)·(length of code for B) + …

= .08·2 + .01·4 + .03·4 + .04·3+… ≈ 2.4


Vector Quantization

• Basic idea– Treat several signal samples or coefficients as a vector and encode them

together as a block.– Complicated but better coding efficiency, from Shanon’s rate distortion

theory.– Use an N-dimensional quantizer and an L-size Codebook

E.g. : Partitioning of two-dimensional (N)Space into 16 cells (L)


x Q x̂Scalar Quantization

Vector Quantization x1,…,xK Q x1,…,xK^ ^

Each sample is quantized independent of others

A block of samples are quantized simultaneously

Vector Quantization


Vector Quantization

Stages

– Codebook design

– Encoding

– Decoding

Scalar vs. Vector quantization

– VQ allows flexible partition of coding cells

– VQ could naturally explore the correlation between elements

– SQ is simpler in implementation


ITU Recommended coders

ITU # Common name Coding technique

Description bit rate (kbps)

Input frame size (bits)

Output frame size (bits)

Frame sample time (ms) MOS

G.711Companding Pulse Code

Modulationμ-law or A-law

PCM

Used in PSTN, converts analog speech into non-linear 8-bit samples.

64Continuous

analog 8 0.125 4.8

G.723.1Multi Pulse - Maximum Likelyhood Quantizer MP-MLQ

Encodes 240 samples of 16-bit linear data into 12

16-bit code words6.4

240 samples * 16 bits per sample

= 3840 bits

12 code words * 16 bits per word

= 192 bits30 3.8

G.723.1Algebraic Codebook

Excited Linear Prediction ACELPEncodes 240 samples of 16-bit linear data into 10

16-bit code words5.3


= 3840 bits


= 160 bits30 3.7

G.726 Adaptive Differential Pulse Code Modulation

ADPCM Converts analog speech into 3, 4, or 5 bit samples

16, 24, 32 or 40

Continuous analog

3, 4, or 5 0.125 4.3

G.728Low Delay - Codebook

Excited Linear Prediction LD-CELPEncodes 5 samples of 16-bit linear data into 10-bit

code words16


= 80 bits10 0.625 4

G.729Conjugate Structure Algebraic Codebook

Excited Linear PredictionCS-ACELP


8-bit code words8


= 1280 bits


= 80 bits10 4

G.729aConjugate Structure Algebraic Codebook

Excited Linear PredictionCS-ACELP


8-bit code words8


= 1280 bits


= 80 bits10 4

MOS, or mean opinion score: subjective measure of voice qualityScores of 4 to 5 are deemed toll quality, 3 to 4 communication quality andless than 3, synthetic quality.

Documents

speech coding II - Angelfire · • Signal power (Gain) G Pulse Train ... – Do the complete analysis-and-synthesis in encoding – Decoder is embedded in the encoder for the reason