State of the Art in Perceptual Coding: MPEG-2/4 Advanced ... › fileadmin › media › mt › lehre › ... · Huffman coding of scalefactors and spectral coefficients. Prof. Dr

Prof. Dr. Karlheinz Brandenburg, [email protected] Page 1

State of the Art in Perceptual Coding: MPEG-2/4 Advanced Audio Coding (AAC)


History•

1994: Official start of AAC development

•

Goal: Development of a new powerful state-of-the-art multi-channel coder without compatibility constraints

•

1997: AAC International standard (IS)

•

1999: AAC part of the MPEG-4 standard

•

Today: favorite coder for many application areas like Internet audio, solid state players, ISDN music transmission, High definition TV (HDTV), satellite and terrestrial digital audio, broadcasting


Overview (1)•

Next generation mono/stereo/multichannel coding

•

Same quality at half the bit-rate

•

International cooperation of the Fraunhofer Institute and companies like AT&T, Sony and Dolby

•

Most efficient MPEG method for audio data compression up until now

•

Driving force to develop AAC was the quest for an efficient coding method for surround signals, like 5-channel signals (cinemas)


Overview (2)•

Makes use of the signal masking properties of the human ear in order to reduce the amount of data

•

Quantization noise is distributed to frequency bands in such a way that it is masked by the total signal

•

Iterative encoder structure using Huffman coding and non-uniform quantization

–

Features found in Layer 3 and PAC

•

Window type and block switching–

Features found in AC-3, Layer 3, PAC, + new

•

Temporal Noise Shaping (TNS)–

New technique


Overview (3)•

Prediction

•

Bit reservoir

•

M/S stereo coding

•

Intensity stereo coding

•

Gain control


AAC-

Encoder Overview


MPEG-AAC: Basic Features•

High frequency resolution filter bank-based coder (1024 subband

MDCT with 50% overlap)

•

1:8 block switching (1024/128 subband

MDCT)

•

Non-uniform quantizer

•

Noise shaping in half critical bands (scalefactorbands)

•

Huffman coding of scalefactors

and spectral coefficients


MPEG-2 AAC: Advanced Coding Tools

•

Window shape adaptation usually fixed (sine or KBD)

•

Temporal noise shaping (TNS) often used

•

Gain control (SRS/ Sample Rate Scalable profile, only), not often used

•

Backward adaptive prediction not often used


Frequency response of Sine and Fielder windowFielder (KBD) window

Sine window


MPEG-2 AAC: Joint Stereo Tools

•

Mid/Side stereo (MS) per scalefactor

band

•

Intensity stereo coding between channel pairs

•

Coupling channel(s)

Other Features:

•

Flexible bitstream

format for up to 48 channels

•

Low Frequency Enhancement (LFE) channel(s)


MPEG-2 AAC PerformanceTest Results:•

Broadcast quality at 320 kbit/s for 5 channels (better than MPEG-2 Layer II at 640 kbit/s)

•

Broadcast quality at 128 kbit/s stereo•

Comparison to other codecs: AAC 96 kbit/s stereo comparable to

–

AC-3 at 160 kbit/s–

Layer II at 192 kbit/s–

Layer III at 128 kbit/s Hence: AAC is successful in providing higher

compression ratios)•

Very low bitrates (comparison within MPEG): AAC best audio coder at bitrates down to 16 kbit/s for mono and stereo


Differences MPEG-2 AAC and MPEG Audio Layer-3

•

Filter bank

•

ISO/MPEG Audio Layer-3 uses hybrid filter bank chosen for reasons of compatibility

•

MPEG-2 AAC uses a plain Modified Discrete Cosine Transform (MDCT) to reduce aliasing

•

Together with the increased number of subbands

(1024 instead of 576 samples) the

MDCT outperforms the filter banks of previous coding methods


Differences MPEG-2 AAC and MPEG Audio Layer-3 •

Temporal Noise Shaping TNS•

Shapes the distribution of quantization noise intime by prediction in the frequency domain

•

Voice signals in particular experience considerable improvement through TNS

•

Prediction

(in band in time domain)

•

A technique commonly established in the area of speech coding systems

•

It benefits from the fact that stationary audio signals are predictable to a certain extend

•

But requires higher computational complexity


Differences MPEG-2 AAC and MPEG Audio Layer-3 •

Quantization•

By allowing finer control of quantization resolution, the given bit rate can be used more efficiently

•

Bit-stream format•

The information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible

•

The optimization of these coding methods together with a flexible bit-stream structure has made further improvement of the coding efficiency possible


Filter Bank Details•

MDCT

(Princen

/ Bradley)

–

TDAC, MLT, cosine modulated filter bank–

critical sampling

–

time domain aliasing cancellation•

Block switching

to adjust the impulse response

•

Window type switching–

sine window

–

Kaiser Bessel Derived (KBD) window


MPEG-4 General Audio Coding


MPEG-4 Audio•

Interactivity

•

High compression•

Universal accessibility

MPEG-4 addresses applications in the shaded area.

‘TV/film’

‘Telecom’‘Computer’

interactivityWireless

AV-data

wide range of bit rates, more channels possible


A short view into MPEG-4 Audio (1)•

Very diverse requirements: no single algorithm:

–

Music synthesis (Structured Audio) = kind of an extension of midi

–

Very low rate parametric coding (HILN, HVXC)–

Speech coding (CELP)

–

Perceptual Coding ("General Audio") over a wide range of bitrates

•

High quality coding done via AAC with additional coding tools:

–

TvinVQ, scalability tools–

Perceptual Noise Substitution (PNS)

•

Backwards compatibility, no new coding paradigm for high quality audio


A short view into MPEG-4 Audio (2)

•

MPEG-4 General Audio Coding: The “all-round coder”

in MPEG-4 audio

•

MPEG-4 Extensions:–

Perceptual Noise Substitution (PNS)

–

Long Term Prediction–

TwinVQ

Coding Core ( important for low

bitrates)


MPEG-4 Audio Algorithms DSL

text to speech


MPEG-4 Audio Profiles•

Speech Audio Profile

–

Parametric speech coder (HVXC)–

CELP speech coder

• Synthesis Audio Profile–

generate speech and sound

•

Scalable Audio Profile–

contains the speech audio profile

–

General audio coding (AAC)–

TwinVQ

tools

–

Scalable coding of speech and music•

Main Audio Profile

–

contains all other profiles


Temporal Noise

Shaping: TNS


Temporal Noise Shaping (1)

t

t

t

t

original

TNS

e.g. castanet

e.g. speech

quantization noise

1 frame 1 frame


Why use TNS instead of block switching?

Low number of subbands

leads to higher bit rate. Ok if it only happens occasionally. Therefore

buffer can be used. Problem if there are many peaks, as in speech the glottal pulses (every few ms!). Bit rate would become too high, or the quantization noise too high.

Alternative approach is needed TNS

But: TNS is not really a replacement for block switching


Temporal Noise Shaping (2)

•

Solution for avoiding quantization noise spread:–

Make smaller frames (works for attacks but notfor speech decrease of coding efficiency)

–

Higher time resolution to shape quantizationnoise

–

TNS

•

Limitation of TNS:–

Time domain aliasing


Speech Coding as Model for TNS

•

In frequency domain the prediction error is flat

•

TNS predicts in frequency domain instead of time domain, shapes noise in time domain.

encoder

decoder

compute LPC coefficientsand transmit them to the decoder

should beoriginal spectrum

predictionerror

linear FIRfilter

predictor only knows past

samples

should be flat spectrumfrequency response of structure inverse of signal

structure should havefrequency response like signal

quantization noise is shaped accordingly in frequency domain


TNS•

Switch roles of time and frequency domain:

predict not over time, but over frequencies, over the subbands

quantization error is shaped (after decoding)in the time domain (instead the frequency domain) like the signal

hopefully reduces pre-echo artifacts

But: aliasing in time domain limits effectiveness (peaks are mirrored over time)


MDCT1024 bands

z-1 pred

Q

TNS

Audio

already in AAC subbands

sequence of subbands

coeffside-info

predicts from one subband

to the next, starting at the lowest subband, from subband

0 it predicts 1,…

Structure of TNS (encoder)


TNS Decoder

pred z-1

SynthMDCT1024 bands

Audio

subbands


Extension: Perceptual Noise Substitution (PNS)


Perceptual Noise Substitution (1)Background: •

Parametric coding of signals gives a very compact signal representation

•

Parametric coding of noise-like signal components has been used widely e.g. in speech coding

•

Can similar techniques be used in perceptual audio coding ?

MPEG-4:•

Perceptual Noise Substitution (PNS) permits a frequency selective parametric coding of noise-like signal components


Perceptual Noise Substitution (2)


Perceptual Noise Substitution (3)Principle:•

Noise-like signal components are detected on a scalefactor

band basis

•

Corresponding groups of spectral coefficients are excluded from quantization/coding

•

Instead, only a "noise substitution flag" plus totalpower of the substituted band is transmitted in the bitstream

•

Decoder inserts pseudo random vectors with desired target power as spectral coefficientsHighly compact representation for noise-like spectral components


Extension: Long Term Prediction


Prediction (1)

Background:•

Tone-like signals require much higher coding precision than noise-like signals (e.g. 20 dB vs. 6 dB SNR)High bit rate necessary to code signals with many tonal components (e.g. Harpsichord, Pitch Pipe)

•

Stationary tonal signal components are predictable, even in the downsampled

subbands

Further quality enhancement by predictive coding


Prediction (2)MPEG-2 AAC:•

Prediction of each spectral coefficient; backward adaptive 2nd order Lattice predictor

•

High complexity (ca. 50% of decoder computation & RAM)Prohibitive for cost sensitive applications, used in “MPEG-2 Main Profile” only

MPEG-4 AAC:•

Long Term Predictor (LTP) as known from speech coding, before MDCT on time domain

signal.

•

New: Integration into perceptual audio coder•

Lower complexity: Saving of approx. 50% in terms of computation and memory over MPEG-2 predictors


Long Term Prediction


Transform-Domain Weighted Interleave VQ (1)Background:•

Audio coding at extremely low bitrates (≥

6

kbit/s)MPEG-4:•

Transform-Domain Weighted Interleave Vector Quantization (TwinVQ) as additional coding kernel

•

Fully integrated into MPEG-4 AAC coding system:

–

Uses same spectral representation as AAC coder

–

Makes use of other MPEG-4 tools (e.g. LTP, TNS, joint stereo coding)

–

Possible core coder for MPEG-4 scalable coding


Transform-Domain Weighted Interleave VQ (2)Structure:•

Normalization of spectral coefficients:

–

LPC envelope (overall spectral shape)–

Periodic component coding (harmonic components)

–

Bark-scale envelope coding (additional flattening)

•

Vector Quantization (VQ) process:–

Interleaving of spectral coefficients into new sub-vectors

–

Vector quantization (two sets of codebooks, weighted distortion measure)

–

allows distortion control by perceptual model


Transform-Domain Weighted Interleave VQ (3)


Conclusions•

The “all-round”

coding system among the MPEG-4

audio schemes, providing a set powerful tools•

Based on MPEG-2 Advanced Audio Coding kernel

•

Several enhancements for improved coding efficiency

•

Perceptual Noise Substitution (PNS): Exploiting noise-like components in the signal

•

Long Term Prediction (LTP): Taking advantage of verystationary / tonal signals

•

TwinVQ

coder kernel supports audio coding at extremely low data rates (MPEG-4 scalable coding)


MPEG-4 Scalable Audio Coding


Scalable Audio Coding

•

Embed lower quality (e.g. lower bandwidth) bitstream

in higher bandwidth bitstream

•

Key functionality for MPEG-4 audio

•

Main types of scalability:–

Small step scalability

Enhancement layers of ~ 1 kbit/s–

Large step scalability

Enhancement layers of 8 kbit/s and more

•

All natural audio coding in MPEG-4 supports scalability


The Scalable Audio Profile•

Four levels defined according to

–

sampling frequency–

number of channels / objects

•

Objects in the scalable audio profile–

AAC LC

–

AAC LTP–

AAC Scalable

–

TwinVQ–

CELP

–

HVXC–

TTSI


Speech Coding•

HVXC (Harmonic Vector EXcitation

Coding)

–

Bit-rates typically 2 kbps -

4 kbps–

Parametric speech coding: High quality for coding of clean speech

–

Speed change / pitch change capability

•

CELP (Code Excited Linear Prediction)–

Bit-rates typically 6 kbps -

24 kbps

–

Very flexible configuration possibilities–

Support for 8 kHz and 16 kHz sampling


MPEG-4 Low Delay Audio Coding


MPEG-4 Version 2 Low Delay Audio Coding

Target:•

High audio and speech quality and

•

Low bitrate

and Low algorithmic delay (20 ms)

Solution:•

MPEG-4 Version 2 Low Delay Audio Coder:

–

Derived from MPEG-2/4 "Advanced Audio Coding" (AAC)

–

Specific modifications for low-delay operation


Delay Sources in Perceptual Audio Coding

•

Framing delay•

Filter bank delay

•

Look-ahead delay for block switching•

Use of bit reservoir

Overall delay:


Example: Delay of AAC Codec (48 kHz / 64 kbps)

•

Framing delay : 1024 samples•

Filter bank delay : 1024 samples

•

Look-ahead delay for block switching : 576 samples

•

Use of bit reservoir : 74.7 ms

Overall delay:


Low Delay AAC Codec (48 kHz, min. delay mode)

•

Reduced filter bank delay : 959 samples•

No block switching no look-ahead delay: 0 samples

•

Minimal bit reservoir : 0...32 bits

Overall delay:


Delay vs. Bitrate


Preecho

Behavior


Preecho

Reduction by Window Shape Adaptation


Test Results -

Comparison to "MP3"


SummaryThe MPEG-4 V2 low-delay coder provides

•

High audio quality for music and speech•

Algorithmic delay of 20 ms enables two-way communications

•

Audio quality scales with bitrate•

Stereo and multi-channel capabilities (inherited from MPEG-2/4 AAC)

•

Compares to well w.r.t. MP3•

Low computational & memory complexity

•

Error robustness

Documents

State of the Art in Perceptual Coding: MPEG-2/4 Advanced ... › fileadmin › media › mt › lehre › ... · Huffman coding of scalefactors and spectral coefficients. Prof. Dr