40
Dr. Jürgen Herre 11/07 Page 1 Fraunhofer Institut Integrierte Schaltungen Efficient Representation of Sound Images: Recent Developments in Parametric Coding of Spatial Audio Jürgen Herre Fraunhofer Institut für Integrierte Schaltungen (IIS) Erlangen, Germany

Efficient Representation of Sound Images: Recent Developments in Parametric … · 2007-11-12 · Dr. Jürgen Herre 11/07 Page 1 Fraunhofer Institut Integrierte Schaltungen Efficient

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Dr. Jürgen Herre 11/07 Page 1

Fraunhofer InstitutIntegrierte Schaltungen

Efficient Representation of Sound Images:

Recent Developments in Parametric Coding

of Spatial Audio

Jürgen HerreFraunhofer Institut für Integrierte Schaltungen (IIS)

Erlangen, Germany

Dr. Jürgen Herre 11/07 Page 2

Fraunhofer InstitutIntegrierte Schaltungen

Introduction: Sound Images … ?

• Humans live in a world of sound

• … continuously listen to the soundsurrounding us

• … perceive the acoustic waves reachingthe ears and interpret them

• … (re)construct an acoustic scene

• In analogy to visual perception, we form a“sound image” . . .

This talk: How to efficiently represent and reproduce sound images!

Dr. Jürgen Herre 11/07 Page 3

Fraunhofer InstitutIntegrierte Schaltungen

Part I

Part II

Part III

Overview

• What constitutes a “sound image”?

• Efficient coding of spatial sound images– Perceptual audio coding– Coding of multi-channel / surround sound

– “Spatial Audio Coding” & MPEG Surround

• Next-generation interactive coding /rendering of sound images:“Spatial Audio Object Coding”

– Demonstration

Dr. Jürgen Herre 11/07 Page 4

Fraunhofer InstitutIntegrierte Schaltungen

Part I:

What Constitutes A “Sound Image”?

Dr. Jürgen Herre 11/07 Page 5

Fraunhofer InstitutIntegrierte Schaltungen

Spatial Sound Perception

• We are permanently listening to soundthrough our two sensors (ears)

• Ears receive a complex sound mixture of– direct sound that is radiated from several

sources (frequently concurrently), and

– many reflections from our acousticenvironment (“ambient sound”)

The Physics Part

Dr. Jürgen Herre 11/07 Page 6

Fraunhofer InstitutIntegrierte Schaltungen

Spatial Sound Perception (2)

• Humans interpret sound to make sense of it– Map it to an internal sound scene which is a

perceptual correlate of the actual physicalscene (not necessarily identical!)

– Very complex & sophisticated process -by far not fully understood!

• Psychoacoustics [Zwicker, Moore, Fletcher, Blauert, …]

• Perceptual streaming, auditory scene analysis[Bregman, …]

– How to form a consistent interpretation ofthe acoustic world from the stream ofreceived information fragments

The Perceptual Part

Some Key Words

Dr. Jürgen Herre 11/07 Page 7

Fraunhofer InstitutIntegrierte Schaltungen

Spatial Sound Perception (3)

• In the following, we will look at thephenomenon of spatial audio perception ina grossly simplified, yet illustrative and veryuseful way …

• Allows immediate technical application toefficient coding of spatial sound

• Highlight some analogies between “soundimages” and “visual images”

• The Human Auditory System processessound in a frequency selective way(roughly logarithmic, ca. 1/3 octave→ BARK and ERB scales)

FrequencySelectivity

A Note of Caution

Dr. Jürgen Herre 11/07 Page 8

Fraunhofer InstitutIntegrierte Schaltungen

L R

Spatial Sound Perception (4)

• Complex process, mostly driven byinteraural signal differences

• Depending on its position, a sound sourceproduces differences in the ear signals

• Most important perceptual cues containedin the ear signals: e.g. [Blauert 1984]

– Interaural Level Differences (ILD)

– Interaural Phase/Time Differences (IPD/ITD)(relevant up to ca. 4 kHz frequency)

– Temporal signal envelope (high frequencies)

⇒ ILD, IPD/ITD determine perceived lateral position of sound sources!

Sound Localization

Dr. Jürgen Herre 11/07 Page 9

Fraunhofer InstitutIntegrierte Schaltungen

Spatial Sound Perception (5)

• In realistic closed rooms, reflections fromthe walls (and other objects) occur

• These blur the simple ILD, IPD/ITD relationsby decorrelating the ear signals

⇒Decorrelated sound at the two earsrepresents reflections and thus providesinformation on the acoustic environment

– Distinct early reflections → nearby walls

– Late reverberation → statistical summary ofacoustic environment / room properties

⇒ Auditory system needs to “parse” sound into direct & ambient sound!

Dr. Jürgen Herre 11/07 Page 10

Fraunhofer InstitutIntegrierte Schaltungen

Spatial Sound Perception (6)

• Important role of Interaural Coherence (IC):Determines spatial extent (=width) of soundevent [Schuijers03] [Faller03]

• Perceived width of auditory event increasesas coherence decreases (1 → 4);eventually separates into 2 distinct events

• Max. of normalized cross-correlation:

!

IC =maxd

e1(n) " e

2(n + d)

n=#$

$

%

e1

2(n)

n=#$

$

% " e2

2(n + d)

n=#$

$

%[Faller 2004]

e1 e2

⇒ Key to source width and source/ambience perception!

Dr. Jürgen Herre 11/07 Page 11

Fraunhofer InstitutIntegrierte Schaltungen

Some A↔V Analogies in Scene Attributes

Visual Domain Auditory Domain (→ auditory cues)

Foreground object Sound sources (→ high IC) - directional

Object position - Object position (→ ILD/ITD/IPD for lat. position)

Object size - Object size (→ IC)

Background Ambience (→ low IC) - non-directional

Dr. Jürgen Herre 11/07 Page 12

Fraunhofer InstitutIntegrierte Schaltungen

Part II:

Efficient Coding Of Spatial Sound Images

Dr. Jürgen Herre 11/07 Page 13

Fraunhofer InstitutIntegrierte Schaltungen

Basics of Audio Coding

• Represent audio data as compactly aspossible while maintaining sound quality(ideally: “transparent” coding)

• Concept of perceptual audio coding

– Optimize subjective quality rather thanobjective distortion metrics (e.g. MSE/SNR)

– Use knowledge about signal receiver:Psychoacoustics gives limits of perception

– Keep coding distortion below limits!

– No universal source model available(unlike in speech coding)

Goal

Predominant Approach

Dr. Jürgen Herre 11/07 Page 14

Fraunhofer InstitutIntegrierte Schaltungen

Psychoacoustics

L

0

20

40

60

dB

80

0,02 0,05 0,1 0,2 0,5 1 2 5 10 20 kHz

Tf

Threshold

in quiet

T

Masking

Threshold

Inaudible

Signal

Masker

Masked

Sound

Dr. Jürgen Herre 11/07 Page 15

Fraunhofer InstitutIntegrierte Schaltungen

Demonstration: The "13 dB Miracle"

• Original signal

• Original + white noise, SNR = 13,6 dB

• Original + noise at threshold, SNR = 13,6 dB

• Difference signal: White noise

• Difference signal: Noise at threshold

Historic demonstration by James D. Johnston and Karlheinz Brandenburg at AT&T Bell Laboratories in 1990 using the best psychoacoustic modelavailable at that time ...

⇒ SNR does not adequately describe subjective sound quality!⇒ Putting psychoacoustics to work makes a huge difference!

Dr. Jürgen Herre 11/07 Page 16

Fraunhofer InstitutIntegrierte Schaltungen

Basic Paradigm of (Monophonic) Perceptual Audio Coding

Quantization & Coding

Encoding of Bitstream

Analysis Filterbank

Perceptual Model

bitstream

out

audio

in

Inverse Quantization

Synthesis Filterbank

audio

out

bitstream

in

Decoding of Bitstream

Dr. Jürgen Herre 11/07 Page 17

Fraunhofer InstitutIntegrierte Schaltungen

A Real Audio Coder (MPEG-2 AAC, 1997)In

pu

t ti

me

sig

nal In

ten

sity

/C

ou

pli

ng

BitstreamMultiplexer

Perceptual Model

GainControl

FilterBank

TNS M/S

Rate/Distortion Control

Nois

ele

ss

Codin

g

Bitstream Output

Scale

Facto

rs

Pre

dic

tion

Quant.

Dr. Jürgen Herre 11/07 Page 18

Fraunhofer InstitutIntegrierte Schaltungen

The Better Spatial Sound Image:Surround Sound / Multi-Channel Audio

• Significantly increased spatial realism over stereo, envelopment• Origins in movie sound (5.1); now also for music, broadcasting• Increasingly adopted in consumers’ homes

Dr. Jürgen Herre 11/07 Page 19

Fraunhofer InstitutIntegrierte Schaltungen

Traditional Delivery Formats For Surround Sound

• Downmix of 5.1 sound into stereo signal,upmix at the receiver side

– Efficient in terms of transmission bandwidth(same bitrate as stereo)

– Backward compatible to stereo delivery

– Limited computation necessary– Significant loss in subjective audio quality

• Separate transmission of each channel– Significantly higher bitrate than stereo– Moderate amount of computation– High subjective audio quality possible

Matrixed Surround(Prologic, Neo6, …)

Discrete Surround(AAC, AC-3, …)

Dr. Jürgen Herre 11/07 Page 20

Fraunhofer InstitutIntegrierte Schaltungen

A Major Step Ahead: “Spatial Audio Coding”

• Rather recent development

• Compression efficiency:Transmits multi-channel audio at bitratesused for 2-channel stereo (or even mono)

• Backward compatibility:SAC multi-channel audio is coded in abackward compatible way⇒ existing infrastructures can be seamlesslyupgraded to multi-channel / surround!

• High subjective audio quality

Heavily based on exploiting perception rather than waveform coding!

Main Characteristics

Dr. Jürgen Herre 11/07 Page 21

Fraunhofer InstitutIntegrierte Schaltungen

The Spatial Audio Coding (SAC) Concept“Spatial Audio Coding” = Downmix + Parametric Spatial Synthesis

Extracts & reproduces Inter-Channel Counterparts of Inter-Aural Parameters

e.g. ICLD, ICTD/ICPD, ICC

Dr. Jürgen Herre 11/07 Page 22

Fraunhofer InstitutIntegrierte Schaltungen

Related Technology

• Binaural Cue CodingParametric Coding of Multi-Channel into aMono Channel

• Parametric StereoParametric Coding of 2-Channel Stereo intoa Mono Channel

• Matrix Surround(Dolby Prologic, Logic 7, Circle Surround …)

• MP3 SurroundBackward compatible extension of stereoMP3 towards 5.1 surround sound at 128 -192kbit/s

Generalization /Extension of

First commercialApplication (2003)

Dr. Jürgen Herre 11/07 Page 23

Fraunhofer InstitutIntegrierte Schaltungen

Principle: Surround → Stereo + Side Information

Example: MP3 Surround Encoding

[Herre et al. 2004]

L

R

C

Ls

Rs

CompatibleDownmix

StandardMP3

Encoder

BCCEncoder

Lc

Rc

MP3 Surround

Bitstream

Surround

Enhancement

AncillaryData

SurroundEncoder

Dr. Jürgen Herre 11/07 Page 24

Fraunhofer InstitutIntegrierte Schaltungen

Stereo Playback(simple decoder)

Example: MP3 Surround DecodingPrinciple: Stereo + Side Information → Surround

L

R

C

Ls

Rs

Surround Enhancement

AncillaryData

AdvancedMulti-

ChannelDecoding

SurroundDecoder

Standard

MP3

Decoder

Lc

Rc

MP3 Surround

Bitstream

[Herre et al. 2004]

Dr. Jürgen Herre 11/07 Page 25

Fraunhofer InstitutIntegrierte Schaltungen

MPEG Spatial Audio Coding / MPEG Surround

• Work item “Spatial Audio Coding” (SAC)• Technical development 3/2004 - 7/2006• Main contributors: Fraunhofer IIS, Agere

Systems, Coding Technologies and Philips• Renamed into MPEG Surround (MPEG-D)

• Efficient & backward compatible upgradeof audio distribution to multi-channel, e.g.:

– Music download service

– Multi-channel streaming / Internet radio

– Digital audio broadcasting

Applications:

Dr. Jürgen Herre 11/07 Page 26

Fraunhofer InstitutIntegrierte Schaltungen

hybrid analysisInput 1

hybrid analysisInput 2

Spatial

synthesis

Output 1hybrid synthesis

hybrid synthesisOutput 2

Spatial parameters

hybrid synthesisOutput N

Decoder Structure

MPEG Surround Synthesis: General Concept

• ”hybrid” = QMF filterbank + 2nd stage → non-uniform frequencyresolution relating to frequ. resolution of human auditory system

• Same QMF filterbank as in MPEG-4 HE-AAC (AAC + SBR)

Dr. Jürgen Herre 11/07 Page 27

Fraunhofer InstitutIntegrierte Schaltungen

MPEG Surround Synthesis: General Concept (2)

• Frequency selective processing: Dynamic matrix + decorrelation

• Many more details, please see dedicated publications!• Side information rate typ. 3 - 32 kbit/s

Dr. Jürgen Herre 11/07 Page 28

Fraunhofer InstitutIntegrierte Schaltungen

Underlying Idea: Hierarchical En/decoding

Encoder Decoder“5-2-5 Coding”

OTT

encoder

element

OTT

encoder

element

OTT

encoder

element

TTT

encoder

element

l0

r0

lf

lb

rf

rb

c

lfe

Spatial parameters

OTT

decoder

element

OTT

decoder

element

OTT

decoder

element

TTT

decoder

element

l0

r0

lf

lb

rf

rb

c

lfe

Spatial parameters

Dr. Jürgen Herre 11/07 Page 29

Fraunhofer InstitutIntegrierte Schaltungen

MPEG Surround Concepts

• Similar trees for mono-based operation(“5-1-5”), other modes (“7-2-7”, “7-5-7”) …

• Channel Level Differences (CLDs)• Inter-Channel Correlations (ICCs)• Channel Prediction Coefficients (CPCs)• Prediction errors (residuals)

• Decorrelation by QMF-domain all-pass filters• Several tools for handling fine temporal

envelope structure (both without and withadditional side information)

Generalization

Spatial Parameters

Other Aspects

Dr. Jürgen Herre 11/07 Page 30

Fraunhofer InstitutIntegrierte Schaltungen

MPEG Surround: Additional Functionalities

• Externally created downmixes can be used

• Stereo downmix can be made compatiblewith common matrix surround decoders

• MPEG Surround decoder can decode matrixsurround signal (i.e. work without side info)

• Downmix can be generated as virtualsurround, or MPEG Surround can bedecoded directly into virtual surround veryefficiently

Artistic Downmix

Matrix SurroundCompatibility

Enhanced MatrixMode

Binaural Renderingfor headphone

⇒ Rich set of attractive features for practical application

Dr. Jürgen Herre 11/07 Page 31

Fraunhofer InstitutIntegrierte Schaltungen

MPEG Surround: Recent Verification Test

“Music-Store” test scenario: Stereo downmix coded using AAC@160kbit

MPEG Surround

MPEG Surroundwithout side-info

Matrixed Surround(Dolby Prologic II)

Dr. Jürgen Herre 11/07 Page 32

Fraunhofer InstitutIntegrierte Schaltungen

Part III:

Next Generation Interactive Coding /Rendering of Sound Images

From Spatial Audio Coding to Spatial Audio Object Coding

Dr. Jürgen Herre 11/07 Page 33

Fraunhofer InstitutIntegrierte Schaltungen

hierarchically multiplexed

downstream control / data

hierarchically multiplexed

upstream control / data

audiovisualpresentation

3D objects

2D background

voice

sprite

hypothetical viewer

projection

videocompositor

plane

audio

compositor

scene

coordinatesystem

x

y

zuser events

audiovisual objects

speakerdisplay

user input

scene

globe desk

person audiovisual

presentation

2D background furniture

voice sprite

Hierarchy of objects

Classic MPEG-4 Interactive Scene Composition (1996ff)

Scene is composed of multiple A/V objects and can be rendered interactively

Dr. Jürgen Herre 11/07 Page 34

Fraunhofer InstitutIntegrierte Schaltungen

MPEG-4 Object Based (De)Coding

Discrete approach comes at a rather high price:• Bitrate and decoding complexity grow with number of objects• Structural complexity

Dr. Jürgen Herre 11/07 Page 35

Fraunhofer InstitutIntegrierte Schaltungen

From Spatial Audio Coding (SAC) to SAOCRegular Spatial Audio Coding: Channel-oriented scheme(MPEG Surround)

Chan. #1Chan. #2Chan. #3Chan. #4 . . .

Downmixsignal(s)SAC

EncoderSideInfo

SACDecoder

Chan. #1Chan. #2Chan. #3Chan. #4 . . .

Dr. Jürgen Herre 11/07 Page 36

Fraunhofer InstitutIntegrierte Schaltungen

Alternative: Object-oriented Spatial Audio Coding

Obj. #1Obj. #2Obj. #3Obj. #4 . . .

Downmixsignal(s)SAOC

EncoderSideInfo

SAOCDecoder

Chan. #1Chan. #2 . . .

Renderer

User Interaction / Ctrl.

obj. #1

obj. #2

obj. #3

obj. #4

. . .

From SAC to SAOC (2)

• Processes object signals instead of channel signals• “Mixing”/rendering parameters vary according to user interaction• Combined obj. decoding & rendering ⇒ computationally efficient!• Previous work by Faller & Baumgarte [2001ff] and Faller [2006]

Dr. Jürgen Herre 11/07 Page 37

Fraunhofer InstitutIntegrierte Schaltungen

Real-time interactive rendering of audio objects from a mono audio downmix + SAOC side information

Dr. Jürgen Herre 11/07 Page 38

Fraunhofer InstitutIntegrierte Schaltungen

New MPEG Standardization Activities

• Work on “Spatial Audio Object Coding” (SAOC) started• Transcoding approach: “SAOC” + rendering info → MPEG Surround

• Reference model and working draft recently established (10/2007)

SAOC Decoder

Dr. Jürgen Herre 11/07 Page 39

Fraunhofer InstitutIntegrierte Schaltungen

Conclusions

• “Sound Images” carry some analogy to images in the visual world

• Spatial Audio Coding schemes code surround sound based onperception (rather than on waveform match)

– “Object positions” are represented by perceptual spatial parameters– “Audio Object Texture” is coded using regular mono/stereo coder

• Such schemes can bring surround sound into existing infrastructures– High compression factor (surround sound at 64kbps and below!)– Stereo / mono backward compatibility

• Extension towards efficient interactive, object-based scene coding /rendering (Spatial Audio Object Coding) is currently on its way …

Dr. Jürgen Herre 11/07 Page 40

Fraunhofer InstitutIntegrierte Schaltungen