21
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 1 MPEG Spatial Audio Object Coding (SAOC) Prof. Dr.-Ing. Gerald Schuller Fraunhofer IDMT & Ilmenau Technical University Ilmenau, Germany

MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 1

MPEG Spatial Audio Object Coding (SAOC)

Prof. Dr.-Ing. Gerald Schuller

Fraunhofer IDMT & Ilmenau Technical UniversityIlmenau, Germany

Page 2: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 2

Overview

• Concept

• MPEG Surround integration

• Advantages of SAOC

• Applications

• Conclusion

Page 3: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 3

Concept: From MPEG Surround to SAOC (1)Current Spatial Audio Coding: Channel-oriented (MPEG Surround)

Chan. #1Chan. #2Chan. #3Chan. #4

. . .

Downmixsignal(s)SAC

EncoderSideInfo

SACDecoder

Chan. #1Chan. #2Chan. #3Chan. #4

. . .

Page 4: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 4

Object-oriented Spatial Audio Coding

Obj. #1Obj. #2Obj. #3Obj. #4

. . .

Downmixsignal(s)SAOC

EncoderSideInfo

SAOCDecoder

Chan. #1Chan. #2

. . .

Renderer

Interaction/ Control

obj. #1

obj. #2

obj. #3

obj. #4

. . .

Concept: From MPEG Surround to SAOC (2)

• Processes object signals instead of channel signals• Side Info: few kbit/s per audio object• Mono or stereo downmix• “Mixing”/rendering parameters vary according to RT user interaction

Page 5: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 5

MPEG Surround integration/extension

Obj. #1Obj. #2Obj. #3Obj. #4

. . .

Downmixsignal(s)

SAOCEncoder SAOC

Bitstream

SAOCTranscoder

Chan. #1Chan. #2

. . .

MPEGSurroundDecoder

Interaction/ Control

Downmixsignal(s)

MPSBitstream

Combined Decoder

• MPEG SAOC decoder = MPEG SAOC Transcoder + MPEG Surround decoder

Page 6: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 6

Advantages using MPEG SAOC (1)

• Highly efficient storage/transport of individual

audio objects ..

• .. in a backwards compatible downmix

• User interactive rendering of the audio

objects (e.g. move or amplify objects)

• Flexible rendering configurations

(e.g. 2.0, 5.1, binaural, ..)

Key features

Page 7: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 7

Advantages using MPEG SAOC (2)• Low complexity decoding/rendering for a

large number of objects compared with individually encoded and rendered objects

• Compatible with any core codec (for the downmix)

• Powerful rendering engine (= MPEG Surround) integrated, no additional solution required

Other features

Page 8: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 8

Applications (1)• Interactive Remix / Karaoke

– Suppress / attenuate instruments or vocals (Karaoke)

– Modify the original track to reflect current preference (e.g. “more drums & less strings” for a dance party)

– Choose between different vocal tracks (“female lead vocal vs. male lead vocal”)

– Control the dialog/speech level in movies/news broadcasts for better speech intelligibility.

• Backwards compatibilityMain feature

Examples

Page 9: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 9

Applications (2)• Gaming / Rich Media

– Efficient and flexible audio transport in multi- player games or applications (e.g. Second Life)

– Efficient storage together with flexible rendering of audio in small interactive games

• Storage/ Bitrate EfficiencyMain feature

Examples

Page 10: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 10

Applications (3)• Teleconferencing

– Mobile conference over headphones: Virtual 3D-audio line-up of communication partners all around the listener

– Conference setup with 2 or more loudspeakers: Spatial distribution of communication partners

• Quality Improvement:– Increased speech intelligibility– Increased listening comfort

Main feature

Examples

Page 11: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 11

Conclusions SAOC• Highly efficient transport/storage of audio

objects and flexible/interactive audio scene rendering

• Backwards compatible downmix for reproduction on legacy devices

• Flexible rendering configurations• Under standardization within MPEG• Very interesting applications, e.g.:

– Remixing/Karaoke– Gaming– Teleconferencing

Page 12: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 12

MPEG Parametric Surround• Signal is decomposed into several bands

(flexible configuration) outside core coder• For certain groups of bands, the Interaural

Level Difference (ILD), the Interaural Time Difference (ITD) and a Coherence Value (the correlation, concentration in space) is determined

• These parameters are used to generate the side information

• The down-mix is either a stereo signal or a mono signal

• The decoder uses the down-mix and the side- information to generate surround sound which sounds “similar” to the original (psycho- acoustics!)

Page 13: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 13

Universal Speech and Audio Coding (USAC)• Problem:

– Speech coders are good at speech but not at music,

– Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes speech sound reverberant)

• MPEG decided to tackle the problem• Goal: to come up with a universal coder

which handles speech and audio as well as the best speech or audio coder in that bit-rate range

Page 14: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 14

Universal Speech and Audio Coding• A competition was conducted by MPEG• Winner of this competition was a joint

submission by Fraunhofer IIS and Voiceage Corp. in Canada

• Their submission was a combination of VoiceAge’s AMR-WB+ coder and Fraunhofers HE-AAC coder

• The bit-rate range for the competition was about 12 to 64 kb/s.

• Target is mainly mobile devices (wireless phones, digital radio…)

Page 15: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 15

Universal Speech and Audio Coding• We already know HE-AAC• But how does the VoiceAge coder work?• Answer: It is based on CELP (Code Excited

Linear Prediction)• CELP is based on predictive coding, just as

we saw for ULD or lossless predictive coding• Here: usually prediction of order 12 (this was

found to be sufficient to model the human vocal tract for speech production)

Page 16: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 16

Universal Speech and Audio Coding• The prediction residual is then encoded using

a codebook vectors, called Code Excitation, using a fixed codebook (innovation) and an adaptive codebook (past samples)

Page 17: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 17

CELP (Code Excited Linear Prediction)• Structure of the CELP decoder (from

Wikipedia, CELP):

Decoder prediction filter(usually order 12)

Constantly adapted delay

Page 18: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 18

Universal Speech and Audio Coding• ACELP (Algebraic CELP): The codebook is

not explicitely stored, by algebraicly described by pulses and their distances to the next pulses

• AMR: Voiceage Speech Coder (for instance for 3GPP), for about 4.75 and 12.2 kb/s

• AMR-WB: Wideband Extension (up to 7 kHz bandwidth), 6.6 to 23.5 kb/s

• AMR-WB+: Used for the MPEG submission, has a transform coding kernel in it too, to obtain higher bandwidth and bit rates up to about 32 kb/s

Source: IEEE TransactionOn Speech and Audio Processing, Bessette et al., 2002

Page 19: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 19

Universal Speech and Audio Coding• AMR-WB+ has a transform based mode

called TCX, which is based on an FFT (not an MDCT)

• The TCX mode is switchable: The audio stream is divided in 80 ms “super frames”, which consists of two 40 ms frames, and each 40 ms frame consists of two 20 ms frames.

• For the 20 ms frame base it is decided if ACELP is used or TCX

• For TCX it is decided of it is applied to frames of 20ms, 40ms, or 80 ms, to obtain different numbers of subbands

Source: IEEE International Conference on Audio andSpeech Signal Processing(ICASSP), 2005,Bessette et al.

Page 20: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 20

Universal Speech and Audio Coding (USAC)• USAC combines AMR-WB+ with HE-AAC• An important component is a suitable switch

between them, such that for the current audio signal the suitable coder is selected

• Some integration between subband coding modes in AMR-WB+ and HE-AAC.

Page 21: MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 21

Universal Speech and Audio Coding• Tests showed: the resulting codec is indeed

at least as good as a virtual coder, which is the best of either HE-AAC or AMR-WB+ (which was a requirement)

• It was tested on speech, audio, and mixed speech and audio (the latter being the most difficult)

• That showed that the goal was reached