64
1 Basics of Audio Signal Processing Sudhir K

Basics of Audio Signal Processing

  • Upload
    babu

  • View
    156

  • Download
    5

Embed Size (px)

DESCRIPTION

Basics of Audio Signal Processing. Sudhir K. Digital Representation of Audio Psycho-Acoustic principles Lossy Compression of Audio (MP3 and AAC) Lossless compression of Audio (general principles with example). Summary Slide. PCM Data - PowerPoint PPT Presentation

Citation preview

Page 1: Basics of Audio Signal Processing

1

Basics of Audio Signal Processing

Sudhir K

Page 2: Basics of Audio Signal Processing

2

Summary Slide

Digital Representation of Audio Psycho-Acoustic principles Lossy Compression of Audio (MP3 and AAC) Lossless compression of Audio (general principles with example)

Page 3: Basics of Audio Signal Processing

3

Digital Representation of Audio

PCM Data Sampling audio input at discrete intervals and quantizing into discrete

number of evenly spaced levels. Sampling Frequency Bits per sample Number of Channels Interleaved and block format

Audio CD 44.1 KHz, 2 channels , data-rate is 1.4 Mbits per second

ADCDigital

ProcessingDAC speakers

Page 4: Basics of Audio Signal Processing

4

Psycho-Acoustic Principles

Sound Pressure Level Perceptual and Statistical redundancy Absolute Threshold of Hearing Critical Bands Masking in Time domain Masking in Frequency domain Perceptual Entropy Pre-echo Effect Psycho-Acoustic Model 1 Psycho-Acoustic Model 2 Filter Banks and Transforms

Page 5: Basics of Audio Signal Processing

5

Sound Pressure Level

Standard metric to quantify intensity of acoustical stimulus

Measured in decibels (dB) relative to an internationally defined reference level

LSPL is the SPL of stimulus p

P0 is the standard reference level at 20 µPa

150-dB SPL is the dynamic range of human auditory system

140-dB SPL is typically the threshold of pain Human auditory system can hear frequencies

ranging from 20 Hz to 20 KHz frequency

Page 6: Basics of Audio Signal Processing

6

Absolute Threshold of Hearing Characterizes the amount of energy needed in a pure tone such that

it can be detected by a listener in a noiseless environment This can be interpreted naively as a maximum allowable energy

level for coding distortions introduced in frequency domain

Note that the absolute threshold of hearing is a function of frequency Response of a human ear for a pure tone is dependant on the frequency

of the tone Sensation Level : intensity level difference for stimultus relative to

detection threshold (quantifies listener’s audibility) Equal SL components can have different SPL’s

Page 7: Basics of Audio Signal Processing

7

Absolute Threshold of Hearing

Page 8: Basics of Audio Signal Processing

8

Human Ear Model

Frequency to place transformation

Sound wave moves the eardrum and attached bones

The eardrum and the bones transfer mechanical vibrations to Cochlea

Oval window of cochlear membrane induces traveling waves along length of basilar membrane.

Traveling waves generate peak responses at frequency specific membrane positions

Specific positions of membrane provide peak responses for specific frequency band

Cochlea can be considered as a set of highly

overlapped band-pass filters.

Page 9: Basics of Audio Signal Processing

9

Critical Bands Cochlea can be considered as a set of

highly overlapped band-pass filters. Critical bandwidth is a function of

frequency that quantifies the cochlear bandwidth

Loudness (percieved intensity) remains same when the noise energy in present within a critical band

One bark corresponds to distance of one critical band

Critical bandwidth tends to remain constant up to 500Hz and then increases to 20% of center frequency above 500 Hz

Frequency (KHz)

0 2 4 6 8 10 12 14 16 18 20

Page 10: Basics of Audio Signal Processing

10

Simultaneous Masking Process where one sound is rendered inaudible by presence of another sound Frequency domain masking

Masker

Maskee

Tone masking Noise (TMN) Noise Masking Tone Noise Masking Noise In-band Phenomenon (occurs within same critical band)

Page 11: Basics of Audio Signal Processing

11

Simultaneous Masking SMR (signal to mask ratio)

smallest difference between intensity of masking signal and the intensity of masked signal

SMR for NMN is 26dB, TMN is 24dB and NMT is 5dB

Noise is a better masker than tone

Spread of Masking Inter-band Masking Triangular spreading function

Page 12: Basics of Audio Signal Processing

12

Temporal (Non-simultaneous) masking Masking in time-domain Pre-Masking : Masking occurs prior to the signal Post-Masking: Masking following the occurrence of signal Pre-masking is usually less (approx 1-2 ms) Post-masking is of longer duration (50 to 300ms)

Page 13: Basics of Audio Signal Processing

13

Just Noticeable Difference (JND) Also called as global masking threshold Global Masking threshold is a combinaton of individual masking

thresholds (threshold due to NMT, TMN and absolute threshold) Quantization noise should be kept below the JND to keep it

inaudible.

Masking curve

Noise

Signal

Page 14: Basics of Audio Signal Processing

14

Perceptual Entropy Measure of perceptually relevant information Expressed in bits per sample Represents a theoretical limit on compressibility of a particular

signal

Page 15: Basics of Audio Signal Processing

15

Pre-Echo

Pre-echoes occur when a signal with sharp attack begins near end of a transform block immediately following a region of low energy

Inverse quantization spreads evenly throughout the reconstructed block

Page 16: Basics of Audio Signal Processing

16

Pre-Echo control

Bit-reservoir Store surplus bits, which can be used during periods of attack

Window Switching Switch between long and short time-window Short window for transients to minimize spread of noise. Long window for normal case to increase compression efficiency

Gain Modification Smoothes transient peaks by changing gain of signal prior to the

transient

Temporal Noise Masking Linear prediction on frequency domain spectrum Flattened residual and quantization noise. The quantization noise is suchthat it follows original signal enveope

Page 17: Basics of Audio Signal Processing

17

Stereo coding

MS-Stereo (Middle/Side Stereo) One channel to encode information identical between left and right

channel One channel to encode differences between left and right channel Transmit sum and difference of the original signals in left and right

channels

Intensity Stereo Lossy Coding technique Replace left and right channel with a single representing signal plus

directional information Usually used only in higher frequencies (since human ear is less

sensitive to signal phase at these frequencies) Used only at low bit-rates

Page 18: Basics of Audio Signal Processing

18

Psycho Acoustic Model11. Spectral analysis and SPL normalization

Normalize input samples and segment into blocks

2. Identification of Tonal and Noise maskers Energy from 3 adjacent spectral components combined to form single

tonal masker Energy of all other spectral lines not within a range of Δ combined to

form noise masker

Decimation and reorganization of maskers Any tonal or noise threshold below absolute threshold are discarded Adjacent pair of maskers are compared and is replaced by stronger of

the two.

Calculation of individual Masking Threshold Calcullate threshold due to tonal and noise maskers

Page 19: Basics of Audio Signal Processing

19

Pyscho Acoustic Model 1

Threshold due to tonal maskers

Threshold due to noise maskers

Page 20: Basics of Audio Signal Processing

20

Psycho Acoustic Model 1 Calcullation of global masking threshold

Individual masking threshold are combined to estimate global masking threshold

Assumes masking effects are additive Sum of absolute threshold of hearing, threshold from tonal masker and

threshold from noise masker

Page 21: Basics of Audio Signal Processing

21

Filter Bank Characteristics Lossless (analysis and synthesis should be invertible) Aliasing errors should cancel for perfect or near-perfect

reconstruction Low computational complexity Bandwidth should replicate critical bands of human ear.

Page 22: Basics of Audio Signal Processing

22

QMF Filters

Page 23: Basics of Audio Signal Processing

23

Pseudo-QMF

Cosine Modulation of low-pass prototype filter to implement parallel M-channel filter banks with nearly perfect reconstruction

Overall linear phase and hence constant group delay Complexity = one filter + modulation Critical sampling

Analysis & synthesis filters satisfy mirror image conditions to eliminate phase distortion

Analysis filter

Synthesis filter

MPEG1 uses a 32-channel PQMF bank for spectral decomposition in layer I and Layer II

Page 24: Basics of Audio Signal Processing

24

MDCT (TDAC)

De-correlate signal by mapping to an orthogonal basis functions Lapped orthogonal block transform Successive transform block overlap each other Overall linear phase Forward MDCT

50% Overlap between blocks Block transform of 2M samples and block advance of M samples Basis functions extend across 2 blocks (blocking artifacts elimination) Critically sampled M samples output for 2M input samples

Page 25: Basics of Audio Signal Processing

25

Lossy Audio Compression techniques Decoded output is not bit-exact with original input Decoded output is perceptually same as original input More compression achieved Extensive use of psycho-acoustic model to discard perceptually

irrelevant audio data Examples : MP3 and AAC

Time to Frequency Filter Bank

Allocate bits &

Quantize

FormatBitstream

Psycho-Acoustics

Model

Page 26: Basics of Audio Signal Processing

26

Audio Decoder

Usually Encoder Complex and Decoder less complex

Page 27: Basics of Audio Signal Processing

27

MPEG Compression ISO 11172-3 ISO (MPEG 1) Mainly specifies the bit-stream and hence leaves the flexibility of Encoder

design to individual developers Lossy and perceptually transparent Sampling frequencies of 32, 44.1 KHz and 48 KHz supported Various bit-rates from 32-192 kbps per channel supported Supports following channel modes

Mono, Stereo, Dual Mono, Joint Stereo Based on complexity 3 independent layers of compression

Layer 1 (around 192 kbps per channel) Layer 2 (around 128 kbps per channel) Layer 3 (MP3) (around 64 kbps per channel)

Complexity increases as we go from Layer 1 to Layer 3 CRC (optional) for error checking Ancillary Data support

Page 28: Basics of Audio Signal Processing

28

MPEG 1 layer1 and layer 2

Page 29: Basics of Audio Signal Processing

29

MPEG Layer 1 and Layer 2 Sub-band filtering

Polyphase filter bank Decompose input signal into 32 sub-bands Sub-bands are equally spaced (for ex : 48KHz signal, each subband is

750 Hz) Critically sampled (output of each sub-band is down sampled such that

the number of input and output samples are the same) sub-bands do not reflect the human ear’s critical band Prototype filter chosen such that high side lobe attenuation (96 dB) is

achieved Not perfectly Lossless (error is small)

FFT Done for psycho-acoustic analysis and determination of JND thresholds Done in parallel with the sub-band filtering Layer 1 : 512 and Layer 2 : 1024 point

Page 30: Basics of Audio Signal Processing

30

MPEG 1 Layer 1 and Layer 2 Block companding

Sub-band filtering output is block-companded (normalized by a scale factor) such that the maximum sample amplitude in each block is unity.

This operation is done on a block of 12 samples (8 ms at 48 KHz) Psycho-Acoustic analysis

Output of the FFT block is input to the psycho-acoustic block This block outputs the masking threshold for each band

Quantization and bit-allocation This procedure is iterative Bit-allocation applies JND threshold to select an optimal quantizer

from a pre-determined set Quantization should satisfy both masking and bit-rate requirements Scale factors and quantizer selections are also coded and sent in the

bitstream

Page 31: Basics of Audio Signal Processing

31

MPEG Layer 1 and Layer 2 Psycho-Acoustic Model

Separate spectral values into tonal and non-tonal components or calcullate tonality index

Apply spreading function Set lower bound for threshold values Find masking threshold for each sub-band Calculate Signal to Mask Ratio and pass it to the bit-allocation block.

Page 32: Basics of Audio Signal Processing

32

MPEG 1 Layer 1 and Layer 2

MPEG1 Layer 1 Frame length of 384 samples 32 sub-bands of length 12. Each group of 12 samples gets a bit-allocation and a scale-factor

MPEG 1 Layer 2 Enhancement of Layer 1 More compact code for representing scale-factors, quantized samples

and bit-allocation Frame length of 1152 samples Each sub-band = 3 groups of 12 samples each Each sub-band has a bit-allocation and upto 3 scale-factors

Page 33: Basics of Audio Signal Processing

33

MPEG 1 Layer 1 and Layer 2

Bitstream

SCFSI : Scale factor Selection information. Number of scale factors for each sub-band.

Page 34: Basics of Audio Signal Processing

34

MPEG 1 Layer 3

Diag from fhg site

Page 35: Basics of Audio Signal Processing

35

MPEG 1 Layer 3

Main blocks Filter Bank Perceptual acoustic model Quantization and Coding Encoding of bit-stream

Features Mono and stereo support Bit-rates upto 320 kbps Sampling frequencies => 32 KHz, 44.1 KHz and 48 KHz CBR and VBR coding MS-stereo and IS-stereo coding

Page 36: Basics of Audio Signal Processing

36

Enhancements over Layer 1 and Layer 2

Higher frequency resolution due to MDCT Non-uniform quantization Uses scale-factor bands, which resemble human ear model (unlike

sub-bands used in Layer 1 and Layer 2) Entropy Coding (Variable length Huffman codes) Better Handling of Pre-echo artifacts Use of Bit-reservoir

Page 37: Basics of Audio Signal Processing

37

Hybrid Filter BankFilterBank

Hybrid filter bank Better approximation of critical

bands of human ear Poly-phase filter followed by

MDCT filter bank Poly-phase filter bank

Compatible to Layer 1 and Layer 2

MDCT filter bank Each poly-phase frequency band

into 18 finer sub-bands Higher frequency resolution Pre-echo control Better Alias reduction Block Switching

Page 38: Basics of Audio Signal Processing

38

for i=31 downto 0 do X[i]=next_input_audio_sample

for i=511 downto 32 do X[i]=X[i-32]

BEGIN

Partial Calculation

Zi + 64j

7

j=0

Window by 512 Coefficients Produce Vector Z

for i=0 to 511 do Zi=Ci*X i

Calculate 32 Samples by Matrixing

END

Output 32 Subband Samples

M63

k=0

ik * Yk

for i=0 do 63 do Yi =

for i=0 do 31 do Si =

Sub-band Filtering

Page 39: Basics of Audio Signal Processing

39

Window SwitchingWindow Switching

Short and long windows Adaptive MDCT block sizes of 6 and 18 points Short windows to prevent pre-echo (pre-masking to hide pre-echoes) Long window of length 1152 samples Short window of length 384 samples

Page 40: Basics of Audio Signal Processing

40

Quantization and Coding Uses Bit-reservoir

Bits saved from one frame are used for encoding other frame

Non-linear quantization

Huffman encoding 32 different huffman code tables available for coding Each table caters for different Max value that can be coded and the

signal statistics Different code books for each sub-region

ix(i) = nint(( xr(i)

24 qquant+quantanf

)0.75

- 0.0946)

Page 41: Basics of Audio Signal Processing

41

RETURN

ny

Inner Iteration Loop

Calculate the distortion for each critical band

Save scaling factors of the critical bands

Preemphasis

Amplify critical bands with more than the allowed distortion

ny

ny

All critical bands amplified ?

Amplification of all bands below upper limit ?

At least one band with more than the allowed distortion ?

Layer III Outer Iteration Loop

Restore scaling factors

BEGINQuantization and

Coding Inner iteration loop

Rate control loop Assigns shorter code to more frequently

used values Does huffman coding and quantization Keeps increasing global gain till

quantization values are small enough to be encoded by available number of bits

Outer Iteration loop Noise Control loop If quantization noise exceeds masking

threshold in any band then it increases the scale factor for that band

Executed till noise is less than masking threshold

Page 42: Basics of Audio Signal Processing

42

Bit-reservoir and Back-frames

Encoder can donate bits to bit-reservoir and can borrow bits from the bit-reservoir

9-bit pointer for pointing to main data begin (starting byte of audio data for that frame)

Theoretically the main data begin cannot be greater than 7680 bits (frame length for frame of 320 kbps at 48 KHz)

Page 43: Basics of Audio Signal Processing

43

Advanced Audio Coding (AAC)

Page 44: Basics of Audio Signal Processing

44

AAC Features Sampling Rate (8 kHz to 96 kHz) Bit Rates (8 kbps to 576kbps) Mono, Stereo and multi-channel (Upto 48 channels) Supports both CBR and VBR Multiple profiles or Object Types

Low Complexity (LC) SSR HE (High Efficiency) HEv2 (High Efficiency with Parametric Stereo)

Page 45: Basics of Audio Signal Processing

45

AAC-Basic Features and Modules High frequency resolution transform coder (1024 lines MDCT with

50% overlap) Non-uniform quantizer Noise shaping in scale factor bands Huffman Coding Temporal Noise Shaping (TNS) Perceptual Noise Substitution (PNS) Modules

FilterBank Perceptual Model Quantization and Coding Optional tools like TNS, PNS, prediction etc

Page 46: Basics of Audio Signal Processing

46

Improvements over MP3 Higher efficiency and simpler filter bank

Only MDCT vs hybrid filter bank of MP3 Higher Frequency Resolution (1024 vs 576 of MP3) Improved Huffman Coding table Window Shape adaptation (Sine and KBD) Enhanced Block Switching

The window length is dynamically changed between 2048 and 256 samples (Against 1152 and 384 of MP3). This leads to better coding efficiency for long blocks and less pre-echo artifacts for short blocks.

Use of following tools only in AAC Temporal Noise Shaping Perceptual Noise Substitution Long Term Prediction

More flexible joint stereo (separate for every scale band)

Page 47: Basics of Audio Signal Processing

47

Filter Bank MDCT supporting block lengths of 2048 and 256 points Dynamic switching between long and short blocks 50 % overlap between blocks Windows are of two types

Kaiser Bessel Window (KBD) Sine shaped Window

In case of short blocks 8 short transforms are performed in a row to maintain synchronicity

Page 48: Basics of Audio Signal Processing

48

Temporal Noise Shaping (TNS) Forward Prediction

Correlation between subsequent input samples exploited by quantizing the prediction error based on unquantized input samples

Quantization error in the final decoded signal is adapted to PSD (Power Spectral Density) of the input signal

Forward prediction done on spectral data over frequency. The temporal shape of the quantization error signal will appear adapted to the temporal shape of input signal at output of the decoder.

Temporal shape of Quantization noise of a filter bank is adapted to the envelope of the input signal by TNS and in case of No TNS the quantization noise is distributed almost uniformly over time.

Page 49: Basics of Audio Signal Processing

49

Temporal Noise Shaping (TNS) Tool for handling transient and pitched input signals Duality between time and frequency domains

Un-flat spectrum can be coded efficiently by coding spectral values or by applying predictive coding methods to time-domain signal

Duality : Efficient coding of transient signals (un-flat in time-domain) is efficient in time-domain or by applying predictive methods to the spectral data

TNS uses a prediction approach in the frequency domain to shape the quantization noise over time

Quantized filter coefficients transmitted TNS tool can be dynamically switched on and off in the stream

Page 50: Basics of Audio Signal Processing

50

Perceptual Noise Substitution (PNS)

Available only in MPEG-4 and not in MPEG-2 Based on the fact that the fine structure of a noise signal is of minor

importance for the subjective perception of signal. Instead of transmitting actual spectrum transmit the following

Information that this frequency region is noise-like. Total power in that frequency band

PNS can be switched on and off on a scale-factor basis. In decoder when a region is coded using PNS, then the decoder

inserts randomly generated noise.

Page 51: Basics of Audio Signal Processing

51

Spectral Band Replication (SBR)

Recreate High-frequencies from decoded base-band signal. Enhancement Technology (needs a base audio codec) Base codec operates at half the sampling frequency of SBR The bit-stream of the basic encoder + control parameters

transmitted.

Page 52: Basics of Audio Signal Processing

52

SBR Decoder

1. Decoded low-band Signal analyzed using QMF

2. High Frequency Reconstruction from Lower bands

3. Reconstructed signal adaptively filtered to ensure spectral characteristics of each sub-band

4. Envelope adjustment

5. Addition of low-band signals with envelope adjusted high-band signals

Page 53: Basics of Audio Signal Processing

53

Parametric Stereo (PS)

Mono Signal is encoded along with stereo Parameters as side information in the encoded bit-stream

3 types of parameters are employed in parametric stereo Inter-Channel Intensity Difference (IID) Inter-Channel Cross Coherence (ICC) Inter-Channel Phase Difference (IPD)

Page 54: Basics of Audio Signal Processing

54

Lossless Audio Compression

Sudhir K

Multimedia Codecs

Page 55: Basics of Audio Signal Processing

55

Main Features

No Loss in Quality Perfect Reconstruction Less Compression No Psycho-Acoustic Model required Applications

High-end Audio Home-Theatre DVD Audio

Examples MLP, WMA Lossless, OptimFrog, Real Lossless, Monkey’s audio,

FLAC, LTAC, Apple Lossless, TTA Lossless audio, MPEG4 lossless Coding (ALC)

Page 56: Basics of Audio Signal Processing

56

Types of Lossless Coding

Time domain lossless Coding Audio data in time-domain Most of the current lossless compression techniques are of this type

Frequency domain lossless Coding Operate on audio data in Frequency domain Very few schemes like LTAC

Page 57: Basics of Audio Signal Processing

57

Time Domain Lossless compression

Block Decomposition Inter-Channel Decorrelation Signal Modelling Entropy Coding

BlockDecomposition

Inter-ChannelDecorrelation

SignalModelling

EntropyCoding

Page 58: Basics of Audio Signal Processing

58

Inter-Channel Coding

Redundancy between various channels Various Techniques

Difference Channel Coding Mid-Side Stereo Coding Intensity Stereo Coding Inter-Channel Matrixing

Page 59: Basics of Audio Signal Processing

59

Signal Modeling and Prediction

Model input audio signal Difference between original and predicted audio signal minimal Model parameters and error coefficients transmitted Computationally most complex block Various Techniques

Linear Prediction LMS Filter or Adaptive filter Polynomial Curve fitting techniques

Page 60: Basics of Audio Signal Processing

60

Entropy Coding

Remove redundancy between bits in the bit-stream To compress residue or error signal further Many schemes

Huffman coding Run length Coding Golomb Rice coding

Page 61: Basics of Audio Signal Processing

61

References TED PAINTER, ANDREAS SPANIAS, “Perceptual Coding of Digital Audio”, in Proc IEEE Vol 88, No

4, April 2000 Davis Yen Pan, “Digital Audio Compression”, Digital Technical Journal, Vol 5, No 2, Spring 1993 Heiko Purnhagen, “Low Complexity Parametric Stereo Coding in MPEG-4”, Proc of 7 th Int Conference on

Digital Audio Effects, Naples Italy, Oct 5-8, 2004 TED PAINTER, ANDREAS SPANIAS, “A review of Algorithms for Perceptual Coding of Digital Audio

Signals”, Davis Pan, “A Tutorial on MPEG/ Audio Compression” Seymour Shlien, “Guide to MPEG-1 Audio Standard” ISO 11172-3, Information Technology- Coding of moving pictures and associated audio for digital storage

media Part-3 ISO 13818-3 ISO 14496-3 Jurgen Herre, “Temporal Noise Shaping, Quantization and Coding methods in Perceptual Audio Coding: A

Tutorial Introduction”, AES 17th International conference on high quality audio coding.

Page 62: Basics of Audio Signal Processing

62

Deleted Slides

Page 63: Basics of Audio Signal Processing

63

Filter Banks

Time-frequency analysis block Parallel bank of bandpass filters covering entire spectrum Divide signal spectrum into frequency sub-bands

Band-pass analysis output

Decimation by factor M

Critically sampled or maximally decimated

Upsampling in Decoder

Output is identical to input with delay

Page 64: Basics of Audio Signal Processing

64

Parametric Stereo Encoder Decoder

C= 10IID/20

α= arccos(ICC/2)