View
20
Download
0
Category
Preview:
Citation preview
AUDIO SIGNALPROCESSINGAND CODING
Andreas SpaniasTed PainterVenkatraman Atti
WILEY-INTERSCIENCEA John Wiley amp Sons Inc Publication
AUDIO SIGNALPROCESSINGAND CODING
AUDIO SIGNALPROCESSINGAND CODING
Andreas SpaniasTed PainterVenkatraman Atti
WILEY-INTERSCIENCEA John Wiley amp Sons Inc Publication
Copyright 2007 by John Wiley amp Sons Inc All rights reserved
Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada
No part of this publication may be reproduced stored in a retrieval system or transmitted in anyform or by any means electronic mechanical photocopying recording scanning or otherwiseexcept as permitted under Section 107 or 108 of the 1976 United States Copyright Act withouteither the prior written permission of the Publisher or authorization through payment of theappropriate per-copy fee to the Copyright Clearance Center Inc 222 Rosewood Drive DanversMA 01923 (978) 750-8400 fax (978) 750-4470 or on the web at wwwcopyrightcom Requeststo the Publisher for permission should be addressed to the Permissions Department John Wiley ampSons Inc 111 River Street Hoboken NJ 07030 (201) 748-6011 fax (201) 748-6008 or online athttpwwwwileycomgopermission
Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their bestefforts in preparing this book they make no representations or warranties with respect to theaccuracy or completeness of the contents of this book and specifically disclaim any impliedwarranties of merchantability or fitness for a particular purpose No warranty may be created orextended by sales representatives or written sales materials The advice and strategies containedherein may not be suitable for your situation You should consult with a professional whereappropriate Neither the publisher nor author shall be liable for any loss of profit or any othercommercial damages including but not limited to special incidental consequential or otherdamages
For general information on our other products and services or for technical support please contactour Customer Care Department within the United States at (800) 762-2974 outside the UnitedStates at (317) 572-3993 or fax (317) 572-4002
Wiley also publishes its books in a variety of electronic formats Some content that appears in printmay not be available in electronic formats For more information about Wiley products visit ourweb site at wwwwileycom
Wiley Bicentennial Logo Richard J Pacifico
Library of Congress Cataloging-in-Publication Data
Spanias AndreasAudio signal processing and codingby Andreas Spanias Ted Painter Venkatraman Atti
p cmldquoWiley-Interscience publicationrdquoIncludes bibliographical references and indexISBN 978-0-471-79147-8
1 Coding theory 2 Signal processingndashDigital techniques 3 SoundndashRecording andreproducingndashDigital techniques I Painter Ted 1967-II Atti Venkatraman 1978-IIITitle
TK510292S73 2006621382rsquo8ndashdc22
2006040507
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
ToPhotini John and Louis
Lizzy Katie and LeeSrinivasan Sudha Kavitha Satish and Ammu
CONTENTS
PREFACE xv
1 INTRODUCTION 1
11 Historical Perspective 112 A General Perceptual Audio Coding Architecture 413 Audio Coder Attributes 5
131 Audio Quality 6132 Bit Rates 6133 Complexity 6134 Codec Delay 7135 Error Robustness 7
14 Types of Audio Coders ndash An Overview 715 Organization of the Book 816 Notational Conventions 9
Problems 11Computer Exercises 11
2 SIGNAL PROCESSING ESSENTIALS 13
21 Introduction 1322 Spectra of Analog Signals 1323 Review of Convolution and Filtering 1624 Uniform Sampling 1725 Discrete-Time Signal Processing 20
vii
viii CONTENTS
251 Transforms for Discrete-Time Signals 20252 The Discrete and the Fast Fourier Transform 22253 The Discrete Cosine Transform 23254 The Short-Time Fourier Transform 23
26 Difference Equations and Digital Filters 2527 The Transfer and the Frequency Response Functions 27
271 Poles Zeros and Frequency Response 29272 Examples of Digital Filters for Audio Applications 30
28 Review of Multirate Signal Processing 33281 Down-sampling by an Integer 33282 Up-sampling by an Integer 35283 Sampling Rate Changes by Noninteger Factors 36284 Quadrature Mirror Filter Banks 36
29 Discrete-Time Random Signals 39291 Random Signals Processed by LTI Digital Filters 42292 Autocorrelation Estimation from Finite-Length Data 44
210 Summary 44Problems 45Computer Exercises 47
3 QUANTIZATION AND ENTROPY CODING 51
31 Introduction 51311 The QuantizationndashBit AllocationndashEntropy Coding
Module 5232 Density Functions and Quantization 5333 Scalar Quantization 54
331 Uniform Quantization 54332 Nonuniform Quantization 57333 Differential PCM 59
34 Vector Quantization 62341 Structured VQ 64342 Split-VQ 67343 Conjugate-Structure VQ 69
35 Bit-Allocation Algorithms 7036 Entropy Coding 74
361 Huffman Coding 77362 Rice Coding 81363 Golomb Coding 82
CONTENTS ix
364 Arithmetic Coding 8337 Summary 85
Problems 85Computer Exercises 86
4 LINEAR PREDICTION IN NARROWBAND AND WIDEBANDCODING 91
41 Introduction 9142 LP-Based Source-System Modeling for Speech 9243 Short-Term Linear Prediction 94
431 Long-Term Prediction 95432 ADPCM Using Linear Prediction 96
44 Open-Loop Analysis-Synthesis Linear Prediction 9645 Analysis-by-Synthesis Linear Prediction 97
451 Code-Excited Linear Prediction Algorithms 10046 Linear Prediction in Wideband Coding 102
461 Wideband Speech Coding 102462 Wideband Audio Coding 104
47 Summary 106Problems 107Computer Exercises 108
5 PSYCHOACOUSTIC PRINCIPLES 113
51 Introduction 11352 Absolute Threshold of Hearing 11453 Critical Bands 11554 Simultaneous Masking Masking Asymmetry and the Spread
of Masking 120541 Noise-Masking-Tone 123542 Tone-Masking-Noise 124543 Noise-Masking-Noise 124544 Asymmetry of Masking 124545 The Spread of Masking 125
55 Nonsimultaneous Masking 12756 Perceptual Entropy 12857 Example Codec Perceptual Model ISOIEC 11172-3
(MPEG - 1) Psychoacoustic Model 1 130571 Step 1 Spectral Analysis and SPL Normalization 131
x CONTENTS
572 Step 2 Identification of Tonal and Noise Maskers 131573 Step 3 Decimation and Reorganization of Maskers 135574 Step 4 Calculation of Individual Masking Thresholds 136575 Step 5 Calculation of Global Masking Thresholds 138
58 Perceptual Bit Allocation 13859 Summary 140
Problems 140Computer Exercises 141
6 TIME-FREQUENCY ANALYSIS FILTER BANKS ANDTRANSFORMS 145
61 Introduction 14562 Analysis-Synthesis Framework for M-band Filter Banks 14663 Filter Banks for Audio Coding Design Considerations 148
631 The Role of Time-Frequency Resolution in MaskingPower Estimation 149
632 The Role of Frequency Resolution in Perceptual BitAllocation 149
633 The Role of Time Resolution in Perceptual BitAllocation 150
64 Quadrature Mirror and Conjugate Quadrature Filters 15565 Tree-Structured QMF and CQF M-band Banks 15666 Cosine Modulated ldquoPseudo QMFrdquo M-band Banks 16067 Cosine Modulated Perfect Reconstruction (PR) M-band Banks
and the Modified Discrete Cosine Transform (MDCT) 163671 Forward and Inverse MDCT 165672 MDCT Window Design 165673 Example MDCT Windows (Prototype FIR Filters) 167
68 Discrete Fourier and Discrete Cosine Transform 17869 Pre-echo Distortion 180610 Pre-echo Control Strategies 182
6101 Bit Reservoir 1826102 Window Switching 1826103 Hybrid Switched Filter Banks 1846104 Gain Modification 1856105 Temporal Noise Shaping 185
611 Summary 186Problems 188Computer Exercises 191
CONTENTS xi
7 TRANSFORM CODERS 195
71 Introduction 19572 Optimum Coding in the Frequency Domain 19673 Perceptual Transform Coder 197
731 PXFM 198732 SEPXFM 199
74 Brandenburg-Johnston Hybrid Coder 20075 CNET Coders 201
751 CNET DFT Coder 201752 CNET MDCT Coder 1 201753 CNET MDCT Coder 2 202
76 Adaptive Spectral Entropy Coding 20377 Differential Perceptual Audio Coder 20478 DFT Noise Substitution 20579 DCT with Vector Quantization 206710 MDCT with Vector Quantization 207711 Summary 208
Problems 208Computer Exercises 210
8 SUBBAND CODERS 211
81 Introduction 211811 Subband Algorithms 212
82 DWT and Discrete Wavelet Packet Transform (DWPT) 21483 Adapted WP Algorithms 218
831 DWPT Coder with Globally Adapted DaubechiesAnalysis Wavelet 218
832 Scalable DWPT Coder with Adaptive Tree Structure 220833 DWPT Coder with Globally Adapted General
Analysis Wavelet 223834 DWPT Coder with Adaptive Tree Structure and
Locally Adapted Analysis Wavelet 223835 DWPT Coder with Perceptually Optimized Synthesis
Wavelets 22484 Adapted Nonuniform Filter Banks 226
841 Switched Nonuniform Filter Bank Cascade 226842 Frequency-Varying Modulated Lapped Transforms 227
85 Hybrid WP and Adapted WPSinusoidal Algorithms 227
xii CONTENTS
851 Hybrid SinusoidalClassical DWPT Coder 228852 Hybrid SinusoidalM-band DWPT Coder 229853 Hybrid SinusoidalDWPT Coder with WP Tree
Structure Adaptation (ARCO) 23086 Subband Coding with Hybrid Filter BankCELP Algorithms 233
861 Hybrid SubbandCELP Algorithm for Low-DelayApplications 234
862 Hybrid SubbandCELP Algorithm forLow-Complexity Applications 235
87 Subband Coding with IIR Filter Banks 237Problems 237Computer Exercise 240
9 SINUSOIDAL CODERS 241
91 Introduction 24192 The Sinusoidal Model 242
921 Sinusoidal Analysis and Parameter Tracking 242922 Sinusoidal Synthesis and Parameter Interpolation 245
93 AnalysisSynthesis Audio Codec (ASAC) 247931 ASAC Segmentation 248932 ASAC Sinusoidal Analysis-by-Synthesis 248933 ASAC Bit Allocation Quantization Encoding and
Scalability 24894 Harmonic and Individual Lines Plus Noise Coder (HILN) 249
941 HILN Sinusoidal Analysis-by-Synthesis 250942 HILN Bit Allocation Quantization Encoding and
Decoding 25195 FM Synthesis 251
951 Principles of FM Synthesis 252952 Perceptual Audio Coding Using an FM Synthesis
Model 25296 The Sines + Transients + Noise (STN) Model 25497 Hybrid Sinusoidal Coders 255
971 Hybrid Sinusoidal-MDCT Algorithm 256972 Hybrid Sinusoidal-Vocoder Algorithm 257
98 Summary 258Problems 258Computer Exercises 259
CONTENTS xiii
10 AUDIO CODING STANDARDS AND ALGORITHMS 263
101 Introduction 263102 MIDI Versus Digital Audio 264
1021 MIDI Synthesizer 2641022 General MIDI (GM) 2661023 MIDI Applications 266
103 Multichannel Surround Sound 2671031 The Evolution of Surround Sound 2671032 The Mono the Stereo and the Surround Sound
Formats 2681033 The ITU-R BS775 51-Channel Configuration 268
104 MPEG Audio Standards 2701041 MPEG-1 Audio (ISOIEC 11172-3) 2751042 MPEG-2 BCLSF (ISOIEC-13818-3) 2791043 MPEG-2 NBCAAC (ISOIEC-13818-7) 2831044 MPEG-4 Audio (ISOIEC 14496-3) 2891045 MPEG-7 Audio (ISOIEC 15938-4) 3091046 MPEG-21 Framework (ISOIEC-21000) 3171047 MPEG Surround and Spatial Audio Coding 319
105 Adaptive Transform Acoustic Coding (ATRAC) 319106 Lucent Technologies PAC EPAC and MPAC 321
1061 Perceptual Audio Coder (PAC) 3211062 Enhanced PAC (EPAC) 3231063 Multichannel PAC (MPAC) 323
107 Dolby Audio Coding Standards 3251071 Dolby AC-2 AC-2A 3251072 Dolby AC-3Dolby DigitalDolby SR middot D 327
108 Audio Processing Technology APT-x100 335109 DTS ndash Coherent Acoustics 338
1091 Framing and Subband Analysis 3381092 Psychoacoustic Analysis 3391093 ADPCM ndash Differential Subband Coding 3391094 Bit Allocation Quantization and Multiplexing 3411095 DTS-CA Versus Dolby Digital 342Problems 342Computer Exercise 342
11 LOSSLESS AUDIO CODING AND DIGITAL WATERMARKING 343
111 Introduction 343
xiv CONTENTS
112 Lossless Audio Coding (L2AC) 3441121 L2AC Principles 3451122 L2AC Algorithms 346
113 DVD-Audio 3561131 Meridian Lossless Packing (MLP) 358
114 Super-Audio CD (SACD) 3581141 SACD Storage Format 3621142 Sigma-Delta Modulators (SDM) 3621143 Direct Stream Digital (DSD) Encoding 364
115 Digital Audio Watermarking 3681151 Background 3701152 A Generic Architecture for DAW 3741153 DAW Schemes ndash Attributes 377
116 Summary of Commercial Applications 378Problems 382Computer Exercise 382
12 QUALITY MEASURES FOR PERCEPTUAL AUDIO CODING 383
121 Introduction 383122 Subjective Quality Measures 384123 Confounding Factors in Subjective Evaluations 386124 Subjective Evaluations of Two-Channel Standardized Codecs 387125 Subjective Evaluations of 51-Channel Standardized Codecs 388126 Subjective Evaluations Using Perceptual Measurement
Systems 3891261 CIR Perceptual Measurement Schemes 3901262 NSE Perceptual Measurement Schemes 390
127 Algorithms for Perceptual Measurement 3911271 Example Perceptual Audio Quality Measure (PAQM) 3921272 Example Noise-to-Mask Ratio (NMR) 3961273 Example Objective Audio Signal Evaluation (OASE) 399
128 ITU-R BS1387 and ITU-T P861 Standards for PerceptualQuality Measurement 401
129 Research Directions for Perceptual Codec Quality Measures 402
REFERENCES 405
INDEX 459
PREFACE
Audio processing and recording has been part of telecommunication and enter-tainment systems for more than a century Moreover bandwidth issues associatedwith audio recording transmission and storage occupied engineers from the veryearly stages in this field A series of important technological developments pavedthe way from early phonographs to magnetic tape recording and lately compactdisk (CD) and super storage devices In the following we capture some of themain events and milestones that mark the history in audio recording and storage1
Prototypes of phonographs appeared around 1877 and the first attempt to mar-ket cylinder-based gramophones was by the Columbia Phonograph Co in 1889Five years later Marconi demonstrated the first radio transmission that markedthe beginning of audio broadcasting The Victor Talking Machine Company withthe little nipper dog as its trademark was formed in 1901 The ldquotelegraphonerdquo amagnetic recorder for voice that used still wire was patented in Denmark aroundthe end of the nineteenth century The Odeon and His Masters Voice (HMV)label produced and marketed music recordings in the early nineteen hundredsThe cabinet phonograph with a horn called ldquoVictrolardquo appeared at about the sametime Diamond disk players were marketed in 1913 followed by efforts to producesound-on-film for motion pictures Other milestones include the first commercialtransmission in Pittsburgh and the emergence of public address amplifiers Elec-trically recorded material appeared in the 1920s and the first sound-on-film wasdemonstrated in the mid 1920s by Warner Brothers Cinema applications in the1930s promoted advances in loudspeaker technologies leading to the develop-ment of woofer tweeter and crossover network concepts Juke boxes for musicalso appeared in the 1930s Magnetic tape recording was demonstrated in Ger-many in the 1930s by BASF and AEGTelefunken The Ampex tape recordersappeared in the US in the late 1940s The demonstration of stereo high-fidelity(Hi-Fi) sound in the late 1940s spurred the development of amplifiers speakersand reel-to-reel tape recorders for home use in the 1950s both in Europe and
xv
xvi PREFACE
Apple iPod (Courtesy of Apple Computer Inc) Apple iPod is a registered trademarkof Apple Computer Inc
the US Meanwhile Columbia produced the 33-rpm long play (LP) vinyl recordwhile its rival RCA Victor produced the compact 45-rpm format whose salestook off with the emergence of rock and roll music Technological developmentsin the mid 1950s resulted in the emergence of compact transistor-based radiosand soon after small tape players In 1963 Philips introduced the compact cas-sette tape format with its EL3300 series portable players (marketed in the US asNorelco) which became an instant success with accessories for home portableand car use Eight track cassettes became popular in the late 1960s mainly for caruse The Dolby system for compact cassette noise reduction was also a landmarkin the audio signal processing field Meanwhile FM broadcasting which hadbeen invented earlier took off in the 1960s and 1970s with stereo transmissionsHelical tape-head technologies invented in Japan in the 1960s provided high-bandwidth recording capabilities which enabled video tape recorders for homeuse in the 1970s (eg VHS and Beta formats) This technology was also usedin the 1980s for audio PCM stereo recording Laser compact disk technologywas introduced in 1982 and by the late 1980s became the preferred format forHi-Fi stereo recording Analog compact cassette players high-quality reel-to-reelrecorders expensive turntables and virtually all analog recording devices startedfading away by the late 1980s The launch of the digital CD audio format in
PREFACE xvii
the 1980s coincided with the advent of personal computers and took over inall aspects of music recording and distribution CD playback soon dominatedbroadcasting automobile home stereo and analog vinyl LP The compact cas-sette formats became relics of an old era and eventually disappeared from musicstores Digital audio tape (DAT) systems enabled by helical tape head technologywere also introduced in the 1980s but were commercially unsuccessful becauseof strict copyright laws and unusually large taxes
Parallel developments in digital video formats for laser disk technologiesincluded work in audio compression systems Audio compression research papersstarted appearing mostly in the 1980s at IEEE ICASSP and Audio Engineer-ing Society conferences by authors from several research and development labsincluding Erlangen-Nuremburg University and Fraunhofer IIS ATampT Bell Lab-oratories and Dolby Laboratories Audio compression or audio coding researchthe art of representing an audio signal with the least number of informationbits while maintaining its fidelity went through quantum leaps in the late 1980sand 1990s Although originally most audio compression algorithms were devel-oped as part of the digital motion video compression standards eg the MPEGseries these algorithms eventually became important as stand alone technologiesfor audio recording and playback Progress in VLSI technologies psychoacous-tics and efficient time-frequency signal representations made possible a series ofscalable real-time compression algorithms for use in audio and cinema applica-tions In the 1990s we witnessed the emergence of the first products that usedcompressed audio formats such as the MiniDisc (MD) and the Digital CompactCassette (DCC) The sound and video playing capabilities of the PC and theproliferation of multimedia content through the Internet had a profound impacton audio compression technologies The MPEG-1-2 layer III (MP3) algorithmbecame a defacto standard for Internet music downloads Specialized web sitesthat feature music content changed the ways people buy and share music Com-pact MP3 players appeared in the late 1990s In the early 2000s we had theemergence of the Apple iPod player with a hard drive that supports MP3 andMPEG advanced audio coding (AAC) algorithms
In order to enhance cinematic and home theater listening experiences anddeliver greater realism than ever before audio codec designers pursued sophis-ticated multichannel audio coding techniques In the mid 1990s techniques forencoding 51 separate channels of audio were standardized in MPEG-2 BC andlater MPEG-2 AAC audio Proprietary multichannel algorithms were also devel-oped and commercialized by Dolby Laboratories (AC-3) Digital Theater System(DTS) Lucent (EPAC) Sony (SDDS) and Microsoft (WMA) Dolby Labs DTSLexicon and other companies also introduced 2N channel upmix algorithmscapable of synthesizing multichannel surround presentation from conventionalstereo content (eg Dolby ProLogic II DTS Neo6) The human auditory systemis capable of localizing sound with greater spatial resolution than current multi-channel audio systems offer and as a result the quest continues to achieve theultimate spatial fidelity in sound reproduction Research involving spatial audioreal-time acoustic source localization binaural cue coding and application of
xviii PREFACE
head-related transfer functions (HRTF) towards rendering immersive audio hasgained interest Audiophiles appeared skeptical with the 441-kHz 16-bit CDstereo format and some were critical of the sound quality of compression for-mats These ideas along with the need for copyright protection eventually gainedmomentum and new standards and formats appeared in the early 2000s In par-ticular multichannel lossless coding such as the DVD-Audio (DVD-A) and theSuper-Audio-CD (SACD) appeared The standardization of these storage for-mats provided the audio codec designers with enormous storage capacity Thismotivated lossless coding of digital audio
The purpose of this book is to provide an in-depth treatment of audio com-pression algorithms and standards The topic is currently occupying several com-munities in signal processing multimedia and audio engineering The intendedreadership for this book includes at least three groups At the highest level anyreader with a general scientific background will be able to gain an appreciation forthe heuristics of perceptual coding Secondly readers with a general electrical andcomputer engineering background will become familiar with the essential signalprocessing techniques and perceptual models embedded in most audio codersFinally undergraduate and graduate students with focuses in multimedia DSPand computer music will gain important knowledge in signal analysis and audiocoding algorithms The vast body of literature provided and the tutorial aspectsof the book make it an asset for audiophiles as well
Organization
This book is in part the outcome of many years of research and teaching at Ari-zona State University We opted to include exercises and computer problems andhence enable instructors to either use the content in existing DSP and multimediacourses or to promote the creation of new courses with focus in audio and speechprocessing and coding The book has twelve chapters and each chapter containsproblems proofs and computer exercises Chapter 1 introduces the readers tothe field of audio signal processing and coding In Chapter 2 we review thebasic signal processing theory and emphasize concepts relevant to audio cod-ing Chapter 3 describes waveform quantization and entropy coding schemesChapter 4 covers linear predictive coding and its utility in speech and audio cod-ing Chapter 5 covers psychoacoustics and Chapter 6 explores filter bank designChapter 7 describes transform coding methodologies Subband and sinusoidalcoding algorithms are addressed in Chapters 8 and 9 respectively Chapter 10reviews several audio coding standards including the ISOIEC MPEG family thecinematic Sony SDDS the Dolby AC-3 and the DTS-coherent acoustics (DTS-CA) Chapter 11 focuses on lossless audio coding and digital audio watermarkingtechniques Chapter 12 provides information on subjective quality measures
Use in Courses
For an undergraduate elective course with little or no background in DSP theinstructor can cover in detail Chapters 1 2 3 4 and 5 then present select
PREFACE xix
sections of Chapter 6 and describe in an expository and qualitative mannercertain basic algorithms and standards from Chapters 7-11 A graduate class inaudio coding with students that have background in DSP can start from Chapter 5and cover in detail Chapters 6 through Chapter 11 Audio coding practitioners andresearchers that are interested mostly in qualitative descriptions of the standardsand information on bibliography can start at Chapter 5 and proceed readingthrough Chapter 11
Trademarks and Copyrights
Sony Dynamic Digital Sound SDDS ATRAC and MiniDisc are trademarks ofSony Corporation Dolby Dolby Digital AC-2 AC-3 DolbyFAX Dolby Pro-Logic are trademarks of Dolby laboratories The perceptual audio coder (PAC)EPAC and MPAC are trademarks of ATampT and Lucent Technologies TheAPT-x100 is trademark of Audio Processing Technology Inc The DTS-CA istrademark of Digital Theater Systems Inc Apple iPod is a registered trademarkof Apple Computer Inc
Acknowledgments
The authors have all spent time at Arizona State University (ASU) and ProfSpanias is in fact still teaching and directing research in this area at ASU Thegroup of authors has worked on grants with Intel Corporation and would like tothank this organization for providing grants in scalable speech and audio codingthat created opportunities for in-depth studies in these areas Special thanks toour colleagues in Intel Corporation at that time including Brian Mears GopalNair Hedayat Daie Mark Walker Michael Deisher and Tom Gardos We alsowish to acknowledge the support of current Intel colleagues Gang Liang MikeRosenzweig and Jim Zhou as well as Scott Peirce for proof reading some of thematerial Thanks also to former doctoral students at ASU including Philip Loizouand Sassan Ahmadi for many useful discussions in speech and audio processingWe appreciate also discussions on narrowband vocoders with Bruce Fette in thelate 1990s then with Motorola GEG and now with General Dynamics
The authors also acknowledge the National Science Foundation (NSF) CCLIfor grants in education that supported in part the preparation of several computerexamples and paradigms in psychoacoustics and signal coding Also some ofthe early work in coding of Dr Spanias was supported by the Naval ResearchLaboratories (NRL) and we would like to thank that organization for providingideas for projects that inspired future work in this area We also wish to thankASU and some of the faculty and administrators that provided moral and materialsupport for work in this area Thanks are extended to current ASU studentsShibani Misra Visar Berisha and Mahesh Banavar for proofreading some ofthe material We thank the Wiley Interscience production team George TeleckiMelissa Yanuzzi and Rachel Witmer for their diligent efforts in copyeditingcover design and typesetting We also thank all the anonymous reviewers for
xx PREFACE
their useful comments Finally we all wish to express our thanks to our familiesfor their support
The book content is used frequently in ASU online courses and industry shortcourses offered by Andreas Spanias Contact Andreas Spanias (spaniasasuedu httpwwwfultonasuedusimspanias) for details
1 Resources used for obtaining important dates in recording history include web sites at the Universityof San Diego Arizona State University and Wikipedia
CHAPTER 1
INTRODUCTION
Audio coding or audio compression algorithms are used to obtain compact dig-ital representations of high-fidelity (wideband) audio signals for the purpose ofefficient transmission or storage The central objective in audio coding is to rep-resent the signal with a minimum number of bits while achieving transparentsignal reproduction ie generating output audio that cannot be distinguishedfrom the original input even by a sensitive listener (ldquogolden earsrdquo) This textgives an in-depth treatment of algorithms and standards for transparent codingof high-fidelity audio
11 HISTORICAL PERSPECTIVE
The introduction of the compact disc (CD) in the early 1980s brought to thefore all of the advantages of digital audio representation including true high-fidelity dynamic range and robustness These advantages however came atthe expense of high data rates Conventional CD and digital audio tape (DAT)systems are typically sampled at either 441 or 48 kHz using pulse code mod-ulation (PCM) with a 16-bit sample resolution This results in uncompresseddata rates of 7056768 kbs for a monaural channel or 141154 Mbs for astereo-pair Although these data rates were accommodated successfully in first-generation CD and DAT players second-generation audio players and wirelesslyconnected systems are often subject to bandwidth constraints that are incompat-ible with high data rates Because of the success enjoyed by the first-generation
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
1
2 INTRODUCTION
systems however end users have come to expect ldquoCD-qualityrdquo audio reproduc-tion from any digital system Therefore new network and wireless multimediadigital audio systems must reduce data rates without compromising reproduc-tion quality Motivated by the need for compression algorithms that can satisfysimultaneously the conflicting demands of high compression ratios and trans-parent quality for high-fidelity audio signals several coding methodologies havebeen established over the last two decades Audio compression schemes in gen-eral employ design techniques that exploit both perceptual irrelevancies andstatistical redundancies
PCM was the primary audio encoding scheme employed until the early 1980sPCM does not provide any mechanisms for redundancy removal Quantizationmethods that exploit the signal correlation such as differential PCM (DPCM)delta modulation [Jaya76] [Jaya84] and adaptive DPCM (ADPCM) were appliedto audio compression later (eg PC audio cards) Owing to the need for dras-tic reduction in bit rates researchers began to pursue new approaches for audiocoding based on the principles of psychoacoustics [Zwic90] [Moor03] Psychoa-coustic notions in conjunction with the basic properties of signal quantizationhave led to the theory of perceptual entropy [John88a] [John88b] Perceptualentropy is a quantitative estimate of the fundamental limit of transparent audiosignal compression Another key contribution to the field was the characterizationof the auditory filter bank and particularly the time-frequency analysis capabili-ties of the inner ear [Moor83] Over the years several filter-bank structures thatmimic the critical band structure of the auditory filter bank have been proposedA filter bank is a parallel bank of bandpass filters covering the audio spectrumwhich when used in conjunction with a perceptual model can play an importantrole in the identification of perceptual irrelevancies
During the early 1990s several workgroups and organizations such asthe International Organization for StandardizationInternational Electro-technicalCommission (ISOIEC) the International Telecommunications Union (ITU)ATampT Dolby Laboratories Digital Theatre Systems (DTS) Lucent TechnologiesPhilips and Sony were actively involved in developing perceptual audio codingalgorithms and standards Some of the popular commercial standards publishedin the early 1990s include Dolbyrsquos Audio Coder-3 (AC-3) the DTS CoherentAcoustics (DTS-CA) Lucent Technologiesrsquo Perceptual Audio Coder (PAC)Philipsrsquo Precision Adaptive Subband Coding (PASC) and Sonyrsquos AdaptiveTransform Acoustic Coding (ATRAC) Table 11 lists chronologically some ofthe prominent audio coding standards The commercial success enjoyed bythese audio coding standards triggered the launch of several multimedia storageformats
Table 12 lists some of the popular multimedia storage formats since the begin-ning of the CD era High-performance stereo systems became quite common withthe advent of CDs in the early 1980s A compact-discndashread only memory (CD-ROM) can store data up to 700ndash800 MB in digital form as ldquomicroscopic-pitsrdquothat can be read by a laser beam off of a reflective surface or a medium Threecompeting storage media ndash DAT the digital compact cassette (DCC) and the
HISTORICAL PERSPECTIVE 3
Table 11 List of perceptual and lossless audio coding standardsalgorithms
Standardalgorithm Related references
1 ISOIEC MPEG-1 audio [ISOI92]2 Philipsrsquo PASC (for DCC applications) [Lokh92]3 ATampTLucent PACEPAC [John96c] [Sinh96]4 Dolby AC-2 [Davi92] [Fiel91]5 AC-3Dolby Digital [Davis93] [Fiel96]6 ISOIEC MPEG-2 (BCLSF) audio [ISOI94a]7 Sonyrsquos ATRAC (MiniDisc and SDDS) [Yosh94] [Tsut96]8 SHORTEN [Robi94]9 Audio processing technology ndash APT-x100 [Wyli96b]10 ISOIEC MPEG-2 AAC [ISOI96]11 DTS coherent acoustics [Smyt96] [Smyt99]12 The DVD Algorithm [Crav96] [Crav97]13 MUSICompress [Wege97]14 Lossless transform coding of audio (LTAC) [Pura97]15 AudioPaK [Hans98b] [Hans01]16 ISOIEC MPEG-4 audio version 1 [ISOI99]17 Meridian lossless packing (MLP) [Gerz99]18 ISOIEC MPEG-4 audio version 2 [ISOI00]19 Audio coding based on integer transforms [Geig01] [Geig02]20 Direct-stream digital (DSD) technology [Reef01a] [Jans03]
Table 12 Some of the popular audio storageformats
Audio storage format Related references
1 Compact disc [CD82] [IECA87]2 Digital audio tape (DAT) [Watk88] [Tan89]3 Digital compact cassette (DCC) [Lokh91] [Lokh92]4 MiniDisc [Yosh94] [Tsut96]5 Digital versatile disc (DVD) [DVD96]6 DVD-audio (DVD-A) [DVD01]7 Super audio CD (SACD) [SACD02]
MiniDisc (MD) ndash entered the commercial market during 1987ndash1992 Intendedmainly for back-up high-density storage (sim13 GB) the DAT became the primarysource of mass data storagetransfer [Watk88] [Tan89] In 1991ndash1992 Sony pro-posed a storage medium called the MiniDisc primarily for audio storage MDemploys the ATRAC algorithm for compression In 1991 Philips introduced theDCC a successor of the analog compact cassette Philips DCC employs a com-pression scheme called the PASC [Lokh91] [Lokh92] [Hoog94] The DCC began
4 INTRODUCTION
as a potential competitor for DATs but was discontinued in 1996 The introduc-tion of the digital versatile disc (DVD) in 1996 enabled both video and audiorecordingstorage as well as text-message programming The DVD became oneof the most successful storage media With the improvements in the audio com-pression and DVD storage technologies multichannel surround sound encodingformats gained interest [Bosi93] [Holm99] [Bosi00]
With the emergence of streaming audio applications during the late1990s researchers pursued techniques such as combined speech and audioarchitectures as well as joint source-channel coding algorithms that are optimizedfor the packet-switched Internet The advent of ISOIEC MPEG-4 standard(1996ndash2000) [ISOI99] [ISOI00] established new research goals for high-qualitycoding of audio at low bit rates MPEG-4 audio encompasses more functionalitythan perceptual coding [Koen98] [Koen99] It comprises an integrated family ofalgorithms with provisions for scalable object-based speech and audio coding atbit rates from as low as 200 bs up to 64 kbs per channel
The emergence of the DVD-audio and the super audio CD (SACD) pro-vided designers with additional storage capacity which motivated research inlossless audio coding [Crav96] [Gerz99] [Reef01a] A lossless audio coding sys-tem is able to reconstruct perfectly a bit-for-bit representation of the originalinput audio In contrast a coding scheme incapable of perfect reconstruction iscalled lossy For most audio program material lossy schemes offer the advan-tage of lower bit rates (eg less than 1 bit per sample) relative to losslessschemes (eg 10 bits per sample) Delivering real-time lossless audio contentto the network browser at low bit rates is the next grand challenge for codecdesigners
12 A GENERAL PERCEPTUAL AUDIO CODING ARCHITECTURE
Over the last few years researchers have proposed several efficient signal models(eg transform-based subband-filter structures wavelet-packet) and compressionstandards (Table 11) for high-quality digital audio reproduction Most of thesealgorithms are based on the generic architecture shown in Figure 11
The coders typically segment input signals into quasi-stationary frames rangingfrom 2 to 50 ms Then a time-frequency analysis section estimates the temporaland spectral components of each frame The time-frequency mapping is usuallymatched to the analysis properties of the human auditory system Either waythe ultimate objective is to extract from the input audio a set of time-frequencyparameters that is amenable to quantization according to a perceptual distortionmetric Depending on the overall design objectives the time-frequency analysissection usually contains one of the following
ž Unitary transformž Time-invariant bank of critically sampled uniformnonuniform bandpass
filters
AUDIO CODER ATTRIBUTES 5
Time-frequencyanalysis
Quantizationand encoding
Psychoacousticanalysis Bit-allocation
Entropy(lossless)
codingMUX
Tochannel
Inputaudio
Maskingthresholds
Sideinformation
Parameters
Figure 11 A generic perceptual audio encoder
ž Time-varying (signal-adaptive) bank of critically sampled uniformnonunif-orm bandpass filters
ž Harmonicsinusoidal analyzerž Source-system analysis (LPC and multipulse excitation)ž Hybrid versions of the above
The choice of time-frequency analysis methodology always involves a fun-damental tradeoff between time and frequency resolution requirements Percep-tual distortion control is achieved by a psychoacoustic signal analysis sectionthat estimates signal masking power based on psychoacoustic principles Thepsychoacoustic model delivers masking thresholds that quantify the maximumamount of distortion at each point in the time-frequency plane such that quan-tization of the time-frequency parameters does not introduce audible artifactsThe psychoacoustic model therefore allows the quantization section to exploitperceptual irrelevancies This section can also exploit statistical redundanciesthrough classical techniques such as DPCM or ADPCM Once a quantized com-pact parametric set has been formed the remaining redundancies are typicallyremoved through noiseless run-length (RL) and entropy coding techniques egHuffman [Cove91] arithmetic [Witt87] or Lempel-Ziv-Welch (LZW) [Ziv77][Welc84] Since the output of the psychoacoustic distortion control model issignal-dependent most algorithms are inherently variable rate Fixed channelrate requirements are usually satisfied through buffer feedback schemes whichoften introduce encoding delays
13 AUDIO CODER ATTRIBUTES
Perceptual audio coders are typically evaluated based on the following attributesaudio reproduction quality operating bit rates computational complexity codecdelay and channel error robustness The objective is to attain a high-quality(transparent) audio output at low bit rates (lt32 kbs) with an acceptable
6 INTRODUCTION
algorithmic delay (sim5 to 20 ms) and with low computational complexity (sim1 to10 million instructions per second or MIPS)
131 Audio Quality
Audio quality is of paramount importance when designing an audio codingalgorithm Successful strides have been made since the development of sim-ple near-transparent perceptual coders Typically classical objective measures ofsignal fidelity such as the signal to noise ratio (SNR) and the total harmonicdistortion (THD) are inadequate [Ryde96] As the field of perceptual audio cod-ing matured rapidly and created greater demand for listening tests there was acorresponding growth of interest in perceptual measurement schemes Severalsubjective and objective quality measures have been proposed and standard-ized during the last decade Some of these schemes include the noise-to-maskratio (NMR 1987) [Bran87a] the perceptual audio quality measure (PAQM1991) [Beer91] the perceptual evaluation (PERCEVAL 1992) [Pail92] the per-ceptual objective measure (POM 1995) [Colo95] and the objective audio signalevaluation (OASE 1997) [Spor97] We will address these and several other qual-ity assessment schemes in detail in Chapter 12
132 Bit Rates
From a codec designerrsquos point of view one of the key challenges is to rep-resent high-fidelity audio with a minimum number of bits For instance if a5-ms audio frame sampled at 48 kHz (240 samples per frame) is representedusing 80 bits then the encoding bit rate would be 80 bits5 ms = 16 kbs Lowbit rates imply high compression ratios and generally low reproduction qual-ity Early coders such as the ISOIEC MPEG-1 (32ndash448 kbs) the Dolby AC-3(32ndash384 kbs) the Sony ATRAC (256 kbs) and the Philips PASC (192 kbs)employ high bit rates for obtaining transparent audio reproduction However thedevelopment of several sophisticated audio coding tools (eg MPEG-4 audiotools) created ways for efficient transmission or storage of audio at rates between8 and 32 kbs Future audio coding algorithms promise to offer reasonable qual-ity at low rates along with the ability to scale both rate and quality to matchdifferent requirements such as time-varying channel capacity
133 Complexity
Reduced computational complexity not only enables real-time implementationbut may also decrease the power consumption and extend battery life Com-putational complexity is usually measured in terms of millions of instructionsper second (MIPS) Complexity estimates are processor-dependent For examplethe complexity associated with Dolbyrsquos AC-3 decoder was estimated at approxi-mately 27 MIPS using the Zoran ZR38001 general-purpose DSP core [Vern95]for the Motorola DSP56002 processor the complexity was estimated at 45MIPS [Vern95] Usually most of the audio codecs rely on the so-called asym-metric encoding principle This means that the codec complexity is not evenly
TYPES OF AUDIO CODERS ndash AN OVERVIEW 7
shared between the encoder and the decoder (typically encoder 80 and decoder20 complexity) with more emphasis on reducing the decoder complexity
134 Codec Delay
Many of the network applications for high-fidelity audio (streaming audio audio-on-demand) are delay tolerant (up to 100ndash200 ms) providing the opportunityto exploit long-term signal properties in order to achieve high coding gainHowever in two-way real-time communication and voice-over Internet proto-col (VoIP) applications low-delay encoding (10ndash20 ms) is important Considerthe example described before ie an audio coder operating on frames of 5 msat a 48 kHz sampling frequency In an ideal encoding scenario the minimumamount of delay should be 5 ms at the encoder and 5 ms at the decoder (same asthe frame length) However other factors such as analysis-synthesis filter bankwindow the look-ahead the bit-reservoir and the channel delay contribute toadditional delays Employing shorter analysis-synthesis windows avoiding look-ahead and re-structuring the bit-reservoir functions could result in low-delayencoding nonetheless with reduced coding efficiencies
135 Error Robustness
The increasing popularity of streaming audio over packet-switched and wire-less networks such as the Internet implies that any algorithm intended for suchapplications must be able to deal with a noisy time-varying channel In partic-ular provisions for error robustness and error protection must be incorporatedat the encoder in order to achieve reliable transmission of digital audio overerror-prone channels One simple idea could be to provide better protection tothe error-sensitive and priority (important) bits For instance the audio frameheader requires the maximum error robustness otherwise transmission errorsin the header will seriously impair the entire audio frame Several error detect-ingcorrecting codes [Lin82] [Wick95] [Bayl97] [Swee02] [Zara02] can also beemployed Inclusion of error correcting codes in the bitstream might help to obtainerror-free reproduction of the input audio however with increased complexityand bit rates
From the discussion in the previous sections it is evident that several tradeoffsmust be considered in designing an algorithm for a particular application For thisreason audio coding standards consist of several tools that enable the design ofscalable algorithms For example MPEG-4 provides tools to design algorithmsthat satisfy a variety of bit rate delay complexity and robustness requirements
14 TYPES OF AUDIO CODERS ndash AN OVERVIEW
Based on the signal model or the analysis-synthesis technique employed to encodeaudio signals audio coders can be broadly classified as follows
ž Linear predictivež Transform
8 INTRODUCTION
ž Subbandž Sinusoidal
Algorithms are also classified based on the lossy or the lossless nature of audiocoding Lossy audio coding schemes achieve compression by exploiting percep-tually irrelevant information Some examples of lossy audio coding schemesinclude the ISOIEC MPEG codec series the Dolby AC-3 and the DTS CA Inlossless audio coding the audio data is merely ldquopackedrdquo to obtain a bit-for-bitrepresentation of the original The meridian lossless packing (MLP) [Gerz99]and the direct stream digital (DSD) techniques [Brue97] [Reef01a] form a classof high-end lossless compression algorithms that are embedded in the DVD-audio [DVD01] and the SACD [SACD02] storage formats respectively Losslessaudio coding techniques in general yield high-quality digital audio without anyartifacts at high rates For instance perceptual audio coding yields compressionratios from 101 to 251 while lossless audio coding can achieve compressionratios from 21 to 41
15 ORGANIZATION OF THE BOOK
This book is organized as follows In Chapter 2 we review basic signal pro-cessing concepts associated with audio coding Chapter 3 provides introductorymaterial to waveform quantization and entropy coding schemes Some of the keytopics covered in this chapter include scalar quantization uniformnonuniformquantization pulse code modulation (PCM) differential PCM (DPCM) adap-tive DPCM (ADPCM) vector quantization (VQ) bit-allocation techniques andentropy coding schemes (Huffman Rice and arithmetic)
Chapter 4 provides information on linear prediction and its application innarrow and wideband coding First we address the utility of LP analysissynthesisapproach in speech applications Next we describe the open-loop analysis-synthesis LP and closed-loop analysis-by-synthesis LP techniques
In Chapter 5 psychoacoustic principles are described Johnstonrsquos notion ofperceptual entropy is presented as a measure of the fundamental limit of trans-parent compression for audio The ISOIEC 11172-3 MPEG-1 psychoacous-tic analysis model 1 is used to describe the five important steps associatedwith the global masking threshold computation Chapter 6 explores filter bankdesign issues and algorithms with a particular emphasis placed on the modi-fied discrete cosine transform (MDCT) that is widely used in several perceptualaudio coding algorithms Chapter 6 also addresses pre-echo artifacts and controlstrategies
Chapters 7 8 and 9 review established and emerging techniques for trans-parent coding of FM and CD-quality audio signals including several algorithmsthat have become international standards Transform coding methodologies aredescribed in Chapter 7 subband coding algorithms are addressed in Chapter 8and sinusoidal algorithms are presented in Chapter 9 In addition to methodsbased on uniform bandwidth filter banks Chapter 8 covers coding methods that
NOTATIONAL CONVENTIONS 9
utilize discrete wavelet transforms (DWT) discrete wavelet packet transforms(DWPT) and other nonuniform filter banks Examples of hybrid algorithmsthat make use of more than one signal model appear throughout Chapters 7 8and 9
Chapter 10 is concerned with standardization activities in audio coding Itdescribes coding standards and products such as the ISOIEC MPEG family(minus1 ldquoMP123rdquo minus2 minus4 minus7 and minus21) the Sony Minidisc (ATRAC) thecinematic Sony SDDS the Lucent Technologies PACEPACMPAC the DolbyAC-2AC-3 the Audio Processing Technology APT-x100 and the DTS-coherentacoustics (DTS-CA) Details on the MP3 and MPEG-4 AAC algorithms thatare popular in Web and in handheld media applications eg Apple iPod areprovided
Chapter 11 focuses on lossless audio coding and digital audio watermarkingtechniques In particular the SHORTEN the DVD algorithm the MUSICom-press the AudioPaK the C-LPAC the LTAC and the IntMDCT lossless codingschemes are described in detail Chapter 11 also addresses the two popular high-end storage formats ie the SACD and the DVD-Audio The MLP and the DSDtechniques for lossless audio coding are also presented
Chapter 12 provides information on subjective quality measures for perceptualcodecs The five-point absolute and differential subjective quality scales areaddressed as well as the subjective test methodologies specified in the ITU-R Recommendation BS1116 A set of subjective benchmarks is provided forthe various standards in both stereophonic and multichannel modes to facilitatealgorithm comparisons
16 NOTATIONAL CONVENTIONS
Unless otherwise specified bit rates correspond to single-channel or monauralcoding throughout this text Subjective quality measurements are specified interms of either the five-point mean opinion score (MOS Table 13) or the 41-point subjective difference grade (SDG Chapter 12 Table 121) Table 14 listssome of the symbols and notation used in the book
Table 13 Mean opinionscore (MOS) scale
MOS Perceptual quality
1 Bad2 Poor3 Average4 Good5 Excellent
10 INTRODUCTION
Table 14 Symbols and notation used in the book
Symbolnotation Description
t n Time indexsample indexω Frequency index (analog domain
discrete domain)f (= ω2π) Frequency (Hz)Fs Ts Sampling frequency sampling periodx(t) harr X(ω) Continuous-time Fourier transform
(FT) pairx(n) harr X() Discrete-time Fourier transform
(DTFT) pairx(n) harr X(z) z transform pairs[] Indicates a particular element in a
coefficient arrayh(n) harr H() Impulse-frequency response pair of a
discrete time systeme(n) Errorprediction residual
H(z) = B(z)
A(z)= 1 + b1z
minus1 + + bLzminusL
1 + a1zminus1 + + aMzminusMTransfer function consisting of
numerator-polynomial anddenominator-polynomial(corresponding to b-coefficientsand a-coefficients)
s(n) = sumMi=0 ais(n minus i) Predicted signal
Q〈s(n)〉 = s(n) Quantizationapproximation operatoror estimatedencoded value
s[] Square brackets in the superscriptdenote recursion
s() Parenthesis superscript timedependency
N Nf Nsf Total number of samples samples perframe samples per subframe
log() ln() logp() Log to the base-10 log to the base-elog to the base-p
E[] Expectation operatorε Mean squared error (MSE)microx σ 2
x Mean and the variance of the signalx(n)
rxx(m) Autocorrelation of the signal x(n)
rxy(m) Cross-correlation of x(n) and y(n)
Rxx(ej) Power spectral density of the signal
x(n)
Bit rate Number of bits per second (bs kbsor Mbs)
dB SPL Decibels sound pressure level
COMPUTER EXERCISES 11
PROBLEMS
The objective of these introductory problems are to introduce the novice to simplerelations between the sampling rate and the bit rate in PCM coded sequences
11 Consider an audio signal s(n) sampled at 441 kHz and digitized using a)8-bit b) 24-bit and c) 32-bit resolution Compute the data rates for the cases(a)ndash(c) Give the number of samples within a 16-ms frame and compute thenumber of bits per frame
12 List some of the typical data rates (in kbs) and sampling rates (in kHz)employed in applications such as a) video streaming b) audio streamingc) digital audio broadcasting d) digital compact cassette e) MiniDisc f)DVD g) DVD-audio h) SACD i) MP3 j) MP4 k) video conferencingand l) cellular telephony
COMPUTER EXERCISES
The objective of this exercise is to familiarize the reader with the handling ofsound files using MATLAB and to expose the novice to perceptual attributes ofsampling rate and bit resolution
13 For this computer exercise use MATLAB workspace ch1pb1mat from thewebsite
Load the workspace ch1pb1mat using
gtgt load(lsquoch1pb1matrsquo)
Use whos command to view the variables in the workspace The data-vectorlsquoaudio inrsquo contains 44100 samples of audio data Perform the following inMATLAB
gtgt wavwrite(audio in4410016lsquopb1 aud44 16wavrsquo)gtgt wavwrite(audio in1000016lsquopb1 aud10 16wavrsquo)gtgt wavwrite(audio in441008lsquopb1 aud44 08wavrsquo)
Listen to the wave files pb1minusaud44minus16wav pb1minusaud10minus16wav andpb1minusaud44minus08wav using a media player Comment on the perceptual qualityof the three wave files
14 Down-sample the data-vector lsquoaudio inrsquo in problem 13 using
gtgt aud down 4 = downsample(audio in 4)
Use the following commands to listen to audio in and aud down 4 Com-ment on the perceptual quality of the data vectors in each of the cases below
gtgt sound(audio in fs)gtgt sound(aud down 4 fs)gtgt sound(aud down 4 fs4)
CHAPTER 2
SIGNAL PROCESSING ESSENTIALS
21 INTRODUCTION
The signal processing theory described here will be restricted only to the con-cepts that are relevant to audio coding Because of the limited scope of thischapter we provide mostly qualitative descriptions and establish only the essen-tial mathematical formulas First we briefly review the basics of continuous-time(analog) signals and systems and the methods used to characterize the frequencyspectrum of analog signals We then present the basics of analog filters and sub-sequently describe discrete-time signals Coverage of the basics of discrete-timesignals includes the fundamentals of transforms that represent the spectra ofdigital sequences and the theory of digital filters The essentials of random andmultirate signal processing are also reviewed in this chapter
22 SPECTRA OF ANALOG SIGNALS
The frequency spectrum of an analog signal is described in terms of the con-tinuous Fourier transform (CFT) The CFT of a continuous-time signal x(t) isgiven by
X(ω) =int infin
minusinfinx(t)eminusjωt dt (21)
where ω is the frequency in radians per second (rads) Note that ω = 2πf wheref is the frequency in Hz The complex-valued function X(ω) describes the CFT
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
13
14 SIGNAL PROCESSING ESSENTIALS
02
T0 T0
2t w
1 CFT
0 2πT0
4πT0
X(w) = T0 sinc2
wT0
T0
1x(t) = 2 2
0 otherwise
T0 T0tminus le le
minus
Figure 21 The pulse-sinc CFT pair
t
1
T0
x(t) = cos(w0t)where w0 = 2pT0
w0 w0minusw0
w0
w0minusw0
CFT
0
t
1CFT
0
w(t)
CFT W(w) = T0e minusjwT02sinc2
wT0
X(w) = p(d(w minus w0) + d(w + w0))
1 0 le t leT0w(t) =0 otherwise
xw(t) = x(t) w(t)Xw (w) = pT0 e minusj (w minus w
0)T
02 sinc ((w minus w0)T0 2)
+e minusj (w + w0)T
02 sinc ((w + w0)T02)
pT0
T0
p
Figure 22 CFT of a sinusoid and a truncated sinusoid
magnitude and phase spectrum of the signal The inverse CFT is given by
x(t) = 1
2π
int infin
minusinfinX(ω)ejωt dω (22)
The inverse CFT is also known as the synthesis formula because it describesthe time-domain signal x(t) in terms of complex sinusoids In CFT theory x(t)
and X(ω) are called a transform pair ie
x(t) harr X(ω) (23)
SPECTRA OF ANALOG SIGNALS 15
The pulse-sinc pair shown in Figure 21 is useful in explaining the effects oftime-domain truncation on the spectra For example when a sinusoid is truncatedthen there is loss of resolution and spectral leakage as shown in Figure 22
In real-life signal processing all signals have finite length and hence time-domain truncation always occurs The truncation of an audio segment by arectangular window is shown in Figure 23 To smooth out frame transitionsand control spectral leakage effects the signal is often tapered prior to truncationusing window functions such as the Hamming the Bartlett and the trapezoidalwindows A tapered window avoids the sharp discontinuities at the edges of thetruncated time-domain frame This in turn reduces the spectral leakage in thefrequency spectrum of the truncated signal This reduction of spectral leakage isattributed to the reduced level of the sidelobes associated with tapered windowsThe reduced sidelobe effects come at the expense of a modest loss of spectral
minus20 minus10 0 10 20 30 40
minus1
minus05
0
05
1
(a)
x(t
)
minus20 minus10 0 10 20 30 40minus1
minus05
0
05
1
(b)
x w(t
)
0 02 04 06 08 1minus80
minus60
minus40
minus20
0
(c)
Xw
(w)
w(t )
Time t (ms)
Time t (ms)
Frequency (x p) rads
Figure 23 (a) Audio signal x(t) and a rectangular window w(t) (shown in dashedline) (b) truncated audio signal xw(t) and (c) frequency-domain representation Xw(ω)of the truncated audio
16 SIGNAL PROCESSING ESSENTIALS
minus20 minus10 0 10 20 30 40
minus1
minus05
0
05
1
(a)
x(t
)w(t )
Time t (ms)
minus20 minus10 0 10 20 30 40minus1
minus05
0
05
1
(b)
x w(t
)
Time t (ms)
0 02 04 06 08 1minus80
minus60
minus40
minus20
0
(c)
Xw
(w)
Frequency (x p) rads
Figure 24 (a) Audio signal x(t) and a Hamming window w(t) (shown in dashed line)(b) truncated audio signal xw(t) and (c) frequency-domain representation Xw(ω) of thetruncated audio
resolution An audio segment formed using a Hamming window is shown inFigure 24
23 REVIEW OF CONVOLUTION AND FILTERING
A linear time-invariant (LTI) system configuration is shown in Figure 25 Alinear filter satisfies the property of generalized superposition and hence its outputy(t) is the convolution of the input x(t) with the filter impulse response h(t)Mathematically convolution is represented by the integral in Eq (24)
y(t) =int infin
minusinfinh(τ)x(t minus τ)dτ = h(t) lowast x(t) (24)
UNIFORM SAMPLING 17
LTI system
h(t)
x(t ) y(t ) = x(t ) h(t )
Figure 25 A linear time-invariant (LTI) system and convolution operation
0 20 40 60 80 100minus50
minus40
minus30
minus20
minus10
0
Frequency w (rads)
Mag
nitu
de|H
(w)|2
(dB
)
RC = 1
1H(w) =
1 + jwRC
x(t)
R
C y(t)
Figure 26 A simple RC low-pass filter
The symbol between the impulse response h(t) and the input x(t) is oftenused to denote the convolution operation
The CFT of the impulse response h(t) is the frequency response of thefilter ie
h(t) harr H(ω) (25)
As an example for reviewing these fundamental concepts in linear systemswe present in Figure 26 a simple first-order RC circuit that corresponds to a low-pass filter The impulse response for this RC filter is a decaying exponential andits frequency response is given by a simple first-order rational function H(ω)This function is complex-valued and its magnitude represents the gain of thefilter with respect to frequency at steady state If a sinusoidal signal drives thelinear filter the steady-state output is also a sinusoid with the same frequencyHowever its amplitude is scaled and phase is shifted in a manner consistent withthe magnitude and phase of the frequency response function respectively
24 UNIFORM SAMPLING
In all of our subsequent discussions we will be treating audio signals andassociated systems in discrete time The rules for uniform sampling of ana-log speechaudio are provided by the sampling theorem [Shan48] This theoremstates that a signal that is strictly bandlimited to a bandwidth of B rads can beuniquely represented by its sampled values spaced at uniform intervals that are
18 SIGNAL PROCESSING ESSENTIALS
not more than πB seconds apart In other words if we denote the samplingperiod as Ts then the sampling theorem states that Ts πB In the frequencydomain and with the sampling frequency defined as ωs = 2πfs = 2πTs thiscondition can be stated as
ωs 2B(rads) or fs B
π (26)
Mathematically the sampling process is represented by time-domain multi-plication of the analog signal x(t) with an impulse train p(t) as shown inFigure 27
Since multiplication in time is convolution in frequency the CFT of the sam-pled signal xs(t) corresponds to the CFT of the original analog signal x(t)convolved with the CFT of the impulse train p(t) The CFT of the impulses isalso a train of uniformly spaced impulses in frequency that are spaced 1Ts Hzapart The CFT of the sampled signal is therefore a periodic extension of theCFT of the analog signal as shown in Figure 28 In Figure 28 the analog signalwas considered to be ideally bandlimited and the sampling frequency ωs waschosen to be more than 2B to avoid aliasing The CFT of the sampled signal is
x(t)
t0
xs(t )
t0 Ts
x
p(t)
0 Ts
=
Analog signal Sampling Discrete signal
t
Figure 27 Uniform sampling of analog signals
x(t)
t0 0
X(w)
w
CFT
xs(t)
t
CFT
BminusB
0
Xs(w)
wBminusB wsminuswsTs0
Figure 28 Spectrum of ideally bandlimited and uniformly sampled signals
UNIFORM SAMPLING 19
given by
Xs(ω) = 1
Ts
infinsum
k=minusinfinX(ω minus kωs) (27)
Note that the spectrum of the sampled signal in Figure 28 is such that anideal low-pass filter (LPF) can recover the baseband of the signal and henceperfectly reconstruct the analog signal from the digital signal The reconstructionprocess is shown in Figure 29 This reconstruction LPF essentially interpolatesbetween the samples and reproduces the analog signal from the digital signal Theinterpolation process becomes evident once the filtering operation is interpretedin the time domain as convolution Reconstruction occurs by interpolating withthe sinc function which is the impulse response of the ideal low-pass filter Thereconstruction process for ωs = 2B is given by
x(t) =infinsum
n=minusinfinx(nTs)sinc(B(t minus nTs)) (28)
Note that if the sampling frequency is less than 2B then aliasing will occurand therefore the signal can no longer be reconstructed perfectly Figure 210illustrates aliasing
In real-life applications the analog signal is not ideally bandlimited and thesampling process is not perfect ie sampling pulses have finite amplitude andfinite duration Therefore some level of aliasing is always present To reduce
x(t)
t0
Reconstruction
0
Xs(w)
wBminusB wsminusws Ts
A reconstructionlow-pass filter
Interpolated signal
Figure 29 Reconstruction (interpolation) using a low-pass filter
Xs(w)
0w
BminusBwsminusws
Aliasing
Figure 210 Aliasing when ωs lt 2B
20 SIGNAL PROCESSING ESSENTIALS
Table 21 Sampling rates and bandwidth specifications
Format Bandwidth Sampling frequency
Telephony 32 kHz 8 kHzWideband audio 7 kHz 16 kHzHigh-fidelity CD 20 kHz 441 kHzDigital audio tape (DAT) 20 kHz 48 kHzSuper audio CD (SACD) 100 kHz 28224 MHzDVD audio (DVD-A) 96 kHz 441 48 882 96 1764 or 192 kHz
aliasing the signal is prefiltered by an anti-aliasing low-pass filter and usuallyover-sampled (ωs gt 2B) The degree of over-sampling depends also on the choiceof the analog anti-aliasing filter For high-quality reconstruction and modest over-sampling the anti-aliasing filter must have good rejection characteristics On theother hand over-sampling by a large factor relaxes the requirements on the analoganti-aliasing filter and hence simplifies analog hardware at the expense of a higherdata rate Nowadays over-sampling is practiced often even in high-fidelity sys-tems In fact the use of inexpensive Sigma-Delta () analog-to-digital (AD)converters in conjunction with down-sampling in the digital domain is a com-mon practice Details on AD conversion and some over-sampling schemestailored for high-fidelity audio will be presented in Chapter 11 Standard samplingrates for the different grades of speech and audio are given in Table 21
25 DISCRETE-TIME SIGNAL PROCESSING
Audio coding algorithms operate on a quantized discrete-time signal Prior tocompression most algorithms require that the audio signal is acquired with high-fidelity characteristics In typical standardized algorithms audio is assumed to bebandlimited at 20 kHz sampled at 441 kHz and quantized at 16 bits per sampleIn the following discussion we will treat audio as a sequence ie as a streamof numbers denoted x(n) = x(t)|t=nTs
Initially we will review the discrete-timesignal processing concepts without considering further aliasing and quantizationeffects Quantization effects will be discussed later during the description ofspecific coding algorithms
251 Transforms for Discrete-Time Signals
Discrete-time signals are described in the transform domain using the z-transformand the discrete-time Fourier transform (DTFT) These two transformations havesimilar roles as the Laplace transform and the CFT for analog signals respec-tively The z-transform is defined as
X(z) =infinsum
n=minusinfinx(n)zminusn (29)
DISCRETE-TIME SIGNAL PROCESSING 21
where z is a complex variable Note that if the z-transform is evaluated on theunit circle ie for
z = ej = 2πf Ts (210)
then the z-transform becomes the discrete-time Fourier transform (DTFT) TheDTFT is given by
X(ej) =infinsum
n=minusinfinx(n)eminusjn (211)
The DTFT is discrete in time and continuous in frequency As expected thefrequency spectrum associated with the DTFT is periodic with period 2π rads
Example 21
Consider the DTFT of a finite-length pulse
x(n) = 1 for 0 n N minus 1
= 0 else
Using geometric series results and trigonometric identities on the DTFT sum
X(ej) =Nminus1sum
n=0
eminusjn
0 01 02 03 04 05
Normalized frequency Ω (x p rad)
06 07 08 09 10
2
4
6
8
10
12
14
16N = 8N = 16
|X(e
jΩ)|
Figure 211 DTFT of a sampled pulse for the Example 21 Digital sinc for N = 8(dashed line) and N = 16 (solid line)
22 SIGNAL PROCESSING ESSENTIALS
= 1 minus eminusjN
1 minus eminusj
= eminusj (Nminus1)2 sin(N2)
sin(2) (212)
The ratio of sinusoidal functions in Eq (212) is known as the Dirichlet func-tion or as a digital sinc function Figure 211 shows the DTFT of a finite-lengthpulse The digital sinc is quite similar to the continuous-time sinc functionexcept that it is periodic with period 2π and has a finite number of sidelobeswithin a period
252 The Discrete and the Fast Fourier Transform
A computational tool for Fourier transforms is developed by starting from theDTFT analysis expression (211) and considering a finite length signal consistingof N points ie
X(ej) =Nminus1sum
n=0
x(n)eminusjn (213)
Furthermore the frequency-domain signal is sampled uniformly at N pointswithin one period = 0 to 2π ie
rArr k = 2π
Nk k = 0 1 N minus 1 (214)
With the sampling in the frequency domain the Fourier sum of Eq (213)becomes
X(ejk ) =Nminus1sum
n=0
x(n)eminusjnk (215)
It is typical in the DSP literature to replace k with the frequency index k
and hence Eq (215) can be written as
X(k) =Nminus1sum
n=0
x(n)eminusj2πknN k = 0 1 2 N minus 1 (216)
The expression in (216) is called the discrete Fourier transform (DFT) Notethat the sampling in the frequency domain forces periodicity in the time domainie x(n) = x(n + N) We also have periodicity in the frequency domain X(k) =X(k + N) because the signal in the time domain is also discrete These periodic-ities create circular effects when convolution is performed by frequency-domainmultiplication ie
x(n) otimes h(n) harr X(k)H(k) (217)
DISCRETE-TIME SIGNAL PROCESSING 23
where
x(n) otimes h(n) =Nminus1sum
m=0
h(m) x((n minus m)modN) (218)
The symbol otimes stands for circular or periodic convolution and mod N impliesmodulo N subtraction of indices The DFT is a one-to-one transformation whosebasis functions are orthogonal With the proper normalization the DFT matrixcan be written as a unitary matrix The N -point inverse DFT (IDFT) is written as
x(n) = 1
N
Nminus1sum
k=0
X(k)ej2πknN n = 0 1 2 N minus 1 (219)
The DFT transform pair is represented by the following notation
x(n) harr X(k) (220)
The DFT can be computed efficiently using the fast Fourier transform (FFT)The FFT takes advantage of redundancies in the DFT sum by decimating thesequence into subsequences with even and odd indices It can be shown that if N
is a radix-2 integer the N -point DFT can be computed using a series of butterflystages The complexity associated with the DFT algorithm is of the order of N2
computations In contrast the number of computations associated with the FFTalgorithm is roughly of the order of N log2 N This is a significant reduction incomputational complexity and FFTs are almost always used in lieu of a DFT
253 The Discrete Cosine Transform
The discrete cosine transform (DCT) of x(n) can be defined as
X(k) = c(k)
radic2
N
Nminus1sum
n=0
x(n) cos
[π
N
(
n + 1
2
)
k
]
0 k N minus 1 (221)
where c(0) = 1radic
2 and c(k) = 1 for 1 k N minus 1 Depending on the period-icity and the symmetry of the input signal x(n) the DCT can be computed usingdifferent orthonormal transforms (usually DCT-1 DCT-2 DCT-3 and DCT-4)More details on the DCT and the modified DCT (MDCT) [Malv91] are given inChapter 6
254 The Short-Time Fourier Transform
Spectral analysis of nonstationary signals cannot be accommodated by theclassical Fourier transform since the signal has time-varying characteristicsInstead a time-frequency transformation is required Time-varying spectral
24 SIGNAL PROCESSING ESSENTIALS
hk(n) X X
x(n)
Analysis Synthesis
xk(n)
e jΩkne minus jΩkn
Figure 212 The k-th channel of the analysis-synthesis filterbank (after [Rabi78])
analysis [Silv74] [Alle77] [Port81a] can be performed using the short-timeFourier transform (STFT) The analysis expression for the STFT is given by
X(n ) =infinsum
m=minusinfinx(m)h(n minus m)eminusjm = h(n) lowast x(n)eminusjn (222)
where = ωT = 2πf T is the normalized frequency in radians and h(n) is thesliding analysis window The synthesis expression (inverse transform) is given by
h(n minus m)x(m) = 1
2π
int π
minusπ
X(n )ejm d (223)
Note that if n = m and h(0) = 1 [Rabi78] [Port80] then x(n) can be obtainedfrom Eq (223) The basic assumption in this type of analysis-synthesis is thatthe signal is slowly time-varying and can be modeled by its short-time spectrumThe temporal and spectral resolution of the STFT are controlled by the lengthand shape of the analysis window For speech and audio signals the length of thewindow is often constrained to be about 5ndash20 ms and hence spectral resolutionis sacrificed The sequence h(n) can also be viewed as the impulse responseof a LTI filter which is excited by a frequency-shifted signal (see Eq (222))The latter leads to the filter-bank interpretation of the STFT ie for a discretefrequency variable k = k() k = 0 1 N minus 1 and and N chosensuch that the speech band is covered Then the analysis expression is written as
X(n k) =infinsum
m=minusinfinx(m)h(n minus m)eminusjkm = h(n) lowast x(n)eminusjkn (224)
and the synthesis expression is
xST FT (n) =Nminus1sum
k=0
X(n k)ejkn (225)
where xST FT (n) is the signal reconstructed within the band of interest Ifh(n) and N are chosen carefully [Scha75] the reconstruction by Eq (225)
DIFFERENCE EQUATIONS AND DIGITAL FILTERS 25
can be exact The k-th channel analysis-synthesis scheme is depicted inFigure 212 where hk(n) = h(n)ejkn
26 DIFFERENCE EQUATIONS AND DIGITAL FILTERS
Digital filters are characterized by difference equations of the form
y(n) =Lsum
i=0
bix(n minus i) minusMsum
i=1
aiy(n minus i) (226)
In the input-output difference equation above the output y(n) is given as thelinear combination of present and past inputs minus a linear combination of pastoutputs (feedback term) The parameters ai and bi are the filter coefficients or fil-ter taps and they control the frequency response characteristics of the digital filterFilter coefficients are programmable and can be made adaptive (time-varying)A direct-form realization of the digital filter is shown in Figure 213
The filter in the Eq (226) is referred to as an infinite-length impulse response(IIR) filter The impulse response h(n) of the filter shown in Figure 213 isgiven by
h(n) =Lsum
i=0
biδ(n minus i) minusMsum
i=0
aih(n minus i) (227)
The IIR classification stems from the fact that when the feedback coefficientsare non-zero the impulse response is infinitely long In a statistical signal rep-resentation Eq (226) is referred to as a time-series model That is if the inputof this filter is white noise then y(n) is called an autoregressive moving average(ARMA) process The feedback coefficients ai are chosen such that the filteris stable ie a bounded input gives a bounded output (BIBO) An input-outputequation of a causal filter can also be written in terms of the impulse responseof the filter ie
y(n) =infinsum
m=0
h(m)x(n minus m) (228)
zminus1 zminus1 zminus1 zminus1zminus1
b0
b1
bL Σ
minusa1
minusa2
minusaM
x(n)
y(n)
Figure 213 Direct-form realization of an IIR digital filter
26 SIGNAL PROCESSING ESSENTIALS
The impulse response of the filter is associated with its coefficients and canbe computed explicitly by programming the difference equation It can also beobtained in closed form by solving the difference equation
Example 22
Consider the first-order IIR filter shown in Figure 214 The difference equationof this digital filter is given by
y(n) = 02x(n) + 08y(n minus 1)
The coefficient b0 = 02 and a1 = minus08 The impulse response of this filter isgiven by
h(n) = 02δ(n) + 08h(n minus 1)
The impulse response can be determined in closed-form by solving the abovedifference equation Note that h(0) = 02 and h(1) = 016 Therefore theclosed-form expression for the impulse response is h(n) = 02(08)nu(n) Notealso that this first-order IIR filter is BIBO stable because
infinsum
n=minusinfin|h(n)| lt infin (229)
Digital filters with finite-length impulse response (FIR) are realized by settingthe feedback coefficients ai = 0 for i = 1 2 M FIR filters Figure 215are inherently BIBO stable because their impulse response is always abso-lutely summable The output of an FIR filter is a weighted moving averageof the input The simplest FIR filter is the so-called averaging filter that isused in some simple estimation applications The input-output equation of theaveraging filter is given by
y(n) = 1
L + 1
Lsum
i=0
x(n minus i) (230)
02
x(n)
Σ
08
y(n)
zminus1
Figure 214 A first-order IIR digital filter
THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS 27
b0
b1
bL Σzminus1zminus1x(n) y(n)
Figure 215 An FIR digital filter
The impulse response of this filter is equal to h(n) = 1(L + 1) for n =0 1 L The frequency response of the averaging filter is the DTFT ofits impulse response h(n) Therefore frequency responses of averaging filtersfor L = 7 and 15 are normalized versions of the DTFT spectra shown inFigure 211(a) and Figure 211(b) respectively
27 THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS
The z-transform of the impulse response of a filter is called the transfer functionand is given by
H(z) =infinsum
n=minusinfinh(n)zminusn (231)
Considering the difference equation we can also obtain the transfer functionin terms of filter parameters ie
X(z)
(Lsum
i=0
bizminusi
)
= Y (z)
(
1 +Msum
i=1
aizminusi
)
(232)
The ratio of output over input in the z domain gives the transfer function interms of the filter coefficients
H(z) = Y (z)
X(z)= b0 + b1z
minus1 + + bLzminusL
1 + a1zminus1 + + aMzminusM (233)
For an FIR filter the transfer function is given by
H(z) =Lsum
i=0
bizminusi (234)
The frequency response function is a special case of the transfer function ofthe filter That is for z = ej then
H(ej) =infinsum
n=minusinfinh(n)eminusjn
28 SIGNAL PROCESSING ESSENTIALS
By considering the difference equation associated with the LTI digital filterthe frequency response can be written as the ratio of two polynomials ie
H(ej) = b0 + b1eminusj + b2e
minusj2 + + bLeminusjL
1 + a1eminusj + a2eminusj2 + + aMeminusjM
Note that for an FIR filter the frequency response becomes
H(ej) = b0 + b1eminusj + b2e
minusj2 + + bLeminusjL
Example 23
Frequency responses of four different first-order filters are shown inFigure 216 The frequency responses in Figure 216 are plotted up to the fold-over frequency which is half the sampling frequency Note from Figure 216that low-pass and high-pass filters can be realized as either FIR or as IIRfilters The location of the root of the polynomial of the FIR filter determineswhere the notch in the frequency response occurs Therefore in the top twofigures that correspond to the FIR filters the low-pass filter (top left) has anotch at π rads (zero at z = minus1) while the high-pass filter has a notch at
0 05 1minus60
minus40
minus20
0
20
Normalized frequency Ω ( x p rad )
|H(z
)|2(d
B)
0 05 1minus60
minus40
minus20
0
20
0 05 1minus10
0
10
20
0 05 1minus10
0
10
20
H(z) = 1+zminus1 H(z) = 1minuszminus1
H(z) = 11 minus 09zminus1 H(z) = 1
1 + 09zminus1
Figure 216 Frequency responses of first-order FIR and IIR digital filters
THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS 29
0 rads (zero at z = 1) The bottom two IIR filters have a pole at z = 09 (peakat 0 rads) and z = minus09 (peak at π rads) for the low-pass and high-pass filtersrespectively
271 Poles Zeros and Frequency Response
A z domain function H(z) can be written in terms of its poles and zeros asfollows
H(z) = G(z minus ζ1)(z minus ζ2) (z minus ζL)
(z minus p1)(z minus p2) (z minus pM)= G
prodLi=1(z minus ζi)
prodMi=1(z minus pi)
(235)
where ζi and pi are the zeros and poles of H(z) respectively and G is a constantThe locations of the poles and zeros affect the shape of the frequency responseThe magnitude of the frequency response can be written as
|H(ej)| = G
prodLi=1 |ej minus ζi |
prodMi=1 |ej minus pi |
(236)
It is therefore evident that when an isolated zero is close to the unit circle thenthe magnitude frequency response will assume a small value at that frequencyWhen an isolated pole is close to unit circle it will give rise to a peak in themagnitude frequency response at that frequency In speech processing the pres-ence of poles in z domain representations of the vocal tract has been associatedwith the speech formants [Rabi78] In fact formant synthesizers use the polelocations to form synthesis filters for certain phonemes On the other hand thepresence of zeros has been associated with the coupling of the nasal tract Forexample zeros associate with nasal sounds such as m and n [Span94]
Example 24
For the second-order system below find the poles and zeros give a z-domaindiagram with the pole and zeros and sketch the frequency response
H(z) = 1 minus 13435zminus1 + 09025zminus2
1 minus 045zminus1 + 055zminus2
The poles and zeros appear in conjugate pairs because the coefficients of H(z)
are real-valued
H(z) = (z minus 95ej45o
)(z minus 95eminusj45o
)
(z minus 7416ej7234o)(z minus 7416eminusj7234o
)
The pole zero diagram and the frequency response are shown in Figure 217Poles give rise to spectral peaks and zeros create spectral valleys in themagnitude of the frequency response The symmetry around π is due to thefact that roots appear in complex conjugate pairs
30 SIGNAL PROCESSING ESSENTIALS
minus1 minus05 0 05 1minus1
minus05
0
05
1
Real part
Imag
inar
y pa
rt
z plane plot
0 05 1 15 2minus30
minus20
minus10
0
10
Normalized frequency x p (rad)
Mag
nitu
de (
dB)
Frequency response
Figure 217 z domain and frequency response plots of the second-order system givenin Example 24
272 Examples of Digital Filters for Audio Applications
There are several standard designs for digital filters that are targeted specificallyfor audio-type applications These designs include the so-called shelving filterspeaking filters cross-over filters and quadrature mirror filter (QMF) bank filtersLow-pass and high-pass shelving filters are used for bass and treble tone controlsrespectively in stereo systems
Example 25
The transfer function of a low-pass shelving filter can be expressed as
Hlp(z) = Clp
(1 minus b1z
minus1
1 minus a1zminus1
)
(237)
where
Clp = 1 + kmicro
1 + k b1 =
(1 minus kmicro
1 + kmicro
)
a1 =(
1 minus k
1 + k
)
k =(
4
1 + micro
)
tan
(c
2
)
and micro = 10g20
Note also that c = 2πfcfs is the normalized cutoff frequency and g is thegain in decibels (dB)
Example 26
The transfer function of a high-pass shelving filter is given by
Hhp(z) = Chp
(1 minus b1z
minus1
1 minus a1zminus1
)
(238)
THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS 31
where
Chp = micro + p
1 + p b1 =
(micro minus p
micro + p
)
a1 =(
1 minus p
1 + p
)
p =(
1 + micro
4
)
tan
(c
2
)
and micro = 10g20
Again c = 2πfcfs is the normalized cutoff frequency and g is the gainin dB More complex tone controls that operate as graphic equalizers areaccomplished using bandpass peaking filters
Example 27
The transfer function of a peaking filter is given by
Hpk(z) = Cpk
(1 + b1z
minus1 + b2zminus2
1 + a1zminus1 + a2zminus2
)
(239)
where
Cpk =(
1 + kqmicro
1 + kq
)
b1 = minus2 cos(c)
1 + kqmicro b2 = 1 minus kqmicro
1 + kqmicro
a1 = minus2 cos(c)
1 + kq
a2 = 1 minus kq
1 + kq
(a) (b)
0 05 10
2
4
6
8
10Low-pass shelving filter with gain = 10 dB
Mag
nitu
de (
dB)
Normalized frequency x p (rad)0 02 04 06 08 1
minus20
minus10
0
10
20
Low-pass shelving filter with Ω c = p4
Mag
nitu
de (
dB)
Gain = 20 dB
10 dB
5 dBminus5 dB
minus10 dB
minus20 dB
Ω c = p4
Ω c = p6
Normalized frequency x p (rad)
Figure 218 Frequency responses of a low-pass shelving filter (a) for different gains and(b) for different cutoff frequencies c = π 6 and π 4
32 SIGNAL PROCESSING ESSENTIALS
(c)
0 05 10
2
4
6
8
10
Mag
nitu
de (
dB)
Peaking filter with gain = 10 dB and Q = 4
(a)
0 02 04 06 08 10
5
10
15
20
Normalized frequency x p (rad)
Mag
nitu
de (
dB)
Peaking filter with Q factor 2 and Ωc = p2
Gain = 20 dB
10 dB
Ω c= p4Ω c = p2
(b)
0 05 10
2
4
6
8
10
Mag
nitu
de (
dB)
Peaking filter with gain = 10 dB and Ωc = p2
Q = 2
Q = 4
Normalized frequency x p (rad)
Normalized frequency x p (rad)
Figure 219 Frequency responses of a peaking filter (a) for different gains g = 10 dBand 20 dB (b) for different quality factors Q = 2 and 4 and (c) for different cutofffrequencies c = π 4 and π 2
kq =(
4
1 + micro
)
tan
(c
2Q
)
and micro = 10g20
The frequency c = 2πfcfs is the normalized cutoff frequency Q is thequality factor and g is the gain in dB Example frequency responses of shelv-ing and peaking filters for different gains and cutoff frequencies are given inFigures 218 and 219
Example 28
An audio graphic equalizer is designed by cascading peaking filters as shownin Figure 220 The main idea behind the audio graphic equalizer is that itapplies a set of peaking filters to modify the frequency spectrum of the inputaudio signal by dividing its audible frequency spectrum into several frequencybands Then the frequency response of each band can be controlled by varyingthe corresponding peaking filterrsquos gain
REVIEW OF MULTIRATE SIGNAL PROCESSING 33
PeakingFilter
1
PeakingFilter
2
PeakingFilterNminus1
PeakingFilter
N
Σ Σ Σ
Input
Output
Figure 220 A cascaded setup of peaking filters to design an audio graphic equalizer
28 REVIEW OF MULTIRATE SIGNAL PROCESSING
Multirate signal processing (MSP) involves the change of the sampling rate whilethe signal is in the digital domain Sampling rate changes have been popular inDSP and audio applications Depending on the application changes in the sam-pling rate may reduce algorithmic and hardware complexity or increase resolutionin certain signal processing operations by introducing additional signal samplesPerhaps the most popular application of MSP is over-sampling analog-to-digital(AD) and digital-to-analog (DA) conversions In over-sampling AD the sig-nal is over-sampled thereby relaxing the anti-aliasing filter design requirementsand hence the hardware complexity The additional time-resolution in the over-sampled signal allows a simple 1-bit delta modulation (DM) quantizer to delivera digital signal with sufficient resolution even for high-fidelity audio applicationsThis reduction of analog hardware complexity comes at the expense of a data rateincrease Therefore a down-sampling operation is subsequently performed usinga DSP chip to reduce the data rate This reduction in the sampling rate requiresa high precision anti-aliasing digital low-pass filter along with some other cor-recting DSP algorithmic steps that are of appreciable complexity Therefore theover-sampling analog-to-digital (AD) conversion or as otherwise called Delta-Sigma AD conversion involves a process where complexity is transferred fromthe analog hardware domain to the digital software domain The reduction ofanalog hardware complexity is also important in DA conversion In that casethe signal is up-sampled and interpolated in the digital domain thereby reducingthe requirements on the analog reconstruction (interpolation) filter
281 Down-sampling by an Integer
Multirate signal processing is characterized by two basic operations namely up-sampling and down-sampling Down-sampling involves increasing the samplingperiod and hence decreasing the sampling frequency and data rate of the digitalsignal A sampling rate reduction by integer L is represented by
xd(n) = x(nL) (240)
34 SIGNAL PROCESSING ESSENTIALS
Given the DTFT transform pairs
x(n)DT FTlarrminusminusrarrX(ej) and xd(n)
DT FTlarrminusminusrarrXd(ej) (241)
it can be shown [Oppe99] that the DTFT of the original and decimated signalare related by
Xd(ej) = 1
L
Lminus1sum
l=0
X(ej(minus2πl)L) (242)
Therefore down-sampling introduces L copies of the original DTFT that areboth amplitude and frequency scaled by L It is clear that the additional copiesmay introduce aliasing Aliasing can be eliminated if the DTFT of the originalsignal is bandlimited to a frequency πL ie
X(ej) = 0π
L || π (243)
An example of the DTFTs of the signal during the down-sampling processis shown Figure 221 To approximate the condition in Eq (243) a digital anti-aliasing filter is used The down-sampling process is illustrated in Figure 222
x(n)
n0
2
0
helliphellip
1 3
xd(n) = x(2n)
n0 1
helliphellip
minus2p 2p 0minus2p 2p
112
(a)
X (e jΩ) Xd (e jΩ)
(b)
42 2
ΩΩ
Figure 221 (a) The original and the down-sampled signal in the time-domain and(b) the corresponding DTFTs
HD(ejΩ)
x prime(nL)
Anti-aliasingfilter
x(n)
L
x prime(n)
Figure 222 Down-sampling by an integer L
REVIEW OF MULTIRATE SIGNAL PROCESSING 35
In Figure 222 HD(ej) is given by
HD(ej) =
1 0 || πL
0 πL || π (244)
282 Up-sampling by an Integer
Up-sampling involves reducing the sampling period by introducing additionalregularly spaced samples in the signal sequence
xu(n) =infinsum
m=minusinfinx(m)δ(n minus mM) =
x(nM) n = 0 plusmnM plusmn2M
0 else (245)
The introduction of zero-valued samples in the up-sampled signal xu(n)increases the sampling rate of the signal The DTFT of the up-sampled signalrelates to the DTFT of the original signal as follows
Xu(ej) = X(ejM) (246)
xu(n) = x(n2)
n0
2
0
helliphellip
1 3
x(n)
n0 1
helliphellip
minus2p 2p0minus2p 2p
(a)
(b)
X (e jΩ) Xu (e jΩ)
2 42
ΩΩ
Figure 223 (a) The original and the up-sampled signal in the time-domain and (b) thecorresponding DTFTs
HU (ejΩ)
x prime(nM)
Interpolationfilter
x(n)
M
x(nM)
Figure 224 Up-sampling by an integer M
36 SIGNAL PROCESSING ESSENTIALS
M
x(n)
L
x prime(nM)x(nM)x prime(nLM)
HLP (e jΩ)
Ωc =max (LM)
p
Figure 225 Sampling rate changes by a noninteger factor
Therefore the DTFT of the up-sampled signal Xu(ej) is described by a
series of frequency compressed images of the DTFT of the original signal locatedat integer multiples of 2πM rads (see Figure 223) To complete the up-samplingprocess an interpolation stage is required that fills appropriate values in the time-domain to replace the artificial zero-valued samples introduced by the samplingIn Figure 224 HU(ej) is given by
HU(ej) =
M 0 || πM
0 πM || π (247)
283 Sampling Rate Changes by Noninteger Factors
Sampling rate by noninteger factors can be accomplished by cascading up-sampling and down-sampling operations The up-sampling stage precedes thedown-sampling stage and the low-pass interpolation and anti-aliasing filters arecombined into one filter whose bandwidth is the minimum of the two filtersFigure 225 For example if we want a noninteger sampling period modificationsuch that Tnew = 12T 5 In this case we choose L = 12 and M = 5 Hence thebandwidth of the low-pass filter is the minimum of π 12 and π 5
284 Quadrature Mirror Filter Banks
The analysis of the signal in a perceptual audio coding system is usuallyaccomplished using either filter banks or frequency-domain transformations ora combination of both The filter bank is used to decompose the signal intoseveral frequency subbands Different coding strategies are then derived andimplemented in each subband The technique is known as subband coding in thecoding literature Figure 226
One of the important aspects of subband decomposition is the aliasing betweenthe different subbands because of the imperfect frequency responses of the digitalfilters Figure 227 These aliasing problems prevented early analysis-synthesisfilter banks from perfectly reconstructing the original input signal in the absenceof quantization effects In 1977 a solution to this problem was provided bycombining down-sampling and up-sampling operations with appropriate filterdesigns [Este77] The perfect reconstruction filter bank design came to be knownas a quadrature mirror filter (QMF) bank An analysis-synthesis QMF consists ofanti-aliasing filters down-sampling stages up-sampling stages and interpolationfilters A two-band QMF structure is shown in Figure 228
REVIEW OF MULTIRATE SIGNAL PROCESSING 37
x(n)
BP1
BP2
BP3
BPN
Encoder1
Encoder2
Encoder3
EncoderN
2f1
2f2
2f3
2fN
MUX
DEMUX
Decoder1
Decoder2
Decoder3
DecoderN
BP1
BP2
BP3
BPN
Channel x(n)ˆ
Figure 226 Signal coding in subbands
Lower band Upper band
p 2
H0 H1
Frequency Ω (rad)
|Hk(e jΩ)|
p
Figure 227 Aliasing effects in a two-band filter bank
H0(z)
H1(z)
2
2
2
2
F0(z)
F1(z)
x(n) x0(n)
x1(n)
xd0(n)
xd1(n)
Analysis stage Synthesis stage
x(n)ˆ
sum
Figure 228 A two-band QMF structure
The analysis stage consists of the filters H0(z) and H1(z) and down-samplingoperations The synthesis stage includes up-sampling stages and the filters F0(z)
and F1(z) If the process includes quantizers those will be placed after the down-sampling stages We first examine the filter bank without the quantization stageThe input signal x(n) is first filtered and then down-sampled The DTFT of the
38 SIGNAL PROCESSING ESSENTIALS
down-sampled signal can be shown to be
Xdk(ej) = 1
2(Xk(e
j2) + Xk(ej(minus2π)2)) k = 0 1 (248)
Figure 229 presents plots of the DTFTs of the original and down-sampledsignals It can be seen that an aliasing term is present
The reconstructed signal x(n) is derived by adding the contributions fromthe up-sampling and interpolations of the low and the high band It can be shown
helliphellip
X (ejΩ)
helliphellip
0minus2p 2p
X (ejΩ 2) Aliasing term
X (minusejΩ 2)
Ω
0minus2p 2p Ω
Figure 229 DTFTs of the original and down-sampled signals to illustrate aliasing
QMF(11)
QMF(21)
QMF(22)
QMF(41)
QMF(42)
QMF(43)
QMF(44)
QMF(81)
QMF(84)
Stage - 0 Stage - 1 Stage - 3Stage - 2
QMF(82)
QMF(83)
Level -1
Level -2
Figure 230 Tree-structured QMF bank
DISCRETE-TIME RANDOM SIGNALS 39
that the reconstructed signal in the z-domain has the form
X(z) = 1
2(H0(z)F0(z) + H1(z)F1(z))X(z) + 1
2(H0(minusz)F0(z)
+ H1(minusz)F1(z))X(minusz) (249)
The signal X(minusz) in Eq (249) is associated with the aliasing term Thealiasing term can be cancelled by designing filters to have the following mirrorsymmetries
F0(z) = H1(minusz) F1(z) = minusH0(minusz) (250)
Under these conditions the overall transfer function of the filter bank can thenbe written as
T (z) = 1
2(H0(z)F0(z) + H1(z)F1(z)) (251)
If T (z) = 1 then the filter bank allows perfect reconstruction Perfect delay-less reconstruction is not realizable but an all-pass filter bank with linear phasecharacteristics can be designed easily For example the choice of first-order FIRfilters
H0(z) = 1 + zminus1 H1(z) = 1 minus zminus1 (252)
results in alias-free reconstruction The overall transfer function of the QMF inthis case is
T (z) = 1
2((1 + zminus1)2 minus (1 minus zminus1)2) = 2zminus1 (253)
Therefore the signal is reconstructed within a delay of one sample and withan overall gain of 2 QMF filter banks can be cascaded to form tree structures Ifwe represent the analysis stage of a filter bank as a block that divides the signalin low and high frequency subbands then by cascading several such blockswe can divide the signal into smaller subbands This is shown in Figure 230QMF banks are part of many subband and hybrid subbandtransform audio andimagevideo coding standards [Thei87] [Stoll88] [Veld89] [John96] Note thatthe theory of quadrature mirror filterbanks has been associated with wavelettransform theory [Wick94] [Akan96] [Stra96]
29 DISCRETE-TIME RANDOM SIGNALS
In signal processing we generally classify signals as deterministic or random Asignal is defined as deterministic if its values at any point in time can be definedprecisely by a mathematical equation For example the signal x(n) = sin(πn4)
is deterministic On the other hand random signals have uncertain values and areusually described using their statistics A discrete-time random process involvesan ensemble of sequences x(nm) where m is the index of the m-th sequence
40 SIGNAL PROCESSING ESSENTIALS
in the ensemble and n is the time index In practice one does not have accessto all possible sample signals of a random process Therefore the determinationof the statistical structure of a random process is often done from the observedwaveform This approach becomes valid and simplifies random signal analysis ifthe random process at hand is ergodic Ergodicity implies that the statistics of arandom process can be determined using time-averaging operations on a singleobserved signal Ergodicity requires that the statistics of the signal are indepen-dent of the time of observation Random signals whose statistical structure isindependent of time of origin are generally called stationary More specificallya random process is said to be widesense stationary if its statistics upto the sec-ond order are independent of time Although it is difficult to show analyticallythat signals with various statistical distributions are ergodic it can be shownthat a stationary zero-mean Gaussian process is ergodic up to second order Inmany practical applications involving a stationary or quasi-stationary processit is assumed that the process is also ergodic The definitions of signal statis-tics presented henceforth will focus on real-valued stationary processes that areergodic
The mean value microx of the discrete-time wide sense stationary signal x(n)is a first-order statistic that is defined as the expected value of x(n) ie
microx = E[x(n)] = limNrarrinfin
1
2N + 1
Nsum
n=minusN
x(n) (254)
where E[] denotes the statistical expectation The assumption of ergodicity allowsus to determine the mean value with a time-averaging process shown on theright-hand side of Eq (254) The mean value can be viewed as the DC com-ponent in the signal In many applications involving speech and audio signalsthe DC component does not carry any useful information and is either ignoredor filtered out
The variance σ 2x is a second-order signal statistic and is a measure of signal
dispersion from its mean value The variance is defined as
σ 2x = E[(x(n) minus microx)(x(n) minus microx)] = E[x2(n)] minus micro2
x (255)
The square root of the variance is the standard deviation of the signal Fora zero-mean signal the variance is simply E[x2(n)] The autocorrelation of asignal is a second-order statistic defined by
rxx(m) = E[x(n + m)x(n)] = limNrarrinfin
1
2N + 1
Nsum
n=minusN
x(n + m)x(n) (256)
where m is called the autocorrelation lag index The autocorrelation can be viewedas a measure of predictability of the signal in the sense that a future value of acorrelated signal can be predicted by processing information associated with its
DISCRETE-TIME RANDOM SIGNALS 41
past values For example speech is a correlated waveform and hence it can bemodeled by linear prediction mechanisms that predict its current value from alinear combination of past values Correlation can also be viewed as a measureof redundancy in the signal in that correlated waveforms can be parameterizedin terms of statistical time-series models and hence represented by a reducednumber of information bits
The autocorrelation sequence rxx(m) is symmetric and positive definite ie
rxx(minusm) = rxx(m) rxx(0) |rxx(m)| (257)
Example 29
The autocorrelation of a white noise signal is
rww(m) = σ 2wδ(m)
where σ 2w is the variance of the noise The fact that the autocorrelation of
white noise is the unit impulse implies that white noise is an uncorrelatedsignal
Example 210
The autocorrelation of the output of a second-order FIR digital filter H(z)(in Figure 231) to a white noise input of zero mean and unit variance is
ryy(m) = E[y(n + m)y(n)]
= δ(m + 2) + 2δ(m + 1) + 3δ(m) + 2δ(m minus 1) + δ(m minus 2)
Cross-correlation is a measure of similarity between two signals The cross-correlation of a signal x(n) relative to a signal y(n) is given by
rxy(m) = E[x(n + m)y(n)] (258)
Similarly cross-correlation of a signal y(n) relative to a signal x(n) isgiven by
ryx(m) = E[y(n + m)x(n)] (259)
Note that the symmetry property of the cross-correlation is
ryx(m) = rxy(minusm) (260)
White noise x(n) y(n)
H(z) = 1 + zminus1 + zminus2
Figure 231 FIR filter excited by white noise
42 SIGNAL PROCESSING ESSENTIALS
m0
rww (m) = sw2 d (m) Rww (e jΩ) = sw
2
0
DTFTsw
2
sw2
Ω
Figure 232 The PSD of white noise
The power spectral density (PSD) of a random signal is defined as the DTFTof the autocorrelation sequence
Rxx(ej) =
infinsum
m=minusinfinrxx(m)eminusjm (261)
The PSD is real-valued and positive and describes how the power of therandom process is distributed across frequency
Example 211
The PSD of a white noise signal (see Figure 232) is
Rww(ej) = σ 2w
infinsum
m=minusinfinδ(m)eminusjm = σ 2
w
291 Random Signals Processed by LTI Digital Filters
In Example 210 we determined the autocorrelation of the output of a second-order FIR digital filter when the excitation is white noise In this section wereview briefly the characterization of the statistics of the output of a causal LTIdigital filter that is excited by a random signal The output of a causal digitalfilter can be computed by convolving the input with its impulse response ie
y(n) =infinsum
i=0
h(i)x(n minus i) (262)
Based on the convolution sum we can derive the following expressions for themean the autocorrelation the cross-correlation and the power spectral densityof the steady-state output of an LTI digital filter
microy =infinsum
k=0
h(k)microx = H(ej)|=0microx (263)
ryy(m) =infinsum
k=0
infinsum
i=0
h(k)h(i)rxx(m minus k + i) (264)
DISCRETE-TIME RANDOM SIGNALS 43
ryx(m) =infinsum
i=0
h(i)rxx(m minus i) (265)
Ryy(ej) = |H(ej)|2Rxx(e
j) (266)
These equations describe the statistical behavior of the output at steady stateDuring the transient state of the filter the output is essentially nonstationary ie
microy(n) =nsum
k=0
h(k)microx (267)
Example 212
Determine the output variance of an LTI digital filter excited by white noiseof zero mean and unit variance
σ 2y = ryy(0) =
infinsum
k=0
infinsum
i=0
h(k)h(i)δ(i minus k) =infinsum
k=0
h2(k)
Example 213
Determine the variance the autocorrelation and the PSD of the output of thedigital filter in Figure 233 when its input is white noise of zero mean andunit varianceThe impulse response and transfer function of this first-order IIR filter is
h(n) = 08nu(n) H(z) = 1
1 minus 08zminus1
The variance of the output at steady state is
σ 2y =
infinsum
k=0
h2(k)σ 2x =
infinsum
k=0
064k = 1
1 minus 064= 278
zminus1x(n)
08
y(n)
Σ
Figure 233 An example of an IIR filter
44 SIGNAL PROCESSING ESSENTIALS
The autocorrelation of the output is given by
ryy(m) =infinsum
k=0
infinsum
i=0
08k+iδxx(m minus k + i)
It is easy to see that the unit impulse will be non-zero only for k = m + i andhence
ryy(m) =infinsum
i=0
08m+2i m 0
And taking into account the autocorrelation symmetry
ryy(m) = ryy(minusm) = 277(08)|m|forallm
Finally the PSD is given by
Ryy(ej) = |H(ej)|2Rxx(e
j) = 1
|1 minus 08eminusj|2
292 Autocorrelation Estimation from Finite-Length Data
Estimators of signal statistics given N observations are based on the assumptionof stationarity and ergodicity The following is an estimator of the autocorrelation(based on sample averaging) of a signal x(n)
rxx(m) = 1
N
Nminusmminus1sum
n=0
x(n + m)x(n) m = 0 1 2 N minus 1 (268)
Correlations for negative lags can be taken using the symmetry property inEq (257) The estimator above is asymptotically unbiased (fixed m and N gtgt
m) but for small N it is biased
210 SUMMARY
A brief review of some of the essentials of signal processing techniques weredescribed in this chapter In particular some of the important concepts coveredin this chapter include
ž Continuous Fourier transformž Spectral leakage effectsž Convolution sampling and aliasing issuesž Discrete-time Fourier transform and z-transformž The DFT FFT DCT and STFT basics
PROBLEMS 45
ž IIR and FIR filter representationsž Polezero and frequency response interpretationsž Shelving and peaking filters audio graphic equalizersž Down-sampling and up-samplingž QMF banks and alias-free reconstructionž Discrete-time random signal processing review
PROBLEMS
21 Determine the continuous Fourier transform (CFT) of a pulse described by
x(t) = u(t + 1) minus u(t minus 1)
where u(t) is the unit step function
22 State and derive the CFT properties of duality time shift modulation andconvolution
23 For the circuit shown in Figure 234(a) and for RC = 1a Write the input-output differential equationb Determine the impulse response in closed-form by solving the differ-
ential equationc Write the frequency response functiond Determine the steady state response y(t) for x(t) = sin(10t)
x(t)
R
C y(t)
0 12minus12 4minus4
1
x(t)
t
(a) (b)
(c)
0minus1
1
x(t)
t1
Figure 234 (a) A simple RC circuit (b) input signal for problem 23(e) x(t) and(c) input signal for problem 23(f)
46 SIGNAL PROCESSING ESSENTIALS
e Given x(t) as shown in Figure 234(b) find the circuit output y(t)using convolution
f Determine the Fourier series of the output y(t) of the RC circuit forthe input shown in Figure 234(c)
24 Determine the CFT of p(t) = suminfinn=minusinfin δ(t minus nTs) Given xs(t) = x(t)p(t)
derive the following
Xs(ω) = 1
Ts
infinsum
k=minusinfinX(ω minus kωs)
x(t) =infinsum
n=minusinfinx(nTs) sinc(B(t minus nTs))
where X(ω) and Xs(ω) are the spectra of ideally bandlimited and uniformlysampled signals respectively and ωs = 2πTs (Refer to Figure 28 forvariable definitions)
25 Determine the z-transforms of the following causal signalsa sin(n)
b δ(n) + δ(n minus 1)
c pn sin(n)
d u(n) minus u(n minus 9)
26 Determine the impulse and frequency responses of the averaging filter
h(n) = 1
L + 1 n = 0 1 L for L = 9
27 Show that the IDFT can be derived as a least squares signal matchingproblem where N points in the time domain are matched by a linearcombination of N sampled complex sinusoids
28 Given the transfer function H(z) = (z minus 1)2(z2 + 081)a Determine the impulse response h(n)
b Determine the steady state response due to the sinusoid sin(πn
4
)
29 Derive the decimation-in-time FFT algorithm and determine the number ofcomplex multiplications required for an FFT size of N = 1024
210 Derive the following expression in a simple two-band QMF
Xdk(ej) = 1
2(Xk(e
j2) + Xk(ej(minus2π)2)) k = 0 1
Refer to Figure 228 for variable definitions Give and justify the conditionsfor alias-free reconstruction in a simple QMF bank
COMPUTER EXERCISES 47
211 Design a tree-structured uniform QMF bank that will divide the spectrumof 0-20 kHz into eight uniform subbands Give appropriate figures anddenote on the branches the range of frequencies
212 Modify your design in problem 211 and give one possible realization ofa simple nonuniform tree structured QMF bank that will divide the of 0-20 kHz spectrum into eight subbands whose bandwidth increases with thecenter frequency
213 Design a low-pass shelving filter for the following specifications fs =16 kHz fc = 4 kHz and g = 10 dB
214 Design peaking filters for the following cases a) c = π4Q = 2 g =10 dB and b) c = π2Q = 3 g = 5 dB Give frequency responses ofthe designed peaking filters
215 Design a five-band digital audio equalizer using the concept of peakingdigital filters Select center frequencies fc as follows 500 Hz 1500 Hz4000 Hz 10 kHz and 16 kHz sampling frequency fs = 441 kHz andthe corresponding peaking filter gains as 10 dB Choose a constant Q forall the peaking filters
216 Derive equations (263) and (264)
217 Derive equation (266)
218 Show that the PSD is real-valued and positive
219 Show that the estimator (268) of the autocorrelation is biased
220 Show that the estimator (268) provides autocorrelations such that rxx(0) |rxx(m)|
221 Provide an unbiased autocorrelation estimator by modifying the estimatorin (268)
222 A digital filter with impulse response h(n) = 07nu(n) is excited by whiteGaussian noise of zero mean and unit variance Determine the mean andvariance of the output of the digital filter in closed-form during the transientand steady state
223 The filter H(z) = z(z minus 08) is excited by white noise of zero mean andunit variancea Determine all the autocorrelation values at the output of the filter at
steady stateb Determine the PSD at the output
COMPUTER EXERCISES
Use the speech file lsquoCh2speechwavrsquo from the Book Website for all the computerexercises in this chapter
48 SIGNAL PROCESSING ESSENTIALS
224 Consider the 2-band QMF bank shown in Figure 235 In this figure x(n)
denotes speech frames of 256 samples and x(n) denotes the synthesizedspeech frames
a Design the transfer functions F0(z) and F1(z) such that aliasing iscancelled Also calculate the overall delay of the QMF bank
b Select an arbitrary voiced speech frame from Ch2speechwav Givetime-domain and frequency-domain plots of xd0(n) and xd1(n) for thatparticular frame Comment on the frequency-domain plots with regardto low-passhigh-pass band-splitting
H0(z)
H1(z)
2
2
2
2
F0(z)
F1(z)
x(n)
sum
x0(n)
x1(n)
xd0(n)
xd1(n)
Analysis stage Synthesis stage
x(n)
xe0(n)
xe1(n)
Given
H1(z) = 1 + zminus1
H0(z) = 1 minus zminus1 Choose F0(z) and F1(z) such that the aliasing term can be cancelled
ˆ
Figure 235 A two-band QMF bank
Select Lcomponents
out of NFFT
(Size N = 256)
x(n) Xrsquo(k) InverseFFT (Size N = 256)
xrsquo(n)X(k)
n = [1 times 256] k = [1 times N] k = [1 times L] n = [1 times 256]
Figure 236 Speech synthesis from a select number (subset) of FFT components
Table 22 Signal-to-noise ratio (SNR) and MOS values
Number of FFTcomponents L SNRoverall
Subjective evaluation MOS (meanopinion score) in a scale of
1ndash5 for the entire speech record
16128
COMPUTER EXERCISES 49
c Repeat step (b) for x1(n) and xd1(n) in order to compare the signalsbefore and after the downsampling stage
d Calculate the SNR between the input speech record x(n) and the syn-thesized speech record x(n) Use the following equation to computethe SNR
SNR = 10 log10
( sumn x2(n)
sumn(x(n) minus x(n))2
)
(dB)
Listen to the synthesized speech record and comment on its qualitye Choose a low-pass F0(z) and a high-pass F1(z) such that aliasing
occurs Compute the SNR Listen to the synthesized speech and describeits perceptual quality relative to the output speech in step(d) (Hint Usefirst-order IIR filters)
225 In Figure 236 x(n) denotes speech frames of 256 samples and x prime(n)
denotes the synthesized speech frames For L = N(= 256) the synthesizedspeech will be identical to the input speech In this computer exercise youneed to perform speech synthesis on a frame-by-frame basis from a selectnumber (subset) of FFT components ie L lt N We will use two meth-ods for the FFT component selection (i) Method 1 selecting the first L
components including their conjugate-symmetric ones out of a total of N
components and (ii ) Method 2 the least-squares method (peak-pickingmethod that selects the L components that minimize the sum of squareserror)a Use L = 64 and Method 1 for component selection Perform speech
synthesis and give time-domain plots of both input and output speechrecords
b Repeat the above step using the peak-picking Method 2 (choose L peaksincluding symmetric components in the FFT magnitude spectrum) Listthe SNR values in both the cases Listen to the output files correspondingto (a) and (b) and provide a subjective evaluation (on a MOS scale1ndash5) To calibrate the process think of a wireline telephone quality(toll) as 4 cellphone quality around 37
c Perform speech synthesis for (i) L = 16 and (ii) L = 128 Use the peak-picking Method 2 for the FFT component selection Compute the overallSNR values and provide a subjective evaluation of the output speechfor both the cases Tabulate your results in Table 22
CHAPTER 3
QUANTIZATION AND ENTROPY CODING
31 INTRODUCTION
This chapter provides an introduction to waveform quantization and entropycoding algorithms Waveform quantization deals with the digital or more specifi-cally binary representation of signals All the audio encoding algorithms typicallyinclude a quantization module Theoretical aspects of waveform quantizationmethods were established about fifty years ago [Shan48] Waveform quantizationcan be i) memoryless or with memory depending upon whether the encod-ing rules rely on past samples and ii ) uniform or nonuniform based on thestep-size or the quantization (discretization) levels employed Pulse code modu-lation (PCM) [Oliv48] [Jaya76] [Jaya84] [Span94] is a memoryless method fordiscrete-time discrete-amplitude quantization of analog waveforms On the otherhand Differential PCM (DPCM) delta modulation (DM) and adaptive DPCM(ADPCM) have memory
Waveform quantization can also be classified as scalar or vector In scalarquantization each sample is quantized individually as opposed to vector quanti-zation where a block of samples is quantized jointly Scalar quantization [Jaya84]methods include PCM DPCM DM and their adaptive versions Several vectorquantization (VQ) schemes have been proposed including the VQ [Lind80] thesplit-VQ [Pali91] [Pali93] and the conjugate structure-VQ [Kata93] [Kata96]Quantization can be parametric or nonparametric In nonparametric quantizationthe actual signal is quantized Parametric representations are generally based onsignal transformations (often unitary) or on source-system signal models
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
51
52 QUANTIZATION AND ENTROPY CODING
A bit allocation algorithm is typically employed to compute the number ofquantization bits required to encode an audio segment Several bit allocationschemes have been proposed over the years these include bit allocation basedon the noise-to-mask-ratio (NMR) and masking thresholds [Bran87a] [John88a][ISOI92] perceptually motivated bit allocation [Vora97] [Naja00] and dynamicbit allocation based on signal statistics [Jaya84] [Rams86] [Shoh88] [West88][Beat89] [Madi97] Note that the NMR-based perceptual bit allocation sche-me [Bran87a] is one of the most popular techniques and is embedded in severalaudio coding standards (eg ISOIEC MPEG codec series etc)
In audio compression entropy coding techniques are employed in conjunctionwith the quantization and bit-allocation modules in order to obtain improved codingefficiencies Unlike the DPCM and the ADPCM techniques that remove the redun-dancy by exploiting the correlation of the signal while entropy coding schemesexploit the likelihood of the symbol occurrence [Cove91] Entropy is a measure ofuncertainty of a random variable For example consider two random variables x
and y and two random events A and B For the random variable x let the probabil-ity of occurrence of the event A be px(A) = 05 and the event B be px(B) = 05Similarly define py(A) = 099999 and py(B) = 000001 The random variable x
has a high uncertainty measure ie it is very hard to predict whether event A or B
is likely to occur On the contrary in the case of the random variable y the eventA is more likely to occur and therefore we have less uncertainty relative to therandom variable x In entropy coding the information symbols are mapped intocodes based on the frequency of each symbol Several entropy-coding schemeshave been proposed including Huffman coding [Huff52] Rice coding [Rice79]Golomb coding [Golo66] arithmetic coding [Riss79] [Howa94] and Lempel-Zivcoding [Ziv77] These entropy coding schemes are typically called noiseless Anoiseless coding system is able to reconstruct the signal perfectly from its codedrepresentation In contrast a coding scheme incapable of perfect reconstruction iscalled lossy
In the rest of the chapter we provide an overview of the quantization-bitallocation-entropy coding (QBE) framework We also provide background on theprobabilistic signal structures and we show how they are exploited in quantizationalgorithms Finally we introduce vector quantization basics
311 The QuantizationndashBit AllocationndashEntropy Coding Module
After perceptual irrelevancies in an audio frame are exploited a quantizationndashbitallocationndashentropy coding (QBE) module is employed to exploit statistical corre-lation In Figure 31 typical output parameters from stage I include the transformcoefficients the scale factors and the residual error These parameters are firstquantized using one of the aforementioned PCM DPCM or VQ schemes Thenumber of bits allocated per frame is typically specified by a bit-allocation modulethat uses perceptual masking thresholds
The quantized parameters are entropy coded using an explicit noiseless codingstage for final redundancy reduction Huffman or Rice codes are available in theform of look-up tables at the entropy coding stage Entropy coders are employed
DENSITY FUNCTIONS AND QUANTIZATION 53
Psychoacousticanalysis
ampMDCT
analysis(or)
Prediction
QuantizationEntropycoding
Huffman Rice orGolomb codes
(tables)
Bit-allocationmodule
Distortionmeasure
Quantizedand entropy
codedbitstream
Input audiosignal
A
A
B
B
C
C
Transform coefficients Scalefactors Residual error
Stage - IPerceptual masking
thresholds
Figure 31 A typical QBE module employed in audio coding
in scenarios where the objective is to achieve maximum coding efficiency Inentropy coders more probable symbols (ie frequently occurring amplitudes)are encoded with shorter codewords and vice versa This will essentially reducethe average data rate Next a distortion measure between the input and theencoded parameters is computed and compared against an established thresholdIf the distortion metric is greater than the specified threshold the bit-allocationmodule supplies additional bits in order to reduce the quantization error Theabove procedure is repeated until the distortion falls below the threshold
32 DENSITY FUNCTIONS AND QUANTIZATION
In this section we discuss the characterization of a random process in termsof its probability density function (PDF) This approach will help us derive thequantization noise equations for different quantization schemes A random pro-cess is characterized by its PDF which is a non-negative function p(x) whoseproperties are int infin
minusinfinp(x)dx = 1 (31)
and int x2
x1
p(x)dx = Pr(x1 lt X x2) (32)
From the above equations it is evident that the PDF area from x1 to x2 is theprobability that the random variable X is observed in this range Since X liessomewhere in [minusinfin infin] the total area under p(x) is one The mean and thevariance of the random variable X are defined as
microx = E[X] =int infin
minusinfinxp(x)dx (33)
54 QUANTIZATION AND ENTROPY CODING
0 x x0(a) (b)
pG(x) = 1
minusx2
e2ps2
x
2s2x
pL(x) = 1 e2sx
sx
minusradicminus2x
Figure 32 (a) The Gaussian PDF and (b) The Laplacian PDF
σ 2x =
int infin
minusinfin(x minus microx)
2p(x)dx = E[(X minus microx)2] (34)
Note that the expectation is computed either as a weighted average (33) orunder ergodicity assumptions as a time average (Chapter 2 Eq 254) PDFsare useful in the design of optimal signal quantizers as they can be used todetermine the assignment of optimal quantization levels PDFs often used todesign or analyze quantizers include the zero-mean uniform (pU(x)) the Gaus-sian (pG(x)) (Figure 32a) and the Laplacian (pL(x)) These are given in thatorder below
pU(x) = 1
2S minusS x S (35)
pG(x) = 1radic
2πσ 2x
eminus x2
2σ 2x (36)
pL(x) = 1radic2σx
eminus
radic2|x|σx (37)
where S is some arbitrary non-zero real number and σ 2x is the variance of the
random variable X Readers are referred to Papoulisrsquo classical book on probabilityand random variables [Papo91] for an in-depth treatment of random processes
33 SCALAR QUANTIZATION
In this section we describe the various scalar quantization schemes In particularwe review uniform and nonuniform quantization and then we present differentialPCM coding methods and their adaptive versions
331 Uniform Quantization
Uniform PCM is a memoryless process that quantizes amplitudes by roundingoff each sample to one of a set of discrete values (Figure 33) The differencebetween adjacent quantization levels ie the step size is constant in non-adaptive uniform PCM The number of quantization levels Q in uniform PCM
SCALAR QUANTIZATION 55
Quantizationnoise eq(t )
Step size ∆
Quantized waveformsq(t)
Analog waveformsa(t)
Qua
ntiz
atio
n le
vels
(a)
(b)
2 4 6 8 10 12 14 160
1
2
3
4
5
6
7
t
Am
plitu
de
sa(t )sq(t )
Figure 33 (a) Uniform PCM and (b) uniform quantization of a triangular waveformFrom the figure Rb = 3 bits Q = 8 uniform quantizer levels
binary representations is Q = 2Rb where Rb denotes the number of bits Theperformance of uniform PCM can be described in terms of the signal-to-noiseratio (SNR) Consider that the signal s is to be quantized and its values lie inthe interval
s isin (minussmax smax) (38)
A uniform step size can then be determined by
= 2smax
2Rb (39)
Let us assume that the quantization noise eq has a uniform PDF ie
minus
2 eq
2(310)
56 QUANTIZATION AND ENTROPY CODING
peq(eq) = 1
for |eq |
2 (311)
From (39) (310) and (311) the variance of the quantization noise can beshown [Jaya84] to be
σ 2eq
= 2
12= s2
max2minus2Rb
3 (312)
Therefore if the input signal is bounded an increase by 1 bit reduces the noisevariance by a factor of four In other words the SNR for uniform PCM will
(b)
(a)
Analog waveformsa(t)
Qua
ntiz
atio
n le
vels
Quantized waveformsq(t)
2 4 6 8 10 12 14 160
035
095
258
7
t
Am
plitu
de
sa(t )sq(t )
Figure 34 (a) Nonuniform PCM and (b) nonuniform quantization of a decaying-expon-ential waveform From the figure Rb = 3 bits Q = 8 nonuniform quantizer levels
SCALAR QUANTIZATION 57
improve approximately by 6 dB per bit ie
SNRPCM = 602Rb + K1(dB) (313)
The factor K1 is a constant that accounts for the step size and loading factorsFor telephone speech K1 = minus10 [Jaya84]
332 Nonuniform Quantization
Uniform nonadaptive PCM has no mechanism for exploiting signal redundancyMoreover uniform quantizers are optimal in the mean square error (MSE) sensefor signals with uniform PDF Nonuniform PCM quantizers use a nonuniformstep size (Figure 34) that can be determined from the statistical structure of thesignal
PDF-optimized PCM uses fine step sizes for frequently occurring amplitudesand coarse step sizes for less frequently occurring amplitudes The step sizescan be optimally designed by exploiting the shape of the signalrsquos PDF A signalwith a Gaussian PDF (Figure 35) for instance can be quantized more effi-ciently in terms of the overall MSE by computing the quantization step sizesand the corresponding centroids such that the mean square quantization noise isminimized [Scha79]
Another class of nonuniform PCM relies on log-quantizers that are quite com-mon in telephony applications [Scha79] [Jaya84] In Figure 36 a nonuniformquantizer is realized by using a nonlinear mapping function g() that mapsnonuniform step sizes to uniform step sizes such that a simple linear quantizer isused An example of the mapping function is given in Figure 37 The decoderuses an expansion function gminus1() to recover the signal
Two telephony standards have been developed based on logarithmic compand-ing ie the micro-law and the A-law The micro-law companding function is used in
s
Assigned values (centroids)
∆1
∆2∆3
∆2
∆1
Figure 35 PDF-optimized PCM for signals with Gaussian distribution Quantizationlevels are on the horizontal axis
58 QUANTIZATION AND ENTROPY CODING
Compressorg()
Expandergminus1()
Uniformquantizer
s s
Figure 36 Nonuniform PCM via compressor and expansion functions
∆1
∆
∆
∆
∆2 ∆3s
g(s)
Figure 37 Companding function for nonuniform PCM
the North American PCM standard (micro = 255) The micro-law is given by
|g(s)| = log(1 + micro|ssmax|)log(1 + micro)
(314)
For micro = 255 (314) gives approximately linear mapping for small amplitudesand logarithmic mapping for larger amplitudes The European A-law compandingstandard is slightly different and is based on the mapping
|g(s)| =
A|ssmax|1 + log(A)
for 0 lt |ssmax| lt 1A
1 + log(A|ssmax|)1 + log(A)
for 1A lt |ssmax| lt 1
(315)
The idea with A-law companding is similar with micro-law in that again for signalswith small amplitudes the mapping is almost linear and for large amplitudes thetransformation is logarithmic Both of these techniques can yield superior SNRsparticularly for small amplitudes In telephony the companding schemes havebeen found to reduce bit rates without degradation by as much as 4 bitssamplerelative to uniform PCM
Dynamic range variations in PCM can be handled by using an adaptive stepsize A PCM system with an adaptive step-size is called adaptive PCM (APCM)The step size in a feed forward system is transmitted as side information while in afeedback system the step size is estimated from past coded samples Figure 38 In
SCALAR QUANTIZATION 59
Buffer Q()
∆ Estimator
sQndash1()
∆ Estimator
(a)
Qndash1()sand
sand
(b)
Buffer Q()
∆ Estimator
s
∆ Estimator
Figure 38 Adaptive PCM with (a) forward estimation of step size and (b) backwardestimation of step size
this figure Q represents either uniform or nonuniform quantization (compression)scheme and corresponds to the stepsize
333 Differential PCM
A more efficient scalar quantizer is the differential PCM (DPCM) that removesthe redundancy in the audio waveform by exploiting the correlation betweenadjacent samples In its simplest form a DPCM transmitter encodes only thedifference between successive samples and the receiver recovers the signal byintegration Practical DPCM schemes incorporate a time-invariant short-term pre-diction process A(z) This is given by
A(z) =psum
i=1
aizminusi (316)
where ai are the prediction coefficients and z is the complex variable of thez-transform This DPCM scheme is also called predictive differential coding(Figure 39) and reduces the quantization error variance by reducing the vari-ance of the quantizer input An example of a representative DPCM waveformeq(n) along with the associated analog and PCM quantized waveforms s(t) ands(n) respectively is given in Figure 310
The DPCM system (Figure 39) works as follows The sample s prime(n) is theestimate of the current sample s(n) and is obtained from past sample valuesThe prediction error e(n) is then quantized (ie eq(n)) and transmitted to the
60 QUANTIZATION AND ENTROPY CODING
(a) (b)
eq(n) s(n)
PredictionfilterA(z)
sum
andand
Quantizer
PredictionfilterA(z)
minus
s(n) e(n) eq(n)
s prime(n)
s prime(n)
sum
sum~
Figure 39 DPCM system (a) transmitter and (b) receiver
receiver The quantized prediction error is also added to s prime(n) in order to recon-struct the sample s prime(n) In the absence of channel errors s prime(n) = s(n) In thesimplest case A(z) is a first-order polynomial In Figure 39 A(z) is given by
A(z) =psum
i=1
aizminusi (317)
and the predicted signal is given by
s prime(n) =psum
i=1
aisprime(n minus i) (318)
The prediction coefficients are usually determined by solving the autocorrelationequations
rss(m) minuspsum
i=1
airss(m minus i) = 0 for m = 1 2 p (319)
where rss(m) are the autocorrelation samples of s(n) The details of the equationabove will be discussed in Chapter 4 Two other types of scalar coders are thedelta modulation (DM) and the adaptive DPCM (ADPCM) coders [Cumm73][Gibs74] [Gibs78] [Yatr88] DM can be viewed as a special case of DPCM wherethe difference (prediction error) is encoded with one bit DM typically operatesat sampling rates much higher than the rates commonly used with DPCM Thestep size in DM may also be adaptive
In an adaptive differential PCM (ADPCM) system both the step size andthe predictor are allowed to adapt and track the time-varying statistics of theinput signal [Span94] The predictor can be either forward adaptive or back-ward adaptive In forward adaptation the prediction parameters are estimatedfrom the current data which are not available at the receiver Therefore the
SCALAR QUANTIZATION 61
111
110
101
100
011010
001000
DigitizedPCM
waveform
7
6
5
4
3
2
1
0
Analogwaveform
(a)
xa(t )
yd(n)
zd(n)
t
n
1110
0100
DigitizedDPCM
waveform
n
(b)
(c)
Figure 310 Uniform quantization (a) Analog input signal (b) PCM waveform (3-bitdigitization of the analog signal) Output after quantization [101 100 011 011 101 001100 001 101 010 100] Total number of bits in PCM digitization = 33 (c) DifferentialPCM (2-bit digitization of the analog signal) Output after quantization [10 10 01 0110 00 10 00 10 01 01] Total number of bits in DPCM digitization = 22 As an asideit can be noted that relative to the PCM the DPCM reduces the number of bits forencoding by reducing the variance of the input signal The dynamic range of the inputsignal can be reduced by exploiting the redundancy present within the adjacent samplesof the signal
prediction parameters must be encoded and transmitted separately in order toreconstruct the signal at the receiver In backward adaptation the parameters areestimated from past data which are already available at the receiver Thereforethe prediction parameters can be estimated locally at the receiver Backward pre-dictor adaptation is amenable to low-delay coding [Gibs90] [Chen92] ADPCM
62 QUANTIZATION AND ENTROPY CODING
encoders with pole-zero decoder filters have proved to be particularly versatilein speech applications In fact the ADPCM 32 kbs algorithm adopted for theITU-T G726 [G726] standard (formerly G721 [G721]) uses a pole-zero adaptivepredictor
34 VECTOR QUANTIZATION
Data compression via vector quantization (VQ) is achieved by encoding a data-setjointly in block or vector form Figure 311(a) shows an N -dimensional quantizerand a codebook The incoming vectors can be formed from consecutive datasamples or from model parameters The quantizer maps the i-th incoming [N times 1]vector given by
si = [si(0) si(1) si(N minus 1)]T (320)
to a n-th channel symbol un n = 1 2 L as shown in Figure 311(a) Thecodebook consists of L code vectors
sn = [sn(0) sn(1) sn(N minus 1)]T n = 1 2 L (321)
which reside in the memory of the transmitter and the receiver A vector quantizerworks as follows The input vectors si are compared to each codeword snand the address of the closest codeword with respect to a distortion measureε(si sn) determines the channel symbol to be transmitted The simplest andmost commonly used distortion measure is the sum of squared errors which isgiven by
ε(si sn) =Nminus1sum
k=0
(si(k) minus sn(k))2 (322)
The L [N times 1] real-valued vectors are entries of the codebook and are designedby dividing the vector space into L nonoverlapping cells cn as shown in Figu-re 311(b) Each cell cn is associated with a template vector sn The quantizerassigns the channel symbol un to the vector si if si belongs to cn The channelsymbol un is usually a binary representation of the codebook index of sn
A vector quantizer can be considered as a generalization of the scalar PCMand in fact Gersho [Gers83] calls it vector PCM (VPCM) In VPCM the code-book is fully searched and the number of bits per sample is given by
B = 1
Nlog2 L (323)
The signal-to-noise ratio for VPCM is given by
SNRN = 6B + KN(dB) (324)
VECTOR QUANTIZATION 63
CodebookROM
Encoder
Channel
CodebookROM
Decodersi un un sn
sn
(a)
centroid
cellCn
si (0)
si (1)
(b)
Figure 311 Vector quantization scheme (a) block diagram (b) cells for two-dimensionalVQ ie N = 2
Note that for N = 1 VPCM defaults to scalar PC and therefore (313) is aspecial case of (324) Although the two equations are quite similar VPCM yieldsimproved SNR (reflected in KN ) since it exploits the correlation within thevectors VQ offers significant coding gain by increasing N and L However thememory and the computational complexity required grows exponentially withN for a given rate In general the benefits of VQ are realized at rates of 1 bitper sample or less The codebook design process also known as the training orpopulating process can be fixed or adaptive
Fixed codebooks are designed a priori and the basic design procedure involvesan initial guess for the codebook and then iterative improvement by using a
64 QUANTIZATION AND ENTROPY CODING
large number of training vectors An iterative codebook design algorithm thatworks for a large class of distortion measures was given by Linde Buzo andGray [Lind80] This is essentially an extension of Lloydrsquos [Lloy82] scalar quan-tizer design and is often referred to as the ldquoLBG algorithmrdquo Typically thenumber of training vectors per code vector must be at least ten and prefer-ably fifty [Makh85] Since speech and audio are nonstationary signals one mayalso wish to adapt the codebooks (ldquocodebook design on the flyrdquo) to the signalstatistics A quantizer with an adaptive codebook is called adaptive VQ (A-VQ) and applications to speech coding have been reported in [Paul82] [Cupe85]and [Cupe89] There are two types of A-VQ namely forward adaptive and back-ward adaptive In backward A-VQ codebook updating is based on past data thatis also available at the decoder Forward A-VQ updates the codebooks based oncurrent (or sometimes future) data and as such additional information must beencoded
341 Structured VQ
The complexity in high-dimensionality VQ can be reduced significantlywith the use of structured codebooks that allow for efficient search Tree-structured [Buzo80] and multi-step [Juan82] vector quantizers are associated withlower encoding complexity at the expense of a modest loss of performance Multi-step vector quantizers consist of a cascade of two or more quantizers each oneencoding the error or residual of the previous quantizer In Figure 312(a) thefirst VQ codebook L1 encodes the signal s(k) and the subsequent VQ stages L2
through LM encode the errors e1(k) through eMminus1(k) from the previous stagesrespectively In particular the codebook L1 is first searched and the vectorsl1(k) that minimizes the MSE (325) is selected where l1 is the codebook indexassociated with the first-stage codebook
εl =Nminus1sum
k=0
(s(k) minus sl(k))2 for l = 1 2 3 L1 (325)
Next the difference between the original input s(k) and the first-stage codewordsl1(k) is computed as shown in Figure 312 (a) This is given by
e1(k) = s(k) minus sl1(k) for k = 0 1 2 3 N minus 1 (326)
The residual e1(k) is used in the second stage as the reference signal to beapproximated Codebooks L2 L3 LM are searched sequentially and the codevectors el21(k) el32(k) elMMminus1(k) that result in the minimum MSE (327)are chosen as the codewords Note that l2 l3 lM are the codebook indices
(b)
(a)
Vec
tor
quan
tizer
-1C
odeb
ook
L1
Vec
tor
quan
tizer
-2C
odeb
ook
L2
Vec
tor
quan
tizer
- MC
odeb
ook
LM
minusminus
hellipminus
Sta
ge-1
Sta
ge-2
Sta
ge- M
s (k
)
k=
01
N
minus1e 1
(k)
e 2(k
)
e M(k
)
e Mminus1
(k)
s l1(
k)
ˆe l
21(
k)
ˆ
e lM
Mminus1
(k)
ˆ
sumsum
sum
Sta
ge-1
Sta
ge-2
Sta
ge-3
32
10
L 11 2 3
32
10
L 21 2 3
32
10
L 31 2
Nminus1
Nminus1
Nminus1
Fig
ure
312
(a
)M
ulti-
step
M-s
tage
VQ
bloc
kdi
agra
m
At
the
deco
der
the
sign
al
s(k
)ca
nbe
reco
nstr
ucte
das
s(k
)=
s l1(k
)+
e l21(k
)
+
+
e lM
Mminus1
(k)
fork
=0
12
N
minus1
(b)
An
exam
ple
mul
ti-st
epco
debo
okst
ruct
ure
(M=
3)T
hefir
stst
age
N-d
imen
sion
alco
debo
okco
nsis
tsof
L1
code
vect
ors
Sim
ilarl
yth
enu
mbe
rof
entr
ies
inth
ese
cond
-an
dth
ird-
stag
eco
debo
oks
incl
ude
L2
and
L3r
espe
ctiv
ely
Usu
ally
in
orde
rto
redu
ceth
eco
mpu
tatio
nal
com
plex
ity
the
num
ber
ofco
deve
ctor
sis
asfo
llow
sL
1gt
L2
gtL
3
65
66 QUANTIZATION AND ENTROPY CODING
associated with the 2nd 3rd M-th-stage codebooks respectively
For codebook L2
εl2 =Nminus1sum
k=0
(e1(k) minus el1(k))2 for l = 1 2 3 L2
For codebook LM
εlM =Nminus1sum
k=0
(eMminus1(k) minus elMminus1(k))2 for l = 1 2 3 LM
(327)
At the decoder the transmitted codeword s(k) can be reconstructed as follows
s(k) = sl1(k) + el21(k) + + elMMminus1(k) for k = 0 1 2 N minus 1 (328)
The complexity of VQ can also be reduced by normalizing the vectors of thecodebook and encoding the gain separately The technique is called gainshapeVQ (GS-VQ) and has been introduced by Buzo et al [Buzo80] and later studiedby Sabin and Gray [Sabi82] The waveform shape is represented by a code vectorfrom the shape codebook while the gain can be encoded from the gain codebookFigure 313 The idea of encoding the gain separately allows for the encoding of
ShapeCodebook
GS-VQEncoder
si un
Channel
un si
GainCodebook
ShapeCodebook
GS-VQDecoder
GainCodebook
ˆˆ
Figure 313 Gainshape (GS)-VQ encoder and decoder In the GS-VQ the idea is toencode the waveform shape and the gain separately using the shape and gain codebooksrespectively
VECTOR QUANTIZATION 67
vectors of high dimensionality with manageable complexity and is being widelyused in encoding the excitation signal in code-excited linear predictive (CELP)coders [Atal90] An alternative method for building highly structured codebooksconsists of forming the code vectors by linearly combining a small set of basisvectors [Gers90b]
342 Split-VQ
In split-VQ it is typical to employ two stages of VQs multiple codebooks ofsmaller dimensions are used in the second stage A two-stage split-VQ blockdiagram is given in Figure 314(a) and the corresponding split-VQ codebookstructure is shown in Figure 314(b) From Figure 314(b) note that the first-stage codebook L1 employs an N -dimensional VQ and consists of L1 codevectors The second-stage codebook is implemented as a split-VQ and consistsof a combination of two N 2-dimensional codebooks L2 and L3 The number ofentries in these two codebooks are L2 and L3 respectively First the codebookL1 is searched and the vector sl1(k) that minimizes the MSE (329) is selectedwhere l1 is the index of the first-stage codebook
εl =Nminus1sum
k=0
(s(k) minus sl(k))2 for l = 1 2 3 L1 (329)
Next the difference between the original input s(k) and the first-stage codewordsl1(k) is computed as shown in Figure 314(a)
e(k) = s(k) minus sl1(k) for k = 0 1 2 3 N minus 1 (330)
The residual e(k) is used in the second stage as the reference signal to be approx-imated Codebooks L2 and L3 are searched separately and the code vectors el2low
and el3upp that result in the minimum MSE (331) and (332) respectively arechosen as the codewords Note that l2 and l3 are the codebook indices associatedwith the second- and third-stage codebooks
For codebook L2
εllow =N2minus1sum
k=0
(e(k) minus ellow(k))2 for l = 1 2 3 L2 (331)
For codebook L3
εlupp =Nminus1sum
k=N2
(e(k) minus elupp(k minus N2))2 for l = 1 2 3 L3 (332)
68 QUANTIZATION AND ENTROPY CODING
(a)
Vectorquantizer-1
Codebook L1s(k) e(k)minus
Vectorquantizer-2ACodebook L2
Vectorquantizer-2BCodebook L3
minus
minusk = 01Nminus1
k prime = 01N 2 minus 1
elow = e(k prime)
eupp = e(k primeprime)
k = 01Nminus1
sl1(k)ˆ
el2 lowˆ
el3uppˆ
2 2
Nk primeprime =
N+ 1Nminus1
sum
sum
sum
(b)
Nminus13210
L1
3
2
1
Stage - 2
N2 - 13210
L2
123 hellip
L3
123
Stage - 1
N2-dimensional codebook - 2A N2-dimensional codebook - 2B
N-dimensional codebook - 1
s1(0)ˆ s1(1)ˆ s1(2)ˆ s1(3)ˆ s1(N minus 1)ˆ
Figure 314 Split VQ (a) A two-stage split-VQ block diagram (b) the split-VQcodebook structure In the split-VQ the codevector search is performed by ldquodividingrdquothe codebook into smaller dimension codebooks In Figure 314(b) the second stageN-dimensional VQ has been divided into two N 2-dimensional split-VQ Note that thefirst-stage codebook L1 employs an N-dimensional VQ and consists of L1 entries Thesecond-stage codebook is implemented as a split-VQ and consists of a combination of twoN 2-dimensional codebooks L2 and L3 The number of entries in these two codebooksare L2 and L3 respectively The codebook indices ie l1 l2 and l3 will be encoded andtransmitted At the decoder these codebook indices are used to reconstruct the transmittedcodeword s(k) Eq (333)
At the decoder the transmitted codeword s(k) can be reconstructed as fol-lows
s(k) =
sl1(k) + el2low(k) for k = 0 1 2 N2 minus 1sl1(k) + el3upp(k minus N2) for k = N2 N2 + 1 N minus 1
(333)
Split-VQ offers high coding accuracy however with increased computationalcomplexity and with a slight drop in the coding gain Paliwal and Atal discuss
VECTOR QUANTIZATION 69
these issues in [Pali91] [Pali93] while presenting an algorithm for vector quan-tization of speech LPC parameters at 24 bitsframe Despite the aforementionedshortcomings split-VQ techniques are very efficient when it comes to encodingline spectrum prediction parameters in several speech and audio coding standardssuch as the ITU-T G729 CS-ACELP standard [G729] the IS-893 SelectableMode Vocoder (SMV) [IS-893] and the MPEG-4 General Audio (GA) Coding-Twin VQ tool [ISOI99]
343 Conjugate-Structure VQ
Conjugate structure-VQ (CS-VQ) [Kata93] [Kata96] enables joint quantizationof two or more parameters The CS-VQ works as follows Let s(n) be the targetvector that has to be approximated and the MSE ε to be minimized is given by
ε = 1
N
Nminus1sum
n=0
|e(n)|2 = 1
N
Nminus1sum
n=0
|(s(n) minus g1u(n) minus g2v(n))|2 (334)
where u(n) and v(n) are some arbitrary vectors and g1 and g2 are the gains to bevector quantized using the CS codebook given in Figure 315 From this figurecodebooks A and B contain P and Q entries respectively In both codebooks thefirst-column element corresponds to parameter 1 ie g1 and the second-columnelement represents parameter 2 ie g2 The optimum combination of g1 and g2
that results in the minimum MSE (334) is computed from PQ permutations asfollows
g1(i j) = gA1i + gB1j i isin [1 2 P ] j isin [1 2 Q] (335)
g2(i j) = gA2i + gB2j i isin [1 2 P ] j isin [1 2 Q] (336)
P
3
2
1
Index
gA2PgA1P
gA23gA13
gA22gA12
gA21gA11
Codebook ndash A
Q
3
2
1
Index
gB2QgB1Q
gB23gB13
gB22gB12
gB21gB11
Codebook ndash B
Figure 315 An example CS-VQ In this figure the codebooks lsquoArsquo and lsquoBrsquo are conjugate
70 QUANTIZATION AND ENTROPY CODING
CS-VQ codebooks are particularly handy in scenarios that involve joint quantiza-tion of excitation gains Second-generation near-toll-quality CELP codecs (egITU-T G729) and third-generation (3G) CELP standards for cellular applications(eg TIAIS-893 Selectable Mode Vocoder) employ the CS-VQ codebooks toencode the adaptive and stochastic excitation gains [Atal90] [Sala98] [G729] [IS-893] A CS-VQ is used to vector quantize the transformed spectral coefficientsin the MPEG-4 Twin-VQ encoder
35 BIT-ALLOCATION ALGORITHMS
Until now we discussed various scalar and vector quantization algorithms withoutemphasizing how the number of quantization levels are determined In thissection we review some of the fundamental bit allocation techniques A bit-allocation algorithm determines the number of bits required to quantize an audioframe with reduced audible distortions Bit-allocation can be based on certainperceptual rules or spectral characteristics From Figure 31 parameters typicallyquantized include the transform coefficients x scale factors S and the residualerror e For now let us consider that the transform coefficients x are to bequantized ie
x = [x1 x2 x3 xNf]T (337)
where Nf represents the total number of transform coefficients Let the totalnumber of bits available to quantize the transform coefficients be N bits Ourobjective is to find an optimum way of distributing the available N bits across theindividual transform coefficients such that a distortion measure D is minimizedThe distortion D is given by
D = 1
Nf
Nfsum
i=1
E[(xi minus xi )2] = 1
Nf
Nfsum
i=1
di (338)
where xi and xi denote the i-th unquantized and quantized transform coefficientsrespectively and E[] is the expectation operator Let ni be the number of bitsassigned to the coefficient xi for quantization such that
Nfsum
i=1
ni N (339)
Note that if xi are uniformly distributed foralli then we can employ a simpleuniform bit-allocation across all the transform coefficients ie
ni =[
N
Nf
]
foralli isin [1 Nf ] (340)
However in practice the transform coefficients x may not have uniformprobability distribution Therefore employing an equal number of bits for both
BIT-ALLOCATION ALGORITHMS 71
large and small amplitudes may result in spending extra bits for smaller ampli-tudes Moreover in such scenarios for a given N the distortion D can bevery high
Example 31
An example of the aforementioned discussion is presented in Table 31 Uni-form bit-allocation is employed to quantize both the uniformly distributed andGaussian-distributed transform coefficients In this example we assume thata total number of N = 64 bits are available for quantization and Nf = 16samples Therefore ni = 4 foralli isin [1 16] Note that the input vectors xu andxg have been randomly generated in MATLAB using rand(1 Nf ) andrandn(1 Nf ) functions respectively The distortions Du and Dg are com-puted using (338) and are given by 000023927 and 000042573 respectivelyFrom Example 31 we note that the uniform bit-allocation is not optimal forall the cases especially when the distribution of the unquantized vector xis not uniform Therefore we must have some cost function available thatminimizes the distortion di subject to the constraint given in (339) is metThis is given by
minni
D = minni
1
Nf
Nfsum
i=1
E[(xi minus xi)2]
= min
ni
1
Nf
Nfsum
i=1
σ 2i
(341)
where σ 2i is the variance Note that the above minimization problem can be
simplified if the quantization noise has a uniform PDF [Jaya84] From (312)
σ 2i = x2
i
3(22ni ) (342)
Table 31 Uniform bit-allocation scheme where ni = [N Nf ] foralli isin [1 Nf ]
Uniformly distributed coefficients Gaussian-distributed coefficients
Input vector xu
Quantizedvector xu Input vector xg
Quantizedvector xg
[06029 03806 [0625 0375 [05199 24205 [05 24375056222 012649 05625 0125 minus094578 minus00081113 minus09375 0026904 047535 025 05 minus042986 minus087688 minus04375 minus087504553 038398 04375 0375 11553 minus082724 1125 minus08125041811 035213 04375 0375 minus1345 minus015859 minus1375 minus01875023434 032256 025 03125 minus023544 085353 minus025 0875031352 03026 03125 03125 0016574 minus20292 0 minus2032179 016496] 03125 01875] 12702 028333] 125 03125]
D = 1
Nf
sumNf
i=1 E[(xi minus xi )2] ni = 4foralli isin [1 16]Du = 000023927 and Dg = 000042573
72 QUANTIZATION AND ENTROPY CODING
Substituting (342) in (341) and minimizing wrt ni
partD
partni
= x2i
3(minus2)2(minus2ni ) ln 2 + K1 = 0 (343)
ni = 1
2log2 x2
i + K (344)
From (344) and (339)
Nfsum
i=1
ni =Nfsum
i=1
(1
2log2 x2
i + K
)
= N (345)
K = N
Nf
minus 1
2Nf
Nfsum
i=1
log2 x2i = N
Nf
minus 1
2Nf
log2(
Nfprod
i=1
x2i ) (346)
Substituting (346) in (344) we can obtain the optimum bit-allocationn
optimum
i as
noptimum
i = N
Nf
+ 1
2log2
x2i
(prodNf
i=1 x2i )
1Nf
(347)
Table 32 presents the optimum bit assignment for both uniformly distributed andGaussian-distributed transform coefficients considered in the previous example(Table 31) From Table 32 note that the resulting optimal bit-allocation forGaussian-distributed transform coefficients resulted in two negative integersSeveral techniques have been proposed to avoid this scenario namely thesequential bit-allocation method [Rams82] and the Segallrsquos method [Sega76]For more detailed descriptions readers are referred to [Jaya84] [Madi97] Notethat the bit-allocation scheme given by (347) may not be optimal either inthe perceptual sense or in the SNR sense This is because the minimizationof (341) is performed without considering either the perceptual noise maskingthresholds or the dependence of the signal-to-noise power on the optimal num-ber of bits n
optimum
i Also note that the distortion Du in the case of optimalbit-assignment (Table 32) is slightly greater than in the case of uniform bit-allocation (Table 31) This can be attributed to the fact that fewer quantizationlevels must have been assigned to the low-powered transform coefficients rela-tive to the number of levels implied by (347) Moreover a maximum coding gaincan be achieved when the audio signal spectrum is non-flat One of the importantremarks presented in [Jaya84] relevant to the on-going discussion is that whenthe geometric mean of x2
i is less than the arithmetic mean of x2i then the optimal
bit-allocation scheme performs better than the uniform bit-allocation The ratio
Tabl
e3
2O
ptim
albi
t-al
loca
tion
sche
me
whe
ren
optim
umi
=N N
f+
1 2lo
g 2x
2 iminus
1 2lo
g 2(prod
Nf
i=1
x2 i
)1 N
f
Uni
form
lydi
stri
bute
dco
effic
ient
sG
auss
ian-
dist
ribu
ted
coef
ficie
nts
Inpu
tvec
tor
x uQ
uant
ized
vect
orx
uIn
put
vect
orx
gQ
uant
ized
vect
orx
g
[06
029
038
06
[05
9375
03
75
[05
199
242
05
[05
24
219
056
222
012
649
056
250
125
minus0
945
78minus
000
8111
3minus0
937
50
026
904
047
535
025
04
6875
minus0
429
86minus
087
688
minus04
375
minus08
75
045
530
383
98
043
750
375
1
1553
minus0
8272
41
1563
minus0
8125
0
4181
10
3521
30
4375
03
75
minus13
45minus
015
859
minus13
438
minus01
25
023
434
032
256
025
03
125
minus02
3544
08
5353
minus0
25
084
375
031
352
030
26
031
250
312
50
3125
01
25]
001
6574
minus2
0292
0
minus20
313
032
179
016
496]
127
020
283
33]
126
560
25]
Bits
allo
cate
dn
=[5
45
34
54
44
44
44
44
3]D
isto
rtio
nD
u=
000
0246
8B
itsal
loca
ted
n=
[46
5minus
24
55
56
33
5minus
16
63]
Dis
tort
ion
Dg
=0
0002
2878
73
74 QUANTIZATION AND ENTROPY CODING
of the two means is captured in the spectral flatness measure (SFM) ie
Geometric MeanGM =
Nfprod
i=1
x2i
1Nf
Arithemetic Mean AM = 1
Nf
Nfsum
i=1
x2i
SFM = GM
AM
and SFM isin [0 1] (348)
Other important considerations in addition to SNR and spectral flatness mea-sures are the perceptual noise masking the noise-to-mask ratio (NMR) andthe signal-to-mask ratio (SMR) All these measures are used in perceptualbit allocation methods Since at this point readers are not introduced to theprinciples of psychoacoustics and the concepts of SMR and NMR we defera discussion of perceptually based bit allocation to Chapter 5 Section 58
36 ENTROPY CODING
It is worthwhile to consider the theoretical limits for the minimum number ofbits required to represent an audio sample Shannon in his mathematical theoryof communication [Shan48] proved that the minimum number of bits requiredto encode a message X is given by the entropy He(X) The entropy of aninput signal can be defined as follows Let X = [x1 x2 xN ] be the inputdata vector of length N and pi be the probability that i-th symbol (over thesymbol setV = [v1 v2 vK ]) is transmitted The entropy He(X) is given by
He(X) = minusKsum
i=1
pi log2(pi) (349)
In simple terms entropy is a measure of uncertainty of a random variable Forexample let the input bitstream to be encoded be X = [4 5 6 6 2 5 4 4 5 4 4]ie N = 11 symbol set V = [2 4 5 6] and the corresponding probabilities are[
1
11
5
11
3
11
2
11
]
respectively with K = 4 The entropy He(X) can be com-
puted as follows
He(X) = minusKsum
i=1
pi log2(pi)
= minus
1
11log2
(1
11
)
+ 5
11log2
(5
11
)
+ 3
11log2
(3
11
)
+ 2
11log2
(2
11
)
= 17899 (350)
ENTROPY CODING 75
Table 33 An example entropy code for Example 32
Swimmer Probability of winning Binary string or the identifier
S1 12 0S2 14 10S3 18 110S4 116 1110S5 164 111100S6 164 111101S7 164 111110S8 164 111111
Example 32
Consider eight swimmers S1 S2 S3 S4 S5 S6 S7 and S8 in a race withwin probabilities 12 14 18 116 164 164 164 and 164 respectivelyThe entropy of the message announcing the winner can be computed as
He(X) = minus8sum
i=1
pi log2(pi)
= minus
1
2log2
(1
2
)
+ 1
4log2
(1
4
)
+ 1
8log2
(1
8
)
+ 1
16log2
(1
16
)
+ 4
64log2
(1
64
)
= 2
An example of the entropy code for the above message can be obtainedby associating binary strings wrt the swimmersrsquo probability of winning asshown in Table 33 The average length of the example entropy code given inTable 33 is 2 bits in contrast with 3 bits for a uniform codeThe statistical entropy alone does not provide a good measure of compress-ibility in the case of audio coding since several other factors ie quantiza-tion noise masking thresholds and tone- and noise-masking effects must beaccounted for in order to achieve efficiency Johnston in 1988 proposed a the-oretical limit on compressibility for audio signals (sim 21 bitssample) based onthe measure of perceptually relevant information content The limit is obtainedbased on both the psychoacoustic signal analysis and the statistical entropyand is called the perceptual entropy [John88a] [John88b] The various stepsinvolved in the perceptual entropy estimation are described later in Chapter 5In all the entropy coding schemes the objective is to construct an ensemblecode for each message such that the code is uniquely decodable prefix-freeand optimum in the sense that it provides minimum-redundancy encodingIn particular some basic restrictions and design considerations imposed on asource-coding process include
76 QUANTIZATION AND ENTROPY CODING
ž Condition 1 Each message should be assigned a unique code (seeExample 33)
ž Condition 2 The codes must be prefix-free For example considerthe following two code sequences (CS) CS-1 = 00 11 10 011 001and CS-2 = 00 11 10 011 010 Consider the output sequence to bedecoded to be 001100011 At the decoder if CS-1 is employedthe decoded sequence can be either 00 11 00 011 or 001 10001 or 001 10 00 11 etc This is due to the confusion at thedecoder whether to select lsquo00rsquo or lsquo001rsquo from the output sequence Thisconfusion is avoided using CS-2 where the decoded sequence is uniqueand is given by 00 11 00 011 Therefore in a valid no code in itsentirety can be found as a prefix of another code
ž Condition 3 Additional information regarding the beginning- and theend-point of a message source will not usually be available at the decoder(once synchronization occurs)
ž Condition 4 A necessary condition for a code to be prefix-free is givenby the Kraft inequality [Cove91]
KI =Nsum
i=1
2minusLi 1 (351)
where Li is the codeword length of the i-th symbol In the example discussedin condition 2 the Kraft inequality KI for both CS-1 and CS-2 is lsquo1rsquo Althoughthe Kraft inequality for CS-1 is satisfied the encoding sequence is not prefix-free Note that (351) is not a sufficient condition for a code to be prefix-free
ž Condition 5 To obtain a minimum-redundancy code the compressionrate R must be minimized and (351) must be satisfied
R =Nsum
i=1
piLi (352)
Example 33
Let the input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] be chosen over a dataset V = [0 12 3 4 5 6 7] Here N = 11 K = 8 and the probabilities pi =[
01
11
1
11 0
5
11
2
11
2
11 0
]
i isin V
In Figure 316(a) a simple binary representation with equal-length code isused The code length of each symbol is given by l = int(log2 K) iel = int(log2 8) = 3 Therefore a possible binary mapping would be 1 rarr001 2 rarr 010 4 rarr 100 5 rarr 101 6 rarr 110 and the total length Lb = 33bitsFigure 316(b) depicts the Shannon-Fano coding procedure Each symbol isencoded using a unary representation based on the symbol probability
ENTROPY CODING 77
Input bitstream [4 5 6 6 2 5 4 4 1 4 4 ]
(a) Encoding based on the binary equal - length code
[100 101 110 110 010 101 100 100 001 100 100]Total length Lb = 33 bits
(b) Encoding based on the Shannon - FanoCoding
4s-5 times 5s-2 times 6s-2 twice 2s-once and 1s-onceProbabilities p4 = 511 p5 = 211 p6 = 211 p2 = 111 and p1 = 111
Representations 4 rarr 0 5rarr 10 6 rarr 110 1 rarr 1110 2 rarr 11110
[0 10 110 110 11110 10 0 0 1110 0 0]Total length LSF = 24 bits
Figure 316 Entropy coding schemes (a) binary equal-length code (b) Shannon-Fanocoding
361 Huffman Coding
Huffman proposed a technique to construct minimum redundancy codes [Huff52]Huffman codes found applications in audio and video encoding due to their sim-plicity and efficiency Moreover Huffman coding is regarded as the most effectivecompression method provided that the codes designed using a specific set ofsymbol frequencies match the input symbol frequencies PDFs of audio signalsof shorter frame-lengths are better described by the Gaussian distribution whilethe long-time PDFs of audio can be characterized by the Laplacian or gammadensities [Rabi78] [Jaya84] Hence for example Huffman codes designed basedon the Gaussian or Laplacian PDFs can provide minimum redundancy entropycodes for audio encoding Moreover depending upon the symbol frequencies aseries of Huffman codetables can also be employed for entropy coding eg theMPEG-1 Layer-III employs 32 Huffman codetables [ISOI94]
Example 34
Figure 317 depicts the Huffman coding procedure for the numerical Exam-ple 33 The input symbols ie 1 2 4 5 and 6 are first arranged in ascendingorder wrt their probabilities Next the two symbols with the smallest proba-bilities are combined to form a binary tree The left tree is assigned a ldquo0rdquo andthe right tree is represented by a ldquo1rdquo The probability of the resulting nodeis obtained by adding the two probabilities of the previous nodes as shownin Figure 317 The above procedure is continued until all the input symbolnodes are used Finally Huffman codes for each input symbol is formed byreading the bits along the tree For example the Huffman codeword for theinput symbol ldquo1rdquo is given by ldquo0000rdquo The resulting Huffman bit-mapping isgiven in Table 34 and the total length of the encoded bitstream is LHF = 23bits Note that depending on the node selection for the code tree formation
78 QUANTIZATION AND ENTROPY CODING
111 211 211 511
411
1111
2 6 5
0 1
0001 001 01 1
Input symbols
Huffman codewords
0000
111
211
1
0
1
6110
1
0
0000
41
Figure 317 A possible Huffman coding tree for Example 34
Table 34 Huffman codetable for the inputbitstream X = [4 5 6 6 2 5 4 4 1 4 4]
Input symbol Probability Huffman codeword
4 511 15 211 016 211 0012 111 00011 111 0000
several Huffman bit-mappings can be possible for example 4 rarr 1 5 rarr 0116 rarr 010 2 rarr 001 and 1 rarr 000 as shown in Figure 318 However the totalnumber of bits remain the same ie LHF = 23 bits
Example 35
The entropy of the input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] is given by
He(X) = minusKsum
i=1
pi log2(pi)
= minus
2
11log2
(1
11
)
+ 4
11log2
(2
11
)
+ 5
11log2
(5
11
)
(353)
= 204
ENTROPY CODING 79
111 211 211 511
411
1111
2 6 5
0 1
001 010 011 1
Input symbols
Huffman codewords
000
111
1
211
1
0
6110
1
0
000
4
1
Figure 318 Huffman coding tree for Example 34
From Figures 316 and 317 the compression rate R is obtained using
(a) the uniform binary representation = 3311 = 3 bitssymbol(b) Shannon-Fano coding = 2411 = 218 bitssymbol and(c) Huffman coding = 2311 = 209 bitssymbol
In the case of Huffman coding entropy He(X) and the compression rate Rcan be related using the entropy bounds [Cove91] This is given by
He(X) R He(X) + 1 (354)
It is interesting to note that the compression rate for the Huffman code willbe equal to the lower entropy bound ie R = He(X) if the input symbolfrequencies are radix 2 (see Example 32)
Example 36
The Huffman code table for a different input symbol frequency thanthe one given in Example 34 Consider the input bitstream Y =[2 5 6 6 2 5 5 4 1 4 4] chosen over a data set V = [0 1 2 3 4 5 6 7] and
the probabilities pi =[
01
11
2
11 0
3
11
3
11
2
11 0
]
i isin V Using the design
procedure described above a Huffman code tree can be formed as shown inFigure 319 Table 35 presents the resulting Huffman code table Total lengthof the Huffman encoded bitstream is given by LHF = 25 bits
Depending on the Huffman code table design procedure employed three dif-ferent encoding approaches can be possible First entropy coding based on
80 QUANTIZATION AND ENTROPY CODING
211 211 311 311
511
2 6 5
0 1
001 01 10 11
Input symbols
Huffman codewords
000
111
1
311
1
0
1
11110
1
0
000
611
4
Figure 319 Huffman coding tree for Example 36
Table 35 Huffman codetable for the inputbitstream Y = [2 5 6 6 2 5 5 4 1 4 4]
Input symbol Probability Huffman codeword
4 311 115 311 106 211 012 211 0011 111 000
the Huffman codes designed beforehand ie nonadaptive Huffman codingIn particular a training process involving a large database of input symbolsis employed to design Huffman codes These Huffman code tables will beavailable both at the encoder and at the decoder It is important to note thatthis approach may not (always) result in minimum redundancy encoding Forexample if the Huffman bitmapping given in Table 35 is used to encodethe input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] given in Example 33the resulting total number of bits is LHF = 24 bits ie one bit more com-pared to Example 34 Therefore in order to obtain better compression areliable symbol-frequency model is necessary A series of Huffman codetables (in the range of 10ndash32) based on the symbol probabilities is usuallyemployed in order to overcome the aforementioned shortcomings The non-adaptive Huffman coding method is typically employed in a variety of audiocoding standards [ISOI92] [ISOI94] [ISOI96] [John96] Second Huffman cod-ing based on an iterative designencode procedure ie semi-adaptive Huffmancoding In the entropy coding literature this approach is typically called theldquotwo-passrdquo encoding scheme In the first pass a Huffman codetable is designed
ENTROPY CODING 81
based on the input symbol statistics In the second pass entropy coding isperformed using the designed Huffman codetable (similar to Examples 34and 36) In this approach note that the designed Huffman codetables mustalso be transmitted along with the entropy coded bitstream This results inreduced coding efficiency however with an improved symbol-frequency mod-eling Third adaptive Huffman coding based on the symbol frequencies com-puted dynamically from the previous samples Adaptive Huffman coding sch-emes based on the input quantization step size have also been proposedin order to accommodate for wide range of input word lengths [Crav96][Crav97]
362 Rice Coding
Rice in 1979 proposed a method for constructing practical noiseless cod-es [Rice79] Rice codes are usually employed when the input signal x exhibitsthe Laplacian distribution ie
pL(x) = 1radic2σx
eminus
radic2|x|σx (355)
A Rice code can be considered as a Huffman code for the Laplacian PDF Severalefficient algorithms are available to form Rice codes [Rice79] [Cove91] A simplemethod to represent the integer I as a Rice code is to divide the integer into fourparts ie a sign bit m low-order bits (LSBs) and the number corresponding tothe remaining MSBs of I as zeros followed by a stop bit lsquo1rsquo The parameter lsquomrsquocharacterizes the Rice code and is given by [Robi94]
m = log2(loge(2)E(|x|)) (356)
For example the Rice code for I = 69 and m = 4 is given by [0 0101 0000 1]
Example 37
Rice coding for Example 33 Input bitstream = [4 5 6 6 2 5 4 4 1 4 4]
Input symbol Binary representation Rice code (m = 2)
4 100 0 00 0 15 101 0 01 0 16 110 0 10 0 12 010 0 10 11 001 0 011
82 QUANTIZATION AND ENTROPY CODING
363 Golomb Coding
Golomb codes [Golo66] are optimal for exponentially decaying probability dis-tributions of positive integers Golomb codes are prefix codes that can be char-acterized by a unique parameter ldquomrdquo An integer ldquoIrdquo can be encoded using aGolomb code as follows The code consists of two parts a binary representation
of (m mod I ) and a unary representation of
[I
m
]
For example consider I = 69
and m = 16 The Golomb code will be [010111100] as explained in Figure 320In Method 1 the positive integer ldquoIrdquo is divided in two parts ie binary andunary bits along with a stop bit On the other hand in Method 2 if m = 2kthe codeword for ldquoIrdquo consists of ldquokrdquo LSBs of ldquoI rdquo followed by the numberformed by the remaining MSBs of ldquoIrdquo in unary representation and with a stop
bit Therefore the length of the code is k +[
I
2k
]
+ 1
Example 38
Consider the input bitstream X = [4 4 4 2 2 4 4 4 4 4 4 2 4 4 4 4] chosenover the data set V = [2 4] The run-length encoding scheme [Golo66] canbe employed to efficiently encode X Note that ldquo4rdquo is the most frequentlyoccurring symbol in X The number of consecutive occurrences of ldquo4rdquo iscalled the run length n The run lengths are monitored and encoded ie [30 6 4] Here ldquo0rdquo represents the consecutive occurrence of ldquo2rdquo The prob-ability of occurrence of ldquo4rdquo and ldquo2rdquo are given by p(4) = p = 1316 andp(2) = (1 minus p) = q = 316 respectively Note that p gtgt q For this case
Method ndash 1n = 69 m =16
First part Binary ( m mod n ) = Binary (16 mod 69) = Binary (5) = 0101
Stop bit = 0
Method ndash 2
n = 69 m = 16 ie k = 4 ( where m = 2k)
First part k LSBs of n = 4 LSBs of [1000101] = 0101
Second part Unary (rest of MSBs) = unary (4) = 1110
Stop bit = 0
Second part Unary ( ) = Unary(4) = 1110nm
Golomb Code
[0101 1110 0]First part + Second part + Stop bit
Golomb Code
[0101 1110 0]First part + Second part + Stop bit
Figure 320 Golomb coding
ENTROPY CODING 83
Huffman coding of X results in 16 bits Moreover the PDFs of the run lengthsare better described using an exponential distribution ie the probability of arun length of n is given by qpn which is an exponential distribution Rice cod-ing [Rice79] or the Golomb coding [Golo66] can be employed to efficientlyencode the exponentially distributed run lengths Furthermore both Golomband Rice codes are fast prefix codes that allow for practical implementation
364 Arithmetic Coding
Arithmetic coding [Riss79] [Witt87] [Howa94] deals with encoding a sequenceof input symbols as a large binary fraction known as a ldquocodewordrdquo For examplelet V = [v1 v2 vK ] be the data set let pi be the probability that the i-thsymbol is transmitted and let X = [x1 x2 xN ] be the input data vector oflength N The main idea behind an arithmetic coder is to encode the input datastream X as one codeword that corresponds to a rational number in the half-open unit interval [0 1) Arithmetic coding is particularly useful when dealingwith adaptive encoding and with highly skewed symbols [Witt87]
Example 39
Arithmetic coding of the input stream X = 1 0 minus 1 0 1 chosen over a dataset V = [minus1 0 1] Here N = 5 K = 3 We will use the following symbol
probabilities pi =[
1
3
1
2
1
6
]
i isin V
Step 1
The probabilities associated with the data set V = [minus1 0 1] are arrangedas intervals on a scale of [0 1) as shown in Figure 321Step 2
The first input symbol in the data stream X is lsquo1rsquo Therefore the interval[5
6 1
)
is chosen as the target range
Step 3
The second input symbol in the data stream X is lsquo0rsquo Now the interval[5
6 1
)
is partitioned according to the symbol probabilities 13 12 and
16 The resulting interval ranges are given in Figure 321 For example
the interval range for symbol lsquominus1rsquo can be computed as
[5
6
5
6+ 1
6
1
3
)
=[
5
6
16
18
)
and for symbol lsquo0rsquo the interval ranges is given by
[16
18
5
6+
1
6
1
3+ 1
6
1
2
)
=[
16
18
35
36
)
84 QUANTIZATION AND ENTROPY CODING
0
0 1minus1
1
minus1
1
10
1
Input data stream X = [1 0 minus1 0 1]
minus1 0
1618
3536
56
56
13
1618
3536
3336
6972
1
1618
3336
0minus1
197216
97108
0minus1 1
197216
97108
393432
393432
394432
390432
Figure 321 Arithmetic coding First the probabilities associated with the data setV = [minus1 0 1] are arranged as intervals on a scale of [0 1) Next in step 2 an intervalis chosen that corresponds to the probability of the input symbol lsquo1rsquo in the datasequence X ie [56 1) In step 3 the interval [56 1) is partitioned according tothe probabilities 13 12 and 16 and the range corresponding to the input symbollsquo0rsquo is chosen ie [16183536) This procedure is repeated for the rest of the inputsymbols and the final interval range (typically a rational number) is encoded in thebinary form
PROBLEMS 85
Step 4
The third input symbol in the data stream X is lsquominus1rsquo The interval
[16
18
35
36
)
is partitioned according to the symbol probabilities 13 12 and 16 Theresulting interval ranges are given in Figure 321
The above procedure is repeated for the rest of the input symbols and an
interval range is obtained ie
[393
432
394
432
)
[09097222 0912037) In the
binary form the interval is given by [01110100011 01110100101 )Since all binary numbers that begin with 01110100100 are within the interval[
393
432
394
432
)
the binary codeword 1110100100 uniquely represents the input
data stream X = [1 0 minus 1 0 1]
37 SUMMARY
This chapter covered quantization essentials and provided background on PCMDPCM vector quantization bit allocation and entropy coding algorithms Aquantizationndashbit allocationndashentropy coding (QBE) framework that is part of mostof the audio coding standards was described Some of the important conceptsaddressed in this chapter include
ž Uniform and nonuniform quantizationž PCM DPCM and ADPCM techniquesž Vector quantization structured VQ split VQ and conjugate structure VQž Bit-allocation strategiesž Source coding principlesž Lossless (entropy) coders ndash Huffman coding Rice coding Golomb coding
and arithmetic coding
PROBLEMS
31 Derive the PCM 6 dB per bit rule when the quantization error has a uniformprobability density function
32 For a signal with Gaussian distribution (zero mean and unit variance)a Design a uniform PCM quantizer with four levelsb Design a nonuniform four-level quantizer that is optimized for the signal
PDF Compare with the uniform PCM in terms of SNR
33 For the PDF p(x) = 1
2eminus|x| determine the mean the variance and the
probability that a random variable will fall within plusmnσx of the mean value
86 QUANTIZATION AND ENTROPY CODING
1
minus1 10 x
p(x)
Figure 322 An example PDF
34 For the PDF p(x) given in Figure 322 design a four-level PDF-optimizedPCM and compare to uniform PCM in terms of SNR
35 Give and justify a formula for the number of bits in simple vector quanti-zation with N times 1 vectors and L template vectors
36 Give in terms of L and N the order of complexity in a VQ codebooksearch where L is the number of codebook entries and N is the codebookdimension Consider the following cases (i) a simple VQ (ii) a multi-stepVQ and (iii) a split VQ For (ii) and (iii) use configurations given inFigure 312 and Figure 314 respectively
COMPUTER EXERCISES
37 Design a DPCM coder for a stationary random signal with power spectral
density S(ej) = 1
|1 + 08ej|2 Use a first-order predictor Give a block
diagram and all pertinent equations Write a program that implements theDPCM coder and evaluate the MSE at the receiver Compare the SNR (forthe same data) for a PCM system operating at the same bit rate
38 In this problem you will write a computer program to design a vectorquantizer and generate a codebook of size [L times N ] Here L is the numberof codebook entries and N is the codebook dimension
Step 1Generate a training set Tin of size [Ln times N ] where n is the number oftraining vectors per codevector Assume L = 16 N = 4 and n = 10 Denotethe training set elements as tin(i j) for i = 0 1 159 and j = 0 1 2 3Use Gaussian vectors of zero mean and unit variance for trainingStep 2Using the LBG algorithm [Lind80] design a vector quantizer and generate acodebook C of size [L times N ] ie [16 times 4] In the LBG VQ design choosethe distortion threshold as 0001 Label the codevectors as c(i j) for i =0 1 15 and j = 0 1 2 3
Tabl
e3
6Se
gmen
tal
and
over
all
SNR
valu
esfo
ra
test
sign
alw
ithi
nan
dou
tsid
eth
etr
aini
ngse
quen
cefo
rdi
ffer
ent
num
ber
ofco
debo
oken
trie
sL
di
ffer
ent
code
book
dim
ensi
ons
N
and
ε=
000
1
Tra
inin
gve
ctor
spe
ren
try
nC
odeb
ook
dim
ensi
onN
SNR
for
ate
stsi
gnal
wit
hin
the
trai
ning
sequ
ence
(dB
)SN
Rfo
ra
test
sign
alou
tsid
eth
etr
aini
ngse
quen
ce(d
B)
Segm
enta
lO
vera
llSe
gmen
tal
Ove
rall
L=
16L
=64
L=
16L
=64
L=
16L
=64
L=
16L
=64
102 4
100
2 4
87
88 QUANTIZATION AND ENTROPY CODING
VectorQuantizer
Codebook L1
4-bit VQL = 16 N = 4 and n = 100
e = 000001
s s
Figure 323 Four-bit VQ design specifications for Problem 39
Step 3Similar to Step 1 generate another training set Tout of size [160 times 4] that wewill use for testing the VQ performance Label these training set values astout (i j) for i = 0 1 159 and j = 0 1 2 3Step 4Using the codebook C designed in Step 2 perform vector quantization oftin(i j) and tout (i j) Let us denote the VQ results as tin(i j) and tout (i j)respectively
a When the test vectors are within the training sequence compute theover-all SNR and segmental SNR values as follows
SNRoverall =sumLnminus1
i=0
sumNminus1j=0 t2
in(i j)sumLnminus1
i=0
sumNminus1j=0 (tin(i j) minus tin(i j))2
(357)
SNRsegmental = 1
Ln
Lnminus1sum
i=0
sumNminus1j=0 t2
in(i j)sumNminus1
j=0 (tin(i j) minus tin(i j))2(358)
b Compute the over-all SNR and segmental SNR values when the testvectors are different from the training ones ie replace tin(i j) withtout (i j) and tin(i j) with tout (i j) in (357) and (358)
c List in Table 36 the overall and segmental SNR values for differentnumber of codebook entries and different codebook dimensions Explainthe effects of choosing different values of L n and N on the SNRvalues
d Compute the MSE ε(tin tin) = 1
Ln
1
N
sumLnminus1i=0
sumNminus1j=0 (tin(i j) minus
tin(i j))2 for different cases eg L = 16 64 n = 10 100 1000N = 2 8 Explain how the MSE varies for different values of L nand N
39 Write a program to design a 4-bit VQ codebook L1 (ie use L = 16codebook entries) with codebook dimension N = 4 (Figure 323) Usen = 100 training vectors per codebook entry and a distortion threshold ε =000001 For VQ-training choose zero mean and unit variance Gaussianvectors
COMPUTER EXERCISES 89
VectorQuantizer-1
Codebook L1
s ssum sum
minus
VectorQuantizer-2
Codebook L2e1
e1ˆ e2ˆ
minus
VectorQuantizer-3
Codebook L3e2
4-bit VQL = 16 N = 4 and n = 100
e = 000001
4-bit VQL = 16 N = 4 and n = 100
e = 00001
4-bit VQL = 16 N = 4 and n = 100
e = 0001
Figure 324 A three-stage vector quantizer
310 Extend the VQ design in problem 39 to a multi-step VQ (see Figure 324for an example multi-step VQ configuration) Use a total of three stagesin your VQ design Choose the MSE distortion thresholds in each of thestages as ε1 = 0001 ε2 = 00001 and ε3 = 000001 Comment on theMSE convergence in each of the stages How would you compare themulti-step VQ with a simple VQ in terms of the segmental and overallSNR values (Note In Figure 324 the first VQ codebook (L1) encodes thesignal s and the subsequent VQ stages L2 and L3 encode the error fromthe previous stage At the decoder the signal s prime can be reconstructed ass prime = s + e1 + e2)
311 Design a two-stage split-VQ Choose L = 16 and n = 100 Implementthe first-stage as a 4-dimensional VQ and the second-stage as two 2-dimensional VQs Select the distortion thresholds as follows for the firststage ε1 = 0001 and for the second stage ε2 = 000001 Compare thecoding accuracy in terms of a distance measure and the coding gain interms of the number of bitssample of the split-VQ with respect to thesimple VQ in problem 39 and the multi-step VQ in problem 310 (SeeFigure 314)
312 Given the input data stream X = [1 0 2 1 0 1 2 1 0 2 0 1 1] chosen overa data set V = [0 1 2]a Write a program to compute the entropy He(X)b Compute the symbol probabilities pi i isin V for the input data stream
Xc Write a program to encode X using Huffman codes Employ an appro-
priate Huffman bit-mapping Give the length of the output bitstream(Hint See Example 34 and Example 36)
d Use arithmetic coding to encode the input data stream X Give the finalcodeword interval range in the binary form Give also the length of theoutput bitstream (See Example 39)
CHAPTER 4
LINEAR PREDICTION IN NARROWBANDAND WIDEBAND CODING
41 INTRODUCTION
Linear predictive coders are embedded in several telephony and multimediastandards [G729] [G7231] [IS-893] [ISOI99] Linear predictive coding(LPC) [Kroo95] is mostly used for source coding of speech signals and thedominant application of LPC is cellular telephony Recently linear prediction(LP) analysissynthesis has also been integrated in some of the wideband speechcoding standards [G722] [G7222] [Bess02] and in audio modeling [Iwak96][Mori96] [Harm97a] [Harm97b] [Bola98] [ISOI00]
LP analysissynthesis exploits the short- and long-term correlation to param-eterize the signal in terms of a source-system representation LP analysis canbe open loop or closed loop In closed-loop analysis also called analysis-by-synthesis the LP parameters are estimated by minimizing the ldquoperceptuallyweightedrdquo difference between the original and reconstructed signal Speech cod-ing standards use a perceptual weighting filter (PWF) to shape the quantizationnoise according to the masking properties of the human ear [Schr79] [Kroo95][Sala98] Although the PWF has been successful in speech coding audio cod-ing requires a more sophisticated strategy to exploit perceptual redundancies Tothis end several extensions [Bess02] [G7222] to the conventional LPC havebeen proposed Hybrid transformpredictive coding techniques have also beenemployed for high-quality low-bit-rate coding [Ramp98] [Ramp99] [Rong99][ISOI99] [ISOI00] Other LP methods that make use of perceptual constrains and
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
91
92 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
auditory psychophysics include the perceptual LP (PLP) [Herm90] the warpedLP (WLP) [Stru80] [Harm96] [Harm01] and the perceptually-motivated all-pole(PMAP) modeling [Atti05] In the PLP a perceptually based auditory spectrumis obtained by filtering the signal using a filter bank that mimics the auditoryfilter bank An all-pole filter that approximates the auditory spectrum is thencomputed using the autocorrelation method [Makh75] On the other hand in theWLP the main idea is to warp the frequency axis according to a Bark scale priorto performing LP analysis The PMAP modeling employs an auditory excitationpattern matching-method to directly estimate the perceptually-relevant pole loca-tions The estimated ldquoperceptual polesrdquo are then used to construct an all-polefilter for speech analysissynthesis
Whether or not LPC is amenable for audio modeling depends on the signalproperties For example a code-excited linear predictive (CELP) coder seemsto be more adequate than a sinusoidal coder for telephone speech while thesinusoidal coder seems to be more promising for music
42 LP-BASED SOURCE-SYSTEM MODELING FOR SPEECH
Speech is produced by the interaction of the vocal tract with the vocal chords TheLP analysissynthesis framework (Figures 41 and 42) has been successful forspeech coding because it fits well the source-system paradigm forspeech [Makh75] [Mark76] In particular the slowly time-varying spectral char-acteristics of the upper vocal tract (system) are modeled by an all-pole filterwhile the prediction residual captures the voiced unvoiced or mixed excitationsignal The LP analysis filter A(z) in Figure 41 is given by
A(z) = 1 minusLsum
i=1
aizminusi (41)
Framing or buffering
Linearprediction (LP) analysis A(z)
Vocal tractLP spectral envelope
Input speech
Speech frame
Linear predictionresidual
A(z) = 1minus Σ ai z minusiL
i=1
Figure 41 Parameter estimation using linear prediction
LP-BASED SOURCE-SYSTEM MODELING FOR SPEECH 93
Synthesized signal
Vocal tract
filter 1A(z)
(LP synthesis)
Gain
Figure 42 Engineering model for speech synthesis
where L is the order of the linear predictor Figure 42 depicts a simple speechsynthesis model where a time-varying digital filter is excited by quasi-periodicwaveforms when speech is voiced (eg as in steady vowels) and random wave-forms for unvoiced speech (eg as in consonants) The inverse filter 1A(z)shown in Figure 42 is an all-pole LP synthesis filter
H(z) = G
A(z)= G
1 minus sumLi=1 aizminusi
(42)
where G represents the gain Note that the term all-pole is used loosely since (42)has zeros at z = 0 The frequency response associated with the LP synthesis filterie the LPC spectrum represents the formant structure of the speech signal(Figure 43) In this figure F1 F2 F3 and F4 represent the four formants
0 02 04 06 08 1minus50
minus40
minus30
minus20
minus10
0
10
20
Normalized Frequency (x 4kHz)
Mag
nitu
de (
dB)
LPC spectrumFFT spectrum
F1 F2
F3F4
Figure 43 The LPC and FFT spectra (dotted line) The formants represent the resonantmodes of the vocal tract
94 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
43 SHORT-TERM LINEAR PREDICTION
Figure 44 presents a typical L-th order FIR linear predictor During forwardlinear prediction of s(n) an estimated value s(n) is computed as a linear com-bination of the previous samples ie
s(n) =Lsum
i=1
ais(n minus i) (43)
where the weights ai are the LP coefficients The output of the LP analysis filterA(z) is called the prediction residual e(n) = s(n) minus s(n) This is given by
e(n) = s(n) minusLsum
i=1
ais(n minus i) (44)
Because only short-term delays are considered in (44) the linear predictorin Figure 44 is also referred to as the short-term linear predictor The linearpredictor coefficients ai are estimated using least-square minimization of theprediction error ie
ε = E[e2(n)] = E
(
s(n) minusLsum
i=1
ais(n minus i)
)2
(45)
The minimization of ε in (45) with respect to ai ie partεpartai = 0 for i =1 2 L yields a set of equations involving autocorrelations
rss(m) minusLsum
i=1
airss(m minus i) = 0 for m = 1 2 L (46)
where rss(m) is the autocorrelation sequence of the signal s(n) Equation (46)can be written in matrix form ie
hellip
a1
aL
hellip
s(n) e(n)
aLminus1 hellip
_
_
+_
LP analysis filter A(z) = 1 minus sum ai zminus i
i = 1
L
z minus1 z minus1 sum
Figure 44 Linear prediction (LP) analysis
SHORT-TERM LINEAR PREDICTION 95
a1
a2
a3
aL
=
rss(0) rss(minus1) rss(minus2) rss(1 minus L)
rss(1) rss(0) rss(minus1) rss(2 minus L)
rss(2) rss(1) rss(0) rss(3 minus L)
rss(L minus 1) rss(L minus 2) rss(L minus 3) rss(0)
minus1
rss(1)
rss(2)
rss(3)
rss(L)
(47)
or more compactly
a = Rminus1ss rss (48)
where a is the LP coefficient vector rss is the autocorrelation vector and Rss isthe autocorrelation matrix Note that Rss has a Toeplitz and symmetric structureEfficient algorithms [Makh75] [Mark76] [Marp87] are available for inverting theautocorrelation matrix Rss including algorithms tailored to work well with finiteprecision arithmetic [Gers90] Typically the Levinson-Durbin recursive algo-rithm [Makh75] is used to compute the LP coefficients Preconditioning of theinput sequence s(n) and autocorrelation data rss(m) using tapered windowsimproves the numerical behavior of these algorithms [Klei95] [Kroo95] In addi-tion bandwidth expansion or scaling of the LP coefficients is typical in LPC asit reduces distortion during synthesis
In low-bit-rate coding the prediction coefficients and the residual must beefficiently quantized Because the direct-form LP coefficients ai do not haveadequate quantization properties transformed coefficients are typically quantizedFirst-generation voice coders (vocoders) such as the LPC10e [FS1015] and theIS-54 VSELP [IS-54] quantize reflection coefficients that are a by-product of theLevinson-Durbin recursion Transformation of the reflection coefficients can leadto a set of parameters that are also less sensitive to quantization In particularthe log area ratios and the inverse sine transformation have been used in theearly GSM 610 algorithm [GSM89] and in the skyphone standard [Boyd88]Recent LP-based cellular standards quantize line spectrum pairs (LSPs) Themain advantage of the LSPs is that they relate directly to frequency-domain andhence they can be encoded using perceptual criteria
431 Long-Term Prediction
Long-term prediction (LTP) as opposed to short-term prediction is a processthat captures the long-term correlation in the signal The LTP provides a mech-anism for representing the periodicity in the signal and as such it representsthe fine harmonic structure in the short-term spectrum (see Eq (41)) LTP syn-thesis (49) requires estimation of two parameters ie the delay τ and thegain parameter aτ For strongly voiced segments of speech the delay is usuallyan integer that approximates the pitch period A transfer function of a simpleLTP synthesis filter Hτ (z) is given in (49) More complex LTP filters involve
96 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
multiple parameters and noninteger delays [Kroo90]
Hτ (z) = 1
AL(z)= 1
1 minus aτ zminusτ
(49)
The LTP can be implemented by open loop or closed loop analysis The open-loopLTP parameters are typically obtained by searching the autocorrelation sequenceThe gain is simply obtained by aτ = rss(τ )rss (0) In closed-loop LTP search thesignal is synthesized for a range of candidate LTP lags and the lag that producesthe best waveform matching is chosen Because of the intensive computations infull-search closed-loop LTP recent algorithms use open-loop LTP to establishan initial LTP lag that is then refined using closed-loop search around the neigh-borhood of the initial estimate In order to further reduce the complexity LTPsearches are often carried in every other subframe
432 ADPCM Using Linear Prediction
One of the simplest compression schemes that uses the short-term LP analysis-synthesis is the adaptive differential pulse code modulation (ADPCM)coder [Bene86] [G726] ADPCM algorithms encode the difference between thecurrent and the predicted speech samples The block diagram of the ITU-T G72632 kbs ADPCM encoder [Bene86] is shown in Figure 45 The algorithm con-sists of an adaptive quantizer and an adaptive pole-zero predictor The predictionparameters are obtained by backward estimation ie from quantized data usinga gradient algorithm at the decoder From Figure 45 it can be noted that thedecoder is embedded in the encoder The pole-zero predictor (2 poles and 6 zeros)estimates the input signal and hence it reduces the variance of e(n) The quantizerencodes the error e(n) into a sequence of 4-bit words The ITU-T G726 alsoaccommodates 16 24 and 40 kbs with individually optimized quantizers
44 OPEN-LOOP ANALYSIS-SYNTHESIS LINEAR PREDICTION
In almost all LP-based speech codecs speech is approximated on short analysisintervals typically in the neighborhood of 20 ms As shown in Figure 46 a setof LP synthesis parameters is estimated on each analysis frame to capture theshape of the vocal tract envelope and to model the excitation
Some of the typical synthesis parameters encoded and transmitted in the open-loop LP include the prediction coefficients the pitch information the frameenergy and the voicing At the receiver the transmitted ldquosourcerdquo parametersare used to form the excitation The excitation e(n) is then used to excite theLP synthesis filter 1A(z) to reconstruct the speech signal Some of the stan-dardized open-loop analysis-synthesis LP algorithms include the LPC10e FederalStandard FS-1015 [FS1015] [Trem82] [Camp86] and the Mixed Excitation LP(MELP) [McCr91] The LPC10e FS-1015 uses a tenth-order predictor to estimatethe vocal tract parameters and a two-state voiced or unvoiced excitation modelfor residual modeling Mixed excitation schemes in conjunction with LPC were
ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION 97
+
_Step update
Coder output
Decoder output
+
+
++
Qminus1Qs(n) e(n)
s (n)
e (n)
B(z)
A(z) sumsum
sum
Figure 45 The ADPCM ITU-T G726 encoder
A(z) = 1 minus Σ ai z minusiL
i =1
Vocal tract filter 1A(z)
Synthesis (Decoder)
Synthetic speech
Input speech
Analysis (Encoder) Pitch period
Residual or Excitation ParametersLPC excitation pitchenergy voicing etc
Parameters
LPC excitation pitch
energy voicing etc
Reconstructedspeech
gain
Figure 46 Open-loop analysis-synthesis LP
proposed by Makhoul et al [Makh78] and were later revisited by McCree andBarnwell [McCr91] [McCr93]
45 ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION
In closed-loop source-system coders (Figure 47) the excitation source is deter-mined by closed-loop or analysis-by-synthesis (A-by-S) optimization Theoptimization process determines an excitation sequence that minimizes the per-ceptually weighted mean-square-error (MSE) between the input speech and recon-structed speech [Atal82b] [Sing84] [Schr85] The closed-loop LP combines the
98 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
spectral modeling properties of vocoders with the waveform matching attributesof waveform coders and hence the A-by-S LP coders are also called hybridLP coders The system consists of a short-term LP synthesis filter 1A(z) anda LTP synthesis filter 1AL(z) shown in Figure 47 The perceptual weightingfilter (PWF) W(z) shapes the error such that quantization noise is masked bythe high-energy formants The PWF is given by
W(z) = A(zγ1)
A(zγ2)= 1 minus sumL
i=1 γ i1aiz
minusi
1 minus sumLi=1 γ i
2aizminusi
0 lt γ2 lt γ1 lt 1 (410)
where γ1 and γ2 are the adaptive weights and L is the order of the linear predic-tor Typically γ1 ranges from 094 to 098 and γ2 varies between 04 and 07depending upon the tilt or the flatness characteristics associated with the LPCspectral envelope [Sala98] [Bess02] The role of W(z) is to de-emphasize theerror energy in the formant regions [Schr79] This de-emphasis strategy is basedon the fact that quantization noise in the formant regions is partially masked byspeech From Figure 47 note that a gain factor g scales the excitation vector xand the excitation samples are filtered by the long-term and short-term synthesisfilters
The three most common excitation models typically embedded in theexcitation generator module (Figure 47) in the A-by-S LP schemes includethe multi-pulse excitation (MPE) [Atal82b] [Sing84] the regular pulseexcitation (RPE) [Kroo86] and the vector or code excited linear prediction(CELP) [Schr85] A 96 kbs multi-pulse excited linear prediction (MPE-LP) algorithm is used in Skyphone airline applications [Boyd88] A 13 kbscoding scheme that uses regular pulse excitation (RPE) [Kroo86] was adoptedfor the first generation full-rate ETSI GSM Pan-European digital cellularstandard [GSM89]
The aforementioned MPE-LP and RPE schemes achieve high-quality speech atmedium rates (13 kbs) For low-rate high-quality speech coding a more efficientrepresentation of the excitation sequence is required Atal [Atal82a] suggestedthat high-quality speech at low rates may be produced by using noninstantaneous
1AL(z) 1A(z)g
Errorminimization W(z)
s
Error
xExcitationgenerator
(MPE or RPE)
s
Input speech
Syntheticspeech
+
_
e
Σ
Figure 47 A typical source-system model employed in the analysis-by-synthesis LP
ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION 99
(delayed decision) coding of Gaussian excitation sequences in conjunction withA-by-S linear prediction and perceptual weighting In the mid-1980s Atal andSchroeder [Atal84] [Schr85] proposed a vector or code excited linear prediction(CELP) algorithm for A-by-S linear predictive coding
We provide further details on CELP in this section because of its recent use inwideband coding standards The excitation codebook search process in CELP canbe explained by considering the A-by-S scheme shown in Figure 48 The N times 1error vector e associated with the k-th excitation vector can be written as
e[k] = sw minus s0w minus gk sw[k] (411)
where sw is the N times 1 vector that contains the perceptually-weighted speechsamples s0
w is the vector that contains the output due to the initial filter statesw[k] is the filtered synthetic speech vector associated with the k-th excitationvector and gk is the gain factor Minimizing εk = eT [k]e[k] wrt gk we obtain
gk = sTw sw[k]
sTw[k]sw[k]
(412)
where sw = sw minus s0w and T represents the transpose operator From (412) εk
can be written as
εk = sTwsw minus (sT
w sw[k])2
sTw[k]sw[k]
(413)
Excitation vectorsCodebook
helliphellip
gk
MSEminimization
Errore[k]
Input speechs
sw
LTP synthesisfilter
1 AL(z)
LP synthesisfilter
1 A(z)
+
_PWFW(z)
PWFW(z)
x[k]
Syntheticspeech
s[k]ˆ sw[k]ˆΣ
Figure 48 A generic block diagram for the A-by-S code-excited linear predictive (CELP)coding Note that the perceptual weighting W(z) is applied directly on the input speechs and synthetic speech s in order to facilitate for the CELP analysis that follows Thek-th excitation vector x[k] that minimizes εk in (413) is selected and the correspondinggain factor gk is obtained from (412) The codebook index k and the gain gk asso-ciated with the candidate excitation vector x[k] are encoded and transmitted along withthe short-term and long-term prediction filter parameters
100 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
The k-th excitation vector x[k] that minimizes (413) is selected and thecorresponding gain factor gk is obtained from (412)
One of the disadvantages of the original CELP algorithm is thelarge computational complexity required for the codebook search [Schr85]This problem motivated a great deal of work focused upon developingstructured codebooks [Davi86] [Klei90a] and fast search procedures [Tran90] Inparticular Davidson and Gersho [Davi86] proposed sparse codebooks and Kleijnet al [Klei90a] proposed a fast algorithm for searching stochastic codebooks withoverlapping vectors In addition Gerson and Jasiuk [Gers90] [Gers91] proposeda vector sum excited linear predictive (VSELP) coder which is associated withfast codebook search and robustness to channel errors Other implementationissues associated with CELP include the quantization of the CELP parametersthe effects of channel errors on CELP coders and the operation of the algorithmon finite-precision and fixed-point machines A study on the effects of parameterquantization on the performance of CELP was presented in [Kroo90] and theissues associated with the channel coding of the CELP parameters were discussedby Kleijn [Klei90b] Some of the problems associated with the fixed-pointimplementation of CELP algorithms were presented in [Span92]
451 Code-Excited Linear Prediction Algorithms
In this section we taxonomize CELP algorithms into three categories that areconsistent with the chronology of their development ie first-generation CELP(1986ndash1992) second-generation CELP (1993ndash1998) and third-generation CELP(1999ndashpresent)
4511 First-Generation CELP Coders The first-generation CELP algo-rithms operate at bit rates between 58 kbs and 16 kbs These are generallyhigh complexity and non-toll-quality algorithms Some of the first-generationCELP algorithms include the FS-1016 CELP the IS-54 VSELP the ITU-TG728 low delay-CELP and the IS-96 Qualcomm CELP The FS-1016 48 kbsCELP standard [Camp90] [FS1016] was jointly developed by the Department ofDefense (DoD) and the Bell Labs for possible use in the third-generation securetelephone unit (STU-III) The IS-54 VSELP algorithm [IS-54] [Gers90] and itsvariants are embedded in three digital cellular standards ie the 8 kbs TIAIS-54 [IS-54] the 63 kbs Japanese standard [GSM96a] and the 56 kbs half-rate GSM [GSM96b] The VSELP algorithm uses highly structured codebooksthat are tailored for reduced computational complexity and increased robust-ness to channel errors The ITU-T G728 low-delay (LD) CELP coder [G728][Chen92] achieves low one-way delay by using very short frames a backward-adaptive predictor and short excitation vectors (five samples) The IS-96 Qual-comm CELP [IS-96] is a variable bit rate algorithm and is part of the originalcode division multiple access (CDMA) standard for cellular communications
4512 Second-Generation Near-Toll-Quality CELP Coders Thesecond-generation CELP algorithms are targeted for TDMA and CDMAcellphones Internet audio streaming voice-over-Internet-protocol (VoIP) and
ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION 101
secure communications Second-generation CELP algorithms include the ITU-T G7231 dual-rate speech codec [G7231] the GSM enhanced full rate(EFR) [GSM96a] [IS-641] the IS-127 Relaxed CELP (RCELP) [IS-127][Klei92] and the ITU-T G729 CS-ACELP [G729] [Sala98]
The coding gain improvements in second-generation CELP coders can beattributed partly to the use of algebraic codebooks in excitation coding [Adou87][Lee90] [Sala98] [G729] The term algebraic CELP refers to the structure ofthe excitation codebooks Various algebraic codebook structures have been pro-posed [Adou87] [Lafl90] but the most popular is the interleaved pulse permu-tation code In this codebook the code vector consists of a set of interleavedpermutation codes containing only few non-zero elements This is given by
pi = i + jd j = 0 1 2M minus 1 (414)
where pi is the pulse position i is the pulse number and d is the interleavingdepth The integer M represents the number of bits describing the pulse positionsTable 41 shows an example ACELP codebook structure where the interleavingdepth d = 5 the number of pulses or tracks equal to 5 and the number ofbits to represent the pulse positions M = 3 From (414) pi = i + j5 wherei = 0 1 2 3 4 j = 0 1 2 7
For a given value of i the set defined by (414) is known as lsquotrackrsquo andthe value of j defines the pulse position From the codebook structure shown inTable 41 the codevector x(n) is given by
x(n) =4sum
i=0
αiδ(n minus pi) n = 0 1 39 (415)
where δ(n) is the unit impulse αi are the pulse amplitudes (plusmn1) and pi are thepulse positions In particular the codebook vector x(n) is computed by placingthe 5-unit pulses at the determined locations pi multiplied with their signs (plusmn1)The pulse position indices and the signs are coded and transmitted Note that thealgebraic codebooks do not require any storage
4513 Third-Generation CELP for 3G Cellular Standards The third-generation (3G) CELP algorithms are multimodal and accommodate several
Table 41 An example algebraic codebookstructure tracks and pulse positions
Track (i) Pulse positions (pi)
0 P0 0 5 10 15 20 25 30 351 P1 1 6 11 16 21 26 31 362 P2 2 7 12 17 22 27 32 373 P3 3 8 1318 23 28 33 384 P4 4 9 14 19 24 29 34 39
102 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
different bit rates This is consistent with the vision on wideband wirelessstandards [Knis98] that operate in different modes including low-mobility high-mobility indoor etc There are at least two algorithms that have been developedand standardized for these applications In Europe GSM standardized theadaptive multi-rate coder [ETSI98] [Ekud99] and in the United States the TIAhas tested the selectable mode vocoder (SMV) [Gao01a] [Gao01b] [IS-893] Inparticular the adaptive multirate coder [ETSI98] [Ekud99] has been adoptedby ETSI for use in the GSM network This is an algebraic CELP algorithmthat operates at multiple rates 122 102 795 67 59 515 and 575 kbsThe bit rate is adjusted according to the traffic conditions The SMV algorithm(IS-893) was developed to provide higher quality flexibility and capacity overthe existing IS-96 QCELP and IS-127 enhanced variable rate coding (EVRC)CDMA algorithms The SMV is based on 4 codecs full-rate at 85 kbs half-rate at 4 kbs quarter-rate at 2 kbs and eighth-rate at 08 kbs The rate andmode selections in SMV are based on the frame voicing characteristics and thenetwork conditions Efforts to establish wideband cellular standards continueto drive further the research and development towards algorithms that work atmultiple rates and deliver enhanced speech quality
46 LINEAR PREDICTION IN WIDEBAND CODING
Until now we discussed the use of LP in narrowband coding with signal band-width limited to 150ndash3400 Hz Signal bandwidth in wideband speech codingspans 50 Hz to 7 kHz which substantially improves the quality of signal recon-struction intelligibility and naturalness In particular the introduction of thelow-frequency components improves the naturalness while the higher frequencyextension provides more adequate speech intelligibility In case of high-fidelityaudio it is typical to consider sampling rates of 441 kHz and signal bandwidthcan range from 20 Hz to 20 kHz Some of the recent super high-fidelity audiostorage formats (Chapter 11) such as the DVD-audio and the super audio CD(SACD) consider signal bandwidths up to 100 kHz
461 Wideband Speech Coding
Over the last few years several wideband speech coding algorithms have beenproposed [Orde91] [Jaya92] [Lafl93] [Adou95] Some of the coding principlesassociated with these algorithms have been successfully integrated into sev-eral speech coding standards for example the ITU-T G722 subband ADPCMstandard and the ITU-T G7222 AMR-WB codec
4611 The ITU-T G722 Codec The ITU-T G722 standard (Figure 49) usesa combination of both subband and ADPCM (SB-ADPCM)techniques [G722] [Merm88] [Span94] [Pain00] The input signal is sampledat 16 kHz and decomposed into two subbands of equal bandwidth using quadra-ture mirror filter (QMF) banks The subband filters hlow(n) and hhigh(n) shouldsatisfy
hhigh(n) = (minus1)nhlow(n)and|Hlow(z)|2 + |Hhigh(z)|2 = 1 (416)
LINEAR PREDICTION IN WIDEBAND CODING 103
The low-frequency subband is typically quantized at 48 kbs while the high-frequency subband is coded at 16 kbs The G722 coder includes an adaptive bitallocation scheme and an auxiliary data channel Moreover provisions for quan-tizing the low-frequency subband at 40 or at 32 kbs are available In particularthe G722 algorithm is multimodal and can operate in three different modes ie48 56 and 64 kbs by varying the bits used to represent the lower band signalThe MOS at 64 kbs is greater than four for speech and slightly less than fourfor music signals [Jaya90] and the analysis-synthesis QMF banks introduce adelay of less than 3 ms Details on the real-time implementation of this coderare given in [Taka88]
4612 The ITU-T G7222 AMR-WB Codec The ITU-T G7222 [G7722][Bess02] is an adaptive multi-rate wideband (AMR-WB) codec that operates atbit rates ranging from 66 to 2385 kbs The G722 AMR-WB standard is pri-marily targeted for the voice-over IP (VoIP) 3G wireless communications ISDNwideband telephony and audiovideo teleconferencing It is important to note thatthe AMR-WB codec has also been adopted by the third-generation partnershipproject (3GPP) for GSM and the 3G WCDMA systems for wideband mobile com-munications [Bess02] This in fact brought to the fore all the interoperability-related advantages for wideband voice applications across wireline and wirelesscommunications The ITU-T G7222 AMR-WB codec is based on the ACELPcoder and operates on audio frames of 20 ms sampled at 16 kHz The codec sup-ports the following nine bit rates 2385 2305 1985 1825 1585 1425 1265885 and 66 kbs Excepting the two lowest modes ie the 885 kbs and the
MUX Data insertion
Auxiliary data16 kbs
64 kbsoutput
QMFanalysis
bank
ADPCMencoder 1
ADPCMencoder 2
s(n)
(a)
Auxiliary data16 kbs
64 kbsinput De-
MUXData
extraction
QMFsynthesis
bank
ADPCMdecoder 1
ADPCMdecoder 2
s(n)
(b)
Figure 49 The ITU-T G722 standard for ISDN teleconferencing Wideband coding at64 kbs based on a two-band QMF analysissynthesis bank and ADPCM (a) encoder and(b) decoder Note that the low-frequency band is encoded at 32 kbs in order to allow foran auxiliary data channel at 16 kbs
104 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
66 kbs that are intended for transmission over noisy time-varying channels otherencoding modes ie 2385 through 1265 kbs offer high-quality signal recon-struction The G722 AMR-WB embeds several innovative techniques [Bess02]such as i) a modified perceptual weighting filter that decouples the formantweighting from the spectrum tilt ii ) an enhanced closed-loop pitch search to bet-ter accommodate the variations in the voicing level and iii ) efficient codebookstructures for fast searches The codec also includes a voice activity detection(VAD) scheme that activates a comfort noise generator module (1ndash2 kbs) incase of discontinuous transmission
462 Wideband Audio Coding
Motivated by the need to reduce the computational complexity associated with theCELP-based excitation source coding researchers have proposed several hybrid(LP + subbandtransform) coders [Lefe94] [Ramp98] [Rong99] In this sectionwe consider LP-based wideband coding methods that encode the prediction resid-ual based upon the transform or subband or sinusoidal coding techniques
4621 Multipulse Excitation Model Singhal at Bell Labs [Sing90] reportedthat analysis-by-synthesis multipulse excitation with sufficient pulse density canbe applied to correct for LP envelope errors introduced by bandwidth expansionand quantization (Figure 410) This algorithm uses a 24th-order LPC synthesisfilter while optimizing pulse positions and amplitudes to minimize perceptuallyweighted reconstruction errors Singhal determined that densities of approxi-mately 1 pulse per 4 output samples of each excitation subframe are requiredto achieve near transparent quality
Spectral coefficients are transformed to inverse sine reflection coefficientsthen differentially encoded and quantized using PDF-optimized Max quantiz-ers Entropy (Huffman) codes are also used Pulse locations are differentiallyencoded relative to the location of the first pulse Pulse amplitudes are fractionallyencoded relative to the largest pulse and then quantized using a Max quantizerThe proposed MPLPC audio coder achieved output SNRs of 35ndash40 dB at a bitrate of 128 kbs Other MPLPC audio coders have also been proposed [Lin91]
LPSynthesisFilter
ErrorWeighting
ExcitationGenerator
minus+
u(n)
s(n)
s(n)Σ
Figure 410 Multipulse excitation model used in [Sing90]
LINEAR PREDICTION IN WIDEBAND CODING 105
including a scheme based on MPLPC in conjunction with the discrete wavelettransform [Bola95]
4622 Discrete Wavelet Excitation Coding While most of the successfulspeech codecs nowadays use some form of closed-loop time-domain analysis-by-synthesis such as MPLPC high-performance LP-based perceptual audio cod-ing has been realized with alternative frequency-domain excitation models Forinstance Boland and Deriche reported output quality comparable to MPEG-1Layer II at 128 kbs for an LPC audio coder operating at 96 kbs [Bola98] inwhich the prediction residual was transform coded using a three-level discrete-wavelet-transform (DWT) (see also Section 82) based on a four-band uniformfilter bank At each level of the DWT the lowest subband of the previous levelwas decomposed into four uniform bands This 10-band nonuniform structurewas intended to mimic critical bandwidths to a certain extent A perceptual bitallocation according to MPEG-1 psychoacoustic model-2 was applied to thetransform coefficients
4623 Sinusoidal Excitation Coding Excitation sequences modeled as asum of sinusoids were investigated in [Chan96] This form of excitation is basedon the tendency of the prediction residuals to be spectrally impulsive rather thanflat for high-fidelity audio In coding experiments using 32-kHz-sampled inputaudio subjective and objective quality improvements relative to the MPLPCcoders were reported for the sinusoidal excitation schemes with high-qualityoutput audio reported at 72 kbs In the experiments reported in [Chan97] a setof ten LP coefficients is estimated on 94 ms analysis frames and split-vectorquantized using 24 bits Then the prediction residual is analyzed and sinusoidalparameters are estimated for the seven best out of a candidate set of thirteensinusoids for each of six subframes The masked threshold is estimated and usedto form a time-varying bit allocation for the amplitudes frequencies and phaseson each subframe Given a frame allocation of 675 a total of 573 78 and 24 bitsrespectively are allocated to the sinusoidal bit allocation side information andLP coefficients Sinusoidal excitation coding when used in conjunction with amasking-threshold adapted weighting filter resulted in improved quality relativeto MPEG-1 layer I at a bit rate of 96 kbs [Chan96] for selected test material
4624 Frequency-Warped LP Beyond the performance improvementsrealized through the use of different excitation models there has been interestin warping the frequency axis before LP analysis to effectively provide bet-ter resolution at certain frequencies In the context of perceptual coding it isnaturally of interest to achieve a Bark-scale warping Frequency axis warping toachieve nonuniform FFT resolution was first introduced by Oppenheim Johnsonand Steiglitz [Oppe71] [Oppe72] using a network of cascaded first-order all-passsections for frequency warping of the signal followed by a standard FFT Theidea was later extended to warped linear prediction (WLP) by Strube [Stru80]and was ultimately applied to an ADPCM codec [Krug88] Cascaded first-orderall-pass sections were used to warp the signal and then the LP autocorrelation
106 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
analysis was performed on the warped autocorrelation sequence In this scenarioa single-parameter warping of the frequency axis can be introduced into the LPanalysis by replacing the delay elements in the FIR analysis filter (Figure 44)with all-pass sections This is done by replacing the complex variable zminus1 ofthe FIR system function with another filter H(z) of the form
H(z) = zminus1 minus λ
1 minus λzminus1 (417)
Thus the predicted sample value is not produced from a combination of pastsamples as in Eq (44) but rather from the samples of a warped signal In factit has been shown [Smit95] [Smit99] that selecting the value of 0723 for theparameter λ leads to a frequency warp that approximates well the Bark frequencyscale A WLP-based audio codec [Harm96] was recently proposed The inherentBark frequency resolution of the WLP residual yields a perceptually shapedquantization noise without the use of an explicit perceptual model or time-varyingbit allocation In this system a 40-th order WLP synthesis filter is combined withdifferential encoding of the prediction residual A fixed rate of 2 bits per sample(882 kbs) is allocated to the residual sequence and 5 bits per coefficient areallocated to the prediction coefficients on an analysis frame of 800 samples or18 ms This translates to a bit rate of 992 kbs per channel In objective termsan auditory error measure showed considerable improvement for the WLP codingerror in comparison to a conventional LP coding error when the same numberof bits was allocated to the prediction residuals Subjectively the algorithm wasreported to achieve transparent quality for some material but it also had difficultywith transients at the frame boundaries
The algorithm was later extended to handle stereophonic signals [Harm97a]by forming a complex-valued representation of the two channels and then usinga version of WLP modified for complex signals (CWLP) It was suggested thatsignificant quality improvement could be realized for the WLPC audio coderby using a closed-loop analysis-by-synthesis procedure [Harm97b] One of theshortcomings of the original WLP coder was inadequate attention to temporaleffects As a result further experiments were reported [Harm98] in which WLPwas combined with temporal noise shaping (TNS) to realize additional qualityimprovement
47 SUMMARY
In this Chapter we presented the LP-based source-system model and described itsapplications in narrowband and wideband coding Some of the topics presentedin this chapter include
ž Short-term linear predictionž Conventional LP analysis-synthesis
PROBLEMS 107
ž Closed-loop analysis-by-synthesis hybrid codersž Code-excited linear prediction (CELP) speech standardsž Linear prediction in wideband codingž Frequency-warped LP
PROBLEMS
41 The following autocorrelation sequence is given rss(m) = 08|m|
036
Describe a source-system mechanism that will generate a signal with thisautocorrelation
42 Sketch the magnitude frequency response of the following filter function
H(z) = 1
1 minus 09zminus10
43 The autocorrelation sequence in Figure 411 corresponds to a strongly voicedspeech Show with an arrow which autocorrelation sample relates to the pitchperiod of the voiced signal Estimate the pitch period from the graph
44 Consider Figure 412 with a white Gaussian input signala Determine analytically the LP coefficients for a first-order predictor and
for H(z) = 1(1 minus 08zminus1)b Determine analytically the LP coefficients for a second-order predictor
and for H(z) = 1 + zminus1 + zminus2
0 20 40 60 80log index m
120100minus5
0
5
r ss
10
Figure 411 Autocorrelation of a voiced speech segment
H(z) Linearprediction
White Gaussianinput
s2 = 1m = 0
Residuale(n)
Figure 412 LP coefficients estimation
108 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
COMPUTER EXERCISES
The necessary MATLAB software for computer simulations and the speechaudiofiles (Ch4Sp8wav Ch4Au8wav and Ch4Au16wav ) can be obtained from theBook website
45 Linear predictive coding (LPC)
a Write a MATLAB program to load display and play back speech filesUse Ch4Sp8wav for this computer exercise
b Include a framing module in your program and set the frame size to256 samples Every frame should be read in a 256 times 1 real vector calledStime Compute the fast Fourier transform (FFT) of this vector ieSf req = ff t(Stime) Next compute the magnitude of the complex vec-tor Sfreq and plot its magnitude in dB up to the fold-over frequency Thiscomputation should be part of your frame-by-frame speech processingprogram
Deliverable 1Present at least one plot of time and one corresponding plot of frequency-domain data for a voiced unvoiced and a mixed speech segment (A total ofsix plots ndash use the subplot command)
c Pitch period and voicing estimation The period of a strongly voicedspeech signal is associated in a reciprocal manner to the fundamentalfrequency of the corresponding harmonic spectrum That is if the pitchperiod is T the fundamental frequency is 1T Note that T can be mea-sured in terms of the number of samples within a pitch period for voicedspeech If T is measured in ms then multiply the number of samples by1Fs where Fs is the sampling frequency of the input speech
Deliverable 2Create and fill Table 42 for the first 30 speech frames by visual inspec-tion as follows when the segment is voiced enter 1 in the 2nd columnIf speech pause (ie no speech present) enter 0 if unvoiced enter 025and if mixed enter 05 Measure the pitch period visually from the time-domain plot in terms of the number of samples in a pitch period If thesegment is unvoiced or pause enter infinity for the pitch period and hencezero for the fundamental frequency If the segment is mixed do your bestto obtain an estimate of the pitch if it is not possible set pitch toinfinityDeliverable 3From Table 42 plot the fundamental frequency as a function of the framenumber for all thirty frames This is called the pitch frequency contour
COMPUTER EXERCISES 109
Table 42 Pitch period voicing and frame energy measurements
Speech framenumber
Voicedunvoicedmixedpause
Pitch(number of
samples)Frame
energy
Fundamentalfrequency
(Hz)
1230
Deliverable 4From Table 42 plot also i) the voicing and ii ) frame energy (in dB) as afunction of the frame number for all thirty frames
d The FFT and LP spectra Write a MATLAB program to implement theLevinson-Durbin recursion Assume a tenth-order LP analysis and esti-mate the LP coefficients (lp coeff ) for each speech frame
Deliverable 5Compute the LP spectra as follows H allpole = f reqz(1 lp coeff) Super-impose the LP spectra (H allpole) with the FFT speech spectra (Sfreq) for avoiced segment and an unvoiced segment Plot the spectral magnitudes in dBup to the foldover frequency Note that the LPC spectra look like a smoothedversion of the FFT spectra Divide Sfreq by H allpole Plot the magnitudeof the result in dB up to the fold-over frequency What does the resultingspectrum representDeliverable 6From the LP spectra measure (visually) the frequencies of the first threeformants F1 F2 and F3 Give these frequencies in Hz (Table 43) Plot thethree formants across the frame number These will be the formant contoursUse different line types or colors to discriminate the three contours
e LP analysis-synthesis Using the prediction coefficients (lp coeff ) frompart (d) perform LP analysis Use the mathematical formulation given inSections 42 and 43 Quantize both the LP coefficients and the prediction
Table 43 Formants F1 F2 and F3
Speech frame number F1 (Hz) F2 (Hz) F3 (Hz)
1230
110 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
residual using a 3-bit (ie 8 levels) scalar quantizer Next perform LPsynthesis and reconstruct the speech signal
Deliverable 7Plot the quantized residual and its corresponding dB spectrum for a voicedand an unvoiced frame Provide plots of the original and reconstructed speechCompute the SNR in dB for the entire reconstructed signal relative to theoriginal record Listen to the reconstructed signal and provide a subjectivescore on a scale of 1 to 5 Repeat this step when a 8-bit scalar quantizer isemployed In your simulation when the LP coefficients were quantized usinga 3-bit scalar quantizer the LP synthesis filter will become unstable for certainframes What are the consequences of this
46 The FS-1016 CELP standarda Obtain the MATLAB software for the FS-1016 CELP from the Book
website Use the following wave files Ch4Sp8wav and Ch4Au8wav
Deliverable 1Give the plots of the entire input and the FS-1016 synthesized output for thetwo wave files Comment on the quality of the two synthesized wave files Inparticular give more emphasis on the Ch4Au8wav and give specific reasonswhy the FS-1016 does not synthesize Ch4Au8wav with high quality Listen tothe output files and provide a subjective evaluation The CELP FS1016 coderscored 32 out of 5 in government tests on a MOS scale How would you rateits performance in terms of MOS for Ch4Sp8wav and Ch4Au8wav Also givesegmental SNR values for a voicedunvoicedmixed frame and overall SNR forthe entire record Present your results as Table 44 with appropriate caption
b Spectrum analysis File CELPANALM from lines 134 to 159Valuable comments describing the variable names globals and inputsoutputs are provided at the beginning of the MATLAB fileCELPANALM to further assist you with understanding the MATLAB
Table 44 FS-1016 CELP subjective and objective evaluation
Speech frame number
Segmental SNR forthe chosenframe (dB)
Voiced Ch4Sp8wav Unvoiced
Mixed Over-all SNR for the entire speech record = (dB)MOS in a scale of 1ndash5 =
Ch4Au8wavOver-all SNR for the entire music record = (dB)MOS in a scale of 1ndash5 =
COMPUTER EXERCISES 111
program Specifically some of the useful variables in the MATLABcode include snew-input speech buffer fcn-LP filter coefficients of1A(z) rcn-reflection coefficients newfreq-LSP frequencies unqfreq-unquantized LSPs newfreq-quantized LSPs and lsp-interpolated LSPsfor each subframe
Deliverable 2Choose Ch4Sp8wav and use the voicedunvoicedmixed frames selected inthe previous step Indicate the frame numbers FS-1016 employs 30-ms speechframes so the frame size is fixed = 240 samples Give time-domain (variablein the code lsquosnewrsquo ) and frequency-domain plots (use FFT size 512 includecommands as necessary in the program to obtain the FFT) of the selectedvoicedunvoicedmixed frame Also plot the LPC spectrum using
figure freqz(1 fcn)
Study the interlacing property of the LSPs on the unit circle Note that in theFS-1016 standard the LSPs are encoded using scalar quantization Plot theLPC spectra obtained from the unquantized LSPs and quantized LSPs Youhave to convert LSPs to LPCs in both the cases (unquantized and quantized)and use the freqz command to plot the LPC spectra Give a z = domain plot(of a voiced and an unvoiced frame) containing the pole locations (show ascrosses lsquoxrsquo) of the LPC spectra and the roots of the symmetric (show as blackcircles lsquoorsquo) and asymmetric (show as red circles lsquoorsquo) LSP polynomials Notethe interlacing nature of black and red circles they always lie on the unit circleAlso note that if a pole is close to the unit circle the corresponding LSPs willbe close to each other In the file CELPANALM line 124 high-pass filteringis performed to eliminate the undesired low frequencies Experiment with andwithout a high-pass filter to note the presence of humming and low-frequencynoise in the synthesized speech
c Pitch analysis File CELPANALM from lines 162 to 191
Deliverable 3What are the key advantages of employing subframes in speech coding (eginterpolation pitch prediction) Explain in general the differences betweenlong-term prediction and short-term prediction Give the necessary transferfunctions In particular describe what aspects of speech each of the two pre-dictors capturesDeliverable 4Insert in file CELPANALM after line 182
tauptr = 75
Perform an evaluation of the perceptual quality of synthesis speech and giveyour remarks How does the speech quality change by forcing a pitch to apredetermined value (Choose different tauptr values 40 75 and 110)
112 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING
47 The warped LP for audio analysis-synthesisa Write a MATLAB program to perform analysis-synthesis using i) the
conventional LP and ii ) the warped LP Use Ch4Au8wav as the inputwave file
b Perform a tenth-order LP analysis and use a 5-bit scalar quantizer toquantize the LP and the WLP coefficients and a 3-bit scalar quantizerfor the excitation vector Perform audio synthesis using the quantized LPand WLP analysis parameters Compute the warping coefficient [Smit99][Harm01] using
λ = 10674
(2
πarctan
(006583Fs
1000
))12
minus 01916
where Fs is the sampling frequency of the input audio Comment on the qualityof the synthesized audio from the LP and WLP analysis-synthesis Repeat thisstep for Ch4Au16wav Refer to [Harm00] for implementation of WLP synthe-sis filters
CHAPTER 5
PSYCHOACOUSTIC PRINCIPLES
51 INTRODUCTION
The field of psychoacoustics [Flet40] [Gree61] [Zwis65] [Scha70] [Hell72][Zwic90] [Zwic91] has made significant progress toward characterizing humanauditory perception and particularly the time-frequency analysis capabilities ofthe inner ear Although applying perceptual rules to signal coding is not a newidea [Schr79] most current audio coders achieve compression by exploiting thefact that ldquoirrelevantrdquo signal information is not detectable by even a well-trainedor sensitive listener Irrelevant information is identified during signal analysis byincorporating into the coder several psychoacoustic principles including absolutehearing thresholds critical band frequency analysis simultaneous masking thespread of masking along the basilar membrane and temporal masking Combiningthese psychoacoustic notions with basic properties of signal quantization has alsoled to the theory of perceptual entropy [John88b] a quantitative estimate of thefundamental limit of transparent audio signal compression
This chapter reviews psychoacoustic fundamentals and perceptual entropy andthen gives as an application example some details of the ISOMPEG psychoa-coustic model 1 Before proceeding however it is necessary to define the soundpressure level (SPL) a standard metric that quantifies the intensity of an acous-tical stimulus [Zwic90] Nearly all of the auditory psychophysical phenomenaaddressed in this book are treated in terms of SPL The SPL gives the level(intensity) of sound pressure in decibels (dB) relative to an internationally definedreference level ie LSPL = 20 log10(pp0) dB where LSPL is the SPL of a stim-ulus p is the sound pressure of the stimulus in Pascals (Pa equivalent to Newtons
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
113
114 PSYCHOACOUSTIC PRINCIPLES
per square meter (Nm2)) and p0 is the standard reference level of 20 microPa or2 times 10minus5 Nm2 [Moor77] About 150 dB SPL spans the dynamic range of inten-sity for the human auditory system from the limits of detection for low-intensity(quiet) stimuli up to the threshold of pain for high-intensity (loud) stimuli TheSPL reference level is calibrated such that the frequency-dependent absolutethreshold of hearing in quiet (Section 52) tends to measure in the vicinity of0 dB SPL On the other hand a stimulus level of 140 dB SPL is typically at orabove the threshold of pain
52 ABSOLUTE THRESHOLD OF HEARING
The absolute threshold of hearing characterizes the amount of energy needed ina pure tone such that it can be detected by a listener in a noiseless environ-ment The absolute threshold is typically expressed in terms of dB SPL Thefrequency dependence of this threshold was quantified as early as 1940 whenFletcher [Flet40] reported test results for a range of listeners that were generatedin a National Institutes of Health (NIH) study of typical American hearing acuityThe quiet threshold is well approximated [Terh79] by the non linear function
Tq(f ) = 364(f1000)minus08 minus 65eminus06(f1000minus33)2 + 10minus3(f1000)4(dB SPL)
(51)
which is representative of a young listener with acute hearing When applied tosignal compression Tq(f ) could be interpreted naively as a maximum allow-able energy level for coding distortions introduced in the frequency domain(Figure 51)
102 103 104
0
20
40
60
80
100
Frequency (Hz)
Sou
nd p
ress
ure
leve
l S
PL
(dB
)
Figure 51 The absolute threshold of hearing in quiet
CRITICAL BANDS 115
At least two caveats must govern this practice however First whereas thethresholds captured in Figure 51 are associated with pure tone stimuli the quan-tization noise in perceptual coders tends to be spectrally complex rather thantonal Secondly it is important to realize that algorithm designers have no a pri-ori knowledge regarding actual playback levels (SPL) and therefore the curveis often referenced to the coding system by equating the lowest point (ie near4 kHz) to the energy in +minus 1 bit of signal amplitude In other words it isassumed that the playback level (volume control) on a typical decoder will be setsuch that the smallest possible output signal will be presented close to 0 dB SPLThis assumption is conservative for quiet to moderate listening levels in uncon-trolled open-air listening environments and therefore this referencing practice iscommonly found in algorithms that utilize the absolute threshold of hearing Wenote that the absolute hearing threshold is related to a commonly encounteredacoustical metric other than SPL namely dB sensation level (dB SL) Sensa-tion level (SL) denotes the intensity level difference for a stimulus relative toa listenerrsquos individual unmasked detection threshold for the stimulus [Moor77]Hence ldquoequal SLrdquo signal components may have markedly different absoluteSPLs but all equal SL components will have equal supra-threshold margins Themotivation for the use of SL measurements is that SL quantifies listener-specificaudibility rather than an absolute level Whether the target metric is SPL or SLperceptual coders must eventually reference the internal PCM data to a physicalscale A detailed example of this referencing for SPL is given in Section 57 ofthis chapter
53 CRITICAL BANDS
Using the absolute threshold of hearing to shape the coding distortion spectrumrepresents the first step towards perceptual coding Considered on its own how-ever the absolute threshold is of limited value in coding The detection thresholdfor spectrally complex quantization noise is a modified version of the abso-lute threshold with its shape determined by the stimuli present at any giventime Since stimuli are in general time-varying the detection threshold is alsoa time-varying function of the input signal In order to estimate this thresholdwe must first understand how the ear performs spectral analysis A frequency-to-place transformation takes place in the cochlea (inner ear) along the basilarmembrane [Zwic90]
The transformation works as follows A sound wave generated by an acousticstimulus moves the eardrum and the attached ossicular bones which in turn trans-fer the mechanical vibrations to the cochlea a spiral-shaped fluid-filled structurethat contains the coiled basilar membrane Once excited by mechanical vibrationsat its oval window (the input) the cochlear structure induces traveling wavesalong the length of the basilar membrane Neural receptors are connected alongthe length of the basilar membrane The traveling waves generate peak responsesat frequency-specific membrane positions and therefore different neural receptors
116 PSYCHOACOUSTIC PRINCIPLES
are effectively ldquotunedrdquo to different frequency bands according to their locationsFor sinusoidal stimuli the traveling wave on the basilar membrane propagatesfrom the oval window until it nears the region with a resonant frequency nearthat of the stimulus frequency The wave then slows and the magnitude increasesto a peak The wave decays rapidly beyond the peak The location of the peakis referred to as the ldquobest placerdquo or ldquocharacteristic placerdquo for the stimulus fre-quency and the frequency that best excites a particular place [Beke60] [Gree90]is called the ldquobest frequencyrdquo or ldquocharacteristic frequencyrdquo Thus a frequency-to-place transformation occurs An example is given in Figure 52 for a three-tonestimulus The interested reader can also find online a number of high-qualityanimations demonstrating this aspect of cochlear mechanics [Twve99]
As a result of the frequency-to-place transformation the cochlea can be viewedfrom a signal processing perspective as a bank of highly overlapping bandpassfilters The magnitude responses are asymmetric and nonlinear (level-dependent)Moreover the cochlear filter passbands are of nonuniform bandwidth and thebandwidths increase with increasing frequency The ldquocritical bandwidthrdquo is afunction of frequency that quantifies the cochlear filter passbands Empirical workby several observers led to the modern notion of critical bands [Flet40] [Gree61][Zwis65] [Scha70] We will consider two typical examples
In one scenario the loudness (perceived intensity) remains constant for anarrowband noise source presented at a constant SPL even as the noise bandwidthis increased up to the critical bandwidth For any increase beyond the criticalbandwidth the loudness then begins to increase In this case one can imaginethat loudness remains constant as long as the noise energy is present withinonly one cochlear ldquochannelrdquo (critical bandwidth) and then that the loudnessincreases as soon as the noise energy is forced into adjacent cochlear ldquochannelsrdquoCritical bandwidth can also be viewed as the result of auditory detection efficacyin terms of a signal-to-noise ratio (SNR) criterion The power spectrum model
Dis
plac
emen
t
Distance from oval window (mm)
32 24 16 8 0
400 Hz 1600 Hz 6400 Hz
Figure 52 The frequency-to-place transformation along the basilar membrane The pic-ture gives a schematic representation of the traveling wave envelopes (measured in termsof vertical membrane displacement) that occur in response to an acoustic tone complexcontaining sinusoids of 400 1600 and 6400 Hz Peak responses for each sinusoid arelocalized along the membrane surface with each peak occurring at a particular distancefrom the oval window (cochlear ldquoinputrdquo) Thus each component of the complex stimulusevokes strong responses only from the neural receptors associated with frequency-specificloci (after [Zwic90])
CRITICAL BANDS 117
of hearing assumes that masked threshold for a given listener will occur at aconstant listener-specific SNR [Moor96] In the critical bandwidth measurementexperiments the detection threshold for a narrowband noise source presentedbetween two masking tones remains constant as long as the frequency separationbetween the tones remains within a critical bandwidth (Figure 53a) Beyondthis bandwidth the threshold rapidly decreases (Figure 53c) From the SNRviewpoint one can imagine that as long as the masking tones are presentedwithin the passband of the auditory filter (critical bandwidth) that is tuned tothe probe noise the SNR presented to the auditory system remains constantand hence the detection threshold does not change As the tones spread furtherapart and transition into the filter stopband however the SNR presented to theauditory system improves and hence the detection task becomes easier In orderto maintain a constant SNR at threshold for a particular listener the powerspectrum model calls for a reduction in the probe noise commensurate with thereduction in the energy of the masking tones as they transition out of the auditoryfilter passband Thus beyond critical bandwidth the detection threshold for theprobe tones decreases and the threshold SNR remains constant A notched-noiseexperiment with a similar interpretation can be constructed with masker andmaskee roles reversed (Figure 53 b and d) Critical bandwidth tends to remainconstant (about 100 Hz) up to 500 Hz and increases to approximately 20 ofthe center frequency above 500 Hz For an average listener critical bandwidth(Figure 53b) is conveniently approximated [Zwic90] by
BW c(f ) = 25 + 75[1 + 14(f1000)2]069(Hz) (52)
Although the function BW c is continuous it is useful when building practicalsystems to treat the ear as a discrete set of bandpass filters that conforms to (52)The function [Zwic90]
Zb(f ) = 13 arctan (000076f ) + 35 arctan
[(f
7500
)2]
(Bark) (53)
is often used to convert from frequency in Hertz to the Bark scale Figure 54(a) Corresponding to the center frequencies of the Table 51 filter bank thenumbered points in Figure 54 (a) illustrate that the nonuniform Hertz spacing ofthe filter bank (Figure 55) is actually uniform on a Bark scale Thus one criticalbandwidth (CB) comprises one Bark Table 51 gives an idealized filter bank thatcorresponds to the discrete points labeled on the curves in Figure 54(a b) Adistance of 1 critical band is commonly referred to as one Bark in the literature
Although the critical bandwidth captured in Eq (52) is widely used in perceptualmodels for audio coding we note that there are alternative expressions In particularthe equivalent rectangular bandwidth (ERB) scale emerged from research directedtowards measurement of auditory filter shapes Experimental data is obtained typi-cally from notched noise masking procedures Then the masking data is fitted withparametric weighting functions that represent the spectral shaping properties of the
118 PSYCHOACOUSTIC PRINCIPLES
∆f
Aud
ibili
ty T
h
fcb
(c)
∆f
Aud
ibili
ty T
h
fcb
(d)
(a)Freq
Sou
nd P
ress
ure
Leve
l (dB
)∆f
(b)Freq
Sou
nd P
ress
ure
Leve
l (dB
)
∆f
Figure 53 Critical band measurement methods (ac) Detection threshold decreases asmasking tones transition from auditory filter passband into stopband thus improvingdetection SNR (bd) Same interpretation with roles reversed (after [Zwic90])
auditory filters [Moor96] Rounded exponential models with one or two free param-eters are popular For example the single-parameter roex(p) model is given by
W(g) = (1 + pg)eminuspg (54)
where g = |f minus f0|f0 is the normalized frequency f0 is the center frequency ofthe filter and f represents frequency in Hz Although the roex(p) model does notcapture filter asymmetry asymmetric filter shapes are possible if two roex(p) modelsare used independently for the high- and low-frequency filter skirts Two parame-ter models such as the roex(p r) are also used to gain additional degrees of free-dom [Moor96] in order to improve the accuracy of the filter shape estimates Aftercurve-fitting an ERB estimate is obtained directly from the parametric filter shapeFor the roex(p) model it can be shown easily that the equivalent rectangular band-width is given by
ERBroex(p) = 4f0
p(55)
We note that some texts denote ERB by equivalent noise bandwidth An example isgiven in Figure 56 The solid line in Figure 56 (a) shows an example roex(p) filterestimated for a center frequency of 1 kHz while the dashed line shows the ERBassociated with the given roex(p) filter shape
In [Moor83] and [Glas90] Moore and Glasberg summarized experimentalERB measurements for roex(pr) models obtained over a period of several years
CRITICAL BANDS 119
(a)
0 05 1 15 2
x 104
0
5
10
15
20
25
Frequency f (Hz)
Crit
ical
ban
d ra
te Z
b (
Bar
k)
1
25 24
2221
19
15
10
5
x ndash CB center frequencies
(b)
102 103 1040
1000
2000
3000
4000
5000
6000
Frequency f (Hz)
Crit
ical
ban
dwid
th (
Hz)
25
24
22
2119
151051
x ndash CB center frequencies
Figure 54 Two views of critical bandwidth (a) Critical band rate Zb(f ) maps fromHertz to Barks and (b) critical bandwidth BWc(f ) expresses critical bandwidth as afunction of center frequency in Hertz The ldquoXsrdquo denote the center frequencies of theidealized critical band filter bank given in Table 51
by a number of different investigators Given a collection of ERB measurementson center frequencies across the audio spectrum a curve fitting on the data setyielded the following expression for ERB as a function of center frequency
ERB(f ) = 247(437(f1000) + 1) (56)
120 PSYCHOACOUSTIC PRINCIPLES
0
02
04
06
08
Am
plitu
de
1
12
02 04 06 08 1 12 14 16 18 2X 104Frequency (Hz)
Figure 55 Idealized critical band filter bank
As shown in Figure 56 (b) the function specified by Eq (56) differs from thecritical bandwidth of Eq (52) Of particular interest for perceptual codec design-ers the ERB scale implies that auditory filter bandwidths decrease below 500 Hzwhereas the critical bandwidth remains essentially flat The apparent increasedfrequency selectivity of the auditory system below 500 Hz has implications foroptimal filter-bank design as well as for perceptual bit allocation strategies Theseimplications are addressed later in the book
Regardless of whether it is best characterized in terms of critical bandwidth orERB the frequency resolution of the auditory filter bank largely determines whichportions of a signal are perceptually irrelevant The auditory time-frequencyanalysis that occurs in the critical band filter bank induces simultaneous andnonsimultaneous masking phenomena that are routinely used by modern audiocoders to shape the coding distortion spectrum In particular the perceptual mod-els allocate bits for signal components such that the quantization noise is shapedto exploit the detection thresholds for a complex sound (eg quantization noise)These thresholds are determined by the energy within a critical band [Gass54]Masking properties and masking thresholds are described next
54 SIMULTANEOUS MASKING MASKING ASYMMETRY AND THESPREAD OF MASKING
Masking refers to a process where one sound is rendered inaudible because ofthe presence of another sound Simultaneous masking may occur whenever two
SIMULTANEOUS MASKING MASKING ASYMMETRY AND THE SPREAD OF MASKING 121
Table 51 Idealized critical band filter bank (after [Scha70]) Band edges andcenter frequencies for a collection of 25 critical bandwidth auditory filters thatspan the audio spectrum
Band number Center frequency (Hz) Bandwidth (Hz)
1 50 ndash1002 150 100ndash2003 250 200ndash3004 350 300ndash4005 450 400ndash5106 570 510ndash6307 700 630ndash7708 840 770ndash9209 1000 920ndash1080
10 1175 1080ndash127011 1370 1270ndash148012 1600 1480ndash172013 1850 1720ndash200014 2150 2000ndash232015 2500 2320ndash270016 2900 2700ndash315017 3400 3150ndash370018 4000 3700ndash440019 4800 4400ndash530020 5800 5300ndash640021 7000 6400ndash770022 8500 7700ndash950023 10500 9500ndash1200024 13500 12000ndash1550025 19500 15500ndash
or more stimuli are simultaneously presented to the auditory system From afrequency-domain point of view the relative shapes of the masker and maskeemagnitude spectra determine to what extent the presence of certain spectral energywill mask the presence of other spectral energy From a time-domain perspectivephase relationships between stimuli can also affect masking outcomes A simpli-fied explanation of the mechanism underlying simultaneous masking phenomena isthat the presence of a strong noise or tone masker creates an excitation of sufficientstrength on the basilar membrane at the critical band location to block effectivelydetection of a weaker signal Although arbitrary audio spectra may contain complexsimultaneous masking scenarios for the purposes of shaping coding distortions itis convenient to distinguish between three types of simultaneous masking namelynoise-masking-tone (NMT) [Scha70] tone-masking-noise (TMN) [Hell72] andnoise-masking-noise (NMN) [Hall98]
122 PSYCHOACOUSTIC PRINCIPLES
(a)
700 800 900 1000 1100 1200 1300
minus25
minus20
minus15
minus10
minus5
0
5Equivalent Rectangular Bandwidth roex(p)
Frequency (Hz)
Atte
nuat
ion
(dB
) ERB
(b)
102
103
Center Frequency (Hz)
Ban
dwid
th (
Hz)
102 103 104
Critical bandwidth
ERB
Figure 56 Equivalent rectangular bandwidth (ERB) (a) Example ERB for a roex(p) sin-gle-parameter estimate of the shape of the auditory filter centered at 1 kHz The solid linerepresents an estimated spectral weighting function for a single-parameter fit to data froma notched noise masking experiment the dashed line represents the equivalent rectangularbandwidth (b) ERB vs critical bandwidth ndash the ERB scale of Eq (56) (solid) vs criticalbandwidth of Eq (52) (dashed) as a function of center frequency
SIMULTANEOUS MASKING MASKING ASYMMETRY AND THE SPREAD OF MASKING 123
(b)
Freq (Hz)
SP
L (d
B)
Crit BW
TonalMasker
Threshold
SM
Rsim
24 d
B
MaskedNoise
80
56
1000Freq (Hz)
SP
L (d
B)
Crit BW
NoiseMasker
ThresholdS
MR
sim 4
dB
MaskedTone
80
76
410
(a)
Figure 57 Example to illustrate the asymmetry of simultaneous masking (a) Noise-masking-tone ndash At the threshold of detection a 410-Hz pure tone presented at 76-dB SPLis just masked by a critical bandwidth narrowband noise centered at 410 Hz (90 Hz BW)of overall intensity 80 dB SPL This corresponds to a threshold minimum signal-to-mask(SMR) ratio of 4 dB The threshold SMR increases as the probe tone is shifted eitherabove or below 410 Hz (b) Tone-masking-noise ndash At the threshold of detection a 1-kHzpure tone presented at 80-dB SPL just masks a critical-band narrowband noise centeredat 1 kHz of overall intensity 56-dB SPL This corresponds to a threshold minimum SMRof 24 dB As for the NMT experiment threshold SMR for the TMN increases as themasking tone is shifted either above or below the noise center frequency 1 kHz Whencomparing (a) to (b) it is important to notice the apparent ldquomasking asymmetryrdquo namelythat NMT produces a significantly smaller threshold minimum SMR (4 dB) than doesTMN (24 dB)
541 Noise-Masking-Tone
In the NMT scenario (Figure 57a) a narrowband noise (eg having 1 Barkbandwidth) masks a tone within the same critical band provided that the inten-sity of the masked tone is below a predictable threshold directly related to theintensity and to a lesser extent center frequency of the masking noise Numer-ous studies characterizing NMT for random noise and pure-tone stimuli haveappeared since the 1930s (eg [Flet37] [Egan50]) At the threshold of detectionfor the masked tone the minimum signal-to-mask ratio (SMR) ie the smallestdifference between the intensity (SPL) of the masking noise (ldquosignalrdquo) and theintensity of the masked tone (ldquomaskrdquo) occurs when the frequency of the maskedtone is close to the maskerrsquos center frequency In most studies the minimumSMR tends to lie between minus5 and +5 dB For example a sample thresholdSMR result from the NMT investigation [Egan50] is schematically representedin Figure 57a In the figure a critical band noise masker centered at 410 Hz
124 PSYCHOACOUSTIC PRINCIPLES
with an intensity of 80 db SPL masks a 410 Hz tone and the resulting SMR atthe threshold of detection is 4 dB
Masking power decreases (ie SMR increases) for probe tones above andbelow the frequency of the minimum SMR tone in accordance with a level-and frequency-dependent spreading function that is described later We note thattemporal factors also affect simultaneous masking For example in the NMTscenario an overshoot effect is possible when the probe tone onset occurs withina short interval immediately following masker onset Overshoot can boost simul-taneous masking (ie decrease the threshold minimum SMR) by as much as10 dB over a brief time span [Zwic90]
542 Tone-Masking-Noise
In the case of TMN (Figure 57b) a pure tone occurring at the center of acritical band masks noise of any subcritical bandwidth or shape provided thenoise spectrum is below a predictable threshold directly related to the strengthand to a lesser extent the center frequency of the masking tone In contrastto NMT relatively few studies have attempted to characterize TMN At thethreshold of detection for a noise band masked by a pure tone however it wasfound in both [Hell72] and [Schr79] that the minimum SMR ie the smallestdifference between the intensity of the masking tone (ldquosignalrdquo) and the intensityof the masked noise (ldquomaskrdquo) occurs when the masker frequency is close to thecenter frequency of the probe noise and that the minimum SMR for TMN tendsto lie between 21 and 28 dB A sample result from the TMN study [Schr79] isgiven in Figure 57b In the figure a narrowband noise of one Bark bandwidthcentered at 1 kHz is masked by a 1 kHz tone of intensity 80 dB SPL Theresulting SMR at the threshold of detection is 24 dB As with NMT the TMNmasking power decreases for critical bandwidth probe noises centered above andbelow the minimum SMR probe noise
543 Noise-Masking-Noise
The NMN scenario in which a narrowband noise masks another narrowbandnoise is more difficult to characterize than either NMT or TMN because ofthe confounding influence of phase relationships between the masker and mas-kee [Hall98] Essentially different relative phases between the components ofeach can lead to different threshold SMRs The results from one study of intensitydifference detection thresholds for wideband noise [Mill47] produced thresholdSMRs of nearly 26 dB for NMN [Hall98]
544 Asymmetry of Masking
The NMT and TMN examples in Figure 57 clearly show an asymmetry in mask-ing power between the noise masker and the tone masker In spite of the factthat both maskers are presented at a level of 80 dB SPL the associated thresh-old SMRs differ by 20 dB This asymmetry motivates our interest in both the
SIMULTANEOUS MASKING MASKING ASYMMETRY AND THE SPREAD OF MASKING 125
TMN and NMT masking paradigms as well as NMN In fact knowledge ofall three is critical to success in the task of shaping coding distortion such thatit is undetectable by the human auditory system For each temporal analysisinterval a codecrsquos perceptual model should identify across the frequency spec-trum noise-like and tone-like components within both the audio signal and thecoding distortion Next the model should apply the appropriate masking relation-ships in a frequency-specific manner In conjunction with the spread of masking(below) NMT NMN and TMN properties can then be used to construct a globalmasking threshold Although several methods for masking threshold estimationhave proven effective we note that a deeper understanding of masking asym-metry may provide opportunities for improved perceptual models In particularHall [Hall97] has shown that masking asymmetry can be explained in terms ofrelative maskermaskee bandwidths and not necessarily exclusively in terms ofabsolute masker properties Ultimately this implies that the de facto standardenergy-based schemes for masking power estimation among perceptual codecsmay be valid only so long as the masker bandwidth equals or exceeds mas-kee (probe) bandwidth In cases where the probe bandwidth exceeds the maskerbandwidth an envelope-based measure be embedded in the masking calcula-tion [Hall97] [Hall98]
545 The Spread of Masking
The simultaneous masking effects characterized before by the paradigms NMTTMN and NMN are not bandlimited to within the boundaries of a single criticalband Interband masking also occurs ie a masker centered within one criticalband has some predictable effect on detection thresholds in other critical bandsThis effect also known as the spread of masking is often modeled in codingapplications by an approximately triangular spreading function that has slopes of+25 and minus10 dB per Bark A convenient analytical expression [Schr79] is given by
SF dB (x) = 1581 + 75(x + 0474) minus 175radic
1 + (x + 0474)2dB (57)
where x has units of Barks and SF db(x) is expressed in dB After critical band anal-ysis is done and the spread of masking has been accounted for masking thresholdsin perceptual coders are often established by the [Jaya93] decibel (dB) relations
TH N = ET minus 145 minus B (58)
TH T = EN minus K (59)
where TH N and TH T respectively are the noise-and tone-masking thresholdsdue to tone-masking-noise and noise-masking-tone EN and ET are the criticalband noise and tone masker energy levels and B is the critical band numberDepending upon the algorithm the parameter K is typically set between 3 and5 dB Of course the thresholds of Eqs (58) and (59) capture only the contributionsof individual tone-like or noise-like maskers In the actual coding scenario each
126 PSYCHOACOUSTIC PRINCIPLES
frame typically contains a collection of both masker types One can see easily thatEqs (58) and (59) capture the masking asymmetry described previously Afterthey have been identified these individual masking thresholds are combined toform a global masking threshold
The global masking threshold comprises an estimate of the level at whichquantization noise becomes just noticeable Consequently the global maskingthreshold is sometimes referred to as the level of ldquojust-noticeable distortionrdquoor JND The standard practice in perceptual coding involves first classifyingmasking signals as either noise or tone next computing appropriate thresholdsthen using this information to shape the noise spectrum beneath the JND levelTwo illustrated examples are given later in Sections 56 and 57 which addressthe perceptual entropy and the ISOIEC MPEG Model 1 respectively Note thatthe absolute threshold (Tq) of hearing is also considered when shaping the noisespectra and that MAX (JND Tq) is most often used as the permissible distortionthreshold Notions of critical bandwidth and simultaneous masking in the audiocoding context give rise to some convenient terminology illustrated in Figure 58where we consider the case of a single masking tone occurring at the center ofa critical band All levels in the figure are given in terms of dB SPL
A hypothetical masking tone occurs at some masking level This generates anexcitation along the basilar membrane that is modeled by a spreading functionand a corresponding masking threshold For the band under consideration theminimum masking threshold denotes the spreading function in-band minimumAssuming the masker is quantized using an m-bit uniform scalar quantizer noisemight be introduced at the level m Signal-to-mask ratio (SMR) and noise-to-mask ratio (NMR) denote the log distances from the minimum masking thresholdto the masker and noise levels respectively
Freq
Sou
nd P
ress
ure
Leve
l (dB
)
mminus1mm+1
Critical
Band
Masking Tone
Masking Threshold
Minimum
Masking Threshold
Neighboring
Band
NM
RS
MR
SN
R
Figure 58 Schematic representation of simultaneous masking (after [Noll93])
NONSIMULTANEOUS MASKING 127
55 NONSIMULTANEOUS MASKING
As shown in Figure 59 masking phenomena extend in time beyond the win-dow of simultaneous stimulus presentation In other words for a masker of finiteduration non simultaneous (also sometimes denoted ldquotemporalrdquo) masking occursboth prior to masker onset as well as after masker removal The skirts on bothregions are schematically represented in Figure 59 Essentially absolute audi-bility thresholds for masked sounds are artificially increased prior to during andfollowing the occurrence of a masking signal Whereas significant premaskingtends to last only about 1ndash2 ms postmasking will extend anywhere from 50to 300 ms depending upon the strength and duration of the masker [Zwic90]Below we consider key nonsimultaneous masking properties that should beembedded in audio codec perceptual models
Of the two nonsimultaneous masking modes forward masking is better under-stood For masker and probe of the same frequency experimental studies haveshown that the amount of forward (post-) masking depends in a predictableway on stimulus frequency [Jest82] masker intensity [Jest82] probe delay aftermasker cessation [Jest82] and masker duration [Moor96] Forward masking alsoexhibits frequency-dependent behavior similar to simultaneous masking that canbe observed when the masker and probe frequency relationship is varied[Moor78] Although backward (pre) masking has also been the subject of many
studies it is not well understood [Moor96] As shown in Figure 59 backwardmasking decays much more rapidly than forward masking For example onestudy at Thomson Consumer Electronics showed that only 2 ms prior to maskeronset the masked threshold was already 25 dB below the threshold of simulta-neous masking [Bran98]
We note however that the literature lacks consensus over the maximumtime persistence of significant backward masking Despite the inconsistent resultsacross studies it is generally accepted that the amount of measured backward
minus50 0 50 100 150 0 50 100 150
20
40
60 SimultaneousPre- Postmasking
Time after masker appearance (ms) Time after masker removal (ms)
Masker
Mas
kee
audi
bilit
y th
resh
old
incr
ease
(dB
)
200
Figure 59 Nonsimultaneous masking properties of the human ear Backward (pre-)masking occurs prior to masker onset and lasts only a few milliseconds Forward (post-)masking may persist for more than 100 ms after masker removal (after [Zwic90])
128 PSYCHOACOUSTIC PRINCIPLES
masking depends significantly on the training of the experimental subjects Forthe purposes of perceptual coding abrupt audio signal transients (eg the onsetof a percussive musical instrument) create pre- and postmasking regions dur-ing which a listener will not perceive signals beneath the elevated audibilitythresholds produced by a masker In fact temporal masking has been used inseveral audio coding algorithms (eg [Bran94a] [Papa95] [ISOI96a] [Fiel96][Sinh98a]) Premasking in particular has been exploited in conjunction withadaptive block size transform coding to compensate for pre-echo distortions(Chapter 6 Sections 69 and 610)
56 PERCEPTUAL ENTROPY
Johnston [John88a] combined notions of psychoacoustic masking with signalquantization principles to define perceptual entropy (PE) a measure of percep-tually relevant information contained in any audio record Expressed in bits persample PE represents a theoretical limit on the compressibility of a particular sig-nal PE measurements reported in [John88a] and [John88b] suggest that a widevariety of CD-quality audio source material can be transparently compressed atapproximately 21 bits per sample The PE estimation process is accomplished asfollows The signal is first windowed and transformed to the frequency domainA masking threshold is then obtained using perceptual rules Finally a determi-nation is made of the number of bits required to quantize the spectrum withoutinjecting perceptible noise The PE measurement is obtained by constructing aPE histogram over many frames and then choosing a worst-case value as theactual measurement
The frequency-domain transformation is done with a Hann window followedby a 2048-point fast Fourier transform (FFT) Masking thresholds are obtainedby performing critical band analysis (with spreading) making a determination ofthe noise-like or tone-like nature of the signal applying thresholding rules for thesignal quality then accounting for the absolute hearing threshold First real andimaginary transform components are converted to power spectral components
P(ω) = Re2(ω) + Im2(ω) (510)
then a discrete Bark spectrum is formed by summing the energy in each criticalband (Table 51)
Bi =bhisum
ω=bli
P (ω) (511)
where the summation limits are the critical band boundaries The range of theindex i is sample rate dependent and in particular i isin 1 25 for CD-qualitysignals A spreading function Eq (57) is then convolved with the discrete Barkspectrum
Ci = Bi lowast SFi (512)
PERCEPTUAL ENTROPY 129
to account the spread of masking An estimation of the tone-like or noise-likequality for Ci is then obtained using the spectral flatness measure (SFM)
SFM = microg
microa
(513)
where microg and microa respectively correspond to the geometric and arithmetic meansof the PSD components for each band The SFM has the property that it isbounded by 0 and 1 Values close to 1 will occur if the spectrum is flat in aparticular band indicating a decorrelated (noisy) band Values close to zero willoccur if the spectrum in a particular band is narrowband A coefficient of tonalityα is next derived from the SFM on a dB scale
α = min
(SFM db
minus60 1
)
(514)
and this is used to weight the thresholding rules given by Eqs (58) and (59)(with K = 55) as follows for each band to form an offset
Oi = α(145 + i) + (1 minus α)55( in dB) (515)
A set of JND estimates in the frequency power domain are then formed bysubtracting the offsets from the Bark spectral components
Ti = 10log10(Ci )minus
Oi
10 (516)
These estimates are scaled by a correction factor to simulate deconvolution ofthe spreading function and then each Ti is checked against the absolute thresholdof hearing and replaced by max(Ti Tq(i)) In a manner essentially identical tothe SPL calibration procedure that was described in Section 52 the PE estima-tion is calibrated by equating the minimum absolute threshold to the energy ina 4 kHz signal of +minus 1 bit amplitude In other words the system assumes thatthe playback level (volume control) is configured such that the smallest possiblesignal amplitude will be associated with an SPL equal to the minimum absolutethreshold By applying uniform quantization principles to the signal and associ-ated set of JND estimates it is possible to estimate a lower bound on the numberof bits required to achieve transparent coding In fact it can be shown that theperceptual entropy in bits per sample is given by
PE =25sum
i=1
bhisum
ω=bli
log2
(
2
∣∣∣∣nint
(Re(ω)radic6Tiki
)∣∣∣∣ + 1
)
+ log2
(
2
∣∣∣∣nint
(Im(ω)radic6Tiki
)∣∣∣∣ + 1
)
(bitssample) (517)
130 PSYCHOACOUSTIC PRINCIPLES
where i is the index of critical band bli and bhi are the lower and upper boundsof band i ki is the number of transform components in band i Ti is the mask-ing threshold in band i (Eq (516)) and nint denotes rounding to the nearestinteger Note that if 0 occurs in the log we assign 0 for the result The maskingthresholds used in the above PE computation also form the basis for a transformcoding algorithm described in Chapter 7 In addition the ISOIEC MPEG-1 psy-choacoustic model 2 which is often used in MP3 encoders is closely related tothe PE procedure
We note however that there have been evolutionary improvements since thePE estimation scheme first appeared in 1988 For example the PE calculationin many systems (eg [ISOI94]) relies on improved tonality estimates relativeto the SFM-based measure of Eq (514) The SFM-based measure is both time-and frequency-constrained Only one spectral estimate (analysis frame) is exam-ined in time and in frequency the measure by definition lumps together multiplespectral lines In contrast other tonality estimation schemes eg the ldquochaosmeasurerdquo [ISOI94] [Bran98] consider the predictability of individual frequencycomponents across time in terms of magnitude and phase-tracking properties Apredicted value for each component is compared against its actual value and theEuclidean distance is mapped to a measure of predictability Highly predictablespectral components are considered to be tonal while unpredictable componentsare treated as noise-like A tonality coefficient that allows weighting towards oneextreme or the other is computed from the chaos measure just as in Eq (514)Improved performance has been demonstrated in several instances (eg [Bran90][ISOI94] [Bran98]) Nevertheless the PE measurement as proposed in its orig-inal form conveys valuable insight on the application of simultaneous maskingasymmetry to a perceptual model in a practical system
57 EXAMPLE CODEC PERCEPTUAL MODEL ISOIEC 11172-3(MPEG - 1) PSYCHOACOUSTIC MODEL 1
It is useful to consider an example of how the psychoacoustic principles describedthus far are applied in actual coding algorithms The ISOIEC 11172-3 (MPEG-1layer 1) psychoacoustic model 1 [ISOI92] determines the maximum allowablequantization noise energy in each critical band such that quantization noiseremains inaudible In one of its modes the model uses a 512-point FFT forhigh-resolution spectral analysis (8613 Hz) then estimates for each input frameindividual simultaneous masking thresholds due to the presence of tone-like andnoise-like maskers in the signal spectrum A global masking threshold is thenestimated for a subset of the original 256 frequency bins by (power) additive com-bination of the tonal and nontonal individual masking thresholds The remainderof this section describes the step-by-step model operations Sample results aregiven for one frame of CD-quality pop music sampled at 441 kHz16-bits persample We note that although this model is suitable for any of the MPEG-1coding layers IndashIII the standard [ISOI92] recommends that model 1 be usedwith layers I and II while model 2 is recommended for layer III (MP3) The five
EXAMPLE CODEC PERCEPTUAL MODEL 131
steps leading to computation of global masking thresholds are described in thefollowing Sections
571 Step 1 Spectral Analysis and SPL Normalization
Spectral analysis and normalization are performed first The goal of this step is toobtain a high-resolution spectral estimate of the input with spectral componentsexpressed in terms of sound pressure level (SPL) Much like the PE calculationdescribed previously this SPL normalization guarantees that a 4 kHz signal of+minus1 bit amplitude will be associated with an SPL near 0 dB (close to anacceptable Tq value for normal listeners at 4 kHz) whereas a full-scale sinusoidwill be associated with an SPL near 90 dB The spectral analysis procedure worksas follows First incoming audio samples s(n) are normalized according to theFFT length N and the number of bits per sample b using the relation
x(n) = s(n)
N(2bminus1) (518)
Normalization references the power spectrum to a 0-dB maximum The normal-ized input x(n) is then segmented into 12-ms frames (512 samples) using a116th-overlapped Hann window such that each frame contains 109 ms of newdata A power spectral density (PSD) estimate P (k) is then obtained using a512-point FFT ie
P(k) = PN + 10 log10
∣∣∣∣∣∣
Nminus1sum
n=0
w(n)x(n)eminusj
2πkn
N
∣∣∣∣∣∣
2
0 k N
2 (519)
where the power normalization term PN is fixed at 90302 dB and the Hannwindow w(n) is defined as
w(n) = 1
2
[
1 minus cos
(2πn
N
)]
(520)
Because playback levels are unknown during psychoacoustic signal analysis thenormalization procedure (Eq (518)) and the parameter PN in Eq (519) are usedto estimate SPL conservatively from the input signal For example a full-scalesinusoid that is precisely resolved by the 512-point FFT in bin ko will yielda spectral line P(k0) having 84 dB SPL With 16-bit sample resolution SPLestimates for very-low-amplitude input signals will be at or below the absolutethreshold An example PSD estimate obtained in this manner for a CD-qualitypop music selection is given in Figure 510(a) The spectrum is shown both on alinear frequency scale (upper plot) and on the Bark scale (lower plot) The dashedline in both plots corresponds to the absolute threshold of hearing approximationused by the model
572 Step 2 Identification of Tonal and Noise Maskers
After PSD estimation and SPL normalization tonal and nontonal masking com-ponents are identified Local maxima in the sample PSD that exceed neighboring
132 PSYCHOACOUSTIC PRINCIPLES
1minus10
minus10
60
50
40
30
20
10
0
10
20
30
40
50
60
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
SP
L (d
B)
SP
L (d
B)
3 5 7 9 11 13 15 17 19 21 23 25
Bark
(a)
Frequency (Hz)
Figure 510a ISOIEC MPEG-1 psychoacoustic analysis model 1 for an example popmusic selection steps 1ndash5 as described in the text (a) Step 1 Obtain PSD express in dBSPL Top panel gives linear frequency scale bottom panel gives Bark frequency scaleAbsolute threshold superimposed Step 2 Tonal maskers identified and denoted by lsquoxrsquosymbol noise maskers identified and denoted by lsquoorsquo symbol (b) Collection of proto-type spreading functions (Eq (531)) shown with level as the parameter These illustratethe incorporation of excitation pattern level-dependence into the model Note that theprototype functions are defined to be piecewise linear on the Bark scale These will beassociated with maskers in steps 3 and 4 (c) Steps 3 and 4 Spreading functions are asso-ciated with each of the individual tonal maskers satisfying the rules outlined in the textNote that the Signal-to-Mask Ratio (SMR) at the peak is close to the widely acceptedtonal value of 145 dB (d) Spreading functions are associated with each of the individ-ual noise maskers that were extracted after the tonal maskers had been eliminated fromconsideration as described in the text Note that the peak SMR is close to the widelyaccepted noise-masker value of 5 dB (e) Step 5 A global masking threshold is obtainedby combining the individual thresholds as described in the text The maximum of theglobal threshold and the absolute threshold is taken at each point in frequency to be thefinal global threshold The figure clearly shows that some portions of the input spectrumrequire SNRs better than 20 dB to prevent audible distortion while other spectral regionsrequire less than 3 dB SNR
EXAMPLE CODEC PERCEPTUAL MODEL 133
7
minus60
minus40
minus20
0
20
SP
L (d
B)
40
60
8090
75
60
45
30
15
0
8 9 10 11 12Bark
(b)
13 14 15 16 17
Barks
(c)
0minus20
minus10
0
10
20
SP
L (d
B)
30
40
50
60
5 10 15 20 25
Figure 510b c
134 PSYCHOACOUSTIC PRINCIPLES
0minus20
minus10
0
10
20
SP
L (d
B)
30
40
50
60
5 10Barks
(d)
15 20 25
0minus20
minus10
0
10
20
SP
L (d
B)
30
40
50
60
5 10 15Barks
(e)
20 25
Figure 510d e
EXAMPLE CODEC PERCEPTUAL MODEL 135
components within a certain Bark distance by at least 7 dB are classified as tonalSpecifically the ldquotonalrdquo set ST is defined as
ST =
P(k)
∣∣∣∣
P(k) gt P (k plusmn 1)
P (k) gt P (k plusmn k) + 7dB
(521)
where
k isin
2 2 lt k lt 63 (017 minus 55 kHz)[2 3] 63 k lt 127 (55 minus 11 kHz)[2 6] 127 k 256 (11 minus 20 kHz)
(522)
Tonal maskers PTM (k) are computed from the spectral peaks listed in ST asfollows
PTM (k) = 10 log10
1sum
j=minus1
1001P(k+j)(dB) (523)
In other words for each neighborhood maximum energy from three adjacentspectral components centered at the peak are combined to form a single tonalmasker Tonal maskers extracted from the example pop music selection are iden-tified using lsquoxrsquo symbols in Figure 510(a) A single noise masker for each criticalband PNM (k) is then computed from (remaining) spectral lines not within theplusmnk neighborhood of a tonal masker using the sum
PNM (k) = 10 log10
sum
j
1001P(j)(dB) forallP(j) isin PTM (k k plusmn 1 k plusmn k) (524)
where k is defined to be the geometric mean spectral line of the critical bandie
k =
uprod
j=l
j
1(lminusu+1)
(525)
where l and u are the lower and upper spectral line boundaries of the criticalband respectively The idea behind Eq (524) is that residual spectral energywithin a critical bandwidth not associated with a tonal masker must by defaultbe associated with a noise masker Therefore in each critical band Eq (524)combines into a single noise masker all of the energy from spectral componentsthat have not contributed to a tonal masker within the same band Noise maskersare denoted in Figure 510 by lsquoorsquo symbols Dashed vertical lines are included inthe Bark scale plot to show the associated critical band for each masker
573 Step 3 Decimation and Reorganization of Maskers
In this step the number of maskers is reduced using two criteria First any tonalor noise maskers below the absolute threshold are discarded ie only maskersthat satisfy
PTM NM (k) Tq(k) (526)
136 PSYCHOACOUSTIC PRINCIPLES
are retained where Tq(k) is the SPL of the threshold in quiet at spectral line k Inthe pop music example two high-frequency noise maskers identified during step2 (Figure 510(a)) are dropped after application of Eq (526) (Figure 510(cndashe))Next a sliding 05 Bark-wide window is used to replace any pair of maskersoccurring within a distance of 05 Bark by the stronger of the two In thepop music example two tonal maskers appear between 195 and 205 Barks(Figure 510(a)) It can be seen that the pair is replaced by the stronger of thetwo during threshold calculations (Figure 510(cndashe)) After the sliding windowprocedure masker frequency bins are reorganized according to the subsamplingscheme
PTM NM (i) = PTM NM (k) (527)
PTM NM (k) = 0 (528)
where
i =
k 1 k 48k + (k mod 2) 49 k 96
k + 3 minus ((k minus 1)mod 4) 97 k 232
(529)
The net effect of Eq (529) is 21 decimation of masker bins in critical bands18ndash22 and 41 decimation of masker bins in critical bands 22ndash25 with no lossof masking components This procedure reduces the total number of tone andnoise masker frequency bins under consideration from 256 to 106 Tonal andnoise maskers shown in Figure 510(cndashe) have been relocated according to thisdecimation scheme
574 Step 4 Calculation of Individual Masking Thresholds
Using the decimated set of tonal and noise maskers individual tone and noisemasking thresholds are computed next Each individual threshold represents amasking contribution at frequency bin i due to the tone or noise masker locatedat bin j (reorganized during step 3) Tonal masker thresholds TTM (i j) are givenby
TTM (i j) = PTM (j) minus 0275Zb(j) + SF(i j) minus 6025(dB SPL) (530)
where PTM (j) denotes the SPL of the tonal masker in frequency bin j Zb(j)
denotes the Bark frequency of bin j (Eq (53)) and the spread of masking frommasker bin j to maskee bin i SF(i j) is modeled by the expression
SF(i j) =
17Zb minus 04PTM (j) + 11 minus3 Zb lt minus1(04PTM (j) + 6)Zb minus1 Zb lt 0minus17Zb 0 Zb lt 1(015PTM (j) minus 17)Zb minus 015PTM (j) 1 Zb lt 8
(dB SPL)
(531)
EXAMPLE CODEC PERCEPTUAL MODEL 137
ie as a piecewise linear function of masker level P (j ) and Bark maskee-maskerseparation Zb = Zb(i) minus Zb(j) SF(i j) approximates the basilar spreading(excitation pattern) described in Section 54 Prototype individual masking thresh-olds TTM (i j) are shown as a function of masker level in Figure 510(b) foran example tonal masker occurring at Zb = 10 Barks As shown in the figurethe slope of TTM (i j) decreases with increasing masker level This is a reflec-tion of psychophysical test results which have demonstrated [Zwic90] that theearrsquos frequency selectivity decreases as stimulus levels increase It is also notedhere that the spread of masking in this particular model is constrained to a 10-Bark neighborhood for computational efficiency This simplifying assumption isreasonable given the very low masking levels that occur in the tails of the excita-tion patterns modeled by SF(i j) Figure 510(c) shows the individual maskingthresholds (Eq (530)) associated with the tonal maskers in Figure 510(a) (lsquoxrsquo)It can be seen here that the pair of maskers identified near 19 Barks has beenreplaced by the stronger of the two during the decimation phase The plot includesthe absolute hearing threshold for reference Individual noise masker thresholdsTNM (i j) are given by
TNM (i j) = PNM (j) minus 0175Zb(j) + SF(i j) minus 2025(dB SPL) (532)
where PNM (j) denotes the SPL of the noise masker in frequency bin j Zb(j)
denotes the Bark frequency of bin j (Eq (53)) and SF(i j) is obtained byreplacing PTM (j) with PNM (j) in Eq (531) Figure 510(d) shows individualmasking thresholds associated with the noise maskers identified in step 2(Figure 510(a) lsquoorsquo) It can be seen in Figure 510(d) that the two high frequencynoise maskers that occur below the absolute threshold have been eliminated
Before we proceed to step 5 and compute a global masking threshold it isworthwhile to consider the relationship between Eq (58) and Eq (530) as wellas the connection between Eq (59) and Eq (532) Equations (58) and (530)are related in that both model the TMN paradigm (Section 54) in order togenerate a masking threshold for quantization noise masked by a tonal signalcomponent In the case of Eq (58) a Bark-dependent offset that is consis-tent with experimental TMN data for the threshold minimum SMR is subtractedfrom the masker intensity namely the quantity 145 +B In a similar mannerEq (530) estimates for a quantization noise maskee located in bin i the inten-sity of the masking contribution due the tonal masker located in bin j LikeEq (58) the psychophysical motivation for Eq (530) is the desire to modelthe relatively weak masking contributions of a TMN Unlike Eq (58) howeverEq (530) uses an offset of only 6025 + 0275B ie Eq (530) assumes asmaller minimum SMR at threshold than does Eq (58) The connection betweenEqs (59) and (532) is analogous In the case of this equation pair howeverthe psychophysical motivation is to model the masking contributions of NMTEquation (59) assumes a Bark-independent minimum SMR of 3ndash5 dB depend-ing on the value of the parameter K Equation (532) on the other hand assumesa Bark-dependent threshold minimum SMR of 2025 + 0175B dB Also whereas
138 PSYCHOACOUSTIC PRINCIPLES
the spreading function (SF ) terms embedded in Eqs (530) and (532) explicitlyaccount for the spread of masking equations (58) and (59) assume that thespread of masking was captured during the computation of the terms ET andEN respectively
575 Step 5 Calculation of Global Masking Thresholds
In this step individual masking thresholds are combined to estimate a globalmasking threshold for each frequency bin in the subset given by Eq (529) Themodel assumes that masking effects are additive The global masking thresholdTg(i) is therefore obtained by computing the sum
Tg(i) = 10 log10(1001Tq(i) +Lsum
l=1
1001TTM (il) +Msum
m=1
1001TNM (im))(dB SPL)
(533)
where Tq(i) is the absolute hearing threshold for frequency bin i TTM (i l) andTNM (i m) are the individual masking thresholds from step 4 and L and M are thenumbers of tonal and noise maskers respectively identified during step 3 In otherwords the global threshold for each frequency bin represents a signal-dependentpower-additive modification of the absolute threshold due to the basilar spread ofall tonal and noise maskers in the signal power spectrum Figure 510(e) showsglobal masking threshold obtained by adding the power of the individual tonal(Figure 510(c)) and noise (Figure 510(d)) maskers to the absolute threshold inquiet
58 PERCEPTUAL BIT ALLOCATION
In this section we will extend the uniform- and optimal-bit allocation algorithmspresented in Chapter 3 Section 35 with perceptual bit-assignment strategies Inperceptual bit allocation method the number of bits allocated to different bands isdetermined based on the global masking thresholds obtained from the psychoacous-tic model The steps involved in the computation of the global masking thresholdshave been presented in detail in the previous section The signal-to-mask ratio(SMR) determines the number of bits to be assigned in each band for perceptu-ally transparent coding of the input audio The noise-to-mask ratios (NMRs) arecomputed by subtracting the SMR from the SNR in each subband ie
NMR = SNR minus SMR(dB) (534)
The main objective in a perceptual bit allocation scheme is to keep the quan-tization noise below a masking threshold For example note that the NMR inFigure 511(a) is relatively more compared to NMR in Figure 511(b) Hencethe (quantization) noise in case of Figure 511(a) can be masked relatively easilythan in case of Figure 511(b) Therefore it is logical to assign sufficient number
PERCEPTUAL BIT ALLOCATION 139
of bits to the subband with the lowest NMR This criterion will be applied toall the subbands and until all the bits are exhausted Typically in audio codingstandards an iterative procedure is employed that satisfies both the bit rate andglobal masking threshold requirements
(a)
Freq
Sou
nd p
ress
ure
leve
l (dB
)
Criticalband
Masking tone
Masking thresh
Minimum masking threshold
Neighboringband
NM
RS
MR
SN
R
Freq
Sou
nd p
ress
ure
leve
l (dB
)
Criticalband
Masking tone
Masking thresh
Minimum maskingthreshold
Neighboringband
NM
RS
MR
SN
R
Noise threshold
(b)
Figure 511 Simultaneous masking depicting relatively large NMR in (a) compared to(b)
140 PSYCHOACOUSTIC PRINCIPLES
59 SUMMARY
This chapter dealt with some of the basics of psychoacoustics We coveredthe absolute threshold of hearing the Bark scale the simultaneous and tem-poral masking effects and the perceptual entropy A step-by-step procedure thatdescribes the ISOIEC psychoacoustic model 1 was provided
PROBLEMS
51 Describe the difference between a Mel scale and a Bark scale Give tablesand itemize side-by-side the center frequencies and bandwidth for 0ndash5 kHzDescribe how the two different scales are constructed
52 In Figure 512 the solid line indicates the just noticeable distortion (JND)curve and the dotted line indicates the absolute threshold in quiet Statewhich of the tones A B C or D would be audible and which ones arelikely to be masked Explain
53 In Figure 513 state whether tone B would be masked by tone A ExplainAlso indicate whether tone C would mask the narrow-band noise Givereasons
54 In Figure 514 the solid line indicates the JND curve obtained from thepsychoacoustic model 1 A broadband noise component is shown that spans
100
80
60
40
20
0
minus20
Sou
nd p
ress
ure
leve
l (dB
)
5 10 15 20 25
Barks
A
B
D
C
JND curveAbsolute threshold in quiet
Figure 512 JND curve for Problem 52
COMPUTER EXERCISES 141
minus20
100
80
60
40
20
0
Minimum maskingthreshold
Maskingthreshold
A
B
C
Sou
nd p
ress
ure
leve
l (dB
)
5 10 15 20
Critical BWBarks
Absolute threshold
Narrow-band noise
Figure 513 Masking experiment Problem 53
from 3 to 11 Barks and a tone is present at 10 Barks Sketch the portionsof the noise and the tone that could be considered perceptually relevant
COMPUTER EXERCISES
55 Design a 3-band equalizer using the peaking filter equations of Chapter 2The center frequencies should correspond to the auditory filters (seeTable 51) at center frequencies 450 Hz 1000 Hz and 2500 Hz Computethe Q-factors associated with each of these filters using Q = f0BW where f0 is the center frequency and BW is the filter bandwidth (obtainfrom Table 51) Choose g = 5dB for all the filters Give the frequencyresponse of the 3-band equalizer in terms of Bark scale
56 Write a program to plot the absolute threshold of hearing in quiet Eq (51)Give a plot in terms of a linear Hz scale
57 Use the program of Problem 56 and plot the absolute threshold of hearingin a Bark scale
58 Generate four sinusoids with frequencies 400 Hz 1000 Hz 2500 Hz and6000 Hz fs = 441 kHz Obtain s(n) by adding these individual sinusoidsas follows
s(n) =4sum
i=1
sin
(2πfin
fs
)
n = 1 2 1024
142 PSYCHOACOUSTIC PRINCIPLES
10
80
70
60
50
40
30
20
10
0
minus10
minus20
Sou
nd p
ress
ure
leve
l (dB
)
5 15 20
Barks
3 11
Figure 514 Perceptual bit-allocation Problem 54
Give power spectrum plots of s(n) (in dB) in terms of a Bark scale and in termsof a linear Hz scale List the Bark-band numbers where the four peaks are located(Hint see Table 51 for the bark band numbers)
59 Extend the above problem and give the power spectrum plot in dB SPLSee Section 571 for details Also include the absolute threshold of hearingin quiet in your plot
510 Write a program to compute the perceptual entropy (in bitssample) of thefollowing signalsa ch5 malespeechwav (8 kHz 16 bit)b ch5 musicwav (441 kHz 16 bit)
(Hint Use equations (510)ndash(517) in Chapter 5 Section 56) Choose the framesize as 512 samples Also the perceptual entropy (PE) measurement is obtainedby constructing a PE histogram over many frames and then choosing a worst-casevalue as the actual measurement
511 FFT-based perceptual audio synthesis using the MPEG 1 psychoacous-tic model 1
In this computer exercise we will consider an example to show how thepsychoacoustic principles are applied in actual audio coding algorithms Recallthat in Chapter 2 Computer Exercise 225 we employed the peak-picking methodto select a subset of FFT components for audio synthesis In this exercise wewill use the just-noticeable-distortion (JND) curve as the ldquoreferencerdquo to selectthe perceptually important FFT components All the FFT components below the
COMPUTER EXERCISES 143
JND curve are assigned a minimal value (for example minus50 dB SPL) such thatthese perceptually irrelevant FFT components receive minimum number of bitsfor encoding The ISOIEC MPEG 1 psychoacoustic model 1 (See Section 57Steps 1 through 5) is used to compute the JND curve
Use the MATLAB software from the Book website to simulate the ISOIECMPEG-1 psychoacoustic model 1 The software package consists of three MAT-LAB files psymainm psychoacousticsm audio synthesism a wave filech5 musicwav and a hintsdoc file The psymainm file is the main file thatcontains the complete flow of your computer exercise The psychoacousticsmfile contains the steps performed in the psychoacoustic analysis
Deliverable 1a Using the comments included in the hintsdoc file fill in the MATLAB
commands in the audio synthesism file to complete the programb Give plots of the input and the synthesized audio Is the psychoacoustic
criterion for picking FFT components equivalent to using a Parsevalrsquosrelated criterion In the audio synthesis what happens if the FFT con-jugate symmetry was not maintained
c How is this FFT-based perceptual audio synthesis method different froma typical peak-picking method for audio synthesis
Deliverable 2d Provide a subjective evaluation of the synthesized audio in terms of a
MOS scale Did you hear any clicks between the frames in the syn-thesized audio If yes what would you do to eliminate such artifactsCompute the over-all and segmental SNR values between the input andthe synthesized audio
e On the average how many FFT components per frame were selected Ifonly 30 FFT components out of 512 were to be picked (because of the bitrate considerations) would the application of the psychoacoustic modelto select the FFT components yield the best possible SNR
512 In this computer exercise we will analyze the asymmetry of simultaneousmasking Use the MATLAB software from the Book website The softwarepackage consists of two MATLAB files asymm maskm and psychoacous-ticsm The asymm maskm includes steps to generate a pure tone s1(n)with f = 4 kHz and fs = 441 kHz
s1(n) = sin
(2πf n
fs
)
n = 0 1 2 44099
a Simulate a broadband noise s2(n) by band-pass filtering uniform whitenoise (micro = 0 and σ 2 = 1) using a Butterworth filter (of appropriateorder eg 8) with center frequency 4 kHz Assume that 3-dB cut-offfrequencies of the bandpass filter as 3500 Hz and 4500 Hz Generatea test signal s(n) = αs1(n) + βs2(n) Choose α = 0025 and β = 1Observe if the broad-band noise completely masks the tone Experiment
144 PSYCHOACOUSTIC PRINCIPLES
for different values of α β and find out when 1) the broadband noisemasks the tone and 2) the tone masks the broadband noise
b Simulate the two cases of masking (ie the NMT and the TMN) givenin Figure 57 Section 54
CHAPTER 6
TIME-FREQUENCY ANALYSIS FILTERBANKS AND TRANSFORMS
61 INTRODUCTION
Audio codecs typically use a time-frequency analysis block to extract a set ofparameters that is amenable to quantization The tool most commonly employedfor this mapping is a filter bank of bandpass filters The filter bank divides thesignal spectrum into frequency subbands and generates a time-indexed series ofcoefficients representing the frequency-localized signal power within each bandBy providing explicit information about the distribution of signal and hence mask-ing power over the time-frequency plane the filter bank plays an essential rolein the identification of perceptual irrelevancies Additionally the time-frequencyparameters generated by the filter bank provide a signal mapping that is con-veniently manipulated to shape the coding distortion On the other hand bydecomposing the signal into its constituent frequency components the filter bankalso assists in the reduction of statistical redundancies
This chapter provides a perspective on filter-bank design and other tech-niques of particular importance in audio coding The chapter is organized asfollows Sections 62 and 63 introduce filter-bank design issues for audio cod-ing Sections 64 through 67 review specific filter-bank methodologies found inaudio codecs namely the two-band quadrature mirror filter (QMF) the M-bandtree-structured QMF the M-band pseudo-QMF bank and the M-band ModifiedDiscrete Cosine Transform (MDCT) The lsquoMP3rsquo or MPEG-1 Layer III audio
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
145
146 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
codec pseudo-QMF and MDCT are discussed in Sections 66 and 67 respec-tively Section 68 provides filter-bank interpretations of the discrete Fourierand discrete cosine transforms Finally Sections 69 and 610 examine the time-domain ldquopre-echordquo artifact in conjunction with pre-echo control techniques
Beyond the references cited in this chapter the reader is referred to in-depthtutorials on filter banks that have appeared in the literature [Croc83] [Vaid87][Vaid90] [Malv91] [Vaid93] [Akan96] The reader may also wish to explorethe connection between filter banks and wavelets that has been well docu-mented in [Riou91] [Vett92] and in several texts [Akan92] [Wick94] [Akan96][Stra96]
62 ANALYSIS-SYNTHESIS FRAMEWORK FOR M-BAND FILTERBANKS
Filter banks are perhaps most conveniently described in terms of an analysis-synthesis framework (Figure 61) in which the input signal s(n) is processed atthe encoder by a parallel bank of (L minus 1)-th order FIR bandpass filters Hk(z)The bandpass analysis outputs
vk(n) = hk(n) lowast s(n) =Lminus1sum
m=0
s(n minus m)hk(m) k = 0 1 M minus 1 (61)
are decimated by a factor of M yielding the subband sequences
yk(n) = vk(Mn) =Lminus1sum
m=0
s(nM minus m)hk(m) k = 0 1 M minus 1 (62)
which comprise a critically sampled or maximally decimated signal representa-tion ie the number of subband samples is equal to the number of input samplesBecause it is impossible to achieve perfect ldquobrickwallrdquo magnitude responseswith finite-order bandpass filters there is unavoidable aliasing between the deci-mated subband sequences Quantization and coding are performed on the subbandsequences yk(n) In the perceptual audio codec the quantization noise is usuallyshaped according to a perceptual model The quantized subband samples yk(n)are eventually received by the decoder where they are upsampled by M to formthe intermediate sequences
wk(n) =
yk(nM) n = 0 M 2M 3M
0 otherwise(63)
In order to eliminate the imaging distortions introduced by the upsampling oper-ations the sequences wk(n) are processed by a parallel bank of synthesis filtersGk(z) and then the filter outputs are combined to form the overall output s(n)
ANALYSIS-SYNTHESIS FRAMEWORK FOR M-BAND FILTER BANKS 147
The analysis and synthesis filters are carefully designed to cancel aliasing andimaging distortions It can be shown [Akan96] that
s(n) = 1
M
infinsum
m=minusinfin
infinsum
l=minusinfin
Mminus1sum
k=0
s(m)hk(lM minus m)gk(l minus Mn) (64)
or in the frequency domain
S() = 1
M
Mminus1sum
k=0
Mminus1sum
l=0
S
(
+ 2πl
M
)
Hk
(
+ 2πl
M
)
Gk() (65)
For perfect reconstruction (PR) filter banks the output s(n) will be identicalto the input s(n) within a delay ie s(n) = s(n minus n0) as long as there is noquantization noise introduced ie y(n) = yk(n) This is naturally not the casefor a codec and therefore quantization sensitivity is an important filter bankproperty since PR guarantees are lost in the presence of quantization
Figures 62 and 63 respectively give example magnitude responses for banksof uniform and nonuniform bandwidth filters that can be realized within theframework of Figure 61 A uniform bandwidth M-channel filter bank is shownin Figure 62 The M analysis filters have normalized center frequencies (2k +1)π2M and are characterized by individual impulse responses hk(n) as wellas frequency responses Hk() 0 k lt M minus 1
Some of the popular audio codecs contain parallel bandpass filters of uniformbandwidth similar to Figure 62 Other coders strive for a ldquocritical bandrdquo analysisby relying upon filters of nonuniform bandwidth The octave-band filter bank
MH0(z)
H1(z)
H2(z)
HMminus1(z)
M
M
M
s(n)
s(n)
v0(n)
v1(n)
v2(n)
vMminus1(n)
M G0(z)
G1(z)
G2(z)
GMminus1(z)
M
M
M
y0(n)
y1(n)
y2(n)
yMminus1(n)
w0(n)
w1(n)
w2(n)
wMminus1(n)
Σˆ
Figure 61 Uniform M-band maximally decimated analysis-synthesis filter bank
148 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
Frequency (Ω)
minusp p
H0 H1 H2H0H1H2HM minus1 HM minus1
2M(2M minus1)p
minus 2M
(2M minus1)p2M5p
minus2M3p
minus2Mp
minusp
2M3p2M
5p2M
Figure 62 Magnitude frequency response for a uniform M-band filter bank (oddlystacked)
Frequency (Ω)
H20 H21 H11 H01
0 8p
4p
2p p
Figure 63 Magnitude frequency response for an octave-band filter bank
for which the four-band case is illustrated in Figure 63 is sometimes used as anapproximation to the auditory filter bank albeit a poor one
As shown in Figure 63 octave-band analysis filters have normalized centerfrequencies and filter bandwidths that are dyadically related Naturally muchbetter approximations are possible
63 FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS
This section addresses the issues that govern the selection of a filter bank for audiocoding Efficient coding performance depends heavily on adequately matching theproperties of the analysis filter bank to the characteristics of the input signal Algo-rithm designers face an important and difficult tradeoff between time and frequencyresolution when selecting a filter-bank structure [Bran92a] Failure to choose a suit-able filter bank can result in perceptible artifacts in the output (eg pre-echoes) orlow coding gain and therefore high bit rates No single tradeoff between time andfrequency resolution is optimal for all signals We will present three examples toillustrate the challenge facing codec designers In the first example we consider theimportance of matching time-frequency analysis resolution to the signal-dependentdistribution of masking power in the time-frequency plane The second exampleillustrates the effect of inadequate frequency resolution on perceptual bit alloca-tion Finally the third example illustrates the effect of inadequate time resolution onperceptual bit allocation These examples clarify the fundamental tradeoff requiredduring filter-bank selection for perceptual coding
FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS 149
631 The Role of Time-Frequency Resolution in Masking PowerEstimation
Through schematic representations of masking thresholds for castanets and pic-colo Figure 64(ab) illustrates the difficulty of selecting a single filter bank tosatisfy the diverse time and frequency resolution requirements associated withdifferent classes of audio In the figures darker regions correspond to highermasking thresholds To realize maximum coding gain the strongly harmonicpiccolo signal clearly calls for fine frequency resolution and coarse time resolu-tion because the masking thresholds are quite localized in frequency Quite theopposite is true of the castanets The fast attacks associated with this percussivesound create highly time-localized masking thresholds that are also widely dis-bursed in frequency Therefore adequate time resolution is essential for accurateestimation of the highly time-varying masked threshold Naturally similar reso-lution properties should also be associated with the filter bank used to decomposethe signal into a parametric set for quantization and encoding Using real signalsand filter banks the next two examples illustrate the bit rate impact of adequateand inadequate resolutions in each domain
632 The Role of Frequency Resolution in Perceptual Bit Allocation
To demonstrate the importance of matching a filter bankrsquos resolution propertieswith the noise-shaping requirements imposed by a perceptual model the next twoexamples combine high- and low-resolution filter banks with two input extremesnamely those of a sinusoid and an impulse First we consider the importance ofadequate frequency resolution The need for high-resolution frequency analysis ismost pronounced when the input contains strong sinusoidal components Given atonal input inadequate frequency resolution can produce unreasonably high signal-to-noise ratio (SNR) requirements within individual subbands resulting in high bitrates To see this we compare the results of processing a 27-kHz pure tone firstwith a 32-channel and then with a 1024-channel MDCT filter bank as shownin Figure 65(a) and Figure 65(b) respectively The vertical line in each figurerepresents the frequency and level (80 dB SPL) of the input tone In the figuresthe masked threshold associated with the sinusoid is represented by a solid nearlytriangular line For each filter bank in Figure 65(a) and Figure 65(b) the bandcontaining most of the signal energy is quantized with sufficient resolution tocreate an in-band SNR of 154 dB Then the quantization noise is superimposedon the masked threshold In the 32-band case (Figure 65a) it can be seen thatthe quantization noise at an SNR of 154 dB spreads considerably beyond themasked threshold implying that significant artifacts will be audible in the recon-structed signal On the other hand the improved selectivity of the 1024-channelfilter bank restricts the spread of quantization at 154 dB SNR to well within thelimits of the masked threshold (Figure 65b) The figure clearly shows that for atonal signal good frequency selectivity is essential for low bit rates In fact the 32-channel filter bank for this signal requires greater than a 60 dB SNR to satisfy themasked threshold (Figure 65c) This high cost (in terms of bits required) results
150 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
Time (ms)
Fre
quen
cy (
Bar
ks)
15 300
5 10
1520
25
Time (ms)
Fre
quen
cy (
Bar
ks)
15 300
5 10
1520
25
(b)
(a)
Figure 64 Masking thresholds in the time-frequency plane (a) castanets (b) piccolo(after [Prin95])
from the mismatch between the filter bankrsquos poor selectivity and the very limiteddownward spread of masking in the human ear As this experiment would implyhigh-resolution frequency analysis is usually appropriate for tonal audio signals
633 The Role of Time Resolution in Perceptual Bit Allocation
A time-domain dual of the effect observed in the previous example can be used toillustrate the importance of adequate time resolution Whereas the previous exper-iment showed that simultaneous masking criteria determine how much frequency
FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS 151
(a)
1000 2000 3000 4000 5000 6000 7000minus40
minus20
0
20
40
60
80
100
Band SNR = 154 dBS
PL
(dB
)
Frequency (Hz)
Conservativemasked threshold
(b)
1000 2000 3000 4000 5000 6000 7000minus40
minus20
20
0
40
60
80
100
Band SNR = 154 dB
SP
L (d
B)
Frequency (Hz)
Conservativemasked threshold
Figure 65 The effect of frequency resolution on perceptual SNR requirements for a27 kHz pure tone input Input tone is represented by the spectral line at 27 kHz with80 dB SPL Conservative masked threshold due to presence of tone is shown Quantizationnoise for a given number of bits per sample is superimposed for several precisions(a) 32-channel MDCT with 154 dB in-band SNR quantization noise spreads beyondmasked threshold (b) 1024-channel MDCT with 154 dB in-band SNR quantization noiseremains below masked threshold (c) 32-channel MDCT requires 637 dB in-band SNRto satisfy masked threshold ie requires larger bit allocation to mask quantization noise
152 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
resolution is necessary for good performance one can also imagine that temporalmasking effects dictate time resolution requirements In fact the need for goodtime resolution is pronounced when the input contains sharp transients This is bestillustrated with an impulse Unlike the sinusoid with its highly frequency-localizedmasking power the broadband impulse contains broadband masking power that ishighly time-localized Given a transient input therefore lacking time resolutioncan result in a temporal smearing of quantization noise beyond the time window ofeffective masking To illustrate this point consider the results obtained by process-ing an impulse with the same filter banks as in the previous example The resultsare shown in Figure 66(a) and Figure 66(b) respectively The vertical line in eachfigure corresponds to the input impulse The figures also show the temporal enve-lope of masking power associated with the impulse as a solid nearly triangularwindow For each filter bank all subbands were quantized with identical bit alloca-tions Then the error signal at the output (quantization noise) was superimposed onthe masking envelope In the 32-band case (Figure 66a) one can observe that thetime resolution (impulse response length) restricts the spread of quantization noiseto well within the limits of the masked threshold For the 1024-band filter bank(Figure 66b) on the other hand quantization noise spreads considerably beyondthe time envelope masked threshold implying that significant artifacts will be audi-ble in the reconstructed signal The only remedy in this case would be to ldquoovercoderdquothe transient so that the signal surrounding the transient receives precision adequateto satisfy the limited masking power in the regions before and after the impulseThe figures clearly show that for a transient signal good time resolution is essentialfor low bit rates The combination of long impulse responses and limited maskingwindows in the presence of transient signals can lead to an artifact known as ldquopre-echordquo distortion Pre-echo distortion and pre-echo compensation are covered at theend of this chapter in Sections 69 and 610
Unfortunately most audio source material is highly nonstationary and containssignificant tonal and atonal energy as well as both steady-state and transient inter-vals As a rule signal models [John96a] tend to remain constant for long periods andthen change abruptly Therefore the ideal coder should make adaptive decisionsregarding optimal time-frequency signal decomposition and the ideal analysis filterbank would have time-varying resolutions in both the time and frequency domainsThis fact has motivated many algorithm designers to experiment with switchedand hybrid filter-bank structures with switching decisions occurring on the basisof the changing signal properties Filter banks emulating the analysis properties ofthe human auditory system ie those containing nonuniform ldquocritical bandwidthrdquosubbands have proven highly effective in the coding of highly transient signalssuch as the castanets glockenspiel or triangle For dense harmonically structuredsignals such as the harpsichord or pitch pipe on the other hand the ldquocritical bandrdquofilter banks have been less successful because of their reduced coding gain relativeto filter banks with a large number of subbands In short several bank characteristicsare highly desirable for audio coding
ž Signal-adaptive time-frequency tilingž Low-resolution ldquocritical-bandrdquo mode (eg 32 subbands)
FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS 153
(a)
minus25 minus20 minus15 minus10 minus5 0 5 10 15 20 25minus50
minus45
minus40
minus35
minus30
minus25
minus20
minus15
minus10
minus5
0
5
Time (ms)
dB32 Channel MDCT
Conservativetemporal masked
threshold
1024 Channel MDCT
minus50
minus45
minus40
minus35
minus30
minus25
minus20
minus15
minus10
minus5
0
5
(b)
minus25 minus20 minus15 minus10 minus5 0 5 10 15 20 25
Time (ms)
dB
Conservativetemporal masked
threshold
Figure 66 The effect of time resolution on perceptual bit allocation for an impulseImpulse input occurs at time 0 Conservative temporal masked threshold due to thepresence of impulse is shown Quantization noise for a fixed number of bits per sam-ple is superimposed for low- and high-resolution filter banks (a) 32-channel MDCT(b) 1024-channel MDCT
ž High-resolution mode eg 4096 subbandsž Efficient resolution switchingž Minimum blocking artifactsž Good channel separationž Strong stop-band attenuation
154 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
ž Perfect reconstructionž Critical samplingž Fast implementation
Good channel separation and stop-band attenuation are particularly desirable forsignals containing very little irrelevancy such as the harpsichord Maximum redun-dancy removal is essential for maintaining high quality at low bit rates for thesesignals Blocking artifacts in time-varying filter banks can lead to audible distortionin the reconstruction Beyond filter-bank-specific architectural and performance cri-teria system-level considerations may also influence the best choice of filter bankfor a codec design For example the codec architecture could contain two separateparallel time-frequency analysis blocks one for the perceptual model and one forgenerating the parametric set that is ultimately quantized and encoded The parallelscenario offers the advantage that each filter bank can be optimized independentlyThis is possible since the perceptual analysis section does not typically require sig-nal reconstruction whereas the coefficients for coding must eventually be mappedback to the time-domain In the interest of computational efficiency however manyaudio codecs have only one time-frequency analysis block that does ldquodouble dutyrdquoin the sense that the perceptual model obtains information from the same set ofcoefficients that are ultimately quantized and encoded
Algorithms for filter-bank design as well as fast algorithms for efficient filter-bank realizations offer many choices to designers of perceptual audio codecsAmong the many types available are those characterized by the following
ž Uniform or nonuniform frequency partitioningž An arbitrary number of subbandsž Perfect or almost perfect reconstructionž Critically sampled or oversampled representationsž FIR or IIR constituent filters
In the next few Sections we will focus on the design and performance of well-known filter banks that are popular in audio coding Rather than dealing withefficient implementation structures that are available we have elected for eachfilter-bank architecture to describe the individual bandpass filters in terms ofimpulse and frequency response functions that are easily related to the analysis-synthesis framework of Figure 61 These descriptions are intended to provideinsight regarding the filter-bank response characteristics and to allow for com-parisons across different methods The reader should be aware however thatstructures for efficient realizations are almost always used in practice and becausecomputational efficiency is of paramount importance most audio coding filter-bank realizations although functionally equivalent may or may not resemble themaximally decimated analysis-synthesis structure given in Figure 61 In otherwords most of the filter banks used in audio coders have equivalent parallelforms and can be conveniently analyzed in terms of this analysis-synthesis frame-work The framework provides a useful interpretation for the sets of coefficients
QUADRATURE MIRROR AND CONJUGATE QUADRATURE FILTERS 155
generated by the unitary transforms often embedded in audio coders such asthe discrete cosine transform (DCT) the discrete Fourier transform (DFT) thediscrete wavelet transform (DWT) and the discrete wavelet packet transform(DWPT)
64 QUADRATURE MIRROR AND CONJUGATE QUADRATUREFILTERS
The two-band quadrature mirror and conjugate quadrature filter (QMF and CQF)banks are logical starting points for the discussion on filter banks for audio codingTwo-band QMF banks were used in early subband algorithms for speech cod-ing [Croc76] and later for the first standardized 7-kHz wideband audio algorithmthe ITU G722 [G722] Also the strong connection between two-band perfectreconstruction (PR) CQF filter banks and the discrete wavelet transform [Akan96]has played a significant role in the development of high-performance audio cod-ing filter banks Ultimately tree-structured cascades of the CQF filters have beenused to construct several ldquocritical-bandrdquo filter banks in a number of high qualityalgorithms The two-channel bank which can provide a building block for struc-tured M-channel banks is developed as follows If the analysis-synthesis filterbank (Figure 61) is constrained to two channels ie if M = 2 then Eq (65)becomes
S() = 1
2S()[H 2
0 () minus H 21 ()] (66)
Esteband and Galand showed [Este77] that aliasing is cancelled between theupper and lower bands if the QMF conditions are satisfied namely
H1() = H0( + π) rArr h1(n) = (minus1)nh0(n)
G0() = H0() rArr g0(n) = h0(n)
G1() = minusH0( + π) rArr g1(n) = minus(minus1)nh0(n) (67)
Thus the two-band filter-bank design task is reduced to the design of a sin-gle lowpass filter h0(n) under the constraint that the overall transfer func-tion Eq (66) be an allpass function with constant group delay (linear phase)Although filter families satisfying the QMF criteria with good stop-band andtransition-band characteristics have been designed (eg [John80]) to minimizeoverall distortion the QMF conditions actually make perfect reconstructionimpossible Smith and Barnwell showed in [Smit86] however that PR two-bandfilter banks based on a lowpass prototype are possible if the CQF conditions aresatisfied namely
h1(n) = (minus1)nh0(L minus 1 minus n)
g0(n) = h0(L minus 1 minus n) (68)
g1(n) = minus(minus1)nh0(n)
156 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
minus10
minus20
minus30
minus40
minus50
minus60
0
025 0750
Normalized Frequency (x p rad)
Two Channel CQF Filter Bank
Mag
nitu
de (
dB)
Figure 67 Two-band Smith-Barnwell CQF filter bank magnitude frequency responsewith L = 8 [Smit86]
The magnitude response of an example Smith-Barnwell CQF filter bank from[Smit86] with L = 8 is shown in Figure 67 The lowpass response is shown
as a solid line while the highpass is dashed One can observe the significantoverlap between channels as well as monotonic passband and equiripple stop-band characteristics with minimum stop-band rejection of 40 dB As mentionedpreviously efficiency concerns dictate that filter banks are rarely implementedin the direct form of Eq (61) The QMF banks are most often realized using apolyphase factorization [Bell76] (ie
H(z) =Mminus1sum
l=0
zminuslEl(zM) (69)
where
El(z) =infinsum
n=minusinfinh(Mn + l)zminusn (610)
which yields better than a 21 computational load reduction On the other handthe CQF filters are incompatible with the polyphase factorization but can beefficiently realized using alternative structures such as the lattice [Vaid88]
65 TREE-STRUCTURED QMF AND CQF M-BAND BANKS
Clearly audio coders require better frequency resolution than either the QMF orCQF two-band decompositions can provide in order to realize sufficient coding
TREE-STRUCTURED QMF AND CQF M-BAND BANKS 157
y0(n)
s(n)
2H00(z)
2H10(z)
2H11(z)
2H20(z)
2H21(z)
2H20(z)
2H21(z)
2H01(z)
2H10(z)
2H11(z)
2H20(z)
2H21(z)
2H20(z)
2H21(z)
y1(n)
y2(n)
y3(n)
y4(n)
y5(n)
y6(n)
y7(n)
Figure 68 Tree-structured realization of a uniform eight-channel analysis filter bank
gain for spectrally complex signals Tree-structured cascades are one straight-forward method for creating M-band filter banks from the two-band QMF andCQF prototypes These are constructed as follows The two-band filters are con-nected in a cascade that can be represented well using either a binary tree or apruned binary tree The root node of the tree is formed by a single two-bandQMF or CQF section Then each of the root node outputs is connected to acascaded QMF or CQF bank The cascade structure may be continued to thedepth necessary to achieve the desired magnitude response characteristics Ateach node in the tree a two-channel QMF or CQF bank operates on the out-put from a higher-level two-channel bank Thus frequency subdivision occursthrough a series of two-band splits Tree-structured filter banks have severaladvantages First of all the designer can approximate an arbitrary partitioning ofthe frequency axis by creating an appropriate cascade Consider for example theuniform subband tree (Figure 68) or the octave-band tree (Figure 69) The abil-ity to partition the frequency axis in a nonuniform manner also has implicationsfor multi-resolution temporal analysis or nonuniform tiling of the time-frequencyplane This property can be advantageous if the ultimate objective is to approx-imate the analysis properties of the human ear and in fact many algorithmsmake use of tree-structured filter banks for this very reason [Bran90] [Sinh93b]
158 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
[Tsut98] In addition the designer has the flexibility to optimize the length andother properties of constituent filters at each node in the tree This flexibilityhas been exploited to enhance the performance of several experimental audiocodecs [Sinh93b] [Phil95a] [Phil95b] Tree-structured filter banks are also attrac-tive for their computational efficiency relative to other M-band techniques Onedisadvantage of the tree-structured filter bank is that delays accumulate throughthe cascaded nodes and hence the overall delay can become quite large
The example tree in Figure 68 shows an eight-band uniform analysis filterbank in which the analysis filters are indexed first by level and then by typeLowpass filters are indexed with a 0 and highpass with a 1 For instance high-pass filters at level 2 in the tree are denoted by H21(z) and lowpass by H20(z)It is often convenient to analyze the M-band tree-structured CQF or QMF bankusing an equivalent parallel form To see the connection between the cascaded
s(n)
2H01(z)
y0(n)
2H00(z)
2H10(z)
2H11(z)
2H20(z)
2H21(z) y1(n)
y2(n)
y3(n)
Figure 69 Tree-structured realization of an octave-band four-channel analysis filter bank
MMy(n)
y(n)
x(n)
MH(z) M
(b)
(a)
x(n)
H(z) H(zM)
H(zM)
x(n)
x(n) y(n)
y(n)
Figure 610 The Noble identities In each picture the structure on the left is equivalentto the structure on the right (a) Interchange of a filter and a downsampler The positionsare swapped after the complex variable z is replaced by zM in the system function H(z)(b) Interchange of a filter and an upsampler The positions are swapped after the complexvariable z is replaced by zM in the system function H(z)
TREE-STRUCTURED QMF AND CQF M-BAND BANKS 159
(a)
Eight Channel Tree-Structured CCQF Filter Bank N = 53
0
minus10
minus20
minus30
minus40
minus50
minus6000062501875 03125 04375 05625 06875 08125 09375
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
(b)
0
350 5 10 15 20 25 30 40 45 50
03125
0
minus20
minus40
minus60
minus02
minus01
0
01
02
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
Am
plitu
de
Sample Number
Figure 611 Eight-channel cascaded CQF (CCQF) filter bank (a) Magnitude frequencyresponses for all eight channels Odd-numbered channel responses are drawn with dashedlines even-numbered channel responses are drawn with solid lines (b) Isolated view ofthe magnitude frequency response and time-domain impulse response for channel 3 Thisview highlights the presence of a significant sidelobe in the stop-band In this figure N
is the length of the impulse response
tree structures (Figures 68 and 69) and the parallel analysis-synthesis structure(Figure 61) one can apply the ldquonoble identitiesrdquo (Figure 610) which allow forthe interchange of the down-sampling and filtering operations In a straightfor-ward manner this practice collapses the cascaded filter transfer functions into
160 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
single parallel-form analysis filters for each channel In the case of the eight-channel bank (Figure 68) for example we have
H0(z) = H00(z)H10(z2)H20(z
4)
H1(z) = H00(z)H10(z2)H21(z
4) (611)
H7(z) = H01(z)H11(z2)H21(z
4)
Figure 611(a) shows the magnitude spectrum of an eight-channel tree-structuredfilter bank based on a three-level cascade (Figure 68) of the same Smith-BarnwellCQF filter examined previously (Figure 67) Even-numbered channel responsesare drawn with solid lines and odd-numbered channel responses are drawn withdashed lines One can observe the effect that the cascaded structure has on theshape of the channel responses Bands 0 through 3 are each uniquely shapedand are mirror images of bands 4 through 7 Moreover the M-band stop-bandcharacteristics are significantly different than the prototype filter ie the equirip-ple property does not extend to the M-channels Figure 611(b) shows |H2()|2making it is possible to observe clearly a sidelobe of significant magnitude inthe stop-band The figure also illustrates the impulse response h2(n) associatedwith the filter H2(z) One can see that the effective length of the parallel-formimpulse response represents the cumulative contributions from each of the cas-caded filters
66 COSINE MODULATED lsquolsquoPSEUDO QMFrsquorsquo M-BAND BANKS
A tree-structured cascade of two-channel prototypes is only one of several well-known methods available for realization of an M-band filter bank Although thetree structures offer opportunities for optimization at each node and are con-ceptually simple the potential for long delay and irregular channel responsesis sometimes unappealing As an alternative to the tree-structured architecturecosine modulation of a lowpass prototype filter has been used since the early1980s [Nuss81] [Roth83] [Chu85] [Mass85] [Cox86] to realize parallel M-channelfilter banks with nearly perfect reconstruction Because they do not achieve perfectreconstruction these filter banks are known collectively as ldquopseudo QMFrdquo andthey are characterized by several attractive properties
ž Constrained design single FIR prototype filterž Uniform linear phase channel responsesž Overall linear phase hence constant group delayž Low complexity ie one filter plus modulationž Amenable to fast block algorithmsž Critical sampling
COSINE MODULATED lsquolsquoPSEUDO QMFrsquorsquo M-BAND BANKS 161
In the pseudo QMF (PQMF) bank derivation phase distortion is completely elim-inated from the overall transfer function Eq (65) by forcing the analysis andsynthesis filters to satisfy the mirror image condition
gk(n) = hk(L minus 1 minus n) (612)
Moreover adjacent channel aliasing is cancelled by establishing precise relation-ships between the analysis and synthesis filters Hk(z) and Gk(z) respectively Inthe critically sampled analysis-synthesis notation of Figure 61 these conditionsultimately yield analysis filters given by
hk(n) = 2w(n) cos
[π
M(k + 05)
(
n minus (L minus 1)
2
)
+ k
]
(613)
and synthesis filters given by
gk(n) = 2w(n) cos
[π
M(k + 05)
(
n minus (L minus 1)
2
)
minus k
]
(614)
wherek = (minus1)kπ
4(615)
and the sequence w(n) corresponds to the L-sample ldquowindowrdquo a real-coefficientlinear phase FIR prototype lowpass filter with normalized cutoff frequencyπ2M Given that aliasing and phase distortions have been eliminated in thisformulation the filter-bank design procedure is reduced to the design of the win-dow w(n) such that overall amplitude distortion (Eq (65)) is minimized Oneapproach [Vaid93] is to minimize a composite objective function ie
C = αc1 + (1 minus α)c2 (616)
where constraint c1 of the form
c1 =int πM
0
[
|W()|2 +∣∣∣W
( minus π
M
)∣∣∣2 minus 1
]2
d (617)
minimizes spectral nonflatness in the reconstruction and constraint c2 of the form
c2 =int π
π
2M+ε
|W()|2d (618)
maximizes stop-band attenuation The parameter ε is related to transition band-width and the parameter α determines which design constraint is more dominant
The magnitude frequency response of an example eight-channel pseudo QMFbank designed using Eqs (66) and (67) is shown in Figure 612 In contrast tothe previous CCQF example one can observe that all of the channel magni-tude responses are identical modulated versions of the lowpass prototype andtherefore the passband and stop-band characteristics are uniform The impulse
162 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
0 00625 01875 03125 04375 05625 06875 08125 09375minus60
minus50
minus40
minus30
minus20
minus10
0
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
(a)
Eight Channel PQMF Bank N = 39
(b)
350 5 10 15 20 25 30minus02
minus01
0
01
02
Am
plitu
de
Sample Number
Normalized Frequency (x p rad)031250
0
minus20
minus40
minus60
Mag
nitu
de (
dB)
Figure 612 Eight-channel PQMF bank (a) Magnitude frequency responses for all eightchannels Odd-numbered channel responses are drawn with dashed lines even-numberedchannel responses are drawn with solid lines (b) Isolated view of the magnitude fre-quency response and time-domain impulse response for channel 3 Here N is the impulseresponse length
response symmetry associated with a linear phase filter is also evident in anexamination of Figure 612(b)
The PQMF bank plays a significant role in several popular audio codingalgorithms In particular the IS11172-3 and IS13818-3 algorithms (ldquoMPEG-1rdquo
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 163
[ISOI92] and ldquoMPEG-2 BCLSFrdquo [ISOI94a]) employ a 32-channel PQMF bankfor spectral decomposition in both layers I and II The prototype filter w(n)contains 512 samples yielding better than 96-dB sidelobe suppression in thestop-band of each analysis channel Output ripple (non-PR) is less than 007 dBIn addition the same PQMF is used in conjunction with a PR cosine modulatedfilter bank (discussed in the next section) in layer III to form a hybrid filter-bankarchitecture with time-varying properties The MPEG-1 algorithm has reached aposition of prominence with the widespread use of ldquoMP3rdquo files (MPEG-1 layer3) on the World Wide Web (WWW) for the exchange of audio recordings aswell as with the deployment of MPEG-1 layer II in direct broadcast satellite(DBSDSS) and European digital audio broadcast (DBA) initiatives Because ofthe availability of common algorithms for pseudo QMF and PR QMF banks wedefer the discussion on generic complexity and efficient implementation strategiesuntil later In the particular case of MPEG-1 however note that the 32-bandpseudo QMF analysis bank as defined in the standard requires approximately80 real multiplies and 80 real additions per output sample [ISOI92] although amore efficient implementation based on a fast algorithm for the DCT was alsoproposed [Pan93] [Kons94]
67 COSINE MODULATED PERFECT RECONSTRUCTION (PR)M-BAND BANKS AND THE MODIFIED DISCRETE COSINETRANSFORM (MDCT)
Although PQMF banks have been used quite successfully in perceptual audiocoders the overall system design still must compensate for the inherent dis-tortion induced by the lack of perfect reconstruction to avoid audible artifactsin the codec output The compensation strategy may be a simple one (egincreased prototype filter length) but perfect reconstruction is actually prefer-able because it constrains the sources of output distortion to the quantizationstage Beginning in the early 1990s independent work by Malvar [Malv90b]Ramstad [Rams91] and Koilpillai and Vaidyanathan [Koil91] [Koil92] showedthat generalized perfect reconstruction (PR) cosine modulated filter banks arepossible by appropriately constraining the prototype lowpass filter w(n) andsynthesis filters gk(n) for 0 k M minus 1 In particular perfect reconstructionis guaranteed for a cosine-modulated filter bank with analysis filters hk(n) givenby Eqs (613) and (615) if four conditions are satisfied First the length L ofthe window w(n) must be integer multiple of the number of subbands ie
1L = 2mM (619)
where the parameter m is an integer greater than zero Next the synthesisfilters gk(n) must be related to the analysis filters by a time-reversalsuch that
2gk(n) = hk(L minus 1 minus n) (620)
164 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
In addition the FIR lowpass prototype must have linear phase whichmeans that
3w(n) = w(L minus 1 minus n) (621)
and finally the polyphase components of w(n) must satisfy the pairwisepower complementary requirement ie
4Ek(z)Ek(z) + EM+k(z)EM+k(z) = α (622)
where the constant α is greater than 0 the functions Ek(z) are the k =0 1 2 M minus 1 polyphase components (Eq (69)) of W (z) and the tildenotation denotes the paraconjugate ie
E(z) = E lowast (zminus1) (623)
or in other words the coefficients of E(z) are conjugated and then thecomplex variable zminus1 is substituted for the complex variable z
The generalized PR cosine-modulated filter banks developed in Eqs(619)through (623) are of considerable interest in many applications This Sectionhowever concentrates on the special case that has become of central importancein the advancement of perceptual audio coding algorithms namely the filter bankfor which L = 2M ie m = 1 The PR properties of this special case were firstdemonstrated by Princen and Bradley [Prin86] using time-domain arguments forthe development of the time domain aliasing cancellation (TDAC) filter bankLater Malvar [Malv90a] developed the modulated lapped transform (MLT) byrestricting attention to a particular prototype filter and formulating the filter bankas a lapped orthogonal block transform More recently the consensus namein the audio coding literature for lapped block transform interpretation of thisspecial case filter bank has evolved into the modified discrete cosine transform(MDCT) To avoid confusion we will denote throughout this book by MDCTthe PR cosine-modulated filter bank with L = 2M and we will restrict thewindow w(n) in accordance with Eqs (619) and (621) In short the readershould be aware that the different acronyms TDAC MLT and MDCT all referessentially to the same PR cosine modulated filter bank Only Malvarrsquos MLT labelimplies a particular choice for w(n) as described below From the perspective ofan analysis-synthesis filter bank (Figure 61) the MDCT analysis filter impulseresponses are given by
hk(n) = w(n)
radic2
Mcos
[(2n + M + 1)(2k + 1)π
4m
]
(624)
and the synthesis filters to satisfy the overall linear phase constraint are obtainedby a time reversal ie
gk(n) = hk(2M minus 1 minus n) (625)
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 165
This perspective is useful for visualizing individual channel characteristics interms of their impulse and frequency responses In practice however the MDCTis typically realized as a block transform usually via a fast algorithm The remain-der of this section treats several MDCT facets that are of importance in audiocoding applications including its forward and inverse transform interpretationsprototype filter (window) design criteria window design examples time-varyingforms and fast algorithms
671 Forward and Inverse MDCT
The analysis filter bank (Figure 613(a)) is realized as a block transform of length2M samples while using a block advance of only M samples ie with 50overlap between blocks Thus the MDCT basis functions extend across twoblocks in time leading to virtual elimination of the blocking artifacts that plaguethe reconstruction of nonoverlapped transform coders Despite the 50 overlaphowever the MDCT is still critically sampled and only m coefficients are gen-erated by the forward transform for each 2M-sample input block Given an inputblock x(n) the transform coefficients X(k) for 0 k M minus 1 are obtainedby means of the forward MDCT defined as
X(k) =2Mminus1sum
n=0
x(n)hk(n) (626)
Clearly the forward MDCT performs a series of inner products between the M
analysis filter impulse responses hk(n) and the input x(n) On the other handthe inverse MDCT (Figure 613(b)) obtains a reconstruction by computing a sumof the basis vectors weighted by the transform coefficients from two blocks Thefirst M samples of the k-th basis vector for hk(n) 0 n M minus 1 are weightedby k-th coefficient of the current block X(k) Simultaneously the second M
samples of the k-th basis vector hk(n) for M n 2M minus 1 are weighted bythe k-th coefficient of the previous block XP (K) Then the weighted basisvectors are overlapped and added at each time index n Note that the extendedbasis functions require that the inverse transform maintains an M sample memoryto retain the previous set of coefficients Thus the reconstructed samples x(n)for 0 n M minus 1 are obtained via the inverse MDCT defined as
x(n) =Mminus1sum
k=0
[X(k)hk(n) + XP (k)hk(n + M)] (627)
where xP (k) denotes the previous block of transform coefficients The overlappedanalysis and overlap-add synthesis processes are illustrated in Figure 613(a) andFigure 613(b) respectively
672 MDCT Window Design
Given the forward (Eq (626)) and inverse (Eq (627)) transform definitions onestill must design a suitable FIR prototype filter (window) w(n) Several general
166 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
Frame k Frame k + 1 Frame k + 2 Frame k + 3M M M M
2M MDCT M
2M MDCT M
2M MDCT M
(a)
+
IMDCTM 2M
IMDCTM 2M
IMDCTM 2M
+
Frame k + 1 Frame k + 2M M
(b)
Figure 613 Modified discrete cosine transform (MDCT) (a) Lapped forward transform(analysis) ndash 2M samples are mapped to M spectral components (Eq (626)) Analy-sis block length is 2M samples but analysis stride (hop size) and time resolution areM-samples (b) Inverse transform (synthesis) ndash M spectral components are mapped to avector of 2M samples (Eq (627)) that is overlapped by M samples and added to thevector of 2M samples associated with the previous frame
purpose orthogonal [Prin86] [Prin87] [Malv90a] and biorthogonal [Jawe95][Smar95] [Matv96] windows that have been proposed while still otherorthogonal [USAT95a] [Ferr96a] [ISOI96a] [Fiel96] and biorthogonal [Cheu95][Malv98] windows are optimized explicitly for audio coding In the orthogonalcase the generalized PR conditions [Vaid93] given in Eqs (619)ndash(623) can bereduced to linear phase and Nyquist constraints on the window namely
w(2M minus 1 minus n) = w(n) (628a)
w2(n) + w2(n + M) = 1 (628b)
for the sample indices 0 n M minus 1 These constraints give rise to two consid-erations First unlike the pseudo-QMF bank linear phase in the MDCT lowpassprototype does not translate into linear phase for the modulated analysis filters oneach subband channel The overall MDCT analysis-synthesis filter bank howeveris characterized by perfect reconstruction and hence linear phase with a constantgroup delay of L minus 1 samples Secondly although Eqs (628a) and (628b) guar-antee an orthogonal basis for the MDCT an orthogonal basis is not required tosatisfy the PR constraints in Eqs (619)ndash(623) In fact it can be shown [Cheu95]
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 167
that Eq (628b) can be revised and that perfect reconstruction for the MDCT isstill guaranteed as long as it is true that
ws(n) = wa(n)
w2a(n) + w2
a(n + M)(629)
for the sample indices 0 n M minus 1 where ws(n) denotes the synthesis win-dow and wa(n) denotes the analysis window From the transform perspec-tive Eqs (628a) and (629) guarantee a biorthogonal MDCT basis Clearly thisrelaxation of prototype FIR lowpass filter design requirements increases thedegrees of freedom available to the filter bank designer from M2 to M Ineffect it is no longer necessary to use the same analysis and synthesis windowsIn any case whether an orthogonal or biorthogonal basis is used the MDCTwindow design problem can be formulated in the same manner as it was forthe PQMF bank (Eq (616)) except that the PR property of the MDCT elim-inates the spectral flatness constraint (Eq (617)) such that the designer canconcentrate solely on minimizing either the stop-band energy or the maximumstop-band magnitude of W() Well-known tools are available (eg [Pres89])for minimizing Eq (616) but in many cases one can safely forego the designprocess and rely instead upon the general purpose orthogonal [Prin86] [Prin87][Malv90a] or biorthogonal [Jawe95] [Smar95] [Matv96] MDCT windows thathave been proposed in the literature In fact several existing orthogonal [Ferr96a][ISOI96a] [USAT95a] and biorthogonal [Cheu95] [Malv98] transform windowswere explicitly designed to be in some sense optimal for audio coding
673 Example MDCT Windows (Prototype FIR Filters)
It is instructive to consider some example MDCT windows in order to appreciatemore fully the characteristics well suited to audio coding as well as the tradeoffsthat are involved in the window selection process
6731 Sine Window Malvar [Malv90a] denotes by MLT the MDCT filterbank that makes use of the sine window defined as
w(n) = sin
[(
n + 1
2
)π
2M
]
(630)
for 0 n M minus 1 This particular window is perhaps the most popularin audio coding It appears for example in the MPEG-1 layer III (MP3)hybrid filter bank [ISOI92] the MPEG-2 AACMPEG-4 time-frequency filterbank [ISOI96a] and numerous experimental MDCT-based coders that haveappeared in the literature In fact this window has become the de factostandard in MDCT audio applications and its properties are typically referencedas performance benchmarks when windows are proposed The sine window(Figure 614) has several unique properties that make it advantageous First DCenergy is concentrated in a single transform coefficient because all basis functionsexcept for the first one have infinite attenuation at DC Secondly the filter bank
168 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
channels achieve 24 dB sidelobe attenuation when the sine window (Figure 614dashed line) is used Finally the sine window has been shown [Malv90a] to makethe MDCT asymptotically optimal in terms of coding gain for a lapped transformCoding gain is desirable because it quantifies the factor by which the mean squareerror (MSE) is reduced when using the filter bank relative to using direct pulsecode modulated (PCM) quantization of the time-domain signal at the same rate
6732 Parametric Phase-Modulated Sine Window Optimization crite-ria other than coding gain or DC localization are possible and have also beeninvestigated for the MDCT Ferreira [Ferr96a] proposed a parametric window forthe orthogonal MDCT that offers a controlled tradeoff between reduction of thetime-domain ringing artifacts produced by coarse quantization and reduction ofstop-band leakage relative to the sine window The window (Figure 614 solid)which is defined in terms of three parameters for any value of M ie
w(n) = sin
[(
n + 1
2
)π
2M+ φopt (n)
]
(631)
where φopt (n) = 4πβ
(1 minus δ2)
[(4n
2M minus 2
)α
minus δ
] [(4n
2M minus 2
)α
minus 1
]
(632)
was motivated by the observation that explicit simultaneous minimization of time-domain aliasing and stop-band energy resulted in a window well approximatedby a nonlinear phase difference with respect to the sine window Moreover theparametric solution provided nearly optimal results and was tractable while theexplicit minimization was numerically unstable for long windows Parametersare given in [Ferr96a] for three windows that offer respectively time-domainaliasingstop-band leakage percentage improvements relative to the sine windowof 63101 8307 and 133minus31 Figure 614 compares the latter para-metric window (β = 003125 α = 092 δ = 00) in both time and frequencyagainst the sine window It can be seen that the negative gain in stop-band atten-uation is caused by a slight increase in the first sidelobe energy It is also clearhowever that the stop-band attenuation characteristics improve with increasingfrequency In fact the Ferreira window has a broader range of better than 110 dBattenuation than does the sine window This characteristic of improved ultimatestop-band rejection can be beneficial for perceptual gain particularly for stronglyharmonic signals
6733 Separate Analysis and Synthesis Windows ndash BiorthogonalMDCT Basis Even more dramatic improvements in ultimate stop-band rejec-tion are possible when the orthogonality constraint is removed Cheung and Lim[Cheu95] derived for the MDCT the biorthogonality window constraint given by
Eq (629) and then demonstrated with a Kaiser analysis window wa(n) the poten-tial for improved stop-band attenuation In a similar fashion Figure 615(a) showsthe analysis (solid) and synthesis (dashed) windows that result for the biorthogonalMDCT when wa(n) is a Kaiser window [Oppe99] with β = 11 The most signif-icant benefit of this arrangement is apparent from the frequency response plot for
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 169
100 200 300 400 500 600 700 800 900 10000
01
02
03
04
05
06
07
08
09
1
Sample
Am
plitu
de
(a)
0 0005 001 0015 002 0025 003minus80
minus70
minus60
minus50
minus40
minus30
minus20
minus10
0
Normalized Frequency (x p rad)
Gai
n (d
B)
(b)
Figure 614 Orthogonal MDCT analysissynthesis windows of Malvar [Malv90a] (das-hed) and Ferreira [Ferr96a] (solid) (a) time-domain (b) frequency-domain magnituderesponse The parametric Ferreira window provides better stop-band attenuation over abroader range of frequencies at the expense of transition bandwidth and slightly reducedattenuation of the first sidelobe
two 256-channel filter banks depicted in Figure 615(b) In this figure the dashedline represents the frequency response associated with channel four of the sinewindow MDCT and the lighter solid line corresponds to the frequency responseassociated with the same channel in the Kaiser window MDCT filter bank Also
170 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
50 100 150 200 250 300 350 400 450 500
Sample
0
05
1
15
2A
mpl
itude
(a)
0 200 400 600 800 1000 1200 1400 1600 1800 2000minus120
minus100
minus80
minus60
minus40
minus20
0
Frequency (Hz)
Gai
n (d
B)
Simultaneousmasking threshold
(b)
Figure 615 Biorthogonal MDCT basis (a) Time-domain view of the analysis (solid)and synthesis (dashed) windows (Cheung and Lim [Cheu95]) that are associated witha biorthogonal MDCT basis (b) Frequency-domain magnitude responses associated withMDCT channel four of 256 for the sine (orthogonal basis dashed) and Kaiser (biorthogonalbasis solid) windows Simultaneous masking threshold is superimposed for a pure toneoccurring at the channel center frequency 388 Hz Picture demonstrates the potentialfor super-threshold leakage associated with the sine window and the improved stop-bandattenuation realized with the Kaiser window
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 171
superimposed on the plot is the simultaneous masking threshold generated by a388-Hz pure tone occurring at the channel center It can be seen that althoughthe main lobe for the Kaiser MDCT is somewhat broader than the sine MDCTthe stop-band attenuation is significantly below the masking threshold whereas thesine window MDCT stop-band leakage has substantial super-threshold energy Thesine window therefore has the potential to cause artificially high bit rates becauseof its greater leakage This type of artifact motivated the designers of the DolbyAC-2AC-3 [USAT95a] and MPEG-2 AACMPEG-4 T-F [ISOI96a] algorithms touse customized windows rather than the standard sine window in their respectiveorthogonal MDCT filter banks
6734 The Dolby AC-2Dolby AC-3MPEG-2 AAC KBD Window TheKaiser-Bessel Derived (KBD) window was obtained in a procedure devised atDolby Laboratories The AC-2 and AC-3 designers showed [Fiel96] that theprototype filter for an M ndashchannel orthogonal MDCT filter bank satisfying thePR conditions (Eqs (628a) and (628b)) can be derived from any symmetrickernel window of length M + 1 by applying a transformation of the form
wa(n) = ws(n) =radicradicradicradic
sumnj=0 v(j)
sumMj=0 v(j)
0 n lt M (633)
where the sequence v(n) represents the symmetric kernel The resulting identicalanalysis and synthesis windows wa(n) and ws(n) respectively are of lengthM + 1 and symmetric ie w(2M minus n minus 1) = w(n) Note that although a moregeneral form of Eq (633) appeared [Fiel96] we have simplified it here for theparticular case of the 50-overlapped MDCT During the development of theAC-2 and AC-3 algorithms novel MDCT prototype filters optimized to sat-isfy a minimum masking template (eg Figure 616(a) for AC-3) were designedusing Eq (633) with a parametric Kaiser-Bessel kernel v(n) At the expenseof some passband selectivity the KBD windows achieve considerably betterstop-band attenuation (greater than 40 dB improvement) than the sine window(Figure 616b) Thus for a pure tone occurring at the center of a particularMDCT channel the KBD filter bank concentrates more energy into a single trans-form coefficient The remaining dispersed energy tends to generate coefficientmagnitudes that lie below the worst-case pure tone excitation pattern (ldquomask-ing templaterdquo (Figure 616b)) Particularly for signals with adequately spacedtonal components the presence of fewer supra-threshold MDCT componentsreduces the perceptual bit allocation and therefore tends to improve coding gainIn spite of the reduced bit allocation the filter bank still renders the quantizationnoise inaudible since the uncoded coefficients have smaller magnitudes than themasked threshold A KBD filter bank simulation exemplifying this behavior forthe MPEG-2 AAC algorithm is given later
6735 Parametric Windows for a Biorthogonal MDCT Basis In anotherexample of biorthogonal window design Malvar [Malv98] proposed the lsquomod-ulated biorthogonal lapped transform (MBLT)rsquo a biorthogonal version of the
172 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
(a)
50 100 150 200 250 300 350 400 450 5000
01
02
03
04
05
06
07
08
09
1
Sample
Am
plitu
de
Sine
AC-3
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000minus120
minus100
minus80
minus60
minus40
minus20
0
Masking template
Sine window
AC-3 window
Frequency (Hz)
Mag
nitu
de (
dB)
(b)
Figure 616 Dolby AC-3 (solid) vs sine (dashed) MDCT windows (a) time-domainviews (b) frequency-domain magnitude responses in relation to worst-case masking tem-plate Improved stop-band attenuation of the AC-3 (KBD) window is shown to approxi-mate well the minimum masking template
MDCT based on a parametric window defined as
ws(n) =1 minus cos
[(n + 1
2M
)α
π
]
+ β
2 + β(634)
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 173
for 0 n M minus 1 Like [Cheu95] Eq (634) was also motivated by a desireto realize improved stop-band attenuation Additionally it was used to achievegood characteristics in a novel nonuniform filter bank based on a straightforwardmanipulation of the MBLT In this design the parameter α controls windowwidth while the parameter β controls its end values
6736 Summary and Example Eight-Channel Filter Bank (MDCT) Usinga Sine Window The foregoing examples demonstrate that MDCT windowdesigns are predominantly concerned with optimizing in some sense the trade-off between mainlobe width and stopband attenuation as is true of any FIRfilter design We also note that biorthogonal MDCT extensions are a recentdevelopment and consequently most current audio coders incorporate primarilydesign innovations that have occurred within the orthogonal MDCT frameworkTo facilitate comparisons with the previously described filter bank methodolo-gies (QMF CQF tree-structured QMF pseudo-QMF etc) the analysis filtermagnitude responses for an example eight-channel MDCT filter bank using thesine window are shown in Figure 617(a) Examination of the channel-3 impulseresponse in Figure 617(b) reveals the asymmetry that precludes linear phase forthe analysis filters
6737 Time-Varying Forms of the MDCT One final point regarding MDCTwindow design is of particular relevance for perceptual audio coders The earlierexamples for tone-like and noise-like signals (Chapter 6 Section 62) demon-strated clearly that characteristics of the ldquobestrdquo filter bank for audio are signal-specific and therefore time-varying In practice it is very common for codecsusing the MDCT (eg MPEG-1 [ISOI92a] MPEG-2 AAC [ISOI96a] Dolby AC-3 [USAT95a] Sony ATRAC [Tsut96] etc) to change the number of channels andhence the window length to match the signal properties of the input Typically abinary classification scheme identifies the input as either stationary or nonstation-arytransient Then a long window is used to maximize coding gain and achievegood channel separation during segments identified as stationary or a short windowis used to localize time-domain artifacts when pre-echoes are likely Although thestrategy has proven to be highly effective it does complicate the codec structureIn particular because of the time overlap between basis vectors either boundaryfilters [Herl95] or special transitional windows [Herl93] are required to preserveperfect reconstruction when window switching occurs Other schemes are alsoavailable to achieve perfect reconstruction with time-varying filter bank prop-erties [Quei93] [Soda94] but for practical reasons these are not typically usedConsequently window switching has been the method of choice In this scenariothe transitional window function does not need to be symmetrical It can be shownthat the PR property is preserved as long as the transitional window satisfies thefollowing constraints
w2(n) + w2(M minus n) = 1 n lt M (635a)
w2(M + n) + w2(2M minus n) = 1 n M (635b)
174 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
000625 01875 03125 04375 05625 06875 08125 09375minus60
minus50
minus40
minus30
minus20
minus10
0
Eight Channel MDCT Filter Bank N = 15
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
(a)
(b)
0 03125minus60
minus40
minus20
0
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
0 5 10 15minus05
0
05
Sample Number
Am
plitu
de
Figure 617 Eight-channel MDCT filter bank constructed with the sine window (a) Mag-nitude frequency responses for all eight channels of the analysis filter bank Odd-numberedchannel responses are drawn with dashed lines even-numbered channel responses aredrawn with solid lines (b) Isolated view of the magnitude frequency response andtime-domain impulse response for channel 3 Asymmetry is clearly visible in the chan-nel impulse response precluding the possibility of linear phase although the overallanalysis-synthesis filter bank has linear phase on all channels
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 175
and provided that the relationship between the transitional window and the adjoin-ing new length window obeys
w1(M + n) = w2(M minus n) (636)
where w1(n) and w2(n) are the left and right window functions respectively Inspite of the preserved PR property it should be noted that MDCT transitionalwindows are highly non ideal in the sense that they seriously impair the channelselectivity and stop-band attenuation of the filter bank
The Dolby AC-3 algorithm as well as the MPEG MDCT-based coders employMDCT window switching to maximize filter bank-to-signal matching The MPEG-1 layer III and MPEG-2 AAC window switching schemes use transitional win-dows that are described in some detail later (Section 610) Unlike the MPEGapproach the AC-3 algorithm maintains perfect reconstruction while avoidingtransitional windows The AC-3 applies high-resolution frequency analysis to sta-tionary signals using an MDCT as defined in Eqs (626) and (627) with M =256 During transient segments a pair of two half-length transforms (M = 128)given by
X1(k) =2Mminus1sum
n=0
x(n)hk1(n) (637a)
X2(k) =2Mminus1sum
n=0
x(n + 2M)hk2(n + 2M) (637b)
replaces the single long-block transform and the short block filter impulse respon-ses hk1 and hk2 are defined as
hk1(n) = w(n)
radic2
Mcos
[(2n + 1)(2k + 1)π
4M
]
(638a)
hk2(n) = w(n)
radic2
Mcos
[(2n + 2M + 1)(2k + 1)π
4M
]
(638b)
The window function w(n) remains identical for both the long and short trans-forms Here the key to maintaining the PR property is that the different phaseshifts in Eqs (638a) and (638b) relative to Eq (624) guarantee an orthogonalbasis Also note that the AC-3 window is customized and incorporates into itsdesign some perceptual properties [USAT95a] The spectral and temporal analysistradeoffs involved in transitional window designs are well illustrated in [Shli97]for both the MPEG-1 layer III [ISOI92a] and the Dolby AC-3 [USAT95a] fil-ter banks
6738 Fast Algorithms Complexity and Implementation Issues Oneof the attractive properties that has contributed to the widespread use of the MDCT
176 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
particularly in the standards is the availability of FFT-based fast algorithms[Duha91] [Sevi94] that make the filter bank viable for real-time applications A
unified fast algorithm [Liu98] is available for the MPEG-1 -2 -4 and AC-3 longblock MDCT (Eq (626) and Eq (627)) the AC-3 short block MDCT (Eq (638a)and Eq (638b)) and the MPEG-1 pseudo-QMF bank (Eq (613) and Eq (614))The computational load of [Liu98] for an M = 1024 (2048-point) MDCT (egMPEG-2 AAC ATampT PAC) is 8192 multiplies and 13920 adds This translatesinto complexity of O(M log2 M) for multiplies and O(2M log2 M) for adds Thecomplexity scales accordingly for other values of M Both [Duha91] and [Liu98]exploit the fact that the forward MDCT can be decomposed into two cascadedstages (Figure 618) namely a set of M2 butterflies followed by an M-point dis-crete cosine transform (DCT) The inverse transform is decomposed in the inversemanner ie a DCT followed by a butterfly network In both cases the butterfliescapture the windowing behavior while the DCT performs the modulation and fil-tering The decompositions are efficient as well-known fast algorithms are availablefor the various DCT types [Rao90] The butterflies are of low complexity typicallyO(2M) for both multiplies and adds In addition to the computationally efficientalgorithms of [Duha91] and [Liu98] a regressive structure suitable for parallelVLSI implementation of the Eq (626) forward MDCT was proposed in [Chia96]with complexity of 3M adds and 2M multiplies per output for the forward transformand 3M adds and M multiplies for the inverse transform
As far as other implementation issues are concerned several researchershave addressed the quantization sensitivity of the MDCT There are available
zminus1
zminus1
minusw(0)
w(0)
minusw(M minus 1)
minusw(M minus 1)
x(M minus 1)
minusw(M 2)
minusw(M 2)
minusw(M2 minus 1)
w(M2 minus 1)
x(M2 minus 1)
X(M minus 1)
x(M2 minus 2)M2 minus 2
M2 minus 1
M minus 1
0
DCT-IV
x(0) X(0)
bullbullbull
bullbullbull
bullbullbull
bullbullbull
bullbullbull
Figure 618 A fast algorithm for the 2M-point (M-channel) forward MDCT (Eq (626))consists of a butterfly network and memory followed by a Type IV DCT The inversestructure can be formed to compute the inverse MDCT (Eq (627)) Efficient FFT-basedalgorithms are available for the Type IV DCT
COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 177
expressions [Jako96] for the reconstruction error of the quantized system interms of signal-correlated and uncorrelated components that can be used toassist algorithm designers in the identification and optimization of perceptuallydisturbing reconstruction artifacts induced by quantization noise A more generaltreatment of quantization issues for PR cosine modulated filter banks has alsoappeared [Akan92]
6739 Remarks on the MDCT The MDCT has become of central impor-tance in audio coding and the majority of standardized algorithms make someuse of this filter bank This section has traced the origins of the MDCT reviewedcommon terminology and definitions addressed the major window design issuesexamined the strategies for time-varying implementations and noted the avail-ability fast algorithms for efficient realization It has also provided numerousexamples The important properties of the MDCT filter bank can be summarizedas follows
ž Perfect reconstructionž Overlapping basis vectorsž Linear overall filter bank phase responsež Extended ringing artifacts due to quantizationž Critical samplingž Virtual elimination of blocking artifactsž Constant group delay = L minus 1ž Nonlinear analysis filter phase responsesž Low complexity one filter and modulationž Orthogonal version M2 degrees of freedom for w(n)
ž Amenable to time-varying implementations with some performance sac-rifices
ž Amenable to fast algorithmsž Constrained design a single FIR lowpass prototype filterž Biorthogonal version M degrees of freedom for wa(n) or ws(n)
One can see from this synopsis that the MDCT possesses many of the qualitiessuitable for audio coding (Section 63) As a PR cosine-modulated filter bankit inherits all of the advantages realized for the pseudo-QMF except for phaselinearity on individual analysis channels and it does so at the expense of lessthan a 5 dB reduction (typically) in stop-band attenuation Moreover the MDCToffers the added advantage that the number of parameters to be optimized indesign of the lowpass prototype is essentially reduced to M2 in the orthogonalcase If more freedom is desired however one can opt for the biorthogonalconstruction Finally we have presented the MDCT as both a filter bank and ablock transform To maintain consistency we recognize the filter-banktransformduality of some of the other tools presented in this chapter Recall that the MDCT
178 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
is a special case of the PR cosine modulated filter bank for which L = 2mM and m = 1 Note then that the PQMF bank (Chapter 6 Section 66) can alsobe interpreted as a lapped transform for which it is possible to have L = 2mM In the case of the MPEG-1 filter bank for layers I and II for example L = 512and M = 32 or in other words m = 8 As the coder architectures described inChapters 7 through 10 will demonstrate many filter banks for audio coding aremost efficiently realized as block transforms particularly when fast algorithmsare available
68 DISCRETE FOURIER AND DISCRETE COSINE TRANSFORM
This section offers abbreviated filter bank interpretations of the discrete Fouriertransform (DFT) and the discrete cosine transform (DCT) These classical blocktransforms were often used to achieve high-resolution frequency analysis in theearly experimental transform-based audio coders (Chapter 7) that preceded theadaptive spectral entropy coding (ASPEC) and ultimately the MPEG-1 algo-rithms layers IndashIII (Chapter 10) For example the FFT realization of the DFTplays an important role in layer III of MPEG-1 (MP3) The FFT is embeddedin efficient realizations of both MP3 hybrid filter bank stages (pseudo-QMF andMDCT) as well as in the spectral estimation blocks of the psychoacoustic mod-els 1 and 2 recommended in the MPEG-1 standard [ISOI92] It can be seen thatblock transforms are a special case of the more general uniform-band analysis-synthesis filter bank of Figure 61 For example consider the unitary DFT andits inverse [Akan92] which can be written as respectively
X(k) = 1radic2M
2Mminus1sum
n=0
x(n)Wminusnk 0 k 2M minus 1 (639a)
x(n) = 1radic2M
2Mminus1sum
k=0
X(k)Wnk 0 n 2M minus 1 (6 39b)
where W = ejπM If the analysis filters in Eq (61) all have the same length andL = 2M then the filter bank could be interpreted as taking contiguous L sam-ple blocks of the input and applying to each block the transform in Eq (639a)Although the DFT is usually defined with a block size of N instead of 2M Eqs (639a) and (639b) are given using notation slightly different from the usualto remain consistent with the convention of this chapter throughout which thenumber of filter bank channels is denoted by the parameter M The DFT hasconjugate symmetry for real signals and thus from the audio filter bank perspec-tive effectively half as many channels as its block length Also from the filterbank viewpoint the impulse response of the k-th-channel analysis filter is givenby the k-th DFT basis vector ie
hk(n) = 1radic2M
Wkn 0 n 2M minus 1 0 k M minus 1 (640)
DISCRETE FOURIER AND DISCRETE COSINE TRANSFORM 179
0 0125 025 0375 05 0625 075 0875 1minus60
minus50
minus40
minus30
minus20
minus10
0
Eight Channel DFT Filter Bank
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
(a)
(b)
0 025minus60
minus40
minus20
0
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
0 5 10 15minus60
minus40
minus20
0
Sample Number
Am
plitu
de
Figure 619 Eight-band STFT filter bank (a) Magnitude frequency responses for alleight channels of the analysis filter bank Odd-numbered channel responses are drawn withdashed lines even-numbered channel responses are drawn with solid lines (b) Isolatedview of the magnitude frequency response and time-domain impulse response for channel3 Note that the impulse response is complex-valued and that only its magnitude is shownin the figure
180 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
An example eight-channel DFT analysis filter bank magnitude frequency res-ponse appears in Figure 619 with the magnitude response of the third chan-nel magnified in Figure 619(b) The magnitude of the complex-valued impulseresponse for the same channel also appears in Figure 619(b) Note that theDFT filter bank is evenly stacked whereas the cosine-modulated filter banksin Sections 66 and 67 of this chapter were oddly stacked In other words thecenter frequencies for the DFT analysis and synthesis filters occur at kπM for0 k M minus 1 while the center frequencies for the oddly stacked filters occurat (2k + 1)π2M for 0 k M minus 1 As is evident from Figure 619(a) evenstacking means that the low-band filter is only half the bandwidth of the otherchannels and that it ldquowraps-aroundrdquo the fold-over frequency
A filter bank perspective can also be provided for to the DCT As a blocktransform the forward DCT (Type II) and its inverse are given by the analysisand synthesis expressions respectively
X(k) = c(k)
radic2
M
Mminus1sum
n=0
x(n) cos
[π
M
(
n + 1
2
)
k
]
0 k M minus 1 (641a)
x(n) =radic
2
M
Mminus1sum
k=0
c(k)X(k) cos
[π
M
(
n + 1
2
)
k
]
0 n M minus 1(641b)
where c(0) = 1radic
2 and c(k) = 1 for 1 k M minus 1 Using the same dualityarguments as for the DFT one can view the DCT from the perspective of theanalysis-synthesis filter bank (Figure 61) in which case the impulse response ofthe k-th-channel analysis filter is the k-th DCT-II basis vector given by
hk(n) = c(k)
radic2
Mcos
[π
M
(
n + 1
2
)
k
]
0 n k M minus 1 (642)
As an example the magnitude frequency responses of an eight-channel DCT anal-ysis filter bank are given in Figure 620(a) and the isolated magnitude responseof the third channel is given in Figure 620(b) The impulse response for thesame channel is also given in the figure
69 PRE-ECHO DISTORTION
An artifact known as pre-echo distortion can arise in transform coders using per-ceptual coding rules Pre-echoes occur when a signal with a sharp attack beginsnear the end of a transform block immediately following a region of low energyThis situation can arise when coding recordings of percussive instruments suchas the triangle the glockenspiel or the castanets for example (Figure 621a)For a block-based algorithm when quantization and encoding are performed inorder to satisfy the masking thresholds associated with the block average spec-tral estimate time-frequency uncertainty dictates that the inverse transform willspread quantization distortion evenly in time throughout the reconstructed block
PRE-ECHO DISTORTION 181
(a)
000625 01875 03125 04375 05625 06875 08125 09375minus60
minus50
minus40
minus30
minus20
minus10
0
Eight Channel DCT Filter Bank N = 7
Normalized Frequency (x p rad)
Mag
nitu
de (
dB)
0
0
0
0 1 2 3 4 5 6 7
minus20
minus40
minus60
Mag
nitu
de (
dB)
03125Normalized Frequency (x p rad)
05
minus05
Sample Number
(b)
Am
plitu
de
Figure 620 Eight-band DCT filter bank (a) Magnitude frequency responses for all eightchannels of the analysis filter bank Odd-numbered channel responses are drawn withdashed lines even-numbered channel responses are drawn with solid lines (b) Isolatedview of the magnitude frequency response and time-domain impulse response for ch-annel 3
182 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
(Figure 621b) This results in unmasked distortion throughout the low-energyregion preceding in time the signal attack at the decoder Although it has thepotential to compensate for pre-echo temporal premasking is possible only ifthe transform block size is sufficiently small (minimal coder delay) Percussivesounds are not the only signals likely to produce pre-echoes Such artifacts alsooften plague coders when processing ldquopitchedrdquo signals containing nearly impul-sive bursts at the beginning of each pitch period eg the ldquoGerman male speechrdquorecording [Herr96] For a male speaker with a fundamental frequency of 125 Hzthe interval between impulsive events is only 8 ms which is much less than thetypical analysis block length Several methods proposed to eliminate pre-echoesare reviewed next
610 PRE-ECHO CONTROL STRATEGIES
Several methodologies have been proposed and successfully applied in the effortto mitigate the pre-echoes that tend to plague block-based coding schemes Thissection describes several of the most widespread techniques including the bitreservoir window switching gain modification switched filter banks and tem-poral noise shaping Advantages and drawbacks associated with each method arealso discussed
6101 Bit Reservoir
Some coders [ISOI92] [John96c] utilize this technique to satisfy the greater bitdemand associated with transients Although most algorithms are fixed rate theinstantaneous bit rates required to satisfy masked thresholds on each frame arein fact time-varying Thus the idea behind a bit reservoir is to store surplus bitsduring periods of low demand and then to allocate bits from the reservoir duringlocalized periods of peak demand resulting in a time-varying instantaneous bitrate but at the same time a fixed average bit rate One problem however isthat very large reservoirs are needed to deal with certain transient signals egldquopitched signalsrdquo Particular bit reservoir implementations are addressed later inconjunction with the MPEG [ISOI92a] and PAC [John96c] standards
6102 Window Switching
First introduced by Edler [Edle89] this is also a popular method for pre-echosuppression particularly in the case of MDCT-based algorithms Window switch-ing works by changing the analysis block length from long duration (eg 25 ms)during stationary segments to ldquoshortrdquo duration (eg 4 ms) when transients aredetected (Figure 622) At least two considerations motivate this method First ashort window applied to the frame containing the transient will tend to minimizethe temporal spread of quantization noise such that temporal premasking effectsmight preclude audibility Secondly it is desirable to constrain the high bit ratesassociated with transients to the shortest possible temporal regions Althoughwindow switching has been successful [ISOI92] [John96c] [Tsut98] it also has
PRE-ECHO CONTROL STRATEGIES 183
1
08
06
04
02
Am
plitu
de
0
minus02
minus04
minus06
minus08200 400 600 800 1000
Sample (n)
(a)
1200 1400 1600 1800 2000
1
08
06
04
02
Am
plitu
de
0
minus02
minus04
minus06
minus08200 400 600 800 1000
Sample (n)
(b)
1200 1400 1600 1800 2000
Pre-echo distortion
Figure 621 Pre-echo example (time-domain waveforms) (a) Uncoded castanets (b) tra-nsform coded castanets 2048-point block size Pre-echo distortion is clearly visible in thefirst 1300 samples of the reconstructed signal
184 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
significant drawbacks For one the perceptual model and lossless coding por-tions of the coder must support multiple time resolutions This usually translatesinto increased complexity Furthermore most coders nowadays use lapped trans-forms such as the MDCT To satisfy PR constraints window switching typicallyrequires transition windows between the long and short blocks Even when suit-able transition windows (Figure 622) satisfy the PR constraints they do so atthe expense of poor time and frequency localization properties [Shli97] resultingin reduced coding gain Other difficulties inherent to window switching schemesare increased coder delay undesirable latency for closely spaced transients (eglong-start-short-stop-start-short) and impractical overuse of short windows forldquopitchedrdquo signals
6103 Hybrid Switched Filter Banks
Window switching essentially relies upon a fixed filter bank with adaptive win-dow lengths In contrast the hybrid and switched filter-bank architectures relyupon distinct filter bank modes In hybrid schemes (eg [Prin95]) compatiblefilter-bank elements are cascaded in order to achieve the time-frequency tilingbest suited to the current input signal Switched filter banks (eg [Sinh96]) onthe other hand make hard switching decisions on each analysis interval in orderto select a single monolithic filter bank tailored to the current input Examplesof these methods are given in later chapters along with some discussion of theirassociated tradeoffs
10 20 30 40 50 60 70 80 90 100 110 1200
02
04
06
08
1
Time (ms)
Am
plitu
de
LongLong Start Stop Short
Figure 622 Example window switching scheme (MPEG-1 layer III or ldquoMP3rdquo) Tran-sitional start and stop windows are required in between the long and short blocks topreserve the PR properties of the filter bank
PRE-ECHO CONTROL STRATEGIES 185
6104 Gain Modification
The gain modification approach (Figure 623) has also shown promise in thetask of pre-echo control [Vaup91] [Link93] The gain modification proceduresmoothes transient peaks in the time-domain prior to spectral analysis Then per-ceptual coding may proceed as it does for normal stationary blocks Quantizationnoise is shaped to satisfy masking thresholds computed for the equalized longblock without compensating for an undesirable temporal spread of quantizationnoise A time-varying gain and the modification time interval are transmitted asside information Inverse operations are performed at the decoder to recover theoriginal signal Like the other techniques caveats also apply to this method Forexample gain modification effectively distorts the spectral analysis time windowDepending upon the chosen filter bank this distortion could have the unintendedconsequence of broadening the filter-bank responses at low frequencies beyondcritical bandwidth One solution for this problem is to apply independent gainmodifications selectively within only frequency bands affected by the transientevent This selective approach however requires embedding of the gain blockswithin a hybrid filter-bank structure which increases coder complexity [Akag94]
6105 Temporal Noise Shaping
The final pre-echo control technique considered in this section is temporal noiseshaping (TNS) As shown in Figure 624 TNS [Herr96] is a frequency-domaintechnique that operates on the spectral coefficients X(k) generated by the analy-sis filter bank TNS is applied only during input attacks susceptible to pre-echoes
s(n)G(n)
TRANSSpectralAnalysis
Side Info
Figure 623 Gain modification scheme for pre-echo control
X(k)A(z)
Q
e(k)
TNSe(k) X(k)ˆˆ
Figure 624 Temporal noise shaping scheme (TNS) for pre-echo control
186 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
The idea is to apply linear prediction (LP) across frequency (rather than time)since for an impulsive time signal frequency-domain coding gain is maximizedusing prediction techniques The method works as follows Parameters of a spec-tral LP synthesis filter A(z) are estimated via application of standard minimumMSE estimation methods (eg Levinson-Durbin) to the spectral coefficientsX(k) The resulting prediction residual e(k) is quantized and encoded usingstandard perceptual coding according to the original masking threshold Pre-diction coefficients are transmitted to the receiver as side information to allowrecovery of the original signal The convolution operation associated with spectraldomain prediction is associated with multiplication in time In a manner anal-ogous to the source-system separation realized by time-domain LP analysis fortraditional speech codecs TNS effectively separates the time-domain waveforminto an envelope and temporally flat ldquoexcitationrdquo Then because quantizationnoise is added to the flattened residual the time-domain multiplicative enve-lope corresponding to A(z) shapes the quantization noise such that it follows theoriginal signal envelope
Quantization noise for the castanets applied to a DCT-based coder is shownin Figure 625(a) and Figure 625(b) both without and with TNS active respec-tively TNS clearly shapes the quantization noise to follow the input signalrsquosenergy envelope TNS mitigates pre-echoes since the error energy is now concen-trated in the time interval associated with the largest masking threshold Althoughthey are related as time-frequency dual operations TNS is advantageous relativeto gain shaping because it is easily applied selectively in specific frequency sub-bands Moreover TNS has the advantages of compatibility with most filter-bankstructures and manageable complexity Unlike window switching schemes forexample TNS does not require modification of the perceptual model or losslesscoding stages to a new time-frequency mapping TNS was reported in [Herr96]to dramatically improve performance on a five-point mean opinion score (MOS)test from 264 to 354 for a particularly troublesome pitched signal ldquoGermanMale Speechrdquo for the MPEG-2 nonbackward compatible (NBC) coder [Herr96]A MOS improvement of 03 was also realized for the well-known ldquoGlocken-spielrdquo test signal This ultimately led to the adoption of TNS in the MPEG NBCscheme [Bosi96a] [ISOI96a]
611 SUMMARY
This chapter covered the basics of time-frequency analysis techniques for audiosignal processing and coding We also highlighted the time-frequency tradeoffchallenges faced by the audio codec designers when designing a filter bank Wediscussed both the QMF and CQF filter-bank designs and their extended tree-structured forms We also dealt in detail with the cosine modulated pseudo-QMFand perfect reconstruction M-band filter-bank designs The modified discretecosine transform (MDCT) and the various window designs were also covered indetail
SUMMARY 187
1
06
08
04
02
0
Am
plitu
de
minus02
minus04
minus06
minus08
minus1200 400 600 800 1000
Sample (n)
(a)
1200 1400 1600 1800 2000
1
08
06
04
02
Am
plitu
de
0
minus02
minus04
minus06
minus1
minus08
200 400 600 800 1000
Sample (n)
(b)
1200 1400 1600 1800 2000
Figure 625 Temporal noise shaping example showing quantization noise and the inputsignal energy envelope for castanets (a) without TNS and (b) with TNS
188 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
PROBLEMS
61 Prove the identities shown in Figure 610
62 Consider the analysis-synthesis filter bank shown in Figure 61 with input
signal s(n) Show that s(n) = 1
M
suminfinm=minusinfin
suminfinl=minusinfin
sumMminus1k=0 s(m)
hk(lM minus m)gk(l minus Mn) where M is the number of subbands
63 For the down-sampling and up-sampling processes given in Figure 626show that
Sd() = 1
M
Mminus1sum
l=0
S
( + 2πl
M
)
H
( + 2πl
M
)
and Su() = S(M)G()
64 Using results from Problem 63 Prove Eq (65) for the analysis-synthesisframework shown in Figure 61
65 Consider Figure 627Given s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0 1 6 H0(z) = 1 minuszminus1 and H1(z) = 1 + zminus1
a Design the synthesis filters G0(z) and G1(z) in Figure 627 such thataliasing distortions are minimized
b Write the closed-form expression for v0(n) v1(n) y0(n) y1(n)
w0(n) w1(n) and the synthesized waveform s(n) In Figure 627assume yi(n) = yi(n) for i = 0 1
c Assuming an alias-free scenario show that s(n) = αs(n minus n0) whereα is the QMF bank gain n0 is a delay that depends on Hi(z) and Gi(z)Estimate the value of n0
d Repeat steps (a) and (c) for H0(z) = 1 minus 075zminus1 and H1(z) = 1 +075zminus1
H(z)
s(n) sd(n)
M
Down-sampling
G(z)s(n) su(n)
M
Up-sampling
Figure 626 Down-sampling and up-sampling processes
s(n)
y0(n)2
v0(n)
y1(n)2
v1(n)
2w0(n)
2w1(n)
H0(z)
H1(z)
G0(z)
G1(z)
y0(n)ˆ
s(n)ˆ
y1(n)ˆ
sum
Figure 627 A two-band maximally decimated analysis-synthesis filter bank
PROBLEMS 189
66 In this problem we will compare the two-band QMF and CQF designsGiven H0(z) = 1 minus 09zminus1a Design a two-band (ie M = 2) QMF [Use Eq (67)]b Design a two-band CQF for L = 8 [Use Eq (68)]c Consider the two-band QMF and CQF banks in Figure 628 with input
signal s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0 1 6Compare the designs in (a) and (b) and check for the alias-free recon-struction in case of CQF Give the delay values d1 and d2
d Extend the two-band QMF design in part (a) to polyphase factorization[use Equations (69) and (610)] What are the advantages of employingpolyphase factorization
67 In this problem we will design and analyze a four-channel uniform tree-structured QMF banka Given H00(z) = 1 + 01zminus1 and H10(z) = 1 + 09zminus1 Complete
the tree-structured QMF bank (use Figure 68) for four channelsb Using the identities given in Figure 610 (or Eq (611)) construct a
parallel analysis-synthesis filter bank The parallel analysis-synthesisfilter bank structure must be similar to the one shown in Figure 61with M = 4
c Plot the frequency response of the resulting parallel filter bank analysisfilters ie H0(z) H1(z) H2(z) and H3(z) Comment on the pass-bandand stopband structures of the magnitude responses associated withthese filters
d Plot the impulse response h1(n) Is h1(n) symmetric
68 Repeat Problem 67 for a four-channel uniform tree-structured CQF bankwith L = 4
69 A time-domain plot of an audio signal is shown in Figure 629 Given theflexibility to encode the regions A through E with varying frame sizesWhich of the following choices is preferred
Choice ILong frames in regions B and DChoice IILong frames in regions A C and E
2-bandQMF
s(n) s1(n minus d1)
2-bandCQF
s(n) s2(n minus d2)
Figure 628 A two-band QMF and CQF design comparison
190 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
B CA D E
Figure 629 An example audio segment with harmonics (region A) a transient (regionB) background noise (region C) exponentially weighted harmonics (region D) and back-ground noise (region E) segments
Choice IIIShort frames in region B onlyChoice IVShort frames in regions B and DChoice VShort frames in region A only
Explain how would you assign frequency-resolution (high or low) among theregions A B C D and E
610 A pure tone at f0 with P0 dB SPL is encoded such that the quantiza-tion noise is masked Let us assume that a 256-point MDCT producedan in-band signal-to-noise ratio of SNRA and encodes the tone with bA
bitssample And a 1024-point MDCT yielded SNRB and encodes the tonewith bB bitssample
x1(n)
x2(n)
x3(n)
x4(n)
Figure 630 Audio frames x1(n) x2(n) x3(n) and x4(n) for Problem 611
COMPUTER EXERCISES 191
In which of the two cases we will require the most bitssample (state ifbA gt bB or bA lt bB) to mask the quantization noise
611 Given the signals x1(n) x2(n) x3(n) and x4(n) as shown in Figure 630Let all the signals be of length 1024 samples When transform coded usinga 512-point MDCT which of the signals will result in pre-echo distortion
COMPUTER EXERCISES
612 In this problem we will study the filter banks that are based on theDFT Use Eq (639a) to implement a 2M-point DFT of x(n) given inFigure 631 Assume M = 8a Give the plots of |X(k)|b Plot the frequency response of the second- and third-channel analysis
filters that are associated with the basis vectors h1(n) and h2(n)c State whether the DFT filter bank is evenly stacked or oddly stacked
613 In this problem we will study the filter banks that are based on the DCTUse Eq (641a) to implement a M-point DCT of x(n) given in Figure 631Assume M = 8a Give the plots of |X(k)|b Also plot the frequency response of the second and third channel anal-
ysis filters that are associated with the basis vectors h1(n) and h2(n)c Plot the impulse response of h1(n) and see if it is symmetricd Is the DCT filter bank evenly stacked or oddly stacked
614 In this problem we will study the filter banks based on the MDCTa First design a sine window w(n) = sin[(2n + 1)π4M] with M = 8b Check if the sine window satisfies the generalized perfect reconstruction
conditions ie Eqs (628a) (628b)c Next design a MDCT analysis filter bank hk(n) for 0 lt k lt 7
x(n)
n0 7
8 15
16 23
24 31
1
minus1
Figure 631 Input signal x(n) for Problems 612 613 and 614
192 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS
d Plot both the impulse response and the frequency response of the anal-ysis filters h1(n) and h2(n)
e Compute the MDCT coefficients X(k) of the input signal x(n) shownin Figure 631
f Is the impulse response of the analysis filter h1(n) symmetric
615 Show analytically that a DCT can be implemented using FFTs Also usex(n) given in Figure 632 as your test signal and verify your softwareimplementation
616 Give expressions for DCT-I DCT-II DCT-III and DCT-IV orthonormaltransforms (eg see [Rao90]) Use the signals x1(n) and x2(n) shownin Figure 633 to study the differences in the 4-point DCT coefficientsobtained from different types of DCT Describe in general whether choos-ing a particular type of DCT affects the energy compaction of a signal
617 In this problem we will design a two-band (M = 2) cosine-modulatedPQMF bank with L = 8a First design a linear phase FIR prototype lowpass filter (ie w(n)) with
normalized cutoff frequency π4 Plot the frequency response of thiswindow Use fir2 command in MATLAB to design the lowpass filter
b Use Eq (613) and (614) to design the PQMF analysis and synthesisfilters respectively
x(n)
0 7
8 15
1
minus1
n
Figure 632 Input signal x(n) for Problem 615
0
1
minus1
n
3
x2(n)x1(n)
0
1
minus1
n
3
Figure 633 Test signals to study the differences among various types of orthonormalDCT transforms
COMPUTER EXERCISES 193
2-bandPQMF
s(n) s3(n ndash d3)
Figure 634 A two-band PQMF design
c In Figure 634 use s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0
1 6 and generate s3(n) Compare s(n) and s3(n) and commenton any type of distortion that you may observe
d What are the advantages of employing a cosine modulated filter bankover the two-band QMF and two-band CQF
e List some of the key differences between the CQF and the PQMFin terms of 1) the analysis filter bank frequency responses 2) phasedistortion 3) impulse response symmetries
f Use sine window w(n) = sin[(2n + 1)π4M] with M = 8 and repeatsteps (b) and (c)
CHAPTER 7
TRANSFORM CODERS
71 INTRODUCTION
Transform coders make use of unitary transforms (eg DFT DCT etc) for thetimefrequency analysis section of the audio coder shown in Figure 11 Manytransform coding schemes for wideband and high-fidelity audio have been pro-posed starting with some of the earliest perceptual audio codecs For examplein the mid-1980s Krahe applied psychoacoustic bit allocation principles to atransform coding scheme [Krah85] [Krah88] Schroeder [Schr86] later extendedthese ideas into multiple adaptive spectral audio coding (MSC) The MSC uti-lizes a 1024-point DFT then groups coefficients into 26 subbands inspired bythe critical bands of the ear This chapter gives overview of algorithms that wereproposed for transform coding of high-fidelity audio following the early work ofSchroeder [Schr86]
The Chapter is organized as follows Sections 72 through 75 describe in somedetail the transform coding algorithms proposed by Brandenburg Johnston andMahieux [Bran87b] [John88a] [Mahi89] [Bran90] Most of this research becameconnected with the MPEG standardization and the ISOIEC eventually clusteredthese algorithms into a single candidate algorithm called adaptive spectral entropycoding (ASPEC) [Bran91] of high quality music signals The ASPEC algorithm(Section 76) has become part of the ISOIEC MPEG-1 [ISOI92] and the MPEG-2BC-LSF [ISOI94a] audio coding standards Sections 77 and 78 are concernedwith two transform coefficient substitution schemes namely the differential per-ceptual audio coder (DPAC) and the DFT noise substitution algorithm Finally
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
195
196 TRANSFORM CODERS
Sections 79 and 710 address several early applications of vector quantization(VQ) to transform coding of high-fidelity audio
The algorithms described in the Chapter that make use of modulated fil-ter banks (eg ASPEC DPAC TwinVQ) can also be characterized as high-resolution subband coders Typically transform coders perform high-resolutionfrequency analysis and subband coders rely on a coarse division of the frequencyspectrum In many ways the transform and subband coder categories overlapand in some cases it is hard to categorize a coder in a definite manner Thesource of this overlapping of transformsubband categories come from the factthat block transform realizations are used for cosine modulated filter banks
72 OPTIMUM CODING IN THE FREQUENCY DOMAIN
Brandenburg in 1987 proposed a 132 kbs algorithm known as optimum coding inthe frequency domain (OCF) [Bran87b] which is in some respects an extensionof the well-known adaptive transform coder (ATC) for speech The OCF wasrefined several times over the years with two enhanced versions appearing afterthe original algorithm The OCF is of interest because of its influence on currentstandards
The original OCF (Figure 71) works as follows The input signal is firstbuffered in 512 sample blocks and transformed to the frequency domain usingthe DCT Next transform components are quantized and entropy coded A sin-gle quantizer is used for all transform components Adaptive quantization andentropy coding work together in an iterative procedure to achieve a fixed bitrate The initial quantizer step size is derived from the spectral flatness measure(Eq (513))
In the inner loop of Figure 71 the quantizer step size is iteratively increasedand a new entropy-coded bit stream is formed at each update until the desired bit
Input BufferWindowing DCT
Psychoacoustic Analysis
EntropyCoder WeightingQuantizer
Outer Loop
Inner Loop
output
s(n)
loop count
loop count
Figure 71 OCF encoder (after [Bran88b])
PERCEPTUAL TRANSFORM CODER 197
rate is achieved Increasing the step size at each update produces fewer levelswhich in turn reduces the bit rate Using a second iterative procedure a perceptualanalysis is introduced after the inner loop is done First critical band analysis isapplied Then a masking function is applied which combines a flat minus6 dB mask-ing threshold with an interband masking threshold leading to an estimate of JNDfor each critical band If after inner-loop quantization and entropy encoding themeasured distortion exceeds JND in at least one critical band then quantizationstep sizes are adjusted only in the out-of-tolerance critical bands The outer looprepeats until JND criteria are satisfied or a maximum loop count is reachedEntropy coded transform components are then transmitted to the receiver alongwith side information which includes the log encoded SFM the number ofquantizer updates during the inner loop and the number of step size reductionsthat occurred for each critical band in the outer loop This side informationis sufficient to decode the transform components and perform reconstruction atthe receiver
Brandenburg in 1988 reported an enhanced OCF (OCF-2) which achievedsubjective quality improvements at a reduced bit rate of only 110 kbs [Bran88a]The improvements were realized by replacing the DCT with the MDCT andadding a pre-echo detectioncompensation scheme Reconstruction quality is imp-roved due to the effective time resolution increase (ie 50 time overlap)associated with the MDCT OCF-2 quality is also improved for difficult signalssuch as triangle and castanets due to a simple pre-echo detectioncompensationscheme The encoder detects pre-echoes using analysis-by-synthesis Pre-echoesare detected when noise energy in a reconstructed segment (16 samples = 036 ms 441 kHz) exceeds signal energy The encoder then determines the frequencybelow which 90 of signal energy is contained and transmits this cutoff to thedecoder Given pre-echo detection at the encoder (1 bit) and a cutoff frequencythe decoder discards frequency components above the cutoff in effect low-passfiltering pre-echoes Due to these enhancements the OCF-2 was reported toachieve transparency over a wide variety of source material
Later in 1988 Brandenburg reported further OCF enhancements (OCF-3) inwhich better quality was realized at a lower bit rate (64 kbs) with reducedcomplexity [Bran88b] This was achieved through differential coding of spectralcomponents to exploit correlation between adjacent samples an enhanced psy-choacoustic model modified to account for temporal masking and an improvedrate-distortion loop
73 PERCEPTUAL TRANSFORM CODER
While Brandenburg developed the OCF algorithm similar work was simulta-neously underway at ATampT Bell Labs Johnston developed several DFT-basedtransform coders [John88a] [John89] for audio during the late 1980s that becamean integral part of the ASPEC proposal Johnstonrsquos work in perceptual entropy[John88b] forms the basis for a transform coder reported in 1988 [John88a] that
198 TRANSFORM CODERS
Quantizers
ThresholdAdjustment
PsychoacousticAnalysis
Bit Packing
Ti
s(n) FFT
2048 point
To Channel
Side Info
Ti
Bit AllocationLoop
Ti Pj
Figure 72 PXFM encoder (after [John88a])
achieves transparent coding of FM-quality monaural audio signals (Figure 72)A stereophonic coder based on similar principles was developed later
731 PXFM
A monaural algorithm the perceptual transform coder (PXFM) was developedfirst The idea behind the PXFM is to estimate the amount of quantization noisethat can be inaudibly injected into each transform domain subband using PEestimates The coder works as follows The signal is first windowed into over-lapping (116) segments and transformed using a 2048-point FFT Next the PEprocedure described in Section 56 is used to estimate JND thresholds for eachcritical band Then an iterative quantization loop adapts a set of 128 subbandquantizers to satisfy the JND thresholds until the fixed bit rate is achieved Finallyquantization and bit packing are performed Quantized transform components aretransmitted to the receiver along with appropriate side information Quantizationsubbands consist of 8-sample blocks of complex-valued transform componentsThe quantizer adaptation loop first initializes the j isin [1 128] subband quantizers(1024 unique FFT components8 components per subband) with kj levels andstep sizes of Ti as follows
kj = 2 nint
(Pj
Ti
)
+ 1 (71)
where Ti are the quantized critical band JND thresholds Pj is the quantizedmagnitude of the largest real or imaginary transform component in the j -thsubband and nint() is the nearest integer rounding function The adaptation pro-cess involves repeated application of two steps First bit packing is attemptedusing the current quantizer set Although many bit packing techniques are pos-sible one simple scenario involves sorting quantizers in kj order then filling64-bit words with encoded transform components according to the sorted resultsAfter bit packing Ti are adjusted by a carefully controlled scale factor andthe adaptation cycle repeats Quantizer adaptation halts as soon as the packeddata length satisfies the desired bit rate Both Pj and the modified Ti are quan-tized on a dB scale using 8-bit uniform quantizers with a 170 dB dynamic
PERCEPTUAL TRANSFORM CODER 199
range These parameters are transmitted as side information and used at thereceiver to recover quantization levels (and thus implicit bit allocations) for eachsubband which are in turn used to decode quantized transform componentsThe DC FFT component is quantized with 16 bits and is also transmitted asside information
732 SEPXFM
In 1989 Johnston extended the PXFM coder to handle stereophonic signals(SEPXFM) and attained transparent coding of a CD-quality stereophonic channelat 192 kbs or 22 bitssample SEPXFM [John89] realizes performance improve-ments over PXFM by exploiting inherent stereo cross-channel redundancy andby assuming that both channels are presented to a single listener rather thanbeing used as separate signal sources The SEPXFM structure is similar to thatof PXFM with variable radix bit packing replaced by adaptive entropy codingSide information is therefore reduced to include only adjusted JND thresholds(step sizes) and pointers to the entropy codebooks used in each transform domainsubband The coder works in the following manner First sum (L + R) and dif-ference (L minus R) signals are extracted from the left (L) and right (R) channels toexploit leftright redundancy Next the sum and difference signals are windowedand transformed using the FFT Then a single JND threshold for each criticalband is established via the PE method using the summed power spectra fromthe L + R and L minus R signals A single combined JND threshold is applied toquantization noise shaping for both signals (L + R and L minus R) based upon theassumption that a listener is more than one ldquocritical distancerdquo [Jetz79] away fromthe stereo speakers
Like PXFM a fixed bit rate is achieved by applying an iterative thresholdadjustment procedure after the initial determination of JND levels The adapta-tion process analogous to PXFM bit rate adjustment and bit packing consistsof several steps First transform components from both (L + R) and (L minus R)are split into subband blocks each averaging 8 realimaginary samples Thenone of six entropy codebooks is selected for each subband based on the averagecomponent magnitude within that subband Next transform components are quan-tized given the JND levels and encoded using the selected codebook Subbandcodebook selections are themselves entropy encoded and transmitted as sideinformation After encoding JND thresholds are scaled by an estimator andthe quantizer adaptation process repeats Threshold adaptation stops when thecombined bitstream of quantized JND levels Huffman-encoded (L + R) com-ponents Huffman-encoded (L minus R) components and Huffman-encoded averagemagnitudes achieves the desired bit rate The Huffman codebooks are developedusing a large music and speech database They are optimized for difficult sig-nals at the expense of mean compression rate It is also interesting to note thatheadphone listeners reported no noticeable acoustic mixing despite the criticaldistance assumption and single combined JND level estimate for both channels(L + R) and (L minus R)
200 TRANSFORM CODERS
74 BRANDENBURG-JOHNSTON HYBRID CODER
Johnston and Brandenburg [Bran90] collaborated in 1990 to produce a hybridcoder that strictly speaking is both a subband and transform coding algorithmThe idea behind the hybrid coder is to improve time and frequency resolutionrelative to OCF and PXFM by constructing a filter bank that more closely resem-bles the auditory filter bank This is accomplished at the encoder by first splittingthe input signal into four octave-width subbands using a QMF filter bank
The decimated output sequence from each subband is then followed by one ormore transforms to achieve the desired timefrequency resolution Figure 73(a)Both the DFT and the MDCT were investigated Given the tiling of the time-frequency plane shown in Figure 73(b) frequency resolution at low frequencies(234 Hz) is well matched to the ear while the time resolution at high frequencies(27 ms) is sufficient for pre-echo control
The quantization and coding schemes of the hybrid coder combine elementsfrom both PXFM and OCF Masking thresholds are estimated using the PXFM
(a)
(b)
1024 samples (Time)
128 frequency lines (23 Hz21 ms)
64 freq lines (47 Hz11 ms)
64 freq lines(94 Hz5 ms)
64 freqlines
188 Hz27 ms
6 kHz
3 kHz
24 kHz
12 kHz
64 freq lines (47 Hz11 ms)
64 freq lines(94 Hz5 ms)
64 freq lines(94 Hz5 ms)
64 freq lines(94 Hz5 ms)
64 freqlines
188 Hz27 ms
64 freqlines
188 Hz27 ms
64 freqlines
188 Hz27 ms
64 freqlines
188 Hz27 ms
64 freqlines
188 Hz27 ms
64 freqlines
188 Hz27 ms
64 freqlines
188 Hz27 ms
Freq
s(n) 80 tap 64 pt DFT (8)
0-33-6 kHz
80 tap
80 tap
64 pt DFT (4)
64 pt DFT (2)
128 pt DFT (1)
(512)
(256)
0-66-12 kHz
0-1212-24 kHz
(1024)
320 lines frame
QMF
QMF
QMF
Figure 73 Brandenburg-Johnston coder (a) filter bank structure (b) timefreq tiling(after [Bran90])
CNET CODERS 201
approach for eight time slices in each frequency subband A more sophisticatedtonality estimate was defined to replace the SFM (Eq (513)) used in PXFMhowever such that tonality is estimated in the hybrid coder as a local char-acteristic of each individual spectral line Predictability of magnitude and phasespectral components across time is used to evaluate tonality instead of just globalspectral shape within a single frame High temporal predictability of magnitudesand phases is associated with the presence of a tonal signal In contrast low pre-dictability implies the presence of a noise-like signal The hybrid coder employsa quantization and coding scheme borrowed from OCF As far as quality thehybrid coder without any explicit pre-echo control mechanism was reported toachieve quality better than or equal to OCF-3 at 64 kbs [Bran90] The onlydisadvantage noted by the authors was increased complexity A similar hybridstructure was eventually adopted in MPEG-1 and -2 layer III
75 CNET CODERS
Research at the Centre National drsquoEtudes des Telecommunications (CNET) res-ulted in several transform coders based on the DFT and the MDCT
751 CNET DFT Coder
In 1989 Mahieux Petit et al proposed a DFT-based audio coding system thatintroduced a novel scheme to exploit DFT interblock redundancy Nearly trans-parent quality was reported for 15-kHz (FM-grade) audio at 96 kbs [Mahi89]except for some highly harmonic signals The encoder applies first-order back-ward-adaptive predictors (across time) to DFT magnitude and differential phasecomponents then quantizes separately the prediction residuals Magnitude anddifferential phase residuals are quantized using an adaptive nonuniform pdf-optimized quantizer designed for a Laplacian distribution and an adaptive uniformquantizer respectively The backward-adaptive quantizers are reinitialized dur-ing transients Bits are allocated during step-size adaptation to shape quantizationnoise such that a psychoacoustic noise threshold is satisfied for each block Theperceptual model used is similar to Johnstonrsquos model that was described earlierin Section 56 The use of linear prediction is justified because it exploits mag-nitude and differential phase time redundancy which tends to be large duringperiods when the audio signal is quasi-stationary especially for signal harmonicsQuasi-stationarity might occur for example during a sustained note A similartechnique was eventually embedded in the MPEG-2 AAC algorithm
752 CNET MDCT Coder 1
In 1990 Mahieux and Petit reported on the development of a similar MDCT-based transform coder for which they reported transparent CD-quality at64 kbs [Mahi90] This algorithm introduced a novel spectrum descriptor schemefor representing the power spectral envelope The algorithm first segments inputaudio into frames of 1024 samples corresponding to 12 ms of new data per frame
202 TRANSFORM CODERS
given 50 MDCT time overlap Then a bit allocation is computed at the encoderusing a set of ldquospectrum descriptorsrdquo Spectrum descriptors consist of quantizedsample variances for MDCT coefficients grouped into 35 nonuniform frequencysubbands Like their DFT coder this algorithm exploits either interblock orintrablock redundancy by differentially encoding the spectrum descriptors withrespect to time or frequency and transmitting them to the receiver as sideinformation A decision whether to code with respect to time or frequency ismade on the basis of which method requires fewer bits the binary decisionrequires only 1 bit Either way spectral descriptor encoding is done using logDPCM with a first-order predictor and a 16-level uniform quantizer with a stepsize of 5 dB Huffman coding of the spectral descriptor codewords results in lessthan 2 bitsdescriptor A global masking threshold is estimated by convolving thespectral descriptors with a basilar spreading function on a bark scale somewhatlike the approach taken by Johnstonrsquos PXFM Bit allocations for quantizationof normalized transform coefficients are obtained from the masking thresholdestimate As usual bits are allocated such that quantization noise is below themasking threshold at every spectral line Transform coefficients are normalized bythe appropriate spectral descriptor then quantized and coded with one exceptionMasked transform coefficients which have lower energy than the global maskingthreshold are treated differently The authors found that masked coefficient binstend to be clustered therefore they can be compactly represented using run lengthencoding (RLE) RLE codewords are Huffman coded for maximum coding gainThe first CNET MDCT coder was reported to perform well for broadband signalswith many harmonics but had some problems in the case of spectrally flat signals
753 CNET MDCT Coder 2
Mahieux and Petit enhanced their 64 kbs algorithm by incorporating a sophisti-cated pre-echo detection and postfiltering scheme as well as by incorporating anovel quantization scheme for two-coefficient (low-frequency) spectral descriptorbands [Mahi94] For improved quantization performance two-component spec-tral descriptors are efficiently vector encoded in terms of polar coordinatesPre-echoes are detected at the encoder and flagged using 1 bit The idea behindthe pre-echo compensation is to temporarily activate a postfilter at the decoderin the corrupted quiet region prior to the signal attack and therefore a stop-ping index must also be transmitted The second-order IIR postfilter differenceequation is given by
spf (n) = b0s(n) + a1spf (n minus 1) + a2spf (n minus 2) (72)
where s(n) is the nonpostfiltered output signal that is corrupted by pre-echo dis-tortion spf (n) is the postfiltered output signal and ai are related to the parametersαi by
a1 = α1
[
1 minus(
p(0 0)
p(0 0) + σ 2b
)]
(73a)
a2 = α2
[
1 minus(
p(1 0)
p(0 0) + σ 2b
)]
(73b)
ADAPTIVE SPECTRAL ENTROPY CODING 203
where αi are the parameters of a second-order autoregressive (AR-2) spectralestimate of the output audio s(n) during the previous nonpostfiltered frameThe AR-2 estimate s(n) can be expressed in the time domain as
s(n) = w(n) + α1s(n minus 1) + α2s(n minus 2) (74)
where w(n) represents Gaussian white noise The prediction error is thendefined as
e(n) = s(n) minus s(n) (75)
The parameters p(ij ) in Eqs (73a) and (73b) are elements of the predictionerror covariance matrix P and the parameter σ 2
b is the pre-echo distortion vari-ance which is derived from side information Pre-echo postfiltering and improvedquantization schemes resulted in a subjective score of 365 for two-channelstereo coding at 64 kbs per channel on the 5-point CCIR 5-grade impairmentscale (described in Section 123) over a wide range of listening material TheCCIR J41 reference audio codec (MPEG-1 layer II) achieved a score of 384 at384 kbschannel over the same set of tests
76 ADAPTIVE SPECTRAL ENTROPY CODING
The MSC OCF PXFM Brandenburg-Johnston hybrid and CNET transformcoders were eventually clustered into a single proposal by the ISOIEC JTC1SC2WG11 committee As a result Schroeder Brandenburg Johnston Herre andMahieux collaborated in 1991 to propose for acceptance as the new MPEG audiocompression standard a flexible coding algorithm ASPEC which incorporatedthe best features of each coder in the group ASPEC [Bran91] was claimed toproduce better quality than any of the individual coders at 64 kbs
The structure of ASPEC combines elements from all of its predecessors LikeOCF and the later CNET coders ASPEC uses the MDCT for time-frequencymapping The masking model is similar to that used in PXFM and the Brandenburg-Johnston hybrid coders including the sophisticated tonality estimation scheme atlower bit rates The quantization and coding procedures use the pair of nested loopsproposed for OCF as well as the block differential coding scheme developed atCNET Moreover long runs of masked coefficients are run-length and Huffmanencoded Quantized scale factors and transform coefficients are Huffman codedalso Pre-echoes are controlled using a dynamic window switching mechanismlike the Thomson coder [Edle89] ASPEC offers several modes for different qualitylevels ranging from 64 to 192 kbs per channel A real-time ASPEC implementa-tion for coding one channel at 64 kbs was realized on a pair of 33-MHz MotorolaDSP56001 devices ASPEC ultimately formed the basis for layer III of the MPEG-1and MPEG-2BC-LSF standards We note that similar contributions were made inthe area of transform coding for audio outside of the ASPEC cluster For exampleIwadare et al reported on DCT-based [Sugi90] and MDCT-based [Iwad92] per-ceptual adaptive transform coders that control pre-echo distortion using an adaptivewindow size
204 TRANSFORM CODERS
77 DIFFERENTIAL PERCEPTUAL AUDIO CODER
Other investigators have also developed promising schemes for transform codingof audio Paraskevas and Mourjopoulos [Para95] reported on a differential per-ceptual audio coder (DPAC) which makes use of a novel scheme for exploitinglong-term correlations DPAC works as follows Input audio is transformed usingthe MDCT A two-state classifier then labels each new frame of transform coef-ficients as either a ldquoreferencerdquo frame or a ldquosimplerdquo frame The classifier labelsas ldquoreferencerdquo frames that contain significant audible differences from the pre-vious frame The classifier labels nonreference frames as ldquosimplerdquo Referenceframes are quantized and encoded using scalar quantization and psychoacousticbit allocation strategies similar to Johnstonrsquos PXFM Simple frames howeverare subjected to coefficient substitution Coefficients whose magnitude differ-ences with respect to the previous reference frame are below an experimentallyoptimized threshold are replaced at the decoder by the corresponding refer-ence frame coefficients The encoder then replaces subthreshold coefficientswith zeros thus saving transmission bits Unlike the interframe predictive cod-ing schemes of Mahieux and Petit the DPAC coefficient substitution system isadvantageous in that it guarantees that the ldquosimplerdquo frame bit allocation willalways be less than or equal to the bit allocation that would be required ifthe frame was coded as a ldquoreferencerdquo frame Suprathreshold ldquosimplerdquo framecoefficients are coded in the same way as reference frame coefficients DPACperformance was evaluated for frame classifiers that utilized three different selec-tion criteria
1 Euclidean distance Under the Euclidean criterion test frames satisfyingthe inequality
[sd
T sd
srT sr
]1
2 λ (76)
are classified as simple where the vectors sr and st respectively containreference and test frame time-domain samples and the difference vectorsd is defined as
sd = sr minus st (77)
2 Perceptual entropy Under the PE criterion (Eq 517) a test frame islabeled as ldquosimplerdquo if it satisfies the inequality
PE S
PER
λ (78)
where PE S corresponds to the PE of the ldquosimplerdquo (coefficient-substituted)version of the test frame and PER corresponds to the PE of the unmodifiedtest frame
DFT NOISE SUBSTITUTION 205
3 Spectral flatness measure Finally under the SFM criterion (Eq 513) atest frame is labeled as ldquosimplerdquo if it satisfies the inequality
abs
(
10 log10SFM T
SFM R
)
λ (79)
where SFMT corresponds to the test frame SFM and SFMR correspondsto the SFM of the previous reference frame The decision threshold λwas experimentally optimized for all three criteria Best performance wasobtained while encoding source material using a PE criterion As far asoverall performance is concerned noise-to-mask ratio (NMR) measure-ments were compared between DPAC and Johnstonrsquos PXFM algorithm at64 88 and 128 kbs Despite an average drop of 30ndash35 in PE measuredat the DPAC coefficient substitution stage output relative to the coefficientsubstitution input comparative NMR studies indicated that DPAC outper-forms PXFM only below 88 kbs and then only for certain types of sourcematerial such as pop or jazz music The desirable PE reduction led toan undesirable drop in reconstruction quality The authors concluded thatDPAC may be preferable to algorithms such as PXFM for low-bit-rate nontransparent applications
78 DFT NOISE SUBSTITUTION
Whereas DPAC exploits temporal correlation a substitution technique that exp-loits decorrelation was devised for coding efficiently noise-like portions of thespectrum In a noise substitution procedure [Schu96] Schulz parameterizes trans-form coefficients corresponding to noise-like portions of the spectrum in terms ofaverage power frequency range and temporal evolution resulting in an increasedcoding efficiency of 15 on average A temporal envelope for each parametricnoise band is required because transform block sizes for most codecs are muchlonger (eg 30 ms) than the human auditory systemrsquos temporal resolution (eg2 ms) In this method noise-like spectral regions are identified in the followingway First least-mean-square (LMS) adaptive linear predictors (LP) are appliedto the output channels of a multi-band QMF analysis filter bank that has as inputthe original audio s(n) A predicted signal s(n) is obtained by passing the LPoutput sequences through the QMF synthesis filter bank Prediction is done insubbands rather than over the entire spectrum to prevent classification errors thatcould result if high-energy noise subbands are allowed to dominate predictoradaptation resulting in misinterpretation of low-energy tonal subbands as noisyNext the DFT is used to obtain magnitude (S(k)S(k)) and phase components(θ(k)θ (k)) of the input s(n) and prediction s(n) respectively Then tonalityT (k) is estimated as a function of the magnitude and phase predictability ie
T (k) = α
∣∣∣∣∣
S(k) minus S(k)
S(k)
∣∣∣∣∣+ β
∣∣∣∣∣
θ(k) minus θ (k)
θ(k)
∣∣∣∣∣ (710)
206 TRANSFORM CODERS
where α and β are experimentally determined constants Noise substitution isapplied to contiguous blocks of transform coefficient bins for which T (k) is verysmall The 15 average bit savings realized using this method in conjunctionwith transform coding is offset to a large extent by a significant complexityincrease due to the additions of the adaptive linear predictors and a multi-bandanalysis-synthesis QMF filter bank As a result the author focused his attentionon the application of noise substitution to QMF-based subband coding algorithmsA modified version of this scheme was adopted as part of the MPEG-2 AACtime-frequency coder within the MPEG-4 reference model [Herr98]
79 DCT WITH VECTOR QUANTIZATION
For the most part the algorithms described thus far rely upon scalar quantiza-tion of transform coefficients This is not unreasonable since scalar quantizationin combination with entropy coding can achieve very good performance Asone might expect however vector quantization (VQ) has also been appliedto transform coding of audio although on a much more limited scale Ger-sho and Chan investigated VQ schemes for coding DCT coefficients subjectto a constraint of minimum perceptual distortion They reported on a variablerate coder [Chan90] that achieves high quality in the range of 55ndash106 kbs foraudio sequences bandlimited to 15 kHz (32 kHz sample rate) After computingthe DCT on 512 sample blocks the algorithm utilizes a novel multi-stage tree-structured VQ (MSTVQ) scheme for quantization of normalized vectors witheach vector containing four DCT components Bit allocation and vector nor-malization are derived at both the encoder and decoder from a sampled powerspectral envelope which consists of 29 groups of transform coefficients A simpli-fied masking model assumes that each sample of the power envelope representsa single masker Masking is assumed to be additive as in the ASPEC algo-rithms Thresholds are computed as a fixed offset from the masking level Theauthors observed a strong correlation between the SFM and the amount of offsetrequired to achieve high quality Two-segment scalar quantizers that are piecewiselinear on a dB scale are used to encode the power spectral envelope Quadraticinterpolation is used to restore full resolution to the subsampled envelope
Gersho and Chan later enhanced [Chan91b] their algorithm by improving thepower envelope and transform coefficient quantization schemes In the new app-roach to quantization of transform coefficients constrained-storage VQ [Chan91a]techniques are combined with the MSTVQ from the original coder allowing thenew coder to handle peak noise-to-mask ratio (NMR) requirements without imprac-tical codebook storage requirements In fact CS-MSTVQ enabled quantization of127 four-coefficient vectors using only four unique quantizers Power spectralenvelope quantization is enhanced by extending its resolution to 127 samples Thepower envelope samples are encoded using a two-stage process The first stageapplies nonlinear interpolative VQ (NLIVQ) a dimensionality reduction processwhich represents the 127-element power spectral envelope vector using only a 12-dimensional ldquofeature power enveloperdquo Unstructured VQ is applied to the feature
MDCT WITH VECTOR QUANTIZATION 207
power envelope Then a full-resolution quantized envelope is obtained from theunstructured VQ index into a corresponding interpolation codebook In the secondstage segments of a power envelope residual are encoded using 8- 9- and 10-element TSVQ Relative to their first VQDCT coder the authors reported savingsof 10ndash20 kbs with no reduction in quality due to the CS-VQ and NLIVQ schemesAlthough VQ schemes with this level of sophistication typically have not been seenin the audio coding literature since [Chan90] and [Chan91b] first appeared therehave been successful applications of less-sophisticated VQ in some of the standards(eg [Sree98a] [Sree98b])
710 MDCT WITH VECTOR QUANTIZATION
Iwakami et al developed transform-domain weighted interleave vector quantiza-tion (TWIN-VQ) an MDCT-based coder that also involves transform coefficientVQ [Iwak95] This algorithm exploits LPC analysis spectral interframe redun-dancy and interleaved VQ
At the encoder (Figure 74) each frame of MDCT coefficients is first dividedby the corresponding elements of the LPC spectral envelope resulting in a spec-trally flattened quotient (residual) sequence This procedure flattens the MDCTenvelope but does not affect the fine structure The next step therefore dividesthe first step residual by a predicted fine structure envelope This predicted finestructure envelope is computed as a weighted sum of three previous quantized finestructure envelopes ie using backward prediction Interleaved VQ is applied tothe normalized second step residual The interleaved VQ vectors are structuredin the following way Each N -sample normalized second step residual vector issplit into K subvectors each containing N K coefficients Second step residualsfrom the N -sample vector are interleaved in the K subvectors such that the i-thsubvector contains elements i + nK where n = 0 1 (NK) minus 1 Percep-tual weighting is also incorporated by weighting each subvector by a nonlinearlytransformed version of its corresponding LPC envelope component prior to thecodebook search VQ indices are transmitted to the receiver Side information
s(n) MDCT
LPCAnalysis
DenormalizeWeightedInterleaveVQ
Normalize
Inter-framePrediction
LPCEnvelope
Indices
Side info
Side info
Figure 74 TWIN-VQ encoder (after [Iwak95])
208 TRANSFORM CODERS
consists of VQ normalization coefficients and the LPC envelope encoded in termsof LSPs The authors claimed higher subjective quality than MPEG-1 layer IIat 64 kbs for 48 kHz CD-quality audio as well as higher quality than MPEG-1layer II for 32 kHz audio at 32 kbs
TWIN-VQperformance at lower bit rateshasalsobeen investigated At least threetrends were identified during ISO-sponsored comparative tests [ISOI98] of TWIN-VQ and MPEG-2 AAC First AAC outperformed TWIN-VQ for bit rates above16 kbs Secondly TWIN-VQ and AAC achieved similar performance at 16 kbswith AAC having a slight edge Finally the performance of TWIN-VQ exceededthat of AAC at a rate of 8 kbs These results ultimately motivated a combinedAACTWIN-VQ architecture for inclusion in MPEG-4 [Herre98] Enhancementsto the weighted interleaving scheme and LPC envelope representation [Mori96]enabled real-time implementation of stereo decoders on Pentium-I and PowerPCplatforms Channel error robustness issues are addressed in [Iked95] A later versionof the TWIN-VQ scheme is embedded in the set of tools for MPEG-4 audio
711 SUMMARY
Transform coders for high-fidelity audio were described in this Chapter Thetransform coding algorithms presented include
ž the OCF algorithmž the monaural and stereophonic perceptual transform coders (PXFM and
SEPXFM)ž the CNET DFT and MDCT codersž the ASPECž the differential PACž the TWIN-VQ algorithm
PROBLEMS
71 Given the expressions for the DFT the DCT and the MDCT
XDFT (k) = 1radic2M
2Mminus1sum
n=0
x(n)eminusjπnkM 0 k 2M minus 1
XDCT (k) = c(k)
radic2
M
Mminus1sum
n=0
x(n) cos
[π
M
(
n + 1
2
)
k
]
0 k M minus 1
where c(0) = 1radic
2 and c(k) = 1 for 1 k M minus 1
XMDCT (k) =radic
2
M
2Mminus1sum
n=0
x(n) sin
[(
n + 1
2
)π
2M
]
︸ ︷︷ ︸w(n)
cos
[(2n + M + 1)(2k + 1)π
4M
]
for 0 k M minus 1
H0(z
)
H1(z
)
2 2
2 2
F0(z
)
F1(z
)
FF
Tsy
nthe
sis
mod
ule
FF
Tsy
nthe
sis
mod
ule
x 0(n
)
x 1(n
)
x d0(n
)
x d1(
n)
x e0(n
)
x e1(n
)
Sel
ect L
com
pone
nts
out o
f N
FF
T(S
ize
N =
128)
x(n
)X
rsquo(k
)In
vers
eF
FT
(Siz
e N
=12
8)
xrsquo(n
)X
(k)
n=
[1 x
128
]k
= [1
xN]
k =
[1xL
]n
= [1
x 1
28]
FF
T s
ynth
esis
mod
ulexrsquo
d0(n
)
xrsquod1
(n)
x(n
)xrsquo
(n)
Fig
ure
75
FFT
anal
ysis
syn
thes
isw
ithin
the
two
band
sof
QM
Fba
nk
209
210 TRANSFORM CODERS
Write the three transforms in matrix form as follows
XT = Hx
where H is the transform matrix and x and XT denote the input and transformedvector respectively Note the structure in the transform matrices
72 Give the signal flowgraph of the FFT butterfly structure for an 8-pointDFT an 8-point DCT and an 8-point MDCT Specify clearly the valueson the nodes and the branches [Hint See Problem 616 and Figure 618 inChapter 6]
COMPUTER EXERCISES
73 In this problem we will study the energy compaction of the DFT andthe DCT Use x(n) = eminus05n sin(04πn) n = 0 1 15 Plot the 16-pointDFT and 16-point DCT of the input signal x(n) See how the energy ofthe sequence is concentrated Now pick two peaks of the DFT vector andthe DCT vector and synthesize the input signal x(n) Let the synthesizedsignals be xDFT (n) and xDCT (n) Compute the MSE values between theinput signal and the two reconstructed signals Repeat this for four peakssix peaks and eight peaks Plot the estimated MSE values across the numberof peaks selected and comment on your result
74 This computer exercise is a combination of Problems 224 and 225 inChapter 2 In particular the FFT analysissynthesis module in Problem 225will be used within the two bands of the QMF bank The configuration isshown in Figure 75a Given H0(z) = 1 minus zminus1H1(z) = 1 + zminus1 Choose F0(z) and F1(z) such
that the aliasing term can be cancelled Use L = 32 and the peak-pickingmethod for component selection Perform speech synthesis and give time-domain plots of both input and output speech records
b Use the same voiced frame selected in Problem 224 Give time-domainand frequency-domain plots of x prime
d0(n) and x primed1(n) in Figure75
c Compute the overall SNR (between x(n) and x prime(n)) and estimate a MOSscore for the output speech
d Describe whether the perceptual quality of the output speech improvesif the FFT analysissynthesis module is employed within the subbandsinstead of using it for the entire band
CHAPTER 8
SUBBAND CODERS
81 INTRODUCTION
Similar to the transform coders described in the previous chapter subband codersalso exploit signal redundancy and psychoacoustic irrelevancy in the frequencydomain The audible frequency spectrum (20 Hzndash20 kHz) is divided into fre-quency subbands using a bank of bandpass filters The output of each filter is thensampled and encoded At the receiver the signals are demultiplexed decodeddemodulated and then summed to reconstruct the signal Audio subband codersrealize coding gains by efficiently quantizing decimated output sequences fromperfect reconstruction filter banks Efficient quantization methods usually relyupon psychoacoustically controlled dynamic bit allocation rules that allocate bitsto subbands in such a way that the reconstructed output signal is free of audi-ble quantization noise or other artifacts In a generic subband audio coder theinput signal is first split into several uniform or nonuniform subbands using somecritically sampled perfect reconstruction (or nearly perfect reconstruction) filterbank Nonideal reconstruction properties in the presence of quantization noiseare compensated for by utilizing subband filters that have good sidelobe attenu-ation Then decimated output sequences from the filter bank are normalized andquantized over short 2ndash10 ms blocks Psychoacoustic signal analysis is usedto allocate an appropriate number of bits for the quantization of each subbandThe usual approach is to allocate an adequate number of bits to mask quantiza-tion noise in each block while simultaneously satisfying some bit rate constraintSince masking thresholds and hence bit allocation requirements are time-varyingbuffering is often introduced to match the coder output to a fixed rate The encoder
Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc
211
212 SUBBAND CODERS
sends to the decoder quantized subband output samples normalization scale fac-tors for each block of samples and bit allocation side information Bit allocationmay be transmitted as explicit side information or it may be implicitly repre-sented by some parameter such as the scale factor magnitudes The decoder usesside information and scale factors in conjunction with an inverse filter bank toreconstruct a coded version of the original input
The purpose of this chapter is to expose the reader to subband coding algo-rithms for high-fidelity audio This chapter is organized much like Chapter 7The first portion of this chapter is concerned with early subband algorithms thatnot only contributed to the MPEG-1 standardization but also had an impact onlater developments in the field The remainder of the chapter examines a vari-ety of recent experimental subband algorithms that make use of discrete wavelettransforms (DWT) discrete wavelet packet transforms (DWPT) and hybrid filterbanks The chapter is organized as follows Section 811 concentrates upon theearly subband coding algorithms for high-fidelity audio including the MaskingPattern Adapted Universal Subband Integrated Coding and Multiplexing (MUSI-CAM) Section 82 presents the filter-bank interpretations of the DWT and theDWPT Section 83 addresses subband audio coding algorithms in which time-invariant and time-varying signal adaptive filter banks are constructed fromthe DWT and the DWPT Section 84 examines the use of nonuniform fil-ter banks related the DWPT Sections 85 and 86 are concerned with hybridsubband architectures involving sinusoidal modeling and code-excited linear pre-diction (CELP) Finally Section 87 addresses subband audio coding with IIRfilter banks
811 Subband Algorithms
This section is concerned with early subband algorithms proposed by researchersfrom the Institut fur Rundfunktechnik (IRT) [Thei87] [Stoll88] Philips ResearchLaboratories [Veld89] and CCETT Much of this work was motivated by stan-dardization activities for the European Eureka-147 digital broadcast audio (DBA)system The ISOIEC eventually clustered the IRT Philips and CCETT propos-als into the MUSICAM algorithm [Wies90] [Dehe91] which was adopted as partof the ISOIEC MPEG-1 and MPEG-2 BC-LSF audio coding standards
8111 Masking Pattern Adapted Subband Coding (MASCAM) TheMUSICAM algorithm is derived from coders developed at IRT Philips andCNET At IRT Theile Stoll and Link developed Masking Pattern AdaptedSubband Coding (MASCAM) a subband audio coder [Thei87] based upon atree-structured quadrature mirror filter (QMF) filter bank that was designed tomimic the critical band structure of the auditory filter bank The coder has 24nonuniform subbands with bandwidths of 125 Hz below 1 kHz 250 Hz in therange 1ndash2 kHz 500 Hz in the range 2ndash4 kHz 1 kHz in the range 4ndash8 kHz and2 kHz from 8 kHz to 16 kHz The prototype QMF has 64 taps Subband outputsequences are processed in 2-ms blocks A normalization scale factor is quantized
INTRODUCTION 213
and transmitted for each block from each subband Subband bit allocations arederived from a simplified psychoacoustic analysis The original coder reportedin [Thei87] considered only in-band simultaneous masking Later as describedin [Stol88] interband simultaneous masking and temporal masking were addedto the bit rate calculation Temporal postmasking is exploited by updating scalefactors less frequently during periods of signal decay The MASCAM coder wasreported to achieve high-quality results for 15 kHz bandwidth input signals atbit rates between 80 and 100 kbs per channel A similar subband coder wasdeveloped at Philips during this same period As described by Velhuis et alin [Veld89] the Philips group investigated subband schemes based on 20- and26-band nonuniform filter banks Like the original MASCAM system the Philipscoder relies upon a highly simplified masking model that considers only theupward spread of simultaneous masking Thresholds are derived from a prototypebasilar excitation function under worst-case assumptions regarding the frequencyseparation of masker and maskee Within each subband signal energy levels aretreated as single maskers Given SNR targets due to the masking model uniformADPCM is applied to the normalized output of each subband The Philips coderwas claimed to deliver high-quality coding of CD-quality signals at 110 kbs forthe 26-band version and 180 kbs for the 20-band version
8112 Masking Pattern Adapted Universal Subband Integrated Cod-ing and Multiplexing (MUSICAM) Based primarily upon coders developedat IRT and Philips the MUSICAM algorithm [Wies90] [Dehe91] was successfulin the 1990 ISOIEC competition [SBC90] for a new audio coding standard Iteventually formed the basis for MPEG-1 and MPEG-2 audio layers I and II Rela-tive to its predecessors MUSICAM (Figure 81) makes several practical tradeoffsbetween complexity delay and quality By utilizing a uniform bandwidth 32-band pseudo-QMF bank (aka ldquopolyphaserdquo filter bank) instead of a tree-structuredQMF bank both complexity and delay are greatly reduced relative to the IRTand Phillips coders Delay and complexity are 1066 ms and 5 MFLOPS respec-tively These improvements are realized at the expense of using a sub-optimal
s(n)
PolyphaseAnalysisFilterbank 32 ch
(750 Hz 48 kHz)
Side Info
1024-ptFFT
PsychoacousticAnalysis
Quantization
Bit Allocation
Scl Fact
Samples
81624 ms
Figure 81 MUSICAM encoder (after [Wies90])
214 SUBBAND CODERS
filter bank however in the sense that filter bandwidths (constant 750 Hz for48 kHz sample rate) no longer correspond to the critical band rate Despite theseexcessive filter bandwidths at low frequencies high-quality coding is still possi-ble with MUSICAM due to its enhanced psychoacoustic analysis High-resolutionspectral estimates (46 Hzline at 48 kHz sample rate) are obtained through theuse of a 1024-point FFT in parallel with the PQMF bank This parallel structureallows for improved estimation of masking thresholds and hence determinationof more accurate minimum signal-to-mask ratios (SMRs) required within eachsubband
The MUSICAM psychoacoustic analysis procedure is essentially the same asthe MPEG-1 psychoacoustic model 1 The remainder of MUSICAM works asfollows Subband output sequences are processed in 8-ms blocks (12 samplesat 48 kHz) which is close to the temporal resolution of the auditory system(4ndash6 ms) Scale factors are extracted from each block and encoded using 6 bitsover a 120-dB dynamic range Occasionally temporal redundancy is exploitedby repetition over 2 or 3 blocks (16 or 24 ms) of slowly changing scale factorswithin a single subband Repetition is avoided during transient periods such assharp attacks Subband samples are quantized and coded in accordance with SMRrequirements for each subband as determined by the psychoacoustic analysis Bitallocations for each subband are transmitted as side information On the CCIRfive-grade impairment scale MUSICAM scored 46 (std dev 07) at 128 kbsand 43 (std dev 11) at 96 kbs per monaural channel compared to 47 (stddev 06) on the same scale for the uncoded original Quality was reported tosuffer somewhat at 96 kbs for critical signals which contained sharp attacks (egtriangle castanets) and this was reflected in a relatively high standard deviation of11 MUSICAM was selected by ISOIEC for MPEG-1 audio due to its desirablecombination of high quality reasonable complexity and manageable delay Alsobit error robustness was found to be very good (errors nearly imperceptible) upto a bit error rate of 10minus3
82 DWT AND DISCRETE WAVELET PACKET TRANSFORM (DWPT)
The previous section described subband coding algorithms that utilize banks offixed resolution bandpass QMF or pseudo-QMF finite impulse response (FIR)filters This section describes a different class of subband coders that rely insteadupon a filter-bank interpretation of the discrete wavelet transform (DWT) DWT-based subband coders offer increased flexibility over the subband coders describedpreviously since identical filter-bank magnitude frequency responses can be obtai-ned for many different choices of a wavelet basis or equivalently choices of filtercoefficients This flexibility presents an opportunity for basis optimization Theadvantage of this optimization in the audio coding application is illustrated bythe following example First a desired filter-bank magnitude response can beestablished This response might be matched to the auditory filter bank Thenfor each segment of audio one can adaptively choose a wavelet basis that mini-mizes the rate for some target distortion level Given a psychoacoustically deriveddistortion target the encoding remains perceptually transparent
DWT AND DISCRETE WAVELET PACKET TRANSFORM (DWPT) 215
darr 2
darr 2=y = Qx =
ylp
x
yhp
Q
Qylp
yhpx
Hlp(z)
Hhp(z)
Figure 82 Filter-bank interpretation of the DWT
A detailed discussion of specific technical conditions associated with thevarious wavelet families is beyond the scope of this book and this chaptertherefore concentrates upon high-level coder architectures In-depth treatmentof wavelets is available from many sources eg [Daub92] Before describingthe wavelet-based coders however it is useful to summarize some basic waveletcharacteristics Wavelets are a family of basis functions for the space of squareintegrable signals A finite energy signal can be represented as a weighted sumof the translates and dilates of a single wavelet Continuous-time wavelet sig-nal analysis can be extended to discrete-time and square summable sequencesUnder certain assumptions the DWT acts as an orthonormal linear transformT RN rarr RN For a compact (finite) support wavelet of length K the asso-ciated transformation matrix Q is fully determined by a set of coefficientsck for 0 k K minus 1 As shown in Figure 82 this transformation matrixhas an associated filter-bank interpretation One application of the transformmatrix Q to an N times 1 signal vector x generates an N times 1 vector of wavelet-domain transform coefficients y The N times 1 vector y can be separated into twoN
2times 1 vectors of approximation and detail coefficients ylp and yhp respec-
tively The spectral content of the signal x captured in ylp and yhp correspondsto the frequency subbands realized in the 21 decimated output sequences froma QMF bank (Section 64) which obeys the ldquopower complimentary conditionrdquoie
|Hlp()|2 + |Hlp( + π)|2 = 1 (81)
where Hlp() is the frequency response of the lowpass filter Therefore recursiveDWT applications effectively pass input data through a tree-structured cascadeof lowpass (LP) and highpass (HP) filters followed by 21 decimation at everynode The forwardinverse transform matrices of a particular wavelet are associ-ated with a corresponding QMF analysissynthesis filter bank The usual waveletdecomposition implements an octave-band filter bank structure as shown inFigure 83 In the figure frequency subbands associated with the coefficientsfrom each stage are schematically represented for an audio signal sampled at441 kHz
Wavelet packet (WP) or discrete wavelet packet transform (DWPT) representa-tions on the other hand decompose both the detail and approximation coefficientsat each stage of the tree as shown in Figure 84 In the figure frequency subbands
216 SUBBAND CODERS
22 kHz11552814
Q QQxy1 y2 y3
y4
y5
y1y2y3y4y5
Frequency (Hz)
Q
Figure 83 Octave-band subband decomposition associated with a discrete wavelet trans-form (ldquoDWTrdquo)
associated with the coefficients from each stage are schematically represented fora 441-kHz sample rate
A filter-bank interpretation of wavelet transforms is attractive in the context ofaudio coding algorithms Wavelet or wavelet packet decompositions can be treestructured as necessary (unbalanced trees are possible) to decompose input audiointo a set of frequency subbands tailored to some application It is possible forexample to approximate the critical band auditory filter bank utilizing a waveletpacket approach Moreover many K-coefficient finite support wavelets are asso-ciated with a single magnitude frequency response QMF pair and therefore aspecific subband decomposition can be realized while retaining the freedom tochoose a wavelet basis which is in some sense ldquooptimalrdquo These considerationshave motivated the development of several experimental wavelet-based subbandcoders in recent years The basic idea behind DWT and DWPT-based subbandcoders is to quantize and encode efficiently the coefficient sequences associatedwith each stage of the wavelet decomposition tree using the same noise shapingtechniques as the previously described perceptual subband coders
The next few sections of this chapter Sections 83 through 85 expose thereader to several WP-based subband coders developed in the early 1990s bySinha Tewfik et al [Sinh93a] [Sinh93b] [Tewf93] as well as more recentlyproposed hybrid sinusoidalWPT algorithms developed by Hamdy and Tewfik[Hamd96] Boland and Deriche [Bola97] and Pena et al [Pena96] [Prel96a][Prel96b] [Pena97a] The core of least one experimental WP audio coder [Sinh96]has been embedded in a commercial standard namely the ATampT PerceptualAudio Coder (PAC) [Sinh98] Although not addressed in this chapter we notethat other studies of DWT and DWPT-based audio coding schemes have appearedFor example experimental coder architectures for low-complexity low-delaycombined waveletmultipulse LPC coding and combined scalarvector quantiza-tion of transform coefficients were reported respectively by Black and Zeytinoglu[Blac95] Kudumakis and Sandler [Kudu95a] [Kudu95b] [Kudu96] and Bolandand Deriche [Bola95][Bola96] Several bit rate scalable DWPT-based schemeshave also been investigated recently For example a fixed-tree DWPT codingscheme capable of nearly transparent quality with scalable bitrates below 100 kbswas proposed by Dobson et al and implemented in real-time on a 75 MHzPentium-class platform [Dobs97] Additionally Lu and Pearlman investigateda rate-scalable DWPT-based coder that applies set partitioning in hierarchicaltrees (SPIHT) to generate an embedded bitstream Nearly transparent qualitywas reported at bit rates between 55 and 66 kbs [Lu98]
22kH
z11
55
28
83
138
165
193
Q
Q
QQQ
y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8
y 1y 2
y 3y 4
y 5y 6
y 7y 8
x
Fre
quen
cy (
Hz)
Fig
ure
84
Subb
and
deco
mpo
sitio
nas
soci
ated
with
apa
rtic
ular
wav
elet
pack
ettr
ansf
orm
(ldquoW
PTrdquo
orldquoW
Prdquo)
Alth
ough
the
pict
ure
illus
trat
esa
bala
nced
bina
rytr
eean
dth
eas
soci
ated
unif
orm
band
wid
thsu
bban
ds
node
sco
uld
bepr
uned
inor
der
toac
hiev
eno
nuni
form
freq
uenc
ysu
bdiv
isio
n
217
218 SUBBAND CODERS
83 ADAPTED WP ALGORITHMS
The ldquobest basisrdquo methodologies [Coif92] [Wick94] for adapting the WP treestructure to signal properties are typically formulated in terms of Shannonentropy [Shan48] and other perceptually blind statistical measures For a givenWP tree related research directed towards optimal filter selection [Hedg97][Hedg98a] [Hedg98b] has also emphasized optimization of statistical rather thanperceptual properties The questions of perceptually motivated filter selection andtree construction are central to successful application of WP analysis in audiocoding algorithms The WP tree structure determines the time and frequencyresolution of the transform and therefore also creates a particular tiling of thetime-frequency plane Several WP audio algorithms [Sinh93b] [Dobs97] havesuccessfully employed time-invariant WP tree structures that mimic the earrsquoscritical band frequency resolution properties In some cases however a moreefficient perceptual bit allocation is possible with a signal-specific time-frequencytiling that tracks the shape of the time-varying masking threshold Some examplesare described next
831 DWPT Coder with Globally Adapted Daubechies AnalysisWavelet
Sinha and Tewfik developed a variable-rate wavelet-based coding schemefor which they reported nearly transparent coding of CD-quality audio at48ndash64 kbs [Sinh93a] [Sinh93b] The encoder (Figure 85) exploits redundancyusing a VQ scheme and irrelevancy using a wavelet packet (WP) signaldecomposition combined with perceptual masking thresholds The algorithmworks as follows Input audio is segmented into N times 1 vectors which are then
DynamicDictionarySearch
PsychoacousticAnalysis
T
sd
s
r
sminus
+
s
TransmitIndex of sd
T
Y
NWaveletPacketSearchAnalysis
Transmitr or s
d(s sd) le T
sum
Figure 85 Dynamic dictionaryoptimal wavelet packet encoder (after [Sinh93a])
ADAPTED WP ALGORITHMS 219
windowed using a 116-th overlap square root Hann window The dynamicdictionary (DD) which is essentially an adaptive VQ subsystem then eliminatessignal redundancy A dictionary of N times 1 codewords is searched for the vectorperceptually closest to the input vector The effective size of the dictionary ismade larger than its actual size by a novel correlation lag searchtime-warpingprocedure that identifies two N 2-sample codewords for each N -sample inputvector At both the transmitter and receiver the dictionary is systematicallyupdated with N -sample reconstructed output audio vectors according to aperceptual distance criterion and last-used-first-out rule After the DD procedurehas been completed an optimized WP decomposition is applied to the originalsignal as well as the DD residual The decomposition tree is structured such thatits 29 frequency subbands roughly correspond to the critical bands of the auditoryfilter bank A masking threshold obtained as in [Veld89] is assumed constantwithin each subband and then used to compute a perceptual bit allocation
The encoder transmits the particular combination of DD and WP informationthat minimizes the bit rate while maintaining perceptual quality Three combi-nations are possible In one scenario the DD index and time-warping factor aretransmitted alone if the DD residual energy is below the masking threshold atall frequencies Alternatively if the DD residual has audible noise energy thenWP coefficients of the DD residual are also quantized encoded and transmittedIn some cases however WP coefficients corresponding to the original signal aremore compactly represented than the combination of the DD plus WP residualinformation In this case the DD information is discarded and only quantizedand encoded WP coefficients are transmitted In the latter two cases the encoderalso transmits subband scale factors bit allocations and energy normalizationside information
This algorithm is unique in that it contains the first reported application ofadapted WP analysis to perceptual subband coding of high-fidelity CD-qualityaudio During each frame the WP basis selection procedure applies an optimalitycriterion of minimum bit rate for a given distortion level The adaptation isldquoglobalrdquo in the sense that the same analysis wavelet is applied to the entiredecomposition The authors reached several conclusions regarding the optimalcompact support (K-coefficient) wavelet basis when selecting from among theDaubechies orthogonal wavelet bases ([Daub88])
First optimization produced average bit rate savings dependent on filter lengthof up to 15 Average bit rate savings were 3 65 875 and 15 for waveletsselected from the sets associated with coefficient sequences of lengths 10 2040 and 60 respectively In an extreme case a savings of 17 bitssample is real-ized for transparent coding of a difficult castanets sequence when using best-caserather than worst-case wavelets (08 vs 25 bitssample for K = 40) The sec-ond conclusion reached by the researchers was that it is not necessary to searchexhaustively the space of all wavelets for a particular value of K The search canbe constrained to wavelets with K 2 vanishing moments (the maximum possiblenumber) with minimal impact on bit rate The frequency responses of the filtersassociated with a p-th-order vanishing moment wavelet have p-th-order zeros at
220 SUBBAND CODERS
the foldover frequency ie = π Only a 31 bitrate reduction was realizedfor an exhaustive search versus a maximal vanishing moment constrained searchThird the authors found that larger K ie more taps and deeper decomposi-tion trees tended to yield better results Given identical distortion criteria for acastanets sequence bit rates of 21 bitssample for K = 4 wavelets were realizedversus 08 bitssample for K = 40 wavelets
As far as quality is concerned subjective tests showed that the algorithmproduced transparent quality for certain test material including drums pop vio-lin with orchestra and clarinet Subjects detected differences however for thecastanets and piano sequences These difficulties arise respectively because ofinadequate pre-echo control and inefficient modeling of steady sinusoids Thecoder utilizes only an adaptive window scheme which switches between 1024and 2048-sample windows Shorter windows (N = 1024 or 23 ms) are used forsignals that are likely to produce pre-echoes The piano sequence contained longsegments of nearly steady or slowly decaying sinusoids The wavelet coder doesnot handle steady sinusoids as well as other signals With the exception of thesetroublesome signals in a comparative test one additional expert listener alsofound that the WP coder outperformed MPEG-1 layer II at 64 kbs
Tewfik and Ali later enhanced the WP coder to improve pre-echo control andincrease coding efficiency After elimination of the dynamic dictionary theyreported improved quality in the range of 55 to 63 kbs as well as a real-time implementation of a simplified 64 to 78 kbs coder on two TMS320C31devices [Tewf93] Other improvements included exploitation of auditory tem-poral masking for pre-echo control more efficient quantization and encodingof scale-factors and run-length coding of long zero sequences The improvedWP coder also upgraded its psychoacoustic analysis section with a more sophis-ticated model similar to Johnstonrsquos PXFM coder [John88a] The most notableimprovement occurred in the area of pre-echo control This was accomplishedin the following manner First input frames likely to produce pre-echoes areidentified using a normalized energy measure criterion These frames are parsedinto 5-ms time slots (256 samples) Then WP coefficients from all scales withineach time slot are combined to estimate subframe energies Masking thresholdscomputed over the global 1024-sample frame are assumed only to apply dur-ing high-energy time slots Masking thresholds are reduced across all subbandsfor low-energy time slots utilizing weighting factors proportional to the energyratio between high- and low-energy time-slots The remaining enhancements ofimproved scale factor coding efficiency and run-length coding of zero sequencesmore than compensated for removal of the dynamic dictionary
832 Scalable DWPT Coder with Adaptive Tree Structure
Srinivasan and Jamieson proposed a WP-based audio coding scheme [Srin97][Srin98] in which a signal-specific perceptual best basis is constructed by adaptingthe WP tree structure on each frame such that perceptual entropy and ultimatelythe bit rate are minimized While the tree structure is signal-adaptive the analysis
ADAPTED WP ALGORITHMS 221
AdaptiveWPT
ZerotreeQuantizer
Perceptual Model
LosslessCoding
l
s(n)
Figure 86 Masking-threshold adapted WP audio coder [Srin98] On each frame the WPtree structure is adapted in order to minimize a perceptually motivated rate constraint
filters are time-invariant and obtained from the family of spline-based biorthog-onal wavelets [Daub92] The algorithm (Figure 86) is also unique in the sensethat it incorporates mechanisms for both bit rate and complexity scaling Beforethe tree adaptation process can commence for a given frame a set of 63 maskingthresholds corresponding to a set of threshold frequency partitions roughly 13Bark wide is obtained from the ISOIEC MPEG-1 psychoacoustic model rec-ommendation 2 [ISOI92] Of course depending upon the WP tree the subbandsmay or may not align with the threshold partitions For any particular WP treethe associated bit rate (cost) is computed by extracting the minimum maskingthresholds from each subband and then allocating sufficient bits to guarantee thatthe quantization noise in each band does not exceed the minimum threshold
The objective of the tree adaptation process therefore is to construct a mini-mum cost subband decomposition by maximizing the minimum masking thresh-old in every subband Figure 87a shows a possible subband structure in whichsubband 0 contains five threshold partitions This choice of bandsplitting is clearlyundesirable since the minimum masking threshold for partition 1 is far below par-tition 4 Bit allocation for subband 0 will be forced to satisfy partition 1 with aresulting overallocation for partitions 2 through 5
It can be seen that subdividing the band (Figure 87b) relaxes the minimummasking threshold in band 1 to the level of partition 5 Naturally the idealbandsplitting would in this case ultimately match the subband boundaries tothe threshold partition boundaries On each frame therefore the tree adaptationprocess performs the following top-down iterative ldquogrowingrdquo procedure Dur-ing any iteration the existing subbands are assigned individual costs based onthe bit allocation required for transparent coding Then a decision on whetheror not to subdivide further at a node is made on the basis of cost reductionSubbands are examined for potential splitting in order of decreasing cost andthe search is ldquobreadth-firstrdquo meaning that each level is completely decomposedbefore proceeding to the next level Subdivision occurs only if the associated bitrate improvement exceeds a threshold The tree adaptation is also constrained bya complexity scaling mechanism Top-down tree growth is halted by the com-plexity scaling constraint λ when the estimated total cost of computing the
222 SUBBAND CODERS
Subband 0
t4
t5
t1
t3
t2Min = t1
Frequency
Mas
king
thre
shol
d
(a)
Subband 0
Frequency
Min = t5
Mas
king
thre
shol
d
(b)
Subband 1
t4
t5
t1
t3
t2
Figure 87 Example masking-threshold adapted WP filter bank (a) initial condition (b)after one iteration Threshold partitions are denoted by dashed lines and labeled by tk Idealized subband boundaries are denoted by heavy black lines Under the initial conditionwith only one subband the minimum masking threshold is given by t1 and therefore thebit allocation will be relatively large in order to satisfy a small threshold After oneband splitting however the minimum threshold in subband 1 increases from t1 to t5thereby reducing the perceptual bit allocation Hence the cost function is reduced in part(b) relative to part (a)
DWPT reaches a predetermined limit With this feature it is envisioned that ina real-time environment the WP adaptation process could respond to changingCPU resources by controlling the cost of the analysis and synthesis filter banks
In [Srin98] a complexity-constrained tree adaptation procedure is shown toyield a basis requiring the fewest bits for perceptually transparent coding for agiven complexity and temporal resolution After the WP tree adaptation procedurehas been completed Shapirorsquos zerotree algorithm [Shap93] is applied iterativelyto quantize the coefficients and exploit remaining temporal correlation until theperceptual rate-distortion criteria are satisfied ie until sufficient bits have beenallocated to satisfy the perceptually transparent bit rate associated with the given
ADAPTED WP ALGORITHMS 223
WP tree The zerotree technique has the added benefit of generating an embeddedbitstream making this coder amenable to progressive transmission In scalableapplications the embedded bitstream has the property that it can be partiallydecoded and still guarantee the best possible quality of reconstruction giventhe number of bits decoded The complete bitstream consists of the encoded treestructure the number of zerotree iterations and a block of zerotree encoded dataThese elements are coded in a lossless fashion (eg Huffman arithmetic etc) toremove any remaining redundancies and transmitted to the decoder For informallistening tests over coded program material that included violin violinviola flutesitar vocalsorchestra and sax the coded outputs at rates in the vicinity of 45 kbswere reported to be indistinguishable from the originals with the exceptions ofthe flute and sax
833 DWPT Coder with Globally Adapted General Analysis Wavelet
Srinivasan and Jamieson [Srin98] demonstrated the advantages of a maskingthreshold adapted WP tree with a time-invariant analysis wavelet On the otherhand Sinha and Tewfik [Sinh93b] used a time-invariant WP tree but a glob-ally adapted analysis wavelet to demonstrate that there exists a signal-specificldquobestrdquo wavelet basis in terms of perceptual coding gain for a particular numberof filter taps The basis optimization in [Sinh93b] however was restricted toDaubechiesrsquo wavelets Recent research has attempted to identify which waveletproperties portend an optimal basis as well as to consider basis optimizationover a broader class of wavelets In an effort to identify those wavelet proper-ties that could be associated with the ldquobestrdquo filter Philippe et al measured theimpact on perceptual coding gain of wavelet regularity AR(1) coding gain andfilter bank frequency selectivity [Phil95a] [Phil95b] The study compared perfor-mance between orthogonal Rioul [Riou94] orthogonal Onno [Onno93] and thebiorthogonal wavelets of [More95] in a WP coding scheme that had essentiallythe same time-invariant critical band WP decomposition tree as [Sinh93b] Usingfilters of lengths varying between 4 and 120 taps minimum bit rates requiredfor transparent coding in accordance with the usual perceptual subband bit allo-cations were measured for each wavelet For a given filter length the resultssuggested that neither regularity nor frequency selectivity mattered significantlyOn the other hand the minimum bit rate required for transparent coding wasshown to decrease with increasing analysis filter AR(1) coding gain leading theauthors to conclude that AR(1) coding gain is a legitimate criterion for WP filterselection in perceptual coding schemes
834 DWPT Coder with Adaptive Tree Structure and Locally AdaptedAnalysis Wavelet
Phillipe et al [Phil96] measured the perceptual coding gain associated withoptimization of the WP analysis filters at every node in the tree aswell as optimization of the tree structure In the first experiment the WPtree structure was fixed and then optimal filters were selected for each
224 SUBBAND CODERS
tree node (local adaptation) such that the bit rate required for transparentcoding was minimized Simulated annealing [Kirk83] was used to solve thediscrete optimization problem posed by a search space containing 300 filtersof varying lengths from the Daubechies [Daub92] Onno [Onno93] Smith-Barnwell [Smit86] Rioul [Riou94] and Akansu-Caglar [Cagl91] families Thenthe filters selected by simulated annealing were used in a second set ofexperiments on tree structure optimization The best WP decomposition treewas constructed by means of a growing procedure starting from a single celland progressively subdividing Further splitting at each node occurred only if itsignificantly reduced the perceptually transparent bit rate As in [Phil95b] thesefilter and tree adaptation experiments estimated bit rates required for perceptuallytransparent coding of 48-kHz sampled source material using statistical signalproperties For a fixed tree the filter adaptation experiments yielded severalnoteworthy results First a nominal bit rate reduction of 3 was realized forOnnorsquos filters (665 kbs) relative to Daubechiesrsquo filters (68 kbs) when the samefilter family was applied in all tree nodes and filter length was the only freeparameter Secondly simulated annealing over the search space of 300 filtersyielded a nominal 1 bit rate reduction (66 kbs) relative to the Onno-onlycase Finally longer filter bank delay ie longer analysis filters and hencebetter frequency selectivity yielded lower bitrates For low-delay applicationshowever a sevenfold delay reduction from 700 down to only 100 samples isrealized at the cost of only a 10 increase in bit rate The tree adaptationexperiments showed that a 16-band decomposition yielded the best bit rate whentree description overhead was accounted for In light of these results and thewavelet adaptation results of [Sinh93b] one might conclude that WP filter andWP tree optimization are warranted if less than a 10 bit rate improvementjustifies the added complexity
835 DWPT Coder with Perceptually Optimized Synthesis Wavelets
The wavelet-based audio coding schemes as well as WP tree and filter adapta-tion experiments described in the foregoing sections (eg [Sinh93b] [Phil95a][Phil95b] [Phil96]) seek to maximize perceptual coding efficiency by match-ing subband bandwidths (ie the time-frequency tiling) andor individual filtermagnitude and phase characteristics to incoming signal properties All of thesetechniques make use of perfect reconstruction (ldquoPRrdquo) DWT or WP filter banksthat are designed to split a signal into frequency subbands in the analysis filterbank and then later recombine the subband signals in the synthesis filter bank toreproduce exactly the original input signal The PR property only holds howeverso long as distortion is not injected into the subband sequences ie in the absenceof quantization This is an important point to consider in the context of coding Thequantization noise introduced into the subbands during bit allocation leads to filterbank-induced reconstruction artifacts because the synthesis filter bank has carefullycontrolled spectral leakage properties specifically designed to cancel the aliasingand imaging distortions introduced by the critically sampled analysis-synthesisprocess Whether using classical or perceptual bit allocation rules most subband
ADAPTED WP ALGORITHMS 225
coders do not account explicitly for the filter bank distortion artifacts introducedby quantization Using explicit knowledge of the analysis filters and the quanti-zation noise however recent research has shown that reconstruction distortioncan be minimized in the mean square sense (MMSE) by relaxing PR constraintsand tuning the synthesis filters [Chen95] [Hadd95] [Kova95] [Delo96] [Goss97b]Naturally mean square error minimization is of limited value for subband audiocoders As a result Gosse et al [Goss95] [Goss97] extended the MMSE synthe-sis filter tuning procedure [Goss96] to minimize a mean perceptual error (MMPE)rather than MMSE Experiments were conducted to determine whether or not tunedsynthesis filters outperform the unmodified PR synthesis filters and if so whetheror not MMPE filters outperform MMSE filters in subjective listening tests A WPaudio coding scheme configured for 128 kbs operation and having a time-invariantfilter-bank structure (Figure 88) formed the basis for the experiments
The tree and filter selections were derived from the minimum-rate filter andtree adaptation investigation reported in [Phil96] In the figure each of the 16 sub-bands is labeled with its upper cutoff frequency (kHz) The experiments involvedfirst a design phase and then an evaluation phase During the design phase opti-mized synthesis filter coefficients were obtained as follows For the MMPE filterscoding simulations were run using the unmodified PR synthesis filter bank withpsychoacoustically derived bit allocations for each subband on each frame Amean perceptual error (MPE) was evaluated at the PR filter bank output in termsof a unique JND measure [Duro96] Then the filter tuning algorithm [Goss96]was applied to minimize the reconstruction error Since the bit allocation was per-ceptually motivated the tuning and reconstruction error minimization procedureyielded MMPE filter coefficients For the MMSE filters coefficients were alsoobtained using [Goss96] without the benefit of a perceptual bit allocation step
Daub-14
Daub-24 Daub-24
Onno-32 Onno-32
Daub-18 Daub-18
Onno-4
Onno-6 Onno-6
Haar
Haar
01 02
04 06 08
11 15
3 45 756 9 105 12
18 24 (kHz)
Haar
Daub-18 Daub-18
Figure 88 Wavelet packet analysis filter-bank optimized for minimum bitrate used inMMPE experiments
226 SUBBAND CODERS
During the evaluation phase of the experiments three 128 kbs coding simulationswith psychoacoustic bit allocations were run with
ž PR synthesis filtersž MMSE-tuned synthesis filters andž MMPE-tuned synthesis filters
Performance was evaluated in terms of a perceptual objective measure(POM) [Colo95] an estimate of the probability that an expert listener candistinguish between the original and coded signal The POM results were 44distinguishability for the PR case versus only 16 for both the MMSE andMMPE cases The authors concluded that synthesis filter tuning is worthwhilesince some performance enhancement exists over the PR case They alsoconcluded that MMPE filters failed to outperform MMSE filters because theywere designed to minimize the perceptual error over a long period rather than atime-localized basis Since perceptual signal properties are strongly time-variantit is possible that time-variant MMPE tuning will realize some performance gainrelative to MMSE tuning The perceptual synthesis filter tuning ideas exploredin this work have shown promise but further investigation is required to bettercharacterize its costs and benefits
84 ADAPTED NONUNIFORM FILTER BANKS
The most popular method for realizing nonuniform frequency subbands is tocascade uniform filters in an unbalanced tree structure as with for examplethe DWPT For a given impulse response length however cascade structures ingeneral produce poor channel isolation Recent advances in modulated filter bankdesign methodologies (eg [Prin94]) have made tractable direct form near perfectreconstruction nonuniform designs that are critically sampled This section is con-cerned with subband coders that employ signal-adaptive nonuniform modulatedfilter banks to approximate the time-frequency analysis properties of the auditorysystem more effectively than the other subband coders Two examples are givenBeyond the pair of algorithms addressed below we note that other investigatorshave proposed nonuniform filter bank coding techniques that address redundancyreduction utilizing lattice [Mont94] and bidimensional VQ schemes [Main96]
841 Switched Nonuniform Filter Bank Cascade
Princen and Johnston developed a CD-quality coder based upon a signal-adaptivefilter bank [Prin95] for which they reported quality better than the sophisticatedMPEG-1 layer III algorithm at both 48 and 64 kbs The analysis filter bank forthis coder consists of a two-stage cascade The first stage is a 48-band nonuniformmodulated filter bank split into four uniform-bandwidth sections There are 8 uni-form subbands from 0 to 750 Hz 4 uniform subbands from 750 to 1500 Hz 12uniform subbands from 15 to 6 kHz and 24 uniform subbands from 6 to 24 kHz
HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS 227
The second stage in the cascade optionally decomposes nonuniform bank outputswith onoff switchable banks of finer resolution uniform subbands During filterbank adaptation a suitable overall time-frequency resolution is attained by selec-tively enabling or disabling the second stage filters for each of the four uniformbandwidth sections The low-resolution mode for this architecture corresponds toslightly better than auditory filter-bank frequency resolution On the other handthe high-resolution mode corresponds roughly to 512 uniform subband decompo-sition Adaptation decisions are made independently for each of the four cascadedsections based on a criterion of minimum perceptual entropy (PE) The secondstage filters in each section are enabled only if a reduction in PE (hence bit rate)is realized Uniform PCM is applied to subband samples under the constraint ofperceptually masked quantization noise Masking thresholds are transmitted asside information Further redundancy reduction is achieved by Huffman codingof both quantized subband sequences and masking thresholds
842 Frequency-Varying Modulated Lapped Transforms
Purat and Noll [Pura96] also developed a CD-quality audio coding scheme basedon a signal-adaptive nonuniform tree-structured wavelet packet decompositionThis coder is unique in two ways First of all it makes use of a novel waveletpacket decomposition [Pura95] Secondly the algorithm adapts to the signal thewavelet packet tree decomposition depth and breadth (branching structure) basedon a minimum bit rate criterion subject to the constraint of inaudible distortionsIn informal subjective tests the algorithm achieved excellent quality at a bit rate of55 kbs
85 HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS
This section examines audio coding algorithms that make use of a hybrid waveletpacketsinusoidal signal analysis Hybrid coder architectures often improve coderrobustness to diverse program material In this case the wavelet portion of acoder might be better suited to certain signal classes (eg transient) while theharmonic portion might be better suited to other classes of input signal (egtonal or steady-state) In an effort to improve coder overall performance (egbetter output quality for a given bit rate) several of the signal-adaptive waveletand wavelet packet subband coding schemes presented in the previous sectionhave been embedded in experimental hybrid coding schemes that seek to adaptthe analysis properties of the coding algorithm to the signal content Severalexamples are considered in this section
Although the WP coder improvements reported in [Tewf93] addressed pre-echo control problems evident in [Sinh93b] they did not rectify the coderrsquos inade-quate performance for harmonic signals such as the piano test sequence This is inpart because the low-order FIR analysis filters typically employed in a WP decom-position are characterized by poor frequency selectivity and therefore waveletbases tend not to provide compact representations for strongly sinusoidal signals
228 SUBBAND CODERS
MaskingModel
s(n)Σ
SinusoidalAnalysis
SinusoidalSynthesis
WPAnalysis Transient
Removal
NoiseEncoder
TransientTracker
TransientEncoder
Bit P
acking
EncodeSide Info
QuantizerEncoder
minus
Figure 89 Hybrid sinusoidalwavelet encoder (after [Hamd96])
On the other hand wavelet decompositions provide some control over time res-olution properties leading to efficient representations of transient signals Theseconsiderations have inspired several researchers to investigate hybrid coders
851 Hybrid SinusoidalClassical DWPT Coder
Hamdy et al developed a hybrid coder [Hamd96] designed to exploit the effi-ciencies of both harmonic and wavelet signal representations For each framethe encoder (Figure 89) chooses a compact signal representation from combinedsinusoidal and wavelet bases This algorithm is based on the notion that short-timeaudio signals can be decomposed into tonal transient and noise components Itassumes that tonal components are most compactly represented in terms of sinu-soidal basis functions while transient and noise components are most efficientlyrepresented in terms of wavelet bases The encoder works as follows FirstThomsonrsquos analysis model [Thom82] is applied to extract sinusoidal parameters(frequencies amplitudes and phases) for each input frame Harmonic synthesisusing the McAulay and Quatieri reconstruction algorithm [McAu86] for phaseand amplitude interpolation is next applied to obtain a residual sequence Thenthe residual is decomposed into WP subbands
The overall WP analysis tree approximates an auditory filter bank Edge-detection processing identifies and removes transients in low-frequency sub-bands Without transients the residual WP coefficients at each scale becomelargely decorrelated In fact the authors determined that the sequences are wellapproximated by white Gaussian noise (WGN) sources having exponential decayenvelopes As far as quantization and encoding are concerned sinusoidal frequen-cies are quantized with sufficient precision to satisfy just-noticeable-differencesin frequency (JNDF) which requires 8-bit absolute coding for a new frequencytrack and then 5-bit differential coding for the duration of the lifetime of thetrack The sinusoidal amplitudes are quantized and encoded in a similar abso-lutedifferential manner using simultaneous masking thresholds for shaping of
HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS 229
quantization noise This may require up to 8 bits per component Sinusoidalphases are uniformly quantized on the interval [minusπ π] and encoded using 6 bitsAs for quantization and encoding of WP parameters all coefficients below 11 kHzare encoded as in [Sinh93b] Above 11 kHz however parametric representationsare utilized Transients are represented in terms of a binary edge mask that can berun length encoded while the Gaussian noise components are represented in termsof means variances and exponential decay constants The hybrid harmonic-wavelet coder was reported to achieve nearly transparent coding over a widerange of CD-quality source material at bit rates in the vicinity of 44 kbs [Ali96]
852 Hybrid SinusoidalM-band DWPT Coder
During the late 1990s other researchers continued to explore the potential ofhybrid sinusoidal-wavelet signal analysis schemes for audio coding Bolandand Deriche [Bola97] reported on an experimental sinusoidal-wavelet hybridaudio codec with high-level architecture very similar to [Hamd96] but with low-level differences in the sinusoidal and wavelet analysis blocks In particularfor harmonic analysis the proposed algorithm replaces Thomsonrsquos method usedin [Hamd96] with a combination of total least squares linear prediction (TLS-LP)and Pronyrsquos method Then in the harmonic residual wavelet decomposition blockthe proposed method replaces the usual DWT cascade of two-band QMF sectionswith a cascade of four-band QMF sections The algorithm works as follows Firstharmonic analysis operates on nonoverlapping 12-ms blocks of rectangularlywindowed input audio (512 samples 441 kHz) For each block sinusoidalfrequencies fk are extracted using TLS-LP spectral estimation [Rahm87] aprocedure that is formulated to deal with closely spaced sinusoids in lowSNR environments Given the set of TLS-LP frequencies a classical Pronyalgorithm [Marp87] next determines the corresponding amplitudes Ak andphases φk Masking thresholds for the tonal sequence are calculated in a mannersimilar to the ISOIEC MPEG-1 psychoacoustic recommendation 2 [ISOI92]After masked tones are discarded the parameters of the remaining sinusoidsare uniformly quantized and encoded in a procedure similar to [Hamd96]Frequencies are encoded according to JNDFs (3 nonuniform bands 8 bits percomponent in each band) phases are allocated 6 bits across all frequencies andamplitudes are block companded with 5 bits for the gain and 6 bits per normalizedamplitude Unlike [Hamd96] however amplitude bit allocations are fixed ratherthan signal adaptive Quantized sinusoidal components are used to synthesize atonal sequence stonal(n) as follows
stonal(n) =psum
k=1
Akej(k+φk) (82)
where the parameters k = 2πfkfs are the normalized radian frequencies andonly p2 frequency components are independent since the complex exponentialsare organized into conjugate symmetric pairs As in [Hamd96] the synthetic
230 SUBBAND CODERS
y1
y2
y3
y4
y5
y6
y2y3y4y5y6 y1
y7
y8
y9y10
x
Frequency (kHz)
22 kHz1651155412713
y10
03
Q4 Q4 Q4
Figure 810 Subband decomposition associated with cascaded M-band DWT in [Bola97]
tonal sequence stonal(n) is subtracted from the input sequence s(n) to form aspectrally flattened residual r(n)
In the wavelet analysis section the harmonic residual r(n) is decomposedsuch that critical bandwidths are roughly approximated using a three-level cascade(Figure 810) of 4-band analysis filters (ie 10 subbands) designed accordingto the M-band technique in [Alki95] Compared to the usual DWT cascade of2-band QMF sections the M-band cascade offers the advantages of reducedcomplexity reduced delay and linear phase The DWT coefficients are uni-formly quantized and encoded in a block companding scheme with 5 bits persubband gain and a dynamic bit allocation according to a perceptual noise modelfor the normalized coefficients A Huffman coding section removes remainingstatistical redundancies from the quantized harmonic and DWT coefficient setsIn subjective listening comparisons between the proposed scheme at 60ndash70 kbsand MPEG-1 layer III at 64 kbs on 12 SQAM CD [SQAM88] source items theauthors reported indistinguishable quality for ldquoacoustic guitarrdquo ldquoEddie Rabbitrdquoldquocastanetsrdquo and ldquofemale speechrdquo Slight impairments relative to MPEG-1 layerIII were reported for the remaining eight items No comparisons were reportedin terms of delay or complexity
853 Hybrid SinusoidalDWPT Coder with WP Tree StructureAdaptation (ARCO)
Other researchers have also developed hybrid algorithms that represent audiousing a combination of sinusoidal and wavelet packet bases Pena et al [Pena96]have reported on the Adaptive Resolution COdec (ARCO) This algorithmemploys a two-stage hybrid tonal-WP analysis section architecturally similar toboth [Hamd96] and [Bola97] The experimental ARCO algorithm has introducedseveral novelties in the segmentation psychoacoustic analysis tonal analysis bitallocation and WP analysis blocks In addition recent work on this project hasproduced a unique MDCT-based filter bank The remainder of this subsectiongives some details on these developments
8531 ARCO Segmentation Perceptual Model and Sinusoidal An-alysis-by-Synthesis In an effort to match the time-frequency analysis res-olution to the signal properties ARCO includes a segmentation scheme that
HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS 231
makes use of both time and frequency block clustering to determine optimalanalysis frame lengths [Pena97b] Similar blocks are assumed to contain station-ary signals and are therefore combined into larger frames Dissimilar blocks onthe other hand are assumed to contain nonstationarities that are best analyzedusing individual short segments The ARCO psychoacoustic model resemblesISOIEC MPEG-1 model recommendation 1 [ISOI92] with some enhancementsUnlike [ISOI92] tonality labeling is based on [Terh82] and noise maskers aresegregated into narrowband and wideband subclasses Then frequency-dependentexcitation patterns are associated with the wideband noise maskers ARCO quan-tizes tonal signal components in a perceptually motivated analysis-by-synthesisUsing an iterative procedure bits are allocated on each analysis frame until thesynthetic tonal signalrsquos excitation pattern matches the original signalrsquos excitationpattern to within some tolerance
8532 ARCO WP Decomposition The ARCO WP decomposition proce-dure optimizes both the tree structure as in [Srin98] and filter selections asin [Sinh93b] and [Phil96] For the purposes of WP tree adaptation [Prel96a]ARCO defines for the k-th band a cost εk as
εk =int fk+Bk2fkminusBk2 (U(f ) minus Ak)df
int fk+Bk2fkminusBk2 U(f )df
(83)
where U(f ) is the masking threshold expressed as a continuous function theparameter f represents frequency fk is the center frequency for the k-th subbandBk is the k-th subband bandwidth and Ak is the minimum masking threshold inthe k-th band Then the total cost C to be minimized over all M subbands isgiven by
C =Msum
k=1
εk (84)
By minimizing Eq (84) on each frame ARCO essentially arranges the subbandssuch that the corresponding set of idealized brickwall rectangular filters havingamplitude equal to the height of the minimum masking threshold in the eachband matches as closely as possible the shape of the masking threshold Thenbits are allocated in each subband to satisfy the minimum masking thresholdAk Therefore uniform quantization in each subband with sufficient bits affectsa noise shaping that satisfies perceptual requirements without wasting bits Themethod was found to be effective without accounting explicitly for the spec-tral leakage associated with the filter bank sidelobes [Prel96b] As far as filterselection is concerned ARCO employs signal-adaptive filters during steady-statesegments and time-invariant filters during transients Some of the filter selectionstrategies were reported to have been inspired by Agerkvistrsquos auditory modelingwork [Ager94] [Ager96] In [Pena97a] it was found that the ldquosymmetrizationrdquotechnique [Bamb94] [Kiya94] was effective for minimizing the boundary distor-tions associated with the time-varying WP analysis