downloads.hindawi.comdownloads.hindawi.com/journals/specialissues/196487.pdf · 2012-02-02 · Editor-in-Chief Marc Moonen, Belgium Senior Advisory Editor K. J. Ray Liu, College Park,

EURASIP Journal on Applied Signal Processing

Digital Audio for MultimediaCommunications

Guest Editors: Gianpaolo Evangelista, Mark Kahrs,and Emmanuel Bacry




Guest Editors: Gianpaolo Evangelista, Mark Kahrs,and Emmanuel Bacry


Copyright © 2003 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2003 of “EURASIP Journal on Applied Signal Processing.” All articles are open accessarticles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproductionin any medium, provided the original work is properly cited.

Editor-in-ChiefMarc Moonen, Belgium

Senior Advisory EditorK. J. Ray Liu, College Park, USA

Associate EditorsKiyoharu Aizawa, Japan A. Gorokhov, The Netherlands Naohisa Ohta, JapanGonzalo Arce, USA Peter Handel, Sweden Antonio Ortega, USAJaakko Astola, Finland Ulrich Heute, Germany Mukund Padmanabhan, USAKenneth Barner, USA John Homer, Australia Ioannis Pitas, GreeceMauro Barni, Italy Jiri Jan, Czech Phillip Regalia, FranceSankar Basu, USA Søren Holdt Jensen, Denmark Hideaki Sakai, JapanJacob Benesty, Canada Mark Kahrs, USA Wan-Chi Siu, Hong KongHelmut Bölcskei, Switzerland Ton Kalker, The Netherlands Dirk Slock, FranceChong-Yung Chi, Taiwan Mos Kaveh, USA Piet Sommen, The NetherlandsM. Reha Civanlar, Turkey Bastiaan Kleijn, Sweden John Sorensen, DenmarkTony Constantinides, UK Ut-Va Koc, USA Michael G. Strintzis, GreeceLuciano Costa, Brazil Aggelos Katsaggelos, USA Tomohiko Taniguchi, JapanZhi Ding, USA C.-C. Jay Kuo, USA Sergios Theodoridis, GreecePetar M. Djurić, USA Chin-Hui Lee, USA Xiaodong Wang, USAJean-Luc Dugelay, France Kyoung Mu Lee, Korea Douglas Williams, USATariq Durrani, UK Sang Uk Lee, Korea An-Yen (Andy) Wu, TaiwanTouradj Ebrahimi, Switzerland Y. Geoffrey Li, USA Xiang-Gen Xia, USASadaoki Furui, Japan Ferran Marqués, Spain Kung Yao, USAMoncef Gabbouj, Finland Bernie Mulgrew, UKFulvio Gini, Italy King N. Ngan, Singapore

Contents

Editorial, Gianpaolo Evangelista, Mark Kahrs, and Emmanuel BacryVolume 2003 (2003), Issue 10, Pages 939-940

Physically Informed Signal Processing Methods for Piano Sound Synthesis: A Research Overview,Balázs Bank, Federico Avanzini, Gianpaolo Borin, Giovanni De Poli, Federico Fontana,and Davide RocchessoVolume 2003 (2003), Issue 10, Pages 941-952

Frequency-Zooming ARMA Modeling for Analysis of Noisy String Instrument Tones,Paulo A. A. Esquef, Matti Karjalainen, and Vesa VälimäkiVolume 2003 (2003), Issue 10, Pages 953-967

Virtual Microphones for Multichannel Audio Resynthesis, Athanasios Mouchtaris,Shrikanth S. Narayanan, and Chris KyriakakisVolume 2003 (2003), Issue 10, Pages 968-979

Progressive Syntax-Rich Coding of Multichannel Audio Sources, Dai Yang, Hongmei Ai,Chris Kyriakakis, and C.-C. Jay KuoVolume 2003 (2003), Issue 10, Pages 980-992

Time-Scale Invariant Audio Data Embedding, Mohamed F. Mansour and Ahmed H. TewfikVolume 2003 (2003), Issue 10, Pages 993-1000

Watermarking-Based Digital Audio Data Authentication, Martin Steinebach and Jana DittmannVolume 2003 (2003), Issue 10, Pages 1001-1015

Model-Based Speech Signal Coding Using Optimized Temporal Decomposition for Storage andBroadcasting Applications, Chandranath R. N. Athaudage, Alan B. Bradley, and Margaret LechVolume 2003 (2003), Issue 10, Pages 1016-1026

On Securing Real-Time Speech Transmission over the Internet: An Experimental Study,Alessandro Aldini, Marco Roccetti, and Roberto GorrieriVolume 2003 (2003), Issue 10, Pages 1027-1042

Efficient Alternatives to the Ephraim and Malah Suppression Rule for Audio Signal Enhancement,Patrick J. Wolfe and Simon J. GodsillVolume 2003 (2003), Issue 10, Pages 1043-1051

EURASIP Journal on Applied Signal Processing 2003:10, 939–940c© 2003 Hindawi Publishing Corporation

Editorial

Gianpaolo EvangelistaDepartment of Physical Sciences, University “Federico II” of Naples, I-80126 Napoli, ItalyEmail: [email protected]

Mark KahrsDepartment of Electrical Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USAEmail: [email protected]

Emmanuel Bacry

Centre de Mathematiques Appliquees, Ecole Polytechnique, F-91128 Palaiseau Cedex, FranceEmail: [email protected]

Interest in digital processing of audio signals has been re-invigorated by the introduction of multimedia communica-tion via the Internet and digital audio broadcasting systems.These new applications demand high bandwidth and requireinnovative solutions to an old problem: how to achieve highquality at low bit rates. Often this problem is addressed bytransmission schemes in which only part of the original au-dio data is transmitted. Other sources, voices or channels.The output must be reconstructed at the receiver from purelysynthetic or incomplete data. Additionally, the global net-worked audio community must solve a new class of problemsconcerning protection of audio streams and documents. Ac-cordingly, robust methods are sought for enforcing security,privacy, ownership, and authentication of audio data. Fur-thermore, the maintenance of audio archives—our culturalheritage—requires the development of efficient techniquesfor the restoration of corrupted audio documents.

This special issue provides a sample of the new directionsof digital audio research.

In audio synthesis, real-time computation of physicalmodels of acoustic instruments is now possible due to thesteady progress of Moore’s law. In the paper by B. Bank etal., a review of piano synthesis is given. The synthesis is de-scribed in terms of structured audio and the structured audioorchestral language (SAOL) which is included in MPEG-4.Through the use of filtering and interpolation, P. A. A. Es-quef et al. describe the use of the frequency-zooming analysismethod to derive an ARMA model for synthesizing stringedinstruments. Model-based computation of string sounds canbe used to create more expressive synthesis of string soundsby offering a wide space of controllable parameters.

Multichannel audio promises to bring more realistic

reproduction to the listener. In the paper by A. Mouchtaris etal., a small number of microphone signals are resynthesizedinto a larger number of “virtual microphones,” thereby re-ducing the transmission bandwidth while enhancing the finalrendering. In the paper by D. Yang et al., a high-performancescheme based on the MPEG advanced audio coding systemthat allows for the efficient transmission of multiple audiochannels at scalable bit rates is proposed.

Watermarking and data-hiding techniques try to preventunauthorized use of audio resources and additionally make itpossible to include additional metadata in the audio stream.In their paper, M. F. Mansour and A. H. Tewfik introduce anew method for robust scale and shift invariant data-hidingbased on wavelet transforms. The paper by M. Steinebachand J. Dittmann addresses the problem of authenticating au-dio streams by embedding content related data that allow thedecoder to check for integrity.

Quality networked speech communication poses notonly bandwidth but also privacy concerns. In their paper, C.R. N. Athaudage et al. propose a new method for efficientlyencoding the spectral information in a low-rate speech coder.The authors exploit the possibility of increasing the codinggain at the cost of introducing a substantially higher codingdelay. Real-time software applications designed for securingspeech transmission over the Internet are reviewed in the pa-per by A. Aldini et al.

In denoising or noise-reduction problems, a time vary-ing filter can be applied to the corrupted audio signal. Earlierwork on a minimum mean square error (MMSE) estimatorby Ephraim and Malah is quite expensive to compute. In P.J. Wolfe and S. J. Godsill’s paper, a Bayesian estimator that iseasier to compute and easier to understand is derived.

940 EURASIP Journal on Applied Signal Processing

The guest editors would like to thank the authors and thereviewers of the papers for their contributions in maintainingclarity, coherence, and consistency in this special issue.

Gianpaolo EvangelistaMark Kahrs

Emmanuel Bacry

Gianpaolo Evangelista received the Laureain physics (summa cum laude) from theUniversity “Federico II” of Naples, Napoli,Italy in 1984 and the M.S. and Ph.D. degreesin electrical engineering from the Universityof California, Irvine, in 1987 and 1990, re-spectively. Since 1995, he is Assistant Pro-fessor in the Department of Physical Sci-ences, University “Federico II” of Naples.From 1998 to 2002 he was Scientific Adjunctin the Laboratory for Audiovisual Communications, Swiss FederalInstitute of Technology, Lausanne, Switzerland. From 1985 to 1986,he worked at the Centre d’Etudes de Mathematique et AcoustiqueMusicale (CEMAMu/CNET), Paris, France, where he contributedto the development of a DSP-based sound synthesis system, andfrom 1991 to 1994, he was a Research Engineer at the Micrograv-ity Advanced Research and Support (MARS) Center, Napoli, wherehe was engaged in research in image processing applied to fluidmotion analysis and material science. His interests include digitalaudio; music, speech, and image processing; synthesis and coding;wavelets; and multirate signal processing. Dr. Evangelista was a re-cipient of the Fulbright Fellowship.

Mark Kahrs received an A.B. degree in ap-plied physics and information science (withhigh honors) from Revelle College, Univer-sity of California, San Diego in 1974. Hereceived his Ph.D. degree in computer sci-ence from the University of Rochester in1984. He has held positions at Stanford Uni-versity, Xerox PARC, Institut de Rechercheet Coordination Acoustique/Musique (IR-CAM) in Paris, Bell Laboratories, and Rut-gers University. In the Spring of 2001, he was a Fulbright Scholar atthe Acoustics Laboratory, Helsinki University of Technology. He iscurrently a visiting Associate Professor in the Department of Elec-trical Engineering at the University of Pittsburgh. His audio specificinterests include DSP for electroacoustic transducers, multichannelDSP hardware and new analysis and synthesis methods for com-puter music.

Emmanuel Bacry graduated from EcoleNormale Superieure, Ulm, Paris, Francein 1990. He received the Ph.D. degree inapplied mathematics from the Universityof Paris VII, Paris, France in 1992 andobtained the “habilitation a diriger desrecherches” from the same university in1996. Since 1992, he is a Researcher at theCentre Nationale de Recherche Scientifique(CNRS). After spending four years in theApplied Mathematics Department of Jussieu (Paris VII), he moved,in 1996, to the Centre de Mathematiques Appliquees (CMAP) at

Ecole Polytechnique, Palaiseau, France. During the same year, hebecame a part-time Assistant Professor at Ecole Polytechnique. Hisresearch interests include signal processing, wavelet transform, andfractal and multifractal theory with applications to very various do-mains such as sound processing and finance.


Physically Informed Signal Processing Methodsfor Piano Sound Synthesis: A Research Overview

Balazs BankDepartment of Measurement and Information Systems, Faculty of Electronical Engineering and Informatics,Budapest University of Technology and Economics, H-111 Budapest, HungaryEmail: [email protected]

Federico AvanziniDepartment of Information Engineering, University of Padova, 35131 Padua, ItalyEmail: [email protected]

Gianpaolo BorinDipartimento di Informatica, University of Verona, 37134 Verona, ItalyEmail: [email protected]

Giovanni De PoliDepartment of Information Engineering, University of Padova, 35131 Padua, ItalyEmail: [email protected]

Federico FontanaDepartment of Information Engineering, University of Padova, 35131 Padua, ItalyEmail: [email protected]

Davide RocchessoDipartimento di Informatica, University of Verona, 37134 Verona, ItalyEmail: [email protected]

Received 31 May 2002 and in revised form 6 March 2003

This paper reviews recent developments in physics-based synthesis of piano. The paper considers the main components of theinstrument, that is, the hammer, the string, and the soundboard. Modeling techniques are discussed for each of these elements, to-gether with implementation strategies. Attention is focused on numerical issues, and each implementation technique is describedin light of its efficiency and accuracy properties. As the structured audio coding approach is gaining popularity, the authors arguethat the physical modeling approach will have relevant applications in the field of multimedia communication.

Keywords and phrases: sound synthesis, audio signal processing, structured audio, physical modeling, digital waveguide, piano.

1. INTRODUCTION

Sounds produced by acoustic musical instruments can bedescribed at the signal level, where only the time evolutionof the acoustic pressure is considered and no assumptionson the generation mechanism are made. Alternatively, sourcemodels, which are based on a physical description of thesound production processes [1, 2], can be developed.

Physics-based synthesis algorithms provide semanticsound representations since the control parameters have astraightforward physical interpretation in terms of masses,

springs, dimensions, and so on. Consequently, modificationof the parameters leads in general to meaningful results andallows more intuitive interaction between the user and thevirtual instrument. The importance of sound as a primaryvehicle of information is being more and more recognized inthe multimedia community. Particularly, source models ofsounding objects (not necessarily musical instruments) arebeing explored due to their high degree of interactivity andthe ease in synchronizing audio and visual synthesis [3].

The physical modeling approach also has potential appli-cations in structured audio coding [4, 5], a coding scheme


where, in addition to the parameters, the decoding algo-rithm is transmitted to the user as well. The structured audioorchestral language (SAOL) became a part of the MPEG-4standard, thus it is widely available for multimedia applica-tions. Known problems in using physical models for codingpurposes are primarily concerned with parameter estima-tion. Since physical models describe specific classes of instru-ments, automatic estimation of the model parameters froman audio signal is not a straightforward task: the model struc-ture which is best suited for the audio signal has to be chosenbefore actual parameter estimation. On the other hand, oncethe model structure is determined, a small set of parameterscan describe a specific sound. Casey [6] and Serafin et al. [7]address these issues.

In this paper, we review some of the strategies and al-gorithms of physical modeling, and their applications to pi-ano simulation. The piano is a particularly interesting instru-ment, both for its prominence in western music and for itscomplex structure [8]. Also, its control mechanism is simple(it basically reduces to key velocity), and physical control de-vices (MIDI keyboards) are widely available, which is not thecase for other instruments. The source-based approach canbe useful not only for synthesis purposes but also for gaininga better insight into the behavior of the instruments. How-ever, as we are interested in efficient algorithms, the featuresmodeled are only those considered to have audible effects.In general, there is a trade-off between the accuracy and thesimplicity of the description. The optimal solution may varydepending on the needs of the user.

The models described here are all based on digital waveg-uides. The waveguide paradigm has been found to be themost appropriate for real-time synthesis of a wide range ofmusical instruments [9, 10, 11]. As early as in 1987, Gar-nett [12] presented a physical waveguide piano model. In hismodel, a semiphysical lumped hammer is connected to a dig-ital waveguide string and the soundboard is modeled by a setof waveguides, all connected to the same termination.

In 1995, Smith and Van Duyne [13, 14] presented amodel based on commuted synthesis. In their approach, thesoundboard response is stored in an excitation table andfed into a digital waveguide string model. The hammer ismodeled as a linear filter whose parameters depend on thehammer-string collision velocity. The hammer filter param-eters have to be precalculated and stored for all notes andhammer velocities. This precalculation can be avoided byrunning an auxiliary string model connected to a nonlinearhammer model in parallel, and, based on the force responseof the auxiliary model, designing the hammer filters in realtime [15].

The original motivation for commuted synthesis was toavoid the high-order filter which is needed for high qual-ity soundboard modeling. As low-complexity methods havebeen developed for soundboard modeling (see Section 5),the advantages of the commuted piano with respect to thedirect modeling approach described here are reduced. Also,due to the lack in physical description, some effects, such asthe restrike (ribattuto) of the same string, cannot be preciselymodeled with the commuted approach. Describing the com-

muted synthesis in detail is beyond the scope of this paper,although we would like to mention that it is a comparablealternative to the techniques described here.

As part of a collaboration between the University ofPadova and Generalmusic, Borin et al. [16] presented acomplete real-time piano model in 1997. The hammer wastreated as a lumped model, with a mass connected in paral-lel to a nonlinear spring, and the strings were simulated us-ing digital waveguides, all connected to a single-lumped load.Bank [17] introduced in 2000 a similar physical model, basedon the same functional blocks, but with slightly different im-plementation. An alternative approach was used for the solu-tion of the hammer differential equation. Independent stringmodels were used without any coupling, and the influenceof the soundboard on decay times was taken into accountby using high-order loss filters. The use of feedback delaynetworks was suggested for modeling the radiation of thesoundboard.

This paper addresses the design of each component ofa piano model (i.e., hammer, string, and soundboard). Dis-cussion is carried on with particular emphasis on real-timeapplications, where the time complexity of algorithms playsa key role. Perceptual issues are also addressed since a preciseknowledge of what is relevant to the human ear can drivethe accuracy level of the design. Section 2 deals with generalaspects of piano acoustics. In Section 3, the hammer is dis-cussed and numerical techniques are presented to overcomethe computability problems in the nonlinear discretized sys-tem. Section 4 is devoted to string modeling, where the prob-lems of parameter estimation are also addressed. Finally,Section 5 deals with the soundboard, where various alterna-tive techniques are described and the use of the multirate ap-proach is proposed.

2. ACOUSTICS AND MODEL STRUCTURE

Piano sounds are the final product of a complex synthesisprocess which involves the entire instrument body. As a resultof this complexity, each piano note exhibits its unique soundfeatures and nuances, especially in high quality instruments.Moreover, just varying the impact force on a single key al-lows the player to explore a rich dynamic space. Accountingfor such dynamic variations in a wavetable-based synthesizeris not trivial: dynamic postprocessing filters which shape thespectrum according to key velocity can be designed, but find-ing a satisfactory mapping from velocity to filter response isfar from being an easy task. Alternatively, a physical model,which mimics as closely as possible the acoustics of the in-strument, can be developed.

The general structure of the piano is displayed inFigure 1a: an iron frame is attached to the upper part of thewooden case and the strings are extended upon this in a di-rection nearly perpendicular to the keyboard. The keyboard-side end of the string is connected to the tuning pins on thepin block, while the other end, passing the bridge, is attachedto the hitch-pin rail of the frame. The bridge is a thin woodenbar that transmits the string vibration to the soundboard,which is located under the frame.

Methods for Piano Sound Synthesis 943

String Bridge

Hammer Soundboard

(a)

Excitation

Control

String Radiator Sound

(b)

Figure 1: General structures: (a) schematic representation of theinstrument and (b) model structure.

Since the physical modeling approach tries to simulatethe structure of the instrument rather than the sound itself,the blocks in the piano model resemble the parts of a real pi-ano. The structure is displayed in Figure 1b. The first modelblock is the excitation, the hammer strike. Its output prop-agates to the string, which determines the fundamental fre-quency of the tone. The quasiperiodic output signal is fil-tered through a postprocessing block, covering the radiationeffects of the soundboard. Figure 1b shows that the hammer-string interaction is bidirectional since the hammer force de-pends on the string displacement [8]. On the other hand,there is no feedback from the radiator to the string. Feed-back and coupling effects on the bridge and the soundboardare taken into account in the string block. The model differsfrom a real piano in the fact that the two functions of thesoundboard, namely, to provide a terminating impedance tothe strings and to radiate sound, are located in separate partsof the model. As a result, it is possible to treat radiation as alinear filtering operation.

3. THE HAMMER

We will first discuss the physical aspects of the hammer-string interaction, then concentrate on various modeling ap-proaches and implementation issues.

3.1. Hammer-string interaction

As a first approximation, the piano hammer can be consid-ered a lumped mass connected to a nonlinear spring, whichis described by the equation

F(t) = −mhd2yh(t)dt2

, (1)

where F(t) is the interaction force and yh(t) is the hammerdisplacement. The hammer mass is represented by mh. Ex-periments on real instruments have shown (see, e.g., [18, 19,20]) that the hammer-string contact can be described by thefollowing formula:

Table 1: Sample values for hammer parameters for three differentnotes, taken from [19, 20]. The hammer mass mh is given in kg.

C2 C4 C6

p 2.3 2.5 3k 4.0× 108 4.5× 109 1.0× 1012

mh 4.9× 10−3 2.97× 10−3 2.2× 10−3

F(t) = f(∆y(t)

) =k∆y(t)p, ∆y(t) > 0,

0, ∆y(t) ≤ 0,(2)

where ∆y(t) = yh(t) − ys(t) is the compression of the ham-mer felt, ys(t) is the string position, k is the hammer stiff-ness coefficient, and p is the stiffness exponent. The condi-tion ∆y(t) > 0 corresponds to the hammer-string contact,while the condition ∆y(t) ≤ 0 indicates that the hammeris not touching the string. Equations (1) and (2) result in anonlinear differential system of equations for yh(t). Due tothe nonlinearity, the tone spectrum varies dynamically withhammer velocity. Typical values of hammer parameters canbe found in [19, 20]. Example values are listed in Table 1.

However, (2) is not fully satisfactory in that real pianohammers exhibit hysteretic behavior. That is, contact forcesduring compression and during decompression are different,and a one-to-one law between compression and force doesnot correspond to reality. A general description of the hys-teresis effect of piano felts was provided by Stulov [21]. Theidea, coming from the general theory of mechanics of solids,is that the stiffness k of the spring in (2) has to be replacedby a time-dependent operator which introduces memory inthe nonlinear interaction. Thus, the first part of (2) (when∆y(t) > 0) is replaced by

F(t) = f(∆y(t)

) = k[1− hr(t)

]∗ [∆y(t)p

], (3)

where hr(t) = (ε/τ)e−t/τ is a relaxation function that accountsfor the “memory” of the material and the ∗ operator repre-sents convolution.

Previous studies [22] have shown that a good fit to realdata can be obtained by implementing hr as a first-order low-pass filter. It has to be noted that informal listening tests in-dicate that taking into account the hysteresis in the hammermodel does not improve the sound quality significantly.

3.2. Implementation approaches

The hammer models described in Section 3.1 can be dis-cretized and coupled to the string in order to provide a fullphysical description. However, there is a mutual dependencebetween (2) and (1), that is, the hammer position yh(n) atdiscrete time instant n should be known for computing theforce F(n), and vice versa. The same problem arises when(3) is used instead of (2). This implicit relationship can bemade explicit by assuming that F(n) ≈ F(n− 1), thus insert-ing a fictitious delay element in a delay-free path. Althoughthis approximation has been extensively used in the literature(see, e.g., [19, 20]), it is a potential source of instability.


The theory of wave digital filters addresses the problem ofnoncomputable loops in terms of wave variables. Every com-ponent of a circuit is described as a scattering element witha reference impedance, and delay-free loops between com-ponents are treated by “adapting” reference impedances. VanDuyne et al. [23] presented a “wave digital hammer” model,where wave variables are used. More severe computabilityproblems can arise when simulating nonlinear dynamic ex-citers since the linear equations used to describe the systemdynamics are tightly coupled with a nonlinear map. Borinet al. [24] have recently proposed a general strategy named“K method” for solving noncomputable loops in a wide classof nonlinear systems. The method is fully described in [24]along with some application examples. Here, only the basicprinciples are outlined.

Whichever the discretization method is, the hammercompression ∆y(n) at time n can be written as

∆y(n) = p(n) + KF(n), (4)

where p(n) is the linear combination of past values of thevariables (namely, yh, ys, and F) and K is a coefficient whosevalue depends on the numerical method in use. The inter-action force F(n) at discrete time instant n, computed eitherby (2) or (3), is therefore described by the implicit relationF(n) = f (p(n) + KF(n)). The K method uses the implicitfunction theorem to solve the following implicit relation:

F = f (p + KF)Kmeth.−→ F = h(p). (5)

The new nonlinear map h defines F as a function of p,hence instantaneous dependencies across the nonlinearityare dropped. The function h can be precomputed and storedin a lookup table for efficient implementation.

Bank [25] presented a simpler but less general methodfor avoiding artifacts caused by fictitious delay insertion. Theidea is that the stability of the discretized hammer modelwith a fictitious delay can always be maintained by choos-ing a sufficiently large sampling rate fs if the correspondingcontinuous-time system is stable. As fs → ∞, the discrete-time system will behave as the original differential equation.Doubling the sampling rate of the whole string model woulddouble the computation time as well. However, if only thehammer model operates at double rate, the computationalcomplexity is raised only by a negligible amount. Therefore,in the proposed solution, the hammer operates at twice sam-pling rate of the string. Data is downsampled using sim-ple averaging and upsampled using linear interpolation. Themultirate hammer has been found to result in well-behavingforce signals at a low-computational cost. As the hammermodel is a nonlinear dynamic system, the stability boundsare not trivial to derive in a closed form. In practice, stabilityis maintained up to an impact velocity ten times higher thanthe point where the straightforward approach (e.g., used in[19, 20]) turns unstable.

Figure 2 shows a typical force signal in a hammer-stringcontact. The overall contact duration is around 2 ms andthe pulses in the signal are produced by reflections of force

70

60

50

40

30

20

10

0

Forc

e[N

]

0 0.5 1 1.5 2 2.5Time [ms]

Figure 2: Time evolution of the interaction force for note C5

(522 Hz) with fs = 44.1 kHz, and hammer velocity v = 5 m/s, com-puted by inserting a fictitious delay element (solid line), with theK method (dashed line), and with the multirate hammer (dottedline).

waves at string terminations. The K method and the multi-rate hammer produce very similar force signals. On the otherhand, inserting a fictitious delay element drives the systemtowards instability (the spikes are progressively amplified).In general, the multirate method provide results comparableto the K method for hammer parameters realistic for pianos,while it does not require that precomputed lookup tables bestored. On the other hand, when low-sampling rates (e.g.,fs = 11.025 kHz) or extreme hammer parameters are used(i.e., k is ten times the value listed in Table 1), the system sta-bility cannot be maintained by upsampling by a factor of 2.In such cases, the K method is the appropriate solution.

The computational approaches presented in this sectionare applicable to a wide class of mechanical interactions be-tween physical objects [26].

4. THE STRING

Many different approaches have been presented in the litera-ture for string modeling. Since we are considering techniquessuitable for real-time applications, only the digital waveguide[9, 10, 11] is described here in detail. This method is basedon the time-domain solution of the one-dimensional waveequation. The velocity distribution of the string v(x, t) canbe seen as the sum of two traveling waves:

v(x, t) = v+(x − ct) + v−(x + ct), (6)

where x denotes the spatial coordinate, t is time, c is the prop-agation speed, and v+ and v− are the traveling wave compo-nents.

Spatial and time-domain sampling of (6) results in a sim-ple delay-line representation. Nonideal, lossy, and stiff stringscan also be modeled by the method. If linearity and time in-variance of the string are assumed, all the distributed losses


−1

z−Min +

Min

z−(M−Min)

M

Fout

Fin Hr(z)

z−Min + z−(M−Min)

Figure 3: Digital waveguide model of a string with one polariza-tion.

and dispersion can be consolidated to one end of the digi-tal waveguide [9, 10, 11]. In the case of one polarization ofa piano string, the system takes the form shown in Figure 3,where M represents the length of the string in spatial sam-pling intervals, Min denotes the position of the force input,and Hr(z) refers to the reflection filter. This structure is ca-pable of generating a set of quasiharmonic, exponentially de-caying sinusoids. Note that the four delay lines of Figure 3can be simplified to a two-delay line structure for more effi-cient implementation [13].

Accurate design of the reflection filter plays a keyrole for creating realistic sounds. To simplify the design,Hr(z) is usually split into three separate parts: Hr(z) =−Hl(z)Hd(z)Hf d(z), where Hl(z) accounts for the losses,Hd(z) for the dispersion due to stiffness, and Hf d(z) for fine-tuning the fundamental frequency. Using allpass filters Hd(z)for simulating dispersion ensures that the decay times of thepartials are controlled by the loss filter Hl(z) only. The slightphase difference caused by the loss filter is negligible com-pared to the phase response of the dispersion filter. In thisway, the loss filter and the dispersion filter can be treated asorthogonal with respect to design.

The string needs to be fine tuned because delay lines canimplement only an integer phase delay and this provides toolow resolution for the fundamental frequencies. Fine tuningcan be incorporated in the dispersion filter design or, alter-natively, a separate fractional delay filter Hf d(z) can be usedin series with the delay line. Smith and Jaffe [9, 27] suggestedto use a first-order allpass filter for this purpose. Valimakiet al. [28] proposed an implementation based on low-orderLagrange interpolation filters. Laakso et al. [29] provided anexhaustive overview on this topic.

4.1. Loss filter design

First, the partial envelopes of the recorded note have to becalculated. This can be done by sinusoidal peak trackingwith short time Fourier transform implementation [28] orby heterodyne filtering [30]. A robust way of calculating de-cay times is fitting a line by linear regression on the logarithmof the amplitude envelopes [28]. The magnitude specifica-tion gk for the loss filter can be computed as follows:

gk =∣∣Hl

(e j(2π fk/ fs)

)∣∣ = e−k/ fkτk , (7)

where fk and τk are the frequency and the decay time of thekth partial, and fs is the sampling rate. Fitting a filter to the gk

coefficients is not trivial since the error in the decay times is anonlinear function of the filter magnitude error. If the mag-nitude response exceeds unity, the digital waveguide loop be-comes unstable. To overcome this problem, Valimaki et al.[28, 30] suggested the use of a one-pole loop filter whosetransfer function is

H1p(z) = g1 + a1

1 + a1z−1. (8)

The advantage of this filter is that stability constraints for thewaveguide loop, namely, a1 < 0 and 0 < g < 1, are rela-tively simple. As for the design, Valimaki et al. [28, 30] useda simple algorithm for minimizing the magnitude error inthe mean squares sense. However, the overall decay time ofthe synthesized tone did not always coincide with the origi-nal one.

As a general solution for loss filter design, Smith [9] sug-gested to minimize the error of decay times of the partialsrather than the error of the filter magnitude response. Thisassures that the overall decay time of the note is preservedand the stability of the feedback loop is maintained. More-over, optimization with respect to decay times is perceptu-ally more meaningful. The methods described hereafter areall based on this idea.

Bank [17] developed a simple and robust method forone-pole loop filter design. The approximate analytical for-mulas for decay times τk of a digital waveguide with a one-pole filter are as follows:

τk ≈ 1c1 + c3ϑ

2k

, (9)

where c1 and c3 are computed from the parameters of theone-pole filter of (8):

c1 = f0(1− g), c3 = − f0a1

2(a1 + 1

)2 , (10)

where f0 is the fundamental frequency and ϑk = 2π fk/ fs isthe digital frequency of the kth partial in radians. Equation(9) shows that the decay rate σk = 1/τk is a second-orderpolynomial of frequency ϑk with even order terms. This sim-plifies the filter design since c1 and c3 are easily determinedby polynomial regression from the prescribed decay times. Aweighting function of wk = τ4

k has to be used to minimize theerror with respect to τk. Parameters g and a1 of the one-poleloop filter are easily computed via the inverse of (10) fromcoefficients c1 and c3.

In most cases, the one-pole loss filter yields good results.Nevertheless, when precise rendering of the partial envelopesis required, higher-order filters have to be used. However,computing analytical formulas for the decay times with high-order filters is a difficult task. A two-step procedure was sug-gested by Erkut [31]; in this case, a high-order polynomial isfit to the decay rates σk = 1/τk , which contains only termsof even order. Then, a magnitude specification is calculatedfrom the decay rate curve defined by the polynomial, and thismagnitude response is used as a specification for minimum-phase filter design.


Another approach was proposed by Bank [17] who sug-gested the transformation of the specification. As the goal isto match decay times, the magnitude specification gk is trans-formed into a form gk,tr = 1/(1− gk) which approximates τk,and a transformed filter Htr(z) is designed for the new spec-ification by least squares filter design. The loss filter Hl(z) isthen computed by the inverse transformHl(z) = 1−1/Htr(z).

Bank and Valimaki [32] presented a simpler method forhigh-order filter design based on a special weighting func-tion. The resulting decay times of the digital waveguide arecomputed from the magnitude response gk = |H(e jϑk )| ofthe loss filter by τk = d(gk) = −1/( f0 ln gk). This functionis approximated by its first-order Taylor series around thespecification d(gk) ≈ d(gk) + d′(gk − gk). Accordingly, theerror with respect to decay times can be approximated by theweighted mean square error

eWLS =K∑k=1

wk(Hl

(e jϑk

)− gk)2,

wk = 1(gk − 1

)4 .

(11)

The weighted error eWLS can be easily minimized by standardfilter design algorithms, and leads to a good match with re-spect to decay times.

All of these techniques for high-order loss filter designhave been found to be robust in practice. Comparing them isleft for future work.

Borin et al. [16] have used a different approach for mod-eling the decay time variations of the partials. In their im-plementation, second-order FIR filters are used as loss filtersthat are responsible for the general decay of the note. Smallvariations of the decay times are modeled by connecting allthe string models to a common termination which is imple-mented as a filter with a high number of resonances. Thisalso enables the simulation of the pedal effect since now allthe strings are coupled to each other (see Section 4.3). Anadvantage of this method compared to high-order loop fil-ters is the smaller computational complexity. On the otherhand, the partial envelopes of the different notes cannot becontrolled independently.

Although optimizing the loss filter with respect to de-cay times has been found to give perceptually adequate re-sults, we remark that the loss filter design can be helped viaperceptual studies. The audibility of the decay-time varia-tions for the one-pole loss filter was studied by Tolonen andJarvelainen [33]. The study states that relatively large devia-tions (between −25% and +40%) in the overall decay timeof the note are not perceived by listeners. Unfortunately, the-oretical results are not directly applicable for the design ofhigh-order loss filters as the tolerance for the decay time vari-ations of single partials is not known.

4.2. Dispersion simulation

Dispersion is due to stiffness which causes piano stringsto deviate from ideal behavior. If the dispersive correctionterm in the wave equation is small, its first-order effect is

to increase the wave propagation speed c( f ) with frequency.This phenomenon causes string partials to become inhar-monic. If the string parameters are known, then the fre-quency of the kth stretched partial can be computed as

fk = k f0√

1 + Bk2, (12)

where the value of the inharmonicity coefficient B dependson the parameters of the string (see, e.g., [34]).

Phase delay specification Dd( fk) for the dispersion filterHd(z) can be computed from the partial frequencies:

Dd(fk) = fsk

fk−N −Dl

(fk), (13)

where N is the total length of the waveguide delay line andDl( fk) is the phase delay of the loss filter Hl(z). The phasespecification of the dispersion filter becomes φpre( fk) =2π fkDd( fk)/ fs.

Van Duyne and Smith [35] proposed an efficient methodfor simulating dispersion by cascading equal first-order all-pass filters in the waveguide loop; however, the constraint ofusing equal first-order sections is too severe and does not al-low accurate tuning of inharmonicity.

Rocchesso and Scalcon [36] proposed a design methodbased on [37]. Starting from a target phase response, l points fkk=1,...,l are chosen on the frequency axis corresponding tothe points where string partials should be located. The filterorder is chosen to be n < l. For each partial k, the methodcomputes the quantities

βk = −12

(φpre

(fk)

+ 2nπ fk), (14)

where φpre( f ) is the prescribed allpass response. Filter coeffi-cients aj are computed by solving the system

n∑j=1

aj sin(βk + 2 jπ fk

) = − sin(βk), k = 1, . . . , l. (15)

A least-squared equation error (LSEE) is used to solve theoverdetermined system (15). It was shown in [36] that sev-eral tens of partials can be correctly positioned for any pianokey, with the allpass filter order not exceeding 20. Moreover,the fine tuning of the string is automatically taken into ac-count in the design. Figure 4 plots results obtained using afilter order of 18. Note that the pure tone frequency JND (justnoticeable difference) has been used in Figure 4b as a refer-ence as no accurate studies of partial JNDs for piano tonesare available to our knowledge.

Since the computational load for Hd(z) is heavy, it is im-portant to find criteria for accuracy and order optimizationwith respect to human perception. Rocchesso and Scalcon[38] studied the dependence of the bandwidth of perceivedinharmonicity (i.e., the frequency range in which misplace-ment of partials is audible) on the fundamental frequencyby performing listening tests with decaying piano tones. Thebandwidth has been found to increase almost linearly on a


35

30

25

20

15

10

5

0

Perc

enta

gedi

sper

sion

(%)

0 10 20 30 40 50 60 70 80 90 100Order of partial numbers

(a)

40

30

20

10

0

−10

−20

−30

−40

Dev

iati

onof

part

ials

(cen

t)

0 1000 2000 3000 4000 5000 6000Frequency (Hz)

(b)

Figure 4: Dispersion filter (18th order) for the C2 string: (a) com-puted (solid line) and theoretical (dashed line) percentage disper-sion versus partial numbers and (b) deviation of partials (solid line).Dash-dotted vertical lines show the end of the LSEE approximation;dash-dotted bounds in (b) indicate the pure tone frequency JND asa reference; and the dashed line in (b) is the partial deviation fromthe theoretical inharmonic series in a nondispersive string model.

logarithmic pitch scale. Partials above this frequency bandonly contribute some brightness to the sound, and can bemade harmonic without relevant perceptual consequences.

Jarvelainen et al. [39] also found that inharmonicity ismore easily perceived at low frequencies even when coeffi-cient B for bass tones is lower than for treble tones. This isprobably due to the fact that beats are used by listeners ascues for inharmonicity, and even low B’s produce enoughmistuning in higher partials of low tones. These findings canhelp in the allpass filter design procedure, although a numberof issues still need further investigation.

FinSv(z) +

Fout

↓ 2 R1(z) + ↑ 2

↓ 2 R2(z) + ↑ 2

...... ...

↓ 2 Rk(z) ↑ 2

Figure 5: The multirate resonator bank.

As high-order dispersion filters are needed for modelinglow notes, the computational complexity is increased signifi-cantly. Bank [17] proposed a multirate approach to overcomethis problem. Since the lowest tones do not contain signifi-cant energy in the high-frequency region anyway, it is worth-while to run the lowest two or three octaves of the piano athalf-sampling rate of the model. The outputs of the low notesare summed before upsampling, therefore only one interpo-lation filter is required.

4.3. Coupled piano strings

String coupling occurs at two different levels. First of all, twoor three slightly mistuned strings are sounded together whena single piano key is pressed (except for the lowest octave)and complicated modulation of the amplitudes is broughtabout. This results in beating and two-stage decay, the firstrefers to an amplitude modulation overlaid on the exponen-tial decay, and the latter means that tone decays are faster inthe early part than in the latter. These phenomena were stud-ied by Weinreich as early as in 1977 [40]. At the second level,the presence of the bridge and the action of the soundboard isknown to originate important coupling effects even betweendifferent tones. In fact, the bridge-soundboard system con-nects strings together and acts as a distributed driving-pointimpedance for string terminations.

The simplest way for modeling beating and two-stage de-cay is to use two digital waveguides in parallel for a singlenote. Varying by the used type of coupling, many differentsolutions have been presented in the literature, see, for ex-ample, [14, 41].

Bank [17] presented a different approach for modelingbeating and two-stage decay, based on a parallel resonatorbank. In a subsequent study, the computational complexityof the method was decreased by an order of ten by applyingmultirate techniques, making the approach suitable for real-time implementations [42]. In this approach, second-orderresonators R1(z) · · ·Rk(z) are connected to the basic stringmodel Sv(z) in parallel, rather than using a second waveg-uide. The structure is depicted in Figure 5. The idea comesfrom the observation that the behavior of two coupled stringscan be described by a pair of exponentially damped sinusoids[40]. In this model, one sinusoid of the mode pair is simu-lated by one partial of the digital waveguide and the other


one by one of the resonators Rk(z). The transfer functions ofthe resonators are as follows:

Rk(z) = Reak− Re

ak pk

z−1

1− 2 Repkz−1 +

∣∣pk∣∣2z−2

,

ak = Akejϕk , pk = e j(2π fk/ fs)−1/ fsτk ,

(16)

where Ak, ϕk, fk, and τk refer to the initial amplitude, ini-tial phase, frequency, and decay-time parameters of the kthresonator, respectively. The overline stands for complex con-jugation, Re indicates the real part of a complex variable, andfs is the sampling frequency.

An advantage of the structure is that the resonators Rk(z)are implemented only for those partials whose beating andtwo-stage decay are prominent. The others will have sim-ple exponential decay, determined by the digital waveguidemodel Sv(z). Five to ten resonators have been found to beenough for high-quality sound synthesis. The resonator bankis implemented by the multirate approach, running the res-onators at a much lower-sampling rate, for example, the 1/8or 1/16 part of the original sampling frequency.

It is shown in [42] that when only half of the downsam-pled frequency band is used for resonators, no lowpass filter-ing is needed before downsampling. This is due to the factthat the excitation signal is of lowpass character leading toaliasing less than −20 dB. As the role of the excitation signalis to set the initial amplitudes and phases of the resonators,the result of this aliasing is a less than 1 dB change in the res-onator amplitudes, which has been found to be inaudible.On the other hand, the interpolation filters after upsamplingcannot be neglected. However, they are not implemented forall notes separately; the lower-sampling rate signals of the dif-ferent strings are summed before interpolation filtering (thisis not depicted in Figure 5). Their specification is relativelysimple (e.g., 5 dB passband ripple) since their passband er-rors can be easily corrected by changing the initial ampli-tudes and phases of the resonators. This results in signifi-cantly lower-computational cost, compared to the methodswhich use coupled waveguides.

Generally, the average computational cost of the methodfor one note is less than five multiplications per sample.Moreover, the parameter estimation gets simpler since onlythe parameters of the mode pairs have to be found by, forexample, the methods presented in [17, 41], and there is noneed for coupling filter design. Stability problems of a cou-pled system are also avoided. The method presented hereshows that combining physical and signal-based approachescan be useful in reducing computational complexity.

Modeling the coupling between strings of different tonesis essential when the sustain pedal effect has to be simu-lated. Garnett [12] and Borin et al. [16] suggested connect-ing the strings to the same lumped terminating impedance.The impedance is modeled by a filter with a high number ofpeaks. For that, the use of feedback delay networks [43, 44]is a good alternative. Although in real pianos the bridge con-nects to the string as a distributed termination, thus couplingdifferent strings in different ways, the simple model of Borinet al. was able to produce a realistic sustain pedal effect [45].

5. RADIATION MODELING

The soundboard radiates and filters the string waves thatreach the bridge, and radiation patterns are essential fordescribing the “presence” of a piano in a musical context.However, now we are concentrating on describing the soundpressure generated by the piano at a certain locus in thelistening space, that is, the directional properties of radia-tion are not taken into account. Modeling the soundboardas a linear postprocessing stage is an intrinsically weak ap-proach since on a real piano it also accounts for couplingbetween strings and affects the decay times of the partials.However, as already stated in Section 2, our modeling strat-egy keeps the radiation properties of the soundboard sepa-rated from its impedance properties. The latter are incorpo-rated in the string model, and have already been addressedin Sections 4.1 and 4.3; here we will concentrate on radia-tion.

A simple and efficient radiation model was presented byGarnett [12]. The waveguide strings were connected to thesame termination and the soundboard was simulated by con-necting six additional waveguides to the common termina-tion. This can be seen as a predecessor of using feedback de-lay networks for soundboard simulation. Feedback delay net-works have been proven to be efficient in simulating roomreverberation since they are able to produce high-modaldensity at a low-computational cost [43]. For an overview,see the work of Rocchesso and Smith [44]. Bank [17] ap-plied feedback delay networks with shaping filters for thesimulation of piano soundboards. The shaping filters wereparametrized in such a way that the system matched the over-all magnitude response of a real piano soundboard. A draw-back of the method is that the modal density and the qualityfactors of the modes are not fully controllable. The methodhas proven to yield good results for high piano notes, wheresimulating the attack noise (the knock) of the tone is themost important issue.

The problem of soundboard radiation can also be ad-dressed from the point of view of filter design. However, asthe soundboard exhibits high-modal density, a high-orderfilter has to be used. For fs = 44.1 kHz, a 2000 tap FIR fil-ter was necessary to achieve good results. The filter order didnot decrease significantly when IIR filters were used.

To resolve the high-computational complexity, a multi-rate soundboard model was proposed by Bank et al. [46].The structure of the model is depicted in Figure 6. The stringsignal is split into two parts. The part below 2.2 kHz is down-sampled by a factor of 8 and filtered by a high-order fil-ter Hlow(z) precisely synthesizing the amplitude and phaseresponse of the soundboard for the low frequencies. Thepart above 2.2 kHz is filtered by a low-order filter, model-ing the overall magnitude response of the soundboard athigh frequencies. The signal of the high-frequency chainis delayed by N samples to compensate for the latency ofdecimation and interpolation filters of the low-frequencychain.

The filters Hlow(z) and Hhigh(z) are computed as fol-lows. First, a target impulse response Ht(z) is calculated by


measuring the force-pressure transfer function of a real pianosoundboard. Then, this is lowpass-filtered and downsampledby a factor of 8 to produce an FIR filter Hlow(z). The impulseresponse of the low-frequency chain is now subtracted fromthe target response Ht(z) providing a residual response con-taining energy above 2.2 kHz. This residual response is mademinimum phase and windowed to a short length (50 tap).The multirate soundboard model outlined here consumes100 operations per cycle and produces a spectral charactersimilar to that of a 2000-tap FIR filter. The only differenceis that the attack of high notes sounds sharper since the en-ergy of the soundboard response is concentrated to a shorttime period above 2.2 kHz. This could be overcome by usingfeedback delay networks for Hhigh(z), which is left for futureresearch.

The parameters of the multirate soundboard model can-not be interpreted physically. However, this does not lead toany drawbacks since the parameters of the soundboard can-not be changed by the player in real pianos either. Havinga purely physical model, for example, based on finite differ-ences [47], would lead to unacceptable high-computationalcosts. Therefore, implementing a black box model block as apart of a physical instrument model seems to be a good com-promise.

6. CONCLUSIONS

This paper has reviewed the main stages of the developmentof a physical model for the piano, addressing computationalaspects and discussing problems that not only are related topiano synthesis but arise in a broad class of physical modelsof sounding objects.

Various approaches have been discussed for dealing withnonlinear equations in the excitation block. We have pointedout that inaccuracies at this stage can lead to severe instabil-ity problems. Analogous problems arise in other mechanicaland acoustical models, such as impact and friction betweentwo sounding objects, or reed-bore interaction. The two al-ternative solutions presented for the piano hammer can beused in a wide range of applications.

Several filter design techniques have been reviewed forthe accurate tuning of the resonating waveguide block. It hasbeen shown that high-order dispersion filters are needed foraccurate simulation of inharmonicity. Therefore, perceptualissues have been addressed since they are helpful in optimiz-ing the design and reducing computational loads. The re-quirement of physicality can always be weakened when theeffect caused by a specific feature is considered to be inaudi-ble.

A filter-based approach was presented for the sound-board model. As such, it cannot be interpreted as physical,but this does not influence the functionality of the model. Ingeneral, only those parameters which are involved in blockinteraction or are influenced by control messages need tohave a clear physical interpretation. Therefore, we recom-mend synthesis structures that are based on building blockswith physical input and output parameters, but whose inner

Fstring ↓ 8 Hlow(z) ↑ 8 + P

Hhigh(z) z−N

Figure 6: The multirate soundboard model.

structure does not necessarily follow physical model. In otherwords, the basic building blocks are black box models withthe most efficient implementations available, and they formthe physical structure of the instrument model at a higherlevel.

The use of multirate techniques was suggested for mod-eling beating and two-stage decay as well as the soundboard.The model can run at different sampling rates (e.g., 44.1,22.05, and 11.025 kHz) or/and with different filter orders im-plemented in the digital waveguide model. Since the stabil-ity of the numerical structures is maintained in all cases, theuser has the option of choosing between quality and effi-ciency. This remark is also relevant for potential applicationsin structured audio coding. In cases when instrument mod-els are to be encoded and transmitted without the preciseknowledge of the computational power of the decoder, it isessential that the stability is guaranteed even at low-samplingrates in order to allow graceful degradation.

ACKNOWLEDGMENTS

Work at CSC-DEI, University of Padova, was developed un-der a Research Contract with Generalmusic. Partial fund-ing was provided by the EU Project “MOSART,” ImprovingHuman Potential, and the Hungarian National Scientific Re-search Fund OTKA F035060. The authors are thankful to P.Hussami and to the anonymous reviewers for their helpfulcomments, which have contributed to the improvement ofthe paper.

REFERENCES

[1] G. De Poli, “A tutorial on digital sound synthesis techniques,”in The Music Machine, C. Roads, Ed., pp. 429–447, MIT Press,Cambridge, Mass, USA, 1991.

[2] J. O. Smith III, “Viewpoints on the history of digital synthe-sis,” in Proc. International Computer Music Conference (ICMC’91), pp. 1–10, Montreal, Quebec, Canada, October 1991.

[3] K. Tadamura and E. Nakamae, “Synchronizing computergraphics animation and audio,” IEEE Multimedia, vol. 5, no.4, pp. 63–73, 1998.

[4] E. D. Scheirer, “Structured audio and effects processing in theMPEG-4 multimedia standard,” Multimedia Systems, vol. 7,no. 1, pp. 11–22, 1999.

[5] B. L. Vercoe, W. G. Gardner, and E. D. Scheirer, “Structuredaudio: creation, transmission, and rendering of parametricsound representations,” Proceedings of the IEEE, vol. 86, no.5, pp. 922–940, 1998.

[6] M. A. Casey, “Understanding musical sound with forwardmodels and physical models,” Connection Science, vol. 6, no.2-3, pp. 355–371, 1994.


[7] S. Serafin, J. O. Smith III, and H. Thornburg, “A patternrecognition approach to invert a bowed string physicalmodel,” in Proc. International Symposium on Musical Acoustics(ISMA ’01), pp. 241–244, Perugia, Italy, September 2001.

[8] N. H. Fletcher and T. D. Rossing, The Physics of Musical In-struments, Springer-Verlag, New York, NY, USA, 1991.

[9] J. O. Smith III, Techniques for digital filter design and systemidentification with application to the violin, Ph.D. thesis, De-partment of Music, Stanford University, Stanford, Calif, USA,June 1983.

[10] J. O. Smith III, “Principles of digital waveguide models of mu-sical instruments,” in Applications of Digital Signal Processingto Audio and Acoustics, M. Kahrs and K. Brandenburg, Eds.,pp. 417–466, Kluwer Academic, Boston, Mass, USA, 1998.

[11] J. O. Smith III, Digital Waveguide Modeling of Musical Instru-ments, August 2002, http://www-ccrma.stanford.edu/∼jos/waveguide/.

[12] G. E. Garnett, “Modeling piano sound using waveguide digi-tal filtering techniques,” in Proc. International Computer MusicConference (ICMC ’87), pp. 89–95, Urbana, Ill, USA, Septem-ber 1987.

[13] J. O. Smith III and S. A. Van Duyne, “Commuted pianosynthesis,” in Proc. International Computer Music Conference(ICMC ’95), pp. 335–342, Banff, Canada, September 1995.

[14] S. A. Van Duyne and J. O. Smith III, “Developments forthe commuted piano,” in Proc. International Computer MusicConference (ICMC ’95), pp. 319–326, Banff, Canada, Septem-ber 1995.

[15] B. Bank and L. Sujbert, “On the nonlinear commuted syn-thesis of the piano,” in Proc. 5th International Conference onDigital Audio Effects (DAFx ’02), pp. 175–180, Hamburg, Ger-many, September 2002.

[16] G. Borin, D. Rocchesso, and F. Scalcon, “A physical pianomodel for music performance,” in Proc. International Com-puter Music Conference (ICMC ’97), pp. 350–353, Thessa-loniki, Greece, September 1997.

[17] B. Bank, “Physics-based sound synthesis of the piano,” M.S.thesis, Department of Measurement and Information Sys-tems, Budapest University of Technology and Economics, Bu-dapest, Hungary, May 2000, published as Tech. Rep. 54, Lab-oratory of Acoustics and Audio Signal Processing, HelsinkiUniversity of Technology, Helsinki, Finland.

[18] D. E. Hall, “Piano string excitation VI: Nonlinear modeling,”Journal of the Acoustical Society of America, vol. 92, no. 1, pp.95–105, 1992.

[19] A. Chaigne and A. Askenfelt, “Numerical simulations of pi-ano strings. I. A physical model for a struck string using finitedifference methods,” Journal of the Acoustical Society of Amer-ica, vol. 95, no. 2, pp. 1112–1118, 1994.

[20] A. Chaigne and A. Askenfelt, “Numerical simulations of pianostrings. II. Comparisons with measurements and systematicexploration of some hammer-string parameters,” Journal ofthe Acoustical Society of America, vol. 95, no. 3, pp. 1631–1640,1994.

[21] A. Stulov, “Hysteretic model of the grand piano hammer felt,”Journal of the Acoustical Society of America, vol. 97, no. 4, pp.2577–2585, 1995.

[22] G. Borin and G. De Poli, “A hysteretic hammer-string inter-action model for physical model synthesis,” in Proc. NordicAcoustical Meeting (NAM ’96), pp. 399–406, Helsinki, Fin-land, June 1996.

[23] S. A. Van Duyne, J. R. Pierce, and J. O. Smith III, “Travelingwave implementation of a lossless mode-coupling filter andthe wave digital hammer,” in Proc. International ComputerMusic Conference (ICMC ’94), pp. 411–418, Arhus, Denmark,September 1994.

[24] G. Borin, G. De Poli, and D. Rocchesso, “Elimination of delay-free loops in discrete-time models of nonlinear acoustic sys-tems,” IEEE Trans. Speech, and Audio Processing, vol. 8, no. 5,pp. 597–605, 2000.

[25] B. Bank, “Nonlinear interaction in the digital waveguide withthe application to piano sound synthesis,” in Proc. Inter-national Computer Music Conference (ICMC ’00), pp. 54–57,Berlin, Germany, September 2000.

[26] F. Avanzini, M. Rath, D. Rocchesso, and L. Ottaviani, “Low-level models: resonators, interactions, surface textures,” inThe Sounding Object, D. Rocchesso and F. Fontana, Eds.,pp. 137–172, Edizioni di Mondo Estremo, Florence, Italy,2003.

[27] D. A. Jaffe and J. O. Smith III, “Extensions of the Karplus-Strong plucked-string algorithm,” Computer Music Journal,vol. 7, no. 2, pp. 56–69, 1983.

[28] V. Valimaki, J. Huopaniemi, M. Karjalainen, and Z. Janosy,“Physical modeling of plucked string instruments with appli-cation to real-time sound synthesis,” Journal of the Audio En-gineering Society, vol. 44, no. 5, pp. 331–353, 1996.

[29] T. I. Laakso, V. Valimaki, M. Karjalainen, and U. K. Laine,“Splitting the unit delay—tools for fractional delay filter de-sign,” IEEE Signal Processing Magazine, vol. 13, no. 1, pp. 30–60, 1996.

[30] V. Valimaki and T. Tolonen, “Development and calibration ofa guitar synthesizer,” Journal of the Audio Engineering Society,vol. 46, no. 9, pp. 766–778, 1998.

[31] C. Erkut, “Loop filter design techniques for virtual stringinstruments,” in Proc. International Symposium on MusicalAcoustics (ISMA ’01), pp. 259–262, Perugia, Italy, September2001.

[32] B. Bank and V. Valimaki, “Robust loss filter design for digitalwaveguide synthesis of string tones,” IEEE Signal ProcessingLetters, vol. 10, no. 1, pp. 18–20, 2002.

[33] T. Tolonen and H. Jarvelainen, “Perceptual study of decay pa-rameters in plucked string synthesis,” in Proc. AES 109th Con-vention, Los Angeles, Calif, USA, September 2000, preprintNo. 5205.

[34] H. Fletcher, E. D. Blackham, and R. Stratton, “Quality of pi-ano tones,” Journal of the Acoustical Society of America, vol.34, no. 6, pp. 749–761, 1962.

[35] S. A. Van Duyne and J. O. Smith III, “A simplified approach tomodeling dispersion caused by stiffness in strings and plates,”in Proc. International Computer Music Conference (ICMC ’94),pp. 407–410, Arhus, Denmark, September 1994.

[36] D. Rocchesso and F. Scalcon, “Accurate dispersion simulationfor piano strings,” in Proc. Nordic Acoustical Meeting (NAM’96), pp. 407–414, Helsinki, Finland, June 1996.

[37] M. Lang and T. I. Laakso, “Simple and robust method forthe design of allpass filters using least-squares phase error cri-terion,” IEEE Trans. Circuits and Systems, vol. 41, no. 1, pp.40–48, 1994.

[38] D. Rocchesso and F. Scalcon, “Bandwidth of perceived inhar-monicity for physical modeling of dispersive strings,” IEEETrans. Speech, and Audio Processing, vol. 7, no. 5, pp. 597–601,1999.

[39] H. Jarvelainen, V. Valimaki, and M. Karjalainen, “Audibilityof the timbral effects of inharmonicity in stringed instrumenttones,” Acoustic Research Letters Online, vol. 2, no. 3, pp. 79–84, 2001.

[40] G. Weinreich, “Coupled piano strings,” Journal of the Acous-tical Society of America, vol. 62, no. 6, pp. 1474–1484, 1977.

[41] M. Aramaki, J. Bensa, L. Daudet, Ph. Guillemain, andR. Kronland-Martinet, “Resynthesis of coupled piano stringvibrations based on physical modeling,” Journal of New MusicResearch, vol. 30, no. 3, pp. 213–226, 2001.


[42] B. Bank, “Accurate and efficient modeling of beating and two-stage decay for string instrument synthesis,” in Proc. MOSARTWorkshop on Current Research Directions in Computer Music,pp. 134–137, Barcelona, Spain, November 2001.

[43] J.-M. Jot and A. Chaigne, “Digital delay networks for design-ing artificial reverberators,” in Proc. 90th AES Convention,Paris, France, February 1991, preprint No. 3030.

[44] D. Rocchesso and J. O. Smith III, “Circulant and elliptic feed-back delay networks for artificial reverberation,” IEEE Trans.Speech, and Audio Processing, vol. 5, no. 1, pp. 51–63, 1997.

[45] G. De Poli, F. Campetella, and G. Borin, “Pedal resonance ef-fect simulation device for digital pianos,” United States Patent5,744,743, April 1998, (Appl. No. 618379, filed: March 1996).

[46] B. Bank, G. De Poli, and L. Sujbert, “A multi-rate approach toinstrument body modeling for real-time sound synthesis ap-plications,” in Proc. 112th AES Convention, Munich, Germany,May 2002, preprint no. 5526.

[47] B. Bazzi and D. Rocchesso, “Numerical investigation of theacoustic properties of piano soundboards,” in Proc. XIII Collo-quium on Musical Informatics (CIM ’00), pp. 39–42, L’Aquila,Italy, September 2000.

Balazs Bank was born in 1977 in Budapest,Hungary. He received his M.S. degree inelectrical engineering in 2000 from the Bu-dapest University of Technology and Eco-nomics. In the academic year 1999/2000,he was with the Laboratory of Acousticsand Audio Signal Processing, Helsinki Uni-versity of Technology, completing his thesisas a Research Assistant within the “SoundSource Modeling” project. From October2001 to April 2002, he held a Research Assistant position at the De-partment of Information Engineering, University of Padova withinthe EU project “MOSART Improving Human Potential.” He is cur-rently studying for his Ph.D. degree at the Department of Mea-surement and Information Systems, Budapest University of Tech-nology and Economics. He works on the physics-based soundsynthesis of musical instruments, with a primary interest on thepiano.

Federico Avanzini received in 1997 the Lau-rea degree in physics from the University ofMilano, with a thesis on nonlinear dynami-cal systems and full marks. From November1998 to November 2001, he pursued a Ph.D.degree in computer science at the Univer-sity of Padova, with a research project on“Computational issues in physically-basedsound models.” Within his doctoral activ-ities (January to June, 2001), he workedas a visiting Researcher at the Laboratory of Acoustics and Au-dio Signal Processing, Helsinki University of Technology, wherehe was involved in the “Sound Source Modeling” project. He iscurrently a Postdoctoral Researcher at the University of Padova.His research interests include sound synthesis models in human-computer interaction, computational issues, models for voice syn-thesis and analysis, and multimodal interfaces. Recent research ex-perience includes participation to both national (“Sound mod-els in human-computer and human-environment interaction”)and European (“SOb—The Sounding Object,” and “MEGA—Multisensory Expressive Gesture Applications”) projects, where hehas been working on sound synthesis algorithms based on physicalmodels.

Gianpaolo Borin received the Laurea de-gree in electronic engineering from the Uni-versity of Padova, Italy, in 1990, with a the-sis on sound synthesis by physical model-ing. Since then, he has been doing researchat the Center of Computational Sonology(CSC), University of Padova. He has alsobeen working both as a Unix ProfessionalDeveloper and as a Consultant Researcherfor Generalmusic. He is a coauthor of aGeneralmusic’s US patent for digital piano postprocessing meth-ods. His current research interest include algorithms and methodsfor the efficient implementation of physical models of musical in-struments and tools for real-time sound synthesis.

Giovanni De Poli is an Associate Profes-sor in the Department of Information En-gineering of the University of Padova, wherehe teaches classes of Fundamentals of Infor-matics II and Processing Systems for Mu-sic. He is the Director of Center of Com-putational Sonology (CSC), University ofPadova. He is a member of the ExCom ofthe IEEE Computer Society Technical Com-mittee on Computer Generated Music, As-sociate Editor of the International Journal of New Music Research,and member of the board of Directors of AIMI (Associazione Ital-iana Informatica Musicale), board of Directors of CIARM (Cen-tro Interuniversitario Acustica e Ricerca Musicale), and ScientificCommittee of ACROE (Institut National Politechnique Grenoble).His main research interests are in algorithms for sound synthe-sis and analysis, models for expressiveness in music, multimediasystems and human-computer interaction, and preservation andrestoration of audio documents. He is an author of several scien-tific international publications. He has also served in the Scien-tific Committees of international conferences. He is involved in theCOST G6, MEGA, and MOSART-IHP European projects as a lo-cal coordinator. He is an owner of several patents on digital musicinstruments.

Federico Fontana received the Laurea de-gree in electronic engineering from the Uni-versity of Padova, Padua, Italy, in 1996, andthe Ph.D. degree in computer science fromthe University of Verona, Verona, Italy, in2003. He is currently a Postdoctoral re-searcher in the Department of InformationEngineering at the University of Padova,and collaborates with the Video, Image Pro-cessing and Sound (VIPS) Laboratory in theDiaprtimento di Informatica at the University of Verona. From1998 to 2000 he has been collaborating with the Center of Com-putational Sonology (CSC), University of Padova, working onsound synthesis by physical modeling. During the same periodhe has been a consultant of Generalmusic, Italy, and STMicro-electronics—Audio & Automotive Division, Italy, in the design andrealization of real-time algorithms for the deconvolution, virtualspatialization and dynamic processing of musical and audio sig-nals. In 2001, he has been visiting the Laboratory of Acoustics andAudio Signal Processing at the Helsinki University of Technology,Espoo, Finland. He has been involved in several national and in-ternational research projects. His main interests are in audio signalprocessing and physical sound modeling and spatialization in vir-tual and interactive environments and in multimedia systems.


Davide Rocchesso received the Laurea de-gree in electronic engineering and thePh.D. degree from the University of Padova,Padua, Italy, in 1992 and 1996, respec-tively. His Ph.D. research involved the de-sign of structures and algorithms based onfeedback delay networks for sound pro-cessing applications. In 1994 and 1995, hewas a visiting Scholar at the Center forComputer Research in Music and Acoustics(CCRMA), Stanford University, Stanford, Calif. Since 1991, he hasbeen collaborating with the Center of Computational Sonology(CSC), University of Padova, as a Researcher and Live-ElectronicDesigner. Since 1998, he has been with the University of Verona,Verona, Italy, where he is now an Associate Professor. At the Di-partimento di Informatica of the University of Verona, he coordi-nates the project “Sounding Object,” funded by the European Com-mission within the framework of the Disappearing Computer ini-tiative. His main interests are in audio signal processing, physicalmodeling, sound reverberation and spatialization, multimedia sys-tems, and human-computer interaction.


Frequency-Zooming ARMA Modeling for Analysisof Noisy String Instrument Tones

Paulo A. A. EsquefLaboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, P.O. Box 3000,FIN-02015 HUT, Espoo, FinlandEmail: [email protected]

Matti KarjalainenLaboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, P.O. Box 3000,FIN-02015 HUT, Espoo, FinlandEmail: [email protected]

Vesa ValimakiLaboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, P.O. Box 3000,FIN-02015 HUT, Espoo, Finland

Pori School of Technology and Economics, Tampere University of Technology, P.O. Box 300, FIN-28101 Pori, FinlandEmail: [email protected]


This paper addresses model-based analysis of string instrument sounds. In particular, it reviews the application of autoregressive(AR) modeling to sound analysis/synthesis purposes. Moreover, a frequency-zooming autoregressive moving average (FZ-ARMA)modeling scheme is described. The performance of the FZ-ARMA method on modeling the modal behavior of isolated groupsof resonance frequencies is evaluated for both synthetic and real string instrument tones immersed in background noise. Wedemonstrate that the FZ-ARMA modeling is a robust tool to estimate the decay time and frequency of partials of noisy tones.Finally, we discuss the use of the method in synthesis of string instrument sounds.

Keywords and phrases: acoustic signal processing, spectral analysis, computer music, sound synthesis, digital waveguide.

1. INTRODUCTION

It has been known for quite a long time that a free vibrat-ing body may generate a sound that is composed of dampedsinusoids, assuming valid the hypothesis of small perturba-tions and linear elasticity [1]. This behavior has motivatedthe use of a set of controllable sinusoidal oscillators to artifi-cially emulate the sound of musical instruments [2, 3, 4]. Asfor analysis purposes, tools like the short-time Fourier trans-form (STFT) [5] and discrete cosine transform (DCT) [6]have been widely employed since these transformations arebased on projecting the input signal onto an orthogonal ba-sis consisting of sine or cosine functions.

An appealing idea, which is also based on resonant be-havior of vibrating structures, consists in letting the resonantbehavior be parametrically modeled by means of resonantfilters (all-pole or pole-zero) excited by a source signal. Forshort duration excitation signals and filters parameterized bya few coefficients, such a source-filter model implies a com-pact representation for sound sources. Furthermore, para-

metric modeling of linear and time-invariant systems findsapplications in several areas of engineering and digital sig-nal processing, such as system identification [7], equaliza-tion [8], and spectrum estimation [9]. The moving-average(MA), the autoregressive (AR), and autoregressive moving-average (ARMA) models are among the most widely usedones. Indeed, there exists an extensive literature on estima-tion of these models [9, 10, 11, 12].

There is a long tradition in applying source-filter schemesin sound synthesis. For instance, the linear predictive cod-ing (LPC) [13] used for speech coding and synthesis is oneof the most well-known applications of source-filter synthe-sis. The problems involved in source-filter approaches can beroughly divided into two subproblems: the estimation of thefilter parameters and the choice or design of suitable excita-tion signals. As regards the filter parameter estimation, stan-dard techniques for estimation of AR and ARMA processescan be used. Ways of obtaining adequate excitations for thegenerator filter have been discussed in [14, 15, 16].


Model-based spectral analysis of recorded instrumentsounds also finds applications in parametric sound synthe-sis. In this context, it is possible to derive the frequencies anddecay times of the partial modes from the parameters of theestimated models (all-pole or pole-zero filters). This infor-mation can be used afterward to calibrate a synthesis algo-rithm, for example, a guitar synthesizer based on the com-muted waveguide method [17, 18].

However, when dealing with signals exhibiting a largenumber of mode frequencies, for example, low-pitched har-monic tones, high-order models are needed for properlymodeling the signal resonances. Therefore, it is plausible toexpect difficulties to either estimate or realize such high-order models.

A possible way to alleviate the burden of employing high-order models is to split the original frequency band into sub-bands with reduced bandwidth. Frequency-selective schemesallow signal modeling within a subband of interest withlower-order filters [14, 19, 20, 21]. Naturally, the choices ofthe subband bandwidth as well as the modeling orders de-pend on the problem at hand. For instance, in [20], Larocheshows that adequate modeling of beating modes of a singlepartial of a piano tone can be accomplished by applying ahigh-resolution spectral analysis method to the signal associ-ated with the sole contribution of the specific partial. In thiscase, the decimated subband signal associated with the par-tial contribution was analyzed via the ESPRIT method [22].

In this paper, we review a frequency-zooming ARMA(FZ-ARMA) modeling technique that was presented in [23]and discuss the advantages of applying the method for anal-ysis of string instrument sounds. Our focus, however, is noton the FZ-ARMA modeling formulation, which bears simi-larities to other subband modeling approaches, such as thoseproposed in [14, 20, 24, 25], among others. In fact, we aremore interested in reliable ways to estimate the frequenciesand decay times of partial modes when the tone under studyis corrupted with broadband background noise. Within thisscenario, our aim is to investigate the performance of the FZ-ARMA modeling as a spectrum analysis tool.

Every measurement setup is prone to noise interferenceto some extent, even in controlled conditions as in an ane-choic environment. For instance, the recording circuitry in-volving microphones and amplifiers is one of the sources ofnoise. In [26], the authors highlight the importance of takinginto account the level of background noise in the signal whenattempting to estimate the decay time of string tone partials,especially for the fast decaying ones.

Another situation in which corrupting noise has to becarefully considered is in the context of audio restoration.In a recent paper [27], the authors proposed a sound sourcemodeling approach to bandwidth extension of guitar tones.The method was applied to recover the high-frequency con-tent of a strongly de-hissed guitar tone. To perform this task,a digital waveguide (DWG) model for the vibrating string hasto be designed. In [27], the DWG model was estimated usinga clean guitar tone similar to the noisy one. This resourcewas adopted because the presence of the corrupting noiseprevented obtaining reliable estimates for the decay time of

high-frequency partials. These estimates were determined viaa linear fitting over the time evolution of the partial ampli-tude (in dB), which was obtained through a procedure simi-lar to the McAulay and Quatieri analysis scheme [2, 28].

Through examples which feature noisy versions of bothsynthetic and real string tones, we demonstrate that the FZ-ARMA modeling offers a reliable means to overcome the lim-itations of the STFT-based methods regarding estimating thedecay time of partials.

This paper is organized as follows. Section 2 reviewsthe basic properties of AR and ARMA modeling and dis-cusses signal modeling strategies in full bandwidth as wellas in subbands. In Section 3, we formulate the FZ-ARMAmodeling scheme and address issues related to the choiceof the processing parameters. In Section 4, we employ theFZ-ARMA modeling to focus the analysis on isolated par-tials of synthetic and real string tones. Moreover, we as-sess the FZ-ARMA modeling performance on estimating thedecay times of the partial modes under noisy conditions.In addition, we confront the results of spectral analysis ofthe subband signals using ARMA models against those ob-tained through the ESPRIT method. Section 5 discusses ap-plications of the FZ-ARMA modeling in sound synthesis.In particular, we show an example in which, from the FZ-ARMA analysis of a noisy guitar tone, a DWG-based gui-tar tone synthesizer is calibrated. Conclusions are drawn inSection 6.

2. AR/ARMA MODELING OF STRINGINSTRUMENT SOUNDS

2.1. Basic definitions

An ARMA process of order p and q, here indicated asARMA(p, q), can be generated by filtering a white noise se-quence e(n) through a causal linear shift-invariant and stablefilter with transfer function [12]

H(z) = Bq(z)

Ap(z)=

∑qk=0 bq(k)z−k

1−∑pk=1 ap(k)z−k

. (1)

For real-valued filter coefficients, the transfer function of anARMA(p, q) model has p poles and q zeros. Considering aflat power spectrum for the input, that is, Pe(z) = σ2

e , theresulting output x(n) has power spectrum given by

Px(z) = σ2e

Bq(z)B∗q(1/z∗

)Ap(z)A∗p

(1/z∗

) , (2)

where the symbol ∗ stands for complex conjugate.An AR process is a particular case of an ARMA process

when q = 0. Thus, the generator filter assumes the form

H(z) = b(0)

1−∑pk=1 aq(k)z−k

, (3)

which is usually referred to as the transfer function of an all-pole filter.

FZ-ARMA Analysis of Noisy String Tones 955

2.2. Parameter estimation of AR and ARMA processes

Thorough descriptions of methods for estimation of AR andARMA models are outside the scope of this paper since thistopic is well covered elsewhere [9, 12] and computer-aid toolsare readily available for this purpose. Here, we briefly sum-marize the most commonly used methods.

Parameter estimation of AR processes can be done byseveral means, usually through the minimization of a mod-eling error cost function. Solving for the model coefficientsfrom the so-called autocorrelation and covariance normalequations [9] are perhaps the most common ways.

The stability of the estimated AR models is an importantissue in synthesis applications. The autocorrelation methodguarantees AR model estimates that are minimum phase.The Matlab function ar.m allows estimating AR models us-ing several approaches [29].

Parameter estimation of ARMA processes is more com-plicated since the normal equations are no longer linearin the pole-zero filter coefficients. Therefore, the estima-tion relies on nonlinear optimization procedures that haveto be done in an iterative manner. Prony’s method and theSteiglitz-McBride iteration [30, 31] are examples of suchschemes. A drawback of these methods is that the estimatedpole-zero filters cannot be guaranteed to be minimum phase.In addition, and especially for high-order models, the esti-mated filters can be unstable. The functions prony.m andstmcb.m are available in Matlab for estimation of ARMAmodels using Prony’s and Steiglitz-McBride methods, re-spectively [32].

2.3. Full bandwidth modeling

Modeling of string instrument sounds has been approachedby either physically motivated or signal modeling methods.Examples of the former can be found in physics-based algo-rithms for sound synthesis [18, 33, 34, 35]. Examples of thelatter include the AR-based modeling of percussive soundspresented in [14, 15, 16, 36, 37].

In principle, when approaching the problem from a sig-nal modeling point of view, it seems natural to employ a res-onant filter, such as an all-pole or pole-zero filter, to modelthe mode behavior of a freely vibrating string, which consistsof a sum of exponentially decaying sinusoids. However, mod-eling of broadband signals can be a tricky task. One practicalissue related to both AR and ARMA modeling is model or-der selection. In general, there is no automated way to choosean appropriate order for the model assigned to a signal. Forinstance, one can deduce that AR modeling of low-pitchedtones in full bandwidth is expected to require high-ordermodels. The same is valid for piano tones which are pro-duced by one to three strings sounding together. In this case,considering the detuning among the strings and two polar-izations of transversal vibration per string, up to 6 resonancemodes should be allocated to each partial of the tone.

In fact, the temporal envelope exhibited by partials ofguitar and piano tones can be far from being exponentiallydecaying. On the contrary, the usually observed temporal en-velopes contain frequency beating and two-stage decay [38].

This indicates that the partials are composed of two or moremodes that are tightly clustered in frequency. The need forhigh-resolution frequency analysis tools is evident in thesecases.

If frequency analysis is to be performed by means ofAR/ARMA modeling, higher spectral resolutions can be at-tained by increasing the model orders. However, parameterestimation of high-order AR/ARMA models may be prob-lematic if the poles of the system are very close to the unitcircle and if there are poles located close to each other. Re-alizing a filter with these features is very demanding as therequired dynamic range for the filter coefficients tends to behuge. In addition, computation of the roots associated withthe corresponding polynomial in z, if necessary, can be alsodemanding and prone to numerical errors [39].

2.4. Frequency-selective modeling

The aforementioned problems have motivated the use of al-ternative modeling or analysis strategies based on subbanddecomposition [40]. In such schemes, the original signal isfirst split in several spectral subbands. Then, modeling oranalysis of the resulting subband signals can be performedseparately in each subband. Examples of subband modelingapproaches can be found in [14, 16, 20, 24, 25].

A prompt advantage of subband decomposition of anAR/ARMA process is the possibility to focus the analysis onthinner portions of the spectrum. Thus, a small number ofresonances can be analyzed at a time. This accounts for usinglower-order models to analyze subband signals. Moreover,the subband signals can be down sampled, as their band-width is reduced compared to that of the original signal.As a consequence, the implied decrease in temporal resolu-tion due to down-sampling is rewarded by an increase in fre-quency resolution. This fact favors the problem of resolvingresonant modes that are very close to each other in frequency.The effects of decimating AR and ARMA processes have beendiscussed in [21, 41, 42].

3. FREQUENCY-ZOOMING ARMA METHOD

As presented in [23], the FZ-ARMA analysis consists of thefollowing steps.

(i) Define a frequency range of interest (for instance, toselect a certain frequency region around the spectralpeaks one wants to analyze).

(ii) Modulate the target signal (shift in frequency by multi-plying with a complex exponential) to place the centerof the previously defined frequency band at the originof the frequency axis.

(iii) Lowpass filter the complex-valued modulated signal inorder to attenuate its spectral content outside the bandof interest.

(iv) Down sample the lowpass filtered signal according toits new bandwidth.

(v) Estimate an ARMA model for the previously obtaineddecimated signal. Throughout all examples shownin this work, the Steiglitz-McBride iteration method


[12, 30, 31] is employed to perform this task. Morespecifically, we used the stmcb.m function available inthe signal processing toolbox of Matlab [32].

In mathematical terms, and starting with a target sound sig-nal h(n), the first two steps of the FZ-ARMA method implydefining a modulation frequency fm (in Hz) and multiplyingh(n) by a complex exponential, as to obtain the modulatedresponse

hm(n) = e− jΩmnh(n), (4)

where Ωm = 2π fm/ fs with fs being the sample rate. Thismodulation implies only a clockwise rotation of the polesof a hypothetical transfer function H(z) associated with theAR process h(n). Thus, if zi is a pole of H(z) with phasearg(zi) = Ωi, its resulting phase after rotation becomes

Ωi,rot = Ωi −Ωm. (5)

The lowpass filtering is supposed to retain without distortionthose poles located inside its passband. On the other hand,down sampling the resulting lowpass filtered response yieldsmodified poles

zi,zoom = zKzoomi = ∣∣zi∣∣Kzoome j(Ωi−Ωm)Kzoom , (6)

where Kzoom is the zooming factor, which relates the newsampling rate to the original one as fs,zoom = fs/Kzoom.

Now, we know what the zooming procedure does to thepoles, zi, of the original transfer function. As a result, thosepoles, zi,zoom, estimated in subbands via ARMA modeling,need to be remapped to the original fullband domain. Thiscan be accomplished by inverse scaling the poles and counterrotating them, that is,

zi =(zi,zoom

)1/Kzoome jΩm . (7)

The frequency and decay time of the resonances presentwithin the analyzed subband can be drawn from the angleand magnitude of zi, respectively.

Note that the original target response is supposed to bereal valued and, therefore, its transfer function must havecomplex-conjugated pole pairs. However, due to the one-sided modulation performed in (4), the subband model re-turns pure complex poles. Thus, if the goal is to devise a real-valued all-pole filter in fullband for synthesizing the contri-bution of resonances within the analyzed subband, its trans-fer function must include not only the remapped poles, butalso their corresponding complex-conjugates.

Hereafter, when referring to the models of the complex-valued subband signals, we will adopt the convention FZ-ARMA(p, q), where p and q stand for the orders of the de-nominator (AR part) and numerator (MA part), respectively.

3.1. Choice of parameters for the FZ-ARMA method

The choice of the FZ-ARMA parameters, that is, fm, Kzoom,and the model orders, depends on several factors. We willnow discuss these issues.

3.1.1. Zoom factor

Considering first the zoom factor, it can be said that thegreater Kzoom, the higher the frequency resolution attainablein a subband. This favors cases in which the frequencies of themodes are densely clustered. However, large values of Kzoom

imply a more demanding signal decimation procedure andshorter decimated signals.

The values of Kzoom and fs,zoom are tied together, and thelatter defines the bandwidth of the subband which the anal-ysis will be focused on. For instance, if the aim is to analyzethe behavior of isolated partials of a tone, the choice of fs,zoom

should be such that its value be less than two times the mini-mum frequency difference between adjacent partials. On theother hand, fs,zoom should be large enough to guarantee thatthe modes belonging to a given partial do not lie inside dif-ferent subbands.

While the model estimation may be unnecessarily over-loaded if based on long signals, it may yield poor results ifbased on few signal samples only. Therefore, the criterionupon which the value of fs,zoom is chosen should also takeinto account the number of samples that remains in the dec-imated signal.

3.1.2. Modulation frequency

Suppose that we are interested in analyzing a set of reso-nances concentrated around a frequency fr. Having definedthe bandwidth of the zoomed subband fs,zoom, a straightfor-ward choice is to set the value of the modulation frequency tofm = fr. Note that this option places the resonance peaks in-side the subband around Ωr = 0. As pole estimation aroundΩr = 0 may be more sensitive to numerical errors, we de-cided to adopt fm = fr − fs,zoom/8, which implies concen-trating the peaks around Ωr = π/4. This frequency shift isnot harmful since the resonance peaks are still well inside thesubband. Thus, their characteristics are not severely distortedby the nonideal lowpass filtering employed during the deci-mation procedure. However, to afford this choice of fm andstill ensure the isolation of a tone partial, the maximum valueof fs,zoom should be at maximum one and half times the min-imum frequency difference between adjacent partials.

The frequency of the partials can be predicted fromthat of the fundamental if the tone is harmonic or quasi-harmonic. However, as some level of dispersion is alwayspresent, errors at the frequencies of the higher partials areexpected to occur. Alternatively, the frequencies of the par-tials can be determined by performing spectral analysis onthe attack part of the tone and running a peak-picking algo-rithm over the resulting magnitude spectrum, as employedin [16, 25]. This approach is more general since it can dealwith highly inharmonic tones.

In our experiments, we first estimate the fundamentalfrequency of the tone, a task that was performed through themultipitch estimator described in [43]. Then, after model-ing the first partial, which allows obtaining a precise value ofthis partial frequency, the frequency of the following partialto be analyzed is set as the sum of the estimated frequencyof the current partial with the value of the fundamental


frequency. This procedure is repeated until one reaches thedesired number of partials to be analyzed. This approachminimizes the problems related to multiplicative errors whenpredicting the frequencies of higher partials based on integermultiples of the fundamental frequency.

3.1.3. Model order

Regarding the orders of the ARMA models, they should bechosen as to allow the modeling of the most prominent res-onant modes of the signal. Depending on the case, a prioriinformation on the characteristics of the signal at hand canbe used to guide suitable model-order choices. For string in-strument sounds, the estimation of the number of modes perpartial can be based on the number of strings per note andthe number of polarizations per string.

Moreover, it is known that if a real-valued signal has presonant modes, one has to allocate at least two poles per res-onant mode, that is, an ARMA(2p, 0), to properly model it.However, due to the one-sided modulation used in the FZ-ARMA scheme, the resulting subband signals are complexvalued, thus composed of pure complex poles. Therefore,only one single complex pole per mode suffices. As a con-sequence, at the expense of working with a complex arith-metic, the FZ-ARMA scheme optimizes the resources spenton modeling of the subband signals. This represents one ad-vantage over, for instance, the modulation scheme proposedin [20], which yields real-valued decimated signals.

4. FZ-ARMA MODELING OF STRINGINSTRUMENT TONES

In this section, we apply the FZ-ARMA modeling to ana-lyze the resonant modes of isolated partials of string instru-ment sounds. We start by analyzing synthetic signals as a wayto objectively evaluate the results. This allows knowing be-forehand the mode frequencies and decay rates of the arti-ficial tone. Thus, we can compare them with the estimatesobtained via the FZ-ARMA modeling. In this context, thechoice of the model orders is investigated as well as the mod-eling performance under noisy conditions. Then, following asimilar analysis procedure, we evaluate the modeling perfor-mance of the FZ-ARMA method on recorded tones of real-world string instruments.

4.1. Experiments on artificially generated stringinstrument tones

4.1.1. Guitar tone synthesis

In this case study, the synthetic guitar tone is generated bymeans of a dual-polarization DWG model [18]. Thus, eachof its partials has two modes with known parameters, that is,resonance frequencies and time constant of the exponentiallydecaying envelope.

The string model for one polarization is depicted inFigure 1. Its transfer function is given by

S(z) = 11− z−LiHFD(z)HLF(z)

, (8)

HLF(z) HFD(z) z−Li

x(n)e(n)+

Figure 1: Block diagram of the string model.

HLF,h(z) HFD,h(z) z−Li,h

+1−α

x(n)e(n) +

+

HLF,v(z) HFD,v(z) z−Li,v

α

Figure 2: Block diagram of the dual-polarization string model. Thesubscripts “v” and “h” stand for vertical and horizontal, respectively.

where z−Li and HFD(z) are, respectively, the integer and frac-tional parts of the delay line associated with the length of thestring. This length is given by L = fs/ fp, where fs and fpare the sample frequency and fundamental frequency of thetone, respectively. The transfer function HLF(z) is called loopfilter and is in charge of simulating the frequency-dependentlosses of the partial modes.

For the sake of simplicity, we implemented the loop filtervia the one-pole lowpass filter with transfer function givenby

HLF(z) = g(1 + a)1 + az−1

. (9)

The magnitude response of HLF(z) must not exceed unity inorder to guarantee the stability of S(z). This constraint im-poses that 0 < g < 1 and −1 < a < 0. As regards thefractional-delay filter HFD(z), we chose to employ the first-order allpass filter proposed in [44], which implies the com-putation of a single coefficient afd. This choice assures thatthe decay rates of the partials depend mainly on the charac-teristics of HLF(z).

The dual-polarization model consists in placing twostring models in parallel as depicted in Figure 2. With thismodel, amplitude beating can be obtained by setting slightlydifferent delay line lengths for each polarization. In addi-tion, two-stage envelope decay can be accomplished by hav-ing loop filters with different magnitude responses for eachpolarization.

Consider first a string model with only one polarization.The partials of the resulting tone will decay exponentiallyand form a perfect harmonic series, that is, their frequen-cies are fν = ν fp, where fp is the fundamental frequency ofthe tone, and ν = 1, . . . , fs/(2 fp) the partial indices. To de-termine the decay rate associated with each partial, we needto know the gain of the loop filter as well as the group delayof the feedback path (cascade of z−Li , HFD(z), and HLF(z))at the partial frequencies. By defining the partial frequencies


Table 1: Parameters used to generate the synthetic guitar tone. Thesample rate was chosen as fs = 44.1 kHz and α was set to 0.5.

Polarization fp g a Li afd

Vertical 200 Hz 0.997 −0.03 220 0.3614Horizontal 200.4 Hz 0.980 −0.10 219 0.0263

in radians as ων = 2π fν/ fs, the gain of the loop filter at ων isgiven by

∣∣HLF(e jων

)∣∣ = g(1 + a)√1 + 2a cos

(ων

)+ a2

. (10)

The group delay of a transfer function F(z) is commonlydefined as the ratio ΓF(ω) = −∂ argF(e jω)/∂ω. Then, if onedefines G(ων) as the group delay (in samples) of the feedbackpath at ων, that is, G(ων) = Li +ΓHLF (ων)+ΓHFD (ων), the decaytime (in seconds) of the partials can be obtained by

τν = 1fs

(− G

(ων

)log

(∣∣HLF(e jων

)∣∣)). (11)

Now we can generate an artificial guitar tone throughthe dual-polarization model, analyze it using the FZ-ARMAmethod, and compare the estimated values of the mode pa-rameters with the theoretical ones. The tone is generatedvia the model shown in Figure 2 with parameters given inTable 1.

By adopting the parameters shown in Table 1, one guar-antees that the modes of each partial will decay with differenttime constants. Hence, each partial exhibits a two-stage enve-lope decay behavior. Moreover, the mode frequencies of eachpartial are also different, thus yielding amplitude modulationin its envelope.

4.1.2. FZ-ARMA analysis

To proceed with the FZ-ARMA analysis of the generatedtone, we have to choose appropriate values for the frequencybands of interest and corresponding modulation frequencies.In this example, equal bandwidth subbands are used to an-alyze the partials. The subband bandwidth is chosen to beequal to the fundamental frequency of the vertical polariza-tion. This implies a new sampling frequency of fs,zoom =fp,v = 200 Hz for the subband signals and a zoom factorKzoom = 220. For convenience, we only show results of pa-rameter estimation up to the 45th partial. As highlighted inSection 3.1.2, for each partial frequency fν (of the vertical po-larization) to be analyzed, the modulation frequency is cho-sen to be fm = fν − fs,zoom/8.

The goal of this experiment is to gain an insight of themodel orders that are necessary to reasonably estimate themode parameters of the partials of a guitar tone. The FZ-ARMA procedure was devised in such a way that the subbandsignals are supposed to contain only two complex modes.Therefore, at least an FZ-ARMA(2, 0) must be employed tomodel each subband signal.

The results of mode parameter estimation obtained inthis example are shown in Figure 3. Subplot 3(a) depicts

the reference values of the time constants of each polariza-tion τν,h and τν,v as a function of the partial index ν. Insubplots 3(c) and 3(e), one finds the relative errors in thetime constant estimates, ∆τν = |τν,ref − τν,meas|/τν,ref , whenmodeling the target signals through FZ-ARMA(2, 1) and FZ-ARMA(3, 2), respectively. Subplots 3(d) and 3(f) display therelative errors in the frequency estimates, ∆ fν = | fν,ref −fν,meas|/ fν,ref , when modeling the target signals through FZ-ARMA(2, 1) and FZ-ARMA(3, 2), respectively.

From Figure 3, it is possible to verify that low-order mod-els suffice to estimate the mode frequencies. On the contrary,to properly estimate the decay time of the partial modes,higher-order models are required. Furthermore, as one couldexpect, it is more difficult to estimate the time constants offaster decaying modes.

4.1.3. Analysis of noisy tones

We start with the same synthetic tone devised in Section4.1.1. This tone is then corrupted with zero-mean whiteGaussian noise, whose variance is adjusted to producea certain signal-to-noise ratio (SNR) within the first10 milliseconds of the tone. We proceed with the FZ-ARMAanalysis of four noisy tones with SNR equal to 40, 20, 10, and0 dB, respectively. The goal now is to investigate the effect ofthe SNR on the decay time estimates of the partial modes.

As in the previous example, equal-bandwidth subbandsare used to analyze the partials of the tone. But, here, theadopted value of the zoom factor was Kzoom = 600. As be-fore, the frequency fν of each partial to be analyzed de-fined the modulation frequency, which was chosen to befm = fν − fs,zoom/8. To model the two-mode partial signals,FZ-ARMA(3, 3) models were used. From the poles of eachestimated model, those two with the largest radii were se-lected to determine the decay times and frequencies of thepartial modes. In addition, for the sake of convenience, theestimated mode parameters were sorted by decreasing valuesof decay time.

The results are depicted in Figure 4, in which the solidand dashed lines describe the reference values of the decaytime, associated with the vertical and horizontal polariza-tions, respectively, as functions of the partial indices. The cir-cle and square markers indicate the corresponding estimatedvalues.

As one could expect, the estimation performance is wors-ened when decreasing the SNR. Nevertheless, it is worth not-ing that even for the signal with SNR equal to 10 dB, the ma-jority of the estimated values of decay time is concentratedaround the reference values, especially for low-frequencypartials. The occurring outliers can be either discarded, forexample, negative values, or removed by means of medianfiltering. As for the mode frequency estimates (not shown),the maximum relative error encountered for the tone withSNR = 0 dB is of order equal to ±0.1%, which is negligible.

4.1.4. Comparison against STFT-based methods

At this stage, one wonders if an estimation procedure basedon short-time Fourier analysis or heterodyne filtering would


ν, harmonic index

10 20 30 40

τ ν[s

]

0

1

2

(a)

ν, harmonic index

10 20 30 40

f ν[k

Hz]

0

5

10

(b)

ν, harmonic index

10 20 30 40

∆τ ν

0

5

10

(c)

ν, harmonic index

10 20 30 40

∆f ν

0

2

4

6

8

×10−3

(d)

ν, harmonic index

10 20 30 40

∆τ ν

0

0.1

0.2

(e)

ν, harmonic index

10 20 30 40

∆f ν

0

1

2

×10−4

(f)

Figure 3: Case study on a synthetic string tone with amplitude envelope featuring beating and two-stage decay. Subplots (a) and (b) show,respectively, the reference time constants and frequencies of the modes as functions of the partial index; subplots (c) and (d) depict therelative errors ∆τν = |τν,ref − τν,meas|/τν,ref and ∆ fν = | fν,ref − fν,meas|/ fν,ref when estimating τν and fν, respectively, via FZ-ARMA(2, 1)models; similar curves are shown in subplots (e) and (f) when adopting FZ-ARMA(3, 2) models. The results for the vertical and horizontalpolarizations are indicated by solid and dashed lines, respectively.

yield similar results as those of the FZ-ARMA-based schemewhen dealing with noisy signals.

In these approaches, each prominent partial is isolatedsomehow and the evolutions of its amplitude over time aretracked. Then, a linear slope is to be fitted to the obtainedlog-amplitude envelope curve. The decay time of the ana-lyzed partial is determined from the slope of the fitted curve.

To start answering our question, we should rememberthat, even for clean signals, there are situations in which thejust described slope fitting does not give appropriate results.Perhaps the most striking one is when the envelope curveshows amplitude beating. Back to the noisy signals, there maybe a point in the amplitude envelope curves of the partialsafter which the noise component dominates the amplitude.


Partial index10 20 30 40 50

Dec

ayti

me

[s]

0

0.5

1

1.5

2SNR = 40 dB

τvτh

τv,est.τh,est.

Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

0.5

1

1.5

2SNR = 20 dB

τvτh

τv,est.τh,est.

Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

0.5

1

1.5

2SNR = 10 dB

τvτh

τv,est.τh,est.

Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

0.5

1

1.5

2SNR = 0 dB

τv

τh

τv,est.

τh,est.

Figure 4: Decay times of two-mode partials of a synthetic noisy guitar tone: comparison between reference values and FZ-ARMA(3, 3)estimates.

The noise floor is not so critical for the decay time estima-tion of low-frequency partials since they are usually strongerin amplitude and decay slowly. On the other hand, high-frequency partials are in general weaker in magnitude anddecay fast. They are likely to reach and be masked by the noisefloor very early in time. Taking into account the noise floorlevel is essential for the decay time estimation of these par-tials (see [26, Figure 5]).

For the sake of simplicity, we do not use neither the het-erodyne filtering nor the sinusoidal modeling (SM) analy-sis in the comparisons shown in this section. Instead, wecan resort to the frequency-zooming procedure itself. Theamplitude envelope curves of each partial are obtained di-rectly from the evolution of the signal magnitude within each

subband. Note that we are dealing with narrow subbands(bandwidth of about 70 Hz) and that each subband isolatesa given partial. Therefore, the so-attained envelope curveswill approximate well the curves that would result from ei-ther the heterodyne filtering or the SM analyses. The latter,however, would provide smoother curves. Yet, they would in-evitably be lower-bounded by the average amplitude of thenoise floor.

As an example, we compare the analysis of two high-frequency partials (6th and 13th) of the string tone devisedin Section 4.1. These high-order partials are chosen on pur-pose to illustrate the effect of the corrupting noise on theamplitude envelope curves. Figure 5 compares the envelopecurves of the featured partials in 3 conditions: noiseless tone


Time [s]

0 1 2 3 4

Env

elop

e[d

B]

−35

−30

−25

−20

−15

−106th partial

Time [s]

0 1 2 3 4

Env

elop

e[d

B]

−40

−30

−20

−10

013th partial

Figure 5: Analysis of the 6th and 13th partials of the synthetictone: comparison among the envelopes of the reference signal (thin-ner solid line), its noisy version with SNR = 0 dB (dash-dottedline), and the modeled signal via FZ-ARMA(3, 3) (thicker solid line)based on the noisy signal.

(thinner solid line), noisy signal with SNR = 0 dB (dash-dotted line), and modeled signal based on the noisy target(thicker solid line).

From Figure 5, it becomes evident that, for the noisy sig-nal, decay time estimation of the partials via slope fitting isimpractical. On the contrary, the FZ-ARMA modeling is ca-pable of properly estimating the decay time of the slowest de-caying or the most prominent partial mode. Note that we areprimarily interested in the slope of the envelope curve. Theupward bias, which is observed in the envelopes of the mod-eled signals, occurs due to the difference in power betweenthe clean and the noisy version of the signal.

The frequency-zooming procedure per se accounts fora significant improvement in the value of the SNR. For in-stance, if the target signal is a single complex exponen-tial immersed in white noise, the reduction in SNR due to

the zooming will be given by 10 log10(Kzoom). Of course,an even bigger SNR improvement can be achieved by FFT-based analysis. This comes from the fact that tracking a sin-gle frequency bin in the DFT domain (preferably refinedby parabolic interpolation) implies analysis within a muchnarrower bandwidth than the frequency-zooming scheme.However, the improvement in the SNR is not the main is-sue here. This larger SNR improvement does not prevent theamplitude envelope from being lower-bounded by the noisefloor level after some time.

The keypoint here is that fitting a parametric model tothe partial signals allows capturing the intrinsic temporalstructures of them, even in noise conditions. Moreover, theresonance features are derived from the model parametersrather than from a simple curve fitting process. As a conse-quence, a further improvement in the SNR is achieved, cul-minating in more reliable estimates for the decay time of thepartials. Of course, the corrupting noise tends to degrade andbias the estimated models. Thus, any improvement in theSNR before the modeling stage is welcome. The frequencyzooming helps in this matter as well.

4.1.5. Comparison against ESPRIT method

One could also think of applying other high-resolution spec-tral analysis methods to the subband signals. For instance,Laroche has used the ESPRIT method [20, 22] to analyzemodes of isolated partials of clean piano tones. Just forcomparison purposes, we repeat the experiments conductedin Section 4.1.3 using the ESPRIT method [22, 45]. Moreprecisely, we employ the frequency-zooming procedure asbefore, but replace the ARMA modeling with the ESPRITmethod as a means to analyze the subband signals.

In the ESPRIT method, we have to set basically three pa-rameters: the length of the signal to be analyzed, N , the a pri-ori estimate of the number of complex exponentials in thesignal, M, and the pencil parameter, M ≤ Ppencil ≤ N −M.Analysis of noise sensitivity of the ESPRIT method has beenconducted in [45] for single complex exponentials in noise.It revealed that setting Ppencil = N/3 or Ppencil = 2N/3 are thebest choices for the pencil parameter, in order to minimizethe effects of the noise on the exponential estimates. Further-more, as highlighted in [20], overestimating M is harmlessand even desirable to avoid biased frequency estimates. TheESPRIT method outputs M complex eigenvalues from whichthe frequency and decay time of M exponentials can be de-rived. As M is usually overestimated, a pruning scheme hasto be employed to select the most prominent exponentials.In our experiments, we take only the two exponentials withthe largest decay times.

According to the results of our simulations, the perfor-mances of the ESPRIT and ARMA methods are equivalentfor estimating the frequencies of the resonant modes. For in-stance, as regards the frequency estimates, the maximum rel-ative errors measured for the tone with SNR = 0 dB were0.19 and 0.11, respectively, for the ESPRIT and ARMA meth-ods. In this particular example, FZ-ARMA(3, 3) models wereused whereas the parameter values adopted in the ESPRITmethod were N = 295, Ppencil = 98, and M = 20.


Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

1

2

3

4

5

6SNR = 40 dB, Ppencil = 98

τv

τh

τv,est.

τh,est.

Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

1

2

3

4

5


τvτh

τv,est.

τh,est.

Figure 6: Decay times of partial modes of synthetic noisy guitartones: comparison between reference values and ESPRIT estimates(M = 20 and Ppencil = 98).

The situation is different when it comes to the decay timeestimates. It seems that the accuracy of these estimates is verydependent on the choice of pencil parameter. For instance,when dealing with noisy signals, setting Ppencil = M yieldsunderestimated values of decay time. On the contrary, in-creasing the value of Ppencil tends to produce overestimatedvalues of decay times. According to the results of our experi-ments, this is also the case if Ppencil = N/3 is chosen.

Figure 6 confronts the reference values of the decay timeagainst the estimates obtained through the ESPRIT methodwith M = 20 and Ppencil = 98. It can be clearly seen that thedecay time estimates are substantially overestimated, even formoderate levels of SNR. Interestingly enough, repeating the

Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

0.5

1

1.5


τv

τh

τv,est.

τh,est.

Partial index

10 20 30 40 50

Dec

ayti

me

[s]

0

0.5

1

1.5


τv

τh

τv,est.

τh,est.

Figure 7: Decay times of partial modes of synthetic noisy guitartones: comparison between reference values and ESPRIT estimates(M = 20 and Ppencil = 20).

experiments for Ppencil = M = 20 yields better results, as canbe seen in Figure 7. In this case, the estimates are much moreaccurate than those obtained with Ppencil = 98. Notwith-standing, these estimates are still worse than those drawnfrom the poles of the ARMA(3, 3) fitted to the subband sig-nals, as one can verify from Figure 4. Therefore, we stick tothe FZ-ARMA modeling in the following experiments.

4.1.6. Discussion

Carrying out systematic performance comparisons amongthe addressed methods of decay time estimation is outsidethe scope of this work. Including such comparisons woulddemand not only covering a broader range of situations


and examples, but also precise description of the algorithmsand the calibration of their associated processing parameters.Besides, comparisons between FFT-based schemes of spec-tral analysis, such as the SM technique, and parametric ap-proaches are not fair. Sticking to comparisons among para-metric methods of spectral analysis would necessarily includeother techniques than just the ARMA and ESPRIT methods.

The comparisons shown in Section 4.1.4 are basicallymeant to highlight the situations in which STFT-based meth-ods for decay time estimation are prone to failure. A pre-sumed goal is to motivate the need for alternative solutionsto decay time estimation in noisy conditions.

As for the performance comparisons between the ARMAand the ESPRIT methods, they were conducted after thefrequency-zooming stage in order to keep equal conditions.Yet, the performance results can depend significantly on thechoice of the processing parameters. This fact is clearly ver-ified by comparing the results shown in Figures 6 and 7.Moreover, translating the parameters of one method intothose of the other may not be straightforward. Due to theaforementioned reasons, we restrict the comparisons to a sin-gle case study. Rather than tabulating the attained perfor-mances, we believe that visual assessment on Figures 4, 6, and7 offers more effective means of drawing conclusions on theresults.

In summary, the STFT-based schemes are appropriate fordecay time estimation of the partials when the partials showmonotonic and exponential decay and when the measure-ment noise is low. If the noise component is prominent, reli-able decay time (and frequency) estimation of the high-orderpartials will be prevented. For both the parametric methodstested, and under the setups adopted, a reliable frequency es-timation for the partials of noisy tones is attained. As regardsthe decay time estimation in noisy conditions, the ARMAanalysis performs better in general than the ESPRIT method.

Now, we comment specifically on the analysis results ofthe noisy tone with SNR = 20 dB. The ESPRIT methodseems to overestimate the decay times as the value of thepencil parameter increases. Adopting the minimum value forthe pencil parameters yielded the best results. Yet, the ES-PRIT analysis underestimates the decay times of the low-order partials. This is critical from the perceptual point ofview, especially if one aims at resynthesizing a new tonebased on the analyzed data. For the high-order partials, how-ever, the ESPRIT-based decay time estimates seem to con-verge with low variance to the decay time of the slowest res-onance mode. In contrast, there are more outliers in the de-cay time estimates attained via the ARMA analysis. Never-theless, the ARMA analysis seems to do a better job in prop-erly segregating the estimates into two distinct resonancemodes.

Finally, when it comes to choosing the most appropriatetechnique, many variables should be considered. Examples ofsuch variables are the characteristics of the problem at handand the aimed objectives, the effectiveness of the availabletools in performing the targeted task, and the available com-putational resources. The latter issue, although important,does not fit to the profile of this paper. Therefore, discussions

on the computational complexity of the tested methods arenot included.

4.2. Experiments on recorded string instrument tones

In this section, we follow the same methodology used in Sec-tions 4.1.2 and 4.1.3 to analyze recorded tones of real-worldstring instruments. Here, we do not have a set of referencevalues for the decay times of the partials. Nevertheless, basedon the results obtained for the synthetic tone, we can as-sume that the FZ-ARMA modeling of an originally cleantone provides correct estimates for the decay time of the par-tial modes. Then, this set of values can be taken as a reference.

For this experiment, we selected a clean classical guitartone A2 ( fp = 109.97 Hz, softly plucked open 5th string),which was recorded in anechoic conditions. Three noisy ver-sions of this tone, with SNR = 60, SNR = 40, and SNR =20 dB, respectively, were generated by adding zero-meanwhite Gaussian noise to the clean tone. The noise variancewas adjusted as to produce the desired SNR during the at-tack part of the tone (about 20 milliseconds starting from themaximum amplitude).

The first step of the analysis procedure is to obtain an es-timate of the fundamental frequency of the noisy tone. Thisestimate is the starting point to the choices of the bandwidthof the subbands and the modulation frequencies to be used inthe FZ-ARMA analysis. The fundamental frequency of tonewith SNR = 20 dB was estimated to be fp = 110.25 Hz,which is not far from that of the clean tone. Thus, by fol-lowing the guidelines stated in Section 3.1.2, we can proceedtoward analyzing the higher partials of both the clean and thenoisy tones. The parameters used in the FZ-ARMA analysiswere Kzoom = 600, fm = fν − fs,zoom/8, and FZ-ARMA(3, 3)models. This time, only the decay time of the slowest decay-ing mode of each partial was extracted.

The results of this experiment are displayed in Figure 8.The solid line curves correspond to the estimated values ofdecay time based on the original clean tone. On the otherhand, the circles show the corresponding estimated valuesbased on the noisy tones with indicated SNRs. From Figure 8,we observe that, even for the tone with SNR = 20 dB, the FZ-ARMA analysis provides reliable decay time estimates, espe-cially for the low-frequency partials.

5. APPLICATIONS IN SOUND SYNTHESIS

5.1. Digital waveguide synthesis

We have seen in Section 4 that the FZ-ARMA modeling canbe used as an analysis tool, aiming at estimating the parame-ters associated with the resonances of the tone partials. Thus,based on the set of frequencies and decay times estimated foreach partial, one could design a DWG model to resynthesizethe tone.

More interestingly, the FZ-ARMA modeling allows esti-mating more than one frequency and decay time per partial.Thus, one can consider using this information to design thefilters of a multipolarization DWG model, such as the dual-polarization DWG model shown in Figure 2. As in source-


Partial index

0 10 20 30 40

Dec

ayti

me

[s]

0

1

2

3

4Original

Clean

Partial index

0 10 20 30 40

Dec

ayti

me

[s]

0

1

2

3

4SNR = 60 dB

Clean Noisy

Partial index

0 10 20 30 40

Dec

ayti

me

[s]

0

1

2

3

4SNR = 40 dB

Clean Noisy

Partial index

0 10 20 30 40

Dec

ayti

me

[s]

0

1

2

3

4SNR = 20 dB

Clean Noisy

Figure 8: FZ-ARMA(3, 3) estimates of the decay time of partials of an A2 guitar tone: comparisons among estimates based on the originalclean signal and its noisy versions at different SNRs.

filter synthesis, in DWG-based synthesis, the excitation sig-nal is in charge of controlling the initial phase and ampli-tude of the resonance modes. In this work, however, we willnot tackle the attainment of suitable excitation signals but weconcentrate more on the calibration of the string models.

Calibrating a multipolarization DWG model based onthe estimated parameters of the partial modes is a difficulttask, especially when dealing with real-world recorded tonesimmersed in noise. This is mainly due to the high varianceexhibited in the estimates of decay time of the partial modes.In contrast to what is seen in the analysis results of the syn-thetic tone shown in Section 4.1.2, the decay time of the par-tial modes, estimated from a recorded tone, cannot be easilydiscriminated in two or more distinct classes. Thus, decidingwhich partial mode belongs to which polarization turns out

to be a difficult nonlinear optimization problem. We leavethis topic for future research and we stick to the calibrationof the one-polarization DWG model.

5.1.1. Calibration of one-polarization DWGmodel from noisy tones

We start with an example in which the target signal is the cor-rupted version (SNR = 20 dB) of the recorded guitar tonefeatured in Section 4.2. From the FZ-ARMA analysis of thistone, we obtained estimates for the frequency and decay timeof the partial modes. Then, the specification for the magni-tude of the loop filter at the partial frequencies can be ob-tained by ∣∣HLF

(fν)∣∣ = e−ν/ fντν , (12)


Frequency [Hz]

0 1000 2000 3000 4000

Mag

nit

ude

0.7

0.75

0.8

0.85

0.9

0.95

1

(a)

Frequency [Hz]

0 1000 2000 3000 4000

Dec

ayti

me

[s]

0

1

2

3

4

(b)

Figure 9: Specification points and attained response of the8th-order IIR loop filter: (a) smoothed magnitude specification(squares) versus attained response (solid line) up to the frequency ofthe 40th partial; (b) measured decay times (circles) versus attainedvalues forged by the loop filter response (solid line).

where ν is the partial index, fν are the frequencies of the par-tials in Hz, and τν are the corresponding decay times in sec-onds.

As the sequence of estimated decay times, which wasbased on the corrupted signal, seems to have a couple of out-liers, it was first median filtered using a three-sample win-dow. The values of τν that result from the filtered sequenceare then used in (12).

The specification of the loop filter within the frequencyrange above the frequency of the 40th partial is devised artifi-cially. We fit a −6 dB per octave slope to the magnitude spec-ification points associated with the highest 10 partials andextrapolate the curve up to the Nyquist frequency. To designa loop filter that approximates this extended specification, weresort to the IIR design method proposed in [46, 47].

Figure 9 shows the results obtained by approximating thespecified (smoothed) magnitude response of the loss filter viaan 8th-order IIR lowpass filter.

We could also think of designing a dispersion filter forthe DWG model. In this case, the specification for phase re-sponse of the allpass dispersion filter could be based on theestimated frequencies of the partials in a similar manner towhat was done in [48, 49]. However, for the noisy tone understudy, the variance observed in these estimates prevented onefrom obtaining any meaningful specification for the disper-sion filter.

6. CONCLUSION

In this paper, a spectral analysis technique based on FZ-ARMA modeling was applied to string instrument tones.More specifically, the method was used to analyze the res-onant characteristics of isolated partials of the tones. In ad-dition, analyses performed on noisy tones demonstrated thatthe FZ-ARMA modeling turns out to be a robust tool for esti-mating the frequencies and decay times of the partial modes,despite the presence of the corrupting noise. Comparisonsbetween the estimates attained by FZ-ARMA modeling andthose obtained via the ESPRIT method revealed a superiorperformance of the former method when dealing with noisytones. Finally, the paper discussed the use of FZ-ARMA mod-eling in sound synthesis. In particular, the calibration of aDWG guitar synthesizer was successfully carried out basedon FZ-ARMA analysis of a recorded guitar tone, which wasartificially corrupted by zero-mean white Gaussian noise.

ACKNOWLEDGMENTS

The work of Paulo A. A. Esquef has been supported by ascholarship from the Brazilian National Council for Scien-tific and Technological Development (CNPq-Brazil) and bythe Academy of Finland project “Technology for Audio andSpeech Processing.” The authors wish to thank Mr. BalazsBank, Dr. Cumhur Erkut, and Dr. Lutz Trautmann for kindlyproviding some of the codes used in the simulations. Finally,the authors would like to thank the anonymous reviewers fortheir comments, which contributed to the improvement ofthe quality of this manuscript.

REFERENCES

[1] A. H. Benade, Fundamentals of Musical Acoustics, Dover Pub-lications, Mineola, NY, USA, 1990.

[2] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesisbased on a sinusoidal representation,” IEEE Trans. Acoustics,Speech, and Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[3] J. O. Smith III and X. Serra, “PARSHL: An analysis/synthesisprogram for non-harmonic sounds based on a sinusoidal rep-resentation,” in Proc. International Computer Music Confer-ence (ICMC ’87), Champaign-Urbana, Ill, USA, 1987.

[4] R. C. Maher, “Sinewave additive synthesis revisited,” in 91stAES Convention, New York, NY, USA, October 1991.

[5] J. B. Allen and L. R. Rabiner, “A unified approach to short-


time Fourier analysis and synthesis,” Proceedings of the IEEE,vol. 65, no. 11, pp. 1558–1564, 1977.

[6] H. S. Malvar, Signal Processing with Lapped Transforms, ArtechHouse, Norwood, Mass, USA, 1992.

[7] L. Ljung, System Identification: Theory for the User, Prentice-Hall, Upper Saddle River, NJ, USA, 2nd edition, 1999.

[8] S. Haykin, Adaptive Filter Theory, Prentice-Hall, Upper Sad-dle River, NJ, USA, 3rd edition, 1996.

[9] S. M. Kay, Modern Spectral Estimation, Prentice-Hall, Engle-wood Cliffs, NJ, USA, 1988.

[10] A. V. Oppenheim, A. Willsky, and I. Young, Signals and Sys-tems, Prentice-Hall, Englewood Cliffs, NJ, USA, 1983.

[11] S. M. Kay, Fundamentals of Statistical Signal Processing: Es-timation Theory, Prentice-Hall, Englewood Cliffs, NJ, USA,1993.

[12] M. H. Hayes, Statistical Digital Signal Processing and Modeling,John Wiley & Sons, New York, NY, USA, 1996.

[13] J. Makhoul, “Linear prediction: a tutorial review,” Proceedingsof the IEEE, vol. 63, no. 4, pp. 561–580, 1975.

[14] J. Laroche, “A new analysis/synthesis system of musical signalsusing Prony’s method. Application to heavily damped percus-sive sounds,” in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing, vol. 3, pp. 2053–2056, Glasgow, Scotland, UK, May1989.

[15] J. Laroche and J.-L. Meillier, “Multichannel excitation/filtermodeling of percussive sounds with application to the piano,”IEEE Trans. Speech, and Audio Processing, vol. 2, no. 2, pp.329–344, 1994.

[16] M. W. Macon, A. McCree, W.-M. Lai, and V. Viswanathan,“Efficient analysis/synthesis of percussion musical instrumentsounds using an all-pole model,” in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing, vol. 6, pp. 3589–3592,Seattle, Wash, USA, May 1998.

[17] J. O. Smith III, “Efficient synthesis of stringed musical in-struments,” in Proc. International Computer Music Conference(ICMC ’93), pp. 64–71, Tokyo, Japan, September 1993.

[18] M. Karjalainen, V. Valimaki, and Z. Janosy, “Towards high-quality sound synthesis of the guitar and string instruments,”in Proc. International Computer Music Conference (ICMC ’93),pp. 56–63, Tokyo, Japan, September 1993.

[19] J. Makhoul, “Spectral linear prediction: Properties and appli-cations,” IEEE Trans. Acoustics, Speech, and Signal Processing,vol. 23, no. 3, pp. 283–296, 1975.

[20] J. Laroche, “The use of the matrix pencil method for the spec-trum analysis of musical signals,” Journal of the Acoustical So-ciety of America, vol. 94, no. 4, pp. 1958–1965, 1993.

[21] L. W. P. Biscainho, P. S. R. Diniz, and P. A. A. Esquef, “ARMAprocesses in sub-bands with application to audio restoration,”in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, pp. 157–160, Sydney, Australia, May 2001.

[22] R. Roy, A. Paulraj, and T. Kailath, “ESPRIT—a subspace rota-tion approach to estimation of parameters of cisoids in noise,”IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 34,no. 5, pp. 1340–1342, 1986.

[23] M. Karjalainen, P. A. A. Esquef, P. Antsalo, A. Makivirta, andV. Valimaki, “Frequency-zooming ARMA modeling of reso-nant and reverberant systems,” Journal of the Audio Engineer-ing Society, vol. 50, no. 12, pp. 1012–1029, 2002.

[24] J. Laroche and J.-L. Meillier, “A simplified source/filter modelfor percussive sounds,” in IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics, pp. 173–176, NewYork, NY, USA, October 1993.

[25] R. B. Sussman and M. Kahrs, “Analysis and resynthesis ofmusical instrument sounds using energy separation,” in Proc.IEEE Int. Conf. Acoustics, Speech, Signal Processing, vol. 2, pp.997–1000, Atlanta, Ga, USA, May 1996.

[26] C. Erkut, V. Valimaki, M. Karjalainen, and M. Laur-son, “Extraction of physical and expressive parametersfor model-based sound synthesis of the classical guitar,”in 108th AES Convention, Paris, France, February 2000,preprint 5114. Available on-line at http://lib.hut.fi/Diss/2002/isbn9512261901/.

[27] P. A. A. Esquef, V. Valimaki, and M. Karjalainen, “Restorationand enhancement of solo guitar recordings based on soundsource modeling,” Journal of the Audio Engineering Society,vol. 50, no. 4, pp. 227–236, 2002.

[28] V. Valimaki, J. Huopaniemi, M. Karjalainen, and Z. Janosy,“Physical modeling of plucked string instruments with appli-cation to real-time sound synthesis,” Journal of the Audio En-gineering Society, vol. 44, no. 5, pp. 331–353, 1996.

[29] MathWorks, “MATLAB System Identification Toolbox,” 2001,User’s Guide.

[30] K. Steiglitz and L. E. McBride, “A technique for the identifi-cation of linear systems,” IEEE Trans. Automatic Control, vol.10, no. 4, pp. 461–464, 1965.

[31] K. Steiglitz, “On the simultaneous estimation of poles andzeros in speech analysis,” IEEE Trans. Acoustics, Speech, andSignal Processing, vol. 25, no. 3, pp. 229–234, 1977.

[32] MathWorks, “MATLAB Signal Processing Toolbox,” 2001,User’s Guide.

[33] J. O. Smith III, Techniques for digital filter design and systemidentification with application to the violin, Ph.D. thesis, Elec.Eng. Dept., Stanford University, Stanford, Calif, USA, 1983.

[34] J. O. Smith III, “Physical modeling using digital waveguides,”Computer Music Journal, vol. 16, no. 4, pp. 74–91, 1992.

[35] M. Karjalainen and J. O. Smith III, “Body modeling tech-niques for string instrument synthesis,” in Proc. InternationalComputer Music Conference (ICMC ’96), pp. 232–239, HongKong, China, August 1996.

[36] M. Sandler, “Analysis and synthesis of atonal percussion usinghigh order linear predictive coding,” Applied Acoustics, vol. 30,no. 2-3, pp. 247–264, 1990.

[37] J.-L. Meillier and A. Chaigne, “AR modeling of musical tran-sients,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing, pp. 3649–3652, Toronto, Canada, April 1991.

[38] G. Weinreich, “Coupled piano strings,” Journal of the Acous-tical Society of America, vol. 62, no. 6, pp. 1474–1484, 1977.

[39] M. Sandler, “Algorithm for high precision root finding fromhigh order LPC models,” IEE Proceedings. Part I: Communica-tions, Speech and Vision, vol. 138, no. 6, pp. 596–602, 1991.

[40] P. P. Vaidyanathan, Multirate Systems and Filter Banks,Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.

[41] K. B. Eom and R. Chellappa, “ARMA processes in multiratefilter banks with applications to radar signal classification,”in Proc. IEEE-SP International Symposium on Time-Frequencyand Time-Scale Analysis, pp. 136–139, Philadelphia, Pa, USA,October 1994.

[42] A. Benyassine and A. N. Akansu, “Subspectral modeling infilter banks,” IEEE Trans. Signal Processing, vol. 43, no. 12, pp.3050–3053, 1995.

[43] T. Tolonen and M. Karjalainen, “A computationally efficientmultipitch analysis model,” IEEE Trans. Speech, and AudioProcessing, vol. 8, no. 6, pp. 708–716, 2000.

[44] D. A. Jaffe and J. O. Smith III, “Extensions of the Karplus-Strong plucked-string algorithm,” Computer Music Journal,vol. 7, no. 2, pp. 56–69, 1983.

[45] Y. Hua and T. K. Sarkar, “Matrix pencil method for estimatingparameters of exponentially damped/undamped sinusoids innoise,” IEEE Trans. Acoustics, Speech, and Signal Processing,vol. 38, no. 5, pp. 814–824, 1990.

[46] B. Bank, “Physics-based sound synthesis of the piano,” Tech.Rep. 54, Laboratory of Acoustics and Audio Signal Processing,


Helsinki University of Technology, Espoo, Finland, June 2000,available on-line at http://www.acoustics. hut.fi/publications/.

[47] B. Bank and V. Valimaki, “Robust loss filter design for digitalwaveguide synthesis of string tones,” IEEE Signal ProcessingLetters, vol. 10, no. 1, pp. 18–20, 2003.

[48] D. Rocchesso and F. Scalcon, “Accurate dispersion simulationfor piano strings,” in Proc. Nordic Acoustical Meeting (NAM’96), pp. 407–414, Helsinki, Finland, June 1996.

[49] L. Trautmann, B. Bank, V. Valimaki, and R. Rabenstein,“Combining digital waveguide and functional transformationmethods for physical modeling of musical instruments,” inProc. AES 22nd International Conference on Virtual, Syntheticand Entertainment Audio, pp. 307–316, Espoo, Finland, June2002.

Paulo A. A. Esquef was born in Brazil, in1973. He received the Engineering degreefrom Polytechnic School of the Federal Uni-versity of Rio de Janeiro (UFRJ) in 1997 andthe M.S. degree from COPPE-UFRJ in 1999,both in electrical engineering. His M.S.thesis addressed digital restoration of oldrecordings. From 1999 to 2000, he workedon research and development of a DSP sys-tem for analysis classification of sonar sig-nals as part of a cooperation project between the Signal Process-ing Laboratory (COPPE-UFRJ) and the Brazilian Navy ResearchCenter (IPqM). Since 2000, he has been with the Laboratory ofAcoustics and Audio Signal Processing at Helsinki University ofTechnology, where he is currently pursuing postgraduate studies.He is a grant holder from CNPq, a Brazilian governmental councilfor funding research in science and technology. His research inter-ests include among others digital audio restoration, computationalauditory scene analysis, and sound synthesis. Esquef is an asso-ciate member of the IEEE and member of the Audio EngineeringSociety.

Matti Karjalainen was born in Hankasalmi,Finland, in 1946. He received the M.S.and the Dr.Tech. degrees in electrical en-gineering from the Tampere University ofTechnology, in 1970 and 1978, respectively.From 1980, he has been Professor in acous-tics and audio signal processing at theHelsinki University of Technology in thefaculty of Electrical Engineering. In audiotechnology, his interest is in audio signalprocessing, such as DSP for sound reproduction, perceptuallybased signal processing, as well as music DSP and sound synthe-sis. In addition to audio DSP, his research activities cover speechsynthesis, analysis, and recognition, perceptual auditory modelingand spatial hearing, DSP hardware, software, and programmingenvironments, as well as various branches of acoustics, includingmusical acoustics and modeling of musical instruments. He haswritten more than 300 scientific and engineering articles or papersand contributed to organizing several conferences and workshops.Professor Karjalainen is an AES (Audio Engineering Society) Fel-low and member in IEEE (Institute of Electrical and ElectronicsEngineers), ASA (Acoustical Society of America), EAA (EuropeanAcoustics Association), ISCA (International Speech Communica-tion Association), and several Finnish scientific and engineeringsocieties.

Vesa Valimaki was born in Kuorevesi, Fin-land, in 1968. He received his M.S. intechnology, Licentiate of Science (Lic.S.) inTechnology, and Doctor of Science (D.S.) inTechnology degrees in electrical engineer-ing from Helsinki University of Technology(HUT), Espoo, Finland, in 1992, 1994, and1995, respectively. Dr. Valimaki worked atthe HUT Laboratory of Acoustics and Au-dio Signal Processing from 1990 until 2001.In 1996, he was a Postdoctoral Research Fellow with the Univer-sity of Westminster, London, UK. During the academic year 2001–2002, he was Professor of signal processing at Pori School of Tech-nology and Economics, Tampere University of Technology (TUT),Pori, Finland. In August 2002, he returned to HUT where he iscurrently Professor of audio signal processing. In 2003, he was ap-pointed Docent in signal processing at Pori School of Technologyand Economics, TUT. His research interests are in the applicationof digital signal processing to audio and music. He has publishedmore than 120 papers in international journals and conferences.He holds 2 patents. Dr. Valimaki is a senior member of the IEEESignal Processing Society and a member of the Audio EngineeringSociety and the International Computer Music Association.


Virtual Microphones for MultichannelAudio Resynthesis

Athanasios MouchtarisElectrical Engineering Systems Department, Integrated Media Systems Center (IMSC), University of Southern California,3740 McClintock Avenue, Los Angeles, CA 90089-2564, USAEmail: [email protected]

Shrikanth S. NarayananElectrical Engineering Systems Department, Integrated Media Systems Center (IMSC), University of Southern California,3740 McClintock Avenue, Los Angeles, CA 90089-2564, USAEmail: [email protected]

Chris KyriakakisElectrical Engineering Systems Department, Integrated Media Systems Center (IMSC), University of Southern California,3740 McClintock Avenue, Los Angeles, CA 90089-2564, USAEmail: [email protected]

Received 30 May 2002 and in revised form 17 February 2003

Multichannel audio offers significant advantages for music reproduction, including the ability to provide better localization andenvelopment, as well as reduced imaging distortion. On the other hand, multichannel audio is a demanding media type in termsof transmission requirements. Often, bandwidth limitations prohibit transmission of multiple audio channels. In such cases, analternative is to transmit only one or two reference channels and recreate the rest of the channels at the receiving end. Here, wepropose a system capable of synthesizing the required signals from a smaller set of signals recorded in a particular venue. Thesesynthesized “virtual” microphone signals can be used to produce multichannel recordings that accurately capture the acousticsof that venue. Applications of the proposed system include transmission of multichannel audio over the current Internet infras-tructure and, as an extension of the methods proposed here, remastering existing monophonic and stereophonic recordings formultichannel rendering.

Keywords and phrases: multichannel audio, Gaussian mixture model, distortion measures, virtual microphones, audio resynthe-sis, multiresolution analysis.

1. INTRODUCTION

Multichannel audio can enhance the sense of immersion fora group of listeners by reproducing the sounds that wouldoriginate from several directions around the listeners, thussimulating the way we perceive sound in a real acousticalspace. On the other hand, multichannel audio is one of themost demanding media types in terms of transmission re-quirements. A novel architecture allowing delivery of un-compressed multichannel audio over high-bandwidth com-munications networks was presented in [1]. As suggestedthere, for applications in which bandwidth limitations pro-hibit transmission of multiple audio channels, an alternativewould be to transmit only one or two channels (denoted asreference channels or recordings in this work, for example,the left and right signals in a traditional stereo recording)

and reconstruct the remaining channels at the receiving end.The system proposed in this paper provides a solution forreconstructing the channels of a specific recording from thereference channels and is particularly suitable for live con-cert hall performances. The proposed method is based on in-formation of the acoustics of a specific concert hall and themicrophone locations with respect to the orchestra; this in-formation can be extracted from the specific multichannelrecording.

Before proceeding to the description of the method pro-posed, a brief outline of the basis of our approach is given.A number of microphones are used to capture several char-acteristics of the venue, resulting in an equal number of stemrecordings (or elements). Figure 1 provides an example of howmicrophones may be arranged in a recording venue in a mul-tichannel recording. These recordings are then mixed and

Virtual Microphones for Multichannel Audio Resynthesis 969

C

E F

G AB

D

Figure 1: An example of how microphones may be arranged in arecording venue for a multichannel recording. In the virtual mi-crophone synthesis algorithm, microphones A and B are the mainreference pair from which the remaining microphone signals canbe derived. Virtual microphones C and D capture the hall rever-beration, while virtual microphones E and F capture the reflectionsfrom the orchestra stage. Virtual microphone G can be used to cap-ture individual instruments such as the tympani. These signals canthen be mixed and played back through a multichannel audio sys-tem that recreates the spatial realism of a large hall.

played back through a multichannel audio system that at-tempts to recreate the spatial realism of the recording venue.Our objective is to design a system, based on available stemrecordings, which is able to recreate all of these recordingsfrom the reference channels at the receiving end (thus, stemrecordings are also referred to as target recordings here). Theresult would be a significant reduction in transmission re-quirements, while enabling mixing at the receiving end. Con-sequently, such a system would be suitable for completelyresynthesizing any number of channels in the initial record-ing (i.e., no information about the target recordings needs tobe transmitted other than the conversion parameters). Thisis different than what commercial systems accomplish today.In addition, the system proposed in this paper is a struc-tured representation of multichannel audio that lends itselfto other possible applications such as multichannel audiosynthesis which is briefly described later in this section. Byexamining the acoustical characteristics of the various stemrecordings, the distinction of microphones is made into re-verberant and spot microphones.

Spot microphones are microphones that are placed closeto the sound source (e.g., G in Figure 1). These microphonesintroduce a very challenging situation. Because the source ofsound is not a point source but rather distributed such asin an orchestra, the recordings of these microphones dependlargely on the instruments that are near the microphone andnot so much on the acoustics of the hall. Synthesizing therecordings of these microphones, therefore, involves enhanc-ing certain instruments and diminishing others, which inmost cases overlap both in the time and frequency domains.The algorithm described here, focusing on this problem, isbased on spectral conversion (SC). The special case of per-cussive drum-like sounds is separately examined since these

sounds are of impulsive nature and cannot be addressed bySC methods. These sounds are of particular interest howeversince they greatly affect our perception of proximity to theorchestra.

Reverberant microphones are the microphones placed farfrom the sound source, for example, C and D in Figure 1.These microphones are treated separately as one category be-cause they mainly capture reverberant information (that canbe reproduced by the surrounded channels in a multichannelplayback system). The recordings captured by these micro-phones can be synthesized by filtering the reference record-ings through linear time-invariant (LTI) filters, designed us-ing the methods that will be described in later sections of thispaper. Existing reverberation methods use a combination ofcomb and all-pass filters to effectively add reverberation tothe existing monophonic or stereophonic signal. Our objec-tive is to estimate the appropriate filters that capture the con-cert hall acoustical properties from a given set of stem micro-phone recordings. We describe an algorithm that is based ona spectral estimation approach and is particularly suitable forgenerating such filters for large venues with long reverbera-tion times. Ideally, the resulting filter implements the spectralmodification induced by the hall acoustics.

We have obtained such stem microphone recordingsfrom two orchestra halls in the USA by placing microphonesat various locations throughout the hall. By recording a per-formance with a total of sixteen microphones, we then de-signed a system that recreates these recordings (thus namedvirtual microphone recordings) from the main microphonepair. It should be noted that the methods proposed here in-tend to provide a solution for the problem of resynthesiz-ing existing multichannel recordings from a smaller subsetof these recordings. The problem of completely synthesiz-ing multichannel recordings from stereophonic (or mono-phonic) recordings, thus greatly augmenting the listeningexperience, is not addressed here. The synthesis problem isa topic of related research to appear in a future publica-tion. However, it is important to distinguish the cases wherethese two problems (synthesis and resynthesis) differ. For re-verberant microphones, since the result of our method isa group of LTI filters, both problems are addressed at thesame time. The filters designed are capable of recreating theacoustic properties of the venue where the specific recordingstook place. If these filters are applied to an arbitrary (non-reverberant) recording, the resulting signal will contain thevenue characteristics at the particular microphone location.In such manner, it is possible to completely synthesize re-verberant stem recordings and, consequently, a multichannelrecording. On the contrary, this will not be possible for thestem microphone methods. As it will be clear later, the al-gorithms described here are based on the specific recordingsthat are available. The result is a group of SC functions thatare designed by estimating the unknown parameters basedon training data that are available from the target recordings.These functions cannot be applied to an arbitrary signal andproduce meaningful results. This is an important issue whenaddressing the synthesis problem and will not be the topic ofthis paper.


The remainder of this paper is organized as follows. InSection 2, the spot microphone resynthesis problem is ad-dressed. SC methods are described and applied to the prob-lem in different subbands of the audio signal. The specialcase of percussive sounds is also examined. In Section 3, thereverberant microphone resynthesis problem is examined.The issue of defining an objective measure of the methodperformance arises and is addressed by defining a normal-ized mutual information (NMI) measure. Finally, a briefdiscussion of the results is given in Section 4 and possi-ble directions for future research on the subject are pro-posed.

2. SPOT MICROPHONE RESYNTHESIS

The methods for spot microphones are geared towards en-hancing certain instruments in the reference recording. Notethat this problem is different from the source separationproblem that seeks to extract the instrument from a signalcontaining multiple instruments, nor do we attempt to es-timate the room impulse response and thus dereverberatethe signals. Instead, it is an attempt to simulate what a mi-crophone near a particular instrument would pick up, whichincludes mostly a “dry” (nonreverberant) version of the in-strument and some leakage from nearby instruments. Theinstruments close to the target microphone are far moreprominent in the target recording than in the referencerecording. Our objective is to retain the perceptual advan-tages of the multichannel recording, as a first step towardsaddressing the problem. This, in effect, means that our ob-jective is to enhance the desired voices/instruments in thereference recording even if the resynthesized signal is notidentical with the desired. We were able, as stated later, toproduce identical responses for the reverberant microphonescase, however the spot microphone case proved to be farmore demanding.

For the spot microphones case, nonstationarity of the au-dio signals is the focus of this paper; the SC methods attemptto address this problem. The problem arises from the factthat the objective of our method is to enhance a particularinstrument in the reference recording. The instrument to beenhanced has a frequency response that significantly variesin time, and as a result, a time-invariant filter would notproduce meaningful results. Our methods are based on thefact that the reference and target responses are highly related(same performance recorded simultaneously with differentmicrophones). Based on this observation, the desired trans-fer function, although constantly varying in time, can be es-timated, based on the reference recording, with the use ofthe SC methods. For the spot microphones case, each targetmicrophone captures mainly a specific type of instrumentswhile the reference microphone “weighs” all instruments ap-proximately equally. This corresponds to the dependence ofthe spot microphones on their location with respect to theorchestra. Although the response of these microphones de-pends on the acoustics of the hall as well, this dependence isnot considered acoustically significant (for reasons explained

in Section 2.1), and this greatly simplifies the solution. Themethods proposed here result in one conversion function foreach pair of spot and reference microphones (with the refer-ence microphone remaining the same in all cases) so that alltarget waveforms can be resynthesized from only one record-ing.

2.1. Spectral conversion

Our initial experiments for the spot microphones case, de-tailed in the next paragraph, motivated us to focus on mod-ifying the short-term spectral properties of the reference au-dio signal in order to recreate the desired one. The short-term spectral properties are extracted by using a short slidingwindow with overlapping (resulting in a sequence of signalsegments or frames). Each frame is modeled as an autore-gressive (AR) filter excited by a residual signal. The AR fil-ter coefficients are found by means of linear predictive (LP)analysis [2] and the residual signal is the result of the in-verse filtering of the audio signal of the current frame bythe AR filter. The LP coefficients are modified in a way tobe described later in this section and the residual is filteredwith the designed AR filter to produce the desired signal ofthe current frame. Finally, the desired response is synthe-sized from the designed frames using overlap-add techniques[3].

It is interesting to describe one of our initial experimentsthat led us to focus on the short-term spectral envelope and,as a consequence, on the SC methods that are described next.In this simple experiment, we attempted to synthesize the de-sired response (in this case the response captured by the mi-crophone is placed close to the chorus of the orchestra) byusing the reference residual and the cepstral coefficients ob-tained from the desired response. In other words, we wereinterested to test the result of our resynthesis methods in theideal case where the desired sequence of cepstral coefficientswas correctly “predicted.” The result was an audio signalwhich sounded more reverberant than the desired signal (forreasons explained later in this section), but extremely simi-lar in all respects. Thus, deriving an algorithm that correctlypredicts the desired sequence of cepstral coefficients from thereference cepstral coefficients of the respective frame wouldresult in a resynthesized signal very close to the desired sig-nal. The problem as stated is exactly the problem statementof SC, which aims to design a mapping function from thereference to the target space, whose parameters remain con-stant for a particular pair of reference and target sources. Theresult will be a significant reduction of information as the tar-get response can be reconstructed using the reference signaland this function.

Such a mapping function can be designed by followingthe approach of voice conversion algorithms [4, 5, 6]. Theobjective of voice conversion is to modify a speech waveformso that the context remains as it is but appears to be spo-ken by a specific (target) speaker. Although the application iscompletely different, the followed approach is very suitablefor our problem. In voice conversion, pitch and time scal-ing need to be considered, while in the application examinedhere, this is not necessary. This is true since the reference and


target waveforms come from the same excitation recordedwith different microphones and the need is not to modify butto enhance the reference waveform. However, in both cases,there is the need to modify the short-term spectral propertiesof the waveform.

At this point, it is of interest to mention that the SCmethods are useful for modifying the spectral coloration ofthe signal, and the target response is resynthesized using themodified spectral envelope along with the residual derivedfrom the reference recording. Note that short-term analy-sis indicates the use of windows in the order of 50 millisec-onds, which means that the residual (in effect, the model-ing error) contains the reverberation which cannot be mod-eled with the short-term spectral envelope. As a result, theresynthesized response might sound more reverberant thanthe target response, depending on how reverberant the ref-erence response originally is. Our concern, though, is mostlyto enhance a specific instrument within the reference record-ing, without focusing on dereverberating the signal. In mostcases, this will not be an issue, given that usually the referencerecordings are not highly reverberant.

Assuming that a sequence [x1x2 · · · xn] of reference spec-tral vectors (e.g., line spectral frequencies (LSFs), cepstral co-efficients, etc.) is given, as well as the corresponding sequenceof target spectral vectors [y1y2 · · · yn] (training data fromthe reference and target recordings, respectively), a function(·) can be designed which, when applied to vector xk, pro-duces a vector close in some sense to vector yk. Many algo-rithms have been described for designing this function (see[4, 5, 6, 7] and the references therein). Here the algorithmsbased on vector quantization (VQ) [4] and Gaussian mixturemodels (GMM) [5, 6] were implemented and compared.

2.1.1. SC based on VQ

Under this approach, the spectral vectors of the referenceand target signals (training data) are vector quantized usingthe well-known modifiedK-means clustering algorithm (see,e.g., [8] for details). Then, a histogram is created indicatingthe correspondences between the reference and target cen-troids. Finally, the function is defined as the linear combi-nation of the target centroids using the designed histogramas a weighting function. It is important to mention that inthis case the spectral vectors were chosen to be the cepstralcoefficients so that the distance measure used in clustering isthe truncated cepstral distance.

2.1.2. SC based on GMM

In this case, the assumption made is that the sequence ofspectral vectors xk is a realization of a random vector x witha probability density function (pdf) that can be modeled as amixture of M multivariate Gaussian pdfs. GMMs have beenrepeatedly used in such manner to model the properties ofaudio signals with reasonable success (see, e.g., [9, 10, 11]).

According to GMMs, the pdf of x, g(x), can be written as

g(x) =M∑i=1

p(ωi)(

x;µxi ,Σxxi

), (1)

where (x;µ,Σ) is the normal multivariate distribution withmean vector µ and covariance matrixΣ, and p(ωi) is the priorprobability of class ωi. The parameters of the GMM, that is,the mean vectors, covariance matrices, and priors, can be es-timated using the expectation maximization (EM) algorithm[12].

As already mentioned, the function is designed so thatthe spectral vectors yk and (xk) are close in some sense. In[5], the function is designed such that the error

=n∑

k=1

∥∥yk −(

xk)∥∥2

(2)

is minimized. Since this method is based on least squares es-timation, it will be denoted as the LSE method. This prob-lem becomes possible to solve under the constraint that ispiecewise linear, that is,

(

xk) = M∑

i=1

p(ωi|xk

)[vi + ΓiΣ

xx−1

i

(xk − µxi

)], (3)

where the conditional probability that a given vector xk be-longs to class ωi, p(ωi|xk), can be computed by applyingBayes’ theorem:

p(ωi|xk

) = p(ωi)(

xk;µxi ,Σxxi

)∑Mj=1 p

(ωj

)(

xk;µxj ,Σxxj

) . (4)

The unknown parameters (vi and Γi, i = 1, . . . ,M) can befound by minimizing (2) which reduces to solving a typicalleast squares equation.

A different solution for function results when a dif-ferent function than (2) is minimized [6]. Assuming that xand y are jointly Gaussian for each class ωi, then, in mean-squared sense, the optimal choice for the function is

(

xk) = E(

y|xk)

=M∑i=1

p(ωi|xk

)[µyi + Σ

yxi Σxx−1

i

(xk − µxi

)],

(5)

where E(·) denotes the expectation operator and the con-ditional probabilities p(ωi|xk) are given again from (4). Ifthe source and target vectors are concatenated, creating anew sequence of vectors zk that are the realizations of therandom vector z = [xTyT]T (where T denotes transposi-tion), then all the required parameters in the above equa-tions can be found by estimating the GMM parameters of z.Then,

Σzzi =

Σxxi Σ

xyi

Σyxi Σ

yyi

, µzi =[µxiµyi

]. (6)

Once again, these parameters are estimated by the EM al-gorithm. Since this method estimates the desired functionbased on the joint density of x and y, it will be referred toas the joint density estimation (JDE) method.


2.2. Subband processing

Audio signals contain information about a larger bandwidththan speech signals. The sampling rate for audio signals isusually 44.1 or 48 kHz compared with 16 kHz for speech.Moreover, since high acoustical quality for audio is essen-tial, it is important to consider the entire spectrum in de-tail. For these reasons, the decision to follow an analysis insubbands seems natural. Instead of warping the frequencyspectrum using the Bark scale as is usual in speech analysis,the frequency spectrum was divided in subbands and eachone was treated separately under the analysis presented inthe previous section (the signals were demodulated and dec-imated after they were passed through the filter banks andbefore the linear predictive analysis). Perfect reconstructionfilter banks, based on wavelets [13], provide a solution withacceptable computational complexity as well as the appropri-ate, for audio signals, octave frequency division. The choiceof filter bank was not a subject of investigation but steeptransition from passband to stopband is desirable. The rea-son is that the short-term spectral envelope is modified sep-arately for each band, thus frequency overlapping betweenadjacent subbands would result in a distorted synthesizedsignal.

2.3. Residual processing for percussive sounds

The SC methods described earlier will not produce the de-sired result in all cases. Transient sounds cannot be ade-quately processed by altering their spectral envelope andmust be examined separately. An example of an analy-sis/synthesis model that treats transient sounds separatelyand is very suitable as an alternative to the subband-basedresidual/LP model that we employed is described in [14]. Itis suitable since it also models the audio signal in differentbands, in each one as a sinusoidal/residual model [15, 16].The sinusoidal parameters can be treated in the same man-ner as the LP coefficients during SC [17]. We are currentlyconsidering this model for improving the produced soundquality of our system. However, no structured model is pro-posed in [14] for transient sounds. In the remainder of thissection, the special case of percussive sounds is addressed.

The case of percussive drum-like sounds is consideredof particular importance. It is usual in multichannel record-ings to place a microphone close to the tympani as drum-like sounds are considered perceptually important in recre-ating the acoustical environment of the recording venue. Forpercussive sounds, a similar model to the residual/LP modeldescribed here can be used [18] (see also [19, 20, 21]), butfor the enhancement purposes investigated in this paper, theemphasis is given to the residual instead of the LP parame-ters. The idea is to extract the residual of an instance of theparticular percussive instrument from the recording of themicrophone that captures this instrument and then recre-ate this channel from the reference channel by simply sub-stituting the residual of all instances of this instrument withthe extracted residual. As explained in [18], this residual cor-responds to the interaction between the exciter and the res-onating body of the instrument and lasts until the structure

reaches a steady vibration. This signal characterizes the at-tack part of the sound and is independent of the frequen-cies and amplitudes of the harmonics of the produced sound(after the instrument has reached a steady vibration). Thus,it can be used for synthesizing different sounds by using anappropriate all-pole filter. This method proved to be quitesuccessful and further details are given in Section 2.4. Thedrawback of this approach is that a robust algorithm is re-quired for identifying the particular instrument instances inthe reference recording. A possible improvement of the pro-posed method would be to extract all instances of the in-strument from the target response and use some clusteringtechnique for choosing the residual that is more appropri-ate in the resynthesis stage. The reason is that the residual/LPmodel introduces modeling error which is larger in the spec-tral valleys of the AR spectrum; thus, better results would beobtained by using a residual which corresponds to an AR fil-ter as close as possible to the resynthesis AR filter. However,this approach would again require robustly identifying all theinstances of the instrument.

2.4. Implementation details

The three SC methods outlined in Section 2.1 were imple-mented and tested using a multichannel recording, obtainedas described in Section 1. The objective was to recreate thechannel that mainly captured the chorus of the orchestra(residual processing for percussive sound resynthesis is alsoconsidered at the last paragraph of this section). Acoustically,therefore, the emphasis was on the male and female voices. Atthe same time, it was clear that some instruments, inaudiblein the target recording but particularly audible in the refer-ence recording, are needed to be attenuated. More generally,it might hold that a spot microphone might enhance morethan one type of musical sources. Usually, such microphonesare placed with a particular type of instruments in mind,which is easy to discern by acoustical examination, but, ingeneral, careful selection of the training data will result inthe desirable result even in complex cases.

A database of about 10 000 spectral vectors for each bandwas created so that only parts of the recording where the cho-rus is present are used, with the choice of spectral vectorsbeing the cepstral coefficients. Parts of the chorus recordingwere selected so that there were no segments of silence in-cluded. Given that our focus was on modifying the short-term spectral properties of the reference signal, the analysiswindow we used was a 2048 sample window for a 44.1 kHzsampling rate. This is a typical value often used when theobjective is to alter the short-term spectral properties of au-dio signals, and was found to produce good sound qualityresults in our case as well. Results were evaluated throughinformal listening tests and through objective performancecriteria. The SC methods were found to provide promisingenhancement results. The experimental conditions are givenin Table 1. The number of octave bands used was eight, achoice that gives particular emphasis on the frequency band0–5 kHz and at the same time does not impose excessivecomputational demands. The frequency range 0–5 kHz is


Table 1: Parameters for the chorus microphone example.

Bandno.

Frequency rangeLP order

GMMcentroidsLow (kHz) High (kHz)

1 0.0000 0.1723 4 42 0.1723 0.3446 4 43 0.3446 0.6891 8 84 0.6891 1.3782 16 165 1.3782 2.7563 32 166 2.7563 5.5125 32 167 5.5125 11.0250 32 168 11.0250 22.0500 32 16

Table 2: Normalized distances for LSE-, JDE-, and VQ-based meth-ods.

SCmethod

Cepstral distance Centroidsper bandTrain Test

LSE 0.6451 0.7144 Table 1JDE 0.6629 0.7445 Table 1VQ 1.2903 1.3338 1024

particularly important for the specific case of chorus record-ing resynthesis since this is the frequency range where thehuman voice is mostly concentrated. For producing betterresults, the entire frequency range 0–20 kHz must be con-sidered. The order of the LP filter varied depending on thefrequency detail of each band, and for the same reason, thenumber of centroids for each band was different.

In Table 2, the average quadratic cepstral distance (aver-aged over all vectors and all eight bands) is given for eachmethod, for the training data as well as for the data used fortesting (nine seconds of music from the same recording). Thecepstral distance is normalized with the average quadraticdistance between the reference and the target waveforms (i.e.,without any conversion of the LP parameters). The improve-ment is large for both the GMM-based algorithms, with theLSE algorithm being slightly better, and for both the trainingand testing data. The VQ-based algorithm, in contrast, pro-duced a deterioration in performance which was audible aswell. This can be explained based on the fact that the GMM-based methods result in a conversion function which is con-tinuous with respect to the spectral vectors. The VQ-basedmethod, on the other hand, produces audible artifacts intro-duced by spectral discontinuities because the conversion isbased on a limited number of existing spectral vectors. This isthe reason why a large number of centroids was used for theVQ-based algorithm as seen in Table 2 compared with thenumber of centroids used for the GMM-based algorithms.However, the results for the VQ-based algorithm were stillunacceptable from both the objective and subjective perspec-tives (a higher number of centroids was tested, up to 8 192,without any significant improvement).

The algorithm described in Section 2.1 considering thespecial case of percussive sound resynthesis was tested as well.

300

200

100

Freq

uen

cy(H

z)

20 40 60 80 100 120 140 160 180 200Time (samples)

(a)

300

200

100

Freq

uen

cy(H

z)

20 40 60 80 100 120 140 160 180 200Time (samples)

(b)

300

200

100

Freq

uen

cy(H

z)20 40 60 80 100 120 140 160 180 200

Time (samples)

(c)

Figure 2: Choi-Williams distribution of the desired (a), reference(b), and synthesized (c) waveforms at the time points during a tym-pani strike (60–80 samples).

Figure 2 shows the time-frequency evolution of a tympaniinstance using the Choi-Williams distribution [22], a dis-tribution that achieves the high resolution needed in suchcases of impulsive signal nature. Figure 2 clearly demon-strates the improvement in drum-like sound resynthesis. Theimpulsiveness of the signal at around samples 60–80 is ob-served in the desired response and verified in the synthesizedwaveform. The attack part is clearly enhanced, significantlyadding naturalness in the audio signal, as our informal lis-tening tests clearly demonstrated.

The methods described in this section can be used forsynthesizing recordings of microphones that are placed closeto the orchestra. Of importance in this case were the short-term spectral properties of the audio signals. Thus, LTI filterswere not suitable and the time-frequency properties of thewaveforms had to be exploited in order to obtain a solution.In Section 3, we focus on microphones placed far from theorchestra and thus containing mainly reverberant signals. Aswe demonstrate, the desired waveforms can be synthesizedby taking advantage of the long-term spectral properties ofthe reference and the desired signals.

3. REVERBERANT MICROPHONE SIGNAL SYNTHESIS

The problem of synthesizing a virtual microphone signalfrom a signal recorded at a different position in the room can


be described as follows. Given two processes s1 and s2, deter-mine the optimal filter H that can be applied to s1 (the refer-ence microphone signal) so that the resulting process s′2 (thevirtual microphone signal) is as close as possible to s2. Theoptimality of the resulting filter H is based on how “close”s′2 is to s2. For the case of audio signals, the distance betweenthese two processes must be measured in a way that is psy-choacoustically valid. For microphones placed far from theorchestra (reverberant microphones), the main factor thatdifferentiates the target from the reference recording is hallreverberation; thus, in this case, the transfer function is in-herently time invariant. This is a typical problem of identi-fication, however in our case we estimate the room responsebased on existing recordings since it would be impractical oreven impossible to measure the hall response for every dif-ferent recording. At the same time, the nonstationarity of theaudio signals that might prevent us from accurate estimationof the transfer functions is addressed by the spectral estima-tion methods explained in Section 3.1. Another importantissue that arises is the fact that the physical system is charac-terized by a long impulse response. For a typical large sym-phony hall, the reverberation time is approximately two sec-onds, which would require a filter of more than 96 000 tapsto describe the reverberation process (for a typical samplingrate of 48 kHz). This issue consequently affects both the filterdesign and the system implementation. While the filter de-sign problem is appropriately addressed, the resulting filtersare of inevitably high order, prohibiting cost-effective real-time applications of our methods.

For the reverberant microphones case, the orchestra isconsidered as a point source. For all practical purposes, thisis a valid assumption to make. The distant microphones arenot trying to recreate the physical sound field generated bya complex sound source such as the orchestra. Rather, theyare trying to provide us with a signal that can be com-bined with signals from other microphones (real and syn-thesized) using aesthetic (not mathematical) rules for mix-ing into a multichannel performance. It is well known thattrying to use microphones to capture the physical soundwaves at one point in space is not physically possible anddoes not correspond to the way a human listener wouldhear/perceive it even if it were. As explained later in this sec-tion, our listening tests indicate that the assumption madeis a valid one, with the target and resynthesized wave-forms acoustically indistinguishable (for appropriate filterorders).

3.1. IIR filter design

There are several possible approaches to the problem. Oneis to use classical estimation theoretic techniques such asleast squares or Wiener filtering-based algorithms to esti-mate the hall environment with a long finite-duration im-pulse response (FIR) or infinite-duration impulse response(IIR) filter. Adaptive algorithms such as LMS [2] can providean acceptable solution in such system identification problemswhile least squares methods suffer prohibitive computationaldemands. For LMS, the limitation lies in the fact that the

input and the output are nonstationary signals making con-vergence quite slow. In addition, the required length of thefilter is very large, so such algorithms would prove to be inef-ficient for this problem. Although it is possible to prewhitenthe input of the adaptive algorithm (see, e.g., [2, 23] andthe references therein) so that convergence is improved,these algorithms still have not proved to be efficient for thisproblem.

An alternative to the aforementioned methods for treat-ing system identification problems is to use spectral esti-mation techniques based on the cross spectrum [24]. Thesemethods are divided into parametric and nonparametric.Nonparametric methods based on averaging techniques suchas the averaged periodogram (Welch spectral estimate) [25,26, 27] are considered more appropriate for the case oflong observations and for nonstationary conditions since nomodel is assumed for the observed data (a different approachbased on the cross spectrum which, instead of averaging,solves an overdetermined system of equations can be foundin [28]). After the frequency response of the filter is esti-mated, an IIR filter can be designed based on that response.The advantage of this approach is that IIR filters are a morenatural choice of modeling the physical system under con-sideration and can be expected to be very efficient in approx-imating the spectral properties of the recording venue. In ad-dition, an IIR filter would implement the desired frequencyresponse with a significantly lower order compared with anFIR filter. Caution must, of course, be taken in order to en-sure the stability of the filters.

To summarize, if we could define a power spectral den-sity Ss1 (ω) for signal s1 and Ss2 (ω) for signal s2, then it wouldbe possible to design filter H(ω) that can be applied to pro-cess s1 resulting in process s′2, which is intended to be anestimate of s2. The filter H(ω) can be estimated by meansof spectral estimation techniques. Furthermore, if Ss1 (ω) ismodeled by an all-pole approximation |1/Ap1|2 and Ss2 (ω)similarly as |1/Ap2|2, then H = Ap1/Ap2 if H is restricted tobe the minimum phase spectral factor of |H(ω)|2. The resultis a minimum-phase stable IIR filter that can be efficientlydesigned. The analysis that follows provides the details fordesigning H .

The estimation of H(ω) is based on computing the crossspectrum Ss2s1 of signals s2 and s1 and the autospectrum Ss1

of signal s1. It is true that if these signals were stationary,then

Ss2s1 (ω) = H(ω)Ss1 (ω). (7)

The difficulties arising in the design of filter H are due tothe nonstationary nature of audio signals. This issue can bepartly addressed if the signals are divided into segments shortenough to be considered of approximately stationary nature.It must be noted, however, that these segments must be largeenough so that they can be considered long compared withthe length of the impulse response that must be estimated inorder to avoid edge effects (as explained in [29], where a sim-ilar procedure is followed for the case of blind deconvolutionfor audio signal restoration).


For interval i, composed of M (real) samples s(i)1 (0),

. . . , s(i)1 (M − 1), the empirical transfer function estimate

(ETFE) [24] is computed as

H(i)(ω) = S(i)2 (ω)

S(i)1 (ω)

, (8)

where

S(i)1 (ω) =

M−1∑n=0

s(i)1 (n)e− jωn (9)

is the Fourier transform of the segment samples, though thiscannot be considered an accurate estimate of H(ω), since thefilter H(i)(ω) will be valid only for frequencies correspondingto the harmonics of segment i (under the valid assumption ofquasiperiodic nature of the audio signal for each segment).An intuitive procedure would be to obtain the estimate ofthe spectral properties of the recording venue H(ω) by av-eraging all the estimates available. Since the ETFE is the re-sult of frequency division, it is apparent that in frequencieswhere Ss1 (ω) is close to zero, the ETFE would become unsta-ble, so a more robust procedure would be to estimate H usinga weighted average of the K segments available [24], that is,

H(ω) =∑K−1

i=0 β(i)(ω)H(i)(ω)∑K−1i=0 β(i)(ω)

. (10)

A sensible choice of weights would be

β(i)(ω) =∣∣∣S(i)

1 (ω)∣∣∣2. (11)

It can be easily shown that estimating H under this ap-proach is equivalent to estimating the autospectrum of s1 andthe cross spectrum of s2 and s1 using the Cooley-Tukey spec-tral estimate [26] (in essence, Welch spectral estimation withrectangular windowing of the data and no overlapping). Inother words, defining the power spectrum estimate under theCooley-Tukey procedure as

SCTs1(ω) = 1

K

K−1∑i=0

∣∣∣S(i)1 (ω)

∣∣∣2, (12)

where S(ω) is defined as previously, and a similar expressionfor the cross spectrum

SCTs2s1(ω) = 1

K

K−1∑i=0

S(i)2 (ω)S(i)∗

1 (ω), (13)

it holds that

H(ω) = SCTs2s1(ω)

SCTs1(ω)

(14)

which is analogous to (7). Thus, for a stationary signal, theaveraging of the estimated filters is justifiable. A window canadditionally be used to further smooth the spectra.

The described method is meaningful for the special caseof audio signals despite their nonstationarity. It is well knownthat the averaged periodogram provides a smoothed versionof the periodogram. Considering that it is true even for non-stationary (but of finite length) signals that

S2(ω)S∗1 (ω) = H(ω)∣∣S1(ω)

∣∣2, (15)

then averaging in essence smoothes the frequency responseof H . This is justifiable since it is true that a nonsmoothedH will contain details that are of no acoustical significance.Further smoothing can yield a lower-order IIR filter by tak-ing advantage of AR modeling. Considering signal s1, the in-verse Fourier transform of its power spectrum Ss1 (ω), derivedas described earlier, will yield the sequence rs1 (m). If this se-quence is viewed as the autocorrelation of s1 and the samplesrs1 (0), . . . , rs1 (p + 1) are inserted in the Wiener-Hopf equa-tions for linear prediction (with the AR order p being signif-icantly smaller than the number of samples of each block Mfor smoothing the spectra):

rs1 (0) rs1 (1) · · · rs1 (p − 1)rs1 (1) rs1 (0) · · · rs1 (p − 2)

......

. . ....

rs1 (p − 1) rs1 (p − 2) · · · rs1 (0)

ap1(1)ap1(2)

...ap1(p)

=

rs1 (1)rs1 (2)

...rs1 (p)

,

(16)

then the coefficients ap1(i) result in an approximation ofSs1 (ω) (omitting the constant gain term which is not of im-portance in this case):

Ss1 (ω) =∣∣∣∣ 1Ap1(ω)

∣∣∣∣2

, (17)

where

Ap1(ω) = 1 +p∑

l=1

ap1(l)e− jωl. (18)

A similar expression holds for Ss2 (ω). The spectra Ss1 and Ss2

can be computed as in (12). Using the fact that

Ss2 (ω) = ∣∣H(ω)∣∣2Ss1 (ω) (19)

and restricting H to be minimum phase, we find from thespectral factorization of (19) that a solution for H is

H(ω) = Ap1(ω)

Ap2(ω). (20)

Filter H can be designed very efficiently even for very largefilter orders following this method since (16) can be solvedusing the Levinson-Durbin recursion. This filter will be IIRand stable.


A problem with the aforementioned design method isthat the filter H is restricted to be of minimum phase. It isof interest to mention that in our experiments the minimumphase assumption proved to be perceptually acceptable. Thiscan be possibly attributed to the fact that if the minimumphase filter H captures a significant part of the hall reverber-ation, then the listener’s ear will be less sensitive to the phasedistortion [30]. It is not possible, however, to generalize thisobservation and the performance of this last step in the filterdesign will possibly vary depending on the particular charac-teristics of the venue captured in the multichannel recording.

3.2. Mutual information as a spectraldistortion measure

As previously mentioned, we need to apply the above proce-dure in blocks of data of the two processes s1 and s2. In ourexperiments, we chose signal block lengths of 100 000 sam-ples (long blocks of data are required due to the long of re-verberation time of the hall as explained earlier). We thenexperimented with various orders of filters Ap1 and Ap2. Asexpected, relatively high orders were required to reproduces2 from s1 with an acceptable error between s′2 (the resyn-thesized process) and s2 (the target recording). The perfor-mance was assessed through blind A/B/X listening evalua-tion. An order of 10 000 coefficients for both the numeratorand denominator of H resulted in an error between the orig-inal and synthesized signals that was not detectable by listen-ers. We also evaluated the performance of the filter by syn-thesizing blocks from a part of the signal other than the onethat was used for designing the filter. Again, the A/B/X eval-uation showed that for orders higher than 10 000, the syn-thesized signal was indistinguishable from the original. Al-though such high-order filters are impractical for real-timeapplications, the performance of our method is an indica-tion that the model is valid, therefore motivating us to fur-ther investigate filter optimization. This method can be usedfor offline applications such as remastering old recordingsand requiring a reasonable amount of time for resynthesisthat depends on the specific platform and implementation. Areal-time version was also implemented using the Lake DSPHuron digital audio convolution workstation. With this sys-tem, we are able to synthesize 12 virtual microphone stemrecordings from a monophonic or stereophonic compact disc(CD) in real time. It is interesting to mention that our in-formal listening tests showed that for filter orders of 5 000or less, the amount of reverberation perceived in the sig-nal is not sufficient. This is not surprising, given the phys-ical size (150′ in length) and reverberation characteristics(1.9 seconds) of the hall in which we conducted our exper-iments.

To obtain an objective measure of the performance, it isnecessary to derive a mathematical measure of the distancebetween the synthesized and the original processes. The dif-ficulty in defining such a measure is that it must also bepsychoacoustically valid. This problem has been addressedin speech processing where measures such as the log spec-tral distance and the Itakura-Saito distance are used [31]. Inour case, we need to compare the spectral characteristics of

0

−10

−20

−30

−40

−50

−60

−70

−80

−90

−100

Nor

mal

ized

erro

r(d

B)

0 2 4 6 8 10 12 14 16 18 20Frequency (kHz)

Figure 3: Normalized error between original and synthesized mi-crophone signals as a function of frequency.

long sequences with spectra that contain a large number ofpeaks and dips that are narrow enough to be imperceptibleto the human ear. In other words, the focus is on the long-term spectral properties of the audio signals, while spectraldistortion measures have been developed for comparing theshort-term spectral properties of signals. To overcome com-parison inaccuracies that would be mathematical rather thanpsychoacoustical in nature, we chose to perform 1/3 octavesmoothing [32] and compare the resulting smoothed spec-tral cues. The results are shown in Figure 3 in which we com-pare the spectra of the original (measured) microphone sig-nal and the synthesized signal. The two spectra are practi-cally indistinguishable below 10 kHz. Although the error in-creases at higher frequencies, the listening evaluations showthat this is not perceptually significant. One problem that wasencountered while comparing the 1/3 octave smoothed spec-tra was the fact that the average error was not reduced withincreasing filter order as rapidly as the results of the listen-ing tests suggested. To address this inconsistency, we experi-mented with various distortion measures.

These measures included the root mean square (RMS)log spectral distance, the truncated cepstral distance, and theItakura distance (for a description of all these measures, see,e.g., [8]). The results, however, were still not inline with whatthe listening evaluations indicated. This led us to a measurethat is commonly used in pattern comparison and is knownas the mutual information (see, e.g., [33]). By definition, themutual information of two random variables X and Y withjoint pdf p(x, y) and marginal pdfs p(x) and p(y) is the rel-ative entropy between the joint distribution and the productdistribution, that is,

I(X ;Y) =∑x∈

∑y∈

p(x, y) logp(x, y)p(x)p(y)

. (21)

It is easy to prove that

I(X ;Y) = H(X)−H(X|Y) = H(Y)−H(Y |X) (22)


and also

I(X ;Y) = H(X) + H(Y)−H(X,Y), (23)

where H(X) is the entropy of X ,

H(X) = −∑x∈

p(x) log p(x). (24)

Similarly, H(Y) is the entropy of Y . The term H(X|Y) is theconditional entropy defined as

H(X|Y) =∑y∈

p(y)H(X|Y = y)

= −∑y∈

p(y)∑x∈

p(x|y) log p(x|y)(25)

while H(X,Y) is the joint entropy defined as

H(X,Y) = −∑x∈

∑y∈

p(x, y) log p(x, y). (26)

The mutual information is always positive. Since our interestis in comparing two vectors X and Y with Y being the desiredresponse, it is useful to use a modified definition for the mu-tual information, the NMI IN (X ;Y), which can be defined as

IN (X ;Y) = H(Y)−H(Y |X)H(Y)

= I(X ;Y)H(Y)

. (27)

This version of the mutual information is mentioned in [33,page 47] and has been applied in many applications as an op-timization measure (e.g., radar remote sensing applications[34]). Obviously,

0 ≤ IN (X ;Y) ≤ 1. (28)

The NMI obtains its minimum value when X and Y are sta-tistically independent and its maximum value when X = Y .The NMI does not constitute a metric since it lacks symme-try. On the other hand, the NMI is invariant to amplitudedifferences [35], which is a very important property, espe-cially for comparing audio waveforms.

The spectra of the original and the synthesized responseswere compared using the NMI for various filter orders andthe results are depicted in Figure 4. The NMI increaseswith filter order both when considering the raw spectra andwhen using the spectra that were smoothed using AR model-ing (spectral envelope by all-pole modeling with linear pre-dictive coefficients). We believe that the calculated NMI us-ing the smoothed spectra is the measure that closely approx-imates the results we achieved from the listening tests. As canbe seen from the figure, the NMI for a filter order of 20 000is 0.9386 (i.e., close to unity which corresponds to indistin-guishable similarity) for the LP spectra while the NMI for thesame order but for the raw spectra is 0.5124. Furthermore,the fact that both the raw and smoothed NMI measures in-crease monotonically in the same fashion indicates that the

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

NM

I

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2×104

Filter order

LPC spectrumTrue spectrum

Figure 4: NMI between original and synthesized microphone sig-nals as a function of filter order.

smoothing is valid since it only reduces the “distance” be-tween the two waveforms in a proportionate way for all thesynthesized waveforms (order 0 in the diagram correspondsto no filtering; it is the distance between the original and thereference waveforms).

4. CONCLUSIONS AND FUTURE RESEARCH

Multichannel audio resynthesis is a new and important ap-plication that allows transmission of only one or two chan-nels of multichannel audio and resynthesis of the remainingchannels at the receiving end. It offers the advantage that thestem microphone recordings can be resynthesized at the re-ceiving end, which makes this system suitable for many pro-fessional applications and, at the same time, poses no re-strictions on the number of channels of the initial multi-channel recording. The distinction was made of the meth-ods employed, depending on the location of the “virtual”microphones, namely, spot and reverberant microphones.Reverberant microphones are those that are placed at somedistance from the sound source (e.g., the orchestra) andtherefore, contain more reverberation. On the other hand,spot microphones are located close to individual sources(e.g., near a particular musical instrument). This is a com-pletely different problem because placing such microphonesnear individual sources with varying spectral characteris-tics results in signals whose frequency content will de-pend highly on the microphone positions. For spot micro-phones, we only considered their space dependence with re-spect to the orchestra and did not consider their depen-dence on hall acoustics. This allowed us to design time-varying filters (one for each spot microphone recording)that can enhance particular instrument types in the refer-ence recording based on training datasets. For reverberant


microphones, we only considered their dependence with re-spect to hall acoustics and did not consider the orchestra as adistributed source. This allowed us to design time-invariantfilters (one for each reverberant microphone recording) thatcan add the reverberation effect in the reference record-ing, simulating the particular concert hall acoustic proper-ties.

Spot microphones were treated separately by applyingspectral conversion techniques for altering the short-termspectral properties of the reference audio signals. Someof the SC algorithms that have been used successfully forvoice conversion can be adopted for the task of multichan-nel audio resynthesis quite favorably. In particular, threeof the most common SC methods have been comparedand our objective results, in accordance with our infor-mal listening tests, have indicated that GMM-based spec-tral conversion can produce extremely successful results.Residual signal enhancement was also found to be essen-tial for the special case of percussive sound resynthesis.Our current research has focused on audio quality improve-ment for the methods proposed here, by using alternativemodels for the short-term spectral properties of the au-dio signals. Other possible directions for future research in-clude conducting formal listening tests, as well as extend-ing the methods described here towards remastering ex-isting monophonic and stereophonic recordings for mul-tichannel rendering (the synthesis problem). Our experi-ments so far were conducted mostly with the chorus micro-phone case in mind. We also examined a special case of tran-sient sounds, namely, percussive drum-like sounds, whichare considered perceptually significant. Other types of tran-sient sounds as well as various instrument types should beconsidered, possibly resulting in improved or novel algo-rithms.

For the reverberant microphone recordings, we have de-scribed a method for synthesizing the desired audio signals,based on spectral estimation techniques. The emphasis inthis case is on the long-term spectral properties of the sig-nals since the reverberation process is considered to be longin duration (e.g., two seconds for large concert halls). AnIIR filtering solution was proposed for addressing the longreverberation-time problem, with associated long impulseresponses for the filters to be designed. The issue of objec-tively estimating the performance of our methods arose andwas treated by proposing the NMI as a measure of spec-tral distance that was found to be very suitable for com-paring the long-term spectral properties of audio signals.The designed IIR filters are currently not suitable for real-time applications. We are investigating other possible alter-natives for the filter design that will result in more practicalsolutions.

ACKNOWLEDGMENT

This research has been funded by the Integrated Media Sys-tems Center, a National Science Foundation Engineering Re-search Center, Cooperative Agreement no. EEC-9529152.

REFERENCES

[1] A. Mouchtaris, Z. Zhu, and C. Kyriakakis, “High-quality mul-tichannel audio over the Internet,” in Proc. 33rd Asilomar Con-ference on Signals, Systems, and Computers, vol. 1, pp. 347–351, Pacific Grove, Calif, USA, October 1999.

[2] S. Haykin, Adaptive Filter Theory, Prentice-Hall, Upper Sad-dle River, NJ, USA, 1996.

[3] D. W. Griffin and J. S. Lim, “Signal estimation from modifiedshort-time Fourier transform,” IEEE Trans. Acoustics, Speech,and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.

[4] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voiceconversion through vector quantization,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Processing (ICASSP ’88), pp.655–658, New York, NY, USA, April 1988.

[5] Y. Stylianou, O. Cappe, and E. Moulines, “Continuous prob-abilistic transform for voice conversion,” IEEE Trans. Speech,and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.

[6] A. Kain and M. W. Macon, “Spectral voice conversion fortext-to-speech synthesis,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’98), pp. 285–288, Seattle,Wash, USA, May 1998.

[7] G. Baudoin and Y. Stylianou, “On the transformation of thespeech spectrum for voice conversion,” in Proc. InternationalConf. on Spoken Language Processing (ICSLP ’96), pp. 1405–1408, Philadephia, Pa, USA, October 1996.

[8] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recogni-tion, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993.

[9] M. Slaney, “Semantic-audio retrieval,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing (ICASSP ’02), pp.4108–4111, Orlando, Fla, USA, May 2002.

[10] P. J. Moreno and R. Rifkin, “Using the Fisher kernel methodfor web audio classification,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing (ICASSP ’00), pp. 2417–2420, Istan-bul, Turkey, June 2000.

[11] A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Anchor spacefor classification and similarity measurement of music,” inProc. IEEE International Conference on Multimedia and Expo(ICME), Baltimore, Md, USA, July 2003.

[12] D. A. Reynolds and R. C. Rose, “Robust text-independentspeaker identification using Gaussian mixture speaker mod-els,” IEEE Trans. Speech, and Audio Processing, vol. 3, no. 1,pp. 72–83, 1995.

[13] G. Strang and T. Nguyen, Wavelets and Filter Banks, Wellesley-Cambridge Press, Wellesley, Mass, USA, 1996.

[14] S. N. Levine, T. S. Verma, and J. O. Smith III, “Multireso-lution sinusoidal modeling for wideband audio with modifi-cations,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Pro-cessing (ICASSP ’98), pp. 3585–3588, Seattle, Wash, USA, May1998.

[15] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesisbased on a sinusoidal representation,” IEEE Trans. Acoustics,Speech, and Signal Processing, vol. 34, no. 4, pp. 744–754, 1986.

[16] X. Serra and J. O. Smith III, “Spectral modeling synthesis: asound analysis/synthesis system based on a deterministic plusstochastic decomposition,” Computer Music Journal, vol. 14,no. 4, pp. 12–24, 1990.

[17] O. Cappe and E. Moulines, “Regularization techniques fordiscrete cepstrum estimation,” IEEE Signal Processing Letters,vol. 3, no. 4, pp. 100–102, 1996.

[18] J. Laroche and J.-L. Meillier, “Multichannel excitation/filtermodeling of percussive sounds with application to the piano,”IEEE Trans. Speech, and Audio Processing, vol. 2, no. 2, pp.329–344, 1994.

[19] R. B. Sussman and M. Kahrs, “Analysis and resynthesis ofmusical instrument sounds using energy separation,” in Proc.


IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP’96), pp. 997–1000, Atlanta, Ga, USA, May 1996.

[20] M. W. Macon, A. McCree, W. M. Lai, and V. Viswanathan,“Efficient analysis/synthesis of percussion musical instrumentsounds using an all-pole model,” in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing (ICASSP ’98), pp. 3589–3592, Seattle, Wash, USA, May 1998.

[21] J. Laroche, “A new analysis/synthesis system of musical sig-nals using Prony’s method—Application to heavily dampedpercussive sounds,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing (ICASSP ’89), pp. 2053–2056, Glasgow, UK,May 1989.

[22] H.-I. Choi and W. J. Williams, “Improved time-frequencyrepresentation of multicomponent signals using exponentialkernels,” IEEE Trans. Acoustics, Speech, and Signal Processing,vol. 37, no. 6, pp. 862–871, 1989.

[23] M. Mboup, M. Bonnet, and N. Bershad, “LMS coupled adap-tive prediction and system identification: a statistical modeland transient mean analysis,” IEEE Trans. Signal Processing,vol. 42, no. 10, pp. 2607–2615, 1994.

[24] L. Ljung, System Identification: Theory for the User, Prentice-Hall, Englewood Cliffs, NJ, USA, 1987.

[25] R. B. Blackman and J. W. Tukey, The Measurement of PowerSpectra, Dover Publications, New York, NY, USA, 1958.

[26] J. W. Cooley and J. W. Tukey, “An algorithm for the machinecalculation of complex Fourier series,” Mathematics of Com-putation, vol. 19, no. 90, pp. 297–301, 1965.

[27] P. D. Welch, “The use of fast Fourier transform for the esti-mation of power spectra: a method based on time averagingover short, modified periodograms,” IEEE Trans. Audio andElectroacoustics, vol. 15, no. 2, pp. 70–73, 1967.

[28] O. Shalvi and E. Weinstein, “System identification using non-stationary signals,” IEEE Trans. Signal Processing, vol. 44, no.8, pp. 2055–2063, 1996.

[29] J. T. G. Stockham, T. M. Cannon, and R. B. Ingebretsen,“Blind deconvolution through digital signal processing,” Pro-ceedings of the IEEE, vol. 63, no. 4, pp. 678–692, 1975.

[30] B. D. Radlovic and R. A. Kennedy, “Nonminimum-phaseequalization and its subjective importance in room acoustics,”IEEE Trans. Speech, and Audio Processing, vol. 8, pp. 728–737,November 2000.

[31] F. Itakura and S. Saito, “A statistical method for estimation ofspeech spectral density and formant frequencies,” Electronicsand Communications in Japan, vol. 53A, pp. 36–43, 1970.

[32] B. C. J. Moore, An Introduction to the Psychology of Hearing,Academic Press, New York, NY, USA, 1989.

[33] T. M. Cover and J. A. Thomas, Elements of Information Theory,Wiley, New York, NY, USA, 1991.

[34] D. B. Trizna, C. Bachmann, M. Sletten, N. Allan, J. Topoko-vand, and R. Harris, “Projection pursuit classification meth-ods applied to multiband polarimetric SAR imagery,” in Proc.IEEE International Geoscience and Remote Sensing Symposium(IGARSS ’00), vol. 1, pp. 105–107, Honolulu, Hawaii, USA,July 2000.

[35] C. Shekhar and R. Chellappa, “Experimental evaluation oftwo criteria for pattern comparison and alignment,” in Proc.14th International Conference on Pattern Recognition (ICPR’98), vol. 1, pp. 146–153, Brisbane, Australia, August 1998.

Athanasios Mouchtaris received theDiploma degree in electrical engineeringfrom the Aristotle University of Thessa-loniki, Greece, in 1997 and the M.S. degreein electrical engineering from the Univer-sity of Southern California (USC), in 1999.He is currently pursuing the Ph.D. degree atUSC, working within the Immersive AudioLaboratory of the Integrated Media SystemsCenter. His research interests include signalprocessing for rendering immersive audio environments, content-based audio enhancement for multichannel rendering, and audiosynthesis for efficient transmission of multichannel recordings.

Shrikanth S. Narayanan received his M.S.,Engineer, and Ph.D. degrees, all in electri-cal engineering from University of Califor-nia at Los Angeles (UCLA) in 1990, 1992,and 1995, respectively. From 1995 to 2000,he was with AT&T Labs Research, FlorhamPark, NJ (formerly AT&T Bell Labs, MurrayHill)—first as a Senior Member and lateras a Principal Member of its technical staff.He is currently an Assistant Professor in theElectrical Engineering Department, Signal and Image ProcessingInstitute, University of Southern California (USC). He is also a Re-search Area Director of the Integrated Media Systems Center, anNSF ERC, and holds joint appointments in computer science andlinguistics at USC. He is an Associate Editor of the IEEE Trans-actions of Speech and Audio Processing and serves on the SpeechCommunication technical committee of ASA. His research inter-ests include signal processing and systems modeling with emphasison speech, audio, and language processing. Shrikanth Narayananis an author or coauthor of over 90 publications and holds 3 USPatents. He is a recipient of an NSF CAREER Award, a Center forInterdisciplinary Research Fellowship; he is a Member of Tau BetaPi and Eta Kappa Nu and a Senior Member of IEEE.

Chris Kyriakakis received his B.S. degreefrom the California Institute of Technol-ogy in 1985 and his M.S. and Ph.D. degreesfrom the University of Southern Californiain 1987 and 1993, respectively. Since 1996,he has been in the faculty of Electrical En-gineering Systems Department at USC andis currently an Associate Professor. He isthe Director of the Immersive Audio Lab-oratory that is part of the Integrated MediaSystems Center, an NSF Engineering Research Center at the USCSchool of Engineering. His research interests include acquisition,synthesis, and rendering multichannel immersive audio, multiper-son room equalization, microphone arrays, psychoacoustics, anderror-free transmission of multichannel audio over high bandwidthnetworks.


Progressive Syntax-Rich Coding of MultichannelAudio Sources

Dai YangIntegrated Media Systems Center and Department of Electrical Engineering, University of Southern California,Los Angeles, CA 90089-2564, USAEmail: [email protected]

Hongmei AiIntegrated Media Systems Center and Department of Electrical Engineering, University of Southern California,Los Angeles, CA 90089-2564, USAEmail: [email protected]

Chris KyriakakisIntegrated Media Systems Center and Department of Electrical Engineering, University of Southern California,Los Angeles, CA 90089-2564, USAEmail: [email protected]

C.-C. Jay KuoIntegrated Media Systems Center and Department of Electrical Engineering, University of Southern California,Los Angeles, CA 90089-2564, USAEmail: [email protected]


Being able to transmit the audio bitstream progressively is a highly desirable property for network transmission. MPEG-4 version 2audio supports fine grain bit rate scalability in the generic audio coder (GAC). It has a bit-sliced arithmetic coding (BSAC) tool,which provides scalability in the step of 1 Kbps per audio channel. There are also several other scalable audio coding methods,which have been proposed in recent years. However, these scalable audio tools are only available for mono and stereo audiomaterial. Little work has been done on progressive coding of multichannel audio sources. MPEG advanced audio coding (AAC)is one of the most distinguished multichannel digital audio compression systems. Based on AAC, we develop in this work aprogressive syntax-rich multichannel audio codec (PSMAC). It not only supports fine grain bit rate scalability for the multichannelaudio bitstream but also provides several other desirable functionalities. A formal subjective listening test shows that the proposedalgorithm achieves an excellent performance at several different bit rates when compared with MPEG AAC.

Keywords and phrases: multichannel audio, progressive coding, Karhunen-Loeve transform, successive quantization, PSMAC.

1. INTRODUCTION

Multichannel audio technologies have become much moremature these days, partially pushed by the need of the filmindustry and home entertainment systems. Starting from themonophonic technology, new systems, such as stereophonic,quadraphonic, 5.1 channels, and 10.2 channels, are penetrat-ing into the market very quickly. Compared with the monoor stereo sound, multichannel audio provides end users amore compelling experience and becomes more appealingto music producers. As a result, an efficient coding schemefor multichannel audio storage and transmission is in greatdemand. Among several existing multichannel audio com-

pression algorithms, Dolby AC-3 and MPEG advanced au-dio coding (AAC) [1, 2, 3, 4] are two most prevalent percep-tual digital audio coding systems. Both of them can provideperceptually indistinguishable audio quality at the bit rate of64 Kbps/ch.

In spite of their success, they can only provide bitstreamswith a fixed bit rate, which is specified during the encod-ing phase. When this kind of bitstream is transmitted overvariable bandwidth networks, the receiver can either success-fully decode the full bitstream or ask the encoder to retrans-mit a bitstream with a lower bit rate. The best solution tothis problem is to develop a scalable compression algorithmto transmit and decode the audio content in an embedded

Progressive Syntax-Rich Coding of Multichannel Audio Sources 981

manner. To be more specific, a bitstream generated by ascalable coding scheme consists of several partial bitstreams,each of which can be decoded on their own in a meaning-ful way. Therefore, transmission and decoding of a subset ofthe total bitstream will result in a valid decodable signal at alower bit rate and quality. This capability offers a significantadvantage in transmitting contents over networks with vari-able channel capacity and heterogeneous access bandwidth.

MPEG-4 version 2 audio coding supports fine grain bitrate scalablility [5, 6, 7, 8, 9] in its generic audio coder (GAC).It has a bit-sliced arithmetic coding (BSAC) tool, which pro-vides scalability in the step of 1 Kbps per audio channel formono or stereo audio material. Several other scalable monoor stereo audio coding algorithms [10, 11, 12] were proposedin recent years. However, not much work has been done onprogressive coding of multichannel audio sources. In thiswork, we propose a progressive syntax-rich multichannel au-dio codec (PSMAC) based on MPEG AAC. In PSMAC, theinterchannel redundancy inherent in original physical chan-nels is first removed in the preprocessing stage by using theKarhunen-Loeve transform (KLT). Then, most coding blocksin the AAC main profile encoder are employed to generatespectral coefficients. Finally, a progressive transmission strat-egy and a context-based QM-coder are adopted to obtain thefully quality-scalable multichannel audio bitstream. The PS-MAC system not only supports fine-grain bit rate scalabil-ity for the multichannel audio bitstream, but also providesseveral other desirable functionalities, such as random accessand channel enhancement, which have not been supportedby other existing multichannel audio codecs (MAC).

Moreover, compared with the BSAC tool provided inMPEG-4 version 2 and most of the other scalable audiocoding tools, a more sophisticated progressive transmissionstrategy is employed in PSMAC. PSMAC does not only en-code spectral coefficients from MSB to LSB and from low tohigh frequency so that the decoder can reconstruct these co-efficients more and more precisely with an increasing band-width as the receiver collects more and more bits from thebitstream, but also utilizes the psychoacoustic model to con-trol the subband transmission sequence so that the most sen-sitive frequency area is more precisely reconstructed. In thisway, bits used to encode coefficients in those nonsensitive fre-quency area can be saved and used to encode coefficients inthe sensitive frequency area. As a result of this subband se-lection strategy, a perceptually more appealing audio can bereconstructed by PSMAC, especially at very low bit rates suchas 16 Kbps/ch. The side information required to encode thesubband transmission sequence is carefully handled in ourimplementation so that the overall overhead will not havesignificant impact on the audio quality even at very low bitrates. Note that Shen et al. [12] proposed a subband selectionrule to achieve progressive coding. However, Shen’s schemedemands a large amount of overhead in coding the selectionorder.

Experimental results show that, when compared withMPEG AAC, the decoded multichannel audio generated bythe proposed PSMAC’s mask-to-noise-ratio (MNR) progres-sive mode has comparable quality at high bit rates, such as

64 Kbps/ch or 48 Kbps/ch, and much better quality at low bitrates, such as 32 Kbps/ch or 16 Kbps/ch. We also demonstratethat our PSMAC can provide better quality of single-channelaudio when compared with MPEG-4 version 2 GAC at sev-eral different bit rates.

The rest of the paper is organized as follows. Section 2gives an overview of the proposed design. Section 3 brieflyintroduces how interchannel redundancy can be removedvia the KLT. Sections 4 and 5 describe progressive quantiza-tion and subband selection blocks in our system, respectively.Section 6 presents the complete compression system. Experi-mental results are shown in Section 7. Finally, conclusion re-marks are given in Section 8.

2. PROFILES OF PROPOSED PROGRESSIVESYNTAX-RICH AUDIO CODEC

In the proposed progressive syntax-rich codec, the followingthree user-defined profiles are provided.

(1) The MNR progressive profile. If the flag of this profileis on, it should be possible to decode the first n bytesof the bitstream per second, where n is a user-specifiedvalue or a value that the current network parametersallowed.

(2) The random access profile. If the flag of this profile ispresent, the codec will be able to independently encodea short period of audio more precisely than other pe-riods. It allows users to randomly access a certain partof audio that is more of interest to end users.

(3) The channel enhancement profile. If the flag of thisprofile is on, the codec will be able to independentlyencode an audio channel more precisely than otherchannels. Either these channels are of more interestto end users or the network situation does not allowthe full multichannel audio bitstream to be received ontime.

Figure 1 illustrates a simple example of three user-defined profiles. Among all profiles, the MNR progressiveprofile is the default one. In the other two profiles, that is,the random access and the channel enhancement, the MNRprogressive feature is still provided as a basic functionalityand the decoding of the bitstream can be stopped at any ar-bitrary point. With these three profiles, the proposed codeccan provide a versatile set of functionalities desirable in vari-able bandwidth network conditions with different user accessbandwidth.

3. INTERCHANNEL DECORRELATION

For a given time instance, removing interchannel redun-dancy would result in a significant bandwidth reduction.This can be done via an orthogonal transform MV = U ,where V and U denote the vector whose n elements are sam-ples in original channels and transformed channels, respec-tively. Among several commonly used transforms, includingthe discrete cosine transform (DCT), the Fourier transform


Bitstream (Low quality)

Bitstream + Bitstream (Median quality)

Bitstream + Bitstream + Bitstream (High quality)

(a)

Lower quality Higher quality Lower quality

(b)

Lower quality Higher quality Lower quality

Lower quality

Surround

Left Center Right Lower quality

Surround

(c)

Figure 1: Illustration of three user-defined profiles: (a) the MNR progressive profile, (b) the random access profile, and (c) the channelenhancement with the enhanced center channel.

(FT), and the KLT, the signal-dependent KLT is adopted inthe preprocessing stage because it is theoretically optimal indecorrelating signals across channels. If M is the KLT ma-trix, we call the corresponding transformed channels eigen-channels. Figure 2 illustrates how KLT is performed on mul-tichannel audio signals, where the columns of the KLT matrixare composed of eigenvectors calculated from the covariancematrix CV associated with original multichannel audio sig-nals V .

Suppose that an input audio signal has n channels, thenthe covariance of KL transformed signals is

E[UUT

] = E[(MV)(MV)T

] =ME[V VT

]MT

=MCVMT =

λ1 0 · · · 00 λ2 · · · 0...

.... . .

...0 0 · · · λn

,(1)

where X (X = U,V) represents the mean-removed signal ofX , and λ1, λ2, . . . , λn are eigenvalues of CV . Thus, the trans-form produces statistically decorrelated channels in the senseof having a diagonal covariance matrix for transformed sig-nals. Another property of KLT, which can be used in the re-construction of audio of original channels, is that the inversetransform matrix of M is equal to its transpose. Since CV isreal and symmetric, the matrix formed by normalized eigen-vectors is orthonormal. Therefore, we have V = MTU in re-construction. From KL expansion theory [13], we know thatselecting eigenvectors associated with the largest eigenvaluescan minimize the error between original and reconstructedchannels. This error will go to zero if all eigenvectors are used.KLT is thus optimum in the least square error sense.

The KLT preprocessing method was demonstrated to im-prove the multichannel audio coding efficiency in our previ-ous work [14, 15, 16]. After the preprocessing stage, signalsin these relatively independent channels called eigenchannelsare further processed.


Original multichannel audiosignals with high correlation

between channels

Eigenchannel audio signalswith little correlation

between channels

KL transform

matrixM

Correlatedcomponent

V

Decorrelatedcomponent

U

× =

Figure 2: Interchannel decorrelation via KLT.

4. SCALABLE QUANTIZATION AND ENTROPY CODING

The major difference between the proposed progressive au-dio codec and other existing nonprogressive audio codecssuch as AAC lies in the quantization block and the entropycoding block. The dual iteration loop used in AAC to cal-culate the quantization step size for each frame and eachchannel coefficients is replaced by a progressive quantiza-tion block. The Huffman coding block used in the AAC toencode quantized data is replaced by a context-based QM-coder. This will be explained in detail below.

4.1. Successive approximation quantization (SAQ)

The most important component of the quantization blockis called successive approximation quantization (SAQ). TheSAQ scheme, which is adopted by most embedded waveletcoders for progressive image coding, is crucial to the designof embedded coders. The motivation for successive approx-imation is built upon the goal of developing an embeddedcode that is in analogy to find an approximation of binaryrepresentation of a real number [17]. Instead of coding everyquantized coefficient as one symbol, SAQ processes the bitrepresentation of coefficients via bit layers sliced in the or-der of their importance. Thus, SAQ provides a coarse-to-fine,multiprecision representation of the amplitude information.The bitstream is organized such that a decoder can immedi-ately start reconstruction based on the partially received bit-stream. As more and more bits are received, more accuratecoefficients and higher quality multichannel audio can be re-constructed.

SAQ sequentially applies a sequence of thresholds T0,T1, . . . , TN+1 for refined quantization, where these thresholdsare chosen such that Ti = Ti−1/2. The initial threshold T0

is selected such that |C(i)| < 2T0 for all transformed coeffi-cients in one subband, where C(i) represents the ith spectralcoefficient in the subband. To implement SAQ, two separatelists, the dominant list and the subordinate list, are main-

tained both at the encoder and the decoder. At any pointof the process, the dominant list contains the coordinatesof those coefficients that have not yet been found to be sig-nificant, while the subordinate list contains magnitudes ofthose coefficients that have been found to be significant. Theprocess that updates the dominate list is called the signifi-cant pass, and the process that updates the subordinate list iscalled the refinement pass.

In the proposed algorithm, SAQ is adopted as the quanti-zation method for each spectral coefficient within each sub-band. This algorithm (for the encoder part) is listed below.

Successive approximation quantization (SAQ) algorithm

(1) Initialization: For each subband, find out the maxi-mum absolute value Cmax for all coefficients C(i) in thesubband, and set the initial quantization threshold tobe T0 = Cmax/2 + ∆, where ∆ is a small constant.

(2) Construction of the significant map (significance iden-tification). For each C(i) contained in the dominantlist, if |C(i)| ≥ Tk, where Tk is the threshold of thecurrent layer (layer k), add i to the significant map, re-move i from the dominant list, and encode it with “1s,”where “s” is the sign bit. Moreover, modify the coeffi-cient’s value to

C(i) ←−C(i)− 1.5Tk, ∀C(i) > 0,

C(i) + 1.5Tk, otherwise.(2)

(3) Construction of the refinement map (refinement). Foreach C(i) contained in the significant map, encode thebit at layer k with a refinement bit “D” and change thevalue of C(i) to

C(i) ←−C(i)− 0.25Tk, ∀C(i) > 0,

C(i) + 0.25Tk, otherwise.(3)

(4) Iteration. Set Tk+1 = Tk/2 and repeat steps (2)–(4) fork = 0, 1, 2, . . . .

At the decoder side, the decoder performs similar stepsto reconstruct coefficients’ values. Figure 3 gives a simple ex-ample to show how the decoder reconstructs a single coeffi-cient after one significant pass and one refinement pass. Asillustrated in this figure, the magnitude of this coefficient isrecovered to 1.5 times of the current threshold Tk after thesignificant pass, and then refined to 1.5Tk − 0.25Tk after thefirst refinement pass. As more refinement steps follow, themagnitude of this coefficient will approach its original valuegradually.

4.2. Context-based QM-coder

The QM-coder is a binary arithmetic coding algorithm de-signed to encode data formed by a binary symbol set. Itwas the result of the effort by JPEG and JBIG commit-tees, in which the best features of various arithmetic codersare integrated. The QM-coder is a lineal descendent of theQ-coder, but significantly enhanced by improvements inthe two building blocks, that is, interval subdivision and


Significant pass

1.5Tk

Refinement pass

1.5Tk − 0.25Tk

Tk

Threshold Originalvalue

Reconstructvalue

Reconstructvalue

Figure 3: An example to show how the decoder reconstructs a sin-gle coefficient after one significant pass and one refinement pass.

probability estimation [18]. Based on the Bayesian estima-tion, a state-transition table, which consists of a set of rulesto estimate the statistics of the bitstream depending on thenext incoming symbols, can be derived. The efficiency of theQM-coder can be improved by introducing a set of contextrules. The QM arithmetic coder achieves a very good com-pression result if the context is properly selected to summa-rize the correlation between coded data.

Six classes of contexts are used in the proposed embed-ded audio codec as shown in Figure 4. They are the generalcontext, the constant context, the subband significance con-text, the coefficient significance context, the coefficient re-finement context, and the coefficient sign context. The gen-eral context is used in the coding of the configuration in-formation. The constant context is used to encode differentchannel header information. As their names suggest, the sub-band significance context, the coefficient significance con-text, the coefficient refinement context, and the coefficientsign context are used to encode the subband significance, co-efficient significance, coefficient refinement, and coefficientsign bits, respectively. These contexts are adopted becausedifferent classes of bits may have different probability dis-tributions. In principle, separating their contexts should in-crease the coding performance of the QM-coder.

5. CHANNEL AND SUBBAND TRANSMISSIONSTRATEGY

5.1. Channel selection rule

In the embedded MAC, we should put the most importantbits (in the rate-distortion sense) to the cascaded bitstreamfirst so that the decoder can reconstruct the optimal qualityof multichannel audio given a fixed number of bits received.Thus, the importance of channels should be determined foran appropriate order of the bitstream.

The first instinct about the metric of channel importancewould be the energy of the audio signal in each channel.However, this metric does not work well in general. For ex-ample, for some multichannel audio sources, especially forthose that have been reproduced artificially in a music stu-dio, the side channel which does not normally contain themain melody may even have a larger energy than the centerchannel. Based on our experience with multichannel audio,loss or significant distortion of the main melody in the centerchannel would be much more annoying than loss of melodiesin side channels. In other words, the location of channels alsoplays an important role. Therefore, for a regular 5.1 chan-nel configuration, the order of channel importance from thelargest to the least should be

(1) center channel,(2) left and right (L/R) channel pair,(3) left surround and right surround (Ls/Rs) channel pair,(4) low-frequency channel.

Between channel pairs, their importance can be determinedby their energy values. This rule is adopted in our experi-ments, given in Section 7.

After KLT, eigenchannels are no longer the original phys-ical channels, and sounds in different physical channels aremixed in every eigenchannel. Thus, spatial dependency ofeigenchannels is less trivial. We observe from experimentsthat although it is true that one eigenchannel may containsounds from more than one original physical channel, therestill exists a close correspondence between eigenchannels andphysical channels. To be more precise, audio of eigenchannel1 would sound similar to that of the center channel, audioof eigenchannels 2 and 3 would sound similar to that of theL/R channel pair, and so forth. Therefore, if eigenchannel 1 islost in transmission, we would end up with a very distortedcenter channel. Moreover, it happens that, sometimes, eigen-channel 1 may not the channel with a very large energy andcould be easily discarded if the channel energy is adopted asthe metric of channel importance. Thus, the channel impor-tance of eigenchannels should be similar to that of physicalchannels, that is, eigenchannel 1 corresponding to the cen-ter channel, eigenchannels 2 and 3 corresponding to the L/Rchannel pair, and eigenchannels 4 and 5 corresponding to theLs/Rs channel pair. Within each channel pair, the importanceis still determined by their energy values.

5.2. Subband selection rule

In principle, any quality assessment of an audio channel canbe either performed subjectively by employing a large num-ber of expert listeners or done objectively by using an ap-propriate measuring technique. While the first choice tendsto be an expensive and time-consuming task, the use ofobjective measures provides quick and reproducible results.An optimal measuring technique would be a method thatproduces the same results as subjective tests while avoidingall problems associated with the subjective assessment pro-cedure. Nowadays, the most prevalent objective measure-ment is the MNR technique, which was first introduced by


Quantizer

Program configurationbitstream

Channel header infobitstream

Subband significancebitstream

Coefficient significancebitstream

Coefficient refinementbitstream

Coefficient signbitstream

Compressed file

Figure 4: The adopted context-based QM-coder with six classes of contexts.

Brandenburg [19] in 1987. It is the ratio of the maskingthreshold with respect to the error energy. In our imple-mentation, the masking is calculated from the general psy-choacoustic model of the AAC encoder. The psychoacous-tic model calculates the maximum distortion energy which ismasked by the signal energy, and outputs the signal to maskratio (SMR).

A subband is masked if the quantization noise level is be-low the masking threshold, so the distortion introduced bythe quantization process is not perceptible to human ears.As discussed earlier, SMR represents the human auditory re-sponse to the audio signal. If SNR of an input audio signal ishigh enough, the noise level will be suppressed below mask-ing threshold, and the quantization distortion will not beperceived. Since SNR can be easily calculated by

SNR =∑

i

∣∣Soriginal(i)∣∣2∑

i

∣∣Soriginal(i)− Sreconstruct(i)∣∣2 , (4)

where Soriginal(i) and Sreconstruct(i) represent the ith originaland the ith reconstructed audio signal value, respectively,thus, MNR is just the difference between SNR and SMR (indB) or

SNR = MNR + SMR . (5)

A side benefit of the SAQ technique is that an operationalrate versus distortion plot (or, equivalently, an operationalrate versus the current MNR value) for the coding algorithmcan be computed online.

100

90

80

70

60

50

40

30

20

10

0

Wid

th

0 5 10 15 20 25 30 35 40 45 50Subband

Figure 5: Subband width distribution.

The basic ideas behind choosing the subband selectionrules are simple. They are presented as follows:

(1) the subband with a better rate deduction capabilityshould be chosen earlier to enhance the performance;

(2) the subband with a smaller number of coefficientsshould be chosen earlier to reduce the computationalcomplexity if the rate reduction performances of twosubbands are close.

The first rule implies that we should allocate more bitsto those subbands with larger SMR values (or smaller MNR


SB #1 SB #2 SB #LB SB #LE1 SB #LE2 SB #LE3

Subband scanning for base layer

Subband scanning for first enhance layer

Subband scanning for second enhance layer

Subband scanning for third enhance layer

Figure 6: Illustration of the subband scanning rule, where the solid line with an arrow means that all subbands inside this area are scanned,and the dashed line means that only those nonsignificant subbands inside the area are scanned.

values). In other words, we should send out bits belonging tothose subbands with larger SMR values (or smaller MNR val-ues) first. The second rule tells us how to decide the subbandscanning order. As we know about the subband1 formationin MPEG AAC, the number of coefficients in each subbandis nondecreasing with the increase of the subband number.Figure 5 shows the subband width distribution used in AACfor 44.1 kHz and 48 kHz sampling frequencies and long blockframes. Thus, a sequential subband scanning order from thelowest number to the highest number is adopted in thiswork.

In order to save bits, especially at very low bit rates, onlyinformation corresponding to lower subbands will be sentinto the bitstream at the first layer. When the number oflayers increases, more and more subbands will be added.Figure 6 shows how subbands are scanned for the first sev-eral layers. At the base layer, the priority is given to lower-frequency signals so that only subbands numbered up to LBwill be scanned. As the information of enhancement layers isadded to the bitstream, the subband scanning upper limit in-creases (as indicated by values of LE1, LE2, and LE3 as shownin Figure 6) until it reaches the effective psychoacoustic up-per bound of all subbands N . In our implementation, wechoose LE3 = N , which means that all subbands are scannedafter the third enhance layer. Here, the subband scanning up-per limits in different layers, that is, LB, LE1, and LE2, are em-pirically determined values that provide a good coding per-formance.

A dual-threshold coding technique is proposed in thiswork. One of the thresholds is the MNR threshold, whichis used in subband selection. The other is the magnitudethreshold, which is used for coefficients quantization in eachselected subband. A subband that has its MNR value smallerthan the current MNR threshold is called the significant sub-band. Similar to the SAQ process for coefficient quantization,two lists, that is, the dominant subband list and the subordi-nate subband list, are maintained in the encoder and the de-coder, respectively. The dominant subband list contains the

1The term “subband” defined in this paper is equivalent to the “scalefactor band” implemented in MPEG AAC.

indices of those subbands that have not become significantyet, and the subordinate subband list contains the indicesof those subbands that have already become significant. Theprocess that updates the subband dominant list is called thesubband significant pass, and the process that updates thesubband subordinate list is called the subband refinementpass.

Different coefficient magnitude thresholds are main-tained in different subbands. Since we would like to deal withthe most important subbands first and get the best result withonly a little amount of information from the resource, and,since sounds in different subbands have different impactson human ears according to the psychoacoustic model, it isworthwhile to consider each subband independently ratherthan all subbands in one frame simultaneously.

We summarize the subband selection rule below.

(1) MNR threshold calculation. Determine empiricallythe MNR threshold value TMNR

i,k for channel i at layer k.Subbands with smaller MNR value at the current layerare given higher priority.

(2) Subband dominant pass. For those subbands that arestill in the dominant subband list, if subband j inchannel i has the current MNR value MNRk

i, j < TMNRi,k ,

add subband j of channel i into the significant map, re-move it from the dominant subband list, and send 1 tothe bitstream, indicating that this subband is selected.Then, apply SAQ to coefficients in this subband. Forsubbands that have MNRk

i, j ≥ TMNRi,k , send 0 to the bit-

stream, indicating that this subband is not selected inthis layer.

(3) Subband refinement pass. For a subband already inthe subordinate list, perform SAQ to coefficients in thesubband.

(4) MNR values update. Recalculate and update MNR val-ues for selected subbands.

(5) Repeat steps (1)–(4) until the bitstream meets the tar-get rate.

Figure 7 gives a simple example of the subband selec-tion rule. Suppose that, at layer k, channel i has the MNRthreshold equal to TMNR

i,k . In this example, among all scanned


Channel i,layer k

MN

R

0 1 2 3 4 5 6 7 8 9 10 11 Subband

TMNRi,k

0001 Coefficient SAQ 00001 Coefficient SAQ 1 Coefficient SAQ 00

Figure 7: An example of the subband selection rule.

subbands, that is, subbands 0 to 11, only subbands 3, 8, and9 have their current MNR values smaller than TMNR

i,k . There-fore, according to rule (2), three 0 bits and one 1 bit are firstsent into the bitstream indicating nonsignificant subbands 0,1, and 2 and significant subband 3. These subband selectingbits are represented in the left-most shaded area in Figure 7.Similarly, subband selecting bits for subbands 4 to 11 are il-lustrated in the rest of shaded areas. Coefficients SAQ bits ofsignificant subbands are sent immediately after each signifi-cant subband bit as shown in this example.

6. COMPLETE DESCRIPTION OF PSMAC

The block diagram of a complete PSMAC encoder is shownin Figure 8. The perceptual model, the filter bank, the tempo-ral noise shaping (TNS), and the intensity blocks in our pro-gressive encoder are the same as those in the AAC main pro-file encoder. The interchannel redundancy removal block viaKLT is implemented after the input audio signals are trans-formed into the modified discrete cosine transform (MDCT)domain. Then, a dynamic range control block follows toavoid any possible data overflow in later compression stages.Masking thresholds are then calculated in the perceptualmodel based on the KL transformed signals. The progres-sive quantization and lossless coding parts are finally used toconstruct the compressed bitstream. The information gen-erated at the first several coding blocks will be sent into thebitstream as the overhead.

Figure 9 provides more details of the progressive quanti-zation block. The channel and the subband selection rules areused to determine which subband in which channel shouldbe encoded at this point, and then coefficients within this se-lected subband will be quantized via SAQ. The user-definedprofile parameter is used for the syntax control of the channelselection and the subband selection. Finally, based on severaldifferent contexts, the layered information together with alloverhead bits generated during previous coding blocks willbe losslessly coded by using the context-based QM-coder.

The encoding process performed by using the proposedalgorithm will stop when the bit budget is exhausted. It cancease at any time, and the resulting bitstream contains alllower rate coded bitstreams. This is called the full embeddedproperty. The capability to terminate the decoding of an em-

bedded bitstream at any specific point is extremely useful ina coding system that is either rate constrained or distortionconstrained.

7. EXPERIMENTAL RESULTS

The proposed PSMAC system has been implemented andtested. The basic audio coding blocks [1] inside the MPEGAAC main profile encoder, including the psychoacousticmodel, filter bank, TNS, and intensity/coupling, are stilladopted. Furthermore, an interchannel removal block, a pro-gressive quantization block, and a context-based QM-coderblock are added to construct the PSMAC.

Two types of experimental results are shown in this sec-tion. One is measured by an objective metric, that is, theMNR, and the other is measured in terms of a subjectivemetric, that is, listening test score. It is worthwhile to men-tion that the coding blocks adopted from AAC have not beenmodified to improve the performance of the proposed PS-MAC for fair comparison. Moreover, test audio that producesthe worst performance by the MPEG reference code was notselected in the experiment.

7.1. Results using MNR measurement

Two multichannel audio materials are used in this experi-ment to compare the performance of the proposed PSMACalgorithm with MPEG AAC [1] main profile codec. One isa one-minute long ten-channel2 audio material called “Mes-siah,” which is a piece of classical music recorded live in areal concert hall. Another one is an eight-second long five-channel3 music called “Herre,” which is a piece of pop musicand was used in the MPEG-2 AAC standard (ISO/IEC 13818-7) conformance work.

7.1.1. MNR progressive mode

The performance comparison of MPEG AAC and the pro-posed PSMAC for the normal MNR progressive mode are

2The ten channels include center (C), left (L), right (R), left wide (Lw),right wide (Rw), left high (Lh), right high (Rh), left surround (Ls), rightsurround (Rs), and back surround (Bs).

3The five channels include C, L, R, Ls, and Rs.


Input audiosignal

Perceptualmodel

Syntax control Transformedcoefficients Data

Data controlSyntax control

Filterbank KLT

Dynamicrange

control

TNSIntensitycoupling

Progressivequantization

Noiselesscoding

Bitstream multiplex

Coded bitstream

Figure 8: The block diagram of the proposed PSMAC encoder.

Transformedcoefficients

Channelselection

Subbandselection

CoefficientsSAQ

Exceed bitbudget?

Yes

Syntax controlMNR progressive?Random access?Channel enhance?

No Finish one layerfor this channel?

Yes

No

No Finish one layerfor all channels?

MNRupdate

Yes

Context-basedbinary QM coder

Noiselesscoding

Progressivequantization

Data

Data control

Syntax control

Figure 9: Illustration of the progressive quantization and lossless coding blocks.

shown in Table 1. The average MNR shown in the table iscalculated by

mean MNRsubband =∑

channel MNRchannel, subband

number of channels,

average MNR =∑

subband mean MNRsubband

number of subband.

(6)

Table 1 shows the MNR values for the performance com-parison of the nonprogressive AAC algorithm and the pro-posed PSMAC algorithm when working in the MNR progres-sive profile. Values in this table clearly show that our codecoutperforms AAC for both testing materials at lower bit ratesand it only has a small performance degradation at higher

Table 1: MNR comparison for MNR progressive profiles.

Average MNR values (dB/subband/ch)

Bit rate (bit/s/ch) Herre Messiah

AAC PSMAC AAC PSMAC

16k −0.90 6.00 14.37 21.82

32k 5.81 14.63 32.40 34.57

48k 17.92 22.32 45.13 42.81

64k 28.64 28.42 54.67 47.84

bit rates. In addition, the bitstream generated by MPEG AAConly achieves an approximate bit rate and is normally a little


Messiah5

4

3

2

1

0

Qu

alit

y

16k 32k 48k 64kBit rate in bit/s/ch

A

P

A

P

A PA

P

(a)

Band5

4

3

2

1

0

Qu

alit

y


A P

A

P

A P AP

(b)

Herbie5

4

3

2

1

0

Qu

alit

y


A

P

A

PA P

AP

(c)

Herre5

4

3

2

1

0

Qu

alit

y


AP

AP

A

P

AP

(d)

Figure 10: Listening test results for multichannel audio sources where A = MPEG AAC and P = PSMAC.

bit higher than the desired one while our algorithm achievesa much more accurate bit rate in all experiments carried out.

7.1.2. Random access

The MNR result after the base-layer reconstruction for therandom access mode by using the test material “Herre” isshown in Table 2. When listening to the reconstructed mu-sic, we can clearly hear the quality difference between theenhance period and the rest of the other period. The MNRvalue given in Table 2 verifies the above claim by showingthat the mean MNR value for the enhanced period is muchbetter (more than 10 dB per subband) than the rest of otherperiods. It is common that we may prefer a certain part of amusic to others. With the random access profile, the user canindividually access a period of music with better quality thanothers when the network condition does not allow a full highquality transmission.

7.1.3. Channel enhancement

The performance result using the test material “Herre” forthe channel enhancement mode is also shown in Table 2.Here, the center channel has been enhanced with enhance-ment parameter 1. Note that the total bit rate is kept thesame for both codecs, that is, each has an average bit rate of16 Kbps/ch. Since we have to separate the quantization andthe coding control of the enhanced physical channel as well asto simplify the implementation, KLT is disabled in the chan-nel enhancement mode. Compared with the normal MNRprogressive mode, we find that the enhanced center channelhas an average of more than 10 dB per subband MNR im-provement, while the quality of other channels is only de-graded by about 3 dB per subband.

When an expert subjectively listens to the reconstructedaudio, the one with the enhanced center channel has a muchbetter performance and is more appealing, compared with


Table 2: MNR comparison for random access and channel enhancement profiles.

Average MNR values (dB/subband/ch)

Random accessChannel enhancement

Enhanced channel Other channels

Other area Enhanced area w/o enhance w/ enhance w/o enhance w/ enhance

3.99 13.94 8.42 19.23 1.09 −2.19

the one without channel enhancement. This is because thecenter channel of “Herre” contains more musical informa-tion than other channels, and a better reconstructed centerchannel will give listeners a better overall quality, which isbasically true for most multichannel audio materials. There-fore, this experiment suggests that, with a narrower band-width, audio generated by the channel enhancement modeof the PSMAC algorithm can provide the user a more com-pelling experience with either a better reconstructed centerchannel or a channel which is more interesting to a particu-lar user.

7.2. Subjective listening test

In order to further confirm the advantage of the proposedPSMAC algorithm, a formal subjective listening test accord-ing to ITU recommendations [20, 21, 22] was conductedin an audio lab to compare the coding performance of PS-MAC and the MPEG AAC main profile. At the bit rateof 64 Kbps/ch, the reconstructed sound clips are supposedto have a perceptual quality similar to that of the orig-inal ones, which means that the difference between PS-MAC and AAC would be so small that nonprofession-als can hardly hear it. According to our experience, non-professional listeners tend to give random scores if theycannot tell the difference between two sound clips, whichmakes their scores nonrepresentative. Therefore, instead ofinviting a large number of nonexpert listeners, four well-trained professionals, who have no knowledge of any algo-rithms, participated in the listening test [22]. For each testsound clip, subjects listened to three versions of the samesound clip, that is, the original one followed by two pro-cessed ones (one by PSMAC and one by AAC in a ran-dom order), subjects were allowed to listen to these files asmany times as possible until they were comfortable to givescores to the two processed sound files for each test mate-rial.

The five-grade impairment scale given in Recommenda-tion ITU-R BS. 1284 [21] was adopted in the grading pro-cedure and utilized for final data analysis. Besides “Messiah”and “Herre,” another two ten-channel audio materials called“Band” and “Herbie” were included in this subjective listen-ing test, where “Band” is a rock band music lively recordedin a football field, and “Herbie” is a piece of music played byan orchestra. According to ITU-R BS. 1116-1 [20], audio filesselected for listening test only contained short durations, thatis, 10 to 20 seconds long.

Figure 10 shows the score given to each test materialcoded at four different bit rates during the listening test formultichannel audio materials. The solid vertical line repre-sents the 95% confidence interval, where the middle lineshows the mean value and the other two lines at the bound-ary of the vertical line represent the upper and lower confi-dence limits [23]. It is clear from Figure 10 that, at lower bitrates, such as 16 Kbps/ch and 32 Kbps/ch, the proposed PS-MAC algorithm outperforms MPEG AAC in all four test ma-terials. To be more precise, at these two bit rates for all fourtest materials, the proposed PSMAC algorithm achieves sta-tistically significantly better results.4 At higher bit rates, suchas 48 Kbps/ch and 64 Kbps/ch, PSMAC achieves either com-parable or slightly degraded subjective quality when com-pared with MPEG AAC.

To demonstrate that the PSMAC algorithm achieves anexcellent coding performance even for single-channel au-dio files, another listening test for the mono sound wasalso carried out. Three single-channel single-instrumenttest audio materials, which are downloaded and processedfrom MPEG sound quality assessment material, knownas “GSPI” (http://www.tnt.uni-hannover.de/project/mpeg/audio/sqam/), “TRPT” (http://www.tnt.uni-hannover.de/project/mpeg/audio/sqam/), and “VIOO” (http://www.tnt.uni-hannover.de/project/mpeg/audio/sqam/), were used inthis experiment, and the performance between the standardfine-grain scalable audio coder provided by MPEG-4 BSAC[6, 8] and the proposed PSMAC was compared.

Figure 11 shows the listening test results for the threesingle-channel audio materials. For cases where no confi-dence intervals are shown, it means that all four listenershappened to give the same score to the given sound clip.From this figure, we can clearly see that at lower bit rates,for example, 16 Kbps/ch and 32 Kbps/ch, our algorithm gen-erates better sound quality for all test sequences. In all cases,except “GSPI” coded at 32 Kbps/ch, PSMAC achieves statis-tically significantly better performance than that of MPEG-4 BSAC. At higher bit rates, for example, 48 Kbps/ch and64 Kbps/ch, our algorithm outperforms MPEG-4 BSAC fortwo out of three test materials and is only slightly worse forthe “TRPT” case.

4We call algorithm A statistically significantly better than algorithm B ifthe mean value given to the sound clip processed by algorithm A is abovethe upper 95% confidence limit given to sound clip processed by algo-rithm B.


GSPI6

5

4

3

2

1

0

Qu

alit

y


BP

B

P B

P

B

P

(a)

TRPT6

5

4

3

2

1

0

Qu

alit

y


BP

B

P

B P BP

(b)

VIOO6

5

4

3

2

1

0

Qu

alit

y


BP

B

P B

PB

P

(c)

Figure 11: Listening test results for single-channel audio sourceswhere B = BSAC and P = PSMAC.

8. CONCLUSION

A PSMAC algorithm was presented in this research. This al-gorithm utilized KLT as a preprocessing block to remove in-terchannel redundancy inherent in the original multichannelaudio source. Then, rules for channel selection and subbandselection were developed and the SAQ process was used todetermine the importance of coefficients and their layeredinformation. At the last stage, all information was losslesslycompressed by using the context-based QM-coder to gener-ate the final multichannel audio bitstream.

The distinct advantages of the proposed algorithm overmost existing MACs not only lie in its progressive trans-mission property which can achieve a precise rate controlbut also in its rich-syntax design. Compared with the newMPEG-4 BSAC tool, PSMAC provides a more delicate sub-band selection strategy such that the information, which ismore sensitive to the human ear, is reconstructed earlier andmore precisely at the decoder side. It was shown by experi-mental results that PSMAC has a comparable performance asnonprogressive MPEG AAC at several different bit rates whenusing the multichannel test material while PSMAC achievesbetter reconstructed audio quality than MPEG-4 BSAC toolswhen using single-channel test materials. Moreover, the ad-vantage of the proposed algorithm over the other existing au-dio codec is more obvious at lower bit rates.

ACKNOWLEDGMENTS

This is research has been funded by the Integrated Media Sys-tems Center and National Science Foundation EngineeringResearch Center, Cooperative Agreement no. EEC-9529152.Any opinions, findings, and conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect those of the National Science Foun-dation.

REFERENCES

[1] ISO/IEC 13818-5, Information technology—Generic codingof moving pictures and associated audio information—Part 5:Software simulation, 1997.

[2] ISO/IEC 13818-7, Information technology—Generic coding ofmoving pictures and associated audio information—Part 7: Ad-vanced audio coding, 1997.

[3] K. Brandenburg and M. Bosi, “ISO/IEC MPEG-2 advancedaudio coding: overview and applications,” in Proc. 103rd Con-vention of Audio Engineering Society (AES), New York, NY,USA, September 1997.

[4] M. Bosi, K. Brandenburg, S. Quackenbush, et al., “ISO/IECMPEG-2 advanced audio coding,” in Proc. 101st Conventionof Audio Engineering Society (AES), Los Angeles, Calif, USA,November 1996.

[5] S.-H. Park, Y.-B. Kim, S.-W. Kim, and Y.-S. Seo, “Multi-layerbit-sliced bit-rate scalable audio coding,” in Proc. 103rd Con-vention of Audio Engineering Society (AES), New York, NY,USA, September 1997.

[6] ISO/IEC JTC1/SC29/WG11 N2205, Final Text of ISO/IECFCD 14496-5 Reference Software.

[7] ISO/IEC JTC1/SC29/WG11 N2803, Text ISO/IEC 14496-3Amd 1/FPDAM.


[8] ISO/IEC JTC1/SC29/WG11 N4025, Text of ISO/IEC 14496-5:2001.

[9] J. Herre, E. Allamanche, K. Brandenburg, et al., “The inte-grated filterbank based scalable MPEG-4 audio coder,” inProc. 105th Convention of Audio Engineering Society (AES),San Francisco, Calif, USA, September 1998.

[10] J. Zhou and J. Li, “Scalable audio streaming over the in-ternet with network-aware rate-distortion optimization,” inProc. IEEE International Conference on Multimedia and Expo,Tokyo, Japan, August 2001.

[11] M. S. Vinton and E. Atlas, “A scalable and progressive au-dio codec,” in Proc. IEEE International Conference on AousticsSpeech and Signal Processing, Salt Lake City, Utah, USA, May2001.

[12] Y. Shen, H. Ai, and C.-C. J. Kuo, “A progressive algorithm forperceptual coding of digital audio signals,” in Proc. 33rd An-nual Asilomar Conference on Signals, Systems, and Computers,Pacific Grove, Calif, USA, Octorber 1999.

[13] S. Haykin, Adaptive Filter Theory, Prentice Hall, Upper SaddleRiver, NJ, USA, 3rd edition, 1996.

[14] D. Yang, H. Ai, C. Kyriakakis, and C.-C. J. Kuo, “An inter-channel redundancy removal approach for high-quality mul-tichannel audio compression,” in Proc. 109th Conventionof Audio Engineering Society (AES), Los Angeles, Calif, USA,September 2000.

[15] D. Yang, H. Ai, C. Kyriakakis, and C.-C. J. Kuo, “An explo-ration of Karhunen-Loeve transform for multichannel audiocoding,” in Proc. SPIE on Digital Cinema and Microdisplays,vol. 4207 of SPIE Proceedings, pp. 89–100, Boston, Mass, USA,November 2000.

[16] D. Yang, H. Ai, C. Kyriakakis, and C.-C. J. Kuo, “High fidelitymultichannel audio coding with Karhunen-Loeve transform,”IEEE Trans. Speech, and Audio Processing, vol. 11, no. 4, 2003.

[17] J. Shapiro, “Embedded image coding using zerotrees ofwavelet coefficients,” IEEE Trans. Signal Processing, vol. 41,no. 12, pp. 3445–3462, 1993.

[18] W. Pennebaker and J. Mitchell, JPEG Still Image Data Com-pression Standard, Van Nostrand Reinhold, New York, NY,USA, 1993.

[19] K. Brandenburg, “Evaluation of quality for audio encoding atlow bit rates,” in Proc. 82nd Convention of Audio EngineeringSociety (AES), London, UK, 1987.

[20] ITU-R Recommendation BS.1116-1, Methods for the subjec-tive assessment of small impairments in audio systems includingmultichannel sound systems.

[21] ITU-R Recommendation BS.1284, Methods for the subjectiveassessment of sound quality – general requirements.

[22] ITU-R Recommendation BS.1285, Pre-selection methods forthe subjective assessment of small impairments in audio systems.

[23] R. A. Damon Jr. and W. R. Harvey, Experimental Design,ANOVA, and Regression, Harper & Row Publishers, New York,NY, USA, 1987.

Dai Yang received the B.S. degree in elec-tronics from Peking University, Beijing,China in 1997, and the M.S. and Ph.D. de-grees in electrical engineering from the Uni-versity of Southern California, Los Angeles,Calif in 1999 and 2002, respectively. She iscurrently a Postdoctoral Researcher in NTTCyber Space Laboratories in Tokyo, Japan.Her research interests are in the areas ofdigital signal and image processing, audio,speech, video, graphics coding, and their network/wireless applica-tions.

Hongmei Ai received the B.S. degree in1991, and the M.S. and Ph.D. degrees in1996 all in electronic engineering from Ts-inghua University. She was an Assistant Pro-fessor (1996–1998) and Associate Profes-sor (1998–1999) in the Department of Elec-tronic Engineering at Tsinghua University,Beijing, China. She was a Visiting Scholar inthe Department of Electrical Engineering atthe University of Southern California, LosAngeles, Calif from 1999 to 2002. Now she is a Principal SoftwareEngineer at Pharos Science & Applications, Inc., Torrance, Calif.Her research interests focus on signal and information processingand communications, including data compression, video, and au-dio processing, and wireless communications.

Chris Kyriakakis received the B.S. degreefrom California Institute of Technology in1985, and the M.S. and Ph.D. degrees fromthe University of Southern California in1987 and 1993, respectively, all in electri-cal engineering. He is currently an Asso-ciate Professor in the Department of Elec-trical Engineering, University of SouthernCalifornia. He heads the Immersive AudioLaboratory. His research focuses on multi-channel audio acquisition, synthesis, rendering, room equalization,streaming, and compression. He is also the Research Area Direc-tor for sensory interfaces in the Integrated Media Systems Centerwhich is the National Science Foundation’s Exclusive EngineeringResearch Center for multimedia and Internet research at the Uni-versity of Southern California.

C.-C. Jay Kuo received the B.S. degree fromthe National Taiwan University, Taipei, in1980, and the M.S. and Ph.D. degrees fromthe Massachusetts Institute of Technology,Cambridge, in 1985 and 1987, respectively,all in electrical engineering. Dr. Kuo wasa computational and applied mathematics(CAM) Research Assistant Professor in theDepartment of Mathematics at the Univer-sity of California, Los Angeles, from Octo-ber 1987 to December 1988. Since January 1989, he has been withthe Department of Electrical Engineering-Systems and the Signaland Image Processing Institute at the University of Southern Cal-ifornia, where he currently has a joint appointment as a Profes-sor of electrical engineering and mathematics. His research inter-ests are in the areas of digital signal and image processing, audioand video coding, wavelet theory and applications, and multime-dia technologies and large-scale scientific computing. He has au-thored around 500 technical publications in international confer-ences and journals, and graduated more than 50 Ph.D. students.Dr. Kuo is a member of SIAM and ACM, and a Fellow of IEEE andSPIE. He is the Editor-in-Chief of the Journal of Visual Commu-nication and Image Representation. Dr. Kuo received the NationalScience Foundation Young Investigator Award (NYI) and Presiden-tial Faculty Fellow (PFF) Award in 1992 and 1993, respectively.


Time-Scale Invariant Audio Data Embedding

Mohamed F. MansourDepartment of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55414, USAEmail: [email protected]

Ahmed H. TewfikDepartment of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55414, USAEmail: [email protected]

Received 31 May 2002 and in revised form 22 December 2002

We propose a novel algorithm for high-quality data embedding in audio. The algorithm is based on changing the relative lengthof the middle segment between two successive maximum and minimum peaks to embed data. Spline interpolation is used tochange the lengths. To ensure smooth monotonic behavior between peaks, a hybrid orthogonal and nonorthogonal wavelet de-composition is used prior to data embedding. The possible data embedding rates are between 20 and 30 bps. However, for practicalpurposes, we use repetition codes, and the effective embedding data rate is around 5 bps. The algorithm is invariant after time-scalemodification, time shift, and time cropping. It gives high-quality output and is robust to mp3 compression.

Keywords and phrases: data embedding, broadcast monitoring, time-scale invariant, spline interpolation.

1. INTRODUCTION

In this paper, we introduce a new algorithm for high-capacitydata embedding in audio that is suited for marketing, broad-cast, and playback monitoring applications. The purpose ofbroadcast and playback monitoring is primarily to analyzethe broadcasted content and collect statistical data to im-prove the content quality. For this class of applications, thesecurity is not an important issue. However, the embeddeddata should survive basic operations that the host audio sig-nal may undergo.

The most important requirements of a data embeddingsystem are transparency and robustness. Transparency meansthat there is no perceptual difference between the originaland the modified host media. Data embedding techniquesusually exploit irrelevancies in digital representation to as-sure transparency. For audio data embedding, the maskingphenomenon is usually exploited to assure that the distortiondue to data embedding is imperceptible. Robustness refers tothe property that the embedded data should remain in thehost media regardless of the signal processing operations thatthe signal may undergo.

The research work in audio watermarking can be clas-sified into two broad classes: spread-spectrum watermark-ing and projection-based watermarking. In spread-spectrumwatermarking, the data is embedded by adding a pseudo ran-dom sequence (the watermark) to the audio signal or somefeatures derived from it. An example of spread-spectrum wa-

termarking in the time domain was presented in [1]. Thefeatures used for data embedding include the phase of theFourier coefficients [2], the middle frequency coefficients[3], and the cepstrum coefficients [4]. More complicatedstructures for spread spectrum watermarking (e.g., [5]) wereproposed to synchronize the watermarked signal with thewatermark prior to decoding. On the other hand, projection-based watermarking is based on quantizing the host signal totwo or more codebooks that represent the different symbolsto be embedded. The decoding is done by quantizing the wa-termarked signal and deciding the symbol that correspondsto the codebook with minimum quantization error. Exam-ples of this technique are described in [6, 7].

Signal synchronization is an important issue in water-mark decoding. Loss of synchronization will result in ran-dom decoding even if the individual watermark componentsare extracted correctly. In this paper, we propose a new em-bedding algorithm that is automatically robust to most syn-chronization attacks that the signal may undergo.

The proposed algorithm is designed to be transparentand robust to most common signal processing operations.It is automatically invariant under time-scale modification(TSM), which is the most severe attack to most data embed-ding algorithm. In addition, it is robust to basic signal pro-cessing operations, for example, lowpass filtering, mp3 com-pression, and bandpass filtering. Also, the embedding algo-rithm is localized in nature, hence it is robust to synchroniza-tion attacks, for example, cropping and time shift. The idea of


Original

Modified

1 0 1 0 1 1 1 0

Figure 1: Embedding example.

the algorithm is to change the length of the middle segmentbetween two successive peaks relative to the total length be-tween the two peaks so as to be greater or less than a certainthreshold to embed one or zero, respectively. Hence, if thesignal is subject to TSM, then both the middle interval andthe whole segment will change by the same factor leaving theratio unchanged. Hence, the algorithm is automatically ro-bust to TSM without need to rescale the signal. This workwas first introduced in [8].

The average embedding capacity of the algorithm is 20–30 bps. However, due to practical issues that will be discussedin Section 3, the embedded data is encoded first with lowcode rate. The effective embedding rate drops to 4–6 bps.

The paper is organized as follows. Section 2 describesthe basic idea of the embedding and extraction algorithms.Section 3 discusses several practical issues and the imple-mentation details of the general ideas described in Section 2.In Section 5, the experimental results of the algorithm aregiven.

2. ALGORITHM

2.1. Basic idea

The intervals between a successive maximum and minimumpair are partitioned to N segments of equal amplitude whereN is odd (typically N = 3 or 5). If we have an exact linearbehavior between the two extrema, then all the segments willbe of equal size (up to a quantization error). For sinusoidal-like segments, the outer segments tend to be longer than theinner ones because of the smaller slope at these segments.

If we assume that the total length of the intervals betweenthe two peaks is L and the length of each segment is li, thenthe basic idea of the algorithm is to control the ratio l(N+1)/2/Lto be greater or less than a certain threshold γ to embed oneor zero, respectively. The idea is illustrated in Figure 1.

Note that the smoothness of the signal is increased whenit is lowpass filtered. This results in higher embedding ca-pacity. However, we need to efficiently reconstruct the sig-nal from the lowpass component. In our implementation, we

d[−k] ↓2W1

d[−k] ↓2W2

Inputc[−k] ↓2 c[−k] ↓2

Approx.

(a) Analysis stage.

W2 ↑2 d[k]W1 ↑2 d[k]

Approx.↑2 c[k] ↑2 c[k]

Output

(b) Synthesis stage.

Figure 2: Orthogonal wavelet decomposition.

used a hybrid of orthogonal and nonorthogonal wavelet de-compositions (as will be discussed in the next subsection) tosatisfy the two requirements of smoothness and efficient re-construction. The approximation signal at the coarsest scaleis modified rather than the signal itself. The practicalities ofchoosing the possible intervals and selecting the thresholdare discussed in Section 3.

2.2. Hybrid orthogonal/nonorthogonalwavelet decomposition

The required smooth behavior does not occur often in au-dio signals except for a set of single-instrument audio like apiano and a flute. For other composite audio signals, this re-quirement is hardly fulfilled. This greatly reduces the embed-ding rate if the original signal is used directly in embedding.Moreover, even if such a behavior exists, it is very vulnera-ble to distortion after compression. Hence, the direct audiosignal is not a good candidate for data embedding.

In our implementation, we used a hybrid of orthogonaland nonorthogonal decompositions. These two types of de-compositions are illustrated in Figures 2 and 3.

The orthogonal decomposition is an exact (nonredun-dant) representation of the signal. It involves subsamplingafter each decomposition stage. Hence, the approximationsignal is not smoother than the original because of the fre-quency spread after subsampling. If a modification is donein the transform domain, then it is preserved after the inverseand the forward transform because it is nonredundant.

On the other hand, nonorthogonal wavelet decomposi-tion does not involve subsampling after filtering at each scale,as illustrated in Figure 3. For our particular purpose, this de-composition has a two-fold advantage. First, the lengths arepreserved so that the lengths at any scale are in one-to-onecorrespondence with the lengths at the finest scale. The sec-ond advantage is that the approximation signal at coarserscales is smoother than the signal at a finer scale.

However, nonorthogonal decomposition is redundant,that is, not every two dimensional signal is a valid transform.Hence modification in the transform domain are not guaran-teed to be preserved if the inverse transform is applied. This

Time-Scale Invariant Audio Data Embedding 995

d1[k]W1

d1[k/2] W2

Inputc1[k] c1[k/2]

Approx.

(a) Analysis stage.

W2 d2[k/2]W1

d2[k]

Approx.c2[k/2] c2[k]

Output

(b) Synthesis stage.

Figure 3: Nonorthogonal wavelet decomposition.

is more apparent if the modification is done at a sufficientlycoarse scale. Hence, at most, two decomposition scales canbe used to embed the data. However, this is not sufficient forrobustness against lossy compression.

In Figure 4, we illustrate the ideas in the previous twoparagraphs. The first 103 samples of the original signals areplotted along with the approximation signal after three scalesof nonorthogonal and orthogonal decompositions. We no-tice that the approximation signal after the nonorthogonaldecomposition is much smoother.

In our system, the orthogonal decompositions is appliedfirst. The resulting approximation signal is further decom-posed using nonorthogonal decomposition. The orthogonaldecomposition gives the required robustness against lossycompression but at the cost of reducing the interval lengths,that is, reducing the embedding rate. The nonorthogonaldecomposition gives the required smooth behavior betweenpeaks. Typically, one scale of orthogonal decompositions isused with two scales of nonorthogonal decompositions.

It should be mentioned that the filters of orthogonal andnonorthogonal decompositions need not be similar. Differ-ent bases can be used within the same framework. The op-timality of choosing the wavelet basis for the problem is be-yond the scope of this paper. However, experimental resultsshow that different wavelet bases give very comparable re-sults.

2.3. The embedding algorithm

The first step in the embedding algorithm is to apply a hy-brid of orthogonal and nonorthogonal decompositions asdiscussed in the previous subsection. After decomposition,the approximation signal at the coarsest scale is used for em-bedding.

The next step is to change the relative length of the mid-dle segment between two successive refined extrema to matchthe corresponding data bit. Let the total number of samples

Original

After nonorthogonal decomposition

After orthogonal decomposition

Figure 4: Examples of orthogonal and nonorthogonal decomposi-tions.

between the two extreme points be L and let the length of themiddle segment be l. Define α = l/L and the threshold γ. Toembed one, α should be greater than γ, and vice versa. Splineinterpolation is used to modify the lengths. An increase in themiddle segments will reflect in a decrease in both of the outersegments, and vice versa, so as to keep the original length be-tween the extrema unchanged. Some details for acceleratingthe convergence and improving the error performance aredescribed in Section 3.3. The overall algorithm is illustratedin Figure 5.

The main difficulty with the algorithm described abovestems from the redundancy of the nonorthogonal wavelettransform. Specifically, not all 2D functions are valid wavelettransform. Therefore, it is possible to end up with a nonvalidtransform after modifying the coarsest scale of the signal. Inparticular, some previously used intervals may disappear ornew intervals may arise. This causes a shift in the embeddedsequence from one iteration to another and slows down theconvergence. To partially fix this problem, a repetition codeis used where each bit is repeated an odd number of times(typically five). The advantage of a repetition code is two-fold. First, at the encoder side, it accelerates the convergencebecause a smaller number of intervals will need modifica-tion after peak deviation. Second, at the decoder side, it al-leviates the problem of false alarms as will be discussed later.The transitions from zero to one and from one to zero arelabeled as markers in the embedded sequence. These mark-ers play a crucial role in synchronizing the data in the pres-ence of false alarms and missed peaks as will be described inSection 4.2.

2.4. The extraction algorithm

The extraction algorithm is straightforward. The hybrid de-composition is applied as in the encoder. Then, the peaks


Output audio

Signalreconstruction

Iterate

Identifyingcandidate

peaks

Modifyinglengths using

splineinterpolation

Input data

Hybrid orthogonal/nonorthogonal

wavelet decomposition

Inputaudio

Figure 5: Embedding algorithm.

are picked and refined. For the refined intervals, the ratioα(= l/L) is calculated. If α > γ, then decide one, otherwisedecide zero. If a repetition code is used, then the majorityrule is applied to decide the decoded bit.

This extraction algorithm works well with nice channels,which do not introduce false alarms or missed data, thatis, channels with no synchronization problems. This type ofchannels is a good model for simple operations like volumechange. However, if the audio signal undergoes compressionor lowpass filtering, this ideal situation cannot be assumedand additional work has to be done to synchronize the dataand remove the false alarms. The details of the practical de-coding algorithm is discussed in Section 4.3.

3. PRACTICAL ISSUES

3.1. Refining the extrema

The careful selection of the extrema is an important issue inthe algorithm performance. The objective here is to iden-tify the pairs of successive extrema between which reliableembedding is possible. The first requirement is to choosethe pairs with distance greater than a certain threshold. Thisthreshold should guarantee that the middle segment andeach of the outer segments contain at least two samples af-ter modifying their lengths. The second requirement is that arefined peak should be a strong one in the sense that it shouldbe significantly larger (or smaller) than its immediate neigh-bors. This is important to ensure that the peak will survivemodifications and compression.

Here it is important to mention that adjacent peaks thatare very close to each others and very close to their im-mediate neighbors are labeled as weak peaks. In our algo-rithm, weak peaks are not considered peaks at all and theyare ignored if they exist between two successive strong peaks.Those weak peaks usually arise if the signal undergoes com-pression. Hence if they are treated as peaks candidates, theymay lead to missed peaks.

3.2. Threshold selection

The selection of the threshold γ is important for the qualityof the output audio and the fast convergence at the encoder.To minimize the modifications due to changing the length ofthe middle interval, we first calculate the histogram of the ra-tio of the middle segment length to the interval length. Thenwe set the threshold as the median of the histogram. This isdone offline only once using a large set of audio pieces.

3.3. Modifying the lengths

The embedding algorithm as described earlier requires mod-ification of the lengths of interval segments. For example,assume we have a data bit of 1, then l/L should be greaterthan γ. Otherwise, the length l of the middle segment shouldbe increased to satisfy this inequality. The increment in themiddle segment reflects in a decrement in both of the outersegments so as to preserve the original interval length. Theprocess is reversed for embedding zero. The modification ofall segments is performed via spline interpolation.

To improve the error performance and to give additionalrobustness against TSM, a guard band is built around thethreshold so that all modified segments are at least two sam-ples above or below the threshold value.

The interval lengths should be chosen large enough toassure that there exists enough number of samples at eachsegment after the length decrement. At least two samples areneeded in each segment to perform correct interpolation. Forexample, if the intervals are segmented to five levels, then thetypical threshold length between refined peaks is at least 20samples. This limits the highest-frequency component thatcan be used in embedding. For example, if the sampling fre-quency is 44.1 kHz, and the intervals between two successiveextrema are at least 20 samples, then the highest-frequencycomponent that is used in embedding is around 1.1 kHz.Moreover, if an orthogonal decomposition is applied first,then the subsampling reduces the periods by half at eachscale. Hence, 20 samples after two scales of orthogonal de-composition corresponds to a frequency of 1.1 kHz/4 ≈275 Hz. For some instruments, these very low frequencycomponents do not exist, and hence the nonorthogonaldecomposition should be applied on the original signaldirectly.

4. ENCODER/DECODER STRUCTURE

Due to the complications introduced by the presence of falsealarms and missed bits, the encoder/decoder structure of thewhole system is more complex than the simple structure de-scribed earlier. In this section, we will discuss these structuresin detail. In the first subsection, we will discuss the source offalse alarms, then we will discuss the encoder/decoder struc-ture to cope with this problem.

4.1. False alarms

False alarms pose a serious problem for our algorithm andestablish a limitation on the possible embedding rate. Thesefalse alarms usually arise after mp3 compression. By false


Aftercompression

L>Th Peaksmoothed

Beforecompression

L<Th

Strongpeak

Figure 6: False alarms example.

alarms we mean the peaks that are identified by the decoderbut not used by the encoder. These false alarms appear be-cause of two main reasons.

(1) The smoothing effect of compression and lowpass fil-tering. This may remove some weak peaks.

(2) The deviation of some strong peaks at the thresholdlength after signal processing. For example, assumethat refined peaks should be 30 samples apart, then atthe encoder peaks that are 29 samples apart are notconsidered in embedding. However, these periods mayincrease after compression by one sample (or more).Therefore, the decoder will recognize them as activeperiods.

These two sources of false alarms are illustrated inFigure 6.

These false alarms lead to a loss in synchronization at thedecoder. Remedies for this problem are treated in Sections4.2 and 4.3. The problem was treated in detail in [9, 10].

It should be mentioned that missed peaks might also takeplace. However, this happens much less frequently than thefalse alarms. The number of false alarms that arise rangesfrom 2% to 15% of the total number of peaks depending onthe nature of the audio signal.

4.2. Encoding

To alleviate the problem of false alarms, a self-synchroni-zation mechanism should be contained in the embedded se-quence. As mentioned earlier, a repetition code is used at theencoder to improve the convergence and the error perfor-mance. If each bit is repeated r times and a single false alarmoccurs within a sequence of r similar bits, then it can be easilyidentified and removed.

The main idea of the encoding algorithm is to isolate thefalse alarms so as to identify them individually. At each tran-sition from a group of ones to a group of zeros (or the re-verse), a marker is put. The sequence of bits between succes-sive markers are decoded separately. In [8], long sequences ofzeros or ones are cut by employing high-density bipolar cod-ing (HDBn) scheme in digital communication to add a bit

of reverse polarity to a long sequence of similar bits. How-ever, experiments show that this may lead to loss of synchro-nization in the extracted bits if the extra bit is not identifiedproperly.

4.3. Decoding

The decoder performs the following steps.

(1) Extracting the embedded bit as described in Section2.4. During extraction, each bit is given a score thatrepresents the certainty about the correctness of thisbit. The higher the score, the higher the certainty of thecorresponding bit. This score is simply the differencebetween the actual length of the middle segment andthe threshold length. These scores are used in furtheroperations.

(2) Applying a median filter (with width = r) to the ex-tracted bits sequence so as to remove sparse bits thatdo not agree with their neighbors, and at the sametime preserving the correct transition between differ-ent bits.

(3) Identifying the markers, which are defined as thepoints at which a sign change occurs, and the medianof its following bits is different from that of the preced-ing bits.

(4) Identifying the bit sequence between the markers. Ifthe number of bits is divisible by r, then the sequenceof bits is decoded using the majority rule. The prob-lems arise when the number of bits between two suc-cessive markers is not divisible by r. For example, as-sume r = 5 and the number of bits between two suc-cessive markers is 13, then we have two possibilities.The first possibility is that the correct number of bits is10, and we have three false alarms; the other is that thecorrect number is 15 and we have two missed peaks.The decision between the two possibilities is based onthe scores of the residual bits, that is, the three bits withthe lowest scores. If the average score of these bits isfar smaller than the average of the remaining bits, thenthey are classified as false alarms, otherwise, they areclassified as missed peaks.

(5) Remove the redundant bits that are added at the en-coder side if HDBn encoding is employed. This is doneby skipping a bit with opposite sign that follows n sim-ilar bits in the final output stream.

In what follows, we will discuss the effect of repetitionencoding in reducing the probability of false alarms. We willuse the following assumptions.

(1) Only false alarms exist (no missed bit).(2) The probability of false alarms is Pf .(3) False alarm events are independent.(4) Each bit is repeated r times.(5) All markers are identified correctly.(6) The number of false alarms between two markers is

less than the number of the original bits between them.


Table 1: Probabilities of the number of bits between markers.

Bits before encoding Bits after encoding Probability

1 r 1/2

2 2r 1/4...

......

n nr 1/2n

......

...

After repetition, a false alarm exists if there are more than(r + 1)/2 false alarms between two successive markers. Withrepetition, we can have multiple of r bits between two suc-cessive markers. If zero and one are equally probable, thenTable 1 gives the probabilities for the number of bits betweenmarkers.

The number of false alarms within a given number of bitshas a binomial distribution because the false-alarm events areindependent. The probability of having k false alarms in asample space of size N bits is

PN (k) =(N

k

)Pkf

(1− Pf

)N−k. (1)

Note that in (1) N takes the discrete values r, 2r, 3r, . . ., and soforth. The probability of having a false alarm after encodingis the probability of having N/2 false alarms or more be-tween two successive markers, where · is the ceiling integerfunction. Hence the new probability of false alarm is

PFA =∞∑

m=1

mr∑k=mr/2

Pmr(k)(

12

)m

,

PFA =∞∑

m=1

mr∑k=mr/2

(mr

k

)Pkf

(1− Pf

)mr−k(

12

)m(2)

In Figure 7, we show the reduction in the probability offalse alarms after using repetition encoding with n = 3, 5, 7.Note that, for the typical range of Pf (between 0.1 and 0.2),the range of PFA is between 0.01 and 0.05. This range of falsealarms is quite adequate for the algorithms described in [9,10] to work efficiently with high code rate, for example, 2/3.These algorithms are based on novel decoding techniques forthe common convolutional codes.

The overall encoder system consists of a frontend of con-volution encoder followed by the repetition encoder whichsimply repeat each bit for r times. At the decoder side, therepetition decoder (with majority decision rule) is applied onthe extracted data, then the convolutional decoder is appliedto take care of the residual false alarms. The overall system isshown in Figure 8.

5. EXPERIMENTAL RESULTS

The algorithm was applied to a set of 13 audio signals. Thelengths of the sequences were around 11 seconds. The test

Input false alarm probability0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ou

tpu

tfa

lse

alar

mpr

obab

ility

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

r=3

r=5

r=7

Figure 7: Coding gain after repetition.

Inputdata Convolutional

encoderRepetition

encoderBits

embedding

Modifiedaudio

Input audio

(a) Encoder.

Inputaudio Bits

extractionRepetition

decoderConvolutional

decoder

Outputdata

(b) Decoder.

Figure 8: Overall encoding/decoding structure.

signals include speech, single instrument music (piano, flute,and violin), and composite music. All test signals are monowith sampling rate 44.1 kHz. In all the experiments, we useDaubechies db5 wavelet for orthogonal decomposition, andthe derivative of cubic spline wavelet [11] for nonorthogonaldecomposition.

The number of levels between two successive extrema ischosen to be an odd number so that the middle segment isusually symmetric around zero. Therefore, the largest mod-ification, which takes place in the middle segment, is in thelowest amplitude region. The typical choice is three or fivelevels. The larger the number of levels, the better the er-ror performance. However, the output quality (although stillhigh in all cases) is higher with lower number of levels be-cause no large changes occur in this case. It was found thatthe choice of three levels and interval length of 40 samplesgives the best compromise between quality and robustness.This parameter setting is used in all the following tests.

(i) Embedding rate. The median embedding rate of theuncoded data is around 25 bps. However, after coding,the effective embedding rate becomes 5 bps. The em-bedding rate is very large for single instruments, wherepure sinusoids with low frequencies are dominant. The


Table 2: Performance versus signal processing operations.

Operation Insertions Deletions Errors

mp3 compression 0.039 0.0065 0.018

LPF (4 kHz) 0.019 0.005 0.003

Adding noise (36 dB) 0.001 0.001 0

Resampling to 48 kHz 0.002 0.013 0

embedding rates of the algorithm depend heavily onthe signal nature. If the signal contains long intervalsof low frequencies, then the embedding rate increasessignificantly. It can be as high as 80 bps for the aboveparameter setting.

(ii) Noiseless channel. The algorithm described in Sec-tions 2.3 and 2.4 works perfectly with all sequences.However, sometimes, especially with speech signals, itneeds an excessive number of iterations at the encoderto converge.

(iii) Quality. The quality of the output signal is very high,and for a nonprofessional listener, it is very hard todistinguish between the original and modified signals.However, when the algorithm was tested with speechsignals, the results were not satisfactory.

(iv) Time shift and cropping. The proposed algorithm is au-tomatically robust to time shift and cropping. How-ever, for time cropping, some bits may be missed ifa modified interval is cropped. This is unlikely tooccur because the intervals used in embedding areusually active audio intervals. If such intervals arecropped, this will affect the audio content. More-over, with repetition code, deletions can be com-pensated. However, this can be done only for ran-dom deletions. To randomize the effect of time crop-ping, bit interleaver may be used prior to repetitionencoding.

(v) mp3 compression. We tested the performance of thesystem against mp3 compression with rates 112 kbps(compression ratio 6.3 : 1). The average rates areshown in Table 2. These rates are well suited to the al-gorithm described in [10] to work efficiently. However,at lower compression bit rate, the insertions rate tendsto increase significantly.

(vi) Lowpass filtering. Due to the lowpass component of theapproximation signal, the algorithm is robust to low-pass filtering. The typical rates are shown in Table 2.

(vii) Time-scale modification. This is the most powerful fea-ture of the proposed algorithm. It is automatically ro-bust to TSM up to a quantization error factor. Thismeans that false alarms (or missed bits) may ap-pear because of the rounding of the thresholds. Con-sider, for example, if the threshold before TSM is40, and the time-scale factor is 0.96, then the newlength becomes 38.4. Then we have two choices forthe threshold length (which should be integer), ei-ther 38 or 39. The smaller choice may result in false

Table 3: Performance versus TSM.

Factor Insertions Deletions Errors

0.96 0.068 0 0.018

0.98 0.044 0 0.011

1.02 0.023 0.006 0.006

1.04 0.004 0.012 0

alarms while the larger one may cause missed bits. InTable 3, we show the performance of the algorithmversus different factors of time-scale modifications. Inthis table, the new threshold length is the round ofthe old threshold length multiplied by the time-scalefactor.

(viii) It should be mentioned that, in the above results, weassumed a fixed time-scale factor. The algorithm canbe made robust to time-varying TSM if the thresh-old of the interval lengths is adaptively updated. FromTable 3, it is noticed that either insertions or deletionsare dominant at different scale factors. This dependson the rounding. If it is to the smaller integer, theninsertions will be more frequent and vice versa. Notethat the algorithm is also automatically robust to re-sampling by any factor. In Table 2, we show the per-formance against resampling to 48 kHz. It should bementioned that, for dyadic resampling or upsampling,we may need to reduce the number of decompositionlevels at the decoder to match the levels before resam-pling.

The robustness of the proposed algorithm againstmp3 compression and other signal processing operationsis comparable to the results reported in recent audiospread-spectrum watermarking works (e.g., [12, 13]) andprojection-based watermarking schemes (e.g., [14]), wherethe bit error rate is between 0.001 and 0.03. However, TSMand synchronization attacks have not been studied for mostaudio watermarking algorithms proposed in the literaturebecause such attacks cannot be compensated within the tra-ditional frameworks. Robustness to these attacks is the mainstrength of the proposed algorithm.

6. CONCLUSION

In this work, we propose a novel algorithm for embeddingdata in audio by changing the interval length of certain seg-ments of the audio signal. The algorithm is invariant afterTSM, time shift, and time cropping. We proposed a set ofencoding and decoding techniques to survive the commonmp3 compression.

The embedding rate of the algorithm is above 20 bps.However, as discussed for practical reasons, repetition cod-ing is used and the effective embedding rate is 4–8 bps. Thequality of the output is very high and it is indistinguishablefrom the original signal.


The proposed technique is suitable for applications likebroadcast monitoring, where the embedded data are infor-mation relevant to host signal and used for several purposes,for example, tracking the use of the signal, providing statisti-cal data collection, and analyzing the broadcast content.

REFERENCES

[1] P. Bassia, I. Pitas, and N. Nikolaidis, “Robust audio water-marking in the time domain,” IEEE Trans. Multimedia, vol. 3,no. 2, pp. 232–241, 2001.

[2] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, “Techniquesfor data hiding,” IBM Systems Journal, vol. 35, no. 3-4, pp.313–336, 1996.

[3] J. F. Tilki and A. Beex, “Encoding a hidden digital signa-ture onto an audio signal using psychoacoustic masking,” inProc. 7th International Conference on Signal Processing Ap-plications and Technology, pp. 476–480, Boston, Mass, USA,1996.

[4] S. K. Lee and Y. S. Ho, “Digital audio watermarking in thecepstrum domain,” in Proc. IEEE International Conference onConsumer Electronics, pp. 334–335, Los Angeles, Calif, USA,2000.

[5] D. Kirovski and H. Malvar, “Robust spread-spectrum au-dio watermarking,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing, vol. 3, pp. 1345–1348, Salt Lake City, Utah,USA, 2001.

[6] M. D. Swanson, B. Zhu, and A. H. Tewfik, “Data hiding forvideo-in-video,” in Proceedings of IEEE International Con-ference on Image Processing, vol. 2, pp. 676–679, Washington,DC, USA, 1997.

[7] M. F. Mansour and A. H. Tewfik, “Audio watermarking bytime-scale modification,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing, vol. 3, pp. 1353–1356, Salt Lake City,Utah, USA, 2001.

[8] M. F. Mansour and A. H. Tewfik, “Time-scale invariant audiodata embedding,” in Proc. IEEE International Conference onMultimedia and Expo, Tokyo, Japan, 2001.

[9] M. F. Mansour and A. H. Tewfik, “Efficient decoding of wa-termarking schemes in the presence of false alarms,” in Proc.IEEE 4th Workshop on Multimedia Signal Processing, pp. 523–528, Cannes, France, 2001.

[10] M. F. Mansour and A. H. Tewfik, “Convolutional decoding forchannels with false alarms,” in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Processing, vol. 3, pp. 2501–2504, Orlando, Fla,USA, 2002.

[11] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press,Boston, Mass, USA, 2nd edition, 1999.

[12] J. W. Seok and J. W. Hong, “Audio watermarking for copyrightprotection of digital audio data,” Electronics Letters, vol. 37,no. 1, pp. 60–61, 2001.

[13] G. C. M. Silvestre, N. J. Hurley, G. S. Hanau, and W. J. Dowl-ing, “Informed audio watermarking scheme using digitalchaotic signals,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing, vol. 3, pp. 1361–1364, Salt Lake City, Utah,USA, 2001.

[14] M. D. Swanson, B. Zhu, and A. H. Tewfik, “Current stateof the art, challenges and future directions for audio wa-termarking,” in IEEE International Conference on Multime-dia Computing and Systems, vol. 1, pp. 19–24, Florence, Italy,1999.

Mohamed F. Mansour was born in Cairo, Egypt, in 1973. He re-ceived his B.S. and M.S. degrees from Cairo University, Cairo,Egypt, in 1995 and 1998, respectively, and his Ph.D. degree fromthe University of Minnesota, Minneapolis, Minn, in 2003, all inelectrical engineering. During the period 1999–2003, he was withthe Department of Electrical and Computer Engineering, Univer-sity of Minnesota as a Research and Teaching Assistant. In 2003,he joined DSPS R&D Center at Texas Instruments Inc., Dallas, Tex,as a member of technical staff. His current research interests are inreal-time signal processing, adaptive filtering, and optimization.

Ahmed H. Tewfik received his B.S. degree from Cairo University,Cairo, Egypt, in 1982 and his M.S., E.E., and S.D. degrees from theMassachusetts Institute of Technology, Cambridge, MA, in 1984,1985, and 1987, respectively. Dr. Tewfik has worked at Alphatech,Inc., Burlington, MA, in 1987. He is the E. F. Johnson Professorof Electronic Communications with the Department of ElectricalEngineering at the University of Minnesota. He served as a Consul-tant to MTS Systems, Inc., Eden Prairie, MN and Rosemount, Inc.,Eden Prairie, MN and worked with Texas Instruments and Com-puting Devices International. From August 1997 to August 2001,he was the President and CEO of Cognicity, Inc., an entertainmentmarketing software tools publisher that he co-founded. Dr. Tewfikis a Fellow of the IEEE. He was a Distinguished Lecturer of the IEEESignal Processing Society in 1997–1999. He received the IEEE ThirdMillennium Award in 2000.


Watermarking-Based Digital AudioData Authentication

Martin SteinebachFraunhofer Institute IPSI, MERIT, C4M Competence for Media Security, D-64293 Darmstadt, GermanyEmail: [email protected]

Jana DittmannPlatanista GmbH and Otto-von-Guericke University Magdeburg, 39106 Magdeburg, GermanyEmail: [email protected]

Received 11 July 2002 and in revised form 4 January 2003

Digital watermarking has become an accepted technology for enabling multimedia protection schemes. While most efforts con-centrate on user authentication, recently interest in data authentication to ensure data integrity has been increasing. Existingconcepts address mainly image data. Depending on the necessary security level and the sensitivity to detect changes in the media,we differentiate between fragile, semifragile, and content-fragile watermarking approaches for media authentication. Furthermore,invertible watermarking schemes exist while each bit change can be recognized by the watermark which can be extracted and theoriginal data can be reproduced for high-security applications. Later approaches can be extended with cryptographic approacheslike digital signatures. As we see from the literature, only few audio approaches exist and the audio domain requires additionalstrategies for time flow protection and resynchronization. To allow different security levels, we have to identify relevant audiofeatures that can be used to determine content manipulations. Furthermore, in the field of invertible schemes, there are a bunch ofpublications for image and video data but no approaches for digital audio to ensure data authentication for high-security appli-cations. In this paper, we introduce and evaluate two watermarking algorithms for digital audio data, addressing content integrityprotection. In our first approach, we discuss possible features for a content-fragile watermarking scheme to allow several postpro-duction modifications. The second approach is designed for high-security applications to detect each bit change and reconstructthe original audio by introducing an invertible audio watermarking concept. Based on the invertible audio scheme, we combinedigital signature schemes and digital watermarking to provide a public verifiable data authentication and a reproduction of theoriginal, protected with a secret key.

Keywords and phrases: multimedia security, manipulation recognition, content-fragile watermarking, invertible watermarking,digital signature, original protection.

1. INTRODUCTION

Multimedia data manipulation has become more and moresimple and undetectable by the human audible and visualsystem due to technology advances in recent years. While thisenables numerous new applications and generally makes itconvenient to work with image, audio, or video data, a cer-tain loss of trust in media data can be observed. As we seein Figure 1, small changes in the audio stream can cause adifferent meaning of the whole sentence.

Regarding security particularly in the field of multime-dia, the requirements on security increase. The possibil-ity and the way of applying security mechanisms to multi-media data and their applications need to be analyzed foreach purpose separately. This is mainly due to the struc-ture and complexity of multimedia, see, for example, [1].

The security requirements such as integrity (unauthorizedmodification of data) or data authentication (detection oforigin and data alterations) can be met by the succeed-ing security measures using cryptographic mechanisms anddigital watermarking techniques [1]. Digital watermarkingtechniques based on steganographic systems embed infor-mation directly into the media data. Besides cryptographicmechanisms, watermarking represents an efficient technol-ogy to ensure both data integrity and data origin authen-ticity. Copyright, customer, or integrity information can beembedded, using a secret key, into the media data as trans-parent patterns. Based on the application areas for digitalwatermarking known today, the following five watermark-ing classes are defined: authentication watermarks, finger-print watermarks, copy control watermarks, annotation wa-termarks, and integrity watermarks. The most important


I am not guilty

I am guilty

Figure 1: Digital audio data is easily manipulated.

properties of digital watermarking techniques are robust-ness, security, imperceptibility/transparency, complexity, ca-pacity, and possibility of verification and invertibility, see,for example, [2].

Robustness describes whether the watermark can be reli-ably detected after media operations. It is important to notethat robustness does not include attacks on the embeddingschemes that are based on the knowledge of the embed-ding algorithm or on the availability of the detector func-tion. Robustness means resistance to “blind,” nontargetedmodifications, or common media operations. For example,the Stirmark tool [3] attacks the robustness of watermark-ing algorithms with geometrical distortions. For manipula-tion recognition, the watermark has to be fragile to detectaltered media.

Security describes whether the embedded watermarkinginformation cannot be removed beyond reliable detection bytargeted attacks based on full knowledge of the embeddingand detection algorithm and possession of at least one water-marked data. Only the applied secret key remains unknownto the attacker. The concept of security includes proceduralattacks or attacks based on a partial knowledge of the car-rier modifications due to message embedding. The securityaspect also includes the false positive detection rates.

Transparency relates to the properties of the human sen-sory system. A transparent watermark causes no perceptibleartifacts or quality loss.

Complexity describes the effort and time we need to em-bed and retrieve a watermark. This parameter is essentialfor real-time applications. Another aspect addresses whetherthe original data is required in the retrieval process or not.We distinguish between nonblind and blind watermarkingschemes, the latter require no original copy for detection.

Capacity describes how many information bits can beembedded into the cover data. It also addresses the possibil-

ity of embedding multiple watermarks in one document inparallel.

The verification procedure distinguishes between privateverification similar to symmetric cryptography and publicverification like in asymmetric cryptography. Furthermore,during verification, we differ between invertible and nonin-vertible techniques, where the first one allows the reproduc-tion of the original and the last one provides no possibility toextract the watermark without alterations of the original.

The optimization of the parameters is mutually compet-itive and cannot be clearly done at the same time. If we wantto embed a large message, we cannot require strong robust-ness simultaneously. A reasonable compromise is always anecessity. On the other hand, if robustness to strong distor-tions is an issue, the message that can be reliably hidden mustnot be too long.

Therefore, we find different kinds of optimized water-marking algorithms. The robust watermarking methods forowner and copyright holder or customer identification areusually unable to detect manipulations in the cover mediaand their design is completely different from that of fragilewatermarks. When dealing with fragile watermarks, differentaspects of manipulation have to be taken into account.

A fragile watermark is a mark that is easily altered ordestroyed when the host data is modified through a linearor nonlinear transformation. The sensitivity of fragile water-marks to modification leads to their use in media authentica-tion. Today we find several fragile watermarking techniquesto recognize manipulations. For images, Lin and Delp [4]summarize the features of fragile schemes and their possi-ble attacks. Fridrich [5] gives an overview of existing imagetechniques. In general, we can classify the techniques as oneswhich work directly in the spatial domain or in the trans-form (DCT, wavelet) domains. Furthermore, Fridrich clas-sifies fragile (very sensitive to alterations), semifragile (lesssensitive to alterations), visual-fragile (sensitive to visual al-terations) watermarks (here we can generalize such schemesinto content-fragile watermarks), and self-embedding water-marking as a means for detecting both malicious and inad-vertent changes to digital imagery.

Altogether, we see that the watermarking community infavor of robust techniques has neglected fragile watermark-ing for audio data. There are only few approaches and manyopen research problems that need to be addressed in fragilewatermarks, for example, the sensitivity to modifications [6].The syntax (bit stream) of multimedia data can be manip-ulated without influencing their semantics, as it is the casewith scaling, compression, or transmission errors. Thus itis more important to protect the semantics of the data in-stead of their syntax to vouch for their integrity. Therefore,content-based watermarks [7] can be used to verify illegalmanipulations and to allow several content-preserving oper-ations. Therefore, the main research challenge is to differen-tiate between content-preserving and content-changing ma-nipulations. Most existing techniques use threshold-basedtechniques to decide the content integrity. The main problemis to face the wide variety of allowed content-preserving op-erations. As we see in the literature, most algorithms address

Watermarking-Based Digital Audio Data Authentication 1003

the problem of compression. But very often, scaling, formatconversion, or filtering are also allowed transformations.

Furthermore, for high-security application, we have therequirement to detect each bit change in an audio track andto extract the watermark embedded as additional noise. In-vertible schemes face this problem and have been introducedfor image and video data in recent publications [8]. To ensurea public verification, these approaches have been combinedwith digital signatures by Dittmann et al. [9]. As we see fromthe literature, there are no approaches for an invertible audiowatermarking scheme.

Our contribution focuses mainly on the design of acontent-fragile audio watermarking scheme to allow severalpostproduction processes and on the design of an invert-ible watermarking scheme combined with digital signaturesfor high-security applications. We introduce two watermark-ing algorithms: our first approach is a content-fragile wa-termarking scheme combining fragile feature extraction androbust audio watermarking, and the second approach is de-signed to detect each bit change and reconstruct the originalaudio, where we combine digital signature schemes and dig-ital watermarking to provide a public verifiable data authen-tication and a reproduction of the original protected with asecret key.

In the following subsections, we firstly review the state ofthe art of basic concepts for audio data authentication; sec-ondly, we describe the general approaches for content-fragileand invertible schemes as basis for our conceptual design inSections 2 and 3. In Section 4, we show example applications,and we summarize our work in Section 5.

1.1. Digital audio watermarking parameters andgeneral methods for data authentication

There are numerous algorithms for audio watermarking; asselection, see [10, 11, 12, 13, 14, 15, 16]. Most of them aredesigned as copyright protection mechanisms, and therefore,the robustness, security, capacity, and transparency are themost important design issues, while in a lot of approaches,complexity and possible verification methods come second.

In the case of fragile watermarking for data authentica-tion, the importance of the parameters changes. The fragilityand security with a moderate transparency are most impor-tant. Depending on what kind of fragility we expect, re-member that we differentiate between fragile, semifragile,content-fragile, self-embedding, and invertible schemes; ahigh payload of the watermarking algorithm is necessary toembed sufficient data to verify the integrity. Security is im-portant as the whole idea of fragile watermarking is to pro-vide integrity security, and a weak watermarking securitywould mean a weak overall system as embedded informationcould be forged. Using cryptography while embedding, thedata can further increase security, for example, asymmetricsystems could be used to ensure the authenticity of the em-bedded content descriptions. Robustness is not as importantas security. If, due to media manipulations, a certain loss ofquality is reached and the content is changed or is not recog-nizable any more, the watermark can be destroyed. Depend-ing on the application transparency can be less important as

content protected by this scheme is usually not to be usedfor entertainment with high-end quality demands. Complex-ity can become relevant if the system is to work in real time,which is the case if it is applied directly into recording equip-ment like cameras.

Fragile watermarking can also be applied to audio data.If the algorithm is fragile against an attack, the watermarkcannot be retrieved afterwards. Therefore, not being able todetect a watermark in a file, which is assumed to be marked,identifies a manipulation.

Content-fragile watermarks discriminate betweencontent-preserving and content-manipulating operations.In the literature, we find only few approaches for audioauthentication watermarks. In [17], the focus of audiocontent security has been on speech protection. Wu andKuo describe two methods for speech content authentica-tion. The first one is based on content feature extractionintegrated in CELP speech coders. Here, content-relevantinformation is extracted, encrypted, and attached as headerinformation. The second one embeds fragile watermarksin content-relevant frequency domains. They stress thefact that common hash functions are not suited for speechprotection because of unavoidable but content-preservingaddition of noise during transmission and format changes.Feature extraction and watermarking are both regarded asa more robust alternative to such hash functions. Wu andKuo provide experimental results regarding false alarms andcome to the conclusion that discrimination between weakcontent-preserving operations and content manipulationsis possible with both methods. This is similar to our resultsprovided in Section 2.

Dittmann et al. [18] introduce a content-fragile water-marking concept for multimedia data authentication, espe-cially for a/v data. While previous data authentication wa-termarking schemes address a single media stream only, thepaper discusses the requirements of multimedia protectiontechniques, where the authors introduce a new approachcalled 3D thumbnail cube. The main idea is based on a 3Dhologram over continuing video and audio frames to verifythe integrity of the a/v stream.

1.2. Feature-based authentication concept:content-fragile watermarking

As introduced, the concept of a content-fragile watermarkcombines a robust watermark and a content abstraction froma feature extraction function for integrity verification. Dur-ing verification, the embedded content features are com-pared with the actual content, similar to hash functions incryptography. If changes are detected, content and water-mark differ, a warning message is prompted. The idea ofcontent-fragile watermarking is based on the knowledge thatwe have to handle content-preserving operations, manipula-tions that do not manipulate the content.

Two different approaches of content embedding strate-gies can be recognized: direct embedding and seed-based em-bedding. With the first approach, a complete feature-basedcontent description is embedded in the cover signal (orig-inal). The second approach uses the content description to


generate information packages of smaller size based on theextracted features.

Direct embedding. In direct embedding, the extracted fea-tures are embedded bit by bit into the corresponding mediadata. The feature description has to be coded as a bit vector tobe embedded in this way. The methods of embedding differfor every watermarking algorithm. What they have in com-mon is that the feature vector is the embedded watermarkinginformation. The problem with direct embedding is the pay-load of the watermarking technology: to embed a completeand sufficiently exact content description, very high bit rateswould be necessary, which most watermarking algorithmscannot provide.

Seed-based approach. Features are used to achieve robust-ness against allowed media manipulations while still beingable to detect content manipulations. The amount of data forthe describing features is much less than the described me-dia. But usually, even this reduced data cannot be embeddedinto the media as a watermark. The maximum payload of to-day’s watermarking algorithms is still too small. Therefore,to embed some content description, we have to use sum-maries or very global features—like the root mean square(RMS) of one second of audio. This leads to security prob-lems: if we only have information about a complete second,parts smaller than a second could be changed or removedwithout being noticed. A possible solution is to use a seed-based approach. Here, we use the extracted features as an ad-dition to the embedding key. The embedding process of thewatermark now depends on the secret key and the extractedfeatures. The idea is that only if the features have not beenchanged, the watermark can be extracted correctly. If the fea-tures are changed, the retrieval process cannot be initializedto read the watermark.

In Section 2, we introduce a content-fragile audio water-marking algorithm based on the direct embedding strategy.

Remark 1. There are also more simple concepts of audio dataauthentication, which we do not address here, as they includeno direct connection with the content. For example, embed-ding of a continuous time code is a way to recognize cutoutattacks. The retrieved time code will show gaps at the corre-sponding positions if a sufficiently small step size has beenchosen.

1.3. Invertible concept

The approach in [19] has introduced the first two invertiblewatermarking methods for digital image data. While virtu-ally all previous authentication watermarking schemes in-troduced some small amount of noninvertible distortion inthe data, the new methods are invertible in the sense that ifthe data is deemed authentic, the distortion due to authen-tication can be completely removed to obtain the originaldata. Their first technique is based on lossless compressionof biased bit streams derived from the quantized JPEG co-efficients. The second technique modifies the quantizationmatrix to enable lossless embedding of one bit per DCT co-efficient. Both techniques are fast and can be used for general

distortion-free (invertible) data embedding. The two meth-ods provide new information assurance tools for integrityprotection of sensitive imagery, such as medical images orhigh-importance military images viewed under nonstandardconditions when usual criteria for visibility do not apply.Further improvements in [8] generalize the scheme for com-pressed image and video data.

In [9], an invertible watermarking scheme is combinedwith a digital signature to provide a public verifiable in-tegrity. Furthermore, the original data can only be repro-duced with a secret key. The concept uses the general ideaof selecting public key dependent watermarking positions(here, e.g., the blue channel bits) and compressing the origi-nal data at these positions losslessly to produce space for in-vertible watermark data embedding. In the retrieval, the wa-termarking positions are selected again, the watermark is re-trieved, and the compressed part is decompressed and writ-ten back to recover the original data. The scheme is highlyfragile and the original can only be reproduced if there wasno change. The integrity of the whole data is ensured withtwo hash functions: the first is built over the remaining im-age and the second over the marked data at the watermark-ing positions by using a message authentication code HMAC.The authenticity is granted by the use of an RSA digital signa-ture. The reproduction by authorized persons is granted bya symmetric key scheme: AES. The protocol for image datafrom [9] can be written as follows:

IW = Iremaining‖W‖Datainfo//Datafill,

W = EAES(EAES

(CblueBits, kH

(Iremaining

)), Ksecret

)//HMAC

(SelectedblueBits, Ksecret

)//RSAsignature(H

(Iremaining

//EAES(EAES

(CblueBits, kH

(Iremaining

), Ksecret

)//HMAC

(selectedblueBits, Ksecret

)), Kprivate

)).

The watermarked image data IW contains the remain-ing nonwatermarked image bits Iremaining and the image dataat the watermarking bit positions derived from the publickey (see cursive in the equation) where the watermark isplaced. The watermark dataW itself contains the compressedoriginal data C of the marking position bits, which are en-crypted with the function E by AES using an encryption keykH(Iremaining) derived by the hash value from the remainingimage to verify the integrity. As invertibility protection [9],use an additional AES encryption E of the first encryptionwith the secret key parameter Ksecret only known by autho-rized persons. To ensure the integrity of the original com-pressed data at the marking positions [9], use an HMACfunction initialized by the secret key too. To enable publicverification, the authors add an additional private key ini-tialized RSA signature, which is built over the hash value ofthe remaining image, twice encrypted compressed data, andthe HMAC function. For synchronization in the retrieval, in-formation about the selected watermarking positions and theused compression function is added as well as padding bits.To verify the integrity and authenticity of the data, the usercan use the public key to retrieve the watermark information


Provider side Customer side

Audio fileMarked

audio file

Channel Transmittedaudio file

WMextraction

FV (1)Water-

marking

• Noise• Attacks

FV (2)Extracted

FV (1)

Public keyencryption

Key distribution

Public keydecryption

Compare

Figure 2: Content-fragile data authentication scheme.

and verify the RSA signature with the public key. For originalreproduction, the secret k is necessary to decrypt the com-pressed data. With the HMAC function, the authenticity andintegrity of the decrypted original data can be ensured. Thegeneral scheme can be described as dividing the digital docu-ment into two sets A and B. The set A is kept unchanged. Theset B will be severed as a cover for watermark embedding,where B is compressed to C to produce room for embeddingthe digital signature S. To ensure that C belongs to A, we en-crypt C with a content-depending key derived from A, andto restrict reproduction of original C, it is again encryptedwith a secret key. The digital signature S is built over A andthe twice encrypted C as well as the message authenticationcode to ensure correct reproduction of C.

In our paper, we adopt the scheme of [9] for digital au-dio data and introduce a new invertible audio watermark, seeSection 3.

2. CONTENT-FRAGILE AUDIO WATERMARKING

In this section, we introduce our approach to content-fragileaudio watermarking based on the concepts introduced inSection 1.2. We address suitable features of audio data, in-troduce an algorithm, and provide test results.

2.1. Content-fragile authentication concept

Figure 2 illustrates the general content-fragile audio water-marking concept: from an audio file, a feature vector (FV) isextracted and may be encrypted. This information is embed-ded as a watermark. The audio file is then transmitted via anoisy channel. At some time, the content has to be verified.Now the watermark (WM) is extracted and the embeddedand decrypted FV is compared to a newly generated FV. If acertain level of difference is reached, integrity cannot be ver-ified. A PKI may be helpful to handle key management.

Remember, fragility is about losing equality of extractedand embedded contents in this case with the challenge tohandle content-preserving operations—manipulations thatdo not manipulate the content. The well-known problemof “friendly attacks” occurs here as in any watermarkingscheme: some signal manipulations must be allowed with-

out breaking the watermark. In our case, every editing pro-cess that does not change the content itself is a friendlyattack. Compression, dynamics, A/D-D/A-conversion, andmany other operations that only change the signal but notthe content described by the signal should not be detected.The idea is to use content information as an indicator formanipulations. The main challenge is to identify audio fea-tures appropriate to distinguish between content-preservingand content-changing manipulations.

Figure 3 shows the verification process of our content-fragile watermarking approach. We divide the audio file intoframes of n samples. From these n samples, the feature check-sums and the embedded watermark are retrieved and com-pared at the integrity check. As audio files are often cut, aresynchronization function is necessary to find the correctstarting point of the watermark corresponding with the fea-tures. Our watermarking algorithm is robust against crop-ping attacks, but cutting out samples can lead to significantdifferences between extracted watermark and retrieved fea-tures. Therefore, a sync compare function tries to resynchro-nize both (features and watermark) if the integrity check isnegative. Only if this is not successful, an integrity error isprompted.

2.2. Digital audio features

Extracted audio features are used to achieve robustnessagainst allowed media manipulations while still being ableto detect content manipulations. We want to ignore content-preserving operations which would lead to false alarms incryptographic solutions and only identify real changes in thecontent. Additionally, we need to produce a binary represen-tation of the audio content that is small enough to be embed-ded as a watermark and detailed enough to identify changes.

To produce a robust description of sound data, we haveto examine which features of sound data can be extractedand described. Research has addressed this topic in psychoa-coustics, for example, [20], and automated scene detectionfor videos, as in [20, 21]. We use the RMS, zero-crossing rate(ZCR), and the spectrum of the data as follows.

(i) RMS provides information about the energy of a num-ber of samples of an audio file. It can be interpreted


Audio file n samples Start position

File n x = static pos = 0

Reading samples from file

Samples Samples

x = sync pos Retrieving WM Creating checksum x = x + n

retr check extr check

sync pos Integrity check

No Yes

Sync compare

dev > 200 dev < 200

Modified Ok

Figure 3: Content-fragile watermarking-based integrity decision.

Figure 4: RMS curve of a speech sample.

as loudness. If we can embed RMS information in afile and compare it after some attack, we can recognizemuted parts or changes in the sequence (see Figure 4).

(ii) ZCR provides information about the amount of highfrequencies in a window of sound data. It is calculatedby counting the time the sign of the samples changes.The brightness of the sound data is described by it.Parts with small volume often have a high ZCR as theyconsist of noise or are similar to it (see Figure 5).

(iii) The transformation from time domain to frequencydomain provides the spectrum information of audiodata (see Figure 6). Pitch information can be retrievedfrom the spectrum. The amount of spectral informa-tion data is similar to the original sample data. There-fore, concepts for data reduction, like combining fre-quencies into subbands or quantization, are necessary.

To protect the semantic integrity of audio data, usuallyonly a part of its full spectrum is required. For our approach,we choose a range similar to the frequency band transmittedwith analogue telephones, from 500 Hz to 4000 Hz. Thereby,all information to detect changes in the content of spoken

Figure 5: ZCR curve of a speech sample.

20 kHz

15 kHz

10 kHz

5 kHz

0 kHzt

Figure 6: Spectrum of eight seconds of speech.

language is kept while other frequencies are ignored and theamount of data for the describing features is much less thanthe described audio. But even the amount of the thereby re-duced data is too large for embedding. The maximum pay-load of today’s watermarking algorithms is still too small.Therefore, to directly embed content descriptions, we haveto use summaries of features or very global features—like the


Table 1: Required bit rates for feature embedding.

FFT size Features Detail Sync bits Bit rate

Samples # Bit/feature Bit Bit/s

1.024 4 8 4 6,201.56

2.048 4 8 4 3,100.78

4.096 4 8 4 1,550.39

10.240 4 4 4 344.53

51.200 4 4 4 68.91

81.920 4 4 4 43.07

Binary data Watermark

Features

Feature checksum

Figure 7: Feature checksums reduce the amount of embedded data.

RMS of one second of audio. This leads to security problems.As we only have information about a complete second, partssmaller than a second could be changed or removed withoutthe possibility of localization. One cannot trust the completesecond regardless the amount and position of change. It willalso be a major challenge to disable possible specialized at-tacks trying to keep the overall feature the same while doingsmall but content-manipulating changes.

Table 1 shows a calculation of theoretically required wa-termarking algorithm bit rates. Here we extract four features(e.g., ZCR, RMS, and two frequency bands) and encode themwith 8 or 4 bits. Quantization of the feature values is neces-sary to use a small number of bits. It also increases the fea-ture robustness: less different values yield more robust onesagainst small changes. Quantization will set both originalfeature and modified feature to the same quantized value. Weuse quantization steps from 0.9 to 0.01. These are incremen-tal values stepping from 0 to 1. If 0.9 is used, only one step ispresent, and basically no information regarding the featureis provided. With quantizer 0.01, 100 steps from 0 to 1 aremade. The algorithm can differentiate between 100 values forfeature representation.

Additionally, sync bits are required for resynchroniza-tion. This leads to very high bit rates at small FFT windowsizes. Using big windows and low resolution reduces the re-quired bit rates to about 43 bps. We could embed a content

Table 2: Feature checksums based on different algorithms.

Window size Key size Sync bits Type Bit rate

Seconds Samples Bit Bit Bit/s

15.3252 675,840 160 4 SHA 10.7

12.3066 542,720 128 4 MD5/MAC 10.7

3.3669 148,480 32 4 CRC32 10.7

1.8576 81,920 16 4 CRC16 10.8

1.1146 49,152 8 4 XOR 10.8

0.7430 32,768 4 4 XOR 10.8

description about 5 times per second. But as 43 bps are stilla rather high payload for current audio watermarking, ro-bustness and transparency are not satisfactory. This leads tohigh error rates at retrieval and therefore to high false errorrates. Our prototypic audio watermarking algorithm offers abit rate of up to 30 bps if no strong attacks are to be expected,which would be the case in manipulation recognition scenar-ios. But with this average to high bit rate, compared to otheralgorithms available today, not only does robustness decreasebut also error rates increase. Very robust watermarking algo-rithms today offer about 10 bits down to 1 bps.

2.3. Feature checksums

To circumvent the payload problem, we use feature check-sums. We do not embed the robust features but only theirchecksum. Figure 7 illustrates this concept. The checksumscan be compared to the actual media features checksums todetect content changes. An ideal feature is robust to all al-lowed changes—the checksum would be exactly the same af-ter the manipulation. As we employ a sequence of features inevery window, we need additional robustness: quantizationreduces the required amount of bits and, at the same time, in-creases robustness as it maps similar values to the same quan-tized value. In Table 2, we list a number of checksums likehash functions (SHA, MD5), cyclic redundancy checks, orsimple XOR functions. For hash functions, a certain amountof bits is required, therefore we can only work with big win-dow sizes or a sequence of frames. XOR functions offer smallwindow sizes. We can embed a feature checksum in less thana second with a bit rate of 10.8 bps into a single channel ofCD quality PCM audio.

2.4. Test results

We use a prototypic implementation based on our own pro-totypic watermarking algorithm which uses spread spectrumand statistical techniques, different feature extractors, a fea-ture comparison algorithm, and a feature checksum gener-ator to evaluate our content-fragile watermarking concept.The basic idea of our tests can be described in the followingsteps.

(1) Select an audio file as a cover to be secured.(2) Select one or more features describing the audio file.(3) Retrieve the features for a given amount of time.(4) Create a feature checksum.


Table 3: Embed/retrieve comparison for 4-bit RMS.

: :

Mode: Embed Mode: Retrieve

Bits per checksum: 4 bit Bits per checksum: 4 bit

Bit rate: 5.3833 bps Bit rate: 5.3833 bps

Frames per checksum: 48 frames Frames per checksum: 48 frames

Included features: Included features:

RMS in frequency domain 2000–6000 Hz RMS in frequency domain 2000–6000 Hz

Checksum

No Time (min:s) Checksum No Time (min:s) Extr Retr Integrity

0 0:0 11 0 0:3.72719 11 12 modified

1 0:2.22912 9 1 0:3.71299 9 4 modified

2 0:4.45823 7 2 0:6.68261 7 12 modified

3 0:6.68735 13 3 0:12.6269 13 12 modified

4 0:8.91646 9 4 0:11.1427 9 0 modified

5 0:11.1456 3 5 0:12.6313 3 4 modified

6 0:13.3747 7 6 0:13.3739 7 7 ok

7 0:15.6038 12 7 0:15.6022 12 12 ok

8 0:17.8329 9 8 0:17.8322 9 9 ok

9 0:20.062 3 9 0:20.0586 3 3 ok

10 0:22.2912 8 10 0:23.775 8 4 modified

11 0:24.5203 4 11 0:24.5186 4 4 ok

12 0:26.7494 1 12 0:26.7523 1 1 ok

13 0:28.9785 12 13 0:28.9769 12 12 ok

14 0:31.2076 15 14 0:32.6924 15 8 modified

15 0:33.4367 11 15 0:37.8899 11 12 modified

16 0:35.6659 14 16 0:37.1661 14 12 modified

17 0:37.895 14 17 0:37.9079 14 14 ok

18 0:40.1241 0 18 0:40.1361 0 0 ok

19 0:42.3532 7 19 0:42.3515 7 7 ok

20 0:44.5823 4 20 0:44.5798 4 4 ok

Embed mode: Checksums are generated andembedded as a watermark

Retrieve mode: Checksums are generated andcompared to those retrieved as a watermark

(5) Embed the feature checksum as a watermark.(6) Attack the cover.(7) Retrieve the watermark from the attacked cover.(8) Retrieve the features from the attacked cover and gen-

erate the checksums.(9) Compare both to decide if a content-change has oc-

curred.

Table 3 shows an example where a 4-bit checksum and twosync bits are embedded every 48 frames. In the left row, theembedded feature checksums are presented and in the rightrow, the results of a retrieve process. The comparison in-cludes actual extracted features, retrieved features, and a de-cision if integrity has been corrupted. In this example, wesee that extracted feature checksums after embedding and re-trieval are matching, while the extracted watermark showsother features. This may seem confusing at first sight as one

would assume the embedded information and the extractedfeatures in embed mode to be similar. In this example, thechosen watermarking parameters are too weak and producebit errors at retrieval but at the same time do not influencethe robust features. It is clear that an optimal trade-off be-tween the robustness and transparency of the watermark willprovide the best results.

Audio watermarking algorithms are usually not com-pletely reliable regarding the retrieval of single embeddedbits. Certain number of errors in the detected watermarkscan be expected and compensated by error-correction codesand redundancy. But as the data rate of the watermarkingalgorithms is already low without these additional mecha-nisms, content-fragile watermarks cannot rely on error com-pensation. Therefore, to achieve good test results, water-marking and feature parameters have to be chosen carefullyto prevent a high error rate. In Figure 8, a set of optimized


50

40

30

20

10

0

Err

ors

%

0.9 0.7 0.5 0.3 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01Quantizer values

TeenageEnyaAlive

AteamTKKG

Figure 8: Optimized parameters lead to error rates below 20% (RMS, checksum 4 bit).

parameters has been identified and tested with five audio filesranging from rock music to radio drama. RMS is chosen asthe extracted feature. To receive optimal results, we keep acertain distance between the frequency band the watermarkis embedded in and the feature it is extracted from. In thisexample, the feature band is 2 kHz to 6 kHz. The watermarkis embedded in the band from 10 kHz to 14 kHz.

Even with these optimized parameters, for the retrievalof feature checksums, a false error rate between 5% and 20%is usual. Today’s audio watermarking algorithms offer errorrates of 1% or less per embedded bit. This adds up to a biggererror rate in our application as one wrong bit in the multibitchecksum results in an error. For common audio watermark-ing applications, a 5% error rate for embedded watermarksis acceptable. Both, the error rate of the watermarking algo-rithm and the possibility of changing the monitored featureby embedding the watermark, sum up to a basic error rate,which is detected even if no attacks have occurred. This ba-sic error rate has to be taken into account when a decisionregarding the integrity of the audio material is made.

As already stated in Section 2.2, quantizer sizes influencerobustness. For the results in Figure 8, a quantizer value of0.9 basically means that all features are identified by the samevalue, while 0.01 provides a detailed representation. Errorrates increase with the level of detail.

In Figure 9, we show test results after performing a stir-mark benchmark audio attack [22] for the parameter RMS.We embed a feature vector with the parameters of Figure 8and run a number of audio manipulations of differentstrength on the marked file. Then the watermark is retrievedand both the retrieved and the recalculated feature vectorsare compared.

The content-preserving attacks “normalize,” “invert,”and “amplify” result in equal error rates as in the no-operation attack “nothing” or after only embedding the wa-termark. An error rate below 20% can be seen as a thresh-

old for content-preserving operations. Content manipula-tions like filters (lowpass, highpass), the addition of noise(addnoise) or humming (addbrumm), and removal of sam-ples result in higher error rates up to almost 100%. The dif-ferent quantization values have a significant influence on theerror rate again, but the behavior is the same for all attacktypes: a lower resolution results in lower error rates.

While these attacks may be assumed to be content pre-serving in some cases, for example, lowpass filtering com-mon in audio transmission, the results show that a certaindiscrimination between attacks is possible. The results alsocorrespond to the attack strength. Lower noise values lead tolower error rates.

The test results are encouraging. A threshold may be nec-essary to filter an unavoidable error level of about 20%, butattacks can be identified. Quantization values can be used asa fragility parameter. A similar behavior is observed in differ-ent audio files including speech, environmental recordings,and music, making this approach useful for various applica-tions.

3. INVERTIBLE AUDIO WATERMARKING

Based on the general idea of invertible watermarking, an in-vertible scheme for audio has to combine a lossless compres-sion with different cryptographic functions, see Figure 10.An audio stream consists of samples with variable numbersof bits describing one sample value. We take a number ofconsecutive samples and call them a frame. Now one bit layerof this frame is selected and compressed by a lossless com-pression algorithm. For example, we would build a frame of10 000 16-bit samples and take bit #5 from each sample. Thedifference between memory requirements of the original andthe compressed bit layer can now be used to carry additionalsecurity information. In our example, the compressed 10 000bits of layer #5 could require only 9 000 bits to represent. The


rc lowpass

rc highpass

Nothing

Normalize

Invert

Cutsamples

Zerocross

Compressor

Amplify

Addsinus

Addnoise 900

Addnoise 700

Addnoise 500

Addnoise 300

Addnoise 100

Addbrumm 9100

Addbrumm 8100

Addbrumm 7100

Addbrumm 6100

Addbrumm 5100

Addbrumm 4100

Addbrumm 3100

Addbrumm 2100

Addbrumm 1100

Addbrumm 10100

Addbrumm 100

Marked original

0 10 20 30 40 50 60 70 80 90 100

Error rate %

Quantizer value: 0.02Quantizer value: 0.1Quantizer value: 0.9

Figure 9: Stirmark audio test results. Stronger attacks lead to higher error rates (RMS, checksum 4 bit).

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

· · ·001001010101011010 · · ·

Sync Comp H(Y) H(X) F ID Fill

· · ·001001010101011010 · · ·

Figure 10: Invertible audio watermarking. The bits of one bit layer are compressed and the resulting free space is used to embed additionalsecurity information.

resulting 1 000 bits can be used as security information like,for example, a hash of the other 15 bit layers. The originalbit vector is replaced by the compressed bit vector and thesecurity information. As the complete information about theoriginal bit layer is still available in compressed form, it canbe decompressed at any time, and by overwriting the new in-

formation with the original bits, we get the original frameback.

3.1. Invertible authentication for audio streams

As discussed, today’s invertible watermarking solutions areonly available for image data. Here, only one complete image


Sequence ID

· · ·001001010101011010 · · ·· · ·001001010101011010 · · ·· · ·001001010101011010 · · ·


· · ·001001010101011010 · · ·

· · · 001001010101011010 · · ·· · · 001001010101011010 · · ·· · · 001001010101011010 · · ·· · · 001001010101011010 · · ·


· · · 001001010101011010 · · ·· · · 001001010101011010 · · ·· · · 001001010101011010 · · ·


· · · 001001010101011010 · · ·

+1 +1

Figure 11: Audio watermarking requires stream synchronization to allow cutting the material. Therefore, an incremental frame ID isincluded in the watermark.

will be protected. If the same concept is transferred to au-dio data, certain problems arise: the amount of data for along recording can easily become more than one GB. If theinvertible watermark is to be embedded in a special device,very large memory reserves would be necessary. Besides thistechnical problem, integrity may not be lost even if the orig-inal data is not completely present. A recording of an in-terview may be edited later, removing an introduction ofthe reporter or some regards at the end. The message ofthe interview will not be corrupted if this information isremoved.

Therefore, we suggest a frame-based solution like that in-troduced in Section 2. A number of consecutive samples arehandled as an audio frame, for example, 44100 samples forone second of CD-quality mono data. This frame Fi is nowprotected like, for example, a single image. It includes all nec-essary information to prove its integrity. But additional secu-rity information and a synchronization header are necessaryas shown in Figure 11.

(i) A sequence IDS is embedded in every frame Fi. It veri-fies that the frame belongs to a certain recording. Thisprovides security versus exchanges from other record-ings. If IDS is not included, an attacker could overwritea part of the protected audio data with a different butalso protected stream from another recording but fromthe same position without being detected.

(ii) An incremental frame IDT is also embedded. This pro-vides security against exchanges in the sequence offrames Fi of the protected sequence. Swapping a num-ber of frames would not be detectable without this IDand would lead to manipulation possibilities.

(iii) A synchronization header (Datasync) is also necessary.Otherwise, cutting the audio data would usually leadto a complete loss of the security information, as thecorrect start position of a new frame would be unde-tectable. With the help of the synchronization header,the algorithm can scan for a new starting point if it de-tects a corrupt frame.

With these three additional mechanisms, invertible audiowatermarking becomes more usable for many applicationsas a number of attacks are disabled and a certain amount

of editing is allowed. A tool for integrity verification couldidentify gaps in the sequence and inform the user about it.This user then can decide whether to trust the informationclose to the gaps or not as, for example, a third party couldhave removed words from a speech. From our discussion inSection 1.3, the protocol for audio data can be written foreach audio frame Fi (i = 1 . . . number of audio frames) asfollows:

FiW = Fi remaining//Datasync//Datainfo//Wi//Datafill,

W = EAES(EAES

(CAudioLayerBits, kH

(Fi remaining

)), Ksecret

)//HMAC

(AudioLayerBits, Ksecret

)//RSAsignature

(H(Fi-remaining

//EAES(EAES

(CAudioLayerBits, kH

(Fi remaining

)), Ksecret

)//HMAC

(AudioLayerBits, Ksecret

)//DataSync

)),

where AudioLayerBits is the bit vector to be replaced in Fi,and Fi remaining is the set of the remaining bit vectors in Fi.

3.2. Compression techniques and capacity evaluation

Based on the general invertible concepts, the next majorquestion is how to perform a lossless audio compression Cto achieve invertibility: to get back the exact original of theaudio representation, common audio compressions like mp3are not acceptable due to their lossy characteristics. Com-pression schemes also applied to text or software, like thecommon zip compression C, satisfy this requirement but arefar less efficient than lossy audio compression.

Therefore, we design the following compression algo-rithm.

(1) The required number of bits r and the number of sam-ples n to be used as one frame are provided as param-eters.

(2) From the n samples, each lowest bit is added to a bitvector B of the length n.

(3) B is compressed by a lossless algorithm, producing acompressed bit vector B′ of length n′.

(4) If n−n′ < r, the compression of the bit layer is not suf-ficiently efficient and the next higher bit layer is com-pressed.


(5) If we win r bits from the compression, the current bitlayer becomes the one we embed the information intofor this frame.

(6) If even at bit layer #15 the compression is not suf-ficiently efficient, embedding is not possible in thisframe.

Table 4 shows an example of this process. The parame-ters are 44100 samples for one frame and a requirement of2000 bits. In the first frame at bit #0, we already receive goodcompression results. The difference n−n′ is 3412, more thanrequired. In the second frame, bits #0 to #7 do not provide apositive compression ratio, so bit #8 is selected as the com-pressed layer. To identify the chosen bit layer, a synchroniza-tion sequence embedded into the Datainfo is necessary, iden-tifying the compressed layer for every frame.

In Table 5, we provide a comparison of capacities of fourexample files A to D for two frames of 44100 samples. A firstassumption about bit requirements can be made based onthe knowledge about required components. As multiple hashfunctions are available and the length of the RSA signature iskey dependent, the capacity requirements are calculated asfollows:

(i) sync info, for example, 64 bit;(ii) two hash values:

(a) remaining audio information, for example, 256 bit,(b) selected bit layer, for example, 256 bit;

(iii) RSA digital signature, for example, 512 bit;(iv) compressed bits are encrypted by a symmetric key

scheme (AES), that is, adding max. 63 bits.

This sums up to about 1100 bits. Therefore, any compres-sion result providing 1100 bits of gain would be suitable forembedding invertible security information. In the exampleof Table 5, in frame 1, the information will be embedded inbit layer #8 of file A and in layer #0 of B, C, and D. In frame 2,A and C require bit layer #8, while B and D can use bit layer#0. An important observation is the fact that the capacity ofcompression results is not always increasing as one would as-sume when looking at the examples for still images of [9]. Inframe 2 of Table 5, column D, the amount of received bits de-creases from bit layer #0 to #7 and then becomes a constantvalue for bit layers #8 to #15. Quantization, changes of bitrepresentation, and addition of noise are possible reasons forthis effect.

4. APPLICATIONS

Content security for digital audio is not discussed today asmuch as for images or video data. In this section, we discuss aselection of possible scenarios where either content-fragile orinvertible watermarking schemes like the ones we describedin Sections 2 and 3 will become necessary.

4.1. News data authentication

Digital audio downloads on the Internet can replace radionews. Interviews and reports will be recorded, digitized, and

Table 4: Compression efficiency changes from frame to frame.

Bit# Bits/Orig. Bits/Comp. Difference Comp. Factor

Frame 1

0 5513 2101 3412 0.381

1 5513 2307 3206 0.418

2 5513 2517 2996 0.457

3 5513 3424 2089 0.621

4 5513 4415 1098 0.801

5 5513 5225 288 0.948

6 5513 5574 −61 1.011

7 5513 5603 −90 1.016

8 5513 1299 4214 0.236

9 5513 1298 4215 0.235

10 5513 1298 4215 0.235

11 5513 1298 4215 0.235

12 5513 1301 4212 0.236

13 5513 1386 4127 0.251

14 5513 1605 3908 0.291

15 5513 1859 3654 0.337

Frame 2

0 5513 5345 168 0.970

1 5513 5566 −53 1.010

2 5513 5599 −86 1.016

3 5513 5603 −90 1.016

4 5513 5603 −90 1.016

5 5513 5606 −93 1.017

6 5513 5602 −89 1.016

7 5513 5600 −87 1.016

8 5513 1725 3788 0.313

9 5513 1726 3787 0.313

10 5513 1726 3787 0.313

11 5513 1726 3787 0.313

12 5513 1742 3771 0.316

13 5513 2626 2887 0.476

14 5513 3611 1902 0.655

15 5513 4647 866 0.843

uploaded to news servers. With content-fragile watermark-ing, the trust in the information can be increased. The sourceof the news, for example, a reporter or even the recordingdevice, embeds a content-fragile watermark into the audiodata and encrypts the content information with a private key.Now everybody who uses a corresponding detection algo-rithm would be able to verify the content. If the watermark-ing keys were distributed freely, only the public key of theembedding party would be required for verification.

The robustness of the algorithm to content-preservingoperations, for example, format changes or volume changes,allows the news distributor to adjust the data to his commonformat without the need of a new verification process. Onlythe source of the data has to be trusted; all changes in the dis-tribution chain will be detected. A person receiving the news


Table 5: Capacity comparison of four example files and two frames.

Frame Bit# A B C D Frame Bit# A B C D

1 0 0 5362 1702 3793 2 0 0 5097 0 2708

1 1 0 5362 1690 3792 2 1 0 5098 0 2359

1 2 0 5362 1670 3793 2 2 0 5047 0 1666

1 3 0 5362 1648 3792 2 3 0 4946 0 772

1 4 0 5361 1630 3779 2 4 0 4861 0 144

1 5 0 5362 1631 3708 2 5 0 4801 0 0

1 6 0 5347 1625 3584 2 6 0 4715 0 0

1 7 0 5347 1630 2395 2 7 0 4701 0 0

1 8 3113 5362 4195 3792 2 8 3151 5097 4051 2746

1 9 3114 5361 4195 3793 2 9 3151 5097 4051 2746

1 10 3113 5362 4128 3792 2 10 3151 5097 3987 2746

1 11 3113 5362 3567 3792 2 11 3151 5098 3114 2746

1 12 2998 5362 2941 3793 2 12 2982 5097 2118 2746

1 13 2081 5361 2328 3792 2 13 1984 5097 1116 2746

1 14 1115 5362 1844 3793 2 14 1027 5097 276 2746

1 15 128 5362 1741 3792 2 15 84 5097 0 2746

therefore can be sure the content is not censored or manipu-lated by third parties.

4.2. Surveillance recordings

Surveillance recordings are most often made by cameras. Re-cently, requirements regarding the trustworthiness of digi-tal versions of such cameras became an important issue. Ifthe recorded content is easily manipulated, the concept ofsurveillance and its weight at court is flawed. With digitalaudio content authentication, the audio channels of surveil-lance recordings can be protected.

Invertible methods would be applied in scenarios wherea high security is required and the audio data is not com-pressed but stored directly onto high-capacity mediums. Thewatermark would act like a seal that will be broken if ma-nipulations take place. At the same time, the inversion op-tion enables a selected group of persons to work with originalquality if necessary.

Content-fragile watermarking will be applied if opera-tions like compression are forecasted and accepted after em-bedding. The robust watermark applied in this approachwill survive the compression algorithm and provide the con-tent information at a later time to verify the integrity of therecording.

4.3. Forensic recordings

The assumption that an audio file has to be highly securedcan be made with forensic recordings. When such a record-ing is performed, for example, of an interview with wit-ness, protection of the content is very important and ourinvertible approach can be used to ensure data authentica-tion. Usually the invertible aspect will not be required as thespoken language is not very fragile against bit changes inthe lower layers. Invertibility is important when very smallchanges in the audio can have some effect. This may be the

case when a digital copy of an analogue recording is made,and later based on the digital copy, assumptions about cutsin the analogue media shall be made. Now the addition ofnoise from the watermark may mask the slight changes onecan perceive at the cutting points. The possibility of settingback the recording to its original state is an important in-crease of usability here. A similar case is the detection of en-vironmental noises in telephone calls, for example. A markedrecording with added noise will also make it hard to estimatethe nature of the background noise as both types of noisewill mix.

4.4. CD-master protection

In the examples given above, only speech or environment in-formation is protected. But music can also be the subject ofour protection schemes; CD masters are valuable pieces ofaudio data, which also require an exact reproduction to en-sure copies of high quality. Our invertible audio watermark-ing scheme offers two valuable mechanisms for this scenario.When the CD master is protected by an invertible water-mark, it can be sent without any additional security require-ments via mail or internet. Any third party capturing thecopy and not possessing the secret key can get an idea howthe CD will sound like, but audio quality is too low for ille-gal reproduction. After the CD arrives at its destination, theCD copy plant will use the sender’s public key to verify theintegrity of the audio tracks and the previously exchangedsecret key to remove the watermark. Thereby an error-proof and secure copy has been transmitted via an insecureenvironment.

5. SUMMARY AND CONCLUSION

In this paper, we introduce two concepts for digital au-dio content authentication. Content-fragile watermarking is


based on combining robust watermarking and fragile con-tent features. It is suitable for applications where certainoperations changing the binary representation of the con-tent are acceptable. The robust nature of the watermark andthe right choice of content features and their quantizationprovide tolerance to such operations while they still enableus to identify content changes. The invertible watermark-ing approach is suited for high-security scenarios. It offersno robustness to any kind of operation but cutting. Ourframe-based approach allows the detection of cuts and theresynchronization afterwards. The verification of integrity ismuch more exact than in the content-fragile approach sincecryptographic hash functions are applied. An important ad-ditional feature is the invertibility allowing recreating theoriginal state of the data if the corresponding secret key ispresent.

We provide test results for both authentication schemes.Content-fragile watermarking error rates increase with thestrength of attacks. Therefore, a threshold-based identifica-tion of content changes depending on the application is pos-sible. One source of false alarms in this approach is errors inthe retrieved watermark. Improving the watermarking algo-rithm will decrease false alarms. Both, a better transparencyto reduce the effects of the embedded watermark on the re-trieved features and a more reliable watermarking detectionwould decrease the basic error rate.

The test results of the invertible approach address mainlycompression results. We show that a flexible detection of suit-able bit vectors from frame to frame is necessary to achievean optimal trade-off between quality and compression. Wealso prove the general possibility to embed a suitable amountof security data. The compression rates of the audio bits aresuitable to carry all required information in all examples.

The security of both introduced approaches depends onkeys. A secure watermarking algorithm like the one appliedto embed the content-fragile information in Section 2 will al-ways be based on a secret user key. One can assume that thebasic embedding function will become known to the pub-lic sooner or later, so security based on secret algorithmswill lead to serious security risks. The invertible approachin Section 3 includes two key management methods. Thecompressed data is encrypted with a secret key scheme whilethe verification of integrity is based on a public key scheme.Therefore, key management is an important issue for bothapproaches.

Content-fragile watermarking requires at least a key dis-tribution concept for the watermark key that will be applica-tion dependent. The key could be distributed to every inter-ested party so everyone can verify the integrity of the content,or the key is present at a trusted third party where the markedmaterial can be uploaded for verification. If our suggestion touse asymmetric encryption to add authentication of the em-bedding party of the content-fragile data is applied, a PKI isnecessary.

Invertible watermarking also requires a PKI for integrityverification as the hash values of the original data are en-crypted with an asymmetric scheme. Secure key exchangebetween those parties that should be able to decrypt the com-

pressed data and set back the audio data to its original stateis also necessary.

To conclude, we see that audio content security is an im-portant new domain in audio processing and watermarkingas well as in general audio research. Our paper shows up dif-ferent promising directions. Future work in content-fragileaudio watermarking concentrates on further feature extrac-tion and development of a demonstrator software to finallyachieve a complete framework. Current test results show thegeneral correctness of our ideas but also identify the necessityof further research.

ACKNOWLEDGMENT

We would like to thank Patrick Horster from University Kla-genfurt (Austria) for his input regarding formalizations ofthe invertible watermarking scheme.

REFERENCES

[1] J. Dittmann, P. Wohlmacher, and K. Nahrstedt, “Using cryp-tographic and watermarking algorithms,” IEEE Multimedia,vol. 8, no. 4, pp. 54–65, 2001.

[2] J. Dittmann, M. Steinebach, T. Kunkelmann, and L. Stoffels,“H2O4M - watermarking for media: Classification, qualityevaluation, design improvements,” in Proc. 8th ACM Inter-national Multimedia Conference (ACM Multimedia ’00), pp.107–110, Los Angeles, Calif, USA, November 2000.

[3] F. Petitcolas, R. Anderson, and M. Kuhn, “Attacks on copy-right marking systems,” in Proc. 2nd International Workshopon Information Hiding (IHW ’98), vol. 1525 of Lecture Notesin Computer Science, pp. 219–239, Springer-Verlag, Portland,Ore, USA, April 1998.

[4] E. Lin and E. J. Delp, “A review of fragile image water-marks,” in Proc. ACM International Multimedia Confer-ence (ACM Multimedia ’99), J. Dittmann, K. Nahrstedt, andP. Wohlmacher, Eds., pp. 25–29, Orlando, Fla, USA, October–November 1999.

[5] J. Fridrich, “Methods for tamper detection of digital im-ages,” in Proc. ACM International Multimedia Conference(ACM Multimedia ’99), J. Dittmann, K. Nahrstedt, andP. Wohlmacher, Eds., pp. 29–34, Orlando, Fla, USA, October–November 1999.

[6] J. Dittmann, M. Steinebach, I. Rimac, S. Fischer, and R. Stein-metz, “Combined video and audio watermarking: embeddingcontent information in multimedia data,” in Security and Wa-termarking of Multimedia Contents II, vol. 3971 of SPIE Pro-ceedings, pp. 455–464, San Jose, Calif, USA, January 2000.

[7] J. Dittmann, “Content-fragile watermarking for image au-thentication,” in Security and Watermarking of MultimediaContents III, vol. 4314 of SPIE Proceedings, pp. 175–184, SanJose, Calif, USA, January 2001.

[8] J. Fridrich, M. Goljan, and R. Du, “Lossless data embedding—new paradigm in digital watermarking,” Journal on AppliedSignal Processing, vol. 2002, no. 2, pp. 185–196, 2002.

[9] J. Dittmann, M. Steinebach, and L. Croce Ferri, “Watermark-ing protocols for authentication and ownership protectionbased on timestamps and holograms,” in Photonics West 2002,Electronic Imaging: Science and Technology; Multimedia Pro-cessing and Applications, Security and Watermarking of Multi-media Contents IV, E. J. Delp and P. W. Wong, Eds., vol. 4675of SPIE Proceedings, pp. 240–251, San Jose, Calif, USA, Jan-uary 2002.


[10] M. Arnold, “Audio watermarking: Features, applications andalgorithms,” in Proc. IEEE International Conference on Mul-timedia and Expo (ICME ’00), pp. 1013–1016, New York, NY,USA, July–August 2000.

[11] J. D. Gordy and L. T. Bruton, “Performance evaluation of dig-ital audio watermarking algorithms,” in Proc. 43rd IEEE Mid-west Symposium on Circuits and Systems (MWSCAS ’00), pp.456–459, Lansing, Mich, USA, August 2000.

[12] D. Kirovski and H. Malvar, “Spread-spectrum audio water-marking: Requirements, applications, and limitations,” inIEEE 4th Workshop on Multimedia Signal Processing, Cannes,France, October 2001.

[13] P. Nintanavongsa and T. Amornkraksa, “Using raw speechas a watermark, does it work?,” in Proc. International Fed-eration for Information Processing Communications and Mul-timedia Security (CMS), Joint Working Conference IFIP TC6and TC11, R. Steinmetz, J. Dittmann, and M. Steinebach, Eds.,Kluwer Academic, Darmstadt, Germany, May 2001.

[14] C. Neubauer and J. Herre, “Audio watermarking of MPEG-2 AAC bit streams,” in AES 108th Convention, Porte Maillot,Paris, France, February 2000.

[15] L. Boney, A. H. Tewfik, and K. N. Hamdy, “Digital water-marks for audio signals,” in VIII European Signal Proc. Conf.(EUSIPCO ’96), Trieste, Italy, September 1996.

[16] J. Dittmann, Digitale Wasserzeichen, Springer-Verlag, Berlin,Heidelberg, 2000.

[17] C.-P. Wu and C.-C. J. Kuo, “Comparison of two speechcontent authentication approaches,” in Photonics West 2002:Electronic Imaging, Security and Watermarking of MultimediaContents IV, vol. 4675 of SPIE Proceedings, pp. 158–169, SanJose, Calif, USA, January 2002.

[18] J. Dittmann, M. Steinebach, L. Croce Ferri, A. Mayerhofer,and C. Vielhauer, “Advanced multimedia security solutionsfor data and owner authentication,” in Applications of Digi-tal Image Processing XXIV, vol. 4472 of SPIE Proceedings, pp.132–143, San Diego, Calif, USA, July–August 2001.

[19] J. Fridrich, M. Goljan, and R. Du, “Invertible authentication,”in Photonics West 2001: Electronic Imaging, Security and Wa-termarking of Multimedia Contents III, vol. 4314 of SPIE Pro-ceedings, pp. 197–208, San Jose, Calif, USA, January 2001.

[20] Z. Liu, J. Huang, Y. Wang, and T. Chen, “Audio feature extrac-tion & analysis for scene classification,” in IEEE 1st Workshopon Multimedia Signal Processing, pp. 343–348, Princeton, NJ,USA, June 1997.

[21] S. Pfeiffer, Information Retrieval aus digitalisierten Audio-spuren von Filmen, Shaker Verlag, Aachen, Germany, 1999.

[22] M. Steinebach, F. Petitcolas, F. Raynal, et al., “StirMark bench-mark: Audio watermarking attacks,” in Proc. InternationalConference on Information Technology: Coding and Computing(ITCC ’01), pp. 49–54, Las Vegas, Nev, USA, April 2001.

Martin Steinebach is a Research Assis-tant at Fraunhofer IPSI (Integrated Publi-cation and Information Systems Institute).His main research topic is digital audiowatermarking. Current activities are water-marking algorithms for mp2, MIDI andPCM data, feature extraction for content-fragile watermarking, attacks on audio wa-termarks, and concepts for applying au-dio watermarks in eCommerce environ-ments. He studied computer science at the Technical Universityof Darmstadt and finished his Diploma thesis on copyright pro-tection for digital audio in 1999. Martin Steinebach was the Or-ganizing Committee Chair of CMS 2001 and coorganized the

Watermarking Quality Evaluation Special Session at ITCC Interna-tional Conference on Information Technology: Coding and Com-puting 2002. Since 2002 he is the Head of the Department MERIT(Media Security in IT) and of the C4M Competence Center for Me-dia Security.

Jana Dittmann has been a Full Professorin the field of multimedia and media se-curity at the Otto-von-Guericke Univer-sity Magdeburg since September 2002. Shestudied computer science and economy atthe Technical University in Darmstadt andworked as a Research Assistant at the GMD-IPSI (later Fraunhofer IPSI) from 1996 to2002. In 1999, she received her Ph.D. fromthe Technical University of Darmstadt. AtIPSI, she was one of the founders and the leader of the C4M Com-petence Center for Media Security. Jana Dittmann specializes inthe field of Multimedia Security. Her research has mainly focusedon digital watermarking and content-based digital signatures fordata authentication and for copyright protection. She has many na-tional and international publications, is a member of several con-ference PCs, and organizes workshops and conferences in the fieldof multimedia and security issues. Since June 2002, she is Editorof the Editorial Board of ACM Multimedia Systems Journal. Shewas involved in the organization of all the last five Multimediaand Security Workshops at ACM Multimedia. In 2001, she was aCochair of the CMS 2001 conference that took place in May 2002in Darmstadt, Germany. Furthermore, she organized several spe-cial sessions, for example, on watermarking quality evaluation andon biometrics.


Model-Based Speech Signal Coding Using OptimizedTemporal Decomposition for Storageand Broadcasting Applications

Chandranath R. N. AthaudageARC Special Research Center for Ultra-Broadband Information Networks (CUBIN), Department of Electrical and ElectronicEngineering, The University of Melbourne, Victoria 3010, AustraliaEmail: [email protected]

Alan B. BradleyInstitution of Engineers Australia, North Melbourne, Victoria 3051, AustraliaEmail: [email protected]

Margaret LechSchool of Electrical and Computer System Engineering, Royal Melbourne Institute of Technology (RMIT) University,Melbourne, Victoria 3001, AustraliaEmail: [email protected]


A dynamic programming-based optimization strategy for a temporal decomposition (TD) model of speech and its application tolow-rate speech coding in storage and broadcasting is presented. In previous work with the spectral stability-based event localizing(SBEL) TD algorithm, the event localization was performed based on a spectral stability criterion. Although this approach gavereasonably good results, there was no assurance on the optimality of the event locations. In the present work, we have optimizedthe event localizing task using a dynamic programming-based optimization strategy. Simulation results show that an improvedTD model accuracy can be achieved. A methodology of incorporating the optimized TD algorithm within the standard MELPspeech coder for the efficient compression of speech spectral information is also presented. The performance evaluation resultsrevealed that the proposed speech coding scheme achieves 50%–60% compression of speech spectral information with negligibledegradation in the decoded speech quality.

Keywords and phrases: temporal decomposition, speech coding, spectral parameters, dynamic programming, quantization.

1. INTRODUCTION

While practical issues such as delay, complexity, and fixedrate of encoding are important for speech coding applica-tions in telecommunications, they can be significantly re-laxed for speech storage applications such as store-forwardmessaging and broadcasting systems. In this context, it isdesirable to know what optimal compression performanceis achievable if associated constraints are relaxed. Varioustechniques for compressing speech information exploitingthe delay domain, for applications where delay does notneed to be strictly constrained (in contrast to full-duplexconversational communication), are found in the literature[1, 2, 3, 4, 5]. However, only very few have addressed theissue from an optimization perspective. Specifically, tempo-ral decomposition (TD) [6, 7, 8, 9, 10, 11], which is very

effective in representing the temporal structure of speech andfor removing temporal redundancies, has not been given ad-equate treatment for optimal performance to be achieved.Such an optimized TD (OTD) algorithm would be useful forspeech coding applications such as voice store-forward mes-saging systems, and multimedia voice-output systems, andfor broadcasting via the internet. Not only would it be use-ful for speech coding in its own right, but research in thisdirection would lead to a better understanding of the struc-tural properties of the speech signal and the development ofimproved speech models which, in turn, would result in im-provement in audio processing systems in general.

TD of speech [6, 7, 8, 9, 10, 11] has recently emerged asa promising technique for analyzing the temporal structureof speech. TD is a technique of modelling the speech param-eter trajectory in terms of a sequence of target parameters

Speech Signal Coding Using Optimized Temporal Decomposition 1017

(event targets) and an associated set of interpolation func-tions (event functions). TD can also be considered as aneffective technique of decorrelating the inherent interframecorrelations present in any frame-based parametric represen-tation of speech. TD model parameters are normally eval-uated over a buffered block of speech parameter frames,with the block size generally limited by the computationalcomplexity of the TD analysis process over long blocks. Letyi(n) be the ith speech parameter at the nth frame location.The speech parameters can be any suitable parametric rep-resentation of the speech spectrum such as reflection coeffi-cients, log area ratios, and line spectral frequencies (LSFs).It is assumed that the parameters have been evaluated atclose enough frame intervals to represent accurately even thefastest of speech transitions. The index i varies from 1 to I ,where I is the total number of parameters per frame. The in-dex n varies from 1 to N , where n = 1 and n = N are theindices of the first and last frames of the speech parameterblock buffered for TD analysis. In the TD model of speech,each speech parameter trajectory, yi(n), is described as

yi(n) =K∑k=1

aikφk(n), 1 ≤ n ≤ N, 1 ≤ i ≤ I, (1)

where yi(n) is the approximation of yi(n) produced by theTD model. The variable φk(n) is the amplitude of the kthevent function at the frame location n and aik is the contri-bution of the kth event function to the ith speech parame-ter. The value K is the total number of speech events withinthe speech block with frame indices 1 ≤ n ≤ N . It shouldbe noted that the event functions φk(n)’s are common to allspeech parameter trajectories (yi(n), 1 ≤ i ≤ I) and thereforeprovide a compact and approximate representation, that is, amodel, of speech. Equation (1) can be expressed in vectornotation as

y(n) =K∑k=1

akφk(n), 1 ≤ n ≤ N, (2)

where

ak =[a1k a2k · · · aIk

]T,

y(n) =[y1(n) y2(n) · · · yI(n)

]T,

y(n) =[y1(n) y2(n) · · · yI(n)

]T,

(3)

where ak is the kth event target vector, and y(n) is the approx-imation of y(n), the nth speech parameter vector, producedby the TD model of speech. Note that φk(n) remains a scalarsince it is common to each of the individual parameter tra-jectories. In matrix notation, (2) can be written as

Y = AΦ, Y ∈ RI×N , A ∈ RI×K , Φ ∈ RK×N , (4)

where the kth column of matrix A contains the kth event tar-get vector, ak, and the nth column of the matrix Y (approxi-mation of Y) contains the nth speech parameter frame, y(n),

produced by the TD model. Matrix Y contains the originalspeech parameter block. In the matrix Φ, the kth row con-tains the kth event function, φk(n). It is assumed that thefunctions φk(n)s are ordered with respect to their locationsin time. That is, the function φk+1(n) occurs later than thefunction φk(n). Each φk(n) is supposed to correspond to aparticular speech event. Since a speech event lasts for a shorttime (temporal), each φk(n) should be nonzero only over asmall range of n. Event function overlapping normally oc-curs between close by events in time, while events that are farapart in time have no overlapping at all. These characteris-tics ensure the matrix Φ to be a sparse matrix with numberof nonzero terms in the nth column indicating the numberof event functions overlapping at the nth frame location [6].Thus, significant coding gains can be achieved by encodingthe information in the matrices A and Φ instead of the orig-inal speech parameter matrix Y [6, 11, 12].

The results of the spectral stability-based event localiz-ing (SBEL) TD [9, 10] and Atal’s original algorithm [6] forTD analysis show that event function overlapping beyondtwo adjacent event functions occurs very rarely, although inthe generalized TD model overlapping is allowed to any ex-tent. Taking this into account, the proposed modified modelof TD imposes a natural limit to the length of the eventfunctions. We have shown that better performance can beachieved through optimization of the modified TD model. Inprevious TD algorithms such as SBEL TD [9, 10] and Atal’soriginal algorithm [6], event locations are determined usingheuristic assumptions. In contrast, the proposed OTD anal-ysis technique makes no a priori assumptions on event lo-cations. All TD components are evaluated based on error-minimizing criteria, using a joint optimization procedure.Mixed excitation LPC vocoder model used in the standardMELP coder was used as the baseline parametric representa-tion of the speech signal. Application of OTD for efficientcompression of MELP spectral parameters is also investi-gated with TD parameter quantization issues and effectivecoupling between TD analysis and parameter quantizationstages. We propose a new OTD-based LPC vocoder with de-tail coder performance evaluation, both in terms of objectiveand subjective measures.

This paper is organized as follows. Section 2 introducesthe modified TD model. An optimal TD parameter evalu-ation strategy based on the modified TD model is presentedin Section 3. Section 4 gives numerical results with OTD. Thedetails of the proposed OTD-based vocoder and its perfor-mance evaluation results are reported in Sections 5 and 6,respectively. The concluding remarks are given in Section 7.

2. MODIFIED TD MODEL OF SPEECH

The proposed modified TD model of speech restricts theevent function overlapping to only two adjacent event func-tions as shown in Figure 1. This modified model of TD canbe described as

y(n) = akφk(n) + ak+1φk+1(n), nk ≤ n < nk+1, (5)


nk nk+1 Time index (n)

φk+1(n)ak+1akφk(n)

Figure 1: Modified temporal decomposition model of speech. Thespeech parameter segment nk ≤ n < nk+1 is represented by aweighted sum (with weights φk(n) and φk+1(n) forming the eventfunctions) of the two vectors ak and ak+1 (event targets). Verticallines depict the speech parameter vector sequence.

where nk and nk+1 are the locations of the kth and (k + 1)thevents, respectively. All speech parameter frames betweenthe consecutive event locations nk and nk+1 are described bythese two events. Equivalently, the modified TD model canbe expressed as

y(n) =K∑k=1

akφk(n), 1 ≤ n ≤ N, (6)

where φk(n) = 0 for n < nk−1 and n ≥ nk+1. In the modifiedTD model, each event function is allowed to be nonzero onlyin the region between the centers of the proceeding and suc-ceeding events. This eliminates the computational overheadassociated with achieving the time-limited property of eventsin the previous TD algorithms [6, 9, 10].

The modified TD model can be considered as a hybridbetween the original TD concept [6] and the speech segmentrepresentation techniques proposed in [1]. In [1], a speechparameter segment between two locations nk and nk+1 is sim-ply represented by a constant vector (centroid of the seg-ment) or by a first-order (linear) approximation. A constantvector approximation of the form

y(n) =nk+1−1∑n=nk

y(n)(nk+1 − nk

) , for nk ≤ n < nk+1, (7)

provides a single vector representation for a whole speechsegment. However, this representation requires the segmentsto be short in length in order to achieve a good speech pa-rameter representation accuracy. A linear approximation ofthe form y(n) = na + b requires two vectors (a and b) torepresent a segment of speech parameters. This segment rep-resentation technique captures the linearly varying speechsegments well and is similar to the linear interpolation tech-nique report in [13]. The proposed modified model of TDin (5) provides a further extension to speech segment rep-resentation, where each speech parameter vector y(n) is de-scribed as the weighted sum of two vectors ak and ak+1, fornk ≤ n < nk+1. The weights φk(n) and φk+1(n) for the nthspeech parameter frame form the event functions of the tra-ditional TD model [6]. It is shown that the simplicity of theproposed modified TD model allows the optimal evaluationof the model parameters, thus resulting in an improved mod-elling accuracy.

Speech parametersequence

Parameterbuffering

Buffered block ofspeech parameters

TDanalysis

TDparameters

Figure 2: Buffering of speech parameters into blocks is a prepro-cessing stage required for TD analysis. TD analysis is performed onblock-by-block basis with TD parameters calculated for each blockseparately and independently.

1 n1 n2 nk N

Block

Figure 3: A block of speech parameter vectors, y(n) | 1 ≤ n ≤ N,buffered for TD analysis.

3. OPTIMAL ANALYSIS STRATEGY

This section describes the details of the optimization proce-dure involved with the evaluation of the TD model parame-ters based on the proposed modified model of TD describedin Section 2.

3.1. Speech parameter buffering

TD is a speech analysis modelling technique, which can takeadvantage of the relaxation in the delay constraint for speechsignal coding. TD generally requires speech parameters tobe buffered over long blocks for processing, as shown inFigure 2. Although the block length is not fundamentallylimited by the speech storage application under considera-tion, the computational complexity associated with process-ing long speech parameter blocks imposes a practical limit onthe block size, N . The total set of speech parameters, y(n),where 1 ≤ n ≤ N , buffered for TD analysis is termed ablock (see Figures 3). The series of speech parameters, y(n),where nk ≤ n < nk+1, is termed a segment. TD analysis isnormally performed on a block-by-block basis, and for eachblock, the event locations, event targets, and event functionsare optimally evaluated. For optimal performance, a buffer-ing technique with overlapping blocks is required to ensure asmooth transition of events at the block boundaries. Sections3.2 through 3.5 give the details of the proposed optimizationstrategy for a single block analysis. Details of the overlappingbuffering technique for improved performance are given inSection 3.6.

3.2. Event function evaluation

The proposed optimization strategy for the modified TDmodel of speech has the key feature of determining the op-timum event locations from all possible event locations. Thisguarantees the optimality of the technique with respect tothe modified TD model. Given a candidate set of locations,


n1, n2, . . . , nK, for the events, event functions are deter-mined using an analytical optimization procedure. Since themodified TD model of speech considered for optimizationplaces an inherent limit on event function length, the eventfunctions can be evaluated in a piece-wise manner. In otherwords, the parts of event functions between the centers ofconsecutive events can be calculated separately as describedbelow. The remainder of this section describes the computa-tional details of this optimum event function evaluation task.

Assume the locations nk and nk+1 of two consecutiveevents are known. Then, the right half of the kth event func-tion and the left half of the (k + 1)th event function can beoptimally evaluated by using ak = y(nk) and ak+1 = y(nk+1)as initial approximations for the event targets. The initial ap-proximations of event targets are later on iteratively refinedas described in Section 3.5. The reconstruction error, E(n),for the nth speech parameter frame is given by

E(n) = ∥∥y(n)− y(n)∥∥2

= ∥∥y(n)− akφk(n)− ak+1φk+1(n)∥∥2,

(8)

where nk ≤ n < nk+1. By minimizing E(n) against φk(n) andφk+1(n), we obtain

∂E(n)∂φk(n)

= ∂E(n)∂φk+1(n)

= 0,

(φk(n)φk+1(n)

)=

(aTk ak aTk ak+1

aTk ak+1 aTk+1ak+1

)−1(aTk y(n)

aTk+1y(n)

),

(9)

where nk ≤ n < nk+1. Therefore, the modelling error,E(n), for each spectral parameter, y(n), in a segment canbe evaluated by using (5) and (6). Total accumulated error,Eseg(nk, nk+1), for a segment becomes

Eseg(nk, nk+1

) = nk+1−1∑n=nk

E(n). (10)

Therefore, given the event locations n1, n2, . . . , nK for a pa-rameter block, 1 ≤ n ≤ N , the total accumulated error forthe block can be calculated as

Eblock(n1, n2, . . . , nK

) = N∑n=1

E(n) =K∑k=0

Eseg(nk, nk+1

), (11)

where n0 = 0, nK+1 = N +1, and E(0) = 0. The first segment,1 ≤ n < n1, and the last segment, nK ≤ n < N , of a speechparameter block, 1 ≤ n ≤ N , should be specifically analyzedtaking into account the fact that these two segments are de-scribed by only one event, that is, first andKth events, respec-tively. This is achieved by introducing two dummy events lo-cated at n0 = 0 and nK+1 = N + 1, with target vectors a0 andaK+1 set to zero, in the process of evaluating Eseg(1, n1) andEseg(nK ,N), respectively.

3.3. Optimization of event localization task

The previous subsection described the computational pro-cedure for evaluating the optimum event functions, φ1(n),

φ2(n), . . . , φK (n), and the corresponding accumulatedmodelling error for a block of speech parameters,Eblock(n1, n2, . . . , nK ), for a given candidate set of eventlocations, n1, n2, . . . , nK. The procedure relies on theinitial approximation of y(n1), y(n2), . . . , y(nK ) for theevent target set a1, a2, . . . , aK. Section 3.4 will describe amethod of refining this initial approximation of the eventtarget set to obtain an optimum result in terms of the speechparameter reconstruction accuracy of the TD model. Withthe above knowledge, the optimum event localizing taskcould be formulated as follows. Given a block of speechparameter frames, y(n), where 1 ≤ n ≤ N , and the numberof events, K , allocated to the block (this determines theresolution, event/s, of the TD analysis), we need to find theoptimum locations of the events, n∗1 , n∗2 , . . . , n∗K, such thatEblock(n1, n2, . . . , nK ) is minimized, where nk ∈ 1, 2, . . . , Nfor 1 ≤ k ≤ K and n1 < n2 < · · · < nK . The minimumaccumulated error for a block can be given as

E∗block = Eblock(n∗1 , n

∗2 , . . . , n

∗K

). (12)

It should be noted that E∗block versus K/N describes the rate-distortion performance of the TD model.

3.4. Dynamic programming formulation

A dynamic programming-based solution [14] for the opti-mum event localizing task can be formulated as follows. Wedefine D(nk) as the accumulated error from the first frame ofthe parameter block up to the kth event location, nk,

D(nk

) = nk−1∑n=1

E(n). (13)

Also note that

D(nK+1

) = D(N + 1) = Eblock(n1, n2, . . . , nK

). (14)

The minimum of the accumulated error, E∗block, can be calcu-lated using the following recursive formula:

D(nk

) = minnk−1∈Rk−1

[D(nk−1

)+ Eseg

(nk−1, nk

)], (15)

for k = 1, 2, . . . , K+1, where D(n0) = 0. And the correspond-ing optimum event locations can be found using

nk−1 = arg minnk−1∈Rk−1

[D(nk−1

)+ Eseg

(nk−1, nk

)], (16)

for k = 1, 2, . . . , K + 1, where Rk−1 is the search range forthe (k − 1)th event location, nk−1. Figure 4 illustrates the dy-namic programming formulation. For a full search assuringthe global optimum, the search range Rk−1 will be the inter-val between nk−2 and nk:

Rk−1 =n | nk−2 < n < nk

. (17)

The recursive formula in (15) can be solved in the increasingvalues of k, starting with k = 1. Substitution of k = 1 in(15) gives D(n1) = Eseg(n0, n1), where n0 = 0. Thus, values


Eseg(nk−1, nk)

D(nk−1)

1 nk−1 nk N

D(nk)

Figure 4: Dynamic programming formulation.

of D(n1) for all possible n1 can be calculated. Substitution ofk = 2 in (15) gives

D(n2) = min

n1∈R1

[D(n1)

+ Eseg(n1, n2

)], (18)

where R1 = n | n0 < n < n2. Using (18), D(n2) canbe calculated for all possible n1 and n2 combinations. Thisprocedure (Viterbi algorithm [15]) can be repeated to ob-tain D(nk) sequentially for k = 1, 2, . . . , K + 1. The final stepwith k = K + 1 gives D(nK+1) = Eblock(n1, n2, . . . , nK ) and thecorresponding optimal locations for n1, n2, . . . , nK (as givenby (14)). Also, by decreasing the search range Rk−1 in (17), adesired performance versus computational cost trade-off canbe achieved for the event localizing task. However, results re-ported in this paper are based on full search range, thus guar-antee the optimum event locations.

3.5. Refinement of event targets

The optimization procedure described in Sections 3.2through 3.4 determines the optimum set of event functions,φ1(n), φ2(n), . . . , φK (n), and the optimum set of event lo-cations, n1, n2, . . . , nK, based on the initial approxima-tion of y(n1), y(n2), . . . , y(nK ), for the event target set,a1, a2, . . . , aK. We refine the initial set of event target to fur-ther improve the modelling accuracy of the TD model. Eventtarget vectors, ak’s, can be refined by reevaluating them tominimize the reconstruction error for the speech parameters.This refinement process is based on the set of event functionsdetermined in Section 3.4. Consider the modelling error Ei,for the ith speech parameter trajectory within a block, givenby

Ei =N∑n=1

(yi(n)−

K∑k=1

akiφk(n)

)2

, 1 ≤ i ≤ I, (19)

where yi(n) and aki are the ith element of the speech param-eter vector, y(n), and the event target vector, ak, respectively.The partial derivative of Ei with respect to ari can be calcu-lated as

∂Ei∂ari

=N∑n=1

(yi(n)−

K∑k=1

akiφk(n)

)(− 2φr(n))

=N∑n=1

yi(n)φr(n)−K∑k=1

aki

N∑n=1

φk(n)φr(n).

(20)

First frame of the next block

Block 3

Block 2

Block 1Last target of the present block

Figure 5: The block overlapping technique.

Therefore, setting the above partial derivative to zero, we ob-tain

K∑k=1

aki

N∑n=1

φk(n)φr(n) =N∑n=1

yi(n)φr(n), (21)

where 1 ≤ r ≤ K and 1 ≤ i ≤ I . Equation (21) gives I sets ofK simultaneous equations with K unknowns, which can besolved to determine the elements of the event target vectors,aki’s. This refined set of event targets can be iteratively usedto further optimize the event functions and event locationsusing the dynamic programming formulation described inSection 3.4.

3.6. Overlapping buffering technique

If no overlapping is allowed between adjacent blocks, spec-tral error will tend to be relatively high for the frames near theblock boundaries. This is due to the fact that first and last seg-ments, 1 ≤ n ≤ n1 and nK ≤ n ≤ N , are only described by asingle event target instead of two, as described in Section 3.2.The block overlapping technique effectively overcomes thisproblem by forcing each transmitted block to start and endat an event location. During analysis, the block length N iskept fixed. Overlapping is introduced so that the location ofthe first frame of the next block coincides with the locationof the last event of the present block, as shown in Figure 5.This makes each transmitted block length slightly less thanN , but their starting and end frames now coincide with anevent location. Block length N determines the algorithmicdelay introduced in analyzing continuous speech.

4. NUMERICAL RESULTS WITH OTD

4.1. Speech data and performance measure

A speech data set consisting of 16 phonetically diverse sen-tences from the TIMIT1 speech database was used to evaluatethe modelling performance of OTD. MELP [16] spectral pa-rameters, that is, LSFs, calculated at 22.5-millisecond frameintervals were used as the speech parameters for TD analysis.

1The TIMIT acoustic-phonetic continuous speech corpus has been de-signed to provide speech data for the acquisition of acoustic-phoneticknowledge, and for the development and evaluation of speech processingsystems in general.


The block size was set to N = 20 frames (450 milliseconds).The number of iterations was set to 5 as further iteration onlyachieves negligible (less than 0.01 dB) improvement in TDmodel accuracy. Spectral distortion (SD) [13] was used asthe objective performance measure. The spectral distortion,Dn, for the nth frame is defined in dB as

Dn

=√

12π

∫ π

−π

[10 log

(Sn(e jω

))−10 log(Sn(e jω

))]2dω dB,

(22)

where Sn(e jω) and Sn(e jω) are the LPC power spectra corre-sponding to the original spectral parameters y(n) and the TDmodel (i.e., reconstructed) spectral parameters y(n), respec-tively.

4.2. Performance evaluation

One important feature of the OTD algorithm is its ability tofreely select an arbitrary number of events per block, that is,average number of events per second (event rate). This wasnot the case in previous TD algorithms [9, 10, 11], where thenumber of events was limited by constraints such as spectralstability. Average event rate, also called the TD resolution,determines the reconstruction error (distortion) of the TDmodel. The event rate, erate, can be given as

erate =(K

N

)× frate, (23)

where frate is the base frame rate of the speech parameters.Lower distortion can be expected for higher TD resolutionand vice versa. But higher resolution implies a lower com-pression efficiency from an application point of view. Thisrate-distortion characteristic of the OTD algorithm is quiteimportant for coding applications, and simulations were car-ried out to determine it. Average SD was evaluated for theevent rates of 4, 8, 12, 16, 20, and 24 event/s. Figure 6 showsan example of event functions obtained for a block of speech.Figure 7 shows the average SD versus event rate graph. Thebase frame rate point, that is, 44.4 frame/s, is also shownfor reference. The significance of the frame rate is that ifthe event rate is made equal to the frame rate (in this case44.44 event/s), theoretically the average SD should becomezero. This is the maximum possible TD resolution and cor-responds to a situation where all event functions become unitimpulses spaced at frame intervals and event target values ex-actly equal the original spectral parameter frames. As can beseen, an average event rate of more than 12 event/s is requiredif the OTD model is to achieve an SD less than 1 dB. It shouldbe noted that at this stage, TD parameters are unquantized,and therefore, only modelling error accounts for the averageSD.

4.3. Performance comparison with SBEL-TD

In SBEL-TD algorithm [10], event localization is performedbased on the a priori assumption of spectral stability and

Frame number (n)

30 35 40 45 50 55 60

φk(n

)

0

0.5

1

1.5

Speech waveform

Figure 6: Bottom: an example of event functions obtained for ablock of spectral parameters. Triangles indicate the event locations.Top: the corresponding speech waveform.

Event rate (event/s)

0 5 10 15 20 25 30 35 40 45 50

Ave

rage

spec

tral

dist

orti

on(d

B)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

44.44 event/s

24

20

16

128

4

Figure 7: Average SD (dB) versus TD resolution (event/s) charac-teristic of the OTD algorithm. Average SD was evaluated for theevent rates of 4, 8, 12, 16, 20, and 24 event/s. The base frame ratepoint, that is, 44.4 frame/s, is also shown for reference.

does not guarantee the optimal event locations. Also, SBEL-TD incorporates an adaptive iterative technique to achievethe temporal nature (short duration of existence) of the eventfunctions. In contrast, the OTD algorithm uses the modifiedmodel of TD (temporal nature of the event functions is aninherent property of the model) and also uses the optimumlocations for the events. In this section, the objective perfor-mance of the OTD algorithm is compared with that of theSBEL-TD algorithm [10] in terms of speech parameter mod-elling accuracy.

OTD analysis was performed on the speech data set de-scribed in Section 4.1, with the event rate set to 12 event/s(N = 20 and K = 5). SBEL-TD analysis was also performedon the same spectral parameter set with the event rate ap-proximately set to the value of 12 event/s (for a valid compar-ison between the two TD algorithms, the same value of eventrate should be selected). Spectral parameter reconstructionaccuracy was calculated using SD measure for the two al-gorithms. Table 1 shows the average SD and the percentagenumber of outlier frames for the two algorithms. As can be


Table 1: Average SD (dB) and the percentage number of outliers forthe SBEL-TD and OTD algorithms evaluated over the same speechdata set. Event rate is set approximately to 12 event/s in both cases.

Algorithm Average SD (dB) ≤ 2 dB 2–4 dB > 4 dB

SBEL-TD 1.82 72% 25% 3%

OTD 0.98 97% 3% 0%

seen from the results in Table 1, the OTD algorithm achieveda significant improvement in terms of the speech parametermodelling accuracy. Also, the percentage number of outlierframes has been reduced significantly in the OTD case. Theseimprovements of the OTD algorithm are critically importantfor speech coding applications. As reported in [12], SBEL-TD fails to realize good-quality synthesized speech becausethe TD parameter quantization error increases the postquan-tized average SD and the number of outliers to unacceptablelevels. With a significant improvement in speech parametermodelling accuracy, OTD has a greater margin to accommo-date the TD parameter quantization error, resulting in good-quality synthesized speech in coding applications. Sections5 and 6 give the details of the proposed OTD-based speechcoding scheme and the coder performance evaluation, re-spectively.

5. PROPOSED TD-BASED LPC VOCODER

5.1. Coder schematics

The mixed excitation LPC model [17] incorporated by theMELP coding standard [16] achieves good-quality synthe-sized speech at the bit rate of 2.4 kbit/s. The coder is based ona parametric model of speech operating at 22.5-millisecondspeech frames. The MELP model parameters can be broadlycategorized into the two groups of

(1) excitation parameters that model the excitation, thatis, LPC residual, to the LPC synthesis filter and consistof Fourier magnitudes, gain, pitch, bandpass voicingstrengths, and aperiodic flag;

(2) spectral parameters that represent the LPC filter coef-ficients and consist of the 10th-order LSFs.

With the above classification of MELP parameters, theMELP encoder can be represented as shown in Figure 8. Theproposed OTD-based LPC vocoder uses the LPC excitationmodelling and parameter quantization stages of the MELPcoder, but uses block-based (i.e., delayed) OTD analysis andOTD parameter quantization for the spectral parameter en-coding instead of the multistage vector quantization (MSVQ)[15] stage of the standard MELP coder. This proposed speechencoding scheme is shown in Figure 9. The underlying con-cept of the speech coder shown in Figure 9 is that it exploitsthe short-term redundancies (interframe and intraframe cor-relations) present in the spectral parameter frame sequence(line spectral frequencies), using TD modelling, for efficientencoding of spectral information at very low bit rates. The

LPC excitationmodel parameters

Quantized excitationparameters

LPCexcitationmodelling

LPC excitationparameter

quantization

Input speechLPC

analysis

Spectralparameters

MultistageVQ

Quantized spectralparameters

Figure 8: Standard MELP speech encoder block diagram.

LPC excitationmodel parameters

Quantized excitationparameters

LPCexcitationmodelling

LPC excitationparameter

quantization

Input speechLPC

analysis

Spectralparameters

TD modellingand

quantization

Quantized spectralparameters

Figure 9: Proposed speech encoder block diagram.

OTD algorithm was incorporated. The frame-based MSVQstage of Figure 8 only accounts for the redundancies presentwithin spectral frames (intraframe correlations), while theTD analysis quantization stage of Figure 9 accounts for bothinterframe and intraframe redundancies present in spectralparameter sequence, and therefore, is capable of achievingsignificantly higher compression ratios. It should be notedthat the concept of TD can be used to exploit the short-termredundancies present in some of the LPC excitation parame-ters also using block mode TD analysis. However, some pre-liminary results of applying OTD to LPC excitation parame-ters showed that the achievable coding gain is not significantcompared to that for the LPC spectral parameters.

Figure 10 gives the detail schematic of the TD modellingand quantization stage shown in Figure 9. The first stage is tobuffer the spectral parameter vector sequence using a blocksize of N = 20 (20 × 22.5 = 450 milliseconds). This in-troduces a 450-millisecond processing delay at the encoder.OTD is performed on the buffered block of spectral pa-rameters to obtain the TD parameters (event targets andevent functions). The number of events calculated per block(N = 20) is set to K = 5 resulting in an average event rateof 12 event/s. The event target and event function quanti-zation techniques are described in Section 5.2. The quanti-zation code-book indices are transmitted to the speech de-coder. Improved performance in terms of spectral parameterreconstruction accuracy can be achieved by coupling the TDanalysis and TD parameter quantization stages as shown inFigure 10. The event targets from the TD analysis stage are


Vectorquantization

Quantizedtargets

Refinedtargets

Refinementof targets

Eventtargets

OptimizedTD

analysisLSF block

Parameterbuffering

Spectralparametersequence

LSF’s

Block overlapping Eventfunctions

Vectorquantization

Quantizedfunctions

Figure 10: Proposed spectral parameter encoding scheme based on the OTD. For improved performance, coupling between the TD analysisand the quantization stage is incorporated.

refined using the quantized version of the event functions inorder to optimize the overall performance of the TD analysisand TD parameter quantization stages.

5.2. OTD parameter quantization

5.2.1. Event function quantization

One choice for quantization of the event function set,

φ1, φ2, . . . , φK, for each block is to use vector quantiza-

tion (VQ) [15] on individual event functions, φk’s, in or-der to exploit any dependencies in event function shapes.

However, the event functions are of variable length (φk ex-tending from nk−1 to nk+1) and therefore require normal-ization to a fixed length before VQ. Investigations showedthat the process of normalization-denormalization itself in-troduces a considerable error which gets added to the quan-tization error. Therefore, we incorporated a frame-based 2-dimensional VQ for event functions which proved to be sim-ple and effective. This was possible only because the mod-ified TD model allows only two event functions to overlapat any frame location. Vectors

[φk(n) φk+1(n)

]were quan-

tized individually. The distribution of the 2-dimensional vec-tor points of

[φk(n) φk+1(n)

]showed significant clustering,

and this dependency was effectively exploited through theframe-level VQ of the event functions. Sixty-two phoneticallydiverse sentences from TIMIT database resulting in 8428 LSFframes were used as the training set to generate the codebooks of sizes 5, 6, 7, 8, and 9 bit using the LBG k-meansalgorithm [15].

5.2.2. Event target quantization

Quantization of the event target set, a1, a2, . . . , aK, for eachblock was performed by vector quantizing each target vec-tor, ak, separately. Event targets are 10-dimensional LSFs, butthey differ from the original LSFs due to the iterative refine-ment of the event targets incorporated in the TD analysisstage. VQ code books of sizes 6, 7, 8, and 9 bit were generatedusing the same training data set described in Section 5.2.1using the LBG k-means algorithm [15].

6. CODER PERFORMANCE EVALUATION

6.1. Objective quality evaluation

Spectral parameters can be synthesized from the quantizedevent targets, ak’s, and quantized event functions, φk’s, foreach speech block as

ˆy(n) =K∑k=1

akφk(n), 1 ≤ n ≤ N, (24)

where ˆy(n) is the nth synthesized spectral parameter vectorat the decoder, synthesized using the quantized TD param-eters. Note that double-hat notation is used here for spec-tral parameters as the single-hat notation is already usedin (5) to denote the spectral parameters synthesized usingthe unquantized TD parameters. The average error betweenthe original spectral parameters, y(n)’s, and the synthesizedspectral parameters, ˆy(n)’s, calculated in terms of average SD(dB) was used to evaluate the objective quality of the coder.The final bit rate requirement for spectral parameters of theproposed compression scheme can be expressed in numberof bit per frame as

B = n1 + n2K

N+ n3

K

Nbit/frame, (25)

where n1 and n2 are the sizes (in bit) of the code books forthe event function quantization and event target quantiza-tion, respectively. The parameter n3 denotes the number ofbit required to code each event location within a given block.For the chosen block size (N = 20) and the number of eventsper block (K = 5), the maximum possible segment length(nk+1 − nk) is 16. Therefore, the event location informa-tion can be losslessly coded using differential encoding withn3 = 4.

6.1.1. Results of evaluation

A speech data set consisting of 16 phonetically diverse sen-tences of the TIMIT speech corpus was used as the test speechdata set for SD analysis. This test speech data set was different


Bit rate for spectral parameter coding (bit/frame)

7 8 9 10 11 12 13

Ave

rage

spec

tral

dist

orti

on(d

B)

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

n1=9(6)

(7)

(8)

(9)

n1=8(6)

(7)

(8)

(9)

n1=7(6)

(7)

(8)

(9)

n1=6(6)

(7)

(8)

(9)

n1=5(6)

(7)

(8)

(9)

Figure 11: Average SD against bit rate for the proposed speechcoder with coupled TD analysis and TD parameter quantizationstages. Code-book size for event target quantization, n2, is depictedas (n2).

Table 2: SD analysis results for the standard MELP coder and theproposed OTD-based speech coder operating at the TD parameterquantization resolutions of n1 = 7 and n2 = 9.

Coder (bit/frame) SD (dB) < 2 dB 2–4 dB > 4 dB

MELP (25) 1.22 91% 9% 0%

Proposed (10.25) 1.62 80% 20% 0%

from the speech data set used for VQ code book training inSection 5.2. The SD between the original spectral parametersand the reconstructed spectral parameters from the quan-tized TD parameters (given in (24)) was used as the objectiveperformance measure. This SD was evaluated for differentcombinations of the event function and event target code-book sizes. The event location quantization resolution wasfixed at n3 = 4 bit. Figure 11 shows the average SD (dB) fordifferent n1 and n2 against the bit rate B.

6.1.2. Performance comparison

Figure 11 shows the average SD (dB) against the bit raterequirement for spectral parameter encoding in bit/frame.Standard MELP coder uses 25 bit/frame for the spectral pa-rameters (line spectral frequencies). In order to compare therate-distortion performances of the proposed delay domainspeech coder and the standard MELP coder, the SD analysiswas performed for the standard MELP coder also using thesame speech data set. Table 2 shows the results of this analy-sis. For comparison, the SD analysis results obtained for theproposed coder with TD parameter quantization resolutionsof n1 = 7 and n2 = 9 are also shown in Table 2.

In comparison to the 25 bit/frame of the standard MELPcoder, the proposed coder operating at n1 = 7 and n2 = 9results in a bit rate of 10.25 bit/frame. This signifies over 50%compression of bit rate required for spectral information atthe expense of 0.4 dB of objective quality (spectral distortion)and 450 milliseconds of algorithmic coder delay.

Table 3: Six operating bit rates of the proposed speech coder se-lected for subjective performance evaluation.

Rate Bit/frame n1 (bit) n2 (bit) Average SD (dB)

R1 12.25 9 9 1.579 dB

R2 11.25 8 9 1.584 dB

R3 10.25 7 9 1.629 dB

R4 9.25 6 9 1.659 dB

R5 8.25 5 9 1.724 dB

R6 7.50 5 6 1.912 dB

6.2. Subjective quality evaluation

In order to back up the objective performance evaluation re-sults, and to further verify the efficiency and the applicabilityof the proposed speech coder design, subjective performanceevaluation was carried out in terms of listening tests. The 5-point degradation category rating (DCR) scale [18] was uti-lized as the measure to compare the subjective quality of theproposed coder to that of the standard MELP coder.

6.2.1. Experimental design

Six different operating bit rates of the proposed speech coderwith coupling between TD analysis and TD parameter quan-tization stages (Figure 10) were selected for subjective evalu-ation. Table 3 gives the 6 selected operating bit rates togetherwith the corresponding quantization code-book sizes for theTD parameters and the objective quality evaluation result. Itshould be noted that the speech coder operating points givenin Table 3 have the best rate-distortion advantage within thegrid of TD parameter quantizer resolutions (Figure 11), andare therefore selected for the subjective evaluation.

Sixteen nonexpert listeners were recruited for the listen-ing test on volunteer basis. Each listener was asked to lis-ten to 30 pairs of speech sentences (stimuli), and to rate thedegradation perceived in speech quality when comparing thesecond stimulus to the first in each pair. In each pair, thefirst stimulus contained speech synthesized using the stan-dard MELP coder and the second stimulus contained speechsynthesized using the proposed speech coder. The six differ-ent operating bit rates given in Table 3 of the proposed coder,each with 5 pairs of sentences (including one null pair) perlistener, were evaluated. Therefore, a total of 30 (6×5) pairs ofspeech stimuli per listener were used. The null pairs contain-ing the identical speech samples as the first and the secondstimuli were included to monitor any bias in the one-sidedDCR scale used.

6.3. Results and analysis

The 30 pairs of speech stimuli consisting of 5 pairs of sen-tences (including 1 null pair) from each of the 6 operatingbit rates of the proposed speech coder were presented to the16 listeners. Therefore, a total of 64 (16 × 4) votes (DCRs)were obtained for each of the 6 operating bit rates, R1 to R6.Table 4 gives the DCR obtained for each of the 6 operating bitrates of the proposed speech coder. It should be noted that


Table 4: Degradation category rating (DCR) results obtained forthe 6 operating bit rates of the proposed speech coder.

RateCompression

ratioNo. of DCR votes

DMOS5 4 3 2 1

R1 51% 31 23 10 0 0 4.33

R2 54% 21 34 9 0 0 4.19

R3 59% 22 28 14 0 0 4.13

R4 63% 20 32 9 3 0 4.08

R5 67% 16 21 25 2 0 3.80

R6 70% 7 22 28 7 0 3.45

the degradation was measured in comparison to the subjec-tive quality of the standard MELP coder. Degradation meanopinion score (DMOS) was calculated as the weighted aver-age of the listener ratings, where the weighting is the DCRvalues (1–5). As can be seen from the DMOSs in Table 4, theproposed speech coder achieves a DMOS of over 4 for the op-erating bit rates of R1 to R4. This corresponds to a compres-sion ratio of 51% to 63%. Therefore, the proposed speechcoder achieves over 50% compression of the bit rate requiredfor spectral encoding at a negligible degradation (in betweennot perceivable or perceivable but not annoying distortionlevels) of the subjective quality of the synthesized speech.DMOS drops below 4 for the bit rates of R5 and R6, suggest-ing that on average the degradation in the subjective qualityof synthesized speech becomes perceivable and annoying forcompression ratios over 63%.

7. CONCLUSIONS

We have proposed a dynamic programming-based optimiza-tion strategy for a modified TD model of speech. Optimumevent localization, model accuracy control through TD res-olution, and overlapping speech parameter buffering tech-nique for continuous speech analysis can be highlighted asthe main features of the proposed method. Improved objec-tive performance in terms of modelling accuracy has beenachieved compared to the SBEL-TD algorithm, where theevent localization is based on the a priori assumption of spec-tral stability. A speech coding scheme was proposed, basedon the OTD algorithm and associated VQ-based TD param-eter quantization techniques. The MELP model was used asthe baseline parametric model of speech with OTD being in-corporated for efficient compression of the spectral param-eter information. Performance evaluation of the proposedspeech coding scheme was carried out in detail. Objectiveperformance evaluation was performed in terms of log SD(dB), while the subjective performance evaluation was per-formed in terms of DMOS calculated using DCR votes. TheDCR listening test was performed in comparison to the qual-ity of the standard MELP synthesized speech. These evalua-tion results showed that the proposed speech coder achieves50%–60% compression of the bit rate requirement for spec-tral parameter encoding for a little degradation (in between

not perceivable and perceivable but not annoying distortionlevels) of the subjective quality of decoded speech. The pro-posed speech coder would find useful applications in voicestore-forward messaging systems, multimedia voice outputsystems, and broadcasting.

ACKNOWLEDGMENTS

The authors would like to thank the members of the Cen-ter for Advanced Technology in Telecommunications andthe School of Electrical and Computer Systems Engineering,RMIT University, who took part in the listening test.

REFERENCES

[1] T. Svendsen, “Segmental quantization of speech spectral in-formation,” in Proc. IEEE Int. Conf. Acoustics, Speech, SignalProcessing (ICASSP ’94), vol. 1, pp. I517–I520, Adelaide, Aus-tralia, April 1994.

[2] D. J. Mudugamuwa and A. B. Bradley, “Optimal transformfor segmented parametric speech coding,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing (ICASSP ’98),vol. 1, pp. 53–56, Seatle, Wash, USA, May 1998.

[3] D. J. Mudugamuwa and A. B. Bradley, “Adaptive transforma-tion for segmented parametric speech coding,” in Proc. 5th In-ternational Conf. on Spoken Language Processing (ICSLP ’98),pp. 515–518, Sydney, Australia, November–December 1998.

[4] A. N. Lemma, W. B. Kleijn, and E. F. Deprettere, “LPC quan-tization using wavelet based temporal decomposition of theLSF,” in Proc. 5th European Conference on Speech Communica-tion and Technology (Eurospeech ’97), pp. 1259–1262, Rhodes,Greece, September 1997.

[5] Y. Shiraki and M. Honda, “LPC speech coding based onvariable-length segment quantization,” IEEE Trans. Acoustics,Speech, and Signal Processing, vol. 36, no. 9, pp. 1437–1444,1988.

[6] B. S. Atal, “Efficient coding of LPC parameters by tempo-ral decomposition,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing (ICASSP ’83), pp. 81–84, Boston, Mass, USA,April 1983.

[7] S. M. Marcus and R. A. J. M. Van-Lieshout, “Temporal de-composition of speech,” IPO Annual Progress Report, vol. 19,pp. 26–31, 1984.

[8] A. M. L. Van Dijk-Kappers and S. M. Marcus, “Temporal de-composition of speech,” Speech Communication, vol. 8, no. 2,pp. 125–135, 1989.

[9] A. C. R. Nandasena and M. Akagi, “Spectral stability basedevent localizing temporal decomposition,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing (ICASSP ’98), pp.957–960, Seattle, Wash, USA, May 1998.

[10] A. C. R. Nandasena, P. C. Nguyen, and M. Akagi, “Spec-tral stability based event localizing temporal decomposition,”Computer Speech and Language, vol. 15, no. 4, pp. 381–401,2001.

[11] S. Ghaemmaghami and M. Deriche, “A new approach tovery low-rate speech coding using temporal decomposition,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing(ICASSP ’96), pp. 224–227, Atlanta, Ga, USA, May 1996.

[12] A. C. R. Nandasena, “A new approach to temporal decom-position of speech and its application to low-bit-rate speechcoding,” M.S. thesis, Department of Information Processing,School of Information Science, Japan Advanced Institute ofScience and Technology, Hokuriku, Japan, September 1997.


[13] K. K. Paliwal, “Interpolation properties of linear predictionparametric representations,” in Proc. 4th European Conferenceon Speech Communication and Technology (Eurospeech ’95),pp. 1029–1032, Madrid, Spain, September 1995.

[14] D. P. Bertsekas, Dynamic Programming and Optimal Control,vol. 1 of Optimization and Computation Series, Athena Scien-tific, Belmont, Mass, USA, 2nd edition, 2000.

[15] A. Gersho and R. M. Gray, Vector Quantization and SignalCompression, vol. 159 of Kluwer International Series in Engi-neering and Computer Science, Kluwer Academic, Dordrecht,The Netherlands, 1992.

[16] L. M. Supplee, R. P. Cohn, J. S. Collura, and A. V. McCree,“MELP: The new federal standard at 2400 bps,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing (ICASSP ’97), pp.1591–1594, Munich, Germany, April 1997.

[17] A. V. McCree and T. P. Barnwell, “A mixed excitation LPCvocoder model for low bit rate speech coding,” IEEE Trans.Speech, and Audio Processing, vol. 3, no. 4, pp. 242–250, 1995.

[18] P. Kroon, “Evaluation of speech coders,” in Speech Codingand Synthesis, pp. 467–494, Elsevier Science, Sara Burgerhart-straat, Amsterdam, The Netherlands, 1995.

Chandranath R. N. Athaudage was born inSri Lanka in 1965. He received the B.S. de-gree in electronic and telecommunicationengineering with first-class honours fromUniversity of Moratuwa, Sri Lanka in 1991,and the M.S. degree in information sciencefrom Japan Advanced Institute of Scienceand Technology (JAIST) in 1997. He re-ceived his Ph.D. degree in electrical engi-neering from Royal Melbourne Institute ofTechnology (RMIT), Australia, in 2001. Dr. Athaudage received aJapanese Government Fellowship during his graduate studies andan Academic Excellence Award from JAIST in 1997. During 1993–1994 he was an Assistant Lecturer at University of Moratuwa, andduring 1999–2000 a Lecturer at RMIT, where he taught undergrad-uate and graduate courses in digital signal processing and commu-nication theory and systems. He has been a member of IEEE since1995. Since 2001, he has been a Research Fellow at the AustralianResearch Council Special Research Centre for Ultra-Broadband In-formation Networks, University of Melbourne, Australia. His re-search interests include speech signal processing, multimedia com-munications, multicarrier systems, channel estimation, and syn-chronization for broadband wireless systems.

Alan B. Bradley received his M.S. degreein engineering from Monash University in1972. In 1973, he joined RMIT Universityand completed a 29-year career holding thepositions of Lecturer, Senior Lecturer, Prin-cipal Lecturer, Head of Department, andAssociate Dean. In 1991, he became a Pro-fessor of signal processing at RMIT Univer-sity. His research interests have been in thefield of signal processing with specific em-phasis on speech coding, speech processing, and speaker recog-nition. Earlier research was focused on the control of time andfrequency-domain aliasing cancellation in filter bank structureswith application to speech coding. More recently, attention hasbeen turned to two-dimensional time-frequency analysis structuresand approaches to exploiting longer-term temporal redundanciesin very low data rate speech coding. Alan Bradley retired from

RMIT University in 2002 and was granted the title of ProfessorEmeritus. He is now Manager Accreditation for The Institutionof Engineers Australia, and responsible for engineering educationprogram accreditation in Australian universities. Professor Bradleyis a member of IEEE as well as a Fellow of The Institution of Engi-neers Australia.

Margaret Lech received her M.S. degreein applied physics from the Maria Curie-Sklodowska University (UMCS), Poland in1982. This was followed by Diploma degreein biomedical engineering in 1985 from theWarsaw Institute of Technology and Ph.D.degree in electrical engineering from TheUniversity of Melbourne in 1993. From1982 to 1987, Dr. Lech was working at TheInstitute of Physics, UMCS conducting re-search on speech therapies for stutterers and diagnostic methodsfor subjects with systemic hypertension. From 1993 to 1995, shewas working at Monash University, Australia, on the developmentof a noncontact measurement system for three-dimensional ob-jects. In 1995, she joined The Bionic Ear Institute in Melbourne,and until 1997, she conducted her research work on psychophysi-cal characteristics of hearing loss and on the development of speechprocessing schemes for digital hearing aids. Since 1997, Dr. Lechhas been working as a Lecturer at the School of Electrical andComputer Engineering, RMIT University, Melbourne. She contin-ues her research work in the areas of digital signal processing andsystem modelling and optimization.


On Securing Real-Time Speech Transmissionover the Internet: An Experimental Study

Alessandro AldiniInstituto di Scienze e Tecnologie dell’Informazione (STI), Universita degli Studi di Urbino, 61029 Urbino, ItalyEmail: [email protected]

Marco RoccettiDipartimento di Scienze dell’Informazione, Universita di Bologna, 40127 Bologna, ItalyEmail: [email protected]

Roberto GorrieriDipartimento di Scienze dell’Informazione, Universita di Bologna, 40127 Bologna, ItalyEmail: [email protected]

Received 27 May 2002 and in revised form 3 January 2003

We analyze and compare several soft real-time applications designed for the secure transmission of packetized audio over theInternet. The main metrics we consider for the purposes of our analysis are (i) the computational load due to the coding/decodingphases, and (ii) the computational overhead of the encryption/decryption activities, carried out by the audio tools of interest.The main result we present is that an appropriate degree of security may be guaranteed to real-time audio communications ata negligible computational cost if the adopted security strategies are integrated together with the playout control mechanismincorporated in the audio tools.

Keywords and phrases: Internet, multimedia applications, real time, security.

1. INTRODUCTION

The Internet offers a best-effort service over public networkswithout security guarantees. Therefore, the provision of se-cure real-time audio applications over wide area networks(WAN) like the Internet has to be carefully addressed. In par-ticular, the success of such applications depends strictly onthe speech quality and the privacy guaranteed by the pro-vided services, which have to be perceived as sufficiently goodby their users. Based on these considerations, we concentrateour attention on the steps of the audio data flow pipeline(depicted in Figure 1) which affect the performance and thesecurity features of those applications designed for deliver-ing secure real-time communications over the Internet. Moreprecisely, in this paper, we analyse several audio applicationsto investigate the overhead, in terms of additional latency,which is caused by the embedded data compression algo-rithms and securing mechanisms.

In general, the provision of both adequate performanceand security for the above-mentioned applications has to becarefully examined and modeled because of some importantconstricting conditions, illustrated as follows.

(i) These applications are often constrained to work un-der very restrictive resources (e.g., bandwidth) and congestedtraffic conditions. In particular, network-based audio ap-plications experience variable transmission delay; hence themost used approach in order to ameliorate the effect of suchan inevitable problem is to adapt the application behavior tothe variable network delays (see, e.g., [1]).

(ii) Real-time audio applications which employ public,untrusted, and uncontrolled networks have strict security re-quirements, namely, they have to guarantee authentication,confidentiality, and integrity of the conversation (see, e.g.,[2]).

On the one hand, from a performance standpoint, manyare the factors that affect the computational cost of real-timeaudio applications over the Internet, such as codec, networkaccess, transmission, operating system, and sound-card de-lays; and a significant issue is the problem of the latencydue to each of the above components. For instance, an ef-ficient coding of the signal, carried out by the codec activity,is the first factor to be considered if we want to effectively ex-ploit the available transmission rates over the network, andto obtain at the receiver site the same speech quality as that


generated at the sender site [3, 4]. While such a kind of de-lay is relatively fixed, others depend on variable conditions.This is the case, for example, of network delays, for whichthe situation is quite crucial. Indeed, since the current Inter-net service model offers a flat, classless, and best-effort ser-vice, real-time audio traffic experiences unwanted delay vari-ation (known as jitter) on the order of 500/1000 millisecondsfor congested Internet links [5]. On the contrary, it is wellaccepted that telephony users find round trip delays longerthan 300 milliseconds, more like a half-duplex connectionthan a real-time conversation (experience suggests that a de-lay of even 250 milliseconds is annoying despite the fact thatmessage coherence is not affected). In addition, too large au-dio packet loss rates (over 10%) may have an awful impacton speech recognition [6, 7].

These observations put in evidence the importance ofthe trade-off between the stochastic end-to-end delays of theplayed out audio packets and the packet loss percentage, es-pecially when dealing with the problem of unpredictable jit-ter typical of network environments providing a best-effortservice (see, e.g., [1, 8, 9, 10]). The problem of obtainingthe optimal trade-off between these two aspects and facingthe constraints on strict delays and losses tolerated in an un-favourable platform is addressed by adaptive packet audiocontrol algorithms (see, e.g., [1, 9, 10]), which adaptively ad-just to the fluctuating network delays of the Internet in orderto guarantee, when possible, an acceptable quality of the au-dio service.

On the other hand, from a security standpoint, real-time audio communications are a much less secure ser-vice than most people realize. It is relatively easy for any-one to eavesdrop phone conversations. Still, critics claimthat the Echelon system [11], a world wide high-tech es-pionage system, is being used for crass commercial theftand a brutal invasion of privacy on a staggering scale. Asfar as audio applications over the Internet are concerned,anyone with a PC and an access to the public networkhas the possibility to capture the network traffic, poten-tially compromising the privacy and the reliability of theprovided services. Hence, it is mandatory for audio appli-cations to guarantee authentication, confidentiality, and in-tegrity of data.

In the light of the above considerations, in this paper,we analyse some popular tools designed at the applicationlayer for delivering secure real-time audio communicationsover the Internet, and we compare them by evaluating thecomputational overhead introduced by the coding algorithmand by the securing mechanism adopted by those tools. Inparticular, to carry out such a comparison, we have takeninto account the following audio tools: Nautilus [12], PGP-fone [13], Speak Freely [14], and BoAT [15]. The above-mentioned tools, Nautilus, Pretty Good Privacy Phone (PGP-fone) and Speak Freely, have been designed for protectingaudio communications, at the application level, on the ba-sis of external cryptographic modules. The motivation be-hind our choice of taking into account such tools relies onthe consideration that they are freeware and downloadablewith complete source code. On the other hand, the fourth

Origin oftransmission

Codingencryption, . . .

Ethernet,token ring,FDDI

WANinterconnection

Totalend-to-end latency

Ethernet,token ring,FDDI

Decryption,decoding, . . .

Termination oftransmission

Figure 1: Internetworking audio data flow pipeline issue.

audio tool we take into consideration, that is, BoAT, inte-grates, at the application level, the security mechanism to-gether with the playout control algorithm. We point out thatthe freeware version of such a tool [15] does not include thesecurity infrastructure, whose implementation is an ongoingcopyrighted project.

An important consideration concerning Nautilus, PGP-fone, and Speak Freely is that these audio software pack-ages do not include adaptive mechanized playout adjust-ment schemes. In contrast, BoAT has been designed to inte-grate the mechanism that adaptively adjust the audio play-out point to the variable network behavior with the algo-rithm that makes the conversation secure. On the one hand,the playout control scheme of BoAT offers a minimal per-packet communication overhead, and also tolerable packetloss percentage and playout delays. On the other hand, thesecuring mechanism, integrated with the playout control al-gorithm, provides the receiver with a high assurance of se-crecy, integrity, and authenticity of the conversation at a neg-ligible computational cost as long as the underlying crypto-graphic assumptions are enforced. More precisely, the secur-ing algorithm of BoAT allows two trusted parties to have aprivate conversation by employing a stream cipher (differ-ently from the other considered tools which adopt block ci-phers only) whose cryptanalysis is made much more difficultby the integration of this algorithm with the playout controlmechanism. In particular, the securing algorithm naturallyallows the parties taking part into the audio communicationto agree on a sequence of session keys used by the particularstream cipher to encrypt data, where the lifetime of each key

On Securing Real-Time Speech over the Internet 1029

is limited to a temporal interval not greater than one secondof conversation (corresponding to less than 212 bits of trans-mitted data), whereas the best-known attacks of stream ci-phers require 220 to 233 ciphertext bits (with complexity from259 to 221, respectively) [16, 17].

Other popular tools include strong security mechanismsat the application level like, for example, SecurePhone Pro-fessional [18] and SecuriPhone [19], two professional en-crypted voice over Internet protocol (IP) tools, WebPhone[20], a shareware TCP/IP-based network phone tool, andMS NetMeeting [21], a freeware real-time Web phone (in-cluded in Windows 2000) which employs the MS CryptoAPIs to support cryptographic services. In particular, Net-Meeting is distributed in a binary form only and a com-parison with the other tools would be complicated by thefact that some of its functionalities are embedded into theWindows operating system. On the other hand, there areother popular web audio tools that do not consider secu-rity services with the same intensity. For instance, FreeP-hone [22] does not take into consideration security featuresat all, while NeVot [23] and rat [24] provide a simplistic pri-vacy service (without authentication mechanisms and keyexchange protocols) which consists in encrypting the conver-sation by using the well-known DES block cipher [25] thatexploits a symmetric key somehow decided by the involvedparties.

This paper is a full version of [26], which in turn is basedon some preliminary ideas [27] proposed to embed securityservices into the audio tool BoAT [10, 15, 28]. Here, we re-port on a complete performance/security comparative anal-ysis conducted on the audio tools, Nautilus, PGPfone, SpeakFreely, and BoAT, by measuring the computational overheaddue to the coding/decoding phases, and the computationaloverhead due to the encryption/decryption phases. The mainresults we obtained emphasize that the computational costspayed by the security mechanism are quite low with respectto those due to the coding activity, by each of the consideredtools. In particular, thanks to its integrated approach, the se-curity platform of BoAT pays a computational cost whichturns out to be about two orders of magnitude lower thanthat of the other tools. We wish to conclude these considera-tions by observing that our experimental results put clearly inevidence the kind of influence that the coding activity exertson the cost of the security mechanism in terms of computa-tional overhead. Simply put, the higher the compression levelimposed by the codec, the lower the computational overheaddue to the securing algorithm is (since less data are to be en-crypted).

The remainder of the paper is organized as follows. InSection 2, we discuss the general problem of guaranteeingsecure real-time audio communications over IP platforms,provide a succinct survey of those tools that guarantee secu-rity at the application level through external cryptographicmodules, and provide the reader with the system and adver-sary model which all the tools have to cope with. In Section 3,we present a detailed survey of the BoAT architecture. InSection 4, we introduce the experimental scenario we havedeveloped for carrying out our analysis. In Sections 5 and 6,

we provide, respectively, the results of our experimental anal-ysis with respect to the coding activity and the security mech-anism. Finally, in Section 7, some conclusions terminate thepaper.

2. SECURE AUDIO TRANSMISSION OVER IP

The need to consider security constraints when developingapplications over IP is well accepted. The IP underlies largeacademic and industrial networks as well as the Internet.IP’s strength lies in its easy and flexible way to route pack-ets; however, its strength is also its weakness. In particular,the way IP routes packets makes large IP networks vulnera-ble to a range of security risks, for example, spoofing (mean-ing that a machine on the network masquerades as another)and sniffing (meaning that a third party listens in a trans-mission between two other parties). In order to protect au-dio communications in such a scenario, different approachescan be exploited depending on the particular layer chosen tobe equipped with a complete set of security services. In thefollowing two subsections, we briefly introduce two differentaudio security approaches; the former amounts to the useof the network level secure protocol termed IP-Sec, while thelatter consists in equipping the networked audio applicationswith appropriate security mechanisms.

2.1. Securing speech at the network level

According to this approach, networked audio applicationsrely on the underlying securing internetworking structure tosatisfy their security requirements. Usually, this transparentmanagement of security is obtained by making the networklayer secure. This is the case of IP-Sec [29], a collection ofprotocols and mechanisms adopted to extend the classical IPlayer with authentication and security features.

The IP security (IP-Sec) protocol suite [29], developedby the Internet Engineering Task Force (IETF), defines a setof IP extensions for the provision of a secure, virtual, andprivate network which is as safe as or safer than an isolatedlocal area network (LAN), but built on an unsecured, publicnetwork. The set of security services that IP-Sec can provideincludes access control, connectionless integrity, data originauthentication, rejection of replayed packets (a form of par-tial sequence integrity and defence against unauthorized re-sending of data), confidentiality (encryption), and limitedtraffic flow confidentiality. IP-Sec technology seeks to securethe network itself instead of the applications that use it, asshown in Figure 2. Just as IP is transparent to the averageuser, so are the IP-Sec-based security services. Unlike classicalspecific application-level methods for protecting communi-cations, this protocol suite guarantees security for any appli-cation using the network. To make the network level secure,the IP-Sec exploits three main traffic security technologies:(i) the Authentication Header (AH) through which authen-tication of packets is allowed, (ii) the Encapsulation Payload(ESP) through which encryption of data is offered, and (iii)the Internet Key Exchange (IKE) protocol that allows usersto agree on keys and every related information. As already


Application

TCP UDP

IP-Sec

Host

Secure link

Application

TCP UDP

IP-Sec

Host

Figure 2: IP-Sec within the network layers.

Table 1: Packet size overhead due to the ESP header when the audiopacket is generated by the codecs implemented in three differenttools.

GSM (Speak Freely) GSM (PGPfone) LPC-10 (Nautilus)+9% +57% +39%

mentioned, these mechanisms are designed to be algorithm-independent.

Besides the optimal level of interoperability guaranteedby this standard, the computational costs imposed by its im-plementation must be carefully considered, especially whenreal-time applications have to be supported. In particular,these costs are associated with (i) the memory needed for theIP-Sec code and data structures, and (ii) the computationaloverhead due to the activities of header management and ofdata encryption and decryption, to be carried out on a per-packet basis. This per-packet computational cost amounts toincreased latency and reduced throughput. In addition, theuse of the IP-Sec also imposes bandwidth utilization over-head on the transmission/switching/routing elements of theInternet infrastructure even if those components do not im-plement the IP-Sec. This is due to the increase in the packetsize resulting from the addition of dedicated IP-Sec headersand from the increased traffic associated with key manage-ment protocols. For instance, the ESP header consists of a 10-byte long segment, an additional padding of variable length(0–255 bytes), and finally, a message authenticating code(MAC) whose length depends on the particular algorithmused to compute it. Such an additional header increases thepacket size and can jeopardize the application throughput.This situation is particularly exacerbated when short audiosamples are transmitted with each packet. To make this prob-lem explicit, in Table 1, we show the packet size overhead dueto the ESP header, in the case the audio packet is generated bythe GSM codecs of Speak Freely and PGPfone (see columns1 and 2), and by the LPC-10 codec of Nautilus (see column3), when the particular algorithm used for the MAC is theMD5 [30]. It is also worth considering the analysis developedin [31], where the authors evaluate the performance of dig-ital video transmission with the IP-Sec over IPv6 networksusing an ordinary PC platform. By adding the IP-Sec infras-tructure, the throughput degrades to 1/9 with respect to theperformance without authentication or encryption.

In conclusion, the provision of reliable real-time audioquality data transmission over the Internet can be a hard taskwhen using IP-Sec. For this reason and due to the fact that

IP-Sec is not yet widely used over the Internet, other specificapplication-level security methods are to be considered in or-der to provide an adequate trade-off between security andperformance.

2.2. Securing speech at the application level

According to this approach, we can exploit suitable hardwareand software packages that, working at the application layer,are able to offer a secure real-time audio communicationover the Internet. Usually, these applications are responsiblefor taking the audio samples, and then continuously digitiz-ing, compressing, and encrypting them. After the encryptionphase, the obtained audio packets are sent out through thenetwork to the receiver site, where the reverse process is exe-cuted.

In this section, we make an overview of the software toolstermed Nautilus, Speak Freely, and PGPfone, which pro-vide secure audio communications over the Internet by ex-ploiting, at the application level, appropriate external secu-rity modules. As already mentioned, instead, BoAT guaran-tees secure conversations by merging the security strategy to-gether with the playout control mechanism. Due to the nov-elty of the approach adopted in BoAT, the general architec-ture of this tool together with the related design issue are pre-sented separately in Section 3. For the sake of completeness,in Section 2.2.4, we formally provide the system model andthe threat model with which the above audio tools have tocope.

2.2.1. Nautilus

Nautilus [12] is a popular audio tool that digitizes, encrypts,transmits, and playouts audio packets either on ordinaryphone lines using modems or over TCP/IP networks includ-ing the Internet. This tool provides usable speech quality atbandwidths as low as 4800 bps. The current version of Nau-tilus supports linear predicting coding [3] and exploits threedifferent encryption functions.

The securing algorithm of Nautilus first generates an en-cryption key in one of two ways. In the former case, the keyis generated from a secret passphrase that the users share.In the latter case, Nautilus generates the key by employingthe Diffie-Hellman key exchange algorithm. Once the key isagreed on, by using one of the two ways, it is used for the en-cryption of the rest of the conversation by means of one ofthree block ciphers (Triple DES, Blowfish, and IDEA [25]),to be selected by the user. It is worth noting that Nautilus hasbeen the first audio tool of this type freely distributed withsource code (written in C) and it has been through four pub-lic beta test releases.

This tool is supported by two hardware platforms: IBMPC-compatibles and desktop Sun Sparcstations. In the for-mer case, it supports (i) Windows platforms including Win-dows 95, 98, and NT, (ii) Linux, and (iii) Solaris X86. In thelatter case, SunOS or Solaris are needed.

2.2.2. Speak Freely

Speak Freely [14] is an audio tool for Windows machinesand a variety of Unix workstations (Windows and Unix


machines can intercommunicate) which are usable across alocal network or the Internet. Speak Freely is full duplex andprovides a variety of compression modes, but if no com-pression mode is selected, it requires the network to reli-ably transmit 8000 Bps. Speak Freely incorporates a softwareimplementation of the compression algorithm used in GSMdigital cellular telephones that permits operations over Inter-net links of modest bandwidth. For instance, by using GSMcompression, in conjunction with sample interpolation, thedata rate can be reduced to about 9600 bps. Moreover, SpeakFreely supports ADPCM compression to halve the data rate,and LPC-10 which compresses audio down to the limit of346 Bps, thus yielding a compression factor of more than 26to 1. Within Speak Freely, audio packets can be encryptedwith either IDEA, DES, Blowfish, or a method based on a bi-nary key supplied in a file. Speak Freely cooperates with PGP[32] to automatically exchange session keys with users on thesame public key ring.

Speak Freely supports multicasting and can interoperatewith other Internet voice programs supporting the InternetReal-Time Transport Protocol (RTP) or the Lawrence Berke-ley Laboratory Visual Audio Tool (VAT) protocol, a widelyused Unix conferencing program.

2.2.3. PGPfone

PGPfone [13] is another popular tool which exploits externalsoftware modules in order to provide secure audio commu-nications over the Internet. In particular, the audio transmis-sion starts by transparently and dynamically negotiating thekeys between the two parties by using the Diffie-Hellman keyexchange protocol. Then, the voice stream is encrypted bymeans of either triple DES, CAST, or Blowfish, depending onthe user.

The tool architecture allows any speech compression al-gorithm to be negotiated between the two parties as long asboth parties support the same algorithm in their respectiveversions of PGPfone. Currently, it supports the GSM speechcompression algorithm and the ADPCM compression forhigher bandwidth connections such as ISDN. PGPfone iscopyrighted freeware for noncommercial use and availablefor Windows machines and Apple Macintosh.

2.2.4. The system model and the threat model

In this section, we define the environment in which the con-sidered audio tools are expected to work, and the threatmodel such mechanisms should deal with, which basicallyreflects the assumptions of the Dolev-Yao model [33].

An ideal network can be expected to provide some pre-cise properties; for instance, it should guarantee message de-livery, deliver messages in the same order they are sent, de-liver one copy of each message, and support synchronizationbetween the sender and the receiver. All these properties arefavourable in order to support real-time applications such aspacketized audio transmission or multimedia conferencingover wide area networks.

However, the underlying network upon which we operatehas certain limitations in the level of the service it can pro-

vide. Some of the more typical limitations on the network weare going to consider are that it may

(i) drop messages,(ii) reorder messages,

(iii) deliver duplicate copies of a given message,(iv) limit messages to some finite size,(v) deliver messages after an arbitrarily long delay.

A network with the above limitations is said to provide abest-effort level of service, as exemplified by the Internet. Thismodel adequately represents the Internet as well as sharedLANs, but not switched LANs. All the dissertations and theresults presented in the next sections are obtained under suchmodel of the network.

As far as the adversary model is concerned, we argue thatthe audio tools we consider are also secure in the presence ofa powerful adversary with the following capabilities:

(i) the adversary can eavesdrop, capture, drop, resend, de-lay, and alter packets;

(ii) the adversary has access to a fast network with negligi-ble delay;

(iii) the adversary computational resources are large, butnot unbounded. The adversary knows every detailof the cryptographic algorithm, and is in possessionof encryption/decryption equipment. Nonetheless hecannot guess secret keys or invert pseudorandom func-tions with nonnegligible probability.

3. A SURVEY OF BoAT

In this section, we describe in detail the playout control soft-ware mechanism of [10], which has been originally designedfor controlling and adapting the audio application to thenetwork conditions. In [27], some proposals have been dis-cussed to extend the above algorithm with security features.Here, we give a detailed and formal explanation of the ap-proach proposed in [26], which integrates the playout con-trol activity together with the security mechanism.

The original playout control algorithm of BoAT has beenpassed through intense functional and performance analysis[8], which revealed its adequacy to guarantee real-time con-straints and it has been implemented in a software tool calledBoAT [15]. Such a mechanism follows an adaptive approachand operates as follows. At the sending site, audio samples areperiodically gathered, packetized, encrypted, and then trans-mitted to the receiving site, where the provision of a syn-chronous playout of the received audio packets is achievedby queueing the packets into a smoothing buffer and delay-ing their playout so as to maximize the percentage of packetsthat arrive before their playout point.

The playout control mechanism of BoAT assumes nei-ther the existence of an external mechanism for maintain-ing an accurate clock synchronization between the senderand the receiver, nor a specific distribution of the end-to-endtransmission delays. Such a scheme relies on a periodic syn-chronization between the sender and the receiver in orderto obtain an estimation of the upper bound for the packet


Table 2: Steps of the handshaking protocol.

Direction Message Type Contents of packets

S→ R probe sender time tsR→ S response sender time tsS→ R install RTT computed by SR→ S ack RTT computed by S

transmission delays experienced during the conversation.This upper bound is computed using round trip time (RTT)values obtained from packet exchanges of a handshakingprotocol periodically performed (about every second) be-tween the two parties. The handshaking protocol can be ex-ploited for a two-fold goal:

(i) it allows the receiver to generate a synchronous play-out of audio packets in spite of stochastic end-to-endnetwork delays;

(ii) it allows the two authenticated parties to agree on a se-quence of secret keys used to encrypt the conversation.

Before detailing the handshaking protocol and the re-lated playout mechanism, we briefly explain the notation weadopt: S is the sender, R is the receiver, Mj is a chunk of con-versation contained in a packet, and Pj denotes a packet com-posed of a timestamp and an audio sample Mj . We denoteby K0 a symmetric key agreed on during a preliminary au-thentication phase (e.g., by using a regular digital signaturescheme such as RSA [34]), and by Ki any subsequent sessionkey agreed on between the two authenticated parties. More-over, we assume that the packets of the handshaking phaseare encrypted with K0 by using any one of the block ciphersfor the symmetric cryptography such as AES and Blowfish[25].

3.1. The playout control algorithm of BoAT

The first purpose of the synchronization protocol of BoATis the provision of an adaptive control mechanism at the re-ceiver site in order to properly playout the incoming audiopackets. This is typically achieved by buffering the receivedaudio packets and delaying their playouts so that most pack-ets, in spite of stochastic end-to-end network delays, will havebeen received before their scheduled playout points. The suc-cess of such a strategy depends on a correct estimation of anupper bound for the maximum transmission delay. The tech-nique we describe to achieve such an estimation is based ona three-way handshake protocol.

The first handshaking protocol precedes the conversa-tion. As shown in Table 2, the sender begins the packet pro-tocol exchange by sending a probe packet timestamped withthe time value shown by its own clock (ts). At the receptionof this packet, the receiver sets its own clock to ts and sendsimmediately back a response packet. Upon receiving the re-sponse packet, the sender computes the value of the RTT bysubtracting the value of the timestamp ts from the currentvalue of its local clock. At that moment, the difference be-tween the sender clock CS and the receiver clock CR is equal

to an unknown quantity (say t0) which may range from a the-oretical lower bound of 0 (i.e., all the RTT values have beenconsumed on the way back from the receiver to the sender),and a theoretical upper bound of RTT (i.e., all the RTT valueshave been consumed when the probe packet is transmittedfrom the sender to the receiver). Then, the sender transmitsto the receiver an installation packet with the calculated RTTvalue attached. Upon receiving this packet, the receiver setsthe time of its local clock by subtracting from the currentvalue of its local clock the value of the transmitted RTT. Atthat moment, the difference between CS and CR is equal to avalue given by

∆ = CS − CR = t0 + RTT, (1)

where ∆ ranges in the interval [RTT, 2×RTT], depending onthe unknown value of t0, that in turn may range in the inter-val [0,RTT]. Hence, the receiver is provided with the sender’sestimate of an upper bound for the transmission delay thatcan be used in order to dynamically adjust the playout delayand buffer. In essence, a maximum transmission delay equalto ∆ is left to the audio packets to arrive at the receiver intime for playout, and consequently a playout buffering spaceproportional to ∆ is required for packets with early arrivals.

During the audio conversation, the sender timestampseach emitted audio packet Pj with the value of its local clockts at the moment of the audio packet generation. When anaudio packet arrives, its timestamp ts is compared with thevalue tr of the receiver clock, then a decision is taken accord-ing to the rules shown in Table 3. Simply put, packets thatarrive too late to be played out (ts < tr) are immediately dis-carded. In the same way, packets arriving too far in advance(ts > tr + ∆) are discarded since their playout instant is be-yond the temporal window represented by the buffer size.Instead, if tr ≤ ts ≤ tr + ∆, the packet arrives in time for be-ing played out and is placed in the first empty location in theplayout buffer. Then, the playout buffering space allows thepackets that arrive in time for being played out to be sched-uled according to the following rules. The playout instant ofeach packet that arrive in time is scheduled after a time in-terval equal to the positive difference between the values of tsand tr . Using the same rate adopted for the sampling of theoriginal audio signal at the sender site, the playout processat the receiver site fetches audio packets from the buffer andsends them to the audio device for playout. More precisely,when the receiver clock shows a value tr , the playout processsearches in the buffer for the audio packet with timestamptr . If such a packet is found, it is fetched from the buffer andsent to the audio device for immediate playout.

In order for the proposed policy to adaptively adjustto the highly fluctuant end-to-end delays experienced overwide area, packet-switched networks (like the Internet), theabove mentioned synchronization technique is first carriedout prior to the beginning of the conversation, and then pe-riodically repeated throughout the whole audio communica-tion. The adopted period is about 1 second in order to pre-vent the two clocks (possibly equipped with different clockrates) from drifting apart. Thus, each time a new RTT is


computed by the sender, it may be used by the receiver foradaptively setting the value of its local clock and the playoutbuffer size. This strategy guarantees that both the introducedadditional playout time and the buffer size are always propor-tioned to the traffic conditions. However, it may be not pos-sible to replace on the fly the current value of the receiver’sclock and the dimension of its playout buffer. In fact, such aninstantaneous adaptive adjustment of the parameters mightintroduce either gaps or even time collisions inside a talkspurtperiod during which the audio activity is carried out.

On the one hand, a gap occurs when a given sequenceof audio packets is artificially contracted (or truncated) bythe playout control mechanism, thus causing at the receiveran arbitrary skipping of a number of consecutive audio sam-ples. This unwanted situation arises when an improvementof the traffic conditions of the underlying network causes areduction of the estimated RTT. In such a case, as soon asthe current synchronization is completed and the receiver in-stalls new parameters, the receiver’s clock suddenly advancesfrom its current value to a larger value. In [10], it is shownthat, in order for the receiver to playout all the audio pack-ets generated by the sending site without skipping any audiosample, it suffices that the sender transmits the installationpacket as soon as the first silence period not smaller than anamount of time proportional to the improvement of the traf-fic conditions is elapsed. Since no audio packet is generatedduring the silence period, at the moment the receiver sets anew value for its own clock, no audio packet is waiting for itsplayout instant in the receiver’s buffer.

On the other hand, a time collision occurs when audiopackets that would be too late for playout according to thecurrent synchronization may instead be considered in timefor playout if they are processed by the receiver’s buffer assoon as a new synchronization has been completed. This sit-uation arises in case of a deterioration of the traffic condi-tions over the underlying network. In such a case, the instal-lation of a new synchronization causes the receiver’s clock tobe moved back from its current value; thus, in order to avoidcollisions, it is necessary that the receiver does not play out,when the new synchronization is active, any audio packetthat was generated when the old synchronization was active.Again, in [10], it is shown that in order to circumvent theproblem raised by such a scenario, it suffices that the receiverinstalls the new value for its own clock only at the beginningof a silence period signalled by the sender.

In general, the installation at the receiver of the valuesof the receiver’s playout clock and of the buffer dimension iscarried out only during the periods of audio inactivity, whenno audio packets are generated by the sender (i.e., during si-lence periods between different talkspurts). The reader inter-ested in the proofs concerning the policies described aboveshould refer to [10].

3.2. The securing algorithm of BoAT

In this section, we show how to integrate the playout controlalgorithm of BoAT with security services allowing for con-fidentiality, integrity, and authenticity to be preserved. Thisis obtained in two steps. On the one hand, we guarantee the

Table 3: Playout rules at the receiver site.

Condition Effect on the packet Motivation

ts < tr discardedit arrived too late tobe played out

ts > tr +∆ discardedit arrived too far inadvance of its playout

tr ≤ ts ≤ tr + ∆ bufferedit arrived in time forbeing played out

handshaking packets against spoofing and sniffing attempts.On the other hand, we employ the handshaking protocol tomake secure the whole audio conversation.

As far as secrecy is concerned, we show that the robust-ness of the privacy mechanism of BoAT depends on (i) theparticular stream cipher we adopt and (ii) the lifetime of thesecret keys used during the conversation. As far as authentic-ity is concerned, we show that after a preliminary authentica-tion phase, the two trusted parties are provided with data ori-gin authentication during the conversation lifetime. As far asthe integrity is concerned, we show that the receiving trustedparty can unambiguously decide that a received packet Pj

(timestamped with a value ts) is exactly the same packet Pj

sent at the instant ts by the sending trusted party.

3.2.1. The handshaking protocol

The original handshaking protocol of BoAT is exploited inorder to exchange fresh session keys between the two authen-ticated parties, more precisely providing a key for each syn-chronization phase. Such a key will be used to secure the con-versation and will have a lifetime equal to at most 1 second,namely, the time between two consecutive synchronizations.More precisely, we adopt the exchanged key as the sessionkey of a stream cipher used to encrypt audio data. A streamcipher is a symmetric encryption algorithm which is usu-ally faster than any block cipher. While block ciphers operateon large blocks of data, stream ciphers typically operate onsmaller units of plaintext, usually bits. A stream cipher gen-erates what is called a keystream starting from a session key Kwhich is used as a seed for the pseudorandom generation ofthe keystream. Encryption is accomplished by combining thekeystream with the plaintext, usually with the bitwise XORoperation. Examples of well-known stream ciphers are A5/1[35] (used by about 130 million GSM customers in Europeto protect the over-the-air privacy of their cellular voice anddata communication), RC4 [25] (by the RSA’s group), andSEAL [36].

The packets of the handshaking phases, instead of beingencrypted with the particular stream cipher, are encryptedby employing the initial key K0 and a block cipher that canuse long keys in order to strengthen the security assumptions(e.g., up to 448-bit keys in the case of Blowfish, or up to 2040-bit keys in the case of RC6). During the generic handshakingphase i, the two authenticated parties agree on a 128-bit ses-sion key Ki (e.g., exchanged in the install packet). Wheneverthe handshaking protocol has a positive outcome, Ki is the


new key used to secure the subsequent chunk of conversa-tion. Since the handshaking protocol is periodically startedduring the conversation, a sequence of keys Kii∈N is gener-ated.

In order to guarantee the correct behavior of the abovemechanism, both sender and receiver must come to an agree-ment. In particular, the sender site has to know if the re-ceiver site has received the new key Ki in order to decide toemploy such a key to encrypt the following audio samples.Hence, upon receiving the installation packet, the receiversends back an ack packet. At the reception of this packet, thesender starts to use the new key. An additional informationfor each audio packet is used as a flag in order to inform thereceiver that the key is changed and is exactlyKi. For instance,by following a policy inspired by the alternating bit proto-col, if each packet encrypted with the key Ki is transmittedwith a flag bit set to 0, then whenever a new synchroniza-tion phase is completed, each subsequent packet is transmit-ted with the bit set to 1. It is worth noting that if either theinstallation packet or the ack packet does not arrive at theirdestination, both sender and receiver carry on the commu-nication by using the old key. Indeed, on the one hand, thesender begins to encrypt the outgoing audio packets with thenew key only if it receives the ack packet. On the other hand,the receiver begins to decrypt the ingoing audio packets withthe new key as soon as it receives a packet whose flag has beenchanged with respect to the previously received packets. Thepresented policy does not require additional overhead on theoriginal scheme because it relies on the handshaking proto-col only.

As far as the secrecy, authenticity, and integrity condi-tions of the handshaking protocol are concerned, the follow-ing remarks are in order.

(i) An adversary can try to corrupt the result of the hand-shaking protocol so that the two parties, after such a negotia-tion, disagree on the new key used for securing the conversa-tion. In particular, he may try to forge or alter some packetsof the handshaking phase, but he does not know the symmet-ric key used to encrypt them (e.g., he cannot create or altera response packet with a given timestamp). In addition, hecan cheat neither the sender nor the receiver by reusing anypacket because of the presence of the timestamp ts in case ofthe probe and response packets, and also the presence of ofthe RTT in case of the install and ack packets (e.g., duringthe generic handshaking phase i, he cannot masquerade asthe receiver by transmitting to the sender the response andack packets intercepted during a previously completed hand-shaking phase i− j).

(ii) An adversary can try to drop systematically the mes-sages of the handshaking protocol so that the lifetime of theold session key is extended from 1 second to the whole du-ration of the conversation; in this way, many more data andtime are at disposal of a cryptanalysis attempt. Such a prob-lem may be avoided by adopting the following policy. Foreach handshaking message, we create a packet containing thesynchronization information encrypted with the block ci-pher and the audio sample filled with rubbish. Such a packetis first enriched with an additional field to inform the receiver

that this is a handshaking packet and then encrypted withthe stream cipher, thus masquerading it as a normal audiopacket. Finally, in order to make it harder to reveal the hand-shaking packets, the time instant a new phase is started by thesender can be randomly chosen, instead of being scheduledonce per second as in the original proposal of the algorithm.With these assumptions in view, an adversary can only try todrop some packets in a random way and, as a consequence,he can break off several consecutive handshaking phases witha negligible probability. In spite of this, an intensive trafficanalysis during a full-duplex conversation could significantlyrestrict the temporal interval in which the two parties are ex-pected to send packets of the handshaking phase. If we wantthe security mechanism to be more robust against this un-likely attack, we can shut down the conversation whenevermore than n consecutive handshaking phases are not com-pleted, for some suitable n depending on the strength of thecryptographic algorithm.

In essence, the handshaking protocol does not reveal anyinformation flow allowing an adversary to spoof or sniff theconversation. Moreover, the same mechanism is robust tolost and misordered packets and makes no assumption on theservice offered by the network. The described policy is simi-lar to some well-known protocols for radio communicationswhich are based on using spread spectrum frequency, in thesense that during a conversation, the transmission frequencyis frequently changed in order to avoid interception and al-teration. In the case of the securing mechanism of BoAT, theduration of every key is limited to the time space between twoconsecutive synchronizations (at most one second for nor-mal executions), thereby this policy allows for making it dif-ficult for a not authenticated party to decode the encrypteddata, and practically guarantees to be robust to trivial breaks[25].

3.2.2. Securing the conversation

The session key exchanged during the handshaking phase isused by the particular stream cipher for the encryption ofboth the timestamp and the whole audio packet. More pre-cisely, each audio packet belonging to the chunk of conver-sation i between the two consecutive synchronizations i andi+ 1 is encrypted by resorting to the particular stream cipherand the session key Ki.

In order to guarantee authenticity and integrity of data,we employ this mechanism in conjunction with a MAC. Inparticular, we can adopt a mechanism similar to the HMAC-MD5 used also in [2] to ensure authenticity and integrityof the audio packets. Alternatively, we can encrypt (by theparticular stream cipher) the output of a 1-way hash func-tion applied to the audio packet to ensure authenticity andintegrity of the same packet. Examples of well-known hashfunctions are MD5 and SHA [25].

In Algorithm 1, we show such an approach which guar-antees a secure conversation. We denote by PjKi the audiopacket Pj encrypted by using the stream cipher starting fromthe session key Ki, and by MAC(Ki, Pj) the message authen-ticating code for the packet Pj obtained by resorting to thesession key Ki.


Sender

1. Pj = ts,Mj2. Send P∗j = PjKi ,MAC(Ki, Pj)

Receiver

1. Receive P∗j2. Compute ts and Mj by means of Ki

3. Verify the MAC

Algorithm 1: Securing algorithm.

The algorithm guarantees secrecy and satisfies the prop-erties of authentication and integrity. More precisely, it guar-antees the following condition. For each audio packet P∗j ,which is generated with the above algorithm and received intime for its playout, the receiver can decide its playout instantand verify its integrity and the authenticity of the sender.

Secrecy

As far as secrecy is concerned, the security mechanism ofBoAT offers to the trusted parties a high assurance of theprivacy of the data transmitted during the conversation life-time. In fact, we have shown that the handshaking proto-col does not reveal any information about the secret keysexchanged between the trusted parties, and that an adver-sary as specified in Section 2 cannot guess secret keys. Se-crecy is a crucial condition that the recent literature showsto be not met in glaring cases. For instance, we consider theattack on the A5/1 algorithm (used in GSM systems [35])proposed in [37], in which a single PC is proved to be ableto extract the conversation key in real time from a smallamount of generated output. In particular, the authors of[37] claim that a novel attack requires two minutes of dataand one second of processing time to decrypt the conversa-tion. Now we assume that the particular cipher we choose toadopt is as weak as the A5/1 algorithm. In the approach ofBoAT, in the absence of a powerful adversary able to iden-tify and drop the handshaking messages, during two minutesof conversation, at least 120 different session keys are usedso that the quantity of data that can be analyzed for a singlekey is not sufficient to perform the attack and to reveal thekey and, consequently, the conversation. Moreover, in sup-port of the robustness of the approach of BoAT, we pointout that, in the recent literature, the best known attacks ofsome stream ciphers, proposed in [16], have complexity 259

and require 220 bits of ciphertext and are based on some re-strictive assumptions on the characteristics of the stream ci-pher. In [17], a novel attack has a complexity gain 221, butit requires 233 bits of ciphertext, and, in certain cases, the ci-pher can resist this attack. Because of this, we have that anadversary can guess somehow a session key with a negligi-ble probability; anyway, we recall that each session key mayallow an adversary to decipher just one second of conver-sation with no information about the remaining encrypteddata. In general, it is worth noting that the relatively shortlifetime of every session key improves the secrecy guaranteesfor any cryptographic algorithm. Anyway, a study conducted

in [8] revealed that too short lifetimes (e.g., less than 0.5 sec-onds) cause a worsening of the speech quality, therefore amassive resort to such an approach should be carefully ana-lyzed.

Authenticity

As far as authenticity is concerned, we first assume a prelim-inary authentication phase carried out by the two parties be-fore the conversation (e.g., by resorting to a regular digitalsignature scheme). After this initial secure step, only the legit-imate parties know the value of the symmetric key agreed onduring this phase, and can carry out the first packet exchangeof the handshaking protocol by means of the symmetric key.In particular, as we have shown in Section 3.2.1, an adversarycannot start, carry out, and complete the packet exchange ofsuch a synchronization protocol with any of the trusted par-ties. Later on, during the conversation, each packet is times-tamped with the sender clock value at the moment of theaudio packet generation, encrypted by means of the sessionkey Ki, and authenticated by means of the MAC, so that eachreceived packet can be played out only once, and only if itarrives in time for being played out, according to the adap-tive adjustment carried out during the ith handshaking syn-chronization phase. The receiver is guaranteed that the audiopackets encrypted by means of the key Ki and played out ac-cording to the piggybacked timestamp have been generatedat (and sent by) the sender site. In fact, an adversary cannotbehave as a “man in the middle” by generating new pack-ets (as he does not know the session key and he cannot au-thenticate the packets) or spoofing (as he can resend or de-lay packets, but the timestamp allows the receiver to discardsuch packets). Finally, we point out that the key Ki+1 is agreedon by resorting to a packet exchange encrypted by means ofa secret key, and such a negotiation does not reveal any in-formation about the new session key. For these reasons, wededuce that the authentication condition is preserved alongthe conversation lifetime.

Integrity

As far as integrity is concerned, the following remarks arein order. As a first result, we argue about the correctness ofthe algorithm, and then, we show that an adversary cannotalter the content of the conversation obtained by applyingthe above-presented algorithm. In a first simplified scenario,we assume the system model without malicious parties. Weconsider a packet P∗j generated by the sender and arriving atthe receiver site in time for its playout. As the trusted par-ties share the same session key, the receiver can compute thetimestamp in order to schedule the playout instant of thepacket, compute Mj in order to playout the audio packet,and check the MAC in order to verify the integrity of Mj .The effect of this behavior cannot be altered by an adversary,and we prove this fact by considering the potential movesof a malicious party. We assume the audio packets gener-ated by the sender and managed by the receiver as seen inthe above algorithm, and we show that all the played outpackets can be neither generated nor altered by an adversary


with the capabilities specified in the threat model. In the casethe adversary eavesdrops, captures, drops, or delays a packetP∗j , then the proof is trivial. In fact, in these cases the adver-sary can only prevent the receiver from receiving or playingout P∗j . The most interesting case arises whenever the ad-versary tries to alter P∗j . In particular, he can alter the en-crypted timestamp, the plaintext Mj , or the MAC, but inthis case, the receiver notices the alteration by verifying theMAC, and therefore he discards the packet. It is worth not-ing that it is computationally infeasible, given a packet Pj

and the message authenticating code MAC(Ki, Pj), to findanother packet P′j such that MAC(Ki, Pj) = MAC(Ki, P

′j).

In addition, the adversary cannot send a new packet Pj

to the receiver because he knows neither the session keynor the playout instant of the audio sample Mj he intendsto forge.

4. EXPERIMENTAL SCENARIO

In this section, we describe the experimental scenario wehave constructed to conduct the analysis of the audio tools ofinterest, namely, Nautilus, PGPfone, Speak Freely, and BoAT.

The experiments have been conducted with the twofollowing machines, namely, a 133-MHz Pentium proces-sor, 48-MB RAM, and ISA Opti 16-bit audio card, and a200-MHz MMX Pentium processor, 64-MB RAM, and PCIYamaha 724 audio card. These workstations have used two10/100-Mbit Ethernet network cards to transmit packets overthe underlying network. Both Linux (RedHat 6.0) and Win-dows 98 operating systems have been used depending on theanalyzed audio tool.

In order to perform measurements of the computationaloverhead introduced by both securing and coding activi-ties, while avoiding the issue of taking into account networkdelays, all the experiments were conducted as described inthe following. For all the analyzed audio tools, each audiosample was first compressed by the codec employed withinthat tool, then encrypted by the corresponding securing al-gorithm, and finally, transmitted over the network by theadopted Ethernet card. Hence, for each conducted exper-iment, we took, at the sending site, measurements of thetime intervals between the packet generation instant and itstransmission instant over the network. This policy has al-lowed us to take experimental measurements of the packe-tization/compression/encryption delays not affected by theproblem of managing variable network delays. The reverseprocess was executed at the receiving site in order to evaluatethe decompression/decryption delays. Each of those experi-ments was repeated 30 times with an individual duration (foreach experiment) of 30 seconds. All the results (reported inthe two following sections) have been obtained by averagingthe experimental measurements taken in each repeated ex-periment.

In order to allow the reader to understand the meaningof the reported experimental results, some important con-siderations concerning codecs are discussed in the followingsection.

4.1. Codecs

An efficient coding of the signal is the first factor to considerin order to allow speech to be reduced to a bandwidth fit-ting the network availability, and to obtain the same speechquality as generated at the sender site. For instance, telephonequality of speech needs 64 Kbits, but in most cases, suchbandwidth is not reachable over the Internet. Codecs areused to cope with this lack, but as the compression level in-creases (and the needed bandwidth decreases), the generatedspeech degrades itself by turning misunderstandable. Thisissue has been passed down to the codec development ef-forts of the International Telecommunication Union (ITU).Hence, several codecs that work well in the presence of thescarce network bandwidth constraint have been designed. Asan example, the ITU codecs G.729 and G.723.1 [38] havebeen designed for transmitting audio data at bit rates rangingfrom 8 Kbps to 5.3 Kbps.

In general, a trade-off exists between loss of fidelity in thecompression process and amount of computation requiredto compress and decompress data. In turn, the more the dataare compressed, the faster is the encryption/decryption pro-cess since less data is to be encrypted.

In the remainder of this section, we briefly survey themost characterizing features of the codecs embodied in allthe audio tools of interest, namely, GSM, ADPCM, LPC-10,and the wavelet-based codec of BoAT.

GSM compression employs the global system mobile al-gorithm used by European digital cellular phones [39]. SpeakFreely supports the standard GSM version that can produceaudio at a data rate of 1650 bytes per second, thus reducingthe PCM basic data rate by a factor of almost five with lit-tle degradation of voice-grade audio. In turn, PGPfone sup-ports two versions of GSM: standard GSM and a versioncalled “GSM lite,” that provides the same speech quality asfull GSM, but with less effort and less bandwidth needed.PGPfone provides GSM and GSM lite with a range of sam-pling frequencies. More precisely, voice can be sampled atvarious sampling rates, ranging through 4410, 6000, 7350,8000, and 11025 samples per second. The faster GSM is sam-pled, the better the voice quality, but at the cost of a consid-erable computational load. Like most voice codecs, GSM isasymmetrical in its computational load for compression anddecompression; indeed, decoding requires only about halfthe computation as encoding.

ADPCM [3] compression uses adaptive differential pulsecode modulation and delivers high sound quality (the loss infidelity is barely perceptible) with low computing loads, butat a cost of a higher bit rate with respect to GSM. Both SpeakFreely and PGPfone support ADPCM.

LPC-10 [3] uses a version of the linear predictive cod-ing algorithm (as specified by United States Department ofDefense Federal Standard 1015/NATO-STANAG-4198) andachieves the greatest degree of compression, but like GSM,it is extremely computationally intensive. LPC-10 requiresmany calculations to be done in floating point and may notrun in real time on a machine without an FPU. Audio fi-delity in LPC-10 is less than what may be achieved with GSM.


The high degree of compression achieved by LPC-10 per-mits to use low-speed Internet links. In addition, the com-putational overhead per each audio packet due to encryp-tion/decryption activities decreases since those packets havea typical small size. Both Nautilus and Speak Freely supportLPC-10.

As far as BoAT is concerned, its control mechanism ex-ploits a wavelet-based software codec designed to encodeaudio samples with a variable bit rate [28]. This codec ex-ploits a flexible compression scheme based on the discretewavelet transform of audio data; the wavelet coefficients arequantized using a successive approximation algorithm andare encoded according to a run-length entropy coding strat-egy that uses variable length codes. The quantization schemehas the property that the bits in the audio stream are gen-erated in order of importance, yielding a fully embeddedcode. In such a way, the encoder can terminate the encod-ing at any point, thus allowing any precise (and variable)target bit rate. Based on this codec, a control mechanismthat encodes and transmits audio samples with a data ratethat is always proportioned to the network traffic condi-tions has been devised. In essence, the control mechanismdevised within BoAT establishes a feedback channel betweenthe sender and the receiver in order to assess periodically,in 5-second measurement intervals, the network congestionestimated under the form of average packet loss rate (andtransmission delay variation). On the basis of this period-ical feedback information, the following control process iscarried out at the sender’s site in order to match the send-ing rate to the current network capacity. When the experi-enced average loss percentage surpasses a given upper thresh-old, the control process gradually decreases the sending rateaccording to a predefined decreasing scheme which takesinto direct account the value of the measured jitter. Suchone gradual decrease is obtained by exploiting the possibil-ity of terminating at any point the encoding activity pro-vided by the wavelet-based variable-bit-rate codec embed-ded in BoAT. Conversely, if the average loss rate falls belowa certain lower threshold, the sending rate is gradually in-creased in order to match the improved connection capacity.In summary, the use of one such wavelet-based variable-bit-rate codec guarantees that BoAT is always able to encode au-dio samples at a speed proportioned to the current networkperformances.

As a final remark, it is also worth mentioning that BoATmay guarantee a gradual adjustment of the sending rate torange from 8000 Bps (corresponding to the toll quality pro-vided by PCM) to 700 Bps (corresponding to the syntheticquality provided by the LPC encoding strategy). For the pur-poses of the experiments concerning the coding computa-tional load, we have used only the two limiting rates: 700 Bpsand 8000 Bps. Instead, as far as the experiments related tothe encryption/decryption computational overhead are con-cerned, we have used only the data rate 850 Bps, correspond-ing to a speech quality comparable with that offered by theGSM lite codec.

Summarizing, in Table 4, we report the costs of all theabove-mentioned codecs in terms of both relative CPU cost

Table 4: Relative CPU cost and bandwidth (bytes per second) re-quirements of various codecs.

Codec CPU Bandwidth

ADPCM 1 4000

GSM 39 1650

LPC-10 53 346

BoAT 9 700

and needed bandwidth. The CPU costs reported in that ta-ble are calculated taking the value 1 as the basis for the timeneeded by ADPCM to encode a second of speech. For in-stance, based on the fact that ADPCM takes one time slot toencode a second of speech, the wavelet-based codec of BoATrequires about 9 time slots to encode the same quantity ofdata, even if it guarantees a better level of compression sinceit requires 700 Bps instead of 4000 Bps.

To conclude this section, in Table 5, we report the quan-tity of data compressed during a second of audio transmis-sion by the different codecs embedded in each analyzed tool.Such values are particularly meaningful, especially whenevaluating the performance of the securing algorithms, be-cause they specify the quantity of data to be encrypted anddecrypted.

5. COMPRESSING DATA: EXPERIMENTAL RESULTS

In this section, we report the experimental results obtainedby calculating the computational overhead due to the codecactivity for all the analyzed tools. The motivation behind ourstudy relies on the fact that the coding activity is an impor-tant step of the audio data flow pipeline, and also affects theperformance of the encryption/decryption activities. Hence,a significant comparison among the different tools must alsotake the performance of the coding/decoding activities intoconsideration.

All the results of interest are reported in Tables 6, 7, 8,and 9, and were obtained by employing the same architec-ture presented in Section 4. In particular, our tables showthe computing time (expressed in milliseconds) needed forcoding (meaning that digitized speech samples are convertedinto a compressed form) and decoding a second of conversa-tion.

As already mentioned previously, the codecs imple-mented in each tool offer different trade-offs between theloss of fidelity and the amount of computation required tocompress and decompress the data. In turn, the more thecompression level, the faster the encryption/decryption pro-cess. These considerations motivate the very different resultsshown in our tables. In particular, we have that a lower com-pression level implies a better speech quality, a lower cod-ing computational load, and, consequently, a high quantityof data to be encrypted and transmitted.

As a first result, since ADPCM offers the lowest compres-sion level and the best quality of speech, we can observe thatits computational load is limited to a few milliseconds (3 to 6


Table 5: Audio packet size and number of transmitted audio packets per second for each codec.

Speak Freely 7.1 PGPfone 2.1 Nautilus BoAT

GSM ADPCM GSM 4.4 ADPCM LPC-10

Bytes per packet 336 496 70 327 56 34

Packets per second 5 8.4 14 13 5.5 25

Table 6: Speak Freely 7.1 (Windows 98).

Computing time (ms)CODEC

GSM ADPCMMean Variancy Mean Variancy

Coding 114.1 10.5 3.58 0.01Decoding 32.2 3.23 3.67 4.51

Table 7: PGPfone 2.1 (Windows 98).


GSM lite ADPCMMean Variancy Mean Variancy

Coding 115 296 6.08 2.84Decoding 79.1 274.3 5.88 2.75

depending on the tool, see Tables 6 and 7) with respect to tensof milliseconds experienced by the other codecs. On the con-trary, LPC-10, which offers the maximum level of compres-sion, experiences the worst computational load (Table 8). In-stead, both the BoAT codec (at 700 Bps) and the GSM codechave a workload corresponding to about a hundred of mil-liseconds (Tables 6, 7, and 9).

As another significant result, it is worth noting that eachcodec, except for the ADPCM codec of Speak Freely (Table 6)and for the wavelet-based codec of BoAT (Table 9), is appre-ciably faster during the decoding than the encoding phase.Only in the cases of ADPCM and of the BoAT codec (at8000 Bps), the coding and decoding activities present a com-parable computational load.

Again, as far as the wavelet-based codec of BoAT is con-cerned, the amount of computation required to compressdata is very low when using the 8000 Bps data rate with re-spect to the 700 Bps data rate (Table 9). We point out thatour experiments reveal that the codec of BoAT and the codecof GSM represent a good trade-off between speech qual-ity and coding computational load since they both outper-form LPC-10 from a performance standpoint, and guaran-tee a voice quality only a little lower than that provided bythe codec of ADPCM in spite of an appreciably lower neededbandwidth.

6. SECURING DATA: EXPERIMENTAL RESULTS

In this section, we report the experimental results obtainedby calculating the computational overhead due to the secu-rity mechanism for all the analyzed tools. In particular, the

Table 8: Nautilus 1.5a (Linux RedHat 6.0).

Computing time (ms)CODECLPC-10

Mean VariancyCoding 172 1.95Decoding 80.8 18.6

Table 9: BoAT (Linux RedHat 6.0).


8000 Bps 700 Bps

Mean Variancy Mean Variancy

Coding 31.03 12.11 105.5 46.5Decoding 36.1 22.6 143.7 337.6

results are obtained by employing the same architecture pre-sented in Section 4.

As far as BoAT is concerned, we make the following as-sumptions. The particular stream cipher we have consid-ered in our experiments is the RC4 algorithm [25], while themessage-authenticating code of each packet is computed asthe encryption of the output of the MD5 message-digest al-gorithm [30]. The packets of the handshaking protocol areencrypted by using the block cipher Blowfish [25], and thetemporal interval between two consecutive synchronizationsis exactly one second.

In Table 10, we report the computational overhead (ex-pressed in milliseconds) experienced during a second of con-versation by a sending site that follows the algorithm illus-trated in Algorithm 1 by singling out the different steps ofthe mechanism:

(i) encryption of the handshaking packets by means of theblock cipher,

(ii) encryption of the audio packets by means of the streamcipher,

(iii) computation of the MAC.

The results of Table 10 put in evidence the followingfacts. The overall computational overhead is negligible (equalto few tens of microseconds). The extremely low use of theblock cipher, which is used for the packets of the hand-shaking phase only, motivates the almost null computationalcost derived from such an operation. Substantially, we notethat the computational overhead is equally divided betweenthe encryption phase, performed resorting to the RC4 al-gorithm (whose performance is about 13.7 MBps), and the


Table 10: Computational overhead of the securing mechanism ofBoAT per second of conversation.

Computing time (ms)

Block cipher 0.008

Stream cipher 0.0591

MAC 0.0474

Total latency 0.1145

authentication phase, performed resorting to the MD5 algo-rithm (whose performance is about 17 MBps). It is worthnoting that these results are compatible with those on theperformance of RC4 and MD5 presented in [40, 41, 42, 43].Moreover, we point out that we have not considered SEAL-like stream ciphers, because from the performance viewpointthese algorithms do not seem to be appropriate if the keyneeds to be changed frequently (see, e.g., [36]).

Summarizing, the results put in evidence the negligiblecomputational overhead of the implemented security mech-anism, especially with respect to the latency introduced bythe additional delay calculated by the adaptive playout con-trol algorithm, equal to tens of milliseconds, as shown in[1, 8, 9].

Now, we contrast the above results with the performanceobtained by analyzing the other application-level methods,namely, Nautilus, PGPfone, and Speak Freely. Before pre-senting the results of our experiments, we point out the fol-lowing remarks. Unlike BoAT, we have that the above meth-ods adopt block ciphers only in order to encrypt each au-dio packet to be transmitted along the network. More pre-cisely, they employ some well-known cryptographic algo-rithms such as DES, 3DES, IDEA, Blowfish, and CAST (see[25] for the technical details of these algorithms).

In order to provide the reader with a better understand-ing of the reported results, we just recall the following re-marks on the considered codecs. ADPCM offers toll qual-ity of speech at the cost of a high quantity of data to beencrypted and transmitted. GSM codecs offer high speechquality in spite of a higher compression level. Finally, LPC-10 offers poor speech quality with the maximum level ofcompression. The experimental results are shown in Tables11, 12, and 13. For each block cipher implemented in thetools of interest, the tables report the computing time expe-rienced during a second of conversation by the encryptionphase.

The first interesting point illustrated by our tables is thatin all cases the computational overhead of the privacy mech-anism is restricted to a few milliseconds. The upper boundis represented by the case of Speak Freely with the blockcipher DES and the codec ADPCM with 20.8 milliseconds(Table 11). If we compare these results with those reportedin Table 10, we can conclude that the securing mechanism ofBoAT outperforms the other tools; in particular, BoAT turnsout to be about 2 orders of magnitude better than the othertools (tens of microseconds with respect to a few millisec-onds). This is because the integrated mechanism of BoAT

Table 11: Speak Freely 7.1 (Windows 98).


GSM ADPCMMean Variancy Mean Variancy

Blowfish 2.47 0.01 5.22 0.15IDEA 3.94 0.01 9.08 0.05DES 9.77 0.20 20.8 0.16

Table 12: PGPfone 2.1 (Windows 98).


GSM lite 4.4 ADPCMMean Variancy Mean Variancy

Blowfish 2.09 0.06 4.72 0.02CAST 2.08 0.002 4.43 0.073DES 6.35 0.14 16.8 0.56

adopts a lightweight ciphering mechanism that is very ad-equate when incorporated within the original handshakingprotocol.

The results of a comparison among the performance ofthe different tools depend strictly on the particular codecthat is used to compress data. Indeed, the more the data arecompressed, the faster is the encryption/decryption process.For instance, it may be significant to contrast the perfor-mance of PGPfone with the codec GSM (Table 12) and BoAT(Table 10) since the related codecs offer the same quality ofspeech and the same quantity of data to be encrypted andtransmitted per second of conversation (850 bytes in the caseof BoAT and 980 bytes in the case of the PGPfone GSM 4.4).The results (about 0.1 milliseconds for BoAT and about 2–7milliseconds for PGPfone) confirm once again our claim thatBoAT outperforms the other tools.

As far as Nautilus is concerned, it is worth mentioningthat the very low computational overhead of its securing al-gorithm (Table 13) depends on the fact that Nautilus uses theLPC-10 codec that exploits a very high compression factor(note that the output of the LPC-10 compression algorithmper second of conversation is few hundreds of bytes). In par-ticular, the speech quality offered by this codec is noticeablypoorer than the high quality guaranteed by the codecs of theother considered tools.

An interesting remark is in order in the case of the AD-PCM codec implemented in Speak Freely and PGPfone (seeTables 11 and 12). Indeed, when using such a codec, we canobserve an overhead of the encryption phase of several mil-liseconds, especially in the case of 3DES, because ADPCMis based on a low compression level (thousands of bytes persecond of conversation) in order to offer toll quality of thetransmitted speech.

To conclude this section, we can summarize the obtainedresults by observing that the integrated mechanism of BoAT,thanks to its handshaking protocol which allows the two par-ties to share the session keys, has turned out to be very suit-able to extend the playout control algorithm with security


Table 13: Nautilus 1.5a (Linux RedHat 6.0).

Computing time (ms)LPC-10

Mean Variancy

Blowfish 0.32 0.0004

IDEA 0.48 0.004

3DES 0.84 0.0009

features in a simple and cheap way. Hence, adding securitymodules to the audio data flow pipeline may be done withoutjeopardizing the overall end-to-end delay because the pre-sented approach has been revealed to be neither a noticeablecomputational penalty nor a performance bottleneck in real-time speech traffic. As far as the other application-level au-dio tools of interest are concerned, the performance resultsput in evidence that the computational overhead of the se-curity mechanism is limited to a few milliseconds, and thatsuch a result is about 2 orders of magnitude worse than theperformance offered by BoAT.

7. CONCLUSION

In this paper, we have considered an adaptive packet audiocontrol mechanism called BoAT and three application-leveltools for the secure speech transmission over the Internet,namely, Nautilus, PGPfone, and Speak Freely. The formeroffers a scheme which adaptively adjusts to the fluctuatingnetwork delays typical of the Internet and integrates in suchan algorithm security features. The other tools have been de-signed at the application layer in order to add speech com-pression and strong cryptographic protocols, as separatedexternal modules, to the audio transmission over untrustednetworks.

The comparison among the above audio tools has beenconducted by measuring the computational overhead of bothcodec activity and security mechanism. In the former case,we have put in evidence the role played by codecs in the gen-eration of the audio data flow pipeline and how they also af-fect the performance of the security mechanism. In the lattercase, we have emphasized the low-computational cost of thecryptographic algorithms for each considered tool. In partic-ular, we have shown the adequacy of BoAT in adding secu-rity with a negligible overhead. As an example, an interest-ing summarization of the results of Section 6 is reported inTable 14, where we show the computing time experienced byboth encryption (at the sending site) and decryption (at thereceiving site) during a second of conversation. In such a ta-ble, we consider the tools BoAT, Speak Freely with the codecGSM and the block cipher DES, PGPfone with the codecGSM lite 4.4 and the block cipher 3DES, and Nautilus withthe codec LPC-10 and the block cipher 3DES. The results re-veal that the provision of security has a computational cost ofa few milliseconds and BoAT performs better than the othertools (tens of microseconds with respect to a few millisec-onds).

Table 14: Performance comparison.

Computing time (ms)

BoAT 0.229

Speak Freely 19.54

PGPfone 12.7

Nautilus 1.68

A final consideration may be done concerning the partic-ular approach adopted by the designers of BoAT for guaran-teeing security. This approach has permitted to add appro-priate security services without affecting the overall playoutlatency introduced by the playout control mechanism. Thisis a very relevant result for all those audio tools that incor-porate dynamic mechanisms to adapt the playout process tothe network conditions. In fact, as stressed in recent works[8, 9], it is not possible to interfere with the playout valuesdecided by the control mechanisms without jeopardizing thestrict real-time constraints imposed by audio applications.

ACKNOWLEDGMENTS

We are grateful to the EURASIP JASP reviewers for their use-ful comments on the first version of this paper. This researchhas been funded by a Progetto MIUR and by a grant fromMicrosoft Research Europe.

REFERENCES

[1] R. Ramjee, J. Kurose, D. Towsley, and H. Schulzrinne, “Adap-tive playout mechanisms for packetized audio applicationsin wide-area networks,” in Proc. 13th IEEE Infocom Confer-ence on Computer Communications (Infocom ’94), pp. 680–688, Toronto, Ontario, Canada, June 1994.

[2] A. Perrig, R. Canetti, J. D. Tygar, and D. Song, “Efficient au-thentication and signing of multicast streams over lossy chan-nels,” in Proc. IEEE Symposium on Security and Privacy, pp.56–73, Oakland, Calif, USA, May 2000.

[3] R. Westwater, “Digital audio presentation and compression,”in Handbook of Multimedia Computing, B. Furht, Ed., pp.135–147, CRC Press, Boca Raton, Fla, USA, 1999.

[4] R. Steinmetz and K. Nahrstedt, Multimedia: Computing, Com-munications and Applications, Innovative Technology Series.Prentice-Hall, Upper Saddle River, NJ, USA, 1995.

[5] L. Cottrell, W. Matthews, and C. Logg, Tutorial on InternetMonitoring & PingER at SLAC, Stanford Linear AcceleratorCenter, 2000.

[6] N. Jayant, “Effects of packet loss on waveform coded speech,”in Proc. 5th Data Communications Symposium, pp. 275–280,Atlanta, Ga, USA, October 1980.

[7] J. Boyce and R. Gaglianello, “Packet loss effects on MPEGvideo sent over the public Internet,” in Proc. 6th ACM Interna-tional Multimedia Conference (Multimedia ’98), pp. 181–190,Bristol, UK, September 1998.

[8] A. Aldini, M. Bernardo, R. Gorrieri, and M. Roccetti, “Com-paring the QoS of Internet audio mechanisms via formalmethods,” ACM Transactions on Modeling and Computer Sim-ulation, vol. 11, no. 1, pp. 1–42, 2001.

[9] S. B. Moon, J. Kurose, and D. Towsley, “Packet audio play-


out delay adjustment: performance bounds and algorithms,”ACM Multimedia Systems, vol. 6, no. 1, pp. 17–28, 1998.

[10] M. Roccetti, V. Ghini, G. Pau, P. Salomoni, and M. E. Bonfigli,“Design and experimental evaluation of an adaptive playoutdelay control mechanism for packetized audio for use over theInternet,” Multimedia Tools and Applications, vol. 14, no. 1,pp. 23–53, 2001.

[11] N. Hager, Secret Power, Craig Potton Publishing, Nelson, NewZealand, 1996.

[12] B. Dorsey, P. Rubin, A. Fingerhut, B. Soley, and P. Mullarky,“Nautilus documentation,” 1996, http://www.nautilus.berlios.de/.

[13] P. R. Zimmermann, “PGPfone: Owner’s manual,” 1996,http://www.pgp.com.

[14] J. Walker and B. C. Wiles, “Speak freely,” 1995,http://www.fourmilab.ch/.

[15] M. Roccetti, V. Ghini, D. Balzi, and M. Quieti, BoAT:Bologna optimal Audio Tool, Department of ComputerScience, University of Bologna, Bologna, Italy, 1999,http://radiolab.csr.unibo.it/BoAT/src.

[16] A. Canteaut and M. Trabbia, “Improved fast correlation at-tacks using parity-check equations of weight 4 and 5,” in Ad-vances in Cryptology - EUROCRYPT ’00, International Con-ference on the Theory and Application of Cryptographic Tech-niques, vol. 1807 of Lecture Notes in Computer Science, pp.573–588, Springer-Verlag, Bruges, Belgium, May 2000.

[17] E. Filiol, “Decimation attack of stream ciphers,” in Proc. FirstInternational Conference on Cryptology in India (INDOCRYPT2000), vol. 1977 of Lecture Notes in Computer Science, pp. 31–42, Springer Verlag, 2000.

[18] Information Security Corporation, “SecurePhone Profes-sional,” 2002, http://www.infoseccorp.com.

[19] EarthSpeak International LLC, “SecuriPhone V. 1.09,” 2002,http://www.jechtech.com/securiphone.htm.

[20] NetSpeak Corporation, “NetSpeak Webphone User’s Guide,”1998, http://www.k2nesoft.com/webphone/.

[21] Microsoft, “NetMeeting 3 Resource Kit,” 1999, http://www.microsoft.com/windows/netmeeting/.

[22] J.-C. Bolot and A. Vega-Garcia, “Control mechanisms forpacket audio in the Internet,” in Proc. 15th IEEE InfocomConference on Computer Communications (Infocom ’96), SanFrancisco, Calif, USA, March 1996.

[23] H. Schulzrinne, “Voice communication across the Inter-net: a network voice terminal,” Tech. Rep., University ofMassachusetts, Amherst, Mass, USA, 1992, http://www.cs.columbia.edu/∼hgs/rtp/nevot.html.

[24] V. Hardman, M. A. Sasse, and I. Kouvelas, “Successful multi-party audio communication over the Internet,” Communica-tions of the ACM, vol. 41, no. 5, pp. 74–80, 1998.

[25] B. Schneier, Applied Cryptography, John Wiley & Sons, NewYork, NY, USA, 2nd edition, 1996.

[26] A. Aldini, R. Gorrieri, and M. Roccetti, “An adaptive mech-anism for real-time secure speech transmission over the In-ternet,” in Proc. 2nd IP-Telephony Workshop (IP-Tel ’01),H. Schulzrinne, Ed., pp. 64–72, Columbia University, NewYork, NY, USA, April 2001.

[27] M. Roccetti, “Secure real time speech transmission over theInternet: performance analysis and simulation,” in Proc. Sum-mer Computer Simulation Conference (SCSC ’00), B. Waite andA. Nisanci, Eds., pp. 939–944, Society for Computer Simula-tion International, Vancouver, British Columbia, Canada, July2000.

[28] M. Roccetti, “Adaptive control mechanisms for packet audioover the Internet,” in Proc. SCS Euromedia Conference (EURO-MEDIA ’00), F. Broeckx and L. Pauwels, Eds., pp. 151–155, So-

ciety for Computer Simulation International, Antwerp, Bel-gium, May 2000.

[29] Internet Engineering Task Force, “IP security protocol,” inProc. 43th IETF Meeting, Orlando, Fla, USA, December 1998,Internet Drafts available at http://www.ietf.org.

[30] R. L. Rivest, The MD5 Message-Digest Algorithm, MIT Labo-ratory for Computer Science and RSA Data Security, 1992.

[31] S. Ariga, K. Nagahashi, M. Minami, H. Esaki, and J. Mu-rai, “Performance evaluation of data transmission using IPSecover IPv6 networks,” in Proc. The Internet Global Summit:Global Distributed Intelligence for Everyone, 10th Annual Inter-net Society Conference, Yokohama, Japan, July 2000.

[32] S. Garfinkel, PGP: Pretty Good Privacy, O’Reilly & Associates,Sebastopol, Calif, USA, 1994.

[33] D. Dolev and A. C. Yao, “On the security of public key proto-cols,” IEEE Transactions on Information Theory, vol. 29, no. 2,pp. 198–208, 1983.

[34] R. L. Rivest, A. Shamir, and L. M. Adleman, “A method forobtaining digital signatures and public-key cryptosystems,”Communications of the ACM, vol. 21, no. 2, pp. 120–126, 1978.

[35] M. Briceno, I. Goldberg, and D. Wagner, “A pedagogical im-plementation of A5/1,” 1999, http://www.scard.org/.

[36] P. Rogaway and D. Coppersmith, “A software-optimized en-cryption algorithm,” Journal of Cryptology, vol. 11, no. 4, pp.273–287, 1998.

[37] A. Biryukov, A. Shamir, and D. Wagner, “Real time crypt-analysis of A5/1 on a PC,” in Proc. 7th Fast Software Encryp-tion Workshop (FSE ’00), pp. 1–18, New York, NY, USA, April2000.

[38] ITU-T Recommendation G.729-G.723.1, 1996, http://www.itu.int/publications/maim publ/itut.html.

[39] S. Redl, M. Weber, and M. Oliphant, GSM and Personal Com-munications Handbook, Artech House Publishers, Norwood,Mass, USA, 1998.

[40] A. Bosselaers, R. Govaerts, and J. Vandewalle, “Fast hashingon the Pentium,” in Advances in Cryptology - CRYPTO ’96,16th Annual International Cryptology Conference, N. Koblitz,Ed., vol. 1109 of Lectures Notes in Computer Science, pp. 298–312, Springer-Verlag, Santa Barbara, Calif, USA, 1996.

[41] A. Bosselaers, “Even faster hashing on the Pentium,”in Proc. Rump Session of Eurocrypt (Eurocrypt ’97), Kon-stanz, Germany, May 1997, http://www.esat.kuleuven.ac.be/∼bosselae/publications.html.

[42] B. Schneier and D. Whiting, “Fast software encryption: de-signing encryption algorithms for optimal software speed onthe Intel Pentium Processor,” in Proc. 4th Fast Software En-cryption Workshop (FSE ’97), pp. 242–259, Springer-Verlag,Haifa, Israel, January 1997.

[43] J. Touch, “Performance analysis of MD5,” in Proc. Conferenceon Applications, Technologies, Architectures, and Protocols forComputer Communication (SIGCOMM ’95), pp. 77–86, Cam-bridge, Mass, USA, August–September 1995.

Alessandro Aldini is an Assistant Professorof computer science at the STI Centro of theUniversity of Urbino, Italy. He received theLaurea (with honors) and the Ph.D. degreesin computer science from the University ofBologna, in 1998 and 2002, respectively. Hiscurrent research interests include theory ofconcurrency, formal description techniquesand tools for concurrent and distributedcomputing systems, and performance eval-uation and simulation.


Marco Roccetti is a Professor of computerscience in the department of Computer Sci-ence of the University of Bologna, Italy.From 1992 to 1998, he was a Research As-sociate in the Department of Computer Sci-ence of the University of Bologna, and from1998 to 2000, he was an Associate Profes-sor of computer science at the Universityof Bologna. Marco Roccetti authored morethan 70 technical refereed papers that ap-peared in the proceedings of several international conferences andjournals. His research interests include protocol design, implemen-tation and evaluation for wired/wireless multimedia systems, per-formance modeling and simulation of multimedia systems, anddigital audio for multimedia communications.

Roberto Gorrieri is a Professor of computerscience in the Department of Computer Sci-ence, University of Bologna, Italy. He re-ceived the Laurea and the Ph.D. degrees incomputer science, both from the Universityof Pisa, Italy, in 1986 and 1991, respectively.From 1992 to 2000, he was an Associate Pro-fessor of computer science at the Universityof Bologna. Roberto Gorrieri is a memberof the European Association for TheoreticalComputer Science and Chairman of IFIP WG 1.7 on TheoreticalFoundations of Security. His research interests include theory ofconcurrent and distributed systems, formal method for security,and real-time and performance evaluation.


Efficient Alternatives to the Ephraim and MalahSuppression Rule for Audio Signal Enhancement

Patrick J. WolfeSignal Processing Group, Department of Engineering, University of Cambridge, CB2 1PZ Cambridge, UKEmail: [email protected]

Simon J. GodsillSignal Processing Group, Department of Engineering, University of Cambridge, CB2 1PZ Cambridge, UKEmail: [email protected]

Received 31 May 2002 and in revised form 20 February 2003

Audio signal enhancement often involves the application of a time-varying filter, or suppression rule, to the frequency-domaintransform of a corrupted signal. Here we address suppression rules derived under a Gaussian model and interpret them as spectralestimators in a Bayesian statistical framework. With regard to the optimal spectral amplitude estimator of Ephraim and Malah, weshow that under the same modelling assumptions, alternative methods of Bayesian estimation lead to much simpler suppressionrules exhibiting similarly effective behaviour. We derive three of such rules and demonstrate that, in addition to permitting a morestraightforward implementation, they yield a more intuitive interpretation of the Ephraim and Malah solution.

Keywords and phrases: noise reduction, speech enhancement, Bayesian estimation.

1. INTRODUCTION

Herein we address an important issue in audio signal pro-cessing for multimedia communications, that of broadbandnoise reduction for audio signals via statistical modelling oftheir spectral components. Due to its ubiquity in applica-tions of this nature, we concentrate on short-time spectralattenuation, a popular method of broadband noise reductionin which a time-varying filter, or suppression rule, is appliedto the frequency-domain transform of a corrupted signal. Wefirst address existing suppression rules derived under a Gaus-sian statistical model and interpret them in a Bayesian frame-work. We then employ the same model and framework to de-rive three new suppression rules exhibiting similarly effectivebehaviour, preliminary details of which may also be found in[1]. These derivations lead in turn to a more intuitive meansof understanding the behaviour of the well-known Ephraimand Malah suppression rule [2], as well as to an extension ofcertain others [3, 4].

This paper is organised as follows. In the remainder ofSection 1, we introduce the assumed statistical model and es-timation framework, and then employ these in an alternatederivation of the minimum mean square error (MMSE) sup-pression rules due to Wiener [5] and Ephraim and Malah [2].In Section 2, we derive three alternatives to the MMSE spec-

tral amplitude estimator of [2], all of which may be formu-lated as suppression rules. Finally, in Section 3, we investigatethe behaviour of these solutions and compare their perfor-mance to that of the Ephraim and Malah suppression rule.Throughout the ensuing discussion, we consider—for sim-plicity of notation and without loss of generality—the caseof a single, windowed segment of audio data. To facilitatea comparison, our notation follows that of [2], except thatcomplex quantities appear in bold.

1.1. A simple Gaussian model

To date, the most popular methods of broadband noise re-duction involve the application of a time-varying filter tothe frequency-domain transform of a noisy signal. Let xn =x(nT) in general represent values from a finite-duration ana-logue signal sampled at a regular interval T , in which case acorrupted sequence may be represented by the additive ob-servation model

yn = xn + dn, (1)

where yn represents the observed signal at time index n, xn isthe original signal, and dn is additive random noise, uncor-related with the original signal. The goal of signal enhance-ment is then to form an estimate xn of the underlying signalxn based on the observed signal yn, as shown in Figure 1.


xn

dn

yn Noiseremovalprocess

xn

Unobservable Observable

Figure 1: Signal enhancement in the case of additive noise.

In many implementations where efficient online perfor-mance is required, the set of observations yn is filteredusing the overlap-add method of short-time Fourier analy-sis and synthesis, in a manner known as short-time spectralattenuation. Taking the discrete Fourier transform on win-dowed intervals of length N yields K frequency bins per in-terval:

Yk = Xk + Dk, (2)

where these quantities are denoted in bold to indicate thatthey are complex. Noise reduction in this manner may beviewed as the application of a suppression rule, or nonnega-tive real-valued gain Hk, to each bin k of the observed signalspectrum Yk, in order to form an estimate Xk of the originalsignal spectrum:

Xk = Hk · Yk. (3)

As shown in Figure 2, this spectral estimate is then inverse-transformed to obtain the time-domain signal reconstruc-tion.

Within such a framework, a simple Gaussian model of-ten proves effective [6, Chapter 6]. In this case, the elementsof Xk and Dk are modelled as independent, zero-mean,complex Gaussian random variables with variances λx(k)and λd(k), respectively:

Xk ∼ 2(

0, λx(k)I), Dk ∼ 2

(0, λd(k)I

). (4)

1.2. A Bayesian interpretation of suppression rules

It is instructive to consider an interpretation of suppres-sion rules based on the Gaussian model of (4) in terms ofa Bayesian statistical framework. Viewed in this light, therequired task is to estimate each component Xk of the un-derlying signal spectrum as a function of the correspond-ing observed spectral component Yk. To do so, we may de-fine a nonnegative cost function C(xk, xk) of xk (the realisa-tion of Xk) and its estimate xk, and then minimise the risk E[C(xk, xk)|Yk] in order to obtain the optimal estima-tor of xk.

1.2.1. The Wiener suppression rule

A frequent goal in signal enhancement is to minimise themean square error of an estimator; within the framework ofBayesian risk theory, this MMSE criterion may be viewed as a

Noiseestimation

yn Short-timeanalysis

|Yk |

Yk

Suppressionrule

xn Short-timesynthesis |Xk |

Figure 2: Short-time spectral attenuation.

squared-error cost function. Considering the model of (2), itfollows from Bayes’ rule and the prior distributions definedin (4) that we seek to minimise

E[C(

xk, xk)|Yk

]∝

∫xk

∣∣xk − xk

∣∣2exp

− ∣∣yk − xk

∣∣2

λd(k)−

∣∣xk

∣∣2

λx(k)

dxk.(5)

The corresponding Bayes estimator is the optimal solu-tion in an MMSE sense, and is given by the mean of the pos-terior density appearing in (5), which follows directly fromits Gaussian form:

E[

Xk|Yk] = λx(k)

λx(k) + λd(k)Yk. (6)

The result given by (6) is recognisable as the well-knownWiener filter [5].

In fact, it can be shown (see, e.g., [7, pages 59–63]) thatwhen the posterior density is unimodal and symmetric aboutits mean, the conditional mean is the resultant Bayes es-timator for a large class of nondecreasing, symmetric costfunctions. However, we soon move to consider densities thatare inherently asymmetric. Thus we will also employ the so-called uniform cost function, for which the optimal estima-tor may be shown to be that which maximises the posteriordensity—that is, the maximum a posteriori (MAP) estima-tor.

1.2.2. The Ephraim and Malah suppression rule

While, from a perceptual point of view, the ear is by no meansinsensitive to phase, the relative importance of spectral am-plitude rather than phase in audio signal enhancement [8, 9]has led researchers to recast the spectral estimation prob-lem in terms of the former quantity. In this vein, McAulayand Malpass [4] derive a maximum-likelihood (ML) spec-tral amplitude estimator under the assumption of Gaussiannoise and an original signal characterised by a deterministicwaveform of unknown amplitude and phase:

Hk = 12

+12

√λx(k)

λx(k) + λd(k). (7)

Alternative Suppression Rules for Audio Signal Enhancement 1045

As an extension of the model underlying (7), Ephraimand Malah [2] derive an MMSE short-time spectral ampli-tude estimator based on the model of (4); that is, underthe assumption that the Fourier expansion coefficients of theoriginal signal and the noise may be modelled as statisticallyindependent, zero-mean, Gaussian random variables. Thusthe observed spectral component in bin k, Yk Rk exp( jϑk),is equal to the sum of the spectral components of the signal,Xk Ak exp( jαk), and the noise, Dk. This model leads to thefollowing marginal, joint, and conditional distributions:

p(ak) =

2akλx(k)

exp

(− a2

k

λx(k)

)if ak∈[0,∞),

0 otherwise,

(8)

p(αk

) =

12π

if αk ∈ [−π, π),

0 otherwise,(9)

p(ak, αk

) = akπλx(k)

exp

(− a2

k

λx(k)

), (10)

p(

Yk|ak, αk) = 1

πλd(k)exp

− ∣∣Yk − ake jαk∣∣2

λd(k)

, (11)

where it is understood that (10) and (11) are defined overthe range of ak and αk, as given in (8) and (9), respectively;again λx(k) E[|Xk|2] and λd(k) E[|Dk|2] denote the re-spective variances of the kth short-time spectral componentof the signal and noise. Additionally, define

1λ(k)

1λx(k)

+1

λd(k), (12)

υk ξk1 + ξk

γk; ξk λx(k)λd(k)

, γk R2k

λd(k), (13)

where ξk and γk are interpreted after [4] as the a priori and aposteriori signal-to-noise ratios (SNRs), respectively.

Under the assumed model, the posterior densityp(ak|Yk) (following integration with respect to the phaseterm αk) is Rician [10] with parameters (σ2

k , s2k):

p(ak|Yk

) = akσ2k

exp

(− a2

k + s2k

2σ2k

)I0

(akskσ2k

), (14)

σ2k λ(k)

2, s2

k υkλ(k), (15)

where Ii(·) denotes the modified Bessel function of order i.The mth moment of a Rician distribution is given by

E[Xm

] = (2σ2)m/2

Γ(m + 2

2

)

×Φ(m + 2

2, 1;

s2

2σ2

)exp

(− s2

2σ2

), m ≥ 0,

(16)

where Γ(·) is the gamma function [11, equation (8.310.1)]

andΦ(·) is the confluent hypergeometric function [11, equa-tion (9.210.1)].

The MMSE solution of Ephraim and Malah is simply thefirst moment of (14); when combined with the optimal phaseestimator (found by Ephraim and Malah to be the observedphase ϑk [2]), it takes the form of a suppression rule:

Ak = λ(k)1/2Γ(1.5)Φ(1.5, 1; υk

)exp

(− υk)

= λ(k)1/2Γ(1.5)Φ(− 0.5, 1;−υk

) (17)

=⇒ Hk =√πυk

2γk

[(1 + υk

)I0

(υk2

)+ υkI1

(υk2

)]exp

(−υk2

).

(18)

2. THREE ALTERNATIVE SUPPRESSION RULES

The spectral amplitude estimator given by (18), while beingoptimal in an MMSE sense, requires the computation of ex-ponential and Bessel functions. We now proceed to derivethree alternative suppression rules under the same model,each of which admits a more straightforward implementa-tion.

2.1. Joint maximum a posteriori spectral amplitudeand phase estimator

As shown earlier, joint estimation of the real and imaginarycomponents of Xk under either the MAP or MMSE criterionleads to the Wiener estimator (due to symmetry of the Gaus-sian posterior distribution). However, as we have seen, theproblem may be reformulated in terms of spectral amplitudeAk and phase αk; it is then possible to obtain a joint MAP esti-mate by maximising the posterior distribution p(ak, αk|Yk):

p(ak, αk|Yk

)∝ p

(Yk|ak, αk

)p(ak, αk

)∝ ak

π2λx(k)λd(k)exp

− ∣∣Yk − ake jαk∣∣2

λd(k)− a2

k

λx(k)

. (19)

Since ln(·) is a monotonically increasing function, one mayequivalently maximise the natural logarithm of p(ak, αk|Yk).Define

J1 = −∣∣Yk − ake jαk

∣∣2

λd(k)− a2

k

λx(k)+ ln ak + constant. (20)

Differentiating J1 with respect to αk yields

∂

∂αkJ1 = − 1

λd(k)

[(Y∗k − ake

− jαk)(− jake

jαk)

+(

Yk − akejαk

)(jake

− jαk)],

(21)

where Y∗k denotes the complex conjugate of Yk. Setting tozero and substituting Yk = Rk exp( jϑk), we obtain

0 = jakRkej(ϑk−αk) − jakRke

− j(ϑk−αk)

= 2 j sin(ϑk − αk

) (22)


since ak = 0 if the phase estimate is to be meaningful. There-fore

αk = ϑk; (23)

that is, the joint MAP phase estimate is simply the noisyphase—just as in the case of the MMSE solution due toEphraim and Malah [2]. Differentiating J1 with respect to akyields

∂

∂akJ1 = − 1

λd(k)

[(Y∗k − ake

− jαk)(− e jαk

)+(

Yk − akejαk

)(− e− jαk)]

− 2akλx(k)

+1ak

.

(24)

Setting the above to zero implies

2a2k = λx(k)− λx(k)

λd(k)ak[2ak − Rke

− j(ϑk−αk) − Rkej(ϑk−αk)]

= λx(k)− ξkak[2ak − 2Rk cos

(ϑk − αk

)].

(25)

From (23), we have cos(ϑk − αk) = 1; therefore

0 = 2(1 + ξk

)a2k − 2Rkξkak − λx(k), (26)

where ξk is as defined in (13). Solving the above quadraticequation and substituting

λx(k) = ξkγkR2k, (27)

which follows from the definitions of ξk and γk in (13), wehave

Ak =ξk +

√ξ2k + 2

(1 + ξk

)(ξk/γk

)2(1 + ξk

) Rk. (28)

Equations (23) and (28) together define the following sup-pression rule:

Hk =ξk +

√ξ2k + 2

(1 + ξk

)(ξk/γk

)2(1 + ξk

) . (29)

2.2. Maximum a posteriori spectral amplitudeestimator

Recall that the posterior density p(ak|Yk) of (14), arisingfrom integration over the phase term αk, is Rician with pa-rameters (σ2

k , s2k). Following McAulay and Malpass [4], we

may for large arguments of I0(·) (i.e., when, for λx(k) = A2k,

ξkRk

√1/[(1 + ξk)λ(k)] ≥ 3) substitute the approximation

I0(|x|) ≈ 1√

2π|x| exp(|x|) (30)

into (14), yielding

p(ak|Yk

) ≈ 1√2πσ2

k

(aksk

)1/2

exp

(− 1

2

[ak − skσk

]2), (31)

which we note is “almost” Gaussian. Considering (31), andagain taking the natural logarithm and maximising with re-spect to ak, we obtain

J2 = −12

[ak − skσk

]2

+12

ln ak + constant, (32)

in which case

d

dakJ2 = sk − ak

σ2k

+1

2ak(33)

=⇒ 0 = a2k − skak − σ2

k

2. (34)

Substituting (15) and (27) into (34) and solving, we arriveat the following equation, which represents an approximateclosed-form MAP solution corresponding to the maximisa-tion of (14) with respect to ak:

Ak =ξk +

√ξ2k +

(1 + ξk

)(ξk/γk

)2(1 + ξk

) Rk. (35)

Note that this estimator differs from that of the joint MAPsolution only by a factor of two under the square root (owingto the factor

√ak in (31), replacement with ak would yield the

spectral estimator of (28)).Combining (35) with the Ephraim and Malah phase esti-

mator (i.e., the observed phase ϑk) yields the following sup-pression rule:

Hk =ξk +

√ξ2k +

(1 + ξk

)(ξk/γk

)2(1 + ξk

) . (36)

In fact, this solution extends that of McAulay and Malpass[4], who use the same approximation of I0(·) to enable thederivation of the ML estimator given by (7). In this sense,the suppression rule of (36) represents a generalisation of the(approximate) ML spectral amplitude estimator proposed in[4].

2.3. Minimum mean square error spectralpower estimator

Recall that Ephraim and Malah formulated the first momentof a Rician posterior distribution, E[Ak|Yk], as a suppressionrule. The second moment of that distribution, E[A2

k|Yk], re-duces to a much simpler expression

E[A2k

∣∣Yk] = 2σ2

k + s2k, (37)

where σ2k and s2

k are as defined in (15). Letting Bk = A2k and

substituting for σ2k and s2

k in (37) yields

Bk = ξk1 + ξk

(1 + υkγk

)R2k, (38)


10

0

−10

−20

−30

−40

−50

−60

Gai

n(d

B)

3020

100−10

−20−30

Instantaneous SNR (dB)−30

−20−10

010

2030

A priori SNR (dB)

Figure 3: Ephraim and Malah MMSE suppression rule.

543210

−1−2−3−4−5

Gai

ndi

ffer

ence

(dB

)

3020

100 −10

−20 −30


−20−10

010

2030

A priori SNR (dB)

Figure 4: Joint MAP suppression rule gain difference.

where Bk is the optimal spectral power estimator in anMMSE sense, as it is also the first moment of a new posteriordistribution p(bk|Yk) having a noncentral chi-square proba-bility density function with two degrees of freedom and pa-rameters (σ2

k , s2k).

When combined with the optimal phase estimator ofEphraim and Malah (i.e., the observed phase ϑk), this esti-mator also takes the form of a suppression rule

Hk =√√√ ξk

1 + ξk

(1 + υkγk

). (39)

3. ANALYSIS OF ESTIMATOR BEHAVIOUR

Figure 3 shows the Ephraim and Malah suppression rule asa function of instantaneous SNR (defined in [2] as γk − 1)

543210

−1−2−3−4−5

Gai

ndi

ffer

ence

(dB

)

3020

100−10−20 −30


−20−10

010

2030

A priori SNR (dB)

Figure 5: MAP approximation suppression rule gain difference.

543210

−1−2−3−4−5

Gai

ndi

ffer

ence

(dB

)

3020

100−10

−20−30


−20−10

0 1020

30

A priori SNR (dB)

Figure 6: MMSE power suppression rule gain difference.

and a priori SNR ξk.1 Figures 4, 5, and 6 show the gain dif-ference (in decibels) between it and each of the three derivedsuppression rules, given by (29), (36), and (39), respectively(note the difference in scale). A comparison of the magnitudeof these gain differences is shown in Table 1.

From these figures, it is apparent that the MMSE spec-tral power suppression rule of (39) follows the Ephraimand Malah solution most closely and consistently, with onlyslightly less suppression in regions of low a priori SNR.Table 1 also indicates that the approximate MAP suppressionrule of (36) is still within 5 dB of the Ephraim and Malahrule value over a wide SNR range, despite the approximation

1Recall that the a priori SNR is the “true but unobserved” SNR, whereasthe instantaneous SNR is the “spectral subtraction estimate” thereof.


Table 1: Magnitude of deviation from MMSE suppression rule gain.

Suppression rule(γk − 1, ξk) ∈ [−30, 30] dB (γk − 1, ξk) ∈ [−100, 100] dB

Mean Maximum Range Mean Maximum Range

MMSE power 0.68473 −1.0491 1.0469 0.63092 −1.0491 1.0491

Joint MAP 0.52192 +1.7713 2.3352 0.74507 +1.9611 2.5250

Approximate MAP 1.2612 +4.7012 4.7012 1.7423 +4.9714 4.9714

of (30).2 While the sign of the deviation of both the MMSEspectral power and approximate MAP rules is constant, thatof the joint MAP suppression rule of (29) depends on theinstantaneous and a priori SNRs.

Ephraim and Malah [2] show that at high SNRs, their de-rived suppression rule converges to the Wiener suppressionrule detailed in Section 1.2.1, formulated as a function of apriori SNR ξk:

Hk = ξk1 + ξk

. (40)

This relationship is easily seen from the MMSE spectralpower suppression rule given by (39), expanded slightly tothe following equation:

Hk =√√√√ ξk

1 + ξk

(1γk

+ξk

1 + ξk

). (41)

As the instantaneous SNR becomes large, (41) may be seen toapproach the Wiener suppression rule of (40). As it becomessmall, the 1/γk term in (41) lessens the severity of the atten-uation. Cappe [12] makes the same observation concerningthe behaviour of the Ephraim and Malah suppression rule,although the simpler form of the MMSE spectral power es-timator shows the influence of the a priori and a posterioriSNRs more explicitly.

We also note that the success of the Ephraim and Malahsuppression rule is largely due to the authors’ decision-directed approach for estimating the a priori SNR ξk [12].For a given short-time block n, the decision-directed a pri-

ori SNR estimate ξk is given by a geometric weighting of theSNRs in the previous and current blocks:

ξk = α

∣∣Xk(n− 1)∣∣2

λd(n− 1, k)

+ (1− α) max[0, γk(n)− 1

], α ∈ [0, 1).

(42)

It is instructive to consider the case in which ξk = γk − 1,that is, α = 0 in (42) so that the estimate of the a prioriSNR is based only on the spectral subtraction estimate of the

2For a fixed spectral magnitude observation Rk , and with λx(k) = A2k ,

the approximation of (30) is dominated by the a priori SNR ξk . Hence wesee that when ξk is large, the resultant suppression rule gain exhibits lessdeviation from that of the other rules.

0

−5

−10

−15

−20

−25

−30

−35

−40

Gai

n(d

B)

−30 −20 −10 0 10 20 30Instantaneous SNR = a priori SNR (dB)

MMSE spectral amplitude

Joint MAP spectral amplitude and phaseMAP spectral amplitude approximationMMSE spectral power

Figure 7: Optimal and derived suppression rules.

0

−10

−20

−30

−40

−50

−60

−70

Gai

n(d

B)

−30 −20 −10 0 10 20 30Instantaneous SNR (dB)

Power spectral subtractionWiener suppression ruleMagnitude spectral subtraction

Figure 8: Standard suppression rules.


Narrowband speech16

12

8

4

0

−4

SNR

gain

(dB

)

0 10 20 30Input SNR (dB)

MMSE amplitudeJoint MAPApproximate MAPMMSE power

Wideband speech15

10

5

0

−5

SNR

gain

(dB

)0 10 20 30

Input SNR (dB)


Wideband music14

12

10

8

6

4

SNR

gain

(dB

)



Narrowband speech10

8

6

4

2

0

SNR

gain

(dB

)



Wideband speech12

10

8

6

4

2

SNR

gain

(dB

)



Wideband music13

12

11

10

9

8

7

SNR

gain

(dB

)



Figure 9: A performance comparison of the derived suppression rules. The top row of figures corresponds to a priori SNR estimation usingthe decision-directed approach of (42), with α = 0.98 as recommended in [2]. The bottom row corresponds to α = 0, in which case the gainsurfaces of Figures 3, 4, 5, and 6 reduce to the gain curves of Figure 7.

current block. In this case, the MMSE spectral power sup-pression rule given by (41) reduces to the method of powerspectral subtraction (see, e.g., [3]). Figure 7 shows a compar-ison of the derived suppression rules under this constraint;by way of comparison, Figure 8 shows some standard sup-pression rules, including power spectral subtraction and theWiener filter, as a function of instantaneous SNR (note thedifference in ordinate scale).

Lastly, we mention the results of informal listening testsconducted across a range of audio material. These tests indi-cate that, especially when coupled with the decision-directedapproach for estimating ξk, each of the derived estimatorsyields an enhancement similar in quality to that obtained us-

ing the Ephraim and Malah suppression rule. To this end,Figure 9 shows a comparison of SNR gain over a range of in-put SNRs for three typical 16-bit audio examples, artificiallydegraded with additive white Gaussian noise, and processedusing the overlap-add method with a 50% window overlap:narrowband speech (sampled at 16 kHz and analysed usinga 256-sample hanning window), wideband speech (sampledat 44.1 kHz and analysed using a 512-sample hanning win-dow), and wideband music (solo piano, sampled at 44.1 kHzand analysed using a 2048-sample Hanning window).3

3Segmental SNR gain measurements yield a similar pattern of results.


As we intend these results to be illustrative rather than ex-haustive, we limit our direct comparison here to the Ephraimand Malah suppression rule. Comparisons have been madeboth with and without smoothing in the a priori SNR calcu-lation, as described in the caption of Figure 9. It may be seenfrom Figure 9 that in the case of smoothing (upper row), thespectral power estimator appears to provide a small increasein SNR gain. In terms of sound quality, a small decrease inresidual musical noise results from the approximate MAP so-lution, albeit at the expense of slightly more signal distortion.The joint MAP suppression rule lies in between these two ex-tremes. Without smoothing, the methods produce a resid-ual with approximately the same amount of musical noiseas power spectral subtraction (as is expected in light of thecomparison of these curves given by Figure 7). In compari-son to Wiener filtering and magnitude spectral subtraction,the derived methods yield a slightly greater level of musicalnoise (as is to be expected according to Figure 8).

Audio examples illustrating these features, along with aMatlab toolbox allowing for the reproduction of results pre-sented here, as well as further experimentation and com-parison with other suppression rules, are available online athttp://www-sigproc.eng.cam.ac.uk/∼pjw47.

4. DISCUSSION

In the first part of this paper, we have provided a com-mon interpretation of existing suppression rules based ona simple Gaussian statistical model. Within the frameworkof Bayesian estimation, we have seen how two MMSE sup-pression rules due to Wiener [5] and Ephraim and Malah [2]may be derived. While the Ephraim and Malah MMSE spec-tral amplitude estimator is well known and widely used, itsimplementation requires the evaluation of computationallyexpensive exponential and Bessel functions. Moreover, an in-tuitive interpretation of its behaviour is obscured by thesesame functions. With this motivation, we have presented inthe second part of this paper a derivation and comparison ofthree alternatives to the Ephraim and Malah MMSE spectralamplitude estimator.

The derivations also yield an extension of two existingsuppression rules: the ML spectral estimator due to McAulayand Malpass [4], and the estimator defined by power spectralsubtraction. Specifically, the ML suppression rule has beengeneralised to an approximate MAP solution in the case ofan independent Gaussian prior for each spectral component.It has also been shown that the well-known method of powerspectral subtraction, previously developed in a non-Bayesiancontext, arises as a special case of the MMSE spectral powerestimator derived herein.

In addition to providing the aforementioned theoreti-cal insights, these solutions may be of use themselves in sit-uations where a straightforward implementation involvingsimpler functional forms is required; alternative approachesalong a similar line of motivation are developed in [13, 14].Additionally, for the purposes of speech enhancement, eachmay be coupled with hypotheses concerning uncertainty of

speech presence, as in [2, 4, 13, 14]. Moreover, the form of theMMSE spectral power suppression rule given by (41) pro-vides a clearer insight into the behaviour of the Ephraim andMalah solution. Finally, we note that just as Ephraim andMalah argued that log-spectral amplitude estimation maybe more appropriate for speech perception [15], so in othercases may be MMSE spectral power estimation—for exam-ple, when calculating auditory masked thresholds for use inperceptually motivated noise reduction [16].

ACKNOWLEDGMENTS

Material by the first author is based upon work supportedunder a US National Science Foundation Graduate Fellow-ship. The authors also gratefully acknowledge the contribu-tion of Shyue Ping Ong to this paper, as well as the helpfulcomments of the anonymous reviewers.

REFERENCES

[1] P. J. Wolfe and S. J. Godsill, “Simple alternatives to theEphraim and Malah suppression rule for speech enhance-ment,” in Proc. 11th IEEE Workshop on Statistical Signal Pro-cessing, pp. 496–499, Orchid Country Club, Singapore, August2001.

[2] Y. Ephraim and D. Malah, “Speech enhancement using a min-imum mean-square error short-time spectral amplitude esti-mator,” IEEE Trans. Acoustics, Speech, and Signal Processing,vol. 32, no. 6, pp. 1109–1121, 1984.

[3] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancementof speech corrupted by acoustic noise,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, pp. 208–211,Washington, DC, USA, April 1979.

[4] R. J. McAulay and M. L. Malpass, “Speech enhancement usinga soft-decision noise suppression filter,” IEEE Trans. Acoustics,Speech, and Signal Processing, vol. 28, no. 2, pp. 137–145, 1980.

[5] N. Wiener, Extrapolation, Interpolation, and Smoothing of Sta-tionary Time Series: With Engineering Applications, Principlesof Electrical Engineering Series, MIT Press, Cambridge, Mass,USA, 1949.

[6] S. J. Godsill and P. J. W. Rayner, Digital Audio Restoration:A Statistical Model Based Approach, Springer-Verlag, Berlin,Germany, 1998.

[7] H. L. Van Trees, Detection, Estimation, and Modulation The-ory: Part 1, Detection, Estimation and Linear Modulation The-ory, John Wiley & Sons, New York, NY, USA, 1968.

[8] D. L. Wang and J. S. Lim, “The unimportance of phase inspeech enhancement,” IEEE Trans. Acoustics, Speech, and Sig-nal Processing, vol. 30, no. 4, pp. 679–681, 1982.

[9] P. Vary, “Noise suppression by spectral magnitudeestimation—Mechanism and theoretical limits,” Signal Pro-cessing, vol. 8, no. 4, pp. 387–400, 1985.

[10] S. O. Rice, “Statistical properties of a sine wave plus randomnoise,” Bell System Technical Journal, vol. 27, pp. 109–157,1948.

[11] I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, andProducts, Academic Press, San Diego, Calif, USA, 5th edition,1994.

[12] O. Cappe, “Elimination of the musical noise phenomenonwith the Ephraim and Malah noise suppressor,” IEEE Trans.Speech, and Audio Processing, vol. 2, no. 2, pp. 345–349, 1994.

[13] A. Akbari Azirani, R. le Bouquin Jeannes, and G. Fau-con, “Optimizing speech enhancement by exploiting masking


properties of the human ear,” in Proc. IEEE Int. Conf. Acous-tics, Speech, Signal Processing, vol. 1, pp. 800–803, Detroit,Mich, USA, May 1995.

[14] A. Akbari Azirani, R. le Bouquin Jeannes, and G. Faucon,“Speech enhancement using a Wiener filtering under signalpresence uncertainty,” in Signal Processing VIII: Theories andApplications, G. Ramponi, G. L. Sicuranza, S. Carrato, andS. Marsi, Eds., vol. 2 of Proceedings of the European SignalProcessing Conference, pp. 971–974, Trieste, Italy, September1996.

[15] Y. Ephraim and D. Malah, “Speech enhancement using a min-imum mean-square error log-spectral amplitude estimator,”IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 33,no. 2, pp. 443–445, 1985.

[16] P. J. Wolfe and S. J. Godsill, “Towards a perceptually optimalspectral amplitude estimator for audio signal enhancement,”in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing,vol. 2, pp. 821–824, Istanbul, Turkey, June 2000.

Patrick J. Wolfe attended the Universityof Illinois at Urbana-Champaign (UIUC)from 1993–1998, where he completed a self-designed programme leading to undergrad-uate degrees in electrical engineering andmusic. After working at the UIUC Experi-mental Music Studios in his final year andlater at Studer Professional Audio AG, hejoined the Signal Processing Group at theUniversity of Cambridge. There he held aUS National Science Foundation Graduate Research Fellowship atChurchill College, working towards his Ph.D. with Dr. Simon God-sill on the application of perceptual criteria to statistical audio sig-nal processing, prior to his appointment in 2001 as a Fellow andCollege Lecturer in engineering and computer science at New Hall,University of Cambridge, Cambridge. His research interests lie inthe intersection of statistical signal processing and time-frequencyanalysis, and include general applications as well as those relatedspecifically to audio and auditory perception.

Simon J. Godsill is a Reader in statisticalsignal processing in the Engineering De-partment of Cambridge University. In 1988,following graduation in electrical and in-formation sciences from Cambridge Uni-versity, he led the technical developmentteam at the audio enhancement company,CEDAR Audio, Ltd., researching and devel-oping DSP algorithms for restoration of au-dio signals. Following this, he completed aPh.D. with Professor Peter Rayner at Cambridge University andwent on to be a Research Fellow of Corpus Christi College, Cam-bridge. He has research interests in Bayesian and statistical methodsfor signal processing, Monte Carlo algorithms for Bayesian prob-lems, modelling and enhancement of audio signals, nonlinear andnon-Gaussian signal processing, image sequence analysis, and ge-nomic signal processing. He has published over 70 papers in refer-eed journals, conference proceedings, and edited books. He has au-thored a research text on sound processing, Digital Audio Restora-tion, with Peter Rayner, published by Springer-Verlag.

Documents

downloads.hindawi.comdownloads.hindawi.com/journals/specialissues/196487.pdf · 2012-02-02 · Editor-in-Chief Marc Moonen, Belgium Senior Advisory Editor K. J. Ray Liu, College Park,