Analysis and design of Wavelet-Packet Cepstral ...profesores.elo.utfsm.cl/~mzanartu/IPD414/Docs/wavelets_paper_1.pdf · The rich coverage of time-frequency properties of Wavelet Packets

Available online at www.sciencedirect.com

www.elsevier.com/locate/specom

Speech Communication 54 (2012) 814–835

Analysis and design of Wavelet-Packet Cepstral coefficientsfor automatic speech recognition

Eduardo Pavez, Jorge F. Silva ⇑

University of Chile, Department of Electrical Engineering, Av. Tupper 2007, Santiago 412-3, Chile

Received 3 July 2011; received in revised form 31 January 2012; accepted 2 February 2012Available online 18 February 2012

Abstract

This work proposes using Wavelet-Packet Cepstral coefficients (WPPCs) as an alternative way to do filter-bank energy-based featureextraction (FE) for automatic speech recognition (ASR). The rich coverage of time-frequency properties of Wavelet Packets (WPs) isused to obtain new sets of acoustic features, in which competitive and better performances are obtained with respect to the widelyadopted Mel-Frequency Cepstral coefficients (MFCCs) in the TIMIT corpus. In the analysis, concrete filter-bank design considerationsare stipulated to obtain most of the phone-discriminating information embedded in the speech signal, where the filter-bank frequencyselectivity, and better discrimination in the lower frequency range [200 Hz–1 kHz] of the acoustic spectrum are important aspects toconsider.� 2012 Elsevier B.V. All rights reserved.

Keywords: Wavelet Packets; Filter-bank analysis; Automatic speech recognition; Filter-bank selection; Cepstral coefficients; The Gray code

1. Introduction

Feature extraction (FE) is one of the key dimensions ofdesign in automatic speech recognition (ASR) (Quatieri,2002). The most recognized and widely adopted approachfor acoustic FE is using the Mel-Frequency Cepstral coef-ficients (MFCCs). MFCCs is a short-time analysis scheme,in which a signature of the acoustic signal spectrum is com-puted from a filter-bank with central frequencies projecteduniformly on the Mel scale (Quatieri, 2002). This scale isderived from well-documented studies of the human audi-tory system (Quatieri, 2002). Departing from this direction,there has been interest in the use of alternative signal pro-cessing techniques to propose new ways of doing short-time filter-bank analysis on the acoustic signal (Silva andNarayanan, 2009; Farooq and Datta, 2001; Choueiter

0167-6393/$ - see front matter � 2012 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2012.02.002

⇑ Corresponding author. Tel.: +56 2 9784090; fax: +56 2 6953881.E-mail addresses: [email protected] (E. Pavez), [email protected].

cl (J.F. Silva).URL: http://www.ids.uchile.cl/josilva/ (J.F. Silva).

and Glass, 2007; Kim et al., 2000; Tan et al., 1996). Theuse of Wavelets and Wavelet Packets (Daubechies, 1992;Mallat, 1989; Vetterli and Kovacevic, 1995) has been ofparticular interest in this context.

Wavelet Packets (WPs) (Vetterli and Kovacevic, 1995;Mallat, 1989; Coifman et al., 1990) have emerged as impor-tant signal representation schemes impacting compression,detection and classification (Crouse et al., 1998; Etemadand Chellapa, 1998; Ramchandran et al., 1996; Vasconce-los, 2004; Willsky, 2002; Learned et al., 1992; Scott andNowak, 2004). This collection of bases is particularlyappealing for the analysis of pseudo-stationary time seriesprocesses and quasi-periodic random fields, such as theacoustic speech process (Silva and Narayanan, 2009;Choueiter and Glass, 2007; Chang and Kuo, 1993; Learnedet al., 1992). WPs belong to the category of structuredbases, those whose orthonormal basis elements are gener-ated from a finite number of elementary transformations(Vetterli and Kovacevic, 1995; Daubechies, 1992;Ramchandran et al., 1996). From an engineering point ofview, these kinds of representations are attractive because

http://dx.doi.org/10.1016/j.specom.2012.02.002

mailto:[email protected]



http://www.ids.uchile.cl/josilva/

http://dx.doi.org/10.1016/j.specom.2012.02.002

E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835 815

they can be implemented with a basic two-channel filter(TCF) and down-sampling operations (Vetterli and Kovac-evic, 1995). WPs can be used to characterize a rich coveringof signal-space decomposition, and in particular, they pro-vide a way for generating sub-band dependent partitions ofthe observation space. In conclusion, WPs induce a familyof structural filter-banks with a rich covering of time-fre-quency characteristics that has the potential for enrichingthe way conventional MFCC features describe the short-term behavior of the acoustic speech process.

WPs and multi-rate filter bank analysis have beenadopted to improve the performance of conventionalMFCC features in the context of ASR (Farooq and Datta,2001; Choueiter and Glass, 2007; Kim et al., 2000; Tanet al., 1996). In particular, Farooq and Datta (2001) pro-posed a WP filter-bank representation, in which the objec-tive was to mimic the MEL-scale frequency resolution.They used the Daubechies (DB) two channel filter (Daube-chies, 1992), with which performance improvements wereobserved for specific phone subcategories (stop andunvoiced) in a portion of the TIMIT corpus. Morerecently, Choueiter and Glass (2007) explored the problemof two-channel filter-bank design and, in particular, thenovel framework of rational filter-banks. The focus of thiswork was to improve the frequency selectivity with respectto the conventionally adopted Daubechies (DB) WPs withstandard dyadic structure, by designing a type of MEL-frequency filter-bank structure. Better performances wereobtained in a simplified phone-segmented classificationtask with respect to MFCCs.

These seminal works provide concrete evidence of theadvantage of adopting WPs for parameterizing the speechacoustic process. However, the problem of adapting theWP basis-structure to the decision task, in the sense of find-ing the filter-bank topology, within the collection of tree-structured WP bases, that best captures the time-frequencyacoustic information for a given complexity constraint(feature dimension), remains an unexplored direction. Aspointed out in (Choueiter and Glass, 2007), this directionhas the potential to further adapt WP filter-bank solutions(acoustic energy-signature) to the phone discriminationtask at hand. On the other hand, the results reported sofar have considered simplified settings, in terms of the clas-sification task or data-sets. Thus, a systematic analysis instandard phone recognition experiments would be benefi-cial to support the adoption of WP-based features as acompetitive front-end alternative for doing acoustic FE.

In this work we propose the Wavelet-Packet Cepstralcoefficients (WPCC’s) and show concrete results that com-plement previous work on supporting the use of WPs as aFE techniques for ASR. This work builds upon the ideasrecently proposed in Silva and Narayanan (2009), in whichthe problem of optimal filter-bank selection for pattern rec-ognition (PR) was formulated based on the minimumprobability of error decision principle (Silva et al., 2012;Vasconcelos, 2004). Here we explore WP filter-bank selec-tion to propose a family of WPCCs. These features are

log-energy-based acoustic signatures rotated with thediscrete cosine transform (the Cepstrum), as proposed inFarooq and Datta (2001), where the energy signaturesare obtained from a bank of filters selected from the familyof WP filter-banks. For the filter-bank selection, we use acomplexity regularized criterion adopted from standardtree-structured bases selection problems (Silva and Naraya-nan, 2009; Etemadand Chellapa, 1998; Saito and Coifman,1994; Coifman et al., 1992). In particular, we use acousticenergy, the Fisher-scatter ratio (Duda and Hart, 1983),and the Kullback-Leibler divergence (KLD) as fidelitymeasures. The last two criteria are phone-discriminativein nature, while energy is based on the principle of increas-ing the frequency resolution in bands with higher acousticenergy, proposed in Chang and Kuo (1993) for the prob-lem of texture classification. As supporting results, werun standard phone recognition experiments in the TIMITcorpus. We contrast the different filter-bank solutions withrespect to a number of design elements. Among them arethe fidelity measure to select the filter-banks, the numberof bands, the number of features, and the frequency selec-tivity of the two-channel filter (TCF) that induces the fam-ily of WPs. Interestingly, we found competitive results andconcrete solutions that outperform the MFCCs. In theanalysis, we show performance trends and dependenciesthat explain what the important design variables are tobe considered for the construction of good acoustic fea-tures for ASR. At the end, WPCCs offer a rich collectionof acoustic features that extend the idea of short-time (seg-mental) energy-signature for acoustic event detection.

The rest of the article is organized as follows. Section 2revisits the standard approach for obtaining short-termacoustic features. Sections 3 and 4 are devoted to the pre-sentation of the WPCCs, where background material is cov-ered to aid understanding of the filter-bank properties ofWPs, and Section 5 covers the filter-bank selection problem.Finally Sections 6 and 7 show the filter-bank structure ofthe obtained solutions and the phone-classification perfor-mances, respectively. Final remarks are presented in Section8, and supplemental material is presented in the Appendix.

2. Revisiting the filter bank Cepstral analysis view of feature

extraction

We revisit the standard feature extraction (FE) techniquefor ASR based on filter-bank energy features and the appli-cations of the Cepstral transform (Quatieri, 2002) illustratedin Fig. 1a. Given the acoustic signal the scheme has the fol-lowing phases: a high pass pre-emphasis filter 1–0:97z�1 isapplied on the whole acoustic signal; the resulting signalis segmented with a Hamming window of 32 ms creatingoverlapped short-term acoustic segments every 10 ms (seg-mental analysis); each acoustic segment is passed througha bank of triangular shaped filters with center frequenciesforming an equipartition of the MEL scale, as shown inFig. 1; and finally, in each segment the filter-bank energies(FBE) are computed to form a vector, where the logarithm

Fig. 1. Illustration of the phases that characterize the standard approach for acoustic feature extraction in speech recognition.

816 E. Pavez, J.F. Silva / Speech Communication 54 (2012) 814–835

function (point-wise) and the Discrete Cosine transform

(DCT) are applied to create the MEL frequency Cepstralcoefficients (MFCCs) (Davis and Mermelstein, 1980).

In this work we explore an extension of this frameworkfor acoustic FE, where, instead of using the perceptuallymotivated MEL filter-bank structure, we study the rich col-lection of filter-banks induced from the Wavelet Packet(WP) bases (Vetterli and Kovacevic, 1995; Mallat, 2009).The next section is devoted to explaining the methodologyadopted to induce a new set of filter-bank energy featuresfrom the WPs, and, later, we present the proposed WaveletPacket Cepstral coefficients (WPCCs) for ASR.

3. Wavelet Packets

WPs were proposed by Coifman et al. (1992) as a collec-tion of bases with an underlying tree-structure. They offerdifferent time-frequency representation qualities, and con-sequently, the potential to adapt to complex time seriesphenomena like the speech acoustic process (Silva andNarayanan, 2009). Here we provide a brief introductionof this family with focus on its filter-bank characteristics.Excellent expositions can be found in Mallat (2009), Vett-erli and Kovacevic (1995) and Daubechies (1992).

3.1. WP sub-space decomposition: tree-structured collection

Let X be the signal space of interest that, without loss of

generality, is associated with a finite level of scale 2L or

resolution 2�L, L being an integer strictly greater thanzero (Mallat, 2009). Consequently, X can be equipped with

an orthonormal basis BL � /Lðt � 2LnÞ� �

n2Z (Mallat, 2009;

Vetterli and Kovacevic, 1995; Daubechies, 1992). TheWP framework provides a way of decomposing the

basis BL into two orthonormal collections, B0Lþ1 �

/0Lþ1

�ðt � 2Lþ1nÞgn2Z and B1

Lþ1 � /1Lþ1ðt � 2Lþ1nÞ

� �n 2 Z, where, denoting by U 0

Lþ1 � span /0Lþ1ðt � 2Lþ1nÞ :

�

n 2 Zg and U 1Lþ1 � span /1

Lþ1ðt � 2Lþ1nÞ : n 2 Z� �

, we have

that (Mallat, 2009)

X ¼ U 0Lþ1 � U 1

Lþ1: ð1Þ

The structure of the WP framework comes from the fact

that B1Lþ1 and B0

Lþ1 are induced by a discrete time pair of

conjugate mirror filters (CMF) that we denote byðhðnÞ; gðnÞÞ (Mallat, 2009, Chap. 7.1.3). More precisely,

the basis elements /0Lþ1ðtÞ;/

1Lþ1ðtÞ

� �associated with the

scale Lþ 1 are induced from /LðtÞ, of the scale L, by

/0Lþ1ðtÞ ¼

X1n¼�1

hðnÞ � /Lðt � 2LnÞ;

/1Lþ1ðtÞ ¼

X1n¼�1

gðnÞ � /Lðt � 2LnÞ; ð2Þ

where hðnÞ and gðnÞ are related by the perfect reconstruc-

tion property, i.e., gðnÞ ¼ ð�1Þ1�nhð1� nÞ; 8n 2 Z (Coif-man et al., 1992), (Mallat, 2009, Th. 8.1).

Iterating the application of the CMF pair ðhðnÞ; gðnÞÞ on

each basis element /0Lþ1ðtÞ and /1

Lþ1ðtÞ (Mallat, 2009, Th.8.1), we can continue, in a binary tree-structured way, withthe construction of alternative bases and subspace decom-positions for X. More precisely after a fixed number of iter-ations, we can create /p

LþjðtÞ for all j P 1 and for any

p 2 0; . . . ; 2j � 1� �

, where U pLþj ¼ span /p

Lþjðt � 2LþjnÞ :n

n 2 Zg, see Fig. 2a. Furthermore by construction,

8j P 1; 8p 2 0; . . . ; 2j � 1� �

,

UpLþj ¼ U 2p

Lþjþ1 � U 2pþ1Lþjþ1; ð3Þ

where /2pLþjþ1ðtÞ ¼

P1n¼�1hðnÞ � /p

Lþjðt � 2LþjnÞ and /2pþ1Lþjþ1

ðtÞ ¼P1

n¼�1gðnÞ � /pLþjðt � 2LþjnÞ.

At the end, the WPs can be seen as a family of tree-struc-tured bases induced from the iteration of the two channel

filter (TCF) ðhðnÞ; gðnÞÞ as illustrated in Fig. 2a.

Fig. 2. Binary tree-structure and representation of the family of Wavelet Packet bases.


3.2. Inter-scale relationship of the WP transform coefficients

A key property of WPs is the inter-scale relationshipinduced from (2) among the WP transform coefficientsobtained across scales (Mallat, 2009). More precisely, letxðtÞ be in U p

j � X with transform coefficients given by

dpj ðnÞ � hxðtÞ;/

pj ðt � 2jnÞi; 8n 2 Z: ð4Þ

Projecting xðtÞ, instead, in the alternative basis associatedwith U 2p

jþ1 � U 2pþ1jþ1 , we have that (Mallat, 2009, Prop. 8.4)

d2pjþ1ðnÞ ¼

Xk2Z

dpj ðkÞ � hðk � 2nÞ;

d2pþ1jþ1 ðnÞ ¼

Xk2Z

dpj ðkÞ � gðk � 2nÞ; 8n 2 Z:

ð5Þ

Considering the fact that those are orthonormal bases, theParseval’s relationship (Mallat, 2009) implies that

xðtÞj jj j2 ¼Xn2Z

dpj ðnÞ

�� 2 ¼Xn2Z

d2pjþ1ðnÞ

�� 2 þXn2Z

d2pþ1jþ1 ðnÞ

�� 2: ð6ÞBy induction, a closed-form relationship in the transformcoefficients can be obtained for every pair of basis elementsin the WPs, as illustrated in Fig. 2b. The beauty of this re-sult is that we pass from an analysis in continuous time in(4), to a discrete time analysis (algorithm) in (5). In fact,assuming that xðtÞ lives in a finite resolution space X, theEq. (4) with j ¼ L and p ¼ 0 can be seen as a generalizedSampling theorem (Zhou and Sun, 1999; Walter, 1992).Furthermore, the WP binary structure manifested in (5)permits a fast algorithm implementation of the WP analy-sis (Mallat, 2009). Concerning the algorithmic part, thenext section addresses the filter-bank implementation ofWPs (Vetterli and Kovacevic, 1995).

3.3. WP filter bank implementation

From a discrete time filter-bank point of view (Vetterliand Kovacevic, 1995), the basic iteration in (5) can beimplemented by the application of a two channel filter(TCF), with impulse response hðnÞ and gðnÞ, followed bya down-sampler by 2 operation (Vetterli and Kovacevic,1995; Mallat, 2009). This view is generalized in the follow-ing result.

Proposition 1 (Vaidyanathan (1993, Chap. 11.3.3)). Let

xðtÞ be in a finite 2L scale space X, with transform coefficients

ðd0LðnÞÞn2Z obtained from (4). Let us consider an arbitrary

sub-space U pj induced from the WP filter bank decomposition

with j > L and p 2 0; . . . ; 2j�L � 1� �

. Let us denote by

ðh0ðnÞÞn2Z and ðh1ðnÞÞn2Z, the conjugate mirror filter pair

(with transfer function H 0ðzÞ and H 1ðzÞ), by U p1

Lþ1; . . . ;

Upj�L�1

j the sequence of intermediate sub-spaces used to go

from X to U pj , and by Hðj:pÞ ¼ ðh1; . . . ; hj�LÞ 2 0; 1f gj�L the

binary path code. In the last definition, choosing hk implies

filtering with H hk ðzÞ and then applying the down-sampler by 2

at step k of the iteration. Then ðdpj ðnÞÞn2Z is obtained by

passing ðd0LðnÞÞn2Z to the following discrete time filter

HHðj;pÞðzÞ ¼Yj�L

i¼1

H hk ðz2i�1Þ; ð7Þ

and then applying the down-sampler by 2j�L operator.

Proof. The proof of this result is a consequence of Propo-sition 2 presented in Appendix B. Fig. 3 illustrates therelationship. h

Fig. 3. The equivalent systems stated in Proposition 1. The aggregated down-sampler is by K ¼ 2j�L.

Fig. 4. Illustration of the frequency division of Wavelet Packet bases for two tree structures. The ideal Shannon conjugate filter pair is considered, whichprovides perfect dyadic partitions of the interval ½�p; p�. Scenario (a–c) shows a recursive iteration of H 0ðzÞ (Wavelet type), and scenario (b–d) presents abalanced tree structure (uniform frequency resolution).

2 It is necessary that ji > L and pi 2 0; . . . ; 2ji�L � 1;8i 2 1; . . . ;Mf g. Inaddition there are structural conditions to guarantee that


4. Frequency response of the WP filter banks

Note that the process that relates ðd0LðnÞÞn2Z with

ðdpj ðnÞÞn2Z in Proposition 1, is linear but not time invariant.

Consequently, it is misleading to talk about the frequencyresponse associated with the process of projecting xðtÞ intothe WP sub-space U p

j . We can circumvent this issue by con-sidering only the equivalent filtering part of the process in(7) and, consequently, avoiding the last down-samplingstage.1 More precisely, we consider the frequency responseof the equivalent linear time-invariant (LTI) system justbefore the down-sampling stage. This characterizes the fre-quency content associated with each subspace, with whichwe can define the frequency decomposition achieved by agiven WP basis. To illustrate this, let us consider the Shan-

non WPs (Mallat, 2009) induced by the perfect low andhigh pass filters presented in Figs. 4 and 5, i.e.,

1 An alternative interpretation is presented in Appendix A. This analysisis not based on the filter-bank view of WP’s presented here.

j H 0ðejxÞ j¼ffiffiffi2p

x 2 ½�p=2þ 2kp; p=2þ 2kp�0 otherwise

(

and

j H 1ðejxÞ j¼ffiffiffi2p

x 2 ½p=2þ 2kp; 3p=2þ 2kp�0 otherwise

(:

Following Section 3.1, each WP basis of X can be repre-sented by the leaves of a binary-tree, as shown inFig. 2(a). More precisely a basis is indexed byðji; piÞ : i ¼ 1; . . . ;Mf g2 associated with the basis element

B ¼SM

i¼1Bpiji

and sub-space decomposition X ¼aMi¼1

U piji

.For each leaf ðji; piÞ of this tree, we can obtain its equiva-

ðji; piÞ : i ¼ 1; . . . ;Mf g corresponds to the leaves of a binary tree rootedat node ðL; 0Þ, not detailed here for space considerations. We refer thereader to Breiman et al. (1984), Chou et al. (1989) and Scott (2005) for asystematic exposition of this point.

Fig. 5. The same scenario as in Fig. 4. Scenario (a–c) shows a recursive iteration of H 1ðzÞ, and scenario (b–d) the reciprocal, in terms of frequencyselectivity, of the Wavelet type in Fig. 4(a) and (c).

Fig. 6. The equivalent M-channel filter-bank of a WP basis B ¼SM

i¼1Bpiji

.


lent filters H iðzÞ � HHðji ;piÞðzÞ by (7) and, consequently, re-duce the analysis to the frequency response of an M-chan-nel filter-bank, see Fig. 6. Examples of the frequencyresponse before the down-sampling stage are presented inFigs. 4 and 5. From these, we can notice that for the Wave-let type of structure, produced by iterating H 0ðejwÞ in everystep, we obtain a solution that increases the resolution inthe low frequency range. In general, in each step of iterat-ing the TCF, we reduce the frequency support of the result-ing sub-space by half, as illustrated in Fig. 4c.

4.1. Frequency ordering: the Gray code

Concerning frequency ordering, however, the up-sam-pled versions of H 0ðzÞ and H 1ðzÞ do not necessarily playthe role of the low and high pass filters, respectively, inthe band of interest. The reason is that the side lobes of

these filters, out of the original frequency range of its def-inition ½�p; p�, are brought into the ½�p; p� after the up-sampling operation in a non-trivial way (Mallat, 2009).This is a direct consequence of the result presented in Prop-osition 1.

An example of this phenomenon is shown in Fig. 5a, forthe case of iterating H 1ðzÞ. This scenario does not provide asolution that decomposes the high frequency range of thesignal, see Fig. 5c, as one would expect from its reciprocalWavelet solution shown in Fig. 4c. To illustrate this mir-roring effect more clearly, let us consider Fig. 5b and d.In this scenario, the frequency support of the equivalentfilter H 1ðzÞH 1ðz2Þ is not the highest band in the interval½0; p� as expected. In fact, the supports of H 1ðzÞ andH 1ðz2Þ are ½p=2; 3p=2� and ½p=4; 3p=4�, respectively. ThusH 1ðzÞH 1ðz2Þ has support in ½p=2; 3p=4�. For further detailson this frequency ordering issue, we refer the reader toMallat (2009, Section 8.1.2) and Atto et al. (2007, 2010).

Fortunately, there is a simple closed-form rule to re-label any admissible node ðj; pÞ in the WP tree as an equiv-alent node ðj; kÞ, at the same depth (scale), so that theresulting labels are frequency ordered (Mallat, 2009). Thismapping k ¼ GðpÞ is called the Gray code and it is pre-sented in Appendix C for completeness. Then, for eachWP basis B ¼

SMi¼1B

piji

, we can compute the ordered indexesfðji; kiÞ : i ¼ 1; . . . ;Mg, with ki ¼ GðpiÞ, (C.1), where eachinduced subspace atom U pi

ji, captures the signal informa-

tion concentrated in the band


Ikiji� ½�ðki þ 1Þp2�ji ;�kip2�ji � [ ½kip2�ji ; ðki þ 1Þp2�ji �:

ð8Þ

Then, B produces IB ¼ fIkiji

: i ¼ 1; . . . ;Mg a partition ofthe discrete time frequency range ½�p; p�.

Extending this analysis to WPs with an arbitrary conju-gate mirror filter pair ðh0ðnÞ; h1ðnÞÞ, their frequency selec-tivity property depends upon how H 0ðejwÞ is concentratedin ½�p=2; p=2�. Consequently, we only have an approxima-tion of the clean selectivity properties of the Shannon WPsin (8). For the applications on acoustic speech signals, thiswill be one of the critical aspects to evaluate. In the follow-ing, we concentrate on the family of Daubechies (DB) WPs(Mallat, 2009; Daubechies, 1992), exploring different filterorder solutions (associated with the number of zeros at pof H 0ðzÞ), which provide a tradeoff between the order ofthe TCF, and the concentration of H 0ðejwÞ in the range½�p=2; p=2�, or frequency selectivity (Chap. 8.1.2 Mallat,2009). We choose the family of compactly supportedDaubechies wavelets (Daubechies, 1992), because it offersa rich range of frequency selectivities. In fact, we can gofrom the Haar Wavelet (Vetterli and Kovacevic, 1995; Mal-lat, 2009), where H 0ðzÞ has one zero at p, with almost no-frequency selectivity but perfect time localization, to theShannon Wavelet that offers perfect frequency selectivity(in the limit where the number of zeros at p of H 0ðzÞ goesto infinity) (Mallat, 2009). On the theoretical side, this fam-ily offers the minimum order TCF solution ðh0ðnÞ; h1ðnÞÞfor a given number of vanishing moments or zeros at pof H 0ðzÞ. This last attribute is associated with the frequencyselectivity of the TCF (Mallat, 2009, Th. 7.9).

5. Wavelet Packet filter-bank selection

The last aspect in the implementation of the WP acous-tic features is to decide appropriate WP filter-bank struc-tures for the phone recognition task we have at hand. Wefollow the data-driven approach independently proposedby Etemad and Chellapa (1998) and Saito and Coifman(1994),3 and revisited by Silva and Narayanan (2009).The idea is to use supervised data to select a filter-bankstructure (or a frequency partition of ½�p; p�), that providesa nearly-optimal phonetic discrimination basis solution.More details of the formulation of this problem can befound in Silva and Narayanan (2009), Silva and Naraya-nan (2007) and Vasconcelos (2004).

To formulate the optimization problem, let us first intro-duce some notations. Following Silva and Narayanan(2009), we represent the process of producing a particularbasis in the WP family by a rooted binary tree (Scott,2005). For simplicity, let J > 0 be the maximum numberof iterations of the sub-band decomposition process. LetG ¼ ðV ;EÞ be a graph with

3 This work was inspired by the seminal work of Coifman andWickerhauser (Coifman et al., 1992) in the context of basis selection forsparse signal representation.

V ¼ ð0; 0Þ; ð1; 0Þ; ð1; 1Þ; . . . ; ðJ ; 0Þ; . . . ; ðJ ; 2J � 1Þ� �

; ð9Þ

and E the collection of arcs on V � V that characterizes afull-rooted binary tree with root vroot ¼ ð0; 0Þ as shown inFig. 2a. Instead of representing the tree as a collection ofarcs in G, we use the convention of Breiman et al. (1984),in which subgraphs are represented by a subset of nodesof the full graph. More formally, we define a rooted binary

tree T ¼ v0; v1; . . . ;f g � V as a collection of nodes withonly one of degree 2, the root node, and the remainingnodes with degree 3 (internal nodes) and leaf nodes (Cor-men et al., 1990). We define LðT Þ as the set of leaves ofT and IðT Þ as the set of internal nodes, consequently,LðT Þ [ IðT Þ ¼ T . We say that a rooted binary tree S is asubtree of T if S � T . In the previous definition, if theroots of S and T are the same, then S is a pruned subtreeof T , denoted by S T . In addition, if the root of S is aninternal node of T , then S is called a branch. In particular,we denote the largest branch of T rooted at v 2 T as T v.We define the size of the tree T as the number of leaves,i.e., the cardinality of LðT Þ denoted as j T j. Finally inour problem, T full ¼ V in (9) denotes the full binary tree,consequently, the collection of WP bases is indexed bythe admissible trees T � V : T T full

� �.

In this context, any pruned version of the full-rootedbinary tree represents a particular way of iterating theTCF ðh0ðnÞ; h1ðnÞÞn2Z of the WP. More precisely, if we letT ¼ ðji; piÞ : i 2 1; . . . ;Mf gf g be an admissible WP binarytree, then we denote its basis by

BT �[Mi¼1

Bpiji; ð10Þ

its sub-space decomposition by

UT � Upiji

: i ¼ 1; . . . ;Mn o

; ð11Þ

where X ¼aMi¼1

U piji

, and its ideal Shannon frequency par-tition by

IT � Ikiji

: i ¼ 1; . . . ;Mn o

; ð12Þ

with ki ¼ GðpiÞ from (C.1) and Ikiji

from (8).Finally, as we are interested in extending the filter-bank

Cepstral analysis view for acoustic FE, Section 2, then foreach T T full and for any point x 2 X, we define the fil-

ter-bank energy signature of x relative to T by

mT ðxÞ � Epj ðxÞ

� �ðj;pÞ2LðT Þ ð13Þ

where Epj ðxÞ denotes the energy of x in the subspace Up

j ,and by orthonormality jjxjj2 ¼

Pðj;pÞ2LðT ÞE

pj ðxÞ.

5.1. The tree-pruning problem

Here we revisit the approach in Silva and Narayanan(2009), where the selection of the WP basis was based onapproximating the minimum probability of error decision(Silva et al., 2012). This formulation is reduced to find an


optimal tradeoff between the estimation and approxima-tion errors and, consequently, addresses a complexity-reg-ularization problem. More precisely, we address thesolution of

T ðkÞ ¼ arg minT T full

�F ðmT ðX Þ; Y Þ þ kUðT Þ; ð14Þ

where X is the random object representing the raw acousticobservation in our signal space X, and Y is the class labelrandom variable with values in the finite alphabet space ofphonetic classes Y. The first term in (14) involves F ð�; �Þ,which is a measure designed to capture the discriminateinformation of mT ðX Þ relative to the class label Y (fidelitymeasure). The second term /ð�Þ is a non-decreasing realfunction (cost term) designed to incorporate estimation er-ror effects. The solution of (14), for all k > 0, resides in thesolution of the following cost-fidelity problem (Scott, 2005)(Silva and Narayanan, 2009, Sec. IV.D):

T k ¼ arg maxT T full: Tj j6kf g

F ðmT ðX Þ; Y Þ: ð15Þ

The problem in (15) is equivalent to finding the filter-bank oflength k that maximizes the fidelity gain F ðmT ðX Þ; Y Þ, for allk 2 2; 3; . . . ; jT fullj

� �. Interestingly, when the fidelity mea-

sure is additive,4 or alternatively affine,5 with respect thestructure of T , which will be the case for all measures exper-imentally evaluated in this work (see Section 5.2), the solu-tion of (15) admits an efficient implementation withcomplexity Oð T full

�� log T full

�� Þ (Silva and Narayanan,2009, Th. 2 and 3). Furthermore, (15) offers an embeddedsolution structure, i.e. T 2 T 3 � � � T ð T fullj j�1Þ T full (Silva and Narayanan, 2009, Th. 3). For completeness,the algorithm for solving (15) is presented in Section 5.3.

5.2. Fidelity measures

Let fðxi; yiÞgNi¼1 be independent and identically distrib-

uted (i.i.d.) realizations of the joint vector ðX ; Y Þ, whereevery pair ðxi; yiÞ corresponds to a speech segment and itsrespective phone label. As fidelity measures, we use theindicators proposed by Saito and Coifman (1994), Etemadand Chellapa (1998) and Silva and Narayanan (2009). Allof them can be written in the additive form:

F ðmT ðX Þ; Y Þ ¼X

ðj;pÞ2LðT ÞF ðEp

j ðX Þ; Y Þ: ð16Þ

5.2.1. KLD fidelity estimate

The first fidelity measure is the symmetric version of theKullback-Leibler divergence (KLD) (Kullback, 1958) pro-posed in Saito and Coifman (1994). Let us define the nor-

4 A tree functional qð�Þ is is additive if qðT Þ ¼Pðj;pÞ2LðT Þqðj; pÞ (Scott,

2005).5 A tree functional qð�Þ is affine if, for any T ; S rooted binary trees such

that S T , then qðT Þ ¼ qðSÞ þP

s2LðSÞqðT sÞ � qðfsgÞ, where fsg is thetrivial tree rooted at s, see (Scott, 2005).

malized energy of x 2 X by Epj ðxÞ �

Epj ðxÞkxk2 , and the number of

examples in class y 2 Y by Ny �PN

i¼1I yf gðyiÞ. Let theenergy map eðj; p; yÞ be given by

eðj; p; yÞ ¼ 1

Ny

XN

i¼1

I yf gðyiÞ � �Epj ðxiÞ; ð17Þ

for any pair ðj; pÞ 2 0; . . . ; Jf g � 0; . . . ; 2j � 1� �

andy 2 Y. For a binary tree T , its class conditional energy sig-nature is defined by

eT ðyÞ ¼ eðj; p; yÞð Þðj;pÞ2LðT Þ; ð18Þwhere from the Parseval’s relationship we have thatPðj;pÞ2LðT Þeðj; p; yÞ ¼ 1. Therefore, we can treat eT ðyÞ as a

probability mass function and define the KLD fidelity as(Saito and Coifman, 1994)

F ðmT ðX ; Y ÞÞ ¼Xy;z2YDðeT ðyÞkeT ðzÞÞ: ð19Þ

Here D is the discrete KLD (Gray, 1990; Cover and Tho-mas, 1991). To write the functional in its additive form,in (16), we consider the following equalities:

F ðmT ðX ; Y ÞÞ ¼Xy;z2YDðeT ðyÞkeT ðzÞÞ

¼Xy;z2Y

Xðj;pÞ2LðT Þ

eðj; p; yÞ logeðj; p; yÞeðj; p; zÞ

�

¼X

ðj;pÞ2LðT Þ

Xy;z2Y


� ¼

Xðj;pÞ2LðT Þ

F ðEpj ðX Þ; Y Þ:

where the leaf functional is

F ðEpj ðX Þ; Y Þ ¼

Xy;z2Y


� : ð20Þ

5.2.2. Parametric version of the mutual information: Fisher

fidelity estimate

The second indicator is the mutual information (MI)adopted in Silva and Narayanan (2009). Assuming theMarkov tree property presented in Prop. 3 (Silva andNarayanan, 2009) the functional is affine (Silva andNarayanan, 2009, Th. 3). To simplify the estimation, weassume that the class conditional distributions are Gauss-ian, where MI reduces to a version of the Fisher discrimina-

tive indicator (Silva and Narayanan, 2007; Padmanabhanet al., 2005), proposed by Etemad and Chellapa (1998).More precisely, let the energy vector of a signal xi in thetree T be given by mT ðxiÞ ¼ Ep

j ðxiÞ� �

ðj;pÞ2T , and bP ð yf gÞ ¼Ny

N denote the class probability mass 8y 2 Y. Assuming thatthe class conditional probability of object mT ðX Þ is a mul-tivariate Gaussian distribution, the maximum likelihoodestimator of its mean and covariance are

ly ¼1

Ny

XN

i¼1

I yf gðyiÞmT ðxiÞ ð21Þ


and

Ry ¼1

Ny

XN

i¼1

I yf gðyiÞðmT ðxiÞ � lyÞðmT ðxiÞ � lyÞy; ð22Þ

respectively. The unconditional mean estimator isl ¼ 1

N

PNi¼1mT ðxiÞ. Now we can define the within-class scat-

ter matrix Sw for the tree T by

SwðT Þ ¼Xy2Y

bP ð yf gÞ � Ry ; ð23Þ

and the between-class scatter matrix by

SbðT Þ ¼Xy2Y

bP ð yf gÞ � ðl� lyÞðl� lyÞy: ð24Þ

Finally for a rooted binary tree T , its leaf ðj; pÞ fidelityfunctional is consequently defined by

F ðEpj ðX Þ; Y Þ ¼ trðS�1

w ðtvÞSbðtvÞÞ� trðS�1

w ð ðj; pÞf gÞSbð ðj; pÞf gÞÞ: ð25Þ

In this context tv is the binary tree rooted at v ¼ ðj; pÞ withleaves ðjþ 1; 2pÞ and ðjþ 1; 2p þ 1Þ (see Fig. 2a), andfðj; pÞg is the one node tree.

5.2.3. Energy fidelity estimate

Finally, as a non-discriminative indicator, we considerthe average subspace energy proposed in Chang and Kuo(1993), i.e.,

F ðEpj ðX Þ; Y Þ ¼ 1

N

XN

i¼1

Epj ðxiÞ: ð26Þ

With the average energy fidelity measure in (26), the algo-rithm to solve (15), presented in Section 5.3, splits the leafof the tree T k with the highest average energy to findT kþ1, the solution of order k þ 1.

5.3. Minimum cost tree pruning algorithm

To conclude this section, a dynamic programing (DP)algorithm to solve (15) is presented. We refer the interestedreader to Scott (2005), Chou et al. (1989), Bohanec andBratko (1994), Breiman et al. (1984) and Silva and Naraya-nan (2009) for a systematic exposition on the computationalcomplexity, as well as theoretical results of this algorithm.

Phase 0:
(Choice of parameters) Choose a specific CMF pair h0; h1, a maximumlevel of decomposition J and a fidelityfunctional F.
Phase 1:
(Computation: Subband measurements andFidelity Gain) � � 8j 2 0; . . . ; J � 1f g; 8p 2 0; . . . ; 2j � 1compute: – Ep
j ðxiÞ : 8xi 2 X � �
8j 2 0; . . . ; J � 2f g; 8p 2 0; . . . ; 2j � 1compute:
– Fidelity gain Dðj; pÞ
if (F is KLD functional) Dðj; pÞ ¼ F ðE2p
jþ1ðX Þ; Y Þ þ F ðE2pþ1jþ1 ðX Þ; Y Þ

else
Dðj; pÞ ¼ F ðEp
j ðX Þ; Y Þ
end
Phase 2:
(Initialization) Initialize: T 2 ¼ ð0; 0Þ; ð1; 0Þ; ð1; 1Þf g, thenLðT 2Þ ¼ ð1; 0Þ; ð1; 1Þf g
Phase 3:
(Iteration) for k ¼ 2 to k ¼ 2J � 2
1. -compute:ðj; pÞ ¼ arg max

ðj;pÞ2LðT kÞ:j6J�1Dðj; pÞ

2. -save:

T ðkþ1Þ ¼ T k [ ðj þ 1; 2pÞ; ðj þ 1; 2p þ 1Þf gend

6. Analysis of filter bank solutions

The TIMIT corpus was adopted for all the experimentspresented in this work. TIMIT is one of the standard cor-pus used to evaluate new methods and techniques inASR, mainly because it is a phonetically balanced taskand has good coverage of speakers and dialects. All ofthese make TIMIT a sufficiently challenging corpus withwhich to evaluate new ASR methods, which justifies itswide adoption by the community. The TIMIT corpusconsists of 6300 utterances for the 8 major dialects ofthe United States. There are 630 different speakers, eachone speaking 10 sentences. TIMIT phonetic transcriptionscontain 64 phonetic classes, from which we have adoptedthe standard folding proposed in (Lee and Hon, 1989)that reduces the number of phonetic classes to 39 plusthe silence model. The training set, proposed in theTIMIT corpus, was used to extract supervised data forthe tree-pruning stage, in Section 5.1. More precisely,we used the phonetic segmentations and labels of theTIMIT database folded in 39 classes to select the super-vised training data. For each phone segmented signal,we took three 20ms segments, from the left, center, andright positions of the signal, and we considered those asrealizations of the phoneme. With this data, we computedthe fidelity measures presented in Section 5.2, i.e., theFisher, the symmetric KLD, and the Energy tree function-als, respectively. Finally, those measures were used to cre-ate the filter-bank solutions by solving the pruningproblem in (14) and (15).

In addition, we have adopted four different pairs of twochannel filters (TCFs), (see Section 3.3), associated with theDaubechies (DB) Wavelets (Daubechies, 1992; Mallat,2009; Vetterli and Kovacevic, 1995) of order 6, 12, 24and 44, respectively. With these we have good coverageof frequency selectivity properties to obtain a fairly repre-sentative family of WP filter-bank solutions. It is importantto point out that frequency selectivity was one of the keydimensions considered in this analysis.

Index of frequency−bands

Dep

th o

f the

WP

deco

mpo

sitio

n −

scal

e in

dex

5 10 15 20 25 30

1

2

3

4

5

6 −5

−4

−3

−2

−1

0

1

2

3

4

5


Dep

th o

f the

WP

deco

mpo

sitio

n −

scal

e in

dex

5 10 15 20 25 30

1

2

3

4

5

6 −5

−4

−3

−2

−1

0

1

2

3

4

5


Dep

th o

f the

WP

deco

mpo

sitio

n −

scal

e in

dex

5 10 15 20 25 30

1

2

3

4

5

6 −5

−4

−3

−2

−1

0

1

2

3

4

5

Fig. 7. Distribution of the KLD fidelity gains Dðj; kÞ indexed by the scale j (vertical axes) and frequency location k (horizontal axes) considering thefrequency-ordered WP sub-space decomposition structure. A whiter color indicates a higher fidelity gain.

6 A systematic exposition of this fact is presented in Shen and Strang(1996) and Shen and Strang (1998).


6.1. Analysis of fidelity gains across scale and frequency

location

In this section we report the sensitivity of the WP filter-bank selection algorithm to the frequency selectivity, pro-portional to the order of the Daubechies TCF (DB-TCF)(Mallat, 2009). For that purpose, we have analyzed thefidelity gains across scale and position, represented by ascale index j and a frequency localization index (position)k. We compared the fidelity gains of iterating the TCF,(see Section 3.1), of the three fidelity functionals (Fisher,KLD and Energy).

Fig. 7 shows the KLD-based gains of decomposing afrequency ordered node ðj; kÞ (associated with a WP sub-space) for the DB-TCF of orders 6, 12 and 24. As expected,higher discriminative gains are obtained in the low fre-quency domain. It is important to note in the figure thatthe KLD gain structure is not that sensitive to the orderof the TCF, and tends to stabilize as the order (frequencyselectivity) increases. This stability phenomenon was alsoobserved with the Fisher-based gains, as well as the Energygains. However each of them has a particular fidelity gainstructure as shown in Fig. 8. This shows that the frequencyselectivity does not imply a major change in the fidelitygains and consequently, in the filter-bank tree-structuresobtained from solving the minimum cost tree-pruningproblem in (15).

On the other hand, Fig. 8 illustrates the gains for thethree fidelity criteria with the DB-TCF of order 44 (thehighest selectivity). Interestingly, all the plots show thatthe salient information for discriminating phonemes, rela-tive to the fidelity measure adopted, is localized in thelow frequency domain. Consequently, the solutions of theoptimal tree-pruning problem offer structures that give pri-ority to iterating the TCF in this frequency range. In thisregard, the non-discriminative criterion in Fig. 8, withrespect to the discriminative criteria in Fig. 8c and b, hasminor differences. However these differences are sufficientto characterize a particular way of zooming on the lowerfrequency region of the acoustic space. These zooming pat-terns could potentially imply some marginal but important

differences in ASR recognition performances, as we shallsee in the following sections.

6.2. Analysis of the filter-bank frequency responses

In order to contrast the filter-bank solutions inducedfrom different frequency selectivity conditions, Fig. 9 showsthe equivalent filter-bank frequency response obtained forthe scenarios with DB-TCF of orders 6 and 44,respectively.

Verifying our previous analysis, the frequency selectivitydoes not significantly affect the structure of the filter-banksolutions, i.e, the way of iterating the TCF. This can beobserved in the main lobes of the solutions, which are cen-tered at the same frequencies, focusing on the solutionswith the same number of frequency bands illustrated inrows of Fig. 9. In fact, the solutions of size 6 (Fig. 9) andsize 14 (Fig. 9) have the same tree topology, however, theirfrequency supports are clearly different. Concerning thefrequency support, the trend is the following: The familyof DB Wavelets converges to the Shannon Wavelets, asthe order of the TCF increases,6 then the frequency sup-ports of the filter-banks converge to the Shannon WP par-titions in (8). Alternatively, for any order of the TCF, thefrequency support of a subspace with arbitrary large depth(scale) gets narrower following the Shannon WP frequencysupport, which in the limit converges to a fixed frequencypoint. Details of this result are presented in Section 3.2(Atto et al., 2007, Atto et al., 2010).

For our finite scale regime, the higher the order of theDB-TCF, the closer we are to the Shannon frequency par-tition in Section 4. Hence, by increasing the order of theTCF, the frequency bands are more clearly localized andthe overlap between adjacent bands, or what we calledbetween-band interference, is reduced.

Associated with each frequency-ordered leaf ðj; kÞ of agiven WP tree, we have its main lobe centered in the


Dep

th o

f the

WP

deco

mpo

sitio

n −

scal

e in

dex

5 10 15 20 25 30

1

2

3

4

5

6

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0


Dep

th o

f the

WP

deco

mpo

sitio

n −

scal

e in

dex

5 10 15 20 25 30

1

2

3

4

5

6 −5

−4

−3

−2

−1

0

1

2

3

4

5


Dep

th o

f the

WP

deco

mpo

sitio

n −

scal

e in

dex

5 10 15 20 25 30

1

2

3

4

5

6 −6

−5

−4

−3

−2

−1

0

Fig. 8. Fidelity gains Dðj; kÞ indexed by the scale j (vertical axes) and frequency location k (horizontal axes) considering the frequency-ordered WP sub-space decomposition structure. The Daubechies of order 44 is considered and the results are presented for the three methods. Whiter color indicates higherfidelity gain.


frequency range Ikj in (8). However, there are also second-

ary lobes with significant gains, which are not necessarilyadjacent to the target band Ik

j , in particular for the caseof small TCF order solutions. This phenomenon character-izes a very complex interference pattern as illustrated inFig. 9. Interpreting these results, the projection onto thesubspace associated with a given WP node ðj; kÞ containsinformation of: its target Shannon band Ik

j ; the neighbor-hood bands of Ik

j ; but not intuitively, information of unde-termined non-adjacent bands because of the gains of thesecondary lobes as illustrated in Fig. 9. The good news isthat those secondary-interference lobes vanish as the fre-quency selectivity increases. These asymptotic trends havea formal justification in the fact that the DB WPs convergeto the Shannon WPs as the TCF order tends to infinity(Shen and Strang, 1996, 1998).

Finally Fig. 10 shows the frequency response of theequivalent filter-banks obtained with a discriminative anda non-discriminative method. We use the DB-TCF of order44 to induce filter-banks with clearer structures andreduced side-lobe interference. As was illustrated in Figs.7 and 8, the pruned solutions offer higher resolution inthe low frequency region. In general the M-channel filter-bank solutions of the same size are similar (rows ofFig. 10), but as we increase the number of bands, someminor differences can be observed. In conclusion, for aclean acoustic speech process, the filter-banks obtainedare pretty much independent of the pruning method, andno major contrast is observed by the use of a discriminativeor a non-discriminative criterion. This verifies the prelimin-ary results obtained in Silva and Narayanan (2009), whereit was claimed that the acoustic speech process is an opti-mal design, in the sense that it allocates energy in the fre-quency bands that offer higher frequency discrimination.These results are based on short-time (frame by frame)information analysis of acoustic speech processes to dis-criminate phonemes, and do not consider, for instance, anoisy scenario, or higher level contextual information,where alternative trends could be observed.

7. Phone recognition experiments

The analysis made in this work considered a number ofdegrees of freedom for acoustic FE such as: the fidelitymeasure for the filter-bank selection problem presented inSection 5.1 (and, therefore, the set of embedded tree-structured WP filter-banks); the frequency selectivity ofthe TCF; the filter-bank size; and the feature space dimen-sion. As we presented in previous sections, we induce theWPCCs by: first, selecting a M-channel WP filter-bank;second, by deriving the frequency-ordered energy coeffi-cients; and finally, by applying DCT for de-correlation aswell as for dimensionality reduction (Quatieri, 2002) bychoosing the first m < M transformed DCT coefficients.The resulting WPCC features are the previously mentionedm Cepstral coefficients plus the log-energy of the frame.

The experiments are conducted in a sequence of incre-mental steps. First, we start the analysis in a simplifiedmono-phone recognition task that does not consider con-textual information appended to the WPCC feature vector,i.e., delta and acceleration coefficients. This initial phase isdesigned to explore the feature space dimension (number ofCepstral coefficients) and WP tree size (number of bands)to define an initial range of values to be explored in themore complex settings. This analysis is conducted underdifferent frequency selectivity for the TCF, and for all thefidelity measures. We then expand the analysis, enrichingthe feature vector with delta and acceleration coefficients,under the same mono-phone recognition task, to see ifwe observe similar trends. For that we re-run the phonerecognition experiments in the range of values obtainedin the previous phase. Finally, we run a state-of-the-artphone recognition experiment considering context depen-dent HMM-acoustic phone models (tri-phones) with a bi-gram language model. As a benchmark in all the phasesmentioned, we have chosen the standard MFCC featurescomputed with the 22 channel MEL-filters and adoptingthe first 12 Cepstral coefficients plus frame log-energy asthe feature vector.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5Am

plitu

de G

ain

Normalized Frequency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ampl

itude

Gai

n0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Normalized Frequency







0

0.5

1

1.5

2

2.5

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

Fig. 9. Frequency response of the Wavelet Packet filter-bank solutions. The solutions were obtained with Daubechies of order 6 (left column) and 44(right column), respectively. Plots are normalized over the interval [0, 8 kHz].


In general for each speech segment, we computed theMFCC and WPCC features using a hamming windowsof 32ms with a frame-rate of 10ms. The ASR system wasimplemented with the HTK toolbox (Young, 2009), wherefor each phone acoustic model we adopted the standard 5state hidden Markov model (HMM) (Rabiner, 1989) with 3

emitting states, the standard left-to-right topology, and the16 Gaussian mixture as the observation distribution(Rabiner, 1989). We used the steps proposed in the TIMITdocumentation to train all models in this work, and theCore-test of the TIMIT corpus was used for obtainingASR performances.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ampl

itude

Gai

n


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Ampl

itude

Gai

n0

0.5

1

1.5

2

2.5

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n

0

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n








Fig. 10. Frequency response of the Wavelet Packet filter-bank solutions. The figures show a comparison between non-discriminative and discriminativecriteria, Energy (left column) and KLD (right column), respectively. The solutions were obtained with Daubechies of order 44. Plots are normalized overthe interval [0, 8 kHz].


7.1. Context-independent phone recognition experiments

The pruning solutions of size 24 obtained from the threefidelity functionals (KLD, Fisher and Energy) are pre-sented here. The acoustic features are the WPCC pluslog-energy with a fixed number of bands, where we varied

the number of Cepstral coefficients from 6 to 24, to gaininsight into the most appropriate dimension for the featurespace. In this context Fig. 11a shows the performancetrends of the Fisher fidelity WPCC solutions across the fea-ture space dimension and for different frequency selectivitygiven by the order of the DB-TCF (db6, db12, db24 and


db44). In each of these performance curves, the curse ofdimensionality is observed as expected. There is an initialincreasing trend in performances that later saturates anddecreases, attributed to the well-understood estimationerror phenomenon presented in this learning-decisionproblem. The results show an optimal range for featurespace dimension starting approximately at dimension 11and ending approximately at dimension 19. This goodrange of feature dimension is practically invariant whenwe increase the number of bands and the frequency selec-tivity of the filter-bank solutions. This behavior is also con-sistent with the other two fidelity measures, KLD andEnergy, exemplified in Fig. 11b and c for the WPCC fil-ter-bank solutions of 24 bands in each case.

Considering the good range of feature dimensionobtained in the previous set of experiments, we fixed oneof them, dimension 13 (12 Cepstral coefficients plus log-

6 8 10 12 1438

39

40

41

42

43

44

45

Number of Cepstra

% A

ccur

racy

Cor

etes

t

FisheFisheFisheFisheMFCC

6 8 10 12 14 16 18 20 22 2439

40

41

42

43

44

45

46

Number of Cepstral Coefficients

% A

ccur

racy

Cor

etes

t

KLD db44KLD db24KLD db12KLD db6MFCCE

Fig. 11. Recognition accuracies in the Core-test set as a function of the numberand static features. Effect of frequency selectivity for the Fisher functional filteEnergy functional filter-banks of size 24 (11c).

energy), to show the performance trend with respect tothe number of bands of the WP filter-bank solutions (WPtree size). The experiments again consider all fidelity mea-sures and TCF orders (db6,db12,db24 and db44). Fig. 12shows these trends. Again we observed a performancetrend that increases, then saturates, and finally decreasesas we explore WP filter-bank solutions with an increasingnumber of bands. Since in this case the feature dimensionis fixed, this trend cannot be attributed to the curse ofdimensionality and so, consequently, has to do with theacoustic discrimination power of the filter-bank solutions.From these results we conclude that a good range of explo-ration in the number of bands is from 18 to 26.

Before we change the focus to the next set of experi-ments, a couple of remarks should be made. It is very inter-esting to observe the trend with respect to the frequencyselectivity in the obtained results, Figs. 11 and 12. In

16 18 20 22 24l Coefficients

r db44r db24r db12r db6E

6 8 10 12 14 16 18 20 22 2439

40

41

42

43

44

45


% A

ccur

racy

Cor

etes

t

EN db44EN db24EN db12EN db6MFCCE

of Cepstral coefficients for a fixed size of WP filter-bank (number of bands)r-banks of size 24 (11a), KLD functional filter-banks of size 24 (11b), and

14 16 18 20 22 24 26 28 3041

41.5

42

42.5

43

43.5

44

44.5

45

45.5

Number of Bands

% A

ccur

racy

Cor

etes

t

Fisher db44Fisher db24Fisher db12Fisher db6MFCCE

14 16 18 20 22 24 26 28 3041

41.5

42

42.5

43

43.5

44

44.5

45

45.5

Number of Bands

% A

ccur

racy

Cor

etes

t

KLD db44KLD db24KLD db12KLD db6MFCCE

14 16 18 20 22 24 26 28 3041.5

42

42.5

43

43.5

44

44.5

45

Number of Bands

% A

ccur

racy

Cor

etes

t

EN db44EN db24EN db12EN db6MFCCE

Fig. 12. Recognition accuracies in the Core-test set as a function of the WP filter-bank size (number of bands), for fixed 12 Cepstral coefficients and staticfeatures. Effect of frequency selectivity for the Fisher functional filter-banks (a), KLD (b) and Energy (c).


almost all cases, increasing the frequency selectivity pro-vides better performances for any given dimension, filter-bank size, and fidelity measure adopted. This ratifies ourconjecture that inter-band interference is something to beavoided for acoustic discrimination, and consequently, bet-ter performances can be achieved by increasing the order ofthe DB-TCF in our context. This is congruent with some ofthe results presented in Choueiter and Glass (2007) for thecase of a simplified phone-segmented classification task.Also it is important to note that we have already obtainedconcrete settings for our WPCCs that outperform the stan-dard MFCC features, under the same scenario that doesnot consider contextual information in the acoustic fea-tures. In this mono-phone recognition task, this bench-mark has 44,87% recognition accuracy.

Finally, we add delta and acceleration coefficients to theanalysis. It is well understood that dynamic features

improve recognition rates, but it is interesting to observetheir particular effects on our WP filter-bank features. Weconsider a similar set of scenarios (number of bands, num-ber of Cepstral coefficients) to explore the effect on fre-quency selectivity and the fidelity criterion. Fig. 13 showsrecognition accuracies as a function of the number ofbands for a given fixed Cepstral feature dimension in theset f11; 12; 13; 14g, which maps to a feature vector ofdimensions f36; 39; 42; 45g, respectively, and with the max-imum order (frequency selectivity) in the TCF. In general,the best set of results is obtained in the range of 20–26bands, illustrated in Fig. 13. In addition, out of this range,the energy fidelity criterion systematically shows the bestperformance curves and, consequently, the most competi-tive results with respect to the standard MFCCs (39 featurevector) with a baseline of 55.3% in accuracy. In spite ofthat, the best result is obtained with the KLD fidelity

14 16 18 20 22 24 2653.4

53.6

53.8

54

54.2

54.4

54.6

54.8

55

55.2

55.4

Number of Bands

% A

ccur

racy

Cor

etes

t

KLD db44Energy db44Fisher db44MFCC

EDA

14 16 18 20 22 24 2653.4

53.6

53.8

54

54.2

54.4

54.6

54.8

55

55.2

55.4

Number of Bands

% A

ccur

racy

Cor

etes

t


EDA

14 16 18 20 22 24 2653.6

53.8

54

54.2

54.4

54.6

54.8

55

55.2

55.4

Number of Bands

% A

ccur

racy

Cor

etes

t


EDA

14 16 18 20 22 24 2652.5

53

53.5

54

54.5

55

55.5

Number of Bands

% A

ccur

racy

Cor

etes

t


EDA

Fig. 13. Recognition accuracies in the Core-test set as a function of the WP filter-bank size (number of bands), for fixed numbers Cepstral coefficientsadding delta and acceleration features. Comparison of solutions obtained for all pruning methods and the higher frequency selectivity considered (DB 44).


measure, solution of 22 bands and 12 Cepstral coefficients(a 39 feature vector) shown in Fig. 13b, with recognitionaccuracy of 55.36%.

Fig. 14, on the other hand, revisits the effect of the fre-quency selectivity on the recognition accuracy for theKLD and Energy based solutions with 12 Cepstral coeffi-cients. This verifies that higher order DB-TCF achievesthe best performance. Finally, Table 1 presents the gainof adding delta and acceleration coefficients to the featurevector. This gap increases by increasing the frequency res-olutions of the WP filters, reaffirming the advantage ofadopting higher order TCFs for this task.

7.2. Context-dependent phone recognition experiments

Finally we evaluate performance in the standard phonerecognition task that considers context-dependent HMMs,Cepstral acoustic features plus delta and acceleration, and

a bi-gram language model. For this, we focus the analysison the range of 20–26 bands, and the Cepstral featuredimension in the neighborhood of 13 coefficients. This isthe range of values with good performances observed inthe previous set of experiments. Fig. 15 shows recognitionaccuracies as a function of the number of Cepstral coeffi-cients. Here we report the best trends, observed for the caseof 24 and 26 filter-bank bands with the DB44 TCF. Thesetrends were obtained with 9 to 15 Cepstral coefficients. i.e.,feature space dimensions from 30 to 48. The estimation-approximation error trade-off can be observed as expected,however, these trends are different from those in the con-text independent case, shown in Fig. 11. The reason is that,in this context, the number of models is larger as are themodel parameters to be estimated, but the training dataremains the same. This causes the estimation error to dom-inate the approximation error earlier, in lower dimensionalfeature spaces, with respect to the results shown in Fig. 11.

14 16 18 20 22 24 26 28 3050.5

51

51.5

52

52.5

53

53.5

54

54.5

55

55.5

Number of Bands

% A

ccur

racy

Cor

etes

t

Energy db44Energy db24Energy db12Energy db6MFCCEDA

14 16 18 20 22 24 26 28 3050

51

52

53

54

55

56

Number of Bands

% A

ccur

racy

Cor

etes

t

KLD db44KLD db24KLD db12KLD db6MFCCEDA

Fig. 14. Recognition accuracies in the Core-test set as a function of the WP filter-bank size (number of bands), for fixed 12 Cepstral coefficients addingdelta and acceleration features. Comparison of solutions at different frequency selectivity for energy (a) and KLD (b).

Table 1Average gains in recognition accuracy when passing from WPCCE toWPCCEDA acoustic features. Accuracies obtained in a scenario with 12Cepstral coefficients plus log-energy and number of bands from 14 to 30.The first row shows the average recognition accuracy of static features inthe four Daubechies Wavelet scenarios for the KLD, Fisher and Energysolutions. The second row shows the accuracy obtained when running thesame task using delta and acceleration features, and the third row showsthe accuracy gain.

DB6 (%) DB12 (%) DB24 (%) DB44 (%)

WPCCE 42.09 43.29 43.97 44.26WPCCEDA 51.19 53.13 54.2 54.5Gain 9.1 9.84 10.22 10.24


We observed again that Energy and the KLD methodsoffer the best performance trends, which is consistent with

9 10 11 12 13 14 1565

65.5

66

66.5

67

67.5

68

68.5


% A

ccur

racy

Cor

etes

t


EDA

Fig. 15. Phone recognition accuracies with context-dependent phone models aacceleration features.

previous context-independent phone recognition results,where the best two performances are achieved with theEnergy functional, in the scenario with 26 bands and 11Cepstral coefficients (68.04%) and with 24 bands and 11Cepstral coefficients (68.09%), Fig. 15b, respectively. Thoseresults are very competitive with the state-of-the-artMFCC feature, baseline of 67.28%, were in fact, they offera relative improvement of 1.2% in the best case tested.

To conclude this analysis, the equivalent filter-banks ofthe Energy solutions with 24 and 26 bands are presented inFig. 16a and b, respectively. The Mel-scale has a linear-uni-form frequency partitioning in the lower frequency rangeand moves to a uniform logarithmic partitioning in the rest(Quatieri, 2002). Following this trend, our best two solu-tions, shown in Fig. 16a and b and in Table 2, offer an

9 10 11 12 13 14 1564

64.5

65

65.5

66

66.5

67

67.5

68

68.5


% A

ccur

racy

Cor

etes

t


EDA

s a function of the number of Cepstral coefficients, considering delta and


approximately uniform partition (with the same band-width) in the interval [0, 1 kHz] and then an increasingbandwidth from 2 Hz to 8 kHz, as depicted in Table 2.Hence, as expected, our data-driven WP filter-bank solu-tions offer, in general, the Mel frequency partition type ofstructure.

7.3. Final analysis

Finally our solutions are compared with two state-of-the-art dyadic WP based features for ASR. In particular,we implemented the 24 and 26 band WP energy-signaturesconsidered by Farooq and Datta (2001) and Choueiter andGlass (2007), respectively. The ideal frequency partitions ofthose WP solutions are shown in Tables 3 and 4, respec-tively. In (Farooq and Datta, 2001), the FE is implementedwith the Daubechies TCF of order 6 (DB6) considering avector of 13 Cepstral coefficients. On the other hand, theacoustic features proposed in Choueiter and Glass (2007)(for the case of dyadic WP) were obtained from the concat-enation of 26 log energy vectors plus dynamic featuresobtained at the phone segmental level, where, at the endof this process, principal component analysis (PCA) wasused to reduce the dimensionality of the resulting vector,targeting a phone segmented classification task. Their dya-dic WPs were implemented using Daubechies (DB) TCF oforders 4, 6, 10 and 12, respectively.

To contextualize these solutions in our time-series phonerecognition scenario and to make them comparable withour solutions, we only consider their WP filter bank struc-ture. More precisely, we consider the binary-tree topologiesof the WP bases with their respective dyadic partition ofthe frequency space and their induced WPCCs plusdynamic features (delta and acceleration) based on the gen-eral scheme presented in Section 2. The accuracies obtainedfor the 24 and 26 band WP solutions with DB6 were63.37% and 61.19%, respectively. For the 26 band WP,increasing the order of the TCF to DB12 improves the per-formance to 64.59%, which is consistent with our previousanalysis on frequency selectivity. Because of this trend, wealso tried the unexplored DB44 for the 24 and 26 bandsolutions obtaining improvements of 66.45% and 66.33%,

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n


Fig. 16. Frequency response of the filter-banks with the two highest performan

respectively. In general, all these results are below theMFCC baseline of 67.28% for this phone-recognition taskand, in consequence, below the best performance of 68.09%reported for the WPCCs, even in the scenario in which wematch the filter orders adopting DB44.

In terms of the filter-bank structure, our best data-dri-ven solution with 24 bands presented in Table 2 offers fre-quency bands similar to those adopted in Farooq andDatta (2001) and Choueiter and Glass (2007), presentedin Tables 3 and 4, respectively. The reason again is thatour solution follows the general structure of the MEL-scale. However, it is important to emphasize the minorstructural mismatches to justify the performance differ-ences among the WP solutions. On this, the entries in boldin Tables 3 and 4 indicate the bands that have differences,in terms of bandwidth or frequency support, with respectto our best solution shown in Table 2. In particular, the24 band solution in Table 3 has different frequency parti-tions in the intervals [0, 250 Hz], [1000 Hz, 1500 Hz] and[3000 Hz, 5000 Hz]. The same comparison can be madefor the 26 band WP of Table 4, where the differences areconcentrated in the [0, 250 Hz] and [5000 Hz, 6000 Hz]regions. It is worth mentioning that the 26 band WP canbe generated from our 24 band solution, by splitting the(6,0) and (3,5) leaves, therefore, the structural differencesare minor, but important to induce particular feature attri-butes for the task.

8. Summary, discussion and final remarks

This work proposes the Wavelet-Packet Cepstral coeffi-cient (WPCC) as a dynamic filter-bank structure to per-form short-time (frame-by-frame) acoustic analysis forASR. A collection of log-energy based acoustic signatureswith different time-frequency resolutions was derived,extending the conventional MFCC scheme. In the process,the filter-bank properties and basis structure of Wavelet-Packets (WPs) were fully considered, where the interpreta-tion of WP as a filter-bank analysis scheme was put into theframe-by-frame acoustic analysis context. In particular, theequivalent filter-bank frequency response of a WP basiswas defined, where the Gray code and the concept of

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

Ampl

itude

Gai

n


ces tested. The frequency range is normalized over the interval [0, 8 kHz].

Table 2Shannon WP frequency partition of the interval [0, 8 kHz] for the filt -bank solution of Fig. 16a. It contains the frequency ordered leaves of the WP tree, i.e., ðj; k ¼ gðpÞÞ : ðj; pÞ 2 LðT Þf g, and theirrespective frequency supports (Ik

j ) and bandwidths in Hz.

Leaf ðj; kÞ ð5; 0Þ ð6; 2Þ ð6; 3Þ ð6; 4Þ ð6; 5Þ ð6; 6Þ ð6; 7Þ ð5; 4Þ ð5; 5Þ ð5; 6Þ ð5; 7Þ ð5; 8ÞBand Ik

j (Hz) 0; 250½ � 250; 375½ � 375; 500½ � 500; 6 5½ � 625; 750½ � 750; 875½ � 875; 1000½ � 1000; 1250½ � 1250; 1500½ � 1500; 1750½ � 1750; 2000½ � 2000; 2250½ �Bandwidth

(Hz)250 125 125 125 125 125 125 250 250 250 250 250


j (Hz) 2250; 2500½ � 2500; 2750½ � 2750; 3000½ � 3000; 250½ � 3250; 3500½ � 3500; 3750½ � 3750; 4000½ � 4000; 4500½ � 4500; 5000½ � 5000; 6000½ � 6000; 7000½ � 7000; 8000½ �Bandwidth

(Hz)250 250 250 250 250 250 250 500 500 1000 1000 1000

Table 3Shannon WP frequency partition of the interval [0, 8 kHz] for a Mel-l e filter bank with 24 bands considered by Farooq and Datta (2001). It contains the frequency ordered leaves, frequency supportsand bandwidths as in Table 2.


j (Hz) 0; 125½ � 125; 250½ � 250; 375½ � 375; 5 0½ � 500; 625½ � 625; 750½ � 750; 875½ � 875; 1000½ � 1000; 1125½ � 1125; 1250½ � 1250; 1375½ � 1375; 1500½ �Bandwidth

(Hz)125 125 125 125 125 125 125 125 125 125 125 125


j (Hz) 1500; 1750½ � 1750; 2000½ � 2000; 2250½ � 2250; 500½ � 2500; 2750½ � 2750; 3000½ � 3000; 3500½ � 3500; 4000½ � 4000; 5000½ � 5000; 6000½ � 6000; 7000½ � 7000; 8000½ �Bandwidth

(Hz)250 250 250 250 250 250 500 500 1000 1000 1000 1000

832E

.P

avez,

J.F

.S

ilva/S

peech

Co

mm

un

icatio

n5

4(

20

12

)8

14

–8

35

er

2

3

ik

0

2

Tab

le4

Sh

ann

on

WP

freq

uen

cyp

arti

tio

no

fth

ein

terv

al[0

,8

kH

z]fo

ra

Mel

-lik

efi

lter

ban

kw

ith

26b

and

sco

nsi

der

edb

yC

ho

uei

ter

and

Gla

ss(2

007)

.It

con

tain

sth

efr

equ

ency

ord

ered

leav

es,

freq

uen

cysu

pp

ort

san

db

and

wid

ths

asin

Tab

le2.

Lea

fðj;kÞ

ð6;0Þ

ð6;1Þ

ð6;2Þ

ð6;3Þ

ð6;4Þ

ð6;5Þ

ð6;6Þ

ð6;7Þ

ð5;4Þ

ð5;5Þ

ð5;6Þ

ð5;7Þ

ð5;8Þ

Ban

dIk j

(Hz)

0;12

5½

�12

5;25

0½

�25

0;37

5½

�37

5;50

0½

�50

0;62

5½

�62

5;75

0½

�75

0;87

5½

�87

5;10

00½

�10

00;1

250

½�

1250;1

500

½�

1500;1

750

½�

1750;2

000

½�

2000;2

250

½�

Ban

dw

idth

(Hz)

125

125

125

125

125

125

125

125

250

250

250

250

250

Lea

fðj;kÞ

ð5;9Þ

ð5;1

0Þð5;1

1Þð5;1

2Þð5;1

3Þð5;1

4Þð5;1

5Þð4;8Þ

ð4;9Þ

ð4;1

0Þð4;1

1Þð3;6Þ

ð3;7Þ

Ban

dIk j

(Hz)

2250;2

500

½�

2500;2

750

½�

2750;3

000

½�

3000;3

250

½�

3250;3

500

½�

3500;3

750

½�

3750;4

000

½�

4000;4

500

½�

4500;5

000

½�

5000;5

500

½�

5500;6

000

½�

6000;7

000

½�

7000;8

000

½�

Ban

dw

idth

(Hz)

250

250

250

250

250

250

250

500

500

500

500

1000

1000


filter-bank frequency ordering was revisited. This last pointis an important concept that, to the best of our knowledge,has not been treated in previous work on the topic (Farooqand Datta, 2001; Choueiter and Glass, 2007; Kim et al.,2000; Tan et al., 1996).

The main contribution of this work is systematicallyexploring the problem of WP filter-bank selection to obtainadaptive and nearly optimal energy-based filter-bank sig-natures for an ASR task. This important dimension ofanalysis has not been considered in previous studies onthe topic of Wavelet and WP for ASR (Farooq and Datta,2001; Choueiter and Glass, 2007; Kim et al., 2000; Tanet al., 1996). In this regard, Farooq and Datta (2001) con-sidered a fixed tree-topology (frequency partition pattern)based on the MEL scale, while in the work of Choueiterand Glass (2007) the objective was on obtaining a specificcritical-band frequency partition by means of adoptingtwo previously unexplored filter-bank design methods, aswell as rational and dyadic WP filter-banks. In this work,the filter-bank selection problem was addressed by a com-plexity regularized criterion, with the objective of modelingthe well-understood trade-off between feature discrimina-tion and feature complexity. Three methods were exploredto provide a wide range of data-driven filter-bank solutionsinduced from the proposed WPCC analysis scheme, andthe performances of those solutions were evaluated andcontrasted. It is worth noting that all the proposed filter-bank selection methods reduce to an equivalent tree-prun-ing problem with additive or affine functionals, that admit,consequently, computationally efficient implementations,i.e., a complexity that grows polynomial on the side ofthe problem (Silva and Narayanan, 2009).

Moving on with the experimental findings, as reportedin Section 6, there are only marginal differences in the fidel-ity gains observed when considering discriminative andnon-discriminative fidelity indicators. This implies thatthe filter-bank solutions obtained show similar structures,where in general they provide increased frequency resolu-tion in the low-frequency range. This verifies the well-known fact that the discriminative information of thespeech acoustic process is embedded in lower frequencybands, and that the speech production-perception processcan be considered an optimal communication design, inthe sense that there is more signal energy in the frequencyregion where more perception (frequency discrimination) isavailable. On the experimental side, this is demonstratedunder concrete experimental conditions and with the stan-dard HMM-based phone recognition task. The energy-fidelity-based WPCC solutions offer the best performanceresults compared with two discriminative fidelity indicators(Fisher-scatter based, and Kullback-Leibler divergence-based) and, furthermore, they show a number of construc-tions that outperform the state-of-the-art MFCCs. Inter-estingly, under clean acoustic conditions, our data-drivenfrequency selectivity methods offer filter-bank solutionsthat follow, in general, the structure of the MEL scale,although our approach offers performance improvements

Fig. 17. Commuting relationship between the down-sampler and filtering.


with respect to the state-of-the-art MEL-based energy sig-natures (MFFCs).

In addition, we show that frequency selectivity in thedesign of the Wavelet Packet filter-banks is a criticaldimension of analysis for obtaining good performances.More precisely, the better the selectivity of the two-channelfilter (the basic block that constructs the WP basis family),the better the phone recognition performances obtainedfrom their filter-bank solutions, which agrees with someof the findings presented in Choueiter and Glass (2007).

Although the reported ASR performance improvementscan be considered to be marginal, the generality of ourWPCC construction is worth emphasizing. WPCC acousticfeatures offer a natural way of extending the MFCC filter-bank analysis paradigm by considering a much more gen-eral way of characterizing the filter-bank analysis part. Infact, we provide a way of creating not only a fixed solution,but also a family of embedded filter-bank solutions (andtheir respective Cepstral energy-based features) withincreased frequency discrimination. As we have shown inour experiments, these solutions are adapted to the task,i.e., they offer the optimal estimation-approximation errortradeoff, which depends on a number of dynamic factors,which are strongly task dependent. Just to mention a fewof them: the intrinsic acoustic-discrimination complexityof the task (approximation part); the modeling assump-tions; the number of model parameters; the amount of data(the estimation error part); and the presence of distortionor noise in the training data.

9. Future work

For some applications, it would be beneficial to workwith non-optimal parsimonious representations, to savealgorithmic complexity at the expense of sacrificingsome accuracy. An example of this would be scenarioswith communication constraints, or scenarios wherethe task is of a smaller vocabulary, in which the algo-rithmic complexity associated with the Viterbi-decodingis a critical issue in the design of an ASR solution. Inthis context, the proposed embedded filter-bank solu-tions have the flexibility to address the trade-off betweenperformance and algorithmic complexity. In this regard,we believe that there is a number of directions to beexplored with respect to the operational flexibility thatthe proposed WPCCs offer for ASR applications.Another important future work direction is to evaluatethe WPCCs in the problem of robust ASR under differ-ent noisy conditions, source coding distortions andchannel degradations.

Acknowledgment

The work was supported by funding from FONDECYTGrant 1110145, CONICYT-Chile. We are grateful to theanonymous reviewers for their suggestions and commentsthat contribute to improve the quality and organizationof the work. We thank S. Beckman for proofreading thismaterial.

Appendix A. Wavelet Packets: an alternative view of its sub-

space frequency content

The conjugate mirror filter pair ðhðnÞ; gðnÞÞ maps thecanonical basis BL of X to an alternative orthogonal basisB0

Lþ1 [ B1Lþ1. Importantly, we can associate the sub-spaces

U 0Lþ1 ¼ span /0

Lþ1ðt � 2Lþ1nÞ : n 2 Z� �

and U 1Lþ1 ¼ span

/1Lþ1ðt � 2Lþ1nÞ : n 2 Z

� �with a frequency content of X

by the following relationship:

/0Lþ1ðwÞ ¼ hð2LwÞ � /LðwÞ; /1

Lþ1ðwÞ ¼ gð2LwÞ � /LðwÞ; ðA:1Þ

where /0Lþ1ðwÞ and hð2LwÞ denote the Fourier transform

(FT) and the Discrete-Time Fourier transform (DTFT) of/0

Lþ1ðtÞ and hðnÞ (alternatively, /1Lþ1ðtÞ and gðnÞ), respec-

tively. Iterating the application of ðhðnÞ; gðnÞÞ, we induce/p

LþjðtÞ for all j P 1 and for any p 2 0; . . . ; 2j � 1� �

, wherethe frequency content of any arbitrary sub-space in thechain, for instance Up

Lþj ¼ span /pLþjðt � 2LþjnÞ : n 2 Z

n o,

is inherited from (A.1) by:

/2pLþjþ1ðwÞ ¼ hð2LwÞ � /p

LþjðwÞ; /2pþ1Lþjþ1ðwÞ ¼ gð2LwÞ � /p

LþjðwÞ: ðA:2Þ

Figs. 4 and 5 illustrates those frequency maps for theideal Shannon pair of filters that provides a perfect partitionof the frequency content of X.

Appendix B. Multi-rate filter-bank property

Proposition 2 Vetterli and Kovacevic (1995, Chap. 2, pp.72–73). Let hðnÞ be the impulse response of a LTI system

with transfer function HðzÞ. Then for any ðxðnÞÞ 2 RZ, it is

equivalent to pass xðnÞ through a down-sampler by N andthen by the LTI system with transfer function HðzÞ; to pass

xðnÞ through HðzN Þ and then by the down-sampler by N-

factor. Fig. 17 illustrates the relationship.

Appendix C. The Gray code

Proposition 3 Mallat (2009, Chap. 8.1.2). Let ðj; pÞ be an

admissible node of the Shannon WP decomposition withbinary path Hðj:pÞ ¼ ðh1; . . . ; hj�LÞ 2 f0; 1gj�L: then its

equivalent frequency-ordered label ðj; kÞ is constructed by

the following rule

k ¼ GðpÞ �Xj�L

i¼1

�hi � 2i 2 0; . . . ; 2j�L � 1� �

; ðC:1Þ

where �hi �Pj�L

l¼i hl

� �mod 2 2 0; 1f g; 8i 2 1; . . . ; j� Lf g.


References

Atto, A.M., Pastor, D., Isar, A., 2007. On the statistical decorrelation ofthe wavelet packet coefficients of a band-limited wide-sense stationaryrandom process. Signal Processing 87 (10), 2320–2335.

Atto, A.M., Pastor, D., Mercier, G., 2010. Wavelet packets of fractionalbrownian motion: Asymptotic analysis and spectrum estimation. IEEETransactions on Information Theory 56 (9), 429–441.

Bohanec, M., Bratko, I., 1994. Trading accuracy for simplicity in decisiontrees. Machine Learning 15, 223–250.

Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification andRegression Trees. Wadsworth, Belmont, CA.

Chang, T., Kuo, C.J., 1993. Texture analysis and classification with tree-structured wavelet transform. IEEE Transactions on Image Processing2 (4), 429–441.

Chou, P., Lookabaugh, T., Gray, R., 1989. Optimal pruning withapplications to tree-structure source coding and modeling. IEEETransactions on Information Theory 35 (2), 299–315.

Choueiter, G., Glass, J., 2007. An implementation of rational wavelets andfilter design for phonetic classification. IEEE Transactions on Audio,Speech, and Language Processing 15 (3), 939–948.

Coifman, R., Meyer, Y., Quake, S., Wickerhauser, V., 1990. Signalprocessing and compression with wavelet packets. Tech. rep., Numer-ical Algorithms Research Group, New Haven, CT, Yale University.

Coifman, R.R., Meyer, Y., Wickerhauser, M.V., 1992. Wavelet analysisand signal processing. In B. Ruskai (Ed.), Wavelets and theirApplications. Jones and Barlettt, pp. 153–178.

Coifman, R.R., Wickerhauser, M.V., 1992. Entropy-based algorithm forbest basis selection. IEEE Transactions on Information Theory 38 (2),713–718, March.

Cormen, T., Leiserson, C., Rivest, R.L., 1990. Introduction to Algo-rithms. The MIT Press, Cambridge, Massachusetts.

Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. WileyInterscience, New York.

Crouse, M.S., Nowak, R.D., Baraniuk, R.G., April 1998. Wavelet-basedstatistical signal processing using hidden Markov models. IEEETransactions on Signal Processing 46 (46), 886–902.

Daubechies, I., 1992. Ten Lectures on Wavelets. SIAM, Philadelphia.Davis, S.B., Mermelstein, P., 1980. Comparison of parametric represen-

tations for monosyllabic word recognition in continuously spokensentences. IEEE Transactions on Acoustics, Speech, and SignalProcessing 28 (4), 357–366.

Duda, R.O., Hart, P.E., 1983. Pattern Classification and Scene Analysis.Wiley, New York.

Etemad, K., Chellapa, R., 1998. Separability-based multiscale basisselection and feature extraction for signal and image classification.IEEE Transactions on Image Processing 7 (10), 1453–1465, October.

Farooq, O., Datta, S., 2001. Mel filter-like admissible wavelet packetstructure for speech recognition. IEEE Signal Processing Letters 8 (7),196–198.

Gray, R.M., 1990. Entropy and Information Theory. Springer-Verlag,New York.

Kim, K., Youn, D., Lee, C., 2000. Evaluation of wavelet filters for speechrecognition. In: IEEE Int. Conf. Syst. Man. Cybern. pp. 2891–2894.

Kullback, S., 1958. Information theory and Statistics. Wiley, New York.Learned, R.E., Karl, W.C., Willsky, A.S., 1992. Wavelet packet based

transient signal classification., 109–112.

Lee, K.-F., Hon, H.-W., 1989. Speaker-independent phone recognitionusing hidden markov models. IEEE Transactions on Acustics, Speechand Signal Processing 37 (11), 1641–1648.

Mallat, S., 1989. A theory for multiresolution signal decomposition: thewavelet representation. IEEE Transactions on Pattern Analysis andMachine Intelligence 11, 674–693, July.

Mallat, S., 2009. A Wavelet Tour of Signal Processing. 3rd ed. AcademicPress.

Padmanabhan, M., Dharanipragada, S., 2005. Maximizing informationcontent in feature extraction. IEEE Transactions on Speech and AudioProcessing 13 (4), 512–519, July.

Quatieri, T.F., 2002. Discrete-time Speech Signal Processing principlesand practice. Prentice Hall.

Rabiner, L.R., 1989. A tutorial on hidden Markov models and selectedapplications in speech recognition. Proceedings of the IEEE 77 (2),257–286, February.

Ramchandran, K., Vetterli, M., Herley, C., 1996. Wavelet, subbandcoding, and best bases. Proceedings of the IEEE 84 (4), 541–560. April.

Saito, N., Coifman, R.R., 1994. Local discriminant basis. in: Proc. SPIE2303, Mathematical Imaging: Wavelet Applications in Signal andImage Processing 2–14.

Scott, C., 2005. Tree pruning with subadditive penalties. IEEE Transac-tions on Signal Processing 53 (12), 4518–4525.

Scott, C., Nowak, R.D., 2004. Templar: A wavelet-based framework forpattern learning and analysis. IEEE Transactions on Signal Processing52 (8), 2264–2274. August.

Shen, J., Strang, G., 1996. Asymptotic analysis of daubechies polynomials.Proceedings of the American Mathematical Society 124 (12), 3819–3833.

Shen, J., Strang, G., 1998. Asymptotics of daubechies filters, scalingfunctions, and wavelets. Applied and Computational HarmonicAnalysis 5, 312–331.

Silva, J., Narayanan, S., August 2007. Minimum probability of errorsignal representation. In: IEEE Workshop Machine Learning forSignal Processing.

Silva, J., Narayanan, S., 2009. Discriminative wavelet packet filter bankselection for pattern recognition. IEEE Transactions on SignalProcessing 57 (5), 1796–1810.

Silva, J.F., Narayanan, S.S., 2012. On signal representations within thebayes decision framework. Pattern Recognition 45 (5), 1853–1865,May.

Tan, B., Minyue, F., Spray, A., Dermody, P., 1996. The use of wavelettransform in phoneme recognition. In: Int. Conf. Spoken Lang.Process. pp. 2431–2434.

Vaidyanathan, P.P., 1993. Multirate Systems and Filter Banks. NYPrentice-Hall, Englewood Cliffs.

Vasconcelos, N., 2004. Minimum probability of error image retrieval.IEEE Transactions on Signal Processing 52 (8), 2322–2336.

Vetterli, M., Kovacevic, J., 1995. Wavelet and Subband Coding. Prentice-Hall, Englewood Cliffs, NY.

Walter, G.G., 1992. A sampling theorem for wavelet subspaces. IEEETransactions on Information Theory 38 (2), 881–884.

Willsky, A.S., 2002. Multiresolution Markov models for signal and imageprocessing. Proceedings of the IEEE 90 (8), 1396–1458. August.

Young, S., 2009. The HTK Book (for HTK Version 3.4).Zhou, X., Sun, W., 1999. On the sampling theorem for wavelet subspaces.

The Journal of Fourier Analysis and Applications 5 (4), 347–354.

Documents

Analysis and design of Wavelet-Packet Cepstral ...profesores.elo.utfsm.cl/~mzanartu/IPD414/Docs/wavelets_paper_1.pdf · The rich coverage of time-frequency properties of Wavelet Packets