06143995

8/11/2019 06143995

1/12

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012 1573

VTLN Using Analytically DeterminedLinear-Transformation on Conventional MFCC

D. R. Sanand and S. Umesh

AbstractIn this paper, we propose a method to analyticallyobtain a linear-transformation on the conventional Mel frequencycepstral coefficients (MFCC) features that corresponds to conven-tional vocal tract length normalization (VTLN)-warped MFCCfeatures, thereby simplifying the VTLN processing. There havebeen many attempts to obtain such a linear-transformation,but all the previously proposed approaches either modify thesignal processing (and therefore not conventional MFCC), orthe linear-transformation does not correspond to conventionalVTLN-warping, or the matrices being estimated and are datadependent. In short, the conventional VTLN part of an automaticspeech recognition (ASR) system cannot be simply replacedwith any of the previously proposed methods. Umesh et al. pro-

posed the idea to use band-limited interpolation for performingVTLN-warping on MFCC using plain cepstra. Motivated from thiswork, Panchapagesan and Alwan proposed a linear-transforma-tion to perform VTLN-warping on conventional MFCC. However,in their approach, VTLN warping is specified in the Mel-fre-quency domain and is not equivalent to conventional VTLN. Inthis paper, we present an approach which also draws inspirationfrom the work of Umesh et al., and which we believe for the firsttime performs conventional VTLN as a linear-transformation onconventional MFCC using the ideas of band-limited interpolation.Deriving such a linear-transformation to perform VTLN, wouldallow us to use the VTLN-matrices in transform-based adaptationframework with its associated advantages and yet would requirethe estimation of a single parameter. Usingfour different tasks, weshow that our proposed approach has almost identical recognition

performance to conventional VTLN on both clean and noisyspeech data.

Index TermsAutomatic speech recognition (ASR), linear-transformation, Mel frequency cepstral coefficient (MFCC),speaker normalization, vocal tract length normalization (VTLN).

I. INTRODUCTION

INTER-SPEAKER variability is a major source of perfor-

mance degradation in speaker-independent (SI) automatic

speech recognition (ASR) systems. Most state-of-the-art sys-

tems now incorporate vocal-tract length normalization (VTLN)

as an integral part of the system to reduce inter-speaker vari-ability and hence improve the recognition performance [1][6].

Manuscript received December 27, 2010; revised July 08, 2011 and January15, 2012; accepted January 23,2012. Date of publicationJanuary31, 2012;dateof current version March 21, 2012. This work was done while the authors wereat the Department of Electrical Engineering, Indian Institute of Technology,Kanpur. This work was supported in part by the Department of Science andTechnology, Ministry of Science and Technology, India, under SERC projectSR/S3/EECE/058/2008. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Steve Renals.

D. R. Sanand is with the Norwegian University of Science and Technology,NO-7491 Trondheim, Norway (e-mail: [email protected]).

S. Umesh is with the Department of Electrical Engineering, Indian Instituteof TechnologyMadras, Chennai-600036, India (e-mail: [email protected]).

Digital Object Identifier 10.1109/TASL.2012.2186289

VTLN performs speaker normalization by reducing the variabil-

ities in the spectra of speech signals that arise due to differences

in the vocal tract lengths (VTL) of speakers uttering the same

sound [7]. The normalization is achieved by either compressing

or expanding the speech spectrum and is usually referred to as

scaling. This scaling is usually specified through a mathematical

relation of the type , where is the warped-fre-

quency and is the frequency-warping function. It is com-

monly assumed that the spectra of different speakers uttering the

same sound are linearly scaled versions of one another [7], [8],

i.e., . We would like to make it clear to the reader

that, though the discussion in this paper assumes linear-scalingof the spectra, the methods developed in this paper can be ap-

plied toanyarbitrary warping function. VTLN requires the esti-

mation of only a single parameter called the warp-factor for

normalization and hence requires very little acoustic data unlike

adaptation based methods (e.g., MLLR and CMLLR). However,

the practical implementation of conventional VTLN follows a

maximum likelihood (ML) based grid search over a pre-defined

range of warping-factors. This requires the features to be gen-

erated for all the warp-factors after appropriate modification of

the spectra. The ML estimate of the warp-factor is then found

by evaluating the likelihood of the warped features with respect

to the acoustic model, , and the transcription and is givenby

(1)

where consist of static features obtained after frequency-

warping the spectra by warp-factor and appended with dif-

ferential and acceleration coefficients. In some systems, linear

discriminant analysis (LDA) is applied over a window of such

warped consecutive frames to account for dynamic variations

before obtaining the final feature-vector.

Recently there has been lot of interest in obtaining a direct

linear-transformation between static conventional Mel fre-

quency cepstral coefficients (MFCC) features and the staticVTLN-warped MFCC , i.e.,

(2)

where, represents a matrix transformation.

One of the early attempts to obtain a linear-transformation

(LT) on the cepstra for speaker-normalization was by Acero

et al. [9], [10]. They showed that the warped cepstral coeffi-

cients can be obtained at the outputs of the filters at time

, by formulating the bilinear transform as a linear filtering

operation and having the time reversed cepstrum sequence as

the input. McDonough et al. [11], proposed a linear transfor-

mation using generalizations of bilinear transform known as

1558-7916/$31.00 2012 IEEE

8/11/2019 06143995

2/12

1574 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

all-pass transforms. The derivations were based on the argument

that frequency warping functions, , used in most VTLN

methods can be approximated to a reasonable degree by the bi-

linear transform. Pitzet al.[12], [13] argued that a linear-trans-

formation of cepstra can be obtained for any arbitrary invert-

ible warping function. However, their derivations were made

using the modified signal processing approach discussed in [14],which does not include filter-bank smoothing during the feature

extraction. The cepstra are assumed to be inverse discrete-time

Fourier transform (IDTFT) coefficients of the log power spec-

trum (without Mel warping) to derive the cepstral linear-trans-

formation. Pitz states in his thesis [15] that inclusion ofMel-

warping makes the transformation highly non-linear and could

not be solved analytically. There have been other attempts to

obtain an approximation to the linear-transformation including

the work of Claes et al. [16], where the linear-transformation

was derived using the average third formant information. Cox

[17] presented a model based approach for VTLN that performs

transformation on MFCC features. Kim et al. [18] estimated

the linear-transformation using the ideas of constrained max-imum likelihood linear-transformation (CMLLR) from training

data. Cui and Alwan [19] derived a mapping matrix using for-

mant-like peaks and can be seen as a special case of [16]. Sanand

et al.[20] derived a linear-transformation using the idea of dy-

namic frequency warping, where the mapping is learnt from

the data. It is important to note that in all these methods, ei-

ther the signal processing is changed (and therefore not conven-

tional MFCC), or the linear-transformation does not correspond

to conventional VTLN-warping, or the matrices are estimated

and hence are dependent on the database. Therefore, the con-

ventional VTLN part of an ASR system cannot be simply re-

placed with any of the methods described above.Umesh et al. [21] proposed the idea of using band-lim-

ited interpolation to derive a linear-transformation for ob-

taining VTLN-warped MFCC, that performs both Mel- and

VTLN-warping on plain cepstra. Motivated from this work,

Panchapagesan and Alwan [22], [23] proposed an approach

to incorporate VTLN-warping into the inverse discrete cosine

transform (DCT) transformation to obtain a linear-transfor-

mation of the type shown in (2). We refer to this approach

as Cosine-interpolation in this paper. It is important to note

that, VTLN-warping, , in [22], [23] is performed in the

Mel-frequency domain and is not exactly equivalent to con-

ventional VTLN frequency-warping. This may be important in

cases where the warping function is specified in the frequency

(Hz) domain based on physiological arguments.

In this paper, we present an approach, which we believe for

the first time performs conventional frequency-warping, ,

as a linear-transformation on conventional MFCC using the

ideas of band-limited interpolation. We refer to this approach

as Sinc-interpolation in this paper. The goal is to analytically

obtain the linear-transformation of (2) given . The

proposed method does not modify any aspect of the conven-

tional MFCC computation including the use of Mel filter-bank

smoothing as well as discrete cosine transform (DCT)-II. A part

of this work has already been presented in [20], [24], and [25].

A major advantage in obtaining a linear-transformation inthe framework of (2) is that the VTLN-warped cepstral features

Fig. 1. Steps involved in generating conventional MFCC features.

need not be computed for each , by first frequency warping the

spectra and then computing the corresponding VTLN-warped

cepstra. Instead, the VTLN-warped cepstra can be directly

obtained from static conventional MFCC features through

a matrix transformation. It can be easily shown that the dy-

namic coefficients of the warped features would also be related

through the same transformation in this case. Another advan-

tage of such an approach is that, these matrices can be viewed as

feature transformation matrices similar to CMLLR, but are pre-

computed rather than estimated from data, requiring very little

adaptation data for optimal selection of . The use of suchmatrices also enables the warp-factors to be estimated by accu-

mulating the sufficient statistics, there by simplifying the proce-

dure for optimal warp-factor estimation [26], [27] and reducing

the computational complexity by 75%. Further, VTLN matrices

can be used in regression tree framework to perform VTLN at

acoustic class level, allowing estimation of multiple warp-fac-

tors for a single utterance [28] which is very difficult to imple-

ment in conventional VTLN framework. Finally, there is a pos-

sibility of using these VTLN matrices as base matrices for adap-

tation until sufficient data is available to obtain a robust estimate

of the adaptation (MLLR/CMLLR) matrix [29]. Recently, there

is also interest in using VTLN in the transform-based approach

for statistical speech synthesis [30].The paper is organized as follows. In Section II, we present

how VTLN is performed in practice and discuss the limita-

tions in formulating the problem as a linear-transformation. In

Section III, we present our idea of performing VTLN and show

that a matrix transformation can be formulated on conventional

MFCC to obtain VTLN-warped MFCC. Section IV presents

our setup for performing the speech recognition experiments

along with description of the databases used in our experiments.

In Section V, we discuss the differences between the proposed

and the Cosine-interpolation approaches for VTLN. Finally,

we present the recognition results to show that the proposed

approach has performance comparable to the conventionalVTLN.

II. IMPLEMENTATION OFCONVENTIONALVTLN

Conventional MFCC feature extraction, which does not

include VTLN-warping, is usually implemented as shown in

Fig. 1. Let represent the power or magnitude spectrum of

a frame of speech. Let represent the filter-bank smoothing

operation along with Mel-warping, which can be represented

through a linear-transformation matrix. Further, let repre-

sent the DCT transformation which is also linear. The static

MFCC features, , are obtained by applying the Mel-warped

filter-bank on the power spectrum of the speech signal, followedby applying logarithm on the amplitudes of the output of the

8/11/2019 06143995

3/12

SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1575

Fig. 2. Conventional frame work for generatingwarped features in VTLN. Thefilter-bank is inversely scaled instead of re-sampling the speech signal for eachwarp-factor for efficient implementation.

filter-bank and finally a DCT transformation. All the operations

can be written mathematically as

(3)

The DCT matrix is given by

(4)

and the scaling factor is defined as

.

Here, is the number of filters used in theMelfilter-bank and

is the number of cepstral coefficients.

As an illustration, let be a speech frame consisting of 320

samples. A 512-point DFT is applied to obtain the 256-dimen-

sional vector whose elements are the magnitude of the DFT

coefficients for one-half of the spectrum. This is because the

magnitude spectrum has even-symmetry. If a 20-filterMel filter-

bank smoothing is applied, then is a 20 256 matrix that op-

erateson to obtain theMel-warped smoothed spectra. is the

20 20 DCT-matrix applied on log-compressedMel-warped

smoothed spectra to obtain the MFCC feature vector . In prac-

tice, only the first 16 cepstral coefficients are used and one mayuse a 16 20 DCT transformation.

VTLN features are obtained in the original method of An-

dreouet al.[7], by frequency-warping the magnitude spectra

to get before applying the unwarpedMel filter-bank. This

is done by re-sampling the signal. Therefore, in this case the

signal is warped for each VTLN warp-factor, while the Mel

filter-bank is left unchanged. Lee and Rose [8] proposed an ef-

ficient alternate implementation, where the Mel filter-bank is

inverse-scaled for each , while the signal spectra is left un-

changed as shown in Fig. 2. This is the most popular method

of VTLN-warping. Therefore, in the Lee-Rose method, VTLN-

warping is integratedinto the Mel filter-bank and denotes

the (inverse) VTLN-warpedMelfilter-bank. Conventionally thewarp-factor, , used for warping the spectra is in the range of

Fig. 3. Illustrating the change in the filter-bank structure with VTLN-warpingin linear-frequency (Hz) domain. The filters have nonuniform center frequencieswith nonuniform bandwidths.

Fig. 4. Piece-wise linear warping function used in conventional VTLN moti-vated by physiological arguments is shown. The slope of the warping functionis changed at to avoid bandwidth mismatch after frequency scaling.

0.80 to 1.20 based on physiological arguments. For each , the

center frequencies and bandwidths of the Melfilter-bank are ap-

propriately scaled to obtainMel- and VTLN-warped smoothed

spectra [8]. The change in the filter-bank structure for different

warp-factors is illustrated in Fig. 3. The slope in the last filter has

been modified appropriately using piece-wise linear warping

[31], so that the Nyquist frequency maps onto itself after fre-

quency scaling. This avoids the bandwidth mismatch that arises

due to frequency warping. The piece-wise linear warping func-

tion used in our experiments is given by

(5)

(6)

and is shown in Fig. 4. represents the cutoff frequency

where the slope is changed and is the Nyquist frequency.Although piece-wise linear warping is the most commonly used

8/11/2019 06143995

4/12


frequency-warping and is motivated from physiological argu-

ments that changes in VTL manifest as spectral-scaling, the

methods developed in this paper can be applied to any arbitrary

warping function.

The warped cepstral features are given by

(7)

These are obtained by first warping and smoothing the power

spectrum, followed by log and the DCT operations. The filter-

bank is integrated with both Mel- and VTLN-warping, to per-

form smoothing as well as scaling of the spectrum. Observing

(3) and (7), it is clear that the only difference between the con-

ventional and VTLN-warped MFCC features is the change in

the filter-bank structure, while the rest of the operations are the

same.

For the case of , exactlycorresponds to the case

of conventional MFCC withoutVTLN-warping. From (3) and

(7), the relation between and is given as

(8)

A linear-transformation between and (or ) can be

derived if all the intermediate operations can be represented as

linear operations, but from (8), it is evident that log is a nonlinear

operation and in practice does notexist. This is because,

the power-spectrum cannotbe completely reconstructed from

the filter-bank outputs because of the smoothing operation [16].

We need to obtain , since conventional VTLN warping re-

lations are always specified in the linear-frequency (Hz) do-

main, usually through a mathematical relation of the type

, where is the warped-frequency and is the fre-

quency-warping function. Therefore, in this case, it is not pos-

sible to completely recover from the filter-bank output and

hence a linear-transformation is not possible.

In the next section, we show that separating the frequency-

warping operation from the filter-bank avoids the need to invert

the filter-bank operation or the logarithm and allows us to derive

a linear transformation on conventional MFCC.

III. REALIZING ALINEAR-TRANSFORMATION

In this section, we show that separatingthe VTLN-warping

(speaker scaling) from the Mel filter-bank helps us to derive a

linear-transformation (LT) between warped and unwarped cep-stral features within the conventional MFCC framework. Let

, be the log-compressed Mel warped filter-bank

output. From (3), we see that the knowledge of implies the

knowledge of as they form a DCT pair, i.e.,

(9)

However, we cannot completely recover from because

of the filter-bank smoothing operation. Since cannot be

completely recovered, we re-frame the problem as follows: can

be obtained by applying a linear-transfor-

mation on without recovering , i.e.,

(10)

Fig. 5. Modification in the signal processing steps (separating the Mel- and

VTLN-warping) for realizing a linear-transformation. The filter-bank performsonly Mel-warping of the spectra and the proposed band-limited interpolationmatrix performs the VTLN warping.

where is the transformation that is applied on to ob-

tain . The above equation states that the filter-bank, , per-

forms only Mel warping and the transformation performs

VTLN-warping. This means that the VTLN-warping integrated

in the filter-bank for efficient implementation in the conven-

tional approach [8] is now performed separately and is not a

part of the filter-bank construction. This is illustrated in Fig. 5.

If such a relation can be obtained, then from (3) and (7), the

relation between and is given by

(11)

By defining a LT between and , we are completely

avoiding the inversion of filter-bank for obtaining the raw mag-

nitude spectrum and also bypassing the

operation. We

would like to remind the reader that VTLN-warping relation

is usually specified in the linear-frequency (Hz) domain, and

therefore, at this point it is not clear what the relation between

and should be. In the next subsection, we describe a method

to obtain a LT using the idea of band-limited interpolation.

A. Band-Limited (Sinc-) Interpolation

For a band-limited continuous-time signal, , given uni-

formly spaced samples of the signal that are appropriately sam-

pled, i.e., , we can exactly reconstruct the original contin-

uous-time signal. This implies that we can recover the values

of the time signal at time-instants otherthan those at the uni-

formly spaced samples. We use this idea to obtain the LT for

VTLN-warping, except now we consider que-frency limited sig-

nals, instead of frequency-limited signals.

can be obtained either by applying a nonuniform filter-

bank (shown in Fig. 3) on the linear-frequency (Hz) magnitude

spectrum or by applying a uniformly spaced filter-bank (shown

in Fig. 6) on the Mel-warped magnitude spectrum. Therefore,

in the Mel-frequency domain, the continuous Mel-warped log-

compressed spectrum, , can be interpreted as the convolved

output of a triangle function on theMel-warped magnitude spec-

trum and followed by a log operation on the amplitudes. We

can think of vector as being obtained by uniformly sampling

at where and the po-

sitions of these samples exactly correspond to the center fre-

quencies of the filter-bank. Because of the triangle smoothing

and subsequent

-operation on the output (which reduces dy-

namic range), the que-frency content of this -compressed

smoothed spectrum is only in the low que-frency region. Fig. 7

compares the cepstral coefficients obtained with and withoutfilter-bank smoothing. We see that the cepstral coefficients die

8/11/2019 06143995

5/12


Fig. 6. The change in the filter-bank structure with VTLN-warping inMel-fre-quency domain is illustrated. The filters have uniformly spaced center frequen-cies with uniform bandwidth for . However, they are nonuniformlyspaced for different from unity.

Fig. 7. The effect of filter-bank smoothing on the cepstral coefficients is il-lustrated. Filter-bank smoothing helps limit the que-frency content to the lowerregion ensuring que-frency limitedness.

down faster with filter-bank smoothing indicating that the que-

frency content is limited to the low que-frency region. During

VTLN-warping, the filter center frequencies are appropriately

scaled in the linear-frequency (Hz) domain by inverse- as de-

scribed in LeeRose [8]. This corresponds to the center frequen-

cies of the filter-bank to be non-uniformly spaced in theMel-fre-

quency domain as shown in Fig. 6. As we represent the log-com-

pressedMel-warped smoothed magnitude spectrum by the con-

tinuous function , the output of the VTLN-warped filter-

bank corresponds to sampling nonuniformly, i.e., .

These nonuniformly spaced samples exactly correspond to the

elements of the vector .

From the above discussion, we point out that elements of

vector (i.e., ) can beinterpreted as uniformly spaced sam-

ples and elements of (i.e., ) as nonuniformly spaced

samples of the same continuous function . The main idea

is that, given the samples in , the samples (or elements) in

can be reconstructed using band-limited interpolation provided

that the cepstrum is que-frency limited.

Let and form a discrete-time Fourier transform

(DTFT) pair. Then sampling would result in periodicrepetition of . As long as is strictly que-frency limited and

the sampling rate is sufficiently high, then there is no aliasing

in the cepstral domain. In such a case, the value of at any

Mel-frequency can be found from its uniformly-spaced sam-

ples at through band-limited interpolation. This is basically

exploiting the sampling theorem, where a signal (in this case a

frequency domain signal) can be reconstructed from its samples

using Sinc-interpolation. is nowhere used for any calculationpurposes and is presented here only for better understanding in

the derivation of the band-limited interpolation matrix.

Note that que-frency limitedness ensures that there is no

overlap in the periodic repetition of (i.e., no aliasing), and

hence can beexactlyrecovered. The que-frency limitedness

property dependsbothon the amount of smoothing done by the

Mel-filters (which controls the number of significant cepstral

coefficients) and on the number ofMel-filters which determines

the periodicity. If there is aliasing, there will be differences

between Sinc-interpolated and the actual values. Since our

effort in this paper, is to use conventional MFCC processing,

both of these parameters are already fixed by the feature

extraction stage. However, as we will show later, even using

conventional MFCC processing there is very little difference

between interpolated and true values.

The steps to obtain the transformation matrix are as follows.

1) Let , represent t he u niformly-spacedMel-

frequencies with samples of at these points being el-

ements of vector . Their corresponding linear-frequen-

cies (Hz) are nonuniformly spaced and are represented by

. These are the center frequencies of the

Mel-filters in the linear-frequency (Hz) domain and are re-

lated through the standard Mel-relation, i.e.,

(12)

2) During VTLN-warping, the warping function is ap-

plied to obtain the warped frequencies. Let,

represent the warped frequencies in the linear-frequency

(Hz) domain. Although our proposed method will work

for any warping function , for illustrationpurposewe

use the piece-wise linear warping function as defined in (5)

and (6). The corresponding VTLN-warped center frequen-

cies of the filters in the Mel-frequency domain will

notbe related through a linear scaling relation, since

(13)

Therefore, while for the linear-

scaling relation (i.e., along axis), along -axis

as seen from (12) and (13) and graphically shown in

Fig. 8. The Cosine interpolation approach proposed in [22],

[23] assumes , i.e., warping in the Mel do-

main (i.e., domain), and therefore, does not correspond

to conventional VTLN-warping which is specified in fre-

quency domain (i.e., domain). While we refer to only

piece-wise linear warping for illustration purpose, any fre-

quency-warping function can be used in our proposed ap-proach by specifying the appropriate in (13).

8/11/2019 06143995

6/12


Fig. 8. Tthe band-limited interpolation for linear-scaling relation is illus-trated. Warping is defined in the linear-frequency (Hz) domain or -axis, i.e., . Along the -axis, are the center-frequencies ofuniformly spaced filter-bank corresponding to in theMel-domain.Similarly, are the center-frequencies of warped filter-bankand arenonuniformly spaced in theMel-domain. The band-limited interpolation

matrix is defined to obtain samples at given samples at . In the figure,represents unwarped frequencies both in linear-frequency (Hz) and Mel

frequency domains, and represents warped frequencies in both the domains.

3) The Fourier relation between and is given by

(14)

where is the Nyquist frequency in the Mel frequency

domain. Here, we assume that the signal is periodic with

a period of and symmetric around . There-fore, theoretically half-filters are present at indices 0 and

1. The values at these indices are required for per-

forming band-limited interpolation. If, we assume that is

que-frency limited, the elements of can be determined

as [we use variable since is already used in (14)]

(15)

Substituting of (14) in (15), we get

The band-limited interpolation matrix between and

is given by

(16)

Fig. 9. Framework of the proposed linear-transformation approach. Note thatonly the conventional MFCC features are generated and warped features areobtained using the LT matrices .

where

and is obtained from (12) and from (13) and the

VTLN-warping relation, is specified in domain in(13). Using the even-symmetry property, we obtain NxN

interpolation matrix , i.e., . The matrix is

given by

where

Alternatively, the above matrix can also be written as aproduct of matrices and is given by

(17)

where is the number of filters. The matrices and are

given by

where are normalized frequencies with the range.

The linear transformation matrix to obtain the VTLN-warped

MFCC given the conventional MFCC is given by

(18)

Here, represents the number of static cepstral coefficients in

the feature-vector and is the number ofMel-filters used in

the feature extraction. The feature generation process using the

proposed linear-transformation (LT) approach is illustrated in

Fig. 9. Although we have shown here for the case of piece-wiselinear warping, the same procedure can be used for any arbitrary

8/11/2019 06143995

7/12


Fig. 10. Comparing the VTLN-warped cepstra obtainedusing the conventionaland the proposed Sinc-interpolation approach for piece-wise linear and bilinearwarping functions. (a) Piece-wise linear warping. (b) Bilinear warping.

warping function by choosing the appropriate in (13).

Fig. 10 compares the VTLN-warped cepstra obtained using the

conventional and the proposed LT approach on piece-wise linearand bilinear warping functions.

The idea of linear-transformation presented here will be a

special case of the method proposed by Umesh et al. in [21],

where a linear-transformation is proposed by separating both

Mel- and VTLN-warping from the filter-bank. The main differ-

ences between these approaches are as follows.

The filters are uniformly spaced in theMel frequency do-

main for the approach proposed in this paper, i.e., are

uniformly spaced. In the work of Umeshet al.in [21], the

filters are uniformly spaced in the linear-frequency (Hz)

domain, i.e., are uniformly spaced. Therefore, conven-

tionalMel filter-bank is notused in [21].

The interpolation matrix proposed in this paper is defined

as

(19)

i.e., it performs only VTLN-warping on the Mel warped

spectra. In [21], the interpolation matrix is defined as

(20)

where is the smoothed spectrum without Mel warping

and the transformation matrix performs both Mel- andVTLN-warping to obtain VTLN-warped MFCC features.

B. Cosine-Interpolation

Motivated fromthe work of Umesh et al. [21], Panchapagesan

and Alwan [22], [23] proposed a linear-transformation approach

that incorporates the interpolation and warping in the inverse

discrete cosine transform (IDCT) matrix and we refer to this

approach as Cosine-interpolation. Considering to be the con-

tinuousMel-frequency variable, the signal is assumed to be pe-riodic with a period of and symmetric about the points

and . A normalization variable is de-

fined as follows:

where (21)

and has the range . The warped IDCT matrix is given

as

(22)

where are the normalized half-sampledshifted positions of theMel filter-bank and being the fre-

quency warping function. The relation between the warped and

unwarped cepstral features is given by

(23)

From the above equations, we see that the VTLN-warping,

, is performed on the half-sample shifted positions

of the filter-bank center-frequencies that are already

Mel-warped. In conventional VTLN, frequency warping is

performed in the linear-frequency (Hz) domain through .

From the above discussion, it is clear that the Cosine-interpola-

tion approach performs VTLN-warping on the Mel-warped fre-quencies and is not equivalent to conventional VTLN-warping.

Panchapagesan and Alwan themselves point to these differences

[23, below Eq. 27]. As seen from the warp-factor histograms

in [23], the conventional VTLN-warping-factors lie between

(0.88, 1.24) and the Cosine-interpolation based warping-factors

in the range (0.91, 1.11) for the same piece-wise linear warping.

This indicates that conventional frequency-warping and the

warping used in Mel-domain by Cosine-interpolation cannot

be directly compared since the domains in which the warping

is applied are different. In practice, most frequency-warping

functions are specified in linear-frequency (Hz) domain (and

not Mel-domain) often motivated by physiological arguments.To summarize, the main differences between Cosine-interpola-

tion and the linear-transformation derived in this paper are as

follows.

In Cosine-interpolation, VTLN-warping is applied in

theMel-domain and hence the corresponding warp-factors

are very different when compared with those from conven-

tional VTLN.

The interpolation is performed using the inverse-DCT

matrix in Cosine-interpolation [see (22)], whereas the

approach presented in this paper uses band-limited inter-

polation as shown in (17).

Before proceeding further, we present the recognition setup

along with the details of the databases used in our experimentsin the next section.

8/11/2019 06143995

8/12


TABLE IDESCRIPTION OF THECORPUSUSED FOREXPERIMENTS

IV. EXPERIMENTALSETUP

The recognition experiments include four different setsof speech data: Wall Street Journal (WSJ0) [32], EuropeanParliamentary Plenary Sessions (EPPS) English [33], Texas

Instruments connected digits (TIDIGITS) [34] and Aurora 4.0[35]. WSJ0, TIDIGITS and EPPS-English are clean speechdata, where as Aurora 4.0 is a noisy speech data. The details ofthe database are presented in Table I. Aurora 4.0 consists of 14different types of test sets, where seven of the them are recordedwith a similar microphone used for recording the training datawhile the other seven with a different microphone.

All the experiments were done using the RWTH AachenSpeech Recognition System [36] except for the TIDIGITStask. While performing feature extraction, we use 20 filtersand obtain 16 cepstral coefficients . The features aremean and variance normalized at the segment level and LDA isapplied over a window of nine consecutive frames to derive a

45-dimensional feature vector. The system used a classificationand regression tree (CART) based state tying. We have 1501generalized triphones for both WSJ0 and Aurora 4.0. and 4501generalized triphones for EPPS task, respectively. The HMMmodel consists of three emitting states with 256 mixtures perstate and uses a pooled covariance matrix.

The TIDIGITS speech recognition task is done using HTK[37] and uses word models. It had 11 word models, which in-clude zero to nine and oh. The features are of 39 dimensionscomprising normalized log-energy, (excluding )and their first- and second-order derivatives. Cepstral mean sub-traction is applied at the segment level. The digits were mod-eled with simple left-to-right HMMs without skips and had 16

emitting states with five diagonal covariance Gaussian mixturesper state. Silence is modeled using a three-state HMM havingsix-mixture Gaussian models per state.

While performing VTLN in training, we follow a max-imum-likelihood (ML)-based approach for estimating theoptimal warping-factor, i.e.,

(24)

where is the SI model and is the known transcriptionduring training. is the VTLN-warped feature vector ofutterance with the static features appended with the delta and ac-celeration coefficients or is obtained after transformation using

LDA. Since the delta and acceleration coefficients are obtainedfrom the static coefficients, the same VTLN transformation ma-

trix can be used to obtain the VTLN-warped delta and accelera-tion coefficients. Therefore, the relation between the unwarpedand VTLN-warped features are given by

(25)

If the features are obtained using an LDA transformation matrix,then the relation between VTLN-warped and static unwarpedfeatures are given by [26]

(26)

where represents the window length. represents the super-

vector formed by concatenating all the static MFCC cepstrafrom adjacent frames.Using the estimated warped features a new VTLN model

is obtained. For performing the warp-factor estimation duringtesting, we use a Gaussian mixture model (GMM) classifier[38]. Unwarped features corresponding to each warping-factorobtained in training are used to train the GMM with 256 mix-tures. The optimal warping-factor in recognition is obtained bycalculating the likelihood with respect to each of the warping-factor GMM and choosing the one that gives the best likelihood.The warping-factors are estimated at speaker level in trainingand utterance level during recognition. During warped featureextraction, we map frequencypointszero andpi onto themselves

using piece-wise linear-warping. We do not account for Jaco-bian in VTLN for the experiments presented in this paper.

V. IMPLEMENTATIONDETAILS

In this section, we present the implementation details

for Sinc- and Cosine-interpolation approaches. Later, we

present the recognition results comparing the performance of

linear-transformation approaches with conventional VTLN.

A. Cosine-Interpolation

In this section, we will discuss the implementation details for

Cosine-interpolation and argue that the range of warping-fac-

tors have to be properly mapped either in the Mel-domain or

in the linear-frequency (Hz) domain for comparing the recog-

nition performance with conventional and Sinc-interpolation

approaches. Before proceeding further, we present recognition

results for the TIDIGITS task in Table II. The models were

trained using male speakers and are used for recognizing

children speakers. For this task, Panchapagesan and Alwan

observed that Cosine-interpolation approach performed better

than conventional VTLN (see [22] and [23, Sec. (6.1)]). As

we will show next, the warp-factors for Cosine-interpolation

and conventional VTLN need to be mapped before they can be

compared. This is due to the difference in the domains where

frequency warping is applied. If proper mapping is chosen, the

difference in the performance observed in [22] and [23] nolonger exists.

8/11/2019 06143995

9/12


TABLE IIRECOGNITIONRESULTS(%WER) COMPARING THEPERFORMANCE OF DIFFERENTAPPROACHES TO VTLNFOR MALE-TRAIN ANDCHILD-TESTCASE OF TIDIGITS. DIFFERENTRANGE OF WARPING-FACTORS HAVE TO BE USED TOGETCOMPARABLE PERFORMANCE FOR CONVENTIONAL

ANDCOSINE-INTERPOLATION APPROACHES DUE TODIFFERENCE INDOMAINWHEREFREQUENCYWARPING ISAPPLIED

Baseline - No VTLN; Conv. - Conventional; LT - Linear Transformation; M-C - Male Train - Child Test

Fig. 11. The different frequency-warping functions in the linear-frequency(Hz) domain is shown. Using a warp-factor of 0.80 in Mel-domain will resultin a warping function in linear-frequency domain (dotted line) that is quitedifferent from using the same value of the warp-factor (i.e., 0.80) directly inlinear-frequency (Hz) domain (solid line). The figure also shows that using0.9194 as the warp-factor in Mel-domain will generate frequency-warpingfunction very similar to using in linear-frequency (Hz) domain.

We briefly discuss the physiological motivation in choosing

the range of warp-factors used in conventional VTLN for piece-

wise linear warping. The average vocal-tract length for males

is about 17 cm, that for females is about 14.5 cm and for chil-

dren is about 12 cm. Males can have vocal-tract lengths that

are 19 cm or longer. Since the differences in vocal-tract lengths

crudely manifest as scaling of the spectra for the same sound,

the range of scaling (or warp-) factors are determined by the

ratio of vocal-tract lengths. For adult speakers (i.e., only male

and female speakers), this ratio varies from

to about . Usually, the range of warp-factors for

adult data are in the range of 0.80 to 1.20. However, if we train

models using male speakers and use children speakers for test,

then the range of warp-factors have to be different. In this case,

the lower-end in the range of warp-factors can be approximately

.

As pointed out in Section III-B, Cosine-interpolation per-

forms VTLN-warping on the Mel-warped frequencies (i.e.,

) as opposed to performing frequency-warping in the

linear-frequency (Hz) domain (i.e., ). Using the same

numerical value of the warping-factor both in Mel- and

linear-frequency (Hz) domain will result in different warpings

in the frequency (Hz) domain. The differences are illustrated

in Fig. 11. Using a warp-factor of 0.80 in the Mel-frequency

domain and mapping the warping back to linear-frequencydomain (shown in the figure with a dotted line) produces a very

different warping function in linear-frequency domain when

compared to directly using 0.80 warp-factor in linear-frequency

domain (shown with a solid line in the figure). On the other

hand, using a warp-factor of 0.9194 in the Mel-domain and

mapping the function back to linear-frequency domain results

in frequency-warping very similar to 0.80. We suspect that for

TIDIGITS task in [22] and [23], thesamerange of warp-factors

from 0.80 to 1.25 were used in both conventional VTLN and

in Cosine-interpolation. Since the test data is from children

speakers, the lower limit of 0.80 for conventional-VTLN did

not provide sufficient search space and resulted in degraded

performance. On the other hand, scaling the Mel-domain with

0.80 is approximately equivalent to scaling the linear-fre-

quency domain by a factor of 0.5695. This provided a larger

search-space for Cosine-interpolation, probably helping it to

get better performance for children speech when compared to

conventional-VTLN in [22], [23].

In order to have a fair comparison, the warping-factors in

the linear-frequency (Hz) domain (orMel-domain) have to be

in the same range for all the approaches. This can be done by

calculating the equivalent warping-factor in one domain (say

linear-frequency (Hz) domain) by fixing the warping-factor in

the other domain (sayMelfrequency domain). The mapping ofwarping-factors is done as follows.

Let us say that the cutoff where the slope of the warping

function changes is in Hz and similarly where the

slope changes in theMel domain as .

The corresponding warped frequencies are given as

(27)

(28)

where and

are different.

The idea is that the inverseMel-warped should match

or Mel-warped should match

. So thisinvolves finding the value of (or

) that matches

(or ) and this can be found by equating (27) and

(28). Therefore, the equivalent when the

is fixed

is given by

(29)

and the equivalent

when is fixed is given by

(30)

The converts Mel frequencies to Hz and similarlyconverts Hz frequencies to Mel frequencies.

8/11/2019 06143995

10/12


We map the warping-factors as discussed above by fixing

and calculating the corresponding

. The new range

of warping-factors for

will be in the range from (0.91,

1.08) for the corresponding in the range of (0.80, 1.25)

for adult speech. These ranges seem to be consistent with the

observations made by Panchapagesan-Alwan (see [23, Fig. 4

]). The recognition performance on the TIDIGITS task usingproper range of warping-factors is shown in Table II, i.e., we

extend the lower range to account for children speakers. Unlike

[22] and [23], we now observe that all the approaches to VTLN

have similar performance. When the warping-factors are in the

improperrange of (0.80, 1.20) for child speakers, the conven-

tional VTLN performance is inferior to Cosine-interpolation as

observed in [22].

For all the subsequent experiments in this paper, we will use

mapping of warping-factors for Cosine-interpolation by fixing

the warp-factors in the linear-frequency (Hz) domain. In the next

section, we will present the implementation details of Sinc-in-

terpolation when performing linear-transformation of conven-

tional MFCC.

B. Linear-Transformation of Conventional MFCC for VTLN

Using Sinc-Interpolation

In order to perform band-limited interpolation, full-spectral

information in the frequency band is necessary. Since we use

conventional MFCC, the available spectral information is be-

tween the first and last filter of the filter-bank. Using a conven-

tional filter-bank with 20 filters, the first filter has a center fre-

quency around 135.2Mels (or 89.92 Hz) and the last filter has

a center frequency around 2704.8 Mels (or 7016.2 Hz). It is

quite unlikely that a formant for any specific sound exists below

or above the specified frequencies for any particular speaker.We can safely assume that the first and last filter-bank center

frequencies act as zero and Nyquist frequencies which should

in no way effect the VTLN performance. This assumption in-

herently means that the center frequencies of the first and last

filters will map onto themselves after frequency warping. Note

that by stating the center frequencies of the first and last filters

are not changing with frequency-warping, we do not mean that

we are ignoring the speech spectrum below 89.92 Hz. The in-

formation is still present in the first-filter since the lower-end of

the filter starts at zero frequency. The point we want to make

is that the center frequencies of the first and last filters do not

change after frequency warping. The only consequence of thecenter frequency of the first filter being used as zero frequency

in linear-transformation case is that there will be a small con-

sistent difference in numerical value between the warp-factor

estimate obtained by this method and the conventional method.

This is because there will be a small difference in the slope used

in (5) and (6). The linear transformation relation is given by

(31)

where indicates thatconventionalMel filter-bank is used.

The linear transformation is derived as shown in (17).

The results comparing the recognition performance using the

conventional Mel filter-bank are shown in Table III. We makethe following observations.

TABLE IIIRECOGNITIONRESULTS(%WER) OF CONVENTIONAL AND LINEAR

TRANSFORM APPROACHES TOVTLN USING CONVENTIONALMFCC.BOTHSINC- AND COSINE-INTERPOLATION APPROACHES PERFORM

COMPARABLY TOCONVENTIONALVTLN

Fig. 12. Histogramand contour plot comparingthe alpha estimatesfor thecon-ventional and Sinc-interpolation approaches for EPPS train data. In the linear-transformationapproach, since we useconventional Mel-filters(without half-fil-ters) the corresponding warp-factors would be approximately0.02 less thancon-ventional-VTLN which is reflected in the histogram. (a) Histogram plot. (b)Contour plot.

We observe that the linear-transformation based ap-

proaches perform comparably with the conventional

approach irrespective of noisy or clean speech.

More importantly, we use the conventionalMelfilter-bank

with out any modification and still perform VTLN-warping

using a linear-transformation.

Fig. 12 shows the histogram and contour plots for the warp-

factors obtained using the conventional and the proposed Sinc-

interpolation approaches for the train data corresponding to theEPPS task. The histogram plot shows the distribution of warp-

8/11/2019 06143995

11/12


factors and the contour plot gives an idea on how the warp-

factor estimates differ between Sinc- and conventional VTLN

approaches. From the histogram, we observe that majority of

the warp-factors got shifted by a single warp-factor step, like

the peaks at 1.02, 0.94 and 0.92 in the conventional ap-

proach appear at 1.00, 0.92 and 0.90, respectively. A sim-

ilar behavior can also be observed from the contour plot. Thisis because we are using conventional Mel-filters (i.e., no ad-

ditional half-filters) with the center frequency of the first Mel-

filter (at 89.2 Hz) being mapped to zero-frequency to enable the

linear-transformation approach without any change in the signal

processing. This leads to a small consistent difference between

warp-factors obtained in the linear-transformation and the con-

ventional approach. Our analysis of the warp-factors obtained

on the EPPS train data indicates that for any warp-factor in con-

ventional-VTLN, the corresponding warp-factors are same or

0.02 smaller in 90% of the utterances. The correlation coeffi-

cient between the alpha estimates is 0.93, which also indicates

that the deviations are only marginal.

Another source of approximation is the use of truncated un-warped MFCC cepstra in (18) to obtain the VTLN-warped cep-

stra, which will also result in some loss of information. Though

there are small differences in the warp-factor distribution, the

recognition performance of LT-Sinc is comparable to conven-

tional VTLN on a variety of tasks that we have presented in this

paper.

VI. CONCLUSION

In this paper, we have presented an approach to perform

VTLN using a linear transformation on conventional MFCC

without any modification in the feature extraction steps. The

linear-transformation is given by (18) with the interpolationmatrix given by (17). Therefore, the linear-transformation

can be analytically calculated using the above equations for

any as well as for any arbitrary warping function by putting

appropriate in (13). This is an important difference when

compared to the Cosine-interpolation, where is used in

theMel-domain and the warping is different from conventional

VTLN. Further, the corresponding warp-factors between co-

sine-interpolation approach and conventional-VTLN cannot be

easily compared. The key idea to our approach is to separate

the speaker scaling operation from the filter-bank which helps

us derive a linear transformation for VTLN using the idea of

band-limited interpolation. The use of such transformations

would enable the warp-factors to be efficiently estimated by

accumulating the sufficient statistics, use regression tree frame-

work to perform VTLN at acoustic class level or use VTLN

matrices as base matrices for adaptation until sufficient data is

available. Such approaches cannot be easily implemented in

conventional-VTLN framework. Using fourdifferent tasks, to

illustrate the efficacy of our proposed approach, we have shown

that the recognition performance of our proposed approach of

linear transformation is always comparable to the conventional

VTLN on both clean and noisy speech data.

ACKNOWLEDGMENT

D. R. Sanand would like to thank Prof. H. Ney for giving himan opportunity to work as a research assistant in the Human Lan-

guage Technology Group at RWTH Aachen University, Aachen,

Germany. The authors would like to thank Prof. H. Ney for pro-

viding resources to run the recognition experiments reported in

this paper. The authors would also like to thank the anonymous

reviewers for their valuable comments and suggestions.

REFERENCES

[1] G. Evermann, H. Y. Chan, M. J. F. Gales, T. Hain, X. Liu, D. Mrva, L.Wang, and P. C. Woodland, Development of the 2003 CU-HTK con-versational telephone speech transcription system, in Proc. ICASSP04, Montreal, QC, Canada, May 2004, pp. 249252.

[2] A. Sixtus, S. Molau, S. Kanthak, R. Schlter, and H. Ney, Recent im-provements of the RWTH large vocabulary speech recognition systemon spontaneous speech, in Proc. ICASSP 00, Istanbul, Turkey, Jun.2000, pp. 16711674.

[3] G. Zavaliagkos, J. McDonough, D. Miller, A. El-Jaroudi, J. Billa, F.Richardson, K. Ma, M. Siu, and H. Gish, The BBN Byblos 1997large vocabulary conversational speech recognition system, in Proc.

ICASSP 98, Seattle, WA, May 1998, pp. 905908.[4] J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, and F.

Lefevre, Conversational telephone speech recognition, in Proc.ICASSP 03, Hong Kong, Apr. 2003, pp. 212215.

[5] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. R. Gadde, M.Plauch, C. Richey, E. Shriberg, K. Snmez, F. Weng, and J. Zheng,The SRI March 2000 Hub-5 conversational speech transcriptionsystem, inProc. NIST Speech Transcript. Workshop, 2000.

[6] Y. Gao, Y. Li, V. Goel, and M. Picheny, Recent advances in speechrecognition system for IBM DARPA communicator, in Proc. Eu-rospeech 01, Aalborg, Denmark, Sep. 2001.

[7] A. Andreou, T. Kamm, and J. Cohen, Experiments in vocal tract nor-malization, inProc. CAIP Workshop: Frontiers in Speech Recognition

II, 1994.[8] L. Lee and R. Rose, Frequency warping approach to speaker normal-

ization, IEEE Trans. Speech Audio Process., vol. 6, no. 1, pp. 4959,Jan. 1998.

[9] A. Acero and R. M. Stern, Robust speech recognition by normaliza-tion of the acoustic space, inProc. ICASSP 91, Toronto, ON, Canada,May 1991, pp. 893896.

[10] A. Acero, Acoustical and environmental robustness in automaticspeech recognition, Ph.D. dissertation, Carnegie Mellon Univ.,Pittsburgh, PA, 1990.

[11] J. McDonough, W. Bryne, and X. Luo, Speaker normalization withall-pass transforms, inICSLP 98, Sydney, Australia, Nov. 1998.

[12] M. Pitz and H. Ney, Vocal tract normalization equals linear transfor-mation in cepstral space,IEEE Trans. Speech Audio Process., vol. 13,no. 5, pp. 930944, Sep. 2005.

[13] M. Pitz, S. Molau, R. Schlter, and H. Ney, Vocal tract normalizationequals linear transformation in cepstral space,in Eurospeech 01, Aal-borg, Denmark, Sep. 2001.

[14] S. Molau, M. Pitz,R. Schlter,and H. Ney,ComputingMel-frequencycepstral coefficients on the power spectrum, inProc. ICASSP 01, SaltLake City, UT, May 2001, pp. 7376.

[15] M. Pitz, Investigations on linear transformations for speaker adapta-tion and normalization, Ph.D. dissertation, RWTH Aachen, Aachen,

Germany, March 2005.[16] T. Claes, I. Dologlou, L. Bosch, and D. van Compernolle, A novelfeature transformation for vocal tractlength normalisation in automaticspeech recognition,IEEE Trans. Speech Audio Process., vol. 6, no. 6,pp. 549557, Nov. 1998.

[17] S. Cox, Speaker normalization in the MFCC domain, inProc. ICSLP00, Beijing, China, Oct. 2000.

[18] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland,Using VTLN for broadcast news transcription, in Proc. Interspeech04, Jeju Island, Korea, Sep. 2004.

[19] X. Cui and A. Alwan, Adaptation of children speech with limited databased on formant-like peakalignment, Comput. Speech Lang., vol.20,no. 4, pp. 400419, Oct. 2006.

[20] D. R. Sanand, D. D. Kumar, and S. Umesh, Linear transformationapproach to VTLN using dynamic frequency warping, inProc. Inter-speech 07, Antwerp, Belgium, Aug. 2007.

[21] S. Umesh, A. Zolnay, and H. Ney, Implementing frequency warpingand VTLN through linear transformation of conventional MFCC, inInterspeech 05, Lisbon, Portugal, Sept. 2005.

8/11/2019 06143995

12/12


[22] S. Panchapagesan, Frequency warping by linear transformation ofstandard MFCC, inProc. Interspeech 06, Pittsburgh, PA, Sep. 2006.

[23] S. Panchapagesan and A. Alwan, Frequency warping for VTLNand speaker adaptation by linear transformation of standard MFCC,Comput. Speech Lang., vol. 23, no. 1, pp. 4264, Jan. 2009.

[24] D. R. Sanand and S. Umesh, Study of Jacobian compensation usinglinear transformation of conventional MFCC for VTLN, in Proc. In-terspeech 08, Brisbane, Australia, Sep. 2008.

[25] D. R. Sanand, R. Schlter, and H. Ney, Revisiting VTLN using lineartransformation on conventional MFCC, in Proc. Interspeech 10,Makuhari, Japan, Sep. 2010.

[26] J. Lf, H. Ney, andS. Umesh,VTLNwarpingfactor estimation usingaccumulation of sufficient statistics, in Proc. ICASSP 06, Toulouse,France, May 2006, pp. 12011204.

[27] P. T. Akhil, S. P. Rath, S. Umesh, and D. R. Sanand, A computation-ally efficient approach to warp factor estimation in VTLN using EMalgorithm and sufficient statistics, in Proc. Interspeech 08, Brisbane,Australia, Sep. 2008.

[28] S. P. Rath andS. Umesh,Acoustic classspecific VTLN-warpingusingregression class trees, in Proc. Interspeech 09, Brighton, U.K., Sep.2009.

[29] C. Breslin, K. Chin, M. Gales, K. Knill, and H. Xu, Prior informa-tion for rapid speaker adaptation, in Proc. Interspeech 10, Makuhari,Japan, Sep. 2010.

[30] L. Saheer, J. Dines, P. N. Garner, and H. Liang, Implementation ofVTLN for statistical speech synthesis, in Proc. ISCA Speech Synth.Workshop, Sep. 2010.

[31] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin, Speaker nor-malization on conversational telephone speech, in Proc. ICASSP 96,Atlanta, GA, May 1996, pp. 339341.

[32] D. B. Pauland J. M.Baker, The designfor thewallstreet journal-basedCSR corpus, inProc. ICSLP 92, Banff, AB, Canada, Oct. 1992.

[33] J. Lf, M. Bisani, C. Gollan, G. Heigold, B. Hoffmeister, C. Plahl,R. Schlter, and H. Ney, The 2006 RWTH parliamentary speechestranscription system, inProc. Interspeech 06, Barcelona, Spain, Jun.2006.

[34] R. Leonard, A database for speaker-independent digit recognition, inICASSP 84, San Diego, CA, Mar. 1984, pp. 328331.

[35] N. Parihar and J. Picone, DSR Front End LVCSREvaluation, AU/384/02, Mississippi State University, Mississippi State, MS, Dec. 2002,

Tech. Rep..[36] D. Rybach, C. Gollan, G. Heigold,B. Hoffmeister, J. Lf, R. Schlter,

and H. Ney, The RWTH Aachen university open source speechrecognition system, in Proc. Interspeech 09, Brighton, U.K., Sep.2009.

[37] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G.Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for HTK Version 3.4). Cambridge, U.K.: CambridgeUniv. Eng. Dept., 2006.

[38] L. Welling, S. Kanthak, and H. Ney, Improved methods for vocaltract normalization, in Proc. ICASSP 99, Phoenix, AZ, Mar. 1999,pp. 761764.

D. R. Sanandreceived the Ph.D. degree in electricalengineering from the Indian Institute of Technology,Kanpur, in 2010.

From 2009 to 2010, he was a Postdoctoral Re-searcher in the Department of Computer Science,RWTH Aachen University, Aachen, Germany,and later in the Department of Information andComputer Science, Aalto University, Espoo, Fin-land, until 2011. Currently, he is a PostdoctoralResearcher in the Department of Electronics andTelecommunications, Norwegian University of

Science and Technology, Trondheim, Norway. His research interests includespeech recognition, synthesis, and biomedical signal processing.

S. Umesh received the Ph.D. degree in electricalengineering from the University of Rhode Island,Kingston, in 1993.

From 1993 to 1996, he was a Postdoctoral Fellowat the City University of New York. From 1996 to2009, he was with the Indian Institute of Technology,Kanpur, first as an Assistant Professor and then asa Professor of electrical engineering. Since 2009,he has been with the Indian Institute of Technology,Madras, where he is a Professor of electrical engi-neering. He has also been a Visiting Researcher at

AT&T Research Laboratories, Machine Intelligence Laboratory, CambridgeUniversity, U.K., and the Department of Computer Science (Lehrstuhl frInformatik VI), RWTH-Aachen, Germany. His recent research interests havebeen mainly in the area of speaker-normalization and noise-robustness and

their application in large-vocabulary continuous speech recognition systems.He has also worked in the areas of statistical signal processing and time-varyingspectral analysis.

Dr. Umesh was a recipient of the Indian AICTE Career Award for YoungTeachers and the Alexander von Humboldt Research Fellowship.

Documents

06143995