06143995

Preview:

Citation preview

  • 8/11/2019 06143995

    1/12

    IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012 1573

    VTLN Using Analytically DeterminedLinear-Transformation on Conventional MFCC

    D. R. Sanand and S. Umesh

    AbstractIn this paper, we propose a method to analyticallyobtain a linear-transformation on the conventional Mel frequencycepstral coefficients (MFCC) features that corresponds to conven-tional vocal tract length normalization (VTLN)-warped MFCCfeatures, thereby simplifying the VTLN processing. There havebeen many attempts to obtain such a linear-transformation,but all the previously proposed approaches either modify thesignal processing (and therefore not conventional MFCC), orthe linear-transformation does not correspond to conventionalVTLN-warping, or the matrices being estimated and are datadependent. In short, the conventional VTLN part of an automaticspeech recognition (ASR) system cannot be simply replacedwith any of the previously proposed methods. Umesh et al. pro-

    posed the idea to use band-limited interpolation for performingVTLN-warping on MFCC using plain cepstra. Motivated from thiswork, Panchapagesan and Alwan proposed a linear-transforma-tion to perform VTLN-warping on conventional MFCC. However,in their approach, VTLN warping is specified in the Mel-fre-quency domain and is not equivalent to conventional VTLN. Inthis paper, we present an approach which also draws inspirationfrom the work of Umesh et al., and which we believe for the firsttime performs conventional VTLN as a linear-transformation onconventional MFCC using the ideas of band-limited interpolation.Deriving such a linear-transformation to perform VTLN, wouldallow us to use the VTLN-matrices in transform-based adaptationframework with its associated advantages and yet would requirethe estimation of a single parameter. Usingfour different tasks, weshow that our proposed approach has almost identical recognition

    performance to conventional VTLN on both clean and noisyspeech data.

    Index TermsAutomatic speech recognition (ASR), linear-transformation, Mel frequency cepstral coefficient (MFCC),speaker normalization, vocal tract length normalization (VTLN).

    I. INTRODUCTION

    INTER-SPEAKER variability is a major source of perfor-

    mance degradation in speaker-independent (SI) automatic

    speech recognition (ASR) systems. Most state-of-the-art sys-

    tems now incorporate vocal-tract length normalization (VTLN)

    as an integral part of the system to reduce inter-speaker vari-ability and hence improve the recognition performance [1][6].

    Manuscript received December 27, 2010; revised July 08, 2011 and January15, 2012; accepted January 23,2012. Date of publicationJanuary31, 2012;dateof current version March 21, 2012. This work was done while the authors wereat the Department of Electrical Engineering, Indian Institute of Technology,Kanpur. This work was supported in part by the Department of Science andTechnology, Ministry of Science and Technology, India, under SERC projectSR/S3/EECE/058/2008. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Steve Renals.

    D. R. Sanand is with the Norwegian University of Science and Technology,NO-7491 Trondheim, Norway (e-mail: drsanand@gmail.com).

    S. Umesh is with the Department of Electrical Engineering, Indian Instituteof TechnologyMadras, Chennai-600036, India (e-mail: umeshs@ee.iitm.ac.in).

    Digital Object Identifier 10.1109/TASL.2012.2186289

    VTLN performs speaker normalization by reducing the variabil-

    ities in the spectra of speech signals that arise due to differences

    in the vocal tract lengths (VTL) of speakers uttering the same

    sound [7]. The normalization is achieved by either compressing

    or expanding the speech spectrum and is usually referred to as

    scaling. This scaling is usually specified through a mathematical

    relation of the type , where is the warped-fre-

    quency and is the frequency-warping function. It is com-

    monly assumed that the spectra of different speakers uttering the

    same sound are linearly scaled versions of one another [7], [8],

    i.e., . We would like to make it clear to the reader

    that, though the discussion in this paper assumes linear-scalingof the spectra, the methods developed in this paper can be ap-

    plied toanyarbitrary warping function. VTLN requires the esti-

    mation of only a single parameter called the warp-factor for

    normalization and hence requires very little acoustic data unlike

    adaptation based methods (e.g., MLLR and CMLLR). However,

    the practical implementation of conventional VTLN follows a

    maximum likelihood (ML) based grid search over a pre-defined

    range of warping-factors. This requires the features to be gen-

    erated for all the warp-factors after appropriate modification of

    the spectra. The ML estimate of the warp-factor is then found

    by evaluating the likelihood of the warped features with respect

    to the acoustic model, , and the transcription and is givenby

    (1)

    where consist of static features obtained after frequency-

    warping the spectra by warp-factor and appended with dif-

    ferential and acceleration coefficients. In some systems, linear

    discriminant analysis (LDA) is applied over a window of such

    warped consecutive frames to account for dynamic variations

    before obtaining the final feature-vector.

    Recently there has been lot of interest in obtaining a direct

    linear-transformation between static conventional Mel fre-

    quency cepstral coefficients (MFCC) features and the staticVTLN-warped MFCC , i.e.,

    (2)

    where, represents a matrix transformation.

    One of the early attempts to obtain a linear-transformation

    (LT) on the cepstra for speaker-normalization was by Acero

    et al. [9], [10]. They showed that the warped cepstral coeffi-

    cients can be obtained at the outputs of the filters at time

    , by formulating the bilinear transform as a linear filtering

    operation and having the time reversed cepstrum sequence as

    the input. McDonough et al. [11], proposed a linear transfor-

    mation using generalizations of bilinear transform known as

    1558-7916/$31.00 2012 IEEE

  • 8/11/2019 06143995

    2/12

    1574 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

    all-pass transforms. The derivations were based on the argument

    that frequency warping functions, , used in most VTLN

    methods can be approximated to a reasonable degree by the bi-

    linear transform. Pitzet al.[12], [13] argued that a linear-trans-

    formation of cepstra can be obtained for any arbitrary invert-

    ible warping function. However, their derivations were made

    using the modified signal processing approach discussed in [14],which does not include filter-bank smoothing during the feature

    extraction. The cepstra are assumed to be inverse discrete-time

    Fourier transform (IDTFT) coefficients of the log power spec-

    trum (without Mel warping) to derive the cepstral linear-trans-

    formation. Pitz states in his thesis [15] that inclusion ofMel-

    warping makes the transformation highly non-linear and could

    not be solved analytically. There have been other attempts to

    obtain an approximation to the linear-transformation including

    the work of Claes et al. [16], where the linear-transformation

    was derived using the average third formant information. Cox

    [17] presented a model based approach for VTLN that performs

    transformation on MFCC features. Kim et al. [18] estimated

    the linear-transformation using the ideas of constrained max-imum likelihood linear-transformation (CMLLR) from training

    data. Cui and Alwan [19] derived a mapping matrix using for-

    mant-like peaks and can be seen as a special case of [16]. Sanand

    et al.[20] derived a linear-transformation using the idea of dy-

    namic frequency warping, where the mapping is learnt from

    the data. It is important to note that in all these methods, ei-

    ther the signal processing is changed (and therefore not conven-

    tional MFCC), or the linear-transformation does not correspond

    to conventional VTLN-warping, or the matrices are estimated

    and hence are dependent on the database. Therefore, the con-

    ventional VTLN part of an ASR system cannot be simply re-

    placed with any of the methods described above.Umesh et al. [21] proposed the idea of using band-lim-

    ited interpolation to derive a linear-transformation for ob-

    taining VTLN-warped MFCC, that performs both Mel- and

    VTLN-warping on plain cepstra. Motivated from this work,

    Panchapagesan and Alwan [22], [23] proposed an approach

    to incorporate VTLN-warping into the inverse discrete cosine

    transform (DCT) transformation to obtain a linear-transfor-

    mation of the type shown in (2). We refer to this approach

    as Cosine-interpolation in this paper. It is important to note

    that, VTLN-warping, , in [22], [23] is performed in the

    Mel-frequency domain and is not exactly equivalent to con-

    ventional VTLN frequency-warping. This may be important in

    cases where the warping function is specified in the frequency

    (Hz) domain based on physiological arguments.

    In this paper, we present an approach, which we believe for

    the first time performs conventional frequency-warping, ,

    as a linear-transformation on conventional MFCC using the

    ideas of band-limited interpolation. We refer to this approach

    as Sinc-interpolation in this paper. The goal is to analytically

    obtain the linear-transformation of (2) given . The

    proposed method does not modify any aspect of the conven-

    tional MFCC computation including the use of Mel filter-bank

    smoothing as well as discrete cosine transform (DCT)-II. A part

    of this work has already been presented in [20], [24], and [25].

    A major advantage in obtaining a linear-transformation inthe framework of (2) is that the VTLN-warped cepstral features

    Fig. 1. Steps involved in generating conventional MFCC features.

    need not be computed for each , by first frequency warping the

    spectra and then computing the corresponding VTLN-warped

    cepstra. Instead, the VTLN-warped cepstra can be directly

    obtained from static conventional MFCC features through

    a matrix transformation. It can be easily shown that the dy-

    namic coefficients of the warped features would also be related

    through the same transformation in this case. Another advan-

    tage of such an approach is that, these matrices can be viewed as

    feature transformation matrices similar to CMLLR, but are pre-

    computed rather than estimated from data, requiring very little

    adaptation data for optimal selection of . The use of suchmatrices also enables the warp-factors to be estimated by accu-

    mulating the sufficient statistics, there by simplifying the proce-

    dure for optimal warp-factor estimation [26], [27] and reducing

    the computational complexity by 75%. Further, VTLN matrices

    can be used in regression tree framework to perform VTLN at

    acoustic class level, allowing estimation of multiple warp-fac-

    tors for a single utterance [28] which is very difficult to imple-

    ment in conventional VTLN framework. Finally, there is a pos-

    sibility of using these VTLN matrices as base matrices for adap-

    tation until sufficient data is available to obtain a robust estimate

    of the adaptation (MLLR/CMLLR) matrix [29]. Recently, there

    is also interest in using VTLN in the transform-based approach

    for statistical speech synthesis [30].The paper is organized as follows. In Section II, we present

    how VTLN is performed in practice and discuss the limita-

    tions in formulating the problem as a linear-transformation. In

    Section III, we present our idea of performing VTLN and show

    that a matrix transformation can be formulated on conventional

    MFCC to obtain VTLN-warped MFCC. Section IV presents

    our setup for performing the speech recognition experiments

    along with description of the databases used in our experiments.

    In Section V, we discuss the differences between the proposed

    and the Cosine-interpolation approaches for VTLN. Finally,

    we present the recognition results to show that the proposed

    approach has performance comparable to the conventionalVTLN.

    II. IMPLEMENTATION OFCONVENTIONALVTLN

    Conventional MFCC feature extraction, which does not

    include VTLN-warping, is usually implemented as shown in

    Fig. 1. Let represent the power or magnitude spectrum of

    a frame of speech. Let represent the filter-bank smoothing

    operation along with Mel-warping, which can be represented

    through a linear-transformation matrix. Further, let repre-

    sent the DCT transformation which is also linear. The static

    MFCC features, , are obtained by applying the Mel-warped

    filter-bank on the power spectrum of the speech signal, followedby applying logarithm on the amplitudes of the output of the

  • 8/11/2019 06143995

    3/12

    SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1575

    Fig. 2. Conventional frame work for generatingwarped features in VTLN. Thefilter-bank is inversely scaled instead of re-sampling the speech signal for eachwarp-factor for efficient implementation.

    filter-bank and finally a DCT transformation. All the operations

    can be written mathematically as

    (3)

    The DCT matrix is given by

    (4)

    and the scaling factor is defined as

    .

    Here, is the number of filters used in theMelfilter-bank and

    is the number of cepstral coefficients.

    As an illustration, let be a speech frame consisting of 320

    samples. A 512-point DFT is applied to obtain the 256-dimen-

    sional vector whose elements are the magnitude of the DFT

    coefficients for one-half of the spectrum. This is because the

    magnitude spectrum has even-symmetry. If a 20-filterMel filter-

    bank smoothing is applied, then is a 20 256 matrix that op-

    erateson to obtain theMel-warped smoothed spectra. is the

    20 20 DCT-matrix applied on log-compressedMel-warped

    smoothed spectra to obtain the MFCC feature vector . In prac-

    tice, only the first 16 cepstral coefficients are used and one mayuse a 16 20 DCT transformation.

    VTLN features are obtained in the original method of An-

    dreouet al.[7], by frequency-warping the magnitude spectra

    to get before applying the unwarpedMel filter-bank. This

    is done by re-sampling the signal. Therefore, in this case the

    signal is warped for each VTLN warp-factor, while the Mel

    filter-bank is left unchanged. Lee and Rose [8] proposed an ef-

    ficient alternate implementation, where the Mel filter-bank is

    inverse-scaled for each , while the signal spectra is left un-

    changed as shown in Fig. 2. This is the most popular method

    of VTLN-warping. Therefore, in the Lee-Rose method, VTLN-

    warping is integratedinto the Mel filter-bank and denotes

    the (inverse) VTLN-warpedMelfilter-bank. Conventionally thewarp-factor, , used for warping the spectra is in the range of

    Fig. 3. Illustrating the change in the filter-bank structure with VTLN-warpingin linear-frequency (Hz) domain. The filters have nonuniform center frequencieswith nonuniform bandwidths.

    Fig. 4. Piece-wise linear warping function used in conventional VTLN moti-vated by physiological arguments is shown. The slope of the warping functionis changed at to avoid bandwidth mismatch after frequency scaling.

    0.80 to 1.20 based on physiological arguments. For each , the

    center frequencies and bandwidths of the Melfilter-bank are ap-

    propriately scaled to obtainMel- and VTLN-warped smoothed

    spectra [8]. The change in the filter-bank structure for different

    warp-factors is illustrated in Fig. 3. The slope in the last filter has

    been modified appropriately using piece-wise linear warping

    [31], so that the Nyquist frequency maps onto itself after fre-

    quency scaling. This avoids the bandwidth mismatch that arises

    due to frequency warping. The piece-wise linear warping func-

    tion used in our experiments is given by

    (5)

    (6)

    and is shown in Fig. 4. represents the cutoff frequency

    where the slope is changed and is the Nyquist frequency.Although piece-wise linear warping is the most commonly used

  • 8/11/2019 06143995

    4/12

    1576 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

    frequency-warping and is motivated from physiological argu-

    ments that changes in VTL manifest as spectral-scaling, the

    methods developed in this paper can be applied to any arbitrary

    warping function.

    The warped cepstral features are given by

    (7)

    These are obtained by first warping and smoothing the power

    spectrum, followed by log and the DCT operations. The filter-

    bank is integrated with both Mel- and VTLN-warping, to per-

    form smoothing as well as scaling of the spectrum. Observing

    (3) and (7), it is clear that the only difference between the con-

    ventional and VTLN-warped MFCC features is the change in

    the filter-bank structure, while the rest of the operations are the

    same.

    For the case of , exactlycorresponds to the case

    of conventional MFCC withoutVTLN-warping. From (3) and

    (7), the relation between and is given as

    (8)

    A linear-transformation between and (or ) can be

    derived if all the intermediate operations can be represented as

    linear operations, but from (8), it is evident that log is a nonlinear

    operation and in practice does notexist. This is because,

    the power-spectrum cannotbe completely reconstructed from

    the filter-bank outputs because of the smoothing operation [16].

    We need to obtain , since conventional VTLN warping re-

    lations are always specified in the linear-frequency (Hz) do-

    main, usually through a mathematical relation of the type

    , where is the warped-frequency and is the fre-

    quency-warping function. Therefore, in this case, it is not pos-

    sible to completely recover from the filter-bank output and

    hence a linear-transformation is not possible.

    In the next section, we show that separating the frequency-

    warping operation from the filter-bank avoids the need to invert

    the filter-bank operation or the logarithm and allows us to derive

    a linear transformation on conventional MFCC.

    III. REALIZING ALINEAR-TRANSFORMATION

    In this section, we show that separatingthe VTLN-warping

    (speaker scaling) from the Mel filter-bank helps us to derive a

    linear-transformation (LT) between warped and unwarped cep-stral features within the conventional MFCC framework. Let

    , be the log-compressed Mel warped filter-bank

    output. From (3), we see that the knowledge of implies the

    knowledge of as they form a DCT pair, i.e.,

    (9)

    However, we cannot completely recover from because

    of the filter-bank smoothing operation. Since cannot be

    completely recovered, we re-frame the problem as follows: can

    be obtained by applying a linear-transfor-

    mation on without recovering , i.e.,

    (10)

    Fig. 5. Modification in the signal processing steps (separating the Mel- and

    VTLN-warping) for realizing a linear-transformation. The filter-bank performsonly Mel-warping of the spectra and the proposed band-limited interpolationmatrix performs the VTLN warping.

    where is the transformation that is applied on to ob-

    tain . The above equation states that the filter-bank, , per-

    forms only Mel warping and the transformation performs

    VTLN-warping. This means that the VTLN-warping integrated

    in the filter-bank for efficient implementation in the conven-

    tional approach [8] is now performed separately and is not a

    part of the filter-bank construction. This is illustrated in Fig. 5.

    If such a relation can be obtained, then from (3) and (7), the

    relation between and is given by

    (11)

    By defining a LT between and , we are completely

    avoiding the inversion of filter-bank for obtaining the raw mag-

    nitude spectrum and also bypassing the

    operation. We

    would like to remind the reader that VTLN-warping relation

    is usually specified in the linear-frequency (Hz) domain, and

    therefore, at this point it is not clear what the relation between

    and should be. In the next subsection, we describe a method

    to obtain a LT using the idea of band-limited interpolation.

    A. Band-Limited (Sinc-) Interpolation

    For a band-limited continuous-time signal, , given uni-

    formly spaced samples of the signal that are appropriately sam-

    pled, i.e., , we can exactly reconstruct the original contin-

    uous-time signal. This implies that we can recover the values

    of the time signal at time-instants otherthan those at the uni-

    formly spaced samples. We use this idea to obtain the LT for

    VTLN-warping, except now we consider que-frency limited sig-

    nals, instead of frequency-limited signals.

    can be obtained either by applying a nonuniform filter-

    bank (shown in Fig. 3) on the linear-frequency (Hz) magnitude

    spectrum or by applying a uniformly spaced filter-bank (shown

    in Fig. 6) on the Mel-warped magnitude spectrum. Therefore,

    in the Mel-frequency domain, the continuous Mel-warped log-

    compressed spectrum, , can be interpreted as the convolved

    output of a triangle function on theMel-warped magnitude spec-

    trum and followed by a log operation on the amplitudes. We

    can think of vector as being obtained by uniformly sampling

    at where and the po-

    sitions of these samples exactly correspond to the center fre-

    quencies of the filter-bank. Because of the triangle smoothing

    and subsequent

    -operation on the output (which reduces dy-

    namic range), the que-frency content of this -compressed

    smoothed spectrum is only in the low que-frency region. Fig. 7

    compares the cepstral coefficients obtained with and withoutfilter-bank smoothing. We see that the cepstral coefficients die

  • 8/11/2019 06143995

    5/12

    SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1577

    Fig. 6. The change in the filter-bank structure with VTLN-warping inMel-fre-quency domain is illustrated. The filters have uniformly spaced center frequen-cies with uniform bandwidth for . However, they are nonuniformlyspaced for different from unity.

    Fig. 7. The effect of filter-bank smoothing on the cepstral coefficients is il-lustrated. Filter-bank smoothing helps limit the que-frency content to the lowerregion ensuring que-frency limitedness.

    down faster with filter-bank smoothing indicating that the que-

    frency content is limited to the low que-frency region. During

    VTLN-warping, the filter center frequencies are appropriately

    scaled in the linear-frequency (Hz) domain by inverse- as de-

    scribed in LeeRose [8]. This corresponds to the center frequen-

    cies of the filter-bank to be non-uniformly spaced in theMel-fre-

    quency domain as shown in Fig. 6. As we represent the log-com-

    pressedMel-warped smoothed magnitude spectrum by the con-

    tinuous function , the output of the VTLN-warped filter-

    bank corresponds to sampling nonuniformly, i.e., .

    These nonuniformly spaced samples exactly correspond to the

    elements of the vector .

    From the above discussion, we point out that elements of

    vector (i.e., ) can beinterpreted as uniformly spaced sam-

    ples and elements of (i.e., ) as nonuniformly spaced

    samples of the same continuous function . The main idea

    is that, given the samples in , the samples (or elements) in

    can be reconstructed using band-limited interpolation provided

    that the cepstrum is que-frency limited.

    Let and form a discrete-time Fourier transform

    (DTFT) pair. Then sampling would result in periodicrepetition of . As long as is strictly que-frency limited and

    the sampling rate is sufficiently high, then there is no aliasing

    in the cepstral domain. In such a case, the value of at any

    Mel-frequency can be found from its uniformly-spaced sam-

    ples at through band-limited interpolation. This is basically

    exploiting the sampling theorem, where a signal (in this case a

    frequency domain signal) can be reconstructed from its samples

    using Sinc-interpolation. is nowhere used for any calculationpurposes and is presented here only for better understanding in

    the derivation of the band-limited interpolation matrix.

    Note that que-frency limitedness ensures that there is no

    overlap in the periodic repetition of (i.e., no aliasing), and

    hence can beexactlyrecovered. The que-frency limitedness

    property dependsbothon the amount of smoothing done by the

    Mel-filters (which controls the number of significant cepstral

    coefficients) and on the number ofMel-filters which determines

    the periodicity. If there is aliasing, there will be differences

    between Sinc-interpolated and the actual values. Since our

    effort in this paper, is to use conventional MFCC processing,

    both of these parameters are already fixed by the feature

    extraction stage. However, as we will show later, even using

    conventional MFCC processing there is very little difference

    between interpolated and true values.

    The steps to obtain the transformation matrix are as follows.

    1) Let , represent t he u niformly-spacedMel-

    frequencies with samples of at these points being el-

    ements of vector . Their corresponding linear-frequen-

    cies (Hz) are nonuniformly spaced and are represented by

    . These are the center frequencies of the

    Mel-filters in the linear-frequency (Hz) domain and are re-

    lated through the standard Mel-relation, i.e.,

    (12)

    2) During VTLN-warping, the warping function is ap-

    plied to obtain the warped frequencies. Let,

    represent the warped frequencies in the linear-frequency

    (Hz) domain. Although our proposed method will work

    for any warping function , for illustrationpurposewe

    use the piece-wise linear warping function as defined in (5)

    and (6). The corresponding VTLN-warped center frequen-

    cies of the filters in the Mel-frequency domain will

    notbe related through a linear scaling relation, since

    (13)

    Therefore, while for the linear-

    scaling relation (i.e., along axis), along -axis

    as seen from (12) and (13) and graphically shown in

    Fig. 8. The Cosine interpolation approach proposed in [22],

    [23] assumes , i.e., warping in the Mel do-

    main (i.e., domain), and therefore, does not correspond

    to conventional VTLN-warping which is specified in fre-

    quency domain (i.e., domain). While we refer to only

    piece-wise linear warping for illustration purpose, any fre-

    quency-warping function can be used in our proposed ap-proach by specifying the appropriate in (13).

  • 8/11/2019 06143995

    6/12

    1578 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

    Fig. 8. Tthe band-limited interpolation for linear-scaling relation is illus-trated. Warping is defined in the linear-frequency (Hz) domain or -axis, i.e., . Along the -axis, are the center-frequencies ofuniformly spaced filter-bank corresponding to in theMel-domain.Similarly, are the center-frequencies of warped filter-bankand arenonuniformly spaced in theMel-domain. The band-limited interpolation

    matrix is defined to obtain samples at given samples at . In the figure,represents unwarped frequencies both in linear-frequency (Hz) and Mel

    frequency domains, and represents warped frequencies in both the domains.

    3) The Fourier relation between and is given by

    (14)

    where is the Nyquist frequency in the Mel frequency

    domain. Here, we assume that the signal is periodic with

    a period of and symmetric around . There-fore, theoretically half-filters are present at indices 0 and

    1. The values at these indices are required for per-

    forming band-limited interpolation. If, we assume that is

    que-frency limited, the elements of can be determined

    as [we use variable since is already used in (14)]

    (15)

    Substituting of (14) in (15), we get

    The band-limited interpolation matrix between and

    is given by

    (16)

    Fig. 9. Framework of the proposed linear-transformation approach. Note thatonly the conventional MFCC features are generated and warped features areobtained using the LT matrices .

    where

    and is obtained from (12) and from (13) and the

    VTLN-warping relation, is specified in domain in(13). Using the even-symmetry property, we obtain NxN

    interpolation matrix , i.e., . The matrix is

    given by

    where

    Alternatively, the above matrix can also be written as aproduct of matrices and is given by

    (17)

    where is the number of filters. The matrices and are

    given by

    where are normalized frequencies with the range.

    The linear transformation matrix to obtain the VTLN-warped

    MFCC given the conventional MFCC is given by

    (18)

    Here, represents the number of static cepstral coefficients in

    the feature-vector and is the number ofMel-filters used in

    the feature extraction. The feature generation process using the

    proposed linear-transformation (LT) approach is illustrated in

    Fig. 9. Although we have shown here for the case of piece-wiselinear warping, the same procedure can be used for any arbitrary

  • 8/11/2019 06143995

    7/12

    SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1579

    Fig. 10. Comparing the VTLN-warped cepstra obtainedusing the conventionaland the proposed Sinc-interpolation approach for piece-wise linear and bilinearwarping functions. (a) Piece-wise linear warping. (b) Bilinear warping.

    warping function by choosing the appropriate in (13).

    Fig. 10 compares the VTLN-warped cepstra obtained using the

    conventional and the proposed LT approach on piece-wise linearand bilinear warping functions.

    The idea of linear-transformation presented here will be a

    special case of the method proposed by Umesh et al. in [21],

    where a linear-transformation is proposed by separating both

    Mel- and VTLN-warping from the filter-bank. The main differ-

    ences between these approaches are as follows.

    The filters are uniformly spaced in theMel frequency do-

    main for the approach proposed in this paper, i.e., are

    uniformly spaced. In the work of Umeshet al.in [21], the

    filters are uniformly spaced in the linear-frequency (Hz)

    domain, i.e., are uniformly spaced. Therefore, conven-

    tionalMel filter-bank is notused in [21].

    The interpolation matrix proposed in this paper is defined

    as

    (19)

    i.e., it performs only VTLN-warping on the Mel warped

    spectra. In [21], the interpolation matrix is defined as

    (20)

    where is the smoothed spectrum without Mel warping

    and the transformation matrix performs both Mel- andVTLN-warping to obtain VTLN-warped MFCC features.

    B. Cosine-Interpolation

    Motivated fromthe work of Umesh et al. [21], Panchapagesan

    and Alwan [22], [23] proposed a linear-transformation approach

    that incorporates the interpolation and warping in the inverse

    discrete cosine transform (IDCT) matrix and we refer to this

    approach as Cosine-interpolation. Considering to be the con-

    tinuousMel-frequency variable, the signal is assumed to be pe-riodic with a period of and symmetric about the points

    and . A normalization variable is de-

    fined as follows:

    where (21)

    and has the range . The warped IDCT matrix is given

    as

    (22)

    where are the normalized half-sampledshifted positions of theMel filter-bank and being the fre-

    quency warping function. The relation between the warped and

    unwarped cepstral features is given by

    (23)

    From the above equations, we see that the VTLN-warping,

    , is performed on the half-sample shifted positions

    of the filter-bank center-frequencies that are already

    Mel-warped. In conventional VTLN, frequency warping is

    performed in the linear-frequency (Hz) domain through .

    From the above discussion, it is clear that the Cosine-interpola-

    tion approach performs VTLN-warping on the Mel-warped fre-quencies and is not equivalent to conventional VTLN-warping.

    Panchapagesan and Alwan themselves point to these differences

    [23, below Eq. 27]. As seen from the warp-factor histograms

    in [23], the conventional VTLN-warping-factors lie between

    (0.88, 1.24) and the Cosine-interpolation based warping-factors

    in the range (0.91, 1.11) for the same piece-wise linear warping.

    This indicates that conventional frequency-warping and the

    warping used in Mel-domain by Cosine-interpolation cannot

    be directly compared since the domains in which the warping

    is applied are different. In practice, most frequency-warping

    functions are specified in linear-frequency (Hz) domain (and

    not Mel-domain) often motivated by physiological arguments.To summarize, the main differences between Cosine-interpola-

    tion and the linear-transformation derived in this paper are as

    follows.

    In Cosine-interpolation, VTLN-warping is applied in

    theMel-domain and hence the corresponding warp-factors

    are very different when compared with those from conven-

    tional VTLN.

    The interpolation is performed using the inverse-DCT

    matrix in Cosine-interpolation [see (22)], whereas the

    approach presented in this paper uses band-limited inter-

    polation as shown in (17).

    Before proceeding further, we present the recognition setup

    along with the details of the databases used in our experimentsin the next section.

  • 8/11/2019 06143995

    8/12

    1580 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

    TABLE IDESCRIPTION OF THECORPUSUSED FOREXPERIMENTS

    IV. EXPERIMENTALSETUP

    The recognition experiments include four different setsof speech data: Wall Street Journal (WSJ0) [32], EuropeanParliamentary Plenary Sessions (EPPS) English [33], Texas

    Instruments connected digits (TIDIGITS) [34] and Aurora 4.0[35]. WSJ0, TIDIGITS and EPPS-English are clean speechdata, where as Aurora 4.0 is a noisy speech data. The details ofthe database are presented in Table I. Aurora 4.0 consists of 14different types of test sets, where seven of the them are recordedwith a similar microphone used for recording the training datawhile the other seven with a different microphone.

    All the experiments were done using the RWTH AachenSpeech Recognition System [36] except for the TIDIGITStask. While performing feature extraction, we use 20 filtersand obtain 16 cepstral coefficients . The features aremean and variance normalized at the segment level and LDA isapplied over a window of nine consecutive frames to derive a

    45-dimensional feature vector. The system used a classificationand regression tree (CART) based state tying. We have 1501generalized triphones for both WSJ0 and Aurora 4.0. and 4501generalized triphones for EPPS task, respectively. The HMMmodel consists of three emitting states with 256 mixtures perstate and uses a pooled covariance matrix.

    The TIDIGITS speech recognition task is done using HTK[37] and uses word models. It had 11 word models, which in-clude zero to nine and oh. The features are of 39 dimensionscomprising normalized log-energy, (excluding )and their first- and second-order derivatives. Cepstral mean sub-traction is applied at the segment level. The digits were mod-eled with simple left-to-right HMMs without skips and had 16

    emitting states with five diagonal covariance Gaussian mixturesper state. Silence is modeled using a three-state HMM havingsix-mixture Gaussian models per state.

    While performing VTLN in training, we follow a max-imum-likelihood (ML)-based approach for estimating theoptimal warping-factor, i.e.,

    (24)

    where is the SI model and is the known transcriptionduring training. is the VTLN-warped feature vector ofutterance with the static features appended with the delta and ac-celeration coefficients or is obtained after transformation using

    LDA. Since the delta and acceleration coefficients are obtainedfrom the static coefficients, the same VTLN transformation ma-

    trix can be used to obtain the VTLN-warped delta and accelera-tion coefficients. Therefore, the relation between the unwarpedand VTLN-warped features are given by

    (25)

    If the features are obtained using an LDA transformation matrix,then the relation between VTLN-warped and static unwarpedfeatures are given by [26]

    (26)

    where represents the window length. represents the super-

    vector formed by concatenating all the static MFCC cepstrafrom adjacent frames.Using the estimated warped features a new VTLN model

    is obtained. For performing the warp-factor estimation duringtesting, we use a Gaussian mixture model (GMM) classifier[38]. Unwarped features corresponding to each warping-factorobtained in training are used to train the GMM with 256 mix-tures. The optimal warping-factor in recognition is obtained bycalculating the likelihood with respect to each of the warping-factor GMM and choosing the one that gives the best likelihood.The warping-factors are estimated at speaker level in trainingand utterance level during recognition. During warped featureextraction, we map frequencypointszero andpi onto themselves

    using piece-wise linear-warping. We do not account for Jaco-bian in VTLN for the experiments presented in this paper.

    V. IMPLEMENTATIONDETAILS

    In this section, we present the implementation details

    for Sinc- and Cosine-interpolation approaches. Later, we

    present the recognition results comparing the performance of

    linear-transformation approaches with conventional VTLN.

    A. Cosine-Interpolation

    In this section, we will discuss the implementation details for

    Cosine-interpolation and argue that the range of warping-fac-

    tors have to be properly mapped either in the Mel-domain or

    in the linear-frequency (Hz) domain for comparing the recog-

    nition performance with conventional and Sinc-interpolation

    approaches. Before proceeding further, we present recognition

    results for the TIDIGITS task in Table II. The models were

    trained using male speakers and are used for recognizing

    children speakers. For this task, Panchapagesan and Alwan

    observed that Cosine-interpolation approach performed better

    than conventional VTLN (see [22] and [23, Sec. (6.1)]). As

    we will show next, the warp-factors for Cosine-interpolation

    and conventional VTLN need to be mapped before they can be

    compared. This is due to the difference in the domains where

    frequency warping is applied. If proper mapping is chosen, the

    difference in the performance observed in [22] and [23] nolonger exists.

  • 8/11/2019 06143995

    9/12

    SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1581

    TABLE IIRECOGNITIONRESULTS(%WER) COMPARING THEPERFORMANCE OF DIFFERENTAPPROACHES TO VTLNFOR MALE-TRAIN ANDCHILD-TESTCASE OF TIDIGITS. DIFFERENTRANGE OF WARPING-FACTORS HAVE TO BE USED TOGETCOMPARABLE PERFORMANCE FOR CONVENTIONAL

    ANDCOSINE-INTERPOLATION APPROACHES DUE TODIFFERENCE INDOMAINWHEREFREQUENCYWARPING ISAPPLIED

    Baseline - No VTLN; Conv. - Conventional; LT - Linear Transformation; M-C - Male Train - Child Test

    Fig. 11. The different frequency-warping functions in the linear-frequency(Hz) domain is shown. Using a warp-factor of 0.80 in Mel-domain will resultin a warping function in linear-frequency domain (dotted line) that is quitedifferent from using the same value of the warp-factor (i.e., 0.80) directly inlinear-frequency (Hz) domain (solid line). The figure also shows that using0.9194 as the warp-factor in Mel-domain will generate frequency-warpingfunction very similar to using in linear-frequency (Hz) domain.

    We briefly discuss the physiological motivation in choosing

    the range of warp-factors used in conventional VTLN for piece-

    wise linear warping. The average vocal-tract length for males

    is about 17 cm, that for females is about 14.5 cm and for chil-

    dren is about 12 cm. Males can have vocal-tract lengths that

    are 19 cm or longer. Since the differences in vocal-tract lengths

    crudely manifest as scaling of the spectra for the same sound,

    the range of scaling (or warp-) factors are determined by the

    ratio of vocal-tract lengths. For adult speakers (i.e., only male

    and female speakers), this ratio varies from

    to about . Usually, the range of warp-factors for

    adult data are in the range of 0.80 to 1.20. However, if we train

    models using male speakers and use children speakers for test,

    then the range of warp-factors have to be different. In this case,

    the lower-end in the range of warp-factors can be approximately

    .

    As pointed out in Section III-B, Cosine-interpolation per-

    forms VTLN-warping on the Mel-warped frequencies (i.e.,

    ) as opposed to performing frequency-warping in the

    linear-frequency (Hz) domain (i.e., ). Using the same

    numerical value of the warping-factor both in Mel- and

    linear-frequency (Hz) domain will result in different warpings

    in the frequency (Hz) domain. The differences are illustrated

    in Fig. 11. Using a warp-factor of 0.80 in the Mel-frequency

    domain and mapping the warping back to linear-frequencydomain (shown in the figure with a dotted line) produces a very

    different warping function in linear-frequency domain when

    compared to directly using 0.80 warp-factor in linear-frequency

    domain (shown with a solid line in the figure). On the other

    hand, using a warp-factor of 0.9194 in the Mel-domain and

    mapping the function back to linear-frequency domain results

    in frequency-warping very similar to 0.80. We suspect that for

    TIDIGITS task in [22] and [23], thesamerange of warp-factors

    from 0.80 to 1.25 were used in both conventional VTLN and

    in Cosine-interpolation. Since the test data is from children

    speakers, the lower limit of 0.80 for conventional-VTLN did

    not provide sufficient search space and resulted in degraded

    performance. On the other hand, scaling the Mel-domain with

    0.80 is approximately equivalent to scaling the linear-fre-

    quency domain by a factor of 0.5695. This provided a larger

    search-space for Cosine-interpolation, probably helping it to

    get better performance for children speech when compared to

    conventional-VTLN in [22], [23].

    In order to have a fair comparison, the warping-factors in

    the linear-frequency (Hz) domain (orMel-domain) have to be

    in the same range for all the approaches. This can be done by

    calculating the equivalent warping-factor in one domain (say

    linear-frequency (Hz) domain) by fixing the warping-factor in

    the other domain (sayMelfrequency domain). The mapping ofwarping-factors is done as follows.

    Let us say that the cutoff where the slope of the warping

    function changes is in Hz and similarly where the

    slope changes in theMel domain as .

    The corresponding warped frequencies are given as

    (27)

    (28)

    where and

    are different.

    The idea is that the inverseMel-warped should match

    or Mel-warped should match

    . So thisinvolves finding the value of (or

    ) that matches

    (or ) and this can be found by equating (27) and

    (28). Therefore, the equivalent when the

    is fixed

    is given by

    (29)

    and the equivalent

    when is fixed is given by

    (30)

    The converts Mel frequencies to Hz and similarlyconverts Hz frequencies to Mel frequencies.

  • 8/11/2019 06143995

    10/12

    1582 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

    We map the warping-factors as discussed above by fixing

    and calculating the corresponding

    . The new range

    of warping-factors for

    will be in the range from (0.91,

    1.08) for the corresponding in the range of (0.80, 1.25)

    for adult speech. These ranges seem to be consistent with the

    observations made by Panchapagesan-Alwan (see [23, Fig. 4

    ]). The recognition performance on the TIDIGITS task usingproper range of warping-factors is shown in Table II, i.e., we

    extend the lower range to account for children speakers. Unlike

    [22] and [23], we now observe that all the approaches to VTLN

    have similar performance. When the warping-factors are in the

    improperrange of (0.80, 1.20) for child speakers, the conven-

    tional VTLN performance is inferior to Cosine-interpolation as

    observed in [22].

    For all the subsequent experiments in this paper, we will use

    mapping of warping-factors for Cosine-interpolation by fixing

    the warp-factors in the linear-frequency (Hz) domain. In the next

    section, we will present the implementation details of Sinc-in-

    terpolation when performing linear-transformation of conven-

    tional MFCC.

    B. Linear-Transformation of Conventional MFCC for VTLN

    Using Sinc-Interpolation

    In order to perform band-limited interpolation, full-spectral

    information in the frequency band is necessary. Since we use

    conventional MFCC, the available spectral information is be-

    tween the first and last filter of the filter-bank. Using a conven-

    tional filter-bank with 20 filters, the first filter has a center fre-

    quency around 135.2Mels (or 89.92 Hz) and the last filter has

    a center frequency around 2704.8 Mels (or 7016.2 Hz). It is

    quite unlikely that a formant for any specific sound exists below

    or above the specified frequencies for any particular speaker.We can safely assume that the first and last filter-bank center

    frequencies act as zero and Nyquist frequencies which should

    in no way effect the VTLN performance. This assumption in-

    herently means that the center frequencies of the first and last

    filters will map onto themselves after frequency warping. Note

    that by stating the center frequencies of the first and last filters

    are not changing with frequency-warping, we do not mean that

    we are ignoring the speech spectrum below 89.92 Hz. The in-

    formation is still present in the first-filter since the lower-end of

    the filter starts at zero frequency. The point we want to make

    is that the center frequencies of the first and last filters do not

    change after frequency warping. The only consequence of thecenter frequency of the first filter being used as zero frequency

    in linear-transformation case is that there will be a small con-

    sistent difference in numerical value between the warp-factor

    estimate obtained by this method and the conventional method.

    This is because there will be a small difference in the slope used

    in (5) and (6). The linear transformation relation is given by

    (31)

    where indicates thatconventionalMel filter-bank is used.

    The linear transformation is derived as shown in (17).

    The results comparing the recognition performance using the

    conventional Mel filter-bank are shown in Table III. We makethe following observations.

    TABLE IIIRECOGNITIONRESULTS(%WER) OF CONVENTIONAL AND LINEAR

    TRANSFORM APPROACHES TOVTLN USING CONVENTIONALMFCC.BOTHSINC- AND COSINE-INTERPOLATION APPROACHES PERFORM

    COMPARABLY TOCONVENTIONALVTLN

    Fig. 12. Histogramand contour plot comparingthe alpha estimatesfor thecon-ventional and Sinc-interpolation approaches for EPPS train data. In the linear-transformationapproach, since we useconventional Mel-filters(without half-fil-ters) the corresponding warp-factors would be approximately0.02 less thancon-ventional-VTLN which is reflected in the histogram. (a) Histogram plot. (b)Contour plot.

    We observe that the linear-transformation based ap-

    proaches perform comparably with the conventional

    approach irrespective of noisy or clean speech.

    More importantly, we use the conventionalMelfilter-bank

    with out any modification and still perform VTLN-warping

    using a linear-transformation.

    Fig. 12 shows the histogram and contour plots for the warp-

    factors obtained using the conventional and the proposed Sinc-

    interpolation approaches for the train data corresponding to theEPPS task. The histogram plot shows the distribution of warp-

  • 8/11/2019 06143995

    11/12

    SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1583

    factors and the contour plot gives an idea on how the warp-

    factor estimates differ between Sinc- and conventional VTLN

    approaches. From the histogram, we observe that majority of

    the warp-factors got shifted by a single warp-factor step, like

    the peaks at 1.02, 0.94 and 0.92 in the conventional ap-

    proach appear at 1.00, 0.92 and 0.90, respectively. A sim-

    ilar behavior can also be observed from the contour plot. Thisis because we are using conventional Mel-filters (i.e., no ad-

    ditional half-filters) with the center frequency of the first Mel-

    filter (at 89.2 Hz) being mapped to zero-frequency to enable the

    linear-transformation approach without any change in the signal

    processing. This leads to a small consistent difference between

    warp-factors obtained in the linear-transformation and the con-

    ventional approach. Our analysis of the warp-factors obtained

    on the EPPS train data indicates that for any warp-factor in con-

    ventional-VTLN, the corresponding warp-factors are same or

    0.02 smaller in 90% of the utterances. The correlation coeffi-

    cient between the alpha estimates is 0.93, which also indicates

    that the deviations are only marginal.

    Another source of approximation is the use of truncated un-warped MFCC cepstra in (18) to obtain the VTLN-warped cep-

    stra, which will also result in some loss of information. Though

    there are small differences in the warp-factor distribution, the

    recognition performance of LT-Sinc is comparable to conven-

    tional VTLN on a variety of tasks that we have presented in this

    paper.

    VI. CONCLUSION

    In this paper, we have presented an approach to perform

    VTLN using a linear transformation on conventional MFCC

    without any modification in the feature extraction steps. The

    linear-transformation is given by (18) with the interpolationmatrix given by (17). Therefore, the linear-transformation

    can be analytically calculated using the above equations for

    any as well as for any arbitrary warping function by putting

    appropriate in (13). This is an important difference when

    compared to the Cosine-interpolation, where is used in

    theMel-domain and the warping is different from conventional

    VTLN. Further, the corresponding warp-factors between co-

    sine-interpolation approach and conventional-VTLN cannot be

    easily compared. The key idea to our approach is to separate

    the speaker scaling operation from the filter-bank which helps

    us derive a linear transformation for VTLN using the idea of

    band-limited interpolation. The use of such transformations

    would enable the warp-factors to be efficiently estimated by

    accumulating the sufficient statistics, use regression tree frame-

    work to perform VTLN at acoustic class level or use VTLN

    matrices as base matrices for adaptation until sufficient data is

    available. Such approaches cannot be easily implemented in

    conventional-VTLN framework. Using fourdifferent tasks, to

    illustrate the efficacy of our proposed approach, we have shown

    that the recognition performance of our proposed approach of

    linear transformation is always comparable to the conventional

    VTLN on both clean and noisy speech data.

    ACKNOWLEDGMENT

    D. R. Sanand would like to thank Prof. H. Ney for giving himan opportunity to work as a research assistant in the Human Lan-

    guage Technology Group at RWTH Aachen University, Aachen,

    Germany. The authors would like to thank Prof. H. Ney for pro-

    viding resources to run the recognition experiments reported in

    this paper. The authors would also like to thank the anonymous

    reviewers for their valuable comments and suggestions.

    REFERENCES

    [1] G. Evermann, H. Y. Chan, M. J. F. Gales, T. Hain, X. Liu, D. Mrva, L.Wang, and P. C. Woodland, Development of the 2003 CU-HTK con-versational telephone speech transcription system, in Proc. ICASSP04, Montreal, QC, Canada, May 2004, pp. 249252.

    [2] A. Sixtus, S. Molau, S. Kanthak, R. Schlter, and H. Ney, Recent im-provements of the RWTH large vocabulary speech recognition systemon spontaneous speech, in Proc. ICASSP 00, Istanbul, Turkey, Jun.2000, pp. 16711674.

    [3] G. Zavaliagkos, J. McDonough, D. Miller, A. El-Jaroudi, J. Billa, F.Richardson, K. Ma, M. Siu, and H. Gish, The BBN Byblos 1997large vocabulary conversational speech recognition system, in Proc.

    ICASSP 98, Seattle, WA, May 1998, pp. 905908.[4] J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, and F.

    Lefevre, Conversational telephone speech recognition, in Proc.ICASSP 03, Hong Kong, Apr. 2003, pp. 212215.

    [5] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. R. Gadde, M.Plauch, C. Richey, E. Shriberg, K. Snmez, F. Weng, and J. Zheng,The SRI March 2000 Hub-5 conversational speech transcriptionsystem, inProc. NIST Speech Transcript. Workshop, 2000.

    [6] Y. Gao, Y. Li, V. Goel, and M. Picheny, Recent advances in speechrecognition system for IBM DARPA communicator, in Proc. Eu-rospeech 01, Aalborg, Denmark, Sep. 2001.

    [7] A. Andreou, T. Kamm, and J. Cohen, Experiments in vocal tract nor-malization, inProc. CAIP Workshop: Frontiers in Speech Recognition

    II, 1994.[8] L. Lee and R. Rose, Frequency warping approach to speaker normal-

    ization, IEEE Trans. Speech Audio Process., vol. 6, no. 1, pp. 4959,Jan. 1998.

    [9] A. Acero and R. M. Stern, Robust speech recognition by normaliza-tion of the acoustic space, inProc. ICASSP 91, Toronto, ON, Canada,May 1991, pp. 893896.

    [10] A. Acero, Acoustical and environmental robustness in automaticspeech recognition, Ph.D. dissertation, Carnegie Mellon Univ.,Pittsburgh, PA, 1990.

    [11] J. McDonough, W. Bryne, and X. Luo, Speaker normalization withall-pass transforms, inICSLP 98, Sydney, Australia, Nov. 1998.

    [12] M. Pitz and H. Ney, Vocal tract normalization equals linear transfor-mation in cepstral space,IEEE Trans. Speech Audio Process., vol. 13,no. 5, pp. 930944, Sep. 2005.

    [13] M. Pitz, S. Molau, R. Schlter, and H. Ney, Vocal tract normalizationequals linear transformation in cepstral space,in Eurospeech 01, Aal-borg, Denmark, Sep. 2001.

    [14] S. Molau, M. Pitz,R. Schlter,and H. Ney,ComputingMel-frequencycepstral coefficients on the power spectrum, inProc. ICASSP 01, SaltLake City, UT, May 2001, pp. 7376.

    [15] M. Pitz, Investigations on linear transformations for speaker adapta-tion and normalization, Ph.D. dissertation, RWTH Aachen, Aachen,

    Germany, March 2005.[16] T. Claes, I. Dologlou, L. Bosch, and D. van Compernolle, A novelfeature transformation for vocal tractlength normalisation in automaticspeech recognition,IEEE Trans. Speech Audio Process., vol. 6, no. 6,pp. 549557, Nov. 1998.

    [17] S. Cox, Speaker normalization in the MFCC domain, inProc. ICSLP00, Beijing, China, Oct. 2000.

    [18] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland,Using VTLN for broadcast news transcription, in Proc. Interspeech04, Jeju Island, Korea, Sep. 2004.

    [19] X. Cui and A. Alwan, Adaptation of children speech with limited databased on formant-like peakalignment, Comput. Speech Lang., vol.20,no. 4, pp. 400419, Oct. 2006.

    [20] D. R. Sanand, D. D. Kumar, and S. Umesh, Linear transformationapproach to VTLN using dynamic frequency warping, inProc. Inter-speech 07, Antwerp, Belgium, Aug. 2007.

    [21] S. Umesh, A. Zolnay, and H. Ney, Implementing frequency warpingand VTLN through linear transformation of conventional MFCC, inInterspeech 05, Lisbon, Portugal, Sept. 2005.

  • 8/11/2019 06143995

    12/12

    1584 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012

    [22] S. Panchapagesan, Frequency warping by linear transformation ofstandard MFCC, inProc. Interspeech 06, Pittsburgh, PA, Sep. 2006.

    [23] S. Panchapagesan and A. Alwan, Frequency warping for VTLNand speaker adaptation by linear transformation of standard MFCC,Comput. Speech Lang., vol. 23, no. 1, pp. 4264, Jan. 2009.

    [24] D. R. Sanand and S. Umesh, Study of Jacobian compensation usinglinear transformation of conventional MFCC for VTLN, in Proc. In-terspeech 08, Brisbane, Australia, Sep. 2008.

    [25] D. R. Sanand, R. Schlter, and H. Ney, Revisiting VTLN using lineartransformation on conventional MFCC, in Proc. Interspeech 10,Makuhari, Japan, Sep. 2010.

    [26] J. Lf, H. Ney, andS. Umesh,VTLNwarpingfactor estimation usingaccumulation of sufficient statistics, in Proc. ICASSP 06, Toulouse,France, May 2006, pp. 12011204.

    [27] P. T. Akhil, S. P. Rath, S. Umesh, and D. R. Sanand, A computation-ally efficient approach to warp factor estimation in VTLN using EMalgorithm and sufficient statistics, in Proc. Interspeech 08, Brisbane,Australia, Sep. 2008.

    [28] S. P. Rath andS. Umesh,Acoustic classspecific VTLN-warpingusingregression class trees, in Proc. Interspeech 09, Brighton, U.K., Sep.2009.

    [29] C. Breslin, K. Chin, M. Gales, K. Knill, and H. Xu, Prior informa-tion for rapid speaker adaptation, in Proc. Interspeech 10, Makuhari,Japan, Sep. 2010.

    [30] L. Saheer, J. Dines, P. N. Garner, and H. Liang, Implementation ofVTLN for statistical speech synthesis, in Proc. ISCA Speech Synth.Workshop, Sep. 2010.

    [31] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin, Speaker nor-malization on conversational telephone speech, in Proc. ICASSP 96,Atlanta, GA, May 1996, pp. 339341.

    [32] D. B. Pauland J. M.Baker, The designfor thewallstreet journal-basedCSR corpus, inProc. ICSLP 92, Banff, AB, Canada, Oct. 1992.

    [33] J. Lf, M. Bisani, C. Gollan, G. Heigold, B. Hoffmeister, C. Plahl,R. Schlter, and H. Ney, The 2006 RWTH parliamentary speechestranscription system, inProc. Interspeech 06, Barcelona, Spain, Jun.2006.

    [34] R. Leonard, A database for speaker-independent digit recognition, inICASSP 84, San Diego, CA, Mar. 1984, pp. 328331.

    [35] N. Parihar and J. Picone, DSR Front End LVCSREvaluation, AU/384/02, Mississippi State University, Mississippi State, MS, Dec. 2002,

    Tech. Rep..[36] D. Rybach, C. Gollan, G. Heigold,B. Hoffmeister, J. Lf, R. Schlter,

    and H. Ney, The RWTH Aachen university open source speechrecognition system, in Proc. Interspeech 09, Brighton, U.K., Sep.2009.

    [37] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G.Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for HTK Version 3.4). Cambridge, U.K.: CambridgeUniv. Eng. Dept., 2006.

    [38] L. Welling, S. Kanthak, and H. Ney, Improved methods for vocaltract normalization, in Proc. ICASSP 99, Phoenix, AZ, Mar. 1999,pp. 761764.

    D. R. Sanandreceived the Ph.D. degree in electricalengineering from the Indian Institute of Technology,Kanpur, in 2010.

    From 2009 to 2010, he was a Postdoctoral Re-searcher in the Department of Computer Science,RWTH Aachen University, Aachen, Germany,and later in the Department of Information andComputer Science, Aalto University, Espoo, Fin-land, until 2011. Currently, he is a PostdoctoralResearcher in the Department of Electronics andTelecommunications, Norwegian University of

    Science and Technology, Trondheim, Norway. His research interests includespeech recognition, synthesis, and biomedical signal processing.

    S. Umesh received the Ph.D. degree in electricalengineering from the University of Rhode Island,Kingston, in 1993.

    From 1993 to 1996, he was a Postdoctoral Fellowat the City University of New York. From 1996 to2009, he was with the Indian Institute of Technology,Kanpur, first as an Assistant Professor and then asa Professor of electrical engineering. Since 2009,he has been with the Indian Institute of Technology,Madras, where he is a Professor of electrical engi-neering. He has also been a Visiting Researcher at

    AT&T Research Laboratories, Machine Intelligence Laboratory, CambridgeUniversity, U.K., and the Department of Computer Science (Lehrstuhl frInformatik VI), RWTH-Aachen, Germany. His recent research interests havebeen mainly in the area of speaker-normalization and noise-robustness and

    their application in large-vocabulary continuous speech recognition systems.He has also worked in the areas of statistical signal processing and time-varyingspectral analysis.

    Dr. Umesh was a recipient of the Indian AICTE Career Award for YoungTeachers and the Alexander von Humboldt Research Fellowship.