Upload
rasool-reddy
View
218
Download
0
Embed Size (px)
Citation preview
8/11/2019 06143995
1/12
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012 1573
VTLN Using Analytically DeterminedLinear-Transformation on Conventional MFCC
D. R. Sanand and S. Umesh
AbstractIn this paper, we propose a method to analyticallyobtain a linear-transformation on the conventional Mel frequencycepstral coefficients (MFCC) features that corresponds to conven-tional vocal tract length normalization (VTLN)-warped MFCCfeatures, thereby simplifying the VTLN processing. There havebeen many attempts to obtain such a linear-transformation,but all the previously proposed approaches either modify thesignal processing (and therefore not conventional MFCC), orthe linear-transformation does not correspond to conventionalVTLN-warping, or the matrices being estimated and are datadependent. In short, the conventional VTLN part of an automaticspeech recognition (ASR) system cannot be simply replacedwith any of the previously proposed methods. Umesh et al. pro-
posed the idea to use band-limited interpolation for performingVTLN-warping on MFCC using plain cepstra. Motivated from thiswork, Panchapagesan and Alwan proposed a linear-transforma-tion to perform VTLN-warping on conventional MFCC. However,in their approach, VTLN warping is specified in the Mel-fre-quency domain and is not equivalent to conventional VTLN. Inthis paper, we present an approach which also draws inspirationfrom the work of Umesh et al., and which we believe for the firsttime performs conventional VTLN as a linear-transformation onconventional MFCC using the ideas of band-limited interpolation.Deriving such a linear-transformation to perform VTLN, wouldallow us to use the VTLN-matrices in transform-based adaptationframework with its associated advantages and yet would requirethe estimation of a single parameter. Usingfour different tasks, weshow that our proposed approach has almost identical recognition
performance to conventional VTLN on both clean and noisyspeech data.
Index TermsAutomatic speech recognition (ASR), linear-transformation, Mel frequency cepstral coefficient (MFCC),speaker normalization, vocal tract length normalization (VTLN).
I. INTRODUCTION
INTER-SPEAKER variability is a major source of perfor-
mance degradation in speaker-independent (SI) automatic
speech recognition (ASR) systems. Most state-of-the-art sys-
tems now incorporate vocal-tract length normalization (VTLN)
as an integral part of the system to reduce inter-speaker vari-ability and hence improve the recognition performance [1][6].
Manuscript received December 27, 2010; revised July 08, 2011 and January15, 2012; accepted January 23,2012. Date of publicationJanuary31, 2012;dateof current version March 21, 2012. This work was done while the authors wereat the Department of Electrical Engineering, Indian Institute of Technology,Kanpur. This work was supported in part by the Department of Science andTechnology, Ministry of Science and Technology, India, under SERC projectSR/S3/EECE/058/2008. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Steve Renals.
D. R. Sanand is with the Norwegian University of Science and Technology,NO-7491 Trondheim, Norway (e-mail: [email protected]).
S. Umesh is with the Department of Electrical Engineering, Indian Instituteof TechnologyMadras, Chennai-600036, India (e-mail: [email protected]).
Digital Object Identifier 10.1109/TASL.2012.2186289
VTLN performs speaker normalization by reducing the variabil-
ities in the spectra of speech signals that arise due to differences
in the vocal tract lengths (VTL) of speakers uttering the same
sound [7]. The normalization is achieved by either compressing
or expanding the speech spectrum and is usually referred to as
scaling. This scaling is usually specified through a mathematical
relation of the type , where is the warped-fre-
quency and is the frequency-warping function. It is com-
monly assumed that the spectra of different speakers uttering the
same sound are linearly scaled versions of one another [7], [8],
i.e., . We would like to make it clear to the reader
that, though the discussion in this paper assumes linear-scalingof the spectra, the methods developed in this paper can be ap-
plied toanyarbitrary warping function. VTLN requires the esti-
mation of only a single parameter called the warp-factor for
normalization and hence requires very little acoustic data unlike
adaptation based methods (e.g., MLLR and CMLLR). However,
the practical implementation of conventional VTLN follows a
maximum likelihood (ML) based grid search over a pre-defined
range of warping-factors. This requires the features to be gen-
erated for all the warp-factors after appropriate modification of
the spectra. The ML estimate of the warp-factor is then found
by evaluating the likelihood of the warped features with respect
to the acoustic model, , and the transcription and is givenby
(1)
where consist of static features obtained after frequency-
warping the spectra by warp-factor and appended with dif-
ferential and acceleration coefficients. In some systems, linear
discriminant analysis (LDA) is applied over a window of such
warped consecutive frames to account for dynamic variations
before obtaining the final feature-vector.
Recently there has been lot of interest in obtaining a direct
linear-transformation between static conventional Mel fre-
quency cepstral coefficients (MFCC) features and the staticVTLN-warped MFCC , i.e.,
(2)
where, represents a matrix transformation.
One of the early attempts to obtain a linear-transformation
(LT) on the cepstra for speaker-normalization was by Acero
et al. [9], [10]. They showed that the warped cepstral coeffi-
cients can be obtained at the outputs of the filters at time
, by formulating the bilinear transform as a linear filtering
operation and having the time reversed cepstrum sequence as
the input. McDonough et al. [11], proposed a linear transfor-
mation using generalizations of bilinear transform known as
1558-7916/$31.00 2012 IEEE
8/11/2019 06143995
2/12
1574 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
all-pass transforms. The derivations were based on the argument
that frequency warping functions, , used in most VTLN
methods can be approximated to a reasonable degree by the bi-
linear transform. Pitzet al.[12], [13] argued that a linear-trans-
formation of cepstra can be obtained for any arbitrary invert-
ible warping function. However, their derivations were made
using the modified signal processing approach discussed in [14],which does not include filter-bank smoothing during the feature
extraction. The cepstra are assumed to be inverse discrete-time
Fourier transform (IDTFT) coefficients of the log power spec-
trum (without Mel warping) to derive the cepstral linear-trans-
formation. Pitz states in his thesis [15] that inclusion ofMel-
warping makes the transformation highly non-linear and could
not be solved analytically. There have been other attempts to
obtain an approximation to the linear-transformation including
the work of Claes et al. [16], where the linear-transformation
was derived using the average third formant information. Cox
[17] presented a model based approach for VTLN that performs
transformation on MFCC features. Kim et al. [18] estimated
the linear-transformation using the ideas of constrained max-imum likelihood linear-transformation (CMLLR) from training
data. Cui and Alwan [19] derived a mapping matrix using for-
mant-like peaks and can be seen as a special case of [16]. Sanand
et al.[20] derived a linear-transformation using the idea of dy-
namic frequency warping, where the mapping is learnt from
the data. It is important to note that in all these methods, ei-
ther the signal processing is changed (and therefore not conven-
tional MFCC), or the linear-transformation does not correspond
to conventional VTLN-warping, or the matrices are estimated
and hence are dependent on the database. Therefore, the con-
ventional VTLN part of an ASR system cannot be simply re-
placed with any of the methods described above.Umesh et al. [21] proposed the idea of using band-lim-
ited interpolation to derive a linear-transformation for ob-
taining VTLN-warped MFCC, that performs both Mel- and
VTLN-warping on plain cepstra. Motivated from this work,
Panchapagesan and Alwan [22], [23] proposed an approach
to incorporate VTLN-warping into the inverse discrete cosine
transform (DCT) transformation to obtain a linear-transfor-
mation of the type shown in (2). We refer to this approach
as Cosine-interpolation in this paper. It is important to note
that, VTLN-warping, , in [22], [23] is performed in the
Mel-frequency domain and is not exactly equivalent to con-
ventional VTLN frequency-warping. This may be important in
cases where the warping function is specified in the frequency
(Hz) domain based on physiological arguments.
In this paper, we present an approach, which we believe for
the first time performs conventional frequency-warping, ,
as a linear-transformation on conventional MFCC using the
ideas of band-limited interpolation. We refer to this approach
as Sinc-interpolation in this paper. The goal is to analytically
obtain the linear-transformation of (2) given . The
proposed method does not modify any aspect of the conven-
tional MFCC computation including the use of Mel filter-bank
smoothing as well as discrete cosine transform (DCT)-II. A part
of this work has already been presented in [20], [24], and [25].
A major advantage in obtaining a linear-transformation inthe framework of (2) is that the VTLN-warped cepstral features
Fig. 1. Steps involved in generating conventional MFCC features.
need not be computed for each , by first frequency warping the
spectra and then computing the corresponding VTLN-warped
cepstra. Instead, the VTLN-warped cepstra can be directly
obtained from static conventional MFCC features through
a matrix transformation. It can be easily shown that the dy-
namic coefficients of the warped features would also be related
through the same transformation in this case. Another advan-
tage of such an approach is that, these matrices can be viewed as
feature transformation matrices similar to CMLLR, but are pre-
computed rather than estimated from data, requiring very little
adaptation data for optimal selection of . The use of suchmatrices also enables the warp-factors to be estimated by accu-
mulating the sufficient statistics, there by simplifying the proce-
dure for optimal warp-factor estimation [26], [27] and reducing
the computational complexity by 75%. Further, VTLN matrices
can be used in regression tree framework to perform VTLN at
acoustic class level, allowing estimation of multiple warp-fac-
tors for a single utterance [28] which is very difficult to imple-
ment in conventional VTLN framework. Finally, there is a pos-
sibility of using these VTLN matrices as base matrices for adap-
tation until sufficient data is available to obtain a robust estimate
of the adaptation (MLLR/CMLLR) matrix [29]. Recently, there
is also interest in using VTLN in the transform-based approach
for statistical speech synthesis [30].The paper is organized as follows. In Section II, we present
how VTLN is performed in practice and discuss the limita-
tions in formulating the problem as a linear-transformation. In
Section III, we present our idea of performing VTLN and show
that a matrix transformation can be formulated on conventional
MFCC to obtain VTLN-warped MFCC. Section IV presents
our setup for performing the speech recognition experiments
along with description of the databases used in our experiments.
In Section V, we discuss the differences between the proposed
and the Cosine-interpolation approaches for VTLN. Finally,
we present the recognition results to show that the proposed
approach has performance comparable to the conventionalVTLN.
II. IMPLEMENTATION OFCONVENTIONALVTLN
Conventional MFCC feature extraction, which does not
include VTLN-warping, is usually implemented as shown in
Fig. 1. Let represent the power or magnitude spectrum of
a frame of speech. Let represent the filter-bank smoothing
operation along with Mel-warping, which can be represented
through a linear-transformation matrix. Further, let repre-
sent the DCT transformation which is also linear. The static
MFCC features, , are obtained by applying the Mel-warped
filter-bank on the power spectrum of the speech signal, followedby applying logarithm on the amplitudes of the output of the
8/11/2019 06143995
3/12
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1575
Fig. 2. Conventional frame work for generatingwarped features in VTLN. Thefilter-bank is inversely scaled instead of re-sampling the speech signal for eachwarp-factor for efficient implementation.
filter-bank and finally a DCT transformation. All the operations
can be written mathematically as
(3)
The DCT matrix is given by
(4)
and the scaling factor is defined as
.
Here, is the number of filters used in theMelfilter-bank and
is the number of cepstral coefficients.
As an illustration, let be a speech frame consisting of 320
samples. A 512-point DFT is applied to obtain the 256-dimen-
sional vector whose elements are the magnitude of the DFT
coefficients for one-half of the spectrum. This is because the
magnitude spectrum has even-symmetry. If a 20-filterMel filter-
bank smoothing is applied, then is a 20 256 matrix that op-
erateson to obtain theMel-warped smoothed spectra. is the
20 20 DCT-matrix applied on log-compressedMel-warped
smoothed spectra to obtain the MFCC feature vector . In prac-
tice, only the first 16 cepstral coefficients are used and one mayuse a 16 20 DCT transformation.
VTLN features are obtained in the original method of An-
dreouet al.[7], by frequency-warping the magnitude spectra
to get before applying the unwarpedMel filter-bank. This
is done by re-sampling the signal. Therefore, in this case the
signal is warped for each VTLN warp-factor, while the Mel
filter-bank is left unchanged. Lee and Rose [8] proposed an ef-
ficient alternate implementation, where the Mel filter-bank is
inverse-scaled for each , while the signal spectra is left un-
changed as shown in Fig. 2. This is the most popular method
of VTLN-warping. Therefore, in the Lee-Rose method, VTLN-
warping is integratedinto the Mel filter-bank and denotes
the (inverse) VTLN-warpedMelfilter-bank. Conventionally thewarp-factor, , used for warping the spectra is in the range of
Fig. 3. Illustrating the change in the filter-bank structure with VTLN-warpingin linear-frequency (Hz) domain. The filters have nonuniform center frequencieswith nonuniform bandwidths.
Fig. 4. Piece-wise linear warping function used in conventional VTLN moti-vated by physiological arguments is shown. The slope of the warping functionis changed at to avoid bandwidth mismatch after frequency scaling.
0.80 to 1.20 based on physiological arguments. For each , the
center frequencies and bandwidths of the Melfilter-bank are ap-
propriately scaled to obtainMel- and VTLN-warped smoothed
spectra [8]. The change in the filter-bank structure for different
warp-factors is illustrated in Fig. 3. The slope in the last filter has
been modified appropriately using piece-wise linear warping
[31], so that the Nyquist frequency maps onto itself after fre-
quency scaling. This avoids the bandwidth mismatch that arises
due to frequency warping. The piece-wise linear warping func-
tion used in our experiments is given by
(5)
(6)
and is shown in Fig. 4. represents the cutoff frequency
where the slope is changed and is the Nyquist frequency.Although piece-wise linear warping is the most commonly used
8/11/2019 06143995
4/12
1576 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
frequency-warping and is motivated from physiological argu-
ments that changes in VTL manifest as spectral-scaling, the
methods developed in this paper can be applied to any arbitrary
warping function.
The warped cepstral features are given by
(7)
These are obtained by first warping and smoothing the power
spectrum, followed by log and the DCT operations. The filter-
bank is integrated with both Mel- and VTLN-warping, to per-
form smoothing as well as scaling of the spectrum. Observing
(3) and (7), it is clear that the only difference between the con-
ventional and VTLN-warped MFCC features is the change in
the filter-bank structure, while the rest of the operations are the
same.
For the case of , exactlycorresponds to the case
of conventional MFCC withoutVTLN-warping. From (3) and
(7), the relation between and is given as
(8)
A linear-transformation between and (or ) can be
derived if all the intermediate operations can be represented as
linear operations, but from (8), it is evident that log is a nonlinear
operation and in practice does notexist. This is because,
the power-spectrum cannotbe completely reconstructed from
the filter-bank outputs because of the smoothing operation [16].
We need to obtain , since conventional VTLN warping re-
lations are always specified in the linear-frequency (Hz) do-
main, usually through a mathematical relation of the type
, where is the warped-frequency and is the fre-
quency-warping function. Therefore, in this case, it is not pos-
sible to completely recover from the filter-bank output and
hence a linear-transformation is not possible.
In the next section, we show that separating the frequency-
warping operation from the filter-bank avoids the need to invert
the filter-bank operation or the logarithm and allows us to derive
a linear transformation on conventional MFCC.
III. REALIZING ALINEAR-TRANSFORMATION
In this section, we show that separatingthe VTLN-warping
(speaker scaling) from the Mel filter-bank helps us to derive a
linear-transformation (LT) between warped and unwarped cep-stral features within the conventional MFCC framework. Let
, be the log-compressed Mel warped filter-bank
output. From (3), we see that the knowledge of implies the
knowledge of as they form a DCT pair, i.e.,
(9)
However, we cannot completely recover from because
of the filter-bank smoothing operation. Since cannot be
completely recovered, we re-frame the problem as follows: can
be obtained by applying a linear-transfor-
mation on without recovering , i.e.,
(10)
Fig. 5. Modification in the signal processing steps (separating the Mel- and
VTLN-warping) for realizing a linear-transformation. The filter-bank performsonly Mel-warping of the spectra and the proposed band-limited interpolationmatrix performs the VTLN warping.
where is the transformation that is applied on to ob-
tain . The above equation states that the filter-bank, , per-
forms only Mel warping and the transformation performs
VTLN-warping. This means that the VTLN-warping integrated
in the filter-bank for efficient implementation in the conven-
tional approach [8] is now performed separately and is not a
part of the filter-bank construction. This is illustrated in Fig. 5.
If such a relation can be obtained, then from (3) and (7), the
relation between and is given by
(11)
By defining a LT between and , we are completely
avoiding the inversion of filter-bank for obtaining the raw mag-
nitude spectrum and also bypassing the
operation. We
would like to remind the reader that VTLN-warping relation
is usually specified in the linear-frequency (Hz) domain, and
therefore, at this point it is not clear what the relation between
and should be. In the next subsection, we describe a method
to obtain a LT using the idea of band-limited interpolation.
A. Band-Limited (Sinc-) Interpolation
For a band-limited continuous-time signal, , given uni-
formly spaced samples of the signal that are appropriately sam-
pled, i.e., , we can exactly reconstruct the original contin-
uous-time signal. This implies that we can recover the values
of the time signal at time-instants otherthan those at the uni-
formly spaced samples. We use this idea to obtain the LT for
VTLN-warping, except now we consider que-frency limited sig-
nals, instead of frequency-limited signals.
can be obtained either by applying a nonuniform filter-
bank (shown in Fig. 3) on the linear-frequency (Hz) magnitude
spectrum or by applying a uniformly spaced filter-bank (shown
in Fig. 6) on the Mel-warped magnitude spectrum. Therefore,
in the Mel-frequency domain, the continuous Mel-warped log-
compressed spectrum, , can be interpreted as the convolved
output of a triangle function on theMel-warped magnitude spec-
trum and followed by a log operation on the amplitudes. We
can think of vector as being obtained by uniformly sampling
at where and the po-
sitions of these samples exactly correspond to the center fre-
quencies of the filter-bank. Because of the triangle smoothing
and subsequent
-operation on the output (which reduces dy-
namic range), the que-frency content of this -compressed
smoothed spectrum is only in the low que-frency region. Fig. 7
compares the cepstral coefficients obtained with and withoutfilter-bank smoothing. We see that the cepstral coefficients die
8/11/2019 06143995
5/12
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1577
Fig. 6. The change in the filter-bank structure with VTLN-warping inMel-fre-quency domain is illustrated. The filters have uniformly spaced center frequen-cies with uniform bandwidth for . However, they are nonuniformlyspaced for different from unity.
Fig. 7. The effect of filter-bank smoothing on the cepstral coefficients is il-lustrated. Filter-bank smoothing helps limit the que-frency content to the lowerregion ensuring que-frency limitedness.
down faster with filter-bank smoothing indicating that the que-
frency content is limited to the low que-frency region. During
VTLN-warping, the filter center frequencies are appropriately
scaled in the linear-frequency (Hz) domain by inverse- as de-
scribed in LeeRose [8]. This corresponds to the center frequen-
cies of the filter-bank to be non-uniformly spaced in theMel-fre-
quency domain as shown in Fig. 6. As we represent the log-com-
pressedMel-warped smoothed magnitude spectrum by the con-
tinuous function , the output of the VTLN-warped filter-
bank corresponds to sampling nonuniformly, i.e., .
These nonuniformly spaced samples exactly correspond to the
elements of the vector .
From the above discussion, we point out that elements of
vector (i.e., ) can beinterpreted as uniformly spaced sam-
ples and elements of (i.e., ) as nonuniformly spaced
samples of the same continuous function . The main idea
is that, given the samples in , the samples (or elements) in
can be reconstructed using band-limited interpolation provided
that the cepstrum is que-frency limited.
Let and form a discrete-time Fourier transform
(DTFT) pair. Then sampling would result in periodicrepetition of . As long as is strictly que-frency limited and
the sampling rate is sufficiently high, then there is no aliasing
in the cepstral domain. In such a case, the value of at any
Mel-frequency can be found from its uniformly-spaced sam-
ples at through band-limited interpolation. This is basically
exploiting the sampling theorem, where a signal (in this case a
frequency domain signal) can be reconstructed from its samples
using Sinc-interpolation. is nowhere used for any calculationpurposes and is presented here only for better understanding in
the derivation of the band-limited interpolation matrix.
Note that que-frency limitedness ensures that there is no
overlap in the periodic repetition of (i.e., no aliasing), and
hence can beexactlyrecovered. The que-frency limitedness
property dependsbothon the amount of smoothing done by the
Mel-filters (which controls the number of significant cepstral
coefficients) and on the number ofMel-filters which determines
the periodicity. If there is aliasing, there will be differences
between Sinc-interpolated and the actual values. Since our
effort in this paper, is to use conventional MFCC processing,
both of these parameters are already fixed by the feature
extraction stage. However, as we will show later, even using
conventional MFCC processing there is very little difference
between interpolated and true values.
The steps to obtain the transformation matrix are as follows.
1) Let , represent t he u niformly-spacedMel-
frequencies with samples of at these points being el-
ements of vector . Their corresponding linear-frequen-
cies (Hz) are nonuniformly spaced and are represented by
. These are the center frequencies of the
Mel-filters in the linear-frequency (Hz) domain and are re-
lated through the standard Mel-relation, i.e.,
(12)
2) During VTLN-warping, the warping function is ap-
plied to obtain the warped frequencies. Let,
represent the warped frequencies in the linear-frequency
(Hz) domain. Although our proposed method will work
for any warping function , for illustrationpurposewe
use the piece-wise linear warping function as defined in (5)
and (6). The corresponding VTLN-warped center frequen-
cies of the filters in the Mel-frequency domain will
notbe related through a linear scaling relation, since
(13)
Therefore, while for the linear-
scaling relation (i.e., along axis), along -axis
as seen from (12) and (13) and graphically shown in
Fig. 8. The Cosine interpolation approach proposed in [22],
[23] assumes , i.e., warping in the Mel do-
main (i.e., domain), and therefore, does not correspond
to conventional VTLN-warping which is specified in fre-
quency domain (i.e., domain). While we refer to only
piece-wise linear warping for illustration purpose, any fre-
quency-warping function can be used in our proposed ap-proach by specifying the appropriate in (13).
8/11/2019 06143995
6/12
1578 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
Fig. 8. Tthe band-limited interpolation for linear-scaling relation is illus-trated. Warping is defined in the linear-frequency (Hz) domain or -axis, i.e., . Along the -axis, are the center-frequencies ofuniformly spaced filter-bank corresponding to in theMel-domain.Similarly, are the center-frequencies of warped filter-bankand arenonuniformly spaced in theMel-domain. The band-limited interpolation
matrix is defined to obtain samples at given samples at . In the figure,represents unwarped frequencies both in linear-frequency (Hz) and Mel
frequency domains, and represents warped frequencies in both the domains.
3) The Fourier relation between and is given by
(14)
where is the Nyquist frequency in the Mel frequency
domain. Here, we assume that the signal is periodic with
a period of and symmetric around . There-fore, theoretically half-filters are present at indices 0 and
1. The values at these indices are required for per-
forming band-limited interpolation. If, we assume that is
que-frency limited, the elements of can be determined
as [we use variable since is already used in (14)]
(15)
Substituting of (14) in (15), we get
The band-limited interpolation matrix between and
is given by
(16)
Fig. 9. Framework of the proposed linear-transformation approach. Note thatonly the conventional MFCC features are generated and warped features areobtained using the LT matrices .
where
and is obtained from (12) and from (13) and the
VTLN-warping relation, is specified in domain in(13). Using the even-symmetry property, we obtain NxN
interpolation matrix , i.e., . The matrix is
given by
where
Alternatively, the above matrix can also be written as aproduct of matrices and is given by
(17)
where is the number of filters. The matrices and are
given by
where are normalized frequencies with the range.
The linear transformation matrix to obtain the VTLN-warped
MFCC given the conventional MFCC is given by
(18)
Here, represents the number of static cepstral coefficients in
the feature-vector and is the number ofMel-filters used in
the feature extraction. The feature generation process using the
proposed linear-transformation (LT) approach is illustrated in
Fig. 9. Although we have shown here for the case of piece-wiselinear warping, the same procedure can be used for any arbitrary
8/11/2019 06143995
7/12
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1579
Fig. 10. Comparing the VTLN-warped cepstra obtainedusing the conventionaland the proposed Sinc-interpolation approach for piece-wise linear and bilinearwarping functions. (a) Piece-wise linear warping. (b) Bilinear warping.
warping function by choosing the appropriate in (13).
Fig. 10 compares the VTLN-warped cepstra obtained using the
conventional and the proposed LT approach on piece-wise linearand bilinear warping functions.
The idea of linear-transformation presented here will be a
special case of the method proposed by Umesh et al. in [21],
where a linear-transformation is proposed by separating both
Mel- and VTLN-warping from the filter-bank. The main differ-
ences between these approaches are as follows.
The filters are uniformly spaced in theMel frequency do-
main for the approach proposed in this paper, i.e., are
uniformly spaced. In the work of Umeshet al.in [21], the
filters are uniformly spaced in the linear-frequency (Hz)
domain, i.e., are uniformly spaced. Therefore, conven-
tionalMel filter-bank is notused in [21].
The interpolation matrix proposed in this paper is defined
as
(19)
i.e., it performs only VTLN-warping on the Mel warped
spectra. In [21], the interpolation matrix is defined as
(20)
where is the smoothed spectrum without Mel warping
and the transformation matrix performs both Mel- andVTLN-warping to obtain VTLN-warped MFCC features.
B. Cosine-Interpolation
Motivated fromthe work of Umesh et al. [21], Panchapagesan
and Alwan [22], [23] proposed a linear-transformation approach
that incorporates the interpolation and warping in the inverse
discrete cosine transform (IDCT) matrix and we refer to this
approach as Cosine-interpolation. Considering to be the con-
tinuousMel-frequency variable, the signal is assumed to be pe-riodic with a period of and symmetric about the points
and . A normalization variable is de-
fined as follows:
where (21)
and has the range . The warped IDCT matrix is given
as
(22)
where are the normalized half-sampledshifted positions of theMel filter-bank and being the fre-
quency warping function. The relation between the warped and
unwarped cepstral features is given by
(23)
From the above equations, we see that the VTLN-warping,
, is performed on the half-sample shifted positions
of the filter-bank center-frequencies that are already
Mel-warped. In conventional VTLN, frequency warping is
performed in the linear-frequency (Hz) domain through .
From the above discussion, it is clear that the Cosine-interpola-
tion approach performs VTLN-warping on the Mel-warped fre-quencies and is not equivalent to conventional VTLN-warping.
Panchapagesan and Alwan themselves point to these differences
[23, below Eq. 27]. As seen from the warp-factor histograms
in [23], the conventional VTLN-warping-factors lie between
(0.88, 1.24) and the Cosine-interpolation based warping-factors
in the range (0.91, 1.11) for the same piece-wise linear warping.
This indicates that conventional frequency-warping and the
warping used in Mel-domain by Cosine-interpolation cannot
be directly compared since the domains in which the warping
is applied are different. In practice, most frequency-warping
functions are specified in linear-frequency (Hz) domain (and
not Mel-domain) often motivated by physiological arguments.To summarize, the main differences between Cosine-interpola-
tion and the linear-transformation derived in this paper are as
follows.
In Cosine-interpolation, VTLN-warping is applied in
theMel-domain and hence the corresponding warp-factors
are very different when compared with those from conven-
tional VTLN.
The interpolation is performed using the inverse-DCT
matrix in Cosine-interpolation [see (22)], whereas the
approach presented in this paper uses band-limited inter-
polation as shown in (17).
Before proceeding further, we present the recognition setup
along with the details of the databases used in our experimentsin the next section.
8/11/2019 06143995
8/12
1580 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
TABLE IDESCRIPTION OF THECORPUSUSED FOREXPERIMENTS
IV. EXPERIMENTALSETUP
The recognition experiments include four different setsof speech data: Wall Street Journal (WSJ0) [32], EuropeanParliamentary Plenary Sessions (EPPS) English [33], Texas
Instruments connected digits (TIDIGITS) [34] and Aurora 4.0[35]. WSJ0, TIDIGITS and EPPS-English are clean speechdata, where as Aurora 4.0 is a noisy speech data. The details ofthe database are presented in Table I. Aurora 4.0 consists of 14different types of test sets, where seven of the them are recordedwith a similar microphone used for recording the training datawhile the other seven with a different microphone.
All the experiments were done using the RWTH AachenSpeech Recognition System [36] except for the TIDIGITStask. While performing feature extraction, we use 20 filtersand obtain 16 cepstral coefficients . The features aremean and variance normalized at the segment level and LDA isapplied over a window of nine consecutive frames to derive a
45-dimensional feature vector. The system used a classificationand regression tree (CART) based state tying. We have 1501generalized triphones for both WSJ0 and Aurora 4.0. and 4501generalized triphones for EPPS task, respectively. The HMMmodel consists of three emitting states with 256 mixtures perstate and uses a pooled covariance matrix.
The TIDIGITS speech recognition task is done using HTK[37] and uses word models. It had 11 word models, which in-clude zero to nine and oh. The features are of 39 dimensionscomprising normalized log-energy, (excluding )and their first- and second-order derivatives. Cepstral mean sub-traction is applied at the segment level. The digits were mod-eled with simple left-to-right HMMs without skips and had 16
emitting states with five diagonal covariance Gaussian mixturesper state. Silence is modeled using a three-state HMM havingsix-mixture Gaussian models per state.
While performing VTLN in training, we follow a max-imum-likelihood (ML)-based approach for estimating theoptimal warping-factor, i.e.,
(24)
where is the SI model and is the known transcriptionduring training. is the VTLN-warped feature vector ofutterance with the static features appended with the delta and ac-celeration coefficients or is obtained after transformation using
LDA. Since the delta and acceleration coefficients are obtainedfrom the static coefficients, the same VTLN transformation ma-
trix can be used to obtain the VTLN-warped delta and accelera-tion coefficients. Therefore, the relation between the unwarpedand VTLN-warped features are given by
(25)
If the features are obtained using an LDA transformation matrix,then the relation between VTLN-warped and static unwarpedfeatures are given by [26]
(26)
where represents the window length. represents the super-
vector formed by concatenating all the static MFCC cepstrafrom adjacent frames.Using the estimated warped features a new VTLN model
is obtained. For performing the warp-factor estimation duringtesting, we use a Gaussian mixture model (GMM) classifier[38]. Unwarped features corresponding to each warping-factorobtained in training are used to train the GMM with 256 mix-tures. The optimal warping-factor in recognition is obtained bycalculating the likelihood with respect to each of the warping-factor GMM and choosing the one that gives the best likelihood.The warping-factors are estimated at speaker level in trainingand utterance level during recognition. During warped featureextraction, we map frequencypointszero andpi onto themselves
using piece-wise linear-warping. We do not account for Jaco-bian in VTLN for the experiments presented in this paper.
V. IMPLEMENTATIONDETAILS
In this section, we present the implementation details
for Sinc- and Cosine-interpolation approaches. Later, we
present the recognition results comparing the performance of
linear-transformation approaches with conventional VTLN.
A. Cosine-Interpolation
In this section, we will discuss the implementation details for
Cosine-interpolation and argue that the range of warping-fac-
tors have to be properly mapped either in the Mel-domain or
in the linear-frequency (Hz) domain for comparing the recog-
nition performance with conventional and Sinc-interpolation
approaches. Before proceeding further, we present recognition
results for the TIDIGITS task in Table II. The models were
trained using male speakers and are used for recognizing
children speakers. For this task, Panchapagesan and Alwan
observed that Cosine-interpolation approach performed better
than conventional VTLN (see [22] and [23, Sec. (6.1)]). As
we will show next, the warp-factors for Cosine-interpolation
and conventional VTLN need to be mapped before they can be
compared. This is due to the difference in the domains where
frequency warping is applied. If proper mapping is chosen, the
difference in the performance observed in [22] and [23] nolonger exists.
8/11/2019 06143995
9/12
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1581
TABLE IIRECOGNITIONRESULTS(%WER) COMPARING THEPERFORMANCE OF DIFFERENTAPPROACHES TO VTLNFOR MALE-TRAIN ANDCHILD-TESTCASE OF TIDIGITS. DIFFERENTRANGE OF WARPING-FACTORS HAVE TO BE USED TOGETCOMPARABLE PERFORMANCE FOR CONVENTIONAL
ANDCOSINE-INTERPOLATION APPROACHES DUE TODIFFERENCE INDOMAINWHEREFREQUENCYWARPING ISAPPLIED
Baseline - No VTLN; Conv. - Conventional; LT - Linear Transformation; M-C - Male Train - Child Test
Fig. 11. The different frequency-warping functions in the linear-frequency(Hz) domain is shown. Using a warp-factor of 0.80 in Mel-domain will resultin a warping function in linear-frequency domain (dotted line) that is quitedifferent from using the same value of the warp-factor (i.e., 0.80) directly inlinear-frequency (Hz) domain (solid line). The figure also shows that using0.9194 as the warp-factor in Mel-domain will generate frequency-warpingfunction very similar to using in linear-frequency (Hz) domain.
We briefly discuss the physiological motivation in choosing
the range of warp-factors used in conventional VTLN for piece-
wise linear warping. The average vocal-tract length for males
is about 17 cm, that for females is about 14.5 cm and for chil-
dren is about 12 cm. Males can have vocal-tract lengths that
are 19 cm or longer. Since the differences in vocal-tract lengths
crudely manifest as scaling of the spectra for the same sound,
the range of scaling (or warp-) factors are determined by the
ratio of vocal-tract lengths. For adult speakers (i.e., only male
and female speakers), this ratio varies from
to about . Usually, the range of warp-factors for
adult data are in the range of 0.80 to 1.20. However, if we train
models using male speakers and use children speakers for test,
then the range of warp-factors have to be different. In this case,
the lower-end in the range of warp-factors can be approximately
.
As pointed out in Section III-B, Cosine-interpolation per-
forms VTLN-warping on the Mel-warped frequencies (i.e.,
) as opposed to performing frequency-warping in the
linear-frequency (Hz) domain (i.e., ). Using the same
numerical value of the warping-factor both in Mel- and
linear-frequency (Hz) domain will result in different warpings
in the frequency (Hz) domain. The differences are illustrated
in Fig. 11. Using a warp-factor of 0.80 in the Mel-frequency
domain and mapping the warping back to linear-frequencydomain (shown in the figure with a dotted line) produces a very
different warping function in linear-frequency domain when
compared to directly using 0.80 warp-factor in linear-frequency
domain (shown with a solid line in the figure). On the other
hand, using a warp-factor of 0.9194 in the Mel-domain and
mapping the function back to linear-frequency domain results
in frequency-warping very similar to 0.80. We suspect that for
TIDIGITS task in [22] and [23], thesamerange of warp-factors
from 0.80 to 1.25 were used in both conventional VTLN and
in Cosine-interpolation. Since the test data is from children
speakers, the lower limit of 0.80 for conventional-VTLN did
not provide sufficient search space and resulted in degraded
performance. On the other hand, scaling the Mel-domain with
0.80 is approximately equivalent to scaling the linear-fre-
quency domain by a factor of 0.5695. This provided a larger
search-space for Cosine-interpolation, probably helping it to
get better performance for children speech when compared to
conventional-VTLN in [22], [23].
In order to have a fair comparison, the warping-factors in
the linear-frequency (Hz) domain (orMel-domain) have to be
in the same range for all the approaches. This can be done by
calculating the equivalent warping-factor in one domain (say
linear-frequency (Hz) domain) by fixing the warping-factor in
the other domain (sayMelfrequency domain). The mapping ofwarping-factors is done as follows.
Let us say that the cutoff where the slope of the warping
function changes is in Hz and similarly where the
slope changes in theMel domain as .
The corresponding warped frequencies are given as
(27)
(28)
where and
are different.
The idea is that the inverseMel-warped should match
or Mel-warped should match
. So thisinvolves finding the value of (or
) that matches
(or ) and this can be found by equating (27) and
(28). Therefore, the equivalent when the
is fixed
is given by
(29)
and the equivalent
when is fixed is given by
(30)
The converts Mel frequencies to Hz and similarlyconverts Hz frequencies to Mel frequencies.
8/11/2019 06143995
10/12
1582 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
We map the warping-factors as discussed above by fixing
and calculating the corresponding
. The new range
of warping-factors for
will be in the range from (0.91,
1.08) for the corresponding in the range of (0.80, 1.25)
for adult speech. These ranges seem to be consistent with the
observations made by Panchapagesan-Alwan (see [23, Fig. 4
]). The recognition performance on the TIDIGITS task usingproper range of warping-factors is shown in Table II, i.e., we
extend the lower range to account for children speakers. Unlike
[22] and [23], we now observe that all the approaches to VTLN
have similar performance. When the warping-factors are in the
improperrange of (0.80, 1.20) for child speakers, the conven-
tional VTLN performance is inferior to Cosine-interpolation as
observed in [22].
For all the subsequent experiments in this paper, we will use
mapping of warping-factors for Cosine-interpolation by fixing
the warp-factors in the linear-frequency (Hz) domain. In the next
section, we will present the implementation details of Sinc-in-
terpolation when performing linear-transformation of conven-
tional MFCC.
B. Linear-Transformation of Conventional MFCC for VTLN
Using Sinc-Interpolation
In order to perform band-limited interpolation, full-spectral
information in the frequency band is necessary. Since we use
conventional MFCC, the available spectral information is be-
tween the first and last filter of the filter-bank. Using a conven-
tional filter-bank with 20 filters, the first filter has a center fre-
quency around 135.2Mels (or 89.92 Hz) and the last filter has
a center frequency around 2704.8 Mels (or 7016.2 Hz). It is
quite unlikely that a formant for any specific sound exists below
or above the specified frequencies for any particular speaker.We can safely assume that the first and last filter-bank center
frequencies act as zero and Nyquist frequencies which should
in no way effect the VTLN performance. This assumption in-
herently means that the center frequencies of the first and last
filters will map onto themselves after frequency warping. Note
that by stating the center frequencies of the first and last filters
are not changing with frequency-warping, we do not mean that
we are ignoring the speech spectrum below 89.92 Hz. The in-
formation is still present in the first-filter since the lower-end of
the filter starts at zero frequency. The point we want to make
is that the center frequencies of the first and last filters do not
change after frequency warping. The only consequence of thecenter frequency of the first filter being used as zero frequency
in linear-transformation case is that there will be a small con-
sistent difference in numerical value between the warp-factor
estimate obtained by this method and the conventional method.
This is because there will be a small difference in the slope used
in (5) and (6). The linear transformation relation is given by
(31)
where indicates thatconventionalMel filter-bank is used.
The linear transformation is derived as shown in (17).
The results comparing the recognition performance using the
conventional Mel filter-bank are shown in Table III. We makethe following observations.
TABLE IIIRECOGNITIONRESULTS(%WER) OF CONVENTIONAL AND LINEAR
TRANSFORM APPROACHES TOVTLN USING CONVENTIONALMFCC.BOTHSINC- AND COSINE-INTERPOLATION APPROACHES PERFORM
COMPARABLY TOCONVENTIONALVTLN
Fig. 12. Histogramand contour plot comparingthe alpha estimatesfor thecon-ventional and Sinc-interpolation approaches for EPPS train data. In the linear-transformationapproach, since we useconventional Mel-filters(without half-fil-ters) the corresponding warp-factors would be approximately0.02 less thancon-ventional-VTLN which is reflected in the histogram. (a) Histogram plot. (b)Contour plot.
We observe that the linear-transformation based ap-
proaches perform comparably with the conventional
approach irrespective of noisy or clean speech.
More importantly, we use the conventionalMelfilter-bank
with out any modification and still perform VTLN-warping
using a linear-transformation.
Fig. 12 shows the histogram and contour plots for the warp-
factors obtained using the conventional and the proposed Sinc-
interpolation approaches for the train data corresponding to theEPPS task. The histogram plot shows the distribution of warp-
8/11/2019 06143995
11/12
SANAND AND UMESH: VTLN USING ANALYTICALLY DETERMINED LINEAR-TRANSFORMATION ON CONVENTIONAL MFCC 1583
factors and the contour plot gives an idea on how the warp-
factor estimates differ between Sinc- and conventional VTLN
approaches. From the histogram, we observe that majority of
the warp-factors got shifted by a single warp-factor step, like
the peaks at 1.02, 0.94 and 0.92 in the conventional ap-
proach appear at 1.00, 0.92 and 0.90, respectively. A sim-
ilar behavior can also be observed from the contour plot. Thisis because we are using conventional Mel-filters (i.e., no ad-
ditional half-filters) with the center frequency of the first Mel-
filter (at 89.2 Hz) being mapped to zero-frequency to enable the
linear-transformation approach without any change in the signal
processing. This leads to a small consistent difference between
warp-factors obtained in the linear-transformation and the con-
ventional approach. Our analysis of the warp-factors obtained
on the EPPS train data indicates that for any warp-factor in con-
ventional-VTLN, the corresponding warp-factors are same or
0.02 smaller in 90% of the utterances. The correlation coeffi-
cient between the alpha estimates is 0.93, which also indicates
that the deviations are only marginal.
Another source of approximation is the use of truncated un-warped MFCC cepstra in (18) to obtain the VTLN-warped cep-
stra, which will also result in some loss of information. Though
there are small differences in the warp-factor distribution, the
recognition performance of LT-Sinc is comparable to conven-
tional VTLN on a variety of tasks that we have presented in this
paper.
VI. CONCLUSION
In this paper, we have presented an approach to perform
VTLN using a linear transformation on conventional MFCC
without any modification in the feature extraction steps. The
linear-transformation is given by (18) with the interpolationmatrix given by (17). Therefore, the linear-transformation
can be analytically calculated using the above equations for
any as well as for any arbitrary warping function by putting
appropriate in (13). This is an important difference when
compared to the Cosine-interpolation, where is used in
theMel-domain and the warping is different from conventional
VTLN. Further, the corresponding warp-factors between co-
sine-interpolation approach and conventional-VTLN cannot be
easily compared. The key idea to our approach is to separate
the speaker scaling operation from the filter-bank which helps
us derive a linear transformation for VTLN using the idea of
band-limited interpolation. The use of such transformations
would enable the warp-factors to be efficiently estimated by
accumulating the sufficient statistics, use regression tree frame-
work to perform VTLN at acoustic class level or use VTLN
matrices as base matrices for adaptation until sufficient data is
available. Such approaches cannot be easily implemented in
conventional-VTLN framework. Using fourdifferent tasks, to
illustrate the efficacy of our proposed approach, we have shown
that the recognition performance of our proposed approach of
linear transformation is always comparable to the conventional
VTLN on both clean and noisy speech data.
ACKNOWLEDGMENT
D. R. Sanand would like to thank Prof. H. Ney for giving himan opportunity to work as a research assistant in the Human Lan-
guage Technology Group at RWTH Aachen University, Aachen,
Germany. The authors would like to thank Prof. H. Ney for pro-
viding resources to run the recognition experiments reported in
this paper. The authors would also like to thank the anonymous
reviewers for their valuable comments and suggestions.
REFERENCES
[1] G. Evermann, H. Y. Chan, M. J. F. Gales, T. Hain, X. Liu, D. Mrva, L.Wang, and P. C. Woodland, Development of the 2003 CU-HTK con-versational telephone speech transcription system, in Proc. ICASSP04, Montreal, QC, Canada, May 2004, pp. 249252.
[2] A. Sixtus, S. Molau, S. Kanthak, R. Schlter, and H. Ney, Recent im-provements of the RWTH large vocabulary speech recognition systemon spontaneous speech, in Proc. ICASSP 00, Istanbul, Turkey, Jun.2000, pp. 16711674.
[3] G. Zavaliagkos, J. McDonough, D. Miller, A. El-Jaroudi, J. Billa, F.Richardson, K. Ma, M. Siu, and H. Gish, The BBN Byblos 1997large vocabulary conversational speech recognition system, in Proc.
ICASSP 98, Seattle, WA, May 1998, pp. 905908.[4] J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, L. Chen, and F.
Lefevre, Conversational telephone speech recognition, in Proc.ICASSP 03, Hong Kong, Apr. 2003, pp. 212215.
[5] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. R. Gadde, M.Plauch, C. Richey, E. Shriberg, K. Snmez, F. Weng, and J. Zheng,The SRI March 2000 Hub-5 conversational speech transcriptionsystem, inProc. NIST Speech Transcript. Workshop, 2000.
[6] Y. Gao, Y. Li, V. Goel, and M. Picheny, Recent advances in speechrecognition system for IBM DARPA communicator, in Proc. Eu-rospeech 01, Aalborg, Denmark, Sep. 2001.
[7] A. Andreou, T. Kamm, and J. Cohen, Experiments in vocal tract nor-malization, inProc. CAIP Workshop: Frontiers in Speech Recognition
II, 1994.[8] L. Lee and R. Rose, Frequency warping approach to speaker normal-
ization, IEEE Trans. Speech Audio Process., vol. 6, no. 1, pp. 4959,Jan. 1998.
[9] A. Acero and R. M. Stern, Robust speech recognition by normaliza-tion of the acoustic space, inProc. ICASSP 91, Toronto, ON, Canada,May 1991, pp. 893896.
[10] A. Acero, Acoustical and environmental robustness in automaticspeech recognition, Ph.D. dissertation, Carnegie Mellon Univ.,Pittsburgh, PA, 1990.
[11] J. McDonough, W. Bryne, and X. Luo, Speaker normalization withall-pass transforms, inICSLP 98, Sydney, Australia, Nov. 1998.
[12] M. Pitz and H. Ney, Vocal tract normalization equals linear transfor-mation in cepstral space,IEEE Trans. Speech Audio Process., vol. 13,no. 5, pp. 930944, Sep. 2005.
[13] M. Pitz, S. Molau, R. Schlter, and H. Ney, Vocal tract normalizationequals linear transformation in cepstral space,in Eurospeech 01, Aal-borg, Denmark, Sep. 2001.
[14] S. Molau, M. Pitz,R. Schlter,and H. Ney,ComputingMel-frequencycepstral coefficients on the power spectrum, inProc. ICASSP 01, SaltLake City, UT, May 2001, pp. 7376.
[15] M. Pitz, Investigations on linear transformations for speaker adapta-tion and normalization, Ph.D. dissertation, RWTH Aachen, Aachen,
Germany, March 2005.[16] T. Claes, I. Dologlou, L. Bosch, and D. van Compernolle, A novelfeature transformation for vocal tractlength normalisation in automaticspeech recognition,IEEE Trans. Speech Audio Process., vol. 6, no. 6,pp. 549557, Nov. 1998.
[17] S. Cox, Speaker normalization in the MFCC domain, inProc. ICSLP00, Beijing, China, Oct. 2000.
[18] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland,Using VTLN for broadcast news transcription, in Proc. Interspeech04, Jeju Island, Korea, Sep. 2004.
[19] X. Cui and A. Alwan, Adaptation of children speech with limited databased on formant-like peakalignment, Comput. Speech Lang., vol.20,no. 4, pp. 400419, Oct. 2006.
[20] D. R. Sanand, D. D. Kumar, and S. Umesh, Linear transformationapproach to VTLN using dynamic frequency warping, inProc. Inter-speech 07, Antwerp, Belgium, Aug. 2007.
[21] S. Umesh, A. Zolnay, and H. Ney, Implementing frequency warpingand VTLN through linear transformation of conventional MFCC, inInterspeech 05, Lisbon, Portugal, Sept. 2005.
8/11/2019 06143995
12/12
1584 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 5, JULY 2012
[22] S. Panchapagesan, Frequency warping by linear transformation ofstandard MFCC, inProc. Interspeech 06, Pittsburgh, PA, Sep. 2006.
[23] S. Panchapagesan and A. Alwan, Frequency warping for VTLNand speaker adaptation by linear transformation of standard MFCC,Comput. Speech Lang., vol. 23, no. 1, pp. 4264, Jan. 2009.
[24] D. R. Sanand and S. Umesh, Study of Jacobian compensation usinglinear transformation of conventional MFCC for VTLN, in Proc. In-terspeech 08, Brisbane, Australia, Sep. 2008.
[25] D. R. Sanand, R. Schlter, and H. Ney, Revisiting VTLN using lineartransformation on conventional MFCC, in Proc. Interspeech 10,Makuhari, Japan, Sep. 2010.
[26] J. Lf, H. Ney, andS. Umesh,VTLNwarpingfactor estimation usingaccumulation of sufficient statistics, in Proc. ICASSP 06, Toulouse,France, May 2006, pp. 12011204.
[27] P. T. Akhil, S. P. Rath, S. Umesh, and D. R. Sanand, A computation-ally efficient approach to warp factor estimation in VTLN using EMalgorithm and sufficient statistics, in Proc. Interspeech 08, Brisbane,Australia, Sep. 2008.
[28] S. P. Rath andS. Umesh,Acoustic classspecific VTLN-warpingusingregression class trees, in Proc. Interspeech 09, Brighton, U.K., Sep.2009.
[29] C. Breslin, K. Chin, M. Gales, K. Knill, and H. Xu, Prior informa-tion for rapid speaker adaptation, in Proc. Interspeech 10, Makuhari,Japan, Sep. 2010.
[30] L. Saheer, J. Dines, P. N. Garner, and H. Liang, Implementation ofVTLN for statistical speech synthesis, in Proc. ISCA Speech Synth.Workshop, Sep. 2010.
[31] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin, Speaker nor-malization on conversational telephone speech, in Proc. ICASSP 96,Atlanta, GA, May 1996, pp. 339341.
[32] D. B. Pauland J. M.Baker, The designfor thewallstreet journal-basedCSR corpus, inProc. ICSLP 92, Banff, AB, Canada, Oct. 1992.
[33] J. Lf, M. Bisani, C. Gollan, G. Heigold, B. Hoffmeister, C. Plahl,R. Schlter, and H. Ney, The 2006 RWTH parliamentary speechestranscription system, inProc. Interspeech 06, Barcelona, Spain, Jun.2006.
[34] R. Leonard, A database for speaker-independent digit recognition, inICASSP 84, San Diego, CA, Mar. 1984, pp. 328331.
[35] N. Parihar and J. Picone, DSR Front End LVCSREvaluation, AU/384/02, Mississippi State University, Mississippi State, MS, Dec. 2002,
Tech. Rep..[36] D. Rybach, C. Gollan, G. Heigold,B. Hoffmeister, J. Lf, R. Schlter,
and H. Ney, The RWTH Aachen university open source speechrecognition system, in Proc. Interspeech 09, Brighton, U.K., Sep.2009.
[37] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. A. Liu, G.Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for HTK Version 3.4). Cambridge, U.K.: CambridgeUniv. Eng. Dept., 2006.
[38] L. Welling, S. Kanthak, and H. Ney, Improved methods for vocaltract normalization, in Proc. ICASSP 99, Phoenix, AZ, Mar. 1999,pp. 761764.
D. R. Sanandreceived the Ph.D. degree in electricalengineering from the Indian Institute of Technology,Kanpur, in 2010.
From 2009 to 2010, he was a Postdoctoral Re-searcher in the Department of Computer Science,RWTH Aachen University, Aachen, Germany,and later in the Department of Information andComputer Science, Aalto University, Espoo, Fin-land, until 2011. Currently, he is a PostdoctoralResearcher in the Department of Electronics andTelecommunications, Norwegian University of
Science and Technology, Trondheim, Norway. His research interests includespeech recognition, synthesis, and biomedical signal processing.
S. Umesh received the Ph.D. degree in electricalengineering from the University of Rhode Island,Kingston, in 1993.
From 1993 to 1996, he was a Postdoctoral Fellowat the City University of New York. From 1996 to2009, he was with the Indian Institute of Technology,Kanpur, first as an Assistant Professor and then asa Professor of electrical engineering. Since 2009,he has been with the Indian Institute of Technology,Madras, where he is a Professor of electrical engi-neering. He has also been a Visiting Researcher at
AT&T Research Laboratories, Machine Intelligence Laboratory, CambridgeUniversity, U.K., and the Department of Computer Science (Lehrstuhl frInformatik VI), RWTH-Aachen, Germany. His recent research interests havebeen mainly in the area of speaker-normalization and noise-robustness and
their application in large-vocabulary continuous speech recognition systems.He has also worked in the areas of statistical signal processing and time-varyingspectral analysis.
Dr. Umesh was a recipient of the Indian AICTE Career Award for YoungTeachers and the Alexander von Humboldt Research Fellowship.