Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Speaker Adaptation
Steve Renals
Automatic Speech Recognition – ASR Lectures 13&145, 9 March 2015
ASR Lectures 13&14 Speaker Adaptation 1
Overview
Speaker Adaptation
Introduction: speaker-specific variation, modes of adaptation
Model-based adaptation: MAP
Model-based adaptation: MLLR
Model-based adaptation: Speaker space models
Speaker normalization: VTLN
Adaptive training
Adaptation for hybrid HMM / NN systems
ASR Lectures 13&14 Speaker Adaptation 2
Speaker independent / dependent / adaptive
Speaker independent (SI) systems have long been the focusfor research in transcription, dialogue systems, etc.
Speaker dependent (SD) systems can result in word errorrates 2–3 times lower than SI systems (given the sameamount of training data)
A Speaker adaptive (SA) system... we would like
Error rates similar to SD systemsBuilding on an SI systemRequiring only a small fraction of the speaker-specific trainingdata used by an SD system
ASR Lectures 13&14 Speaker Adaptation 3
Speaker-specific variation
Acoustic model
Speaking stylesAccentsSpeech production anatomy (eg length of the vocal tract)
Also non-speaker variation, such as channel conditions(telephone, reverberant room, close talking mic) andapplication domainSpeaker adaptation of acoustic models aims to reduce themismatch between test data and the models
Pronunciation model: speaker-specific, consistent change inpronunciation
Language model: user-specific documents (exploited inpersonal dictation systems)
ASR Lectures 13&14 Speaker Adaptation 4
Speaker-specific variation
Acoustic model
Speaking stylesAccentsSpeech production anatomy (eg length of the vocal tract)
Also non-speaker variation, such as channel conditions(telephone, reverberant room, close talking mic) andapplication domainSpeaker adaptation of acoustic models aims to reduce themismatch between test data and the models
Pronunciation model: speaker-specific, consistent change inpronunciation
Language model: user-specific documents (exploited inpersonal dictation systems)
ASR Lectures 13&14 Speaker Adaptation 4
Speaker-specific variation
Acoustic model
Speaking stylesAccentsSpeech production anatomy (eg length of the vocal tract)
Also non-speaker variation, such as channel conditions(telephone, reverberant room, close talking mic) andapplication domainSpeaker adaptation of acoustic models aims to reduce themismatch between test data and the models
Pronunciation model: speaker-specific, consistent change inpronunciation
Language model: user-specific documents (exploited inpersonal dictation systems)
ASR Lectures 13&14 Speaker Adaptation 4
Modes of adaptation
Supervised or unsupervised
Supervised: the word level transcription of the adaptation datais known (and HMMs may be constructed)Unsupervised: the transcription must be estimated (eg usingrecognition output)
Static or dynamic
Static: All adaptation data is presented to the system in ablock before the final system is estimated (eg as used inenrollment in a dictation system)Dynamic: Adaptation data is incrementally available, andmodels must be adapted before all adaptation data is available(eg as used in a spoken dialogue system)
ASR Lectures 13&14 Speaker Adaptation 5
Modes of adaptation
Supervised or unsupervised
Supervised: the word level transcription of the adaptation datais known (and HMMs may be constructed)Unsupervised: the transcription must be estimated (eg usingrecognition output)
Static or dynamic
Static: All adaptation data is presented to the system in ablock before the final system is estimated (eg as used inenrollment in a dictation system)Dynamic: Adaptation data is incrementally available, andmodels must be adapted before all adaptation data is available(eg as used in a spoken dialogue system)
ASR Lectures 13&14 Speaker Adaptation 5
Approaches to adaptation
Model based: Adapt the parameters of the acoustic models tobetter match the observed data
Maximum a posteriori (MAP) adaptation of HMM/GMMparametersMaximum likelihood linear regression (MLLR) of GaussianparametersLinear input network (LIN) for neural networks
Speaker Normalization: Normalize the acoustic data to reducemismatch with the acoustic models
Vocal Tract Length Normalization (VTLN)Constrained MLLR (cMLLR) — model-based normalisation!
Speaker space: Estimate multiple sets of acoustic models,characterizing new speakers in terms of these model sets
Cluster-adapative trainingEigenvoicesSpeaker codes
ASR Lectures 13&14 Speaker Adaptation 6
Approaches to adaptation
Model based: Adapt the parameters of the acoustic models tobetter match the observed data
Maximum a posteriori (MAP) adaptation of HMM/GMMparametersMaximum likelihood linear regression (MLLR) of GaussianparametersLinear input network (LIN) for neural networks
Speaker Normalization: Normalize the acoustic data to reducemismatch with the acoustic models
Vocal Tract Length Normalization (VTLN)Constrained MLLR (cMLLR) — model-based normalisation!
Speaker space: Estimate multiple sets of acoustic models,characterizing new speakers in terms of these model sets
Cluster-adapative trainingEigenvoicesSpeaker codes
ASR Lectures 13&14 Speaker Adaptation 6
Approaches to adaptation
Model based: Adapt the parameters of the acoustic models tobetter match the observed data
Maximum a posteriori (MAP) adaptation of HMM/GMMparametersMaximum likelihood linear regression (MLLR) of GaussianparametersLinear input network (LIN) for neural networks
Speaker Normalization: Normalize the acoustic data to reducemismatch with the acoustic models
Vocal Tract Length Normalization (VTLN)Constrained MLLR (cMLLR) — model-based normalisation!
Speaker space: Estimate multiple sets of acoustic models,characterizing new speakers in terms of these model sets
Cluster-adapative trainingEigenvoicesSpeaker codes
ASR Lectures 13&14 Speaker Adaptation 6
Adaptation and normalization of acoustic models
Feature Space Model Space
Xtrain
Xtest
MtrainTraining
Recognition
Trainingconditions
TestCondition
ASR Lectures 13&14 Speaker Adaptation 7
Adaptation and normalization of acoustic models
Feature Space Model Space
Xtrain
Xtest Mtest
MtrainTraining
Adaptation
Trainingconditions
TestCondition
Recognition
ASR Lectures 13&14 Speaker Adaptation 7
Adaptation and normalization of acoustic models
Feature Space Model Space
Xtrain
Xtest
M̂X̂trainTraining
Normalization(recognition)
Normalization(training)
Trainingconditions
Referencecondition
TestCondition
ASR Lectures 13&14 Speaker Adaptation 7
Adaptation and normalization of acoustic models
Feature Space Model Space
Xtrain
Xtest Mtest
Mtrain
M̂
Training
Adaptation
Adaptivetraining
Adaptive recognition
Trainingconditions
Referencecondition
TestCondition
ASR Lectures 13&14 Speaker Adaptation 7
Desirable properties for model-based speaker adaptation
Compact: relatively few speaker-dependent parameters
Unsupervised: does not require labelled adaptation data, orchanges to the training
Efficient: low computational requirements
Flexible: applicable to different model variants
ASR Lectures 13&14 Speaker Adaptation 8
Model-based adaptation: The MAP family
Basic idea Use the SI models as a prior probability distributionover model parameters when estimating using speaker-specificdata
Theoretically well-motivated approach to incorporating theknowledge inherent in the SI model parameters
Maximum likelihood (ML) training sets the model parametersλ to maximize the likelihood p(X | λ)
Maximum a posteriori (MAP) training maximizes theposterior of the parameters given the data:
p(λ | X) ∝ p(X | λ)p0(λ)
p0(λ) is the prior distribution of the parameters
The use of a prior distribution, based on the SI models, meansthat less data is required to estimate the speaker-specificmodels: we are not starting from complete ignorance
ASR Lectures 13&14 Speaker Adaptation 9
Model-based adaptation: The MAP family
Basic idea Use the SI models as a prior probability distributionover model parameters when estimating using speaker-specificdata
Theoretically well-motivated approach to incorporating theknowledge inherent in the SI model parameters
Maximum likelihood (ML) training sets the model parametersλ to maximize the likelihood p(X | λ)
Maximum a posteriori (MAP) training maximizes theposterior of the parameters given the data:
p(λ | X) ∝ p(X | λ)p0(λ)
p0(λ) is the prior distribution of the parameters
The use of a prior distribution, based on the SI models, meansthat less data is required to estimate the speaker-specificmodels: we are not starting from complete ignorance
ASR Lectures 13&14 Speaker Adaptation 9
Model-based adaptation: The MAP family
Basic idea Use the SI models as a prior probability distributionover model parameters when estimating using speaker-specificdata
Theoretically well-motivated approach to incorporating theknowledge inherent in the SI model parameters
Maximum likelihood (ML) training sets the model parametersλ to maximize the likelihood p(X | λ)
Maximum a posteriori (MAP) training maximizes theposterior of the parameters given the data:
p(λ | X) ∝ p(X | λ)p0(λ)
p0(λ) is the prior distribution of the parameters
The use of a prior distribution, based on the SI models, meansthat less data is required to estimate the speaker-specificmodels: we are not starting from complete ignorance
ASR Lectures 13&14 Speaker Adaptation 9
Model-based adaptation: The MAP family
Basic idea Use the SI models as a prior probability distributionover model parameters when estimating using speaker-specificdata
Theoretically well-motivated approach to incorporating theknowledge inherent in the SI model parameters
Maximum likelihood (ML) training sets the model parametersλ to maximize the likelihood p(X | λ)
Maximum a posteriori (MAP) training maximizes theposterior of the parameters given the data:
p(λ | X) ∝ p(X | λ)p0(λ)
p0(λ) is the prior distribution of the parameters
The use of a prior distribution, based on the SI models, meansthat less data is required to estimate the speaker-specificmodels: we are not starting from complete ignorance
ASR Lectures 13&14 Speaker Adaptation 9
Model-based adaptation: The MAP family
Basic idea Use the SI models as a prior probability distributionover model parameters when estimating using speaker-specificdata
Theoretically well-motivated approach to incorporating theknowledge inherent in the SI model parameters
Maximum likelihood (ML) training sets the model parametersλ to maximize the likelihood p(X | λ)
Maximum a posteriori (MAP) training maximizes theposterior of the parameters given the data:
p(λ | X) ∝ p(X | λ)p0(λ)
p0(λ) is the prior distribution of the parameters
The use of a prior distribution, based on the SI models, meansthat less data is required to estimate the speaker-specificmodels: we are not starting from complete ignorance
ASR Lectures 13&14 Speaker Adaptation 9
Recall: ML estimation of GMM/HMM
The mean of the mth Gaussian component of the jth state isestimated using a weighted average
µmj =
∑n γjm(n)xn∑n γjm(n)
Where∑
n γjm(n) is the component occupation probability
The covariance of the Gaussian component is given by:
Σmj =
∑n γjm(n)(xn − µjm)(xn − µjm)T∑
n γjm(n)
ASR Lectures 13&14 Speaker Adaptation 10
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20xn is the adaptation vector at time nγ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20xn is the adaptation vector at time nγ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20xn is the adaptation vector at time nγ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20
xn is the adaptation vector at time nγ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20xn is the adaptation vector at time n
γ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20xn is the adaptation vector at time nγ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
MAP estimation
What is p0(λ)?
Conjugate prior: the prior distribution has the same form asthe posterior. There is no simple conjugate prior for GMMs,but an intuitively understandable approach may be employed.
If the prior mean is µ0, then the MAP estimate for theadapted mean µ̂ of Gaussian is given by:
µ̂ =τµ0 +
∑n γ(n)xn
τ +∑
n γ(n)
τ is a hyperparameter that controls the balance between theML estimate of the mean, its prior value. Typically τ is in therange 2–20xn is the adaptation vector at time nγ(n) the probability of this Gaussian at this time
As the amount of training data increases, so the MAPestimate converges to the ML estimate
ASR Lectures 13&14 Speaker Adaptation 11
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare GaussiansThe MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare GaussiansThe MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare GaussiansThe MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare Gaussians
The MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare GaussiansThe MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare GaussiansThe MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
Local estimation
Basic idea The main drawback to MAP adaptation is that it islocal
Only the parameters belonging to Gaussians of observedstates will be adapted
Large vocabulary speech recognition systems have about 105
Gaussians: most will not be adapted
Structural MAP (SMAP) approaches have been introduced toshare GaussiansThe MLLR family of adaptation approaches addresses this byassuming that transformations for a specific speaker aresystematic across Gaussians, states and models
MAP adaptation is very useful for domain adaptation:
Example: adapting a conversational telephone speech system(100s of hours of data) to multiparty meetings (10s of hours ofdata) works well with MAP
ASR Lectures 13&14 Speaker Adaptation 12
SMAP: Structural MAP
Basic idea share Gaussians by organising them in a tree,whose root contains all the Gaussians
At each node in the tree compute mean offset and diagonalvariance scaling term
For each node, its parent is used as a prior distribution
This has been shown to speed adaptation compared withstandard MAP, while converging to the same solution asstandard MAP in the large data limit
ASR Lectures 13&14 Speaker Adaptation 13
The Linear Transform family
Basic idea Rather than directly adapting the modelparameters, estimate a transform which may be applied theGaussian means and covariances
Linear transform applied to parameters of a set of Gaussians:adaptation transform parameters are shared across Gaussians
This addresses the locality problem arising in MAPadaptation, since each adaptation data point can affect manyof (or even all) the Gaussians in the system
There are relatively few adaptation parameters, so estimationis robust
Maximum Likelihood Linear Regression (MLLR) is the bestknown linear transform approach to speaker adaptation
ASR Lectures 13&14 Speaker Adaptation 14
The Linear Transform family
Basic idea Rather than directly adapting the modelparameters, estimate a transform which may be applied theGaussian means and covariances
Linear transform applied to parameters of a set of Gaussians:adaptation transform parameters are shared across Gaussians
This addresses the locality problem arising in MAPadaptation, since each adaptation data point can affect manyof (or even all) the Gaussians in the system
There are relatively few adaptation parameters, so estimationis robust
Maximum Likelihood Linear Regression (MLLR) is the bestknown linear transform approach to speaker adaptation
ASR Lectures 13&14 Speaker Adaptation 14
The Linear Transform family
Basic idea Rather than directly adapting the modelparameters, estimate a transform which may be applied theGaussian means and covariances
Linear transform applied to parameters of a set of Gaussians:adaptation transform parameters are shared across Gaussians
This addresses the locality problem arising in MAPadaptation, since each adaptation data point can affect manyof (or even all) the Gaussians in the system
There are relatively few adaptation parameters, so estimationis robust
Maximum Likelihood Linear Regression (MLLR) is the bestknown linear transform approach to speaker adaptation
ASR Lectures 13&14 Speaker Adaptation 14
The Linear Transform family
Basic idea Rather than directly adapting the modelparameters, estimate a transform which may be applied theGaussian means and covariances
Linear transform applied to parameters of a set of Gaussians:adaptation transform parameters are shared across Gaussians
This addresses the locality problem arising in MAPadaptation, since each adaptation data point can affect manyof (or even all) the Gaussians in the system
There are relatively few adaptation parameters, so estimationis robust
Maximum Likelihood Linear Regression (MLLR) is the bestknown linear transform approach to speaker adaptation
ASR Lectures 13&14 Speaker Adaptation 14
The Linear Transform family
Basic idea Rather than directly adapting the modelparameters, estimate a transform which may be applied theGaussian means and covariances
Linear transform applied to parameters of a set of Gaussians:adaptation transform parameters are shared across Gaussians
This addresses the locality problem arising in MAPadaptation, since each adaptation data point can affect manyof (or even all) the Gaussians in the system
There are relatively few adaptation parameters, so estimationis robust
Maximum Likelihood Linear Regression (MLLR) is the bestknown linear transform approach to speaker adaptation
ASR Lectures 13&14 Speaker Adaptation 14
MLLR: Maximum Likelihood Linear Regression
MLLR is the best known linear transform approach to speakeradaptation
Affine transform of mean parameters
µ̂ = Aµ + b
If the observation vectors are d-dimension, then A is a d × dmatrix and b is d-dimension vector
If we define W = [bA] and η = [1µT ]T , then we can write:
µ̂ = Wη
In MLLR, W is estimated so as to maximize the likelihood ofthe adaptation data
A single transform W can be shared across a set of Gaussiancomponents (even all of them!)
ASR Lectures 13&14 Speaker Adaptation 15
Regression classes
The number of transforms may obtained automatically
A set of Gaussian components that share a transform is calleda regression class
Obtain the regression classes by constructing a regressionclass tree
Each node in the tree represents a regression class sharing atransform
For an adaptation set, work down the tree until arriving at themost specific set of nodes for which there is sufficient data
Regression class tree constructed in a similar way to stateclustering tree
In practice the number of regression may be very small: oneper context-independent phone class, one per broad class, oreven just two (speech/non-speech)
ASR Lectures 13&14 Speaker Adaptation 16
Estimating the transforms
The linear transformation matrix W is obtained by finding itssetting which optimizes the log likelihood
Mean adaptation: Log likelihood
L =∑r
∑n
γr (n) log
(Kr exp
(−1
2(xn −Wηr )TΣ−1r (xn −Wηr )
))where r ranges over the components belonging to theregression class
Differentiating L and setting to 0 results in an equation forW: there is no closed form solution if Σ is full covariance; canbe solved if Σ is diagonal (but requires a matrix inversion)
Variance adaptation is also possible
See Gales and Woodland (1996), Gales (1998) for details
ASR Lectures 13&14 Speaker Adaptation 17
MLLR in practice
Mean-only MLLR results in 10–15% relative reduction in WER
Few regression classes and well-estimated transforms workbest in practice
Robust adaptation available with about 1 minute of speech;performance similar to SD models available with 30 minutesof adaptation data
Such linear transforms can account for any systematic (linear)variation from the speaker independent models, for examplethose caused by channel effects.
ASR Lectures 13&14 Speaker Adaptation 18
Constrained MLLR (cMLLR)
Basic idea use the same linear transform for both mean andcovariance
µ̂ = A′µ− b′
Σ̂ = A′ΣA′T
No closed form solution but can be solved iteratively
Log likelihood for cMLLR
L = N (Axn + b;µ,Σ) + log(|A|) A′ = A−1 ; b′ = Ab
Equivalent to applying the linear transform to the data!Also called fMLLR (feature space MLLR)
Iterative solution amenable to online/dynamic adaptation, byusing just one iteration for each increment
Similar improvement in accuracy to standard MLLR
ASR Lectures 13&14 Speaker Adaptation 19
Constrained MLLR (cMLLR)
Basic idea use the same linear transform for both mean andcovariance
µ̂ = A′µ− b′
Σ̂ = A′ΣA′T
No closed form solution but can be solved iteratively
Log likelihood for cMLLR
L = N (Axn + b;µ,Σ) + log(|A|) A′ = A−1 ; b′ = Ab
Equivalent to applying the linear transform to the data!Also called fMLLR (feature space MLLR)
Iterative solution amenable to online/dynamic adaptation, byusing just one iteration for each increment
Similar improvement in accuracy to standard MLLR
ASR Lectures 13&14 Speaker Adaptation 19
Constrained MLLR (cMLLR)
Basic idea use the same linear transform for both mean andcovariance
µ̂ = A′µ− b′
Σ̂ = A′ΣA′T
No closed form solution but can be solved iteratively
Log likelihood for cMLLR
L = N (Axn + b;µ,Σ) + log(|A|) A′ = A−1 ; b′ = Ab
Equivalent to applying the linear transform to the data!Also called fMLLR (feature space MLLR)
Iterative solution amenable to online/dynamic adaptation, byusing just one iteration for each increment
Similar improvement in accuracy to standard MLLR
ASR Lectures 13&14 Speaker Adaptation 19
Constrained MLLR (cMLLR)
Basic idea use the same linear transform for both mean andcovariance
µ̂ = A′µ− b′
Σ̂ = A′ΣA′T
No closed form solution but can be solved iteratively
Log likelihood for cMLLR
L = N (Axn + b;µ,Σ) + log(|A|) A′ = A−1 ; b′ = Ab
Equivalent to applying the linear transform to the data!Also called fMLLR (feature space MLLR)
Iterative solution amenable to online/dynamic adaptation, byusing just one iteration for each increment
Similar improvement in accuracy to standard MLLR
ASR Lectures 13&14 Speaker Adaptation 19
Constrained MLLR (cMLLR)
Basic idea use the same linear transform for both mean andcovariance
µ̂ = A′µ− b′
Σ̂ = A′ΣA′T
No closed form solution but can be solved iteratively
Log likelihood for cMLLR
L = N (Axn + b;µ,Σ) + log(|A|) A′ = A−1 ; b′ = Ab
Equivalent to applying the linear transform to the data!Also called fMLLR (feature space MLLR)
Iterative solution amenable to online/dynamic adaptation, byusing just one iteration for each increment
Similar improvement in accuracy to standard MLLR
ASR Lectures 13&14 Speaker Adaptation 19
Speaker-adaptive training (SAT)
Basic idea Rather than SI seed (canonical) models, constructmodels designed for adaptation
Estimate parameters of canonical models by training MLLRmean transforms for each training speaker
Train using the MLLR transform for each speaker; interleaveGaussian parameter estimation and MLLR transformestimation
SAT results in much higher training likelihoods, and improvedrecognition results
But: increased training complexity and storage requirements
SAT using cMLLR, corresponds to a type of speakernormalization at training time
ASR Lectures 13&14 Speaker Adaptation 20
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groups
Fragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive training
Eigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Speaker Space Methods
Gender-dependent models: sets of HMMs for male and forfemale speakers
Speaker clustering: sets of HMMs for different speaker clusters
Drawbacks:
Hard division of speakers into groupsFragments training data
Weighted speaker cluster approaches which use aninterpolated model to represent the current speaker
Cluster-adaptive trainingEigenvoices
ASR Lectures 13&14 Speaker Adaptation 21
Cluster-adaptive training
Basic idea Represent a speaker as a weighted sum of speakercluster modelsDifferent cluster models have shared variances and mixtureweights, but separate meansFor a new speaker, mean is defined as
µ =∑c
λcµc
Given the canonical models, only the λc mixing parametersneed estimated for each speakerGiven sets of weights for individual speakers, means of theclusters may be updatedCAT can reduce WER in large vocabulary tasks by about4–8% relativeFor more, see Gales (2000), Cluster adaptive training ofhidden Markov models, IEEE Trans Speech and AudioProcessing, 8:417–428.
ASR Lectures 13&14 Speaker Adaptation 22
Eigenvoices
Basic idea Construct a speaker space from a set of SD HMMs
Could regard each canonical model as forming a dimension ofspeaker space
Generalize by computing PCA of sets of “supervectors”(concatenated mean vectors), to form speaker space: eachdimension is an “eigenvoice”
Represent a new speaker as a combination of eigenvoices
Close relation to CAT
Computationally intensive, does not scale well to largevocabulary systems
For more, see Kuhn et al (2000), Rapid speaker adaptation ineigenvoice space, IEEE Trans Speech and Audio Processing,8:695–707.
ASR Lectures 13&14 Speaker Adaptation 23
Feature normalization
Basic idea: Transform the features to reduce mismatchbetween training and test
Cepstral Mean Normalization (CMN): subtract the avergaefeature value from each feature, so each feature has a meanvalue of 0. makes features robust to some linear filtering ofthe signal (channel variation)
Cepstral Variance Normalization (CVN): Divide feature vectorby standard deviation of feature vectors, so each featurevector element has a variance of 1
Cepstral mean and variance normalisation, CMN/CVN:
x̂i =xi − µ(x)
σ(x)
Compute mean and variance statistics over longest availablesegments with the same speaker/channel
Real time normalisation: compute a moving average
ASR Lectures 13&14 Speaker Adaptation 24
Vocal Tract Length Normalization (VTLN)
Basic idea Normalize the acoustic data to take account ofchanges in vocal tract length
Vocal tract length (VTL):
First larynx descent in first 2-3 years of lifeVTL grows according to body size, and is sex-dependentPuberty: second larynx descent for males
VTL has large effect on the spectrum
Tube acoustic model: formant positions are inverselyproportional to VTLObservation: formant frequencies for women are 20% higherthan for men (on average)
VTLN: compensate for differences between speakers via awarping of the frequency axis
ASR Lectures 13&14 Speaker Adaptation 25
Vocal Tract Length Normalization (VTLN)
Basic idea Normalize the acoustic data to take account ofchanges in vocal tract length
Vocal tract length (VTL):
First larynx descent in first 2-3 years of lifeVTL grows according to body size, and is sex-dependentPuberty: second larynx descent for males
VTL has large effect on the spectrum
Tube acoustic model: formant positions are inverselyproportional to VTLObservation: formant frequencies for women are 20% higherthan for men (on average)
VTLN: compensate for differences between speakers via awarping of the frequency axis
ASR Lectures 13&14 Speaker Adaptation 25
Vocal Tract Length Normalization (VTLN)
Basic idea Normalize the acoustic data to take account ofchanges in vocal tract length
Vocal tract length (VTL):
First larynx descent in first 2-3 years of lifeVTL grows according to body size, and is sex-dependentPuberty: second larynx descent for males
VTL has large effect on the spectrum
Tube acoustic model: formant positions are inverselyproportional to VTLObservation: formant frequencies for women are 20% higherthan for men (on average)
VTLN: compensate for differences between speakers via awarping of the frequency axis
ASR Lectures 13&14 Speaker Adaptation 25
Vocal Tract Length Normalization (VTLN)
Basic idea Normalize the acoustic data to take account ofchanges in vocal tract length
Vocal tract length (VTL):
First larynx descent in first 2-3 years of lifeVTL grows according to body size, and is sex-dependentPuberty: second larynx descent for males
VTL has large effect on the spectrum
Tube acoustic model: formant positions are inverselyproportional to VTLObservation: formant frequencies for women are 20% higherthan for men (on average)
VTLN: compensate for differences between speakers via awarping of the frequency axis
ASR Lectures 13&14 Speaker Adaptation 25
Approaches to VTLN
f → f̂ = gα(f )
Classify by frequency warping function
Piecewise linearPower functionBilinear transform
Classify by estimation of warping factor α
Signal-based: estimated directly from the acoustic signal,through explicit estimation of formant positionsModel-based: maximize the likelihood of the observed datagiven acoustic models and a transcription. α is anotherparameter set so as to maximize the likelihood
ASR Lectures 13&14 Speaker Adaptation 26
Warping functions: Piecewise linear
α=1.2
α=1.0
α=0.8
f̂ = αf
ASR Lectures 13&14 Speaker Adaptation 27
Model-based VTLN
Basic idea Warp the acoustic features (for a speaker) tobetter fit the models — rather than warping the models to fitthe features!
Estimate the warping factor α so as to maximise thelikelihood of the acoustic models
After estimating the warp factors, normalize the acoustic dataand re-estimate the models
The process may be iterated
Model-based VTLN does not directly estimate vocal tractsize, rather it estimates an optimal frequency warping, whichmay be affected by other factors (eg F0)
Exhaustive search for the optimal warping factor would beexpensive
Approximate the log likelihood wrt α as a quadratic, and findthe maximum using a line search (Brent’s method)
ASR Lectures 13&14 Speaker Adaptation 28
Model-based VTLN
Basic idea Warp the acoustic features (for a speaker) tobetter fit the models — rather than warping the models to fitthe features!
Estimate the warping factor α so as to maximise thelikelihood of the acoustic models
After estimating the warp factors, normalize the acoustic dataand re-estimate the models
The process may be iterated
Model-based VTLN does not directly estimate vocal tractsize, rather it estimates an optimal frequency warping, whichmay be affected by other factors (eg F0)
Exhaustive search for the optimal warping factor would beexpensive
Approximate the log likelihood wrt α as a quadratic, and findthe maximum using a line search (Brent’s method)
ASR Lectures 13&14 Speaker Adaptation 28
Model-based VTLN
Basic idea Warp the acoustic features (for a speaker) tobetter fit the models — rather than warping the models to fitthe features!
Estimate the warping factor α so as to maximise thelikelihood of the acoustic models
After estimating the warp factors, normalize the acoustic dataand re-estimate the models
The process may be iterated
Model-based VTLN does not directly estimate vocal tractsize, rather it estimates an optimal frequency warping, whichmay be affected by other factors (eg F0)
Exhaustive search for the optimal warping factor would beexpensive
Approximate the log likelihood wrt α as a quadratic, and findthe maximum using a line search (Brent’s method)
ASR Lectures 13&14 Speaker Adaptation 28
Model-based VTLN
Basic idea Warp the acoustic features (for a speaker) tobetter fit the models — rather than warping the models to fitthe features!
Estimate the warping factor α so as to maximise thelikelihood of the acoustic models
After estimating the warp factors, normalize the acoustic dataand re-estimate the models
The process may be iterated
Model-based VTLN does not directly estimate vocal tractsize, rather it estimates an optimal frequency warping, whichmay be affected by other factors (eg F0)
Exhaustive search for the optimal warping factor would beexpensive
Approximate the log likelihood wrt α as a quadratic, and findthe maximum using a line search (Brent’s method)
ASR Lectures 13&14 Speaker Adaptation 28
Model-based VTLN
Basic idea Warp the acoustic features (for a speaker) tobetter fit the models — rather than warping the models to fitthe features!
Estimate the warping factor α so as to maximise thelikelihood of the acoustic models
After estimating the warp factors, normalize the acoustic dataand re-estimate the models
The process may be iterated
Model-based VTLN does not directly estimate vocal tractsize, rather it estimates an optimal frequency warping, whichmay be affected by other factors (eg F0)
Exhaustive search for the optimal warping factor would beexpensive
Approximate the log likelihood wrt α as a quadratic, and findthe maximum using a line search (Brent’s method)
ASR Lectures 13&14 Speaker Adaptation 28
Model-based VTLN
ASR Lectures 13&14 Speaker Adaptation 29
VTLN: Warp factor estimation, females, non-normalized
ASR Lectures 13&14 Speaker Adaptation 30
VTLN: Warp factor estimation, females, pass 1
ASR Lectures 13&14 Speaker Adaptation 31
VTLN: Warp factor estimation, females, pass 2
ASR Lectures 13&14 Speaker Adaptation 32
VTLN: Warp factor estimation, females, pass 3
ASR Lectures 13&14 Speaker Adaptation 33
VTLN: Warp factor estimation, males, non-normalized
ASR Lectures 13&14 Speaker Adaptation 34
VTLN: Warp factor estimation, males, pass 1
ASR Lectures 13&14 Speaker Adaptation 35
VTLN: Warp factor estimation, males, pass 2
ASR Lectures 13&14 Speaker Adaptation 36
VTLN: Warp factor estimation, males, pass 3
ASR Lectures 13&14 Speaker Adaptation 37
Speaker adaptation in hybrid HMM/NN systems:CMLLR feature transformation
Basic idea: If HMM/GMM system is used to estimate a singleconstrained MLLR adaptation transform, this can be viewedas a feature space transform
Use the HMM/GMM system with the same tied state spaceto estimate a single CMLLR transform for a given speaker,and use this to transform the input speech to the DNN for thetarget speaker
Can operate unsupervised (since the GMM system estimatesthe transform)
Limited to a single transform (regression class)
ASR Lectures 13&14 Speaker Adaptation 38
Speaker adaptation in hybrid HMM/NN systems:LIN – Liniear Input Network
Basic idea: single linear input layer trained to map inputspeaker-dependent speech to speaker-independent network
Training: linear input network (LIN) can either be fixed as theidentity or (adaptive training) be trained along with the otherparameters
Testing: freeze the main (speaker-independent) network andpropagate gradients for speech from the target speaker to theLIN, which is updated — linear transform learned for eachspeaker
Requires supervised training data
ASR Lectures 13&14 Speaker Adaptation 39
LIN
3-8 hidden layers
~2000 hidden units
~6000 CD phone outputs
9x39 MFCC inputs
~2000 hidden units
ASR Lectures 13&14 Speaker Adaptation 40
LIN
3-8 hidden layers
~2000 hidden units
~6000 CD phone outputs
9x39 MFCC inputs
~2000 hidden units
Transformed inputs
Linear inputnetwork
ASR Lectures 13&14 Speaker Adaptation 40
LIN
3-8 hidden layers
~2000 hidden units
~6000 CD phone outputs
9x39 MFCC inputs
~2000 hidden units
Transformed inputs
Linear inputnetwork
Fixed
Adapted
ASR Lectures 13&14 Speaker Adaptation 40
Speaker adaptation in hybrid HMM/NN systems:Speaker codes
Basic idea: Learn a short speaker code vector for each talker
speaker. Moreover, the speaker code size can be freely adjusted ac-cording to the amount of available adaptation data. As a result, it ispossible to conduct a very fast adaptation of the hybrid NN/HMMmodel for each speaker based on only a small amount of adaptationdata. Experimental results on TIMIT have shown that it is possibleto achieve over 10% relative reduction in phone error rate by usingonly seven adaptation utterances.
2. MODEL DESCRIPTION
The baseline model is a hybrid NN-HMM model similar to the onedescribed in [17]. The NN computes posteriori probabilities of allHMM states given each input feature vector. The NN inputs are con-catenated super-vector consisting of all speech feature vectors withina window of a number of consecutive frames. The baseline NN-HMM model is trained without using any speaker labels information.The NN training targets are HMM state labels. The standard backpropagation procedure is used to optimize the NN weights where thecross entropy is used as an objective function.
As shown in the right side of Fig. 1, the proposed speaker adap-tation method relies on learning another generic adaptation NN aswell as some speaker specific codes. The adaptation NN is insertedabove the input layer of original NN-HMM model. All layers of theadaptation NN are standard fully connected layers with a weight ma-trix, denoted as W(l)
a with l representing l-th layer of the adaptationNN. The top layer of the adaptation NN represents the transformedfeatures and its size matches the input size.
In addition, each layer of the adaptation NN receives all activa-tion output signals of the lower layer along with a speaker-specificinput vector, S , named as speaker code. When we estimate the adap-tation NN using the back-propagation (BP) algorithm, the derivativesof the objective function are calculated with respect to all weightsW(l)
a (for all l) as well as the associated speaker code S . As a result,both of the weights and speaker codes will be learned. For exam-ple, when we apply a speech vector from i-th speaker to update theadaptation NN in BP, we use the computed derivatives to update allweights, W(l)
a (for all l), and the speaker code Si specific to the i-thspeaker. In this way, we will be able to benefit from speaker la-bels to learn a generic adaptation NN as well as a whole bunch ofspeaker codes at the end of the BP training process. Each speakerhas his/her own speaker code and each speaker code, Si, is a verycompact feature vector representing speaker-dependent information.The speaker code is fed to the adaptation NN to control how eachspeaker’s data is transformed to a general speaker-independent fea-ture space by the generic adaptation NN. Moreover, this model con-figuration provides a very effective way to conduct speaker adapta-tion for the hybrid NN/HMM model. To adapt an existing hybridNN/HMM model to a new speaker, only a new speaker code, S ,needs to be estimated without changing any weights in both originalNN and adaptation NN in Fig. 1.
The advantage of our proposed method is that only a smallspeaker code needs to be estimated for each new speaker. Thislargely reduces the required amount of adaptation data per speakerparticularly when a small speaker code is chosen for each speaker.As a result, it is possible conduct very rapid speaker adaptation forthe hybrid NN-HMM model based on only a few utterances perspeaker. On the other hand, if a large amount of adaptation data isavailable per speaker, the size of speaker code can be increased toallow a better representation of each speaker. Moreover, the genericadaptation NN is learned using all training data. This allows tobuild a large-scale adaptation NN that is powerful enough to model
Speaker Code
Features vector
Features vector
Original Network
Adaptation NN
Original Network
Composite NN
Transformed Features
Fig. 1. Speaker adaptation of the hybrid NN-HMM model based onspeaker code.
a complex transformation function between different feature spaces.This method is clearly superior to other speaker adaptation methodsthat learn a complete independent transform for each speaker, whereeach transformation needs to be linear.
2.1. Training
During training, we want to learn three sets of parameters: the orig-inal NN weights, the adaptation weights, and the training speakerscodes. First of all, the original NN weights is learned without in-serting the adaptation weights in the same way as a standard hybridNN-HMM model without using any speaker information. This re-sults in a speaker independent (SI) NN-HMM model.
Secondly, the adaptation layers are inserted and all adaptationweights, W(l)
a (for all l), and speakers codes Si for all speakers inthe training set, are learned jointly in such a way that the frame-wise classification performance is optimized. In this paper, theseparameters are optimized using the standard back-propagation al-gorithm with the cross entropy objective function. Both adaptationweights and speaker codes are initialized randomly at the beginning.No weight in the original NN is modified during this phase.
Of course, other training scenarios are possible here. For exam-ple, all or part of the original NN weights can be further fine tunedwhen learning the adaptation NN to further optimize the whole net-work because speaker labels are considered in this phase. Anotherpossibility is to learn all the three sets of parameters at the same time.However, this may result in two inseparable NNs and they eventu-ally become one large deep NN with only a number of lower layersreceiving a speaker code. Another possibility is to use the learnedadaptation NN to transform all training data and a new NN is learnedfrom scratch. This NN receives speaker-normalized features insteadof the original features. This can be considered as a form of speakeradaptive training for the NN-HMM model.
2.2. Adaptation
After learning all adaptation NN weights using all training data asabove, adaptation to a new speaker is done by learning a new speakercode for each new speaker who is not observed in the training set.
����
ASR Lectures 13&14 Speaker Adaptation 41
Speaker adaptation in hybrid HMM/NN systems:LHUC – Learning Hidden Unit Contributions
Basic idea: Add a learnablespeaker dependentampolitude to each hiddenunit
Speaker independent:amplituides set to 1
Speaker dependent: learnamplitudes from data, perspeaker
3-8 hidden layers
~2000 hidden units
~6000 CD phone outputs
~2000 hidden units
inputs
X X
XX
r11 r1
n
rknrk
1
ASR Lectures 13&14 Speaker Adaptation 42
Speaker adaptation in hybrid HMM/NN systems:Experimental Results on TED
16
10
11
12
13
14
15
WER
/%
DNN +LHUC +CMLLR +CMLLR+LHUC
TED Talks – IWSLT tst201115.2
13.7 13.9
12.9
ASR Lectures 13&14 Speaker Adaptation 43
Summary
Speaker Adaptation
One of the most intensive areas of speech recognition researchsince the early 1990s
HMM/GMM
Substantial progress, resulting in significant, additive,consistent reductions in word error rateClose mathematical links between different approachesLinear transforms at the heart of many approaches
HMM/NN
Open research topicGMM-based feature space transforms somewhat effectiveDirect weight adaptation less effective
ASR Lectures 13&14 Speaker Adaptation 44
Reading
HMM/GMMGales and Young (2007), The Application of Hidden MarkovModels in Speech Recognition, Foundations and Trends inSignal Processing, 1 (3), 195–304: section 5.Woodland (2001), Speaker adaptation for continuous densityHMMs: A review, ISCA ITRW on Adaptation Methods forSpeech Recognition.Gales (1998), Maximum likelihood linear transformations forHMM-based speech recognition, Computer Speech andLanguage, 12:75–98.
HMM/DNNLiao (2013), Speaker adaptation of context dependent deepneural networks, Proc IEEE ICASSPAbdel-Hamid and Jiang (2013), Fast speaker adaptation ofhybrid NN/HMM model for speech recognition based ondiscriminative learning of speaker code, Proc IEEE ICASSPSwietojanski and Renals (2014), Learning Hidden UnitContributions for Unsupervised Speaker Adaptation of NeuralNetwork Acoustic Models, Proc IEEE SLT
ASR Lectures 13&14 Speaker Adaptation 45