Upload
dominic-payne
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Pitch Prediction From MFCC Vectors for Speech Reconstruction
Xu shao and Ben MilnerSchool of Computing Sciences,
University of East Anglia, UK
Presented By Yi-Ting
Outline
Introduction
Speech ReconstructionSinusoidal model
Pitch PredictionGMM-based prediction
HMM-based prediction
Voiced / unvoiced classification
Experimental Results
Conclusion
Introduction
Speech to be reconstructed from MFCC vectors through the inclusion of pitch information.The aim of this word is to predict the pitch frequency from the MFCC vector.Several studies have indicated that class-dependent correlation exists between the spectral envelop and pitch.
Speech Reconstruction
Speech reconstruction from MFCC vectors and pitch using the sinusoidal model
The sinusoidal model synthesis a speech signal ,x(n).
An estimate of the spectral envelope can be calculated from an MFCC vector by zero padding and applying and inverse discrete cosine transform (IDCT).
1
( ) cos( )L
l l ll
X n A n
Speech Reconstruction
A smoothed magnitude spectral estimate.
Normalization must be applied to remove the effect of pre-emphasis and the non-linear filterbank channel.
The frequency of the sinusoidal components, ,can be estimated from the pitch frequency,
,can be computed from the smoothed magnitude spectral estimate.
l
0
lA
Pitch Prediction
These scheme are based on modeling the joint density of the MFCC vector, x, and pitch frequency, f.
Form a set of training data, a series of augmented feature vector, y, are extracted.
,T
y x f
Pitch Prediction
GMM-based predictionFrom the training set of augmented vectors, unsupervised clustering is implemented using EM algorithm to produce a set of K clusters.
Each of these cluster is represented by Gaussian probability density function
xx xf
y k k
k fx ff
k k
and =x
y kk f
k
Pitch Prediction
GMM-based predictionUsing these cluster-based correlations a prediction of the pitch frequency, , can be made from an input MFCC vector.
The closest cluster, k.
To prediction of pitch :
if
arg max | xi k k
k
k p X C
1
ˆ Tfx xxf xi k i kk kf x
Pitch Prediction
GMM-based predictionAn alternative method combines the MAP pitch prediction from all K clusters in the GMM.
1
1
ˆ ( )K Tfx xxf x
i k i k i kk kk
f h x x
1
|( )
|
xk i k
k i Kx
k i kk
p X Ch x
p X C
Pitch Prediction
HMM-based predictionTo better model the inherent correlation of the feature vector stream, a series of HMMs, w
Pitch Prediction
HMM-based predictionThe first stage of training involves the creation of a set of HMM-based speech models.
The training data is aligned to the speech models using Viterbi decoding and vectors belonging to each state, s, of each model, w.(Unvoiced vectors are removed to ensure)
Clustering is applied to the pooled vectors within each voiced state using the EM algorithm.
Pitch Prediction
HMM-based predictionPrediction of the pitch :(By first determining the model and state sequence from the set of models using Viterbi decoding.
1
, , , , , ,, , , ,1
ˆ ( )i i i i i ii i i i
K Tfx xxf xi k q m i k q m i k q mk q m k q m
k
f h x x
Pitch Prediction
Voiced / unvoiced classificationthrough analysis of the resulting model, to classify MFCC stream into voiced or unvoiced speech.
Voiced was calculated, ,s wO
Pitch Prediction
Voiced / unvoiced classificationUsing the state occupancy, , measured from the training data, the voicing is determined.
The threshold, , has been determined experimentally with a reasonable value being =0.2.
,s wO
Experimental Results
Measure both the accuracy of pitch prediction and the resultant reconstructed speech quality.
ETSI aurora database, 200 utterances for training and 90 for testing.
Experimental Results
Pitch classification error is measured as,
RMS pitch error is computed as,
/ / 20% 100%V U U VC
Total
N N NE
N
2
1
1 ˆN
p i ii
E f fN
Experimental Results
Experimental Results
Increasing the number of clusters in each state of the HMM enables more detailed modeling of the joint distribution of MFCC and pitch.
The significant majority of frame classification errors arise form arise from incorrect voiced/unvoiced decisions which in low energy regions at the start and end of speech.
Experimental Results
Experimental Results
Conclusion
Speech reconstructed from the predicted pitch, using a sinusoidal model, is almost indistinguishable from that reconstructed using the reference pitch.