Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented

Pitch Prediction From MFCC Vectors for Speech Reconstruction

Xu shao and Ben MilnerSchool of Computing Sciences,

University of East Anglia, UK

Presented By Yi-Ting

Outline

Introduction

Speech ReconstructionSinusoidal model

Pitch PredictionGMM-based prediction

HMM-based prediction

Voiced ／ unvoiced classification

Experimental Results

Conclusion

Introduction

Speech to be reconstructed from MFCC vectors through the inclusion of pitch information.The aim of this word is to predict the pitch frequency from the MFCC vector.Several studies have indicated that class-dependent correlation exists between the spectral envelop and pitch.

Speech Reconstruction

Speech reconstruction from MFCC vectors and pitch using the sinusoidal model

The sinusoidal model synthesis a speech signal ,x(n).

An estimate of the spectral envelope can be calculated from an MFCC vector by zero padding and applying and inverse discrete cosine transform (IDCT).

1

( ) cos( )L

l l ll

X n A n

Speech Reconstruction

A smoothed magnitude spectral estimate.

Normalization must be applied to remove the effect of pre-emphasis and the non-linear filterbank channel.

The frequency of the sinusoidal components, ,can be estimated from the pitch frequency,

,can be computed from the smoothed magnitude spectral estimate.

l

0

lA

Pitch Prediction

These scheme are based on modeling the joint density of the MFCC vector, x, and pitch frequency, f.

Form a set of training data, a series of augmented feature vector, y, are extracted.

,T

y x f

Pitch Prediction

GMM-based predictionFrom the training set of augmented vectors, unsupervised clustering is implemented using EM algorithm to produce a set of K clusters.

Each of these cluster is represented by Gaussian probability density function

xx xf

y k k

k fx ff

k k

and =x

y kk f

k

Pitch Prediction

GMM-based predictionUsing these cluster-based correlations a prediction of the pitch frequency, , can be made from an input MFCC vector.

The closest cluster, k.

To prediction of pitch ：

if

arg max | xi k k

k

k p X C

1

ˆ Tfx xxf xi k i kk kf x

Pitch Prediction

GMM-based predictionAn alternative method combines the MAP pitch prediction from all K clusters in the GMM.

1

1

ˆ ( )K Tfx xxf x

i k i k i kk kk

f h x x

1

|( )

|

xk i k

k i Kx

k i kk

p X Ch x

p X C

Pitch Prediction

HMM-based predictionTo better model the inherent correlation of the feature vector stream, a series of HMMs, w

Pitch Prediction

HMM-based predictionThe first stage of training involves the creation of a set of HMM-based speech models.

The training data is aligned to the speech models using Viterbi decoding and vectors belonging to each state, s, of each model, w.(Unvoiced vectors are removed to ensure)

Clustering is applied to the pooled vectors within each voiced state using the EM algorithm.

Pitch Prediction

HMM-based predictionPrediction of the pitch ：(By first determining the model and state sequence from the set of models using Viterbi decoding.

　

1

, , , , , ,, , , ,1

ˆ ( )i i i i i ii i i i

K Tfx xxf xi k q m i k q m i k q mk q m k q m

k

f h x x

Pitch Prediction

Voiced ／ unvoiced classificationthrough analysis of the resulting model, to classify MFCC stream into voiced or unvoiced speech.

Voiced was calculated, ,s wO

Pitch Prediction

Voiced ／ unvoiced classificationUsing the state occupancy, , measured from the training data, the voicing is determined.

The threshold, , has been determined experimentally with a reasonable value being =0.2.

,s wO


Measure both the accuracy of pitch prediction and the resultant reconstructed speech quality.

ETSI aurora database, 200 utterances for training and 90 for testing.


Pitch classification error is measured as,

RMS pitch error is computed as,

/ / 20% 100%V U U VC

Total

N N NE

N

2

1

1 ˆN

p i ii

E f fN



Increasing the number of clusters in each state of the HMM enables more detailed modeling of the joint distribution of MFCC and pitch.

The significant majority of frame classification errors arise form arise from incorrect voiced/unvoiced decisions which in low energy regions at the start and end of speech.



Conclusion

Speech reconstructed from the predicted pitch, using a sinusoidal model, is almost indistinguishable from that reconstructed using the reference pitch.

Documents

Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented