Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner

Voice Transformation

Project by:

Asaf Rubin

Michael Katz

Under the guidance of:

Dr. Izhar Levner

Objective

Contents

Conversion Scheme Analysis

Speech production model

Transformation Preprocessing analysis

Synthesis

Results, Conclusions & Future plans

Conversion Scheme

SpeechAnalysis

Transformationfunction

SpeechSynthesis

Source Orator

Target Orator

SourceParameters

TargetParameters

Requires robust parameterization of speech.

Transformation is done on-line, based upon previous off-line data coordination, via codebooks, histogram equalization, neural networks.

Vocal Tract Model Linear all-pole filter, varying slowly in time relative to pitch period :

P

k

kk

Z

z

GV

1

)(

1

Radiation Model Simulates the lips derivation.

Differentiation filter with constant parameters.

Unvoiced Excitation White random noise.

This model was derived from the analytical solution of the acoustic speech model equations.


GlottalPulseModelG(z)

VocalTract

ModelV(z)

RadiationModelR(z)

Voiced/ Unvoiced

Switch

Speech Production Model

Voiced Excitation

Impulse train with pitch period passed through glottal pulse model.

Signal CleaningNoise reduction through use of signals energy and zero-

crossing computation.

Glottal Pulseestimation

LSP conversion

Global Par.estimation

Pitch estimation

Signalcleaning

SourceSignal

SourceParameters


Source Parameters Estimation

LPC estimation

Phoneme Segmentation

Phoneme segmentation Manual or semi-automatic – using energy,zero-crossing,pitch. Automatic – using Hidden Markov Models.

Pitch estimationPhonemes’ pitch contour evaluation.

LPC estimation

Calculation of the Linear Prediction Coefficients set for each phoneme.

LSP conversion

Calculation of Line Spectrum Pairs corresponding to each work frame.

Glottal Pulse Parameters estimation

Calculation for corresponding work frames of the phoneme.

Global Parameters EstimationPhonemes characteristics such as duration and global LSP.

Conversion Scheme Transformation

Source Parameters

Target Parameters

GPPTransformation

Pitch

Glottal Pulse Parameters

Pitch

Glottal Pulse Parameters

LSPCode Book

LSPCode Book

LSP LSP

Duration Duration

Transformation Function

SourceCode Book

TargetCode Book

Phoneme LSP

Distance Measure

Find the source codeword closest to the phonemes LSP (given the distance measure).

There is 1to1 correspondence between source and target codeword entries.

Transform the phonemes duration according to the average source and target durations of a corresponding codeword:

SnSCB

nTCB

T DD

DD

For each work frame transform the LSP through secondary 1to1 source-target LSP code books, corresponding to the n-th code words of the primary books.

For each work frame pitch and energy transform through histogram equalization using source-target histograms of the n-th code word.

Residue is substituted by corresponding to the target LSP, accepted via secondary codebooks.

For each phoneme the LSP and duration are extracted.Given the identical source and target utterances the phoneme coordination is done manually (with aid of preliminary phoneme segmentation) or using HMM.

The LSP of target phonemes, corresponding to source LSP in each quantization region, are clustered to obtain the primary codebook, with centroids of phonemes’ LSP as codewords.

Averaging the durations corresponding to source and target phonemes at each quantization region gives the codebook for phonemes’ durations.

For each of the work frames of every phoneme the LSP, residue, pitch and energy are extracted. Vector quantization is performed upon the phonemes’ LSP of the source, clustering the similar phonemes.

Codebooks Creation – Training Stage

Source LSPQuantization

Phoneme par.calculation

Framework Par. Calc.

Phoneme Coordination

Source & Target utterances

Pitch & Energ.histogram

Secondary codebook

Framework coordination

DurationAveraging

Target LSPClustering

Primary codebook


The source-target coordination on work frame level is achieved using Dynamic Time Warping – thus for each primary codeword the itemized LSP pairs of the corresponding phonemes establish the secondary codebook.

For each of the primary codewords the pitch and energy information of every work frame of the corresponding phonemes are used to create source and target histograms.The normalized residues corresponding to the itemized LSP are kept as well.

For each phoneme –The excitation for each work frame is (according to model) 1. impulse pair with given pitch and energy (voiced) or 2. residue interpolated/decimated to 2-pitch length. The work frames are linearly interpolated according to duration. The speech is produced by exiting the prediction filter with corresponding coefficients.

ExcitationGeneration V(z)

LPCConversion

Duration Control

LSP

Duration

Pitch GPP

Target parameters

Target speech

Conversion Scheme Synthesis

Target Speech Production

Vocal Results

Vocal Coding

S SS SS S S S S S

1 1 1 2 2

Conversion

S TS TS T S T

1 111

Source Target1 2 1 2

Legend? ? No codebook ? ? Phoneme codebook ? ? Clustered codebook

Non-modified pitch excitation

Modified pitch excitation

Residue excitation

Vocal ResultsSource Target1 2 1 2

Conversion

S T S TS T

1 11

S T

2

S T

2

S TS T

2 2

Legend? ? No codebook ? ? Phoneme codebook ? ? Clustered codebook

Non-modified pitch excitation

Modified pitch excitation

Residue excitation

Conclusions The parametric approach with codebook attains waveform coding of about 5600 bps.

The training stage phoneme clustering allows global parameter (pitch,duration) conversion.

balances between global frameworks search and single phoneme correspondence.

LSP conversion alone miscaptures significant voice characteristics.

The quality difference between conversion based upon Euclidian and IS distances is insignificant.

Future plans The parametric approach limits the optimum conversion to 5600 bps quality.

Improve the parametric model (GPP), or use non parametric conversion - residue codebook (CELP).

Better clustering method (other then VQ) may improve global parameter conversion as well as phoneme recognition.

Improve LSP transformation-interpolation.

DTW determines optimal “least-cost” path through the grid, minimizing sum of visited nodes costs.

01 2 3 4 I

1

2

3

4

J

i

j For a given phoneme, we set:

Work frames parameters (LPC or LSP) of target - along i and of source - along j axis.

Node cost - distance between corresponding source and target parameters.


Dynamic Time Warping

Path constrains, for avoiding distortion, forcing time advancing with limiting stretching/contraction ratio.

The optimal path determines desired alignment through node pairs.

* - code vectors NiY 1

- quant. regions NiV 1

d – distance measure (Euclidian or I-S).

jYXQ )( jiYXdYXd ij ),(),(iffjVX


Vector Quantization VQ subdivides the space into quantization regions each represented by code vector.

we find

Given a set of LSP - training sequence

Mix 1 NiY 1 NiV 1 which result

M

iii xQxd

MD

1

))(,(1in smallest average distance:

We use LBG algorithm with PNN initialization for Euclidian and random for I-S distance.

Advantages Robustness to errors. Support of inter-vector operations.

p

k

kkZ zA

1

)( 1 Given the LPC define :

)()1(

)()(1 Z

pZZ AzAP

)()1(

)()(1 Z

pZZ AzAQ

LSP are the positive angles of the roots of :


LSP Conversion

Q(z) rootsP(z) roots

LSP

V(z) zeros

12

V(z) poles

LPC

For stable vocal filter roots of P&Q lie on the unit circle and are interleaved.

Close PQ pairs correspond to dominant formants (vocal filter poles) .

P

i

iiLspLsp LspLspd1

2)()()2,1( 21

Euclidian - squared distance between source and target LSP :


Speech Distance Measures

Itakura-Saito (gain normalized) - in matrix notation :

1212),( 21 Rd tIS

where are LPC vectors and is the covariance matrix of process excited by normalized white noise.

i)( 1AR

1R

Motivation -

YtReE 2

P

k

kk z

1

1 eYthe error variance of any random process passed trough error filter is :

50 60 70 80 90 100 110 1200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

50 60 70 80 90 100 110 1200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18Source Pitch Histogram Target Pitch Histogram

Given the histograms we calculate the source and target histogram equalization functions:

n

sHnT0

)( n

tHnG0

)(and


Histogram Equalization

)(1st pTGp

Given the source pitch value the target pitch value is calculated by:


Hidden Markov Models

Initialization

For each segment

Calculation

Segmentation

Determination

Pitch Voiced/Unvoiced

Speech utterance

Next segment

Segment lengthStep length

Segmentation

Constant length and overlap – for each segment the pitch value is determined.

Segmentation

Speech utterance

Initialization

Initialization

Set 2 adjacent segments of arbitrary minimal length – estimated pitch period.

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?

Pitch?Calculation

Calculation

• Increase the segments length.

• For each calculate their cross-correlation.

• Stop at arbitrary maximal length.

Calculation

Determination

Determination

• Pitch period – length of the segment with maximal cross-correlation value.

• C.C. must achieve given threshold or classify the segment as unvoiced.

Pitch!

Pitch!


Pitch Estimation

Determination

Pitch Voiced/Unvoiced

Segmentation

Windowing

Pre-emphasis

Calculation

Signal,Pitch

LPC

Windowing

Windowing

Multiply each segment by hamming window. Overlapping hamming windows are approximately rectangular.

Signal,Pitch

2 PitchPitch

Segmentation

Segmentation Work Frame – segment of twice a pitch period duration for voiced or constant duration for unvoiced speech. The segments are overlapped by half.

F


LPC Estimation

Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation.

Windowing

Pre-emphasis

Calculation The gain and the denominator coefficients estimated using Linear Prediction methods.

Pre-emphasis

Calculation

Spectral envelop of V(f)

Signal’s FFT

Calculation

LPC

We use 2 methods of residue coding: Full residue preservation - obtained by passing the speech segment through the prediction error filter:


GPP Estimation

)(

1

ZV

Residue’s energy – the excitation is pitch train (voiced) or noise (unvoiced) only.

Segmentation

Windowing

Pre-emphasis

Calculation

Signal,Pitch

LPC

Signal,Pitch

Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation.

Pre-emphasis 1F


LPC Estimation

Documents

Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner