View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Voice Transformation
Project by:
Asaf Rubin
Michael Katz
Under the guidance of:
Dr. Izhar Levner
Objective
Contents
Conversion Scheme Analysis
Speech production model
Transformation Preprocessing analysis
Synthesis
Results, Conclusions & Future plans
Conversion Scheme
SpeechAnalysis
Transformationfunction
SpeechSynthesis
Source Orator
Target Orator
SourceParameters
TargetParameters
Requires robust parameterization of speech.
Transformation is done on-line, based upon previous off-line data coordination, via codebooks, histogram equalization, neural networks.
Vocal Tract Model Linear all-pole filter, varying slowly in time relative to pitch period :
P
k
kk
Z
z
GV
1
)(
1
Radiation Model Simulates the lips derivation.
Differentiation filter with constant parameters.
Unvoiced Excitation White random noise.
This model was derived from the analytical solution of the acoustic speech model equations.
Conversion Scheme Analysis
GlottalPulseModelG(z)
VocalTract
ModelV(z)
RadiationModelR(z)
Voiced/ Unvoiced
Switch
Speech Production Model
Voiced Excitation
Impulse train with pitch period passed through glottal pulse model.
Signal CleaningNoise reduction through use of signals energy and zero-
crossing computation.
Glottal Pulseestimation
LSP conversion
Global Par.estimation
Pitch estimation
Signalcleaning
SourceSignal
SourceParameters
Conversion Scheme Analysis
Source Parameters Estimation
LPC estimation
Phoneme Segmentation
Phoneme segmentation Manual or semi-automatic – using energy,zero-crossing,pitch. Automatic – using Hidden Markov Models.
Pitch estimationPhonemes’ pitch contour evaluation.
LPC estimation
Calculation of the Linear Prediction Coefficients set for each phoneme.
LSP conversion
Calculation of Line Spectrum Pairs corresponding to each work frame.
Glottal Pulse Parameters estimation
Calculation for corresponding work frames of the phoneme.
Global Parameters EstimationPhonemes characteristics such as duration and global LSP.
Conversion Scheme Transformation
Source Parameters
Target Parameters
GPPTransformation
Pitch
Glottal Pulse Parameters
Pitch
Glottal Pulse Parameters
LSPCode Book
LSPCode Book
LSP LSP
Duration Duration
Transformation Function
SourceCode Book
TargetCode Book
Phoneme LSP
Distance Measure
Find the source codeword closest to the phonemes LSP (given the distance measure).
There is 1to1 correspondence between source and target codeword entries.
Transform the phonemes duration according to the average source and target durations of a corresponding codeword:
SnSCB
nTCB
T DD
DD
For each work frame transform the LSP through secondary 1to1 source-target LSP code books, corresponding to the n-th code words of the primary books.
For each work frame pitch and energy transform through histogram equalization using source-target histograms of the n-th code word.
Residue is substituted by corresponding to the target LSP, accepted via secondary codebooks.
For each phoneme the LSP and duration are extracted.Given the identical source and target utterances the phoneme coordination is done manually (with aid of preliminary phoneme segmentation) or using HMM.
The LSP of target phonemes, corresponding to source LSP in each quantization region, are clustered to obtain the primary codebook, with centroids of phonemes’ LSP as codewords.
Averaging the durations corresponding to source and target phonemes at each quantization region gives the codebook for phonemes’ durations.
For each of the work frames of every phoneme the LSP, residue, pitch and energy are extracted. Vector quantization is performed upon the phonemes’ LSP of the source, clustering the similar phonemes.
Codebooks Creation – Training Stage
Source LSPQuantization
Phoneme par.calculation
Framework Par. Calc.
Phoneme Coordination
Source & Target utterances
Pitch & Energ.histogram
Secondary codebook
Framework coordination
DurationAveraging
Target LSPClustering
Primary codebook
Conversion Scheme Transformation
The source-target coordination on work frame level is achieved using Dynamic Time Warping – thus for each primary codeword the itemized LSP pairs of the corresponding phonemes establish the secondary codebook.
For each of the primary codewords the pitch and energy information of every work frame of the corresponding phonemes are used to create source and target histograms.The normalized residues corresponding to the itemized LSP are kept as well.
For each phoneme –The excitation for each work frame is (according to model) 1. impulse pair with given pitch and energy (voiced) or 2. residue interpolated/decimated to 2-pitch length. The work frames are linearly interpolated according to duration. The speech is produced by exiting the prediction filter with corresponding coefficients.
ExcitationGeneration V(z)
LPCConversion
Duration Control
LSP
Duration
Pitch GPP
Target parameters
Target speech
Conversion Scheme Synthesis
Target Speech Production
Vocal Results
Vocal Coding
S SS SS S S S S S
1 1 1 2 2
Conversion
S TS TS T S T
1 111
Source Target1 2 1 2
Legend? ? No codebook ? ? Phoneme codebook ? ? Clustered codebook
Non-modified pitch excitation
Modified pitch excitation
Residue excitation
Vocal ResultsSource Target1 2 1 2
Conversion
S T S TS T
1 11
S T
2
S T
2
S TS T
2 2
Legend? ? No codebook ? ? Phoneme codebook ? ? Clustered codebook
Non-modified pitch excitation
Modified pitch excitation
Residue excitation
Conclusions The parametric approach with codebook attains waveform coding of about 5600 bps.
The training stage phoneme clustering allows global parameter (pitch,duration) conversion.
balances between global frameworks search and single phoneme correspondence.
LSP conversion alone miscaptures significant voice characteristics.
The quality difference between conversion based upon Euclidian and IS distances is insignificant.
Future plans The parametric approach limits the optimum conversion to 5600 bps quality.
Improve the parametric model (GPP), or use non parametric conversion - residue codebook (CELP).
Better clustering method (other then VQ) may improve global parameter conversion as well as phoneme recognition.
Improve LSP transformation-interpolation.
DTW determines optimal “least-cost” path through the grid, minimizing sum of visited nodes costs.
01 2 3 4 I
1
2
3
4
J
i
j For a given phoneme, we set:
Work frames parameters (LPC or LSP) of target - along i and of source - along j axis.
Node cost - distance between corresponding source and target parameters.
Conversion Scheme Transformation
Dynamic Time Warping
Path constrains, for avoiding distortion, forcing time advancing with limiting stretching/contraction ratio.
The optimal path determines desired alignment through node pairs.
* - code vectors NiY 1
- quant. regions NiV 1
d – distance measure (Euclidian or I-S).
jYXQ )( jiYXdYXd ij ),(),(iffjVX
Conversion Scheme Transformation
Vector Quantization VQ subdivides the space into quantization regions each represented by code vector.
we find
Given a set of LSP - training sequence
Mix 1 NiY 1 NiV 1 which result
M
iii xQxd
MD
1
))(,(1in smallest average distance:
We use LBG algorithm with PNN initialization for Euclidian and random for I-S distance.
Advantages Robustness to errors. Support of inter-vector operations.
p
k
kkZ zA
1
)( 1 Given the LPC define :
)()1(
)()(1 Z
pZZ AzAP
)()1(
)()(1 Z
pZZ AzAQ
LSP are the positive angles of the roots of :
Conversion Scheme Analysis
LSP Conversion
Q(z) rootsP(z) roots
LSP
V(z) zeros
12
V(z) poles
LPC
For stable vocal filter roots of P&Q lie on the unit circle and are interleaved.
Close PQ pairs correspond to dominant formants (vocal filter poles) .
P
i
iiLspLsp LspLspd1
2)()()2,1( 21
Euclidian - squared distance between source and target LSP :
Conversion Scheme Transformation
Speech Distance Measures
Itakura-Saito (gain normalized) - in matrix notation :
1212),( 21 Rd tIS
where are LPC vectors and is the covariance matrix of process excited by normalized white noise.
i)( 1AR
1R
Motivation -
YtReE 2
P
k
kk z
1
1 eYthe error variance of any random process passed trough error filter is :
50 60 70 80 90 100 110 1200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
50 60 70 80 90 100 110 1200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18Source Pitch Histogram Target Pitch Histogram
Given the histograms we calculate the source and target histogram equalization functions:
n
sHnT0
)( n
tHnG0
)(and
Conversion Scheme Transformation
Histogram Equalization
)(1st pTGp
Given the source pitch value the target pitch value is calculated by:
Conversion Scheme Analysis
Hidden Markov Models
Initialization
For each segment
Calculation
Segmentation
Determination
Pitch Voiced/Unvoiced
Speech utterance
Next segment
Segment lengthStep length
Segmentation
Constant length and overlap – for each segment the pitch value is determined.
Segmentation
Speech utterance
Initialization
Initialization
Set 2 adjacent segments of arbitrary minimal length – estimated pitch period.
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?
Pitch?Calculation
Calculation
• Increase the segments length.
• For each calculate their cross-correlation.
• Stop at arbitrary maximal length.
Calculation
Determination
Determination
• Pitch period – length of the segment with maximal cross-correlation value.
• C.C. must achieve given threshold or classify the segment as unvoiced.
Pitch!
Pitch!
Conversion Scheme Analysis
Pitch Estimation
Determination
Pitch Voiced/Unvoiced
Segmentation
Windowing
Pre-emphasis
Calculation
Signal,Pitch
LPC
Windowing
Windowing
Multiply each segment by hamming window. Overlapping hamming windows are approximately rectangular.
Signal,Pitch
2 PitchPitch
Segmentation
Segmentation Work Frame – segment of twice a pitch period duration for voiced or constant duration for unvoiced speech. The segments are overlapped by half.
F
Conversion Scheme Analysis
LPC Estimation
Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation.
Windowing
Pre-emphasis
Calculation The gain and the denominator coefficients estimated using Linear Prediction methods.
Pre-emphasis
Calculation
Spectral envelop of V(f)
Signal’s FFT
Calculation
LPC
We use 2 methods of residue coding: Full residue preservation - obtained by passing the speech segment through the prediction error filter:
Conversion Scheme Analysis
GPP Estimation
)(
1
ZV
Residue’s energy – the excitation is pitch train (voiced) or noise (unvoiced) only.
Segmentation
Windowing
Pre-emphasis
Calculation
Signal,Pitch
LPC
Signal,Pitch
Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation.
Pre-emphasis 1F
Conversion Scheme Analysis
LPC Estimation