29
12/11/2006 Chih-Ti Shih 1 Pitch Estimation Pitch Estimation By Chih-Ti Shih By Chih-Ti Shih

Pitch Estimation

  • Upload
    aricin

  • View
    26

  • Download
    3

Embed Size (px)

DESCRIPTION

Pitch Estimation. By Chih-Ti Shih. Objective. Determine the fundamental frequency of a speech waveform automatically. Automatic Extraction of Fundamental Frequency Methods. Cepstrum-based FΦ determinator (CFD) Harmonic product spectrum (HPS) Feature-based FΦ tracker (FBFT) - PowerPoint PPT Presentation

Citation preview

Page 1: Pitch Estimation

12/11/2006 Chih-Ti Shih 1

Pitch Estimation Pitch Estimation

By Chih-Ti ShihBy Chih-Ti Shih

Page 2: Pitch Estimation

12/11/2006 Chih-Ti Shih 2

Objective Objective

Determine the fundamental Determine the fundamental frequency of a speech frequency of a speech

waveform automaticallywaveform automatically

Page 3: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 33

Automatic Extraction of Automatic Extraction of Fundamental Frequency MethodsFundamental Frequency Methods

Cepstrum-based FΦ determinator (CFD)Cepstrum-based FΦ determinator (CFD) Harmonic product spectrum (HPS)Harmonic product spectrum (HPS) Feature-based FΦ tracker (FBFT)Feature-based FΦ tracker (FBFT) Parallel processing method (PP)Parallel processing method (PP) Integrated FΦ tracking algorithm (IFTA)Integrated FΦ tracking algorithm (IFTA) Super resolution FΦ determinator (SRFSuper resolution FΦ determinator (SRF

D)D)

Page 4: Pitch Estimation

12/11/2006 Chih-Ti Shih 4

eSRFD eSRFD

1. Pass the sample through 1. Pass the sample through low-pass filterlow-pass filter to simplify the temporal to simplify the temporal structure of the waveformstructure of the waveform

2. Pass the sample frames through 2. Pass the sample frames through silence detectorsilence detector to identify to identify unvoiced frames. No analysis will be done for the unvoiced frames. unvoiced frames. No analysis will be done for the unvoiced frames.

eSRFD: Enhanced Super resolution FΦ determinator.

Equation 1

if |Xif |XNmin Nmin or X or XNmaxNmax| + |Y| + |YNminNmin or Y or YNmaxNmax | < T| < Tsrfdsrfd it is a it is a silent framesilent frame

Page 5: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 55

eSRFDeSRFD

Each frame is subdivided into 3 consecutive segments, xEach frame is subdivided into 3 consecutive segments, xnn,y,ynn and z and znn..

},...,1|)()({

},...,1|)()({

},...,1|)()({

ninisiZZn

niisiYYn

ninisiXXn

Page 6: Pitch Estimation

12/11/2006 Chih-Ti Shih 6

eSRFD eSRFD

Page 7: Pitch Estimation

12/11/2006 Chih-Ti Shih 7

eSRFD eSRFD 3. For the ‘voiced’ frame, the first normalized cross-correlation of P3. For the ‘voiced’ frame, the first normalized cross-correlation of Px,yx,y(n) of t(n) of t

he frame is determined. he frame is determined.

max}min,...;1,0|.min{

)()(

)()(

)(,/

1

2/

1

2

/

1

NnNiLiNn

jLyjLx

jLyjLx

nyPxLn

j

Ln

j

Ln

j

Equation 2

Cross-correlation

Normalization

Page 8: Pitch Estimation

12/11/2006 Chih-Ti Shih 8

eSRFD eSRFD

4. Candidate values of the 4. Candidate values of the fundamental periodfundamental period are obtained by locating the p are obtained by locating the peaks in the normalized cross-correlation coefficient for which the value Peaks in the normalized cross-correlation coefficient for which the value Px,yx,y

(n) exceeds a certain threshold T(n) exceeds a certain threshold Tsrfd srfd

If no candidates are found in the frame, the frame is classified as If no candidates are found in the frame, the frame is classified as ‘unvoiced’. ‘unvoiced’.

Page 9: Pitch Estimation

12/11/2006 Chih-Ti Shih 9

eSRFD eSRFD

})(|{

)()(

)()(

)(,

,

/

1

2/

1

2

/

1

srfdyx

Ln

j

Ln

j

Ln

j

Tnpn

jLzjLy

jLzjLy

nzPy

5. For the voiced frame (P5. For the voiced frame (Px,yx,y(n)(n) > TsrfdTsrfd), the second normalized cross-correl), the second normalized cross-correlation coefficient pation coefficient py,zy,z(n) is determined(n) is determined

Page 10: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1010

eSRFDeSRFD

6. For those candidates with both P6. For those candidates with both Px,yx,y(n) and p(n) and py,zy,z(n) exceeds the threshold T(n) exceeds the threshold Tsrfdsrfd are given a s are given a score of 2, others are 1. core of 2, others are 1.

Note: If there are one or more candidates with a score of 2, then all those with a score of 1 are removed from the list of candidates.

Page 11: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1111

eSRFDeSRFD

Otherwise, an optimal fundamental period is sought from the set of remaining candidates. The candidate at the end of this list represents a fundamental period is nM, and the m’th candidate represents a period nm.

If there is only one candidate with score 1 or 2, the candidate is assumed to be the best estimate of the fundamental period of that frame.

Page 12: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1212

eSRFDeSRFD

7. then calculate q(n7. then calculate q(nMM) which is a normalized c) which is a normalized cross-correlation coefficient between sectionross-correlation coefficient between sections of length ns of length nMM spaced n spaced nmm . q(n . q(nMM) is defined as:) is defined as:

MM

M

n

jm

n

jM

n

jmM

m

njsnjs

njsnjs

nq

1

2

1

2

1

)()(

)()(

)(

Page 13: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1313

eSRFDeSRFD

Page 14: Pitch Estimation

12/11/2006 Chih-Ti Shih 14

eSRFD eSRFD

The first coefficient q(n1) is assumed to be the optimal value. If the suThe first coefficient q(n1) is assumed to be the optimal value. If the subsequent q(nbsequent q(nmm) * 0.77 > the current optimal value , the subsequent q) * 0.77 > the current optimal value , the subsequent q(n(nmm) is the optimal value. ) is the optimal value.

Page 15: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1515

eSRFDeSRFD

If previous frame is ‘unvoiced’: the current value If previous frame is ‘unvoiced’: the current value is hold and depends on the next frame. If the next is hold and depends on the next frame. If the next frame is also unvoiced, the current frame will be frame is also unvoiced, the current frame will be considered as ‘unvoiced’ considered as ‘unvoiced’

Otherwise, the current frame is considered as Otherwise, the current frame is considered as ‘voiced’ and current held FΦ will be considered as ‘voiced’ and current held FΦ will be considered as the good estimation for the current frame. the good estimation for the current frame.

If only 1 candidate score 1 but no candidate score2:

Page 16: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1616

eSRFDeSRFD

The changes reduced the occurrence The changes reduced the occurrence of doubling and halving in FΦ of doubling and halving in FΦ contour. However, they increase the contour. However, they increase the chance the voiced region been miss-chance the voiced region been miss-classified as unvoiced.classified as unvoiced.

Page 17: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1717

eSRFDeSRFD8. Applying biasing to Px,y(n) and Py,z(n) if:

1. The two previous frames were ‘voiced’2. The FΦΦ value of the previous frame is not being temp held. 3. FΦΦ of previous frame is less than 7/4 *(FΦΦ of current frame) and greater than 5/8*(FΦΦ of current frame)

However, the biasing tends to increase the percentage of unvoiced regions of speech being incorrectly classified as ‘voiced’.

If the unbiased coefficient Px,y(n) does not exceed the TTsrfdsrfd and this candidate is believed to be the best estimate of the frame. The FΦΦ of this candidate is held until the state of the subsequent frame is known. If the next frame is silent, the current frame is re-classified as silent.

Page 18: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1818

eSRFDeSRFD

9. The fundamental period for the frame is es9. The fundamental period for the frame is estimated by calculate rtimated by calculate rx,yx,y(n) for n in the region (n) for n in the region –L < n < L. The maximum within this range co–L < n < L. The maximum within this range corresponds to a more accurate value of the furresponds to a more accurate value of the fundamental period. ndamental period.

maxmin,...,

)()(

)()(

)(

1

2

1

2

1,

NNn

jyjx

jyjx

nrn

j

n

j

n

jyx

Page 19: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 1919

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

Compare FCompare Fxx which is generated from the which is generated from the laryngograph with the FΦ contours genelaryngograph with the FΦ contours generated by eSRFDrated by eSRFD

FFxreferencexreference refer to reference value from lary refer to reference value from laryngograph.ngograph.

FΦ refer to the value from eSRFDFΦ refer to the value from eSRFD

Page 20: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2020

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

FFxreferencexreference and FΦ are zero: both describe a silent or unvoiced region of the ut and FΦ are zero: both describe a silent or unvoiced region of the utterance and no error result.terance and no error result.

FΦ is non-zero but FFΦ is non-zero but Fxreferencexreference is zero: the region is incorrectly classified as voi is zero: the region is incorrectly classified as voiced by eSRFDced by eSRFD

FFxrefernecexrefernece is non-zero but FΦ is zero: the voice region is incorrectly classified is non-zero but FΦ is zero: the voice region is incorrectly classified as unvoiced by eSRFDas unvoiced by eSRFD

FFxreferencexreference and FΦ are non-zero: both correctly classify the region as voiced. I and FΦ are non-zero: both correctly classify the region as voiced. In such case, calculate the ration of:n such case, calculate the ration of:

reference

eSRFDreference

Fx

FFx 0

Page 21: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2121

Gross ErrorGross Error

2.00

reference

eSRFDreference

Fx

FFx

2.00

reference

eSRFDreference

Fx

FFx

Halving error

Doubling error

2.00

2.0

reference

eSRFDreference

Fx

FFxAcceptable accuracy

Page 22: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2222

Comparison of asynchronous frequComparison of asynchronous frequency contoursency contours

Female

Page 23: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2323

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

Female

Page 24: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2424

Comparison of asynchronous frequComparison of asynchronous frequency contoursency contours

Male

Page 25: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2525

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

Male

Page 26: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2626

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

laryngograph

eSRFD

Page 27: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2727

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

Page 28: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2828

Comparison of asynchronous Comparison of asynchronous frequency contoursfrequency contours

Page 29: Pitch Estimation

12/11/200612/11/2006 Chih-Ti ShihChih-Ti Shih 2929

Question?