View
24
Download
0
Category
Tags:
Preview:
DESCRIPTION
An Exemplar-based Approach to Automatic Burst Detection in Voiceless Stops. YAO YAO UC BERKELEY YAOYAO@BERKELEY.EDU http://linguistics.berkeley.edu/~yaoyao JULY 25, 2008. Overview. Background Data Methodology Algorithm Tuning the model Testing Results General Discussion. - PowerPoint PPT Presentation
Citation preview
YAO YAO UC BERKELEY
YAOYAO@BERKELEY.EDUhttp://linguistics.berkeley.edu/~yaoyao
JULY 25, 2008
An Exemplar-based Approach to Automatic Burst Detection
in Voiceless Stops
Overview2
BackgroundDataMethodology
Algorithm Tuning the model Testing
ResultsGeneral Discussion
Background3
Purpose of the study To find the point of
burst in a word initial voiceless stop (i.e. [p], [t], [k])
Existing approach Detecting the point of maximal energy change (cf.
Niyogi and Ramesh, 1998; Liu, 1996)
close release vowel onset
Background4
Our approach Compare the spectrogram of the target token at each
point against that of fricatives and silence Assess how “fricative-like” and “silence-like” the
spectrogram is at each time point Find the point where “fricative-ness” suddenly rises
and “silence-ness” suddenly drops point of burst
Background5
Our approach (cont’d) What do we need?
Spectral features of a given time frame Spectral templates of fricatives and silence
Specific to speaker and the recording environment Measure and compare fricative-ness and silence-ness An algorithm to find the most likely point for release
Advantage Easy to implement No worries about change in the environment and
individual differences
Data6
Buckeye corpus (Pitt, M. et al. 2005)40 speakers
All residents of Columbus, Ohio Balanced in gender and age One-hour interview Transcribed at word and phone level 19 used in the current study
Target tokens Transcribed word-initial voiceless stops (e.g. [p], [t], [k])
Methodology: spectral measures7
Spectral vector 20ms Hamming window Mel scale 1 × 60 array
Spectral template Speaker-specific, phone-specific Ignore tokens shorter than average duration of that phone
of the speaker For the remaining tokens
Calculate a spectral vector for the middle 20ms window Average over the spectral vectors
Methodology: spectral template8
[a] of F01 [f] of F01 Silence of F01
Methodology: similarity scores9
Similarity between spectral vectors x and u
Dx,u =
Sx,u = e-0.005Dx,u
Comparing the given acoustic data against any spectral templates of that speaker Stepsize = 5ms
60)(
1||60
1 jjjj usdux
Similarity scores
Formulae:
Dx,t =
Sx,t = e-0.005Dx,t
Step size = 5ms
10
60)(
1||60
1 jjjj tsdtx
- [s] score
- <sil> score
Methodology: finding the release point11
Basic idea
Near the release point - Fricative similarity score rises - Silence similarity score drops
Closure BurstFricative-ness Low HighSilence-ness High Low
close release vowel onset
Q1: Which fricative to use?
Q2: Which period of rise or drop to pick?
Methodology : finding the release point
12
[h]
[s]
[sh]
<sil> similarity scores
Slope is a better predictor than absolute score value
The end point of a period with maximal slope the release point
Which fricative? [sh] score is more
consistent than other fricatives
Initial [t] in "doing" Initial [k] in “countries”
13
Methodology : finding the release point
[h]
[s]
[sh]
<sil>
[h]
[s]
[sh]
<sil>
Methodology : finding the release point
14
Original algorithm Find the end point of a period of fastest increase in
<sh> score Find the end point of a period of fastest decrease in
<sil> score Return the middle point of the two end points as the
point of release If either or both end points cannot be found within the
duration of the stop, return NULL.
Methodology : finding the release point
15
Select two speakers’ data to tune the model
Hand-tag the release point for all tokens in the test set. If the stop doesn’t appear to have a release point on the
spectrogram, mark it as a problematic case, and take the end point of the stop as the release point, for calculating error.
Speaker Age Gender Speaking rate # of tokens
# of test tokens
F07 Old Female Slow (4.022 syll/s)
231 231
M08 Young Male Fast (6.434 syll/sec
618 261
Methodology : problematic cases16
no burst no closure weak and double release(??)
[sh]
<sil>
Methodology : finding the release point
17
Calculate the difference between hand-tagged release point and the estimated one (i.e. error) for each case.
RMS (Root Mean Square) of error is used to measure the performance of the algorithm.
F07 ( n=231 tokens) M08 (n=261 tokens)
Methodology : error analysis18
real release-estimate real release-estimate
Add 5ms to the estimation
RMS = 7.22ms
4.85ms
RMS = 13.11ms
14.ms
Methodology: tuning the algorithm19
1st Rejection Rule -- A target token will be rejected if the changes in scores
are not drastic enough.
E.g. Insignificant rise Reject!
[sh]
<sil>
Methodology: tuning the algorithm20
Applying 1st Rejection Rule Rejecting 4 cases inF07
RMS(+5ms) = 4.19ms Rejecting 28 cases in M08
covering most of the
problematic cases RMS(+5ms)=9.27ms
Error analysis in M08 after 1st rejection rule
RMS(+5ms) = 14ms
9.27ms
Methodology : tuning the algorithm21
Still a problem… Multiple releases
Each might corresponds
to a rise/drop of the scores
Initial [k] in “cause” of M08
[sh]
<sil>
Methodology: tuning the algorithm22
2nd Rejection Rule -- A target token will be dropped If the points found in
<sh> and <sil> scores are too far apart. (>20ms) Partly solves the multiple release problem The ideal way would to identify all candidate release
points, and return the first one.
Methodology: tuning the algorithm23
Applying 2nd Rejection Rule Rejecting 3 cases inF07
RMS(+5ms) = 3.22ms Rejecting 20 cases in M08
Only 2 problematic cases remain RMS(+5ms) = 3.44ms
Error analysis in M08 after 2nd rejection rule
RMS(+5ms) = 9.26ms
3.44ms
Compare: Optimal error is 2.5ms given the 5ms step size…
Methodology: tuning the algorithm
# of cases
RMS RMS(+5ms)
Original 261 13.11 14After 1st rejection
233 9.27 9.26
After 2nd rejection
213 5.64 3.44
# of cases
RMS RMS(+5ms)
Original 231 7.22 4.85After 1st rejection
227 6.81 4.19
After 2nd rejection
224 6.02 3.22
24
F07 M08
Rejection rate: 3.03%
Rejection rate: 15.05%
Methodology: testing the algorithm25
Select a random sample of 50 tokens from all speakers Hand-tag the release point Use the current algorithm together with two rejection
rules to find the estimated release. Compare the hand-tagged point and the estimated one 4 rejected by the 1st rule (3 were legitimate) 3 rejected by the 2nd rule (2 were legitimate) 43 accepted cases. RMS(error) <5ms
Methodology: summary26
Calculate <silence> score and <sh> score
Calculate the slope in <silence> score and <sh> score
In a labeled voiceless stop span, (i)find the time point of largest positive slope in <sh> score, and store in p1; (ii)find the time point of smallest negative slope in <silence> score, and store in p2
return (p1+p2)/2+0.005
p1 = null or p2 = null
|p1–p2|>=0.02 s
slope (p1)<0.02 and
slope (p2)>0.04
reject the case
N
N
N
Y
Y
Y
Results: grand means27
Rejection rates (2 rules combined) Varies from 3. 03% to 30.5% (mean = 13.3%,sd=
8.6%) across speakers.
VOT and closure duration
[p] [t] [k]Closure (ms)
69.5 48.9 54.9
VOT (ms) 48 51.2 57.9
Results: VOT by speaker28
General Discussion29
Echoing previous findings Byrd (1993): Closure duration and VOT in read speech
Shattuck-Hufnagel & Veilleux (2007): 13% of missing landmarks in spontaneous speech
[p] [t] [k]Closure (ms)
69 (69.5) 53 (48.9) 60 (54.9)
VOT (ms) 44 (48) 49 (51.2) 52 (57.9)
General Discussion30
Future work Fine-tune the 2nd rejection rule Generalize the exemplar-based method for other
automatic phonetic processing problem?
Acknowledgement31
Anonymous speakersBuckeye corpus developersProf. Keith JohnsonMembers of the phonology lab in UC
Berkeley
Thank you! Any comments are welcome.
References32
Byrd, D. (1993) 54,000 American stops. UCLA Working Papers in Phonetics. No 83, pp: 97-116.
Johnson, K. (2006) Acoustic attribute scoring: A preliminary report. Liu, S. (1996) Landmark detection for distinctive feature-based speech
recognition. J. Acoust. Soc. Amer. Vol 100, pp 3417-3430. Niyogi, P., Ramesh, P. (1998) Incorporating voice onset time to improve
letter recognition accuracies. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98. Vol 1, pp: 13-16.
Pitt, M. et al. (2005) The Buckeye Corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication. Vol 45, pp: 90-95
Shattuck-Hufnagel, S., Veilleux, N.M. (2007) Robustness of acoustic landmarks in spontaneously-spoken American English. Proceedings of International Congress of Phonetic Science 2007, Saarbrucken, August 2007.
Zue, V.W. (1976) Acoustic Characteristics of stop consonants: A controlled study. Sc. D. thesis. MIT, Cambridge, MA.
Recommended