Upload
amy-horton
View
217
Download
0
Embed Size (px)
Citation preview
Bootstrap TDNN for Classification of Voiced Stop
Consonants (B,D,G)
CAIP, Rutgers University
Oct.13, 2006
Jun Hou, Lawrence Rabiner and Sorin Dusan
2
Outline
• Review TDNN basics
• Bootstrap TDNN using categories
• Model lattice
• Experiments
• Discussion and future work
3
4
1
2 3
5. Overall System Prototypes and Common Platform
ASAT Paradigm
4
Previous Research
• Frame based method– Used MLP to detect the 14 Sound Pattern of
English features for the 61 phoneme TIMIT alphabet using single frame of MFCC’s
• Major problem– Frames capture static properties of speech– Need dynamic information when detection
dynamic features and sounds– Need segment-based methods rather than
frame-based methods
5
Phoneme Hierarchy
Phonemes
Vowels Diphthongs Semivowels Consonants
AWAYEYOY
WL
RY
Nasals
MNNG Stops Fricat
ives
Affricates
JCH
Whisper
H
Voiced
BDG
Unvoiced
PTK
Voiced
VDHZZH
Unvoiced
FTHSSH
OWAOAELow
UHERAX
IHEH
Mid
UWAAIYHigh
BackMidFront
39 phonemes – 11 vowels, 4 diphthongs, 4 semivowels, 20 consonants
• Bottom-up approach
• Classify voiced stop consonants: /B/, /D/ and /G/ using segment based methods (this is the hardest classification problem among the consonants)
Step 1
Step 2
Step 3
Step 4
6
Voiced Stop Consonants Classification• B, D, and G tokens in the TIMIT training and test sets, without the SA
sentences– * C V form of tokens
• * - preceding phoneme can be any sound• C - B, D, or G• V - vowel or diphthong
– 10 msec windows, 150 msec segments (15 frames)– The beginning of the vowel is at the 10th frame
Training set Test set
B D G Total B D G Total
1567 1460 658 3685 638 537 243 1418
42.5 39.6 17.9 45.0 37.8 17.1
Any preceding phoneme (s)
burst vowel
10th frame
15th frame
1st frame
– Distribution (in # tokens and percentage)
7
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-5000
0
5000
time
Am
plitu
de
2 4 6 8 10 12 14
40
60
80
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-1000
0
1000
2000
time
Am
plitu
de
2 4 6 8 10 12 14
30
40
50
60
70
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-500
0
500
time
Am
plitu
de
2 4 6 8 10 12 14
55
60
65
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
Voiced Stop Consonants Classification• Example – /b/ (speech wave, 10 * log(energy), spectrogram)
Short stop gap Medium stop gap Long stop gap
“ … that big goose …”
Stop gap = 2260 (141 msec)
20580 20850 dh20850 22920 ae22920 25180 tcl25180 25460 b25460 27320 ih27320 29270 gcl29270 29850 g29850 32418 ux32418 34506 s
“… and become …”
Stop gap = 396 (24.74 msec)
48215 48855 ix48855 49684 n49684 50080 bcl50080 50321 b50321 51400 iy51400 52360 kcl52360 53463 k53463 54600 ah54600 55560 m
“… judged by …”
Stop gap = 1100 (68.75 msec)
27770 28970 jh28970 31960 ah31960 32619 dcl32619 33810 jh33810 34280 dcl34280 34840 d34840 35940 bcl35940 36180 b36180 37800 ay
8
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-500
0
500
1000
time
Am
plitu
de
2 4 6 8 10 12 14
50
60
70
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-1000
0
1000
time
Am
plitu
de
2 4 6 8 10 12 14
40
50
60
70
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-1000
0
1000
2000
time
Am
plitu
de
2 4 6 8 10 12 14
30
40
50
60
70
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
Voiced Stop Consonants Classification• Example – /d/ (speech wave, 10 * log(energy), spectrogram)
Short stop gap Medium stop gap Long stop gap
“… scampered across …”
Stop gap = 398 (24.88 msec)
16560 17100 pcl17100 17530 p17530 18162 axr18162 18560 dcl18560 18800 d18800 19360 ix19360 20150 kcl20150 21080 k21080 21640 r
“A doctor …”
Stop gap = 800 (50 msec)
2360 3720 ey3720 4520 dcl4520 4760 d4760 7060 aa7060 8920 kcl8920 9480 t9480 10443 axr
“Does …”
Stop gap = silence = 1960 (122.5 msec)
0 1960 h#1960 2440 d2440 3800 ah3800 5413 z
9
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1000
0
1000
time
Am
plitu
de
2 4 6 8 10 12 14
40
50
60
70
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-4000
-2000
0
2000
4000
time
Am
plitu
de
2 4 6 8 10 12 14
40
60
80
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
-1000
0
1000
2000
time
Am
plitu
de
2 4 6 8 10 12 14
30
40
50
60
70
time
10*l
og(e
nerg
y)
time
freq
uenc
y
0 0.02 0.04 0.06 0.08 0.1 0.120
2000
4000
6000
8000
Voiced Stop Consonants Classification• Example – /g/ (speech wave, 10 * log(energy), spectrogram)
Short stop gap Medium stop gap
“ May I get …”
Stop gap = 430 (26.88 msec)
0 2080 h#2080 2720 m2720 4344 ey4344 6080 ay6080 6510 gcl6510 6930 g6930 8141 ih8141 9170 tcl9170 9492 t
“ … a good mechanic …”
Stop gap = 960 (60 msec)
30340 36313 pau36313 36762 q36762 37720 ah37720 38680 gcl38680 39101 g39101 40120 uh40120 41000 dcl41000 41600 m41600 42200 ix
“ …, give or take …”
Stop gap = 3022 (188.88 msec)
24800 27822 pau27822 28280 g28280 29080 ih29080 29960 v29960 30840 axr30840 31480 tcl31480 32360 t32360 34200 ey34200 34966 kcl34966 35400 k
Long stop gap
10
Time-Delay Neural Network• Developed by A. Waibel e
tc.• Effective in classifying dy
namic sounds, like voiced stop consonants
• Introduces delays into the input of each layer of a regular MLP
• The inputs of a unit are multiplied by the un-delayed weights and the delayed weights, then summed and passed through a nonlinear function
11
0 0.5 1 1.5 2 2.5
x 105
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7Mean Square Error for TDNN training, OLGD
No. of training epochs
MSE
3 6 9 24 99 249 780 3135
Jitter
Problems in TDNN Training
• Slow convergenceSlow convergence– during error back
propagation, the weight increments are smeared when averaging on the sets of time-delayed weights
• Requires staged Requires staged batch trainingbatch training– initially trained on a
small number of tokens
– after convergence, gradually add more tokens to the training set
Hand selected
balancedUnbalanced
Bootstrapping begins here
12
A well designed bootstrap training
method
Training Training SolutionSolution
13
Bootstrap Training Introduction
• A bootstrap sample utilizes tokens from the original dataset, by sampling with replacement
• The statistics are calculated on each bootstrap sample• Estimate standard error, etc. on the bootstrap replications
X=(x1, x2, …xn)
X*1 X*2
X*B
S(X*1) S(X*2)S(X*B)
Training dataset
Bootstrap samples
Bootstrap replications
14
Bootstrap TDNN
• Because of the slow convergence of a TDNN, it is difficult (and time consuming) to repeat the training of a TDNN many times (more than 200 times for normal bootstrap training experiments)
• Instead of resampling individual tokens, we build a bootstrap sample by resampling clusters of tokens
• Need to find a way to partition the input space into a small number of clusters which we call categories
15
Bootstrap TDNN – Concept• Begin with a good starting pointBegin with a good starting point
– use a small set of hand selected tokens to train the initial TDNN• Partition the training set into subsetsPartition the training set into subsets
– use the initial TDNN to partition the training set into several subsets which we call “Categories”, and a subsequent TDNN is trained on each category
• Expand each categoryExpand each category– iteratively use the TDNN to partition the remaining (unutilized) tr
aining data into categories; merge the tokens in the category with the previous training data, and train a new TDNN based on the merged data
• Merge final categoriesMerge final categories– merge the tokens in the categories (in a sequenced manner) and
train the final TDNNs based on the union of the categories• Use an Use an nn-best list to combine the TDNN scores to giv-best list to combine the TDNN scores to giv
e the final segment classificatione the final segment classification
16
Bootstrap TDNN – Notation
• Double thresholdsDouble thresholds – a high score threshold (φm
ax) and a low score threshold (φmin)• Good score and bad scoreGood score and bad score – if, and only if, one
phoneme has a score above φmax and the other two phonemes have scores below φmin, the classification score is considered as a good score. All other cases are treated as bad scores.
• Segmentation ruleSegmentation rule:– Category 1 — good score and correct classification – Category 2 — good score and incorrect classification– Category 3 — bad score and correct classification – Category 4 — bad score and incorrect classification
17
Bootstrap TDNN – Procedure• (1) Use the balanced training set of 99 hand-selected tokens (i.e., 33 t
okens each of /B/, /D/, and /G/) as the initial training set, and train a single TDNN
• (2) The current set of TDNNs (one TDNN initially, four TDNNs at later stages of training) is used to score and segment the complete set of training tokens into 4 Categories (based on the double threshold procedure)
• (3) Add selected and balanced (equal number of /B/, /D/ and /G/ tokens) training tokens from the above four categories to the old training data, and train a new set of four updated TDNNs.
• (4) Iterate steps (2) and (3) until a stopping criterion is met– stopping criteria: there are no more new tokens to be added; the desired T
DNN performance is met.
• (5) Merge the tokens from the four training categories in a sequenced manner to obtain a new TDNN. Use a beam search to select the best sequence for merging the data from the four categories
18
Bootstrap TDNN – Illustration• Use the 99 hand selected tokens to train a TDNN, and partition the
initial input space into 4 categories
I(1) II(1)
III(1) IV(1)
A circle denotes balanced tokens in this category
99 TDNN688
1081
954
525309
429
612
863
19
Bootstrap TDNN – Illustration• Merge 99 tokens and the balanced tokens in one category, and train
a TDNN• Use the TDNN to partition the remaining space into 4 categories
99 I(2) II(2)
III(2) IV(2)I(1)
99
I(n)
I(1)
TDNN
• Iterate until the TDNN performance is met, or there are no more new and balanced training tokens to be added to the previous training data
TDNN
II(n+1)
III(n+1)
IV(n+1)
……
…
20
Merge Categories – Model Lattice
• Use a forward lattice to merge the four different category TDNNs– after the initial categories are established, we create a
bootstrap sample consisting of some or all of the categories, selected in a sequenced manner
• Partial lattice – select category samples without replacement
• Full lattice – select category samples with replacement
21
Partial Lattice• Starting point: 4 TDNNs built on each of the 4 category TDNNs• At each step
– select a category without replacement of the previous categories– merge the data in this category and the data in previous step– build a TDNN on the union of the data
Step 1
Cat 1
Cat 2
Cat 3
Cat4
Step 2 Step 3 Step 4• Iterate until all the categories are merged together or the TDNN performance is met
• Use a beam search to select the best path(s) for merging categories to obtain the best set of TDNNs
22
Partial Lattice• Cross node beam search – compare TDNNs that are trai
ned on the same categories but in different sequences
Net 1 Net 2Cat (I) Cat (II)
Cat. (I) U Cat.(II)Net 12 Net 21
Net(1,2) If beam width = 1, select the best net between net12 and net21
• Beam search comparison criterion– performance on the complete training set; or– weighted sum of performances on the 99 hand selected tokens,
the training set, and the complete training set.
23
Full Lattice
Cat 1
Cat 2
Cat 3
Cat 4
……
……
……
……
Step 0 Step 1 Step 2 Step n-1 Step n
• Select categories with replacement• Regular beam search
24
Experiments
• Training set and test setTraining set and test set– TIMIT training set and test set without the SA sentences– *CV form tokens, where C denotes /B/, /D/ or /G/, V denotes any
vowel or diphthong and * denotes any previous sound• /B/ - 1567; /D/ - 1460; /G/ - 658. 3685 tokens in training set• /B/ - 638; /D/ - 537; /G/ - 243. 1418 tokens in test set.
– 13 MFCCs calculated on a 10 msec window with 5 msec window overlap
– average adjacent frames resulting in 10 msec frame rate– segment length: 150 msec (15 frames, with the beginning of the
succeeding vowel at the 10th frame)• TDNNTDNN
– inputs: 13 mfccs, 15 frames → 195 input nodes– 1st hidden layer: 8 nodes, delay D1 = 2;– 2nd hidden layer: 3 nodes, delay D2 = 4;– output layer: 3 nodes, one each for /B/, /D/, and /G/.
25
Baseline Results
• A single TDNN trained using staged batch training of 3, 6, 9, 24, 99, 249, 780, 3685 tokens.
• Train the TDNN on the 99 tokens and test on the same 99 tokens
B D G
B 32 0 1
D 0 32 1
G 0 2 31
• After staged batch training– performance on the complete training set: 91.8%– performance on the test set: 82.0%
• After bootstrapping training + lattice decision, need 68% of training set to achieve comparable results
27
0 1 2 3 440
50
60
70
80
90
100
Bootstrap iterations
Perc
enta
ge
Performance of bootstrapping of catagory 1
99traincompletetest
0 2 4 6 840
50
60
70
80
90
100
Bootstrap iterations
Perc
enta
ge
Performance of bootstrapping of catagory 2
99traincompletetest
0 2 4 6 840
50
60
70
80
90
100
Bootstrap iterations
Perc
enta
ge
Performance of bootstrapping of catagory 3
99traincompletetest
0 5 10 1540
50
60
70
80
90
100
Bootstrap iterations
Perc
enta
ge
Performance of bootstrapping of catagory 4
99traincompletetest
Results of the 4 Category TDNNs• TDNN performance after stopping criteria are met
280 2 4 6 8 10 120
200
400
600
800
1000
1200
Bootstrap iterations
Num
ber
of
tokens in t
rain
ing
Number of tokens used in bootstrapping of TDNN
cat 1cat 2cat 3cat 4
Results of the 4 Category TDNNs
• Number of tokens used in each category
31
Results on Partial Lattice
• Using an n-best list method to score all 4 models and choosing the highest score for each of /B/, /D/ and /G/. The maximum of the 3 highest scores provides the final classification decision.– performance on the complete training set: 93.1%– performance on the complete test set: 82.1%– 35% error reduction on the complete training set, over the best model
trained using data from all 4 categories– 18% error reduction on the complete training set compared with a single
TDNN trained using all the 3685 tokens that achieved 91.8% on the training set and 82.0% on the test set
Cat 1 Cat 2 Cat 3 Cat 4
Lattice Complete train 2423 (65.75%) 21 (0.57%) 1014 (27.52%) 227 (6.16%)
Test 630 (44.43%) 33 (2.33%) 534 (37.66%) 221 (15.59%)
Staged batch training
Complete train 2920 (79.24%) 105 (2.85%)
483 (13.11%) 177 (4.80%)
Test 977 (68.90%) 123 (8.67%)
184 (12.98%) 134 (9.45%)
32
Bootstrap - Discussion
• Bootstrapping is an effective procedure to guarantee convergence in robust training of TDNNs
• The problem with bootstrapping is that the TDNN needs to be trained several times, which is quite time consuming
• In order to reduce the total number of training cycles, we use a beam search method to prune the path for merging different categories of data
• The results showed that, although trained on a relatively small portion of all the training data (approximately 68%), the TDNN achieved better performance on the complete training set and concomitant improvement on the test set
35
A Few Issues – Shifting
• The difficulty in TDNN training lies in the shift-invariant nature of the TDNN. For the set of voiced stop consonants, the stop regions can appear at any 30 msec window during the 150 msec segment.
• The previous vowel affects the BDG articulation and provides information useful for classification
• We can make a TDNN converge faster (to a better solution) by appropriately shifting frames for tokens in categories (II), (III), and (IV).
36
Frame Shifting• Train TDNN on the 99 long stop gap hand selected tokens;
test on the same 99 tokens• Shift right 4 frames for tokens in categories (II), (III) and (IV)• Train the TDNN again, and test on the 99 tokens
Before shifting (# tokens) After shifting (# tokens)
B D G B D G
B 32 1 0 31 1 1
D 1 30 2 0 33 0
G 0 0 33 0 0 33
Category (i) (ii) (iii) (iv)
Before shift (# tokens) 87 0 8 4
After shift(# tokens) 94 0 3 2
Number of tokens in each category
37
Discussion and Future Work• Examine bootstrapping more closely, to reduce the total
number of bootstrap iterations, and to improve model accuracy
• Segment length affects classification accuracy – 150 msec can contain more than one short stop consonants → use DTW to map the input frames to a fixed number of frames and then use the aligned tokens to train a TDNN
• Investigate TDNN for classification of other phoneme classes, e.g., voiced fricatives, diphthongs, etc.
• Use frame-based methods for classification of static sounds (e.g. vowels, unvoiced fricatives); use segment-based methods to recognize dynamic sounds (e.g., voiced stop consonants, diphthongs)
• Develop a bottom-up approach to build small but accurate classifiers first, then gradually classify broader classes in the phoneme hierarchy
38
Thank you!