33
Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

Embed Size (px)

Citation preview

Page 1: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

Bootstrap TDNN for Classification of Voiced Stop

Consonants (B,D,G)

CAIP, Rutgers University

Oct.13, 2006

Jun Hou, Lawrence Rabiner and Sorin Dusan

Page 2: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

2

Outline

• Review TDNN basics

• Bootstrap TDNN using categories

• Model lattice

• Experiments

• Discussion and future work

Page 3: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

3

4

1

2 3

5. Overall System Prototypes and Common Platform

ASAT Paradigm

Page 4: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

4

Previous Research

• Frame based method– Used MLP to detect the 14 Sound Pattern of

English features for the 61 phoneme TIMIT alphabet using single frame of MFCC’s

• Major problem– Frames capture static properties of speech– Need dynamic information when detection

dynamic features and sounds– Need segment-based methods rather than

frame-based methods

Page 5: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

5

Phoneme Hierarchy

Phonemes

Vowels Diphthongs Semivowels Consonants

AWAYEYOY

WL

RY

Nasals

MNNG Stops Fricat

ives

Affricates

JCH

Whisper

H

Voiced

BDG

Unvoiced

PTK

Voiced

VDHZZH

Unvoiced

FTHSSH

OWAOAELow

UHERAX

IHEH

Mid

UWAAIYHigh

BackMidFront

39 phonemes – 11 vowels, 4 diphthongs, 4 semivowels, 20 consonants

• Bottom-up approach

• Classify voiced stop consonants: /B/, /D/ and /G/ using segment based methods (this is the hardest classification problem among the consonants)

Step 1

Step 2

Step 3

Step 4

Page 6: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

6

Voiced Stop Consonants Classification• B, D, and G tokens in the TIMIT training and test sets, without the SA

sentences– * C V form of tokens

• * - preceding phoneme can be any sound• C - B, D, or G• V - vowel or diphthong

– 10 msec windows, 150 msec segments (15 frames)– The beginning of the vowel is at the 10th frame

Training set Test set

B D G Total B D G Total

1567 1460 658 3685 638 537 243 1418

42.5 39.6 17.9 45.0 37.8 17.1

Any preceding phoneme (s)

burst vowel

10th frame

15th frame

1st frame

– Distribution (in # tokens and percentage)

Page 7: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

7

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-5000

0

5000

time

Am

plitu

de

2 4 6 8 10 12 14

40

60

80

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-1000

0

1000

2000

time

Am

plitu

de

2 4 6 8 10 12 14

30

40

50

60

70

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-500

0

500

time

Am

plitu

de

2 4 6 8 10 12 14

55

60

65

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

Voiced Stop Consonants Classification• Example – /b/ (speech wave, 10 * log(energy), spectrogram)

Short stop gap Medium stop gap Long stop gap

“ … that big goose …”

Stop gap = 2260 (141 msec)

20580 20850 dh20850 22920 ae22920 25180 tcl25180 25460 b25460 27320 ih27320 29270 gcl29270 29850 g29850 32418 ux32418 34506 s

“… and become …”

Stop gap = 396 (24.74 msec)

48215 48855 ix48855 49684 n49684 50080 bcl50080 50321 b50321 51400 iy51400 52360 kcl52360 53463 k53463 54600 ah54600 55560 m

“… judged by …”

Stop gap = 1100 (68.75 msec)

27770 28970 jh28970 31960 ah31960 32619 dcl32619 33810 jh33810 34280 dcl34280 34840 d34840 35940 bcl35940 36180 b36180 37800 ay

Page 8: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

8

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-500

0

500

1000

time

Am

plitu

de

2 4 6 8 10 12 14

50

60

70

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-1000

0

1000

time

Am

plitu

de

2 4 6 8 10 12 14

40

50

60

70

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-1000

0

1000

2000

time

Am

plitu

de

2 4 6 8 10 12 14

30

40

50

60

70

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

Voiced Stop Consonants Classification• Example – /d/ (speech wave, 10 * log(energy), spectrogram)

Short stop gap Medium stop gap Long stop gap

“… scampered across …”

Stop gap = 398 (24.88 msec)

16560 17100 pcl17100 17530 p17530 18162 axr18162 18560 dcl18560 18800 d18800 19360 ix19360 20150 kcl20150 21080 k21080 21640 r

“A doctor …”

Stop gap = 800 (50 msec)

2360 3720 ey3720 4520 dcl4520 4760 d4760 7060 aa7060 8920 kcl8920 9480 t9480 10443 axr

“Does …”

Stop gap = silence = 1960 (122.5 msec)

0 1960 h#1960 2440 d2440 3800 ah3800 5413 z

Page 9: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

9

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-1000

0

1000

time

Am

plitu

de

2 4 6 8 10 12 14

40

50

60

70

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400-4000

-2000

0

2000

4000

time

Am

plitu

de

2 4 6 8 10 12 14

40

60

80

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400

-1000

0

1000

2000

time

Am

plitu

de

2 4 6 8 10 12 14

30

40

50

60

70

time

10*l

og(e

nerg

y)

time

freq

uenc

y

0 0.02 0.04 0.06 0.08 0.1 0.120

2000

4000

6000

8000

Voiced Stop Consonants Classification• Example – /g/ (speech wave, 10 * log(energy), spectrogram)

Short stop gap Medium stop gap

“ May I get …”

Stop gap = 430 (26.88 msec)

0 2080 h#2080 2720 m2720 4344 ey4344 6080 ay6080 6510 gcl6510 6930 g6930 8141 ih8141 9170 tcl9170 9492 t

“ … a good mechanic …”

Stop gap = 960 (60 msec)

30340 36313 pau36313 36762 q36762 37720 ah37720 38680 gcl38680 39101 g39101 40120 uh40120 41000 dcl41000 41600 m41600 42200 ix

“ …, give or take …”

Stop gap = 3022 (188.88 msec)

24800 27822 pau27822 28280 g28280 29080 ih29080 29960 v29960 30840 axr30840 31480 tcl31480 32360 t32360 34200 ey34200 34966 kcl34966 35400 k

Long stop gap

Page 10: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

10

Time-Delay Neural Network• Developed by A. Waibel e

tc.• Effective in classifying dy

namic sounds, like voiced stop consonants

• Introduces delays into the input of each layer of a regular MLP

• The inputs of a unit are multiplied by the un-delayed weights and the delayed weights, then summed and passed through a nonlinear function

Page 11: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

11

0 0.5 1 1.5 2 2.5

x 105

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Mean Square Error for TDNN training, OLGD

No. of training epochs

MSE

3 6 9 24 99 249 780 3135

Jitter

Problems in TDNN Training

• Slow convergenceSlow convergence– during error back

propagation, the weight increments are smeared when averaging on the sets of time-delayed weights

• Requires staged Requires staged batch trainingbatch training– initially trained on a

small number of tokens

– after convergence, gradually add more tokens to the training set

Hand selected

balancedUnbalanced

Bootstrapping begins here

Page 12: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

12

A well designed bootstrap training

method

Training Training SolutionSolution

Page 13: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

13

Bootstrap Training Introduction

• A bootstrap sample utilizes tokens from the original dataset, by sampling with replacement

• The statistics are calculated on each bootstrap sample• Estimate standard error, etc. on the bootstrap replications

X=(x1, x2, …xn)

X*1 X*2

X*B

S(X*1) S(X*2)S(X*B)

Training dataset

Bootstrap samples

Bootstrap replications

Page 14: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

14

Bootstrap TDNN

• Because of the slow convergence of a TDNN, it is difficult (and time consuming) to repeat the training of a TDNN many times (more than 200 times for normal bootstrap training experiments)

• Instead of resampling individual tokens, we build a bootstrap sample by resampling clusters of tokens

• Need to find a way to partition the input space into a small number of clusters which we call categories

Page 15: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

15

Bootstrap TDNN – Concept• Begin with a good starting pointBegin with a good starting point

– use a small set of hand selected tokens to train the initial TDNN• Partition the training set into subsetsPartition the training set into subsets

– use the initial TDNN to partition the training set into several subsets which we call “Categories”, and a subsequent TDNN is trained on each category

• Expand each categoryExpand each category– iteratively use the TDNN to partition the remaining (unutilized) tr

aining data into categories; merge the tokens in the category with the previous training data, and train a new TDNN based on the merged data

• Merge final categoriesMerge final categories– merge the tokens in the categories (in a sequenced manner) and

train the final TDNNs based on the union of the categories• Use an Use an nn-best list to combine the TDNN scores to giv-best list to combine the TDNN scores to giv

e the final segment classificatione the final segment classification

Page 16: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

16

Bootstrap TDNN – Notation

• Double thresholdsDouble thresholds – a high score threshold (φm

ax) and a low score threshold (φmin)• Good score and bad scoreGood score and bad score – if, and only if, one

phoneme has a score above φmax and the other two phonemes have scores below φmin, the classification score is considered as a good score. All other cases are treated as bad scores.

• Segmentation ruleSegmentation rule:– Category 1 — good score and correct classification – Category 2 — good score and incorrect classification– Category 3 — bad score and correct classification – Category 4 — bad score and incorrect classification

Page 17: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

17

Bootstrap TDNN – Procedure• (1) Use the balanced training set of 99 hand-selected tokens (i.e., 33 t

okens each of /B/, /D/, and /G/) as the initial training set, and train a single TDNN

• (2) The current set of TDNNs (one TDNN initially, four TDNNs at later stages of training) is used to score and segment the complete set of training tokens into 4 Categories (based on the double threshold procedure)

• (3) Add selected and balanced (equal number of /B/, /D/ and /G/ tokens) training tokens from the above four categories to the old training data, and train a new set of four updated TDNNs.

• (4) Iterate steps (2) and (3) until a stopping criterion is met– stopping criteria: there are no more new tokens to be added; the desired T

DNN performance is met.

• (5) Merge the tokens from the four training categories in a sequenced manner to obtain a new TDNN. Use a beam search to select the best sequence for merging the data from the four categories

Page 18: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

18

Bootstrap TDNN – Illustration• Use the 99 hand selected tokens to train a TDNN, and partition the

initial input space into 4 categories

I(1) II(1)

III(1) IV(1)

A circle denotes balanced tokens in this category

99 TDNN688

1081

954

525309

429

612

863

Page 19: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

19

Bootstrap TDNN – Illustration• Merge 99 tokens and the balanced tokens in one category, and train

a TDNN• Use the TDNN to partition the remaining space into 4 categories

99 I(2) II(2)

III(2) IV(2)I(1)

99

I(n)

I(1)

TDNN

• Iterate until the TDNN performance is met, or there are no more new and balanced training tokens to be added to the previous training data

TDNN

II(n+1)

III(n+1)

IV(n+1)

……

Page 20: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

20

Merge Categories – Model Lattice

• Use a forward lattice to merge the four different category TDNNs– after the initial categories are established, we create a

bootstrap sample consisting of some or all of the categories, selected in a sequenced manner

• Partial lattice – select category samples without replacement

• Full lattice – select category samples with replacement

Page 21: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

21

Partial Lattice• Starting point: 4 TDNNs built on each of the 4 category TDNNs• At each step

– select a category without replacement of the previous categories– merge the data in this category and the data in previous step– build a TDNN on the union of the data

Step 1

Cat 1

Cat 2

Cat 3

Cat4

Step 2 Step 3 Step 4• Iterate until all the categories are merged together or the TDNN performance is met

• Use a beam search to select the best path(s) for merging categories to obtain the best set of TDNNs

Page 22: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

22

Partial Lattice• Cross node beam search – compare TDNNs that are trai

ned on the same categories but in different sequences

Net 1 Net 2Cat (I) Cat (II)

Cat. (I) U Cat.(II)Net 12 Net 21

Net(1,2) If beam width = 1, select the best net between net12 and net21

• Beam search comparison criterion– performance on the complete training set; or– weighted sum of performances on the 99 hand selected tokens,

the training set, and the complete training set.

Page 23: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

23

Full Lattice

Cat 1

Cat 2

Cat 3

Cat 4

……

……

……

……

Step 0 Step 1 Step 2 Step n-1 Step n

• Select categories with replacement• Regular beam search

Page 24: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

24

Experiments

• Training set and test setTraining set and test set– TIMIT training set and test set without the SA sentences– *CV form tokens, where C denotes /B/, /D/ or /G/, V denotes any

vowel or diphthong and * denotes any previous sound• /B/ - 1567; /D/ - 1460; /G/ - 658. 3685 tokens in training set• /B/ - 638; /D/ - 537; /G/ - 243. 1418 tokens in test set.

– 13 MFCCs calculated on a 10 msec window with 5 msec window overlap

– average adjacent frames resulting in 10 msec frame rate– segment length: 150 msec (15 frames, with the beginning of the

succeeding vowel at the 10th frame)• TDNNTDNN

– inputs: 13 mfccs, 15 frames → 195 input nodes– 1st hidden layer: 8 nodes, delay D1 = 2;– 2nd hidden layer: 3 nodes, delay D2 = 4;– output layer: 3 nodes, one each for /B/, /D/, and /G/.

Page 25: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

25

Baseline Results

• A single TDNN trained using staged batch training of 3, 6, 9, 24, 99, 249, 780, 3685 tokens.

• Train the TDNN on the 99 tokens and test on the same 99 tokens

B D G

B 32 0 1

D 0 32 1

G 0 2 31

• After staged batch training– performance on the complete training set: 91.8%– performance on the test set: 82.0%

• After bootstrapping training + lattice decision, need 68% of training set to achieve comparable results

Page 26: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

27

0 1 2 3 440

50

60

70

80

90

100

Bootstrap iterations

Perc

enta

ge

Performance of bootstrapping of catagory 1

99traincompletetest

0 2 4 6 840

50

60

70

80

90

100

Bootstrap iterations

Perc

enta

ge

Performance of bootstrapping of catagory 2

99traincompletetest

0 2 4 6 840

50

60

70

80

90

100

Bootstrap iterations

Perc

enta

ge

Performance of bootstrapping of catagory 3

99traincompletetest

0 5 10 1540

50

60

70

80

90

100

Bootstrap iterations

Perc

enta

ge

Performance of bootstrapping of catagory 4

99traincompletetest

Results of the 4 Category TDNNs• TDNN performance after stopping criteria are met

Page 27: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

280 2 4 6 8 10 120

200

400

600

800

1000

1200

Bootstrap iterations

Num

ber

of

tokens in t

rain

ing

Number of tokens used in bootstrapping of TDNN

cat 1cat 2cat 3cat 4

Results of the 4 Category TDNNs

• Number of tokens used in each category

Page 28: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

31

Results on Partial Lattice

• Using an n-best list method to score all 4 models and choosing the highest score for each of /B/, /D/ and /G/. The maximum of the 3 highest scores provides the final classification decision.– performance on the complete training set: 93.1%– performance on the complete test set: 82.1%– 35% error reduction on the complete training set, over the best model

trained using data from all 4 categories– 18% error reduction on the complete training set compared with a single

TDNN trained using all the 3685 tokens that achieved 91.8% on the training set and 82.0% on the test set

Cat 1 Cat 2 Cat 3 Cat 4

Lattice Complete train 2423 (65.75%) 21 (0.57%) 1014 (27.52%) 227 (6.16%)

Test 630 (44.43%) 33 (2.33%) 534 (37.66%) 221 (15.59%)

Staged batch training

Complete train 2920 (79.24%) 105 (2.85%)

483 (13.11%) 177 (4.80%)

Test 977 (68.90%) 123 (8.67%)

184 (12.98%) 134 (9.45%)

Page 29: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

32

Bootstrap - Discussion

• Bootstrapping is an effective procedure to guarantee convergence in robust training of TDNNs

• The problem with bootstrapping is that the TDNN needs to be trained several times, which is quite time consuming

• In order to reduce the total number of training cycles, we use a beam search method to prune the path for merging different categories of data

• The results showed that, although trained on a relatively small portion of all the training data (approximately 68%), the TDNN achieved better performance on the complete training set and concomitant improvement on the test set

Page 30: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

35

A Few Issues – Shifting

• The difficulty in TDNN training lies in the shift-invariant nature of the TDNN. For the set of voiced stop consonants, the stop regions can appear at any 30 msec window during the 150 msec segment.

• The previous vowel affects the BDG articulation and provides information useful for classification

• We can make a TDNN converge faster (to a better solution) by appropriately shifting frames for tokens in categories (II), (III), and (IV).

Page 31: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

36

Frame Shifting• Train TDNN on the 99 long stop gap hand selected tokens;

test on the same 99 tokens• Shift right 4 frames for tokens in categories (II), (III) and (IV)• Train the TDNN again, and test on the 99 tokens

Before shifting (# tokens) After shifting (# tokens)

B D G B D G

B 32 1 0 31 1 1

D 1 30 2 0 33 0

G 0 0 33 0 0 33

Category (i) (ii) (iii) (iv)

Before shift (# tokens) 87 0 8 4

After shift(# tokens) 94 0 3 2

Number of tokens in each category

Page 32: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

37

Discussion and Future Work• Examine bootstrapping more closely, to reduce the total

number of bootstrap iterations, and to improve model accuracy

• Segment length affects classification accuracy – 150 msec can contain more than one short stop consonants → use DTW to map the input frames to a fixed number of frames and then use the aligned tokens to train a TDNN

• Investigate TDNN for classification of other phoneme classes, e.g., voiced fricatives, diphthongs, etc.

• Use frame-based methods for classification of static sounds (e.g. vowels, unvoiced fricatives); use segment-based methods to recognize dynamic sounds (e.g., voiced stop consonants, diphthongs)

• Develop a bottom-up approach to build small but accurate classifiers first, then gradually classify broader classes in the phoneme hierarchy

Page 33: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) CAIP, Rutgers University Oct.13, 2006 Jun Hou, Lawrence Rabiner and Sorin Dusan

38

Thank you!