Phonetic Tree Analysis (PTA) and its application to ...mi.eng.cam.ac.uk/seminars/speech/minematsu_seminar1.pdf · its application to estimation of the segmental intelligibility. Introduction

Nobuaki MINEMATSUUNIVERSITY OF TOKYO

Phonetic Tree Analysis (PTA)and

its application to estimationof the segmental intelligibility

Introduction

Development of Japanese English read speech database

Corpus-based analysis of JE productionPhonetic Tree Analysis (PTA)

Corpus-based analysis of JE perceptionSegmental intelligibility estimation without acoustic matching

Application of PTA-based intelligibility estimationWhat is the next target phoneme in efficient learning ?Prediction of the student’s future

Some interesting issues on PTA and phonetics

Conclusions and future works

Outline

Current situation of English education in Japan3+3+2 years’ learning at shortest

16 / 16 in TOEIC test (1998)From “native-sounding” to “intelligible” pronunciation

Foreign accents don’t always disturb smooth communication.Listeners easily adapt themselves to the speaker’s pronunciation.

What is the intelligibility of the pronunciation ?Easiness of accessing to a listener’s mental lexiconNot only phonetics-based but also cognition-based strategy for LLWhat is the model for listeners’ ability of the adaptation ?

A boom -- CALL system --Acoustic matching between students and teachers“Native-sounding” orientedStill unstable especially for children / elderly speechNo adaptation techniques are allowed basically.

Introduction (#1)

No two students are the same.No methods to represent the differences concisely and accurately

Current CALL technologies = error detection and scoringRequired technologies = writing of how the student was, is, and will be.

Phonetic Tree Analysis (PTA)Extraction of embedded phonetic structure in the pronunciationAbstract but phonetically-meaningful visualization of how the student isCancellation of microphone characteristics and speakers’ individualityPerceptual representation of the pronunciation structureNo need of acoustic matching with teachers’ speechCan be applied to children and elderly speech immediately.Can be applied to estimate the segmental intelligibility.Can be applied to instruct the next target in learning.Can be applied to roughly estimate the student’s future. :

Introduction (#2)

From phonetics to phonetics+cognitionConventional --- matching between two sounds ---

Proposed --- matching between two structures ---

Introduction (#3)

students’speech

teachers’HMMs

students’speech

Based on speech sci. and eng.

Affected by mic, BN, size, shape, sex, age, individuality

Native-sounding oriented

Based also on cognition

Complete cancellation of static multiplicative noise

Intelligibility oriented

mentallexicon

Technical and educational requirements for the designTechnical issues

Focus only on commonly observed acoustic distortions Adequate selection of speakers and recording strategy

Random selection of 100 male and 102 female university studentsReading given sheets, not spontaneous conversation Phonetic symbols and prosodic symbolsPronunciation practices allowedRepetition until the speakers judged that they did it right.

Educational issuesPhonetic (segmental) aspect and prosodic aspectWord reading and sentence readingSentence set / word set X phonetic set / prosodic set

Development of JE speech database (#1)

Word / sentence sets for the phonetic aspect

Minimal pair words include some unknown words

Word / sentence sets for the prosodic aspect

Intonational differences caused by commas, focused words, syntactic structures, references, and so onProsodic symbols assigned by English teachers


set sizePhonemically-balanced words 300Minimal pair words 600TIMIT-based phonemically-balanced sentences 460Sentences with phoneme sequences difficult to produce fluently 32Sentences designed for test set 100

set sizeWords and compound words with various accent patterns 109Sentences with various intonation patterns 94Sentences with various rhythm patterns 121

Symbols and marks assigned to the reading sheetsPhonetic symbols

Prosodic symbolsLexical stress = primary (1), secondary (2), and unstressed (0)Sentence stress (rhythm) = three-level stress (@, +, -) and phrase break (/)Intonation = adequately directed arrows

Examples


B, D, G, P, T, K, JH, CH, S, SH, Z, ZH, F, TH, V, DH, M, N, NG, L, R, W, Y, HH, IY, IH, EH, EY, AE, AA, AW, AY, AH, AO, OY, OW, UH, UW, ER, AXR, AX

More examples


RecordingSpeakers

Quasi-random selection of 100 male and 102 female univ. studentsRecording task

All the sentences into 8 sub-setsAll the words into 5 sub-sets1 sentence sub-set (125) + 1 word sub-set (225) / speaker

Recording procedures[Before R] Speakers were asked to do pronunciation practices[During R] Also asked to do repetition until they judged they did it right.

Correct English at least for students[After R] Every utterance was checked by technical staff.

The remaining errors are due to lack of the speakers’ knowledge of correct articulation of English sounds.


Pronunciation proficiency rating by teachersThree criteria for the rating

Phonetic aspect of the pronunciationRhythmic aspect of the pronunciationIntonational aspect of the pronunciation5-degree scale + 5 American teachers5,700 words + 3,800 sentences / rater

Distribution of the proficiency


-- 1.5 -- 2.0 -- 2.5 -- 3.0 -- 3.5 -- 4.0 -- 4.5 -- 5.00

10

20

30

40

50segmental rhythm intonation

#stu

dent

s

Training of AE and JE SI-HMMsMonophones with a single mixture for easy visualizationTranscription automatically generated with PRONLEX

Pronunciation errors were not represented in the transcription.

Corpus analysis of JE production (SI, #1)

AD 16bit / 16kHz sampling

Window Hamming, 25 ms length, 10 ms rate

Pre-emphasis 1-0.97 z-1

Parameters 12MFCC + 12DMFCC + DPower (25dim)

Phoneme setae, ah, ch, dh, eh, nx, wh, ih, jh, oy, er, sh, th, uh, aw, ay, zh, aa,

b, ao, d, ey, f, g, hh, iy, k, l, m, o, ow, p, r, s, t, uw, v, w, ax, y, z

HMMs 1-mixture monophones with diagonal matrices (5st. + 3dis.)

TrainingAE = 245 male Americans, 25,652 sentences (WSJ)

JE = 68 male Japanese, 8,282 utterances

Initial models Models built with TIMIT (4,346 utterances)

1 2 3 4 5

Magnitude of variances of AE and JERelative difference in averaged variances over cep. dimensions

JE > AE although #JE <<< #AE and JE are carefully read samples.Due to inter-speaker differences in pronunciation proficiency


Phoneme pairs difficult for Japanese to discriminateRatio of state distance in JE to that in AE

Larger confusion is clearly seen in each pair.Mid and low vowels are replaced by a Japanese mid and low vowel /あ/.


1 2 3 4 5

あ

Schwa and other vowels in AE and JEFive nearest phonemes (not vowels) to schwa in AE and JE

Schwa is one of the most difficult sounds for Japanese to produce.Various vowels are found in AE but only mid and low vowels in JE.Mid and low vowels of JE = /あ/

Japanese perceive /あ/ in native schwa sounds.Japanese produce /あ/ for schwa.


あ

Phonetic Tree Analysis (PTA) with SI HMMsState-level distance matrices for AE and JEBhattacharyya distance measureHierarchical clustering for each matrix with Ward’s method


2 3 4

2 3 4

2 3 4

2 3 4

:

/aa/

/ae/

/th/

/m/

aa-2aa-3aa-4 :

m-2m-3m-4 :

aa-2 aa-3 aa-4 ,,,, m-2 m-3 m-4 ,,,

0 00

0

0

0

00

0 tree diagram

Where are the confusing phonemes in the two trees ?

ae2

eh2

ah2

uh2

ax2

er2

r2dh3

dh4

b4d4g4hh4

w2ih2

ey2

uw2

iy2y2ae4

eh4

ah4

ih4

uh4

uw4

ax3

ax4

m2

n2v2er4

r4ih3

uw3

uh3

oy4

ay4

er3

r3nx2

ey4

iy4ey3

iy3y3y4nx3

n3m3

nx4

n4m4

d2g2k2ae3

eh3

ay3

aw2

ay2

ah3

aw3

aa3

ao3

l2ow3

wh4

oy2

w3aa2

ao2

ow2

oy3

l3ow4

l4w4aw4

ao4

aa4

ch2

jh2

sh2

zh2

s2z2dh2

v4b3d3g3th2

f2p2t2b2v3wh3

k4p4hh2

hh3

k3t3th3

f3p3silB4

silE2

ch3

jh3

sh3

zh3

s3z3ch4

sh4

jh4

zh4

th4

t4f4s4z4wh2

silB2

sp2

silB3

silE3

silE4

05

10

ae2

ay2

ah2

aw2

aa2

er2

ax2

wh4

ao2

ow2

eh2

ey2

ih2y2iy2uh2

uw2r2w2ch4

sh4

jh4

zh4

dh4

th4s4z4d4t4g4k4f4

hh4

ch2

dh3k3t3f2

hh2b3v4d3b4p4g3dh2

jh2

zh2

sh2

th2s2z2nx4n4m4b2v3d2g2k2p2t2

ch3

jh3

zh3

sh3

th3s3z3

wh2

silE3

silB2sp

2sil

B3sil

E4wh3

silB4

silE2p3f3

hh3

ae3

er3

ax3

ah3

aw3

aa3

ay3

oy2

ao3

ow3

eh3

ey3

uh3

uw3y4ih3iy3y3nx3n3m3

ae4

ah4

er4

ax4

aw4

aa4

ao4

ow4

oy3l2r3l3r4l4w3w4eh4

oy4

ay4

ih4iy4ey4

nx2

m2n2uh4

uw4v2

05

10


r/l ? s/th ? f/h ? ih/iy ?

r2r3r4 l2l3l4r2r3 r4l2l3l4s2s3s4 th2th3th4

th4th3s3 s4th2s2 h2h3 h4f2f3 f4

f2f3f4 h2h3 h4

1 2 3 4 5

iy2iy3iy4ih4 ih2ih3

ih3 ih4iy2iy3iy4 ih2

th=ih=iy=

TIi:

Phonetic interpretation of the tree structure


v2c4c4 & v2 subtree

1 2 3 4 5

vowel insertion

ae2

eh2

ah2

uh2

ax2

er2

r2dh3

dh4

b4d4g4hh4

w2ih2

ey2

uw2

iy2y2ae4

eh4

ah4

ih4

uh4

uw4

ax3

ax4

m2

n2v2er4

r4ih3

uw3

uh3

oy4

ay4

er3

r3nx2

ey4

iy4ey3

iy3y3y4nx3

n3m3

nx4

n4m4

d2g2k2ae3

eh3

ay3

aw2

ay2

ah3

aw3

aa3

ao3

l2ow3

wh4

oy2

w3aa2

ao2

ow2

oy3

l3ow4

l4w4aw4

ao4

aa4

ch2

jh2

sh2

zh2

s2z2dh2

v4b3d3g3th2

f2p2t2b2v3wh3

k4p4hh2

hh3

k3t3th3

f3p3silB4

silE2

ch3

jh3

sh3

zh3

s3z3ch4

sh4

jh4

zh4

th4

t4f4s4z4wh2

silB2

sp2

silB3

silE3

silE4

05

10

ae2

ay2

ah2

aw2

aa2

er2

ax2

wh4

ao2

ow2

eh2

ey2

ih2y2iy2uh2

uw2r2w2ch4

sh4

jh4

zh4

dh4

th4s4z4d4t4g4k4f4

hh4

ch2

dh3k3t3f2

hh2b3v4d3b4p4g3dh2

jh2

zh2

sh2


ch3

jh3

zh3

sh3

th3s3z3

wh2

silE3

silB2sp

2sil

B3sil

E4wh3

silB4

silE2p3f3

hh3

ae3

er3

ax3

ah3

aw3

aa3

ay3

oy2

ao3

ow3

eh3

ey3

uh3


ae4

ah4

er4

ax4

aw4

aa4

ao4

ow4


oy4

ay4

ih4iy4ey4

nx2

m2n2uh4

uw4v2

05

10

Phonetic interpretation of the tree structure


1 2 3 4 5

48V+9N+6L+6G80

0V+0N+0L+0G50

ae2

eh2

ah2

uh2

ax2

er2

r2dh3

dh4

b4d4g4hh4

w2ih2

ey2

uw2

iy2y2ae4

eh4

ah4

ih4

uh4

uw4

ax3

ax4

m2

n2v2er4

r4ih3

uw3

uh3

oy4

ay4

er3

r3nx2

ey4

iy4ey3

iy3y3y4nx3

n3m3

nx4

n4m4

d2g2k2ae3

eh3

ay3

aw2

ay2

ah3

aw3

aa3

ao3

l2ow3

wh4

oy2

w3aa2

ao2

ow2

oy3

l3ow4

l4w4aw4

ao4

aa4

ch2

jh2

sh2

zh2

s2z2dh2

v4b3d3g3th2

f2p2t2b2v3wh3

k4p4hh2

hh3

k3t3th3

f3p3silB4

silE2

ch3

jh3

sh3

zh3

s3z3ch4

sh4

jh4

zh4

th4

t4f4s4z4wh2

silB2

sp2

silB3

silE3

silE4

05

10

ae2

ay2

ah2

aw2

aa2

er2

ax2

wh4

ao2

ow2

eh2

ey2

ih2y2iy2uh2

uw2r2w2ch4

sh4

jh4

zh4

dh4

th4s4z4d4t4g4k4f4

hh4

ch2

dh3k3t3f2

hh2b3v4d3b4p4g3dh2

jh2

zh2

sh2


ch3

jh3

zh3

sh3

th3s3z3

wh2

silE3

silB2sp

2sil

B3sil

E4wh3

silB4

silE2p3f3

hh3

ae3

er3

ax3

ah3

aw3

aa3

ay3

oy2

ao3

ow3

eh3

ey3

uh3


ae4

ah4

er4

ax4

aw4

aa4

ao4

ow4


oy4

ay4

ih4iy4ey4

nx2

m2n2uh4

uw4v2

05

10

(48V+9N+6L+6G) x 0.7all the v2sall the v3s and v4s

?

Speaker Dependent tree diagramsState-level --> phoneme-level

Phoneme-level distance matrix1-state HMM with 1 mixture (GM)approx. 60 sentences to train a set of HMMsAll the phonemes - /zh/ (/ /) - dipthongs (#phonemes = 34)

Representation of how the student is100 + 102 treesNo two students are the same.

20 Americans, each reading 3 to 4 subsets

Corpus analysis of JE production (SD, #1)

Z

A PTA example of an American teacher (subset = S3)

Corpus analysis of JE production (SD, #2)M

ergi

ng D

istor

tion Speaker : USA/F12

Set : S3Score : 5.00

/sh/ /jh

//c

h/ /s/

/z/

/dh/ /g

//b

//v

//d

//h

/ /f/ /th/

/p/ /t/ /k/

5.47.0

/iy/

/y/

/ao/

/aa/ /er/ /r/ /ih/

/ax/

/ah/

/eh/

/ae/ /m

//n

//n

g//u

h//u

w/ /w/ /l/

4.76.1

8.09.1

11.0

18.9

S Ã Ù s z D g b v d h f T p t k i: j O:A: r@:r

I @ 2 E æmn N Ú u:w l

A PTA example of a poor student (subset = S3)


ergi

ng D

istor

tion Speaker : TUT/M03


/s/

/th/

/p/ /f/ /h/

/sh/ /jh

/ /t//c

h/ /k/

/b/

/d/

/dh/ /z

/

5.2

/aa/

/ah/ /er/

/ax/

/ae/

/eh/

/uw/ /y

//ih

//iy

//a

o//u

h/ /w/

/v/

/g/ /r/ /l/

/ng/ /n

//m

/

4.66.1

8.0

12.5

S Ã Ùs zD gb vdhfT p t k i:j O:A: r@:r

I@2 Eæ mnNÚu: w l

A PTA example of an intermediate student (subset = S3)


ergi

ng D

istor

tion Speaker : TKT/M03


/d/

/v/

/g/

/h/ /t/ /k/

/b/

/p/

/sh/ /jh

//c

h/ /f//d

h/ /z/

/s/

/th/

6.6/e

r/ /r/ /l//a

o/ /w/

/ah/

/ae/

/ax/

/aa/

4.7

/ih/

/iy/

/y/

/ng/ /n

//m

//e

h//u

h//u

w/

5.15.8

8.5

14.1

S Ã Ù szDg bvd h f Tpt k i: jO: A:r@:r

I@2 Eæ mnN Ú u:wl

A PTA example of a good student (subset = S3)


ergi

ng D

istor

tion Speaker : NAR/F05


/v/

/b/

/d/

/g/

/dh/ /p

/ /t/ /k/

/sh/ /jh

//c

h/ /s/

/z/

/h/ /f/ /th/

4.7

8.2

/l//a

o//a

a/ /er/ /r/

/uh/

/ah/

/ax/

/ae/ /ih

//e

h/

5.46.6

/iy/

/y/

/ng/ /n

//m

//u

w/ /w/

9.511.4

20.0

S Ã Ù s zDgbv d h f Tp t k i: jO:A: r@:r

I@2 Eæ mnNÚ u:wl



ergi

ng D

istor



/sh/ /jh

//c

h/ /s/

/z/

/dh/ /g

//b

//v

//d

//h

/ /f/ /th/

/p/ /t/ /k/

5.47.0

/iy/

/y/

/ao/

/aa/ /er/ /r/ /ih/

/ax/

/ah/

/eh/

/ae/ /m

//n

//n

g//u

h//u

w/ /w/ /l/

4.76.1

8.09.1

11.0

18.9

S Ã Ù s z D g b v d h f T p t k i: j O:A: r@:r

I @ 2 E æmn N Ú u:w l



ergi

ng D

istor



/sh/ /jh

//c

h/ /s/

/z/

/v/

/d/

/g/

/dh/ /t/ /k

//b

//p

/ /f/ /th/

5.77.2

/iy/

/y/

/ao/

/aa/ /w

//u

h/ /l/ /n/

/ng/

/uw/ /m

/

5.1

/er/ /r/ /h/

/ae/ /ih

//e

h//a

h//a

x/

4.56.4

8.19.2

10.9

17.7

S Ã Ù s z Dg bv d hf Tpt k i: j O:A: r@:r

I @2Eæmn NÚ u:w l

Characteristics of PTACan do writing and labeling of how the student is.

Clustering the trees defines typical states of pronunciation learning.Can do abstract and easy-to-understand visualization.

No physics, no acoustics, but educationally meaningful enoughRequires no acoustic matching with teachers.

Never has “mismatch” problems.Can be applied to children and elderly people without any difficulty.

Only focuses on inter-phoneme Bhattacharyya distances.Shift and rotation do not change the distances.Scaling does not change either if variances are proportionally modified.A part of MLLR with a single matrix does not change the distances.Extraction of only the phonetic structure by ignoring some other factors.

Perceptual representation of the pronunciation structure.


Labeling of phonemesPhonemic transcriptions generated by forced alignmentSequence of phonemes seen in speech of some seconds

Labeling of rhythms4-level (word/phrase/sentence/above) categorization of stressSequence of (un)stressed syllables seen in speech of some seconds

Labeling of intonation(x-)ToBI, Fujisaki-model, piecewise concatenation model,,,Change of F0 seen in speech of some seconds

Labeling of pronunciationSpectrograms for the individual phonemes

Only for researchers, not for educationPTA for the overall description of pronunciationChange of pronunciation seen in training of some months or years


Characteristics of PTACan do writing and labeling of how the student is.







A state transition model of pronunciation learning


L1L3

L2

Characteristics of PTACan do writing of how the student is.







MLLR = Affine Transform of cepstrumsAT = scaling + warping + rotation + shift

Structure in an object is kept before and after the transform

MLLR with a single transform matrixMLLR adaptation with N (>>1) matrices in speech recognition

Different parts of a triphone set have different speaker individuality.Speaker individuality can be modeled by a single GMM.MLLR adaptation in HMM-based speech synthesis

HMMs of a speaker are converted into those of another with 5 sentences.N ～ 1

Good cancellation of speaker individuality


Characteristics of PTACan do writing of how the student is.







NO !“Structuralism” by R. Jakobson, M. Halle, G. Fant, and etc

Concise and accurate description of sounds in a languagePhoneme clustering based upon distinctive features

Is PTA new ? (#1)

“The Sound Pattern of Russian” by M. Halle

Is PTA new ? (#2)

NO !“Structuralism” by R. Jakobson, M. Halle, G. Fant, and etc

Concise and accurate description of sounds in a languagePhoneme clustering based upon distinctive features

YES !It’s non-native speech sounds.

No fixed mapping between distinctive features and phonemesSome state-tying in HMM training and mixture-tying in MLLR adaptation

YES !!Technical meaning of ignoring phonemes’ absolute positions in AS

Complete cancellation of static and multiplicative distortionsRobust also in rotation and scaling

YES !!!Accordance between phonetic structure and lexical structure

Cognition-based goodness of the segmental aspect of the pronunciationNot native sounding but intelligibility-based scoring

Is PTA new ? (#1)

From phonetics to phonetics+cognitionConventional --- matching between two sounds ---

Proposed --- matching between two structures ---

Corpus analysis of JE perception (SI, #1)

students’speech

teachers’HMMs

students’speech

Based also on cognition

Complete cancellation of static multiplicative noise

Intelligibility oriented

mentallexicon

Based on speech sci. and eng.

Affected by mic, BN, size, shape, sex, age, individuality

Native-sounding oriented

Listeners’ adaptationor cancellation

Perceptual representation ofthe phonetic structure

?

Reduced acoustic space of JE phonemes#phonemes of J (approx. 25) < #phonemes of AE (approx. 40)

1-to-N mappingReduction of acoustic space = increase of lexical density

Confusedness = segmental unintelligibility = cognitive loadQuantitative estimation of the lexical density increase

Cohort Model of word perceptionA left-to-right model of isolated word perception process

Estimation of the cohort size with the initial portion of input speech


str…

s…s…s…s… str…

straight

#candidates > 1 #candidates > 1 #candidates > 1 #candidates = 1

cohort = a set of activated words

= start

/s/

/b/

/r/

:

:

/t/

/eh/

/ih//ey/

/ow//r/

:

Acoustic unit of cohort developmentPerceptual unit of English = syllablesSyllabification with tsylb v2.1

Estimation of the cohort size with the initial syl. inputVocabulary

PRONLEX dictionary (approx. 100K word entries)Varieties of the initial syllables of the entries

#different (initial) syllables = approx. 10KFor each of the different syllables, CS0(si, q) is calculated.

CS0(si,q) = #words starting with si + #words starting with a syllable distant from si by less than q

CS(q) = average of CS0(si,q) over si

Distance measure between syllables= DP matching between two state sequences (HMMs)State-to-state distance = Bhattacharyya distanceState-level distance matrix provides all the required information.


Cohort size as a function of threshold qSI AE models vs. SI JE modelsSD AE models vs. SD JE modelsNo weighting based on N-gram probabilities


Required resolution of acoustic analysis LowHigh

Con

fuse

dnes

s

Inte

lligi

bilit

y

Cohort size estimation with unigramVocabulary

WSJ 20K (unigram) - words starting with /zh/ or dipthongsVarieties of the initial syllables of the word entries

#different (initial syllables) = approx. 3,200For each of the different entries, CS1(wj, q) is calculated.

CS1(wj,q) = CS0(sij,q), where sij is an initial syllable of wj

Expected Cohort Size, ECS(q) = S p(wj) CS1(wj, q), p(wj) = 1-gram

How correlated are the ECS and segmental proficiency labels ?

Corpus analysis of JE perception (SD, #1)

Expected cohort size and segmental proficiency


0.10 0.700.30 0.600.20 0.500.400

2500

2000

1500

1000

500

3000

3500

Threshold of inter-syllable distance (q)

Expe

cted

coh

ort s

izeSet-6

4.984.17

4.12

2.02

2.18

2.24

q = 0.35

2.0 <= score < 2.5 42.5 <= score < 3.0 83.0 <= score < 3.5 63.5 <= score < 4.0 54.0 <= score < 4.5 24.5 <= score < 5.0 1 score = 5.0 4

score range #spk

Expected cohort size and segmental proficiency

Non-acoustic matching can estimate the segmental intelligibility.


4.03.02.0 5.0Segmental proficiency rated by teachers

200

300

400

500

600Ex

pect

ed c

ohor

t size

Set-6, q = 0.35

4 American speakers

(1)

(3)

cor. = -0.80

Possible applications of PTA (#1)

Estimation of the next target for efficient learning

Speech samples of RYU/F06Several sentence speech samples

teacher’s matrix learner’s matrix

? ?Which replacement realizes the largest reduction of cohort size ?

Which phoneme’s replacement should come first ?


SÃ Ùs zD gb vd hf Tp tki: j O:A: r@:r

I@ 2Eæ m n N Úu:wl

Cohort size reduction only with a single replacement

150

200

250

300

350

400

450

500

Coho

rt siz

e

orig

./a

x/ /ih/

/dh/

/ae/ /iy

//a

a/ /r/ /d/

/eh/

/ah/ /k

//s

/ /l/ /w/

/p/

/b/ /t/ /f/ /m/

/uw/ /n

//v

//jh

//g

//e

r/ /z/

/sh/

/ch/

/ng/ /th

//y

//u

h//a

o/ /h/

original

native

What is the order in the most efficient learning ?


sD dA:r@:r

I 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l150

200

250

300

350

400

450

500

Coho

rt siz

e

original

native

Cohort size reduction by successive phoneme replacements

@

orig

.

The learning order of another student


Cohort size reduction by successive phoneme replacements

150

200

250

300

350

400

450

500

Coho

rt siz

e

orig

. /r//d

h//a

x/ /iy/

/d/

/s/

/ah/ /m

//e

r/ /w/

/n/

/aa/ /l/ /g

/ /t/ /h/

/uw/ /k

//y

//s

h/ /z/

/ao/

/ch/

/ng/

/uh/ /jh

//v

//th

//p

//b

/ /f//a

e//e

h//is

h/

Phonemes

original

native

Prediction of her future.....?


sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.

Mer

ging

Dis

torti

on Speaker : RYU/F06Set : S6

Score : 2.02

/th/

/z/

/jh/

/dh/ /k

/ /t/ /d/

/b/

/g/ /f/ /p/

/h/

/ao/

/ch/ /s

//s

h/

4.95.7/u

h/ /r/ /l//a

h//e

h//a

e/ /er/

/uw

//w

//v

//a

x//a

a/ /y/

/ih/

/iy/

/ng/ /n

//m

/

7.1

12.3




Mer

ging

Dis

torti


Score : 2.02

/ax/ /h/

/th/

/z/

/jh/

/dh/ /k

/ /t/ /d/

/ao/

/ch/ /s

//s

h/

6.6

/y/

/ih/

/iy/

/ng/ /n

//m

/ /f/ /p/

/v/

/b/

/g/

/ah/

/eh/

/ae/ /er/

/uh/ /r/ /l/

/aa/

/uw

//w

/

5.2

7.9

12.4




Mer

ging

Dis

torti


Score : 2.02

/th/

/z/

/jh/

/dh/ /k

/ /t/ /d/

/ao/

/ch/ /s/

/sh/ /h/

/ih/

/ax/ /iy/

/y/

/ng/ /n

//m

/ /f/ /p/

/v/

/b/

/g/

/ah/

/eh/

/ae/ /er/

/uh/ /r/ /l/

/aa/

/uw

//w

/

5.46.9

8.8

12.5




Mer

ging

Dis

torti


Score : 2.02

/k/ /t/ /d/

/th/

/z/

/jh/

/ao/

/ch/ /s

//s

h/ /f/ /p/

/v/

/b/

/g/

/dh/ /h

/5.9

7.0/iy

//y

//n

g/ /n/

/m/

/uh/ /r/ /l/

/aa/

/uw

//w

//a

h//e

h//a

e/ /er/

/ih/

/ax/

5.0

8.0

12.9




Mer

ging

Dis

torti


Score : 2.02

/k/ /t/ /d/

/th/

/z/

/jh/

/ao/

/ch/ /s

//s

h/ /f/ /p/

/v/

/b/

/g/

/dh/ /h

/6.2

7.2/iy

//y

//n

g/ /n/

/m/

/ae/

/ah/

/eh/ /ih

//a

x//e

r/ /r//a

a//u

w/

/w/

/uh/ /l/

4.55.3

8.4

13.4




Mer

ging

Dis

torti


Score : 2.02

/k/ /t/ /d/

/th/

/z/

/jh/

/ao/

/ch/ /s

//s

h/ /f/ /p/

/v/

/b/

/g/

/dh/ /h

/6.3

7.3/iy

//y

//n

g/ /n/

/m/

/ih/

/eh/

/ah/

/ax/

/er/ /r/

/uh/ /l/

/ae/

/aa/

/uw

//w

/

4.55.4

8.6

13.7




Mer

ging

Dis

torti


Score : 2.02

/dh/ /h

/ /f/ /p/

/v/

/d/

/b/

/g/

/s/

/sh/

/ch/ /t/ /k

//th

//z

//jh

/6.77.4

/iy/

/y/

/m/

/n/

/ng/

5.3

/ao/

/aa/ /er/ /r/ /ih/

/eh/

/ah/

/ax/

/ae/

/uw

//w

//u

h/ /l/

4.66.0

8.810.6

15.5




Mer

ging

Dis

torti


Score : 2.02

/dh/ /h

/ /f/ /p/

/v/

/d/

/b/

/g/

/s/

/z/

/sh/

/ch/ /t/ /k

//th

//jh

/

5.17.17.8

/iy/

/y/

/n/

/ng/

/uw

//m

/

6.5

/ao/

/aa/ /er/ /r/ /ih/

/eh/

/ah/

/ax/

/ae/ /w

//u

h/ /l/

4.45.8

9.110.8

16.6




Mer

ging

Dis

torti


Score : 2.02

/sh/

/ch/

/dh/ /h

//s

//z

//v

//d

//g

//jh

/ /t/ /k/

/th/ /f/ /b/

/p/

5.47.48.2

/iy/

/y/

/n/

/ng/

/uw

//m

/

6.8

/ao/

/aa/

/ae/ /ih

//e

h//a

h//a

x//e

r/ /r/ /w/

/uh/ /l/

4.86.1

9.311.1

17.3




Mer

ging

Dis

torti


Score : 2.02

/sh/ /jh

//c

h/ /s/

/z/

/v/

/d/

/g/

/dh/ /t/ /k

//b

//p

/ /f/ /th/

5.67.2

/ao/

/aa/ /er/ /r/ /h/

/ae/ /ih

//e

h//a

h//a

x/

4.56.3

8.2

/iy/

/y/

/w/

/uh/ /l/ /n

//n

g//u

w/

/m/

5.0

9.111.0

17.5




Mer

ging

Dis

torti


Score : 2.02

/sh/ /jh

//c

h/ /s/

/z/

/v/

/d/

/g/

/dh/ /t/ /k

//b

//p

/ /f/ /th/

5.77.3

/iy/

/y/

/ao/

/aa/ /w

//u

h/ /l/ /n/

/ng/

/uw

//m

/

5.1

/er/ /r/ /h/

/ae/ /ih

//e

h//a

h//a

x/

4.56.4

8.19.2

10.9

17.7




Mer

ging

Dis

torti


Score : 2.02

/sh/ /jh

//c

h/ /s/

/z/

/v/

/d/

/g/

/dh/ /t/ /k

//b

//p

/ /f/ /th/

5.77.2

/iy/

/y/

/ao/

/aa/ /w

//u

h/ /l/ /n/

/ng/

/uw

//m

/

5.1

/er/ /r/ /h/

/ae/ /ih

//e

h//a

h//a

x/

4.56.4

8.19.2

10.9

17.7

Goooooooooooooooooooooooal !!

LVCSR vs. SIE

Two types of hearings

Input

Acoustic model

Pron. lexicon

Lang. model

Integration

Output

Problems

phoneme-basedtree structured lexicon

phone models(tied-state triphones)

word trigram

decoder

sentence candidates

mismatchchildren & elderly

an utterance(MFCC)

mismatchchildren & elderly

native sounding

utterances(phonetic structure)

(native phoneticstructure)

perceptual-unit-basedtree structured lexicon

word unigram(baseform unigram)

isolated wordperception model

segmentalintelligibility

Some interesting issues on PTA (#1)

B vs. AWhich is more intelligible pronunciation ?


B vs. SForeign accented pronunciation alwaysreduces the segmental intelligibility ?


B vs. BBPhoneticstructure

Lexicalstructure

Phoneticstructure

Lexicalstructure

Some optimality is found between the twostructures in a single accented language ?


B vs. B1000

2000

3000

Phoneticstructure

Lexicalstructure

Phoneticstructure

Lexicalstructure

Phoneticstructure

LexicalstructureA language is evolving with keeping the

optimality between the two structures ??

Two ways of looking at speech

sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@

A:

Co-existence of different phones

Separate observation&

Relational observation

Several new lights on “Structuralism”

Non-native speechLearners differ at all.No rules there, it’s chaotic.Only bottom-up processingsPTA to extract the structure

Word-level cognitionPhoneme phone perceptionAccess to mental lexiconCohort, Trace, Shortlist, etcStructuralism on lexicon

What’s missing ?Cepstrum-based spaceMLLR adaptation of ....GMM modeling of .....Individuality, mic, age, etc

Application to LLIntelligible pronunciationStr. description of learnersSegmental intelligibilityDesign of efficient learning

Development of Japanese English read speech database

Corpus-based analysis of JE productionPhonetic Tree Analysis (PTA)Relational observation can visualize well how the student is.

Corpus-based analysis of JE perceptionCognition-based estimation of the segmental intelligibilityAccordance between two different levels of structures

Possible applications of PTA The most efficient learning of English phonemes for the studentPrediction of the student’s future

Some interesting issues on PTAB vs. A, B vs. S, and B vs. B

Conclusions & future works (#1)

Tuning up of acoustic conditions for analysisKind of cepstrums, dimensions, ∆&∆∆ components, etc.

PTA-based native trees should be similar to the DF-based classical trees.How to handle insertions and deletions in non-native speech ?

Better preparation of transcriptions to build HMMsMore adequate derivation of phoneme-based distance matrix

Clustering of the treesMeaningful and effective definition of distance between two trees

Bottom-up definition of typical states of Japanese EnglishState transition model of change of pronunciation through learning

Practical and pedagogical evaluationIs PTA-based representation really good for teachers and students ?Is PTA-based design of learning really effective and efficient ?

Conclusions & future works (#2)

Thank you. Any questions ?

?? ?

Documents

Phonetic Tree Analysis (PTA) and its application to ...mi.eng.cam.ac.uk/seminars/speech/minematsu_seminar1.pdf · its application to estimation of the segmental intelligibility. Introduction