Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Nobuaki MINEMATSUUNIVERSITY OF TOKYO
Phonetic Tree Analysis (PTA)and
its application to estimationof the segmental intelligibility
Introduction
Development of Japanese English read speech database
Corpus-based analysis of JE productionPhonetic Tree Analysis (PTA)
Corpus-based analysis of JE perceptionSegmental intelligibility estimation without acoustic matching
Application of PTA-based intelligibility estimationWhat is the next target phoneme in efficient learning ?Prediction of the student’s future
Some interesting issues on PTA and phonetics
Conclusions and future works
Outline
Current situation of English education in Japan3+3+2 years’ learning at shortest
16 / 16 in TOEIC test (1998)From “native-sounding” to “intelligible” pronunciation
Foreign accents don’t always disturb smooth communication.Listeners easily adapt themselves to the speaker’s pronunciation.
What is the intelligibility of the pronunciation ?Easiness of accessing to a listener’s mental lexiconNot only phonetics-based but also cognition-based strategy for LLWhat is the model for listeners’ ability of the adaptation ?
A boom -- CALL system --Acoustic matching between students and teachers“Native-sounding” orientedStill unstable especially for children / elderly speechNo adaptation techniques are allowed basically.
Introduction (#1)
No two students are the same.No methods to represent the differences concisely and accurately
Current CALL technologies = error detection and scoringRequired technologies = writing of how the student was, is, and will be.
Phonetic Tree Analysis (PTA)Extraction of embedded phonetic structure in the pronunciationAbstract but phonetically-meaningful visualization of how the student isCancellation of microphone characteristics and speakers’ individualityPerceptual representation of the pronunciation structureNo need of acoustic matching with teachers’ speechCan be applied to children and elderly speech immediately.Can be applied to estimate the segmental intelligibility.Can be applied to instruct the next target in learning.Can be applied to roughly estimate the student’s future. :
Introduction (#2)
From phonetics to phonetics+cognitionConventional --- matching between two sounds ---
Proposed --- matching between two structures ---
Introduction (#3)
students’speech
teachers’HMMs
students’speech
Based on speech sci. and eng.
Affected by mic, BN, size, shape, sex, age, individuality
Native-sounding oriented
Based also on cognition
Complete cancellation of static multiplicative noise
Intelligibility oriented
mentallexicon
Technical and educational requirements for the designTechnical issues
Focus only on commonly observed acoustic distortions Adequate selection of speakers and recording strategy
Random selection of 100 male and 102 female university studentsReading given sheets, not spontaneous conversation Phonetic symbols and prosodic symbolsPronunciation practices allowedRepetition until the speakers judged that they did it right.
Educational issuesPhonetic (segmental) aspect and prosodic aspectWord reading and sentence readingSentence set / word set X phonetic set / prosodic set
Development of JE speech database (#1)
Word / sentence sets for the phonetic aspect
Minimal pair words include some unknown words
Word / sentence sets for the prosodic aspect
Intonational differences caused by commas, focused words, syntactic structures, references, and so onProsodic symbols assigned by English teachers
Development of JE speech database (#2)
set sizePhonemically-balanced words 300Minimal pair words 600TIMIT-based phonemically-balanced sentences 460Sentences with phoneme sequences difficult to produce fluently 32Sentences designed for test set 100
set sizeWords and compound words with various accent patterns 109Sentences with various intonation patterns 94Sentences with various rhythm patterns 121
Symbols and marks assigned to the reading sheetsPhonetic symbols
Prosodic symbolsLexical stress = primary (1), secondary (2), and unstressed (0)Sentence stress (rhythm) = three-level stress (@, +, -) and phrase break (/)Intonation = adequately directed arrows
Examples
Development of JE speech database (#3)
B, D, G, P, T, K, JH, CH, S, SH, Z, ZH, F, TH, V, DH, M, N, NG, L, R, W, Y, HH, IY, IH, EH, EY, AE, AA, AW, AY, AH, AO, OY, OW, UH, UW, ER, AXR, AX
More examples
Development of JE speech database (#4)
RecordingSpeakers
Quasi-random selection of 100 male and 102 female univ. studentsRecording task
All the sentences into 8 sub-setsAll the words into 5 sub-sets1 sentence sub-set (125) + 1 word sub-set (225) / speaker
Recording procedures[Before R] Speakers were asked to do pronunciation practices[During R] Also asked to do repetition until they judged they did it right.
Correct English at least for students[After R] Every utterance was checked by technical staff.
The remaining errors are due to lack of the speakers’ knowledge of correct articulation of English sounds.
Development of JE speech database (#5)
Pronunciation proficiency rating by teachersThree criteria for the rating
Phonetic aspect of the pronunciationRhythmic aspect of the pronunciationIntonational aspect of the pronunciation5-degree scale + 5 American teachers5,700 words + 3,800 sentences / rater
Distribution of the proficiency
Development of JE speech database (#6)
-- 1.5 -- 2.0 -- 2.5 -- 3.0 -- 3.5 -- 4.0 -- 4.5 -- 5.00
10
20
30
40
50segmental rhythm intonation
#stu
dent
s
Training of AE and JE SI-HMMsMonophones with a single mixture for easy visualizationTranscription automatically generated with PRONLEX
Pronunciation errors were not represented in the transcription.
Corpus analysis of JE production (SI, #1)
AD 16bit / 16kHz sampling
Window Hamming, 25 ms length, 10 ms rate
Pre-emphasis 1-0.97 z-1
Parameters 12MFCC + 12DMFCC + DPower (25dim)
Phoneme setae, ah, ch, dh, eh, nx, wh, ih, jh, oy, er, sh, th, uh, aw, ay, zh, aa,
b, ao, d, ey, f, g, hh, iy, k, l, m, o, ow, p, r, s, t, uw, v, w, ax, y, z
HMMs 1-mixture monophones with diagonal matrices (5st. + 3dis.)
TrainingAE = 245 male Americans, 25,652 sentences (WSJ)
JE = 68 male Japanese, 8,282 utterances
Initial models Models built with TIMIT (4,346 utterances)
1 2 3 4 5
Magnitude of variances of AE and JERelative difference in averaged variances over cep. dimensions
JE > AE although #JE <<< #AE and JE are carefully read samples.Due to inter-speaker differences in pronunciation proficiency
Corpus analysis of JE production (SI, #2)
Phoneme pairs difficult for Japanese to discriminateRatio of state distance in JE to that in AE
Larger confusion is clearly seen in each pair.Mid and low vowels are replaced by a Japanese mid and low vowel /あ/.
Corpus analysis of JE production (SI, #3)
1 2 3 4 5
あ
Schwa and other vowels in AE and JEFive nearest phonemes (not vowels) to schwa in AE and JE
Schwa is one of the most difficult sounds for Japanese to produce.Various vowels are found in AE but only mid and low vowels in JE.Mid and low vowels of JE = /あ/
Japanese perceive /あ/ in native schwa sounds.Japanese produce /あ/ for schwa.
Corpus analysis of JE production (SI, #4)
あ
Phonetic Tree Analysis (PTA) with SI HMMsState-level distance matrices for AE and JEBhattacharyya distance measureHierarchical clustering for each matrix with Ward’s method
Corpus analysis of JE production (SI, #5)
2 3 4
2 3 4
2 3 4
2 3 4
:
/aa/
/ae/
/th/
/m/
aa-2aa-3aa-4 :
m-2m-3m-4 :
aa-2 aa-3 aa-4 ,,,, m-2 m-3 m-4 ,,,
0 00
0
0
0
00
0 tree diagram
Where are the confusing phonemes in the two trees ?
ae2
eh2
ah2
uh2
ax2
er2
r2dh3
dh4
b4d4g4hh4
w2ih2
ey2
uw2
iy2y2ae4
eh4
ah4
ih4
uh4
uw4
ax3
ax4
m2
n2v2er4
r4ih3
uw3
uh3
oy4
ay4
er3
r3nx2
ey4
iy4ey3
iy3y3y4nx3
n3m3
nx4
n4m4
d2g2k2ae3
eh3
ay3
aw2
ay2
ah3
aw3
aa3
ao3
l2ow3
wh4
oy2
w3aa2
ao2
ow2
oy3
l3ow4
l4w4aw4
ao4
aa4
ch2
jh2
sh2
zh2
s2z2dh2
v4b3d3g3th2
f2p2t2b2v3wh3
k4p4hh2
hh3
k3t3th3
f3p3silB4
silE2
ch3
jh3
sh3
zh3
s3z3ch4
sh4
jh4
zh4
th4
t4f4s4z4wh2
silB2
sp2
silB3
silE3
silE4
05
10
ae2
ay2
ah2
aw2
aa2
er2
ax2
wh4
ao2
ow2
eh2
ey2
ih2y2iy2uh2
uw2r2w2ch4
sh4
jh4
zh4
dh4
th4s4z4d4t4g4k4f4
hh4
ch2
dh3k3t3f2
hh2b3v4d3b4p4g3dh2
jh2
zh2
sh2
th2s2z2nx4n4m4b2v3d2g2k2p2t2
ch3
jh3
zh3
sh3
th3s3z3
wh2
silE3
silB2sp
2sil
B3sil
E4wh3
silB4
silE2p3f3
hh3
ae3
er3
ax3
ah3
aw3
aa3
ay3
oy2
ao3
ow3
eh3
ey3
uh3
uw3y4ih3iy3y3nx3n3m3
ae4
ah4
er4
ax4
aw4
aa4
ao4
ow4
oy3l2r3l3r4l4w3w4eh4
oy4
ay4
ih4iy4ey4
nx2
m2n2uh4
uw4v2
05
10
Corpus analysis of JE production (SI, #6)
r/l ? s/th ? f/h ? ih/iy ?
r2r3r4 l2l3l4r2r3 r4l2l3l4s2s3s4 th2th3th4
th4th3s3 s4th2s2 h2h3 h4f2f3 f4
f2f3f4 h2h3 h4
1 2 3 4 5
iy2iy3iy4ih4 ih2ih3
ih3 ih4iy2iy3iy4 ih2
th=ih=iy=
TIi:
Phonetic interpretation of the tree structure
Corpus analysis of JE production (SI, #7)
v2c4c4 & v2 subtree
1 2 3 4 5
vowel insertion
ae2
eh2
ah2
uh2
ax2
er2
r2dh3
dh4
b4d4g4hh4
w2ih2
ey2
uw2
iy2y2ae4
eh4
ah4
ih4
uh4
uw4
ax3
ax4
m2
n2v2er4
r4ih3
uw3
uh3
oy4
ay4
er3
r3nx2
ey4
iy4ey3
iy3y3y4nx3
n3m3
nx4
n4m4
d2g2k2ae3
eh3
ay3
aw2
ay2
ah3
aw3
aa3
ao3
l2ow3
wh4
oy2
w3aa2
ao2
ow2
oy3
l3ow4
l4w4aw4
ao4
aa4
ch2
jh2
sh2
zh2
s2z2dh2
v4b3d3g3th2
f2p2t2b2v3wh3
k4p4hh2
hh3
k3t3th3
f3p3silB4
silE2
ch3
jh3
sh3
zh3
s3z3ch4
sh4
jh4
zh4
th4
t4f4s4z4wh2
silB2
sp2
silB3
silE3
silE4
05
10
ae2
ay2
ah2
aw2
aa2
er2
ax2
wh4
ao2
ow2
eh2
ey2
ih2y2iy2uh2
uw2r2w2ch4
sh4
jh4
zh4
dh4
th4s4z4d4t4g4k4f4
hh4
ch2
dh3k3t3f2
hh2b3v4d3b4p4g3dh2
jh2
zh2
sh2
th2s2z2nx4n4m4b2v3d2g2k2p2t2
ch3
jh3
zh3
sh3
th3s3z3
wh2
silE3
silB2sp
2sil
B3sil
E4wh3
silB4
silE2p3f3
hh3
ae3
er3
ax3
ah3
aw3
aa3
ay3
oy2
ao3
ow3
eh3
ey3
uh3
uw3y4ih3iy3y3nx3n3m3
ae4
ah4
er4
ax4
aw4
aa4
ao4
ow4
oy3l2r3l3r4l4w3w4eh4
oy4
ay4
ih4iy4ey4
nx2
m2n2uh4
uw4v2
05
10
Phonetic interpretation of the tree structure
Corpus analysis of JE production (SI, #8)
1 2 3 4 5
48V+9N+6L+6G80
0V+0N+0L+0G50
ae2
eh2
ah2
uh2
ax2
er2
r2dh3
dh4
b4d4g4hh4
w2ih2
ey2
uw2
iy2y2ae4
eh4
ah4
ih4
uh4
uw4
ax3
ax4
m2
n2v2er4
r4ih3
uw3
uh3
oy4
ay4
er3
r3nx2
ey4
iy4ey3
iy3y3y4nx3
n3m3
nx4
n4m4
d2g2k2ae3
eh3
ay3
aw2
ay2
ah3
aw3
aa3
ao3
l2ow3
wh4
oy2
w3aa2
ao2
ow2
oy3
l3ow4
l4w4aw4
ao4
aa4
ch2
jh2
sh2
zh2
s2z2dh2
v4b3d3g3th2
f2p2t2b2v3wh3
k4p4hh2
hh3
k3t3th3
f3p3silB4
silE2
ch3
jh3
sh3
zh3
s3z3ch4
sh4
jh4
zh4
th4
t4f4s4z4wh2
silB2
sp2
silB3
silE3
silE4
05
10
ae2
ay2
ah2
aw2
aa2
er2
ax2
wh4
ao2
ow2
eh2
ey2
ih2y2iy2uh2
uw2r2w2ch4
sh4
jh4
zh4
dh4
th4s4z4d4t4g4k4f4
hh4
ch2
dh3k3t3f2
hh2b3v4d3b4p4g3dh2
jh2
zh2
sh2
th2s2z2nx4n4m4b2v3d2g2k2p2t2
ch3
jh3
zh3
sh3
th3s3z3
wh2
silE3
silB2sp
2sil
B3sil
E4wh3
silB4
silE2p3f3
hh3
ae3
er3
ax3
ah3
aw3
aa3
ay3
oy2
ao3
ow3
eh3
ey3
uh3
uw3y4ih3iy3y3nx3n3m3
ae4
ah4
er4
ax4
aw4
aa4
ao4
ow4
oy3l2r3l3r4l4w3w4eh4
oy4
ay4
ih4iy4ey4
nx2
m2n2uh4
uw4v2
05
10
(48V+9N+6L+6G) x 0.7all the v2sall the v3s and v4s
?
Speaker Dependent tree diagramsState-level --> phoneme-level
Phoneme-level distance matrix1-state HMM with 1 mixture (GM)approx. 60 sentences to train a set of HMMsAll the phonemes - /zh/ (/ /) - dipthongs (#phonemes = 34)
Representation of how the student is100 + 102 treesNo two students are the same.
20 Americans, each reading 3 to 4 subsets
Corpus analysis of JE production (SD, #1)
Z
A PTA example of an American teacher (subset = S3)
Corpus analysis of JE production (SD, #2)M
ergi
ng D
istor
tion Speaker : USA/F12
Set : S3Score : 5.00
/sh/ /jh
//c
h/ /s/
/z/
/dh/ /g
//b
//v
//d
//h
/ /f/ /th/
/p/ /t/ /k/
5.47.0
/iy/
/y/
/ao/
/aa/ /er/ /r/ /ih/
/ax/
/ah/
/eh/
/ae/ /m
//n
//n
g//u
h//u
w/ /w/ /l/
4.76.1
8.09.1
11.0
18.9
S Ã Ù s z D g b v d h f T p t k i: j O:A: r@:r
I @ 2 E æmn N Ú u:w l
A PTA example of a poor student (subset = S3)
Corpus analysis of JE production (SD, #3)M
ergi
ng D
istor
tion Speaker : TUT/M03
Set : S3Score : 2.38
/s/
/th/
/p/ /f/ /h/
/sh/ /jh
/ /t//c
h/ /k/
/b/
/d/
/dh/ /z
/
5.2
/aa/
/ah/ /er/
/ax/
/ae/
/eh/
/uw/ /y
//ih
//iy
//a
o//u
h/ /w/
/v/
/g/ /r/ /l/
/ng/ /n
//m
/
4.66.1
8.0
12.5
S Ã Ùs zD gb vdhfT p t k i:j O:A: r@:r
I@2 Eæ mnNÚu: w l
A PTA example of an intermediate student (subset = S3)
Corpus analysis of JE production (SD, #4)M
ergi
ng D
istor
tion Speaker : TKT/M03
Set : S3Score : 3.18
/d/
/v/
/g/
/h/ /t/ /k/
/b/
/p/
/sh/ /jh
//c
h/ /f//d
h/ /z/
/s/
/th/
6.6/e
r/ /r/ /l//a
o/ /w/
/ah/
/ae/
/ax/
/aa/
4.7
/ih/
/iy/
/y/
/ng/ /n
//m
//e
h//u
h//u
w/
5.15.8
8.5
14.1
S Ã Ù szDg bvd h f Tpt k i: jO: A:r@:r
I@2 Eæ mnN Ú u:wl
A PTA example of a good student (subset = S3)
Corpus analysis of JE production (SD, #5)M
ergi
ng D
istor
tion Speaker : NAR/F05
Set : S3Score : 4.47
/v/
/b/
/d/
/g/
/dh/ /p
/ /t/ /k/
/sh/ /jh
//c
h/ /s/
/z/
/h/ /f/ /th/
4.7
8.2
/l//a
o//a
a/ /er/ /r/
/uh/
/ah/
/ax/
/ae/ /ih
//e
h/
5.46.6
/iy/
/y/
/ng/ /n
//m
//u
w/ /w/
9.511.4
20.0
S Ã Ù s zDgbv d h f Tp t k i: jO:A: r@:r
I@2 Eæ mnNÚ u:wl
A PTA example of an American teacher (subset = S3)
Corpus analysis of JE production (SD, #6)M
ergi
ng D
istor
tion Speaker : USA/F12
Set : S3Score : 5.00
/sh/ /jh
//c
h/ /s/
/z/
/dh/ /g
//b
//v
//d
//h
/ /f/ /th/
/p/ /t/ /k/
5.47.0
/iy/
/y/
/ao/
/aa/ /er/ /r/ /ih/
/ax/
/ah/
/eh/
/ae/ /m
//n
//n
g//u
h//u
w/ /w/ /l/
4.76.1
8.09.1
11.0
18.9
S Ã Ù s z D g b v d h f T p t k i: j O:A: r@:r
I @ 2 E æmn N Ú u:w l
A PTA example of an American teacher (subset = S6)
Corpus analysis of JE production (SD, #7)M
ergi
ng D
istor
tion Speaker : USA/F12
Set : S6Score : 5.00
/sh/ /jh
//c
h/ /s/
/z/
/v/
/d/
/g/
/dh/ /t/ /k
//b
//p
/ /f/ /th/
5.77.2
/iy/
/y/
/ao/
/aa/ /w
//u
h/ /l/ /n/
/ng/
/uw/ /m
/
5.1
/er/ /r/ /h/
/ae/ /ih
//e
h//a
h//a
x/
4.56.4
8.19.2
10.9
17.7
S Ã Ù s z Dg bv d hf Tpt k i: j O:A: r@:r
I @2Eæmn NÚ u:w l
Characteristics of PTACan do writing and labeling of how the student is.
Clustering the trees defines typical states of pronunciation learning.Can do abstract and easy-to-understand visualization.
No physics, no acoustics, but educationally meaningful enoughRequires no acoustic matching with teachers.
Never has “mismatch” problems.Can be applied to children and elderly people without any difficulty.
Only focuses on inter-phoneme Bhattacharyya distances.Shift and rotation do not change the distances.Scaling does not change either if variances are proportionally modified.A part of MLLR with a single matrix does not change the distances.Extraction of only the phonetic structure by ignoring some other factors.
Perceptual representation of the pronunciation structure.
Corpus analysis of JE production (SD, #8)
Labeling of phonemesPhonemic transcriptions generated by forced alignmentSequence of phonemes seen in speech of some seconds
Labeling of rhythms4-level (word/phrase/sentence/above) categorization of stressSequence of (un)stressed syllables seen in speech of some seconds
Labeling of intonation(x-)ToBI, Fujisaki-model, piecewise concatenation model,,,Change of F0 seen in speech of some seconds
Labeling of pronunciationSpectrograms for the individual phonemes
Only for researchers, not for educationPTA for the overall description of pronunciationChange of pronunciation seen in training of some months or years
Corpus analysis of JE production (SD, #9)
Characteristics of PTACan do writing and labeling of how the student is.
Clustering the trees defines typical states of pronunciation learning.Can do abstract and easy-to-understand visualization.
No physics, no acoustics, but educationally meaningful enoughRequires no acoustic matching with teachers.
Never has “mismatch” problems.Can be applied to children and elderly people without any difficulty.
Only focuses on inter-phoneme Bhattacharyya distances.Shift and rotation do not change the distances.Scaling does not change either if variances are proportionally modified.A part of MLLR with a single matrix does not change the distances.Extraction of only the phonetic structure by ignoring some other factors.
Perceptual representation of the pronunciation structure.
Corpus analysis of JE production (SD, #8)
A state transition model of pronunciation learning
Corpus analysis of JE production (SD, #10)
L1L3
L2
Characteristics of PTACan do writing of how the student is.
Clustering the trees defines typical states of pronunciation learning.Can do abstract and easy-to-understand visualization.
No physics, no acoustics, but educationally meaningful enoughRequires no acoustic matching with teachers.
Never has “mismatch” problems.Can be applied to children and elderly people without any difficulty.
Only focuses on inter-phoneme Bhattacharyya distances.Shift and rotation do not change the distances.Scaling does not change either if variances are proportionally modified.A part of MLLR with a single matrix does not change the distances.Extraction of only the phonetic structure by ignoring some other factors.
Perceptual representation of the pronunciation structure.
Corpus analysis of JE production (SD, #8)
MLLR = Affine Transform of cepstrumsAT = scaling + warping + rotation + shift
Structure in an object is kept before and after the transform
MLLR with a single transform matrixMLLR adaptation with N (>>1) matrices in speech recognition
Different parts of a triphone set have different speaker individuality.Speaker individuality can be modeled by a single GMM.MLLR adaptation in HMM-based speech synthesis
HMMs of a speaker are converted into those of another with 5 sentences.N ~ 1
Good cancellation of speaker individuality
Corpus analysis of JE production (SD, #11)
Characteristics of PTACan do writing of how the student is.
Clustering the trees defines typical states of pronunciation learning.Can do abstract and easy-to-understand visualization.
No physics, no acoustics, but educationally meaningful enoughRequires no acoustic matching with teachers.
Never has “mismatch” problems.Can be applied to children and elderly people without any difficulty.
Only focuses on inter-phoneme Bhattacharyya distances.Shift and rotation do not change the distances.Scaling does not change either if variances are proportionally modified.A part of MLLR with a single matrix does not change the distances.Extraction of only the phonetic structure by ignoring some other factors.
Perceptual representation of the pronunciation structure.
Corpus analysis of JE production (SD, #8)
NO !“Structuralism” by R. Jakobson, M. Halle, G. Fant, and etc
Concise and accurate description of sounds in a languagePhoneme clustering based upon distinctive features
Is PTA new ? (#1)
“The Sound Pattern of Russian” by M. Halle
Is PTA new ? (#2)
NO !“Structuralism” by R. Jakobson, M. Halle, G. Fant, and etc
Concise and accurate description of sounds in a languagePhoneme clustering based upon distinctive features
YES !It’s non-native speech sounds.
No fixed mapping between distinctive features and phonemesSome state-tying in HMM training and mixture-tying in MLLR adaptation
YES !!Technical meaning of ignoring phonemes’ absolute positions in AS
Complete cancellation of static and multiplicative distortionsRobust also in rotation and scaling
YES !!!Accordance between phonetic structure and lexical structure
Cognition-based goodness of the segmental aspect of the pronunciationNot native sounding but intelligibility-based scoring
Is PTA new ? (#1)
From phonetics to phonetics+cognitionConventional --- matching between two sounds ---
Proposed --- matching between two structures ---
Corpus analysis of JE perception (SI, #1)
students’speech
teachers’HMMs
students’speech
Based also on cognition
Complete cancellation of static multiplicative noise
Intelligibility oriented
mentallexicon
Based on speech sci. and eng.
Affected by mic, BN, size, shape, sex, age, individuality
Native-sounding oriented
Listeners’ adaptationor cancellation
Perceptual representation ofthe phonetic structure
?
Reduced acoustic space of JE phonemes#phonemes of J (approx. 25) < #phonemes of AE (approx. 40)
1-to-N mappingReduction of acoustic space = increase of lexical density
Confusedness = segmental unintelligibility = cognitive loadQuantitative estimation of the lexical density increase
Cohort Model of word perceptionA left-to-right model of isolated word perception process
Estimation of the cohort size with the initial portion of input speech
Corpus analysis of JE perception (SI, #2)
str…
s…s…s…s… str…
straight
#candidates > 1 #candidates > 1 #candidates > 1 #candidates = 1
cohort = a set of activated words
= start
/s/
/b/
/r/
:
:
/t/
/eh/
/ih//ey/
/ow//r/
:
Acoustic unit of cohort developmentPerceptual unit of English = syllablesSyllabification with tsylb v2.1
Estimation of the cohort size with the initial syl. inputVocabulary
PRONLEX dictionary (approx. 100K word entries)Varieties of the initial syllables of the entries
#different (initial) syllables = approx. 10KFor each of the different syllables, CS0(si, q) is calculated.
CS0(si,q) = #words starting with si + #words starting with a syllable distant from si by less than q
CS(q) = average of CS0(si,q) over si
Distance measure between syllables= DP matching between two state sequences (HMMs)State-to-state distance = Bhattacharyya distanceState-level distance matrix provides all the required information.
Corpus analysis of JE perception (SI, #3)
Cohort size as a function of threshold qSI AE models vs. SI JE modelsSD AE models vs. SD JE modelsNo weighting based on N-gram probabilities
Corpus analysis of JE perception (SI, #4)
Required resolution of acoustic analysis LowHigh
Con
fuse
dnes
s
Inte
lligi
bilit
y
Cohort size estimation with unigramVocabulary
WSJ 20K (unigram) - words starting with /zh/ or dipthongsVarieties of the initial syllables of the word entries
#different (initial syllables) = approx. 3,200For each of the different entries, CS1(wj, q) is calculated.
CS1(wj,q) = CS0(sij,q), where sij is an initial syllable of wj
Expected Cohort Size, ECS(q) = S p(wj) CS1(wj, q), p(wj) = 1-gram
How correlated are the ECS and segmental proficiency labels ?
Corpus analysis of JE perception (SD, #1)
Expected cohort size and segmental proficiency
Corpus analysis of JE perception (SD, #2)
0.10 0.700.30 0.600.20 0.500.400
2500
2000
1500
1000
500
3000
3500
Threshold of inter-syllable distance (q)
Expe
cted
coh
ort s
izeSet-6
4.984.17
4.12
2.02
2.18
2.24
q = 0.35
2.0 <= score < 2.5 42.5 <= score < 3.0 83.0 <= score < 3.5 63.5 <= score < 4.0 54.0 <= score < 4.5 24.5 <= score < 5.0 1 score = 5.0 4
score range #spk
Expected cohort size and segmental proficiency
Non-acoustic matching can estimate the segmental intelligibility.
Corpus analysis of JE perception (SD, #3)
4.03.02.0 5.0Segmental proficiency rated by teachers
200
300
400
500
600Ex
pect
ed c
ohor
t size
Set-6, q = 0.35
4 American speakers
(1)
(3)
cor. = -0.80
Possible applications of PTA (#1)
Estimation of the next target for efficient learning
Speech samples of RYU/F06Several sentence speech samples
teacher’s matrix learner’s matrix
? ?Which replacement realizes the largest reduction of cohort size ?
Which phoneme’s replacement should come first ?
Possible applications of PTA (#2)
SÃ Ùs zD gb vd hf Tp tki: j O:A: r@:r
I@ 2Eæ m n N Úu:wl
Cohort size reduction only with a single replacement
150
200
250
300
350
400
450
500
Coho
rt siz
e
orig
./a
x/ /ih/
/dh/
/ae/ /iy
//a
a/ /r/ /d/
/eh/
/ah/ /k
//s
/ /l/ /w/
/p/
/b/ /t/ /f/ /m/
/uw/ /n
//v
//jh
//g
//e
r/ /z/
/sh/
/ch/
/ng/ /th
//y
//u
h//a
o/ /h/
original
native
What is the order in the most efficient learning ?
Possible applications of PTA (#3)
sD dA:r@:r
I 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l150
200
250
300
350
400
450
500
Coho
rt siz
e
original
native
Cohort size reduction by successive phoneme replacements
@
orig
.
The learning order of another student
Possible applications of PTA (#4)
Cohort size reduction by successive phoneme replacements
150
200
250
300
350
400
450
500
Coho
rt siz
e
orig
. /r//d
h//a
x/ /iy/
/d/
/s/
/ah/ /m
//e
r/ /w/
/n/
/aa/ /l/ /g
/ /t/ /h/
/uw/ /k
//y
//s
h/ /z/
/ao/
/ch/
/ng/
/uh/ /jh
//v
//th
//p
//b
/ /f//a
e//e
h//is
h/
Phonemes
original
native
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/th/
/z/
/jh/
/dh/ /k
/ /t/ /d/
/b/
/g/ /f/ /p/
/h/
/ao/
/ch/ /s
//s
h/
4.95.7/u
h/ /r/ /l//a
h//e
h//a
e/ /er/
/uw
//w
//v
//a
x//a
a/ /y/
/ih/
/iy/
/ng/ /n
//m
/
7.1
12.3
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/ax/ /h/
/th/
/z/
/jh/
/dh/ /k
/ /t/ /d/
/ao/
/ch/ /s
//s
h/
6.6
/y/
/ih/
/iy/
/ng/ /n
//m
/ /f/ /p/
/v/
/b/
/g/
/ah/
/eh/
/ae/ /er/
/uh/ /r/ /l/
/aa/
/uw
//w
/
5.2
7.9
12.4
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/th/
/z/
/jh/
/dh/ /k
/ /t/ /d/
/ao/
/ch/ /s/
/sh/ /h/
/ih/
/ax/ /iy/
/y/
/ng/ /n
//m
/ /f/ /p/
/v/
/b/
/g/
/ah/
/eh/
/ae/ /er/
/uh/ /r/ /l/
/aa/
/uw
//w
/
5.46.9
8.8
12.5
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/k/ /t/ /d/
/th/
/z/
/jh/
/ao/
/ch/ /s
//s
h/ /f/ /p/
/v/
/b/
/g/
/dh/ /h
/5.9
7.0/iy
//y
//n
g/ /n/
/m/
/uh/ /r/ /l/
/aa/
/uw
//w
//a
h//e
h//a
e/ /er/
/ih/
/ax/
5.0
8.0
12.9
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/k/ /t/ /d/
/th/
/z/
/jh/
/ao/
/ch/ /s
//s
h/ /f/ /p/
/v/
/b/
/g/
/dh/ /h
/6.2
7.2/iy
//y
//n
g/ /n/
/m/
/ae/
/ah/
/eh/ /ih
//a
x//e
r/ /r//a
a//u
w/
/w/
/uh/ /l/
4.55.3
8.4
13.4
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/k/ /t/ /d/
/th/
/z/
/jh/
/ao/
/ch/ /s
//s
h/ /f/ /p/
/v/
/b/
/g/
/dh/ /h
/6.3
7.3/iy
//y
//n
g/ /n/
/m/
/ih/
/eh/
/ah/
/ax/
/er/ /r/
/uh/ /l/
/ae/
/aa/
/uw
//w
/
4.55.4
8.6
13.7
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/dh/ /h
/ /f/ /p/
/v/
/d/
/b/
/g/
/s/
/sh/
/ch/ /t/ /k
//th
//z
//jh
/6.77.4
/iy/
/y/
/m/
/n/
/ng/
5.3
/ao/
/aa/ /er/ /r/ /ih/
/eh/
/ah/
/ax/
/ae/
/uw
//w
//u
h/ /l/
4.66.0
8.810.6
15.5
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/dh/ /h
/ /f/ /p/
/v/
/d/
/b/
/g/
/s/
/z/
/sh/
/ch/ /t/ /k
//th
//jh
/
5.17.17.8
/iy/
/y/
/n/
/ng/
/uw
//m
/
6.5
/ao/
/aa/ /er/ /r/ /ih/
/eh/
/ah/
/ax/
/ae/ /w
//u
h/ /l/
4.45.8
9.110.8
16.6
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/sh/
/ch/
/dh/ /h
//s
//z
//v
//d
//g
//jh
/ /t/ /k/
/th/ /f/ /b/
/p/
5.47.48.2
/iy/
/y/
/n/
/ng/
/uw
//m
/
6.8
/ao/
/aa/
/ae/ /ih
//e
h//a
h//a
x//e
r/ /r/ /w/
/uh/ /l/
4.86.1
9.311.1
17.3
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/sh/ /jh
//c
h/ /s/
/z/
/v/
/d/
/g/
/dh/ /t/ /k
//b
//p
/ /f/ /th/
5.67.2
/ao/
/aa/ /er/ /r/ /h/
/ae/ /ih
//e
h//a
h//a
x/
4.56.3
8.2
/iy/
/y/
/w/
/uh/ /l/ /n
//n
g//u
w/
/m/
5.0
9.111.0
17.5
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/sh/ /jh
//c
h/ /s/
/z/
/v/
/d/
/g/
/dh/ /t/ /k
//b
//p
/ /f/ /th/
5.77.3
/iy/
/y/
/ao/
/aa/ /w
//u
h/ /l/ /n/
/ng/
/uw
//m
/
5.1
/er/ /r/ /h/
/ae/ /ih
//e
h//a
h//a
x/
4.56.4
8.19.2
10.9
17.7
Prediction of her future.....?
Possible applications of PTA (#5)
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@orig.
Mer
ging
Dis
torti
on Speaker : RYU/F06Set : S6
Score : 2.02
/sh/ /jh
//c
h/ /s/
/z/
/v/
/d/
/g/
/dh/ /t/ /k
//b
//p
/ /f/ /th/
5.77.2
/iy/
/y/
/ao/
/aa/ /w
//u
h/ /l/ /n/
/ng/
/uw
//m
/
5.1
/er/ /r/ /h/
/ae/ /ih
//e
h//a
h//a
x/
4.56.4
8.19.2
10.9
17.7
Goooooooooooooooooooooooal !!
LVCSR vs. SIE
Two types of hearings
Input
Acoustic model
Pron. lexicon
Lang. model
Integration
Output
Problems
phoneme-basedtree structured lexicon
phone models(tied-state triphones)
word trigram
decoder
sentence candidates
mismatchchildren & elderly
an utterance(MFCC)
mismatchchildren & elderly
native sounding
utterances(phonetic structure)
(native phoneticstructure)
perceptual-unit-basedtree structured lexicon
word unigram(baseform unigram)
isolated wordperception model
segmentalintelligibility
Some interesting issues on PTA (#1)
B vs. AWhich is more intelligible pronunciation ?
Some interesting issues on PTA (#2)
B vs. SForeign accented pronunciation alwaysreduces the segmental intelligibility ?
Some interesting issues on PTA (#3)
B vs. BBPhoneticstructure
Lexicalstructure
Phoneticstructure
Lexicalstructure
Some optimality is found between the twostructures in a single accented language ?
Some interesting issues on PTA (#4)
B vs. B1000
2000
3000
Phoneticstructure
Lexicalstructure
Phoneticstructure
Lexicalstructure
Phoneticstructure
LexicalstructureA language is evolving with keeping the
optimality between the two structures ??
Two ways of looking at speech
sD dA:r @:rI 2 Em S ÃÙz g b vh fTp t ki:j O:æn NÚu: w l@
A:
Co-existence of different phones
Separate observation&
Relational observation
Several new lights on “Structuralism”
Non-native speechLearners differ at all.No rules there, it’s chaotic.Only bottom-up processingsPTA to extract the structure
Word-level cognitionPhoneme phone perceptionAccess to mental lexiconCohort, Trace, Shortlist, etcStructuralism on lexicon
What’s missing ?Cepstrum-based spaceMLLR adaptation of ....GMM modeling of .....Individuality, mic, age, etc
Application to LLIntelligible pronunciationStr. description of learnersSegmental intelligibilityDesign of efficient learning
Development of Japanese English read speech database
Corpus-based analysis of JE productionPhonetic Tree Analysis (PTA)Relational observation can visualize well how the student is.
Corpus-based analysis of JE perceptionCognition-based estimation of the segmental intelligibilityAccordance between two different levels of structures
Possible applications of PTA The most efficient learning of English phonemes for the studentPrediction of the student’s future
Some interesting issues on PTAB vs. A, B vs. S, and B vs. B
Conclusions & future works (#1)
Tuning up of acoustic conditions for analysisKind of cepstrums, dimensions, ∆&∆∆ components, etc.
PTA-based native trees should be similar to the DF-based classical trees.How to handle insertions and deletions in non-native speech ?
Better preparation of transcriptions to build HMMsMore adequate derivation of phoneme-based distance matrix
Clustering of the treesMeaningful and effective definition of distance between two trees
Bottom-up definition of typical states of Japanese EnglishState transition model of change of pronunciation through learning
Practical and pedagogical evaluationIs PTA-based representation really good for teachers and students ?Is PTA-based design of learning really effective and efficient ?
Conclusions & future works (#2)
Thank you. Any questions ?
?? ?