View
219
Download
2
Category
Preview:
Citation preview
1
Building A Highly Accurate Mandarin Speech Recognizer
Mei-Yuh Hwang, Gang Peng,
Wen Wang (SRI), Arlo Faria (ICSI),
Aaron Heidel (NTU) Mari Ostendorf
12/12/2007
2
Outline Goal: A highly accurate Mandarin ASR Baseline: System-2006 Improvement
Acoustic segmentation Two complementary comparable systems Language models and adaptation More Data
Error analysis Future
3
Background: System-2006 849M words training text 60K-word lexicon Static 5-gram rescoring 465 hrs acoustic training Two AMs (same phone-72 pronunciation)
MFCC+pitch (42-dim), SAT+fMPE, CW MPE, 3000x128 Gaussians.
MFCC+MLP+pitch (74-dim), SAT+fMPE, nonoCW MPE, 3000x64 Gaussians
CER 18.4% on Eval06.
4
2007 Increased Training Data 870 hours of acoustic training data. 3500x128
Gaussians. 1.2G words of training text. Trigrams and 4-grams.
#bigrams #trigrams #4-grams Dev07-IV Perplexity
LM3 58M 108M --- 325.7
qLM3 6M 3M --- 379.8
LM4 58M 316M 201M 297.8
qLM4 19M 24M 6M 383.2
5
Acoustic segmentation Former segmenter caused high deletion errors. It
mis-classified some speech segments as noises.
Speech segment min duration 18*30=540ms=0.5s
Vocabulary Pronunciation
speech 18+ fg
Noise rej rej
silence bg bg
Start /null End /null
speech
silence
noise
6
New Acoustic Segmenter Allow shorter speech duration Model Mandarin vs. Foreign (English) separately.
Vocabulary Pronunciation
Mandarin1 I1 F
Mandarin2 I2 F
Foreign forgn forgn
Noise rej rej
Silence bg bg
Start /null End /nullForeign
silence
Mandarin 1 Mandarin 2
noise
7
Improved Acoustic SegmentationPruned trigram, SI nonCW-MLP MPE, on Eval06
Segmenter Sub Del Ins Total
OLD 9.7 7.0 1.9 18.6
NEW 9.9 6.4 2.0 18.3
Oracle 9.5 6.8 1.8 18.1
8
Decoding Architecture
MLP nonCW
qLM3
PLP CW SAT+fMPEMLLR, LM3
MLP CW SATMLLR, LM3
qLM4 Adapt/Rescore qLM4 Adapt/Rescore
Confusion Network Combination
Aachen
9
Two Sets of Acoustic Models For cross adaptation and system combo
Different error behaviors Similar error rate performance
System-MLP System-PLP
Features 74
(MFCC+pitch+MLP)
42
(PLP+pitch)
fMPE no yes
Phones 72 81
10
MLP Phoneme Posterior Features
Compute Tandem features with pitch+PLP input. Compute HATs features with 19 critical bands Combine Tandem and HATs posterior vectors into
one. PCA(Log(71)) 32 MFCC + pitch + MLP = 74-dim
11
Tandem Features [T1,T2,…,T71] Input: 9 frames of PLP+pitch
(42x9)x15000x71
PLP (39x9)
Pitch (3x9)
12
HATS Features [H1,H2,…,H71]
51x60x71
…
E1
E2
E19
(60*19)x8000x71
13
MLP and Pitch Features
HMM Feature MLP Input CER
MFCC (39-dim) None 24.1
MFCC+F0 (42-dim) None 21.4
MFCC+F0+Tandem (74-dim) PLP(39*9) 20.3
MFCC+F0+Tandem (74-dim) PLP+F0(42*9) 19.7
nonCW ML, Hub4 Training, MLLR, LM2 on Eval04
14
Phone-81: Diphthongs for BC Add diphthongs (4x4=16) for fast speech and modeling
longer triphone context. Maintain unique syllabification. Syllable ending W and Y not needed anymore.
Example Phone-72 Phone-81
要 /yao4/ a4 W aw4
北 /bei3/ E3 Y ey3
有 /you3/ o3 W ow3
爱 /ai4/ a4 Y ay4
15
Phone-81: Frequent Neutral Tones for BC
Neural tones more common in conversation. Neutral tones were not modeled. The 3rd tone
was used as replacement. Add 3 neutral tones for frequent chars.
Example Phone-72 Phone-81
了 /e5/ e3 e5
吗 /ma5/ a3 a5
子 /zi5/ i3 i5
16
Phone-81: Special CI Phones for BC Filled pauses (hmm, ah) common in BC. Add
two CI phones for them. Add CI /V/ for English.
Example Phone-72 Phone-81
victory w V
呃 /ah/ o3 fp_o
嗯 /hmm/ e3 N fp_en
17
Phone-81: Simplification of Other Phones
Now 72+14+3+3=92 phones, too many triphones to model.
Merge similar phones to reduce #triphones. I2 was modeled by I1, now i2.
92 – (4x3–1) = 81 phones.
Example Phone-72 Phone-81
安 /an1/ A1 N a1 N
词 /ci2/ I1 i2
池 /chi2/ IH2 i2
18
Different Phone SetsPruned trigram, SI nonCW-PLP ML, on dev07
BN BC Avg
Phone-81 7.6 27.3 18.9
Phone-72 7.4 27.6 19.0
Indeed different error behaviors --- good for system combo.
19
PLP Models with fMPE Transform PLP model with fMPE transform to compete
with MLP model. Smaller ML-trained Gaussian posterior model:
3500x32 CW+SAT
5 Neighboring frames of Gaussian posteriors. M is 42 x (3500*32*5), h is (3500*32*5)x1. Ref: Zheng ICASSP 07 paper
( )k kt t ty A x b hM
20
Topic-based LM Adaptation
Latent Dirichlet Allocation Topic Model
{w | w same story (4secs) }
0
One sentence
4s window is used to make adaptation more robust against ASR errors.
{w} are weighted based on distance.
21
Topic-based LM Adaptation Training: one topic per sentence Train 64 topic-dependent LMs. Testing: top n topics per sentence, weighting
on neighboring 4s of speech
4( ) (1 )adapt i ii
LM LM qLM
22
Topic-based LM Adaptation LMi still 60K-words? Per-sentence adaptation? Computational cost?
23
LM Adaptation and CNC on Dev07
Dev07 CW PLP CW MLP CNC
LM3 12.0 11.9 ---
LM4 11.9 11.7 11.4
Adapted qLM4 11.7 11.4 11.2
UW 2 systems only
24
LM Adaptation and CNC on Eval07
AM(adapt. hyps)
PLP(MLP)
MLP(PLP)
MLP(Aachen)
PLP(Aachen)
Rover
LM3 10.2 9.6 9.9 10.1 --
qLM4 10.2 9.7 10.0 10.1 --
LM4 10.0 9.6 9.8 10.0 9.1
Adapted
qLM4
9.7 9.3 9.6 9.7 8.9
25
Eval07
Team CER
UW 9.1%
RWTH 12.1%
UW+RWTH 8.9%
CU+BBN 9.4%
IBM+CMU 9.8%
26
2006 vs. 2007 on Eval07
SUB DEL INS TOTAL
2006
system
7.2 6.5 0.4 14.1
2007
system
5.5 3.0 0.4 8.9
37% relative improvement!!
27
Progress
Testset 2006 2007-06 2007-12
Eval06 18.4% 15.3% 14.7%
Dev07 --- 11.2% 9.6%*
Eval07 14.1% 8.9% ---
28
RWTH Demo UW acoustic segmenter. RWTH single-system ASR. Foreign (Korean)
speech skipped. Mis-reco highlighted. Manual sentence segmentation. Machine translation. Not real-time.
29
MT Error Analysis on Extreme Cases
Snippet Dur CER HTER
a) Worst BN 87s 10.9% 47.73%
b) Worst BC 72s 24.9% 48.37%
c) Best BN 62s 0 12.67%
d) Best BC 77s 15.2% 14.20%
CER not directly related to HTER; genre matters. Better CER does ease MT.
30
MT Error Analysis (a) worst BN: OOV names (b) worst BC: overlapped speech (c) best BN: composite sentences (d) best BC: simple sentences with disfluency
and re-starts. *.html, *.wav
31
Error Analysis OOV (especially names): problematic for ASR,
MT, distillation.
徐 昌 霖徐 成 民徐 长 明 Xu, Chang-Lin
黄 竹 琴黄 朱 琴黄 朱 勤皇 猪 禽黄 朱 其 Huang, Zhu-Qin
32
Error Analysis MT BN high errors
Composite syntax structure. Syntactic parsing would be useful.
MT BC high errors Overlapped speech ASR high errors due to disfluency Conjecture: MT on perfect BC ASR is easy, for
its simple/short sentence structure
33
Next ASR: Chinese Organization Names Semi-auto abbreviation generation for long
words. Segment a long word into a sequence of shorter
words Extract the 1st char of each shorter words: 世界卫生组织 世卫
(Make sure they are in MT translation table, too)
34
Next ASR: Chinese Person Names Mandarin high rate of homophones: 408 syllables 6000
common characters. 14 homophone chars / syllable!! Given a spoken Chinese OOV name, no way to be sure which
characters to use. But for MT, don’t care anyway as long as the syllables are correct.!!
Recognizing repetition of the same name in the same snippet: CNC at syllable level Xu {Chang, Cheng} {Lin, Min, Ming} Huang Zhu {Qin, Qi}
After syllable CNC, apply the same name to all occurrences in Pinyin.
35
Next ASR: Foreign Names English spelling in Lexicon, with (multiple) Mandarin
pronunciations: Bush /bu4 shi2/ or /bu4 xi1/ Bin Laden /ben1 la1 deng1/ or /ben3 la1 deng1/ John /yue1 han4/ Sadr /sa4 de2 er3/ Name mapping from MT?
Need to do name tagging on training text (Yang Liu), convert Chinese names to English spelling, re-train n-gram.
36
Next ASR: LM
LM adaptation with fine topics, each topic with small vocabulary size.
Spontaneous speech: n-gram backtraces to content words in search or N-best? Text paring modeling? 我想那 ( 也 )( 也 ) 也是 我想那也是 I think it, (too), (too), is, too. I think it is, too.
If optimizing CER, stm needs to be designed such that disfluency is optionally deletable. 小孩 ( 儿 )
37
Next ASR: AM Add explicit tone modeling (Lei07).
Prosody info: duration and pitch contour at word level Various backoff schemes for infrequent words
More understanding why outside regions not helping with AM adaptation. Add SD MLLR regression tree (Mandal06). Improve auto speaker clustering
Smaller clusters, better performance Gender ID first.
38
ASR & MT Integration Do we need to merge lexicon? ASR MT. Do we need to use the same word segmenter? Is word/char -level CNC output better for MT? Open questions and feedback!!!
Recommended