In-car Speech Recognition Using Distributed Microphones
Tetsuya Shinde Kazuya Takeda Fumitada Itakura
Center for Integrated Acoustic Information Research
Nagoya University
Background
• In-car Speech Recognition using multiple microphones– Since the position of the speaker and noise are not fixed, many
sophisticated algorithms are difficult to apply.– Robust criterion for parameter optimizing is necessary.
• Multiple Regression of Log Spectra (MRLS)– Minimize the log spectral distance between the reference
speech and the multiple regression results of the signals captured by distributed microphones.
• Filter parameter optimization for microphone array (M.L. Seltzer, 2002)– Maximize the likelihood by changing the filter parameters of a
microphone array system for a reference utterance.
Sample utterances
idling city area expressway
・・・
・・・
・・・
distant microphones
RegressionWeights
Speech Signal MR
MR
MR
SpectrumAnalysis
SpectrumAnalysis
SpectrumAnalysis
・・・
log MFB output
・・・
SpeechRecognition
Approximatelog MFB output
Block diagram of MRLS
N
iii kXkwkX
10 )(log)(ˆ)(ˆlog
Modified spectral subtraction
NGSHX iii
22
S
N X1
Xi
XN
Hi
Gi
• Assume that power spectrum at each microphone position obey power sum rule.
NbSa
NN
XS
S
XX
ii
iii
loglog
loglog
loglog
log
loglog
NGSHX iii
22loglog
)0()0()0( loglogloglogloglog NNbSSaXX iiii )()()( logloglog d
id
idi NbSaX
Taylor expansion of log spectrum
NGSH
NGb
NGSH
SHa
ii
ii
ii
ii 22
2
22
2
,
Multiple regression of log spectrum
i
dii
d XS )()( loglog
2
)()(
2
)()(
loglog1
loglog
i
dii
d
iii
i
dii
d
NbSa
XS
0 ,1 i
iii
ii bEaE
Minimum error is given when
)()()( logloglog di
di
di NbSaX
a
λ
b
1
10
1
0
1
ii
iii
iii
ba
bE
aE
Optimal regression weights
Reduction of freedom inoptimization
Experimental Setup for Evaluation
• Recorded with 6 microphones• Training data
– Phonetically balanced sentences– 6,000 sentences while idling– 2,000 sentences while driving– 200 speakers
• Test data– 50 isolated word utterances– 15 different driving conditions
• road (idling/ city area/ expressway)• in-car (normal/ fan-low/ fan-hi/ CD
play/ window open)
– 18 speakers
top view
side view
distributed microphone positions
Recognition experiments
• HMMs: – Close-talking: close-talking microphone speech. – Distant-mic.: nearest distant microphone (mic. #6) speech.– MLLR: nearest distant mic. speech after MLLR adaptation. – MRLS: MRLS results obtained by the optimal regression weights f
or each training utterance.
• Test Utterances– Close-talking speech (CLS-TALK)– Distant-microphone speech (DIST)– Distant-microphone speech after MLLR adaptation (MLLR)– MRLS results of the 6 different weights optimized for:
• each utterance (OPT)• each speaker (SPKER)• each driving condition (DR) • all training corpus (ALL)
Performance Comparison(average over 15 different conditions)
75
80
85
90
95
100A
cc
ura
cy
[%
]
CLS-TALK
OPT SPKER DR ALL MLLR DIST
MRLS
Clustering in-car sound environment
• Clustering in-car sound environment using a spectrum feature concatenating distributed microphone signals
normal CD fan lo fan hiwindow
open
Class 1 2224 190 329 8 372
Class 2 440 2477 13 4 4
Class 3 25 20 2354 2684 35
Class 4 11 13 5 0 2289
76564636 ,,, RRRRP ,)24(,),4( 666 iii RR R
,)()( 66 kXkXkR ii Clustering Results
Adapting weights to sound environment
80
82
84
86
88
90
Ac
cu
rac
y [
%]
SPKER DR ADAPT ALL
• Vary regression weights in accordance with the classification results.
• Same performance with speaker/condition dependent weights.
Summary
• Results– Log spectral multiple regression is effective for in-car
speech recognition using distributed multiple microphones.
– Especially, when the regression weights are trained for a particular driving condition, very high performance can be obtained.
– Adapting weights to the diving condition improves the performance.
• Future works– Combing with microphone array.