Download ppt - In-car Speech Recognition Using Distributed Microphones

In-car Speech Recognition Using Distributed Microphones

Tetsuya Shinde Kazuya Takeda Fumitada Itakura

Center for Integrated Acoustic Information Research

Nagoya University

Background

• In-car Speech Recognition using multiple microphones– Since the position of the speaker and noise are not fixed, many

sophisticated algorithms are difficult to apply.– Robust criterion for parameter optimizing is necessary.

• Multiple Regression of Log Spectra (MRLS)– Minimize the log spectral distance between the reference

speech and the multiple regression results of the signals captured by distributed microphones.

• Filter parameter optimization for microphone array (M.L. Seltzer, 2002)– Maximize the likelihood by changing the filter parameters of a

microphone array system for a reference utterance.

Sample utterances

idling city area expressway

・・・

・・・

・・・

distant microphones

RegressionWeights

Speech Signal MR

MR

MR

SpectrumAnalysis

SpectrumAnalysis

SpectrumAnalysis

・・・

log MFB output

・・・

SpeechRecognition

Approximatelog MFB output

Block diagram of MRLS

N

iii kXkwkX

10 )(log)(ˆ)(ˆlog

Modified spectral subtraction

NGSHX iii

22

S

N X1

Xi

XN

Hi

Gi

• Assume that power spectrum at each microphone position obey power sum rule.

NbSa

NN

XS

S

XX

ii

iii

loglog

loglog

loglog

log

loglog

NGSHX iii

22loglog

)0()0()0( loglogloglogloglog NNbSSaXX iiii )()()( logloglog d

id

idi NbSaX

Taylor expansion of log spectrum

NGSH

NGb

NGSH

SHa

ii

ii

ii

ii 22

2

22

2

,

Multiple regression of log spectrum

i

dii

d XS )()( loglog

2

)()(

2

)()(

loglog1

loglog

i

dii

d

iii

i

dii

d

NbSa

XS

0 ,1 i

iii

ii bEaE

Minimum error is given when

)()()( logloglog di

di

di NbSaX

a

λ

b

1

10

1

0

1

ii

iii

iii

ba

bE

aE

Optimal regression weights

Reduction of freedom inoptimization

Experimental Setup for Evaluation

• Recorded with 6 microphones• Training data

– Phonetically balanced sentences– 6,000 sentences while idling– 2,000 sentences while driving– 200 speakers

• Test data– 50 isolated word utterances– 15 different driving conditions

• road (idling/ city area/ expressway)• in-car (normal/ fan-low/ fan-hi/ CD

play/ window open)

– 18 speakers

top view

side view

distributed microphone positions

Recognition experiments

• HMMs: – Close-talking: close-talking microphone speech. – Distant-mic.: nearest distant microphone (mic. #6) speech.– MLLR: nearest distant mic. speech after MLLR adaptation. – MRLS: MRLS results obtained by the optimal regression weights f

or each training utterance.

• Test Utterances– Close-talking speech (CLS-TALK)– Distant-microphone speech (DIST)– Distant-microphone speech after MLLR adaptation (MLLR)– MRLS results of the 6 different weights optimized for:

• each utterance (OPT)• each speaker (SPKER)• each driving condition (DR) • all training corpus (ALL)

Performance Comparison(average over 15 different conditions)

75

80

85

90

95

100A

cc

ura

cy

[%

]

CLS-TALK

OPT SPKER DR ALL MLLR DIST

MRLS

Clustering in-car sound environment

• Clustering in-car sound environment using a spectrum feature concatenating distributed microphone signals

normal CD fan lo fan hiwindow

open

Class 1 2224 190 329 8 372

Class 2 440 2477 13 4 4

Class 3 25 20 2354 2684 35

Class 4 11 13 5 0 2289

76564636 ,,, RRRRP ,)24(,),4( 666 iii RR R

,)()( 66 kXkXkR ii Clustering Results

Adapting weights to sound environment

80

82

84

86

88

90

Ac

cu

rac

y [

%]

SPKER DR ADAPT ALL

• Vary regression weights in accordance with the classification results.

• Same performance with speaker/condition dependent weights.

Summary

• Results– Log spectral multiple regression is effective for in-car

speech recognition using distributed multiple microphones.

– Especially, when the regression weights are trained for a particular driving condition, very high performance can be obtained.

– Adapting weights to the diving condition improves the performance.

• Future works– Combing with microphone array.