View
7
Download
0
Category
Preview:
Citation preview
Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and
Reverberant EnvironmentsAndreas Schwarz, Christian Huemmer, Roland Maas,
Walter Kellermann
Lehrstuhl für Multimediakommunikation und SignalverarbeitungFriedrich-Alexander-Universität Erlangen-Nürnberg, Germany
ICASSP 2015
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
Trend: explicit feature processing → implicit learning! MFCCs → simple filterbank features [Mohamed et al. 2013]! Filterbanks → raw time-domain signals [Jaitly, Hinton 2011]
! Denoising → noise-aware training [Seltzer et al. 2013]
What about spatial information (microphone arrays)?! Stacked feature vectors from multiple channels
[Swietojanski et al. 2013]! Phase information is not exploited
! Raw multi-channel waveforms [Hoshen et al. 2015]! Hard to generalize for arbitrary acoustic scenarios
! Spatial diffuseness features! Represent spatial information independently of
source position and microphone array
2
Deep Neural Networks for Acoustic Modeling
mh acoustics Eigenmike
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
Signal Model
Coherence-based Dereverberation in the STFT Domain
Extraction of Spatial Diffuseness Features
3
Outline
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
! Desired signal is fully coherent (only delayed between microphones)
! Noise and reverberation is diffuseand uncorrelated to the desired signal
! Coherence of the mixed sound fieldcan be modeled as:
4
Signal Model
→ Coherent-to-diffuse ratio (CDR) can be estimatedfrom the complex spatial coherence of the mixture
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
1. Estimate short-time spatial coherence (quasi-instantaneous)2. Estimate coherent-to-diffuse ratio (CDR)3. Perform spectral subtraction to suppress diffuse components
[Schwarz/Kellermann, “Coherent-to-Diffuse Power Ratio Estimation for Dereverberation”, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015]
Only instantaneous signal properties are exploitedNo knowledge or estimation of source DOA required
5
Coherence-based STFT-Domain Dereverberation
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
Word Error Rate for REVERB challenge evaluation set
Multi-condition training neutralizes the effect of dereverberation
6
Evaluation
x2x2testx2
44.4
30.3
85.7
69.3
0
10
20
30
40
50
60
70
80
90
100
logmelspec enh. logmelspec
WER
[%]
Clean speech-trained DNNSimDataRealData
9.5 9.4
28.8 28.8
0
10
20
30
40
50
60
70
80
90
100
logmelspec enh. logmelspecW
ER [%
]
Multi-condition-trained DNNSimDataRealData
⇨ Improvement for clean-trained DNN " ⇨ Disappears with multi-condition training #
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
Instead of STFT-domain enhancement, extract spatial features
! meldiffuseness:! 0 for purely directional sound, 1 for purely diffuse sound! computed from coherent-to-diffuse ratio: D(k,f)=1/(CDR(k,f)+1)
! Naive approach: magnitude squared coherence (melmsc)! Depends not only on diffuse noise content, but also on microphone spacing, DOA
7
Spatial Feature Extraction
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
8
Visualization of Features
logmelspec:
enhanced logmelspec:
meldiffuseness:
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
REVERB challenge “two microphone” task [Kinoshita et al. 2013]! noisy and reverberant signals created from WSJCAM0 corpus! varying direction of arrival! 2 microphones, 8cm spacing
DNN-based Speech Recognizer! Kaldi toolkit! hybrid DNN-HMM acoustic model! “maxout” network (4 hidden layers, 2000 inputs, 400 outputs per layer)! ±5#frame#splicing! training on#multi!condition noisy and reverberant data (17.5#hours)
Feature vectors! noisy logmelspec features:! enhanced logmelspec features:! augmented with melmsc:! augmented with meldiffuseness:
9
Evaluation Setup
x2x2testx2
logmelspec Δ ΔΔenh. logmel Δ ΔΔlogmelspec Δ melmsclogmelspec Δ meldiffuseness
overall dimension: 72
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
SimData: measured impulse responses, additive noiseRealData: real recordings in noisy environment
6% to 11% relative WER reduction by using spatial features
10
Evaluation Results
9.5 9.4 9.0 8.5
28.8 28.8 27.7 27.0
0
5
10
15
20
25
30
35
40
logmelspec enh. logmelspec logmelspec +melmsc
logmelspec +meldiffuseness
WER
[%]
SimData
RealData
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
Motivation! STFT-domain dereverberation has little effect on WER! Idea: exploit spatial information in the DNN
Spatial Diffuseness Features! Can be extracted instantaneously! “Blind”, no knowledge or estimation of the source DOA required! Device-independent features! 6% to 11% relative WER reduction for REVERB challenge 2-channel task! MATLAB code available (see paper)
Can we use a similar approach to deal with directional interferers?
Thank you for your attention!
11
Summary
ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann
12
Results (Details)
SimData RealData
near far near far near far near farGMM-HMM MFCC-LDA-MLLT-fMLLR 6.6 7.5 9.4 16.6 11.1 20.7 12.0 31.2 30.2 30.7 12.1 31.6
logmelspec+∆+∆∆ 5.7 6.7 7.7 13.9 8.7 14.6 9.5 28.5 29.1 28.8 9.7 24.9enhanced logmelspec+∆+∆∆ 6.6 7.1 7.7 12.2 8.3 14.6 9.4 28.5 29.1 28.8 9.1 25.3logmelspec+∆+melmsc 6.2 6.3 7.0 12.3 8.2 13.9 9.0 27.3 28.0 27.7 8.7 24.7logmelspec+∆+meldiffuseness 5.9 6.1 6.9 11.0 8.2 12.9 8.5 27.8 26.3 27.0 7.9 24.2
Recognizer Feature
DNN-HMM
Room 1 Room 2AvgAvg
Evaluation Set Development SetSimData RealData
Avg AvgRoom 3 Room 1
Recommended