View
215
Download
1
Embed Size (px)
Citation preview
Advances in WP1 and WP2
Paris Meeting – 11 febr. 2005
www.loquendo.com
Advances in WP1
Paris Meeting – 11 febr. 2005
www.loquendo.com
3
WP1: Environment & Sensor RobustnessT1.2 Noise Independence
• Voice Activity Detection: – A Model-based approach using NN (Neural Networks) to
discriminate two classes (noise and voice) will be explored;– NN input could be standard features (Cepstral coeff., Energy)
after noise reduction, in case complemented by other features (pitch/voicing) produced by other partners (IRST);
– Training set will be multi-style, including several types of noise conditions and languages
• Noise Reduction:– Some noise reduction techniques will be experimented on the
test sets selected as benchmarks for the project:• Spectral Subtraction (standard, Wiener and SNR
dependent) and Spectral Attenuation (Ephraim-Malah SA standard and SNR dependent)
• New techniques for non-stationary noises
4
WP1: Speech Databases for Noise Reduction
• Aurora 2 - Connected digits - TIdigits data down sampled to 8 kHz, filtered with a G712 characteristic and noise artificially added at several SNRs (20dB, 15dB, 10 dB, 5dB, 0dB, -5dB). There are three test sets:
– A: same noises as in train: subway, babble, car noise, exhibition hall;– B: 4 different noises: restaurant, street, airport, train station;– C: same noises as A but filtered with a different microphone
• Aurora 3 - Connected digits recorded in car environment - Signal collected by hand free (ch1) and close talk (ch0) microphones. In HIWIRE we use Italian and Spanish recordings. There are two test sets:
– WM: ch0 and ch1 recordings used in training and testing lists;– HM: ch0 for training and ch1 for testing
• Aurora 4 - Continuous speech 5k vocabulary - It is WSJ0 5K with added noise of 6 kinds: Car, Babble, Restaurant, Street, Airport, Train station. It uses the standard Bi-Gram language modeling.
5
Spectral Subtraction (SS) operates in the frequency domain and attempts to compute a denoised version of the power spectrum. Wiener spectral subtraction is defined as:
otherwisemYm
mYmmDmmYifmY
mDmmY
mX
k
kkk
k
kk
k
2
222
2
222
2
)()(
)()()(ˆ)(-)()(
)(ˆ)(-)(
)(ˆ
where m is time frame, k frequency bin, is an estimate of the noise power spectrum, is noisy power spectrum, is the estimate of clean spectrum, (m) is noise overestimation and (m) is flooring. The standard case assumes that flooring and overestimation are constant in time. The best results are obtained with flooring and overestimation parameters dependent on the estimated global Signal-to-Noise Ratio at time m, SNR(m), with piecewise linear functions
2)(ˆ mDk
2)(mYk
2)(ˆ mX k
(m)
1.5
0 10 20 SNR(m) dB
0.001
(m)
1.0
0 15 20 SNR(m) dB
0.01
Denoising Techniques for baseline evaluations
Baseline evaluations of Loquendo ASR on Aurora2 speech
databases
7
Baseline Performance evaluations • This test was performed with the Loquendo ASR with the
CLEAN / MULTI_CONDITION models trained using the Aurora2 training lists.
• The test has been done using the A/B/C testing lists.
Performances in terms of Word Accuracy and Error Reduction
CLEAN Models Test A Test B Test C A-B-C
RPLP 75.6 77.5 75.3 76.3
+ Wiener SNR Dep. 84.0(34.4) 84.4(30.7) 83.3(32.4) 84.0(32.5)
MULTI Models Test A Test B Test C A-B-C Avg.
RPLP 93.5 91.1 90.2 91.9 84.1
+ Wiener SNR Dep. 93.9(6.1) 92.1(11.2) 90.5(3.1) 92.5(7.4) 88.2(25.8)
LASR Models Test A Test B Test C A-B-C
RPLP 80.9 83.3 77.6 81.2
+ Wiener SNR Dep. 88.1(37.7) 88.3(29.9) 86.2(38.4) 87.8(35.1)
Baseline evaluations of Loquendo ASR on Aurora3 speech
databases
9
Baseline Performance evaluations
• This test was performed with the Loquendo ASR and the models trained using the Aurora3 training lists.
• The test has been done using the Well Matched (WM) and High Mismatch (HM) testing lists.
Performances in terms of Word Accuracy and Error Reduction
Aurora3 Models Ita WM Ita HM Spa WM Spa HM
RPLP 98.2 46.6 97.3 74.6
+ Wiener SNR dep. 98.3(5.5) 77.5(59.4) 97.6(11.1) 89.9(60.2)
LASR Models Ita WM Ita HM Spa WM Spa HM
RPLP - 56.4 - 79.4
+ Wiener SNR dep. - 74.6(41.7) - 84.9(26.6)
Baseline evaluations of Loquendo ASR on Aurora4 speech
databases
(…work in progress)
11
WP1: Workplan
• selection of suitable benchmark databases (m6);
• Completion of LASR baseline experimentation of Spectral
Subtraction (Wiener SNR dependent) (m12)
• Discriminative VAD (m16)
• Spectral Attenuation (Ephraim-Malah SA SNR dependent) (m18)
• Noise estimation and reduction for non-stationary noises (m24)
Advances in WP2
Paris Meeting – 11 febr. 2005
www.loquendo.com
13
WP2: User RobustnessT2.2 Speaker Adaptation
• Acoustic model adaptation: – Loquendo ASR is based on Hybrid HMM-NN;– Hybrid HMM-NN is an alternative to HMM modeling that exploits the
discriminative training of MLP to estimate the acoustic units likelihood; it is also very efficient for open vocabularies;
– Differently from HMM, not much has been done in the literature for the adaptation of NN;
• State-of-art NN adaptation methods:– The Linear Input Network (LIN) method has been proposed for speaker
adaptation with promising results [Neto 1996] [Mana 2002]– The principle of LIN adaptation is to learn through error back-propagation
the parameters of a linear input space transformation;– The speaker independent acoustic model (MLP) is kept fixed;
• Innovative NN adaptation methods:– Other innovative techniques for NN adaptation will be proposed and
experimented, including regularization techniques and rotations of NN hidden units activations
14
LOQUENDO Activity in the first year
• The first activity has been the selection of suitable
benchmark databases: WSJ0 Adaptation component and
WSJ1 Spoke-3 component
• The second activity has been the set up of experimental
baselines for these databases, with standard LASR and
without adaptation
• In the meantime, LIN adaptation method has been
implemented and experimentations on the benchmarks are
under way and will be presented at M12;
15
Speech Databases for Speaker Adaptation
• WSJ0: (standard ARPA, 1993, LDC, 1000$)
– Large vocabulary (5K words) continuous speech database– Test Set: 8 speakers, ~40 utterances, read speech, bigram LM– Adaptation set: the same 8 speakers, 40 utterances each
• WSJ1: (1994,LDC, 1500$)– Similar to WSJ0, same vocabulary and LM– SPOKE-3: standard case study of adaptation to non-native
speakers – 10 speakers, 40 adaptation utterances, 40 test utterances
• Hiwire Non-Native Speaker database:
– Collected within the project;
– 80 speakers, each reads 100 utterances
16
WSJ0 baseline
• WSJ0 SI Test Set is made up by 8 speakers and ~40 sentences
for each speaker (two microphones: WV1: Sennheiser; WV2:
others)
• Vocabulary: 5K words, with a standard bigram LM
• The Adaptation component of WSJ0 is made up by the same 8
speakers of SI test, with 40 adaptation sentences for each of
them;
• Only the component of adaptation and test set with the coherent
microphone (Sennheiser -WV1) has been employed
Adaptation Model
Spk:WV1_440
Spk:WV1_441
Spk:WV1_442
Spk:WV1_443
Spk:WV1_444
Spk:WV1_445
Spk:WV1_446
Spk:WV1_447
Average
No Adaptation 83.6 79.0 80.7 87.1 79.7 82.2 88.5 82.0 82.8
17
WSJ1 – SPOKE-3 baseline
• Spoke-3 is the standard WSJ1 case study to evaluate
adaptation to non-native speakers
• There are 10 non-native speakers
• For each of them there are 40 adaptation sentences and ~40
test sentences
• Vocabulary is 5K words, with standard bigram LM
• Standard LASR for US-english has been usedAdaptation
Model4N0 4N1 4N3 4N4 4N5 4N8 4N9 4NA 4NB 4NC Average
No Adaptation 19.8 24.3 34.0 29.0 56.2 77.7 71.1 71.3 60.6 57.8 49.7
THE FEMALE PRODUCES A LITTER OF TWO TO FOUR YOUNG IN NOVEMBER AND DECEMBER
18
Workplan
• Selection of suitable benchmark databases (m6)
• Baseline set-up for the selected databases (m8)
• LIN adaptation method implemented and experimented on
the benchmarks (m12)
• Regularization methods implemented and experimented on
the benchmarks (m12)
• Innovative NN adaptation methods for acoustic modeling
(m24)