Advances in WP1 and WP2 Paris Meeting – 11 febr. 2005

Advances in WP1 and WP2

Paris Meeting – 11 febr. 2005

www.loquendo.com

Advances in WP1


www.loquendo.com

3

WP1: Environment & Sensor RobustnessT1.2 Noise Independence

• Voice Activity Detection: – A Model-based approach using NN (Neural Networks) to

discriminate two classes (noise and voice) will be explored;– NN input could be standard features (Cepstral coeff., Energy)

after noise reduction, in case complemented by other features (pitch/voicing) produced by other partners (IRST);

– Training set will be multi-style, including several types of noise conditions and languages

• Noise Reduction:– Some noise reduction techniques will be experimented on the

test sets selected as benchmarks for the project:• Spectral Subtraction (standard, Wiener and SNR

dependent) and Spectral Attenuation (Ephraim-Malah SA standard and SNR dependent)

• New techniques for non-stationary noises

4

WP1: Speech Databases for Noise Reduction

• Aurora 2 - Connected digits - TIdigits data down sampled to 8 kHz, filtered with a G712 characteristic and noise artificially added at several SNRs (20dB, 15dB, 10 dB, 5dB, 0dB, -5dB). There are three test sets:

– A: same noises as in train: subway, babble, car noise, exhibition hall;– B: 4 different noises: restaurant, street, airport, train station;– C: same noises as A but filtered with a different microphone

• Aurora 3 - Connected digits recorded in car environment - Signal collected by hand free (ch1) and close talk (ch0) microphones. In HIWIRE we use Italian and Spanish recordings. There are two test sets:

– WM: ch0 and ch1 recordings used in training and testing lists;– HM: ch0 for training and ch1 for testing

• Aurora 4 - Continuous speech 5k vocabulary - It is WSJ0 5K with added noise of 6 kinds: Car, Babble, Restaurant, Street, Airport, Train station. It uses the standard Bi-Gram language modeling.

5

Spectral Subtraction (SS) operates in the frequency domain and attempts to compute a denoised version of the power spectrum. Wiener spectral subtraction is defined as:

otherwisemYm

mYmmDmmYifmY

mDmmY

mX

k

kkk

k

kk

k

2

222

2

222

2

)()(

)()()(ˆ)(-)()(

)(ˆ)(-)(

)(ˆ

where m is time frame, k frequency bin, is an estimate of the noise power spectrum, is noisy power spectrum, is the estimate of clean spectrum, (m) is noise overestimation and (m) is flooring. The standard case assumes that flooring and overestimation are constant in time. The best results are obtained with flooring and overestimation parameters dependent on the estimated global Signal-to-Noise Ratio at time m, SNR(m), with piecewise linear functions

2)(ˆ mDk

2)(mYk

2)(ˆ mX k

(m)

1.5

0 10 20 SNR(m) dB

0.001

(m)

1.0

0 15 20 SNR(m) dB

0.01

Denoising Techniques for baseline evaluations

Baseline evaluations of Loquendo ASR on Aurora2 speech

databases

7

Baseline Performance evaluations • This test was performed with the Loquendo ASR with the

CLEAN / MULTI_CONDITION models trained using the Aurora2 training lists.

• The test has been done using the A/B/C testing lists.

Performances in terms of Word Accuracy and Error Reduction

CLEAN Models Test A Test B Test C A-B-C

RPLP 75.6 77.5 75.3 76.3

+ Wiener SNR Dep. 84.0(34.4) 84.4(30.7) 83.3(32.4) 84.0(32.5)

MULTI Models Test A Test B Test C A-B-C Avg.

RPLP 93.5 91.1 90.2 91.9 84.1

+ Wiener SNR Dep. 93.9(6.1) 92.1(11.2) 90.5(3.1) 92.5(7.4) 88.2(25.8)

LASR Models Test A Test B Test C A-B-C

RPLP 80.9 83.3 77.6 81.2

+ Wiener SNR Dep. 88.1(37.7) 88.3(29.9) 86.2(38.4) 87.8(35.1)


databases

9

Baseline Performance evaluations

• This test was performed with the Loquendo ASR and the models trained using the Aurora3 training lists.

• The test has been done using the Well Matched (WM) and High Mismatch (HM) testing lists.

Performances in terms of Word Accuracy and Error Reduction

Aurora3 Models Ita WM Ita HM Spa WM Spa HM

RPLP 98.2 46.6 97.3 74.6

+ Wiener SNR dep. 98.3(5.5) 77.5(59.4) 97.6(11.1) 89.9(60.2)

LASR Models Ita WM Ita HM Spa WM Spa HM

RPLP - 56.4 - 79.4

+ Wiener SNR dep. - 74.6(41.7) - 84.9(26.6)


databases

(…work in progress)

11

WP1: Workplan

• selection of suitable benchmark databases (m6);

• Completion of LASR baseline experimentation of Spectral

Subtraction (Wiener SNR dependent) (m12)

• Discriminative VAD (m16)

• Spectral Attenuation (Ephraim-Malah SA SNR dependent) (m18)

• Noise estimation and reduction for non-stationary noises (m24)

Advances in WP2


www.loquendo.com

13

WP2: User RobustnessT2.2 Speaker Adaptation

• Acoustic model adaptation: – Loquendo ASR is based on Hybrid HMM-NN;– Hybrid HMM-NN is an alternative to HMM modeling that exploits the

discriminative training of MLP to estimate the acoustic units likelihood; it is also very efficient for open vocabularies;

– Differently from HMM, not much has been done in the literature for the adaptation of NN;

• State-of-art NN adaptation methods:– The Linear Input Network (LIN) method has been proposed for speaker

adaptation with promising results [Neto 1996] [Mana 2002]– The principle of LIN adaptation is to learn through error back-propagation

the parameters of a linear input space transformation;– The speaker independent acoustic model (MLP) is kept fixed;

• Innovative NN adaptation methods:– Other innovative techniques for NN adaptation will be proposed and

experimented, including regularization techniques and rotations of NN hidden units activations

14

LOQUENDO Activity in the first year

• The first activity has been the selection of suitable

benchmark databases: WSJ0 Adaptation component and

WSJ1 Spoke-3 component

• The second activity has been the set up of experimental

baselines for these databases, with standard LASR and

without adaptation

• In the meantime, LIN adaptation method has been

implemented and experimentations on the benchmarks are

under way and will be presented at M12;

15

Speech Databases for Speaker Adaptation

• WSJ0: (standard ARPA, 1993, LDC, 1000$)

– Large vocabulary (5K words) continuous speech database– Test Set: 8 speakers, ~40 utterances, read speech, bigram LM– Adaptation set: the same 8 speakers, 40 utterances each

• WSJ1: (1994,LDC, 1500$)– Similar to WSJ0, same vocabulary and LM– SPOKE-3: standard case study of adaptation to non-native

speakers – 10 speakers, 40 adaptation utterances, 40 test utterances

• Hiwire Non-Native Speaker database:

– Collected within the project;

– 80 speakers, each reads 100 utterances

16

WSJ0 baseline

• WSJ0 SI Test Set is made up by 8 speakers and ~40 sentences

for each speaker (two microphones: WV1: Sennheiser; WV2:

others)

• Vocabulary: 5K words, with a standard bigram LM

• The Adaptation component of WSJ0 is made up by the same 8

speakers of SI test, with 40 adaptation sentences for each of

them;

• Only the component of adaptation and test set with the coherent

microphone (Sennheiser -WV1) has been employed

Adaptation Model

Spk:WV1_440

Spk:WV1_441

Spk:WV1_442

Spk:WV1_443

Spk:WV1_444

Spk:WV1_445

Spk:WV1_446

Spk:WV1_447

Average

No Adaptation 83.6 79.0 80.7 87.1 79.7 82.2 88.5 82.0 82.8

17

WSJ1 – SPOKE-3 baseline

• Spoke-3 is the standard WSJ1 case study to evaluate

adaptation to non-native speakers

• There are 10 non-native speakers

• For each of them there are 40 adaptation sentences and ~40

test sentences

• Vocabulary is 5K words, with standard bigram LM

• Standard LASR for US-english has been usedAdaptation

Model4N0 4N1 4N3 4N4 4N5 4N8 4N9 4NA 4NB 4NC Average

No Adaptation 19.8 24.3 34.0 29.0 56.2 77.7 71.1 71.3 60.6 57.8 49.7

THE FEMALE PRODUCES A LITTER OF TWO TO FOUR YOUNG IN NOVEMBER AND DECEMBER

18

Workplan

• Selection of suitable benchmark databases (m6)

• Baseline set-up for the selected databases (m8)

• LIN adaptation method implemented and experimented on

the benchmarks (m12)

• Regularization methods implemented and experimented on

the benchmarks (m12)

• Innovative NN adaptation methods for acoustic modeling

(m24)

Documents

Advances in WP1 and WP2 Paris Meeting – 11 febr. 2005