DataDrivenFeatures

Preview:

DESCRIPTION

http://fvalente.zxq.net/presentations/DataDrivenFeatures.pdf

Citation preview

Data-Driven Discriminative Speech Analysis Module in

DARPA GALE

Fabio Valente and Hynek Hermansky

IDIAP Research Institute, Martigny, Switzerland

Motivation

ASR requires knowledge and knowledge comes from data

– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)

– task-specific knowledge (e.g. language and its phonotactics, environment,…)

data-drivenfeatures

derived fromEnglish

classifier

train on small amountsof task-specific data

Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)

PROBLEM

For some tasks, amounts of data may be limited

ONE SOLUTION

Acquire speech-specific knowledge from large amounts of American English data

TANDEM and its training

TANDEMHermansky, Ellis and SharmaICASSP 2000

evid

ence

TANDEM

tran

sfor

med

phon

eme

post

erio

rs

trainingdata

training data for TANDEM : can be from other application domain

TANDEM trained on OGI stories

amount of task-specific training datafor training of the HMM models

wo

rd e

rro

r ra

te

WER on OGI digit data(Sivadas and Hermansky ICASSP 2002)

PLP

100%20 %

70%

0%

• preprocessing of input data for TANDEM is beneficial – e.g. TRAP technique (nonlinear and data-driven)

linearprocessing

evidence: anything that carries the relevant information

time

frequ

ency

featuresfor HMM trained NN

some functionof phoneme posteriors

TANDEM

evidence

posteriogram

/f/

/ay/

/v/

data

The Current Research• Where is the information?• study linear preprocessing using LDA

– data-driven technique– straightforward interpretation in terms of basis functions

time

freq

uenc

y

spectral projections

FIR RASTA filters

2-D projections

Applied earlier to American English portion of OGI stories (about 3 hours of telephone quality monologues from 210 adult talkers.

Hypothesis: If extracting speech-specific information, general conclusions should hold for different database (30 hours of RM and Switchboard from SRI)

Spectral Projections

Spectral sensitivity of projections

• Perturbation analysis– project Gaussian shape (σ = 250 Hz) on

the first 16 spectral basis and evaluate the effect of the shift in µ by 30 Hz as the function of µ

Consistent with auditory (Bark) spectral scale

log spectralEuclidean distancedue to the shift in µ

µ

Shift in µ constant on the Hz scale

Shift in µ constant on the Bark scale

µ

Relative importance of spectral regions

• Hilbert envelope of the first 15 spectral basis averaged

frequency [Hz]

• Importance of each frequency region for articulation and intelligibility [Fletcher 1953]

100 1000 10000frequency [Hz]

Central 600 ms of temporal discriminants(impulse responses of FIR RASTA filters)

time [ms]

ampl

itude

of

impu

lse

resp

onse

2-D discriminants

4.2 % 3.7 % 3.3% 3.2% 3.0%

2.8% 2..8% 2.7% 2.1% 2.0%

1.7% 1.6% 1.5% 1.4% 1.4%

1.2% 1.1% 1.1% 1.1% 1.1%

1.1% 1.1% 1.1% 1.1% 1.1%

freq

uenc

y [H

z]

o

4000

0-200 200time [ms]

Multi-RASTA

0-500 500time [ms]

time

averagefrequencyderivative

3 criticalbands

frequency

time

freq

uen

cy

example(out of 32 possible)

matched training and test miss-matched channel

conventional (PLP) 5.2 % 13.5 %

Multi-RASTA 3.7 % 3.8 %

Conclusions

• Data-driven (ANN-based TANDEM) feature extraction module as means for implementing speech-specific (task-independent) knowledge – aim for reducing need for large task-specific acoustic training data

• LDA guided pre-processingResults qualitatively consistent for different databases (OGI Stories

and forcefully aligned SRI Broadcast News and Switchboard data)– optimality of Bark-like frequency scale– need for larger (about 500 ms) temporal context in feature extraction – dominant time-frequency discriminants as outer products of spectral

discriminants and temporal discriminants

• Linear pre-processing of data for TANDEM– multi-RASTA (projections on zero-mean variable temporal resolution

basis)• demonstrated improvements on small vocabulary task

Initialize temporal basis R

Project Spectro-Temporal matrix on R

Estimate spectral basisL by LDA

Project Spectro-Temporal matrix on L

Estimate temporal basisR by LDA

2-D Linear Discriminant Analysis(Ye, Janardan and Li, NIPS 2005)

Eigenvalues

spectral discriminants temporal discriminants 2-D discriminants

Temporal discriminants across critical bands

first discriminant second discriminant third discriminant

• preprocessing of input data for TANDEM is beneficial – TRAP technique (nonlinear and data-driven)

– multi-RASTA filters (linear and “knowledge” guided)

linearprocessing

evidence: anything that carries the relevant information

time

frequ

ency

featuresfor HMM trained NN

some functionof phoneme posteriors

TANDEM

evidence

posteriogram

/f/

/ay/

/v/

data

Temporal discriminants(frequency responses of FIR RASTA filters)

modulation frequency [Hz]

log

mag

nitu

de s

pect

rum

[dB

]

Experimental Setup

• 30 hours of phoneme-labeled (forced alignment) data from SRI (Switchboard, broadcast news,…)

• spectral vectors: 129 samples of LPC log power spectrum (12th order, 30 ms window, 10 ms step)

• temporal vectors: 2010 ms long (201 samples), labeled by the phoneme in the center, 10 ms step

• spectro-temporal matrix: 201 time samples x 129 spectral samples, 10 ms step

Motivation

ASR requires knowledge and knowledge comes from data

– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)

– task-specific knowledge (e.g. language and its phonotactics, environment,…)

CONVENTIONAL WAY

features

classifier trained on

English

adapt on small amountsof task-specific data

ALTERNATIVE

data-drivenfeatures

derived fromEnglish

classifier

train on small amountsof task-specific data

Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)

PROBLEM

For some tasks, amounts of data may be limited

ONE SOLUTION

Acquire speech-specific knowledge from large amounts of American English data

Frequencies around 600 Hz are the most important for decoding of nonsense syllables