DataDrivenFeatures

Data-Driven Discriminative Speech Analysis Module in

DARPA GALE

Fabio Valente and Hynek Hermansky

IDIAP Research Institute, Martigny, Switzerland

Motivation

ASR requires knowledge and knowledge comes from data

– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)

– task-specific knowledge (e.g. language and its phonotactics, environment,…)

data-drivenfeatures

derived fromEnglish

classifier

train on small amountsof task-specific data

Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)

PROBLEM

For some tasks, amounts of data may be limited

ONE SOLUTION

Acquire speech-specific knowledge from large amounts of American English data

TANDEM and its training

TANDEMHermansky, Ellis and SharmaICASSP 2000

TANDEM

trainingdata

training data for TANDEM : can be from other application domain

TANDEM trained on OGI stories

amount of task-specific training datafor training of the HMM models

WER on OGI digit data(Sivadas and Hermansky ICASSP 2002)

100%20 %

• preprocessing of input data for TANDEM is beneficial – e.g. TRAP technique (nonlinear and data-driven)

linearprocessing

evidence: anything that carries the relevant information

featuresfor HMM trained NN

some functionof phoneme posteriors

TANDEM

evidence

posteriogram

The Current Research• Where is the information?• study linear preprocessing using LDA

– data-driven technique– straightforward interpretation in terms of basis functions

spectral projections

FIR RASTA filters

2-D projections

Applied earlier to American English portion of OGI stories (about 3 hours of telephone quality monologues from 210 adult talkers.

Hypothesis: If extracting speech-specific information, general conclusions should hold for different database (30 hours of RM and Switchboard from SRI)

Spectral Projections

Spectral sensitivity of projections

• Perturbation analysis– project Gaussian shape (σ = 250 Hz) on

the first 16 spectral basis and evaluate the effect of the shift in µ by 30 Hz as the function of µ

Consistent with auditory (Bark) spectral scale

log spectralEuclidean distancedue to the shift in µ

Shift in µ constant on the Hz scale

Shift in µ constant on the Bark scale

Relative importance of spectral regions

• Hilbert envelope of the first 15 spectral basis averaged

frequency [Hz]

• Importance of each frequency region for articulation and intelligibility [Fletcher 1953]

100 1000 10000frequency [Hz]

Central 600 ms of temporal discriminants(impulse responses of FIR RASTA filters)

time [ms]

2-D discriminants

4.2 % 3.7 % 3.3% 3.2% 3.0%

2.8% 2..8% 2.7% 2.1% 2.0%

1.7% 1.6% 1.5% 1.4% 1.4%

1.2% 1.1% 1.1% 1.1% 1.1%

1.1% 1.1% 1.1% 1.1% 1.1%

0-200 200time [ms]

Multi-RASTA

0-500 500time [ms]

averagefrequencyderivative

3 criticalbands

frequency

example(out of 32 possible)

matched training and test miss-matched channel

conventional (PLP) 5.2 % 13.5 %

Multi-RASTA 3.7 % 3.8 %

Conclusions

• Data-driven (ANN-based TANDEM) feature extraction module as means for implementing speech-specific (task-independent) knowledge – aim for reducing need for large task-specific acoustic training data

• LDA guided pre-processingResults qualitatively consistent for different databases (OGI Stories

and forcefully aligned SRI Broadcast News and Switchboard data)– optimality of Bark-like frequency scale– need for larger (about 500 ms) temporal context in feature extraction – dominant time-frequency discriminants as outer products of spectral

discriminants and temporal discriminants

• Linear pre-processing of data for TANDEM– multi-RASTA (projections on zero-mean variable temporal resolution

basis)• demonstrated improvements on small vocabulary task

Initialize temporal basis R

Project Spectro-Temporal matrix on R

Estimate spectral basisL by LDA

Project Spectro-Temporal matrix on L

Estimate temporal basisR by LDA

2-D Linear Discriminant Analysis(Ye, Janardan and Li, NIPS 2005)

Eigenvalues

spectral discriminants temporal discriminants 2-D discriminants

Temporal discriminants across critical bands

first discriminant second discriminant third discriminant

• preprocessing of input data for TANDEM is beneficial – TRAP technique (nonlinear and data-driven)

– multi-RASTA filters (linear and “knowledge” guided)

linearprocessing

evidence: anything that carries the relevant information

featuresfor HMM trained NN

some functionof phoneme posteriors

TANDEM

evidence

posteriogram

Temporal discriminants(frequency responses of FIR RASTA filters)

modulation frequency [Hz]

Experimental Setup

• 30 hours of phoneme-labeled (forced alignment) data from SRI (Switchboard, broadcast news,…)

• spectral vectors: 129 samples of LPC log power spectrum (12th order, 30 ms window, 10 ms step)

• temporal vectors: 2010 ms long (201 samples), labeled by the phoneme in the center, 10 ms step

• spectro-temporal matrix: 201 time samples x 129 spectral samples, 10 ms step

Motivation

ASR requires knowledge and knowledge comes from data

– speech-specific knowledge (e.g. vocal tract organs and the way they are used in speech production,..)

– task-specific knowledge (e.g. language and its phonotactics, environment,…)

CONVENTIONAL WAY

features

classifier trained on

English

adapt on small amountsof task-specific data

ALTERNATIVE

data-drivenfeatures

derived fromEnglish

classifier

train on small amountsof task-specific data

Sivadas and Hermansky, ICASSP 2004Stolcke et al., ICASSP 2006 (and also this meeting)

PROBLEM

For some tasks, amounts of data may be limited

ONE SOLUTION

Acquire speech-specific knowledge from large amounts of American English data

Frequencies around 600 Hz are the most important for decoding of nonsense syllables

DataDrivenFeatures

Documents

Fortran

Algorithms

The Best American Humorous Short Stories

Basic Buffer Overflows Explained

Bodybuilding - The Rock Hard Challenge (Month 1 Training)

Cakes Recipes

Effective Parenting: Establishing Boundaries

Tragic Heroes

Improve the Color Quality Of Your Monitor

Chapter 23

Explore The Levels of Creation

Keyboard Shortcuts for the Opera Browser for Mac OS X

Star Wars Original Trilogy Trivia (Episodes IV-VI)

Aesops Fables

Barclays1

18 Tricks to Teach Your Body

Iron Mills Essay

Jan Van Eyck and the Man In A Red Turban

Life Is Just A Dream - Or Is It?

Compressing And Decompressing Folders