Covariation and weighting of harmonically decomposed streams for ASR

Preview:

DESCRIPTION

Covariation and weighting of harmonically decomposed streams for ASR. Introduction Pitch-scaled harmonic filter Recognition experiments Results Conclusion. Production of /z/:. periodic. aperiodic. Motivation and aims. - PowerPoint PPT Presentation

Citation preview

aperiodic periodic

Production of /z/:

Covariation and weighting of harmonically decomposed

streams for ASR

Introduction

Pitch-scaled harmonic filter

Recognition experiments

Results

Conclusion

Motivation and aims

• Most speech sounds are either voiced or unvoiced, which have very different properties:

– voiced: quasi-periodic signal from phonation

– unvoiced: aperiodic signal from turbulence noise

• Do these properties allow humans to recognize speech in noise?

Maybe, we can use this information to help ASR...

by computing separate features for the two parts.

• Are their two contributions complementary?

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ INTRODUCTION

aperiodic contribution periodic contribution

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ INTRODUCTION

Voiced and unvoiced parts of a speech signal

Production of /z/:

speech waveform

aperiodic waveform

s(n)

periodic waveform

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Pitch-scaled harmonic filter

u(n)^

time shifting

v(n)^

PSHF. . .

optimised pitch

f0raw

f0opt

pitch optimisation

pitch extraction

Nopt

PSHFPSHF

re-splicing

Orig

inal

Per

iodi

cA

perio

dic

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Decomposition example (waveforms)

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Orig

inal

Per

iodi

cA

perio

dic

Decomposition ex. (spectrograms)

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Orig

inal

Per

iodi

cA

perio

dic

Decomposition ex. (MFCC specs.)

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Speech database: Aurora 2.0

• From TIdigits database of connected English digit strings (male & female speakers), filtered with G.712 at 8 kHz.

Data type Signal-to-Noise Ratio (dB)

clean-condition

multi-condition 20 15 10 5

set A (same noises)

20 15 10 5 0 -5

set B (different noises)

20 15 10 5 0 -5

set C (diffferent channel)

20 15 10 5 0 -5

TR

AIN

TE

ST

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Description of the experiments

• Baseline experiment: [base]– standard parameterisation of the original waveforms

(i.e., MFCC,+Δ,+ΔΔ)

• PCA experiments: [pca26, pca78, pca13 and pca39]– decorrelation of the feature vectors, and reduction of

the number of coefficients

• Split experiments: [split, split1]– adjustment of stream weights (periodic vs. aperiodic)

Caveat: pitch values were derived from clean speech files, for entire database!

PCA26:

PCA78:

PCA13:

PCA39:

MFCC +Δ, +Δ2catPSHF PCA

MFCC +Δ, +Δ2 catPSHF PCA

MFCC +Δ, +Δ2 catPSHF PCA

MFCC +Δ, +Δ2 catPSHF PCA

BASE: MFCCwaveform features

+Δ, +Δ2

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ METHOD

Parameterisations

SPLIT: MFCC +Δ, +Δ2 catPSHF

SPLIT1: MFCC +Δ, +Δ2 catPSHF

Word Error Rate (%) clean multi overall base 47.4 21.7 34.6

pca26 33.8 11.4 22.6 pca78 42.7 12.8 27.7 pca13 28.3 13.0 20.7 pca39 30.3 14.5 22.4

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS

Full-sized PCA results

PCA26PCA39

• clean+ multi

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS

Variance of Principal Components

PCA26 experiment’s results

CLEAN MULTI

Word Error Rate (%) clean multi overall base 47.4 21.7 34.6

pca26 29.0 11.4 20.2 pca78 38.3 12.1 25.2 pca13 27.6 12.6 20.1 pca39 29.3 12.5 20.9

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS

Summary of best PCA results

Split experiment’s results

Word Error Rate (%) clean multi overall base 47.4 21.7 34.6

split (=0) 62.9 44.3 53.6

split (=1) 28.5 11.7 20.1

split (=2) 22.7 11.5 17.1

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS

Sample Split results

Note: same value of stream weights used in training as in testing, for Split.

Split1 experiment’s results

Word Error Rate (%) WER (%) clean multi overall abs. rel. base 47.4 21.7 34.6 0.0 0.0

pca26 29.0 11.4 20.2 14.4 41.6 pca78 38.3 12.1 25.2 9.4 27.2 pca13 27.6 12.6 20.1 14.5 41.9 pca39 29.3 12.5 20.9 13.7 39.6

split 22.6 11.0 16.8 17.8 51.4 split1 21.0 10.9 16.0 18.6 53.8

Word Error Rate (%) clean multi overall base 47.4 21.7 34.6

pca26 29.0 11.4 20.2 pca78 38.3 12.1 25.2 pca13 27.6 12.6 20.1 pca39 29.3 12.5 20.9

split 22.6 11.0 16.8 split1 21.0 10.9 16.0

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ RESULTS

Summary of PCA & Split results

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/ CONCLUSION

Conclusions• PSHF module split Aurora’s speech waveforms into

two synchronous streams (periodic and aperiodic)– large improvements over the single-stream Baseline

• Split was better than all PCA combinations:– PCA26/13 better than PCA 78/39, and PCA13 best

– Split1 marginally better than Split

• Periodic speech segments give robustness to noise.

Further work– Modeling: how best to combine the streams?

– LVCSR: evaluate front end on TIMIT (phone recognition).

– Robust pitch tracking

COLUMBO PROJECT: Harmonic decomposition

applied to ASR

Philip J.B. Jackson 1 <p.jackson@surrey.ac.uk>

David M. Moreno 2 <davidm@talp.upc.es>

Javier Hernando 2 <javier@talp.upc.es>

Martin J. Russell 3 <m.j.russell@bham.ac.uk>

1 2 3

http://www.ee.surrey.ac.uk/Personal/P.Jackson/Columbo/

Recommended