Automatic speech recognition on the articulation index corpus

EPSRC Perceptual Constancy Steering Group Meeting | 19th May 2010 Slide 1

Automatic speech recognition on

the articulation index corpus

Guy J. Brown and Amy BeestonDepartment of Computer Science

University of [email protected]


Aims

• Eventual aim is to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).

• Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task.– wider vocabulary– range of reverberation conditions– variety of speech contexts– naturalistic speech, rather than interpolated stimuli– consider phonetic confusions in reverberation in general

• Initial ASR studies using articulation index corpus• Aim to compare human performance (Amy

experiment) and machine performance on same task


Articulation index (AI) corpus

• Recorded by Jonathan Wright (University of Pennsylvania)

• Intended for speech recognition in noise experiments similar to those of Fletcher.

• Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins:– American English– Target syllables are mostly nonsense, but some

correspond to real words (including “sir” and “stir”)– Target syllables are embedded in a context sentence

drawn from a limited vocabulary


Grammar for Amy’s subset of AI corpus

$cw1 = YOU | I | THEY | NO-ONE | WE | ANYONE | EVERYONE | SOMEONE | PEOPLE;

$cw2 = SPEAK | SAY | USE | THINK | SENSE | ELICIT | WITNESS | DESCRIBE | SPELL | READ | STUDY | REPEAT | RECALL | REPORT | PROPOSE | EVOKE | UTTER | HEAR | PONDER | WATCH | SAW | REMEMBER | DETECT | SAID | REVIEW | PRONOUNCE | RECORD | WRITE | ATTEMPT | ECHO | CHECK | NOTICE | PROMPT | DETERMINE | UNDERSTAND | EXAMINE | DISTINGUISH | PERCEIVE | TRY | VIEW | SEE | UTILIZE | IMAGINE | NOTE | SUGGEST | RECOGNIZE | OBSERVE | SHOW | MONITOR | PRODUCE;

$cw3 = ONLY | STEADILY | EVENLY | ALWAYS | NINTH | FLUENTLY | PROPERLY | EASILY | ANYWAY | NIGHTLY | NOW | SOMETIME | DAILY | CLEARLY | WISELY | SURELY | FIFTH | PRECISELY | USUALLY | TODAY | MONTHLY | WEEKLY | MORE | TYPICALLY | NEATLY | TENTH | EIGHTH | FIRST | AGAIN | SIXTH | THIRD | SEVENTH | OFTEN | SECOND | HAPPILY | TWICE | WELL | GLADLY | YEARLY | NICELY | FOURTH | ENTIRELY | HOURLY;

$test = SIR | STIR | SPUR | SKUR;

( !ENTER $cw1 $cw2 $test $cw3 !EXIT )

Audio demos


ASR system

• HMM-based phone recogniser – implemented in HTK– monophone models– 20 Gaussian mixtures per state– adapted from scripts by Tony Robinson/Dan Ellis

• Bootstrapped by training on TIMIT then further 10-12 iterations of embedded training on AI corpus

• Word-level transcripts in AI corpus expanded to phones using the CMU pronunciation dictionary

• All of AI corpus used for training, except the 80 utterances in Amy’s experimental stimuli


MFCC features

• Baseline system trained using mel-frequency cepstral coefficients (MFCCs)– 12 MFCCs + energy + delta+acceleration (total 39

features per frame)– cepstral mean normalization

• Baseline system performance on Amy’s clean subset of AI corpus (80 utterances, no reverberation):– 98.75% context words correct– 96.25% test words correct


Amy experiment

• Amy’s first experiment used 80 utterances – 20 instances each of “sir”, “skur”, “spur” and “stir”

test words

• Overall confusion rate was controlled by lowpass filtering at 1, 1.5, 2, 3 and 4 kHz

• Same reverberation conditions as in Watkins et al. experiments

• Stimuli presented to the ASR system as in Amy’s human studies

Test 0.32m Test 10m

Context 0.32m

near-near near-far

Context 10m far-near far-far


Baseline ASR: context words

• Performance falls as the cutoff frequency decreases

• Performance falls as level of reverberation increases

• Near context substantially better than far context at most cutoffs


Baseline ASR: test words

No particular pattern of confusions in 2kHz near-near case but more frequent skur/spur/stir errors


Human data(20 subjects)

BaselineASR system

Baseline ASR: human comparison

• Data for 4 kHz cutoff• Even mild

reverberation (near near) causes substantial errors in the baseline ASR system

• Human listeners exhibit compensation in the AIC task, the baseline ASR system doesn’t (as expected) far test word near test word

perc

enta

ge e

rror


Training on auditory features

• 80 channels between 100 Hz and 8 kHz

• 15 DCT coefficients + delta + acceleration (45 features per frame)

• Efferent attenuation set to zero for initial tests

• Performance of auditory features on Amy’s clean subset of AI corpus (80 utterances, no reverberation):

– 95% context words correct– 97.5% test words correct

DRNLDRNLOMEOME Hair cellHair cell Frame & DCT

Frame & DCT

RecogniserRecogniserATTATT

Auditory periphery

Efferent system

StimulusStimulus


Auditory features: context words

• Take a big hit in performance using auditory features– saturation in AN is

likely to be an issue– mean normalization

• Performance falls sharply with decreasing cutoff

• As expected, best performance in the least reverberated conditions


Auditory features: test words


Effect of efferent suppression

• Not yet used fullclosed-loop modelin ASR experiments

• Indication of likelyperformanceobtained by increasingefferent attenuationin ‘far’ context conditions

DRNLDRNLOMEOME Hair cellHair cell Frame & DCT

Frame & DCT

RecogniserRecogniserATTATT

Auditory periphery

Efferent system

StimulusStimulus


Auditory features: human comparison

• 4 kHz cutoff• Efferent

suppression effective for mild reverberation

• Detrimental to far test word

• Currently unable to model human data, but:– not closed loop– same efferent

attenuation in all bands

Human data(20 subjects)

No efferent suppressio

n

10 dB efferent suppression

far test word near test word


Confusion analysis: far-near condition

• Without efferent attenuation “skur”, “spur” and “stir” are frequently confused as “sir”

• These confusions are reduced by more than half when 10 dB of efferent attenuation is applied

SIRSKU

RSPU

RSTIR

SIR 12 3 3 2

SKUR

11 5 2 2

SPUR

10 2 4 4

STIR 7 1 7 5

SIR SKUR SPUR STIR

SIR 9 2 2 7

SKUR 3 10 4 3

SPUR 5 4 4 7

STIR 3 2 3 12

far-near 0 dB attenuation

far-near 10 dB attenuation


Confusion analysis: far-far condition

• Again “skur”, “spur” and “stir” are commonly reported as “sir”

• These confusions are somewhat reduced by 10dB efferent attenuation, but:

– gain is outweighed by more frequent “skur”, “spur”, “stir” confusions

• Efferent attenuation recovers the dip in the temporal envelope but not cues to /k/, /p/ and /t/

SIRSKU

RSPU

RSTIR

SIR 18 0 1 1

SKUR

14 1 3 2

SPUR

12 1 5 2

STIR 12 0 3 5

SIRSKU

RSPU

RSTIR

SIR 13 2 1 4

SKUR

11 3 1 5

SPUR

6 5 2 7

STIR 10 2 1 7

far-far 0 dB attenuation

far-far 10 dB attenuation


Summary

• ASR framework in place for the AI corpus experiments

• We can compare human and machine performance on the AIC task

• Reasonable performance from baseline MFCC system

• Need to address shortfall in performance when using auditory features

• Haven’t yet tried the full within-channel model as a front end

Documents

Automatic speech recognition on the articulation index corpus