17
Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av. Félix viallet, Grenoble, France email: (bertho,ggeorges)@icp.inpg.fr

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR

  • Upload
    dewitt

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR. Frédéric Berthommier and Angélique Grosgeorges ICP 46 av. Félix viallet, Grenoble, France email: (bertho,ggeorges)@icp.inpg.fr. - PowerPoint PPT Presentation

Citation preview

Page 1: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Temporal masking of spectrally reduced speech:psychoacoustical experiments

and links with ASR

Frédéric Berthommier and Angélique Grosgeorges

ICP

46 av. Félix viallet, Grenoble, France

email: (bertho,ggeorges)@icp.inpg.fr

Page 2: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Introduction and motivations

We used the experimental paradigm proposed by [Shannon et al., 95], from which we developed a series of experiments. As proposed by (Horii et al., 1971) they varied the spectro-temporal resolution of speech utterances. The stimuli were composed of white noise modulated by the filtered envelopes extracted in 4 subbands. The task was consonant identification for VCVCV within 16 French consonants. Then, we evaluated the transmission of their phonetic features: voicing, mode and place of articulation.

We extent this paradigm by masking this residual signal with stationary [Lorenzi et al., 99], or non stationary noises [Grosgeorges et al., 00]. In this framework, we substitute to the couple (local SNR/acoustic representation) and to the analysis in terms of identification rate another couple (global SNR/phonetic representation) with an analysis in terms of feature transmission.

Then, we focus on the problem of acoustic phonetic decoding in noise, and on the impact of the noise on the features grounding the classification process. In other words, we postulate the existence of an intermediate level preceding the phonetic categorisation, and we study its properties.

Page 3: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Introduction and motivations (2)

So, we expect a set of complementary results from this approach, at the same time informative about the study of the link between auditory and speech processes, useful for CASA, and informative for developing ASR for noisy and distorted speech.

For RESPITE, the goal of this project is to set-up a plausible multi-stream model in which the phonetic identification of consonants is grounded by the extraction of these three phonetic characteristics, voicing, place and mode, this in specialised modules having different spectro-temporal resolution. A pre-classification according this appropriate phonetic representation could be more robust than the direct classification, the streams easier to weight according their information content, and the fusion process easier to control. Remark: vowel identification is considered as well modelled in current implementations.

Moreover, the visual modality can be integrated in this model easily for the same reason: the audio-visual complementarity is optimally represented.

Page 4: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

The Shannon et al. ’ experimentSpectral degradation: signal was divided into one, two, three or four frequency bands. Temporal degradation: the amplitude envelope extracted from each band was low-pass filtered with cutoff frequencies Fc:16, 50, 160 or 500Hz. The identification of 3 features (voicing, manner and place) for 16 French consonants « a/C/a » was evaluated by the classical information transmission analysis (Miller and Nicely, 1955).

0

20

40

60

80

100

1 2 3 4 1 2 3 4 1 2 3 4

Voicing Manner Place

Info

rmat

ion

rec

eive

d (%

)

Number of bands

Fc = 16 Hz or 50, 160, 500 Hz

The main conclusion is: despite the great spectro-temporal reduction, voicing and manner are remarkably well transmited by the residual envelope, i.e. by the temporal components of the speech.

Some questions arise: how this residue is processed ? how to use it for increasing robustness ?

…. one way is to mask it and to analyse what occurs.

Page 5: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Factorial design of the masking experiment

Factor n°1: The spectral resolution was constant at 4 frequency bands, and the envelope was filtered with cutoff frequency Fc at 10 or 500Hz.

Factor n°2: We added different temporal maskers in order to selectively degrade the different components of the residual signal:

(1) in order to mask the coarse component of temporal information, we used a low frequency AM (amplitude modulation < 8Hz) white noise applied in each subband, for all maskers.

(2) to degrade the residual spectral information, we decorrelated the low frequency AM across the 4 frequency bands.

(3) to mask the fine temporal information, we re-modulated the low frequency AM of the masker at 100Hz.

Page 6: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Factorial design (2)

Level 1 Level 2 Level 3Factor n°2White noise: - decorrelated

White noise: - decorrelated - 100 Hz

Fc=500 Hz

Fc=10 Hz (1) (1) + (2) (1) + (2) + (3)

White noise: - correlated Factor n°1

(1) (1) + (2) (1) + (2) + (3)

Task: Consonant identification task in a quiet room, with forced choice and no feedback

Subjects: 6 normal hearing listeners not trained. However all subjects had experience in psychoacoustical experiments

Stimuli: 384 stimuli composed of 6 different conditions were presented in random order

Page 7: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Speech and signal processing

16 utterances aCaCa : - with C = {b, d, g,v, Z, z, m, n, r, l,p,t,k,f,s,S}consonant features: voicing: voiced={b,d,g,v,Z,z,m,n,r,l} / voiceless={p,t,k,f,s,S}manner: fricative + liquid ={f,s,S,v,Z,z,r,l} / occlusive + nasal={p,t,k,b,d,g,m,n}place: dental={p,b,f,v,m} / labial={t,d,s,z,n,l} / palatal={k,g,Z,S,r}

Stimulus

FS = 11025 Hz and Frame analysis92.8ms

Nonsense Speech:

SNR=+6dBTemporal

masker

FFT

2

3

4

1

4 spectral bands decomposition

Low-pass filteringat 500 Hz or 10 Hz

Signal rectificationiFFT

Bandpassfiltering

Signal reconstruction

+

Whitenoise

White noise

(1)

(1) + (2)

(1) + (2) + (3)

Page 8: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Exemple of stimulus

aCaCa speech envelope Fc = 10 Hz

Envelope of stimulus:(1)

Envelope of stimulus:(1) + (2)

Envelope of stimulus:(1) + (2) + (3)

t

Am

pli

tude

1

2

3

4

a B a B a

t

Am

pli

tude

a B a B a

t

Am

plit

ude

a B a B a

t

Am

pli

tude

a B a B a

Page 9: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Results of the experiment

For all conditions, chance was set at 6.25% (1/16) for consonant recognition.

Overall mean correct identification for the 6 subjects was 28%.

A confusion matrix was generated for each listener and summed across listeners. Then, the mean transmission information (Miller and Nicely, J. Acoust. Soc. Am., 1955) for voicing, manner and place of articulation was evaluated.

The average information received for each consonant feature is plotted as a function of the level number, as compared with the average information received when there was no temporal masker (dashed lines).

Page 10: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Results : transmission of voicing

Voicing is not transmitted by the fine temporal modulation (as in Shannon et al.) and it decreases slightly with the degradation of residual spectral information allowed by decorrelation.

So we conclude that voicing features are acoustically “distributed”, and then, the degradation according the different maskers’ characteristics (low frequency AM, decorrelation and 100Hz re-modulation) is cumulative.

Info

rmat

ion

rec

eive

d (

%)

Level number1 2 3

0

20

40

60

70 Voicing recognition

Fc=10Hz

Fc=500Hz

Page 11: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Results : transmission of the manner

Manner of consonant articulation is completely suppressed for all temporal maskers, having in common a low AM characteristic.There is no significant difference with 0% information received.

Manner recognition

Info

rmat

ion

rec

eive

d (

%)

0

10

40

20

30

Level number1 2 3

Manner recognition

Info

rmat

ion

rec

eive

d (

%)

1 2 30

20

40

60

70

Level number

Fc=10Hz

Fc=500Hz

When spectral information is reduced, manner is conveyed by the coarse envelope component, and it strongly interferes with a low AM masker: the differentiation between fricatives and occlusivesis encoded temporally and it is well masked by noise having close temporal characteristics.

Page 12: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Nullification of manner transmission

aCaCa speech envelope Fc = 10 Hz

Envelope of stimulus:(1)

Envelope of stimulus:(1) + (2)

Envelope of stimulus:(1) + (2) + (3)

t

Am

pli

tude

1

2

3

4

a B a B a

t

Am

pli

tude

a B a B a

t

Am

plit

ude

a B a B a

t

Am

pli

tude

a B a B a

Page 13: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Results : place transmission

Place of articulation is significantly less transmitted (P<0.05; t-test) for Level 2 and 3 comparatively to Level 1, for Fc=10 Hz (*).

Decorrelation degrades the residual spectral information (for Fc at 10Hz).In

form

atio

n r

ecei

ved

(%

)

Level number1 2 3

0

20

40

60

70

**

Place recognition

Fc=10HzFc=500Hz

Page 14: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Conclusion of the masking experiment

We retrieve the main Shannon et al.’s results.

Our experiment suggests that: -voicing is a redundant consonant feature which depends on both categories of information: coarse temporal envelope and spectral information, -but manner is mainly carried by the coarse temporal envelope.

This experiment supports the hypothesis that consonant identification is a complex process which can compensate for the reduction or the masking of both temporal or spectral information by the use of residual information for voicing and place, but not for the manner.

Page 15: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

Hz

Gain

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

Hz

Gain

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

Hz

Gain

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

Hz

Gain

0 1000 2000 3000 4000 50000

0.2

0.4

0.6

0.8

1

Hz

Gain

Clean signal

10 Hz 500 Hz

The intelligibility is weak for 1 and 2 subbands, with a poor transmission of the place of articulation. The difference between Fc at 10 and 500 Hz is weak.

Perspective (1) : variation of the spectro-temporal resolution

Page 16: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Perspective (2): interaction between spectral reductionand masking

Info

rmat

ion

rec

eive

d (

%)

4sbSNR=+6dB

4sbclean

16sbclean

16sbSNR=+6dB

Voicing

Place of articulation

Manner of articulation

0

20

40

60

80

100

This preliminary experiment (Fc=10Hz) shows that for the mode, thereis a rather independent effect of spectral reduction and of temporalmasking, the later having the stronger impact. This confirms that the mode is mainly encoded temporally. So one proposal for multistream ASR is to decode this feature temporally in a separate 4 subbands stream.

Page 17: Temporal masking of spectrally reduced speech: psychoacoustical experiments  and links with ASR

Perspective (3): audio-visual complementarity

As shown by Erber (1972), intelligibility is high even for 1 and 2 subbands: the place of articulation is the best transmitted by the visual modality, whereas this is the worse transmitted for the audio reduced speech, so the global intelligibility is restored thanks to the direct complementarity of transmission in the two modalities.