39
Cocktail Party Problem as Binary Classification DeLiang Wang Perception & Neurodynamics Lab Ohio State University

Cocktail Party Problem as Binary Classification

Embed Size (px)

DESCRIPTION

Cocktail Party Problem as Binary Classification. DeLiang Wang Perception & Neurodynamics Lab Ohio State University. Outline of presentation. Cocktail party problem Computational theory analysis Ideal binary mask Speech intelligibility tests - PowerPoint PPT Presentation

Citation preview

Page 1: Cocktail Party Problem  as Binary Classification

Cocktail Party Problem as Binary Classification

DeLiang Wang

Perception & Neurodynamics LabOhio State University

Page 2: Cocktail Party Problem  as Binary Classification

2

Outline of presentation

Cocktail party problem Computational theory analysis

Ideal binary mask Speech intelligibility tests

Unvoiced speech segregation as binary classification

Page 3: Cocktail Party Problem  as Binary Classification

3

Real-world auditionWhat?• Speech

messagespeaker

age, gender, linguistic origin, mood, …

• Music• Car passing byWhere?• Left, right, up, down• How close?Channel characteristicsEnvironment characteristics• Room reverberation• Ambient noise

Page 4: Cocktail Party Problem  as Binary Classification

4

Sources of intrusion and distortion

additive noise from other sound sources

reverberation from surface reflections

channel distortion

Page 5: Cocktail Party Problem  as Binary Classification

5

Cocktail party problem

• Term coined by Cherry• “One of our most important faculties is our ability to listen to, and

follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry’57)

• “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92)

Ball-room problem by Helmholtz “Complicated beyond conception” (Helmholtz, 1863)

Speech segregation problem

Page 6: Cocktail Party Problem  as Binary Classification

6

Approaches to Speech Segregation Problem Speech enhancement

Enhance signal-to-noise ratio (SNR) or speech quality by attenuating interference. Applicable to monaural recordings

Limitation: Stationarity and estimation of interference Spatial filtering (beamforming)

Extract target sound from a specific spatial direction with a sensor array

Limitation: Configuration stationarity. What if the target switches or changes location?

Independent component analysis (ICA) Find a demixing matrix from mixtures of sound sources Limitation: Strong assumptions. Chief among them is stationarity

of mixing matrix

• “No machine has yet been constructed to do just that [solving the cocktail party problem].” (Cherry’57)

Page 7: Cocktail Party Problem  as Binary Classification

7

Auditory scene analysis

Listeners parse the complex mixture of sounds arriving at the ears in order to form a mental representation of each sound source

This perceptual process is called auditory scene analysis (Bregman’90)

Two conceptual processes of auditory scene analysis (ASA): Segmentation. Decompose the acoustic mixture into sensory

elements (segments) Grouping. Combine segments into groups, so that segments in the

same group likely originate from the same environmental source

Page 8: Cocktail Party Problem  as Binary Classification

8

Computational auditory scene analysis

Computational auditory scene analysis (CASA) approaches sound separation based on ASA principles Feature based approaches Model based approaches

Page 9: Cocktail Party Problem  as Binary Classification

9

Outline of presentation

Cocktail party problem Computational theory analysis

Ideal binary mask Speech intelligibility tests

Unvoiced speech segregation as binary classification

Page 10: Cocktail Party Problem  as Binary Classification

10

What is the goal of CASA?

What is the goal of perception? The perceptual systems are ways of seeking and extracting

information about the environment from sensory input (Gibson’66) The purpose of vision is to produce a visual description of the

environment for the viewer (Marr’82) By analogy, the purpose of audition is to produce an auditory

description of the environment for the listener

What is the computational goal of ASA? The goal of ASA is to segregate sound mixtures into separate

perceptual representations (or auditory streams), each of which corresponds to an acoustic event (Bregman’90)

By extrapolation the goal of CASA is to develop computational systems that extract individual streams from sound mixtures

Page 11: Cocktail Party Problem  as Binary Classification

11

Marrian three-level analysis

According to Marr (1982), a complex information processing system must be understood in three levels Computational theory: goal, its appropriateness, and basic processing

strategy Representation and algorithm: representations of input and output

and transformation algorithms Implementation: physical realization

All levels of explanation are required for eventual understanding of perceptual information processing

Computational theory analysis – understanding the character of the problem – is critically important

Page 12: Cocktail Party Problem  as Binary Classification

12

Computational-theory analysis of ASA

To form a stream, a sound must be audible on its own The number of streams that can be computed at a time is

limited Magical number 4 for simple sounds such as tones and vowels

(Cowan’01)? 1+1, or figure-ground segregation, in noisy environment such as a

cocktail party?

Auditory masking further constrains the ASA output Within a critical band a stronger signal masks a weaker one

Page 13: Cocktail Party Problem  as Binary Classification

13

Computational-theory analysis of ASA (cont.)

ASA outcome depends on sound types (overall SNR is 0)Noise-Noise: pink , white , pink+white Tone-Tone: tone1 , tone2 , tone1+tone2Speech-Speech: Noise-Tone:Noise-Speech:Tone-Speech:

Page 14: Cocktail Party Problem  as Binary Classification

14

Some alternative CASA goals

Extract all underlying sound sources or the target sound source (the gold standard) Implicit in speech enhancement, spatial filtering, and ICA Segregating all sources is implausible, and probably unrealistic with

one or two microphones

Enhance automatic speech recognition (ASR) Close coupling with a primary motivation of speech segregation Perceiving is more than recognizing (Treisman’99)

Enhance human listening Advantage: close coupling with auditory perception There are applications that involve no human listening

Page 15: Cocktail Party Problem  as Binary Classification

15

Ideal binary mask as CASA goal

• Motivated by above analysis, we have suggested the ideal binary mask as a main goal of CASA (Hu & Wang’01, ’04)

• Key idea is to retain parts of a target sound that are stronger than the acoustic background, or discard the rest• What a target is depends on intention, attention, etc.

• The definition of the ideal binary mask (IBM)

s(t, f ): Target energy in unit (t, f ) n(t, f ): Noise energy θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB It does not actually separate the mixture!

otherwise0

),(),( if1),(

ftnftsftIBM

Page 16: Cocktail Party Problem  as Binary Classification

16

IBM illustration

Page 17: Cocktail Party Problem  as Binary Classification

17

Properties of IBM

Flexibility: With the same mixture, the definition leads to different IBMs depending on what target is

Well-definedness: IBM is well-defined no matter how many intrusions are in the scene or how many targets need to be segregated

Consistent with computational-theory analysis of ASA Audibility and capacity Auditory masking Effects of target and noise types

Optimality: Under certain conditions the ideal binary mask with θ = 0 dB is the optimal binary mask from the perspective of SNR gain

• The ideal binary mask provides an excellent front-end for robust ASR (Cooke et al.’01; Roman et al.’03)

Page 18: Cocktail Party Problem  as Binary Classification

18

Subject tests of ideal binary masking

• Recent studies found large speech intelligibility improvements by applying ideal binary masking for normal-hearing (Brungart et al.’06; Li & Loizou’08), and hearing-impaired (Anzalone et al.’06; Wang et al.’09) listeners• Improvement for stationary noise is above 7 dB for normal-hearing

(NH) listeners, and above 9 dB for hearing-impaired (HI) listeners

• Improvement for modulated noise is significantly larger than for stationary noise

Page 19: Cocktail Party Problem  as Binary Classification

19

Test conditions of Wang et al.’09

SSN: Unprocessed monaural mixtures of speech-shaped noise (SSN) and Dantale II sentences (0 dB: -10 dB: )

CAFÉ: Unprocessed monaural mixtures of cafeteria noise (CAFÉ) and Dantale II sentences (0 dB: -10 dB: )

SSN-IBM: IBM applied to SSN (0 dB: -10 dB: -20 dB: )

CAFÉ-IBM: IBM applied to CAFÉ (0 dB: -10 dB: -20 dB: )

Intelligibility results are measured in terms of speech reception threshold (SRT), the required SNR level for 50% intelligibility score

Page 20: Cocktail Party Problem  as Binary Classification

20

Wang et al.’s results

12 NH subjects (10 male and 2 female), and 12 HI subjects (9 male and 3 female) SRT means for the 4 conditions for NH listeners: (-8.2, -10.3, -15.6, -20.7) SRT means for the 4 conditions for HI listeners: (-5.6, -3.8, -14.8, -19.4)

NH H I

SSN CAFE SSN -IBM CAFE-IBM-24

-22

-20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

2

4

Dan

tale

II

SR

T (

dB)

Page 21: Cocktail Party Problem  as Binary Classification

21

Speech perception of noise with binary gains Wang et al. (2008) found that, when LC is chosen to be the same

as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech)

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

96 dB

72 dB

48 dB

24 dB

0 dB

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

Time (s)

Ch

an

ne

l Nu

mb

er

0.4 0.8 1.2 1.6 2

32

22

12

2

Time (s)

Ce

nte

r F

req

ue

ncy

(H

z)

0.4 0.8 1.2 1.6 2

7743

2489

603

55

Page 22: Cocktail Party Problem  as Binary Classification

22

Wang et al.’08 results

Despite a great reduction of spectrotemporal information, a pattern of binary gains is apparently sufficient for human speech recognition

Mean numbers for the 4 conditions: (97.1%, 92.9%, 54.3%, 7.6%)

N umber of channels

4 8 16 320

10

20

30

40

50

60

70

80

90

100P

erc

en

t c

orr

ec

t

Page 23: Cocktail Party Problem  as Binary Classification

23

Interim summary

Ideal binary mask is an appropriate computational goal of auditory scene analysis in general, and speech segregation in particular

Hence solving the cocktail party problem would amount to binary classification This formulation opens the problem to a variety of pattern

classification methods

Page 24: Cocktail Party Problem  as Binary Classification

24

Outline of presentation

Cocktail party problem Computational theory analysis

Ideal binary mask Speech intelligibility tests

Unvoiced speech segregation as binary classification

Page 25: Cocktail Party Problem  as Binary Classification

25

Unvoiced speech Speech sounds consist of vowels and consonants;

consonants further consist of voiced and unvoiced consonants

• For English, unvoiced speech sounds come from the following consonant categories:• Stops (plosives)

– Unvoiced: /p/ (pool), /t/ (tool), and /k/ (cake)– Voiced: /b/ (book), /d/ (day), and /g/ (gate)

• Fricatives– Unvoiced: /s/(six), /sh/ (sheep), /f/ (fix), and /th/ (this)– Voiced: /z/ (zoo), /zh/ (pleasure), /v/ (vine), and /dh/ (that)– Mixed: /h/ (high)

• Affricates (stop followed by fricative)– Unvoiced: /ch/ (chicken)– Voiced: /jh/ (orange)

• We refer to the above consonants as expanded obstruents

Page 26: Cocktail Party Problem  as Binary Classification

26

Unvoiced speech segregation

• Unvoiced speech constitutes 20-25% of all speech sounds• It carries crucial information for speech intelligibility

• Unvoiced speech is more difficult to segregate than voiced speech• Voiced speech is highly structured, whereas unvoiced speech lacks

harmonicity and is often noise-like

• Unvoiced speech is usually much weaker than voiced speech and therefore more susceptible to interference

Page 27: Cocktail Party Problem  as Binary Classification

27

Processing stages of Hu-Wang’08 model

Segmentation

Mixture

Auditory periphery

Segregated speech

Grouping

• Peripheral processing results in a two-dimensional cochleagram

Page 28: Cocktail Party Problem  as Binary Classification

28

Auditory segmentation

• Auditory segmentation is to decompose an auditory scene into contiguous time-frequency (T-F) regions (segments), each of which should contain signal mostly from the same sound source• The definition of segmentation applies to both voiced and unvoiced

speech

• This is equivalent to identifying onsets and offsets of individual T-F segments, which correspond to sudden changes of acoustic energy

• Our segmentation is based on a multiscale onset/offset analysis (Hu & Wang’07)• Smoothing along time and frequency dimensions• Onset/offset detection and onset/offset front matching• Multiscale integration

Page 29: Cocktail Party Problem  as Binary Classification

29

Smoothed intensity

(a)

Fre

qu

en

cy (

Hz)

0 0.5 1 1.5 2 2.550

363

1246

3255

8000(b)

Fre

qu

en

cy (

Hz)

0 0.5 1 1.5 2 2.550

363

1246

3255

8000

(c)

Fre

qu

en

cy (

Hz)

Time (s) 0 0.5 1 1.5 2 2.5

50

363

1246

3255

8000(d)

Fre

qu

en

cy (

Hz)

Time (s) 0 0.5 1 1.5 2 2.5

50

363

1246

3255

8000

Utterance: “That noise problem grows more annoying each day”Interference: Crowd noise in a playground. Mixed at 0 dB SNRScale in freq. and time: (a) (0, 0), initial intensity. (b) (2, 1/14). (c) (6, 1/14). (d) (6, 1/4)

Page 30: Cocktail Party Problem  as Binary Classification

30

Segmentation result

The bounding contours of estimated segments from multiscale analysis. The background is represented by blue:

(a)One scale analysis

(b)Two-scale analysis

(c)Three-scale analysis

(d)Four-scale analysis

(e)The ideal binary mask

(f)The mixture

Page 31: Cocktail Party Problem  as Binary Classification

31

Grouping

• Apply auditory segmentation to generate all segments for the entire mixture

• Segregate voiced speech using an existing algorithm

• Identify segments dominated by voiced target using segregated voiced speech

• Identify segments dominated by unvoiced speech based on speech/nonspeech classification• Assuming nonspeech interference due to the lack of sequential

organization

Page 32: Cocktail Party Problem  as Binary Classification

32

Speech/nonspeech classification

• A T-F segment is classified as speech if

• Xs: The energy of all the T-F units within segment s

• H0: The hypothesis that s is dominated by expanded obstruents

• H1: The hypothesis that s is interference dominant

)|()|( 10 ss HPHP XX

Page 33: Cocktail Party Problem  as Binary Classification

33

Speech/nonspeech classification (cont.)

• By the Bayes rule, we have

• Since segments have varied durations, directly evaluating the above likelihoods is computationally infeasible

• Instead, we assume that each time frame within a segment is statistically independent given a hypothesis

• A multilayer perceptron is trained to distinguish expanded obstruents from nonspeech interference

)()|()()|( 1100 HPHPHPHP ss XX

1)|)((

)|)((

)(

)( 2

1 1

0

1

0

m

mm s

s

HmXP

HmXP

HP

HP

Page 34: Cocktail Party Problem  as Binary Classification

34

Speech/nonspeech classification (cont.)

• The prior probability ratio of , is found to be approximately linear with respect to input SNR

• Assuming that interference energy does not vary greatly over the duration of an utterance, earlier segregation of voiced speech enables us to estimate input SNR

)(/)( 10 HPHP

Page 35: Cocktail Party Problem  as Binary Classification

35

Speech/nonspeech classification (cont.)

• With estimated input SNR, each segment is then classified as either expanded obstruents or interference

• Segments classified as expanded obstruents join the segregated voiced speech to produce the final output

Page 36: Cocktail Party Problem  as Binary Classification

36

(a) Clean utteranceF

requency

(H

z)

0.5 1 1.5 2 2.550

363

1246

3255

8000

(c) Segregated voiced utterance

Fre

quency

(H

z)

0.5 1 1.5 2 2.550

363

1246

3255

8000

(b) Mixture (SNR 0 dB)

0.5 1 1.5 2 2.5

(d) Segregated whole utterance

0.5 1 1.5 2 2.5

(e) Utterance segregated from IBM

Fre

quency

(H

z)

Time (S)0.5 1 1.5 2 2.5

50

363

1246

3255

8000

Example of segregation

Utterance: “That noise problem grows more annoying each day”Interference: Crowd noise in a playground (IBM: Ideal binary mask)

Page 37: Cocktail Party Problem  as Binary Classification

37

SNR of segregated target

-5 0 5 10 15

0

5

10

15

(a)

Ove

rall

SN

R (dB

) Proposed systemSpectral subtraction

0 5 10 15-10

-5

0

5

10

(b)

SN

R in

unvo

iced fr

am

es

(dB

)

Mixture SNR (dB)

Compared to spectral subtraction assuming perfect speech pause detection

Page 38: Cocktail Party Problem  as Binary Classification

38

Conclusion

• Analysis of ideal binary mask as CASA goal

• Formulation of the cocktail party problem as binary classification

• Segregation of unvoiced speech based on segment classification• The proposed model represents the first systematic study on

unvoiced speech segregation

Page 39: Cocktail Party Problem  as Binary Classification

39

Credits

• Speech intelligibility tests of IBM: Joint with Ulrik Kjems, Michael S. Pedersen, Jesper Boldt, and Thomas Lunner, at Oticon

• Unvoiced speech segregation: Joint with Guoning Hu