Computational models of human visual attention driven by auditory cues

Preview:

Citation preview

Copyright©2014 NTT corp. All Rights Reserved.

Computational models of human visual attention driven by auditory cues

Akisato Kimura, Ph.D

NTT Communication Science Laboratories

(Most of the content presented in this talk are based on the collaborative research with National Institute of Informatics, Japan.)

1Copyright©2014 NTT corp. All Rights Reserved.

Visual attention

Visual attention is a built-in mechanism of the human visual system for scene understanding.

http://www.tobii.com/eye-tracking-research/global/library/white-papers/tobii-eye-tracking-white-paper/

2Copyright©2014 NTT corp. All Rights Reserved.

Simulating visual attention is essential

Such a pre-selection mechanism would be essential in enabling computers to undertake

HCI

[http://www.icub.org]

Visual assistance

[https://www.google.com/glass]

Object detection

[Donoser et al. 09]

3Copyright©2014 NTT corp. All Rights Reserved.

Saliency as a measure of attention

Saliency = attractiveness of visual attention

• Simple, easy to implement, reasonable outputs

Input image Saliency map [Itti et al. 98]

Estimating human visual focus of attention

Low

High

4Copyright©2014 NTT corp. All Rights Reserved.

Related work

Visual saliency

• Saliency map model [Itti 1998]

• Shannon self-information [Bruce 2005]

• Incorporating temporal dynamics [Itti 2009]

[Itti et al. 98] [Bruce et al. 05]

Input image

5Copyright©2014 NTT corp. All Rights Reserved.

Visual attention modulated by audios

Sounds are strongly related to events that draw human visual attention.

Without audio

With audio

[Song et al.11]

Speaking

6Copyright©2014 NTT corp. All Rights Reserved.

Related work

Visual saliency

• Saliency map model [Itti 1998]

• Shannon self information [Bruce 2005]

• Incorporating temporal dynamics [Itti 2009]

Auditory saliency

• Center-surround mechanism [Kayser 2005]

• Bayesian surprise [Schauerte 2013]

Audio-visual saliency

• Multi-modal saliency for robotics []

• Sound source localization[Nakajima 2013]

[Itti et al. 98] [Bruce et al. 05]

[Kayser et al. 05]Audio spectrogram

Input image

[Itti et al. 03] [Nakajima et al. 13]Input video

Human visual attention models with the help of auditory information is underway.

7Copyright©2014 NTT corp. All Rights Reserved.

Main content of this talk

Our recent challenges to simulate human visual attention driven by auditory cues

• Auditory information plays a supportive rolein contrast to standard multi-modal fusion approaches

• Our strategy is built on two psychophysical findings

1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]

2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]

8Copyright©2014 NTT corp. All Rights Reserved.

Our strategy

1. Audio-visual temporal alignment leads to benefits changes are both synchronized and transient[Van der Burget, PLoS One 2010]

2. Auditory attention modulates visual attention in a feature-specific manner [Aveninen, PLoS ONE 2012]

Following those findings…

1. detect transient events in visual and auditory domains separately

2. look for visual features synchronized withdetected auditory events

3. modulate saliency maps by feature selection.

9Copyright©2014 NTT corp. All Rights Reserved.

Previous method – Bayesian surprise

Image signal

Input video

Bayesian surprise

Visual saliency(from only visual features)

Conventional saliency maps

10Copyright©2014 NTT corp. All Rights Reserved.

Our strategy

Audio signalImage signal

Input video

Bayesian surprise

Visual saliency(from selected visual features)

Auditory Surprise

Selecting visual features synchronized with the auditory events

Modulating saliency mapswith the selected features

Proposed saliency maps

11Copyright©2014 NTT corp. All Rights Reserved.

Bayesian surprise

Audio signalImage signal

Input video

Bayesian surprise

Auditory Surprise

12Copyright©2014 NTT corp. All Rights Reserved.

Concept of Bayesian surprise

Continuously similar features Low saliency values

Unexpected features high saliency values

13Copyright©2014 NTT corp. All Rights Reserved.

Visual Bayesian surprise

Intensity×6Color×12Orientation×24Flicker×6Motion×24

Kullback-Leibler divergence

Prior

Observations = Feature maps

Input video

Gaussian pyramid scale

surprise

Visual surprise for 72 feature maps

Input video Visual surprise

Low

High

PosteriorBayes

72 visual feature maps

[Itti, Vision Research 2009]

14Copyright©2014 NTT corp. All Rights Reserved.

Auditory Bayesian surprise

Spectrograms as an observation

Audio signal

Spectrogram

Auditory surprise

Prior Posterior

Observation at frequency 𝜔

Surprise at frequency 𝜔

Averaged over frequencies

𝜔

Low

High

Auditory surprise

[Schauerte, ICASSP2013]

15Copyright©2014 NTT corp. All Rights Reserved.

Audio-visual synchronization

Bayesian surprise

Auditory Surprise

Selecting visual featuressynchronized with the audio

Visual saliency(from selected visual features)

16Copyright©2014 NTT corp. All Rights Reserved.

Correlation-based detection

Type of features

Time

The window width depends on the length of auditory events

𝜃𝑠

Visual surprise Auditory surprise

Feature 𝑓

Feature 𝑓

Averaging over pixels

Auditory event

360 features

Calculating correlation

17Copyright©2014 NTT corp. All Rights Reserved.

Visual feature selection

Bayesian surprise

Visual saliency(from selected visual features)

Auditory Surprise

Selecting visual features synchronized with the audio

Modulating saliency mapswith the selected features

Proposed saliency maps

18Copyright©2014 NTT corp. All Rights Reserved.

Selecting visual features

Time

Type of features Binarization

Frequency of “synchronization”

Selected features

Voting with threshold 𝜃𝑐

TimeFinal saliency map

Emphasizing selected features by summing up only selected features

360 types of features𝑁 < 360 types of features

19Copyright©2014 NTT corp. All Rights Reserved.

Experimental setup

Detecting scan-paths for ground truth• 15 subjects

• 6 videos (The DIEM project)

• Using Tobii TX300

Evaluation criteria

• Normalized Scanpath Saliency (NSS) [Peters 2009]

• Baseline:• Saliency map model [Itti 2003], Bayesian surprise [Itti 2009],

Sound source localization [Nakajima 2013]

20Copyright©2014 NTT corp. All Rights Reserved.

Experimental results – summary

The proposed model produced best NSS scores for all the videos

21Copyright©2014 NTT corp. All Rights Reserved.

Qualitative evaluation – Video 2

22Copyright©2014 NTT corp. All Rights Reserved.

Qualitative evaluation – Video 2

Input Baseline

Auditory surprise Proposed

23Copyright©2014 NTT corp. All Rights Reserved.

Detailed evaluation – Video 1

Frame

NSS

Surp

rise

NSS(Proposed)

NSS(Baseline)

Auditory surprise

Auditory event

Feature Intensity Color Orientation Flicker Motion Total

Baseline 30 60 120 30 120 360

Proposed 8 17 46 0 0 71

Selected visual features

The proposed model outperformed the baseline in many frames

24Copyright©2014 NTT corp. All Rights Reserved.

Some extensions

Drawbacks of the proposed method

• 2-pass algorithm:Whole the video should be scanned first to detect synchronization.

Recent updates

• Sequential estimation of visual & auditory surprise via exponential smoothing

Video 1 Video 2 Video 3 Video 4 Video 5 Video 6

Itti2009 2.896 1.816 0.790 1.209 0.318 0.513

Nakajima2013 1.857 0.992 0.540 1.073 0.368 0.216

Proposed (new) 3.077 1.820 0.791 1.273 0.318 0.513

25Copyright©2014 NTT corp. All Rights Reserved.

Conclusion

Our recent challenges to simulate human visual attention driven by auditory cues

• Auditory information plays a supportive role

• Our model is built on recent psychophysical findings

Human visual attention models with the help of auditory information is underway.

• Auditory attention models

• Auditory cues other than synchronization

26Copyright©2014 NTT corp. All Rights Reserved.

Reference

• Kimura, Yonetani, Hirayama “Computational models of human visual attention and their implementations: A survey,” IEICE Transactions on Information and Systems, Vol.E96-D, No.3, 2013.

• Nakajima, Sugimoto, Kawamoto “Incorporating audio signals into constructing a visual saliency map,” Proc. Pacific-Rim Symposium on Image and Video Technology (PSIVT2013).

• Nakajima, Kimura, Sugimoto, Kashino “Visual attention driven by auditory cues: Selecting visual features in synchronization with attracting auditory events,” Proc. International Conference on Multimedia Modeling (MMM2015).

• Nakajima, Kimura, Sugimoto, Kashino “An online computational model of human visual attention considering spatio-temporal synchronization with auditory events,” IPSJ Technical Report, CVIM195-57, 2015 (in Japanese).

Recommended