1
introduction videogame stimuli bottom-up model metrics: model scoring bottom-up results top-down model feature extraction top-down predictions top-down results summary overall method sample image 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 "salience" probability "non-fixated" "fixated" 0 0.2 0.4 0.6 0.8 0 0.01 0.02 0.03 KL contribution "salience" KL dist = 0.08 -2 0 2 4 6 distribution of salience values σ 2 = 1 µ = 0 normalized scanpath salience 1.304 -1 0 1 2 3 4 normalized salience map 1.019 2.701 3.222 1.777 1.588 0.877 0.096 1.067 0.615 human scanpath salience map and fixation locations normalized scanpath salience (NSS) • each salience map (or control map) is normalized to have mean=0 and stdev=1 • human scanpath is overlaid on normalized salience map • normalized salience value is extracted at each fixation location • these values are summed to give the normalized scanpath salience (NSS) • NSS can be compared with the distribution of random salience values Kullback-Leibler (KL) distance • how well can fixated and non-fixated locations be distinguished, with no assumptions about decision criteria? percentile • within each salience map, what fraction of the values is exceeded by the value at the location of the human eye position? KL = 0.5 . Σ i p i (f ) . log ( p i (f ) / p i (nf ) ) + 0.5 . Σ i p i (nf ) . log ( p i (nf ) / p i (f ) ) training movie frames (from multiple movies) feature extractor eye tracking p(t1) = (x1, y1) p(t2) = (x2, y2) p(t3) = (x3, y3) f1 = f2 = f3 = . . . . . . training set training phase dyadic pyramids fourier components random features learner features test movie observed eye positions predicted eye positions compare & score top−down prediction saliency model bottom−up prediction * testing phase linear network (least−squares fitting) non−linear multilayer network (backprop) gaussian mixture model (EM fitting) support vector machine 0 0.2 0.4 0.6 0.8 1 mean bottom−up map mean top−down map (pyramid features) −3 −2 −1 0 1 2 3 0 0.5 1 1.5 2 2.5 3 time (s) <− model leads eye ... eye leads model −> NSS top−down bottom−up saliency map 2:5 2:6 3:6 3:7 4:7 4:8 scale 0 1 2 3 intensity (I) 2:5 2:6 3:6 3:7 4:7 4:8 scale 0 1 2 3 orientation (O) 2:5 2:6 3:6 3:7 4:7 4:8 scale 0 1 2 3 color (C) 2:5 2:6 3:6 3:7 4:7 4:8 scale 0 1 2 3 flicker (F) 2:5 2:6 3:6 3:7 4:7 4:8 scale 0 1 2 3 motion (M) input image multiscale feature maps via linear filtering center−surround maps via spatial competition conspicuity maps via pooling & normalization feature combination 0 0.1 0.2 0.3 entropy (E) variance (V) orientation (O) intensity (I) color (C) static saliency (C,I,O) flicker (F) motion (M) full saliency (C,I,O,F,M) entropy (E) variance (V) orientation (O) intensity (I) color (C) static saliency (C,I,O) flicker (F) motion (M) full saliency (C,I,O,F,M) entropy (E) variance (V) orientation (O) intensity (I) color (C) static saliency (C,I,O) flicker (F) motion (M) full saliency (C,I,O,F,M) E V O I C C,I,O F M C,I,O,F,M E V O I C C,I,O F M C,I,O,F,M E V O I C C,I,O F M C,I,O,F,M E V O I C C,I,O F M C,I,O,F,M E V O I C C,I,O F M C,I,O,F,M E V O I C C,I,O F M C,I,O,F,M women men KL distance 0 0.1 0.2 0.3 racing games exploring games KL distance 0 0.1 0.2 0.3 all games KL distance −0.5 0 0.5 1 NSS −0.5 0 0.5 1 NSS −0.5 0 0.5 1 NSS 40 50 60 70 Percentile 40 50 60 70 Percentile 40 50 60 70 Percentile input entropy color flicker full saliency 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 KL distance mean ± s.e.m. BU BU*MEP BU*RND BU*FTD BU*PTD BU BU*MEP BU*RND BU*FTD BU*PTD BU BU*MEP BU*RND BU*FTD BU*PTD bottom−up (BU) BU * MEP (mean eye position) BU * RND (fake random features) BU * FTD (top-down, fourier features) BU * PTD (top-down, pyramid features) 0 0.5 1 1.5 2 2.5 NSS 0 10 20 30 40 50 60 70 80 90 100 percentile We use fully-computational, autonomous models to predict eye movements in a dynamic and interactive visual task with naturalistic stimuli. We move beyond purely stimulus-driven bottom-up models, and introduce a simple model that captures task-dependent top-down influences on eye movements. Previous studies have either relied on qualitative/descriptive models, or have not used naturalistic interactive stimuli. MODEL TYPE STIMULUS TYPE qualitative quantitative static, artificial (gabor patches, search arrays) static, natural (photographs) dynamic, natural (movies, cartoons) interactive, natural (video games, flying or driving simulators, virtual reality, actual reality) Yarbus 1967; Parker 1978; Mannan, Ruddock & Wooding 1997; Rayner 1998; Voge 1999 Privitera & Stark 1998; Reinagel & Zador 1999; Parkhurst, Law & Niebur 2002; Torralba 2003; Peters et al 2005; Navalpakkam & Itti 2005; Pomplun 2006 Treisman & Gelade 1980; Wolfe & Horowitz 2004 Rao et al 2002; Najemnik & Geisler 2005; Navalpakkam & Itti 2006 Land, Mennie & Rusted 1999; Land & Hayhoe 2001; Hayhoe et al 2002; Hayhoe et al 2003 this study Tosi, Mecacci & Pasqual 1997; May, Dean & Barnard 2003; Peli, Goldstein & Woods 2005 Carmi & Itti 2004; Itti & Baldi 2005 Nintendo GameCube (Mario Kart, Wave Race, Super Mario Sunshine, Hulk, Pac Man World) Infrared camera Subject Stimulus display system with framegrabber (dual−CPU Linux with SCHED_FIFO timing) CRT (30Hz frame rate) Eyetracker system (ISCAN, Inc.) Recorded video frames (216,000 frames; 185GB) Recorded eye movement traces (240Hz; 1,740,972 samples) Game controller Local only Outlier−based KL metric NSS metric Percentile metric Heuristic analysis (multi−CPU Linux cluster) Observed vs. predicted comparison Entropy Variance C I O F M CIO CIOFM Saliency maps a similar set of comparisons is run for each saliency map in turn video signal observed eye position task interaction predicted eye position PSYCHOPHYSICS ANALYSIS input mean eye position MEP top-down model TD bottom-up model BU BU*MEP combo BU*TD combo input mean eye position MEP top-down model TD bottom-up model BU BU*MEP combo BU*TD combo input mean eye position MEP top-down model TD bottom-up model BU BU*MEP combo BU*TD combo input mean eye position MEP top-down model TD bottom-up model BU BU*MEP combo BU*TD combo input mean eye position MEP top-down model TD bottom-up model BU BU*MEP combo BU*TD combo human eye position BU map peak MEP or TD peak combo peak Super Mario Kart Pac Man World Super Mario Sunshine Wave Race Hulk • eye movements recorded while subjects play video games • eye movements compared with model predictions • input processed through as many as 5 multi-scale feature channels • different versions of the model include different combinations of features salience is a global property: outliers detected through coarse global competition • sample frames illustrate the output of the models • for illustration, these frames are selected at the time when a saccade is just beginning • yellow arrows indicate the saccade path • models score significantly above chance • dynamic, global features work best • but, scores in this interactive task are lower than in previous passive-viewing experiments fourier features • 384 features from the fourier transform • fft log-magnitude converted to cartesian (θ,ω) space • sample at 24x16 (θ,ω) locations dyadic pyramid features 7 pyramids - 1 luminance - 2 color - 4 orientation 2 scales per pyramid - coarse - fine 32 features per scale - 4x4 array of local mean - 4x4 array of local var. • model learns to associate image “gist” signatures with corresponding eye position density maps • in testing, a leave-one-out approach is used: for each test clip, the remaining 23 clips are used for training • therefore, the model must be able to generalize across game types in order to successfully predict eye positions • sample frames illustrating output of - bottom-up (BU) model alone - mean eye position (MEP) (a control condition) - BU combined with MEP - top-down model (TD) based on pyramid features - BU combined with TD • average bottom-up and top-down maps across all frames • bottom-up map reflects activity in the screen corners (game score, time counter, etc.) that is largely ignored by observers • a simple mean-eye-position control improves upon purely bottom-up performance by a factor of 3-4x • but, the full top-down models perform best • the bottom-up model is best predictive of eye position with about ~250ms delay • the top-down model slightly lags behind eye position (due to temporal averaging in the model) • Goal was to explain gaze behavior with fully-computational models that don’t know about “objects” or “actions” • Purely bottom-up models are able to predict eye position significantly better than chance, but not as well as in passive-viewing conditions • So, some other factors must be influencing eye position in our interactive task • Introduced a simple model for learning top-down, task-dependent influences on eye position • Combining bottom-up and top-down mechanisms leads to significantly improved eye position prediction • This kind of model could be used in autonomous machine vision situations, such as interactive virtual environments • models score better for exploring than for racing games • static color (C) is best for racing games • dynamic motion (M) is best for exploring games • men were more bottom-up driven than were women (but note that N is small: 3 men, 2 women) • Nintendo GameCube games • 24 sessions, 5 minutes each • three different metrics used to test how well the models predict human eye movements A Computational Model of Task-Dependent Influences on Eye Position Robert J. Peters and Laurent Itti University of Southern California, Computer Science, Neuroscience http://ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf Supported by the National Geospatial-Intelligence Agency (NGA) and the Intelligence Community (IC) Postdoctoral Research Fellowship Program

A Computational Model of Task-Dependent Influences on Eye ...ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf · Hulk • eye movements recorded while subjects play video games •

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Computational Model of Task-Dependent Influences on Eye ...ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf · Hulk • eye movements recorded while subjects play video games •

introduction videogame stimuli

bottom-up model

metrics: model scoring

bottom-up results

top-down model

feature extraction

top-down predictions top-down results

summary

overall method

sample image

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

"salience"

prob

abili

ty

"non-fixated""fixated"

0 0.2 0.4 0.6 0.80

0.01

0.02

0.03

KL

cont

ribut

ion

"salience"

KL dist = 0.08

-2 0 2 4 6

distribution ofsalience values

σ2 = 1µ = 0

normalizedscanpathsalience1.304

-1 0 1 2 3 4

normalizedsalience map

1.0192.701

3.222

1.777

1.588

0.877

0.0961.067 0.615

humanscanpath

salience map andfixation locations

normalized scanpath salience (NSS)• each salience map (or control map) is normalized to have mean=0 and stdev=1• human scanpath is overlaid on normalized salience map• normalized salience value is extracted at each fixation location• these values are summed to give the normalized scanpath salience (NSS)• NSS can be compared with the distribution of random salience values

Kullback-Leibler (KL) distance• how well can fixated and non-fixated locations be distinguished, with no assumptions about decision criteria?

percentile• within each salience map, what fraction of the values is exceeded by the value at the location of the human eye position?

KL = 0.5 . Σi pi(f ) . log (pi(f ) / pi(nf )) + 0.5 . Σi pi(nf ) . log (pi(nf ) / pi(f ))

training movie frames(from multiple movies)

feature extractor eye tracking

p(t1) = (x1, y1)p(t2) = (x2, y2)p(t3) = (x3, y3)

f1 =

f2 =

f3 =

. . .

. . .

training set

training phase

dyadicpyramids

fouriercomponents

randomfeatures

learner

features

testmovie observed

eye positions

predictedeye positions

compare& score

top−downprediction

saliencymodel

bottom−upprediction

*

testing phase

linear network (least−squares fitting)

non−linear multilayer network (backprop)

gaussian mixture model (EM fitting)

support vector machine

0

0.2

0.4

0.6

0.8

1 mean bottom−up map mean top−down map (pyramid features)

−3 −2 −1 0 1 2 30

0.5

1

1.5

2

2.5

3

time (s)<− model leads eye ... eye leads model −>

NSS

top−downbottom−up

saliency map

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

intensity (I)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

orientation (O)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

color (C)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

flicker (F)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

motion (M)

input image

multiscale feature maps via linear filtering

center−surround maps via spatial competition

conspicuity maps via pooling & normalization

feature combination

0

0.1

0.2

0.3

entro

py (E

)

varia

nce (

V)

orien

tation

(O)

inten

sity (

I)

color

(C)

static

salie

ncy (

C,I,O)

flicke

r (F)

motion

(M)

full s

alien

cy (C

,I,O,F,M

)

entro

py (E

)

varia

nce (

V)

orien

tation

(O)

inten

sity (

I)

color

(C)

static

salie

ncy (

C,I,O)

flicke

r (F)

motion

(M)

full s

alien

cy (C

,I,O,F,M

)

entro

py (E

)

varia

nce (

V)

orien

tation

(O)

inten

sity (

I)

color

(C)

static

salie

ncy (

C,I,O)

flicke

r (F)

motion

(M)

full s

alien

cy (C

,I,O,F,M

)

E V O I CC,I,O

F M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M

E V O I CC,I,O

F M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M

womenmen

KL

dist

ance

0

0.1

0.2

0.3racing gamesexploring games

KL

dist

ance

0

0.1

0.2

0.3all games

KL

dist

ance

−0.5

0

0.5

1

NSS

−0.5

0

0.5

1

NSS

−0.5

0

0.5

1

NSS

40

50

60

70

Perc

entil

e

40

50

60

70

Perc

entil

e

40

50

60

70

Perc

entil

e

input entropy color flicker full saliency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

KL

dist

ance

mean± s.e.m.

BU

BU*MEP

BU*RND

BU*FTD

BU*PTD BU

BU*MEP

BU*RND

BU*FTD

BU*PTD BU

BU*MEP

BU*RND

BU*FTD

BU*PTD

bottom−up (BU)BU * MEP (mean eye position)BU * RND (fake random features)BU * FTD (top-down, fourier features)BU * PTD (top-down, pyramid features)

0

0.5

1

1.5

2

2.5

NSS

0

10

20

30

40

50

60

70

80

90

100

perc

entil

e

We use fully-computational, autonomous models to predict eye movements in a dynamic and interactive visual task with naturalistic stimuli.We move beyond purely stimulus-driven bottom-up models, and introduce a simple model that captures task-dependent top-down influences on eye movements.

Previous studies have either relied on qualitative/descriptive models, or have not used naturalistic interactive stimuli.

MODEL TYPESTIMULUS TYPE qualitative quantitative

static, artificial(gabor patches,search arrays)

static, natural(photographs)

dynamic, natural(movies, cartoons)

interactive, natural(video games, flying or

driving simulators,virtual reality, actual reality)

Yarbus 1967;Parker 1978;

Mannan, Ruddock & Wooding 1997;Rayner 1998;

Voge 1999

Privitera & Stark 1998;Reinagel & Zador 1999;

Parkhurst, Law & Niebur 2002;Torralba 2003;

Peters et al 2005;Navalpakkam & Itti 2005;

Pomplun 2006

Treisman & Gelade 1980;Wolfe & Horowitz 2004

Rao et al 2002;Najemnik & Geisler 2005;Navalpakkam & Itti 2006

Land, Mennie & Rusted 1999;Land & Hayhoe 2001;Hayhoe et al 2002;Hayhoe et al 2003

this study

Tosi, Mecacci & Pasqual 1997;May, Dean & Barnard 2003;

Peli, Goldstein & Woods 2005

Carmi & Itti 2004;Itti & Baldi 2005

NintendoGameCube(Mario Kart,Wave Race,Super Mario

Sunshine, Hulk,Pac Man World)

Infraredcamera

Subject

Stimulus display systemwith framegrabber

(dual−CPU Linux withSCHED_FIFO timing)

CRT(30Hz frame rate)

Eyetracker system(ISCAN, Inc.)

Recorded video frames(216,000 frames; 185GB)

Recordedeye movement traces

(240Hz; 1,740,972 samples)

Game controller

Local only Outlier−based

KL metricNSS metric

Percentile metric

Heuristic analysis (multi−CPU Linux cluster)

Observed vs. predictedcomparison

Entropy Variance C I O F M CIO CIOFM

Saliency maps

a similar set of comparisonsis run for each saliency map in turn

video signal

observed eye positiontask interaction

predicted eye position

PSYCHOPHYSICS

ANALYSIS

input mean eye position MEP top-down model TD

bottom-up model BU BU*MEP combo BU*TD combo

input mean eye position MEP top-down model TD

bottom-up model BU BU*MEP combo BU*TD combo

input mean eye position MEP top-down model TD

bottom-up model BU BU*MEP combo BU*TD combo

input mean eye position MEP top-down model TD

bottom-up model BU BU*MEP combo BU*TD combo

input mean eye position MEP top-down model TD

bottom-up model BU BU*MEP combo BU*TD combo

human eye position

BU map peak

MEP or TD peak

combo peak

Super MarioKart

Pac ManWorld

Super MarioSunshine

Wave Race

Hulk

• eye movements recorded while subjects play video games• eye movements compared with model predictions

• input processed through as many as5 multi-scale feature channels

• different versions of the model include different combinations of features

• salience is a global property: outliers detected through coarse global competition

• sample frames illustrate the output of the models• for illustration, these frames are selected at the time when a saccade is just beginning• yellow arrows indicate the saccade path

• models score significantly above chance• dynamic, global features work best• but, scores in this interactive task are lower than in previous passive-viewing experiments

fourier features• 384 features from the fourier transform• fft log-magnitude converted to cartesian (θ,ω) space• sample at 24x16 (θ,ω) locations

dyadic pyramid features

• 7 pyramids - 1 luminance - 2 color - 4 orientation

• 2 scales per pyramid - coarse - fine

• 32 features per scale - 4x4 array of local mean - 4x4 array of local var.

• model learns to associate image “gist” signatures with corresponding eye position density maps• in testing, a leave-one-out approach is used: for each test clip, the remaining 23 clips are used for training• therefore, the model must be able to generalize across game types in order to successfully predict eye positions

• sample frames illustrating output of - bottom-up (BU) model alone - mean eye position (MEP) (a control condition) - BU combined with MEP - top-down model (TD) based on pyramid features - BU combined with TD

• average bottom-up and top-down maps across all frames• bottom-up map reflects activity in the screen corners (game score, time counter, etc.) that is largely ignored by observers

• a simple mean-eye-position control improves upon purely bottom-up performance by a factor of 3-4x• but, the full top-down models perform best

• the bottom-up model is best predictive of eye position with about ~250ms delay

• the top-down model slightly lags behind eye position (due to temporal averaging in the model)

• Goal was to explain gaze behavior with fully-computational models that don’t know about “objects” or “actions”• Purely bottom-up models are able to predict eye position significantly better than chance, but not as well as in passive-viewing conditions• So, some other factors must be influencing eye position in our interactive task• Introduced a simple model for learning top-down, task-dependent influences on eye position• Combining bottom-up and top-down mechanisms leads to significantly improved eye position prediction• This kind of model could be used in autonomous machine vision situations, such as interactive virtual environments

• models score better for exploring than for racing games• static color (C) is best for racing games• dynamic motion (M) is best for exploring games

• men were more bottom-up driven than were women(but note that N is small: 3 men, 2 women)

• Nintendo GameCube games• 24 sessions, 5 minutes each

• three different metrics used to test how well the models predict human eye movements

A Computational Model of Task-Dependent Influences on Eye PositionRobert J. Peters and Laurent IttiUniversity of Southern California, Computer Science, Neuroscience

http://ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf

Supported by the National Geospatial-Intelligence Agency (NGA)and the Intelligence Community (IC) Postdoctoral Research Fellowship Program