A Computational Model of Task-Dependent Influences on Eye ...ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf · Hulk • eye movements recorded while subjects play video games •

introduction videogame stimuli

bottom-up model

metrics: model scoring

bottom-up results

top-down model

feature extraction

top-down predictions top-down results

summary

overall method

sample image

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

"salience"

prob

abili

ty

"non-fixated""fixated"

0 0.2 0.4 0.6 0.80

0.01

0.02

0.03

KL

cont

ribut

ion

"salience"

KL dist = 0.08

-2 0 2 4 6

distribution ofsalience values

σ2 = 1µ = 0

normalizedscanpathsalience1.304

-1 0 1 2 3 4

normalizedsalience map

1.0192.701

3.222

1.777

1.588

0.877

0.0961.067 0.615

humanscanpath

salience map andfixation locations

normalized scanpath salience (NSS)• each salience map (or control map) is normalized to have mean=0 and stdev=1• human scanpath is overlaid on normalized salience map• normalized salience value is extracted at each fixation location• these values are summed to give the normalized scanpath salience (NSS)• NSS can be compared with the distribution of random salience values

Kullback-Leibler (KL) distance• how well can fixated and non-fixated locations be distinguished, with no assumptions about decision criteria?

percentile• within each salience map, what fraction of the values is exceeded by the value at the location of the human eye position?

KL = 0.5 . Σi pi(f ) . log (pi(f ) / pi(nf )) + 0.5 . Σi pi(nf ) . log (pi(nf ) / pi(f ))

training movie frames(from multiple movies)

feature extractor eye tracking

p(t1) = (x1, y1)p(t2) = (x2, y2)p(t3) = (x3, y3)

f1 =

f2 =

f3 =

. . .

. . .

training set

training phase

dyadicpyramids

fouriercomponents

randomfeatures

learner

features

testmovie observed

eye positions

predictedeye positions

compare& score

top−downprediction

saliencymodel

bottom−upprediction

*

testing phase

linear network (least−squares fitting)

non−linear multilayer network (backprop)

gaussian mixture model (EM fitting)

support vector machine

0

0.2

0.4

0.6

0.8

1 mean bottom−up map mean top−down map (pyramid features)

−3 −2 −1 0 1 2 30

0.5

1

1.5

2

2.5

3

time (s)<− model leads eye ... eye leads model −>

NSS

top−downbottom−up

saliency map

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

intensity (I)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

orientation (O)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

color (C)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

flicker (F)

2:5 2:6

3:6 3:7

4:7 4:8

scale 0

1

23

motion (M)

input image

multiscale feature maps via linear filtering

center−surround maps via spatial competition

conspicuity maps via pooling & normalization

feature combination

0

0.1

0.2

0.3

entro

py (E

)

varia

nce (

V)

orien

tation

(O)

inten

sity (

I)

color

(C)

static

salie

ncy (

C,I,O)

flicke

r (F)

motion

(M)

full s

alien

cy (C

,I,O,F,M

)

entro

py (E

)

varia

nce (

V)

orien

tation

(O)

inten

sity (

I)

color

(C)

static

salie

ncy (

C,I,O)

flicke

r (F)

motion

(M)

full s

alien

cy (C

,I,O,F,M

)

entro

py (E

)

varia

nce (

V)

orien

tation

(O)

inten

sity (

I)

color

(C)

static

salie

ncy (

C,I,O)

flicke

r (F)

motion

(M)

full s

alien

cy (C

,I,O,F,M

)

E V O I CC,I,O

F M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M

E V O I CC,I,O

F M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M E V O I C

C,I,OF M

C,I,O,F,M

womenmen

KL

dist

ance

0

0.1

0.2

0.3racing gamesexploring games

KL

dist

ance

0

0.1

0.2

0.3all games

KL

dist

ance

−0.5

0

0.5

1

NSS

−0.5

0

0.5

1

NSS

−0.5

0

0.5

1

NSS

40

50

60

70

Perc

entil

e

40

50

60

70

Perc

entil

e

40

50

60

70

Perc

entil

e

input entropy color flicker full saliency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

KL

dist

ance

mean± s.e.m.

BU

BU*MEP

BU*RND

BU*FTD

BU*PTD BU

BU*MEP

BU*RND

BU*FTD

BU*PTD BU

BU*MEP

BU*RND

BU*FTD

BU*PTD

bottom−up (BU)BU * MEP (mean eye position)BU * RND (fake random features)BU * FTD (top-down, fourier features)BU * PTD (top-down, pyramid features)

0

0.5

1

1.5

2

2.5

NSS

0

10

20

30

40

50

60

70

80

90

100

perc

entil

e

We use fully-computational, autonomous models to predict eye movements in a dynamic and interactive visual task with naturalistic stimuli.We move beyond purely stimulus-driven bottom-up models, and introduce a simple model that captures task-dependent top-down influences on eye movements.

Previous studies have either relied on qualitative/descriptive models, or have not used naturalistic interactive stimuli.

MODEL TYPESTIMULUS TYPE qualitative quantitative

static, artificial(gabor patches,search arrays)

static, natural(photographs)

dynamic, natural(movies, cartoons)

interactive, natural(video games, flying or

driving simulators,virtual reality, actual reality)

Yarbus 1967;Parker 1978;

Mannan, Ruddock & Wooding 1997;Rayner 1998;

Voge 1999

Privitera & Stark 1998;Reinagel & Zador 1999;

Parkhurst, Law & Niebur 2002;Torralba 2003;

Peters et al 2005;Navalpakkam & Itti 2005;

Pomplun 2006

Treisman & Gelade 1980;Wolfe & Horowitz 2004

Rao et al 2002;Najemnik & Geisler 2005;Navalpakkam & Itti 2006

Land, Mennie & Rusted 1999;Land & Hayhoe 2001;Hayhoe et al 2002;Hayhoe et al 2003

this study

Tosi, Mecacci & Pasqual 1997;May, Dean & Barnard 2003;

Peli, Goldstein & Woods 2005

Carmi & Itti 2004;Itti & Baldi 2005

NintendoGameCube(Mario Kart,Wave Race,Super Mario

Sunshine, Hulk,Pac Man World)

Infraredcamera

Subject

Stimulus display systemwith framegrabber

(dual−CPU Linux withSCHED_FIFO timing)

CRT(30Hz frame rate)

Eyetracker system(ISCAN, Inc.)

Recorded video frames(216,000 frames; 185GB)

Recordedeye movement traces

(240Hz; 1,740,972 samples)

Game controller

Local only Outlier−based

KL metricNSS metric

Percentile metric

Heuristic analysis (multi−CPU Linux cluster)

Observed vs. predictedcomparison

Entropy Variance C I O F M CIO CIOFM

Saliency maps

a similar set of comparisonsis run for each saliency map in turn

video signal

observed eye positiontask interaction

predicted eye position

PSYCHOPHYSICS

ANALYSIS

input mean eye position MEP top-down model TD

bottom-up model BU BU*MEP combo BU*TD combo









human eye position

BU map peak

MEP or TD peak

combo peak

Super MarioKart

Pac ManWorld

Super MarioSunshine

Wave Race

Hulk

• eye movements recorded while subjects play video games• eye movements compared with model predictions

• input processed through as many as5 multi-scale feature channels

• different versions of the model include different combinations of features

• salience is a global property: outliers detected through coarse global competition

• sample frames illustrate the output of the models• for illustration, these frames are selected at the time when a saccade is just beginning• yellow arrows indicate the saccade path

• models score significantly above chance• dynamic, global features work best• but, scores in this interactive task are lower than in previous passive-viewing experiments

fourier features• 384 features from the fourier transform• fft log-magnitude converted to cartesian (θ,ω) space• sample at 24x16 (θ,ω) locations

dyadic pyramid features

• 7 pyramids - 1 luminance - 2 color - 4 orientation

• 2 scales per pyramid - coarse - fine

• 32 features per scale - 4x4 array of local mean - 4x4 array of local var.

• model learns to associate image “gist” signatures with corresponding eye position density maps• in testing, a leave-one-out approach is used: for each test clip, the remaining 23 clips are used for training• therefore, the model must be able to generalize across game types in order to successfully predict eye positions

• sample frames illustrating output of - bottom-up (BU) model alone - mean eye position (MEP) (a control condition) - BU combined with MEP - top-down model (TD) based on pyramid features - BU combined with TD

• average bottom-up and top-down maps across all frames• bottom-up map reflects activity in the screen corners (game score, time counter, etc.) that is largely ignored by observers

• a simple mean-eye-position control improves upon purely bottom-up performance by a factor of 3-4x• but, the full top-down models perform best

• the bottom-up model is best predictive of eye position with about ~250ms delay

• the top-down model slightly lags behind eye position (due to temporal averaging in the model)

• Goal was to explain gaze behavior with fully-computational models that don’t know about “objects” or “actions”• Purely bottom-up models are able to predict eye position significantly better than chance, but not as well as in passive-viewing conditions• So, some other factors must be influencing eye position in our interactive task• Introduced a simple model for learning top-down, task-dependent influences on eye position• Combining bottom-up and top-down mechanisms leads to significantly improved eye position prediction• This kind of model could be used in autonomous machine vision situations, such as interactive virtual environments

• models score better for exploring than for racing games• static color (C) is best for racing games• dynamic motion (M) is best for exploring games

• men were more bottom-up driven than were women(but note that N is small: 3 men, 2 women)

• Nintendo GameCube games• 24 sessions, 5 minutes each

• three different metrics used to test how well the models predict human eye movements

A Computational Model of Task-Dependent Influences on Eye PositionRobert J. Peters and Laurent IttiUniversity of Southern California, Computer Science, Neuroscience

http://ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf

Supported by the National Geospatial-Intelligence Agency (NGA)and the Intelligence Community (IC) Postdoctoral Research Fellowship Program

Documents

A Computational Model of Task-Dependent Influences on Eye ...ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf · Hulk • eye movements recorded while subjects play video games •