Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
introduction videogame stimuli
bottom-up model
metrics: model scoring
bottom-up results
top-down model
feature extraction
top-down predictions top-down results
summary
overall method
sample image
0 0.2 0.4 0.6 0.80
0.1
0.2
0.3
"salience"
prob
abili
ty
"non-fixated""fixated"
0 0.2 0.4 0.6 0.80
0.01
0.02
0.03
KL
cont
ribut
ion
"salience"
KL dist = 0.08
-2 0 2 4 6
distribution ofsalience values
σ2 = 1µ = 0
normalizedscanpathsalience1.304
-1 0 1 2 3 4
normalizedsalience map
1.0192.701
3.222
1.777
1.588
0.877
0.0961.067 0.615
humanscanpath
salience map andfixation locations
normalized scanpath salience (NSS)• each salience map (or control map) is normalized to have mean=0 and stdev=1• human scanpath is overlaid on normalized salience map• normalized salience value is extracted at each fixation location• these values are summed to give the normalized scanpath salience (NSS)• NSS can be compared with the distribution of random salience values
Kullback-Leibler (KL) distance• how well can fixated and non-fixated locations be distinguished, with no assumptions about decision criteria?
percentile• within each salience map, what fraction of the values is exceeded by the value at the location of the human eye position?
KL = 0.5 . Σi pi(f ) . log (pi(f ) / pi(nf )) + 0.5 . Σi pi(nf ) . log (pi(nf ) / pi(f ))
training movie frames(from multiple movies)
feature extractor eye tracking
p(t1) = (x1, y1)p(t2) = (x2, y2)p(t3) = (x3, y3)
f1 =
f2 =
f3 =
. . .
. . .
training set
training phase
dyadicpyramids
fouriercomponents
randomfeatures
learner
features
testmovie observed
eye positions
predictedeye positions
compare& score
top−downprediction
saliencymodel
bottom−upprediction
*
testing phase
linear network (least−squares fitting)
non−linear multilayer network (backprop)
gaussian mixture model (EM fitting)
support vector machine
0
0.2
0.4
0.6
0.8
1 mean bottom−up map mean top−down map (pyramid features)
−3 −2 −1 0 1 2 30
0.5
1
1.5
2
2.5
3
time (s)<− model leads eye ... eye leads model −>
NSS
top−downbottom−up
saliency map
2:5 2:6
3:6 3:7
4:7 4:8
scale 0
1
23
intensity (I)
2:5 2:6
3:6 3:7
4:7 4:8
scale 0
1
23
orientation (O)
2:5 2:6
3:6 3:7
4:7 4:8
scale 0
1
23
color (C)
2:5 2:6
3:6 3:7
4:7 4:8
scale 0
1
23
flicker (F)
2:5 2:6
3:6 3:7
4:7 4:8
scale 0
1
23
motion (M)
input image
multiscale feature maps via linear filtering
center−surround maps via spatial competition
conspicuity maps via pooling & normalization
feature combination
0
0.1
0.2
0.3
entro
py (E
)
varia
nce (
V)
orien
tation
(O)
inten
sity (
I)
color
(C)
static
salie
ncy (
C,I,O)
flicke
r (F)
motion
(M)
full s
alien
cy (C
,I,O,F,M
)
entro
py (E
)
varia
nce (
V)
orien
tation
(O)
inten
sity (
I)
color
(C)
static
salie
ncy (
C,I,O)
flicke
r (F)
motion
(M)
full s
alien
cy (C
,I,O,F,M
)
entro
py (E
)
varia
nce (
V)
orien
tation
(O)
inten
sity (
I)
color
(C)
static
salie
ncy (
C,I,O)
flicke
r (F)
motion
(M)
full s
alien
cy (C
,I,O,F,M
)
E V O I CC,I,O
F M
C,I,O,F,M E V O I C
C,I,OF M
C,I,O,F,M E V O I C
C,I,OF M
C,I,O,F,M
E V O I CC,I,O
F M
C,I,O,F,M E V O I C
C,I,OF M
C,I,O,F,M E V O I C
C,I,OF M
C,I,O,F,M
womenmen
KL
dist
ance
0
0.1
0.2
0.3racing gamesexploring games
KL
dist
ance
0
0.1
0.2
0.3all games
KL
dist
ance
−0.5
0
0.5
1
NSS
−0.5
0
0.5
1
NSS
−0.5
0
0.5
1
NSS
40
50
60
70
Perc
entil
e
40
50
60
70
Perc
entil
e
40
50
60
70
Perc
entil
e
input entropy color flicker full saliency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
KL
dist
ance
mean± s.e.m.
BU
BU*MEP
BU*RND
BU*FTD
BU*PTD BU
BU*MEP
BU*RND
BU*FTD
BU*PTD BU
BU*MEP
BU*RND
BU*FTD
BU*PTD
bottom−up (BU)BU * MEP (mean eye position)BU * RND (fake random features)BU * FTD (top-down, fourier features)BU * PTD (top-down, pyramid features)
0
0.5
1
1.5
2
2.5
NSS
0
10
20
30
40
50
60
70
80
90
100
perc
entil
e
We use fully-computational, autonomous models to predict eye movements in a dynamic and interactive visual task with naturalistic stimuli.We move beyond purely stimulus-driven bottom-up models, and introduce a simple model that captures task-dependent top-down influences on eye movements.
Previous studies have either relied on qualitative/descriptive models, or have not used naturalistic interactive stimuli.
MODEL TYPESTIMULUS TYPE qualitative quantitative
static, artificial(gabor patches,search arrays)
static, natural(photographs)
dynamic, natural(movies, cartoons)
interactive, natural(video games, flying or
driving simulators,virtual reality, actual reality)
Yarbus 1967;Parker 1978;
Mannan, Ruddock & Wooding 1997;Rayner 1998;
Voge 1999
Privitera & Stark 1998;Reinagel & Zador 1999;
Parkhurst, Law & Niebur 2002;Torralba 2003;
Peters et al 2005;Navalpakkam & Itti 2005;
Pomplun 2006
Treisman & Gelade 1980;Wolfe & Horowitz 2004
Rao et al 2002;Najemnik & Geisler 2005;Navalpakkam & Itti 2006
Land, Mennie & Rusted 1999;Land & Hayhoe 2001;Hayhoe et al 2002;Hayhoe et al 2003
this study
Tosi, Mecacci & Pasqual 1997;May, Dean & Barnard 2003;
Peli, Goldstein & Woods 2005
Carmi & Itti 2004;Itti & Baldi 2005
NintendoGameCube(Mario Kart,Wave Race,Super Mario
Sunshine, Hulk,Pac Man World)
Infraredcamera
Subject
Stimulus display systemwith framegrabber
(dual−CPU Linux withSCHED_FIFO timing)
CRT(30Hz frame rate)
Eyetracker system(ISCAN, Inc.)
Recorded video frames(216,000 frames; 185GB)
Recordedeye movement traces
(240Hz; 1,740,972 samples)
Game controller
Local only Outlier−based
KL metricNSS metric
Percentile metric
Heuristic analysis (multi−CPU Linux cluster)
Observed vs. predictedcomparison
Entropy Variance C I O F M CIO CIOFM
Saliency maps
a similar set of comparisonsis run for each saliency map in turn
video signal
observed eye positiontask interaction
predicted eye position
PSYCHOPHYSICS
ANALYSIS
input mean eye position MEP top-down model TD
bottom-up model BU BU*MEP combo BU*TD combo
input mean eye position MEP top-down model TD
bottom-up model BU BU*MEP combo BU*TD combo
input mean eye position MEP top-down model TD
bottom-up model BU BU*MEP combo BU*TD combo
input mean eye position MEP top-down model TD
bottom-up model BU BU*MEP combo BU*TD combo
input mean eye position MEP top-down model TD
bottom-up model BU BU*MEP combo BU*TD combo
human eye position
BU map peak
MEP or TD peak
combo peak
Super MarioKart
Pac ManWorld
Super MarioSunshine
Wave Race
Hulk
• eye movements recorded while subjects play video games• eye movements compared with model predictions
• input processed through as many as5 multi-scale feature channels
• different versions of the model include different combinations of features
• salience is a global property: outliers detected through coarse global competition
• sample frames illustrate the output of the models• for illustration, these frames are selected at the time when a saccade is just beginning• yellow arrows indicate the saccade path
• models score significantly above chance• dynamic, global features work best• but, scores in this interactive task are lower than in previous passive-viewing experiments
fourier features• 384 features from the fourier transform• fft log-magnitude converted to cartesian (θ,ω) space• sample at 24x16 (θ,ω) locations
dyadic pyramid features
• 7 pyramids - 1 luminance - 2 color - 4 orientation
• 2 scales per pyramid - coarse - fine
• 32 features per scale - 4x4 array of local mean - 4x4 array of local var.
• model learns to associate image “gist” signatures with corresponding eye position density maps• in testing, a leave-one-out approach is used: for each test clip, the remaining 23 clips are used for training• therefore, the model must be able to generalize across game types in order to successfully predict eye positions
• sample frames illustrating output of - bottom-up (BU) model alone - mean eye position (MEP) (a control condition) - BU combined with MEP - top-down model (TD) based on pyramid features - BU combined with TD
• average bottom-up and top-down maps across all frames• bottom-up map reflects activity in the screen corners (game score, time counter, etc.) that is largely ignored by observers
• a simple mean-eye-position control improves upon purely bottom-up performance by a factor of 3-4x• but, the full top-down models perform best
• the bottom-up model is best predictive of eye position with about ~250ms delay
• the top-down model slightly lags behind eye position (due to temporal averaging in the model)
• Goal was to explain gaze behavior with fully-computational models that don’t know about “objects” or “actions”• Purely bottom-up models are able to predict eye position significantly better than chance, but not as well as in passive-viewing conditions• So, some other factors must be influencing eye position in our interactive task• Introduced a simple model for learning top-down, task-dependent influences on eye position• Combining bottom-up and top-down mechanisms leads to significantly improved eye position prediction• This kind of model could be used in autonomous machine vision situations, such as interactive virtual environments
• models score better for exploring than for racing games• static color (C) is best for racing games• dynamic motion (M) is best for exploring games
• men were more bottom-up driven than were women(but note that N is small: 3 men, 2 women)
• Nintendo GameCube games• 24 sessions, 5 minutes each
• three different metrics used to test how well the models predict human eye movements
A Computational Model of Task-Dependent Influences on Eye PositionRobert J. Peters and Laurent IttiUniversity of Southern California, Computer Science, Neuroscience
http://ilab.usc.edu/rjpeters/pubs/2006_VSS_Peters.pdf
Supported by the National Geospatial-Intelligence Agency (NGA)and the Intelligence Community (IC) Postdoctoral Research Fellowship Program