1
While the hypothesis that it is possible to decode the ob- server's task from eye movements (Yarbus 1967) has re- ceived some support (e.g., DeAngelus & Pelz, 2009; Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013; Iqbal & Bailey, 2004), Greene, Liu, and Wolfe (2012) argued against it by reporting a failure. In this study, we perform a more systematic investigation of this problem, probing a larger number of experimental factors than previously. Our main goal is to determine the informativeness of eye movements for task and mental state decoding. Predicting observers' task from their scanpaths on natural scenes Over all data Over single images Ali Borji and Laurent Itti {borji,itti}@usc.edu University of Southern California, Computer Science, Psychology, and Neuroscience (1) Introduction & Motivation Supported by the National Science Foundation, the Army Research Office, and the U.S. Army http://ilab.usc.edu/publications/doc/Borji_Itti14vss.pdf (4) Experiment 2 (3) Experiment 1 (2) Related works Conclusions & Discussions 1) Successful task decoding results provide further evidence that fixations convey diagnostic information regarding the observer's mental state and task, consistent with the cognitive rel- evance theory of attention (see Hayhoe & Ballard, 2005). 2) It is possible to reliably infer the observer's task from Greene et al.'s (2012) data using stronger classifiers. Classification was better when we treated images individually. 3) Is it always possible to decode task from eye movements? We argue that there is no gen- eral answer to this type of pattern recognition questions. Answers depend on the used stimuli, observers, and questions. 4) Here we followed the procedure by Greene et al. (2012) in which: (a) no observer viewed the same image twice and (b) the same scene was shown under multiple questions. The first rule aims to eliminate memory biases. The second rule ensures that the final result is not due to differences in stimuli. 5) Recently Kanan et al. (2014), were also able to decode the task from eye movement pat- terns on Greene et al.,’s data with about the same performance as us (> 33 %). 6) Beyond scientific value, decoding task from eye movements has practical applications such as wearable visual technologies (e.g., Google glass) and patient diagnosis (e.g., ASD, ADHD, FASD, Parkinson's, and Alzheimer's). [1] DeAngelus M., Pelz J. B. (2009). Top-down control of eye movements: Yarbus revisited. Visual Cognition, 17, 790–811. [2] Kanan C., Ray N., Bseiso D. N. F., Hsiao J. H., Cottrell G. W. (2014). Predicting an observer's task using multi-fixation pattern analysis. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA-2014). [3] Yarbus A. L. (1967). Eye movements and vision. New York: Plenum. [4] Iqbal S., Bailey B. (2004). Using eye gaze patterns to identify user tasks. Proceedings of the Grace Hopper Celebration of Women in Computing, October 6–9, 2004. [5] Henderson J., Shinkareva S., Wang J., Luke S., Olejarczyk J. (2013). Predicting cognitive state from eye movements. Plos One, 8 (5), e64937. [6] Itti L., Koch C., Niebur E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. [7] Seiffert C., Khoshgoftaar T. M., Van Hülse J., Napolitano A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 40, 185–197. [8] Borji, A., and L. Itti. Defending Yarbus: Eye movements reveal observers' task. Journal of vision 14.3 (2014): 29. [9] Hayhoe M., Ballard D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9, 188–194. image 1 - 0.4906 [0.2500] Easy 8 - 0.5118 [0.2941] 9 - 0.4835 [0.1728] 10 - 0.2800 [0.2206] 7 - 0.2694 [0.3492] 2 - 0.3741 [0.3088] 13 - 0.1859 [0.2280] 11 - 0.3977 [0.1765] 4 - 0.2812 [0.2279] 6 - 0.3565 [0.2610] 5 - 0.3094 [0.5331] 12 - 0.3965 [0.5845] 3 - 0.3518 [0.2537] 14 - 0.3623 [0.5956] 15 - 0.3765 [0.4118] 16 - 0.3600 [0.3382] 17 - 0.2259 [0.1765] 18 - 0.3270 [0.4154] 19 - 0.1318 [0.4118] 20 - 0.3670 [0.3235] Hard ITTI98 GBVS memory decade people wealth TASK OBSERVER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 15 mean = 0.3420 chance level = 0.25 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Image number Decoding accuracy Decoding accuracy Type 1 Type 2 Type 3 Type 4 All 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 kNN RUSBoost * * chance Easy Hard image 1 - 0.1648 [0.1810] 2 - 0.3733 [0.3556] 6 - 0.2771 [0.2635] 7 - 0.1019 [0.1143] 3 - 0.3152 [0.3492] 8 - 0.2724 [0.2698] 4 - 0.2352 [0.2318] 5 - 0.2619 [0.2635] 10 - 0.4162 [0.3905] 9 - 0.2828 [0.2825] 15 - 0.2505 [0.2222] 14 - 0.3648 [0.3619] 13 - 0.2600 [0.2698] 12 - 0.3867 [0.3810] 11 - 0.1514 [0.1492] Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Yarbus’ experiment on Repin's painting Greene et al,’s results (5) Experiment 2 (cntd) TASKS: 1. Memorize the picture (memory). 2. Determine the decade in which the picture was taken (decade). 3. Determine how well the people in the picture know each other (people). 4. Determine the wealth of the people in the picture (wealth). Decoding Accuracy: - viewed image: 33% correct vs. chance = 1.5% (64 images) - participant: 26% vs. chance = 6.3% (16 participants) - task: 27.1% correct vs. chance = 25% (4 tasks) ?! n.s. Decoding Accuracy: [over all data with Bonferroni correction] Using kNN and Feature Type 3, we achieved accuracy of 0.3118, which is above Greene et al.'s (2012) results and is significantly better than chance (k = 8; p = 0.014). Using RUSBoost and Feature Type 3, we achieved accuracy of 0.3412 (p = 1.07e – 04). Decoding Accuracy: [over all data with Bonferroni correction] We report results using a leave-one-out procedure. We have 21 observers, each viewing 15 images (each five images under a different question; three questions per observer) thus resulting in 315 scan paths. Using Feature Type 1, we achieved average accuracy of 0.2421, which is significantly above chance (binomial test, p = 2.4535e – 06). Using Feature Type 2 (i.e., NSS histogram of nine saliency models as in Experiment 1) results in accuracy of 0.2254 (p = 5.6044e – 05). A) Confusion matrix averaged over 15 images and 50 RUSBoost runs using Feature Type 1. We observe high accuracies for Task 3 (estimating age), Task 5 (remembering cloths), and Task 7 (estimating how long the visitor has been away) but low accuracy for the free-viewing task. There is a high confusion between Task 2 and Tasks 6 and 1 and also between Task 1 and Task 7. B) The easiest and hardest stimuli using Feature Type 1. C) RUSBoost classification using Feature Type 2 in Experiment 2. Decoding Accuracy: [over single images with Bonferroni correction] Three observers viewed each image under one question thus resulting in 21 data points per image (i.e., 3 Ob- servers × 7 Questions). Note that each set of three observers were assigned the same question. RUSBoost classifier and Feature Type 1 results in average accuracy of 0.2724 over 50 runs and 15 images. Using first two feature types (a 10,000 + 23 × 70 = 11610D vector) results in average perfor- mance of 0.2743. Over all runs (i.e., table rows), the minimum accuracy (average over all 15 images) is 0.2540 and maximum accuracy is 0.3079. Note that our accuracies are almost two times higher than the 14.29% chance level (i.e., 1/7). Example stimuli for the A) scene memori- zation task, B) reading task, C) scene search task and D) pseudo- reading task. Decoding Accuracy > 64 % (chance = 25 %) using multivariate pattern classification. The features were the mean and standard deviation of fixation duration, the mean and standard deviation of saccade amplitude, the number of fixations per trial, and the three parameters μ, σ, and τ quantifying the shape of the fixation duration distribution with an ex-Gaussian distribution. The features were (1) number of fixations, (2) the mean fixation duration, (3) mean saccade ampli- tude, (4) percent of image covered by fixations assuming a 1° fovea, dwell-time on various re- gions of interest: (5) faces, (6) human bodies, and (7) objects. Observers: 21 USC students (10 male, 11 female) Visual angles: 43 x 25 degrees Stimuli: 15 paintings (13 oil on canvas, some by Repin) Viewing time: 30 seconds Features: Type 1: the smoothed fixation map, down sampled to 100 × 100 (10000 D) Type 2: histograms of normalized scan path saliency (NSS) (9 x 70 = 630 D ) Type 3: first 4 features of Greene et al. (4D) Type 4: < x, y > locations of the first five fixations (i.e., a 10 D vector) Classifiers: - k-nearest-neighbor; kNN - RUSBoost (random undersampling boost) algorithm (Seiffert, Khoshgoftaar, Van Hülse, & Napolitano, 2010) Henderson et al., PLOS ONE 2013 Greene et al., Vis. Res. 2012 A B C D A B Easiest Hardest Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 TASK OBSERVER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Confusion matrix of RUSBoost classifier in Exp II Predicted Ground truth 0.24 0.18 0.00 0.11 0.02 0.18 0.09 0.29 0.33 0.22 0.18 0.04 0.18 0.11 0.13 0.20 0.40 0.22 0.24 0.09 0.33 0.04 0.11 0.09 0.20 0.11 0.13 0.07 0.07 0.09 0.09 0.09 0.40 0.18 0.11 0.07 0.04 0.04 0.07 0.09 0.20 0.04 0.16 0.04 0.16 0.13 0.09 0.04 0.24 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 AIM Classification accuracy AWS GBVS HouCVPR HouNIPS ITTI ITTI-CIO ITTI-CIOLTXE ITTI-E ITTI-Entrop ITTI-I ITTI-L ITTI-O ITTI-OLTXE ITTI-Scorr ITTI-T ITTI-Variance ITTI-X ITTI98 PQFT SEO SUN ITTI-C Feature importance (RUSBoost classification over all data in Exp II) Chance level 0.1429 = 1/7 C

Predicting observers' task from their scanpaths on natural scenesilab.usc.edu/borji/papers/borjiVSS2014lowres.pdf · 2014. 5. 14. · 9 - 0.2828 [0.2825] 10 - 0.4162 [0.3905] 11 -

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Predicting observers' task from their scanpaths on natural scenesilab.usc.edu/borji/papers/borjiVSS2014lowres.pdf · 2014. 5. 14. · 9 - 0.2828 [0.2825] 10 - 0.4162 [0.3905] 11 -

While the hypothesis that it is possible to decode the ob-server's task from eye movements (Yarbus 1967) has re-ceived some support (e.g., DeAngelus & Pelz, 2009; Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013; Iqbal & Bailey, 2004), Greene, Liu, and Wolfe (2012) argued against it by reporting a failure.

In this study, we perform a more systematic investigation of this problem, probing a larger number of experimental factors than previously. Our main goal is to determine the informativeness of eye movements for task and mental state decoding.

Predicting observers' task from their scanpaths on natural scenes

Ove

r al

l dat

aO

ver

sing

le im

ages

Ali Borji and Laurent Itti {borji,itti}@usc.edu

University of Southern California, Computer Science, Psychology, and Neuroscience

(1) Introduction & Motivation

Supported by the National Science Foundation, the Army Research Office, and the U.S. Army

http://ilab.usc.edu/publications/doc/Borji_Itti14vss.pdf

(4) Experiment 2(3) Experiment 1

(2) Related works

Conclusions & Discussions1) Successful task decoding results provide further evidence that fixations convey diagnostic information regarding the observer's mental state and task, consistent with the cognitive rel-evance theory of attention (see Hayhoe & Ballard, 2005).

2) It is possible to reliably infer the observer's task from Greene et al.'s (2012) data using stronger classifiers. Classification was better when we treated images individually.

3) Is it always possible to decode task from eye movements? We argue that there is no gen-eral answer to this type of pattern recognition questions. Answers depend on the used stimuli, observers, and questions.

4) Here we followed the procedure by Greene et al. (2012) in which: (a) no observer viewed the same image twice and (b) the same scene was shown under multiple questions. The first rule aims to eliminate memory biases. The second rule ensures that the final result is not due to differences in stimuli.

5) Recently Kanan et al. (2014), were also able to decode the task from eye movement pat-terns on Greene et al.,’s data with about the same performance as us (> 33 %).

6) Beyond scientific value, decoding task from eye movements has practical applications such as wearable visual technologies (e.g., Google glass) and patient diagnosis (e.g., ASD, ADHD, FASD, Parkinson's, and Alzheimer's).

[1] DeAngelus M., Pelz J. B. (2009). Top-down control of eye movements: Yarbus revisited. Visual Cognition, 17, 790–811.[2] Kanan C., Ray N., Bseiso D. N. F., Hsiao J. H., Cottrell G. W. (2014). Predicting an observer's task using multi-fixation pattern analysis. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications (ETRA-2014).[3] Yarbus A. L. (1967). Eye movements and vision. New York: Plenum.[4] Iqbal S., Bailey B. (2004). Using eye gaze patterns to identify user tasks. Proceedings of the Grace Hopper Celebration of Women in Computing, October 6–9, 2004.[5] Henderson J., Shinkareva S., Wang J., Luke S., Olejarczyk J. (2013). Predicting cognitive state from eye movements. Plos One, 8 (5), e64937.[6] Itti L., Koch C., Niebur E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1254–1259. [7] Seiffert C., Khoshgoftaar T. M., Van Hülse J., Napolitano A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 40, 185–197. [8] Borji, A., and L. Itti. Defending Yarbus: Eye movements reveal observers' task. Journal of vision 14.3 (2014): 29.[9] Hayhoe M., Ballard D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9, 188–194.

image 1 - 0.4906 [0.2500]

Easy

8 - 0.5118 [0.2941]

9 - 0.4835 [0.1728] 10 - 0.2800 [0.2206]

7 - 0.2694 [0.3492]

2 - 0.3741 [0.3088]

13 - 0.1859 [0.2280]

11 - 0.3977 [0.1765]

4 - 0.2812 [0.2279]

6 - 0.3565 [0.2610]5 - 0.3094 [0.5331]

12 - 0.3965 [0.5845]

3 - 0.3518 [0.2537]

14 - 0.3623 [0.5956] 15 - 0.3765 [0.4118] 16 - 0.3600 [0.3382]

17 - 0.2259 [0.1765] 18 - 0.3270 [0.4154] 19 - 0.1318 [0.4118] 20 - 0.3670 [0.3235]

Hard

ITTI98 GBVS

mem

ory

deca

depe

ople

wea

lth

TASK

OBSERVER

1 2 3 4

5 6 7 8

9 10 11 12 13

14 16 1715

mean = 0.3420

chance level = 0.25

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Image number

Dec

odin

g ac

cura

cy

Dec

odin

g ac

cura

cy

Type 1 Type 2 Type 3 Type 4 All0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

kNNRUSBoost

* *

chance

Easy Hard

image 1 - 0.1648 [0.1810] 2 - 0.3733 [0.3556]

6 - 0.2771 [0.2635] 7 - 0.1019 [0.1143]

3 - 0.3152 [0.3492]

8 - 0.2724 [0.2698]

4 - 0.2352 [0.2318] 5 - 0.2619 [0.2635]

10 - 0.4162 [0.3905]9 - 0.2828 [0.2825]

15 - 0.2505 [0.2222]14 - 0.3648 [0.3619]13 - 0.2600 [0.2698]12 - 0.3867 [0.3810]11 - 0.1514 [0.1492]

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7

Yarbus’ experiment on Repin's painting

Greene et al,’s results

(5) Experiment 2 (cntd)

TASKS:1. Memorize the picture (memory).2. Determine the decade in which the picture was taken (decade).3. Determine how well the people in the picture know each other (people).4. Determine the wealth of the people in the picture (wealth).

Decoding Accuracy:- viewed image: 33% correct vs. chance = 1.5% (64 images)

- participant: 26% vs. chance = 6.3% (16 participants)

- task: 27.1% correct vs. chance = 25% (4 tasks) ?! n.s.

Decoding Accuracy: [over all data with Bonferroni correction]Using kNN and Feature Type 3, we achieved accuracy of 0.3118, which is above Greene et al.'s (2012) results and is significantly better than chance (k = 8; p = 0.014).

Using RUSBoost and Feature Type 3, we achieved accuracy of 0.3412 (p = 1.07e – 04).

Decoding Accuracy: [over all data with Bonferroni correction]We report results using a leave-one-out procedure. We have 21 observers, each viewing 15 images (each five images under a different question; three questions per observer) thus resulting in 315 scan paths. Using Feature Type 1, we achieved average accuracy of 0.2421, which is significantly above chance (binomial test, p = 2.4535e – 06). Using Feature Type 2 (i.e., NSS histogram of nine saliency models as in Experiment 1) results in accuracy of 0.2254 (p = 5.6044e – 05).

A) Confusion matrix averaged over 15 images and 50 RUSBoost runs using Feature Type 1. We observe high accuracies for Task 3 (estimating age), Task 5 (remembering cloths), and Task 7 (estimating how long the visitor has been away) but low accuracy for the free-viewing task. There is a high confusion between Task 2 and Tasks 6 and 1 and also between Task 1 and Task 7.

B) The easiest and hardest stimuli using Feature Type 1.

C) RUSBoost classification using Feature Type 2 in Experiment 2.

Decoding Accuracy: [over single images with Bonferroni correction]Three observers viewed each image under one question thus resulting in 21 data points per image (i.e., 3 Ob-servers × 7 Questions). Note that each set of three observers were assigned the same question. RUSBoost classifier and Feature Type 1 results in average accuracy of 0.2724 over 50 runs and 15 images. Using first two feature types (a 10,000 + 23 × 70 = 11610D vector) results in average perfor-mance of 0.2743. Over all runs (i.e., table rows), the minimum accuracy (average over all 15 images) is 0.2540 and maximum accuracy is 0.3079. Note that our accuracies are almost two times higher than the 14.29% chance level (i.e., 1/7).

Example stimuli for the A) scene memori-zation task, B) reading task, C) scene search task and D) pseudo-reading task.

Decoding Accuracy> 64 % (chance = 25 %) using multivariate pattern classification.

The features were the mean and standard deviation of fixation duration, the mean and standard deviation of saccade amplitude, the number of fixations per trial, and the three parameters μ, σ, and τ quantifying the shape of the fixation duration distribution with an ex-Gaussian distribution.

The features were (1) number of fixations, (2) the mean fixation duration, (3) mean saccade ampli-tude, (4) percent of image covered by fixations assuming a 1° fovea, dwell-time on various re-gions of interest: (5) faces, (6) human bodies, and (7) objects.

Observers: 21 USC students (10 male, 11 female) Visual angles: 43 x 25 degreesStimuli: 15 paintings (13 oil on canvas, some by Repin)Viewing time: 30 seconds

Features: Type 1: the smoothed fixation map, down sampled to 100 × 100 (10000 D)Type 2: histograms of normalized scan path saliency (NSS) (9 x 70 = 630 D )Type 3: first 4 features of Greene et al. (4D)Type 4: < x, y > locations of the first five fixations (i.e., a 10 D vector)

Classifiers: - k-nearest-neighbor; kNN- RUSBoost (random undersampling boost) algorithm (Seiffert, Khoshgoftaar, Van Hülse, & Napolitano, 2010)

Henderson et al., PLOS ONE 2013

Greene et al., Vis. Res. 2012

A

BC

D

AB

Easiest Hardest

Task

1Ta

sk 2

Task

3Ta

sk 4

Task

5Ta

sk 6

Task

7

TASK

OBSERVER

1 2 3

4 5 6

7 8 9

10 11 12

13 14 15

16 17 18

19 20 21

1 2 3

4 5 6

7 8 9

10 11 12

13 14 15

16 17 18

19 20 21

Confusion matrix of RUSBoost classifier in Exp II

Predicted

Gro

und

truth

0.24

0.18

0.00

0.11

0.02

0.18

0.09

0.29

0.33

0.22

0.18

0.04

0.18

0.11

0.13

0.20

0.40

0.22

0.24

0.09

0.33

0.04

0.11

0.09

0.20

0.11

0.13

0.07

0.07

0.09

0.09

0.09

0.40

0.18

0.11

0.07

0.04

0.04

0.07

0.09

0.20

0.04

0.16

0.04

0.16

0.13

0.09

0.04

0.24

Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0

0.05

0.1

0.15

0.2

0.25

AIM

Cla

ssifi

catio

n ac

cura

cy

AWS

GBV

SH

ouC

VPR

Hou

NIP

SIT

TI

ITTI

-CIO

ITTI

-CIO

LTXE

ITTI

-EIT

TI-E

ntro

pIT

TI-I

ITTI

-LIT

TI-O

ITTI

-OLT

XEIT

TI-S

corr

ITTI

-TIT

TI-V

aria

nce

ITTI

-XIT

TI98

PQFT

SEO

SUN

ITTI

-C

Feature importance(RUSBoost classification over all data in Exp II)

Chance level0.1429 = 1/7

C