Visual Attention Jeremy Wyatt. Where to look? Many visual processes are expensive Humans don’t process the whole visual field How do we decide what to

Visual Attention

Jeremy Wyatt

Where to look?

• Many visual processes are expensive

• Humans don’t process the whole visual field

• How do we decide what to process?

• How can we use insights about this to make machine vision more efficient?

Visual salience

• Salience ~ visual prominence

• Must be cheap to calculate

• Related to features that we collect from very early stages of visual processing

• Colour, orientation, intensity change and motion are all important indicators of salience

On/Off cells• Recall centre surround cells

ON area

OFF area

OFF area

ON area

Light spotTime

LightON Cell OFF Cell

Colour sensitive On/Off cells

• Recall that some ganglion ON cells are sensitive to the outputs of cones

ON

OFF

An intensity change map

• I = (r+g+b)/3 gives I, the intensity map• The intensity change map is formed from a grid of

on/off cells (they overlap)• There are several maps, each from cells with

receptive fields at a different scale• Each cell fires for its area

How do we calculate the maps?

• We can create each on cell using a pair of Gaussians

- =ON area

OFF area

Light spot

How do we calculate the maps?

• Imagine grids of fat and thin Gaussians

• We calculate the value of each Gaussian in each grid and then subtract one grid (here with 16 elements) from the other

• This implements our grid of on cells

Calculating the intensity change map• We do this for a mix of scales

• We have to interpolate the values of some maps to match the outputs of others (this corresponds to cells that have overlapping receptive fields)

• By aligning and then combining the maps at different scales we have implemented a grid of on cells, or a grid of off cells

Other maps

• We can now do this for red, green, yellow and blue

• We also do this for intensity changes of a certain orientation

- gives

Combining maps to calculate saliency• We now add the maps to obtain the saliency of each group of pixels

in the scene

Saliency map

• We normalise each map to the same range before adding• We weight each map before combining it• We attend to the most active point in the saliency map

Attending to areas of the scene

• We use the salience model I have described to attend to certain areas of the scene

• We can now use this salience model to make other visual processes more efficient (e.g. object recognition)

Learning names and appearances of objects

Salience can be modulated by language

Modulating visual salience by language:results

SIFT based recognition

0

0.5

1

1.5

2

2.5

SpriteCan

Diet CokeCan

Coke Can Magic LucozadeBottle

Object

Tim

e (s

ecs) Full Scene

Bottom up salience

Modulated by context

Number of Fixations

Package Full Scene

Bottom up salience

Modulated by Context

Sprite

Can

1 4.5 1

Diet Coke

Can

1 7 3.1

Coke can 1 3.5 1

Magic 1 2 2

Lucozade

Bottle

1 1 2

Fanta bottle

1 11.9 11.7

Summary

• Visual attention is guided by many features

• A good model of attention involves parts of early visual processing we have already seen

• We can use this to make object learning in robots more efficient

Documents

Visual Attention Jeremy Wyatt. Where to look? Many visual processes are expensive Humans don’t process the whole visual field How do we decide what to