View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Visual Attention
Jeremy Wyatt
Where to look?
• Many visual processes are expensive
• Humans don’t process the whole visual field
• How do we decide what to process?
• How can we use insights about this to make machine vision more efficient?
Visual salience
• Salience ~ visual prominence
• Must be cheap to calculate
• Related to features that we collect from very early stages of visual processing
• Colour, orientation, intensity change and motion are all important indicators of salience
On/Off cells• Recall centre surround cells
ON area
OFF area
OFF area
ON area
Light spotTime
LightON Cell OFF Cell
Colour sensitive On/Off cells
• Recall that some ganglion ON cells are sensitive to the outputs of cones
ON
OFF
An intensity change map
• I = (r+g+b)/3 gives I, the intensity map• The intensity change map is formed from a grid of
on/off cells (they overlap)• There are several maps, each from cells with
receptive fields at a different scale• Each cell fires for its area
How do we calculate the maps?
• We can create each on cell using a pair of Gaussians
- =ON area
OFF area
Light spot
How do we calculate the maps?
• Imagine grids of fat and thin Gaussians
• We calculate the value of each Gaussian in each grid and then subtract one grid (here with 16 elements) from the other
• This implements our grid of on cells
Calculating the intensity change map• We do this for a mix of scales
• We have to interpolate the values of some maps to match the outputs of others (this corresponds to cells that have overlapping receptive fields)
• By aligning and then combining the maps at different scales we have implemented a grid of on cells, or a grid of off cells
Other maps
• We can now do this for red, green, yellow and blue
• We also do this for intensity changes of a certain orientation
- gives
Combining maps to calculate saliency• We now add the maps to obtain the saliency of each group of pixels
in the scene
Saliency map
• We normalise each map to the same range before adding• We weight each map before combining it• We attend to the most active point in the saliency map
Attending to areas of the scene
• We use the salience model I have described to attend to certain areas of the scene
• We can now use this salience model to make other visual processes more efficient (e.g. object recognition)
Learning names and appearances of objects
Salience can be modulated by language
Modulating visual salience by language:results
SIFT based recognition
0
0.5
1
1.5
2
2.5
SpriteCan
Diet CokeCan
Coke Can Magic LucozadeBottle
Object
Tim
e (s
ecs) Full Scene
Bottom up salience
Modulated by context
Number of Fixations
Package Full Scene
Bottom up salience
Modulated by Context
Sprite
Can
1 4.5 1
Diet Coke
Can
1 7 3.1
Coke can 1 3.5 1
Magic 1 2 2
Lucozade
Bottle
1 1 2
Fanta bottle
1 11.9 11.7
Summary
• Visual attention is guided by many features
• A good model of attention involves parts of early visual processing we have already seen
• We can use this to make object learning in robots more efficient