26
Object Recognition Mark van Rossum School of Informatics, University of Edinburgh January 15, 2018 0 Based on slides by Chris Williams. Version: January 15, 2018 1 / 27

Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Object Recognition

Mark van Rossum

School of Informatics, University of Edinburgh

January 15, 2018

0Based on slides by Chris Williams. Version: January 15, 20181 / 27

Page 2: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Overview

Neurobiology of VisionComputational Object Recognition: What’s the Problem?Fukushima’s NeocognitronHMAX model and recent versionsOther approaches

2 / 27

Page 3: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Neurobiology of Vision

WHAT pathway: V1 → V2 → V4 → ITWHERE pathway: V1 → V2 → V3 → MT/V5 → parietal lobeIT (Inferotemporal cortex) has cells that are

Highly selective to particular objects (e.g. face cells)Relatively invariant to size and position of objects, but typicallyvariable wrt 3D view

What and where information must be combined somewhere

3 / 27

Page 4: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Invariances in higher visual cortex

[?]

4 / 27

Page 5: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Left: partial rotation invariance [?].Right: clutter reduces translation invariance [?].

5 / 27

Page 6: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

thways/index.html

6 / 27

Page 7: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Computational Object Recognition

The big problem is creating invariance to scaling, translation,rotation (both in-plane and out-of-plane), and partial occlusion,while at the same time being selective.What about a back-propagation network that learns some functionf (Ix ,y )?

Large input dimension, need enormous training setNo invariances a priori

Objects are not generally presented against a neutral background,but are embedded in clutterTasks: object-class recognition, specific object recognition,localization, segmentation, ...

7 / 27

Page 8: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Some Computational Models

Two extremes:Extract 3D description of the world, and match it to stored 3Dstructural models (e.g. human as generalized cylinders)Large collection of 2D views (templates)

Some other methods2D structural description (parts and spatial relationships)Match image features to model features, or do pose-spaceclustering (Hough transforms))

What are good types of features?

Feedforward neural networkBag-of-features (no spatial structure; but what about the “bindingproblem”?)Scanning window methods to deal with translation/scale

8 / 27

Page 9: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Fukushima’s Neocognitron

[?, ?]To implement location invariance, “clone” (or replicate) a detectorover a region of space, and then pool the responses of the clonedunitsThis strategy can then be repeated at higher levels, giving rise togreater invarianceSee also [?], convolutional neural networks

9 / 27

Page 10: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

HMAX model

[?]

10 / 27

Page 11: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

HMAX model

S1 detectors based on Gabor filters at various scales, rotationsand positionsS-cells (simple cells) convolve with local filtersC-cells (complex cells) pool S-responses with maximumNo learning between layersObject recognition: Supervised learning on the output of C2 cells.

11 / 27

Page 12: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Rather than learning, take refuge in having many, many cells.(Cover, 1965)A complex pattern-classification problem, cast in ahigh-dimensional space nonlinearly, is more likely to be linearlyseparable than in a low-dimensional space, provided that the space isnot densely populated.

12 / 27

Page 13: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

[?]

13 / 27

Page 14: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

HMAX model: Results

“paper clip” stimuliBroad tuning curves wrt size, translationScrambling of the input image does not give rise to objectdetections: not all conjunctions are preserved

14 / 27

Page 15: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

More recent version

[?]15 / 27

Page 16: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Use real images as inputs

S-cells convolution,e.g. h = (∑

i wi xi

κ+√∑

i w2i

), y = g(h).

C-cell soft-max pooling h =∑

xq+1i

κ+∑

k xqi

(some support from biology for such pooling)Some unsupervised learning between layers [?]

16 / 27

Page 17: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Results

Localization can be achieved by using a sliding-window methodClaimed as a model on a “rapid categorization task”, whereback-projections are inactivePerformance similar to human performance on flashed (20ms)imagesThe model doesn’t do segmentation (as opposed to boundingboxes)

17 / 27

Page 18: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Learning invariances

Hard-code (convolutional network)http://yann.lecun.com/exdb/lenet/Supervised learning (show various sample and require sameoutput)Use temporal continuity of the world. Learn invariance by seeingobject change, e.g. it rotates, it changes colour, it changes shape.Algorithms: trace rule[?]E.g. replace∆w = x(t).y(t) with ∆w = x(t).y(t)where y(t) is temporally filtered y(t).Similar principles: VisNet [?], Slow feature analysis.

18 / 27

Page 19: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Slow feature analysis

Find slow varying features, these are likely relevant [?]

Find output y for which: 〈(dy(t)dt )2〉 minimal,

while 〈y〉 = 0, 〈y2〉 = 1

19 / 27

Page 20: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Experiments: Altered visual world [?]

20 / 27

Page 21: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

A different flavour Object Recognition Model

[?]Preprocess image to obtain interest pointsAt each interest point extract a local image descriptor (e.g. Lowe’sSIFT descriptor). These can be clustered to give discrete “visualwords”(wi ,xi) pair at each interest point, defining visual word andlocationDefine a generative model. Object has instantiation parameters θ(location, scale, rotation etc)Object also has parts, indexed by z

21 / 27

Page 22: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

p(wi ,xi |θ) =P∑

j=0

p(zi = j)p(wi |zi = j)p(xi |zi = j ,θ)

Part 0 is the background (broad distributions for w and x)p(xi |zi = j ,θ) will contain geometric information, e.g. relativeoffset of part j from the centre of the model

p(W ,X |θ) =n∏

i=1

p(wi ,xi |θ)

p(W ,X ) =

∫p(W ,X |θ)p(θ) dθ

22 / 27

Page 23: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Fergus, Perona, Zisserman (2005)

23 / 27

Page 24: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Results and Discussion

Sudderth et al’s model is generative, and can be trainedunsupervised (cf Serre et al)There is not much in the way of top-down influences (except rôleof θ)The model doesn’t do segmentationUse of context should boost performanceThere is still much to be done to obtain human level performance!

24 / 27

Page 25: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

Including top-down interaction

Extensive top-down connections everywhere in the brainOne known role: attention. For the rest: many theories

[?]

Local parts can be ambiguous, but knowing global object at helps.Top-down to set priors.Improvement in object recognition is actually small,but recognition and localization of parts is much better.

25 / 27

Page 26: Object Recognition - inf.ed.ac.uk · Object recognition: Supervised learning on the output of C2 cells. 11/27. Rather than learning, take refuge in having many, many cells. (Cover,

References I

26 / 27