Biologically Inspired Machine Perceptionrik/courses/cogs1_w10/slides/... · 1) Which of these is easiest for a computer program: Seeing, Doing your laundry, Playing

Biologically Inspired Machine Perception

N i c h o l a s B u t k o , M a c h i n e P e r c e p t i o n L a b

W I n t e r , 2 0 1 0

<Chapter 1>Artificial Intelligence vs.

Natural Intelligence

Borrowed Intelligence vs. Owned Intelligence

Hard Things are Easy,Easy Things are Hard

InspirationEarly on, Artificial Intelligence grabbed hold of my imagination and wouldn’t let go.

“The Age of Spiritual Machines”By 2020, computers will have more transistors than brains have neurons.That won’t be sufficient for computers to be intelligent:

Can’t write a summary of a movieCan’t tie shoe-lacesCan’t recognize humor

AI is not limited by computing power, but by our understanding of “intelligence”A revolution in that understanding is required before we can create truly cognitive machines.

I wanted to be part of that revolution.

First Steps

Freshman year of undergrad:Volunteered in lab of AI prof in CSE.“Labeling” eyes and mouths.

Thousands of images.Computer used this information to help figure out facial expression.

One of the most successful paradigms in AI: “Supervised Learning”

“Learn” about facial expressions from thousands of examplesUse statistics, calculus, and linear algebra.

Computer Expression Recognition Toolbox

“Supervised Learning” has been very successful; My own lab uses it extensively to develop sophisticated facial-expression recognizers. [Demo at end, if we have time]

Computer Expression Recognition Toolbox

Widely Applicable: Driver DrowsinessLie DetectionReal/Fake PainAutism TherapyTutoringSmile ShutterArt

Different from how humans learn: Nobody points out thousands of eyes and mouths to babies to help them learn about faces.

May, 11, 1997

What’s wrong?

Simple Is HardDaniel Wolpert, “The Master Puppeteer”

Crick Memorial Lecture, 2005http://royalsociety.org/event.asp?id=3773

http://royalsociety.org/event.asp?id=3773

http://royalsociety.org/event.asp?id=3773

Why is simple hard?Artificial domains like chess have a clear, well defined structure.

Natural domains like “seeing” are rife with ambiguity.

Consider a simple problem like “how to look at something.”

?

?

?

Dealing with AmbiguitiesTo know “how to look somewhere”, it is helpful to know “where did I look?”

From many experiences of sending signals to your eye-muscle neurons, your brain can learn the relationship between actions and consequences.

Even the question “Where did I look?” is hard to answer!Lots of things could go wrong.Can we ever make explicit rules for all of them?

Whole Scene View 1 View 2 Difficulty

No match

Same object? (Which lightpost?)

Same object type? (Lake or Cloud?)

Same location? (Moving Target)

</Chapter 1>1) Which of these is easiest for a computer

program: Seeing, Doing your laundry, Playing Sudoku, Writing a Book Report, Laughing at funny

jokes?

2) We gave four reasons that it’s tough to know where you’re looking. Can you remember them?What’s the main difficulty that unites them?

3) If you were going to use today’s state-of-the-art approaches to make an intelligent computer

program that “Knows how to teach,” what is the first thing you should do?

<Chapter 2>The Computational Approach: Do we need feathers to fly?

Define the Problem with a Generative Model

Algebra to the Rescue: Finding all the rules.

How to study Natural Intelligence?

Study the “aerodynamics” of natural intelligence -- the underlying principles and objectives organizing behavior.

Want a theory that’s not just about humansFlying is not about birds and feathers.

Different organisms or systems may not have access to the type of actuators and sensors that humans have, but we still want to understand and build intelligent systems.

Choose problems that will help us understand behavior in real life.

E.g. “Learning how to look somewhere.”

Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: it just cannot be done. In order to understand bird flight, we have to understand aerodynamics; only then do the structure of the feathers and the different shapes of bird’s wings make sense.

--Marr, Vision, 1982

Defining the Problem: A Generative Model

A “Generative Model” is a tool to describe the structure of the problems organisms face.

You must describe how the things you can see relate to the things you want to know.

You must describe your uncertainty about how things are and how things will be.

Probability theory tells us how to make the best guess about how the things you want to know are and how they will be based on everything you’ve seen before.

!

" #

$ aMotor

commandvalue

How themotorswork

Worldappearance

Where thecamera is

looking

Cameraimage

t=1 t=2 t=3

{0.3, 0.2} {0.25, 0.5} {0.1, 0.1}

Sens

ory

(!)

Mot

or (

a)

!1

!2

!1

!2

+∞+∞

+∞-∞

-∞-∞

-∞

+∞

={x',y'}

={x,y}

="2

Finding all the rulesA little probability theory:

And a little algebra:

Give us all the rules for making the best guess about where we are looking.

Where am I looking?! "# $p(!t|!1:t!1,"1:t, a1:t) =

= p(!t|!1:t!1, a1:t)p("t|"1:t!1, !1:t!1)p("1:t!1|!1:t!1)

p("1:t|!1:t!1)

g(!t) =

= !.5

Predicted Motion Match! "# $(!t ! Ct"Kt)

T (Ct!!KtCTt +Q!)

!1(!t ! Ct"Kt)

! .5%

xy

(#xyt ! $xy

Kt)2

(%xy2"Kt + q2")# $! "

Image Match

!.5%

xy

log(%xy2"Kt + q2")# $! "

Uncertainty Penalty

Everything you’ve seen so far

What you see right now.





= p(!t|!1:t!1, a1:t)p("t|"1:t!1, !1:t!1)p("1:t!1|!1:t!1)

p("1:t|!1:t!1)

g(!t) =

= !.5


T (Ct!!KtCTt +Q!)

!1(!t ! Ct"Kt)

! .5%

xy

(#xyt ! $xy

Kt)2

(%xy2"Kt + q2")# $! "

Image Match

!.5%

xy

log(%xy2"Kt + q2")# $! "

Uncertainty Penalty



Where you think you’re looking based on the neural signals sent to your eyes.





= p(!t|!1:t!1, a1:t)p("t|"1:t!1, !1:t!1)p("1:t!1|!1:t!1)

p("1:t|!1:t!1)

g(!t) =

= !.5


T (Ct!!KtCTt +Q!)

!1(!t ! Ct"Kt)

! .5%

xy

(#xyt ! $xy

Kt)2

(%xy2"Kt + q2")# $! "

Image Match

!.5%

xy

log(%xy2"Kt + q2")# $! "

Uncertainty Penalty



Possible MatchOK Match

Good Match





= p(!t|!1:t!1, a1:t)p("t|"1:t!1, !1:t!1)p("1:t!1|!1:t!1)

p("1:t|!1:t!1)

g(!t) =

= !.5


T (Ct!!KtCTt +Q!)

!1(!t ! Ct"Kt)

! .5%

xy

(#xyt ! $xy

Kt)2

(%xy2"Kt + q2")# $! "

Image Match

!.5%

xy

log(%xy2"Kt + q2")# $! "

Uncertainty Penalty



Avoid if possible





= p(!t|!1:t!1, a1:t)p("t|"1:t!1, !1:t!1)p("1:t!1|!1:t!1)

p("1:t|!1:t!1)

g(!t) =

= !.5


T (Ct!!KtCTt +Q!)

!1(!t ! Ct"Kt)

! .5%

xy

(#xyt ! $xy

Kt)2

(%xy2"Kt + q2")# $! "

Image Match

!.5%

xy

log(%xy2"Kt + q2")# $! "

Uncertainty Penalty



Best Guess!

Learning to Look

50 100 150 200 250 300 350 4000

10

20

30

40

50

60

Eye-Movements

Err

or o

n D

esir

ed

Eye

-Mov

emen

t

What the brain does

In 1992, Duhamel et al. showed that the parietal cortex does something similar to what we just described.

Just before an eye-movement, cells “remap” their visual representation to be in line with what they expect to see.

This does not mean the brain is doing probability theory and algebra.

It may mean the brain found a way to implement the solution probability theory and algebra give.

REFERENCES AND NOTES

1. J. Pines and T. Hunter, Nature 346, 760 (1990). 2. K. I. Swenson, K. M. Farrell, J. V. Ruderman, ibid.

47, 861 (1986). 3. G. Draetta et a!., ibid. 56, 829 (1989). 4. C. F. Lehner and P. H. O'Farrell, ibid., p. 957. 5. J. Minshull, R. Golsteyn, C. S. Hill, T. Hunt,

EMBOJ. 9, 2865 (1990). 6. B. Faha et al., in preparation. 7. L. Tsai, E. Harlow, M. Meyerson, Nature 353, 174

(1991). 8. A. Giordano et a!., Cell 58, 981 (1989). 9. L. Bandara, J. Adamczewski, T. Hunt, N. La

Thangue, Nature 352, 249 (1991). 10. S. Chellappan, S. Hiebert, M. Mudryj, J. Horowitz,

J. Nevins, Cell 65, 1053 (1991). 11. M. Mudryj et a!., ibid., p. 1243. 12. E. Harlow, B. J. Franza, C. Schley, J. Viro!. 55,533

(1985). 13. A. Giordano et a!., Science 253, 1271 (1991). 14. B. Faha, unpublished data. 15. D. W. Cleveland, S. G. Fischer, M. W. Kirschner,

U. K. Laemmli, J. Biol. Chem. 252, 1102 (1977). 16. These include CEM, H9, Weri, WI38, Hs68, HL60,

and MCF-7 cell lines. 17. Q. Hu et a!., Mo!. Ce!!. Biol. 11, 5792 (1991). 18. L.-H. Tsai, unpublished data. 19. M. Ewen, Y. Xing, J. B. Lawrence, D. Livingston,

Ce!! 66, 1155 (1991). 20. Q. Hu, N. Dyson, E. Harlow, EMBO J. 9, 1147

(1990). 21. S. Huang, N. P. Wang, B. Y. Tseng, W. H. Lee, E.

H. Lee, ibid., p. 1815.

22. W. J. Kaelin, M. E. Ewen, D. M. Livingston, Mo!. Cell. Biol. 10, 3761 (1990).

23. N. Dyson et a!., in preparation. 24. Q. Hu, J. Lees, K. Buchkovich, E. Harlow, Mo!.

Cell. Biol., in press. 25. S. Bagchi, P. Raychaudhuri, J. Nevins, Cell 62, 659

(1990). 26. E. Harlow, L. V. Crawford, D. C. Pim, N. M.

Williamson, J. Virol. 39, 861 (1981). 27. U. K. Laemmli, Nature 227, 680 (1970). 28. W. M. Bonner and R. A. Laskey, Eur. J. Biochem.

46, 83 (1974). 29. E. Harlow and D. Lane, Antibodies: A Laboratory

Manual (Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 1988).

30. Treatment of lysates with SDS and boiling reduced the affinity of BF683 for cycin A. To make the intensity of cyclin A comparable in immunoprecip- itations from treated and untreated lysates, lanes 4 to 6 were exposed to film 10 hours longer than lanes 1 to 3.

31. D. B. Smith and K. S. Johnson, Gene 67, 31 (1988). 32. C. Herrmann et al., J. Virol. 65, 5848 (1991). 33. J. DeCaprio et al., Cell 58, 1085 (1989). 34. The authors acknowledge W. Reese and Q. Hu for

important contributions to the early stages of this work. We also thank our colleagues at Cold Spring Harbor Laboratory and the MGH Cancer Center for their helpful discussions, N. Dyson and E. Lees for critical reading of the manuscript, E. Lees for the gift of the cyclin A mutations, and J. Duffy, M. Ockler, and P. Renna for art and photography. Supported by NIH grants CA 13106 and 55339.

10 September 1991; accepted 25 November 1991

The Updating of the Representation of Visual Space in Parietal Cortex by Intended Eye Movements JEAN-RENI DUHAMEL, CAROL L. COLBY, MICHAEL E. GOLDBERG*

Every eye movement produces a shift in the visual image on the retina. The receptive field, or retinal response area, of an individual visual neuron moves with the eyes so that after an eye movement it covers a new portion of visual space. For some parietal neurons, the location of the receptive field is shown to shift transiently before an eye movement. In addition, nearly all parietal neurons respond when an eye movement brings the site of a previously flashed stimulus into the receptive field. Parietal cortex both anticipates the retinal consequences of eye movements and updates the retinal coordinates of remembered stimuli to generate a continuously accurate representation of visual space.

A S WE MOVE OUR EYES, A STATION- ary object excites successive loca- tions on the retina. Despite this

constantly shifting input, we perceive a stable visual world. This perception is presum- ably based on an internal representation derived from both visual and nonvisual information. Helmholtz proposed that the brain uses information about intended movement to interpret retinal displacements (1). We show that single neurons in monkey parietal cortex use information about intended eye movements to update the representation of visual space (2).

Laboratory of Sensorimotor Research, National Eye Institute, National Institutes of Health, Building 10, Room lOC 101, Bethesda, MD 20892.

*To whom correspondence should be addressed.

The shift in the visual image on the retina produced by a saccade is determined by the size and direction of the eye movement. This predictability enables the representation of visual space in parietal cortex to be remapped in advance of the eye movement. At the single cell level, the intention to move the eyes evokes a transient shift in the retinal location at which a stimulus can excite the neuron.

Our results are summarized schematically in Fig. 1, in which an observer transfers fixation from the mountain top to the tree. During fixation, the representation of the visual scene in parietal cortex is stable. A given neuron encodes the stimulus at a certain retinal location (the cloud). Immedi- ately before and during the saccade, the cortical representation shifts into the coor-

dinates of the next intended fixation. The neuron now responds to the stimulus at a new retinal location (the sun) and stops responding to the stimulus at the initial location (the cloud). The neuron thus anticipates the retinal consequences of the intended eye movement: the cortical representation shifts first, and then the eye catches up. After the eye movement, the representation in parietal cortex matches the reafferent visual input and the neuron continues to respond to the stimulus (the sun). This process constitutes a remapping of the stimulus from the coordinates of the initial fixation to those of the intended fixation.

We demonstrated this remapping by studying the visual responsiveness of neurons in the lateral intraparietal area (LIP) of alert monkeys performing fixation and saccade tasks (3). Neurons in LIP have reti- nocentric receptive fields and carry visual and visual memory signals (4). An example is shown in Fig. 2. When the monkey fixates, this neuron responds to the onset of a visual stimulus in its receptive field at a latency of 70 ms (Fig. 2A). Receptive field borders were defined while the monkey maintained fixation, and, under these conditions, stimuli presented outside these borders never activated the neuron. In the saccade task, the fixation target jumps at the same time that a visual stimulus appears. The visual stimulus is positioned so that it will be in the receptive field when the monkey has completed the saccade. If there were no predictive remapping, the cell would be expected to begin discharging 70 ms after the eye movement brings the stimulus into

Oculomotor viuleet events Visual events

Fixate

Intend eye movement

Refixate

Fig. 1. Remapping of the visual representation in parietal cortex. Each panel represents the visual imnage at a point in time relative to a sequence of oculomotor events. Receptive field of a parietal neuron, dashed circle; center of current gaze location, solid circle; and coordinates of the cortical representation, cross hairs.

90 SCIENCE, VOL. 255

</Chapter 2>1) True or false?: Scientists know that

intelligence definitely requires neurons.

2) Make a Generative Model for the question, “Is this street safe to cross?”: A) What do you want to know? B) What can you do? C) What can you see? D) How is the answer likely to change in the future?

3) What is the role of “Nurture” in learning to look? What is the role of “Nature?”

<Chapter 3>Exquisitely tuned information

consumers.

Bamboozled by Math?Ask a different question.

Generative models let you reward yourself for a job

well done.

Where do we find information?

Information helps us answer a question:

Who was the 17th president?

Will it rain tomorrow?

What am I supposed to talk about next?

We constantly gather visual information by moving our eyes!

Choosing where to look

People don’t closely examine every inch of the world.

Eye-movements are tuned to optimally gather information.

We turned two of the shortcuts that people use into new machine perception technologies.

1) Fast Visual Saliency2) Digital Eye

How do we measure Information?

Maximum: uniform distribution has most information, because we can’t make a good guess.

Additivity: we get as much information from two events as we get from each one separately.

Continuity: small changes in probability give small changes in information.

Symmetry: reordering/renaming outcomes doesn’t change information.

!! !

"!p(x) log p(x)dx

Visual Salience

Salient objects “pop out” of visual scenes.

Simple preprocessing step directs computational resources.Rare (improbable) image features are more salient than common (probable ones)Improbable events carry more information.

We developed an efficient way to model the statistics of a video stream, and analyze it for salient “pop out”.

Two Examples

Offline: Video Analysis

Online: Camera Control

Empirically Useful

Tracks people in pre-school:68.04% of salience-tracking images contained people.34.81% of playback images contained people.

Predicts Key-frames in Video Annotation:Video sequence labeled by coders for “Change in activity.” [RED]Initial attempts at salience-based video statistics can give up to 70% signal correlation [BLUE]Can also be used to make a “virtual cameraman” to focus on areas of a scene.

What information?

Salience approaches don’t really pay attention to what they see.

Inhibition of returnCan pre-compute saccade trajectory from first image.Not reacting to information in the image.

The image is constant, and all image analysis is pre-computed.

What is the consequence of each eye movement?

Information-gathering model, but what information was gathered?

What question were we trying to answer?

123

4

5

Task Directed Looking Behavior

Visual Popout can be useful for robots, and it seems to be important in people, but it can’t account for task-specific looking behavior.

It has long been known that where people look depends on what questions they are trying to answer. [Yarbus 1967]

Current studies have difficulty making quantitative claims: “Fixations are tightly linked in time to the evolution task. Very few irrelevant regions are fixated.” [Hayhoe & Ballard 2005]

Uncertainty after I open my eyesUncertainty with my eyes closed

Mutual Information is “How much was my uncertainty about a question I have reduced by the things I do and see?”

Is this street safe to cross?Don’t look: Very uncertain Look left: Somewhat uncertainLook right: Not uncertain

How do we measure Information?

I(S;A,O) =

!

Sp(S|A,O) log p(S|A,O)dS !

!

Sp(S) log p(S)dS

= H(S)!H(S|A,O)

!

"#

Infomax Principle: “Feeling of learning”

Supervised: Student / teacher model of learning; teacher knows right answer.Learning judged by %Correct.

Infomax: Confidence in response to a question (Information).

Reinforcement Learning: Given a reinforcement signal, learn how to act to optimally accrue that reinforcer. In the Infomax approach, learn to gain information in order to become confident (can’t be confident without information).

0

50

100

No Yes0

50

100

No Yes

Con

fide

nce

! "

Searching for Faces

4

32

1 1 No Face2 No Face3 No Face

4 Face!

A Generative Model for Visual Search[Adapted from Najemnik & Geisler 2005]

00.51.01.52.02.53.0

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

Target-Eye Distance (Degrees)

TargetSignal Strength

Signal Signal+Noise ~N(0,1)

BeliefLikelihoodState / Action

t=0

1 2 3 4 5 6 7 8 9 10 11

1

2

3

4

5

6

7

8

9

10

11

t=1

1 2 3 4 5 6 7 8 9 10 11

1

2

3

4

5

6

7

8

9

10

11

00.51.01.52.02.53.0

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8


t=2

1 2 3 4 5 6 7 8 9 10 11

1

2

3

4

5

6

7

8

9

10

11

00.51.01.52.02.53.0

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8


t=3

Infomax Reward

!

#

#

"

Digital Eye in Action

9 6 0 x 5 4 0 V i d e o ( 1 / 2 M p x ) .

D i g i t a l R e t i n a : 2 5 F P S

V i o l a J o n e s : 1 . 2 5 F P S

</Chapter 3>1) True or False?: If you close your eyes (and ears, nose, etc.), you get no information about

whether a street is safe to cross.

2) Why is it a good idea to be bad at “Where’s Waldo?”

3) In Infomax Control approaches, you reward yourself for doing things that make you more certain about the answer to a question. What

keeps you from just tricking yourself into believing things with complete certainty?

<Chapter 4>Moving up: Social Awareness

Baby Einstein: Doing the right experiment at the right time.

Generative Models let you own your intelligence.

It takes 2-month infants about 40 minutes to learn new contingencies (head moves ! mobile moves)

By 10 months infants have become experts at learning new contingencies: (it takes them only a few seconds to detect contingencies).

Learning Contingencies

[Movellan & Watson 1985]

A Generative Model for Contingency

Example: vocalization contingencyActions: Vocalize, Remain QuietQuestion: Are the sound statistics after my vocalization different from background?Goal: Choose length of waiting period to quickly become confident in the answer to this question.

Vol

ume

Vocalization Vocalization

0

50

100

No Yes0

50

100

No Yes

Con

fide

nce ! "

Infomax Control Demo

Developmental Result

0

4

8

12

16

20

24

28

32

36

40

0 2 4 6 8 10 12 14 16 18 20

10 mo: 3.4 Minutes

2 mo: 18 Minutes

Months of Development

Min

utes

of

Inte

ract

ion

for

Acc

urat

e C

onti

ngen

cy D

etec

tion

*Butko, Movellan, ICDL 2007

Learning to See Humans

Is it possible to learn about the visual appearance of people based on contingency? [John Watson (1972), 2 month infants]

Contingency is the driver of social developmentContingency defines the concept of “caregiver”

Computational Analysis

Is Watson’s hypothesis computationally plausible? If so, how long does it take to gather enough information to learn reliably?

Testing the Hypothesis

Infomax model of detecting contingencies High reliability in real world, real time robotic applications.

Movellan and Fasel (2006): Segmental Boltzmann Fields

identify and locate objectsin cluttered scenes weak training label

“A leopard is probably inthis scene”

Use contingency to teach yourselfabout people:

“A social being is probably in thisscene”

Autonomous Robotic Learner BEV

A Baby’s Eye-View Robot

GA

Nothing Responsive. Take a Picture

GA

Nice Baby

Something Responsive!

Take a Picture

3700 Images collected over 90 minutes of interaction.

No experimenter intervention

Variety of lighting and background conditions

No post-processing of images (rectification, etc.)

“Baby’s Eye View”Learning In The WilD

18% - No face ; 4% - No Person

17% - Face ; 20% - Person

Con

ting

ency

No

Con

ting

ency

Key Results

Learns what people look like with high accuracy very quickly (6 minutes)

Shows preference for schematic faces shown by infants shortly after birth (40 minutes)

Shows preference for caregivers above other people shown by infants shortly after birth (2 days)

A B

0 50 100 150 200 2500.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Training Images

2A

FC

Perf

orm

an

ce &

S

td. E

rro

r

Person vs. No Person

Contingent vs. Noncontingent

Face vs. No Face

Face vs. No Person

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1


Sa

lie

nc

e

&

Std

. E

rro

r

Face

Scrambled

Blank

BA

0 50 100 150 200 2500.5

0.6

0.7

0.8

0.9

1


2A

FC

Pe

rfo

rma

nc

e (

Fa

ce

v.

No

Pe

rso

n)

Caregivers

Other People

A B

</Chapter 4>1) From a computational point of view, in what

ways is social intelligence “special,” or fundamentally different from low-level

perceptual intelligence?

2) Under the Infomax Hypothesis, how do babies learn to be good scientists, i.e. ask the right

question at the right time?

3) What ultimately enabled BEV to learn what people look like, without borrowing the expertise

of human teachers?

Advice

Artificial Intelligence is an exciting field where we are constantly pushing the boundaries of imagination.

Get involved in research labs early.

Take lots of math classes:Calculus, Linear Algebra, Probability, Statistics, Discrete Math, Algorithms & Data Structures

Thanks, Pr. Belew!For more info:

http://mplab.ucsd.edu

[email protected]

mailto:[email protected]




mailto:[email protected]

Documents

Biologically Inspired Machine Perceptionrik/courses/cogs1_w10/slides/... · 1) Which of these is easiest for a computer program: Seeing, Doing your laundry, Playing