47
Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder Singh Doina Precup with thanks to:

Reinforcement Learning’s Computational Theory of Mind

Embed Size (px)

DESCRIPTION

Reinforcement Learning’s Computational Theory of Mind. Rich Sutton. with thanks to:. Andy Barto. Satinder Singh. Doina Precup. Outline. Computational Theory of Mind Reinforcement Learning Some vivid examples Intuition of RL’s Computational Theory of Mind - PowerPoint PPT Presentation

Citation preview

Page 1: Reinforcement Learning’s Computational Theory of Mind

Reinforcement Learning’sComputational Theory of

MindRich Sutton

Andy Barto Satinder Singh Doina Precup

with thanks to:

Page 2: Reinforcement Learning’s Computational Theory of Mind

Outline

• Computational Theory of Mind• Reinforcement Learning

– Some vivid examples

• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)

• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error

• Speculative extensions– Reason– Knowledge

Page 3: Reinforcement Learning’s Computational Theory of Mind

Honeybee Brain & VUM Neuron

Hammer, Menzel

Page 4: Reinforcement Learning’s Computational Theory of Mind

Dopamine Neurons

Signal “Error/Change” in Prediction

of Reward

Page 5: Reinforcement Learning’s Computational Theory of Mind

Marr’s Three Levels at which any information processing system can be

understood• Computational Theory Level

– What are the goals of the computation?– What is being computed?– Why are these the right things to compute?– What overall strategy is followed?

• Representation and Algorithm Level– How are these things computed?– What representation and algorithms are used?

• Hardware Implementation Level– How is this implemented physically?

What and Why?

How?

Really how?

Page 6: Reinforcement Learning’s Computational Theory of Mind

Cash Register

• Computational Theory– Adding numbers– Making change– Computing tax– Controling access to cash drawer

• Representations and Algorithms– Are numbers stored in decimal or binary, or

BCD?– Is multiplication done by repeated adding?

• Hardware Implementation– Silicon or gears?– Motors or springs?

What and Why?

Page 7: Reinforcement Learning’s Computational Theory of Mind

Word Processor

• Computational Theory– The whole “application” level– To display document as it will appear on paper– To make it easy to change, enhance

• Representations and Algorithms– How is the document stored?– What algorithms are used to maintain its display?– The underlying C code

• Hardware Implementation– How does the display work?– How does the silicon implement the computer?

What and Why?

Page 8: Reinforcement Learning’s Computational Theory of Mind

Flight

• Computational Theory– Aerodynamics– Lift, propulsion, airfoil shape

• Representations and Algorithms– Fixed wings or flapping?

• Hardware Implementation– Steel, wood, or feathers?

What and Why?

Page 9: Reinforcement Learning’s Computational Theory of Mind

Importance of Computational Theory

• The levels are loosely coupled• Each has an internal logic, coherence of its own• But computational theory drives all lower levels• We have so often gone wrong by mixing CT with

the other levels– Imagined neuro-physiological constraints– The meaning of connectionism, neural networks

• Constraints within the CT level have more force• We have little computational theory for AI

– No clear problem definition– Many methods for knowledge representation, but no

theory of knowledge

Page 10: Reinforcement Learning’s Computational Theory of Mind

Outline

• Computational Theory of Mind• Reinforcement Learning

– Some vivid examples

• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)

• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error

• Speculative extensions– Reason– Knowledge

Page 11: Reinforcement Learning’s Computational Theory of Mind

Reinforcement learning:

Learning from interaction

to achieve a goal Environment

actionstate

rewardAgent

•complete agent• temporally situated• continual learning & planning• object is to affect environment• environment stochastic &

uncertain

Page 12: Reinforcement Learning’s Computational Theory of Mind

Strands of History of RLTrial-and-error

learningTemporal-difference

learningOptimal control,value functions

Thorndike ()1911

Minsky

Klopf

Barto et al

Secondary reinforcement ()

Samuel

Witten

Sutton

Hamilton (Physics)1800s

Shannon

Bellman/Howard (OR)

Werbos

Watkins

Holland

Page 13: Reinforcement Learning’s Computational Theory of Mind

Hiroshi Kimura’s RL Robots

QuickTime™ and a decompressor

are needed to see this picture.

Before

QuickTime™ and a decompressor

are needed to see this picture.

After

QuickTime™ and a decompressor

are needed to see this picture.

Backward

QuickTime™ and a decompressor

are needed to see this picture.

New Robot, Same algorithm

Page 14: Reinforcement Learning’s Computational Theory of Mind

The RoboCup Soccer Competition

Page 15: Reinforcement Learning’s Computational Theory of Mind

The Acrobot Problem

e.g., Dejong & Spong, 1994

Sutton, 1995

Minimum–Time–to–Goal:

4 state variables: 2 joint angles 2 angular velocities

Tile coding with 48 layers

Goal: Raise tip above line

Torqueapplied

here

tip

Reward = -1 per time step

fixed base

Page 16: Reinforcement Learning’s Computational Theory of Mind

Examples of Reinforcement Learning• Robocup Soccer Teams Stone & Veloso, Reidmiller et al.

– World’s best player of simulated soccer, 1999; Runner-up 2000

• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods

• Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin– World's best assigner of radio channels to mobile telephone calls

• Elevator Control Crites & Barto– (Probably) world's best down-peak elevator controller

• Many Robots– navigation, bi-pedal walking, grasping, switching between skills...

• TD-Gammon and Jellyfish Tesauro, Dahl– World's best backgammon player

Page 17: Reinforcement Learning’s Computational Theory of Mind

New Applications of RL• CMUnited Robocup Soccer Team Stone & Veloso

– World’s best player of Robocup simulated soccer, 1998

• KnightCap and TDleaf Baxter, Tridgell & Weaver – Improved chess play from intermediate to master in 300 games

• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods

• Walking Robot Benbrahim & Franklin– Learned critical parameters for bipedal walking

Real-world applications using on-line learning Back-

prop

QuickTime™ and aCinepak decompressor

are needed to see this picture.

Page 18: Reinforcement Learning’s Computational Theory of Mind

Backgammon

SITUATIONS: configurations of

the playing board (about 1020)

ACTIONS: moves

REWARDS: win: +1

lose: –1

else: 0

20

Pure delayed reward

Page 19: Reinforcement Learning’s Computational Theory of Mind

TD-Gammon

. . .. . .

. . .. . .

Value

TD Error

Vt1 Vt

Action selectionby 2-3 ply search

Tesauro, 1992-1995

Start with a random Network

Play millions of games against itself

Learn a value function from this simulated experience

This produces arguably the best player in the world

Page 20: Reinforcement Learning’s Computational Theory of Mind

Outline

• Computational Theory of Mind• Reinforcement Learning

– Some vivid examples

• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)

• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error

• More speculative extensions– Reason– Knowledge

Page 21: Reinforcement Learning’s Computational Theory of Mind

The Reward Hypothesis

The mind’s goal is to maximize the cumulative sum of a received scalar

signal (reward)

Page 22: Reinforcement Learning’s Computational Theory of Mind

The Reward Hypothesis

The mind’s goal is to maximize the cumulative sum of a received scalar

signal (reward)

can be conceived of, understood as

Page 23: Reinforcement Learning’s Computational Theory of Mind

The Reward Hypothesis

The mind’s goal is to maximize the cumulative sum of a received scalar

signal (reward)

Must come from outside,not under the mind’s direct control

Page 24: Reinforcement Learning’s Computational Theory of Mind

The Reward Hypothesis

The mind’s goal is to maximize the cumulative sum of a received scalar

signal (reward)

a simple, single number(not a vector or symbol structure)

Page 25: Reinforcement Learning’s Computational Theory of Mind

The Reward Hypothesis

• Obvious? Brilliant? Demeaning? Inevitable? Trivial?

• Simple, but not trivial

• May be adequate, may be completely satisfactory

• A good null hypothesis

The mind’s goal is to maximize the cumulative sum of a received scalar

signal (reward)

Page 26: Reinforcement Learning’s Computational Theory of Mind

Policies

• A policy maps each state to an action to take– Like a stimulus–response rule

• We seek a policy that maximizes cumulative reward

• The policy is a subgoal to achieving reward

Reward

Policy

Page 27: Reinforcement Learning’s Computational Theory of Mind

Value Functions• Value functions = Predictions of expected reward

following states: Value: States Expected future reward

• Moment-by-moment estimates of how well its going • All efficient methods for finding optimal policies first estimate

value functions – RL methods, state-space planning methods, dynamic programming

• Recognizing and reacting to the ups and downs of life is an important part of intelligence

Reward

ValueFunction

Policy

Page 28: Reinforcement Learning’s Computational Theory of Mind

The Mountain Car Problem

Minimum-Time-to-Goal Problem

Moore, 1990 Goal

Gravity wins

SITUATIONS: car's position and velocity

ACTIONS: three thrusts: forward, reverse, none

REWARDS: always –1 until car reaches the goal

No Discounting

Page 29: Reinforcement Learning’s Computational Theory of Mind

Value Functions Learned while solving the Mountain Car

problem

Lower is better

Page 30: Reinforcement Learning’s Computational Theory of Mind

Honeybee Brain & VUM Neuron

Hammer, Menzel

Page 31: Reinforcement Learning’s Computational Theory of Mind

Dopamine Neurons

Signal “Error/Change” in Prediction

of Reward

Page 32: Reinforcement Learning’s Computational Theory of Mind

Outline

• Computational Theory of Mind• Reinforcement Learning

– Some vivid examples

• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)

• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error

• Speculative extensions– Reason– Knowledge

Page 33: Reinforcement Learning’s Computational Theory of Mind

Notation• Reward

r : States Pr() e.g.,

• Policies: States Pr(Actions) is optimal

• Value Functions

Vπ :States → ℜ

Vπ (s)=Eπ γk−1rt+kk=1

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭ given that st =s

rt ∈ℜ

discount factor ≈1 but <1

Page 34: Reinforcement Learning’s Computational Theory of Mind

Policy Iteration

Page 35: Reinforcement Learning’s Computational Theory of Mind

Generalized Policy Iteration

Policy

Value

Function

V

policyevaluation

* V

greedification

*

Page 36: Reinforcement Learning’s Computational Theory of Mind

Reward is a time series

Reward

Reward

Total future reward

Discounted(imminent) reward

Page 37: Reinforcement Learning’s Computational Theory of Mind

Discounting Example

REWARD

Page 38: Reinforcement Learning’s Computational Theory of Mind

What should the prediction error be?

at time t?

Vt = E γk−1rt+kk=1

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

= E rt+1 +γ γk−1rt+1+kk=1

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

≈ rt+1 +γVt+1

TDerrort = rt+1 +γVt+1 −Vt

Page 39: Reinforcement Learning’s Computational Theory of Mind

Reward Unexpected

Reward

Value

TD error

Reward Expected

Cue

Value

TD error

Reward Absent

Value

TD error

ComputationTheoretical

TD Errors

Page 40: Reinforcement Learning’s Computational Theory of Mind

Honeybee Brain & VUM Neuron

Hammer, Menzel

Page 41: Reinforcement Learning’s Computational Theory of Mind

Dopamine Neurons

Signal TD Error

in Prediction of Reward

Page 42: Reinforcement Learning’s Computational Theory of Mind

Representation & Algorithm Level:

The many methods for approximating valueDynamic

programming

Temporal-differencelearning

Monte Carlo

Exhaustivesearch

bootstrapping,

fullbackups

samplebackups

shallowbackups

deepbackups

Page 43: Reinforcement Learning’s Computational Theory of Mind

Outline

• Computational Theory of Mind• Reinforcement Learning

– Some vivid examples

• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)

• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error

• Speculative extensions– Reason– Knowledge

Page 44: Reinforcement Learning’s Computational Theory of Mind

Tolman & Honzik, 1930“Reasoning in Rats”

Food box

Path 1

Path 3Path 2

Block B

Block A

Startbox

Page 45: Reinforcement Learning’s Computational Theory of Mind

Reason as RL over Imagined Experience

1. Learn a model of the world’s transition dynamicstransition probabilities, expected immediate rewards“1-step model” of the world

2. Use model to generate imaginary experiencesinternal thought trials, mental simulation (Craik, 1943)

3. Apply RL as if experience had really happened

Reward

ValueFunction

1-Step Model

Policy

Page 46: Reinforcement Learning’s Computational Theory of Mind

Mind is About PredictionsHypothesis: Knowledge is predictive

About what-leads-to-what, under what ways of behavingWhat will I see if I go around the corner?Objects: What will I see if I turn this over?Active vision: What will I see if I look at my hand?Value functions: What is the most reward I know how to get?

Such knowledge is learnable, chainable

Hypothesis: Mental activity is working with predictionsLearning themCombining them to produce new predictions (reasoning)Converting them to action (planning, reinforcement

learning)Figuring out which are most useful

Page 47: Reinforcement Learning’s Computational Theory of Mind

An old, simple, appealing idea

• Mind as prediction engine! • Predictions are learnable, combinable• They represent cause and effect, and can be

pieced together to yield plans• Perhaps this old idea is essentially correct.• Just needs

– Development, revitalization in modern forms– Greater precision, formalization, mathematics– The computational perspective to make it respectable– Imagination, determination, patience

• Not rushing to performance