7
CHAPTER 11 CHAPTER 11 R R EINFORCEMENT EINFORCEMENT L L EARNING VIA EARNING VIA T T EMPORAL EMPORAL D D IFFERENCES IFFERENCES •Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference algorithm –Batch and online implementations of TD –Examples –Connections to stochastic approximation Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

Embed Size (px)

DESCRIPTION

11-3 Schematic of Delayed Reinforcement Process Suppose time moves left to right in diagram below Z represents some system output at a future time represent some intermediate predictions of Z

Citation preview

Page 1: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

CHAPTER 11CHAPTER 11RREINFORCEMENT EINFORCEMENT LLEARNING VIA EARNING VIA

TTEMPORAL EMPORAL DDIFFERENCESIFFERENCES

•Organization of chapter in ISSO–Introduction –Delayed reinforcement–Basic temporal difference algorithm–Batch and online implementations of TD–Examples–Connections to stochastic approximation

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Page 2: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

11-2

Reinforcement LearningReinforcement Learning• Reinforcement learning is important class of methods in computer

science, AI, engineering, etc.– Based on common-sense idea that good results are reinforced while

bad results provide negative reinforcement• Delayed reinforcement only provides output after several

intermediate “actions”• Want to create model for predicting state of system

– Model depends on – “Training” or “learning” (estimating ) not based on methods such as

stochastic gradient (supervised learning) because of delay in response– Need learning method to cope with delayed response

Page 3: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

11-3

Schematic of Delayed Schematic of Delayed Reinforcement ProcessReinforcement Process

• Suppose time moves left to right in diagram below • Z represents some system output at a future time• represent some intermediate predictions of Z0 1ˆ ˆ ˆ, ,..., nz z z

Page 4: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

11-4

Temporal Difference (TD) LearningTemporal Difference (TD) Learning• Focus is delayed reinforcement problem • Prediction function has form h(,x), where are parameters and

x is input

• Need to estimate from sequence of inputs and outputs {x0, x1, ..., xn; Z}

• TD learning is method for using in training rather than only inputs and outputs– Implies that some forms of TD allow for updating of value before

observing Z– TD exploits prior information embedded in predictions to modify

• Basic form of TD for updating is where () is increment to be determined

z

ˆ ˆz z

new 0 0ˆ ˆ ( ) ,n

Page 5: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

11-5

Exercise 11.4 in Exercise 11.4 in ISSOISSO: Conceptual : Conceptual Example of Benefits of TD. Circles Example of Benefits of TD. Circles

denote game states. denote game states.

Novel Bad

Loss

Win

90%

10%

Gameoutcome

Page 6: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

11-6

Batch Version of TD LearningBatch Version of TD Learning

Page 7: CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference

11-7

Random-Walk ModelRandom-Walk Model(Example 11.3 in (Example 11.3 in ISSOISSO))

• All walks begin in state S3

• Each step involves 50–50 chance of moving left or right until terminal state Tleft or Tright is reached• Use TD to estimate probabilities of reaching Tright from any of states S1 , S2 , S3 , S4 , or S5

S1Tleft S2 S3 S4 S5 Tright

Start