Upload
maximillian-floyd
View
216
Download
0
Embed Size (px)
DESCRIPTION
11-3 Schematic of Delayed Reinforcement Process Suppose time moves left to right in diagram below Z represents some system output at a future time represent some intermediate predictions of Z
Citation preview
CHAPTER 11CHAPTER 11RREINFORCEMENT EINFORCEMENT LLEARNING VIA EARNING VIA
TTEMPORAL EMPORAL DDIFFERENCESIFFERENCES
•Organization of chapter in ISSO–Introduction –Delayed reinforcement–Basic temporal difference algorithm–Batch and online implementations of TD–Examples–Connections to stochastic approximation
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall
11-2
Reinforcement LearningReinforcement Learning• Reinforcement learning is important class of methods in computer
science, AI, engineering, etc.– Based on common-sense idea that good results are reinforced while
bad results provide negative reinforcement• Delayed reinforcement only provides output after several
intermediate “actions”• Want to create model for predicting state of system
– Model depends on – “Training” or “learning” (estimating ) not based on methods such as
stochastic gradient (supervised learning) because of delay in response– Need learning method to cope with delayed response
11-3
Schematic of Delayed Schematic of Delayed Reinforcement ProcessReinforcement Process
• Suppose time moves left to right in diagram below • Z represents some system output at a future time• represent some intermediate predictions of Z0 1ˆ ˆ ˆ, ,..., nz z z
11-4
Temporal Difference (TD) LearningTemporal Difference (TD) Learning• Focus is delayed reinforcement problem • Prediction function has form h(,x), where are parameters and
x is input
• Need to estimate from sequence of inputs and outputs {x0, x1, ..., xn; Z}
• TD learning is method for using in training rather than only inputs and outputs– Implies that some forms of TD allow for updating of value before
observing Z– TD exploits prior information embedded in predictions to modify
• Basic form of TD for updating is where () is increment to be determined
z
ˆ ˆz z
new 0 0ˆ ˆ ( ) ,n
11-5
Exercise 11.4 in Exercise 11.4 in ISSOISSO: Conceptual : Conceptual Example of Benefits of TD. Circles Example of Benefits of TD. Circles
denote game states. denote game states.
Novel Bad
Loss
Win
90%
10%
Gameoutcome
11-6
Batch Version of TD LearningBatch Version of TD Learning
11-7
Random-Walk ModelRandom-Walk Model(Example 11.3 in (Example 11.3 in ISSOISSO))
• All walks begin in state S3
• Each step involves 50–50 chance of moving left or right until terminal state Tleft or Tright is reached• Use TD to estimate probabilities of reaching Tright from any of states S1 , S2 , S3 , S4 , or S5
S1Tleft S2 S3 S4 S5 Tright
Start