13
Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence

Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Passive Reinforcement Learning

Bert Huang Introduction to Artificial Intelligence

Page 2: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Notation Review• Recall the Bellman Equation:

⇡⇤(s) = arg max

a2A(s)

X

s0

P (s0|s, a)U(s0)

U(s) = R(s) + � max

a2A(s)

X

s0

P (s0|s, a)U(s0)

U(s) = max

a2A(s)R(s, a) + �

X

s0

P (s0|s, a)U(s0)alternate version

Page 3: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

• Computes utility for every state

• Needs exact transition model

• Needs to fully observe state

• Needs to know exact reward for each state

Value Iteration Drawbacks

Page 4: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Slippery Bridgeetc.

Page 5: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Value Iteration Passive Learning Active Learning

States and rewards

Transitions

Decisions

Value Iteration Passive Learning Active Learning

States and rewards Observes all states and rewards in environment

Observes only states (and rewards) visited by

agent

Observes only states (and rewards) visited by

agent

Transitions Observes all action-transition probabilities

Observes only transitions that occur from chosen

actions

Observes only transitions that occur from chosen

actions

Decisions N/A Learning algorithm does not choose actions

Learning algorithm chooses actions

Page 6: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning

Page 7: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Direct Utility EstimationU(s) = R(s) + � max

a2A(s)

X

s0

P (s0|s, a)U(s0)

U⇡(s) = R(s) + �X

s0

P (s0|s,⇡(s))U⇡(s0)

future reward of state assuming we use this policy

Direct utility estimation: use observed rewards and future rewards to estimate U (i.e., take average of samples from data)

Page 8: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

Page 9: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

RIGHT UP

Action Result

RIGHT RIGHT

RIGHT RIGHT

RIGHT DOWN

RIGHT RIGHT

Page 10: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

Estimate of

Ui+1(s) R(s) + � max

a2A(s)

X

s0

P (s0|s, a)Ui(s0)

Estimate of

Page 11: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Temporal-Difference LearningU⇡(s) = R(s) + �

X

s0

P (s0|s,⇡(s))U⇡(s0)

U⇡(s) U⇡(s) + ↵(R(s) + �U⇡(s0)� U⇡(s))

U⇡(s) = R(s) + �Es0 [U⇡(s0)]

U⇡(s) = Es0 [R(s) + �U⇡(s0)]

observed utility

current estimate of utilitylearning rate parameter

Page 12: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Temporal-Difference LearningU⇡(s) U⇡(s) + ↵(R(s) + �U⇡(s0)� U⇡(s))

Run each time we transition from state s to s’

Converges slower than ADP, but much simpler update.

Leads to famous q-learning algorithm (next video)

Page 13: Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10 Passive...Passive Reinforcement Learning Bert Huang Introduction to Artificial Intelligence Notation

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning