Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10...

Passive Reinforcement Learning

Bert Huang Introduction to Artificial Intelligence

Notation Review• Recall the Bellman Equation:

⇡⇤(s) = arg max

a2A(s)

P (s0|s, a)U(s0)

U(s) = R(s) + � max

a2A(s)

P (s0|s, a)U(s0)

U(s) = max

a2A(s)R(s, a) + �

P (s0|s, a)U(s0)alternate version

• Computes utility for every state

• Needs exact transition model

• Needs to fully observe state

• Needs to know exact reward for each state

Value Iteration Drawbacks

Slippery Bridgeetc.

Value Iteration Passive Learning Active Learning

States and rewards

Transitions

Decisions

Value Iteration Passive Learning Active Learning

States and rewards Observes all states and rewards in environment

Observes only states (and rewards) visited by

Transitions Observes all action-transition probabilities

Observes only transitions that occur from chosen

actions

Observes only transitions that occur from chosen

actions

Decisions N/A Learning algorithm does not choose actions

Learning algorithm chooses actions

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning

Direct Utility EstimationU(s) = R(s) + � max

a2A(s)

P (s0|s, a)U(s0)

U⇡(s) = R(s) + �X

P (s0|s,⇡(s))U⇡(s0)

future reward of state assuming we use this policy

Direct utility estimation: use observed rewards and future rewards to estimate U (i.e., take average of samples from data)

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

RIGHT UP

Action Result

RIGHT RIGHT

RIGHT DOWN

RIGHT RIGHT

Adaptive Dynamic Programming• Run value iteration using estimated rewards and transition probabilities

Estimate of

Ui+1(s) R(s) + � max

a2A(s)

P (s0|s, a)Ui(s0)

Estimate of

Temporal-Difference LearningU⇡(s) = R(s) + �

P (s0|s,⇡(s))U⇡(s0)

U⇡(s) U⇡(s) + ↵(R(s) + �U⇡(s0)� U⇡(s))

U⇡(s) = R(s) + �Es0 [U⇡(s0)]

U⇡(s) = Es0 [R(s) + �U⇡(s0)]

observed utility

current estimate of utilitylearning rate parameter

Temporal-Difference LearningU⇡(s) U⇡(s) + ↵(R(s) + �U⇡(s0)� U⇡(s))

Run each time we transition from state s to s’

Converges slower than ADP, but much simpler update.

Leads to famous q-learning algorithm (next video)

Passive Learning• Recordings of agent running fixed policy

• Observe states, rewards, actions

• Direct utility estimation

• Adaptive dynamic programming (ADP)

• Temporal-difference (TD) learning

Passive Reinforcement Learning - Virginia Techcourses.cs.vt.edu/cs4804/Fall16/pdfs/10...

Documents

Reinforcement Learning Lecture Inverse Reinforcement Learningipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2017/07/09... · Reinforcement Learning Inverse Reinforcement

Reinforcement Learning - uni-freiburg.degki.informatik.uni-freiburg.de/.../recordings/reinforcement.pdf · Reinforcement Learning 3 What is Reinforcement Learning? Learning from interaction

Reinforcement Learning Introduction & Passive Learning

Learning frameworks Associative reinforcement learningcnbc.cmu.edu/~plaut/IntroPDP/slides/reinforcement-learning.pdf · Reinforcement learning Assumes environment provides evaluative

Continuous hoops for transverse reinforcement of ... · Continuous hoops for transverse reinforcement of ... transverse reinforcement details for the ... Fig. 1 Transverse reinforcement

Skinner’s Behavioral Reinforcement Theory Positive Reinforcement Negative Reinforcement PunishmentExtinction Person repeats desired behaviors to gain a

On the Role of Geogrid Reinforcement in Reducing Earth ......active and passive lateral earth pressure values on retaining structures. Olson et al. (2011) used tactile sensors to record

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Passive Investors, Not Passive Owners

Passive without passive morphology

REINFORCEMENT DETAILING ACCESSORIES - …paradigm.in/.../2015/01/6-REINFORCEMENT-DETAILING-ACCESSORIES.pdfStructural Engineering & Geospatial Consultants REINFORCEMENT DETAILING ACCESSORIES

Passive Reinforcement Learning - Virginia Tech

Chapter 10 – Aversive Control: Avoidance and Punishment Outline –Active Avoidance vs. Passive Avoidance Negative Reinforcement vs. Positive Punishment

PASSIVE SYNTACTIC STRUCTURESsils.shoin.ac.jp/~jeemonds/Passives_Perfects.pdfearlier competing accounts of the passive. PASSIVE AND PERFECT SYNTACTIC STRUCTURES 1. Analytic Passive

Generalization in Reinforcement Learning: Successful ...papers.nips.cc/paper/1109-generalization-in-reinforcement-learning... · Generalization in Reinforcement Learning: Successful

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

for high REINFORCEMENT SYSTEMS - nevoga.com€¦ · 2 REINFORCEMENT SYSTEMS REINFORCEMENT SYSTEMS REINFORCEMENT SYSTEM PLEXUS®, PYRAPLEX®, FTW The reinforcement system PLEXUS®,

Wright (2006) the Role of Modeling and Automatic Reinforcement in the Construction of the Passive Voice