24
Thomas Trappenberg Reinforcement learning

Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Embed Size (px)

Citation preview

Page 1: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Thomas Trappenberg

Reinforcement learning

Page 2: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Three kinds of learning:

1.Supervised learning

2. Unsupervised Learning

3. Reinforcement learning

Detailed teacher that provides desired output y for a giveninput x: training set {x,y} find appropriate mapping function y=h(x;w) [= W (x) ]

Delayed feedback from the environment in form of reward/punishment when reaching state s with action a: reward r(s,a) find optimal policy a=*(s) Most general learning circumstances

Unlabeled samples are provided from which the system has tofigure out good representations: training set {x} find sparse basis functions bi so that x=i ci bi

Page 3: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Maximize expected Utility

www.koerding.com

Page 4: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

2. Reinforcement learning

-0.1

-0.1 -0.1

-0.1

-0.1

-0.1

-0.1 -0.1 -0.1

From Russel and Norvik

Page 5: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Markov Decision Process (MDP)

Page 6: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Goal: maximize total expected payoff

Two important quantities

policy:

value function:

Optimal Control

Page 7: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Calculate value function (dynamic programming)

Richard Bellman1920-1984

Bellman Equation for policy

Deterministic policies to simplify notation

Solution: Analytic or Incremental

Page 8: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Remark on different formulations:

Some (like Sutton, Alpaydin, but not Russel & Norvik) define the value as the reward at the next state plus all the following reward:

instead of

Page 9: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Policy Iteration

Page 10: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Value Iteration

Bellman Equation for optimal policy

Page 11: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

But:

Environment not known a prioriObservability of statesCurse of Dimensionality

Solution:

Online (TD) POMDP Model-based RL

Page 12: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

POMDP:

Partially observable MDPs can be reduced to MDPs by considering believe states b:

Page 13: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Online value function estimation (TD learning)

If the environment is not known, use Monte Carlo method with bootstrapping

Expected payoff before taking step

Expected reward after taking step = actual reward plus discounted expected payoff of next step

Temporal Difference

What if the environment is not completely known ?

Page 14: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Online optimal control: Exploitation versus Exploration

On-policy TD learning: Sarsa

Off-policy TD learning: Q-learning

Page 15: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Model-based RL: TD()

Instead of tabular methods as mainly discussed before, use function approximator with parameters and gradient descent step (Satton 1988):

For example by using a neural network with weights and corresponding delta learning rule

when updating the weights after an episode of m steps.The only problem is that we receive the feedback r only after the t-th step. So we need to keep a memory (trace) of the sequence.

Page 16: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Model-based RL: TD() … alternative formulation

We can write

An putting this into the formula and rearranging the sum gives

We still need to keep the cumulative sum of the derivative terms, but otherwise it looks already closer to bootstrapping.

Page 17: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Model-based RL: TD()

We now introduce a new algorithm by weighting recent gradients more than ones in the distance

This is called the TD() rule. For we recover the TD(1) rule. Interesting is also the the other extreme of TD(0)

Which uses the prediction of V(t+1) as supervision signal for step t. Otherwise this is equivalent to supervised learning and can easily be generalized to hidden layer networks.

Page 18: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Free-Energy-Based RL:

This can be generalized to Boltzmann machines (Sallans & Hinton 2004)

Paul Hollensen: Sparse, topographic RBM successfully learns to drive the e-puck and avoid obstacles, given training data (proximity sensors, motor speeds)

Page 19: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed
Page 20: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Ivan Pavlov1849-1936Nobel Prize 1904

Classical Conditioning

Rescorla-Wagner Model (1972)

Page 21: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Stimulus B Stimulus A Reward

Stimulus A No reward

Wolfram Schultz

Reward Signals in the Brain

Page 22: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed
Page 23: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Maia & Frank 2011

Disorders with effectsOn dopamine system:

Parkinson’s diseaseTourett’s syndromeADHDDrug addictionSchizophrenia

Page 24: Thomas Trappenberg Reinforcement learning. Three kinds of learning: 1.Supervised learning 2. Unsupervised Learning 3. Reinforcement learning Detailed

Conclusion and Outlook

Three basic categories of learning:

Supervised: Lots of progress through statistical learning theory Kernel machines, graphical models, etc

Unsupervised: Hot research area with some progress,deep temporal learning

Reinforcement: Important topic in animal behavior, model-based RL