Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis...

Preview:

Citation preview

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Reinforcement Learning 2

Pantelis P. Analytis

March 24, 2018

1 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

1 Introduction

2 Temporal difference learning

3 Q-learning

4 Applications

5 Midterm revision

2 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Different types of learning

3 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Characteristics of reinforcement learning

Evaluative feedback.

Sequentiality, delayed rewards.

Need for trial and error, to explore as well as to exploit.

Non stationary world.

4 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

Broadly used to predict future rewards.It appears to be how the brain reward system works.It is learning a prediction from another later, learnedprediction.The TD error is the difference between two predictions,the temporal difference.

5 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

V (s)← V (s) + α(

The TD target︷ ︸︸ ︷r + γV (s ′) −V (s))

r + γV (s ′) is known as the TD target

6 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning in the brain (Schultz,Dayan, Montague, 1997)

7 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning in the brain

V (s)← V (s) + α(

The TD target︷ ︸︸ ︷r + γV (s ′) −V (s))

r + γV (s ′) is known as the TD target8 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning: example

Predicting the outcome of a game like chess orbackgammon.Long-term predictions by simulation are complex and evensmall errors in one-step predictions might be amplified. 9 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning: example

Predicting the outcome of a game like chess orbackgammon.Long-term predictions by simulation are complex and evensmall errors in one-step predictions might be amplified. 10 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

11 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Temporal difference learning

12 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Q-learning

13 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Q-learning

Q-learning converges to the optimal even if you are actingsub-optimally.

14 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Model based and model free learning

Many situations involve conflict between a model-freesystem like TD-learning and a model-based system thatplans ahead.

15 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Samuel’s checkers program

Inspired by Shannon’s paper on chess-playing computers.

It achieved good, but not expert level of playing.

Used a learning process that was similar to TD-learning.

16 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Tesauro’s TD-Grammon

Developed in 1992 by Gerard Tesauro. After playing300.000 games against itself it performed approximately atthe level of human world class players.

17 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Atari breakthrough

Google brained trained an agent that learned 49 Atarigames by receiving as input the pixels of the screen andevaluated the rewards from different positions of thejoystick. It learned half of them at human level.

18 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Alpha Go

Alpha go searched planned much deeper in the game tree.

It uses reinforcement learning to evaluate which pathswhere worthwhile searching.

19 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Attention allocation in online interfaces

20 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Music lab experiment

21 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Learning from others

22 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Clinical vs. actuarial decision making

23 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Exploration-exploitation dilemma

24 / 25

ReinforcementLearning 2

Pantelis P.Analytis

Introduction

Temporaldifferencelearning

Q-learning

Applications

Midtermrevision

Iowa gambling task

25 / 25

Recommended