A brief intro to Deep Reinforcement...

A brief intro toDeep Reinforcement Learning

CEng 783 – Deep LearningFall 2019

Emre Akbaş

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 3

● Brief introduction to Reinforcement Learning– Concepts– Markov Decision Processes– Bellman Equation– Q-learning

● Deep Q-learning● Deep policy gradients

Reinforcement learning

Environment

ActionReward

The agent is situated & operating in an environment

Environment

ActionReward

Typically stochastic

Environment

ActionReward

Typically stochastic

Partial or full description of the environment at a certain time.

Some actions may result in positive/negative rewards.

Policy: The rule(s) that specify how to choose an action

Example: the backgammon game

● State

● Agent

● Action

● Reward

● Policy

● Environment (is stochastic in general)

How could we train an agent that would perform well in such a setting?

Well = maximize reward

Looking at only immediate rewards would not work well.

We need to take into account “future” rewards.

At time t, the total future reward is

We want to take the action that maximizes Rt.

BUT we have to consider the fact that ...

At time t, the total future reward is

We want to take the action that maximizes Rt.

BUT we have to consider the fact that the

environment is stochastic.

So, discounting the future rewards is a good idea.

Discounted future rewards:

where gamma is the discount factor between 0 and 1.

The more a reward is into the future, the less we care about it.

Consider the cases

● γ = 0 ● Considers the immediate rewards only. ● Short-sighted. Won't work well

● γ = 1● Future rewards are not discounted● Should work in deterministic environments

● γ = 0 ● Considers the immediate rewards only. ● Short-sighted. Won't work well

● γ = 1● Future rewards are not discounted● Should work in deterministic environments

We’ll come back to this with value-based methods.

The goal of RL

[Figure from Sergey Levine‘s slides.]http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_3_rl_intro.pdf

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 16[Slide by David Silver]http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf

A policy-gradient method

● Based on direct differentiation of the cost function.

Markov Decision Process

● Relies on the Markov assumption

● The probability of the next state depends only on the current state and the action but not on preceding states or actions.

[Image from nervanasys.com]

Markov Decision Process

● Relies on the Markov assumption

● The probability of the next state depends only on the current state and the action but not on preceding states or actions.

One episode (e.g. a game from start to finish) of thisprocess forms a sequence:

The REINFORCE algorithm

The previous derivations result in the following algorithm.

[From Sergey Levine’s slides.]

An example application is in the next slides.

Example: The “Pong” game

Image from http://karpathy.github.io/2016/05/31/rl/

State: current frame minus the last frame (to capture the motion)

Action: UP or DOWN

Reward: +1 if you win the game, otherwise -1

Example: “Pong”

Image from http://karpathy.github.io/2016/05/31/rl/

Learning the optimal policy1) Initialize the policy network randomly

2) Play a batch of episodes

3) Label all the actions in a WIN game as CORRECT

4) Label all the actions in a LOST game as INCORRECT

5) Apply supervised learning, i.e. maximize

where ri =1 for any action in a WON game, otherwise -1

6) Go to step 2, repeat until convergence.

Q-learning: a value-based method

● The Q-function: – Expected total reward from taking action at in st

This is called the Bellman equation.

Q-learning: a value-based method

● The Q-function: – Expected total reward from taking action at in st

This is called the Bellman equation.

Remember the discounted future rewards:

The policy π

The rule that specifies which action to choose given the current state.

A typical and sensible policy is

Q-learning

Suppose the current state is “s” and we take action “a”

Then, we observe the next state “s' “ and obtain reward “ r”.

It has been shown1 that the following iterative update rule will converge to an optimal Q function:

1http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf

Q-learning pseudo-code

1) Create a table (num_states x num_actions) for Q

2) Initialize the table randomly

3) Observe the initial state s

4) Repeat until the game is over (i.e. next state is terminal state)

1) Choose an action, i.e. a = π(s) /* s is the current state */

2) Receive next state s' and reward r

4) s = s' /* next state becomes the current state */

How can we use deep-learning here?

The Q-function can be approximated using a neural network model.

Q(s,a) is implemented on the right.

The Q-function can be approximated using a neural network model.

Q(s,a) is implemented on the right.

Alternative model for implementing Q(s,a)

Now, it's efficient to evaluate:

Example

In DeepMind's 2013 paper [Mnih et al. (2013)]

state is encoded by the images of last four frames

of the game.

After pre-processing, a state is a 84x84x4 matrix.

Example

The network used in DeepMind's 2013 paper

[Mnih et al. (2013)]

Qs: What type of a network is this? Why 18? Why no pooling layers?

Deep Q-learningRemember the Bellman equation?

For a transition <s,a,r,s'>,

The update rule in ordinary Q-learning:

In deep Q-learning:

Deep Q-learning

[Pseudocode from nervanasys.com]

Deep Q-learning

● “It turns out that approximation of Q-values using non-linear functions is not very stable.

● There is a whole bag of tricks (reward clipping, gradient clipping, batch normalization) that you have to use to actually make it converge.

● The most important trick is experience replay.”[Quote from nervanasys.com]

Experience relay is nothing but minibatch training. i.e. collect <s,a,r,s'> transitions and use them in minibatches (instead of one by one) while training.

03.12.2019 METU CEng 783 - Deep Learning - E. Akbas 43Figure from Deepmind’s post.]https://deepmind.com/blog/article/deep-reinforcement-learning

References● Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

● Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.

● Guest Post (Part I): Demystifying Deep Reinforcement Learning (Blog post), https://www.nervanasys.com/demystifying-deep-reinforcement-learning/

● Deep Reinforcement Learning: Pong from Pixels (Blog post) by A. Karpathy, http://karpathy.github.io/2016/05/31/rl/

A brief intro to Deep Reinforcement...

Documents

GAMI GLOBAL ADVANCED MANUFACTURING INSTITUTE … GAMI Study...2 03.12.2019 Prof. Dr.-Ing. J. Fleischer, Prof. Dr.-Ing. G. Lanza, Prof. Dr.-Ing. habil. V. Schulze APPLY FOR WS 2020/21

TENDER DOCUMENT For Society of Physical …...Supply of Equipments for open Gym of SPER Department, IGIT Sarang (Annexure-I) 2 Date of upload of Tender Papers Date: 03.12.2019 3 Last

* We would like to thank Ferhat Akbas, Chris Anderson, George …swfa2015.uno.edu/F_Corporate_Finance_Family_Firms/paper... · 2015-02-18 · Second, employee views of their companies,

A Ranking-based, Balanced Loss Function Unifying Classiﬁcation … · Kemal Oksuz, Baris Can Cam, Emre Akbas , Sinan Kalkan Dept. of Computer Engineering, Middle East Technical

Basics of Java Programming - Florida State Universityww2.cs.fsu.edu/~akbas/java17/Slides/Java_Basics.pdf · Basics of Java Programming Lecture 2 CGS 3416 Spring 2016 January 9, 2017

A framework for machine vision based on neuro-mimetic ... · A framework for machine vision based on neuro-mimetic frontend and clustering Emre Akbas, Aseem Wadhwa#, Miguel Eckstein,

ncst.nic.in · 2020-03-06 · Sub: Representation dated 03.12.2019 of Shri Ajit Sebastian Distributor, Jai Hind Bharat Gas Agency, at Oyoor, Kerala regarding harassment by Authority

Telangana...No OFFICE OF THE SHAMSHABAD MUNICIPALITY RANGA REDDY DISTRICT Notification Lr. No. Ele/795/2019, Dt : 03.12.2019. It is to inform to the people of the Shamshabad Municipality

Middle East Technical University, Ankara, Turkey arXiv ... · nermin.samet@metu.edu.tr (Nermin Samet), emre@ceng.metu.edu.tr (Emre Akbas), pinar@cs.hacettepe.edu.tr (Pinar Duygulu)

Scanned Image - БАН ot Ivan... · 2020-01-22 · Assist. Prof. Dr. eng. Nikolay Ivanov Stoimenov fulfills the requirements of Art. 24 (1). 2a, as with a letter note N01205/03.12.2019,

Deep calls unto deep

Smart Money, Dumb Money, and Capital Market Anomalies · Smart Money, Dumb Money, and Capital Market Anomalies July 5, 2014 Ferhat Akbas, Will J. Armstrong, Sorin Sorescu, Avanidhar

TRP Belarus PPT 03.12.2019 english...5 Veterinarian 748 574,10 6 Accountant 557 501,47 7 Director 437 899,20 8 Medical Laboratory Assistant 411 479,78 9 Forester 386 549,24 10 Master

Belarus: country focus - Europa … · Belarus: country focus ETF’s Statistics Team Torino Process Dissemination Event | 03.12.2019 | RIPO, Minsk

ULUSLARARASI HAKEMLİ DERGİLER Akbas, B., Değirmenci, K ... Yayinlar... · Gozlekçi, Ş., Ercisli, S., Okturen, F., Sonmez, S. 2011. Physico-Chemical Characteristics at Three Development

& RANKINGS IHR FEBRUAR 2016 SEID TOP! - flp.de · assistant supervisor … adams, dieter aigner, natalie akbas, emine akdabal, ayse akgÜn, adile gÜlhan

Improving Student Performance in Calculus - ose.uga.edu Student Performance in Calculus Dabney W. Dixon Rebecca Rizzo, Mark Grinshpon, Erol Akbas, Jeremy Brazas, and Markus Germann

Scanned by CamScanner … · in compliance of orders of Hon' ble National Green Tribunal passed in OA No. 603/2019 titles Raieev Verma Vs. State of HP vide order dated 03.12.2019

Deep 2 Deep

Metadiscourse Variations acr oss Academic Genres ...Metadiscourse Variations acr oss Academic Genres: Rhetorical Preferences in Textual and Interpersonal Markers. Erdem Akbas.