Reinforcement Learning

Reinforcement Learning A (almost)quick(and very incomplete) introduction

Slides from David Silver, Dan Klein, Mausam, Dan Weld

At each time step t:

• Agent executes an action At

• Environment emits a reward Rt

• Agent transitions to state St

Rat Example

• What if agent state = last 3 items in sequence?

Rat Example

• What if agent state = counts for lights, bells and levers?

Rat Example

• What if agent state = counts for lights, bells and levers?

• What if agent state = complete sequence?

Major Components of RL

An RL agent may include one or more of these components:

• Policy: agent’s behaviour function

• Value function: how good is each state and/or action

• Model: agent’s representation of the environment

Policy

• A policy is the agent’s behaviour

• It is a map from state to action

• Deterministic policy: a = π(s)

• Stochastic policy: π(a|s) = P[At = a|St = s]

Value function

• Value function is a prediction of future reward

• Used to evaluate the goodness/badness of states…

• …and therefore to select between actions

• A model predicts what the environment will do next

• It predicts the next state…

• …and predicts the next (immediate) reward

Dimensions of RLModel-based vs. Model-free

• Model-based: Have/learn action

models (i.e. transition probabilities.

• Uses Dynamic Programming

• Model-free: Skip them and directly

learn what action to do when

(without necessarily finding out the

exact model of the action)

• e.g. Q-learning

On Policy vs. Off Policy

• On Policy: Makes estimates based on a

policy, and improves it based on estimates.

• Learning on the job.

• e.g. SARSA

• Off Policy: Learn a policy while following

another (or re-using experience from old

policy).

• Looking over someone's shoulder

• e.g. Q-learning

Markov Decision Process• Set of states S = {si}

• Set of actions for each state A(s) = {asi} (often independent of state)

• Transition model T(s -> s’ | a) = Pr(s’ | a, s)

• Reward model R(s, a, s’)

• Discount factor γ

MDP = <S, A, T, R, γ>

Bellman Equation for Value

Function

Bellman Equation for Action-Value

Function

Q vs V

Exploration vs Exploitation• Restaurant Selection

• Exploitation: Go to your favourite restaurant

• Exploration: Try a new restaurant

• Online Banner Advertisements

• Exploitation: Show the most successful advert

• Exploration: Show a different advert

• Oil Drilling

• Exploitation: Drill at the best known location

• Exploration: Drill at a new location

• Game Playing

• Exploitation: Play the move you believe is best

• Exploration: Play an experimental move

ε-Greedy solution

• Simplest idea for ensuring continual exploration

• All m actions are tried with non-zero probability

• With probability 1 − ε choose the greedy action

• With probability ε choose an action at random

Off Policy Learning• Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a) while following behaviour

policy μ(a|s)

{s1,a1,r2,...,sT} ∼ μ

• Why is this important?

• Learn from observing humans or other agents

• Re-use experience generated from old policies π1, π2, ..., πt−1

• Learn about optimal policy while following exploratory policy

• Learn about multiple policies while following one policy

Q - Learning

• We now consider off-policy learning of action-values Q(s,a)

• Next action is chosen using behaviour policy At+1 ∼ μ(·|St)

• But we consider alternative successor action A′ ∼ π(·|St)

• And update Q(St,At) towards value of alternative action

Q - Learning

• We now allow both behaviour and target policies to improve

• The target policy π is greedy w.r.t. Q(s,a)

• The behaviour policy μ is e.g. ε-greedy w.r.t. Q(s,a)

• The Q-learning target then simplifies:

Q - Learning

Deep RL

• We seek a single agent which can solve any human-level task

• RL defines the objective

• DL gives the mechanism

• RL + DL = general intelligence (David Silver)

Function Approximators

Deep Q-Networks

• Q Learning diverges using neural networks due to:

• Correlations between samples

• Non-stationary targets

Solution: Experience Replay

• Fancy biological analogy

• In reality, quite simple

Solution: Experience Replay

Improving Information Extraction by

Acquiring External Evidence with

Reinforcement LearningKarthik Narasimhan, Adam Yala, Regina Barzilay

CSAIL, MIT

Slides from Karthik Narasimhan

Why try to reason, when someone else can do it for you

Doubts*• Algo 1 line# 19. The process should end when "d" == "end_episode" and not q. [Prachi]

Error.

• The dimension of the match vector should be equivalent to the number of columns to ve

extracted. But Fig 3 has twice the number of dim. [Prachi] Error.

• Is RL the best approach. [Non believers].

• Experience Replay [Anshul]. Hope it is clear now.

• Why is RL-extract better than meta classifier? Explanation provided in paper about "long

tail of noisy, irrelevant documents" is unclear. [Yash]

• The meta-classifier should also cut off at top-20 results per search like the RL system to

be completely fair. [Anshul]

* most mean questions

Discussions• Experiments

• People are happy!

• Queries

• Cluster documents and learn queries [Yashoteja]

• Many other query formulations [Surag (lowest confidence entity), Barun (LSTM), Gagan (highest confidence entity), DineshR]

• Fixed set of queries [Akshay]

• Simplicity. Search engines are robust.

• Reliance on News articles {Gagan]

• Where else would you get News from?

• Domain limitations

• Too narrow [Barun, Himanshu]. Domain specific [Happy]. Small ontology [Akshay]

• It is not Open IE. It is task specific. Can be applied to any domain.

• Better meta-classifiers [Surag]

• Effect of more sophisticated RL algorithms (A3C, TRPO) [esp. if increasing action space by LSTM queries], and their effect on performance and training time.

Reinforcement Learning - Indian Institute of Technology...

Documents

Information Extraction - Indian Institute of Technology Delhimausam/courses/csl772/autumn2014/lectures/12-ie.pdf · • Information extraction of particular interest to the intelligence

IE, QA, and Dialogmausam/courses/col864/spring2017/slides/22-w… · •10% presentation •Extra credit: participation. Plan (1st half of the course) •Classical papers/problems

Summer Reinforcement

Reinforcement Learning - uni-freiburg.degki.informatik.uni-freiburg.de/.../recordings/reinforcement.pdf · Reinforcement Learning 3 What is Reinforcement Learning? Learning from interaction

Reinforcement and deep reinforcement learning for wireless

Generative Deep Neural Networks for Dialoguemausam/courses/col864/spring... · 2017. 4. 21. · Adapted from slides by Iulian Vlad Serban. What are Dialogue Systems? • Computer

Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Bayesian Reinforcement Learning in Continuous …MotivationBayesian Reinforcement LearningExperiments and ApplicationsConclusion Bayesian Reinforcement Learning in Continuous POMDPs

Reinforcement Paper

Skinner’s Behavioral Reinforcement Theory Positive Reinforcement Negative Reinforcement PunishmentExtinction Person repeats desired behaviors to gain a

for high ReinfoRcement systems - kotaca.cz · ReinfoRcement systems rEiNforcEmENt systEm PyraPlEx ... and DBV data sheet "Reinforcement system steel and ... suant to EC2 6.2.2 and

Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Trees COL 106 - Indian Institute of Technology Delhimausam/courses/col106/autumn2017/lectures/07-trees.pdf · 29 The tree data structure Terminology Another approach to a tree is

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

Reinforcement detailing

Reinforcement Detailing

ASD Reinforcement

Learning frameworks Associative reinforcement learningcnbc.cmu.edu/~plaut/IntroPDP/slides/reinforcement-learning.pdf · Reinforcement learning Assumes environment provides evaluative

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning