Collaborative Reinforcement Learning

Presented by Dr. Ying Lu

Credits

Reinforcement Learning: A User’s Guide. Bill Smart at ICAC 2005

Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, "Collaborative Reinforcement Learning of Autonomic Behaviour", 2nd International Workshop on Self-Adaptive and Autonomic Computing Systems, pages 700-704, 2004. [Winner Best Paper Award].

What is RL?

“a way of programming agents by reward and punishment without needing to specify how the

task is to be achieved”

[Kaelbling, Littman, & Moore, 96]

Basic RL Model

1. Observe state, st

2. Decide on an action, at

3. Perform action

4. Observe new state, st+1

5. Observe reward, rt+1

6. Learn from experience7. Repeat

Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agent

An Example: Gridworld

Canonical RL domain• States are grid cells• 4 actions: N, S, E, W• Reward for entering top right cell• -0.01 for every other move

Maximizing sum of rewards Shortest path• In this instance

The Promise of RL

Specify what to do, but not how to do it• Through the reward function• Learning “fills in the details”

Better final solutions• Based on actual experiences, not programmer

assumptions

Less (human) time needed for a good solution

Mathematics of RL

Before we talk about RL, we need to cover some background material

• Some simple decision theory• Markov Decision Processes• Value functions

Making Single Decisions

Single decision to be made• Multiple discrete actions• Each action has a reward associated

with it

Goal is to maximize reward• Not hard: just pick the action with the largest reward

State 0 has a value of 2• Sum of rewards from taking the best action from the

Markov Decision Processes

We can generalize the previous example to multiple sequential decisions

• Each decision affects subsequent decisions

This is formally modeled by a Markov Decision Process (MDP)

Markov Decision Processes

Formally, an MDP is• A set of states, S = {s1, s2, ... , sn}

• A set of actions, A = {a1, a2, ... , am}

• A reward function, R: SAS→• A transition function,

We want to learn a policy, : S →A• Maximize sum of rewards we see over our lifetime

aai,s|jsPP tt1taij

Policies

There are 3 policies for this MDP1. 0 →1 →3 →5

2. 0 →1 →4 →5

3. 0 →2 →4 →5

Which is the best one?

Comparing Policies

Order policies by how much reward they see1. 0 →1 →3 →5 = 1 + 1 + 1 = 3

2. 0 →1 →4 →5 = 1 + 1 + 10 = 12

3. 0 →2 →4 →5 = 2 – 1000 + 10 = -988

Value Functions

We can define value without specifying the policy• Specify the value of taking action a from state s and

then performing optimally• This is the state-action value function, Q

Q(0, A) = 12 Q(0, B) = -988

Q(3, A) = 1

Q(4, A) = 10

Q(1, A) = 2Q(1, B) = 11

Q(2, A) = -990

How do you tell whichaction to take from

each state?

Value Functions

So, we have value function• Q(s, a) = R(s, a, s’) + maxa’ Q(s’, a’)

In the form of• Next reward plus the best I can do from the next state

These extend to probabilistic actions•

s’ is thenext state

a' ,s'Q maxs' a, s,RPas,Q a's'

Getting the Policy

If we have the value function, then finding the best policy is easy

• (s) = arg maxa Q(s, a)

We’re looking for the optimal policy, (s)• No policy generates more reward than

Optimal policy defines optimal value functions•

The easiest way to learn the optimal policy is to learn the optimal value function first

a' ,s'Qmaxs' a, s,Ras,Q *a'*

Collaborative Reinforcement Learningto Adaptively Optimize MANET Routing

Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill

Overview

Building autonomic distributed systems with self* properties

• Self-Organizing• Self-Healing• Self-Optimizing

Add collaborative learning mechanism to self-adaptive component modelImproved ad-hoc routing protocol

Introduction

Autonomous distributed systems will consist of interacting components free from human interference

• Existing top-down management and programming solutions require too much global state

• Bottom up, decentralized collection of components who make their own decisions based on local information

• System wide self* behavior emerges from interactions

Self-* Behavior

Self-adaptive components that change structure and/or behavior at run-time, adapt to

• discovered faults• reduced performance

Requires active monitoring of component states and external dependencies

Self-* Distributed Systems using Distributed (collaborative) Reinforcement Learning

For complex systems, programmers cannot be expected to describe all conditions

• Self-adaptive behavior learnt by components• Decentralized co-ordination of components to

support system-wide properties• Distributed Reinforcement Learning (DRL) is

extension to RL and uses neighbor interactions only

Model-Based Reinforcement Learning

( , ) ( , ) ' | , . 's

Q s a R s a P s s a V s

MDPAdaptationContract

action (at)rt+1

reward (rt)

state (st)

Component

1.Action Reward 2. State Transition Model

3. Next State Reward

Markov Decision Process = {States }, {Actions}, R(States,Actions), (States, Actions, States)

Decentralised System Optimisation

Coordinating the solution to a set of Discrete Optimisation Problems (DOPs)

• Components have a Partial System View• Coordination Actions

• Actions ={delegation} U {DOP actions} U {discovery}

• Connection Costs

Causally-Connected

States

CDelegation

Collaborative Reinforcement Learning

Advertisement• Update Partial Views of Neighbours

Decay• Negative Feedback on State Values in the Absence of

Collaborative Reinforcement Learning

Documents

Deep Learning for Reinforcement Learning in · PDF fileDeep Learning for Reinforcement Learning in ... Deep Learning for Reinforcement Learning in Pacman Deep Learning für ... Während

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Collaborative Multiagent Reinforcement Learning by Payoff …jmlr.csail.mit.edu/papers/volume7/kok06a/kok06a.pdf · 2017-07-22 · COLLABORATIVE MULTIAGENT REINFORCEMENT LEARNING

A Human-Robot Collaborative Reinforcement Learning Algorithm · 2013-12-20 · Y. Edan e-mail: yael@bgu.ac.il. J Intell Robot Syst the traditional Q(λ)-reinforcement learning algorithm,

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

Collaborative Deep Reinforcement Learning for Multi-Object ...openaccess.thecvf.com/content_ECCV_2018/papers/Liangliang_Ren... · Collaborative Deep Reinforcement Learning for Multi-Object

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

Generalization in Reinforcement Learning: Successful ...papers.nips.cc/paper/1109-generalization-in-reinforcement-learning... · Generalization in Reinforcement Learning: Successful

Emergent Communication for Collaborative Reinforcement ... · Emergent Communication for Collaborative Reinforcement Learning Yarin Gal and Rowan McAllister MLG RCC 8 May 2014. Game

Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks

Collaborative Multiagent Reinforcement Learning by Payoff ...Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Q-learning, belief propagation 1

Evolution-Guided Policy Gradient in Reinforcement Learning · 2019-02-19 · Evolution-Guided Policy Gradient in Reinforcement Learning Shauharda Khadka Kagan Tumer Collaborative

A Human-Robot Collaborative Reinforcement …scholar.harvard.edu/files/kartoun/files/1ca003_83c9850ad...J Intell Robot Syst the traditional Q(λ)-reinforcement learning algorithm,

Reinforcement Learning

Emergent Communication for Collaborative Reinforcement Learningmlg.eng.cam.ac.uk/rowan/files/EmergentCommunication… · · 2015-02-24Emergent Communication for Collaborative Reinforcement

Inverse Reinforcement Learning CS885 Reinforcement

Reinforcement Learning - Multi-Agent Reinforcement

Collaborative Deep Reinforcement LearningCollaborative Deep Reinforcement Learning WOODSTOCK’97, July 1997, El Paso, Texas USA environment. In DRL, an agent is represented by a policy

Reinforcement Learning Lecture Inverse Reinforcement Learningipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2017/07/09... · Reinforcement Learning Inverse Reinforcement