51
Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments Peter Stone Department of Computer Science The University of Texas at Austin Joint work with Matthew Hausknecht 1

Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Deep Multiagent Reinforcement Learning for Partially Observable

Parameterized EnvironmentsPeter Stone

Department of Computer Science The University of Texas at Austin

Joint work with Matthew Hausknecht

1

Page 2: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Intelligent decision making is at the heart of AI.

Motivation

2

Page 3: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Outline1. Background

2. Recurrent Q-Learning for partially observable MDPs

3. Deep Multiagent RL in Half-Field-Offense

4. Future Work

3

Page 4: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Markov Decision Process

Action at

State st

Reward rt

Markov Property ensures st+1 depends only on st

Learning an optimal policy π* requires no memory4

Page 5: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Partially Observable MDP (POMDP)

Action at

Observation ot

Reward rt

Observations provide noisy or incomplete information

Memory may help to learn a better policy5

Page 6: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Reinforcement LearningReinforcement Learning provides a general framework for sequential decision making.

Objective: Learn a policy that maximizes discounted sum of future rewards.

Deterministic policy π is a mapping from states/observations to actions.

For each encountered state/observation, what is the best action to perform.

6

Page 7: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Q-Value FunctionEstimates the expected return from a given state-action:

Answers the question: “How good is action a from state s.”

Optimal Q-Value function yields an optimal policy.

Q⇡(s, a) = E⇥rt+1 + �rt+2 + �2rt+3 + . . . |s, a

7

Page 8: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Deep Neural Network

Parametric model with stacked layers of representation.

Powerful, general purpose function approximator.

Parameters optimized via backpropagation.

Input

Output

8

Page 9: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Outline1. Background

2. Recurrent Q-Learning for partially observable MDPs

3. Deep Multiagent RL in Half-Field-Offense

4. Future Work

9

Page 10: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Atari Environment

Action at

Observation ot

Reward rt

Resolution 160x210x318 discrete actions

Reward is change in game score

10

Page 11: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Atari: MDP or POMDP?

Depends on the number game screens used in the state representation.

Many games PO with a single frame.

11

Page 12: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Neural network estimates Q-Values Q(s,a) for all 18 actions:

Learns via temporal difference:

Accepts the last 4 screens as input.

Deep Q-Network (DQN)

Convolution 1

Convolution 2

Convolution 3

Fully Connected

Fully Connected

Q-Values

Q(s|✓) = (Qs,a1 . . . Qs,an)

yi = rt + �max(Q(st+1|✓))

12

Li(✓i) = E(st,at,rt,st+1)⇠D

h�yi �Q(st|✓i)

�2i

Page 13: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Flickering AtariHow well does DQN perform on POMDPs?

Induce partial observability by stochastically obscuring the game screen

Game state must be inferred from past observations

ot =

⇢st with p =

12

< 0, . . . , 0 > otherwise

13

Page 14: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

DQN Pong

True Game Screen Observed Game Screen14

Page 15: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

DQN Flickering Pong

True Game Screen Observed Game Screen15

Page 16: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Uses a Long Short Term Memory (LSTM) to selectively remember past game screens.

Architecture identical to DQN except: 1. Replaces FC layer with LSTM 2. Single frame as input each

timestep

Trained end-to-end using BPTT for last 10 timesteps.

Deep Recurrent Q-Network

Convolution 1

Convolution 2

Convolution 3

LSTM

Fully Connected

Q-Values

16

Page 17: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

DRQN Flickering Pong

True Game Screen Observed Game Screen17

Page 18: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin18

Page 19: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

LSTM infers velocity

19

Page 20: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

DRQN Frostbite

20

Page 21: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

21

Page 22: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

ExtensionsDRQN has been extended in several ways:

• Addressable Memory: Control of Memory, Active Perception, and Action in Minecraft; Oh et al. in ICML ’16

• Continuous Action Space: Memory Based Control with Recurrent Neural Networks; Heess et al., 2016

[Deep Recurrent Q-Learning for Partially Observable MDPs, Hausknecht et al, 2015; ArXiv]

22

Page 23: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Outline1. Background

2. Recurrent Q-Learning for partially observable MDPs

3. Deep Multiagent RL in Half-Field-Offense

4. Future Work

23

Page 24: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Half Field OffenseCooperative multiagent soccer domain built on the libraries used by the RoboCup competition

Objective: Learn a goal scoring policy for the offense agents

Features continuous actions, partial observability, and opportunities for multiagent coordination

24

Page 25: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Half Field Offense

25

Page 26: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

26

Page 27: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

27

Page 28: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

State Action Spaces

58 continuous state features encoding distances and angles to points of interest

Parameterized-Continuous Action Space: Dash(direction, power) Turn(direction)Tackle(direction) Kick(direction, power)

Choose one discrete action + parameters every timestep

28

Page 29: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Exploration is Hard

29

Page 30: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Reward Signal

rt = -ᵂd(Agent, Ball) + Ikick + -3ᵂd(Ball, Goal) + 5IGoal

Go to Ball Kick to Goal

30

With only goal-scoring reward, agent never learns to approach the ball or dribble.

Page 31: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Deep Deterministic Policy Gradients

Model-free Deep Actor Critic architecture [Lillicrap ’15]

Actor learns a policy π, Critic learns to estimate Q-values

Actor outputs all 6 possible parameters.

at = max(4 actions) + associated parameter(s)

State

4 Actions 6 Parameters

1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value

1024

ReLU

256

ReLU

512

ReLU

128

ReLU Actor

Critic

31

Page 32: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

TrainingCritic trained using temporal difference:

Actor trained via Critic gradients:

State

ᵘθμ

4 Actions 6 Parameters

ᵘθQ

Q-Value

Actor

Critic

auQ(s,a)L = ||Q(st, µ(st)|✓Q)� y||22

y = rt + �(Q(st+1, µ(st+1)|✓Q))

r✓µµ(s) = raQ(s, a|✓Q)r✓µµ(s|✓µ)

32

Page 33: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Bounded Action SpaceHFO’s continuous parameters are bounded

Dash(direction, power) Turn(direction) Tackle(direction) Kick(direction, power)

Direction in [-180,180], Power in [0, 100]

Exceeding these ranges results in no action

If DDPG is unaware of the bounds, it will invariably exceed them

33

Page 34: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

We examine 3 approaches for bounding the DDPG’s action space:

1. Squash Gradients

2. Zero Gradients

3. Invert Gradients

Bounded DDPG

34

Page 35: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Squashing Gradients1. Use Tanh non-linearity to bound parameter output

2. Rescale into desired range

35

Page 36: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Squashing Gradients

36

Page 37: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Each continuous parameter has a range: [pmin, pmax]

Let p denote current value of parameter, and the suggested gradient.

Then:

Zeroing Gradients

rp =

(rp if p

min

< p < pmax

0 otherwise

rp

37

Page 38: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Zeroing Gradients

38

Page 39: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Inverting Gradients

rp = rp ·((p

max

� p)/(pmax

� pmin

) if rp suggests increasing p

(p� pmin

)/(pmax

� pmin

) otherwise

For each parameter:

Allows parameters to approach the bounds of the ranges without exceeding them

Parameters don’t get “stuck” or saturate

39

Page 40: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Inverting Gradients

40

Page 41: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Results

41

Page 42: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

ResultsScoring Avg. StepsPercent to Goal

DDPG1 1.0 108.0DDPG2 .99 107.1DDPG3 .98 104.8DDPG4 .96 112.3

Helios’ Champion .96 72.0DDPG5 .94 119.1DDPG6 .84 113.2SARSA .81 70.7DDPG7 .80 118.2

42

[Deep Reinforcement Learning in Parameterized Action Space, Hausknecht and Stone, in ICLR ‘16]

Page 43: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Deep Multiagent RLCan multiple Deep RL agents cooperate to achieve a shared goal?

Examine several baseline architectures:

Decentralized: Independent agents

Centralized: Single controller for multiple agents

Parameter Sharing: Layers shared between agents

43

Page 44: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Centralized

Both agents are controlled by a single DDPG

State & Action spaces are concatenated

Learning is more challenging for this reason

4 Actions 6 Parameters

1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value

1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Actor

Critic

State

6 Parameters4 Actions

State

44

Page 45: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

45

Page 46: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Parameter Sharing

Shared weights between layers in Actor networks. Separate sharing between Critic networks.

Reduces total number of parameters!

Encourages both agents to participate even though 2v0 is solvable by a single agent.

State

4 Actions 6 Parameters

256

ReLU

128

ReLU

Q-Value

256

ReLU

128

ReLU

4 Actions 6 Parameters

256

ReLU

128

ReLU

Q-Value

256

ReLU

128

ReLU

State

1024

ReLU

512

ReLU

1024

ReLU

512

ReLU

Critics

Actors

46

Page 47: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

47

Page 48: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments
Page 49: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

49

Page 50: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Related Work• Multiagent Cooperation and Competition with Deep

Reinforcement Learning; Tampuu et. al, 2015

• Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks; Foerster et al., 2016

• Learning to Communicate with Deep Multi-Agent Reinforcement Learning; Foerster et al., 2016

50

Page 51: Deep Multiagent Reinforcement Learning for Partially ...pstone/Courses/394Rfall16/... · Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

Hausknecht and Stone, UT Austin

Thanks!

State

4 Actions 6 Parameters

1024

ReLU

256

ReLU

512

ReLU

128

ReLU

Q-Value

1024

ReLU

256

ReLU

512

ReLU

128

ReLU Actor

CriticConvolution 1

Convolution 2

Convolution 3

LSTM

Fully Connected

Q-Values

51