61
Language grounding for cross-domain policy transfer and spatial reasoning Karthik Narasimhan OpenAI

Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Language grounding for cross-domain policy transfer and spatial reasoning

Karthik NarasimhanOpenAI

Page 2: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Grounding semantics in control applications

Output

Input

Text:instructionsorknowledge

1.Ghostschaseandtrytokillyou2.Collectallthepellets3.…

Environment:executeactionsandobserve

ControlPolicy:sequenceofactionstointeractoptimally

MOVE LEFTSTAY

MOVE UP…

Page 3: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Grounding semantics in control applications

1. Use language to improve performance in control applications

+

Score:7 Score:107

1.Ghostschaseandtrytokillyou2.Collectallthepellets3.…

Page 4: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Grounding semantics in control applications

1. Use language to improve performance in control applications

+

Score:7 Score:107

1.Ghostschaseandtrytokillyou2.Collectallthepellets3.…

2. Use feedback from control application to understand language

Reward+1

Alleviate dependence on large scale annotation

Walkacrossthebridge

Page 5: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Improved language understanding translates to improved task performance

Reinforcement Learning

• Delayed feedback

ac0on1 ac0onnReward+10

• Large number of possible action sequences

⇒ How to perform credit assignment for individual actions

⇒ Need for effective exploration

Page 6: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

 Deep Transfer in Reinforcement Learning by Language Grounding

Karthik Narasimhan, Regina Barzilay, Tommi JaakkolaMIT

Page 7: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Deep reinforcement learning for games

Standard approach: deep Q-learning by acting in the environment

Steps to convergence: ~ a few million

Page 8: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Ac0onvaluefunc0on

State s = Observed Environment

Action a = Move/Shoot/Use sword

State1 State2Ac0on

MOVE RIGHT

Policy⇡ : s ! a

Reward+1

Markov decision process

Traditional RL framework

Q(s, a)

Page 9: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Estimating a policy

Learn from sampled experiences

Episode1

Episode2

Episode3

⇡(a|s)

Page 10: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

• Learn transitions between states

• Identify good vs bad states

Environment state space

0.4

0.4 0.2

s1s2 s3

s4s5

Page 11: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

More games

• Each new game requires re-learning from scratch

Page 12: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

More games

• Each new game requires re-learning from scratch

• Policy transfer: challenging

(Taylor and Stone, 2009; Parisotto et al., 2015,Rusu et al., 2016; Rajendran et al., 2017, …)

Page 13: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Why is transfer hard?

• Different state spaces and actions

• Need to explore new environment to learn mapping

• Incorrect mapping leads to negative transfer

Page 14: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

0.4

0.4 0.2

s1s2 s3

s4s5

Environment 1 Environment 2

• How do we re-use learnt information?

• Need some anchor

u1 u2 u3

u4

u5

u6

Page 15: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Environment 1 Environment 2

0.4

0.4 0.2

s1s2 s3

s4s5

u1 u2 u3

u4

u5

u6

0.35

0.45

0.2

s1issimilartou1s2issimilartou3

Page 16: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Scor%ions chase you

and kill you on touch

Using text descriptions

Spiders are chasers

and can be dest5oyed

by an ex%losion

• Text descriptions associated with objects/entities

• No mapping between objects in different environments

Page 17: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Transfer through language

• Language as domain-invariant and accessible medium

• Traditional approaches: direct policy transfer

• This work: transfer ‘model’ of the environment using text descriptions

Language

Page 18: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Model-based reinforcement learning

StatesAc0ona

s01

s0n

Transition distribution T (s0|s, a)

Page 19: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Model-based reinforcement learning

StatesAc0ona

s01

s0n

Transition distribution and reward function T (s0|s, a) R(s, a)

+10

Page 20: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Value Iteration

V (s) = max

aQ(s, a)

Action value

function

Q(s, a) = R(s, a) + �X

s0

T (s0|s, a)V (s0)

Value function

v vR(s, a) T (s0|s, a)

Accurately estimating T and R is challenging

Page 21: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

States

Text-conditioned transition distribution T (s0|s, a, z)

Scor%ions chase you

and kill you on touch

Ac0ona

s01

s0n

Textz

Page 22: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Bootstrap learning through text

Scor%ions chase you

and kill you on touchz1

• Appropriate representation to incorporate language

• Partial text descriptions

T1(s0|s, a, z1) R1(s, a, z1)

Transferz2

Spiders are chasers

and can be dest5oyed

T2(u0|u, a, z2) R2(u, a, z2)

Page 23: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Incorporating descriptions

Description

ziRNN

vzi

sState �(s)voi

Page 24: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Reward

R

Q

Convolu0onalneuralnetwork(CNN)

(Value Iteration Network, Tamar et al., 2016)

�(s)

Differentiable value iteration

Q(s, a) = R(s, a) + �X

s0

T (s0|s, a)V (s0)

Page 25: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Reward

maxR

Q V

Convolu0onalneuralnetwork(CNN)

Value�(s)

Differentiable value iteration

V (s) = max

aQ(s, a)

(Value Iteration Network, Tamar et al., 2016)

Page 26: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Reward

maxR

VQ V

Convolu0onalneuralnetwork(CNN)

T

Value

ksteprecurrence

�(s)

Differentiable value iteration

Q(s, a) = R(s, a) + �X

s0

T (s0|s, a)V (s0) V (s) = max

aQ(s, a)

(Value Iteration Network, Tamar et al., 2016)

Page 27: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

R

VQ V�(s)

Dzi

vzi

svoi

Parameter Learning

L(⇥i) = Es,a,s0 [(r + �max

a0Q(s0, a0, Z;⇥i�1)�Q(s, a, Z;⇥i))

2]

Objective: Minimize loss function

Page 28: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Model-aware policy

Policy

⇡(a|s)R

V

R

VQ V

Valuees0ma0on

Scor%ions

chase you

Page 29: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Model-aware transfer

⇡1(a|s)

⇡2(a|s)R

V

R

VQ V

Policies

⇡n(a|s)

Scor%ions

chase you

Scor%ions

chase you

Spiders are

chasers

Valuees0ma0on

Page 30: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Experiments

• 2-D game environments with multiple instances (each with different layouts, different entity sets, etc.)

• Text descriptions from Amazon Mechanical Turk

• Transfer setup: train on multiple source tasks, and use learned parameters to initialize for target tasks

• Baselines: DQN (Mnih et al., 2015), text-DQN, Actor-Mimic (Parisotto et al., 2016)

• Evaluation: Jumpstart, average and asymptotic reward

Source and target game instances for transfer

Page 31: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Average rewardRe

war

d

0

0.2

0.4

0.6

0.8

No transfer

0.74

0.08

0.50

0.25

0.15

F&E-1 to Freeway

Page 32: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Average rewardRe

war

d

0

0.2

0.4

0.6

0.8

No transfer DQN Actor Mimic

0.73

0.080.080.06

0.15

F&E-1 to Freeway

Page 33: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Average rewardRe

war

d

0

0.2

0.4

0.6

0.8

No transfer DQN Actor Mimic text-DQN text-VIN

0.73

0.38

0.080.06

0.15

F&E-1 to Freeway

Page 34: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Transfer results (F&E-1 to Freeway)

Page 35: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Seen ‘friendly’ entity Unseen entity

Unseen entity + ‘friendly' text Unseen entity + ‘enemy' text

Agent: (3,4)

Entity: (2,6)

Page 36: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Representation Learning for Grounded Spatial Reasoning

Michael Janner, Karthik Narasimhan, Regina BarzilayMIT

Page 37: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Understanding spatial references

Autonomous navigationHuman robot interaction

Pickuptheredblockontopofagreenone

Page 38: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

A spatial reasoning task

Reach the cell above the westernmost rock

• Interactive navigation world

• Goal specified in natural language

• Rewards for reaching goals

• No domain knowledge

Page 39: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

A spatial reasoning task

Reach the cell above the westernmost rock

Page 40: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

A spatial reasoning task

Reach the cell above the westernmost rock

Interpreting spatial references is highly contextual!

Goal: Generalize effectively over different layouts and spatial references

Page 41: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

1. Refer to specific entity

2.Location using a single referent entity

3.Location using multiple referent entities

Types of spatial references

“Go to the circle”

Page 42: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

1. Refer to specific entity

2. Location using a single referent entity

3.Location using multiple referent entities

Types of spatial references

“Reach the cell above the circle”

Page 43: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Types of spatial references

1. Refer to specific entity

2. Location using a single referent entity

3.Location using multiple referent entities

“Move to the goal one square to the right of triangle and two squares to the bottom of star”

Page 44: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Types of spatial references

1.Refer to specific entity

2.Location using a single referent entity

3.Location using multiple referent entities

1. (Local) Go two spaces above the heart.

2. (Global) Reach the northernmost house.

Page 45: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Challenges

• Interpretation of spatial references is highly context-dependent.

• Rich, flexible ways of verbalizing spatial references

• Only source of supervision is reward-based feedback

Page 46: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Markov Decision Process

hS,A,X, T,Ri

S

A

X

T

R

: State configurations

: Actions

: Goal specifications in language

: Transition distribution

: Reward function

Page 47: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Value Iteration

Q(s, a, x) = R(s, x) + �

X

s02S

T (s0|s, a, x)V (s0, x)

V (s, x) = max

aQ(s, a, x)

Goal-conditioned action policy: ⇡(s, x) = argmax

aQ(s, a, x)

Page 48: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

• Joint representation of observations (s) and instructions (x)

• Flexible representation of goals, encoding both local structure and global spatial attributes.

• Must be compositional, to generalize over linguistic variety in instructions.

Model requirements

Page 49: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Universal Value Function Approximators

(Schaul et al., 2015)

Page 50: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Our Model

Page 51: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Our Model

Convolutional filter

Global gradient coefficients

Page 52: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

V (s, x;⇥)

Our Model

L(⇥) = Es⇠D

V (s, x;⇥)� R(s, x) + �max

a

X

s0

T (s

0|s, a) ˆV (s

0, x;⇥

�)

!#2

Page 53: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Experimental setup

• Puddle world with randomized layout and randomly placed unique (6) and non-unique (4) objects

• Text instructions for randomized goals collected from Amazon Mechanical Turk (max length 43)

Split Local Global

Train 1566 1071

Test 399 272

Page 54: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Baselines

• UVFA (text): UVFA model (Schaul et al., 2015) adapted to use text for goal specifications

• CNN+LSTM: Separately process image and text and then feed concatenated representations to MLP (Misra et al., 2017)

Page 55: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Results: Policy Quality

0

0.225

0.45

0.675

0.9

UVFA (t

ext)

CNN+LSTM

Our m

odel

UVFA (t

ext)

CNN+LSTM

Our m

odel

0.900.870.820.80

0.620.56

Local Global

Page 56: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Results: Distance to goal

0

1.75

3.5

5.25

7

UVFA (t

ext)

CNN+LSTM

Our m

odel

UVFA (t

ext)

CNN+LSTM

Our m

odel

3.35

2.18

6.135.73

6.28

4.71

Local Global

Page 57: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Sample efficiency

Page 58: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Value maps

Page 59: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Multiple levels of indirection

Redundant information

Page 60: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework

Conclusions

• Language provides a compact medium for encoding knowledge for policy transfer

• Model-aware methods more suitable for leveraging language for transfer.

• Spatial reasoning is highly contextual and a challenging grounding task.

• Joint reasoning over text and environment is crucial for effective generalization over unseen worlds and linguistic variety.

www.karthiknarasimhan.com

Page 61: Language grounding for cross-domain policy …Action a = Move/Shoot/Use sword State 1 Acon State 2 MOVE RIGHT Policy : s ! a Reward +1 Markov decision process Traditional RL framework