21
Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel @ Berkeley Artificial Intelligence Research Lab (BAIR)

Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Value Iteration NetworksNIPS 2016 BEST PAPER

7-Minute Tour

Runzhe Yang @ SJTU ACM CLASS

Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel

@ Berkeley Artificial Intelligence Research Lab (BAIR)

Page 2: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

• Deep RL learns policies from complicated visual input

Introduction

• Learns to act, but does it understand?

Runzhe Yang @ SJTU ACM CLASS

• A simple test: generalization on grid worlds

Page 3: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Introduction

Train Test

• A simple test: generalization on grid worlds

FAIL

Page 4: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Introduction

Why doesn’t it understand?

Page 5: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Introduction

Observation Policy

- A neural network (NN) is trained to represent a policy

Deep Q-NetTask

Why doesn’t it understand?

Action Probability

Page 6: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Introduction

Observation Policy

- A neural network (NN) is trained to represent a policy

Deep Q-NetTask

Why doesn’t it understand?

Action Probability

Task

Observation Policy

- New task → need to re-plan

Page 7: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Task Sequential Nature

Policy

Runzhe Yang @ SJTU ACM CLASS

Introduction

Observation

Deep Q-Net

- A sequential problem requires a planning computation

Why doesn’t it understand?

Page 8: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Task Sequential Nature

Reactive Policy

Runzhe Yang @ SJTU ACM CLASS

Introduction

Observation

- RL gets around that (learns a mapping, State → Q-value)

Deep Q-Net

- A sequential problem requires a planning computation

- Lack of planning computation bad understanding

Why doesn’t it understand?

Page 9: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Introduction

- Policies that generalize to unseen tasks

- Learn to plan

In this work:

Task Sequential Nature

Reactive Policy Observation

Deep Q-Net

Page 10: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

A Planning-based Policy Model

Observation Reactive Policy

- Start from reactive policy

Page 11: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

- Assumption: observation can be mapped to a useful (but _.unknown) planning computation

Runzhe Yang @ SJTU ACM CLASS

A Planning-based Policy Model

Observation Reactive Policy

Planning Module

Plan on MDP .

- Add an explicit planning computation

Page 12: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

A Planning-based Policy Model

- NNs map observation to reward and transitions

- Later, learn on new MDP

- How to use the planning computation?

Planning Module

Plan on MDP .

Observation Reactive Policy

Page 13: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

A Planning-based Policy Model

- Fact 1: value function = sufficient information about plan

Planning Module

Plan on MDP .

Observation Reactive Policy

Page 14: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

A Planning-based Policy Model

- Fact 1: value function = sufficient information about plan

- Fact 2: action prediction can require only subset of

Planning Module

Plan on MDP .

Observation Reactive Policy

Runzhe Yang @ SJTU ACM CLASS

Page 15: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

A Planning-based Policy Model

- Fact 1: value function = sufficient information about plan

- Fact 2: action prediction can require only subset of

Planning Module

Plan on MDP .

Attention

Observation Reactive Policy

Runzhe Yang @ SJTU ACM CLASS

Page 16: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Planning Module

Plan on MDP .

Attention

Runzhe Yang @ SJTU ACM CLASS

A Planning-based Policy Model

Observation Reactive Policy

- Policy is still a mapping

- Parameters for mapping , , attention

- How to back-prop through planning computation?

Page 17: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Prev. V

New ValueReward

k recurrence

Value Iteration Module

Value Iteration Network

- Differential planner (Value Iteration ≈ CNN)

Conv:

Page 18: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Prev. V

New ValueReward

k recurrence

Value Iteration Module

Value Iteration Network

- Differential planner (Value Iteration ≈ CNN)

Conv: Pool:

Page 19: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Experiments

3. Continuous Control

1. Grid-World Domain

2. Mars Rover Navigation

4. WebNav Challenge

Network 8 × 8 16 × 16

VIN 90.9% 82.5%

CNN 86.9% 33.1%

Table: RL Results – performance on test maps.

Page 20: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Q&A

Thank you!

Page 21: Value Iteration Network - Runzhe Yang · 2020. 3. 11. · Value Iteration Networks NIPS 2016 BEST PAPER 7-Minute Tour Runzhe Yang @ SJTU ACM CLASS Aviv Tamar, Yi Wu, Garrett Thomas,

Runzhe Yang @ SJTU ACM CLASS

Q&A