20
INVERSE REINFORCEMENT LEARNING: A REVIEW Ghazal Zand Department of Electrical Engineering and Computer Science Cleveland State University [email protected]

INVERSE REINFORCEMENT LEARNING: A REVIEW

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INVERSE REINFORCEMENT LEARNING: A REVIEW

INVERSE REINFORCEMENT

LEARNING: A REVIEW

Ghazal Zand

Department of Electrical Engineering and Computer Science

Cleveland State University

[email protected]

Page 2: INVERSE REINFORCEMENT LEARNING: A REVIEW

outlines

• Quick review of RL

• IRL introduction

• IRL motivation

• IRL some sample applications

• IRL formulation

• IRL shortcomings

• IRL approaches

• Comparison

• Conclusion

2

Page 3: INVERSE REINFORCEMENT LEARNING: A REVIEW

Inverse Reinforcement Learning vs. Reinforcement

Learning

3

IRL framework RL framework

Page 4: INVERSE REINFORCEMENT LEARNING: A REVIEW

Inverse Reinforcement Learning Motivation

• IRL was originally posed by Andrew Ng and Stuart Russell

• Ng and Russell. "Algorithms for inverse reinforcement learning.” Icml. 2000

• Bee foraging : reward at each flower

• RL assumes known function of its nectar content

• But actually different factors have influence on it: e.g. distance, time, risk of wind or

predators, …4

Page 5: INVERSE REINFORCEMENT LEARNING: A REVIEW

• Autonomous helicopter aerobatics through apprenticeship learning

• Abbeel, Coates, and Ng.

• Enabling Robots to Communicate

their Objectives

• Huang, Held, Abbeel and Dragan

5

IRL

Applications

Page 6: INVERSE REINFORCEMENT LEARNING: A REVIEW

IRL Applications

• Apprenticeship learning for motion

planning with application to parking lot

navigation

• Abbeel, Dolgov, Ng and Thrun

• Try to mimic different driving styles

6

Page 7: INVERSE REINFORCEMENT LEARNING: A REVIEW

IRL Formulation

• We are given

• A standard MDP

• defined as a five-element tuple 𝑀 = (𝑆,𝐴,𝑃,𝑅,𝛾)

• The reward function 𝑅 is unknown. But, it can be written as a linear combination of features 𝑅∗ =

𝑊∗.𝐹

• A set of m trajectories generated by an expert

• Goal is to

• Find a reward function 𝑅∗ which explains the expert behavior

• Use this 𝑅∗ to find a policy that its performance is close to performance of the expert’s7

Page 8: INVERSE REINFORCEMENT LEARNING: A REVIEW

IRL Shortcomings

• In original IRL the environment is modeled as a MDP

• But in practice, there is no access the true global state of the environment

• Original IRL algorithm assumes the expert always performs optimally

• But in practice, usually expert’s demonstrations are imperfect, noisy and incomplete

• Most of the IRL algorithms consider the reward function as a linear combination of features

• While the expert might act according to more complex reward functions

• The original IRL problem is ill-posed

• Since there are infinitely many reward functions consistent with the expert’s behavior

8

Page 9: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL

• To solve the MDP limitations

• In: Choi and Kim. 2011. Inverse reinforcement learning in partially observable

environments. Journal of Machine Learning Research

• A generalization to a Partially Observable Markov Decision Process (POMDP) is

proposed

• POMDP considers that the agent’s sensors are limited, so it can estimate states

through the observations

9

Page 10: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (BIRL)

• To solve the uncertainty in obtained reward, in original IRL problem

• The probability distributions were utilized to model this uncertainty

• Ramachandran and Amir. 2007. Bayesian inverse reinforcement learning. Urbana.

• BIRL assumes given the reward function all of the expert actions are independent

• This assumption allows us to:

10

• likelihood of observing expert’s demonstration sequence

Page 11: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (BIRL)

• According to the Bayes theorem, the posterior probability of the reward function is given by

• where 𝑃𝑅(𝑅) is the prior knowledge on the reward function

• In Qiao and Beling. 2011. Inverse reinforcement learning with Gaussian process. IEEE(ACC)

• Authors assign a Gaussian prior on the reward function

• To deal with noisy observations, incomplete policies, and small number of observations

11

Page 12: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (MaxEnt IRL)

• Similar to BIRL, Maximum Entropy Inverse Reinforcement Learning (MaxEnt

IRL) uses a probability approach

• Ziebart et al. 2008. Maximum Entropy Inverse Reinforcement Learning. AAAI

• The optimal value of W is given by• maximizing the likelihood of the observed trajectory through maximum entropy,

through gradient-based methods

• One solution to the large state spaces problem is approximating MaxEnt IRL by

graphs

• Shimosaka et al. 2017. Fast Inverse Reinforcement Learning with Interval Consistent

Graph for Driving Behavior Prediction12

Page 13: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (MMP IRL)

• To solve the uncertainty in obtained reward function in original IRL

problem:

• Ratliff et al. 2006. Maximum margin planning. ACM

• Introduces the loss-functions

• in different forms

• to penalize choosing actions that are different from expert’s demonstration

• to penalize arriving states that the expert chooses not to enter.

13

Page 14: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (MMP IRL)

• The difference between MMP and the original IRL:

• in MMP, the margin scales with these loss-functions

• instead of returning policies, MMP reproduces the expert’s behaviors

• But MMP still assumes a linear form for the reward function.

• Ratliff et al. 2009. Learning to search: Functional gradient techniques for imitation

learning. Autonomous Robots.

• Extended MMP to learn non-linear reward functions by introducing LEARCH

algorithm

14

Page 15: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (more complex reward functions)

• IRL algorithm originally considers the reward as a weighted linear combination of feature

• But to better capture the relationship between feature vector and expert demonstration

• Choi and Kim. 2013. Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning. IJCAI

• Considered the reward function as a weighted set of composite features

• Levine et al. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems

• Used a kernel machine for modelling the reward function15

Page 16: INVERSE REINFORCEMENT LEARNING: A REVIEW

Improvements on IRL (more complex reward functions)

• Wulfmeier et al. 2015. Deep inverse reinforcement learning. CoRR

• Used a sufficiently large deep neural network with two layers and sigmoid activation

functions

16

Page 17: INVERSE REINFORCEMENT LEARNING: A REVIEW

Highway Driving Simulator

• To compare the efficiency of some of these algorithm we applied them to

the problem of highway driving simulator set in

• Levine et al. 2010. Feature construction for inverse reinforcement learning.

Advances in Neural Information Processing Systems

• Goal: learn reward function from human demonstration

• Road color indicates the reward at highest speed

• The agent will be penalized for driving fast near the police vehicle

17

Page 18: INVERSE REINFORCEMENT LEARNING: A REVIEW

(a) Sample highway environment

(b) Human demonstration

(c) MMPIRL resuts

(d) MaxEntIRL results

(e) MWAL IRL results

(f) GPIRL results

Learned Reward

Functions

Page 19: INVERSE REINFORCEMENT LEARNING: A REVIEW

Comparison

• Expected Value Difference score:

• Presented in

• Levine et al. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems

• Measures how suboptimal the learned policy is under the true reward

19

Page 20: INVERSE REINFORCEMENT LEARNING: A REVIEW

Conclusion

• The IRL problem was introduced

• Some sample applications of IRL were represented

• The original IRL’s shortcomings were mentioned

• Then papers that presented different approaches of IRL were reviewed

• Finally, the performance of some of the IRL algorithms on highway

driving simulator with varying number of human demonstrations was

compared

20