INVERSE REINFORCEMENT LEARNING: A REVIEW

INVERSE REINFORCEMENT

LEARNING: A REVIEW

Ghazal Zand

Department of Electrical Engineering and Computer Science

Cleveland State University

[email protected]

outlines

• Quick review of RL

• IRL introduction

• IRL motivation

• IRL some sample applications

• IRL formulation

• IRL shortcomings

• IRL approaches

• Comparison

• Conclusion

2

Inverse Reinforcement Learning vs. Reinforcement

Learning

3

IRL framework RL framework

Inverse Reinforcement Learning Motivation

• IRL was originally posed by Andrew Ng and Stuart Russell

• Ng and Russell. "Algorithms for inverse reinforcement learning.” Icml. 2000

• Bee foraging : reward at each flower

• RL assumes known function of its nectar content

• But actually different factors have influence on it: e.g. distance, time, risk of wind or

predators, …4

• Autonomous helicopter aerobatics through apprenticeship learning

• Abbeel, Coates, and Ng.

• Enabling Robots to Communicate

their Objectives

• Huang, Held, Abbeel and Dragan

5

IRL

Applications

IRL Applications

• Apprenticeship learning for motion

planning with application to parking lot

navigation

• Abbeel, Dolgov, Ng and Thrun

• Try to mimic different driving styles

6

IRL Formulation

• We are given

• A standard MDP

• defined as a five-element tuple 𝑀 = (𝑆,𝐴,𝑃,𝑅,𝛾)

• The reward function 𝑅 is unknown. But, it can be written as a linear combination of features 𝑅∗ =

𝑊∗.𝐹

• A set of m trajectories generated by an expert

• Goal is to

• Find a reward function 𝑅∗ which explains the expert behavior

• Use this 𝑅∗ to find a policy that its performance is close to performance of the expert’s7

IRL Shortcomings

• In original IRL the environment is modeled as a MDP

• But in practice, there is no access the true global state of the environment

• Original IRL algorithm assumes the expert always performs optimally

• But in practice, usually expert’s demonstrations are imperfect, noisy and incomplete

• Most of the IRL algorithms consider the reward function as a linear combination of features

• While the expert might act according to more complex reward functions

• The original IRL problem is ill-posed

• Since there are infinitely many reward functions consistent with the expert’s behavior

8

Improvements on IRL

• To solve the MDP limitations

• In: Choi and Kim. 2011. Inverse reinforcement learning in partially observable

environments. Journal of Machine Learning Research

• A generalization to a Partially Observable Markov Decision Process (POMDP) is

proposed

• POMDP considers that the agent’s sensors are limited, so it can estimate states

through the observations

9

Improvements on IRL (BIRL)

• To solve the uncertainty in obtained reward, in original IRL problem

• The probability distributions were utilized to model this uncertainty

• Ramachandran and Amir. 2007. Bayesian inverse reinforcement learning. Urbana.

• BIRL assumes given the reward function all of the expert actions are independent

• This assumption allows us to:

10

• likelihood of observing expert’s demonstration sequence

Improvements on IRL (BIRL)

• According to the Bayes theorem, the posterior probability of the reward function is given by

• where 𝑃𝑅(𝑅) is the prior knowledge on the reward function

• In Qiao and Beling. 2011. Inverse reinforcement learning with Gaussian process. IEEE(ACC)

• Authors assign a Gaussian prior on the reward function

• To deal with noisy observations, incomplete policies, and small number of observations

11

Improvements on IRL (MaxEnt IRL)

• Similar to BIRL, Maximum Entropy Inverse Reinforcement Learning (MaxEnt

IRL) uses a probability approach

• Ziebart et al. 2008. Maximum Entropy Inverse Reinforcement Learning. AAAI

• The optimal value of W is given by• maximizing the likelihood of the observed trajectory through maximum entropy,

through gradient-based methods

• One solution to the large state spaces problem is approximating MaxEnt IRL by

graphs

• Shimosaka et al. 2017. Fast Inverse Reinforcement Learning with Interval Consistent

Graph for Driving Behavior Prediction12

Improvements on IRL (MMP IRL)

• To solve the uncertainty in obtained reward function in original IRL

problem:

• Ratliff et al. 2006. Maximum margin planning. ACM

• Introduces the loss-functions

• in different forms

• to penalize choosing actions that are different from expert’s demonstration

• to penalize arriving states that the expert chooses not to enter.

13

Improvements on IRL (MMP IRL)

• The difference between MMP and the original IRL:

• in MMP, the margin scales with these loss-functions

• instead of returning policies, MMP reproduces the expert’s behaviors

• But MMP still assumes a linear form for the reward function.

• Ratliff et al. 2009. Learning to search: Functional gradient techniques for imitation

learning. Autonomous Robots.

• Extended MMP to learn non-linear reward functions by introducing LEARCH

algorithm

14

Improvements on IRL (more complex reward functions)

• IRL algorithm originally considers the reward as a weighted linear combination of feature

• But to better capture the relationship between feature vector and expert demonstration

• Choi and Kim. 2013. Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning. IJCAI

• Considered the reward function as a weighted set of composite features

• Levine et al. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems

• Used a kernel machine for modelling the reward function15

Improvements on IRL (more complex reward functions)

• Wulfmeier et al. 2015. Deep inverse reinforcement learning. CoRR

• Used a sufficiently large deep neural network with two layers and sigmoid activation

functions

16

Highway Driving Simulator

• To compare the efficiency of some of these algorithm we applied them to

the problem of highway driving simulator set in

• Levine et al. 2010. Feature construction for inverse reinforcement learning.

Advances in Neural Information Processing Systems

• Goal: learn reward function from human demonstration

• Road color indicates the reward at highest speed

• The agent will be penalized for driving fast near the police vehicle

17

(a) Sample highway environment

(b) Human demonstration

(c) MMPIRL resuts

(d) MaxEntIRL results

(e) MWAL IRL results

(f) GPIRL results

Learned Reward

Functions

Comparison

• Expected Value Difference score:

• Presented in

• Levine et al. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems

• Measures how suboptimal the learned policy is under the true reward

19

Conclusion

• The IRL problem was introduced

• Some sample applications of IRL were represented

• The original IRL’s shortcomings were mentioned

• Then papers that presented different approaches of IRL were reviewed

• Finally, the performance of some of the IRL algorithms on highway

driving simulator with varying number of human demonstrations was

compared

20

Documents

INVERSE REINFORCEMENT LEARNING: A REVIEW