View
19
Download
4
Category
Preview:
Citation preview
Inverse Reinforcement LearningCS885 Reinforcement LearningModule 6: November 9, 2021
Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML.
Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58).
CS885 Fall 2021 Pascal Poupart 1University of Waterloo
CS885 Fall 2021 Pascal Poupart 2
Reinforcement Learning Problem
Agent
Environment
StateReward Action
Data: (π !, π!, π!, π ", π", π", β¦ , π #, π#, π#)Goal: Learn to choose actions that maximize rewards
University of Waterloo
CS885 Fall 2021 Pascal Poupart 3
Imitation Learning
Expert
Environment
StateReward Optimal action
Data: (π !, π!β , π ", π"β, β¦ , π #, π#β )Goal: Learn to choose actions by imitating expert actions
University of Waterloo
CS885 Fall 2021 Pascal Poupart 4
Problems
β’ Imitation learning: supervised learning formulationβ Issue #1: Assumption that state-action pairs are identically
and independently distributed (i.i.d.) is falseπ !, π!β β π ", π"β β β― β (π #, π#β )
β Issue #2: Canβt easily transfer learnt policy to environments with different dynamics
University of Waterloo
CS885 Fall 2021 Pascal Poupart 5
Inverse Reinforcement Learning (IRL)
University of Waterloo
Rewards Optimal policy πβRL
Rewards Optimal policy πβIRL
Benefit: can easily transfer reward function to new environment where we can learn an optimal policy
CS885 Fall 2021 Pascal Poupart 6
Formal Definition
Definitionβ’ States: π β πβ’ Actions: π β π΄β’ Transition: Pr(π !|π !"#, π!"#)β’ Rewards: π β ββ’ Reward model: Pr(π!|π !, π!)β’ Discount factor: 0 β€ πΎ β€ 1β’ Horizon (i.e., # of time steps): β
Data: (π !, π!, π!, π ", π", π", β¦ , π # , π# , π#)Goal: find optimal policy πβ
University of Waterloo
Definitionβ’ States: π β πβ’ Optimal actions: πβ β π΄β’ Transition: Pr(π !|π !"#, π!"#)β’ Rewards: π β ββ’ Reward model: Pr(π!|π !, π!)β’ Discount factor: 0 β€ πΎ β€ 1β’ Horizon (i.e., # of time steps): β
Data: (π %, π%β , π #, π#β, β¦ , π &, π&β )Goal: find Pr(π!|π !, π!) for which expert actions πβ are optimal
Reinforcement Learning (RL) Inverse Reinforcement Learning (IRL)
CS885 Fall 2021 Pascal Poupart 7
IRL Applications
Advantagesβ’ No assumption that state-action pairs are i.i.d.β’ Transfer reward function to new environments/tasks
University of Waterloo
autonomous drivingrobotics
CS885 Fall 2021 Pascal Poupart 8
IRL Techniques
General approach: 1. Find reward function for which
expert actions are optimal2. Use reward function to optimize
policy in same or new environments
Broad categories of IRL techniquesβ’ Feature matchingβ’ Maximum margin IRLβ’ Maximum entropy IRLβ’ Bayesian IRL
University of Waterloo
(π !, π!β , π ", π"β , β¦ , π #π#β )
π π , π or Pr(π|π , π)
πβ
CS885 Fall 2021 Pascal Poupart 9
Feature Expectation Matching
β’ Normally: find π such that πβ chooses the same actions πβ as expert
β’ Problem: we may not have enough data for some states (especially continuous states) to properly estimate transitions and rewards
β’ Note: rewards typically depend on features π6(π , π)e.g., π π , π = β6π€6π6(π , π) = π7π(π , π)
β’ Idea: Compute feature expectations and match them
University of Waterloo
CS885 Fall 2021 Pascal Poupart 10
Feature Expectation Matching
Let π8(π 9) =:;β<=:; β> πΎ>π(π >
< , π>(<))
be the average feature count of expert π(where π indexes trajectories)
Let π?(π 9) be the expected feature count of policy π
Claim: If π? π = π8 π βπ then π? π = π8 π βπ
University of Waterloo
CS885 Fall 2021 Pascal Poupart 11
ProofFeatures: π π , π = (π! π , π , π" π , π , π# π , π , β¦ )$Linear reward function: π π π , π = β&π€&π&(π , π) = π$π(π , π)
Discounted state visitation frequency: π'%( π β² = πΏ(π β², π )) + πΎβ'π'%
( (π ) Pr π * π , π π
Value function: π( π = β'&π'( π * π π(π β², π(π β²))= β'&π'( π * π$π(π *, π π * )= π$β'&π'( π * π(π β², π π β² )= π$ π((π )
Hence: π( π = π+ π Γ π!π" π = π!π# π Γ π" π = π#(π )
University of Waterloo
CS885 Fall 2021 Pascal Poupart 12
Indeterminacy of Rewards
β’ Learning π π π , π = π3π(π , π) amounts to learning π
β’ When π4 π = π5 π , then π4 π = π5(π ), but π can be anything since
π4 π = π5 π Γ π3π4 π = π3π5 π βπ
β’ We need a bias to determine πβ’ Ideas:
β Maximize the marginβ Maximize entropy
University of Waterloo
CS885 Fall 2021 Pascal Poupart 13
Maximum Margin IRL
β’ Idea: select reward function that yields the greatest minimum difference (margin) between the Q-values of the expert actions and other actions
ππππππ = min@
π π , πβ βmaxABAβ
π π , π
University of Waterloo
CS885 Fall 2021 Pascal Poupart 14
Maximum Margin IRL
Let π? π , π = π π , π + πΎ β@- Pr π C π , π π?(π C)Then π? π , π = π7π?(π , π)Find πβ that maximizes margin:
πβ = ππππππ₯πmin) π*π+ π , πβ βmax,-,β
π*π+ π , π
s.t. π+ π , π = π. π , π βπ , π
Problem: maximizing margin is somewhat arbitrary since it doesnβt allow suboptimal actions to have values that are close to optimal
University of Waterloo
CS885 Fall 2021 Pascal Poupart 15
Maximum EntropyIdea: Among models that match the expertβs average features, select the model with maximum entropy
max.(0)
π»(π π )
s.t. !2343
β0β6787π(π) = πΈ[π π ]
Trajectory: π = (π )0, π)0, π !0, π!0, β¦ , π 90, π90)Trajectory feature vector: π π = β8 πΎ8π(π 80, π80)Trajectory cumulative reward: π π = π$π π = β8 πΎ8π$π(π 80, π80)
Probability of a trajectory: ππ π = +' (
β( +' ( =+π*π (
β(& +π*π((&)
Entropy: π»(π π ) = ββ0π(π) logπ(π)
University of Waterloo
CS885 Fall 2021 Pascal Poupart 16
Maximum LikelihoodMaximum Entropy
max6(7)
π»(π π )
s.t. "89:9
β7β<=>=π(π) = πΈ[π π ]
Dual objective: This is equivalent to maximizing the log likelihood of the trajectories under the constraint that π π takes an exponential form:
maxπ
β7β<=>= log ππ(π)s.t. ππ(π) β ππ
'π(7)
University of Waterloo
CS885 Fall 2021 Pascal Poupart 17
Maximum Log Likelihood (LL)πβ = ππππππ₯π
!|6787|
β0β6787 logππ(π)
= ππππππ₯π!
|6787|β0β6787 log
+π*π(()
β(& +π*π((&)
= ππππππ₯π!
|6787|β0β6787π$π(π) β logβ0& ππ
*π(0*)
Gradient: βππΏπΏ =!
|6787|β0β6787π(π) β β0&&
+π*π (&&
β(& +π*π (&
π(π**)
= !|6787|
β0β6787π(π) β β0&&+π*π (&&
β(& +π*π (&
π(π**)
= !|6787|
β0β6787π(π) β β0&& ππ(π**)π(π**)= πΈ6787 π π β πΈπ[π π ]
University of Waterloo
CS885 Fall 2021 Pascal Poupart 18
Gradient estimationComputing πΈπ[π π ] exactly is intractable due to exponential number of trajectories. Instead, approximate by sampling.
πΈπ π π β1π O0βΌ.π(0)
π(π)
Importance sampling: Since we donβt have a simple way of sampling π from ππ(π), sample π from a base distribution π(π)and then reweight π by ππ(π)/π(π):
πΈπ π π β1π O0βΌ>(0)
ππ ππ π π(π)
We can choose π π to be a) uniform, b) close to demonstration distribution, or c) close to ππ π
University of Waterloo
CS885 Fall 2021 Pascal Poupart 19
Maximum Entropy IRL Pseudocode
Input: expert trajectories π( βΌ π()*(+! where π( = π #, π#, π ,, π,, β¦Initialize weights π at randomRepeat until stopping criterion
Expert feature expectation: πΈ-!"#!$% π π = #|/0!0|
β1! β /0!0π(π()Model feature expectation:
Sample π trajectories: π ~ π(π)πΈπ π π = #
4β1
5π 16 1
π(π)Gradient: βππΏπΏ = πΈ-!"#!$% π π β πΈπ π πUpdate model: π β π+πΌβππΏπΏ
Return π
University of Waterloo
Assumption: Linear rewards π π π , π = π7π(π , π)
CS885 Fall 2021 Pascal Poupart 20
Non-Linear Rewards
Suppose rewards are non-linear in πe.g., π π π , π = πππ’ππππππ‘π(π , π)
Then π π π = β> πΎ>π π π >7, π>7
Likelihood: πΏπΏ π = "|<=>=|
β7β<=>=π π(π) β logβ78 π@π(78)
Gradient: βππΏπΏ = πΈ<=>= βππ π π β πΈπ[βππ π π ]
University of Waterloo
CS885 Fall 2021 Pascal Poupart 21
Maximum Entropy IRL Pseudocode
Input: expert trajectories π# βΌ π#$%#&' where π# = π (, π(, π ), π), β¦Initialize weights π€ at randomRepeat until stopping criterion
Expert feature expectation: πΈ"!"#!$% βππ π π# = (|,-'-|
β.! β ,-'- βππ π π#Model feature expectation:
Sample π trajectories: π ~ π(π)πΈπ βππ π π = (
0β.
1π .2 .
βππ π πGradient: βππΏπΏ = πΈ"!"#!$% βππ π π β πΈπ βππ π πUpdate model: π β π + πΌβππΏπΏ
Return π
University of Waterloo
General case: Non-linear rewards π 9 π , π
CS885 Fall 2021 Pascal Poupart 22
Policy ComputationTwo choices:1) Optimize policy based on π π(π , π) with favorite RL algorithm2) Compute policy induced by ππ(π).
Induced policy: probability of choosing π after π in trajectoriesLet π , π, π be a trajectory that starts with π , π and then continues with the state action-pairs of π
ππ π π = ππ(π|π )= β( .π ',7,0
β.&,(& .π(',7&,0&)
= β( +'π 0,. 12'π(()
β.&,(& +'π 0,.& 12'π((&)
University of Waterloo
Recommended