Reinforcement Learning Dealing with Complexity and Safety in RL

Reinforcement Learning

Dealing with Complexity and Safety in RL

Subramanian RamamoorthySchool of Informatics

27 March, 2012

(Why) Isn’t RL Deployed More Widely?

Very interesting discussion at: http://umichrl.pbworks.com/w/page/7597585/Myths%20of%20Reinforcement%20Learning, maintained by Satinder Singh

• Negative views/myths: RL is hard due to dimensionality, partial observability, function approximation, etc. etc.

• Positive view: There is no getting away from the fact that RL is the proper statement of the “agent’s problem”. So, the question is really one of how to solve it!

27/03/2012 Reinforcement Learning 2

http://umichrl.pbworks.com/w/page/7597585/Myths%20of%20Reinforcement%20Learning

http://umichrl.pbworks.com/w/page/7597585/Myths%20of%20Reinforcement%20Learning

A Provocative Claim

“The (PO)MDP frameworks are fundamentally broken, not because they are insufficiently powerful representations, but because they are too powerful. We submit that, rather than generalizing these models, we should be specializing them if we want to make progress on solving real problems in the real world.”

T. Lane, W.D. Smart, Why (PO)MDPs Lose for Spatial Tasks and What to Do About It, ICML Workshop on Rich Representations for RL, 2005.


What is the Issue? (Lane et al.)

• In our efforts to formalize the notion of “learning control”, we have striven to construct ever more general and, putatively, powerful models. By the mid-1990s we had (with a little bit of blatant “borrowing” from the Operations Research community) arrived at the (PO)MDP formalism (Puterman, 1994) and grounded our RL methods in it (Sutton & Barto, 1998; Kaelbling et al., 1996; Kaelbling et al., 1998).

• These models are mathematically elegant, have enabled precise descriptions and analysis of a wide array of RL algorithms, and are incredibly general. We argue, however, that their very generality is a hindrance in many practical cases.

• In their generality, these models have discarded the very qualities — metric, topology, scale, etc. — that have proven to be so valuable for many, many science and engineering disciplines.


What is Missing in POMDPs?

• POMDPs do not describe natural metrics in environment– When driving, we know both global and local distances

• POMDPs do not natively recognize differences between scales– Uncertainty in control is entirely different from uncertainty in routing

• POMDPs conflate properties of the environment with properties of the agent– Roads and buildings behave differently from cars and pedestrians: we

need to generalize over them differently

• POMDPs are defined in a global coordinate frame, often discrete!– We may need many different representations in real problems

27/03/2012 5Reinforcement Learning

Specific Insight #1

Metric of a space imposes a “speed limit” on the agent — the agent cannot transition to arbitrary points in the environment in a single step.

Consequences:

• Agent can neglect large parts of the state space when planning.

• More importantly, however, this result implies that control experience can be generalized across regions of the state space. – If the agent learns a good policy

for one bounded region of the state space, and it can find a second bounded region that is homeomorphic to the first.


Metric envelope bound for point-to-point navigation in an open-space gridworld environment. The outer region is the elliptical envelope that contains 90% of the trajectory probability mass. The inner, darker region is the set of states occupied by an agent in a total of 10,000 steps of experience (319 trajectories from bottom to top).

Insight #2: Manifold Representations

• Informally, a manifold representation models the domain of the value function using a set of overlapping local regions, called charts.

• Each chart has a local coordinate frame, is a (topological) disk, and has a (local) Euclidean distance metric. The collection of charts and their overlap regions is called a manifold.

• We can embed partial value functions (and other models) on these charts, and combine them, using the theory of manifolds, to provide a global value function (or model).


13 eq. classes.If you considerRotational symmetry,Only 4 classes.

What Makes Some POMDP Problems Easy to Approximate?

David Hsu, Wee Sun Lee, Nan Rong, NIPS 2007


Understanding Why PBVI Works

• Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for POMDPs.

• What are the belief-space properties that allow some POMDP problems to be approximated efficiently, explaining the point-based algorithms’ success?


Hardness of POMDPs

• Intractability due to curse of dimensionality• Size of belief space grows exponentially with state space, |S|

• But, in recent years, good progress has been made in sampling the belief space and approximating solutions

• Hsu et al. refer to solutions to a POMDP with hundreds of states in seconds– Tag problem: robot needs to search for a moving tag (whose position

is unobserved except when robot bumps into it), ~870-dim space– Solved using PBVI methods in <1 minute


Initial Observation

• Many point-based algorithms only explore a subset of the belief space, , the reachable space

• The reachable space contains all points reachable from a given initial belief point b0 under arbitrary sequences of actions and observations– Is the reason for PBVI’s success that reachable space is

small? – Not always: Tag has approx. 860-dim reachable space.


Covering Number

• Covering number of a space is the minimum number of given size balls that needed to cover the space fully

• Hsu et al. show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of R(b0)

• Covering number also reveals that the belief space for Tag behaves more like the union of some 29-dimensional spaces rather than an 870-dimensional space, as the robot’s position is fully observed.


Further Questions• Is it possible to compute an approximate solution efficiently under the

weaker condition of having a small covering number for an optimal reachable R*(b0), which contains only points in B reachable from b0 under an optimal policy?

• Unfortunately, this problem is NP-hard. The problem remains NP-hard, even if the optimal policies have a compact piecewise-linear representation using -vectors.

• However, given a suitable set of points that “cover” R*(b0) well, a good approximate solution can be computed in polynomial time.

• Using sampling to approximate an optimal reachable space, and not just the reachable space, may be a promising approach in practice.


Lyapunov Design for Safe Reinforcement Learning

Theodore J. Perkins and Andrew G. Barto, JMLR 2002


Dynamical Systems

• Dynamical systems can be described by states and evolution of states over time

• The evolution of states is constrained by dynamics of the system

• In other words, dynamical systems are mappings from current state to next state

• If the mapping is a contraction, the state will eventually converge to a fixed point


Reinforcement Learning – Traditional Methods

• The target or goal state may not be a natural attractor• Hypothesis: Learning is easier if target is a fixed point, e.g.,

TD-Gammon• People have tried to embed domain knowledge in various

ways:– Known good actions are specified– Sub-goals are explicitly specified


Key Idea

• Use Lyapunov functions to constrain action selection• This forces the RL agent to move towards the goal• e.g., consider grid world, finite steps if Lyapunov

constrained:


Problem Setup

• Deterministic dynamical system• Evolution according to MDP,

)(minarg

1,0)()(

:

,,1:

)(

0

)(0

0

00

sVE

scsV

s

kG

ssca

Ss

iis

i

jjj

i

factordiscount a is where

statestart for policy of Value

state on based actionselect to mapping, a is policy A

statesuccessor in results and cost negative-non incurs Action

actions possible k have s, state goal each In

SG states goalNon

SG state Goal

Sset State


Lyapunov Functions

• Generalized energy functions

steps. time than more no inG enter must at starting trajectory any then

function Lyapunov a on descending is MDP ticdeterminis a If

:Theorem

on descending is MDP the then actions, allfor holds condition second the If

or that such that such

that, such is function Lyapunov a MDP, ticdeterminis a Given

:Definition

/)(

,

))(()()(,0

,0)(

:

00 sLGs

L

L

ssLsLGssGsa

GssL

SL

jjj


Pendulum Problem

sjsc

EG

E

E

u

u

j ,,0.1)(

0.2),(:),(

2

cos1),(

0

sin,

],[

221

separatrix defined - state goal theat ,

)1.0,1.0,(

)1.0,1.0,(

LQR saturated :

:

:

:Laws Control

21

5

4

3

2

1

EAa

EAa

a

ua

ua

function Lyapunov on this descending is MDP The

),(2),(

:function Lyapunov

,,,

,

,

:SetsAction

5421

54

21

EL

aaaaA

aaA

aaA

All

EA

const

otherwisesign

signEA

,)(

0,)(


Results 1

• AEA,AAll had shorter trials than Aconst

• AEA outperformed AAll, especially at fine resolutions of discretization

• AEA trial times seemed independent of binning

• AConst alone never worked

Note: Theorem guarantees that AEA monotonically increases energy.


Results 2

1: AEA, G2 2: AAll, G2 3: AConst, G2 4: AAll + sat LQR, G1


Stochastic Case

up

up

W

udtdtd

dWdtd

S outside incurredcost Unit

5.0:,S

noise hitegaussian w process, Wiener :

sin

1.0


Results – Stochastic Case


Some Open Questions

• How can you improve performance using less sophisticated ‘primitive’ actions?

• Perkins and Barto use deep intuition to design local laws, e.g., to avoid undesired gravity-control equilibria. How do we deal with this when the dynamics is less understood?

• Stochastic cases have rather weak guarantees. How can they be improved?


Documents

Reinforcement Learning Dealing with Complexity and Safety in RL