36
Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs István Szita & András Lőrincz University of Alberta Canada Eötvös Loránd University Hungary

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

  • Upload
    judah

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs. Istv án Szita & Andr ás Lőrincz. University of Alberta Canada. Eötvös Loránd University Hungary. Outline. Factored MDPs motivation definitions planning in FMDPs Optimism - PowerPoint PPT Presentation

Citation preview

Page 1: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Optimistic Initialization and Greediness

Lead to Polynomial-Time Learning

in Factored MDPsIstván Szita & András LőrinczUniversity of Alberta

CanadaEötvös Loránd University

Hungary

Page 2: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 2/31

Outline Factored MDPs

motivation definitions planning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Page 3: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 3/31

Reinforcement learning

the agent makes decisions … in an unknown world makes some observations (including

rewards) tries to maximize collected reward

Page 4: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 4/31

What kind of observation?

structured observations structure is unclear

???

Page 5: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 5/31

How to “solve an RL task”? a model is useful

can reuse experience from previous trials can learn offline

observations are structured structure is unknown

structured + model + RL = FMDP ! (or linear dynamical systems, neural networks,

etc…)

Page 6: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 6/31

Factored MDPs

ordinary MDPs everything is factored

statesrewardstransition probabilities(value functions)

Page 7: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 7/31

Factored state space

all functions depend on a few variables only

Page 8: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 8/31

Factored dynamics

Page 9: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 9/31

Factored rewards

Page 10: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 10/31

(Factored value functions)

V * is not factored in general we will make an approximation error

Page 11: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 11/31

Solving a known FMDP NP-hard

either exponential-time or non-optimal…

exponential-time worst case flattening the FMDP approximate policy iteration

[Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]

non-optimal solution (approximating value function in a factored form) approximate linear programming

[Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008]

Page 12: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 12/31

Factored value iterationH := matrix of basis functions

N (HT) := row-normalization of HT, the iteration

converges to fixed point w£

can be computed quickly for FMDPs

Let V £ = H w£. Then V £ has bounded error:

Page 13: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 13/31

Learning in unknown FMDPs

unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics)

Page 14: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 14/31

Learning in unknown FMDPs

unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics)

Page 15: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 15/31

Outline

Factored MDPsmotivationdefinitionsplanning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Page 16: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 16/31

Learning in an unknown FMDPa.k.a. “Explore or exploit?”

after trying a few action sequences… … try to discover better ones? … do the best thing according to

current knowledge?

Page 17: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 17/31

Be Optimistic!

(when facing uncertainty)

Page 18: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 18/31

either you get experience…

Page 19: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 19/31

or you get reward!

Page 20: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 20/31

Outline

Factored MDPsmotivationdefinitionsplanning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Page 21: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 21/31

Factored Initial Model

0 1

(0,0), a1 - -

(0,0), a2 - -

(0,1), a1 - -

(0,1), a2 - -

(1,0), a1 - -

(1,0), a2 - -

(1,1), a1 - -

(1,1), a2 - -

component x1 parents: (x1,x3)

0 1

(0), a1 - -

(0), a2 - -

(1), a1 - -

(1), a2 - -

component x2 parent: (x2)

Page 22: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 22/31

Factored Optimistic Initial Model

0 1 GOE

(0,0), a1 - - 1

(0,0), a2 - - 1

(0,1), a1 - - 1

(0,1), a2 - - 1

(1,0), a1 - - 1

(1,0), a2 - - 1

(1,1), a1 - - 1

(1,1), a2 - - 1

component x1 parents: (x1,x3)

0 1 GOE

(0), a1 - - 1

(0), a2 - - 1

(1), a1 - - 1

(1), a2 - - 1

component x2 parent: (x2)

“Garden of Eden”+$10000 reward(or something very high)

Page 23: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 23/31

Later on…

0 1 GOE

(0,0), a1 25 30 1

(0,0), a2 42 12 1

(0,1), a1 3 1 1

(0,1), a2 2 5 1

(1,0), a1 11 9 1

(1,0), a2 2 29 1

(1,1), a1 56 63 1

(1,1), a2 98 - 1

component x1 parents: (x1,x3)

0 1 GOE

(0), a1 42 34 1

(0), a2 25 27 1

(1), a1 7 1 1

(1), a2 3 6 1

component x2 parent: (x2)

according to initial model, all states have valuein frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

Page 24: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 24/31

Factored optimistic initial model initialize model (optimistically) for each time step t,

solve aproximate model using factored value iteration

take greedy action, observe next state update model

number of non-near-optimal steps (w.r.t. V

£ ) is polynomial with probability ¼1

Page 25: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 25/31

elements of proof: some standard stuff if , then if for all i,

then

let mi be the number of visits toif mi is large, thenfor all yi.

more precisely:with prob.(Hoeffding/Azuma inequality)

Page 26: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 26/31

elements of proof: main lemma for any , approximate Bellman-updates

will be more optimistic than the real ones:

if VE is large enough, the bonus term dominates for a long time

if all elements of H are nonnegative, projection preserves optimism

lower bound byAzuma’s inequality bonus promised by

Garden of Eden state

Page 27: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 27/31

elements of proof: wrap up

for a long time, Vt is optimistic enough to boost exploration

at most polynomially many exploration steps can be made

except those, the agent must be near-V £-optimal

Page 28: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 28/31

Previous approaches extensions of E3, Rmax, MBIE to FMDPs

using current model, make smart plan (explore or exploit)

explore: make model more accurate exploit: collect near-optimal reward

unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP

polynomial sample complexity exponential amounts of computation!

Page 29: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 29/31

Unknown rewards? “To simplify the presentation, we assume

the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.”false.

problem: cannot observe reward components, only their sum! UAI poster [Walsh, Szita, Diuk, Littman,

2009]

Page 30: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 30/31

Unknown structure?

can be learnt in polynomial time SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009]

Page 31: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 31/31

Take-home message

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Page 32: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Thank you for your attention!

Page 33: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 33/31

Optimistic initial model for FMDPs add “garden of Eden” value to each state

variable

add reward factors for each state variable

init transition model

Page 34: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 34/31

Outline

Page 35: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 35/31

Outline

Page 36: Optimistic Initialization  and Greediness  Lead to Polynomial-Time Learning  in Factored MDPs

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 36/31

Outline