Reinforcement Learning : Dynamic Programming

Reinforcement Learning:Dynamic ProgrammingCsaba SzepesvriUniversity of Alberta

Kioloa, MLSS08

Slides: http://www.cs.ualberta.ca/~szepesva/MLSS08/

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAA

Reinforcement LearningRL = Sampling based methods to solve optimal control problems

ContentsDefining AIMarkovian Decision ProblemsDynamic ProgrammingApproximate Dynamic ProgrammingGeneralizations

(Rich Sutton)

LiteratureBooksRichard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming, Athena Scientific, 1996JournalsJMLR, MLJ, JAIR, AIConferencesNIPS, ICML, UAI, AAAI, COLT, ECML, IJCAI

Some More BooksMartin L. Puterman. Markov Decision Processes. Wiley, 1994.Dimitri P. Bertsekas: Dynamic Programming and Optimal Control. Athena Scientific. Vol. I (2005), Vol. II (2007).James S. Spall: Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, 2003.

ResourcesRL-Gluehttp://rlai.cs.ualberta.ca/RLBB/top.htmlRL-Libraryhttp://rlai.cs.ualberta.ca/RLR/index.htmlThe RL Toolbox 2.0http://www.igi.tugraz.at/ril-toolbox/general/overview.htmlOpenDPhttp://opendp.sourceforge.netRL-Competition (2008)!http://rl-competition.org/June 1st, 2008: Test runs begin!Related fields:Operations research (MOR, OR)Control theory (IEEE TAC, Automatica, IEEE CDC, ECC)Simulation optimization (Winter Simulation Conference)

Abstract Control ModelEnvironmentactionsSensations(and reward) Controller= agentPerception-action loop

Zooming in..external sensationsmemorystaterewardactionsinternal sensationsagent

A Mathematical ModelPlant (controlled object):xt+1 = f(xt,at,vt) xt : state, vt : noisezt = g(xt,wt) zt : sens/obs, wt : noiseState: Sufficient statistics for the futureIndependently of what we measure ..or..Relative to measurementsControllerat = F(z1,z2,,zt)at: action/control=> PERCEPTION-ACTION LOOPCLOSED-LOOP CONTROLDesign problem: F = ?Goal: t=1T r(zt,at)! max

Objective StateSubjective State

A Classification of ControllersFeedforward:a1,a2, is designed ahead in time???Feedback:Purely reactive systems: at = F(zt)Why is this bad?Feedback with memory:mt = M(mt-1,zt,at-1)~interpreting sensations at = F(mt)decision making: deliberative vs. reactive

Feedback controllersPlant:xt+1 = f(xt,at,vt)zt+1 = g(xt,wt)Controller:mt = M(mt-1,zt,at-1)at = F(mt)mt xt: state estimation, filteringdifficulties: noise,unmodelled partsHow do we compute at?With a model (f): model-based control..assumes (some kind of) state estimationWithout a model: model-free control

Markovian Decision Problems

Markovian Decision Problems(X,A,p,r)X set of statesA set of actions (controls)p transition probabilities p(y|x,a)r rewards r(x,a,y), or r(x,a), or r(x) discount factor 0 < 1

The Process View(Xt,At,Rt)Xt state at time tAt action at time tRt reward at time tLaws:Xt+1~p(.|Xt,At)At ~ (.|Ht): policyHt = (Xt,At-1,Rt-1, .., A1,R1,X0) history Rt = r(Xt,At,Xt+1)

The Control ProblemValue functions:

Optimal value function:

Optimal policy:

Applications of MDPsOperations research Econometrics

Optimal investmentsReplacement problemsOption pricingLogistics, inventory managementActive visionProduction schedulingDialogue controlControl, statistics Games, AI

Bioreactor controlRobotics (Robocup Soccer)DrivingReal-time load balancingDesign of experiments (Medical tests)

Variants of MDPsDiscountedUndiscounted: Stochastic Shortest PathAverage rewardMultiple criteriaMinimaxGames

MDP ProblemsPlanning The MDP (X,A,P,r,) is known. Find an optimal policy *!Learning The MDP is unknown. You are allowed to interact with it. Find an optimal policy *!Optimal learning While interacting with the MDP, minimize the loss due to not using an optimal policy from the beginning

Solving MDPs DimensionsWhich problem? (Planning, learning, optimal learning)Exact or approximate?Uses samples?Incremental?Uses value functions?Yes: Value-function based methodsPlanning: DP, Random Discretization Method, FVI, Learning: Q-learning, Actor-critic, No: Policy search methodsPlanning: Monte-Carlo tree search, Likelihood ratio methods (policy gradient), Sample-path optimization (Pegasus), RepresentationStructured state:Factored states, logical representation, Structured policy space:Hierarchical methods

Dynamic Programming

Richard Bellman (1920-1984)Control theorySystems AnalysisDynamic Programming: RAND Corporation, 1949-1955

Bellman equationBellman-Ford algorithmHamilton-Jacobi-Bellman equationCurse of dimensionalityinvariant imbeddingsGrnwall-Bellman inequality

Bellman OperatorsLet :X ! A be a stationary policyB(X) = { V | V:X! R, ||V||1

Proof of T V = VWhat you need to know:Linearity of expectation: E[A+B] = E[A]+E[B]Law of total expectation: E[ Z ] = x P(X=x) E[ Z | X=x ], and E[ Z | U=u ] = x P(X=x|U=u) E[Z|U=u,X=x]. Markov property: E[ f(X1,X2,..) | X1=y,X0=x] = E[ f(X1,X2,..) | X1=y] V(x) = E [t=01 t Rt|X0 = x] = y P(X1=y|X0=x) E[t=01 t Rt|X0 = x,X1=y] (by the law of total expectation) = y p(y|x,(x)) E[t=01 t Rt|X0 = x,X1=y] (since X1~p(.|X0,(X0))) = y p(y|x,(x)) {E[ R0|X0=x,X1=y]+ E [ t=01t Rt+1|X0=x,X1=y]} (by the linearity of expectation) = y p(y|x,(x)) {r(x,(x),y) + V(y)} (using the definition of r, V) = (T V)(x).(using the definition of T)

The Banach Fixed-Point TheoremB = (B,||.||) Banach spaceT: B1! B2 is L-Lipschitz (L>0) if for any U,V, || T U T V || L ||U-V||.T is contraction if B1=B2, L

An Algebra for ContractionsProp: If T1:B1! B2 is L1-Lipschitz, T2: B2 ! B3 is L2-Lipschitz then T2 T1 is L1 L2 Lipschitz.Def: If T is 1-Lipschitz, T is called a non-expansionProp: M: B(X A) ! B(X), M(Q)(x) = maxa Q(x,a) is a non-expansionProp: Mulc: B! B, Mulc V = c V is |c|-LipschitzProp: Addr: B ! B, Add V = r + V is a non-expansion.Prop: K: B(X) ! B(X), (K V)(x)=y K(x,y) V(y) is a non-expansion if K(x,y) 0, y K(x,y) =1.

Policy Evaluations are ContractionsDef: ||V||1 = maxx |V(x)|, supremum norm; here ||.||Theorem: Let T the policy evaluation operator of some policy . Then T is a -contraction.Corollary: V is the unique fixed point of T. Vk+1 = T Vk ! V, 8 V0 2 B(X) and ||Vk-V|| = O(k).

The Bellman Optimality OperatorLet T:B(X)! B(X) be defined by (TV)(x) = maxa y p(y|x,a) { r(x,a,y) + V(y) }Def: is greedy w.r.t. V if TV =T V.Prop: T is a -contraction. Theorem (BOE): T V* = V*.Proof: Let V be the fixed point of T. T T V* V. Let be greedy w.r.t. V. Then T V = T V. Hence V = V V V* V = V*.

Value IterationTheorem: For any V0 2 B(X), Vk+1 = T Vk, Vk ! V* and in particular ||Vk V*||=O(k). What happens when we stop early?Theorem: Let be greedy w.r.t. V. Then ||V V*|| 2||TV-V||/(1-).Proof: ||V-V*|| ||V-V||+||V-V*|| Corollary: In a finite MDP, the number of policies is finite. We can stop when ||Vk-TVk|| (1-)/2, where = min{ ||V*-V|| : V V* }Pseudo-polynomial complexity

Policy Improvement [Howard 60]Def: U,V2 B(X), V U if V(x) U(x) holds for all x2 X.Def: U,V2 B(X), V > U if V U and 9 x2 X s.t. V(x)>U(x).Theorem (Policy Improvement): Let be greedy w.r.t. V. Then V V. If T V>V then V>V.

Policy IterationPolicy Iteration()V VDo {improvement}V VLet : T V = T VV V While (V>V)Return

Policy Iteration TheoremTheorem: In a finite, discounted MDP policy iteration stops after a finite number of steps and returns an optimal policy.Proof: Follows from the Policy Improvement Theorem.

Linear ProgrammingV T V V V* = T V*.Hence, V* is the largest V that satisfies V T V.V T V ,(*) V(x) yp(y|x,a){r(x,a,y)+ V(y)}, 8x,aLinProg(V): x V(x) ! min s.t. V satisfies (*).Theorem: LinProg(V) returns the optimal value function, V*.Corollary: Pseudo-polynomial complexity

Variations of a Theme

Approximate Value IterationAVI: Vk+1 = T Vk + kAVI Theorem: Let = maxk ||k||. Then limsupk!1 ||Vk-V*|| 2 / (1-).Proof: Let ak = ||Vk V*||. Then ak+1 = ||Vk+1 V*|| = ||T Vk T V* + k || ||Vk-V*|| + = ak + . Hence, ak is bounded. Take limsup of both sides: a a + ; reorder.//(e.g., [BT96])

Fitted Value Iteration Non-expansion OperatorsFVI: Let A be a non-expansion, Vk+1 = A T Vk. Where does this converge to?Theorem: Let U,V be such that A T U = U and T V = V. Then ||V-U|| ||AV V||/(1-).Proof: Let U be the fixed point of TA. Then ||U-V|| ||AV-V||/(1-). Since A U = A T (AU), U=AU. Hence, ||U-V|| =||AU-V|| ||AU-AV||+||AV-V|| [Gordon 95]

Application to AggregationLet be a partition of X, S(x) be the unique cell that x belongs to.Let A: B(X)! B(X) be (A V)(x) = z (z;S(x)) V(z), where is a distribution over S(x).p(C|B,a) = x2 B (x;B) y2 C p(y|x,a), r(B,a,C) = x2 B (x;B) y2 C p(y|x,a) r(x,a,y).Theorem: Take (,A,p,r), let V be its optimal value function, VE(x) = V(S(x)). Then ||VE V*|| ||AV*-V*||/(1-).

Action-Value FunctionsL: B(X)! B(X A), (L V)(x,a) = y p(y|x,a) {r(x,a,y) + V(y)}. One-step lookahead.Note: is greedy w.r.t. V if (LV)(x,(x)) = max a (LV)(x,a).Def: Q* = L V*.Def: Let Max: B(X A)! B(X), (Max Q)(x) = max a Q(x,a).Note: Max L = T.Corollary: Q* = L Max Q*.Proof: Q* = L V* = L T V* = L Max L V* = L Max Q*.T = L Max is a -contractionValue iteration, policy iteration,

Changing GranularityAsynchronous Value Iteration:Every time-step update only a few statesAsyncVI Theorem: If all states are updated infinitely often, the algorithm converges to V*.How to use? Prioritized SweepingIPS [MacMahan & Gordon 05]:Instead of an update, put state on the priority queueWhen picking a state from the queue, update itPut predecessors on the queue Theorem: Equivalent to Dijkstra on shortest path problems, provided that rewards are non-positiveLRTA* [Korf 90] ~ RTDP [Barto, Bradtke, Singh 95]Focussing on parts of the state that matterConstraints:Same problem solved from several initial positionsDecisions have to be fastIdea: Update values along the paths

Changing GranularityGeneralized Policy Iteration:Partial evaluation and partial improvement of policiesMulti-step lookahead improvement

AsyncPI Theorem: If both evaluation and improvement happens at every state infinitely often then the process converges to an optimal policy. [Williams & Baird 93]

Variations of a theme [SzeLi99]Game against nature [Heger 94]: infw tt Rt(w) with X0 = xRisk-sensitive criterion: log ( E[ exp(tt Rt ) | X_0 = x ] )Stochastic Shortest PathAverage RewardMarkov gamesSimultaneous action choices (Rock-paper-scissor)Sequential action choicesZero-sum (or not)

References[Howard 60] R.A. Howard: Dynamic Programming and Markov Processes, The MIT Press, Cambridge, MA, 1960.[Gordon 95] G.J. Gordon: Stable function approximation in dynamic programming. ICML, pp. 261268, 1995.[Watkins 90] C.J.C.H. Watkins: Learning from Delayed Rewards, PhD Thesis, 1990.[McMahan, Gordon 05] H. B. McMahan and Geoffrey J. Gordon:Fast Exact Planning in Markov Decision Processes. ICAPS. [Korf 90] R. Korf: Real-Time Heuristic Search. Artificial Intelligence 42, 189211, 1990.[Barto, Bradtke & Singh, 95] A.G. Barto, S.J. Bradtke & S. Singh: Learning to act using real-time dynamic programming, Artificial Intelligence 72, 81138, 1995. [Williams & Baird, 93] R.J. Williams & L.C. Baird: Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions. Northeastern University Technical Report NU-CCS-93-14, November, 1993. [SzeLi99] Cs. Szepesvri and M.L. Littman: A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms, Neural Computation, 11, 20172059, 1999.[Heger 94] M. Heger: Consideration of risk in reinforcement learning, ICML, 105111, 1994.

Why DP? (1950) Wilson, Secretary of Defense hated research. Not speaking of math! Not math; planning, thinking, decision making programming; multistage, dynamic, time-varying dynamicRAND, 1948: David Blackwell, George Dantzig, Ted Harris, Sam Karlin, Lloyd Shapley ! (game theory, decision theory)What Kushner said: The word programming was used by the military to mean scheduling. Dantzig's linear programming was an abbreviation of ``programming with linear models." Bellman has described the origin of the name ``dynamic programming " as follows. An Assistant Secretary of the Air Force, who was believed to be strongly anti-mathematics was to visit RAND. So Bellman was concerned that his work on the mathematics of multi-stage decision process would be unappreciated. But ``programming" was still OK, and the Air Force was concerned with rescheduling continuously due to uncertainties. Thus ``dynamic programming" was chosen a politically wise descriptor. On the other hand, when I asked him the same question, he replied that he was trying to upstage Dantzig's linear programming by adding dynamic. Perhaps both motivations were true.

Documents

Reinforcement Learning : Dynamic Programming