52
Q-Learning and Pontryagin's Minimum Principle Sean Meyn Department of Electrical and Computer Engineering and the Coordinated Science Laboratory University of Illinois Joint work with Prashant Mehta NSF support: ECS-0523620

Q-Learning and Pontryagin's Minimum Principle

Embed Size (px)

DESCRIPTION

https://netfiles.uiuc.edu/meyn/www/spm_files/Q2009/Q09.html Abstract: Q-learning is a technique used to compute an optimal policy for a controlled Markov chain based on observations of the system controlled using a non-optimal policy. It has proven to be effective for models with finite state and action space. This paper establishes connections between Q-learning and nonlinear control of continuous-time models with general state space and general action space. The main contributions are summarized as follows. * The starting point is the observation that the "Q-function" appearing in Q-learning algorithms is an extension of the Hamiltonian that appears in the Minimum Principle. Based on this observation we introduce the steepest descent Q-learning (SDQ-learning) algorithm to obtain the optimal approximation of the Hamiltonian within a prescribed finite-dimensional function class. * A transformation of the optimality equations is performed based on the adjoint of a resolvent operator. This is used to construct a consistent algorithm based on stochastic approximation that requires only causal filtering of the time-series data. * Several examples are presented to illustrate the application of these techniques, including application to distributed control of multi-agent systems.

Citation preview

Page 1: Q-Learning and Pontryagin's Minimum Principle

Q-Learning and Pontryagin's Minimum Principle

Sean Meyn

Department of Electrical and Computer Engineeringand the Coordinated Science Laboratory University of Illinois

Joint work with Prashant MehtaNSF support: ECS-0523620

Page 2: Q-Learning and Pontryagin's Minimum Principle

Outline

?

Step 1: Recognize Step 2: Find a stab...Step 3: Optimality Step 4: Adjoint Step 5: Interpret

Example: Decentralized controlExample: Decentralized control

Coarse models - what to do with them?Coarse models - what to do with them?

Q-learning for nonlinear state space modelsQ-learning for nonlinear state space models

Example: Local approximationExample: Local approximation

Page 3: Q-Learning and Pontryagin's Minimum Principle

Outline

?

Step 1: Recognize Step 2: Find a stab...Step 3: Optimality Step 4: Adjoint Step 5: Interpret

Example: Decentralized controlExample: Decentralized control

Coarse models - what to do with them?Coarse models - what to do with them?

Q-learning for nonlinear state space modelsQ-learning for nonlinear state space models

Example: Local approximationExample: Local approximation

Page 4: Q-Learning and Pontryagin's Minimum Principle

Coarse Models: A rich collection of model reduction techniques

Many of today’s participants have contributed to this research.A biased list:Many of today’s participants have contributed to this research.A biased list:

Fluid models: Law of Large Numbers scaling, most likely paths in large deviations

Clustering: spectral graph theory Markov spectral theory

Large population limits: Interacting particle systems

Singular perturbations

Workload relaxation for networks Heavy-traffic limits

Page 5: Q-Learning and Pontryagin's Minimum Principle

Workload Relaxations

An example from CTCN:An example from CTCN:

Workload at two stations evolves as a two-dimensional systemCost is projected onto these coordinates:Workload at two stations evolves as a two-dimensional systemCost is projected onto these coordinates:

w2w2

-20 -10 0 10 20 30 40 50

-20

-10

0

10

20

30

40

50

-20 -10 0 10 20 30 40 50

-20

-10

0

10

20

30

40

50

R∗RSTO RSTO R∗

w1w1

−(1− ρ)−(1− ρ)

Figure 7.1: Demand-driven model with routing, scheduling, and re-work.

Optimal policy for relaxation = hedging policy for full network

Figure 7.2: Optimal policies for two instances of the network shown in Figure 7.1.In each figure the optimal stochastic control region RSTO is compared with the optimalregion R∗ obtained for the two dimensional fluid model.

Page 6: Q-Learning and Pontryagin's Minimum Principle

Workload Relaxations and Simulation

50 100 150 200 250 300

1020

3040

5060

1020

3040

5060

1020

3040

5060

2030

4050

60

VIA initialized with Simulated mean with and without control variate:

ZeroFluid value function

An example from CTCN:An example from CTCN:

DP and simulations accelerated using fluid value function for workload relaxationDP and simulations accelerated using fluid value function for workload relaxation

Decision making at stations 1 & 2e.g., setting safety-stock levels

safety-stock levels

Station 1

Station 2

α

α

µµ

µµ

Ave

rag

e co

st

Ave

rag

e co

st

Iteration

Page 7: Q-Learning and Pontryagin's Minimum Principle

What To Do With a Coarse Model?

50 100 150 200 250 300

VIA initialized with

ZeroFluid value function

Ave

rag

e co

st

Iteration

Setting: we have qualitative or partial quantitative insight regarding optimal controlSetting: we have qualitative or partial quantitative insight regarding optimal control

The network examples relied on specific network structureThe network examples relied on specific network structure

What about other models?What about other models?

Page 8: Q-Learning and Pontryagin's Minimum Principle

What To Do With a Coarse Model?

50 100 150 200 250 300

VIA initialized with

ZeroFluid value function

Ave

rag

e co

st

Iteration

Setting: we have qualitative or partial quantitative insight regarding optimal controlSetting: we have qualitative or partial quantitative insight regarding optimal control

The network examples relied on specific network structureThe network examples relied on specific network structure

An answer lies in a new formulation of Q-learningAn answer lies in a new formulation of Q-learning

What about other models?What about other models?

Page 9: Q-Learning and Pontryagin's Minimum Principle

Outline

?

Step 1: Recognize Step 2: Find a stab...Step 3: Optimality Step 4: Adjoint Step 5: Interpret

Example: Decentralized controlExample: Decentralized control

Coarse models - what to do with them?Coarse models - what to do with them?

Q-learning for nonlinear state space modelsQ-learning for nonlinear state space models

Example: Local approximationExample: Local approximation

Page 10: Q-Learning and Pontryagin's Minimum Principle

What is Q learning?

50 100 150 200 250 300

VIA initialized with

ZeroFluid value function

Ave

rag

e co

st

Iteration

Watkin’s 1992 formulation applied to finite state space MDPsWatkin’s 1992 formulation applied to finite state space MDPs

Idea is similar to Mayne & Jacobson’s differential dynamic programmingIdea is similar to Mayne & Jacobson’s differential dynamic programming Differential dynamic programming

D. H. Jacobson and D. Q. MayneAmerican Elsevier Pub. Co. 1970

Q-LearningC. J. C. H. Watkins and P. Dayan Machine Learning, 1992

Page 11: Q-Learning and Pontryagin's Minimum Principle

What is Q learning?

50 100 150 200 250 300

VIA initialized with

ZeroFluid value function

Ave

rag

e co

st

Iteration

Watkin’s 1992 formulation applied to finite state space MDPsWatkin’s 1992 formulation applied to finite state space MDPs

Deterministic formulation: Nonlinear system on Euclidean space,Deterministic formulation: Nonlinear system on Euclidean space,

Infinite-horizon discounted cost criterion,Infinite-horizon discounted cost criterion,

with c a non-negative cost function.with c a non-negative cost function.

Idea is similar to Mayne & Jacobson’s differential dynamic programmingIdea is similar to Mayne & Jacobson’s differential dynamic programming

ddtx(t) = f(x(t),u(t)), t≥ 0

J∗(x) = inf∞

0

e−γsc(x(s),u(s)) ds, x(0) = x

Differential dynamic programming D. H. Jacobson and D. Q. MayneAmerican Elsevier Pub. Co. 1970

Q-LearningC. J. C. H. Watkins and P. Dayan Machine Learning, 1992

Page 12: Q-Learning and Pontryagin's Minimum Principle

What is Q learning?

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Deterministic formulation: Nonlinear system on Euclidean space,Deterministic formulation: Nonlinear system on Euclidean space,

Infinite-horizon discounted cost criterion,Infinite-horizon discounted cost criterion,

with c a non-negative cost function.with c a non-negative cost function.

Differential generator: For any smooth function h,Differential generator: For any smooth function h,

ddtx(t) = f(x(t),u(t)), t≥ 0

J∗(x) = inf∞

0

e−γsc(x(s),u(s)) ds, x(0) = x

Duh (x) := (∇h (x))Tf(x,u)

Page 13: Q-Learning and Pontryagin's Minimum Principle

What is Q learning?

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Deterministic formulation: Nonlinear system on Euclidean space,Deterministic formulation: Nonlinear system on Euclidean space,

Infinite-horizon discounted cost criterion,Infinite-horizon discounted cost criterion,

with c a non-negative cost function.with c a non-negative cost function.

Differential generator: For any smooth function h,Differential generator: For any smooth function h,

ddtx(t) = f(x(t),u(t)), t≥ 0

J∗(x) = inf∞

0

e−γsc(x(s),u(s)) ds, x(0) = x

HJB equation:HJB equation: minu

c(x,u) + DuJ∗ (x) = γJ∗(x)

Duh (x) := (∇h (x))Tf(x,u)

Page 14: Q-Learning and Pontryagin's Minimum Principle

What is Q learning?

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Deterministic formulation: Nonlinear system on Euclidean space,Deterministic formulation: Nonlinear system on Euclidean space,

Infinite-horizon discounted cost criterion,Infinite-horizon discounted cost criterion,

with c a non-negative cost function.with c a non-negative cost function.

The Q-function of Q-learning is this function of two variablesThe Q-function of Q-learning is this function of two variables

Differential generator: For any smooth function h,Differential generator: For any smooth function h,

ddtx(t) = f(x(t),u(t)), t≥ 0

J∗(x) = inf∞

0

e−γsc(x(s),u(s)) ds, x(0) = x

HJB equation:HJB equation: minu

c(x,u) + DuJ∗ (x) = γJ∗(x)

Duh (x) := (∇h (x))Tf(x,u)

Page 15: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Sequence of five steps:Sequence of five steps:

Step 1: Recognize fixed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Page 16: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Sequence of five steps:Sequence of five steps:

Goal - seek the best approximation, within a parameterized classGoal - seek the best approximation, within a parameterized class

Step 1: Recognize fixed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Page 17: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Step 1: Recognize fixed point equation for the Q-functionStep 1: Recognize fixed point equation for the Q-function

Q-function: Q-function:

Its minimum:Its minimum:

Fixed point equation:Fixed point equation:

c(x,u) +DuJ∗∗ (x)=H (x,u)

H∗(x) := minu∈U

H∗(x,u) = γJ∗(x)

DuH∗ (x) = −γ(c(x,u)−H∗(x,u))

Page 18: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

DuH∗ (x) = ddtH

∗(x(t))x=x(t)u=u(t)

Step 1: Recognize fixed point equation for the Q-functionStep 1: Recognize fixed point equation for the Q-function

Q-function: Q-function:

Its minimum:Its minimum:

Fixed point equation:Fixed point equation:

Key observation for learning: For any input-output pair,Key observation for learning: For any input-output pair,

c(x,u) +DuJ∗∗ (x)=H (x,u)

H∗(x) := minu∈U

H∗(x,u) = γJ∗(x)

DuH∗ (x) = −γ(c(x,u)−H∗(x,u))

Page 19: Q-Learning and Pontryagin's Minimum Principle

Q learning - LQR example

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Linear model and quadratic cost,

Cost:

Q-function:

c(x, u) = 12xTQx + 1

2uTRu

= c(x, u) + (

= c(x, u) +

Ax + Bu)TP ∗x

Solves Riccatti eqnDuJ∗ (x)

=J∗ (x)

∗H (x,u)

P ∗x12xT

Page 20: Q-Learning and Pontryagin's Minimum Principle

Q learning - LQR example

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Linear model and quadratic cost,

Cost:

Q-function:

Q-function approx:

Minimum:

Minimizer:

Hθ(x) = 12xT Q + Eθ − F θT

R−1F θ x

φθ(x) =uθ(x) = −R−1F θx

c(x, u) = 12xTQx + 1

2uTRu

= c(x, u) + (Ax + Bu)TP ∗x

Solves Riccatti eqn

Hθ(x, u) = c(x, u) + 12

dx

i=1

θxix

TEix +

dxu

j=1

θxjx

TF iu

∗H (x,u)

Page 21: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Step 2: Stationary policy that is ergodic?Step 2: Stationary policy that is ergodic?

Assume the LLN holds for continuous functionsAssume the LLN holds for continuous functions

AsAs

F : R × R u → RT → ∞,

1

T

T

0

F (x(t), u(t)) dt −→X×U

F (x, u) (dx, du)

Page 22: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Step 2: Stationary policy that is ergodic?Step 2: Stationary policy that is ergodic?

Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state]Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state]

Can try a linearcombination of sinusouids

Can try a linearcombination of sinusouids

Page 23: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

u(t) = A(sin(t) + sin(πt) + sin(et))

Step 2: Stationary policy that is ergodic?Step 2: Stationary policy that is ergodic?

Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state]Suppose for example the input is scalar, and the system is stable [Bounded-input/Bounded-state]

Can try a linearcombination of sinusouids

Can try a linearcombination of sinusouids

Page 24: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Based on observations, minimize the mean-square Bellman error:Based on observations, minimize the mean-square Bellman error:

First order condition for optimality:First order condition for optimality:

θ θ,

with ψθ θi (x) = ψθ

i (x, φ (x)),1 ≤ i ≤ d

θ,Duψθi − γψθ

i = 0

Step 3: Bellman errorStep 3: Bellman error

Page 25: Q-Learning and Pontryagin's Minimum Principle

Q learning - Convex Reformulation

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Based on observations, minimize the mean-square Bellman error:Based on observations, minimize the mean-square Bellman error:

θ θ,

G

Gθ(x) ≤ Hθ(x, u), all x, u

Step 3: Bellman errorStep 3: Bellman error

Page 26: Q-Learning and Pontryagin's Minimum Principle

Q learning - LQR example

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

Linear model and quadratic cost,

Cost:

Q-function:

Q-function approx:

Approximation to minimum

Minimizer:

G θ(x) = 12xT Gθ x

φθ(x) =uθ(x) = −R−1F θx

c(x, u) = 12xTQx + 1

2uTRu

H∗(x) = c(x, u) + (Ax + Bu)TP ∗x

Solves Riccatti eqn

Hθ(x, u) = c(x, u) + 12

dx

i=1

θxix

TEix +

dxu

j=1

θxjx

TF iu

Page 27: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

g : R × R w → R

Rβg (x, w) =∞

0

e−βtg(x(t), ξ(t)) dt

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

For any function of two variables,Resolvent gives a new function,For any function of two variables,Resolvent gives a new function,

Skip to examples

Page 28: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

g : R × R w → R

Rβg (x, w) =∞

0

e−βtg(x(t), ξ ,(t)) dt β > 0

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

For any function of two variables,Resolvent gives a new function,For any function of two variables,Resolvent gives a new function,

controlled using the nominal policycontrolled using the nominal policy

stabilizing & ergodicstabilizing & ergodic

u(t) = φ(x(t), ξ(t)), t ≥ 0

Page 29: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

g : R × R w → R

Rβg (x, w) =∞

0

e−βtg(x(t), ξ ,(t)) dt β > 0

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

For any function of two variables,Resolvent gives a new function,For any function of two variables,Resolvent gives a new function,

Resolvent equation:Resolvent equation:

Page 30: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

g : R × R w → R

Rβg (x, w) =∞

0

e−βtg(x(t), ξ ,(t)) dt β > 0

Lθ,β = RβLθ

= [βRβ − I]Hθ + γRβ(c − Hθ)

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

For any function of two variables,Resolvent gives a new function,For any function of two variables,Resolvent gives a new function,

Resolvent equation:Resolvent equation:

Smoothed Bellman error:Smoothed Bellman error:

Page 31: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

Eβ(θ) := 12

θ,β 2

∇Eβ(θ) =

=

θ,β ,∇θLθ,β

Smoothed Bellman error:Smoothed Bellman error:

zero at an optimumzero at an optimum

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

Page 32: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

Eβ(θ) := 12

θ,β 2

∇Eβ(θ) =

=

θ,β ,∇θLθ,β

Smoothed Bellman error:Smoothed Bellman error:

zero at an optimumzero at an optimum

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

Involves terms of the formInvolves terms of the form Rβg,R βh

Page 33: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

Eβ(θ) := 12

θ,β 2

∇Eβ(θ) = θ,β ,∇θLθ,β

R†βRβ = (R†

β + Rβ)

Rβg,R βh =1

2βg,R †

βh + h,R †βg

1

Smoothed Bellman error:Smoothed Bellman error:

Adjoint operation: Adjoint operation:

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

Page 34: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

Eβ(θ) := 12

θ,β 2

∇Eβ(θ) = θ,β ,∇θLθ,β

R†βRβ = (R†

β + Rβ)

Rβg,R βh =1

2βg,R †

βh + h,R †βg

1

Smoothed Bellman error:Smoothed Bellman error:

Adjoint operation: Adjoint operation:

Step 4: Causal smoothing to avoid differentiationStep 4: Causal smoothing to avoid differentiation

Adjoint realization: time-reversal Adjoint realization: time-reversal

R†βg (x, w) =

0

e−βtEx, w [g(x◦(−t), ξ◦(−t))] dt

expectation conditional on x◦(0) = x, ξ◦(0) = w.

Page 35: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

Step 1: Recognize �xed point equation for the Q-functionStep 2: Find a stabilizing policy that is ergodic Step 3: Optimality criterion - minimize Bellman errorStep 4: Adjoint operationStep 5: Interpret and simulate!

After Step 5: Not quite adaptive control:After Step 5: Not quite adaptive control:

Ergodic input appliedErgodic input applied

Desired behavior

Measured behavior

Compareand learn Inputs

Outputs

Complex system

Page 36: Q-Learning and Pontryagin's Minimum Principle

Q learning - Steps towards an algorithm

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

After Step 5: Not quite adaptive control:After Step 5: Not quite adaptive control:

Ergodic input appliedBased on observations minimize the mean-square Bellman error:Ergodic input appliedBased on observations minimize the mean-square Bellman error:

Desired behavior

Measured behavior

Compareand learn Inputs

Outputs

Complex system

Page 37: Q-Learning and Pontryagin's Minimum Principle

ddth(x(t))

x=x(t)w=ξ(t)

= Duh (x)

Deterministic Stochastic Approximation

0 1 2 3 4 5 6 7 8 9 10

0

1

-1

(indi

vidu

al st

ate)

(ens

embl

e st

ate)

Gradient descent:Gradient descent:

Converges* to the minimizer of the mean-square Bellman error:Converges* to the minimizer of the mean-square Bellman error:

Convergence observed in experiments!For a convex re-formulation of the problem, see Mehta & Meyn 2009

*

ddtθ = −ε θ,Du∇θH

θ − γ∇θHθ

Page 38: Q-Learning and Pontryagin's Minimum Principle

ddth(x(t))

x=x(t)w=ξ(t)

= Duh (x)

Deterministic Stochastic Approximation

0 1 2 3 4 5 6 7 8 9 10

0

1

-1

(indi

vidu

al st

ate)

(ens

embl

e st

ate)

Stochastic ApproximationStochastic Approximation

ddtθ = −ε θ,Du∇θH

θ − γ∇θHθ

Gradient descent:

Mean-square Bellman error:

ddtθ = −εtLθ

tddt∇θH

θ (x◦(t)) − γ∇θHθ(x◦(t), u◦(t))

Lθt := d

dtHθ (x◦(t))+γ(c(x◦(t) u◦(t))−Hθ(x◦(t), ◦(t)))u,

Page 39: Q-Learning and Pontryagin's Minimum Principle

Outline

?

Step 1: Recognize Step 2: Find a stab...Step 3: Optimality Step 4: Adjoint Step 5: Interpret

Example: Decentralized controlExample: Decentralized control

Coarse models - what to do with them?Coarse models - what to do with them?

Q-learning for nonlinear state space modelsQ-learning for nonlinear state space models

Example: Local approximationExample: Local approximation

Page 40: Q-Learning and Pontryagin's Minimum Principle

Q learning - Local Learning

Cubic nonlinearity:Cubic nonlinearity:

Desired behavior

Measured behavior

Compareand learn Inputs

Outputs

Complex system

ddtx = −x3 + u, c(x, u) = 1

2x2 + 12u2

Page 41: Q-Learning and Pontryagin's Minimum Principle

Q learning - Local Learning

Cubic nonlinearity:Cubic nonlinearity:

HJB:HJB:

Desired behavior

Measured behavior

Compareand learn Inputs

Outputs

Complex system

ddtx = −x3 + u, c(x, u) = 1

2x2 + 12u2

minu

12x2 + 1

2u2 + (−x3 + u)∇J∗(x) = γJ∗(x)( )

Page 42: Q-Learning and Pontryagin's Minimum Principle

Q learning - Local Learning

Cubic nonlinearity:Cubic nonlinearity:

HJB:HJB:

Basis:Basis:

Desired behavior

Measured behavior

Compareand learn Inputs

Outputs

Complex system

ddtx = −x3 + u, c(x, u) = 1

2x2 + 12u2

minu

12x2 + 1

2u2 + (−x3 + u)∇J∗(x) = γJ∗(x)( )

Hθ(x, u) = c(x, u) + θxx2 + θxu x

1 + 2x2u

Page 43: Q-Learning and Pontryagin's Minimum Principle

Q learning - Local Learning

Cubic nonlinearity:Cubic nonlinearity:

HJB:HJB:

Basis:Basis:

Low amplitude inputLow amplitude input High amplitude inputHigh amplitude input

Desired behavior

Measured behavior

Compareand learn Inputs

Outputs

Complex system

0.01

0.02

0.03

0.04

0.05

0.06

−1 0 1−1

0

1

Optimal policy

−1 0 1−1

0

1

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08Optimal policy

u(t) = A(sin(t) + sin(πt) + sin(et))

ddtx = −x3 + u, c(x, u) = 1

2x2 + 12u2

minu

12x2 + 1

2u2 + (−x3 + u)∇J∗(x) = γJ∗(x)( )

Hθ(x, u) = c(x, u) + θxx2 + θxu x

1 + 2x2u

Page 44: Q-Learning and Pontryagin's Minimum Principle

Outline

?

Step 1: Recognize Step 2: Find a stab...Step 3: Optimality Step 4: Adjoint Step 5: Interpret

Example: Decentralized controlExample: Decentralized control

Coarse models - what to do with them?Coarse models - what to do with them?

Q-learning for nonlinear state space modelsQ-learning for nonlinear state space models

Example: Local approximationExample: Local approximation

Page 45: Q-Learning and Pontryagin's Minimum Principle

Multi-agent modelM. Huang, P. E. Caines, and R. P. Malhame. Large-populationcost-coupled LQG problems with nonuniform agents: Individual-massbehavior and decentralized ε-Nash equilibria. IEEE Trans. Auto.Control, 52(9):1560–1571, 2007.

Huang et. al. Local optimization for global coordinationHuang et. al. Local optimization for global coordination

Page 46: Q-Learning and Pontryagin's Minimum Principle

Multi-agent model

Model: Linear autonomous models - global cost objectiveModel: Linear autonomous models - global cost objective

HJB: Individual state + global average HJB: Individual state + global average

Basis: Consistent with low dimensional LQG modelBasis: Consistent with low dimensional LQG model

Results from five agent model:Results from five agent model:

Page 47: Q-Learning and Pontryagin's Minimum Principle

Multi-agent model

Model: Linear autonomous models - global cost objectiveModel: Linear autonomous models - global cost objective

HJB: Individual state + global average HJB: Individual state + global average

Basis: Consistent with low dimensional LQG modelBasis: Consistent with low dimensional LQG model

Estimated state feedback gainsEstimated state feedback gains

Gains for agent 4: Q-learning sample paths

and gains predicted from ∞-agent limit

Gains for agent 4: Q-learning sample paths

and gains predicted from ∞-agent limit

Results from five agent model:Results from five agent model:

0

1

-1

(individual state)

(ensemble state)

time

Page 48: Q-Learning and Pontryagin's Minimum Principle

Outline

?

Step 1: Recognize Step 2: Find a stab...Step 3: Optimality Step 4: Adjoint Step 5: Interpret

Example: Decentralized controlExample: Decentralized control

... Conclusions... Conclusions

Coarse models - what to do with them?Coarse models - what to do with them?

Q-learning for nonlinear state space modelsQ-learning for nonlinear state space models

Example: Local approximationExample: Local approximation

Page 49: Q-Learning and Pontryagin's Minimum Principle

Conclusions

Coarse models give tremendous insight

They are also tremendously useful for design in approximate dynamic programming algorithms

Coarse models give tremendous insight

They are also tremendously useful for design in approximate dynamic programming algorithms

Page 50: Q-Learning and Pontryagin's Minimum Principle

Conclusions

Coarse models give tremendous insight

They are also tremendously useful for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses

Coarse models give tremendous insight

They are also tremendously useful for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses

Page 51: Q-Learning and Pontryagin's Minimum Principle

Conclusions

Coarse models give tremendous insight

They are also tremendously useful for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses

Coarse models give tremendous insight

They are also tremendously useful for design in approximate dynamic programming algorithms

Q-learning is as fundamental as the Riccati equation - this should be included in our first-year graduate control courses

Current research: Current research: Algorithm analysis and improvementsApplications in biology and economicsAnalysis of game-theoretic issues in coupled systems

Algorithm analysis and improvementsApplications in biology and economicsAnalysis of game-theoretic issues in coupled systems

Page 52: Q-Learning and Pontryagin's Minimum Principle

References

. PhD thesis, University of London, London, England, 1967.

. American Elsevier Pub. Co., New York, NY, 1970.

Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.

Machine Learning, 8(3-4):279–292, 1992.

SIAM J. Control Optim., 38(2):447–469, 2000.

on policy iteration. Automatica, 45(2):477 – 484, 2009.

Submitted to the 48th IEEE Conference on Decision and Control, December 16-18 2009.

[9] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming for queueing networks. Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.