158
Dynamic Programming and Stochastic Control Dr. Alex Leong Department of Electrical Engineering (EIM-E) Paderborn University, Germany [email protected] Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 1 / 158

Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

  • Upload
    others

  • View
    50

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Dynamic Programming and Stochastic Control

Dr. Alex Leong

Department of Electrical Engineering (EIM-E)Paderborn University, Germany

[email protected]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 1 / 158

Page 2: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

1 Introduction

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 2 / 158

Page 3: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Introduction

What is dynamic programming (DP)?

Method for solving multi-stage decision problems (Sequential decisionmaking).

There is often some randomness to what happens in future.

Optimize set of decisions to achieve a good overall outcome.

Richard Bellman popularized DP in the 1950s

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 3 / 158

Page 4: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

1) Inventory control

A store sells a product, e.g. Ice cream.

Order supplies once a week.

Sales during the week are “random”.

How much supply should the store get to maximize expected profitover summer?

I Order too little, can’t meet demand.I Order too much, storage/refrigeration cost.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 4 / 158

Page 5: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

2) Parts replacement e.g. bus engine.

At the start of each month, decide whether the engine on a busshould be replaced, to maximize expected profit?

If replace, profit = earnings - replacement cost - maintenance.

If don’t replace, profit = earnings - maintenance.

Earnings will decrease if engine breaks down.

P(Breakdown) is age dependent.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 5 / 158

Page 6: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

3) Formula 1 engines, replace or not?

20 races, 4 engines (in 2017)

Decide whether to replace engine at the start of each race, tomaximize chance of winning championship.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 6 / 158

Page 7: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples4) Queueing (see Figure 1)

Packets arrive at queues 1 and 2.

If both queues transmit at same time, have collision.

If collision, retransmit at next time with a certain probability.

Choose retransmission probabilities to maximize throughput.

Figure 1: Queueing

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 7 / 158

Page 8: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

5) LQR (Linear Quadratic Regulator)Linear System: xk+1 = Axk + Buk (Deterministic Problem)

Assume knowledge of xk at time k (Perfect state info)

Choose sequence of uk to

minu0,u1,...,uN−1

N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

N = number of stages = horizon.N finite → finite horizon.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 8 / 158

Page 9: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

6) xk+1 = Axk + Buk + wk

wk = Random noise.

Assume xk known (Perfect state info)

Choose sequence of uk to

minu0,u1,...,uN−1

E

[N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 9 / 158

Page 10: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

7) LQG (Linear Quadratic Gaussian) Control

xk+1 = Axk + Buk + wk

yk = Cxk + vk

vk ,wk Gaussian noise.

Case of imperfect state info.

Based on measurements yk , choose uk to

minu0,u1,...,uN−1

E

[N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 10 / 158

Page 11: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

8) Infinite horizon

minu0,u1,...,uN−1

limN→∞

1

NE

[N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

]

Note: Here we divide by N, otherwise summation often blows up.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 11 / 158

Page 12: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples9) Shortest paths (see Figure 2)

Find shortest path from A to stage D (Deterministic Problem).

Can solve using the Viterbi algorithm (1967)

Can be regarded as a special case of (forward) DP.Applications:

I decoding of convolutional codes (communications)I channel equalization (communications)I estimation of hidden Markov models (signal processing)

Figure 2: Shortest paths problem

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 12 / 158

Page 13: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

2 The Dynamic Programming Principle and Dynamic ProgrammingAlgorithm

Basic Structure of Dynamic Programming ProblemDynamic Programming Principle of OptimalityDynamic Programming AlgorithmShortest Path Problems

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 13 / 158

Page 14: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Basic structure of stochastic DP problem

Two ingredients, discrete time system and cost function

1. Discrete time system

xk+1 = fk(xk , uk ,wk), k = 0, 1, . . . ,N − 1 (or k = 1, 2, ...N)

k is time index.

xk is state at time k, summarizes past information that is relevant forfuture optimization.

uk is control/decision/action at time k , lies in a set Uk(xk) whichmay depend on k and xk .

wk is random disturbance (noise), with a probability distributionP(.|k , xk , uk) which may depend on k , xk , uk .

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 14 / 158

Page 15: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Basic structure of stochastic DP problem

xk+1 = fk(xk , uk ,wk), k = 0, 1, . . . ,N − 1

N is horizon, or number of times control is applied.

fk is function that describes how system evolves over time.

ExamplesI fk = Axk + Buk + wk (linear system)I fk = xkuk + wk (non-linear)I fk = cos xk + wk sin uk (non-linear)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 15 / 158

Page 16: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Basic structure of stochastic DP problem

2. Cost function which is additive over time

E

[N−1∑k=0

gk(xk , uk ,wk) + gN(xN)

]

Expectation is used because of random wk .

gk is function that represents cost at time k .

ExamplesI gk = xk + ukI gk = x2k + Cu2k , where C is a constant.

gN(xN) is terminal cost.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 16 / 158

Page 17: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Basic structure of stochastic DP problem

Objective: Minimize the cost function over the controls

u0 = µ0(x0), u1 = µ1(x1), ..., uN−1 = µN−1(xN−1)

Choice of uk depends on xk .

Optimization over policies: rules/functions µk for generating uk forevery possible value of xk .

Expected cost of policy π = (µ0, µ1, ..., µN−1) starting at x0 is

Jπ(x0) = E

[N−1∑k=0

gk(xk , µk(xk),wk) + gN(xN)

]

Optimal policy: π∗ = argminπ

Jπ(x0)

Optimal cost starting at x0: J∗(x0) = minπ

Jπ(x0)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 17 / 158

Page 18: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

1) Inventory example

xk = amount of stock at time k.uk = stock ordered at time k.wk = demand at time k, with some probability distribution e.g. uniform.

System: xk+1 = xk + uk − wk (= fk(xk , uk ,wk))

xk can be negative with this model.

Alternative model: xk+1 = max(0, xk + uk − wk).

Cost function at time k : gk(xk , uk ,wk) = r(xk) + Cuk

r(xk) is penalty for holding excess stock.

C is cost per item.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 18 / 158

Page 19: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

1) Inventory example (cont.)

Terminal cost: R(xN) is penalty for having excess stock at the end.

Cost function: E[∑N−1

k=0 (r(xk) + Cuk) + R(xN)]

Amount uk to order can depend on inventory level xk .

Can have constraints on uk , e.g. xk + uk ≤ max. storage.

Optimization over policies: Find the rule which tells you how much toorder for every possible stock level xk .

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 19 / 158

Page 20: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples2) Example 6 of previous section

System

xk+1 = Axk+Buk+wk︸ ︷︷ ︸fk

Cost function

E

N−1∑k=0

(xTk Qxk+uTk Ruk︸ ︷︷ ︸gk

) + xTN QxN︸ ︷︷ ︸gN(xN)

Objective: Determine uk = µk(xk), k = 0, 1, . . . ,N − 1, to minimizethe cost function.

Solution turns out to be u∗k = Lkxk for some matrices Lk . (Derived inlater lecture)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 20 / 158

Page 21: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples

3) Shortest paths (see Figure 3)

Figure 3: Shortest path problem

xk = which node we’re in at stage k .uk = which path we take to get to stage k + 1wk = zeroCost function = Sum of values along the paths we choose.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 21 / 158

Page 22: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Open loop vs. Closed loop

Open loop: Controls (u0, u1, . . . , uN−1) chosen at beginning (time 0).

Closed loop: Policy (µ0, µ1, . . . , µN−1) chosen, where at time k ,µk(xk) = uk can depend on xk .

Can adapt to conditions.

e.g. Inventory problem. If current stock level:I xk high → order less.I xk low → order more.

Closed loop is always at least as good as open loop.

For deterministic problems, open loop is as good as closed loopI can predict exactly the future states given initial state and sequence of

controls.

For stochastic problems, generally should use closed loop.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 22 / 158

Page 23: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

D.P. Principle of OptimalityIntuition

B 3 D

5

A

2 F

1 2 1

6 4

C 4 E

Figure 4: Shortest path problem

Consider the shortest path problem in Figure 4.

Shortest path from A to F shown in red: A→C→D→F

Shortest path from C to F: C→D→F.I Subpath of shortest path from A→F.

Shortest path from D to F: D→F.I Subpath of shortest path from A→F.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 23 / 158

Page 24: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

D.P. Principle of Optimality

ObservationShortest path from A to F contains shortest paths from intermediatenodes to F.

Why?

Suppose there is a shorter path from C to F which is not C→D→F.

Then can construct a new path A→C→ . . .→F (new shortest path)which is shorter than A→C→D→F⇒ contradicts A→C→D→F being the shortest.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 24 / 158

Page 25: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

D.P. Principle of Optimality

Formal statement:

Basic problem

minπ

E

(N−1∑k=0

gk(xk , µk(xk),wk) + gN(xN)

)

Let π∗ = {µ∗0, µ∗1, . . . , µ∗N−1} be the optimal policy. Consider the“tail subproblem”

minµi ,µi+1,...,µN−1

E

(N−1∑k=i

gk(xk , µk(xk),wk) + gN(xN)

),

where we are at state xi at time i and we wish to minimize the “costto go” from time i to time N.

D.P. Principle of optimality then says that {µ∗i , µ∗i+1, ..., µ∗N−1} is

optimal for the tail subproblem.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 25 / 158

Page 26: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

D.P. Principle of Optimality

“Proof”:If {µi , ..., µN−1} is a better policy for tail subproblem, then{µ∗0, µ∗1, ..., µ∗i−1, µi , ..., µN−1} would be a better policy for original problem⇒ contradiction of {µ∗0, µ∗1, ..., µ∗N−1} being optimal.

How can we make use of the D.P. principle?Idea: Construct an optimal policy in stages.

Solve tail subproblem involving last stage, to obtain µ∗N−1Solve tail subproblem involving last two stages, making use of µ∗N−1,to obtain µ∗N−2Solve tail subproblem involving last three stages, making use ofµ∗N−2, µ

∗N−1, to obtain µ∗N−3

...

Solve tail subproblem involving last N stages, making use ofµ∗1, .., µ

∗N−1, to obtain µ∗0

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 26 / 158

Page 27: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

D.P. Algorithm

Basic problem:

minπ

E

{N−1∑k=0

gk(xk , µk(xk),wk) + gN(xN)

}D.P. Algorithm: For each possible xk , compute:

JN(xN) = gN(xN),

Jk(xk) = minuk∈Uk (xk )

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))},

for k = N − 1,N − 2, ..., 1, 0

Theorem:1 Optimal cost J∗(x0) = J0(x0), where J0(x0) is quantity computed by

D.P. algorithm.2 Let µ∗k(.) be the function that generates the minimum uk in the D.P.

algorithm, i.e. µ∗k(xk) = u∗k . Then {µ∗0, µ∗1, ..., µ∗N−1} is the optimalpolicy to the basic problem.

Proof: See laterDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 27 / 158

Page 28: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

D.P. Algorithm

Comments:

D.P. algorithm needs to be run for all possible states xk .

Solves all tail subproblems (don’t know which subproblem you need atthe start).

Can be computationally expensive if number of states/controls islarge.

Often done on computer.

Suboptimal methods can reduce complexity.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 28 / 158

Page 29: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Inventory Example

xk = level of stock at time k.

uk = amount ordered at time k .

wk = demand at time k .

xk+1 = max(0, xk + uk − wk) = fk(xk , uk ,wk), excess demand is lost.

Storage constraint: xk + uk ≤ 2

Cost at time k= Purchasing cost︸ ︷︷ ︸

cost per item=1euro

+ storage cost︸ ︷︷ ︸(xk+uk−wk )2

= uk + (xk + uk − wk)2 = gk(xk , uk ,wk)

Terminal cost gN(xN) = 0.

Probability distribution of wk :

P(wk = 0) = 0.1,P(wk = 1) = 0.7,P(wk = 2) = 0.2

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 29 / 158

Page 30: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Inventory Example

Problem: Find the optimal policy for horizon N = 3, i.e.

min(µ0,µ1,µ2)

E

{2∑

k=0

gk(xk , µk(xk),wk)

}

Apply D.P. algorithm:J3(x3) = g3(x3) = 0Jk(xk) = min

uk∈Uk

E{uk + (xk + uk − wk)2 + Jk+1(max(0, xk + uk − wk))},k = 2, 1, 0

Question: What values can xk take?

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 30 / 158

Page 31: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Inventory Example

Period 2:Compute J2(x2) for all possible values of x2

J2(0) = minu2∈{0,1,2}

E{u2 + (0 + u2 − w2)2 + J3(x3)︸ ︷︷ ︸=0 for all x3

}

= minu2∈{0,1,2}

u2 + E{(u2 − w2)2}

= minu2∈{0,1,2}

u2 + (u2 − 0)20.1 + (u2 − 1)20.7 + (u2 − 2)20.2

If u2 = 0: u2+0.1u22+0.7(u2−1)2+0.2(u2−2)2 = 0.7×1+0.2×4 = 1.5

If u2 = 1: 1 + 0.1× 1 + 0.7× 0 + 0.2× 1 = 1.3

If u2 = 2: 2 + 0.1× 4 + 0.7× 1 + 0.2× 0 = 3.1⇒ J2(0) = 1.3 and µ∗2(0) = 1

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 31 / 158

Page 32: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Inventory Example

J2(1) = minu2∈{0,1}

u2 + (1 + u2)20.1 + (1 + u2 − 1)20.7 + (1 + u2 − 2)20.2

If u2 = 0: 0.3 (check this!)

If u2 = 1: 2.1⇒ J2(1) = 0.3 and µ∗2(1) = 0

J2(2) = minu2∈{0}

E{u2 + (2 + u2 − w2)2} = · · · = 1.1

⇒ J2(2) = 1.1 and µ∗2(2) = 0.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 32 / 158

Page 33: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Inventory Example

Period 1:Compute J1(x1) for all possible values of x1.

J1(0) = minu1∈{0,1,2}

E{u1 + (u1 − w1)2 + J2(max(0, 0 + u1 − w1))}

= minu1∈{0,1,2}

u1 + (u21 + J2(max(0, u1))0.1

+ ((u1 − 1)2 + J2(max(0, u1 − 1)))0.7

+ ((u1 − 2)2 + J2(max(0, u1 − 2)))0.2

u1 = 0: J2(0)︸ ︷︷ ︸from previous stage

×0.1 + (1 + J2(0)︸ ︷︷ ︸)0.7 + (4 + J2(0)︸ ︷︷ ︸)0.2 = 2.8

u1 = 1: 1 + (1 + J2(1)︸ ︷︷ ︸from previous stage

)0.1 + J2(0)︸ ︷︷ ︸ 0.7 + (1 + J2(0)︸ ︷︷ ︸)0.2 = 2.5

u1 = 2: 2 + (4 + J2(2))0.1 + (1 + J2(1))0.7 + J2(0)0.2 = 3.6⇒ J1(0) = 2.5 and µ∗1(0) = 1

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 33 / 158

Page 34: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Inventory Example

J1(1) = minu1∈{0,1}

E{u1 + (1 + u1 − w1)2 + J2(max(0, 1 + u1 − w1))}

u1 = 0: 1.5(check!)

u1 = 1: 2.68⇒ J1(1) = 1.5, and µ∗1(1) = 0

J1(2) = 1.68, µ∗1(2) = 0 (check!)

Period 0:Compute J0(x0) for all possible x0 (Tutorial problem)Solution: J0(0) = 3.7, J0(1) = 2.1, J0(2) = 2.818µ∗0(0) = 1, µ∗0(1) = 0, µ∗0(2) = 0

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 34 / 158

Page 35: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Scheduling Example

Example: Scheduling problem (deterministic problem)

Four operations need to be performed: A, B, C, D.

B has to occur after A, D has to occur after C.

Costs: cAB = 2, cAC = 3, cAD = 4, cBC = 3, cBD = 1, cCA = 4, cCB =4, cCD = 6, cDA = 3, cDB = 3.

Startup costs: SA = 5,SC = 3.

What is the optimal order?

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 35 / 158

Page 36: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Scheduling Example6

ABC 6 ABCD

9 3

AB

2 8

A 5 4

1

ACB

3

1 ACBD

5 10 3 AC

3 7

3C CA

4

6 5

CD

ACD 6

1

2 CAB

4 3

CAD

3 2

CDA

3 ACDB

1 CABD

3 CADB

2 CDAB

Minimum cost to go in red

Figure: Scheduling

0

Figure 5: Scheduling Problem

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 36 / 158

Page 37: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Scheduling ExampleUse D.P. algorithm

Let State = Set of operations already performed, see Figure“Scheduling”.

No terminal costs for this problem.

Tail subproblems of length 1.

Easy, only one choice at each state, e.g. if state ACD , nextoperation has to be B.

Tail subproblems of length 2.

State AB , only one choice, next operation is C.

State AC , if next operation is B: cost = 4 + 1 = 5.

State AC , if next operation is D: cost = 6 + 3 = 9. ⇒ Choose B.

State CA , if next operation is B: cost = 2 + 1 = 3.

State CA , if next operation is D: cost = 4 + 3 = 7. ⇒ Choose B.

State CD , only one choice, next operation is A.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 37 / 158

Page 38: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Scheduling Example

Tail subproblems of length 3.

State A , if next operation is B: cost = 2 + 9 = 11.

State A , if next operation is C: cost = 3 + 5 = 8.⇒ Choose C

State C , if next operation is A: cost = 4 + 3 = 7.

State C , if next operation is D: cost = 6 + 5 = 11. ⇒ Choose A.

Original problem of length 4.

If start with A: cost = 5 + 8 = 13

If start with C: cost = 3 + 7 = 10 ⇒ Choose C

Therefore, the optimal sequence = CABD , and the optimal cost = 10.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 38 / 158

Page 39: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Proof that D.P. Algorithm gives Optimal SolutionBasic problem:

minπ

E

{N−1∑k=0

gk(xk , µk(xk),wk) + gN(xN)

}

D.P. Algorithm: For each possible xk , compute:

JN(xN) = gN(xN),

Jk(xk) = minuk∈Uk (xk )

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))},

for k = N − 1,N − 2, ..., 1, 0Theorem:

1 Optimal cost J∗(x0) = J0(x0), where J0(x0) is quantity computed byD.P. algorithm.

2 Let µ∗k(.) be the function that generates the minimum uk in D.P.algorithm i.e µ∗k(xk) = u∗k . Then {µ∗0, µ∗1, ..., µ∗N−1} is the optimalpolicy to the basic problem.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 39 / 158

Page 40: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Proof that D.P. Algorithm gives Optimal Solution

Notation:

Given policy π = (µ0, µ1, ..., µN−1),

let πk = (µk , µk+1, ..., µN−1) = “tail policy”

and J∗k (xk) = minπk

E{∑N−1

i=k gi (xi , µi (xi ),wi ) + gN(xN)} be the optimal cost

for tail subproblem.

Let Jk(xk) = quantity computed by D.P algorithm.

Want to show that J∗k (xk) = Jk(xk), for all xk , k .

Proof is by mathematical induction

Initial step (k = N):

By definition of J∗k (xk), J∗N(xN) = gN(xN)

By definition of D.P algorithm JN(xN) = gN(xN)⇒ J∗N(xN) = JN(xN)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 40 / 158

Page 41: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Proof that D.P. Algorithm gives Optimal Solution

Induction step:

Assume J∗l (xl) = Jl(xl) for l = N,N − 1, ..., k + 1

Want to show that J∗k (xk) = Jk(xk)

From the definition of J∗k (xk),

J∗k (xk) = minπk

E

{N−1∑i=k

gi (xi , µi (xi ),wi ) + gN(xN)

}

= min(µk ,πk+1)

E

{gk(xk , µk(xk),wk) +

N−1∑i=k+1

gi (xi , µi (xi ),wi ) + gN(xN)

}

= minµk

E

{gk(xk , µk(xk),wk)+min

πk+1E

[N−1∑i=k+1

gi (xi , µi (xi ),wi )+gN(xN)

]}by D.P principle (optimize tail subproblem then µk)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 41 / 158

Page 42: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Proof that D.P. Algorithm gives Optimal Solution

= minµk

E{gk(xk , µk(xk),wk) + J∗k+1(fk(xk , µk(xk),wk))} by definition of

J∗k+1(xk+1)

= minµk

E{gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))} by induction

hypothesis

= minuk∈Uk (xk )

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))} using fact that

minµ

F (x , µ(x)) = minu∈U(x)

F (x , u).

= Jk(xk) from D.P. algorithm equations

So J∗k (xk) = Jk(xk), and µ∗k(xk) = u∗k is the optimal policy.

By induction, this is true for k = N,N − 1, ..., 1, 0.

In particular, J∗(x0) = J∗0 (x0) = J0(x0) is the optimal cost.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 42 / 158

Page 43: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Shortest Paths in a Trellis

s t

Initialstate

Artificial Terminal state

Stage 0

Stage 1Stage 2

Stage N-1Stage N

1

2

a^1_12

Figure 6: Shortest paths in a trellis

Find shortest path from a node in Stage 1 to a node in Stage N

states → nodescontrols → arcsakij : cost of transition from state i at stage k to state j at stage k + 1.

aNit : terminal cost of state icost function = length of path from s to t

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 43 / 158

Page 44: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Shortest Paths in a TrellisD.P. Algorithm:

JN(i) = aNit

Jk(i) = minj

[akij + Jk+1(j)], k = N − 1, . . . , 1, 0

Optimal cost = J0(s) = length of shortest path from s to t.

Example: Find shortest path from stage 1 to stage 3 in Figure 7.

Shortestpath in red300 100

50 400

400 200

150 350

Stage 1 Stage 2 Stage 3

Figure 7: Shortest paths example

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 44 / 158

Page 45: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Shortest Paths in a TrellisRedraw as a trellis with initial and terminal node, see Figure 8.

s

1

2 2

1 1

2

t

Stage 0

Stage 1 Stage 2 Stage 3

0

0

100300

50 400

0

0

400 200

150 350

100 0400

350 0250

250

Figure 8: Redrawn shortest paths example

Here N = 3.Call the top node state 1 and bottom node state 2.

Stage N:

J3(1) = 0J3(2) = 0

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 45 / 158

Page 46: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Shortest Paths in a Trellis

Stage 2:

J2(1) = min{a211 + J3(1), a212 + J3(2)}= min{100 + 0, 200 + 0} = 100

J2(2) = min{a221 + J3(1), a222 + J3(2)}= min{350 + 0, 400 + 0} = 350

Stage 1:

J1(1) = min{a111 + J2(1), a112 + J2(2)}= min{300 + 100, 400 + 350} = 400

J1(2) = min{a121 + J2(1), a122 + J2(2)}= min{150 + 100, 50 + 350} = 250

Stage 0:

J0(s) = min{0 + J1(1), 0 + J1(2)} = 250

Shortest path to original problem shown in red in Figure 7.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 46 / 158

Page 47: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Forward D.P. Algorithm

Observe that optimal path s→ t is also optimal path t→ s ifdirections of arcs are reversed.⇒ Shortest path algorithm can be run forwards in time (see Bertsekasfor equations).Figure 9 shows the result of forward D.P. on shortest paths example.

Forward D.P. useful in real-time applications, where data arrives justbefore you need to make a decision.

Viterbi algorithm uses this idea

Shortest paths is a deterministic problem, so forward D.P. works.

For stochastic problems, no such concept of forward D.P.I Impossible to guarantee that any given state can be reached

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 47 / 158

Page 48: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Forward D.P. Algorithm

s

1

t

1

2

1

22

35050

250150

0 200

150 350

400 0

0

250

Figure 9: Forward D.P. on shortest paths example

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 48 / 158

Page 49: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Viterbi Algorithm Applications

Estimation of hidden Markov models (HMMs)I xk = Markov chainI state transitions in xk not observed (hidden).I observe zk , r(z , i , j) = probability we observe z given a transition in

Markov chain xk from state i to j .I Estimation problem:

Given ZN = {z1, z2, ..., zN}, find a sequence XN = {x0, x1, ..., xN} overall possible {x0, x1, ..., xN} that maximizes P(XN |ZN).

Note that P(XN |ZN) = P(XN ,ZN )P(ZN )

, and P(ZN) is “constant” given ZN

Somax

{x0,...,xN}P(XN |ZN)←→ max

{x0,...,xN}P(XN ,ZN)←→ max

{x0,...,xN}ln P(XN ,ZN)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 49 / 158

Page 50: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Viterbi Algorithm Applications

I After some calculations (see Bertsekas), can show that problem isequivalent to:

min{x0,...,xN}

− ln(πx0)−N∑

k=1

ln(πxk−1xk r(Zk , xk−1, xk))

where πx0 = probability of initial state, πxk−1xk = transition probabilitiesof Markov chain, and − lnπx0 and − ln(πxk−1xk r(Zk , xk−1, xk)) can beregarded as lengths of the different stages⇒ shortest path problem through a trellis

Decoding of convolutional codes

Channel equalization in presence of ISI (Inter-symbol interference)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 50 / 158

Page 51: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

General Shortest Path Problems

No trellis structuree.g. Find the shortest path from each node to node 5 in Figure 10.

1

2 3

4 5

5

2

3

4

1

7

Figure 10: General shortest path problem

Graph with N + 1 nodes {1, 2, ...,N, t}aij = cost of moving from node i to node j .

Find the shortest path from each node i to node t.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 51 / 158

Page 52: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

General Shortest Path Problems

Assume some aij ’s can be negative, but cycles have non-negativelength.

I Then shortest path will not involve more than N arcs.

Reformulate as a trellis-type shortest path problem with N arcs, byallowing arcs from node i to itself with cost aii = 0

D.P. algorithm:

JN−1(i) = ait

Jk(i) = minj{aij + Jk+1(j)}, k = N − 2, . . . , 1, 0

This algorithm is essentially the Bellman-Ford algorithm.

Other algorithms have also been invented, e.g. Dijkstra’s algorithmwhich can be used when all aij ’s are positive.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 52 / 158

Page 53: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

3 Problems with Perfect State InformationLinear Quadratic ControlOptimal Stopping Problems

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 53 / 158

Page 54: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Problems with Perfect State Information

Will study some problems where analytical solutions can be obtained:

Linear quadratic control

Optimal stopping problems

+ others in Chapter 4 of Bertsekas

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 54 / 158

Page 55: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control(Linear) System:

xk+1 = Axk + Buk + wk , k = 0, 1, ...,N − 1

(Quadratic) Cost function:

E

{N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

}

Problem: Determine optimal policy to minimize cost function

xk , uk ,wk are column vectors

A,B,Q,R are matrices.

wk are independent and zero mean.

Q is positive semi-definite.

R is positive definite.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 55 / 158

Page 56: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control

Definition:A symmetric matrix M is positive semi-definite if xTMx ≥ 0, ∀ vectors xM is positive definite if xTMx > 0,∀x 6= 0

One characterization:

M is positive semi definite ⇔ all eigenvalues of M are ≥ 0.

M is positive definite ⇔ all eigenvalues of M are > 0.

D.P. algorithm applied to this problem gives:

JN(xN) = xTN QxN

Jk(xk) = minuk{xTk Qxk + uTk Ruk + Jk+1(Axk + Buk + wk)},

k = N − 1, ..., 1, 0.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 56 / 158

Page 57: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control

Turns out that minimization can be done analytically

JN−1(xN−1) = minuN−1

E{xTN−1QxN−1 + uTN−1RuN−1

+ (AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)}

= minuN−1

E{xTN−1QxN−1 + uTN−1RuN−1

+ xTN−1ATQAxN−1 + xTN−1A

TQBuN−1 + xTN−1ATQwN−1

+ uTN−1BTQAxN−1 + uTN−1B

TQBuN−1 + uTN−1BTQwN−1

+ wTN−1QAxN−1 + wT

N−1QBuN−1 + wTN−1QwN−1}

= xTN−1(ATQA + Q)xN−1 + E{wTN−1QwN−1}

+ minuN−1

{uTN−1(R + BTQB)uN−1 + 2xTN−1ATQBuN−1}

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 57 / 158

Page 58: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic ControlDigressionProblem: min

xf (x)

How to solve?For unconstrained scalar problems, can differentiate and set derivative equalto 0.e.g. min

x(x − 2)2, d

dx (x − 2)2 = 2(x − 2) = 0 ⇒ x∗ = 2.

Similarly, differentiate uTN−1(R + BTQB)uN−1 + 2xTN−1ATQBuN−1

with respect to the vector uN−1 and set equal to zero

Note that∂(uTAu)

∂u= 2Au,

∂(aTu)

∂u= a,

where a and u are column vectors, and A is a symmetric matrix.

Using above formulas, obtain 2(R + BTQB)uN−1 + 2BTQAxN−1 = 0⇒ u∗N−1 = −(R + BTQB)−1BTQAxN−1

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 58 / 158

Page 59: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control

Substituting u∗N−1 = −(R + BTQB)−1BTQAxN−1 back into expressionfor JN−1(xN−1), we obtain

JN−1(xN−1) = xTN−1(ATQA + Q)xN−1 + E{wTN−1QwN−1}

+ xTN−1ATQB(R + BTQB)−1(R + BTQB)(R + BTQB)−1BTQAxN−1

− 2xTN−1ATQB(R + BTQB)−1BTQAxN−1

= xTN−1(ATQA + Q)xN−1 − xTN−1ATQB(R + BTQB)−1BTQAxN−1

+ E{wTN−1QwN−1}

= xTN−1(ATQA + Q − ATQB(R + BTQB)−1BTQA)xN−1

+ E{wTN−1QwN−1}

= xTN−1KN−1xN−1 + E{wTN−1QwN−1}

with KN−1 = ATQA + Q − ATQB(R + BTQB)−1BTQA

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 59 / 158

Page 60: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control

Continuing on, can show that

u∗N−2 = −(BTKN−1B + R)−1BTKN−1AxN−2,

and more generally (tutorial problem) that

µ∗k(xk) = −(BTKk+1B + R)−1BTKk+1Axk

where

KN = Q,

Kk = ATKk+1A + Q − ATKk+1B(BTKk+1B + R)−1BTKk+1A

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 60 / 158

Page 61: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Certainty Equivalence

Certainty Equivalence: Optimal policy is the same as solving problem forthe deterministic system:

xk+1 = Axk + Buk + E[wk ],

where wk is replaced by its expected value E[wk ] = 0, i.e. the standardLQR problem

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 61 / 158

Page 62: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Asymptotic Behaviour

Definition:

A pair of matrices (A,B), where A is n × n, B is n ×m, iscontrollable if the n × nm matrix[

B AB A2B ... An−1B]

has full rank (all rows linearly independent)

A pair (A,C ), where A is n× n, C is m× n, is observable if (AT ,CT )is controllable.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 62 / 158

Page 63: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Asymptotic Behaviour

TheoremIf (A,B) is controllable and Q can be written as Q = CTC, where (A, C) isobservable, then:

1 Kk → K as k → −∞, with K satisfying the algebraic Riccati equation

K = ATKA + Q − ATKB(BTKB + R)−1BTKA

2 The steady state controller

µ∗(xk) = Lxk ,

where L = −(BTKB + R)−1BTKA, stabilizes the system, i.e. theeigenvalues of A + BL have magnitude < 1.

Proof: See Bertsekas

Note: If uk = Lxk , then xk+1 = Axk + Buk + wk = (A + BL)xk + wk .xk stays “bounded” when the eigenvalues of A + BL have magnitude < 1

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 63 / 158

Page 64: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Other Variationsxk+1 = Akxk + Bkuk + wk

Ak ,Bk random, unknown, independent.Optimal policy:

µ∗k(k) = −(R + E{BTk Kk+1B})−1E{BT

k Kk+1A}xk ,where

KN = Q,

Kk = E{ATk Kk+1A

Tk }+ Q

− E{ATk Kk+1Bk}(E{BT

k Kk+1Bk}+ R)−1E{BTk Kk+1Ak}

I may not have certainty equivalenceI may not have steady state solution

xk+1 = Axk + Bkuk + wk

Bk is random, independent, and is only revealed to us at time k .Motivation: Wireless channelsSimilar to Leong, Dey, Anand, “Optimal LQG control over continuousfading channels”, Proc. IFAC World Congress, 2011.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 64 / 158

Page 65: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Optimal Stopping Problems

At each state, there is a “stop” control that stops the system, i.emoves to and stays in a stop state.

Pure stopping problem: if only other control is “continue”.

For pure stopping problems, policy characterized by partition of set ofstates into:

I stop regionI continue region,

which may depend on time.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 65 / 158

Page 66: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)

A person has an asset for sale, e.g. a house.

At each time k = 0, 1, ...,N − 1, person receives a random offer wk

for the asset.

Assume wk ’s are independent.

Either accept wk at time k + 1, and invest money at interest rate r ,or reject wk and wait for offer wk+1.

Must accept last offer wN−1 at time N if every previous offer wasrejected.

Find policy that maximizes (expected) revenue at the N-th period.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 66 / 158

Page 67: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)

States: If xk = T : asset already sold (= stop state)If xk = wk−1: offer currently under consideration.

Controls: {accept, reject}

System evolves as:

xk+1 = fk(xk ,wk , uk)

=

{T , if 1) xk = T or 2) xk 6= T and uk = acceptwk , otherwise.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 67 / 158

Page 68: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)

Rewards at time k :

gN(xN) =

{xN , if xN 6= T ;0, otherwise.

gk(xk , uk ,wk) =

{(1 + r)N−kxk , if xk 6= T and uk = accept ;0, otherwise.

I (For compound interest over n years, final amount = (1 + r)n×initialamount.)

I Note: From the way the rewards are defined, gk is non-zero for onlyone k ∈ {0, 1, ...,N − 1}.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 68 / 158

Page 69: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)

Expected total reward

= E

[N−1∑k=0

gk(xk , uk ,wk) + gN(xN)

]

D.P. algorithm (for reward maximization):

JN(xN) = gN(xN) =

{xN , if xN 6= T ;0, otherwise.

Jk(xk) = maxuk

E[gk(xk , uk ,wk) + Jk+1(xk+1)]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 69 / 158

Page 70: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)If xk = T , then gk(xk , uk ,wk) = 0 and Jk+1(xk+1) = 0, by propertyof gk being non-zero for only one k, and reward being incurred priorto time kIf xk 6= T , then

E[gk(xk , uk ,wk)+Jk+1(xk+1)] =

{(1 + r)N−kxk , if uk = accept;0 + E[Jk+1(wk)], if uk = reject.

So

Jk(xk) = maxuk

E[gk(xk , uk ,wk) + Jk+1(xk+1)]

=

{max((1 + r)N−kxk ,E[Jk+1(wk)]), if xk 6= T ,0, if xk = T ,

and optimal policy is of the form:

uk = accept if (1 + r)N−kxk > E[Jk+1(wk)]

or uk =

{accept, if xk >

E[Jk+1(wk )](1+r)N−k ;

reject, otherwise.Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 70 / 158

Page 71: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)Let

αk =E[Jk+1(wk)]

(1 + r)N−k

Can show (see Bertsekas) that αk ≥ αk+1 for all k if wk are i.i.d.I Intuition: offer acceptable at time k should also be acceptable at time

k + 1. See Figure 11

Reject

Accept

N-11 2k

αN−1

α2

α1

Figure 11: Asset selling

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 71 / 158

Page 72: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example (Asset selling)

Can also show that if wk are i.i.d and N →∞, then optimal policy“converges” to the stationary policy:

uk =

{accept, if xk > αreject, if xk ≤ α

where α is constant.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 72 / 158

Page 73: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

General Stopping Problems

Pure stopping problem - stop or continue only possible controls

General stoping problem - stop or choose a control uk from U(xk)(where U has more than one element)

Consider time invariant case: f (xk , uk ,wk), g(xk , uk ,wk) don’tdepend on k, and wk is i.i.d.

Stop at time k with cost t(xk)

Must stop by last stage.

D.P. algorithm:JN(xN) = t(xN),

Jk(xk) = min[t(xk), minuk∈U(xk )

E{g(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))}]

Optimal to stop when

t(xk) ≤ minuk∈U(xk )

E{g(xk , uk ,wk) + Jk+1(f (xk , uk ,wk))}

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 73 / 158

Page 74: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

General Stopping ProblemsStopping set at time k (set of states where you stop) defined as

Tk = {x |t(x) ≤ minu∈U(x)

E[g(x , u,w) + Jk+1(f (x , u,w))]}

Note that JN−1(x) ≤ JN(x) for all x , since JN(x) = t(x) and

JN−1(x) = min[t(x), min

u∈U(x)E[g(x , u,w) + Jk+1(f (x , u,w))]

]≤ t(x) = JN(x)

Can show that Jk(x) ≤ Jk+1(x) (Monotonicity principle: tutorialproblem)

Then we have :

T0 ⊆ T1 ⊆ T2 ⊆ ... ⊆ Tk ⊆ Tk+1 ⊆ ... ⊆ TN−1

i.e. set of states in which we stop increases with time.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 74 / 158

Page 75: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Special Case

If f (x , u,w) ∈ TN−1 for all x ∈ TN−1, u ∈ U(x),w , i.e. the set TN−1is absorbing, then

T0 = T1 = T2 = · · · = TN−1.

Proof: See Bertsekas

Simplifies optimal policy, called the one step lookahead policy.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 75 / 158

Page 76: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Special Case

E.g. Asset selling with past offers retained

Same situation as before, except that previously rejected offers can beaccepted at a later time.

State evolves asxk+1 = max(xk ,wk)

(instead of xk+1 = wk before)

Can show (see Bertsekas) that TN−1 = {x |x ≥ α} for some constantα

This set is absorbing, since best currently received offer cannotdecrease over time.⇒ optimal policy at every time k is to accept if best offer > α

Have constant threshold α even for finite horizon N

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 76 / 158

Page 77: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

4 Problems with Imperfect State InformationReformulation as Perfect State Information ProblemLinear Quadratic Control with Noisy MeasurementsSufficient Statistics

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 77 / 158

Page 78: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Problems with Imperfect State Information

State xk not known to controller.

Instead have “noisy” observations zk of the form:

z0 = h0(x0, v0),

zk = hk(xk , uk−1, vk), k = 1, 2, ...,N − 1,

where vk is “observation noise”, with a probability distribution

Pv (.|x0, ..., xk , u0, ..., uk−1,w0, ...,wk−1, v0, ..., vk−1)

which can depend on states, controls and disturbances

Exampleshx(xk , uk−1, vk) = xk + vk ,

hk(xk , uk−1, vk) = sin xk + uk−1vk

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 78 / 158

Page 79: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Problems with Imperfect State Information

Initial state x0 is random with distribution Px0 .

uk ∈ Uk , where Uk does not depend on (unknown) xk .

Information vector, i.e. information available to controller at time k,defined as

I0 = z0,

Ik = (z0, ..., zk , u0, ..., uk−1), k = 1, 2, ...,N − 1

Policies π = (µ0, ..., µN−1), where now µk(Ik) ∈ Uk (before µk(xk)).

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 79 / 158

Page 80: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Basic Problem with Imperfect State Information

Find π that minimizes the cost function

Jπ = E

{N−1∑k=0

gk(xk , µk(Ik),wk) + gN(xN)

}

s.t. system equation

xk+1 = fk(xk , µk(Ik),wk)

and measurement equation

zk = hk(xk , µk−1(Ik−1), vk)

Question: How to solve this problem?

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 80 / 158

Page 81: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Reformulation as Perfect State Information Problem

Idea: Define new system where the state is Ik . Then have D.P.algorithm etc.

By definition

Ik+1 = (z0, ..., zk , zk+1, u0, ..., uk−1, uk)

= (z0, ..., zk , u0, ..., uk−1︸ ︷︷ ︸Ik

, zk+1, uk)

⇒ Ik+1 = (Ik , uk , zk+1).

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 81 / 158

Page 82: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Reformulation as Perfect State Information Problem

RegardIk+1 = (Ik , uk , zk+1)

as a dynamical system with state Ik , control uk and disturbance zk+1

Next note that E[gk(xk , uk ,wk)] = E[E[gk(xk , uk ,wk)|Ik , uk ]] (Recallthat E[X ] = E[E[X |Y ]])

Define gk(Ik , uk) = E[gk(xk , uk ,wk)|Ik , uk ] = cost per stage of newsystem, and gN(IN) = E[gN |IN ] = terminal cost.

Cost function becomes

E{∑N−1

k=0 gk(xk , µk(Ik),wk) + gN(xN)}

= E{∑N−1

k=0 gk(Ik , µk(Ik)) + gN(IN)}

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 82 / 158

Page 83: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Reformulation as Perfect State Information Problem

D.P. algorithm for reformulated perfect state information problem is:

JN(IN) = gN(IN) = E[gN(xN)|IN ]

Jk(Ik) = minuk∈Uk

E{gk(Ik , uk) + Jk+1(Ik , uk , zk+1)}

= minuk∈Uk

E{gk(xk , uk ,wk) + Jk+1(Ik , uk , zk+1)|Ik}, k = N − 1, . . . , 0

Optimal cost J∗ = E{J0(z0)}

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 83 / 158

Page 84: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy Measurements

Systemxk+1 = Axk + Buk + wk

Cost function

E

N−1∑k=0

(xTk Qxk + uTk Ruk︸ ︷︷ ︸gk (xk ,uk ,wk )

) + xTN QxN︸ ︷︷ ︸gN(xN)

Observations

zk = Cxk + vk

wk are independent, zero mean.

From D.P. Algorithm:

JN(IN) = E[xTN QxN |IN ],

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 84 / 158

Page 85: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy Measurements

JN−1(IN−1) = minuN−1

E{xTN−1QxN−1 + uTN−1RuN−1

+E[(AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)|IN ]∣∣∣IN−1}

= minuN−1

E{xTN−1QxN−1 + uTN−1RuN−1

+(AxN−1 + BuN−1 + wN−1)TQ(AxN−1 + BuN−1 + wN−1)|IN−1}

(Using the tower property that E(E(X |Y )|Z ) = E(X |Z ) if Y contains“more information” than Z )

= ... ( expand, simplify and use E(wN−1|IN−1) = 0.)

= E[xTN−1(ATQA + Q)xN−1|IN−1] + E[wTN−1QwN−1|IN−1]

+ minuN−1

{uTN−1(BTQB + R)uN−1 + 2E[xN−1|IN−1]TATQBuN−1

}Differentiate with respect to uN−1 and set equal to zero:

2(BTQB + R)uN−1 + 2BTQAE[xN−1|IN−1] = 0

⇒ u∗N−1 = −(BTQB + R)−1BTQAE[xN−1|IN−1]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 85 / 158

Page 86: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy MeasurementsSubstituting expression for u∗N−1 back in:

JN−1(IN−1) = E[xTN−1(ATQA + Q)xN−1|IN−1] + E[wTN−1QwN−1]

+ E[xN−1|IN−1]TATQB(BTQB + R)−1(BTQB + R)

× (BTQB + R)−1BTQAE[xN−1|IN−1]− 2E[xN−1|IN−1]T

× ATQB(BTQB + R)−1BTQAE[xN−1|IN−1]

= E[xTN−1(ATQA + Q)xN−1|IN−1] + E(wTN−1QwN−1)

− E(xN−1|IN−1)TATQB(BTQB + R)−1BTQAE(xN−1|IN−1)

= E[xTN−1(ATQA + Q)xN−1|IN−1] + E(wTN−1QwN−1)

+ E[(xN−1 − E[xN−1|IN−1])TATQB(BTQB + R)−1

× BTQA(xN−1 − E[xN−1|IN−1])|IN−1]

− E[xTN−1 ATQB(BTQB + R)−1BTQA︸ ︷︷ ︸

PN−1

xN−1|IN−1]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 86 / 158

Page 87: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy Measurements

We have

JN−1(IN−1) = E[xTN−1KN−1xN−1|IN−1] + E[wTN−1QwN−1]

+ E[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E(xN−1|IN−1))|IN−1]

where

PN−1 = ATQB(BTQB + R)−1BTQA

KN−1 = ATQA + Q − PN−1.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 87 / 158

Page 88: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy MeasurementsFor period N − 2,

JN−2(IN−2) = minuN−2

E{xTN−2QxN−2 + uTN−2RuN−2 + JN−1(IN−1)|IN−2}

= E{xTN−2QxN−2|IN−2}+ minuN−2

[uTN−2RuN−2 + E{xTN−1KN−1xN−1|IN−2}

]+ E

[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E[xN−1|IN−1])|IN−2

]+ E(wT

N−1QwN−1)

Then can obtain

u∗N−2 = −(BTKN−1B + R)−1BTKN−1AE[xN−2|IN−2]

Note that in the above the term

E[(xN−1 − E[xN−1|IN−1])TPN−1(xN−1 − E[xN−1|IN−1])|IN−2

]can be taken outside the minimization (see Bertsekas for proof).

I Intuition: estimation error xk − E[xk |Ik ] can’t be influenced by choiceof control.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 88 / 158

Page 89: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy Measurements

Continuing on, general solution is:

µ∗k(Ik) = u∗k = −(BTKk+1B + R)−1BTKk+1AE[xk |Ik ] = LkE[xk |Ik ]

where

KN = Q

Pk = ATKk+1B(BTKk+1B + R)−1BTKk+1A

Kk = ATKk+1A + Q − Pk

Comparison with perfect state information case:

Lk matrix the same

xk is replaced by E[xk |Ik ]

How to compute E[xk |Ik ]?

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 89 / 158

Page 90: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy Measurements

Summary so far:

System

xk+1 = Axk + Buk + wk

zk = Cxk + vk

Problem

minE[ N−1∑k=0

(xTk Qxk + uTk Ruk) + xTN QxN

]Optimal solution is

µ∗k(Ik) = −(BTKk+1B + R)−1BTKk+1AE[xk |Ik ] = LkE[xk |Ik ]

where Ik = (z0, ..., zk , u0, ..., uk−1)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 90 / 158

Page 91: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Linear Quadratic Control with Noisy Measurements

Optimal controller can be decomposed into two parts:

1) An estimator which computes E[xk |Ik ].

2) An actuator which multiplies E[xk |Ik ] with Lk . Lk is the same gainmatrix as in the perfect state information case, only replace xk withE[xk |Ik ].

Estimator and actuator can be designed separately.I Known as the separation principle/theorem

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 91 / 158

Page 92: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

LQG Control

Remaining problem: How do we compute E[xk |Ik ]?

Very difficult problem in general (subject called non-linear filtering).

When system is linear and wk , vk are Gaussian, E[xk |Ik ] can becomputed analytically.

I Procedure/algorithm is known as the Kalman Filter (ref: Anderson andMoore, “Optimal Filtering”), and the overall controller is called theLQG (linear quadratic Gaussian) controller

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 92 / 158

Page 93: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Kalman Filter

System:

xk+1 = Axk + Buk + wk

zk = Cxk + vk

wk ∼ N(0,Σw ) i.i.d, Σw = E[wkwTK ]

vk ∼ N(0,Σv ) i.i.d, Σv = E[vkvTK ]

Define state estimates

xk|k = E[xk |Ik ]

xk+1|k = E[xk+1|Ik ]

and estimation error covariance matrices

Σk|k = E[(xk − xk|k)(xk − xk|k)T |Ik ]

Σk+1|k = E[(xk+1 − xk+1|k)(xk+1 − xk+1|k)T |Ik ]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 93 / 158

Page 94: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Kalman Filter

Then xk|k , xk+1|k ,Σk|k ,Σk+1|k can be computed recursively using theKalman Filter equations:

xk|k = xk|k−1 + Σk|k−1CT (CΣk|k−1C

T + Σv )−1(zk − Cxk|k−1)

xk+1|k = Axk|k + Buk

Σk|k = Σk|k−1 − Σk|k−1CT (CΣk|k−1C

T + Σv )−1CΣk|k−1

Σk+1|k = AΣk|kAT + Σw , k = 0, 1, ...,N − 1

Proof: see Bertsekas, or Anderson and Moore.

Beware: Many people who work in Kalman filtering like to use Q forΣw , R for Σv , Kk for the “Kalman gain”Σk|k−1C

T (CΣk|k−1CT + Σv )−1, but here Q,R,Kk have been used

for different things. People also use Pk+1|k for Σk+1|k , Pk|k for Σk|ketc.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 94 / 158

Page 95: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Kalman Filter Properties

In general the mean squared error

E[(xk − xk)T (xk − xk)|Ik ]

is minimized when xk = E[xk |Ik ]

Kalman filter equations compute E[xk |Ik ] when noises are Gaussian,and (optimal) estimates are linear functions of the measurements zk .

Even when noises are not Gaussian, xk|k computed by Kalman filterequations gives the best linear estimate of xk .

I Useful suboptimal solution when noises are non-Gaussian.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 95 / 158

Page 96: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Kalman Filter Properties

Recall that if the pair (A,B) is controllable and (A,Q1/2) isobservable, optimal controller has a steady state solution.

Similarly, if (A,C ) is observable, and (A,Σ1/2w ) is controllable, then

Σk|k−1 converges to a steady state value Σ as k →∞, where Σsatisfies the algebraic Riccati equation

Σ = AΣAT − AΣCT (CΣCT + Σv )−1CΣAT + Σw

So we have a steady state estimator:

xk|k = xk|k−1 + ΣCT (CΣCT + Σv )−1(zk − Cxk|k−1)

xk+1|k = Axk|k + Buk

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 96 / 158

Page 97: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Sufficient Statistics

Information vector Ik = (z0, .., zk , u0, ..., uk−1)

Dimension of Ik increases with time k.

Inconvenient for large k

Sufficient statistic: function Sk(Ik) which summarizes all essential contentin Ik for computing the optimal control, i.e. µ∗k(Ik) = µ(Sk(Ik)) for somefunction µ.

Sk(Ik) preferably of smaller dimension than Ik .

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 97 / 158

Page 98: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Examples of Sufficient Statistics1) Ik itself

2) Conditional state distribution/belief state Pxk |Ik , assuming thatdistribution of vk depends only on xk−1, uk−1,wk−1.

If number of states is finite then Pxk |Ik is a vector.

e.g. if states are 1, 2, ..., n, then

Pxk |Ik =

P(xk = 1|Ik)P(xk = 2|Ik)

.

.

.P(xk = n|Ik)

Dimension of vector is n, which doesn’t grow with k

3) Special case: E[xk |Ik ] is a sufficient statistic for LQG problem (thoughnot a sufficient statistic in general).

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 98 / 158

Page 99: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Conditional State Distribution

For conditional state distribution, Pxk |Ik can be generated recursively,as

Pxk+1|Ik+1= Φk(Pxk |Ik , uk , zk+1)

for some function Φk(·, ·, ·).

Then D.P. algorithm can be written as

Jk(Pxk |Ik ) = minuk∈Uk

E[gk(xk , uk ,wk) + Jk+1(Φk(Pxk |Ik , uk , zk+1))|Ik ].

General formula for Φk(·, ·, ·) can be derived, but is quite complicated(see Bertsekas). Will derive some examples from first principles.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 99 / 158

Page 100: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 1: Search Problem

At each period, decide whether to search a site that may contain atreasure.

If treasure is present and we search, we find it with probability β andtake it.

States: {treasure present, treasure not present}Controls: {search, no search}Regard each search result as (imperfect) observation of the state.

Let pk = probability treasure present at start of time k .I If not search, pk+1 = pk .I If search and find treasure, pk+1 = 0.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 100 / 158

Page 101: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 1

If search and don’t find treasure,

pk+1 = P(treasure present at k|don’t find at k)

=P(treasure present at k

⋂don’t find at k)

P(don’t find at k)

=pk(1− β)

pk(1− β) + (1− pk),

with (1− pk) corresponding to treasure not present & don’t find.

Thus

pk+1 =

pk , not search at time k0, search and find treasure.

pk (1−β)pk (1−β)+(1−pk ) , search and don’t find treasure

= Φk(pk , uk , zk+1) function.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 101 / 158

Page 102: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 1

Now let treasure be worth V , each search costs C , and once wedecide not to search we can’t search again at future times.

D.P. algorithm gives:

Jk(pk) = max{no search,search}

[0,−C + pkβV

+ (1− pkβ)Jk+1

( pk(1− β)

pk(1− β) + 1− pk

)+ pkβJk+1(0)

]

= max{no search,search}

[0,−C + pkβV

+ (1− pkβ)Jk+1

(pk(1− β)

pk(1− β) + 1− pk

)](where pkβJk+1(0) = 0 since treasure already found)

Can show that Jk(pk) = 0,∀pk ≤ CβV , and that it is optimal to search

iff expected reward pkβV ≥ cost of search C . (Tutorial problem)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 102 / 158

Page 103: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 2: Research Paper*

A process {Pe,k} evolves in the following way, for k = 1, ...,N:

Pe,k+1 =

{P, νk+1γe,k+1 = 1APe,kA

T + Q, νk+1γe,k+1 = 0,

P,A,Q are some matrices

{γe,k} is i.i.d Bernoulli process with

P(γe,k = 1) = λe ,P(γe,k = 0) = 1− λe , ∀k

νk ∈ {0, 1}{Pe,k} is not observed at all (no observation zk).

*Leong, Quevedo, Dolz, Dey “On Remote State Estimation in thePresence of an Eavesdropper” Proc. IFAC World Congress, 2017

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 103 / 158

Page 104: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 2

Regard Pe,k as the state at time k , and νk+1 as the control. AssumePe,0 = P

Then Pe,k ∈ {P,APAT + Q,A(APAT + Q) + Q, ...} ={P, f (P), f 2(P), ..., f N(P)}, where

f (P) = APAT + Q

Conditional state distribution isP(Pe,k = P|ν0, .., νk)

P(Pe,k = f (P)|ν0, ..., νk)......

P(Pe,k = f N(P)|ν0, ..., νk)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 104 / 158

Page 105: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 2

When νk+1 = 0, Pe,k+1 = f (Pe,k) with probability 1. SoP(Pe,k+1 = P|ν0, .., νk+1)

P(Pe,k+1 = f (P)|ν0, ..., νk+1)......

P(Pe,k+1 = f N(P)|ν0, ..., νk+1)

=

0

P(Pe,k = P|ν0, .., νk)......

P(Pe,k = f N−1(P)|ν0, ..., νk)

= Φk(PPe,k |Ik , νk+1, zk+1) function when νk+1 = 0

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 105 / 158

Page 106: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example 2

When νk+1 = 1,Pe,k+1 = P with probability λe , andPe,k+1 = f (Pe,k) with probability 1− λe . So

P(Pe,k+1 = P|ν0, .., νk+1)P(Pe,k+1 = f (P)|ν0, ..., νk+1)

...

...P(Pe,k+1 = f N(P)|ν0, ..., νk+1)

=

λe

(1− λe)P(Pe,k = P|ν0, ..., νk)......

(1− λe)P(Pe,k = f N−1(P)|ν0, ..., νk)

= Φk(PPe,k |Ik , νk+1, zk+1) function when νk+1 = 1

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 106 / 158

Page 107: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

5 Suboptimal Methods / Approximate Dynamic ProgrammingCertainty Equivalent ControlRollout AlgorithmsModel Predictive Control

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 107 / 158

Page 108: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Suboptimal Methods

Why do we need/want suboptimal methods?

In D.P. need to compute

Jk(xk) = minuk

E[gk(xk , uk ,wk) + Jk+1(xk+1)]

for all states xk

1) In many problems, this minimization can’t be done analytically.

Have to test each uk .

When number of possible xk , uk or wk are large, amount ofcomputation required can be substantial.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 108 / 158

Page 109: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Suboptimal Methods2) In some problems, xk , uk or wk are continuous valued.

Have to discretize their ranges to convert to discrete problem, seeFig. 12.

Using more points gives better approximation, but more computationrequired.

Situation worse for higher dimensions - “curse of dimensionality”.

Discretization points

-1 0 1

Range of x_k =[-1,1]

Figure 12: Discretization

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 109 / 158

Page 110: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Suboptimal Methods

3) In problems with imperfect state information, conditional statedistribution Pxk |Ik is of the form

Pxk |Ik =

P(xk = 1|Ik)P(xk = 2|Ik).........P(xk = n|Ik)

Range is [0, 1]n (continuous).

Solving imperfect state information problems exactly is intractableexcept in special or very simple cases.

4) Real time constraints, data not available until shortly before, or datamay change as system is being controlled.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 110 / 158

Page 111: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Suboptimal Methods

Will discuss a few methods for suboptimal solutions

Certainty Equivalent Control (CEC)

Rollout Algorithms

Model Predictive Control (MPC)

Many other methods in Vol. I Ch.6 and Vol. II of Bertsekas.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 111 / 158

Page 112: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Certainty Equivalent Control

Idea

Replace a stochastic problem with a deterministic one

At each time k , fix the future uncertain quantities to some “typical”values, e.g. replace wk with E[wk ].

Procedure (Online Version)At each time k

(1) Fix wi , i ≥ k to some wi . Solve the deterministic problem

min{uk ,uk+1,...,uN−1}

[N−1∑i=k

gi (xi , ui , wi ) + gN(xN)

],

assuming xi+1 = fi (xi , ui , wi ), i = k, k + 1, ...,N − 1, ui ∈ Ui (xi )

(2) Use the first control in optimal control sequence {uk , uk+1, ..., uN−1}found, i.e. µk(xk) = uk

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 112 / 158

Page 113: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Certainty Equivalent Control

Equivalent Procedure (Offline Version)

(1) Fix wk to some wk for k = 0, 1, ..,N − 1. Solve the deterministicproblem

min{µ0,µ1,...,µN−1}

[N−1∑k=0

gk(xk , µk(xk), wk) + gN(xN)

],

assuming xk+1 = fk(xk , µk(xk), wk), k = 0, 1, ...,N − 1, uk(xk) ∈ Uk(xk)

(2) Let {µd0 , µd1 , ..., µdN−1} be the solution to problem above. At each time

k , apply µk(xk) = µdk (xk)

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 113 / 158

Page 114: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Certainty Equivalent ControlComments:

N problems have to be solved in online version, one in the offlineversion

Online and offline versions give same controller if data is notchanging. Use online version if data is changing.

For problems with imperfect state information, also replace xk byestimate xk(Ik) (e.g. xk(Ik) = E[xk |Ik ]).

Certainty Equivalent Control often performs well in practice.

For linear quadratic control problem, Certainty Equivalent Controlleris equivalent to optimal controller.

Can fix some disturbances while leaving others stochastic, e.g. forimperfect state information problems, replace xk by xk(Ik) whileleaving wk as stochastic.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 114 / 158

Page 115: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Rollout Algorithms

One step lookahead policy, with optimal cost to go approximated by costto go of some base policy.

“Rollout” coined by Gerald Tesauro in 1996 in the context of rollingdice in a backgammon playing computer program.

A given backgammon position evaluated by “rolling out” many gamesstarting from that position, and taking average.

Rollout policy has a cost improvement property

Often produces substantial improvement over base policy.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 115 / 158

Page 116: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Rollout Algorithms

One step lookahead policy

At each k and xk we use the control µk(xk) that solves the problem

minuk∈Uk

E{gk(xk , uk ,wk) + Jk+1(fk(xk , uk ,wk))}

where JN = gN , and Jk+1 is approximation to true cost to go Jk+1.

Rollout policy

When the approximation Jk is cost to go of some heuristic base policy.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 116 / 158

Page 117: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Quiz problem

N questions given.

Question i answered correctly with probability pi , reward vi if correct.

Quiz terminates at first incorrect answer.

Choose order of questions to maximize total reward.

Index policy: answer questions in decreasing order of pivi1−pi

I Index policy is optimal when no other constraints (Ch. 4.5 Bertsekas).

Now assume there is a limit (< N) on maximum number of questionsto be answered.

I Then index policy in general is not optimal.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 117 / 158

Page 118: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Quiz problem

Rollout algorithm: use index policy as base policy.I At a state denoting the subset of questions already answered, compute

the expected reward R(j) for each possible next question j , assumingthe order of remaining questions follows index policy.

I Answer the question with maximum R(j).

R(j) can be computed analytically, since given an order of questions(i1, i2, ..., iM), with M ≤ N, the expected reward is

pi1(vi1 + pi2(vi2 + pi3(...+ piMviM )...))

.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 118 / 158

Page 119: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Travelling Salesman Problem

N cities

Assume graph is complete

Find the minimum cost tour that visits each city exactly once andreturns to the city you started from.

Important and difficult problem in combinatorial optimization.

20

12

3442

30 35

AB

DC

Figure 13: Travelling Salesman Problem

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 119 / 158

Page 120: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Travelling Salesman Problem

Nearest neighbour heuristic:I Start from an arbitrary city.I Next city visited is the one with minimum distance from current city

(and has not been previously visited)

Rollout algorithm: Use the nearest neighbour heuristic as base policy.I For each node not yet visited, assume nearest neighbour heuristic is

then run afterwards, and compute cost of the tour.I Choose next city as the one that gives best tour.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 120 / 158

Page 121: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Travelling Salesman Problem

Consider the travelling salesman problem for the graph shown below.

Let a be the node with which we start and end the tour. An optimaltour can be shown to be abcdea, with length 375.

Figure 14: Travelling Salesman Problem

Nearest neighbour (N.N.) heuristic gives tour aedbca with length 550.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 121 / 158

Page 122: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Travelling Salesman ProblemRollout algorithm with nearest neighbour heuristic as base policy:1st stage:ab cdea︸︷︷︸

N.N.

length = 375

ac bdea︸︷︷︸N.N.

length = 550

ad ebca︸︷︷︸N.N.

length = 625

ae dbca︸︷︷︸N.N.

length = 550

So next node should be b.2nd stage:abc dea︸︷︷︸

N.N.

length = 375

abd eca︸︷︷︸N.N.

length = 650

abe dca︸︷︷︸N.N.

length = 675

So next node is cDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 122 / 158

Page 123: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example: Travelling Salesman Problem

Rollout algorithm with nearest neighbour heuristic as base policy:3rd stage:abcd ea︸︷︷︸

N.N.

length = 375

abce da︸︷︷︸N.N.

length = 425

So next node is d4th stage:abcdea = tour computed by rollout, with length 375

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 123 / 158

Page 124: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Cost Improvement Property of Rollout Algorithm

Theorem:

Let Jk(xk) be the cost to go of rollout policy. Let Jk(xk) be the cost to goof base policy. Then

Jk(xk) ≤ Jk(xk),∀xk , k

Proof: Use induction

Initial step:

By definitionJN(xN) = JN(xN) = gN(xN), ∀xN

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 124 / 158

Page 125: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Cost Improvement Property of Rollout Algorithm

Induction step:

Assume Jl(xl) ≤ Jl(xl),∀xl , l = N − 1,N − 2, . . . , k + 1

Want to show Jk(xk) ≤ Jk(xk).

Let µk(xk) be control applied by rollout policy.Let µk(xk) be control applied by base policy.

ThenJk(xk) = E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] bydefinition of cost to go function Jk≤ E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] by inductionhypothesis≤ E[gk(xk , µk(xk),wk) + Jk+1(fk(xk , µk(xk),wk))] by definition ofµk(xk) being optimal for rollout= Jk(xk) by definition of cost to go function Jk

By induction, Jk(xk) ≤ Jk(xk),∀k , xk .

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 125 / 158

Page 126: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Difficulties In Using Rollout

For stochastic problems, cost to go Jk of base policy may still bedifficult to evaluate analytically.

Need to approximate Jk using e.g. Monte Carlo simulations, orcertainty equivalence.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 126 / 158

Page 127: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Model Predictive Control (MPC)

Originated and widely used in process control industries.

Concepts have since been applied to many areas.

Idea:

Compute a set of m control signals which optimizes objective over afinite horizon m, using a model that predicts system outputs at futuretimes.

First element of this set is applied to system.

Repeat process at next time step in a receding horizon manner.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 127 / 158

Page 128: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Model Predictive Control (MPC)u

2 431

u_0^*

k

A

u

2 431

u_0^*

k

A

1 2 34 5 6

u u_2^*

k

u

k1 2 4 53

B

C

u_1^*

Figure 15: Model Predictive Control:A: Computed set of control signals at time 0.B: Computed set of control signals at time 1.C: Computed set of control signals at time 2.Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 128 / 158

Page 129: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Model Predictive Control (MPC)

Well suited to systems with control/state constraints, non-linearsystems etc...

Corresponds to a m-step lookahead policy with cost to goapproximation equal to zero.

As m increases, performance “usually” improves (see Bertsekas forcounter-examples)

Higher amount of computations required for larger m.

For nonlinear systems, computation of control signals often needs tobe done numerically.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 129 / 158

Page 130: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

6 Infinite Horizon ProblemsDiscounted Cost ProblemsAverage Cost Problems

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 130 / 158

Page 131: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Infinite Horizon Problems

Infinite number of stages.

Assume system is stationary, i.e. f (., ., .), g(., ., .) and distribution ofwk don’t depend on time k .

“Different” algorithms needed.

Optimal policies often have a simple stationary form that does notdepend on time.

But analysis is more difficult than finite horizon problems (won’tcover in this course).

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 131 / 158

Page 132: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Types of Infinite Horizon Problems

1 Total cost problems

min{µk}

limN→∞

E

[N−1∑k=0

g(xk , µ(xk),wk)

]

I Not commonly used because cost function often goes to infinity.I Special type called stochastic shortest path problem with cost free

termination state studied in Bertsekas.

2 Discounted cost problems

min{µk}

limN→∞

E

[N−1∑k=0

γkg(xk , µk(xk),wk)

]

where γ ∈ (0, 1).I γ is called the discount factor.I Cost incurred at earlier times more important than later times.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 132 / 158

Page 133: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Types of Infinite Horizon Problems

3 Average cost problems

min{µk}

limN→∞

E

[1

N

N−1∑k=0

g(xk , µ(xk),wk)

]

I Cost incurred in the future more important than at the beginningI Cost function usually finite in contrast to total cost problems.I Optimal average cost usually independent of initial states.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 133 / 158

Page 134: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Discounted Cost Problems

min{µk}

limN→∞

E

[N−1∑k=0

γkg(xk , µk(xk),wk)

]Assume

number of states is finite, taking values 1, 2, ..., n.

number of possible controls is finite.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 134 / 158

Page 135: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Notation

Given policy π = {µ0, µ1, ..., }, cost of policy starting at state i is

Jπ(i) = limN→∞

E

[N−1∑k=0

γkg(xk , µk(xk),wk)

∣∣∣∣x0 = i

]

If policy is stationary, i.e. π = {µ, µ, ...}, write Jµ(i) instead of Jπ(i).

Optimal cost starting at state i is

J∗(i) = minπ

limN→∞

E

[N−1∑k=0

γkg(xk , µk(xk),wk)

∣∣∣∣x0 = i

]

Optimal policy π∗ satisfies Jπ∗(i) = J∗(i), ∀i .

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 135 / 158

Page 136: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Theorem

For the discounted cost problem we have:

(a) The value iteration algorithm

Jk+1(i) = minu∈U(i)

E[g(i , u,w) + γJk(f (i , u,w))], i = 1, 2, ..., n

converges as k →∞ to optimal costs J∗(i), i = 1, 2, ..., n, startingfrom arbitrary J0(i), i = 1, 2, ..., n

(b) The optimal costs J∗(i), i = 1, 2, ..., n, satisfy the Bellman equation

J∗(i) = minu∈U(i)

E[g(i , u,w) + γJ∗(f (i , u,w))], i = 1, 2, ..., n

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 136 / 158

Page 137: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Theorem(c) Given a stationary policy µ, the cost Jµ(i), i = 1, ..., n satisfies

Jµ(i) = E[g(i , µ(i),w) + γJµ(f (i , µ(i),w))], i = 1, ..., n

Starting from arbitrary J0(i), i = 1, ..., n, the iteration

Jk+1(i) = E[g(i , µ(i),w) + γJk(f (i , µ(i),w))]

converges to Jµ(i), i = 1, ..., n

(d) A stationary policy µ is optimal iff. for every state i , µ(i) attains theminimum in the Bellman equation.

(e) The policy iteration algorithm

µk+1(i) = argminu∈U(i)

E[g(i , u,w) + γJµk (f (i , u,w))], i = 1, 2, ..., n

generates an improving sequence of policies and terminates (in finitetime) with an optimal policy.

Proof: see BertsekasDr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 137 / 158

Page 138: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Comments

Parts (a) and (e) provide algorithms for solving discounted costproblems (like the D.P. algorithm for finite horizon problems). Valueiteration (part (a)) requires less computation at every iteration, whilepolicy iteration (part (e)) is guaranteed to terminate in finite time.

In part (c),

Jµ(i) = E[g(i , µ(i),w) + γJµ(f (i , u,w))], i = 1, ..., n

is a system of n linear equations, with which one can solve forJµ(i), i = 1, ..., n.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 138 / 158

Page 139: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Policy Iteration

Starting with a stationary policy µ0, generate a sequence µ1, µ2, . . .of stationary policies.

Given µk , perform policy evaluation step, to computeJµk (i), i = 1, 2, .., n, using

Jµk (i) = E[g(i , µk(i),w) + γJµk (f (i , µk(i),w))], i = 1, . . . , n

Given Jµk (.), perform policy improvement step

µk+1(i) = argminu∈U(i)

E[g(i , u,w) + γJµk (f (i , u,w))]

with Jµk (f (i , u,w)) the “cost to go of old policy” (c.f. rolloutalgorithm)

Terminate when Jµk (i) = Jµk+1(i),∀i .

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 139 / 158

Page 140: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example

A manufacturer at each time period:

Receives an order with probability p, no order with probability 1− p.

May process all unfilled orders at cost K > 0, or process no orders.

Cost per unfilled order at each time period is C > 0.

max. no. unfilled orders is n.

Find processing policy that minimizes the discounted cost, with discountfactor γ.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 140 / 158

Page 141: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example

Let state = no. unfilled orders at the start of each period(∈ {0, 1, ..., n}).

Bellman equation :For states i = 0, 1, . . . , n − 1, can either process orders or not, soBellman equation is

J∗(i) = min{

K︸︷︷︸process

+ γp J∗(1)︸ ︷︷ ︸new order received

+ γ(1− p) J∗(0)︸ ︷︷ ︸new order not received

,

Ci︸︷︷︸don’t process

+ γp J∗(i + 1)︸ ︷︷ ︸new order received

+ γ(1− p) J∗(i)︸ ︷︷ ︸no new order

}For state i=n, all orders must be processed, so Bellman equation is

J∗(n) = K + γpJ∗(1) + γ(1− p)J∗(0)

Can show that the optimal policy is a threshold policy: process orderiff. i ≥ m∗, where m∗ is a threshold (see Bertsekas).

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 141 / 158

Page 142: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Average Cost Problems

min{µk}

limN→∞

1

NE

[N−1∑k=0

g(xk , µk(xk),wk)

]

In most problems of this type, the average cost per stage of a policyis independent of initial state.

Expresses costs occurred in the long run, costs incurred in early stagesdo not matter.

Analysis is harder than for discounted cost problems (won’t coverhere).

AssumeI number of states is finite, taking values 1, 2, ..., n.I number of possible controls is finite.

Also assume that there is some state t such that for all initial statesand policies, t is visited infinitely often with probability 1.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 142 / 158

Page 143: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Theorem

For the average cost problem, we have:

(a) The optimal average cost per stage λ∗ is the same for all initialstates, and there exists a vector h∗ = (h∗(1), h∗(2), ..., h∗(n))satisfying the Bellman equation:

λ∗ + h∗(i) = minu∈U(i)

E[g(i , u,w) + h∗(f (i , u,w))], i = 1, ...n

(h∗ is unique if we fix h∗(t) = 0.)If µ(i) attains the minimum in the Bellman equation for all i, then thestationary policy µ is optimal.

(b) If µ and h satisfy Bellman’s equation, then λ is the optimal averagecost per stage for each initial state.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 143 / 158

Page 144: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Theorem

(c) Given a stationary policy µ with average cost per stage λµ, thereexists a vector hµ = (hµ(1), ..., hµ(n)) such that

λµ + hµ(i) = E[g(i , µ(i),w) + hµ(f (i , µ(i),w))], i = 1, ..., n

(hµ is unique if we fix hµ(t) = 0.)

Proof: See Bertsekas

Comment: h is also called the differential cost vector.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 144 / 158

Page 145: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example

A manufacturer at each time period:

Receives an order with probability p, no order with probability 1− p.

May process all unfilled orders at cost K > 0, or process no orders.

Cost per unfilled order at each time period is C > 0.

max. no. unfilled orders is n.

Find processing policy that minimizes the average cost.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 145 / 158

Page 146: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Example

State = no. unfilled orders at start of each period.State 0 = Special state t here (will visit this state infinitely often)

Bellman equation:For states 0, 1, . . . , n − 1, Bellman equation is

λ∗+h∗(i) = min{K+ph∗(1)+(1−p)h∗(0),Ci+ph∗(i+1)+(1−p)h∗(i)}

For state n, Bellman equation is

λ∗ + h∗(n) = K + ph∗(1) + (1− p)h∗(0)

Optimal policy: Process orders if

K + ph∗(1) + (1− p)h∗(0) ≤ Ci + ph∗(i + 1) + (1− p)h∗(i)

Can again show that a threshold policy is optimal, where value of thethreshold may be different from value of the threshold in discountedcost problem

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 146 / 158

Page 147: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Algorithms For Average Cost Problems

Value iteration:

Starting from any J0, compute

Jk+1(i) = minu∈U(i)

E[g(i , u,w) + Jk(f (i , u,w))], i = 1, ..., n

Have

limk→∞

Jk(i)

k= λ∗, ∀i

Drawbacks of value iteration :

Often components of Jk will diverge to ∞ or −∞, so calculatinglimk→∞

Jk (i)k may be tricky.

Doesn’t compute a differential cost vector h∗.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 147 / 158

Page 148: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Algorithms For Average Cost Problems

Relative value iteration:

Subtract a constant (dependent on k) from all components of Jk , sothat the difference hk is bounded, e.g.

hk(i) = Jk(i)− Jk(s), i = 1, ..., n

where s is some fixed state. Then relative value iteration algorithm is:

hk+1(i) = minu∈U(i)

E[g(i , u,w) + hk(f (i , u,w))]

− minu∈U(s)

E[g(s, u,w) + hk(f (s, u,w))], i = 1, ..., n

Can show that hk → h∗ as k →∞

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 148 / 158

Page 149: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Algorithms For Average Cost Problems

Policy iteration:

Given µk , perform policy evaluation step to compute λk and hk ,using the equations:

λk + hk(i) = E[g(i , µk(i),w) + hk(f (i , µk(i),w))], i = 1, ..., n

hk(t) = 0 for some state t which is visited infinitely often.

Given λk and hk , perform policy improvement step:

µk+1(i) = argminu∈U(i)

E[g(i , u,w) + hk(f (i , u,w))], i = 1, ..., n

Terminate when λk+1 = λk and hk+1(i) = hk(i), i = 1, . . . , n.

Policy iteration can be shown to terminate in finite time.

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 149 / 158

Page 150: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Outline

7 Introduction to Reinforcement Learning

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 150 / 158

Page 151: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Introduction to Reinforcement Learning

As in the finite horizon case, want to consider suboptimal methods forsolving infinite horizon problems

Also studied in machine learning as reinforcement learning

Many different methods, e.g. Q-learning, TD/SARSA(λ),REINFORCE, . . .

I References: Sutton & Barto, “Reinforcement Learning”, Bertsekas Vol.II

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 151 / 158

Page 152: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Introduction to Reinforcement Learning

Slight change of notation:

State xk → state sk

Control uk → action ak

Cost function g(., ., .) → Reward function g(., ., .)

Cost minimization min{µk}

limN→∞

E[∑N−1

k=0 γkg(xk , µk(xk),wk)

]→ Reward maximization max

{ak}lim

N→∞E[∑N−1

k=0 γkg(sk , ak(sk),wk)

]

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 152 / 158

Page 153: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Q-Learning for Discounted Problems

Bellman equation

J∗(s) = maxa

E[g(s, a,w) + γJ∗(f (s, a,w))],∀s

J∗(s) is the optimal expected future reward when in state s

Introduce now the Q-Bellman equation

Q∗(s, a) = E[g(s, a,w) + γ maxa′

Q∗(s ′, a′)], ∀(s, a)

where s ′ , f (s, a,w).I Q-factor Q(s, a) is the expected future reward when in state s and

taking action aI Q∗ are the optimal Q-factors

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 153 / 158

Page 154: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Q-Learning for Discounted Problems

Can also solve Q-Bellman equation using value iteration or policyiteration

Given Q∗(s, a), optimal policy can be computed as

a∗(s) = arg maxa

Q∗(s, a)

Using Q∗(s, a) gives same policy as using J∗(s), though it requiresmore storage

However, one advantage is that Q∗(s, a) can be found approximatelyusing e.g. Q-learning

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 154 / 158

Page 155: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Q-Learning for Discounted Problems

Q-learning algorithm. Repeat:

Generate (sk , ak) using any probabilistic mechanism such that allstate-action pairs (s, a) are chosen infinitely often

Given (sk , ak), update Q(sk , ak) as:

Qk+1(sk , ak) = Qk(sk , ak) + αk

(r + γ max

a′Qk(s ′, a′)− Qk(sk , ak)

)where r = g(sk , ak ,wk) is the sampled reward, s ′ = f (sk , ak ,wk) isthe sampled next state when current state is sk and action ak isapplied, and {αk} is a sequence converging to 0.

Leave all other Q-factors unchanged

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 155 / 158

Page 156: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Q-Learning for Discounted Problems

Q-learning algorithm converges to the optimal Q-factors Q∗(s, a)provided all pairs (s, a) are chosen infinitely often, and the sequence{αk} satisfies

αk > 0,∞∑k=0

αk =∞,∞∑k=0

α2k <∞

I e.g. αk = 1k satisfies this condition

In Sutton & Barto, {(sk , ak)} generated according to:

sk+1 := s ′

ak+1 =

{random a, w.p. εarg max

aQk+1(sk+1, a), w.p. 1− ε,

for some ε > 0

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 156 / 158

Page 157: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Function Approximation and Deep Reinforcement Learning

For large problems:I Too many (state, action) pairs to store in memoryI Too slow to learn the value of each Q∗(s, a) individually

Function approximation:I Regard Q∗(s, a) as a function of s and a. Approximate Q∗(s, a) by

another function Q(s, a,θ) parameterized by a set of weights θI Learn the weights θ instead of the entire set of values Q∗(s, a)

Deep reinforcement learning: When the weights θ are learnt using adeep neural network, seehttps://deepmind.com/blog/deep-reinforcement-learning

Spectacular recent advances in AI using deep reinforcement learning,e.g. AlphaGo, AlphaZero

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 157 / 158

Page 158: Dynamic Programming and Stochastic Controlcontrolsystems.upb.de/.../dynamic_programming_slides_2018.pdf · 2 The Dynamic Programming Principle and Dynamic Programming Algorithm Basic

Deep Reinforcement Learning - Further Reading

OverviewI https://deepmind.com/blog/deep-reinforcement-learningI http://www0.cs.ucl.ac.uk/staff/d.silver/web/Talks.html

Deep Q-Network (DQN) algorithmI https://arxiv.org/pdf/1312.5602.pdfI https://keon.io/deep-q-learning

More AdvancedI https://arxiv.org/pdf/1509.02971.pdfI https://medium.com/tensorflow/deep-reinforcement-learning-playing-

cartpole-through-asynchronous-advantage-actor-critic-a3c-7eab2eea5296

Dr. Alex Leong ([email protected]) DP and Stochastic Control Paderborn University 158 / 158