Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms

Reinforcement LearningHidden Theory and New Super-Fast Algorithms∗

Sean P. Meyn

Department of Electrical and Computer Engineering — University of Florida

∗Based on joint research with Adithya Devraj@UF

August 28, 2017

http://ccc.centers.ufl.edu/

Reinforcement Learning: Hidden Theory, and ...Outline

1 Reinforcement Learning ⊂ Stochastic Approximation

2 Fastest Stochastic Approximation

3 Introducing Zap Q-Learning

4 Conclusions

5 References

E[f(θ,W)]θ=θ∗

= 0

Stochastic Approximation

Reinforcement Learning ⊂ Stochastic Approximation Basic Algorithm

What is Stochastic Approximation?

A simple goal: Find the solution θ∗ to

f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0

What makes this hard?

1 The function f and the distribution of the random vector W may notbe known

– we may only know something about the structure of the problem

2 Even if everything is known, computation of the expectation may beexpensive. For root finding, we may need to compute the expectationfor many values of θ

3 The recursive algorithms we come up with are often slow, and theirvariance may be infinite (this is typical).

1 / 30




f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0






1 / 30




f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0






1 / 30




f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0






1 / 30


What is Stochastic Approximation?Example: Monte-Carlo

Monte-Carlo Estimation

c is a real valued function of a random variable X.We want to compute the mean: η = E[c(X)]

SA formulation: Find θ∗ solving 0 = E[f(θ,X)] = E[c(X)− θ]

Algorithm: θ(n) =1

n

n∑i=1

c(X(i))

=⇒ (n+ 1)θ(n+ 1) =

n+1∑i=1

c(X(i)) = nθ(n) + c(X(n+ 1))

=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]

SA Recursion: θ(n+ 1) = θ(n) + anf(θ(n), X(n+ 1))

2 / 30






Algorithm: θ(n) =1

n

n∑i=1

c(X(i))

=⇒ (n+ 1)θ(n+ 1) =

n+1∑i=1

c(X(i)) = nθ(n) + c(X(n+ 1))

=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]


2 / 30






Algorithm: θ(n) =1

n

n∑i=1

c(X(i))

=⇒ (n+ 1)θ(n+ 1) =

n+1∑i=1

c(X(i))

= nθ(n) + c(X(n+ 1))

=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]


2 / 30






Algorithm: θ(n) =1

n

n∑i=1

c(X(i))

=⇒ (n+ 1)θ(n+ 1) =

n+1∑i=1

c(X(i)) = nθ(n) + c(X(n+ 1))

=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]


2 / 30






Algorithm: θ(n) =1

n

n∑i=1

c(X(i))

=⇒ (n+ 1)θ(n+ 1) =

n+1∑i=1

c(X(i)) = nθ(n) + c(X(n+ 1))

=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]


2 / 30






Algorithm: θ(n) =1

n

n∑i=1

c(X(i))

=⇒ (n+ 1)θ(n+ 1) =

n+1∑i=1

c(X(i)) = nθ(n) + c(X(n+ 1))

=⇒ (n+ 1)θ(n+ 1) = (n+ 1)θ(n) + [c(X(n+ 1))− θ(n)]


2 / 30


Optimal Control

MDP Model

X is a controlled Markov chain, with input U .

For all states x and sets A,

P{X(t+ 1) ∈ A | X(t) = x, U(t) = u, and prior history} = Pu(x,A)

c : X× U→ R is a cost function

β < 1 a discount factor.

Value function: h∗(x) = minU

∞∑t=0

βtE[c(X(t), U(t)) | X(0) = x]

Bellman equation:

h∗(x) = minu{c(x, u) + βE[h∗(X(t+ 1)) | X(t) = x, U(t) = u]}

3 / 30


Optimal Control

MDP Model

X is a controlled Markov chain, with input U .

For all states x and sets A,

P{X(t+ 1) ∈ A | X(t) = x, U(t) = u, and prior history} = Pu(x,A)

c : X× U→ R is a cost function

β < 1 a discount factor.

Value function: h∗(x) = minU

∞∑t=0

βtE[c(X(t), U(t)) | X(0) = x]

Bellman equation:

h∗(x) = minu{c(x, u) + βE[h∗(X(t+ 1)) | X(t) = x, U(t) = u]}

3 / 30


Optimal Control TD-Learning

Approximate value function obtained for a fixed policy:

hθ(x) ≈ h(x) =

∞∑t=0

βtE[c(X(t), U(t)) | X(0) = x]

Two flavors:

minθ

E[(h(X)− hθ(X)

)2].

Search for solution to

0 = E[(h(X)− hθ(X)

)∇θhθ(X)

]Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},

0 = E[(c(X(t), U(t)) + βhθ(X(t+ 1))− hθ(X(t))

)ζt]

4 / 30




hθ(x) ≈ h(x) =

∞∑t=0

βtE[c(X(t), U(t)) | X(0) = x]

Two flavors:

minθ

E[(h(X)− hθ(X)

)2]. Search for solution to

0 = E[(h(X)− hθ(X)

)∇θhθ(X)

]

Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},


)ζt]

4 / 30




hθ(x) ≈ h(x) =

∞∑t=0

βtE[c(X(t), U(t)) | X(0) = x]

Two flavors:

minθ

E[(h(X)− hθ(X)

)2]. Search for solution to

0 = E[(h(X)− hθ(X)

)∇θhθ(X)

]Galerkin relaxation of Bellman equation:For a stationary vector-valued sequence {ζt},


)ζt]

4 / 30


Q-Learning: Another Galerkin Relaxation

Bellman equation:

Q∗(x, u) = c(x, u) + βE[Q∗(X(t+ 1)) | X(t) = x, U(t) = u]

General Q-Learning Algorithm: For t ≥ 0,

θ(t+1) = θ(t)+at(c(X(t), U(t))+βQθ(t)(X(t+1))−Qθ(t)(X(t), U(t))

)ζt

Does this work?

5 / 30



Bellman equation:


Notation and Derivation:

h∗(x) = minuQ∗(x, u)

:= minu{c(x, u) + βE[h∗(X(t+ 1)) | X(t) = x, U(t) = u]}

Q∗(x) = minuQ∗(x, u)



)ζt

Does this work?

5 / 30



Bellman equation:



0 = E[(c(X(t), U(t)) + βQθ(X(t+ 1))−Qθ(X(t), U(t))

)ζt]



)ζt

Does this work?

5 / 30



Bellman equation:




)ζt]



)ζt

Does this work?

5 / 30



Bellman equation:




)ζt]



)ζt

Does this work?

5 / 30

Reinforcement Learning ⊂ Stochastic Approximation ODE Method

ODE Method

Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0

Method: θ(n+ 1) = θ(n) + anf(θ(n), X(n))

ODE:d

dtϑ(t) = f(ϑ(t)) θ∗ = stationary point

Assumptions: (see Borkar’s monograph)

1∑an =∞,

∑a2n <∞. We will take an = (n+ 1)−1

2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸︷︷︸white

+ d(n)︸︷︷︸∼ vanishing

3 ODE is asymptotically stable, f is Lipschitz, andConditions to ensure that {θ(n)} is a bounded sequence.

6 / 30


ODE Method

Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0


ODE:d



1∑an =∞,


2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸︷︷︸white



6 / 30


ODE Method

Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0


ODE:d



1∑an =∞,


2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸︷︷︸white



6 / 30


ODE Method

Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0


ODE:d



1∑an =∞,


2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸︷︷︸white



6 / 30


ODE Method

Goal: f(θ∗) := E[f(θ,W )]∣∣∣θ=θ∗

= 0


ODE:d



1∑an =∞,


2 f(θ(n), X(n)) = f(θ(n)) + ∆(n+ 1)︸︷︷︸white



6 / 30

θ(k)

k

Fastest Stochastic Approximation

Fastest Stochastic Approximation Algorithm Performance

Algorithm Performance

Once we have settled that we have convergence:

limn→∞

θ(n) = θ∗

We are very happy to see that one of these algorithms is working great!

θ(k)

k

(two algorithms)

θ(k)

k

θ∗

(two algorithms)

Two standard approaches to evaluate performance:1 Finite-n bound:

P{‖θ(n)‖ ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2)

2 Asymptotic covariance:

Σ = limk→∞

kE[θ(k)θ(k)T

],

√nθ(n) ≈ N(0,Σ)

7 / 30




limn→∞

θ(n) = θ∗


θ(k)

k

(two algorithms)

θ(k)

k

θ∗

(two algorithms)




Σ = limk→∞

kE[θ(k)θ(k)T

],

√nθ(n) ≈ N(0,Σ)

7 / 30




limn→∞

θ(n) = θ∗


θ(k)

k

θ∗

(two algorithms)

Not so great?




Σ = limk→∞

kE[θ(k)θ(k)T

],

√nθ(n) ≈ N(0,Σ)

7 / 30




limn→∞

θ(n) = θ∗


θ(k)

k

θ∗

(two algorithms)




Σ = limk→∞

kE[θ(k)θ(k)T

],

√nθ(n) ≈ N(0,Σ)

7 / 30


Finite-n bounds

Beautiful analysis for

Watkins’ Q Learning (Szepesvari)

Speedy Q Learning (Azar et. al.)

See also Moulines & Bach for optimization.

Hope for other algorithms?Not today!

For a geometrically ergodic Markov chain X we have,

θ(n) =1

n

n∑i=1

c(X(i)) , n ≥ 1

P{‖θ(n)‖ ≥ ε} ≤ ?

Fortunately, theory surrounding asymptotic theory is complete

8 / 30


Finite-n bounds





Hope for other algorithms?

Not today!


θ(n) =1

n

n∑i=1

c(X(i)) , n ≥ 1

P{‖θ(n)‖ ≥ ε} ≤ ?


8 / 30


Finite-n bounds







θ(n) =1

n

n∑i=1

c(X(i)) , n ≥ 1

P{‖θ(n)‖ ≥ ε} ≤ ?


8 / 30


Finite-n bounds







θ(n) =1

n

n∑i=1

c(X(i)) , n ≥ 1

P{‖θ(n)‖ ≥ ε} ≤

?


8 / 30


Finite-n bounds







θ(n) =1

n

n∑i=1

c(X(i)) , n ≥ 1

P{‖θ(n)‖ ≥ ε} ≤ ?


8 / 30


Finite-n bounds







θ(n) =1

n

n∑i=1

c(X(i)) , n ≥ 1

P{‖θ(n)‖ ≥ ε} ≤ ?


8 / 30


Asymptotic Covariance

Theory surrounding asymptotic theory is as complete as for i.i.d. XConsider the asymptotic covariance:

Σ = limn→∞

Σn := limn→∞

nE[θ(n)θ(n)T

]Computation based on Taylor series:

θ(n+ 1) ≈ θ(n) + 1n

(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .

More Taylor series:

Σn+1 ≈ Σn + 1n

{(A+ 1

2I)Σn + Σn(A+ 12I)T + Σ∆

}SA recursion for {Σn}

9 / 30




Σ = limn→∞

Σn := limn→∞

nE[θ(n)θ(n)T


θ(n+ 1) ≈ θ(n) + 1n

(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .

More Taylor series:

Σn+1 ≈ Σn + 1n

{(A+ 1

2I)Σn + Σn(A+ 12I)T + Σ∆

}

SA recursion for {Σn}

9 / 30




Σ = limn→∞

Σn := limn→∞

nE[θ(n)θ(n)T


θ(n+ 1) ≈ θ(n) + 1n

(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .

More Taylor series:

Σn+1 ≈ Σn + 1n

{(A+ 1

2I)Σn + Σn(A+ 12I)T + Σ∆

}SA recursion for {Σn}

9 / 30



SA recursion for {Σn}:

Σn+1 ≈ Σn + 1n

{(A+ 1

2I)Σn + Σn(A+ 12I)T + Σ∆

}Conclusions

1 If Reλ(A) ≥ −12 for some eigenvalue then Σ is (typically) infinite

2 If Reλ(A) < −12 for all, then Σ = limn→∞Σn is the unique solution

to the Lyapunov equation:

0 = (A+ 12I)Σ + Σ(A+ 1

2I)T + Σ∆

10 / 30



The asymptotic covariance of Q-learning is typically infinite, and it shows!

1000 200 300 400 486.60

10

20

30

40

n = 106

Histogram for θ

θ*

n(15)

(15)

Example from Devraj & M 2017

11 / 30

Fastest Stochastic Approximation Hidden Theory

Hidden Theory

Seemingly lost knowledge

Reinforcement Learning ⊂ Stochastic Approximation

∃ Dozens of RL papers searching for the right matrix gain:

θ(n+ 1) = θ(n) +Gnf(θ(n), X(n+ 1))

Less ambitious: the best choice of scalar, Gn = g/n

Q Learning is often great, but only with the right choice of g.

12 / 30

Fastest Stochastic Approximation Hidden Theory

Hidden Theory

Seemingly lost knowledge

Reinforcement Learning ⊂ Stochastic Approximation

∃ Dozens of RL papers searching for the right matrix gain:

θ(n+ 1) = θ(n) +Gnf(θ(n), X(n+ 1))

Less ambitious: the best choice of scalar, Gn = g/n

Q Learning is often great, but only with the right choice of g.

12 / 30

Fastest Stochastic Approximation Stochastic Newton Raphson

Optimal Covariance

Introduce a matrix gain:

θ(n+ 1) = θ(n) +1

n+ 1Gnf(θ(n), X(n))

Assume it converges, and linearize:

θ(n+ 1) ≈ θ(n) +1

n+ 1G(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .

Looks like Monte-Carlo recursion if G = G∗ :=−A−1

Results in optimal covariance: Σ∗ = G∗Σ∆G∗T

Example: LSTD(λ), but this was not their motivation!

13 / 30


Optimal Covariance


θ(n+ 1) = θ(n) +1



θ(n+ 1) ≈ θ(n) +1

n+ 1G(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .




13 / 30


Optimal Covariance


θ(n+ 1) = θ(n) +1



θ(n+ 1) ≈ θ(n) +1

n+ 1G(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .




13 / 30


Optimal Covariance


θ(n+ 1) = θ(n) +1



θ(n+ 1) ≈ θ(n) +1

n+ 1G(Aθ(n) + ∆(n+ 1)

), A =

d

dθf (θ∗) .




13 / 30


Optimal Variance

Example: return to Monte-Carlo

θ(n+ 1) = θ(n) +g

n+ 1

(−θ(n) +X(n+ 1)

)

14 / 30


Optimal Variance

Example: return to Monte-Carlo

θ(n+ 1) = θ(n) +g

n+ 1

(−θ(n) +X(n+ 1)

)∆(n) = X(n)− E[X(n)]

14 / 30


Optimal Variance

∆(n) = X(n)− E[X(n)]Normalization for analysis:

θ(n+ 1) = θ(n) +g

n+ 1

(−θ(n) + ∆(n+ 1)

)

14 / 30


Optimal Variance


θ(n+ 1) = θ(n) +g

n+ 1

(−θ(n) + ∆(n+ 1)

)Example: X(n) = W 2(n), W ∼ N(0, 1), σ2

∆ = 3

14 / 30


Optimal Variance


θ(n+ 1) = θ(n) +g

n+ 1

(−θ(n) + ∆(n+ 1)

)Example: X(n) = W 2(n), W ∼ N(0, 1), σ2

∆ = 3

0 1 2 3 4 5 g

σ2∆

Σ =σ2∆

2

g2

g − 1/2

Asymptotic variance as a function of g

14 / 30


Optimal Variance


θ(n+ 1) = θ(n) +g

n+ 1

(−θ(n) + ∆(n+ 1)

)Example: X(n) = W 2(n), W ∼ N(0, 1), σ2

∆ = 3

0 1 2 3 4 5 x 1040

1

g Σ

0.5 ∞1 3

10 15.8

20 30.8

θ(t)

t

SA estimates of E[W 2], W ∼ N(0, 1)

14 / 30


Optimal Variance

Two ways to achieve Σ∗ = A−1Σ∆A−1T

1 Polyak-Ruppert technique. Forget the matrix, and insteada) crank up the gain! e.g. an = n−2/3

b) Average {θ(n)} a second time.

2 Stochastic Newton Raphson, Gn ≈ G∗, such as

G−1n = − 1

n

n∑k=1

An, An =d

dθf(θ(n), X(n))

Design choices:PR is super-simple to implement, but may have large transients.NR is not universally applicable, but is often amazing

just like ordinary Newton-Raphson

15 / 30


Optimal Variance





G−1n = − 1

n

n∑k=1

An, An =d

dθf(θ(n), X(n))



15 / 30


Optimal Variance





G−1n = − 1

n

n∑k=1

An, An =d

dθf(θ(n), X(n))



15 / 30


Optimal Variance





G−1n = − 1

n

n∑k=1

An, An =d

dθf(θ(n), X(n))



15 / 30


Optimal Variance





G−1n = − 1

n

n∑k=1

An, An =d

dθf(θ(n), X(n))



15 / 30


Optimal Variance





G−1n = − 1

n

n∑k=1

An, An =d

dθf(θ(n), X(n))



15 / 30

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80

100 Watkins, Speedy Q-learning,Polyak-Ruppert Averaging

Zap

Bellm

an E

rror

n

Zap Q-Learning

Introducing Zap Q-Learning

Zap Q Learning

Zap Q Learning ≡ SNR for Q Learning

Not quiteDo we really want to average uniformly over past samples?

G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))

To better emulate Newton-Raphson, crank up the gain:

G−1n+1 = G−1

n + γn[An −G−1n ] ,

with γn � 1/n. Example: γn = (1/n)0.85

What gives us the courage? Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.

16 / 30


Zap Q Learning

Zap Q Learning ≡ SNR for Q LearningNot quite

Do we really want to average uniformly over past samples?

G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))


G−1n+1 = G−1

n + γn[An −G−1n ] ,



16 / 30


Zap Q Learning



G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))


G−1n+1 = G−1

n + γn[An −G−1n ] ,



16 / 30


Zap Q Learning



G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))


G−1n+1 = G−1

n + γn[An −G−1n ] ,

with γn � 1/n.

Example: γn = (1/n)0.85


16 / 30


Zap Q Learning



G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))


G−1n+1 = G−1

n + γn[An −G−1n ] ,



16 / 30


Zap Q Learning



G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))


G−1n+1 = G−1

n + γn[An −G−1n ] ,


What gives us the courage?

Newton-Raphson is not usually globallyconvergent, even for deterministic root-finding.

16 / 30


Zap Q Learning



G−1n = − 1

n

n∑k=1

An , An =d

dθf(θ(n), X(n))


G−1n+1 = G−1

n + γn[An −G−1n ] ,



16 / 30


Zap Q LearningWhat gives us the courage?

Assumptions:

1 Classical setting, with complete basis

2 Unique optimal policy

3 γ = (1/n)δ

4 Technical condition to deal with discontinuities

Assumption 1 =⇒One-to-one mapping between Qθ and its associated cost function cθ

Theorem:

ODE:d

dtcϑ(t) = −cϑ(t) + c

17 / 30



Assumptions:



3 γ = (1/n)δ



Theorem:

ODE:d


17 / 30



Assumptions:



3 γ = (1/n)δ



Theorem:

ODE:d


17 / 30


Zap Q Learning

Theorem: Everything we hope for. Convergence, optimal asymptoticcovariance, and what looks like a good approximation of Newton-Raphson.

Conclusions from Numerical Experiments:

Watkins’ algorithm is worthless

W using an = g∗/n, g∗ optimizing a. variance, is often awesome

Polyak-Ruppert induces large transients, with long-term effect

Zap ... Let’s see ...

18 / 30


Zap Q Learning







18 / 30


Zap Q Learning







18 / 30


Zap Q Learning







18 / 30


Zap Q Learning







18 / 30


Zap Q LearningOptimize Walk to Cafe

14

653 2

Convergence with Zap gain γn = n−0.85

Watkins’ algorithm has infinite asymptotic covariance with an = 1/nOptimal scalar gain is approximately an = 1500/n

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

Zap, γn = αn

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

Watkins, g = 1500

Zap, γn = αn

Convergence of Zap-Q Learning

Discount factor: β = 0.99

19 / 30



14

653 2


Watkins’ algorithm has infinite asymptotic covariance with an = 1/n

Optimal scalar gain is approximately an = 1500/n

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

Zap, γn = αn

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

Watkins, g = 1500

Zap, γn = αn



19 / 30



14

653 2


Watkins’ algorithm has infinite asymptotic covariance with an = 1/nOptimal scalar gain is approximately an = 1500/n

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

Zap, γn = αn

0 1 2 3 4 5 6 7 8 9 10 1050

20

40

60

80


Zap

Bellm

an E

rror

n

Watkins, g = 1500

Zap, γn = αn



19 / 30



14

653 2


-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8

n = 104 n = 106

Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =√nθn

Entry #18: n = 104 n = 106Entry #10:

CLT gives good prediction of finite-n performance


20 / 30



14

653 2Local Convergence: θ(0) initialized in neighborhood of θ∗

g = 500g = 1500

SpeedyPoly

g = 5000

Polyak-Ruppert

B0 10 20 30 40 500

1

2

0 20 40 60 80 100 120 140 1600

0.5

Zap-Q:Zap-Q: ≡ α0 85

n

γn ≡γn

αn

Wat

kins

Bellm

an E

rror

His

togr

amsn

=10

6

g = 500g = 1500

SpeedyPoly

g = 5000

g = 500g = 1500

SpeedyPoly

g = 5000

103 104 105 106100

101

102

103

104

103 104 105 106 n

Polyak-Ruppert

Polyak-RuppertB

B

n

0 10 20 30 40 500

1

2

0 20 40 60 80 100 120 140 1600

0.5


n

γn ≡γn

αn


n

γn ≡γn

αn

Wat

kins

Wat

kins

Bellm

an E

rror

Bellm

an E

rror

His

togr

amsn

=10

6

2σ confidence intervals for the Q-learning algorithms

21 / 30



14

653 2Local Convergence: θ(0) initialized in neighborhood of θ∗

g = 500g = 1500

SpeedyPoly

g = 5000

Polyak-Ruppert

B0 10 20 30 40 500

1

2

0 20 40 60 80 100 120 140 1600

0.5


n

γn ≡γn

αn

Wat

kins

Bellm

an E

rror

His

togr

amsn

=10

6

g = 500g = 1500

SpeedyPoly

g = 5000

g = 500g = 1500

SpeedyPoly

g = 5000

103 104 105 106100

101

102

103

104

103 104 105 106 n

Polyak-Ruppert

Polyak-RuppertB

B

n

0 10 20 30 40 500

1

2

0 20 40 60 80 100 120 140 1600

0.5


n

γn ≡γn

αn


n

γn ≡γn

αn

Wat

kins

Wat

kins

Bellm

an E

rror

Bellm

an E

rror

His

togr

amsn

=10

6

2σ confidence intervals for the Q-learning algorithms21 / 30


Zap Q LearningModel of Tsitsiklis and Van Roy: Option Call

State space: R100

Parameterized Q-function: Qθ with θ ∈ R10

i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6

Real for every eigenvalue λ

Asymptotic covariance is in�nite

λ > −1

2Realλi(A)

i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6


Authors observed slow convergenceProposed a matrix gain sequence

(see refs for details)


λ > −1

2Realλi(A)

{Gn}i

0 1 2 3 4 5 6 7 8 9 10-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6

-0.525-30 -25 -20 -15 -10 -5-10

-5

0

5

10

Re (λ(GA))

Co

(λ(G

A))

λi(GA)Realλi(A)

22 / 30



State space: R100


i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6



λ > −1

2Realλi(A)

i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6





λ > −1

2Realλi(A)

{Gn}

i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6

-0.525-30 -25 -20 -15 -10 -5-10

-5

0

5

10

Re (λ(GA))

Co

(λ(G

A))

λi(GA)Realλi(A)

22 / 30



State space: R100


i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6



λ > −1

2Realλi(A)

i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6





λ > −1

2Realλi(A)

{Gn}

i0 1 2 3 4 5 6 7 8 9 10

-100

-10-1

-10-2

-10-3

-10-4

-10-5

-10-6

-0.525-30 -25 -20 -15 -10 -5-10

-5

0

5

10

Re (λ(GA))

Co

(λ(G

A))

λi(GA)Realλi(A)

Eigenvalues of A and GA for the finance example

We discovered that favorite choice of gain in [23] barely meets the criterionfor Re(λ(GA)) < −1

2

22 / 30



State space: R100. Parameterized Q-function: Qθ with θ ∈ R10

Zap-Q

G-Q-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800

-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300

Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =√nθn

Entry #1: n = 2 × 106 Entry #7: n = 2 × 106

23 / 30

Conclusions

ConclusionsHidden Theory?

Reinforcement Learning is cursed by variance, and not simplydimension.

We need better design tools to improve performance.

Little theory to support useful finite-n bounds,and these bounds give little insight for algorithm improvement.

The asymptotic covariance is an awesome design tool.

Example: g = 1500 was chosen based on asymptotic covariance

It is also predictive of finite-n performance.

Other open questions:

Algorithm design in specific contextsAdaptive optimization of algorithm parameters

24 / 30

Conclusions




Little theory to support useful finite-n bounds,

and these bounds give little insight for algorithm improvement.






24 / 30

Conclusions










24 / 30

Conclusions










24 / 30

Conclusions










24 / 30

Conclusions









Algorithm design in specific contexts

Adaptive optimization of algorithm parameters

24 / 30

Conclusions










24 / 30

Conclusions


thankful

25 / 30

References

Control TechniquesFOR

Complex Networks

Sean Meyn

Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419

Markov Chainsand

Stochastic Stability

S. P. Meyn and R. L. Tweedie

August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009

π(f

)<

∞

∆V (x) ≤ −f(x) + bIC(x)

‖Pn(x, · ) − π‖f → 0

sup

CEx [S

τC(f

)]<

∞

References

26 / 30

http://www.meyn.ece.ufl.edu/archive/spm_files/CTCN/CTCN.html

http://www.meyn.ece.ufl.edu/archive/spm_files/book.html

References

Selected References I

[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.

[2] A. Benveniste, M. Metivier, and P. Priouret. Adaptive algorithms and stochasticapproximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,Berlin, 1990. Translated from the French by Stephen S. Wilson.

[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. HindustanBook Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,2008.

[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochasticapproximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.

[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. CambridgeUniversity Press, Cambridge, second edition, 2009. Published in the CambridgeMathematical Library.

[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.See last chapter on simulation and average-cost TD learning

27 / 30

https://arxiv.org/abs/1707.03770

References

Selected References II

[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.The Annals of Statistics, 13(1):236–245, 1985.

[8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Researchand Industrial Engineering, Ithaca, NY, 1988.

[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika itelemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages98–107, 1990.

[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM J. Control Optim., 30(4):838–855, 1992.

[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochasticapproximation. Ann. Appl. Probab., 14(2):796–819, 2004.

[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximationalgorithms for machine learning. In Advances in Neural Information Processing Systems24, pages 451–459. Curran Associates, Inc., 2011.

28 / 30

References

Selected References III

[13] C. Szepesvari. Algorithms for Reinforcement Learning. Synthesis Lectures on ArtificialIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.

[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,Cambridge, Cambridge, UK, 1989.

[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.

[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,3(1):9–44, 1988.

[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with functionapproximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.

[18] C. Szepesvari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10thInternat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.

[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. InAdvances in Neural Information Processing Systems, 2011.

[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine LearningResearch, 5(Dec):1–25, 2003.

29 / 30

References

Selected References IV

[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection forneuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning andApproximate Dynamic Programming for Feedback Control. Wiley, 2011.

[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert spacetheory, approximation algorithms, and an application to pricing high-dimensional financialderivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.

[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation andefficient temporal-difference learning. Discrete Event Dynamic Systems: Theory andApplications, 16(2):207–239, 2006.

[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal differencelearning. Mach. Learn., 22(1-3):33–57, 1996.

[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,49(2-3):233–246, 2002.

[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear functionapproximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.

[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEEConference on Decision and Control, pages 3598–3605, Dec. 2009.

30 / 30

Data & Analytics

Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms