Stochastic Approximation Algorithms with Set-Valued Maps€¦ · 3V.S.Borkar, Stochastic Approximation Algorithms: A Dynamical Systems Viewpoint, Cambridge Univ. Press, 2008 Shalabh

Stochastic Approximation Algorithms with

Set-Valued Maps

Shalabh Bhatnagar

Computer Science and Automation

Indian Institute of Science

Bangalore

November 10, 2019

Shalabh Bhatnagar (IISc) SAA with Set-Valued Maps November 10, 2019 1 / 24

Outline

1 Optimization under noise

2 Stochastic approximation algorithms

3 Random Directions Stochastic Approximation (RDSA)

4 Gradient algorithms with set-valued maps and their analysis

5 Ongoing and future work


Example: Parameter Optimization1

Consider a repeated experiment that gives i.i.d input-output pairs

(Xn,Yn), n ≥ 0, in real time

Goal: Find a best parameterized fit

Yn = fw (Xn) + εn,

i.e., one with the least g(w) =1

2E [‖ εn ‖2]

fw could correspond to polynomials, neural networks, splines,

wavelets etc.

Note ∇g(w) = −E [< Yn − fw (Xn),∇fw (Xn) >]

1V.S.Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint,

Cambridge Univ Press, 2008


Example: Parameter Optimization

Problem: Cannot find the expectation

Solution: Drop the expectation!

Gradient Scheme with Noise:

wn+1 = wn + a(n) < Yn − fwn(Xn),∇fwn(Xn) >

= wn + a(n)(−∇g(wn) + Mn+1),

where Mn+1 is the noise term

Algorithms of this type are called Stochastic Approximation

Algorithms


Stochastic Approximation2

Objective: Solve the equation J(x) = 0 when analytical form of J

is not known, however, ‘noisy’ measurements J(x) + Mn+1 can be

obtained

J(.)

J( n+1

Mn+1

x x) +M

The Robbins-Monro Algorithm:

xn+1 = xn + a(n)(J(xn) + Mn+1) (1)

2H.Robbins and S.Monro Annals of Mathematical Statistics, 22: 400–407, 1951


A Convergence Result3

(C1) J : RN → RN is Lipschitz continuous

(C2)∑

n

a(n) = ∞,∑

n

a(n)2 < ∞

(C3) Mn+1, n ≥ 0 is a martingale difference w.r.t.

{Fn�= σ(xm,Mm,m ≤ n)}. Further, for some K > 0,

E [‖ Mn+1 ‖2| Fn] ≤ K (1+ ‖ xn ‖2)

(C4) supn ‖ xn ‖< ∞ almost surely

Let x∗ be the unique globally asymptotically stable attractor for the

ODE x(t) = J(x(t)). Then

Theorem (Convergence of SAA): Under (C1)–(C4), {xn}converges almost surely to x∗.

3V.S.Borkar, Stochastic Approximation Algorithms: A Dynamical Systems

Viewpoint, Cambridge Univ. Press, 2008


Optimization under Noise

Let J : RN → R be a given objective function having the form

J(x) = Eμ[h(x , μ)], where μ denotes ‘noise’ and Eμ[·] is the

expectation under that noise

x

J(x)

h(x,

h(x,

μ1)

μ3)

h(x, μ2)

Goal: Find x∗ s.t. J(x∗) = minx∈RN J(x)


Gradient Estimation using RDSA4

Run two simulations with parameters

x + δd =

⎛⎜⎜⎜⎜⎝

x1 + δd1

x2 + δd2

··

xN + δdN

⎞⎟⎟⎟⎟⎠ , x − δd =

⎛⎜⎜⎜⎜⎝

x1 − δd1

x2 − δd2

··

xN − δdN

⎞⎟⎟⎟⎟⎠

where d1, . . . , dN are independent random variables with

distribution U[−η, η]

Gradient Estimator:

∇J(x) =3

η2d

J(x + δd)− J(x − δd)

2δ

4L.A.Prashanth, S.Bhatnagar, M.Fu and S.Marcus, IEEE Transactions on

Automatic Control, 62(5):2223-2238, 2017


Hessian Estimator for RDSA

Hessian Estimator:

∇2J(x) =9

2η4R

(J(x + δd) + J(x − δd)− 2J(x)

δ2

),

where

R =

⎡⎢⎢⎣

52(d

1)2 − η2/3 · · · d1dN

d2d1 · · · d2dN

· · · · · · · · ·dNd1 · · · 5

2(dN)2 − η2/3

⎤⎥⎥⎦ .


Main Convergence Result

RDSA Algorithm:

xn+1 = xn − a(n)Γ(∇2J(xn))−1∇J(xn)

except that δ is replaced with δn ↓ 0 and

∑n

a(n) = ∞;∑

n

(a(n)

δn

)2

< ∞ (2)

Let x∗ be the unique globally asymptotically stable equilibrium of

the ODE

x(t) = −Γ(∇2J(x))−1∇J(x)

Let a(n) = 1/nα and δn = 1/nγ with α− γ > 0.5 and

β�= α− 2γ > 0

Theorem: Under (C1) on ∇J, (2), (C3) and (C4)1 xn

a.s.→ x∗

2 nβ/2(xn − x∗)dist→ N (μ,Ω)


Sufficient Conditions for Stability of SA5

(C5)

(i) Jc(x)�= J(cx)/c, c ≥ 1 satisfies Jc → J∞, for some J∞ : RN → RN

uniformly on compacts

(ii) The origin in RN is a unique globally asymptotically stable

equilibrium for the ODE x(t) = J∞(x(t))

(iii) There is a unique globally asymptotically stable equilibrium

x∗ ∈ RN for the ODE x(t) = J(x(t))

The Stability Theorem: Under (C1)-(C3), (C5), for any initial

condition x0 ∈ RN , supn ‖ xn ‖< ∞ a.s. Further, xna.s.→ x∗.

5V. S. Borkar and S. P. Meyn, SIAM Journal of Control and Optimization,

38(2):447-469, 2000


Analysis of Algorithms with Biases67

Let ∇J(x) denote an estimator for ∇J(x) s.t.

‖ ∇J(x)−∇J(x) ‖≤ ε(δ) → 0 as δ → 0

ε J(x)

J(x)

ΔΔ

Consider the recursion xn+1 = xn − a(n)(∇J(xn) + εn), where

‖ εn ‖≤ ε ∀n6A.Ramaswamy and S.Bhatnagar, IEEE Transactions on Automatic Control,

63(5):1465-1471, 20187A.Ramaswamy and S.Bhatnagar, Mathematics of Operations Research,

42(3):648-661, 2017


Marchaud Map

x1

h(x1)

Rn R

m

x2

x3

x

h(x)

h(x3)

h(x2

) y1

y2

y3

y

A set-valued map h is called Marchaud if

h(x) is convex and compact for each x

supw∈h(x)

‖ w ‖ ≤ K (1+ ‖ x ‖) for each x

h is upper-semicontinuous, i.e., given {xn} ⊂ Rn and {yn} ⊂ Rm

with xn → x and yn → y with yn ∈ h(xn), ∀n, we have y ∈ h(x)


Differential Inclusion

Consider the differential inclusion (DI) in Rd :

x(t) ∈ H(x(t)), (3)

where H : Rd → {subsets of Rd} is Marchaud. Then the above DI

has at least one solution x and each solution is absolutely

continuous.8

8J.Aubin and A.Cellina, Differential Inclusions: Set-Valued Maps and Viability

Theory, Springer, 1984


Semiflow

The Set-Valued Semiflow Φ associated with (3) is defined on

[0,∞)×Rd as

Φ(t , x) = {x(t) | x ∈ Σ, x(0) = x},

where Σ is the set of all absolutely continuous solutions to (3).

x

φ( t,x)

For B × L ⊂ [0,∞)×Rd , let

Φ(B, L) = ∪t∈B,x∈LΦ(t , x)


Invariant Set

M ⊂ Rd is invariant if for every x ∈ M, there exists x∈ Σ s.t.

x(t) ∈ M ∀t with x(0) = x

x

M


Attractor of a DI

A ⊂ Rd is attracting if it is compact and there exists a

neighborhood U such that for any ε > 0, ∃T (ε) ≥ 0 with

Φ([T (ε),∞),U) ⊂ Nε(A)

A

U

A

U

Nε

(A )

T(ε)

If the above A is invariant, it is called an attractor


An Alternative View of Algorithm

Recall the recursion

xn+1 = xn − a(n)(∇J(xn) + εn),

where ‖ εn ‖≤ ε ∀n

Alternatively consider

xn+1 = xn − a(n)g(xn), (4)

where g(xn) ∈ G(xn)∀n and G(x)�= ∇J(x) + Bε(0), i.e., gradient

estimate lies in an ε-ball around true gradient

ε J(x)

J(x)

ΔΔG(x)


Assumptions

(A1) ∇J is a continuous function s.t. ‖ ∇J(x) ‖ ≤ K (1+ ‖ x ‖) for

all x ∈ Rd , K > 0

(A2) a(n) > 0∀n with

∑n

a(n) = ∞,∑

n

a(n)2 < ∞

Can show that G is upper-semicontinuous

Let Gc(x)�= {

y

c| y ∈ G(cx)}

Let G∞(x)�= co(Limsupc→∞Gc(x)), where Limsupc→∞Gc(x)

�= {y | lim infc→∞ d(y ,Gc(x)) = 0}


An Important Result

Lemma 1 The map x �→ G∞(x) is Marchaud

Thus, x(t) ∈ −G∞(x(t)) has at least one solution and which is

absolutely continuous

(A3) x(t) ∈ −G∞(x(t)) has an attractor set A such that A ⊂ Ba(0)for some a > 0 and Ba(0) is a fundamental neighborhood of A

(A4) Let cn ≥ 1 be an increasing sequence of integers such that

cn ↑ ∞ as n ↑ ∞. Let xn → x and yn → y as n ↑ ∞, such that

yn ∈ Gcn(xn), ∀n, then y ∈ G∞(x)


The Stability Result

Theorem 1 Under (A1)-(A4), the iterates (4) are stable i.e.,

supn ‖ xn ‖< ∞ a.s.

Now recall that G(x) = ∇J(x) + Bε(0)

Let the minimum set M of J be the global attractor of

x(t) = −∇J(x(t))

It can be shown that any compact set K with M ⊂ K ⊂ Rd is a

fundamental neighborhood of M

From Theorem 1, x(t) ∈ K0 ∀t ≥ 0 for some (possibly sample path

dependent) compact set K0 which then is a fundamental

neighborhood of M


The Main Result

Theorem 2 Given δ > 0, there exists ε(δ) > 0 such that (4)

converges to Nδ(M) provided ε ≤ ε(δ)/2

1T2 T3

T4T

t(1) t(2)

x(t)

x(t)


Related Recent Work

A general stochastic recursion with set-valued maps and Markov noise

xn+1 = xn + a(n)(h(xn,Zn) + Mn+1)

General convergence with Zn non-ergodic, iterate-dependent, Markov

process [V.Yaji and SB, Stochastics, 2018]

Two-timescale stochastic recursions with Markov noise

xn+1 = xn + a(n)(h(xn, yn,Z1n ) + M1

n+1)

yn+1 = yn + b(n)(g(xn, yn,Z2n ) + M2

n+1)

Z 1n ,Z

2n independent non-ergodic iterate-dependent Markov processes,

M1n+1,M

2n+1 independent martingale differences, a(n) = o(b(n)), h,g

point-to-point maps – analysis and application to reinforcement learning

[P.Karmakar and SB, Math of OR, 2018]

Analysis under set-valued h,g [V.Yaji and SB, Math of OR, 2019]


Ongoing and Future Work

Finding minima of non-differentiable functions under noise

Algorithms for convergence to global minima

Asynchronous update algorithms

Reinforcement learning algorithms for partially observed Markov

decision processes

Analysis of deep reinforcement learning algorithms

Applications in robotics, microgrids, vehicular traffic control etc.


Documents

Stochastic Approximation Algorithms with Set-Valued Maps€¦ · 3V.S.Borkar, Stochastic Approximation Algorithms: A Dynamical Systems Viewpoint, Cambridge Univ. Press, 2008 Shalabh