Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lecture 11 Mirror Descent and Variable MetricMethods
May 8 - 13 2020
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 1 26
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 2 26
Definition Let h be a differentiable convex function differentiableon C The Bregman divergence associated with h is defined as
Dh(xy) = h(x)minus h(y)minus 〈nablah(y)xminus y〉
Dh(xy) is always non-negative and convex in its first argument x
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 3 26
h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if
h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ
2983042xminus y9830422 forall xy isin C
Note that strong convexity of h is equivalent to
Dh(xy) geλ
2983042xminus y9830422 forall xy isin C
2 Mirror Descent method
Compute subgradient gk isin partf(xk)
Update
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26
We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex
Example Gradient descent is mirror descent Consider
h(x) =1
2983042x98304222 Dh(xy) =
1
2983042xminus y98304222
The update of mirror descent is
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
2αk983042xminus xk98304222
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26
Example Solving problems on the probability simplex withexponentiated gradient methods
Consider the negative entropy (h is convex)
h(x) =
n983131
j=1
xj log xj
restricting x to the probability simplex x ge 0
n983131
j=1
xj = 1 Then
Dh(xy) =
n983131
j=1
[xj(log xj minus log yj) + (yj minus xj)] =
n983131
j=1
xj logxjyj
the entropic or Kullback-Leibler divergence
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 2 26
Definition Let h be a differentiable convex function differentiableon C The Bregman divergence associated with h is defined as
Dh(xy) = h(x)minus h(y)minus 〈nablah(y)xminus y〉
Dh(xy) is always non-negative and convex in its first argument x
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 3 26
h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if
h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ
2983042xminus y9830422 forall xy isin C
Note that strong convexity of h is equivalent to
Dh(xy) geλ
2983042xminus y9830422 forall xy isin C
2 Mirror Descent method
Compute subgradient gk isin partf(xk)
Update
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26
We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex
Example Gradient descent is mirror descent Consider
h(x) =1
2983042x98304222 Dh(xy) =
1
2983042xminus y98304222
The update of mirror descent is
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
2αk983042xminus xk98304222
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26
Example Solving problems on the probability simplex withexponentiated gradient methods
Consider the negative entropy (h is convex)
h(x) =
n983131
j=1
xj log xj
restricting x to the probability simplex x ge 0
n983131
j=1
xj = 1 Then
Dh(xy) =
n983131
j=1
[xj(log xj minus log yj) + (yj minus xj)] =
n983131
j=1
xj logxjyj
the entropic or Kullback-Leibler divergence
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Definition Let h be a differentiable convex function differentiableon C The Bregman divergence associated with h is defined as
Dh(xy) = h(x)minus h(y)minus 〈nablah(y)xminus y〉
Dh(xy) is always non-negative and convex in its first argument x
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 3 26
h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if
h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ
2983042xminus y9830422 forall xy isin C
Note that strong convexity of h is equivalent to
Dh(xy) geλ
2983042xminus y9830422 forall xy isin C
2 Mirror Descent method
Compute subgradient gk isin partf(xk)
Update
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26
We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex
Example Gradient descent is mirror descent Consider
h(x) =1
2983042x98304222 Dh(xy) =
1
2983042xminus y98304222
The update of mirror descent is
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
2αk983042xminus xk98304222
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26
Example Solving problems on the probability simplex withexponentiated gradient methods
Consider the negative entropy (h is convex)
h(x) =
n983131
j=1
xj log xj
restricting x to the probability simplex x ge 0
n983131
j=1
xj = 1 Then
Dh(xy) =
n983131
j=1
[xj(log xj minus log yj) + (yj minus xj)] =
n983131
j=1
xj logxjyj
the entropic or Kullback-Leibler divergence
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
h is λ-strongly convex over C with respect to the norm 983042 middot 983042 if
h(x) ge h(y) + 〈nablah(y)xminus y〉+ λ
2983042xminus y9830422 forall xy isin C
Note that strong convexity of h is equivalent to
Dh(xy) geλ
2983042xminus y9830422 forall xy isin C
2 Mirror Descent method
Compute subgradient gk isin partf(xk)
Update
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 4 26
We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex
Example Gradient descent is mirror descent Consider
h(x) =1
2983042x98304222 Dh(xy) =
1
2983042xminus y98304222
The update of mirror descent is
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
2αk983042xminus xk98304222
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26
Example Solving problems on the probability simplex withexponentiated gradient methods
Consider the negative entropy (h is convex)
h(x) =
n983131
j=1
xj log xj
restricting x to the probability simplex x ge 0
n983131
j=1
xj = 1 Then
Dh(xy) =
n983131
j=1
[xj(log xj minus log yj) + (yj minus xj)] =
n983131
j=1
xj logxjyj
the entropic or Kullback-Leibler divergence
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
We often attempt to choose h to better match the geometry of theunderlying constraint set C The goal is roughly to choose astrongly convex function h so that the diameter of C is small in thenorm 983042 middot 983042 (need not be ℓ2 or Euclidean norm) with respect towhich h is strongly convex
Example Gradient descent is mirror descent Consider
h(x) =1
2983042x98304222 Dh(xy) =
1
2983042xminus y98304222
The update of mirror descent is
xk+1 = argminxisinC
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= argminxisinC
983069〈gkx〉+ 1
2αk983042xminus xk98304222
983070
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 5 26
Example Solving problems on the probability simplex withexponentiated gradient methods
Consider the negative entropy (h is convex)
h(x) =
n983131
j=1
xj log xj
restricting x to the probability simplex x ge 0
n983131
j=1
xj = 1 Then
Dh(xy) =
n983131
j=1
[xj(log xj minus log yj) + (yj minus xj)] =
n983131
j=1
xj logxjyj
the entropic or Kullback-Leibler divergence
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Example Solving problems on the probability simplex withexponentiated gradient methods
Consider the negative entropy (h is convex)
h(x) =
n983131
j=1
xj log xj
restricting x to the probability simplex x ge 0
n983131
j=1
xj = 1 Then
Dh(xy) =
n983131
j=1
[xj(log xj minus log yj) + (yj minus xj)] =
n983131
j=1
xj logxjyj
the entropic or Kullback-Leibler divergence
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 6 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
The subproblem in mirror descent is
minx
〈gx〉+n983131
j=1
xj logxjyj
st x ge 0
n983131
j=1
xj = 1
By introducing Lagrange multipliers τ isin R for the equalityconstraint and λ isin Rn
+ for the inequality constraint Then weobtain Lagrangian
L(x τλ) = 〈gx〉+n983131
j=1
983063xj log
xjyj
+ τxj minus λjxj
983064minus τ
Taking derivatives with respect to x and setting them to zero yield
xj = yj exp(minusgj minus 1minus τ + λj)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 7 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Taking λj = 0 by
n983131
j=1
xj = 1 we have (a special solution)
xi =yi exp(minusgi)
n983131
j=1
yj exp(minusgj)
The update in mirror descent is
xk+1i =
xki exp(minusαkgki )
n983131
j=1
xkj exp(minusαkgkj )
This is the so-called exponentiated gradient update also known asentropic mirror descent
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 8 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Example using ℓp norms (p isin (1 2]) Consider
h(x) =1
2(pminus 1)983042x9830422p
We have
nablah(x) =1
pminus 1983042x9830422minusp
p
983045sign(x1)|x1|pminus1 middot middot middot sign(xn)|xn|pminus1
983046T
Defineφ(x) = (pminus 1)nablah(x)
and
ψ(y) = 983042y9830422minusqq
983045sign(y1)|y1|qminus1 middot middot middot sign(yn)|yn|qminus1
983046T
where q =p
pminus 1 We have ψ(φ(x)) = x and hence ψ = φminus1
and φ = ψminus1
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 9 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
The mirror descent for C = Rn is
xk+1 = argminxisinRn
983069f(xk) + 〈gkxminus xk〉+ 1
αkDh(xx
k)
983070
= argminxisinRn
983069〈gkx〉+ 1
αkDh(xx
k)
983070
= (nablah)minus1(nablah(xk)minus αkgk)
= ψ(φ(xk)minus αk(pminus 1)gk)
The update involving the inverse of the gradient mapping (nablah)minus1holds more generally that is for any differentiable and strictlyconvex h
This is the original form of the mirror descent update and itjustifies the name mirror descent as the gradient is ldquomirroredrdquothrough the distance-generating function h and back again
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 10 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
The case C = x isin Rn+
n983131
j=1
xj = 1
The subproblem in mirror descent is
minxisinC
〈vx〉+ 1
2983042x9830422p
wherev = αk(pminus 1)gk minus φ(xk)
Exercise Show by KKT conditions that the solution is given byx = ψ([τeminus v]+) where τ isin R satisfies
n983131
j=1
ψj([τeminus v]+) = 1
Binary search can be used to find τ
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 11 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Theorem 1
Let αk gt 0 be any sequence of non-increasing stepsizes Assume that his strongly convex with respect to 983042 middot 983042 Let 983042 middot 983042lowast denote the dual normLet xk be generated by the mirror descent iteration If Dh(xx983183) le R2
for all x isin C then for all K ge 1
K983131
k=1
[f(xk)minus f(x983183)] leR2
αK+
K983131
k=1
αk
2983042gk9830422lowast
If αk = α is constant then for all K ge 1K983131
k=1
[f(xk)minus f(x983183)] le1
αDh(x983183x
1) +α
2
K983131
k=1
983042gk9830422lowast
If xK =1
K
K983131
k=1
xk or xK = argminxk
f(xk) 983042g983042lowast le M for all
g isin partf(x) for x isin C then f(xK)minus f(x983183) le1
KαDh(x983183x
1) +α
2M2
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 12 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Exercise Prove that the negative entropy
h(x) =
n983131
j=1
xj log xj
is 1-strongly convex with respect to the ℓ1-norm
Example Let C = x isin Rn+
983123nj=1 xj = 1 Let x1 =
1
ne The
exponentiated gradient method with fixed stepsize α guarantees
f(xK)minus f(x983183) lelog n
Kα+
α
2K
K983131
k=1
983042gk9830422infin
where xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 13 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Example Let C sub x isin Rn 983042x9830421 le R1 Let h(x) =1
2(pminus 1)983042x9830422p
with p = 1 +1
log(2n) Then
K983131
k=1
[f(xk)minus f(x983183)] le2R2
1 log(2n)
αK+
e2
2
K983131
k=1
αk983042gk9830422infin
In particular taking αk =R1
e
983157log(2n)
kand xK =
1
K
K983131
k=1
xk gives
f(xK)minus f(x983183) le3eR1
983155log(2n)radicK
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 14 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
A simulated mirror-descent example Let
A =983045a1 middot middot middot am
983046T isin Rmtimesn
have entries drawn iid N (0 1) Let
bi =1
2(ai1 + ai2) + εi
where εi drawn iid N (0 10minus2) and n = 3000 m = 20 Define
f(x) = 983042Axminus b9830421 =m983131
i=1
|〈aix〉 minus bi|
which has subgradients ATsign(Axminus b) Consider
minx
f(x) st x isin C =
983099983103
983101x isin Rn+
n983131
j=1
xj = 1
983100983104
983102
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 15 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
We compare the subgradient method to exponentiated gradientdescent for this problem noting that the Euclidean projection of avector v isin Rn to the set C has coordinates
xj = [vj minus τ ]+
where τ isin R is chosen so that
n983131
j=1
xj =
n983131
j=1
[vj minus τ ]+ = 1
We use stepsize αk = α0radick where the initial stepsize α0 is chosen
to optimize the convergence guarantee for each of the methods
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 16 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
fkbest minus f983183 versus k
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 17 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
21 Stochastic version
Theorem 2
Let the conditions of Theorem 1 hold except that instead of receiving avector
gk isin partf(xk)
at iteration k the vector gk is a stochastic subgradient satisfying
E[gk|xk] isin partf(xk)
Then for any non-increasing stepsize sequence αk (where αk may bechosen dependent on g1 middot middot middot gk)
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le E
983077R2
αK+
K983131
k=1
αk
2983042gk9830422lowast
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 18 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
22 Adaptive stepsizes
Let xK =1
K
K983131
k=1
xk If Dh(xx983183) le R2 for all x isin C then
E[f(xK)minus f(x983183)] le E
983077R2
KαK+
1
K
K983131
k=1
αk
2983042gk9830422lowast
983078
If αk = α for all k we see that the choice of stepsize minimizing
R2
Kα+
α
2K
K983131
k=1
983042gk9830422lowast is α983183 =radic2R
983075K983131
k=1
983042gk9830422lowast
983076minus 12
which yields
E[f(xK)minus f(x983183)] leradic2R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
But this α983183 is not known in advance because
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 19 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Corollary 3
Let the conditions of Theorem 2 hold Let
αk =R983161983160983160983159
k983131
i=1
983042gi9830422lowast
which is the ldquoup to nowrdquo optimal choice Then
E[f(xK)minus f(x983183)] le3R
KE
983093
983095983075
K983131
k=1
983042gk9830422lowast
983076 12
983094
983096
where
xK =1
K
K983131
k=1
xk
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 20 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
3 Variable metric methods
The idea is to adjust the metric with which one constructs updatesto better reflect problem structure
1 Choose subgradient gk isin partf(xk) or stochastic subgradient gk
satisfyingE[gk|xk] isin partf(xk)
2 Update positive semidefinite matrix Hk isin Rntimesn
3 Compute update
xk+1 = argminxisinC
983069〈gkx〉+ 1
2〈xminus xkHk(xminus xk)〉
983070
Special cases Subgradient method Hk =1
αkI
Newton method Hk = nabla2f(xk)
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 21 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Theorem 4
Let Hk be a sequence of positive definite matrices where Hk is afunction of g1 middot middot middot gk (and potentially some additional randomness)Let gk be (stochastic) subgradients with
E[gk|xk] isin partf(xk)
Then
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078
le 1
2E
983077K983131
k=2
983059983042xk minus x9831839830422Hk
minus 983042xk minus x9831839830422Hkminus1
983060+ 983042x1 minus x9831839830422H1
983078
+1
2E
983077K983131
k=1
983042gk9830422Hminus1
k
983078
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 22 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
31 Adaptive gradient variable metric method
Theorem 5
LetRinfin = sup
xisinC983042xminus x983183983042infin
be the ℓinfin radius of the set C and the conditions of Theorem 4 hold Let
Hk =1
α
983075diag
983075k983131
i=1
gigiT
98307698307612
where α gt 0 is a pre-specified constant Then we have
E
983077K983131
k=1
(f(xk)minus f(x983183))
983078le
983061R2
infin2
+ α2
983062E[tr(HK)]
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 23 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
Example Support vector machine classification problem
Consider
minx
f(x) =1
m
m983131
i=1
[1minus bi〈aix〉]+ st 983042x983042infin le 1
where the vectors ai isin minus1 0 1n with m = 5000 and n = 1000
For each coordinate j isin [n] we set aij isin plusmn1 to have a randomsign with probability 1j and aij = 0 otherwise
Let u isin minus1 1n uniformly at random We set bi = sign(〈aiu〉)with probability 095 and bi = minussign(〈aiu〉) otherwiseWe choose a stochastic gradient by selecting i isin [m] uniformly atrandom then set
gk isin part[1minus bi〈aixk〉]+
and αk = αradick
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 24 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
f(xk)minus f983183 versus k100 various initial stepsizes
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 25 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26
f(xk)minus f983183 versus k100 best initial stepsize
Mirror Descent and Variable Metrics DAMC Lecture 11 May 8 - 13 2020 26 26