Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Back to GD omm
We've seen that
11 67 at HER1086011 EL HE x f E Rftf convex when µ Ereso for an error EE we require a iterations
can we do better if we assume moreabout f
i
ii iT
in
We won't prove this theorem but a keyelement in its proof is the fact thatfor L smooth convex functions wehave thatfly E floc Of g x 1 Elly all
With L smoothnees need L iterationsto get E error
fommodifyGDtogetprojectedGDfortsmoothfunctionwithesameconvergencerathensduing
Ffs FGC with Rcpi corvee
Picking µ in Practicemm
The convergence theorems we've seenso far require knowing the Lipschitzconstant or smoothness constantassociated with f Gc
inordest
This is not always possible in practiceIdeas Use the best ult at
every iteration
x Ltt D Ct E pg sect
pi Ek att to minimize
niem.ua ae.eeshI
the variable we're optimiginghereis ult
Often solving this exactly is hardso we settle for an approximatesolution
Possible Solutions set µ usinga Backtrackinghinesearchat every iteration it
Heye for any descent direction p
floc 0865 p E f x p E f x 8 0865pp n
7f isconvetpIs a descent dir
So there is a small constat870 that make this ineq true
Fake p uTfGD as in GDand plug it into Ix
So we expect that if µ is small
enoughf x µ0fGD s 864 full Of HEIdea Fix 8 say 0.5start with e.g µ Id
Esifesaennetoset's'sis smaller than flocks 8m78 T
e.gs µ I µ 0.8 u 0.82
Example by picture
869
i i i lI i f Il l I k entdirectionn I i
i i Iyagrad
K sect 0.82786kt
Loe It Is o 8086
year AGED M o8Now we're
4 gooda is 8GE OfGE'D wetrgm o.seis too big still too big
bigger thanfGe't 1108647112
More Precisely 3
TTarcknglinesearctTPick Bst 851At each GD step t e
c Set v O flock2 Set µ I
flock moffat'D Effit µ 811086MtThenkeep µ
Elseset u pm 4 Repeat
Remark e The condition
flock not Git'D Effit µ 811886Mt
is known as the Armijoconditionand guarantees that the functionValue decreases by some non Zeroamount
Examp les Consider the functionf R2 IR with
f 04 xD Gc 1 4 0Gt Xz 1which has minimizer Ho
Of 464 173 264 22 11 ith DSuppose od o o
xd I 2 04 0 Oso M EG 2
Want to pickle to minimize 84901mfs 4
Hbo C 6 4 f Gu 2M 64 1 8M 15
This is a non linear egg c in mthat be hard to minimize
Then let's run Backtracking Linesearchwith 8 0.5 P 0.8
xd xO
µ Off yoOso M EG 2
So we try z as40
I y bb
floc 674 86401 µ 8118 112
JxG2µ Of 18
If x 237.67367 142M o 82 869 n 8118 11
f same issue
finally i
M 0.8 givesf x 011530 f
f GE 0,5 0,8 1188697112
So we choose µ 0,8
and set at's o 0,8 Off 0,00,5154 011718
Then we repeat the process forX i
Goo
o g7S 0,0211
Forty nine e for an 2 smooth convex
function with Mlb set by backtrackingline search as above we have
u
sGME2t mand sniffe n's 3 min 1 Bk
so we get 8Gt 868 E TapMore can be said when we
assume more about the functione g strong convexitygMore can be said about line searchmethods
We may return to these topicslater if time permits
Newton's method2
So far we have used GDwhich was derived from the1st order Taylor approximationfloc reflects Offset x SED
want want this as negthis small as possible
pick D IMAGEDto get floc
D flocks µ HOGGED112
If instead we use a 2ndorder Taylor approximationwe obtain Newton's method
2
floc floats OffectDTGc ItHak xu50286 Lx
At a minimum we expect thatOf 0 because the right hand sideis convex
so taking 0 on both sides
Of a Iot Of Cocks OfGED sit
x xE 028647 1086t
at the minimizer so we set
x D
xH_p2fGEDJyf xtmTExampleg Let Goo RI R be
guier by AGOx hr Gc
Then Of e f GD I Izp2f x f x Ez
Newton's method starting at 20 0.5generates the iterates
x't act xlt J f'GCHx El GETZ I
see fit set 24 45x 2207 012 I 025 0,75
X E 0 9375D O i 9961IN O i 9998
note that the optimum is x I
If this seems fast it is notacoincidence
Defino for a matrix µ
11m11 Max HMI lengthyto 11041 length of
HUH measures how much thematrix U can stretch rectors
consequence of the definethe 11M ZHE HMHHEH
engtefenz Inouflagsthuis
FTTTeahaffoeffestepigosEftatx't has Tf so
Suppose further that
11 ZfGE Il f f for some h o
1102floc 0286 311 EL Dx Illfor all x
we have e Il x 2 11 f 23 t t
y lis EME's Elf2h
Ht
LooseInterpretations If we start close
enough to a local minimum and thefunction is nice we convergequickly to the minimum
ExceptWant to minimizefloc soft 2xixE sciusing Newton's method
need PHX Ff Ge
Tf Cx 4343 40405405 4 3
own SEE 4 47
suppose we start at od Li D
Then x xd JOGGED
L't l's i It2 3
61Continuing this wayIt 7,3 FI
exponentially hastint
Example
fled 341 set 2kt I
start at odo's 0
Of a 23 2 2 Jfk 322 2
od xo GIG Tf GED
o C 25 2I
od x f x d
I I I
O xo
back to Koil so we entereda cycleSo Newton's method neednot always converge
why does this not contradictthe theorem
Some Remarks on Newton's Method
1 We can modify Newton'smethodto include a step sigeo I
eat D It ult 286ft flock
can choose fixedcan use backtrackingline search
2 Finding the inverse of theHessian can be very expensiveif n is large
Instead notice thata D att 2ffit J Offutt024 7 felt E OHHHT r t t
known known knownSone for this
So we can use lunar algebratechniques to solve forLt 3 from this system
3 Newton's method when itconverges tends to do somuch faster than GD
compare theconvergence
theorems
Quasi Newton Methodsv
very brieflyRecall God has the interpretation
flood flocks t Offutt'T x att0 DX x km2
minimizing the right hand side
gives xD H_ Pf zL
Meanwhile Newtonisthodfloes x flock Tf F x accts
4 x 50286M alt
minimizing the right hand side
gives xD
H_jfGfOf xLtSo we can think of GD as
approximating the Hessianwith
tIdentity matrix
Quasi Newton Methods approximate
the Hessian with some othermatrix Be which changes fromiteration to iteration so that
act D x Ct ht BI jffsettThere are several such methodswith different choices for BtWe won't cover them herebut examples are 3
B FGS method
Brogdeneee