21
Optimization 101 CSE P576 David M. Rosen 1

Optimization 101 - courses.cs.washington.edu

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization 101 - courses.cs.washington.edu

Optimization 101

CSE P576

David M. Rosen

1

Page 2: Optimization 101 - courses.cs.washington.edu

Recap

First half:

β€’ Photogrammetry and bundle adjustment

β€’ Maximum likelihood estimation

This half:

β€’ Basic theory of optimization

(i.e. how to actually do MLE)

Page 3: Optimization 101 - courses.cs.washington.edu

The Main Idea

Given: 𝑓: 𝑅𝑛 β†’ 𝑅, we want to

minπ‘₯

𝑓(π‘₯)

Problem: We have no idea how to actually do this …

Main idea: Let’s approximate f with a simple model function m, and use that to search for a minimizer of f.

Page 4: Optimization 101 - courses.cs.washington.edu

Optimization Meta-Algorithm

Given: A function 𝑓: 𝑅𝑛 β†’ 𝑅 and an initial guess π‘₯0 ∈ 𝑅𝑛 for a minimizer

Iterate:

β€’ Construct a model π‘šπ‘– β„Ž β‰ˆ 𝑓(π‘₯𝑖 + β„Ž) of f near π‘₯𝑖.

β€’ Use π‘šπ‘– to search for a descent direction h (𝑓 π‘₯ + β„Ž < 𝑓(π‘₯))

β€’ Update π‘₯𝑖+1 ← π‘₯𝑖 + β„Ž

until convergence

Page 5: Optimization 101 - courses.cs.washington.edu

A first exampleLet’s consider applying the Basic Algorithm to minimize

𝑓 π‘₯ = π‘₯4 βˆ’ 3π‘₯2 + π‘₯ + 2

starting at π‘₯0 = βˆ’1

2.

Q: How can we approximate (model) fnear π‘₯0?

A: Let’s try linearizing! Take

π‘š0 β„Ž β‰œ 𝑓 π‘₯0 + 𝑓′ π‘₯0 β„Ž

Page 6: Optimization 101 - courses.cs.washington.edu

Gradient descentGiven:

β€’ A function 𝑓: 𝑅𝑛 β†’ 𝑅

β€’ An initial guess π‘₯0 ∈ 𝑅𝑛 for a minimizer

β€’ Sufficient decrease parameter 𝑐 ∈ (0,1), stepsize shrinkage parameter 𝜏 ∈ (0,1)

β€’ Gradient tolerance πœ– > 0

Iterate:

β€’ Compute search direction 𝑝 = βˆ’π›»π‘“ π‘₯𝑖 at π‘₯𝑖‒ Set initial stepsize 𝛼 = 1

β€’ Backtracking line search: Update 𝛼 ← πœπ›Ό until the Armijo-Goldstein sufficient decrease condition:

𝑓 π‘₯𝑖 + 𝛼𝑝 < 𝑓 π‘₯𝑖 βˆ’ 𝑐𝛼 𝑝 2

is satisfied

β€’ Update π‘₯𝑖+1 ← π‘₯𝑖 + 𝛼𝑝

until 𝛻𝑓(π‘₯𝑖) < πœ–

Page 7: Optimization 101 - courses.cs.washington.edu

Exercise: Minimizing a quadratic

Gradient Descent

Given:

β€’ A function 𝑓: 𝑅𝑛 β†’ 𝑅

β€’ An initial guess π‘₯0 ∈ 𝑅𝑛 for a minimizer

β€’ Sufficient decrease parameter 𝑐 ∈ (0,1), stepsize shrinkage parameter 𝜏 ∈ (0,1)

β€’ Gradient tolerance πœ– > 0

Iterate:

β€’ Compute search direction 𝑝 = βˆ’π›»π‘“ π‘₯𝑖 at π‘₯𝑖‒ Set initial stepsize 𝛼 = 1

β€’ Line search: update 𝛼 ← πœπ›Ό until

𝑓 π‘₯𝑖 + 𝛼𝑝 < 𝑓 π‘₯𝑖 βˆ’ 𝑐𝛼 𝑝 2

β€’ Update π‘₯𝑖+1 ← π‘₯𝑖 + 𝛼𝑝

until 𝛻𝑓(π‘₯𝑖) < πœ–

Try minimizing the quadratic:

𝑓 π‘₯, 𝑦 = π‘₯2 βˆ’ π‘₯𝑦 + πœ…π‘¦2

using gradient descent, starting at π‘₯0 = (1,1) and using

𝑐, 𝜏 =1

2and πœ– = 10βˆ’3, for a few different values of πœ…, say

πœ… ∈ {1, 10, 100, 1000 }

Q: If you plot function value 𝑓 π‘₯𝑖 vs. iteration number i, what do you notice?

Page 8: Optimization 101 - courses.cs.washington.edu

Exercise: Minimizing a quadratic

= 1

Page 9: Optimization 101 - courses.cs.washington.edu

Exercise: Minimizing a quadratic

= 10

Page 10: Optimization 101 - courses.cs.washington.edu

Exercise: Minimizing a quadratic

= 100

Page 11: Optimization 101 - courses.cs.washington.edu

The problem of conditioning

Gradient descent doesn’t perform well when f is poorly conditioned (has β€œstretched” contours).

Q: How can we improve our local model:

π‘šπ‘– β„Ž = 𝑓 π‘₯𝑖 + 𝛻𝑓 π‘₯π‘–π‘‡β„Ž

so that it handles curvature better?

Page 12: Optimization 101 - courses.cs.washington.edu
Page 13: Optimization 101 - courses.cs.washington.edu

Second-order methods

Let’s try adding in curvature information using a second-order model for f:

π‘šπ‘– β„Ž = 𝑓 π‘₯𝑖 + 𝛻𝑓 π‘₯π‘–π‘‡β„Ž +

1

2β„Žπ‘‡π›»2𝑓 π‘₯ β„Ž

NB: If 𝛻2𝑓 π‘₯ ≻ 0, then π‘šπ‘–(β„Ž) has a unique minimizer:

β„Žπ‘ = βˆ’ 𝛻2𝑓 π‘₯0βˆ’1𝛻𝑓 π‘₯0

In that case, using the update rule:

π‘₯𝑖+1 ← π‘₯𝑖 + β„Žπ‘gives Newton’s method

Page 14: Optimization 101 - courses.cs.washington.edu

Exercise: Minimizing a quadratic

Newton’s methodGiven:β€’ A function 𝑓: 𝑅𝑛 β†’ 𝑅‒ An initial guess π‘₯0 ∈ 𝑅𝑛 for a

minimizerβ€’ Gradient tolerance πœ– > 0

Iterate:β€’ Compute gradient 𝛻𝑓 π‘₯𝑖 and

Hessian 𝛻2𝑓 π‘₯𝑖‒ Compute Newton step:

β„Žπ‘ = βˆ’ 𝛻2𝑓 π‘₯0βˆ’1𝛻𝑓 π‘₯0

β€’ Update π‘₯𝑖+1 ← π‘₯𝑖 + β„Žπ‘β€’ until 𝛻𝑓(π‘₯𝑖) < πœ–

Let’s try minimizing the quadratic:

𝑓 π‘₯, 𝑦 = π‘₯2 βˆ’ π‘₯𝑦 + πœ…π‘¦2

again, this time using Newton’s method, starting at π‘₯0 =(1,1) and using πœ– = 10βˆ’3, for

πœ… ∈ {1, 10, 100, 1000 }

If you plot function value 𝑓 π‘₯𝑖 vs. iteration number i, what do you notice?

Page 15: Optimization 101 - courses.cs.washington.edu

Quasi-Newton methodsNewton’s method is fast! (It has a quadratic convergence rate)

But:

β€’ β„Žπ‘ is only guaranteed to be a descent direction if 𝛻2𝑓 π‘₯𝑖 ≻ 0

β€’ Computing exact Hessians can be expensive!

Quasi-Newton methods: Use a positive-definite approximate Hessian 𝐡𝑖 in the model function:

π‘šπ‘– β„Ž = 𝑓 π‘₯𝑖 + 𝛻𝑓 π‘₯π‘–π‘‡β„Ž +

1

2β„Žπ‘‡π΅π‘–β„Ž

π‘šπ‘–(β„Ž) always has a unique minimizer:

β„Žπ‘„π‘ = βˆ’π΅π‘–βˆ’1𝛻𝑓 π‘₯𝑖

β„Žπ‘„π‘ is always a descent direction!

Page 16: Optimization 101 - courses.cs.washington.edu

Quasi-Newton method with line searchGiven:

β€’ A function 𝑓: 𝑅𝑛 β†’ 𝑅

β€’ An initial guess π‘₯0 ∈ 𝑅𝑛 for a minimizer

β€’ Sufficient decrease parameter 𝑐 ∈ (0,1), stepsize shrinkage parameter 𝜏 ∈ (0,1)

β€’ Gradient tolerance πœ– > 0

Iterate:

β€’ Compute gradient 𝑔𝑖 = 𝛻𝑓 π‘₯𝑖 and positive-definite Hessian approximation 𝐡𝑖 at π‘₯𝑖‒ Compute quasi-Newton step:

β„Žπ‘„π‘ = βˆ’π΅π‘–βˆ’1𝑔𝑖

β€’ Set initial stepsize 𝛼 = 1

β€’ Backtracking line search: Update 𝛼 ← πœπ›Ό until the Armijo-Goldstein sufficient decrease condition:

𝑓 π‘₯𝑖 + π›Όβ„Žπ‘„π‘ < 𝑓 π‘₯𝑖 + π‘π›Όπ‘”π‘–π‘‡β„Žπ‘„π‘

is satisfied

β€’ Update π‘₯𝑖+1 ← π‘₯𝑖 + π›Όβ„Žπ‘„π‘until 𝑔𝑖 < πœ–

Page 17: Optimization 101 - courses.cs.washington.edu

Quasi-Newton methods (cont’d)

Different choices of 𝐡𝑖 give different QN algorithms

Can trade off accuracy of 𝐡𝑖 with computational cost

LOTS of possibilities here!

β€’ Gauss-Newton

β€’ Levenberg-Marquardt

β€’ (L-) BFGS

β€’ Broyden

β€’ etc …

Don’t be afraid to experiment !

Page 18: Optimization 101 - courses.cs.washington.edu

Special case: The Gauss-Newton methodA quasi-Newton algorithm for minimizing a nonlinear least-squares objective:

𝑓 π‘₯ = π‘Ÿ π‘₯ 2

Uses the local quadratic model obtained by linearizing r:

π‘šπ‘– β„Ž = π‘Ÿ π‘₯𝑖 + 𝐽 π‘₯𝑖 β„Ž2

where 𝐽 π‘₯𝑖 β‰œπœ•π‘Ÿ

πœ•π‘₯π‘₯𝑖 is the Jacobian of r.

Equivalently:𝑔𝑖 = 2𝐽 π‘₯𝑖

π‘‡π‘Ÿ π‘₯𝑖 , 𝐡𝑖 = 2𝐽 π‘₯𝑖𝑇𝐽 π‘₯𝑖

Page 19: Optimization 101 - courses.cs.washington.edu

A word on linear algebraThe dominant cost (memory + time) in a QN method is linear algebra:

β€’ Constructing the Hessian approximation 𝐡𝑖‒ Solving the linear system:

π΅π‘–β„Žπ‘„π‘ = βˆ’β„Žπ‘„π‘

Fast/robust linear algebra is essential for efficient QN methods

β€’ Take advantage of sparsity in 𝐡𝑖!

β€’ NEVER, NEVER, NEVER INVERT π‘©π’Š directly!!!β€’ It’s incredibly expensive and unnecessaryβ€’ Use instead [cf. Golub & Van Loan’s Matrix Computations]:

β€’ Matrix factorizations: QR, Cholesky, LDLT

β€’ Iterative linear methods: conjugate gradient

Page 20: Optimization 101 - courses.cs.washington.edu

A word on linear algebra

NEVER INVERT π‘©π’Š!!!

Page 21: Optimization 101 - courses.cs.washington.edu

Optimization methods: Cheat sheet

First-order methodsUse only gradient information

β€’ Pro: Local model is inexpensive

β€’ Con: Slow (linear) convergence rate

Canonical example: Gradient descent

Best for:

β€’ Moderate accuracy

β€’ Very large problems

Second-order methodsUse (some) 2nd-order information

β€’ Pro: Fast (superlinear) convergence

β€’ Con: Local model can be expensive

Canonical example: Newton’s method

Best for:

β€’ High accuracy

β€’ Small to moderately large problems