29
CS7267 MACHINE LEARNING OPTIMIZATION Mingon Kang, Ph.D. Computer Science, Kennesaw State University * This lecture is based on Kyle Andelinโ€™s slides

CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

CS7267 MACHINE LEARNING

OPTIMIZATION

Mingon Kang, Ph.D.

Computer Science, Kennesaw State University

* This lecture is based on Kyle Andelinโ€™s slides

Page 2: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Optimization

Consider a function f (.) of p numbers of variables:

๐‘ฆ = ๐‘“ ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘

Find ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘ that maximizes or minimizes y

Usually, minimize a cost/loss function or maximize

profit/likelihood function.

Page 3: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Global/Local Optimization

Page 4: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Gradient

Single variable:

The derivative: slope of the tangent line at a point ๐‘ฅ0

Page 5: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Gradient

Multivariable:

๐›ป๐‘“ =๐œ•๐‘“

๐œ•๐‘ฅ1,๐œ•๐‘“

๐œ•๐‘ฅ2, โ€ฆ ,

๐œ•๐‘“

๐œ•๐‘ฅ๐‘›

A vector of partial derivatives with respect to each of

the independent variables

๐›ป๐‘“ points in the direction of greatest rate of

change or โ€œsteepest ascentโ€

Magnitude (or length) of ๐›ป๐‘“ is the greatest rate of

change

Page 6: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Gradient

Page 7: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

The general idea

We have k parameters ๐œƒ1, ๐œƒ2, โ€ฆ , ๐œƒ๐‘˜weโ€™d like to train

for a model โ€“ with respect to some error/loss function

๐ฝ(๐œƒ1, โ€ฆ , ๐œƒ๐‘˜) to be minimized

Gradient descent is one way to iteratively determine

the optimal set of parameter values:

1. Initialize parameters

2. Keep changing values to reduce ๐ฝ(๐œƒ1, โ€ฆ , ๐œƒ๐‘˜)

๐›ป๐ฝ tells us which direction increases ๐ฝ the most

We go in the opposite direction of ๐›ป๐ฝ

Page 8: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

To actually descendโ€ฆ

Set initial parameter values ๐œƒ10, โ€ฆ , ๐œƒ๐‘˜

0

while(not converged) {

calculate ๐›ป๐ฝ (i.e. evaluate ๐œ•๐ฝ

๐œ•๐œƒ1, โ€ฆ ,

๐œ•๐ฝ

๐œ•๐œƒ๐‘˜)

do {

๐œƒ1 โ‰” ๐œƒ1 โˆ’ ฮฑ๐œ•๐ฝ

๐œ•๐œƒ1

๐œƒ2 โ‰” ๐œƒ2 โˆ’ ฮฑ๐œ•๐ฝ

๐œ•๐œƒ2

โ‹ฎ๐œƒ๐‘˜ โ‰” ๐œƒ๐‘˜ โˆ’ ฮฑ

๐œ•๐ฝ

๐œ•๐œƒ๐‘˜}

}

Where ฮฑ is the โ€˜learning rateโ€™ or โ€˜step sizeโ€™

- Small enough ฮฑ ensures ๐ฝ ๐œƒ1๐‘– , โ€ฆ , ๐œƒ๐‘˜

๐‘– โ‰ค ๐ฝ(๐œƒ1๐‘–โˆ’1, โ€ฆ , ๐œƒ๐‘˜

๐‘–โˆ’1)

Page 9: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 10: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 11: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 12: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 13: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 14: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 15: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 16: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

After each iteration:

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 17: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Issues

Convex objective function guarantees convergence

to global minimum

Non-convexity brings the possibility of getting stuck

in a local minimum

Different, randomized starting values can fight this

Page 18: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Initial Values and Convergence

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 19: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Initial Values and Convergence

Picture credit: Andrew Ng, Stanford University, Coursera Machine Learning, Lecture 2 Slides

Page 20: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Issues cont.

Convergence can be slow

Larger learning rate ฮฑ can speed things up, but with too

large of ฮฑ, optimums can be โ€˜jumpedโ€™ or skipped over

- requiring more iterations

Too small of a step size will keep convergence slow

Can be combined with a line search to find the optimal

ฮฑ on every iteration

Page 21: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Convex set

Definition

A set ๐ถ is convex if, for any ๐‘ฅ, ๐‘ฆ โˆˆ ๐ถ and ๐œƒ โˆˆ โ„œ with

0 โ‰ค ๐œƒ โ‰ค 1, ๐œƒ๐‘ฅ + 1 โˆ’ ๐œƒ ๐‘ฆ โˆˆ ๐ถ

If we take any two elements in C, and draw a line

segment between these two elements, then every point

on that line segment also belongs to C

Page 22: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Convex set

Examples of a convex set (a) and a non-convex set

Page 23: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Convex functions

Definition

A function ๐‘“:โ„œ๐‘› โ†’ โ„œ is convex if its domain (denoted

๐ท(๐‘“)) is a convex set, and if, for all ๐‘ฅ, ๐‘ฆ โˆˆ ๐ท(๐‘“) and

๐œƒ โˆˆ ๐‘…, 0 โ‰ค ๐œƒ โ‰ค 1, ๐‘“ ๐œƒ๐‘ฅ + 1 โˆ’ ๐œƒ ๐‘ฆ โ‰ค ๐œƒ๐‘“ ๐‘ฅ +1 โˆ’ ๐œƒ ๐‘“ ๐‘ฆ .

If we pick any two points on the graph of a convex

function and draw a straight line between them, the n

the portion of the function between these two points will

lie below this straight line

Page 24: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Convex function

The line connecting two points on the graph must lie above the function

Page 25: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Steepest Descent Method

Contours are shown below

2 2

1 2 1 2( , ) 5f x x x x

Page 26: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

The gradient at the point is

If we choose x11 = 3.22, x2

1 = 1.39 as the starting point

represented by the black dot on the figure, the black line shown

in the figure represents the direction for a line search.

Contours represent from red (f = 2) to blue (f = 20).

Steepest Descent

1 1 1 1

1 2 1 2( , ) (2 10, )T

f x x x x

1 1

1 2( , )x x

1 1

1 2( , ) ( 6.44, 13.9)Tf x x

Page 27: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Now, the question is how big should the step be along the

direction of the gradient? We want to find the minimum along

the line before taking the next step.

The minimum along the line corresponds to the point where

the new direction is orthogonal to the original direction.

The new point is (x12, x2

2) = (2.47,-0.23) shown in blue .

Steepest Descent

Page 28: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

By the third iteration we can see that from the point (x12, x2

2) the

new vector again misses the minimum, and here it seems that

we could do better because we are close.

Steepest descent is usually used as the first technique in a

minimization procedure, however, a robust strategy that

improves the choice of the new direction will greatly enhance

the efficiency of the search for the minimum.

Steepest Descent

Page 29: CS7267 Machine Learning Optimization...Optimization Consider a function f(.)of p numbers of variables: =๐‘“ 1, 2,โ€ฆ, ๐‘ Find 1, 2,โ€ฆ, ๐‘ that maximizes or minimizes y Usually,

Numerical Optimization

Numerical Optimization

Steepest descent

Newton Method

Gaussโ€“Newton algorithm

Levenbergโ€“Marquardt algorithm

Line Search Methods

Trust-Region Methods

See: http://home.agh.edu.pl/~pba/pdfdoc/Numerical_Optimization.pdf