Machine Learning Department School of Computer Science...

Linear Regression+

Optimization for ML

10-601 Introduction to Machine Learning

Matt GormleyLecture 8

Feb. 07, 2020

Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University

Q: How can I get more one-on-one interaction with the course staff?

A: Attend office hours as soon after the homework release as possible!

Reminders

• Homework 3: KNN, Perceptron, Lin.Reg.– Out: Wed, Feb. 05 (+ 1 day)– Due: Wed, Feb. 12 at 11:59pm

• Today’s In-Class Poll– http://p8.mlcourse.org

LINEAR REGRESSION

Regression Problems

Chalkboard– Definition of Regression– Linear functions– Residuals– Notation trick: fold in the intercept

OPTIMIZATION FOR MLThe Big Picture

Optimization for ML

Not quite the same setting as other fields…– Function we are optimizing might not be the

true goal (e.g. likelihood vs generalization error)

– Precision might not matter (e.g. data is noisy, so optimal up to 1e-16 might not help)

– Stopping early can help generalization error(i.e. “early stopping” is a technique for regularization – discussed more next time)

min vs. argmin

y = f(x) =x2 + 11

3 v* = minx f(x)

x* = argminx f(x)

1. Q: What is v*?

2. Q: What is x*?v* = 1, the minimum value of the function

x* = 0, the argument that yields the minimum value

Linear Regression as Function Approximation

Contour PlotsContour Plots1. Each level curve labeled

with value 2. Value label indicates the

value of the function for all points lying on that level curve

3. Just like a topographical map, but for a function

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

Optimization by Random Guessing

Optimization Method #0: Random Guessing1. Pick a random θ2. Evaluate J(θ)

3. Repeat steps 1 and 2 many

4. Return θ that gives

smallest J(θ)

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

θ1 θ2 J(θ1, θ2)

0.2 0.2 10.4

0.3 0.7 7.2

0.6 0.4 1.0

0.9 0.7 19.2

Optimization by Random Guessing

smallest J(θ)

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

θ1 θ2 J(θ1, θ2)

0.2 0.2 10.4

0.3 0.7 7.2

0.6 0.4 1.0

0.9 0.7 19.2

For Linear Regression:• objective function is Mean

Squared Error (MSE)

• MSE = J(w, b)

= J(θ1, θ2) =

• contour plot: each line labeled with

MSE – lower means a better fit• minimum corresponds to

parameters (w,b) = (θ1, θ2) that

best fit some training dataset

Linear Regression by Rand. GuessingOptimization Method #0: Random Guessing1. Pick a random θ2. Evaluate J(θ)3. Repeat steps 1 and 2 many

times4. Return θ that gives

smallest J(θ)

15time

y = h*(x)(unknown)

h(x; θ(1))

h(x; θ(2))

h(x; θ(3))

h(x; θ(4))

For Linear Regression:• target function h*(x) is unknown• only have access to h*(x) through

training examples (x(i),y(i))

• want h(x; θ(t)) that best approximates h*(x)

• enable generalization w/inductive bias that restricts hypothesis class to linear functions

Linear Regression by Rand. Guessing

smallest J(θ)

θ1 θ2 J(θ1, θ2)

0.2 0.2 10.4

0.3 0.7 7.2

0.6 0.4 1.0

0.9 0.7 19.2time

y = h*(x)

(unknown)

h(x; θ(1))

h(x; θ(2))

h(x; θ(3))

h(x; θ(4))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

OPTIMIZATION METHOD #1:GRADIENT DESCENT

Optimization for ML

Chalkboard– Unconstrained optimization– Derivatives– Gradient

Topographical Maps

Gradients

22These are the gradients that

Gradient Ascent would follow.

(Negative) Gradients

23These are the negative gradients that

Gradient Descent would follow.

(Negative) Gradient Paths

Shown are the paths that Gradient Descent would follow if it were making infinitesimally

small steps.

Pros and cons of gradient descent• Simple and often quite effective on ML tasks• Often very scalable • Only applies to smooth functions (differentiable)• Might find a local minimum, rather than a global one

25Slide courtesy of William Cohen

Gradient Descent

Chalkboard– Gradient Descent Algorithm– Details: starting point, stopping criterion, line

search

Gradient Descent

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ��J(�)

5: return �

In order to apply GD to Linear Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).

��J(�) =

��

dd�1

J(�)d

d�2J(�)...

dd�N

J(�)

��

Gradient Descent

Algorithm 1 Gradient Descent

1: procedure GD(D, �(0))2: � � �(0)

3: while not converged do4: � � � + ��J(�)

5: return �

There are many possible ways to detect convergence. For example, we could check whether the L2 norm of the gradient is below some small tolerance.

||��J(�)||2 � �Alternatively we could check that the reduction in the objective function from one iteration to the next is small.

GRADIENT DESCENT FORLINEAR REGRESSION

Linear Regression by Gradient Desc.Optimization Method #1: Gradient Descent1. Pick a random θ2. Repeat:

a. Evaluate gradient ∇J(θ)b. Step opposite gradient

3. Return θ that gives smallest J(θ)

θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time

y = h*(x)(unknown)

h(x; θ(1))

h(x; θ(2))

h(x; θ(3))

h(x; θ(4))

θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time

y = h*(x)(unknown)

h(x; θ(1))

h(x; θ(2))

h(x; θ(3))

h(x; θ(4))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

Linear Regression by Gradient Desc.

θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time

y = h*(x)(unknown)

h(x; θ(1))

h(x; θ(2))

h(x; θ(3))

h(x; θ(4))

iteration, t

Linear Regression by Gradient Desc.

θ1 θ2 J(θ1, θ2)0.01 0.02 25.20.30 0.12 8.70.51 0.30 1.50.59 0.43 0.2time

y = h*(x)(unknown)

h(x; θ(1))

h(x; θ(2))

h(x; θ(3))

h(x; θ(4))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

iteration, t

Optimization for Linear Regression

Chalkboard– Computing the gradient for Linear Regression– Gradient Descent for Linear Regression

Gradient Calculation for Linear Regression

[used by Gradient Descent]

GD for Linear RegressionGradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function

CONVEXITY

Convexity

Convex Function

• Each local minimum is a global minimum

Nonconvex Function

• A nonconvex function is not convex

• Each local minimum is notnecessarily a global minimum 41

Convexity

Each local minimum of a

convex function is also a global

minimum.

A strictly convex function has a unique global

minimum.

CONVEXITY AND LINEAR REGRESSION

Convexity and Linear Regression

The Mean Squared Error function, which we minimize for learning

the parameters of Linear Regression, is convex!

…but in the general case it is not strictly convex.

Regression Loss Functions

In-Class Exercise:

Which of the following could be used as loss functions for training a linear regression model?

Select all that apply.

Answer:

Solving Linear Regression

Question:

OPTIMIZATION METHOD #2:CLOSED FORM SOLUTION

Calculus and Optimization

In-Class ExercisePlot three functions:

Answer Here:

Optimization: Closed form solutions

Chalkboard– Zero Derivatives– Example: 1-D function– Example: higher dimensions

CLOSED FORM SOLUTION FOR LINEAR REGRESSION

Linear Regression: Closed FormOptimization Method #2: Closed Form1. Evaluate

2. Return θMLE

θ1 θ2 J(θ1, θ2)0.59 0.43 0.2

y = h*(x)(unknown)

h(x; θ(MLE))

J(θ) = J(θ1, θ2) = (10(θ1 – 0.5))2 + (6(θ1 – 0.4))2

Optimization for Linear Regression

Chalkboard– Closed-form (Normal Equations)

Computational Complexity of OLS:

Computational Complexity of OLS

To solve the Ordinary Least Squares problem we compute:

The resulting shape of the matrices:

Linear in # of examples, NPolynomial in # of features, M

Gradient Descent

Cases to consider gradient descent:1. What if we can not find a closed-form

solution?2. What if we can, but it’s inefficient to

compute?3. What if we can, but it’s numerically

unstable to compute?

Convergence Curves

• SGD reduces MSE much more rapidly than GD

• For GD / SGD, training MSE is initially large due to uninformed initialization

Gradient DescentSGD

Closed-form (normal eq.s)

Figure adapted from Eric P. Xing

• Def: an epoch is a single pass through the training data

1. For GD, only one update per epoch

2. For SGD, N updates per epoch N = (# train examples)

Machine Learning Department School of Computer Science...

Documents

Lecture8 chap7

Lecture8 - Stereo

Mae546 lecture8

Decision Trees (Part I) - Carnegie Mellon School of ...mgormley/courses/10601/slides/lecture2-dt.pdf · Decision Trees (Part I) 1 10-601 Introduction to Machine Learning Matt Gormley

Lecture8 xing

LECTURE8 (1)

Lecture8 Cuong

Machine Learning Department School of Computer Science …mgormley/courses/10601-s19/slides/lecture15-pa… · 10 ieee transactions on pattern analysis and machine intelligence, vol

Clustering (K-Means)mgormley/courses/10601-s17/slides/... · 2019. 1. 11. · Clustering (K-Means) 1 10-6014Introduction4to4Machine4Learning Matt%Gormley Lecture%15 March%8,%2017

Machine Learning Department School of Computer Science …mgormley/courses/10601/slides/lecture5... · 2020-04-29 · k-Nearest Neighbors + Model Selection 1 10-601 Introduction to

DC Lecture8

Bayesian(Networks (PartII)mgormley/courses/10601-s17/... · Bayesian(Networks (PartII) 1 105601(Introduction(to(Machine(Learning Matt%Gormley Lecture23 April12,2017 Machine%Learning%Department

Logistic Regression - Carnegie Mellon School of Computer ...mgormley/courses/10601-s18/slides/lecture8-prob.pdf · Logistic Regression 2 10-601 Introduction to Machine Learning Matt

Lecture8 Evaluation

Oop lecture8

Lecture8 - Linköping Universityusers.isy.liu.se/rt/fredrik/edu/sensorfusion/lecture8.pdf · Lecture8: Whiteboard: I SLAMproblemformulation. I FrameworkforEKF-SLAMandfastSLAM(withPFandMPF)

Perceptron - Carnegie Mellon School of Computer …mgormley/courses/10601-s18/slides/lecture5-perc.pdf · Perceptron: History Imagine you are trying to build a new machine learning

Lecture8 carbohydrates

Machine Learning Department School of Computer Science ...mgormley/courses/10601/slides/... · 10-601 Introduction to Machine Learning Matt Gormley Lecture 21 Apr. 01, 2020 Machine

Machine Learning Department School of Computer Science ...mgormley/courses/10601-s18/slides/lecture16-mi… · 10-601: Machine Learning Page 4 of 16 2/29/2016 1.3 MAP vs MLE Answer