02 Linear Regresstion with one variable- andrew ng

Lecture 2:Supervised LearningLinear regression with one variable

1

Reading:• Chapter 17, “Bayesian Reasoning and Machine Learning” Page 345-348• Chapter 03, “Pattern Recognition and Machine Learning” of Christopher M. Bishop, Page 137• Chapter 11, “Data Mining A Knowledge Discovery Approach”, from page 346• Chapter 18 , “Artificial Intelligence A Modern Approach”, from page 718

Model representation

2

500 1000 1500 2000 2500 30000

100000

200000

300000

400000

500000

500 1000 1500 2000 2500 30000

100000

200000

300000

400000

500000Housing Prices(Portland, OR)

Price(in 1000s of dollars)

Size (feet2)

Supervised Learning

Given the “right answer” for each example in the data.

Regression Problem

Predict real-valued output

Classification Problem

Discrete valued output3

Notation:

m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable

Size in feet2 (x) Price ($) in 1000's (y)

2104 4601416 2321534 315852 178… …

Training set ofhousing prices(Portland, OR)

(x, y) – one training example(x(i), y(i)) – ith training example i is an index to training set

x(1) = 2104x(3) = 1534y(4) = 178y(2) = 232

4

m

Training Set

Learning Algorithm

hSize of house

Estimated price

How do we represent h ?

Linear regression with one variableUnivariate linear regression

5

x Hypothesis Estimated value of y

h is a function h maps from x’s to y’s

hθ(x) = θ0 + θ1x

Shorthand: h(x)

hθ(x) = θ0 + θ1x

In summary

• A hypothesis h takes in some variable(s)• Uses parameters determined by a learning

system• Outputs a prediction based on that input

6

Cost function

• A cost function let us figure out how to fit the best straight line to our data

7

How to choose ‘s ?

Training Set

Hypothesis:

‘s: Parameters

Size in feet2 (x) Price ($) in 1000's (y)

2104 4601416 2321534 315852 178… …

8

m

0 1 2 30

1

2

3

0 1 2 30

1

2

3

0 1 2 30

1

2

3

9

h(x) = 1.5 + 0.xh(x) = 0 + 0.5.x

h(x) = 1 + 0.5.x

Different parameter values give different functions

Positive slope if θ1> 0

10

• hθ(x) is a "y imitator" • Tries to convert the x into y• Considering we already have y we can

evaluate how well hθ(x) does this

y

x

Idea: Choose so that is close to for our training examples

11

y

x

Idea: Choose so that is close to for our training examples

Minimization problem

Minimal deviation of x from y

12

h𝜃 (𝑥 (𝑖 ) )=𝜃0+𝜃1𝑥(𝑖)

𝐽 (𝜃0 , 𝜃1 )= 12𝑚∑

𝑖=1

𝑚

(h𝜃 (𝑥 (𝑖 ) )− 𝑦 ( 𝑖 ))2

minimize𝜃0𝜃 1

𝐽 (𝜃0 ,𝜃1 )Cost function

𝐽 (𝜃0 , 𝜃1 )= 12𝑚∑

𝑖=1

𝑚

(h𝜃 (𝑥 (𝑖 ) )− 𝑦 ( 𝑖 ))2

• This cost function is also called the squared error cost function– Reasonable choice for most regression functions– Probably most commonly used function

Hypothesis:

Parameters:

Cost Function:

Goal:

Simplified versionCost function intuition I

14

𝜃0=0

0 1 2 30

1

2

3

0 1 2 30

1

2

3

0 1 2 30

1

2

3

y

x

(for fixed , this is a function of x) (function of the parameter )

-0.5 0 0.5 1 1.5 2 2.50

1

2

3

15

𝐽 (𝜃1 )= 12𝑚∑

𝑖=1

𝑚

(h𝜃 (𝑥 (𝑖 ) )−𝑦 (𝑖 ) )2

𝐽 (𝜃1 )= 12×3

∑𝑖=1

𝑚

(𝜃1𝑥−𝑦 (𝑖 ) )2=16

(02+02+02 )=0𝐽 (1 )=0

0 1 2 30

1

2

3

y

x


-0.5 0 0.5 1 1.5 2 2.50

1

2

3

16

𝐽 (𝜃1 )= 12𝑚∑

𝑖=1

𝑚

(𝜃1𝑥(𝑖)− 𝑦 (𝑖 ) )2

¿12×3

((0 .5−1)2+(1−2)2+(1.5−3)2 )=16

(0.52+12+1.52 )=3.66≈0.58

𝐽 (0.5 )=0.58

-0.5 0 0.5 1 1.5 2 2.50

1

2

3

y

x


0 1 2 30

1

2

3

17

𝐽 (0 )≈2.3𝐽 (𝜃1 )= 12𝑚∑

𝑖=1

𝑚

(𝜃1𝑥(𝑖)− 𝑦 (𝑖 ) )2

¿16

((0×1−1)2+(0×2−2)2+(0×3−3)2 )=16

(1+4+9 )=146≈ 2.3

-0.5 0 0.5 1 1.5 2 2.50

1

2

3

y

x


0 1 2 30

1

2

3

18𝐽 (−0 .5 )≈5.15

𝜃1=−0.5

• If we compute a range of values plot• vs we get a polynomial (looks like

a quadratic)

19

Hypothesis:

Parameters:

Cost Function:

Goal:

20

Cost function intuition II

(for fixed , this is a function of x) (function of the parameters )

500 1000 1500 2000 2500 30000

100000

200000

300000

400000

500000

Price ($) in 1000’s

Size in feet2 (x)

21

𝜃0𝜃1?? ?

?? ?

22


23Contour Plot


24


25


26

• Doing this manually is painful

• What we really want is an efficient algorithm

for finding the minimum J for θ0 and θ1

27

Gradient descent algorithm

• Minimize cost function J• Used all over machine learning

for minimization

28

Have some function

Want

Outline:

• Start with some

• Keep changing to reduce

until we hopefully end up at a minimum

29


• Local Search for optimization :– hill climbing, simulated annealing, Gradient

descent algorithm, etc

30

31

Local Search Methods• Applicable when seeking Goal State & don't care how to

get there. E.g.,– N-queens, – finding shortest/cheapest round trips

• (Travel Salesman Problem, Vehicle Routing Problem)– finding models of propositional formulae (SAT solvers)– VLSI layout, planning, scheduling, time-tabling, . . . – map coloring, – resource allocation– protein structure prediction– genome sequence assembly

32

Local search Key idea (surprisingly simple):

1. Select (random) initial state (generate an initial guess)

2. Make local modification to improve current state (evaluate current state and move to other states)

3. Repeat Step 2 until goal state found (or out of time)

Have some function

Want

Outline:

• Start with some

• Keep changing to reduce

until we hopefully end up at a minimum


33

1

0

J(0,1)

34

0

1

J(0,1)

35


36

: Learning Rate (Should be a small number) • Large number:= Huge steps• Small number := baby steps

Derivative term

: Learning Rate


Correct: Simultaneous update of Incorrect:

37

Gradient descent intuition

• To understand the intuition, we'll return to a simpler function where we minimize one parameter to help explain the algorithm in more detail

38

h𝑤 𝑒𝑟𝑒𝜃1∈𝑅

39

Two key terms in the algorithm• Derivative term

40

𝜃1=𝜃1−𝛼𝑑𝑑𝜃1

𝐽 (𝜃1)

𝜃1=𝜃1−𝛼 (+𝑣𝑒𝑛𝑜 .)≥0

Partial derivative vs. derivative• Use partial derivative when we have multiple variables but only

derive with respect to one• Use derivative when we are deriving with respect to all the variables

Derivative: it takes the tangent to the point (the straight red line) and calculates the slop of this tangent line. Slop = vertical line / horizontal line

41

𝜃1=𝜃1−𝛼𝑑𝑑𝜃1

𝐽 (𝜃1)

𝜃1=𝜃1−𝛼 .(−𝑣𝑒𝑛𝑜 .)≤0

• Positive value• Negative value• Zero value

• At each point, the line is always tangent to the curve• Its slope is the derivative

If α is too small, gradient descent can be slow.

If α is too large, gradient descent can overshoot the minimum. It may fail to converge.

43

: Learning Rate

at local optima

Current value of

44

Question: When you get to a local minimum

Derivative term = 0θ1 = θ1- 0

So θ1 remains the same

Gradient descent can converge to a local minimum, even with the learning rate α fixed.

As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.

45

Gradient descent algorithm Linear Regression Model

Gradient descent for linear regression

46

47

Gradient descent for linear regression

48

𝜕𝜕𝜃 𝑗

12𝑚∑

𝑖=1

𝑚

(h𝜃 (𝑥 (𝑖 ) )−𝑦 (𝑖 ) )2

1𝑚∑

𝑖=1

𝑚

(h𝜃(𝑥(𝑖 ))− 𝑦 ( 𝑖 ))

1𝑚∑

𝑖=1

𝑚

(h𝜃(𝑥(𝑖 ))− 𝑦 ( 𝑖 )) .𝑥(𝑖)

¿ 𝜕𝜕𝜃 𝑗

12𝑚∑

𝑖=1

𝑚

(𝜃0+𝜃1𝑥 (𝑖)− 𝑦 ( 𝑖 ) )2


update and

simultaneously

49

𝜕𝜕𝜃0

𝐽 (𝜃0 ,𝜃1)

𝜕𝜕𝜃1

𝐽 (𝜃0 ,𝜃 1)

1

0

J(0,1)

50

52


53


54


55


56


57


58


59


60


61

Another name:“Batch” Gradient Descent

“Batch”: Each step of gradient descent uses all the training examples.

Another algorithm that solves

Normal equations method

Gradient descent algorithm scales better than Normal equations method to larger datasets

62

Generalization of Gradient descent algorithm

• Learn with larger number of features.

• Difficult to plot

63

64

We see here this matrix shows usSize, Number of bedroomsNumber floors, Age of home

All in one variable

VectorShown as y Shows us the prices

• Need linear algebra for more complex linear regression models

• Linear algebra is good for making computationally efficient models (we’ll see later)– Provides a good way to work with large sets of data

sets– Typically, vectorization of a problem is a common

optimization technique

End

Education

02 Linear Regresstion with one variable- andrew ng