Upload
abdul-waqar
View
93
Download
0
Embed Size (px)
Citation preview
Lecture 2:Supervised LearningLinear regression with one variable
1
Reading:• Chapter 17, “Bayesian Reasoning and Machine Learning” Page 345-348• Chapter 03, “Pattern Recognition and Machine Learning” of Christopher M. Bishop, Page 137• Chapter 11, “Data Mining A Knowledge Discovery Approach”, from page 346• Chapter 18 , “Artificial Intelligence A Modern Approach”, from page 718
Model representation
2
500 1000 1500 2000 2500 30000
100000
200000
300000
400000
500000
500 1000 1500 2000 2500 30000
100000
200000
300000
400000
500000Housing Prices(Portland, OR)
Price(in 1000s of dollars)
Size (feet2)
Supervised Learning
Given the “right answer” for each example in the data.
Regression Problem
Predict real-valued output
Classification Problem
Discrete valued output3
Notation:
m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable
Size in feet2 (x) Price ($) in 1000's (y)
2104 4601416 2321534 315852 178… …
Training set ofhousing prices(Portland, OR)
(x, y) – one training example(x(i), y(i)) – ith training example i is an index to training set
x(1) = 2104x(3) = 1534y(4) = 178y(2) = 232
4
m
Training Set
Learning Algorithm
hSize of house
Estimated price
How do we represent h ?
Linear regression with one variableUnivariate linear regression
5
x Hypothesis Estimated value of y
h is a function h maps from x’s to y’s
hθ(x) = θ0 + θ1x
Shorthand: h(x)
hθ(x) = θ0 + θ1x
In summary
• A hypothesis h takes in some variable(s)• Uses parameters determined by a learning
system• Outputs a prediction based on that input
6
Cost function
• A cost function let us figure out how to fit the best straight line to our data
7
How to choose ‘s ?
Training Set
Hypothesis:
‘s: Parameters
Size in feet2 (x) Price ($) in 1000's (y)
2104 4601416 2321534 315852 178… …
8
m
0 1 2 30
1
2
3
0 1 2 30
1
2
3
0 1 2 30
1
2
3
9
h(x) = 1.5 + 0.xh(x) = 0 + 0.5.x
h(x) = 1 + 0.5.x
Different parameter values give different functions
Positive slope if θ1> 0
10
• hθ(x) is a "y imitator" • Tries to convert the x into y• Considering we already have y we can
evaluate how well hθ(x) does this
y
x
Idea: Choose so that is close to for our training examples
11
y
x
Idea: Choose so that is close to for our training examples
Minimization problem
Minimal deviation of x from y
12
h𝜃 (𝑥 (𝑖 ) )=𝜃0+𝜃1𝑥(𝑖)
𝐽 (𝜃0 , 𝜃1 )= 12𝑚∑
𝑖=1
𝑚
(h𝜃 (𝑥 (𝑖 ) )− 𝑦 ( 𝑖 ))2
minimize𝜃0𝜃 1
𝐽 (𝜃0 ,𝜃1 )Cost function
𝐽 (𝜃0 , 𝜃1 )= 12𝑚∑
𝑖=1
𝑚
(h𝜃 (𝑥 (𝑖 ) )− 𝑦 ( 𝑖 ))2
• This cost function is also called the squared error cost function– Reasonable choice for most regression functions– Probably most commonly used function
Hypothesis:
Parameters:
Cost Function:
Goal:
Simplified versionCost function intuition I
14
𝜃0=0
0 1 2 30
1
2
3
0 1 2 30
1
2
3
0 1 2 30
1
2
3
y
x
(for fixed , this is a function of x) (function of the parameter )
-0.5 0 0.5 1 1.5 2 2.50
1
2
3
15
𝐽 (𝜃1 )= 12𝑚∑
𝑖=1
𝑚
(h𝜃 (𝑥 (𝑖 ) )−𝑦 (𝑖 ) )2
𝐽 (𝜃1 )= 12×3
∑𝑖=1
𝑚
(𝜃1𝑥−𝑦 (𝑖 ) )2=16
(02+02+02 )=0𝐽 (1 )=0
0 1 2 30
1
2
3
y
x
(for fixed , this is a function of x) (function of the parameter )
-0.5 0 0.5 1 1.5 2 2.50
1
2
3
16
𝐽 (𝜃1 )= 12𝑚∑
𝑖=1
𝑚
(𝜃1𝑥(𝑖)− 𝑦 (𝑖 ) )2
¿12×3
((0 .5−1)2+(1−2)2+(1.5−3)2 )=16
(0.52+12+1.52 )=3.66≈0.58
𝐽 (0.5 )=0.58
-0.5 0 0.5 1 1.5 2 2.50
1
2
3
y
x
(for fixed , this is a function of x) (function of the parameter )
0 1 2 30
1
2
3
17
𝐽 (0 )≈2.3𝐽 (𝜃1 )= 12𝑚∑
𝑖=1
𝑚
(𝜃1𝑥(𝑖)− 𝑦 (𝑖 ) )2
¿16
((0×1−1)2+(0×2−2)2+(0×3−3)2 )=16
(1+4+9 )=146≈ 2.3
-0.5 0 0.5 1 1.5 2 2.50
1
2
3
y
x
(for fixed , this is a function of x) (function of the parameter )
0 1 2 30
1
2
3
18𝐽 (−0 .5 )≈5.15
𝜃1=−0.5
• If we compute a range of values plot• vs we get a polynomial (looks like
a quadratic)
19
Hypothesis:
Parameters:
Cost Function:
Goal:
20
Cost function intuition II
(for fixed , this is a function of x) (function of the parameters )
500 1000 1500 2000 2500 30000
100000
200000
300000
400000
500000
Price ($) in 1000’s
Size in feet2 (x)
21
𝜃0𝜃1?? ?
?? ?
22
(for fixed , this is a function of x) (function of the parameters )
23Contour Plot
(for fixed , this is a function of x) (function of the parameters )
24
(for fixed , this is a function of x) (function of the parameters )
25
(for fixed , this is a function of x) (function of the parameters )
26
• Doing this manually is painful
• What we really want is an efficient algorithm
for finding the minimum J for θ0 and θ1
27
Gradient descent algorithm
• Minimize cost function J• Used all over machine learning
for minimization
28
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
29
Gradient descent algorithm
• Local Search for optimization :– hill climbing, simulated annealing, Gradient
descent algorithm, etc
30
31
Local Search Methods• Applicable when seeking Goal State & don't care how to
get there. E.g.,– N-queens, – finding shortest/cheapest round trips
• (Travel Salesman Problem, Vehicle Routing Problem)– finding models of propositional formulae (SAT solvers)– VLSI layout, planning, scheduling, time-tabling, . . . – map coloring, – resource allocation– protein structure prediction– genome sequence assembly
32
Local search Key idea (surprisingly simple):
1. Select (random) initial state (generate an initial guess)
2. Make local modification to improve current state (evaluate current state and move to other states)
3. Repeat Step 2 until goal state found (or out of time)
Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
Gradient descent algorithm
33
1
0
J(0,1)
34
0
1
J(0,1)
35
Gradient descent algorithm
36
: Learning Rate (Should be a small number) • Large number:= Huge steps• Small number := baby steps
Derivative term
: Learning Rate
Gradient descent algorithm
Correct: Simultaneous update of Incorrect:
37
Gradient descent intuition
• To understand the intuition, we'll return to a simpler function where we minimize one parameter to help explain the algorithm in more detail
38
h𝑤 𝑒𝑟𝑒𝜃1∈𝑅
39
Two key terms in the algorithm• Derivative term
40
𝜃1=𝜃1−𝛼𝑑𝑑𝜃1
𝐽 (𝜃1)
𝜃1=𝜃1−𝛼 (+𝑣𝑒𝑛𝑜 .)≥0
Partial derivative vs. derivative• Use partial derivative when we have multiple variables but only
derive with respect to one• Use derivative when we are deriving with respect to all the variables
Derivative: it takes the tangent to the point (the straight red line) and calculates the slop of this tangent line. Slop = vertical line / horizontal line
41
𝜃1=𝜃1−𝛼𝑑𝑑𝜃1
𝐽 (𝜃1)
𝜃1=𝜃1−𝛼 .(−𝑣𝑒𝑛𝑜 .)≤0
• Positive value• Negative value• Zero value
• At each point, the line is always tangent to the curve• Its slope is the derivative
If α is too small, gradient descent can be slow.
If α is too large, gradient descent can overshoot the minimum. It may fail to converge.
43
: Learning Rate
at local optima
Current value of
44
Question: When you get to a local minimum
Derivative term = 0θ1 = θ1- 0
So θ1 remains the same
Gradient descent can converge to a local minimum, even with the learning rate α fixed.
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.
45
Gradient descent algorithm Linear Regression Model
Gradient descent for linear regression
46
47
Gradient descent for linear regression
48
𝜕𝜕𝜃 𝑗
12𝑚∑
𝑖=1
𝑚
(h𝜃 (𝑥 (𝑖 ) )−𝑦 (𝑖 ) )2
1𝑚∑
𝑖=1
𝑚
(h𝜃(𝑥(𝑖 ))− 𝑦 ( 𝑖 ))
1𝑚∑
𝑖=1
𝑚
(h𝜃(𝑥(𝑖 ))− 𝑦 ( 𝑖 )) .𝑥(𝑖)
¿ 𝜕𝜕𝜃 𝑗
12𝑚∑
𝑖=1
𝑚
(𝜃0+𝜃1𝑥 (𝑖)− 𝑦 ( 𝑖 ) )2
Gradient descent algorithm
update and
simultaneously
49
𝜕𝜕𝜃0
𝐽 (𝜃0 ,𝜃1)
𝜕𝜕𝜃1
𝐽 (𝜃0 ,𝜃 1)
1
0
J(0,1)
50
52
(for fixed , this is a function of x) (function of the parameters )
53
(for fixed , this is a function of x) (function of the parameters )
54
(for fixed , this is a function of x) (function of the parameters )
55
(for fixed , this is a function of x) (function of the parameters )
56
(for fixed , this is a function of x) (function of the parameters )
57
(for fixed , this is a function of x) (function of the parameters )
58
(for fixed , this is a function of x) (function of the parameters )
59
(for fixed , this is a function of x) (function of the parameters )
60
(for fixed , this is a function of x) (function of the parameters )
61
Another name:“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples.
Another algorithm that solves
Normal equations method
Gradient descent algorithm scales better than Normal equations method to larger datasets
62
Generalization of Gradient descent algorithm
• Learn with larger number of features.
• Difficult to plot
63
64
We see here this matrix shows usSize, Number of bedroomsNumber floors, Age of home
All in one variable
VectorShown as y Shows us the prices
• Need linear algebra for more complex linear regression models
• Linear algebra is good for making computationally efficient models (we’ll see later)– Provides a good way to work with large sets of data
sets– Typically, vectorization of a problem is a common
optimization technique
End