View
222
Download
0
Category
Preview:
Citation preview
8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
1/49
Lecture 6: Logistic Regression
CSC 84020 - Machine Learning
Andrew Rosenberg
February 19, 2009
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
2/49
Last Time
Regression
Regularization and Overfitting
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
3/49
Today
Logistic Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
4/49
Classification
Classification
Goal: Identify which of K classes a data point x belongs to.
Like Regression, Classification is a supervised task.
For each data point xi we have a corresponding target (or label, orclass) ti that describes the correct classification of the data point.
Goal: identify a function y : RD C whereti C = {c0, . . . , cK1}
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
5/49
Representations of the target variable
y : RD C where ti C
For binary (two-way) classification it is convenient to represent tias a single scalar variable ti {0, 1}.
This will allow us to interpret ti as the likelihood that a pointxi is a member of class cK1
When hypothesized from a model, this can represent theconfidence of the prediction.
For K 2 classes, we represent t as a K element vector, where, ifa point is a member of class cj the j-th element is 1, and all the
others are 0.In 5-way classification, a member of class c2 is
t = (0, 0, 1, 0, 0)T
We may also represent t as a nominal variable when using
non-probabilistic models.
C
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
6/49
Three approaches to Classification
Generative Approach
p(cj|x) =p(x|cj)p(cj)
p(x)
Discriminative approach
p(cj|x)
Discriminant function
f(x) = cj
Th h Cl ifi i
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
7/49
Three approaches to Classification
Generative Approach Highest resource requirements. Need toapproximate the joint probability p(x, cj)
p(cj|x) =p(x|cj)p(cj)
p(x)
Discriminative approach
p(cj|x)
Discriminant functionf(x) = cj
Th h Cl ifi i
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
8/49
Three approaches to Classification
Generative Approach
p(cj|x) =p(x|cj)p(cj)
p(x)
Discriminative approach Moderate resource requirements.Typically less parameters to approximate than generative models
p(cj|x)
Discriminant functionf(x) = cj
Th h t Cl ifi ti
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
9/49
Three approaches to Classification
Generative Approach
p(cj|x) =p(x|cj)p(cj)
p(x)
Discriminative approach
p(cj|x)
Discriminant function Can be trained probabilistically, but theoutput does not include confidence information.
f(x) = cj
Di i i t F ti
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
10/49
Discriminant Functions
Wh Dis i i t F ti s li iti
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
11/49
Why Discriminant Functions are limiting
What can Generative and Discriminative approaches thatDiscriminant Functions cannot?
...Or why we like probabilities
Minimizing Risk continuous updating.Reject Option I dont know
Compensating for Priors
Combining Models
Well talk about these more when we discuss Perceptrons andNeural Networks.
Generative Modeling
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
12/49
Generative Modeling
Generative modeling model the posterior
p(c1|x) =p(x|c1)p(c1)
p(x)
Generative Modeling
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
13/49
Generative Modeling
Generative modeling model the posterior
p(c1|x) =p(x|c1)p(c1)
p(x)
=p(x|c1)p(c1)
j p(x, cj)
Generative Modeling
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
14/49
Generative Modeling
Generative modeling model the posterior
p(c1|x) =p(x|c1)p(c1)
p(x)
=p(x|c1)p(c1)
j p(x, cj)
=p(x|c1)p(c1)
p(x, c0) + p(x, c1)
Generative Modeling
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
15/49
Generative Modeling
Generative modeling model the posterior
p(c1|x) =p(x|c1)p(c1)
p(x)
=p(x|c1)p(c1)
j p(x, cj)
=p(x|c1)p(c1)
p(x, c0) + p(x, c1)
= p(x|c1)p(c1)p(x|c0)p(c0) + p(x|c1)p(c1)
Sigmoid function
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
16/49
Sigmoid function
The sigmoid1 function is a squashing function.
(x) =1
1 + exp(x)
Squashing function maps the reals to a finite domain.
: R (0, 1)
1S-shaped
Generative Modeling
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
17/49
Generative Modeling
p(c1|x) =
p(x|c1)p(c1)
p(x|c0)p(c0) + p(x|c1)p(c1)
=
p(x|c0)p(c0) + p(x|c1)p(c1)
p(x|c1)p(c1)
1
=
p(x|c0)p(c0)
p(x|c1)p(c1)+ 1
1
=
exp
ln
p(x|c0)p(c0)
p(x|c1)p(c1)
+ 1
1
=
exp
ln
p(x|c1)p(c1)
p(x|c0)p(c0)
+ 1
1
a = ln p(x|c1)p(c1)p(x|c0)p(c0)
p(c1|x) =1
1 + exp(a)
= (a)
Some more vocabulary
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
18/49
Some more vocabulary
log-odds or log-odds-ratio
a = lnp(x|c1)p(c1)
p(x|c0)p(c0)
logit function inverse of the sigmoid.
=1
1 + exp(a)
a = ln
1
Generative Model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
19/49
Generative Model
Derive p(c0|x) with Gaussian class conditional probability.
p(x|ck) =1
(2)D/21
||1/2exp
1
2(x k)
T1(x k)
Well assume that p(x|c0) and p(x|c1) have equal covariancematrices.
Want to show that p(c0|x) = (wTx)
p(c0|x) = (a)
a = lnp(x|c0)p(c0)
p(x|c1)p(c1)
a = ln
1
(2)D/21
||1/2exp
1
2(x 0)
T1(x 0)
ff
ln
1
(2)D/21
||1/2exp
1
2(x 1)
T1(x 1)
ff+ ln
p(c0)
p(c1)
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
20/49
Generative model
p(c0|x) = (a)
a = ln
1
(2)D/21
||1/2exp
1
2(x 0)
T1(x 0)
ff
ln
1
(2)D/21
||1/2exp
1
2(x 1)
T1(x 1)
ff+ ln
p(c0)
p(c1)
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
21/49
Generative model
p(c0|x) = (a)
a = ln
1
(2)D/21
||1/2exp
1
2(x 0)
T1(x 0)
ff
ln
1
(2)D/21
||1/2exp
1
2(x 1)
T1(x 1)
ff+ ln
p(c0)
p(c1)
= 12
(x 0)T
1(x 0) + 12(x 1)
T1(x 1) + ln
p(c0)p(c1)
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
22/49
G
p(c0|x) = (a)
a = ln
1
(2)D/21
||1/2exp
1
2(x 0)
T1(x 0)
ff
ln
1
(2)D/21
||1/2exp
1
2(x 1)
T1(x 1)
ff+ ln
p(c0)
p(c1)
= 12
(x 0)T
1(x 0) + 12(x 1)
T1(x 1) + ln
p(c0)p(c1)
= 1
2
xT
1
x + 0T
10 x
T10 0
T1
x
+1
2 xT
1
x + 1T
11 x
T11 1
T1
x+ ln
p(c0)
p(c1)
If A is symmetric A = AT. If A is symmetric, xTAy = yTAx. (HW).
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
23/49
p(c0|x) = (a)
a = 1
2
xT
1
x + 0T
10 x
T10 0
T1
x
+1
2
xT
1
x + 1T
11 x
T11 1
T1
x
+ lnp
(c0)p(c1)
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
24/49
p(c0|x) = (a)
a = 1
2
xT
1
x + 0T
10 x
T10 0
T1
x
+1
2
xT
1
x + 1T
11 x
T11 1
T1
x
+ lnp
(c0)p(c1)
a = 1
2
0
T10 20
T1
x
+1
2
1
T11 21
T1
x
+ ln p(c0)p(c1)
If A is symmetric A = AT. If A is symmetric, xTAy = yTAxT. (HW).
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
25/49
p(c0|x) =
(a
)a =
1
2
xT
1
x + 0T
10 x
T10 0
T1
x
+1
2
xT
1
x + 1T
11 x
T11 1
T1
x
+ lnp(c0)
p(c1)
a = 1
2
0
T10 20
T1
x
+1
2
1
T11 21
T1
x
+ ln
p(c0)
p(c1)
a = (01 1
1)x 1
20
T10 +
1
21
T11 + ln
p(c0)
p(c1)
If A is symmetric A = AT. If A is symmetric, xTAy = yTAxT. (HW).
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
26/49
p(c0|x) = (a)
a = (01 1
1)x 1
20
T10 +
1
21
T11 + ln
p(c0)
p(c1)
Generative model
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
27/49
p(c0|x) = (a)
a = (01 1
1)x 1
20
T10 +
1
21
T11 + ln
p(c0)
p(c1)
a = (wTx + w0)
w = 10T 11
T
w0 = 1
20
T10 +
1
21
T11 + ln
p(c0)
p(c1)
Maximum Likelihood Solution
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
28/49
Now we have a way to describe the linear transformation to x to
generate a prediction under a Gaussian assumption.
How do we estimate the parameters?
Maximize the likelihood function with respect to each parameter.
p(t, X|,0,1, ) =N1n=0
(N(xn |0, ))tn((1)N(xn |1, ))
1tn
tn = 1 for class 0, tn = 0 for class 1.
Prior class probabilities p(C0) = , p(C1) = 1 .
Maximum Likelihood Solution
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
29/49
Optimize .
p(t, X|,0,1, ) =N1Y
n=0
(N(xn|0, ))tn((1 )N(xn|1, ))
1tn
ln p(t, X|,0,1, ) = lnN1Yn=0
(N(xn|0, ))tn((1 )N(xn|1, ))
1tn
=
N1X
n=0
tn ln(N(xn|0, )) + (1 tn) ln((1 )N(xn|1, ))
=N1Xn=0
tn ln(N(xn|0, )) + (1 tn) ln((1 )N(xn|1, ))
=N1X
n=0
tn ln() + (1 tn)ln(1 ) + const
Maximum Likelihood Solution
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
30/49
ln p(t, X|,0,1, ) =N1
Xn=0
tn ln() + (1 tn) ln(1 ) + const
ln p(t, X|,0,1, )
=
1
N1Xn=0
tn 1
1
N1Xn=0
(1 tn) = 0
1
N1
Xn=0
tn =1
1
N1
Xn=0
(1 tn)
1
N1Xn=0
tn =N1Xn=0
(1 tn)
1
1
N1Xn=0
tn =
N1
Xn=0(1 tn)
1
N1Xn=0
tn =N1Xn=0
(1 tn) +N1Xn=0
(tn)
1
N1Xn=0
tn =
N1
Xn=01
Maximum Likelihood Solution
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
31/49
1 N1X
n=0tn =
N1Xn=0
1
1
N1Xn=0
tn = N
1
N
N1Xn=0
tn =
1
N
N1Xn=0
tn =
N0
N=
N0
N0 + N1=
Be prepared to maximize 0 and for HW.
Discriminative Linear Classification
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
32/49
In the generative case recall,
p(t|x) =p(x|t)p(t)
p(x)
Can generate synthetic data from p(x).
Need to model the joint probability
In Discriminative Modeling
Model p(t|x) directly
Logistic Regression
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
33/49
From the generative case we can find that under someassumptions:
p(t|x) = y(x) = (wTx)
In M-dimensions this has M parameters.In the generative case 2M means and M(M+1)/2 covariancematrix2.
Parameters grow linearly in M or quadratically in M.
So wed rather optimize this function directly.
2Covariance matrices are symmetric
Maximum likelihood
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
34/49
Define the Likelihood.
p(t|w) =N1n=0
p(c0|xn)tn
1 p(c1|xn)1tn
E(w) = ln p(t|w) = N1
n=0
{tn ln p(c0|xn) + (1 tn) ln p(c1|xn)}
Where y0 = p(c0|xn) = (an).
E(w) = ln p(t|w) = N1
n=0
{tn
ln yn
+ (1 tn
)ln(1 yn
)}
This is also the cross entropy error function.3
3Logisitic Regression is also called maximum entropy or maxent.
Maximum Likelihood
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
35/49
E(w) = ln p(t|w) = N1n=0
{tn ln yn + (1 tn)ln(1 yn)}
chain rule
wE =N1n=0
E
yn
yn
anwan
Maximum Likelihood
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
36/49
Derivation of Eyn .
E(w) = N1n=0
{tn ln yn + (1 tn)ln(1 yn)}
E
yn (w) =
1 tn1 yn
tn
yn
=yn(1 tn) tn(1 yn)
yn(1 yn)
=yn yntn tn + yntn
yn(1 yn)
=yn tn
yn(1 yn)
Maximum Likelihood
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
37/49
Derivation of ynan .
dda
=d 1
1+exp(
a)da
=d(1 + exp(a))1
da
= (1)(1 + exp(a))2(exp(a))(1)
= (1 + exp(a))2(exp(a))
=1
1 + exp(a)
exp(a)
1 + exp(a)
=1
1 + exp(a)1 + exp(a) 1
1 + exp(a)
=1
1 + exp(a)
1 + exp(a)
1 + exp(a)
1
1 + exp(a)
= (1 )
Maximum Likelihood
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
38/49
Derivation of wan
an = wTx
wan = xn
Maximum Likelihood
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
39/49
Putting it all together
wE =N1n=0
E
yn
yn
anwan
=
N1n=0
yn tnyn(1 yn) (y
n(1 yn))xn
=N1n=0
(yn tn)xn
Same as gradient of the sum of squares error in linear regression.
How do we optimize this?
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
40/49
We know the gradient, but how do we find the maximum value
wE =N1n=0
(yn tn)xn
Numerical Approximations
Gradient Ascent
wn+1 = wn + wE(wn)
Guess.
Jump in the direction of the negative gradient.
Guess again.
Example of Gradient Descent
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
41/49
10 8 6 4 2 0 2 4 6 8 10100
50
0
50
100
150
x0 = 5, f(x0) = 10, = .2
x1 = x0 f
(x0) = 5 .2 10 = 3
Example of Gradient Descent
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
42/49
10 8 6 4 2 0 2 4 6 8 1060
40
20
0
20
40
60
80
100
120
140
x1 = 3, f(x1) = 6, = .2
x2 = x1 f
(x1) = 3 .2 6 = 1.8
Example of Gradient Descent
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
43/49
10 8 6 4 2 0 2 4 6 8 1020
0
20
40
60
80
100
120
140
x2 = 1.8, f(x2) = 3.6, = .2
x3 = x2 f
(x2) = 1.8 .2 3.6 = 1.08
Example of Gradient Descent
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
44/49
10 8 6 4 2 0 2 4 6 8 100
20
40
60
80
100
120
140
x3 = 1.08, f(x3) = 2.16, = .2
x4 = x3 f
(x3) = 5 .2 10 = .648
Another approach to N-way classification
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
45/49
In this derivation we used a 1-of-K representation with K-classdiscriminant function.
Another approach is to construct K 1 binary classifiers.
Each classifier Cn compares cn to not cn
Binary Classifiers are simpler.
But there are some problems with this approach.
One versus the rest
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
46/49
K-class discriminant
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
47/49
Context
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
48/49
Logistic Regression
Powerful classification technique.
Must be approximated no closed form.Assumption of linearity
Can also be extended with basis functions.
Also called maximum entropy.
Bye
http://find/http://goback/8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning
49/49
Next
Graphical Models
http://find/http://goback/Recommended