Linear Logistic Regression Proofs

8/2/2019 Linear Logistic Regression Proofs

1/6

LINEAR AND LOGISITC REGRESSION PROOFS

HAROLD VALDIVIA GARCIA

Contents

1. Linear Regression 12. Logistic Regression 12.1. The Cost Function J() 2

1. Linear Regression

2. Logistic Regression

For logistic Regression, we use h(x(i)) as the estimated probability that the

training example x(i) is in class y = 1 ( or is labeled as y = 1).

Here, we assume that the response variables y(1), y(2), ...y(m) Bern(p = i)

The hypothesis h(x) is the logistic function:

h(x) = g(Tx) =

1

1 + eTx

The derivative of the sigmoid function has the following nice property (the proofis very easy, so we will not prove it):

g(z) = g(z) (1 g(z))

Lets consider X Rmn+1 and Y Rm1 as our dataset.

Now, for the parameters Rn+1, we have that X is a linear combination ofthe features of X:

1


2/6

2 HAROLD VALDIVIA GARCIA

X =

Tx(1)

T

x

(2)

Tx(3)

...Tx(m)

m1

=

x(1) T

x

(2) T

x(3) T...

x(m) T

m1

Lets define the vector h Rm1 such that:

[h]i = g(Tx(i)) = g(x(i) T) (g is the sigmoid function)

h = g(X) ( g is the matrix version

of the sigmoid function)

h =

g(x(1) T)...

g(x(i) T)...

g(x(m) T

)

m1

2.1. The Cost Function J().

We can get the cost function by using maximum likelihood L() over thejoint distributions of the dataset.

Or by constructing a cost function that penalizes the missclasification.

We present the following similar cost functions :

J1() = 1m

y(i)log(h(x(i))) + (1 y(i))log(1 h(x(i)))

J2() =1

m

y(i)log(h(x

(i))) + (1 y(i))log(1 h(x(i)))

J3() =

y(i)log(h(x(i))) + (1 y(i))log(1 h(x

(i)))


3/6

LINEAR AND LOGISITC REGRESSION PROOFS 3

The last cost function J3() is the log-likehood () = log(L()) of the parame-ters s. It is easy to demonstrate that minimizing J1() is the same as maximizingJ2() and J3().

min J1() = max J2() = max J3()

2.1.1. Matrix notation for J(). Lets consider the matrix notation for J1():

J1() = 1

m YTlog(h) + (1 + Y)T log(1 h)

2.1.2. Gradient Descent for minimizing J1().

= J1()

J1() = ?

jJ1() =

1

m

j

y(i)log(h(x

(i))) + (1 y(i))log(1 h(x(i)))

jJ1() =

1

m

y(i)

1

h(x(i))

jh(x

(i)) + (1 y(i))1

1 h(x(i))

jh(x

(i))

jJ1() =

1

m

y(i)

1

h(x(i)) (1 y(i))

1

1 h(x(i))

jh(x

(i))

The partial derivative for the hypothesis h(x) is :


4/6


jh(x

(i)) =

jg(Tx(i))

jh(x

(i)) = g(Tx(i))

1 g(Tx(i))

jTx(i)

jh(x

(i)) = g(Tx(i))

1 g(Tx(i))

x(i)j

jh(x(i)) = h

(x(i)) 1 h

(x(i))x(i)

j

then, the partial derivative for J1() can be written as:

jJ1() =

1

m

y(i)1

h(x(i)) (1 y(i))

1

1 h(x(i))

jh(x(i))

jJ1() =

1

m

y(i)

1 h(x

(i)) (1 y(i))h(x

(i))

x(i)j

jJ1() =

1

m

y(i) y(i)h(x

(i)) h(x(i)) + y(i)h(x

(i))

x(i)j

jJ1() =

1

m

y(i) h(x

(i))

x(i)j

jJ1() =

1

m

h(x

(i)) y(i)

x(i)j


5/6

LINEAR AND LOGISITC REGRESSION PROOFS 5

The expression above in vector notation is:

jJ1() =

1

m

x(1)j x

(2)j . . . x

(i)j . . . x

(m)j

1m

h(x(1)

)

y(1)

h(x(2)) y(2)

...h(x(i)) y(i)

...h(x(m)) y(m)

m1

jJ1() =

1

m

xj

T (h y)

J1() =1

m

XT(h y)

The vector notation for the Gradient descent rule is as follow:

= 1

m

XT(h y)

2.1.3. Gradient Ascent for maximizing J2().

= + J2()

J2() = ?

jJ2() =

1

m

h(x

(i)) y(i)

x(i)j

jJ2() =

1

m

xj

T (h y)

J2() = 1

m

XT(h y)


6/6


The vector notation for the Gradient Ascent rule is as follow:

= 1

mX

T(h y)

Documents

Linear Logistic Regression Proofs