Linear Logistic Regression Proofs

Embed Size (px)

Citation preview

  • 8/2/2019 Linear Logistic Regression Proofs

    1/6

    LINEAR AND LOGISITC REGRESSION PROOFS

    HAROLD VALDIVIA GARCIA

    Contents

    1. Linear Regression 12. Logistic Regression 12.1. The Cost Function J() 2

    1. Linear Regression

    2. Logistic Regression

    For logistic Regression, we use h(x(i)) as the estimated probability that the

    training example x(i) is in class y = 1 ( or is labeled as y = 1).

    Here, we assume that the response variables y(1), y(2), ...y(m) Bern(p = i)

    The hypothesis h(x) is the logistic function:

    h(x) = g(Tx) =

    1

    1 + eTx

    The derivative of the sigmoid function has the following nice property (the proofis very easy, so we will not prove it):

    g(z) = g(z) (1 g(z))

    Lets consider X Rmn+1 and Y Rm1 as our dataset.

    Now, for the parameters Rn+1, we have that X is a linear combination ofthe features of X:

    1

  • 8/2/2019 Linear Logistic Regression Proofs

    2/6

    2 HAROLD VALDIVIA GARCIA

    X =

    Tx(1)

    T

    x

    (2)

    Tx(3)

    ...Tx(m)

    m1

    =

    x(1) T

    x

    (2) T

    x(3) T...

    x(m) T

    m1

    Lets define the vector h Rm1 such that:

    [h]i = g(Tx(i)) = g(x(i) T) (g is the sigmoid function)

    h = g(X) ( g is the matrix version

    of the sigmoid function)

    h =

    g(x(1) T)...

    g(x(i) T)...

    g(x(m) T

    )

    m1

    2.1. The Cost Function J().

    We can get the cost function by using maximum likelihood L() over thejoint distributions of the dataset.

    Or by constructing a cost function that penalizes the missclasification.

    We present the following similar cost functions :

    J1() = 1m

    y(i)log(h(x(i))) + (1 y(i))log(1 h(x(i)))

    J2() =1

    m

    y(i)log(h(x

    (i))) + (1 y(i))log(1 h(x(i)))

    J3() =

    y(i)log(h(x(i))) + (1 y(i))log(1 h(x

    (i)))

  • 8/2/2019 Linear Logistic Regression Proofs

    3/6

    LINEAR AND LOGISITC REGRESSION PROOFS 3

    The last cost function J3() is the log-likehood () = log(L()) of the parame-ters s. It is easy to demonstrate that minimizing J1() is the same as maximizingJ2() and J3().

    min J1() = max J2() = max J3()

    2.1.1. Matrix notation for J(). Lets consider the matrix notation for J1():

    J1() = 1

    m YTlog(h) + (1 + Y)T log(1 h)

    2.1.2. Gradient Descent for minimizing J1().

    = J1()

    J1() = ?

    jJ1() =

    1

    m

    j

    y(i)log(h(x

    (i))) + (1 y(i))log(1 h(x(i)))

    jJ1() =

    1

    m

    y(i)

    1

    h(x(i))

    jh(x

    (i)) + (1 y(i))1

    1 h(x(i))

    jh(x

    (i))

    jJ1() =

    1

    m

    y(i)

    1

    h(x(i)) (1 y(i))

    1

    1 h(x(i))

    jh(x

    (i))

    The partial derivative for the hypothesis h(x) is :

  • 8/2/2019 Linear Logistic Regression Proofs

    4/6

    4 HAROLD VALDIVIA GARCIA

    jh(x

    (i)) =

    jg(Tx(i))

    jh(x

    (i)) = g(Tx(i))

    1 g(Tx(i))

    jTx(i)

    jh(x

    (i)) = g(Tx(i))

    1 g(Tx(i))

    x(i)j

    jh(x(i)) = h

    (x(i)) 1 h

    (x(i))x(i)

    j

    then, the partial derivative for J1() can be written as:

    jJ1() =

    1

    m

    y(i)1

    h(x(i)) (1 y(i))

    1

    1 h(x(i))

    jh(x(i))

    jJ1() =

    1

    m

    y(i)

    1 h(x

    (i)) (1 y(i))h(x

    (i))

    x(i)j

    jJ1() =

    1

    m

    y(i) y(i)h(x

    (i)) h(x(i)) + y(i)h(x

    (i))

    x(i)j

    jJ1() =

    1

    m

    y(i) h(x

    (i))

    x(i)j

    jJ1() =

    1

    m

    h(x

    (i)) y(i)

    x(i)j

  • 8/2/2019 Linear Logistic Regression Proofs

    5/6

    LINEAR AND LOGISITC REGRESSION PROOFS 5

    The expression above in vector notation is:

    jJ1() =

    1

    m

    x(1)j x

    (2)j . . . x

    (i)j . . . x

    (m)j

    1m

    h(x(1)

    )

    y(1)

    h(x(2)) y(2)

    ...h(x(i)) y(i)

    ...h(x(m)) y(m)

    m1

    jJ1() =

    1

    m

    xj

    T (h y)

    J1() =1

    m

    XT(h y)

    The vector notation for the Gradient descent rule is as follow:

    = 1

    m

    XT(h y)

    2.1.3. Gradient Ascent for maximizing J2().

    = + J2()

    J2() = ?

    jJ2() =

    1

    m

    h(x

    (i)) y(i)

    x(i)j

    jJ2() =

    1

    m

    xj

    T (h y)

    J2() = 1

    m

    XT(h y)

  • 8/2/2019 Linear Logistic Regression Proofs

    6/6

    6 HAROLD VALDIVIA GARCIA

    The vector notation for the Gradient Ascent rule is as follow:

    = 1

    mX

    T(h y)