1
Logistic Regression Huong Huynh & Maria Nazari Math 285: Classification of Handwritten Digits In machine learning, classification is the process of predicting the cat- egories of a group of new observations using a training set to build a predictive model. Classification falls under supervised learning (labels provided in training set). A commonly used supervised classification method includes logistic regression. Normally in regression models use a continuous dependent variable, logistic regression is a model where the dependent variable is categorical. This project uses a com- bination of logistic regression and dimensionality reduction tools (Principal Component Analysis and 2DLDA) to classify the MNIST Hand Digit data set, compromising of a training set of 60,000 written hand digit samples and testing set of 10,000. Introduction . . Result using Binary Logistic Regression Method Logistic Regregression Binary Class Logistic Regression: Y variable is binary (only takes the values of 0 or 1) Need a function such that o f(E(Yi)) = θo + θ1x1 + ··· + θdxd Need to use special function f( ) called the logistic function (or logistic transform): o p used as the argument to LR because the function takes values between 0 and 1. When Y is a binary variable. E(Y) = p, where p is the probability that Y takes the value 1. The Logistic Regression model is written as o Notes: p: probability of “success” (i.e. Y = 1) p/(1-p): odds of “winning” log(p/(1-p)) is logit (a link function) Multiclass Logistic Regression: There are two ways to extend it for multiclass classification: Union of binary models o One-versus-one: construct a LR model for every pair of classes o One-versus-rest: construct a LR model for each class against the rest of training set Softmax Regression (fixed versus rest) also known as multinomial logistic regression. o The method fixes one class and fits c-1 binary logistic models for each of the remaining class against the fixed class, and The prediction for a new observation will be the class with the largest relative probability. Feature Selection: With high dimensional data, LR commonly overfits the data. Two methods can be used to resolve the problem: Reducing the dimensionality of the data using dimensionality reduction methods (such as PCA or 2DLDAA) Adding a regularization term to the objection function: .Apply the binary logistic regression classifier to 45 pairs of digits: Figure 1: PCA 50 & Binary Logistic Regression Applied to All Digit Pairs Figure 1 shows the binary regression model test errors for all possible pairs off the handwritten digits (Xi,Xj) where i and j are not equal, and X ranges from 0 to 9. The model for pair (3,5) resulted in the highest test error of 0.049. Pairs (5,8) and (7,9) resulted in high test errors. High test errors suggests that it is harder to predict the correct label between the pairs. Also, we can see that 0 versus other digits and 1 versus other digits produced low test errors. Pair (1,4) had lowest test error of 0.0019. Apply binary logistic regression to all handwritten digits using one versus all method Figure 2: PCA 50 & One-Versus-All Method Applied to All Digits Figure 2 shows the one versus all method for each digit. Digit 8 resulted in the largest test error of 0.0411 followed by digit 9 with an error of 0.0383. Digit 1 had the lowest test error when using one versus all binary logistic regression. Figure 3: Binary Logistic Regression Methods Combined with PCA 50 and 2DLDA Figure 3 shows the test errors when using logistic regression methods combined with PCA 50 and 2DLDA (reduce the data dimension to 11 by 11). For PCA 50: The one-versus-rest method had the largest test error of 0.0948 and the one-versus-one method produced the smallest test error of 0.0641. The test error from the multinomial logistic regression method is 0.0884. For 2DLDA: The test error of one-vs-one binary logistic regression also give the smallest test error (0.0594) among the 4 methods. The method one-versus-rest binary method contributed the highest test error of 0.0871. Overall, 2DLDA produced lower test errors. Result When Applying Different Logistic Regression Methods with PCA & 2DLDA Conclusion This project implemented different variations of the logistic regression method of classification on the MNIST handwritten digit dataset. We found that the binary logistic regression method had trouble distinguishing between certain pairs of digits. This could be because of the similar ways the digits are written by the participants. Different methods of logistic regression were applied on the dataset. To reduce the effects of overfitting, dimensionality tools such as PCA, 2DLDA, and 1 & l2 regularization were used to address the problem. 2DLDA generally outperformed PCA 50. Overall, the logistic regression method performed very well in producing low test errors when classifying the test dataset. The one-versus-one model consistently outperformed the rest of the models examined in this project. PCA and 2DLDA combined with the one-versus-one method outperformed l1 and l2 test errors. When using the one-versus-all method, l1 and l2 regularizations generated the small test errors. The best test error found is attributed to 2DLDA combine with the one-versus-one method. [1] Guangliang Chen, “Logistic Regression”, Lecture 6 – Math 285. [2] Jeff Howbert, “Introduction to Machine Learning – Logistic Regression”, Lecture – Winter 2012. Reference Result When Applying Logistic Regression with l1 & l2 Regularization Figure 4: Binary Logistic Regression Combined with l1 & l2 Regularization Figure 4: Uses the following objective function For l1 regularization, p is set to 1 and l2 will have a p value of 2. Figure 4 contains different values of the regularization parameter (2^C). l1 regularization has its lowest test error when C=1, and l2’s lowest test error occurs at C=0.5. As the value of C increases above 1, the test errors for both l1 and l2 continously increase. Summary of Results Comparing the Results of the Different Methods Applied to the MNIST Handwritten Dataset Figure 5: Comparing the Best Test Error for the Dimensionality Reduction Methods Applied Figure 5 contains the best test errors achieved by the different dimensionality reduction methods analyzed in this project. The best test errors for l1 and l2 regularizations using the one-versus-all method are relatively close in value (0.0793 vs 0.0799). Using the one-versus-one method, PCA produced a test error of 0.0641 and 2DLDA resulted in a lower error of 0.0594. Both PCA and 2DLDA outperformed the l1 and l2 regularization methods. For the MNIST handwritten digit dataset, 2DLDA produced the lowest test error rates and had one of the lowest computational times to generate its test error. -4 -3 -2 -1 0 1 2 3 4 5 2 C 0.079 0.0795 0.08 0.0805 0.081 0.0815 0.082 0.0825 Test Errors Test Errors of L1-regularized + LR One-Versus-All Test errors of L1-regularized one-vesus-all Test errors of L2-regularized one-versus-all min θ =( θ 0 1 ) - n i =1 y i log p ( x i ; θ ) + (1 - y i ) log(1 - p ( x i ; θ )) + C θ p p One-vs-one One-vs-rest Multinomial Methods 0 0.02 0.04 0.06 0.08 0.1 0.12 Test Errors Comparing Test Errors Between PCA50 and 2DLDA 0.0641 0.0948 0.0884 0.0594 0.0871 0.0764 Digit 5 Digit 4 Digit 3 Digit 2 Digit 1 Digit 0 Digit 9 Digit 8 Digit 7 Digit 6 Digit 0 Digit 1 Digit 2 Digit 3 Digit 4 Digit 5 Digit 9 Digit 8 Digit 7 Digit 6 0 1 2 3 4 5 6 7 8 9 Digit vs All 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 Test Errors Test Error Using One-Versus-All Method PCA50 2DLDA PCA50+ovo 2DLDA+ovo L1+ovr(C=1) L2+ovr(C=0.5) Methods 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Test Errors Lowest Test Errors Generate for The Different Dimensionality Reduction Methods Applied 0.0641 0.0594 0.0793 0.0799 Test Error: 0.049 --------> 0 , log( 1 )= 0 + 1 1 + ··· + ( )= log 1 log P ( Y = 2 | x ) P ( Y = 1 | x ) = θ 2 · x log P ( Y = 3 | x ) P ( Y = 1 | x ) = θ · x · · · log P ( Y = c | x ) P ( Y = 1 | x ) = θ c · x

LR poster Final 5-2LR poster Final 5-2 Author: Microsoft Office Created Date: 5/2/2016 2:06:40 PM

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LR poster Final 5-2LR poster Final 5-2 Author: Microsoft Office Created Date: 5/2/2016 2:06:40 PM

Logistic Regression Huong Huynh & Maria Nazari

Math 285: Classification of Handwritten Digits

In machine learning, classification is the process of predicting the cat-egories of a group of new observations using a training set to build a predictive model. Classification falls under supervised learning (labels provided in training set). A commonly used supervised classification method includes logistic regression. Normally in regression models use a continuous dependent variable, logistic regression is a model where the dependent variable is categorical. This project uses a com-bination of logistic regression and dimensionality reduction tools (Principal Component Analysis and 2DLDA) to classify the MNIST Hand Digit data set, compromising of a training set of 60,000 written hand digit samples and testing set of 10,000.

Introduction

.

.

Result using Binary Logistic Regression

MethodLogistic RegregressionBinary Class Logistic Regression:

• Y variable is binary (only takes the values of 0 or 1)

• Need a function such that

o f(E(Yi)) = θo + θ1x1 + ··· + θdxd

• Need to use special function f( ) called the logistic function (or logistic transform):

o p used as the argument to LR because

the function takes values between 0 and 1.

• When Y is a binary variable. E(Y) = p, where p is the probability that Y takes the value 1.

• The Logistic Regression model is written as

o

• Notes: • p: probability of “success” (i.e. Y = 1)

• p/(1-p): odds of “winning” • log(p/(1-p)) is logit (a link function)

Multiclass Logistic Regression:

There are two ways to extend it for multiclass classification:

• Union of binary models

o One-versus-one: construct a LR model for every pair of classes

o One-versus-rest: construct a LR model for each class against the rest of training set

• Softmax Regression (fixed versus rest) also known as multinomial logistic regression.

o The method fixes one class and fits c-1 binary logistic models for each of the remaining class against the fixed class, and The prediction for a new observation will be the class with the largest relative probability.

Feature Selection:

With high dimensional data, LR commonly overfits the data. Two methods can be used to resolve the problem:

• Reducing the dimensionality of the data using dimensionality reduction methods (such as PCA or 2DLDAA)

• Adding a regularization term to the objection function:

.Apply the binary logistic regression classifier to 45 pairs of digits:

Figure 1: PCA 50 & Binary Logistic Regression Applied to All Digit Pairs

Figure 1 shows the binary regression model test errors for all possible pairs off the handwritten digits (Xi,Xj) where i and j are not equal, and X ranges from 0 to 9. The model for pair (3,5) resulted in the highest test error of 0.049. Pairs (5,8) and (7,9) resulted in high test errors. High test errors suggests that it is harder to predict the correct label between the pairs. Also, we can see that 0 versus other digits and 1 versus other digits produced low test errors. Pair (1,4) had lowest test error of 0.0019.

Apply binary logistic regression to all handwritten digits using one versus all method

Figure 2: PCA 50 & One-Versus-All Method Applied to All Digits

Figure 2 shows the one versus all method for each digit. Digit 8 resulted in the largest test error of 0.0411 followed by digit 9 with an error of 0.0383. Digit 1 had the lowest test error when using one versus all binary logistic regression.

Figure 3: Binary Logistic Regression Methods Combined with PCA 50 and 2DLDA

Figure 3 shows the test errors when using logistic regression methods combined with PCA 50 and 2DLDA (reduce the data dimension to 11 by 11). For PCA 50: The one-versus-rest method had the largest test error of 0.0948 and the one-versus-one method produced the smallest test error of 0.0641. The test error from the multinomial logistic regression method is 0.0884. For 2DLDA: The test error of one-vs-one binary logistic regression also give the smallest test error (0.0594) among the 4 methods. The method one-versus-rest binary method contributed the highest test error of 0.0871. Overall, 2DLDA produced lower test errors.

Result When Applying Different Logistic Regression Methods with PCA & 2DLDA

ConclusionThis project implemented different variations of the logistic regression method of classification on the MNIST handwritten digit dataset. We found that the binary logistic regression method had trouble distinguishing between certain pairs of digits. This could be because of the similar ways the digits are written by the participants.

Different methods of logistic regression were applied on the dataset. To reduce the effects of overfitting, dimensionality tools such as PCA, 2DLDA, and 1 & l2 regularization were used to address the problem. 2DLDA generally outperformed PCA 50.

Overall, the logistic regression method performed very well in producing low test errors when classifying the test dataset. The one-versus-one model consistently outperformed the rest of the models examined in this project.

PCA and 2DLDA combined with the one-versus-one method outperformed l1 and l2 test errors. When using the one-versus-all method, l1 and l2 regularizations generated the small test errors. The best test error found is attributed to 2DLDA combine with the one-versus-one method.

[1] Guangliang Chen, “Logistic Regression”, Lecture 6 – Math 285.

[2] Jeff Howbert, “Introduction to Machine Learning – Logistic Regression”, Lecture – Winter 2012.

Reference

Result When Applying Logistic Regression with l1 & l2 Regularization

Figure 4: Binary Logistic Regression Combined with l1 & l2 Regularization

Figure 4: Uses the following objective function

For l1 regularization, p is set to 1 and l2 will have a p value of 2. Figure 4 contains different values of the regularization parameter (2^C).

l1 regularization has its lowest test error when C=1, and l2’s lowest test error occurs at C=0.5.

As the value of C increases above 1, the test errors for both l1 and l2 continously increase.

Summary of Results

Comparing the Results of the Different Methods Applied to the MNIST Handwritten Dataset

Figure 5: Comparing the Best Test Error for the Dimensionality Reduction Methods Applied

Figure 5 contains the best test errors achieved by the different dimensionality reduction methods analyzed in this project.

The best test errors for l1 and l2 regularizations using the one-versus-all method are relatively close in value (0.0793 vs 0.0799).

Using the one-versus-one method, PCA produced a test error of 0.0641 and 2DLDA resulted in a lower error of 0.0594. Both PCA and 2DLDA outperformed the l1 and l2 regularization methods.

For the MNIST handwritten digit dataset, 2DLDA produced the lowest test error rates and had one of the lowest computational times to generate its test error.

-4 -3 -2 -1 0 1 2 3 4 5

2C0.079

0.0795

0.08

0.0805

0.081

0.0815

0.082

0.0825

Tes

t Err

ors

Test Errors of L1-regularized + LR One-Versus-AllTest errors of L1-regularized one-vesus-allTest errors of L2-regularized one-versus-all

minθ=( θ0 ,θ 1 )

−n

i=1

yi logp(x i ; θ) + (1 − yi ) log(1 − p(x i ; θ)) + C θ pp

One-vs-one One-vs-rest Multinomial

Methods0

0.02

0.04

0.06

0.08

0.1

0.12

Tes

t Err

ors

Comparing Test Errors Between PCA50 and 2DLDA

0.0641

0.09480.0884

0.0594

0.0871

0.0764

Digit 5

Digit 4

Digit 3

Digit 2

Digit 1

Digit 0

Digit 9

Digit 8

Digit 7

Digit 6

Digit 0 Digit 1 Digit 2 Digit 3 Digit 4 Digit 5 Digit 9Digit 8Digit 7Digit 6

0 1 2 3 4 5 6 7 8 9

Digit vs All0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

Tes

t Err

ors

Test Error Using One-Versus-All Method

PCA502DLDA

PCA50+ovo 2DLDA+ovo L1+ovr(C=1) L2+ovr(C=0.5)

Methods0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Tes

t Err

ors

Lowest Test Errors Generate for The Different Dimensionality Reduction Methods Applied

0.06410.0594

0.0793 0.0799

Test Error: 0.049 --------> 0 ,

log( 1− ) = 0 + 1 1 + ··· +

( ) = log 1−

logP (Y = 2 | x )P (Y = 1 | x )

= θ2 · x

logP (Y = 3 | x )P (Y = 1 | x )

= θ · x

· · ·

logP (Y = c | x )P (Y = 1 | x )

= θc · x