2

Click here to load reader

Homework 2 - Rice Universitygallen/stat413/hw2_assignment_fall_2017.pdf · Homework 2 Statistics 413 Fall 2017 Assigned: September 26 Due: October 12 Data Analysis: For this problem,

Embed Size (px)

Citation preview

Page 1: Homework 2 - Rice Universitygallen/stat413/hw2_assignment_fall_2017.pdf · Homework 2 Statistics 413 Fall 2017 Assigned: September 26 Due: October 12 Data Analysis: For this problem,

Homework 2

Statistics 413 Fall 2017

Assigned: September 26

Due: October 12

Data Analysis:For this problem, use the training and test sets of the zip code data available from the ESL textbook

website. You should combine the training and test sets and then proceed to design your own model validationprocess as part of this problem.

1. Statistical Learning Process. Your goal is to build the most accurate binary (digits 3 and 8) andmulti-class classifier. To do so, you will try out several model families that all have tuning parameters.In the end, you will need to pick one final model and report it’s misclassification error. Design ananalysis pipeline to achieve this.

2. Cross-Validation.

(a) K-fold CV. Code up (from scratch) a K-fold CV method to select the tuning parameters ofregularized logistic regression.

i. Compare the CV results using the binomial deviance error or misclassification error. Whichis better, why?

ii. Compare your results to those of the built-in function ‘cv.glmnet.

(b) Repeated CV. Code up (from scratch) a repeated CV procedure where you repeatedly split thedata into a training and test set. Are the results of this procedure different than that of K-foldCV?

For this problem, be sure to apply your CV procedures correctly within your analysis pipeline frompart 1.

3. Binary Classification: Classifying between digits 3 and 8.

(a) Compare and contrast the following classifiers:

i. KNN Classifier (you can use build in functions).ii. Naive Bayes Classifier.iii. Logistic regression.iv. `1 Logistic regression.v. Ridge Logistic regression.

(b) Which method family and which tuning parameters were selected via your analysis pipeline frompart 1? Report an estimate of the misclassification error for your optimal model. Reflect uponyour results.

4. Multi-class Classification.

(a) Compare and contrast the following methods for multi-class classification:

� KNN Classifier (you can use build in functions).� Naive Bayes Classifier.� Multinomial regression.� `1 Multinomial regression.� Group-Lasso Multinomial regression.

1

Page 2: Homework 2 - Rice Universitygallen/stat413/hw2_assignment_fall_2017.pdf · Homework 2 Statistics 413 Fall 2017 Assigned: September 26 Due: October 12 Data Analysis: For this problem,

� Ridge Multinomial regression.

(b) Which method family and which tuning parameters were selected via your analysis pipeline frompart 1? Report an estimate of the misclassification error for your optimal model. Also, show theconfusion matrix of misclassification results for the methods. Reflect upon your results.

Conceptual Problems:

1. A business analyst is trying to predict market demand for a product over the next six months. He has90 features of interest measured from 275 stores and decides to use the elastic net for his prediction.To select the optimal regularization parameters, he uses 5-fold cross-validation. As his boss wants anestimate of the prediction error, he runs 5-fold cross-validation again. For each fold, he fits the elasticnet with the previously selected regularization parameter value to fourth-fifths of the data and usesthe one-fifth left out to estimate the prediction error. He averages the prediction error over each of thefive folds and reports this to his boss. Is this an unbiased estimate of the prediction error? If so, why?If not, why not and how would you alter the procedure to obtain an unbiased estimate?

2. A marketer is trying to predict a yes or no response to a survey question based on demographic andsocial networking data. His training set consists 74 measured predictors for 826 individuals of which 62answered yes to the survey question. The marketer fits a regularized logistic regression. The trainingerrors for the model is good. When cross-validation is used to assess the misclassification rate, themarketer notices that almost all individuals are classified as no and the CV error is much higher thanthe training error.

(a) What could be causing this?

(b) If the marketer cares about predicting the “yes” votes much more than the “no” votes, whatwould you recommend?

(c) Design an analysis pipeline appropriate for this situation.

2