Don’t Get Kicked - Machine Learning Predictions for Car ...cs229.stanford.edu/proj2012/HoRomanoWu-KickedCarPrediction.pdf · Don’t Get Kicked - Machine Learning Predictions for

STANFORD UNIVERSITY, CS229 - MACHINE LEARNING

Don’t Get Kicked - Machine Learning Predictions for Car Buying

Albert Ho, Robert Romano, Xin Alice Wu

December 14, 2012

1 Introduction

When you go to an auto dealership with the in-tent to buy a used car, you want a good selection tochoose from and you want to be able to trust thecondition of the car that you buy. Auto dealershipspurchase many of their used cars through autoauctions with the same goals that you have: theywant to buy as many cars as they can in the bestcondition possible. The problem that these dealer-ships often face is the risk of buying used cars thathave serious issues, preventing them from beingsold to customers. These bad purchases are called"kicks", and they can be hard to spot for a varietyof reasons. Many kicked cars are purchased dueto tampered odometers or mechanical issues thatcould not be predicted ahead of time. For thesereasons, car dealerships can benefit greatly fromthe predictive powers of machine learning. If thereis a way to determine if a car would be kicked a pri-ori, car dealerships can not only save themselvesmoney, but also provide their customers with thebest inventory selection possible.

The following paper is split up into 5 mainsections describing our approach to solve thisproblem: Initial Data Preprocessing, Early Algo-rithm Selection, Data Normalization and Balanc-ing, Performance Evaluation, and Boosting. Firstwe identified the key characteristics of our data andformed strategies for preprocessing. Next, we ranseveral simple machine learning algorithms. This

led us to update our data processing strategy anddetermine a better way to evaluate and comparedifferent learning algorithms. Finally, we imple-mented boosting and tailored our final algorithmselection based on initial successes.

2 Initial Data Preprocessing

We obtained our data set from the Kaggle.comchallenge "Don’t Get Kicked" hosted by Carvana.The data set contained 32 unique features with73,041 samples along with a labeling of 0 for goodcar purchases and 1 for "kicks". Some key fea-tures included odometer readings, selling prices,vehicle age, and vehicle model. One thing thatwe immediately noticed was that good cars wereheavily overrepresented in the data set, represent-ing 87.7% of samples. The consequences of thisbecame more apparent once we began comparingmachine learning algorithms across different met-rics.

2.1 Word Bins

Our first major challenge was the preprocess-ing of data. For data such as the name of the ve-hicle’s model, manufacturer, and color, we had toassign unique identifiers to specific strings in thefeature space. This was straightforward for a fea-ture like transmission since we could assign 0 forAuto and 1 for Manual. The process became moreinvolved with multivariate features such as the carsubmodel. We decided that even though there were

1

many different submodels, categorizing them withunique identifiers rather than grouping them wasthe more conservative option.

2.2 Missing Features

Some of the samples had missing features. Wehad the option of throwing out the sample com-pletely, but we believed that it would be a waste.We decided to implement the following rules: if thefeature was represented with a continuous value,we would replace the missing value with the av-erage of the feature over the other samples and ifthe feature was represented with a discrete value,we would create a new value specifically to identifymissing data.

2.3 Data Visualization

Before running any algorithms, we visualizedthe data with plots to gain some intuition about thefeatures. The training data was separated into goodand bad datasets and compared, looking for trends.Histograms were plotted over each feature with thefrequency normalized so that good and bad carswere equally represented. This allowed compar-ison of the relative frequency over a feature. Anexample is Figure 1a, showing that bad cars weregenerally older. To get an idea of how discriminat-ing a feature was, the ratio of the relative frequencyof bad to good was plotted. Figure 1b shows thatCurrent Auction Average Price was a strong feature,however this needed to be taken with a grain ofsalt because the areas where the features were mostdiscriminating were generally in small tail regionsthat applied to a very small subset of cars.

3 Early Algorithm Selection

With the data parsed and some initial in-sights to guide us, we applied some basic machinelearning algorithms that would identify where weneeded improvement and what strategy would bemost effective. At this point, we chose generaliza-tion error as a metric to evaluate our algorithms’performances.

3.1 Support Vector Machine

First, we tested our data with an SVM. We usedlibLINEAR v. 1.92 and the method of cross valida-tion by training on 70% of our data set and testing

(a) Ratio of Scaled VehicleAge

(b) Ratio of CurrAuctnAvgPrice

Figure 1: Histogram plots depicting ratio of scaledvehicle age and current auction averageprice

on the remaining 30%. Initial runs yielded about12% generalization error, which on first glance wasvery good.

3.2 Logistic Regression

Since the feature we were trying to predictwas binary, we decided to try a logistic regressionmodel as a first pass. Logistic regression via New-ton’s method was implemented in MATLAB withthe same method of cross validation as that in SVM.We found that the algorithm converged after 7 it-erations, yielding a generalization error of about12%.

3.3 Observations

Using generalization error as a metric, both lo-gistic regression and SVM seemed to have yieldedpromising results. Upon further investigation,however, these runs would nearly always predictthe null hypothesis, i.e. a good car prediction forevery testing sample. This was where we started toquestion the use of generalization error as a per-formance metric in favor of performance metricsthat took into account false positives and false neg-

2

atives. We also conducted a literature review inhopes of finding alternative algorithms more suit-able for skewed data sets.

4 Data Normalization and

Balancing

4.1 Feature Normalization

After evaluating the performance of our earlyattempts, we made several changes to the data pre-processing procedure in hopes of achieving betterresults. Through our literature search, we foundthat data normalization increases the performanceof many classification algorithms [1] As a result, wenormalized our numeric features over the range 0to 1.

4.2 Data Balancing

In addition to data normalization, we also dis-covered that "up-sampling" the data from the mi-nority class is an effective way of solving the classimbalance problem. ([2], [3], [4]). To do thiswe again split our data in a 70/30 cross-validationscheme. From the data split intended for training,we created a balanced training data set by over-sampling the bad cars. Both balanced and unbal-anced data sets were used for the algorithms wetested from this point forward to observe the ef-fects of artificial data balancing.

5 Performance Evaluation

As mentioned earlier, we found that using gen-eralization error alone as a performance metricwas misleading due to the bias of our data towardsgood cars. A prediction of all good cars, for exam-ple, would yield 12.3% accuracy. In the context ofour problem, it is more relevant to evaluate an al-gorithm’s performance based on precision and re-call

pr eci si on = T P

T P +F P

r ecal l = T P

T P +F N

(1)

rather than predictive accuracy, since the numberof false positive (FP) and false negatives (FN) pre-dicted by an algorithm is more directly related toprofit and opportunity cost, which is ultimately

what car dealers care about. In general, you wanta balance between precision and recall, so we usedAUC and F1, which are derived from FP and FN, tofind that balance.

Through our literature search, we found thatwhen studying problems with imbalanced data,using the classifiers produced by standard machinelearning algorithms without adjusting the outputthreshold may cause poor performance [3]. In thisrespect, AUC is a good metric since it takes intoaccount sensitivity (recall) and specificity over theentire range of possible output threshold values.AUC is a good indicator of one classifier’s abilityfor correct prediction over another. In addition, wealso used the F1 score as a performance metric toaccount for the inverse relationship between pre-cision and recall [5]. We define F1 as the harmonicmean between precision and recall:

F 1 = 2∗pr eci si on ∗ r ecal l

pr eci si on + r ecal l(2)

If precision and recall has been traded off, the F1score will not change. That way we can identify asuperior algorithm as one that increases both pre-cision and recall.

6 Boosting

After applying data normalization and balanc-ing, we returned to our initial approaches usingSVM and logistic regression. We found that byusing these algorithms with normalized and bal-anced data sets, we were able to achieve better AUCand F1 scores, and therefore better results than be-fore. We also tried tuning the C parameter in li-bLINEAR to little effect. From our own researchand discussion with the TAs, we found that boost-ing might be a promising approach for our learningproblem. The idea behind boosting is to combinemany weak learners into a strong learner ([6], [7]).To implement boosting, along with a slew of otherlearning algorithms, we used Weka (Waikato Envi-ronment for Knowledge Analysis) v. 3.7.7.

Weka made it easy to try many different learn-ing algorithms quickly. Due to the nature of ourdata, we were very interested in comparing the per-formance of traditional classification algorithmswith meta-classifiers such as boosting and ensem-ble learning. However, Weka is also very mem-

3

ory intensive. The program could not run logis-tic regression without crashing even with 5.0GB ofmemory allocated. As a result, logistic regressionwas still implemented in MATLAB, while all otherswere implemented in Weka.

7 Results

We used Weka to implement several meta-classifiers, specifically AdaBoostM1, RealAdaBoost,LogitBoost, and ensemble selection. The weakclassifiers we used were decision stump, decisiontable, REPTree, J48, and naive bayes. Decisionstump is a one level decision tree. Decision tableis a simple majority classifier. REPTree is a fast de-cision tree learner, based on information gain andpruning using reduced-error pruning with backfit-ting. J48 is an implementation of the C4.5 deci-sion tree, which is based on maximizing informa-tion gain.

AdaBoostM1 is a general nominal classifierboosting algorithm. Using decision stump as itsclassifier, it performed reasonably well with anAUC of 0.724. We tried using more sophisticatedclassifiers such as J48, random forest, and REP-Tree, however they all performed worse. RealAd-aBoost is an implementation of AdaBoost that isoptimized for binary classification. Using decisionstump as its classifier, it performed well with anAUC of 0.744. Similarly, other more sophisticatedclassifiers did worse, perhaps due to overfitting.LogitBoost using decision stump performed betterthan AdaBoostM1, with an AUC of 0.746. Logit-Boost using decision table performed slightly bet-ter, with an AUC of .758. Because of this we decidedto stick with logitBoost as our boosting algorithm ofchoice.

Ensemble selection can use any combinationof weak classifiers to make a strong classifier, so it isvery flexible. One implementation is to additivelybuild a strong classifier by selecting the strongestweak classifier, and then one by one adding thenext strongest weak classifier. We chose to use AUCas the metric for evaluating classifier strength. Be-cause ensemble selection uses a greedy optimiza-tion algorithm, it is prone to overfitting. To over-come this, strategies such as model bagging, re-placement, and sort initialization were used. Ten

model bags were used as well as sort initializa-tion. The ensemble selection algorithm with mostpromise was one that incorporated many differentclassifiers, including naive bayes, J48, and REPTreeclassifiers. This resulted in an AUC of .752 alongwith an F1 of .279, just shy of LogitBoost.

It was found that contrary to literature, bal-ancing the data did not generally improve clas-sifier performance. In fact, classifiers generallyperformed worse when trained on the balanceddata set. While balancing the data yielded reducednumber of false negatives, it also dramatically in-creased the number of false positives.

8 Discussion

We found through our investigation that Log-itBoost was the best at predicting whether or not acar would be a kick. It produced a prediction withthe highest AUC value of 0.758 and an F1 of 0.368.The F1 value was not as high as we would haveliked, but depending on the relationship betweenGross_Profit and Loss in the Total_Profit equation,F1 may not even be a great metric to maximize theparameter of interest.

Tot al_Pr o f i t = T N ∗Gr oss_Pr o f i t +F N ∗Loss

Oppor tuni t y_Cost = F P ∗Gr oss_Pr o f i t

(3)

Total_Profit represents the profit that a cardealership will make if they follow the predictionsof an algorithm. All cars that are classified asgood and are actually good will make the dealer-ship some Gross_Profit per car. At the same time,all cars that are classified as good, but are actuallynot will cause the dealership to incur some Loss.The Opportunity_Cost represents the Gross_Profitlost from any car classified as bad that actually wasnot. What these formulas boil down to is a trade-off between false negatives, false positives, and truenegatives through Gross_Profit and Loss. If Loss ishigher for the end user, they would tailor the algo-rithm to produce less FN, while if Gross_Profit ishigher, they would want less FP.

Of all the procedures and algorithms we used,the most useful were data normalization, boosting,and using AUC and F1 as performance metrics.

4

Table 1: Algorithm comparison: a. Decision Stump, b. Decision Stump 100 Iterations, c. Decision Table,d. J48 Decision Tree, e. Maximize for ROC, f. assortment

9 Future Work

There are several strategies we would pur-sue in order to further improve prediction perfor-mance. One would be to evaluate our algorithmson a separated data set created by the removal ofoverlapping data via PCA [8]. Literature suggestedthat if a data set is overlapped, one could run algo-rithms on the portion of the data that is not over-lapping to get better results. The reason we did notpursue this in the beginning is that doing so wouldcreate a high variance classfier may overfit the data.Another strategy that we did not get working wouldbe to use RUSBoost, which has been shown to im-prove performance on imbalanced datasets, suchas our own [9]. Finally, we would want to use lib-SVM with a nonlinear kernel such as Gaussian tocompare with our other algorithms. Due to com-putational performance limitations, we were un-able to implement this method.

10 Acknowledgements

We would like to thank Professor Andrew Ngand the TAs (especially Andrew Maas, Sonal Gupta,and Chris Lengerich) for all their help on thisproject along with Kaggle and CARVANA for pro-viding data.

References

[1] Graf, A., Borer, S. (2001). Normalization in support vectormachines. Pattern Recognition, 277-282.

[2] Menardi G, Torelli N. (2010) Training and assessing classif-cation rules with unbalanced data. Working Paper Series

[3] Provost, F. (2000) Learning with Imbalanced Data Sets101. Invited paper for the AAAI’2000 Workshop on Imbal-anced Data Sets.

[4] Japkowicz, N. (2000). The Class Imbalance Problem:Signifcance and Strategies. In Proceedings of the 2000International Conference on Artifcial Intelligence (IC-AI’2000): Special Track on Inductive Learning Las Vegas,Nevada.

[5] Forman, G., Scholz. M. (2009.) Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifer Performance Mea-surement. ACM SIGKDD Explorations, 12(1), 49U57.

[6] Hastie, T. (2003). Boosting. Retrievedfrom Stanford University Web Site:http://www.stanford.edu/ hastie/TALKS/boost.pdf

[7] Friedman, J., Hastie, T., Tibshirani, R. (2000). Additive lo-gistic regression: a statistical view of boosting (With dis-cussion and a rejoinder by the authors). The annals ofstatistics, 28(2), 337-407.

[8] Das, B., Krishnan, N. C., Cook, D. J. (2012) Handling Im-balanced and Overlapping Classes in Smart EnvironmentsPrompting Dataset.

[9] Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., Napoli-tano, A. (2008, December).RUSBoost: Improving classi-fication performance when training data is skewed. InPattern Recognition, 2008. ICPR 2008. 19th InternationalConference on (pp. 1-4). IEEE.

5

Documents

Don’t Get Kicked - Machine Learning Predictions for Car ...cs229.stanford.edu/proj2012/HoRomanoWu-KickedCarPrediction.pdf · Don’t Get Kicked - Machine Learning Predictions for