View
34
Download
0
Category
Preview:
Citation preview
Rakesh Gupta1, Chris Sneed1,Vipul Tyagi1
1College of Computing and Technology, Lipscomb University, Nashville, TN, USA
Predicting Online Purchases Using Conversion Prediction Modeling
1
Executive Summary• Homesite Group Inc. sponsored a Kaggle* competition to
understand how they could better predict what price will entice it’s quote seekers to purchase a home insurance policy.
• The outcome of this research will be important to the field of retail sales, with special importance to online sales
• The benefits of this implementation for Homesite are more sales from its leads through effective product pricing.
• In this presentation, our team will demonsrate the process we followed to create the model and our results in predicting the data
*https://www.kaggle.com/c/homesite-quote-conversion
*U.S. Census Bureau News. Quarterly Retail E-Commerce Sales for 1st Quarter 2016. (May, 2016).
*
Sales Lead Articles History
Predictive Models
Sales and Lead Cycle Research
Sales Pricing Models
Classification Algorithms
Naïve Bayes
Neural Networks
Binary Logistic Regression
AdaBoost
Patents
Sales Lead Prioritization
Lead Conversion
Predicting Online Purchases – A Comparison of Machine Learning Approaches
Dynamic Pricing
Sales Lead Conversion
Weighted KNN
Gradient Boosting
Decision Trees
CART
C5.0
CHAID
Support Vector Machines
Patents
Decision Trees
CART
C5.0
Naïve Bayes
Neural Networks
Binary Logistic Regression
AdaBoost
Weighted KNN
Gradient Boosting
CHAID
Support Vector Machines
Classification Algorithms
Data Source Analysis• Data from Homesite was relatively clean to begin with• The dataset had 299 predictor variables and one target variable:
“QuoteConversion” Flag. – Target variable has the values : 0 or 1
• Data collected had a train dataset of 260K records and test dataset of 173K records
• During analysis, we removed the variable “QuoteDate” and the following variables:
Summary StatisticsVariable Name GeographicField10A GeographicField10B PersonalField84 PropertField29 PropertyField6
Min -1 -1 1 0 01st Quartile -1 25 2 0 0
Median -1 25 2 0 0Mean -1 25 1.99 0 0
3rd Quartile -1 25 2 0 0Max -1 25 8 10 0NAs 207020 334630
Data Cleansing & Preparation
• Categorical variables conversion to numeric– 27 variables converted
• 293 predictor variables in the full training set• Multiple split ratios of train/test
– 90/10– 80/20– 67/33
• Randomized sample• Multiple iterations
Classifications & Platforms
• R – open source statistical tool– Naïve Bayes– Logistic Regression– Boosting
• Python – open source programming platform– Naïve Bayes– kNN– Logistic Regression
Naïve Bayes*
• Naive Bayes is a simple technique for constructing classifiers.• Models that assign class labels to problem instances, represented as
vectors of feature values. • All naive Bayes classifiers assume that the value of a particular feature
is independent of the value of any other feature, given the class variable.• The method of maximum likelihood is applied for parameter estimation
for naive Bayes models.• Despite the naive design and apparently oversimplified assumptions,
naive Bayes classifiers have worked quite well in many complex real-world situations.
• An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.
• Our team used Gaussian Naïve Bayes as it is good for continuous data
*Naïve Bayes classifier. (n.d.). In Wikipedia. Retrieved fromhttps://en.wikipedia.org/wiki/Naive_Bayes_classifier
Logistic Regression*
•Binary logistic regression as our target variable is 0 or 1
•Predicts probabilities of dependent variable
*Logistic Regression. (n.d.). In Wikipedia. Retrieved fromhttps://en.wikipedia.org/wiki/Logistic_regression
kNN*• An object is classified by a majority vote of its neighbors, assigning it to
the “nearest” neighbor• The nearer neighbors contribute more to the average than the distant
ones• Sensitive to the local structure of the data
*k-nearest neighbors algorithm. (n.d.). In Wikipedia. Retrieved fromhttps://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Boosting*•Boosting is a general method for improving the accuracy of any given learning algorithm
•Works by combining rough and less than accurate rules of thumb• Produce a classifier with a
low generalization error• Increase weights on
incorrectly classified examples, forcing the base learner to focus it’s attention on them
*Schapire, Robert E. and Freund, Yoav. Boosting: Foundations and Algorithms. Massachusetts Institute of Technology, Cambridge, MA. 2012
Trials & Tribulations• Neural
Networks?• CSV
Vector?
Mahout
• Output of model
• Learning curve
RapidMiner
• Complicated to fit model
SVM
• VIF Functions
• Corrgrams*
Multicollinearity Analysis
*Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf
Correlation Analysis
*Package ‘corrgram’ Retrieved from https://cran.r-project.org/web/packages/corrgram/corrgram.pdf
Results - Accuracy Matrices“No Models are perfect, but some are better than others…”
Technology ClassifierNaïve Bayes KNN
Logistic Regression
HS Test File0’s1’s
PythonSplit Ratio
90/10 81% 78.34% 81.33%0’s = 168422
1’s = 5414
PythonSplit Ratio
80/20 78.47% 81.15%0’s = 165859
1’s = 7977
PythonSplit Ratio
67/33 78.64% 81.13%0’s = 165870
1’s = 7966
RSplit Ratio
80/20 71%
RSplit Ratio
80/20
0’s = 124,544
1’s = 49,292
Conclusion & Discussion• Boosting helped identify the 6 variables that provided the
most value• We know we can predict a sale from a lead about 80% of the
time given Homesite’s data set
• We reduced the number of predictor values from 292 to 6!• This allows Homesite to focus on these data points.• Following the 80/20 Pareto principle – From these 6
predictors we get 80% of the benefit without wasting time on the other factors that don’t carry as much weight.
• Simple, fast market strategy that will provide immediate benefits in terms of increased sales and revenue for Homesite
Future Works• Continue work on additional data cleaning to
improve accuracy of the model from 81% to 97%• Investigate the use of the remaining classification
models to see if we achieve better results• Design and build a process to provide real-time
prediction as new quotes are sent out by HomeSite.
• Complete ANOVA analysis to determine strength of logistic regression model
Questions?
Recommended