Prediction of Lead Conversion With Imbalanced Dataliu.diva-portal.org/smash/get/diva2:1566623/FULLTEXT01.pdfAcknowledgments First, I would like to give a special thanks to my supervisor,
Upload
others
View
8
Download
0
Embed Size (px)
344 x 292
429 x 357
514 x 422
599 x 487
Citation preview
Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 ,
www.liu.se
Linköping University | Department of Computer and Information
Science Master’s thesis, 30 ECTS | Datateknik
| /LIU-IDA/STAT-A–21/031–SE/
Prediction of Lead ConversionWith Imbalanced Data – A method based
on Predictive Lead Scoring
Ali Etminan
Detta dokument hålls tillgängligt på Internet - eller dess framtida
ersättare - under 25 år från publicer-ingsdatum under förutsättning
att inga extraordinära omständigheter uppstår.Tillgång till
dokumentet innebär tillstånd för var och en att läsa, ladda ner,
skriva ut enstaka ko-pior för enskilt bruk och att använda det
oförändrat för ickekommersiell forskning och för undervis-ning.
Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva
detta tillstånd. All annananvändning av dokumentet kräver
upphovsmannens medgivande. För att garantera äktheten, säker-heten
och tillgängligheten finns lösningar av teknisk och administrativ
art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd som
upphovsman i den omfattning somgod sed kräver vid användning av
dokumentet på ovan beskrivna sätt samt skydd mot att
dokumentetändras eller presenteras i sådan form eller i sådant
sammanhang som är kränkande för upphovsman-nens litterära eller
konstnärliga anseende eller egenart.För ytterligare information om
Linköping University Electronic Press se förlagets hemsida
http://www.ep.liu.se/.
Copyright
The publishers will keep this document online on the Internet - or
its possible replacement - for aperiod of 25 years starting from
the date of publication barring exceptional circumstances.The
online availability of the document implies permanent permission
for anyone to read, to down-load, or to print out single copies for
his/hers own use and to use it unchanged for non-commercialresearch
and educational purpose. Subsequent transfers of copyright cannot
revoke this permission.All other uses of the document are
conditional upon the consent of the copyright owner. The
publisherhas taken technical and administrative measures to assure
authenticity, security and accessibility.According to intellectual
property law the author has the right to bementionedwhen his/her
workis accessed as described above and to be protected against
infringement.For additional information about the Linköping
University Electronic Press and its proceduresfor publication and
for assurance of document integrity, please refer to its www home
page: http://www.ep.liu.se/.
© Ali Etminan
An ongoing challenge for most businesses is to filter out potential
customers from their audience. This thesis proposes a method that
takes advantage of user data to classify po- tential customers from
random visitors to a website. The method is based on the Predictive
Lead Scoring method that segments customers based on their
likelihood of purchasing a product. Our method, however, aims to
predict user conversion, that is predicting whether a user has the
potential to become a customer or not.
Six supervised machine learning models have been used to carry out
the classifica- tion task. To account for the high imbalance in the
input data, multiple resampling meth- ods have been applied to the
training data. The combination of classifier and resampling method
with the highest average precision score has been selected as the
best model.
In addition, this thesis tries to quantify the effect of feature
weights by evaluating some feature ranking and weighting schemes.
Using the schemes, several sets of weights have been produced and
evaluated by training a KNN classifier on the weighted features.
The change in average precision obtained from the original KNN
(without weighting) is used as the reference for measuring the
performance of ranking and weighting schemes.
Acknowledgments
First, I would like to give a special thanks to my supervisor, Joel
Oskarsson, for giving me exceptional feedback throughout the entire
thesis. His support has had a great impact on the quality of this
report.
I would also like to thank my examiner, Ander Grimval, and my
opponent, Mudith Silva, for their constructive comments and
valuable suggestions. I wish to express my gratitude to Samuel
Jenks and Malin Schmidt for providing me with the necessary tools
and information required for this thesis.
Finally, I would like to thank my lovely wife, Romina. Without her
unconditional support, I would have not made it through this
journey. And finally, my little daughter, Nila, whose presence is
the main source of my ambitions.
iv
Contents
1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 2
2 Theory 3 2.1 Lead Scoring and Predictive Lead Scoring . . . . . .
. . . . . . . . . . . . . . . . 3 2.2 Data Preprocessing . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3
Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 6 2.4 Classifiers . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Feature
Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 22 2.6 Weighting Schemes . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 23 2.7 Model Selection
and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 24
3 Method 28 3.1 Data Description . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 28 3.2 Preprocessing . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29 3.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 29 3.4 Oversampling . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5
Ranking and Weighting . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 32 3.6 Evaluation . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 33
4 Results 34 4.1 Exploratory Data Analysis . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 34 4.2 Classification with
oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . .
36 4.3 Feature-weighted Classification . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 42
5 Discussion 47 5.1 Results . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 47 5.2 Method . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 50 5.3 The work in a wider context . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 52
6 Conclusion 53
7 Appendix 57 7.1 Classification with Oversampling . . . . . . . .
. . . . . . . . . . . . . . . . . . . 57 7.2 Feature-weighted
Classification . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 75
vi
List of Figures
2.1 Sales Funnel - Source: Duncan and Elkan, 2015
[duncan_probabilistic_2015] . . . 4 2.2 Random Oversampling . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3
Random Oversampling Examples (ROSE) . . . . . . . . . . . . . . . .
. . . . . . . . 8 2.4 Oversampling with SMOTE . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 9 2.5 Oversampling with K
Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Oversampling with SVM SMOTE . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 10 2.7 Oversampling with Borderline SMOTE 1 .
. . . . . . . . . . . . . . . . . . . . . . . 11 2.8 Oversampling
with Borderline SMOTE 2 . . . . . . . . . . . . . . . . . . . . . .
. . 12 2.9 Comparison of Random oversampling with different
variants of SMOTE . . . . . . 13 2.10 Support Vector Machine with
linear decision boundary . . . . . . . . . . . . . . . . 16 2.11
Logistic curve . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 18 2.12 A comparison of ROC Curve
(left) and Precision-Recall Curve (right) . . . . . . . . 25
4.1 Distribution of source and medium of referrals to the website .
. . . . . . . . . . . 35 4.2 Average number of sessions and average
bounce rate over the weekdays, sepa-
rated by social/non-social media users (left) Average number of
sessions and av- erage bounce rate over 24 hours, separated by
social/non-social media users (right) 35
4.3 Distribution of the scaled average number of converted/not
converted users by days of the week (left) and over 24 hours of the
day (right) . . . . . . . . . . . . . . 36
4.4 Pageviews Per Session and Average Session Duration before
removing outliers (top) and after removing outliers (bottom) . . .
. . . . . . . . . . . . . . . . . . . . . 37
4.5 Correlation heatmap for numerical and binary variables in the
data . . . . . . . . . 37 4.6 Logistic Regression with SMOTE Tomek
link . . . . . . . . . . . . . . . . . . . . . . 38 4.7 Support
Vector Classifier with SVM SMOTE . . . . . . . . . . . . . . . . .
. . . . . 39 4.8 Decision Tree with no oversampling . . . . . . . .
. . . . . . . . . . . . . . . . . . . 39 4.9 Random Forest with
Random Oversampling . . . . . . . . . . . . . . . . . . . . . . 40
4.10 K Nearest Neighbors without oversampling . . . . . . . . . . .
. . . . . . . . . . . . 41 4.11 Gradient Boosting without
oversampling . . . . . . . . . . . . . . . . . . . . . . . . 41
4.12 Feature ranks calculated by Permutation Importance method with
the top per-
forming Gradient Boosting model as its estimator . . . . . . . . .
. . . . . . . . . . 42 4.13 Feature ranks calculated with Pearson
Correlation Coefficient method (left) and
Fisher Coefficient method (right) . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 45 4.14 Absolute values of feature weights
calculated by applying NMF scheme to PCC
and FC ranks . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 45 4.15 Absolute values of feature
weights calculated by applying NRF scheme to PCC
and FC ranks . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 46 4.16 KNN classifier performance
using features weighted with PCC and NRF schemes 46
7.1 Logistic Regression with no oversampling . . . . . . . . . . .
. . . . . . . . . . . . . 59 7.2 Logistic Regression with Random
Oversampling . . . . . . . . . . . . . . . . . . . . 59 7.3
Logistic Regression with SMOTE . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 60 7.4 Logistic Regression with SVM SMOTE . .
. . . . . . . . . . . . . . . . . . . . . . . . 60 7.5 Logistic
Regression with K Means SMOTE . . . . . . . . . . . . . . . . . . .
. . . . 60
vii
7.6 Logistic Regression with Borderline SMOTE . . . . . . . . . . .
. . . . . . . . . . . 61 7.7 Logistic Regression with ADASYN . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 61 7.8 Support
Vector Classifier with no oversampling . . . . . . . . . . . . . .
. . . . . . 62 7.9 Support Vector Classifier with Random
Oversampling . . . . . . . . . . . . . . . . . 62 7.10 Support
Vector Classifier with SMOTE . . . . . . . . . . . . . . . . . . .
. . . . . . . 62 7.11 Support Vector Classifier with SMOTE Tomek
Link . . . . . . . . . . . . . . . . . . 63 7.12 Support Vector
Classifier with K Means SMOTE . . . . . . . . . . . . . . . . . . .
. 63 7.13 Support Vector Classifier with Borderline SMOTE . . . . .
. . . . . . . . . . . . . . 63 7.14 Support Vector Classifier with
ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.15
Decision Tree with Random Oversampling . . . . . . . . . . . . . .
. . . . . . . . . 64 7.16 Decision Tree with SMOTE . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 65 7.17 Decision
Tree with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 65 7.18 Decision Tree with SMOTE Tomek Link . . . . . . .
. . . . . . . . . . . . . . . . . . 65 7.19 Decision Tree with K
Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . .
66 7.20 Decision Tree with Borderline SMOTE . . . . . . . . . . . .
. . . . . . . . . . . . . . 66 7.21 Decision Tree with ADASYN . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.22
Random Forest with no oversampling . . . . . . . . . . . . . . . .
. . . . . . . . . . 67 7.23 Random Forest with SMOTE . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 67 7.24 Random
Forest with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 68 7.25 Random Forest with SMOTE Tomek Link . . . . . . .
. . . . . . . . . . . . . . . . . 68 7.26 Random Forest with K
Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.27 Random Forest with Borderline SMOTE . . . . . . . . . . . . .
. . . . . . . . . . . . 69 7.28 Random Forest with ADASYN . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 69 7.29 K Nearest
Neighbors with Random Oversampling . . . . . . . . . . . . . . . .
. . . 70 7.30 K Nearest Neighbors with SMOTE . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 70 7.31 K Nearest Neighbors with
SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.32
K Nearest Neighbors with SMOTE Tomek Link . . . . . . . . . . . . .
. . . . . . . . 71 7.33 K Nearest Neighbors with K Means SMOTE . .
. . . . . . . . . . . . . . . . . . . . 71 7.34 K Nearest Neighbors
with Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . 71
7.35 K Nearest Neighbors with ADASYN . . . . . . . . . . . . . . .
. . . . . . . . . . . . 72 7.36 Gradient Boosting with Random
Oversampling . . . . . . . . . . . . . . . . . . . . 72 7.37
Gradient Boosting with SMOTE . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 73 7.38 Gradient Boosting with SVM SMOTE . .
. . . . . . . . . . . . . . . . . . . . . . . . 73 7.39 Gradient
Boosting with SMOTE Tomek Link . . . . . . . . . . . . . . . . . .
. . . . 73 7.40 Gradient Boosting with K Means SMOTE . . . . . . .
. . . . . . . . . . . . . . . . . 74 7.41 Gradient Boosting with
Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . . . 74
7.42 Gradient Boosting with ADASYN . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 74 7.43 Pearson Correlation Coefficient
with Normalizing Max Filter . . . . . . . . . . . . . 75 7.44
Fisher Coefficient with Normalizing Max Filter . . . . . . . . . .
. . . . . . . . . . . 75 7.45 Fisher Coefficient with Normalizing
Range Filter . . . . . . . . . . . . . . . . . . . . 76 7.46
Permutation Importance with Normalizing Max Filter . . . . . . . .
. . . . . . . . . 76 7.47 Permutation Importance with Normalizing
Range Filter . . . . . . . . . . . . . . . 76
viii
2.1 Example of lead attributes and their rankings
[michiels_lead_2008] (source: Ab- erdeen Group, 2008) . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Suggested marketing actions for each lead score
[lindhal_qualitative_2017] . . . . 5 2.3 Confusion matrix for
binary classification . . . . . . . . . . . . . . . . . . . . . . .
. 24
3.1 Characteristics of the input data . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 29
4.1 Best hyperparameter values for Logistic Regression and SMOTE
Tomek link . . . . 38 4.2 Best hyperparameter values for Support
Vector Classifier and SVM SMOTE . . . . 38 4.3 Best hyperparameter
values for Decision Tree with no oversampling . . . . . . . . 39
4.4 Best hyperparameter values for Random Forest with Random
Oversampling . . . 40 4.5 Best hyperparameter values for K Nearest
Neighbors without oversampling . . . . 40 4.6 Best hyperparameter
values for Gradient Boosting without oversampling . . . . . 41 4.7
ROC-AUC and Average Precision (AP) scores of all combinations of
classifiers
with resampling methods . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 43 4.8 Estimated run-time of classifiers with
and without oversampling in seconds . . . . 44 4.9 Comparison of
Average Precision obtained with KNN using different feature
ranking and weighting schemes . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . 44
7.1 Complete list of hyperparameters and their corresponding grid
of values used for classifier tuning . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Complete list of hyperparameters and their corresponding grid
of values used for tuning resampling methods . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 59
ix
1 Introduction
1.1 Background
The urge for digital transformation has swept across the entire
business community in the past two decades. The rapid evolution of
tools that retrieve, analyze and transform business data have
forced organizations to constantly shift and adapt their strategy
to unravel con- sumer behavior. This has emphasized the usage of
consumer data to find existing patterns and associations in their
actions and characteristics. Organizations dedicate a good amount
of budget, time, and human resources to build data infrastructures
to track and make infer- ences of their consumer’s habits. This
will in turn become the foundation of every consequent decision
made by marketing and sales teams.
Larger organizations gain their competitive advantage by tracking a
diverse range of fea- tures about their users across their digital
platforms to build sophisticated data models. In general, these
models require a domain expert to determine certain parameters and
do the decision-making manually and by following business
objectives. Lead Scoring methodology is widely practiced by
marketing and sales professionals to score and prioritize users
based on their data. Lead scoring assigns an importance score to
lead (prospective user) interactions like watching a demo, filling
out a form, opening an email, etc. This can also be extended to
demographics features such as age, gender, and location of the
lead. A total score is then calculated for every lead that
determines the likelihood of the lead becoming a customer.
Marketers use this score to segment leads and create a different
marketing strategy for every segment. Leads with the highest scores
will be contacted by the sales team while leads with lower scores
will be treated with proper marketing content and follow-ups.
While lead scoring is built upon data, it still depends on experts
to manually set the impor- tance scores which need ongoing
re-evaluation as business priorities and consumer behaviors evolve.
Therefore, the method cannot be deemed as fully data-driven. Over
the recent years, there has been a growing interest in the
utilization of machine learning models to automate the process of
scoring leads that are widely referred to as Predictive Lead
Scoring or Automated Lead Scoring. Studies suggest that rather than
pure domain knowledge or gut feeling, orga- nizations should pursue
predictive lead scoring as a replacement or complement to manual
lead scoring [24].
Input data for predictive lead scoring can consist of a website and
social media data as far as attributes can be traced back to a
lead. Data generated through Customer Relation-
1
1.2. Objectives
ship Management tools (tools that allow businesses to manage and
integrate marketing and sales activities) is another common source
of input data. Numerous studies have attempted to investigate the
performance of probabilistic and non-probabilistic models in this
domain. Nygård et al. [24] compares the model performance of
Logistic Regression, Decision Trees, Random Forests, and Neural
Networks. The choice of models reveals how linear, non-linear,
ensemble, and deep models perform on this particular problem. In
contrast, Duncan et al. [10] compares two probabilistic models for
this problem and Benhaddou et al. [2] experi- ments with Bayesian
Networks that work best with small and expert-curated
datasets.
The application of machine learning in lead scoring has been a
subject of interest by many researchers in recent years. Current
implementations are mostly patented and offered as commercial
features. Therefore, there is an evident demand from the business
community for further research and exploration of this methodology.
This thesis aims to build a predictive model by investigating some
linear and non-linear classifiers. It also studies the contribution
of features in the prediction task by applying a number of feature
ranking and weighting schemes. The weights are then applied to the
features to find out the actual importance of each feature as well
as to compare ranking and weighting methods. The explicit use of
feature weights is commonly overlooked by similar studies. This is
since the type and source of their input data demand a different
data preparation approach.
1.2 Objectives
The data used in this thesis consists of 11 features and a highly
imbalanced binary class vari- able. The extreme class imbalance
puts the task in the same category as disease or fraud detection
tasks. These tasks suffer from rare positive samples that make the
learning process challenging. Therefore, as the primary objective,
this thesis evaluates oversampling methods that handle problems
with rare positive labels. It then aims to find the best
combination of the classifier with a re-sampling method.
Considering the significance of feature weights in predictive lead
scoring, the thesis de- fines its third objective as an evaluation
of three feature ranking and two feature weighting schemes. The
weighting schemes are used to normalize the ranks. In this stage,
we intro- duce the combination of ranking and weighting scheme that
best represents the features in the input data.
The outline of this report can be described as follows:
• Chapter 2. Theory: Description of theories behind methods used by
this thesis
• Chapter 3. Methodology: The methodology used by this thesis
• Chapter 4. Results: Exploratory data analysis and report of
achieved results with addi- tional plots and tables
• Chapter 5. Discussion: Discussion of strengths and weaknesses
concerning obtained results and comparison with related work
• Chapter 6. Conclusion: Conclusion on how this paper can provide a
new reflection on the topic of predictive lead scoring
2
2 Theory
This chapter introduces the theory and mathematical formulations
behind concepts used by this thesis. This gives the reader
intuition into the choice of certain models and parameters in
consecutive chapters. It also enhances the reproducibility of the
methodology used in this thesis.
2.1 Lead Scoring and Predictive Lead Scoring
As briefly discussed in the introduction chapter, identifying and
prioritizing users with the potential of taking a desired action or
making a purchase is a challenging task for many businesses. In the
business jargon, these users are referred to as leads. First, we
discuss the lead scoring method that has helped many businesses
tackle this challenge. Then we briefly review the theories behind
the machine learning assisted lead scoring known as Predictive Lead
Scoring (a.k.a Automated Lead Scoring).
Lead Scoring
In a study performed by Aberdeen Group [22] lead scoring is defined
as:
Lead scoring is a technique for quantifying the expected value of a
lead or prospect based on the prospect’s profile, behavior (online
and/or offline), demographics, and likelihood to purchase.
The quantification is done by assigning a score to all attributes
of a lead. Scores are de- termined by marketing experts that value
each attribute concerning business priorities. [22] introduces a
number these attributes which is presented in table 2.1.
The sum of the scores attributed to a single lead quantifies the
lead’s readiness (likeli- hood) to make a purchase. Leads can be
divided into several groups depending on their level of readiness.
Those with a high score are considered as hot leads while those
that receive a smaller score are referred to as cold leads. Leads
with the least score, are in the awareness stage, in which they
have an initial impression of the company and its products. Leads
with higher scores are considered as MQL (Marketing Qualified
Leads) and SQL (Sales Qualified Lead) de- pending on their score.
MQLs generally require nurturing to be prepared for sales.
Nurturing may include contacting the lead with more marketing and
promotional materials. SQLs on
3
2.1. Lead Scoring and Predictive Lead Scoring
Attribute Rank Webinars attended 5.0 Purchase propensity scores 4.5
Email click-throughs 4.0 Website activity (pages visited and
recency) 4.0 Website activity (type of activity) 3.5 Keywords
clicked 3.5 Website activity (length of time each page was visited)
3.0 Attitudinal and lifestyle information 3.0
Table 2.1: Example of lead attributes and their rankings [22]
(source: Aberdeen Group, 2008)
Figure 2.1: Sales Funnel - Source: Duncan and Elkan, 2015
[10]
the other hand are deemed to be closer to a sales point and thus
contacted directly by the sales team. Figure 2.1 depicts a sales
funnel that demonstrates different stages of a lead from a busi-
ness perspective. A study by [19] on a company that provides
business solutions, suggests how the business can approach leads in
different stages. These suggestions are presented in table 2.2. In
this table, leads are categorized based on their profiles and
engagement. Leads that have provided many details in their profiles
are categorized as target fit while leads that provide basic
information are labeled as potential fit. In addition, the level of
the lead’s past engagements with the company determines if the lead
has high interest, medium interest or low engagement [19].
Predictive Lead Scoring
In the conventional lead scoring method introduced in the previous
section, attribute scores are assigned manually. Scores need
constant revision and adjustment as business priorities and
objectives change. Moreover, since a human expert decides on the
scores, the process becomes error-prone. Using machine learning to
find potential prospects automates the lead scoring process.
Instead of explicitly assigning importance scores to lead
attributes, these are learned from the data. Similar to lead
scoring, predictive lead scoring can use any at- tribute as input.
However, since this method involves constructing an automated
model, the integration of data from various sources becomes a
challenge.
4
2.2. Data Preprocessing
Lead Description Marketing Action Target fit, High interest Send
email that encourages the lead to leave phone number or
call in directly, or propose to purchase the product directly.
Target fit, Medium interest Send offer of free trial of the service
or propose relevant material
that is close to purchase of the product. Target fit, Low
engagement Priority lead that needs further nurturing and "why
now"-
message Potential fit, High interest Send email that encourages the
lead to leave phone number or to
call in directly, or propose to purchase the product directly.
Potential fit, Medium interest Continue to nurture with marketing
materials that can increase
interest, send offer of free trial of the service. Pursue
information to evaluate if it is a good fit.
Potential fit, Low engagement Send nurturing content that can
create a demand for the product.
Pursue information to evaluate if it is a good fit.
Table 2.2: Suggested marketing actions for each lead score
[19]
Integration of data can be done by manually connecting different
datasets. Some pub- lic platforms like Google Analytics also enable
integration with certain CRM platforms like Hubspot. Hubspot
provides a fully-featured digital marketing platform that also
includes a CRM service. CRM-generated data is the most popular data
source for predictive lead scor- ing. This data consists of
features related to both personal and behavioral attributes. More
importantly, it contains information about historical sales.
The predictive lead scoring model can be defined to solve a binary
problem. In this set- ting, leads are classified as likely or not
likely to purchase or convert. This can be solved by investigating
the performance of binary classifiers. A study by Benhaddou et al.
[2] also de- fines a binary task but uses a Bayesian Network with
binary features to tackle this problem. On the other hand, another
study by Duncan et al., [10], uses the stages of the sales funnel
(see fig 2.1) as class labels. This approach would also predict the
level of a lead’s readiness for sales. The selection of the
modeling approach is highly influenced by the type of available
input data.
2.2 Data Preprocessing
It is essential to create a proper representation of the data
before proceeding to model build- ing. Cleaning data can consist of
simple steps like feature normalization or more sophisti- cated
steps such as outlier detection, feature encoding or handling
missing values.
Handling Outliers
Outliers can produce erroneous results with certain algorithms and
therefore need to be re- moved properly from the data. Below we
introduce two statistical approaches for this task.
Standard Deviation Method
In case of dealing with a normally distributed data and depending
on how sensitive we are to outliers, the mean is first estimated,
and the cut-off threshold is set as:
• One standard deviation from the mean: covering about 68% of the
data
• Two standard deviations from the mean: covering about 95% of the
data
5
2.3. Resampling
• Three standard deviations from the mean: covering about 99.7% of
the data
Any data point smaller or greater than the threshold will be
considered an outlier and thus removed. Choosing a proper number of
standard deviations depends on the domain and size of the data
[7].
Interquartile Range Method
If data is not normally distributed, a different method has to be
applied for outlier removal. Interquartile Range (IQR) is obtained
by finding the 25th and 75th percentiles of the feature.
Percentiles here are referred to as quartiles since they are
divided into four quarters.
IQR = (Q3 ´Q1)k (2.1)
According to equation 2.1, by subtracting the first quartile Q1
from the third quartile Q3 we are left with the feature values
between the two quartiles or within the interquartile range. These
values are then multiplied by a factor k to adjust the cut-off
threshold. k can take a value from 1.5 to any number depending on
the range of values in the feature [7].
One-hot Encoding
Many machine learning models are not able to handle categorical
data in their original form and so categorical features need to be
presented in a way that models can interpret them. In cases where
categories are nominal and there is no ordering among them, the
variable is converted to a one-hot representation. One-hot encoding
expands a variable with k categories into k different binary
variables each representing one category. For instance, if Xj
contains 3 categories A,B and C and if Xi
j, values of feature j at observation i, equals C (Xi j = C),
the
one-hot representation would look like: [0, 0, 1] While this
approach is simple to implement, it comes with a major drawback
when the
data is either high-dimensional or contains high cardinality
categorical features or both. In such cases, data dimension
escalates to a larger extent occupying a larger amount of space in
memory. In some cases, this may leave the model with a lot of
parameters which is compu- tationally inefficient.
An alternative method for ordinal variables is conversion into
integer factors. The order of categories in ordinal variables has
to be taken into consideration. So for instance, if Xj contains 3
categories A,B and C they will be converted to numerical values as
(A = 1, B = 2, C = 3). This omits the increased dimension issue but
makes the assumptions that the categories are ordered.
2.3 Resampling
A common challenge in classification tasks is to deal with
imbalanced classes in the dataset. In binary classification, the
class that has a significantly smaller number of samples is called
the minority class while the other class is referred to as the
majority class. Datasets with multi- ple classes (multinomial) can
also suffer from imbalanced class distribution. Overseeing this
property can yield misleading results even if a sophisticated
classifier is at play. It is therefore essential to experiment with
some resampling techniques to achieve less biased results to- wards
the majority class(es). Depending on the task at hand,
undersampling, oversampling or a combination of both can be
applied. Regardless of the method used, the resampling strategy can
be either defined as a simple strategy (i.e. resample only the
majority class or only the minority class) or customized by setting
desired proportions for each class label.
6
3.7
3.8
3.9
4.0
4.1
4.2
Random Oversampling
Random oversampling involves duplication of samples from the
minority class. The over- sampling can be done iteratively to
balance all classes with respect to the majority class. The
duplication is carried out by sampling the minority data points
with replacement (figure 2.2). Random oversampling is known to
cause overfitting since generated samples overlap with the original
samples. Therefore, a certain amount of dispersion might be desired
over pro- ducing exact copies of the original data points. Assume
that we want to generate a synthetic sample based on the original
sample xi. For this, we first sample x from probability distribu-
tion KH j where KH j is centered at xi and Hj is a matrix of scale
parameters. KH j is usually selected from a unimodal and symmetric
distribution that is scaled by Hj. In this regard, the new sample
is generated in the neighborhood of xi where the width of the
neighborhood is determined by Hj. This method is also referred to
as Random Oversampling Examples (ROSE) [21] (see figure 2.3)
.
SMOTE
Synthetic Minority Oversampling TEchnique or SMOTE is another
popular oversampling method that comes with a variety of
implementations. As opposed to Random Oversampling that simply
duplicates the data points, SMOTE uses the K nearest neighbors of
the minority sample xi to generate synthetic samples.
xnew = xi + λ(xzi ´ xi) (2.2)
xzi is one of the nearest neighbors of the sample xi from the
minority class. λ is a param- eter between 0 and 1 that determines
the distance between the new and the original sample. Baseline
SMOTE uses a uniform distribution to select xi to generate a new
sample [8] (see figure 2.4). This makes SMOTE sensitive to noise. A
noisy sample from the minority class
7
3.7
3.8
3.9
4.0
4.1
4.2
Random Oversampling with dispersion Majority Minority
Figure 2.3: Random Oversampling Examples (ROSE)
that is among the majority samples has an equal probability of
being selected for resampling. This may result in generating more
noisy samples where the majority samples have a high- density [17].
Other implementations of SMOTE take a different strategy in
selecting xi but still use equation 2.2 to generate new
samples.
While oversampling may be able to balance class distributions in
the data, it might not be able to compensate for the lack of
information in some cases. One study shows that synthetic samples
generated by SMOTE result in the same expected value in the
minority class while reducing its variance [4].
K Means SMOTE
K Means SMOTE does the oversampling in three stages:
• Clustering: data is clustered into k groups using k means
clustering algorithm
• Filtering: only clusters with high portion of minority samples
are selected for oversam- pling
• Oversampling: oversamples filtered clusters using equation
2.2
First, the input space is clustered into k groups. Then in the
filtering stage, it finds clusters where minority samples make up
more than 50 percent of the cluster’s population. Applying SMOTE to
these clusters reduces the chance of generating noisy samples.
Moreover, the goal is also achieving a balanced distribution of
samples within the minority class. In other words, if there exist
multiple clusters of minority samples, we want them to be
oversampled to the same extent. Therefore, the filter step
allocates more generated samples to sparse minority clusters rather
than dense ones [17] (see figure 2.5). Imbalance ratio threshold
(irt) hyperpa- rameter determines the threshold for the equation
2.3. Equation 2.3 divides the number of
8
3.7
3.8
3.9
4.0
4.1
4.2
Figure 2.4: Oversampling with SMOTE
majority samples in cluster c over the number of minority samples
in the same cluster. The counts are incremented by 1 to avoid
division by zero or ir = 0.
ir = majorityCount(c) + 1 minorityCount(c) + 1
(2.3)
By increasing the threshold for ir, filtering becomes more
sensitive and requires a cluster to have a higher proportion of
minority instances to be selected. Lowering this threshold has the
opposite effect [17].
SVM SMOTE
This method uses a Support Vector Machine (SVM) to select xi for
resampling. In classifica- tion tasks, SVMs draw a hyperplane that
has the maximum distance with the closest point(s) in each class.
The hyperplane acts as an n-dimensional decision boundary that is
usually accompanied by soft margins on each dimension. Points on or
in between the soft margins and the hyperplane are referred to as
support vectors. SVM SMOTE uses SVM to first iden- tify the support
vectors and then uses them to generate new samples [23]. The
motivation behind this is that support vectors have a significant
effect on the position of the separating hyperplane and the
classification performance in general (see figure 2.6). For further
details on SVMs please refer to section 2.4.
Borderline SMOTE1
Borderline SMOTE1 first checks the label of xi’s m nearest
neighbors. It then proceeds to classify xi as one of the
following:
• noise - all nearest-neighbors are from a different class than the
one of xi.
9
3.7
3.8
3.9
4.0
4.1
4.2
3.8 3.9 4.0 4.1 4.2
3.7
3.8
3.9
4.0
4.1
4.2
10
3.7
3.8
3.9
4.0
4.1
4.2
Figure 2.7: Oversampling with Borderline SMOTE 1
• in danger - at least half of the nearest neighbors are from the
same class as xi.
• safe - all nearest neighbors are from the same class as xi.
The algorithm selects a minority sample in danger and its k nearest
neighbors from the same class to generate a new sample. This means
new samples are generated closer to the minority class border. It
also prevents selecting noisy samples for resampling which occurs
in the original SMOTE algorithm. However, it still uses equation
2.2 to generate new samples [11] (see figure 2.7).
Borderline SMOTE2
This algorithm operates similarly to Borderline-SMOTE1, except that
it considers the k near- est neighbors of xi to be from any class.
The value of λ here ranges from 0 to 0.5 as opposed to 0 to 1 in
borderline SMOTE1. when λ 0.5 the new sample is generated closer to
the mi- nority class [11]. Increased number of samples near the
decision boundary tends to improve classification performance which
is the main motivation behind Borderline SMOTE2 (see fig- ure 2.8).
Figure 2.9 compares the behavior of Random Oversampling, ROSE, and
different variants of SMOTE.
11
3.7
3.8
3.9
4.0
4.1
4.2
ADASYN
Adaptive Synthetic oversampling (ADASYN) is a special case of
SMOTE. It generates new samples proportional to the number of
samples that are not from the same class as xi in a given
neighborhood [13]. Steps performed by ADASYN are described in
algorithm 1.
Algorithm 1: ADASYN
1. Calculate the number of synthetic samples G to be generated (G =
(ml ´ms)ˆ β)
a) ml and ms are the number of majority and minority samples
respectively.
b) β specifies the desired balance level. β = 1 creates a perfectly
balanced dataset.
2. For each xi in minority class calculate ri = i/K
a) i is the number of samples in the K nearest neighbors of xi that
belong to the majority class
3. Calculate ri by normalizing ri as ri = ri/ m
i=1 ri (ri is a categorical distribution)
4. Calculate the number of samples to be generated for each
minority sample as gi = ri ˆ G
5. For each minority sample generate gi new samples using equation
2.2
One can consider ri as weights for minority samples. Therefore, it
can be observed that ADASYN forces the learning algorithm to focus
on samples that are originally more difficult to learn [13].
12
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
Majority Minority
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
Majority Minority
Figure 2.9: Comparison of Random oversampling with different
variants of SMOTE
Combination of Oversampling and Undersampling
Resampling methods can be combined to make a hybrid solution of
over and under sampling. One popular combination is SMOTE-Tomek
link that combines SMOTE oversampling with Tomek link
undersampling.
Tomek link has been described by Batista et al. [1] as:
Given two examples x and y belonging to different classes, and let
d(x, y) be the distance between x and y. A (x, y) is called a Tomek
link if there is not a case z such that d(x, z) d(x, y) and d(y, z)
d(y, x). If two examples form a Tomek link,then one of these
examples is noise or both examples are borderline (near the class
border).
By removing Tomek links the space is cleaned from noisy data, thus
producing better decision boundaries for oversampling.
13
2.4. Classifiers
Weight Correction
Using a resampled synthetic dataset to train the model changes the
class distribution in the training set while the classes in the
validation/test set still inherit the imbalanced weights. Consider
the scenario where we oversample the minority class in the training
set to have as many samples as the majority class, each class gets
a weight (prior probability) of 0.5. This confuses the model when
predicting labels in the unseen data as it assumes both classes
have equal probability, which is not true.
To compensate for this problem one can take the posterior
probabilities returned by the classifier and first divide by the
class fractions in the training set and then multiply by the class
fractions in the validation/test set. Finally, the new values need
to be normalized to ensure that the new posterior probabilities sum
to one [3]. This is mathematically represented in equation
2.4
p(y|x) = p(x|y)p(y)
Z (2.4)
The term q(y|x) represents the posterior probabilities returned by
the model. p(y) is the probability of the positive class in the
test/validation set and q(y) is the probability of the positive
class in the training set. Therefore, p(y)
q(y) is the correction term that is multiplied by the posterior
probabilities. Z contains the normalizing constants for every y. In
a binary task, 1 Z is calculated according to equation 2.5.
1 Z
(2.5)
2.4 Classifiers
Theories in the preceding sections were concerned with the handling
of input data. Prepro- cessing and resampling methods were
discussed to see how a similar dataset can be prepared as an input
of a machine learning model. The next objective of this thesis is
to search, tune and evaluate several binary classifiers and find
the best oversampling and classifier combination. In this section,
a number of linear and non-linear classifiers will be discussed in
detail.
Decision Trees
Tree-based models (a.k.a) Decision Trees are widely used for both
classification and regression problems. Decision Trees are best
known for their interpretability and their ability to handle
non-linear problems. There are different algorithms of decision
trees including CART (Clas- sification And Regression Trees), ID3,
and C4.5 which differ based on some factors including how they
handle overfitting. In general, decision tree follows a sequential
binary procedure to build the trees.
Consider a predictor space with N samples and p dimensions (X1,
..., Xp). The algorithm first chooses a splitting value s for the
jth predictor Xj to divide the predictor space into R1 and R2 so
that R1(j, s) = tX|Xij su and R2(j, s) = tX|Xij su. R1 and R2 are
the two new subspaces which are also known as terminal nodes. The
algorithm repeats this step for every subspace until a stopping
criterion is met. The stopping can be triggered for instance when
the tree reaches a certain depth or when the leaf node contains
less than a minimum number of samples. The choice of Xj and s at
each step of the algorithm is made so that the error within the
resulting subspaces is minimized.
14
2.4. Classifiers
In regression tasks, the predicted value of a new point is the mean
value of the subspace it falls into. At every step, the model aims
to minimize the Sum of Squared Residuals (RSS presented in equation
2.6.
RSS = ÿ
(yi ´ yR2) 2 (2.6)
where yR1 and yR2 are the mean value within the R1 and R2
subspaces. Classification trees on the other hand, use the majority
class label within each subspace to predict the class of a new
point. For choosing Xj and s, the model attempts to minimize either
equation 2.7 (Entropy) or equation 2.8 (Gini Index).
D = ´ K
pmk(1´ pmk) (2.8)
where pmk is the proportion of training data points in the mth
subspace that are from the kth class. Both criteria try to form
regions with the majority of points belonging to one class [3]. In
other words, they attempt to minimize node impurity. Although trees
are easily interpreted, experiment shows that learned trees are
very sensitive to the details of the input data. Therefore a small
change to the training data results in a very different set of
splits [14].
One way to tackle the sensitivity of the model towards the training
data is to do pruning. In cost complexity pruning method, trees
with too many terminal nodes are penalized by a factor α. The value
of α can control the bias-variance trade-off in the case of having
an overfit or underfit model. The complexity of the model is
inversely proportional to the value of α. In contrast to the
top-down approach for building the trees, pruning begins at the
bottom from the leaf nodes back to the root of the tree [14].
Random Forest Classifier
Random Forest is an ensemble method that was introduced to address
the overfitting issue of Decision Trees. Random Forest uses the
Bagging (Bootstrap Aggregating) method with decision trees to
provide more robust results. Consider an input data with N samples
and p features. The Random Forest model is described in algorithm
2.
Algorithm 2: Random Forest
1. For b = 1, 2, ..., B repeat:
a) Create a bootstrap b of size N by sampling from the original
data with replacement
b) Fit a decision tree to b. At each split of the tree choose m
random features from X where m p and calculate the prediction f
b
2. For regression, calculate the average prediction
fbag(X) = 1 B
B ÿ
b=1
f b(X)
3. For classification, count the majority vote as the final
prediction
If the models trained on each bootstrap has variance σ2, the
variance of the mean of all bootstraps ( fbag) is given by σ2/B
that is smaller or at least equal to every individual
variance.
15
Figure 2.10: Support Vector Machine with linear decision
boundary
This proves that bagging guarantees to reduce model variance.
Random Forests are also able to estimate the expected error without
cross-validation. On average and depending on the size of the
bootstrap, about one-third of the data points are not selected for
bootstrapping which are referred to as out-of-bag samples. These
samples can in turn be used as an unseen test dataset to estimate
generalization error [14]. Random Forests and Decision Trees are
highly flexible models. Depth of the tree(s), number of leaf nodes,
number of samples in every leaf node, and the proportion of m
features to fit the tree(s) have significant impacts on the
behavior of the final model.
Support Vector Classifier
Support Vector Classifiers (SVCs) or Support Vector Machines (SVMs)
are originally designed for linear classification. Think of an
input space with n observations and j dimensions. SVC tries to find
a hyperplane with j dimensions to classify the observations.
Ideally, the hyper- plane has the largest possible margin with
observations within each class. However, a major drawback of this
is that the hyperplane becomes extremely sensitive to the training
observa- tions. Adding or removing observations from the training
set can make considerable changes to the position of the
hyperplane. To address this, a soft margin is drawn parallel to the
hy- perplane in each dimension. Points that are either on or inside
the soft margins are called Support Vectors. The role of soft
margins is to allow a desired level of misclassification in the
model to reduce the generalization error [14].
As illustrated in figure 2.10, with soft margins, we allow some
observations (support vec- tors) to be on the wrong side of the
margins or on the wrong side of the decision boundary. This reduces
the model sensitivity towards the training data. The size of the
margins is con- trolled by C or the Cost parameter. When C Ñ8, no
misclassification is tolerated and the soft
16
2.4. Classifiers
margins are removed. This results in a model with a high variance.
On the contrary, when C Ñ 0, the margins are widened, allowing more
misclassified observations. This also implies that the number of
support vectors is inversely proportional to the value of C
[14].
Extending Support Vector Classifiers to non-linear problems can
increase computational costs significantly. This is since training
a SVC to find the optimal decision boundary involves solving a
quadratic optimization problem. In this regard, if for instance, we
decide to enlarge the feature space to a higher order polynomial,
we can end up with a huge number of terms to compute for the
optimization problem. To overcome this issue, SVCs use kernel
functions to efficiently handle non-linear problems [14].
Kernel Functions
Kernel functions try to quantify similarities between two
observations. The linear kernel cal- culates the inner product
between pairs of training observations. However, it turns out that
only the inner product between pairs of support vectors have an
impact on the model. There- fore, it is not required to compute the
inner products between all possible pairs in the training set.
Consider x to be a vector of P dimensions, equation 2.9 represents
the linear kernel. Note that this paper uses bold notations to
represent vectors.
K(xT i xi1) =
xijxi1 j (2.9)
To move beyond linearity, a polynomial kernel (equation 2.10) of
order d can be used instead.
K(xT i xi1) = (1 +
xijxi1 j) d (2.10)
A more flexible kernel function is the radial kernel or the Radial
Basis Function (RBF) (equa- tion 2.11)
K(xT i xi1) = exp(´γ
P ÿ
j=1
(xij ´ xi1 j) 2) (2.11)
RBF calculates the Euclidean distance between two observations. If
the distance is large, the exp in equation 2.11 returns a small
value. This means observations far from the decision boundary are
ignored. γ = 1
2σ2 is a smoothing coefficient and the value of σ in the denomi-
nator controls the shape of the kernel.
Logistic Regression
Linear Regression uses the Ordinary Least Squares (OLS) method to
fit a straight line to the training data. The predicted variable y
is modeled directly as a continuous variable. For every observation
of X1, ..., Xp the target value is calculated using equation
2.12
y = β0 + β1x1 + ... + βPxP (2.12)
where β0 is the y-intercept (the mean value of y) and β1, ..., βP
are model coefficients. Every βi determines how much y should
change for one unit of change in xi. Logistic Regression on the
other hand, calculates a conditional probabilities between 0 and 1.
This makes it suitable for binary classification problems. For this
purpose, it uses the logistic function to calculate the
probabilities [14].
p(Y|X) = eβ0+β1X1+...+βPXP
1 + eβ0+β1X1+...+βPXP (2.13)
17
Figure 2.11: Logistic curve
Remember that X is a vector with p dimensions. The logistic
function can be re-written in the log-odds form as in equation
2.14.
log ( p(X)
1´ p(X)
) = β0 + β1X1 + ... + βPXP (2.14)
The logistic function creates an S-shaped output that represents
probabilities from 0 to 1. This indicates that although increasing
or decreasing X, increases or decreases the log- odds value, the
relationship to p is non-linear. Logistic Regression uses Maximum
Likelihood Estimation (MLE) to learn the model parameters [14].
when MLE is applied to the Logistic Regression model, it attempts
to minimize the expression in equation 2.15
L(θ) = ´(y log(y) + (1´ y) log(1´ y)) (2.15)
Equation 2.15 is the negative log of the likelihood function or the
log loss. Note that y and y indicate the true and predicted class
labels respectively. θ is the vector of parameters that the model
is trying to learn. Therefore, L(θ) quantifies the loss of the
model. One common way to solve the optimization problem defined by
MLE is through a technique known as gradient descent.
Gradient descent is an iterative process. It calculates the
gradient of the loss function for each training observation and
tries to move the opposite direction (downhill) of the gradient
function to find the local minimum. If the local minimum is also
lower than all other local minimums or the loss function is convex
(has one local minimum), the algorithm may be able to find the
global minimum. In terms of equation 2.16, at each step the change
in θ (OL(θ)) is multiplied by a small step size η. It is then
subtracted from the current value of θ until OL(θ) is smaller than
a minimum threshold.
θ := θ ´ ηOL(θ) (2.16)
Regularization
In Linear and Logistic Regression models, the bias-variance
trade-off is usually controlled by regularization methods.
L1-regularization (Lasso Regression) and L2-regularization (Ridge
Regression) are two basic regularization methods. Considering
equation 2.12 for Linear Re- gression with N observations and p
dimensions, the L2-regularization is applied to the loss function
using equation 2.17. Note that both methods can also be applied to
the logistic func- tion in equation 2.13 for the Logistic
Regression model.
L(θ) + λ P
β2 j (2.17)
The first term in equation 2.17 is the log loss presented in
equation 2.15 and the second term is the penalty. The penalty term
becomes larger when the value of β is large. The λ parameter
controls the level of regularization. When λ = 0 the penalty term
becomes 0 and no regularization is performed. As λ Ñ 8
regularization becomes stronger, decreasing the complexity of the
model. The L1-regularization, presented in equation 2.18 works
similarly to L2-regularization except that it uses the absolute
value of β in the penalty term. This forces some of the β estimates
to be 0 while in L2-regularization the coefficients can become
close to 0 but not 0. This also makes L1-regularization suitable
for the task of feature selection [14].
L(θ) + λ
p ÿ
j=1
|β j| (2.18)
Another regularization method, Elastic Net, combines the L1 and L2
regularization meth- ods to overcome the limitations of both. While
L2 is not able to perform feature selection like L1, there are
certain limitations with how L1 removes features. For instance for
highly cor- related features, L1 keeps one of the features and
omits the rest. Equation 2.19 is the penalty term for the Elastic
Net. The second term in the equation averages highly correlated
features, while the first term provides a sparse solution in the
coefficients of the averaged features [14].
p ÿ
j=1
(α|β j|+ (1´ α)β2 j ) (2.19)
The α parameter in equation 2.19 controls the contribution of L1
and L2 methods in the regularization task.
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) is a supervised non-parametric machine
learning model. For a test observation xi and a given k, it finds
the k nearest points to xi. In classification, xi is assigned the
majority class label of its k neighbors. In regression, xi is
assigned the average value of its k neighbors. When K is small the
model becomes more sensitive to training observations and thus have
higher variance.
The behavior of the KNN classifier can also be explained using the
Bayes theorem [3]. Assume we have N observations and k classes. To
classify xi using its K neighbors, we draw a sphere that is
centered on xi and contains all the K neighbors. Consider V to be
the volume of the sphere and Nk be the number of points belonging
to class Ck. Also, Kk denotes the number of the K nearest neighbors
that belong to class Ck. The probability density for each class
(conditional probability) can be expressed as
p(X|Ck) = Kk
19
p(Ck) = Nk N
Now using equation 2.20 we can calculate the posterior probability
density of the class label given the K neighbors.
p(Ck|X) = p(X|Ck)p(Ck)
p(X) (2.20)
The class Ck with the highest probability density will be assigned
to the test observation [3]. There are different metrics to
calculate the distance between xi and its K nearest neigh- bors.
Equation 2.21 represents the Minkowski distance. The Euclidean and
Manhattan distance metrics are special cases of the Minkowski
distance when P = 2 and P = 1 respectively.
d(p, q) = P
(pi ´ qi)P (2.21)
KNNs can handle non-linear classification tasks quite well.
However, they tend to un- derperform with imbalanced data. This is
since samples from the majority class are more frequent. Therefore,
it is more likely that most of the K neighbors of a test
observation are from the majority class. One way to tackle this
problem is to use a weighted KNN. In this method, labels for each
of the K neighbors are multiplied by weights proportional to the
in- verse of their distance to the test observation. In
non-weighted KNN, points in the K nearest neighbors are uniformly
assigned a weight of 1 while the rest get a weight of 0 [12].
Gradient Boosting
Gradient Boosting is an ensemble model. The idea of boosting is to
combine the output of several weak learners to produce more robust
results. Learners whose error rates are slightly above random guess
are considered weak learners. One of the most commonly used
boosting methods is AdaBoost which was designed for binary
classification. Consider classifier G(x) for a binary task with N
observations. In AdaBoost, a number of weak learners (for example
classification trees) Gm(x), m = 1, 2, ..., M are created at every
step. These trees have only a single split which is also known as
stumps. The final prediction is a weighted average of the
prediction of all the stumps.
G(x) = sign ( M
αmGm(x) )
(2.22)
αm determines the contribution of Gm(x). The algorithm starts by
initializing the weights of the training points uniformly as 1/N.
Then at every step, a classifier is trained on the data. The
weights of misclassified points are increased while correctly
classified points are given a smaller weight. This forces the
algorithm to focus on observations that are more difficult to learn
[12].
Since AdaBoost gives a much higher influence to misclassified
points, it becomes very sensitive to outliers. This degrades the
performance of the model when trained on noisy data. Gradient
Boosting addresses this issue and also extends its applicability to
both re- gression and classification tasks. Gradient Boosting is a
generalization of AdaBoost. It can
20
2.4. Classifiers
take any differentiable function as the loss function and use the
gradient descent method for optimization (see section 2.16). For a
general loss function in equation 2.23
L( f ) = N
L(yi, f (xi)) (2.23)
the derivative (gradient) with respect to f (xi) is calculated at
every observation with re- spect to equation 2.24.
gim = BL(yi, f (xi))
B f (xi) (2.24)
Algorithm 3, taken from [12], describes Gradient Boosting for
regression. In classification with k different classes, steps 2(a)
to 2(d) should be performed for each of the k classes. Note that
for both regression and classification problems, Gradient Boosting
grows a regression tree at each step. In step 1 of the algorithm,
the initial predictions are computed. This can either be calculated
through an external estimator or by computing the log odds value of
the target variable. In step 2(a) we calculate the pseudo
residuals, that is the difference between the observed and the
predicted values in each tree. In 2(b) the next regression tree is
grown on the residuals of the previous tree. In regression, the
residuals are also considered as the predicted values for each step
whereas in classification, we need to calculate probabilities for
prediction. Steps 2(c) and 2(d) involve minimizing the loss
function and updating the predictions for each grown tree. For
classification with k classes, the final output in step 3 is k
different tree expansions [12].
Algorithm 3: Gradient Boosting
i=1 L(yi, γ)
a) For i = 1, 2, ..., N compute:
rim = ´
[ BL(yi, f (xi))
B f (xi)
] f= fm´1
b) Fit a regression tree to the targets rim giving terminal
regions. Rjm, j = 1, 2, ..., Jm
c) For j = 1, 2, ..., Jm compute
γjm = argminγ
3. Output f (x) = fM(x)
Trees grown with Gradient Boosting have more leaf nodes than stumps
in AdaBoost that only have two leaf nodes. This algorithm can also
be tuned to use a portion of the training observations at each step
which results in a Stochastic Gradient Boosting. This approach can
control the bias-variance trade-off. A smaller portion of the
sub-sample can decrease variance while increasing bias [12].
21
2.5. Feature Ranking
2.5 Feature Ranking
The main purpose of using a data-driven model for lead scoring is
to let data decide the weight of every user action or
characteristic. This section introduces two feature importance
calculation techniques and then proceeds to describe two
correlation-based ranking methods.
Feature Importance with Decision Trees
The level of contribution of each feature can be estimated when
building a decision tree. In short, the importance of features is
calculated based on reduction of Gini or entropy criterion at every
split point [18] (see section 2.4 for more details on decision
trees). However, this method does not work well with high
cardinality features (numerical or categorical features with
several unique values). This method can give high importance to
features that may not be predictive on unseen data when the model
is overfitting and thus do not have gener- alization power [5].
Calculating the feature importance values are also possible with
other tree-based models like Random Forests. However, this thesis
is not interested in discussing the details of the tree-based
feature importance methods. Instead, the Permutation Feature Im-
portance method is introduced as a more practical solution to
calculate the feature importance scores.
Permutation Feature Importance
The Permutation Importance algorithm, proposed by Breiman et al.
[5], solves the general- ization problem of the tree-based
models.
Algorithm 4: Permutation Importance
1. Inputs: Dataset D, predictive model m
2. Compute a reference score s of the model m on D (accuracy,
ROC-AUC, etc)
3. For each feature j in D:
a) For k in 1, ..., K:
i. Randomly shuffle values in column j to create an altered version
of the dataset Dj,k
ii. Compute the score sk,j of model m on Dj,k
b) Compute importance ij for the permuted feature as:
ij = s´ 1 K
sk,j
The importance of the jth feature is determined by how much the
model’s base score s is changed due to shuffling values in j. The
reshuffling step breaks the feature values’ ties with the target
variable and thus importance is measured solely based on how the
model depends on that specific feature. Sometimes features may
appear to be more important in the training than the
test/validation set. Therefore, it is good practice to train the
model on the training set and calculate the importance scores on
the held-out set to improve generalization power [5].
Pearson Correlation Coefficient
Pearson correlation measures the linear correlation between every
feature and the response variable. It can take values from -1 to 1.
A value close to zero indicates an insignificant corre- lation.
This is while a value of higher magnitude, irrespective of the
sign, represents a higher
22
2.6. Weighting Schemes
correlation between the feature and the response. Pearson
correlation is calculated by divid- ing the covariance of the
feature-response pair by the product of their standard
deviations.
Jcc(Xj) =
[ N
(2.25)
For every feature, Xj and class label c the covariance is
calculated and summed over all N observations. This value is then
divided by the product of the feature and class standard deviations
(σXj .σc). Note that in equation 2.25 c denotes the probability of
observing a sample from class 1. Since rankings near -1 are as
informative as rankings close to 1, we take the absolute values of
the rankings [16].
Fisher Coefficient
This ranking method is based on Fisher’s Discriminant Analysis that
is the Linear Discrim- inant Analysis in the case of binary
problems. Therefore, Fisher’s discriminant separates classes based
on their mean and draws a linear discriminant that maximizes the
separability while minimizing within-class variance. The Fisher
coefficient is calculated using equation 2.26
JFSC(Xj) = [Xj,1 ´ Xj,2]
[σj,1 + σj,2] (2.26)
Xj,1 and Xj,2 represent mean of values in X j corresponding to
class 0 and class 1 respec- tively. The same holds for standard
deviations in the denominator [16]. As opposed to feature
importance techniques, both Fisher coefficient and Pearson
correlation coefficient find a lin- ear relationship between the
features and the target variable.
2.6 Weighting Schemes
Different ranking methods produce different types of values and
therefore need to be normal- ized by a certain scheme to be
converted into weights [16]. Below are two simple weighting schemes
used in this thesis.
Normalized Max Filter (NMF)
Equation 2.27 suggests that for positive ranks (J+), the absolute
value of the rank is divided by the maximum rank. In the case of
negative ranks (J´), the absolute value of the rank is subtracted
from the sum of the maximum and minimum ranks and then divided by
the maximum rank.
WNMF(J) =
Jmax for J-
Normalized Range Filter (NRF)
This scheme is quite identical to NMF. However, in both cases of
positive and negative ranks, the minimum rank is added to the
numerator and the denominator. In the case of NRF weights lie in
[2Jmin/(Jmax + Jmin), 1] [16].
WNRF(J) =
Jmax+Jmin for J-
True
Predicted
0 1 0 True Negative (TN) False Negative (FN) N*
1 False Positive (FP) True Positive (TP) P*
N P
2.7 Model Selection and Evaluation
The procedure of building a regression or classification model
begins with downlisting a set of candidate models that may suit the
specifications of the problem. Then some model tuning and
evaluation steps have to be taken to find the best setting of the
best performing model. Interpretation of "the best" relies solely
on the nature of the problem. There is no one-size-fits- all
evaluation metric to calculate. Therefore, this section starts with
introducing a number of essential metrics and motivates why some
are preferred in this thesis. Finally, some popular model tuning
and validation methods will be discussed.
Metrics
In classification problems, a common approach to evaluate model
performance is through the confusion matrix. The confusion matrix
of a binary classifier consists of the components displayed in
table 2.3.
Several evaluation metrics can be calculated from the confusion
matrix. Here a number of these metrics and their respective
applications will be described.
FRP and TPR
FPR = FP N
(2.30)
FPR also known as probability of false alarm specifies how often
the model incorrectly clas- sifies a sample from class 0 as class
1. In contrast, FNR shows the probability of a model incorrectly
classifying a sample from class 1 as class 0. There is usually a
trade-off between FPR and FNR therefore the two scores cannot be
improved simultaneously.
Precision and Recall
The precision and recall metrics can be computed using the
components in table 2.3. Precision is obtained by dividing the
number of correctly predicted points by all predicted points. A
larger value indicates that the model is capable of correctly
classifying the observations.
Precision = TP
TP + FP (2.31)
Recall is calculated by dividing the number of correctly predicted
points by the number of all relevant points (regardless of whether
they were correctly predicted or not). This ratio indicates what
portion of the true labels is identified by the model.
Recall = TP
0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Pr ec
isi on
Precision-Recall Curve
Figure 2.12: A comparison of ROC Curve (left) and Precision-Recall
Curve (right)
Similar to FPR and FNR, the two scores stand at a trade-off. The
preference of either score depends on how the results are supposed
to be interpreted and how the model is to be used in practice. An
alternative solution is to use the F1 score.
F1 score
F1 aims to find the right balance between Precision and Recall by
combining the two using equation 2.33.
F1 = 2.precision.recall precision + recall
(2.33)
F1 is widely used as an overall as well as per class score to
evaluate a classification model.
Metrics for imbalanced data
Models trained on imbalanced data cannot be simply evaluated
through their accuracy (ratio of correctly predicted points over
all points, regardless of the class label). This score does not
take into account the per class performance. If a problem is
imbalanced, where 0 is the most common class, the F1 score is a
good alternative to evaluate the model. However, F1 does not
distinguish between the severity of FNR and FPR so one has to
decide which score to trade-off in favor of the other [20]. In
other words, the F1 score does not make a preference over any of
the precision or recall.
An alternative solution is threshold tuning. In probabilistic
approaches to binary classi- fication, the model calculates the
posterior probability of an observation belonging to both classes 0
and 1. By default, we set a cut-off threshold on the probabilities
to assign the class labels. In threshold tuning, instead of setting
a discriminative threshold, an FPR and FNR are calculated for many
thresholds in the [0,1] interval. From the obtained values the
Receiver Operating Characteristic Curve (ROC Curve) can be plotted.
The curve has FPR on the x-axis and TPR on the y-axis (Figure
2.12).
The objective is to maximize the Area Under the (ROC) Curve
(ROC-AUC). AUC can take any value between 0 and 1. A value of 0.5
amounts to a random guess and a value of 1 (closer to the top left
corner) represents a perfect model. The convenience of ROC-AUC is
still argued for imbalanced problems. This is because ROC-AUC
measures a model’s overall performance and does not reflect the per
class performance. A better alternative is the area under the
precision-recall curve [20].
The precision-recall curve (PR curve) has precision scores on the
x-axis and recall scores on the y-axis. PR curve of a good model
gets closer to the top right corner of the plot. Depending
25
2.7. Model Selection and Evaluation
on the domain and sensitivity of the problem, a higher recall or
precision may be preferable. In any case, the area under the PR
curve can better reflect the model performance compared to ROC-AUC
when working with imbalanced datasets.
Average Precision
The area under the PR curve is still under question. [9] argues
that in ROC space it is possible to linearly connect (interpolate)
a value of TPR to a value of FPR. However, the same does not hold
for the Precision-Recall space. Precision is not necessarily linear
to recall. This is since we have FP rather than FN in the
denominator of the precision score. In this case, linear
interpolation yields an overly optimistic estimate of performance.
The incorrect interpolation is most problematic when recall and
precision scores are far apart and the local skew is high. The
Average Precision (AP) score overcomes this problem. The AP score
for n thresholds is calculated as
AP = n
(Rn ´ Rn´1)Pn (2.34)
According to equation 2.34, the precision (denoted by P) at the nth
threshold is weighted by the amount of change in recall (denoted by
R) at the n-1 threshold. The final score is the sum of the weighted
precision scores.
Model Tuning
Identifying the proper metric to optimize is key to finding the
right hyperparameter values. Search for the best hyperparameter
setting is a sensitive task. However, it is not realistic to find
the perfect model since that involves evaluating several values for
every hyperparameter. For large datasets, this would become
impractical in terms of time and computational costs. Here, two
model tuning approaches will be discussed briefly.
Grid Search
Grid search is a popular approach for hyperparameter tuning. Given
a set of values for a set of hyperparameters, grid search performs
an exhaustive search over all possible combinations of values and
hyperparameters to find the best setting. For instance a grid
search over 4 values of parameter A (A = (a, b, c, d)) and 3 values
of parameter B (B = (e, f , g)) creates 12 unique combinations.
Therefore, the objective model is fit to the training data 12 times
with all the unique possibilities. In every step, a score (average
precision, ROC-AUC etc) is calculated on the held-out data. The
combination that yields the highest score is selected and its
corresponding hyperparameter values are returned.
To further minimize the generalization error, the grid search can
be used with K-fold Cross Validation. K-fold cross validation
method can be summarized as
• Shuffle the training set and divide it into K different
folds
• For k = 1, ..., K repeat:
– Select one fold as the held out or test set
– Train the model on the remaining K´ 1 folds
– Evaluate the model on the held out fold
– Calculate the evaluation score
• Calculate the mean of all evaluation scores to find model
performance
26
2.7. Model Selection and Evaluation
In this method, every sample is used once in the test set and K ´ 1
times in the training set. Cross validation helps reduce model
variance and increase bias. This method is usually coupled with
grid search. In grid search, the number of unique combinations can
grow expo- nentially when the grid size is increased. Therefore, a
budget has to be calculated first. The budget is determined by the
amount of available computational resources, time, etc. Then the
grid size has to be limited to fit the budget. Very often, the grid
becomes very limited to fit the budget which might avoid finding a
good hyperparameter setting. A faster alternative is to use
successive halving technique with grid search.
Successive Halving
Successive halving (SH) proposed by Jamieson et al. [15], is an
iterative process that aims to reduce the budget requirements while
trying to retain the best possible results. SH has a variety of
applications and is gaining popularity in deep learning models with
large training data and several parameters. Its combination with
grid search enables searching a larger grid while keeping the
budget low. Steps of SH with grid search are as follows:
1. Select a small number of resources (subsample of the training
data)
2. Train all candidate models on the selected resources
3. Evaluate candidate models using a proper metric (i.e. average
precision)
4. Select only a portion of the top-performing candidates for the
next iteration
5. Increase the number of resources (subsample size)
6. Train the remaining candidate models on the increased
resources
7. Repeat steps 3-5 until all resources are exhausted and the best
performing candidate is selected
This tournament-style approach avoids spending the budget on
low-performing candi- dates. This way only the top-performing
candidates survive until the last iteration where they are
evaluated with all available resources. The most common resource to
increase at every iteration is the number of training samples.
However, a different resource, like the number of estimators in a
Random Forest, can be selected. It is also possible to decide how
many resources to allocate initially and by which factor to
increase it in every iteration. The number of candidates is
eliminated by the same factor at every step. Although SH has proven
to be budget efficient, it may result in a sub-optimal solution. If
a grid search fits the available budget, it guarantees to return
the best performing model among all possible candidates.
27
3 Method
This chapter is dedicated to describing the methodology used in
this thesis. The primary challenge is to find the best combination
of classifier and oversampling technique. For this purpose, an
evaluation procedure is defined to evaluate several candidate
models. An addi- tional procedure is also designed to weight
features in the dataset using some feature ranking and weighting
schemes. Then for every set of learned weights, a KNN classifier is
fit to the weighted training set and the results are compared to
the baseline KNN. Before discussing the methodology, a brief
description of the data is provided.
3.1 Data Description
Data for this thesis was provided by a company that offers Supplier
Relationship Manage- ment (SRM) tools. SRMs are web-based platforms
that enable businesses to search, select, communicate and negotiate
with different suppliers. The data was collected from the Google
Analytics API. Google Analytics provides a diverse range of
variables on the characteristics and behavior of users that visit
the website. It allows the businesses to set up several goals (i.e.
a desired action like newsletter subscription, demo request, etc)
and track how and which users interact with them.
The data consists of over 110,000 observations. Every observation
contains 11 attributes (features) of a user’s session on the
company’s website. Users are identified by an anonymous and unique
Client ID. The class label, Goal Conversion, is a value between 0
to 100. It indicates (in percentage) to what extent a client has
taken Goal Conversion on the website per session. For instance, a
client that has requested a demo of the product and subscribed to
the newsletter gets a higher Goal Conversion value than a client
that has only subscribed to the newsletter. When no desired action
is taken in a session, the Goal Conversion value is 0. However,
this project is only interested to find out if a client has taken
any desired actions regardless of the extent. Therefore, the class
label is converted into a binary variable.
There are only 875 records where conversion = 1 (positive class).
The rare positive sam- ples make the dataset severely imbalanced.
Above 99 percent of the samples in the dataset belong to class 0
and only about 0.7 percent to class 1. Provided the setting,
undersampling this data is equivalent to discarding a large portion
of samples, thus leaving very little in- formation for training the
models. Therefore, several oversampling techniques will be
used
28
Name Type Description ClientID Alphanumeric Unique user identifier
WeekDay Integer Numerical representation of weekdays –Sun-
day = 0, Saturday = 6 Hour Integer Hours within a day –From 0 to 23
IsFromSocial Binary If a user was referred from social media (1)
or
not (0) BounceRate Float Probability of user leaving site quickly
(per-
centage) DeviceCategory Categorical Device used by the user
(desktop, mobile or
tablet) Source Categorical Source website the user comes from
(google,
linkedin, etc) Medium Categorical Medium from which user entered
(organic,
paid, etc) UserType Categorical If the user is new or returning
(two categories) SessionsPerUser Integer Number of times user
visited the website AvgSessionDuration Integer Average time user
spends on website (sec-
onds) PageviewsPerSession Integer Average number of pages visited
in every visit
to the website GoalConversion (label) Binary If the user took a
desired action (1) or not (0)
Table 3.1: Characteristics of the input data
throughout this project. Table 3.1 contains the description of the
input variables. Character- istics of the data are further
described in the 4 chapter.
3.2 Preprocessing
There is a variety in the feature types of the input data, thus
demanding different preprocess- ing methods. For categorical
variables with more than 5 categories, the top most frequent
categories are retained and the rest are aggregated under the other
category. This method has been applied to the Medium and Source
variables. The variable IsFromSocial contains two factor levels yes
and no that is converted to a binary 0 and 1. The Bounce Rate
variable is orig- inally in percentage scale. It is normalized to
have values between 0 and 1 using the min max normalization method.
In order to remove outliers for continuous variables, the
interquartile range method has been applied. For each of the
variables with outliers, the scale parame- ter K is chosen
empirically by visually inspecting the outliers. This technique is
applied to three variables, namely Pageviews Per Session, Average
Session Duration and Sessions Per User. All categorical variables
are one-hot encoded to have numerical representation. The ClientID
variable is removed from the data since it is not a predictor. The
data is split in an 80/20 ratio for training and test sets. Note
that all splitting and cross validations used in this the- sis are
stratified. This is to ensure that positive class samples are
equally distributed among subsequent splits which is essential in
imbalanced classification tasks.
3.3 Classifiers
Six different classifiers are selected for tuning and evaluation.
Each model is tuned using Successive Halving Grid Search with 5
fold cross validation on the training set. Average Pre- cision is
used as the optimization score. The classifier tuning is first
performed using the non-oversampled data. The process is then
repeated once more using the data that was
29
3.3. Classifiers
oversampled with Random Oversampling. The motivation behind this is
to see the effect of oversampling on the optimum hyperparameter
values. Also, to avoid excessive tuning, it is assumed that the
best hyperparameter values when using Random Oversampled data are
also valid when using other resampling techniques. The
hyperparameters that are selected for tuning for each classifier
are briefly described below. Note that this thesis uses the Scikit
Learn library to implement many of its models including the
classifiers and the Permutation Importance algorithm. Scikit Learn
is an open-source library written for the Python pro- gramming
language. It features the implementation of several machine
learning algorithms for both classification and regression.
Logistic Regression - Two of the essential parameters for tuning
are penalty and C that control the type and amount of
regularization. Lasso, Ridge, and Elastic net regularization
methods have been evaluated for different values of C. The
intention is to find out if, how and to what extent reduction in
model complexity is necessary. The solver parameter is also tuned
to find the best optimization algorithm for minimizing the loss
function. Note that some regularization methods work only with
certain optimization algorithms.
Support Vector Classifier - The main hyperparameter for Support
Vector Classifiers is C that controls the bias-variance trade-off
in the model. The C parameter in SVC acts simi- larly to the C
parameter in Logistic Regression when L2-regularization is used.
The value for the kernel parameter has been fixed on rbf which
provides more flexible decision boundaries compared to linear or
polynomial kernels. The gamma parameter is a parameter of the RBF
kernel that controls the sensitivity of the kernel towards the
distance between two training points.
Considering that the training set contains more than 88,000
observations, training an SVC performs considerably slower compared
to other models. Therefore, the grid search has to be limited to
fewer hyperparameters and values to make the tuning feasible. This
issue escalates when tuning the oversampling methods on SVC due to
the increased number of training observations. For this purpose,
the tuning of oversampling methods on SVC has also been limited to
fewer hyperparameter values.
Decision Tree - Trees can be highly customized with several
parameters. A number of these parameters control the structure of
the tree. For instance, max depth determines the number of levels
in the tree. Deeper trees are more complex, meaning they are more
cus- tomized to the training set. Therefore, the depth of the tree
controls the level of bias and variance in the model. The minimum
number of allowed observations in each terminal (leaf) node (min
samples leaf ), the minimum number of samples required to make a
new split (min samples split) and the maximum number of leaf nodes
allowed (max leaf nodes) can further control the tree structure.
The splitting strategy has also been tuned using the splitter
param- eter. Splitting at each node is done to minimize the loss
criterion (Gini or entropy). This can be carried out using two
different strategies. One, by using the best feature (feature with
the highest importance score) and the best threshold value at each
split. Two, by choosing the best feature and a random threshold
value at each split. Trees are not pruned by default. However, cost
complexity pruning has been applied and tuned with the ccp alpha
parameter.
Random Forest - Random Forest similarly constructs trees to
Decision Tree. Therefore, it inherits the same set of parameters
for this purpose. The distinguishing parameter of Ran- dom Forest
is n estimators that indicates the number of trees to grow in the
forest. While larger values for this parameter can potentially
reduce generalization error, too large values can impose
unnecessary computational costs with no significant improvement in
the final results. The portion of features used in each bootstrap
has been tuning using the max fea- tures parameter. In this
project, pruning is not performed for trees in the Random Forest.
This is since ensemble models already handle variance reduction and
so further pruning may become redundant.
K Nearest Neighbors - The behavior of the KNN classifier is mainly
controlled by the n neighbors parameter. This parameter determines
the number of nearest neighbors to use to set the label of a test
observation. An increase in the value of this parameter increases
model
LOAD MORE