Prediction of Lead Conversion With Imbalanced Dataliu.diva-portal.org/smash/get/diva2:1566623/FULLTEXT01.pdfAcknowledgments First, I would like to give a special thanks to my supervisor,

Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se
Linköping University | Department of Computer and Information Science Master’s thesis, 30 ECTS | Datateknik
| /LIU-IDA/STAT-A–21/031–SE/
Prediction of Lead ConversionWith Imbalanced Data – A method based on Predictive Lead Scoring
Ali Etminan
Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annananvändning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning somgod sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentetändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.
Copyright
The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to bementionedwhen his/her workis accessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.
© Ali Etminan
An ongoing challenge for most businesses is to filter out potential customers from their audience. This thesis proposes a method that takes advantage of user data to classify potential customers from random visitors to a website. The method is based on the Predictive Lead Scoring method that segments customers based on their likelihood of purchasing a product. Our method, however, aims to predict user conversion, that is predicting whether a user has the potential to become a customer or not.
Six supervised machine learning models have been used to carry out the classification task. To account for the high imbalance in the input data, multiple resampling methods have been applied to the training data. The combination of classifier and resampling method with the highest average precision score has been selected as the best model.
In addition, this thesis tries to quantify the effect of feature weights by evaluating some feature ranking and weighting schemes. Using the schemes, several sets of weights have been produced and evaluated by training a KNN classifier on the weighted features. The change in average precision obtained from the original KNN (without weighting) is used as the reference for measuring the performance of ranking and weighting schemes.
Acknowledgments
First, I would like to give a special thanks to my supervisor, Joel Oskarsson, for giving me exceptional feedback throughout the entire thesis. His support has had a great impact on the quality of this report.
I would also like to thank my examiner, Ander Grimval, and my opponent, Mudith Silva, for their constructive comments and valuable suggestions. I wish to express my gratitude to Samuel Jenks and Malin Schmidt for providing me with the necessary tools and information required for this thesis.
Finally, I would like to thank my lovely wife, Romina. Without her unconditional support, I would have not made it through this journey. And finally, my little daughter, Nila, whose presence is the main source of my ambitions.
iv
Contents
1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theory 3 2.1 Lead Scoring and Predictive Lead Scoring . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Feature Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Model Selection and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Method 28 3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Ranking and Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Results 34 4.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Classification with oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Feature-weighted Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Discussion 47 5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Conclusion 53
7 Appendix 57 7.1 Classification with Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.2 Feature-weighted Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
List of Figures
2.1 Sales Funnel - Source: Duncan and Elkan, 2015 [duncan_probabilistic_2015] . . . 4 2.2 Random Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Random Oversampling Examples (ROSE) . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Oversampling with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Oversampling with K Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Oversampling with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7 Oversampling with Borderline SMOTE 1 . . . . . . . . . . . . . . . . . . . . . . . . 11 2.8 Oversampling with Borderline SMOTE 2 . . . . . . . . . . . . . . . . . . . . . . . . 12 2.9 Comparison of Random oversampling with different variants of SMOTE . . . . . . 13 2.10 Support Vector Machine with linear decision boundary . . . . . . . . . . . . . . . . 16 2.11 Logistic curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.12 A comparison of ROC Curve (left) and Precision-Recall Curve (right) . . . . . . . . 25
4.1 Distribution of source and medium of referrals to the website . . . . . . . . . . . . 35 4.2 Average number of sessions and average bounce rate over the weekdays, sepa-
rated by social/non-social media users (left) Average number of sessions and average bounce rate over 24 hours, separated by social/non-social media users (right) 35
4.3 Distribution of the scaled average number of converted/not converted users by days of the week (left) and over 24 hours of the day (right) . . . . . . . . . . . . . . 36
4.4 Pageviews Per Session and Average Session Duration before removing outliers (top) and after removing outliers (bottom) . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Correlation heatmap for numerical and binary variables in the data . . . . . . . . . 37 4.6 Logistic Regression with SMOTE Tomek link . . . . . . . . . . . . . . . . . . . . . . 38 4.7 Support Vector Classifier with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Decision Tree with no oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.9 Random Forest with Random Oversampling . . . . . . . . . . . . . . . . . . . . . . 40 4.10 K Nearest Neighbors without oversampling . . . . . . . . . . . . . . . . . . . . . . . 41 4.11 Gradient Boosting without oversampling . . . . . . . . . . . . . . . . . . . . . . . . 41 4.12 Feature ranks calculated by Permutation Importance method with the top per-
forming Gradient Boosting model as its estimator . . . . . . . . . . . . . . . . . . . 42 4.13 Feature ranks calculated with Pearson Correlation Coefficient method (left) and
Fisher Coefficient method (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.14 Absolute values of feature weights calculated by applying NMF scheme to PCC
and FC ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.15 Absolute values of feature weights calculated by applying NRF scheme to PCC
and FC ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.16 KNN classifier performance using features weighted with PCC and NRF schemes 46
7.1 Logistic Regression with no oversampling . . . . . . . . . . . . . . . . . . . . . . . . 59 7.2 Logistic Regression with Random Oversampling . . . . . . . . . . . . . . . . . . . . 59 7.3 Logistic Regression with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.4 Logistic Regression with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.5 Logistic Regression with K Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . 60
vii
7.6 Logistic Regression with Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . . 61 7.7 Logistic Regression with ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.8 Support Vector Classifier with no oversampling . . . . . . . . . . . . . . . . . . . . 62 7.9 Support Vector Classifier with Random Oversampling . . . . . . . . . . . . . . . . . 62 7.10 Support Vector Classifier with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.11 Support Vector Classifier with SMOTE Tomek Link . . . . . . . . . . . . . . . . . . 63 7.12 Support Vector Classifier with K Means SMOTE . . . . . . . . . . . . . . . . . . . . 63 7.13 Support Vector Classifier with Borderline SMOTE . . . . . . . . . . . . . . . . . . . 63 7.14 Support Vector Classifier with ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.15 Decision Tree with Random Oversampling . . . . . . . . . . . . . . . . . . . . . . . 64 7.16 Decision Tree with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.17 Decision Tree with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.18 Decision Tree with SMOTE Tomek Link . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.19 Decision Tree with K Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.20 Decision Tree with Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.21 Decision Tree with ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.22 Random Forest with no oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.23 Random Forest with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.24 Random Forest with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.25 Random Forest with SMOTE Tomek Link . . . . . . . . . . . . . . . . . . . . . . . . 68 7.26 Random Forest with K Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.27 Random Forest with Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.28 Random Forest with ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.29 K Nearest Neighbors with Random Oversampling . . . . . . . . . . . . . . . . . . . 70 7.30 K Nearest Neighbors with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.31 K Nearest Neighbors with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.32 K Nearest Neighbors with SMOTE Tomek Link . . . . . . . . . . . . . . . . . . . . . 71 7.33 K Nearest Neighbors with K Means SMOTE . . . . . . . . . . . . . . . . . . . . . . 71 7.34 K Nearest Neighbors with Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . 71 7.35 K Nearest Neighbors with ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.36 Gradient Boosting with Random Oversampling . . . . . . . . . . . . . . . . . . . . 72 7.37 Gradient Boosting with SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.38 Gradient Boosting with SVM SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.39 Gradient Boosting with SMOTE Tomek Link . . . . . . . . . . . . . . . . . . . . . . 73 7.40 Gradient Boosting with K Means SMOTE . . . . . . . . . . . . . . . . . . . . . . . . 74 7.41 Gradient Boosting with Borderline SMOTE . . . . . . . . . . . . . . . . . . . . . . . 74 7.42 Gradient Boosting with ADASYN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.43 Pearson Correlation Coefficient with Normalizing Max Filter . . . . . . . . . . . . . 75 7.44 Fisher Coefficient with Normalizing Max Filter . . . . . . . . . . . . . . . . . . . . . 75 7.45 Fisher Coefficient with Normalizing Range Filter . . . . . . . . . . . . . . . . . . . . 76 7.46 Permutation Importance with Normalizing Max Filter . . . . . . . . . . . . . . . . . 76 7.47 Permutation Importance with Normalizing Range Filter . . . . . . . . . . . . . . . 76
viii
2.1 Example of lead attributes and their rankings [michiels_lead_2008] (source: Ab- erdeen Group, 2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Suggested marketing actions for each lead score [lindhal_qualitative_2017] . . . . 5 2.3 Confusion matrix for binary classification . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Characteristics of the input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Best hyperparameter values for Logistic Regression and SMOTE Tomek link . . . . 38 4.2 Best hyperparameter values for Support Vector Classifier and SVM SMOTE . . . . 38 4.3 Best hyperparameter values for Decision Tree with no oversampling . . . . . . . . 39 4.4 Best hyperparameter values for Random Forest with Random Oversampling . . . 40 4.5 Best hyperparameter values for K Nearest Neighbors without oversampling . . . . 40 4.6 Best hyperparameter values for Gradient Boosting without oversampling . . . . . 41 4.7 ROC-AUC and Average Precision (AP) scores of all combinations of classifiers
with resampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.8 Estimated run-time of classifiers with and without oversampling in seconds . . . . 44 4.9 Comparison of Average Precision obtained with KNN using different feature
ranking and weighting schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.1 Complete list of hyperparameters and their corresponding grid of values used for classifier tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Complete list of hyperparameters and their corresponding grid of values used for tuning resampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
ix
1 Introduction
1.1 Background
The urge for digital transformation has swept across the entire business community in the past two decades. The rapid evolution of tools that retrieve, analyze and transform business data have forced organizations to constantly shift and adapt their strategy to unravel consumer behavior. This has emphasized the usage of consumer data to find existing patterns and associations in their actions and characteristics. Organizations dedicate a good amount of budget, time, and human resources to build data infrastructures to track and make infer- ences of their consumer’s habits. This will in turn become the foundation of every consequent decision made by marketing and sales teams.
Larger organizations gain their competitive advantage by tracking a diverse range of features about their users across their digital platforms to build sophisticated data models. In general, these models require a domain expert to determine certain parameters and do the decision-making manually and by following business objectives. Lead Scoring methodology is widely practiced by marketing and sales professionals to score and prioritize users based on their data. Lead scoring assigns an importance score to lead (prospective user) interactions like watching a demo, filling out a form, opening an email, etc. This can also be extended to demographics features such as age, gender, and location of the lead. A total score is then calculated for every lead that determines the likelihood of the lead becoming a customer. Marketers use this score to segment leads and create a different marketing strategy for every segment. Leads with the highest scores will be contacted by the sales team while leads with lower scores will be treated with proper marketing content and follow-ups.
While lead scoring is built upon data, it still depends on experts to manually set the importance scores which need ongoing re-evaluation as business priorities and consumer behaviors evolve. Therefore, the method cannot be deemed as fully data-driven. Over the recent years, there has been a growing interest in the utilization of machine learning models to automate the process of scoring leads that are widely referred to as Predictive Lead Scoring or Automated Lead Scoring. Studies suggest that rather than pure domain knowledge or gut feeling, organizations should pursue predictive lead scoring as a replacement or complement to manual lead scoring [24].
Input data for predictive lead scoring can consist of a website and social media data as far as attributes can be traced back to a lead. Data generated through Customer Relation-
1
1.2. Objectives
ship Management tools (tools that allow businesses to manage and integrate marketing and sales activities) is another common source of input data. Numerous studies have attempted to investigate the performance of probabilistic and non-probabilistic models in this domain. Nygård et al. [24] compares the model performance of Logistic Regression, Decision Trees, Random Forests, and Neural Networks. The choice of models reveals how linear, non-linear, ensemble, and deep models perform on this particular problem. In contrast, Duncan et al. [10] compares two probabilistic models for this problem and Benhaddou et al. [2] experi- ments with Bayesian Networks that work best with small and expert-curated datasets.
The application of machine learning in lead scoring has been a subject of interest by many researchers in recent years. Current implementations are mostly patented and offered as commercial features. Therefore, there is an evident demand from the business community for further research and exploration of this methodology. This thesis aims to build a predictive model by investigating some linear and non-linear classifiers. It also studies the contribution of features in the prediction task by applying a number of feature ranking and weighting schemes. The weights are then applied to the features to find out the actual importance of each feature as well as to compare ranking and weighting methods. The explicit use of feature weights is commonly overlooked by similar studies. This is since the type and source of their input data demand a different data preparation approach.
1.2 Objectives
The data used in this thesis consists of 11 features and a highly imbalanced binary class variable. The extreme class imbalance puts the task in the same category as disease or fraud detection tasks. These tasks suffer from rare positive samples that make the learning process challenging. Therefore, as the primary objective, this thesis evaluates oversampling methods that handle problems with rare positive labels. It then aims to find the best combination of the classifier with a re-sampling method.
Considering the significance of feature weights in predictive lead scoring, the thesis de- fines its third objective as an evaluation of three feature ranking and two feature weighting schemes. The weighting schemes are used to normalize the ranks. In this stage, we introduce the combination of ranking and weighting scheme that best represents the features in the input data.
The outline of this report can be described as follows:
• Chapter 2. Theory: Description of theories behind methods used by this thesis
• Chapter 3. Methodology: The methodology used by this thesis
• Chapter 4. Results: Exploratory data analysis and report of achieved results with additional plots and tables
• Chapter 5. Discussion: Discussion of strengths and weaknesses concerning obtained results and comparison with related work
• Chapter 6. Conclusion: Conclusion on how this paper can provide a new reflection on the topic of predictive lead scoring
2
2 Theory
This chapter introduces the theory and mathematical formulations behind concepts used by this thesis. This gives the reader intuition into the choice of certain models and parameters in consecutive chapters. It also enhances the reproducibility of the methodology used in this thesis.
2.1 Lead Scoring and Predictive Lead Scoring
As briefly discussed in the introduction chapter, identifying and prioritizing users with the potential of taking a desired action or making a purchase is a challenging task for many businesses. In the business jargon, these users are referred to as leads. First, we discuss the lead scoring method that has helped many businesses tackle this challenge. Then we briefly review the theories behind the machine learning assisted lead scoring known as Predictive Lead Scoring (a.k.a Automated Lead Scoring).
Lead Scoring
In a study performed by Aberdeen Group [22] lead scoring is defined as:
Lead scoring is a technique for quantifying the expected value of a lead or prospect based on the prospect’s profile, behavior (online and/or offline), demographics, and likelihood to purchase.
The quantification is done by assigning a score to all attributes of a lead. Scores are determined by marketing experts that value each attribute concerning business priorities. [22] introduces a number these attributes which is presented in table 2.1.
The sum of the scores attributed to a single lead quantifies the lead’s readiness (likelihood) to make a purchase. Leads can be divided into several groups depending on their level of readiness. Those with a high score are considered as hot leads while those that receive a smaller score are referred to as cold leads. Leads with the least score, are in the awareness stage, in which they have an initial impression of the company and its products. Leads with higher scores are considered as MQL (Marketing Qualified Leads) and SQL (Sales Qualified Lead) depending on their score. MQLs generally require nurturing to be prepared for sales. Nurturing may include contacting the lead with more marketing and promotional materials. SQLs on
3
2.1. Lead Scoring and Predictive Lead Scoring
Attribute Rank Webinars attended 5.0 Purchase propensity scores 4.5 Email click-throughs 4.0 Website activity (pages visited and recency) 4.0 Website activity (type of activity) 3.5 Keywords clicked 3.5 Website activity (length of time each page was visited) 3.0 Attitudinal and lifestyle information 3.0
Table 2.1: Example of lead attributes and their rankings [22] (source: Aberdeen Group, 2008)
Figure 2.1: Sales Funnel - Source: Duncan and Elkan, 2015 [10]
the other hand are deemed to be closer to a sales point and thus contacted directly by the sales team. Figure 2.1 depicts a sales funnel that demonstrates different stages of a lead from a business perspective. A study by [19] on a company that provides business solutions, suggests how the business can approach leads in different stages. These suggestions are presented in table 2.2. In this table, leads are categorized based on their profiles and engagement. Leads that have provided many details in their profiles are categorized as target fit while leads that provide basic information are labeled as potential fit. In addition, the level of the lead’s past engagements with the company determines if the lead has high interest, medium interest or low engagement [19].
Predictive Lead Scoring
In the conventional lead scoring method introduced in the previous section, attribute scores are assigned manually. Scores need constant revision and adjustment as business priorities and objectives change. Moreover, since a human expert decides on the scores, the process becomes error-prone. Using machine learning to find potential prospects automates the lead scoring process. Instead of explicitly assigning importance scores to lead attributes, these are learned from the data. Similar to lead scoring, predictive lead scoring can use any attribute as input. However, since this method involves constructing an automated model, the integration of data from various sources becomes a challenge.
4
2.2. Data Preprocessing
Lead Description Marketing Action Target fit, High interest Send email that encourages the lead to leave phone number or
call in directly, or propose to purchase the product directly. Target fit, Medium interest Send offer of free trial of the service or propose relevant material
that is close to purchase of the product. Target fit, Low engagement Priority lead that needs further nurturing and "why now"-
message Potential fit, High interest Send email that encourages the lead to leave phone number or to
call in directly, or propose to purchase the product directly. Potential fit, Medium interest Continue to nurture with marketing materials that can increase
interest, send offer of free trial of the service. Pursue information to evaluate if it is a good fit.
Potential fit, Low engagement Send nurturing content that can create a demand for the product.
Pursue information to evaluate if it is a good fit.
Table 2.2: Suggested marketing actions for each lead score [19]
Integration of data can be done by manually connecting different datasets. Some pub- lic platforms like Google Analytics also enable integration with certain CRM platforms like Hubspot. Hubspot provides a fully-featured digital marketing platform that also includes a CRM service. CRM-generated data is the most popular data source for predictive lead scoring. This data consists of features related to both personal and behavioral attributes. More importantly, it contains information about historical sales.
The predictive lead scoring model can be defined to solve a binary problem. In this setting, leads are classified as likely or not likely to purchase or convert. This can be solved by investigating the performance of binary classifiers. A study by Benhaddou et al. [2] also de- fines a binary task but uses a Bayesian Network with binary features to tackle this problem. On the other hand, another study by Duncan et al., [10], uses the stages of the sales funnel (see fig 2.1) as class labels. This approach would also predict the level of a lead’s readiness for sales. The selection of the modeling approach is highly influenced by the type of available input data.
2.2 Data Preprocessing
It is essential to create a proper representation of the data before proceeding to model building. Cleaning data can consist of simple steps like feature normalization or more sophisticated steps such as outlier detection, feature encoding or handling missing values.
Handling Outliers
Outliers can produce erroneous results with certain algorithms and therefore need to be removed properly from the data. Below we introduce two statistical approaches for this task.
Standard Deviation Method
In case of dealing with a normally distributed data and depending on how sensitive we are to outliers, the mean is first estimated, and the cut-off threshold is set as:
• One standard deviation from the mean: covering about 68% of the data
• Two standard deviations from the mean: covering about 95% of the data
5
2.3. Resampling
• Three standard deviations from the mean: covering about 99.7% of the data
Any data point smaller or greater than the threshold will be considered an outlier and thus removed. Choosing a proper number of standard deviations depends on the domain and size of the data [7].
Interquartile Range Method
If data is not normally distributed, a different method has to be applied for outlier removal. Interquartile Range (IQR) is obtained by finding the 25th and 75th percentiles of the feature. Percentiles here are referred to as quartiles since they are divided into four quarters.
IQR = (Q3 ´Q1)k (2.1)
According to equation 2.1, by subtracting the first quartile Q1 from the third quartile Q3 we are left with the feature values between the two quartiles or within the interquartile range. These values are then multiplied by a factor k to adjust the cut-off threshold. k can take a value from 1.5 to any number depending on the range of values in the feature [7].
One-hot Encoding
Many machine learning models are not able to handle categorical data in their original form and so categorical features need to be presented in a way that models can interpret them. In cases where categories are nominal and there is no ordering among them, the variable is converted to a one-hot representation. One-hot encoding expands a variable with k categories into k different binary variables each representing one category. For instance, if Xj contains 3 categories A,B and C and if Xi
j, values of feature j at observation i, equals C (Xi j = C), the
one-hot representation would look like: [0, 0, 1] While this approach is simple to implement, it comes with a major drawback when the
data is either high-dimensional or contains high cardinality categorical features or both. In such cases, data dimension escalates to a larger extent occupying a larger amount of space in memory. In some cases, this may leave the model with a lot of parameters which is compu- tationally inefficient.
An alternative method for ordinal variables is conversion into integer factors. The order of categories in ordinal variables has to be taken into consideration. So for instance, if Xj contains 3 categories A,B and C they will be converted to numerical values as (A = 1, B = 2, C = 3). This omits the increased dimension issue but makes the assumptions that the categories are ordered.
2.3 Resampling
A common challenge in classification tasks is to deal with imbalanced classes in the dataset. In binary classification, the class that has a significantly smaller number of samples is called the minority class while the other class is referred to as the majority class. Datasets with multiple classes (multinomial) can also suffer from imbalanced class distribution. Overseeing this property can yield misleading results even if a sophisticated classifier is at play. It is therefore essential to experiment with some resampling techniques to achieve less biased results towards the majority class(es). Depending on the task at hand, undersampling, oversampling or a combination of both can be applied. Regardless of the method used, the resampling strategy can be either defined as a simple strategy (i.e. resample only the majority class or only the minority class) or customized by setting desired proportions for each class label.
6
3.7
3.8
3.9
4.0
4.1
4.2
Random Oversampling
Random oversampling involves duplication of samples from the minority class. The oversampling can be done iteratively to balance all classes with respect to the majority class. The duplication is carried out by sampling the minority data points with replacement (figure 2.2). Random oversampling is known to cause overfitting since generated samples overlap with the original samples. Therefore, a certain amount of dispersion might be desired over producing exact copies of the original data points. Assume that we want to generate a synthetic sample based on the original sample xi. For this, we first sample x from probability distribution KH j where KH j is centered at xi and Hj is a matrix of scale parameters. KH j is usually selected from a unimodal and symmetric distribution that is scaled by Hj. In this regard, the new sample is generated in the neighborhood of xi where the width of the neighborhood is determined by Hj. This method is also referred to as Random Oversampling Examples (ROSE) [21] (see figure 2.3) .
SMOTE
Synthetic Minority Oversampling TEchnique or SMOTE is another popular oversampling method that comes with a variety of implementations. As opposed to Random Oversampling that simply duplicates the data points, SMOTE uses the K nearest neighbors of the minority sample xi to generate synthetic samples.
xnew = xi + λ(xzi ´ xi) (2.2)
xzi is one of the nearest neighbors of the sample xi from the minority class. λ is a parameter between 0 and 1 that determines the distance between the new and the original sample. Baseline SMOTE uses a uniform distribution to select xi to generate a new sample [8] (see figure 2.4). This makes SMOTE sensitive to noise. A noisy sample from the minority class
7
3.7
3.8
3.9
4.0
4.1
4.2
Random Oversampling with dispersion Majority Minority
Figure 2.3: Random Oversampling Examples (ROSE)
that is among the majority samples has an equal probability of being selected for resampling. This may result in generating more noisy samples where the majority samples have a high- density [17]. Other implementations of SMOTE take a different strategy in selecting xi but still use equation 2.2 to generate new samples.
While oversampling may be able to balance class distributions in the data, it might not be able to compensate for the lack of information in some cases. One study shows that synthetic samples generated by SMOTE result in the same expected value in the minority class while reducing its variance [4].
K Means SMOTE
K Means SMOTE does the oversampling in three stages:
• Clustering: data is clustered into k groups using k means clustering algorithm
• Filtering: only clusters with high portion of minority samples are selected for oversampling
• Oversampling: oversamples filtered clusters using equation 2.2
First, the input space is clustered into k groups. Then in the filtering stage, it finds clusters where minority samples make up more than 50 percent of the cluster’s population. Applying SMOTE to these clusters reduces the chance of generating noisy samples. Moreover, the goal is also achieving a balanced distribution of samples within the minority class. In other words, if there exist multiple clusters of minority samples, we want them to be oversampled to the same extent. Therefore, the filter step allocates more generated samples to sparse minority clusters rather than dense ones [17] (see figure 2.5). Imbalance ratio threshold (irt) hyperparameter determines the threshold for the equation 2.3. Equation 2.3 divides the number of
8
3.7
3.8
3.9
4.0
4.1
4.2
Figure 2.4: Oversampling with SMOTE
majority samples in cluster c over the number of minority samples in the same cluster. The counts are incremented by 1 to avoid division by zero or ir = 0.
ir = majorityCount(c) + 1 minorityCount(c) + 1
(2.3)
By increasing the threshold for ir, filtering becomes more sensitive and requires a cluster to have a higher proportion of minority instances to be selected. Lowering this threshold has the opposite effect [17].
SVM SMOTE
This method uses a Support Vector Machine (SVM) to select xi for resampling. In classification tasks, SVMs draw a hyperplane that has the maximum distance with the closest point(s) in each class. The hyperplane acts as an n-dimensional decision boundary that is usually accompanied by soft margins on each dimension. Points on or in between the soft margins and the hyperplane are referred to as support vectors. SVM SMOTE uses SVM to first iden- tify the support vectors and then uses them to generate new samples [23]. The motivation behind this is that support vectors have a significant effect on the position of the separating hyperplane and the classification performance in general (see figure 2.6). For further details on SVMs please refer to section 2.4.
Borderline SMOTE1
Borderline SMOTE1 first checks the label of xi’s m nearest neighbors. It then proceeds to classify xi as one of the following:
• noise - all nearest-neighbors are from a different class than the one of xi.
9
3.7
3.8
3.9
4.0
4.1
4.2
3.8 3.9 4.0 4.1 4.2
3.7
3.8
3.9
4.0
4.1
4.2
10
3.7
3.8
3.9
4.0
4.1
4.2
Figure 2.7: Oversampling with Borderline SMOTE 1
• in danger - at least half of the nearest neighbors are from the same class as xi.
• safe - all nearest neighbors are from the same class as xi.
The algorithm selects a minority sample in danger and its k nearest neighbors from the same class to generate a new sample. This means new samples are generated closer to the minority class border. It also prevents selecting noisy samples for resampling which occurs in the original SMOTE algorithm. However, it still uses equation 2.2 to generate new samples [11] (see figure 2.7).
Borderline SMOTE2
This algorithm operates similarly to Borderline-SMOTE1, except that it considers the k nearest neighbors of xi to be from any class. The value of λ here ranges from 0 to 0.5 as opposed to 0 to 1 in borderline SMOTE1. when λ 0.5 the new sample is generated closer to the minority class [11]. Increased number of samples near the decision boundary tends to improve classification performance which is the main motivation behind Borderline SMOTE2 (see figure 2.8). Figure 2.9 compares the behavior of Random Oversampling, ROSE, and different variants of SMOTE.
11
3.7
3.8
3.9
4.0
4.1
4.2
ADASYN
Adaptive Synthetic oversampling (ADASYN) is a special case of SMOTE. It generates new samples proportional to the number of samples that are not from the same class as xi in a given neighborhood [13]. Steps performed by ADASYN are described in algorithm 1.
Algorithm 1: ADASYN
1. Calculate the number of synthetic samples G to be generated (G = (ml ´ms)ˆ β)
a) ml and ms are the number of majority and minority samples respectively.
b) β specifies the desired balance level. β = 1 creates a perfectly balanced dataset.
2. For each xi in minority class calculate ri = i/K
a) i is the number of samples in the K nearest neighbors of xi that belong to the majority class
3. Calculate ri by normalizing ri as ri = ri/ m
i=1 ri (ri is a categorical distribution)
4. Calculate the number of samples to be generated for each minority sample as gi = ri ˆ G
5. For each minority sample generate gi new samples using equation 2.2
One can consider ri as weights for minority samples. Therefore, it can be observed that ADASYN forces the learning algorithm to focus on samples that are originally more difficult to learn [13].
12
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
Majority Minority
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
3.7
3.8
3.9
4.0
4.1
Majority Minority
Figure 2.9: Comparison of Random oversampling with different variants of SMOTE
Combination of Oversampling and Undersampling
Resampling methods can be combined to make a hybrid solution of over and under sampling. One popular combination is SMOTE-Tomek link that combines SMOTE oversampling with Tomek link undersampling.
Tomek link has been described by Batista et al. [1] as:
Given two examples x and y belonging to different classes, and let d(x, y) be the distance between x and y. A (x, y) is called a Tomek link if there is not a case z such that d(x, z) d(x, y) and d(y, z) d(y, x). If two examples form a Tomek link,then one of these examples is noise or both examples are borderline (near the class border).
By removing Tomek links the space is cleaned from noisy data, thus producing better decision boundaries for oversampling.
13
2.4. Classifiers
Weight Correction
Using a resampled synthetic dataset to train the model changes the class distribution in the training set while the classes in the validation/test set still inherit the imbalanced weights. Consider the scenario where we oversample the minority class in the training set to have as many samples as the majority class, each class gets a weight (prior probability) of 0.5. This confuses the model when predicting labels in the unseen data as it assumes both classes have equal probability, which is not true.
To compensate for this problem one can take the posterior probabilities returned by the classifier and first divide by the class fractions in the training set and then multiply by the class fractions in the validation/test set. Finally, the new values need to be normalized to ensure that the new posterior probabilities sum to one [3]. This is mathematically represented in equation 2.4
p(y|x) = p(x|y)p(y)
Z (2.4)
The term q(y|x) represents the posterior probabilities returned by the model. p(y) is the probability of the positive class in the test/validation set and q(y) is the probability of the positive class in the training set. Therefore, p(y)
q(y) is the correction term that is multiplied by the posterior probabilities. Z contains the normalizing constants for every y. In a binary task, 1 Z is calculated according to equation 2.5.
1 Z
(2.5)
2.4 Classifiers
Theories in the preceding sections were concerned with the handling of input data. Prepro- cessing and resampling methods were discussed to see how a similar dataset can be prepared as an input of a machine learning model. The next objective of this thesis is to search, tune and evaluate several binary classifiers and find the best oversampling and classifier combination. In this section, a number of linear and non-linear classifiers will be discussed in detail.
Decision Trees
Tree-based models (a.k.a) Decision Trees are widely used for both classification and regression problems. Decision Trees are best known for their interpretability and their ability to handle non-linear problems. There are different algorithms of decision trees including CART (Clas- sification And Regression Trees), ID3, and C4.5 which differ based on some factors including how they handle overfitting. In general, decision tree follows a sequential binary procedure to build the trees.
Consider a predictor space with N samples and p dimensions (X1, ..., Xp). The algorithm first chooses a splitting value s for the jth predictor Xj to divide the predictor space into R1 and R2 so that R1(j, s) = tX|Xij su and R2(j, s) = tX|Xij su. R1 and R2 are the two new subspaces which are also known as terminal nodes. The algorithm repeats this step for every subspace until a stopping criterion is met. The stopping can be triggered for instance when the tree reaches a certain depth or when the leaf node contains less than a minimum number of samples. The choice of Xj and s at each step of the algorithm is made so that the error within the resulting subspaces is minimized.
14
2.4. Classifiers
In regression tasks, the predicted value of a new point is the mean value of the subspace it falls into. At every step, the model aims to minimize the Sum of Squared Residuals (RSS presented in equation 2.6.
RSS = ÿ
(yi ´ yR2) 2 (2.6)
where yR1 and yR2 are the mean value within the R1 and R2 subspaces. Classification trees on the other hand, use the majority class label within each subspace to predict the class of a new point. For choosing Xj and s, the model attempts to minimize either equation 2.7 (Entropy) or equation 2.8 (Gini Index).
D = ´ K
pmk(1´ pmk) (2.8)
where pmk is the proportion of training data points in the mth subspace that are from the kth class. Both criteria try to form regions with the majority of points belonging to one class [3]. In other words, they attempt to minimize node impurity. Although trees are easily interpreted, experiment shows that learned trees are very sensitive to the details of the input data. Therefore a small change to the training data results in a very different set of splits [14].
One way to tackle the sensitivity of the model towards the training data is to do pruning. In cost complexity pruning method, trees with too many terminal nodes are penalized by a factor α. The value of α can control the bias-variance trade-off in the case of having an overfit or underfit model. The complexity of the model is inversely proportional to the value of α. In contrast to the top-down approach for building the trees, pruning begins at the bottom from the leaf nodes back to the root of the tree [14].
Random Forest Classifier
Random Forest is an ensemble method that was introduced to address the overfitting issue of Decision Trees. Random Forest uses the Bagging (Bootstrap Aggregating) method with decision trees to provide more robust results. Consider an input data with N samples and p features. The Random Forest model is described in algorithm 2.
Algorithm 2: Random Forest
1. For b = 1, 2, ..., B repeat:
a) Create a bootstrap b of size N by sampling from the original data with replacement
b) Fit a decision tree to b. At each split of the tree choose m random features from X where m p and calculate the prediction f b
2. For regression, calculate the average prediction
fbag(X) = 1 B
B ÿ
b=1
f b(X)
3. For classification, count the majority vote as the final prediction
If the models trained on each bootstrap has variance σ2, the variance of the mean of all bootstraps ( fbag) is given by σ2/B that is smaller or at least equal to every individual variance.
15
Figure 2.10: Support Vector Machine with linear decision boundary
This proves that bagging guarantees to reduce model variance. Random Forests are also able to estimate the expected error without cross-validation. On average and depending on the size of the bootstrap, about one-third of the data points are not selected for bootstrapping which are referred to as out-of-bag samples. These samples can in turn be used as an unseen test dataset to estimate generalization error [14]. Random Forests and Decision Trees are highly flexible models. Depth of the tree(s), number of leaf nodes, number of samples in every leaf node, and the proportion of m features to fit the tree(s) have significant impacts on the behavior of the final model.
Support Vector Classifier
Support Vector Classifiers (SVCs) or Support Vector Machines (SVMs) are originally designed for linear classification. Think of an input space with n observations and j dimensions. SVC tries to find a hyperplane with j dimensions to classify the observations. Ideally, the hyperplane has the largest possible margin with observations within each class. However, a major drawback of this is that the hyperplane becomes extremely sensitive to the training observations. Adding or removing observations from the training set can make considerable changes to the position of the hyperplane. To address this, a soft margin is drawn parallel to the hyperplane in each dimension. Points that are either on or inside the soft margins are called Support Vectors. The role of soft margins is to allow a desired level of misclassification in the model to reduce the generalization error [14].
As illustrated in figure 2.10, with soft margins, we allow some observations (support vectors) to be on the wrong side of the margins or on the wrong side of the decision boundary. This reduces the model sensitivity towards the training data. The size of the margins is controlled by C or the Cost parameter. When C Ñ8, no misclassification is tolerated and the soft
16
2.4. Classifiers
margins are removed. This results in a model with a high variance. On the contrary, when C Ñ 0, the margins are widened, allowing more misclassified observations. This also implies that the number of support vectors is inversely proportional to the value of C [14].
Extending Support Vector Classifiers to non-linear problems can increase computational costs significantly. This is since training a SVC to find the optimal decision boundary involves solving a quadratic optimization problem. In this regard, if for instance, we decide to enlarge the feature space to a higher order polynomial, we can end up with a huge number of terms to compute for the optimization problem. To overcome this issue, SVCs use kernel functions to efficiently handle non-linear problems [14].
Kernel Functions
Kernel functions try to quantify similarities between two observations. The linear kernel calculates the inner product between pairs of training observations. However, it turns out that only the inner product between pairs of support vectors have an impact on the model. There- fore, it is not required to compute the inner products between all possible pairs in the training set. Consider x to be a vector of P dimensions, equation 2.9 represents the linear kernel. Note that this paper uses bold notations to represent vectors.
K(xT i xi1) =
xijxi1 j (2.9)
To move beyond linearity, a polynomial kernel (equation 2.10) of order d can be used instead.
K(xT i xi1) = (1 +
xijxi1 j) d (2.10)
A more flexible kernel function is the radial kernel or the Radial Basis Function (RBF) (equation 2.11)
K(xT i xi1) = exp(´γ
P ÿ
j=1
(xij ´ xi1 j) 2) (2.11)
RBF calculates the Euclidean distance between two observations. If the distance is large, the exp in equation 2.11 returns a small value. This means observations far from the decision boundary are ignored. γ = 1
2σ2 is a smoothing coefficient and the value of σ in the denominator controls the shape of the kernel.
Logistic Regression
Linear Regression uses the Ordinary Least Squares (OLS) method to fit a straight line to the training data. The predicted variable y is modeled directly as a continuous variable. For every observation of X1, ..., Xp the target value is calculated using equation 2.12
y = β0 + β1x1 + ... + βPxP (2.12)
where β0 is the y-intercept (the mean value of y) and β1, ..., βP are model coefficients. Every βi determines how much y should change for one unit of change in xi. Logistic Regression on the other hand, calculates a conditional probabilities between 0 and 1. This makes it suitable for binary classification problems. For this purpose, it uses the logistic function to calculate the probabilities [14].
p(Y|X) = eβ0+β1X1+...+βPXP
1 + eβ0+β1X1+...+βPXP (2.13)
17
Figure 2.11: Logistic curve
Remember that X is a vector with p dimensions. The logistic function can be re-written in the log-odds form as in equation 2.14.
log ( p(X)
1´ p(X)
) = β0 + β1X1 + ... + βPXP (2.14)
The logistic function creates an S-shaped output that represents probabilities from 0 to 1. This indicates that although increasing or decreasing X, increases or decreases the log- odds value, the relationship to p is non-linear. Logistic Regression uses Maximum Likelihood Estimation (MLE) to learn the model parameters [14]. when MLE is applied to the Logistic Regression model, it attempts to minimize the expression in equation 2.15
L(θ) = ´(y log(y) + (1´ y) log(1´ y)) (2.15)
Equation 2.15 is the negative log of the likelihood function or the log loss. Note that y and y indicate the true and predicted class labels respectively. θ is the vector of parameters that the model is trying to learn. Therefore, L(θ) quantifies the loss of the model. One common way to solve the optimization problem defined by MLE is through a technique known as gradient descent.
Gradient descent is an iterative process. It calculates the gradient of the loss function for each training observation and tries to move the opposite direction (downhill) of the gradient function to find the local minimum. If the local minimum is also lower than all other local minimums or the loss function is convex (has one local minimum), the algorithm may be able to find the global minimum. In terms of equation 2.16, at each step the change in θ (OL(θ)) is multiplied by a small step size η. It is then subtracted from the current value of θ until OL(θ) is smaller than a minimum threshold.
θ := θ ´ ηOL(θ) (2.16)
Regularization
In Linear and Logistic Regression models, the bias-variance trade-off is usually controlled by regularization methods. L1-regularization (Lasso Regression) and L2-regularization (Ridge Regression) are two basic regularization methods. Considering equation 2.12 for Linear Re- gression with N observations and p dimensions, the L2-regularization is applied to the loss function using equation 2.17. Note that both methods can also be applied to the logistic function in equation 2.13 for the Logistic Regression model.
L(θ) + λ P
β2 j (2.17)
The first term in equation 2.17 is the log loss presented in equation 2.15 and the second term is the penalty. The penalty term becomes larger when the value of β is large. The λ parameter controls the level of regularization. When λ = 0 the penalty term becomes 0 and no regularization is performed. As λ Ñ 8 regularization becomes stronger, decreasing the complexity of the model. The L1-regularization, presented in equation 2.18 works similarly to L2-regularization except that it uses the absolute value of β in the penalty term. This forces some of the β estimates to be 0 while in L2-regularization the coefficients can become close to 0 but not 0. This also makes L1-regularization suitable for the task of feature selection [14].
L(θ) + λ
p ÿ
j=1
|β j| (2.18)
Another regularization method, Elastic Net, combines the L1 and L2 regularization methods to overcome the limitations of both. While L2 is not able to perform feature selection like L1, there are certain limitations with how L1 removes features. For instance for highly correlated features, L1 keeps one of the features and omits the rest. Equation 2.19 is the penalty term for the Elastic Net. The second term in the equation averages highly correlated features, while the first term provides a sparse solution in the coefficients of the averaged features [14].
p ÿ
j=1
(α|β j|+ (1´ α)β2 j ) (2.19)
The α parameter in equation 2.19 controls the contribution of L1 and L2 methods in the regularization task.
K Nearest Neighbors (KNN)
K Nearest Neighbors (KNN) is a supervised non-parametric machine learning model. For a test observation xi and a given k, it finds the k nearest points to xi. In classification, xi is assigned the majority class label of its k neighbors. In regression, xi is assigned the average value of its k neighbors. When K is small the model becomes more sensitive to training observations and thus have higher variance.
The behavior of the KNN classifier can also be explained using the Bayes theorem [3]. Assume we have N observations and k classes. To classify xi using its K neighbors, we draw a sphere that is centered on xi and contains all the K neighbors. Consider V to be the volume of the sphere and Nk be the number of points belonging to class Ck. Also, Kk denotes the number of the K nearest neighbors that belong to class Ck. The probability density for each class (conditional probability) can be expressed as
p(X|Ck) = Kk
19
p(Ck) = Nk N
Now using equation 2.20 we can calculate the posterior probability density of the class label given the K neighbors.
p(Ck|X) = p(X|Ck)p(Ck)
p(X) (2.20)
The class Ck with the highest probability density will be assigned to the test observation [3]. There are different metrics to calculate the distance between xi and its K nearest neighbors. Equation 2.21 represents the Minkowski distance. The Euclidean and Manhattan distance metrics are special cases of the Minkowski distance when P = 2 and P = 1 respectively.
d(p, q) = P
(pi ´ qi)P (2.21)
KNNs can handle non-linear classification tasks quite well. However, they tend to un- derperform with imbalanced data. This is since samples from the majority class are more frequent. Therefore, it is more likely that most of the K neighbors of a test observation are from the majority class. One way to tackle this problem is to use a weighted KNN. In this method, labels for each of the K neighbors are multiplied by weights proportional to the in- verse of their distance to the test observation. In non-weighted KNN, points in the K nearest neighbors are uniformly assigned a weight of 1 while the rest get a weight of 0 [12].
Gradient Boosting
Gradient Boosting is an ensemble model. The idea of boosting is to combine the output of several weak learners to produce more robust results. Learners whose error rates are slightly above random guess are considered weak learners. One of the most commonly used boosting methods is AdaBoost which was designed for binary classification. Consider classifier G(x) for a binary task with N observations. In AdaBoost, a number of weak learners (for example classification trees) Gm(x), m = 1, 2, ..., M are created at every step. These trees have only a single split which is also known as stumps. The final prediction is a weighted average of the prediction of all the stumps.
G(x) = sign ( M
αmGm(x) )
(2.22)
αm determines the contribution of Gm(x). The algorithm starts by initializing the weights of the training points uniformly as 1/N. Then at every step, a classifier is trained on the data. The weights of misclassified points are increased while correctly classified points are given a smaller weight. This forces the algorithm to focus on observations that are more difficult to learn [12].
Since AdaBoost gives a much higher influence to misclassified points, it becomes very sensitive to outliers. This degrades the performance of the model when trained on noisy data. Gradient Boosting addresses this issue and also extends its applicability to both regression and classification tasks. Gradient Boosting is a generalization of AdaBoost. It can
20
2.4. Classifiers
take any differentiable function as the loss function and use the gradient descent method for optimization (see section 2.16). For a general loss function in equation 2.23
L( f ) = N
L(yi, f (xi)) (2.23)
the derivative (gradient) with respect to f (xi) is calculated at every observation with respect to equation 2.24.
gim = BL(yi, f (xi))
B f (xi) (2.24)
Algorithm 3, taken from [12], describes Gradient Boosting for regression. In classification with k different classes, steps 2(a) to 2(d) should be performed for each of the k classes. Note that for both regression and classification problems, Gradient Boosting grows a regression tree at each step. In step 1 of the algorithm, the initial predictions are computed. This can either be calculated through an external estimator or by computing the log odds value of the target variable. In step 2(a) we calculate the pseudo residuals, that is the difference between the observed and the predicted values in each tree. In 2(b) the next regression tree is grown on the residuals of the previous tree. In regression, the residuals are also considered as the predicted values for each step whereas in classification, we need to calculate probabilities for prediction. Steps 2(c) and 2(d) involve minimizing the loss function and updating the predictions for each grown tree. For classification with k classes, the final output in step 3 is k different tree expansions [12].
Algorithm 3: Gradient Boosting
i=1 L(yi, γ)
a) For i = 1, 2, ..., N compute:
rim = ´
[ BL(yi, f (xi))
B f (xi)
] f= fm´1
b) Fit a regression tree to the targets rim giving terminal regions. Rjm, j = 1, 2, ..., Jm
c) For j = 1, 2, ..., Jm compute
γjm = argminγ
3. Output f (x) = fM(x)
Trees grown with Gradient Boosting have more leaf nodes than stumps in AdaBoost that only have two leaf nodes. This algorithm can also be tuned to use a portion of the training observations at each step which results in a Stochastic Gradient Boosting. This approach can control the bias-variance trade-off. A smaller portion of the sub-sample can decrease variance while increasing bias [12].
21
2.5. Feature Ranking
2.5 Feature Ranking
The main purpose of using a data-driven model for lead scoring is to let data decide the weight of every user action or characteristic. This section introduces two feature importance calculation techniques and then proceeds to describe two correlation-based ranking methods.
Feature Importance with Decision Trees
The level of contribution of each feature can be estimated when building a decision tree. In short, the importance of features is calculated based on reduction of Gini or entropy criterion at every split point [18] (see section 2.4 for more details on decision trees). However, this method does not work well with high cardinality features (numerical or categorical features with several unique values). This method can give high importance to features that may not be predictive on unseen data when the model is overfitting and thus do not have generalization power [5]. Calculating the feature importance values are also possible with other tree-based models like Random Forests. However, this thesis is not interested in discussing the details of the tree-based feature importance methods. Instead, the Permutation Feature Im- portance method is introduced as a more practical solution to calculate the feature importance scores.
Permutation Feature Importance
The Permutation Importance algorithm, proposed by Breiman et al. [5], solves the generalization problem of the tree-based models.
Algorithm 4: Permutation Importance
1. Inputs: Dataset D, predictive model m
2. Compute a reference score s of the model m on D (accuracy, ROC-AUC, etc)
3. For each feature j in D:
a) For k in 1, ..., K:
i. Randomly shuffle values in column j to create an altered version of the dataset Dj,k
ii. Compute the score sk,j of model m on Dj,k
b) Compute importance ij for the permuted feature as:
ij = s´ 1 K
sk,j
The importance of the jth feature is determined by how much the model’s base score s is changed due to shuffling values in j. The reshuffling step breaks the feature values’ ties with the target variable and thus importance is measured solely based on how the model depends on that specific feature. Sometimes features may appear to be more important in the training than the test/validation set. Therefore, it is good practice to train the model on the training set and calculate the importance scores on the held-out set to improve generalization power [5].
Pearson Correlation Coefficient
Pearson correlation measures the linear correlation between every feature and the response variable. It can take values from -1 to 1. A value close to zero indicates an insignificant correlation. This is while a value of higher magnitude, irrespective of the sign, represents a higher
22
2.6. Weighting Schemes
correlation between the feature and the response. Pearson correlation is calculated by dividing the covariance of the feature-response pair by the product of their standard deviations.
Jcc(Xj) =
[ N
(2.25)
For every feature, Xj and class label c the covariance is calculated and summed over all N observations. This value is then divided by the product of the feature and class standard deviations (σXj .σc). Note that in equation 2.25 c denotes the probability of observing a sample from class 1. Since rankings near -1 are as informative as rankings close to 1, we take the absolute values of the rankings [16].
Fisher Coefficient
This ranking method is based on Fisher’s Discriminant Analysis that is the Linear Discrim- inant Analysis in the case of binary problems. Therefore, Fisher’s discriminant separates classes based on their mean and draws a linear discriminant that maximizes the separability while minimizing within-class variance. The Fisher coefficient is calculated using equation 2.26
JFSC(Xj) = [Xj,1 ´ Xj,2]
[σj,1 + σj,2] (2.26)
Xj,1 and Xj,2 represent mean of values in X j corresponding to class 0 and class 1 respectively. The same holds for standard deviations in the denominator [16]. As opposed to feature importance techniques, both Fisher coefficient and Pearson correlation coefficient find a linear relationship between the features and the target variable.
2.6 Weighting Schemes
Different ranking methods produce different types of values and therefore need to be normalized by a certain scheme to be converted into weights [16]. Below are two simple weighting schemes used in this thesis.
Normalized Max Filter (NMF)
Equation 2.27 suggests that for positive ranks (J+), the absolute value of the rank is divided by the maximum rank. In the case of negative ranks (J´), the absolute value of the rank is subtracted from the sum of the maximum and minimum ranks and then divided by the maximum rank.
WNMF(J) =
Jmax for J-
Normalized Range Filter (NRF)
This scheme is quite identical to NMF. However, in both cases of positive and negative ranks, the minimum rank is added to the numerator and the denominator. In the case of NRF weights lie in [2Jmin/(Jmax + Jmin), 1] [16].
WNRF(J) =
Jmax+Jmin for J-
True
Predicted
0 1 0 True Negative (TN) False Negative (FN) N*
1 False Positive (FP) True Positive (TP) P*
N P
2.7 Model Selection and Evaluation
The procedure of building a regression or classification model begins with downlisting a set of candidate models that may suit the specifications of the problem. Then some model tuning and evaluation steps have to be taken to find the best setting of the best performing model. Interpretation of "the best" relies solely on the nature of the problem. There is no one-size-fits- all evaluation metric to calculate. Therefore, this section starts with introducing a number of essential metrics and motivates why some are preferred in this thesis. Finally, some popular model tuning and validation methods will be discussed.
Metrics
In classification problems, a common approach to evaluate model performance is through the confusion matrix. The confusion matrix of a binary classifier consists of the components displayed in table 2.3.
Several evaluation metrics can be calculated from the confusion matrix. Here a number of these metrics and their respective applications will be described.
FRP and TPR
FPR = FP N
(2.30)
FPR also known as probability of false alarm specifies how often the model incorrectly clas- sifies a sample from class 0 as class 1. In contrast, FNR shows the probability of a model incorrectly classifying a sample from class 1 as class 0. There is usually a trade-off between FPR and FNR therefore the two scores cannot be improved simultaneously.
Precision and Recall
The precision and recall metrics can be computed using the components in table 2.3. Precision is obtained by dividing the number of correctly predicted points by all predicted points. A larger value indicates that the model is capable of correctly classifying the observations.
Precision = TP
TP + FP (2.31)
Recall is calculated by dividing the number of correctly predicted points by the number of all relevant points (regardless of whether they were correctly predicted or not). This ratio indicates what portion of the true labels is identified by the model.
Recall = TP
0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Pr ec
isi on
Precision-Recall Curve
Figure 2.12: A comparison of ROC Curve (left) and Precision-Recall Curve (right)
Similar to FPR and FNR, the two scores stand at a trade-off. The preference of either score depends on how the results are supposed to be interpreted and how the model is to be used in practice. An alternative solution is to use the F1 score.
F1 score
F1 aims to find the right balance between Precision and Recall by combining the two using equation 2.33.
F1 = 2.precision.recall precision + recall
(2.33)
F1 is widely used as an overall as well as per class score to evaluate a classification model.
Metrics for imbalanced data
Models trained on imbalanced data cannot be simply evaluated through their accuracy (ratio of correctly predicted points over all points, regardless of the class label). This score does not take into account the per class performance. If a problem is imbalanced, where 0 is the most common class, the F1 score is a good alternative to evaluate the model. However, F1 does not distinguish between the severity of FNR and FPR so one has to decide which score to trade-off in favor of the other [20]. In other words, the F1 score does not make a preference over any of the precision or recall.
An alternative solution is threshold tuning. In probabilistic approaches to binary classification, the model calculates the posterior probability of an observation belonging to both classes 0 and 1. By default, we set a cut-off threshold on the probabilities to assign the class labels. In threshold tuning, instead of setting a discriminative threshold, an FPR and FNR are calculated for many thresholds in the [0,1] interval. From the obtained values the Receiver Operating Characteristic Curve (ROC Curve) can be plotted. The curve has FPR on the x-axis and TPR on the y-axis (Figure 2.12).
The objective is to maximize the Area Under the (ROC) Curve (ROC-AUC). AUC can take any value between 0 and 1. A value of 0.5 amounts to a random guess and a value of 1 (closer to the top left corner) represents a perfect model. The convenience of ROC-AUC is still argued for imbalanced problems. This is because ROC-AUC measures a model’s overall performance and does not reflect the per class performance. A better alternative is the area under the precision-recall curve [20].
The precision-recall curve (PR curve) has precision scores on the x-axis and recall scores on the y-axis. PR curve of a good model gets closer to the top right corner of the plot. Depending
25
2.7. Model Selection and Evaluation
on the domain and sensitivity of the problem, a higher recall or precision may be preferable. In any case, the area under the PR curve can better reflect the model performance compared to ROC-AUC when working with imbalanced datasets.
Average Precision
The area under the PR curve is still under question. [9] argues that in ROC space it is possible to linearly connect (interpolate) a value of TPR to a value of FPR. However, the same does not hold for the Precision-Recall space. Precision is not necessarily linear to recall. This is since we have FP rather than FN in the denominator of the precision score. In this case, linear interpolation yields an overly optimistic estimate of performance. The incorrect interpolation is most problematic when recall and precision scores are far apart and the local skew is high. The Average Precision (AP) score overcomes this problem. The AP score for n thresholds is calculated as
AP = n
(Rn ´ Rn´1)Pn (2.34)
According to equation 2.34, the precision (denoted by P) at the nth threshold is weighted by the amount of change in recall (denoted by R) at the n-1 threshold. The final score is the sum of the weighted precision scores.
Model Tuning
Identifying the proper metric to optimize is key to finding the right hyperparameter values. Search for the best hyperparameter setting is a sensitive task. However, it is not realistic to find the perfect model since that involves evaluating several values for every hyperparameter. For large datasets, this would become impractical in terms of time and computational costs. Here, two model tuning approaches will be discussed briefly.
Grid Search
Grid search is a popular approach for hyperparameter tuning. Given a set of values for a set of hyperparameters, grid search performs an exhaustive search over all possible combinations of values and hyperparameters to find the best setting. For instance a grid search over 4 values of parameter A (A = (a, b, c, d)) and 3 values of parameter B (B = (e, f , g)) creates 12 unique combinations. Therefore, the objective model is fit to the training data 12 times with all the unique possibilities. In every step, a score (average precision, ROC-AUC etc) is calculated on the held-out data. The combination that yields the highest score is selected and its corresponding hyperparameter values are returned.
To further minimize the generalization error, the grid search can be used with K-fold Cross Validation. K-fold cross validation method can be summarized as
• Shuffle the training set and divide it into K different folds
• For k = 1, ..., K repeat:
– Select one fold as the held out or test set
– Train the model on the remaining K´ 1 folds
– Evaluate the model on the held out fold
– Calculate the evaluation score
• Calculate the mean of all evaluation scores to find model performance
26
2.7. Model Selection and Evaluation
In this method, every sample is used once in the test set and K ´ 1 times in the training set. Cross validation helps reduce model variance and increase bias. This method is usually coupled with grid search. In grid search, the number of unique combinations can grow expo- nentially when the grid size is increased. Therefore, a budget has to be calculated first. The budget is determined by the amount of available computational resources, time, etc. Then the grid size has to be limited to fit the budget. Very often, the grid becomes very limited to fit the budget which might avoid finding a good hyperparameter setting. A faster alternative is to use successive halving technique with grid search.
Successive Halving
Successive halving (SH) proposed by Jamieson et al. [15], is an iterative process that aims to reduce the budget requirements while trying to retain the best possible results. SH has a variety of applications and is gaining popularity in deep learning models with large training data and several parameters. Its combination with grid search enables searching a larger grid while keeping the budget low. Steps of SH with grid search are as follows:
1. Select a small number of resources (subsample of the training data)
2. Train all candidate models on the selected resources
3. Evaluate candidate models using a proper metric (i.e. average precision)
4. Select only a portion of the top-performing candidates for the next iteration
5. Increase the number of resources (subsample size)
6. Train the remaining candidate models on the increased resources
7. Repeat steps 3-5 until all resources are exhausted and the best performing candidate is selected
This tournament-style approach avoids spending the budget on low-performing candidates. This way only the top-performing candidates survive until the last iteration where they are evaluated with all available resources. The most common resource to increase at every iteration is the number of training samples. However, a different resource, like the number of estimators in a Random Forest, can be selected. It is also possible to decide how many resources to allocate initially and by which factor to increase it in every iteration. The number of candidates is eliminated by the same factor at every step. Although SH has proven to be budget efficient, it may result in a sub-optimal solution. If a grid search fits the available budget, it guarantees to return the best performing model among all possible candidates.
27
3 Method
This chapter is dedicated to describing the methodology used in this thesis. The primary challenge is to find the best combination of classifier and oversampling technique. For this purpose, an evaluation procedure is defined to evaluate several candidate models. An additional procedure is also designed to weight features in the dataset using some feature ranking and weighting schemes. Then for every set of learned weights, a KNN classifier is fit to the weighted training set and the results are compared to the baseline KNN. Before discussing the methodology, a brief description of the data is provided.
3.1 Data Description
Data for this thesis was provided by a company that offers Supplier Relationship Manage- ment (SRM) tools. SRMs are web-based platforms that enable businesses to search, select, communicate and negotiate with different suppliers. The data was collected from the Google Analytics API. Google Analytics provides a diverse range of variables on the characteristics and behavior of users that visit the website. It allows the businesses to set up several goals (i.e. a desired action like newsletter subscription, demo request, etc) and track how and which users interact with them.
The data consists of over 110,000 observations. Every observation contains 11 attributes (features) of a user’s session on the company’s website. Users are identified by an anonymous and unique Client ID. The class label, Goal Conversion, is a value between 0 to 100. It indicates (in percentage) to what extent a client has taken Goal Conversion on the website per session. For instance, a client that has requested a demo of the product and subscribed to the newsletter gets a higher Goal Conversion value than a client that has only subscribed to the newsletter. When no desired action is taken in a session, the Goal Conversion value is 0. However, this project is only interested to find out if a client has taken any desired actions regardless of the extent. Therefore, the class label is converted into a binary variable.
There are only 875 records where conversion = 1 (positive class). The rare positive samples make the dataset severely imbalanced. Above 99 percent of the samples in the dataset belong to class 0 and only about 0.7 percent to class 1. Provided the setting, undersampling this data is equivalent to discarding a large portion of samples, thus leaving very little information for training the models. Therefore, several oversampling techniques will be used
28
Name Type Description ClientID Alphanumeric Unique user identifier WeekDay Integer Numerical representation of weekdays –Sun-
day = 0, Saturday = 6 Hour Integer Hours within a day –From 0 to 23 IsFromSocial Binary If a user was referred from social media (1) or
not (0) BounceRate Float Probability of user leaving site quickly (per-
centage) DeviceCategory Categorical Device used by the user (desktop, mobile or
tablet) Source Categorical Source website the user comes from (google,
linkedin, etc) Medium Categorical Medium from which user entered (organic,
paid, etc) UserType Categorical If the user is new or returning (two categories) SessionsPerUser Integer Number of times user visited the website AvgSessionDuration Integer Average time user spends on website (sec-
onds) PageviewsPerSession Integer Average number of pages visited in every visit
to the website GoalConversion (label) Binary If the user took a desired action (1) or not (0)
Table 3.1: Characteristics of the input data
throughout this project. Table 3.1 contains the description of the input variables. Character- istics of the data are further described in the 4 chapter.
3.2 Preprocessing
There is a variety in the feature types of the input data, thus demanding different preprocessing methods. For categorical variables with more than 5 categories, the top most frequent categories are retained and the rest are aggregated under the other category. This method has been applied to the Medium and Source variables. The variable IsFromSocial contains two factor levels yes and no that is converted to a binary 0 and 1. The Bounce Rate variable is originally in percentage scale. It is normalized to have values between 0 and 1 using the min max normalization method. In order to remove outliers for continuous variables, the interquartile range method has been applied. For each of the variables with outliers, the scale parameter K is chosen empirically by visually inspecting the outliers. This technique is applied to three variables, namely Pageviews Per Session, Average Session Duration and Sessions Per User. All categorical variables are one-hot encoded to have numerical representation. The ClientID variable is removed from the data since it is not a predictor. The data is split in an 80/20 ratio for training and test sets. Note that all splitting and cross validations used in this thesis are stratified. This is to ensure that positive class samples are equally distributed among subsequent splits which is essential in imbalanced classification tasks.
3.3 Classifiers
Six different classifiers are selected for tuning and evaluation. Each model is tuned using Successive Halving Grid Search with 5 fold cross validation on the training set. Average Pre- cision is used as the optimization score. The classifier tuning is first performed using the non-oversampled data. The process is then repeated once more using the data that was
29
3.3. Classifiers
oversampled with Random Oversampling. The motivation behind this is to see the effect of oversampling on the optimum hyperparameter values. Also, to avoid excessive tuning, it is assumed that the best hyperparameter values when using Random Oversampled data are also valid when using other resampling techniques. The hyperparameters that are selected for tuning for each classifier are briefly described below. Note that this thesis uses the Scikit Learn library to implement many of its models including the classifiers and the Permutation Importance algorithm. Scikit Learn is an open-source library written for the Python pro- gramming language. It features the implementation of several machine learning algorithms for both classification and regression.
Logistic Regression - Two of the essential parameters for tuning are penalty and C that control the type and amount of regularization. Lasso, Ridge, and Elastic net regularization methods have been evaluated for different values of C. The intention is to find out if, how and to what extent reduction in model complexity is necessary. The solver parameter is also tuned to find the best optimization algorithm for minimizing the loss function. Note that some regularization methods work only with certain optimization algorithms.
Support Vector Classifier - The main hyperparameter for Support Vector Classifiers is C that controls the bias-variance trade-off in the model. The C parameter in SVC acts similarly to the C parameter in Logistic Regression when L2-regularization is used. The value for the kernel parameter has been fixed on rbf which provides more flexible decision boundaries compared to linear or polynomial kernels. The gamma parameter is a parameter of the RBF kernel that controls the sensitivity of the kernel towards the distance between two training points.
Considering that the training set contains more than 88,000 observations, training an SVC performs considerably slower compared to other models. Therefore, the grid search has to be limited to fewer hyperparameters and values to make the tuning feasible. This issue escalates when tuning the oversampling methods on SVC due to the increased number of training observations. For this purpose, the tuning of oversampling methods on SVC has also been limited to fewer hyperparameter values.
Decision Tree - Trees can be highly customized with several parameters. A number of these parameters control the structure of the tree. For instance, max depth determines the number of levels in the tree. Deeper trees are more complex, meaning they are more customized to the training set. Therefore, the depth of the tree controls the level of bias and variance in the model. The minimum number of allowed observations in each terminal (leaf) node (min samples leaf ), the minimum number of samples required to make a new split (min samples split) and the maximum number of leaf nodes allowed (max leaf nodes) can further control the tree structure. The splitting strategy has also been tuned using the splitter parameter. Splitting at each node is done to minimize the loss criterion (Gini or entropy). This can be carried out using two different strategies. One, by using the best feature (feature with the highest importance score) and the best threshold value at each split. Two, by choosing the best feature and a random threshold value at each split. Trees are not pruned by default. However, cost complexity pruning has been applied and tuned with the ccp alpha parameter.
Random Forest - Random Forest similarly constructs trees to Decision Tree. Therefore, it inherits the same set of parameters for this purpose. The distinguishing parameter of Ran- dom Forest is n estimators that indicates the number of trees to grow in the forest. While larger values for this parameter can potentially reduce generalization error, too large values can impose unnecessary computational costs with no significant improvement in the final results. The portion of features used in each bootstrap has been tuning using the max features parameter. In this project, pruning is not performed for trees in the Random Forest. This is since ensemble models already handle variance reduction and so further pruning may become redundant.
K Nearest Neighbors - The behavior of the KNN classifier is mainly controlled by the n neighbors parameter. This parameter determines the number of nearest neighbors to use to set the label of a test observation. An increase in the value of this parameter increases model

Documents

Prediction of Lead Conversion With Imbalanced Dataliu.diva-portal.org/smash/get/diva2:1566623/FULLTEXT01.pdfAcknowledgments First, I would like to give a special thanks to my supervisor,