PW-Azure-ML-Predictive-Analtyics-WP

Predictive Analytics with Azure ML

BY JASON SCHUH

In this experiment, several different statistical modeling methods/algorithms will be used in an attempt to identify one method or ensemble of methods that maxi-mizes the “click-through rate” in the most significant manner with the ultimate goal of increasing return on investment (ROI) of ad-serving dollars. Maximizing the click-through rate for ads served will result in an increased ROI.

Modeling ProcessThe objective of the modeling section will be focused on identifying the method/algorithm that provides the most significant predictive capabilities when exposed to the click through data. As stated previously the data contains an attribute la-beled “click” where click=1 indicates an ad-serve was clicked through. The process of building, evaluating and validating models that are capable of correctly predict-ing the value of “click” in a manner that is significantly better than the current pro-cess will yield the outcome of this phase.

During this phase multiple versions of multiple models will be generated of which only one model / method will be selected. Each is constructed from a set of train-ing data and evaluated for its predictive performance. Once a model has been evaluated it will then be validated against a set of unknown test data and is true predictive performance characteristics will be known. To prepare for model eval-uation and validation the data will be randomly partitioned into a training set and a test set. The models will be evaluated against the training set and then validated against the test set.

Azure ML is a powerful predictive analytics tool that can provide value for businesses of all kinds in many ways.

Metric Definition

ROC Chart/AUC The ROC chart is a visual representation of the performance of a binary classifier. The AUC (area under the curve) is a measurement of how much the model outperforms a random guess. AUC scores are measured from 0-1. Any score over .50 indicates a model that is better than a random guess. Scores greater than .70 are desire able.

Precision The percentage of predicted clicks that are in fact actual clicks. Precision scores range from 0-1. A 1 indicates that 100% of all re-cords predicted as true are in fact true.

Recall The percentage of actual clicks that are retrieved. Recall scores range between 0-1. A recall score of 1 indicates that the model captured 100% of all true records in the dataset.

The following metrics will be used in this experiment:

Evaluation of a model will be defined as measuring its perfor-mance over the training data. In this project the measurement of performance will be viewed through a ROC curve. The ROC curve is a visual representation of the model’s performance. The evaluation of visualization can lead to subjective outcomes therefore additional measure of AUC, precision and recall will be used as well.

The Evaluation/Modeling Process

Model validation is where the project will ensure that the training results are indicative of how the model will behave in the real world. The models will be constructed using the training data and then scored against the training data; as a result performance metrics will be produced. Validation will occur by using those same models to score the test data. Valid models will produce performance metrics that are similar across the test and evaluation data sets. Indications of poorly constructed models are present when a model’s performance metric wildly fluctuates between the training and test data sets. A model that performs similar across training and test is a model that is likely to perform in the same manner against new unforeseen data.

AlgorithmsPredicting outcomes for this project is referred to a binary classification outcome modeling. This project is attempting to classify a potential ad-serve as either an ad-serve that is likely to be clicked or an ad-serve that is unlikely to be clicked. There are very specific modeling techniques that are used for classification. Not all classification algorithms are equal, but it is this project team’s experience that often many algorithms perform equally as well, with a few that perform above and below average. No single algorithm is best in all occasions against all data, so the project will evaluate multiple models and choose the model that maximizes the chosen validation metrics.

The data will be modeled in various ways using each algorithm. The algorithm that performs best against the test data will be selected as the chosen algorithm.

Modeling ToolsThere are many tools available to aide in the process of constructing predictive models. There is no one best or most appropriate tool to be used in an attempt to create the best model. It is critical to use a tool that is both flexible and contains a rich set of functionality. The project team has found that most major modeling tools in the market do an equally sufficient job in producing predictive models.

As the effort to build models is usually an iterative process of tweaking and remodeling, it is best to adopt a tool that allows for rapid prototyping and is capable of masking most of the mundane and tedious tasks. These tasks can consume significant amounts of a project’s schedule therefore tools that enable the project to quickly iterate through these tasks are highly desirable.

Algorithm Description

Logistic Regression Logistic regression is a traditional statistical method for predicting classification

Machine Learning TechniquesAveraged Perceptron Linear classifier in the neural network family.Boosted Decision Tree An Ensemble method of building decision trees.Decision Forest An Ensemble method of building decision trees.

This project will move forward using two tools. Microsoft’s Azure Machine Learning cloud platform will be used as the primary modeling tool for this project. Azure ML provides the ability to very quickly assemble a modeling effort that can be determined whether or not the team is on the right path. Azure ML allows the team to either quickly fail or quickly validate current progress. As powerful as Azure ML’s rapid modeling capacities are the tool is lacking in other areas like visualizations and deep dive analysis. Therefore as a supplement the R language will be used at times to produce specific deliverables in specific areas of this project.

Modeling EffortThe modeling effort section will begin the process of iterating through many versions of models. The purpose of this effort is to model many different versions of the data with hopes of converging on a best effort model. It is important to remember that it is unlikely that the project will find the “perfect” model, but should find the best model within constraints given. With unlimited time it is possible to continue to slightly improve a model. However, like all projects this project has a very finite schedule and it is critical that the team not let perfect be the enemy of good. The modeling effort will be given a very specific amount of time and the best model selected from this effort will be the model used.

Financial Measurement

In addition to measuring modeling iterations by the normal AUC, precision and recall a simple financial profit and loss model will be applied. The purpose of this project is not to just produce a model that is best at predicting click through, but a model that is able to maximize return on investment.

The financial model will simply apply a cost to serving an advertisement and projected revenue of producing a click through. For this project the following financial assumptions will be made:

The cost of serving an adjustment ($0.40)

The projected revenue of producing a click-through $2.50

Ad serving budget $10,000

The following function will be used to compute the profit and loss for an advertisement serving campaign. N is equal to the number of ads served. Click through rate measures the percentage of ads served that were clicked through.

P&L=(N*-.40)+(N* Click Through Rate*2.50)The value of N is a financial constraint and set by the budget of individual marketing cam-paign. N is equal to the number of ads the campaign is able to serve based on the amount of money the campaign is able to spend.

For the sake of this project the assumed budget is $10,000, and at $0.40 per ad server the campaign is able to serve the following number of ads.

ads served=$10,000/$0.40=25,000 adsWith a virtually unlimited pool of potential ad serve opportunities the model must be constructed to find the most probable 25,000 ad serves.

No ModelIn order to measure the effectiveness of a model a baseline must be taken to understand the current environment of having no model. The idea of producing a predictive model hinges on its output being significantly better than doing nothing. If a model is not mar-ginally better than doing nothing, then perhaps it might be best to do nothing. The law of parsimony states the simplest solution is usually the best.The no model scenario suggests randomly se-lecting the first 25,000 ad serve opportunities and measuring the results. The following screen shot is showing an Azure ML experiment where 25,000 ad serves were selected at random and the P&L formula was applied.

The Azure ML Experiment starts with the Test data set at the beginning. The Partition and Sample step randomly select 25,000 ad serves.

The final step is the Apply SQL Transformation step where the P&L formula is computed. The output of this experiment is as follows:

These results indicate the No Model solution would generate revenue of $10,517.50 and P&L of $517.50. The financial margin of the No Model solu-tion generates a 5% margin. The CTR indicate that ad serves at random are clicked through at a rate of 16.8% of the time.

The M1 Modeling ExperimentModel M1 will be the first round of many modeling efforts. M1 will represent a simple modeling effort where the four chosen algorithms will each be run against the core data in the test and train file. Once each model is scored the P&L will be run for each. The M1 modeling phase will produce models M1.1 – M1.4.

The goal of M1 is to model the existing variables. This project is dealing with many anonymous variables, which limits many of the possible transformations. Modeling the variables in place will give an additional baseline of the progress made over any future variable transformations.

Algorithm Naming Conventions

Logistic Regression

Averaged Perceptron

Boosted Decision Tree

Decision Forest

M1.1

M1.2

M1.3

M1.4

Financials

At this point the project has a baseline from which to measure all additional results mov-ing forward. For a model to be worthy of inclusion into the business it must substantially increase the above click through rate as well as the financial numbers.

M1 Modeling Experiment Diagram

In this experiment the data flows down two distinct tracts. Tract 1 is initiated by the step Clicks_Train_Date_Usage3.csv. Tract 1 is the model-training and evaluation tract. Tract 2 is the model testing and validation tract. Attributes like “row id” and others that are not needed in the model training process are removed by the “Project Columns” step. The “Metadata Editor” step is used to convert attributes into categorical attributes. Categorical attributes in Azure ML are synonymous to factors in the R language.

Modeling

The following diagram highlights the modeling and calibration algorithm steps:

At this point the experiment separates into four distinct tracts. The individual model that is being trained defines each tract. Logistic Regression, Averaged Perceptron, Boosted Decision Tree and Decision Forest are the four algorithm tracts that the model is tak-ing. Each modeling algorithm is feed into the “Sweep Parameters” step. The “Sweep Parameters” is incredibly powerful. This step will randomly iterate through a number of

different parameter configurations for each algorithm and train a model on each. The step will hold back a calibration set of data and then score each parameter configuration. The iteration with the best score is chosen as the algorithm to use. The step is config-urable as to which metric is used to determine the best algorithm. This model is using AUC as the selective metric. The following screen shot shows 5 different iterations of the “Logistic Regression” step:

The first four columns represent parameter configurations for the “Logistic Regression” step and the final seven columns are used to measure the performance of each of the five iterations. The first row in the table was chosen as the winning iteration due to its best AUC (against the calibration data) of .744334. This AUC score is likely to change when the model is run against the test data.

Model Scoring

Once a trained model is selected for each of the four algorithms based on the best AUC score the experiment will then use that model to score the test data that is initiat-ed in the through the tract that begins with “Clicks_Test_Date_Usage3.csv.” The “Sweep Parameters” step validates the individual models based on calibration data that is also used in the construction of the model. The “Score Model” step actual scores the test data, which is completely out of sample and cannot contribute bias to the model.

The scored test data will then be fed into the “Evaluate Model” step. This step produces the actual performance metrics, based on the test data that can be used to compare each model. The step allows for two models to compare at once, but these steps can be daisy chained to evaluate numerous models. This step will also construct the ROC curve.

Actual ROC curves for all four modelsM1.1 Logistic Regression AUC = .747 (blue)

M1.2 Averaged Perceptron AUC = .741 (red)

M1.3 Boosted Decision Tree AUC = .745 (blue)

M1.4 Decision Forest AUC = .688 (red)

All four models display a good AUC. M1.1 – M1.3 are almost identical. M1.4 appears to be less predictive. To investigate further the each model also produces a confusion matrix:

In reviewing the confusion matrix for each of the four models the first item that is most apparent is that each model is highly accurate with each correctly scoring 83%+ of all the test data. However, further investigation shows that while each has a high Accuracy, each also has a very low Recall. This low recall indicates each model does a poor job of actu-ally finding the positive clicks in the data. M1.4 is so selective it does not predict a sin-gle click, but still gets an Accuracy of 83%. This overall trend indicates that the model is not giving equal weight to click = 0 or 1. The models do not understand the importance of predicting click=1. From an accuracy standpoint these are very good models, but this project does not define success by predicting people who are not going to click. Success for this project will be predicting people who will click.

Financials

The final section of the experiment deals with financial scoring of the models. The end-game of this project is to develop a predictive model that our client can use to generate a return on investment. While AUC scores are very good ways to compare the perfor-mance of a model, the model will ultimately be chosen based on its ability to generate a return on the investment put into the model.

The financial will be based on the following assumptions

• The total budget for an ad serving campaign is $10,000• Fixed costs for an ad serving campaign will not be considered• The cost to serve and advertisement = $0.40• The expected revenue generated by a click through = $2.50

The following diagram shows the high level financial section of the project:

Azure ML is very extensible and provides an ability to add custom R code to handle sce-narios not natively supported. A custom R script has been built to compute the following metrics.

Metric Description

AdServes The number advertisements the model recommended serving. This is based on the model scoring with a probability > 50%.

CTR Click through rate based on the predictive click through

Cost The number of predicted clicks * $0.40 not to exceed $10,000

Revenue The number of actual clicks * $2.50ProfitLoss Revenue - CostMargin ProfitLoss / Cost

The following diagram displays the combined financial metrics for each algorithm:

Strictly based on profit and loss M1.2 Averaged Perceptron has the largest ROI. Compared to the No Model profit and loss of $517.50 M1.3 gives an increase of $11,865.60 - $517.50 = $11,348.10 or a 2,193% increase. However, note that because the model is so selective only 10,558 ads were served spending less than half (cost $4,954.40) of the budgeted $10,000. Going back to the confusion matrix for M1.2 it shows that 61,208 of the actual clicks were missed by the model.

Clearly M1 out performs No Model and provides very good financials. However, this proj-ect will continue to iterate the model process to see if M1 can be improved upon.

The M2 Modeling ExperimentThe second modeling iteration will focus on attribute transformations. M1 did little in the way of transforming attributes. All the attributes are categorical anonymous variables and the somewhat limits what can be done. Several of the attributes have very high cardinal-ity, some of which had over 2,000 distinct values. Many modeling algorithms like logistic regression have difficulties dealing with categorical attributes with vast numbers of dis-tinct values. M2 will transform the high cardinality attributes into conditional probabili-ties and then run a similar experiment to M1. The performance and financials will then be compared.

Transformation

A common transformation when dealing with high numbers of distinct categories is to transform these categories into conditional probabilities. Computing the conditional probabilities on the click through rate for each distinct value of each high cardinality at-tribute is how these attributes will be transformed. The click through rate (CTR) of each attribute value will then be combined into a numeric measure for each attribute. Once the CTR for each attribute is computed the original categorical attribute is omitted from the model.

There is one caveat to this process of computing conditional probabilities; the proba-bilities have to be computed from a time period that is previous to the prediction time period. If the time periods are the same then data snooping bias occurs. This data snoop-ing bias would allow the model training process to cheat and over-fit the data. The model would score very high in training, but very poorly in validation.

The source data consist of 10 days of data. Each day’s attributes had to be given a score of the cumulative click through for all days previous to that day. Since there is no data previous to day 1, day 1 shows a CTR of 0% for all attributes. Day 2 shows the CTR from any data that showed up on Day1. Day 3 was computed from day 1 through day 2, and so on until all 10 days were populated. Under normal circumstances the team would like-ly have many years of data to work with and the previous day CTR would be fully pop-ulated. However, due to the lack of days (10) the team decided to use the first five days of the data to populate and bring the previous day CTR metrics up to speed. The data is then filtered to only use the last five days when constructing the training and test sets.

To perform this lagged period conditional probability the team constructed a series of advanced SQL stored procedures that executed the process in a relational database. The source data consisted of 40 million records thus the team knew this process would be intense and require the power of a large relational database. Once the data was fully computed it was exported from the relational database and imported into Azure ML for experiment modeling.

The following table shows the source attributes that were transformed into previous day conditional probabilities:

Attribute Distinct Values New ColumnSite_id 2,149 PrevDay_Site_ID_CTR

Site_domain 2,196 PrevDay_Site_Domain_CTRApp_id 2,039 PrevDay_App_Id_CTRApp_domain 116 PrevDay_App_Domain_CTRDevice_id 64,443 PrevDay_Device_Id_CTRDevice_ip 259,413 PrevDay_Device_Ip_CTRDevice_model 4,316 PrevDay_Device_Model_CTRC14 1,617 PrevDay_C14_CTR

In addition to the conditional probability transformation two more variables were con-structed. PrevDay_Device_ID_Served and PrevDay_Device_OP_Served. These two vari-ables are binary flags that indicated whether or not these Device_Ids or Device_IPs had ever been served an advertisement prior to the day in question. If the historical CTR for a device_Id PrevDay_Device_ID_CTR was equal to 0 the PrevDay_Device_ID_Served bi-nary flag could be used to indicate if PrevDay_Device_ID_CTR was equal to 0 because that device never clicks through an add or because this is the first time this device had ever been served.

Modeling


The M2 experiment is very similar to the M1 experiment. The only differences are found in the ”Project Columns” step where the original values are being filter out and replac-es with their respective _CTR and _Served columns stated above. Every other step is identical.

The following diagram is the output from the “Sweep Parameter” step for Logistic Regression:

The output is similar but slightly different. The Logistic Regression step has been opti-mized using the new _CTR attributes instead of the high cardinality categorical attributes. The training AUC is slightly better, but not significantly.

Model Scoring

M2 will be scored in the same way as M1 using ROC curves, AUC scores and confusion matrix.

The following diagram displays the ROC curves for the four models:M2.1 Logistic Regression AUC = .747 (blue)




Again all four models display a good AUC. M2.1 – M2.4 are almost identical. M2.4 ap-pears to be significantly better than M1.4. M2.1 is exactly the same as M1.1 and M2.3 comes in with the highest AUC. To investigate further the each model also produces a confusion matrix.

Again it is apparent is that each model is highly accurate with each correctly scoring 83%+ of all the test data. However, further investigation shows that while each has a high Accuracy, each also has a very low Recall. This low recall indicates each model does a poor job of actually finding the positive clicks in the data. On comparison to make is that M1.4 did not predict a single click, but M2.4 predicted 5,955 clicks. This overall trend indicates that the model is not giving equal weight to click = 0 or 1. The M2 models like the M1 models do not understand the importance of predicting click=1.

Financials

The following diagram displays the combined financial metrics for each algorithm in the M2 experiment:

Strictly based on profit and loss the M2.3 model out performs the rest, but it is slightly behind M1.2 from the M1 experiment. Again note that because the model is so selec-tive only 10,599 ads were served spending less than half (cost $4,239.6) of the budget-ed $10,000. Going back to the confusion matrix for M2.3 it shows that 61,681 of the actual clicks were missed by the model.

M2 is a very good model, but is fairly similar to M1. M1.2 Averaged Perceptron still has the largest P&L, but M2.3 is just slightly behind. M2.3 Boosted Decision Tree has the largest overall AUC to this point.

The M3 Modeling ExperimentM3 will integrate aspects of M1 and M2 as well as attempt to address the issue of these experiments being so overly selective the marketing team is unable to spend their entire budget on ad serves.

Sampling

The modeling process does not understand the importance that needs to be given to click=1 over click=0. As stated earlier the client does not make money predicting which users are unlikely to click on their advertisements. For these models to be useful they

must be able to identify who is likely to click on an ad-vertisements. To nudge the modeling process to treat click=1 and click=0 equally M3 will over sample click=0 to the point where the distribution of click=0 and click=1 is roughly 50% in the training data.

The oversampling process in this experiment repeats the click=1 data until it is roughly equal to the number of click=0 records. To facilitate this process the experi-ment uses the following SQL union statement.

This SQL is executed in the “Apply SQL Transformation” step of the M3 experiment.

The following diagram shows the distribution of clicks in the test data before and after the over sampling.

Before Over Sample Distribution: 83/17 Records: 408,143

After Over Sample Distribution: 50/50 Records: 4680,823

To bring the distribution up to 50 / 50 the overall record count of the training data was increased by approximately 172k records.

Modeling

M3.1 Logistic Regression AUC = .747 (blue)





With the exception of the “Apply SQL Transformation” step this model is constructed in a similar fashion to M1 and M2.

Model Scoring

M3 will be scored in the same way as M1 and M2 using ROC curves, AUC scores and confusion matrix.

The diagram to the right displays the ROC curves for the four models.

Three of the four models display a good AUC, but M3.2 drops off significant-ly. M3.1, 3.3 and 3.4 ap-pear to have almost the

same AUC as their M2 counter parts. Again M3.3 Boosted Decision Tree comes in with the highest AUC. To investigate further the each model also produces a confusion matrix.

Here is where the Recall issues with the models are address. The recall for all four mod-els is significantly higher that either the M1 or M2 models. The over sampling forced the model to give click=0 a much higher consideration. However, as a result of increasing the Recall measure the Precision measure was negatively impacted. The models are finding many roughly 70% of the positive clicks in the data, but at the same time it is predicting many more negative clicks as positive.

Financials

The following diagram displays the combined financial metrics for each algorithm in the M3 experiment:

Strictly based on profit and loss the M3.3 Boosted Decision Tree model out performs the rest. Because the data has been over sample many more records have been chosen. This allows the marketing budget to be fully utilized. Notice each algorithm has served exactly 25,000 ads. This financial model is sorting the data from highest probability of a CTR to the lowest where the probability >= 50%. The model selects all the highest probability ad servers until there are no more add serves to choose from or it has reached its $10,000 limit.

M3 is a very good experiment with M3.3 Boosted Decision tree out performing all mod-els with AUC of .755 and P&L of $20,082.50. Using the M3.3 model the expected return on investment over No Model is roughly 3,800%.

Modeling ConclusionM1, M2 and M3 all showed significant improvements over No Model. M1 and M2 were very accurate models but they also proved very exclusive and failed to detect many of the true positive clicks. Oversampling was used to decreases the models selectivity and enable it to select far more positive clicks. The over sampling was able to increase the M3.3 model Profit and Loss by almost 100% over the M2.3.

Using the M3.3 model a marketing team could expect to invest $10,000 into an ad-serving campaign that returns $20,000 in click-through revenue.

Find out how you can start using these tools and more with training and consulting services from Pragmatic Works.

www.pragmaticworks.com

904-638-5743

[email protected]

Documents

PW-Azure-ML-Predictive-Analtyics-WP