Maximizing ROI - Salford Systemsmedia.salford-systems.com/.../maximizing_roi_tutorial.pdf · 2015-09-17 · Maximizing ROI Using State-of-the-Art Data Science Techniques ... need

Maximizing ROI Using State-of-the-Art Data Science Techniques

How to download the data*:

1) Open the Walmart Store Sales Forecasting challenge on the Kaggle website

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

2) In the navigation pane on the left, click ‘Data’

3) Download the 3 files: train.csv, features.csv, stores.csv

4) At this point, you will be prompted to accept the official competition rules in order to access the data (you may

need to create a Kaggle account)

5) Combine these 3 files, merging by store number and date (this may require a small amount of programming)

6) Subset the full dataset by including only Department 1

*Typically, we would provide you with the final dataset to start the analysis, but the Kaggle competition rules prevent us

from distributing any data without your electronic acceptance of the terms of use. We apologize for the inconvenience.

How to replicate the analysis:

1) Open SPM®:

https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

2) Click the folder shortcut to open a data file:

3) Locate Dept1.csv (or your chosen file name) and click Open.

The Activity Window, pictured below, will appear:

This file contains the 1,395 records of weekly sales at Walmart in department 1 of all 45 included stores. In a typical

Walmart, department 1 corresponds to candy, food, and tobacco. On the left side of the Activity Window, you can see

the variables in the data file. These include a case ID, consumer price index, date, department number, fuel price, a

holiday indicator, five markdowns, size of the store, store number, the temperature outside, a test partition variable,

type of store, unemployment rate, and the weekly sales.

4) At the bottom of the Activity Window, click the View Data button:

The spreadsheet pictured below will appear with the data:

Note that variables such as size and type are consistent across each store. The DATE variable has been transformed to

be strictly numeric. Additionally, TEST was created to separate the data into an 80% training sample and a 20% testing

sample. Viewing the data in this way allows for a quick understanding of the variables and any possible red flags (such as

missing values).

5) Return to the Activity Window via the shortcut :

There are several other buttons here to run statistics on the data, display frequency distributions, and most importantly,

build a model with the data.

6) Click Model:

The Model Setup window, pictured below, will appear:

First, you will run a standard linear regression on the data to see how well it performs.

7) In the Analysis Method drop-down menu, select Regression. In the Variable Selection pane, check WEEKLY_SALES in

the Target column. In the Predictor column, select all of the other variables except for CASEID, DATE, DEPT, and TEST

(as these will not be predictive). Your Model tab should now match the picture above.

8) Click the Testing tab:

In traditional regression, the entire data sample is used to build the model. However, you will use 80% of the data to

build the model and 20% to test the model. This will provide an accurate model performance and facilitate comparisons

to the data science techniques used later on. Typically, this partition is chosen randomly but since we have time-ordered

data, the 80% used to build the model must have occurred in time before the 20% testing sample. This is where the TEST

variable comes in.

9) Select “Variable separates learn, test, (holdout)” and select TEST from the list. TEST simply holds a value of 0 if the

record falls in the 80% training sample or 1 if the record falls in the 20% testing sample.

10) Click Start to build the model.

This window gives the results of the regression model predicting weekly sales. The model is fairly poor with an MAD

(mean absolute deviation) of 5,514 and an R2 of 36%. These results suggest that there are nonlinearities and interactions

in the data that cannot be captured in a traditional regression model. Instead, you will attempt to build a better model

with CART decision trees and TreeNet gradient boosting, two modern data mining techniques that will address these

common issues.

11) Re-open the Model Setup window via the shortcut :

12) Change the Analysis Method to CART. Be sure that the Analysis Type is set to Regression and that the same variables

are selected for the model (remember to leave off CASEID, DATE, DEPT, and TEST).


The default testing method in CART is cross-validation, but you will stay with the 80/20 partition by TEST in order to

compare your results to the regression model.

14) Click the Limits tab:

To keep your regression tree under control, you need to impose limits on how big it can grow. This is done by

designating a minimum number of records that must fall in both parent and terminal nodes.

15) In the Unweighted column, enter 70 for parent node minimum cases and 20 for terminal node minimum cases.

16) Click Start.

The CART Navigator window will appear once the model is built:

The CART Navigator window is a simple, easy-to-use visual presentation of your CART model. In the top pane, you will

see a tree structure. This tree corresponds to the selected tree in the model sequence, pictured in the bottom pane.

Here, you will see all of the different tree sizes that CART constructed plotted against their relative error. You can click

through the sequence to see trees of different sizes.

Note the R2 value of 49% for the test sample in the table to the right of the model sequence. While this is an

improvement on the regression model, it still isn’t great. But, you can still use CART to draw insights from your data. For

example, a quick CART tree run can tell you which variables play the biggest part in weekly sales.

17) Click the Summary button at the bottom of the Navigator window:

This brings up a window of numerous summary statistics for the selected CART tree:

Notice the MAD of the model is around 4,300.

18) Click the Variable Importance tab:

This is where each variable in the model is recorded with its relative importance score. The most important variable is

markdown 3, followed by size of store and consumer price index. This information tells you that there may be value in

markdown 3 that should be explored further. An opportunity to maximize ROI may lie in optimizing this particular

promotion throughout the year, as it has a significant effect on weekly sales.

19) Re-open the Model Setup window via the shortcut :

20) Switch the Analysis Method to TreeNet. Check that the variables are still correctly selected and Regression remains

as the Analysis Type.


Again, you will use the 80/20 partition by TEST for comparison.

22) Finally, click the TreeNet tab:

There are many different parameters that come in handy during the TreeNet model building process. Before playing

with these, run the default model to observe the performance and evaluate which parameters should be tweaked.

23) Click Start.

The TreeNet Output window shows a graph of the number of trees built against the MAD. The optimal model is denoted

by the green bar at 200 trees with an MAD of 4,323 and an R2 of 46%. This isn’t any better than the CART model.

Because the optimal number of trees is equal to the number of trees you built, this number needs to be increased to

allow the model performance to develop as the ensemble grows.

24) Re-open the Model Setup window and click the TreeNet tab:

25) In the Learn Rate box, enter 0.1. This parameter controls how fast (or slow) the model converges to the optimal

solution. Change the number of trees to build from 200 to 500 to allow the model to develop.

26) Click Start.

The optimal model now has an MAD of 3,263 and an R2 of 62%, a huge improvement on the original regression model.

Now that you have a decent model to predict weekly sales, you can observe how the variables are affecting your sales.

27) Click the Summary button:

Again, you can see summary statistics for the optimal model:

28) Click the Variable Importance tab:

Size, markdown 3, and markdown 2 are the most important variables. You will now create dependency plots to see

exactly how these top predictors are contributing.

29) Return to the TreeNet Output window and click Create Plots:

The Create Plots dialog box will appear with plot options:

30) Check the box next to ‘One variable dependence’ plots and check the box next to ‘Use top 3 variables’. Click Create

Plots.

The resulting window will appear with an expandable list of the created plots.

31) Click Show All to display the figures.

These dependency plots show you how each of your predictors contributed to the response variable across all of its

values. For example, markdown 3 isn’t necessarily influential across all values. In particular, contribution is much lower

after a value of 25,000. Now that you have your final model and see that markdown 3 plays a significant part in the

model at certain values, you will see just how much this markdown affects sales at different times throughout the year.

32) To generate predictions for the TreeNet weekly sales model, return to the TreeNet output window and click Score:

The Score Data dialog box will appear:

33) Check the box next to “Save Scores As” and save the file as Dept1_score1.csv. Also check the boxes below next to

“Save all model related values” and “Save all variables in score data set”. Click Score.

A score results window will appear with the overall performance of the predictions. You scored the model back to the

original dataset, so it is not surprising that the scoring was successful.

34) Open this file in Excel to observe the predictions:

Next, we want to see what would happen if markdown 3 was not an included promotion throughout the year. Outside

of the software, we simply set all values of markdown 3 to zero. We repeated the scoring process on this new dataset to

get a second spreadsheet of predictions (the true value of weekly sales is irrelevant here):

Finally, we combined the important columns of the spreadsheet and defined a new variable, DIFFERENCE_RATE, to

better illustrate the difference markdown 3 made at different times of the year:

STORE DATE WEEKLY_SALES DIFFERENCE_RATE DIFFERENCE RESPONSE1 MARKDOWN3 RESPONSE2

45 12/23/2011 53942.21 1.989 36952.258 55534.05 1962.19 18581.79

25 12/23/2011 53439.71 1.934 38389.2892 58237.9 1689.22 19848.61

24 12/23/2011 49692.85 1.613 31506.922 51039.3 1633.67 19532.37

35 12/23/2011 37706.15 1.572 26009.4305 42557.3 1429.48 16547.87

16 12/2/2011 11075.5 1.407 15717.2215 26885.89 944.52 11168.67

18 12/23/2011 63868.79 1.329 37871.5417 66363.26 2083.28 28491.72

19 12/23/2011 66366.16 1.289 36505.0922 64826.1 1902.55 28321.01

14 12/23/2011 101846.1 1.198 38311.0907 70284.09 2127.77 31973

23 12/23/2011 114300.6 1.044 37025.0772 72498.32 1683.17 35473.25

27 12/23/2011 78431.1 1.032 29698.0321 58487.97 1438.33 28789.94

39 12/23/2011 51349.28 1.012 25832.6383 51351.53 1339.1 25518.89

20 12/23/2011 93745.04 0.99 38090.2178 76558.66 2145.46 38468.44

10 12/23/2011 134217.7 0.978 45116.2228 91260.19 2432.01 46143.97

15 12/23/2011 43921.97 0.973 22580.9871 45781.64 1120.42 23200.65

22 12/23/2011 60924.13 0.956 25365.781 51889.72 936.62 26523.94

8 12/23/2011 36655.67 0.896 18440.7603 39011.4 891.94 20570.64

Not surprisingly, the weeks in which markdown 3 yielded the greatest difference in sales predictions were around

Christmas. Store managers could use this information to ensure the promotion is run during the same time the following

year at the same markdown value.

On the other hand, you can look at the weeks during the year where markdown 3 made the smallest difference:

STORE DATE WEEKLY_SALES DIFFERENCE_RATE DIFFERENCE RESPONSE1 MARKDOWN3 RESPONSE2

7 6/8/2012 6971.47 0 0 7789.085 0.12 7789.085

7 6/15/2012 7748.14 0 0 8007.492 0.65 8007.492

7 6/22/2012 7967.26 0 0 8594.52 0.09 8594.52

7 7/13/2012 9253.35 0 0 9099.11 0.35 9099.11

8 2/24/2012 13457.69 0 0 14420.63 1 14420.63

8 3/2/2012 12947.8 0 0 12870.26 0.6 12870.26

8 9/28/2012 14572.76 0 0 10041.95 0.55 10041.95

9 2/24/2012 10386.62 0 0 8855.457 1 8855.457

9 3/23/2012 11993.86 0 0 12085.34 0.68 12085.34

9 6/8/2012 9289.55 0 0 9155.973 0.3 9155.973

9 8/3/2012 8765.82 0 0 9280.226 0.22 9280.226

9 9/28/2012 13772.7 0 0 7266.772 0.55 7266.772

19 12/30/2011 22068.26 -0.001 -12.3187 21909.38 210.34 21921.69

19 7/13/2012 15529.77 -0.001 -9.9487 14744.39 35.88 14754.34

2 6/22/2012 24155.67 -0.001 -30.7258 24756.04 60.99 24786.76

2 8/3/2012 22851.99 -0.001 -18.8834 22394.65 43.02 22413.53

21 1/20/2012 13821.16 -0.001 -13.3914 14075.11 3.46 14088.5

21 7/27/2012 10630.35 -0.001 -13.3914 11127.64 3.43 11141.03

5 7/20/2012 9513.66 -0.001 -13.3914 9858.215 3.39 9871.606

7 8/10/2012 8698.9 -0.001 -13.5279 10811.5 55.79 10825.03

8 7/27/2012 9322.04 -0.001 -13.3914 9288.481 3.55 9301.872

1 9/7/2012 18322.37 -0.002 -45.9031 19789.22 50.94 19835.12

13 8/17/2012 35524.12 -0.002 -65.8643 33590.01 40.96 33655.88

With corresponding cost information, store managers could decide whether it is worth it to run the promotion at all

during these weeks.

Documents

Maximizing ROI - Salford Systemsmedia.salford-systems.com/.../maximizing_roi_tutorial.pdf · 2015-09-17 · Maximizing ROI Using State-of-the-Art Data Science Techniques ... need