Upload
minchao-lin
View
370
Download
1
Embed Size (px)
Citation preview
Master’s Project Report
Sales Prediction of 111 Weather Sensitive Products in 45
Walmart Stores using Machine Learning Techniques and
Discussion on its Implications for Inventory Policy
by
Minchao Lin
December 10, 2015
Contents
1 Motivation ................................................................................................................................ 3
2 Objectives ................................................................................................................................ 3
3 Data Description ...................................................................................................................... 4
3.1 Training Data and Test Data ................................................................................................. 4
3.2 Data Features ......................................................................................................................... 5
3.3 Feature Engineering .............................................................................................................. 6
3.4 Feature Correlation ................................................................................................................ 8
4 Models and Techniques ......................................................................................................... 10
4.1 Performance Metric ............................................................................................................. 10
4.2 Models ................................................................................................................................. 11
4.2.1 Stepwise Linear Regression .......................................................................................... 11
4.2.2 K-Nearest Neighbors Search ........................................................................................ 13
4.2.3 Ensemble Learning ....................................................................................................... 17
4.2.4 Combinations of Models .............................................................................................. 19
5 Implications............................................................................................................................ 20
5.1 Cross Validation .................................................................................................................. 20
5.2 Evaluating Forecasts ........................................................................................................... 21
5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock .................... 26
6 Conclusion ............................................................................................................................. 29
7 References .............................................................................................................................. 30
8 Appendices ............................................................................................................................. 31
1 Motivation
Demand forecasting and inventory control are two of the most important aspects in
supply chain management. An accurate prediction of demand can not only help replenishment
managers correctly predict the level of inventory needed but also avoid being out of stock or
overstock. To better forecast demand, we need to take into consideration the various factors that
may have significant contribution to the demand variability. For a retail store, extreme weather
events such as hurricanes and blizzards can have a huge impact on sales at the store and product
level. Thus, accurately predicting the sales of potentially weather-sensitive products around the
time of major weather events becomes essential to the timely adjustment in inventory. In
addition, the difference between the predicted and realized demand can also provide further
information for setting the inventory policy such as the level of safety stock.
2 Objectives
The objectives of this project are two-fold. The first objective is to fit an effective model to
predict the sales of 111 potentially weather-sensitive products that are affected by snow and rain
in 45 Walmart retail stores. For each product specifically, the task is to predict the units sold for
a window of ±3 days surrounding each storm. The model performance is evaluated with the
Root Mean Squared Logarithmic Error (RMSLE) and compared with other 485 teams’ results in
the online Walmart recruiting competition. The training data used to generate the model is
provided with actual product demand and actual weather data while the actual demand in the test
data used to evaluate the effectiveness of predicted demand is not provided. The only way to
know the efficiency of the model is by submitting the predicted demand online and obtaining its
RMSLE. Considering that the actual demand in the test data is unknown which will limit further
analysis on the inventory policy of these products, the next objective is introduced. The second
objective of the project is to fully utilize the training data by applying the most effective model
from previous steps via cross validation and compare the predicted demand and actual demand
for each product, then develop analysis on each of their safety stocks.
3 Data Description
3.1 Training Data and Test Data
Sales data for 111 products whose sales may be affected by the weather such as milk, bread and
umbrellas are provided. These 111 products are sold in stores at 45 different Walmart locations.
Each product id is provided but not name or description. The competition teams are reminded
that some of the products are similar but have a different id in different stores. The 45 store
locations are covered by 20 weather stations. Some stores share a weather station. The full
observed weather covering both the training data and test data is provided. Training data contains
4,617,600 observations and test data contains 526,917 observations.
In the following graph, the green dots show the training set days, the red dots show the test set
days, and the event=True are the days with storms. The graph is for 20 weather stations.
Figure 1. Training set days and test set days for 20 weather stations1.
3.2 Data Features
The features in the training data provided include:
date
store id
Item id
number of units sold
The features in the weather data provided include:
1 “Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle,” accessed December 9, 2015, https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.
date
weather station id
dew point temperature
wet bulb temperature
heating degree days
cooling degree days
time for sunrise
time for sunset
significant weather types
snowfall in inches
water equivalent of rainfall and melted snow
average station pressure
average sea pressure
resultant wind speed
resultant wind direction
average wind speed
3.3 Feature Engineering
In order to better describe the underlying structure in the data, new features are created based on
the observation and analysis of the provided original data. It is reasonable to assume that sales in
each day may be related to the position of that day in a month, in a year, or in the whole timeframe
of the provided dataset, so new features generated from the date includes day in a month, month,
day in a year, year, numeric number for each date, weekday, and if that day is a holiday or not.
In addition, from observation of the data, it is noticed that sales in each month varies significantly.
Thus, monthly average sales for each product is calculated and serves as another new feature.
Based on the monthly average sales, a binary variable identifying whether the monthly average
sales equals zero is created. Indicating whether the same month has zero sales in each year for a
product can provide further details for the predicted demand during that month, thus improving
the accuracy of the model.
Temperature can be another related feature because too high or too low temperature may influence
a customer’s decision to go out or stay home. In addition, “feels like” temperature may be a better
indicator. Since feels like temperature is related to the moisture in the air, two new features
identifying the moisture in the air in two different ways are created. The first feature calculates the
difference between the dew point temperature and average temperature since this difference
represents the how far away the amount of moisture in the air is from the saturation. The second
feature calculates the difference between the wet bulb temperature and average temperature. This
difference shows the relative humidity in the air. The larger the difference, the lower the relative
humidity.
Features including precipitation and average wind speed are included directly without further
processing. The feature snowfalls is eliminated as it includes too many undefined values (NaN or
empty cells). Resultant wind speed is not included either as it is closely correlated with average
wind speed. The rest of the features in the weather data are ignored either because they are
constructed with too many different text entries that are hard to describe numerically or because
they are not related to the sales of the product intuitively. These features include heating degree
days, cooling degree days, time for sunrise, time for sunset, significant weather types, average
station pressure, average sea pressure and resultant wind direction.
Because some products have a lot of zero sales, I assume that the number of days with zero sales
before or after each day may also have an influence on the sales of that day. Three new features
are created based on this assumption: number of continuous days with zero sales before today,
number of continuous days with zero sales after today, and the minimum of the previous two
features. Besides number of days with zero sales, the average number of sales before or after each
day may also impact the sales of each day. Thus, I created one more variable calculating the
average sales seven days before today, and another variable calculating the average sales seven
days after today. If the seven days before a date are not all included in the training data, which
means some dates are in the test data, the average of only the available sales in the training data
will be calculated.
To conclude, features that are used to build models are:
1. numeric number for the date
2. month
3. day in month
4. year
5. weekday
6. is holiday or not
7. day in year
8. monthly average sales
9. is a month having zero sales or not
10. precipitation
11. average wind speed
12. difference between average temperature and dew point temperature
13. difference between average temperature and wet bulb temperature
14. number of continuous days with zero sales after today
15. number of continuous days with zero sales before today
16. minimum of the number of continuous days with zero sales before or after today
17. average sales seven days before today
18. average sales seven days after today
3.4 Feature Correlation
Because multiple variables are used for generating the model, multicollinearity problem may
arise if these variables are not independent. As a first step towards model specification, it is
useful to identify any possible dependencies among the predictors. The correlation matrix is a
standard measure of the strength of pairwise linear relationships. In the following table, R value
between each numeric variable is calculated:
Variables 1 2 3 4 5 6 7 8 9 10
1 1 0.0066 -0.038 0.035 -0.11 -0.12 -0.29 -0.19 -0.24 0.015
2 0.0066 1 0.027 0.056 -0.20 0.067 -0.35 -0.40 -0.42 0.82
3 -0.038 0.027 1 0.12 -0.37 -0.027 0.023 0.029 0.049 0.033
4 0.035 0.056 0.12 1 0.24 0.020 0.10 -0.071 0.0011 0.064
5 -0.11 -0.20 -0.37 0.24 1 0.040 0.23 0.13 0.18 -0.16
6 -0.12 0.067 -0.027 0.020 0.040 1 -0.047 -0.053 -0.047 -0.035
7 -0.29 -0.35 0.023 0.10 0.23 -0.047 1 0.17 0.58 -0.25
8 -0.19 -0.40 0.029 -0.071 0.13 -0.053 0.17 1 0.58 -0.36
9 -0.24 -0.42 0.049 0.0011 0.18 -0.047 0.58 0.58 1 -0.35
10 0.015 0.82 0.033 0.064 -0.16 -0.035 -0.25 -0.36 -0.35 1
Table 1. R value between each numeric variable
Variables 1 to 10 each represent features: numeric date, monthly average sales, precipitation,
average wind speed, average temperature subtracted by dew point temperature, average
temperature subtracted by wet bulb temperature, number of continuous days with zero sales after
today, number of continuous days with zero sales before today, and minimum value of the
previous two features.
From the table, we observe that only the number of continuous days with zero sales after today
and number of continuous days with zero sales before today have a moderate correlation with
minimum value of the previous two features. These moderate correlation would be dealt with in
the ensemble methods where only a subset of features are selected to generate a decision tree
every time. The other R values show little correlation between each other pair of features.
Besides pairwise correlation, relationships among arbitrary feature subsets may imply
multicollinearity problem. To diagnose multicollinearity, we can calculate the variance inflation
factor (VIF). VIF quantifies the severity of multicollinearity in an ordinary least squares
regression analysis and it is calculated as:
𝑉𝐼𝐹𝑖 =1
1 − 𝑅𝑖2
When the variation of feature 𝑖 is largely explained by a linear combination of the other features,
𝑅𝑖2 is close to and the VIF for that feature is correspondingly large. A rule of thumb is that if
VIF is greater than 10 then multicollinearity is high. Again, VIF for the previous data is
calculated:
Variables 1 2 3 4 5 6 7 8 9 10
VIF 1.20 3.65 1.27 1.18 1.45 1.05 1.91 1.84 2.44 3.24
Table 2. VIF for each variable
The above values show that monthly average sales and minimum value of continuous days of
zero sales before or after today have the two highest VIFs, but their values are still far below the
significant level of 10. Thus we conclude that no significant multicollinearity between variables
exist.
4 Models and Techniques
4.1 Performance Metric
For regression problem, the method of measuring the distance between the estimated outputs and
the actual outputs is used to quantify the model's performance. The Mean Squared Error
penalizes the bigger difference more because of the square effect. On the other hand, if we want
to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The
effect of introducing the logarithm function is to balance the emphasis on small and big
predictive errors. For the Walmart recruiting competition, the submissions of predictions are
evaluated based on the Root Mean Squared Logarithmic Error (RMSLE):
√1
𝑛∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2
𝑛
𝑖=1
Where:
n is the number of hours in the test set
pi is the predicted count
ai is the actual count
log(x) is the natural logarithm
4.2 Models
4.2.1 Stepwise Linear Regression
Stepwise Linear regression creates a linear model and automatically adds or removes terms in the
model based on their statistical significance in a regression. The method begins with an initial
model and then compares the explanatory power of incrementally larger and smaller models
using forward selection and backward elimination. Specifically, at each step, the p values of an F
statistics is computed to test the model with and without a potential term. If a term is not
currently in the model, the null hypothesis is that the term would have a zero coefficient if added
to the model. If the null hypothesis is rejected, then the term that have the smallest p value
among all the terms having p values less than an entrance tolerance will be added to the model.
Conversely, if the term is already in the model, the null hypothesis is that the term has a zero
coefficient and if there is no significant evidence to reject the null hypothesis, the term that has
the greatest p value among all the terms in the model having p values greater than an exit
tolerance will be removed from the model.2 In this sense, stepwise models are locally optimal but
may not be globally optimal.
For this method, five stepwise models were built based on different combinations of variables
(the numbers that represent each feature correspond to the ones listed in section 3.2). The first
four models are listed as below:
2 “Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm,” accessed December 10, 2015, http://www.mathworks.com/help/stats/stepwiselm.html.
RMSLE
of each
models
1 2 3 4 5 6 8 9 10 11 14 15 16 17 18
0.12995 √ √ √ √ √
0.11892 √ √ √ √ √ √ √
0.13218 √ √ √ √ √ √ √ √ √ √ √ √ √
0.19076 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
Table 3. Stepwise Linear Regression Models
The model having the best RMSLE in the table is the second one with an RMSLE equaling to
0.11892. From the results, we can see that having more features doesn’t necessarily improve the
model. Thus, instead of creating more features, the focus was shifted from the predictor variables
to the response variable. Since the performance metric for the Walmart recruiting online
competition uses log transformation on the difference between the predicted values and actual
values in the test data, log transformation is then applied to the response values (i.e. units sold for
each item in each store) in the training data as an attempted way to improve the performance of
prediction models. In order to avoid negative transformed values, log (1+x) is applied to each
response value. The best result is as follows:
RMSLE 1 2 3 4 5 6 8 9 10 11 14 15 16 17 18
0.10477 √ √ √ √ √
Table 4. Stepwise Linear Regression Models with log-transformed response variable
The above result shows that log transformation of the response value in the training data does
improve the performance. However, it is also observed that even for log-transformed response
values, having more features doesn’t necessarily improve the model. The final ranking of the
best stepwise linear regression model from above is 94/485.
Figure 2. Ranking of Stepwise Linear Regression Model
4.2.2 K-Nearest Neighbors Search
K-Nearest Neighbors Search finds the k closest points in X for each point in Y, the predicted
value is often calculated as the average of those k closest points or the weighted average of the k
closest points using the inverse distance weights. Two different search methods can be used. The
exhaustive search method finds the distance from each query point to every point in X, ranks
them in ascending order, and returns the k points with the smallest distances. Kd-trees search
method divides the data into nodes with a certain bucket size based on coordinates. The closest k
points are found within the node that the query point in Y belongs to. Then points in all other
nodes that are within the distance between the previous k points and the query point are chosen
as well. Using a Kd-tree for large data sets can be much more efficient than using the exhaustive
search method because it only calculates a subset of the distances. Distances can also be
determined with various metrics. The most general distance metric is Euclidean distance. The
other distance metrics that will also be tested later in this section include correlation distance,
spearman distance, and cosine distance, and Hamming distance. Correlation distance is
calculated as one minus the sample linear correlation between observations which are treated as
sequences of values. Spearman distance is calculated as one minus the sample Spearman’s rank
correlation between observations which are treated as sequences of values. Cosine distance is
calculated as one minus the cosine of the included angle between observations which are treated
as vectors. Hamming distance is calculated as the percentage of coordinates that differ. 3
Thus, changing parameters include nearest neighbors search method, methods to calculate
predicted value with the values from the closest neighbors, number of closest neighbors, and
distance metric. The default setting in Matlab is followed to choose search method: exhaustive
search method is used when the number of columns of X is more than 10, and Kd-trees search
method is used otherwise.
For exhaustive search method, all the 18 predictors listed in Section 3.2 are included. Different
distance metric are tested first with the number of closest neighbors set to a fixed number 10.
The results are as follows:
Distance metric RMSLE
Euclidean distance 0.11189
Correlation distance 0.14171
Spearman distance 0.18862
Cosine distance 0.14401
Hamming distance 0.12848
Table 5. Testing Distance Metrics.
From the table, we see that Euclidean distance works significantly better than the other distance
metrics. Thus, for the next step, Euclidean distance is set to be the distance metric. The number
of closest neighbors is still set to 10. Yet instead of using the mean value of the 10 closest
3 “Classification Using Nearest Neighbors - MATLAB & Simulink,” accessed December 10, 2015, http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.
neighbors, the weighted average of the k closest points using the inverse distance weights is
used.
Inverse distance weights are defined as
𝑢(𝑥) = ∑ 𝑤𝑖(𝑥)𝑢(𝑥𝑖)
𝑁𝑖=1
∑ 𝑤𝑖(𝑥)𝑁𝑖=1
where 𝑤𝑖(𝑥) is defined as
𝑤𝑖(𝑥) =1
𝑑(𝑥, 𝑥𝑖)𝑝
The result is as follows:
Ways to calculate predicted values RMSLE
Arithmetic mean 0.11189
Weighted mean with inverse distance weights (𝑝 = 1) 0.10341
Weighted mean with inverse distance weights (𝑝 = 2) 0.10473
Weighted mean with inverse distance weights (𝑝 = 3) 0.10732
Weighted mean with inverse distance weights (𝑝 = 7) 0.11666
Table 6. Testing ways to calculate predicted values
The above table shows that weighted mean with inverse distance weights having p = 1 gives the
best RMSLE. In the next step, this way of calculating the final predicted values remains and
different number of closest neighbors to choose for each point in Y are tested. Let k denote the
number of closest neighbors. The results are as follows:
K RMSLE
3 0.11008
10 0.10341
40 0.10215
60 0.10193
80 0.10198
100 0.10200
Table 7. Testing K values.
For Kd-tree search method, only predictors related to time are included. These variables
correspond to the 1, 2, 3, 4, 5, and 7 in Section 3.2.
K RMSLE
20 0.10182
60 0.10126
70 0.10136
Table 8. Kd-tree search method
Figure 3. Ranking of K-Nearest Neighbors Search
To conclude, the best generated k-nearest neighbor model uses the Euclidean distance as the
distance metric, uses weighted mean with inverse distance weights having p = 1 to predict the
response value, uses only variables related to time (numeric date, month, day in month, year, day
in year, and weekday) as the predictor variables, and chooses 60 as the number of closet nearest
neighbors in the algorithm. The best RMSLE returns 0.10126 and ranks 66/485 in the
competition.
4.2.3 Ensemble Learning
Ensemble methods use multiple learning algorithms to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms.4 Among the constituent
learning algorithms, decision tree, neural network and other machine learning algorithms are
commonly used. Decision tree builds regression or classification models in the form of a tree
structure where a dataset is divided into smaller subsets at each node. In a regression tree, a
regression model is fit to the target variable using each of the independent variables. For each
independent variable the data is split at several split points where the squared mean error
between the predicted value and the actual values are calculated. The node chooses to split the
predictor variable at the split point that maximizes the squared mean error reduction.
Regression tree ensembles work with two methods. One is least squares boosting, and the other
is bagging. Least squares boosting fits regression ensembles in order to minimize mean squared
error. At every step, the ensemble fits a new learner to the difference between the observed
response and the aggregated prediction of all learners grown previously.5 The ensemble fits to
minimize mean-squared error. Bagging trains each model in the ensemble using a randomly
drawn subset (with replacement) of the training set and finds the predicted response of a trained
ensemble by taking an average over predictions from individual trees. Furthermore, random
sampling with replacement omits on average 37% of observations for each decision tree and
every tree in the ensemble can randomly select predictors for decision splits.
4 “Ensemble Learning - Wikipedia, the Free Encyclopedia,” accessed December 9, 2015, https://en.wikipedia.org/wiki/Ensemble_learning. 5 Jerome Friedman et al., “Discussion of Boosting Papers,” Ann. Statist 32 (2004): 102–7.
Since ensembles tend to overtrain, lasso regularization of the ensembles is implemented in order
to choose fewer weak learners with no loss in predictive performance.
To start training the data, both least squares boosting and bagging are applied respectively with
all the predictor variables listed in section 3.2 included. The results are as follows:
Ensemble Learning Methods RMSLE
Least Squares Boosting 0.10388
Bagging 0.10142
Table 9. Ensemble Learning Methods
The results indicate that bagging works much better than least squares boosting. Thus, bagging is
chosen as the ensemble learning method.
In consideration of the potential interactions between each variable, two ways to include more
terms of features are applied. The first method is to include all products of pairs of distinct
predictors into the pool of features and the number of features will increase from 18 to 171 as a
result. The other method is to only include interactions between numerical terms and the number
of features will increase from 18 to 52 accordingly. Ensemble method is then applied to both sets
of data. The result is as follows:
Number of features RMSLE
52 0.11728
171 0.09907
Table 10. Number of features
The result shows that including interaction terms between each pair of predictors significantly
improves the model. Hence the best performance given by regression learning ensembles has an
RMSLE equaling to 0.00907. The result ranks 47/482 in the competition.
Figure 4. Ranking of Ensemble Learning Method
4.2.4 Combinations of Models
In this section, three different combinations of previous generated models are tested in order to
see if there is any improvement in the prediction performance. Specifically, the first combination
takes the median of predicted values from all previous models for each entry in the test data, the
second combination takes the linear combination of the most efficient models from k-nearest
neighbors search and ensemble learning. The third combination is a linear combination of the
three most effective ensemble learning models together with the most effective stepwise linear
regression model. The coefficients for the linear regression are generated by fitting the predicted
values of the training data from each model to the actual values. The results are as follows:
Combinations of Models RMSLE
Median 0.09972
Linear combination of 1 k nearest neighbors and 1 ensemble learning (appendix 1) 0.10384
Linear combination of 1 stepwise linear regression and 3 ensemble learning (appendix 2) 0.09818
Table 11. Combinations of Models
The above table shows that the third combination returns the best result with a ranking of 40/485.
From the graph below, we see that the difference between the current best result and the top
result is around 0.09875 – 0.09340 = 0.00535 for RMSLE. Instead of generating more models to
fit the actual value in the test data to explain the 0.00535 difference, the focus of the project is
shifted to analyzing the current obtained predicted values and their implications on inventory
policy. In the next section, the second objective of the project will be introduced and explained in
details.
Figure 5. Ranking of Combinations of Models
5 Implications
5.1 Cross Validation
Although for the competition, the lower the RMSLE the higher the ranking among the
participating teams, the generality of the model needs further proof. For this reason, cross
validation is applied to the training data while test data is ignored since its actual sales value are
not provided. Specifically, 5-fold cross validation is applied, which means each group of
observations for each product in each store is partitioned into 5 disjoint subsamples (or folds),
chosen randomly but with roughly equal size. Every time, 4 folds are used for training and last
fold is used for evaluation. Predicted values for that last fold is created at the same time. This
process is repeated 5 times, leaving one different fold for evaluation each time. The models used
for training the data are the most effective ones generated in the sections 4.3.1, 4.3.2, and 4.3.3.
RMSLE of each model is ranked in order to compare the effectiveness of prediction performance
from cross validation with those that are submitted to the online competition. The results are as
follows:
testRMSLE ranking trainRMSLE ranking
Stepwise Linear Regression 0.10477 5 0.129844 5
Ensemble Learning – LS Boosting -18 features 0.10388 4 0.122193 3
Ensemble Learning –Bagging -18 features 0.10142 3 0.105286 2
Ensemble Learning –Bagging -171 features 0.09907 2 0.1029 1
Linear combination of the previous 4 models 0.09818 1 0.123611 4
Table 12. Cross Validation
From table above, we notice that linear combination of models does not work well for the cross
validation (ranked number four out of five). If we ignore that last row, the rest four models share
the same ranking in both the RMSLE for test data in the online competition and for cross
validation in the training data. With these results, we are more confident in applying the best
prediction model (Ensemble Learning –Bagging -171 features) to the analysis of inventory policy.
5.2 Evaluating Forecasts
In this section, two common measures of forecast accuracy are applied to the predictions for the
training data generated with cross validation from previous section. Specifically, these two
measures are mean absolute deviation (MAD) and mean absolute percentage error (MAPE).
To calculate these three measures, denote 𝑒𝑖 as the difference between the forecast value and
actual value for each observation in the training data and suppose there are n observations. MAD
and MAPE are calculated as:
MAD = (1
𝑛) ∑ |𝑒𝑖|
𝑛𝑖=1
MAPE = [(1
𝑛) ∑ |𝑒𝑖/𝐷𝑖|𝑛
𝑖=1 ] × 100%
Because some products have a lot of days with zero sales, 𝐷𝑖 used in MAPE is replaced with
average demand to avoid undefined values. Each of the above measure is applied to each product
in each store. Since there are 255 combinations of different stores and products, 255 MADs and
MAPEs are generated.
It should be noted that in the original model that generates the best result, feature 18 which is
average sales 7 days after today is included. However, when developing the inventory policy
based on the predictions, the data for this feature is obviously not available in real life. For this
reason, feature 18 and its interaction terms with other predictors are eliminated and a new cross-
validated ensemble learning model is built with this new update. MADs and MAPEs are then
calculated. It turned out that feature 18 contributes little to the original model and its elimination
does not have significant influence on the original predicted value. To illustrate this point, the
ranking of variables importance for predicting sales of product 23 in store 8 is shown as an
example:
rank variables importance rank variables importance rank variables importance
1 7 4.47E-04 31 102 4.43E-06 61 12 1.95E-06
2 78 5.98E-05 32 62 4.41E-06 62 37 1.75E-06
3 21 5.11E-05 33 17 4.28E-06 63 58 1.60E-06
4 3 4.66E-05 34 147 4.27E-06 64 76 1.57E-06
5 87 3.88E-05 35 98 4.15E-06 65 38 1.39E-06
6 63 2.41E-05 36 77 4.14E-06 66 35 1.08E-06
7 24 2.29E-05 37 138 4.14E-06 67 4 9.18E-07
8 5 1.98E-05 38 103 4.07E-06 68 39 9.00E-07
9 8 1.45E-05 39 42 3.92E-06 69 80 6.61E-07
10 66 1.28E-05 40 112 3.75E-06 70 55 6.43E-07
11 20 1.08E-05 41 28 3.68E-06 71 22 5.86E-07
12 29 9.34E-06 42 111 3.59E-06 72 101 5.59E-07
13 113 9.33E-06 43 43 3.58E-06 73 71 5.53E-07
14 83 9.29E-06 44 27 3.50E-06 74 132 4.33E-07
15 1 9.14E-06 45 53 3.31E-06 75 127 4.25E-07
16 2 8.70E-06 46 36 3.16E-06 76 44 4.12E-07
17 117 7.68E-06 47 70 3.14E-06 77 126 3.83E-07
18 133 5.70E-06 48 134 3.01E-06 78 89 3.67E-07
19 81 5.40E-06 49 88 2.95E-06 79 68 3.49E-07
20 108 5.38E-06 50 19 2.82E-06 80 41 3.46E-07
21 82 5.17E-06 51 11 2.75E-06 81 128 3.39E-07
22 143 5.04E-06 52 69 2.74E-06 82 10 2.48E-07
23 50 4.84E-06 53 18 2.57E-06 83 110 2.24E-07
24 33 4.76E-06 54 49 2.34E-06 84 6 2.23E-07
25 57 4.68E-06 55 99 2.25E-06 85 26 2.04E-07
26 56 4.67E-06 56 65 2.18E-06 86 64 1.73E-07
27 52 4.60E-06 57 139 2.11E-06 87 92 1.03E-07
28 48 4.52E-06 58 51 2.09E-06 88 93 8.48E-08
29 75 4.48E-06 59 104 2.07E-06 89 94 3.20E-08
30 34 4.47E-06 60 23 2.00E-06
Table 13. Variable Importance
We see that feature 18 (the average sales 7 days after today) ranked 53 among all the features
and it is about half as important as feature 17 (the average sales 7 days before today).
Since MADs and MAPEs each has 255 values, it is not convenient to show them all in the report.
Instead, the detailed values from the top 10 and the last 10 sorted with descending order
according to the average daily sales for each product in each store will be shown while the rest of
the values will be shown in the graphs to indicate the trend in MAD and MAPE. The tables and
graphs are as follows:
Top 10 in average daily sales:
store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand
33 44 189903 914 36.219 0.115 207.771
16 25 135046 857 28.097 0.118 157.580
30 44 136473 868 26.824 0.317 157.227
17 9 135367 939 45.548 0.204 144.161
2 44 117125 875 21.016 0.120 133.857
4 9 117123 960 36.619 0.190 122.003
33 9 101586 914 36.785 0.227 111.144
25 9 98560 1011 28.217 0.157 97.488
34 45 87419 947 15.747 0.125 92.312
38 45 80068 875 15.488 0.130 91.506
Table 14. Top 10 in average daily sales
Bottom 10 in average daily sales:
store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand
16 85 67 857 0.099 0.810 0.078
40 106 78 1011 0.093 1.049 0.077
9 105 73 947 0.099 0.884 0.077
22 104 68 898 0.094 0.883 0.076
38 86 62 875 0.088 0.929 0.071
25 84 69 1011 0.087 0.906 0.068
20 106 61 896 0.085 0.968 0.068
31 104 58 947 0.070 1.025 0.061
34 84 46 947 0.065 0.883 0.049
3 102 31 896 0.045 0.936 0.035
Table 15. Bottom 10 in average daily sales
MADs for each store and item combination sorted according to its average daily sales sorted in
descending order:
Figure 6. MAD
MAPEs for each store and item combination sorted according to its average daily sales sorted in
descending order:
Figure 7. MAPE
0.000
5.000
10.000
15.000
20.000
25.000
30.000
35.000
40.000
45.000
50.000
20
7.7
71
92
.31
2
76
.20
5
65
.06
9
57
.63
5
48
.45
7
43
.12
3
37
.20
8
33
.20
0
22
.53
4
15
.67
3
9.5
92
3.4
47
1.6
98
1.2
79
1.1
30
1.0
28
0.9
41
0.8
78
0.8
07
0.7
63
0.6
97
0.6
18
0.5
81
0.5
34
0.4
69
0.3
66
0.3
08
0.1
95
0.1
46
0.0
91
0.0
76
MA
D
Average Daily Sales for each store and item combination sorted in descending order
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
20
7.7
71
97
.48
87
9.3
57
69
.66
96
3.2
40
50
.12
74
7.7
51
41
.43
23
7.2
08
34
.46
42
6.0
10
17
.44
21
2.2
99
5.2
67
2.8
45
1.6
28
1.2
79
1.1
51
1.0
81
0.9
80
0.9
02
0.8
28
0.7
92
0.7
49
0.6
97
0.6
22
0.5
97
0.5
47
0.5
00
0.4
35
0.3
60
0.3
05
0.1
95
0.1
51
0.0
99
0.0
78
MA
PE
Average Daily of Sales for each store and item combination sorted in descending order
The above plots show that in general the MADs decrease with the number of average daily sales
and MAPEs increase with number of average daily sales. For MAD, some models for store and
item combination do not perform as well as others. This is particularly obvious for items with
large volume of sales. For those models that do not perform as well, extra effort to fit a better
model may be applied as a further approach. For MAPE, we can see a big jump from an average
of around 0.1 to an average of around 0.4 when the sum of average daily sales drops to around
five. Yet it should be noted that The MAPE is scale sensitive and should not be used when
working with low-volume data because when the average demand is very low, the denominator
in MAPE formula will often make MAPE take on extreme values.
5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock
In general, forecasting error variance is higher than the demand variance since forecasting error
also incorporates sampling error. If a forecast is used to estimate the mean demand, we keep
safety stocks in order to protect against the error in the forecast6. Thus, the standard deviation
(STD) in forecast errors instead of standard deviation in demand should be used to calculate
safety stocks.
When the model is built at the very beginning, it used 5-fold cross validation which means each
prediction group (generated by the model with data from the other four subsamples) accounts for
only one fifth of the overall prediction. Thus, instead of calculating the standard deviation over
all predictions, the average standard deviation of each of the five prediction groups should be
used in order to comply with the cross validation method. Again, graph of averaged STDs
against the mean daily demand for each of the 255 store and item combinations is shown below:
6 Steven Nahmias, Production and Operations Analysis (New York: McGraw-Hill/Irwin, 2009).
Figure 8. Averaged STD
Assuming overnight replenishment and 98% service level (which corresponds to a z-score of
2.05), daily safety stock is calculated as 2 × 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝑆𝑇𝐷. the percentage of daily safety
stock over average daily demand for each store and item combination is shown below:
0.000
10.000
20.000
30.000
40.000
50.000
60.000
70.000
20
7.7
71
92
.31
2
76
.20
5
65
.06
9
57
.63
5
48
.45
7
43
.12
3
37
.20
8
33
.20
0
22
.53
4
15
.67
3
9.5
92
3.4
47
1.6
98
1.2
79
1.1
30
1.0
28
0.9
41
0.8
78
0.8
07
0.7
63
0.6
97
0.6
18
0.5
81
0.5
34
0.4
69
0.3
66
0.3
08
0.1
95
0.1
46
0.0
91
0.0
76
aver
aged
STD
Average Daily Sales for each store and item combination sorted in descending order
Figure 9. Percentage of safety stock over average daily demand
Part of the previous graph with only average daily sales above 5 products is shown below:
Figure 10. Percentage of safety stock over average daily demand that are above 5 units
0.000
5.000
10.000
15.000
20.000
25.000
30.000
20
7.7
71
92
.31
2
76
.20
5
65
.06
9
57
.63
5
48
.45
7
43
.12
3
37
.20
8
33
.20
0
22
.53
4
15
.67
3
9.5
92
3.4
47
1.6
98
1.2
79
1.1
30
1.0
28
0.9
41
0.8
78
0.8
07
0.7
63
0.6
97
0.6
18
0.5
81
0.5
34
0.4
69
0.3
66
0.3
08
0.1
95
0.1
46
0.0
91
0.0
76
per
cen
tage
of
safe
ty s
tock
ove
r av
erag
e d
aily
dem
and
Average Daily Sales for each store and item combination sorted in descending order
0.000
0.500
1.000
1.500
2.000
20
7.7
71
14
4.1
61
11
1.1
44
91
.50
6
81
.03
5
79
.27
9
72
.61
4
69
.66
9
65
.06
9
63
.39
2
62
.39
5
54
.68
7
49
.76
9
48
.49
7
47
.75
1
45
.88
0
43
.12
3
39
.88
8
37
.21
8
36
.97
5
35
.48
6
34
.46
4
32
.69
7
28
.77
0
22
.53
4
17
.92
0
16
.85
0
13
.87
3
12
.29
9
11
.20
0
7.1
64
per
cen
tage
of
safe
ty s
tock
ove
r av
erag
e d
aily
dem
and
Average Daily Sales for each store and item combination sorted in descending order
From the plot, we notice that for the products that have average daily sales below five, the
percentage of safety stock over average daily demand increase dramatically and has very
unstable fluctuation. This situation poses a question of whether it is profitable to maintain those
low demand products in stock since the number of safety stock for these products is much larger
than its daily demand. However, similar to the problem with MAPE, when the average daily
demand is very close to zero, its location in the denominator will often make the percentage take
on very high values. This may partially account for the high spikes in the graph.
6 Conclusion
For the first objective to fit an effective model in order to lower RMSLE in the test data, three
different methods with different model parameters are sequentially tested. Stepwise linear
regression provides the highest RMSLE among the three methods. K-nearest neighbors Search
generates a better result, and ensemble learning provides the best prediction performance. Linear
combination improves the prediction performance for the test data even further, although this
combination cannot be applied generally which is indicated by its poor performance when tested
with only the training data using cross validation. The variable importance implies that weather
information is not significant in predicting the daily sales. Instead, features related to time
contribute a lot more and rank among the top features in the importance ranking. Thus, although
these products are assumed to be weather-sensitive, weather does not influence their sales as
much as it is originally supposed. Future research on other machine learning techniques may
further improve the prediction performance. However, the robustness of model should always be
kept in mind when the prediction is going to be used in business activities such as setting up the
inventory policy.
The second objective allows us to dive into the implications from the predictions. With cross
validation, ensemble tree model proves its robustness. It is natural that MAD decrease with
average daily demand, yet the products with rather large MAD compared to those having similar
average daily demand may require more attention for further model improvement. In addition,
the two spikes in MAPE before the aforementioned jump at around 5 average daily sales impose
concern. The models for these two spikes should be further tested with other machine learning
techniques. Finally, the calculated safety stock and its value as percentage of average daily
demand poses a question of whether the products is profitable to be maintained on the store
shelves. Although no further data is provided, inventory costs such as holding cost of high
inventory, the obsolescence cost, the ordering cost, the storage space costs, and the transportation
costs for those products should all be taken into account when more detailed information
regarding those products become available.
7 References
“Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle.” Accessed December 9, 2015.
https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.
“Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm.” Accessed
December 10, 2015. http://www.mathworks.com/help/stats/stepwiselm.html.
“Classification Using Nearest Neighbors - MATLAB & Simulink.” Accessed December 10, 2015.
http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.
Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of
Boosting Papers.” Ann. Statist 32 (2004): 102–7.
“Ensemble Learning - Wikipedia, the Free Encyclopedia.” Accessed December 9, 2015.
https://en.wikipedia.org/wiki/Ensemble_learning.
Nahmias, Steven. Production and Operations Analysis. New York: McGraw-Hill/Irwin, 2009.
8 Appendices
1. Linear regression model of 1 k nearest neighbors and 1 ensemble learning:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
________ _________ ______ ___________
(Intercept) 0.24598 0.044419 5.5377 3.0688e-08
Ensemble learning 0.85715 0.0058669 146.1 0
K-nearest neighbors 0.21063 0.0055461 37.977 1.2375e-314
Root Mean Squared Error: 18.8
R-squared: 0.773, Adjusted R-Squared 0.773
2. Linear regression model of 1 stepwise linear regression and 3 ensemble learning:
y ~ 1 + x1 + x2 + x3 + x4
Estimated Coefficients:
Estimate SE tStat pValue
________ _________ _______ __________
(Intercept) -0.18353 0.033454 -5.486 4.1145e-08
x1 0.2025 0.0043553 46.495 0
x2 0.33965 0.0049085 69.195 0
x3 -0.36155 0.017711 -20.414 1.5038e-92
x4 0.8802 0.017254 51.014 0
Root Mean Squared Error: 14.3
R-squared: 0.868, Adjusted R-Squared 0.868