Periodic Model Monitoring Framework for Accurate …...business because of the need for accurate demand forecasts. For example, over-forecasting may lead to higher inventory costs

Periodic Model Monitoring Framework

for Accurate Demand Forecasting

CONTENTS

1 INTRODUCTION .......................................................................................... 1

2 PERFORMANCE METRICS AND APPLICABILITY TO DIFFERENT MODELS ...... 2

3 MODEL MONITORING FRAMEWORK .......................................................... 4

3.1 DEVELOPING MODELS...................................................................................... 4

3.2 PERIODIC MONITORING ................................................................................... 4

3.2.1 Population Stability Index (PSI) ............................................................... 5

3.2.2 Receiver Operating Characteristic (ROC) and Lift Curve .......................... 8

3.2.3 Kolmogorov-Smirnov (KS) Test .............................................................. 11

3.2.4 Rank Order ........................................................................................... 12

4 MODEL MONITORING PLATFORMS .......................................................... 13

5 CONCLUSION ........................................................................................... 14

6 REFERENCES ............................................................................................. 15

1

www.corecompete.com

1 Introduction

Response or propensity models are widely used by businesses to

identify customers who are likely to purchase a product. To help develop these models, techniques like logistic regression and

decision trees have been deployed across verticals such as banking, telecom, and retail. Advanced algorithms, such as

random forest, gradient boosting techniques, and neural network models are also used for classifying outcomes, such as responder vs non-responder and good vs bad. Since propensity models are

extremely important to a business, a lot of attention is paid to statistical validation of the selected model before it is deployed.

Most importantly, even after the model has been validated and found to be robust, periodic monitoring is an imperative to ensure that the model is performing at peak efficiency over a course of

time. Ongoing monitoring is also required to determine whether changes in market conditions or business strategies demand

adjustment, redevelopment, or replacement of the model.

Several metrics are used to monitor the model’s performance and its validity. This paper discusses the different ways of confirming

the model’s validity. In addition, it attempts to answer the question of when to retire a model.

The use of response models is very important in the retail business because of the need for accurate demand forecasts. For example, over-forecasting may lead to higher inventory costs

while under-forecasting may lead to an inability to meet demand, which results in a loss of sales revenue. Retailers generate

forecasts at multiple levels including item, store, and day/week levels. Usually, when forecast models are developed, performance

metrics such as mean absolute percentage error (MAPE) or root mean square error (RMSE) are used to measure their accuracy.

A valid and accurate response model is a fundamental

requirement of a successful business. However, it has been observed that with time, forecast values start showing a higher

deviation with respect to the actual values than when the model was originally developed. For instance, Google Flu Trends,

“A valid and accurate

response model is

a fundamental

requirement of a

successful business”

2

www.corecompete.com

launched in 2008, is a classic example where a model was accurate in the initial years but started degrading over time,

especially between 2011 and 2013. The use of cross-sell models in banking is another example of a model that was accurate and

robust when it was first deployed to identify responders. It helped banks improve customer relationships and generate more business value. However, after 18 months, the same model was

no longer able to identify potential responders accurately.

In the case of Google Flu Trends, modifications in the Google

search algorithm meant that the data used for prediction had changed, producing unexpected results. The cross-sell models used in the banking sector were unable to remain reliable because

they could not keep up with the effect that socio-economic changes and technological advancements had on consumer

behavior.

In both cases, changes in external conditions led to deterioration

in the model’s performance. Do such situations sound familiar? This paper will attempt to address such challenges through a framework that can help track the performance of response

models and identify inefficiencies before they lead to negative outcomes.

2 Performance Metrics and Applicability to Different Models

Typically, performance metrics used to measure the accuracy and validity of a model depend on the type of model being tested. Table 1 provides guidelines for selecting performance

measurement metrics based on the type of model or business objective.

“Changes in external

conditions led to

a deterioration in

the model’s

performance”

3

www.corecompete.com

Table 1: Selecting Appropriate Performance Measurement Metrics

PERFORMANCE METRICS MODEL TYPE

ROOT MEAN SQUARED ERROR

FORECASTING MODEL

LINEAR REGRESSION MODEL

CONTINUOUS TARGET VARIABLE

MEAN ABSOLUTE PERCENTAGE ERROR

ROOT MEAN SQUARED LOGARITHMIC ERROR

MEAN ABSOLUTE ERROR

BIAS

CLASSIFICATION MODELS

SENSITIVITY

SPECIFICITY

AREA UNDER CURVE

LIFT CURVE

KOLMOGOROV-SMIRNOV (KS) - STATISTIC

Note: A detailed discussion of all the performance metrics mentioned in Table 1 is beyond the scope of this paper. As such, this paper primarily focuses on the performance metrics applicable to classification models.

4

www.corecompete.com

3 Model Monitoring Framework

3.1 Developing Models

The primary objective behind developing response models is to identify customers who have a high likelihood of responding to a cross-sell campaign or an event of interest. Developing a model

involves multiple stages, with extensive efforts directed towards maintaining accuracy and efficiency at every step in the process.

The stages of development are as follows:

1. Understanding the business problem and objective

2. Translating business objectives into analytical objectives

3. Identifying the data scope and time window.

4. Data exploration and preparation

5. Data treatment and transformation

6. Variable reduction and feature engineering

7. Model training and validation

8. Model selection and interpretation

9. Business approval

In this process, stages 5-9 involve multiple iterations. Once the model is finalized and approved, it is deployed and scored periodically.

3.2 Periodic Monitoring

Since response models are developed and deployed for use cases such as cross-sell campaigns and customer retention, they need

to be assessed periodically to measure their effectiveness over time. There are several performance metrics which can be used

to evaluate a model. These metrics help in answering various critical questions like:

• Is the model, which was developed previously, still usable?

• Has the model's performance deteriorated?

• Is now the right time to develop a challenger model?

“Response models…

…need to be

assessed periodically

to measure their

effectiveness over time”

5

www.corecompete.com

• Can the model still separate outcomes, such as bad vs good or responder vs non-responders?

One way to evaluate the performance of the model is by adopting a framework which measures a few key statistics at periodic

intervals. The proposed framework uses the following criteria to examine the validity of a model:

• Population Stability Index (PSI)

• Receiver Operating Characteristic (ROC) and Lift Curve

• Kolmogorov-Smirnov (KS) statistic

• Rank ordering of responders

This framework is intended to be used specifically for cross-sell and upsell models wherein campaign response data is sourced from multiple sections. With minor adjustments to the process, it

can be used to evaluate churn and risk models too. The adjustments are recommended since target information is not

sourced from campaign data in churn and risk models.

3.2.1 Population Stability Index (PSI)

PSI measures changes in the characteristics of the population over time. As models are based on historical datasets, it is

necessary to ensure that the characteristics of the present-day population are sufficiently similar to the historical population on which the model is based in order to accurately predict the

expected lift when used in a targeted campaign.

The formula for PSI is shown below, where 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖 is the % of

responders in the ith decile of the development dataset and 𝐴𝑐𝑡𝑢𝑎𝑙𝑖

is the % of observations in the ith decile of the scoring dataset.

This formula is used to calculate the PSI for each of the 10 deciles, and the total PSI is the summation of the individual PSIs from

each decile. A higher PSI indicates greater shifts in population.

𝑃𝑆𝐼 = ∑{(𝐴𝑐𝑡𝑢𝑎𝑙𝑖 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖)𝑙𝑛 (

𝐴𝑐𝑡𝑢𝑎𝑙𝑖𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖

)}

𝑛

𝑖=1

“One way to evaluate

performance…is by

adopting a framework

which measures

a few key statistics

at periodic intervals”

6

www.corecompete.com

The first step involved in computing the PSI is to score the current population using SAS® Enterprise Miner™. The next step is to

create the scoring table, following all the subset criteria and business decisions taken while developing the model. Next,

ensure that the current scoring base is connected to the score node of the SAS Enterprise Miner. The score takes two inputs - node or best scoring table and the selected model. Once scoring

is imported into the SAS Enterprise Miner, data assigned role should be "score". Role data can be changed from node

properties, if required.

Table 2: Process of Calculating PSI

Table 2 represents an example of the process followed to calculate the PSI. The summarized value of the PSI for all the deciles is 0.0148, which indicates that there is no significant shift in the population and therefore, no correction of the existing model is required The next course of action is based on the PSI value, as described in Table 3.

7

www.corecompete.com

Figure 1 illustrates how SAS Enterprise Miner can be used for scoring new data.

Figure 1: Using SAS Enterprise Miner for scoring new data

Table 3: Recommended actions based on the PSI value

PSI INFERENCE ACTION

< 0.1 INSIGNIFICANT CHANGE

NO ACTION REQUIRED

0.1-1.25 MINOR SHIFT IN POPULATION

CARRY OUT INVESTIGATION ON CHARACTERISTICS OF IMPORTANT VARIABLES.

MODEL CAN BE USED WITH MINOR ADJUSTMENT.

> 0.25 MAJOR SHIFT IN POPULATION

MAJOR INVESTIGATION IS REQUIRED OR REBUILD THE MODEL ON CURRENT DATA

AVAILABLE.

8

www.corecompete.com

3.2.2 Receiver Operating Characteristic (ROC) and Lift Curve

3.2.2.1 Understanding ROC Curve

The ROC curve (Figure 2) is a graphical plot which illustrates the ability of the models to classify binary events. Area Under Curve

(AUC) represents the measure of the quality of the classification model. The ROC curve is developed by plotting the true positive rate against the false positive rate by varying the threshold value

to classify binary events.

The ROC curve displays sensitivity and specificity for the entire

range of cut-off values. As the cut-off decreases, more and more cases are allocated to Class 1. Hence, sensitivity increases, and specificity decreases. A random classifier has an AUC of 0.5, while

AUC for a perfect classifier is equal to 1. In practice, most classification models have an AUC between 0.5 and 1. An AUC of

0.8 means that 80% of the time, a randomly selected case from the group with the event = 1 has a score greater than that for a randomly chosen case from the group with the event = 0.

Figure 2: ROC Curve

9

www.corecompete.com

Sensitivity and specificity are two additional parameters that need to be considered while evaluating model performance. They are

defined as follows:

• Sensitivity, also known as the true positive rate, is defined

as the ratio of the correctly classified true positive cases to total actual positive cases.

• Specificity, also known as true negative rate, is defined as

the ratio of correctly classified negative cases to the total actual negatives cases.

3.2.2.2 Understanding the Lift Chart

A Lift Chart measures the effectiveness of a classification model and is calculated as the ratio between results obtained with and without the model. The Lift Chart shows how much more likely

you are to receive positive responses by contacting customers suggested by the model as opposed to a random sample of

customers. For example, the Lift Curve can tell us that by contacting only 10% of the customers based on the model, we would reach 3.5 times as many respondents, as we would without

a model.

Table 4: Calculating the Lift Curve

10

www.corecompete.com

3.2.2.3 Plotting the ROC and the Lift Curve Chart

It is possible to use actual responses from a product cross-sell campaign to measure the effectiveness of the model regularly.

First, create a binary variable for responders using the campaign responses for all the scored and targeted customers. Then, the

dataset which has the final variables of the model and the actual responses from the campaign are used as test data in the SAS

Enterprise Miner to calculate the ROC, the Lift Curve, and other statistical metrics. This allows us to compare metrics across training, validation, and test datasets. If these values are very

close, then the model is effective and performing well.

The process for combining the latest scored data with actual

responses from a campaign and using it as test data is described below. It is important to note that this is refreshed data and not the data originally used to develop the model. Test data can be

created by:

1. Capturing the response data for the scored customers.

Include untargeted customers scored using the model.

Capture the data if they purchased the product without any

intervention.

2. Combining the scored data and corresponding actual

response data into one table. This table will have all the

significant independent variables used in the model scoring

and response data.

3. Importing this dataset as an additional data source into the

SAS Enterprise Miner and defining the role as raw or

training data.

4. Applying the exact transformation or variable treatment

which was applied on the training data during the model

development stage.

5. Adding a data partition node from the sampling tab of the

SAS Enterprise Miner, this time with the following settings

- train: 0%, validation: 0%, and test: 100%.

Combine the latest

scored data with

actual responses

from a campaign

and use it

as test data

Model Testing Process

11

www.corecompete.com

6. Adding a model comparison node after the selected model

node.

Figure 3: The ROC curve and Lift Curve produced on SAS Enterprise Miner

3.2.3 Kolmogorov-Smirnov (KS) Test

The KS statistic quantifies the distance between the empirical

distribution function of the sample and the cumulative distribution function of the reference distribution — or between the empirical distribution functions of two samples. It is a

measure of the degree of separation between the positive and negative distributions.

The KS-statistic is 100 if the model partitions the population into two separate groups in which one group contains all the events

and the other contains all the non-events. The higher the value, the better the model is at separating the events from non-event cases. The KS-statistic is calculated as the maximum value of

difference between cum event (%) and cum non-event (%).

As shown in Table 5, there are two ways of calculating the KS-

statistic:

• Split scored population into deciles (10 parts) ordered by decreasing predicted probability value. Then compute the

cumulative % of events and non-events in each decile and the difference between them. The maximum value of this difference is the KS-statistic.

12

www.corecompete.com

• Compute KS two sample test with PROC NPAR1WAY. It generates the difference metrics.

Table 5: Computing the KS-statistic value

Figure 4: The KS-statistic chart

3.2.4 Rank Order

Rank order is a simple concept wherein the event rate is observed

across deciles created using predicted probability values. The value of the event rate should be in decreasing order as we move

13

www.corecompete.com

from top deciles to bottom deciles. In model development phase, rank ordering of event rate is observed on the training and

validation datasets. If the event rate is not rank ordered in the training and validation datasets, it is not the best model. A rank

order table can be created by using the current analytical table as well as combined data of the response and scoring datasets. The event rate should be rank ordered in the new data set to pass

the model performance criteria.

Table 6: Rank order

Figure 6: Plotting rank order

It is important to note that there is a possibility of a break in rank ordering of the event rate. If this is towards the top deciles (1-6), there is a need to investigate. But if the break is towards the bottom deciles (7-10,) then it is not of significant concern.

4 Model Monitoring Platforms

In this paper, the model monitoring framework is illustrated using SAS Enterprise Miner — a market-leading predictive modelling

and machine learning solution. It enables data scientists to use advanced machine learning algorithms for business applications

with ease. In addition, SAS Model Manager can also be used for periodic model monitoring and champion vs challenger model decision-making. Rapid Minor and KNIME are a couple of other

model development products that are currently available.

The proposed model monitoring frameworks can be programmed,

customized, and implemented using open source programming languages such as Python, Spark, and R. These languages have

14

www.corecompete.com

rich libraries containing machine learning and data manipulation packages developed by the user and developer communities.

Core Compete has deep expertise in SAS technologies. SAS-certified data scientists at Core Compete are helping numerous

organizations across industries realize their optimal business value by deploying predictive and machine learning models. Core Compete has developed and deployed machine learning models

for clients in verticals such as retail, banking and financial services, telecom, and healthcare — across geographies including

the US, UK, Middle East, and South Asia.

5 Conclusion

Periodic model monitoring is an important exercise that will ensure that models maintain their accuracy and robustness over an extended period. Given that market conditions and consumer

behavior are rapidly evolving, model monitoring has become an essential part of data-driven decision-making. Since model

development requires significant effort and engagement with decision-makers, accurate and timely decisions on the production model are vital. The framework proposed in this paper has been

illustrated using SAS Enterprise Miner but can easily be developed and deployed on any other platform (Python, R etc.) with alert

mechanisms.

By Lokendra Devangan

Lokendra is a Manager of Data Science for Core Compete

Contact Us

For more information, email Core Compete at [email protected]

“Model monitoring has

become an essential

part of data-driven

decision-making”

15

www.corecompete.com

6 References

1. Gainers and Losers in Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms (https://www.kdnuggets.com/2018/02/gartner-2018-mq-data-science-machine-learning-changes.html).

2. Getting Started with SAS Enterprise Miner 14.3. Copyright © 2017, SAS Institute Inc., Cary, NC, USA.

3. SAS® Model Manager 14.2: User’s Guide Copyright © 2016, SAS Institute Inc., Cary, NC, USA.

4. The Parable of Google Flu: Traps in Big Data Analysis. David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani. Science 14 Mar 2014: Vol. 343, Issue 6176, pp. 1203-1205.

5. https://en.wikipedia.org/wiki/Google_Flu_Trends.

https://www.kdnuggets.com/2018/02/gartner-2018-mq-data-science-machine-learning-changes.html

https://www.kdnuggets.com/2018/02/gartner-2018-mq-data-science-machine-learning-changes.html

https://en.wikipedia.org/wiki/Google_Flu_Trends

Documents

Periodic Model Monitoring Framework for Accurate …...business because of the need for accurate demand forecasts. For example, over-forecasting may lead to higher inventory costs