Upload
bill-cassill
View
283
Download
0
Embed Size (px)
DESCRIPTION
A brief walkthrough on predictive modeling for those who want to be educated users of this technology.
Citation preview
What Is Predictive Modeling?
4250 258th Ave SE
Issaquah, WA 98029
425.996.8732 Office
Copyright 2009 Numerical Alchemy, Inc.This material is not to be distributed or in any way duplicated without the prior consent of the author.
• Predictive modeling refers to a class of techniques that
determine the most likely outcome given a set of inputs.
Frequently, this requires inputs consisting of prior data that
will be used to predict a future outcome or event.
What Is a Model?
Input A
Input B
Input C
Input D
Outcome
Event
Model InputsModel Inputs
Past Data (e.g. last month) Future Data (e.g. 2 months out)
Predictive Model Often Uses Past Data to Predict Future EventsPredictive Model Often Uses Past Data to Predict Future Events
1
• Models currently have many uses.
• Some examples include:
– Which people are a good credit risk?
– What is someone’s accident risk based on age, gender, and past
What Are Models Used For?
– What is someone’s accident risk based on age, gender, and past
driving history?
– Who is most likely to buy my products in the next 90 days?
– Who is most likely to stop doing business with my company in the near
future?
– Which purchase transactions represent a significant fraud risk?
• All of these questions can be answered with predictive
modeling.2
• What can make the prediction task complex is when we are
faced with hundreds or thousands of potential factors that
can be used as inputs.
• The obvious questions arise:
A Tangled Web of Data
• The obvious questions arise:
– Which ones should I use?
– How many of the factors are truly relevant or predictive?
– How do I know if I have the “right” model?
• All of these questions can be answered by a good analyst or
statistician.
3
• Several types of outcome variables can be predicted using
statistical modeling techniques.
• These include:
– Continuous values like future customer profitability and future sales
Outcome Variables
– Continuous values like future customer profitability and future sales
volumes
– Binary outcomes (1 = event occurs & 0 = event does not occur) like
whether someone buys something (or not) or defaults on a credit card
(or not)
– Multi-category outcomes like small, medium, and large.
• However, by far the most popular outcomes to model are the
continuous and binary variety.4
• Once a model has been built, it can be used to generate
scores (i.e. predicted values) on new data. Depending on the
outcome being modeled, these scores can take on a couple of
different varieties.
Prediction and Scores
• Predicted scores for binary outcomes are represented as a
probability score: a 0 to 1 decimal score representing the
percentage chance that the modeled event will occur for a
given case.
• For continuous values, predicted scores take on the scale and
characteristics of the original outcome variable.
5
• There are many measures that tell you how predictive your
model is. The problem is that no matter how predictive your
model is on one set of data, it may lose it’s predictive power
once applied to another set of data.
Finding the “Right Model”
• One example is using demographic data to predict store level
retail sales during the summer months. The predictors we
observe for the South Eastern U.S. may not prove useful when
used on West Coast locations.
• Similarly, using an algorithm that predicts summer sales well
may likely prove useless in predicting the spike in sales during
the November and December Christmas season. 6
• The way to truly test how well a model performs is to test it
on an external data set.
• The data the model is built on is typically call the
“development sample” while the data set used to validate the
Validation Is the Key
“development sample” while the data set used to validate the
model is called the “validation sample.”
• Ideally, both samples will be pulled from the same population
of cases. By creating random samples, we can be fairly sure
that we are creating data sets that are representative of the
population of interest.
7
• One way to tell how well a model performs is by looking at
something called a lift chart. In order to construct one, follow
these basic steps:
Lift Charts
1. Sort the case in the data set in descending order from the highest 1. Sort the case in the data set in descending order from the highest
predicted score to the lowest (i.e. the highest scores are at the top)
2. Cut the file into 10% chunks called “deciles” where the top 10% (or top
decile) represents the top 10% with the highest scoring cases.
3. Calculate your lift value by dividing the average value of the outcome
variable within each decile by the average value of the entire sample.
8
• Once we’ve done the basic data manipulation as shown on
the previous page, we can make a chart like the one shown
below. The good thing about models is that we can use them
to identify and target our actions to a much smaller number
of cases.
Lift Charts (cont.)
Sample Lift Chartof cases.
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
Average Decile Value Average Sample Value
Sample Lift Chart
The average rate for the
outcome event is 1.5% of the
total cases. However, for the
top decile (or the top 10% of
cases with the highest
scores), the percentage of
cases experiencing the event
is 6%. This represents a lift
of 4 times higher than the
sample average.
In terms of application, if this model
were developed to identify likely
buyers of a product, we would want
to focus our marketing efforts on
those in the top one or two deciles
who have a much stronger likelihood
to purchase vs. those who are very
unlikely to purchase.
It is better to target these cases…
…than these
9
• Gains charts are another way to determine how well a model
performs.
• Like lift charts, we sort the data in descending order from
highest score to lowest score. Next, we cut the file into 10%
Gains Charts
highest score to lowest score. Next, we cut the file into 10%
chunks.
• However, unlike a lift chart, the idea is to see how much of the
target event we are capturing as we move from the top of the
data file to the bottom.
10
• We compare the cumulative capture of the “event” cases to
the cumulative capture rate if the file had simply been sorted
in a random order.
Gains Charts (cont.)
100.00%
Cumulative % of Event Captured
Sample Gains ChartIn this example, the model
captures 45% of all the cases
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Cumulative Capture (Model) Cumulative Capture (Random)
Cumulative % of Event Captured
captures 45% of all the cases
that exhibit the “event“ within
the top 10% of the file. Within
the top 30% of the file better
than 75% of the “event” cases
have been captured.
These results for the model
can be compared to a random
sorting of the file. In the case
of a random sort, we could
expect to capture 10% of the
“event” cases within the top
10% of the file and 30% of
“event” cases within the top
30% of the file.
11
• Once the model has been developed and validated, it is time
to use it. In order to use it, fresh data is utilized to generate
scores on the cases or population of interest.
• Typically, models are deployed to be used in one of three
Using the Model
• Typically, models are deployed to be used in one of three
fashions:
– One time or infrequent, occasional use
– Regularly scheduled rescoring (e.g. weekly, monthly, quarterly)
depending upon when fresh data becomes available
– Scoring in real time. This is most appropriate for applications like
transaction fraud detection or continuous learning predictive
algorithms.
12
• Like almost everything else, models age and can become less
predictive over time.
• Because of this, it is important to periodically reassess a
model’s performance.
Tracking the Model
model’s performance.
• This can be done using the standard lift and gains charts. By
comparing the model performance over different time
periods, the degree of performance decay can be assessed on
an ongoing basis.
13
• When a model finally loses its luster, it is time to retire it.
• However, the decision as to when to retire an existing model
can be somewhat subjective.
Putting a Model Out to Pasture
• When you do make this decision, you are faced with the
prospect of creating a new model to replace the one you are
going to retire.
• Don’t panic! This is just part of the model lifecycle. Simply
create the new one and then switch them out.
14
• Congratulations! You can now claim to be an educated user
of predictive analytics.
• At this point, you should have an idea of:
– What a model does
Final Comments
– What a model does
– What it can be used for
– How to assess it’s predictive accuracy
– The basic model lifecycle
• We hope you have enjoyed this little overview, and best of
luck in your application of predictive analytics.
15