Upload
rafael-kim
View
16
Download
0
Embed Size (px)
DESCRIPTION
Lecture 10. MARK2039 Summer 2006 George Brown College Wednesday 9-12. Assignment 8: Geocoding example. Example: A retailer has the following information: Name and address of its customers Address of its stores Stats Can Information - PowerPoint PPT Presentation
Citation preview
Lecture 10
MARK2039
Summer 2006
George Brown College
Wednesday 9-12
2
Assignment 8: Geocoding example
• Example:– A retailer has the following information:
• Name and address of its customers• Address of its stores• Stats Can Information
– As a marketer, how would you intelligently use this information• Get Postal codes of customers and stores• Get geocodes(latitude and longitude numbers of each postal code)• Calculate distance between each customer and neares store• Create trading area around store to determine relevant customers for store• Identify best stores and calculate demographics of best stores vs. the
remaining stores• Use above learning to either promote non performing stores with similar
customer demographic makeup of best stores• Use above info to determine where to open up or perhaps close stores
3
Assignment 8
• Why do we look at correlation analysis as our first statistical exercise in the data mining process
• Allows us to initially use statistics as a prescreen tool in eliminating variables from the data mining exercise
4
Assignment 8
• Give me an example of a correlation table of 5 variables where two variables aresignificant and three variables are not significant. Provide correlation values that support your results
• Variable Corr. Coefficient Stat.Sign.Spend 0.43 99%
Age -0.35 95%Income 0.05 20%
Household Size 0.04 15%live in Quebec -0.03 10%
Spend Response Rate<100 1%
100-200 2%200-300 3%300-400 4%
400+ 5%
Age Response Rate<20 5%
20-40 4%40-50 3%50-60 2%60+ 1%
Income Response Rate<30M 1%
30M-40M 2%40M-50M 1%50M-60M 2%
60M+ 1%
Household Size Response Rate1 1%2 2%3 1%4 2%
4+ 1%
Live in Quebec Response RateYes 1.99%no 2.01%
5
Recapping from last week
• Geocoding
– What are key things to think of.
• Look at answer from two slides ago.Geo coding gives us numbers to calculate distance between two postal codes
• More Material on correlation analysis
• How do EDA reports tie into the correlation analysis
– They are trend-like reports which demonstrate why a given variable has a strong relationship with the objective function.
• How should we present the final results of a model?
Model Variable Impact on Response Contribution to ModelLive in Quebec positive 85%
Income negative 7%Behaviour Score negative 5%# of promotions negative 3%
How is the above derived?From the partial R2 of each variable divided by the total R2 of the equation.
6
Notion of Lift
• What is Lift: the performance of a group relative to the performance of the benchmark
• Examples:Type of ActivityUntargetted/ Benchmark
Targetted/ Challenger Lift
Acquisition Campaign Response Rate 1% 2% 200.Retention Campaign Churn Rate 15% 25% 166Credit Card Loss Rate 5% 8% 160Product Affinity Rate 10% 30% 300
The targetted group represents those names as determined by a data mining tool such as a predictive model.
7
Notion of Lift
• Examples of cases where lift is below 100
Type of ActivityUntargetted/ Benchmark
Targetted/ Challenger Lift
Acquisition Campaign Response Rate 1% .5% 50Retention Campaign Churn Rate 15% 10% 66Credit Card Loss Rate 5% 2% 40Product Affinity Rate 10% 6% 60
8
Validating the Model: Example of a Gains Chart
• Listed below are the hard numbers that might comprise a lift curve
Revenue per order is $60. Cost of 1 mail piece is $.855 Benefits of modelling are the foregone promotion costs by promoting
fewer names to achieve a given # of orders at a higher response rate.
% of List Validation Cum. Cum. % Cum. Interval Benefits(Ranked by Mail Resp. of all Lift ROIModel Quantity Rate RespScore)0-10% 20000 3.50% 23.33% 233 145% $2279910-20% 40000 3.00% 40% 200 75% $3420020-30% 60000 2.75% 55% 183 58% $4275030-40% 80000 2.50% 67% 167 23% $4560040-50% 100000 2.25% 75% 150 -12.2% $42750...90-100% 20,0000 1.50% 100% 100 -58% $0
How might this be plotted?-in class we saw this as a straight decreasing linear slope if we were plotting interval resp. rate against the deciles. If we plot the Cum % of responders, then the shape would be a parobola type curve with a larger parobola representing a better model. Meanwhile, a steeper slope if we plotted interval response rate against deciles would represent a stronger model.
9
Validating the Model: Calculating the metrics on the gains charts.
• Cum. % of Responders in top 10%:– Total Responders: 200000 X 1.5%: 3000– # of responders in top 10%:20000X3.5%: 700– Cum. % in top 10%: 700/3000: 23%
• Cum. Lift in top 10%:– Average Response Rate: 1.5%– Cum. Response Rate in top 10%: 3.5%– Cum .Lift: 233
10
Calculating the metrics on the gains charts.
• Interval ROI in 10%-20%– # of persons mailed: 20000– # of responders in 10%-20%(40%-23.33%)*3000: 500– Net revenue: (500*60)-.855*20000: 12900– Costs: 17100– ROI:(12900/17100): 75%
• Calculating Benefits Column at 30%:– Mailed costs to achieve 1650 responders without modelling:
• ((.0275*60000)/.015) * .855= 94050
– Mailed costs with modelling=60000*.855= 51300– Benefits: 94050-51300= $42750
11
Gains Chart Examples
Assume a mail cost of $1.00 per piece and a revenue per order of $50.00.
Please fill in the blanks for the first 4 rows.
Cum. # of Names Mailed
Cum. Response Rate Interval Resp.Rate Interval Lift Benefits Interval ROI
10000 2.50%20000 2.25%30000 2.10%40000 1.80%
.
.
.
.100000 1%
2.5%
qty cum. Resp.rate interval resp. rate interval lift roi benefits10000 2.50% 2.50% 250 25.00% 1500020000 2.25% 2.00% 200 0.00% 2500030000 2.10% 1.80% 180 -10.00% 3300040000 1.80% 0.90% 90 -55.00% 32000
1
IntervalResp.Rate10,000*0.025=250=2.5%20,000*0.
2.5% 250 $15,000$25,000$33,000$32,000
2001.8% 180
0.9% 90
2.5%25%0-10%-55%
12
Lift Curve with Zero Model Effectiveness
% of ListCum. Response
RateCum. % of
Responders0-20% 0.95% 19.50%
20%-40% 1.05% 41%40%-60% 1.02% 61%60%-80% 0.99% 79%80-100% 1.00% 100%
What does this look like if we plot it on a lift curve
A line rather than a parobola if we plot cum % of responders
13
Gains Chart Examples
Cum. # of Names Mailed
Interval Response Rate-
Model 1Interval Response
Rate-Model 2
Interval Response Rate-
Model 3
Interval Response Rate-
Model 410000 2.50% 2% 1.75% 1.10%20000 1.75% 1.60% 1.60% 0.90%30000 1.45% 1.30% 1.55% 0.80%40000 1.25% 1.10% 1.10% 1.15%50000 1.00% 1% 0.90% 1.05%60000 0.90% 0.90% 1% 0.85%70000 0.75% 0.80% 0.50% 0.90%80000 0.55% 0.70% 0.60% 1.02%90000 0.35% 0.60% 0.45% 0.90%
100000 0.30% 0.55% 0.50% 1.03%Average 1% 1% 1% 1%
What is the best model?-Model 1
What is the worst model?-Model 4
What are the Model 3 results telling you. –we have some rank ordering all the way down to 70000 names and then the model flattens out-may need a strategy herefor this bottom segment.
14
Gains Chart Examples
Cum. # of Names Mailed
Interval Response Rate-
Model 1Interval Response
Rate-Model 2
Interval Response Rate-
Model 3
Interval Response Rate-
Model 410000 2.50% 2% 1.75% 1.10%20000 1.75% 1.60% 1.60% 0.90%30000 1.45% 1.30% 1.55% 0.80%40000 1.25% 1.10% 1.10% 1.15%50000 1.00% 1% 0.90% 1.05%60000 0.90% 0.90% 1% 0.85%70000 0.75% 0.80% 0.50% 0.90%80000 0.55% 0.70% 0.60% 1.02%90000 0.35% 0.60% 0.45% 0.90%
100000 0.30% 0.55% 0.50% 1.03%Average 1% 1% 1% 1%
In each response model case, answer the following questions:•Where would you cutoff be with a budget of $80000 and a cost per piece of $2.00
•40000 names•Where would you cutoff be if you needed to attain a forecasted order qty of 350.
•Between 10000 and 20000 names-model 1 and 2, between 20000 and 30000 for model 3 and between 30000 and 40000 formodel 4
•Where would your optimum cutoff be presuming that budget nor forecasted order model quantities were constraints? 50000-model 1,2, and 60000 for model 3 –it does not matter for model 4
15
Gains Chart Examples
Avg. Response Rate: 1.75%Cum Names Mailed Interval Response Rate
5000 4%10000 3.50%15000 3.10%20000 2.50%
…100000 0.40%
•Calculate the Following:Calculate the Following: -Interval Names Mailed -Interval Names Mailed -Cum. Response Rate -Cum. Response Rate
•Assuming a cost per name of $1.50 and revenue perAssuming a cost per name of $1.50 and revenue perresponder of $75, calculate the interval ROI forresponder of $75, calculate the interval ROI foreach intervaleach intervaland modelling benefits for each interval? and modelling benefits for each interval?
16
Tracking of Models
• Two models are used in two campaigns. In campaign A, the overall response rate is 3.5% which is above the breakeven response rate of 2%. In campaign B, the overall response rate is 1.2% which is below the breakeven response rate of 2%. Yet, the model in campaign B is more effective. Explain Why?
3.50%
ResponseRate
1.20%
1 2 3 4 5 ……. 10Deciles
Model is rank ordering names quite well for campaign B(1.2% overall) while the better campaign overall(3.5%) exhibits no rank ordering of response rate between deciles.
17
CHAID
• CHAID” is an acronym for Chi-square Automatic Interaction Detection
• Produces decision-tree like report
– Branches and Nodes
• Non parametric approach
– Output of routine is a segment or groupas opposed to a score
• Uses Chi-Square statistics to determine statistically significant breaks
• Conceptual Interpretation:(Observed-Expected)/Expected
18
CHAID
<2 years tenure50% of mail file
2% response rate
> 2 years tenure50% of mail file
6% response rate
Income<50K20% of mail file2.4% response
rate
Income>50K30% of mail file1.75% response
rate
<40 yrs. old20% of mail file4.9% response
rate
>40 yrs. old30% of mail file6.7% response
rate
Total100% of mail file4% response rate
What criteria determine the end nodes? What criteria determine the end nodes?