4
1 1 Dr. Paul Bracewell 8th November '06 Undermining the Future 2 Overview Rationale Approaches Case Study (StateFleet, NSW Government) Project Background Methodology Results Summary 3 25 Years of SAS The same techniques introduced in SAS 71 (first limited release of SAS) are still used today (Data Step, Regression and ANOVA). http://en.wikipedia.org/wiki/SAS_System 4 Why the Title? Undermining the Future – underneath any meaningful forecast, there is a lot of hard work required to ensure that the results are viable. 5 Rationale Robust results depend on stable framework Appropriate Data (and manipulation) DOMAIN EXPERT Appropriate Methodology ANALYST Appropriate Interpretation END USER For Results to be deployed “Marketing” of “Appropriateness” SPONSOR “Marketing” of Results TRUST apart from the price tag, there is little difference between a model that isn’t deployed and a model that isn’t built…” 6 Methodology Data Mining Definitions “Data mining is the … equivalent of sitting a huge number of monkeys down at keyboards, and then reporting on the monkeys who happened to type actual words.” http://www.basketballfreesportspicks.com/glossary.shtml “Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” Berry, M.J.A. & Linoff, G.S. (2000). Mastering data mining: The art and science of customer relationship management. John Wiley & Sons: New York.

2006underminingthefuture

Embed Size (px)

Citation preview

Page 1: 2006underminingthefuture

1

1

Dr. Paul Bracewell8th November '06

Undermining the Future

2

Overview

RationaleApproachesCase Study (StateFleet, NSW Government)

Project BackgroundMethodologyResults

Summary

3

25 Years of SAS

The same techniques introduced in SAS 71 (first limited release of SAS) are still used today (Data Step, Regression and ANOVA).

http://en.wikipedia.org/wiki/SAS_System

4

Why the Title?

Undermining the Future – underneath any meaningful forecast, there is a lot of hard work required to ensure that the results are viable.

5

RationaleRobust results depend on stable framework

Appropriate Data (and manipulation) DOMAIN EXPERT

Appropriate Methodology ANALYST

Appropriate Interpretation END USER

For Results to be deployed“Marketing” of “Appropriateness” SPONSOR

“Marketing” of Results TRUST

“apart from the price tag, there is little difference between a model that isn’t deployed and a model that isn’t built…”

6

MethodologyData Mining Definitions

“Data mining is the … equivalent of sitting a huge number of monkeys down at keyboards, and then reporting on the monkeys who happened to type actual words.”

http://www.basketballfreesportspicks.com/glossary.shtml

“Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.”

Berry, M.J.A. & Linoff, G.S. (2000). Mastering data mining: The art and science of customer relationship management. John Wiley & Sons: New York.

Page 2: 2006underminingthefuture

2

7

Robust Methodologies…are rarely automaticrequire “expert” guidancetypically have little involvement from monkeysneed a team effort – interaction between skilled partiesrepetitive and/or interactiveembrace process variabilityallow for customers to evolve (not always a caterpillar)

8

ApproachesStatistical Process Control

Significant Change in Behaviour (Predicting Churn)within customer changesstandardisation (focus on change, not value of units)

collective behaviour changes (between customer)seasonalitygrowth/reduction in market

if normality can be approximated (via transformation), can obtain a probability – a relative rank score [0,1] – of normally exhibiting that behaviour (eliminate impact of very large values)effectively create a hypothesis test comparing current observedbehaviour with behaviour from specific period of time (past year, 4 years etc.)

9

ApproachesStatistical Inference

Predicting OutcomesRugby Example – use same principle as Chi-Square

Statistic to create relativity rugby rating (Think of a results grid, Expected Value is: Total Points Scored @ Home × Total Points Scored Away

/Total Points Scored)Why? To answer questions like: If Wellington plays

Canterbury, Canterbury plays Auckland, how will Wellington fare against Auckland?Continuous vs Discrete

10

ApproachesTime Series Analysis

Remove known componentsTest residuals for any remaining explainable

relationship Repeat removal and residual testing until no

relationship is left in the residuals (like strip mining)Smoothing

11

ApproachesData Driven

large data sets, possibly censuswhat you see is what you get……and all that there is

12

Case StudyStateFleet (NSW)

Page 3: 2006underminingthefuture

3

13

Project Background

Used Car Market Increasingly VolatileAs Tax Payer Money Involved, StateFleet:

can’t charge too much – poor use of Taxcan’t charge too little – slush fund wiped out

Public determine vehicle price (auction)Existing predictions not accurate enough

14

Reducing Impact of “Oil”

Split problem into 2 parts:1. Predict Base Trend

Public Buying Behaviour – confounding/nuisance variables

2. Model Impact of Attributes (mileage, engine size, etc.)Information known at completion of lease

show that approach is viable – descriptive model

Information known at start of leaseshow forecasting viability – predictive model

15

Other Techniques

Proportion of Value Modelledi.e. Used Sale Price/New Pricereduce impact of sales, currency etc.

Use of change percentage – removes unit of measure ($)

16

Cracking the Problem

Base trend biggest cause of dynamic volatilitye.g. petrol price, demand for small vs large cars,

increased popularity of 4WD etc.Perceived value of colour, engine, size etc.

relatively constant once base trend removed

17

Forecasting Base Trend

Two-stage approach1. Decision Tree: non-linearity, interactions2. RegressionRequired transparent, continuous results

Justify predictions to TreasuryCrude form of cointegration usedlarge vehicle sales dependent on small and 4WD sales at previous lags

18

Predicting Base Trend (Utes)

Forecasting 2 Years in Advance: Observed Predicted

Page 4: 2006underminingthefuture

4

19

Predicting Cohort Value

With all available information

(2 year forecast)

With all information known at start of lease

(2 year forecast)

20

Information Loss

0%10%20%30%40%50%60%70%80%90%100%

7. Marke

t

6. Con

dition

5. Clie

nt

4. Le

ase

3. Sp

ecific

2. Gen

eral

1. Ba

se

R-Sq for Observed and Predicted Residual Values for Cohorts sold within 60 day hold-out sample period

21

Project Outcome

Improved prediction capability by 250%Predicted results within 2% of actual result

Savings of $6,000,000 per yearAutomated tool created:

Saving timeResults auditable

Results reported in Sydney Morning Herald (29/08/06)

22