Upload
tommy96
View
1.680
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Data Mining Industrial Projects and Case
Studies
Kwok-Leung Tsui
Industrial and Systems Engineering
Georgia Institute of Technology
1. AT&T business data mining2. Inventory management in military maintenance 3. Sea cargo demand forecasting4. SMATRAQ project in transportation policies5. Location problem of letterbox6. Home improvement store shrinkage analysis 7. Hotels & resorts chain data mining8. Used car auction sales data mining9. Fast food restaurant call center
Industrial Projects
Data Mining in Telecom. (Funded AT&T project)
~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local)100 million + customers/accounts/lines>1 billion phone calls per day
Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue)Segmentation (Clustering of usage, growth, …)Cross Selling (Association Rule)Churn (Disconnect prediction & Tracking)Fraud (Detection of unusual usage time series behavior)Each of these problems worth hundreds millions dollars
A contractor manages parts inventory for aircraft maintenance
Characterization and forecasting of demand and lead time distributions
60,000 different parts and 500 bench locations
Data tracked by an automated system
Demand data not available & stockout penalty
Inventory Management in Air Force (Funded project)
Sea cargo network optimization
Contract planning & booking control
Characterize & forecast sea cargo demand distribution & cost structure
Improve ocean carrier and terminal operation efficiency
Data Mining in Sea Cargo Application (Funded TLIAP project)
Strategies for Metropolitan Atlanta’s Regional Transportation & Air Quality
Five-year project sponsored by Transportation Dept., Federal Highway Admin., EPA, CDC, etc.
Assess air quality, travel behavior, land use & transportation policies
Reduce auto-dependence and vehicle emissions
SMARTRAQ Project for Transportation Policies
Improve performance of express mail dropoff letter boxes
50,000 letter boxes & 8 month transaction data
Relate performance with important factors, e.g. regions, demographic, adjacent competition, pick-up schedule
Comparison with direct competitors
Customer demand analysis and forecast
Mining of Letter Box Transaction Data
Inventory shrinkage costs US retailers 32 billions
Shrinkage = book inventory – inventory on hand
Working with a home improvement store’s Loss Prevention Group
Develop predictive model to relate shrinkage to important variables
Extract hidden knowledge to reduce loss and improve operation efficiency
Data Mining for Shrinkage Analysis in Retail Industry
Manage chain hotels and resorts in different scale
Evaluate impact of promotional programs
Forecasting of customer behavior in frequent stay program
Monitor performance in customer survey
Predict performance with important factors
Data Mining for Hotels and Resorts Chain Business
Maintain all used car auction data in last 20 years
Provide service to customers and dealers on auction price projection
Price depreciations on year,
Develop methods for mileage, seasonal, and regional adjustments
Data Mining of Used Car Auction Data
Centralized call center for drive through customers of over 50 chain restaurants
Contractor manages call center with constraints on time to answer customers
Scheduling and management of human resources
Simulation and optimization algorithms
Data mining and forecasting on aggregate and individual demand
Fast Food Restaurant Call Center
1. A Medical Case Study2. Profile Monitoring in Telecommunication3. Letterbox Transaction Data Mining4. A Market Analysis Case Study5. Air Force Parts Inventory Data Mining
Data Mining Case Studies
1. Telecommunication Data Mining2. Churn Modeling in Wireless Industry3. Market Basket Analysis4. Supermarket Mining I5. Supermarket Mining II6. Banking and Finance
More DM Case Studies (Berry & Linoff)
A Review & Analysis of MTS
(Technometrics, 2003)
W. H. Woodall and R. Koudelik, Virginia Tech
K.-L. Tsui and S. B. Kim, Georgia Tech
Z. G. Stoumbos, Rutgers University
Christos P. Carvounis, MD State University at Stony Brook
A Medical Case Study using MTS and DM Methods
Primary MTS ReferencesTaguchi, G., and Rajesh, J. (2000), “New Trends in Multivariate Diagnosis,” Sankhya: The Indian Journal of Statistics, 62, 233-248.Taguchi, G., Chowdhury, S., and Wu, Y. (2001), The Mahalanobis-Taguchi System, New York: McGraw Hill.Taguchi, G., and Rajesh, J. (2002), a new book in MTS.
P.C. Mahalanobis
Very influential in large-scale sample survey methodsFounder of the Indian Statistical Institute in 1931Architect of India’s industrial strategyAdvisor to Nehru and friend of R.A. Fisher
Deming prize in Japan: 4 timesRockwell Medal (1986) Citation:Combine engineering & statistical methods to achieve rapid improvements in costs and quality by optimizing product design and manufacturing processes.1978-79: Ford / Bell Labs Teams "Discover" Method1980: First US Experiences (Xerox / Bell Labs)1990 - : Taguchi Methods or DOE well recognized by all industries for improving product or manufacturing process design.
Genichi Taguchi Japanese Quality Engineer
MTS is said to be ………A groundbreaking new philosophy for data mining from multivariate data.A process of recognizing patterns and forecasting resultsUsed by Fiju, Nissan, Sharp, Xerox, Delphi Automotive Systems, Ford, GE and othersBeyond theory Intended to create an atmosphere of excitement for management, engineering and academia.
Applications include the following:Patient monitoringMedical diagnosisWeather and earthquake forecastingFire detectionManufacturing inspectionClinical trialsCredit scoring
MTS OverviewSimilar to a classification method using a discriminant-type function.Based on multivariate observations from a “normal” and an “abnormal” group. Used to develop a scale to measure how abnormal an item is while matching a pre-specified or estimated scale.MTS scale is used for variable selection, diagnosis, forecasting, and classification.
MTS Procedure: Stage 1Identify p variables, Vi , i = 1, 2, …, pthat measure the “normality” of an item.Collect multivariate data on the normal group, Xj , j = 1, 2, …, m.Standardize each variable to obtain Zi vectors.Calculate the Mahalanobisdistances (MD) for the mobservations.
( ) T 11i i iMD p −= Z S Z
i=1, …, mwhere S is the sample correlation matrix of the Z’s for the normal group.
Stage 2Collect data on t abnormal items, Xi, i = m + 1, m + 2, …, m + t.
Standardize each variable using the normal group means and standard deviations.Calculate MD values MDi , i = m + 1, m + 2, …, m + t.
According to the MTS, the scale is good if the MD values for the abnormal items are higher than those for the normal items (good separation).
Stage 3Identify the useful variables using orthogonal arrays (OAs) and signal to noise (S/N) ratios.The MTS uses a design of experiments approach as an optimization tool to choose the variables that maximize the average S/N ratio.
Use of DOE for Variable SelectionDesign an OA experiment using all variables.
For each row of the OA (a given set of variables)Compute MDi for each observation in abnormal groups;Determine a Mi value (the true severity level or working average) for each abnormal group;Compute S/N ratio based on MDi and Mi.
Determine significant variables using main effect analysis with S/N ratio as response.
An Example of OA+ including variable; - excluding variable
Run V1 V2 V3 . . . . . . . . . V17 S/N Ratio1 + + + . . . . . . . . . + SN1 2 - + + + SN2 3 + - + + SN3 4 - - + + SN4 5 + + - + SN5 6 - + - + SN6 . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . .
.
.
.
.
.
32 - - - . . . . . . . . . - SN32
Dynamic S/N ratio (multiple abnormal groups)
First regress Yi = SQRT(MDi) to Mi to obtain slope estimate (beta hat), then define S/N ratio:
⎥⎥⎦
⎤
⎢⎢⎣
⎡=⎥⎦
⎤⎢⎣⎡ −
MSEMSEMSESSR
r
2ˆlog101log10 β
Larger-is-better S/N Ratio (single abnormal group)
⎥⎦
⎤⎢⎣
⎡− ∑
=
t
i iMDt 1
11log10
For t abnormal observations, the larger-is-better S/N ratio is
Compute level averages of S/N ratios (+ and -) for each variable.Keep variables only with positive(significant) estimated main effects.
i iS N S N
+ −−
Main Effect Analysis
Stage 4
Based on the chosen variables, use the MD scale for diagnosis and forecasting.A threshold is given such that the losses due to the two types of classification errors are balanced in some sense.
A Medical Case Study
Medical diagnosis of liver disease.200 healthy patients and 17 unhealthy patients (10 with a mild level of disease and 7 with a medium case).Age, Gender and 15 blood test variables
(Data is made available.)
Case Study Blood Test Variables with Normal Ranges
Variables Symbol Acronym Normal RangesTaguchi et al. (2001) Normal
RangesTotal Protein in Blood V3 TP 6.0 to 8.3 gm/dL 6.5-7.5 gm/dL
Albumin in Blood V4 Alb 3.4 to 5.4 gm/dL 3.5-4.5 gm/dL
Cholinesterase(Pseudocholinesterase) V5 ChE Depends on Technique
8 to 18 U/mL 0.60-1.00 dpHGlutamate O Transaminase
(Asparate Aminotransferase) V6 GOT 10 to 34 IU/L 2-25 Units
Glutamate P Transaminase(Alanine Transaminase) V7 GPT 6 to 59 U/L 0-22 Units
Lactic Dehydrogenase V8 LDH 105 to 333 IU/L 130-250 Units
Alkaline Phosphatase V9 Alp 0-250 U/L Normal250-750U/L Moderate Elevation 2.0-10.0 Units
r-Glutamyl Transpeptidase(gamma-Glutamate Transferase) V10 r-GPT 0 to 51 IU/L 0-68 Units
Leucine Aminopeptidase V11 LAP
Serum:
Male: 80 to 200 U/mLFemale: 75 to 185 U/mL
⎯
Total Cholesterol V12 TCh< 200 Desirable
200-239 Borderline high240+ High
⎯
Triglyceride V13 TG 10 to 190 mg/dL ⎯
Phospholipid V14 PL Platelet: 150,000 to 400,000/mm3 ⎯
Creatinine V15Cr
.
8 to 1.4 mg/dL ⎯
Blood Urea Nitrogen V16 BUN 7 to 20 mg/dL ⎯
Uric Acid V17 UA 4.1 to 8.8 mg/dL ⎯
Some results and conclusions
Largest MD in healthy group 2.36 Lowest MD in unhealthy group 7.73
Thus, there is a lot of separation between the healthy and unhealthy group.
The Mi values are estimated from averages of MD values.
OA32+ including variable; - excluding variable
Run V1 V2 V3 . . . . . . . . . V17 S/N Ratio1 + + + . . . . . . . . . + SN1 2 - + + + SN2 3 + - + + SN3 4 - - + + SN4 5 + + - + SN5 6 - + - + SN6 . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . .
.
.
.
.
.
32 - - - . . . . . . . . . - SN32
average S/N ratioAll variables -6.25MTS combination -4.27 OA optimal comb. -3.34 Overall optimal comb. -1.76
Thus, the proposed method does not yield the optimum combination. MTS average S/N ratio was at about the 95th
percentile.
Subject Disease Level All MTS OA Optimal Optimal1 Mild 7.727 13.937 8.058 13.3292 Mild 8.416 14.726 7.485 8.6163 Mild 10.291 17.342 9.498 8.0024 Mild 7.204 10.804 4.951 12.3115 Mild 10.590 18.379 9.367 12.0426 Mild 10.557 8.605 6.643 6.1397 Mild 13.317 13.896 7.794 6.1398 Mild 14.812 27.910 8.162 22.6669 Mild 15.693 28.110 10.278 26.000
10 Mild 18.911 35.740 20.992 14.42211 Medium 12.610 20.828 16.517 20.83312 Medium 12.256 18.578 14.607 19.31213 Medium 19.655 34.127 35.229 44.61414 Medium 43.039 85.564 13.105 32.72015 Medium 78.639 74.175 9.560 28.56016 Medium 97.268 104.424 29.201 31.81017 Medium 135.698 123.022 44.742 57.226
MDs for Unhealthy Group for Various Combinations of Variables
Plots of MDs for Unhealthy Group for Various Combinations of Variables.:.
.::.Mild +---------+---------+---------+---------+---------+-------All
... . . . .Medium +---------+---------+---------+---------+---------+-------All
:::. :.
Mild +---------+---------+---------+---------+---------+-------MTS: . . . . .
Medium +---------+---------+---------+---------+---------+-------MTS
:.::: .
Mild +---------+---------+---------+---------+---------+-------OA Optimal.
.: .. .Medium +---------+---------+---------+---------+---------+-------OA Optimal
:::: :
Mild +---------+---------+---------+---------+---------+-------Optimal : :. . .
Medium +---------+---------+---------+---------+---------+-------Optimal
Case Study Blood Test Variables with Normal Ranges
Variables Symbol Acronym Normal RangesTaguchi et al. (2001) Normal
RangesTotal Protein in Blood V3 TP 6.0 to 8.3 gm/dL 6.5-7.5 gm/dL
Albumin in Blood V4 Alb 3.4 to 5.4 gm/dL 3.5-4.5 gm/dL
Cholinesterase(Pseudocholinesterase) V5 ChE Depends on Technique
8 to 18 U/mL 0.60-1.00 dpHGlutamate O Transaminase
(Asparate Aminotransferase) V6 GOT 10 to 34 IU/L 2-25 Units
Glutamate P Transaminase(Alanine Transaminase) V7 GPT 6 to 59 U/L 0-22 Units
Lactic Dehydrogenase V8 LDH 105 to 333 IU/L 130-250 Units
Alkaline Phosphatase V9 Alp 0-250 U/L Normal250-750U/L Moderate Elevation 2.0-10.0 Units
r-Glutamyl Transpeptidase(gamma-Glutamate Transferase) V10 r-GPT 0 to 51 IU/L 0-68 Units
Leucine Aminopeptidase V11 LAP
Serum:
Male: 80 to 200 U/mLFemale: 75 to 185 U/mL
⎯
Total Cholesterol V12 TCh< 200 Desirable
200-239 Borderline high240+ High
⎯
Triglyceride V13 TG 10 to 190 mg/dL ⎯
Phospholipid V14 PL Platelet: 150,000 to 400,000/mm3 ⎯
Creatinine V15Cr
.
8 to 1.4 mg/dL ⎯
Blood Urea Nitrogen V16 BUN 7 to 20 mg/dL ⎯
Uric Acid V17 UA 4.1 to 8.8 mg/dL ⎯
Variables for Unhealthy Patients Well Outside Normal Ranges
Subject Number Variable Number
1 12, 13
2 None
3 None
4 13
5 10
6 7
7 7
8 13
9 12, 13
10 4, 12
11 10, 12
12 10
13 10
14 10, 13
15 6, 7, 13
16 3, 6, 7, 10, 12
17 6, 7, 8, 10, 13
Medical Analysis
V4, V6, V7, V9, and V10 are crucial for liver disease diagnosis and classification.Medical diagnosis shows that patients 15-17 exhibit some chronic liver disorder.Cluster analysis on V4, V6, V7, V9, and V10 yields only two groups. Only patients 15-17 are classified as “abnormal”. This result is consistent with medical diagnosis
5.84.83.8Normal
Mild
Medium16 17 15
Dotplot for V4 Alb
15010050
1617 15Medium
Mild
Normal
Dotplot for V6 GOT
1701207020
1517 16Medium
Mild
Normal
Dotplot for V7 GPT
300200100
1716 15Medium
Mild
Normal
Dotplot for V9 Alp
2001000
1715 16Medium
Mild
Normal
Dotplot for V10 r-GPT
Tree Classification Methods
Classification Trees
• The CART (Classification And Regression Tree) methodology known as binary recursive partitioning. For more detailed information on CART, please see: Breiman, Friedman, Olshen, & Stone (1984): Classification and Regression Trees
• C4.5 is a decision tree learning system introduced by Quinlan (Quinlan, J. Ross (1993): C4.5: Programs for Machine Learning). The software is available at:http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html
Tree from Splus
V5 < 381.5
V10 < 63 V6 < 37.5
2(2)3(6)
2(8) 1(4)3(1)
1(196)
NoYes
NoYes NoYes
Tree from SplusVariables actually used in tree construction: V5, V10, and V6.
Number of terminal nodes: 4
Misclassification error rate:0.01382 = 3 / 217
Classification matrix based on learning sample
Predicted ClassActual Class 1 2 3
1 200 0 02 0 8 23 1 0 6
Tree from C4.5
V5 <= 364
V10 <= 63
V6 <= 26
1(200)3(1)
2(8)
3(6)2(2)
No
NoYes
NoYes
Yes
Tree from C4.5
Variables actually used in tree construction: V5, V10, and V6.
Number of terminal nodes: 4
Misclassification error rate:0.0046 = 1 / 217
Classification matrix based on learning sample
Predicted ClassActual Class 1 2 3
1 200 0 02 0 10 03 1 0 6
150100
0
0
100
V6 GOT
200
50
300
400
50
500
100
600
700
150 200
V5 ChE
0250V10 r-GPT
Normal
Mild
Medium
1617
15
Scatter Plot of V5 vs. V10 vs. V6
7006005004003002001000
150
100
50
0
V5 ChE
V6
GO
T
Medium
Mild
Normal
17
16 15
Scatter Plot of V5 vs. V6
7006005004003002001000
250
200
150
100
50
0
V5 ChE
V10
r-G
PT
Medium
Mild
Normal17
16
15
Scatter Plot of V5 vs. V10
250200150100500
150
100
50
0
V10 r-GPT
V6
GO
T
Medium
Mild
Normal
15 16 17
Scatter Plot of V10 vs. V6
700600500400300200100Normal
Mild
Medium16
1715
Dotplot for V5 ChE
15010050
1617 15Medium
Mild
Normal
Dotplot for V6 GOT
2001000
1715 16Medium
Mild
Normal
Dotplot for V10 r-GPT
Comparison with Taguchi Approaches
All variables: V1 – V17
MTS: V4, V5, V10, V12, V13, V14, V15, V17
OA Optimal: V1, V4, V5, V10, V11, V14, V15, V16, V17
Optimal: V3, V5, V10, V11, V12, V13, V17
Classification Trees : V5, V6, V10
Disease Level All MTS OA Optimal Optimal TreesMild 7.727 13.937 8.058 13.329 7.366Mild 8.416 14.726 7.485 8.616 18.789Mild 10.291 17.342 9.498 8.002 9.068Mild 7.204 10.804 4.951 12.311 6.517Mild 10.590 18.379 9.367 12.042 29.864Mild 10.557 8.605 6.643 6.139 10.869Mild 13.317 13.896 7.794 6.139 10.869Mild 14.812 27.910 8.162 22.666 8.222Mild 15.693 28.110 10.278 26.000 9.155Mild 18.911 35.740 20.992 14.422 16.420
Medium 12.610 20.828 16.517 20.833 42.681Medium 12.256 18.578 14.607 19.312 38.523Medium 19.655 34.127 35.229 44.614 86.796Medium 43.039 85.564 13.105 32.720 28.252Medium 78.639 74.175 9.560 28.560 208.102Medium 97.268 104.424 29.201 31.810 228.428Medium 135.698 123.022 44.742 57.226 199.304
MDs for Unhealthy Group for Various Combinations of Variables
.:.
.::.Mild +---------+---------+---------+---------+---------+-------All
... . . . .Medium +---------+---------+---------+---------+---------+-------All
:::. :.
Mild +---------+---------+---------+---------+---------+-------MTS: . . . . .
Medium +---------+---------+---------+---------+---------+-------MTS:
.::: .
Mild +---------+---------+---------+---------+---------+-------OA Optimal.
.: .. .Medium +---------+---------+---------+---------+---------+-------OA Optimal
:::: :
Mild +---------+---------+---------+---------+---------+-------Optimal : :. . .
Medium +---------+---------+---------+---------+---------+-------Optimal.:
::.. .Mild +---------+---------+---------+---------+---------+-------Trees
. .. . . . .Medium +---------+---------+---------+---------+---------+-------Trees
0 50 100 150 200 250
ConclusionThe MD values and dotplots show that
only the MD scale based on the variables used by classification trees, i.e., V5, V6and V10, does a good job discriminating between patients with mild level disease and patients with medium level disease. (Maybe MD is a good measure for multivariate data.)
Comparison with Medical Analysis
V4, V6, V7, V9, and V10 are crucial for liver disease diagnosis and classification.Medical diagnosis shows that patients 15-17 exhibit some chronic liver disorder.Cluster analysis on V4, V6, V7, V9, and V10 yields only two groups. Only patients 15-17 are classified as “abnormal”. This result is consistent with medical diagnosis
Correlations
Variables in Classification Trees V5 V6 V10
V4 0.501 -0.505 -0.184
V6 -0.370 1 0.507
V7 -0.365 0.905 0.485
V9 -0.305 0.197 0.269
Var
iabl
es C
ruci
al fo
r M
edia
l Dia
gnos
is
V10 -0.189 0.507 1
5.84.83.8Normal
Mild
Medium16 17 15
Dotplot for V4 Alb
1701207020
1517 16Medium
Mild
Normal
Dotplot for V7 GPT
300200100
1716 15Medium
Mild
Normal
Dotplot for V9 Alp
OA & main effect analysis do not give overall optimum.MTS discriminant function (S/N ratios) does not separate the two unhealthy groups.The variables selected from MTS are not appropriate to detect liver disease based on medical diagnosis.Tree methods separate the two unhealthy groups.MD may be a good distance measure for multivariate data.Results are based on current data and training error.
Case Study Summary
DiscussionsThe MTS ignores considerable previous work in application areas such as medical diagnosis and classification methods.The MTS ignores sampling variation and discounts variation between units.The use of OA cannot be justified.The MTS is not a well-defined approach.Traditional statistical approaches may work better in many cases.Despite flaws, we expect the MTS to be used, in many companies.
150100500
180
160
140
120
100
80
60
40
20
0
V6 GOT
V7
GP
T
Medium
Mild
Normal15
17
16
Correlation (V6, V7) = 0.905
300200100
350
250
150
V12 TCh
V14
PL
Medium
Mild
Normal
15
17
16
Correlation (V12, V14) = 0.807
250200150100500
120
110
100
90
80
70
60
50
40
V10 r-GPT
V11
LA
P
Medium
Mild
Normal
1615
17
Correlation (V10, V11) = 0.646
4003002001000
350
250
150
V13 TG
V14
PL
Medium
Mild
Normal
17
1516
Correlation (V13, V14) = 0.616
8.57.56.55.5
6
5
4
V3 TP
V4
Alb
Medium
Mild
Normal
17
15
16
Correlation (V3, V4) = 0.604
A SPC Approach for Business Activity Monitoring
(IIE Transcations, 2006)
W. Jiang, Stevens Institute of TechnologyT. Au, AT&T
K.-L. Tsui, Georgia Institute of Technology
A Telecommunication Case Study
A General Framework for Modeling & Monitoring of Dynamic Systems
Dynamic Monitoring (A General Framework)
Actions
Segmentation & Model Selection
Monitoring
Dynamic Update
Problem
Profile–Time domain profile– Profile w. controllable predictors– Profile w. uncontrollable predictors
Model Selection– Global w/o segmentation– Global w. segmentation– Local within Segment
–Detection/Classification– Interpretation–Forecasting/Prediction Segmentation
– Known– Unknown
– Phase I: estimating unknown parameter– Phase II: monitoring and detecting– Anticipated drifts Vs. unanticipated
changes
Objective
ApplicationsManufacturing Processes
Stamping Tonnage Signal Data (functional data)Nortel’s Antenna Signal Data (functional data)Mass Flow Controller (MFC) Calibration (linear profile)Vertical Density Profile (VDP) Data (nonlinear profile)
Service OperationsUsed Car Price Mining and PredictionTelecom. Customer UsageHotel Performance MonitoringFast food drive through call center forecasting & scheduling
Manufacturing:Stamping Tonnage Signal Data
Figure 2: An Tonnage Signal and Some Possible Faults (Jin and Shi 1999)
Stamping Tonnage Signal DataProblem
Time domain profile (a tonnage signal represents the stamping force in a process cycle).
ObjectiveFault detection and classification
Segmentation & Model SelectionKnown segmentation: most process faults occur only in specific working stages. Boundaries and sizes of segments are determined by process knowledge. (Jin and Shi 1999) Global model: wavelet transforms
MonitoringFor each segment, use T2 charts based on selected wavelet coefficients to conduct monitoring. (Jin and Shi 2001)
Dynamic UpdateClassify a new signal as normal, a known fault or a new fault, and update wavelet coefficients’ selection and parameter estimates (e.g. μ, ∑, etc.) using all available data.
ActionsIdentify and remove assignable causes.
Service: Telecom. Customer UsageProblem
Profile with uncontrollable predictorsObjective
Abnormal behavior detection and classificationForecasting/prediction
Segmentation & Model SelectionUnknown segmentation: segment customers based on demographic, geographic, psychographic and/or behavioral information.Segmental: fit model for each customer segment, e.g. linear regression.
MonitoringUse the model built for each segment to monitor customer behaviors, e.g. monitor linear regression parameter vector βusing T2 chart.
Dynamic UpdateUpdate customer segmentation, segmental model fitting and/or parameter monitoring, e.g. parameters update based on known trend.
ActionsService improvement, customer approval, etc.
Telecom. Customer Usage
Profile: profile with uncontrollable predictors
Objective– Abnormal behavior detection and classification– Forecasting/prediction
Segmentation– Unknown (segments are defined by customer information.)Model Selection– segmental (e.g. linear regression on uncontrollable predictors for each segment)
Monitoring – Phase I: unknown control chart parameters estimated from data– Phase II: monitoring by control charts, like T2 chart, EWMA chart, etc.
Actions: service improvement, customer approval, etc.
Dynamic Update– Update segmentation, model selection and/or parameter monitoring
A SPC Approach for Business Activity Monitoring
Jiang, Au, and Tsui (2006), to appear in IIE Transactions
Churn Detection via Customer Profiling
Qian, Jiang, and Tsui (2006), appear in International J. of Production Research
Activity monitoring for interesting events that require actions (Fawcett and Provost, 1999)Examples:
Credit card or insurance fraud detectionChurn modeling and detectionComputer intrusion detectionNetwork performance monitoring
Objective: Trigger alarms for action accuratelyand as quickly as possible once activity occurs
Activity Monitoring
Profiling Approach (SPC & hypothesis test):Characterize populations of key variables that describe normal activityTrigger alarm on activity that deviates from normal
Discriminating Approach (classification):Establish models & patterns of abnormal activity w.r.t. normal Apply pattern recognition to identify abnormal activity
Other Approaches:Hypothesis Vs. classificationNeural network for SPC problems (Hwarng et al. )Apply other classification to SPCDOE for variable selections on discriminationDetect complex patterns in SPC
Activity Monitoring
Objective of Activity monitoring is similar to that of statistical process control (SPC)Multivariate control chart methods for continuous and attribute data may be neededMore sophisticated tools are needed
Activity Monitoring
STATISTICAL PROCESS CONTROL
Widely used in manufacturing industry for variation reduction by discriminating:
Common causesAssignable causes
Evaluation: in-control vs. out-of-control
Performance:
False alarm rate
Average run length (ARL) Techniques:
Shewhart chart , EWMA chart, CUSUM chart
STATISTICAL PROCESS CONTROL
Two stages of implementation:Phase 1 (retrospective): off-line modeling
Identify and clear outliersEstimate in-control models
Phase 2 (prospective): on-line deploymentTrigger out-of-control conditionsIsolate and remove causes of signals
AN EXAMPLE
+
+
+
+++++++
+
+
+
+++++
+++
+++++
+++
+
+
+
++++
+++
++
+++++
+
+++
++
+
+
++
+++
+++
++++
+
++++
+
+
+++
+
+++
++
+++
+
+++
+
+
++++
++
++
+
Time
x
0 20 40 60 80 100
-4-2
02
4
Shewhart Chart
+++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++
++++++++++++++++++++
+++++++++++++++++++++
Time
x.se
s$pr
ed
0 20 40 60 80 100
-4-2
02
EWMA Chart
Time
0 20 40 60 80 100
010
300
1030
CUSUM Chart
KEY CHALLENGES TO SPC
Off-line modelingRobust models w/ outliers and change pointsAutomatic model building
Scalability - a single algorithm tracking millions of data streamsImportance of early signals Interpretation is mostly qualitative, sacrificing accuracy for speed is acceptableDiagnosis and updating - business rulesOnline fashion: incomplete data - censored and/or truncated
SPC Approach for CRM Monitoring
PHASE 1AUTOMATIC MODELING &
PROFILING
PHASE 2PROFILE
MONITORING & UPDATING
PHASE 3EVENT
DIAGNOSIS
CRM MONITORING PROCESS
Business Event Definition
Customer Profiling
ProfileUpdating
Event Monitoring and Triggering
Small Set of Interesting Customers
Customer Diagnosis
SPC FOR CRM - PHASE 1
Off-line Modeling: building customer profile robustly - time consuming
RequirementsA single, time variant model capturing most customers’behaviorAutomatic modeling, less human intervention
TechniquesRobust and efficient estimation methodsChange-point modeling
Parameter SelectionMSE/AIC/BICBusiness Requirement/Domain Knowledge
SPC FOR CRM - PHASE 2
On-line customer profile updating and monitoring, in search for interesting events requiring action
Requirements:Recursive vs. time windowSignal accurately and as quickly as possible
Techniques:Markovian Type Updating – storage space & timeState Space control models
SPC FOR CRM - PHASE 3
Diagnosis and Re-profilingRequirements
Following signalsRobust - outliers, trends, …Attribute identification
Techniques:Bayesian modelsNonlinear filtering methods
PHASE 1: CUSTOMER PROFILE
Dynamic Linear Model (West and Harrison, 1997)
Size/Level
Trend
Variability/Variance
Seasonaility (optional)
)(iMt
)(iTt
)]'(),(),([)()}({ iViTiMiPiX ttttt
)(iVt
=a
)(iSt
Estimation Methods
Least Square Estimation (LSE)
Least Absolute Deviation (LAD)
Dummy Change Point Model with LSE
Dummy Change Point Model with LAD
LSE and LAD
A DUMMY CHANGE-POINT MODEL
A DUMMY CHANGE-POINT MODEL
Solve global models assuming dummy change points
can be recursively obtained by reversing DES method with
Combine forecasts with exponential weights
Local variance can be estimated via bootstrap resampling
∑−
=− +−=
1
0
210 )]([arg)( Min
p
kkt
akaaXpa
∑=
t
pp paw
20 )(
1=λ)( pa
A DUMMY CHANGE-POINT MODEL
PHASE 2: CUSTOMER PROFILE UPDATING AND MONITORING
History data cleaning and profilingForecasting
Online monitoring
Markovian updating
)()()(ˆ1 iTiMiM ttt +=+
2111
11
111
))()(()()1()(
))()(()()1()())(ˆ)(()()1()(
iMiXiViV
iMiMiTiTiMiXiMiM
ttVtVt
ttTtTt
ttMtMt
+++
++
+++
−+−=
−+−=−+−=
λλ
λλλλ
ttt VKMX >− ++ |ˆ| 11
Comparisons
Objectives:Robust at phase 1.Sensitive at phase 2.
Four methods:1. LSA2. LAD3. Dummy change point model with LSE4. Dummy change point model with LAD
Case Study
Data Mining in Telecommunications Industry
(Source: AT & T, Mastering Data Mining by Berry & Linoff.)
Outline
BackgroundDataflowsBusiness problemsDataA voyage of discoverySummary
Telecommunication Industry
~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local)100 million + customers/accounts/lines>1 billion phone calls per day
Book closing (Estimating this month price/usage/revenue) Budgeting (Forecasting next year price/usage/revenue)Segmentation (Clustering of usage, growth, …)Cross Selling (Association Rule)Churn (Disconnect prediction & Tracking)Fraud (Detection of unusual usage time series behavior)Each of these problems worth hundreds millions dollars
Information Sources
OrderingSystem Network Billing
SystemCustomer
Add a phone Make a call
FCC
CensusDun & Bradstreet,...
CompetitiveWin/Loss/New/No Further Use, ...
Call Details/Web access
Revenue,Price, ...
Official Competitive highlevel reports
DelayedAnnually/Quarterly
Daily Real timeDelayedMonthly
DelayedAnnually/Quarterly
(Tera bytes of interesting information)
Customer Focus
Telecommunication companies want to meet all the needs of their customers:
Local, long distance, and international voice telephone servicesWireless voice communicationsData communicationsGateways to the InternetData networks between corporationsEntertainment services, cable and satellite television
Instead of miles of cable and numbers of switches, customers are becoming the biggest asset of a telephone company.
DataflowsCustomer behavior is in the data.Over a billion phone calls every day.A dataflow is a way of visually representing transformations on data.A dataflow graph consists of nodes and edges.
Data flows along edges, and gets processed at node.
A basic dataflow to read a file, uncompress it, and write it out:
uncompressCompressed
input file (in.z)Uncompressed
output file (out.text)
Why are Dataflows efficient?
Dataflows dispense with most of the overhead that traditional databases have, like logging transaction, indexes, pages, etc.
Dataflows can be run in parallel, taking advantage of multiple processors and disks.
Dataflows provide a much richer set of transforma-tions than traditional SQL.
Basic Operations in Dataflow
Basic OperationsCompress and uncompressReformatSelectSortAggregate and hash aggregateMerge/Join
Very important stepsData is very, very largeNeed super computer power
Business ProblemsTelecommunication business has shifted from an infrastructure business to a customer business.
Understanding customer behavior becomes critical (market segmentation).
Revenue forecasting, churn prediction, fraud detection, new business customer identification.
The detailed transaction data contains a wealth of information, but unexploited due to its huge volume.
Important Marketing Questions
Discussions with business users highlight the areas for analysis:
Understanding the behavior of individual customers
Regional differences in calling patterns
High-margin services
Supporting marketing and new sales initiatives
Data
Call detail data
Customer data
Auxiliary files
Call Detail Data
Definition:A call detail data is a single record for every call made over the telephone network.
Three sources of call detail data:Direct network/switch recordingsSwitch records: the least clean, but the most informative.Inputs into the billing systemBilling records: cleaner, but not complete.Data warehouse feedsRather clean, but limited by the needs of the data warehouse.
Network Call DetailsHundred million calls a day
>100 byte per call record (>10 giga-bytes per days)Originating numberTerminating numberDay/Time of the CallLength of the callTypes of call, …..
2 year data online ??? ---> Statistical Compression>70 billion records (> 7 Tera bytes)Currently in tapes, Batch processing
Real time, low level details +++Raw data, Massive data processing ---Key applications : Book closing, Fraud Detection, Early Warning, ...
Billing DetailsMillions of customer/accounts
Tons of other information about the customer/accounts100+ services (Regular long distance, Digital 1 rate, easylink, Readyline, VTNS,..)5 Jurisdiction (International, Interstate, …)50 states
NPA-NXX24-36 months of Message, Minute, Revenue
Length of call, Average revenue per minute
~? Billions observations$, Detailed +++Dirty, Delayed ----Key Applications : Budgeting/Forecasting, Segmentation/Clustering.
Call Detail DataRecord formatImportant fields in a call detail record includes:
from_numberto_numberduration_of_callstart_timebandservice_field
Customer Data
Customers can have multiple telephone lines. Customer data is needed to match telephone numbers to information about customers.
Telecommunication companies have made significant investments in building and populating data models for their customers.
Customer Ordering DataHundred thousands of add/disconnect order weekly
Add a line or disconnect a line, …Tons of other information about the customer/accounts4+ Order types (Add, Win, Loss, No Further Use)100+ servicesRelated Carrier
Require Minute/Revenue Estimation/PredictionSummarizing the historical usage of a loss/NFU into 1 numberPredicting the future usage of a win/new (Growth Curve)
5 year online, a few hundred million recordsTimely, Small Volume +++Missing information, Massive Data Integration ---Major Applications : Customer Churn, Early Warning, Predicting disconnects
Auxiliary FilesISP access numbersA list of access number of Internet Service Providers
Fax numbersA list of known fax machines
Wireless exchangesA list of exchanges that correspond to mobile carriers
Exchange geographyA list of geographic areas represented by the phone number exchange
InternationalA list of country code, and the names of the corresponding countries.
DiscoveryCall durationCalls by time of dayCalls by market segmentInternational calling patternsWhen are customers at homeInternet service providersPrivate networksConcurrent callsBroad band customers
Call Duration
Call Duration
Calls by Time of DayIn call detail data, the field band is a number representing how the call should be charged. This provides a breakdown:
localregionalnationalinternationalfixed-to-mobileotherUnknown
Question: when are different types of calls being made?
Calls by Time of Day
Calls by Time of Day
Calls by Time of Day
Calls by Market SegmentThe market segment is a broad categorization of customers:
ResidentialSmall businessMedium businessLarge businessGlobalNamed accountsGovernment
Question: Are customers within market segments similar to each other?What are the calling patterns between market segments?
Calls by Market SegmentSolution approach
Results
customer data from_market_segment to_market_segment
from_number to_number
call detail records
Calls by Market Segment
Calls by Market Segment
Calls by Market Segment
International Calling PatternsInternational calls are highly profitable, but highly competitive.
Questions:where are calls going to?how do calling patterns change over time?how do calling patterns change during the day?what are differences between business and consumer usage?which customers primarily call one country?which customers call a wider variety of international numbers?
International Calling Patterns
When are Customers at Home?
Internet Providers
Question:which customers own modems?which Internet service providers (ISPs) are customers using?do different segments of customers use different ISPs?
Internet Providers
Private NetworksSpecial customers:
Businesses that operate from multiple sites likely make large volumes of phone calls and data transfers between the sites.
Some businesses must exchange large volumes of data with other businesses.
Virtual private network (VPN) is a telephone product designed for this situation. For large volumes of phone calls, it provide less expensive service than pay-by-call service
Question: Which customers are good candidates for VPN?
Result: A list of businesses that have multiple offices and makephone calls between them.
Concurrent CallsFor businesses having a limited number of outbound lines connected to a large number of extensions, the following questions are of interest:
When do a customer need additional outside line?
When is the right time to offer upgrades to their phone systems?
One measure of a customer’s need for new lines is the maximum number of lines that are used concurrently.
Concurrent Calls
Identify Broad Band Customers
Objective: Identify customers who use the telephone lines for data/computer access (potential broad band customers)Collect sample of 4000 lines in which voice or data/computer access information are availableDivide to two halves for training and testingDefine hundreds of call behavior variablesRun neural network, logistic regression, and tree
Identify Broad Band Customers
Key predictive drivers:length of call (10+ min.)number of repeat phone call to the same number (5+)call by the time of day (at night)Call by day of the week (weekend)
Neural network performed the best.Tree is most intuitive.
Summary
Call detail records contain rich information about customers:Customer behavior varies from one region of a country to another.Thousands of companies place calls to ISPs. They own modems and have the ability to respond to web-based marketing.Residential customers indicate when they are home by using the phone. These patterns can be important, both for customer contact and for customer segmentation.The market share of ISPs differs by market segment.International calls show regional variations. The length of calls varies considerably depending on the destinations.International calls made during the evening and early morning are longer than international calls made during the day.Companies making calls between their different sites are candidates for private networking.
Case Study: Churn Modelingin Wireless Communications
This case study took place at the largest mobile telephone company in a newly developed country. The primary data source is the prototype of an ongoing data warehousing effort. (Source: “Mastering Data Mining” by Berry & Linoff)
Outline The Wireless Telephone IndustryThree GoalsApproach to Building the Churn ModelChurn Model buildingThe DataLessons about Churn Models BuildingSummary
The Wireless Telephone Industry
Rapidly maturing of the wireless market makes the number of churners and the effect of churn on the customer base grow significantly. The business shifts away from signing on nonusers, and focuses on existing customers. (see Figure 11.2 and Figure 11.3)
The wireless telephone industry has differences from other industries.Sole service providersRelatively high cost of acquisitionNo direct customer contactLittle customer mindshareThe handset
Three Goals
Near-term goal: identify a list of probable churners for a marketing intervention.Discussion with the marketing group define the near-term goal: by the 24th of the month, provide the marketing department with a list of 10’000 club members most likely to churnMedium-term goal: build a churn management application (CMA).Besides running churn models, CMA also needed to:
Manage modelsProvide an environment for data analysis before and after modelingImport data and transform it into the input for churn modelsExport the churn scores developed by the models
Long-term goal: complete customer relationship management
Approach to Building the Churn Model
Define churnInvoluntary churn refers to cancellation of a customer’s service due to nonpayment. Voluntary churn is everything that is not involuntary churn. The model is for the latter. Inventory available dataA basic set of data includes data from the customer information file, data from the service account file, and data from billing system.Build modelsDeploy scoresChurn scores can be used for marketing intervention campaigns, prioritizing of customers for different campaigns, and estimating customer longevity in computing estimated lifetime customer value.Measure the scores against what really happens
How close are the estimated churn probabilities to the actual churn probabilities?Are the churn scores “relatively” true, i.e., higher scores imply higher probabilities?
Churn Model Building
A churn modeling effort necessitate a number of decisions:The choice of data mining toolSAS Enterprise Miner Version 2 was used for this project.Segmenting the models setThree models were built for three segments of customers: club members, non-club members, recent customers who had joined in the previous eight or nine months.
The final four models on four different segmentsIn order to investigate if customers joining at about the same time have similar reasons for churn, the club model set was split into two segments: customers who joined in the previous two years, and the rest.
Churn Model Building (continued)
Choice of modeling algorithmDecision tree models were used for churn modeling due to their ability to handle hundreds of fields in the data, their explanatory power and easy to be automated.This project built six trees for each model set (using Gini and entropy as split functions, and allowing 2-, 3- and 4-way splits) in order to see which performs best and to verify each other.Three parameters need to be set: minimum size of a leaf node, minimum size of a node to split, and maximum depth of the tree. The resulted tree needs to be pruned.
The size and churner density of the model setExperiments with different model sets show that the model set with 30% churners and 50k records works best. (Table 11.3)
The effect of latency (Figure 11.12)Translating models in time (Figure 11.13)
The Data
Historical churn ratesHistorical churn rate was calculated along different dimensions: handset, demographic, dealer, and ZIP code.Data at the customer and account levelSSN, ZIP code of residence, market ID, age and gender, pager indication flag, etc.Data at the service levelActivation data and reason, features ordered, billing plan, handset, and dealer, etc.Data billing historyTotal amount billed, late charges and amount overdue, all calls, fee-paid services, etc. Rejecting some variablesVariable that cheat, identifiers, categoricals with too many values, absolute dates, and untrustworthy values, etc.Derived variables
Lessons about Churn Model Building
Finding the most significant variableshandset churn rate, other churn rate, number of phones in use by a customer, low usageListening to the business users to define the goalsListening to the dataIncluding historical churn ratesThe past is the best predictor of the future. For churn, the past is historical churn rates: churn rate by handset, by demographics, by area, and by usage patterns.(Figure 11.17)Composing the model setImportant factors: historical data availability, size and churner density. (Figure 11.18)Building a model for the churn management applicationListening to the data to determine model parametersUnderstanding the algorithm and the tool
Summary
Four critical success factors for building a churn model:Defining churn, especially differentiating between interesting churn (such as customers who leave for a competitor) and uninteresting churn (customers whose service has been cut off due to nonpayment).Understanding how the churn results will be used. Identifying data requirements for the churn model, being sure toinclude historical predictors of churn, such as churn rate by handset and churn rate by demographics.Designing the model set so the resultant models can slide through different time windows and are not obsolete as soon as they are built.
Case Study
Market Basket AnalysisWho buys meat at the health food
store ?
(Source: Mastering Data Mining by Berry & Linoff.)
Purpose
Who buys meat at the health food store?
Understand customer behavior.
DM Tools
Association Rules of Market Basket Analysis.
Customer clustering.
Decision tree.
Customer AnalysisMarket Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases.
Product AnalysisMarket Basket Analysis gives us insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to purchase.
Market Basket Analysis
Source: E. Wegman
Given
A database of transactions.
Each transaction contains a set of items.
Find all rules X →Y that correlate the presence of one set of items X with another set of items Y.
Example: When a customer buys bread and butter, they buy milk 85% of the time.
Market Basket Analysis
Source: E. Wegman
While association rules are easy to understand, they are not always useful.Useful: On Friday convenience store customers often purchase diapers and beer together.Trivial: Customer who purchase maintenance agreements are very likely to purchase larger appliances.
Inexplicable: When a new Super Store opens, one of the most commonly sold items is light bulbs.
Market Basket Analysis
Source: E. Wegman
Measures for Market Basket Analysis
Confidence: Probability that right-hand product is present given that the left-hand product is in the basket.
Support: Percentage of baskets that contain both the left-hand side and the right-hand side of the association.
Lift (correlation): Compare the likelihood of finding the right-hand product in a basket known to contain the left-hand product to the likelihood of finding the right-hand product in any random basket.
Example“Caviar implies Vodka”
High confidenceGiven that we know someone bought a caviar then probability that person buy vodka is very high.
Low supportThe percentage of basket that contain both the vodka and caviar is very low since those products are not much use.
High lift
)Pr()Pr(
basketrandomanyinVodkaFindingbaskettheinalreadyisCavierVodkafinding
Association Results
Relation Lift Support (%)
Confidence(%) Rule
1 4 2.47 3.23 33.72 Red pepper -> Yellow pepper & Bananas & Bakery
2 3 2.24 4.75 49.21 Red pepper -> Yellow pepperBananas
… … … … … …
50 2 1.37 3.77 85.96 Green peppers -> Bananas
… … … … … …
LowHighHigh
basketrandomanyinBananaFindingbaskettheinalreadyispepperGreenbanadafinding
==
)Pr()Pr(
Clustering
VariablesGender
Meat buying
Total Spending
• The height of pies: total spending
• Shaded pie slice: the percentage of people in the cluster who buy meat
• Top row: Women, Bottom row: men
Customer Clusters
Decision Tree
The most meat-buying branches
Spend the most money
Buy the largest number of item
Although only about 5% of shoppers buy meat, they are among the most valuable shoppers !!!
Decision Treefor More about Meat
Conclusion
Data Mining can be used to improve shelf placement decision.
Data Mining can be used to identify a small, but very profitable group of customers.
Case Study
Supermarket Mining Analyzing Ethnic Purchasing Patterns
(Source: Mastering Data Mining by Berry & Linoff.)
Overview
Describe how the manufacturer learned about ethnic
purchasing patterns.
Aimed at Spanish speaking shoppers in Texas.
Collected data from supermarket chain in Texas.
Employed data mining tools from Mineset (SGI).
Purpose
Discover whether the data provided revealed any differences between the stores with a high percentage of Spanish-speaking customers and those having fewer.
Hispanic percentage for the specific item.
Identify which products sell well in Hispanic consumers.
Scatter plot showing variability of Hispanic appeal by category
DataConsist of weekly sales figures for products from five basic categories. (Ready-to-eat cereals, Desserts, Snacks, Main meals, Pancake and variety baking mixes)Within category subcategories were assigned.(actual units sold, dollar volume and equivalent case sales)For each store,(store sizes, % of Hispanic shoppers and % of African-American shoppers)
Decode variables that carried more than one piece of information.HISPLVL and AALEVEL: % of Hispanic and AAs.HISPLVL: 1 ~15
1 Store outside San Antonio with 90% or more Hispanic.10 With little or no Hispanic.
Normalized values by taking the sales volume to compare stores of different sizes.Hispanic score
Ave. values for the high H. store - Ave. values for the least H. storeLarge post. value indicates a product that sells much better in the heavily Hispanic stores.
Transformation of Data
Transformation of Data
The most valuable part of the project was preparing the data and getting familiar with it,
Rather than in running fancy data mining algorithms.
Association rule visualization for Hispanic percentage.
Scatter plot showing which products sell well in Hispanic neighborhoods.
Scatter plot showing variability of Hispanic appeal by category.
DM Tools
Case Study
Supermarket MiningTransactions & Customer Analysis
(Source: Mastering Data Mining by Berry & Linoff.)
A collaboration between a manufacturer and one of the retailer chains.Grocery market usually belong to the retailer actually performed by a supplier.
Overview
Effective use sales data to make the category as a whole more profitable for the retailer.
Identify the customer behavior.
Finding clusters of customers
Purpose
Transaction Detail Fields
FIELDS DESCRIPTION
Date YYYY-MM-DD
Store CCCSSSS, where CCC=chain, SSSS=store
Lane Lane of transaction
Time The time-stamp of the order start time
Customer IDThe loyalty card number presented by the customerID of 0 means the customer did not present a card
Tender Type Payment type, i.e. 1=cash, 2=check,….
UPC The universal product code for item purchased
Quantity The total quantity of this item
Dollar Amount The total $ amount for the quantity of a particular UPC purchased
The numbers, encoded as machine-readable bar code that identify nearly every product that might be sold in a grocery store.Organizations
Uniform Code Council(www.uc-council.org): US and CanadaEuropean Article Numbering Association(www.ean.be):Europe and rest of the world
North America: Consist of 12 digitsThe code itself fits in 11 digits; the twelfth is a checksum
Universal Product Code
Calculate the % of each shopper’s total spending that went to that category.
The total number of trips.
The total dollar amount spent for the year along with the total number of items purchased and the total number of distinct itemspurchased.
The % of the items purchased that carried high, medium and low profit margins for the store.
From Transaction Detail FieldsWE can calculate …….. .
Finding groups of customers with similar behavior.
K-mean clustering.
Set a certain number k.
Selected as candidate cluster centers.
Assigned to the cluster whose center it is nearest.
Centers of the clusters are recalculated and the records are reassigned based on their proximity to the new cluster center.
Finding Clusters of Customers
To get insight in customer behavior by understanding what differentiates one cluster from another.
To build further model within cluster
To use as additional input variables to another models.
Main Ways to Use Cluster
Case Study
Who Gets What? Building a Best Next Offer Model for an Online Bank
(Source: Mastering Data Mining by Berry & Linoff.)
Who Gets What? Building a Best Next Offer Model for an Online Bank
The use of data mining by the online division of a major bank to improve its ability to perform cross selling.
Cross-selling: the activity of Selling additional services to the customers you already have.
Outline
Background on the Banking IndustryThe Business ProblemThe DataApproach to The ProblemModels BuildingLesson learned
Background on the Banking Industry
The challenge for today’s large bank is to shift their focus from market share to wallet-share. That is, instead of merely increasing the number of customers, banks need to increase the profitability of the ones they already have.
Background on the Banking Industry
Why use data mining?A bank knows much more about current customers than external prospects.
The information gathered on customers in the course of normal business operations is much more reliable than the data purchased on external respects.
The Business Problem
The project had immediate, short-term, and long-term goals.Long-term: increase the bank’s share of each customer’s financial business by cross-selling appropriate products.Short term: support a direct e-mail campaign for four selected products (brokerage accounts, money market accounts, home equity loans, and a particular type of saving account).Immediate: take advantage of a data mining platform on loan from SGI to demonstrate the usefulness of data mining to the marketing of online banking services.
The Data
The initial data comprised 1,122,692 account records extracted from the Customer Information System (CIS). Before starting data mining, a SAS data set was created, which contain an enriched version of the extracted data.
The Data
From accounts to customers
Defining the products to be offered.
The Data
From accounts to customersThe data extracted from the CIS had one row per account, which reflects the usual product-centric organization of a bank where managers are responsible for the profitability of particular products rather than the profitability of customers or households.
The best next offer project required pivoting the data to build customer-centric models. The account-level records from the CIS were transformed into around a quarter million household-level records.
The Data
Defining the products to be offered45 product types is used for the best next offer model. Of these25 products are ones that may be offered to a customers. Information on the remaining is used only as input variables when building the models.
Approach to the Problem
The approach to the problem:
A propensity-to-buy model is built for each product individually, which gives each customer a score for the modeled product. The scores for four products are combined to yield the best next offer model: customers are all offered the product for which they have the highest score.
Approach to the Problem
Comparable scores
How to score?
Pitfalls of this approach
Approach to the Problem
Comparable scoresThree requirements are needed to make scores from various product propensity models comparable:
All scores must fall into the same range: zero to one.
Anyone who already has a product should score zero for it.
The relative popularity of products should be reflects in the scores.
Approach to the Problem
How to score?With the product propensity model, prospects are given a score based on the extent to which they look like the existing account holders for that product.This project used a decision tree-based approach, which use the percentage of existing customers at a leaf to assign a score forthe product.
This approach can be summed up by the words of Richard C. Cushing: “When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.”
Approach to the Problem
Pitfalls of this approachBecoming a customer may change people’s behavior.
The best approach is to build models based on the way current customers looked just before they became customers. But, the data this approach is not easy to get.
Current customers reflect past policy. This will result in “past discrimination” .
Models Building
Build an individual propensity model for each product
Finding important variablesBuilding a decision tree modelModel performance in a controlled test
Get to a cross-sell model by combining individual propensity models
Models Building
Start with brokerage accounts
Finding important variables
Using the column importance ToolFind a set of variables which, taken together, do a good job of differentiating classes (people with brokerage accounts and people without):
Whether they are a private banking customerThe length of time they have been with the bankThe value of certain lifestyle codes assigned to them by Microvision (a marketing statistics company)
Using the evidence classifierThis tool Uses the naïve Bayes algorithm to build a predictive model.Naïve Bayes models treat each variable independently and measure their contributions to a prediction. Then these independent contributions are combined to make a classification.
Building a decision tree model for brokerage
MineSet’s decision tree toolLeaves in the tree are either mostly nonbrokerage or mostly brokerage.Each path through the tree to the leaf containing mostly brokerage customers can be thought of as a “rule” for predicting an unclassified customer. Customers meeting the conditions of the “rule” are likely to have or be interested in a brokerage account.In our data, only 1.2 percent of customers had brokerage accounts.To improve the model, Oversampling is used to increase the percentage of brokerage customers in the model set. The final tree is built on a model set containing about one quarter brokerage accounts.
Building a decision tree model for brokerage
Records weights in place of oversamplingAllowing one-off splitsGrouping categoriesInfluencing the pruning decisionsBackfitting the model for comparable scores
Building a decision tree model for brokerage
Records weights in place of oversamplingRecording weighting can achieve the effect of oversampling by increasing the relative importance of the rare records.
Splitting decision is based on the total weight of records in each class rather than the total number of records.
In stead of increasing the weight of records in the rare class, the proper approach is to lower the weight of records in the common class.
Bringing the weight of rare records up to 20~25% of the total works well.
Building a decision tree model for brokerage
Allowing one-off splitsBy default, MineSet’s tree building algorithm splits a categorical variable on every single value, or dose not split on it at all.Users can control if one-off splits are considered through one parameter.One-off split: split based on a single value of a categorical variable.
Grouping categoriesMineSet’s design: The tree building algorithm is unlikely to make good splits on a categorical variable taking on hundreds of values.Some variables rejected by MineSet seem to be very predictive for some cases. They have the characteristic that although there were hundreds of values in the data, only a few values of those variables appear frequently.The approach is to Lump all values below a certain threshold into a catch-all “other” category, and make splits on the more populous ones.
Building a decision tree model for brokerage
Influencing the pruning decisionsUsers have the control of the size, depth, and bushiness of the tree.Good settings: minimum number of records in a node: 50; pruning factor: 0.1; no explicit limit on the depth.
Backfitting the model for comparable scoresThe backfit model is used to run the original data through the tree.The score for each leaf is based on the percentage of brokerage customers at that leaf.The more brokerage at one leaf, the higher scores the customers without brokerage at this leaf will get, and the more possible they will open a brokerage account.
Brokerage model performance in a controlled test
High score: any score higher than the density of brokerage customers in the population, not a large number.
Group Size Choosing Email Response Rate
Model 10,000 High score Yes 0.7
Control 10,000 Random Yes 0.3
Hold-out 10,000 Random No 0.05
Getting to a cross-sell model
The propensity models for the rest products are built following the same procedure, and individual propensity models are combined into a cross-sell model to find the best next offer.
B
D
A
vote B
0.47
0.10
0.72
C0.31
Summary of the Procedure
Determine whether cross-selling makes sense.
Determine whether sufficient data exists to build a good cross-sell model.
Build propensity models for each product individually.
Combine individual propensity models to construct a cross-sell model.
Lessons LearnedBefore building customer-centric models, data need to be transformed from product-centric to customer-centric.
Having a particular product may change a customer’s behavior. The best way to solve this problem is to build models based on the behavior before buying the product.
The current composition of the customer population is largely a reflection of past marketing policy.
Oversampling and record weighting can be used to consider rare events.
References
Berry & Linoff (Wiley)
Mastering Data Mining, 2000
Han & Kamber (Morgan Kaufmann Publishers)
Data Mining: Concept and Techniques, 2001
Hastie, Tibshirani, & Friedman (Springer Verleg)
The Elements of Statistical Learning, 2001
Taguchi & Jugulum (Wiley)
The Mahalanobis-Taguchi Strategy, 2002