Upload
mercy-armstrong
View
213
Download
1
Embed Size (px)
Citation preview
What Exactly Is Data Mining?
Judy PastorSenior Manager, Operations Research
Continental AirlinesAGIFORS YM and Res 2000, NYC
If You Haven’t Done It By Now,
• You are probably thinking about it ….
Building A Data Warehouse
• The airline business is becoming more and more competitive– INFORMATION IS POWER
• efficiency in– transaction processing
– storage
– retrieval
• e-commerce
Many Advantages of a DW
• All relevant airline data can be accessed through one portal
• Legacy systems kept data at an aggregated level
• “The devil is in the details” Ross Perot– DW gives opportunity to keep the lowest level
of information
BUT …
• Building a DW is a huge expense
• Can tie up key IT people for years
• Is not worth it unless the data is actually used
• Sheer volume of data can cause “analysis paralysis”
Data Mining Is the Answer
• Great Buzz Word … But what is it?– Emerging discipline
• defined as the efficient discovery of previously unknown patterns in large data bases
• combines CS, OR, and Statistics
• Data Mining can be confused with OLAP– On Line Analytical Processing– “Data Cubes”– “Roll-Up and Drill Down”
Data Mining versus OLAP
• OLAP is user-driven– user generates a hypothesis– uses OLAP to verify it– user guides exploration of data
• Data Mining– DM tool is used to generate a hypothesis– tool guides the exploration of data– user must verify hypothesis
Data Mining versus OLAP
• Data Aggregation is necessary for human examination– but it can hide the most “interesting” details of
the business process
• DM provides tools that can perform exhaustive searches at disaggregated levels
• Uses mathematical algorithms in the same way an LP searches for the optimal among many solutions in a solution space
Data Mining
• Most DM algorithms have origins in classical statistics– Regression (Linear and Logistic)– Clustering– Time Series– Sequential Pattern Matching – Association Rules– Classification Trees– Outlier Detection
Typical DM Applications
• Fraud Detection– outlier detection
• Customer Segmentation– clustering
• Market Basket Analysis (beer and diapers)– association rules
• Direct Mail/Marketing– classification
Airline DM Applications
• Actual case studies done at Continental– No Show Analysis– Direct Mail Campaign for One Pass
• Both used Classification Tree Method– Software: CART (Classification and
Regression Trees) by Salford Systems
No Show Analysis
• Employed historical data to classify PNRs into “Show” or “No Show”– IAHDFW, Monday, 7:45am departure, 7/5/99-1/24/00, 2902
PNRs DeptDate Flight# Rec_LocatorCreate_date CLS Num_party outbound/returnLcl/cnx Show/NS OneWay TRP_DAYSADV_BKD pathOND19990705 1014 I7Q8FH 19990403 Q 2 O C S Y 0 93 LAXDFW19990705 1014 J45XMP 19990603 T 1 O L S Y 0 32 IAHDFW19990705 1014 JLCLPD 19990703 Y 1 O C N Y 0 2 DFWMSY19990705 1014 JVWCR1 19990528 K 2 I L S N 1 38 IAHDFW19990705 1014 JYXTCD 19990624 Q 1 O L S N 1 11 IAHDFW19990705 1014 K32LJP 19990614 H 2 I C S N 10 21 GUMDFW19990705 1014 LCRXY1 19990701 Q 1 O L S N 3 4 IAHDFW19990712 1014 I69LBX 19990707 Q 1 O L S N 3 5 IAHDFW19990712 1014 IF0P8T 19990707 Q 1 O L S Y 0 5 DFWIAH19990712 1014 IFF9SP 19990707 Q 1 O L S Y 0 5 DFWIAH19990712 1014 IFRQ01 19990707 Q 1 O C N N 29 5 SEADFW19990712 1014 ILCNMT 19990617 T 1 O L S N 1 25 IAHDFW19990712 1014 IP0FCX 19990708 Q 1 O L S Y 0 4 DFWIAH19990712 1014 IQ8BPD 19990617 Q 1 O C S N 1 25 VCTDFW19990712 1014 IRZ0W1 19990708 Q 1 O L S Y 0 4 DFWIAH
….
No Show Study: CART Model
TerminalNode 1Class = 0
Class Cases %0 1549 87.71 217 12.3
N = 1766
TerminalNode 2Class = 1
Class Cases %0 279 80.91 66 19.1
N = 345
TerminalNode 3Class = 0
Class Cases %0 20 100.01 0 0.0
N = 20
Node 4Class = 1
ADV_BKD <= 86.000Class Cases %
0 299 81.91 66 18.1
N = 365
Node 3Class = 0
ADV_BKD <= 14.500Class Cases %
0 1848 86.71 283 13.3
N = 2131
TerminalNode 4Class = 1
Class Cases %0 62 72.91 23 27.1
N = 85
TerminalNode 5Class = 0
Class Cases %0 31 96.91 1 3.1
N = 32
TerminalNode 6Class = 1
Class Cases %0 23 69.71 10 30.3
N = 33
Node 8Class = 1
TRP_DAYS <= 2.500Class Cases %
0 54 83.11 11 16.9
N = 65
Node 7Class = 1
TRP_DAYS <= 0.500Class Cases %
0 116 77.31 34 22.7
N = 150
TerminalNode 7Class = 0
Class Cases %0 42 93.31 3 6.7
N = 45
Node 6Class = 1
ADV_BKD <= 6.500Class Cases %
0 158 81.01 37 19.0
N = 195
TerminalNode 8Class = 1
Class Cases %0 26 59.11 18 40.9
N = 44
Node 5Class = 1
INBOUND = (0)Class Cases %
0 184 77.01 55 23.0
N = 239
Node 2Class = 0
B_G = (1)Class Cases %
0 2032 85.71 338 14.3
N = 2370
TerminalNode 9Class = 1
Class Cases %0 281 73.91 99 26.1
N = 380
TerminalNode 10Class = 0
Class Cases %0 138 91.41 13 8.6
N = 151
Node 9Class = 1
ADV_BKD <= 19.500Class Cases %
0 419 78.91 112 21.1
N = 531
Node 1Class = 0
LOCAL = (1)Class Cases %
0 2451 84.51 450 15.5
N = 2901
Data input: IAHDFW, Mon., 7:45am departure, 7/5/99-1/24/00, 2902PNRs
local connecting
businessleisure
Bkd <2 wks
Bkd > 2 wks
Bkd>3 months
Bkd<3months
Bkd <2 wks Bkd>2 wks
out return
Bkd <1 wk Bkd >1 wk
Trip in 1 day
Trip > 3daysTrip=2 or3days
Note:
Class=1 <==> No show
One Pass Direct Marketing: CART Model
Note:
Class=1 : Respond
Class=0: Not Respond
Predictor Variables:
P_CLUB: President’s Club Status (Not, Former or Current, a categorical variable with values 0/1/2)HY_RCNCY: Recency of flying high yield segment - equals 1(in Feb 99), 2( in Jan 99), 3 ( in Dec 98), 4 (in Nov 98) and 10 (None in the most 4 recent months).HY-Num: number of high yield segments flown in the 4 months before March 99.ELT_BAL9: OnePass mileage accumulated in 1998M98: Segments flown in 1998HUB1: Live in Hub city or Non-Hub city, a binary 1/0 variableMASTRER_CD: Master card status (No-card vs. either or both CO and EA card, a binary 0/1 variable)MO_ONFIL: Months enrolled in OnePass ProgramMO_LSFLT: Months since last flightMO_LSACT: Months since last account activity.
Presidents’ Club MemberNot Presidents’ Club Member
Flew HY segment in last 3 months
Not flew HY segment in last 3 months
Mileage >13155Mileage <13155
Last account activity within 1 month
Last account activity more than 1 month
Last account activity within 4 months
Last account activity more than 4 months
Mileage>11194Mileage<=11194
W/ CO Master Card
W/o CO Master Card
Months on file >43.5
Months on file <43
# of cases in this node
TerminalNode 1Class = 1
Class Cases %0 1466 93.01 111 7.0
N = 1577
TerminalNode 2Class = 0
Class Cases %0 926 96.81 31 3.2
N = 957
Node 4Class = 0
MO_LSTAC <= 1.500Class Cases %
0 2392 94.41 142 5.6
N = 2534
TerminalNode 3Class = 1
Class Cases %0 3426 91.11 335 8.9
N = 3761
Node 3Class = 1
ELT_BAL9 <= 13155.500Class Cases %
0 5818 92.41 477 7.6
N = 6295
TerminalNode 4Class = 0
Class Cases %0 756 97.91 16 2.1
N = 772
TerminalNode 5Class = 0
Class Cases %0 1502 96.71 52 3.3
N = 1554
TerminalNode 6Class = 1
Class Cases %0 58 84.11 11 15.9
N = 69
TerminalNode 7Class = 0
Class Cases %0 1757 95.41 85 4.6
N = 1842
Node 10Class = 0
MO_ONFIL <= 46.500Class Cases %
0 1815 95.01 96 5.0
N = 1911
Node 9Class = 0
MO_ONFIL <= 43.500Class Cases %
0 3317 95.71 148 4.3
N = 3465
TerminalNode 8Class = 1
Class Cases %0 140 89.71 16 10.3
N = 156
Node 8Class = 0
MASTR_CD = (0)Class Cases %
0 3457 95.51 164 4.5
N = 3621
TerminalNode 9Class = 1
Class Cases %0 100 87.01 15 13.0
N = 115
Node 7Class = 0
ELT_BAL9 <= 18740.500Class Cases %
0 3557 95.21 179 4.8
N = 3736
Node 6Class = 0
ELT_BAL9 <= 11194.000Class Cases %
0 4313 95.71 195 4.3
N = 4508
TerminalNode 10Class = 0
Class Cases %0 2204 98.51 33 1.5
N = 2237
Node 5Class = 0
MO_LSTAC <= 4.500Class Cases %
0 6517 96.61 228 3.4
N = 6745
Node 2Class = 0
HY_RCNCY <= 3.500Class Cases %
0 12335 94.61 705 5.4
N = 13040
TerminalNode 11Class = 1
Class Cases %0 706 79.31 184 20.7
N = 890
Node 1Class = 0
P_CLUB = (0)Class Cases %
0 13041 93.61 889 6.4
N = 13930
Respond Node
Percentage of respond
CART Basics
• Can be used with classification (categories) or continuous data (regression)
• Results are presented in form of binary decision trees
• Tree structure allows CART to handle very complex data but with easy to understand results
CART Methodology
• Binary recursive partitioning
• Set of rules for– splitting each node in a tree– deciding when to stop– assigning each terminal node to a classification
outcome
CART Methodology
• Advantages– CART does not require variables to be selected
in advance– Transformations of independent variables are
not necessary– A complex structure can be analyzed– CART is extremely robust to the effects of
outliers
Uses for CART
• Data exploration– identifies variables that are good predictors– gives a “warm start” to data analysis– can detect variable interactions
• Supplement classical statistical methods
• Determine best aggregation points
Drawbacks to CART
• In no show study, we would need to forecast every PNR as to its specific attributes
• But overall, we found it to be very accurate on a testing dataset (after learning with a training dataset)– CART beat logistic regression
Conclusions
• Data Mining methods fit naturally into the Operations Research toolkit
• Strong statistical and mathematical background necessary to fully understand models (but not necessarily to use them)
• KDD (Knowledge Discovery in Databases) Conference in Boston in August, 2000 - highly recommended