20
What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Embed Size (px)

Citation preview

Page 1: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

What Exactly Is Data Mining?

Judy PastorSenior Manager, Operations Research

Continental AirlinesAGIFORS YM and Res 2000, NYC

Page 2: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

If You Haven’t Done It By Now,

• You are probably thinking about it ….

Page 3: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Building A Data Warehouse

• The airline business is becoming more and more competitive– INFORMATION IS POWER

• efficiency in– transaction processing

– storage

– retrieval

• e-commerce

Page 4: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Many Advantages of a DW

• All relevant airline data can be accessed through one portal

• Legacy systems kept data at an aggregated level

• “The devil is in the details” Ross Perot– DW gives opportunity to keep the lowest level

of information

Page 5: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

BUT …

• Building a DW is a huge expense

• Can tie up key IT people for years

• Is not worth it unless the data is actually used

• Sheer volume of data can cause “analysis paralysis”

Page 6: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Data Mining Is the Answer

• Great Buzz Word … But what is it?– Emerging discipline

• defined as the efficient discovery of previously unknown patterns in large data bases

• combines CS, OR, and Statistics

• Data Mining can be confused with OLAP– On Line Analytical Processing– “Data Cubes”– “Roll-Up and Drill Down”

Page 7: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Data Mining versus OLAP

• OLAP is user-driven– user generates a hypothesis– uses OLAP to verify it– user guides exploration of data

• Data Mining– DM tool is used to generate a hypothesis– tool guides the exploration of data– user must verify hypothesis

Page 8: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Data Mining versus OLAP

• Data Aggregation is necessary for human examination– but it can hide the most “interesting” details of

the business process

• DM provides tools that can perform exhaustive searches at disaggregated levels

• Uses mathematical algorithms in the same way an LP searches for the optimal among many solutions in a solution space

Page 9: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Data Mining

• Most DM algorithms have origins in classical statistics– Regression (Linear and Logistic)– Clustering– Time Series– Sequential Pattern Matching – Association Rules– Classification Trees– Outlier Detection

Page 10: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Typical DM Applications

• Fraud Detection– outlier detection

• Customer Segmentation– clustering

• Market Basket Analysis (beer and diapers)– association rules

• Direct Mail/Marketing– classification

Page 11: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Airline DM Applications

• Actual case studies done at Continental– No Show Analysis– Direct Mail Campaign for One Pass

• Both used Classification Tree Method– Software: CART (Classification and

Regression Trees) by Salford Systems

Page 12: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

No Show Analysis

• Employed historical data to classify PNRs into “Show” or “No Show”– IAHDFW, Monday, 7:45am departure, 7/5/99-1/24/00, 2902

PNRs DeptDate Flight# Rec_LocatorCreate_date CLS Num_party outbound/returnLcl/cnx Show/NS OneWay TRP_DAYSADV_BKD pathOND19990705 1014 I7Q8FH 19990403 Q 2 O C S Y 0 93 LAXDFW19990705 1014 J45XMP 19990603 T 1 O L S Y 0 32 IAHDFW19990705 1014 JLCLPD 19990703 Y 1 O C N Y 0 2 DFWMSY19990705 1014 JVWCR1 19990528 K 2 I L S N 1 38 IAHDFW19990705 1014 JYXTCD 19990624 Q 1 O L S N 1 11 IAHDFW19990705 1014 K32LJP 19990614 H 2 I C S N 10 21 GUMDFW19990705 1014 LCRXY1 19990701 Q 1 O L S N 3 4 IAHDFW19990712 1014 I69LBX 19990707 Q 1 O L S N 3 5 IAHDFW19990712 1014 IF0P8T 19990707 Q 1 O L S Y 0 5 DFWIAH19990712 1014 IFF9SP 19990707 Q 1 O L S Y 0 5 DFWIAH19990712 1014 IFRQ01 19990707 Q 1 O C N N 29 5 SEADFW19990712 1014 ILCNMT 19990617 T 1 O L S N 1 25 IAHDFW19990712 1014 IP0FCX 19990708 Q 1 O L S Y 0 4 DFWIAH19990712 1014 IQ8BPD 19990617 Q 1 O C S N 1 25 VCTDFW19990712 1014 IRZ0W1 19990708 Q 1 O L S Y 0 4 DFWIAH

….

Page 13: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

No Show Study: CART Model

TerminalNode 1Class = 0

Class Cases %0 1549 87.71 217 12.3

N = 1766

TerminalNode 2Class = 1

Class Cases %0 279 80.91 66 19.1

N = 345

TerminalNode 3Class = 0

Class Cases %0 20 100.01 0 0.0

N = 20

Node 4Class = 1

ADV_BKD <= 86.000Class Cases %

0 299 81.91 66 18.1

N = 365

Node 3Class = 0

ADV_BKD <= 14.500Class Cases %

0 1848 86.71 283 13.3

N = 2131

TerminalNode 4Class = 1

Class Cases %0 62 72.91 23 27.1

N = 85

TerminalNode 5Class = 0

Class Cases %0 31 96.91 1 3.1

N = 32

TerminalNode 6Class = 1

Class Cases %0 23 69.71 10 30.3

N = 33

Node 8Class = 1

TRP_DAYS <= 2.500Class Cases %

0 54 83.11 11 16.9

N = 65

Node 7Class = 1

TRP_DAYS <= 0.500Class Cases %

0 116 77.31 34 22.7

N = 150

TerminalNode 7Class = 0

Class Cases %0 42 93.31 3 6.7

N = 45

Node 6Class = 1

ADV_BKD <= 6.500Class Cases %

0 158 81.01 37 19.0

N = 195

TerminalNode 8Class = 1

Class Cases %0 26 59.11 18 40.9

N = 44

Node 5Class = 1

INBOUND = (0)Class Cases %

0 184 77.01 55 23.0

N = 239

Node 2Class = 0

B_G = (1)Class Cases %

0 2032 85.71 338 14.3

N = 2370

TerminalNode 9Class = 1

Class Cases %0 281 73.91 99 26.1

N = 380

TerminalNode 10Class = 0

Class Cases %0 138 91.41 13 8.6

N = 151

Node 9Class = 1

ADV_BKD <= 19.500Class Cases %

0 419 78.91 112 21.1

N = 531

Node 1Class = 0

LOCAL = (1)Class Cases %

0 2451 84.51 450 15.5

N = 2901

Data input: IAHDFW, Mon., 7:45am departure, 7/5/99-1/24/00, 2902PNRs

local connecting

businessleisure

Bkd <2 wks

Bkd > 2 wks

Bkd>3 months

Bkd<3months

Bkd <2 wks Bkd>2 wks

out return

Bkd <1 wk Bkd >1 wk

Trip in 1 day

Trip > 3daysTrip=2 or3days

Note:

Class=1 <==> No show

Page 14: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

One Pass Direct Marketing: CART Model

Note:

Class=1 : Respond

Class=0: Not Respond

Predictor Variables:

P_CLUB: President’s Club Status (Not, Former or Current, a categorical variable with values 0/1/2)HY_RCNCY: Recency of flying high yield segment - equals 1(in Feb 99), 2( in Jan 99), 3 ( in Dec 98), 4 (in Nov 98) and 10 (None in the most 4 recent months).HY-Num: number of high yield segments flown in the 4 months before March 99.ELT_BAL9: OnePass mileage accumulated in 1998M98: Segments flown in 1998HUB1: Live in Hub city or Non-Hub city, a binary 1/0 variableMASTRER_CD: Master card status (No-card vs. either or both CO and EA card, a binary 0/1 variable)MO_ONFIL: Months enrolled in OnePass ProgramMO_LSFLT: Months since last flightMO_LSACT: Months since last account activity.

Presidents’ Club MemberNot Presidents’ Club Member

Flew HY segment in last 3 months

Not flew HY segment in last 3 months

Mileage >13155Mileage <13155

Last account activity within 1 month

Last account activity more than 1 month

Last account activity within 4 months

Last account activity more than 4 months

Mileage>11194Mileage<=11194

W/ CO Master Card

W/o CO Master Card

Months on file >43.5

Months on file <43

# of cases in this node

TerminalNode 1Class = 1

Class Cases %0 1466 93.01 111 7.0

N = 1577

TerminalNode 2Class = 0

Class Cases %0 926 96.81 31 3.2

N = 957

Node 4Class = 0

MO_LSTAC <= 1.500Class Cases %

0 2392 94.41 142 5.6

N = 2534

TerminalNode 3Class = 1

Class Cases %0 3426 91.11 335 8.9

N = 3761

Node 3Class = 1

ELT_BAL9 <= 13155.500Class Cases %

0 5818 92.41 477 7.6

N = 6295

TerminalNode 4Class = 0

Class Cases %0 756 97.91 16 2.1

N = 772

TerminalNode 5Class = 0

Class Cases %0 1502 96.71 52 3.3

N = 1554

TerminalNode 6Class = 1

Class Cases %0 58 84.11 11 15.9

N = 69

TerminalNode 7Class = 0

Class Cases %0 1757 95.41 85 4.6

N = 1842

Node 10Class = 0

MO_ONFIL <= 46.500Class Cases %

0 1815 95.01 96 5.0

N = 1911

Node 9Class = 0

MO_ONFIL <= 43.500Class Cases %

0 3317 95.71 148 4.3

N = 3465

TerminalNode 8Class = 1

Class Cases %0 140 89.71 16 10.3

N = 156

Node 8Class = 0

MASTR_CD = (0)Class Cases %

0 3457 95.51 164 4.5

N = 3621

TerminalNode 9Class = 1

Class Cases %0 100 87.01 15 13.0

N = 115

Node 7Class = 0

ELT_BAL9 <= 18740.500Class Cases %

0 3557 95.21 179 4.8

N = 3736

Node 6Class = 0

ELT_BAL9 <= 11194.000Class Cases %

0 4313 95.71 195 4.3

N = 4508

TerminalNode 10Class = 0

Class Cases %0 2204 98.51 33 1.5

N = 2237

Node 5Class = 0

MO_LSTAC <= 4.500Class Cases %

0 6517 96.61 228 3.4

N = 6745

Node 2Class = 0

HY_RCNCY <= 3.500Class Cases %

0 12335 94.61 705 5.4

N = 13040

TerminalNode 11Class = 1

Class Cases %0 706 79.31 184 20.7

N = 890

Node 1Class = 0

P_CLUB = (0)Class Cases %

0 13041 93.61 889 6.4

N = 13930

Respond Node

Percentage of respond

Page 15: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

CART Basics

• Can be used with classification (categories) or continuous data (regression)

• Results are presented in form of binary decision trees

• Tree structure allows CART to handle very complex data but with easy to understand results

Page 16: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

CART Methodology

• Binary recursive partitioning

• Set of rules for– splitting each node in a tree– deciding when to stop– assigning each terminal node to a classification

outcome

Page 17: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

CART Methodology

• Advantages– CART does not require variables to be selected

in advance– Transformations of independent variables are

not necessary– A complex structure can be analyzed– CART is extremely robust to the effects of

outliers

Page 18: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Uses for CART

• Data exploration– identifies variables that are good predictors– gives a “warm start” to data analysis– can detect variable interactions

• Supplement classical statistical methods

• Determine best aggregation points

Page 19: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Drawbacks to CART

• In no show study, we would need to forecast every PNR as to its specific attributes

• But overall, we found it to be very accurate on a testing dataset (after learning with a training dataset)– CART beat logistic regression

Page 20: What Exactly Is Data Mining? Judy Pastor Senior Manager, Operations Research Continental Airlines AGIFORS YM and Res 2000, NYC

Conclusions

• Data Mining methods fit naturally into the Operations Research toolkit

• Strong statistical and mathematical background necessary to fully understand models (but not necessarily to use them)

• KDD (Knowledge Discovery in Databases) Conference in Boston in August, 2000 - highly recommended