Data Mining
© 2006, HEC Montréal. www.hec.ca/sap/ERPsim
Data Mining
The majority of reports are based on known facts
BUT
The majority of reports are based on known facts
BUT
We don’t know what we don’t knowWe don’t know what we don’t know
Definition
Data mining is the process of discovering meaningful new correlations, patterns and trends by "mining" large amounts of stored data using pattern recognition technologies, as well as statistical and mathematical techniques.
(Ashby, Simms (1998))
Data Mining Examples
Market Based Analysis and Up-
Selling/Cross-Selling
Market Based Analysis and Up-
Selling/Cross-Selling
Pharmaceutical Industry:
Drug Effectiveness by Patient Type
Pharmaceutical Industry:
Drug Effectiveness by Patient Type
Defect Analysis in
Manufacturing
Defect Analysis in
Manufacturing
University and Employee
Recruitment
University and Employee
Recruitment
Employee Turnover
Predictions
Employee Turnover
Predictions
CreditRisk
Determination
CreditRisk
Determination
CreditCardFraud
CreditCardFraud
Customer Grouping and
Behaviour Prediction
Customer Grouping and
Behaviour Prediction
BusinessUnderstanding
DataUnderstanding
EvaluationDataPreparation
Modeling
Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria
Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits
Determine Data Mining GoalData Mining GoalsData Mining Success Criteria
Produce Project PlanProject PlanInitial Asessment of Tools and Techniques
Collect Initial DataInitial Data Collection Report
Describe DataData Description Report
Explore DataData Exploration Report
Verify Data Quality Data Quality Report
Data SetData Set Description
Select Data Rationale for Inclusion / Exclusion
Clean Data Data Cleaning Report
Construct DataDerived AttributesGenerated Records
Integrate DataMerged Data
Format DataReformatted Data
Select Modeling TechniqueModeling TechniqueModeling Assumptions
Generate Test DesignTest Design
Build ModelParameter SettingsModelsModel Description
Assess ModelModel AssessmentRevised Parameter Settings
Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models
Review ProcessReview of Process
Determine Next StepsList of Possible ActionsDecision
Plan DeploymentDeployment Plan
Plan Monitoring and MaintenanceMonitoring and Maintenance Plan
Produce Final ReportFinal ReportFinal Presentation
Review ProjectExperience Documentation
Deployment
CRISP – DM: Phases and TasksCRISP – DM: Phases and Tasks
CRISP-DM: CRoss Industry Standard Process for Data Mining Initiative launched Sept.1996
CRISP-DM: CRoss Industry Standard Process for Data Mining Initiative launched Sept.1996
SAP BI Analysis Process Designer (APD)
Data Mining Methods: Predictive vs Informative
Association Analysis
8
Association Analysis Data Mining
Cross-SellingRules
C
D
D
A
B
E
E
E
A
Customers
Products
B
C
D
What products / services are typically bought together?
Export rules to Web Shop
Use in merchandising
Informative: Association Analysis - Example
10
Small Example
Rule: Diapers -> Beer Support: 60% (3/5)
• 60% of all purchases have diapers and beer Confidence: 75% (3/4)
• If diapers are purchased, 75% chance of buying beer Lift: 1.25 (75/60)
• If diapers purchased, person is 1.25 times more likely to purchase beer
url: http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
11
Rule Identification
Brute force: Examine all combinations to see which have high support, confidence & lift
What is the problem with this approach?
Algorithms developed to reduce # of rules considered: Frequent itemsets (support), then high confidence
rules
url: http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf
Clustering
12
Clustering
Clustering is a data mining technique that creates groups of records that are:
• Similar to each other within a particular group • Very different across different groups
The degree of association between members is measured by all the characteristics specified in the analysis
Clustering helps the user explore vast amounts of data and organize it in a systematic way
Income
Age
High
Low High
Clustering
Clustering Process
ABC Analysis
16
Informative: ABC Classification
Use ABC to classify objects (such as customers, employees, vendors or products) based on a particular measure (such as revenue or profit).
Examples: Customers with revenue >$100M = Class “A”, etc Customers who generate top 20% of our revenue = Class “A”, etc Rank customers by their revenue:
• The top 20% on the list = Class “A”, etc OR• The first 50 customers = Class “A”, etc
Practical applications Classify customers into Platinum, Gold, Silver Rank vendors based on product quality (returned goods)
Informative: ABC Analysis - Example
Classification/Decision Trees
19
Customer Income Age Credit Rating Etc. Buying Behavior
Selected Customers -
Historical Data
(query)
Mick Jagger $ 10 000 48 Excellent … Yes
Elton John $ 3000 22 Fair … No
Tina Turner $ 8000 36 Excellent … Yes
Etc. … … … … …
How will other Customers behave?
New Data
(query)
Willie Nelson $ 6500 34 Fair …
Carol King $ 2000 63 Excellent …
Etc. … … … …
• Identify the factors driving customer behavior and predict future behavior
?
?
?
Predictive: Decision Tree
Model process:
A record in the query starts at the root node
A test (in the model) determines which node the record should go to next
All records end up in a leaf node
Interpreting the Results
Read the tree from top to bottom
Rule: If Age is less than 35 and Income is greater than $5000 and Credit standing is Fair, then the
customer has a 35% chance of buying the product
Age, then Income and credit rating, are the most influential attributes determining buying behavior.
Age
IncomeBuy100%
Won’t Buy100%
Credit Rating
Buy35%
Won’t Buy65%
Leaf Nodes
Root Node
Decision Node
<35>= 35
>$5000<=$5000
FairExcellent
Test
Predictive: Decision Tree
Play Golf Dataset
Case Outlook Temp Humidity Windy Play
a sunny hot high FALSE no
b sunny hot high TRUE no
c overcast hot high FALSE yes
d rainy mild high FALSE yes
e rainy cool normal FALSE yes
f rainy cool normal TRUE no
g overcast cool normal TRUE yes
h sunny mild high FALSE no
i sunny cool normal FALSE yes
j rainy mild normal FALSE yes
k sunny mild normal TRUE yes
l overcast mild high TRUE yes
m overcast hot normal FALSE yes
n rainy mild high TRUE no
Decision Tree of Golf Data
Play 9
Don’t Play 5
Play 2
Don’t Play 3
Play 3
Don’t Play 2
Play 4
Don’t Play 0
Play 2
Don’t Play 0
Play 0
Don’t Play 3
Play 0
Don’t Play 2
Play 3
Don’t Play 0
Outlook?
OvercastRain
Humidity?
< 70% > 70%
Windy?
True False
Sunny
Conclusion
The best way to explain the attribute “play” is with the attribute Outlook First conclusion, people always play when it’s
overcast On days it rains, the attribute Windy explains
whether people play or not On days when it’s sunny, the attribute
humidity explains when people play
Confidence and Support
Confidence refers to the relative frequency that an event occurs If golfers play 8 out of the 10 days it’s overcast
then we have 8/10 confidence that golfers will play on overcast days
Support refers to number of times an event occurs out of all instances If it’s only overcast 1 day in 100 then there is only
1/100 support for the rule given above
Decision Tree: Practical Applications
How can we reduce customer fraud? Analyze customer characteristics:
• Fraudulent behavior (Y or N), age, education, occupation, frequency of purchase, dollar value of purchase, etc.
Who is likely to “churn” (stop buying from us)? Analyze customer characteristics; who is:
• (1) still with us, and • (2) no longer “on board”, • Plus other demographic or transactional attributes...
Who is likely to be a credit risk? Analyze customer characteristics: who has:
• (1) not been a credit risk in the past, and • (2) who has been a credit risk in the past• Include relevant customer characteristics