Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DATA MINING
1
Introduction
2
What Is Data Mining?
• Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount of data
• Alternative names • Knowledge discovery (mining) in databases (KDD), • Knowledge extraction, • Data/pattern analysis, • Information harvesting, • Business intelligence
3
Usage of Data Mining: Real-World Apps I • Play-by-play information recorded by teams
• Who is on the court • Who shoots • Results
• Coaches want to know what works best • Plays that work well against a given team • Good/bad player matchups
Data
Knowledge
4
Usage of Data Mining: Real-World Apps I
• Advanced Scout (from IBM Research) is a data mining tool to answer these questions
0 20 40 60
Overall ShootingPercentage
Starks+Houston+ Ward playing
5
Usage of Data Mining: Real-World Apps 2 • Assume players X and Y
• All statistics from previous matches is there
• What we need to know is • What factors should each player focus on to win
the game • What are the weaknesses of the other side • ….
6
Usage of Data Mining: Real-World Apps 2 • IBM has data mining tools for analyzing
tennis data and converting it into knowledge
7
Usage of Data Mining: Real-World Apps 3 • Items and customers transactions
• Which items are bought together • When the items are bought • Quantities
• Owners need to know • Are there peak seasons for specific items? • Which items should be put next to each other? • If make discount on item X, should we make discount
on other items?
Data
Knowledge
8
Usage of Data Mining: Real-World Apps 3
One piece of knowledge another piece of
knowledge
If you put discount on A and B, do not put on C
9
Data Mining: Name
• The process of discovering meaningful new correlations, patterns, and trends from large amounts of stored data.
Data Mining Knowledge Mining
Knowledge Discovery in Databases
Data Archaeology
Data Dredging
Database Mining Knowledge Extraction
Data Pattern Processing
Information Harvesting
Siftware
10
Integration of Multiple Technology
Machine Learning
Database Management
Artificial Intelligence
Statistics
Data Mining
Visualization Algorithms
11
Data Mining: Classification Schemes • General functionality
• Descriptive data mining • Predictive data mining
• Different views, different classifications • Kinds of data to be mined • Kinds of knowledge to be discovered • Kinds of techniques utilized • Kinds of applications adapted
12
adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Data Target Data
Selection
Knowledge
Preprocessed Data
Patterns
Data Mining
Interpretation/ Evaluation
Knowledge Discovery in Databases: Process
Preprocessing
13
What Can Data Mining Do? • Cluster • Classify
• Categorical, Regression • Summarize
• Summary statistics, Summary rules • Link Analysis / Model Dependencies
• Association rules • Sequence analysis
• Time-series analysis, Sequential associations • Detect Deviations (Outliers)
14
Clustering • Find groups of similar data items • Statistical techniques require
some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions
Uses: • Demographic analysis Technologies: • Self-Organizing Maps • Probability Densities • Conceptual Clustering
“Group people with similar travel profiles” • George, Patricia • Jeff, Evelyn, Chris • Rob
Clusters
15
Classification • Find ways to separate data
items into pre-defined groups • We know X and Y belong
together, find other things in same group
• Requires “training data”: Data items where group is known
Uses: • Profiling Technologies: • Generate decision trees (results
are human understandable) • Neural Nets
“Route documents to most likely interested parties” • English or non-english? • Domestic or Foreign?
Groups
Training Data
tool produces
classifier
16
Association Rules • Identify dependencies in the
data: • X makes Y likely
• Indicate significance of each dependency
• Bayesian methods Uses: • Targeted marketing Technologies: • AIS, SETM, Hugin, TETRAD II
“Find groups of items commonly purchased together” • People who purchase fish are
extraordinarily likely to purchase wine
• People who purchase Turkey are extraordinarily likely to purchase cranberries
Date/Time/Register Fish Turkey Cranberries Wine …12/6 13:15 2 N Y Y Y …12/6 13:16 3 Y N N Y …
17
Deviation Detection (Outlier Detection) • Find unexpected values, outliers Uses: • Failure analysis • Anomaly discovery for analysis Technologies: • clustering/classification methods • Statistical techniques • visualization
• “Find unusual occurrences in IBM stock prices”
Date Close Volume Spread58/07/02 369.50 314.08 .02256158/07/03 369.25 313.87 .02256158/07/04 Market Closed58/07/07 370.00 314.50 .022561
Sample date Event Occurrences58/07/04 Market closed 317 times59/01/06 2.5% dividend 2 times59/04/04 50% stock split 7 times73/10/09 not traded 1 time
18
Architecture: Typical Data Mining System
Data Warehouse
Data cleaning & data integration Filtering
Databases
Database or data warehouse server
Data mining engine
Pattern evaluation
Graphical user interface
Knowledge-base
19
What To Cover
• Frequent Itemset Mining
• Association Rule Mining
• Clustering
• Classification
• Deviation (Outlier) Detection
20
Frequent Itemset Mining
21
Frequent Itemset Mining
• Very common problem in Market-Basket applications
• Given a set of items I ={milk, bread, jelly, …}
• Given a set of transactions where each transaction contains subset of items • t1 = {milk, bread, water} • t2 = {milk, nuts, butter, rice}
22
Frequent Pattern Mining • Given a set of items I ={milk, bread, jelly, …} • Given a set of transactions where each transaction contains
subset of items • t1 = {milk, bread, water} • t2 = {milk, nuts, butter, rice}
What are the itemsets frequently sold together ??
% of transactions in which the itemset appears >= α
23
Example
Assume α = 60%, what are the frequent itemsets
• {Bread} à 80% • {PeanutButter} à 60% • {Bread, PeanutButter} à 60%
called “Support”
24
How to find frequent itemsets
• Naïve Approach • Enumerate all possible itemsets and then count each one
All possible itemsets of size 1
All possible itemsets of size 2
All possible itemsets of size 3
All possible itemsets of size 4
25
Can we optimize??
Assume α = 60%, what are the frequent itemsets
• {Bread} à 80% • {PeanutButter} à 60% • {Bread, PeanutButter} à 60%
called “Support”
Property For itemset S={X, Y, Z, …} of size n to be frequent, all its subsets of size n-1 must be frequent as well
26
Apriori Algorithm • Executes in scans (iterations), each scan has two phases
• Given a list of candidate itemsets of size n, count their appearance and find frequent ones
• From the frequent ones generate candidates of size n+1 (previous property must hold) • All subsets of size n must be frequent to be a candidate
• Start the algorithm where n =1, then repeat
Use the property reduce the number of itemsets to check
27
Apriori Example
28
Apriori Example (Cont’d)
29
The Apriori Algorithm — Example 2
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1 L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2 Scan D
C3 L3 itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
30
Apriori with Constraints • If we have constraints, e.g., Sum(price) of the frequent
group should not exceed X
• Lazy Approach • Apply the constraints at the end on the discovered patterns
• Eager Approach • Push the constraints during the computations (if possible)
31
Lazy Algorithm: Apriori + Constraint
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1 L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2 Scan D
C3 L3 itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
Constraint:
Sum{S.price < 5}
Assume price = ItemID
32
Eager Algorithm (Crossed entries are not computed)
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1 L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2 Scan D
C3 L3 itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
Constraint:
Sum{S.price < 5}
Assume price = ItemID
33
Apriori Adv/Disadv • Advantages:
• Uses large itemset property. • Easily parallelized • Easy to implement.
• Disadvantages: • Assumes transaction database is memory resident. • Requires up to m database scans.
34
Association Rule Mining
35
Association Rules Outline
• Finding associations between the objects in the database
• When X happens, Y also happens with probability … • If the probability is high, then the association between X, Y is
important
• What is the probability when a customer buys bread in a transaction, (s)he also buys milk in the same transaction?
36
Example: Market Basket Data • Items frequently purchased together:
Bread ⇒PeanutButter • Uses:
• Placement • Advertising • Sales • Coupons
• Objective: increase sales and reduce costs
37
Association Rule Definitions • Set of items: I={I1,I2,…,Im} • Transactions: D={t1,t2, …, tn}, tj⊆ I • Itemset: {Ii1,Ii2, …, Iik} ⊆ I • Support of an itemset: Percentage of transactions which
contain that itemset. • Large (Frequent) itemset: Itemset whose number of
occurrences is above a threshold.
38
Association Rules Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
39
Association Rule Definitions • Association Rule (AR): implication X ⇒ Y where X,Y ⊆ I and X ∩ Y = ;
• Support of AR (s) X ⇒ Y: Percentage of transactions that contain X ∪Y
• Confidence of AR (α) X ⇒ Y: Ratio of number of transactions that contain X ∪ Y to the number that contain X
40
Association Rules Ex (cont’d)
41
Association Rule Problem • Given a set of items I={I1,I2,…,Im} and a database of
transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij ∈ I, the Association Rule Problem is to identify all association rules X ⇒ Y with a minimum support and confidence.
• Link Analysis • NOTE: Support of X ⇒ Y is same as support of X ∪ Y.
42
Association Rule Techniques 1. Find Large Itemsets. 2. Generate rules from frequent itemsets.
43
Example
44
Rule: Bread à PeanutButter • Support of rule = support(Bread, PeanutButter) = 60% • Confidence of rule = support(Bread, PeanutButter)/support(Bread) = 75%
Rule: Bread, Jelly à PeanutButter • Support of rule = support(Bread, Jelly, PeanutButter) = 20% • Confidence of rule = support(Bread, Jelly, PeanutButter) /support(Bread, Jelly) = 100%
Usually we search for rules: Support > α Confidence > β