View
233
Download
3
Embed Size (px)
Citation preview
Association Rule Mining
Mining Association Rules in Large Databases
Association rule mining
Algorithms Apriori and FP-Growth
Max and closed patterns
Mining various kinds of association/correlation
rules
Max-patterns & Close-patterns If there are frequent patterns with many
items, enumerating all of them is costly. We may be interested in finding the
‘boundary’ frequent patterns. Two types…
Max-patterns Frequent pattern {a1, …, a100} (100
1) + (100
2) + … + (11
00
00) = 2100-1 = 1.27*1030
frequent sub-patterns! Max-pattern: frequent patterns without
proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,FMin_sup=2
MaxMiner: Mining Max-patterns
Idea: generate the complete set-enumeration tree one level at a time, while prune if applicable.
(ABCD)
A (BCD) B (CD) C (D) D ()
AB (CD) AC (D) AD () BC (D) BD () CD ()
ABC (C)
ABCD ()
ABD () ACD () BCD ()
Local Pruning Techniques (e.g. at node A)
Check the frequency of ABCD and AB, AC, AD. If ABCD is frequent, prune the whole sub-tree. If AC is NOT frequent, remove C from the
parenthesis before expanding.
(ABCD)
A (BCD) B (CD) C (D) D ()
AB (CD) AC (D) AD () BC (D) BD () CD ()
ABC (C)
ABCD ()
ABD () ACD () BCD ()
Algorithm MaxMiner
Initially, generate one node N= , where h(N)= and t(N)={A,B,C,D}.
Consider expanding N, If h(N)t(N) is frequent, do not expand N. If for some it(N), h(N){i} is NOT frequent,
remove i from t(N) before expanding N. Apply global pruning techniques…
(ABCD)
Global Pruning Technique (across sub-trees)
When a max pattern is identified (e.g. ABCD), prune all nodes (e.g. B, C and D) where h(N)t(N) is a sub-set of it (e.g. ABCD).
(ABCD)
A (BCD) B (CD) C (D) D ()
AB (CD) AC (D) AD () BC (D) BD () CD ()
ABC (C)
ABCD ()
ABD () ACD () BCD ()
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
(ABCDEF)
Items Frequency
ABCDEF 0
A 2
B 2
C 3
D 3
E 2
F 1
Min_sup=2
Max patterns:
A (BCDE)B (CDE) C (DE) E ()D (E)
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
(ABCDEF)
Items Frequency
ABCDE 1
AB 1
AC 2
AD 2
AE 1
Min_sup=2
A (BCDE)B (CDE) C (DE) E ()D (E)
AC (D) AD ()
Max patterns:
Node A
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
(ABCDEF)
Items Frequency
BCDE 2
BC
BD
BE
Min_sup=2
A (BCDE)B (CDE) C (DE) E ()D (E)
AC (D) AD ()
Max patterns:
BCDE
Node B
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
(ABCDEF)
Items Frequency
ACD 2
Min_sup=2
A (BCDE)B (CDE) C (DE) E ()D (E)
AC (D) AD ()
Max patterns:
BCDE
ACD ()
ACD
Node AC
Frequent Closed Patterns For frequent itemset X, if there exists no
item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “ab” is a frequent closed pattern
Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99
TID Items
10 a, b, c
20 a, b, c
30 a, b, d
40 a, b, d
50 e, f
Min_sup=2
Max Pattern vs. Frequent Closed Pattern max pattern closed pattern
if itemset X is a max pattern, adding any item to it would not be a frequent pattern; thus there exists no item y s.t. every transaction containing X also contains y.
closed pattern max pattern “ab” is a closed pattern, but not max TID Items
10 a, b, c
20 a, b, c
30 a, b, d
40 a, b, d
50 e, f
Min_sup=2
Mining Frequent Closed Patterns: CLOSET
Flist: list of all frequent items in support ascending order
Flist: d-a-f-e-c
Divide search space
Patterns having d
Patterns having a but not d, etc.
Find frequent closed pattern recursively
Among the transactions having d, cfa is frequent closed cfad is a frequent closed pattern
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.
TID Items
10 a, c, d, e, f20 a, b, e30 c, e, f40 a, c, d, f50 c, e, f
Min_sup=2
Multiple-Level Association Rules
Items often form hierarchy. Items at the lower level are
expected to have lower support.
Rules regarding itemsets at appropriate levels could be
quite useful. A transactional database can
be encoded based on dimensions and levels
We can explore shared multi-level mining
Food
breadmilk
skim
Garelick
2% fat whitewheat
Wonder....
Mining Multi-Level Associations
A top_down, progressive deepening approach: First find high-level strong rules:
milk bread [20%, 60%]. Then find their lower-level “weaker” rules:
2% fat milk wheat bread [6%, 50%]. Variations at mining multiple-level association
rules. Level-crossed association rules:
skim milk Wonder wheat bread Association rules with multiple, alternative
hierarchies:
full fat milk Wonder bread
Multi-level Association: Uniform Support vs. Reduced Support Uniform Support: the same minimum
support for all levels + One minimum support threshold. No need to
examine itemsets containing any item whose ancestors do not have minimum support.
– Lower level items do not occur as frequently. If support threshold
too high miss low level associations too low generate too many high level
associations
Multi-level Association: Uniform Support vs. Reduced Support Reduced Support: reduced minimum
support at lower levels There are 4 search strategies:
Level-by-level independent Independent search at all levels (no misses)
Level-cross filtering by k-itemset Prune a k-pattern if the corresponding k-pattern at
the upper level is infrequent Level-cross filtering by single item
Prune an item if its parent node is infrequent Controlled level-cross filtering by single item
Consider ‘subfrequent’ items that pass a passage threshold
Uniform SupportMulti-level mining with uniform support
Milk
[support = 10%]
full fat Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
X
Reduced SupportMulti-level mining with reduced support
full fat Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 3%
Milk
[support = 10%]
Interestingness Measurements
Objective measuresTwo popular measurements: support; and confidence
Subjective measuresA rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)
Criticism to Support and Confidence
Example 1: Among 5000 students
3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
Criticism to Support and Confidence (Cont.)
Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates
We need a measure of dependent or correlated events
P(B|A)/P(B) is also called the lift of rule A => B
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(
)(, BPAP
BAPcorr BA
Other Interestingness Measures: Interest Interest (correlation, lift)
taking both P(A) and P(B) in consideration
P(AB)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
)()(
)(
BPAP
BAP
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57