Upload
raj-endran
View
14
Download
0
Embed Size (px)
DESCRIPTION
Data Mining-Association Mining 1
Citation preview
Association Rule Mining Association Rules Finds Interesting associations / correlation
relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store
layout
Association Rules Forming Association rules Universe all items Boolean Vector
Example: Computer Accounting_Software [support = 5%, confidence = 60%] Minimum support and Confidence threshold
Basic Concepts I = {i1, i2, im} Set of Items D Set of database Transactions T Transaction contains a set of items and T I Association rule
Support Percentage of transactions in D containing both A and B -
Confidence Percentage of transactions in D containing A that also contain B P(B/A)
(A)
Itemset K-Itemset Occurrence frequency of an itemset
Frequency, support_count (absolute support) or count
Itemset satisfies minimum support when count >= min_sup * number of transactions in D
Minimum Support Count Frequent Itemset
Association Rule Mining Process Find all frequent itemsets Generate strong association rules from frequent
itemsets Satisfy Minimum Support and Minimum
Confidence
Itemsets Complete Itemsets Closed Frequent Itemset
X is closed in a data set S if there exists no proper super itemset Y such that Y has the same support count as X in S
X is frequent Maximal Frequent Itemset
X is Frequent and there exists no super-itemset Y such that X Y and Y is frequent in S
Example: T = { {a1,a2,a100}, {a1,a2,a50}}, min_sup = 1 Closed frequent itemsets : Both {{a1,a2,a100}:1,
{a1,a2,a50}: 2} Maximal frequent itemset: {a1,a2,a100}
Types of Association Rules Types of Values
Boolean, Quantitative Association Rule Dimensions of data
Single Dimensional, Multi-dimensional Level of abstraction
Multilevel association rules Based on kinds of rules
Association rules, Correlation rules, Strong gradient relationships
Based on completeness of patterns Complete, Closed, Maximal, top-k, constrained,
approximate
Mining Single Dimensional Boolean Association Rules Apriori Algorithm Finding Frequent Itemsets using
Candidate Generation Uses prior knowledge of frequent itemset
properties Level wise search
K itemsets used for exploring k+1 itemsets Frequent 1-itemsets L1 L1 is used to find L2
Apriori Property Reduces Search space All non empty subsets of a frequent itemset must also
be frequent If P(I) < min_sup then P(I U A) < min_sup Anti-monotone property If a set cannot pass a test
all of its supersets will fail the test as well. Any subset of a frequent itemset must be frequent
Apriori property application Join Step
To find Lk - join Lk-1 with itself - Ck li[j] jth item in li Members of Lk-1 are joinable if their first (k-2)
items are common Members l1 and l2 of Lk-1 are joinable if
(l1[1]=l2[1]) (l1[2]=l2[2]) (l1[k-2]=l2[k-2]) (l1[k-1]< l2[k-
1]) Resulting itemset is l1[1], l1[2], l1[k-1], l2[k-1] Prune Step Ck is a superset of Lk Determine the count of each candidate of Ck To reduce the size of Ck - if any (k-1) subset is
not in Lk-1 it can be removed from Ck
The Apriori Algorithm Pseudo-code:
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 2; Lk-1 !=; k++) do begin Ck = candidates generated from Lk-1; for each transaction t in database do
increment the count of all candidates in Ck that are contained in t
Lk = candidates in Ck with min_support end return k Lk;
The Apriori AlgorithmAn Example Apriori Algorithm Input: Database of transactions D, min_sup Output: L, frequent itemsets L1 = find_frequent_1-itemsets(D); for(k=2;Lk-1 ; k++) { Ck = apriori_gen(Lk-1, min_sup); for each transaction t D { Ct = subset(Ck, t) for each candidate c Ct c.count++; } Lk = {c Ck | c.count >= min_sup } } return L = UkLk; Apriori Algorithm procedure apriori_gen(Lk-1 , min_sup) for each itemset l1 Lk-1 for each itemset l2 Lk-1 if(l1[1]=l2[1]) (l1[2]=l2[2]) (l1[k-1]< l2[k-1]) { c = l1 join l2; // Join step if has_infrequent_subset(c, Lk-i) then delete c; // Prune step else add c to Ck; } return Ck procedure has_infrequent_subset(c, Lk-1) for each (k-1) subset s of c if s is not an element of Lk-1 then return TRUE; return false;
Association Rule MiningAssociation Rules Finds Interesting associations / correlation relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store layout
Association Rules Forming Association rules Universe all items Boolean Vector
Example: Computer ( Accounting_Software[support = 5%, confidence = 60%] Minimum support and Confidence threshold
Basic Concepts I = {i1, i2, im} Set of Items D Set of database Transactions T Transaction contains a set of items and T ( I Association rule A B where A I B I and A B = Support Percentage of transactions in D containing both A and B - P(AB) Confidence Percentage of transactions in D containing A that also contain B P(B/A) Confidence(A B) = Support(A B) / Support (A)
Itemset K-Itemset Occurrence frequency of an itemset Frequency, support_count (absolute support) or count Itemset satisfies minimum support when count >= min_sup * number of transactions in D Minimum Support Count Frequent Itemset
Association Rule Mining Process Find all frequent itemsets Generate strong association rules from frequent itemsets Satisfy Minimum Support and Minimum Confidence
Itemsets Complete Itemsets Closed Frequent Itemset X is closed in a data set S if there exists no proper super itemset Y such that Y has the same support count as X in S X is frequent
Maximal Frequent Itemset X is Frequent and there exists no super-itemset Y such that X ( Y and Y is frequent in S
Example: T = { {a1,a2,a100}, {a1,a2,a50}}, min_sup = 1 Closed frequent itemsets : Both {{a1,a2,a100}:1, {a1,a2,a50}: 2} Maximal frequent itemset: {a1,a2,a100}
Types of Association Rules Types of Values Boolean, Quantitative Association Rule
Dimensions of data Single Dimensional, Multi-dimensional
Level of abstraction Multilevel association rules
Based on kinds of rules Association rules, Correlation rules, Strong gradient relationships
Based on completeness of patterns Complete, Closed, Maximal, top-k, constrained, approximate
Mining Single Dimensional Boolean Association Rules Apriori Algorithm Finding Frequent Itemsets using Candidate Generation Uses prior knowledge of frequent itemset properties Level wise search K itemsets used for exploring k+1 itemsets Frequent 1-itemsets L1 L1 is used to find L2
Apriori Property Reduces Search space All non empty subsets of a frequent itemset must also be frequent If P(I) < min_sup then P(I U A) < min_sup Anti-monotone property If a set cannot pass a test all of its supersets will fail the test as well. Any subset of a frequent itemset must be frequent
Apriori property application Join Step To find Lk - join Lk-1 with itself - Ck li[j] jth item in li Members of Lk-1 are joinable if their first (k-2) items are common Members l1 and l2 of Lk-1 are joinable if (l1[1]=l2[1]) ((l1[2]=l2[2]) ( (l1[k-2]=l2[k-2]) ( (l1[k-1]< l2[k-1]) Resulting itemset is l1[1], l1[2], l1[k-1], l2[k-1]
Prune Step Ck is a superset of Lk Determine the count of each candidate of Ck To reduce the size of Ck - if any (k-1) subset is not in Lk-1 it can be removed from Ck
The Apriori Algorithm Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 2; Lk-1 !=(; k++) do beginCk = candidates generated from Lk-1;for each transaction t in database doincrement the count of all candidates in Ckthat are contained in t
Lk = candidates in Ck with min_supportendreturn (k Lk;
The Apriori AlgorithmAn ExampleApriori AlgorithmInput: Database of transactions D, min_supOutput: L, frequent itemsetsL1 = find_frequent_1-itemsets(D);for(k=2;Lk-1 ; k++){Ck = apriori_gen(Lk-1, min_sup);for each transaction t D{Ct = subset(Ck, t)for each candidate c Ctc.count++;}Lk = {c Ck | c.count >= min_sup }}return L = UkLk;
Apriori Algorithmprocedure apriori_gen(Lk-1 , min_sup)for each itemset l1 Lk-1for each itemset l2 Lk-1if(l1[1]=l2[1]) ( (l1[2]=l2[2]) ( (l1[k-1]< l2[k-1]){c = l1 join l2; // Join stepif has_infrequent_subset(c, Lk-i) thendelete c; // Prune stepelse add c to Ck;}return Ckprocedure has_infrequent_subset(c, Lk-1)for each (k-1) subset s of cif s is not an element of Lk-1 then return TRUE;return false;