Data Mining-Association Mining 1

Association Rule Mining Association Rules Finds Interesting associations / correlation

relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store

layout

Association Rules Forming Association rules Universe all items Boolean Vector

Example: Computer Accounting_Software [support = 5%, confidence = 60%] Minimum support and Confidence threshold

Basic Concepts I = {i1, i2, im} Set of Items D Set of database Transactions T Transaction contains a set of items and T I Association rule

Support Percentage of transactions in D containing both A and B -

Confidence Percentage of transactions in D containing A that also contain B P(B/A)

(A)

Itemset K-Itemset Occurrence frequency of an itemset

Frequency, support_count (absolute support) or count

Itemset satisfies minimum support when count >= min_sup * number of transactions in D

Minimum Support Count Frequent Itemset

Association Rule Mining Process Find all frequent itemsets Generate strong association rules from frequent

itemsets Satisfy Minimum Support and Minimum

Confidence

Itemsets Complete Itemsets Closed Frequent Itemset

X is closed in a data set S if there exists no proper super itemset Y such that Y has the same support count as X in S

X is frequent Maximal Frequent Itemset

X is Frequent and there exists no super-itemset Y such that X Y and Y is frequent in S

Example: T = { {a1,a2,a100}, {a1,a2,a50}}, min_sup = 1 Closed frequent itemsets : Both {{a1,a2,a100}:1,

{a1,a2,a50}: 2} Maximal frequent itemset: {a1,a2,a100}

Types of Association Rules Types of Values

Boolean, Quantitative Association Rule Dimensions of data

Single Dimensional, Multi-dimensional Level of abstraction

Multilevel association rules Based on kinds of rules

Association rules, Correlation rules, Strong gradient relationships

Based on completeness of patterns Complete, Closed, Maximal, top-k, constrained,

approximate

Mining Single Dimensional Boolean Association Rules Apriori Algorithm Finding Frequent Itemsets using

Candidate Generation Uses prior knowledge of frequent itemset

properties Level wise search

K itemsets used for exploring k+1 itemsets Frequent 1-itemsets L1 L1 is used to find L2

Apriori Property Reduces Search space All non empty subsets of a frequent itemset must also

be frequent If P(I) < min_sup then P(I U A) < min_sup Anti-monotone property If a set cannot pass a test

all of its supersets will fail the test as well. Any subset of a frequent itemset must be frequent

Apriori property application Join Step

To find Lk - join Lk-1 with itself - Ck li[j] jth item in li Members of Lk-1 are joinable if their first (k-2)

items are common Members l1 and l2 of Lk-1 are joinable if

(l1[1]=l2[1]) (l1[2]=l2[2]) (l1[k-2]=l2[k-2]) (l1[k-1]< l2[k-

1]) Resulting itemset is l1[1], l1[2], l1[k-1], l2[k-1] Prune Step Ck is a superset of Lk Determine the count of each candidate of Ck To reduce the size of Ck - if any (k-1) subset is

not in Lk-1 it can be removed from Ck

The Apriori Algorithm Pseudo-code:

Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 2; Lk-1 !=; k++) do begin Ck = candidates generated from Lk-1; for each transaction t in database do

increment the count of all candidates in Ck that are contained in t

Lk = candidates in Ck with min_support end return k Lk;

The Apriori AlgorithmAn Example Apriori Algorithm Input: Database of transactions D, min_sup Output: L, frequent itemsets L1 = find_frequent_1-itemsets(D); for(k=2;Lk-1 ; k++) { Ck = apriori_gen(Lk-1, min_sup); for each transaction t D { Ct = subset(Ck, t) for each candidate c Ct c.count++; } Lk = {c Ck | c.count >= min_sup } } return L = UkLk; Apriori Algorithm procedure apriori_gen(Lk-1 , min_sup) for each itemset l1 Lk-1 for each itemset l2 Lk-1 if(l1[1]=l2[1]) (l1[2]=l2[2]) (l1[k-1]< l2[k-1]) { c = l1 join l2; // Join step if has_infrequent_subset(c, Lk-i) then delete c; // Prune step else add c to Ck; } return Ck procedure has_infrequent_subset(c, Lk-1) for each (k-1) subset s of c if s is not an element of Lk-1 then return TRUE; return false;

Association Rule MiningAssociation Rules Finds Interesting associations / correlation relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store layout

Association Rules Forming Association rules Universe all items Boolean Vector

Example: Computer ( Accounting_Software[support = 5%, confidence = 60%] Minimum support and Confidence threshold

Basic Concepts I = {i1, i2, im} Set of Items D Set of database Transactions T Transaction contains a set of items and T ( I Association rule A B where A I B I and A B = Support Percentage of transactions in D containing both A and B - P(AB) Confidence Percentage of transactions in D containing A that also contain B P(B/A) Confidence(A B) = Support(A B) / Support (A)

Itemset K-Itemset Occurrence frequency of an itemset Frequency, support_count (absolute support) or count Itemset satisfies minimum support when count >= min_sup * number of transactions in D Minimum Support Count Frequent Itemset

Association Rule Mining Process Find all frequent itemsets Generate strong association rules from frequent itemsets Satisfy Minimum Support and Minimum Confidence

Itemsets Complete Itemsets Closed Frequent Itemset X is closed in a data set S if there exists no proper super itemset Y such that Y has the same support count as X in S X is frequent

Maximal Frequent Itemset X is Frequent and there exists no super-itemset Y such that X ( Y and Y is frequent in S

Example: T = { {a1,a2,a100}, {a1,a2,a50}}, min_sup = 1 Closed frequent itemsets : Both {{a1,a2,a100}:1, {a1,a2,a50}: 2} Maximal frequent itemset: {a1,a2,a100}

Types of Association Rules Types of Values Boolean, Quantitative Association Rule

Dimensions of data Single Dimensional, Multi-dimensional

Level of abstraction Multilevel association rules

Based on kinds of rules Association rules, Correlation rules, Strong gradient relationships

Based on completeness of patterns Complete, Closed, Maximal, top-k, constrained, approximate

Mining Single Dimensional Boolean Association Rules Apriori Algorithm Finding Frequent Itemsets using Candidate Generation Uses prior knowledge of frequent itemset properties Level wise search K itemsets used for exploring k+1 itemsets Frequent 1-itemsets L1 L1 is used to find L2

Apriori Property Reduces Search space All non empty subsets of a frequent itemset must also be frequent If P(I) < min_sup then P(I U A) < min_sup Anti-monotone property If a set cannot pass a test all of its supersets will fail the test as well. Any subset of a frequent itemset must be frequent

Apriori property application Join Step To find Lk - join Lk-1 with itself - Ck li[j] jth item in li Members of Lk-1 are joinable if their first (k-2) items are common Members l1 and l2 of Lk-1 are joinable if (l1[1]=l2[1]) ((l1[2]=l2[2]) ( (l1[k-2]=l2[k-2]) ( (l1[k-1]< l2[k-1]) Resulting itemset is l1[1], l1[2], l1[k-1], l2[k-1]

Prune Step Ck is a superset of Lk Determine the count of each candidate of Ck To reduce the size of Ck - if any (k-1) subset is not in Lk-1 it can be removed from Ck

The Apriori Algorithm Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 2; Lk-1 !=(; k++) do beginCk = candidates generated from Lk-1;for each transaction t in database doincrement the count of all candidates in Ckthat are contained in t

Lk = candidates in Ck with min_supportendreturn (k Lk;

The Apriori AlgorithmAn ExampleApriori AlgorithmInput: Database of transactions D, min_supOutput: L, frequent itemsetsL1 = find_frequent_1-itemsets(D);for(k=2;Lk-1 ; k++){Ck = apriori_gen(Lk-1, min_sup);for each transaction t D{Ct = subset(Ck, t)for each candidate c Ctc.count++;}Lk = {c Ck | c.count >= min_sup }}return L = UkLk;

Apriori Algorithmprocedure apriori_gen(Lk-1 , min_sup)for each itemset l1 Lk-1for each itemset l2 Lk-1if(l1[1]=l2[1]) ( (l1[2]=l2[2]) ( (l1[k-1]< l2[k-1]){c = l1 join l2; // Join stepif has_infrequent_subset(c, Lk-i) thendelete c; // Prune stepelse add c to Ck;}return Ckprocedure has_infrequent_subset(c, Lk-1)for each (k-1) subset s of cif s is not an element of Lk-1 then return TRUE;return false;

Documents

Data Mining-Association Mining 1