5
Association Rule Mining Association Rules Finds Interesting associations / correlation relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store layout Association Rules Forming Association rules Universe all items Boolean Vector Example: Computer Accounting_Software [support = 5%, confidence = 60%] Minimum support and Confidence threshold Basic Concepts I = {i 1 , i 2 , …i m } Set of Items D Set of database Transactions T Transaction contains a set of items and T I Association rule Support Percentage of transactions in D containing both A and B - Confidence Percentage of transactions in D containing A that also contain B P(B/A) (A)

Data Mining-Association Mining 1

Embed Size (px)

DESCRIPTION

Data Mining-Association Mining 1

Citation preview

  • Association Rule Mining Association Rules Finds Interesting associations / correlation

    relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store

    layout

    Association Rules Forming Association rules Universe all items Boolean Vector

    Example: Computer Accounting_Software [support = 5%, confidence = 60%] Minimum support and Confidence threshold

    Basic Concepts I = {i1, i2, im} Set of Items D Set of database Transactions T Transaction contains a set of items and T I Association rule

    Support Percentage of transactions in D containing both A and B -

    Confidence Percentage of transactions in D containing A that also contain B P(B/A)

    (A)

  • Itemset K-Itemset Occurrence frequency of an itemset

    Frequency, support_count (absolute support) or count

    Itemset satisfies minimum support when count >= min_sup * number of transactions in D

    Minimum Support Count Frequent Itemset

    Association Rule Mining Process Find all frequent itemsets Generate strong association rules from frequent

    itemsets Satisfy Minimum Support and Minimum

    Confidence

    Itemsets Complete Itemsets Closed Frequent Itemset

    X is closed in a data set S if there exists no proper super itemset Y such that Y has the same support count as X in S

    X is frequent Maximal Frequent Itemset

    X is Frequent and there exists no super-itemset Y such that X Y and Y is frequent in S

    Example: T = { {a1,a2,a100}, {a1,a2,a50}}, min_sup = 1 Closed frequent itemsets : Both {{a1,a2,a100}:1,

    {a1,a2,a50}: 2} Maximal frequent itemset: {a1,a2,a100}

  • Types of Association Rules Types of Values

    Boolean, Quantitative Association Rule Dimensions of data

    Single Dimensional, Multi-dimensional Level of abstraction

    Multilevel association rules Based on kinds of rules

    Association rules, Correlation rules, Strong gradient relationships

    Based on completeness of patterns Complete, Closed, Maximal, top-k, constrained,

    approximate

    Mining Single Dimensional Boolean Association Rules Apriori Algorithm Finding Frequent Itemsets using

    Candidate Generation Uses prior knowledge of frequent itemset

    properties Level wise search

    K itemsets used for exploring k+1 itemsets Frequent 1-itemsets L1 L1 is used to find L2

    Apriori Property Reduces Search space All non empty subsets of a frequent itemset must also

    be frequent If P(I) < min_sup then P(I U A) < min_sup Anti-monotone property If a set cannot pass a test

    all of its supersets will fail the test as well. Any subset of a frequent itemset must be frequent

  • Apriori property application Join Step

    To find Lk - join Lk-1 with itself - Ck li[j] jth item in li Members of Lk-1 are joinable if their first (k-2)

    items are common Members l1 and l2 of Lk-1 are joinable if

    (l1[1]=l2[1]) (l1[2]=l2[2]) (l1[k-2]=l2[k-2]) (l1[k-1]< l2[k-

    1]) Resulting itemset is l1[1], l1[2], l1[k-1], l2[k-1] Prune Step Ck is a superset of Lk Determine the count of each candidate of Ck To reduce the size of Ck - if any (k-1) subset is

    not in Lk-1 it can be removed from Ck

    The Apriori Algorithm Pseudo-code:

    Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 2; Lk-1 !=; k++) do begin Ck = candidates generated from Lk-1; for each transaction t in database do

    increment the count of all candidates in Ck that are contained in t

    Lk = candidates in Ck with min_support end return k Lk;

  • The Apriori AlgorithmAn Example Apriori Algorithm Input: Database of transactions D, min_sup Output: L, frequent itemsets L1 = find_frequent_1-itemsets(D); for(k=2;Lk-1 ; k++) { Ck = apriori_gen(Lk-1, min_sup); for each transaction t D { Ct = subset(Ck, t) for each candidate c Ct c.count++; } Lk = {c Ck | c.count >= min_sup } } return L = UkLk; Apriori Algorithm procedure apriori_gen(Lk-1 , min_sup) for each itemset l1 Lk-1 for each itemset l2 Lk-1 if(l1[1]=l2[1]) (l1[2]=l2[2]) (l1[k-1]< l2[k-1]) { c = l1 join l2; // Join step if has_infrequent_subset(c, Lk-i) then delete c; // Prune step else add c to Ck; } return Ck procedure has_infrequent_subset(c, Lk-1) for each (k-1) subset s of c if s is not an element of Lk-1 then return TRUE; return false;

    Association Rule MiningAssociation Rules Finds Interesting associations / correlation relationships among large sets of data Business Decision Making Example Market Basket Analysis Items likely to be purchased Advertising strategy, Catalog Design, Store layout

    Association Rules Forming Association rules Universe all items Boolean Vector

    Example: Computer ( Accounting_Software[support = 5%, confidence = 60%] Minimum support and Confidence threshold

    Basic Concepts I = {i1, i2, im} Set of Items D Set of database Transactions T Transaction contains a set of items and T ( I Association rule A B where A I B I and A B = Support Percentage of transactions in D containing both A and B - P(AB) Confidence Percentage of transactions in D containing A that also contain B P(B/A) Confidence(A B) = Support(A B) / Support (A)

    Itemset K-Itemset Occurrence frequency of an itemset Frequency, support_count (absolute support) or count Itemset satisfies minimum support when count >= min_sup * number of transactions in D Minimum Support Count Frequent Itemset

    Association Rule Mining Process Find all frequent itemsets Generate strong association rules from frequent itemsets Satisfy Minimum Support and Minimum Confidence

    Itemsets Complete Itemsets Closed Frequent Itemset X is closed in a data set S if there exists no proper super itemset Y such that Y has the same support count as X in S X is frequent

    Maximal Frequent Itemset X is Frequent and there exists no super-itemset Y such that X ( Y and Y is frequent in S

    Example: T = { {a1,a2,a100}, {a1,a2,a50}}, min_sup = 1 Closed frequent itemsets : Both {{a1,a2,a100}:1, {a1,a2,a50}: 2} Maximal frequent itemset: {a1,a2,a100}

    Types of Association Rules Types of Values Boolean, Quantitative Association Rule

    Dimensions of data Single Dimensional, Multi-dimensional

    Level of abstraction Multilevel association rules

    Based on kinds of rules Association rules, Correlation rules, Strong gradient relationships

    Based on completeness of patterns Complete, Closed, Maximal, top-k, constrained, approximate

    Mining Single Dimensional Boolean Association Rules Apriori Algorithm Finding Frequent Itemsets using Candidate Generation Uses prior knowledge of frequent itemset properties Level wise search K itemsets used for exploring k+1 itemsets Frequent 1-itemsets L1 L1 is used to find L2

    Apriori Property Reduces Search space All non empty subsets of a frequent itemset must also be frequent If P(I) < min_sup then P(I U A) < min_sup Anti-monotone property If a set cannot pass a test all of its supersets will fail the test as well. Any subset of a frequent itemset must be frequent

    Apriori property application Join Step To find Lk - join Lk-1 with itself - Ck li[j] jth item in li Members of Lk-1 are joinable if their first (k-2) items are common Members l1 and l2 of Lk-1 are joinable if (l1[1]=l2[1]) ((l1[2]=l2[2]) ( (l1[k-2]=l2[k-2]) ( (l1[k-1]< l2[k-1]) Resulting itemset is l1[1], l1[2], l1[k-1], l2[k-1]

    Prune Step Ck is a superset of Lk Determine the count of each candidate of Ck To reduce the size of Ck - if any (k-1) subset is not in Lk-1 it can be removed from Ck

    The Apriori Algorithm Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 2; Lk-1 !=(; k++) do beginCk = candidates generated from Lk-1;for each transaction t in database doincrement the count of all candidates in Ckthat are contained in t

    Lk = candidates in Ck with min_supportendreturn (k Lk;

    The Apriori AlgorithmAn ExampleApriori AlgorithmInput: Database of transactions D, min_supOutput: L, frequent itemsetsL1 = find_frequent_1-itemsets(D);for(k=2;Lk-1 ; k++){Ck = apriori_gen(Lk-1, min_sup);for each transaction t D{Ct = subset(Ck, t)for each candidate c Ctc.count++;}Lk = {c Ck | c.count >= min_sup }}return L = UkLk;

    Apriori Algorithmprocedure apriori_gen(Lk-1 , min_sup)for each itemset l1 Lk-1for each itemset l2 Lk-1if(l1[1]=l2[1]) ( (l1[2]=l2[2]) ( (l1[k-1]< l2[k-1]){c = l1 join l2; // Join stepif has_infrequent_subset(c, Lk-i) thendelete c; // Prune stepelse add c to Ck;}return Ckprocedure has_infrequent_subset(c, Lk-1)for each (k-1) subset s of cif s is not an element of Lk-1 then return TRUE;return false;