26
Mining Association Rules

Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

  • View
    239

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Mining Association Rules

Page 2: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Data Mining Overview

Data Mining Data warehouses and OLAP (On Line Analytical

Processing.) Association Rules Mining Clustering: Hierarchical and Partitional

approaches Classification: Decision Trees and Bayesian

classifiers Sequential Patterns Mining Advanced topics: outlier detection, web mining

Page 3: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Association Rules: Background

Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

Find: all association rules that satisfy user-specified minimum support and minimum confidence interval

Example: 30% of transactions that contain beer also contain diapers; 5% of transactions contain these items

30%: confidence of the rule 5%: support of the rule

We are interested in finding all rules rather than verifying if a rule holds

Page 4: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Rule Measures: Support and Confidence

Find all the rules X & Y Z with minimum confidence and support

support, s, probability that a transaction contains {X Y Z}

confidence, c, conditional probability that a transaction having {X Y} also contains Z

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have

A C (50%, 66.6%) C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Page 5: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Application Examples

Market Basket Analysis * Maintenance Agreement (What the store should do to

boost Maintenance Agreement sales?) Home Electronics * (What other products should the store

stocks up on if the store has a sale on Home Electronics?) Attached mailing in direct marketing Detecting “ping-pong”ing of patients

Transaction: patient Item: doctor/clinic visited by patient Support of the rule: number of common patients

HIC Australia “success story”

Page 6: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Problem Statement

I = {i1, i2, …, im}: a set of literals, called items Transaction T: a set of items s.t. T I Database D: a set of transactions A transaction contains X, a set of items in I, if X T An association rule is an implication of the form X Y, where X,Y I The rule X Y holds in the transaction set D with

confidence c if c% of transactions in D that contain X also contain Y

The rule X Y has support s in the transaction set D if s% of transactions in D contain X Y

Find all rules that have support and confidence greater than user-specified min support and min confidence

Page 7: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Association Rule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”)

[0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%,

75%] Single dimension vs. multiple dimensional associations

(see ex. Above) Single level vs. multiple-level analysis

What brands of beers are associated with what brands of diapers? Various extensions

Correlation, causality analysis Association does not necessarily imply correlation or causality

Constraints enforced E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Page 8: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Problem Decomposition

1. Find all sets of items that have minimum support (frequent itemsets)

2. Use the frequent itemsets to generate the desired rules

Page 9: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Problem Decomposition – Example

Transaction ID Items Bought1 Shoes, Shirt, Jacket2 Shoes,Jacket3 Shoes, Jeans4 Shirt, Sweatshirt

For min support = 50% = 2 trans,

and min confidence = 50%

Frequent Itemset Support{Shoes} 75%{Shirt} 50%{Jacket} 50%{Shoes, Jacket} 50%

For the rule Shoes Jacket

•Support = Sup({Shoes,Jacket)}=50%

•Confidence = =66.6%70

50

Jacket Shoes has 50% support and 100% confidence

Page 10: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Discovering Rules

Naïve Algorithmfor each frequent itemset l do

for each subset c of l do

if (support(l ) / support(l - c) >= minconf) then

output the rule (l – c ) c, with confidence = support(l ) / support (l - c )

and support = support(l )

Page 11: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Discovering Rules (2)

Lemma. If consequent c generates a valid rule, so do all subsets of c. (e.g. X YZ, then XY Z and XZ Y)

Example: Consider a frequent itemset ABCDE

If ACDE B and ABCE D are the only one-consequent rules with minimum support confidence, then

ACE BD is the only other rule that needs to be tested

Page 12: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Mining Frequent Itemsets: the Key Step

Find the frequent itemsets: the sets of items that have minimum support

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

Use the frequent itemsets to generate association rules.

Page 13: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

The Apriori Algorithm

Lk: Set of frequent itemsets of size k (those with min support)

Ck: Set of candidate itemset of size k (potentially frequent itemsets)

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 14: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Database DMin support =50% = 2 trans

Page 15: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

How to Generate Candidates?

Suppose the items in Lk-1 are listed in order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

Page 16: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Example of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}

Page 17: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

How to Count Supports of Candidates?

Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates

Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets

and counts Interior node contains a hash table Subset function: finds all the candidates

contained in a transaction

Page 18: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Hash-tree:search

Given a transaction T and a set Ck find all members of its members contained in T

Assume an ordering on the items Start from the root, use every item in T to go to the next

node If you are at an interior node and you just used item i, then

use each item that comes after i in T If you are at a leaf node check the itemsets

Page 19: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Methods to Improve Apriori’s Efficiency

Transaction reduction: A transaction that does not contain

any frequent k-itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB

must be frequent in at least one of the partitions of DB

Sampling: mining on a subset of given data, lower support

threshold + a method to determine the completeness

Dynamic itemset counting: add new candidate itemsets

only when all of their subsets are estimated to be

frequent

Page 20: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Is Apriori Fast Enough? — Performance Bottlenecks

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets Use database scan and pattern matching to collect counts for the

candidate itemsets

The bottleneck of Apriori: candidate generation Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets

To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern

Page 21: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Max-Miner

Max-miner finds long patterns efficiently: the maximal frequent patterns

Instead of checking all subsets of a long pattern try to detect long patterns early

Scales linearly to the size of the patterns

Page 22: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Max-Miner: the idea

1 2 3 4

1,2 1,3 1,4 2,3 2,4 3,4

1,2,3 1,2,4 1,3,4

1,2,3,4

2,3,4

Set enumeration tree ofan ordered set

Pruning: (1) set infrequency(2) Superset frequency

Each node is a candidate group gh(g) is the head: the itemset of the nodet(g) tail: an ordered set that contains all items that can appear in the subnodes

Example: h({1}) = {1} and t({1}) = {2,3,4}

Page 23: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Max-miner pruning

When we count the support of a candidate group g, we compute also the support for h(g), h(g) t(g) and h(g) {i} for each i in t(g)

If h(g) t(g) is frequent, then stop expanding the node g and report the union as frequent itemset

If h(g) {i} is infrequent, then remove I from all subnodes (just remove i from any tail of a group after g)

Expand the node g by one and do the same

Page 24: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

The algorithmMax-Miner Set candidate groups C {} Set of Itemsets F {Gen-Initial-Groups(T,C)} while C not empty do scan T to count the support of all candidate groups in C for each g in C s.t. h(g) U t(g) is frequent do F F U {h(g) U t(g)} Set candidate groups Cnew{ } for each g in C such that h(g) U t(g) is infrequent do F F U {Gen-sub-nodes(g, Cnew)} C remove from F any itemset with a proper superset in F remove from C any group g s.t. h(g) U t(g) has a superset in F return F

Page 25: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

The algorithm (2)Gen-Initial-Groups(T, C) scan T to obtain F1, the set of frequent 1-itemsets

impose an ordering on items in F1

for each item i in F1 other than the greatest itemset do

let g be a new candidate with h(g) = {i} and t(g) = {j | j follows i in the ordering} C C U {g} return the itemset F1 (an the C of course)

Gen-sub-nodes(g, C) /* generation of new itemsets at the next level*/ remove any item i from t(g) if h(g) U {i} is infrequent reorder the items in t(g) for each i in t(g) other than the greatest do let g’ be a new candidate with h(g’) = h(g) U {i} and t(g’) = {j | j in t(g) and j is after i in t(g)} C C U {g’} return h(g) U {m} where m is the greatest item in t(g) or h(g) if t(g) is

empty

Page 26: Mining Association Rules. Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering:

Item Ordering

Re-ordering items we try to increase the effectiveness of frequency-pruning

Very frequent items have higher probability to be contained in long patterns

Put these item at the end of the ordering, so they appear in many tails