חוקי Association ד " ר אבי רוזנפלד. המוטיבציה מה הם הדברים...

Preview:

Citation preview

Associationחוקי

רוזנפלד" אבי ר ד

המוטיבציה

•? ביחד שהולכים הדברים הם מהביחד – לשים שווה בסופר מוצרים איזההמלצה – – Recommendation Systemsמערכות

הבא • שבוע

באינטרנט – חיפוששבועיים • בעוד

ביחד מגדירים איך ?אבל

זה • Unsupervisedאו Supervisedהאם–Unsupervisedביחד • להיות חייבים דברים פעמים כמה מה

? אותם מחשב שאני לפני–Threshold שלConfidence וSupport–... אותם נגדיר תכף

©GKGupta 4December 2008

, דוגמא כל קודם אבלTID Items 1 Biscuits, Bread, Cheese, Coffee, Yogurt 2 Bread, Cereal, Cheese, Coffee 3 Cheese, Chocolate, Donuts, J uice, Milk 4 Bread, Cheese, Coffee, Cereal, J uice 5 Bread, Cereal, Chocolate, Donuts, J uice 6 Milk, Tea 7 Biscuits, Bread, Cheese, Coffee, Milk 8 Eggs, Milk, Tea 9 Bread, Cereal, Cheese, Chocolate, Coffee 10 Bread, Cereal, Chocolate, Donuts, J uice 11 Bread, Cheese, J uice 12 Bread, Cheese, Coffee, Donuts, J uice 13 Biscuits, Bread, Cereal 14 Cereal, Cheese, Chocolate, Donuts, J uice 15 Chocolate, Coffee 16 Donuts 17 Donuts, Eggs, J uice 18 Biscuits, Bread, Cheese, Coffee 19 Bread, Cereal, Chocolate, Donuts, J uice 20 Cheese, Chocolate, Donuts, J uice 21 Milk, Tea, Yogurt 22 Bread, Cereal, Cheese, Coffee 23 Chocolate, Donuts, J uice, Milk, Newspaper 24 Newspaper, Pastry, Rolls 25 Rolls, Sugar, Tea

©GKGupta 5December 2008

Frequency of Items

Item No Item name Frequency 1 Biscuits 4 2 Bread 13 3 Cereal 10 4 Cheese 11 5 Chocolate 9 6 Coffee 9 7 Donuts 10 8 Eggs 2 9 J uice 11 10 Milk 6 11 Newspaper 2 12 Pastry 1 13 Rolls 2 14 Sugar 1 15 Tea 4 16 Yogurt 2

6

מחפש אני ...מה

• If-then rules about the contents of baskets.• {i1, i2,…,ik} → j means: “if a basket contains all

of i1,…,ik then it is likely to contain j.”• – למשהו תלות יש אם ללמוד רוצה , jאני

בהינתן{i1, i2,…,ik}

7

Support

• Simplest question: find sets of items that appear “frequently” in the baskets.

• Support for itemset I = the number of baskets containing all items in I.

– , ביחד הדברים את למצוא ההסתברות כלל בדרך

• Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets.

Confidence

גם • היה פעמים jוגם Iכמה•Confidence of this association rule is the

probability of j given i1,…,ik.–I = i1,…,ik

•: פורמאלי באופן Confidence =

: פורמאלי באופןLift=

©GKGupta 9December 2008

Frequent Items

Assume 25% support. In 25 transactions, a frequent item must occur in at least 7 transactions (25*1/4=6.25). The frequent 1-itemset or L1 is now given below.

Item Frequency Bread 13/25 Cereal 10/25 Cheese 11/25 Chocolate 9/25 Coffee 9/25 Donuts 10/25 J uice 11/25

©GKGupta 10December 2008

L2

The following pairs are frequent.

Frequent 2-itemset Frequency {Bread, Cereal} 9/25 {Bread, Cheese} 8/25 {Bread, Coffee} 8/25 {Cheese, Coffee} 9/25 {Chocolate, Donuts} 7/25 {Chocolate, J uice} 7/25 {Donuts, J uice} 9/25

11

Example: ConfidenceB1 = {m, c, b} B2 = {m, p, j}

B3 = {m, b} B4 = {c, j}

B5 = {m, p, b} B6 = {m, c, b, j}

B7 = {c, b, j} B8 = {b, c}

• An association rule: {m, b} → c.– Confidence = 2/4 = 50%.

+__ +

©GKGupta 12December 2008

Rules

The full set of rules are given below. Could some rules be removed?

Cheese → Bread Cheese → Coffee Coffee → Bread Coffee → Cheese Cheese, Coffee → Bread Bread, Coffee → Cheese Bread, Cheese → Coffee Chocolate → Donuts Chocolate → J uice Donuts → Chocolate Donuts → J uice Donuts, J uice → Chocolate Chocolate, J uice → Donuts Chocolate, Donuts → J uice Bread → Cereal Cereal → Bread

Comment: Study the above rules carefully.

13

Main-Memory Bottleneck

• For many frequent-itemset algorithms, main memory is the critical resource.– As we read baskets, we need to count something,

e.g., occurrences of pairs.– The number of different things we can count is

limited by main memory.– Swapping counts in/out is a disaster (why?).

14

Finding Frequent Pairs

• The hardest problem often turns out to be finding the frequent pairs.– Why? Often frequent pairs are common, frequent

triples are rare.• We’ll concentrate on how to do that, then

discuss extensions to finding frequent triples, etc.

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

1 1 1 1 1 12 1 1 13 1 1 1

4 1 1 1

5 1 16 1 17 1 1 1

8 1 1

9 1 110 1 111 1 112 113 1 1

14 1 1

15 1 116 117 1 118 1 1 1 1 119 1 1 1 1 120 1

ID a, b, c, d, e, f, g, h, i

E.g. 3-itemset {a,b,h}has support 15%

2-itemset {a, i} has support 0%

4-itemset {b, c, d, h}has support 5%

If minimum support is 10%, then {b} is a largeitemset, but {b, c, d, h}Is a small itemset!

16

Finding Frequent Pairs

• The hardest problem often turns out to be finding the frequent pairs.– Why? Often frequent pairs are common, frequent

triples are rare.• We’ll concentrate on how to do that, then

discuss extensions to finding frequent triples, etc.

17

Naïve Algorithm

• Read file once, counting in main memory the occurrences of each pair.– From each basket of n items, generate its n

(n -1)/2 pairs by two nested loops.• Fails if (#items)2 exceeds main memory.– Remember: #items can be 100K (Wal-Mart) or 10B

(Web pages).

18

Example: Counting Pairs

• Suppose 105 items.• Suppose counts are 4-byte integers.• Number of pairs of items: 105(105-1)/2 = 5*109

(approximately).• Therefore, 2*1010 (20 gigabytes) of main

memory needed.

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

The Apriori algorithm for finding large itemsets efficiently in big DBs

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

4 For each c in Ck, initialise c.count to zero

5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }

7 Set Lk := all c in Ck whose count >= minsup

8 } /* end -- return all of the Lk sets.

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets

To start off, we simply find all of the large 1-itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get

L1 = { {a}, {b}, {c}, {d}, {f}}

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)

We already have L1. This next bit just means that the

remainder of the algorithm generates L2, L3 , and so on until we get to an Lk that’s empty.

How these are generated is like this:

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

Given the large k-1-itemsets, this step generates some candidate k-itemsets that might be large. Because of how apriori-gen works, the set Ck is guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’.

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

4 For each c in Ck, initialise c.count to zero

We are going to work out the support for each of the candidate k-itemsets in Ck, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero.

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)

3 {Ck = apriori-gen(Lk-1)

4 For each c in Ck, initialise c.count to zero

5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }

We now take each record r in the DB and do this: get all the

candidate k-itemsets from Ck that are contained in r. For each of these, update its count.

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)3 {Ck = apriori-gen(Lk-1)4 For each c in Ck, initialise c.count to zero 5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }7 Set Lk := all c in Ck whose count >= minsup

Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have Lk, We now go back into the for loop of line 2 and start working towards finding Lk+1

David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com

These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teac

hing/dmml.html

Explaining the Apriori Algorithm …

1: Find all large 1-itemsets2: For (k = 2 ; while Lk-1 is non-empty; k++)3 {Ck = apriori-gen(Lk-1)4 For each c in Ck, initialise c.count to zero 5 For all records r in the DB6 {Cr = subset(Ck, r); For each c in Cr , c.count++ }7 Set Lk := all c in Ck whose count >= minsup

8 } /* end -- return all of the Lk sets.

We finish at the point where we get an empty Lk . The algorithm returns all of the (non-empty) Lk sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.

נוסף FP-Growth :פתרון

הביטויים • של התדירות לפי עץ בונהלקטן – מהגדול– - ( ... הוא אבל פטנט עליו שיש כזה אלגוריתם לי יש

ולא לחיפוש (Associationמיועד