Mining Quantitative Association Rules in Large Relational Databases

Mining Quantitative Association Rules in Large Relational

DatabasesRamakrishnan Srikant

Rakesh Agrawal

ACM SIGMOD Conference on Management of Data, 1996

March 21, 2013(Slides modified from Sasi Sekhar Kunta’s version.)

Presented by:Sepehr Amir-

Mohammadian

2

Outline• Association Rules and Quantitative Association

Rules• Formal Study of Quantitative Association

Analysis• Partitioning Quantitative Attributes• Identifying the Interesting Rules• Candidate Generation• Concluding Remarks• Q&A

3




4

Association Rules• Itemsets and , • Rule • Support: • Confidence: • Find rules that have MinSup and MinConf

5

Boolean Association Rules

TID A B C D100 1 1 0 1200 0 1 1 1300 1 1 1 0400 0 0 1 0

TID Items100 A B D200 B C D300 A B C400 C

6

Quantitative Association RulesRecordID Age Married NumCars

100 23 No 1200 25 Yes 1300 29 No 0400 34 Yes 2500 38 Yes 2

7

Mapping to Boolean Association Rules

• Use as new attribute instead of a categorical attribute

• Use as new attribute instead of a quantitative attribute with a small domain

• Use as new attribute instead of a quantitative attribute with a large domain

RecordID

Age: 20..29

Age: 30..39

Married: Yes

Married: No

NumCars: 0

NumCars: 1

100 1 0 0 1 0 1200 1 0 1 0 0 1300 1 0 0 1 1 0400 0 1 1 0 0 0500 0 1 1 0 0 0

8

Problems• “MinSup”: If number of partitions is large, the

support of a single partition can be lower• “MinConf”: Information lost during partition

values into intervals. Confidence can be lower as number of intervals is smaller

RecordID Age Married NumCars

100 23 No 1200 25 Yes 1300 29 No 0400 34 Yes 2500 38 Yes 2

9

Solution• Consider all combinations of adjacent

values/intervals in quantitative attributes Solves “MinSup” problem

• Increase the number of values/intervals, without encountering “MinSup” problem Reduces information loss

• New Problems:– Execution time: Maximum support threshold, MaxSup– Many rules: Interestingness of rules

10

Steps of Proposed Approach1. Determine the number of partitions for each

quantitative attribute2. Map values/ranges to consecutive integer

values such that the order is preserved3. Find the support of each value of the attributes,

and combine when support is less than MaxSup. Find frequent itemsets, whose support is larger than MinSup

4. Use frequent itemsets to generate association rules

5. Pruning out uninteresting rules

11

Example• Step 0: Initial set of records

RecordID Age Married NumCars100 23 No 1200 25 Yes 1300 29 No 0400 34 Yes 2500 38 Yes 2

12

Example – Cont. • Step 1: Determine the partitions for each

quantitative attributes

Intervals for Age

20 .. 24

25 .. 29

30 .. 34

35 .. 39

RecordID Age Married NumCars

100 20 .. 24 No 1

200 25 .. 29 Yes 1

300 25 .. 29 No 0

400 30 .. 34 Yes 2

500 35 .. 39 Yes 2

13

Example – Cont.• Step 2: Mapping intervals/values to consecutive

intergers

Intervals for Age

Integers

20 .. 24 1

25 .. 29 2

30 .. 34 3

35 .. 39 4

Values for

Married

Integers

Yes 1

No 2

14

Example – Cont.• Step 2: Mapping intervals/values to consecutive

integers

RecordID Age Married NumCars100 1 2 1200 2 1 1300 2 2 0400 3 1 2500 4 1 2

15

Example – Cont.• Step 3: Extracting large itemsets

– Some of these itemsets are represented with MinSup = 0.4

Itemset Support323232

16

Example – Cont.• Step 4: Rule generation

– Some of these rules are represented with MinConf = 0.5

17




18

Formal Study of Quantitative A. A.

• set of attributes• set of positive integers• , denotes that attribute has value • set of items • For any , • , set of records• , a record such that attributes are distinct• A record supports itemset if

• , a quantitative association rule, where

– ,

19

Formal Definition of Quantitative A. A. – Cont.

• holds in with support , if of the records in support .

• holds in with confidence , if of the records in that support , also support .

• , probability that all items in are supported by a given record

• is a generalization of , denoted by if

20




21

Partitioning Quantitative Attributes• A measure of partial completeness: Information

lost in partitioning– : set of rules obtained before partitioning– : set of rules obtained after partitioning– Partial completeness measures the distance

between a rule in and its closest generalization in – The distance is defined by the ratio of support

• Give the best approach to have minimal number of partitions

22

Partial Completeness• : the set of frequent itemsets• For any , is -complete w.r.t if

– –

• The smaller is, the less the information lost

23

Example – K-Completeness• Consider the following set of frequent itemsets:

• Then, items 2, 3, 5, 7 form a 1.5-complete set.• But, items 3,5,7 do not form a 1.5-complete set.

Number Itemset Support1 5%2 6%3 8%4 5%5 6%6 4%7 5%

24

Confidence of Rules Generated from K-Complete Set

• If is -complete set w.r.t , then any rule obtained from has a generalization from , such that is bounded by

• In the previous example:

25

K-Completeness for a Single Attribute

• Consider as a quantitative attribute, partitioned into base intervals.

• Suppose than the support for each base interval is less than

• Let be the set of all combinations of base intervals that have .

• Then, is -complete w.r.t. the set of all ranges over .

26

K-Completeness for a Group of Attributes

• Consider a set 0f quantitative attributes, partitioned into base intervals.

• Suppose that the support for each base interval is less than

• Let be the set of all frequent itemsets over the partitioned attributes.

• Then, is -complete w.r.t. the set of all frequent itemsets without partitioning.

27

Equi-Depth Partitioning • Equi-depth partitioning: Splitting the support

identically

• Suppose that the number of intervals are given.• Then, equi-depth partitioning minimizes max

support for a base interval , and so minimizes .

• Suppose that is given and .• Then, equi-depth partitioning with support in

each base interval results in the minimum number of intervals:

28




29

Identify Interesting Rules• Combining intervals results in many rules

• For example, suppose a quarter of people in age group 20..30 are in the age group 20..25– with 8% sup, 70% conf– , with 2% sup, 70% conf– The second rule doesn’t give any additional

information, and is less general than the first rule

30

Expected Value of Support and Confidence

• Interest: Rules with support and confidence according to some expectations

• Let • Let , • The expected value of based on , would be

)• Similarly, the expected value of the confidence for the rule

according to its generalization would be)

where , .

31

Interest Measure• Itemset is -interesting w.r.t its generalization

, if – , and– For any specialization with , is -interesting w.r.t

• Rule is -interesting w.r.t its generalization if – , or

– Moreover, the itemset is -interesting w.r.t .

32

Example of Interest

33




34

Candidate Generation• Given the set of all frequent -itemsets, generate

the set of • The process has three parts:

– Join Phase– Subset Prune Phase– Interest Prune Phase

35

Join Phase• joined with itself

• Example, :

• Result of self-join, :

36

Subset Prune Phase• Make sure any -subset is in .

• Example, :

• Result of self-join, :

• Delete the first itemset in since is not in .

37

Interest Prune Phase• Given user-specified interest level • Delete any itemset that contains an item with

support greater than • It is guaranteed that such itemsets cannot be -

interesting w.r.t their generalizations

38




39

Concluding Remarks• Introduced the problem of mining quantitative

association rules

• Dealt with quantitative attributes by fine-partitioning the values and combining adjacent partitions as necessary

• Introduced partial completeness to quantify the information lost, and help decide the partitions

• Gave interest measure to identify interesting rules

• Candidate Generation

40



Analysis• Partitioning Quantitative Attributes• Identifying the Interesting Rules• Extending the Apriori Algorithm• Concluding Remarks• Q&A

41

Exam Questions1. What are the two problems with mapping quantitative associations to boolean associations?A. Slide No. 8

2. Give the general steps to be followed in order to mine quantitative association rules.B. Slide No. 10

3. If P is a K-Complete set w.r.t. the set of all frequent itemsets, the minimum confidence when generating rules from P should follow what constraint, in order to guarantee that a close rule will be generated?C. It should be of the desired level of confidence. Slide

No. 24.

42

Thank you.Questions?

Documents

Mining Quantitative Association Rules in Large Relational Databases