34
CS 345: Topics in Data Warehousing Thursday, November 18, 2004

CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Embed Size (px)

Citation preview

Page 1: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

CS 345:Topics in Data Warehousing

Thursday, November 18, 2004

Page 2: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Review of Tuesday’s Class• Data Mining

– What is data mining?– Types of data mining– Data mining pitfalls

• Decision Tree Classifiers– What is a decision tree?– Learning decision trees– Entropy– Information Gain– Cross-Validation

Page 3: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Overview of Today’s Class• Assignment #3 clarifications• Association Rule Mining

– Market basket analysis– What is an association rule?– Frequent itemsets

• Association rule mining algorithms– A-priori algorithm– Speeding up A-priori using hashing– One- and two-pass algorithms

* Adapted from slides by Vipin Kumar (Minnesota) and Rajeev Motwani (Stanford)

Page 4: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Aggregate Tables

FACT

Dimension

n dimension columns =2n possible aggregates

2 are special•All columns = original dimension table•No grouping columns

•Only 1 row•No reason to join to

FACTAGG

Eliminate thisforeign keyfrom the factaggregate table

Page 5: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Candidate Column Sets

Including fact aggregatesthat use some base dimensiontables is optional

Page 6: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Association Rule Mining• Given a set of transactions, find rules that will predict the occurrence

of an item based on the occurrences of other items in the transaction• Also known as market basket analysis

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},

Implication means co-occurrence, not causality!

Page 7: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Definition: Frequent Itemset• Itemset

– A collection of one or more items• Example: {Milk, Bread, Diaper}

– k-itemset• An itemset that contains k items

• Support count ()– Frequency of occurrence of an

itemset– E.g. ({Milk, Bread,Diaper}) = 2

• Support– Fraction of transactions that

contain an itemset– E.g. s({Milk, Bread, Diaper}) = 2/5

• Frequent Itemset– An itemset whose support is

greater than or equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 8: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Definition: Association Rule

Example:Beer}Diaper,Milk{

4.052

|T|)BeerDiaper,,Milk(

s

67.032

)Diaper,Milk()BeerDiaper,Milk,(

c

• Association Rule– An implication expression of the

form X Y, where X and Y are itemsets

– Example: {Milk, Diaper} {Beer}

• Rule Evaluation Metrics– Support (s)

• Fraction of transactions that contain both X and Y

– Confidence (c)• Measures how often items in Y

appear in transactions thatcontain X

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 9: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Association Rule Mining Task

• Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold– confidence ≥ minconf threshold

• High confidence = strong pattern • High support = occurs often

– Less likely to be random occurrence– Larger potential benefit from acting on the rule

Page 10: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Application 1 (Retail Stores)

• Real market baskets – chain stores keep TBs of customer purchase

info– Value?

• how typical customers navigate stores• positioning tempting items• suggests cross-sell opportunities – e.g., hamburger

sale while raising ketchup price• …

• High support needed, or no $$’s

Page 11: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Application 2 (Information Retrieval)

• Scenario 1– baskets = documents– items = words in documents– frequent word-groups = linked concepts.

• Scenario 2– items = sentences– baskets = documents containing sentences– frequent sentence-groups = possible

plagiarism

Page 12: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Application 3 (Web Search)

• Scenario 1– baskets = web pages– items = outgoing links– pages with similar references about same topic

• Scenario 2– baskets = web pages – items = incoming links– pages with similar in-links mirrors, or same

topic

Page 13: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Mining Association RulesExample of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

• Thus, we may decouple the support and confidence requirements

Page 14: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Mining Association Rules

• Goal – find all association rules such that– support– confidence

• Reduction to Frequent Itemsets Problems– Find all frequent itemsets X– Given X={A1, …,Ak}, generate all rules X-Aj Aj

– Confidence = sup(X)/sup(X-Aj)– Support = sup(X)– Exclude rules whose confidence is too low

• Observe X-Aj also frequent support known• Finiding all frequent itemsets is the hard part!

sc

Page 15: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Itemset Latticenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given m items, there are 2m-1 possible candidate itemsets

Page 16: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Scale of Problem

• WalMart – sells m=100,000 items– tracks n=1,000,000,000 baskets

• Web– several billion pages– approximately one new “word” per page

• Exponential number of itemsets– m items → 2m-1 possible itemsets– Cannot possibly example all itemsets for large m– Even itemsets of size 2 may be too many– m=100,000 → 5 trillion item pairs

Page 17: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Frequent Itemsets in SQL• DBMSs are poorly suited to association rule mining• Star schema

– Sales Fact – Transaction ID degenerate dimension– Item dimension

• Finding frequent 3-itemsets:SELECT Fact1.ItemID, Fact2.ItemID, Fact3.ItemID, COUNT(*)FROM Fact1 JOIN Fact2 ON Fact1.TID = Fact2.TID AND Fact1.ItemID < Fact2.ItemIDJOIN Fact3 ON Fact1.TID = Fact3.TID AND Fact1.ItemID < Fact2.ItemID AND Fact2.ItemID < Fact3.ItemIDGROUP BY Fact1.ItemID, Fact2.ItemID, Fact3.ItemIDHAVING COUNT(*) > 1000

• Finding frequent k-itemsets requires joining k copies of fact table– Joins are non-equijoins– Impossibly expensive!

Page 18: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Association Rules and Data Warehouses

• Typical procedure:– Use data warehouse to apply filters

• Mine association rules for certain regions, dates

– Export all fact rows matching filters to flat file• Sort by transaction ID• Items in same transaction are grouped together

– Perform association rule mining on flat file

• An alternative:– Database vendors are beginning to add specialized data mining

capabilities– Efficient algorithms for common data mining tasks are built in to

the database system• Decisions trees, association rules, clustering, etc.

– Not standardized yet

Page 19: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Finding Frequent Pairs

• Frequent 2-Sets– hard case already– focus for now, later extend to k-sets

• Naïve Algorithm– Counters – all m(m–1)/2 item pairs (m = # of distinct items)– Single pass – scanning all baskets– Basket of size b – increments b(b–1)/2 counters

• Failure?– if memory < m(m–1)/2 – m=100,000 → 5 trillion item pairs– Naïve algorithm is impractical for large m

Page 20: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Pruning Candidate Itemsets

• Monotonicity principle:– If an itemset is frequent, then all of its subsets

must also be frequent

• Monotonicity principle holds due to the following property of the support measure:

• Converse:– If an itemset is infrequent, then all of its

supersets must also be infrequent

)()()(:, YsXsYXYX

Page 21: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Monotonicity Principlenull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Page 22: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

A-Priori Algorithm

• A-Priori – 2-pass approach in limited memory• Pass 1

– m counters (candidate items in A)– Linear scan of baskets b– Increment counters for each item in b

• Mark as frequent, f items of count at least s• Pass 2

– f(f-1)/2 counters (candidate pairs of frequent items)– Linear scan of baskets b – Increment counters for each pair of frequent items in b

• Failure – if memory < f(f–1)/2– Suppose that 10% of items are frequent– Memory is (m2 / 200) vs. (m2 / 2)

Page 23: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Finding Larger Itemsets

• Goal – extend A-Priori to frequent k-sets, k > 2• Monotonicity

itemset X is frequent only if X – {Xj} is frequent for all Xj

• Idea– Stage k – finds all frequent k-sets– Stage 1 – gets all frequent items– Stage k – maintain counters for all candidate k-sets– Candidates – k-sets whose (k–1)-subsets are all frequent– Total cost: number of passes = max size of frequent itemset

Page 24: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

A-Priori AlgorithmItem CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1

Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Itemset Count {Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C3 = 41

With support-based pruning,6 + 6 + 3 = 15

Page 25: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Memory Usage – A-Priori

Candidate Items

Pass 1 Pass 2

Frequent Items

Candidate Pairs

MEMORY

MEMORY

Page 26: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

PCY Idea

• Improvement upon A-Priori– Uses less memory– Proposed by Park, Chen, and Yu

• Observe – during Pass 1, memory mostly idle• Idea

– Use idle memory for hash-table H– Pass 1 – hash pairs from b into H– Increment counter at hash location – At end – bitmap of high-frequency hash locations– Pass 2 – bitmap extra condition for candidate pairs

• Similar to bit-vector filtering in “Bloom join”

Page 27: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Memory Usage – PCY

Candidate Items

Pass 1 Pass 2

MEMORY

MEMORY

Hash Table

Frequent Items

Bitmap

Candidate Pairs

Page 28: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

PCY Algorithm

• Pass 1– m counters and hash-table T– Linear scan of baskets b– Increment counters for each item in b– Increment hash-table counter for each item-pair in b

• Mark as frequent, f items of count at least s• Summarize T as bitmap (count > s bit = 1)• Pass 2

– Counter only for F qualified pairs (Xi,Xj):• both are frequent• pair hashes to frequent bucket (bit=1)

– Linear scan of baskets b – Increment counters for candidate qualified pairs of items in b

Page 29: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Multistage PCY Algorithm

• Problem – False positives from hashing • New Idea

– Multiple rounds of hashing– After Pass 1, get list of qualified pairs– In Pass 2, hash only qualified pairs– Fewer pairs hash to buckets less false positives

(buckets with count >s, yet no pair of count >s)– In Pass 3, less likely to qualify infrequent pairs

• Repetition – reduce memory, but more passes• Failure – memory < O(f+F)

Page 30: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Memory Usage – Multistage PCY

Candidate Items

Pass 1 Pass 2

Hash Table 1

Frequent Items

Bitmap

Frequent Items

Bitmap 1

Bitmap 2

Candidate Pairs

Hash Table 2

Page 31: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Approximation Techniques

• Goal– find all frequent k-sets– reduce to 2 passes– must lose something accuracy

• Approaches– Sampling algorithm– SON (Savasere, Omiecinski, Navathe) Algorithm– Toivonen Algorithm

Page 32: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Sampling Algorithm

• Pass 1 – load random sample of baskets in memory• Run A-Priori (or enhancement)

– Scale-down support threshold (e.g., if 1% sample, use s/100 as support threshold)

– Compute all frequent k-sets in memory from sample– Need to leave enough space for counters

• Pass 2– Keep counters only for frequent k-sets of random sample– Get exact counts for candidates to validate

• Error?– No false positives (Pass 2)– False negatives (X frequent, but not in sample)

Page 33: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

SON Algorithm

• Pass 1 – Batch Processing– Scan data on disk– Repeatedly fill memory with new batch of data– Run sampling algorithm on each batch– Generate candidate frequent itemsets

• Candidate Itemsets – if frequent in some batch• Pass 2 – Validate candidate itemsets• Monotonicity Property

Itemset X is frequent overall frequent in at least one batch

Page 34: CS 345: Topics in Data Warehousing Thursday, November 18, 2004

Toivonen’s Algorithm

• Lower Threshold in Sampling Algorithm – Example – if support threshold is 1%, use 0.8% as support

threshold when evaluating sample– Goal – overkill to avoid any false negatives

• Negative Border– Itemset X infrequent in sample, but all subsets are frequent– Example: AB, BC, AC frequent, but ABC infrequent

• Pass 2– Count candidates and negative border – Negative border itemsets all infrequent candidates are exactly

the frequent itemsets– Otherwise? – start over!

• Achievement? – reduced failure probability, while keeping candidate-count low enough for memory