10.Pertemuan Kesepuluh.ppt

*Data Mining: Concepts and Techniques*Data Mining: Concepts and Techniques

Chapter 5

Data Mining: Concepts and Techniques

*Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary


*Data Mining: Concepts and Techniques*What Is Frequent Pattern Analysis?Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule miningMotivation: Finding inherent regularities in dataWhat products were often purchased together? Beer and diapers?!What are the subsequent purchases after buying a PC?What kinds of DNA are sensitive to this new drug?Can we automatically classify web documents?ApplicationsBasket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.


*Data Mining: Concepts and Techniques*Why Is Freq. Pattern Mining Important?Discloses an intrinsic and important property of data setsForms the foundation for many essential data mining tasksAssociation, correlation, and causality analysisSequential, structural (e.g., sub-graph) patternsPattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classificationCluster analysis: frequent pattern-based clusteringData warehousing: iceberg cube and cube-gradient Semantic data compression: fasciclesBroad applications


*Data Mining: Concepts and Techniques*Basic Concepts: Frequent Patterns and Association RulesItemset X = {x1, , xk}Find all the rules X Y with minimum support and confidencesupport, s, probability that a transaction contains X Yconfidence, c, conditional probability that a transaction having X also contains YLet supmin = 50%, confmin = 50%Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}Association rules:A D (60%, 100%)D A (60%, 75%)

Transaction-idItems bought10A, B, D20A, C, D30A, D, E40B, E, F50B, C, D, E, F


*Data Mining: Concepts and Techniques*Closed Patterns and Max-PatternsA long pattern contains a combinatorial number of sub-patterns, e.g., {a1, , a100} contains (1001) + (1002) + + (110000) = 2100 1 = 1.27*1030 sub-patterns!Solution: Mine closed patterns and max-patterns insteadAn itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X (proposed by Pasquier, et al. @ ICDT99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (proposed by Bayardo @ SIGMOD98)Closed pattern is a lossless compression of freq. patternsReducing the # of patterns and rules


*Data Mining: Concepts and Techniques*Closed Patterns and Max-PatternsExercise. DB = {, < a1, , a50>} Min_sup = 1.What is the set of closed itemset?: 1< a1, , a50>: 2What is the set of max-pattern?: 1What is the set of all patterns?!!


*Data Mining: Concepts and Techniques*Scalable Methods for Mining Frequent PatternsThe downward closure property of frequent patternsAny subset of a frequent itemset must be frequentIf {beer, diaper, nuts} is frequent, so is {beer, diaper}i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approachesApriori (Agrawal & Srikant@VLDB94)Freq. pattern growth (FPgrowthHan, Pei & Yin @SIGMOD00)Vertical data format approach (CharmZaki & Hsiao @SDM02)


*Data Mining: Concepts and Techniques*Apriori: A Candidate Generation-and-Test ApproachApriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB94, Mannila, et al. @ KDD 94)Method: Initially, scan DB once to get frequent 1-itemsetGenerate length (k+1) candidate itemsets from length k frequent itemsetsTest the candidates against DBTerminate when no frequent or candidate set can be generated


*Data Mining: Concepts and Techniques*The Apriori AlgorithmAn Example Database TDB1st scanC1L1L2C2C22nd scanC3L33rd scanSupmin = 2

TidItems10A, C, D20B, C, E30A, B, C, E40B, E

Itemsetsup{A}2{B}3{C}3{D}1{E}3

Itemsetsup{A}2{B}3{C}3{E}3

Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}

Itemsetsup{A, B}1{A, C}2{A, E}1{B, C}2{B, E}3{C, E}2

Itemsetsup{A, C}2{B, C}2{B, E}3{C, E}2

Itemset{B, C, E}

Itemsetsup{B, C, E}2


*Data Mining: Concepts and Techniques*The Apriori AlgorithmPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;


*Data Mining: Concepts and Techniques*Important Details of AprioriHow to generate candidates?Step 1: self-joining LkStep 2: pruningHow to count supports of candidates?Example of Candidate-generationL3={abc, abd, acd, ace, bcd}Self-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4={abcd}


*Data Mining: Concepts and Techniques*How to Generate Candidates?Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1Step 2: pruningforall itemsets c in Ck doforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck


*Data Mining: Concepts and Techniques*How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidatesMethod:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction


*Data Mining: Concepts and Techniques*Example: Counting Supports of CandidatesTransaction: 1 2 3 5 61 + 2 3 5 61 2 + 3 5 61 3 + 5 6


*Data Mining: Concepts and Techniques*Efficient Implementation of Apriori in SQLHard to get good performance out of pure SQL (SQL-92) based approaches aloneMake use of object-relational extensions like UDFs, BLOBs, Table functions etc.Get orders of magnitude improvementS. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD98


*Data Mining: Concepts and Techniques*Challenges of Frequent Pattern MiningChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload of support counting for candidatesImproving Apriori: general ideasReduce passes of transaction database scansShrink number of candidatesFacilitate support counting of candidates


*Data Mining: Concepts and Techniques*Partition: Scan Database Only TwiceAny itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBScan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patternsA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB95


*Data Mining: Concepts and Techniques*DHP: Reduce the Number of CandidatesA k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequentCandidates: a, b, c, d, eHash entries: {ab, ad, ae} {bd, be, de} Frequent 1-itemset: a, b, d, eab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support thresholdJ. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD95


*Data Mining: Concepts and Techniques*Sampling for Frequent PatternsSelect a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checkedExample: check abcd instead of ab, ac, , etc.Scan database again to find missed frequent patternsH. Toivonen. Sampling large databases for association rules. In VLDB96


*Data Mining: Concepts and Techniques*DIC: Reduce Number of ScansABCDABCABDACDBCDABACBCADBDCDABCD{}Itemset latticeOnce both A and D are determined frequent, the counting of AD beginsOnce all length-2 subsets of BCD are determined frequent, the counting of BCD beginsTransactions1-itemsets2-itemsetsApriori1-itemsets2-items3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD97


*Data Mining: Concepts and Techniques*Bottleneck of Frequent-pattern MiningMultiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: (1001) + (1002) + + (110000) = 2100-1 = 1.27*1030 !Bottleneck: candidate-generation-and-testCan we avoid candidate generation?


*Data Mining: Concepts and Techniques*Mining Frequent Patterns Without Candidate GenerationGrow long patterns from short ones using local frequent itemsabc is a frequent patternGet all transactions having abc: DB|abcd is a local frequent item in DB|abc abcd is a frequent pattern


*Data Mining: Concepts and Techniques*Construct FP-tree from a Transaction Databasemin_support = 3TIDItems bought (ordered) frequent items100{f, a, c, d, g, i, m, p}{f, c, a, m, p}200{a, b, c, f, l, m, o}{f, c, a, b, m}300 {b, f, h, j, o, w}{f, b}400 {b, c, k, s, p}{c, b, p}500 {a, f, c, e, l, p, m, n}{f, c, a, m, p}Scan DB once, find frequent 1-itemset (single item pattern)Sort frequent items in frequency descending order, f-listScan DB again, construct FP-treeF-list=f-c-a-b-m-p


*Data Mining: Concepts and Techniques*Benefits of the FP-tree StructureCompleteness Preserve complete information for frequent pattern miningNever break a long pattern of any transactionCompactnessReduce irrelevant infoinfrequent items are goneItems in frequency descending order: the more frequently occurring, the more likely to be sharedNever be larger than the original database (not count node-links and the count field)For Connect-4 DB, compression ratio could be over 100


*Data Mining: Concepts and Techniques*Partition Patterns and DatabasesFrequent patterns can be partitioned into subsets according to f-listF-list=f-c-a-b-m-pPatterns containing pPatterns having m but no pPatterns having c but no a nor b, m, pPattern fCompleteness and non-redundency


*Data Mining: Concepts and Techniques*Find Patterns Having P From P-conditional DatabaseStarting at the frequent item header table in the FP-treeTraverse the FP-tree by following the link of each frequent item pAccumulate all of transformed prefix paths of item p to form ps conditional pattern baseConditional pattern basesitemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fcab:1pfcam:2, cb:1


*Data Mining: Concepts and Techniques*From Conditional Pattern-bases to Conditional FP-trees For each pattern-baseAccumulate the count for each item in the baseConstruct the FP-tree for the frequent items of the pattern basem-conditional pattern base:fca:2, fcab:1All frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcam{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3


*Data Mining: Concepts and Techniques*Recursion: Mining Each Conditional FP-treeCond. pattern base of am: (fc:3)Cond. pattern base of cm: (f:3){}f:3cm-conditional FP-treeCond. pattern base of cam: (f:3){}f:3cam-conditional FP-tree


*Data Mining: Concepts and Techniques*A Special Case: Single Prefix Path in FP-treeSuppose a (conditional) FP-tree T has a shared single prefix-path PMining can be decomposed into two partsReduction of the single prefix path into one nodeConcatenation of the mining results of the two parts+


*Data Mining: Concepts and Techniques*Mining Frequent Patterns With FP-treesIdea: Frequent pattern growthRecursively grow frequent patterns by pattern and database partitionMethod For each frequent item, construct its conditional pattern-base, and then its conditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern


*Data Mining: Concepts and Techniques*Scaling FP-growth by DB ProjectionFP-tree cannot fit in memory?DB projectionFirst partition a database into a set of projected DBsThen construct and mine FP-tree for each projected DBParallel projection vs. Partition projection techniquesParallel projection is space costly


*Data Mining: Concepts and Techniques*Partition-based ProjectionParallel projection needs a lot of disk space Partition projection saves it


*Data Mining: Concepts and Techniques*FP-Growth vs. Apriori: Scalability With the Support ThresholdData set T25I20D10K


Chart1

22

43

46

524

566

7500

110.3

180.2

440.1

D1 FP-growth runtime

D1 Apriori runtime

Support threshold(%)

Run time(sec.)

Sheet1

dataset T25I10D10k item=1k

FP-tree/FP-mineTreeProjectionComparison

threshold(%)#tree nodestree size(k)#stack nodesstack size(k)#trans(FP)#tree nodesmax width#transmatrices costfp/lexi treestack/widthlexi/fp trans

523034553230345533296173734312495681315.53315.531.31

4727611745727611745825771831839186973038397.60397.601.11

31461193507146119350715596639339316422148777371.80371.801.05

221896852552188515252228746732683240778730933299.14320.431.05

12684466443256093614627404457811000557415269663046.44256.092.03

0.828377668112587466210286271930019697046093535114930.51131.412.46

0.53585048604261771628334510028647561412844755892687212.5146.633.72

0.357507613802267947643153101269654131412007235728161858.2620.393.78

0.21059473254272735056564978119127070295162757934750178118.349.272.82

0.1268092164342279803671525038933391221267594461514609136977.912.211.78




21018822445101870244520173914313033161421641131712.46783.621.64

11009635242319958992390211050085256104739870832750956301192.09951.193.61

0.8137053532893134386832253146157710230193462021634034686178133.97694.864.24

0.5203932048944196776747226211608427067528211607208204318718275.34372.545.49

0.3252033060488232748155860255857272640124062057613046477459234.70187.618.04

0.130063537215825941226225929763641515433054527400206133656480819.8484.939.21

dataset T25I10D10k item=1kdataset T25I10D100k item=1k

threshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP costthreshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP cost

5198240020114855112110.751.2821515635076391361942986140.901.23

44520785584282420607960.622.621666291277946564469576597470.974.51

3100231793296655992441020.655.150.89429618492627969981034582331.035.59

217518331020812342253472630.707.660.5155409319159871770242373272401.147.44

122462442046018237390080730.818.820.3199892422556982753103687389961.388.73

0.823218461853819747406276690.858.800.1239233511939173518854815280571.479.41

0.525875513138325132436604110.978.51

0.333842628702933896478740991.007.61

0.255016940138252178597613610.956.36

0.1143191238758621728861420208151.215.95

threshold(%)D1 running mem. Req.D2 running mem. Req.

252522445

1614623902

0.8621032253

0.5628347226

0.3643155860

0.1671562259

Sheet1

00

00

00

00

00

00

D1 running mem. Req.



Running memory requirement (Kb)

Sheet2

T25I10D10k 1k items

threshold(%)D1 #frequent itemsetsD1 #weighted lengthD1 longestD2 #frequent itemsetsD2 #weighted lengthD2 longest

3393111911

27321.0831431.13

1.514822.18611363.27

157813.791052563.539

0.893003.6311102303.8911

0.5286474.0912270674.0912

0.3696544.4313726404.7414

0.21270704.3813997074.6614

0.13391223.84141515434.6414

Sheet2

00

00

00

00

00

00

00

00

00

D1 #frequent itemsets



Number of frequent itemsets

Sheet3

0000

0000

0000

0000

0000

0000

0000

0000

0000

D1 #weighted length

D1 longest

D2 #weighted length

D2 longest


Length of frequent itemsets

Sheet4

threshold(%)D1 FP-growth runtimeD1 Apriori runtimeD2 FP-growth runtimeD2 Apriori runtimeD1 #frequent itemsetsD2 #frequent itemsetsD1 runtime/itemsetD2 runtime/itemset

3221012393190.00508905850.5263157895

24311217321430.00546448090.0769230769

1.5461564148211360.00269905530.0132042254

152426225578152560.00086490230.0049467275

0.8566359300102300.00053763440.0034213099

0.575005828647270670.00024435370.0021428308

0.3119169654726400.00015792350.0012527533

0.218107127070997070.00014165420.0010731443

0.1441373391221515430.00012974680.0009040338

Sheet4

00

00

00

00

00

00

00

00

00


D1 Apriori runtime


Run time(sec.)

Sheet5

00

00

00

00

00

00

00

00

00


D2 Apriori runtime


Run time(sec.)

Sheet6

00

00

00

00

00

00

00

00

00

D1 runtime/itemset

D2 runtime/itemset


Runtime/itemset(sec.)

Size(K)FP-growthApriori

1014

20311

30517

50731

801251

1001564

00

00

00

00

00

00

FP-growth

Apriori

Number of transactions (K)

Run time (sec.)

D1 10k

threshold(%)D1 FP-growthD1 TreeProjection

322

244

1.545

157

0.859

0.5718

0.31131

0.21846

0.14498

D2 100k


31011

21111

1.51521

12657

0.835123

0.558300

0.391

0.2107

0.1137

scalability1.00%

sizeFP-growthTreeProjection

1033

2059

30815

501327

802045

1002657

00

00

00

00

00

00

00

00

00

D1 FP-growth

D1 TreeProjection

Support threshold (%)

Run time (sec.)

00

00

00

00

00

00

00

00

00

D2 FP-growth

D2 TreeProjection


Runtime (sec.)

00

00

00

00

00

00

FP-growth

TreeProjection


Run time (sec.)

Support threshold(%)#nodes in FP-tree#frequent item occurrencesRatioFP-tree MemDB MemMem ratio

9036313757483789.9487125502992631.6565656566

80151517787821174.11363607115128195.6855885589

7026151930953738.41627607723812123.0690248566

6081052147771264.99194520859108444.165556241

50134492219609165.04322776887843627.5064936674

4037162231038762.17891888924154810.3617808514

30123368242560519.66296083297024203.2769235134

2037387326563797.118972952106255161.1841717196

1080747327925353.4619379352111701400.5763938856

00

00

00

00

00

00

00

00

00

#nodes in FP-tree

#frequent item occurrences

*Data Mining: Concepts and Techniques*FP-Growth vs. Tree-Projection: Scalability with the Support ThresholdData set T25I20D100K


Chart3

1011

1111

1521

2657

35123

58300

910.3

1070.2

1370.1

D2 FP-growth

D2 TreeProjection


Runtime (sec.)

Sheet1




523034553230345533296173734312495681315.53315.531.31

4727611745727611745825771831839186973038397.60397.601.11

31461193507146119350715596639339316422148777371.80371.801.05

221896852552188515252228746732683240778730933299.14320.431.05

12684466443256093614627404457811000557415269663046.44256.092.03

0.828377668112587466210286271930019697046093535114930.51131.412.46

0.53585048604261771628334510028647561412844755892687212.5146.633.72

0.357507613802267947643153101269654131412007235728161858.2620.393.78

0.21059473254272735056564978119127070295162757934750178118.349.272.82

0.1268092164342279803671525038933391221267594461514609136977.912.211.78




21018822445101870244520173914313033161421641131712.46783.621.64

11009635242319958992390211050085256104739870832750956301192.09951.193.61

0.8137053532893134386832253146157710230193462021634034686178133.97694.864.24

0.5203932048944196776747226211608427067528211607208204318718275.34372.545.49

0.3252033060488232748155860255857272640124062057613046477459234.70187.618.04

0.130063537215825941226225929763641515433054527400206133656480819.8484.939.21

dataset T25I10D10k item=1kdataset T25I10D100k item=1k

threshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP costthreshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP cost

5198240020114855112110.751.2821515635076391361942986140.901.23

44520785584282420607960.622.621666291277946564469576597470.974.51

3100231793296655992441020.655.150.89429618492627969981034582331.035.59

217518331020812342253472630.707.660.5155409319159871770242373272401.147.44

122462442046018237390080730.818.820.3199892422556982753103687389961.388.73

0.823218461853819747406276690.858.800.1239233511939173518854815280571.479.41

0.525875513138325132436604110.978.51

0.333842628702933896478740991.007.61

0.255016940138252178597613610.956.36

0.1143191238758621728861420208151.215.95

threshold(%)D1 running mem. Req.D2 running mem. Req.

252522445

1614623902

0.8621032253

0.5628347226

0.3643155860

0.1671562259

Sheet1

00

00

00

00

00

00




Running memory requirement (Kb)

Sheet2

T25I10D10k 1k items

threshold(%)D1 #frequent itemsetsD1 #weighted lengthD1 longestD2 #frequent itemsetsD2 #weighted lengthD2 longest

3393111911

27321.0831431.13

1.514822.18611363.27

157813.791052563.539

0.893003.6311102303.8911

0.5286474.0912270674.0912

0.3696544.4313726404.7414

0.21270704.3813997074.6614

0.13391223.84141515434.6414

Sheet2

00

00

00

00

00

00

00

00

00




Number of frequent itemsets

Sheet3

0000

0000

0000

0000

0000

0000

0000

0000

0000

D1 #weighted length

D1 longest

D2 #weighted length

D2 longest


Length of frequent itemsets

Sheet4

threshold(%)D1 FP-growth runtimeD1 Apriori runtimeD2 FP-growth runtimeD2 Apriori runtimeD1 #frequent itemsetsD2 #frequent itemsetsD1 runtime/itemsetD2 runtime/itemset

3221012393190.00508905850.5263157895

24311217321430.00546448090.0769230769

1.5461564148211360.00269905530.0132042254

152426225578152560.00086490230.0049467275

0.8566359300102300.00053763440.0034213099

0.575005828647270670.00024435370.0021428308

0.3119169654726400.00015792350.0012527533

0.218107127070997070.00014165420.0010731443

0.1441373391221515430.00012974680.0009040338

Sheet4

00

00

00

00

00

00

00

00

00


D1 Apriori runtime


Run time(sec.)

Sheet5

00

00

00

00

00

00

00

00

00


D2 Apriori runtime


Run time(sec.)

Sheet6

00

00

00

00

00

00

00

00

00

D1 runtime/itemset

D2 runtime/itemset


Runtime/itemset(sec.)

Size(K)FP-growthApriori

1014

20311

30517

50731

801251

1001564

00

00

00

00

00

00

FP-growth

Apriori


Run time (sec.)

D1 10k


322

244

1.545

157

0.859

0.5718

0.31131

0.21846

0.14498

D2 100k


31011

21111

1.51521

12657

0.835123

0.558300

0.391

0.2107

0.1137

scalability1.00%

sizeFP-growthTreeProjection

1033

2059

30815

501327

802045

1002657

00

00

00

00

00

00

00

00

00

D1 FP-growth

D1 TreeProjection


Run time (sec.)

00

00

00

00

00

00

00

00

00

D2 FP-growth

D2 TreeProjection


Runtime (sec.)

00

00

00

00

00

00

FP-growth

TreeProjection


Run time (sec.)

Support threshold(%)#nodes in FP-tree#frequent item occurrencesRatioFP-tree MemDB MemMem ratio

9036313757483789.9487125502992631.6565656566

80151517787821174.11363607115128195.6855885589

7026151930953738.41627607723812123.0690248566

6081052147771264.99194520859108444.165556241

50134492219609165.04322776887843627.5064936674

4037162231038762.17891888924154810.3617808514

30123368242560519.66296083297024203.2769235134

2037387326563797.118972952106255161.1841717196

1080747327925353.4619379352111701400.5763938856

00

00

00

00

00

00

00

00

00

#nodes in FP-tree

#frequent item occurrences

*Data Mining: Concepts and Techniques*Why Is FP-Growth the Winner?Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so farleads to focused search of smaller databasesOther factorsno candidate generation, no candidate testcompressed database: FP-tree structureno repeated scan of entire database basic opscounting local freq items and building sub FP-tree, no pattern search and matching


*Data Mining: Concepts and Techniques*Implications of the Methodology Mining closed frequent itemsets and max-patternsCLOSET (DMKD00)Mining sequential patternsFreeSpan (KDD00), PrefixSpan (ICDE01)Constraint-based mining of frequent patternsConvertible constraints (KDD00, ICDE01)Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD01)


*Data Mining: Concepts and Techniques*MaxMiner: Mining Max-patterns1st scan: find frequent itemsA, B, C, D, E2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanR. Bayardo. Efficiently mining long patterns from databases. In SIGMOD98Potential max-patterns

TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,F


*Data Mining: Concepts and Techniques*Mining Frequent Closed Patterns: CLOSETFlist: list of all frequent items in support ascending orderFlist: d-a-f-e-cDivide search spacePatterns having dPatterns having d but no a, etc.Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed patternJ. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.Min_sup=2

TIDItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, f


*Data Mining: Concepts and Techniques*CLOSET+: Mining Closed Itemsets by Pattern-GrowthItemset merging: if Y appears in every occurrence of X, then Y is merged with XSub-itemset pruning: if Y X, and sup(X) = sup(Y), X and all of Xs descendants in the set enumeration tree can be prunedHybrid tree projectionBottom-up physical tree-projectionTop-down pseudo tree-projectionItem skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levelsEfficient subset checking


*Data Mining: Concepts and Techniques*CHARM: Mining by Exploring Vertical Data FormatVertical format: t(AB) = {T11, T25, }tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersectionst(X) = t(Y): X and Y always happen togethert(X) t(Y): transaction having X always has YUsing diffset to accelerate miningOnly keep track of differences of tidst(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2}Eclat/MaxEclat (Zaki et al. @KDD97), VIPER(P. Shenoy et al.@SIGMOD00), CHARM (Zaki & Hsiao@SDM02)


*Data Mining: Concepts and Techniques*Further Improvements of Mining MethodsAFOPT (Liu, et al. @ KDD03)A push-right method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et al. @ KDD03)Mine data sets with small rows but numerous columnsConstruct a row-enumeration tree for efficient mining


*Data Mining: Concepts and Techniques*Visualization of Association Rules: Plane Graph


*Data Mining: Concepts and Techniques*Visualization of Association Rules: Rule Graph


*Data Mining: Concepts and Techniques*Visualization of Association Rules (SGI/MineSet 3.0)


*Data Mining: Concepts and Techniques*Mining Various Kinds of Association RulesMining multilevel associationMiming multidimensional associationMining quantitative association Mining interesting correlation patterns


*Data Mining: Concepts and Techniques*Mining Multiple-Level Association RulesItems often form hierarchiesFlexible support settings Items at the lower level are expected to have lower supportExploration of shared multi-level mining (Agrawal & Srikant@VLB95, Han & Fu@VLDB95)


*Data Mining: Concepts and Techniques*Multi-level Association: Redundancy FilteringSome rules may be redundant due to ancestor relationships between items.Examplemilk wheat bread [support = 8%, confidence = 70%]2% milk wheat bread [support = 2%, confidence = 72%]We say the first rule is an ancestor of the second rule.A rule is redundant if its support is close to the expected value, based on the rules ancestor.


*Data Mining: Concepts and Techniques*Mining Multi-Dimensional AssociationSingle-dimensional rules:buys(X, milk) buys(X, bread)Multi-dimensional rules: 2 dimensions or predicatesInter-dimension assoc. rules (no repeated predicates)age(X,19-25) occupation(X,student) buys(X, coke)hybrid-dimension assoc. rules (repeated predicates)age(X,19-25) buys(X, popcorn) buys(X, coke)Categorical Attributes: finite number of possible values, no ordering among valuesdata cube approachQuantitative Attributes: numeric, implicit ordering among valuesdiscretization, clustering, and gradient approaches


*Data Mining: Concepts and Techniques*Mining Quantitative AssociationsTechniques can be categorized by how numerical attributes, such as age or salary are treatedStatic discretization based on predefined concept hierarchies (data cube methods)Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97) one dimensional clustering then associationDeviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)


*Data Mining: Concepts and Techniques*Static Discretization of Quantitative AttributesDiscretized prior to mining using concept hierarchy.Numeric values are replaced by ranges.In relational database, finding all frequent k-predicate sets will require k or k+1 table scans.Data cube is well suited for mining.The cells of an n-dimensional cuboid correspond to the predicate sets.Mining from data cubes can be much faster.


*Data Mining: Concepts and Techniques*Quantitative Association Rulesage(X,34-35) income(X,30-50K) buys(X,high resolution TV)Proposed by Lent, Swami and Widom ICDE97Numeric attributes are dynamically discretizedSuch that the confidence or compactness of the rules mined is maximized2-D quantitative association rules: Aquan1 Aquan2 AcatCluster adjacent association rules to form general rules using a 2-D gridExample


*Data Mining: Concepts and Techniques*Mining Other Interesting PatternsFlexible support constraints (Wang et al. @ VLDB02)Some items (e.g., diamond) may occur rarely but are valuable Customized supmin specification and applicationTop-K closed frequent patterns (Han, et al. @ ICDM02)Hard to specify supmin, but top-k with lengthmin is more desirableDynamically raise supmin in FP-tree construction and mining, and select most promising path to mine


*Data Mining: Concepts and Techniques*Interestingness Measure: Correlations (Lift)play basketball eat cereal [40%, 66.7%] is misleadingThe overall % of students eating cereal is 75% > 66.7%.play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidenceMeasure of dependent/correlated events: lift

BasketballNot basketballSum (row)Cereal200017503750Not cereal10002501250Sum(col.)300020005000


*Data Mining: Concepts and Techniques*Are lift and 2 Good Measures of Correlation?Buy walnuts buy milk [1%, 80%] is misleadingif 85% of customers buy milkSupport and confidence are not good to represent correlationsSo many interestingness measures? (Tan, Kumar, Sritastava @KDD02)

MilkNo MilkSum (row)Coffeem, c~m, ccNo Coffeem, ~c~m, ~c~cSum(col.)m~m

DBm, c~m, cm~c~m~cliftall-confcoh2A1100010010010,0009.260.910.839055A210010001000100,0008.440.090.05670A3100010010000100,0009.180.090.098172A4100010001000100010.50.330


*Data Mining: Concepts and Techniques*Which Measures Should Be Used?lift and 2 are not good measures for correlations in large transactional DBsall-conf or coherence could be good measures (Omiecinski@TKDE03)Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM03sub)


*Data Mining: Concepts and Techniques*Constraint-based (Query-Directed) MiningFinding all the patterns in a database autonomously? unrealistic!The patterns could be too many but not focused!Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface)Constraint-based miningUser flexibility: provides constraints on what to be minedSystem optimization: explores such constraints for efficient miningconstraint-based mining


*Data Mining: Concepts and Techniques*Constraints in Data MiningKnowledge type constraint: classification, association, etc.Data constraint using SQL-like queries find product pairs sold together in stores in Chicago in Dec.02Dimension/level constraintin relevance to region, price, brand, customer categoryRule (or pattern) constraintsmall sales (price < $10) triggers big sales (sum > $200)Interestingness constraintstrong rules: min_support 3%, min_confidence 60%


*Data Mining: Concepts and Techniques*Constrained Mining vs. Constraint-Based SearchConstrained mining vs. constraint-based search/reasoningBoth are aimed at reducing search spaceFinding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AIConstraint-pushing vs. heuristic searchIt is an interesting research problem on how to integrate themConstrained mining vs. query processing in DBMSDatabase query processing requires to find allConstrained pattern mining shares a similar philosophy as pushing selections deeply in query processing


*Data Mining: Concepts and Techniques*Anti-Monotonicity in Constraint PushingAnti-monotonicityWhen an intemset S violates the constraint, so does any of its superset sum(S.Price) v is anti-monotonesum(S.Price) v is not anti-monotoneExample. C: range(S.profit) 15 is anti-monotoneItemset ab violates CSo does every superset of abTDB (min_sup=2)

TIDTransaction10a, b, c, d, f20b, c, d, f, g, h30a, c, d, e, f40c, e, f, g

ItemProfita40b0c-20d10e-30f30g20h-10


*Data Mining: Concepts and Techniques*Monotonicity for Constraint PushingMonotonicityWhen an intemset S satisfies the constraint, so does any of its superset sum(S.Price) v is monotonemin(S.Price) v is monotoneExample. C: range(S.profit) 15Itemset ab satisfies CSo does every superset of abTDB (min_sup=2)




*Data Mining: Concepts and Techniques*SuccinctnessSuccinctness:Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price) v is succinctsum(S.Price) v is not succinctOptimization: If C is succinct, C is pre-counting pushable


*Data Mining: Concepts and Techniques*The Apriori Algorithm ExampleDatabase DScan DC1L1L2C2C2Scan DC3L3Scan D


Sheet1

TIDItems

1001 3 4

2002 3 5

3001 2 3 5

4002 5

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{4}1

{5}3

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{5}3

Sheet1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

Sheet1

itemsetsup

{1 2}1

{1 3}2

{1 5}1

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemsetsup

{1 3}2

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemset

{2 3 5}

Sheet1

itemsetsup

{2 3 5}2

*Data Mining: Concepts and Techniques*Nave Algorithm: Apriori + Constraint Database DScan DC1L1L2C2C2Scan DC3L3Scan DConstraint: Sum{S.price} < 5


Sheet1

TIDItems

1001 3 4

2002 3 5

3001 2 3 5

4002 5

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{4}1

{5}3

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{5}3

Sheet1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

Sheet1

itemsetsup

{1 2}1

{1 3}2

{1 5}1

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemsetsup

{1 3}2

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemset

{2 3 5}

Sheet1

itemsetsup

{2 3 5}2

*Data Mining: Concepts and Techniques*The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database DScan DC1L1L2C2C2Scan DC3L3Scan DConstraint: Sum{S.price} < 5


Sheet1

TIDItems

1001 3 4

2002 3 5

3001 2 3 5

4002 5

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{4}1

{5}3

Sheet1

itemsetsup.

{1}2

{2}3

{3}3

{5}3

Sheet1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

Sheet1

itemsetsup

{1 2}1

{1 3}2

{1 5}1

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemsetsup

{1 3}2

{2 3}2

{2 5}3

{3 5}2

Sheet1

itemset

{2 3 5}

Sheet1

itemsetsup

{2 3 5}2

*Data Mining: Concepts and Techniques*The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database DScan DC1L1L2C2C2Scan DC3L3Scan DConstraint: min{S.price }

*Data Mining: Concepts and Techniques*Converting Tough ConstraintsConvert tough constraints into anti-monotone or monotone by properly ordering itemsExamine C: avg(S.profit) 25Order items in value-descending order

If an itemset afb violates CSo does afbh, afb*It becomes anti-monotone!TDB (min_sup=2)




*Data Mining: Concepts and Techniques*Strongly Convertible Constraintsavg(X) 25 is convertible anti-monotone w.r.t. item value descending order R: If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X) 25 is convertible monotone w.r.t. item value ascending order R-1: If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefixThus, avg(X) 25 is strongly convertible



*Data Mining: Concepts and Techniques*Can Apriori Handle Convertible Constraint?A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithmWithin the level wise framework, no direct pruning based on the constraint can be madeItemset df violates constraint C: avg(X)>=25Since adf satisfies C, Apriori needs df to assemble adf, df cannot be prunedBut it can be pushed into frequent-pattern growth framework!

ItemValuea40b0c-20d10e-30f30g20h-10


*Data Mining: Concepts and Techniques*Mining With Convertible ConstraintsC: avg(X) >= 25, min_sup=2List items in every transaction in value descending order R: C is convertible anti-monotone w.r.t. RScan TDB onceremove infrequent itemsItem h is droppedItemsets a and f are good, Projection-based miningImposing an appropriate order on item projectionMany tough constraints can be converted into (anti)-monotoneTDB (min_sup=2)

TIDTransaction10a, f, d, b, c20f, g, d, b, c30 a, f, d, c, e40 f, g, h, c, e

ItemValuea40f30g20d10b0h-10c-20e-30


*Data Mining: Concepts and Techniques*Handling Multiple ConstraintsDifferent constraints may require different or even conflicting item-orderingIf there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraintsIf there exists conflict on order of itemsTry to satisfy one constraint firstThen using the order for the other constraint to mine frequent itemsets in the corresponding projected database


*Data Mining: Concepts and Techniques*What Constraints Are Convertible?

ConstraintConvertible anti-monotoneConvertible monotoneStrongly convertibleavg(S) , vYesYesYesmedian(S) , vYesYesYessum(S) v (items could be of any value, v 0)YesNoNosum(S) v (items could be of any value, v 0)NoYesNosum(S) v (items could be of any value, v 0)NoYesNosum(S) v (items could be of any value, v 0)YesNoNo


*Data Mining: Concepts and Techniques*Constraint-Based MiningA General Picture

ConstraintAntimonotoneMonotoneSuccinctv SnoyesyesS VnoyesyesS Vyesnoyesmin(S) vnoyesyesmin(S) vyesnoyesmax(S) vyesnoyesmax(S) vnoyesyescount(S) vyes noweaklycount(S) vnoyesweaklysum(S) v ( a S, a 0 )yesnonosum(S) v ( a S, a 0 )noyesnorange(S) vyesnonorange(S) vnoyesnoavg(S) v, { , , }convertibleconvertiblenosupport(S) yesnonosupport(S) noyesno


*Data Mining: Concepts and Techniques*A Classification of Constraints


*Data Mining: Concepts and Techniques*Frequent-Pattern Mining: SummaryFrequent pattern miningan important task in data miningScalable frequent pattern mining methodsApriori (Candidate generation & test)Projection-based (FPgrowth, CLOSET+, ...)Vertical format approach (CHARM, ...)Mining a variety of rules and interesting patterns Constraint-based miningMining sequential and structured patternsExtensions and applications


*Data Mining: Concepts and Techniques*Frequent-Pattern Mining: Research ProblemsMining fault-tolerant frequent, sequential and structured patternsPatterns allows limited faults (insertion, deletion, mutation)Mining truly interesting patternsSurprising, novel, concise, Application explorationE.g., DNA sequence analysis and bio-pattern classificationInvisible data mining


*Data Mining: Concepts and Techniques*Ref: Basic Concepts of Frequent Pattern Mining(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93.(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98. (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99.(Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95


*Data Mining: Concepts and Techniques*Ref: Apriori and Its ImprovementsR. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94.A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95.J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95.H. Toivonen. Sampling large databases for association rules. VLDB'96.S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97.S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98.


*Data Mining: Concepts and Techniques*Ref: Depth-First, Projection-Based FP MiningR. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing:02.J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00.J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02.J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03. G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns. KDD'03.


*Data Mining: Concepts and Techniques*Ref: Vertical Format and Row Enumeration MethodsM. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02. C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD02.F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03.


*Data Mining: Concepts and Techniques*Ref: Mining Multi-Level and Quantitative RulesR. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95.R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96.T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96.K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97.R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules KDD'99.


*Data Mining: Concepts and Techniques*Ref: Mining Correlations and Interesting RulesM. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94.S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97.C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98.P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02.E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE03.Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining of Correlated Patterns. ICDM03.


*Data Mining: Concepts and Techniques*Ref: Mining Other Kinds of RulesR. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96.B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98.D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98.F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98.K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT02.


*Data Mining: Concepts and Techniques*Ref: Constraint-Based Pattern MiningR. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97.R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD98.M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB99.G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00.J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. ICDE'01.J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in Large Databases, CIKM'02.


*Data Mining: Concepts and Techniques*Ref: Mining Sequential and Structured PatternsR. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT96.H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97.M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning:01.J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03.X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. KDD'03.


*Data Mining: Concepts and Techniques*Ref: Mining Spatial, Multimedia, and Web DataK. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases, SSD95.O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. ADL'98.O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00.D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns. SSTD'01.


*Data Mining: Concepts and Techniques*Ref: Mining Frequent Patterns in Time-Series DataB. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99.H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association Analysis: Mining Multi-Dimensional Inter-Transaction Association Rules. TOIS:00.B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00.W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical Attributes. ICDE01.J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data. TKDE03.


*Data Mining: Concepts and Techniques*Ref: Iceberg Cube and Cube ComputationS. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB'96.Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidi-mensional aggregates. SIGMOD'97.J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. DAMI: 97.M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98.K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99.


*Data Mining: Concepts and Techniques*Ref: Iceberg Cube and Cube ExplorationJ. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data Cubes with Complex Measures. SIGMOD 01.W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE'02. G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining Multi-Dimensional Constrained Gradients in Data Cubes. VLDB'01. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. DAMI:02.L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB'02. D. Xin, J. Han, X. Li, B. W. Wah. Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration. VLDB'03.


*Data Mining: Concepts and Techniques*Ref: FP for Classification and ClusteringG. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99.B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. KDD98.W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. ICDM'01.H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. SIGMOD 02. J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering. ICDE03. B. Fung, K. Wang, and M. Ester. Large Hierarchical Document Clustering Using Frequent Itemset. SDM03.X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. SDM'03.


*Data Mining: Concepts and Techniques*Ref: Stream and Privacy-Preserving FP MiningA. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association Rules. KDD02.J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD02. G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB02.Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB'02.C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining:03.A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. PODS03.


*Data Mining: Concepts and Techniques*Ref: Other Freq. Pattern Mining ApplicationsY. Huhtala, J. Krkkinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE98. H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99.T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02.


*Data Mining: Concepts and Techniques*


*

Documents

10.Pertemuan Kesepuluh.ppt