99
March 16, 2022 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 5 —

10.Pertemuan Kesepuluh.ppt

Embed Size (px)

Citation preview

  • *Data Mining: Concepts and Techniques*Data Mining: Concepts and Techniques

    Chapter 5

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*What Is Frequent Pattern Analysis?Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule miningMotivation: Finding inherent regularities in dataWhat products were often purchased together? Beer and diapers?!What are the subsequent purchases after buying a PC?What kinds of DNA are sensitive to this new drug?Can we automatically classify web documents?ApplicationsBasket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Why Is Freq. Pattern Mining Important?Discloses an intrinsic and important property of data setsForms the foundation for many essential data mining tasksAssociation, correlation, and causality analysisSequential, structural (e.g., sub-graph) patternsPattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classificationCluster analysis: frequent pattern-based clusteringData warehousing: iceberg cube and cube-gradient Semantic data compression: fasciclesBroad applications

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Basic Concepts: Frequent Patterns and Association RulesItemset X = {x1, , xk}Find all the rules X Y with minimum support and confidencesupport, s, probability that a transaction contains X Yconfidence, c, conditional probability that a transaction having X also contains YLet supmin = 50%, confmin = 50%Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}Association rules:A D (60%, 100%)D A (60%, 75%)

    Transaction-idItems bought10A, B, D20A, C, D30A, D, E40B, E, F50B, C, D, E, F

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Closed Patterns and Max-PatternsA long pattern contains a combinatorial number of sub-patterns, e.g., {a1, , a100} contains (1001) + (1002) + + (110000) = 2100 1 = 1.27*1030 sub-patterns!Solution: Mine closed patterns and max-patterns insteadAn itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X (proposed by Pasquier, et al. @ ICDT99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (proposed by Bayardo @ SIGMOD98)Closed pattern is a lossless compression of freq. patternsReducing the # of patterns and rules

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Closed Patterns and Max-PatternsExercise. DB = {, < a1, , a50>} Min_sup = 1.What is the set of closed itemset?: 1< a1, , a50>: 2What is the set of max-pattern?: 1What is the set of all patterns?!!

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Scalable Methods for Mining Frequent PatternsThe downward closure property of frequent patternsAny subset of a frequent itemset must be frequentIf {beer, diaper, nuts} is frequent, so is {beer, diaper}i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approachesApriori (Agrawal & Srikant@VLDB94)Freq. pattern growth (FPgrowthHan, Pei & Yin @SIGMOD00)Vertical data format approach (CharmZaki & Hsiao @SDM02)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Apriori: A Candidate Generation-and-Test ApproachApriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB94, Mannila, et al. @ KDD 94)Method: Initially, scan DB once to get frequent 1-itemsetGenerate length (k+1) candidate itemsets from length k frequent itemsetsTest the candidates against DBTerminate when no frequent or candidate set can be generated

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*The Apriori AlgorithmAn Example Database TDB1st scanC1L1L2C2C22nd scanC3L33rd scanSupmin = 2

    TidItems10A, C, D20B, C, E30A, B, C, E40B, E

    Itemsetsup{A}2{B}3{C}3{D}1{E}3

    Itemsetsup{A}2{B}3{C}3{E}3

    Itemset{A, B}{A, C}{A, E}{B, C}{B, E}{C, E}

    Itemsetsup{A, B}1{A, C}2{A, E}1{B, C}2{B, E}3{C, E}2

    Itemsetsup{A, C}2{B, C}2{B, E}3{C, E}2

    Itemset{B, C, E}

    Itemsetsup{B, C, E}2

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*The Apriori AlgorithmPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

    L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Important Details of AprioriHow to generate candidates?Step 1: self-joining LkStep 2: pruningHow to count supports of candidates?Example of Candidate-generationL3={abc, abd, acd, ace, bcd}Self-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4={abcd}

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*How to Generate Candidates?Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1Step 2: pruningforall itemsets c in Ck doforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidatesMethod:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transaction

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Example: Counting Supports of CandidatesTransaction: 1 2 3 5 61 + 2 3 5 61 2 + 3 5 61 3 + 5 6

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Efficient Implementation of Apriori in SQLHard to get good performance out of pure SQL (SQL-92) based approaches aloneMake use of object-relational extensions like UDFs, BLOBs, Table functions etc.Get orders of magnitude improvementS. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD98

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Challenges of Frequent Pattern MiningChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload of support counting for candidatesImproving Apriori: general ideasReduce passes of transaction database scansShrink number of candidatesFacilitate support counting of candidates

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Partition: Scan Database Only TwiceAny itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBScan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patternsA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB95

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*DHP: Reduce the Number of CandidatesA k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequentCandidates: a, b, c, d, eHash entries: {ab, ad, ae} {bd, be, de} Frequent 1-itemset: a, b, d, eab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support thresholdJ. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD95

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Sampling for Frequent PatternsSelect a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checkedExample: check abcd instead of ab, ac, , etc.Scan database again to find missed frequent patternsH. Toivonen. Sampling large databases for association rules. In VLDB96

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*DIC: Reduce Number of ScansABCDABCABDACDBCDABACBCADBDCDABCD{}Itemset latticeOnce both A and D are determined frequent, the counting of AD beginsOnce all length-2 subsets of BCD are determined frequent, the counting of BCD beginsTransactions1-itemsets2-itemsetsApriori1-itemsets2-items3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD97

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Bottleneck of Frequent-pattern MiningMultiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: (1001) + (1002) + + (110000) = 2100-1 = 1.27*1030 !Bottleneck: candidate-generation-and-testCan we avoid candidate generation?

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Frequent Patterns Without Candidate GenerationGrow long patterns from short ones using local frequent itemsabc is a frequent patternGet all transactions having abc: DB|abcd is a local frequent item in DB|abc abcd is a frequent pattern

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Construct FP-tree from a Transaction Databasemin_support = 3TIDItems bought (ordered) frequent items100{f, a, c, d, g, i, m, p}{f, c, a, m, p}200{a, b, c, f, l, m, o}{f, c, a, b, m}300 {b, f, h, j, o, w}{f, b}400 {b, c, k, s, p}{c, b, p}500 {a, f, c, e, l, p, m, n}{f, c, a, m, p}Scan DB once, find frequent 1-itemset (single item pattern)Sort frequent items in frequency descending order, f-listScan DB again, construct FP-treeF-list=f-c-a-b-m-p

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Benefits of the FP-tree StructureCompleteness Preserve complete information for frequent pattern miningNever break a long pattern of any transactionCompactnessReduce irrelevant infoinfrequent items are goneItems in frequency descending order: the more frequently occurring, the more likely to be sharedNever be larger than the original database (not count node-links and the count field)For Connect-4 DB, compression ratio could be over 100

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Partition Patterns and DatabasesFrequent patterns can be partitioned into subsets according to f-listF-list=f-c-a-b-m-pPatterns containing pPatterns having m but no pPatterns having c but no a nor b, m, pPattern fCompleteness and non-redundency

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Find Patterns Having P From P-conditional DatabaseStarting at the frequent item header table in the FP-treeTraverse the FP-tree by following the link of each frequent item pAccumulate all of transformed prefix paths of item p to form ps conditional pattern baseConditional pattern basesitemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fcab:1pfcam:2, cb:1

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*From Conditional Pattern-bases to Conditional FP-trees For each pattern-baseAccumulate the count for each item in the baseConstruct the FP-tree for the frequent items of the pattern basem-conditional pattern base:fca:2, fcab:1All frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcam{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Recursion: Mining Each Conditional FP-treeCond. pattern base of am: (fc:3)Cond. pattern base of cm: (f:3){}f:3cm-conditional FP-treeCond. pattern base of cam: (f:3){}f:3cam-conditional FP-tree

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*A Special Case: Single Prefix Path in FP-treeSuppose a (conditional) FP-tree T has a shared single prefix-path PMining can be decomposed into two partsReduction of the single prefix path into one nodeConcatenation of the mining results of the two parts+

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Frequent Patterns With FP-treesIdea: Frequent pattern growthRecursively grow frequent patterns by pattern and database partitionMethod For each frequent item, construct its conditional pattern-base, and then its conditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Scaling FP-growth by DB ProjectionFP-tree cannot fit in memory?DB projectionFirst partition a database into a set of projected DBsThen construct and mine FP-tree for each projected DBParallel projection vs. Partition projection techniquesParallel projection is space costly

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Partition-based ProjectionParallel projection needs a lot of disk space Partition projection saves it

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*FP-Growth vs. Apriori: Scalability With the Support ThresholdData set T25I20D10K

    Data Mining: Concepts and Techniques

    Chart1

    22

    43

    46

    524

    566

    7500

    110.3

    180.2

    440.1

    D1 FP-growth runtime

    D1 Apriori runtime

    Support threshold(%)

    Run time(sec.)

    Sheet1

    dataset T25I10D10k item=1k

    FP-tree/FP-mineTreeProjectionComparison

    threshold(%)#tree nodestree size(k)#stack nodesstack size(k)#trans(FP)#tree nodesmax width#transmatrices costfp/lexi treestack/widthlexi/fp trans

    523034553230345533296173734312495681315.53315.531.31

    4727611745727611745825771831839186973038397.60397.601.11

    31461193507146119350715596639339316422148777371.80371.801.05

    221896852552188515252228746732683240778730933299.14320.431.05

    12684466443256093614627404457811000557415269663046.44256.092.03

    0.828377668112587466210286271930019697046093535114930.51131.412.46

    0.53585048604261771628334510028647561412844755892687212.5146.633.72

    0.357507613802267947643153101269654131412007235728161858.2620.393.78

    0.21059473254272735056564978119127070295162757934750178118.349.272.82

    0.1268092164342279803671525038933391221267594461514609136977.912.211.78

    dataset T25I10D100k item=10k

    FP-tree/FP-mineTreeProjectionComparison

    threshold(%)#tree nodestree size(k)#stack nodesstack size(k)#trans(FP)#tree nodesmax width#transmatrices costfp/lexi treestack/widthlexi/fp trans

    21018822445101870244520173914313033161421641131712.46783.621.64

    11009635242319958992390211050085256104739870832750956301192.09951.193.61

    0.8137053532893134386832253146157710230193462021634034686178133.97694.864.24

    0.5203932048944196776747226211608427067528211607208204318718275.34372.545.49

    0.3252033060488232748155860255857272640124062057613046477459234.70187.618.04

    0.130063537215825941226225929763641515433054527400206133656480819.8484.939.21

    dataset T25I10D10k item=1kdataset T25I10D100k item=1k

    threshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP costthreshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP cost

    5198240020114855112110.751.2821515635076391361942986140.901.23

    44520785584282420607960.622.621666291277946564469576597470.974.51

    3100231793296655992441020.655.150.89429618492627969981034582331.035.59

    217518331020812342253472630.707.660.5155409319159871770242373272401.147.44

    122462442046018237390080730.818.820.3199892422556982753103687389961.388.73

    0.823218461853819747406276690.858.800.1239233511939173518854815280571.479.41

    0.525875513138325132436604110.978.51

    0.333842628702933896478740991.007.61

    0.255016940138252178597613610.956.36

    0.1143191238758621728861420208151.215.95

    threshold(%)D1 running mem. Req.D2 running mem. Req.

    252522445

    1614623902

    0.8621032253

    0.5628347226

    0.3643155860

    0.1671562259

    Sheet1

    00

    00

    00

    00

    00

    00

    D1 running mem. Req.

    D2 running mem. Req.

    Support threshold(%)

    Running memory requirement (Kb)

    Sheet2

    T25I10D10k 1k items

    threshold(%)D1 #frequent itemsetsD1 #weighted lengthD1 longestD2 #frequent itemsetsD2 #weighted lengthD2 longest

    3393111911

    27321.0831431.13

    1.514822.18611363.27

    157813.791052563.539

    0.893003.6311102303.8911

    0.5286474.0912270674.0912

    0.3696544.4313726404.7414

    0.21270704.3813997074.6614

    0.13391223.84141515434.6414

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 #frequent itemsets

    D2 #frequent itemsets

    Support threshold(%)

    Number of frequent itemsets

    Sheet3

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    D1 #weighted length

    D1 longest

    D2 #weighted length

    D2 longest

    Support threshold(%)

    Length of frequent itemsets

    Sheet4

    threshold(%)D1 FP-growth runtimeD1 Apriori runtimeD2 FP-growth runtimeD2 Apriori runtimeD1 #frequent itemsetsD2 #frequent itemsetsD1 runtime/itemsetD2 runtime/itemset

    3221012393190.00508905850.5263157895

    24311217321430.00546448090.0769230769

    1.5461564148211360.00269905530.0132042254

    152426225578152560.00086490230.0049467275

    0.8566359300102300.00053763440.0034213099

    0.575005828647270670.00024435370.0021428308

    0.3119169654726400.00015792350.0012527533

    0.218107127070997070.00014165420.0010731443

    0.1441373391221515430.00012974680.0009040338

    Sheet4

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 FP-growth runtime

    D1 Apriori runtime

    Support threshold(%)

    Run time(sec.)

    Sheet5

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D2 FP-growth runtime

    D2 Apriori runtime

    Support threshold(%)

    Run time(sec.)

    Sheet6

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 runtime/itemset

    D2 runtime/itemset

    Support threshold(%)

    Runtime/itemset(sec.)

    Size(K)FP-growthApriori

    1014

    20311

    30517

    50731

    801251

    1001564

    00

    00

    00

    00

    00

    00

    FP-growth

    Apriori

    Number of transactions (K)

    Run time (sec.)

    D1 10k

    threshold(%)D1 FP-growthD1 TreeProjection

    322

    244

    1.545

    157

    0.859

    0.5718

    0.31131

    0.21846

    0.14498

    D2 100k

    threshold(%)D2 FP-growthD2 TreeProjection

    31011

    21111

    1.51521

    12657

    0.835123

    0.558300

    0.391

    0.2107

    0.1137

    scalability1.00%

    sizeFP-growthTreeProjection

    1033

    2059

    30815

    501327

    802045

    1002657

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 FP-growth

    D1 TreeProjection

    Support threshold (%)

    Run time (sec.)

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D2 FP-growth

    D2 TreeProjection

    Support threshold (%)

    Runtime (sec.)

    00

    00

    00

    00

    00

    00

    FP-growth

    TreeProjection

    Number of transactions (K)

    Run time (sec.)

    Support threshold(%)#nodes in FP-tree#frequent item occurrencesRatioFP-tree MemDB MemMem ratio

    9036313757483789.9487125502992631.6565656566

    80151517787821174.11363607115128195.6855885589

    7026151930953738.41627607723812123.0690248566

    6081052147771264.99194520859108444.165556241

    50134492219609165.04322776887843627.5064936674

    4037162231038762.17891888924154810.3617808514

    30123368242560519.66296083297024203.2769235134

    2037387326563797.118972952106255161.1841717196

    1080747327925353.4619379352111701400.5763938856

    00

    00

    00

    00

    00

    00

    00

    00

    00

    #nodes in FP-tree

    #frequent item occurrences

  • *Data Mining: Concepts and Techniques*FP-Growth vs. Tree-Projection: Scalability with the Support ThresholdData set T25I20D100K

    Data Mining: Concepts and Techniques

    Chart3

    1011

    1111

    1521

    2657

    35123

    58300

    910.3

    1070.2

    1370.1

    D2 FP-growth

    D2 TreeProjection

    Support threshold (%)

    Runtime (sec.)

    Sheet1

    dataset T25I10D10k item=1k

    FP-tree/FP-mineTreeProjectionComparison

    threshold(%)#tree nodestree size(k)#stack nodesstack size(k)#trans(FP)#tree nodesmax width#transmatrices costfp/lexi treestack/widthlexi/fp trans

    523034553230345533296173734312495681315.53315.531.31

    4727611745727611745825771831839186973038397.60397.601.11

    31461193507146119350715596639339316422148777371.80371.801.05

    221896852552188515252228746732683240778730933299.14320.431.05

    12684466443256093614627404457811000557415269663046.44256.092.03

    0.828377668112587466210286271930019697046093535114930.51131.412.46

    0.53585048604261771628334510028647561412844755892687212.5146.633.72

    0.357507613802267947643153101269654131412007235728161858.2620.393.78

    0.21059473254272735056564978119127070295162757934750178118.349.272.82

    0.1268092164342279803671525038933391221267594461514609136977.912.211.78

    dataset T25I10D100k item=10k

    FP-tree/FP-mineTreeProjectionComparison

    threshold(%)#tree nodestree size(k)#stack nodesstack size(k)#trans(FP)#tree nodesmax width#transmatrices costfp/lexi treestack/widthlexi/fp trans

    21018822445101870244520173914313033161421641131712.46783.621.64

    11009635242319958992390211050085256104739870832750956301192.09951.193.61

    0.8137053532893134386832253146157710230193462021634034686178133.97694.864.24

    0.5203932048944196776747226211608427067528211607208204318718275.34372.545.49

    0.3252033060488232748155860255857272640124062057613046477459234.70187.618.04

    0.130063537215825941226225929763641515433054527400206133656480819.8484.939.21

    dataset T25I10D10k item=1kdataset T25I10D100k item=1k

    threshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP costthreshold(%)FP-mem(k)FP-costTP-mem(k)TP-costTP/FP memTP/FP cost

    5198240020114855112110.751.2821515635076391361942986140.901.23

    44520785584282420607960.622.621666291277946564469576597470.974.51

    3100231793296655992441020.655.150.89429618492627969981034582331.035.59

    217518331020812342253472630.707.660.5155409319159871770242373272401.147.44

    122462442046018237390080730.818.820.3199892422556982753103687389961.388.73

    0.823218461853819747406276690.858.800.1239233511939173518854815280571.479.41

    0.525875513138325132436604110.978.51

    0.333842628702933896478740991.007.61

    0.255016940138252178597613610.956.36

    0.1143191238758621728861420208151.215.95

    threshold(%)D1 running mem. Req.D2 running mem. Req.

    252522445

    1614623902

    0.8621032253

    0.5628347226

    0.3643155860

    0.1671562259

    Sheet1

    00

    00

    00

    00

    00

    00

    D1 running mem. Req.

    D2 running mem. Req.

    Support threshold(%)

    Running memory requirement (Kb)

    Sheet2

    T25I10D10k 1k items

    threshold(%)D1 #frequent itemsetsD1 #weighted lengthD1 longestD2 #frequent itemsetsD2 #weighted lengthD2 longest

    3393111911

    27321.0831431.13

    1.514822.18611363.27

    157813.791052563.539

    0.893003.6311102303.8911

    0.5286474.0912270674.0912

    0.3696544.4313726404.7414

    0.21270704.3813997074.6614

    0.13391223.84141515434.6414

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 #frequent itemsets

    D2 #frequent itemsets

    Support threshold(%)

    Number of frequent itemsets

    Sheet3

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    D1 #weighted length

    D1 longest

    D2 #weighted length

    D2 longest

    Support threshold(%)

    Length of frequent itemsets

    Sheet4

    threshold(%)D1 FP-growth runtimeD1 Apriori runtimeD2 FP-growth runtimeD2 Apriori runtimeD1 #frequent itemsetsD2 #frequent itemsetsD1 runtime/itemsetD2 runtime/itemset

    3221012393190.00508905850.5263157895

    24311217321430.00546448090.0769230769

    1.5461564148211360.00269905530.0132042254

    152426225578152560.00086490230.0049467275

    0.8566359300102300.00053763440.0034213099

    0.575005828647270670.00024435370.0021428308

    0.3119169654726400.00015792350.0012527533

    0.218107127070997070.00014165420.0010731443

    0.1441373391221515430.00012974680.0009040338

    Sheet4

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 FP-growth runtime

    D1 Apriori runtime

    Support threshold(%)

    Run time(sec.)

    Sheet5

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D2 FP-growth runtime

    D2 Apriori runtime

    Support threshold(%)

    Run time(sec.)

    Sheet6

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 runtime/itemset

    D2 runtime/itemset

    Support threshold(%)

    Runtime/itemset(sec.)

    Size(K)FP-growthApriori

    1014

    20311

    30517

    50731

    801251

    1001564

    00

    00

    00

    00

    00

    00

    FP-growth

    Apriori

    Number of transactions (K)

    Run time (sec.)

    D1 10k

    threshold(%)D1 FP-growthD1 TreeProjection

    322

    244

    1.545

    157

    0.859

    0.5718

    0.31131

    0.21846

    0.14498

    D2 100k

    threshold(%)D2 FP-growthD2 TreeProjection

    31011

    21111

    1.51521

    12657

    0.835123

    0.558300

    0.391

    0.2107

    0.1137

    scalability1.00%

    sizeFP-growthTreeProjection

    1033

    2059

    30815

    501327

    802045

    1002657

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D1 FP-growth

    D1 TreeProjection

    Support threshold (%)

    Run time (sec.)

    00

    00

    00

    00

    00

    00

    00

    00

    00

    D2 FP-growth

    D2 TreeProjection

    Support threshold (%)

    Runtime (sec.)

    00

    00

    00

    00

    00

    00

    FP-growth

    TreeProjection

    Number of transactions (K)

    Run time (sec.)

    Support threshold(%)#nodes in FP-tree#frequent item occurrencesRatioFP-tree MemDB MemMem ratio

    9036313757483789.9487125502992631.6565656566

    80151517787821174.11363607115128195.6855885589

    7026151930953738.41627607723812123.0690248566

    6081052147771264.99194520859108444.165556241

    50134492219609165.04322776887843627.5064936674

    4037162231038762.17891888924154810.3617808514

    30123368242560519.66296083297024203.2769235134

    2037387326563797.118972952106255161.1841717196

    1080747327925353.4619379352111701400.5763938856

    00

    00

    00

    00

    00

    00

    00

    00

    00

    #nodes in FP-tree

    #frequent item occurrences

  • *Data Mining: Concepts and Techniques*Why Is FP-Growth the Winner?Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so farleads to focused search of smaller databasesOther factorsno candidate generation, no candidate testcompressed database: FP-tree structureno repeated scan of entire database basic opscounting local freq items and building sub FP-tree, no pattern search and matching

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Implications of the Methodology Mining closed frequent itemsets and max-patternsCLOSET (DMKD00)Mining sequential patternsFreeSpan (KDD00), PrefixSpan (ICDE01)Constraint-based mining of frequent patternsConvertible constraints (KDD00, ICDE01)Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD01)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*MaxMiner: Mining Max-patterns1st scan: find frequent itemsA, B, C, D, E2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanR. Bayardo. Efficiently mining long patterns from databases. In SIGMOD98Potential max-patterns

    TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,F

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Frequent Closed Patterns: CLOSETFlist: list of all frequent items in support ascending orderFlist: d-a-f-e-cDivide search spacePatterns having dPatterns having d but no a, etc.Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed patternJ. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.Min_sup=2

    TIDItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, f

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*CLOSET+: Mining Closed Itemsets by Pattern-GrowthItemset merging: if Y appears in every occurrence of X, then Y is merged with XSub-itemset pruning: if Y X, and sup(X) = sup(Y), X and all of Xs descendants in the set enumeration tree can be prunedHybrid tree projectionBottom-up physical tree-projectionTop-down pseudo tree-projectionItem skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levelsEfficient subset checking

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*CHARM: Mining by Exploring Vertical Data FormatVertical format: t(AB) = {T11, T25, }tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersectionst(X) = t(Y): X and Y always happen togethert(X) t(Y): transaction having X always has YUsing diffset to accelerate miningOnly keep track of differences of tidst(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2}Eclat/MaxEclat (Zaki et al. @KDD97), VIPER(P. Shenoy et al.@SIGMOD00), CHARM (Zaki & Hsiao@SDM02)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Further Improvements of Mining MethodsAFOPT (Liu, et al. @ KDD03)A push-right method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et al. @ KDD03)Mine data sets with small rows but numerous columnsConstruct a row-enumeration tree for efficient mining

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Visualization of Association Rules: Plane Graph

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Visualization of Association Rules: Rule Graph

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Visualization of Association Rules (SGI/MineSet 3.0)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Various Kinds of Association RulesMining multilevel associationMiming multidimensional associationMining quantitative association Mining interesting correlation patterns

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Multiple-Level Association RulesItems often form hierarchiesFlexible support settings Items at the lower level are expected to have lower supportExploration of shared multi-level mining (Agrawal & Srikant@VLB95, Han & Fu@VLDB95)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Multi-level Association: Redundancy FilteringSome rules may be redundant due to ancestor relationships between items.Examplemilk wheat bread [support = 8%, confidence = 70%]2% milk wheat bread [support = 2%, confidence = 72%]We say the first rule is an ancestor of the second rule.A rule is redundant if its support is close to the expected value, based on the rules ancestor.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Multi-Dimensional AssociationSingle-dimensional rules:buys(X, milk) buys(X, bread)Multi-dimensional rules: 2 dimensions or predicatesInter-dimension assoc. rules (no repeated predicates)age(X,19-25) occupation(X,student) buys(X, coke)hybrid-dimension assoc. rules (repeated predicates)age(X,19-25) buys(X, popcorn) buys(X, coke)Categorical Attributes: finite number of possible values, no ordering among valuesdata cube approachQuantitative Attributes: numeric, implicit ordering among valuesdiscretization, clustering, and gradient approaches

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Quantitative AssociationsTechniques can be categorized by how numerical attributes, such as age or salary are treatedStatic discretization based on predefined concept hierarchies (data cube methods)Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97) one dimensional clustering then associationDeviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Static Discretization of Quantitative AttributesDiscretized prior to mining using concept hierarchy.Numeric values are replaced by ranges.In relational database, finding all frequent k-predicate sets will require k or k+1 table scans.Data cube is well suited for mining.The cells of an n-dimensional cuboid correspond to the predicate sets.Mining from data cubes can be much faster.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Quantitative Association Rulesage(X,34-35) income(X,30-50K) buys(X,high resolution TV)Proposed by Lent, Swami and Widom ICDE97Numeric attributes are dynamically discretizedSuch that the confidence or compactness of the rules mined is maximized2-D quantitative association rules: Aquan1 Aquan2 AcatCluster adjacent association rules to form general rules using a 2-D gridExample

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining Other Interesting PatternsFlexible support constraints (Wang et al. @ VLDB02)Some items (e.g., diamond) may occur rarely but are valuable Customized supmin specification and applicationTop-K closed frequent patterns (Han, et al. @ ICDM02)Hard to specify supmin, but top-k with lengthmin is more desirableDynamically raise supmin in FP-tree construction and mining, and select most promising path to mine

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Interestingness Measure: Correlations (Lift)play basketball eat cereal [40%, 66.7%] is misleadingThe overall % of students eating cereal is 75% > 66.7%.play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidenceMeasure of dependent/correlated events: lift

    BasketballNot basketballSum (row)Cereal200017503750Not cereal10002501250Sum(col.)300020005000

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Are lift and 2 Good Measures of Correlation?Buy walnuts buy milk [1%, 80%] is misleadingif 85% of customers buy milkSupport and confidence are not good to represent correlationsSo many interestingness measures? (Tan, Kumar, Sritastava @KDD02)

    MilkNo MilkSum (row)Coffeem, c~m, ccNo Coffeem, ~c~m, ~c~cSum(col.)m~m

    DBm, c~m, cm~c~m~cliftall-confcoh2A1100010010010,0009.260.910.839055A210010001000100,0008.440.090.05670A3100010010000100,0009.180.090.098172A4100010001000100010.50.330

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Which Measures Should Be Used?lift and 2 are not good measures for correlations in large transactional DBsall-conf or coherence could be good measures (Omiecinski@TKDE03)Both all-conf and coherence have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM03sub)

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Constraint-based (Query-Directed) MiningFinding all the patterns in a database autonomously? unrealistic!The patterns could be too many but not focused!Data mining should be an interactive process User directs what to be mined using a data mining query language (or a graphical user interface)Constraint-based miningUser flexibility: provides constraints on what to be minedSystem optimization: explores such constraints for efficient miningconstraint-based mining

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Constraints in Data MiningKnowledge type constraint: classification, association, etc.Data constraint using SQL-like queries find product pairs sold together in stores in Chicago in Dec.02Dimension/level constraintin relevance to region, price, brand, customer categoryRule (or pattern) constraintsmall sales (price < $10) triggers big sales (sum > $200)Interestingness constraintstrong rules: min_support 3%, min_confidence 60%

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Constrained Mining vs. Constraint-Based SearchConstrained mining vs. constraint-based search/reasoningBoth are aimed at reducing search spaceFinding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AIConstraint-pushing vs. heuristic searchIt is an interesting research problem on how to integrate themConstrained mining vs. query processing in DBMSDatabase query processing requires to find allConstrained pattern mining shares a similar philosophy as pushing selections deeply in query processing

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Anti-Monotonicity in Constraint PushingAnti-monotonicityWhen an intemset S violates the constraint, so does any of its superset sum(S.Price) v is anti-monotonesum(S.Price) v is not anti-monotoneExample. C: range(S.profit) 15 is anti-monotoneItemset ab violates CSo does every superset of abTDB (min_sup=2)

    TIDTransaction10a, b, c, d, f20b, c, d, f, g, h30a, c, d, e, f40c, e, f, g

    ItemProfita40b0c-20d10e-30f30g20h-10

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Monotonicity for Constraint PushingMonotonicityWhen an intemset S satisfies the constraint, so does any of its superset sum(S.Price) v is monotonemin(S.Price) v is monotoneExample. C: range(S.profit) 15Itemset ab satisfies CSo does every superset of abTDB (min_sup=2)

    TIDTransaction10a, b, c, d, f20b, c, d, f, g, h30a, c, d, e, f40c, e, f, g

    ItemProfita40b0c-20d10e-30f30g20h-10

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*SuccinctnessSuccinctness:Given A1, the set of items satisfying a succinctness constraint C, then any set S satisfying C is based on A1 , i.e., S contains a subset belonging to A1Idea: Without looking at the transaction database, whether an itemset S satisfies constraint C can be determined based on the selection of items min(S.Price) v is succinctsum(S.Price) v is not succinctOptimization: If C is succinct, C is pre-counting pushable

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*The Apriori Algorithm ExampleDatabase DScan DC1L1L2C2C2Scan DC3L3Scan D

    Data Mining: Concepts and Techniques

    Sheet1

    TIDItems

    1001 3 4

    2002 3 5

    3001 2 3 5

    4002 5

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {4}1

    {5}3

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {5}3

    Sheet1

    itemset

    {1 2}

    {1 3}

    {1 5}

    {2 3}

    {2 5}

    {3 5}

    Sheet1

    itemsetsup

    {1 2}1

    {1 3}2

    {1 5}1

    {2 3}2

    {2 5}3

    {3 5}2

    Sheet1

    itemsetsup

    {1 3}2

    {2 3}2

    {2 5}3

    {3 5}2

    Sheet1

    itemset

    {2 3 5}

    Sheet1

    itemsetsup

    {2 3 5}2

  • *Data Mining: Concepts and Techniques*Nave Algorithm: Apriori + Constraint Database DScan DC1L1L2C2C2Scan DC3L3Scan DConstraint: Sum{S.price} < 5

    Data Mining: Concepts and Techniques

    Sheet1

    TIDItems

    1001 3 4

    2002 3 5

    3001 2 3 5

    4002 5

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {4}1

    {5}3

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {5}3

    Sheet1

    itemset

    {1 2}

    {1 3}

    {1 5}

    {2 3}

    {2 5}

    {3 5}

    Sheet1

    itemsetsup

    {1 2}1

    {1 3}2

    {1 5}1

    {2 3}2

    {2 5}3

    {3 5}2

    Sheet1

    itemsetsup

    {1 3}2

    {2 3}2

    {2 5}3

    {3 5}2

    Sheet1

    itemset

    {2 3 5}

    Sheet1

    itemsetsup

    {2 3 5}2

  • *Data Mining: Concepts and Techniques*The Constrained Apriori Algorithm: Push an Anti-monotone Constraint Deep Database DScan DC1L1L2C2C2Scan DC3L3Scan DConstraint: Sum{S.price} < 5

    Data Mining: Concepts and Techniques

    Sheet1

    TIDItems

    1001 3 4

    2002 3 5

    3001 2 3 5

    4002 5

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {4}1

    {5}3

    Sheet1

    itemsetsup.

    {1}2

    {2}3

    {3}3

    {5}3

    Sheet1

    itemset

    {1 2}

    {1 3}

    {1 5}

    {2 3}

    {2 5}

    {3 5}

    Sheet1

    itemsetsup

    {1 2}1

    {1 3}2

    {1 5}1

    {2 3}2

    {2 5}3

    {3 5}2

    Sheet1

    itemsetsup

    {1 3}2

    {2 3}2

    {2 5}3

    {3 5}2

    Sheet1

    itemset

    {2 3 5}

    Sheet1

    itemsetsup

    {2 3 5}2

  • *Data Mining: Concepts and Techniques*The Constrained Apriori Algorithm: Push a Succinct Constraint Deep Database DScan DC1L1L2C2C2Scan DC3L3Scan DConstraint: min{S.price }
  • *Data Mining: Concepts and Techniques*Converting Tough ConstraintsConvert tough constraints into anti-monotone or monotone by properly ordering itemsExamine C: avg(S.profit) 25Order items in value-descending order

    If an itemset afb violates CSo does afbh, afb*It becomes anti-monotone!TDB (min_sup=2)

    TIDTransaction10a, b, c, d, f20b, c, d, f, g, h30a, c, d, e, f40c, e, f, g

    ItemProfita40b0c-20d10e-30f30g20h-10

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Strongly Convertible Constraintsavg(X) 25 is convertible anti-monotone w.r.t. item value descending order R: If an itemset af violates a constraint C, so does every itemset with af as prefix, such as afd avg(X) 25 is convertible monotone w.r.t. item value ascending order R-1: If an itemset d satisfies a constraint C, so does itemsets df and dfa, which having d as a prefixThus, avg(X) 25 is strongly convertible

    ItemProfita40b0c-20d10e-30f30g20h-10

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Can Apriori Handle Convertible Constraint?A convertible, not monotone nor anti-monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithmWithin the level wise framework, no direct pruning based on the constraint can be madeItemset df violates constraint C: avg(X)>=25Since adf satisfies C, Apriori needs df to assemble adf, df cannot be prunedBut it can be pushed into frequent-pattern growth framework!

    ItemValuea40b0c-20d10e-30f30g20h-10

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Mining With Convertible ConstraintsC: avg(X) >= 25, min_sup=2List items in every transaction in value descending order R: C is convertible anti-monotone w.r.t. RScan TDB onceremove infrequent itemsItem h is droppedItemsets a and f are good, Projection-based miningImposing an appropriate order on item projectionMany tough constraints can be converted into (anti)-monotoneTDB (min_sup=2)

    TIDTransaction10a, f, d, b, c20f, g, d, b, c30 a, f, d, c, e40 f, g, h, c, e

    ItemValuea40f30g20d10b0h-10c-20e-30

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Handling Multiple ConstraintsDifferent constraints may require different or even conflicting item-orderingIf there exists an order R s.t. both C1 and C2 are convertible w.r.t. R, then there is no conflict between the two convertible constraintsIf there exists conflict on order of itemsTry to satisfy one constraint firstThen using the order for the other constraint to mine frequent itemsets in the corresponding projected database

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*What Constraints Are Convertible?

    ConstraintConvertible anti-monotoneConvertible monotoneStrongly convertibleavg(S) , vYesYesYesmedian(S) , vYesYesYessum(S) v (items could be of any value, v 0)YesNoNosum(S) v (items could be of any value, v 0)NoYesNosum(S) v (items could be of any value, v 0)NoYesNosum(S) v (items could be of any value, v 0)YesNoNo

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Constraint-Based MiningA General Picture

    ConstraintAntimonotoneMonotoneSuccinctv SnoyesyesS VnoyesyesS Vyesnoyesmin(S) vnoyesyesmin(S) vyesnoyesmax(S) vyesnoyesmax(S) vnoyesyescount(S) vyes noweaklycount(S) vnoyesweaklysum(S) v ( a S, a 0 )yesnonosum(S) v ( a S, a 0 )noyesnorange(S) vyesnonorange(S) vnoyesnoavg(S) v, { , , }convertibleconvertiblenosupport(S) yesnonosupport(S) noyesno

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*A Classification of Constraints

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Chapter 5: Mining Frequent Patterns, Association and CorrelationsBasic concepts and a road mapEfficient and scalable frequent itemset mining methodsMining various kinds of association rulesFrom association mining to correlation analysisConstraint-based association miningSummary

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Frequent-Pattern Mining: SummaryFrequent pattern miningan important task in data miningScalable frequent pattern mining methodsApriori (Candidate generation & test)Projection-based (FPgrowth, CLOSET+, ...)Vertical format approach (CHARM, ...)Mining a variety of rules and interesting patterns Constraint-based miningMining sequential and structured patternsExtensions and applications

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Frequent-Pattern Mining: Research ProblemsMining fault-tolerant frequent, sequential and structured patternsPatterns allows limited faults (insertion, deletion, mutation)Mining truly interesting patternsSurprising, novel, concise, Application explorationE.g., DNA sequence analysis and bio-pattern classificationInvisible data mining

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Basic Concepts of Frequent Pattern Mining(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93.(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98. (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99.(Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Apriori and Its ImprovementsR. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94.A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95.J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95.H. Toivonen. Sampling large databases for association rules. VLDB'96.S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97.S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Depth-First, Projection-Based FP MiningR. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Distributed Computing:02.J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00.J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. KDD'02. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM'02.J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD'03. G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns. KDD'03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Vertical Format and Row Enumeration MethodsM. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset Mining, SDM'02. C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD02.F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD'03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Mining Multi-Level and Quantitative RulesR. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95.R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96.T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96.K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97.R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules KDD'99.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Mining Correlations and Interesting RulesM. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94.S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97.C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98.P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD'02.E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE03.Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining of Correlated Patterns. ICDM03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Mining Other Kinds of RulesR. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96.B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98.D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98.F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98.K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT02.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Constraint-Based Pattern MiningR. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97.R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD98.M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB99.G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00.J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. ICDE'01.J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in Large Databases, CIKM'02.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Mining Sequential and Structured PatternsR. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT96.H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97.M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning:01.J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM'03.X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. KDD'03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Mining Spatial, Multimedia, and Web DataK. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic Information Databases, SSD95.O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. ADL'98.O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00.D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns. SSTD'01.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Mining Frequent Patterns in Time-Series DataB. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99.H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association Analysis: Mining Multi-Dimensional Inter-Transaction Association Rules. TOIS:00.B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00.W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical Attributes. ICDE01.J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data. TKDE03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Iceberg Cube and Cube ComputationS. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB'96.Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidi-mensional aggregates. SIGMOD'97.J. Gray, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. DAMI: 97.M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98.K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Iceberg Cube and Cube ExplorationJ. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data Cubes with Complex Measures. SIGMOD 01.W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE'02. G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining Multi-Dimensional Constrained Gradients in Data Cubes. VLDB'01. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing association rules. DAMI:02.L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB'02. D. Xin, J. Han, X. Li, B. W. Wah. Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration. VLDB'03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: FP for Classification and ClusteringG. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. KDD'99.B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule Mining. KDD98.W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. ICDM'01.H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets. SIGMOD 02. J. Yang and W. Wang. CLUSEQ: efficient and effective sequence clustering. ICDE03. B. Fung, K. Wang, and M. Ester. Large Hierarchical Document Clustering Using Frequent Itemset. SDM03.X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. SDM'03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Stream and Privacy-Preserving FP MiningA. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving Mining of Association Rules. KDD02.J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining in Vertically Partitioned Data. KDD02. G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB02.Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. VLDB'02.C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Next Generation Data Mining:03.A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. PODS03.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*Ref: Other Freq. Pattern Mining ApplicationsY. Huhtala, J. Krkkinen, P. Porkka, H. Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE98. H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern Extraction with Fascicles. VLDB'99.T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or How to Build a Data Quality Browser. SIGMOD'02.

    Data Mining: Concepts and Techniques

  • *Data Mining: Concepts and Techniques*

    Data Mining: Concepts and Techniques

    *