The comparative study of apriori and FP-growth algorithm

A SEMINAR ONTHE COMPARATIVE STUDY OF

APRIORI AND FP-GROWTH ALGORITHM FOR ASSOCIATION RULE MINING

Under the Guidance ofMrs Sankirti Shiravale

By

Deepti Pawar

ContentsIntroduction

Literature Survey

Apriori Algorithm

FP-Growth Algorithm

Comparative Result

Conclusion

Reference

Introduction

Data Mining It is the process of discovering interesting patterns (or knowledge) from large amount of data

bull Which items are frequently purchased with milk

bull Fraud detection Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer

bull Customer relationship management Which of my customers are likely to be the most loyal and which are most likely to leave for a competitor

Data Mining helps extract such information

Introduction (contd) Why Data MiningBroadly the data mining could be useful to answer the queries on

bull Forecasting

bull Classification

bull Association

bull Clustering

bull Making the sequence

Introduction (contd) Data Mining Applicationsbull Aid to marketing or retailing

bull Market basket analysis (MBA)

bull Medicare and health care

bull Criminal investigation and homeland security

bull Intrusion detection

bull Phenomena of ldquobeer and baby diapersrdquo And many morehellip

Literature Survey Association Rule Miningbull Proposed by R Agrawal in 1993

bull It is an important data mining model studied extensively by the database and data mining community

bull Initially used for Market Basket Analysis to find how items purchased by customers are related

bull Given a set of transactions find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Literature Survey (contd)Frequent Itemset

bull Itemset A collection of one or more items

Example Milk Bread Diaper k-itemset

An itemset that contains k itemsbull Support count () Frequency of occurrence of an itemset Eg (Milk Bread Diaper) = 2

bull Support Fraction of transactions that contain an itemset Eg s( Milk Bread Diaper ) = 25

bull Frequent Itemset An itemset whose support is greater than or equal

to a minsup threshold

TID Items

1 Bread Milk

2 Bread Diaper Beer Eggs

3 Milk Diaper Beer Coke

4 Bread Milk Diaper Beer

5 Bread Milk Diaper Coke

Literature Survey (contd)Association Rulebull Association Rule An implication expression of

the form X Y where X and Y are itemsets

Example Milk Diaper Beer

bull Rule Evaluation Metrics Support (s)

Fraction of transactions that contain both X and Y

Confidence (c) Measures how often items in

Y appear in transactions thatcontain X

TID Items

1 Bread Milk





ExampleBeerDiaperMilk

4052

|T|)BeerDiaperMilk(

s

67032

)DiaperMilk()BeerDiaperMilk(

c

Apriori Algorithm

bull Apriori principle If an itemset is frequent then all of its subsets must also be frequent

bull Apriori principle holds due to the following property of the support measure Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

Apriori Algorithm (contd)

The basic steps to mine the frequent elements are as follows

bull Generate and test In this first find the 1-itemset frequent elements L1 by scanning the database and removing all those elements from C which cannot satisfy the minimum support criteria

bull Join step To attain the next level elements Ck join the previous frequent elements by self join ie Lk-1Lk-1 known as Cartesian product of Lk-1 ie This step generates new candidate k-itemsets based on joining Lk-1 with itself which is found in the previous iteration Let Ck denote candidate k-itemset and Lk be the frequent k-itemset

bull Prune step This step eliminates some of the candidate k-itemsets using the Apriori property A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk (ie all candidates having a count no less than the minimum support count are frequent by definition and therefore belong to Lk) Step 2 and 3 is repeated until no new candidate set is generated

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

TID Set-of- itemsets

100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3

Apriori Algorithm (contd) Bottlenecks of Aprioribull It is no doubt that Apriori algorithm successfully finds the frequent

elements from the database But as the dimensionality of the database increase with the number of items then

bull More search space is needed and IO cost will increase

bull Number of database scan is increased thus candidate generation will increase results in increase in computational cost

FP-Growth Algorithm

FP-Growth allows frequent itemset discovery without candidate itemset generation Two step approach

Step 1 Build a compact data structure called the FP-tree Built using 2 passes over the data-set

Step 2 Extracts frequent itemsets directly from the FP-tree

FP-Growth Algorithm (contd)Step 1 FP-Tree Construction FP-Tree is constructed using 2 passes

over the data-setPass 1 Scan data and find support for each

item Discard infrequent items Sort frequent items in decreasing

order based on their supportbull Minimum support count = 2bull Scan database to find frequent 1-itemsetsbull s(A) = 8 s(B) = 7 s(C) = 5 s(D) = 5 s(E) = 3bull 1048698 Item order (decreasing support) A B C D E

Use this order when building the FP-Tree so common prefixes can be shared

FP-Growth Algorithm (contd) Step 1 FP-Tree ConstructionPass 2Nodes correspond to items and have a counter1 FP-Growth reads 1 transaction at a time and maps it to a path

2 Fixed order is used so paths can overlap when transactions share items (when they have the same prefix ) In this case counters are incremented

3 Pointers are maintained between nodes containing the same item creating singly linked lists (dotted lines) The more paths that overlap the higher the compression FP-tree

may fit in memory

4 Frequent itemsets extracted from the FP-Tree

FP-Growth Algorithm (contd) Step 1 FP-Tree Construction (contd)

FP-Growth Algorithm (contd)Complete FP-Tree for Sample Transactions

FP-Growth Algorithm (contd) Step 2 Frequent Itemset Generation FP-Growth extracts frequent itemsets from the FP-tree

Bottom-up algorithm - from the leaves towards the root

Divide and conquer first look for frequent itemsets ending in e then de etc then d then cd etc

First extract prefix path sub-trees ending in an item(set) (using the linked lists)

FP-Growth Algorithm (contd) Prefix path sub-trees (Example)

FP-Growth Algorithm (contd) Example

Let minSup = 2 and extract all frequent itemsets containing E Obtain the prefix path sub-tree for E

Check if E is a frequent item by adding the counts along the linked list (dotted line) If so extract it

Yes count =3 so E is extracted as a frequent itemset

As E is frequent find frequent itemsets ending in e ie DE CE BE and AE

E nodes can now be removed

FP-Growth Algorithm (contd) Conditional FP-Tree

The FP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions)

I Example FP-Tree conditional on e

FP-Growth Algorithm (contd) Current Position in Processing

FP-Growth Algorithm (contd) Obtain T(DE) from T(E) 4 Use the conditional FP-tree for e to find frequent itemsets ending in DE CE

and AE Note that BE is not considered as B is not in the conditional FP-tree for E

bull Support count of DE = 2 (sum of counts of all Drsquos)bull DE is frequent need to solve CDE BDE ADE if they exist

FP-Growth Algorithm (contd) Current Position of Processing

FP-Growth Algorithm (contd)Solving CDE BDE ADEbull Sub-trees for both CDE and BDE are emptybull no prefix paths ending with C or Bbull Working on ADE

ADE (support count = 2) is frequentsolving next sub problem CE

FP-Growth Algorithm (contd)Current Position in Processing

FP-Growth Algorithm (contd) Solving for Suffix CE

CE is frequent (support count = 2)bull Work on next sub problems BE (no support) AE


FP-Growth Algorithm (contd) Solving for Suffix AE

AE is frequent (support count = 2)Done with AEWork on next sub problem suffix D

FP-Growth Algorithm (contd) Found Frequent Itemsets with Suffix Ebull E DE ADE CE AE discovered in this order

FP-Growth Algorithm (contd) Example (contd)

Frequent itemsets found (ordered by suffix and order in which the are found)

Comparative Result

Conclusion

It is found that

bull FP-tree a novel data structure storing compressed crucial information about frequent patterns compact yet complete for frequent pattern mining

bull FP-growth an efficient mining method of frequent patterns in large Database using a highly compact FP-tree divide-and-conquer method in nature

bull Both Apriori and FP-Growth are aiming to find out complete set of patterns but FP-Growth is more efficient than Apriori in respect to long patterns

References

1 Liwu ZOU Guangwei REN ldquoThe data mining algorithm analysis for personalized servicerdquo Fourth International Conference on Multimedia Information Networking and Security 2012

2 Jun TAN Yingyong BU and Bo YANG ldquoAn Efficient Frequent Pattern Mining Algorithmrdquo Sixth International Conference on Fuzzy Systems and Knowledge Discovery 2009

3 Wei Zhang Hongzhi Liao Na Zhao ldquoResearch on the FP Growth Algorithm about Association Rule Miningrdquo International Seminar on Business and Information Management 2008

4 SP Latha DR NRamaraj ldquoAlgorithm for Efficient Data Miningrdquo In Proc

Intrsquo Conf on IEEE International Computational Intelligence and Multimedia Applications 2007

References (contd)

5 Dongme Sun Shaohua Teng Wei Zhang Haibin Zhu ldquoAn Algorithm to Improve the Effectiveness of Apriorirdquo In Proc Intrsquol Conf on 6th IEEE International Conf on Cognitive Informatics (ICCI07) 2007

6 Daniel Hunyadi ldquoPerformance comparison of Apriori and FP-Growth algorithms in generating association rulesrdquo Proceedings of the European Computing Conference 2006

7 By Jiawei Han Micheline Kamber ldquoData mining Concepts and Techniquesrdquo Morgan Kaufmann Publishers 2006

8 Tan P-N Steinbach M and Kumar V ldquoIntroduction to data miningrdquo Addison Wesley Publishers 2006

References (contd)

9 HanJ PeiJ and Yin Y ldquoMining frequent patterns without candidate generationrdquo In Proc ACM-SIGMOD International Conf Management of Data (SIGMOD) 2000

10 R Agrawal Imielinskit SwamiA ldquoMining Association Rules between Sets of Items in Large Databasesrdquo In Proc International Conf of the ACM SIGMOD Conference Washington DC USA 1993

ContentsIntroduction

Literature Survey

Apriori Algorithm

FP-Growth Algorithm

Comparative Result

Conclusion

Reference

Introduction







bull Forecasting

bull Classification

bull Association

bull Clustering



















TID Items

1 Bread Milk












TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



Introduction







bull Forecasting

bull Classification

bull Association

bull Clustering



















TID Items

1 Bread Milk












TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)




bull Forecasting

bull Classification

bull Association

bull Clustering



















TID Items

1 Bread Milk












TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)




















TID Items

1 Bread Milk












TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)














TID Items

1 Bread Milk












TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)










TID Items

1 Bread Milk












TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)










TID Items

1 Bread Milk






4052

|T|)BeerDiaperMilk(

s

67032


c

Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



Apriori Algorithm








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)








TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5


100 134

200 235

300 1235

400 25

Itemset Support

1 2

2 3

3 3

5 3

itemset

1 2

1 3

1 5

2 3

2 5

3 5


100 1 3

200 2 32 5 3 5

300 1 21 31 52 3 2 5 3 5

400 2 5

Itemset Support

1 3 2

2 3 3

2 5 3

3 5 2

itemset

2 3 5


200 2 3 5

300 2 3 5

Itemset Support

2 3 5 2

Database C^1

L2

C2 C^2

C^3

L1

L3C3





FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)







FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



FP-Growth Algorithm












may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)











may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)






may fit in memory


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)


































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)

































Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)





























Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)




























Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)






















Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



















Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)


















Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)















Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)














Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)












Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)











Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)









Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)








Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)






Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)





Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



Comparative Result

Conclusion

It is found that




References






References (contd)





References (contd)



Conclusion

It is found that




References






References (contd)





References (contd)



References






References (contd)





References (contd)



References (contd)





References (contd)



References (contd)



Education

The comparative study of apriori and FP-growth algorithm