5
1 Abstract--Data mining place viral aspect in many of the applications like market –basket analysis, fraud detection etc. In data mining association rule mining and frequent pattern mining, both are key feature of market-basket analysis. In a given large amount of transnational database where each record consists of items purchased by customer at store. One of the basic market basket analysis algorithm is an Apriori, which generate all candidates item-set frequent pattern. In this research paper we describe the improved candidate 1-itemsets generation and candidate 2-itemsets generation from traditional technique. This algorithm utilizes the dynamic programming approach to facilitate fast candidate itemset generation and searching. We have compared results with previous approach that optimize the database scans and eliminate duplicate candidate itemset generation. This technique helps research scholar. Index Terms-- Association Rule Mining, Data Mining, Data Structure, Dynamic Programming, Frequent itemsets. I. INTRODUCTION ATA mining is defines like extraction of knowledge from very large database size. Data mining also called as Knowledge Discovery in Databases (KDD) [1].Now a day many of the organization collect sales data. This data is saving in form of transaction, so each transaction represent purchase items from the store. In such database each record represent transaction and attribute represent item purchase by customers. Take an example of super market. In that, each transaction is collected and after getting large amount of data. They apply market basket analysis and find out the pattern. The discovered patterns are set of items that are most frequent in database. Like “60 percent of people who buy bread also buy the butter”. Decision making person use this detail for identify the customer buying habits. Association rule mining is help in finding relationship among the set of items in all transactions. Apriori algorithm used for discovering association rules between items in market-basket data [2]. Association rule mining require two predefined values, those values are minimum confidence and minimum support. This mining process is divide into two sub process. First one for finding those items which occurrences in database exceed the minimum threshold or minimum support count. This is call frequent items set or large items set. And second one for generating of rules from frequent items set with condition is that it satisfies the minimum confidence. Dynamic programing is one of the techniques to design an efficient algorithm [3]. This technique store the previous solutions, so whenever it reappear, it can be directly access from those pre calculated values without creating more overhead. Many problems solve in optimal way by dynamic programing approach. i.e., Matrix Chain Multiplications, Longest Common Sequence and Binary Search trees. Important feature is that memorization. The word memorization, it’s not spelled, comes from memo means recording of solutions. 1) Apriori Algorithm Let D be the market-basket database, where each row contains T Transactions. Transactions tagged with unique identifier Tid. Now let I be the item set {I1, I2, I3… In}. If an item set contain k-item then it called K-itemset, and if all subset of K-itemset satisfies the minimum support count then it’s called Lk frequent itemset or large itemset. This algorithm need to perform two basic steps are (a) Join, self- join with previous frequent Lk-1 itemset and create new candidate Ck+1 itemset. (b) Prune, filter from the current candidate itemset whose subset is not frequent in previous step. Below step explain the working of Apriori algorithm. 1. Assume that minimum support count and minimum confidence are given as min-sup and min-conf respectively. 2. Scan the entire database and find out candidate 1- itemset C1 along with occurrence count. That is number of times each item appeared in database. 3. From C1 eliminate those items which count is not satisfying min_sup threshold. Remaining 1-items in C1 which called L1. 4. L1 L1, and create new C2, again scan the database and calculate number of times candidate 2-itemset appeared in database. 5. Apply the pruning in C2 and we get the L2. 6. As this way iteratively step 2 to 5 is carried out until the CK is null. An Efficient way to Find Frequent Pattern with Dynamic Programming Approach Dharmesh Bhalodiya K. M. Patel Chhaya Patel Computer Engineering Computer Engineering Computer Engineering R. K. University, Rajkot R. K. University, Rajkot R. K. University, Rajkot Gujarat, India Gujarat, India Gujarat, India [email protected] [email protected] [email protected] D 2013 Nirma University International Conference on Engineering (NUiCONE) 978-1-4799-0727-4/13/$31.00 ©2013 IEEE

[IEEE 2013 Nirma University International Conference on Engineering (NUiCONE) - Ahmedabad, India (2013.11.28-2013.11.30)] 2013 Nirma University International Conference on Engineering

  • Upload
    chhaya

  • View
    216

  • Download
    4

Embed Size (px)

Citation preview

1

Abstract--Data mining place viral aspect in many of the applications like market –basket analysis, fraud detection etc. In data mining association rule mining and frequent pattern mining, both are key feature of market-basket analysis. In a given large amount of transnational database where each record consists of items purchased by customer at store. One of the basic market basket analysis algorithm is an Apriori, which generate all candidates item-set frequent pattern. In this research paper we describe the improved candidate 1-itemsets generation and candidate 2-itemsets generation from traditional technique. This algorithm utilizes the dynamic programming approach to facilitate fast candidate itemset generation and searching. We have compared results with previous approach that optimize the database scans and eliminate duplicate candidate itemset generation. This technique helps research scholar.

Index Terms-- Association Rule Mining, Data Mining, Data Structure, Dynamic Programming, Frequent itemsets.

I. INTRODUCTION

ATA mining is defines like extraction of knowledge from very large database size. Data mining also called as

Knowledge Discovery in Databases (KDD) [1].Now a day many of the organization collect sales data. This data is saving in form of transaction, so each transaction represent purchase items from the store. In such database each record represent transaction and attribute represent item purchase by customers.

Take an example of super market. In that, each transaction is collected and after getting large amount of data. They apply market basket analysis and find out the pattern. The discovered patterns are set of items that are most frequent in database. Like “60 percent of people who buy bread also buy the butter”. Decision making person use this detail for identify the customer buying habits.

Association rule mining is help in finding relationship among the set of items in all transactions. Apriori algorithm used for discovering association rules between items in market-basket data [2]. Association rule mining require two predefined values, those values are minimum confidence and minimum support. This mining process is divide into two sub process. First one for finding those items which occurrences in database exceed the minimum threshold or minimum support count. This is call frequent items set or

large items set. And second one for generating of rules from frequent items set with condition is that it satisfies the minimum confidence.

Dynamic programing is one of the techniques to design an efficient algorithm [3]. This technique store the previous solutions, so whenever it reappear, it can be directly access from those pre calculated values without creating more overhead. Many problems solve in optimal way by dynamic programing approach. i.e., Matrix Chain Multiplications, Longest Common Sequence and Binary Search trees. Important feature is that memorization. The word memorization, it’s not spelled, comes from memo means recording of solutions.

1) Apriori Algorithm Let D be the market-basket database, where each row

contains T Transactions. Transactions tagged with unique identifier Tid. Now let I be the item set {I1, I2, I3… In}. If an item set contain k-item then it called K-itemset, and if all subset of K-itemset satisfies the minimum support count then it’s called Lk frequent itemset or large itemset. This algorithm need to perform two basic steps are (a) Join, self-join with previous frequent Lk-1 itemset and create new candidate Ck+1 itemset. (b) Prune, filter from the current candidate itemset whose subset is not frequent in previous step. Below step explain the working of Apriori algorithm.

1. Assume that minimum support count and minimum

confidence are given as min-sup and min-conf respectively.

2. Scan the entire database and find out candidate 1-itemset C1 along with occurrence count. That is number of times each item appeared in database.

3. From C1 eliminate those items which count is not satisfying min_sup threshold. Remaining 1-items in C1 which called L1.

4. L1 ∞ L1, and create new C2, again scan the database and calculate number of times candidate 2-itemset appeared in database.

5. Apply the pruning in C2 and we get the L2. 6. As this way iteratively step 2 to 5 is carried out until

the CK is null.

An Efficient way to Find Frequent Pattern with Dynamic Programming Approach

Dharmesh Bhalodiya K. M. Patel Chhaya Patel Computer Engineering Computer Engineering Computer Engineering R. K. University, Rajkot R. K. University, Rajkot R. K. University, Rajkot

Gujarat, India Gujarat, India Gujarat, India [email protected] [email protected] [email protected]

D

2013 Nirma University International Conference on Engineering (NUiCONE)

978-1-4799-0727-4/13/$31.00 ©2013 IEEE

II. PREVIOUS WORK Numerous algorithms have been discussed for association

rule mining [2], [4], [5], [6], [7], [8] and [9]. The association rule generation was first introduced in [2] that was called AIS. AIS generate all possible candidate itemsets and association rule, at each database scan candidate itemset find out. From the previous scan, calculated large itemset was further extended in next scan. The traditional AIS technique was improved in [4], this technique generates candidate itemsets in separate step. It uses only the large itemset to proceed further, by self-join. Calculated itemset is join with itself and forwarded to next step, this step eliminate those itemset whose subset is not the part of previous frequent candidate itemset.

The Dynamic Itemset Counting algorithm [5] dynamically generate candidate itemset as the algorithm proceed, it count varying length of itemsets. As this way it requires less number of the database scans compare to Apriori.

The Hash Base [8] uses special data structure and minimizes number of candidate itemset by collection of approximate count in previous result. The partitioning algorithm [9] require only two database scan, the technique behind that was partition the whole database into chunks having same size, so that chunks can easily reside in memory. First pass scan calculate the local large candidate itemset from local chunk. And, in Second pass it calculates globally large itemset. It may possible that locally generated candidate itemset are not potentially large itemset. And also one more scan will be require if locally generated itemset was not fit into the memory.

III. PROPOSED ALGORITHM We have utilized the Dynamic Programming approach

and special data structure to improve the performance of frequent candidate 1-itemset and 2-itemset in traditional Apriori. As par the working principle of Dynamic Programming, Optimal solution in such way that it eliminate the redundancy work, and save the previous records in proper and efficient storage space. In traditional Apriori algorithm, we have applied Dynamic Programming approach to find frequent or large candidate 1-itemset and candidate 2-itemset. This approach require only database scan for both frequent candidate 1-itemset and 2-itemset. This is the key feature to improve the performance. In contrast Apriori need separate scan for each frequent itemsets. Next section shows the results and claimed that to generation of large candidate 2-itemset in this technique is more efficient.

For example suppose we have transactional database that contains certain transactions carried out in super market store. Where each transaction assign with some unique identifier Tid. Items parches by the customers are store in transaction. Fig 1 demonstrates the transactional database. Consider the item set I = {I1, I2, I3, I4, I5}. Fig 1 shows the market-basket transactional database with six transaction T and five unique item I.

1) Algorithm Input

Number of transactions Tn and Number of unique items In in database D.

Minimum support count (min_sup) Output

Frequent candidate 1-itemset and 2-itemset.

Fig. 1 Transactional database Procedure OccurrenceCount() Item_array[] will hold list of item appeared in current

transaction. Count_Table[][] save the occurrence count of 2-itemset.

Start 1. for i=1 to Tn. 2. Item_array[ ]=store each item found in Ti ; 3. for j=1 to Item_array.length-1 4. Count_Table[Item_array[j]][Item_array[j+1]]++ ; End. FrequentCount() Candidate_vectore hold the list of frequent candidate 2-

itemset Start 4. for i=1 to ln. 5. if Count_Table[i][i] is ≥ min_sup then. 6. for j=i+1 to I_n 7. if Count_Table [i][j] is ≥ min_sup then. 8. candidate_vectore ← (i, j). End Procedure is contain two core part, first one,

OccurrenceCount(), calculate the occurrence count of 2-itemset and store it. And second one, FrequentCount(), check weather that count is frequent or not. And finally gives frequent 2-itemset. Layout of Count_Table in shown in Fig 2. That represent two dimensional array is rotated 45 degree to right side. In Count_Table i & j varies from 1 to total number of items present in database (In).

First transaction contain three pair of two item, those pair are {I1, I2}, {I1, I5} and {I2, I5}. Each item set contains two values for example {I1, I5}, where first value represent i index and second value represent j index in table. In Fig 2 at the apex of the triangle, Count_Table[1][5] hold the value “2”, which indicates the occurrence of itemset {I1, I5} in all transactions. Last row in table represent the occurrence counts of candidate 1-itemset in transactions, where both

3

index (i=j) are same in table. First transaction contain three items, those three items are

directly represent as 2-itemset like {{I1, I2}, {I1, I5}, {I2, I5}}. Each item set contains two values for example {I1, I5}, where first value represent ith index and second value represent jth index in Count_Table.

Fig 2: Count_Table layout

Count_Table is filled in procedure OccurrenceCount() through Step no.1 to 4. This function scans the database and enlists the items at step no. 2. Step no 4 occurrence count of two items at calculated from the item_array. That count is incremented by 1 in Count_Table at appropriate positions.

Once OccurrenceCount() is executed we get the 2-itemset count. FrequentCount() prune infrequent itemset. Step no. 5 check weather 1-itemset is frequent or not. If it is then proceed for 2-itemset. If not then directly proceed for next itemset. As par given rule, if any subset of candidate k-itemset is not frequent then it is not frequent [4]. candidate_vector is a one dimensional array, holds the frequent candidate 2-itemset in step no. 8.

Fig.3 (a) candidate 1-itemset (b) frequent candidate 1-itemset. In Apriori, candidate 1-itemset count is given in Fig 3a. If

given minimum support count (min_sup) is 2 then item I3 is directly removed. The reaming, frequent candidate 1-itemset is listed in Fig 3b with their occurrence count. Now this itemset is used in generation of candidate 2-itemset.

Possible 2-itemset candidate are listed in given Fig 4a as per the Apriori algorithm. Candidate 2-itemset {I3, I4} and {I3, I5} are directly remove before searching of frequent candidate. Because subset I3 is not frequent. As per given min_sup this technique filter the infrequent candidate 2-itemset, are given in Fig 4b.

Traditional Apriori need K database scan S, where K is itemset length and S denote as scan. In our case it 2 scan for

each itemset count. In candidate generation by self-joining Lk-1 itemset, so it take Ο (n2) steps, where n is number of item. Candidate pruning step eliminate the infrequent itemset, its take CkTn, where Tn is total number of transaction in database.

Fig.4 (a) candidate 2-itemset (b) frequent candidate 2-itemset The traditional Apriori algorithm performs two important

functions (a) join and (b) prune. Join function generate, from previously known frequent itemset, by self-joining candidate. This function creates some computational overhead. Their approach to find frequent candidate set in top down manner, i.e., first generates all possible candidate rather then worry about it will be frequent or not. And search for that generated candidate itemset count is satisfying support count or not. The proposed approach overcomes with these computational overhead. This approach works in bottom up manner, i.e., first calculate the occurrence count from the database. Then it checks weather itemset is frequent or not. This technique gives frequent 2-itemset without generating candidate. In other word no need to perform Ο (n2) steps. And also with reduction of one database scan S. Effectively utilization of right data structure improve the overall performance.

Count_Table is highly optimized and efficient such that we can easily traverse through entire table and also it is reusable. One can re-generate frequent candidate at support count without scanning database again.

IV. EXPERIMENTS AND RESULTS We have used Intel® Core ™ 2 Duo, 2GB memory with

2.19 GHz system for experimental work. Our database contains 1500 different transactions and 250 items. Experiment work of traditional Apriori and proposed Apriory, we refers as AprioriDP, focus on frequent candidate 1-itemset and 2-itemset. Only this part of algorithm covered in our analysis.

Fig 5 represent bar chart of time requirement of individual algorithms. X-axis draw horizontally with the interval of 50 items. And Y-axis draw vertically with time. In following discussion we refer traditional algorithm as Apriory and proposed algorithm as AprioriDP (Apriori with Dynamic Programing). First we take constant Minimum support count

(min_sup) 2% with different items. In second constant item size with different min_sup. All results are ten times performed in our laboratory then taking mean values of that and shown in below figures.

Fig 5 Apriori and AprioriDP Execution time with 2% min_sup Take first bar from the Fig 5, 25 different items in 1500

transaction, shows two values 0.122s and 0.026s require in Apriori and AprioriDP respectively. These two values doesn’t make much difference, but when we look ahead in graph. i.e., at items size 50 and 75, we get 0.393s and 0.758s require in Apriori where in AprioriDP requires only 0.030s and 0.037s respectively. And the last bar where in traditional approach requires 7.668s but in proposed algorithm it take only 0.074s. If we add more number of item then also we get large different. At starting point both executed in near about time. But when we reach at 250 items, it shows AprioriDP is 100 time faster than Apriori.

Fig 6 Apriori and AprioriDP Execution time with 250 items.

Now we analyze with different support count varies from

0.25 to 1 %. These small values give maximum candidate itemsets. Fig 6 shows that when we decrease the support count Apriori is also slightly decrease. In contrast AprioriDP behave similarly. There is also small differences we getting in AprioriDP. But those are in milliseconds. From 0.50% to 0.75% support count Apriori decrease to 1.6 seconds. In other word we get difference of 1.6 seconds. And in AprioriDP that difference come to 0.008 second. As same way we take difference from 0.75% to 1% support count, that is 1.5s and 0.001s in Apriori and ArioriDP respectively.

Improvement in AprioriDP comes from two major factors.

First one is reduce one database scan S, and another is eliminating complete Ο (n2) steps for candidate generation.

Above experimental work focus only time complicity but at the efficiency point of view it may require space complicity also. So memory consumption by these two algorithms is given by heap value. In our experiment, we commit that memory required by Apriori 6.207 MB and propose approach is 2.801MB. These heap counts is calculated by profiling in Netbeans tool.

Proposed AprioriDP is motivated from Dynamic Programming approach and by reduction of one database scan. We directly map 2-itemset in Count_Table. In this approach it clearly examine that it has advantage over Apriori. This experimental results for calculating large candidate 2-itemset but if one may interests in only in 1-itemset then it again boost improvement in performance.

V. CONCLUSIONS We have proposed changes in Apriori algorithm with use

of efficient data structure & dynamic programming methodology. The effective improvement in proposed change by reducing one database scan and to utilize memory space to store candidate 1-itemset and 2-itemset occurrence count in special dataset which can be used in further process to save time. Proposed algorithm with dynamic programming AprioriDP require least time to compute. We presented comparative analysis of traditional approach (Apriori) and dynamic approach with reference to CPU overhead with time requirement to complete the task, finally observed the proposed algorithm required very less time than Apriori algorithm.

VI. FUTURE WORK Still we focus on calculating large candidate 1-itemset and

2-itemset. We can discover more frequent candidate itemset like 3-itemset, 4-itemset, up to K-itemset with Dynamic Programming approach. Mapping of more than 2-itemset candidate may be possible with transaction ID. This possibility helps us to find out reaming frequent itemset.

VII. ACKNOWLEDGEMENT We are very thankful to Amit M. Lathigra, HOD-IT, School of Engineering, RK University, Rajkot, Gujarat, for proving base resource for experimental work and moral supports as and when required while research work. Also we would like to express deep gratitude to staff members of MTech for giving their continuous courage and motivation to publish research work.

VIII. REFERENCES [1] G. Eason, B. Noble, and I. N. Sneddon, "On certain integrals of Lipschitz-

Hankel type involving products of Bessel functions," Phil. Trans. Roy. Soc. London, vol. A247, pp. 529-551, Apr. 1955.

[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between set of items in large databases” , In Proceeding of the 1993 ACM SIGMOD International conference on management of data, pages 207-216, Washington, DC, May 1993.

5

[3] T. H. Corman, C. E. Leiserson, R. L. Rivest, Clifford Stein, “Introduction to Algorithms”, Third Edition, MIT Press, Cambridge, Londan, 2009.

[4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, “Fast Discovery of Association Rules”, Advances in Knowledge Discovery and Data Mining, U. Fayyad and et al., eds.,pages. 307-328, Menlo Park, Calif.: AAAI Press, 1996.

[5] S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic Itemset Counting and Implication Rules for Market Basket Data,” ACM SIGMOD conference. Management of Data, May 1997.

[6] D.-I. Lin and Z.M. Kedem, “Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set”, Sixth International conference. Extending Database Technology, Mar. 1998.

[7] J.-L. Lin and M.H. Dunham, “Mining Association Rules: Anti- Skew Algorithms”, 14th International conference. Data Eng., Feb. 1998.

[8] J.S. Park, M. Chen, and P.S. Yu, “An Effective Hash Based Algorithm for Mining Association Rules”, ACM SIGMOD International conference. Management of Data, May 1995

[9] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases”, Proc. 21st Very Large Data Bases Conference, 1995