6
Online Mining of data to Generate Association Rule Mining in Large Databases Archana Singh Megha Chaudhary Dr (Prof.) Ajay Rana Gaurav Dubey Ph.D Scholar, M.tech(CS&Engg) Ph.d(Comp Science&Engg) Ph.d Scholar Amity University Amity University Amity University Amity Univeristy NOIDA (U.P) NOIDA (U.P) NOIDA (U.P) NOIDA(U.P) 91-9958255675 +91-981811756 +919958759459 [email protected] [email protected] ajay_[email protected] [email protected] ABSTRACT - Data Mining is a Technology to explore data, analyze the data and finally discovering patterns from large data repository. In this paper, the problem of online mining of association rules in large databases is discussed. Online association rule mining can be applied which helps to remove redundant rules and helps in compact representation of rules for user. In this paper, a new and more optimized algorithm has been proposed for online rule generation. The advantage of this algorithm is that the graph generated in our algorithm has less edge as compared to the lattice used in the existing algorithm. The Proposed algorithm generates all the essential rules also and no rule is missing. The use of non redundant association rules help significantly in the reduction of irrelevant noise in the data mining process. This graph theoretic approach, called adjacency lattice is crucial for online mining of data. The adjacency lattice could be stored either in main memory or secondary memory. The idea of adjacency lattice is to pre store a number of large item sets in special format which reduces disc I/O required in performing the query. Index Keywords: Adjacency lattice, Association Rule Mining, Data Mining I INTRODUCTION Data Mining is a process of analysis the data and summarizing it into useful information. In other words, technically, data mining is the process of finding pattern among dozens of fields in large relational databases. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. A. Overview of the Work done Association rule mining, as suggested by R. Agrawal, basically describes relationships between items in data sets. It helps in finding out the items, which would be selected provided certain set of items have already been selected. An improved algorithm for fast rule generation has been discussed Agrawal et. al (1994). Two algorithms for generating association rules have been discussed in ‘Fast Algorithms for Mining Association Rules’ by Rakesh Agrawal and Srikant (1994). The online mining of data is performed by pre-processing the data effectively in order to make it suitable for repeated online queries. An online association rule mining technique discussed by Charu C Agrawal at al(2001) suggests a graph theoretic approach, in which the pre -processed data is stored in such a way that online processing may be done by applying a graph theoretic search algorithm. In this paper concept of adjacency lattice of itemsets has been introduced. This adjacency lattice is crucial in performing effective online data mining. The adjacency lattice could be stored either in a main memory or on secondary memory. The idea of adjacency is to pre-store a number of item sets at a level of support. These items are stored in a special format (called adjacency lattice) which reduces the disk I/O required in order to perform the query. Online generation of the rules deals with the finding the association rules online by changing the value of the minimum confidence value. Problems with the existing algorithm is that the lattice has to be constructed again for all large itemsets, to generate the rules, which is very time consuming for online generation of rule. The number of edges would be more in the generated lattice as we have edges for a frequent itemset to all its supersets in the subsequent levels. This paper aims to develop a new algorithm for online rule generation. A weighted directed graph has been constructed and depth first search has been used for rule generation. In the proposed algorithm, online rules can be generated by generating adjacency matrix for some confidence value and the generating rules for confidence measure higher than that used for generating the adjacency matrix. 2011 International Conference on Recent Trends in Information Systems 978-1-4577-0792-6/11/$26.00 ©2011 IEEE 126

As 7

Embed Size (px)

Citation preview

Page 1: As 7

Online Mining of data to Generate Association Rule Mining in Large Databases

Archana Singh Megha Chaudhary Dr (Prof.) Ajay Rana Gaurav DubeyPh.D Scholar, M.tech(CS&Engg) Ph.d(Comp Science&Engg) Ph.d Scholar Amity University Amity University Amity University Amity UniveristyNOIDA (U.P) NOIDA (U.P) NOIDA (U.P) NOIDA(U.P) 91-9958255675 +91-981811756 [email protected] [email protected] [email protected] [email protected] ABSTRACT - Data Mining is a Technology to explore data, analyze the data and finally discovering patterns from large data repository. In this paper, the problem of online mining of association rules in large databases is discussed. Online association rule mining can be applied which helps to remove redundant rules and helps in compact representation of rules for user. In this paper, a new and more optimized algorithm has been proposed for online rule generation. The advantage of this algorithm is that the graph generated in our algorithm has less edge as compared to the lattice used in the existing algorithm. The Proposed algorithm generates all the essential rules also and no rule is missing. The use of non redundant association rules help significantly in the reduction of irrelevant noise in the data mining process. This graph theoretic approach, called adjacency lattice is crucial for online mining of data. The adjacency lattice could be stored either in main memory or secondary memory. The idea of adjacency lattice is to pre store a number of large item sets in special format which reduces disc I/O required in performing the query. Index Keywords: Adjacency lattice, Association Rule Mining, Data Mining

I INTRODUCTION Data Mining is a process of analysis the data and summarizing it into useful information. In other words, technically, data mining is the process of finding pattern among dozens of fields in large relational databases. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. A. Overview of the Work done

Association rule mining, as suggested by R. Agrawal, basically describes relationships between items in data sets. It helps in finding out the items, which would be selected provided certain set of items have already been selected. An improved algorithm for fast rule generation has been discussed Agrawal et. al (1994). Two algorithms for generating association rules have been discussed in ‘Fast Algorithms for Mining Association Rules’ by Rakesh Agrawal and Srikant (1994). The online mining of data is performed by pre-processing the data effectively in order to make it suitable for repeated online queries. An online association rule mining technique discussed by Charu C Agrawal at al(2001) suggests a graph theoretic approach, in which the pre -processed data is stored in such a way that online processing may be done by applying a graph theoretic search algorithm. In this paper concept of adjacency lattice of itemsets has been introduced. This adjacency lattice is crucial in performing effective online data mining. The adjacency lattice could be stored either in a main memory or on secondary memory. The idea of adjacency is to pre-store a number of item sets at a level of support. These items are stored in a special format (called adjacency lattice) which reduces the disk I/O required in order to perform the query. Online generation of the rules deals with the finding the association rules online by changing the value of the minimum confidence value. Problems with the existing algorithm is that the lattice has to be constructed again for all large itemsets, to generate the rules, which is very time consuming for online generation of rule. The number of edges would be more in the generated lattice as we have edges for a frequent itemset to all its supersets in the subsequent levels. This paper aims to develop a new algorithm for online rule generation. A weighted directed graph has been constructed and depth first search has been used for rule generation. In the proposed algorithm, online rules can be generated by generating adjacency matrix for some confidence value and the generating rules for confidence measure higher than that used for generating the adjacency matrix.

2011 International Conference on Recent Trends in Information Systems

978-1-4577-0792-6/11/$26.00 ©2011 IEEE 126

Debasish Jana
IEEE CS Logo Stamp
Page 2: As 7

A new algorithm has been developed to overcome these difficulties. In this algorithm the number of edges graph generated is less than the adjacency lattice and it is also capable of finding all the essential rules. This paper is divided further into sections as : Section 2 describes the work done by Charu C Agarwal(2001). Section 3 describes the new proposed algorithm. Section 4 discusses the illustration of Existing and proposed algorithm. In the last para, the comparison between two algorithms with their complexity is found.

II EXISTING ALGORITHM FOR ONLINE RULE GENERATION

The aim of Association Rule Mining (Rakesh et. al, 1994) is to detect relationships or patterns between specific values of categorical variables in large data sets. Rakesh suggests a graph theoretic approach. The main idea of association rule mining in the existing algorithm is to partition the attribute values into Transaction patterns. Basically, this technique enables analysts and researchers to uncover hidden patterns in large data sets. Here the pre-processed data is stored in such a way that online rule generation may be done with a complexity proportional to the size of the output. In the existing algorithm, the concept of an adjacency lattice of itemsets has been introduced. This adjacency lattice is crucial to performing effective online data mining. The adjacency lattice could be stored either in main memory or on secondary memory. The idea of the adjacency lattice is to prestore a number of large itemsets at a level of support possible given the available memory. These itemsets are stored in a special format (called the adjacency lattice) which reduces the disk I/O required in order to perform the query. In fact, if enough main memory is available for the entire adjacency lattice, then no I/O may need to be performed at all. A Adjacency lattice An itemset X is said to be adjacent to an itemset Y if one of them can be obtained from the other by adding a single item. Specifically, an itemset X is said to be a parent of the itemset Y, if Y can be obtained from X by adding a single item to the set X. It is clear that an itemset may possibly have more than one parent and more than one child. In fact, the number of parents of an itemset X is exactly equal to the cardinality of the set X. This observation follows from the fact that for each Element ir in an itemset X, X -ir is a parent of X. In the lattice if a directed path exists from the vertex corresponding to Z to the vertex corresponding to X in the adjacency lattice, then X Z. In such a case, X is said to be a descendant of Z and Z is said to be an ancestor of X. B. The Existing Algorithm There are three steps in the Existing algorithm explained by (Agarwal et al. 1994) STEP 1: Generation of adjacency lattice: The Adjacency lattice is created using the frequent itemsets generated using any standard algorithm by defining some minimum support. This support value is called primary

threshold value. The itemsets obtained above are referred as prestored itemsets, and can be stored in main memory or secondary memory. This is beneficial in the sense that we need not to refer dataset again and again from different value of the min. support and confidence given by the user. The adjacency lattice L is a directed acyclic graph. An itemset X is said to be adjacent to an itemset Y if one of them can be obtained from the other by adding a single item. The adjacency lattice L is constructed as follows: Construct a graph with a vertex v(I) for each primary itemset I. Each vertex I has a label corresponding to the value of its support. This label is denoted by S(I). For any pair of vertices corresponding to itemsets X and Y, a directed edge exists from v(X) to v(Y) if and only if X is a parent of Y .Note that it is not possible to perform online mining of association rules at levels less than the primary threshold. STEP 2: Online Generation of Itemsets: Once we have stored adjacency lattice in RAM. Now user can get some specific large itemsets as he desired. Suppose user want to find all large itemsets which contain a set of items I and satisfy a level of minimum support s, then there is need to solve the following search in the adjacency lattice. For a given itemset I, find all itemsets J such that v(J) is reachable from v(I) by a directed path in the lattice L, and satisfies S(J) ≥ s. STEP 3: Rule Generation : Rules are generated by using these prestored itemsets for some user defined minimum support and minimum confidence.

III PROPOSED ALGORITHM

The algorithm by Charu et al. (2001) is discussed in previous section. Detailed discussion of the proposed algorithm has been done in the current section. Graph theoretic approach has been used in the proposed algorithm. The graph generated is a directed graph with weights associated on the edges. Also the number of edges is less compared to that in the algorithm suggested by Charu et. al. A. Algorithm The algorithm has two steps explained below. The first step is explained in the section 3(A) in which we will explain that how are we going to construct the graph. The second step is explained in section 3(B) in which rule generation is explained. Construction of adjacency lattice

127

Page 3: As 7

The large itemsets obtained by applying some traditional algorithm for finding frequent itemsets (like Apriori) are stored in one file and corresponding values of support is stored in another file. Using these two files we can store the item and their corresponding support in a structure say S. Now create an array of structure s(i, j) having two fields itemsets and support. This array of structure is used to store the different length of large itemsets in different dimensions. In the field itemsets of structure s(i, j) we will store 1-itemsets in s(1, j), 2-itemsets in s(2, j), 3-itemsets in s(3, j) and so on. We have written a function for this purpose named as Initialize ( ). The pseudo code for the Initialize ( ) Algorithm Initialize (S) Begin for each large itemset Є S do

Item1 = s(i).itemset; Item2 = s(i+1).itemset; M1 = length(item1); M2 = length(item2); s(j,k).itemsets = item1; s(j,k).support = s(i).support; Increment k; If(diff of lengths of consecutive items!=0)

put itemsets in the next row of s; return

s; End; Now to calculate the weight of the edge between itemset X and itemset Y, where (X-Y) = 1-itemset, calculate the value support(X)/support(Y) if this value is >= minimum confidence then we can have an edge between the itemset X and the itemset Y and this edge will have weight = support(X)/support(Y). Now a function is required to generate the adjacency matrix using the structure S and s. This function will take one large itemset from s(i, j) and compare with all the items in s(i+1, j). If any subset of this itemset in s(i, j) is present in s(i+1, j) then it is required to find that whether there will be link between them and if there will be link then what will be the weight of the link. Let an itemset X from structure s(i, j) is taken and searched in the S(i). When index of itemset X, say index 1, in the structure S is obtained, we can easily get the support of this itemset X. Now search all subset of this itemset in s(i+1, J). There is need to find the support for each itemset Y, which is present in the s(i+1, j) and also subset of the itemset X present in s(i, j), The index of the itemset Y, index2, is obtained by searching it in structure S(i). Now weight = S(index1).support/S(index2).support is calculated if it is greater than or equal to minimum confidence then in the adjacency matrix,say a, a[index1, index2] is assigned value equal to the weight. The pseudo code for gen_adj_lattice()B is given in the following Algorithm gen_adj_lattice(S,s) Begin

For each row of s do Item 1 = s(I,j).itemsets; Index1 = find_index(item1,s);

//finding all subsets of item1 in s(i+1,j)

For each itemset in s(i+1) do Item2 = s(i+1,k).itemsets;

If (item1 is superset of item2) Index2 = find_index(item2,3); Confidence = s(index2).support/s(index1).support If(Confidence >= minconf)

adj_lat(index1,index2)=Confidence; Return adj_lat;

End; In the above gen_adj_lattice() function there is a sub-function to search an element in the structure S which returns the index of that itemset in the structure. Using this index we can get the support of the corresponding large itemset. Let an itemset X is to be searched in the S(i) firstly find the length of the itemset X. Now take start traversing the structure S if the length of the current itemset is equal to the length of the itemset to be searched then only compare the two itemsets. If all the items of the both itemsets are matching then return the index. This pseudo code for find_index() is given in the following: Algorithm find_index(item,S) Begin

N1 = length(item1); For each itemset in S do

Item2= S( r).itemsets; N2 = length(item2); If(length of the itemsets are equal)

If(Each item matched) index= r; return index;

End; The graph generated will be directed graph in which largest itemsets will be at the first level and 1-large itemsets will be at lowest level. And the direction of the edges will be from (n-1)th level to nth level. And the weight will be equal to the support of the itemset in the (n-1)th level divided by the support of the itemset at the nth level. B. Generation of Rules Each node in the directed graph is chosen for rule generation. Call that node starting node and do depth first search in the directed graph. And generate the rules from the visited node and starting node if and only if it satisfies all the condition, which are required to generate essential rule. Conditions:

1. Product of the confidence of the path between the starting node and the visited mode must be greater than or equal to minimum confidence.

2. To reduce simple redundancy: We generate set of all children of the visited node and then this set of child nodes is compared with the nodes that have already been used by the same starting node for rule generation. If any one of the child nodes is found there from this visited node no rule can be generated. Since, this rule will be redundant.

The pseudo code for find_allChild() is given

128

Page 4: As 7

Algorithm find_allChild(adj_lat,i) Begin

C1=C=NULL; C1=C=child(adj_lat,i);

while C1 = NULL do For each c Є C1 do

C1 = Child(adj_lat,c); C = C Ụ C1;

return C; End;

We have a structure, say G, which stores nodes that have already been used for generating rules. They are stored in such a way that we can get the required nodes just by reaching the corresponding index. The pseudo code for the same is given in the following Algorithm node_gen_rule(nodeset:

S,G) Begin generated Set = NULL; for each node S(i) Є S do generated Set – generated Set Ụ G(S(i));

return generated Set; End; To reduce strict redundancy;

A) We have generated s of all Parents of the starting node and then for all these parent nodes we have to find out all the nodes which have been used for Rule Generation by these parent nodes. Then this set of node is compared with the visited node. If this visited node is found then from this visited node no rule can be generated. Because this rule will be strictly redundant. The pseudo code for find_allParents() is given in the following

B) We generate set of all Childs of the visited node and the set of all Parents of the starting node and then for all these parent’s nodes we have to find out all the nodes, which have been used for Rule Generation by these parent nodes. Then this set of node is compared with the set of all child. If any of child of this visited node is found there then from this visited node no rule can be generated. Because this rule will be strictly redundant.

Algorithm find_allParents(adj_lat,i) Begin P1=P=NULL;

P1=P=Parents(adj_lat,i) While P1 is not equal to NULL

do For Each P Є P1 dp P1 = Parents(adj_lat,P) P = P Ụ P1;

return P; End;

Algorithm Generate Rule (Starting node: X, Visited node: Y, Min Conf: c, G) Begin

RuleSet=NULL; C1=weighted product of the path(X,Y);

If(c1>=c) If(~compare(find_allChild(adj_mat,Y),node_gen_rule(X,G))) If(~compare(node_gen_rule(find_allParents(adj_lat,X),G),Y) ) If(~compare(find_allChild(adj_lat,Y),node_gen_rule(find_all Parents(X)),G))

RuleSet = RuleSet U(Y->(X-Y)); Return ruleSet; End;

IV. ILLUSTRATION OF EXISTING AND PROPOSED ALGORITHMS

Now we are going to illustrate both the algorithms by taking example. The Market Basket Data sets taken shown below in Table 4.1. This dataset has five transactions and five itemsets. Let the minimum support be 0.4 and minimum confidence is 0.67. Various large itemset obtained b having support value greater than 0.4, along with the support value are shown in the Tables 4.2 to 4.4.

Table 4.1 : 1-large itemsets ITEMS SUPPORT

A=Bread 0.8

B=Milk 0.8

C=Beer 0.6

D=Diaper 0.8

F=Cock 0.4 Table 4.2 : 2-large itemsets

AB 0.6

AC 0.4

AD 0.6

BC 0.4

BD 0.6

BF 0.4

CD 0.6

DF 0.4

129

Page 5: As 7

Table 4.3 : 3- large itemsets ABD 0.4

ACD 0.4

BCD 0.4

BDF 0.4

A. Rule Generation from the proposed algorithm Weights of edges between frequent 1-itemset to frequent 2-itemset and between frequent 2-itemset to frequent 3-itemsets are shown in Table 4.4 . The weights of edges are calculated in the following manner. Let X be k-itemset and Y be the k+1 itemset, then the weight of the edge form X to Y is equal to the confidence of the rule X (Y- X) Table 4.4: Weights of the edges between 1-itemset to 2-itemsets

Edges Weights A – AB 0.75 A – AC 0.5 A – AD 0.75 B – AB 0.75 B – BC 0.5 B – BD 0.75 B – BF 0.5 C – AC 0.67 C – BC 0.67 C – CD 1.0 D – DF 0.5 D – AD 0.75 D – BD 0.75 D – CD 0.75 AB-ABD 0.67 AC-ACD 0.67 AD-ABD 0.67 AD-ACD 0.67 BC-BCD 1.0 BD-BCD 0.67 BF-BDF 1.0 CD-ACD 0.67 CD-BCD 0.67 BF-BDF 1.0

The lattice generated for the above example:

Figure 4.1 Lattice Structure The resultant graph is shown below:

Figure 4.2: Graph generated for the rule generation We can see that there are more edges in the lattice generated for the same example. These edges are shown by dotted edge. Figure 4.3 Generating the rules for the large item sets ABD Applying depth first search starting from the node ABD, the node A will be the first visited node but the weighted product (0.67*0.75) of the path obtained from A to ABD is less than minimum confidence. So the node A will not participate in rule generation. Node B will be second visited node but this also will not participate in rule generation because of similar reason. Now the next visited node is AB and the weighted product of the path from AB to ABD is 0.67 which is equal to the minimum confidence. The children nodes of AB are not generating any rule, and also AB is not used by any of the parent nodes of ABD. Thus all the three conditions are satisfied for rule generation. So we will generate the rule

130

Page 6: As 7

from AB. , AB = > D. Now the next visited node will be D, but weighted product of the path from D to ABD is less than minimum confidence hence no rule will be there and we have to go to next visited node AD as satisfies all the three conditions so there will be rule , AD => B . The next visited node will BD and this node satisfies all the three conditions, thus we have rule , BD => A. Similarly, Generating the rules for large itemset ACD, BCD, BDF,BDF,AB,AD,BD,AC,BC,CD,BF,DF. We are getting the following rules shown in the Table 4.5 below

Table 4.5: The rules generated

1 AB => D 2 AD => B 3 BD => A 4 C => AD 5 AD => C 6 C => BD 7 BD => C 8 F => BD 9 BD => F 10 A => B 11 B => A 12 A => D 13 D => A 14 B => D 15 D => B 16 D => C

B. Rules Generated from the Existing algorithm Generating the rules for large itemsets ABD Chose all the ancestors of ABD which has support less than or equal to the

Value = {support (ABD)/c} = (0.4 / 0.67) = 0.6 AB, AD and BD will be selected. So we will have following lattice. We can easily see that AB, AD and BD are the maximal ancestor of the directed graph shown in the figure. Hence we will have two rules:

AB => D , AD => B, BD => A

Figure 4.4 : Directed Graph in Adjacency Total number of 16 rules generated in both algorithms. It was found that no essential rules are missing in proposed algorithm and also there is no redundancy in the rules generated. C. Comparison of Algorithms: Complexity of graph search algorithm is proportional to the size of output

. Theorem: The number of edges in the adjacency lattice is equal to the sum of the number of parents of each primary itemset. Let N(I, s) be the number of primary itemsets in R(I, s). Thus size of output in this case = N(I, s) . h(I, s) Complexity of existing algorithm is proportional to N(I, s) . h(I, s).In the proposed algorithm there are some edges left which are not visited by their parents Let those nodes are denoted by L(I, s).This size of output in this case = N(I, s) . h(I, s) – L(I, s) Complexity of proposed algorithm is proportional to N(I, s) . h(I, s) – L(I, s) CONCLUSION AND FUTURE WORK In this paper, data mining and one of important technique of data mining is discussed. The issues related with association rule mining and then online mining of association rules are introduced to resolve these issues. Online association mining helps to remove redundant rules and helps in compact representation of rules for user. A new algorithm has been proposed for online rule generation. The advantage of this algorithm is that, the graph generated in our algorithm has less edge as compared to the lattice used in the existing algorithm. This algorithm generates all the essential rules also and no rule is missing. The future work will be implementing both existing and proposed algorithms, and then test these algorithms on large datasets like the Zoo Dataset, Mushroom Dataset and Synthetic Dataset. REFERENCES [1] Agrawal, R., Imielinski, T., Swami, A. “Mining association rules between sets of items in large databases.’’ SIGMOD-1993, pp. 207-214. [2] Charu C. Agrawal and Philip S. Yu, “A New Approach to

Online Generation of Association Rules’’ IEEE, vol. 13, No. 4, pp. 327-340, 2001.

[3] Dao-I Lin, Zvi M.Kedem, ``Pincer search: An Efficient algorithm to find maximal frequency item set IEEE trans On knowledge and data engineering, vol. No.3, pp. 333-344, may/june,2002.

[4] Ming Liu, W.Hsu, and Y. Ma, “Mining association rules with multiple minimum supports.’’ In Proceeding of fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 337-341, N.Y., 1999. ACM Press.

[5] R. Agrawal, T. Lmielinksi, and A. Swami “Mining association between sets of items in Large databases’’ Conf. Management of Data, Washington, DC, May 1993.

[6] Ramakrishna Srikanth and Quoc Vu and Rakesh Agrawal, ‘’Mining association rules with itemsets constraints.’’ In Proc. Of the 3rd International Conference on KDD and Data Mining (KDD 97), Newport Beach, California, August 1997.

[7] Rakesh Agrawal and Ramakrishna Srikanth, “Fast Algorithm for Mining Association Rules’’ In Proc. 20 Int Conf. Very Large Data Base, VLDB, 1994.

[8] Data Mining: Concepts and Techniques By “Jiawei Han Micheline Kamber’’. Academic press 2001.

131