Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Maximal Frequent Subgraph Mining
Thesis submitted in partial fulfillmentof the requirements for the degree of
Doctor of Philosophyin
Computer Science and Engineering
by
Lini Teresa Thomas200599002
Center for Data EngineeringInternational Institute of Information Technology
Hyderabad - 500 032, INDIAAugust 2010
Copyright c© Lini Teresa Thomas, 2010
All Rights Reserved
International Institute of Information TechnologyHyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Maximal Frequent Subgraph Mining” by LiniTeresa Thomas, has been carried out under my supervision and is not submitted elsewhere for a degree.
Date Advisor: Prof. Kamalakar Karlapalem
Acknowledgments
First and foremost, I thank my advisor Prof. Kamalakar Karlapalem. It has been an honor to behis Ph.D. student. His strong support has been a constant assurance to me throughout my academiclife under him by making everything look solvable and possible. My colleague Satyanarayana Valluriwith whom I have spent many hours discussing and working with, has contributed greatly in makingthis thesis fun working on. Many thanks to my family who encouraged my joining the program andhave strongly backed my completing the work. The International Institute of Information Technology,Hyderabad has brought me in touch with a new world of ideas, research and exceptional people whohave contributed in a big way to my academic and personal growth. Finally, I thank my silent most andyoungest supporter - my month old daughter Meera.
iv
Abstract
The area of graph mining deals with mining frequent subgraph patterns, graph classification, graphclustering, graph partitioning, graph indexing and so on. In this thesis, we focus only on the area offrequent subgraph mining and more precisely on maximal frequent subgraph mining.
The exponential number of possible subgraphs makes the problem of frequent subgraph mininga challenge. The set of maximal frequent subgraphs is much smaller to that of the set of frequentsubgraphs, providing ample scope for pruning. MARGIN is a maximal subgraph mining algorithm thatmoves among promising nodes of the search space along the “border” of the infrequent and frequentsubgraphs. This drastically reduces the number of candidate patterns in the search space. Experimentalresults validate the efficiency and the utility of the technique proposed. MARGIN-d is the extension ofthe MARGIN algorithm which can be applied to finding disconnected maximal frequent subgraphs. Atheoretical comparison with Apriori like algorithms and analysis are presented in this thesis.
Further, sometimes frequent subgraph mining problems can find simpler solutions outside the area offrequent subgraph mining by applying itemset mining to graphs of unique edge labels. This can reducethe computational cost drastically. A solution to finding maximal frequent subgraphs for graphs withunique edge labels using itemset mining is presented in the thesis.
v
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Organisation and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 MARGIN: Maximal Frequent Subgraph Mining . . . . . . . . . . . . . . . . . . . . . . . . 133.1 Preliminary Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Lattice Representation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4 The MARGIN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 MARGIN-d: Finding disconnected maximal frequent subgraphs . . . . . . . . . . . . 193.6 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6.1 Running Example for the proof . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Optimizing MARGIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.7.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.7.3 The Replication and Embedded Model . . . . . . . . . . . . . . . . . . . . . . 31
4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1 Analysis of MARGIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Comparing MARGIN with gSpan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Comparing MARGIN with SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Comparing MARGIN with CloseGraph . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Analysis of MARGIN-d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Applications of Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.1 Applications of the ExpandCut technique . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.1 Framework for applying ExpandCut . . . . . . . . . . . . . . . . . . . . . . 475.1.2 Using ExpandCut to process OLAP queries . . . . . . . . . . . . . . . . . . 48
5.2 RNA Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3 Real-life dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.1 Page Views from msnbc.com . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.2 Stock Market Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.3.3 Chemical Compound Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vi
CONTENTS vii
6 ISG:Mining maximal frequent graphs using itemsets . . . . . . . . . . . . . . . . . . . . . . 576.1 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2 The ISG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.1 Conversion of Graphs to Itemsets . . . . . . . . . . . . . . . . . . . . . . . . 606.2.2 Conversion of Maximal Frequent Itemsets to Graphs . . . . . . . . . . . . . . 626.2.3 Pruning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2.4 Illustration of an Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2.5 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.3.1 Results on Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3.2 Results on Real-life Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
List of Figures
Figure Page
1.1 Search Space Explored . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Graph Database D={G1, G2} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Lattice L1,L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Embedded Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Upper-3-Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.5 The Steps in ExpandCut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.6 Graph Lattice for Disconnected Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 213.7 Case1: Lattice L
sub
i: P1 ∩P2 6= ∅ . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Case2: Lattice Lsub
i: P1 ∩P2 = ∅ . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.9 Path PathP = (Vp, Ep,Λ, λp) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.10 The Reachability Sequence between cuts. . . . . . . . . . . . . . . . . . . . . . . . . 263.11 Example Lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.12 Constructed Lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.13 Exploiting the commonality of the f(†)-nodes . . . . . . . . . . . . . . . . . . . . . 30
4.1 Number of Expand Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Runtime of MARGIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Runtime of MARGIN for exact support count . . . . . . . . . . . . . . . . . . . . . . 354.4 Running time with 2% Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Comparison with gSpan: Effect of size of frequent graphs . . . . . . . . . . . . . . . . 374.6 Comparison of MARGIN with SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.7 Ratio of SPIN to MARGIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.8 Time comparison with CloseGraph. I = 20, T = 25 . . . . . . . . . . . . . . . . . . 434.9 Time comparison with CloseGraph. I = 25, T = 30 . . . . . . . . . . . . . . . . . . 45
5.1 Data Cube Lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 RNA Graph 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 RNA Graph 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1 Overview of ISG Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.2 Example Graph Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3 Maximal Frequent Subgraphs for the Example (Support=2) . . . . . . . . . . . . . . . 596.4 The Transaction Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Cases in Edge Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
viii
LIST OF FIGURES ix
6.6 Extending the converse edge triplet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.7 Running Example of the ISG algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 68
List of Tables
Table Page
1.1 Lattice Space Explored . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Algorithm Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Algorithm ExpandCut(LF,C † P ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 MARGIN: Lattice Space Explored . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Effect of Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Running time with support 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Comparison of gSpan and MARGIN for high edge-to-vertex(EVR) ratio:D=100-300 . . 374.5 Comparison of gSpan and MARGIN for high edge-to-vertex(EVR) ratio: D=400,500 . 384.6 Comparison of gSpan and MARGIN for low edge-to-vertex(EVR) ratio: D=100-300,
EVR=0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Comparison of gSpan and MARGIN for low edge-to-vertex(EVR) ratio: D=400-500,
EVR=0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.8 Comparison of gSpan and MARGIN for various edge-to-vertex(EVR) ratio . . . . . . . 394.9 Lattice Space Explored . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.10 Comparison of MARGIN and SPIN . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.11 Memory Comparison of gSpan, SPIN and MARGIN . . . . . . . . . . . . . . . . . . 434.12 Generic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.13 Effect of varying I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.14 Effect of varying E, V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.15 Effect of Varying Database Size D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Finding Maximal Attribute sets for 15 feature attributes . . . . . . . . . . . . . . . . . 505.2 Finding Maximal Attribute sets for 20 feature attributes . . . . . . . . . . . . . . . . . 505.3 Running MARGIN on the RNA database . . . . . . . . . . . . . . . . . . . . . . . . 535.4 Running time on web data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5 Comparison of Margin and gSpan on stock data . . . . . . . . . . . . . . . . . . . . . 555.6 Comparison using the Chem340 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 555.7 Comparison using the Chem422 dataset . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 Mapping of edges of graphs in Figure 6.2 to unique item id . . . . . . . . . . . . . . . 626.2 Results with varying D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.3 Results with varying I and T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.4 Results on Stocks Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
x
Chapter 1
Introduction
It is common to model complex data with the help of graphs consisting of nodes and edges thatare often labeled to store additional information. Representing these data in the form of a graph canhelp visualise relationships between entities which are convoluted when captured in relational tables.Graph mining ranges from indexing graphs, finding frequent patterns, finding inexact matches, graphpartitioning and so on. Washio and Motoda[63] states the five theoretical bases of graph-based datamining approaches as subgraph categories, subgraph isomorphism, graph invariants, mining measuresand solution methods. The subgraphs are categorized into various classes, and the approaches of graph-based data mining strongly depend on the targeted class. Subgraph isomorphism is the mathematicalbasis of substructure matching and/or counting in graph-based data mining. Graph invariants provide animportant mathematical criterion to efficiently reduce the search space of the targeted graph structuresin some approaches. Furthermore, the mining measures define the characteristics of the patterns to bemined similar to conventional data mining.
For a graph of e edges, the number of possible frequent subgraphs can grow exponentially in e.Further, the core operation of subgraph isomorphism being NP-complete, it is critical to minimizethe number of subgraphs that need to be considered for finding the frequency counts or use otherstrategic methods such as information of the occurrences of the subgraph in the database that couldavoid subgraph isomorphism. These factors make graph mining challenging. Given a set of graphsD = G1, G2, . . . , Gn, the support of a graph g is defined as the fraction of graphs in D in which g
occurs. The graph g is frequent if its support is at least a user specified threshold. Mining frequentstructures in a set of graphs is an important problem with applications such as chemical and biologicaldata, XML documents, web link data, financial transaction data, social network data, protein foldingdata, etc. In molecular data sets, nodes correspond to atoms and edges represent chemical bonds. Inthe area of drug discovery, the chemical compounds are modeled as graphs and graph mining is usedin classification [26] and to find frequent substructures [25]. Interesting patterns like web communitiescan be mined using the links information in web as a graph [44].
One performance issue in mining graph databases is the large number of recurring patterns. Thelarge number of frequent subgraphs reduces not only efficiency but also the effectiveness of mining
1
since users have to process a large number of subgraphs to find useful ones. A subgraph X is a closedsubgraph if there exists no subgraph X ′ such that X ′ is a proper supergraph of X and every transactioncontaining X also contains X ′. A closed subgraph X is frequent if its support is greater than the usergiven support threshold. All frequent subgraphs along with their corresponding support values can bederived from the set of closed subgraphs. However, the closed subgraphs are a lot fewer than frequentones. Also, association rules extracted from closed sets have been proved to be more concise andmeaningful, because redundancies are discarded. However, the number of closed frequent subgraphstoo is much larger as compared to the number of maximal frequent subgraphs. Many applicationsrequire the maximal frequent subgraphs alone. This gave rise to the area of mining maximal frequentsubgraphs.
A frequent subgraph is maximal if none of its super graphs are frequent. Mining only maximalfrequent subgraphs offer the following advantages in processing large graph databases[36].
• It significantly reduces the total number of mined subgraphs. The total number of frequent sub-graphs is up to one thousand times greater than the number of maximal frequent subgraphs. Thus,the time and space is significantly reduced on mining maximal frequent subgraphs.
• All the frequent subgraphs can be generated from the set of maximal frequent subgraphs. Thesupport of the frequent subgraphs is certain to be at least as high as the frequency of the maximalsubgraphs. The actual support can be counted by scanning the graph database. Even otherwise,techniques used in [15] can be easily adapted to approximate the support of all frequent subgraphswithin some error bound.
• For certain applications, the set of maximal frequent subgraphs itself are of most interest. Max-imal frequent substructures may provide insight into the behaviour of a molecule. For example,researchers exploring immuno deficiency viruses may gain insight into recombinants via com-monalities within initial replicants. In addition, it can be used to group previously unknowngroups, affording new information. Graph-based methods can be applied on proteins to predictfolding sequences and also maximal motifs in RNAs. Other applications that require the com-putation of maximal frequent subgraphs are mining contact maps [34], finding maximal frequentpatterns in metabolic pathways [45], and finding set of large cohesive web pages. Similarly,given a set of web communities (eg,orkut ) the maximal frequent subgraphs would generate allthe largest groups of users and their relationships who together belong to many communities andhence might be users of similar interests.
A typical approach to frequent subgraph mining problem has been to find frequent subgraphs incre-mentally in an Apriori manner. Canonical labeling has been adopted to reduce the number of times eachcandidate is generated [65, 35]. The Apriori based approach has been further modified to suit closedsubgraph mining [66] and maximal subgraph mining [36] with added pruning.
SPIN[36] is the only algorithm in the literature of graph mining for mining maximal frequent sub-graphs explicitly apart from MARGIN. It is of course possible to generate all the frequent or closed
2
frequent subgraphs and then prune them to get the maximal frequent subgraphs. This can be expectedto be highly inefficient as compared to algorithms developed specifically for the problem of maximalfrequent subgraph mining.
Apriori BasedSearch Space
Graph Lattice
Margin Search Space
f−cut−nodes
Finding Representative
Figure 1.1 Search Space Explored
The maximal frequent subgraphs lie in the middle of the lattice such that all the subgraph below themaximal frequent subgraphs are frequent and the ones above are infrequent. Finding maximal subgraphsusing Apriori methods would require bottom-up traversal of the lattice wherein the frequent subgraphsthat are not maximal, though are of no interest, need to be explored and processed in order to reachthe maximal frequent subgraphs. It was hence of interest to check whether it is possible to avoid theexploration of the space above or below the lattice and yet somehow attain the set of maximal frequentsubgraphs which form something like a border between the set of frequent and infrequent subgraphsin the lattice. We developed a technique that would, in some manner, locate one maximal frequentsubgraph, and then jump from one maximal frequent subgraph to another, exploring all the maximalfrequent subgraphs of the lattice. Such a method should minimise the exploration of any node that isnot a member of the set of maximal frequent subgraphs.
For a graph G, the graph lattice of G is a graph where the connected subgraphs of G form the nodesof the lattice. An edge occurs between two nodes of a lattice if the subgraph represented by one nodeis the immediate subgraph of the subgraph represented by the other node. We can visualise the graphlattice to have levels where the bottom-most level is the empty subgraph and the next higher level has allsingle node subgraphs of G and so on where the top-most node will be G itself. If level l has subgraphsof n edges then level l + 1 has subgraphs with n + 1 edges. Edges exist between two subgraphs in thelattice if one is the supergraph of the other formed by adding one edge. Hence, edges exist between twoconsecutive levels of a lattice. A graph lattice is defined in detail in the next chapter.
The set of subgraphs which are likely to be maximally frequent are the set of n-edge frequent sub-graphs that have a (n + 1)-edge infrequent supergraph. We refer to such a set of nodes in the lattice asthe set of f(†)-nodes (Figure 1.1) which form our candidate set. The set of maximal frequent subgraphsis a subset of this candidate set such that all its (n + 1)-edge supergraphs are infrequent. We proposethe MARGIN algorithm that computes such a candidate set efficiently. By a post-processing step ouralgorithm then finds all maximally frequent subgraphs by retaining only those subgraphs that have all itsimmediate supergraphs to be infrequent. The ExpandCut step invoked within the MARGIN algorithmrecursively finds the candidate subgraphs by jumping from one candidate node to another.
3
Apriori based algorithms enumerate the frequent subgraphs or the maximal frequent subgraphs bya bottom-up approach and prune using the property that an infrequent subgraph cannot have a frequentsupergraph. Hence, in order to find maximal frequent subgraphs, the frequent subgraph space needs tobe explored. The SPIN[36] algorithm applies further pruning on the set of frequent subgraphs in order tofind the maximal frequent subgraphs. The search space of Apriori based algorithms corresponds to theregion below the f(†)-nodes in the graph lattice as shown in Figure 1.1. On the other hand, MARGINexplores a much smaller search space by visiting the lattice around the f(†)-nodes. It can be clearlyseen in the figure that if the border separating the frequent and infrequent subgraphs which is formed bythe f(†)-nodes lies in the lower sections of the lattice, then there is a possibility of the space exploredby both MARGIN and Apriori based algorithms being approximately the same. However, for lowersupports where the border between the frequent and infrequent subgraphs lies higher up the lattice, thesearch space is distinctly much more for Apriori based algorithms. Table 1.1 shows the number of nodesvisited by Margin and SPIN[36] for the two input parameters to the algorithm, namely, database sizeand support value. The variable D in column one refers to the number of graphs in the database and“support%” refers to the percentage of graphs in the database a subgraph should occur in inorder toqualify as frequent. Experimental results as in Table 1.1 show that our algorithm explores one-fifth ofthe nodes explored by SPIN[36].
E = 10, V = 10, L = 10, I = 5, T = 6
DataSet Lattice Nodes VisitedD(Support%) SPIN MARGIN
100 (2) 43,861 9,311200 (2) 54,026 10,916300 (2) 57,697 14,954400 (2) 58,929 42,201100 (5) 42,584 9,930200 (5) 49,767 12,318300 (5) 52,118 24,660400 (5) 54,726 44,686500 (2) 32,556 12,619500 (3) 21,669 10,078500 (4) 9,187 9,912500 (5) 4,162 8,264
Table 1.1 Lattice Space Explored
We prove that our algorithm finds all the nodes that belong to the candidate set. We show that anytwo candidate nodes are reachable from one another using the ExpandCut function used to explore thecandidate nodes. Given that the graph lattice is dense and there exists several paths that connect twocandidate nodes, one of the main challenges of the proof is the abstraction of the sublattice that containsthe path taken by the ExpandCut function. By the construction of such a sublattice we show that such
4
a sublattice is guaranteed to exist. A reachability sequence is then shown that starting at any candidatenode would reach another candidate node.
We study the behaviour of the MARGIN algorithm on synthetic datasets for a combination of datasetparameters and support values. We present results with our experiments on real life datasets namely,the records of page views by the users of msnbc.com and stock market data. We further compare theperformance of the MARGIN algorithm with known standard techniques like gSpan, SPIN[36] andCloseGraph[66].
We further show in our work that the proposed algorithm can be generalised to be applied to a widerrange of lattice based problems. We study the properties that makes such a horizontal sweep of thelattice by moving along the “border” of the frequent and infrequent subgraphs possible and generalisethe solution that can be adapted to various other applications.
It is common to model complex data with the help of graphs consisting of nodes and edges thatare often labeled to store additional information. The problem of maximal frequent subgraph miningcan often be complex or simple based on the kind of input dataset. If the graph database is known tosatisfy some constraint, like constraints on the size of graphs, the node or edge labels of graphs andthe degree of nodes, it might be possible to solve the problem of subgraph mining without using anygraph techniques by exploiting the properties of the constraint. Though the class of such graphs mightbe small, this problem is worth investigating since it could lead to immense reduction in run time.
The complexity of itemset mining algorithms is much less than that of graph mining algorithms be-cause of the following reasons: (i) In itemset mining algorithms, an itemset can be generated exactlyonce by using a lexicographical order on the items. However, in order to avoid multiple explorationsof the same subgraph, graph mining algorithms use expensive canonical normal form computation, (ii)the frequency of an itemset can be determined based on tid-list intersection of certain subitemsets. Onthe other hand, even the presence of all subgraphs of a graph g1 in another graph g2 does not guaranteethe presence of the graph g1 in g2. Hence, detecting the presence of a subgraph in a graph requiresadditional operations, and (iii) Subgraph isomorphism being NP-Hard makes it critical to minimize thenumber of subgraphs that need to be considered for finding the frequency counts or use other strate-gic methods such as information of the occurrences of the subgraph in the database that could avoidsubgraph isomorphism. Finding whether an itemset is a subitemset of another is trivial as compared todetermining whether a graph is a subgraph of another.
Since itemset mining algorithms are known to be less complex compared to graph mining algorithms,we show how this aspect can be used effectively used to reduce the time and memory of graph miningalgorithms. In Chapter 6 we present an itemset based subgraph mining algorithm(ISG) for graphs ofunique edge labels which finds the maximal frequent subgraphs by invoking a maximal itemset miningalgorithm.
5
1.1 Organisation and Contributions
• In Chapter 2, we give details of related work in the area of subgraph mining and the utility ofmaximal frequent subgraph mining.
• In Chapter 3, we first present the preliminary definitions and terminology that will be usedthroughout the MARGIN algorithm followed by a brief discussion in Chapter 3.2 on the twomodels in which the graph database is viewed and processed in our work. Next we give the in-tuition to the algorithm in Chapter 3.3 with a running example. We present the naive MARGINalgorithm in Chapter 3.4. The MARGIN algorithm presented in Chapter 3.4 gives connected max-imal frequent subgraphs. The MARGIN algorithm can be modified to give disconnected maximalfrequent subgraphs which is presented in Chapter 3.5
• In Chapter 3.6, we present proof showing that all maximal frequent subgraphs are found by ourtechnique. The proof requires the generalisation of a graph lattice to show that given f(†)-node,all the f(†)-nodes can be reached. The pruning step to find all the maximal frequent subgraphsfrom the set of f(†)-nodes is trivial.
• The MARGIN algorithm presented in Chapter 3 is the naive version of the algorithm which canbe further optimised to give a significant saving in the running time. In Chapter 3.7, we givethe optimisation techniques that can be applied. We further present the performance results onapplying the optimisation techniques to the naive algorithm.
• Given a RNA database, it is interesting to identify different characteristics that are common acrossRNA molecules. RNA motifs are believed to be able to provide the ultimate basis for the under-standing of the RNA structure and function. Motif identification is closely related to findingmaximal frequent subgraphs of the RNA database. We apply the MARGIN algorithm to theRibonucleic Acid database in Chapter 5.2.
• In Chapter 5.1, we discuss further applications of the MARGIN algorithm. It can been seen thatthe framework of the ExpandCut function can be applied to the problems that meet a certainset of criterions which is explored in Chapter 5.1.1. This widens the scope of application of theMARGIN algorithm. The MARGIN algorithm itself can be applied to disconnected maximalfrequent subgraph mining as well. The problem of finding disconnected maximal frequent sub-graphs is more complex than when the reported maximals are to be connected. The modificationof MARGIN to compute disconnected maximal frequent subgraphs is discussed in Chapter 3.5.
• In Chapter 4, we present experimental results on both synthetic and real life datasets. We firstanalyse the behaviour of MARGIN for varying support values, database sizes, graph sizes, numberand size of frequent subgraphs, edge and vertex labels, etc. We then compare MARGIN withother maximal frequent subgraph mining algorithms like SPIN[36], gSpan and CloseGraph[66].We also present results of the extension of MARGIN to finding maximal disconnected frequent
6
subgraphs. We further apply MARGIN to real life datasets like page views from msnbc.com, stockmarket data from the Yahoo Finance Stock site[8] and chemical datasets used widely available at[2].
• In Chapter 6, we propose a new algorithm, ISG that mines maximal frequent subgraphs of aconstraint-based graph database using a maximal itemset mining algorithm. The algorithm opensthe possibility of avoiding expensive subgraph mining techniques for databases that have con-straints. The experimental results show that ISG performs significantly better than existing meth-ods showing the utility of our approach.
7
Chapter 2
Related Work
Research on pattern discovery has advanced from itemset and sequence mining to mining trees,lattices and graphs. The initial work towards applying Apriori techniques towards structured data wasdone in R. Agrawal and R. Srikant [12] for sequence mining.
The earliest work on finding subgraph patterns was a greedy approach to avoid the high complexity ofsubgraph isomorphism, developed in SUBDUE [23] and K. Yoshida et al. [69]. These did not report thecomplete set of frequent subgraphs. SUBDUE [23] compresses the original graph using the minimumdescription length principle. However, due to heuristic and greedy solution based approaches, they misssignificant patterns.
WARMR [24] reported the complete set of frequent substructure in graphs. WARMR used ILP-based methods along with the non-monotone Apriori property. WARMR however cannot be extendedto disconnected frequent subgraph mining. Also, the approach has very high computational complexitydue to equivalence checking done by them. In 1999, A. Inokuchi et al. [38] came up with an approachto find frequent subgraphs in a graph database. This scope of this work was limited to graphs that haveunique node labels. Apriori-based Graph Mining (AGM) [37] was a computationally efficient algorithmproposed to find frequent induced subgraphs. AGM uses Apriori like technique to extend subgraphsby adding one vertex at a time. The algorithm does a join operation based candidate generation andfinds the normalised form of the adjacency matrices of each generated subgraph. However, since thesame graph can have multiple normalised forms based on the subgraphs from which it was generated,the normalised form cannot be used to uniquely represent a subgraph. Hence, the algorithm gives amethod to convert normalised graphs into its canonical form in order to efficiently find the frequency ofthe subgraph in the graph database. Frequent Subgraph Mining Algorithm (FSG) [47] enumerates allthe frequent subgraphs and not just the induced frequent subgraphs, by the join operation. It generatessupergraphs by adding one edge at a time. FSG has to however address both graph isomorphism as wellas subgraph isomorphism. Hence, when the number of edge and vertex labels are low, pruning becomesdifficult and hence the performance deteriorates. One main reason for AGM’s slow computation isthat AGM produces all possible combinations of induced frequent subgraphs while FSG generates onlyfrequent connected subgraphs [47].
8
Enumerating all the subgraphs can be done either by the join operations as in AGM [37] and FSG[47] or by the extension operations as in C. Borgelt and M. R. Berthold [13], J. Huan et al. [35],GSpan [65] or both as in J. Huan et al. [35]. Join-based algorithms suffer from the very expensivecandidate generation phase. Frequency computation is also computationally very expensive due tosubgraph isomorphic checks. Itemset mining uses hash trees in order to find the presence of an itemsetin a transaction. It is challenging to build such a hash tree for graphs [47]. Apriori based algorithms mapeach pattern to a unique canonical label. By using these labels, a complete order relation is imposedover all possible patterns and ensures that only one isomorphic form is extended [65]. Graph-BasedSubstructure Pattern Mining (GSpan) [65] follows a depth first approach where a generated subgraph ischecked with previously generated subgraphs to know whether it has been previously generated usingthe DFS-code. The DFS-code is a five-tuple (i,j,li,li,j ,lj ) with i and j being the identifier numbers ofits incident nodes (assigned by their order of appearance in the sequence), li and lj being their labelsand li,j being the label of the edge. The smallest DFS code representation of a graph is defined as itsminimum DFS code, which is unique and is used as canonical form for the mining process. gSpantraverses the graph database in a depth-first manner by extending every subgraph that was found to havea min-DFS code and then was found to be frequent. gSpan combines frequent subgraph extension andsubgraph isomorphism into one procedure which is the reason for its speed up. Also, the expensivecandidate generation done by join-based algorithms is completely avoided. The memory usage forbreadth first search(BFS) based algorithms is high as the subgraphs of every level have to be saved inorder to generate the next level of subgraphs. Further, GSpan [65] mention a clear speed up due to theshrinking of graphs in each iteration. This is done by not considering edges of lower lexicographic orderduring the depth first extensions of some single edge.
J. Huan et al. [35] uses a canonical ordering on the sequence of lower triangular entries which prunesisomorphic graphs. M. Cohen and E. Gudes [22] formulates frequent graph mining framework in termsof reverse depth search and a prefix based lattice. GASTON [50] efficiently finds frequent subgraphsin the order of increasing complexity by searching first for frequent paths, then frequent undirectedtrees and finally cyclic graphs. J. Wang et al. [61] uses a partition based approach to find frequentsubgraphs for databases that do not fit into main memory. G. Buehrer et al. [14] shows how allowingthe state of the system to influence the behavior of the algorithm can greatly improve parallel graphmining performance in the context of CMP architectures.
M. Kuramochi and G. Karypis [49] attacks the tougher problem of frequent subgraph mining wherefrequency of a graph is the total number of occurrences in the graph database (single-graph setting)unlike GSpan [65], J. Huan et al. [35], AGM [37], J. Pei et al. [53] wherein frequent refers to thenumber of graph transactions that a subgraph occurs in(graph transaction setting). Solutions of subgraphmining in the single graph setting can be applied to the problem of graph transaction setting but not viceversa. Y. Ke et al. [43] and H. Tong et al. [59] propose a new problem of correlation mining for graphdatabases given a query graph. ORIGAMI [31] mines a representative set of graph patterns. T. Washio
9
and H. Motoda [63] provides a comprehensive survey of graph based data mining approaches since themid 1990’s.
Considerable work has gone into closed itemset mining [54, 72]. The exponential size of frequentsubgraphs has led to an interest in closed graphs [66, 62, 74] which is a much smaller set as comparedto the set of frequent graphs. An even smaller set is the set of maximal frequent subgraphs.
Finding maximal frequent itemsets has been a well explored problem [15, 28, 29]. The maximalfrequent subgraph mining problem has been addressed in SPIN [36]. Spanning tree based maximalgraph mining (SPIN) algorithm[36] first mines all the frequent tree patterns from a graph database andthen constructs the maximal frequent subgraphs from the frequent trees. This approach offers asymp-totic advantages compared to using subgraphs as building blocks, since tree normalization is a simplerproblem than graph normalization. Tree normalization is a sequence of transformations that transformsan original expression form one form into an equivalent one which helps in comparing whether twosubgraphs/trees are the same and hence has been visited earlier. There are two important componentsin the framework. The first one is a graph partitioning method through which all frequent subgraphs aregrouped into equivalence classes based on the spanning trees they contain. The second important com-ponent is a set of pruning techniques which aim to remove some partitions entirely or partially for thepurpose of finding the maximal frequent ones only. Trees are then expanded to cyclic graphs by search-ing their search spaces and the maximal frequent subgraphs are constructed from the frequent ones.SPIN [36] also introduces some optimization techniques to improve the search for the maximal frequentsubgraphs. S. Srinivasa and L. BalaSundaraRaman [57] uses a filtration based approach that starts byassuming all graphs to be isomorphic. It removes edges from graphs that contradict this assertion untilit converges to the maximal frequent subgraphs.
In our work(Chapter 3), we focus on undirected connected labeled graphs but it is trivial to fit ouralgorithm to directed graphs, disconnected graphs(3.5) and partially and unlabeled graphs. The MAR-GIN algorithm can be easily modified to find the maximal cliques by changing the lattice definition andthe definition of frequent and infrequent subgraphs in the lattice.
Closely related is the work on the frequent subtree mining problem [20, 19, 21, 71], frequent graphswith geometric constraints [48], discovering typical patterns of graph data [60] and graph indexing [67,42, 32, 75, 39, 64]. The frequent subtree mining problem addresses the subtree isomorphism problemswhich is in P and simpler than the subgraph isomorphism problem that is NP complete. An uncertaingraph [77, 76] is a special edge-weighted graph, where the weight on each edge (u, v) is the probabilityof the edge existing between vertices u and v, called the existence probability of (u, v). The support ofa subgraph pattern S in an uncertain graph database D is a probability distribution over the supports ofS in all implicated graph databases of D. Given an uncertain graph database D and an expected supportthreshold [77, 76] find all frequent subgraph patterns in D. Mining significant patterns is another variantof frequent subgraph mining where significant graphs are the one formed by transforming regions of thegraphs into features and measuring the corresponding importance in terms of p-values. MbT [27] buildsa model based search tree to find significant graphs where they use the divide and conquer method
10
to find the most significant patterns in a subspace of examples. Further, in order to provide a goodunderstanding of the patterns of interest, it is important to reduce the redundancy in the pattern set.ORIGAMI [31] reports frequent graph patterns only if the similarity is below a threshold and for everynon-reported pattern , at least one pattern can be found in for which the underlying similarity to isat least a threshold . Frequent mining problem becomes more challenging with the huge size of datagraphs and the large number of graphs in a database. C. Chen et al. [16] uses randomized summarizationin order to reduce the data set to a much smaller size. This summarization is then used to determinethe frequent subgraph patterns from the data. Bounds are derived in C. Chen et al. [16] on the falsepositives and false negatives with the use of such an approach. Another challenging variation is whenthe frequent patterns are overlaid on a very large graph, as a result of which patterns may themselvesbe very large subgraphs. An algorithm called TSMiner was proposed in R. Jin et al. [40] to determinefrequent structures in very large scale graphs.
The graph database can however come with certain constraints, the simplest of them being a databaseof graphs with unique node labels. M. Koyuturk et al. [46] finds maximal frequent subgraphs in adatabase of graphs with unique node labels. Their work mines metabolic pathways to discover commonmotifs of enzyme interactions that are related to each other. Each enzyme is represented by a uniquenode, independent of the number of times the enzyme appears in the underlying pathway. They henceget a graph database where each graph has unique node labels. This framework simplifies the problemconsiderably as it completely bypasses subgraph isomorphism. A unique node pair automatically im-plies a unique edge and a set of unique edges can be put together to form only one graph structure giventhat the node labels are unique. Hence finding the frequency of a subgraph does not require subgraphisomorphism. The authors address the problem by applying the technique as in gSpan with some ad-ditional optimizations. Since each node label is unique, the set of node labels in a graph is treated asan itemset. The paper follows a depth first approach wherein each sub-itemset is extended if the set oflabels on extension form a connected subgraph of the original graph. The connectivity is maintained byonly adding edges that are connected to the current subgraph and avoid redundancy by keeping track ofalready visited edges. A. Inokuchi et al. [38] addresses the problem in a slightly different manner whereagain frequent itemset mining is used for mining maximal frequent subgraphs. They do not bother look-ing at edge extensions by maintaining connectivity but instead treat each graph as an itemset to applymaximal frequent itemset mining and remove all disconnected graphs. We extend this work to developan algorithm for maximal frequent subgraph mining for the class of graphs containing unique edge la-bels. This problem is more difficult. A frequent itemset cannot be put back together uniquely to rebuildthe subgraph though the edge labels are unique. This is because of the possible multiple occurrences of anode label. Hence the algorithm would require additional information and a slightly different approach.We explore the scope of using itemset mining techniques in solving maximal frequent subgraph min-ing problems. We show that doing this helps substantially in reducing the run-time and the complexityinvolved. Itemset mining has the advantage of being less complex than subgraph mining in terms offrequency computation and candidate generation.
11
Frequent pattern mining was first proposed by R. Agrawal et al. [9] for market basket analysis in theform of association rule mining. The huge number of possible frequent itemsets in a large transactiondatabase makes developing an algorithm for frequent itemset mining challenging. R. Agrawal and R.Srikant [11] observed the Apriori property for mining frequent itemsets where for an itemset to befrequent all its subitemsets have to be frequent. They use this property to efficiently find the frequentitemsets. This led to a lot of interest in other improvements or extensions of Apriori, e.g., hashingtechnique [51], partitioning technique [56], incremental mining [18] and parallel and distributed mining[52, 10, 17, 73].
Mining of the complete set of frequent itemsets without candidate generation was proposed by J. Hanet al. [30]. They devised an FP-growth algorithm that mines long frequent patterns by finding the shorterones recursively and then concatenating the suffx. This method reduces the search time substantially.
Equivalence CLASS Transformation (Eclat) algorithm proposed by M. J. Zaki [70] mines the fre-quent itemsets with vertical data format. After computing the TID-set of each single item by a singledatabase scan, Eclat algorithm generates (k+1)-itemsets from k-itemsets generated previously using adepth first computation order similar to FP-growth [30]. M. Holsheimer et al. [33] explores the potentialof solving data mining problems using the general purpose database management systems (dbms).
MaxMiner proposed by R. J. B. Jr et al. [41] studies the problem of finding maximal frequentitemsets. MaxMiner is performs level-wise bread first search to find maximal frequent itemsets usingthe Apriori property. It adopts superset frequency pruning and subset infrequency pruning for searchspace reduction. D. Burdick et al. [15] proposed MAFIA algorithm that finds maximal frequent itemsetswhere in frequency counting is done efficiently by the use of vertical bitmaps to compress the TID list.Theoretical analysis of the complexity of mining maximal frequent itemsets is discussed in G. Yang [68]where it is shown to be NP-hard. The length distribution of the frequent and maximal frequent itemsetsis studied by G. Ramesh et al. [55].
12
Chapter 3
MARGIN: Maximal Frequent Subgraph Mining
In this chapter, we introduce the MARGIN algorithm. We first present the preliminary definitionsand terminology that will be used throughout the MARGIN algorithm followed by a brief discussionin Chapter 3.2 on the two models in which the graph database is viewed and processed in our work.Next we give the intuition about the running of the algorithm in Chapter 3.3 with a running example.We present the MARGIN algorithm in Chapter 3.4. The MARGIN algorithm presented in Chapter3.4 gives connected maximal frequent subgraphs. The MARGIN algorithm can be modified to givedisconnected maximal frequent subgraphs which is presented in Chapter 3.5
3.1 Preliminary Concepts
In this chapter, we first provide the necessary background and notation. We adopt the same defini-tions as in [65] for Labeled Graph, Isomorphism and Frequent Subgraphs.
Definition 1 Labeled Graph: A labeled graph can be represented as a tuple, G = (V,E,Λ, λ), where
V is a set of vertices’s,
E ⊆ V × V is a set of edges,
Λ is a set of labels,
λ : V ∪ E → Λ, is a function assigning labels to the vertices and edges.
Definition 2 Isomorphism, Subgraph Isomorphism[65]: An isomorphism is a bijective function f :
V (G) → V (G′) such that
∀u ∈ V (G), λG(u) = λG′(f(u))
∀(u, v) ∈ E(G), (f(u), f(v)) ∈ E(G′), and
λG(u, v) = λG′(f(u), f(v))
13
A subgraph isomorphism from G to G′ is an isomorphism from G to a subgraph of G′.
For the ease of presentation, in this work we assume undirected connected labeled graphs. We denotethe relationship “subgraph of” using ⊆graph and “proper subgraph of” using ⊂graph. D is a database ofn graphs D = {G1, G2, .., Gn}.
Given a graph database D of n graphs and minimum support, minSup let
σ(g,G) =
1 if g is isomorphic to a subgraph of G,
0 if g is not isomorphic to any subgraph of G
σ(g, D) =∑
Gi∈D
σ(g,Gi)
σ(g, D) denotes the occurrence frequency of g in D, i.e., the support of g in D. A frequent subgraphis a graph g, such that σ(g, D) is greater than or equal to minSup. Let F = {g1, g2, . . .} be the set of allfrequent subgraphs of D for a given support minSup. A graph gi ∈ F is said to be maximal frequent ifthere exists no subgraph gj ∈ F such that, gi ⊆graph gj . Let MF ⊆ F be the set of all maximal frequentsubgraphs of D. Given the graph database D and a minimum support minSup, the problem of maximalfrequent subgraph mining is to compute all maximal frequent subgraphs in D. We conceptualize thesearch space for finding MF in the form of a graph lattice.
c a c
a c bG2
G1
0 1 2
10 2
MinSup=2
Figure 3.1Graph DatabaseD={G1, G2}
a c10 1
c b2
c ba210G2
c2
0 1ac 2a c
1
g g
G1
Lattice L1Lattice L2
a 2c0b1a1c
0 21ac c
0(2)
(2) (2)
(2) (2)
(1)
(1)(2)
(2)
(1)(2)
(1)
Cut
SubgraphEmbedding
Count
Level 1
Level 2
Level 3
Level 0
Figure 3.2 Lattice L1,L2
The lattice L1 and L2 as shown in Figure 3.2 are the graph lattices of the graphs G1, G2 ∈ D. Tokeep the example simple, we assume that all the edge labels are identical and hence are not shown in thefigure. Every node in the lattice is one connected subgraph of Gi. Every subgraph in Gi is representedexactly once in the lattice. In Figure 3.2, the graph a − c occurs twice in the Lattice L1 since there aretwo subgraphs of G1 that are isomorphic to a − c. The bottom most node corresponds to the emptysubgraph φg and the top most nodes correspond to Gi. A node C is a child of the node P 6= φg in thelattice Li, if P ⊆graph C and C and P differ by exactly one edge. The node P is a parent of such a nodeC in the lattice. For instance, for both isomorphic forms of the graph a− c in L1, the child would be thesubgraph c − a − c (by adding the edge c − a). Similarly, the subgraphs a − c and c − b are the parents
14
of a− c− b in L2. All single node subgraphs are the children of the node φg. φg is thus the parent of allthe single node subgraphs and is considered to be always frequent. An edge exists in the lattice betweenevery pair of child and parent nodes.
For a given graph G, the size of the graph (denoted by |G|), refers to the number of edges present inG. All the subgraphs of equal size form a level in the lattice Li of Gi. The node corresponding to φg
forms level 0, singleton vertex graphs form level 1 and the nodes of size i form level i + 1 for i > 0
(Figure 3.2).
Definition 3 Cut: A cut between two nodes in a lattice represented by (C † P ) is defined as an ordered
pair (C,P ) where P is the parent of C ∈ Li and C is not frequent while P is frequent. The frequent
subgraph P of a cut is represented by f(†) (frequent-†) and the infrequent subgraph C is represented
by I(†) (infrequent-†). The symbol † is read as ‘cut’.
Note that different isomorphic subgraphs of a graph g in the Lattice Li will thus have the same supportin D. However the subgraphs corresponding to the children of each isomorphic form might be different.Also while one isomorphic form of a subgraph might become a f(†)-node, the other might not.
Example: Consider Figure 3.2 with minSup = 2. The node in L2 that corresponds to the subgraphc is a f(†)-node since it is frequent with count 2. Its parent node that corresponds to c − b is infrequentwith count 1 and thus is an I(†)-node. Hence, this pair is marked as cut. Figure 3.2 shows the frequencycount of each node in the example lattice along with all the existing cuts in the lattice L1 and L2
respectively.
3.2 Lattice Representation Models
g
G (0−1)2G (1−2)2
12
G (0,2)G (1) G (2)
21G (1)
bc
G1
G2
Lattice L
0 21ac c c ba
210
1G (1−0,1−2)
a c
bc2G (0) a
Figure 3.3 Embedded Model
The lattice structure to the MARGIN algorithm can be represented using two models, namely thereplication model and the embedded model.
In the replication model (Figure 3.2), each subgraph g ⊂graph Gi ∈ D is represented exactly once inthe lattice Li of Gi as described in Chapter 3.1.
15
The embedded model is shown in Figure 3.3 for the graphs in Figure 3.1. In an embedded model,the lattice L of the database D is common for all Gi ∈ D. All isomorphic forms of any subgraph inD are represented using a single node in the embedded model. Each node in the lattice L stores theinformation about the occurrences of all of its isomorphic forms in the database. The bottommost nodecorresponds to the empty graph φg and the topmost nodes correspond to the graphs in D. The grapha−c in Figure 3.1 has three isomorphic forms in the database but will be represented exactly once alongwith the information about its occurrence in G1 at location 1 − 0 and 1 − 2 and its occurrence in G2 atlocation 0 − 1.
In the rest of the work, we proceed using the replication model unless mentioned. With minormodifications the MARGIN algorithm can be applied to the embedded model.
3.3 Intuition
We start by defining the Upper-3-Property that holds in every lattice Li of Gi ∈ D which weexploit in our algorithm.
e 1e 2
e 2e 1
iC jC
A
P
Figure 3.4 Upper-3-Property
Property 1 Upper Diamond property (Upper-3-property): Any two children Ci, Cj of a node P ,
where Ci, Cj ∈ Lattice Lk of Gk for Gk ∈ D, will have a common child A.
Proof: Let e1 and e2 be the edges incident on the vertices n1 and n2 in P respectively. LetP ∪ {e1} = Cj and P ∪ {e2} = Ci as in Figure 3.4. Hence e1 would be incident on n1 in Ci ande2 would be incident on n2 in Cj . Let A = Cj ∪ {e2}.Hence, A = (P ∪ {e1}) ∪ {e2} = (P ∪ {e2}) ∪ {e1} = Ci ∪ {e1}.Hence, A = Ci∪{e1} = Cj∪{e2} is the common child of Ci and Cj proving the Upper-3-property. 2
Consider any two children A1 and A2 of a node B in a lattice. In a lattice of the replication model,by definition, A1 and A2 are obtained by extending B. However, in the embedded model, any twochildren of a node need not be the extensions of the same isomorphic form. Hence, in the embeddedmodel, the Upper-3-property holds similarly except that A1 and A2 should be the extensions of thesame isomorphic form of the subgraph B in order to have a common child C of A1 and A2.
16
Next, we provide an intuition to the MARGIN algorithm. The set of candidate subgraphs that arelikely to become maximally frequent are the f(†) nodes. This is because they are frequent subgraphshaving an infrequent child. The remaining nodes of the lattice cannot be maximal frequent subgraphs.In this work, we present an approach that avoids traversing the lattice bottom up and instead traversesthe cuts alone in each lattice Li for Gi ∈ D. We prune the set of f(†) nodes to give the set of maximalfrequent subgraphs. MARGIN algorithm unlike the Apriori based algorithms goes directly to any oneof the f(†) nodes of the lattice Li and then finds all other f(†) nodes by cutting across the lattice Li.We give an insight below into the approach developed.
��
(a)
C
P
S2S1
2P����
��
S3
(d)
C
P1C
C 2 � ��
(c)
M
P
C��������
(b)
C
P1 2P P
Infrequent subgraph��Frequent subgraph
Figure 3.5 The Steps in ExpandCut
Finding the initial f(†) node is a trivial dropping of edges one by one from the initial graph G1 ∈ D,ensuring that the resulting subgraph is connected until we find the first frequent subgraph Ri. We callthe frequent subgraph found by such dropping of edges as the Representative Ri of Gi. Our initial cutis thus (CRi†Ri) where CRi is the infrequent child of Ri. Figure 1.1 shows the infrequent nodes visitedin the infrequent space of the lattice in order to determine the Representative that lies on the border.Thus, strictly speaking, in Figure 1.1, MARGIN explores the space marked as “Margin Search Space”and the set of infrequent nodes to find the Representative Ri of Gi. Similarly, the Representative
can also be found by starting with the empty graph φg (assumed to be frequent) and extending it untilan infrequent subgraph is found. For lower support values, as the candidate nodes are expected to liehigher up in the lattice (as discussed later), we find the Representative starting at the topmost node ofthe lattice. On the other hand, for high support values, as the candidate nodes lie in the lower levels ofthe lattice, starting at φg incurs less computational cost.
We devise an algorithm ExpandCut which for each cut c discovered in Gi ∈ D, recursively extendsthe cut to generate all cuts in Gi. We provide an intuition to the ExpandCut algorithm used to find thenearby cuts given any cut (C † P ) as input in the lattice Li of Gi. Recursively invoking ExpandCut
on each newly found cut with the below three steps, finds all cuts in Gi. Figure 3.5 is through theexplanation of the ExpandCut algorithm. Consider the initial cut to be the cut (C †P ) in Figure 3.5(a).Step1: The node C in lattice Li can have many parents that are frequent or infrequent, one of which isP . Consider the frequent parent P1 in Figure 3.5(b). Cut (C † P1) exists since P1 is frequent while C isinfrequent. Thus, for an initial cut (C † P ), all frequent parents of C are reported as f(†) nodes.
17
Step2: Let P be the parent of C as in Figure 3.5(c). Let C1, C2 and C be the children of the frequentnode P . Each of C1, C2 and C can be frequent or infrequent.(a): Consider an infrequent child C2. Cut (C2 † P ) exists since P is frequent while C2 is infrequent.Thus, for an initial cut (C †P ), for each frequent parent Pf of C that has an infrequent child Ci, the cut(Ci † Pf ) is reported.(b): Consider a frequent child C1. By Upper-3-Property, the nodes C and C1 have a common childM . M is infrequent as its parent C is infrequent. Hence, cut (M † C1) exists. Thus, for an initial cut(C † P ), for each frequent parent Pf of C consider each of its frequent child Ci. The cut (M † Ci) isreported where M is the common child of Ci and C .Step3: Consider all parents S1, S2, S3 of an infrequent parent P2 of C as in Figure 3.5(d). Each suchparent can be frequent or infrequent. Consider frequent parents S1, S3 (Figure 3.5(d)) of an infrequentparent P2 of C . Hence, the cuts (P2 †S1) and (P2 †S3). However, if step 1 is called on the cut (P2 †S1),the cut (P2 † S3) is found. Thus, for an initial cut (C † P ), for each infrequent parent Pi of C , considerany one frequent parent Sf of Pi. The cut (Pi † Sf ) is invoked.
Example: Consider the lattice L2 of Figure 3.2. Let the cut (c − b † c) be the initial cut. The noderepresenting the subgraph c − b has two parents, namely the subgraphs c and b of which the subgraph c
is frequent and b is infrequent.Step1: Among the parents of c − b, as the frequent parent is c which is the initial cut and there is noother frequent parent, we go to step2.Step2: The subgraph c has one child a − c apart from the child c − b with which it formed the initialcut. Since a − c is frequent, ExpandCut goes to step2(b). In step2(b), the subgraphs a − c and c − b
will have a common child a − c − b by the Upper-3-Property. As a − c − b will be infrequent anda − c is found to be frequent, the cut (a − c − b † a − c) is found.Step3: The infrequent parent b of c − b is next considered and its frequent parents are found. As φ is afrequent subgraph, the cut (b † φ) is reported.All the cuts found by the first iteration of the ExpandCut function are reported and recursively called.In our example, a single call of the ExpandCut technique on the initial cut has found all the cuts of thelattice. 2
The detailed algorithm ExpandCut is given in Chapter 3.4.
3.4 The MARGIN Algorithm
Table 3.1 shows the MARGIN algorithm to find the globally maximal frequent subgraphs MF.Initially, MF = ∅ (line 1) and the graphs in D are unexplored. LF is the set of locally maximumsubgraphs in each Gi which is initially φ (line 3). Initially, given the graphs D = {G1, G2, . . . , Gn},for each Gi ∈ D, we find the representative Ri for Gi (line 3). This is done by iteratively droppingan edge from Gi until a connected frequent subgraph is found. The ExpandCut algorithm is initially
18
Input: Graph Database D = {G1, G2, . . . , Gn},Output: Set of Maximal Frequent Graphs MFAlgorithm:1 MF=∅2. For each Gi ∈ D do3. LF=φ4. Find the representative Ri of Gi
5. ExpandCut(LF,CRi † Ri) whereCRi is the infrequent child of Ri
6. Merge(MF,LF)
Table 3.1 Algorithm Margin
invoked on the cut (CRi † Ri) with LF=φ where CRi is the infrequent child of Ri. ExpandCut
finds the nearby cuts and recursively calls itself on each newly found cut. The algorithm functions ina manner that finding one cut in Gi ∈ D would find all cuts in Gi. In line 6, the globally maximalfrequent subgraphs MF found so far are merged with the local maximal frequent subgraphs LF foundin Gi. The merge function finds the new globally maximal set by removing all subgraphs that are notmaximal due to the subgraphs in LF and adds subgraphs from LF that are globally maximally frequentafter exploring Gi.
Table 3.2 shows the ExpandCut algorithm which expands a given cut such that its neighboring cutswill be explored. The input to the algorithm are the set of maximal frequent subgraphs LF found so far(initially empty), and the cut (C † P ).For each parent Yi of C , if Yi is frequent, Yi is added to LF (lines 3-4).• If child(Yi) be the set of children of Yi, then for each infrequent child K in child(Yi), ExpandCut iscalled on the cut (K † Yi) (line 7-8).• For each frequent child K in child(Yi), let M be the common child of C and K . ExpandCut is calledon the cut (M † K) (line 9-11).
On the other hand, if Yi is infrequent and there exists at least one frequent parent K among theparents of Yi, then, ExpandCut is called on the cut (Yi † K).(lines 12-15).
3.5 MARGIN-d: Finding disconnected maximal frequent subgraphs
Finding disconnected maximal frequent subgraphs is more complex than the problem of finding con-nected maximal frequent subgraphs. The search space of Apriori like algorithms would increase ascompared to its connected maximal frequent subgraph finding counterparts since every combination ofconnected subgraphs forms a valid node in the graph lattice of a graph in the database. For academicinterest, we discuss in this section the application of MARGIN to disconnected maximal finding fol-lowed by an evaluation of results. Given a subgraph g and a supergraph Gsup, a subgraph isomorphism
19
Input:LF: The maximal frequent subgraphs seen so far in Gi.Cut:C † P
Output: The updated set of maximal frequent subgraphs LF.Algorithm:1. Let Y1, Y2, . . . , Yc be the parents of C .2. for each Yi, i = 1, . . . , c do3. if Yi is frequent4. LF = LF ∪ Yi
5. Let child(Yi) be the set of children of Yi
6. for each element K in child(Yi) do7. if K is infrequent do8. ExpandCut(LF,K † Yi)9. if K is frequent do10. Find common child M of C and K11. ExpandCut(LF,M † K)12. if Yi is infrequent13. Let parent(Yi) be the set of parents of Yi
14. if one frequent parent K in parent(Yi) exists15. ExpandCut(LF,Yi † K )
Table 3.2 Algorithm ExpandCut(LF,C † P )
searches for an occurrence of g in Gsup which is determined by mapping each node in g to a node inGsup such that the corresponding edges between the nodes in g map to the edges incident on the mappednodes of Gsup. In the worst case, the set of nodes in g are attempted to be mapped to every subset ofnodes in Gsup looking for a possible match. It is sufficient to find a single successful match to determinesubgraph isomorphism.
This problem becomes more complex when the subgraph g is disconnected. Each component of g
needs to be mapped to a corresponding subgraph in Gsup such that the mapped subgraphs of Gsup arenon-overlapping. Each component of g can be mapped to multiple matchings in Gsup. Here it is notsufficient to find a successful match for each component of g but to ensure that this set of successfulmatches are non-overlapping. This might require exploring all possible combinations of the matchingacross the components.
Further, for connected subgraphs g the number of children or immediate supergraphs of g ⊂ Gsup
is the total number of edges incident on nodes of g that are not already contained in g. Hence g isguaranteed at least one child. For disconnected subgraphs, the number of children of g is the totalnumber of edges in the graph Gsup that are not contained in g. This number is the maximum number ofchildren in the case of connected maximal frequent subgraphs mining(in case of cliques). Similarly thenumber of parents of g in the connected subgraph version is the number of edges that can be droppedby keeping the resultant graph connected which is a minimum of two for graphs with greater than twoedges. For the disconnected version the number of parents is the number of edges in g which is again
20
the maximum for connected versions. This change in number of parents and children generated issignificant in case of an Apriori approach where the children of each explored node in the bottom-uptraversal needs to be found.
gg
(1) ac c c ba
G1
G2
Level 2
Level 0
Level 3
Level 1
Level 4
Level 5
G
G
ac c
c ba
1
2
Graph Database
( , ) , )(
(1)
(2) ( c,a, )
(2)(2)
,(
ca b
ca )
ba c c b
ca cb b aac( , c) , c)(
(2) (2) (2)(1) (1)
(1)(1)
(1)(1) (1)
(1)
(1)
(1)
(1)a c
(2) ac (c,a,c) (2)a c
,a)c( (c c), )(a c,
ac(2)
(2) (2)c
)(a,b (c ),b
Lattice L Lattice L1 2
Figure 3.6 Graph Lattice for Disconnected Graphs
The algorithm for mining maximal frequent disconnected subgraphs is same as that for connectedsubgraphs except for parent-child definitions. Margin dealt with only connected parent and child sub-graphs while the disconnected version of Margin which we refer to as Margin-d permits parent andchildren to be disconnected. It is trivial to fit the algorithm to directed graphs and unlabeled graphs.The lattice is modified in case of disconnected graphs as represented in Figure 3.6. The lattice is drawnfor the graph database shown in the figure and was seen earlier for the connected subgraph lattice inFigure 3.2. Each level i represents subgraphs whose sum of edges and nodes is i. The numbers given inbrackets show the frequency of each subgraph in the database. The Upper-3 property holds as for thecase of the connected graph lattice. For a support value of two, the ExpandCut function as described
21
earlier will find the cuts as shown in the Figure 3.6. In section 4.5 we validate experimentally why itshould do better than the disconnected subgraph extensions of Apriori-based algorithms.
3.6 Proof of Correctness
In this section, we prove the correctness of MARGIN algorithm. We first prove the essential claimthat finding one cut in Gi ∈ D leads to all cuts in Gi ∈ D. The cuts in D are finally pruned to computethe maximal frequent subgraphs. Note that φg is always considered frequent as φg is a subgraph of everyGi ∈ D.
Claim 1 Given two cuts (C1 † P1) and (C2 † P2) in Gi ∈ D, invoking ExpandCut on one cut finds the
other cut.
Proof: We shall use the term “node” for the nodes of the lattice and “lattice edge” for an edge inthe lattice. Similarly, the term “vertex” is used to represent the nodes of the graph Gi and “graph edge”to represent an edge in the graph Gi. The information available to us for the proof are:
• The given two cuts for which reachability has to be proved are (C1 † P1) and (C2 † P2).
• Cj = (Vj , Ej ,Λ, λj) for j = 1, 2 such that Cj ⊂graph Gi.
• If the subgraphs C1 and C2 have no graph edge in common, then we find the minimal number of
graph edges in Gi joining a vertex of subgraph C1 with a vertex of the subgraph C2. Hence, such a
sequence of graph edges forms a connected subgraph of Gi referred as PathP = (Vp, Ep,Λ, λp).
We consider two cases namely :
1. Case 1: P1 ∩ P2 6= ∅ which implies that there exists a common subgraph of P1 and P2.
2. Case 2: P1 ∩ P2 = ∅ which implies that there does not exist a common subgraph of P1 and P2.Hence the common parent node in the lattice for P1 and P2 is the bottom-most node φ of thelattice.
Starting at one cut, the sequence of cuts that finds the other cut can be shown in a sub-lattice Lsubi of
Li. We next describe the sub-lattice Lsubi which is contained in lattice Li of Gi for each of two cases
above.Case1: If P1 ∩ P2 6= ∅, consider the largest common subgraph t of P1 and P2 in lattice Li (Figure3.7). The node t is the lower most node of the lattice Lsub
i and hence is at level 0 of Lsubi . Note that
the level 0 of Lsubi occurs at the level |t| + 1 of lattice Li since, as mentioned earlier, in lattice Li,
the level of a node is one added to the size of the node. For two graphs Gi and Gj , the set of graphedges present in Gi and not present in Gj are represented by Gi \ Gj . Extend t to g1 by adding a graph
22
���� �������� �������� ��
!!"" ##$$ %%&&''(())** ++,,
--..
Y2Y3
Y4
q1
q3
q2
q4
q5
2
g3
g2
g1
2f
3f
4f
f1
Y5
Y6
Y1
X2
X1
X4
X3
t
C1 P2
,
1
q
q
0
6
0
5
=g4
0 X Y0
Frequent Subgraphs
Infrequent Subgraphs
Unknown Subgraphsg
1P
= f5
C5 X
2
Lattice L
Level 0 of Lattice L
Lattice L
Level 0 of Lattice L
sub
sub
i
i
i
i
1
4r
r
r
r
r
3
r
r2
Figure 3.7 Case1: Lattice Lsub
i: P1 ∩P2 6= ∅
edge eg1∈ (P1 \ t) incident on t. Extend gi for i > 0 in lattice Lsub
i to gi+1, by adding a graph edgeegi+1
∈ (P1 \ gi) incident on gi until gi+1 = P1. Extend P1 in the lattice Lsubi to C1. Similarly, extend
t to f1 by adding a graph edge ef1∈ (P2 \ t) incident on t. Extend fi to fi+1 for i > 0 by adding a
graph edge efi+1∈ (P2 \ fi) incident on fi until fi+1 = P2. Extend P2 to C2. For every two nodes in
level i > 0 of Lsubi which has a common parent in Lsub
i , construct the common child in Lsubi . Such a
common child always exists by the Upper-3 property.
Case2: If P1 ∩ P2 = ∅, consider the shortest path PathP = (Vp, Ep,Λ, λp) that connects P1 and P2 inGi. Let {m1, . . . ,m|Ep|} be the sequence of graph edges that constitute the shortest path where graphedge mi is adjacent to graph edge mi+1 for i = 1, . . . , |Ep| − 1. m1 is incident on a vertex in P1 andm|Ep| is incident on a vertex in P2 (Figure 3.9). Let subgraph g1 = m1 ∪ eg1
where eg1is the graph
edge of P1 adjacent to m1. Similarly let f1 = m|Ep| ∪ ef1where ef1
is the graph edge of P2 incident onm|Ep|. The lower most node of the lattice Lsub
i is φg at level 0 (Figure 3.8). Level 1 of the lattice(Lsubi )
consists of the set of vertices Vp of the path Pathp, which are the children of the node φg. The set ofgraph edges Ep ∪ eg1
∪ ef1forms the nodes of the level 2 of the lattice Lsub
i . If C1 *graph g1, extend gi
to gi+1 by taking a graph edge egi+1∈ P1 \ gi incident on gi for i > 0 until gi+1 = P1. Then, extend
P1 to C1. Similarly, if C2 *graph f1, extend fi to fi+1 by taking a graph edge efi+1∈ P2 \ fi that
is incident on fi to form a subgraph fi+1 in the lattice Lsubi and then extend P2 to C2. For every two
23
nodes in level i > 1 which have a common parent in the lattice Lsubi , construct the common child. If the
common child is one of the gi’s or fi’s then draw the required connecting lattice edges to make it thecommon child in the lattice.
Hence above we have shown the construction of Lsubi for both cases which exists as a sub-lattice of
the lattice Li of Gi. We next mention the terms that we shall use to denote specific nodes of the sublattice Lsub
i as depicted in Figure 3.7.
//00 112233445566 778899:: ;;<< ==>> ??@@
AABB CCDDEEFF GGHHIIJJ
KKLLX4
X6 Y7
5 X Y7
Y7
Y7X3g
Y6
Y5
Y4
Y3
Y2
Y1
X2
X1
X3
X4
5 X
X Y0 0
Infrequent Subgraphs
Frequent Subgraphs
gC2(X Y7 )
Y7
P1
C1 P2
C2
PE
VP
12
=
X6
Unknown Subgraphs
=
fg 11
r
r
r
r
r
r
r
q
q
q
q
0
1
q0
q1
22
33
44
55
6
7
q
q
q6
Figure 3.8 Case2: Lattice Lsub
i: P1 ∩P2 = ∅
1. Let Y = Y0, . . . , Yn be the ordered sequence of super graphs of C1 including C1 in lattice Lsubi
such that |Y0| > |Y1| > . . . > |Yn|.
2. Let X = X0, . . . , Xm be the ordered sequence of super graphs of C2 including C2 in lattice Lsubi
such that |X0| > |X1| > . . . > |Xm| as illustrated in Figure 3.7.
3. We define the set of paths Π1 = r0, . . . , rm (shown in Figure 3.7) where each ri ∈ Π1 is theunion of the lattice edges of form ((Xi, Yj), (Xi, Yj+1)) for j = 0, . . . , n − 1.
4. Similarly Π2 = q0, . . . , qn where each qj ∈ Π2 is the union of lattice edges of form ((Xi, Yj),(Xi+1, Yj))for i = 0, . . . ,m − 1.
The sub-lattice Lsubi has the following properties:
• All super graphs of C1 and C2 in lattice Lsubi are infrequent.
• All subgraphs of P1 and P2 are frequent.
24
P1
E pm
iG
21 2mm
PPath P
Graph
Figure 3.9 Path PathP = (Vp, Ep,Λ, λp)
• Any node N 6= φg in the lattice Lsubi can be now referred to as a unique tuple (Xi, Yj) such that
Xi is the smallest super graph of N in X and Yj is the smallest super graph of N in Y.
• (Xi, Y0) is infrequent while (Xi, Yn) is frequent for ∀i > 0. Hence, for each ri ∈ Π1, ∀ i > 0,there exists exactly one cut.
• Similarly, (X0, Yi) is infrequent while (Xm, Yi) is frequent for ∀i > 0. Hence, there lies one cuton each qj ∈ Π2, ∀j > 0.
• Thus, there are m + n cuts in the lattice Lsubi .
• Cut (C1 †P1)=cut ((X0, Yn)† (X1, Yn)) by the construction of the lattice Lsubi and lies on the path
qn.
• The final cut (C2 † P2)= cut ((Xm, Y0) † (Xm, Y1)) to be reached lies on the path rm.
Next, given the initial cut (C1 †P1), we are to show that cut (C2 †P2) is reachable. We give the prooffor case 1 which uses Figure 3.7. The proof also holds for case 2 and is illustrated through the runningexample in subsection 3.6.1. To conclude the proof we state and prove the reachability sequence prop-erty below.
The Reachability Sequence:The order of paths k1, . . . , km+n where ki ∈ Π1 or Π2 on which the cuts from path qn to rm are found,satisfy the following:
1. for kj = rs ∈ Π1, rs+1 = kl, l > j
2. for kj = qs ∈ Π2, qs−1 = kl, l > j
Proof: Since we traverse the paths from qn to rm (Figure 3.10), the initial path k1 = qn and the finalpath km+n = rm. We prove the above claim by induction.Base Case: To prove that the claim holds for k1. The cut on the path k1 = qn corresponds to the initialcut (C1 † P1). The child(X0, Yn−1) of C1 is infrequent. Consider the node N = (X1, Yn−1). If node N
is frequent, cut ((X0, Yn−1)†N ) on the path qn−1 is found by lines(8-10) of the ExpandCut algorithm.If node N is infrequent, cut (N †P1) on the path r1 is found by lines(6-7) of the ExpandCut algorithm.Hence the next cut is found either on path qn−1 or path r1. Hence, conditions 1 and 2 of the claim are
25
satisfied.
Induction Step: Assuming that the claim holds for k1, . . . , ks, s < m + n, we prove that the claimholds for ks+1. Let r1, r2, . . . , ra and qn, qn−1, . . . , qb be the set of paths whose cuts have been found.Hence, the shaded area in Figure 3.10 is thus eliminated while finding further cuts.
N
N
N
MMNN OOPP
QQRR SSTT
UUVV
WWXX YYZZ
[[\\]]^^
__``
aabb
ccdd
eeff
gghh
Xa b( )Y 3
Shaded Region
q0q
1
b−1q
bq
qn−1
qn
2
1
Infrequent subgraphs
Frequent subgraphs
m−1
1
=KK=1
m+nN
Unknown subgraphs
0
aa+1
m
rr
rr
rr
Figure 3.10 The Reachability Sequence between cuts.
Suppose ks = qb. For a cut on the path qi, if the cuts on the paths r0, . . . , ra have already been seen,then the cut on the path qi is of form cut ((Xa, Yi) † (Xa+1, Yi)). Hence, the cut on the path qb is cut((Xa, Yb)†(Xa+1, Yb)). Hence the node N1 = (Xa, Yb−1) is infrequent while the node N2 = (Xa+1, Yb)
is frequent as shown in the Figure 3.10. Consider the node N3 = (Xa+1, Yb−1) which is the parent ofN1 and the child of N2. If N3 is frequent, cut (N1 † N3) on the path qb−1 is found by lines(3-4) of theExpandCut algorithm. On the other hand, if N3 is infrequent, cut (N3 † N2) on the path ra+1 is foundby lines(6-7).
Suppose ks = ra. For a cut on the path ri, if the cuts on the paths qn, qn−1, . . . , qb have been seen,then the cut on ri is of form cut ((Xi, Yb−1) † (Xi, Yb)). Hence, the cut seen on the path ra is cut((Xa, Yb−1) † (Xa, Yb)). Arguing as in the case of kt = qb, where N1 = (Xa, Yb−1) is infrequent whilethe node N2 = (Xa+1, Yb) is frequent, it can be shown that if N3 is frequent, cut (N1 † N3) on the pathqb−1 is found by steps(3-4) of the ExpandCut algorithm. On the other hand, if N3 is infrequent, cut(N3 † N2) on the path ra+1 is found by lines(11-13).
The next cut is found on either the path qb−1 or the path ra+1. Thus, the conditions 1 and 2 of theclaim holds for ks+1. Since the claim is true for k1 and we showed that the claim holds for any ks+1
when ks is true, by induction the claim holds for all s = 1, . . . ,m + n. 2
26
By the reachability sequence, starting at the initial cut (C1 † P1) on path qn, a cut (C2 † P2) on pathrm has been reached. Hence, starting at an initial cut any cut in a graph Gi ∈ D can be reached. Hence,all cuts in Gi can be reached from an initial cut. 2
Corollary 1 MARGIN algorithm finds all maximal frequent subgraphs in D.
MARGIN algorithm initially finds one cut in each Gi ∈ D. By claim 1, all cuts in Gi are found. Henceall cuts in lattice Li ∀i = 1, . . . , n have been found. The merge operation returns the maximal frequentsubgraphs among the f(†) nodes reported.
g
ijiijikjkkjkllmm nnoo
pjppjpqqrrss ttuu
,
c d
c−a c−d
c−a−c
c−a−c−d
a−c
a−c−d
a c
Infrequent SubgraphFrequent Subgraph
Figure 3.11 Example Lattice.
vvww xxyyzz{{
||}}~~��
���� ��������
���� ��������=C2
gg
1q1
= P2
xx
x
x
x
x
x
x
x
Frequent SubgraphsInfrequent Subgraphs
c d
a−c c−d
a−c−d
c−a−c−d
c−a−c
(b)
C=1
1P =
3
q0
2
(a)
0 1
1 1
2 1
1
03
32
01
0
y
y
y
y
y
y0
y
y0
0
1
2
r
r
r
r
Figure 3.12 Constructed Lattice.
27
3.6.1 Running Example for the proof
Let the graph c − a − c − d be a Gi ∈ D. Consider the graph lattice of c − a − c − d given inFigure 3.11. Assume that the infrequent and frequent nodes in the lattice are as marked in Figure 3.11.The resulting cuts are indicated in the figure. Say, we would like to reach the cut (d † φ) from the cut(c − a − c † c − a). Hence C1 = c − a − c, C2 − d, P1 = c − a and P2 = φ. As P1 ∩ P2 = ∅, we shallconstruct the lattice as in Case2 above. In the graph Gi = c − a − c − d, the shortest path connectingC1 and C2 is c − d. Hence m1=mEp=c − d which is incident on C1 and C2. The lower most level ofthe lattice being constructed is φg at level 0. Level 1 of the lattice(Lsub
i ) consists of the set of verticesVp = c, d as shown in Fig 3.12(b). The set of graph edges Ep ∪ eg1
∪ ef1=c−d ∪ a− c ∪ where eg1
isthe graph edge of P1 adjacent to m1 in Gi and forms the nodes of the level 2 of the lattice Lsub
i . We nextconstruct g1 and f1 as per Case 2. Subgraph g1 = m1∪a−c = a−c−d. As a−c−d ⊆graph a−c−d,we do not extend g1 further. Similarly, as d ⊆graph c − d, we do not extend f1 any further. For everytwo nodes in level i > 1 which have a common parent in the lattice, construct the common child. Ifthe common child already exists in the next level then draw the connecting lattice edges to the commonchild. This results in the lattice shown in Figure 3.12(b).
All super graphs of C1 = c− a− c and C2 = d in lattice Lsubi will be infrequent while all subgraphs
of P1 = a − c and P2 = φ will be frequent. Let Y = Y0, . . . , Yn be the ordered sequence of supergraphs of c − a − c including c − a − c in lattice Lsub
i such that |Y0| > |Y1| > . . . > |Yn|. Hence inour example n = 1. Let X = X0, . . . , Xm be the ordered sequence of super graphs of d including d inlattice Lsub
i such that |X0| > |X1| > . . . > |Xm|. Hence m = 3. Every node N 6= φg in the latticeLsub
i is now referred to as a unique tuple (Xi, Yj). The nodes of the example we picked, are representedas in Figure 3.12(a) where vertex c of Figure 3.12(b) corresponds to X2Y1 of 3.12(b), vertex d of Figure3.12(b) corresponds to X3Y0 of 3.12(a) and so on. C2=(Xm, Y0). Further, the set of paths Π1 and Π2
are represented in the figure.Notice that for any path ri ∈ Π1, (Xi, Y0) is infrequent while (Xi, Yn) is frequent for ∀i > 0. Hence,
there has to be exactly one lattice edge that is incident on one frequent and one infrequent node. Hence,for each rj ∈ Π1, for j > 0, there exists exactly one cut. Arguing similarly, (X0, Yi) is infrequentwhile (Xm, Yi) is frequent for ∀i > 0. Hence, there lies one cut on each qj ∈ Π2, ∀j > 0. Thus, thereare m + n = 3 + 1 cuts in the lattice Lsub
i . Given the initial cut, we are to reach the final cut. Cut(C1 † P1)=cut ((X0, Y1) † (X1, Y1)) by the construction of the lattice Lsub
i and lies on the path qn = q1.The final cut (C2 † P2)= cut ((X3, Y0) † (X3Y1)) to be reached lies on the path r3. Hence we show thereachability sequence between path q1 to r3.The Reachability Sequence in the proof above stated that the order of paths k1, . . . , km+n on which thecuts from path qn to rm are found where ki ∈ Π1 or ki ∈ Π2 satisfy the following:
1. for kj = rs ∈ Π1, rs+1 = kl, l > j
2. for kj = qs ∈ Π2, qs−1 = kl, l > j
28
This can be illustrated in the example as follows: The paths in π1 are r0, r1, r2, r3 while the pathsin π2 are q0, q1. Since we traverse the paths from q1 to r3, the initial path k1 = q1 and the final pathkm+n = r3. The cut on the path k1 = q1 corresponds to the initial cut (c − a − c † a − c). C1 = X0Y1.The child(X0, Y0) of c − a − c is infrequent. Consider the node N = (X1, Y0). As the node N isinfrequent, cut (N † P1) on the path r1 is found by lines(6-7) of the ExpandCut algorithm. Hence thenext cut is found either on path r1.
Hence k1 = q1 and k2 = r1. We next consider the case of ks = ra in the proof above. The cut seenon the path r1 is cut ((X1, Y0) † (X1, Y1)). Let N1 = X1Y0, N2 = X2Y1 and N3 = X2Y0 then, as N3
is infrequent, cut (N3 † N2) on the path r2 is found by lines(11-13).
Now, k1 = q1, k2 = r1 and k3 = r2. Again considering the case of ks = ra, the cut seen on thepath r2 is the cut ((X2, Y0) † (X2, Y1)). Again, let N1 = X2Y0, N2 = φ and N3 = X3Y0 then, as N3 isinfrequent, cut (N3 † N2) on the path r3 is found by lines(11-13).
Hence, we have reached the cut on r3 from the cut q1 as stated in the reachability sequence in theorder of q1, r1, r2, r3.
3.7 Optimizing MARGIN
In this chapter we discuss the pruning that MARGIN does by the approach it takes followed byfurther optimizations that can be applied to the naive MARGIN algorithm. The optimizations introducedare very effective in reducing the run-time of the MARGIN algorithm. Results show that about 50%reduction of running time was achieved by them.
3.7.1 Discussion
Consider a cut (C † P ) in the lattice L1 of the graph G1. P ⊂graph G1 being frequent (see Fig3.13), all the parents of P are frequent due to which no I(†) node (hence no cut) can lie among thesubgraphs of P in the lattice Li. Similarly all the supergraphs of C being infrequent, no f(†) node(hence no cut) would lie among the supergraphs of the C . Hence the subgraph and supergraph spaceof P can be pruned out of the candidate set for maximal frequent subgraphs. Hence, each cut foundprunes our search space substantially. MARGIN traverses all the nodes on the cuts in the lattice L i ∀i,the infrequent parents of the I(†) nodes and their infrequent parents. The remaining nodes on the latticeneed not be explored.
Frequent subgraphs wherein the count value of each frequent subgraph need not be reported can beobtained by reporting all subgraphs of MF or the f(†) nodes. Furthermore, techniques in [53] can beadapted to provide an approximate frequency count value for the frequent subgraphs.
29
Common Cut
Supergraphs of P P
Lattice L Lattice L
Subgraphs of P
21
Figure 3.13 Exploiting the commonality of the f(†)-nodes
3.7.2 Optimizations
Optimizations to reduce the number of revisited cuts are essential in order to reduce the computationtime spent on verifying whether the cut has been visited previously. We discuss these optimizationsseparately in order to keep the main algorithm simple. Few of the optimizations that help speed theresults are given below.
• ( A ): Let the cut (C † P ) be the cut on which ExpandCut has been invoked. Line 1 of theExpandCut algorithm computes the parents of C . Consider a frequent parent Yi of C . If thereexists at least one frequent child CYi of Yi, then by Upper-3-property, there exists a node M
which is the common child of CYi and C . C being infrequent, M is infrequent. It can be shownthat lines 5-10 of the ExpandCut algorithm that iterate over all the children of Yi can be replacedby calling ExpandCut on just one cut (M †CYi). This leads to reducing the number of revisitedcuts. Also, a single frequent child CYi needs to be generated instead of all the children of Yi inline 5. Also line 4 is executed only if there does not exist any frequent child CYi of Yi.
• ( B ): Consider an invocation of ExpandCut on cut (C † P ). The parents Y1, . . . , Yc of C arecomputed in line 1. For some Yi, P = Yi since P is a parent of C . Therefore, in line 5 all thechildren Cc1 , . . . , Cck
of P are computed and explored. If any child Cci6= C is infrequent, then
ExpandCut is invoked on cut (Cci† P ). In the invocation of ExpandCut on the cut (Cci
† P ),the children of P are recomputed and revisited. Since the children of P are already exploredin the invocation of ExpandCut on the cut (C † P ), their re-exploration can be avoided in theinvocation of ExpandCut on cut (Cci
† P ) by passing the appropriate information.
• ( C ): Consider an invocation of ExpandCut on cut (C †P ). Lines 11-13 of the algorithm checkfor infrequent parents Yi of C . If Yi is found among the infrequent subgraphs already visited, thenExpandCut invoked on cut (C † P ) skips executing lines 12-13 on Yi.
• ( D ): The line 13 of the ExpandCut algorithm finds the parents of the infrequent subgraph Yi.For the new ExpandCut invoked in line 15, this parent information can be passed as argumentso that the recomputation of parents is avoided in line 1 of the new ExpandCut.
30
3.7.3 The Replication and Embedded Model
As mentioned earlier, either of the embedded or replication model can be used by the MARGINalgorithm. Let Cr, Pr ∈ Gi be a pair of child-parent nodes in the replication model. Then, the graphsCe and Pe in the embedded model that are isomorphic to subgraphs Cr and Pr would now form achild-parent pair in the embedded model. Hence, every child-parent edge that exists in the lattice ofthe replication model also exists in the embedded model. Thus, the sequence of cuts in the embeddedmodel would follow the same sequence of child-parent cut edges as in the replication model.
Let Ei ∈ Gi and Ej ∈ Gj be two subgraphs isomorphic to g. Let both Ei,Ej ∈ f(†)-nodes. Inthe replication model, the counts for Ei and Ej are accumulated in the ExpandCut calls of Gi and Gj
respectively. Though the counts of Ei and Ej would be the same, it has been computed twice due to theindividual lattices for Gi and Gj . Since in the embedded model there is only one node that correspondsto g, it will be visited only once in the ExpandCut function and hence the frequency will be computedonly once. A subgraph in the embedded model is reported as maximal if all its children over the entiregraph database are found to be infrequent which is same as globally maximal (maximal with respect tothe entire graph database). In the replication model however, we gather locally maximal subgraphs anddo an additional step to identify the globally maximal set.
With respect to the two models, there is a trade-off between the amount of information that is tobe stored and the number of times the frequency of the subgraphs that lie on the cut are computed.In our implementation, using the replication model, we traverse each graph Gi one at a time and theinformation like cuts that are already visited and frequency count of nodes on the cut found in eachiteration of Gi are flushed out. Hence, the amount of information stored reduces. However, if thesubgraphs belong to cuts of multiple graphs, its frequency needs to be recomputed in the iteration ofeach Gi whose cut it belongs to. On the other hand, in the embedded model, the frequency counts ofsubgraphs that lie in the overlapping region of the cuts of multiple graphs are computed only once. Thecommon cut region shown in Figure 3.13 belongs to the lattices L1 and L2 which will be explored onlyonce in case of the embedded lattice. However, this is at the cost of increased storage space. If theintersection of the f(†) nodes of the graphs in the graph database is expected to be low, the replicationmodel is preferred. In the embedded model, all instances of a subgraph that does not occur as candidatenode in a given Gi but occurs as candidate node in some Gj 6=i will have to be found. All children ofsuch subgraphs also have to be accounted for and so on. In the replicated model, instead you wouldhave visited not more than one instance in Gi for frequency count. Thus, with minor modifications, theExpandCut algorithm can be applied to the embedded model.
31
Chapter 4
Experimental Evaluation
We implemented the MARGIN algorithm and tested it on both synthetic and real-life datasets. Weran our experiments on a 1.8GHz Intel Pentium IV PC with 1 GB of RAM, running Fedora Core 4.The code is implemented in C++ using STL, Graph Template Library [1] and the igraph library [3].We conducted experiments for comparative results with the gSpan executable [65], our implementationof the SPIN algorithm and the CloseGraph [66] executable. We compare with gSpan in order to showthe saving MARGIN makes against the time of an algorithm that explores the lattice space below the“border”. We compare with CloseGraph as it also prunes the search space to report a subset of thefrequent subgraphs and a superset of the maximal frequent subgraphs. Since SPIN generates maximalfrequent subgraphs, we compare with it. We give time comparative results with gSpan and both timeand generic operation comparisons with SPIN.
Our experimental results show that MARGIN runs more efficiently than SPIN and gSpan on syntheticdatasets and denser real-life datasets. The number of lattice node computations with low support valuesshow that about one-fifth of the search subspace of SPIN is visited by MARGIN. Also, the cost ofthe operations involved in SPIN and MARGIN are comparable while the difference in the number ofoperations is huge. We generated all maximal frequent subgraphs from the frequent subgraphs obtainedby gSpan and cross validated the results with that of MARGIN and SPIN.
We generated the synthetic datasets using the graph generator software provided in [47]. The graphgenerator generates the datasets based on the six parameters: D (the number of graphs), E,V (thenumber of distinct edge and vertex labels respectively), T (the average size of each graph), I (theaverage size of frequent graphs) and L (the number of frequent patterns as frequent graphs).
4.1 Analysis of MARGIN
For varying values of the database size and support, we show in Table 4.1, the total number of nodesin the lattice, the number of frequent nodes in the lattice, the frequent nodes visited by MARGIN whichis the set of candidate nodes in MARGIN, the extra set of infrequent nodes that MARGIN visits in orderto find the representative and the saving MARGIN makes in terms of the number of nodes visited as
32
L = 5,E = 50, V = 50, I = 6, T = 9
D Sup%Total No. Frequent Candidate Nodes to Gain in
Of Lattice Nodes Frequent find Repre- NodesNodes Nodes -sentatives Explored
200 8 293359 5908 1946 20 3942300 12 342463 5908 1946 24 3938400 16 352371 5908 1946 27 3935500 20 462611 5908 1948 30 3930600 24 800855 5908 1948 34 3926700 28 800855 5908 1948 34 3926800 32 872536 5908 2034 36 3838900 36 894800 6414 4278 38 2098
1000 40 1648819 3028 2806 107 1151100 44 1708471 2867 2668 115 841200 48 2137059 3028 2806 120 1021300 52 2165766 2062 1888 123 511400 56 2173221 2818 2627 124 671500 60 2274580 2902 2753 131 18
Table 4.1 MARGIN: Lattice Space Explored
against the total number of frequent nodes in the lattice. We ignore the set of infrequent nodes visitedby MARGIN while running ExpandCut since the infrequent nodes visited are immediate super graphsof the frequent nodes in the lattice. Such a set of infrequent nodes have to be visited by any Aprioribased algorithm in order to stop further extending the subgraph during its depth first search. Table 4.1shows that the number of infrequent nodes visited to find the representative is much smaller than the totalnumber infrequent nodes (the difference between the total number of nodes in the lattice and the frequentnodes in the lattice). The values of the parameters used to generate the dataset for this experiment areL =5,E =50, V =50, I =6 and T =9. The number of frequent nodes is calculated by running thegSpan algorithm while the total number of lattice nodes is calculated by running gSpan for the graphdatabase with minSup=1. The column “Gain in Nodes Explored” thus shows for various experimentalvalues the number of nodes that MARGIN avoided exploring in the lattice space by avoiding a bottom-up approach.
Table 4.1 clearly indicates that the extra set of infrequent nodes visited by MARGIN to find theRepresentative (Figure 1.1) is very small as compared to the infrequent or frequent lattice space.
Table 4.2 shows the effect of optimizations on the MARGIN algorithm. The optimizations listed inthe work are labeled as A, B, C and D. The optimization by B was not significantly affecting the com-putation time. Table 4.2 shows the time taken algorithm when all the optimizations are used in column“All Opt” of Table 4.2, when no optimizations are used in column “No Opt”, when only optimizationA is used, when only C is used, when only D is used and only when any two of A,C and D are used.We can observe that the time taken by MARGIN when all the optimizations are used is almost 50% less
33
L = 5, E = 50, V = 50, I = 8, T = 10
DSup All No Only Only Only CD AD AC% Opt Opt A C D
(sec) (sec) (sec) (sec) (sec) (sec) (sec) (sec)
100
20 42.56 80.86 70.53 63.61 67.9 56.14 54.9 50.0815 44.59 84.13 74.29 65.96 71.39 58.16 57.02 53.0210 56.38 116.6 100.63 87.37 105.67 78.33 74.61 66.565 82.66 186.87 158.62 134.6 167.68 123.86 116.02 97.75
200
40 107.65 198.74 176.42 147.13 165.93 133.59 139.12 126.0230 128.62 252.23 227.06 188.58 210.95 166.83 170.97 149.720 130.35 265.24 240.33 194.79 220.33 171.51 175.96 153.5710 135.62 303.74 262.48 212.68 262.67 188.38 188.35 160.8
Table 4.2 Effect of Optimizations
than the time it takes when no optimization is used. The values of the parameters used in this datasetare L = 5, E = 50, V = 50, I = 8 and T = 10. The sizes of the dataset are 100 and 200, the supportvalues are varied between 5-20%.
0
10000
20000
30000
40000
50000
60000
70000
0 10 20 30 40 50 60 70 80 90 100
No
Of E
xpan
d Cu
ts
% Support
Margin on 1k graphsMargin on 2k graphsMargin on 5k graphs
Figure 4.1 Number of Expand Cuts
Figure 4.1 shows the number of times the ExpandCut algorithm is invoked against varying supportvalues. The experiments are conducted on three datasets of sizes 1000, 2000 and 5000. The values ofthe parameters used to generate the dataset are L=5, E=50, V =50, I=12 and T=15. The Figure showsthat the number of invocations of the ExpandCut algorithm does not vary linearly with the support.Such a non-linear variation is attributed to the non-dependence on level wise traversal but instead to thenumber of f(†)-nodes in the graph lattice. The number of ExpandCut invocations approaches zero forlarge support values since most subgraphs may not find the required support and hence the ExpandCut
algorithm might be moving between φg and single node graphs which terminates quickly. The latticestructure can create an anomaly behavior in MARGIN because of the structure of different graphs.
Figure 4.2 shows the time taken by the MARGIN algorithm against varying the support values for thesame datasets as that used in Figure 4.1. However, the exact support of the subgraph is not computed.
34
0
1000
2000
3000
4000
5000
6000
0 10 20 30 40 50 60 70 80 90 100
Tim
e Ta
ken
(sec
onds
)% Support
Margin on 1k graphsMargin on 2k graphsMargin on 5k graphs
Figure 4.2 Runtime of MARGIN
0
10
20
30
40
50
60
70
10 20 30 40 50 60 70 80 90
Ratio
of T
ime
Take
n (s
econ
ds)
Support %
100200300400500
Figure 4.3 Runtime of MARGIN for exact support count
The support count function stops when the support exceeds the minSup value or when the numberof graphs in which it does not occur exceeds |D| − minSup. Figure 4.2 thus does not show the timevariation it would incur if the exact support of a subgraph were to be computed. Figure 4.3 shows thevariation of time with support when the exact support is computed. An increase in time followed bya decrease in time thereafter is seen. For low support values, the border between the infrequent andfrequent nodes of the lattice, lies in the higher levels of the lattice. As the support value increases, theborder moves towards the central levels in the lattice and further down towards the lower levels. Thus,with increasing support values, the number of nodes visited increases and then decreases due to whichthe computational time increases and then decreases. The parameters used in Figure 4.3 for D=100 to500 are L=5, E=50, V =50, I=5 and T=8. The support has been varied from 10% to 90%. With anincrease in size of the database, the time MARGIN takes to find the maximals increases proportionallyas more number of graphs need to be checked for the support.
As in Figure 4.1, the variation of the time taken shows similar behavior as that of the number ofExpandCut invocations which forms the core operation in the MARGIN algorithm.
35
4.2 Comparing MARGIN with gSpan
gSpanMargin
0
500
1,000
1,500
2,000
2,500
3,000
1,000900800700600500400300200Size of dataset
Tim
e ta
ken
(in se
c.)
Figure 4.4 Running time with 2% Support
L = 5, E = 50, V = 50, I = 12, T = 15
D MARGIN gSpan Ratio: gSpanMARGIN
(sec) (sec)200 32.73 616.56 18.84300 44.51 932.43 20.95400 45.97 1222.2 26.59500 64.23 1409.55 21.94600 90.43 1532.23 16.94700 132.1 1654.32 12.52800 156.07 1839.38 11.79900 183.78 2432.43 13.24
1000 265.43 2842.42 10.71
Table 4.3 Running time with support 20
Figure 4.4 shows the result where D is varied between 200 and 1000 graphs. The other valuesof the parameters used for this experiment are: L=5, E=50, V =50, I=12 and T=15. The minimumsupport used for each case is 2% of D. Since the average size of each graph is 15 and the averagesize of each frequent subgraph is 12, the maximal frequent subgraphs tend to lie in the higher levelsof the lattice. MARGIN avoids level-wise traversal of the lattice which speeds up the results makingMARGIN perform better.
Table 4.3 shows the running times of MARGIN and gSpan on datasets with varying number of graphsfrom 200 to 1000. The parameters used are: L=5, E=50, V =50, I=12 and T=15. The last column showsthe ratio of the running times of gSpan to that of MARGIN. It can be observed that MARGIN performsbetter than gSpan.
Figure 4.5 compares the effect of varying average size of the frequent graphs. In this experiment,we vary the parameter I from 5 to 12 while generating the dataset. The number of graphs generated are1000 and the minimum support is taken to be 2%. The other parameters are set to as follows: E=50,
36
0
200
400
600
800
1000
1200
1400
4 5 6 7 8 9 10 11 12 13
Tim
e Ta
ken
(sec
onds
)Avg. Size of Frequent Graph (I)
Margin with 2% supgSpan with 2% sup
Figure 4.5 Comparison with gSpan: Effect of size of frequent graphs
V =50, L=5 and T=20. It was observed that gSpan performs very efficiently with lower values of I
(5-7). With increasing values of I , MARGIN performs better. This should be expected as for highervalues of I , the lattice space explored for Apriori-based algorithms would increase since larger graphsare expected to be frequent.
We next compare the running times of gSpan and MARGIN with varying ratios of the number ofedges to the number of vertices. This measure gives us an estimate of the density of the graphs. Werepresent the edge to vertex ratio by the symbol “EVR”.
For higher edge-to-vertex ratio around EVR=2, the parameters used are D = 100 to 500, L = 5,E = 20, V = 20, I = 15, T = 18 and the support values varying from 10 to 50. Tables 4.4 and 4.5shows the results and it can be observed that the running time of MARGIN is significantly less than thatof gSpan.
L = 5, E = 20, V = 20, I = 15, T = 18
D = 100, EVR= 1.98 D = 200, EVR= 2.12 D = 300, EVR= 2.16
Support gSpan MARGIN gSpan MARGIN gSpan MARGIN(sec) (sec) (sec) (sec) (sec) (sec)
10 3.38 0.28 49.97 0.59 106.42 0.9520 3.41 0.29 49.98 0.62 110.48 1.0130 3.38 0.28 50.47 0.64 106.56 1.0740 3.32 0.52 48.41 1.62 110 3.4250 3.34 0.72 48.45 1.43 106.32 3.29
Table 4.4 Comparison of gSpan and MARGIN for high edge-to-vertex(EVR) ratio:D=100-300
We repeated the experiment for low edge-to-vertex ratio where the ratio is less than 1. The parametersused for the generation of the dataset are: L = 5, E = 20, V = 20, I = 6 and T = 8. Tables 4.6 and4.7 show the result and the running times of gSpan to be small for this case.
We also present results for varying edge-vertex ratio of EVR=1.25 to EVR=2 in Table 4.8
37
L = 5, E = 20, V = 20, I = 15, T = 18
D = 400, EVR= 2.17 D = 500, EVR= 2.18
Support gSpan MARGIN gSpan MARGIN(sec) (sec) (sec) (sec)
10 163.57 1.3 220.17 1.6520 164.41 1.41 218.86 1.8130 164.5 1.51 220.41 1.9440 162.87 4.22 218.91 5.0250 163.09 4.12 220.66 4.97
Table 4.5 Comparison of gSpan and MARGIN for high edge-to-vertex(EVR) ratio: D=400,500
L = 5, E = 20, V = 20, I = 6, T = 8
D = 100, EVR= 0.91 D = 200, EVR= 0.97 D = 300, EVR= 0.97
Support gSpan MARGIN gSpan MARGIN gSpan MARGIN(sec) (sec) (sec) (sec) (sec) (sec)
10 0.03 2.42 0.02 9.03 0.03 13.3520 0.02 2.68 0.01 12.16 0.03 22.1730 0.02 2.56 0.01 14.61 0.02 25.0440 0.01 2.16 0.01 7.34 0.02 25.7450 0.01 1.24 0.01 7.61 0.02 29.43
Table 4.6 Comparison of gSpan and MARGIN for low edge-to-vertex(EVR) ratio: D=100-300,EVR=0.9
We can observe that gSpan takes less time for lower values of edge-to-vertex ratios while MARGINperforms better for higher values. In sparse datasets, the Apriori-based approaches have the advantageof having to extend to fewer children. As each extension gives rise to a subtree rooted on it, eachreduced extension saves on computation time considerably. Hence, a decrease in the number of childrencascades the improvement in the performance of the Apriori-based approaches due to which they wouldperform considerably better than MARGIN.
4.3 Comparing MARGIN with SPIN
Table 4.9 shows the number of nodes in the lattice that are explored by the MARGIN algorithm ascompared to SPIN when executed on synthetic datasets of various sizes and support. As can be seen,MARGIN explores one-fifth of the nodes explored by SPIN .
Figure 4.6 shows a time comparison of SPIN and MARGIN. Time with varying support values hasbeen shown for D=500 and D=1000, with other parameters set to E=10, V =10, L=10, I=5 and T=6
38
L = 5, E = 20, V = 20, I = 6, T = 8
D = 400, EVR= 0.92 D = 500, EVR= 0.98
Support gSpan MARGIN gSpan MARGIN(sec) (sec) (sec) (sec)
10 0.03 24.88 0.06 37.9220 0.02 36.16 0.05 43.5330 0.02 42.32 0.04 59.5940 0.02 35.09 0.04 61.650 0.02 35.32 0.03 76.16
Table 4.7 Comparison of gSpan and MARGIN for low edge-to-vertex(EVR) ratio: D=400-500,EVR=0.9
L = 5, E = 20, V = 20, I = 15, T = 18, D = 100
EVR= 1.25 EVR= 1.5 EVR= 1.75 EVR= 2
Support MARGIN gSpan MARGIN gSpan MARGIN gSpan MARGIN gSpan(sec) (sec) (sec) (sec) (sec) (sec) (sec) (sec)
10 1062.3 12.17 2.88 21.2 0.6 274.87 0.28 3.3820 657.53 4.52 1.9 20.93 0.56 261.72 0.29 3.4130 1491.17 0.43 1.86 21.46 0.52 264.96 0.28 3.3840 535.97 0.2 1.82 20.95 0.47 263.98 0.52 3.3250 60.98 0.02 6.99 13.21 2.48 265.43 0.72 3.34
Table 4.8 Comparison of gSpan and MARGIN for various edge-to-vertex(EVR) ratio
and varying support from 1 to 5%. With an increase in the support values, the number of graphs thatare frequent reduce and hence the lattice space below the “border” is smaller. It can be seen that withan increase in the support value, the time taken by MARGIN and SPIN reduce to comparable values.However, for smaller values of support which causes the “border” to be much higher up the lattice,MARGIN performs better than SPIN as expected.
We ran more experiments to compare MARGIN with SPIN. Figure 4.7 shows the result.Figure 4.7 shows the ratio of time taken by SPIN to that of the MARGIN when the support is varied.
The size of the dataset is varied from 400 to 700. The other values of the parameters used for generationof the data are: L = 5, E = 50, V = 50, I = 10 and T = 11. The x-axis shows the percentageof the support which is varied between 2.5% and 5% in the increments of 0.5%. The y-axis shows theratio of the time taken by SPIN to the time taken by MARGIN. It can be seen that the ratio ranges from11-18 for the parameters as listed above. We have conducted further experiments and found that theratio depends on the parameters used.
Table 4.10 shows the time taken by MARGIN and SPIN along with their ratio which is plotted inFigure 4.7.
39
E = 10, V = 10, L = 10, I = 5, T = 6
DataSet Lattice Nodes VisitedD(Support%) SPIN MARGIN
100 (2) 43,861 9,311200 (2) 54,026 10,916300 (2) 57,697 14,954400 (2) 58,929 42,201100 (5) 42,584 9,930200 (5) 49,767 12,318300 (5) 52,118 24,660400 (5) 54,726 44,686500 (2) 32,556 12,619500 (3) 21,669 10,078500 (4) 9,187 9,912500 (5) 4,162 8,264
Table 4.9 Lattice Space Explored
0
20
40
60
80
100
120
140
160
180
1 1.5 2 2.5 3 3.5 4 4.5 5
Tim
e Ta
ken
(sec
onds
)
Support %
Margin (500)SPIN (500)
Margin (1000)SPIN (1000)
Figure 4.6 Comparison of MARGIN with SPIN
We also ran experiments to compare the memory used by gSpan, SPIN and MARGIN. We measuredthe memory used by the executables of these algorithms as the maximum value of “percentage memory”reported by the “top” command [6] (standard tool available on all the Linux implementations) duringthe entire course of execution of the binary. In order to determine this maximum percentage memory,we used the library Libgtop [4] in the implementation.
Table 4.11 shows the memory values for the two datasets which are used for the generation of theresults reported in Figure 4.6 for which the dataset parameters are: D = 500 & 1000, E = 10, V = 10,L = 10, I = 5 and T = 6 and the support is varied between 1% to 5%.
As Table 4.11 shows, the memory taken by MARGIN is more than that of gSpan and SPIN. Thisis because the number of subgraph patterns that are to be stored in the main memory at any instanceof time is very less for gSpan and SPIN compared to that of MARGIN. gSpan visits the subgraphsin a depth-first manner and extends them without having to store the intermediate nodes. Similarly,
40
11
12
13
14
15
16
17
18
2.5 3 3.5 4 4.5 5
Ratio
of T
ime
Take
n (s
econ
ds)
Support %
400500600700
Figure 4.7 Ratio of SPIN to MARGIN
SPIN generates the maximal tree patterns which are then extended to maximal subgraphs with not manyintermediate nodes being stored. On the other hand, the memory incurred by MARGIN might be largebecause it needs to store the nodes on the border of each graph lattice, in order to avoid running theExpandCut function on the cuts visited earlier. As we have seen, gSpan and SPIN take less memorycompared to MARGIN.
An estimate of the number of nodes visited in the lattice has been provided in Table 4.9. The databaseparameters are similar to that of Table 4.12. Since we implement the replication model, we include thecount of all subgraphs visited in MARGIN while in SPIN we accumulate only the number of non-isomorphic subgraphs generated. The distinct graphs visited by MARGIN can be only lesser than thaton including all the isomorphic forms. It can be seen that MARGIN visits less than one-fifth of thespace of SPIN.Comparison of Generic OperationsSince time comparison is not a good measure to compare algorithms, we include a comparison of thecomplexity of the most frequent operations of both algorithms. We consider the subgraph and graphisomorphic operations of the MARGIN algorithm and the subtree isomorphism and maximal-CAM-treeoperations of the SPIN algorithm.
In SPIN, in order to calculate the frequency of a subgraph, the information about the occurrence of theparents which is passed over the levels, is utilized. Instead, MARGIN requires a subgraph isomorphiccheck for the frequency computation of a graph g where g is being generated from its child. Althoughthe problem of determining whether g is a subgraph of G is hard in theory, based on all experimentsconducted by us we find it to be of O(n4) where n is the number of edges in g. In our experiments, foreach subgraph isomorphic operation we calculate the count c of the total number of edges of G that arevisited. We then determine lognc=k which expresses in n the complexity of the subgraph isomorphicoperation. Running the MARGIN algorithm on various datasets for varying parameters, we estimate themaximum value of k to be 3.4. Also, for almost all operations, k was found to be less than 2.1. Thisis much lower than the complexity of subtree isomorphism which is O(t.n.
√t) [19] where t is the size
41
L = 5, E = 50, V = 50, I = 10, T = 11
D Sup% MARGIN SPIN SPINMARGIN(sec) (sec)
400
2.5 34.34 602.59 17.553 31.44 545.01 17.33
3.5 29.9 517.11 17.294 35.43 497.66 14.04
4.5 37.58 507.41 13.505 37.99 476.8 12.55
500
2.5 41.34 728.39 17.623 36.53 621.26 17.01
3.5 39.81 629.83 15.824 47.34 607.22 12.83
4.5 49.83 618.36 12.415 50.09 607.91 12.14
600
2.5 53.09 910.56 17.153 57.34 873.87 15.24
3.5 61.86 796.11 12.874 61.48 773.28 12.58
4.5 64.75 790.85 12.215 64.07 772.16 12.05
700
2.5 65.08 906.51 13.933 71.3 906.51 12.71
3.5 76.78 930.76 12.124 76.62 901.86 11.77
4.5 80.56 924.3 11.475 79.57 903.9 11.36
Table 4.10 Comparison of MARGIN and SPIN
of the larger tree in which the tree of size n is being searched. Similar analysis on SPIN gives us treeisomorphism to be of O(n2) and maximum-CAM-tree to be of O(n3).
Table 4.12 shows the number of generic operations in SPIN and MARGIN for various support anddatabase sizes. The first eight rows correspond to results on dataset with parameters D = 100-400,T = 15, L = 5, I = 7, V = 8, E = 8 and support values of 2% and 5%. In SPIN, the generic operationsinclude tree isomorphism and finding the maximum CAM tree of a graph. The remaining operationslike candidate generation and associative edge pruning have been ignored. The generic operations ofMARGIN include subgraph isomorphic checks for count computation. The only operation of MARGINthat is ignored is the generation of its parents and children. The operations of order O(n2) and greaterthan O(n2) are listed separately as shown in Table 4.12. We can observe that with an increase in thedatabase size for a constant support, the number of operations of MARGIN performs two to three timesbetter than SPIN.
42
L = 10, E = 10, V = 10, I = 5, T = 6
DSupport gSpan SPIN Margin
% Memory Memory Memory
500
1 0.14 5.79 15.032 0.14 2.69 14.523 0.08 2.23 11.044 0.08 1.58 11.565 0.07 1.48 11.40
1000
1 0.17 13.58 24.562 0.16 5.65 27.333 0.16 4.47 23.124 0.14 2.98 23.945 0.14 2.82 10.35
Table 4.11 Memory Comparison of gSpan, SPIN and MARGIN
For datasets with small graphs, as the support increases, the lattice space below the “border” de-creases. For the next four rows of the Table 4.12 we use D = 500, T = 8, L = 5, I = 7, V = 8 andE = 8 with varying support of 2%-5%. As can be observed, the performance of MARGIN to that ofSPIN degrades with increase in support for small graphs leading to better performance of SPIN in somecases. Similar explanation holds for the last row of Table 4.9.
The last four rows of Table 4.12 correspond to results with parameters set to D = 100, L = 5,V = 8, E = 8 and a support of 2% for varying values of T and I . As T increases from 5 to 20 with I
set to 75% of T , it is noticed that the ratio of number of operations of SPIN to that of MARGIN goesup to 20. This is because as T increases, the lattice space below the “border” increases and thus SPINexplores a bigger space as compared to MARGIN.
4.4 Comparing MARGIN with CloseGraph
2
3
4
5
6
7
8
9
10
11
12
5 10 15 20 25
Ratio
of C
lose
Gra
ph to
Mar
gin
Support %
100200300400500
Figure 4.8 Time comparison with CloseGraph. I = 20, T = 25
43
DataSet O(n2) greater than O(n2)D SPIN MARGIN SPIN MARGIN(Support)
L = 5, E = 8, V = 8, T = 15, I = 7
100(2) 1,76,781 39,669 7,956 466200(2) 2,35,457 53,738 14,181 7,289300(2) 2,86,788 72,270 17,515 8,931400(2) 3,28,273 1,33,533 19,432 11,722
L = 5, E = 8, V = 8, T = 15, I = 7
100(5) 1,79,081 58,229 6,156 2,349200(5) 1,93,598 75,270 10,523 7,962300(5) 2,47,826 1,00,225 15,568 10,368400(5) 3,24,216 1,28,363 17,164 13,954
L = 5, E = 8, V = 8, T = 8, I = 7
500(2) 2,57,626 1,84,301 33,669 15,937500(3) 1,85,442 1,43,021 16,764 11,898500(4) 87,883 96,703 8,677 10,145500(5) 24,770 77,637 5747 9454T (I) SPIN MARGIN SPIN MARGIN
D = 100, L = 5, E = 8, V = 8, Supp=2%5 (4) 63,194 20,569 5,328 304
10 (8) 1,45,772 36,999 12,220 66815 (11) 3,11,147 53,937 17,220 78320 (15) 10,42,997 1,64,992 1,28,910 3,378
Table 4.12 Generic Operations
We compare MARGIN with CloseGraph [66] since CloseGraph finds closed frequent subgraphs andhence is a superset of the maximal frequent subgraphs. Figure 4.8 and Figure 4.9 show the ratio ofthe time taken by CloseGraph to that of MARGIN. The dataset parameters used are: D = 100 − 500,E = 20, V = 20 and L = 10 with support varying from values 5 to 25. Figure 4.8 takes parametersI = 20 and T = 25 while Figure 4.9 takes parameters I = 25 and T = 30.
It can be seen that by increasing the values of I, T , the ratio of the time taken by CloseGraph andMARGIN increases. This is because, the lattice space to be explored increases for Apriori-based algo-rithms.
4.5 Analysis of MARGIN-d
We provide an analysis of the disconnected maximal frequent subgraph behaviour and show resultsagainst the connected maximal frequent subgraph version.
It was observed that with increasing values of I the time taken increases and again reduces as in Table4.13. Note that the graph lattice has less nodes at the lower and higher layers of the lattice as against
44
0
5
10
15
20
25
30
5 10 15 20 25
Ratio
of C
lose
Gra
ph to
Mar
gin
Support %
100200300400500
Figure 4.9 Time comparison with CloseGraph. I = 25, T = 30
the middle layers of the lattice. Hence, with large values of I close to the T , the maximal subgraphs liehigher up the lattice which reduces the number of nodes visited. Similarly for very low values of I , thenodes visited lie close to the empty graph and hence the number of nodes visited reduces and hence thetime required for its running.
D=100, L=20, E=20, V=20, T=12, Support=80%DataSet Margin Margin-dSize(I) Time(sec) Time(sec)100(5) 4.23 11.8100(7) 5.49 15.31100(9) 5.91 22.96
100(11) 4.92 12.98
Table 4.13 Effect of varying I
With an decrease in E, V , similarity in the graphs generated increases. This leads to larger frequentsubgraphs. Also, since more labels tend to match due to higher similarity, detecting the correct com-ponent mapping becomes complex and hence increase the running time. Hence, in MARGIN-d forincreasing values of E, V the time decreases as seen in Table 4.14.
D=100, L=20, I=5, T=7, Support=80%DataSet Margin Margin-d
Size(E,V) Time(sec) Time(sec)100(20) 2.34 6.1100(15) 2.32 7.3100(12) 2.36 8.59100(8) 3.06 19.8100(5) 3.53 20.44
Table 4.14 Effect of varying E, V
45
The running time of MARGIN to MARGIN-d was found to be on an average five times lesser as inTable 4.15. The subgraph isomorphic operation of disconnected maximal frequent subgraphs is moreexpensive as that of connected maximal frequent subgraphs. Also, with an increase in the size of thedatabase, the time increases as the subgraph isomorphism operation needs to iterate through more graphsof the database to find the frequency of a subgraph. All maximal connected frequent subgraphs area subset of some maximal disconnected frequent subgraphs for the same support which might leadto a reduction in the number of maximal frequent disconnected subgraphs. However, a set of non-maximal subgraphs together might form a maximal frequent disconnected subgraph which might leadto an increase in the subgraphs explored. The increase in running time can be attributed to the cost ofthe subgraph isomorphism operation. The search space of Apriori techniques where all connected as
D=200-700, L=20, I=5, T=7, E=20, V=20, Support=90%DataSet Margin Margin-d
Size Time(sec) Time(sec)200 4.22 15.4300 7.92 37.89400 10.67 60.88500 16.27 90.76600 20.34 140.27700 21.13 140.14
Table 4.15 Effect of Varying Database Size D
well as disconnected children of a frequent node have to be computed, is expected to be much higherthan that of its connected counterpart.
In Apriori techniques, the property that the connected frequent maximal subgraph is subgraph of adisconnected maximal subgraph does not help in a reduction of the number of nodes. In order to findthe disconnected maximal frequent subgraph, an Apriori technique would have to explore all subgraphsof it, both connected and disconnected. Hence the number of nodes visited by Apriori techniques has toincrease.
Earlier, we had estimated MARGIN to do three-four times better than SPIN. Here, we have estimatedMARGIN to do five times better than MARGIN-d. Hence, MARGIN-d takes approximately twice thetime taken by SPIN for connected graphs. SPIN for disconnected subgraphs would hence do better thanMARGIN-d only if the increase in the time taken by SPIN for disconnected graphs is double that ofSPIN for connected graphs. Let us consider a high approximation of an average of (e − k)/2 generatedchildren for each subgraph in the connected subgraph scenario as against a fixed e−k generated childrenin the disconnected scenario where e is the total number of edges and k the number of edges in thesubgraph. Also it can be seen that the cost of canonical minimal tree finding for disconnected subgraphsis higher than for connected subgraphs. Taking into consideration these factors the increase in runningtime of Apriori based techniques can be estimated to be much more than double the present runningtime.
46
Chapter 5
Applications of Margin
In this chapter we present the various applications of the Margin algorithm. Section 5.1 generalisesthe ExpandCut technique used in the Margin algorithm such that it can be applied to areas outsidegraph mining. Section 5.2 presents the application of Margin to the RNA database. Section 5.3 presentsother real life experiments conducted comparing Margin with other existing algorithms.
5.1 Applications of the ExpandCut technique
The procedure adopted in the MARGIN algorithm is not limited to finding maximal frequent sub-graphs alone. The ExpandCut technique can be used to solve a wider range of problems. In thischapter we discuss below a generic platform to which the ExpandCut technique can be applied. Wefirst state the properties that need to hold for applying this framework. We further give few scenarioswhere the ExpancCut technique can be applied. In section 5.1.2 we illustrate how a specific class ofOLAP queries can be answered using the ExpandCut technique.
5.1.1 Framework for applying ExpandCut
We list below a set of properties that should be present in a problem statement in order to apply theExpandCut technique to solve it.
1. The search space is a subset of elements on a lattice.
2. The Upper-3-property (Chapter 3.3) holds.
3. A constraint C is anti-monotone if and only if for any pattern S not satisfying C , none of thesuper patterns of S can satisfy C . In the case of maximal subgraph mining, the pattern S is asubgraph and the constraint C refers to the subgraph being frequent. Similarly, a constraint C ismonotone if and only if for any pattern S satisfying C , every super pattern of S satisfies C . Theelements of the lattice need to satisfy either the monotone or the anti-monotone property in orderto apply the ExpandCut technique.
47
4. A candidate set can be defined which is a “boundary” set such that every element in the set satisfiesa given user-constraint and there exists an immediate child in the lattice that does not satisfy theconstraint in a anti-monotone lattice. Also for every element in the set, there exists an immediateparent that does not satisfy the constraint in a monotone lattice.
5. The solution set can be generated from the candidate set.
Note that the candidate set could be a set of sets such that elements of each set act as a boundary set.Examples: The ExpandCut function can be used in solving the following problem statements that
satisfy the properties listed above.
1. The closed frequent itemsets and subgraphs define layers of “boundary” where each layer acts asa boundary for a particular value of support. The uppermost layer acts as the boundary regionbetween elements of frequency above and below the user defined threshold. Hence the candidateset here is a set of sets.
2. ExpandCut can be adopted for mining minimal itemsets such that the sum of the cost of itemsin the itemset exceeds a given threshold (for positive costs). This qualifies as an example for alattice wherein the monotone property is satisfied.
3. Given a data cube lattice, ExpandCut can be used to find all data cubes which satisfy a givenuser constraint that follows a monotonic property. For example, we could report all the subset ofattributes for which the sum of sales in their data cube is greater than a given user threshold.
4. The problem of finding maximal frequent cliques satisfies the listed conditions.
5. ExpandCut can be used to find the frequent itemsets satisfying a given user support range x1 tox2 where x1 > x2. While the lattice elements satisfy the anti-monotone property; the resultantset lies between two boundary regions. The lower boundary separates the elements of the solutionfrom the elements of support greater than x1. Similarly, the upper boundary separates the elementsof the solution from the elements of support smaller than x2. Here the candidate set is a subset ofthe solution set which can be generated based on the candidate set.
5.1.2 Using ExpandCut to process OLAP queries
Data cube is a data abstraction that allows one to view aggregated data from a number of perspec-tives. Conceptually, the cube consists of a core or base cuboid, surrounded by a collection of sub-cubes/cuboids that represent the aggregation of the base cuboid along one or more dimensions. Thedimension to be aggregated is called the measure attribute, while the remaining dimensions are knownas the feature attributes.
In total, a d-dimensional base cube is associated with 2d cuboids. Each cuboid represents a uniqueview of the data at a given level of granularity. Not all these cuboids need actually be present since any
48
cuboid can be computed by aggregating across one or more dimensions in the base cuboid. Nevertheless,for anything but the smallest data warehouses, some or all of these cuboids may be computed so thatusers may have rapid query responses at run time.
The data cube lattice depicts the relationships between all 2d cuboids in a given d-dimensional space.(See 5.1) Starting with the base cuboid which contains the full set of dimensions, the lattice branchesout by connecting every node with the set of child nodes/views that can be derived from its dimensionlist. In general, a parent containing d dimensions can be connected to d views at the next level in thelattice.
Figure 5.1 Data Cube Lattice
The commonly used operations on a data cube for OLAP analysis are roll-up, drill-down and slice-dice operations. These operations allow an analyst to view the data at different levels of granularityfor deriving useful business logic. However, certain applications might require finding the subsets ofdimensions that contribute to the aggregation of measure value being more than or less than a giventhreshold. For example, consider a company data warehouse with product, time, place, manufactureras the feature attributes and with sales as the measure attribute. One of the overlap queries involvesfinding the subsets of the feature attributes such that their total sales is more than a given user thresholdt. The number of rows that satisfy a query with a condition say year=X will be more than or equal to thenumber of rows that with the conditions year=X and place=Y. Hence, more rows qualify as the numberof features reduce and hence larger will be the sum of the rows that qualify. Thus, if for a subset offeatures s, the total sales is greater than the required threshold t, then, for every subset of s, the totalsales will also be greater than t. Since the number of feature attributes of a typical data warehouse islarge, finding all such subsets will be prohibitively expensive. Also, the subsets of our interest can becomputed from the set of maximal attribute subsets. The threshold t is different from the notion ofsupport used in frequent mining algorithms. The term support is referred to as a user desired frequencyvalue. That is, the user is interested in those graphs or itemsets whose frequency is above the supportvalue. On the other hand, in the data cube scenario, the user is interested in only those data cubes whoseaggregate attribute value for any row is above threshold value t.
The data cube lattice satisfies the properties required by the ExpandCut technique as given in sec-tion 5.1.1. Hence, the ExpandCut can be used to find the maximal attribute subsets whose aggregatedmeasure value is more than or less than a given user threshold. Given two subsets A, B of attributes, the
49
cut(A†B) exists if the aggregated measure value of A exceeds the user threshold while that of B doesnot and B is the immediate superset of A.
We modified the MARGIN algorithm to run on a data cube lattice. ExpandCut would now gener-ate subsets and supersets of a dimensional subset as parents and children respectively. A dimensionalset qualifies as frequent if the aggregated minimum measure value is greater than t. We ran our ex-periments on a 1.8GHz Intel Pentium IV PC with 1 GB of RAM, running Fedora Core 4. The code isimplemented in C++ using STL. We generated fact tables with varying number of feature attributes andvarying number of rows. By changing the user threshold, we find the maximal attribute subsets whoseaggregated minimum measure value exceeds the user threshold. We store the fact table as a relation inMySQL database and we issue the corresponding group by queries in order to compute the aggregatedmeasure value for each data cuboid. The number of attributes is varied from 5 to 20 feature attributes.The number of rows is varied from 5000 to 50,000.
Supp % No.Of Rows: 1000 No.Of Rows: 25000 No.Of Rows: 50000Time No.Of Time No.Of Time No.Of(sec) Maximals (sec) Maximals (sec) Maximals
1 11.01 78 18.2 105 63.77 1952 0.43 13 18.2 105 17.69 993 0.5 15 18.25 105 14.96 745 0.5 15 0.51 15 0.52 1510 0.5 15 0.51 15 0.5 1515 0.51 15 0.5 15 0.5 15
Table 5.1 Finding Maximal Attribute sets for 15 feature attributes
Supp % No.Of Rows: 1000 No.Of Rows: 25000 No.Of Rows: 50000Time No.Of Time No.Of Time No.Of(sec) Maximals (sec) Maximals (sec) Maximals
1 5.94 34 155.3 190 861.67 4412 1.79 13 154.29 190 146.67 1803 1.86 17 154.55 190 121.67 1175 2.16 20 2.16 20 2.16 2010 2.15 20 2.18 20 2.14 2015 2.16 20 2.19 20 2.16 20
Table 5.2 Finding Maximal Attribute sets for 20 feature attributes
Table 5.1 and Table 5.2 show the time taken for ExpandCut to run on the data cube lattice andthe number of maximal attribute subsets when the user threshold is varied as 1-15% of the sum of themeasure value. The number of attributes considered is 15 in Table 5.1 and 20 in Table 5.2. The numberof attributes in the attribute combinations of the data cuboids which qualify are more for lower supportvalues. Hence, the number of sub-cuboids found during the ExpandCut operation are more. This could
50
be responsible for the sudden rise in time taken from 3% to 5% seen. However, the number of expandcuts called is largely dependent on the dataset.
The SPIN and gSpan algorithm cannot be generalised to be applied to other domains and have beenspecifically developed for the purpose of pattern mining in graphs. We present MARGIN not only asa solution to finding maximal frequent subgraphs but as a technique that can be applied to multipledomains once the properties mentioned in section 5.1.1 are satisfied.
5.2 RNA Margin
The Ribo Nucleic Acid(RNA) can both encode genetic information and catalyze chemical reactions.As the only biological macromolecule capable of such diverse activities, it has been proposed thatRNA preceded DNA and protein in early evolution. RNA structure is important in understanding manybiological processes, including translation regulation in messenger RNA, replication of single-strandedRNA viruses, and the function of structural RNAs and RNA/protein complexes. Therefore, developingmethods for recognizing conformational states of RNA and for discovering statistical rules governingconformation may help answer basic biological and biochemical questions including those related to theorigins of life.
A detailed understanding of the functions and interactions of biological macromolecules requiresknowledge of their molecular structure. Structural genomics, the systematic determination of all macro-molecular structures represented in a genome, is focused at present exclusively on proteins. However, itis being increasingly clear that apart from the earlier known RNA molecules such as mRNAs, introns,tRNAs and rRNAs, a large number of non (protein) coding RNA or ncRNA molecules arising out of noncoding regions of the genome play a variety of significant functional roles in cells. A comprehensiveunderstanding of these roles, which include protein synthesis and targeting, several forms of RNA pro-cessing and splicing, RNA editing and modification and chromosome end maintenance, is crucial to ourunderstanding of life processes at the molecular level. With the introduction of high throughput exper-imental methods for identification, the databases of ncRNA molecules have already started swelling upat a rapid pace. After identification, the next task is to understand their structures in their functional con-text. The structural bioinformatics of RNA molecules is thus an important and emerging research area.The realization that conserved amino acid motifs in proteins can often be related to function has greatlyaided the evaluation of unidentified open reading frames in sequence databases. As sequences haveaccumulated, so has the number of recognizable motifs, thereby guaranteeing that an ever-increasingrole will be played by functional inference or in silico analysis of sequence motifs. RNA moleculesdue to their composition being based on only four major nucleotides renders the primary sequence ofRNA insufficient, by itself, in defining motifs. Secondary and tertiary structural aspects must thereforebe made part of RNA motif definitions. In spite of these complications, evidence is accumulating thatRNA motifs will provide the ultimate basis for an understanding of RNA structure and function.
51
Given a RNA database, it is interesting to identify different characteristics that are common acrossRNA molecules and local environment and species dependent variations thereof. We also try to identifymotifs from the RNA structures. Shown in the Figure 5.2 is a small RNA graph as fed to the MARGINalgorithm. The nodes contain labels of the form Cx y where x stands for the node number and y thenode label. The double and single bonds indicate the covalent and hydrogen bonds respectively. Anotherfigure of a RNA converted into the MARGIN format which is of the typically seen size is given in Figure5.3.
c0_1c1_2
c14_8
c2_3
c3_1
c4_4
c5_5
c6_6 c7_7
c8_8
c12_2
c9_1
c11_2
c10_2
c13_9
Figure 5.2 RNA Graph 1
The input RNA structures are taken from the Protein data bank(www.rcsb.org). The RNAs are pre-processed in order to find nodal parameters, the hydrogen and covalent bonds and the base pairs.
The vertex information contains the chain identifier, the vertex number, the first nodal parameterswhich are the torsion angles: eta and theta, the conformational code, the bin ascii code and the baseidentity respectively. The edge information contains the chain id, the vertex numbers between which theedges are formed and the i,j value for covalent bond and hydrogen bond orientation for non covalentbonds. The given input graph is transformed to the margin format. In order to assign a vertex label,the eta, theta values are binned which along with the other attributes forms one node labels. Verticesthat have the same binned eta, theta values along with the same values for other attributes hence getassigned the same vertex label. In the case of edges, the hydrogen or covalent bonds along with theirvalues pertain to the edge labels.
Experiments are run on a RNA database of 50 graphs. The average number of nodes in each graphin the database is 1500 and the average number of edges is 2750. The degree of each node is around twoto three only. High support values varying from 70% to 95% are used as maximal frequent subgraphs
52
c53_18
c54_26
c55_3c0_1
c568_2c1_2
c566_9
c2_3
c565_27
c3_1
c564_2
c4_4
c563_2
c5_5
c6_6
c562_25
c7_7
c561_2
c8_8c512_11 c9_1
c511_2
c10_2
c510_2
c11_2
c18_2
c419_2
c12_2c17_2
c13_9c13_2
c425_2
c14_8
c424_27
c0_10
c423_41
c1_11
c422_48
c2_12
c421_2
c3_9
c420_2
c4_2
c418_2
c5_2c417_19
c6_2
c407_2
c78_2
c79_2
c21_2
c80_2
c20_2
c19_2
c14_2c15_2
c16_2
c22_2
c23_2c24_15
c25_16
c26_17
c27_2
c28_2
c29_2
c30_2
c31_2
c102_35
c103_2
c104_2
c105_2
c157_2
c106_35
c156_2
c107_2
c155_2
c108_36
c109_14
c110_2
c153_2
c111_2
c224_2
c112_2
c223_2
c113_2
c222_2
c114_27
c221_9
c115_2
c220_2
c116_18
c219_2
c117_37
c218_2
c118_38c199_2
c119_25
c198_3
c120_2
c197_42
c121_18
c196_11
c122_2
c195_41
c123_2
c193_2
c124_2
c192_2 c125_2
c72_11
c200_9
c129_2
c205_25
c130_2
c204_46
c131_18c132_2
c133_2
c203_45
c134_2
c202_44
c135_2
c201_43
c136_2
c152_2
c137_9
c138_2 c139_27
c140_2
c141_2
c142_2 c143_2c144_2
c145_2
c146_2
c147_2
c148_14
c149_2
c35_2
c36_2
c37_2
c38_2
c39_2
c40_18 c41_20
c52_2
c42_21
c51_2
c43_22
c871_2
c44_23c45_24 c876_2c46_25
c875_2c47_2 c874_2
c48_25
c873_18
c49_9
c872_2c50_2
c870_2
c7_13
c8_2
c408_2
c9_14
c56_2
c86_2
c57_27
c405_9
c58_28
c404_48
c59_29 c403_57c60_30
c414_68
c61_2
c402_44
c62_2
c401_49
c63_27
c400_59
c64_2
c65_2 c33_19
c66_2
c32_18
c67_2c68_2
c69_2
c70_2
c71_27
c73_31
c74_32c75_2 c76_2
c77_2
c34_2c81_2
c82_2
c274_2
c83_2
c84_2
c85_2c265_2 c264_50 c87_2
c263_48
c88_33
c262_2
c89_2c90_2
c91_2
c92_2
c381_27
c93_2
c94_2
c95_34
c165_2
c96_22
c164_2
c97_2
c163_2
c98_2
c162_2c99_2c161_5
c100_2
c160_2
c101_2
c159_2
c126_2c158_2
c127_2
c189_2
c128_2
c188_2
c177_2
c187_2
c178_2
c186_2
c179_2c185_2
c180_2
c184_5
c181_27
c182_39
c393_59
c183_40
c190_2
c191_2
c194_2c207_2
c150_2
c151_2
c206_2
c208_2
c217_2
c209_2
c216_9
c210_2 c215_2
c211_2
c214_2
c212_2 c213_2
c174_2
c175_2
c370_18
c176_2
c369_9
c225_2
c226_2
c228_18
c227_2
c248_2
c247_2
c229_9
c246_2
c230_2
c245_2
c231_2
c244_9
c232_35 c243_2
c233_47 c242_2
c234_30 c241_14
c235_48
c236_17
c237_2
c238_2
c239_35
c240_49
c390_18c391_66
c251_14
c392_67
c250_2
c249_2
c576_2
c252_2
c253_2
c254_2
c255_17
c256_14c257_2
c258_2
c259_2
c386_2
c260_2
c261_2
c266_2
c267_10
c287_14
c268_35
c286_2
c269_5
c285_2c294_48
c284_2
c295_14
c296_2
c394_2
c395_2
c396_2
c397_2
c398_2 c399_2
c406_2
c409_18
c410_65
c411_2
c412_11
c413_29
c320_18
c319_56
c154_2
c166_2
c167_2
c380_59c168_2c379_55
c169_34
c378_2
c170_7
c376_9
c171_27
c849_75
c172_2
c377_2c173_2
c375_18c371_9
c372_2
c373_2
c374_2
c382_2
c383_2
c384_2
c385_2
c387_10
c594_14
c388_35
c389_9c270_10
c271_9
c283_2
c272_2
c282_27c273_2
c279_2
c277_2
c275_2
c276_2
c278_51c280_2
c281_2
c288_2
c289_2
c290_2c291_2
c292_2c293_2 c415_69
c416_67
c318_55
c297_2
c317_54
c298_2
c316_53
c299_2
c315_2c300_2
c301_2
c302_2
c303_2
c313_27
c304_48
c312_2
c305_52
c311_2
c306_2
c310_2
c307_2
c308_2
c309_2
c314_14
c426_25
c427_9
c364_2
c428_10
c365_2
c429_14
c338_2
c430_2
c337_27
c431_2
c336_2
c432_9
c335_59
c433_31
c334_29
c434_2
c332_2
c435_34
c436_48
c330_2
c437_52
c342_61
c343_30
c360_2
c344_2
c359_2
c441_2
c358_2
c442_2
c357_27
c443_2
c356_2
c444_2
c355_2
c445_2
c354_63
c446_2
c353_62
c447_2
c448_2
c350_2
c449_2
c450_2
c347_9
c451_10
c346_2
c452_36
c453_50
c464_71
c454_2
c463_2
c455_2
c462_9
c456_2
c461_2
c457_2
c460_2
c458_2
c459_2
c366_2
c367_2
c507_75
c368_2
c506_74
c465_72
c505_11
c466_23
c504_2
c467_20
c503_2
c468_73
c488_2 c469_14
c487_2
c470_11
c486_2
c471_29
c472_9
c473_2
c474_18
c502_2
c475_55
c501_2c476_22 c500_2
c477_70
c499_9
c478_9
c498_19
c479_2
c495_2c480_2
c481_35
c482_2
c483_2
c491_2
c484_2
c490_14
c485_2
c489_2
c567_2
c569_2
c570_2
c571_11
c572_29
c573_36
c574_2
c744_2
c575_2
c578_2
c579_2
c742_2
c580_2
c741_2
c581_2
c846_2
c582_14c845_86
c583_2
c438_70 c439_52
c329_2
c440_53
c328_2
c321_57
c327_2
c322_2
c326_2c323_2
c324_2c325_2
c331_2
c333_58
c339_3
c340_34
c362_64
c341_60
c361_2
c345_2
c348_2
c349_2
c351_2
c352_14
c363_65
c492_2
c493_2
c494_2
c496_18
c497_64
c508_59
c509_2
c584_2
c513_71
c514_76
c740_2
c515_59
c516_2c798_34
c517_2
c797_2
c518_2
c796_2
c519_2
c795_2
c520_2
c521_22
c793_2
c522_2
c792_84
c523_2
c791_27
c524_2
c790_14
c525_2
c789_2
c526_2
c527_2
c613_2
c528_2
c612_2
c529_2
c611_2
c530_11
c610_2
c531_46
c609_2c532_2
c560_2
c533_11
c559_2
c534_71
c535_77
c558_2
c536_2
c557_2
c585_2
c554_79
c586_9
c553_73
c587_2
c552_78
c588_2
c551_2
c589_10
c550_2
c590_17
c549_2
c591_2
c548_2
c592_2
c547_2
c593_36
c546_2
c595_2
c596_2
c597_2
c598_2c543_10
c599_2c542_2
c600_2
c541_2 c601_2c540_2
c602_2c539_2
c603_2c538_2
c604_2
c537_2
c605_2
c608_2
c606_2
c607_2
c614_2
c615_2
c787_2 c616_2
c617_29
c788_2
c618_59
c785_3
c619_2
c784_35
c620_2
c783_12
c621_2
c781_2
c622_2
c780_9
c623_2
c779_2
c624_2
c778_2
c625_11
c777_2
c626_80
c680_2
c627_37
c628_79
c662_27
c629_63
c678_2
c630_81
c677_2
c631_2
c676_2
c776_2
c675_2
c633_2
c634_2
c635_2
c636_10
c637_2
c638_2
c639_2
c769_2
c640_56c641_2
c642_2
c643_2
c644_14c645_2
c646_2
c647_2
c648_2
c649_2
c650_2
c651_2
c652_2
c653_2
c654_2
c655_2
c782_2c786_2
c794_2
c799_23
c544_9
c545_2
c555_2
c556_2
c632_2
c753_2
c674_52
c754_2
c673_53
c755_9
c672_82
c756_2
c757_2
c758_84
c759_2
c760_2
c761_11
c762_14
c763_52
c764_2 c765_2
c766_2
c767_2
c768_2
c770_2
c771_2
c772_2
c773_2
c774_2c775_2
c656_2
c657_2
c658_2
c659_2
c660_2
c661_9c671_9
c663_2
c664_18
c670_2
c665_3
c669_7
c666_18
c667_30
c668_2
c679_2
c800_75c681_2
c1457_9
c682_2
c1488_2
c683_2
c821_9
c684_48
c820_65
c685_9
c819_85
c686_2
c818_29
c687_2
c817_11
c688_2
c816_2
c689_2
c815_9
c690_2
c691_2
c692_48
c693_63c814_76
c694_9
c813_2
c695_2
c812_2
c696_2
c811_2
c697_2
c810_2
c698_2
c809_2
c699_2c808_2
c700_2c807_2
c701_2
c806_2
c702_2
c703_18
c824_2c705_2
c706_2c707_2
c708_2
c1461_2
c709_2c710_2
c711_2
c733_2c712_2
c739_2
c713_2
c738_48
c714_2
c737_9
c715_11c736_2
c716_30
c735_2
c717_9
c734_2
c718_2
c732_2
c719_2
c838_2
c720_2
c837_2
c721_2
c836_2
c722_2
c835_2
c723_2
c834_2
c724_2
c833_2
c725_2
c832_2
c726_2
c831_30
c727_2
c830_79
c848_2
c829_2
c729_2
c730_2
c731_2
c743_2
c745_2
c746_2
c868_2
c747_2
c867_2
c748_9
c866_2
c749_83
c865_2c750_22c864_2
c751_2
c863_9
c877_10 c878_87
c1381_2
c879_2
c1380_9
c880_2
c1378_2
c881_2
c1377_2
c882_27
c1304_2
c883_2
c884_2
c1303_2
c885_2
c1302_17
c886_2
c1301_10
c887_14c1300_2
c888_2c1299_2c889_88
c1298_2
c890_89
c1297_9
c891_90
c894_2
c892_2
c1293_31
c893_2
c1292_114
c895_2
c896_84c1280_2
c945_2c1279_2
c898_2
c1278_2
c899_2
c900_2
c901_2
c902_2
c903_2c955_2
c904_2
c905_2
c906_2
c907_27
c908_2c909_10
c910_2
c911_10
c1368_2
c912_91 c913_85
c914_92
c915_93
c916_94 c917_95
c1205_2
c918_30
c1204_2
c919_2
c1203_27
c920_2
c1202_14
c969_96
c1201_110
c704_9
c1200_50
c801_2
c802_2
c803_2
c804_2
c805_2
c822_9
c823_2
c728_18
c825_14
c828_2
c826_2
c827_2
c839_2c840_2
c841_2
c842_2
c843_2
c847_2
c844_11
c752_2
c862_18
c850_48
c851_74
c861_2
c852_2
c860_2
c853_2
c859_2
c854_2
c858_2
c855_10
c856_9
c857_2
c946_2
c947_2c1277_2
c948_2
c1276_2
c949_2
c1275_2
c950_9
c1124_2 c951_48c1123_2
c952_30c1122_2
c953_2c1121_2
c954_2c1120_2
c1119_2c956_35
c1118_2
c957_3
c1117_44
c958_2
c1116_56
c959_2
c1114_46
c960_2
c1113_2
c961_2
c964_2
c962_2
c963_2
c965_2
c966_2
c1185_2
c967_2
c968_18
c897_2c970_97
c971_2
c1199_74
c972_2
c973_2
c974_2
c1006_2
c975_27
c976_2
c977_14
c978_2
c1005_27
c979_98
c1004_2
c980_30
c1003_2
c981_2
c1002_2
c982_2
c983_2
c1000_11
c984_10c985_52
c986_11
c987_5
c988_57
c989_18 c990_48
c930_2
c991_57
c929_2
c992_11
c928_2
c921_2
c927_2
c922_2
c926_27
c923_2
c925_57
c924_2
c931_14
c932_2
c933_2
c934_27
c935_2c936_18
c937_48
c995_12
c938_93
c939_12c940_93
c944_2c941_2
c942_2c943_25
c993_47
c1138_2
c1139_2 c1140_14c1111_2 c1141_11c1110_2
c1142_105
c1109_2
c1143_14
c1107_2
c1144_2
c1106_2
c1145_2
c1105_32
c1146_2
c1102_2c1147_2
c1148_106
c1103_80
c1149_45c1150_14
c1044_3
c1151_2
c1043_1
c1152_9
c1042_2
c1153_107
c1041_35
c1154_7
c1040_29
c1155_63
c1039_11
c1156_2
c1020_10c1157_2
c1019_27
c1158_2
c1018_11
c1159_35
c1160_63c1017_2
c1045_2
c1046_25 c1047_9c1048_2
c1049_2
c1050_2
c1051_2
c1099_102
c1052_2
c1098_11
c1053_24c1097_9
c1054_3
c1096_2c1055_2
c1056_27
c1068_2
c1057_2
c1067_2
c1058_2
c1066_2
c1059_2
c1065_2
c1060_2
c1184_2
c1061_2
c1183_9
c1062_2
c1182_109c1063_2
c1064_9
c1161_34c1178_14 c994_99
c996_19
c997_30
c998_2
c999_2
c1001_9
c1007_2
c1008_2
c1197_2
c1009_2
c1194_2
c1010_14
c1193_2
c1011_2
c1012_2
c1192_27
c1013_2
c1191_2
c1014_2
c1190_9
c1015_2c1016_2
c1137_2
c1112_2
c1188_2
c1021_64c1022_59
c1023_2
c1024_2
c1025_72
c1035_2
c1026_63
c1034_2
c1027_2
c1033_2
c1028_2
c1032_25
c1029_76
c1030_78
c1031_30
c1036_2
c1037_2
c1038_2
c1162_74
c1163_86
c1164_108
c1165_2
c1166_2
c1175_9
c1167_2
c1168_11
c1173_2
c1169_2
c1170_2
c1171_2
c1172_2
c1174_2
c1176_2
c1177_2
c1179_2
c1180_48
c1181_35
c1090_2
c1091_2 c1092_2
c1093_2 c1094_2c1095_14
c1073_86
c1100_9c1101_2
c1104_103
c1108_2
c1115_104
c1125_2
c1126_14
c1273_71
c1127_2
c1129_27c1128_2
c1130_2c1256_112
c1131_9
c1255_2
c1132_11c1254_9
c1133_44c1253_94
c1134_2c1252_55
c1135_2c1251_17
c1136_2
c1250_11c1209_2
c1234_2
c1248_75
c1235_55c1236_71
c1237_2
c1238_51
c1239_2
c1240_2c1241_2
c1242_2
c1243_2
c1244_2
c1245_2
c1246_2
c1247_9
c1249_48
c1329_2
c1258_2
c1259_2
c1260_2
c1261_2
c1262_2
c1263_2
c1264_2
c1265_9
c1266_2
c1267_2
c1268_2
c1269_2
c1270_2
c1271_2
c1272_48
c1274_113 c1353_9 c1282_2
c1289_2
c1283_2
c1284_9
c1285_2
c1286_2
c1287_14
c1288_2
c1290_60
c1291_66
c1294_2
c1295_2
c1296_2
c1069_7
c1070_2c1071_2
c1072_2
c1074_2
c1088_14
c1075_2c1087_101
c1076_2
c1086_14
c1077_2
c1085_29
c1078_2
c1084_2
c1079_100
c1083_2
c1080_2c1081_2
c1082_2
c1089_2
c1186_27
c1187_2
c1189_35
c1195_2 c1196_2
c1198_18
c1206_2
c1207_2
c1208_27
c1371_2
c1210_2
c1211_2
c1212_18
c1362_2
c1213_2
c1214_48
c1215_93
c1216_2
c1217_2
c1218_14
c1219_2c1220_2
c1221_2
c1222_2
c1223_2
c1233_2
c1224_2
c1232_79
c1225_10
c1231_111
c1226_2
c1230_11
c1227_2
c1228_2
c1229_34
c1344_11
c1330_2
c1331_110
c1332_9
c1333_2
c1334_2
c1335_2
c1336_2
c1337_2
c1338_2
c1339_2
c1340_14
c1341_2
c1342_80
c1343_14
c1345_29
c1346_9
c1347_2c1348_2
c1351_2
c1349_2
c1350_2
c1352_2
c1369_10
c1257_2
c1354_10
c1355_9
c1356_2 c1357_2
c1358_48
c1359_75
c1360_34
c1361_2
c1363_2
c1281_2
c1364_2
c1376_93
c1365_2
c1375_48
c1366_2
c1374_110
c1367_2
c1372_48
c1370_115
c1373_34
c1379_2
c1383_2
c1384_2
c1385_2
c1481_2 c1386_2
c1387_2c1478_2
c1388_2c1477_2
c1389_2
c1476_2
c1390_2c1474_75
c1391_2
c1473_69
c1392_1
c1472_2
c1393_18
c1471_2
c1394_9 c1395_2c1447_2c1396_2
c1446_2
c1397_2
c1445_2
c1398_2
c1444_2
c1399_2
c1400_2
c1305_18
c1441_2
c1306_2
c1440_57
c1307_90
c1308_35
c1309_2c1437_2
c1310_2
c1436_2c1311_2
c1435_2
c1312_2
c1434_2
c1313_2
c1433_2
c1314_2
c1432_2
c1315_2c1316_2
c1430_9
c1317_2
c1429_18
c1318_2
c1428_2c1319_2
c1427_2
c1320_27
c1426_2
c1321_9
c1425_2
c1322_2
c1323_2
c1324_84
c1423_2
c1325_2
c1422_2
c1326_34
c1327_35c1420_2
c1328_37
c1419_2
c1401_2
c1418_2
c1402_2
c1417_2
c1403_2
c1416_2
c1431_2
c1438_2
c1439_2
c1442_2
c1443_2
c1448_2c1470_2
c1404_2
c1405_2
c1406_2
c1407_2
c1415_9
c1408_10
c1414_18
c1409_2
c1413_2
c1410_2
c1411_2
c1412_2
c1421_2
c1424_2
c1475_59
c1479_2
c1480_2
c1482_2
c1483_2
c1484_2
c1485_2c1460_2
c1486_2c1459_2
c1487_57c1458_2
c1489_106
c1456_52c1490_2
c1455_2
c1491_2
c1454_2
c1492_2
c1453_57
c1493_35
c1452_5
c1449_2
c1450_2
c1451_35
c1462_2
c1463_2
c1464_2
c1465_18
c1466_11
c1467_58
c1468_56
c1469_3
c577_2
Figure 5.3 RNA Graph 2
form meaningful motifs only when they occur in a high number of RNA structures. Table 5.3 shows therunning time for various support values. The maximal frequent subgraphs found are converted back intothe original format. The scor database[5] contains information about the specific instances of motifs inthe RNA structures. The maximal frequent subgraphs found are cross checked with the motifs in thescor database to validate whether the RNA motifs listed have been found. A hairpin loop is an unpairedloop of RNA motif that is created when an RNA strand folds and forms base pairs with another sectionof the same strand. The resulting structure looks like a loop or a U-shape. Most hairpin loops wereidentified some of which are:Residue number 380-383 in structure 1hnx[5]Residue number 459-463 in structure 1hnx[5]Residue number 523-526 in structure 1hnx[5]
Support % Number of Maximal Time taken(mins)95 75 156.290 191 209.2285 149 206.4780 157 215.375 168 222.0070 173 232.60
Table 5.3 Running MARGIN on the RNA database
53
5.3 Real-life dataset
In this section we present experiments conducted with three real life datasets.
5.3.1 Page Views from msnbc.com
We tested the performance of MARGIN algorithm on a real-life data obtained from the UCI KDDarchive [7]. The dataset corresponds to the records of page views by the users of msnbc.com on aparticular day. Each record contains the sequence of categories of the pages viewed by a particular user.The dataset contains 9,89,818 records and the average number of page views per record is 5.7.
We processed the data by converting each record in the dataset into a graph as follows. The categoriesthat belong to the record correspond to the vertices of the graph. Edges are created between every pairof categories of the same record. Thus each record is converted into a clique. We considered the graphswhich had more than 70 edges. A total of 306 graphs with the following statistics were generated: the13 minimum number of vertices, 17 maximum number of vertices, 13.9379 average number of vertices,78 minimum number of edges, 136 maximum number of edges and 90.7549 average number of edges.Table 6.4 shows the results with different values of support. It can be observed that MARGIN algorithmreports results efficiently.
One of the maximal frequent subgraph that was reported for support value of 70 is the set of webpage categories {”frontpage”, ”news”, ”weather”, ”sports”}. Thus, this set is one of the largest subsetof web page categories that are visited together most frequently. The subset of categories can alsohelp characterize the web users visiting them. This information can be exploited for providing betternavigation mechanisms and personalizing the web pages that suit the user according to his/her interests.
Minimum Support MARGIN gSpan50 693.66 sec 4864.23 sec60 617.13 sec 3758.4 sec70 545.61 sec 722.68 sec
Table 5.4 Running time on web data
5.3.2 Stock Market Data
We further present our results of stock market data of 20 companies collected from the source [8].We use the correlation function below [62] to calculate the correlation between any pair of companies,A and B, CA
B .
CAB =
1|T |Σ
|T |i=1(A
i × Bi − A × B)
σA × σB
Here, |T | denotes the number of days in the period T , Ai and Bi denote the average price of thestocks on day i of the companies A and B respectively, and A, B, σA, and σB are defined as follows:
54
A =1
|T |Σ|T |i=1A
i
B =1
|T |Σ|T |i=1B
i
σA =
√
1
|T |Σ|T |i=1(A
i)2 − A2
σB =
√
1
|T |Σ|T |i=1(B
i)2 − B2
We construct a graph database where each graph in the database corresponds to seven successiveworking days (referred to as a week). For every week, we find the correlation values between every pairof companies. We also find the top ten companies with highest average stock price for the week. Thenodes in the graph correspond to these top ten companies. An edge exists between any pair of nodes iftheir correlation value is positive. We construct 133 such graphs corresponding to the stock data of fouryears. Table 5.5 reports the time taken by Margin and gSpan. A maximally frequent subgraph refers toa set of cohesive companies that perform consistently through the time line. We found the average sizeof the maximal subgraphs to be 32 edges and hence lie in the higher region of the lattice. Therefore thetime taken for Margin is seen to be lower than gSpan.
Support % MARGIN gSpan(sec) (sec)
2 0.48 6.74 0.56 6.646 0.62 6.618 0.68 6.56
10 0.74 6.5612 0.81 6.58
Table 5.5 Comparison of Margin and gSpan on stock data
Support % MARGIN gSpan SPIN(sec) (sec) (sec)
95 171.93 0.05 0.1185 134.29 0.05 0.1275 133.73 0.05 0.1265 114.58 0.05 0.1555 581.03 0.07 1.9445 659.34 0.11 2.38
Table 5.6 Comparison using the Chem340 dataset
55
Support % MARGIN gSpan SPIN(sec) (sec) (sec)
95 464.93 0.07 0.1685 433.82 0.08 4.2375 554.98 0.09 5.0665 833.63 0.09 7.6555 889.02 0.12 17.9745 908.68 0.13 24.28
Table 5.7 Comparison using the Chem422 dataset
5.3.3 Chemical Compound Datasets
We further test the performance of MARGIN on the two chemical compound datasets used widelyin gSpan and SPIN, referred to in [65] (available at [2]). We refer to them as Chem340 and Chem422in this work. The datasets contain 340 and 422 graphs respectively. We eliminated disconnected graphsfrom these datasets and work with the remaining 327 and 328 graphs respectively. The running time ofgSpan and SPIN are considerably better than MARGIN as seen in table 5.6 and 5.7. We list the timetaken for support ranging from 95% to 15%. We also found the size of the maximal subgraphs generatedfor these varying supports. For the Chem340 (Chem422) dataset, the maximal subgraphs found wereall single node (single edge) graphs for support values from 95% to 75%. With decreasing support from65% to 15%, the size of the maximal subgraphs change from three-edge (two-edge) graphs to six-edge(eight-edge) graphs. The average size of graphs in Chem340 (Chem422) is 25 (36) edges. Hence, themaximals for most of the support values lie in the lower region of the lattice which is disadvantageousto MARGIN.
Further, the chemical compound datasets are very sparse where the edge-to-vertex ratio was found tobe around 1.05. The discovered patterns are mostly tree-like structures. As per the discussion about theeffect of edge-to-vertex ratio in section 4.2, it is expected that Apriori-based algorithms perform betterfor sparse datasets.
Though the average size of the graphs in Chem340 and Chem422 is 25 and 36 respectively, thelargest graph is of size 214 and 196 respectively. Since MARGIN explores and stores the candidatesubgraphs of such large graphs during the ExpandCut operation, the memory utilization shoots up.Our present implementation does not support huge graphs and hence we provide results only till supportvalue 45%.
56
Chapter 6
ISG:Mining maximal frequent graphs using itemsets
It is common to model complex data with the help of graphs consisting of nodes and edges that areoften labeled to store additional information. The problem of maximal frequent subgraph mining canoften be complex or simple based on the kind of input dataset. If the graph database is known to satisfysome constraint, like constraints on the size of graphs, the node or edge labels of graphs, the degreeof nodes etc, it might be possible to solve the problem of subgraph mining without using any graphtechniques and exploiting the properties of the constraint. Though the class of such graphs might besmall, this problem is worth investigating since it could lead to immense reduction in the run time.
Given a set of graphs D = {G1, G2, . . . , Gn}, the support of a graph g is defined as the fraction ofgraphs in D in which g occurs. As defined earlier, g is frequent if its support is at least a user specifiedthreshold. In this chapter, we find maximal frequent subgraphs for a database of graphs having uniqueedge labels using itemset mining techniques. Though itemset mining itself is exponential in the numberof items, the complexity of graph mining algorithms rises steeply as against its itemset counterpart.
In itemset mining algorithms, an itemset can be generated exactly once by using a lexicographicalorder on the items. However, it is difficult to guarantee that a subgraph is generated only once duringthe process of frequent subgraph mining because due to the repetition of node and edge labels (i) asubgraph can be grown in several different ways by adding nodes or edges in different orders. Henceavoiding multiple explorations of the same graph adds to the overhead of graph mining algorithms whichrequires the use of canonical forms, (ii) the frequency of an itemset can be determined based on tid-listintersection of certain subitemsets. On the other hand, even the presence of all subgraphs of a graph g1
in another graph g2 does not guarantee the presence of the graph g1 in g2. Hence, detecting the presenceof a subgraph in a graph requires additional operations, and (iii) subgraph isomorphism is NP-Hard.This makes it critical to minimize the number of subgraphs that need to be considered for finding thefrequency counts. Other strategic methods such as information of the occurrences of the subgraph in thedatabase that could avoid subgraph isomorphism have been developed. Finding whether an itemset isa subitemset of another is trivial as compared to determining whether a graph is a subgraph of another.Hence, our algorithm, ISG, finds the maximal frequent subgraphs by invoking a maximal itemset miningalgorithm.
57
In the current literature on maximal frequent subgraph mining, standard Apriori techniques and theirextensions were used to find the maximal frequent subgraphs. Such algorithms adopt a bottom upapproach to find the final set of maximal frequent subgraphs. While SPIN [36] extended subtrees tofinally construct maximal frequent subgraphs, MARGIN[58] found the n-size candidate subgraphs thathave n + 1-size infrequent supergraphs. All variations of algorithms either need to perform subgraphisomorphism or do subgraph extensions.
6.1 Our Approach
Itemset DatabaseDatabase into
(D’)
Convert Graph
Get Maximal FrequentItemsets (M)
m M,for each
Is convertingm to graphambigious?
Convert Itemsetto Graph
Candidate MaximalSubgraphs
Pruning
No
MaximalSubgraphs
ApplyPre−processing
Set ofsubsets
of m
Database(D)
D’
M
Graph
Yes
Step 1
Step 2 Step 3
Step 4 Step 5
Figure 6.1 Overview of ISG Algorithm
Figure 6.1 shows the overview of the ISG algorithm. The entire algorithm can be broadly dividedinto five steps. Step 1 converts each graph in the database D into a transaction. This conversion stepinvolves mapping parts of the graph to items and assigning a unique item id to each such item. If a graphg is converted into a transaction t which is a set of items then we should be able to reconstruct g usingt. In order to ensure this the edges and “converse edges” (pair of adjacent edge labels) of the graph aremapped to items. Section 6.2.1, we show that there is a need for mapping of additional substructuresof the graph into items in order to ensure unambiguous conversion of the itemsets into graphs. Theseadditional substructures are called secondary structures and are explained in detail in Section 6.2.1.Thus, the transaction database, D′ constructed contains the item ids of the edges, converse edges andthe secondary structures.
Example:Consider a database of three graphs as in Figure 6.2 where G1 is a triangle, G2 is a spikeand G3 is a linear chain. Step 1 converts these graphs into the transaction database in D ′ shown inFigure 6.3.
In step 2, the transaction database D ′ that is constructed in step 1 is given as input to any maximalitemset mining algorithm and the set of maximal itemsets of D ′, denoted as M , is generated. In Figure6.4, the transaction T1 corresponds to graph G1 of Figure 6.2, T2 to G2 and T3 to G3. For the transactiondatabase of the example in Figure 6.4, the only maximal frequent itemset generated for support value
58
a
a
a
a a
a
a
a
G3: Linear ChainG2: Spike
a
a
ae2
e3e1
G1: Triangle
e1 e2
e3
e1 e3e2
Figure 6.2 Example Graph Database
two is the set {1,2,3,51,52,53,100}. Each itemset m ∈ M should be converted into a graph which isdone in steps 3 and 4. If m can be unambiguously converted into a graph then step 3 is omitted and step4 is invoked which generates the graph corresponding to the itemset m. The information correspondingto the mapping of the edges, converse edges and secondary structures to items is used in the process ofconverting itemsets to graphs (explained in section 6.2.2). On the other hand, if conversion of m into agraph is ambiguous, additional processing of step 3 is needed to resolve the ambiguity (more details inSection 6.2.1). In our example the maximal itemset m={1,2,3,51,52,53,100} is ambiguous and is hencepreprocessed in step 3 and converted into a set of sub-itemsets mset={{1,2,51}, {2,3,52}, {1,3,53}}.mset will now be passed to step 4 which converts each subitemset into a graph as shown in Figure 6.3.
a
a
ae2
e1
a
aa
e1
e3 a a
ae3
e2
Figure 6.3 Maximal Frequent Subgraphs for the Example (Support=2)
1, 2, 3, 51, 52, 53, 100, 150
1, 2, 3, 51, 52, 53, 100, 200
1, 2, 3, 51, 52, 100, 250
T2
T3
T1
Figure 6.4 The Transaction Database
The output of Step 4 are candidate maximal frequent subgraphs which might require additionalpruning in order to generate the maximal frequent subgraphs of D. Step 5 compares the graphs generatedby the end of step 4 and prunes any graph which is a subgraph of any other candidate maximal frequentsubgraph. Since no graph in our example in Figure 6.3 is a subgraph of another, the pruning in step 5does not eliminate any graphs and the three graphs are given as output of the ISG algorithm.
59
6.2 The ISG Algorithm
Unique edge labeled graphs refer to the class of graphs where each edge label occurs at mostonce in each graph. Formally, we find maximal frequent subgraphs from a database of graphs D =
{G1, G2, . . . , Gn} where each graph Gi contains no two edges with the same label. Due to multipleoccurrences of a node label, it is not sufficient to check the common node label to determine whethertwo edges are neighbors. In this section, we discuss the steps given in the Figure 6.1. The key steps inthe algorithms are the conversion of graphs to itemsets in step 1 and conversion of maximal frequentitemsets back to graphs in steps 4 and 5.
6.2.1 Conversion of Graphs to Itemsets
This subsection explains step1 of the framework. Each edge is mapped to a unique triplet of the form(ni,e,nj) where e is the edge label of the edge incident on the nodes with labels ni and nj . Similarly, wedefine a converse edge triplet as a triplet of the form (ei, n, ej) where ei and ej are edges labels of twoedges incident on a node with node label n. Converse edge triplets are essentially a triplet representationof every pair of adjacent edges. Each edge triplet and converse edge triplet, being unique is mapped toa unique item id. A graph Gi in the graph database is converted into an itemset transaction where theitems are the item ids of the edge triplets and the item ids of the converse edge triplets of Gi.
The set of transactions so formed by the graphs in the database is now termed as the transactiondatabase D′. The conversion of a graph to an itemset should be such that after finding the maximalfrequent itemsets, they can be mapped back to graphs unambiguously. Considering only edge tripletsdoes not guarantee that each maximal frequent itemset can be uniquely converted back to a graph. Eventhe edge triplets and converse edge triplets together do not guarantee the the given maximal frequentitemset be converted into a graph unambiguously. Consider Figure 6.2.
The edge triplets that correspond to the triangle and the spike structures are {(a, e1, a), (a, e2, a)
and (a, e3, a) while the converse edge triplets are (e1, a, e2), (e1, a, e3) and (e2, a, e3)}. The edgetriplets that correspond to the linear chain are {(a, e1, a), (a, e2, a) and (a, e3, a) while the converseedge triplets are (e1, a, e2) and (e2, a, e3)}.
Case 1: The set of triplets corresponding to triangle and spike are the same. Hence, they cannot beuniquely converted back to a graph causing ambiguity.
Case 2: The maximal frequent itemset found for the transaction database of Figure 6.4 for supportvalue three would be the set {(a, e1, a), (a, e2, a), (a, e3, a), (e1, a, e2), (e2, a, e3)}. This set of tripletsis the same as the set of triplets for the linear chain as we saw above. However, we can observe that thelinear chain is not frequent.
Thus, it can be concluded that the itemset database D ′ using only edge and converse edge tripletslead to ambiguity. Also, it can be trivially seen that there are only three structures exist with three edgesas in Figure 6.2 of which the triangle and spike have the same set of edge and converse edge triplets.Also it is trivial to note that any two non-isomorphic graphs of n edges that have the same edge and
60
converse edge triplets should each have a n−1 edge substructure both of whose edge and converse edgetriplets are the same. Let the edge with label ei be the edge removed from both the n-edge graphs tocreate two n− 1 edge graphs. The single edge triplet containing the edge label ei and all converse edgetriplets containing the edge label ei will thus be removed. Since the edge and converse edge triplets ofthe n edge graphs was the same, the edge and converse edge triplets removed will also be the same set.Hence, the edge and converse edge triplets of the n − 1 edge subgraph of each that remain would againbe the same. Thus, it can be concluded further that both non-isomorphic n edge graphs with the sameedge and converse edge triplets will have three edge substructures whose edge and converse edges arethe same. The attempt is hence to remove the ambiguity at the level of three edge substructures so thatthe graphs of sizes greater than three can be represented and transformed non-ambiguously.
In order to handle the two cases of ambiguity mentioned earlier, we introduce three secondary struc-tures called triangle, spike and linear chain. Figure 6.2 which we used as an example database showseach secondary structure. Note that the node labels of all the nodes in each secondary structure are thesame. Hence, for a given graph, all substructures which are either a triangle, spike or linear chain, eachhaving the same node labels form its secondary structure. The edge triplets and converse edge tripletform the primary structures.
Each secondary structure is assigned a unique item id in addition to the unique ids of the edge andthe converse edge triplets. The concept of a common item id is also introduced which is a representationof a set of triplets irrespective of the structure it denotes. That is, if a spike and triangle have the sameset of edge and converse edge triplets as in our example, then both are assigned the same common itemid though their triangle and spike ids are different. Two triangles with different edge labels or differentnode label will have two different triangle ids. In case of the linear chain we make an exception. Fora linear chain that contains the three edge triplets same as that of a triangle or spike, the linear chain isassigned the same common item id as that of the triangle or spike. Hence, the three structures in theFigure 6.2 would get the same common item id though their spike, triangle and linear chain id wouldbe different. Since an edge label occurs at most once in a graph, each common item id present in anitemset can only correspond to exactly one of the three: triangle or spike or linear chain.
Definition 4 Conflicting Maximal Frequent itemset: A maximal frequent itemset that contains a com-
mon item id without the presence of its secondary item id is called a conflicting maximal frequent
itemset.
Table 6.1 shows the id assigned to each edge, converse edge, common id and secondary structuresfor the graphs in Figure 6.2. The three edge triplets are given item ids 1, 2 and 3 while the converseedge triplets are given item ids 51, 52 and 53. The item id 150 represents the triangle, 200 the spike and250 the linear chain. The item id 100 is the common item id which occurs for all the three structures ofFigure 6.2. Using these item ids, the transaction database in Figure 6.4 is generated.
The graph G1 representing a triangle will form the itemset {1,2,3,51,52,53,150,100} while the spikein G2 of Figure 6.2 will form the itemset {1,2,3,51,52,53,200,100} and G3 forms {1,2,3,51,52,250,100}
61
Type ID Set of TripletsEdge Triplet 1 (a, e1, a)
Edge Triplet 2 (a, e2, a)
Edge Triplet 3 (a, e3, a)
Converse Edge Triplet 51 (e1, a, e2)
Converse Edge Triplet 52 (e2, a, e3)
Converse Edge Triplet 53 (e1, a, e3)
Common Id 100 (a, e1, a), (a, e2, a), (a, e3, a), (e1, a, e2),(e1, a, e3), (e2, a, e3)
Triangle Id 150 (a, e1, a), (a, e2, a), (a, e3, a), (e1, a, e2),(e1, a, e3), (e2, a, e3)
Spike Id 200 (a, e1, a), (a, e2, a), (a, e3, a), (e1, a, e2),(e1, a, e3), (e2, a, e3)
Linear Chain Id 250 (a, e1, a), (a, e2, a), (a, e3, a), (e1, a, e2),(e2, a, e3)
Table 6.1 Mapping of edges of graphs in Figure 6.2 to unique item id
as seen in Figure 6.4. Consider finding the maximal frequent subgraphs for support value two. The sixitems {1,2,3,51,52,53,100} that represent the edge and converse triplets of both the triangle and spikewill show up as the maximal frequent itemset. However, neither the spike structure nor the trianglestructure is maximally frequent. Such a maximal frequent itemset whose secondary item id is notpresent but the common id is present is defined above to be a conflicting maximal frequent itemset.There should be a way of indicating that neither the triangle nor the spike should be constructed as thethree edge structures are infrequent but its two edge substructures are all frequent. Common item id isused for this purpose where the conflicting maximal frequent itemset indicates the need of preprocessingto avoid constructing either of the triangle or spike but include only the two edge substructures of bothwhich are maximal frequent. Hence, if the maximal frequent itemset is a conflicting maximal frequentitemset then converting the itemset into graphs needs preprocessing in order to attach the three frequenttwo-edge substructures (if attachable) to the remaining constructed graph. This would form more thanone possible graph out of the itemset. Step 3 of Figure 6.1, which preprocesses the conflicting itemsetforms a set of itemsets such that each itemset can unambiguously create a graph. The itemset thuscreates multiple graphs where each graph contains only one of the three possible two edge structures.
6.2.2 Conversion of Maximal Frequent Itemsets to Graphs
Once the maximal frequent itemsets are found in the previous phase, the corresponding candidatemaximal frequent subgraphs are to be built. As mentioned earlier, the conflicting maximal frequentitemsets need to undergo a preprocessing phase in order to resolve the conflict that none of the three edgestructures represented by the common item id is frequent while the common item id itself is frequent.This indicates that all three edges are frequent and occurring as neighbours to each other but they cannot
62
co-occur in a graph to form any three edge structure as each possible three edge structure itself isinfrequent.
Preprocessing Phase: The preprocessing phase breaks the conflicting maximal frequent itemset toform non-conflicting subsets. It makes sure that the three edges that caused conflict are not presenttogether in any subitemset created. We refer to the common item id whose secondary item ids are notpresent as the conflicting common item id of the conflicting maximal frequent itemset. The three edgelabels present in the triplet set of each conflicting common item are marked as Invalid Edge Combina-tions since all the three cannot be present in a single graph. For example, as the common item with id100 in the example database is conflicting, the edges {e1, e2, e3} will form one set in the set of InvalidEdge Combinations. Since 100 corresponds to the the only conflicting common item id for the con-flicting maximal frequent itemset {1,2,3,51,52,53,100}, being considered , Invalid Edge Combinationscontains only one entry.
The preprocessing of the maximal frequent itemset that is conflicting begins by forming an initial setof components and recursively extending these components until they cannot be extended. The pair ofedges that belong to each converse edge triplet of the maximal frequent itemset constitute the initial setof components. Thus, for the maximal frequent itemset {1,2,3,51,52,53,100} of our example database,51, 52 and 53 form the three converse edge triplets and hence the initial set of components containsthree edge pairs: {C1 = (e1, e2), C2 = (e2, e3), C3 = (e1, e3)}.
These components are extended recursively. At each stage of the recursion, every pair of componentsare merged if the following two conditions are valid: (i) they both differ by a single element, (ii) onmerging, they do not contain any edge combination that belongs to Invalid Edge Combinations. For theexample maximal frequent itemset that is being considered, it can be seen that no two components canbe merged as merging would produce a new component C = (e1, e2, e3) which belongs to the InvalidEdge Combinations. All the edge label sets that can be merged are collected into NewComponents i
where i denotes the i-th iteration of the merging step. Any components that could not be merged with anyother components become a part of the final output as they cannot be extended further. The edge labelsets in NewComponenti that can be merged are merged to form NewComponenti+1 and those thatcannot be merged with any set in NewComponenti are added to the final output. The merging processis repeated until no further merging is possible. In our running example, the NewComponents1 setremains empty since each edge label set forming the components C1, C2 and C3 cannot merge. HenceC1, C2 and C3 form the final components. The set of edge label sets are returned as the output of thepreprocessing phase.
After each conflicting maximal frequent itemset undergoes the preprocessing stage, each edge labelset in the set of edge label sets reported is subjected to the graph construction algorithm. Maximalfrequent itemsets that are not conflicting can be directly subjected to the graph construction phase. Inthe running example, the preprocessing phase returns three components thus forming three subitemsets.Each itemset corresponds to the edge triplets and converse edge triplets that constitute the components.
63
Thus, C1 forms the subitemset I1 = {1, 2, 51} which corresponds to the triplet set {(a, e1, a), (a, e2, a),(e1, a, e2)}. Similarly I2 = {2, 3, 52} and I3 = {1, 3, 53} are formed.
Converting itemsets to graphs: We next describe how an maximal itemset can be converted intoone or more unique set of graphs. If the maximal frequent itemset is a conflicting itemset then thepre-processing phase first converts it into a set of subitemsets. Each subitemset thus formed containsthe maximum number of edge and converse edge triplets that can be included without causing conflict.Each such subitemset can be converted into a single unique graph. On the other hand, if the maximalitemset is non-conflicting then it might correspond to one or more connected graphs which can be seenas the components of a disconnected graph that can be constructed from it.
The basic idea of converting a maximal frequent itemset or subitemset into a graph is:Step 1: Start with any edge triplet that is not visited so far and mark as visited. Construct a graph withthe single edge picked in this step. In Figure 6.7(1), the edge triplet (a,e1,b) is initially picked.Step 2: Among the converse edge triplets that are not visited so far, find and mark as visited a converseedge triplet, if present, that contains the edge label of the edge picked in Step 1. Hence, in Figure 6.7,the converse edge triplet (e1,b,e2) forms a valid converse edge triplet that can be picked. Extend thegraph formed in Step 1 by adding the new edge label seen in the converse edge triplet picked in this stepas can be seen in Figure 6.7(2).Step 3: Among the remaining unvisited edge triplets find and mark as visited the edge triplet that con-tains the edge label inserted in Step 2. Hence (b,e2,c) is located among the edge triplets shown in thetable in Figure 6.7. Label c is thus added to the unlabeled node in Figure 6.7(2) which forms Figure6.7(3).Step 4: Repeat steps 2 and 3 recursively until no more extension is possible.Step 5: The graph so generated might contain unlabeled nodes. All edges incident on a node that is notlabeled is dropped For example, if the maximal frequent itemset corresponds to only two triplets (n,e,n)and (e,n,e1), then, out of the two nodes incident on e1, we know that one node label is n while we donot have information about the node label of the other node. The edge e1 is thus dropped and the graphformed would be a single edge graph with edge (n,e,n).Step 6: Include the graph that is formed by the end of step 4 to the set of candidate maximal frequentsubgraphs.Step 7: Repeat from step 1 if there are any more unvisited edge triplets.
For a subitemset or for a maximal frequent itemset that corresponds to a single connected graph,all the edge triplets will be visited in one iteration of the algorithm. Conversely, if a maximal frequentitemset corresponds to a disconnected graph, then, in each iteration of the algorithm from step 1 to 6,one component of the disconnected graph will be generated.
The key to the above procedure are two steps: the extension of an edge triplet using converse edgetriplet in step 2 and the extension of a converse edge triplet using an edge triplet in step 3. We explainStep 2 and Step 3 in detail next.
64
Extension of an edge triplet: Let (n1, e, n2) be an edge triplet in the graph constructed so far wherethe label n1 is not same as n2. If a converse edge triplet (e, n1, e1) is present in the itemset then(n1, e, n2) can be extended by adding the converse edge triplet on the n1 node. As yet, we only havethe information that an edge with label e1 can be incident on the node with label n1. The informationabout the opposite node label of the edge with label e1 is not yet known. Extension of (n1, e, n2) using(e, n1, e1) will introduce an edge whose opposite node details are not known and hence is called adangling edge. The unknown opposite node id is called a dangling node. The node label of the danglingnode is assigned in step 3 in the procedure called “Extension of an converse edge triplet”(describedearlier as Step 3). Step 3 is called on a pair say (e, n1) for which it finds matching converse triplets ofthe form (e,n1, *) and does the necessary extension in the graph. However, extending an edge triplet ofthe form (n, e, n) where the node labels of the both nodes are the same is complex. If a converse edgetriplet (e, n, e1) is present then the edge triplet (n, e, n) can be extended on any one of the two nodesincident on the edge, where each extension will lead to a different graph being constructed.
We discuss the details of the various cases when the node labels of the edge to be extended are thesame. That is, when the edge to be extended is of the form (n,e,n). The below cases cover all possiblestates that the graph could be in while wanting to extend the edge of form (n,e,n). Let the nodes onwhich the edge with label e is incident be nodes A and B. Let the edge that we want to constructadjacent to the edge with edge label e be an edge with label e1 due to the presence of the converse edgetriplet (e, n, e1). We use Figure 6.5 to illustrate the cases explained below.
A B Ee
C
ex
D
e y
e z
Figure 6.5 Cases in Edge Extension
Case 1: degree(A)=1 and degree(B)=1: This is the case when the graph contains only one edge sofar which is the edge (A,e,B) itself. In such a case one can extend the edge with label e1 on either ofthe sides.
Case 2: degree(A)=1 and degree(B)=2: Consider the Figure 6.5. Since in this case the degree(A)=1and degree(B)=2 assume that the edge with label ey and ez do not exist in the figure. The algorithm firstchecks if an converse edge triplet of the form (e1,n,ex) exists. Suppose such a triplet does not exist in themaximal frequent itemset being considered, it is straightforward that we create a dangling edge on thenode A. However, suppose such a converse edge triplet exists, then, we could make two constructionsboth of which could be correct. That is, we could add an edge with label e1 connecting the nodes A andC forming a triangle if the node label of node C is also n or we could construct a dangling edge on thenode B. In such a ambiguity, where node C also has label n, we check the maximal frequent itemsetfor whether a triangle or spike is present using edge labels e,e1 and ex. Only one of the two can be
65
present in any maximal frequent itemset, thus, resolving the ambiguity. In cases where the node labelsare different, it is straightforward to construct a dangling edge on the node B and proceed to the step of“Extension of the Converse Edge Triplet”(step (3)).
Case 3: degree(A)=1 and degree(B)>2: The presence of any node with degree greater than two istrivial to resolve. Consider the Figure such that the edge ey does not exist. The three edges incident onthe node B are the edges with labels e, ex and ez . If the edge with label e1 is to be incident on the nodeB, then all the three converse triplets (e1,n,e), (e1,n,ex) and (e1,n,eZ) should be present in the set ofconverse edge triplets of the maximal frequent itemset being processed for construction. If all the threeconverse triplets are not present then the dangling edge with label e1 is constructed on the node A.
Case 4: degree(A)=2 and degree(B)=1: This case is symmetric to the Case 1 and thus can be handledlike Case 1.
Case 5: degree(A)=2 and degree(B)=2: Consider the figure such that the edge with label ez does notexist as both nodes A and B are to have degree two. Consider the case when the converse edge triplets(e1,n,ex) and (e1,n,ey) are both present in the list of converse edge triplets. If the node labels of thenodes C and D are also n, it would imply that any one of the triangles (A,B,C) and (A,B,D) exist. Thisis resolved by referring to the triangle ids checking for the presence of either of edge labels (e,e1,ex) or(e,e1,ey). The other cases when the degree(A)=2 and degree(B)=1 are trivial to resolve.
Case 6: degree(A)=2 and degree(B)>2: As the degree of node B is greater than two, the case istrivial to resolve as in Case 3.
Case 7: degree(A)>2: As the degree of node A is greater than two, the case is trivial to resolve as inCase 3.
a a aa1 2 3 4
5Dangling Node
e e e
e
1 2 3
4
a a aa1 2 3 4
5
e e e
e
1 2 3
4
a
a a aa
e4
1 2 4e e e1 2 3a a aa
e4
1 2 3 4e e e1 2 3
(a) (b)
3
(c) (d)
Figure 6.6 Extending the converse edge triplet
Extension of a converse edge triplet: In this step, the dangling edges that are introduced in Step 2(“extension of a edge triplet” described above) are converted into regular edge triplets. Consider Figure6.6(a). Say the edge with label e4 was introduced in Step 2 last. In order to extend the converse edgetriplet, step 3 looks for the unvisited edge triplet with edge label e4. Say the edge triplet found was of
66
the form (a,e4,a). This indicates that both nodes on which the edge label e4 is to be incident shouldhave edge label a one of which is already marked as a. The converse edge triplet extension functionthen needs to determine whether the dangling node is a node already existing in the graph constructedso far or whether a new node needs to be created. As in Figure 6.6(b), the edge e4 could be connectedto a new node which is node 5 which is labeled a or could be connected to an existing node as in Figure6.6(c) or 6.6(d). In Figure 6.6(c) or 6.6(d) the edge e4 on the dangling node is thus removed.
Generalising the above, let the converse edge triplet included last be (e,n,e1) where the edge labele was introduced before the converse edge triplet extension. Edge label e1 is thus constructed incidenton the node with label n and a dangling node. In order to complete the construction of the edge labele1 incident on the node with label n, one needs to know the opposite node id as well. The functionhence determines the node label of the dangling node by looking at the edge triplet for the edge e1. It istrivial to find the node label of the opposite node id from the set of edge triplets in the maximal itemset.Consider that the edge triplet (n,e1,lk) is present in the maximal itemset. We hence know that the nodelabel of the opposite node is lk. However, due to the possible multiple occurrences of nodes with thesame label the function needs to determine whether a new node is to be created which is to be assignedthe label lk or which existing node with label lk the edge with label e1 is to be incident on. The functionchecks each existing node na with label lk by picking any one edge incident on node na (say with labelek) to check whether a converse edge triplet (e1, lk, ek) exists. If such a triplet exists, an edge is insertedbetween node na and the node with label n. The dangling edge incident on the node with label n withedge label e1 built in the edge triplet extension phase is dropped. If an edge cannot be formed with anyof the existing nodes then the dangling node is marked with label lk. Every edge triplet and converseedge triplet once visited is marked as visited and is not used for further construction.
Consider a node np1in the graph constructed so far with node label lk. Let the edge ek incident on
np1have its opposite node as np2
. If the node label of np2is also lk then there is ambiguity about whether
the edge with label e1 is to be connected to np1or np2
. This is because if the converse triplet (e1, lk, ek)exists then it could be due to the triplets (e1,label(np1
),ek) or (e1,label(np2),ek). Since a graph so far
is connected, either of the nodes np1or np2
has to have degree greater than one. Let np1have degree
greater than one. This would mean that there exists another edge with label say ep incident on np1apart
from the edge with label ek. If the edge e1 is to be incident on np1and not np2
then the converse edgetriplet (ep, label(np1
),e1) should also exist apart from the converse edge triplet (ek, label(np1),e1). If
such a triplet does not exist then the edge e1 should be incident on the node np2. Also, the simple check
to avoid self loops is applied during EdgeExtension.
6.2.3 Pruning Phase
Given two maximal frequent itemsets Mi and Mj , both Mi and Mj might yield a connected or adisconnected graph. One component of the graph generated by Mi might be a subgraph of the graphgenerated by Mj or one of the components of the graph generated by Mj . Hence not all subgraphs gen-erated in the previous phase are maximal frequent subgraphs. Hence an additional simple pruning phase
67
is needed. Using subgraph isomorphism is one way of determining whether a graph g is a subgraph ofanother graph g′. However, in ISG, we do away with the subgraph isomorphic operation as it is possibleto determine whether a graph is a subgraph of another using a subset operation. Consider two graphs g
and g′. Let I be the set of item ids of the edge and converse edge triplets in g. Let I ′ be the set of item idsof the edge and converse edge triplets in g ′. If g is a subgraph of g′, I has to be a subset of I ′. However,I being a subset of I ′ does not guarantee that g is a subgraph of g ′. This is because, edge and converseedge triplets alone cannot disambiguate if g has a spike using the triplets while g ′ has a triangle usingthe same triplets. Hence, here again, we retain for each candidate maximal frequent subgraph formeda corresponding itemset of its edge and converse edge triplets along with the triangle, spike and linearchain ids present in it. It can be trivially seen that for such an itemset constructed, I being subset ofI ′ guarantees that g is a subgraph of g′. It is trivial to pass on the triangle, spike and linear chain idsalready calculated and present in each maximal frequent itemsets that were used for graph construction.
6.2.4 Illustration of an Example
Figure 6.7 illustrates a simple run of the algorithm. Listed are the edge and converse edge tripletsof a maximal frequent itemset and the stages in which the graph is extended for ease of understand-ing. The three edge triplets are (a, e1, b),(a, e3, c),(b, e2, c) and the three converse edge triplets are(e1, b, e2),(e1, a, e3) and (e2, c, e3). In this example we assume that there was no common item id,spike id, triangle id or linear chain id for simplicity.
a be10 1 e2 2
ca be10 1
a be10 1 e2 2
a be1 1 e2 2
c
e30
a be10 1 e2 2
ce3
a be10 1 e2
ce3 32
3a
(e1,a,e3)
(1) (2) (3)
(4) (5.2)(5.1)
(a,e1,b) step 1 (a,e3 ,c) step 5(b,e2,c) step 3(e1,b ,e2) step 2
(e2,c,e3) step 4
Figure 6.7 Running Example of the ISG algorithm
Figure 6.7(1) shows the initial edge picked. The graph is created with two nodes, node 0 with label a,node 1 with label b and edge label e1. The edge triplet (a, e1, b) is marked in the adjacent table as visitedin step1. In Figure 6.7(2), ConverseExtension looks for the extensions of the edge created in Figure6.7(1). All converse edges of the form (e1, label(1), ∗) =(e1, b, ∗) are searched for among the converseedge triplets. The triplet (e1, b, e2) matches the search. Hence a new adjacent edge is created with edgelabel e2. While the node label of incident node 1 is known, the label of the other node 2 is not known. Wecall node 2 a dangling node and edge with label e2 as a dangling edge. EdgeExtension now looks foredge triplet of the form (label(1), e2, ∗)=(b, e2, ∗) in order to find the label of the dangling node. Labelc which matches the search is updated in the graph as in Figure 6.7(3). Next, we look for converseedge triplet extensions of the form (e2, label(2), ∗)= (e2, c, ∗) and extend the graph to contain edge withlabel e3 as seen in Figure 6.7(4). On looking for edge triplets of the form (label(2), e3, ∗)=(c, e3, ∗), the
68
triplet (a, e3, c) qualifies. Since the node label a already exists in the graph, the edge with e3 as labelcould create extensions of two forms. The first one is shown in 6.7(5.1) where it connects to an existingnode with label a and the second one shown in 6.7(5.2) where it creates a new node with label a.
For Figure 6.7(5.1) to be valid, the converse triplet (e1, a, e3) should be present because edges withlabel e1 and e3 now become neighbours. For Figure 6.7(5.2) to be valid, the converse triplet (e1, a, e3)should not be present. Since converse edge triplet (e1, a, e3) exists, the dangling edge on node 2 isdropped and hence 6.7(5.1) qualifies as the valid graph. We next look for extensions among converseedge triplets of the form (e3, a, ∗). (e1, a, e3) is already introduced in the graph. Hence no match exists.
We had initially started with converse edge triplet extensions of the form (e1, label(1), ∗). We nextlook for converse edge triplet extensions of the form (e1, label(0), ∗). The only match being (e1, a, e3)is already introduced in the graph. As we cannot extend further, we terminate by producing the graphin 6.7(5.1). Since all edge triplets have been visited, we also note that the maximal frequent subgraphgenerated is connected. Else, we would need to restart the process with an unvisited edge triplet to con-struct other components of the disconnected graph being generated by the itemset. Note that alternatingbetween searching for converse edge and edge triplets, we extend the graph until no more extension ispossible or no unseen edge triplets exist.
6.2.5 Proof of Correctness
We next show that the algorithm gives all maximal frequent subgraphs. That is, every subgraphreported is maximal frequent and there is no maximal frequent that is unreported.
Claim 2 All maximal frequent subgraphs are reported.
Proof: Consider a maximal frequent subgraph mi. The set of edge and converse edge triplets that formthe maximal frequent subgraph is thus frequent and has to be a subset of some maximal frequent itemsetMIi reported in the itemset mining phase. Let the disconnected or connected graph correspondingto the itemset MIi be GIi. The connected graph GIi or the subgraph of some component of GIi
has to correspond to the maximal frequent subgraph mi. Since mi is maximal, mi can only be animproper subgraph of GIi or some component of it which means the connected GIi or the a componentof GIi itself corresponds to mi. After the accumulation of all the possible candidate maximal frequentsubgraphs, the pruning phase prunes this set by removing all graphs that are subgraphs of another graphin this set. A maximal frequent subgraph by definition will not get pruned. Hence, the set report containsall maximal frequent subgraphs. 2
Claim 3 There is no graph reported that is not a maximal frequent subgraph.
Proof: A graph that is not maximal frequent gets pruned out in the pruning phase as there has to existsome subgraph from the set of maximal frequent subgraphs which is supergraph to it. 2
From the above two claims it can be concluded that the set of graphs reported is the set of maximalfrequent subgraphs of the original dataset D.
69
6.3 Results
L=20, E=20, V =20, I=20, T=22D Supp ISG(sec) GS(sec) Spin(sec) MR(sec)
D=100
30 0.04 43.55 121.23 2.3720 0.07 86.77 164.96 1.310 0.07 88.09 176.24 0.385 0.07 88 176.83 0.38
D = 500
30 0.14 14.83 277.66 2.7620 0.28 17.05 581.41 10.0610 0.16 31.44 488.15 2.45 0.15 31.26 487.71 2.22
D =1000
30 0.29 17.34 581.16 10.6920 0.16 31.33 487.75 2.6410 0.3 20.08 650.77 6.445 0.3 20.2 649.17 5.4
Table 6.2 Results with varying D
D=500, L=20, E=20, V =20I ,T Supp ISG(sec) GS(sec) Spin(sec) MR(sec)
I=10,T=15
30 0.08 0.15 3.22 9.8320 0.09 0.4 6.69 16.5110 0.1 0.47 8.5 13.755 0.11 0.72 12.25 5.96
I=15,T=20
30 0.13 0.04 0.11 35.7220 0.14 5.18 60.13 39.6710 0.17 19.48 131.25 38.375 0.2 29.39 247.59 12.89
I=20,T=25
30 0.18 114.68 1057.53 33.5320 0.2 138.69 1304.07 16.0810 0.23 352.03 2002.18 13.785 0.24 350.52 2005.68 17.18
Table 6.3 Results with varying I and T
We implemented the ISG algorithm and tested it on both synthetic and real-life datasets. We ran ourexperiments on a 1.8GHz Intel Pentium IV PC with 1 GB of RAM, running Fedora Core 4. The code isimplemented in C++ using STL, Graph Template Library1and the IGraph Library 2.
1http://www.infosun.fmi.uni-passau.de/GTL2http://igraph.sourceforge.net
70
Supp ISG(sec) GSpan(sec)3 0.19 0.33
2.5 0.2 0.452 0.19 0.75
1.5 0.17 1.011 0.13 2.29
0.5 0.1 4.43
Table 6.4 Results on Stocks Data
6.3.1 Results on Synthetic Datasets
Synthetic Data Generation: We generated the synthetic datasets using the graph generator softwareprovided by [47]. The graph generator generates the datasets based on the six parameters: D (thenumber of graphs), E,V (the number of distinct edge and vertex labels respectively), T (the averagesize of each graph), I (the average size of frequent graphs) and L (the number of frequent patterns asfrequent graphs). In a post-processing step, the edge labels of each graph are randomly modified suchthat the graph satisfies the unique edge label constraint.
We compared the time taken by ISG algorithm with two other maximal frequent subgraph miningalgorithms Spin [36] and MARGIN [58] and one frequent subgraph mining algorithm gSpan [65]. Weused our implementations of SPIN and MARGIN while we used the executable provided by the authorsof gSpan. While SPIN and MARGIN generate the maximal subgraphs, gSpan outputs all the frequentsubgraphs. In a post-processing step, we compare the frequent subgraphs generated by gSpan and findthe maximal frequent subgraphs.
Table 6.2 shows results generated when the number of graphs are taken as 100, 500 and 1000. Otherparameters used for the generation of the dataset are L=20, E=20, V =20, I=20 and T=22. We can seethat the ISG algorithm performs orders of magnitude better than the rest of the approaches.
Table 6.3 shows the comparison of running time when D, L, E and V are kept constant at 500, 20,20, 20 and 20 respectively while I and T are varied between 10 and 25. It can be observed that theperformance of ISG improves drastically for higher values of I and T .
6.3.2 Results on Real-life Dataset
We further present our results on stock market data of 20 companies collected from the source 3. Weuse the correlation function below [62] to calculate the correlation between any pair of companies, A
and B.[CA
B =
1
|T |Σ
|T |i=1
(Ai×Bi−A×B)
σA×σB] where |T | denotes the number of days in the period T , Ai and Bi de-
note the average price of the stocks on day i of the companies A and B respectively, and A = 1
|T |Σ|T |i=1
Ai,B = 1
|T |Σ|T |i=1
Bi, σA =√
1
|T |Σ|T |i=1
(Ai)2 − A2 and σB =√
1
|T |Σ|T |i=1
(Bi)2 − B2.
3Yahoo Finance: http://finance.yahoo.com
71
We construct a graph database where each graph in the database corresponds to seven successiveworking days (referred to as a week). For every week, we find the correlation values between everypair of companies. The companies are clustered into a specific number of groups based on their averagestock value during that week. Each group is assigned a unique label which corresponds to the nodelabel in the graph. The correlation values between every pair of companies are also ordered and top K%values are ranked and the value of the rank is used as edge label. Thus, the graph corresponding to eachweek will have unique edge labels.
We collected graphs corresponding to top 20 companies for 295 weeks. The top 20 companies areselected for each week based on their average share price during that week. Further, the companies areclassified as high and low categories which are used as node labels. The correlation values are calculatedfor every pair of companies and the edge labels correspond to the rank of the value. Table 6.4 show thecomparison of ISG algorithm with gSpan for this dataset. For low support values, ISG performs betterthan gSpan. The time values for other two algorithms, Spin and MARGIN could not be generated asthey did not terminate even after running for very long time interval because of very low support value.
72
Chapter 7
Conclusions
Many applications require the computation of maximal frequent subgraphs such as mining contactmaps [34], finding maximal frequent patterns in metabolic pathways [45], and finding set of large cohe-sive web pages. As an example, consider the web pages that are related. These can be determined basedon the information about the set of web pages that are visited by users in one session. Each sequence ofpages visited can be modeled as a graph. Given a set of such graphs, the cumulative behavior of the webusers can be mined by finding the maximal frequent subgraphs. Each maximal frequent subgraph mightdenote a cohesive set of topics that are of interest to most of the visitors of the website. Further, given adatabase of chemical compounds, maximal frequent substructures may provide insight into the behaviorof the molecule. For example, researchers exploring immunodeficiency viruses may gain insight intorecombinants via commonalities within initial replicants. Recent interest in functional RNAs, spurredby a surge in the number of published X-ray structures, has highlighted the importance of characteriza-tion of recurrent structural motifs and their roles. Maximal subgraph mining methods can be applied onproteins to predict folding sequences and also maximal motifs in RNAs.
In this thesis, we propose the MARGIN algorithm for maximal frequent subgraph mining for agraph database. Further, if the graph database has graphs with unique edge labels, itemset mining tech-niques can replace the expensive graph mining techniques. We propose the ISG algorithm for maximalfrequent subgraph mining of graphs with unique edge labels by using itemset mining techniques.
Frequent subgraph mining faces several challenges. For a graph of e edges, the number of possiblefrequent subgraphs can grow exponentially in e. Further, it is critical to reduce the number of subgraphsfor which frequency computation is required since doing so requires the NP-complete subgraph isomor-phism operation to be applied. A typical approach to frequent subgraph mining problem has been tofind the frequent subgraphs incrementally in an Apriori manner. The Apriori based approach has beenfurther modified to suit maximal subgraph mining [36] with added pruning and hence faces challengesthat are similar to those of frequent subgraph mining.
However, the size of the set of maximal frequent subgraphs is significantly smaller than that of theset of frequent subgraphs [36], thus, providing scope for ample pruning of the exponentially large searchspace. The set of candidate subgraphs which are likely to be maximally frequent are the set of n-edge
73
frequent subgraphs that have a n + 1-edge infrequent super graph. We refer to such a set of nodes inthe lattice as the set of f(†)-nodes. The MARGIN algorithm computes such a candidate set efficiently.By a post-processing step MARGIN finds all maximally frequent subgraphs. The ExpandCut stepinvoked within the MARGIN algorithm recursively finds the candidate subgraphs. The search space ofApriori based algorithms corresponds to the region below the f(†)-nodes all the way from the bottomof the lattice till the maximal frequent subgraphs. On the other hand, the search space of MARGIN islimited to the region in the lattice around the f(†)-nodes.
We prove that the MARGIN algorithm finds all the nodes that belong to the candidate set. Weshow that any two candidate nodes are reachable from one another using the ExpandCut function usedto explore the candidate nodes. Given that the graph lattice is dense and there exists several paths thatconnect two candidate nodes, one of the main challenges of the proof is the abstraction of the sublatticethat contains the path taken by the ExpandCut function. By the construction of such a sublattice weshow that such a sublattice is guaranteed to exist. A reachability sequence is then shown that starting atany candidate node would reach another candidate node.
If the graph database has graphs with constraints such as, graphs with unique edge or node labels,itemset mining techniques can be used to efficiently find maximal frequent subgraphs. The set of edgesof each graph can be uniquely mapped to an itemset thus mapping the graphs database into a trans-action database. The maximal frequent itemsets of the transaction database are computed using anystandard maximal frequent itemset mining algorithm. The conversion of maximal itemsets into graphsis non trivial as one maximal itemset might lead to several candidate maximal graphs. The ISG algo-rithm efficiently preprocesses each maximal itemset to generate a set of itemsets each of which can beunambiguously converted into candidate maximal graphs.
The main contributions of the work are: (i) A novel algorithm MARGIN to find maximal frequentsubgraphs is presented, (ii) The ExpandCut technique used in MARGIN need not be limited tomaximal frequent subgraph mining. A generalisation of the MARGIN algorithm that can be appliedto a wider range of subgraph mining problems is shown. A set of properties are presented, havingwhich, a problem can be solved using the ExpandCut technique. (iii) The proof of the MARGIN
algorithm is given, (iv) Experimental results on both synthetic and real-life data sets are presented andthe viability of the MARGIN algorithm in efficiently finding maximal frequent subgraphs is shown andcompared with known standard techniques like gSpan, SPIN and CloseGraph. Experimental analysishas shown that MARGIN performs well for databases where the frequent graph patterns are large andfor lower support values. However, MARGIN suffers from the disadvantage of using more memorythan existing algorithms for maximal frequent subgraph mining, (v) the ISG algorithm is proposedthat efficiently mines maximal frequent subgraphs in a constrained database by using itemset miningtechniques, (vi) experimental results are shown that compare using graph mining techniques as againstthe ISG algorithm for graphs with unique edge labels.
The future work consists of (i) Studying the effect of the initial cut selected at the runtime of theMARGIN algorithm. The efficiency of the MARGIN algorithm depends on the initial cut selected.
74
By selecting the appropriate initial cut efficiently, it might be possible to reduce the number of revisitedcuts. Further, a number of initial cuts can be introduced and parallely processed in order to make theprocess completion faster. (ii) Developing a partition based modification to the MARGIN algorithm.Due to memory constraints, it is of great interest to partition the graphs onto several machines and thenapply a modified MARGIN approach to find maximal frequent subgraphs over a distributed graphdatabase. (iii) Extending the MARGIN and the ISG algorithm to incrementally compute the maximalfrequent subgraphs in the presence of updates to the graph database. (iv) Finding graph databases withconstraints other than unique edge labels, and their corresponding algorithms that benefit by methodsother than those typically used in graph mining.
75
Bibliography
[1] The graph template library. http://www.infosun.fmi.uni-passau.de/GTL/.
[2] gspan software. http://www.xifengyan.net/software/gSpan.htm.
[3] The igraph library. http://igraph.sourceforge.net/.
[4] Libgtop. http://library.gnome.org/devel/libgtop/stable/libgtop-GlibTop.html.
[5] The scor database. http://scor.berkeley.edu/.
[6] Top. http://en.wikipedia.org/wiki/Top (Unix).
[7] The uci kdd archive: msnbc.com anonymous web data. http://kdd.ics.uci.edu/databases/msnbc/msnbc.html.
[8] Yahoo finance. http://finance.yahoo.com/.
[9] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large
databases. pages 207–216. SIGMOD, 1993.
[10] R. Agrawal and J. C. Shafer. Parallel mining of association rules. volume 8, pages 962–969. IEEE Trans.
Knowl. Data Eng, 1996.
[11] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. pages 487–499.
VLDB, 1994.
[12] R. Agrawal and R. Srikant. Mining sequential patterns. pages 3–14. ICDE, 1995.
[13] C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules.
pages 51–58. ICDM, 2002.
[14] G. Buehrer, S. Parthasarathy, and Y.-K. Chen. Adaptive parallel graph mining for cmp architectures. pages
97–106. ICDM, 2006.
[15] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional
databases. pages 443–452. ICDE, 2001.
[16] C. Chen, C. X. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and J. Han. Mining graph patterns efficiently
via randomized summaries. volume 2, pages 742–753. PVLDB, 2009.
[17] D. W.-L. Cheung, J. Han, V. T. Y. Ng, A. W.-C. Fu, and Y. Fu. A fast distributed algorithm for mining
association rules. pages 31–42. PDIS, 1996.
[18] D. W.-L. Cheung, J. Han, V. T. Y. Ng, and C. Y. Wong. Maintenance of discovered association rules in large
databases: An incremental updating technique. pages 106–114. ICDE, 1996.
76
[19] Y. Chi, R. R. Muntz, S. Nijssen, and J. N. Kok. Frequent subtree mining - an overview. pages 66(1–2):
161–198. Fundam. Inform., 2005.
[20] Y. Chi, Y. Xia, Y. Yang, and R. R. Muntz. Mining closed and maximal frequent subtrees from databases of
labeled rooted trees. volume 17, pages 190–202. IEEE Trans. Knowl. Data Eng., 2005.
[21] Y. Chi, Y. Yang, and R. R. Muntz. Hybridtreeminer: An efficient algorithm for mining frequent rooted trees
and free trees using canonical form. pages 11–20. SSDBM, 2004.
[22] M. Cohen and E. Gudes. Diagonally subgraphs pattern mining. pages 51–58. DMKD, 2004.
[23] D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background
knowledge. volume 1, pages 231–255. J. Artif. Intell. Res., 1994.
[24] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. volume 3, pages 7–36. Data Min.
Knowl. Discov., 1999.
[25] L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compounds. pages
30–36. KDD, 1998.
[26] M. Deshpande, M. Kuramochi, and G. Karypis. Frequent sub-structure-based approaches for classifying
chemical compounds. pages 35–42. ICDM, 2003.
[27] W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct mining of
discriminative and essential frequent patterns via model-based search tree. pages 230–238. KDD, 2008.
[28] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. pages 163–170. ICDM, 2001.
[29] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharm. Discovering all most
specific sentences. pages 28(2): 140–174. ACM Trans. Database Syst, 2003.
[30] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. pages 1–12. SIGMOD,
2000.
[31] M. A. Hasan, V. Chaoji, S. Salem, J. Besson, and M. J. Zaki. ORIGAMI: Mining representative orthogonal
graph patterns. pages 153–162. ICDM, 2007.
[32] H. He and A. K. Singh. Closure-tree: An index structure for graph queries. page 38. ICDE, 2006.
[33] M. Holsheimer, M. L. Kersten, H. Mannila, and H. Toivonen. A perspective on databases and data mining.
pages 150–155. KDD, 1995.
[34] J. Hu, X. Shen, Y. Shao, C. Bystroff, and M. J. Zaki. Mining protein contact maps. pages 3–10. BIOKDD,
2002.
[35] J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraphs in the presence of isomorphism.
pages 549–552. ICDM, 2003.
[36] J. Huan, W. Wang, J. Prins, and J. Yang. Spin: mining maximal frequent subgraphs from graph databases.
pages 581–586. KDD, 2004.
[37] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from
graph data. pages 13–23. PKDD, 2000.
77
[38] A. Inokuchi, T. Washio, H. Motoda, K. Kumasawa, and N. Arai. Basket analysis for graph structured data.
pages 420–431. PAKDD, 1999.
[39] H. Jiang, H. Wang, P. S. Yu, and S. Zhou. Gstring: A novel approach for efficient search in graph databases.
pages 566–575. ICDE, 2007.
[40] R. Jin, C. Wang, D. Polshakov, S. Parthasarathy, and G. Agrawal. Discovering frequent topological struc-
tures from graph datasets. pages 606–611. KDD, 2005.
[41] R. J. B. Jr. Efficiently mining long patterns from databases. pages 85–93. SIGMOD, 1998.
[42] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-
structured data. pages 129–140. ICDE, 2002.
[43] Y. Ke, J. Cheng, and W. Ng. Correlation search in graph databases. pages 390–399. KDD, 2007.
[44] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measure-
ments, models, and methods. pages 1–17. COCOON, 1999.
[45] M. Koyuturk, A. Grama, and W. Szpankowski. An efficient algorithm for detecting frequent subgraphs in
biological networks. pages 200–207. ISMB, 2004.
[46] M. Koyuturk, A. Grama, and W. Szpankowski. An efficient algorithm for detecting frequent subgraphs in
biological networks. pages 200–207. ICMB/ECCB(Suplement of Bioinformatics, 2004.
[47] M. Kuramochi and G. Karypis. Frequent subgraph discovery. pages 313–320. ICDM, 2001.
[48] M. Kuramochi and G. Karypis. Discovering frequent geometric subgraphs. pages 258–265. ICDM, 2002.
[49] M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. volume 11, pages 243–271.
Data Min. Knowl. Discov., 2005.
[50] S. Nijssen and J. Kok. The gaston tool for frequent subgraph mining. International Workshop on Graph-
Based Tools, October 2, 2004.
[51] J. S. Park, M.-S. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. pages
175–186. SIGMOD, 1995.
[52] J. S. Park, M.-S. Chen, and P. S. Yu. Efficient parallel and data mining for association rules. pages 31–36.
CIKM, 1995.
[53] J. Pei, G. Dong, W. Zou, and J. Han. On computing condensed frequent pattern bases. pages 378–385.
ICDM, 2002.
[54] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. pages 21–30.
ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000.
[55] G. Ramesh, W. Maniatty, and M. J. Zaki. Feasible itemset distributions in data mining: theory and applica-
tion. pages 284–295. PODS, 2003.
[56] A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large
databases. pages 432–444. VLDB, 1995.
[57] S. Srinivasa and L. BalaSundaraRaman. A filtration based technique for mining maximal common sub-
graphs. Technical Report 0303, International Institute of Information Technology, Bangalore, 2003.
78
[58] L. Thomas, S. R. Valluri, and K. Karlapalem. Margin: Maximal frequent subgraph mining. pages 1097–
1101. ICDM, 2006.
[59] H. Tong, C. Faloutsos, B. Gallagher, and T. Eliassi-Rad. Fast best-effort pattern matching in large attributed
graphs. pages 737–746. KDD, 2007.
[60] N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from semistructured data.
pages 458–465. ICDM, 2002.
[61] J. Wang, W. Hsu, M.-L. Lee, and C. Sheng. A partition-based approach to graph mining. page 74. ICDE,
2006.
[62] J. Wang, Z. Zeng, and L. Zhou. Clan: An algorithm for mining closed cliques from large dense graph
databases. page 73. ICDE, 2006.
[63] T. Washio and H. Motoda. State of the art of graph-based data mining. volume 5, pages 59–68. SIGKDD
Explorations, 2003.
[64] D. W. Williams, J. Huan, and W. Wang. Graph database indexing using structured graph decomposition.
pages 976–985. ICDE, 2007.
[65] X. Yan and J. Han. gspan: Graph-based substructure pattern mining. pages 721–724. ICDM, 2002.
[66] X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. pages 286–295. KDD, 2003.
[67] X. Yan, P. S. Yu, and J. Han. Graph indexing: A frequent structure-based approach. pages 335–346.
SIGMOD, 2004.
[68] G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. pages
344–353. KDD, 2004.
[69] K. Yoshida, H. Motoda, and N. Indurkhya. Graph-based induction as a unified learning framework. vol-
ume 4, pages 297–328. J. of Applied Intel., 1994.
[70] M. J. Zaki. Scalable algorithms for association mining. volume 12, pages 372–390. IEEE Trans. Knowl.
Data Eng, 2000.
[71] M. J. Zaki. Efficiently mining frequent trees in a forest. pages 71–80. KDD, 2002.
[72] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algorithm for closed itemset mining. SDM, 2002.
[73] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithms for discovery of association rules.
volume 1, pages 343–373. Data Min. Knowl. Discov, 1997.
[74] Z. Zeng, J. Wang, L. Zhou, and G. Karypis. Coherent closed quasi-clique discovery from large dense graph
databases. page 2006. KDD, 797-802.
[75] S. Zhang, M. Hu, and J. Yang. Treepi: A new graph indexing method. pages 966–975. ICDE, 2007.
[76] Z. Zou, H. Gao, and J. Li.
[77] Z. Zou, J. Li, H. Gao, and S. Zhang. Mining frequent subgraph patterns from uncertain graph data. vol-
ume 22, pages 1203–1218. IEEE Trans. Knowl. Data Eng, 2010.
79