Report.doc

Parallel Frequent Tree MiningCSE 721 – Introduction to Parallel Computing

Project – ReportQian Zhu

Shirish Tatikonda

Introduction

Recent research in Data Mining has progressed from mining frequent itemsets to more general and structured patterns like trees and graphs. Itemset mining can be thought as a special case of graph mining. Data is often stored in the form of graphs in various applications like World Wide Web, Telecommunications, Social Networks, and Bioinformatics. Graphs in such areas, in general, are massive in size. In general, mining for special structures like graphs and trees is referred to as Pattern Mining. However, graphs in general have undesirable theoretical properties with regard to algorithmic complexity. In terms of complexity theory, currently no efficient algorithms are known to determine if one graph is isomorphic to a subgraph of another. Infact, the problem of Graph Isomorphism is in NP and problem of Subgraph Isomorphism is NP-Complete. Furthermore, no efficient algorithm is known to perform systematic enumeration of the subgraphs of a given graph, a common facet of a data mining algorithm. One could therefore expect that general graphs pose serious efficiency problems.

Fortunately, many practical databases do not consist of graphs that require exponential computations. Many applications deal with simpler structures where the number of cycles is limited or graphs may even be acyclic. The latter case is especially interesting, e.g., when the graphs are trees, because many very efficient algorithms are known for this class of graphs. In this project, we consider a particular pattern, Tree Structures. Tree is a minimal connected and maximal acyclic graph. Specifically, we address the problem of Mining Frequent Induced Subtrees.

Mining frequent subtrees from databases is a budding research area that has many practical applications in areas such as computer networks, Web mining, bioinformatics, XML document mining, etc. Since tree structures are complex structures compared to traditional transactions consisting of items, mining for frequent subtrees is a complex and challenging task compared to mining for frequent items. Most of the existing tree mining algorithms borrow techniques from traditional and more matured area of itemset mining.

Motivation

In areas such as bioinformatics, web mining, chemical compound analysis, data is often represented in semi-structured form. In most generic form, structured data can be represented using graphs. Applications like XML deals with restricted structures such as trees. Mining frequent trees from the given dataset has several interesting applications. For example, it has been used in developing efficient algorithms for network multicast routing. Few researchers have applied frequent tree mining algorithms to internet movie descriptions and obtained the common structures present in the movie documentation. Other interesting application is to classify given XML documents using frequent subtree structures.

Recently, there has been growing interest in mining databases of labeled trees, partly due to the increasing popularity of XML in databases. In [1], Zaki presented

an algorithm, TreeMiner, to discover all frequent embedded subtrees, i.e., those subtrees that preserve ancestor-descendant relationships, in a forest or a database of rooted ordered trees. The algorithm was extended further in [2] to build a structural classifier for XML data. In [3] Asai et al. presented an algorithm, FREQT, to discover frequent rooted ordered subtrees. For mining rooted unordered subtrees, Asai et al. in [3] and Chi et al. in [4] both proposed algorithms based on enumeration tree growing. Because there could be multiple ordered trees corresponding to the same unordered tree, similar canonical forms for rooted unordered trees are defined in both studies. In [5], Chi et al. have studied the problem of indexing and mining free trees and developed an Apriori-like algorithm, FreeTreeMiner, to mine all frequent free subtrees. Another very efficient algorithm for mining Free Trees is Gaston, which is proposed by Nijssen et al. [6]. It is important to note that the comparison of various algorithms is not straightforward as the structures each algorithm mines are different. For example, TreeMiner [1] mines for rooted induced and embedded subtrees and Gaston [6] mines for free trees.

Objective

In this project, we focus on the problem of mining frequent induced subtrees in a large database of rooted and labeled trees. We first propose the serial version of the mining algorithm. Then we exploit the scope for parallelism in our serial algorithm to come up with parallel version of the algorithm. To the best of our knowledge, there exists no parallel tree mining algorithm. Furthermore, we evaluate proposed serial version of algorithm with TreeMiner algorithm. We make this comparison even though, as noted in previous section, it is not fair and straight forward to compare these two algorithms. Finally, We show the performance differences between our parallel and serial versions of mining algorithm.

Generic Approach

Let D denote a database where each transaction s D is a labeled rooted unordered tree. For a given pattern t, which is a rooted unordered tree, we say t occurs in a transaction s if there exists at least one subtree of s that is isomorphic to t. The occurrence δt(s) of t in s is the number of distinct subtrees of s that are isomorphic to t. Let σt(s) = 1 if δt(s) > 0, and 0 otherwise. We say s supports pattern t if σt(s) is 1 and we define the support of a pattern t as supp(t) = ∑sεD σt(s). A pattern t is called frequent if its support is greater than or equal to a minimum support (minsup) specified by a user. The frequent subtree mining problem is to find all frequent subtrees in a given database. One nice property of frequent trees is the a priori property, as given below:

Apriori Property: Any subtree of a frequent tree is also frequent and any super tree of an infrequent tree is also infrequent.

Above mentioned property reduces the search space drastically. Hence, most of the existing algorithms employ this strategy of pruning the search space. In general, most of tree mining algorithms start with a seed pattern. At every step, pattern is extended with one edge to create a set of candidate patterns. A scan of dataset is then performed to find frequent patterns among the candidate patterns. These frequent patterns are then extended with one edge to create a new set of candidate patterns. Algorithms differ in the way they extend and enumerate the candidate patterns. For example, TreeMiner generates the candidate patterns using vertical

layout of equivalence classes. And, Gaston employs the pattern growth approach like gSpan[8].

Our Approach

In our method, we represent each tree as a Prufer Sequence instead of any traditional tree or graph representation methods. Prufer sequences provide a bijection between ordered trees with n nodes and sequences of length n-1. Prufer sequences are first proposed by Heinz Prufer in 1918 for couting number of free trees with n nodes. Simple algorithm to construct a prufer sequence from a tree with vertices {1, 2, ..., n} is as follows. We start with a prufer sequence of length 0. At each step, we remove the leaf with smallest label and append its parent to the prufer sequence that is already constructed. This process is repeated until only two vertices remain i.e., it is repeated for n-2 iterations. Resulting sequence will look like (p1, p2, ..., pn-2), where pi is the parent of a vertex with ith smallest label. [7] extended this algorithm to obtain sequence of length n-1 by continuing the process till only one vertex remains. Since the size of Prufer sequence is linear to tree size, storage complexity of our approach is linear in the database size. Since the labels in database trees can occur multiple number of times, above mentioned prufer sequence is not sufficient to uniquely represent a tree. In order to uniquely represent the labeled database tree, we store the following sequences: Labeled Prufer Sequence (LPS), Numbered Prufer Sequence (NPS) and Tree’s Label Sequence (LS). Please note that, though LPS can be constructed from NPS and LS, we chose to distinguish it for the purpose of exposition. Hence, any labeled tree can be uniquely represented using LS and NPS. It is worth mentioning that the method of constructing prufer sequences follows post order (PO) traversal. Example database tree and its representation using prufer sequence is shown below.

We define Left Most Path (LMP) as path from root to left most leaf node. In the above example tree, LMP using post order numbers (PON) is 10-4-3-1. Since the prufer sequence is based on PO traversal, addition of edges on LMP corresponds to extending the prufer sequence on the left hand side. For example, consider the tree without the edge, B-A (1-3). In that tree, LMP is given by 10-4-3-2. Now, the addition of edge B-A on LMP corresponds to prefixing the prufer sequence with information of new edge. Changes in tree and prufer sequence are shown in red color.

Our pattern growth mechanism is dependent upon the growth from Left Most Path. Like any pattern growth algorithm, we start with individual nodes as seed patterns. At each step, we extend the pattern by adding one edge on the LMP. It can be proved that this mechanism is able to generate every possible subtree with given

B, 10

C, 4

A, 3

B, 1 D, 2

A, 9

E, 5 D, 8

B, 7

C, 6

LPS: A A C B A B D A B -NPS: 3 3 4 10 9 7 8 9 10 –LS: B D A C E C B D A BPON: 1 2 3 4 5 6 7 8 9 10

(b) Prufer Sequence – Representation of database treeDatabase Tree with Labels and

Post order numbers

labels. Such a traversal in the search space is shown below. For simplicity, we assume that the set of possible labels is {A, B, C}. Node or edge with which the pattern is extended is shown in red color.

It can be easily seen that this search space is exponential and hence it can be very computationally expensive to traverse through it. We hence, adopt embedding based growth approach. i.e., we grow edges based on the embedding of pattern in the database tree. Using this method, only the edges that are present in database are considered as candidates. Because of space constraints, we do not discuss this in detail. Once the candidate patterns are generated, support of each candidate pattern is calculated by scanning the dataset. If the calculated support is greater than minsup, then the pattern is flagged as frequent and outputted. Frequent subtrees in one level serves as seed patterns for candidates in next level.

In general, database consists of large number of trees and hence processing the candidate patterns serially can be quite expensive. Equivalence classes offer the best place for parallelism. It can be easily seen that the processing of two different equivalence classes can be performed parallely. Furthermore, support counting of a pattern involves scanning all the database trees. Hence, the support counting step can also be easily parallelized. In this report, we present the results of parallelizing across equivalence classes only. One can use OpenMP or any other thread based methods to parallelize across equivalence classes. We implemented our parallel algorithms using POSIX treads. We adopt worker based method of partitioning the work across threads. We first add all the tasks (mining each equivalence classes) to a job queue from which the worker (an idle process) will pick up a job and mines for the corresponding equivalence class. We chose this approach since it offers the better load-balancing across threads. This is because, we do not know mining of which equivalence class is expensive compared to others.

In the next section, we present the experimental evaluation of our serial and parallel algorithms.

Equivalence Class

A B C

BA

C C C

C

A AA

BA C

B BB

BA C

A

B

A

A

BA

A

B

B

A

B

C

A

BB

A

BC

.

.

.

.

Experiments and Results

We have conducted all the experiments on Altix Shared Memory system. We have implemented our algorithms using C++ and POSIX threads. We have performed experiments for datasets with varying sizes (in terms of # of trees - 500K, 1M, 2M, 3M, 4M) and at varying support levels (50, 100, 200, 300, 400, 500). Furthermore, we conducted these experiments using different number of processes (1-8). Due to the space constraint, we present only part of our results.

Scalability with respect to dataset size (support=200)

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

500K 1M 2M 3M 4M

# of Trees

Exe

cuti

on

tim

e (s

ec)

serial

2 processors

4 processors

8 processors

Scalability with respect to # of processors

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

1 2 3 4 5 6 7 8

# of Processors

Exe

cuti

on

Tim

e (s

ec)

First graph shows that our approach is scalable with the increase in dataset size (in terms of number of trees). The increase in execution time is slower compared to the increase in the dataset size. Execution time actually increases as we increase the number of processors. This might be due to the fact that the heap memory is shared across processes. Hence, the calls to malloc( ) from different processes at the same time are serialized thereby increasing the execution time.

We have also analyzed the memory footprint of our serial and parallel versions of the algorithm. We have observed that serial version took approximately 8KB (Dataset size = 2M, support = 200) whereas the parallel version took around 1300MB. We have observed the similar memory usage (1300MB) even when number of processors is 1. pthreads library allocates certain memory for each thread at the time of creation. This might be the reason for the huge difference in the memory footprint. We are still in the process of analyzing the trend.

Comparison of execution time on serial and parallel algorithms (# of trees=2M)

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50 100 200 300 400 500

Support

Exe

cuti

on

tim

e (s

ec)

serial

2 processors

4 processors

8 processors

From the above graph, we can observe the general trend that the execution time decreases as the minimum support increases. This is because of the Apriori-style pruning of search space. We prune large sections of search space in the initial stages when the minimum support is high.

Comparison with TreeMiner

0

50

100

150

200

250

300

350

500K 1M 2M 3M 4M

# of Trees

Exe

cuti

on

Tim

e (s

ec)

TreeMiner Our Approach - Serial

Above graph illustrates the difference in execution times between our approach and the TreeMiner. Our approach beats the TreeMiner at all the dataset sizes by a large margin. But, it needs to be noted that this comparison is not fair as TreeMiner mines for induced and embedded subtrees whereas our approach mines only for induced subtrees. We are currently investigating the methods to incorporate embedded subtree mining into our approach. However, we expect our approach to perform better when compared to TreeMiner as the difference is quite high.

Conclusions and Future Work

In this project, we have designed and developed a novel algorithm to mine induced subtrees given a database of rooted labeled trees. We presented some strategies to parallelize such mining algorithm. Our approach is completely novel and performs better when compared to state-of-the-art algorithms.We want to evaluate our approach more closely to determine the reasons for unexpected trends given by parallel version. We also plan to parallelize the available serial version of TreeMiner and compare against our parallel version. Furthermore, we want to evaluate performance of Gaston with our algorithms.

References

[1] MJ Zaki. Efficiently mining frequent trees in a forest. In Proceedings of the international conference on Knowledge discovery and data mining, 2002.[2] M. J. Zaki and C. C. Aggarwal. XRules: An effective structural classifier for XML data. In Proc. of the 2003 Int. Conf. Knowledge Discovery and Data Mining, 2003.[3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. In Proc. of the 2nd SIAM Int. Conf. on Data Mining, 2002.[4] Y. Chi, Y. Yang, and R. R. Muntz. Mining frequent rooted trees and free trees using canonical forms. Technical Report CSD-TR No. 030043, ftp://ftp.cs.ucla.edu/tech-report/2003-reports/030043.pdf, UCLA, 2003.

ftp://ftp.cs.ucla.edu/tech-report/2003-reports/030043.pdf

[5] Y. Chi, Y. Yang, and R. R. Muntz. Indexing and mining free trees. In Proc. Of the 2003 IEEE Int. Conf. on Data Mining, 2003.[6] S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In Proceedings of the international conference on Knowledge discovery and data mining, 2004.[7] P. Rao and B. Moon. PRIX: Indexing and querying XML using Prufer Sequences. In Proceedings of International Conference on Data Engineering, 2004. [8] Y. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Proc. 2002 Int. Conf. on Data Mining (ICDM'02), Maebashi, Japan, December 2002.

Documents

Report.doc