Extracting frequent connected subgraphs from large graph sets

Nov. 2004, ~bl.19, No.6, pp.~67-875 J. Compat. Sci. & Tedmol.

Extracting Frequent Connected Subgraphs from Large Graph Sets

Wei Wang, Qing-Qing Yuan, Hao-Feng Zhou, Ming-Sheng Hong, and Bai-Le Shi

Department of Computing and Information Technology, Fudan University, Shanghai 200433, P.R. China

E-maih weiwangl~fudan.edu.cn

Received May 23, 2003; revised April 28, 2004.

Abs t rac t Mining frequent patterns from datasets is one of the key success of data mining research. Currently, most of the studies focus on the data sets in which the elements are independent, such as the items in the marketing basket. However, the objects in tile real world often have close relationship with each other. How to extract frequent patterns from these relations is the objective of this paper. The authors use graphs to model the relations, and select a simple type for analysis. Combining the graph theory and algorithms to generate frequent patterns, a new algorithm called Topology, which can mine these graphs efficiently, has been proposed. The perfornlance of the algorithm is evaluated by doing experinlents with synthetic datasets and real data. The experimental results show that Topology can do the job well. At the end of this paper, the potential improvement is mentioned.

Keywords data mining, frequent pattern, graph

1 I n t r o d u c t i o n

Mining association rules ill is an important task in data mining. After development of nearly 10 years, it has made great progress and expanded into many other fields, such as Web mining, bioin- formatics, marketing basket analysis and so on. It is playing a more and more important role.

Frequent pattern extraction is an essential phase in association rule mining. Most studies have been done on the phase of the pattern generation. From Apriori [2], DHP[al, DIC[4I to FP-Growth [5], the efficiency of the algorithms has been improved step by step. However, most of them only process the single item, while there are many relations among objects in the real world, which form a net. It means that to deal with the items isolatedly is not suitable. Meanwhile, the fact that more and more complicated applications have been put up urges us to pay more attention to the frequent pattern mining in graphs. A case in point is to analyze the structural information of chemical molecules, where there are keys between atoms. It is an inter- esting case and will definitely help us to reveal the potential relations from the structural information.

Because these data of structural information can be expressed as graphs, we try to begin our work with the general labeled graph, which leads to the method of structural information analysis.

This work expands the analysis of single-item data in the previous algorithms to that of the item-pair scope.

We introduce some terms and concepts in Sec- tion 2. In Section 3, some improvements are dis- cussed, and they are implemented in the algorithm Topology described in Section 4. In Section 5, the performance of Topology is shown via some experiments, and in Section 6, the new algorithm is applied to cases in real world. Finally, we review some related work in Section 7, and draw conclusions in Section 8.

2 Terms and Concepts

In this paper, we focus on general labeled graph (GLG in short).

Def in i t i on 1. Given a simple graph G(V, E), IV I = m, and the label set L = {/ill ~< i ~< n, Vi , j li # lj}, where li is the label. G is a GLG iff there exists a map f : V --+ L, such that Vvi C V, f (v i ) = lk, l k E L and li # lj, f (v ' ) = li, f ( v " ) = lj ~ v' # v ' . We denote f (v i ) = lk as l(vi) and GLG as G(V, E, L(V)) .

All vertexes of GLG are labeled in the label set space L, and each vertex can only have one label. However each label in L can be assigned to multiple vertexes, i.e., there may be such cases that two or more vertexes share the same label in one graph.

* Correspondence This work was supported by the National Natural Science Foundation of China (Grant Nos.69933010 and 60303008)

and the National High-Technolog~ Development 863 Program of China (Grant No.2002AA4Z3430).

868 J. Comput. Sci. &" Technol., Nov. 2004, Vol.I9, No.6

Also, because of the uncertainty, nit of the other elements in the graph such as edges and subgraphs cannot be uniquely represented by an unordered labeling pair <li, lj) or other labeling methods.

In the background of GLG, the concepts of subgraph, support, and fi'equent subgraph can still be applied.

Def in i t i on 2. Given GLGs, G(V, E, L(V)) and H(V ' ,E ' ,L ' (V ' ) ) , H is a subgraph of G iff l) V' C V, 2) Vv~,v~ ~ V', (v~,vj) ~ E ** {v~,vj) ~ E', and 3) l(v~) = l'(v~), l(vj) = l'(vj).

This definition requires that G must contain a subgraph H in both the structure and the labels.

De f in i t i on 3. Given GLG set GD = {Gt,G2 . . . . . G,~}, where Gi is a GLG, the support of H is

s (H) = tP(H) = {G'tG' e GD,H C G'}I (i) IGD1

,,,here tp(H)t is the frequency of ;4. rf s(H) > minsup, where minsup is a pre-defined threshold called minimum support, H is a frequent GLG.

In general, we are only concerned about the connected graph, where we can find a pa th between any two vertices in the graph. So our goal in the paper is, given a GLS set GD and minimum support minsup, to find all the connected frequent general labeled (sub)graphs in the original GLGs.

In the mining process of GLG, we have to face a very hard problem, i.e., isomorphism.

D e f i n i t i o n 4. Given GLGs, Ga(V~, El , L~(Te])) and G2(V2, E2, L.2(V2)), G~ and G2 are isomorphic iff there exists a one-to-one map f : Vt ~ 1/2, .such that (a,b) E E~ iff ( f (a) , f (b) ) E E2 and L,(a) = L2(f(a)) , Ll(b) = L2(f(b)). We denote it as G1 -~ G2.

Till now, We do not know whether the ~ a p h isomorphism problem is 5 p or J~tP-complete[~]. "VVe can use brute force methods to check whether two graphs are isomorphic or not, but it is not feasible in practice.

Another concept associated with isomorphism is the canonical form[rl. As we all know, graph can be represented by adjacency matrix. For example, suppose an n x n mat r ix for graph G is M = aij, (1 <~ i, j ~< n), then the code of G can be expressed a s

code(G) = a12a13a23a14a24a34.., a,~-l,n. (2)

The canonical form is ~ special matrLx, which is unique under certain rules, e.g., the code(G) is the greatest one among all the possible codes of G in

this paper. So an)" two graphs are isomorphic if and only if they have the same canonical form. This technique is eqtdvMent to the isomorphism problem.

There have been some studies on graph mining, and most of them are based on the Apriori frame- work. Currently, there are two types, one pro- cesses the vertices, such as AGM [s'vl and AcGM [1~ while the other t reats the edges as objects, e.g., FSG[ll,12].

AcGM is based on AGM. The difference between them is that the former focuses on the frequent connected subgraph and applies some techniques to improve the performance. It uses the concept of semi-connected graph, which is defined as a connected graph or a graph con- sisting of a connected par t and a single vertex. Its canonical form is defined as CODE(X) = t(v~)l(v2)-., l(v,~) co&(X).

The join step in AcGM is the most important one. Suppose X k and Yk are the adjacency ma- trixes of two semi-connected k-graphs (graph with k vertices), X k can be joined with Yk iff the following four conditions are satisfied.

1) X k and Yk are identical except the k-th row and k-th column, i.e.,

and the labels from vl to vk-1 are the same in the two graphs;

2) X k is a canonical form; 3) X k represents a semi-connected graph; 4) if the labels of the k-th vertex in X k and Yk

are identical, CODE(Xk) >! CODE(Yk) . Other- wise, the label of vk in X k is greater than that in Y~, or Yk is represented by an unconnected graph.

I f Xi, and Y k are joinable, a normal form can be obtained.

X k �9 Y k = Zk+l = X k X y ] X T 0 Zk,k+-I

yT Zk+l,k 0

where labels of vl , �9 �9 vk in Zk+z are each the same as those in X k , and the label of vk+l is the same as that of vk in Y~.

AcGM also applies other techniques, such as canonical form finding, K k tree, to improve the performance. Please refer to [10] for more details.

FSG is another Apriori-based Mgorithm. I t uses the concept of "core" and TID list to solve the problem.

~,~i U,~ang et al.: Extracting Frequent Connected Subgraphs 869

In [10], the comparison among AGM, AcGM and FSG shows that AcGM is the best one regard- ing performance.

We also noticed that gSpan [ta] was proposed. It encodes the edge set of graph and obtains an edge sequence similar to that in PrefixSpan, and addresses isomorphism directly in sequence analysis.

3 I m p r o v e m e n t s o f Performance

nectivity. If an EIE is not form-connected, then all the corresponding isomers cannot be connected.

Both VIE and EIE can be used to reduce the search space, but the former emphasizes particu- larly on frequency, and the latter, connectivity. Af- ter that, we can identify the different structures expressed in IEs, which is more efficient than the direct mining. We refer to this principle as First- Sequence-Later-Structure (FSLS). The term "sequence" used here will be explained later, where the "sequence" and "expression" will be the same.

3.1 Principle of First-Sequence-Later- Structure (FSLS)

In GLG, each vertex in a graph has a label, and each edge can be represented by the labels of the two vertexes it connects. Thus, each frequent subgraph corresponds to a labeled vertex set and a labeled edge set. However, this mapping is not one to one. A vertex or edge label set can map to multiple frequent subgraphs. We call the graphs with the same labeled vertex set or labeled edge set as isomer-graphs (IG). The labeled set is called isomer-expression (IE). The labeled vertex/edge set is called VIE or EIE in short respectively.

The same as all the Apriori-based methods, in graph mining, the temporary results generated in the algorithm can form lattices. By utilizing IE, we can merge elements in the lattice to form a new lattice. This kind of abstract IE lattice, without storing the detailed structure of graph, only keeps some tagged information, and can reduce the search space efficiently.

If we treat the original graph as a labeled vertex/edge set, the concepts of support and frequency threshold can also be applied to those IEs as they are in subgraphs. They are defined as the number or percentage of IE in graphs. Similarly, the support of IE's any subset is equal to or greater than that of the IEs. Considering the relationship between an IE and a real subgraph, we can draw the following conclusions.

Theorem 1. I f an IE is not frequent, then any graph it corresponds to cannot be frequent.

Meanwhile, we examine the EIE in the view of connectivity. In GLG, because the label of vertex is not unique, if two edges are connected, they have at least one vertex labeled the same. ~Ve can further extend this conclusion to the field of graph, i.e., if a GLG is connected, there exist some vertex labels in its corresponding EIE which can make all labeled edges connected. The reverse is not true. So, we can find EIE to facilitate the check of con-

3.2 Improving Canonical Form Generation

To improve the performance of canonical form generation, it is necessary to analyze the process of canonical form generation in AcGM.

In AcGM, tile canonical form X~k of a k-graph G ( X k ) is derived by the row (cohmm) transformation of G's adjacency matrix Xk. It assumes that all the (k - 1)-subgraphs have already been obtained. So have the canonical forms of then].

From the analysis, it is clear to see that we do not have to rely on the multiplication of ma- trixes in order to calculate the canonical form of a graph. If a candidate graph can survive the set generating phase of candidate graphs, all its frequent subgraphs and their corresponding canonical forms are already known in theory, and thus we can utilize these canonical forms directly to generate the objective canonical form. We can find these canonical forms efficiently if we use vertexse- quence lattice (vertex lattice in short), which stores all the vertex sequences in the previous sequence finding phase. Moreover, to facilitate searching in the lattice, we correlate each vertex with the elements in the lattice. The process above is a process of sequence-matching in essence, and its efficiency is much higher than the multiplication of five ma- trLxes.

It should be noted that, as is mentioned in the previous context, this method may not be sufficient to generate the canonical form of Xk. In that case, we have to perform an exhaustive search of the canonical form and its transformation matrix.

3.3 Using T I D List

Solving subgraph isomorphism is an J~7 ~- complete [7] problem. Therefore, AcGM and FSG both take measures to minimize the cost. While AcGM uses K k tree to reduce the cost of subtree matching, FSG uses TID list instead. In fact, these two methods can be used together, and also help

870

the pruning of vertex sequence lattice and edge sequence lattice, both of which are obtained from the previous sequence finding phase.

First, when the sequence lattice is generated, for each element in it we write down the TIDs that contain the element. Then, when the isomers of the lowest level in the lattice are generated, we can use these TIDs to limit the range of checking subgraph isomorphism. For each isomer, we record the TIDs corresponding to it. For those in the upper levels of the lattice, we can use join result of the TID sets of all its (k - 1)-subgraphs to limit the range.

If all isomers of a sequence are not frequent, the sequence itself is not frequent. To cut down the size of the sequence lattice, we can then delete all the elements containing this sequence in the lattice to further reduce the cost of subsequent candidate graph generation. On the other hand, when we calculate the intersection of TID sets, we can use the TID sets of vertex sequence lattice and edge sequence lattice. Thus, in order to facilitate the searching, we require that all elements maintain their TID sets respectively in the vertex sequence lattice and edge sequence lattice.

As for the K k tree used in AcGM, it is a good idea to use it as well. In particular, when we need to reduce the range of checking subgraph isomorphism, K k tree's advantage is fully exploited.

So, the revision of AcGM algorithm is based on FSLS technique. In the next section, we will include these improvements in the new algorithm Topology.

4 Efficient Subgraph Extraction: Topology

We call the new algorithm Topology. The core procedures described above are the generation of sequence and the process of sequence lattice. We divide Topology into 2 phases, calculating the sequences and distinguishing the structures.

To gain the sequence, in particular, the vertex sequence, we still use PrefixSpan[141. The sequence of graphs is just used to mark the number of oc- currences of the elements (vertexes or edges) in graph. In this way, we obtain a multiset of vertexes (edges). If we sort the elements in the multiset in alphabetical order, we can get a sequence. If we transform all graphs into this sequence format, and regard them as the input of PrefixSpan, we can obtain those sequences E = ( l~ t~ t . ' . l~na), where li a r e the labels. In order to generate the final canonical form conveniently, it should better satisfy Vstart <~ i < j ~ end, li ) lj.

J. Comput . Sci. & Tcchnol., Nov. 2004, Vol.19, No.6

During the generation of edge sequence, we need to check whether this sequence is form-connected by applying the Warshall algorithm and so on. During the generation of vertex/edge sequences, we use them to construct vertex and edge sequence lattices.

In order to use TID in the later part of the algorithm, when a sequence is generated, all its corresponding TIDs should be recorded. That is to say, every time when the projected database is constructed, the IDs of a projected sequence are unique in the global range.

The first phase of Topology can be described as in Fig.1.

Algori thm 1 (First Phase of Topology) Input: GD, minsup Output : vertex sequence latt ice Lv , edge

sequence lattice LE Method: Transform all the graphs in GD into the form

(listt~,, listt~); //listl~ and listt~ are lists of labeled vertexes and

edges respectively; Call GPref ixSpan(( }. 0, GD, 0); Call GPref ixSpan( ( ) , O, GD, 1); GPref ixSpan(a , l, Sly,, ra) { Scan Sly, once to find all frequent labeled elements

b such tha t (a) b can be assembled to the last element of a

to form a sequential pa t tern; or (b) a(b) is a sequential pa t tern; record the TID of each t ransac t ion containing b to b.tid_list; for each frequent labeled element b d o { a' = ab;

a t. tid_list ---- b.tid_list; if (m == 1) then a'.connected=lSConnected(a'); if(m == O) then add a ~ to Lv; else add a r to LE; construct a%projected database SI~

from Sla by keeping the TID in SI~; call GPrefixSpan(a', 1 + 1, SlY, m);

}

Fig.1. First phase of topology.

After the sequence lattices are generated, we can do the real mining job of frequent connected subgraphs.

In the step of joining, we use sequence lattice as the guide. The basic conditions of join are the same as those in AcGM. However, before 2 k-graphs join, we first check whether there e~sts the vertex sequence of this new coming (k + 1)-graph in the lattice. The determination of zk,k+l's value can use edge sequence lattice to see whether the edge sequence of newly generated graph is in it or not. The following steps are the same as those in AcGM.

iVei ~'Vang et al.: Extracting Frequent Connected Subgraphs 871

The calculation of canonical forms has been combined with the filtering conditions of frequent subgraphs in the process of candidate join. When the frequent subgraphs of a graph G are being searched, the canonical form of G can be found in the same way as the winner-select in an elimination game. In some cases, we will have to use exhaustive searching. Meanwhile, the searching process needs to work on the actual (structure) lattice based on vertex sequence lattice.

As for the frequency counting, it also utilizes the vertex structure lattice. First, it executes intersection of all subgraphs ' tid_lists. Next, it performs subgraph matching by using K k tree. Once a fail- ure occurs, we change the searching pa th in the /x "~' tree by selecting different subtrees. If none of the matching is successful, we use exhaustive searching to match the subgraph with this K k tree.

A l g o r i t h m 2 (Second Phase of Topology) Input: GD, minsup, vertex sequence lattice Lv,

edge sequence lattice LE Output: frequent connected graph set F Method: Fk is a set of adjacency matrices of k-frequent graphs. Ca is a set of adjacency matrices of k-candidate graphs. GAcGM (GD, minsup, Lv, LE) { Scan Lv to generate F1 and F~ with all adjacency

matrices of one and two vertices respectively; scan LE to generate F2 with all adjacency matrices with one edge; F~ = F 2 O F ~ ; k = 2 ; Ck+l = Generate-Candidate(Fs ,L E ,Lv ); w h i l e (Ck+l ~ O) { k + +; for all ck E Ck do counting e k by scanning {GIG.tid C ck.tid_list};

F~ = {cklck E Ca and s(G(ck)) ) minsup}; Fk = {cklc, �9 Fs and a(ck) is connected}; revise Lv and LE according to Fk and Fs Ck+t=Generate-Candidate(Fs LE, Lv);}

return Uk {fklfk �9 Yk and fk is canonical};} Generate -Candidate (Fs LE, Lv ) { For all pairs fi, fj �9 Fk that satisfy conditions do

{ refer to Lv and LE do ck+t = fi �9 fj; flag = 1; CF = 0; / /CF means Canonical Form for all k-subgraphs Gc of G(ck+t) do { if (s(Gc) < minsup) t h e n flag = 0 and break;

CF = max(CF, combining CF and CF(Gc)); ck+l.tid_list -- ck+l.tid-list n Gc.tid_list;}

if (flag r 0) then Ck+l = Ck+I U {ck+l with CF and tid_list};} return Ck+l;}

Fig.2. Second Phase of Topology.

In general, latt ice plays an important role in the implementat ion. With the progress of the algorithm, the original vertex sequence lattice and edge sequence latt ice are replaced by vertex/edge structure lattices.

The second phase of Topology can be described as in Fig.2.

5 P e r f o r m a n c e S t u d y of T o p o l o g y

In this section, we evaluate the performance of Topoiogy by doing experiments in comparison with AcGM and gSpan. All experiments are run on a Pentium III 733MHz PC, with 384MB RAM. The platform is Windows XP and the algorithm is coded in C + + .

5.1 E x p e r i m e n t a l D a t a

\Ve wrote a data generator to produce all the da ta rectuired in the experiments. The generator can be adjusted by 6 parameters: the size of label set S, the possibility of an edge existing between two vertexes p, the number of basic patterns (graphs) L, the average scale of basic pat terns (defined as the number of vertexes) I , the size of graph database N, and the average scale of original graphs T.

Given the size of a graph, the label of a vertex can be selected from the label set with an equal possibility. Whether there exists an edge between two vertexes depends on a random variable. If this variable is larger than p, we add an edge there. Generally, we require that all graphs are connected.

The basic patterns (graphs) are generated first. Then, the generation of original graphs is divided in two phases. The first phase is the same as that of basic patterns. In the second phase, we select a pa t te rn P from basic patterns with an equal possibility, and overlap it in the original data graph randomly. When overlapping, if the vertex number of pa t tern P is N, we randomly select N vertexes in the original graph, and map them with those in P randomly. Then we replace those original vertexes with those in P, and rebuild the relationship between them according to their relationship in P.

5.2 P e r f o r m a n c e E v a l u a t i o n

In [10] the performance of AcGM is compared with FSG and AGM, and the result is shown in Fig.3. From it we can find the performance of AcGM is much better than that of FSG and AGM. So we evaluated the performance of Topology by comparing it with AcGM and gSpan only. We tested the execution time of the algorithm under the circumstances of different sizes and scales of

872 J. Comput. Sci. & Technol., Nov. 2004, ~b1.19, No.6

graph sets, and different possibilities of edge existing respectively. The default min imum suppor t is minsup = 0.1.

First, we analyze the relationship between the number of label graphs and the execution time. We set S = 5 , p = 0 . 7 , L = 10, I = 5 a n d T = 10. N ranges from 50 to 1,000. The final experimental results are shown in Fig.4. In this figure, the execution time of bo th Topology and AcGM increases with the increase of the number of labeled graphs. However, it is quite evident tha t the slope of Topol- ogy is smaller than tha t of AcGM.

8,000

6,000

= 4,000

2,000

5

[ ~ AcGM (Induced subgraph) ~: AcGM (Subgraph) '~ FSG i

-~ A G .,',. l

10 15 20 25

Minimum support (%)

Fig.3. AcGM, FSG and AGM.

10 r

= :~ 10 6

105

�9 i 7 Topology i i

i~ GSPAN I 4

I,? ~ - t

;< / J ' !

J

1,000 2,000 3,000 4,000 5,000

Data size

The relationship between the scale of labeled g raph and the execution t ime is similar, as shown in Fig.5, where we set S = 5, p = 0.7, L = 10, I = 5, N = 100, and T varies f rom 8 to 12.

• 105

e-

Z - . . . . . . . -

L"

: " ]

i 1

J

7 8 9 10 iI 12 13

Average graph size

Fig.5. T vs. execution time.

We also analyze the relat ionship between the number of labels and the execution time. W i t h more labels, the number of candida te frequent subgraphs and final frequent pa t te rns will increase. As shown in Fig.6, the execution time of the two algo- r i thms increases with the increase of the number of labels. In this experiment we set p = 0.7, L = 10, I = 5, N = 10, and S varies f rom 4 to 7.

3,800

3,400

3,000

2,600

2,200

Topology vs. AcGM

! ~ AcGM , .... Topolog;. ~ '

5 6

Number of label

Fig.6. Number of labels vs. execution time.

Fig.4. N vs. execution time.

In AcGM, for each frequent subgraph S, it will generate 2 ~ (n is the number of nodes in S) candidate subgraphs. So it costs a lot to test whether these candidate subgraphs are frequent. While in gSpan, it uses DFS code to express the graph, which can be genera ted efficiently. But like AcGM, it also needs to generate a large number of candida te subgraphs. Therefore, in the case of large da ta size, the t rend will be held. Al though Topology needs to extract the frequent pat terns in the vertex and edge sequences. As we at] know the nfining of frequent pa t te rn on sequence is much simpler than tha t of graph.

e

~z

105 : ? Topolo c, ! ,:, AcGM i

L

7,05 '_ *---.._

104 ' _ _

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Graph density

Fig.7. p vs. execution time.

The t rend is the same when we analyze the relat ionship between the possibility of edge existing

l, Vei ~2ng et aI.: Extracting Frequent Connected Subgraphs 873

Table 1. Performance on Real Data Data set name Disease-cause data Non-disease-cause data

Algorithm AcGM Topology AcGM Topology ENLARGE data size 5,620

Running Time for minsup = 10% (ms) 626,870 427,300 Real data size 162

Running Time for minsup = 10% (ms) 48,640 37,540

1,620 237,430 105,920

136 10,828 9,457

in labe led g raph and the execut ion t ime. We set S = 5, L = 10, I = 5, N = 100 and T = 10. p ranges f rom 0.3 to 0.7. The results are shown in Fig.7. I t is the same case tha t Topology can be app l ied to a wider scale, while A c G M can run no rma l ly only when p = 0.7.

We also d id test on checking the con t r ibu t ion of the improvements . Based on the or iginal version of AcGM, we wrote a new version of T V t applying ver tex la t t i ce to help the i somorphism checking. Next , we used the pr inciple of same-sequence- d i f fe ren t - s t ruc tures to form ano ther version T V , based on T V I. Final ly , we incorpora ted the techniques such as T I D list and formed the final version, Topology.

~Ve set S = 5, p = 0.7, L = 10, I = 5, N = 100 and T = 10. We examine the contri- bu t ion of each p a r t in the a lgor i thm by varying minsup from 0.05 to 0.5. We also choose A c G M as the opponen t . Fig .8 shows the results . Clearly, among the th ree improvements , using ver tex lat- t ice to improve the efficiency of checking isomor- ph ism con t r ibu te s most . Using the pr inciple of same-sequence-d i f fe ren t - s t ruc tures to fur ther l imi t the searching space ranks second. Othe r techniques such as TID list con t r ibu te re la t ively small .

10 6

~" 105

e

104

103

~ - AcGM ! TV1 TV.) I

- - t c----,a----_ - - ~ - - ~ - ~ ~ - Topology[ - - C '

v=:==~----=

0 0.1 0.2 0.3 0.4 0.5 0.6

Support %

Fig.8. Factor improvement.

6 A n a l y z i n g M o l e c u l e s

We use the d a t a s e t P T E 1 of the P T E pro jec t (Predic t ive Toxicology Evalua t ion) from Br i t i sh Cambr idge Univers i ty to do the analysis. The

da t a se t P T E 1 collects 340 compounds in to ta l , among which 24 are a toms. Tota l ly 66 a t o m types are discovered. There exist many chemical bonds among these a toms. We simplify it and suppose t ha t these bonds are jus t the connect ions be tween a toms wi thou t any o ther meaning. These molecules have been marked by some basic p roper t ies such as whe ther they cause disease, etc.

F i r s t we convert P T E 1 into the format of graph, where an a t o m is m a p p e d to a vertex, and a chemical bond to an edge. Ti le 66 a t o m types cor respond to the Iabels in the label set. We set minsup = 0.1, and s t a r t the analysis . P a r t of the resul t is shown in Fig.9. T h e molecule s t ruc tu res d i sp layed in the left of F ig .9 occur f requent ly in non-disease- causing compounds , while t ha t d isp layed in the r ight of F ig .9 occur f requent ly in d isease-causing compounds . So, next t ime when we t ry to ana lyze a cer ta in c o m p o u n d whose p rope r t i e s are unknown, and if we find t ha t it conta ins the molecule s t ruc- ture shown in Fig.9, we may pay much a t t en t i on to tes t ing whe the r it is d isease-causing or not. Since the size of the d a t a set is smal l , A c G M and Topol- ogy are all f inished in less t h a n 50 seconds. In Ta- ble 1, we show all the resul ts for different real d a t a sizes. I t should be no ted t h a t the E N L A R G E real d a t a are g e n e r a t e d from the or iginal real d a t a by app ly ing some t ime opera t ions .

H H H C

C - - C - - C H N - - C - - C

H x H 0 \ H

Fig.9. Analysis of chemical molecule structures.

7 R e l a t e d W o r k

Extracting frequent subgraphs from a graph dataset is a rather challenging job. Some heuristic based approaches, such as SUBDUE [151, proposed by Cook in 1994, and GBI [16'17] by Yoshida and Motoda proposed and revised respectively in 1995 and 1997, have been introduced in recent years. However, both methods cannot avoid losing some

874 J. Comput . Sci. & Technol., Nov. 2004, Voi.19, No.6

i somorphic p a t t e r n s by adop t ing exhaus t ive searching. Hence, this a lgo r i thm is revised again [Ir'19].

A n o t h e r m e t h o d p roposed as heur is t ic m e t h o d is De R a e d t ' s method['-'~ Yet ano ther m e t h o d t ha t adop t ed the same exhaus t ive searching m e t h o d as A G M , A c G M and F S G is W A R M R [2tl.

F rom the descr ip t ion above, we can see tha t Apr io r i -based AcGM, F S G and A G M may be in- ferior to S U B D U E in efficiency, bu t consider ing all o ther factors such as the results , this m e t h o d based on ana lyz ing frequent p a t t e r n s has obvious advan- tages.

Concerning the work done in ana lyz ing molecule s t ruc tures , besides the techniques men- t ioned above, there is the way p roposed in [22]. I t uses SMILES ['a] in c o m p u t a t i o n a l chemis t ry and the coding technique of SMARTS {241 to analyze the HIV. Nevertheless , the me thod can only be ap- p l ied to some special fields, which is not su i tab le for GLG. Recently, ~ASang p roposed a m e t h o d to use 3D graph mining techniques to analyze the p a t t e r n s in proteins[25].

8 C o n c l u s i o n

In this paper , we pa id much a t t en t i on to the p rob lem of how to efficiently ex t rac t f requent pa t - terns from a g r a p h da tase t . We proposed a down- to -ea r th scheme to revise AcGM, and also implemented the Topo logy a lgor i thm. The exper imen- ta l resul ts show t h a t our efforts pa id off. A m o n g the several improvement s , the technique of using ver tex l a t t i ce to reduce the cost of checking iso- m o r p h i s m improves the per formance most . The m e t h o d of i nco rpo ra t i ng the pr inciple of same- sequence-d i f fe rent -s t ruc tures to fur ther l imit the searching space ranks second. Othe r techniques such as T I D list con t r i bu t e re la t ive ly small .

R e f e r e n c e s

[1] Agrawal R et al. Mining association rules between sets of items in large databases. In Proe. ACM SIGMOD, Washington D C, USA, 1993, pp.207-216.

[2] AgrawM R et al. Fast algorithms for mining association rules in large databases. In Proc. VLDB, Santiago, Chile, 1994, pp.487-499.

[3] Park J S e t al. An effective hash based algorithm for mining association rules. In Proc. ACM SIGMOD, San Jose, California, USA, 1995, pp.175-186.

[4] Brin S e t al. Dynamic itemset counting and implication rules for market basket data. In Proc. A CM SIGMOD, Tucson, Arizona, USA, 1997, pp.255-264.

[5] Hun J et al. Mining frequent patterns without candidate generation. In Proc. ACM SIGMOD, Dallas, Texas, USA, 2000, pp.1-12.

[6] Read R C et at. The graph isomorphism disease. J. Graph Theory, 1977, 4: 339-303.

[7] Babai b et al. Canonical labeling of graphs. In Proc. ACM STOC. Boston, Massachusetts, USA, 1983, pp.171-183.

[8] Inokuchi A e t al. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. PKDD, LNCS 1910, Springer, Lyon, France, 2000, pp.13-23.

[9] Inokuchi A e t al. Applying algebraic mining method of graph substructures to mutageniesis data analysis. In KDD Challenge, PAKDD, Kyoto, Japan, 2000, pp.41- 46.

[10] Inokuchi A et al. A fast algorithm for mining frequent connected subgraphs. Research Report RT0448, IBM Research, Tokyo Research Laboratory, 2002.

[11] Kuramochi N[ et al. Frequent subgraph discovery. In Proc. IEEE ICDM, San Jose, California, USA, 2001, pp.313-320.

[12] Kuramochi M e t al. An efficient algorithm for discover- ing frequent subgraph. Technical Report 02-026, Dept. of Computer Science, University" of Minnesota, 2002.

[13] Yan X et al. gSpan: Graph-based substructure pattern mining. In Proc. IEEE ICDI~I, Maebashi City, Japan, 2002.

[14] Pei J et al. PrefixSpam Mining sequential patterns by prefix-projected growth. In Proc. ICDE, Dusseldorf, Germany, 2001, pp.215-224.

[15] Cook D J et al. Substructure discovery using minimum description length and background knowledge. J. Arti- ficial Intelligence Research, 1994, 1: 231-255.

[161 Yoshida K et al. CLIP: Concept learning from inference patterns. Artificial Intelligence, 1995, 1: 63-92.

[17] Motoda H et al. Machine learning techniques to make computers easier to use. In Proe. IJCAI, 1997, 2: 1622- 1631, Nagoya, Japan.

[18] Matsuda T et al. Extension of graph-based induction for general graph structured data. In Proc. PAKDD, Springer, Kyoto, Japan, 2000, LNCS 1805: 420-431.

[19] Matsuda T et al. Knowledge discovery from structured data by beam-wise graph-based induction. In Proc. PRICAI, Springer, Tokyo, Japan, 2002, LNCS 2417: 255-264.

[20] Raedt L De et al. The levelwise version space algorithm and its application to molecular fragment finding. In Proe. IJCAL Seattie, Washington, USA, 2001, 2: 853- 862.

[21] Dehaspe L et al. Finding frequent substructures in chemical compounds. In Proc. KDD, New York, USA, 1998, pp.30-36.

[22] Kramer S e t al. Molecular feature mining in HIV data. In Proc. A C M SIGKDD, San Francisco, USA, 2001, pp.136-143.

[231 Weininger D. SMILES, a chemical language and information system. J. Chemical Information and Computer Sciences, 1988, 1: 31-36.

[24] James C A e t al. Daylight Theory Manual--Daylight 4.71.

[25] Wang X et al. Finding patterns in three-dimensional graphs: Algorithms and applications to scientific data mining. IEEE TKDE, 2002, 4: 731-749.

!,Vei ~I,~ng et al.: Extracting Frequent Connected Subgraphs 875

Wei W a n g received the B.S. degree in computer science in 1992 from Shandong Univer- sity, the Ph.D. degree in computer science in 1998 from Fu- dan University, respectively. He is now a professor in Depart- ment of Computing and Infor- mation Technology, Fudan Uni- versity. His research interests in-

clude database, data warehouse, data mining.

puter science in 1997 from Shanghai University, the M.S. degree and the Ph.D. degree in computer science in 2000 and in 2003, from Fudan University, respectively. His research interests include database and data mining.

M i n g - S h e n g H o n g received the B.S. degree in computer science in 2002 from Fudan University. Now she is a Ph.D. candidate in Department of Computer Science, University of Connell. His research interests include database and data mining.

Q i n g - Q i n g Y u a n received the B.S., the M.S. de- grees in computer science in 2000 from Fudan Univer- sity, in 2003, respectively. Now she is a Ph.D. candidate in Department of Computer Science, University of Cal- ifornia, Santa BarBara. Her research interests include database and data mining.

H a o - F e n g Z h o u received the B.S. degree in corn-

Bai~ Shi received the B.S. degree in mathemat- ics in 1957 from Peking University. He is a professor in Department of Computing and Information Technol- ogy, Fudan University. He is also director of the Shang- hai (International) Database Research Center. His research interests include database, data warehouse and digital library.

Documents

Extracting frequent connected subgraphs from large graph sets