25
The VLDB Journal (2012) 21:753–777 DOI 10.1007/s00778-012-0268-8 REGULAR PAPER Mining frequent subgraphs over uncertain graph databases under probabilistic semantics Jianzhong Li · Zhaonian Zou · Hong Gao Received: 27 May 2011 / Revised: 11 January 2012 / Accepted: 9 February 2012 / Published online: 28 February 2012 © Springer-Verlag 2012 Abstract Frequent subgraph mining has been extensively studied on certain graph data. However, uncertainty is intrin- sic in graph data in practice, but there is very few work on mining uncertain graph data. This paper focuses on mining frequent subgraphs over uncertain graph data under the prob- abilistic semantics. Specifically, a measure called ϕ-frequent probability is introduced to evaluate the degree of recurrence of subgraphs. Given a set of uncertain graphs and two real numbers 0 < ϕ,τ < 1, the goal is to quickly find all sub- graphs with ϕ-frequent probability at least τ . Due to the NP-hardness of the problem and to the #P-hardness of computing the ϕ-frequent probability of a subgraph, an approximate mining algorithm is proposed to produce an (ε, δ)-approximate set Π of “frequent subgraphs”, where 0 <ε<τ is error tolerance, and 0 <δ< 1 is a con- fidence bound. The algorithm guarantees that (1) any fre- quent subgraph S is contained in Π with probability at least ((1 δ)/2) s , where s is the number of edges in S; (2) any infrequent subgraph with ϕ-frequent probability less than τ ε is contained in Π with probability at most δ/2. The theoretical analysis shows that to obtain any frequent sub- graph with probability at least 1 Δ, the input parameter δ of the algorithm must be set to at most 1 2(1 Δ) 1/ max , An extended abstract [41] was presented at the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2010. J. Li · Z. Zou (B ) · H. Gao School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China e-mail: [email protected] J. Li e-mail: [email protected] H. Gao e-mail: [email protected] where 0 <Δ< 1, and max is the maximum number of edges in frequent subgraphs. Extensive experiments on real uncertain graph data verify that the proposed algorithm is practically efficient and has very high approximation quality. Moreover, the difference between the probabilistic semantics and the expected semantics on mining frequent subgraphs over uncertain graph data has been discussed in this paper for the first time. Keywords Uncertain graph · Frequent subgraph mining · Probabilistic semantics · ϕ-frequent probability · #P 1 Introduction Graphs are general data structures for representing compli- cated relationships between entities, which have seen wide applications in bioinformatics, social networks, road net- works, and so on. In recent years, increasingly large amount of data represented by graphs, also known as graph data, has been collected by modern data acquisition methods such as high-throughput biological experiments, online social net- work systems, and GPS. Massive graph data require efficient and intelligent tools to analyze and understand it. Frequent subgraph mining [20, 34] is one of the powerful data mining techniques to explore the structures of graph data, specif- ically, the recurring substructures. More formally, the fre- quent subgraph mining problem can be stated as follows: Given a set D of graphs and 0 <ϕ< 1, find all subgraphs that occur in at least ϕ| D| graphs in D. The proportion of graphs in D that contain a graph S as subgraphs is called the support of S [20]. Here, the occurrence of a graph in another is defined in terms of subgraph isomorphism [32]. Recent studies [12, 13, 26, 28, 37, 4244] have shown that uncertainty is inherent in graph data, particularly in the 123

static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

The VLDB Journal (2012) 21:753–777DOI 10.1007/s00778-012-0268-8

REGULAR PAPER

Mining frequent subgraphs over uncertain graph databasesunder probabilistic semantics

Jianzhong Li · Zhaonian Zou · Hong Gao

Received: 27 May 2011 / Revised: 11 January 2012 / Accepted: 9 February 2012 / Published online: 28 February 2012© Springer-Verlag 2012

Abstract Frequent subgraph mining has been extensivelystudied on certain graph data. However, uncertainty is intrin-sic in graph data in practice, but there is very few work onmining uncertain graph data. This paper focuses on miningfrequent subgraphs over uncertain graph data under the prob-abilistic semantics. Specifically, a measure called ϕ-frequentprobability is introduced to evaluate the degree of recurrenceof subgraphs. Given a set of uncertain graphs and two realnumbers 0 < ϕ, τ < 1, the goal is to quickly find all sub-graphs with ϕ-frequent probability at least τ . Due to theNP-hardness of the problem and to the #P-hardness ofcomputing the ϕ-frequent probability of a subgraph, anapproximate mining algorithm is proposed to produce an(ε, δ)-approximate set Π of “frequent subgraphs”, where0 < ε < τ is error tolerance, and 0 < δ < 1 is a con-fidence bound. The algorithm guarantees that (1) any fre-quent subgraph S is contained in Π with probability at least((1 − δ)/2)s , where s is the number of edges in S; (2) anyinfrequent subgraph with ϕ-frequent probability less thanτ − ε is contained in Π with probability at most δ/2. Thetheoretical analysis shows that to obtain any frequent sub-graph with probability at least 1 − Δ, the input parameter δ

of the algorithm must be set to at most 1 − 2(1 − Δ)1/�max ,

An extended abstract [41] was presented at the 16th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining(KDD), 2010.

J. Li · Z. Zou (B) · H. GaoSchool of Computer Science and Technology, Harbin Instituteof Technology, Harbin, Heilongjiang, Chinae-mail: [email protected]

J. Lie-mail: [email protected]

H. Gaoe-mail: [email protected]

where 0 < Δ < 1, and �max is the maximum number ofedges in frequent subgraphs. Extensive experiments on realuncertain graph data verify that the proposed algorithm ispractically efficient and has very high approximation quality.Moreover, the difference between the probabilistic semanticsand the expected semantics on mining frequent subgraphsover uncertain graph data has been discussed in this paperfor the first time.

Keywords Uncertain graph · Frequent subgraph mining ·Probabilistic semantics · ϕ-frequent probability · #P

1 Introduction

Graphs are general data structures for representing compli-cated relationships between entities, which have seen wideapplications in bioinformatics, social networks, road net-works, and so on. In recent years, increasingly large amountof data represented by graphs, also known as graph data, hasbeen collected by modern data acquisition methods such ashigh-throughput biological experiments, online social net-work systems, and GPS. Massive graph data require efficientand intelligent tools to analyze and understand it. Frequentsubgraph mining [20,34] is one of the powerful data miningtechniques to explore the structures of graph data, specif-ically, the recurring substructures. More formally, the fre-quent subgraph mining problem can be stated as follows:Given a set D of graphs and 0 < ϕ < 1, find all subgraphsthat occur in at least ϕ|D| graphs in D. The proportion ofgraphs in D that contain a graph S as subgraphs is called thesupport of S [20]. Here, the occurrence of a graph in anotheris defined in terms of subgraph isomorphism [32].

Recent studies [12,13,26,28,37,42–44] have shown thatuncertainty is inherent in graph data, particularly in the

123

Page 2: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

754 J. Li et al.

structures of graphs due to the limitations of data acquisitiontechniques, data incompleteness, data imprecision, noises,late update of data, and so on. Graphs with uncertainties arecalled uncertain graphs. In the data uncertainty models usedin [12,13,26,28,37,42,44], each edge of an uncertain graphis associated with an uncertainty value in (0, 1] indicating theprobability of the edge existing in reality, and the existenceof the edges is assumed to be mutually independent. Essen-tially, an uncertain graph represents a probability distributionover all of the certain graphs in the forms of which the uncer-tain graph may exist actually. Terminologically, each of thesecertain graphs is called an implicated graph [44]. Uncertain-ties in graphs pose new challenges in both semantics andcomputing to frequent subgraph mining.

Zou et al. [44] investigated frequent subgraph mining onuncertain graphs in the recent work. A set of uncertain graphsD = {G1, G2, . . . , Gn} essentially represents a probabilitydistribution over a family F of sets of certain graphs. Eachset of certain graphs D′ = {G ′

1, G ′2, . . . , G ′

n} ∈ F satis-fies that G ′

i is an implicated graph of Gi for 1 ≤ i ≤ n.For a certain subgraph S, the degree of recurrence of S inD is measured by the expected value of its supports in allsets of certain graphs in F , called the expected support ofS. Thus, the frequent subgraph mining problem on uncertaingraph data was defined in [44] as follows: Given a set D ofuncertain graphs and 0 < γ < 1, find all certain subgraphswith expected support at least γ in D. The semantics of thisproblem statement is called the expected semantics.

Motivated by the recent work [3,39] on frequent itemsetmining on uncertain transactional data, this paper focuseson frequent subgraph mining on uncertain graph data undersemantics different from that adopted in our previous work[44]. Again, a set D of uncertain graphs represents a proba-bility distribution over a family F of sets of certain graphs.However, the degree of recurrence of a certain subgraph S inD is measured by the probability that S has support at leastϕ across all sets of certain graphs in F , where 0 < ϕ < 1.Such probability is called the ϕ-frequent probability of S inD. Therefore, the problem to be solved in this paper can bedefined as follows: Given a set D of uncertain graphs and0 < ϕ, τ < 1, find all certain subgraphs with ϕ-frequentprobability at least τ in D. The semantics of this problemstatement is called the probabilistic semantics.

It is very important to distinguish the expected seman-tics and the probabilistic semantics. Zhang et al. [39]have initially discussed the difference in context of min-ing frequent items over uncertain data streams. Though thediscussion there can be extended to frequent subgraph min-ing over uncertain graphs as will be shown in Appendix A,there is a lack of discussion in [39] on an important issue:when one semantics is preferable to the other? In Appen-dix A, we first show that the ϕ-frequent probability intrin-sically contains more information on the frequentness of a

subgraph than the expected support. Then, we show thatfrequent subgraph mining under the expected semantics ismore suitable for exploring motifs in a set of uncertaingraphs, and frequent subgraph mining under the probabi-listic semantics is more suitable for detecting features from aset of uncertain graphs, where motif exploration and featuredetection are two main scenarios where frequent subgraphmining is applied. For more detailed discussion on the dif-ference between the expected semantics and the probabilisticsemantics, please refer to Appendix A.

Although the problem to be solved in this paper followsthe same semantics as the frequent itemset mining problems[3,30,39], the algorithms proposed in [3,30,39] cannot beextended to address the problem in this paper. This is notonly due to the difference in data types but also becausethat it can be decided in polynomial time whether an itemsetis frequent or not [3,30,39]; however, it is #P-hard [33] tocompute the ϕ-frequent probability of a subgraph as will beformally proved in this paper.

To address the challenges, instead of discovering the com-plete set of frequent subgraphs, we are trying to find a broaderset of subgraphs including all frequent subgraphs and a frac-tion of infrequent subgraphs but with ϕ-frequent probabilityat least τ − ε, where 0 < ε < τ is error tolerance. In otherwords, the terms “frequent subgraph” and “infrequent sub-graph” are redefined. Particularly, a subgraph is frequent ifits ϕ-frequent probability is at least τ ; a subgraph is infre-quent if its ϕ-frequent probability is less than τ − ε. All ofthe subgraphs with ϕ-frequent probability in [τ − ε, τ ) areapproximately frequent. Since ε is usually very small withrespect to τ , the inclusion of a small subset of approximatelyfrequent subgraphs in the output will not degrade the qualityof mining results significantly.

An approximation algorithm is proposed to produce an(ε, δ)-approximate set Π of frequent subgraphs in this paper,where ε is the error tolerance and 0 < δ < 1 is a confidencebound. The algorithm guarantees that:

1. Any frequent subgraph S is contained in Π with prob-ability at least ((1 − δ)/2)s , where s is the number ofedges of S;

2. Any infrequent subgraph with ϕ-frequent probabilityless than τ − ε is contained in Π with probability atmost δ/2.

This algorithm is developed based on the well-knowngSpan algorithm [34] for frequent subgraph mining. First,all subgraphs are encoded into minimum DFS codes and areorganized into a search tree according to the lexicographicorder of minimum DFS codes [34]. Then, the search treeis traversed in depth-first order to yield an (ε, δ)-approxi-mate set Π of frequent subgraphs. In particular, for each vis-ited subgraph S, instead of computing its exact ϕ-frequent

123

Page 3: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 755

probability, it is determined whether the ϕ-frequent proba-bility of S is certainly no less than τ −ε and probably greaterthan or equal to τ using a randomized algorithm. The ran-domized algorithm is very simple and fails with probabilityat most δ. If the answer given by the randomized algorithmis “yes”, then insert S into Π and continue the depth-firstsearch; otherwise, stop searching all descendants of S sincethe ϕ-frequent probability of any descendant of S is definitelyno greater than τ .

The theoretical analysis shows that to obtain any frequentsubgraph with probability at least 1 − Δ, the input δ of theapproximation algorithm must be set to at most 1 − 2(1 −Δ)1/�max , where 0 < Δ < 1 and �max is the maximum num-ber of edges of frequent subgraphs.

Extensive experiments were performed on real uncertaingraph databases to compare the mining results obtained underthe probabilistic semantics and the expected semantics, eval-uate the approximation quality of mining results, and test theefficiency and the scalability of the proposed algorithm. Theexperimental results verify that the algorithm is very efficientand has very high approximation quality.

The main contributions of this paper are as follows:

– The problem of mining frequent subgraphs on uncer-tain graph databases under the probabilistic semanticshas been formally defined;

– It has been formally proved that this frequent subgraphmining problem is NP-hard and that it is #P-hard to com-pute the ϕ-frequent probability of a subgraph;

– An approximate mining algorithm has been proposed toproduce an approximate set of frequent subgraphs;

– The theoretical guarantees of the approximate miningalgorithm have been thoroughly analyzed;

– A method has been given on how to set parameter δ toguarantee the approximation quality of the mining results;

– Extensive experiments have been performed on realuncertain graph databases to compare the mining resultsobtained under the probabilistic semantics and theexpected semantics, evaluate the approximation qualityof mining results, and test the efficiency and the scalabil-ity of the proposed algorithm.

– The difference between the probabilistic semantics andthe expected semantics on mining frequent subgraphsover uncertain graph data has been discussed in this paperfor the first time.

The rest of this paper is organized as follows: Sect. 2reviews the related work. Section 3 introduces a model ofuncertain graph data and defines the frequent subgraph min-ing problem on uncertain graph databases under the probabi-listic semantics. Section 4 formally proves the computationalcomplexity of this problem. Section 5 proposes an approx-imation algorithm and analyzes its theoretical guarantees.

Section 6 elaborates on how to set parameter δ to ensure theoverall approximation quality of the mining results. Section 7discusses the time complexity of the algorithm when inputuncertain graphs are special. Section 8 shows the experimen-tal results on real uncertain graph databases. Finally, Sect. 9concludes this paper.

2 Related work

The work most related to this paper includes frequent sub-graph mining on certain graph data, frequent itemset miningon uncertain data, and frequent subgraph mining on uncertaingraph data under the expected semantics. We will summarizeand analyze the related work in the rest of this section.

2.1 Frequent subgraph mining on certain graph data

In last decade, frequent subgraph mining has been exten-sively studied on certain graph data, and a large amountof algorithms [5,10,14,16,20,24,34] have been proposed tosolve this problem. Because the number of frequent sub-graphs is generally exponential to the input size [15,35],and it is NP-hard to find all matchings of a subgraph in adata graph [9], all existing algorithms are developed to avoidenumerating redundant isomorphic subgraphs and infrequentsubgraphs. Though uncertainty is inherent in graph data, noneof these existing algorithms takes uncertainty into accountand hence are not applicable to uncertain graph data [42,44].

Although all existing algorithms fail to consider uncer-tainty in graph data, the techniques for representing andenumerating subgraphs such as minimum DFS code [34]and right-most extension [34] can also be used in our workbecause the results to be computed in this paper are alsocertain subgraphs. Particularly, minimum DFS code is usedto canonically representing subgraphs, and right-most exten-sion is used to enumerate subgraphs in a systematic way toreduce redundant isomorphic subgraphs.

2.2 Frequent itemset mining on uncertain data

Recently, significant progress has been made on mining fre-quent itemsets over uncertain databases [3,6,7,22,30] andon mining frequent items over uncertain data streams [8,39].From the aspect of data mining semantics, the work in[6–8,22] investigates the problems under the expectedsemantics, and the other work in [3,30,39] studies under theprobabilistic semantics.

Due to the distinction in data types and the notabledifference between the expected semantics and the proba-bilistic semantics as shown in Appendix A, the algorithmsproposed in [6–8,22] cannot be adapted to solve the problemin this paper. Since it is PTIME to compute the probability of

123

Page 4: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

756 J. Li et al.

an itemset being frequent under the probabilistic semantics[3,30,39], but it is #P-hard to compute the ϕ-frequent prob-ability of a subgraph, the algorithms proposed in [3,30,39]also cannot be adapted to solve the problem in this paper.

Furthermore, although frequent itemset mining has beenextensively studied both on the expected semantics and onthe probabilistic semantics, there is no thorough discussionon the difference between the two semantics. To the bestof our knowledge, this paper is the first one to discuss thedifference between the two semantics in details.

2.3 Frequent subgraph mining on uncertain graph dataunder the expected semantics

In the context of mining uncertain graphs, our previous work[42,44] first investigated frequent subgraph mining on uncer-tain graphs under the expected semantics. We proposed a newmeasure called expected support to evaluate the frequentnessof a subgraph in a set of uncertain graphs. In particular, theexpected support of a subgraph is the expected value of itssupports across all implicated graph databases. Papapetrouet al. [26] followed our work and proposed a new indexingstructure to increase the efficiency of the mining algorithm.However, because of the significant difference between theexpected semantics and the probabilistic semantics, a fre-quent subgraph discovered under the expected semanticsmay not be frequent any more under the probabilistic seman-tics (Appendix A). Hence, the algorithms developed underthe expected semantics [26,42,44] cannot be adapted todiscover frequent subgraphs under the probabilistic seman-tics.

Another line of research was carried out by Kimmig andRaedt [18]. They cast pattern mining problems in the con-text of logic programming, particularly in ProbLog [29], aprobabilistic Prolog system. Due to the powerful expressioncapability, ProbLog is able to represent uncertainties of item-sets, sequences, trees, and graphs. Their appealing method isan integration of multi-relational data mining and inductivelogic programming. However, due to the data representationin ProbLog, operations on graphs such as subgraph isomor-phism testings are implemented by clause reduction, whichbecome inefficient on large graphs.

2.4 Summary

Table 1 summarizes the related work according to the datatype they can deal with, the capability of handling uncer-tainties, and the mining semantics they support. As can beseen, all existing work cannot be applied to find frequentsubgraphs over uncertain graph data under the probabilisticsemantics. To our best knowledge, this paper is the first oneto investigate the problem.

3 Problem definition

3.1 Preliminaries

Frequent subgraph mining mainly considers labeled graphs[5,10,14,16,20,24,34,42,44], i.e., graphs whose verticesand edges are associated with labels. If not otherwisespecified, we mean by “graph” a labeled graph in thispaper.

Definition 1 A certain graph is a system G =(V, E,Σ, LV ,

L E ), where (V, E) is an undirected graph, V is the set of ver-tices, E is the set of edges, Σ is a set of labels, LV : V → Σ

is a function assigning a label in Σ to a vertex in V , andL E : E → Σ is a function assigning a label in Σ to an edgein E . A certain graph G = (V, E,Σ, LV , L E ) is trivial ifV = {}. We denote a trivial graph by ∅. A certain graphdatabase is a set of nontrivial certain graphs.

We often say that a certain graph occurs or is contained inanother certain graph. More precisely, it is defined in graphtheory as follows:

Definition 2 A certain graph G = (V, E,Σ, LV , L E )

is subgraph isomorphic to another certain graph G ′ =(V ′, E ′,Σ ′, L ′

V , L ′E ), denoted by G � G ′, if there exists

an injection λ : V → V ′ such that

1. LV (v) = L ′V (λ(v)) for any v ∈ V ,

2. (λ(u), λ(v)) ∈ E ′ for any (u, v) ∈ E , and3. L E ((u, v)) = L ′

E ((λ(u), λ(v))) for any (u, v) ∈ E .

Table 1 Summary of related work

Related work Data type Uncertainty Semantics

Frequent subgraph mining on certain graph data[5,10,14,16,20,24,34]

Graph data No No

Frequent itemset mining on uncertain data[3,6–8,22,30,39]

Transactional data, data streams Yes Expected, probabilistic

Frequent subgraph mining on uncertain graphdata under the expected semantics[18,26,42,44]

Graph data Yes Expected

This paper Graph data Yes Probabilistic

123

Page 5: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 757

The injection λ is called a subgraph isomorphism from Gto G ′. The subgraph (V λ, Eλ,Σ, L ′

V |V λ , L ′E |Eλ) of G ′ is

called an embedding of G in G ′ under λ, denoted by Gλ,where V λ = {λ(v) | v ∈ V }, Eλ = {λ(e) | e ∈ E}, and thenotation f |X denotes the function obtained by restricting afunction f : U → W to a domain X ⊆ U .

In traditional frequent subgraph mining, the degree ofrecurrence of a certain subgraph in a certain graph databaseis measured by its support [5,10,14,16,20,24,34].

Definition 3 The support of a certain subgraph S in a cer-tain graph database D = {G1, G2, . . . , Gn}, denoted bysup(S; D), is the proportion of certain graphs in D that S issubgraph isomorphic to. That is,

sup(S; D) = 1

n

∣∣{Gi | 1 ≤ i ≤ n, S � Gi }

∣∣. (1)

Specifically, we require that sup(S; D) = 0 if D = ∅.

3.2 Uncertain graphs

The model of uncertain graphs used in [12,26,28,37,42,44]only considers uncertain edges. However, vertices in uncer-tain graphs can also have uncertainties. For example, thetopological structure of a wireless sensor network can berepresented as a graph, where vertices are sensor nodes, andedges are the wireless links between sensor nodes. Due to thelimited battery power, mobility, sleeping policies, maliciousattacks, and so on, a sensor node may not be working nor-mally all the time, but has a probability to be malfunctioning.Thus, a vertex in such graph may not be existing all the time.

We now extend the model of uncertain graphs proposedin our previous work [42,44] to the following one that takesuncertainties of both vertices and edges into account.

Definition 4 An uncertain graph is a system G = (V, E,Σ,

LV , L E , PV , PE ), where (V, E) is an undirected graph, V isthe set of vertices, E is the set of edges, Σ is a set of labels,LV : V → Σ is a function assigning a label in Σ to a vertexin V, L E : E → Σ is a function assigning a label in Σ to anedge in E, PV : V → [0, 1] is a function assigning existenceprobability values to the vertices in V , and PE : E → [0, 1]is a function assigning conditional existence probability val-ues to the edges in E , given their endpoints.

In the definition above, the existence probability, PV (v),of a vertex v is the probability of v existing in practice, andthe conditional existence probability, PE (e|u, v), of an edgee = (u, v) is the probability of e existing between verticesu and v, given that both u and v exist in practice. There-fore, a certain graph is essentially equivalent to an uncertaingraph with existence probability of 1 on every vertex, andconditional existence probability of 1 on every edge.

An uncertain graph G essentially represents a set of certaingraphs implicated by G, each of which represents a possiblestructure in the form of which G may exist in practice. Moreformally, we have the following definition:

Definition 5 An uncertain graph G = (V, E,Σ, LV , L E ,

PV , PE ) implicates a certain graph G ′ = (V ′, E ′,Σ ′, L ′V ,

L ′E ), denoted by G ⇒ G ′, if V ′ ⊆ V, E ′ ⊆ E ∩ (V ′ ×

V ′),Σ ′ ⊆ Σ, L ′V = LV |V ′ , and L ′

E = L E |E ′ , where E ∩(V ′ × V ′) is the set of edges with both endpoints in V ′.

In this paper, we assume that the existence probabilities ofall vertices in an uncertain graph are mutually independent,and so are the conditional existence probabilities of all edge.Based on this assumption, the probability of an uncertaingraph G = (V, E,Σ, LV , L E , PV , PE ) implicating a cer-tain graph G ′ = (V ′, E ′,Σ ′, L ′

V , L ′E ) can be obtained by

Pr(G ⇒ G ′) =(

v∈V ′PV (v)

)(∏

v∈V \V ′(1 − PV (v))

)

·(

e=(u,v)∈E ′PE (e|u, v)

)

·(

e=(u,v)∈E∩(V ′×V ′)\E ′(1 − PE (e|u, v))

)

,

(2)

where E ∩ (V ′ × V ′) is the set of edges in G with bothendpoints in V ′.

Let I mp(G) denote the set of all implicated graphs of anuncertain graph G. It is easy to see that

|I mp(G)| = O

|V |∑

i=1

(|V |i

)

2i(i−1)/2

by counting the number of all implicated graphs of a fullyconnected uncertain graph in the worst case.

For an uncertain graph G, let PG : 2I mp(G) → [0, 1]be a function defined as PG(X) = ∑

G ′∈X Pr(G ⇒ G ′). Itis clear that (I mp(G), 2I mp(G), PG) is a probability space.Thus, the function PrG : I mp(G) → [0, 1] defined asPrG(G ′) = Pr(G ⇒ G ′) is a probability distribution func-tion over I mp(G).

3.3 Uncertain graph databases

Now, we extend the model of uncertain graph databasesproposed in [42,44] to the following one that takes trivialimplicated graphs into account:

An uncertain graph database is a finite set of uncertaingraphs, which essentially represents a set of implicated graphdatabases.

Definition 6 An uncertain graph database D = {G1, G2,

. . . , Gn} implicates a certain graph database D′ = {G ′1, G ′

2,

123

Page 6: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

758 J. Li et al.

. . . , G ′m}, denoted by D ⇒ D′, if m ≤ n, and there is

an injection σ : {1, 2, . . . , m} → {1, 2, . . . , n} such thatGσ(i) ⇒ G ′

i for 1 ≤ i ≤ m.

We have m ≤ n in the definition above. The reason is asfollows. In the new model of uncertain graphs, an uncertaingraph generally implicates a trivial certain graph with a non-zero probability. Since trivial certain graphs contain no usefulknowledge, they should be excluded from implicated graphdatabases. Thus, the size m of an implicated graph databasecan be smaller than the size n of an uncertain graph. This isalso the main difference from the model of uncertain graphdatabases proposed in [42].

In this paper, we assume that the uncertain graphs in anuncertain graph database are mutually independent. Basedon this assumption, the probability of an uncertain graphdatabase D = {G1, G2, . . . , Gn} implicating a certain graphdatabase D′ = {G ′

1, G ′2, . . . , G ′

m} can be obtained by

Pr(D ⇒ D′) =( m

i=1

Pr(Gσ(i) ⇒ G ′i )

)

·(

i∈{1,2,...,n}\{σ(x)|1≤x≤m}Pr(Gi ⇒ ∅)

)

,

(3)

where σ is an injection from {1, 2, . . . , m} to {1, 2, . . . , n}such that Gσ(i) ⇒ G ′

i for 1 ≤ i ≤ m, and ∅ denotes a trivialcertain graph.

Let I mp(D) denote the set of all implicated graph data-bases of an uncertain graph database D. It is easy to knowthat

|I mp(D)| =∏

G∈D

|I mp(G)|.

For an uncertain graph database D, let PD : 2I mp(D) →[0, 1] be a function defined as PD(X) = ∑

D′∈X Pr(D ⇒D′). It is clear that (I mp(D), 2I mp(D), PD) is a probabilityspace. Thus, the function PrD : I mp(D) → [0, 1] definedas PrD(D′) = Pr(D ⇒ D′) is a probability distributionfunction over I mp(D).

3.4 Frequent subgraph mining problem over uncertaingraph databases under probabilistic semantics

As in traditional frequent subgraph mining [10,14,20,24,34],this paper also focuses on discovering connected certain sub-graphs. If not otherwise specified in the rest of this paper, a“subgraph” refers to a connected certain subgraph.

Let D be an uncertain graph database, I mp(D) be the setof all implicated graph databases of D, and S be a subgraph inD. Because every implicated graph database D′ in I mp(D)

is actually a certain graph database, the definition of support

(Definition 3) is applicable to S and D′. Hence, the proba-bility that the support of S is no less than 0 < ϕ < 1 acrossall implicated graph databases of D is

Pr(S; D, ϕ) =∑

D′∈I mp(D), sup(S;D′)≥ϕ

Pr(D ⇒ D′), (4)

where Pr(D ⇒ D′) is the probability of D implicating D′as given in (3).

In the rest of the paper, the probability Pr(S; D, ϕ) givenin (4) is called the ϕ-frequent probability of S in D. When Dand ϕ are explicit from the context, Pr(S; D, ϕ) can be simplypresented as Pr(S). A subgraph S is (ϕ, τ )-probabilistic fre-quent in an uncertain graph database D if Pr(S; D, ϕ) ≥ τ ,where 0 < τ < 1 is a user-specified confidence threshold.When ϕ and τ are clear from the context, S can be simplycalled a frequent subgraph.

Based on the concepts and notations above, the problemof mining frequent subgraphs over an uncertain graph data-base under the probabilistic semantics thus can be stated asfollows:

Input: an uncertain graph database D, a support threshold0 < ϕ < 1 and a confidence threshold 0 < τ < 1;

Output: all (ϕ, τ )-probabilistic frequent subgraphs in D.

4 Computational complexity

This section formally proves the computational complexityof the problem defined in the previous section. The complex-ity results proved in this section use the complexity class #Pfor enumeration problems [33].

Theorem 1 It is #P-hard to compute the ϕ-frequent proba-bility of a subgraph S in an uncertain graph database D.

Proof We prove the theorem by reducing any instance of the#P-complete problem of counting the number #F of assign-ments satisfying a monotone k-DNF formula F [33] to aninstance of the current problem of computing the ϕ-frequentprobability Pr(S; D, ϕ) of a subgraph S in an uncertain graphdatabase D in polynomial time. Let F = C1 ∨ C2 ∨ · · · ∨Cm contain m clauses over n variables x1, x2, . . . , xn . Eachclause Ci is in the form of l1 ∧ l2 ∧ · · · ∧ lk , where k >

0 is a constant, and each literal l j is a distinct variable in{x1, x2, . . . , xn}. An instance of the problem of computingPr(S; D, ϕ) can be constructed from F as follows:

1. Construct an uncertain graph database D = {G}, whereG is a bipartite uncertain graph. The vertex set of Gis U ∪ V , where U = {c1, c2, . . . , cm} and V ={v1, v2, . . . , vn}. All vertices in U are labeled α and haveexistence probabilities of 1; all vertices in V are labeled

123

Page 7: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 759

β and have existence probabilities of 1/2. There is anedge between vertices ci and v j if and only if clause Ci

contains variable x j . All edges of G are labeled γ andhave conditional existence probabilities of 1.

2. Construct a subgraph S. The vertex set of S is {c, v1, v2,

. . . , vn}, where c is labeled α, and v1, v2, . . . , vn arelabeled β. There is an edge labeled γ connecting c witheach of v1, v2, . . . , vn .

3. Let ϕ = 1.

It is easy to see that the construction can be accomplishedin polynomial time. All that remains to be shown is the corre-spondence between the number #F of assignments satisfyingF and the ϕ-frequent probability Pr(S; D, ϕ) of S in D. Wehave the following observations:

1. A truth assignment π to F one-to-one corresponds to animplicated graph database D′

π of D such that variable xi

is assigned true in π if and only if vertex vi is containedin the only certain graph in D′

π , where vi is the variablecreated for xi in the construction process above.

2. A truth assignment π satisfies F if and only if S has sup-port at least ϕ in the implicated graph database D′

π thatπ one-to-one corresponds to.

3. Any implicated graph database of D is implicated by Dwith probability 2−n .

Subsequently,

#F =∑

D′∈I mp(D), sup(S;D′)≥ϕ

1

= 2n∑

D′∈I mp(D), sup(S;D′)≥ϕ

Pr(D ⇒ D′)

= 2n Pr(S; D, ϕ).

Thus, the reduction is completed in polynomial time. ��Theorem 2 It is #P-hard to count the number of all (ϕ, τ )-probabilistic frequent subgraphs in an uncertain graph data-base D.

Proof We prove this theorem by reducing any instance ofthe #P-hard problem of counting the number of subgraphswith support at least ϕ′ in a certain graph database D′ [36]to an instance of the current problem of counting the num-ber of (ϕ, τ )-probabilistic frequent subgraphs in an uncertaingraph database D. An instance of the current problem can betrivially constructed from D′ and ϕ′ as follows: First, let Dbe the uncertain graph database obtained by assigning exis-tence probability values of 1 to all vertices and conditionalexistence probability values of 1 to all edges of all graphs inD′. Then, let ϕ = ϕ′ and τ = 1. It is easy to see that theconstruction can be accomplished in polynomial time.

A subgraph has support at least ϕ′ in D′ if and only if itsϕ-frequent probability in D is 1. For this reason, the num-ber of subgraphs with support at least ϕ′ in D′ is equal tothe number of (ϕ, τ )-probabilistic frequent subgraphs in D.Thus, the reduction is completed in polynomial time. ��

Due to the hardness results proved above, it cannot beexpected to exactly discover all frequent subgraphs in anuncertain graph database in polynomial time unless P = NP.In particular, for a subgraph S, there is generally no polyno-mial-time algorithm to determine whether S is frequent or notunless P = NP. To address the challenges, we can discover anapproximate set of subgraphs which are usually sufficient inpractice when small errors are irrelevant. Thus, we have thefollowing problem of approximately mining frequent sub-graphs on uncertain graph databases:

Input: an uncertain graph database D, a support threshold0 < ϕ < 1, a confidence threshold 0 < τ < 1, and errortolerance 0 < ε < τ ;

Output: a set of subgraphs including all frequent subgraphsand a subset of infrequent subgraphs but with ϕ-frequentprobability at least τ − ε.

In other words, the terms “frequent subgraph” and “infre-quent subgraph” are redefined in this paper. Particularly, asubgraph is frequent if its ϕ-frequent probability is at leastτ and infrequent if its ϕ-frequent probability is less thanτ − ε. All of the subgraphs with ϕ-frequent probability in[τ −ε, τ ) are approximately frequent. Since ε is usually verysmall with respect to τ , the inclusion of a small subset ofapproximately frequent subgraphs in the output will not sig-nificantly degrade the quality of the mining results.

It is well worthy knowing that it is actually not to dis-cover all subgraphs with ϕ-frequent probability at least τ −ε.Otherwise, the problem would become even harder. Suppos-ing that a subgraph S has ϕ-frequent probability at least τ −ε,if it is easy to determine that the ϕ-frequent probability of Sis less than τ , then S is unnecessary to be contained in themining results.

5 Approximation algorithm for mining frequentsubgraphs on uncertain graph databases

Given an uncertain graph database D, a support threshold0 < ϕ < 1, a confidence threshold 0 < τ < 1, error tol-erance 0 < ε < τ , and a confidence bound 0 < δ < 1 asinput, an approximation algorithm is proposed to produce an(ε, δ)-approximate set Π of “frequent subgraphs” in D suchthat

123

Page 8: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

760 J. Li et al.

1. Any frequent subgraph S is contained in Π with prob-ability at least ((1 − δ)/2)s , where s is the number ofedges of S;

2. Any infrequent subgraph with ϕ-frequent probabilityless than τ − ε is contained in Π with probability atmost δ/2.

5.1 High-level description

The procedure of the algorithm can be outlined as follows:

– First, organize all subgraphs in D into a search tree, wherenodes represent subgraphs, and each node (subgraph) issubgraph isomorphic to all its children if it has any, andhas one less edge than any of them.

– Then, examine subgraphs in the search tree in depth-firstorder. For each examined subgraph S, determine in poly-nomial time whether S has ϕ-frequent probability at leastτ − ε and probably at least τ . If the answer is “yes”, thenoutput S and proceed to examine the descendants of S indepth-first order. Otherwise, all descendants of S are cer-tainly infrequent and can be pruned due to the followingproperty of ϕ-frequent probability.

Lemma 1 For any subgraphs S and S′, if S � S′, then Pr(S;D, ϕ) ≥ Pr(S′; D, ϕ).

Proof From (1), it is easy to see that for any certain graphdatabase D′, sup(S; D′) ≥ sup(S′; D′). Thus,

Pr(S; D, ϕ) =∑

D′∈I mp(D), sup(S;D′)≥ϕ

Pr(D ⇒ D′)

≥∑

D′∈I mp(D), sup(S′;D′)≥ϕ

Pr(D ⇒ D′)

= Pr(S′; D, ϕ).

��The algorithm terminates while no subgraphs are left

unexamined. Next, we clarify the details of each step.

5.1.1 Build search tree

Because subgraphs are actually certain graphs, all subgraphsin D can be encoded into minimum DFS codes [34], the state-of-the-art canonical graph coding scheme developed for fre-quent subgraph mining. Informally speaking, the minimumDFS code of a subgraph is a sequence that is prior to allother sequential representations of the isomorphic subgraphsaccording to the lexicographic order [34] of DFS codes. Formore details on minimum DFS codes and lexicographic orderof DFS codes, please refer to [34].

The search tree of subgraphs can be constructed as fol-lows. The nodes of the search tree are all subgraphs in D.

The root is a trivial subgraph ∅. The parent parent (S) ofeach subgraph S, where S �= ∅, satisfies that the minimumDFS code of parent (S) is the longest prefix of that of S.

5.1.2 Depth-first search on search tree

Note that we do not materialize the search tree in memory butenumerate subgraphs in the search tree in depth-first order.Particularly, for each subgraph S, all its children are gener-ated as follows. First, perform right-most extension [34] toS, obtaining a set of subgraphs, each of which contains Sand has one more edge than S. Then, for each right-mostextended subgraph S′, if the minimum DFS code of S is aprefix of that of S′, then S′ is a child of S, otherwise S′ is not achild of S. Please refer to [34] for more details on right-mostextension.

5.1.3 Verification of frequent subgraphs

For each subgraph S visited in the depth-first search, it mustbe determined whether to output S or not. To this end, weuse the following method:

1. Compute an approximation of Pr(S; D, ϕ), denoted byP̂r(S; D, ϕ), such that |P̂r(S; D, ϕ) − Pr(S; D, ϕ)| ≤ε/2;

2. Use the following rules for verification:

(a) If P̂r(S; D, ϕ) ≥ τ −ε/2, then Pr(S; D, ϕ) ≥ τ −ε

and probably Pr(S; D, ϕ) ≥ τ , thus S can be out-putted;

(b) If P̂r(S; D, ϕ) < τ − ε/2, then Pr(S; D, ϕ) < τ ,thus S is not frequent and cannot be outputted.

Section 5.2 will present an approximation algorithm forcomputing P̂r(S; D, ϕ) such that

Pr(

|P̂r(S; D, ϕ) − Pr(S; D, ϕ)| ≤ ε

2

)

≥ 1 − δ, (5)

where 0 < δ < 1 is given as an additional input parameterof that algorithm.

5.2 Algorithm for approximating ϕ-frequent probabilities

First, a dynamic programming algorithm is developed inSect. 5.2.1 to exactly compute Pr(S; D, ϕ), but the input ofwhich cannot be derived in polynomial time. Then, a ran-domized algorithm is proposed in Sect. 5.2.2 to approximatethe input of the dynamic programming algorithm in polyno-mial time. Finally, by integrating two algorithms, an approx-imation algorithm is obtained in Sect. 5.2.3 to produce anapproximation, P̂r(S; D, ϕ), of Pr(S; D, ϕ) that satisfies (5).

123

Page 9: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 761

5.2.1 Dynamic programming algorithm

Let us start with a concept originally introduced in [44]. Wesay a subgraph S occurs in an uncertain graph G, denotedby S �U G, if S is subgraph isomorphic to at least oneimplicated graph of G. More formally,

S �U G ⇔ ∃G ′ ∈ I mp(G), S � G ′.

Then, the probability of S occurring in G is

Pr(S �U G) =∑

G ′∈I mp(G), S′�G ′Pr(G ⇒ G ′), (6)

where I mp(G) denotes the set of all implicated graphs of G,and Pr(G ⇒ G ′) is the probability of G implicating G ′ asgiven in (2).

With the concept given above, Pr(S; D, ϕ) can be exactlycomputed by dynamic programming. We first give the intu-ition on the dynamic programming. Let Ii,l be the set ofall implicated graph databases of D that consist of l certaingraphs (l ≤ n), in which i graphs contain S, and the otherl − i graphs do not contain S. Note that l ≤ n because anuncertain graph may implicates a trivial graph that must beremoved from the implicated graph databases (Definition 6).If i ≥ ϕl, the support of S in any implicated graph database inIi,l is at least ϕ. Let Pr(Ii,l) denote

D′∈Ii,lPr(D ⇒ D′).

By summing Pr(Ii,l) for all i and l such that i ≥ ϕl, weobtain Pr(S; D, ϕ).

The probability Pr(Ii,l) can be computed in the followingway. First, the set Ii,l can be partitioned into three disjointsubsets I 1

i,l ,I2

i,l , and I 3i,l , that is, Ii,l = I 1

i,l ∪ I 2i,l ∪ I 3

i,l ,

and I 1i,l ∩ I 2

i,l = I 1i,l ∩ I 3

i,l = I 2i,l ∩ I 3

i,l = ∅. Withoutloss of generality, we choose the uncertain graph Gn in D.

– Every implicated graph database in I 1i,l contains a non-

trivial implicated graph G ′ of Gn , and S is subgraph iso-morphic to G ′;

– Every implicated graph database in I 2i,l contains a non-

trivial implicated graph G ′ of Gn , and S is not subgraphisomorphic to G ′;

– Every implicated graph database in I 3i,l contains no

implicated graphs of Gn .

It is obvious that Pr(Ii,l) = Pr(I 1i,l)+Pr(I 2

i,l)+Pr(I 3i,l).

To compute Pr(I 1i,l), Pr(I 2

i,l) and Pr(I 3i,l), the following

recursive computation can be carried out. Similar to Ii,l , wedenote by I ′

i,l the set of all implicated graph databases ofD \ {Gn} that consist of l certain graphs (l ≤ n − 1), inwhich i graphs contain S, and the other l − i graphs do notcontain S. Thus, we have

– Pr(I 1i,l) = Pr(I ′

i−1,l−1) Pr(S �U Gn) because any

implicated graph database in I 1i,l can be obtained by

adding a nontrivial implicated graph of Gn in which S isa subgraph to any implicated graph database in I ′

i−1,l−1.

– Pr(I 2i,l) = Pr(I ′

i,l−1)(1−Pr(S �U Gn)−Pr(Gn ⇒ ∅))

because any implicated graph database in I 2i,l can be

obtained by adding a nontrivial implicated graph of Gn

in which S is not a subgraph to any implicated graphdatabase in I ′

i,l−1.

– Pr(I 3i,l) = Pr(I ′

i,l) Pr(Gn ⇒ ∅) because any implicated

graph database in I 3i,l does not contain nontrivial impli-

cated graphs of Gn , so that any implicated graph databasein I ′

i,l is also in I 3i,l .

Based on the fundamental idea shown above, the dynamicprogramming can be formally described as follows: Let theuncertain graphs in D be indexed from 1 to n, i.e., D ={G1, G2, . . . , Gn}. Let T [0..n, 0..n, 0..n]be a 3-dimensionalarray with the subscript in each dimension ranging from 0 ton. Element T [i, j, k] of T stores the probability that acrossall implicated graph databases D′ of {G1, G2, . . . , Gk},

1. |D′| = i + j , and2. S is subgraph isomorphic to i certain graphs in D′, i.e.,

the support of S in D′ is i/(i + j).

That is,

T [i, j, k] =∑

D′∈I mp(D), |D′|=i+ j,sup(S;D′)=i/(i+ j)

Pr(D ⇒ D′). (7)

The elements of T can be computed using the fol-lowing recursive equation: Basically, T [0, 0, 0] = 1, andT [i, j, k] = 0 if i + j > k. For other cases, T [i, j, k] canbe computed by (8), where Pr(Gk ⇒ ∅) is the probability ofGk implicating a trivial certain graph ∅, and Pr(S �U Gk)

is the probability of S occurring in Gk as given in (6).

T [i, j, k]

=

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

Pr(Gk ⇒ ∅)T [i, j, k − 1] if i = 0, j = 0 and k > 0,

Pr(Gk ⇒ ∅)T [i, j, k − 1] if i > 0, j = 0 and k ≥ i,

+ Pr(S �U Gk )T [i − 1, j, k − 1]Pr(Gk ⇒ ∅)T [i, j, k − 1]

+(1 − Pr(Gk ⇒ ∅) if i = 0, j > 0 and k ≥ j,

− Pr(S �U Gk ))T [i, j − 1, k − 1]Pr(Gk ⇒ ∅)T [i, j, k − 1]

+ Pr(S �U Gk )T [i − 1, j, k − 1]+(1 − Pr(Gk ⇒ ∅) if i >0, j >0 and k ≥ i + j.

− Pr(S �U Gk ))T [i, j − 1, k − 1](8)

After computing all the elements of T, Pr(S; D, ϕ) can bederived from T by computing

n∑

i=1

min{�(1−ϕ)i/ϕ�, n−i}∑

j=0

T [i, j, n]. (9)

123

Page 10: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

762 J. Li et al.

Fig. 1 Dynamic programming algorithm for computing Pr(S; D, ϕ)

Let n, ϕ, the exact values of Pr(Gi ⇒ ∅) for 1 ≤ i ≤ n,and the exact values of Pr(S �U Gi ) for 1 ≤ i ≤ n be givenas input. The dynamic programming algorithm is presentedas Procedure DP in Fig. 1.

Theorem 3 For an uncertain graph database D = {G1, G2,

. . . , Gn}, a subgraph S and 0 < ϕ < 1, on inputn, ϕ, Pr(Gi ⇒ ∅) for 1 ≤ i ≤ n, and Pr(S �U Gi ) for1 ≤ i ≤ n, Procedure DP outputs Pr(S; D, ϕ).

Proof By grouping the implicated graph databases of D bythe number i of certain graphs containing S and the numberj of certain graphs not containing S, we have

Pr(S; D, ϕ) =n

i=0

n−i∑

j=0

D′∈I mp(D), |D′|=i+ j,sup(S;D′)=i/(i+ j)≥ϕ

Pr(D ⇒ D′).

Since sup(S; D′) = 0 if D′ = ∅, and i/(i + j) ≥ ϕ if andonly if j ≤ �(1 − ϕ)i/ϕ�, we have

Pr(S; D, ϕ)

=n

i=1

min{�(1−ϕ)i/ϕ�, n−i}∑

j=0

D′∈I mp(D), |D′|=i+ j,sup(S;D′)=i/(i+ j)

Pr(D ⇒ D′)

=n

i=1

min{�(1−ϕ)i/ϕ�, n−i}∑

j=0

T [i, j, n] (by (7)).

The theorem thus holds. ��It is easy to see that the time complexity of DP is Θ(n3)

because lines 3–6 can be accomplished in Θ(n3) time, andline 7 can be done in O(n2) time.

Remark 1 A similar dynamic programming algorithm hasbeen adopted by [39] to find frequent items over a probabi-listic data stream, in which the counterpart of Pr(S �U G) in

the recursive equation is the probability that an item is con-tained in an x-tuple [1]. Actually, the probability of an itembeing contained in an x-tuple can be computed in polynomialtime. Unfortunately, the computational complexity result isnegative for computing Pr(S �U G) as will be shown in thefollowing theorem:

Theorem 4 It is #P-hard to compute the probability, Pr(S �U

G), of a subgraph S occurring in an uncertain graph G.

Proof The proof is very similar to the proof of Theorem 3 inour previous work [44], thus it is omitted here.

Although the time complexity of DP is Θ(n3), the prac-tical problem of DP is that its required input Pr(S �U Gi )

for 1 ≤ i ≤ n cannot be derived in polynomial time.To overcome this difficulty, an intuitive way is to substi-tute Pr(S �U Gi ) with an approximate value, denoted byP̂r(S �U Gi ), that is accurate enough and can be computedin polynomial time.

5.2.2 Algorithm for approximating Pr(S �U G)

Given an uncertain graph G, a subgraph S and real numbers0 < ε′, δ′ < 1 as input, a randomized algorithm is proposedto compute an approximate value P̂r(S �U G) in polynomialtime such that

Pr(∣∣P̂r(S �U G) − Pr(S �U G)

∣∣ ≤ ε′) ≥ 1 − δ′.

This algorithm is developed by extending the algo-rithm proposed in our previous work [44], which computesPr(S �U G) only on uncertain graphs G whose edges areuncertain, but whose vertices are certain. Due to the differ-ence in uncertain graph data model, the algorithm in [44]cannot be used in this paper to deal with uncertain graphswhose vertices and edges are both uncertain. We give specialconsiderations to the uncertain vertices in the algorithm.

The fundamental idea of the algorithm is to transform theproblem of computing Pr(S �U G) to the problem of com-puting the probability of a DNF formula F being satisfiedby a randomly and independently chosen truth assignment tothe variables of F , denoted by Pr(F). The procedure of thealgorithm is presented as follows:

Step 1 Construct a DNF formula F from S and G in thefollowing way: Let G denote the certain graphobtained by removing all uncertainties from G. Notethat as the side products of enumerating the childrenof S’ parent in the search tree, all subgraph iso-morphisms from S to G have already been obtainedbefore examining S in the depth-first search. Basedon these subgraph isomorphisms, all embeddings,say M1, M2, . . . , Mm , of S in G can be easily

123

Page 11: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 763

Fig. 2 Randomized algorithm for approximating Pr(S �U G)

derived. We stored M1, M2, . . . , Mm in an auxil-iary array associated with S once they were found.From M1, M2, . . . , Mm and G, the DNF formula Fcan be constructed as follows:

1. Assign a distinct variable xv to each vertex v

contained in M1, M2, . . . , Mm . The probabilityof xv being assigned true is equal to the exis-tence probability of v in G.

2. Assign a distinct variable xe to each edge e con-tained in M1, M2, . . . , Mm . The probability ofxe being assigned true is equal to the conditionalexistence probability of e in G.

3. For 1 ≤ i ≤ m, construct a clause Ci

by conjuncting all variables assigned to thevertices and the edges in Mi , i.e., Ci =(∧

v∈V (Mi )xv

)∧(∧

e∈E(Mi )xe

)

, where V (Mi )

and E(Mi ) denote the vertex set and the edgeset of Mi , respectively.

4. Let F = C1 ∨ C2 ∨ · · · ∨ Cm .

Step 2 Use the Monte-Carlo algorithm proposed in [17] tocompute a value P̂r(F) such that

Pr(|P̂r(F) − Pr(F)| ≤ ε′) ≥ 1 − δ′.

Then, P̂r(F) is returned as P̂r(S �U G).

The randomized algorithm is presented as Procedure Esti-mate in Fig. 2.

We now prove the correctness of Procedure Estimate.

Lemma 2 For an uncertain graph G and a subgraph S, let Fbe the DNF formula constructed from G and S by ProcedureEstimate. We have Pr(S �U G) = Pr(F).

Proof Let M1, M2, . . . , Mm be the embeddings of S in G,and G∗ be the uncertain graph obtained by removing the verti-ces and the edges not contained in M1, M2, . . . , Mm from G.Since any vertex or edge not contained in M1, M2, . . . , Mm

does not affect Pr(S �U G), we have Pr(S �U G) =Pr(S �U G∗). Thus, all that remains to be shown is thatPr(F) = Pr(S �U G∗).

Any implicated graph G ′ of G∗ corresponds to a set oftruth assignments to F , denoted by QG ′ , such that a vertexv is contained in G ′ if and only if every assignment in QG ′sets xv to be true, where xv is the variable constructed forv; and an edge e = (u, v) is contained in G ′ if and only ifevery assignment in QG ′ sets xu, xv and xe to be true, wherexu, xv and xe are the variables constructed for u, v, and e,respectively. The following properties can be easily seen:

1. For any implicated graphs G ′ and G ′′ of G∗, where G ′ �=G ′′, we have QG ′ ∩ QG ′′ = ∅.

2. Let Q be the set of all assignments to F . We have Q =∪G ′∈I mp(G∗)QG ′ .

3. Pr(G∗ ⇒ G ′) = ∑

π∈QG′ Pr(π), where Pr(π) is theprobability of π being randomly and independently cho-sen.

4. S is subgraph isomorphic to G ′ if and only if all assign-ments in QG ′ satisfy F .

Subsequently, Pr(F) = Pr(S �U G∗) = Pr(S �U G).��

Theorem 5 [17] Given a DNF formula F = C1 ∨C2 ∨· · ·∨Cm and 0 < ε′, δ′ < 1, the Monte-Carlo algorithm proposedin [17] computes an approximation P̂r(F) of Pr(F) such that

Pr(|P̂r(F) − Pr(F)| ≤ ε′) ≥ 1 − δ′

in O(m�

ε′2 log 2δ′

)

time, where � is the number of literals in F.

Due to Lemma 2 and Theorem 5, the following theoremholds:

Theorem 6 Given an uncertain graph G, a subgraph S and0 < ε′, δ′ < 1, Procedure Estimate outputs an approxima-tion P̂r(S �U G) of Pr(S �U G) such that

Pr(| Pr(S �U G) − P̂r(S �U G)| ≤ ε′) ≥ 1 − δ′.

Now, we analyze the time complexity of Procedure Esti-mate. Let s be the number of edges of S. Since |V (Mi )| ≤|E(Mi )| = s for all 1 ≤ i ≤ n, lines 1–8 construct the DNFformula F in Θ(ms) time. Since F consists of m clauses andeach clause contains s variables, line 9 runs in O

(m2sε′2 log 2

δ′)

time due to Theorem 5, where m is the number of embed-dings of S in the certain graph G obtained by removing uncer-tainties from G. Thus, the time complexity of Estimate isO

(m2sε′2 log 2

δ′)

.

123

Page 12: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

764 J. Li et al.

Remark 2 There are two criteria for selecting a suitablealgorithm for approximating Pr(F): the time complexityof the algorithm and the accuracy that the algorithm canachieve. So far, there have been proposed four categories ofalgorithms for approximating Pr(F): the exact algorithms[4,19], the heuristic algorithms [25,27,29], the determinis-tic approximation algorithms [23,31], and the Monte-Car-lo algorithm [17]. These algorithms can be summarized asfollows:

– The running time of the exact algorithms [4,19] is expo-nential in the worst case. Thus, they are absolutely unsuit-able choices for computing Pr(F).

– The heuristic algorithms based on binary decision dia-grams (BDD) [27,29] run in exponential time in the worstcase and thus do not satisfy the first criterion. The incre-mental heuristic algorithm recently proposed in [25] ismore efficient than the BDD-based heuristic algorithms.However, it cannot approximate Pr(F) within an arbitraryerror, thus avoiding the second criterion.

– The deterministic approximation algorithms [23,31] runin polynomial time and have theoretical guarantees onapproximation error. However, the constant coefficientsof their time complexities are extremely high even forsmall problem instances.

– The Monte-Carlo algorithm proposed in [17] satisfiesboth of the two criteria. Thus, it is selected in this paperas the suitable algorithm for approximating Pr(F).

5.2.3 Algorithm for approximating Pr(S; D, ϕ)

By approximating the inputs Pr(S �U Gi ) for 1 ≤ i ≤ n ofProcedure DP using Procedure Estimate, we obtain an algo-rithm for approximating Pr(S; D, ϕ). More precisely, givenan uncertain graph database D = {G1, G2, . . . , Gn}, a sub-graph S and 0 < ϕ, ε, δ < 1, the algorithm produces a valueP̂r(S; D, ϕ) in polynomial time such that

Pr(

| Pr(S; D, ϕ) − P̂r(S; D, ϕ)| ≤ ε

2

)

≥ 1 − δ.

This algorithm works in a very simple way:

Step 1 Compute Pr(Gi ⇒ ∅) by (2) for 1 ≤ i ≤ n.Step 2 For 1 ≤ i ≤ n, use Procedure Estimate to compute

P̂r(S �U Gi ) such that

Pr(

|P̂r(S �U Gi ) − Pr(S �U Gi )| ≤ ε

2n

)

≥(1 − δ)1/n .

Step 3 Call Procedure DP with input n, ϕ, Pr(Gi ⇒ ∅) for1 ≤ i ≤ n, and P̂r(S �U Gi ) for 1 ≤ i ≤ n. LetP̂r(S; D, ϕ) be the output of DP and return it.

Fig. 3 Algorithm for approximating Pr(S; D, ϕ)

Since the value of Pr(Gi ⇒ ∅) does not depend on anyspecific subgraph S, the values of Pr(Gi ⇒ ∅) for 1 ≤ i ≤ ncan be computed just at the beginning of the whole miningprocess and reused in all succeeding calls of this algorithmas additional input parameters. The algorithm is presented asProcedure Approximate in Fig. 3.

We now analyze the failure probability of Approximate.For ease of proof, let DP(D, S, ϕ, i) denote the output of DPon input n, ϕ, Pr(G j ⇒ ∅) for 1 ≤ j ≤ n, P̂r(S �U G j )

for 1 ≤ j ≤ i , and Pr(S �U G j ) for i + 1 ≤ j ≤ n,where P̂r(S �U G j ) is an approximation of Pr(S �U G j )

within error ε/(2n) with probability at least (1 − δ)1/n . Thefollowing important lemma can be proved:

Lemma 3 For an uncertain graph database D = {G1, G2,

. . . , Gn}, a subgraph S and 0 < ϕ < 1,

|DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i)|≤ | Pr(S �U Gi ) − P̂r(S �U Gi )|

for 1 ≤ i ≤ n.

Proof Let DP′ be a dynamic programming procedure that isthe same as DP except that the equation in line 7 for com-puting the final result is replaced by

n∑

i=1

min(�(1−ϕ)(i+1)/ϕ�, n−i)∑

j=0

T [i, j, n].

Let DP′′ be a dynamic programming procedure that is thesame as DP except that the equation in line 7 for computingthe final result is substituted with

n∑

i=1

min(�(1−ϕ)i/ϕ�−1, n−i)∑

j=0

T [i, j, n].

Furthermore, let D′ = D \ {Gi }, qi = Pr(Gi ⇒ ∅), pi =Pr(S �U Gi ) and p̂i = P̂r(S �U Gi ). We have

DP(D, S, ϕ, i − 1) = qi · DP(D′, S, ϕ, i − 1)

+pi · DP′(D′, S, ϕ, i − 1)

+(1 − qi − pi )DP′′(D′, S, ϕ, i − 1),

123

Page 13: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 765

and

DP(D, S, ϕ, i) = qi · DP(D′, S, ϕ, i − 1)

+ p̂i · DP′(D′, S, ϕ, i − 1)

+(1 − qi − p̂i )DP′′(D′, S, ϕ, i − 1).

By simple mathematics,

|DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i)|= |pi − p̂i | · |DP′(D′, S, ϕ, i −1) − DP′′(D′, S, ϕ, i −1)|.

Since 0 ≤ DP′(D′, S, ϕ, i − 1) ≤ 1 and 0 ≤ DP′′(D′, S, ϕ,

i − 1) ≤ 1,

|DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i)| ≤ |pi − p̂i |.Thus, the lemma holds. ��Theorem 7 Given an uncertain graph database D = {G1,

G2, . . . , Gn}, a subgraph S, 0 < ϕ, ε, δ < 1 and Pr(Gi ⇒∅) for 1 ≤ i ≤ n as input, let P̂r(S; D, ϕ) be the output ofApproximate. We have

Pr(

| Pr(S; D, ϕ) − P̂r(S; D, ϕ)| ≤ ε

2

)

≥ 1 − δ.

Proof Note that Pr(S; D, ϕ) = DP(D, S, ϕ, 0) and P̂r(S; D, ϕ) = DP(D, S, ϕ, n). We have

| Pr(S; D, ϕ) − P̂r(S; D, ϕ)|= |DP(D, S, ϕ, 0) − DP(D, S, ϕ, n)|

=∣∣∣∣∣

n∑

i=1

(DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i))

∣∣∣∣∣

≤n

i=1

|DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i)| .

By Lemma 3, with probability at least (1 − δ)1/n ,

|DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i)|≤ | Pr(S �U Gi ) − P̂r(S �U Gi )| ≤ ε

2n.

Thus, with probability at least(

(1 − δ)1/n)n = 1 − δ,

| Pr(S; D, ϕ) − P̂r(S; D, ϕ)|≤

n∑

i=1

|DP(D, S, ϕ, i − 1) − DP(D, S, ϕ, i)|

= n · ε

2n= ε

2.

��We now analyze the time complexity of Approximate.

Lines 1–2 take O(

nm2s(ε/(2n))2 log 2

1−(1−δ)1/n

)

time, where s is

the number of edges of S, and m is the maximum numberof embeddings of S in uncertain graphs in D. Line 3 runs in

Θ(n3) time by executing DP. Thus, the time complexity ofApproximate is

O

(4m2n3s

ε2 log2

1 − (1 − δ)1/n

)

.

The mathematical formulation of the time complexity canbe simplified in the following way: Let f (x) = 1−(1−x)1/n ,where 0 ≤ x ≤ 1 and n ∈ N. By Taylor’s expansion at x = 0,

f (x) = f (0) + f ′(0)x + f ′′(ξ)

2x2

= 0 + x

n+ 1

2n

(

1 − 1

n

)

(1 − ξ)1n −2x2,

where 0 ≤ ξ ≤ x . Since the Lagrange remainder is non-negative, we have f (x) ≥ x/n. By replacing x with δ, weget 1 − (1 − δ)1/n ≥ δ/n. Thus, the formulation of the timecomplexity of Approximate can be simplified as

O

(4m2n3s

ε2 log2n

δ

)

. (10)

5.3 Complete mining algorithm

Given an uncertain graph database D = {G1, G2, . . . , Gn}and real numbers 0 < ϕ, τ, ε, δ < 1 as input, the approxi-mation algorithm MUSE-P for mining an (ε, δ)-approximateset Π of “frequent subgraphs” in D is developed as shownin Fig. 4.

MUSE-P first initializes the output set Π (line 1) andcreates an empty stack Ψ (line 2) to keep the subgraphs vis-ited by MUSE-P. Then, for each uncertain graph Gi in D(1 ≤ i ≤ n), MUSE-P computes Pr(Gi ⇒ ∅) by Eq. (2) andkeeps it in the element P[i] of array P (lines 3–4). Specifi-cally, Pr(Gi ⇒ ∅) = ∏

v∈Vi1 − PV (v), where Vi is the ver-

tex set of Gi . The values P[1], P[2], . . . , P[n] will be usedmany times when calling Procedure Approximate in line 10.Next, MUSE-P calls Subroutine DisinctEdges to retrieveall distinct single-edge subgraphs from D (line 5) and pushesall these single-edge subgraphs into the stack Ψ (line 6). Sub-routine DisinctEdges takes the uncertain graph database Das input and returns all distinct single-edge subgraphs in D.The pseudocode of DisinctEdges is self-explanatory. Notethat lines 3–6 can be carried out in a single scan over D.

Lines 7–13 of MUSE-P are the key steps of MUSE-P,which discover frequent subgraphs from D. First, MUSE-Ppops the subgraph S from the top of the stack Ψ (line 8) andchecks if S is now visited for the first time (line 9). Line 9can be implemented by checking if S is now encoded in theminimum DFS code [34]. Recall that minimum DFS codeswere proposed to eliminate redundant isomorphic subgraphs.Particularly, a subgraph is visited for the first time if and onlyif the subgraph is currently encoded in the minimum DFScode. Thus, if the subgraph S popped from Ψ is encoded

123

Page 14: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

766 J. Li et al.

Fig. 4 Complete mining algorithm MUSE-P

in the minimum DFS code, MUSE-P approximates theϕ-frequent probability p of S using Procedure Approximate(line 10). If the condition p ≥ τ − ε/2 is satisfied (line 11),S has a probability as high as 1−δ to have ϕ-frequent proba-bility at least τ − ε, and hence S is added to the output set Π

(line 12). After that, MUSE-P enumerates the children of Sin the search tree by performing right-most extension [34] onS (line 13). For detailed description on right-most extension,please refer to [34]. Finally, MUSE-P terminates when thereare no subgraphs in Ψ , and the set Π is outputted (line 14).

5.3.1 Time complexity of MUSE-P

We now analyze the time complexity of MUSE-P. Sincethe number of frequent subgraphs is generally exponentialto the size of the input, the worst-case time complexity ofMUSE-P is inevitably exponential with respect to the inputsize. In the analysis, we take the output size explicitly intoaccount, particulary the number of outputted subgraphs andthe maximum number of edges in the outputted subgraphs.For convenience, let V be the total number of vertices in D, E

be the total number of edges in D, and Δ be the maximumdiameter of uncertain graphs in D.

Lines 1–2 of MUSE-P take O(1) time. For each uncer-tain graph Gi in D (line 3), MUSE-P computes Pr(Gi ⇒∅) by Pr(Gi ⇒ ∅) = ∏

v∈Vi1 − PV (v) in O(|Vi |) time

(line 4), where Vi is the vertex set of Gi . Hence, lines 3–4can be done in O(

∑ni=1 |Vi |) = O(V ) time. Suppose the

Lookup and Insert operations on hash sets can be done inO(1) time. The running time of Subroutine DistinctEdgesis O(

∑ni=1 |Ei |) = O(E). Thus, lines 5–6 can be finished

in O(E) time. Because lines 7–13 are carried out until thestack Ψ becomes empty, the running time of lines 7–13 isthe sum of the running time spent on every subgraph poppedfrom Ψ .

The subgraphs visited by MUSE-P, i.e., the subgraphsthat have been kept in Ψ , can be partitioned into two disjointsets. The first set Q1 consists of all the subgraphs outputtedby MUSE-P. The second set Q2 consists of all the subgraphsvisited by MUSE-P but not outputted by MUSE-P.

For each subgraph S in the first set Q1, MUSE-P first popsit from the stack Ψ (line 8), which takes O(1) time. Then,MUSE-P checks if S is visited for the first time (line 9).Line 9 is implemented by testing if S is encoded in the min-imum DFS code. Since S could have at most s! distinct DFScodes, i.e., s! permutations of the edges in S, and only oneof them is the minimum DFS code, where s is the number ofedges in S, the running time of line 9 is O(s!) in the worstcase. If S is encoded in the minimum DFS code, MUSE-Papproximates the ϕ-frequent probability of S using Approx-

imate (line 8), which takes O(

4m2n3sε2 log 2

1−(1−δ)1/n

)

time,

where m is the maximum number of embeddings of S inthe uncertain graphs in D, and s is the number of edges inS. Since S is outputted by MUSE-P, lines 11–13 must becarried out. In particular, lines 11 and 12 are completed inO(1) time. Line 13 performs right-most extension on S toget all children of S in the search tree. The details of right-most extension could be found in [34]. The running time ofline 13 is O(nmsΔ). Therefore, the running time spent on Sis analyzed as

O

(

s! + 4m2n3s

ε2 log2

1 − (1 − δ)1/n+ nmsΔ

)

.

For each subgraph S in the second set Q2, MUSE-P doesnot carry out lines 11–13, thus the running time spent on S

is O(

s! + 4m2n3sε2 log 2

1−(1−δ)1/n

)

.

The first set Q1 consists of all the subgraphs outputtedby MUSE-P, thus |Q1| = |Π |. For each of the subgraphsS in Π , there are O(s|Σ |2) possible children in the searchtree according to right-most extension, where Σ is the setof labels of the vertices and edges. Therefore, the number ofsubgraphs visited by MUSE-P but not outputted by MUSE-P is O(s|Σ |2|Π |)−|Π |. In general, s and |Σ | can be viewed

123

Page 15: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 767

as constants with respect to |Π |, hence |Q2| = O(|Π |). LetL be the maximum number of edges in the output subgraphsand M be the maximum number of embeddings of the visitedsubgraphs in the uncertain graphs in D. We have that the timecomplexity of MUSE-P is

O

(

V +E +|Π |(

L!+ 4M2n3L

ε2 log2

1−(1−δ)1/n+nM LΔ

))

.

Furthermore, since |Π | is generally exponential withrespect to V + E , and Δ can be regarded as a constant dueto the small world property [21] of large graphs, the timecomplexity of MUSE-P can be simplified as

O

(

|Π |(

L! + 4M2n3L

ε2 log2

1 − (1 − δ)1/n

))

.

5.3.2 Guarantees on approximation accuracy of MUSE-P

Since MUSE-P is a randomized approximate mining algo-rithm, we now analyze its approximation accuracy. Let Π bethe set of subgraphs outputted by Procedure MUSE-P. Wehave the following theorems:

Theorem 8 For any frequent subgraph S in D,

Pr(S ∈ Π) ≥(

1 − δ

2

)s

,

where s is the number of edges of S.

Proof Let S0, S1, . . . , Ss be the subgraphs on the path fromthe root to S in the search tree, where S0 = ∅, Ss = S, andSi−1 is the parent of Si for 1 ≤ i ≤ s. Due to Lemma 1, allof S0, S1, . . . , Ss−1 are also frequent. Since the subgraphsare visited in depth-first order, we have S ∈ Π if and only ifS0, S1, . . . , Ss ∈ Π . Mathematically,

Pr(S ∈ Π)

= Pr(S0 ∈ Π)

s∏

i=1

Pr(Si ∈ Π | S0, S1, . . . , Si−1 ∈ Π).

Particularly, Pr(S0 ∈ Π) = 1, and for 1 ≤ i ≤ s,

Pr(Si ∈ Π | S0, S1, . . . , Si−1 ∈ Π)

= Pr(

P̂r(Si ) ≥ τ − ε

2

)

= Pr(

Pr(Si ) − P̂r(Si ) ≤ Pr(Si ) − τ + ε

2

)

,

where P̂r(Si ) is the approximation of Pr(Si ) obtained in line 3of Procedure Approximate. Due to the fact that Pr(Si ) ≥ τ

and to Theorem 7,

Pr(Si ∈ Π | S0, S1, . . . , Si−1 ∈ Π)

≥ Pr(

Pr(Si ) − P̂r(Si ) ≤ ε

2

)

≥ 1 − δ

2.

Subsequently,

Pr(S ∈ Π) ≥(

1 − δ

2

)s

.

The theorem thus follows. ��

Theorem 9 For any infrequent subgraph S in D withPr(S; D, ϕ) < τ − ε, we have

Pr(S ∈ Π) ≤ δ/2.

Proof Let S0, S1, . . . , Ss be the subgraphs on the path fromthe root to S in the search tree, where S0 = ∅, Ss = S, andSi−1 is the parent of Si for 1 ≤ i ≤ s. Since the subgraphsare visited in depth-first order, we have S ∈ Π if and only ifS0, S1, . . . , Ss ∈ Π . Mathematically,

Pr(S ∈ Π) = Pr(S0, S1, . . . , Ss−1 ∈ Π)

· Pr(S ∈ Π | S0, S1, . . . , Ss−1 ∈ Π)

≤ Pr(S ∈ Π | S0, S1, . . . , Ss−1 ∈ Π).

From MUSE-P, we have

Pr(S ∈ Π | S0, S1, . . . , Ss−1 ∈ Π)

= Pr(P̂r(S) ≥ τ − ε

2)

= Pr(P̂r(S) − Pr(S) ≥ τ − ε

2− Pr(S)),

where P̂r(S) is the approximation of Pr(S) obtained in line 3of Approximate. Due to the fact that Pr(S) < τ − ε and toTheorem 7,

Pr(P̂r(S) − Pr(S) ≥ τ − ε

2− Pr(S))

≤ Pr(P̂r(S) − Pr(S) ≥ ε

2)

≤ δ/2.

The theorem thus holds. ��

6 Setting parameter δ

This section discusses how to set parameter δ of MUSE-Pto guarantee the approximation quality of the mining results.

Let �max denote the maximum number of edges of fre-quent subgraphs, which can either be estimated by samplingapproaches [11] or be specified as a constraint on the size ofdiscovered subgraphs [40]. From Theorem 8, we can easilyprove the following corollary:

Corollary 1 To guarantee any frequent subgraph to be out-putted with probability at least 1−Δ, where 0 < Δ < 1, theparameter δ should be at most 1 − 2(1 − Δ)1/�max .

123

Page 16: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

768 J. Li et al.

7 Discussion

An interesting observation on MUSE-P is that when thereis one vertex with existence probability of 1 in every uncer-tain graph in D, the time complexity of MUSE-P can besignificantly decreased by a factor of n, where n = |D|.

As a matter of fact, when every uncertain graph Gi in Dhas a vertex with existence probability of 1, we have Pr(Gi ⇒∅) = 0 by (2), i.e., every implicated graph of Gi is nontrivial.Furthermore, by Definition 6, every implicated graph data-base of D has cardinality of n.

Recall that Procedure DP in Sect. 5.2.1 considers the gen-eral case that D has an implicated graph database of cardinal-ity less than n, and thus has time complexity of Θ(n3). In thespecial case considered above, DP can be simplified to run inΘ(n2) time. Since every implicated graph is nontrivial, thearray T used in DP can be simplified from 3D to 2D, andthe element T [i, j] now keeps the probability that across allimplicated graph databases D′ of {G1, G2, . . . , Gi+ j }, S issubgraph isomorphic to i certain graphs in D′ and not sub-graph isomorphic to j certain graphs in D′. Moreover, therecursive equation used in the dynamic programming is alsosimplified as

T [i, j]

=

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

1 if i = j = 0,

0 if i + j > n,

T [i − 1, j] Pr(S �U Gi+ j ) if i > j = 0,

T [i, j − 1](1 − Pr(S �U Gi+ j )) if j > i = 0,

T [i − 1, j] Pr(S �U Gi+ j )

+T [i, j − 1](1 − Pr(S �U Gi+ j )) if i, j > 0.

Hence, the output of DP can be computed by

n∑

i=�nϕ�T [i, n − i].

Furthermore, the input Pr(Gi ⇒ ∅) of DP, for 1 ≤ i ≤ n, isnot required any more. The simplified DP procedure can bedescribed as the DP2 procedure in Fig. 5. It is easy to showthat the time complexity of DP2 is Θ(n2).

When there is one vertex with existence probability of 1in every uncertain graph in D, we replace every call to DPwith a call to DP2, omit the input parameters Pr(Gi ⇒ ∅)

of Procedure Approximate, and skip lines 3–4 of MUSE-P.The running time of Approximate is thus decreased by a

factor of n to O(

4m2n2sε2 log 2

1−(1−δ)1/n

)

. By the analysis in

Sect. 5.3.1, we have that the time complexity of the newlyobtained MUSE-P is decreased to

O

(

|Π |(

L! + 4M2n2 L

ε2 log2

1 − (1 − δ)1/n

))

.

Fig. 5 Procedure DP2

8 Experiments

An extensive number of experiments have been performed toevaluate the performance of MUSE-P and the quality of thesubgraphs discovered by MUSE-P. The experimental resultsare shown and analyzed in this section.

8.1 Experimental setting

MUSE-P was implemented in C based on the software pack-age developed with our previous work [44]. For compar-ison purposes, the algorithm [44] that discovers frequentsubgraphs on uncertain graph databases under the expectedsemantics was taken as baseline. In the rest of this paper, thisalgorithm is called MUSE-E, standing for “Mining Uncer-tain Subgraphs undEr the Expected semantics”. MUSE-E takes as input a minimum expected support threshold0 < minesup < 1 and two real numbers 0 < ε, δ < 1.For more details about MUSE-E, please refer to [44].

The performance of MUSE-P and the quality of the sub-graphs outputted by MUSE-P were experimentally evalu-ated. Particularly, we examined:

– the comparison between the set of subgraphs discov-ered by MUSE-P and the set of subgraphs discoveredby MUSE-E;

– the quality, particularly the precision and the recall, of thesubgraphs discovered by MUSE-P;

– the execution time of MUSE-P with respect to the inputparameters ϕ, τ, ε, and δ, respectively;

– the execution time of MUSE-P with respect to the totalnumber of vertices in uncertain graphs, the total numberof edges in uncertain graphs and the number of uncertaingraphs, respectively;

123

Page 17: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 769

Table 2 Summary of uncertain graph database PPINet

Organism |V | |E | PE

S. pombe (fission yeast) 162 300 0.148

D. melanogaster (fruit fly) 3,751 7,384 0.456

M. musculus (house mouse) 199 286 0.413

R. norvegicus (rat) 130 178 0.374

A. thaliana (thale cress) 513 1, 168 0.444

C. elegans (worm) 514 960 0.190

– the impact of the variations in uncertainty on the execu-tion time of MUSE-P and on the number of subgraphsoutputted by MUSE-P.

All of the experiments were carried out on two uncertaingraph databases. The first one was obtained by integrating theBioGRID database1 with the STRING database.2 It containsthe protein-protein interaction (PPI) networks of six organ-isms taken from the BioGRID database. Each PPI networkis represented as an uncertain graph where vertices representproteins, edges represent PPI’s, the labels of vertices are pro-teins’ COG functional annotations3 provided by the STRINGdatabase, the existence probabilities of all vertices are 1, andthe conditional existence probabilities of edges are providedby the STRING database. The properties of this uncertaingraph database are summarized in Table 2, where |V | is thenumber of vertices, |E | is the number of edges, and PE isthe average conditional existence probabilities of edges. Forconvenience, we call this uncertain graph database PPINet.

The second uncertain graph database was provided byYuan et al. [38], which was generated based on a certaingraph database retrieved from the DTP’s Drug InformationSystem.4 Each graph in the original exact graph databaserepresents a molecular structure, where vertices are atoms(Hydrogen atoms were omitted), edges are bonds, the labelof a vertex is an atom type, and the label of an edge is a bondtype. To make it an uncertain graph database, the existenceprobabilities of all vertices were set to 1, and the conditionalexistence probabilities of edges were synthesized followingnormal distribution N (0.3, 0.1). The uncertain graphs in thedatabase consist of 24.3 vertices and 26.5 edges on average.For ease of presentation, we call this uncertain graph data-base DTP.

All of the experiments were performed on an IBM X3950Linux server with 2.4 GHz Intel Xeon E7440 CPU and 8GBof RAM.

1 http://thebiogrid.org/.2 http://string-db.org/.3 http://www.ncbi.nih.gov/COG/.4 http://dtp.nci.nih.gov/.

8.2 Comparison between the set of subgraphs discoveredby MUSE-P and the set of subgraphs discoveredby MUSE-E

MUSE-P and MUSE-E were developed based on differentdata mining semantics. In particular, MUSE-P uses ϕ-fre-quent probability to measure the frequentness of subgraphs,and MUSE-E uses expected support instead. Therefore, itis crucial to investigate the difference between the set ofsubgraphs discovered by MUSE-P and the set of subgraphsdiscovered by MUSE-E. To the best of our knowledge, thisexperiment has not been performed in previous work yet.

The experiment was designed as follows: For specifiedϕ, τ , and minesup, carry out the following steps:

1. Run MUSE-P on the uncertain graph database PPINetwith input ϕ, τ, ε = 0.1 and δ = 0.1, and obtain a set ofsubgraphs ΠP ;

2. Run MUSE-E on PPINet with input minesup, ε = 0.1and δ = 0.1, and obtain a set of subgraphs ΠE ;

3. Taking ΠP as reference, compute the precision of ΠE

with respect to ΠP , or simply the precision of ΠE , as

precisionΠP (ΠE ) = |ΠP ∩ ΠE ||ΠE |

and compute the recall of ΠE with respect to ΠP , orsimply the recall of ΠE , as

recallΠP (ΠE ) = |ΠP ∩ ΠE ||ΠP | .

Figure 6a and b plot the precision and the recall of ΠE forϕ = 0.2, τ ∈ {0.9, 0.95, 0.99} and minesup = 0.25 to 1. Notethat Fig. 6a and b illustrate the same experimental resultsfrom the precision and the recall perspectives, respectively.

Figure 6a shows that for τ = 0.99, to make the recall ofΠE greater than 0.95, the precision of ΠE have to be lessthan 0.69. It is because that to obtain more than 95 percentof subgraphs in ΠP , even more subgraphs must be discov-ered by MUSE-E by decreasing minesup, thus more than 31percent of subgraphs returned by MUSE-E do not belongto ΠP . Furthermore, to make the recall of ΠE greater than0.95, the precision of ΠE must be less than 0.79 for τ = 0.95as highlighted by the bolder and larger circle in Fig. 6a, andless than 0.89 for τ = 0.9 as shown by the bolder and largercross in Fig. 6a.

As shown in Fig. 6b, for τ = 0.99, to make the preci-sion of ΠE greater than 0.95, the recall of ΠE must be lessthan 0.59. The reason is that to make more than 95 percentof subgraphs in ΠE occur in ΠP too, less subgraphs needto be discovered by MUSE-E by increasing minesup, thusmore than 41 percent of subgraphs in ΠP are missed by

123

Page 18: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

770 J. Li et al.

Fig. 6 Comparison betweenthe set of subgraphs discoveredby MUSE-P and the set ofsubgraphs discovered byMUSE-E. a Illustration fromprecision perspective.b Illustration from recallperspective

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

τ = 0.9τ = 0.95τ = 0.99

(a)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

τ = 0.9τ = 0.95τ = 0.99

(b)

Fig. 7 Effect of ε on precisionand recall of subgraphsoutputted by MUSE-P.a Precision versus ε. b Recallversus ε

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Pre

cisi

on

ε(a)

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Rec

all

ε(b)

MUSE-E. Furthermore, to make the precision of ΠE greaterthan 0.95, the recall ofΠE must be less than 0.60 for τ = 0.95as highlighted by the bolder and larger circle in Fig. 6b, andless than 0.63 for τ = 0.9 as shown by the bolder and largercross in Fig. 6b.

The experimental results in Fig. 6 verify that it is hardto make the precision and the recall of ΠE both very closeto 1. In other words, MUSE-P is able to discover a set ofsubgraphs that is hard for MUSE-E to approximately findthe same one. Therefore, it is practically significant to inves-tigate different semantics for mining frequent subgraphs onuncertain graph databases so that the requirements in variousapplications can be satisfied.

8.3 Quality of subgraphs discovered by MUSE-P

Since MUSE-P is subject to an approximate mining approachand uses a randomized algorithm with failure probability δ toidentify frequent subgraphs, the set of subgraphs outputtedby MUSE-P may consist of a fraction of infrequent sub-graphs with ϕ-frequent probability less than τ . Thus, it isimportant to evaluate the approximation quality of the sub-graphs returned by MUSE-P. We use precision and recall asquality measures. Specifically, let Π be the set of subgraphsdiscovered by MUSE-P and Π∗ be the set of true frequent

subgraphs. The precision of Π is defined as

precision(Π) = |Π ∩ Π∗||Π | , (11)

and the recall of Π as

recall(Π) = |Π ∩ Π∗||Π∗| . (12)

However, it is generally computationally prohibitive to obtainΠ∗. A practical and sound way is thus to take as Π∗ the setof subgraphs discovered by MUSE-P at very small ε and δ,say ε = 0.02 and δ = 0.001 in this experiment.

First, we evaluated the precision and the recall of sub-graphs discovered by MUSE-P with respect to ε. The exper-iment was designed as follows: For ϕ = 0.2, τ = 0.9, andδ = 0.05, run MUSE-P on PPINet with ε varying from0.025 to 0.2. For each tested ε, MUSE-P was executed 10times. Let precε,i and recallε,i denote the precision and therecall of the subgraph set discovered in the i th run of MUSE-P at ε, respectively. Figure 7 plots precε,i and recallε,i forε varying from 0.025 to 0.2 and i = 1, 2, . . . , 10. As can beseen from Fig. 7a, the precision of the discovered subgraphsis at least 96% for ε = 0.025 and decreases as ε gets larger.As shown in Fig. 7b, the recall of the discovered subgraphsis at least 90% and almost independent of ε.

123

Page 19: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 771

Table 3 t test statistics for 0.025 ≤ ε �= ε′ ≤ 0.2, where the numbers in brackets are the degrees of freedom

ε\ε′ 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2

0.025 − 0.40 (13.8) 1.33 (16.2) 1.87 (16.2) 2.15 (17.5) 2.84 (13.6) 2.64 (12.6) 2.83 (15.0)

0.050 −0.40 (13.8) − 1.32 (16.8) 2.07 (16.8) 2.36 (15.2) 3.67 (18.0) 3.48 (17.6) 3.52 (17.7)

0.075 −1.33 (16.2) −1.32 (16.8) − 0.66 (18.0) 1.08 (17.4) 1.79 (16.6) 1.51 (15.4) 1.81 (17.7)

0.100 −1.87 (16.2) −2.07 (16.8) −0.66 (18.0) − 0.47 (17.5) 1.03 (16.5) 0.72 (15.4) 1.10 (17.7)

0.125 −2.15 (17.5) −2.36 (15.2) −1.08 (17.4) −0.47 (17.5) − 0.38 (15.0) 0.09 (13.8) 0.49 (16.4)

0.150 −2.84 (13.6) −3.67 (18.0) −1.79 (16.6) −1.03 (16.5) −0.38 (15.0) − −0.43 (17.7) 0.16 (17.5)

0.175 −2.64 (12.6) −3.48 (17.6) −1.51 (15.4) −0.72 (15.4) −0.09 (13.8) 0.43 (17.7) − 0.55 (16.6)

0.200 −2.83 (15.0) −3.52 (17.7) −1.81 (17.7) −1.10 (17.7) −0.49 (16.4) −0.16 (17.5) −0.55 (16.6) −

Fig. 8 Effect of δ on precisionand recall of subgraphsoutputted by MUSE-P.a Precision versus δ. b Recallversus δ

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5

Pre

cisi

on

δ(a)

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5

Rec

all

δ(b)

Moreover, the effect of ε on the recall of subgraphs discov-ered by MUSE-P was investigated using hypothesis testing.Let Xε denote the random variable representing the recall ofsubgraphs discovered by MUSE-P with respect to a certainε. The set {recallε,i |1 ≤ i ≤ 10} is thus a uniform samplefrom the domain of Xε. For any ε �= ε′, let the null hypothesisbe

H0 : E[Xε] = E[Xε′ ]and the alternative hypothesis be

H1 : E[Xε] �= E[Xε′ ].For 0.025 ≤ ε �= ε′ ≤ 0.2, a two-sample t test with

assumption on unequal population variances was carried out,and the obtained test statistic is presented in Table 3, wherethe numbers in brackets are the degrees of freedom. At a sig-nificance level of 0.001, the acceptance interval for a two-tailed t test with 12 d f is [−4.32, 4.32], and [−3.92, 3.92]for a two-tailed t test with 18 d f . Therefore, all of the nullhypothesis can be accepted at a significance level of 0.001. Inother words, the recall of subgraphs discovered by MUSE-Phas the same expected value for all ε.

Then, we examined the effect of δ on the precision andthe recall of subgraphs discovered by MUSE-P. The exper-iment was designed very similar to the previous one: Forϕ = 0.2, τ = 0.9, and ε = 0.05, run MUSE-P on PPINet

with δ varying from 0.05 to 0.5. For each tested δ, MUSE-P was executed 10 times. Let precδ,i and recallδ,i denotethe precision and the recall of the subgraph set discoveredin the i th run of MUSE-P at δ, respectively. Figure 8 plotsprecδ,i and recallδ,i for δ varying from 0.05 to 0.5 and i =1, 2, . . . , 10. As shown in Fig. 8, both the precision and therecall of the discovered subgraphs are at least 90% for alltested values of δ and almost independent of δ.

In addition, two-sample t tests were carried out to examinethe relationship between the precision of subgraphs discov-ered by MUSE-P and δ. Let Yδ denote the random variablerepresenting the precision of subgraphs with respect to a cer-tain δ. For any δ �= δ′, let the null hypothesis be

H0 : E[Yδ] = E[Yδ′ ]

and the alternative hypothesis be

H1 : E[Yδ] �= E[Yδ′ ].

For 0.05 ≤ δ �= δ′ ≤ 0.5, a two-sample t test withassumption on unequal population variances was carried out,and the computed test statistic is illustrated in Table 4. At asignificance level of 0.001, the acceptance interval for a two-tailed t test with 11 d f is [−4.44, 4.44], and [−3.92, 3.92]for a two-tailed t test with 18 d f . Thus, all of the null hypoth-esis can be accepted at a significance level of 0.001, that is,

123

Page 20: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

772 J. Li et al.

Table 4 t test statistics for 0.05 ≤ δ �= δ′ ≤ 0.5, where the numbers in brackets are the degrees of freedom

δ\δ′ 0.05 0.1 0.15 0.2 0.25

0.05 − −0.72 (17.8) −1.16 (16.3) −0.85 (17.1) −0.98 (17.5)

0.10 0.72 (17.8) − −0.40 (17.2) −0.10 (17.7) −0.26 (17.9)

0.15 1.16 (16.3) 0.40 (17.2) − 0.32 (17.8) 0.13 (17.6)

0.20 0.85 (17.1) 0.10 (17.7) −0.32 (17.8) − −0.17 (17.9)

0.25 0.98 (17.5) 0.26 (17.9) −0.13 (17.6) 0.17 (17.9) −0.30 1.19 (15.6) 0.41 (16.6) 0.00 (17.9) 0.33 (17.5) 0.13 (17.1)

0.35 0.35 (17.6) −0.29 (16.9) −0.65 (15.1) −0.39 (15.9) −0.52 (16.5)

0.40 0.91 (17.6) 0.18 (18.0) −0.20 (17.5) 0.10 (17.9) −0.07 (18.0)

0.45 1.55 (12.0) 0.72 (12.8) 0.31 (14.5) 0.67 (13.7) 0.42 (13.2)

0.50 0.86 (13.3) 0.00 (14.2) −0.52 (16.1) −0.12 (15.3) −0.32 (14.7)

δ\δ′ 0.3 0.35 0.4 0.45 0.5

0.05 −1.19 (15.6) −0.35 (17.6) −0.91 (17.6) −1.55 (12.0) −0.86 (13.3)

0.10 −0.41 (16.6) 0.29 (16.9) −0.18 (18.0) −0.72 (12.8) 0.00 (14.2)

0.15 0.00 (17.9) 0.65 (15.1) 0.20 (17.5) −0.31 (14.5) 0.52 (16.1)

0.20 −0.33 (17.5) 0.39 (15.9) −0.10 (17.9) −0.67 (13.7) 0.12 (15.3)

0.25 −0.13 (17.1) 0.52 (16.5) 0.07 (18.0) −0.42 (13.2) 0.32 (14.7)

0.30 − 0.66 (14.3) 0.21 (17.0) −0.33 (15.2) 0.55 (16.8)

0.35 −0.66 (14.3) − −0.46 (16.6) −0.93 (11.3) −0.34 (12.3)

0.40 −0.21 (17.0) 0.46 (16.6) − −0.51 (13.1) 0.23 (14.6)

0.45 0.33 (15.2) 0.93 (11.3) 0.51 (13.1) − 1.09 (17.4)

0.50 −0.55 (16.8) 0.34 (12.3) −0.23 (14.6) −1.09 (17.4) −

the precision of subgraphs discovered by MUSE-P has thesame expected value for all δ.

Finally, we analyzed the recall of discovered subgraphswith respect to δ. Theorem 8 shows us that a frequent sub-graph is output with probability at least ((1 − δ)/2)s , wheres is the number of edges of the frequent subgraph. Thus, asmaller δ will lead to more frequent subgraphs occurring inoutput, thereby increasing the recall. However, Fig. 8b andthe related t tests show that the recall of discovered subgraphshas the same expected value for all 0.05 ≤ δ ≤ 0.5. Thereason is that the practical failure probability of the Approx-imate procedure is much higher than the theoretical failureprobability proved in Theorem 7, thus a not very small δ willlead to significantly high recall.

The following conclusions can be derived from the exper-imental results and the analysis presented above: (1) Theapproximation quality of subgraphs discovered by MUSE-Pcan be very high by assigning small enough values to ε and δ.(2) The precision of discovered subgraphs is mainly affectedby ε, and the recall of discovered subgraphs is determinedby δ but very insensitive to δ in practice.

8.4 Execution time of MUSE-P

The execution time of MUSE-P was tested with respect toinput parameters ϕ, τ, ε, and δ of MUSE-P, respectively.

In this experiment, MUSE-P was run on PPINet. Unlessotherwise stated, the input parameters of MUSE-P were setto ϕ = 0.2, τ = 0.9, ε = 0.05, and δ = 0.05.

First, we ran MUSE-P with ϕ varying from 0.2 to 1. Theexecution time of MUSE-P is shown in Fig. 9a. It can be seenthat the execution time of MUSE-P significantly decreasesas ϕ gets larger. This is because the number of frequent sub-graphs in PPINet decreases very fast as ϕ becomes larger,thus leading to the decrease in the number of subgraphsvisited by MUSE-P as shown in Fig. 9b. Consequently,the execution time of MUSE-P decreases with the increaseof ϕ.

Then, MUSE-P was executed with τ varying from 0.6to 1. Figure 10a shows the execution time of MUSE-P. It canbe observed that the execution time of MUSE-P decreasesfast as τ increases. The reason is that as τ becomes larger,fewer subgraphs in PPINet will have ϕ-frequent probabilitylarger than or equal to τ . Thus, less subgraphs in PPINethave to be visited by MUSE-P as shown in Fig. 10b. There-fore, the execution time of MUSE-P decreases while τ getslarger.

After that, we experimented the effect of ε on the execu-tion time of MUSE-P. Figure 11a shows the execution timeof MUSE-P while ε varies from 0.025 to 0.2. We can see thatthe execution time of MUSE-P substantially decreases as ε

increases. This is due to the following facts: The execution

123

Page 21: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 773

Fig. 9 Effect of ϕ on executiontime of MUSE-P. a Executiontime of MUSE-P versus ϕ.b Number of subgraphsoutputted by MUSE-P versus ϕ

0

50

100

150

200

250

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

exec

utio

n ti

me

(sec

)

threshold (a)

0

5

10

15

20

25

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

num

ber

of s

ubgr

aphs

(10

0)

threshold

visitedoutput

(b)

Fig. 10 Effect of τ onexecution time of MUSE-P.a Execution time of MUSE-Pversus τ . b Number ofsubgraphs outputted byMUSE-P versus τ

0

2

4

6

8

10

12

14

16

0.6 0.7 0.8 0.9 1

exec

utio

n ti

me

(100

sec

)

threshold τ(a)

0

1

2

3

4

5

6

0.6 0.7 0.8 0.9 1

num

ber

of s

ubgr

aphs

(10

3 )

threshold τ

visitedoutput

(b)

Fig. 11 Effect of ε onexecution time of MUSE-P.a Execution time of MUSE-Pversus ε. b Number of subgraphsoutputted by MUSE-P versus ε

0

2

4

6

8

10

0 0.05 0.1 0.15 0.2

exec

utio

n ti

me

(100

sec

)

threshold ε(a)

0

5

10

15

20

25

30

0 0.05 0.1 0.15 0.2

num

ber

of s

ubgr

aphs

(10

0)

threshold ε

visitedoutput

(b)

time of MUSE-P is dominated by the time for comput-ing the ϕ-frequent probabilities of all subgraphs visited byMUSE-P. Although the number of visited subgraphsincreases as ε gets larger as shown in Fig. 11b, the timefor computing the ϕ-frequent probability of each visitedsubgraph is inversely proportional to ε2 as analyzed inSect. 5.2.3, thus decreasing the overall time for computing theϕ-frequent probabilities of all subgraphs visited by MUSE-P. As a result, the execution time of MUSE-P decreases withthe increase in τ .

Unlike the experimental results obtained in the previousexperiments with respect to ϕ, τ , and ε, we could not figureout the relationship between the execution time of MUSE-P

and δ after only a single run of MUSE-P for each testedvalue of δ. For this reason, we experimented the execu-tion time of MUSE-P 30 times for each tested value ofδ. Figure 12a illustrates the box plot of the execution timeof MUSE-P while δ varies from 0.05 to 0.5. We can seethat the execution time of MUSE-P statistically decreasesas δ becomes larger. This is because the number of sub-graphs visited by MUSE-P does not vary significantly asshown in Fig. 12b; however, the time for computing theϕ-frequent probability of each visited subgraph is propor-tional to log(2n/δ) as analyzed in Sect. 5.2.3. Consequently,the execution time of MUSE-P decreases while δ becomeslarger.

123

Page 22: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

774 J. Li et al.

Fig. 12 Effect of δ onexecution time of MUSE-P.a Execution time of MUSE-Pversus δ. b Number of subgraphsoutputted by MUSE-P versus δ

1

2

3

4

5

0 0.1 0.2 0.3 0.4 0.5

exec

utio

n ti

me

(100

sec

)

threshold δ(a)

1.8

2.1

20

25

0 0.1 0.2 0.3 0.4 0.5

num

ber

of s

ubgr

aphs

(10

0)

threshold δ

visitedoutput

(b)

Fig. 13 a Execution time ofMUSE-P versus total numberof vertices in uncertain graphs.b Execution time of MUSE-Pversus total number of edges inuncertain graphs

0

20

40

60

80

100

120

140

160

0 20 40 60 80 100

Exe

cutio

n tim

e (s

ec)

Percentage (%)

Median

(a)

0

20

40

60

80

100

120

0 20 40 60 80 100

Exe

cutio

n tim

e (s

ec)

Percentage (%)

Median

(b)

Moreover, the execution time of MUSE-P was evalu-ated with respect to the total number of vertices in uncertaingraphs and to the total number of edges in uncertain graphs,respectively.

To examine the execution time of MUSE-P with respectto the total number of vertices in uncertain graphs, uncer-tain graph databases consisting of increasing number of ver-tices were created first based on PPINet in the followingway: For p = 10, 20, . . . , 100, an uncertain graph databaseis constructed, which contains a subgraph of each originaluncertain graph in PPINet induced by p percentage of itsvertices selected uniformly at random. Then, for each valueof p, 50 uncertain graph databases were independently andrandomly created, and MUSE was executed on them withϕ = 0.2, τ = 0.9, ε = 0.1, and δ = 0.1. Figure 13a plotsthe maximum, the minimum, and the median of the executiontime in the 50 runs while p varies from 10 to 100. It can beseen that the execution time of MUSE-P increases rapidlyas the number of vertices of uncertain graphs increases. Thisis because as the total number of vertices increases, the num-ber of frequent subgraphs and the number of embeddings ofsubgraphs both increase rapidly.

The experiment on the execution time of MUSE-P withrespect to the number of edges of uncertain graphs wasdesigned as same as the previous one except that the uncer-tain graphs in the constructed uncertain graph databases are

induced by edges instead of vertices. Figure 13b plots themaximum, the minimum, and the median of the executiontime in the 50 runs while p varies from 0 to 100. It canbe observed that the execution time of MUSE-P increasessignificantly as the number of edges of uncertain graphsincreases.

8.5 Impact of variation in uncertainties on the performanceof MUSE-P

Except for the size and the structural complexity of uncertaingraph databases, the distribution of uncertainties in uncertaingraph databases also has an important impact on the perfor-mance of MUSE-P as will be verified in this experiment. Tochange the distribution of uncertainties, the following piece-wise linear function

f (x) = min{1, (1 + c)x} (13)

is imposed on the existence probabilities of the vertices andthe conditional existence probabilities of the edges, wherec ≥ −1. We executed MUSE-P with ϕ = 0.2, τ = 0.9, ε =0.1, and δ = 0.1 on the uncertain graph databases obtainedby varying the uncertainties in PPINet by Eq. (13).

Figure 14a shows the number of subgraphs discovered byMUSE-P as the coefficient c in Eq. (13) varies from −0.3

123

Page 23: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 775

Fig. 14 Impact of variation inuncertainties on performance ofMUSE-P. a Number ofsubgraphs discovered byMUSE-P versus coefficient c inEq. (13). b Execution time ofMUSE-P versus coefficient c inEq. (13)

0

100

200

300

400

500

-0.3 -0.2 -0.1 0 0.1 0.2 0.3Num

ber

of o

utpu

t sub

grap

hs

Coefficient c

Median

(a)

0

200

400

600

800

1000

-0.3 -0.2 -0.1 0 0.1 0.2 0.3

Exe

cutio

n tim

e (s

ec)

Coefficient c

Median

(b)

0

1000

2000

3000

4000

5000

6000

0 10 20 30 40 50

Exe

cutio

n tim

e (s

ec)

Number of uncertain graphs (100)

Fig. 15 Execution time MUSE-P versus number of uncertain graphs

to 0.3. As can be seen, the number of subgraphs discoveredby MUSE-P grows up as c increases. This is because as cincreases, both the existence probabilities of the vertices andthe conditional existence probabilities of the edges becomelarger, so that the ϕ-frequent probabilities of all subgraphsincrease, thus resulting in the increase in the number of sub-graphs outputted by MUSE-P. Furthermore, the executiontime of MUSE-P increases rapidly as c get larger, as illus-trated in Fig. 14b.

8.6 Scalability of MUSE-P

Finally, we investigated the impact of the number of uncer-tain graphs on the execution time of MUSE-P. In this exper-iment, we ran MUSE-P with input parameters ϕ = 0.3, τ =0.9, ε = 0.1, and δ = 0.1 on the first n uncertain graphsin the DTP database (Sect. 8.1), where n varies from 100 to5000. Figure 15 illustrates the execution time of MUSE-P asn increases from 100 to 5,000. We can see that the executiontime grows super-linearly with respect to the increase of n.Particularly, the plot in Fig. 15 can be fitted by the quadraticfunction f (x) = 2.2x2 − 9.2x + 3.8, where x = n/100.Therefore, this experimental result is consistent with the timecomplexity of MUSE-P as analyzed in Sect. 7, which states

that the running time of MUSE-P is proportional to n2 whenthere is one vertex with existence probability of 1 in everyuncertain graph in an uncertain graph database.

9 Summary

In this paper, an approximate mining algorithm has beendeveloped for efficiently and accurately mining frequent sub-graphs over an uncertain graph database under probabilisticsemantics. The algorithm guarantees to find any frequent sub-graph with a provably high probability by carefully settingparameter δ using a systematic method. The extensive exper-iments on the real uncertain graph database verify that thealgorithm is practically efficient and that the mining resultshave very high precision and recall.

Acknowledgments The work in this paper was partially supportedby the National Basic Research (973) Program of China under GrantNo. 2012CB316202, the National Natural Science Foundation of Chinaunder Grant No. 61173023, No. 61033015, and No. 61190115, the Fun-damental Research Funds for the Central Universities under Grant No.HIT.NSRIF.201180, and Microsoft Research Asia.

Appendix A: Difference between expected semanticsand probabilistic semantics

It is very important to know the difference between theexpected semantics and the probabilistic semantics. Zhanget al. [39] have initially discussed the difference in the con-text of mining frequent items over uncertain data streams.Though the discussion there can be extended to frequent sub-graph mining over uncertain graphs as will be shown later,there is a lack of discussion in [39] on an important issue:when one semantics is preferable to the other?

Prior to the discussion on this issue, let us focus on thedifference between expected supports and ϕ-frequent proba-bilities. In the rest of Appendix A, we use Drand to denote arandomly and independently chosen implicated graph data-base of D. We have the following facts:

123

Page 24: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

776 J. Li et al.

Fact 1: Suppose the expected support of a subgraph S inan uncertain graph database D is 0.5. Then, it isexpected that S occurs in 50% of the graphs inDrand . By the probabilistic method [2], we havethat S has a nonzero probability of occurring inat least 50% of the graphs in Drand . However, wecould infer from the expected support of S neitherthe exact value of this probability nor its nontrivialbounds.

Fact 2: Suppose the 0.5-frequent probability of S in D is0.9. It means that, with probability 0.9, S occurs inat least 50% of the graphs in Drand . Interestingly,we can infer that the expected value of the supportof S in Drand , i.e., the expected support of S, isbounded from below by 0.5×0.9+0×(1−0.9) =0.45.

From Fact 1 and Fact 2, we have that the ϕ-frequent prob-ability intrinsically contains more information on the fre-quentness of a subgraph than the expected support.

Then, let us find out when one semantics is preferableto the other. As summarized in literature, frequent subgraphmining has been used in two main scenarios, namely detect-ing features for graph classifiers and graph indexes, andexploring motifs of graph data. In subgraph feature detec-tion, people would often like to take a subgraph as a featureif the support of the subgraph is not below a given thresholdτ with a significantly high chance. Of course, the ϕ-frequentprobability is a more suitable choice in this scenario. Thereason why the expected support is not a suitable choice is asfollows: Consider a subgraph S with expected support τ . Wehave that S is expected to occur in τ proportion of graphs inDrand . However, we are unable to know the probability of Soccurring in τ proportion of graphs in Drand . For example,consider the following two cases:

Case 1: The support of S in Drand is τ with probability 1;Case 2: The support of S in Drand is 1 with probability

ε1−τ+ε

, and is τ −ε with probability 1−τ1−τ+ε

, whereε → 0.

Although the expected support of S is τ in both cases, S is afeature in Case 1 because S certainly occurs in τ proportionof graphs in Drand , but S is certainly not a feature in Case 2because the probability of S occurring in at least τ proportionof graphs in Drand is ε

1−τ+ε, which tends to 0.

In the scenario of exploring motifs, people would usu-ally like to measure the averaged frequentness of a sub-graph, and wish to obtain subgraphs that are expected tooccur frequently. Thus, the expected semantics is a moresuitable choice. Although the ϕ-frequent probability containsmore information on the frequentness of a subgraph than theexpected support, and a lower bound of the expected support

of a subgraph can be derived from its ϕ-frequent probability,the ϕ-frequent probability actually cannot play the same roleas the expected support can play. Consider a subgraph S withϕ-frequent probability p. The distribution of the support ofS in Drand can be the following two typical cases:

Case 1’: The support of S in Drand is ϕ with probability pand is 0 with probability 1 − p;

Case 2’: The support of S in Drand is 1 with probability pand is ϕ −ε with probability 1− p, where ε → 0.

Suppose a subgraph is taken as a motif if it is expected tooccur in ϕ proportion of graphs in Drand . Although the ϕ-frequent probability of S is p in both cases, S is not a motifin Case 1 because the expected support of S is ϕp, but Sis a motif in Case 2 because the expected support of S isp + (ϕ − ε)(1 − p), which tends to ϕ + (1 − ϕ)p.

Consequently, we have the following conclusions:

– Frequent subgraph mining under the expected semanticsis more suitable for exploring motifs in a set of uncertaingraphs;

– Frequent subgraph mining under the probabilistic seman-tics is more suitable for extracting features from a set ofuncertain graphs.

References

1. Agrawal, P., Benjelloun, O., Sarma, A.D., Hayworth, C., Nabar,S.U., Sugihara, T., Widom, J.: Trio: A system for data, uncertainty,and lineage. In: Proceedings of VLDB, pp. 1151–1154 (2006)

2. Alon, N., Spencer, J.H.: The Probabilistic Method. Wiley, NewYork (1992)

3. Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Züfle, A.: Prob-abilistic frequent itemset mining in uncertain databases. In: Pro-ceedings of KDD, pp. 119–128 (2009)

4. Birnbaum, E., Lozinskii, E.L.: The good old Davis–Putnam proce-dure helps counting models. J. Artif. Intell. Res. 10, 457–477 (1999)

5. Borgelt, C., Berthold, M.R.: Mining molecular fragments: Find-ing relevant substructures of molecules. In: Proceedings of ICDM,pp. 51–58 (2002)

6. Chui, C.K., Kao, B.: A decremental approach for mining frequentitemsets from uncertain data. In: Proceedings of PAKDD, pp. 64–75 (2008)

7. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets fromuncertain data. In: Proceedings of PAKDD, pp. 47–58 (2007)

8. Cormode, G., Garofalakis, M.N.: Sketching probabilistic datastreams. In: Proceedings of SIGMOD Conference, pp. 281–292(2007)

9. Garey, M.R., Johnson, D.S.: Computers and Intractability: AGuide to the Theory of NP-Completeness. Freeman, San Fran-cisco (1979)

10. Gudes, E., Shimony, S.E., Vanetik, N.: Discovering frequentgraph patterns using disjoint paths. IEEE Trans. Knowl. DataEng. 18(11), 1441–1456 (2006)

11. Hasan, M.A., Zaki, M.J.: Output space sampling for graph pat-terns. PVLDB 2(1), 730–741 (2009)

123

Page 25: static.tongtianta.sitestatic.tongtianta.site/paper_pdf/780d4a40-4a10-11e... · Mining frequent subgraphs over uncertain graph databases 755 probability, it is determined whether the

Mining frequent subgraphs over uncertain graph databases 777

12. Hintsanen, P., Toivonen, H.: Finding reliable subgraphs from largeprobabilistic graphs. DMKD 17(1), 3–23 (2008)

13. Hua, M., Pei, J.: Probabilistic path queries in road networks: traf-fic uncertainty aware path selection. In: Proceedings of EDBT,pp. 347–358 (2010)

14. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent sub-graphs in the presence of isomorphism. In: Proceedings of ICDM,pp. 549–552 (2003)

15. Huan, J., Wang, W., Prins, J., Yang, J.: Spin: mining maximal fre-quent subgraphs from graph databases. In: Proceedings of KDD,pp. 581–586 (2004)

16. Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithmfor mining frequent substructures from graph data. In: Proceedingsof PKDD, pp. 13–23 (2000)

17. Karp, R.M., Luby, M.: Monte-carlo algorithms for enumeration andreliability problems. In: Proceedings of FOCS, pp. 56–64 (1983)

18. Kimmig, A., Raedt, L.D.: Local query mining in a probabilisticprolog. In: Proceedings of IJCAI, pp. 1095–1100 (2009)

19. Koch, C., Olteanu, D.: Conditioning probabilistic databases.PVLDB 1(1), 313–325 (2008)

20. Kuramochi, M., Karypis, G.: An efficient algorithm for discoveringfrequent subgraphs. IEEE Trans. Knowl. Data Eng. 16(9), 1038–1051 (2004)

21. Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graphs over time:densification laws, shrinking diameters and possible explanations.In: Proceedings of KDD, pp. 177–187 (2005)

22. Leung, C.K.S., Hao, B.: Mining of frequent itemsets from streamsof uncertain data. In: Proceedings of ICDE, pp. 1663–1670 (2009)

23. Luby, M., Velickovic, B.: On deterministic approximation ofdnf. Algorithmica 16(4/5), 415–433 (1996)

24. Nijssen, S., Kok, J.N.: A quickstart in frequent structure miningcan make a difference. In: Proceedings of KDD, 647–652 (2004)

25. Olteanu, D., Huang, J., Koch, C.: Approximate confidence compu-tation in probabilistic databases. In: Proceedings of ICDE, pp. 145–156 (2010)

26. Papapetrou, O., Ioannou, E., Skoutas, D.: Efficient discovery offrequent subgraph patterns in uncertain graph databases. In: Pro-ceedings of EDBT, pp. 355–366 (2011)

27. Poole, D.: Logic programming, abduction and probability. In: Pro-ceedings of FGCS, pp. 530–538 (1992)

28. Potamias, M., Bonchi, F., Gionis, A., Kollios, G.: k-nearest neigh-bors in uncertain graphs. In: Proceedings of VLDB (2010)

29. Raedt, L.D., Kimmig, A., Toivonen, H.: Problog: A probabilisticprolog and its application in link discovery. In: Proceedings of IJ-CAI, pp. 2462–2467 (2007)

30. Sun, L., Cheng, R., Cheung, D., Cheng, J.: Mining uncertain datawith probabilistic guarantees. In: Proceedings of KDD, pp. 273–282 (2010)

31. Trevisan, L.: A note on approximate counting for k-dnf. In: Pro-ceedings of APPROX-RANDOM, pp. 417–426 (2004)

32. Ullmann, J.R.: An algorithm for subgraph isomorphism. J.ACM 23(1), 31–42 (1976)

33. Valiant, L.G.: The complexity of enumeration and reliability prob-lems. SIAM J. Comput. 8(3), 410–421 (1979)

34. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining.In: Proceedings of ICDM, pp. 721–724 (2002)

35. Yan, X., Han, J.: Closegraph: mining closed frequent graph pat-terns. In: Proceedings of KDD, pp. 286–295 (2003)

36. Yang, G.: The complexity of mining maximal frequent itemsets andmaximal frequent patterns. In: Proceedings of KDD, pp. 344–353(2004)

37. Yuan, Y., Chen, L., Wang, G.: Efficiently answering probabilitythreshold-based shortest path queries over uncertain graphs. In:Proceedings of DASFAA, pp. 155–170 (2010)

38. Yuan, Y., Wang, G., Wang, H., Chen, L.: Efficient subgraph searchover large uncertain graphs. PVLDB 4(11), 876–886 (2011)

39. Zhang, Q., Li, F., Yi, K.: Finding frequent items in probabilisticdata. In: Proceedings of SIGMOD, pp. 819–832 (2008)

40. Zhu, F., Yan, X., Han, J., Yu, P.S.: gprune: A constraint pushingframework for graph pattern mining. In: Proceedings of PAKDD,pp. 388–400 (2007)

41. Zou, Z., Gao, H., Li, J.: Discovering frequent subgraphs over uncer-tain graph databases under probabilistic semantics. In: Proceedingsof KDD, pp. 633–642 (2010)

42. Zou, Z., Li, J., Gao, H., Zhang, S.: Frequent subgraph pattern min-ing on uncertain graph data. In: Proceedings of CIKM, pp. 583–592(2009)

43. Zou, Z., Li, J., Gao, H., Zhang, S.: Finding top-k maximal cliques inan uncertain graph. In: Proceedings of ICDE, pp. 649–652 (2010)

44. Zou, Z., Li, J., Gao, H., Zhang, S.: Mining frequent subgraphpatterns from uncertain graph data. IEEE Trans. Knowl. DataEng. 22(9), 1203–1218 (2010)

123