12
Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R. Elenberg The University of Texas Austin, Texas 78712, USA [email protected] Karthikeyan Shanmugam The University of Texas Austin, Texas 78712, USA [email protected] Michael Borokhovich The University of Texas Austin, Texas 78712, USA [email protected] Alexandros G. Dimakis The University of Texas Austin, Texas 78712, USA [email protected] ABSTRACT We study the problem of approximating the 3-profile of a large graph. 3-profiles are generalizations of triangle counts that specify the number of times a small graph appears as an induced subgraph of a large graph. Our algorithm uses the novel concept of 3-profile sparsifiers: sparse graphs that can be used to approximate the full 3-profile counts for a given large graph. Further, we study the problem of estimating lo- cal and ego 3-profiles, two graph quantities that characterize the local neighborhood of each vertex of a graph. Our algorithm is distributed and operates as a vertex pro- gram over the GraphLab framework. We introduce the con- cept of edge pivoting which allows us to collect 2-hop infor- mation without maintaining an explicit 2-hop neighborhood list at each vertex. This enables the computation of all the local 3-profiles in parallel with minimal communication. We test out implementation in several experiments scaling up to 640 cores on Amazon EC2. We find that our algorithm can estimate the 3-profile of a graph in approximately the same time as triangle counting. For the harder problem of ego 3-profiles, we introduce an algorithm that can estimate profiles of hundreds of thousands of vertices in parallel, in the timescale of minutes. 1. INTRODUCTION Given a small integer k (e.g. k = 3 or 4), the k-profile of a graph G(V,E) is a vector with one coordinate for each distinct k-node graph Hi (see Figure 1 for k = 3). Each coordinate counts the number of times that Hi appears as an induced subgraph of G. For example, the graph G = K4 (the complete graph on 4 vertices) has the 3-profile [0, 0, 0, 4] since it contains 4 triangles and no other (induced) sub- graphs. The graph C5 (the cycle on 5 vertices, i.e. a pen- tagon) has the 3-profile [0, 5, 5, 0]. Note that the sum of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. H0 H1 H2 H3 Figure 1: Subgraphs in the 3-profile of a Graph. We call them (empty, edge, wedge, triangle). The 3- profile of a graph counts how many times each of Hi appears in G. k-profile is always ( |V | k ) , the total number of subgraphs. One can see k-profiles as a generalization of triangle (as well as other motif) counting problems. They are increas- ingly popular for graph analytics both for practical and theoretical reasons. They form a concise graph description that has found several applications for the web [4, 25], so- cial networks [39], and biological networks [27] and seem to be empirically useful. Theoretically, they connect to the emerging theory of graph homomorphisms, graph limits and graphons [6, 39, 21]. In this paper we introduce a novel distributed algorithm for estimating the k = 3-profiles of massive graphs. In ad- dition to estimating the (global) 3-profile, we address two more general problems. One is calculating the local 3-profile for each vertex vj . This assigns a vector to each vertex that counts how many times vj participates in each subgraph Hi . These local vectors contain a higher resolution description of the graph and are used to obtain the global 3-profile (simply by rescaled addition as we will discuss). The second related problem is that of calculating the ego 3-profile for each vertex vj . This is the 3-profile of the graph N (vj ) i.e. the neighbors of vj , also called the ego graph of vj . The 3-profile of the ego graph of vj can be seen as a projection of the vertex into a coordinate system [39]. This is a very interesting idea of viewing a big graph as a collection of small dense graphs, in this case the ego graphs of the vertices. Note that calculating the ego 3-profiles for a set of vertices of a graph is different (in fact, significantly harder) than calculating local 3-profiles. Contributions: Our first contribution is a provable edge sub-sampling scheme: we establish sharp concentration re- sults for estimating the entire 3-profile of a graph. This al- lows us to randomly discard most edges of the graph and still have 3-profile estimates that are provably within a bounded

Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

Beyond Triangles: A Distributed Framework for Estimating3-profiles of Large Graphs

Ethan R. ElenbergThe University of Texas

Austin, Texas 78712, [email protected]

Karthikeyan ShanmugamThe University of Texas

Austin, Texas 78712, [email protected]

Michael BorokhovichThe University of Texas

Austin, Texas 78712, [email protected]

Alexandros G. DimakisThe University of Texas

Austin, Texas 78712, [email protected]

ABSTRACTWe study the problem of approximating the 3-profile of alarge graph. 3-profiles are generalizations of triangle countsthat specify the number of times a small graph appears as aninduced subgraph of a large graph. Our algorithm uses thenovel concept of 3-profile sparsifiers: sparse graphs that canbe used to approximate the full 3-profile counts for a givenlarge graph. Further, we study the problem of estimating lo-cal and ego 3-profiles, two graph quantities that characterizethe local neighborhood of each vertex of a graph.

Our algorithm is distributed and operates as a vertex pro-gram over the GraphLab framework. We introduce the con-cept of edge pivoting which allows us to collect 2-hop infor-mation without maintaining an explicit 2-hop neighborhoodlist at each vertex. This enables the computation of all thelocal 3-profiles in parallel with minimal communication.

We test out implementation in several experiments scalingup to 640 cores on Amazon EC2. We find that our algorithmcan estimate the 3-profile of a graph in approximately thesame time as triangle counting. For the harder problem ofego 3-profiles, we introduce an algorithm that can estimateprofiles of hundreds of thousands of vertices in parallel, inthe timescale of minutes.

1. INTRODUCTIONGiven a small integer k (e.g. k = 3 or 4), the k-profile

of a graph G(V,E) is a vector with one coordinate for eachdistinct k-node graph Hi (see Figure 1 for k = 3). Eachcoordinate counts the number of times that Hi appears asan induced subgraph of G. For example, the graph G = K4

(the complete graph on 4 vertices) has the 3-profile [0, 0, 0, 4]since it contains 4 triangles and no other (induced) sub-graphs. The graph C5 (the cycle on 5 vertices, i.e. a pen-tagon) has the 3-profile [0, 5, 5, 0]. Note that the sum of the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

H0 H1 H2 H3

Figure 1: Subgraphs in the 3-profile of a Graph. Wecall them (empty, edge, wedge, triangle). The 3-profile of a graph counts how many times each of Hiappears in G.

k-profile is always(|V |k

), the total number of subgraphs.

One can see k-profiles as a generalization of triangle (aswell as other motif) counting problems. They are increas-ingly popular for graph analytics both for practical andtheoretical reasons. They form a concise graph descriptionthat has found several applications for the web [4, 25], so-cial networks [39], and biological networks [27] and seemto be empirically useful. Theoretically, they connect to theemerging theory of graph homomorphisms, graph limits andgraphons [6, 39, 21].

In this paper we introduce a novel distributed algorithmfor estimating the k = 3-profiles of massive graphs. In ad-dition to estimating the (global) 3-profile, we address twomore general problems. One is calculating the local 3-profilefor each vertex vj . This assigns a vector to each vertex thatcounts how many times vj participates in each subgraph Hi.These local vectors contain a higher resolution description ofthe graph and are used to obtain the global 3-profile (simplyby rescaled addition as we will discuss).

The second related problem is that of calculating the ego3-profile for each vertex vj . This is the 3-profile of the graphN(vj) i.e. the neighbors of vj , also called the ego graph ofvj . The 3-profile of the ego graph of vj can be seen as aprojection of the vertex into a coordinate system [39]. This isa very interesting idea of viewing a big graph as a collectionof small dense graphs, in this case the ego graphs of thevertices. Note that calculating the ego 3-profiles for a set ofvertices of a graph is different (in fact, significantly harder)than calculating local 3-profiles.

Contributions: Our first contribution is a provable edgesub-sampling scheme: we establish sharp concentration re-sults for estimating the entire 3-profile of a graph. This al-lows us to randomly discard most edges of the graph and stillhave 3-profile estimates that are provably within a bounded

Page 2: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

error with high probability. Our analysis is based on mod-eling the transformation from original to sampled graph asa one step Markov chain with transitions expressed as afunction of the sampling probability. Our result is that arandom sampling of edges forms a 3-profile sparsifier, i.e. asubgraph that preserves the elements of the 3-profile withsufficient probability concentration. Our result is a general-ization of the Triangle sparsifiers by Tsourakakis et al. [38].Our proof relies on a result by Kim and Vu [15] on concentra-tion of multivariate polynomials, similarly to [38]. Unfortu-nately, the Kim and Vu concentration holds only for a classof polynomials called totally positive and some terms in the3-profile do not satisfy this condition. For that reason, theproof of [38] does not directly extend beyond triangles. Ourtechnical innovation involves showing that it is still possibleto decompose our polynomials as combinations of totallypositive polynomials using a sequence of variable changes.

Our second innovation deals with efficiently designing adistributed algorithm for estimating 3-profiles on the sub-sampled graph. We rely on the Gather-Apply-Scatter modelused in Graphlab Powergraph [9] but, more generally, ouralgorithm fits the architecture of most graph engines. Weintroduce the concept of edge pivoting which allows us tocollect 2-hop information without maintaining an explicit2-hop neighborhood list at each vertex. This enables thecomputation of all the local 3-profiles in parallel. Each edgerequires only information from its endpoints and each vertexonly computes quantities using data from incident edges.For the problem of ego 3-profiles, we show how to calculatethem by combining edge pivot equations and local cliquecounts.

We implemented our algorithm in GraphLab and per-formed several experiments scaling up to 640 cores on Ama-zon EC2. We find that our algorithm can estimate the 3-profile of a graph in approximately the same time as trianglecounting. Specifically, we compare against the Powergraphtriangle counting routine and find that it takes us only 1%-10% more time to compute the full 3-profile. For the sig-nificantly harder problem of ego 3-profiles, we were able tocompute (in parallel) the 3-profiles of up to 100, 000 egographs in the timescale of several minutes. We compare ourparallel ego k-profile algorithm to a simple sequential algo-rithm that operates on each ego graph sequentially and showtremendous scalability benefits, as expected. Our datasetsinvolve social network and web graphs with edges ranging innumber from tens of millions to over one billion. We presentresults on both overall runtimes and network communicationon multicore and distributed systems.

2. RELATED WORKIn this section, we describe several related topics and dis-

cuss differences in relation to our work.Graph Sub-Sampling: Random edge sub-sampling is anatural way to quickly obtain estimates for graph parame-ters. For the case of triangle counting such graphs are calleda triangle sparsifiers [38]. Related ideas were explored in theDOULION algorithm [36, 37, 38] with increasingly strongconcentration bounds. The recent work by Ahmed et al. [1]develops subgraph estimators for clustering coefficient, tri-angle count, and wedge count in a streaming sub-sampledgraph. Other recent work [32, 13, 33, 5, 14] uses randomsampling to estimate parts of the 3 and 4-profile. Thesemethods do not account for a distributed computation model

and require more complex sampling rules. As discussed, ourtheoretical results build on [38] to define the first 3-profilesparsifiers, sparse graphs that are a fortiori triangle sparsi-fiers.Triangle Counting in Graph Engines: Graph engines(e.g. Pregel, GraphLab, Galois, GraphX, see [30] for a com-parison) are frameworks for expressing distributed compu-tation on graphs in the language of vertex programs. Tri-angle counting algorithms [31, 4] form one of the standardgraph analytics tasks for such frameworks [9, 30]. In [7], theauthors list triangles efficiently, by partitioning the graphinto components and processing each component in parallel.Typically, it is much harder to perform graph analytics overthe MapReduce framework but some recent work [26, 35] hasused clever partitioning and provided theoretical guaranteesfor triangle counting.Matrix formulations: Fast matrix multiplication has beenused for certain types of subgraph counting. Alon et al. pro-posed a cycle counting algorithm which uses the trace of amatrix power on high degree vertices [2]. Some of our edgepivot equations have appeared in [16, 17, 40], all in a cen-tralized setting. Related approximation schemes [36] andrandomized algorithms [40] depend on centralized architec-tures and computing matrix powers of very large matrices.Frequent Subgraph Discovery: The general problem offinding frequent subgraphs, also known as motifs or sub-graph isomorphisms, is to find the number of occurrences ofa small query graph within a larger graph. Typically fre-quent subgraph discovery algorithms offer pruning rules toeliminate false positives early in the search [41, 19, 10, 28,29]. This is most applicable when subgraphs have labelledvertices or directed edges. For these problems, the num-ber of unique isomorphisms grows much larger than in ourapplication.

In [39], subgraphs were queried on the ego graphs of users.While enumerating all 3-sets and sampling 4-sets neighborscan be done in parallel, forming the ego subgraphs requireschecking for edges between neighbors. This suggests thata graph engine implementation would be highly preferableover an Apache Hive system. Our algorithms simultaneouslycompute the ego subgraphs and their profiles, reducing theamount of communication between nodes. Our algorithmis suitable for both NUMA multicore and distributed archi-tectures, but our implementation focus in this paper is onGraphLab.Graphlets: First described in [27], graphlets generalizethe concept of vertex degree to include the connected sub-graphs a particular vertex participates in with its neighbors.Unique graphlets are defined at a vertex based on its degreein the subgraph. Graphlet frequency distributions (GFDs)have proven extremely useful in the field of bioinformatics.Specifically, GFD analysis of protein interaction networkshelps to design improved generative models [11], accuratesimilarity measures [27], and better features for classifica-tion [24, 34]. Systems that use our edge pivot equations (ina different form) appear in prior literature for calculatingGFDs [12, 22] but not for enabling distributed computation.

3. UNBIASED 3-PROFILE ESTIMATIONIn this section, we are interested in estimating the number

of 3 node subgraphs of type H0, H1, H2 and H3, as depictedin Figure 1, in a given graph G. Let the estimated countsbe denoted Xi, i ∈ {0, 1, 2, 3}. Let the actual counts be

Page 3: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

1

(1−p)

(1− p

)2

(1−p)

3

p

2p(1− p

)

3p(1− p

)2

p2

3p2 (1−

p)

p3

Figure 2: Edge sampling process

ni, i ∈ {0, 1, 2, 3}. This set of counts is called a 3-profile ofthe graph, denoted with the following vector:

n3(G) = [n0, n1, n2, n3] . (1)

Because the vector is a scaled probability distribution, thereare only 3 degrees of freedom. Therefore, we calculate

n3(G) =

[(|V |3

)− n1 − n2 − n3, n1, n2, n3

].

In the case of a large graph, computational difficulty in es-timating the 3-profile depends on the total number of edgesin the large graph. We would like to estimate each of the3-profile counts within a multiplicative factor. So we firstsub-sample the set of edges in the graph with probability p.We compute all 3-profile counts of the sub-sampled graphexactly. Let {Yi}3i=0 denote the exact 3-profile counts of therandom sub-sampled graph. We relate the sub-sampled 3-profile counts to the original ones through a one step Markovchain involving transition probabilities. The sub-samplingprocess is the random step in the chain. Any specific sub-graph is preserved with some probability and otherwise tran-sitions to one of the other subgraphs. For example, a 3-clique is preserved with probability p3. Figure 2 illustratesthe other transition probabilities.

In expectation, this yields the following system of equa-tions:

E [Y0]E [Y1]E [Y2]E [Y3]

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

n0

n1

n2

n3

, (2)

from which we obtain unbiased estimators for each entry

in X(G) = [X0, X1, X2, X3]:

X0 = Y0 −1− pp

Y1 +(1− p)2

p2Y2 −

(1− p)3

p3Y3 (3)

X1 =1

pY1 −

2(1− p)p2

Y2 +3(1− p)2

p3Y3 (4)

X2 =1

p2Y2 −

3(1− p)p3

Y3 (5)

X3 =1

p3Y3. (6)

Lemma 1. X(G) is an unbiased estimator of n(G).

Proof. By substituting (3)–(6) into (2), clearly E [Xi] =ni for i = 0, 1, 2, 3.

We now turn to prove concentration bounds for the aboveestimators. We introduce some notation for this purpose.Let X be a real polynomial function of m real random vari-ables {ti}mi=1. Let α = (α1, α2, . . . , αm) ∈ Zm+ and defineE≥1[X] = maxα:‖α‖1≥1 E(∂αX), where

E(∂αX) = E[(∂

∂t1)α1 . . . (

∂tm)αm [X(t1, . . . , tm)]

](7)

Further, we call a polynomial totally positive if the coeffi-cients of all the monomials involved are non-negative. Westate the main technical tool we use to obtain our concen-tration results.

Theorem 1 (Kim-Vu Concentration [15]). Let X bea random totally positive Boolean polynomial in m booleanrandom variables with degree at most k. If E[X] ≥ E≥1[X],then

P(|X − E[X]| > ak

√E[X]E≥1[X]λk

)= O (exp (−λ+ (k − 1) logm)) (8)

for any λ > 1, where ak = 8kk!1/2.

The above theorem was used to analyze 3-profiles of Erdos-Renyi random ensembles (Gn,p) in [15]. Later, this was usedto derive concentration bounds for triangle sparsifiers in [38].Here, we extend §4.3 of [15] for concentration bounds for the3-profile estimation process, detailed above, on an arbitraryedge-sampled graph.

Theorem 2. (Generalization of Triangle sparsifiers to 3-profile sparsifiers) Let n(G) = [n0, n1, n2, n3] be the 3-profile of a graph G(V,E). Let |V | = n and |E| = m. Letn(G) = [Y0, Y1, Y2, Y3] be the 3-profile of the subgraph ob-tained by sampling each edge in G with probability p. Letα, β and ∆ be the largest collection of H1’s, wedges and tri-angles that share a common edge. Define X(G) according to(3)–(6), ε > 0, and γ > 0. If p, ε satisfy:

n0

3 max{α, β,∆} ≥a2

3 log6(m2+γ

)ε2

p

max{ 13√n3

,∆/n3}≥a2

3 log6(m2+γ

)ε2

p

α/n1≥ a2

1 log2 (mγ)

ε2

p

max{ βn2, 1√

n2}≥a2

2 log4(m1+γ

)ε2

, (9)

Page 4: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

H0(v) He1 (v) Hd

1 (v) Hc2(v) He

2 (v) H3(v)

Figure 3: Unique 3-subgraphs from vertex perspec-tive (white vertex corresponds to v)

then ‖X(G)− n(G)‖∞ ≤ 12ε(|V |

3

)with probability at least

1− 1mγ

.

Proof. Full proof can be found in the appendix. Notethat, as we mentioned, this proof uses a new polynomialdecomposition technique to apply the Kim-Vu concentra-tion.

The sampling probablilty p in Theorem 2 depends poly-logarithmically on the number of edges and linearly on thefraction of each subgraph which occurs on a common edge.For example, if all of the wedges in G depend on a sin-gle edge, i.e. β = n2, then the last equation suggests thepresence of that particular edge in the sampled graph willdominate the overall sparsifier quality.

4. LOCAL 3-PROFILE CALCULATIONIn this section, we describe how to obtain two types of 3-

profiles for a given graphG in a deterministic manner. Thesealgorithms are distributed and can be applied independentlyof the edge sampling described in Section 3.

The key to our approach is to identify subgraphs at a ver-tex based on degree with which it participates in the sub-graph. From the perspective of a given vertex v, there areactually six distinct 3 node subgraphs up to isomorphism asgiven in Figure 3. Let n0,v, n

e1,v, n

d1,v, n

e2,v, n

d2,v, and n3,v de-

note the corresponding local subgraph counts at v. We willfirst outline an approach that calculates these counts andthen add across different vertex perspectives to calculate thefinal 4 scalars (n2,v = ne2,v +nc2,v and n1,v = ne1,v +nd1,v). Itis easy to see that the global counts can be obtained fromthese local counts by summing across vertices:

ni =1

3

(∑v∈V

ni,v

), ∀i. (10)

4.1 Distributed Local 3-profileWe will now give our approach for calculating the local 3-

profile counts of G(V,E) using only local information com-bined with |V | and |E|.Scatter: We assume that every edge (v, a) has access tothe neighborhood sets of both v and a, i.e. Γ(v) and Γ(a).Therefore, intersection sizes are first calculated at everyedge, i.e. |Γ(v) ∩ Γ(a)|. Each edge computes the follow-ing scalars and stores them:

n3,va = |Γ(v) ∩ Γ(a)|, ne2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1.

nc2,va = |Γ(a)| − |Γ(v) ∩ Γ(a)| − 1.

n1,va = |V | − (|Γ(v)|+ |Γ(a)| − |Γ(v) ∩ Γ(a)|). (11)

The computational effort at every edge is at most O(dmax),where dmax is the maximum degree of the graph, for theneighborhood intersection size.

F0(v) F1(v) F2(v) F3(v)

Figure 4: 4-subgraphs for Ego 3-profiles (white ver-tex corresponds to v)

Gather: In the next round, vertex v “gathers” the abovescalars in the following way:

ne2,v =∑

a∈Γ(v)

ne2,va, ne1,v =∑

a∈Γ(v)

n1,va.

nc2,va=

1

2

∑a∈Γ(v)

nc2,va, n3,vb=

1

2

∑a∈Γ(v)

n3,va.

nd1,vc= |E| − |Γ(v)| − n3,v − ne2,v.

n0,v =

(|V | − 1

2

)− n1,v − n2,v − n3,v. (12)

Here, relations (a) and (b) are because triangles and wedgesfrom center are double counted. (c) comes from noticing thateach triangle and wedge from endpoint excludes an extraedge from forming Hd

1 (v). In this Gather stage, the commu-nication complexity is O(M) where it is assumed that Γ(v)is stored over M different machines. The corresponding dis-tributed algorithm is described in Algorithm 1.

4.2 Distributed Ego 3-profileIn this section, we give an approach to compute ego 3-

profiles for a set of vertices V ⊆ V in G. For each vertexv, the algorithm returns a 3-profile corresponding to thatvertex’s ego N(v), a subgraph induced by the neighborhoodset Γ(v), including edges between neighbors and excludingv itself. Formally, our goal is to compute {n(N(v))}v∈V .Clearly, this can be accomplished in two steps repeated se-rially on all v ∈ V: first obtain the ego subgraph N(v) andthen pass as input to Algorithm 1, summing over the egovertices Γ(v) to get a global count. The serial implemen-tation is provided in Algorithm 2. We note that this wasessentially done in [39], where ego subgraphs were extractedfrom a common graph separately from 3-profile computa-tions.

Instead, Algorithm 3 provides a parallel implementationwhich solves the problem by finding cliques in parallel forall v ∈ V. The main idea behind this approach is to realizethat calculating the 3-profile on the induced subgraph N(v)is exactly equivalent to computing specific 4-node subgraphfrequencies among v and 3 of its neighbors, enumerated asFi(v), 0 ≤ i ≤ 3 in Figure 4. Now, the aim is to calculateFi(v)’s, effectively part of a local 4-profile.Scatter: We assume that every edge (v, a) has already com-puted the scalars from (11). Additionally, every edge (v, a)also computes the list Nva = Γ(v) ∩ Γ(a) instead of only itssize. The computational complexity is still O(dmax).Gather: First, the vertex “gathers” the following scalars,forming three edge pivot equations in unknown variables

Page 5: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

Fi(v):

∑a∈Γ(v)

(nc2,va

2

)= 3F0(v) + F1(v)

∑a∈Γ(v)

(n3,va

2

)= F2(v) + 3F3(v)

∑a∈Γ(v)

nc2,van3,va = 2F1(v) + 2F2(v) (13)

By choosing two subgraphs that the edge (v, a) partici-pates in, and then summing over neighbors a, these equa-tions gather implicit connectivity information 2 hops awayfrom v. However, note that there are only three equa-tions in four variables and we must count one of them di-rectly, namely the number of 4-cliques F3(v). Therefore,at the same gather step, the vertex also creates the listCN v =

⋃a∈Γ(v),p∈Nva(a, p). Essentially, this is the list of

edges in the subgraph induced by Γ(v). This requires worstcase communication proportional to the number of edges inN(v), independent of the number of machines M .Scatter: Now, at the next scatter stage, each edge (v, a)receives the pair of lists CN v, CN a. This again incurs com-munication of O(|CN (v)|) = O(M(dmax)2). Now, each edge(v, a) computes the number of 4-cliques it is a part of, de-fined as follows:

n4,va =∑

i,j∈Γ(v)∩Γ(a)

1 ((i, j) ∈ CN v) . (14)

This incurs computation time of |CN v|.Gather: In the final gather stage, every vertex v accu-mulates these scalars to get F3(v) = 1

3

∑a∈Γ(v)

n4,va requir-

ing O(M) communication time. As in the previous section,the scaling accounts for extra counting. Finally, the vertexsolves the equations (13) using F3(v).

5. IMPLEMENTATION AND RESULTSIn this section, we describe the implementation and the

experimental results of the 3-prof, Ego-par and Ego-seralgorithms. We implement our algorithms on GraphLabv2.2 (PowerGraph) [9]. The performance (running time andnetwork usage) of our 3-prof algorithm is compared withthe Undirected Triangles Count per Vertex (hereinafter re-ferred as trian) algorithm shipped with GraphLab. Weshow that in time and network usage comparable to thebuilt-in trian algorithm, our 3-prof can calculate all thelocal and global 3-profiles. Then, we compare our paral-lel implementation of the ego 3-profile algorithm, Ego-par,with the naive serial implementation, Ego-ser. It appearsthat our parallel approach is much more efficient and scalesmuch better than the serial algorithm. The sampling ap-proach, introduced for the 3-prof algorithm, yields promis-ing results – reduced running time and network usage whilestill providing excellent accuracy. We support our findingswith several experiments over various data sets and systems.Vertex Programs: Our algorithms are implemented usinga standard GAS (gather, apply, scatter) model [9]. We im-plement the three functions gather(), apply(), and scat-

ter() to be executed by each vertex. Then we signal subsetsof vertices to run in a specific order.

Algorithm 1 3-prof

Input: Graph G(V,E) with |V | vertices, |E| edgesGather: For each vertex v union over edges of the ‘other’vertex in the edge, ∪a∈Γ(v)a = Γ(v)Apply: Store the gather as vertex data v.nb, size auto-matically storedScatter: For each edge eva, compute and store scalars in(11).Gather: For each vertex v, sum edge scalar data of neigh-bors

g←∑

(v,a)∈Γ(v) e.data.Apply: For each vertex v, calculate and store the quan-tities described in (12).return [v: v.n0 v.n1 v.n2 v.n3]

Algorithm 2 Ego-ser

Input: Graph G(V,E) with |V | vertices, |E| edges, set ofego vertices Vfor v ∈ V do

Signal v and its neighborsInclude an edge if both its endpoints are signaledRun Algorithm 1 on the graph induced by the neighborsand edges between them

end forreturn [v: vego.n0 vego.n1 vego.n2 vego.n3]

The Systems: We perform the experiments on three sys-tems. The first system is a single power server, further re-ferred to as Asterix. The server is equipped with 256 GB ofRAM and two Intel Xeon E5-2699 v3 CPUs, 18 cores each.Since each core has two hardware threads, up to 72 logicalcores are available to the GraphLab engine.

The next two systems are EC2 clusters on AWS (Ama-zon Web Services) [3]. One is comprised of 12 m3.2xlargemachines, each having 30 GB RAM and 8 virtual CPUs.Another system is a cluster of 20 c3.8xlarge machines, eachhaving 60 GB RAM and 32 virtual CPUs.The Data: In our experiments we used five real graphs.These graphs represent different datasets: social networks(LiveJournal and Twitter), citations (DBLP), knowledge con-tent (Wikipedia), and WWW structure (PLD – pay leveldomains). Graph sizes are summarized in Table 1.

Table 1: DatasetsName Vertices Edges (undirected)Twitter [18] 41, 652, 230 1, 202, 513, 046PLD [23] 39, 497, 204 582, 567, 291LiveJournal [20] 4, 846, 609 42, 851, 237Wikipedia [8] 3, 515, 067 42, 375, 912DBLP [20] 317, 080 1, 049, 866

5.1 ResultsExperimental results are averaged over 3− 10 runs.

Local 3-profile vs triangle count: The first result isthat our 3-prof is able to compute all the local 3-profilesin almost the same time as the GraphLab’s built-in triancomputes the local triangles (i.e., number of triangles in-cluding each vertex). Let us start with the first AWS clusterwith less powerful machines (m3.x2large). In Figure 5 (a)

Page 6: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

Algorithm 3 Ego-par

Input: Graph G(V,E) with |V | vertices, |E| edges, set ofego vertices VGather: For each vertex v union over edges of the ‘other’vertex in the edge, ∪evaa = Γ(v)Apply: Store the gather as vertex data v.nb, size auto-matically storedScatter: For each edge eva, compute and store as edgedata:

Scalars in (11).The list Nva.

Gather: For each vertex v, sum edge data of neighbors:Acumulate LHS of (13).g.CN ← g.CN ∪Nva.

Apply: Obtain CN v and equations in (13) using thescalars and g.CN .Scatter: Scatter CN v, CN a to all edges (v, a).

Compute n4,va as in (14).Gather: Sum edge data n4,va of neighbors at v.Apply: Compute F3(v).return [v: vego.n0 vego.n1 vego.n2 vego.n3]

we can see that for the LiveJornal graph, for each samplingprobability p and for each number of nodes (i.e., machinesin the cluster), 3-prof achieves running times comparableto trian. Notice also the benefit in running time achievedby sampling. We can reduce running time almost by half,without significantly sacrificing accuracy (which will be dis-cussed shortly). While the running time is decreased as thenumber of nodes grows (more computing resources becomeavailable), the network usage becomes higher (see Figure 5(c)) due to the extensive inter-machine communication inGraphLab. We can also see that sampling can significantlyreduce network usage. In Figures 5 (b) and (d), we can seesimilar behavior for the Wikipedia graph: running time andnetwork usage of 3-prof is comparable to trian.

Next, we conduct the experiments on the second AWScluster with more powerful (c3.8xlarge) machines. For Live-Journal, we note modest improvements in running time fornearly the same network bandwidth observed in Figure 5.On this system we were able to run 3-prof and trian onthe much larger PLD graph. In Figures 6 (b) and (d) wecompare the running time and network usage of both algo-rithms. For the large PLD graph, the benefit of samplingcan be seen clearly; by setting p = 0.1, the running time of3-prof is reduced by a factor of 4 and the network usage isreduced by a factor of 2. Figure 7 shows the performanceof 3-prof and trian on the LiveJournal and the Wikipediagraphs. We can see that the behavior of running times andthe network usage of the 3-prof algorithm is consistentlycomparable to trian across the various graphs, sampling,and systems parameters.

Let us now show results of the experiments performed ona single powerful machine (Asterix). Figure 11 (a) showsthe running times for 3-prof and trian for Twitter andPLD graphs. We can see that on the largest graph from ourdataset (Twitter), the running time of 3-prof is less than5% larger than that of trian, and for the PLD graph thedifference is less than 3% (for p = 1). Twitter takes roughlytwice as long to compute as PLD, implying that these algo-rithms have running time proportional to the graph’s num-ber of edges.

Finally, we show that while the sampling approach can sig-nificantly reduce the running time and network usage, it hasnegligible effect on the accuracy of the solution. Notice thatthe sampling accuracy refers to the global 3-profile count(i.e., the sum of all the local 3-profiles over all vertices in agraph). In Figure 12 we show accuracy of each of the four3-profiles. For the accuracy metrics, we use ratio betweenthe exact count (obtained running 3-prof with p = 1) di-vided by the estimated count (i.e., the output of our 3-profwhen p < 1). It can be seen that for the three graphs, all the3-profiles are very close to 1. E.g., for the PLD graph, evenwhen p = 0.01, the accuracy is within 0.004 from the idealvalue of 1. Error bars mark one standard deviation from themean, and across all graphs the largest standard deviationis 0.031. As p decreases, the triangle estimator suffers thegreatest loss in both accuracy and consistency.

4 nodes 8 nodes 12 nodes0

2

4

6

8

10

12

14

16

Run

ning

tim

e [s

ec]

LiveJornal, AWS m3_2xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(a)

4 nodes 8 nodes 12 nodes0

2

4

6

8

10

12

14

16

18

Run

ning

tim

e [s

ec]

Wikipedia, AWS m3_2xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(b)

4 nodes 8 nodes 12 nodes0

1

2

3

4

5

6

7

8

Net

wor

k se

nt [b

ytes

]

1e9 LiveJornal, AWS m3_2xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(c)

4 nodes 8 nodes 12 nodes0

1

2

3

4

5

6

7

Net

wor

k se

nt [b

ytes

]

1e9 Wikipedia, AWS m3_2xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(d)

Figure 5: AWS m3_2xlarge cluster. 3-prof vs trian algo-

rithms for LiveJournal and Wikipedia datasets (average

of 3 runs). 3-prof achieves comparable performance to

triangle counting. (a,b) – Running time for various num-

bers of nodes (machines) and various sampling probabil-

ities p. (c,d) – Network bytes sent by the algorithms for

various numbers of nodes and various sampling proba-

bilities p.

Ego 3-profiles: The next set of experiments evaluates theperformance of our Ego-par algorithm for counting ego 3-profiles. We show the performance of Ego-par for vari-ous graphs and systems and also compare it to a naive se-rial algorithm Ego-ser. Let us start with the AWS sys-tem with (c3.8xlarge machines). In Figure 8 we see therunning time of Ego-ser and Ego-par on the LiveJournalgraph. The task was to find ego 3-profiles of 100, 1K, and10K randomly selected nodes. Since the running time de-pends on the size and structure of each induced subgraph,Ego-ser and Ego-par operated on the same list of ego ver-tices. While for 100 random vertices Ego-ser performedwell (and even achieved the same running time as Ego-parfor the PLD graph), its performance drastically degradedfor a larger number of vertices. This is due to its itera-

Page 7: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

12 nodes 16 nodes 20 nodes0

1

2

3

4

5

6

7

Run

ning

tim

e [s

ec]

LiveJournal, AWS c3_8xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(a)

12 nodes 16 nodes 20 nodes0

20

40

60

80

100

120

Run

ning

tim

e [s

ec]

PLD, AWS c3_8xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(b)

12 nodes 16 nodes 20 nodes0.0

0.2

0.4

0.6

0.8

1.0

Net

wor

k se

nt [b

ytes

]

1e10 LiveJournal, AWS c3_8xlarge3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(c)

12 nodes 16 nodes 20 nodes0.0

0.2

0.4

0.6

0.8

1.0

1.2

Net

wor

k se

nt [b

ytes

]1e11 PLD, AWS c3_8xlarge

3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(d)

Figure 6: AWS c3_8xlarge cluster. 3-prof vs trian algo-

rithms for LiveJournal and PLD datasets (average of 3

runs). 3-prof achieves comparable performance to trian-

gle counting. (a,b) – Running time for various numbers

of nodes (machines) and various sampling probabilities p.

(c,d) – Network bytes sent by the algorithms for various

numbers of nodes and various sampling probabilities p.

tive nature – it finds ego 3-profiles of the vertices one ata time and is not scalable. Note that the open bars, andthe numbers above them, mean that this experiment wasnot finished and the number is approximate interpolation,which is reasonable due to the serial design of the Ego-ser.

On the contrary, Ego-par algorithm scales extremely welland computes ego 3-profiles for 100, 1K, and 10K verticesalmost in the same time. In Figure 9 (a), we can see thatas the number of nodes (i.e., machines) increases, runningtime of Ego-par decreases since its parallel design allowsit to use additional computational resources. However, theEgo-ser cannot benefit from more resources and its runningtime even increases when more machines are used. The in-crease in running time of Ego-ser is due to the increasein network usage when using more machines (see Figure 9(b)). The network usage of Ego-par also increases, butit is compensated with the additional computational powerthat it can leverage. In Figure 10, we can see that Ego-parperformed well even when finding ego 3-profiles for all theLiveJournal vertices (4.8M vertices).

Finally in Figure 11 (b) and (c), we can see the comparisonof Ego-par and Ego-ser on the PLD and the DBLP graphson the Asterix machine. For both graphs, we can see a verygood scaling of Ego-par, while the running time of Ego-serscales linearly with the size of the ego vertices list.

6. CONCLUSIONSIn summary, we have reduced several 3-profile problems

to triangle and 4-clique finding in a graph engine framework.Our concentration theorem and experimental results confirmthat local 3-profile estimation via sub-sampling is compara-

LiveJournal Wikipedia0

1

2

3

4

5

6

Run

ning

tim

e [s

ec]

LiveJournal and Wikipedia, AWS c3_8xlarge 20 nodes3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(a)

LiveJournal Wikipedia0.0

0.2

0.4

0.6

0.8

1.0

Net

wor

k se

nt [b

ytes

]

1e10 LiveJournal and Wikipedia, AWS c3_8xlarge 20 nodes3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(b)

Figure 7: AWS c3_8xlarge cluster with 20 nodes. 3-prof

vs trian results for LiveJournal and Wikipedia datasets

(average of 3 runs). (a) – Running time for the both

graphs for various sampling probabilities p. (b) – Net-

work bytes sent by the algorithms for the both graphs

for various sampling probabilities p.

100 egos 1K egos 10K egos10-1

100

101

102

103

104

105

Run

ning

tim

e [s

ec]

>1000 sec

>10000 sec

LiveJournal, AWS c3_8xlargeEgo-ser 12 nodes

Ego-ser 16 nodes

Ego-ser 20 nodes

Ego-par 12 nodes

Ego-par 16 nodes

Ego-par 20 nodes

(a)

100 egos 1K egos 10K egos10-1

100

101

102

103

104

105

Run

ning

tim

e [s

ec]

>1000 sec

>10000 sec

PLD, AWS c3_8xlargeEgo-ser 12 nodes

Ego-ser 16 nodes

Ego-ser 20 nodes

Ego-par 12 nodes

Ego-par 16 nodes

Ego-par 20 nodes

(b)

Figure 8: AWS c3_8xlarge cluster. Ego-par vs Ego-ser

results for LiveJournal and PLD datasets (average of 5

runs). Running time of Ego-par scales well with the num-

ber of ego centers, while Ego-ser scales linearly.

ble in runtime and accuracy to local triangle counting.This paper offers several directions for future work. First,

both the local 3-profile and the ego 3-profile can be used asfeatures to classify vertices in social or bioinformatic net-works. Additionally, we hope to extend our theory and al-gorithmic framework to larger subgraphs, as well as specialclasses of input graphs. Our edge sampling Markov chainand unbiased estimators should easily extend to k > 3.Equations in (13) are useful to count local or global 4-profilesin a centralized setting, as shown recently in [17, 40]. Tractabledistributed algorithms for k > 3 using similar edge pivotequations remain as future work. Our observed dependenceon 4-clique count suggests that an improved graph engine-based clique counting subroutine will improve the parallelalgorithm’s performance.

7. REFERENCES[1] N. K. Ahmed, N. Duffield, J. Neville, and

R. Kompella. Graph Sample and Hold : A Frameworkfor Big-Graph Analytics Categories and SubjectDescriptors. In KDD, 2014.

[2] N. Alon, R. Yuster, and U. Zwick. Finding andcounting given length cycles. Algorithmica,17(3):209–223, Mar. 1997.

[3] Amazon web services. http://aws.amazon.com, 2015.

[4] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis.Efficient semi-streaming algorithms for local triangle

Page 8: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

LiveJournal100 egos

Wikipedia100 egos

0

20

40

60

80

100

120

Run

ning

tim

e [s

ec]

LiveJournal and Wikipedia, AWS c3_8xlargeEgo-ser 12 nodes Ego-ser 16 nodes Ego-ser 20 nodes

(a)

LiveJournal100 egos

Wikipedia100 egos

0

2

4

6

8

10

12

Run

ning

tim

e [s

ec]

LiveJournal and Wikipedia, AWS c3_8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes

(b)

LiveJournal100 egos

Wikipedia100 egos

108

109

1010

1011

Net

wor

k se

nt [b

ytes

]

LiveJournal and Wikipedia, AWS c3_8xlargeEgo-ser 12 nodes

Ego-ser 16 nodes

Ego-ser 20 nodes

Ego-par 12 nodes

Ego-par 16 nodes

Ego-par 20 nodes

(c)

Figure 9: AWS c3_8xlarge cluster. Ego-par vs Ego-ser results for LiveJournal and Wikipedia datasets (average of 5

runs). Running time of Ego-par decreases with the number of machines due to its parallel design. Running time of

Ego-ser does not decrease with the number of machines due to its iterative nature. Network usage increases for both

algorithms with the number of machines.

Twitter PLD0

100

200

300

400

500

600

Run

ning

tim

e [s

ec]

Twitter and PLD, Asterix3-prof p=1

3-prof p=0.5

3-prof p=0.1

Trian p=1

Trian p=0.5

Trian p=0.1

(a)

100 egos 1K egos 10K egos 100K egos100

101

102

103

104

105

106

Run

ning

tim

e [s

ec] >10000 sec

>100000 sec

PLD, AsterixEgo-ser Ego-par

(b)

100 egos 1K egos 10K egos 100K egos10-1

100

101

102

103

Run

ning

tim

e [s

ec]

DBLP, AsterixEgo-ser Ego-par

(c)

Figure 11: Asterix machine. Results for Twitter, PLD, and DBLP datasets. (a) – Running time of 3-prof vs trian

for various sampling probabilities p. (b,c) – Running time of Ego-par vs Ego-ser for various number of ego centers.

Results are averaged over 3, and 3, and 10 runs, respectively.

100 egos 1K egos 10K egos Full graph4846609 egos

0

5

10

15

20

25

30

35

40

Run

ning

tim

e [s

ec]

LiveJornal, AWS c3_8xlarge 20 nodesEgo-par

(a)

100 egos 1K egos 10K egos Full graph4846609 egos

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Net

wor

k se

nt [b

ytes

]

1e11 LiveJornal, AWS c3_8xlarge 20 nodesEgo-par

(b)

Figure 10: AWS c3_8xlarge cluster with 20 nodes.

Ego-par results for LiveJournal dataset (average of 5

runs). The algorithm scales well for various number of

ego centers and even full ego centers list. (a) – Running

time. (b) – Network bytes sent by the algorithm.

counting in massive graphs. In KDD, 2008.

[5] M. A. Bhuiyan, M. Rahman, M. Rahman, and M. AlHasan. GUISE: Uniform Sampling of Graphlets forLarge Graph Analysis. IEEE 12th InternationalConference on Data Mining, pages 91–100, Dec. 2012.

[6] C. Borgs, J. Chayes, and K. Vesztergombi. Counting

Graph Homomorphisms. Topics in DiscreteMathematics, pages 315–371, 2006.

[7] S. Chu and J. Cheng. Triangle listing in massivenetworks and its applications. In Proceedings of the17th ACM SIGKDD, page 672, New York, New York,USA, 2011. ACM Press.

[8] T. A. Davis and Y. Hu. The University of FloridaSparse Matrix Collection. ACM Transactions onMathematical Software, 38(1):1–25, 2011.

[9] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin. PowerGraph : Distributed Graph-ParallelComputation on Natural Graphs. In 10th USENIXSymposium on Operating Systems Design andImplementation (OSDI), pages 17–30, 2012.

[10] W. Han and J. Lee. Turbo iso: towards ultrafast androbust subgraph isomorphism search in large graphdatabases. In SIGMOD, pages 337–348, 2013.

[11] F. Hormozdiari, P. Berenbrink, N. Przulj, and S. C.Sahinalp. Not all scale-free networks are born equal:the role of the seed graph in PPI network evolution.PLoS computational biology, 3(7):e118, July 2007.

[12] T. Hocevar and J. Demsar. A combinatorial approachto graphlet counting. Bioinformatics (Oxford,England), 30(4):559–65, Feb. 2014.

Page 9: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

p=0.7 p=0.4 p=0.1 p=0.010.985

0.990

0.995

1.000

1.005

1.010

1.015

Acc

urac

y [e

xact

/app

rox]

PLD, Accuracy, 3-profilestriangles

wedges

edge

empty

(a)

p=0.7 p=0.5 p=0.3 p=0.1

0.996

0.998

1.000

1.002

1.004

Acc

urac

y [e

xact

/app

rox]

Twitter, Accuracy, 3-profilestriangles

wedges

edge

empty

(b)

p=0.7 p=0.4 p=0.1 p=0.01

0.94

0.96

0.98

1.00

1.02

1.04

1.06

Acc

urac

y [e

xact

/app

rox]

LiveJournal, Accuracy, 3-profilestriangles

wedges

edge

empty

(c)

Figure 12: Global 3-profiles accuracy achieved by 3-prof algorithm for various graphs and for each profile count.

Results are averaged over 5, and 5, and 10 iterations, respectively. Error bars indicate 1 standard deviation. The

metric is a ratio between the exact profile count (when p = 1) and the given output for p < 1. All results are very close

to the optimum value of 1.

[13] M. Jha, C. Seshadhri, and A. Pinar. A Space EfficientStreaming Algorithm for Triangle Counting Using theBirthday Paradox. In KDD, pages 589–597, 2013.

[14] M. Jha, C. Seshadhri, and A. Pinar. Path Sampling:A Fast and Provable Method for Estimating 4-VertexSubgraph Counts. 2014.

[15] J. H. Kim and V. H. Vu. Concentration ofMultivariate Polynomials and Its Applications.Combinatorica, 20(3):417–434, 2000.

[16] T. Kloks, D. Kratsch, and H. Muller. Finding andcounting small induced subgraphs efficiently.Information Processing Letters, 74(3-4):115–121, May2000.

[17] M. Kowaluk, A. Lingas, and E.-M. Lundell. Countingand detecting small subgraphs via equations. SIAMJournal of Discrete Mathematics, 27(2):892–909, 2013.

[18] H. Kwak, C. Lee, H. Park, and S. Moon. What isTwitter, a social network or a news media? In WWW’10: Proceedings of the 19th international conferenceon World wide web, pages 591–600, New York, NY,USA, 2010. ACM.

[19] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. Anin-depth comparison of subgraph isomorphismalgorithms in graph databases. Proceedings of theVLDB Endowment, 6(2):133–144, Dec. 2012.

[20] J. Leskovec and A. Krevl. SNAP Datasets: Stanfordlarge network dataset collection.http://snap.stanford.edu/data, June 2014.

[21] L. Lovasz. Large networks and graph limits,volume 60. American Mathematical Soc., 2012.

[22] D. Marcus and Y. Shavitt. RAGE - A rapid graphletenumerator for large networks. Computer Networks,56(2):810–819, Feb. 2012.

[23] R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer.Graph structure in the web – revisited. In WWW ’14:Proceedings of the 23rd international World Wide WebConference, Web Science Track. ACM, 2014.

[24] T. Milenkovik and N. Przulj. Uncovering BiologicalNetwork Function via Graphlet Degree Signatures.Cancer Informatics, 6:257–273, 2008.

[25] D. O’Callaghan, M. Harrigan, J. Carthy, andP. Cunningham. Identifying Discriminating Network

Motifs in YouTube Spam. Feb. 2012.

[26] R. Pagh and C. E. Tsourakakis. Colorful trianglecounting and a MapReduce implementation.Information Processing Letters, 112(7):277–281, Mar.2012.

[27] N. Przulj. Biological Network Comparison UsingGraphlet Degree Distribution. Bioinformatics,23(2):177–183, 2007.

[28] P. Ribeiro, F. Silva, and L. Lopes. Efficient ParallelSubgraph Counting Using G-Tries. In 2010 IEEEInternational Conference on Cluster Computing, pages217–226. Ieee, Sept. 2010.

[29] M. Saltz, A. Jain, A. Kothari, A. Fard, J. A. Miller,and L. Ramaswamy. DualIso: An Algorithm forSubgraph Pattern Matching on Very Large LabeledGraphs. 2014 IEEE International Congress on BigData, pages 498–505, June 2014.

[30] N. Satish, N. Sundaram, M. A. Patwary, J. Seo,J. Park, M. A. Hassaan, S. Sengupta, Z. Yin,P. Dubey, U. T. Austin, and G. Tech. Navigating theMaze of Graph Analytics Frameworks using MassiveGraph Datasets. In SIGMOD, pages 979–990, 2014.

[31] T. Schank. Algorithmic Aspects of Triangle-BasedNetwork Analysis. PhD thesis, 2007.

[32] C. Seshadhri, A. Pinar, and T. G. Kolda. Triadicmeasures on graphs: The power of wedge sampling. InSDM13: Proceedings of the 2013 SIAM InternationalConference on Data Mining, pages 10–18, 2013.

[33] C. Seshadhri, A. Pinar, and T. G. Kolda. WedgeSampling for Computing Clustering Coefficients andTriangle Counts on Large Graphs. Statistical Analysisand Data Mining, 7(4):294–307, 2014.

[34] N. Shervashidze, K. Mehlhorn, and T. H. Petri.Efficient graphlet kernels for large graph comparison.In Proceedings of the 20th International Conference onArtificial Intelligence and Statistics, pages 488–495,2009.

[35] S. Suri and S. Vassilvitskii. Counting Triangles andthe Curse of the Last Reducer. In Proceedings of the20th international conference on World Wide Web,page 607. ACM Press, 2011.

[36] C. E. Tsourakakis. Fast Counting of Triangles in

Page 10: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

Large Real Networks : Algorithms and Laws. In IEEEInternational Conference on Data Mining, 2008.

[37] C. E. Tsourakakis, U. Kang, G. L. Miller, andC. Faloutsos. DOULION : Counting Triangles inMassive Graphs with a Coin. In SIGKDD, 2009.

[38] C. E. Tsourakakis, M. Kolountzakis, and G. L. Miller.Triangle Sparsifiers. Journal of Graph Theory andApplications, 15(6):703–726, 2011.

[39] J. Ugander, L. Backstrom, M. Park, and J. Kleinberg.Subgraph Frequencies : Mapping the Empirical andExtremal Geography of Large Graph Collections. In22nd International World Wide Web Conference,2013.

[40] V. V. Williams, J. Wang, R. Williams, and H. Yu.Finding Four-Node Subgraphs in Triangle Time.SODA, pages 1671–1680, 2014.

[41] X. Yan and J. Han. gSpan: Graph-Based SubstructurePattern Mining. In International Conference on DataMining, 2002.

APPENDIXA. PROOF OF THEOREM 2

Let m be the total number of edges in the original graphG. If e is an edge in the original graph G, let te be therandom indicator after sampling. te = 1 if e is sampled and0 otherwise. Let H0,H1,H2,H3 denote the set of distinctsubgraphs of the kind H0, H1, H2 and H3 (anti-clique, edge,wedge and triangle) respectively. Let A, (e),Λ(e, f) and∆(e, f, g) denote an anti-clique with no edge, a H1 withedge e , a H2 with two edges e, f and a triangle with edgese, f, g respectively in the original graph G. Our estimators(3)-(6) are a function of Yi’s and each Yi can be written as

a polynomial of at most degree 3 in all the variables te.

Y0 = n0 +∑

(e)∈H1

(1− te) +∑

Λ(e,f)∈H2

(1− te)(1− tf )+

∑∆(e,f,g)∈H3

(1− te)(1− tf )(1− tg) (15)

Y1 =∑

(e)∈H1

te +∑

Λ(e,f)∈H2

((1− te)tf + (1− tf )te)+

∑∆(e,f,g)∈H3

te(1− tf )(1− tg)+

∑∆(e,f,g)∈H3

tf (1− te)(1− tg) + tg(1− te)(1− tf ) (16)

Y2 =∑

Λ(e,f)∈H2

tetf+

∑∆(e,f,g)∈H3

(tetf )(1− tg) + tf tg(1− te) + te(1− tf )tg)

(17)

Y3 =∑

∆(e,f,g)∈H3

tetf tg (18)

S1 =∑

(e)∈H1

te (19)

D1 =∑

Λ(e,f)∈H2

(te + tf ) (20)

D2 =∑

Λ(e,f)∈H2

tetf (21)

T1 =∑

∆(e,f,g)∈H3

(te + tf + tg) (22)

T2 =∑

∆(e,f,g)∈H3

(tetf + tf tg + tgte) (23)

Y1 = S1 +D1 − 2D2 + T1 − 2T2 + 3Y3 (24)

Y2 = D2 + T2 − 3Y3 (25)

Note that the newly defined polynomials have the followingexpectations:

E[S1] = pn1

E[D1] = 2pn2

E[D2] = p2n2

E[T1] = 3pn3

E[T2] = 3p2n3.

We observe that in the above even by change of variablesye = (1 − te), Y1 and Y2 are not totally positive polyno-mials. This means that Theorem 2 cannot be applied di-rectly to the Y ′i s or Xi’s. The strategy we adopt is to splitthe Y1 and Y2 into many polynomials, each of which is to-tally positive, and then apply Thoerem 2 on each of them.P = {Y0, Y3, S1, D1, D2, T1, T2} form the set of totally pos-itive polynomials (proved below). Substituting the aboveequations into (3)-(6), we have the following system of equa-tions that connect Xi’s and the set of totally positive poly-

Page 11: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

nomials P :

X0 = Y0 −1− pp

(S1 +D1 + T1 − 2D2 − 2T2 + 3Y3)

+(1− p)2

p2(D2 + T2 − 3Y3)− (1− p)3

p3Y3

= Y0 −1− pp

(S1 +D1 + T1)

+1− p2

p2(D2 + T2)− 1− p3

p3Y3. (26)

X1 =1

p(S1 +D1 + T1 − 2D2 − 2T2 + 3Y3)

− 2(1− p)p2

(D2 + T2 − 3Y3) +3(1− p)2

p3Y3

=1

p(S1 +D1 + T1)− 2

p2(D2 + T2) +

3

p3Y3. (27)

X2 =1

p2(D2 + T2 − 3Y3)− 3(1− p)

p3Y3

=1

p2(D2 + T2)− 3

p3Y3. (28)

X3 =1

p3Y3. (29)

Let αe, βe, and ∆e be the maximum number of H1’s, H2’s,and H3’s containing an edge e in the original graph G. Letα, β and ∆ be the maximum of αe, βe, and ∆e over all edgese.

We now show concentration results for the totally positivepolynomials alone.

Lemma 1. Define variables ye = 1−te. Then Y0 is totallypositive in ye. With respect to the variables ye, E≥1 [Y0] ≤3 max{α, β,∆}.

Proof. We have the expectation of the following partialderivatives, up to the third order:

E[∂Y0

∂ye

]= αe + (1− p)βe + (1− p)2∆e

≤ 3 max{αe, βe,∆e}.

E[∂Y0

∂yeyf

]≤ 1 + (1− p) ≤ 2, E

[∂Y0

∂yeyfyg

]≤ 1.

From the above equations, we have E≥1 [Y1] ≤ 3 max{α, β,∆}for a nonempty graph.

To satisfy E≥1 [Y0] ≤ E[Y0], it is sufficient to have

n0 ≥ 3 max{α, β,∆}. (30)

This is because Y0 ≥ n0 with probability 1.

Lemma 2. Y3 is totally positive in te. With respect to thevariables te, E≥1 [Y3] ≤ max{1, p2∆}.

Proof. We have the expectation of the following partialderivatives, up to the third order:

E[∂Y3

∂te

]= p2∆e, E

[∂Y3

∂tetf

]= p ≤ 1, E

[∂Y3

∂tetf tg

]≤ 1.

From the above equations, we have E≥1 [Y1] ≤ 3 max{1, p2∆}.

E≥1 [Y3] ≤ E[Y3] implies

p ≥ max{ 13√n3,∆/n3}. (31)

Lemma 3. S1 is totally positive in te. With respect to thevariables te, E≥1 [S1] ≤ α.

Proof. We have the expectation of the following partialderivatives, up to the second order:

E[∂S1

∂te

]= αe, E

[∂S1

∂tetf

]= 0

From the above equations, we have E≥1 [S1] ≤ α .

E≥1 [S1] ≤ E[S3] implies

p ≥ α/n1. (32)

Lemma 4. D1 is totally positive in te. With respect tothe variables te, E≥1 [D1] ≤ β.

Proof. We have the expectation of the following partialderivatives, up to the second order:

E[∂D1

∂te

]= βe, E

[∂D1

∂tetf

]= 0

From the above equations, we have E≥1 [D1] ≤ β .

E≥1 [D1] ≤ E[D1] implies

p ≥ β/(2n2). (33)

Lemma 5. T1 is totally positive in te. With respect to thevariables te, E≥1 [T1] ≤ ∆.

Proof. We have the expectation of the following partialderivatives, up to the second order:

E[∂T1

∂te

]= ∆e, E

[∂T1

∂tetf

]= 0

From the above equations, we have E≥1 [T1] ≤ ∆ .

E≥1 [T1] ≤ E[T1] implies

p ≥ ∆/(3n3). (34)

Lemma 6. D2 is totally positive in te. With respect tothe variables te, E≥1 [D2] ≤ max{pβ, 1}.

Proof. We have the expectation of the following partialderivatives, up to the second order:

E[∂D2

∂te

]= pβe, E

[∂D2

∂tetf

]≤ 1

From the above equations, we have E≥1 [D2] ≤ max{pβ, 1}.

E≥1 [D2] ≤ E[D2] implies

p ≥ max{β/n2,1√n2}. (35)

Lemma 7. T2 is totally positive in te. With respect to thevariables te, E≥1 [T2] ≤ max{2p∆, 1}.

Proof. We have the expectation of the following partialderivatives, up to the second order:

E[∂T2

∂te

]= 2p∆e, E

[∂T2

∂tetf

]≤ 1

From the above equations, we have E≥1 [T2] ≤ max{2p∆, 1}.

Page 12: Beyond Triangles: A Distributed Framework for Estimating 3 ...dimakis/3profile_ext.pdfBeyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R

E≥1 [T2] ≤ E[T2] implies

p ≥ max{2∆/(3n3),1√3n3

}. (36)

Now merging all the conditions (30)-(36), we get

n0 ≥ 3 max{α, β,∆}, p ≥ max{ 13√n3,

1√n2,

n3,β

n2,α

n1}.

(37)Applying Theorem 2 to all the totally positive polynomi-

als, along with condition 37, we get

P(|Y0 − E[Y0]| > a3

√E[Y0]E≥1[Y0]λ3

1

)= O (exp (−λ1 + (2) logm))

P(|Y3 − E[Y3]| > a3

√E[Y3]E≥1[Y3]λ3

2

)= O (exp (−λ2 + (2) logm))

P(|S1 − E[S1]| > a1

√E[S1]E≥1[S1]λ3

)= O (exp (−λ3))

P(|D1 − E[D1]| > a1

√E[D1]E≥1[D1]λ4

)= O (exp (−λ4))

P(|T1 − E[T1]| > a1

√E[T1]E≥1[T1]λ5

)= O (exp (−λ5))

P(|D2 − E[D2]| > a2

√E[D2]E≥1[D2]λ2

6

)= O (exp (−λ6 + logm))

P(|T2 − E[T2]| > a2

√E[T2]E≥1[T2]λ2

7

)= O (exp (−λ7 + logm)) (38)

Choose an ε > 0. We force the following conditions:

a3

√E[Y0]E≥1[Y0]λ3

1 = εE[Y0]

a3

√E[Y3]E≥1[Y3]λ3

2 = εE[Y3]

a1

√E[S1]E≥1[S1]λ3 = εE[S1]

a1

√E[D1]E≥1[D1]λ4 = εE[D1]

a1

√E[T1]E≥1[T1]λ5 = εE[T1]

a2

√E[D2]E≥1[D2]λ2

6 = εE[D2]

a2

√E[T2]E≥1[T2]λ2

7 = εE[T2]

Let γ > 0. For the right hand side of every equation in(38) to be O(exp(−γ logm)), assuming all the bounds in

Lemmas 1-7, it is sufficient to have

n0

3 max{α, β,∆} ≥a2

3 log6(m2+γ

)ε2

p

max{ 13√n3

,∆/n3}≥a2

3 log6(m2+γ

)ε2

p

max{ αn1, β

2n2, ∆

3n3}≥ a2

1 log2 (mγ)

ε2

p

max{ βn2, 2∆

3n3, 1√

n2, 1√

n3}≥a2

2 log4(m1+γ

)ε2

(39)

We can see that conditions in (39) imply the conditions in(37). Further, conditions in (39) can be simplified to removesome redundancy as follows:

n0

3 max{α, β,∆} ≥a2

3 log6(m2+γ

)ε2

p

max{ 13√n3

,∆/n3}≥a2

3 log6(m2+γ

)ε2

p

α/n1≥ a2

1 log2 (mγ)

ε2

p

max{ βn2, 1√

n2}≥a2

2 log4(m1+γ

)ε2

(40)

This is due to the fact that a3 ≥ a2 ≥ a1 and m3 ≥ m2.Therefore, subject to the conditions in (40), all totally posi-tive polynomials Y0, Y3, D1, D2, S1, T1, T2 concentrate withina multiplicative factor of (1 ± ε) with probability at least1−O

(1mγ

).

Under the above concentration result, let the deviationsof Xi’s be denoted by δXi. Now we calculate the deviationof X0 using (26).

δX0 ≤ εE[Y0] + ε1− pp

(|E[S1]|+ |E[D1]|+ |E[T1]|)

+ ε1− p2

p2(|E[D2]|+ |E[T2]|)− ε1− p3

p3|E[Y3]|

≤ ε(n0 + n1 + 3n2 + 7n3)

≤ 7ε(n0 + n1 + n2 + n3).

Similarly for other Xi’s, we get

δX1 ≤ 12ε (n1 + n2 + n3)

δX2 ≤ 6ε (n2 + n3)

δX3 ≤ εn3.

Therefore, sampling every edge independently with prob-ability p satisfying all conditions in (40), all X ′is concentrate

within an additive gap of (1± 12ε)(|V |

3

)with probability at

least 1 − 1mγ

. The constants in this proof can be tightenedby a more accurate analysis.