53
Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros G. Dimakis University of Texas, Austin, USA August 12, 2015 E. R. Elenberg Beyond Triangles 1/20

Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Beyond Triangles: A Distributed Framework forEstimating 3-profiles of Large Graphs

Ethan R. Elenberg, Karthikeyan Shanmugam,Michael Borokhovich, Alexandros G. Dimakis

University of Texas, Austin, USA

August 12, 2015

E. R. Elenberg Beyond Triangles 1/20

Page 2: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Introduction

• Perform analytics on large graphs

- World Wide Web, social networks, bioinformatics

• More descriptive than triangle count, clustering coefficient

• Scalable, distributed algorithms

E. R. Elenberg Beyond Triangles 2/20

Page 3: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-profile

• Count the induced subgraphs formed by selecting all triples ofvertices

H3

Definition

Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.

- Always sums to(|V |

3

), the total number of 3-subgraphs

E. R. Elenberg Beyond Triangles 3/20

Page 4: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-profile

• Count the induced subgraphs formed by selecting all triples ofvertices

H3H2H1H0

Definition

Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.

- Always sums to(|V |

3

), the total number of 3-subgraphs

E. R. Elenberg Beyond Triangles 3/20

Page 5: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-profile

• Count the induced subgraphs formed by selecting all triples ofvertices

H3H2H1H0

Definition

Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.

- Always sums to(|V |

3

), the total number of 3-subgraphs

E. R. Elenberg Beyond Triangles 3/20

Page 6: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Page 7: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Page 8: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Page 9: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Page 10: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 5-cycle: n(C5) = [?, ?, ?, ?]

E. R. Elenberg Beyond Triangles 5/20

Page 11: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 5-cycle: n(C5) = [0, ?, ?, ?]

H0

E. R. Elenberg Beyond Triangles 5/20

Page 12: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 5-cycle: n(C5) = [0, 5, ?, ?]

H1

E. R. Elenberg Beyond Triangles 5/20

Page 13: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 5-cycle: n(C5) = [0, 5, 5, ?]

H2

E. R. Elenberg Beyond Triangles 5/20

Page 14: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Examples

• 5-cycle: n(C5) = [0, 5, 5, 0]

H3

E. R. Elenberg Beyond Triangles 5/20

Page 15: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Related Terms

For each v ∈ V :

Definition

The local 3-profile counts how many times v participates in eachHi with 2 other vertices.

Definition

The ego 3-profile is the 3-profile of ego graph N(v).

- Graph induced by set of neighbors Γ(v)

E. R. Elenberg Beyond Triangles 6/20

Page 16: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Related Terms

For each v ∈ V :

Definition

The local 3-profile counts how many times v participates in eachHi with 2 other vertices.

Definition

The ego 3-profile is the 3-profile of ego graph N(v).

- Graph induced by set of neighbors Γ(v)

E. R. Elenberg Beyond Triangles 6/20

Page 17: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Motivation

• Global 3-profile concisely describes local connectivity

- Molecule classification

• Local and ego 3-profiles are feature vectors for each vertex

- Spam detection- Generative models

E. R. Elenberg Beyond Triangles 7/20

Page 18: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Introduction

• Problem: Compute (or approximate) 3-profile quantities for alarge graph

• Approach: Edge sub-sampling and distributed implementation

E. R. Elenberg Beyond Triangles 8/20

Page 19: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Introduction

• Problem: Compute (or approximate) 3-profile quantities for alarge graph

• Approach: Edge sub-sampling and distributed implementation

E. R. Elenberg Beyond Triangles 8/20

Page 20: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Contributions

1 Derive a 3-profile sparsifier with provable guarantees

2 Design distributed, graph engine algorithms to calculate localand ego 3-profiles

3 Evaluate performance on real-world datasets

E. R. Elenberg Beyond Triangles 9/20

Page 21: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Related Work

Well studied across several communities:

• Graph sub-sampling

[Kim, Vu ’00] [Tsourakakis, et al. ’08 -’11] [Ahmed, et al. ’14]

• Large-scale triangle counting

[Satish, et al. ’14] [Shank ’07] [Suri, Vassilvitskii ’11]

• Subgraph counting

[Alon, et al. ’97] [Kloks, et al. ’00] [Kowaluk, et al. ’13]

• Graphlets

[Przulj ’07] [Shervashidze, et al. ’09]

E. R. Elenberg Beyond Triangles 10/20

Page 22: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 10/20

Page 23: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

• Sub-sample each edge in the graph independently withprobability p

• Relate the original and sub-sampled graphs via a 1-stepMarkov chain

E. R. Elenberg Beyond Triangles 11/20

Page 24: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Page 25: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Page 26: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p2

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Page 27: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p2

3p2 (1−

p)

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Page 28: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

1

(1−p)

(1− p

)2

(1−p)

3

p

2p(1− p

)

3p(1− p

)2

p2

3p2 (1−

p)

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Page 29: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

1

(1−p)

(1− p

)2

(1−p)

3

p

2p(1− p

)

3p(1− p

)2

p2

3p2 (1−

p)

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Page 30: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Main Result

Theorem (3-profile sparsifiers)

For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε

(|V |3

)with high probability.

Definition

A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.

Proof Sketch:

- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator

f(G, p) = e1e2e4 + e4e5e6 + . . .

E. R. Elenberg Beyond Triangles 13/20

Page 31: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Main Result

Theorem (3-profile sparsifiers)

For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε

(|V |3

)with high probability.

Definition

A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.

Proof Sketch:

- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator

f(G, p) = e1e2e4 + e4e5e6 + . . .

E. R. Elenberg Beyond Triangles 13/20

Page 32: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Main Result

Theorem (3-profile sparsifiers)

For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε

(|V |3

)with high probability.

Definition

A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.

Proof Sketch:

- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator

f(G, p) = e1e2e4 + e4e5e6 + . . .

E. R. Elenberg Beyond Triangles 13/20

Page 33: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 13/20

Page 34: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

Page 35: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

Page 36: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

Page 37: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

Page 38: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 14/20

Page 39: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Implementation

• GraphLab PowerGraph v2.2

• Multicore server• 256 GB RAM, 72 logical cores

• EC2 cluster (Amazon Web Services)• 20 c3.8xlarge, 60 GB RAM, 32 logical cores each

Datasets

Name Vertices Edges (undirected)

Twitter 41, 652, 230 1, 202, 513, 046PLD 39, 497, 204 582, 567, 291LiveJournal 4, 846, 609 42, 851, 237Wikipedia 3, 515, 067 42, 375, 912DBLP 317, 080 1, 049, 866

E. R. Elenberg Beyond Triangles 15/20

Page 40: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Implementation

• GraphLab PowerGraph v2.2

• Multicore server• 256 GB RAM, 72 logical cores

• EC2 cluster (Amazon Web Services)• 20 c3.8xlarge, 60 GB RAM, 32 logical cores each

Datasets

Name Vertices Edges (undirected)

Twitter 41, 652, 230 1, 202, 513, 046PLD 39, 497, 204 582, 567, 291LiveJournal 4, 846, 609 42, 851, 237Wikipedia 3, 515, 067 42, 375, 912DBLP 317, 080 1, 049, 866

E. R. Elenberg Beyond Triangles 15/20

Page 41: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Results: 3-profile Sparsifier Accuracy, 5 runs

p=0.7 p=0.4 p=0.1 p=0.010.985

0.990

0.995

1.000

1.005

1.010

1.015A

ccur

acy

[exa

ct/a

ppro

x]PLD, Accuracy, 3-profiles

triangleswedges

edgeempty

E. R. Elenberg Beyond Triangles 16/20

Page 42: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Results: Multicore, 3 runs

Compare 3-PROF to GraphLab’s default triangle count

Twitter PLD0

100

200

300

400

500

600

Run

ning

time

[sec

]Twitter and PLD, Multicore (p=1)

3-prof Trian

E. R. Elenberg Beyond Triangles 17/20

Page 43: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Results: AWS, 5 runs

Compare EGO-PAR to naive, serial algorithm (EGO-SER )

100 egos 1K egos 10K egos10−1

100

101

102

103

104

105

Run

ning

time

[sec

]

>1000 sec

>10000 sec

LiveJournal, AWS c3 8xlargeEgo-ser 12 nodes Ego-par 12 nodes

E. R. Elenberg Beyond Triangles 18/20

Page 44: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Results: AWS, 5 runs

10k egos0

2

4

6

8

10

12

14R

unni

ngtim

e[s

ec]

LiveJournal, AWS c3 8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes

E. R. Elenberg Beyond Triangles 19/20

Page 45: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 19/20

Page 46: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

Summary

1 Edge sub-sampling produces fast, accurate 3-profile estimates

2 3-profile counting consumes roughly the same resources astriangle counting

3 Distributed algorithms scale well over large data and largecomputing clusters

github.com/eelenberg/3-profiles

E. R. Elenberg Beyond Triangles 20/20

Page 47: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Edge Pivot Equations

v a

∑a∈Γ(v)

(n3,va

2

) F2(v) 3F3(v)

= +

E. R. Elenberg Beyond Triangles 20/20

Page 48: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Edge Pivot Equations

v a

∑a∈Γ(v)

(n3,va

2

) F2(v) 3F3(v)

= +

E. R. Elenberg Beyond Triangles 20/20

Page 49: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Edge Pivot Equations

v a

∑a∈Γ(v)

(n3,va

2

) F2(v) 3F3(v)

= +

E. R. Elenberg Beyond Triangles 20/20

Page 50: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Results: 3-profile Sparsifier Accuracy, 5 runs

p=0.7 p=0.5 p=0.3 p=0.1

0.996

0.998

1.000

1.002

1.004

Acc

urac

y[e

xact

/app

rox]

Twitter, Accuracy, 3-profilestriangleswedges

edgeempty

E. R. Elenberg Beyond Triangles 20/20

Page 51: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs

12 nodes 16 nodes 20 nodes0

1

2

3

4

5

6

7

Run

ning

time

[sec

]

LiveJournal, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

LiveJournal Running Time

12 nodes 16 nodes 20 nodes0

20

40

60

80

100

120

Run

ning

time

[sec

]

PLD, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

PLD Running Time

E. R. Elenberg Beyond Triangles 20/20

Page 52: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs

12 nodes 16 nodes 20 nodes0.0

0.2

0.4

0.6

0.8

1.0

Net

wor

kse

nt[b

ytes

]

×1010 LiveJournal, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

LiveJournal Network Usage

12 nodes 16 nodes 20 nodes0.0

0.2

0.4

0.6

0.8

1.0

1.2

Net

wor

kse

nt[b

ytes

]

×1011 PLD, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

PLD Network Usage

E. R. Elenberg Beyond Triangles 20/20

Page 53: Beyond Triangles: A Distributed Framework for Estimating 3 ...eelenberg.github.io/Elenberg3profileKDD15.pdf · Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros

(Backup) Results: AWS, 5 runs

100 egos0

20

40

60

80

100

120

Run

ning

time

[sec

]

LiveJournal, AWS c3 8xlargeEgo-ser 12 nodes Ego-ser 16 nodes Ego-ser 20 nodes

EGO-SER

100 egos0

2

4

6

8

10

12

Run

ning

time

[sec

]

LiveJournal, AWS c3 8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes

EGO-PAR

E. R. Elenberg Beyond Triangles 20/20