Beyond Triangles: A Distributed Framework for Estimating 3...

Preview:

Citation preview

Beyond Triangles: A Distributed Framework forEstimating 3-profiles of Large Graphs

Ethan R. Elenberg, Karthikeyan Shanmugam,Michael Borokhovich, Alexandros G. Dimakis

University of Texas, Austin, USA

August 12, 2015

E. R. Elenberg Beyond Triangles 1/20

Introduction

• Perform analytics on large graphs

- World Wide Web, social networks, bioinformatics

• More descriptive than triangle count, clustering coefficient

• Scalable, distributed algorithms

E. R. Elenberg Beyond Triangles 2/20

3-profile

• Count the induced subgraphs formed by selecting all triples ofvertices

H3

Definition

Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.

- Always sums to(|V |

3

), the total number of 3-subgraphs

E. R. Elenberg Beyond Triangles 3/20

3-profile

• Count the induced subgraphs formed by selecting all triples ofvertices

H3H2H1H0

Definition

Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.

- Always sums to(|V |

3

), the total number of 3-subgraphs

E. R. Elenberg Beyond Triangles 3/20

3-profile

• Count the induced subgraphs formed by selecting all triples ofvertices

H3H2H1H0

Definition

Let ni be the number of Hi’s in a graph G. The vectorn(G) = [n0, n1, n2, n3] is called the 3-profile of G.

- Always sums to(|V |

3

), the total number of 3-subgraphs

E. R. Elenberg Beyond Triangles 3/20

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Examples

• 4-clique: n(K4) = [0, 0, 0, 4]

H3

E. R. Elenberg Beyond Triangles 4/20

Examples

• 5-cycle: n(C5) = [?, ?, ?, ?]

E. R. Elenberg Beyond Triangles 5/20

Examples

• 5-cycle: n(C5) = [0, ?, ?, ?]

H0

E. R. Elenberg Beyond Triangles 5/20

Examples

• 5-cycle: n(C5) = [0, 5, ?, ?]

H1

E. R. Elenberg Beyond Triangles 5/20

Examples

• 5-cycle: n(C5) = [0, 5, 5, ?]

H2

E. R. Elenberg Beyond Triangles 5/20

Examples

• 5-cycle: n(C5) = [0, 5, 5, 0]

H3

E. R. Elenberg Beyond Triangles 5/20

Related Terms

For each v ∈ V :

Definition

The local 3-profile counts how many times v participates in eachHi with 2 other vertices.

Definition

The ego 3-profile is the 3-profile of ego graph N(v).

- Graph induced by set of neighbors Γ(v)

E. R. Elenberg Beyond Triangles 6/20

Related Terms

For each v ∈ V :

Definition

The local 3-profile counts how many times v participates in eachHi with 2 other vertices.

Definition

The ego 3-profile is the 3-profile of ego graph N(v).

- Graph induced by set of neighbors Γ(v)

E. R. Elenberg Beyond Triangles 6/20

Motivation

• Global 3-profile concisely describes local connectivity

- Molecule classification

• Local and ego 3-profiles are feature vectors for each vertex

- Spam detection- Generative models

E. R. Elenberg Beyond Triangles 7/20

Introduction

• Problem: Compute (or approximate) 3-profile quantities for alarge graph

• Approach: Edge sub-sampling and distributed implementation

E. R. Elenberg Beyond Triangles 8/20

Introduction

• Problem: Compute (or approximate) 3-profile quantities for alarge graph

• Approach: Edge sub-sampling and distributed implementation

E. R. Elenberg Beyond Triangles 8/20

Contributions

1 Derive a 3-profile sparsifier with provable guarantees

2 Design distributed, graph engine algorithms to calculate localand ego 3-profiles

3 Evaluate performance on real-world datasets

E. R. Elenberg Beyond Triangles 9/20

Related Work

Well studied across several communities:

• Graph sub-sampling

[Kim, Vu ’00] [Tsourakakis, et al. ’08 -’11] [Ahmed, et al. ’14]

• Large-scale triangle counting

[Satish, et al. ’14] [Shank ’07] [Suri, Vassilvitskii ’11]

• Subgraph counting

[Alon, et al. ’97] [Kloks, et al. ’00] [Kowaluk, et al. ’13]

• Graphlets

[Przulj ’07] [Shervashidze, et al. ’09]

E. R. Elenberg Beyond Triangles 10/20

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 10/20

Edge Sub-sampling Process

• Sub-sample each edge in the graph independently withprobability p

• Relate the original and sub-sampled graphs via a 1-stepMarkov chain

E. R. Elenberg Beyond Triangles 11/20

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p2

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

p2

3p2 (1−

p)

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

1

(1−p)

(1− p

)2

(1−p)

3

p

2p(1− p

)

3p(1− p

)2

p2

3p2 (1−

p)

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Edge Sub-sampling Process

Original

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

1

(1−p)

(1− p

)2

(1−p)

3

p

2p(1− p

)

3p(1− p

)2

p2

3p2 (1−

p)

p3

Sub-sampled

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

Estimator

=

1 1− p (1− p)2 (1− p)3

0 p 2p(1− p) 3p(1− p)2

0 0 p2 3p2(1− p)0 0 0 p3

−1 Sub-sampled

E. R. Elenberg Beyond Triangles 12/20

Main Result

Theorem (3-profile sparsifiers)

For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε

(|V |3

)with high probability.

Definition

A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.

Proof Sketch:

- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator

f(G, p) = e1e2e4 + e4e5e6 + . . .

E. R. Elenberg Beyond Triangles 13/20

Main Result

Theorem (3-profile sparsifiers)

For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε

(|V |3

)with high probability.

Definition

A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.

Proof Sketch:

- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator

f(G, p) = e1e2e4 + e4e5e6 + . . .

E. R. Elenberg Beyond Triangles 13/20

Main Result

Theorem (3-profile sparsifiers)

For all (ε,p)-balanced graphs∗, the l∞-norm of the 3-profilesparsifier error is bounded by ε

(|V |3

)with high probability.

Definition

A graph is (ε,p)-balanced if the majority of “triangles,” “wedges,”or “single-edges” do not depend on one common edge.

Proof Sketch:

- Apply multivariate polynomial concentration inequalities [Kim,Vu ’00] to each estimator

f(G, p) = e1e2e4 + e4e5e6 + . . .

E. R. Elenberg Beyond Triangles 13/20

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 13/20

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

3-PROF

Vertex program in the Gather-Apply-Scatter framework

1 For each vertex v: Gather and Apply vertex IDs to store Γ(v)

2 For each edge va: Scatter

v an3,va = |Γ(v) ∩ Γ(a)|,

v anc2,va = |Γ(v)| − |Γ(v) ∩ Γ(a)| − 1, . . .

3 For each vertex v: Gather and Apply

v an3,v = 1

2

∑a∈Γ(v) n3,va

v anc2,v = 1

2

∑a∈Γ(v) n

c2,va, . . .

E. R. Elenberg Beyond Triangles 14/20

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 14/20

Implementation

• GraphLab PowerGraph v2.2

• Multicore server• 256 GB RAM, 72 logical cores

• EC2 cluster (Amazon Web Services)• 20 c3.8xlarge, 60 GB RAM, 32 logical cores each

Datasets

Name Vertices Edges (undirected)

Twitter 41, 652, 230 1, 202, 513, 046PLD 39, 497, 204 582, 567, 291LiveJournal 4, 846, 609 42, 851, 237Wikipedia 3, 515, 067 42, 375, 912DBLP 317, 080 1, 049, 866

E. R. Elenberg Beyond Triangles 15/20

Implementation

• GraphLab PowerGraph v2.2

• Multicore server• 256 GB RAM, 72 logical cores

• EC2 cluster (Amazon Web Services)• 20 c3.8xlarge, 60 GB RAM, 32 logical cores each

Datasets

Name Vertices Edges (undirected)

Twitter 41, 652, 230 1, 202, 513, 046PLD 39, 497, 204 582, 567, 291LiveJournal 4, 846, 609 42, 851, 237Wikipedia 3, 515, 067 42, 375, 912DBLP 317, 080 1, 049, 866

E. R. Elenberg Beyond Triangles 15/20

Results: 3-profile Sparsifier Accuracy, 5 runs

p=0.7 p=0.4 p=0.1 p=0.010.985

0.990

0.995

1.000

1.005

1.010

1.015A

ccur

acy

[exa

ct/a

ppro

x]PLD, Accuracy, 3-profiles

triangleswedges

edgeempty

E. R. Elenberg Beyond Triangles 16/20

Results: Multicore, 3 runs

Compare 3-PROF to GraphLab’s default triangle count

Twitter PLD0

100

200

300

400

500

600

Run

ning

time

[sec

]Twitter and PLD, Multicore (p=1)

3-prof Trian

E. R. Elenberg Beyond Triangles 17/20

Results: AWS, 5 runs

Compare EGO-PAR to naive, serial algorithm (EGO-SER )

100 egos 1K egos 10K egos10−1

100

101

102

103

104

105

Run

ning

time

[sec

]

>1000 sec

>10000 sec

LiveJournal, AWS c3 8xlargeEgo-ser 12 nodes Ego-par 12 nodes

E. R. Elenberg Beyond Triangles 18/20

Results: AWS, 5 runs

10k egos0

2

4

6

8

10

12

14R

unni

ngtim

e[s

ec]

LiveJournal, AWS c3 8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes

E. R. Elenberg Beyond Triangles 19/20

Outline

1 Introduction

2 3-profile SparsifierEdge Sub-sampling ProcessConcentration Bound

3 3-PROF Algorithm

4 Experiments

5 Conclusions

E. R. Elenberg Beyond Triangles 19/20

Summary

1 Edge sub-sampling produces fast, accurate 3-profile estimates

2 3-profile counting consumes roughly the same resources astriangle counting

3 Distributed algorithms scale well over large data and largecomputing clusters

github.com/eelenberg/3-profiles

E. R. Elenberg Beyond Triangles 20/20

(Backup) Edge Pivot Equations

v a

∑a∈Γ(v)

(n3,va

2

) F2(v) 3F3(v)

= +

E. R. Elenberg Beyond Triangles 20/20

(Backup) Edge Pivot Equations

v a

∑a∈Γ(v)

(n3,va

2

) F2(v) 3F3(v)

= +

E. R. Elenberg Beyond Triangles 20/20

(Backup) Edge Pivot Equations

v a

∑a∈Γ(v)

(n3,va

2

) F2(v) 3F3(v)

= +

E. R. Elenberg Beyond Triangles 20/20

(Backup) Results: 3-profile Sparsifier Accuracy, 5 runs

p=0.7 p=0.5 p=0.3 p=0.1

0.996

0.998

1.000

1.002

1.004

Acc

urac

y[e

xact

/app

rox]

Twitter, Accuracy, 3-profilestriangleswedges

edgeempty

E. R. Elenberg Beyond Triangles 20/20

(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs

12 nodes 16 nodes 20 nodes0

1

2

3

4

5

6

7

Run

ning

time

[sec

]

LiveJournal, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

LiveJournal Running Time

12 nodes 16 nodes 20 nodes0

20

40

60

80

100

120

Run

ning

time

[sec

]

PLD, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

PLD Running Time

E. R. Elenberg Beyond Triangles 20/20

(Backup) Results: 3-PROF vs. TRIAN, AWS, 3 runs

12 nodes 16 nodes 20 nodes0.0

0.2

0.4

0.6

0.8

1.0

Net

wor

kse

nt[b

ytes

]

×1010 LiveJournal, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

LiveJournal Network Usage

12 nodes 16 nodes 20 nodes0.0

0.2

0.4

0.6

0.8

1.0

1.2

Net

wor

kse

nt[b

ytes

]

×1011 PLD, AWS c3 8xlarge3-prof p=13-prof p=0.5

3-prof p=0.1Trian p=1

Trian p=0.5Trian p=0.1

PLD Network Usage

E. R. Elenberg Beyond Triangles 20/20

(Backup) Results: AWS, 5 runs

100 egos0

20

40

60

80

100

120

Run

ning

time

[sec

]

LiveJournal, AWS c3 8xlargeEgo-ser 12 nodes Ego-ser 16 nodes Ego-ser 20 nodes

EGO-SER

100 egos0

2

4

6

8

10

12

Run

ning

time

[sec

]

LiveJournal, AWS c3 8xlargeEgo-par 12 nodes Ego-par 16 nodes Ego-par 20 nodes

EGO-PAR

E. R. Elenberg Beyond Triangles 20/20

Recommended