The shortest path is not always a straight line

Preview:

Citation preview

THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE

leveraging semi-metricity in large-scale graph analysis

Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of TechnologyTiago Simas (tiago.simas@telefonica.com) Telefonica Research Dionysios Logothetis (dionysios@fb.com) Facebook

2

Alice42 likes

Weighted graphs capture relationship strength

distance

similarity social proximity

rating preference

influential nodes

optimal propagation paths

communities

recommendations

BobMax

3 likes

3

Sparsification techniques reduce the graph size and still give exact or good

approximate results

G G’f(G) ~ f(G’)

THE METRIC BACKBONE

Reduces the graph size while maintaining relevant structure

The minimum subgraph of a weighted graph, that preserves the shortest paths of the original graph

4

B

E

DA

C2

3

10

4

2

1

B

E

DA

C2

3

2

1

WHAT CAN WE USE IT FOR?• Exact computations

• any algorithm that depends on the shortest paths• reachability, connectivity• betweenness centrality, closeness centrality

• Approximation• PageRank, random walks• eigenvector centrality• community detection, clustering

5

WHAT CAN WE USE IT FOR?• Exact computations

• any algorithm that depends on the shortest paths• reachability, connectivity• betweenness centrality, closeness centrality

• Approximation• PageRank, random walks• eigenvector centrality• community detection, clustering

5

Improves community detection modularity and recommender

systems accuracy

IMPACT ON LARGE-SCALE SYSTEMS• Graph Databases

• fewer edges => smaller path search space

• Batch Graph Processing• CPU and memory requirements depend on #messages

• #messages proportional to #edges

• fewer edges => improved analysis performance

• Graph Compression• fewer edges => storage reduction

6

BACKGROUND

SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints

8

B

E

DA

C2

3

10

4

2

1

SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints

9

B

E

DA

C2

3

10

4

2

1

CE is 1st-order semi-metric:

C-D-E is a shorter2-hop path

SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints

10

B

E

DA

C2

3

10

4

2

1

AD is 2nd-order semi-metric:

A-B-C-D is a shorter 3-hop path

CE is 1st-order semi-metric:

C-D-E is a shorter2-hop path

SEMI-METRICITYIn a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints

11

B

E

DA

C2

3

10

4

2

1

CE is 1st-order semi-metric:

C-D-E is a shorter2-hop path

AD is 2nd-order semi-metric:

A-B-C-D is a shorter 3-hop path

AB, BC, CD, DE are metric

BACKBONE ALGORITHM

BACKBONE CALCULATION• Calculating the backbone:

• find all semi-metric edges: 1 BFS per edge?• compute APSP and store O(N2) paths

13

BACKBONE CALCULATION• Calculating the backbone:

• find all semi-metric edges: 1 BFS per edge?• compute APSP and store O(N2) paths

Can we calculate or approximate the backbone

without solving APSP?

13

ORDER OF SEMI-METRICITY

14

ORDER OF SEMI-METRICITY

14

Most semi-metric edges are1st-order semi-metric

A 3-PHASE BACKBONE ALGORITHM

15

Find 1st-order semi-metric edges: only look at triangles

1.

A 3-PHASE BACKBONE ALGORITHM

15

Find 1st-order semi-metric edges: only look at triangles

1. Scalable & practicalfor large graphs

EXAMPLE

16

B

E

DA

C2

3

10

4

2

1

EXAMPLE

17

B

E

DA

C2

3

10

4

2

1

Phase 1

EXAMPLE

18

B

E

DA

C2

3

10 2

1

Phase 1

A 3-PHASE BACKBONE ALGORITHM

19

Find 1st-order semi-metric edges: only look at triangles

1. Scalable & practicalfor large graphs

A 3-PHASE BACKBONE ALGORITHM

19

Find 1st-order semi-metric edges: only look at triangles

1.

Identify metric edges in 2-hop paths

2.

Scalable & practicalfor large graphs

A 3-PHASE BACKBONE ALGORITHM

19

Find 1st-order semi-metric edges: only look at triangles

1.

Identify metric edges in 2-hop paths

2.

Scalable & practicalfor large graphs

Most semi-metric edgeshave been removed

EXAMPLE

20

B

E

DA

C2

3

10 2

1

Phase 2

EXAMPLE

20

B

E

DA

C2

3

10 2

1

Phase 2

M

M

MM

The lowest-weight edge of every vertex is metric

EXAMPLE

20

B

E

DA

C2

3

10 2

1

Phase 2

M

M

MM

The lowest-weight edge of every vertex is metric

uv2

4

2

1

any indirect pathfrom u to vwould have

larger weight

EXAMPLE

20

B

E

DA

C2

3

10 2

1

Phase 2

?

M

M

MM

The lowest-weight edge of every vertex is metric

uv2

4

2

1

any indirect pathfrom u to vwould have

larger weight

A 3-PHASE BACKBONE ALGORITHM

21

Find 1st-order semi-metric edges: only look at triangles!

1.

Identify metric edges in 2-hop paths

2.

Scalable & practicalfor large graphs!

Most semi-metric edgeshave been removed

A 3-PHASE BACKBONE ALGORITHM

21

Find 1st-order semi-metric edges: only look at triangles!

1.

Identify metric edges in 2-hop paths

2.

Run a BFS for remaining unlabeled edges.

3.

Scalable & practicalfor large graphs!

Most semi-metric edgeshave been removed

A 3-PHASE BACKBONE ALGORITHM

21

Find 1st-order semi-metric edges: only look at triangles!

1.

Identify metric edges in 2-hop paths

2.

Run a BFS for remaining unlabeled edges.

3.

Scalable & practicalfor large graphs!

1%-9% edges

Most semi-metric edgeshave been removed

EXAMPLE

22

B

E

DA

C2

3

10 2

1

Phase 3

M

M

MM

BFS

EXAMPLE

22

B

E

DA

C2

3

10 2

1

Phase 3

M

M

MM

BFS

Explore paths with shorter

distances only

EXAMPLE

22

B

E

DA

C2

3

10 2

1

Phase 3

M

M

MM

BFS

Explore paths with shorter

distances only

If the BFS arrives at the target, the edge

is semi-metric

EXAMPLE

23

B

E

DA

C2

3

2

1

Metric Backbone

DISTRIBUTED IMPLEMENTATION

code available: http://grafos.ml/okapi.html#analytics

24

Implementation in the vertex-centric model

EVALUATION

EVALUATION GOALS

• How does our algorithm compare to APSP?

• Are large, real-world graphs semi-metric?

• Can we improve graph analysis performance?

26

COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from

several sources in parallel

27

COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from

several sources in parallel

27

In the order of months for million-edge graphs

COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from

several sources in parallel

27

In the order of months for million-edge graphs

In the order of days for million-edge graphs

COMPARISON TO APSPComputing APSP in Giraph• multiple SSSPs• multiple MSSPs, i.e. SSSPs from

several sources in parallel

27

In the order of months for million-edge graphs

In the order of days for million-edge graphs

Our algorithm is 120-180x faster than SSSPand 11-14x faster than MSSP: order of hours for million-edge graphs

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

Removes up to 90%of semi-metric edges

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

Removes up to 90%of semi-metric edges

Moderately fast

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

Removes up to 90%of semi-metric edges

Moderately fast

Labels up to 60%of the unlabeled edges

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

Removes up to 90%of semi-metric edges

Moderately fast

Labels up to 60%of the unlabeled edges

Slow

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

Removes up to 90%of semi-metric edges

Moderately fast

Labels up to 60%of the unlabeled edges

Slow

Labels up to 1-9%of the total edges

ALGORITHM PHASES

28

Phase 1 Phase 2 Phase 3

Very fastand scalable

Removes up to 90%of semi-metric edges

Moderately fast

Labels up to 60%of the unlabeled edges

Slow

Labels up to 1-9%of the total edges

Phase 1 is the fastest and most useful phase

PHASE 1 SCALABILITY

29

PHASE 1 SCALABILITY

29

<200s on a billion-edge graph

PHASE 1 SCALABILITY

29

almost linear scalability

<200s on a billion-edge graph

SEMI-METRICITY IN REAL GRAPHS

30

Graph |V| |E| metric semi-metricity

Facebook 190M 49.9B custom 26.5%Twitter 40M 1.5B jaccard 39%Tuenti 12M 685M jaccard 59%

Livejournal 4.8M 34M jaccard 40%NotreDame 0.3M 1.5M jaccard, adamic 45%-29%

DBLP 318K 1M jaccard, adamic 23%-9%Twitter-ego 81K 1.7M jaccard, adamic 57%-39%Movielens 1.6K 1.9M jaccard 88%

Facebook 1K 143K #messages, message size 78%-77%

US-Airports 0.5K 6K #passengers 72%C-Elegans 0.3K 2.3K #connections 17%

SEMI-METRICITY IN REAL GRAPHS

30

Graph |V| |E| metric semi-metricity

Facebook 190M 49.9B custom 26.5%Twitter 40M 1.5B jaccard 39%Tuenti 12M 685M jaccard 59%

Livejournal 4.8M 34M jaccard 40%NotreDame 0.3M 1.5M jaccard, adamic 45%-29%

DBLP 318K 1M jaccard, adamic 23%-9%Twitter-ego 81K 1.7M jaccard, adamic 57%-39%Movielens 1.6K 1.9M jaccard 88%

Facebook 1K 143K #messages, message size 78%-77%

US-Airports 0.5K 6K #passengers 72%C-Elegans 0.3K 2.3K #connections 17%

% 1st-order semi-metric edges =>

reduction in memory and communication

QUERY SPEEDUP ON NEO4J

31

6.7x speedup

APACHE GIRAPH SPEEDUP

32

Including the time to calculate the backbone

4x speedup

APACHE GIRAPH SPEEDUP

33

6x speedup

COMMUNICATION REDUCTION

34

Up to 70% for highly semi-metric graphs

BEST PRACTICESWhen to use the backbone?

• semi-metric weighting schemes, e.g. neighborhood similarity• we can amortize the overhead: e.g. many algorithms on the same graph,

multiple distance queries• lossy compression is ok

When not to use the backbone?

• for metric weighting schemes• we need to run one-off analysis• we need lossless compression

35

RECAP: MAIN CONTRIBUTIONS

36

• An algorithm for computing the metric backbone without solving APSP

• An open-source distributed implementation• Graph query and graph analytics speedup on

Neo4j and Apache Giraph

THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE

leveraging semi-metricity in large-scale graph analysis

Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of TechnologyTiago Simas (tiago.simas@telefonica.com) Telefonica Research Dionysios Logothetis (dionysios@fb.com) Facebook

Recommended