KTrussExplorer: Exploring the Design Space of K-truss … · 2021. 1. 26. · Methodology •Software: KtrussExplorer kernels are implemented in CUDA •System: Evaluation is on one

KTrussExplorer: Exploring the Design Space of K-truss Decomposition

Optimizations on GPUsSafaa Diab, Mhd Ghaith Olabi, Izzat El Hajj

American University of Beirut

HPEC Graph Challenge

September 23, 2020

Overview

KTrussExplorer is a highly parameterized framework for exploring different combinations of k-truss decomposition optimizations on GPUs

Supported features:

• Edge-centric parallelization

• Undirected or directed graphs• Directed by index or by degree

• Tiling the adjacency matrix

• Parallelizing intersections

• Removing or marking weak edges

• Recomputing for all or affected edges

Contributions:

• A survey of optimizations

• A framework for exploring the design space

• A view of the design space

• Unexplored combinations faster than prior champions

github.com/ielhajj/ktruss-explorer

https://github.com/ielhajj/ktruss-explorer

Methodology

• Software: KtrussExplorer kernels are implemented in CUDA

• System: Evaluation is on one Volta V100 GPU with 16GB of memory

• Datasets: We evaluate with all graphs in the graph challenge collection• Except: Friendster, graph500-scale24-ef16, and graph500-scale25-ef16 due to

limited device memory capacity.

• Search space: Design space is searched exhaustively• Except: very large graphs

Graph Directedness

0.25

0.5

1

2

4

8

16

32

64

0.0000010.000010.00010.001 0.01 0.1 1 10 100

Spe

ed

up

of

Dir

ect

ed

ove

r U

nd

ire

cte

dAverage Number of Triangles per Edge

Directed is faster

Undirected is faster

10-6 10-5 10-4 10-3 10-2 10-1 100 101 102

k = 3

Undirected Graph Directed Graph

0

1 2

0

1 2

support

{0,1} {0,2} {1,0} {1,2} {2,0} {2,1}

support

{0,1} {0,2} {1,2}

+1 +1 +1 +1 +1 +1 +1 +1 +1

Less redundancy Less synchronization (no atomics) Stop counting early

Directing Edges by Degree

0.5

1

2

4

8

1 10 100 1000 10000 1000001000000Sp

ee

du

p o

f D

ire

cte

d b

y D

egr

ee

ove

r D

ire

cte

d b

y In

dex

Maximum Vertex Degree

Directed by degree is fasterDirected by index is faster

100 101 102 103 104 105 106

k = 3

Directed by index

• Keep edges from vertex with lower index to vertex with higher index

Directed by degree

• Keep edges from vertex with lower degree to vertex with higher degree

Advantage: shrink large adjacency lists to reduce load imbalance

Tiling

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

01

3 5

2

7 6

4

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

srcPtr

0 4 7 10 12 14 18 29 24

dstIdx

1 5 6 7 0 3 5 4 5 7 1 5 2 7 0 1 2 3 0 7 0 2 4 6

srcPtr

0 1 3 3 4 7 8 11 12 13 17 18 20 21 21 22 24

dstIdx

1 0 3 1 5 6 7 5 4 5 7 5 2 0 1 2 3 0 0 2 7 7 4 6

Example Graph Logical Adjacency List without Tiling

Logical Adjacency List with Tiling

CSR Representation

Tiled CSR Representation

Better locality Partitioning intersections into smaller sub-intersections

Benefits of Tiling

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Memory Access Pattern without Tiling Memory Access Pattern with Tiling

Good locality

- Bad locality

Good locality

Good locality

Benefits of Tiling

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Intersection without Tiling Intersection with Tiling

Intersection

Sub-intersection 1(trivially empty)


Benefits of Tiling

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1 2 4 8 16 32

Spe

ed

up

of

Tilin

g

Average Vertex Degree

Tiling is faster

No tiling is faster

k = 3

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Intersection with Tiling



Parallelizing Intersections

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

10 100 1,00010,000100,0001,000,00010,000,000100,000,0001,000,000,000

Spe

ed

up

of

Par

alle

lizin

g In

ters

ect

ion

s

Number of Edges

Parallelization is faster

No parallelization is faster

101 102 103 104 105 106 107 108 109

k = 3

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Sub-intersection 1

Sub-intersection 2

Removing Deleted Edges Intermediately

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

1 10 100 1,00010,000100,0001,000,00010,000,000100,000,0001,000,000,000Sp

ee

du

p o

f R

em

ovi

ng

De

lete

d

Edge

s In

term

ed

iate

lyNumber of Edges

Removing deletededges intermediately isfasterNot removing deletededges intermediately isfaster

101 102 103 104 105 106 107 108 109

k = kmax

srcPtr

dstIdx

srcPtr

dstIdx

x x x x x

Mark deleted edges

No overhead to remove edges

weak edges

srcPtr

dstIdx

Remove deleted edges (for select iterations)

Shorter intersections

Recomputing Support for All or Affected Edges

Directed Graph

0

2 3

4 5

1

Undirected Graph

0

2 3

4 5

1

Edges that are not affected and whose threads do not need to recount

Edges that are not affected but whose threads need to recount on behalf of affected edgesEdges that are affected and whose threads need to recount

Weak edges that were deletedGraphs performing better with only affected edges reprocessed:• graph500-scale20-ef16• graph500-scale21-ef16• graph500-scale23-ef16

For further investigation:• Recomputing for affected

edges on select iterations (later iterations)

Marking Affected Edges

0

2 3

4 5

1 01: parallel for e = {u, v} ∈ E do02: if e is deleted then03: mark u as affected, mark v as affected

Pseudocode for Marking Affected Edges



Weak edges that were deleted

4 5


0

2 3

4 5

1 01: parallel for e = {u, v} ∈ E do02: if e is deleted then03: mark u as affected, mark v as affected04: parallel for e = {u, v} ∈ E do05: if e is not deleted and (u is affected or v is affected) then06: mark e as affected07: if u is not affected then mark u as needs to recount08: else if v is not affected then mark v as needs to recount





2 3


0

2 3

4 5

1 01: parallel for e = {u, v} ∈ E do02: if e is deleted then03: mark u as affected, mark v as affected04: parallel for e = {u, v} ∈ E do05: if e is not deleted and (u is affected or v is affected) then06: mark e as affected07: if u is not affected then mark u as needs to recount08: else if v is not affected then mark v as needs to recount09: parallel for e = {u, v} ∈ E do10: if e is not deleted and e is not affected then11: if u needs to recount or v needs to recount then12: mark e as needs to recount





Comparison with Prior Champions

0.25

0.5

1

2

4

8

10 100 1,00010,000100,0001,000,00010,000,000100,000,0001,000,000,000

Spe

ed

up

ove

r 2

01

8 C

ham

pio

ns

(Bis

son

& F

atic

a)

Number of Edges

101 102 103 104 105 106 107 108 109

k = 3

KTrussExplorer: Exploring the Design Space of K-truss Decomposition

Optimizations on GPUsSafaa Diab, Mhd Ghaith Olabi, Izzat El Hajj

American University of Beirut

github.com/ielhajj/ktruss-explorer

https://github.com/ielhajj/ktruss-explorer

Documents

KTrussExplorer: Exploring the Design Space of K-truss … · 2021. 1. 26. · Methodology •Software: KtrussExplorer kernels are implemented in CUDA •System: Evaluation is on one