Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
KTrussExplorer: Exploring the Design Space of K-truss Decomposition
Optimizations on GPUsSafaa Diab, Mhd Ghaith Olabi, Izzat El Hajj
American University of Beirut
HPEC Graph Challenge
September 23, 2020
Overview
KTrussExplorer is a highly parameterized framework for exploring different combinations of k-truss decomposition optimizations on GPUs
Supported features:
• Edge-centric parallelization
• Undirected or directed graphs• Directed by index or by degree
• Tiling the adjacency matrix
• Parallelizing intersections
• Removing or marking weak edges
• Recomputing for all or affected edges
Contributions:
• A survey of optimizations
• A framework for exploring the design space
• A view of the design space
• Unexplored combinations faster than prior champions
github.com/ielhajj/ktruss-explorer
Methodology
• Software: KtrussExplorer kernels are implemented in CUDA
• System: Evaluation is on one Volta V100 GPU with 16GB of memory
• Datasets: We evaluate with all graphs in the graph challenge collection• Except: Friendster, graph500-scale24-ef16, and graph500-scale25-ef16 due to
limited device memory capacity.
• Search space: Design space is searched exhaustively• Except: very large graphs
Graph Directedness
0.25
0.5
1
2
4
8
16
32
64
0.0000010.000010.00010.001 0.01 0.1 1 10 100
Spe
ed
up
of
Dir
ect
ed
ove
r U
nd
ire
cte
dAverage Number of Triangles per Edge
Directed is faster
Undirected is faster
10-6 10-5 10-4 10-3 10-2 10-1 100 101 102
k = 3
Undirected Graph Directed Graph
0
1 2
0
1 2
support
{0,1} {0,2} {1,0} {1,2} {2,0} {2,1}
support
{0,1} {0,2} {1,2}
+1 +1 +1 +1 +1 +1 +1 +1 +1
Less redundancy Less synchronization (no atomics) Stop counting early
Directing Edges by Degree
0.5
1
2
4
8
1 10 100 1000 10000 1000001000000Sp
ee
du
p o
f D
ire
cte
d b
y D
egr
ee
ove
r D
ire
cte
d b
y In
dex
Maximum Vertex Degree
Directed by degree is fasterDirected by index is faster
100 101 102 103 104 105 106
k = 3
Directed by index
• Keep edges from vertex with lower index to vertex with higher index
Directed by degree
• Keep edges from vertex with lower degree to vertex with higher degree
Advantage: shrink large adjacency lists to reduce load imbalance
Tiling
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
01
3 5
2
7 6
4
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
srcPtr
0 4 7 10 12 14 18 29 24
dstIdx
1 5 6 7 0 3 5 4 5 7 1 5 2 7 0 1 2 3 0 7 0 2 4 6
srcPtr
0 1 3 3 4 7 8 11 12 13 17 18 20 21 21 22 24
dstIdx
1 0 3 1 5 6 7 5 4 5 7 5 2 0 1 2 3 0 0 2 7 7 4 6
Example Graph Logical Adjacency List without Tiling
Logical Adjacency List with Tiling
CSR Representation
Tiled CSR Representation
Better locality Partitioning intersections into smaller sub-intersections
Benefits of Tiling
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Memory Access Pattern without Tiling Memory Access Pattern with Tiling
Good locality
- Bad locality
Good locality
Good locality
Benefits of Tiling
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Intersection without Tiling Intersection with Tiling
Intersection
Sub-intersection 1(trivially empty)
Sub-intersection 2(trivially empty)
Benefits of Tiling
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1 2 4 8 16 32
Spe
ed
up
of
Tilin
g
Average Vertex Degree
Tiling is faster
No tiling is faster
k = 3
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Intersection with Tiling
Sub-intersection 1(trivially empty)
Sub-intersection 2(trivially empty)
Parallelizing Intersections
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
10 100 1,00010,000100,0001,000,00010,000,000100,000,0001,000,000,000
Spe
ed
up
of
Par
alle
lizin
g In
ters
ect
ion
s
Number of Edges
Parallelization is faster
No parallelization is faster
101 102 103 104 105 106 107 108 109
k = 3
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Sub-intersection 1
Sub-intersection 2
Removing Deleted Edges Intermediately
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
1 10 100 1,00010,000100,0001,000,00010,000,000100,000,0001,000,000,000Sp
ee
du
p o
f R
em
ovi
ng
De
lete
d
Edge
s In
term
ed
iate
lyNumber of Edges
Removing deletededges intermediately isfasterNot removing deletededges intermediately isfaster
101 102 103 104 105 106 107 108 109
k = kmax
srcPtr
dstIdx
srcPtr
dstIdx
x x x x x
Mark deleted edges
No overhead to remove edges
weak edges
srcPtr
dstIdx
Remove deleted edges (for select iterations)
Shorter intersections
Recomputing Support for All or Affected Edges
Directed Graph
0
2 3
4 5
1
Undirected Graph
0
2 3
4 5
1
Edges that are not affected and whose threads do not need to recount
Edges that are not affected but whose threads need to recount on behalf of affected edgesEdges that are affected and whose threads need to recount
Weak edges that were deletedGraphs performing better with only affected edges reprocessed:• graph500-scale20-ef16• graph500-scale21-ef16• graph500-scale23-ef16
For further investigation:• Recomputing for affected
edges on select iterations (later iterations)
Marking Affected Edges
0
2 3
4 5
1 01: parallel for e = {u, v} ∈ E do02: if e is deleted then03: mark u as affected, mark v as affected
Pseudocode for Marking Affected Edges
Edges that are not affected and whose threads do not need to recount
Edges that are not affected but whose threads need to recount on behalf of affected edgesEdges that are affected and whose threads need to recount
Weak edges that were deleted
4 5
Marking Affected Edges
0
2 3
4 5
1 01: parallel for e = {u, v} ∈ E do02: if e is deleted then03: mark u as affected, mark v as affected04: parallel for e = {u, v} ∈ E do05: if e is not deleted and (u is affected or v is affected) then06: mark e as affected07: if u is not affected then mark u as needs to recount08: else if v is not affected then mark v as needs to recount
Pseudocode for Marking Affected Edges
Edges that are not affected and whose threads do not need to recount
Edges that are not affected but whose threads need to recount on behalf of affected edgesEdges that are affected and whose threads need to recount
Weak edges that were deleted
2 3
Marking Affected Edges
0
2 3
4 5
1 01: parallel for e = {u, v} ∈ E do02: if e is deleted then03: mark u as affected, mark v as affected04: parallel for e = {u, v} ∈ E do05: if e is not deleted and (u is affected or v is affected) then06: mark e as affected07: if u is not affected then mark u as needs to recount08: else if v is not affected then mark v as needs to recount09: parallel for e = {u, v} ∈ E do10: if e is not deleted and e is not affected then11: if u needs to recount or v needs to recount then12: mark e as needs to recount
Pseudocode for Marking Affected Edges
Edges that are not affected and whose threads do not need to recount
Edges that are not affected but whose threads need to recount on behalf of affected edgesEdges that are affected and whose threads need to recount
Weak edges that were deleted
Comparison with Prior Champions
0.25
0.5
1
2
4
8
10 100 1,00010,000100,0001,000,00010,000,000100,000,0001,000,000,000
Spe
ed
up
ove
r 2
01
8 C
ham
pio
ns
(Bis
son
& F
atic
a)
Number of Edges
101 102 103 104 105 106 107 108 109
k = 3
KTrussExplorer: Exploring the Design Space of K-truss Decomposition
Optimizations on GPUsSafaa Diab, Mhd Ghaith Olabi, Izzat El Hajj
American University of Beirut
github.com/ielhajj/ktruss-explorer