View
0
Download
0
Category
Preview:
Citation preview
Dependency-Driven Disk-based Graph ProcessingKeval Vora
Simon Fraser University
https://github.com/pdclab/lumos
USENIX ATC’19 - Renton, WAJuly 11, 2019
Graph Processing12/20
Iterative Graph Processing13/20PageRank computation
Iterative Graph Processing
ab
c d
e
f
ab
c d
e
f
iteration t iteration t+1
13/20
Out-of-Core Graph Processing
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =a b c d e fV =
P0 P1P0 P1
14/20iteration t
iteration t+1
partitions reside on disk
Out-of-Core Graph Processing
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =a b c d e fV =
iteration t
iteration t+1
P0 P1 P0 P1
14/20
Out-of-Core Graph Processing
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =a b c d e fV =
iteration t
iteration t+1
P0 P1 P0 P1
14/20
Out-of-Core Graph Processing
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
iteration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
14/20
I/O C
Out-of-Core Graph Processing
I/O C I/O C I/O C
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
iteration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
14/20
compute
I/O C
Out-of-Core Graph Processing
I/O C I/O C I/O C
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
iteration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/O C I/O C I/O C I/O C
I/O = 69-90%
14/20
I/O C
Lumos
I/O C
ae
b
cd ea
db
c f
a b c d e fV =• Out-of-Order processing
• Amortize I/O across multiple iterations
• Guarantee Bulk Synchronous Parallel Semantics
• Dependency-Driven Cross-Iterationvalue propagation• Safely compute future values
iteration t
P0 P1
15/20
Computing Future Values
ae
b
cd
ae
b
cd
c f
ea
db
c f
ea
db
VF =
a b c d e fV =
a b c d e f
iteration t
iteration t+1
a b c d e fV =
P0 P0 P1
I/O C I/O C I/O C I/O C
P1
16/20future vertex values
Computing Future Values
b
ae
b
cd
c
ea
db
c f
ea
b
f
d
VF =
a b c d e fV =
a b c d e f
ae
cditeration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/O C I/O C I/O C I/O C
16/20
I/O I/O
Computing Future Values
b
ae
b
cd
c
ea
db
c f
ea
b
f
d
VF =
a b c d e fV =
a b c d e f
ae
cditeration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/O C I/O C I/O C C
16/20
I/O
Computing Future Values
b
ae
b
cd
c
e
db
c f
ea
b
f
d
VF =
a b c d e fV =
a b c d e f
ae
cditeration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/O C I/O C I/O C C
16/20
I/O
Computing Future Values
b
ae
b
cd
c
e
db
c f
ea
b
f
d
VF =
a b c d e fV =
a b c d e f
ae
cditeration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/O C I/O C I/O C C
16/20
future dependencies not met
Computing Future Values
0
15
30
45
UK
TW TT FT UK
TW TT FT UK
TW TT FT
16 64 256Chunk Cyclic Random
% V
erti
ces
Edges Vertices
UK 1B 39.5M
TW 1.5B 41.7M
TT 2B 52.6M
FT 2.5B 68.3M
0
1
2
3
4
UK
TW TT FT UK
TW TT FT UK
TW TT FT
16 64 256Chunk Cyclic Random
% E
dges
17/2030-45% vertices compute future values, and they contribute for less than 4% edges
Graph Computation
f
Partial Aggregation
18/20
Graph Computation
f
Partial Aggregation
18/20
Graph Computation
f
Partial Aggregation
18/20
Graph Computation
f
Partial Aggregation
18/20
Cross-Iteration Value Propagation
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
V = a b c d e f
Partial Aggregation
iteration t
iteration t+1
a b c d e fV =
P0 P0 P1
I/O C I/O C I/O C I/O C
P1
19/20
Cross-Iteration Value Propagation
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
V = a b c d e f
Partial Aggregation
iteration t
iteration t+1
a b c d e fV =
P0 P0 P1
I/O C I/O C I/O C I/O C
P1
19/20
I/O
Cross-Iteration Value Propagation
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
V = a b c d e f
Partial Aggregation
iteration t
iteration t+1
a b c d e fV =
P0 P0
I/O C I/O C I/O C I/O C
P1 P1
19/20
Cross-Iteration Value Propagation
ae
b
cd
ae
b
cdea
db
c f
ea
db
c f
a b c d e fV =
V = a b c d e f
Partial Aggregation
iteration t
iteration t+1
a b c d e fV =
P0 P0
I/O C I/O C I/O C
P1 P1
19/20
I/O C
Cross-Iteration Value Propagation
0
15
30
45
UK
TW TT FT UK
TW TT FT UK
TW TT FT
16 64 256Chunk Cyclic Random
% E
dges
10/20
future values computed across >45% edges
Intra-Partition Propagation
ae
b
cd
ae
b
cdea
db
c f
e
d
f
a b c d e fV =
V = a b c d e f
iteration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/OI/O C I/O C C CI/OI/O
11/20
Intra-Partition Propagation
ae
b
cd
ae
b
cdea
db
c f
e
d
f
a b c d e fV =
V = a b c d e f
iteration t
iteration t+1
a b c d e fV =
P0 P1 P0 P1
I/O C I/O C C CI/OI/O
11/20
Lumos: Cross-Iteration Value Propagation
• Propagation across multiple future iterations• Synergy with Dynamic Shards [ATC’16]• Detailed comparison with out-of-order techniques• Generalized graph layout & partitioning strategies• Cross-iteration propagation across different partition sizes• Interplay with selective scheduling
• Lumos for Asynchronous Algorithms
12/20
Lumos for Asynchronous Algorithms
ae
b
cd
ae
b
cdea
db
c f
e
d
f
a b c d e fV =a b c d e fV =
V = a b c d e f
iteration t
iteration t+113/20shortest paths computation
Lumos for Asynchronous Algorithms
ae
b
cd
ae
b
cdea
db
c f
e
d
f
a b c d e fV =a b c d e fV =
V = a b c d e f
iteration t
iteration t+113/20
Lumos for Asynchronous Algorithms
ae
b
cd ea
db
c f
a b c d e fV =
V = a b c d e f • Monotonic selection (KickStarter [ASPLOS’17])
iteration t
13/20
Lumos for Asynchronous Algorithms
ae
b
cd ea
db
c f
a b c d e fV =
V = a b c d e f • Monotonic selection (KickStarter [ASPLOS’17])• Partial aggregations merged directly
iteration t
13/20
Lumos for Asynchronous Algorithms
ae
b
cd ea
db
c f
a b c d e fV =
• Monotonic selection (KickStarter [ASPLOS’17])• Partial aggregations merged directly
• Partial values safe to be propagated• Any change can be propagated immediately
iteration t
13/20
Lumos for Asynchronous Algorithms
ae
b
cd ea
db
c f
a b c d e fV =
• Monotonic selection (KickStarter [ASPLOS’17])• Partial aggregations merged directly
• Partial values safe to be propagated• Any change can be propagated immediately
iteration t
13/20
Lumos for Asynchronous Algorithms
ae
b
cd ea
db
c f
a b c d e fV =
• Monotonic selection (KickStarter [ASPLOS’17])• Partial aggregations merged directly
• Partial values safe to be propagated• Any change can be propagated immediately
• Intra-partition propagation based on degree of asynchrony (ASPIRE [OOPSLA’14])it
eration t
13/20
• Grid layout based on GridGraph [ATC’15]
Lumos System
source
target
14/20
Lumos System
source
target
source =target =
14/20
source
target
Lumos System14/20
future values propagated for upper triangle + diagonal edges
I/O C
V =
V =
V =
I/O C
Lumos Systemiteration t
iteration t+1
14/20only lower triangle edges loaded
I/O C
V =
V =
V =
I/O C
Lumos Systemiteration t
iteration t+1
14/20
Light-weight Degree-Aware PartitioningHOF: Highest Out-Degree FirstHIL: Highest In-Degree LastHRF: Highest Out-Degree to In-Degree Ratio First
source
target
15/20
maximize edges in upper triangle
Light-weight Degree-Aware Partitioning
0
25
50
75
100
16 32 64 128
256 16 32 64 128
256 16 32 64 128
256
TT FT YHHOF HIL HRF
% E
dges
# Partitions
HOF: Highest Out-Degree FirstHIL: Highest In-Degree LastHRF: Highest Out-Degree to In-Degree Ratio First
Edges Vertices
TT 2B 52.6M
FT 2.5B 68.3M
YH 6.6B 1.4B
15/20
cross-iteration propagation across 72-93% edges
Experimental Setup
• Graph algorithms• PageRank (weighted and unweighted), Belief Propagation,
Co-Training Expectation Maximization, Dispersion, Label Propagation
• Performance on single disk• h1.2xlarge: 278MB/sec HDD read bandwidth
• Scaling I/O• d2.4xlarge: 195-768MB/sec read bandwidth over 1-4 HDDs• i3.8xlarge: 1.2-4.1GB/sec read bandwidth over 1-4 SSDs
16/20
Performance
0
0.2
0.4
0.6
0.8
1G
ridG
raph
LUM
OS
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
TT YH
PR CoEM DP BP WPR LP PR CoEM DP BP WPR LPN
orm
aliz
ed T
ime
h1.2xlarge: 278MB/sec HDD read bandwidth
Edges Vertices
TT 2B 52.6M
YH 6.6B 1.4B
17/20
Performance
0
0.2
0.4
0.6
0.8
1G
ridG
raph
LUM
OS
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
Gri
dGra
phLU
MO
S
TT YH
PR CoEM DP BP WPR LP PR CoEM DP BP WPR LP
Read Compute
Nor
mal
ized
Tim
e
h1.2xlarge: 278MB/sec HDD read bandwidth
Edges Vertices
TT 2B 52.6M
YH 6.6B 1.4B
17/20
Scaling I/O
0
2K
4K
1 2 3 4
Exe
cuti
on T
ime
# Drives
HDD SSD
Edges Vertices
RMAT29 8.6B 537M
DeviceType
Single Drive
RAID-0 with k drives
k = 2 k = 3 k = 4
d2.4xlarge HDD 195MB/s 368MB/s 590MB/s 768MB/s
i3.8xlarge SSD 1.2GB/s 3.8GB/s 4.1GB/s 3.9GB/s
18/20
0
25
50
75
100
16 32 64 128
256 16 32 64 128
256 16 32 64 128
256
TT FT YHHOF HIL HRF
% E
dges
# Partitions
Detailed Evaluation of Lumos
0 50
100 150 200 250
00:00 01:30
Rea
ds (
MB
/s)
Time
1 2
4 8
16
0 200 400 600 800
1000
00:00 01:30
I/O
Wai
t T
ime
(ms)
Time
1 2
4 8
16
0 0.2 0.4 0.6 0.8
1 1.2 1.4 1.6
GG LBL
GG LBL
GG LBL
GG LBL
TW TT FT YHDegrees OrderLayouts
Nor
mal
ized
Tim
e
0 0.2 0.4 0.6 0.8
1 1.2
GG LBL
GG LBL
GG LBL
GG LBL
TW TT FT YHPrimary Secondary
Nor
mal
ized
Spa
ce
0
0.2
0.4
0.6
0.8
1
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
GG LBL
TT FT YHPR CoEM DP BP WPR LP PR CoEM DP BP WPR LP PR CoEM DP BP WPR LP
Read Compute
Nor
mal
ized
Tim
e
19/20
Conclusion
• Dependency-Driven Cross-iteration value propagation • Out-of-Order processing to reduce I/O• Guarantees Bulk Synchronous Parallel semantics
• Generic technique that fundamentally eliminates the performance barrier
github.com/pdclab/lumos
Recommended