Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction• Every new desktop/laptop is now equipped with a graphic processing unit (GPU).• GPU = Massively Parallel Architecture.• For most of their life, such GPUs are idle.
• General Purpose GPU applications:
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 1
Numerical AnalysisMathWorks MATLAB
Bioinformatics Deep Learning
(Distributed) Constraint Optimization• A (D)COP is a tuple <X, D, F, (A, α)>, where:• X is a set of variables.• D is a set of finite domains.• F is a set of utility functions: • A is a set of agents, controlling the variables in X.• α maps variables to agents.
• GOAL: Find a utility maximal assignment.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 24/ 41
Motivations and Background DCOP Algorithms: State-of-the-Art Thesis Proposal Activities Report
Distributed Constrained Optimization (DCOP)
A DCOP is defined by a tuple hA,X ,D,F ↵i, where:A is a set of agents;X is a set of variables.D is a set of finite domains.F is a set of utility functions, fi : ⇥xj2scope(fi)Dj 7! N [ {0,�1}.↵ : X ! A maps each variable to one agent.
a2
a1
a3
a4
a5
x1
x3
x4
x5
=
<
=soft
x2
>
x1 x5 Utilities0 0 200 1 80 2 100 3 3. . .
3 3 2
x1 x3 Utilities0 0 �10 1 00 2 00 3 0. . .
3 3 �1
Constraint Graph Soft Constraint Table Hard Constraint Table
DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD
F. FIORETTO
Bi =
n
fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o
Bi =�
fj 2 F|xi 2 x
fj ^ i = max{k|xk 2 x
fj}
X X \ {xi}
F (F [ {f 0i}) \Bi
x
⇤= argmax
x
F(x)
= argmax
x
X
f2F
f(x|scope(f))
1
Graphical Processing Units (GPUs)• A GPU is a massive parallel architecture:• Thousands of multi-‐‑threaded computing cores.• Very highmemory bandwidths.• ~80% of transistors devoted to data processing rather than caching.
• However:• GPU cores are slower than CPU cores.• GPU memories have different sizes and access times.• GPU programming is more challenging and time consuming.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 3
Execution ModelIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 4
Block (0,0)
Block (1,0)
Block (2,0)
Block (0,1)
Block (1,1)
Block (2,1)
Kernel 1
Kernel 2
B
Thread(0,0)
Thread(1,0)
Thread(2,0)
Thread(3,0)
Thread(4,0)
Thread(0,1)
Thread(1,1)
Thread(2,1)
Thread(3,1)
Thread(4,1)
Thread(0,2)
Thread(1,2)
Thread(2,2)
Thread(3,2)
Thread(4,2)
Thread(0,3)
Thread(1,3)
Thread(2,3)
Thread(3,3)
Thread(4,3)
warp
Host Device
Block (0,0)
Block (1,0)
Block (2,0)
Block (0,1)
Block (1,1)
Block (2,1)
warp warp warp
warp warp warp
...
• A Thread is the basic parallel unit.
• Threads are organized into a Block.• Several warps are scheduled for the execution of a GPU function.
• Several Streaming Multiprocessors, (SD) scheduled in parallel.• Single Instruction Multiple Thread(SIMT) parallel model.
HOSTGLOBAL MEMORY
CONSTANT MEMORY
Shared memory
Thread Thread
regs regs
Block
Shared memory
Thread Thread
regs regs
Block
GRIDSM
SM
SM
SM
SM
SM
SM
SM
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
DRAM
DRAM
Host
Inte
rface
DRAM
DRAM
DRAM
DRAM
Inst
ruct
ion
Cach
e
WAR
Psc
hedu
ler
WAR
Psc
hedu
ler
Regi
ster
s (3
2K)
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
Shar
ed M
emor
yL1
Cac
he (6
4KB)
SFU SFU SFU SFU
Memory Hierarchy
• The GPU memory architecture is rather involved.• Registers• Shared memory• Global memory
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 5
HOSTGLOBAL MEMORY
CONSTANT MEMORY
Shared memory
Thread Thread
regs regs
Block
Shared memory
Thread Thread
regs regs
Block
GRIDSM
SM
SM
SM
SM
SM
SM
SM
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
DRAM
DRAM
Host
Inte
rface
DRAM
DRAM
DRAM
DRAM
Inst
ruct
ion
Cach
e
WAR
Psc
hedu
ler
WAR
Psc
hedu
ler
Regi
ster
s (3
2K)
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
Shar
ed M
emor
yL1
Cac
he (6
4KB)
SFU SFU SFU SFU
Memory Hierarchy
• The GPU memory architecture is rather involved.• Registers• Fastest;• Only accessible by a thread;• Lifetime of a thread.
• Shared memory• Global memory
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 6
HOSTGLOBAL MEMORY
CONSTANT MEMORY
Shared memory
Thread Thread
regs regs
Block
Shared memory
Thread Thread
regs regs
Block
GRIDSM
SM
SM
SM
SM
SM
SM
SM
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
DRAM
DRAM
Host
Inte
rface
DRAM
DRAM
DRAM
DRAM
Inst
ruct
ion
Cach
e
WAR
Psc
hedu
ler
WAR
Psc
hedu
ler
Regi
ster
s (3
2K)
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
Shar
ed M
emor
yL1
Cac
he (6
4KB)
SFU SFU SFU SFU
Memory Hierarchy
• The GPU memory architecture is rather involved.• Registers• Shared memory• Extremely fast;• Highly parallel;• Restricted to a block.
• Global memory
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 7
HOSTGLOBAL MEMORY
CONSTANT MEMORY
Shared memory
Thread Thread
regs regs
Block
Shared memory
Thread Thread
regs regs
Block
GRIDSM
SM
SM
SM
SM
SM
SM
SM
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
DRAM
DRAM
Host
Inte
rface
DRAM
DRAM
DRAM
DRAM
Inst
ruct
ion
Cach
e
WAR
Psc
hedu
ler
WAR
Psc
hedu
ler
Regi
ster
s (3
2K)
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
Shar
ed M
emor
yL1
Cac
he (6
4KB)
SFU SFU SFU SFU
Memory Hierarchy
• The GPU memory architecture is rather involved.• Registers• Shared memory• Global memory• Typically implemented in DRAM;• High access latency (400-‐‑800 cycles);• Potential of traffic congestion.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 8
HOSTGLOBAL MEMORY
CONSTANT MEMORY
Shared memory
Thread Thread
regs regs
Block
Shared memory
Thread Thread
regs regs
Block
GRIDSM
SM
SM
SM
SM
SM
SM
SM
L2 Cache
SM
SM
SM
SM
SM
SM
SM
SM
DRAM
DRAM
Host
Inte
rface
DRAM
DRAM
DRAM
DRAM
Inst
ruct
ion
Cach
e
WAR
Psc
hedu
ler
WAR
Psc
hedu
ler
Regi
ster
s (3
2K)
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
Shar
ed M
emor
yL1
Cac
he (6
4KB)
SFU SFU SFU SFU
Memory Hierarchy
• The GPU memory architecture is rather involved.• Registers• Shared memory• Global memory• Challenge: using memory effectively -‐‑-‐‑ likely requires to redesign the algorithm.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 9
CUDA: Compute Unified Device ArchitectureIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 10
Host Device
CUDA: Compute Unified Device ArchitectureIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 11
Host Device
cudaMalloc(&deviceV, sizeV);
cudaMemcpy(deviceV, hostV, sizeV, ...)
data Global Memory
CUDA: Compute Unified Device ArchitectureIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 12
Host Device
cudaKernel<nThreads, nBlocks>( )
cudaKernel( )
Kernel invocation Global Memory
CUDA: Compute Unified Device ArchitectureIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 13
Host Device
cudaMemcpy(hostV, deviceV, sizeV, ...)
Global Memorydata
Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)
• Aggregation Operator: fij + fik
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 14
xi xj U0 0 50 1 81 0 201 1 3
fij
xj U0 201 8
Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)
• Aggregation Operator: fij + fik
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 15
xi xj U0 0 50 1 81 0 201 1 3
fij
max(5, 20)xj U0 201 8
Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)
• Aggregation Operator: fij + fik
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 16
xi xj U0 0 50 1 81 0 201 1 3
xj U0 201 8
fij
max(8, 3)
Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)• Aggregation Operator: fij + fik
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 17
xi xj U0 0 50 1 81 0 201 1 3
fij
xi xk U0 0 20 1 61 0 111 1 4
xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14
. . .. . .
Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)• Aggregation Operator: fij + fik
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 18
xi xj U0 0 50 1 81 0 201 1 3
fij
xi xk U0 0 20 1 61 0 111 1 4
xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14
. . .
5 + 2 = 7
. . .
Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)• Aggregation Operator: fij + fik
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 19
xi xj U0 0 50 1 81 0 201 1 3
fij
xi xk U0 0 20 1 61 0 111 1 4
xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14
. . .
5 + 6 = 11
. . .
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOP1. Imposes an ordering on the problem’s variables.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 20
f12
f23
f13
X = {x1, x2, x3}
F = {f12, f13, f23}
x3
x2
x1
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOP2. Selects the variable xi with highest priority, and it
creates a bucket:
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 21
f12
f23
f13
B3 = {f13, f23}
x3
x2
x1
X = {x1, x2, x3}
F = {f12, f13, f23}
DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD
F. FIORETTO
Bi =
n
fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o
Bi =�
fj 2 F|xi 2 x
fj ^ i = max{k|xk 2 x
fj}
1
DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD
F. FIORETTO
Bi =
n
fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o
Bi =�
fj 2 F|xi 2 x
fj ^ i = max{k|xk 2 x
fj}
1
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOP3. It computes a new utility function fi’ by aggregating the
functions in Bi and projecting out xi
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 22
f12
f23
f13
x3
x2
x1
X = {x1, x2, x3}
F = {f12, f13, f23}
x1 x3 U0 0 50 1 81 0 201 1 3
x2 x3 U0 0 50 1 81 0 201 1 3
x1 x2 U (x3)0 0 max(5 + 5, 8 + 8) = 16 1
f3’ = π-‐‑x3 (f13 + f23)
B3 = {f13, f23}
f13
f23
f3’
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOPIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 23
f12
f23
f13
x3
x2
x1
X = {x1, x2, x3}
F = {f12, f13, f23}
x1 x3 U0 0 50 1 81 0 201 1 3
x2 x3 U0 0 50 1 81 0 201 1 3
x1 x2 U (x3)0 0 16 10 1 max(5 + 20, 8 + 3) = 25 0
f3’ = π-‐‑x3 (f13 + f23)
3. It computes a new utility function fi’ by aggregating the functions in Bi and projecting out xi
B3 = {f13, f23}
f13
f23
f3’
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOPIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 24
f12
f23
f13
x3
x2
x1
X = {x1, x2, x3}
F = {f12, f13, f23}
x1 x3 U0 0 50 1 81 0 201 1 3
x2 x3 U0 0 50 1 81 0 201 1 3
x1 x2 U (x3)0 0 16 10 1 25 01 0 max(20 + 5, 3 + 8) = 25 0
f3’ = π-‐‑x3 (f13 + f23)
3. It computes a new utility function fi’ by aggregating the functions in Bi and projecting out xi
B3 = {f13, f23}
f13
f23
f3’
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOPIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 25
f12
f23
f13
x3
x2
x1
X = {x1, x2, x3}
F = {f12, f13, f23}
x1 x3 U0 0 50 1 81 0 201 1 3
x2 x3 U0 0 50 1 81 0 201 1 3
x1 x2 U (x3)0 0 16 10 1 25 01 0 25 01 0 max(20 + 20, 3 + 3) = 40 0
f3’ = π-‐‑x3 (f13 + f23)
3. It computes a new utility function fi’ by aggregating the functions in Bi and projecting out xi
B3 = {f13, f23}
f3’
f13
f23
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOP4. It updates the set of variables:5. It updates the set of functions:
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 26
f12
f23
f13
x3
x2
x1
X = {x1, x2}
F = {f12, f3’}B2 = {f13, f3’}
DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD
F. FIORETTO
Bi =
n
fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o
Bi =�
fj 2 F|xi 2 x
fj ^ i = max{k|xk 2 x
fj}
X X \ {xi}
F (F [ {f 0i} \Bi
1
DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD
F. FIORETTO
Bi =
n
fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o
Bi =�
fj 2 F|xi 2 x
fj ^ i = max{k|xk 2 x
fj}
X X \ {xi}
F (F [ {f 0i}) \Bi
1
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
Bucket Elimination and DPOPRepeat...
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 27
f12
f23
f13
x3
x2
x1
X = {x1, x2}
F = {f12, f3’}
• DPOP is a distributed version of BE.• It operates on a Pseudotree ordering of the constraint graph.
fj’xi
xj xk
fk’
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
GPU-‐‑(D)BEIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 28
f12
f23
f13
x3
x2
x1 x1 x3 U0 0 50 1 81 0 201 1 3
x2 x3 U0 0 50 1 81 0 201 1 3
x1 x2 U0 0 max(5 + 5, 8 + 8) = 160 1 max(5 + 20, 8 + 3) = 251 0 max(20 + 5, 3 + 8) = 251 0 max(20 + 20, 3 + 3) = 40
f3’ = π-‐‑x3 (f13 + f23)
• BE and DPOP complexity: O(dw*).d = max. domain size; w* = induced width of the constraint graph.
• Can the projection and aggregator operators be executed in parallel?• Do they fit the SIMT parallel model?
x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
GPU-‐‑(D)BEIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 29
f12
f23
f13
x3
x2
x1 x1 x3 U0 0 50 1 81 0 201 1 3
x2 x3 U0 0 50 1 81 0 201 1 3
x1 x2 U0 0 max(5 + 5, 8 + 8) = 160 1 max(5 + 20, 8 + 3) = 251 0 max(20 + 5, 3 + 8) = 251 0 max(20 + 20, 3 + 3) = 40
f3’ = π-‐‑x3 (f13 + f23)
• BE and DPOP complexity: O(dw*).d = max. domain size; w* = induced width of the constraint graph.
• Can the projection and aggregator operators be executed in parallel?• Do they fit the SIMT parallel model?• Obs.: The computation of each row of the Utility tables is independent from the computation of other rows.
Algorithm design and data structure• Limit the amount of host-‐‑device data transfers.• Static Entities: require a single data transaction. • Variables; Domains; Utility functions; Constraint Graph.
• Dynamic Entities: might require multiple data transactions.• Utility tables.
• Minimize the accesses to the global memory.• Padding Utility Tables’ rows; Perfect hashing.
• Ensure data accesses are coalesced.• Mono-‐‑dimensional array organization;
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 30
good bad
Parallel Projection and Aggregation• Mapping between the fi’ table rows and the CUDA blocks:• Each thread in a block is associated to the computations of one permutation of values in scope(fi’).
• 1 block = 64k threads (1 ≤ k ≤ 16).• k depends on the architecture and it is chosen so to maximize the number of threads that can be scheduled concurrently.
• Obs.: Max number of parallel fi’ table rows is M=|SM|64k• In our experiments, |SMs| = 14 and k = 3. Thus M = 2688.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 31
GPU Kernel
GPU Global Memory
B0 B1 B13
B14 B15 B29
……
SM0 SM1 SM13…
… … …
Th0
Th1
Th192
…
5 Related Work
Probably, i will move this section in the introduction
DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]
6 Experimental Results
Unstructured Graphs:
Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).
Distributed Crane Scheduling Problem:
Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.
7 Conclusions and Future Work
max
d2Di
X
fj2Bi
fj(�ix
= r0 ^ xi=d)
max
d2Di
X
fj2Bi
fj(�ix
= r1 ^ xi=d)
max
d2Di
X
fj2Bi
fj(�ix
= r192 ^ xi=d)
5 Related Work
Probably, i will move this section in the introduction
DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]
6 Experimental Results
Unstructured Graphs:
Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).
Distributed Crane Scheduling Problem:
Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.
7 Conclusions and Future Work
max
d2Di
X
fj2Bi
fj(�ix
= r0 ^ xi=d)
max
d2Di
X
fj2Bi
fj(�ix
= r1 ^ xi=d)
max
d2Di
X
fj2Bi
fj(�ix
= r192 ^ xi=d)
5 Related Work
Probably, i will move this section in the introduction
DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]
6 Experimental Results
Unstructured Graphs:
Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).
Distributed Crane Scheduling Problem:
Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.
7 Conclusions and Future Work
max
d2Di
X
fj2Bi
fj(�ix
= r0 ^ xi=d)
max
d2Di
X
fj2Bi
fj(�ix
= r1 ^ xi=d)
max
d2Di
X
fj2Bi
fj(�ix
= r192 ^ xi=d)
R+0
R+1
R+192
U’i
…
Experimental ResultsIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 32
● ● BE GPU−BE DPOP GPU−DBE
Regular Grid Networks• |Di| = 5; • p2 = 0.5;• timeout = 300s
Speedupavg. max.: 125.1xavg. min.: 42.6x
• Similar trends atincreasing |Di|.• Similar trends for DPOP vs GPU-‐‑DBE
Number of Variables
Run
time
(sec
)
0.01
0.1
1
10
9 25 36 49 64 81 100
●●
●
●
●
●
●
●
• CPU: 2.3GHz, 128 GB RAM• GPU: 14 SMs, 837MHz.
Experimental ResultsIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 33
● ● BE GPU−BE DPOP GPU−DBE
Random Networks
Speedupavg. max.: 69.3.xavg. min.: 16.1x
• |Di| = 5; • p2 = 0.5; p1 = 0.3;• timeout = 300s
• Similar trends atincreasing |Di|.• Similar trends for DPOP vs GPU-‐‑DBE
Number of Variables
Run
time
(sec
)
0.01
0.1
1
10
50
5 10 15 20 25●
●
●
●
●
• CPU: 2.3GHz, 128 GB RAM• GPU: 14 SMs, 837MHz.
Experimental ResultsIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 34
● ● BE GPU−BE DPOP GPU−DBE
Scale Free Networks
Speedupavg. max.: 34.9.xavg. min.: 9.5x
• |Di| = 5; • p2 = 0.5;• timeout = 300s
• Similar trends atincreasing |Di|.• Similar trends for DPOP vs GPU-‐‑DBE
Number of Variables
Run
time
(sec
)
0.1
1
10
50
10 20 30 40 50●
●
●
●
●
• CPU: 2.3GHz, 128 GB RAM• GPU: 14 SMs, 837MHz.
Lesson Learned #1• The fi’ table size increases exponentially with w*.• Limited GPU global memory (2GB).• fi’ table + Bi tables, to be used in the aggregation operation, might exceed global memory capacity!• Partition fi’ computations in multiple chunks.
• Alternates GPU and CPU to compute fi’.• GPU: Aggregates the functions in Bi excluding those which do not fit in the global memory.
• CPU: Aggregates the other functions in Bi; Projects out the variable xi.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 35
Experimental ResultsIntroduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 36
Random Networks
Phase Transitionp1 = 0.4
• small p1 correspond to smaller w*
• |A| = 10; |Di| = 5; • p2 = 0.5;• timeout = 300s
Graph Density (p1)
Spee
dup
(GPU
vs
CPU
)
40
50
70
80
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
●
●
●
●
●
●
●
●
New results(work in progress)
• CPU: 2.3GHz, 128 GB RAM• GPU: 14 SMs, 837MHz.
Lesson Learned #2• Host and Device concurrency.• Possible when the fi’ tables are computed in chunks.• It may hide host-‐‑device data transfers as byproduct.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 37
Execute K2
Compress U1 Execute K1
Compress U2
Compute U1 Compute U2
…Copy
H D D H D H
Copy Copy
(Init)CPU(Host)
GPU(Device) …
Update Global Mem. Update Global Mem.
Discussion• Exploiting the integration of CPU and GPU is a key factor to obtain competitive solver performance.
• How to determine good tradeoffs of such integration?• GPU: • Repeated, non memory intensive operations; • Operations requiring regular memory access;
• CPU:• Memory intensive operations;
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 38
Conclusions
• Exploit GPU-‐‑style parallelism from DP-‐‑based (D)COPs resolution methods.• GPU-‐‑(D)BE: Exploits GPUs to parallelizes the aggregation and projection operators.• Observed different speedup, ranging from 34.9 to 125.1, based on several network topologies.• Discussed several possible optimization techniques.
• FUTURE WORK:• Exploit GPUs in DP-‐‑based propagators.• Investigate GPUs in higher form of consistency.
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 39
Introduction GPUs GPU-‐‑(D)BE Results Conclusions
F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 40
Ferdinando FiorettoNew Mexico State University, University of Udine
Email: [email protected]: www.cs.nmsu.edu/~ffiorett
Exploiting GPUs in Solving (Distributed) Constraint Optimization Problems with
Dynamic ProgrammingF. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son
Thank You! x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40
x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45
(a) (b) (d)(c)
x3
1000
x2 x3
1 00 0
xi xj Utilities0 0 50 1 81 0 201 1 3
for i < jx1
x2
x3
x1
x2
x3
f12
f23
f13
x3
x2
x1