Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Introduction• Every new desktop/laptop is now equipped with a graphic processing unit (GPU).• GPU = Massively Parallel Architecture.• For most of their life, such GPUs are idle.

• General Purpose GPU applications:

Introduction GPUs GPU-‐‑(D)BE Results Conclusions

F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 1

Numerical AnalysisMathWorks MATLAB

Bioinformatics Deep Learning

(Distributed) Constraint Optimization• A (D)COP is a tuple <X, D, F, (A, α)>, where:• X is a set of variables.• D is a set of finite domains.• F is a set of utility functions: • A is a set of agents, controlling the variables in X.• α maps variables to agents.

• GOAL: Find a utility maximal assignment.


F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 24/ 41

Motivations and Background DCOP Algorithms: State-of-the-Art Thesis Proposal Activities Report

Distributed Constrained Optimization (DCOP)

A DCOP is defined by a tuple hA,X ,D,F ↵i, where:A is a set of agents;X is a set of variables.D is a set of finite domains.F is a set of utility functions, fi : ⇥xj2scope(fi)Dj 7! N [ {0,�1}.↵ : X ! A maps each variable to one agent.

a2

a1

a3

a4

a5

x1

x3

x4

x5

=

<

=soft

x2

>

x1 x5 Utilities0 0 200 1 80 2 100 3 3. . .

3 3 2

x1 x3 Utilities0 0 �10 1 00 2 00 3 0. . .

3 3 �1

Constraint Graph Soft Constraint Table Hard Constraint Table

DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD

F. FIORETTO

Bi =

n

fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o

Bi =�

fj 2 F|xi 2 x

fj ^ i = max{k|xk 2 x

fj}

X X \ {xi}

F (F [ {f 0i}) \Bi

x

⇤= argmax

x

F(x)

= argmax

x

X

f2F

f(x|scope(f))

1

Graphical Processing Units (GPUs)• A GPU is a massive parallel architecture:• Thousands of multi-‐‑threaded computing cores.• Very highmemory bandwidths.• ~80% of transistors devoted to data processing rather than caching.

• However:• GPU cores are slower than CPU cores.• GPU memories have different sizes and access times.• GPU programming is more challenging and time consuming.



Execution ModelIntroduction GPUs GPU-‐‑(D)BE Results Conclusions


Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block (2,1)

Kernel 1

Kernel 2

B

Thread(0,0)

Thread(1,0)

Thread(2,0)

Thread(3,0)

Thread(4,0)

Thread(0,1)

Thread(1,1)

Thread(2,1)

Thread(3,1)

Thread(4,1)

Thread(0,2)

Thread(1,2)

Thread(2,2)

Thread(3,2)

Thread(4,2)

Thread(0,3)

Thread(1,3)

Thread(2,3)

Thread(3,3)

Thread(4,3)

warp

Host Device

Block (0,0)

Block (1,0)

Block (2,0)

Block (0,1)

Block (1,1)

Block (2,1)

warp warp warp

warp warp warp

...

• A Thread is the basic parallel unit.

• Threads are organized into a Block.• Several warps are scheduled for the execution of a GPU function.

• Several Streaming Multiprocessors, (SD) scheduled in parallel.• Single Instruction Multiple Thread(SIMT) parallel model.

HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)

core core core core core core core core




Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory Hierarchy

• The GPU memory architecture is rather involved.• Registers• Shared memory• Global memory



HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)





Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory Hierarchy

• The GPU memory architecture is rather involved.• Registers• Fastest;• Only accessible by a thread;• Lifetime of a thread.

• Shared memory• Global memory



HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)





Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory Hierarchy

• The GPU memory architecture is rather involved.• Registers• Shared memory• Extremely fast;• Highly parallel;• Restricted to a block.

• Global memory



HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)





Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory Hierarchy

• The GPU memory architecture is rather involved.• Registers• Shared memory• Global memory• Typically implemented in DRAM;• High access latency (400-‐‑800 cycles);• Potential of traffic congestion.



HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)





Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory Hierarchy

• The GPU memory architecture is rather involved.• Registers• Shared memory• Global memory• Challenge: using memory effectively -‐‑-‐‑ likely requires to redesign the algorithm.



CUDA: Compute Unified Device ArchitectureIntroduction GPUs GPU-‐‑(D)BE Results Conclusions


Host Device



Host Device

cudaMalloc(&deviceV, sizeV);

cudaMemcpy(deviceV, hostV, sizeV, ...)

data Global Memory



Host Device

cudaKernel<nThreads, nBlocks>( )

cudaKernel( )

Kernel invocation Global Memory



Host Device

cudaMemcpy(hostV, deviceV, sizeV, ...)

Global Memorydata

Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)

• Aggregation Operator: fij + fik



xi xj U0 0 50 1 81 0 201 1 3

fij

xj U0 201 8





xi xj U0 0 50 1 81 0 201 1 3

fij

max(5, 20)xj U0 201 8





xi xj U0 0 50 1 81 0 201 1 3

xj U0 201 8

fij

max(8, 3)

Bucket Elimination and DPOP• Dynamic Programming procedures to solve (D)COPs.• Both procedures rely on the use of two operators:• Projection Operator: π-‐‑xi(fij)• Aggregation Operator: fij + fik



xi xj U0 0 50 1 81 0 201 1 3

fij

xi xk U0 0 20 1 61 0 111 1 4

xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14

. . .. . .




xi xj U0 0 50 1 81 0 201 1 3

fij

xi xk U0 0 20 1 61 0 111 1 4

xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14

. . .

5 + 2 = 7

. . .




xi xj U0 0 50 1 81 0 201 1 3

fij

xi xk U0 0 20 1 61 0 111 1 4

xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14

. . .

5 + 6 = 11

. . .

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket Elimination and DPOP1. Imposes an ordering on the problem’s variables.



f12

f23

f13

X = {x1, x2, x3}

F = {f12, f13, f23}

x3

x2

x1



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket Elimination and DPOP2. Selects the variable xi with highest priority, and it

creates a bucket:



f12

f23

f13

B3 = {f13, f23}

x3

x2

x1

X = {x1, x2, x3}

F = {f12, f13, f23}


F. FIORETTO

Bi =

n


Bi =�

fj 2 F|xi 2 x


fj}

1


F. FIORETTO

Bi =

n


Bi =�

fj 2 F|xi 2 x


fj}

1



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket Elimination and DPOP3. It computes a new utility function fi’ by aggregating the

functions in Bi and projecting out xi



f12

f23

f13

x3

x2

x1

X = {x1, x2, x3}

F = {f12, f13, f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 max(5 + 5, 8 + 8) = 16 1

f3’ = π-‐‑x3 (f13 + f23)

B3 = {f13, f23}

f13

f23

f3’



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket Elimination and DPOPIntroduction GPUs GPU-‐‑(D)BE Results Conclusions


f12

f23

f13

x3

x2

x1

X = {x1, x2, x3}

F = {f12, f13, f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 16 10 1 max(5 + 20, 8 + 3) = 25 0

f3’ = π-‐‑x3 (f13 + f23)

3. It computes a new utility function fi’ by aggregating the functions in Bi and projecting out xi

B3 = {f13, f23}

f13

f23

f3’



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3



f12

f23

f13

x3

x2

x1

X = {x1, x2, x3}

F = {f12, f13, f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 16 10 1 25 01 0 max(20 + 5, 3 + 8) = 25 0

f3’ = π-‐‑x3 (f13 + f23)


B3 = {f13, f23}

f13

f23

f3’



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3



f12

f23

f13

x3

x2

x1

X = {x1, x2, x3}

F = {f12, f13, f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 16 10 1 25 01 0 25 01 0 max(20 + 20, 3 + 3) = 40 0

f3’ = π-‐‑x3 (f13 + f23)


B3 = {f13, f23}

f3’

f13

f23



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket Elimination and DPOP4. It updates the set of variables:5. It updates the set of functions:



f12

f23

f13

x3

x2

x1

X = {x1, x2}

F = {f12, f3’}B2 = {f13, f3’}


F. FIORETTO

Bi =

n


Bi =�

fj 2 F|xi 2 x


fj}

X X \ {xi}

F (F [ {f 0i} \Bi

1


F. FIORETTO

Bi =

n


Bi =�

fj 2 F|xi 2 x


fj}

X X \ {xi}

F (F [ {f 0i}) \Bi

1



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket Elimination and DPOPRepeat...



f12

f23

f13

x3

x2

x1

X = {x1, x2}

F = {f12, f3’}

• DPOP is a distributed version of BE.• It operates on a Pseudotree ordering of the constraint graph.

fj’xi

xj xk

fk’



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

GPU-‐‑(D)BEIntroduction GPUs GPU-‐‑(D)BE Results Conclusions


f12

f23

f13

x3

x2

x1 x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U0 0 max(5 + 5, 8 + 8) = 160 1 max(5 + 20, 8 + 3) = 251 0 max(20 + 5, 3 + 8) = 251 0 max(20 + 20, 3 + 3) = 40

f3’ = π-‐‑x3 (f13 + f23)

• BE and DPOP complexity: O(dw*).d = max. domain size; w* = induced width of the constraint graph.

• Can the projection and aggregator operators be executed in parallel?• Do they fit the SIMT parallel model?



(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

GPU-‐‑(D)BEIntroduction GPUs GPU-‐‑(D)BE Results Conclusions


f12

f23

f13

x3

x2

x1 x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U0 0 max(5 + 5, 8 + 8) = 160 1 max(5 + 20, 8 + 3) = 251 0 max(20 + 5, 3 + 8) = 251 0 max(20 + 20, 3 + 3) = 40

f3’ = π-‐‑x3 (f13 + f23)

• BE and DPOP complexity: O(dw*).d = max. domain size; w* = induced width of the constraint graph.

• Can the projection and aggregator operators be executed in parallel?• Do they fit the SIMT parallel model?• Obs.: The computation of each row of the Utility tables is independent from the computation of other rows.

Algorithm design and data structure• Limit the amount of host-‐‑device data transfers.• Static Entities: require a single data transaction. • Variables; Domains; Utility functions; Constraint Graph.

• Dynamic Entities: might require multiple data transactions.• Utility tables.

• Minimize the accesses to the global memory.• Padding Utility Tables’ rows; Perfect hashing.

• Ensure data accesses are coalesced.• Mono-‐‑dimensional array organization;



good bad

Parallel Projection and Aggregation• Mapping between the fi’ table rows and the CUDA blocks:• Each thread in a block is associated to the computations of one permutation of values in scope(fi’).

• 1 block = 64k threads (1 ≤ k ≤ 16).• k depends on the architecture and it is chosen so to maximize the number of threads that can be scheduled concurrently.

• Obs.: Max number of parallel fi’ table rows is M=|SM|64k• In our experiments, |SMs| = 14 and k = 3. Thus M = 2688.



GPU Kernel

GPU Global Memory

B0 B1 B13

B14 B15 B29

……

SM0 SM1 SM13…

… … …

Th0

Th1

Th192

…

5 Related Work

Probably, i will move this section in the introduction

DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]

6 Experimental Results

Unstructured Graphs:

Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).

Distributed Crane Scheduling Problem:

Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.

7 Conclusions and Future Work

max

d2Di

X

fj2Bi

fj(�ix

= r0 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r1 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r192 ^ xi=d)

5 Related Work









max

d2Di

X

fj2Bi

fj(�ix

= r0 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r1 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r192 ^ xi=d)

5 Related Work









max

d2Di

X

fj2Bi

fj(�ix

= r0 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r1 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r192 ^ xi=d)

R+0

R+1

R+192

U’i

…

Experimental ResultsIntroduction GPUs GPU-‐‑(D)BE Results Conclusions


● ● BE GPU−BE DPOP GPU−DBE

Regular Grid Networks• |Di| = 5; • p2 = 0.5;• timeout = 300s

Speedupavg. max.: 125.1xavg. min.: 42.6x

• Similar trends atincreasing |Di|.• Similar trends for DPOP vs GPU-‐‑DBE

Number of Variables

Run

time

(sec

)

0.01

0.1

1

10

9 25 36 49 64 81 100

●●

●

●

●

●

●

●

• CPU: 2.3GHz, 128 GB RAM• GPU: 14 SMs, 837MHz.




Random Networks

Speedupavg. max.: 69.3.xavg. min.: 16.1x

• |Di| = 5; • p2 = 0.5; p1 = 0.3;• timeout = 300s


Number of Variables

Run

time

(sec

)

0.01

0.1

1

10

50

5 10 15 20 25●

●

●

●

●





Scale Free Networks

Speedupavg. max.: 34.9.xavg. min.: 9.5x

• |Di| = 5; • p2 = 0.5;• timeout = 300s


Number of Variables

Run

time

(sec

)

0.1

1

10

50

10 20 30 40 50●

●

●

●

●


Lesson Learned #1• The fi’ table size increases exponentially with w*.• Limited GPU global memory (2GB).• fi’ table + Bi tables, to be used in the aggregation operation, might exceed global memory capacity!• Partition fi’ computations in multiple chunks.

• Alternates GPU and CPU to compute fi’.• GPU: Aggregates the functions in Bi excluding those which do not fit in the global memory.

• CPU: Aggregates the other functions in Bi; Projects out the variable xi.





Random Networks

Phase Transitionp1 = 0.4

• small p1 correspond to smaller w*

• |A| = 10; |Di| = 5; • p2 = 0.5;• timeout = 300s

Graph Density (p1)

Spee

dup

(GPU

vs

CPU

)

40

50

70

80

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

●

●

●

●

●

●

●

●

New results(work in progress)


Lesson Learned #2• Host and Device concurrency.• Possible when the fi’ tables are computed in chunks.• It may hide host-‐‑device data transfers as byproduct.



Execute K2

Compress U1 Execute K1

Compress U2

Compute U1 Compute U2

…Copy

H D D H D H

Copy Copy

(Init)CPU(Host)

GPU(Device) …

Update Global Mem. Update Global Mem.

Discussion• Exploiting the integration of CPU and GPU is a key factor to obtain competitive solver performance.

• How to determine good tradeoffs of such integration?• GPU: • Repeated, non memory intensive operations; • Operations requiring regular memory access;

• CPU:• Memory intensive operations;



Conclusions

• Exploit GPU-‐‑style parallelism from DP-‐‑based (D)COPs resolution methods.• GPU-‐‑(D)BE: Exploits GPUs to parallelizes the aggregation and projection operators.• Observed different speedup, ranging from 34.9 to 125.1, based on several network topologies.• Discussed several possible optimization techniques.

• FUTURE WORK:• Exploit GPUs in DP-‐‑based propagators.• Investigate GPUs in higher form of consistency.





Ferdinando FiorettoNew Mexico State University, University of Udine

Email: [email protected]: www.cs.nmsu.edu/~ffiorett

Exploiting GPUs in Solving (Distributed) Constraint Optimization Problems with

Dynamic ProgrammingF. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son

Thank You! x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40


(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

f12

f23

f13

x3

x2

x1

Documents

Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •