40
Introduction Every new desktop/laptop is now equipped with a graphic processing unit (GPU). GPU = Massively Parallel Architecture . For most of their life, such GPUs are idle . General Purpose GPU applications: Introduction GPUs GPU(D)BE Results Conclusions F. Fioretto, T. Le, E. Pontelli, W. Yeoh, T. Son 1 Numerical Analysis MathWorks MATLAB Bioinformatics Deep Learning

Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Introduction• Every  new  desktop/laptop  is  now  equipped  with  a  graphic  processing  unit  (GPU).• GPU  =  Massively  Parallel  Architecture.• For  most  of  their  life,  such  GPUs  are  idle.

• General  Purpose  GPU  applications:

Introduction GPUs                                 GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 1

Numerical  AnalysisMathWorks MATLAB

Bioinformatics Deep  Learning

Page 2: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

(Distributed)  Constraint  Optimization• A  (D)COP  is  a  tuple  <X,  D,  F,  (A,  α)>,  where:• X  is  a  set  of  variables.• D is  a  set  of  finite  domains.• F  is  a  set  of  utility  functions:  • A  is  a  set  of  agents,  controlling  the  variables  in X.• α  maps  variables  to  agents.

• GOAL:  Find  a  utility  maximal  assignment.

Introduction GPUs                                 GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 24/ 41

Motivations and Background DCOP Algorithms: State-of-the-Art Thesis Proposal Activities Report

Distributed Constrained Optimization (DCOP)

A DCOP is defined by a tuple hA,X ,D,F ↵i, where:A is a set of agents;X is a set of variables.D is a set of finite domains.F is a set of utility functions, fi : ⇥xj2scope(fi)Dj 7! N [ {0,�1}.↵ : X ! A maps each variable to one agent.

a2

a1

a3

a4

a5

x1

x3

x4

x5

=

<

=soft

x2

>

x1 x5 Utilities0 0 200 1 80 2 100 3 3. . .

3 3 2

x1 x3 Utilities0 0 �10 1 00 2 00 3 0. . .

3 3 �1

Constraint Graph Soft Constraint Table Hard Constraint Table

DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD

F. FIORETTO

Bi =

n

fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o

Bi =�

fj 2 F|xi 2 x

fj ^ i = max{k|xk 2 x

fj}

X X \ {xi}

F (F [ {f 0i}) \Bi

x

⇤= argmax

x

F(x)

= argmax

x

X

f2F

f(x|scope(f))

1

Page 3: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Graphical  Processing  Units  (GPUs)• A  GPU  is  a  massive  parallel  architecture:• Thousands of  multi-­‐‑threaded  computing  cores.• Very  highmemory  bandwidths.• ~80%  of  transistors  devoted  to  data  processing  rather  than  caching.

• However:• GPU  cores  are  slower than  CPU  cores.• GPU  memories  have  different  sizes  and  access  times.• GPU  programming  is  more  challenging  and  time  consuming.

Introduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 3

Page 4: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Execution  ModelIntroduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 4

Block  (0,0)

Block  (1,0)

Block  (2,0)

Block  (0,1)

Block  (1,1)

Block  (2,1)

Kernel  1

Kernel  2

B

Thread(0,0)

Thread(1,0)

Thread(2,0)

Thread(3,0)

Thread(4,0)

Thread(0,1)

Thread(1,1)

Thread(2,1)

Thread(3,1)

Thread(4,1)

Thread(0,2)

Thread(1,2)

Thread(2,2)

Thread(3,2)

Thread(4,2)

Thread(0,3)

Thread(1,3)

Thread(2,3)

Thread(3,3)

Thread(4,3)

warp

Host Device

Block  (0,0)

Block  (1,0)

Block  (2,0)

Block  (0,1)

Block  (1,1)

Block  (2,1)

warp warp warp

warp warp warp

...

• A  Thread  is  the  basic  parallel  unit.

• Threads are  organized  into  a Block.• Several  warps  are  scheduled  for  the  execution  of  a  GPU  function.

• Several  Streaming  Multiprocessors,  (SD)  scheduled  in  parallel.• Single  Instruction  Multiple  Thread(SIMT)  parallel  model.

Page 5: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)

core core core core core core core core

core core core core core core core core

core core core core core core core core

core core core core core core core core

Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory  Hierarchy

• The  GPU  memory  architecture  is  rather  involved.• Registers• Shared  memory• Global  memory

Introduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 5

Page 6: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)

core core core core core core core core

core core core core core core core core

core core core core core core core core

core core core core core core core core

Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory  Hierarchy

• The  GPU  memory  architecture  is  rather  involved.• Registers• Fastest;• Only  accessible  by  a  thread;• Lifetime  of  a  thread.

• Shared  memory• Global  memory

Introduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 6

Page 7: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)

core core core core core core core core

core core core core core core core core

core core core core core core core core

core core core core core core core core

Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory  Hierarchy

• The  GPU  memory  architecture  is  rather  involved.• Registers• Shared  memory• Extremely  fast;• Highly  parallel;• Restricted  to  a  block.

• Global  memory

Introduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 7

Page 8: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)

core core core core core core core core

core core core core core core core core

core core core core core core core core

core core core core core core core core

Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory  Hierarchy

• The  GPU  memory  architecture  is  rather  involved.• Registers• Shared  memory• Global  memory• Typically  implemented  in  DRAM;• High  access  latency  (400-­‐‑800  cycles);• Potential  of  traffic  congestion.

Introduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 8

Page 9: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

HOSTGLOBAL MEMORY

CONSTANT MEMORY

Shared memory

Thread Thread

regs regs

Block

Shared memory

Thread Thread

regs regs

Block

GRIDSM

SM

SM

SM

SM

SM

SM

SM

L2 Cache

SM

SM

SM

SM

SM

SM

SM

SM

DRAM

DRAM

Host

Inte

rface

DRAM

DRAM

DRAM

DRAM

Inst

ruct

ion

Cach

e

WAR

Psc

hedu

ler

WAR

Psc

hedu

ler

Regi

ster

s (3

2K)

core core core core core core core core

core core core core core core core core

core core core core core core core core

core core core core core core core core

Shar

ed M

emor

yL1

Cac

he (6

4KB)

SFU SFU SFU SFU

Memory  Hierarchy

• The  GPU  memory  architecture  is  rather  involved.• Registers• Shared  memory• Global  memory• Challenge:  using  memory  effectively  -­‐‑-­‐‑ likely  requires   to  redesign  the  algorithm.

Introduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 9

Page 10: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

CUDA:  Compute  Unified  Device  ArchitectureIntroduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 10

Host Device

Page 11: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

CUDA:  Compute  Unified  Device  ArchitectureIntroduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 11

Host Device

cudaMalloc(&deviceV, sizeV);

cudaMemcpy(deviceV, hostV, sizeV, ...)

data Global  Memory

Page 12: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

CUDA:  Compute  Unified  Device  ArchitectureIntroduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 12

Host Device

cudaKernel<nThreads, nBlocks>( )

cudaKernel( )

Kernel  invocation Global  Memory

Page 13: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

CUDA:  Compute  Unified  Device  ArchitectureIntroduction     GPUs GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 13

Host Device

cudaMemcpy(hostV, deviceV, sizeV, ...)

Global  Memorydata

Page 14: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Bucket  Elimination  and  DPOP• Dynamic  Programming  procedures  to  solve  (D)COPs.• Both  procedures  rely  on  the  use  of  two  operators:• Projection  Operator:    π-­‐‑xi(fij)

• Aggregation  Operator:  fij +  fik

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 14

xi xj U0 0 50 1 81 0 201 1 3

fij

xj U0 201 8

Page 15: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Bucket  Elimination  and  DPOP• Dynamic  Programming  procedures  to  solve  (D)COPs.• Both  procedures  rely  on  the  use  of  two  operators:• Projection  Operator:    π-­‐‑xi(fij)

• Aggregation  Operator:  fij +  fik

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 15

xi xj U0 0 50 1 81 0 201 1 3

fij

max(5,  20)xj U0 201 8

Page 16: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Bucket  Elimination  and  DPOP• Dynamic  Programming  procedures  to  solve  (D)COPs.• Both  procedures  rely  on  the  use  of  two  operators:• Projection  Operator:    π-­‐‑xi(fij)

• Aggregation  Operator:  fij +  fik

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 16

xi xj U0 0 50 1 81 0 201 1 3

xj U0 201 8

fij

max(8,  3)

Page 17: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Bucket  Elimination  and  DPOP• Dynamic  Programming  procedures  to  solve  (D)COPs.• Both  procedures  rely  on  the  use  of  two  operators:• Projection  Operator:    π-­‐‑xi(fij)• Aggregation  Operator:  fij +  fik

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 17

xi xj U0 0 50 1 81 0 201 1 3

fij

xi xk U0 0 20 1 61 0 111 1 4

xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14

.  .  ..  .  .  

Page 18: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Bucket  Elimination  and  DPOP• Dynamic  Programming  procedures  to  solve  (D)COPs.• Both  procedures  rely  on  the  use  of  two  operators:• Projection  Operator:    π-­‐‑xi(fij)• Aggregation  Operator:  fij +  fik

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 18

xi xj U0 0 50 1 81 0 201 1 3

fij

xi xk U0 0 20 1 61 0 111 1 4

xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14

.  .  .

5  +  2  =  7

.  .  .  

Page 19: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Bucket  Elimination  and  DPOP• Dynamic  Programming  procedures  to  solve  (D)COPs.• Both  procedures  rely  on  the  use  of  two  operators:• Projection  Operator:    π-­‐‑xi(fij)• Aggregation  Operator:  fij +  fik

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 19

xi xj U0 0 50 1 81 0 201 1 3

fij

xi xk U0 0 20 1 61 0 111 1 4

xi xj xk U0 0 0 70 0 1 110 1 0 100 1 1 14

.  .  .

5  +  6  =  11

.  .  .  

Page 20: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOP1. Imposes  an  ordering  on  the  problem’s  variables.

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 20

f12

f23

f13

X =  {x1,  x2,  x3}

F =  {f12,  f13,  f23}

x3

x2

x1

Page 21: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOP2. Selects  the  variable  xi with  highest  priority,  and  it  

creates  a  bucket:    

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 21

f12

f23

f13

B3 =  {f13,  f23}

x3

x2

x1

X =  {x1,  x2,  x3}

F =  {f12,  f13,  f23}

DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD

F. FIORETTO

Bi =

n

fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o

Bi =�

fj 2 F|xi 2 x

fj ^ i = max{k|xk 2 x

fj}

1

DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD

F. FIORETTO

Bi =

n

fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o

Bi =�

fj 2 F|xi 2 x

fj ^ i = max{k|xk 2 x

fj}

1

Page 22: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOP3. It  computes  a  new  utility  function  fi’ by  aggregating the  

functions  in  Bi and  projecting  out  xi

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 22

f12

f23

f13

x3

x2

x1

X =  {x1,  x2,  x3}

F =  {f12,  f13,  f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 max(5  + 5,  8  + 8) =  16 1

f3’ =  π-­‐‑x3  (f13 + f23)

B3 =  {f13,  f23}

f13

f23

f3’

Page 23: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOPIntroduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 23

f12

f23

f13

x3

x2

x1

X =  {x1,  x2,  x3}

F =  {f12,  f13,  f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 16 10 1 max(5  + 20,  8  + 3) =  25 0

f3’ =  π-­‐‑x3  (f13 + f23)

3. It  computes  a  new  utility  function  fi’ by  aggregating the  functions  in  Bi and  projecting  out  xi

B3 =  {f13,  f23}

f13

f23

f3’

Page 24: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOPIntroduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 24

f12

f23

f13

x3

x2

x1

X =  {x1,  x2,  x3}

F =  {f12,  f13,  f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 16 10 1 25 01 0 max(20  + 5,  3  + 8) =  25 0

f3’ =  π-­‐‑x3  (f13 + f23)

3. It  computes  a  new  utility  function  fi’ by  aggregating the  functions  in  Bi and  projecting  out  xi

B3 =  {f13,  f23}

f13

f23

f3’

Page 25: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOPIntroduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 25

f12

f23

f13

x3

x2

x1

X =  {x1,  x2,  x3}

F =  {f12,  f13,  f23}

x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U (x3)0 0 16 10 1 25 01 0 25 01 0 max(20  + 20,  3  + 3) =  40 0

f3’ =  π-­‐‑x3  (f13 + f23)

3. It  computes  a  new  utility  function  fi’ by  aggregating the  functions  in  Bi and  projecting  out  xi

B3 =  {f13,  f23}

f3’

f13

f23

Page 26: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOP4. It  updates  the  set  of  variables:5. It  updates  the  set  of  functions:

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 26

f12

f23

f13

x3

x2

x1

X =  {x1,  x2}

F =  {f12,  f3’}B2 =  {f13,  f3’}

DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD

F. FIORETTO

Bi =

n

fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o

Bi =�

fj 2 F|xi 2 x

fj ^ i = max{k|xk 2 x

fj}

X X \ {xi}

F (F [ {f 0i} \Bi

1

DISTRIBUTED CONSTRAINT REASONINGNANDO’S SKETCHPAD

F. FIORETTO

Bi =

n

fj 2 F|xi 2 scope(fj) ^ i = max{k|xk 2 scope(fj)}o

Bi =�

fj 2 F|xi 2 x

fj ^ i = max{k|xk 2 x

fj}

X X \ {xi}

F (F [ {f 0i}) \Bi

1

Page 27: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

Bucket  Elimination  and  DPOPRepeat...

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 27

f12

f23

f13

x3

x2

x1

X =  {x1,  x2}

F =  {f12,  f3’}

• DPOP is  a  distributed  version  of  BE.• It  operates  on  a  Pseudotree ordering  of  the  constraint  graph.

fj’xi

xj xk

fk’

Page 28: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

GPU-­‐‑(D)BEIntroduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 28

f12

f23

f13

x3

x2

x1 x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U0 0 max(5  + 5,  8  + 8) =  160 1 max(5  + 20,  8  + 3) =  251 0 max(20  + 5,  3  + 8) =  251 0 max(20  + 20,  3  + 3) =  40

f3’ =  π-­‐‑x3  (f13 + f23)

• BE  and  DPOP  complexity:  O(dw*).d  =  max.  domain  size;  w*  =  induced  width  of  the  constraint  graph.

• Can  the  projection  and  aggregator  operators  be  executed  in  parallel?• Do  they  fit  the  SIMT  parallel  model?

Page 29: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

GPU-­‐‑(D)BEIntroduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 29

f12

f23

f13

x3

x2

x1 x1 x3 U0 0 50 1 81 0 201 1 3

x2 x3 U0 0 50 1 81 0 201 1 3

x1 x2 U0 0 max(5  + 5,  8  + 8) =  160 1 max(5  + 20,  8  + 3) =  251 0 max(20  + 5,  3  + 8) =  251 0 max(20  + 20,  3  + 3) =  40

f3’ =  π-­‐‑x3  (f13 + f23)

• BE  and  DPOP  complexity:  O(dw*).d  =  max.  domain  size;  w*  =  induced  width  of  the  constraint  graph.

• Can  the  projection  and  aggregator  operators  be  executed  in  parallel?• Do  they  fit  the  SIMT  parallel  model?• Obs.:  The  computation  of  each  row  of  the  Utility  tables  is  independent  from  the  computation  of  other  rows.

Page 30: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Algorithm  design  and  data  structure• Limit  the  amount  of  host-­‐‑device  data  transfers.• Static  Entities:  require  a  single  data  transaction.  • Variables;  Domains;  Utility  functions;  Constraint  Graph.

• Dynamic  Entities:  might  require  multiple  data  transactions.• Utility  tables.

• Minimize  the  accesses  to  the  global  memory.• Padding  Utility  Tables’  rows;  Perfect  hashing.

• Ensure  data  accesses  are  coalesced.• Mono-­‐‑dimensional  array  organization;  

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 30

good bad

Page 31: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Parallel  Projection  and  Aggregation• Mapping  between  the  fi’  table  rows  and  the  CUDA  blocks:• Each  thread  in  a  block  is  associated  to  the  computations  of  one permutation  of  values  in  scope(fi’).

• 1  block = 64k  threads (1  ≤  k  ≤  16).• k  depends  on  the  architecture  and  it  is  chosen  so  to  maximize  the  number  of  threads  that  can  be  scheduled  concurrently.

• Obs.:  Max  number  of  parallel  fi’ table  rows  is  M=|SM|64k• In  our  experiments,  |SMs|  =  14 and k  =  3. Thus  M  =  2688.

Introduction     GPUs                                 GPU-­‐‑(D)BE Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 31

GPU Kernel

GPU Global Memory

B0 B1 B13

B14 B15 B29

……

SM0 SM1 SM13…

… … …

Th0

Th1

Th192

5 Related Work

Probably, i will move this section in the introduction

DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]

6 Experimental Results

Unstructured Graphs:

Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).

Distributed Crane Scheduling Problem:

Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.

7 Conclusions and Future Work

max

d2Di

X

fj2Bi

fj(�ix

= r0 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r1 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r192 ^ xi=d)

5 Related Work

Probably, i will move this section in the introduction

DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]

6 Experimental Results

Unstructured Graphs:

Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).

Distributed Crane Scheduling Problem:

Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.

7 Conclusions and Future Work

max

d2Di

X

fj2Bi

fj(�ix

= r0 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r1 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r192 ^ xi=d)

5 Related Work

Probably, i will move this section in the introduction

DP Algorithms: BE, CTE, DPOP, DCTE.GPU-CP: Federico and Codognet (local search and CP).GPU-DCOPs: Nando (sampling on GPU, maybe MVA with GPU).[Solving knapsack problems on GPU]

6 Experimental Results

Unstructured Graphs:

Here the experiments on Random Graphs and Scale Free networks, at varying |X|,(and p1 for the formers).

Distributed Crane Scheduling Problem:

Here the experiments on Grids Topology, at varying |X| for distributed cranescheduling problems.

7 Conclusions and Future Work

max

d2Di

X

fj2Bi

fj(�ix

= r0 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r1 ^ xi=d)

max

d2Di

X

fj2Bi

fj(�ix

= r192 ^ xi=d)

R+0

R+1

R+192

U’i

Page 32: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Experimental  ResultsIntroduction     GPUs                                 GPU-­‐‑(D)BE                                 Results Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 32

● ● BE GPU−BE DPOP GPU−DBE

Regular  Grid  Networks• |Di|  =  5;    • p2  =  0.5;• timeout  =  300s

Speedupavg.  max.:    125.1xavg.  min.:        42.6x

• Similar  trends  atincreasing  |Di|.• Similar  trends  for  DPOP  vs GPU-­‐‑DBE

Number of Variables

Run

time

(sec

)

0.01

0.1

1

10

9 25 36 49 64 81 100

●●

• CPU:  2.3GHz,  128  GB  RAM• GPU:  14  SMs,  837MHz.

Page 33: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Experimental  ResultsIntroduction     GPUs                                 GPU-­‐‑(D)BE                                 Results Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 33

● ● BE GPU−BE DPOP GPU−DBE

Random  Networks

Speedupavg.  max.:    69.3.xavg.  min.:        16.1x

• |Di|  =  5;    • p2  =  0.5;  p1  =  0.3;• timeout  =  300s

• Similar  trends  atincreasing  |Di|.• Similar  trends  for  DPOP  vs GPU-­‐‑DBE

Number of Variables

Run

time

(sec

)

0.01

0.1

1

10

50

5 10 15 20 25●

• CPU:  2.3GHz,  128  GB  RAM• GPU:  14  SMs,  837MHz.

Page 34: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Experimental  ResultsIntroduction     GPUs                                 GPU-­‐‑(D)BE                                 Results Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 34

● ● BE GPU−BE DPOP GPU−DBE

Scale  Free  Networks

Speedupavg.  max.:    34.9.xavg.  min.:        9.5x

• |Di|  =  5;    • p2  =  0.5;• timeout  =  300s

• Similar  trends  atincreasing  |Di|.• Similar  trends  for  DPOP  vs GPU-­‐‑DBE

Number of Variables

Run

time

(sec

)

0.1

1

10

50

10 20 30 40 50●

• CPU:  2.3GHz,  128  GB  RAM• GPU:  14  SMs,  837MHz.

Page 35: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Lesson  Learned  #1• The  fi’ table size  increases  exponentially  with  w*.• Limited  GPU  global  memory  (2GB).• fi’ table +  Bi tables,  to  be  used  in  the  aggregation  operation,  might  exceed  global  memory  capacity!• Partition fi’  computations  in  multiple  chunks.

• Alternates  GPU  and  CPU  to  compute   fi’.• GPU:  Aggregates  the  functions  in  Bi excluding  those  which  do  not  fit  in  the  global  memory.

• CPU:    Aggregates  the  other  functions  in  Bi;  Projects  out  the  variable  xi.

Introduction     GPUs                                 GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 35

Page 36: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Experimental  ResultsIntroduction     GPUs                                 GPU-­‐‑(D)BE                                 Results Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 36

Random  Networks

Phase  Transitionp1  =  0.4  

• small  p1  correspond  to  smaller  w*

• |A|  =  10;  |Di|  =  5;    • p2  =  0.5;• timeout  =  300s

Graph Density (p1)

Spee

dup

(GPU

vs

CPU

)

40

50

70

80

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

New  results(work  in  progress)

• CPU:  2.3GHz,  128  GB  RAM• GPU:  14  SMs,  837MHz.

Page 37: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Lesson  Learned  #2• Host  and  Device  concurrency.• Possible  when  the  fi’  tables  are  computed  in  chunks.• It  may  hide  host-­‐‑device  data  transfers  as  byproduct.

Introduction     GPUs                                 GPU-­‐‑(D)BE                                 Results Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 37

Execute K2

Compress U1 Execute K1

Compress U2

Compute U1 Compute U2

…Copy

H D D H D H

Copy Copy

(Init)CPU(Host)

GPU(Device) …

Update Global Mem. Update Global Mem.

Page 38: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Discussion• Exploiting  the  integration  of  CPU  and  GPU  is  a  key  factor  to  obtain  competitive  solver  performance.

• How  to  determine  good  tradeoffs  of  such  integration?• GPU:  • Repeated,  non  memory  intensive  operations;  • Operations  requiring  regular  memory  access;  

• CPU:• Memory  intensive  operations;

Introduction     GPUs                                 GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 38

Page 39: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Conclusions

• Exploit  GPU-­‐‑style  parallelism  from  DP-­‐‑based  (D)COPs  resolution  methods.• GPU-­‐‑(D)BE:  Exploits  GPUs  to  parallelizes  the  aggregation  and  projection  operators.• Observed  different  speedup,  ranging  from  34.9  to  125.1,  based  on  several  network  topologies.• Discussed  several  possible  optimization  techniques.

• FUTURE  WORK:• Exploit  GPUs  in  DP-­‐‑based  propagators.• Investigate  GPUs  in  higher  form  of  consistency.

Introduction     GPUs                                 GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 39

Page 40: Introduction - University of Michigan › ~fioretto › files › talks › cp15-slides.pdf(Distributed)Constraint.Optimization • A(D)COPisatuple.< X, D, F,(A,α)>,where: •

Introduction     GPUs                                 GPU-­‐‑(D)BE                                 Results                                  Conclusions

F.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son 40

Ferdinando FiorettoNew Mexico State University, University of Udine

Email: [email protected]: www.cs.nmsu.edu/~ffiorett

Exploiting  GPUs  in  Solving  (Distributed)  Constraint  Optimization  Problems  with  

Dynamic  ProgrammingF.  Fioretto,  T.  Le,  E.  Pontelli,  W.  Yeoh,  T.  Son

Thank  You! x1 x2 Utilities0 0 max(5+5, 8+8) = 160 1 max(5+20, 8+3) = 251 0 max(20+5, 3+8) = 251 1 max(20+20, 3+3) = 40

x1 Utilities0 max(5+16, 8+25) = 331 max(20+25, 3+40) = 45

(a) (b) (d)(c)

x3

1000

x2 x3

1 00 0

xi xj Utilities0 0 50 1 81 0 201 1 3

for i < jx1

x2

x3

x1

x2

x3

f12

f23

f13

x3

x2

x1