EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:

EECS 583 – Class 20Research Topic 2: Stream Compilation,Stream Graph Modulo Scheduling

University of Michigan

November 30, 2011

Guest Speaker Today: Daya Khudia

- 2 -

Announcements & Reading Material This class

» “Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.

Next class – GPU compilation» “Program optimization space pruning for a multithreaded GPU,”

S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.

- 3 -

Stream Graph Modulo Scheduling (SGMS)

Coarse grain software pipelining» Equal work distribution» Communication/computation

overlap» Synchronization costs

Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data

DMA engine independent of PEs Filters = operations, cores =

function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

- 4 -

Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Statelessint wgts[N];

wgts = adapt(wgts);

Stateful

- 5 -

SGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

- 6 -

SGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

- 7 -

Processor Assignment: Maximizing Throughputs

iij

jij

IIaiw

a

)(

1 for all filter i = 1, …, N

for all PE j = 1,…,P

Minimize II

B C

E

D

F

W: 20

W: 20 W: 20

W: 30

W: 50

W: 30

A

B C

E

D

F

A

A

DB

CE

F

Minimum II: 50Balanced workload!Maximum throughput

T2 = 50

ABC

D

E

F

PE0

T1 = 170

T1/T2 = 3. 4

PE0 PE1 PE2 PE3Four Processing Elements

• Assigns each filter to a processor

PE1PE0 PE2 PE3

W: workload

- 8 -

Need More Than Just Processor Assignment Assign filters to processors

» Goal : Equal work distribution Graph partitioning? Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

ACD Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

ASD Speedup = 60/32 ~ 2

- 9 -

Filter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

- 10 -

Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)

…

Split/Join overheadfactored in

Objective function- Maximal load on any PE» Minimize

Result» Number of times to “split” each filter» Filter → processor mapping

- 11 -

Step 2: Forming the Software Pipeline To achieve speedup

» All chunks should execute concurrently» Communication should be overlapped

Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

- 12 -

Stage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

Data flow traversal of the stream graph» Assign stages using above two rules

- 13 -

Stage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

- 14 -

Step 3: Code Generation for Cell Target the Synergistic Processing Elements (SPEs)

» PS3 – up to 6 SPEs» QS20 – up to 16 SPEs

One thread / SPE Challenge

» Making a collection of independent threads implement a software pipeline

» Adapt kernel-only code schema of a modulo schedule

- 15 -

Complete Example

void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

AS

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJStoB2AtoC

AS

B1

B2

J

C

B1toJStoB2AtoC

AS

B1

JtoDCtoD B2

J

C

B1toJStoB2AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJStoB2AtoC

SPE1 DMA1 SPE2 DMA2Ti

me

- 16 -

SGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive

Spee

dup

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

- 17 -

SGMS Conclusions Streamroller

» Efficient mapping of stream programs to multicore» Coarse grain software pipelining

Performance summary» 14.7x speedup on 16 cores» Up to 35% better than greedy solution (11% on average)

Scheduling framework» Tradeoff memory space vs. load balance

Memory constrained (embedded) systems Cache based system

- 18 -

Discussion Points Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?

» Filters change execution time?» Memory faster/slower than expected?

Could this be adapted for a more conventional multiprocessor with caches?

Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:

» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling

» Where else can it be used?

“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”

- 20 -

Static Verses Dynamic Scheduling

Core

A

B1 B3

F

E

C

Splitter

Joiner

B2 B4

D1 D2

Splitter

Joiner

Memory

Core

Memory

Core

Memory

Core

Memory

? Performing graph modulo scheduling on a

stream graph statically.

What happens in case of dynamic resource changes?

- 21 -

Overview of Flextream

Prepass Replication

Work Partitioning

Partition Refinement

Stage Assignment

Buffer Allocation

Static

Dynam

ic

Streaming Application

MSL Commands

Adjust the amount of parallelism for the target system by replicating actors.

Performs light-weight adaptation of the schedule for the current configuration of the target hardware.

Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)

Find an optimal schedule for a virtualized member of a family of processors.

Specifies how actors execute in time in the new actor-processor mapping.

Find optimal modulo schedule for a virtualized member of a family of processors.

Tries to efficiently allocate the storage requirements of the new schedule into available memory units.

Goal: To perform Adaptive Stream Graph Modulo Scheduling.

- 22 -

Overall Execution Flow For every application may see multiple iterations of:

Resourc

e cha

nge R

eque

st

Resourc

e cha

nge G

ranted

- 23 -

Prepass Replication [static]

A B C D

E F E0E1

P0 : 10 P1 : 86 P2 : 246 P3 : 326

P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283

D0

D1

P6 : 283 P7 : 163

P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163

P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163

E0

E1 E2 E3

C0 C1 C2C3

A

F

E

D

C

10

246

326

566

10

B86

C0 C2

S0

J0

61.5

6

6

C1 C3

D0

S1

J1

163

6

6

D1

E0 E2

S2

J2

6

6

E1 E3141.5

21

21

22

22

22

22

- 24 -

Partition Refinement [dynamic 1]

Available resources at runtime can be more limited than resources in static target architecture.

Partition refinement tunes actor to processor mapping for the active configuration.

A greedy iterative algorithm is used to achieve this goal.

- 25 -

Partition Refinement Example

Pick processors with most number of actors.

Sort the actors

Find processor with max work

Assign min actors until threshold

A B

P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5

P4 : 151.5 P5 : 173 P7 : 159.5

D0D1

P6 : 140

E0

E1

E2E3C0 C1

C2

C3

S1S2J0S0

J1 J2

BE2 C2C0 C3C1 S1S2 J0S0J1 J2

P5 : 183

S0C3

S1S2

J1J2

C2

C1

B

C0

E2

FJ0

P5 : 193P5 : 270.5P4 : 274.5

P1 : 283 P3 : 289

- 26 -

Stage Assignment [dynamic 2] Processor assignment only specifies how actors are

overlapped across processors.

Stage assignment finds how actors are overlapped in time.

Relative start time of the actors is based on stage numbers.

DMA operations will have a separate stage.

- 27 -

Stage Assignment Example

A

F

B

C0 C2

S0

J0

C1 C3

D0

S1

J1

D1

E0 E2

S2

J2

E1 E3

A D0D1

E0

E1

E3

S0C3

S1S2

J1J2

C2

C1

B

C0

E2

FJ0

0

2

4

6

10

8

12

1618

14

- 28 -

Performance Comparison

- 29 -

Overhead Comparison

- 30 -

Flextream Conclusions

Static scheduling approaches are promising but not enough.

Dynamic adaptation is necessary for future systems.

Flextream provides a hybrid static/dynamic approach to improve efficiency.

Documents

EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today: