Upload
horatio-roberts
View
232
Download
2
Embed Size (px)
DESCRIPTION
- 2 - Stream Graph Modulo Scheduling (SGMS) v Coarse grain software pipelining »Equal work distribution »Communication/computation overlap »Synchronization costs v Target : Cell processor »Cores with disjoint address spaces »Explicit copy to access remote data DMA engine independent of PEs v Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7
Citation preview
EECS 583 – Class 20Research Topic 2: Stream Compilation,Stream Graph Modulo Scheduling
University of Michigan
November 30, 2011
Guest Speaker Today: Daya Khudia
- 2 -
Announcements & Reading Material This class
» “Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008.
Next class – GPU compilation» “Program optimization space pruning for a multithreaded GPU,”
S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S. Ueng, J. Straton, and W. Hwu, Proc. Intl. Sym. on Code Generation and Optimization, Mar. 2008.
- 3 -
Stream Graph Modulo Scheduling (SGMS)
Coarse grain software pipelining» Equal work distribution» Communication/computation
overlap» Synchronization costs
Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data
DMA engine independent of PEs Filters = operations, cores =
function units
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
SPU
256 KB LS
MFC(DMA)
EIB
PPE(Power PC)
DRAM
SPE0 SPE1 SPE7
- 4 -
Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02]
int->int filter FIR(int N, int wgts[N]) {
work pop 1 push 1 { int i, sum = 0;
for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}
Push and pop items from input/output FIFOs
Statelessint wgts[N];
wgts = adapt(wgts);
Stateful
- 5 -
SGMS Overview
PE0 PE1 PE2 PE3
PE0
T1
T4
T4
T1 ≈ 4
DMA
DMA
DMA
DMA
DMA
DMA
DMA
DMA
Prologue
Epilogue
- 6 -
SGMS Phases
Fission +Processor
assignment
Stageassignment
Codegeneration
Load balance CausalityDMA overlap
- 7 -
Processor Assignment: Maximizing Throughputs
iij
jij
IIaiw
a
)(
1 for all filter i = 1, …, N
for all PE j = 1,…,P
Minimize II
B C
E
D
F
W: 20
W: 20 W: 20
W: 30
W: 50
W: 30
A
B C
E
D
F
A
A
DB
CE
F
Minimum II: 50Balanced workload!Maximum throughput
T2 = 50
ABC
D
E
F
PE0
T1 = 170
T1/T2 = 3. 4
PE0 PE1 PE2 PE3Four Processing Elements
• Assigns each filter to a processor
PE1PE0 PE2 PE3
W: workload
- 8 -
Need More Than Just Processor Assignment Assign filters to processors
» Goal : Equal work distribution Graph partitioning? Bin packing?
A
B
D
C
A
B C
D
5
5
40 10
Original stream program
PE0 PE1
B
ACD Speedup = 60/40 = 1.5
A
B1
D
C
B2
J
S
Modified stream program
B2
CJ
B1
ASD Speedup = 60/32 ~ 2
- 9 -
Filter Fission Choices
PE0 PE1 PE2 PE3
Speedup ~ 4 ?
- 10 -
Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP)
…
Split/Join overheadfactored in
Objective function- Maximal load on any PE» Minimize
Result» Number of times to “split” each filter» Filter → processor mapping
- 11 -
Step 2: Forming the Software Pipeline To achieve speedup
» All chunks should execute concurrently» Communication should be overlapped
Processor assignment alone is insufficient information
A
B
C
A
CB
PE0 PE1
PE0 PE1
Tim
e AB
A1
B1
A2
A1
B1
A2A→B
A1
B1
A2A→B
A3A→B
Overlap Ai+2 with Bi
X
- 12 -
Stage Assignment
i
j
PE 1
Sj ≥ Si
i
j
DMA
PE 1
PE 2
Si
SDMA > Si
Sj = SDMA+1
Preserve causality(producer-consumer dependence)
Communication-computationoverlap
Data flow traversal of the stream graph» Assign stages using above two rules
- 13 -
Stage Assignment Example
A
B1
D
C
B2
J
S
AS
B1Stage 0
DMA DMA DMA Stage 1
CB2
J
Stage 2
D
DMA Stage 3
Stage 4
DMA
PE 0
PE 1
- 14 -
Step 3: Code Generation for Cell Target the Synergistic Processing Elements (SPEs)
» PS3 – up to 6 SPEs» QS20 – up to 16 SPEs
One thread / SPE Challenge
» Making a collection of independent threads implement a software pipeline
» Adapt kernel-only code schema of a modulo schedule
- 15 -
Complete Example
void spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}
AS
B1
DMA DMA DMA
CB2
J
D
DMA DMA
AS
B1
AS
B1
B1toJStoB2AtoC
AS
B1
B2
J
C
B1toJStoB2AtoC
AS
B1
JtoDCtoD B2
J
C
B1toJStoB2AtoC
AS
B1
JtoD
D
CtoD B2
J
C
B1toJStoB2AtoC
SPE1 DMA1 SPE2 DMA2Ti
me
- 16 -
SGMS(ILP) vs. Greedy
0
1
2
3
4
5
6
7
8
9
bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar
Benchmarks
Rela
tive
Spee
dup
ILP Partitioning Greedy Partitioning Exposed DMA
(MIT method, ASPLOS’06)
• Solver time < 30 seconds for 16 processors
- 17 -
SGMS Conclusions Streamroller
» Efficient mapping of stream programs to multicore» Coarse grain software pipelining
Performance summary» 14.7x speedup on 16 cores» Up to 35% better than greedy solution (11% on average)
Scheduling framework» Tradeoff memory space vs. load balance
Memory constrained (embedded) systems Cache based system
- 18 -
Discussion Points Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?
» Filters change execution time?» Memory faster/slower than expected?
Could this be adapted for a more conventional multiprocessor with caches?
Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:
» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling
» Where else can it be used?
“Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,”
- 20 -
Static Verses Dynamic Scheduling
Core
A
B1 B3
F
E
C
Splitter
Joiner
B2 B4
D1 D2
Splitter
Joiner
Memory
Core
Memory
Core
Memory
Core
Memory
? Performing graph modulo scheduling on a
stream graph statically.
What happens in case of dynamic resource changes?
- 21 -
Overview of Flextream
Prepass Replication
Work Partitioning
Partition Refinement
Stage Assignment
Buffer Allocation
Static
Dynam
ic
Streaming Application
MSL Commands
Adjust the amount of parallelism for the target system by replicating actors.
Performs light-weight adaptation of the schedule for the current configuration of the target hardware.
Tunes actor-processor mapping to the real configuration of the underlying hardware. (Load balance)
Find an optimal schedule for a virtualized member of a family of processors.
Specifies how actors execute in time in the new actor-processor mapping.
Find optimal modulo schedule for a virtualized member of a family of processors.
Tries to efficiently allocate the storage requirements of the new schedule into available memory units.
Goal: To perform Adaptive Stream Graph Modulo Scheduling.
- 22 -
Overall Execution Flow For every application may see multiple iterations of:
Resourc
e cha
nge R
eque
st
Resourc
e cha
nge G
ranted
- 23 -
Prepass Replication [static]
A B C D
E F E0E1
P0 : 10 P1 : 86 P2 : 246 P3 : 326
P4 : 566 P5 : 10 P6 : 0 P7 : 0 P4 : 283
D0
D1
P6 : 283 P7 : 163
P3 : 163P0 : 151.5 P1 : 147.5 P2 : 184.5 P3 : 163
P4 : 141.5 P5 : 151.5 P6 : 141.5 P7 : 163
E0
E1 E2 E3
C0 C1 C2C3
A
F
E
D
C
10
246
326
566
10
B86
C0 C2
S0
J0
61.5
6
6
C1 C3
D0
S1
J1
163
6
6
D1
E0 E2
S2
J2
6
6
E1 E3141.5
21
21
22
22
22
22
- 24 -
Partition Refinement [dynamic 1]
Available resources at runtime can be more limited than resources in static target architecture.
Partition refinement tunes actor to processor mapping for the active configuration.
A greedy iterative algorithm is used to achieve this goal.
- 25 -
Partition Refinement Example
Pick processors with most number of actors.
Sort the actors
Find processor with max work
Assign min actors until threshold
A B
P0 : 184.5 P1 : 141.5 P2 : 171.5 P3 : 141.5
P4 : 151.5 P5 : 173 P7 : 159.5
D0D1
P6 : 140
E0
E1
E2E3C0 C1
C2
C3
S1S2J0S0
J1 J2
BE2 C2C0 C3C1 S1S2 J0S0J1 J2
P5 : 183
S0C3
S1S2
J1J2
C2
C1
B
C0
E2
FJ0
P5 : 193P5 : 270.5P4 : 274.5
P1 : 283 P3 : 289
- 26 -
Stage Assignment [dynamic 2] Processor assignment only specifies how actors are
overlapped across processors.
Stage assignment finds how actors are overlapped in time.
Relative start time of the actors is based on stage numbers.
DMA operations will have a separate stage.
- 27 -
Stage Assignment Example
A
F
B
C0 C2
S0
J0
C1 C3
D0
S1
J1
D1
E0 E2
S2
J2
E1 E3
A D0D1
E0
E1
E3
S0C3
S1S2
J1J2
C2
C1
B
C0
E2
FJ0
0
2
4
6
10
8
12
1618
14
- 28 -
Performance Comparison
- 29 -
Overhead Comparison
- 30 -
Flextream Conclusions
Static scheduling approaches are promising but not enough.
Dynamic adaptation is necessary for future systems.
Flextream provides a hybrid static/dynamic approach to improve efficiency.