View
212
Download
0
Category
Tags:
Preview:
Citation preview
Optimus: Efficient Realization of Streaming Applications on FPGAs
University of Michigan: Amir Hormati, Manjunath Kudlur, Scott Mahlke
IBM Research: David Bacon, Rodric Rabbah
Introduction
End of free ride from clock scaling
Applications more demanding
More applications on embedded platforms
Evolution of new architectures
Crypto
XML parser
Physics
GPU
• Customizable and reconfigurable– On-the-fly and in-the-field– Customizability performance and low power
• Many orders of magnitude more
parallelism than existing multicores– Task-level parallelism– Pipeline parallelism– Bit-level parallelism
Why FPGAs?
Liquid Metal Vision
• One unified language (Lime) for programming hardware (e.g., FPGAs) and heterogeneous architectures
• Liquid Metal VM: JIT the hardware!
GPU Cell(Multicore)
CPU ???FPGA
LiquidMetal VM
Program all withLime
Liquid Metal Tool Chain
5
Streaming LanguagesStreaming Languages
Front-EndCompiler
Front-EndCompiler Spatial IRSpatial IR
Streaming VMStreaming VMVirtex5 FPGAVirtex5 FPGA
Streaming VMStreaming VM
Xilinxbitfile
Xilinxbitfile
XilinxVHDL
Compiler
XilinxVHDL
Compiler
HDLHDL
Cell BECell BE
Streaming VMStreaming VM
Cell binaryCell
binary
Cell SDKCell SDK
CC
CrucibleBack-EndCompiler
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
OptimusBack-EndCompiler
FPGAModel
Overview
• Spatial IR (SIR)
• Compilation Flow
• Scheduling
• Optimizations
• Results
Spatial Intermediate Representation
• Main Constructs:– Filter Encapsulate computation.– Pipeline Expressing pipeline
parallelism.– Splitjoin Expressing task-level
parallelism.– Other constructs not relevant here
• Exposes different types of parallelism– Composable, hierarchical
• Some streaming languages can be easily lowered to SIR:– Lime, StreamIt
pipeline
filter
splitjoin
Top Level Compilation
Filter
Controller
M0
Init
M1
…
. . .
i0 i1 ix
OmO0O0
…
Mn
Work Source
Filter Filter
Round-Robin Splitter(8,8,8,8)
FilterFilter
Round-Robin Joiner(1,1,1,1)
Sink
a[ ]
i
Init
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
A
B EC
HGF I
J
D
Work
Work
WorkWorkWork
Work
Work
Work
Source
Filter Filter
Round-Robin Splitter(8,8,8,8)
FilterFilter
Round-Robin Joiner(1,1,1,1)
Sink
B DC
F
E
A
J
IHG
Filter Compilation
sum = 0i = 0sum = 0i = 0
temp = pop( )temp = pop( )
sum = sum + tempi = i + 1Branch bb2 if i < 8
sum = sum + tempi = i + 1Branch bb2 if i < 8
push(sum)push(sum)
1
2
3
4
Basic Block
Register
Control in
Control outs
Mem
ory/Queue
ports
Ack
Live data outs
Live data ins
bb1
bb2
bb3
bb4
Live out Data
Live
ou
t Da
ta
Register
mux mux
Register
Register
Register
FIFO Read
FIFO Write
Control
Token
Control Token
Control Token
Ack
Ack
Ack
Operation Compilation
FU
…
…
i0 im
o0 on
predicate
ADDADD
CMP
Register
i 1 temp sum
8
Control out 3
11
1
temp
Control out 4
Control in
…
sum = sum + tempi = i + 1Branch bb2 if i < 8
sum = sum + tempi = i + 1Branch bb2 if i < 8
Stream Scheduling
• Filters fire eagerly.– Blocking channel access.– Allows for potentially smaller
channels
• Results produced with lower latency.
11
Filter 1
Filter 2
Push 2
Pop 3
Filter 1
Filter 2
Optimizations• Streaming optimizations (macro functional)
– Channel allocations, Channel access fusion, Filter fission and fusion, etc.
– Doing these optimization needs global information about the stream graph
– Typically performed manually using existing tools
• Classic optimizations (micro functional)– Common subexpression elimination, Constant folding, Loop unrolling,
etc.– Typically included in existing compilers and tools
Channel Allocation
• Larger channels:– More SRAM– More control logic– Less stalls
• Interlocking makes sure that each filter gets the
right data or blocks.
• What is the right channel size?
Channel Allocation Algorithm• Set the size of the channels to infinity.
• Warm-up the queues.
• Record the steady state instruction schedules for each pair.
• Unroll the schedules to have the same number of pushes and pops.
• Find the maximum number of overlapping lifetimes.
14
Channel Allocation Example
----
----
push
----
push
----
push
push
push
----
----
push
----
----
pop
----
----
----
pop
----
pop
pop
pop
popMax overlap = 3
Producer Consumer
Source
Filter 1
Filter 2
Sink
Channel Allocation
Channel Access Fusion
• Each channel access (push or pop) takes one cycle.
• Communication to computation ratio
• Longer critical path latency
• Limit task-level parallelism
Channel Access Fusion Algorithm
• Clustering channel access operations– Loop Unrolling– Code Motion– Balancing the groups
• Similar to vectorization
• Wide channels
18
rrrrrrrr
w
w
w
w
r
w
w
r
Write Mult. = 1
Read Mult. = 8
Write Mult. = 8
Read Mult. = 8
Write Mult. = 4
Read Mult. = 1
Access Fusion Example
• Some caveats
int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum);
int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); }}
int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);
int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum);
Access Fusion
Speedup (baseline = PowerPC)
Energy Consumption
Handel-C Comparison
• Compared DES and DCT with hand-optimized Handel-C implementation
• Performance– 5% faster before optimizations– 12x faster after optimizations
• Area– 66% larger before optimizations– 90% larger after optimizations
23
Conclusion
• Streaming language to program heterogeneous systems
• Hierarchical synthesis using Spatial IR
• Macro and micro functional optimizations− Channel Access Fusion: 2.4x speedup− Channel Allocation: 50% area saving
Thank you!
• Questions?
25
Static Stream Scheduling
• Resources have to be ready before a filter starts(pushes and pops are non-blocking).
• Double buffering for parallelism.
• Deadlock can be detected at compile-time.
• Could be inefficient in case of data dependent bahavior.
System Setup
27
Streaming LanguagesStreaming Languages
Front-EndCompiler
Front-EndCompiler SIRSIR
Streaming VMStreaming VMVirtex5 FPGAVirtex5 FPGA
Streaming VMStreaming VM
Xilinxbitfile
Xilinxbitfile
XilinxVHDL
Compiler
XilinxVHDL
Compiler
HDLHDL
Cell BECell BE
Streaming VMStreaming VM
Cell binaryCell
binary
Cell SDKCell SDK
CC
CrucibleBack-EndCompiler
CrucibleBack-EndCompiler
OptimusBack-EndCompiler
OptimusBack-EndCompiler
FPGAModel
Stream Scheduling
• Activate all the filters at time 0.
• Blocking channel access.
• No restriction on the channel size.
• Result to least latency.
28
Source
Adder 1 Adder 4
Round-Robin Splitter(8,8,8,8)
Adder 3Adder 2
Round-Robin Joiner(1,1,1,1)
Printer
a[ ]
i
Init
Controller
Controller
Controller
Controller
Controller
Controller
Controller
Controller
A
B EC
HGF I
J
D
Work
Work
WorkWorkWork
Work
Work
Work
StreamIt Example
Source
Adder 1 Adder 4
Round-Robin Splitter(8,8,8,8)
Adder 3Adder 2
Round-Robin Joiner(1,1,1,1)
Printer
B DC
F
E
A
J
IHG
void->void pipeline Minimal { add Source(); add AddSplitter(8, 4); add Printer();}
int->int splitjoin AddSplitter(int addSize, int pFactor) { split roundrobin(pFactor); for (int i = 0; i < pFactor; i++) add AdderFilter(addSize); join roundrobin(1);}
int->void filter Printer() { work pop 1 { println(pop()); }}
Recommended