1
Using GPCE Principles for Hardware Systems and Accelerators
(bridging the gap to HW design)
Rishiyur S. Nikhil
www.bluespec.com
CTO,
GPCE 09October 4, 2009
2
Generative and component approaches are revolutionizing software development ... GPCE provides a venue for researchers and practitioners interested in foundational techniques for enhancing the productivity, quality, and time-to-market in software development ... In addition to exploring cutting-edge techniques for developing generative and component-based software, our goal is to foster further cross-fertilization between the software engineering research community and the programming languages community.
This seems to be a conference about improving software development ...
... so why am I here talking about hardware design?
Two reasons ....
3
... Generative Programming (developing programs that synthesize other programs), Component Engineering (raising the level of modularization and analysis in application design), and Domain-Specific Languages (elevating program specifications to compact domain-specific notations that are easier to write, maintain, and analyze) are key technologies for automating program development.... enhancing the productivity, quality, and time-to-market in software development that stems from deploying standard components and automating program generation. ...
Reason (1): you may be interested in seeing how the principles highlighted below ...
... are used with equal capability and effectiveness in HW design
4
Reason (2): I would like to tempt you to upgrade from being not only a software engineer (v 1.0) ...
... to “The Compleat Computation-ware Engineere (v 2.0)” ...
... where you think of hardware computation as an important (and easy to use) component in your toolbox, when you solve your next problem.HW
SW
5
The traditional HW creation “flow” (early 1990s to present)
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis*
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
run/debug/edit: “instant”
10s of months$10M-50M
minutes/ hours
$100-10K
* “synthesis” is just jargon for a certain kind of compilation
6
New flows (not yet mainstream)
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
Source code(High Level Language)
“High Level” synthesis
By raising level of abstraction,• improve design time by 10x (or more)• expressive power, simulation speed
• with no loss of silicon quality (area, speed, power)
• In fact, sometimes with better silicon quality (because improved flexibility can result in better architectures)
Simulation by compiled execution
7
Some candidate high level languages
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
Source code(BSV)
Source code(C/C++/SystemC)
“High Level” synthesis
Classic limitations of automatic parallelization from sequential codes,cf. “dusty deck Fortran” ca. 1970s
Bluespec’s fresh approach, inspired by
• Term Rewriting Systems (parallel atomic transactions) to describe complex concurrent behaviorRelated to: UNITY, TLA+, EventB, ...
• Haskell (types, overloading, parameterization, generativity)
8
HW languages have always been “generative”
module mkM1 (…); mkM3 m3b ( … ); // instantiates mkM3 mkM2 m2 ( … ); // instantiates mkM2endmodule
module mkM2 (…); mkM3 m3a ( … ); // instantiates mkM3endmodule
module mkM3 (…); …endmodule
m3am3b
m2
m1 (instance of mkM1)
m3a
m3b m2
m1 (instance of mkM1)
Example (Verilog) Two visualizations of the resulting module instance hierarchy:
9
HW languages have long been “generative” (contd.)
Source code(Verilog/VHDL)
RTL simulation
Traditional ASIC synthesis
Traditional FPGA synthesis
Gate-levelVerilog/VHDL
Gate-levelVerilog/VHDL
Place&Route, ..., tape out, ...
manufacture ...
Place&Route, ..., FPGA download
Source code(BSV)
Source code(C/C++/SystemC)
“High Level” synthesis
Static Elaboration
Execution
Static Elaboration(jargon for “generation”)
• Execute the structural aspects of the program to produce the module hierarchy (structure)
Execution within the fixed structure (behavior)• Essentially just the
execution of a giant FSM
Verilog/VHDL have poor generative capabilities (weak afterthought!):•Not orthogonal, not reflective, not Turing-complete
10
I’m now going to show you some code examples for some non-trivial HW designs. I hope, at the end of this, you’ll say:
“Hey! I could do that!”
even if you’ve never designed HW before!
11
Verilog/VHDL module interfaces: wire oriented
data
RDY
ENA
data
ENA
RDY
Example: transferring a datum from one module to another
declare input and output wires
declare input and output wires
declaration of wires;connections to module interface
wires;logic for RDY/ENA
data
ENA
RDY
Protocol (proper behavior) specified separately using waveforms and
English text
Very verbose, very error-prone
12
interface Get #(type t); // polymorphic method ActionValue #(t) get();endinterface
interface Put #(type t); method Action put (t x);endinterface
module mkConnection #(Get#(t) g, Put#(t) p) (Empty); rule connect; let x <- g.get(); p.put (x); endruleendmodule
Put
BSV module interfaces: “transactional” (object-oriented)
Get
These interface definitions are sufficiently useful and reusable that they’re in standard BSV libraries
Get#(Packet) g1 <- mkM1 (...);Put#(Packet) p1 <- mkM2 (...);Empty e <- mkConnection (g1, p1);
parameters
13
clientinterface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface
interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface
module mkConnection #(Client#(t1,t2), Server#(t1,t2)); mkConnection (t1.request, t2.request); mkConnection (t2.response, t1.response);endmodule
Get
data
RD
Y
EN
S
Put
data
EN
A
RD
Y
server
Put
data
RD
Y
EN
A
Get
data
EN
A
RD
Y
req_t resp_t
Note overloaded mkConnection(BSV uses Haskell’s Typeclass mechanism for user-
extensible, recursive, statically typed overloading)
Interfaces can be composed
Get/Put pairs are very common, and duals of each other, so the BSV library defines Client/Server interfaces for this purpose
14
Example: a Butterfly cross-bar switch
Basic building blocks:
Recursive structure: 1x1 2x2 4x4 … NxN
buffer (FIFO)
2x1 merge
routing logic
interface XBar #(type t); interface List#(Put#(t)) input_ports; interface List#(Get#(t)) output_ports;endinterface
The entire interface can be defined in a few lines (polymorphic in the data type of packets flowing through the switch):
15
Butterfly switch: module implementation
module mkXBar #(Integer n, function UInt #(32) destinationOf (t x), Module #(Merge2x1 #(t)) mkMerge2x1) ( XBar #(t) )
endmodule: mkXBar
2x1 merge module
used by routing logic
Size of switch(# of ports)
Interface
Module parameters
Parameters are static arguments, and so can be of any type, including (unbounded) Integers, functions, modules, etc.
Interfaces represent dynamic communications and can only carry hardware-representable types.
16
Butterfly switch: module implementation
module mkXBar #(...) ( XBar #(t) ); List #(Put#(t)) iports; List #(Get#(t)) oports;
if (n == 1) begin // ---- BASE CASE (n = 1) FIFO #(t) f <- mkFIFO; iports = cons (toPut (f), nil); oports = cons (toGet (f), nil); end
else begin // ---- RECURSIVE CASE (n > 1)
end interface input_ports = iports; interface output_ports = oports;endmodule: mkXBar
buffer (FIFO)
17
Butterfly switch: module implementation
module mkXBar #(...) ( XBar #(t) );
if (n == 1) begin // ---- BASE CASE (n = 1)
end else begin // ---- RECURSIVE CASE (n > 1) XBar#(t) upper <- mkXBar (n/2, destinationOf, mkMerge2x1); XBar#(t) lower <- mkXBar (n/2, destinationOf, mkMerge2x1);
List#(Merge2x1#(t)) merges <- replicateM (n, mkMerge2x1);
iports = append (upper.input_ports, lower.input_ports);
function Get#(t) oport_of (Merge2x1#(t) m) = m.oport; oports = map (oport_of, merges);
... routing behavior ...
end
endmodule: mkXBar
18
Butterfly switch: module implementation
module mkXBar #(...) ( XBar #(t) );
if (n == 1) begin // ---- BASE CASE (n = 1)
end else begin // ---- RECURSIVE CASE (n > 1)
let ps = append (upper.output_ports, lower.output_ports); for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule end
endmodule: mkXBar
19
Butterfly switch: atomicity of rules
for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule
May not be a packet to get
The hardware control logic the manage these complex, dynamic (data-dependent), reactive, control conditions is the most tedious and error-prone aspect of designing with RTL (Verilog, VHDL) and even with SystemC.
Creation of this logic is automated (synthesized), based on the atomicity semantics of rules.
May not be able to put a packet:• flow control• contention
20
Butterfly switch: summary observations
The core mkXBar module is expressed in ~40-50 lines of code• Parameterized by packet type, size, routing function, 2x1 merge
module• It’s fully synthesizable
(550 MHz using Magma Synthesis, TSMC 0.18 micron libraries)
Static elaboration (“generativity”) has the full power of Haskell evaluation• Higher-order functions, lists/vectors, recursion, ...
There is no syntactic distinction between the “static elaboration” part and the “dynamic” part of the source code• An expression “a+b” may be used both for static elaboration and as a
dynamic computation (i.e., an adder in the hardware)
2-layers: static elaboration produces a module hierarchy with rules• The rules are then synthesized according to atomicity semantics into
the correct data paths and control logic
21
Controller Scrambler Encoder
Interleaver Mapper
IFFTCyclicExtend
headers
data
IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)
complex numbersaccounts for 85% area
24 Uncoded
bits
Example: IFFT in 802.11a wireless transmitter
22
in0
…
in1
in2
in63
in3
in4
Bfly4
Bfly4
Bfly4
x16
Bfly4
Bfly4
Bfly4
…
Bfly4
Bfly4
Bfly4
…
out0
…
out1
out2
out63
out3
out4
Perm
ute
_1
Perm
ute
_2
Perm
ute
_3
All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...
*
*
*
*
+
-
-
+
+
-
-
+
*jt2
t0
t3
t1
The IFFT computation (specification)
23
IFFT: the HW implementation space(varying in area, power, clock speed, latency, throughput)
serialization unserializationfewer Bfly4s
Varying degrees of pipelining
Iterate 1 stage thrice
Direct combi-national circuit In any stage, use fewer
than 16 Bfly4s
24
stage_j mkLinearPipe ()
module mkLinearPipe #(Integer n_stages, Bool with_registers, function Module #(Pipe#(a,a) mkStage (Integer stage_j)) (Pipe#(a,a))); ...endmodule
Pipe
Get
Put
Pipe
Get
Put
Pipe
Get
Put
n_stages
0
n_stages-1
Higher-order functions for building linear pipelines(“linear combinator”)
mkStage ()
25
mkLoopPipe ()
module mkLoopPipelined #(Integer n, function Module#(PipeF #(Tuple2#(a, UInt#(logn)), a)) mkLoopBody ()) (PipeF #(a,a))
Pipe
Get
Put
Pipe
Get
Put
n
(a,j)
a
(x,j)
x
Higher-order functions for building looped pipelines(“loop combinator”)
26
Generating all versions of IFFT
serialization unserializationfewer Bfly4s
Varying degrees of pipelining
Iterate 1 stage thrice
Direct combi-national circuit In any stage, use fewer
than 16 Bfly4s
Which architecture is “best” depends on the requirements• Desired latency, throughput• Area, power, clock speed• Target silicon technology (FPGA, ASIC 90nm, ASIC 65nm, ...)
“PAClib” (Pipeline Architecture Constructor Library) is a library of such higher-order pipeline combinators. Using PAClib, IFFT can be succinctly expressed in a single source code which, depending on the parameters supplied, will elaborate (unfold) into any one of the possible architectures in the space of architectures illustrated.
PAClib enables a “pipeline DSL”
27
Another important reason for generativity—enables rapid experimentation to determine optimal architecture
Architectural effects can be quite unpredictable. E.g.,• Hypothesis: linear pipe will take more silicon area than looped pipe
But the looped pipe has other silicon costs:• Needs multiplexers, control logic area cost• Needs higher clock speed for same throughput area cost, power cost• A kicker: disables some constant propagations area cost, power cost
(for ASICs, silicon area directly affects price of chip)
Bottom line:• Need to be able to experiment with different architectures• Generativity allows scripting the exploration of the space
28
I hope that by now you’re saying:
“Hey! Writing HW programs doesn’t look too hard!”(Has all the creature comforts of a modern high-level programming language.)
But, so what?• Why would I want to compute something directly in HW?• Even if I want to, aren’t the costs and logistics of actually putting
something in HW just too high a barrier?
29
Why implement things in HW?
Reason (1):
fixed machine(e.g., x86, GPGPU, Cell)
X-machine(fine-grain parallel)
Run: Run:
instructions (program) for application X
Interpret:
Caveat: lots of devils in the details• Interpretation at GHz may still be faster than direct execution at MHz• Interpretation with monster memory bandwidth may still be faster than direct execution with
anemic memory bandwidth
SpeedSpeedSpeed
Direct implementation in HW typically• removes a layer of interpretation, and interpretation generally costs an
order of magnitude in speed• can exploit more parallelism
30
Why implement things in HW?
Reason (2): Power consumption
• Interpretation on fixed computing architectures costs power
fixed machine(e.g., x86, GPGPU, Cell)
X-machine
instructions (program) for application X
Interpret:Pay energy cost for X-execution
Also pay for fetch, decode, register management, cache management, extra data movement, branch misprediction, ...
Portable devices: battery life Server farms/ clouds: cost of power supply, air conditioning
31
Opportunity with today’s FPGA technology(Field Programmable Gate Arrays)
FPGA capacity:• millions of gates
FPGA speeds:• 100s of MHz
Example of what is possible: a single FPGA can easily run H.264 decoding at VGA resolution (640x480) and, with a good design, at HDTV (1920x1080) resolution
FPGA board costs:• As low as $100s• $1K-$10K typical• $10K-$100K for
multi-FPGA boards)
... new and exciting:• FPGA-in-processor-socket:
• AMD Hypertransport bus• Intel Front-Side Bus
• FPGA-on-processor-chip:• Coming soon
Linux X
FA626 ICE X
Bluespec Emulation X
Linux XLinux X
FA626 ICE XFA626 ICE X
Bluespec Emulation XBluespec Emulation XBluespec Emulation X
Your application software on hostFPGA
subsystemYour computation
on FPGAC
lk/Rst
ICE
Int
Ctrl
L2Cache
AXI Interconnect Fabric
AXI-AHBBridge
FA626Processor
GMACTraffic Gen
DDR2Gasket
GMACTransactor
EngineTraffic Gen
EngineTransactor
S
SRAMController
S
SRAMboot memory
RS232UART
SM
DDR2memory
S SSM
S S S
Emulation Board
FPGA Device
Console Co-emulation link
DDR2memory
DDR2Controller
EthernetGMAC
SecurityEngine
S
Debugger
S S
Clk/R
st
ICE
Int
Ctrl
L2Cache
AXI Interconnect Fabric
AXI-AHBBridge
FA626Processor
GMACTraffic Gen
DDR2Gasket
GMACTransactor
EngineTraffic Gen
EngineTransactor
S
SRAMController
S
SRAMboot memory
RS232UART
SM
DDR2memory
S SSM S SSM
S S SS S S
Emulation Board
FPGA Device
Console Co-emulation link
DDR2memory
DDR2Controller
EthernetGMAC
SecurityEngine
S
Debugger
S S
FPGA host communication links:• USB• 1Gb/10Gb Ethernet• PCI Express
32
SW appHW app
(BSV/RTL)
services
SCE-MI
Link layer
services
SCE-MI
Link layer
sockets/PCIe/ USB/ Ethernet/FSB/ Hypertransport
A “Communications Protocol Stack”. Analogy:
RPCsocketTCP/IPEthernet
HW agnostic: FPGA(or Bluesim/Verilog sim)
Software
Making FPGA acceleration easy and routine
Atop today’s FPGA technology, we provide the communication infrastructure:• Make it easy for SW to invoke a HW service or vice versa• Concurrent, pipelined, ...
• Model: Concurrent RPCs (Remote Procedure Calls)• Auto-generate SW and HW (BSV) stubs from service specs• (like using IDL to specify distributed client/server communication)
33
Putting it all together:
SW part (e.g., C++) HW part (BSV)Get/Put/Client/Server
interfacesGet/Put/Client/Server
interfaces
mkConnection connections
FPGA synthesis etc.
BSV synthesisgcc
FPGA
servicesSCE-MI
Link layer
link/ load link/ load
generate
servicesSCE-MI
Link layer
Yourapplication
BSV applies GPCE concepts to HW design—generation, parameterization, changeability; reusability; easy exploration of architecture space, ...
FPGAs are compelling due to speed, lower power, low cost, fast communication with host
34
Virtex5 FPGA
BSV UltraSparc model
Virtutech Simics
Ethernet
Example: CMU ProtoFlexhttp://www.ece.cmu.edu/~protoflex
Virtutech Simics: commercial SW simulator for whole-systems (OS/devices/apps)(“Virtual Platform” for early SW development, before ASIC is available)Problem: very clever tricks for fast simulation, but steady slowdown– for each added thread and core– for each added bit of instrumentation
CMU ProtoFlex:– Fully operational model of 16-cpu UltraSPARC III SunFire 3800 Server, running
unmodified Solaris 8; running on FPGA at 90 MHz– Hybrid simulation: continue to use Simics for modeling rest of system (I/O devices, ...)– Benchmark: TPC-C OLTP on Oracle 10g Enterprise Database Server
Also SPECINT (bzip2, crafty, gcc, gzip, parser, vortex)– Performance: 10-60 MIPS
39x faster than Virtutech Simics alone on same system/benchmark– Written in BSB by 1 graduate student (Eric Chung) in 1 year!
35
Example: Univ. of Glasgow document retrieval experiment
“FPGA-Accelerated Information Retrieval: High-Efficiency Document Filtering”,W. Vanderbauwhede, L. Azzopardi , and M. Moadeli,in Proc. 19th IEEE Intl. Conf. on Field Programmable Logic and Applications (FPL'09), Prague, Czech Republic, Aug 31-Sep 2, 2009
FPGA(match algorithm)
SRAM(search terms)
Document stream
Score stream
E.g.,• find spam in emails• find similar patents• find relevant news stories
Experiments on 3 collections, from ~1M to 1.5M documents eachRan same algorithm• 1.6 GHz Itanium-2• Virtex-4 FPGA
Power consumption: 130 Watts (Itanium), 1.25 Watts (FPGA)
Speedup: ~ 10x – 20x• Itanium slows down as profile (search database) size increases• FPGA does not (parallelism)
36
Example: MEMOCODE’08 Design Contest
Goal: Speed up a software reference application running on the PowerPC on Xilinx XUP reference board using SW/HW codesign
The application:• decrypt• sort• re-encryptlarge db of records in DRAM
Time allotted: 4 weeksXilinx XUP
http://rijndael.ece.vt.edu/memocontest08/
37
Example: MEMOCODE’08 Design Contest Results
(BSV)
Reference: http://rijndael.ece.vt.edu/memocontest08/everybodywins/
Records had to be repeatedly streamed through a “merge-sort” block.
Advantage to those who could rapidly generate a variety of merge-sort architectures and find the best one to “fit” into the FPGA
38
With languages that use GPCE principles,
HW design is now ready for incorporation
into yourprogramming
toolbox!
SW part (e.g., C++) HW part (BSV)Get/Put/Client/Server
interfacesGet/Put/Client/Server
interfaces
mkConnection connections
FPGA synthesis etc.
BSV synthesisgcc
FPGA
servicesSCE-MI
Link layer
link/ load link/ load
generate
servicesSCE-MI
Link layer
Thank you for your kind attention!
In summary
39
Acknowledgements
James Hoe (MIT/CMU) and Arvind (MIT) for original technology for high-level synthesis from rules to RTL used in BSV today, 1997-2000
Lennart Augustsson (Chalmers/Sandburst) for Haskell-based generative technology used in BSV today, 2000-2003
My colleagues in the engineering teams at Sandburst and Bluespec for continuous and substantial improvements, 2000-2009
Prof. Arvind’s group at MIT for their research and ideas, 2000-2009