Upload
andrew-bruce
View
219
Download
5
Embed Size (px)
Citation preview
PACT ’04, Antibes, France
Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers
Stamatis Vassiliadis
Computer Engineering,EEMCS, TU Delft
http://ce.et.tudelft.nl
Member of HiPEAC
PACT ’04, Antibes, France
PZE and the Amdahl’s law
program
Max speedup = 2.0Excluding start-upreduced 5 cycles to 3 speedup 1.683% efficiency
50%20% … ASIC
Very Large
B
Timewise we execute two instructions (50% code elimination)
Why polymorphic? We can ride the Amdahl’s curveeasier and faster
Potential Zero Execution (PZE) introduced in 87-88 and published in IBM Journal of R&D 94
Techniques:• ILP• pipeline• technology
The limitation:
0.5 0.9
2X
10X
PACT ’04, Antibes, France
Motivating example
Original image
Filtered image
bitstream
predictive coding compression (ZIP)
Transmission:
bitstream
UNZIP
Filtered image
Original image
decoding
Goal: get image with more 0’s
Is it possible?: spatial redundancy (adjacent pixels often have same values => many differences between them =0 )
Paethcoding
• What does Paeth means in terms of computations?• Can I put it on hardware?• What is my gain?
Research questions:
PACT ’04, Antibes, France
Motivating examplec ba d
Paeth(d)= one of a,b,c, which is closest to initial prediction p = a+b-c
0 0 0 0
0 3 3 3
0 3 4 4
0 3 4 5
Original
0 0 0 0
0 0 3 3
0 3 3 4
0 3 4 4
Paeth c=3, b=3 a=4, d=4 p =4+3-3=4Paeth(d)=a=4
Filtered 0 0 0 0
0 3 0 0
0 0 1 0
0 0 0 0
Filtered=Original-Paeth =4 - 4 =0
p=a+b-c
pa=|p-a| pb=|p-b| pc=|p-c|
pa<=pb? pa<=pc? pb<=pc?
1 01 0
cab
1 0
Paeth
area:area:…………… 6 8-bit adders…………… 6 8-bit adders
PACT ’04, Antibes, France
Altivec code
load
unpackunpack
processprocess
pack
storestore
Looping
What it does
initialize
Example: Paeth Prediction (PNG)
AltivecAltivec iteration: 95 instructions per 16 pixels iteration: 95 instructions per 16 pixels.. CSI code CSI code : 1 instruction for all iterations (+20 setup instructions): 1 instruction for all iterations (+20 setup instructions)
CSI Instruction design CSI Instruction design : latency: : latency: …………. 5 …………. 5 cyclescycles throughput: throughput: ………16 ………16 pixels/1 cyclepixels/1 cycle
( ( EUROMICROEUROMICRO 9999 )) area: area:…………… 24 32-bit adders…………… 24 32-bit addersCycle Cycle = 1 ALU operation = 1 ALU operation
bptr = prev_row+1; dptr = curr_row+1; predptr= predict_row+1; for(i=1; i < length; i++){ c = *(bptr-1); b = *bptr; ... .... ... if(...) *predptr = a; else if (..) else *predptr = c; ...... bptr++; }
C-codeli r5, 0 ….totally 6 instructionsloop: lvx vr03, r1 # load c's lvx vr04, r2 # load a's vsidoi vr05, vr01, vr03, 1 # load b's
vmrghb vr07, vr03, vr00 # unpack vmrglb vr08, vr03, vr00 # unpack…totally 6 instructions #Computevadduhs vr15, vr09, vr11 # a+b vadduhs vr16, vr10, vr12 # vsubshs vr15, vr15, vr07 #vsubshs vr16, vr16, vr08 # ..totally 76 instructions #Pack:vpkshus vr28, vr28, 29 # pack #Store: stvx vr28, r3, 0 # store #Loop controladdi r1, r1, 16 ……..bneq r7, r0, loop # Loop
ONE INSTRUCTIONFor all loop iterations
csi_paeth predptr, bptr,dptr
CSI code
li r5, 1 csi_mt_scr r1, SCR1, 0 csi_mt_scr r5, SCR1, 1 ..totally 20 instructions
Altivec code
PACT ’04, Antibes, France
Dynamic instruction counts, normalised to non-CSI counts
Bench: Paeth kernel, 132-element vectors (132 pixels in a row)
Results: Instruction count and execution time reduction
Execution time: on 4-issue CPU, with 32 byte-wide CSI unit,
normalised to non-CSI execution
PACT ’04, Antibes, France
Research Questions
Processor architecture(behavior + logical structure)
Compilation
How can I automatically generate the “transformed” program?
Motivating example: Obvious observationsNO way I can do this on fixed hardwareI can do this if the hardware changes functionality at my wishes.
EASIER SAID THAN DONE ! I have to answer the following:
Programming paradigm (HW and SW descriptions coexisting in a program)
Microarchitecture
New kind of tools
How can I implement “arbitrary” code?
Is the hardwired code substituted by new instructions?
How can I substitute this code with SW/HW descriptions say at the source level?
How can I identify the code for hardware implementation?
PACT ’04, Antibes, France
Outline
What to do:
ToolsMicroarchitectureArchitectureProgramming ParadigmCompiler
Sequential consistency Split-join parallelism Function like code
Program PGPP
MEM
RH
FPGA
DATA
RESULTS
– Eliminate the identified code – Add code to have “equivalent” behavior
Program P’
A
Introduce reconfigurable microcode (- code)Specific code in hardware left to the programmer/hardware designer One time 8 new instructions for any ISA
Co-processor paradigm (e.g. vector)New register file for parameter passing
MOLEN
– Identify the “” code– Show hardware feasibility of “” in FPGA– Map “” into reconfigurable hardware (RH)
– Compile new program– Execute
PACT ’04, Antibes, France
YES
NO Critical
?
…
…
Code
int fact(int n){if(n<1) return nelsereturn(n*fact(n-1));}
f(.)
Human Directives
C2C
HDL
Re-
targ
eted
Com
pil
er
BinaryCode
call f(.) HDL
Architecture
XILINX VIRTEX-II PRO FPGA
IBMPowerPC
Tool Chain New Program whereHardware/software descriptions co-exist
hand codedL
I
B
A
U
T
O
PACT ’04, Antibes, France
The MOLEN ISA
Divide RC into two logical phases “SET EXECUTE address”
Implementation and ISA independent “function” independentNo new op-codes
Parameter passing: two new instructions + Register file
Arbitrary number of parameter passing
Parallel execution : split via a Molen instruction and join via a GPP instruction or one special instruction
Modularity: by implementing at least the minimal MOLEN instruction set and by reconfiguring to it.
Reconfigurable design(two instructions)
Execute on reconfigurableOne instruction
Speeding up: reconfiguration and executionTwo instructions for prefetching
Total: 8 new instructions
( ( SAMOSSAMOS ‘03‘03 ))
PACT ’04, Antibes, France
Instruction Set Partitioning
SET < address >EXECUTE < address >MOVTX and MOVFX.
SET PREFETCH < address >EXECUTE PREFETCH < address >
BREAK:
Minimal
partial SET (P-SET) Complete SET (C-SET)
Complete
Preferred
8 instructions grouped in 6 instruction categories:
PACT ’04, Antibes, France
Sequence Control Example
#pragma call_fpga op1int f( int x, int y){ … }#pragma call_fpga op2int g(int x){ … }int h(int a, int b, int c){int m,n, ...;m=f(a, b);n=g(c);……}
h:mov a -> r1movtx r1 ->XR2mov b -> r2movtx r2 ->XR3mov c -> r3movtx r3 -> XR4set address_set_op1set address_set_op2ldc 2 ->r4movtx r4 ->XR0 ldc 4 ->r5movtx r5 ->XR1 execute address_ex_op1execute address_ex_op2movfx XR2 -> r6 mov r6 -> mmovfx XR4 -> r7mov r7 -> n
no data dependency
In parallel
PACT ’04, Antibes, France
MicroProgram On-chip storage
Reconfigurable Microcode Storage
• Fixed on-chip storage for frequently used microcode
• Pageable on-chip storage for less frequently used microcode
From memory
Permanently stored
Permanently stored
Frequently used
Frequently used
Lessfrequently used
FIXED
PAGEABLE
( ( IEEE MICROIEEE MICRO ‘03‘03 ))
PACT ’04, Antibes, France
CONTROL STORE
SEQUENCER
The -code unit
Determine nextmicroinstruction
MIR
CSAR
from execution hardware
to execution hardware
R/P CS-
CS-
-
CS-
HResidence
Table
CS-, if present
/
(fixed)
(pageable)
microinstruction
= reconfigurable unit (CCU)
= reconfigurable unit (CCU)
FIXEDPAGEABLE
FIXEDPAGEABLE
set
execute
PACT ’04, Antibes, France
instruction word
address
More on Architectural support
Instruction format
OPC
Resident (0);Pageable (1)
Control Store address (CS-);Memory address ()
An example microprogram:
• located in memory starting at address • address point to first microinstruction
• terminated by an end_op
memory
end_op
00: load values into adder01: shift_ins02: add_ins03: shift2_ins04: SKIP05: BACK06: store07: end_op
PACT ’04, Antibes, France
The MOLEN -coded processor (FPL’01)
The arbiter also controlsthe loading of microcode
X registers to exchange parameters between
GPP and RU
Arbitrates (redirects) instructions between GPP and RP
-unit controls CCU by
microinstructions
Main Memory
Instruction Fetch
Data Load/Store
ARBITERDATA
MEMORYMUX/DEMUX
Reconfigurable Processor
Core Processor
reconfigurable microcode
unitCCU
Register File
Exchange Registers
CCU has direct access to the data memory
PACT ’04, Antibes, France
The Molen PrototypeMolen machine
organization
Molen prototypeimplemented on
Virtex II Pro
PACT ’04, Antibes, France
The Prototype FeaturesA VHDL model has been synthesized for Virtex II Pro technology
• 64KBytes data and 64KBytes instructions (on-chip) mems;• 64-bit data memory bus;• 64-bit instruction memory bus;• 64 bits microcode word length;• 32MBytes, memory segment for microprograms;• 8Kx64-bit -control store using Dual Port Block RAMs (BRAM);• 512x32-bit XREGs implemented in BRAMs.
Utilization of FPGA resources (no CCU):
Device xc2vp20-5 Reconf. Processor
Arbiter Total incl. XREGs
Available resources
%
# slices 71 84 156 10304 1
# flip-flops 84 69 147 20608 1
# LUT4 171 150 322 20608 1
# BRAM 4 N.A. 5 112 3
Max. Freq. [MHz] 130 143 130 N.A. N.A
Three clock domains:• PPC clock – 250MHz;• MEM clock – 83 MHz;• User clock – external.
Trivial HW costs
( FCCM 04 )
PACT ’04, Antibes, France
MOLEN extension
Compiling for the Molen
SUIFfrontend
Machine SUIFbackend frameworkalpha
backendx86
backend
MAIN.c
C applicationCompilerFile_n.c
FCCM
PACT ’04, Antibes, France
The Molen Compiler
SUIF +MachineSUIF
Molen extension
PowerPCbackend
ISA extension(SET/EXEC)
Register extension(XRs)
• IBM PowerPC 405 GPP in Virtex II Pro • Register file extension (XRs)• ISA extension
( FPL 03-04 )
PACT ’04, Antibes, France
Code for a “function”
• Example:
C code: res = alpha(param1, param2);
movfx res ← XR3
Send parameters
HW reconfigurationHW execution Return result
movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>exec <address_alpha_exec>
PACT ’04, Antibes, France
Sequence Control Example
#pragma call_fpga op1int f(int a, int b){int c,i;c=0;for(i=0; i<b; i++)
c = c + a<<i + i;c = c>>b;return c;}void main(){int x,z;z=5;x= f(z, 7);}
C codemain: mrk 2,13 ldc $vr0.s32 <- 5 mov main.z <- $vr0.s32 mrk 2, 14 ldc $vr2.s32 <- 7 cal $vr1.s32 <- f(main.z, $vr2.s32) mov main.x <- $vr1.s32 mrk 2, 15 ldc $vr3.s32 <- 0 ret $vr3.s32 .text_end main
Original codemrk 2, 14mov $vr2.s32 <- main.zmovtx $vr1.s32(XR) <- $vr2.s32ldc $vr4.s32 <- 7movtx $vr3.s32(XR) <- $vr4.s32
set address_op1_SETldc $vr6.s32(XR) <- 0movtx $vr7.s32(XR) <- vr6.s32exec address_op1_EXEC
movfx $vr8.s32 <- $vr5.s32(XR)mov main.x <- $vr8.s32
Modified code
Code generation:
PACT ’04, Antibes, France
The Experiment (hand tuned HW)
MPEG-2 encoder MPEG-2 decoder
sequence #frames@Resolution SAD (16x16) DCT (8x8) IDCT (8x8) Total IDCT (8x8)carphone 96@176x144 51.1 % 12.5 % 1.3 % 64.9 % 50.4 %claire 168@360x288 53.8 % 11.8 % 1.0 % 66.6 % 37.6 %container 300@352x288 56.2 % 10.7 % 1.0 % 67.9 % 40.4 %tennis 112@352x240 60.0 % 9.5 % 0.8 % 70.3 % 40.5 %
32.3302.141.235.012.1tennis24.4302.141.535.212.2container24.4302.228.223.98.3claire24.4302.322.218.96.5carphone
IDCTDCTSAD256SAD128SAD16
Step 2. Measure the kernels speedups on the prototype:
MPEG-2 decoderMPEG-2 encoder
1.651.631.561.94
IDCT
1.011.102.412.402.22tennis1.011.122.212.202.07container1.011.132.082.061.90claire1.011.141.951.94 1.76carphone
IDCTDCTSAD256SAD128SAD16
Step 3. Overall speedup per kernel
Step 1. Obtain MPEG-2 profiling data on a PowerPC system
PACT ’04, Antibes, France
Real vs. Theoretical SpeedupsSpeedup MPEG-2 Encoder MPEG-2 Decoder
Prototype theory %Smax Prototype theory %Smaxcarphone 2.64 1.94claire 2.80 1.56container 2.96 1.63tennis 3.18 1.65
TSE iT
RecallSmax
MPEG-2 Encoder
0.65 0.67 0.68 0.71a
2.6
2.8
3.0
3.2
Speedup
Performance gain
= 0 Theoretically attainable MAX
Measured experimentally
The MOLEN prototype speeds the MPEG-2 codec up between 93% and 98% of the theoretically max. attainable speedups.
Speedup MPEG-2 Encoder MPEG-2 DecoderPrototype theory %Smax Prototype theory %Smax
carphone 2.64 2.85 93 1.94 2.02 96claire 2.80 2.99 94 1.56 1.60 98container 2.96 3.12 95 1.63 1.68 97tennis 3.18 3.37 94 1.65 1.68 98
SEiSE TT
SEiT
SAD
DCT
aTime
Implem. in reconf.
Step 4. Application speedup
PACT ’04, Antibes, France
mpeg2enc Instruction Counts
100 100 100 100
86.491.4 92.2 91.5
4843.6
61.7
100
35.1
54
91.5
33.7
instructions branches loads stores
default DCT SAD DCT & SAD
33.7 35.1 54 91.5
137 million 46 million
PACT ’04, Antibes, France
M-JPEG (HWAutomatically Generated )
• M-JPEG multimedia
benchmark
• DCT * hardware
implementation
• Molen prototype
( FPL 04 )
PACT ’04, Antibes, France
Performance
34
14
0
10
20
30
40
Millions
SW Execution HW Execution
TennisBarbaraArtemis
2.5
speedup
Execution cyclesSW DCT (%) 66 %SW DCT 1,242,017HW DCT 4,125HW DCT conv 102,589Prototype speedup
2.5 x
TheoreticalSpeedup
2.96 x
Efficiency 84 %
MJPEG
PACT ’04, Antibes, France
Conclusions• We have shown a new:
•microarchitecture
•processor architecture
•programming paradigm
•compilation
• We have shown that it is easier and faster to
ride the Amdahl’s curve with polymorphic
processors!
PACT ’04, Antibes, France
Contact information
Computer Engineering Laboratory:http://ce.et.tudelft.nl
MOLEN homepage:http://ce.et.tudelft.nl/MOLEN
Personal homepage:http://ce.et.tudelft.nl/~stamatis
OVERVIEW Paper:
The Molen Polymorphic Processor IEEE Transactions on computers NOV 04