PACT ’04, Antibes, France Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers Stamatis Vassiliadis Computer Engineering,

PACT ’04, Antibes, France

Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers

Stamatis Vassiliadis

Computer Engineering,EEMCS, TU Delft

http://ce.et.tudelft.nl

Member of HiPEAC

http://ce.et.tudelft.nl/







PZE and the Amdahl’s law

program

Max speedup = 2.0Excluding start-upreduced 5 cycles to 3 speedup 1.683% efficiency

50%20% … ASIC

Very Large

B

Timewise we execute two instructions (50% code elimination)

Why polymorphic? We can ride the Amdahl’s curveeasier and faster

Potential Zero Execution (PZE) introduced in 87-88 and published in IBM Journal of R&D 94

Techniques:• ILP• pipeline• technology

The limitation:

0.5 0.9

2X

10X


Motivating example

Original image

Filtered image

bitstream

predictive coding compression (ZIP)

Transmission:

bitstream

UNZIP

Filtered image

Original image

decoding

Goal: get image with more 0’s

Is it possible?: spatial redundancy (adjacent pixels often have same values => many differences between them =0 )

Paethcoding

• What does Paeth means in terms of computations?• Can I put it on hardware?• What is my gain?

Research questions:


Motivating examplec ba d

Paeth(d)= one of a,b,c, which is closest to initial prediction p = a+b-c

0 0 0 0

0 3 3 3

0 3 4 4

0 3 4 5

Original

0 0 0 0

0 0 3 3

0 3 3 4

0 3 4 4

Paeth c=3, b=3 a=4, d=4 p =4+3-3=4Paeth(d)=a=4

Filtered 0 0 0 0

0 3 0 0

0 0 1 0

0 0 0 0

Filtered=Original-Paeth =4 - 4 =0

p=a+b-c

pa=|p-a| pb=|p-b| pc=|p-c|

pa<=pb? pa<=pc? pb<=pc?

1 01 0

cab

1 0

Paeth

area:area:…………… 6 8-bit adders…………… 6 8-bit adders


Altivec code

load

unpackunpack

processprocess

pack

storestore

Looping

What it does

initialize

Example: Paeth Prediction (PNG)

AltivecAltivec iteration: 95 instructions per 16 pixels iteration: 95 instructions per 16 pixels.. CSI code CSI code : 1 instruction for all iterations (+20 setup instructions): 1 instruction for all iterations (+20 setup instructions)

CSI Instruction design CSI Instruction design : latency: : latency: …………. 5 …………. 5 cyclescycles throughput: throughput: ………16 ………16 pixels/1 cyclepixels/1 cycle

( ( EUROMICROEUROMICRO 9999 )) area: area:…………… 24 32-bit adders…………… 24 32-bit addersCycle Cycle = 1 ALU operation = 1 ALU operation

bptr = prev_row+1; dptr = curr_row+1; predptr= predict_row+1; for(i=1; i < length; i++){ c = *(bptr-1); b = *bptr; ... .... ... if(...) *predptr = a; else if (..) else *predptr = c; ...... bptr++; }

C-codeli r5, 0 ….totally 6 instructionsloop: lvx vr03, r1 # load c's lvx vr04, r2 # load a's vsidoi vr05, vr01, vr03, 1 # load b's

vmrghb vr07, vr03, vr00 # unpack vmrglb vr08, vr03, vr00 # unpack…totally 6 instructions #Computevadduhs vr15, vr09, vr11 # a+b vadduhs vr16, vr10, vr12 # vsubshs vr15, vr15, vr07 #vsubshs vr16, vr16, vr08 # ..totally 76 instructions #Pack:vpkshus vr28, vr28, 29 # pack #Store: stvx vr28, r3, 0 # store #Loop controladdi r1, r1, 16 ……..bneq r7, r0, loop # Loop

ONE INSTRUCTIONFor all loop iterations

csi_paeth predptr, bptr,dptr

CSI code

li r5, 1 csi_mt_scr r1, SCR1, 0 csi_mt_scr r5, SCR1, 1 ..totally 20 instructions

Altivec code


Dynamic instruction counts, normalised to non-CSI counts

Bench: Paeth kernel, 132-element vectors (132 pixels in a row)

Results: Instruction count and execution time reduction

Execution time: on 4-issue CPU, with 32 byte-wide CSI unit,

normalised to non-CSI execution


Research Questions

Processor architecture(behavior + logical structure)

Compilation

How can I automatically generate the “transformed” program?

Motivating example: Obvious observationsNO way I can do this on fixed hardwareI can do this if the hardware changes functionality at my wishes.

EASIER SAID THAN DONE ! I have to answer the following:

Programming paradigm (HW and SW descriptions coexisting in a program)

Microarchitecture

New kind of tools

How can I implement “arbitrary” code?

Is the hardwired code substituted by new instructions?

How can I substitute this code with SW/HW descriptions say at the source level?

How can I identify the code for hardware implementation?


Outline

What to do:

ToolsMicroarchitectureArchitectureProgramming ParadigmCompiler

Sequential consistency Split-join parallelism Function like code

Program PGPP

MEM

RH

FPGA

DATA

RESULTS

– Eliminate the identified code – Add code to have “equivalent” behavior

Program P’

A

Introduce reconfigurable microcode (- code)Specific code in hardware left to the programmer/hardware designer One time 8 new instructions for any ISA

Co-processor paradigm (e.g. vector)New register file for parameter passing

MOLEN

– Identify the “” code– Show hardware feasibility of “” in FPGA– Map “” into reconfigurable hardware (RH)

– Compile new program– Execute


YES

NO Critical

?

…

…

Code

int fact(int n){if(n<1) return nelsereturn(n*fact(n-1));}

f(.)

Human Directives

C2C

HDL

Re-

targ

eted

Com

pil

er

BinaryCode

call f(.) HDL

Architecture

XILINX VIRTEX-II PRO FPGA

IBMPowerPC

Tool Chain New Program whereHardware/software descriptions co-exist

hand codedL

I

B

A

U

T

O


The MOLEN ISA

Divide RC into two logical phases “SET EXECUTE address”

Implementation and ISA independent “function” independentNo new op-codes

Parameter passing: two new instructions + Register file

Arbitrary number of parameter passing

Parallel execution : split via a Molen instruction and join via a GPP instruction or one special instruction

Modularity: by implementing at least the minimal MOLEN instruction set and by reconfiguring to it.

Reconfigurable design(two instructions)

Execute on reconfigurableOne instruction

Speeding up: reconfiguration and executionTwo instructions for prefetching

Total: 8 new instructions

( ( SAMOSSAMOS ‘03‘03 ))


Instruction Set Partitioning

SET < address >EXECUTE < address >MOVTX and MOVFX.

SET PREFETCH < address >EXECUTE PREFETCH < address >

BREAK:

Minimal

partial SET (P-SET) Complete SET (C-SET)

Complete

Preferred

8 instructions grouped in 6 instruction categories:


Sequence Control Example

#pragma call_fpga op1int f( int x, int y){ … }#pragma call_fpga op2int g(int x){ … }int h(int a, int b, int c){int m,n, ...;m=f(a, b);n=g(c);……}

h:mov a -> r1movtx r1 ->XR2mov b -> r2movtx r2 ->XR3mov c -> r3movtx r3 -> XR4set address_set_op1set address_set_op2ldc 2 ->r4movtx r4 ->XR0 ldc 4 ->r5movtx r5 ->XR1 execute address_ex_op1execute address_ex_op2movfx XR2 -> r6 mov r6 -> mmovfx XR4 -> r7mov r7 -> n

no data dependency

In parallel


MicroProgram On-chip storage

Reconfigurable Microcode Storage

• Fixed on-chip storage for frequently used microcode

• Pageable on-chip storage for less frequently used microcode

From memory

Permanently stored

Permanently stored

Frequently used

Frequently used

Lessfrequently used

FIXED

PAGEABLE

( ( IEEE MICROIEEE MICRO ‘03‘03 ))


CONTROL STORE

SEQUENCER

The -code unit

Determine nextmicroinstruction

MIR

CSAR

from execution hardware

to execution hardware

R/P CS-

CS-

-

CS-

HResidence

Table

CS-, if present

/

(fixed)

(pageable)

microinstruction

= reconfigurable unit (CCU)

= reconfigurable unit (CCU)

FIXEDPAGEABLE

FIXEDPAGEABLE

set

execute


instruction word

address

More on Architectural support

Instruction format

OPC

Resident (0);Pageable (1)

Control Store address (CS-);Memory address ()

An example microprogram:

• located in memory starting at address • address point to first microinstruction

• terminated by an end_op

memory

end_op

00: load values into adder01: shift_ins02: add_ins03: shift2_ins04: SKIP05: BACK06: store07: end_op


The MOLEN -coded processor (FPL’01)

The arbiter also controlsthe loading of microcode

X registers to exchange parameters between

GPP and RU

Arbitrates (redirects) instructions between GPP and RP

-unit controls CCU by

microinstructions

Main Memory

Instruction Fetch

Data Load/Store

ARBITERDATA

MEMORYMUX/DEMUX

Reconfigurable Processor

Core Processor

reconfigurable microcode

unitCCU

Register File

Exchange Registers

CCU has direct access to the data memory


The Molen PrototypeMolen machine

organization

Molen prototypeimplemented on

Virtex II Pro


The Prototype FeaturesA VHDL model has been synthesized for Virtex II Pro technology

• 64KBytes data and 64KBytes instructions (on-chip) mems;• 64-bit data memory bus;• 64-bit instruction memory bus;• 64 bits microcode word length;• 32MBytes, memory segment for microprograms;• 8Kx64-bit -control store using Dual Port Block RAMs (BRAM);• 512x32-bit XREGs implemented in BRAMs.

Utilization of FPGA resources (no CCU):

Device xc2vp20-5 Reconf. Processor

Arbiter Total incl. XREGs

Available resources

%

# slices 71 84 156 10304 1

# flip-flops 84 69 147 20608 1

# LUT4 171 150 322 20608 1

# BRAM 4 N.A. 5 112 3

Max. Freq. [MHz] 130 143 130 N.A. N.A

Three clock domains:• PPC clock – 250MHz;• MEM clock – 83 MHz;• User clock – external.

Trivial HW costs

( FCCM 04 )


MOLEN extension

Compiling for the Molen

SUIFfrontend

Machine SUIFbackend frameworkalpha

backendx86

backend

MAIN.c

C applicationCompilerFile_n.c

FCCM


The Molen Compiler

SUIF +MachineSUIF

Molen extension

PowerPCbackend

ISA extension(SET/EXEC)

Register extension(XRs)

• IBM PowerPC 405 GPP in Virtex II Pro • Register file extension (XRs)• ISA extension

( FPL 03-04 )


Code for a “function”

• Example:

C code: res = alpha(param1, param2);

movfx res ← XR3

Send parameters

HW reconfigurationHW execution Return result

movtx XR1 ← param1movtx XR2 ← param2set <address_alpha_set>exec <address_alpha_exec>


Sequence Control Example

#pragma call_fpga op1int f(int a, int b){int c,i;c=0;for(i=0; i<b; i++)

c = c + a<<i + i;c = c>>b;return c;}void main(){int x,z;z=5;x= f(z, 7);}

C codemain: mrk 2,13 ldc $vr0.s32 <- 5 mov main.z <- $vr0.s32 mrk 2, 14 ldc $vr2.s32 <- 7 cal $vr1.s32 <- f(main.z, $vr2.s32) mov main.x <- $vr1.s32 mrk 2, 15 ldc $vr3.s32 <- 0 ret $vr3.s32 .text_end main

Original codemrk 2, 14mov $vr2.s32 <- main.zmovtx $vr1.s32(XR) <- $vr2.s32ldc $vr4.s32 <- 7movtx $vr3.s32(XR) <- $vr4.s32

set address_op1_SETldc $vr6.s32(XR) <- 0movtx $vr7.s32(XR) <- vr6.s32exec address_op1_EXEC

movfx $vr8.s32 <- $vr5.s32(XR)mov main.x <- $vr8.s32

Modified code

Code generation:


The Experiment (hand tuned HW)

MPEG-2 encoder MPEG-2 decoder

sequence #frames@Resolution SAD (16x16) DCT (8x8) IDCT (8x8) Total IDCT (8x8)carphone 96@176x144 51.1 % 12.5 % 1.3 % 64.9 % 50.4 %claire 168@360x288 53.8 % 11.8 % 1.0 % 66.6 % 37.6 %container 300@352x288 56.2 % 10.7 % 1.0 % 67.9 % 40.4 %tennis 112@352x240 60.0 % 9.5 % 0.8 % 70.3 % 40.5 %

32.3302.141.235.012.1tennis24.4302.141.535.212.2container24.4302.228.223.98.3claire24.4302.322.218.96.5carphone

IDCTDCTSAD256SAD128SAD16

Step 2. Measure the kernels speedups on the prototype:

MPEG-2 decoderMPEG-2 encoder

1.651.631.561.94

IDCT

1.011.102.412.402.22tennis1.011.122.212.202.07container1.011.132.082.061.90claire1.011.141.951.94 1.76carphone

IDCTDCTSAD256SAD128SAD16

Step 3. Overall speedup per kernel

Step 1. Obtain MPEG-2 profiling data on a PowerPC system


Real vs. Theoretical SpeedupsSpeedup MPEG-2 Encoder MPEG-2 Decoder

Prototype theory %Smax Prototype theory %Smaxcarphone 2.64 1.94claire 2.80 1.56container 2.96 1.63tennis 3.18 1.65

TSE iT

RecallSmax

MPEG-2 Encoder

0.65 0.67 0.68 0.71a

2.6

2.8

3.0

3.2

Speedup

Performance gain

= 0 Theoretically attainable MAX

Measured experimentally

The MOLEN prototype speeds the MPEG-2 codec up between 93% and 98% of the theoretically max. attainable speedups.

Speedup MPEG-2 Encoder MPEG-2 DecoderPrototype theory %Smax Prototype theory %Smax

carphone 2.64 2.85 93 1.94 2.02 96claire 2.80 2.99 94 1.56 1.60 98container 2.96 3.12 95 1.63 1.68 97tennis 3.18 3.37 94 1.65 1.68 98

SEiSE TT

SEiT

SAD

DCT

aTime

Implem. in reconf.

Step 4. Application speedup


mpeg2enc Instruction Counts

100 100 100 100

86.491.4 92.2 91.5

4843.6

61.7

100

35.1

54

91.5

33.7

instructions branches loads stores

default DCT SAD DCT & SAD

33.7 35.1 54 91.5

137 million 46 million


M-JPEG (HWAutomatically Generated )

• M-JPEG multimedia

benchmark

• DCT * hardware

implementation

• Molen prototype

( FPL 04 )


Performance

34

14

0

10

20

30

40

Millions

SW Execution HW Execution

TennisBarbaraArtemis

2.5

speedup

Execution cyclesSW DCT (%) 66 %SW DCT 1,242,017HW DCT 4,125HW DCT conv 102,589Prototype speedup

2.5 x

TheoreticalSpeedup

2.96 x

Efficiency 84 %

MJPEG


Conclusions• We have shown a new:

•microarchitecture

•processor architecture

•programming paradigm

•compilation

• We have shown that it is easier and faster to

ride the Amdahl’s curve with polymorphic

processors!


Contact information

Computer Engineering Laboratory:http://ce.et.tudelft.nl

MOLEN homepage:http://ce.et.tudelft.nl/MOLEN

Personal homepage:http://ce.et.tudelft.nl/~stamatis

OVERVIEW Paper:

The Molen Polymorphic Processor IEEE Transactions on computers NOV 04


http://ce.et.tudelft.nl/MOLEN

http://ce.et.tudelft.nl/~stamatis

Documents

PACT ’04, Antibes, France Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers Stamatis Vassiliadis Computer Engineering,