JR.S00 1 Lecture 14: (Re)configurable Computing Case Studies Prof. Jan Rabaey Computer Science 252, Spring 2000 The contributions of Andre Dehon (Caltech)

JR.S00 1

Lecture 14: (Re)configurable Computing

Case Studies

Prof. Jan Rabaey

Computer Science 252, Spring 2000

The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics) to this slide set are gratefully acknowledged

JR.S00 2

Summary of Previous Class

• Configurable Computing using “programming in space” versus “programming in time” for traditional instruction-set computers

• Key design choices– Computational units and their granularity

– Interconnect Network

– (Re)configuration time and frequency

• Next class: Some practical examples of reconfigurable computers

JR.S00 3

Applicability of Configurable Processing

• Stand-alone computational engines– E.g. PADDI, UCLA Mojave

• Adding programmable I/O to embedded processors

– E.g. Napa 2000

• Augmenting the instruction set of processors– E.g. GARP, RAW

• Providing programmable accelerator co-processors to embedded micro’s and DSP

– Chameleon, Pleiades, Morphics

JR.S00 4

Stand-Alone ComputationalEnginesUCLA Mojave System

Template Matching for AutomaticTarget Recognition

I960 Board

JR.S00 5

As Programmable Interface and I/O Processor

• Logic used in place of – ASIC environment

customization

– external FPGA/PLD devices

• Example– bus protocols

– peripherals

– sensors, actuators

• Case for:– Always have some system

adaptation to do – varying glue logic requirements

– Modern chips have capacity to hold processor + glue logic

– Reduces part count

– Valued added must now be accommodated on chip (formerly board level)

JR.S00 6

Example: Interface/Peripherals

• Triscend E5

JR.S00 7

Model: IO Processor

• Array dedicated to servicing IO channel

– sensor, lan, wan, peripheral

• Provides– protocol handling

– stream computation

» compression, encrypt

• Looks like IO peripheral to processor

• Case for:– many protocols, services

– only need few at a time

– dedicate attention, offload processor

JR.S00 8

IO Processing

• Single threaded processor– cannot continuously monitor multiple data pipes (src,

sink)

– need some minimal, local control to handle events

– for performance or real-time guarantees , may need to service event rapidly

– E.g. checksum (decode) and acknowledge packet

JR.S00 9

Source: National Semiconductor

NAPA 1000 Block Diagram

RPCReconfigurablePipeline Cntr

ALPAdaptive Logic

Processor

SystemPort

TBTToggleBusTM

Transceiver

PMAPipeline

Memory Array

CR32CompactRISCTM

32 Bit Processor

BIUBus Interface

Unit

CR32PeripheralDevices

ExternalMemoryInterface SMA

ScratchpadMemory Array

CIOConfigurable

I/O

JR.S00 10

Source: National Semiconductor

NAPA 1000 as IO Processor

SYSTEMHOST

NAPA1000

ROM &DRAM

ApplicationSpecific

Sensors, Actuators, orother circuits

System Port

CIO

Memory Interface

JR.S00 11

I/O Stream Processor

Philips Nexperia NX-2700A programmable HDTVmedia processor

Combines Trimedia VLIW withConfigurable media co-processors

JR.S00 12

Model: Instruction Augmentation• Observation: Instruction Bandwidth

– Processor can only describe a small number of basic computations in a cycle

» I bits 2I operations

– This is a small fraction of the operations one could do even in terms of www Ops

» w22(2w) operations

– Processor could have to issue w2(2 (2w) -I) operations just to describe some computations

– An a priori selected base set of functions could be very bad for some applications

JR.S00 13

Instruction Augmentation

• Idea:– provide a way to augment the processor’s instruction set

– with operations needed by a particular application

– close semantic gap / avoid mismatch

• What’s required:– some way to fit augmented instructions into stream

– execution engine for augmented instructions

» if programmable, has own instructions

– interconnect to augmented instructions

JR.S00 14

“First” Instruction Augmentation

• PRISM– Processor Reconfiguration through Instruction Set

Metamorphosis

• PRISM-I– 68010 (10MHz) + XC3090

– can reconfigure FPGA in one second!

– 50-75 clocks for operations

[Athanas+Silverman: Brown]

JR.S00 15

PRISM-1 Results

Raw kernel speedups

JR.S00 16

PRISM

• FPGA on bus

• access as memory mapped peripheral

• explicit context management

• some software discipline for use

• …not much of an “architecture” presented to user

JR.S00 17

PRISC

• Takes next step– what look like if we put it on chip?

– how integrate into processor ISA?

• Architecture:– couple into register file as “superscalar” functional unit

– flow-through array (no state)

[Razdan+Smith: Harvard]

JR.S00 18

PRISC

• ISA Integration– add expfu instruction

– 11 bit address space for user-defined expfu instructions

– fault on pfu instruction mismatch

» trap code to service instruction miss

– all operations occur in clock cycle

– easily works with processor context switch

» no state + fault on mismatch pfu instr

JR.S00 19

PRISC Results

• All compiled

• working from MIPS binary

• <200 4LUTs ?– 64x3

• 200MHz MIPS base

Razdan/Micro27

JR.S00 20

Chimaera

• Start from PRISC idea– integrate as functional unit

– no state

– RFUOPs (like expfu)

– stall processor on instruction miss, reload

• Add– manage multiple instructions loaded

– more than 2 inputs possible

[Hauck: Northwestern]

JR.S00 21

Chimaera Architecture

• “Live” copy of register file values feed into array

• Each row of array may compute from register values or intermediates (other rows)

• Tag on array to indicate RFUOP

Results

• Compress 1.11

• Eqntott 1.8

• Life 2.06 (160 hand parallelization)

[Hauck/FCCM97]

JR.S00 22

Instruction Augmentation

• Small arrays with limited state– so far, for automatic compilation

» reported speedups have been small

– open

» discover less-local recodings which extract greater benefit

JR.S00 23

GARP

Identified Problems:

• Single-cycle flow-through – not most promising usage style

• Moving data through Register File to/from array

– can present a limitation

» bottleneck to achieving high computation rate

[Hauser+Wawrzynek: UCB]

JR.S00 24

GARP

• Integrate as coprocessor– similar bandwidth to processor as FU

– own access to memory

• Support multi-cycle operation– allow state

– cycle counter to track operation

• Fast operation selection– cache for configurations

– dense encodings, wide path to memory

JR.S00 25

GARP• ISA -- coprocessor operations

– issue gaconfig to make a particular configuration resident (may be active or cached)

– explicitly move data to/from array

» 2 writes, 1 read

– processor suspend during coprocessor operation

» cycle count tracks operation

– array may directly access memory

» processor and array share memory space• cache/mmu keeps consistency

» can exploit streaming data operations

JR.S00 26

GARP

• Processor Instructions

JR.S00 27

GARP Array

• Row oriented logic– denser for datapath

operations

• Dedicated path for – processor/memory data

• Processor not have to be involved in arraymemory path

JR.S00 28

GARP Results

• General results– 10-20x on stream, feed-

forward operation

– 2-3x when data-dependencies limit pipelining

[Hauser+Wawrzynek/FCCM97]

JR.S00 29

PRISC/Chimera … GARP

• PRISC/Chimaera– basic op is single cycle: expfu (rfuop)

– no state

– could conceivably have multiple PFUs?

– Discover parallelism => run in parallel?

– Can’t run deep pipelines

• GARP– basic op is multicycle

» gaconfig» mtga» mfga

– can have state/deep pipelining

– ? Multiple arrays viable?

– Identify mtga/mfga w/ corr gaconfig?

JR.S00 30

Common Theme

• To get around instruction expression limits– define new instruction in array

» many bits of config … broad expressability

» many parallel operators

– give array configuration short “name” which processor can callout

» …effectively the address of the operation

But – Impact of using reconfiguration at Instruction Level seems limited Explore opportunities at larger granularity levels (basic block, task, process)

JR.S00 31

Applicability of Configurable Processing

• Stand-alone computational engines– E.g. PADDI, UCLA Mojave

• Adding programmable I/O to embedded processors

– E.g. Napa 2000

• Augmenting the instruction set of processors– E.g. GARP, RAW

• Providing programmable accelerator co-processors to embedded micro’s and DSP

– Chameleon, Pleiades, Morphics

JR.S00 32

Example: Chameleon Reconfigurable Co-Processor (network, communication applications)

Reconfigurable LogicArray of 32-bit Data Path

Operators & Control Logic

Data MemoryPCI

InterfaceMemory

ControllerJTAG

Debugging Port

ARC CPU Instruction Cache

Bus Manager& DMA

Controllers

Local StoreMemory(LSM)

Background Configuration Plane

Configuration bit stream

Wide Internal communications bus

Multiple banks of I/O

JR.S00 33

Reconfigurable Processor Tools Flow

Customer Application / IP

(C code)

ARC Object

Code

C Compiler

RTLHDL

Linker

Chameleon Executable

C Model Simulator

Configuration Bits

Synthesis & Layout

C DebuggerDevelopment

Board

JR.S00 34

Heterogeneous Reconfiguration

ReconfigurableReconfigurableLogicLogic

ReconfigurableReconfigurableDatapathsDatapaths

adder

buffer

reg0

reg1

muxCLB CLB

CLBCLB

DataMemory

InstructionDecoder

&Controller

DataMemory

ProgramMemory

Datapath

MAC

In

AddrGen

Memory

AddrGen

Memory

ReconfigurableReconfigurableArithmeticArithmetic

ReconfigurableReconfigurableControlControl

Bit-Level Operationse.g. encoding

Dedicated data pathse.g. Filters, AGU

Arithmetic kernelse.g. Convolution

RTOSProcess management

JR.S00 35

Multi-granularity Reconfigurable Architecture: The Berkeley Pleiades Architecture

Communication Network

ControlProcessor

ArithmeticProcessor

ArithmeticProcessor

ArithmeticProcessor

ConfigurableDatapath

ConfigurableLogic

Configuration Bus

Network Interface

DedicatedArithmetic

Configuration

Satellite ProcessorSatellite Processor

• Computational kernels are “spawned” to satellite processors• Control processor supports RTOS and reconfiguration• Order(s) of magnitude energy-reduction over traditional programmable architectures

JR.S00 36

Matching Computation and Architecture

AddressGen AddressGen

Memory Memory

MAC MAC

ControlProcessor

L CG

Convolution

Two models of computation:communicating processes + data-flow

Two architectural models:sequential control+ data-driven

JR.S00 37

Execution Model of a Data-Flow Kernel

for(i=1;i<=L;i++)

for(k=i;k<=L;k++)

phi[i][k]= phi[i-1][k-1]

+in[NP-i]*in[NP-k]

-in[NA-1-i]*in[NA-1-k];

endstart

Embedded processor

AddrGen

MEM: in

ALU

ALU

AddrGen

MEM: phiMPY MPY

•Distributed control and memory

Code seg

Code seg

JR.S00 38

Reconfigurable Kernels for W-CDMA

• Dominant kernel M(MTX) requires array of MACs and segmented memories

• Additional operations such as sqrt(x), 1/x, and Trellis decoding may be implemented using FPGA or cordic satellite

JR.S00 39

Inter-Satellite Communication• Data-driven execution

– A satellite processor is enabled only when input data is ready

• Data sources generate data of different types: scalars, vectors, matrices

• Data computing processors handle data inputs of different types

end-of-vector token

MPY

MPY1

nn

MACn

n1

AddrGen Memory

Embeddedprocessor

1

11

Data sources Data computing processors

JR.S00 40

Impact of Architectural Choice

1870

Stron

gAR

M

131

Nor

mal

ized

Ene

rgy

/ st

age

[nJ]

TM

S32

0C2x

x

Energy/stage

49

TM

S32

0LC

54x

1000

100

10000

10

21u

Stron

gAR

M

10uN

orm

aliz

ed D

elay

/sta

ge [s

]

TM

S320

C2x

x

Delay/stage

3.8u

TM

S32

0LC

54x

10u

1u

100u

100n

18.5

TM

S320

LC

54x

Nor

mal

ized

Ene

rgy*

Del

ay /

stag

e [J

s*e-

14]

10

1

100

1000 Energy*Delay/stage

137

TM

S32

0C2x

x0.1

3970

Stron

gAR

M

10000Example: 16 point Complex Radix-2 FFT (Final Stage)

13

570n 0.75

Ple

iade

s

Ple

iade

s

Ple

iade

s

JR.S00 41

Adaptive Multi-User Detector for W-CDMAPilot Correlator Unit Using LMS

AG

MULSUB

ADDMEM

MEM

MEM

MEMAG

MUL

MUL

MUL

Filter

Coefficient Update

MEM

MEMAG

ACC

ACC

MAC

MAC

MUL

MUL

SUB

SUB

MULSUB

ADD

MUL

MUL

MUL

SUB

SUB

alt

alt

alt

alt

alt

alt

alt

s_r

s_i

y_r

y_iADD

ADD

Zmf_r

Zmf_i

s_r

s_iZmf_r

Zmf_i

y_r

y_i

JR.S00 42

Architecture Comparison

LMS Correlator at 1.67 MSymbols Data RateLMS Correlator at 1.67 MSymbols Data RateComplexity: 300 Mmult/sec and 357 Macc/secComplexity: 300 Mmult/sec and 357 Macc/sec

Note: TMS implementation requires 36 parallel processors to meet data rate - validity questionable

16 Mmacs/mW!

JR.S00 43

Maia: Reconfigurable Baseband Processor for Wireless

• 0.25um tech: 4.5mm x 6mm

• 1.2 Million transistors

• 40 MHz at 1V

• 1 mW VCELP voice coder

• Hardware

• 1 ARM-8

• 8 SRAMs & 8 AGPs

• 2 MACs

• 2 ALUs

• 2 In-Ports and 2 Out-Ports

• 14x8 FPGA

JR.S00 44

Reconfigurable Interconnect Exploration

N Inputs

B Buses

M Outputs

Multi-Bus

cluster

cluster

cluster

Hierarchical MeshMesh

Module

JR.S00 45

Software Methodology Flow

Algorithms

Kernel Detection

Estimation/Exploration

Partitioning

Software CompilationReconfig. Hardware Mapping

Interface Code Generation

Power & Timing Estimation

of Various Kernel Implementations

PDA Models

PremappedKernels

Acceleratorproc &

BehavioralDemonstration

C++ Module Libraries

C++

SUIF+ C-IF

JR.S00 46

Hardware-Software Exploration

Macromodel call

JR.S00 47

Implementation Fabrics for Protocols

BU

FMemory

Slot_Set_Tbl2x16

addr

BU

F

slot_set<31:0>

Slot_no<5:0>

Slotstart

Pktend

RACHreq

RACHakn

W_ENA

R_ENAupdate

idle

writereadslotset

RACH

idle

A protocol = Extended FSM

Intercom TDMA MAC

JR.S00 48

Intercom TDMA MACImplementation alternatives

• ASIC: 1V, 0.25 m CMOS process

• FPGA: 1.5 V 0.25 m CMOS low-energy FPGA

• ARM8: 1 V 25 MHz processor; n = 13,000

• Ratio: 1 - 8 - >> 400

ASIC FPGA ARM8

Power 0.26mW 2.1mW 114mWEnergy 10.2pJ/op 81.4pJ/op n*457pJ/op

JR.S00 49

The Software-Defined Radio

ReconfigurableDataPath

ReconfigurableDataPath

FPGAFPGA Embedded uPEmbedded uP

Dedicated FSMDedicated FSM

DedicatedDSP

DedicatedDSP

JR.S00 50

An Industrial Example:Basestation for Cellular Wireless

800 MHz 1900 MHz

A B A B A D B CFE

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

AntennaSystem

multiple sectorsmulti-band

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

RF/IFTuner

Block-SpectrumA/D

AntennaSystem

multiple sectorsmulti-band

HIGH-SPEED DIGITAL BUS

Modular/Parameterizable•per carrier•per TDMA time-slot•per CDMA code

T1 / E1

...

JR.S00 51

BTS Signal Processing Platforms

Hig

h-S

peed D

igit

al B

us

A/D

Con

vers

ion

CommAgent

CommAgentD

/A C

on

vers

ion Standard A, F1

N DSP/CPUs

CPU

Standard A, F2

Standard B, F1...

...

Hardwired ASICs

JR.S00 52

Basestation of the Next Generation

WidebandRF

Data Networks

10/100or

Gbit

ATM

JR.S00 53

Coexistence of Multiple Standards In Product Supply Chain

2.5G– GPRS

– HCSD

– IS-95 MDR

– IS-95 HDR

– IS-136 HS

3G– ETSI UTRA

– ARIB W-CDMA

– TIA cdma2000

– W-TDMA (UWC)

2G– GSM

– DCS1800

– PCS1900

– IS-95

– IS-54B

– IS-136

– PDC

CIRCUIT PACKET VOICE DATANARROWBAND WIDEBAND

JR.S00 54

Wideband CDMA: MOPS? No, GOPS!

Function MIPSDigital RRC Channel 3600Searcher 2100RAKE 1050Maximal Ratio Combiner 24Channel Estimator 12AGC, AFC 10Deinterleaver 15Turbo Coder 90TOTAL 6901

Single 384 kbps ARIB W-CDMA Channel

Source: J. Kohnen et al. “Baseband Solution for WCDMA,” Proc. IEEE

Communication Theory Workshop, May 1999, Aptos, USA.

JR.S00 55

The common approach to hardware design involves:

multiple ASIC’s to support each standard.

• Hardwired implementation is not scalable or upgradeable to new standards.• This approach costs time in a time-to-market dominated world.• Creating new chipsets for every technology combination critically challenges available design resources!

HW Multistandard Solutions

DigitalHardwired

ASICIF RF

DigitalHardwired

ASICIF RF

DigitalHardwired

ASICIF RF

AnalogProgrammable Unique

Combinations

DSP

Control Processor

JR.S00 56

SW Multistandard SolutionApplying instruction-set processor architectures to

all baseband processing would be desireable...

…but is simply not an good implementation for base stations:-Unacceptably high cost per channel

-Unacceptably large power per channel

This is definitely not a viable implementation for terminals

AnalogProgrammable

IF RF

IF RF

IF RF

DSP

Control Processor

JR.S00 57

The Law of Diminishing Returns• More transistors are being thrown at improving

general-purpose CPU and DSP performance

• Fundamental bounds are being pushed– limits on instruction-level parallelism

– limits on memory system performance

• Returns per transistor are diminishing– new architectures realizing only 2-3 instructions/clock

– increasingly large caches to hide DRAM latency

JR.S00 58

Embedded Logic

Evolution– Increasing fixed-function hardwired content of systems

– Core+Logic becomes de-facto design architecture

– Move to deep sub-micron technology

» rapidly increasing product integration cycles

» increasingly constrained design resources

» sharp increases in cost of “trying out” an idea- NRE– Design methodologies optimized for random logic, homogeneous architectures,

and lower speed signal processing (I.e. control-flow dominated systems)

– Verification issues dominate design cycle time

Growing Design Cycle Times At Odds With Shrinking Product Cycle Times

JR.S00 59

FPGA the Solution?Cellular Handset Using Current FPGA

JR.S00 60

Some Interesting Observations

Don’t use more transistors to stretch general-purpose performance, whether for CPUs, DSPs, or

reconfigurable logic.

Don’t use more time to designdedicated hardwired solutions in cases where

mass customizationis what the market demands.

JR.S00 61

View The “Reconfigurability” Problem From The System Level

What Are the Application-Specific Performance Needs?

– What are the applications targeted?

– What algorithms are essential to achieving the performance goals?

– What are the functions at the heart of these algorithms?

– Which functions yield poor price-performance with general-purpose MOPS and system memory models?

– What is the embedded systems programmer’s model?

– Best performance at what cost:

» Area - instruction-level parallelism, memory hierarchy

» Power- energy requirements on a function basis

» Time- quality and ease of programming for app development

» Pain- forward opportunity vs backward compatibility

JR.S00 62

Define optimal architecture for efficient implementation

Successfully Using Reconfigurability

Application-Specific Leverage

Focus on first on applications and constituent algorithms, not the silicon architecture !Wireless Communications Transceiver Signal Processing

Minimize the hardware reconfigurability to constrained set

Maximize the software parameterizability and ease of use of the programmer’s model for flexibility

JR.S00 63

Application-Specific MOPS in Digital Communications

RF/IF

TDMA Wideband Signal

Processing Engine

CDMAWideband Signal

Processing Engine

Wideband ChannelDecoder Engine

Programmable DSP

Microprocessor

DigitalDownconversion

and Channelization

JR.S00 64

Morphics’ DRL Architecture

Heterogeneous Multiprocessing Engine Using Application-Specific Reconfigurable Logic

m

R

m

O

Clk

Enableoutput

input

input

m

R

m

O

Clk

Enableoutput

input

input

m

R

m

O

Clk

Enableoutput

input

input

Small Granularity Kernel

Large Granularity Kernel

DA

TA

FL

OW

JR.S00 65

DRL Kernels

DATA SEQUENCER

DATA MEMORY

PARAMETERIZABLECONFIGURABLE

ALU

JR.S00 66

Mapping Software to Target Architecture

DATA SEQUENCER

DATA MEMORY


ALU

DATA SEQUENCER

DATA MEMORY


ALU

JR.S00 67

Programmer’s Guide

Document provided with each processor to enable application development, configuration, and system control via host processor.

Includes complete • description of each API function

call

• system control functions

• variables, parameters

• coding examples, and performance realized (i.e. ROC curves)

/* morphics soft API usage examples: *//* search a pilot set in search set *//* maintainance mode. */

...set_ptr = get_next_set();if (no_need_to_throttle()){ search_set(set_ptr);}...void search_set(PILOT_TYPE *pilot){ /* search threshold is assumed to be set in other places */ morphics_searcher_set_win_size(pilot->win_size); morphics_searcher_set_pn(pilot->pn); morphics_searcher_set_int_len(pilot->int_len);}

/* finger re-assignments */...fing[i].pos = morphics_demod_get_fin_pos(i);...distance = calc_fing_movement(&fing[i], new_pn);...fing[i].slew = distance;fing[i].pn = new_pn;morphics_demod_set_fing_iq(i, fing[i].pn);while ( slew_not_done(morphics_demod_get_status()) ){ morphics_demod_set_fing_slew(i, fing[i].slew);}...

JR.S00 68

Key Pieces of Design Methodology

System-level Profiling• Analyze sequences of operations (arithmetic, memory access, etc)

• Analyze communication bottlenecks• Key flexible parameters (algorithm v architecture parameters)

Architecture-level Profiling• ALU/kernel definition (sequences of operators)• Memory profile• Type of configurability required for flexibility• Macro-sequencer development

Implementation• SW- programmer’s model developed at architecture specification stage• SW- API proven out via behavioral models & demonstrator hardware• VLSI-focus on regular predictable timing and routability• VLSI- embedded reconfigurability in an ASIC flow

JR.S00 69

CDMA Modem Analysis

JR.S00 70

MOPS Breakdown Analysis

JR.S00 71

Prototype Demonstrator

RISC Microprocessor

Configurable Kernel(s)

Data Router

JR.S00 72

Summary

• Configurable computing is finding its way into the embedded processor space

• Best suited (so far) for – Flexible I/O and Interface functionality

– providing task-level acceleration of “parametizable” functions

• Improvement of IP seems limited

• Software flow still subject to improvement

• Might become more interesting with the emergence of low-current devices (TFT, organic transistors, molecular computing)

DO NOT FORGET CONFIGURATION OVERHEAD

Documents

JR.S00 1 Lecture 14: (Re)configurable Computing Case Studies Prof. Jan Rabaey Computer Science 252, Spring 2000 The contributions of Andre Dehon (Caltech)