62
Rapid: A Configurable Architecture for Compute-Intensive Applications Carl Ebeling Dept. of Computer Science and Engineering University of Washington

Rapid: A Configurable Architecture for Compute-Intensive Applications

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rapid: A Configurable Architecture for Compute-Intensive Applications

Rapid: A Configurable Architecturefor Compute-Intensive Applications

Carl Ebeling

Dept. of Computer Science and Engineering

University of Washington

Page 2: Rapid: A Configurable Architecture for Compute-Intensive Applications

2Reconfigurable Pipelined Datapaths

Alternatives for High-Performance Systems

✦ ASIC➭ Use application-specific architecture that matches the computation➭ Large speedup from fine-grained parallel processing➭ Smaller chip because hardware is tuned to one problem➭ Lower power since no extra work is done➭ Little or no flexibility: problem changes slightly, design a new chip

➧ No economy of scale➧ Long design cycle

✦ Digital Signal Processors➭ Optimized to signal processing operations

➧ Simple, streamlined processor architecture➧ Cheaper, lower power than GP processors

➭ Very flexible: just change the program➭ Lower performance: small scale parallelism

Page 3: Rapid: A Configurable Architecture for Compute-Intensive Applications

3Reconfigurable Pipelined Datapaths

Motivation for Rapid

✦ Many applications require programmability➭ Old standards evolve➭ Multiple standards, protocols, technology

➧ Similar but different computation➧ Reprogram for different context

➭ New algorithms give competitive advantage

✦ We need a “configurable ASIC”➭ Application-specific architecture➭ Reprogrammable

Page 4: Rapid: A Configurable Architecture for Compute-Intensive Applications

4Reconfigurable Pipelined Datapaths

What is a Configurable ASIC?

✦ Like an ASIC: Architecture tuned to application➭ High performance/low cost

✦ But configurable:➭ Datapath structure can be rewired via static configuration➭ Datapath control can be reprogrammed

✦ Rapid approach➭ Domain-specific architecture model

➧ Reconfigurable Pipelined Datapaths➭ Well-suited to many compute-intensive applications

Page 5: Rapid: A Configurable Architecture for Compute-Intensive Applications

5Reconfigurable Pipelined Datapaths

Example: Programmable Downconverter

Analog LPFA/DBandpass Filter

(FIR)

Low Pass Filter(FIR +

decimation)

Symmetric LPF(FIR +

decimation)Matched Filter Resampler

PostProcessing

Timing Recovery

Low Pass Filter(FIR +

decimation)

Symmetric LPF(FIR +

decimation)Matched Filter Resampler

NCO

F

Analog input

Configurable ASICCustom Analog

EmbeddedController

CustomMacroblock

Page 6: Rapid: A Configurable Architecture for Compute-Intensive Applications

6Reconfigurable Pipelined Datapaths

Example: Programmable Downconverter

Analog LPFA/D128-tap FIR

16-tap FIR +2X downsampling

32-tap SymmetricFIR

Matched Filter Resampler

PostProcessing

Timing Recovery

16-tap FIR +2X downsampling

32-tap SymmetricFIR

Matched Filter Resampler

NCO

F

Analog input

Configurable ASICCustom Analog

EmbeddedController

CustomMacroblock

Page 7: Rapid: A Configurable Architecture for Compute-Intensive Applications

7Reconfigurable Pipelined Datapaths

Example: Programmable Downconverter

Analog LPFA/D2 stage IIR

32-tap FIR +4x downsampling

16-tap FIR +2x downsampling

16-tap FIRAdaptive 8-

tap FIR

PostProcessing

Timing Recovery

32-tap FIR +4x downsampling

16-tap FIR +2x downsampling

16-tap FIRAdaptive 8-

tap FIR

NCO

F

Analog input

Configurable ASICCustom Analog

EmbeddedController

CustomMacroblock

Page 8: Rapid: A Configurable Architecture for Compute-Intensive Applications

8Reconfigurable Pipelined Datapaths

Generating a Rapid Array

Netlist

Rapid ArchitectureModel

Rapid-CProgram

Rapid Compiler

Netlist

Rapid-CProgram

Rapid Compiler

Netlist

Rapid-CProgram

Rapid Compiler

Netlist

Rapid-CProgram

Rapid Compiler

Rapid ArraySynthesis

Rapid ArrayNetlist

HardwareSynthesis

Rapid Array

Page 9: Rapid: A Configurable Architecture for Compute-Intensive Applications

9Reconfigurable Pipelined Datapaths

Programmable Downconverter ASIC

Rapid Array

Analog

EmbeddedDSP

NCO

ConfigurationDatapath

ControlProgram

Program

Memory forprogrammable

control

Page 10: Rapid: A Configurable Architecture for Compute-Intensive Applications

10Reconfigurable Pipelined Datapaths

Reconfiguring the Rapid Array

Rapid Array

MapPlace/Route

Configuration

Rapid Array

Configurable ASICDescription

Rapid-CProgram

Rapid Compiler

Netlist

Control Program

Page 11: Rapid: A Configurable Architecture for Compute-Intensive Applications

11Reconfigurable Pipelined Datapaths

Overview of Using Rapid

Generating a Rapid Array

HardwareSynthesis

Rapid Array

Rapid-CPrograms

Rapid Compiler

RapidArchitecture

Model

Rapid ArrayNetlist

Rapid-CProgramsRapid-C

Programs

MapPlace/Route

Configuration

New Rapid-CProgram

Rapid Compiler

Rapid ProgramNetlist

Control Program

Rapid Array

Programming a Rapid Array

Page 12: Rapid: A Configurable Architecture for Compute-Intensive Applications

12Reconfigurable Pipelined Datapaths

How Configurable is Rapid?

✦ Experimental Rapid Array:➭ 16-bit data, configurable long addition➭ 16 Multipliers, 48 ALUs, 48 Memories, Registers➭ Extensive routing resources➭ Configurable control logic➭ 100 Mhz➭ ~100 mm2 in .6u technology

➧ Layout done for major components

Page 13: Rapid: A Configurable Architecture for Compute-Intensive Applications

13Reconfigurable Pipelined Datapaths

How Configurable is Rapid?

✦ Applications mapped to Experimental Array➭ FIR filters

➧ 16 tap, 100MHz sample rate➧ 1024 tap, 1.5 MHz sample rate➧ 16-bit multiply, 32-bit accumulate➧ Decimating filters

➭ IIR filters➭ Matrix multiply

➧ Unlimited size matrices➭ 2-D DCT➭ Motion Estimation➭ 2-D Convolution➭ FFT➭ 3-D spline generation

✦ Performance:➭ > 3 BOPS (data multiplies and adds)

Page 14: Rapid: A Configurable Architecture for Compute-Intensive Applications

14Reconfigurable Pipelined Datapaths

Questions to be Answered

✦ How do you add configurability to an application-specificarchitecture?➭ Use a domain-specific meta-architecture: Rapid

➧ Fine-grained parallelism➧ Deep computational pipelines➧ High performance

✦ How do you program a Rapid array?➭ Use a programming model tuned to the meta-architecture➭ Concise descriptions of pipelined computations

✦ How do you compile a Rapid array?➭ Generating a Rapid array from multiple source programs➭ Reconfiguring Rapid from a source program

Page 15: Rapid: A Configurable Architecture for Compute-Intensive Applications

15Reconfigurable Pipelined Datapaths

RaPiD: Reconfigurable Pipelined Datapath

➭ Linear array of function units➧ Function type determined by application

➭ Function units are connected together as needed using segmented buses➭ Data enters the pipeline via input streams and exits via output streams

Memory

ALU

Multiply

ALU

Memory

OutputStreams

InputStreams

ALU

Multiply

ALU

CustomFunction

Page 16: Rapid: A Configurable Architecture for Compute-Intensive Applications

16Reconfigurable Pipelined Datapaths

Section of a Sample Rapid Datapath

Programmableregisters

Word-baseddata busses

Inputmultiplexers Tri-state

output drivers

Bus connector: Open,connected, or up tothree register delays

ALU

Multiply

ALU

Memory

Page 17: Rapid: A Configurable Architecture for Compute-Intensive Applications

17Reconfigurable Pipelined Datapaths

An Example Application: FIR Filter

✦ FIR Filter➭ Given a fixed set of coefficient weights and an input vector➭ Compute the dot product of the coefficient vector and a window of the

input vector➭ Easily mapped to a linear pipeline

……x9……x8……x7……x6……x5……x4……x3……x2……x1……x0* ** *

w0 w1 w2 w3

y6 = ΣΣΣΣ

Page 18: Rapid: A Configurable Architecture for Compute-Intensive Applications

18Reconfigurable Pipelined Datapaths

FIR Filter

7

0 X +

6

6

5

1 X +

4

5

3

2 X +

2

4

0123456789...XXXXYYYY

WWWW

XXX= ∑∑∑∑

Each number refersto the index of thestream: e.g. w0, w1, w2

7 210

Page 19: Rapid: A Configurable Architecture for Compute-Intensive Applications

19Reconfigurable Pipelined Datapaths

FIR Filter

8

0 X +

7

7

6

1 X +

5

6

3

2 X +

5

0123456789...XXXXYYYY

WWWW

XXX= ∑∑∑∑

Each number refersto the index of thestream: e.g. w0, w1, w2

7 210

4

Page 20: Rapid: A Configurable Architecture for Compute-Intensive Applications

20Reconfigurable Pipelined Datapaths

FIR Filter

9

0 X +

8

8

7

1 X +

6

7

4

2 X +

6

0123456789...XXXXYYYY

WWWW

XXX= ∑∑∑∑

Each number refersto the index of thestream: e.g. w0, w1, w2

7 210

5

Page 21: Rapid: A Configurable Architecture for Compute-Intensive Applications

21Reconfigurable Pipelined Datapaths

FIR Filter

10

0 X +

9

9

8

1 X +

7

8

5

2 X +

7

0123456789...XXXXYYYY

WWWW

XXX= ∑∑∑∑

Each number refersto the index of thestream: e.g. w0, w1, w2

7 210

6

Page 22: Rapid: A Configurable Architecture for Compute-Intensive Applications

22Reconfigurable Pipelined Datapaths

Configuring the FIR Filter✦ Systolic pipeline implements overlapped vector products

✦ The array is configured to implement this computational pipeline➭ Stages of the FIR pipeline are configured onto the datapath

MULT

ALU

ALU

ALU

ALU

MULT

MULT

MULT

InputStream

OutputStream

Multiply

ALU

Multiply

ALU

Multiply

ALU

Multiply

ALU

Page 23: Rapid: A Configurable Architecture for Compute-Intensive Applications

23Reconfigurable Pipelined Datapaths

Configuring Different Filters✦ Time-multiplexing: trade number of taps vs. sampling rate

➭ M taps assigned per multiplier➭ Requires memory in datapath

✦ Symmetric filter coefficients➭ Doubles the number of taps that can be implemented➭ Requires increased control

✦ Followed by downsampling by M➭ Number of taps increased by factor of M

OutputStreams

InputStreams

Memory

ALU

Multiply

ALU

Multiply

ALU

Multiply

ALU

MultiplyMemory Memory Memory

Page 24: Rapid: A Configurable Architecture for Compute-Intensive Applications

24Reconfigurable Pipelined Datapaths

Dynamic Control

✦ A completely static pipeline is very restricted➭ Virtually all applications need to change computation dynamically➭ Dynamic changes are relatively small

➧ Initialize registers, e.g. filter coefficients➧ Memory read/write➧ Stream read/write➧ Data dependent operations, e.g. MIN/MAX➧ Time-multiplexing stage computation

✦ Solution: Make some of the configuration signals dynamic

✦ Problem: Where do these signals come from?

Page 25: Rapid: A Configurable Architecture for Compute-Intensive Applications

25Reconfigurable Pipelined Datapaths

✦ Per-stage programmed control

➭ Similar to programmable systolic array/iWARP➭ Very expensive, requires synchronization

Controller Controller Controller Controller Controller Controller

✦ VLIW/Microprogrammed control

➭ Very expensive, high instruction bandwidth

Controller

Alternatives for Dynamic Control

Page 26: Rapid: A Configurable Architecture for Compute-Intensive Applications

26Reconfigurable Pipelined Datapaths

The RaPiD Approach

✦ Factor out the computation that does not change➭ Statically configure the underlying datapath➭ Datapath is temporarily hardwired

✦ Remaining control is dynamic➭ Configure an instruction set for the application➭ A programmed controller generates instructions➭ Instruction is pipelined alongside the datapath➭ Each pipeline stage “decodes” the instruction➭ Instruction size is small: typically <16 bits

Controller Decode Decode Decode Decode Decode Decode

Page 27: Rapid: A Configurable Architecture for Compute-Intensive Applications

27Reconfigurable Pipelined Datapaths

0

FIR Filter Control

0

X +

1 2

X + X +

0123456789...

001000...

210

DATA

CONTROL

1 0

Control stream containdynamic control bits

Page 28: Rapid: A Configurable Architecture for Compute-Intensive Applications

28Reconfigurable Pipelined Datapaths

0

FIR Filter Control

0 X +

0 1

X + X +

0123456789...

001000...

210

DATA

CONTROL

10

20

Page 29: Rapid: A Configurable Architecture for Compute-Intensive Applications

29Reconfigurable Pipelined Datapaths

0

FIR Filter Control

0 X +

0 1

X + X +

0123456789...

001000...

210

DATA

CONTROL

10

2

1

01

0

Page 30: Rapid: A Configurable Architecture for Compute-Intensive Applications

30Reconfigurable Pipelined Datapaths

0

FIR Filter Control

0 X +

0 0

X + X +

0123456789...

001000...

210

DATA

CONTROL

0

1

1

12

0

2

2

1 0

Page 31: Rapid: A Configurable Architecture for Compute-Intensive Applications

31Reconfigurable Pipelined Datapaths

Summary of Control

✦ Hard control:➭ Configures underlying pipelined datapath➭ Changed only when application changes➭ Like FPGA configuration

✦ Soft control:➭ Signals that can change every clock cycle

➧ ALU function, multiplexer inputs, etc.➭ Configurably connected to instruction decoder

✦ Only part of soft control may used by an application

� Static controlSoft control that is constant

� Dynamic controlSoft control generated by instruction

Page 32: Rapid: A Configurable Architecture for Compute-Intensive Applications

32Reconfigurable Pipelined Datapaths

Soft Control Signals

✦ Static control:➭ Connect control signal to 0/1

✦ Dynamic control:➭ Connect control signal to instruction

0

MUXDatabusses

SoftControlSignals

Instruction

Page 33: Rapid: A Configurable Architecture for Compute-Intensive Applications

33Reconfigurable Pipelined Datapaths

Rapid Programming Model

✦ Pipelined computations are complicated➭ Use a Broadcast model

➭ Assumes data can propagate entire length of datapath in one cycle

➭ A “datapath instruction”:

➧ specifies computation executed by entire datapath in one cycle

➧ execution proceeds sequentially in one direction

MULT

ALU

ALU

ALU

ALU

MULT

MULT

MULT

Page 34: Rapid: A Configurable Architecture for Compute-Intensive Applications

34Reconfigurable Pipelined Datapaths

for (s=0; s < 3; s++) { if (s==0) in[0] = streamX; if (s==0) out = in[0] * W[0]; else out = out + in[s] * W[s]; if (s==3) streamY = out; }

Rapid-C Datapath Instruction

✦ Described using a loop

MULT

ALU

ALU

ALU

MULT

MULT

MULT

In[0] In[1] In[2] In[3]

out out out out

streamX

streamY

S: 0 S: 1 S: 2 S: 3

W[0] W[1] W[2] W[3]

Page 35: Rapid: A Configurable Architecture for Compute-Intensive Applications

35Reconfigurable Pipelined Datapaths

Pipelining RaPiD

✦ The Rapid datapath is pipelined/retimed by the compiler➭ One “datapath instruction” is really executed over multiple cycles

➭ Programmer always thinks in broadcast time

MULT

ALU

ALU

ALU

MULT

MULT

MULT

In[0] In[1] In[2] In[3]

streamX

streamY

Pipelineregisters

Page 36: Rapid: A Configurable Architecture for Compute-Intensive Applications

36Reconfigurable Pipelined Datapaths

Instructions are Broadcast

0

X +

1 2

X + X +

1

✦ Datapath instructions are executed in one cycle➭ Control is broadcast➭ Compiler pipelines control along with data

Page 37: Rapid: A Configurable Architecture for Compute-Intensive Applications

37Reconfigurable Pipelined Datapaths

X + X + X +

0

0 21

Instructions are Broadcast

✦ Datapath instructions are executed in one cycle➭ Control is broadcast➭ Compiler pipelines control along with data

Page 38: Rapid: A Configurable Architecture for Compute-Intensive Applications

38Reconfigurable Pipelined Datapaths

Data Communication in Rapid

✦ Predominately nearest-neighbor communication➭ To the next stage (broadcast within an instruction)

➭ To the next instruction (within a stage)

➭ Across N instructions (through local memory)

Datapath stagesTime

Page 39: Rapid: A Configurable Architecture for Compute-Intensive Applications

39Reconfigurable Pipelined Datapaths

Rapid-C Process

✦ Each nested loop is a loop process➭ The inner loop is a datapath instruction

for (k=0; k < NX-NW; k++) for (s=0; s < NW-1; s++) { if (s==0) in[0] = streamX; if (s==0) out = in[0] * W[0]; else out = out + in[s] * W[s]; if (s==NW-1) streamY = out; }

Process

Datapathinstruction

Page 40: Rapid: A Configurable Architecture for Compute-Intensive Applications

40Reconfigurable Pipelined Datapaths

Rapid-C Programs

✦ Programs are comprised of several loop processes

✦ Processes can run sequentially➭ One process starts when the previous process completes

✦ Processes can run concurrently➭ Computations overlap➭ Processes run in lock-step

➧ One inner loop execution per cycle➭ Concurrent processes are synchronized via signal/wait

➧ One process waits for the other to send a signal

Page 41: Rapid: A Configurable Architecture for Compute-Intensive Applications

41Reconfigurable Pipelined Datapaths

FIR Filter: Three Sequential Loop Processes

for (i=0; i < NW; i++) for (s=0; s < NW; s++) { if (s==0) in[0] = streamW; if (i==NW-1) W[s] = in[s]; }

for (j=0; j < NW-1; j++) for (s=0; s < NW; s++) { if (s==0) in[0] = streamX; }

for (k=0; k < NX-NW; k++) for (s=0; s < NW-1; s++) { if (s==0) in[0] = streamX; if (s==0) out = in[0] * W[0]; else out = out + in[s] * W[s]; if (s==NW-1) streamY = out; }

Load Weights

Compute Results

Fill Pipeline

Pipe in[NW];Reg W[NW];

Page 42: Rapid: A Configurable Architecture for Compute-Intensive Applications

42Reconfigurable Pipelined Datapaths

FIR Filter Using Concurrent Loop Processesfor (i=0; i < NW; i++) for (s=0; s < NW; s++) { if (s==0) in[0] = streamW; if (i==NW-1) W[s] = in[s]; }

for (j=0; j < NX; j++) if (j==NW) signal(goAhead); for (s=0; s < NW; s++) { if (s==0) in[0] = streamX; }

wait(goAhead); // Wait for enough inputsfor (k=0; k < NX-NW; k++) for (s=0; s < NW-1; s++) { if (s==0) out = in[0] * W[0]; else out = out + in[s] * W[s]; if (s==NW-1) streamY = out; }

Load Weights

ComputeResults

Read inputvalues

PAR {

}

Page 43: Rapid: A Configurable Architecture for Compute-Intensive Applications

43Reconfigurable Pipelined Datapaths

Rapid-C Programs

✦ Processes are composed to make other processes➭ Sequential composition➭ Parallel composition

✦ The process structure is represented by a Control Tree➭ Each node is a SEQ or PAR

SEQ

PAR

for (i=0; i < NW; i++) for (s=0; s < NW; s++) { if (s==0) in[0] = streamW; if (i==NW-1) W[s] = in[s]; }

for (j=0; j < NX; j++) if (j==NW)

signal(goAhead); for (s=0; s < NW; s++) { if (s==0) in[0] =streamX; }

wait(goAhead);for (k=0; k < NX-NW; k++) for (s=0; s < NW-1; s++) { if (s==0) out = in[0] * W[0]; else out = out + in[s] * W[s]; if (s==NW-1) streamY = out; }

Page 44: Rapid: A Configurable Architecture for Compute-Intensive Applications

44Reconfigurable Pipelined Datapaths

Executing a Rapid Control Tree

Instruction GeneratorMicroprogram

Register File

PC

DEC

IG

IG

IG

IG

Synchronizer

InstructionFIFO

InstructionFIFOs

InstructionMerge

ToDatapath

Page 45: Rapid: A Configurable Architecture for Compute-Intensive Applications

45Reconfigurable Pipelined Datapaths

Matrix Multiply

00102030

01112131

02122232

03132333

.. .. ....

00102030

01112131

02122232

001020..

011121..

021222..

X =

A B C

00102030

X +

00

01112131

X +

02122232

X +

Page 46: Rapid: A Configurable Architecture for Compute-Intensive Applications

46Reconfigurable Pipelined Datapaths

Matrix Multiply

00102030

01112131

02122232

03132333

.. .. ....

00102030

01112131

02122232

001020..

011121..

021222..

X =

A B C

00102030

X + 00

01

01112131

X + 01

02122232

X + 02

Page 47: Rapid: A Configurable Architecture for Compute-Intensive Applications

47Reconfigurable Pipelined Datapaths

Matrix Multiply

00102030

01112131

02122232

03132333

.. .. ....

00102030

01112131

02122232

001020..

011121..

021222..

X =

A B C

00102030

X + 00

02

01112131

X + 01

02122232

X + 02

Page 48: Rapid: A Configurable Architecture for Compute-Intensive Applications

48Reconfigurable Pipelined Datapaths

Matrix Multiply

00102030

01112131

02122232

03132333

.. .. ....

00102030

01112131

02122232

001020..

011121..

021222..

X =

A B C

00102030

X + 00

03

01112131

X + 01

02122232

X + 02

Page 49: Rapid: A Configurable Architecture for Compute-Intensive Applications

49Reconfigurable Pipelined Datapaths

Matrix Multiply

00102030

01112131

02122232

03132333

.. .. ....

00102030

01112131

02122232

001020..

011121..

021222..

X =

A B C

00102030

X +

00

10

01112131

X +

02122232

X +

01 02

Page 50: Rapid: A Configurable Architecture for Compute-Intensive Applications

50Reconfigurable Pipelined Datapaths

Matrix Multiply

00102030

01112131

02122232

03132333

.. .. ....

00102030

01112131

02122232

001020..

011121..

021222..

X =

A B C

00102030

X +

11

01112131

X +

02122232

X +

00 01

10 11 12

Page 51: Rapid: A Configurable Architecture for Compute-Intensive Applications

51Reconfigurable Pipelined Datapaths

Matrix Multiply Program

✦ Three process➭ Load B matrix

➭ Compute C values

➭ Output C values

for e=0 to M-1

for f=0 to N-1

for s=0 to N-1

...loading B matrix...

PAR {

for i=0 to L-1

for k=0 to M-1

if (k==M-1) signal(go);

for s=0 to N-1

...calculation…

for g=0 to L-1

wait (go);

for h=0 to N-1

...retire results…

}

Page 52: Rapid: A Configurable Architecture for Compute-Intensive Applications

52Reconfigurable Pipelined Datapaths

Blocked Matrix Multiply

+ +

X

+

X

...

. . .

...

. . .

...

. . .

X =

A B C

X

Page 53: Rapid: A Configurable Architecture for Compute-Intensive Applications

53Reconfigurable Pipelined Datapaths

Program for Blocked Matrix Multiply

✦ Load initial B submatrix

✦ Concurrently➭ Load next B submatrix➭ Compute/Accumulate C submatrix➭ Output completed C submatrix

✦ Concurrent processes synchronize➭ Swap double buffered B memories➭ Output C submatrix when completed

Page 54: Rapid: A Configurable Architecture for Compute-Intensive Applications

54Reconfigurable Pipelined Datapaths

Compiling Rapid-C

✦ Balance between programmer and compiler➭ Programmer

➧ Specifies basic computation➧ Specifies parallelism using RaPiD model of computation➧ Partitions/schedules sub-computations➧ Optimizes data movement

➭ Compiler➧ Translates inner loops into a datapath circuit➧ Pipelines/retimes computation to meet performance goal➧ Extracts dynamic control information

conditionals that use run-time information➧ Constructs instruction format and decoding logic➧ Builds datapath control program

Page 55: Rapid: A Configurable Architecture for Compute-Intensive Applications

55Reconfigurable Pipelined Datapaths

IG

IG

IG

IG

SynchronIzer

InstructionFIFO

InstructionFIFOs

ToDatapath

Compiling Rapid-C

0

X +

1 2

X + X +

1

for (i=0; i < NW; i++) for (s=0; s < NW; s++) { if (s==0) in[0] = streamW; if (i==NW-1) W[s] = in[s]; }

for (j=0; j < NX; j++) if (j==NW) signal(goAhead); for (s=0; s < NW; s++) { if (s==0) in[0] = streamX; }

wait(goAhead); // Wait for enough inputsfor (k=0; k < NX-NW; k++) for (s=0; s < NW-1; s++) { if (s==0) out = in[0] * W[0]; else out = out + in[s] * W[s]; if (s==NW-1) streamY = out; }

PAR {

}

InstructionMerge

Page 56: Rapid: A Configurable Architecture for Compute-Intensive Applications

56Reconfigurable Pipelined Datapaths

Synthesizing a Rapid Array

✦ Generate Rapid datapath that “covers” all spec netlists➭ Union of all function units➭ Sufficient routing

➧ Busses with different bit widths➭ Configurable soft control signals➭ Wide enough instruction➭ Enough instruction generators

➧ Max parallel processes

✦ Provide some “elbow-room”

Page 57: Rapid: A Configurable Architecture for Compute-Intensive Applications

57Reconfigurable Pipelined Datapaths

Current Status

✦ Rapid meta-architecture well-understood

✦ Rapid-C programming language➭ Programs for many applications

✦ Rapid-C compiler➭ Verilog structural netlist➭ Program for datapath controller

✦ Rapid Simulator➭ Executes Verilog netlist and control program➭ Visualization of datapath execution

✦ Place & Route➭ Uses an instance of a Rapid datapath➭ Places and routes datapath and control➭ Pipelines/Retimes to target clock cycle

Page 58: Rapid: A Configurable Architecture for Compute-Intensive Applications

58Reconfigurable Pipelined Datapaths

Future Work

✦ Synthesis of Rapid array netlist➭ What is the best way to cover a set of netlists?➭ How to provide additional elbow-room?

✦ Synthesis of Rapid layout➭ How much can industry tools do?➭ Datapath generator

➧ Use standard blocks for functional units Library cells Synthesized cells

➧ Generate segmented bus structure from template➧ Generate control structure from template

➭ Use standard block for datapath controller➧ Parameterized

Page 59: Rapid: A Configurable Architecture for Compute-Intensive Applications

59Reconfigurable Pipelined Datapaths

Future Work (cont)

✦ Improvements➭ Language features

➧ Custom functions Allow arbitrary operations, synthesize the hardware

➧ Escapes➧ Pragmas➧ Spatially sequential processes

➭ Compiler features➧ Automatic time-multiplexing➧ Optimized control

✦ Multiple configuration contexts➭ e.g. switch between data compression/decompression➭ Use program scope to determine context

✦ System interface issues

Page 60: Rapid: A Configurable Architecture for Compute-Intensive Applications

60Reconfigurable Pipelined Datapaths

Using Multiple Contexts

✦ Sometimes computation phases are quite different➭ Produces lots of dynamic control➭ e.g. switch between motion estimation and 2-D DCT

✦ Solution: provide fast “context switch”➭ Multiple configurations (hard and soft control)➭ Datapath control selects the context➭ Rapid context switch➭ Context switch may be pipelined

✦ Programming multiple contexts➭ Scope in program determines context➭ Compiler compiles different scopes independently➭ Instruction now contains context pointer

Page 61: Rapid: A Configurable Architecture for Compute-Intensive Applications

61Reconfigurable Pipelined Datapaths

The Rapid Research Team

✦ Students➭ Darren Cronquist - architecture and compiler➭ Paul Franklin - architecture and simulator➭ Stefan Berg - simulation, memory interface➭ Miguel Figueroa - applications

✦ Staff➭ Larry McMurchie - place/route➭ Chris Fisher - circuit design and layout

✦ Funding: DARPA and NSF

Page 62: Rapid: A Configurable Architecture for Compute-Intensive Applications

62Reconfigurable Pipelined Datapaths

Overview of Using Rapid

Generating a Rapid Array

HardwareSynthesis

Rapid Array

Rapid-CPrograms

Rapid Compiler

RapidArchitecture

Model

Rapid ArrayNetlist

Rapid-CProgramsRapid-C

Programs

MapPlace/Route

Configuration

New Rapid-CProgram

Rapid Compiler

Rapid ProgramNetlist

Control Program

Rapid Array

Programming a Rapid Array