23
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana- Champaign

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Embed Size (px)

DESCRIPTION

Amalgam: a Reconfigurable Processor for Future Fabrication Processes. Nicholas P. Carter University of Illinois at Urbana-Champaign. Performance = f(architecture, implementation). LD. LD. ADD. MUL. LD. MUL. ST. LD. MUL. ST. LD. ADD. MUL. LD. MUL. ST. ST. LD. LD. ADD. MUL. - PowerPoint PPT Presentation

Citation preview

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Amalgam: a Reconfigurable Processor for Future Fabrication

Processes

Nicholas P. Carter

University of Illinois at Urbana-Champaign

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Performance = f(architecture, implementation)

1-DIDCT

1-DIDCT

Time

1-DIDCT

1-DIDCT

1-DIDCT

1-DIDCT

1-DIDCT

1-DIDCT

LDLD LDLDA

DD

AD

DM

UL

MU

LLDLDM

UL

MU

LS

TS

T LDLDM

UL

MU

LS

TS

T LDLDA

DD

AD

DM

UL

MU

LLDLDM

UL

MU

LS

TS

T ST

ST LD

LD LDLDA

DD

AD

DM

UL

MU

LA

DD

AD

DM

UL

MU

LA

DD

AD

DM

UL

MU

LS

TS

T ST

ST

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Efficient Implementation• Everything you give up in clock rate you

have to make back in architectural efficiency

• Wire delay is the big limiting factor in system architectures today– Wires get slower relative to transistors as fab.

process improves

• Programmable processors moving to deeper pipelines– Not good enough to just prevent wires from

making reconf. logic slower

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

AmalgamDRAMDRAM

Cache(Multi-Banked)

NetworkNetwork

PCluster PCluster PCluster PCluster

RCluster RCluster RCluster RCluster

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Network InterfaceNetwork Interface

ACUACU

Reconfigurable Cluster Design• 4 Register banks

– 8 registers/bank

• 4 Reconfigurable logic segments– 8 Rows x 32 LBs

per segment

• Array control unit• Network interface• Counter-clockwise

flow of computation through cluster

SegmentSegment BankBank

BankBank SegmentSegment

BankBankSegmentSegment

BankBank SegmentSegment

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Reconfigurable Clock Rates

BBBBBBBJJJJJJJHHHHHHHFFFFFFF1 8 01 3 09 06 54 53 22 202 0 0 04 0 0 06 0 0 08 0 0 01 0 0 0 01 2 0 0 01 4 0 0 0Fa b ric a tio n P ro c e s s (n m )BP ro g ra m m a b le C lu s te rJID C THD N AFR ijn d a e l

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Unpipelined Critical Path• Latches in logic blocks

only resource for pipelining

• Vertical and horizontal wires carry data

between logic blocks– Wires have heavy

loads, making them slower than their length

would indicate

• Effect on clock rate varies significantly with

fabrication process

LBLBFF

HWIRE

VW

IRE

BankBank

VW

IRE

HWIRE

LBLBFF

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Supporting Pipelining• Goal: make logic block delay the limiting

factor on clock rate

• Add configurable latches at each wire intersection– Problem: different paths may have different

latencies

• Add retiming buffers at logic block inputs/outputs

• Add network queues to reduce synchronization overhead

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Pipelined Critical Path• Delay of individual

wires < logic block delay in all processes studied

• Add configurable pipeline latches at junctions between wires

• Pipeline latches also added on carry chains within rows

LBLBFF

HWIRE

VW

IRE

BankBank

VW

IRE

HWIRE

FF

FFFF

FFFF

LBLBFF

FF

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Retiming Buffers• 5-deep chain of

latches added to each logic block input– Similar structure added

to LB output

• Can “borrow” up to two cycles of additional delay from adjacent input

• Total pipeline register overhead = 17%

FFFF

FFFF

FFFF

FFFF

FFFF

FFFF

FFFF

FFFF

FFFF

FFFF

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Register Queues

WRITE R8, Val1

WRITE R8, Val2

Sync.Message

NetworkNetwork

RegisterFile

RegisterFile

Original Architecture

WRITE R8, Val1WRITE R8, Val2

EMPTY R8

NetworkNetwork

Original Architecture

RegisterQueue

RegisterQueue

RegisterFile

RegisterFile

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Implementing Pipelined Apps.• Logical vs. Physical pipelining

– Logical: Program-visible, uses array and registers

– Physical: Only visible to ACU, uses pipeline registers on wires, retiming buffers

• Take advantage of decoupling provided by queues

• Applications use same reconfigurable logic configurations in different fab. processes– Only FSM in ACU changes– Applications to portability, managing intra-die

variation

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Experimental Methodology• Programs simulated using Amalsim

– Set each cluster’s clock rate independently• Benchmarks: IDCT, Rijndael, DNA comparison

– Fine-grained version of each benchmark does one computation– Medium-grained version performs four independent computatons

• Programmable cluster clock rates based on ITRS– Limit stages to 7 FO4 delay, slightly more aggressive than ITRS

• Logic block latencies, wire lengths taken from circuit-level design of reconf. Cluster in 180nm CMOS– Convert logic block delay to FO4, scale by FO4 delay of each

fabrication process– Scale wire length based on fabrication process, simulate wire

delay in SPICE– Pipeline such that reconf. cluster cycle time is determined by logic

block delay

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Pipelined Clock Rates

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Fine-Grained Benchmark Perf.

• Reconfigurable version maintains about 20% perf. Improvement over programmable in all fab. processes

• Pipelining only small benefit• Majority of speedup comes from reduction in

memory references

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Medium-Grain Benchmark Perf.

• Pipelined architecture sees 2.6x perf improvement over programmable

• Unpipelined architecture only minor improvement over programmable– Greater parallelism means more ability to tolerate

memory delays

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Limit Studies• Believe that memory operations are much of the

benefit for small tasks– Study limit where memory latency = 1– Also test theory that streaming benchmarks have

enough parallelism to cover latency

• Understand how much clock rate of reconfigurable unit affects performance– Model reconfigurable unit at same clock rate as

programmable clusters– Completely unreasonable for unpipelined– Might be indicator of what industry could do with

pipelined

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Unpipelined Fine-Grained

• Removing memory latencies makes programmable performance similar to reconfigurable

• Latency of reconfig. clusters has large impact on performance -- no parallelism to cover latency

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Pipelined Fine-Grained

• Results similar to unpipelined– Benefit still mostly from memory reduction

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Unpipelined Medium-Grain

• Eliminating memory latencies really helps programmable

• Latency of reconf. logic an even bigger problem– Programmable clusters can exploit parallelism through

pipelines

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Pipelined Medium-Grain

• Impact of memory system on reconfigurable performance very small

• Less benefit from increasing reconfigurable cluster clock rate– With even small amounts of parallelism, throughput

becomes more important than latency.

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Future Directions• ASIC-like performance with programmable

systems– ASICs typically get 100x better performance

per unit area than microprocessors

• Application-specific memory systems in a programmable chip– Transform memory references into

communication– Create natural division of programs into regular

and irregular blocks

Amalgam: a Reconfigurable Processor for Future Fabrication Processes

Conclusion• Reconfigurable computing must provide

both speedup from custom logic and high clock rates to succeed

• Amalgam does this by limiting and tolerating wire delay at multiple levels– Clustered architecture– Segmented reconfigurable unit– Pipeline wire delays

• Result: 2.6x speedup over 8-way CMP in current and future fabrication processes