Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment

Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment

Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment

ISPD 2005 San Francisco, CA ISPD 2005 San Francisco, CA

May 5th, 2005May 5th, 2005

Mario R. Casu - Mario R. Casu - Politecnico di TorinoPolitecnico di Torino

and and Luca MacchiaruloLuca Macchiarulo - - University of Hawaii at University of Hawaii at ManoaManoa

OutlineOutline

Communication concerns at the physical Communication concerns at the physical layerlayer

Great Expectations of “Wire Pipelining”Great Expectations of “Wire Pipelining”– No block DelayNo block Delay– Block delay limitationBlock delay limitation

Computation localityComputation locality Adaptive CommunicationsAdaptive Communications Floorplanning strategy for adaptive Floorplanning strategy for adaptive

systemssystems Experimental resultsExperimental results

Wire pipelining - conceptWire pipelining - concept

Wire delay: Wire delay: substantial share substantial share of overall delayof overall delay

Global wires Global wires difficult to deal difficult to deal withwith

Global wires Global wires scaling does not scaling does not follow follow – TransistorsTransistors– Local wiringLocal wiring

Del

Wire pipelining - conceptWire pipelining - concept

Introducing a Introducing a latch/FF reduces latch/FF reduces the timing the timing constraintsconstraints

Similar to classical Similar to classical pipelining pipelining

Del’

Del’’

Critical LengthCritical Length

Maximal length for Maximal length for which the wire can which the wire can be driven at a be driven at a given frequencygiven frequency– Optimum number Optimum number

of buffersof buffers– Optimum buffer Optimum buffer

dimensionsdimensions– Optimum wire Optimum wire

sizingsizing

Del=1/f

Wire PipeliningWire Pipelining

Above Critical Above Critical length clocked length clocked elements are elements are needed (pipeline needed (pipeline stages)stages)

Del>1/f

“Wire Pipelining” techniques“Wire Pipelining” techniques

Problem: maintaining functionality with a Problem: maintaining functionality with a minimum loss in performance.minimum loss in performance.

Solutions:Solutions:– Globally Asynchronous Locally Synchronous – Globally Asynchronous Locally Synchronous –

GALSGALS– RetimingRetiming– Regular Distributed Register (J. Cong)Regular Distributed Register (J. Cong)– c-slowing (S. Sapatnekar) c-slowing (S. Sapatnekar) – Latency Insensitive Protocols (L. Carloni)Latency Insensitive Protocols (L. Carloni)

Shell

LIPs: ConceptLIPs: Concept

Pearl Relay Station

Shell – Relay Station InteractionShell – Relay Station Interaction

valid stop

Feedback TopologyFeedback Topology

τ

τ

τ

τ

00

0


0

τ

0

0

τ

τ

0τ


τ

0

τ

0

1

τ

0τ1


1

τ

τ

1

τ

1

0τ1τ


τ

1

1

1

τ

τ

0τ1ττ


τ

τ

τ

τ

2

2

0τ1ττ2

Feedback Topology: PerformanceFeedback Topology: Performance Void data circulate in the Void data circulate in the

loops: initially as many loops: initially as many as relay stations (as relay stations (ss))

““Period” of void-stop Period” of void-stop equal to the number of equal to the number of shells (shells (ss) and relay ) and relay station (station (rr) in the loop) in the loop

Worst loop fixes thr.Worst loop fixes thr. T=s/(s+r)T=s/(s+r) TTaa=2/4, Tb=2/5 =2/4, Tb=2/5

T=2/5T=2/5 τ

τ

τ

τ

2

2

0τ1ττ2

a b

Classical FloorplanningClassical Floorplanning

Problem: find a Problem: find a placement of (soft or placement of (soft or hard) blocks that hard) blocks that optimally fits a floorplanoptimally fits a floorplan

Optimality is Optimality is Whitespace, overall Whitespace, overall Wirelength, critical path, Wirelength, critical path, or a combinationor a combination

Floorplanning for Throughput [ISPD2004]Floorplanning for Throughput [ISPD2004]

The optimal floorplan The optimal floorplan in our case is that in our case is that which guarantees the which guarantees the maximum throughput maximum throughput compatible with given compatible with given blocks’ dimensionsblocks’ dimensions

Maximum throughput Maximum throughput is equivalent to the is equivalent to the worst cost-to-time worst cost-to-time ratio loopratio loop

New Heuristic Throughput ComputationNew Heuristic Throughput Computation Heuristic: Heuristic:

– Statically compute the shortest loop l(e) in Statically compute the shortest loop l(e) in which every edge appearswhich every edge appears

– For every optimization iteration: For every optimization iteration: Cost(e)=1/l(e)*floor(length/CCost(e)=1/l(e)*floor(length/Clengthlength)) TotCost=TotCost=cost(e)cost(e)

Throughput-frequency trade-offThroughput-frequency trade-off

f=1/L

T=1

DR0=1.1/L=1/L


f=2/L

T=2/(2+2)=1/2

DR=1/2.2/L=1/L

No advantage!


f=1/L L L

L/2

T=1

DR0=1/L.1=1/L


L/2

L/2

L/2

L/2

L/2

f=2/L T=3/(3+2)

DR=2/L.3/5=6/5L

Data Rate as the basic performance metric – Speed-upData Rate as the basic performance metric – Speed-up Wire pipelining allows increased frequencyWire pipelining allows increased frequency But it decreases the throughput according to But it decreases the throughput according to

the previous considerationsthe previous considerations Real performance is given by DATA Real performance is given by DATA

RATE=Thr*fRATE=Thr*f Advantage w.r.t. non-pipelined systems to be Advantage w.r.t. non-pipelined systems to be

assessed through DR measuresassessed through DR measures Speed-Up SU=DR/DRSpeed-Up SU=DR/DR00

L/(lL/(lmm+l+lmaxmax)<SU<L/l)<SU<L/lmm Floorplanning can be extremely beneficial Floorplanning can be extremely beneficial

if it can reduce the average branch length if it can reduce the average branch length llmm

Block delay effectBlock delay effect

Blocks put a cap to the max frequencyBlocks put a cap to the max frequency– ffmaxmax<1/max(d<1/max(dii))

ii

We can measure delay in “length”, by using a proportionality We can measure delay in “length”, by using a proportionality factorfactor

Block delay can enter in the picture if signals are Block delay can enter in the picture if signals are latched at the input or output side onlylatched at the input or output side only

L

ld

Block delay modelsBlock delay models

We used two different modelsWe used two different models– Delay proportional to block edgeDelay proportional to block edge

Rationale: complexity of logic is related to block Rationale: complexity of logic is related to block sizesize

Minimum constant of proportionality=1: delay is Minimum constant of proportionality=1: delay is the same needed for the fastest signal to the same needed for the fastest signal to traverse the entire block traverse the entire block

Optimistic assumptionOptimistic assumption– Delay constant, related to technology and Delay constant, related to technology and

equal to 13FO4equal to 13FO4 Derived for assumption in the roadmapDerived for assumption in the roadmap More realistic for high performance designMore realistic for high performance design More pessimistic (see below)More pessimistic (see below)

Probably the reality is somehow between the Probably the reality is somehow between the two casestwo cases

Speed-up with block delaySpeed-up with block delay

Taking the block delay into account modifies Taking the block delay into account modifies the previous considerationsthe previous considerations

max(Lmax(Lii+d+dii)/(l)/(lmm+d+dmm+d+dmaxmax)<SU<max(L)<SU<max(Lii+d+dii)/(l)/(lmm+d+dmm))

In general, much worse than previous caseIn general, much worse than previous case

Throughput driven floorplan experimentsThroughput driven floorplan experiments We used the floorplanner described in ISPD’04 We used the floorplanner described in ISPD’04

to evaluate the optimal frequency (maximum to evaluate the optimal frequency (maximum DR)DR)

On GSRC and MCNC benchmarks with input-On GSRC and MCNC benchmarks with input-output informationoutput information

No block delay: No block delay: – SU varies between 0.8 to 36%SU varies between 0.8 to 36%– Better on benchmarks with greater complexityBetter on benchmarks with greater complexity

Block delayBlock delay– Proportional to blocks’ edges: -7% to 44%Proportional to blocks’ edges: -7% to 44%– Equal to 13FO4: -11% to 12%Equal to 13FO4: -11% to 12%– MCNC suite shows the worse behaviorMCNC suite shows the worse behavior

High speed systems with highly optimized High speed systems with highly optimized blocks lead to negligible or irrelevant SU, for an blocks lead to negligible or irrelevant SU, for an high increase of clock frequency.high increase of clock frequency.

Space for better performance?Space for better performance?

Not all point to point connections are actually Not all point to point connections are actually used at every clock cycle.used at every clock cycle.

Ex. CPU to Cache communication.Ex. CPU to Cache communication.

Read cycle

Addr

Data-in

Data-out


Not all point to point connections are actually Not all point to point connections are actually used at every clock cycle.used at every clock cycle.

Ex. CPU to Cache communication.Ex. CPU to Cache communication.

Write cycle

Addr

Data-in

Data-out


Unused communication channel effectively break Unused communication channel effectively break throughput-limiting loopsthroughput-limiting loops

Pipelining without limitation can become possiblePipelining without limitation can become possible

Stream Write cycle

Addr 1

Data-out 1τ




Stream Write cycle

Addr 2

Data-out 2

Addr 1

Data-out 1




Stream Write cycle

Addr 3

Data-out 3

Addr 2

Data-out 2

Adaptive Latency Insensitive ProtocolAdaptive Latency Insensitive Protocol Need a mechanism to allow discarding useless Need a mechanism to allow discarding useless

“packets” by blocks: Adaptive communication“packets” by blocks: Adaptive communication Details out of the scope of the paper butDetails out of the scope of the paper but

– It is possible thorugh a simple modification of It is possible thorugh a simple modification of the original protocolthe original protocol

– Requires the introduction of “oracles” Requires the introduction of “oracles” predicting unused inputs for each blockpredicting unused inputs for each block

– We designed a functional implementation in We designed a functional implementation in synthesizable VHDLsynthesizable VHDL

– We proved the correctness of the We proved the correctness of the implementation (absence of deadlocks and implementation (absence of deadlocks and correct signal sequencing)correct signal sequencing)

ALIP performance evaluationALIP performance evaluation

The adaptiveness of the approach prevents a The adaptiveness of the approach prevents a static prediction of performancestatic prediction of performance

However, a few conclusion can be reached:However, a few conclusion can be reached:– The performance is bounded above by static LIPThe performance is bounded above by static LIP– Performance in long sequences of input Performance in long sequences of input

independence is equivalent to the simplified independence is equivalent to the simplified network with the channel removednetwork with the channel removed

If the system experiences unfrequent “context If the system experiences unfrequent “context switching” on its channels, such that at any switching” on its channels, such that at any given time the performance is static Thgiven time the performance is static Th ii, the , the average performance can be approximated as:average performance can be approximated as:– Th=Th=ii.Th.Thii

i: fraction of time with performance Thi: fraction of time with performance Th i i

ALIP performance evaluation - ExampleALIP performance evaluation - Example

Stream Write cycle

Addr 1

Data-out 1τ

Ck=1Valid Data=1


Stream Write cycle

Addr 2

Data-out 2

Addr 1

Data-out 1

Ck=2Valid Data=2


Stream Write cycle

Addr 3

Data-out 3

Addr 2

Data-out 2

Ck=3Valid Data=3


Read cycle

Addr 4 Addr 3

Data-out 3

Ck=4Valid Data=4


Read cycle

----- Addr 4

Ck=5Valid Data=5

ττ


Read cycle

-----

Ck=6Valid Data=5

Data-in4τ

τ


Read cycle

Ck=7Valid Data=5

-----

τ

Data-in4

τ


Read cycle

Ck=8Valid Data=6

-----

τAddr 5

τ


Read cycle

Ck=8Valid Data=6Throughput=3/4Th1=1Th2=1/2=1/22=1/2

-----

τAddr 5

τ

Adaptive communication performance evaluation - assumptions

Adaptive communication performance evaluation - assumptions Assumption 1: No time lost in “context Assumption 1: No time lost in “context

switching”switching”– Unrealistic, but acceptable for burst Unrealistic, but acceptable for burst

communication, and consistent with communication, and consistent with experimentsexperiments

Assumption 2: Channels behave in a Assumption 2: Channels behave in a statistically independent fashionstatistically independent fashion– Only single clock cycle independence is Only single clock cycle independence is

important for our purposesimportant for our purposes Under 1 and 2, we can compute channel Under 1 and 2, we can compute channel

activities and use them to weight the activities and use them to weight the connectionsconnections

Floorplanning for Throughput – adaptive caseFloorplanning for Throughput – adaptive case The optimal floorplan The optimal floorplan

in our case is that in our case is that which guarantees the which guarantees the maximum throughput maximum throughput compatible with given compatible with given blocks’ dimensionsblocks’ dimensions

Maximum throughput Maximum throughput is equivalent to the is equivalent to the worst cost-to-time worst cost-to-time ratio loop, ratio loop, weighted weighted by the by the looploop activation activation ratioratio

It can be It can be approximated by approximated by taking into account taking into account the the channelchannel activation activation ratioratio

New Heuristic Throughput ComputationNew Heuristic Throughput Computation Heuristic: Heuristic:

– Statically compute the shortest loop l(e) in Statically compute the shortest loop l(e) in which every edge appearswhich every edge appears

– For every optimization iteration: For every optimization iteration: Cost(e)=1/l(e)*floor(length/CCost(e)=1/l(e)*floor(length/Clengthlength)*)*(e)(e) TotCost=TotCost=cost(e)cost(e)

The only change consists in the inclusion of The only change consists in the inclusion of the term the term (e)(e)

ExperimentsExperiments

GSRC/MCNC benchmarksGSRC/MCNC benchmarks– Burst modeBurst mode– Uniformly distributed phases and activation Uniformly distributed phases and activation

timestimes– Comparison between non-pipelined solution and Comparison between non-pipelined solution and

adaptively pipelined (13FO4 case)adaptively pipelined (13FO4 case)– After optimization, a VHDL netlist is After optimization, a VHDL netlist is

automatically generated and simulated to automatically generated and simulated to measure the real performance of the system (as measure the real performance of the system (as opposed to the approximation from the opposed to the approximation from the floorplanner)floorplanner)

Results:Results:– SU between 16 and 44%SU between 16 and 44%– Monotonous behavior in the legal intervalMonotonous behavior in the legal interval– Limitations due mainly to FO4 delaysLimitations due mainly to FO4 delays

ExperimentsExperiments

MPEG decoderMPEG decoder– Strict data dependencyStrict data dependency– Optimization as in other casesOptimization as in other cases– Simulation as before Simulation as before andand with real channel with real channel

utilization profilesutilization profiles Results:Results:

– SU of 42% with block delay, 76% withoutSU of 42% with block delay, 76% without– Real SU of 31% (effect of non-random Real SU of 31% (effect of non-random

correlation)correlation)

Conclusions and future workConclusions and future work

Pure “blind” pipelining fails to achive available Pure “blind” pipelining fails to achive available optimization, due to neglect of common optimization, due to neglect of common informationinformation

Adaptive protocols can take advantage of the Adaptive protocols can take advantage of the information available to the blocksinformation available to the blocks

We will concentrate onWe will concentrate on– Automated extraction of information from the Automated extraction of information from the

blocksblocks– Power optimization (power/timing trade-offs)Power optimization (power/timing trade-offs)– Routing constraints effectsRouting constraints effects

Thank you


valid stop

a


valid stop

b

a


valid stop

c

b


valid stop

d

bc

Feedforward equalizationFeedforward equalization

Maximum Maximum performance can be performance can be recovered by recovered by equalizing various equalizing various pathspaths

Longest path Longest path computation to computation to obtain the obtain the appropriate number appropriate number of added relay of added relay stationsstations

Critical Length and Pipelining Stages (ITRS projections)Critical Length and Pipelining Stages (ITRS projections)

YearYear NodeNode Clock Clock FrequencyFrequency

Critical Critical

LengthLength

StagesStages10 10 mmmm

34 mm34 mm

20012001 130 130 nmnm

1.684 GHz1.684 GHz 17.11 mm17.11 mm 00 11

20022002 115 115 nmnm

2.317 GHz2.317 GHz 12.17 mm12.17 mm 00 22

20032003 100 100 nmnm

3.088 GHz3.088 GHz 8.95 mm8.95 mm 11 33

20042004 90 nm90 nm 3.990 GHz3.990 GHz 7.37 mm7.37 mm 11 4420052005 80 nm80 nm 5.173 GHz5.173 GHz 5.28 mm5.28 mm 11 6620062006 70 nm70 nm 5.631 GHz5.631 GHz 4.63 mm4.63 mm 22 7720072007 65 nm65 nm 6.739 GHz6.739 GHz 4.16 mm4.16 mm 22 88

General Performance EvaluationGeneral Performance Evaluation Generic netlists of blocks are feedforward Generic netlists of blocks are feedforward

connections of loopsconnections of loops If feedforward connections are equalized, If feedforward connections are equalized,

“worst” loop dominates throughput“worst” loop dominates throughput Problem formulation: max cost-to-time ratio Problem formulation: max cost-to-time ratio

(polynomial time).(polynomial time).

Documents

Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment