Upload
sezja
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment. ISPD 2005 San Francisco, CA May 5th, 2005 Mario R. Casu - Politecnico di Torino and Luca Macchiarulo - University of Hawaii at Manoa. Outline. Communication concerns at the physical layer - PowerPoint PPT Presentation
Citation preview
Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment
Floorplan Assisted Data Rate Enhancement through Wire Pipelining: A Real Assessment
ISPD 2005 San Francisco, CA ISPD 2005 San Francisco, CA
May 5th, 2005May 5th, 2005
Mario R. Casu - Mario R. Casu - Politecnico di TorinoPolitecnico di Torino
and and Luca MacchiaruloLuca Macchiarulo - - University of Hawaii at University of Hawaii at ManoaManoa
OutlineOutline
Communication concerns at the physical Communication concerns at the physical layerlayer
Great Expectations of “Wire Pipelining”Great Expectations of “Wire Pipelining”– No block DelayNo block Delay– Block delay limitationBlock delay limitation
Computation localityComputation locality Adaptive CommunicationsAdaptive Communications Floorplanning strategy for adaptive Floorplanning strategy for adaptive
systemssystems Experimental resultsExperimental results
Wire pipelining - conceptWire pipelining - concept
Wire delay: Wire delay: substantial share substantial share of overall delayof overall delay
Global wires Global wires difficult to deal difficult to deal withwith
Global wires Global wires scaling does not scaling does not follow follow – TransistorsTransistors– Local wiringLocal wiring
Del
Wire pipelining - conceptWire pipelining - concept
Introducing a Introducing a latch/FF reduces latch/FF reduces the timing the timing constraintsconstraints
Similar to classical Similar to classical pipelining pipelining
Del’
Del’’
Critical LengthCritical Length
Maximal length for Maximal length for which the wire can which the wire can be driven at a be driven at a given frequencygiven frequency– Optimum number Optimum number
of buffersof buffers– Optimum buffer Optimum buffer
dimensionsdimensions– Optimum wire Optimum wire
sizingsizing
Del=1/f
Wire PipeliningWire Pipelining
Above Critical Above Critical length clocked length clocked elements are elements are needed (pipeline needed (pipeline stages)stages)
Del>1/f
“Wire Pipelining” techniques“Wire Pipelining” techniques
Problem: maintaining functionality with a Problem: maintaining functionality with a minimum loss in performance.minimum loss in performance.
Solutions:Solutions:– Globally Asynchronous Locally Synchronous – Globally Asynchronous Locally Synchronous –
GALSGALS– RetimingRetiming– Regular Distributed Register (J. Cong)Regular Distributed Register (J. Cong)– c-slowing (S. Sapatnekar) c-slowing (S. Sapatnekar) – Latency Insensitive Protocols (L. Carloni)Latency Insensitive Protocols (L. Carloni)
Shell
LIPs: ConceptLIPs: Concept
Pearl Relay Station
Shell – Relay Station InteractionShell – Relay Station Interaction
valid stop
Feedback TopologyFeedback Topology
τ
τ
τ
τ
00
0
Feedback TopologyFeedback Topology
0
τ
0
0
τ
τ
0τ
Feedback TopologyFeedback Topology
τ
0
τ
0
1
τ
0τ1
Feedback TopologyFeedback Topology
1
τ
τ
1
τ
1
0τ1τ
Feedback TopologyFeedback Topology
τ
1
1
1
τ
τ
0τ1ττ
Feedback TopologyFeedback Topology
τ
τ
τ
τ
2
2
0τ1ττ2
Feedback Topology: PerformanceFeedback Topology: Performance Void data circulate in the Void data circulate in the
loops: initially as many loops: initially as many as relay stations (as relay stations (ss))
““Period” of void-stop Period” of void-stop equal to the number of equal to the number of shells (shells (ss) and relay ) and relay station (station (rr) in the loop) in the loop
Worst loop fixes thr.Worst loop fixes thr. T=s/(s+r)T=s/(s+r) TTaa=2/4, Tb=2/5 =2/4, Tb=2/5
T=2/5T=2/5 τ
τ
τ
τ
2
2
0τ1ττ2
a b
Classical FloorplanningClassical Floorplanning
Problem: find a Problem: find a placement of (soft or placement of (soft or hard) blocks that hard) blocks that optimally fits a floorplanoptimally fits a floorplan
Optimality is Optimality is Whitespace, overall Whitespace, overall Wirelength, critical path, Wirelength, critical path, or a combinationor a combination
Floorplanning for Throughput [ISPD2004]Floorplanning for Throughput [ISPD2004]
The optimal floorplan The optimal floorplan in our case is that in our case is that which guarantees the which guarantees the maximum throughput maximum throughput compatible with given compatible with given blocks’ dimensionsblocks’ dimensions
Maximum throughput Maximum throughput is equivalent to the is equivalent to the worst cost-to-time worst cost-to-time ratio loopratio loop
New Heuristic Throughput ComputationNew Heuristic Throughput Computation Heuristic: Heuristic:
– Statically compute the shortest loop l(e) in Statically compute the shortest loop l(e) in which every edge appearswhich every edge appears
– For every optimization iteration: For every optimization iteration: Cost(e)=1/l(e)*floor(length/CCost(e)=1/l(e)*floor(length/Clengthlength)) TotCost=TotCost=cost(e)cost(e)
Throughput-frequency trade-offThroughput-frequency trade-off
f=1/L
T=1
DR0=1.1/L=1/L
Throughput-frequency trade-offThroughput-frequency trade-off
f=2/L
T=2/(2+2)=1/2
DR=1/2.2/L=1/L
No advantage!
Throughput-frequency trade-offThroughput-frequency trade-off
f=1/L L L
L/2
T=1
DR0=1/L.1=1/L
Throughput-frequency trade-offThroughput-frequency trade-off
L/2
L/2
L/2
L/2
L/2
f=2/L T=3/(3+2)
DR=2/L.3/5=6/5L
Data Rate as the basic performance metric – Speed-upData Rate as the basic performance metric – Speed-up Wire pipelining allows increased frequencyWire pipelining allows increased frequency But it decreases the throughput according to But it decreases the throughput according to
the previous considerationsthe previous considerations Real performance is given by DATA Real performance is given by DATA
RATE=Thr*fRATE=Thr*f Advantage w.r.t. non-pipelined systems to be Advantage w.r.t. non-pipelined systems to be
assessed through DR measuresassessed through DR measures Speed-Up SU=DR/DRSpeed-Up SU=DR/DR00
L/(lL/(lmm+l+lmaxmax)<SU<L/l)<SU<L/lmm Floorplanning can be extremely beneficial Floorplanning can be extremely beneficial
if it can reduce the average branch length if it can reduce the average branch length llmm
Block delay effectBlock delay effect
Blocks put a cap to the max frequencyBlocks put a cap to the max frequency– ffmaxmax<1/max(d<1/max(dii))
ii
We can measure delay in “length”, by using a proportionality We can measure delay in “length”, by using a proportionality factorfactor
Block delay can enter in the picture if signals are Block delay can enter in the picture if signals are latched at the input or output side onlylatched at the input or output side only
L
ld
Block delay modelsBlock delay models
We used two different modelsWe used two different models– Delay proportional to block edgeDelay proportional to block edge
Rationale: complexity of logic is related to block Rationale: complexity of logic is related to block sizesize
Minimum constant of proportionality=1: delay is Minimum constant of proportionality=1: delay is the same needed for the fastest signal to the same needed for the fastest signal to traverse the entire block traverse the entire block
Optimistic assumptionOptimistic assumption– Delay constant, related to technology and Delay constant, related to technology and
equal to 13FO4equal to 13FO4 Derived for assumption in the roadmapDerived for assumption in the roadmap More realistic for high performance designMore realistic for high performance design More pessimistic (see below)More pessimistic (see below)
Probably the reality is somehow between the Probably the reality is somehow between the two casestwo cases
Speed-up with block delaySpeed-up with block delay
Taking the block delay into account modifies Taking the block delay into account modifies the previous considerationsthe previous considerations
max(Lmax(Lii+d+dii)/(l)/(lmm+d+dmm+d+dmaxmax)<SU<max(L)<SU<max(Lii+d+dii)/(l)/(lmm+d+dmm))
In general, much worse than previous caseIn general, much worse than previous case
Throughput driven floorplan experimentsThroughput driven floorplan experiments We used the floorplanner described in ISPD’04 We used the floorplanner described in ISPD’04
to evaluate the optimal frequency (maximum to evaluate the optimal frequency (maximum DR)DR)
On GSRC and MCNC benchmarks with input-On GSRC and MCNC benchmarks with input-output informationoutput information
No block delay: No block delay: – SU varies between 0.8 to 36%SU varies between 0.8 to 36%– Better on benchmarks with greater complexityBetter on benchmarks with greater complexity
Block delayBlock delay– Proportional to blocks’ edges: -7% to 44%Proportional to blocks’ edges: -7% to 44%– Equal to 13FO4: -11% to 12%Equal to 13FO4: -11% to 12%– MCNC suite shows the worse behaviorMCNC suite shows the worse behavior
High speed systems with highly optimized High speed systems with highly optimized blocks lead to negligible or irrelevant SU, for an blocks lead to negligible or irrelevant SU, for an high increase of clock frequency.high increase of clock frequency.
Space for better performance?Space for better performance?
Not all point to point connections are actually Not all point to point connections are actually used at every clock cycle.used at every clock cycle.
Ex. CPU to Cache communication.Ex. CPU to Cache communication.
Read cycle
Addr
Data-in
Data-out
Space for better performance?Space for better performance?
Not all point to point connections are actually Not all point to point connections are actually used at every clock cycle.used at every clock cycle.
Ex. CPU to Cache communication.Ex. CPU to Cache communication.
Write cycle
Addr
Data-in
Data-out
Space for better performance?Space for better performance?
Unused communication channel effectively break Unused communication channel effectively break throughput-limiting loopsthroughput-limiting loops
Pipelining without limitation can become possiblePipelining without limitation can become possible
Stream Write cycle
Addr 1
Data-out 1τ
Space for better performance?Space for better performance?
Unused communication channel effectively break Unused communication channel effectively break throughput-limiting loopsthroughput-limiting loops
Pipelining without limitation can become possiblePipelining without limitation can become possible
Stream Write cycle
Addr 2
Data-out 2
Addr 1
Data-out 1
Space for better performance?Space for better performance?
Unused communication channel effectively break Unused communication channel effectively break throughput-limiting loopsthroughput-limiting loops
Pipelining without limitation can become possiblePipelining without limitation can become possible
Stream Write cycle
Addr 3
Data-out 3
Addr 2
Data-out 2
Adaptive Latency Insensitive ProtocolAdaptive Latency Insensitive Protocol Need a mechanism to allow discarding useless Need a mechanism to allow discarding useless
“packets” by blocks: Adaptive communication“packets” by blocks: Adaptive communication Details out of the scope of the paper butDetails out of the scope of the paper but
– It is possible thorugh a simple modification of It is possible thorugh a simple modification of the original protocolthe original protocol
– Requires the introduction of “oracles” Requires the introduction of “oracles” predicting unused inputs for each blockpredicting unused inputs for each block
– We designed a functional implementation in We designed a functional implementation in synthesizable VHDLsynthesizable VHDL
– We proved the correctness of the We proved the correctness of the implementation (absence of deadlocks and implementation (absence of deadlocks and correct signal sequencing)correct signal sequencing)
ALIP performance evaluationALIP performance evaluation
The adaptiveness of the approach prevents a The adaptiveness of the approach prevents a static prediction of performancestatic prediction of performance
However, a few conclusion can be reached:However, a few conclusion can be reached:– The performance is bounded above by static LIPThe performance is bounded above by static LIP– Performance in long sequences of input Performance in long sequences of input
independence is equivalent to the simplified independence is equivalent to the simplified network with the channel removednetwork with the channel removed
If the system experiences unfrequent “context If the system experiences unfrequent “context switching” on its channels, such that at any switching” on its channels, such that at any given time the performance is static Thgiven time the performance is static Th ii, the , the average performance can be approximated as:average performance can be approximated as:– Th=Th=ii.Th.Thii
i: fraction of time with performance Thi: fraction of time with performance Th i i
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Stream Write cycle
Addr 1
Data-out 1τ
Ck=1Valid Data=1
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Stream Write cycle
Addr 2
Data-out 2
Addr 1
Data-out 1
Ck=2Valid Data=2
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Stream Write cycle
Addr 3
Data-out 3
Addr 2
Data-out 2
Ck=3Valid Data=3
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Read cycle
Addr 4 Addr 3
Data-out 3
Ck=4Valid Data=4
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Read cycle
----- Addr 4
Ck=5Valid Data=5
ττ
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Read cycle
-----
Ck=6Valid Data=5
Data-in4τ
τ
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Read cycle
Ck=7Valid Data=5
-----
τ
Data-in4
τ
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Read cycle
Ck=8Valid Data=6
-----
τAddr 5
τ
ALIP performance evaluation - ExampleALIP performance evaluation - Example
Read cycle
Ck=8Valid Data=6Throughput=3/4Th1=1Th2=1/2=1/22=1/2
-----
τAddr 5
τ
Adaptive communication performance evaluation - assumptions
Adaptive communication performance evaluation - assumptions Assumption 1: No time lost in “context Assumption 1: No time lost in “context
switching”switching”– Unrealistic, but acceptable for burst Unrealistic, but acceptable for burst
communication, and consistent with communication, and consistent with experimentsexperiments
Assumption 2: Channels behave in a Assumption 2: Channels behave in a statistically independent fashionstatistically independent fashion– Only single clock cycle independence is Only single clock cycle independence is
important for our purposesimportant for our purposes Under 1 and 2, we can compute channel Under 1 and 2, we can compute channel
activities and use them to weight the activities and use them to weight the connectionsconnections
Floorplanning for Throughput – adaptive caseFloorplanning for Throughput – adaptive case The optimal floorplan The optimal floorplan
in our case is that in our case is that which guarantees the which guarantees the maximum throughput maximum throughput compatible with given compatible with given blocks’ dimensionsblocks’ dimensions
Maximum throughput Maximum throughput is equivalent to the is equivalent to the worst cost-to-time worst cost-to-time ratio loop, ratio loop, weighted weighted by the by the looploop activation activation ratioratio
It can be It can be approximated by approximated by taking into account taking into account the the channelchannel activation activation ratioratio
New Heuristic Throughput ComputationNew Heuristic Throughput Computation Heuristic: Heuristic:
– Statically compute the shortest loop l(e) in Statically compute the shortest loop l(e) in which every edge appearswhich every edge appears
– For every optimization iteration: For every optimization iteration: Cost(e)=1/l(e)*floor(length/CCost(e)=1/l(e)*floor(length/Clengthlength)*)*(e)(e) TotCost=TotCost=cost(e)cost(e)
The only change consists in the inclusion of The only change consists in the inclusion of the term the term (e)(e)
ExperimentsExperiments
GSRC/MCNC benchmarksGSRC/MCNC benchmarks– Burst modeBurst mode– Uniformly distributed phases and activation Uniformly distributed phases and activation
timestimes– Comparison between non-pipelined solution and Comparison between non-pipelined solution and
adaptively pipelined (13FO4 case)adaptively pipelined (13FO4 case)– After optimization, a VHDL netlist is After optimization, a VHDL netlist is
automatically generated and simulated to automatically generated and simulated to measure the real performance of the system (as measure the real performance of the system (as opposed to the approximation from the opposed to the approximation from the floorplanner)floorplanner)
Results:Results:– SU between 16 and 44%SU between 16 and 44%– Monotonous behavior in the legal intervalMonotonous behavior in the legal interval– Limitations due mainly to FO4 delaysLimitations due mainly to FO4 delays
ExperimentsExperiments
MPEG decoderMPEG decoder– Strict data dependencyStrict data dependency– Optimization as in other casesOptimization as in other cases– Simulation as before Simulation as before andand with real channel with real channel
utilization profilesutilization profiles Results:Results:
– SU of 42% with block delay, 76% withoutSU of 42% with block delay, 76% without– Real SU of 31% (effect of non-random Real SU of 31% (effect of non-random
correlation)correlation)
Conclusions and future workConclusions and future work
Pure “blind” pipelining fails to achive available Pure “blind” pipelining fails to achive available optimization, due to neglect of common optimization, due to neglect of common informationinformation
Adaptive protocols can take advantage of the Adaptive protocols can take advantage of the information available to the blocksinformation available to the blocks
We will concentrate onWe will concentrate on– Automated extraction of information from the Automated extraction of information from the
blocksblocks– Power optimization (power/timing trade-offs)Power optimization (power/timing trade-offs)– Routing constraints effectsRouting constraints effects
Thank you
Shell – Relay Station InteractionShell – Relay Station Interaction
valid stop
a
Shell – Relay Station InteractionShell – Relay Station Interaction
valid stop
b
a
Shell – Relay Station InteractionShell – Relay Station Interaction
valid stop
c
b
Shell – Relay Station InteractionShell – Relay Station Interaction
valid stop
d
bc
Feedforward equalizationFeedforward equalization
Maximum Maximum performance can be performance can be recovered by recovered by equalizing various equalizing various pathspaths
Longest path Longest path computation to computation to obtain the obtain the appropriate number appropriate number of added relay of added relay stationsstations
Critical Length and Pipelining Stages (ITRS projections)Critical Length and Pipelining Stages (ITRS projections)
YearYear NodeNode Clock Clock FrequencyFrequency
Critical Critical
LengthLength
StagesStages10 10 mmmm
34 mm34 mm
20012001 130 130 nmnm
1.684 GHz1.684 GHz 17.11 mm17.11 mm 00 11
20022002 115 115 nmnm
2.317 GHz2.317 GHz 12.17 mm12.17 mm 00 22
20032003 100 100 nmnm
3.088 GHz3.088 GHz 8.95 mm8.95 mm 11 33
20042004 90 nm90 nm 3.990 GHz3.990 GHz 7.37 mm7.37 mm 11 4420052005 80 nm80 nm 5.173 GHz5.173 GHz 5.28 mm5.28 mm 11 6620062006 70 nm70 nm 5.631 GHz5.631 GHz 4.63 mm4.63 mm 22 7720072007 65 nm65 nm 6.739 GHz6.739 GHz 4.16 mm4.16 mm 22 88
General Performance EvaluationGeneral Performance Evaluation Generic netlists of blocks are feedforward Generic netlists of blocks are feedforward
connections of loopsconnections of loops If feedforward connections are equalized, If feedforward connections are equalized,
“worst” loop dominates throughput“worst” loop dominates throughput Problem formulation: max cost-to-time ratio Problem formulation: max cost-to-time ratio
(polynomial time).(polynomial time).