Upload
deepak
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths. Montek Singh and Steven Nowick Columbia University New York, USA {montek,nowick}@cs.columbia.edu http://www.cs.columbia.edu/~montek. - PowerPoint PPT Presentation
Citation preview
High-ThroughputHigh-ThroughputAsynchronous Pipelines forAsynchronous Pipelines for
Fine-Grain Dynamic Fine-Grain Dynamic DatapathsDatapaths
Montek Singh and Steven NowickMontek Singh and Steven Nowick
Columbia UniversityColumbia UniversityNew York, USANew York, USA
{montek,nowick}@cs.columbia.edu{montek,nowick}@cs.columbia.eduhttp://www.http://www.cscs..columbiacolumbia..eduedu/~/~montekmontek
Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.Intl. Symp. Adv. Res. Asynchronous Circ. Syst. (ASYNC), April 2-6, 2000, Eilat, Israel.
2
OutlineOutline
IntroductionIntroduction
Background: Williams’ PS0 pipelinesBackground: Williams’ PS0 pipelines
New Pipeline DesignsNew Pipeline Designs Dual-Rail:Dual-Rail: LP3/1, LP2/2 and LP2/1 LP3/1, LP2/2 and LP2/1 Single-Rail:Single-Rail: LP LPSRSR2/12/1
Practical Issue: Handling slow Practical Issue: Handling slow
environmentsenvironments
Results and ConclusionsResults and Conclusions
3
Why Dynamic Logic?Why Dynamic Logic?
Potentially:Potentially:
Higher speedHigher speed
Smaller areaSmaller area
““Latch-free” pipelines:Latch-free” pipelines:Logic gate itself provides an Logic gate itself provides an implicitimplicit latch latch
lower latencylower latencyshorter cycle timeshorter cycle timesmaller area –– smaller area –– very important in gate-level pipelining!very important in gate-level pipelining!
Our Focus:Our Focus: Dynamic logic pipelinesDynamic logic pipelines
4
How Do We Achieve High How Do We Achieve High Throughput?Throughput?
Introduce novel pipeline protocols:Introduce novel pipeline protocols: specifically target dynamic logicspecifically target dynamic logic reduce impact of handshaking delaysreduce impact of handshaking delays
shorter cycle timesshorter cycle times
Pipeline at very fine granularity:Pipeline at very fine granularity: ““gate-level:”gate-level:” each stage is a single-gate deep each stage is a single-gate deep
highest throughputs possiblehighest throughputs possible
latch-freelatch-free datapaths especially desirable datapaths especially desirabledynamic logic is a natural matchdynamic logic is a natural match
5
Prior Work: Asynchronous Prior Work: Asynchronous PipelinesPipelines Sutherland (1989), Yun/Beerel/Arceo (1996)Sutherland (1989), Yun/Beerel/Arceo (1996)
very elegant 2-phase control very elegant 2-phase control expensive transition expensive transition latcheslatches
Day/Woods (1995), Furber/Liu (1996)Day/Woods (1995), Furber/Liu (1996)4-phase control 4-phase control simpler latches, but complex simpler latches, but complex
controllerscontrollers
Kol/Ginosar (1997)Kol/Ginosar (1997)double latches double latches greater concurrency, but area-expensive greater concurrency, but area-expensive
Molnar et al. (1997-99)Molnar et al. (1997-99)Two designs: Two designs: asp*asp* and and micropipeline micropipeline both very fast, but: both very fast, but:
– asp*:asp*: complex timing, cannot handle latch-free dynamic complex timing, cannot handle latch-free dynamic datapathsdatapaths
– micropipeline:micropipeline: area-expensive, area-expensive, cannot do logic processing at all!cannot do logic processing at all!
Williams (1991), Martin (1997)Williams (1991), Martin (1997)dynamic stages dynamic stages no explicit latches! no explicit latches! low latency low latency throughput still limitedthroughput still limited
6
BackgroundBackground
IntroductionIntroduction
Background: Williams’ PS0 pipelinesBackground: Williams’ PS0 pipelines
New Pipeline DesignsNew Pipeline Designs Dual-Rail:Dual-Rail: LP3/1, LP2/2 and LP2/1 LP3/1, LP2/2 and LP2/1 Single-Rail:Single-Rail: LP LPSRSR2/12/1
Practical Issue: Handling slow Practical Issue: Handling slow
environmentsenvironments
Results and ConclusionsResults and Conclusions
7
PS0 Pipelines PS0 Pipelines (Williams 1986-91)(Williams 1986-91)
Basic Architecture:Basic Architecture:
FunctionBlock
CompletionDetector
Datain
Dataout
PC
8
PS0 Function BlockPS0 Function Block
Each output is produced using a Each output is produced using a dynamic dynamic gate:gate:
Pull-downPull-downstackstack
““keeper”keeper”
evaluationevaluationcontrolcontrol
prechargeprechargecontrolcontrol
PCPC
datadatainputsinputs
datadataoutputsoutputs
to completionto completiondetectordetector
9
Dual-Rail Completion Dual-Rail Completion DetectorDetector
OROR together two rails of each bit together two rails of each bit Combine results using Combine results using C-elementC-element
CCDoneDone
ORORbitbit00
ORORbitbit11
ORORbitbitnn
10
Precharge Precharge Evaluate: Evaluate: another 3 eventsanother 3 eventsPrecharge Precharge Evaluate: Evaluate: another 3 eventsanother 3 eventsComplete cycle: Complete cycle: 6 events6 eventsComplete cycle: Complete cycle: 6 events6 events
N+1 indicates “done”N+1 indicates “done”
PRECHARGEPRECHARGE N: when N+1 completes evaluation N: when N+1 completes evaluation EVALUATE EVALUATE N: when N+1 completes N: when N+1 completes
prechargingprecharging
PS0 ProtocolPS0 Protocol
11 22 33
44
55
66
N evaluatesN evaluates N+1 evaluatesN+1 evaluates N+2 evaluatesN+2 evaluates
N+2 indicates “done”N+2 indicates “done”
N+1 prechargesN+1 precharges
N+1 indicates “done”N+1 indicates “done”
33
Evaluate Evaluate Precharge: Precharge: 3 events3 eventsEvaluate Evaluate Precharge: Precharge: 3 events3 events
NN N+1N+1 N+2N+2
11
PS0 PerformancePS0 Performance
TEVAL Evaluation Time
TPRECH Precharge Time
TDETECT Completion Detection Time
11 22 33
44
55
66
DETECTPRECHEVAL TTT 23Cycle Time =
12
New Pipeline DesignsNew Pipeline Designs
IntroductionIntroduction
Background: Williams’ PS0 pipelinesBackground: Williams’ PS0 pipelines
New Pipeline DesignsNew Pipeline Designs Dual-Rail:Dual-Rail: LP3/1, LP2/2 and LP2/1 LP3/1, LP2/2 and LP2/1 Single-Rail:Single-Rail: LP LPSRSR2/12/1
Practical Issue: Handling slow Practical Issue: Handling slow
environmentsenvironments
Results and ConclusionsResults and Conclusions
13
Overview of ApproachOverview of Approach
Our Goal:Our Goal: Shorter cycle time, without degrading Shorter cycle time, without degrading latencylatency
Our Approach:Our Approach: Use “ Use “LLookahead ookahead PProtocols” rotocols” (LP):(LP):main idea: main idea: anticipateanticipate critical events based on critical events based on richer observationricher observation
Two new protocol optimizations:Two new protocol optimizations: ““Early evaluation:”Early evaluation:”
give stage give stage head-starthead-start on evaluation by observing events on evaluation by observing events further down the pipelinefurther down the pipeline
(actually, a similar idea proposed by Williams in (actually, a similar idea proposed by Williams in PA0PA0,,but our designs exploit it much better)but our designs exploit it much better)
““Early done:”Early done:” stage signals “done” when it is stage signals “done” when it is about toabout to precharge/evaluate precharge/evaluate
14
Uses Uses “early evaluation:”“early evaluation:” each stage now has each stage now has twotwo control inputs control inputs
the new input comes from the new input comes from two stages aheadtwo stages ahead evaluate N as soon as N+1 evaluate N as soon as N+1 startsstarts precharging precharging
Dual-Rail Design #1: Dual-Rail Design #1: LP3/1LP3/1
Datain
Dataout
PCPC EvalEval
From N+2From N+2From N+2From N+2
NN N+1N+1 N+2N+2
15
LP3/1 ProtocolLP3/1 Protocol PRECHARGEPRECHARGE N: when N+1 completes N: when N+1 completes
evaluationevaluation EVALUATEEVALUATE N: when N: when N+2N+2 completes completes
evaluationevaluationNew!New!
11 22 33
Enables “early evaluation!”Enables “early evaluation!”
44
N evaluatesN evaluates N+1 evaluatesN+1 evaluates
N+2 indicates “done”N+2 indicates “done”
N+2 evaluatesN+2 evaluates
NN N+1N+1 N+2N+2
N+1 indicates “done”N+1 indicates “done”
33
16
PS0PS0PS0PS0
LP3/1LP3/1LP3/1LP3/1
LP3/1: Comparison with PS0LP3/1: Comparison with PS0
11
11
33
33
22
22
55
Only 4 events in cycle!Only 4 events in cycle!
6 events in cycle6 events in cycle
44
4466
NN N+1N+1 N+2N+2
NN N+1N+1 N+2N+2
17
11 22 33
44
LP3/1 PerformanceLP3/1 Performance
DETECTEVAL TT 3Cycle Time =Cycle Time =
saved pathsaved path
Savings over PS0:Savings over PS0: 1 Precharge + 1 Completion Detection1 Precharge + 1 Completion Detection
18
Inside a Stage: Merging Two Inside a Stage: Merging Two ControlsControls
Precharge Precharge when when PC=1PC=1(and Eval=0)(and Eval=0)
Evaluate Evaluate “early”“early” when when Eval=1Eval=1(or PC=0)(or PC=0) Pull-downPull-down
stackstack
““keeper”keeper”
PC (From Stage N+1)PC (From Stage N+1)Eval (From Stage N+2)Eval (From Stage N+2)
NANDNAND
A NAND gate combinesA NAND gate combinesthe two control inputs:the two control inputs:
Problem:Problem: “early”“early” Eval=1Eval=1 is non- is non-persistent!persistent!
it may get de-asserted it may get de-asserted beforebefore the stage has the stage has completed evaluation! completed evaluation!
Problem:Problem: “early”“early” Eval=1Eval=1 is non- is non-persistent!persistent!
it may get de-asserted it may get de-asserted beforebefore the stage has the stage has completed evaluation! completed evaluation!
19
LP3/1 Timing Constraints: LP3/1 Timing Constraints: ExampleExample
Observation:Observation: PC=0PC=0 soon aftersoon after Eval=1, Eval=1, and is persistentand is persistent use PC as safe use PC as safe “takeover” “takeover” for Eval!for Eval!
Solution:Solution: no change! no change!
Timing Constraint:Timing Constraint: PC=0PC=0 arrives arrives beforebefore Eval=1Eval=1 is de- is de-
assertedassertedsimple one-sided timing requirementsimple one-sided timing requirementother constraints as well… all easily satisfied in practiceother constraints as well… all easily satisfied in practice
PC (From Stage N+1)PC (From Stage N+1)Eval (From Stage N+2)Eval (From Stage N+2)
NANDNAND
Problem:Problem: “early”“early” Eval=1Eval=1 is non-persistent! is non-persistent!
20
Dual-Rail Design #2: Dual-Rail Design #2: LP2/2LP2/2
Uses Uses “early done:”“early done:” completion detector now completion detector now beforebefore functional blockfunctional block
stage indicates “done” when stage indicates “done” when about toabout to precharge/evaluate precharge/evaluate
FunctionBlock“early”
CompletionDetector
Datain
Dataout
21
LP2/2 Completion DetectorLP2/2 Completion Detector
Modified completion detectors needed:Modified completion detectors needed: DoneDone=1=1 when stage starts evaluating, and inputs when stage starts evaluating, and inputs
validvalid DoneDone=0=0 when stage starts precharging when stage starts precharging
asymmetric C-elementasymmetric C-element
CCDoneDone
ORORbitbit00
ORORbitbit11
ORORbitbitnn
++++++
PCPC
22
N+1 “early done”N+1 “early done”
11 22
44
LP2/2 ProtocolLP2/2 ProtocolCompletion detection occurs Completion detection occurs in parallelin parallel with evaluation/precharge: with evaluation/precharge:
N evaluatesN evaluates N+1 evaluatesN+1 evaluates
NN N+1N+1 N+2N+2
22
N+1 “early done”N+1 “early done”
33
33
N+2 “early done”N+2 “early done”
23
LP2/2 PerformanceLP2/2 Performance
11 22
3344
DETECTEVAL TT 22Cycle Time =Cycle Time =
LP2/2 savings over PS0: LP2/2 savings over PS0: 1 Evaluation + 1 Precharge1 Evaluation + 1 Precharge
24
Dual-Rail Design #3: Dual-Rail Design #3: LP2/1LP2/1
Hybrid of LP3/1 and LP2/2.Hybrid of LP3/1 and LP2/2. Combines: Combines: early evaluationearly evaluation of LP3/1 of LP3/1 early doneearly done of LP2/2 of LP2/2
DETECTEVAL TT 2Cycle Time =Cycle Time =
25
New Pipeline DesignsNew Pipeline Designs
IntroductionIntroduction
Background: Williams’ PS0 pipelinesBackground: Williams’ PS0 pipelines
New Pipeline DesignsNew Pipeline Designs Dual-Rail:Dual-Rail: LP3/1, LP2/2 and LP2/1 LP3/1, LP2/2 and LP2/1 Single-Rail:Single-Rail: LP LPSRSR2/12/1
Practical Issue: Handling slow Practical Issue: Handling slow
environmentsenvironments
Results and ConclusionsResults and Conclusions
26
Single-Rail Design: Single-Rail Design: LPLPSRSR2/12/1
Derivative of LP2/1, adapted to single-rail:Derivative of LP2/1, adapted to single-rail:bundled-data: bundled-data: matched delaysmatched delays instead of completion instead of completion
detectorsdetectors
delaydelay delaydelay delaydelay
““Ack”Ack” to previous stages is to previous stages is “tapped off early”“tapped off early”once in evaluate (precharge), dynamic logic insensitive to input changesonce in evaluate (precharge), dynamic logic insensitive to input changes
27
PC and Eval are combined exactly as in LP3/1PC and Eval are combined exactly as in LP3/1
Inside an LPInside an LPSRSR2/1 Stage2/1 Stage
““done”done” generated by an generated by an asymmetric C- asymmetric C-element element
donedone=1=1 when stage evaluates, and when stage evaluates, and data inputs data inputs validvalid donedone=0=0 when stage precharges when stage precharges
PC (From Stage N+1)PC (From Stage N+1)
Eval (From Stage N+2)Eval (From Stage N+2)
NANDNAND
aCaC++
““ack”ack”
““req” inreq” in
data indata in data outdata out
““req” outreq” out
matcheddelay
donedone
28
LPLPSRSR2/1 Protocol2/1 Protocol
11 22
33
aCEVAL TT 2Cycle Time =Cycle Time =
element-C asymmetric throughDelay aCT
N evaluatesN evaluates N+2 evaluatesN+2 evaluates
N+2 indicates “done”N+2 indicates “done”
NN N+1N+1 N+2N+2
22
N+1 evaluatesN+1 evaluates
N+1 indicates “done”N+1 indicates “done”
29
Practical Issue: Handling Slow Practical Issue: Handling Slow EnvironmentsEnvironments
We inherit a timing assumption from Williams’ We inherit a timing assumption from Williams’ PS0:PS0: Input (left) environment Input (left) environment must precharge reasonably fastmust precharge reasonably fast
Problem:Problem:If environment is If environment is stuck in precharge,stuck in precharge,
all pipelines (incl. PS0) will malfunction!all pipelines (incl. PS0) will malfunction!
Our Solution:Our Solution: Add a special Add a special robustrobust controller for 1 controller for 1stst stage stage
simply synchronizes input environment and pipelinesimply synchronizes input environment and pipeline delay critical events until environment has finished prechargedelay critical events until environment has finished precharge
Modular solution overcomes shortcoming of Williams’ PS0Modular solution overcomes shortcoming of Williams’ PS0
No serious throughput overheadNo serious throughput overhead real bottleneck is the slow environment!real bottleneck is the slow environment!
30
Results and ConclusionsResults and Conclusions
IntroductionIntroduction
Background: Williams’ PS0 pipelinesBackground: Williams’ PS0 pipelines
New Pipeline DesignsNew Pipeline Designs Dual-Rail:Dual-Rail: LP3/1, LP2/2 and LP2/1 LP3/1, LP2/2 and LP2/1 Single-Rail:Single-Rail: LP LPSRSR2/12/1
Practical Issue: Handling slow Practical Issue: Handling slow
environmentsenvironments
Results and ConclusionsResults and Conclusions
31
ResultsResults
Designed/simulated FIFO’s for each Designed/simulated FIFO’s for each pipeline style pipeline style
Experimental Setup:Experimental Setup: design:design: 4-bit wide, 10-stage FIFO 4-bit wide, 10-stage FIFO technology:technology: 0.6 0.6 HP CMOS HP CMOS operating conditions:operating conditions: 3.3 V and 300°K 3.3 V and 300°K
32
Throughput
Design Mega items/sec Improvement (%)
PS0 420 -
LP3/1 590 40%
LP2/2 760 79%
LP2/1 860 102%
LPSR2/1 1208 188%
dual-raildual-rail
single-railsingle-rail
Comparison with Williams’ Comparison with Williams’ PS0PS0
LP2/1:LP2/1: >2X faster>2X faster than Williams’ PS0 than Williams’ PS0 LPLPSRSR2/1:2/1: 1.2 Giga items/sec1.2 Giga items/sec
33
Comparison: Comparison: LPLPSRSR2/1 vs. Molnar 2/1 vs. Molnar FIFO’sFIFO’s
LPLPSRSR2/1 FIFO:2/1 FIFO: 1.2 Giga items/sec 1.2 Giga items/secAdding logic processing to FIFO:Adding logic processing to FIFO:
simply fold logicsimply fold logic into dynamic gate into dynamic gate little overhead little overhead
Comparison with Molnar FIFO’s:Comparison with Molnar FIFO’s: asp* FIFO:asp* FIFO: 1.1 Giga items/sec 1.1 Giga items/sec
more complex timing assumptions more complex timing assumptions not easily not easily formalizedformalized
requires explicit latches, separate from logic!requires explicit latches, separate from logic!adding logic processing adding logic processing betweenbetween stages stages significant significant
overheadoverhead
micropipeline:micropipeline: 1.7 Giga items/sec 1.7 Giga items/sec two parallel FIFO’s, each only 0.85 Giga/sectwo parallel FIFO’s, each only 0.85 Giga/secvery expensive transition latchesvery expensive transition latchescannot add logic processing to FIFO!cannot add logic processing to FIFO!
34
datapath widthdatapath width= 32 dual-rail bits!= 32 dual-rail bits!
Practicality of Gate-Level Practicality of Gate-Level PipeliningPipelining
When datapath is wide:When datapath is wide:
Can often split into narrow Can often split into narrow “streams”“streams”
comp. comp. ddet. et. ffairly airly low cost!low cost!
Use Use “localized”“localized” completion detector completion detector for each stream:for each stream:
need to examine only a few bitsneed to examine only a few bits small fan-insmall fan-in
send “done” to only a few gatessend “done” to only a few gates small fan-outsmall fan-out
donedone
fan-out=2fan-out=2
comp. det.comp. det.fan-in = 2fan-in = 2
35
ConclusionsConclusions
Introduced several new dynamic pipelines:Introduced several new dynamic pipelines: Use Use two novel protocols:two novel protocols:
– ““early evaluation”early evaluation”– ““early done”early done”
Especially suitable for Especially suitable for fine-grain (gate-level) pipeliningfine-grain (gate-level) pipelining
Very high throughputs obtained:Very high throughputs obtained:– dual-rail:dual-rail: >2X improvement>2X improvement over Williams’ PS0 over Williams’ PS0– single-rail:single-rail: 1.2 Giga items/second1.2 Giga items/second in 0.6 in 0.6 CMOS CMOS
Use easy-to-satisfy, one-sided timing constraintsUse easy-to-satisfy, one-sided timing constraints
Robustly handle arbitrary-speed environmentsRobustly handle arbitrary-speed environments– overcome a major shortcoming of Williams’ PS0 pipelinesovercome a major shortcoming of Williams’ PS0 pipelines
Recent Improvement: Even faster single-rail pipeline Recent Improvement: Even faster single-rail pipeline (WVLSI’00)(WVLSI’00)