Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Christer Svensson, ASYNC 2004 1
Synchronous Latency Insensitive Design
Christer Svensson and Anders EdmanLinköping University
Christer Svensson, ASYNC 2004 2
Outline
• Introduction• Overview of wire properties• Architectural view of future systems• Synchronous Latency Insensitive Design• Multiple clocks• Conclusion
Christer Svensson, ASYNC 2004 3
Introduction
The wire delay problem was recognized very early (Anceau 1982)
In spite of the “alarm” 1982, we still manage multigigahertz synchronous designs, BUT today with considerable problems.
ASIC style designs normally limited to 300-500MHz clock, with severe “timing closure” problems.
Multigigahertz designs very demanding full custom design style.
Wire delay ~ L2/s2, Gate delay ~sα, s=feature size, α=1..2
Christer Svensson, ASYNC 2004 4
Introduction
Synchronous design paradigm VERY established – we need to keep.(Easy to keep track on exact timing of all events; predictable performance)Vast experience used to manage ever increasing complexity.
Critical: Timing relations between clock and data
Present solution: “Flat” clock distribution (skew-free clock)Does not solve problem with data delays
clk
Balanced clk net - no skewWire delay still affects data
Christer Svensson, ASYNC 2004 5
Overview of wire properties
Twisted pair
Coaxialcable Microstrip
Circuit boards and chips
Coplanar waveguide
Cables
Ground planes
We will concentrate on microstrip in the following
Christer Svensson, ASYNC 2004 6
Overview of wire propertiesSkin effect loss
Higher frequencies - skineffektFields penetrate metal to skin-depth δResistance per unit length, r:
ωsrr =Current flow, depth δ, (skin depth)
Frequency dependence (dispersion) gives rise to signal distortion
( ) ωjrrr sDC ++= 1Including current phase and low frequency resistance:
Christer Svensson, ASYNC 2004 7
Overview of wire properties
We discuss 2 wire properties in the following
Delay (Latency)Capacity (Maximum data rate)
Christer Svensson, ASYNC 2004 8
Overview of wire properties
High loss case (RC-case), rDCL/Z0>2ln2. Elmore delay good approximation:
( ) 2ln2
++++= L
wwLwSSd C
CRCCCRt
Low loss case (LC-case), rDCL/Z0
Christer Svensson, ASYNC 2004 9
Overview of wire propertiesCapacity or maximum data rate
Single pulse Eye diagram
Eye opening
Eye opening = 2S(T)-1, S(t) step response, T symbol time
We need a minimum opening for safe data detection, say 64%
For long wires we may afford a simple equalizer, allowing 0%
S(T)
T
Christer Svensson, ASYNC 2004 10
Overview of wire propertiesCapacity or maximum data rate
RC-wire: Step response:
Eye opening of 64% yields S(T)=0.82 or T=0.85RwCw
Max data rate
LC-wire: Step response (skin effect):
Max data rate,
( ) wwCRT
eTS2
1−
−=
2
1LAb
TB RC==
( )
−=
TwZL
erfTS0
0
21
ρµ
2LAbB LC=
Christer Svensson, ASYNC 2004 11
Overview of wire propertiesNote the difference between latency and data rate
td
Ts>td
Ts
Christer Svensson, ASYNC 2004 12
Overview of wire propertiesEstimated data-rates Typical
Boardwire10Gb/s@ 0.5m
Low delayregion
Top metalchip wire10Gb/s@ 15mm
Low levelmetal wire10Gb/s@ 1mm
Christer Svensson, ASYNC 2004 13
Overview of wire properties
Low level on-chip wiresWire delay limits diameter of synchronous blockSystem partition – “Global Asynchronous Local Synchronous”
Upper on-chip wiresLow delay, high data-rate global communicationInter-block communication
Circuit board wiresCan be used at least to 10Gb/s per wireFacilitates very high on-board bandwidths
Christer Svensson, ASYNC 2004 14
Overview of wire propertiesOn-chip local
Future processes, feature size f=0.1 - 0.035 µmwire cross section ~3f2, for 0.1µm: 3·10-14m210Gb/s up to 1.25mm length1mm wire will have a delay of 26ps (26% of 10GHz clock cycle)
We may use 10GHz clock frequency in fully synchronous blockof diameter 1mm. Such a block can contain 250,000 gates.(Compare to Sylvester and Keutzler 50-100 kgates)
Note that diameter scales as f2; number of gates as f-2so 250 kgates is kept until 0.035µm (or further) at 10GHz.
Christer Svensson, ASYNC 2004 15
Overview of wire properties
On-chip global
Traditional alternativeAutomatic insertion of repeaters along long wiresWith wave pipelining allows >10Gb/s per wireDelays may exceed one clock cycle
Utilizing upper thick metal layerData rate >10Gb/sDelays close to velocity-of-light, still order of one clock cycle
Christer Svensson, ASYNC 2004 16
Overview of wire propertiesUpper wire/driver example
2µm3.5µm
12µm4µm
Inverter in 0.18µm CMOSWn=88µm, wp=194µm, RS=20Ω
Actual step response
Step response without overdrive
Step response, terminated
Wire length 2cm
2µm x 4µm copper wire, low loss12µm spacing, X-talk
Christer Svensson, ASYNC 2004 17
Overview of wire properties
Estimated performance (length 2cm)
• Simulated velocity: 108m/s (c0/3)• Simulated maximum data-rate 10Gb/s• Each link is 16 bit wide, 2 links carry 320Gb/s (bidirectionally)• Each 2 links need 544µm width
Upper wire/driver example
Christer Svensson, ASYNC 2004 18
Architectural view of future systems
Clock
Chip Chip
High speed board links
Synchronousblocks
On-chip global links
Christer Svensson, ASYNC 2004 19
Architectural view of future systems
Clock
Chip Chip
High speed board links
Synchronousblocks
On-chip global links
Challenges
Allow scaling of clock rates and bandwidths
Mitigate synchronization and clock skew problems
Keep an unchanged synchronous design paradigm
Christer Svensson, ASYNC 2004 20
Architectural view of future systems
Wire delays are inevitable: we must accept latency.
The latency/delay problem should be managed at two levels
• System level (predictability)
• Implementation level (error-free)
Christer Svensson, ASYNC 2004 21
Architectural view of future systems
System level.Partition the system into blocks of limited size.(Preferably natural partition, processors, memories, IP-blocks etc.)
We may define a system where only order of events is important.(“Classical” asynchronous, Patient systems (Carloni et al 1999))We may then accept any latency between blocks.
We may define a system with fixed latency between blocks.(If fixed latency is n clock cycles, the system is synchronous)We may then accept any latency < nTc between blocks.
Christer Svensson, ASYNC 2004 22
Architectural view of future systemsImplementation level (We must avoid synchronization errors)
Use synchronizers with long decision time (extra latency, nonzero error probability)
Use stoppable clocks to synchronize communication(Classical GALS, Chapiro 1984)
Adapt clock phase to data (mesochronous clocks) (Mu 2001)
Use FIFO’s to isolate clock regions(FIFO’s initialized with synchronizers, Chakraborty 2001)(FIFO’s initialized via system reset, Edman 2004)
Christer Svensson, ASYNC 2004 23
Architectural view of future systemsImplementation level, Examples
Data in Data out
Metastab.detector Rx clk
Choise of clock phase (Mu 2001)
“Circular” FIFO
Writepointer
Data in Data out
Readpointer
Rx clkTx clk
FIFO solution (Chakraborty 2001,Edman 2004)
Christer Svensson, ASYNC 2004 24
Synchronous Latency Insensitive Design
Problem formulation
Find a method to mitigate wire-induced latencies within a synchronous paradigm
Christer Svensson, ASYNC 2004 25
Synchronous Latency Insensitive Design
clk
Communication links
Fixed delays (n clk cycles)
Synchronousblocks
Clock true model SynthesisDuring synthesis we replace Fixed delays withsynchronizing ports(elastic FIFOs) that absorball link latencies and clock skews.
Final design agree exactlywith Clock true modelindependently oflink delays and clock skews.
Concept
Christer Svensson, ASYNC 2004 26
Synchronous Latency Insensitive Design
System partition
Clock-true model &
verification
Synthesis &Back-end
Timing verification
“Natural” partition (processors, memories,IP-blocks…) into isochronous regions
NEW: Insertion of dummy delays between isochronic regions. Clock-true verification.
Replace dummy delays with elastic FIFO’s
Considerably easier, feedback can be avoided
Design flow
Christer Svensson, ASYNC 2004 27
Synchronous Latency Insensitive Design
clkExample with three blocksand two links
data
strobe
Synchronizing portFixed nominal delay preset in counters
Outputcounter
regdata
strobe
datareg
Localclock
select
Implementation
Inputcounter
Christer Svensson, ASYNC 2004 28
Synchronous Latency Insensitive Design
System reset used as initialization mechanism (example n=2)
Tx1
Tx2
Rx
clk rst resetclk at root
data at Tx1data at Rx
written into FIFO(2) by strobe
clk at RxFIFO(2)read from FIFO(2) by
Rx clk after 2 counts
data in Rx
Note that data relation to clk period number predictable
Implementation
Christer Svensson, ASYNC 2004 29
Synchronous Latency Insensitive Design
00 01 10 11 00 01 10 11 00 01
10 11 00 01 10 11 00 01 10 11 00 01
00 01 10 11 00 01 10 11 00 01 10
10 11 00 01 10 11 00 01 10 11 00 01
0 20 ns 40 ns 60 ns
00 01 10 11 00 01 10 11 00 01
10 11 00 01 10 11 00 01 10 11 00 01
00 01 10 11 00 01 10 11 00 01 10
10 11 00 01 10 11 00 01 10 11 00 01
Clk
Tx1 out
Rx1 in
Tx2 out
Rx2 in
Rx1 out
Rx2 out
Rx1 in count
Rx1 out count
Rx2 in count
Rx2 out count
00 01 10 11 00 01 10 11 00 01
10 11 00 01 10 11 00 01 10 11
00 01 10 11 00 01 10 11 00 01 10
10 11 00 01 10 11 00 01 10 11
0 20 ns 40 ns 60 ns
Clk
Tx1 out
Rx1 in
Tx2 out
Rx2 in
Rx1 out
Rx2 out
Rx1 in count 00 01 10 11 00 01 10 11 00 01
Rx1 out count 10 11 00 01 10 11 00 01 10 11
Rx2 in count 00 01 10 11 00 01 10 11 00 01 10
Rx2 out count 10 11 00 01 10 11 00 01 10 11
Tx1
Tx2
Rx
clk
Simulation
Christer Svensson, ASYNC 2004 30
Synchronous Latency Insensitive DesignImplementation example, receiver in 0.18µm CMOS
fc=2.75GHzArea ≈ 3500 µm2Data sent over 2mm wireLatency 2 cyclesRx clk delay 1 cycle
(SPICE circuit level @110oC)
Rx input
Tx clk
Rx clk
Read data
Reference data
Christer Svensson, ASYNC 2004 31
Synchronous Latency Insensitive Design
New method to ease timing closure in large DSM chips• Correct clock-true verification before synthesis
• Synchronous design paradigm and design tools kept
• Implementation induced data delays and clock skews mitigated
• Implementation in standard libraries
• Full clock alignment between blocks
• No synchronizers, no risk for metastability
Christer Svensson, ASYNC 2004 32
Multiple clocks
Can a multiple clock system be synchronous?
Example – rationally related clocks
fc1
fc2=(2/3)fc1
f=Synchronous to fc1
Christer Svensson, ASYNC 2004 33
Multiple clocks
FIFO synchronization can be extended to rationally related clocks(FIFO used for mitigation of delays and introduced clock jitter)
Chakraborty 2003, (Our proposal 2004)
Chakraborty extended his scheme to any clock frequency relation
Writepointer
Readpointer
Jitteraccepted
Christer Svensson, ASYNC 2004 34
ConclusionsWire delays are inevitableWire delays may be limited to velocity-of-light delaysSynchronous blocks may include 250kgates @10GHz clockDelays must be managed at system level and implementation levelOur proposed scheme facilitates:
synchronous flow from system to implementationclock-true verification before synthesismitigation of clock skews and data latencies
“Synchronous” schemes can be extended to multiple clocks
Christer Svensson, ASYNC 2004 35
References
F. Anceau, "A Synchronous Approach for Clocking VLSI Systems", IEEE J. Solid-State Circuits, Vol. 17, pp. 51-56, 1982.D. M. Chapiro, “Globally-Asynchronous Locally-Synchronous Systems”, PhD Thesis, Stanford University, Oct. 1984.M. Afghahi and C. Svensson, “Performance of Synchronous and Asynchronous Schemes for VLSI Systems”, IEEE Trans. on Computers, Vol. 41, pp. 858-872, 1992.D. Sylvester and K. Keutzer, "Getting to the bottom of deep submicron", IEEE/ACM Int. Conference on Computer Aided Design 1998, Digest of Technical Papers, pp. 203-211, 1998.L. P. Carloni, K. L. McMillan, A. Saldanha and A. L. Sangiovanni-Vincentelli, "A Methodology for Correct-by-Construction Latency Insensitive Design", 1999 IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, pp. 309-315, Nov. 1999.F. Mu and C. Svensson, ”Self-tested self-synchronization circuit for mesochronous clocking”, IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, vol 48, pp. 129 – 140, Feb. 2001A. Chakraborty and M. R. Greenstreet, "A Minimal Source-Synchronous Interface", 15th Annual IEEE International ASIC/SOC Conference, pp. 443-447, Sept. 2002.C. Svensson, “Electrical Interconnects Revitalized”, IEEE Trans. on Very Large Scale Integration, vol. 10, pp. 777-788, Dec. 2002.J. Xu and W. Wolf, “A Wave-Pipelined On-chip Interconnect Structure for Network-on-Chips”, Proc. of the 11th Symp. On High Performance Interconnect, pp. 10-14, 2003A. Chakraborty and M. R. Greenstreet, “Efficient Self-Timed Interfaces for Crossing Clock Domains”, Proceedings of Ninth International Symposium on Asynchronous Circuits and Systems, pp. 78-88, May 2003.A. Edman and C. Svensson, "Timing Closure through a Globally Synchronous, Timing Partitioned Design Methodology", accepted for presentation at the 41st Design Automation Conference, 2004.