Upload
eugenia-spencer
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Prediction of High-Performance On-Chip Global Interconnection
Yulei Zhang1, Xiang Hu1, Alina Deutsch2, A. Ege Engin3
James F. Buckwalter1, and Chung-Kuan Cheng1
1Dept. of ECE, UC San Diego, La Jolla, CA2IBM T. J. Watson Research Center, Yorktown Heights, NY
3Dept. of ECE, San Diego State Univ., San Diego, CA
2
Outline Introduction
Technology trend Current approaches
On-Chip Global Interconnection Overview: structures, tradeoffs Interconnect schemes Global wire modeling Performance analysis
Design Methodologies for T-line schemes Prediction of Performance Metrics
Experimental settings Performance metrics comparison and scaling trend
Latency Energy per bit Throughput
Signal Integrity Conclusion
3
Introduction – Performance Impact Interconnect delay determines the
system performance [ITRS08] 542ps for 1mm minimum pitch Cu global
wire w/o repeater @ 45nm ~150ps for 10 level FO4 delay @ 45nm
[Ho2001] “Future of Wire”
4
Introduction – Power Dissipation Interconnects consume a significant portion of power
1-2 order larger in magnitude compared with gates Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07]
Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04] About 1/3 burned on the global wires.
5
Introduction – Different Approaches and Our Contributions
Different Approaches Repeater Insertion Approach
Pros: High throughput density. Cons: Overhead in terms of power consumption and wiring complexity.
T-line Approach [Zhang09] Pros: Low latency. Cons: low throughput density due to low bandwidth and large wire dimension
Equalized T-line Approach [Zhang08] Pros: Low power, Low noise, Higher throughput than single-ended. Cons: The area overhead brought by passive components.
We explore different global interconnection structures and compare their performance metrics across multiple technology nodes.
Contributions: A simple linear model A general design framework A complete prediction and comparison
6
Organization of On-Chip Global Interconnections
7
Multi-Dimensional Design Consideration
Preliminary analysis results assuming 65nm CMOS process.
Application-oriented choice Low LatencyT-TLT-TL or or UT-TL UT-TL -> Single-Ended T-lines-> Single-Ended T-lines High ThroughputR-RCR-RC Low PowerPE-TL PE-TL oror UE-TL UE-TL Low NoisePE-TL PE-TL oror UE-TL UE-TL Low Area/CostR-RCR-RC
Differential T-linesDifferential T-lines
For each architecture, the more area the pentagon covers, the better overall performance is achieved.
8
On-Chip Global Interconnect Schemes (1)
Repeated RC wires (R-RC)
Un-TerminatedUn-Terminated andand Terminated T-Line Terminated T-Line
((UT-TLUT-TL andand T-TL T-TL))
R-RC structure Repeater size/Length of segments Adopt previous design methodology
[Zhang07] UT-TL structure
Full swing at wire-end Tapered inverter chain as TX
T-TL structure Optimize eye-height at wire-end Non-Tapered inverter chain as TX
9
On-Chip Global Interconnect Schemes (2)
Un-Equalized Un-Equalized andand Passive-Equalized T-LinePassive-Equalized T-Line
((UE-TLUE-TL andand PE-TLPE-TL))
Driver side: Tapered differential driver Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain Passive equalizer: parallel RC network Design Constraint: enough eye-opening (50mV) needed at the wire-end
10
Global Wire Modeling – Single-Ended & Differential On-Chip T-lines
Determine the bit rate Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density.
Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high.
Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant
11
Global Wire Modeling – RC wires and T-lines RC wire modeling
T-line 2D-R(f)L(f)C parameter extraction
T-line Modeling R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height. Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue.
2D-C Extraction Template2D-C Extraction Template 2D-R(f)L(f) Extraction Template2D-R(f)L(f) Extraction Template
Distributed Π model composed of wire resistance and capacitance
Closed-form equations [Sim03] to calculate 2D wire capacitance
12
Performance Analysis – Definitions Normalized delay (unit: ps/mm)
Propagation delay includes wire delay and gate delay.
Normalized energy per bit (unit: pJ/m)
Bit rate is assumed to be the inverse of propagation delay for RC wires
Normalized throughput (unit: Gbps/um)
13
Performance Analysis – Latency
Variables: technology-defined parameters Supply voltage: Vdd (unit: V) Dielectric constant: Min-sized inverter FO4 delay: (unit: ps)
r
R-RC structure (min-d)
is roughly constant
FO4 delay scales w/ scaling factor S
0r
Increasing w/ technology scaling!Increasing w/ technology scaling!
T-line structures Sum of wire delay and TX delay Wire delay TX delay improved w/ FO4 delay
Decreasing w/ technology scaling!Decreasing w/ technology scaling!
21/ , ,nmos w w rc S r S c
r
1/ S
14
Performance Analysis – Energy per Bit
Same variables defined before
R-RC structure (min-d)
Vdd reduces as technology scales reduces as technology scales
Energy decreases w/ technology scaling!Energy decreases w/ technology scaling!
T-line structures
Sum of power consumed on wire and TX. Power of T-line Power of TX circuit
FO4 delay reduces exponentially
Energy decreases w/ larger slope!!Energy decreases w/ larger slope!!
r
2DDV
2DDfCV
Constant !
15
Performance Analysis – Throughput
Same variables defined before
R-RC structure (min-d)
Assuming wire pitch
FO4 delay reduces exponentially
Throughput increases by Throughput increases by
20% per generation!20% per generation!
T-line structures
TX bandwidth Neglect the minor change of wire pitch
K1 = 0, for UT-TL
FO4 delay reduces exponentially
Throughput increases by Throughput increases by
43% per generation !!43% per generation !!
1/1/ S
16
Design Framework for On-Chip T-line Schemes
Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure.
Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation.
17
Experimental Settings
Design objective: min-d Technology nodes: 90nm-22nm Five different global interconnection structures Wire length: 5mm Parameter extraction
2D field solver CZ2D from EIP tool suite of IBM Tabular model or synthesized model
Transistor models Predictive transistor model from [Uemura06] Synopsys level 3 MOSFET model tuned according to ITRS roadmap
Simulation HSPICE 2005
Modeling and Optimization Linear or non-linear regression/SQP routine MATLAB 2007
18
Performance Metric: Normalized Delay – Results and Comparison
Technology trends R-RC ↑ T-line schemes ↓
T-line structures Outperform R-RC beyond 90nm Single-ended: lowest delay
At 22nm node R-RC: 55ps/mm T-lines: 8ps/mm (85%
reduction) Speed of light: 5ps/mm
Linear model < 6% average percent error
19
Performance Metric: Normalized Energy per Bit – Results and Comparison
Technology trends R-RC and T-lines ↓ T-lines reduce more quickly
T-line structures Outperform R-RC beyond 45nm Differential: lowest energy. Single-ended similar to R-RC.
T-TL > UT-TL
At 22nm node R-RC: 100pJ/m Single-ended: 60% reduction Differential: 96% reduction
Linear model < 12% average percent error Error for T-TL and PE-TL
RL and passive equalizers.
20
Performance Metric: Normalized Throughput – Results and Comparison
Technology trends R-RC and T-lines ↑ T-lines increase more quickly
T-line structures Outperform R-RC beyond 32nm Differential better than single-ended
At 22nm node R-RC: 12Gbps/um T-TL: 30% improvement UE-TL: 75% improvement PE-TL: ~ 2X of R-RC
Linear model < 7% average percent error
21
Signal Integrity – single-ended T-lines
Worst-case switching pattern for peak noise simulationWorst-case switching pattern for peak noise simulation
UT-TL structure 380mV peak noise at 1V supply voltage w/ 7ps rise time SI could be a big issue as supply voltage drops
T-TL less sensitive to noise At the same rise time, ~ 50% reduction of peak noise Peak noise ↓ as technology scales
Using w.c. pattern
Using single or multiple PRBS patterns
22
Signal Integrity – differential T-lines
More reliable Termination resistance Common-mode noise reduction
Peak noise Within ~10mV range
Eye-Heights UE-TL
Eye reduces as bit rate ↑ Harder to meet constraint.
PE-TL > 70mV eye even at 22nm node Equalization does help!
Worst-case switching pattern for peak noise simulationWorst-case switching pattern for peak noise simulation
23
Conclusion
Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm.
A simple linear model provided to link Architecture-level performance metrics Technology-defined parameters
Some observations from experimental results T-line structures have potential to replace R-RC at future node Differential T-lines are better than single-ended
Low-power/High-throughput/Low-noise Equalization could be utilized for on-chip global interconnection
Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations)
Thank you!
Q & A
Back Up Slides
26
Introduction – Technology Trend On-Chip Interconnect Scaling
Dimension shrinks Wire resistance increases -> RC delay Increasing capacitive coupling -> delay, power, noise, etc.
Performance of global wires decreases w/ technology scaling.
Wire Category Technology Node
90nm 45nm 22nm
M1 Wire
Rw(kohm/mm) 1.914 8.860 34.827
Cw(pF/mm) 0.183 0.157 0.129
Global Wire
Rw(kohm/mm) 0.532 2.970 11.000
Cw(pF/mm) 0.205 0.179 0.151
Copper resistivity versus wire width Scaling trend of PUL wire resistance and capacitance
Design methodology: single-ended T-lines
Single-ended;Inverter chains
2D frequency-dependent tabular Model
SPICE simulation
Inverter size, number of stages,Rload (if any)
SPICE simulation to evaluate.Optimization Routine:1. Optimal cycle time2. Sweep for optimal inverter chain
SPICE simulation to check in-plane crosstalk, etc
27
Design methodology: differential T-lines
Differential lines;SA-based TX
2D frequency-dependent Tabular Model
Closed-form equation-based model
Wire width;Driver impedance;
RC equalizer (if any); Termination resistance.
Evaluation based on models.Optimization Routine:1. Binary search for wire width2. SQP for other var. optimization
SPICE simulation to check in-plane crosstalk, etc
28
Effects of driver impedance and termination resistance
Lowering driver impedance improves eye Eye reduces as frequency goes up Optimal termination resistance.
29
Effects of driver impedance and termination resistance on step response
Larger driver impedance leads to slower rise edge and lower saturation voltage Larger termination resistance causes sharper rise edge but with larger reflection
Optimal Rload
30
Crosstalk effects
@Wire output @Inverter chain output
750mV 6.9ps
@Wire output @Inverter chain output
820mV 3.6ps
Three different PRBS input patterns, min-ddp solutions T-line Scheme A: Delay increased by 9.6%, Power increased by 37% T-line Scheme B: Delay increased by 2%, Power increased by 25.7%
31
Transceiver Design
Double-tail latch-type voltage sense amp.
Sense amplifier (SA) Double-tail latch-type [Schinkel 07] Optimize sizing to minimize SA delay
Inverter chain Number of stage
Fixed to 6 Sizing of each inverter
RS: output resistance of inverter chain Sweep the 1st inverter size to minimize
the total transceiver delay for given [Veye, RS]
@45nm tech node:M1/M3: 45nm/45nmM2/M4: 250nm/45nmM5/M6: 180nm/45nmM7/M8: 280nm/45nmM9: 495nm/45nmM10/M11: 200nm/45nmM12: 1.58um/45nm
32
Transceiver Modeling Driver side
Voltage source Vs with output resistance Rs
Vs: full-swing pulse signal with rise time Tr=0.1Tc
Rs: output resistance of the last inverter in the chain.
Receiver side Extract look-up table for TX delay and power Fit the table using non-linear closed form formula The relative error is within 2% for fitting models
Transceiver delay map at 45nm node
Histogram of fitting errors at 45nm node
Transceiver power map at 45nm node33
Bit-rate: 50Gbps
Rs=11.06ohm, Rd=350ohm, Cd=0.38pF,
RL=107.69ohm
34
35
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 3/35 1/42 1/46 1/55 1/55
UT-TL 5/15 5/13 5/10 5/9 5/8
T-TL 5/15 5/13 5/10 5/9 5/8
UE-TL 1/37 3/25 3/16 3/12 5/8
PE-TL 1/37 3/25 3/16 3/12 5/8
Tech Tech NodeNode
SchemesSchemes
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 5/5 5/6 3/8 3/10 2/12
UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3
T-TL 1/3 2/3.4 2/6 2/9 3/16
UE-TL 3/3 3/5 4/9 4/13 4/21
PE-TL 4/4 4/5.3 5/9 5/15 5/24
Tech Tech NodeNode
SchemesSchemes
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 2/150 2/140 1/130 1/100 1/100
UT-TL 3/140 3/110 3/70 3/50 2/40
T-TL 1/260 1/200 2/100 2/60 3/40
UE-TL 4/60 4/36 4/20 4/10 5/4
PE-TL 5/26 5/16 5/8 5/5 5/2
Tech Tech NodeNode
SchemesSchemes
90nm90nm 65nm65nm 45nm45nm 32nm32nm 22nm22nm
R-RC 1 1 1 1 1
UT-TL 1 1 1 1 1
T-TL 3 3 3 3 3
UE-TL 5 5 4 4 4
PE-TL 4 4 5 5 5
Tech Tech NodeNode
SchemesSchemes
Low-Latency Application (ps/mm) Low-Energy Application (pJ/m)
High-Throughput Application (Gbps/um) Low-Noise Application
Conclusion (cont’)
Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color.
Future Works Explore novel global signaling schemes for high throughput and low
energy dissipation. Design, optimize > 50Gbps on-chip interconnection schemes Architecture-level study to identify trade-offs
Wire configuration Dimension optimization, ground plane, etc.
Un-interrupted architectures Equalization implementation, TX/RX choice
Distributed architectures Active or Passive compensation (RC equalizers, other networks, etc)
Novel high-speed transceiver circuitry design Develop analysis and optimization capability to aid co-design and co-
optimization of wire and transceiver circuit Fabrication to verify analysis and demonstrate feasibility
36
37
Related Publications
1. L. Zhang, H. Chen, B. Yao, K. Hamilton, and C.K. Cheng, “Repeated on-chip interconnect analysis and evaluation of delay, power and bandwidth metrics under different design goals,” IEEE International Symposium on Quality Electronic Design, 2007, pp.251-256.
2. Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh and C.K. Cheng, “Design Methodology of High Performance On-Chip Global Interconnect Using Terminated Transmission-Line, ” IEEE International Symposium on Quality Electronic Design, 2009, pp.451-458.
3. Y. Zhang, L. Zhang, A. Tsuchiya, M. Hashimoto, and C.K. Cheng, “On-chip high performance signaling using passive compensation, ” IEEE International Conference on Computer Design, 2008, pp. 182-187.
4. Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh, and C. K. Cheng, “On-chip bus signaling using passive compensation,” IEEE Electrical Performance of Electronic Packaging, 2008, pp. 33-36.
5. L. Zhang, Y. Zhang, A. Tsuchiya, M. Hashimoto, E. Kuh, and C.K. Cheng, “High performance on-chip differential signaling using passive compensation for global communication, ” Asia and South Pacific Design Automation Conference, 2009, pp. 385-390.
6. Y. Zhang, X. Hu, A. Deutsch, A. E. Engin, J. F. Buckwalter, and C. K. Cheng, “Prediction of High-Performance On-Chip Global Interconnection, ” ACM workshop on System Level Interconnection Prediction, 2009
[Repeated RC Wire][Repeated RC Wire]
[Passive-Equalized T-Line][Passive-Equalized T-Line]
[[Un-TerminatedUn-Terminated/Terminated T-Line]/Terminated T-Line]
[Overview and Comparison][Overview and Comparison]