Upload
elmer-curtis
View
214
Download
1
Embed Size (px)
Citation preview
soc 2.1
Chapter 2Chip Basics: Time, Area, Power,
Reliability, ConfigurabilityComputer System Design
System-on-Chipby M. Flynn & W. Luk
Pub. Wiley 2011 (copyright 2011)
soc 2.2
Basic design issue: Time
• clocking
• pipelining– optimal pipelining– pipeline partitioning– wave pipelining and low overhead clocking
soc 2.3
SIA roadmap
soc 2.4
Tradeoffs in IP selection and design: performance, area, power
soc 2.5
Clock parameters• parameters
– Pmax: maximum delay through logic– Pmin: minimum delay through logic t : cycle time (in seconds per cycle)– tw : clock pulse width– tg : data setup time– td : register output delay– C : total clocking overhead
tg –twPmaxtd
t
t = Pmax + C
soc 2.6
Skew
• skew: uncertainty in the clock arrival time• two types of skew
– depends on t.....skew = k, a fraction of Pmax
where Pmax is the segment delay that determines t • large segments may have longer delay and skew
• part of skew varies with Leff, like segment delay
– independent of t....skew = • can relate to clock routing, jitter from environmental conditions,
other effects unrelated to segment delay
• effect of skew = k(Pmax) + – skew range adds directly to the clock overhead
soc 2.7
Optimal pipelining• let the total instruction execution without pipelining and
associated clock overhead be T• in a pipelined processor, let S be the number of segments S - 1 is number of cycles lost due to a pipeline break• let b = probability of break, C = clock overhead incl. fixed skew
soc 2.8
Optimum pipelining
P1 P2 P3 P4
T
suppose T = i Pmax i without clock overhead
S = number of pipeline segments
C = clock overhead
T/S max (Pmax i ) [quantization]
Pmax i = delay of the i th functional unit
soc 2.9
t = T/S + C
performance = 1/ (1+(S - 1)b) [IPC]
throughput = G = performance / t [IPS]
G =
Find S for optimum performance by solving for S:
we get
Cycletime
Avg. Time / segment
Clockoverhead
soc 2.10
Find Sopt
• estimate b– use instruction traces
• find T and C from design details– feasibility studies
• example:
b k T (ns) C (ns) Sopt G (MIPS) f (MHZ) CPIClock
Overhead %
0.1 0.05 15 0.5 16.8 270 697 2.58 34.8%0.1 0.05 15 1 11.9 206 431 2.09 43.1%0.2 0.05 15 0.5 11.2 173 525 3.04 26.3%0.2 0.05 15 1 7.9 140 335 2.39 33.5%
soc 2.11
Quantization + other considerations • quantization effects
– T cannot be arbitrarily divided into segments– segments defined by functional unit delays– some segments cannot be divided; others can be
divided only at particular boundaries
• some functional operations are atomic– cycle: usually not cross function unit boundary
• Sopt
– ignores cost/area of extra pipeline stages– ignores quantization loss– largest S to be used
soc 2.12
Microprocessor design practice • tradeoff around design target
• optimal in-order integer RISC: 5-10 stages– performance: relatively flat across this range– deeper for out-of-order or complex ISA
(e.g. Intel Architectures)
• use longer pipeline (higher frequency) if– FP/multimedia vector performance important– clock overhead low
• else use shorter pipeline– especially if area/power/effort are critical
soc 2.13
Advanced circuit techniques
• asynchronous or self-timed clocking– avoids clock distribution problems
but has its own overhead
• multi-phase domino clocking– skew tolerant and low clock overhead;
lots of power required and extra area
• wave pipelining– ultimate limit on t
t = Pmax - Pmin + C
soc 2.14
Basic Design Issues: Silicon Area, Power, Reliability, Reconfiguration
• die floorplanning methodology
• area-cost model
• power analysis and model
• reliability
• reconfigurable design
• soft processors
soc 2.15
AMD Barcelona multicore
http://www.techwarelabs.com/reviews/processors/barcelona/
soc 2.16
Die floorplanning methodology
• pick target cost based on market requirements• determine total area available within cost budget
– defect and yield model
• compute net available area for processors, caches and memory– account for I/O, buses, test hooks, I/O pads etc.
• select core processors and assess area and performance
• re-allocate area to optimize performance– cache, signal processors, multimedia processors, etc.
soc 2.17
Wafers and chips
suppose the wafer has diameter d and each die is square with area A
d
soc 2.18
If N is the number of dice on the wafer,
N = d)2/ (4A) [Gross Yield]
Let NG be number of good dice and ND be the number of defects on a wafer.
Given N dice of which NG are good.....suppose we randomly add
1 new defect to the wafer. What’s the probability that it strikes a
good die....and changes NG ?
Wafers and chips: example
soc 2.19
Probability of the defect hitting a good die = NG / N
The change in NG is d NG /d ND = - NG / N
Rewriting this we get d NG / NG = - ( 1/N) d ND
Integrating and solving: ln(NG) = -ND/N + C
Since NG = N => ND = 0, C must be ln(N)
NG / N = Yield = e - ND/N
let defect density ( defects / cm2 ) = D
Nd = D x wafer area = D x A x N
Yield = Ng / N = e - DA
typically D = 0.3 – 1.0 defect / cm2
soc 2.20
Using yield to size a dieto find the cost per die:
1. find N , the number of die on a wafer
2. find Yield
3. find Ng = Yield x N
4. cost/die = wafer cost/ Ng
Wafer Diameter
(cm)
Defect Density
(per cm2)
Wafer Cost ($)
Die Size (cm)
Gross Yield Yield
Good dice
Cost per good die
($)21 1 5000 1 314 0.37 116 43$ 21 1 5000 1.5 133 0.11 14 357$
soc 2.21
Effect of defect density
soc 2.22
What can be put on the die?
• depends on the lithography and die area
• lithography determined by f, minimum feature size
• feature size is related to the mask registration variation– f = 2
soc 2.23
Smallest device: 5 x 5
2
4
4
5
5
soc 2.24
Area Units: rbe and A• rbe: small area unit for sizing functional units
of the processor• suppose we define another larger unit, A, as
1A =f2 x 106,then 1A = 106 / 675 = 1481 rbe• since 1481 is close to 1444 we can also refer
to the simple register file as occupying 1 A
Unit Relative Size mask registration
f minimum feature size
f = 2
rbe register bit equivalent
rbe = 2700 2 = 675 f2
A functional unit area
A = 106 f2 = 1481 rbe
soc 2.25
Area of other cells
• 1 register bit = 1 rbe
• 1 CAM bit = 2 rbe
• 1 cache bit (6 tx cell) = 0.6 rbe
• 1 SRAM bit = 0.6 rbe
• 1 DRAM bit = 0.1 rbe = 67.5 f2
These are the parameters for basic cells in most design tradeoffs
soc 2.26
Floorplan and area allocation
Core processorsSignal processorCacheBusMemoryClockTest
soc 2.27
The baseline: I
• suppose d is 0.2 defects /cm2 and we target 80% yield
• then A = 110 mm2 gross or (allowing 20%) guard 88 mm2 net
• if f = 0.13 we have 5200 A area units for our design
• we want to realize– a 32b core processor (w 8kB I & 16kB D cache)– 2 32b Vector proc. W 16 x 1k x 32 vector memory
+ I and D cache– 128kB ROM– anything else is SRAM
soc 2.28
The baseline: II
This leaves 5200 - 2462 = 2538A available for data SRAMThis implies about 512kB of SRAM
soc 2.29
Example SOC floorplan
soc 2.30
Die area summary• cost: an exponential function of area• successful business model
– targets initial production at relatively low yield (~0.3)– ride learning curve and leverage technology to
reduce cost and improve performance
• technical innovation and analysis– intersect with business decisions to make a product– use design feasibility studies and empirical targets– methodology for cost and performance evaluation– marketing targets: determine weighting of
performance metrics
soc 2.31
Power consumption
• power consumption: becoming key design issue
• increased power: largely due to higher frequency operation
soc 2.32
Bipolar and CMOS clock frequency
Bipolar power limit
soc 2.33
Bipolar cooling technology (ca ’91)
Hitachi M880: 500 MHz; one processor/module, 40 die sealed in helium then cooled by a water jacket. Power consumed: about 800 watts per module.
F. Kobayashi, et al . “Hardware technology for Hitachi M-880.” Proceedings Electronic Components and Tech Conf., 1991.
soc 2.34
Power: real price of performance
As feature size & C (capacitance) decrease, the electric fields force a reduction in V. To maintain performance we also reduce Vth
So as Vth decreases this increases Ileakage and static power.Static power is now a big problem in high performance designs.
Static power can be controlled by maintaining Vth and usinglower frequencies; also lowering V reduces dynamic power.
Dynamicpower Static
power
soc 2.35
Power and frequency
• I = C dV/dt ….smaller C enables higher dV/dt (frequency)
• but I = (V - Vth)1.25/V and I also directly determines max. frequency.
• for Vth = 0.6v , halving V also halves the frequency. (E.g. if V goes from 3 to 1.5v then freq is ½)
• so halving the voltage (VDD or the signal V) halves the frequency BUT reduces the power by 1/8 … (CV2f/2)
• so
soc 2.36
Power: a new frontier
• cooled high power: >70w/ die• high power: 10- 50w/ die … plug in supply• low power: 0.1- 2w / die.. rechargeable battery• very low power: 1- 100mw /die .. AA size
batteries• extremely low power: 1- 100 microwatt/die and
below (nano watts) .. button batteries• no power: extract from local EM field,
….O (1uw/die)
soc 2.37
Battery energy and usage
type energy capacity
time power
recharage able
10,000 mAh
50 hours (10-20% duty)
400mw- 4w
2xAA 4000 mAh
½ year (10-20% duty)
1-10 mw
button 40mAh 5 years (always on)
1uw
soc 2.38
Power is important!
• by scaling alone a 1000x slower implementation may need only 10-9 as much power
• gating power to functional units and other
techniques should enable 100MHz processors
to operate at O(10-3) watts
• goal: O(10-6) watts…. implies about 10 MHz
soc 2.39
• design for reliability using– redundancy – error detect and correct– process recoverability– fail-safe computation
• failure: a deviation from a design specification• error: a failure that results in an incorrect signal
value• fault: an error manifests as an incorrect logical
result• faults
– do not necessarily produce incorrect program execution– can be masked by detection/correction logic, e.g. ecc
codes
• types of faults:– physical fault– design fault
Reliability + computational integrity
soc 2.40
Redundancy: carefully applied
• P(t) = e-t/ – derived in the same way as the yield equation
• TMR (triple modular redundancy) system– additional reliability over a time much less than
the expected failure time for a single module
• additional hardware– makes the occurrence of multiple module failures
more probable
soc 2.41
Highly reliable designs
• typical usage– error detection: parity, residue, block codes;
sanity & bounds checks– action (instruction) retry– error correction: code or alternate path compute– reconfiguration
soc 2.42
Why reconfigurable design?
• manage design complexity based on high-performance IP-blocks– avoid the risk and delay of fabrication
• time – support highly-pipelined designs
• area – regularity of FPGA, readily to advance to better process technology
• reliability – FPGA enables redundant cells and interconnections, avoid run-time faults
soc 2.43
Area estimate of FPGAs• use rbe model as the basic measure
– one slice 7000 transistors = 700 rbe– one logic element (LE) 12000 = 1200 rbe– Xilinx Virtex XC2V6000 = 33,792 slices
• 23.65 million rbe or 16400A
• 8 x 8 multiplier: around 35 slices– equivalent to 24500 rbe or 17A– 1-bit multiplier in VLSI contains a full-adder and an AND
gate 3840 transistors = 384 rbe around 60 times smaller than reconfigurable version
• block multipliers in FPGAs: more efficient
soc 2.44
Soft processors: using FPGAs
• soft processors how soft they are?– an instruction processor design in bit-stream
format, used to program an FPGA device– cost reduction, design reuse, …
• major soft processors include:– Altera: Nios– Xilinx: MicroBlaze – open-source: OpenRISC, Leon– all 32-bit RISC architecture with 5-stage
pipelines, connect to different bus standards
soc 2.45
Features: soft processors
soc 2.46
Summary
• best optimise: time, area, power
• cycle time: optimized pipelining
• area: die floorplanning, rbe model
• power: cooling + battery implications
• reliability: computational integrity, redundancy
• reconfiguration: reduce risks and delays– area overhead alleviated by coarse-grained blocks– soft processors: instruction processors in FPGA