Soc 2.1 Chapter 2 Chip Basics: Time, Area, Power, Reliability, Configurability Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011

soc 2.1

Chapter 2Chip Basics: Time, Area, Power,

Reliability, ConfigurabilityComputer System Design

System-on-Chipby M. Flynn & W. Luk

Pub. Wiley 2011 (copyright 2011)

soc 2.2

Basic design issue: Time

• clocking

• pipelining– optimal pipelining– pipeline partitioning– wave pipelining and low overhead clocking

soc 2.3

SIA roadmap

soc 2.4

Tradeoffs in IP selection and design: performance, area, power

soc 2.5

Clock parameters• parameters

– Pmax: maximum delay through logic– Pmin: minimum delay through logic t : cycle time (in seconds per cycle)– tw : clock pulse width– tg : data setup time– td : register output delay– C : total clocking overhead

tg –twPmaxtd

t

t = Pmax + C

soc 2.6

Skew

• skew: uncertainty in the clock arrival time• two types of skew

– depends on t.....skew = k, a fraction of Pmax

where Pmax is the segment delay that determines t • large segments may have longer delay and skew

• part of skew varies with Leff, like segment delay

– independent of t....skew = • can relate to clock routing, jitter from environmental conditions,

other effects unrelated to segment delay

• effect of skew = k(Pmax) + – skew range adds directly to the clock overhead

soc 2.7

Optimal pipelining• let the total instruction execution without pipelining and

associated clock overhead be T• in a pipelined processor, let S be the number of segments S - 1 is number of cycles lost due to a pipeline break• let b = probability of break, C = clock overhead incl. fixed skew

soc 2.8

Optimum pipelining

P1 P2 P3 P4

T

suppose T = i Pmax i without clock overhead

S = number of pipeline segments

C = clock overhead

T/S max (Pmax i ) [quantization]

Pmax i = delay of the i th functional unit

soc 2.9

t = T/S + C

performance = 1/ (1+(S - 1)b) [IPC]

throughput = G = performance / t [IPS]

G =

Find S for optimum performance by solving for S:

we get

Cycletime

Avg. Time / segment

Clockoverhead

soc 2.10

Find Sopt

• estimate b– use instruction traces

• find T and C from design details– feasibility studies

• example:

b k T (ns) C (ns) Sopt G (MIPS) f (MHZ) CPIClock

Overhead %

0.1 0.05 15 0.5 16.8 270 697 2.58 34.8%0.1 0.05 15 1 11.9 206 431 2.09 43.1%0.2 0.05 15 0.5 11.2 173 525 3.04 26.3%0.2 0.05 15 1 7.9 140 335 2.39 33.5%

soc 2.11

Quantization + other considerations • quantization effects

– T cannot be arbitrarily divided into segments– segments defined by functional unit delays– some segments cannot be divided; others can be

divided only at particular boundaries

• some functional operations are atomic– cycle: usually not cross function unit boundary

• Sopt

– ignores cost/area of extra pipeline stages– ignores quantization loss– largest S to be used

soc 2.12

Microprocessor design practice • tradeoff around design target

• optimal in-order integer RISC: 5-10 stages– performance: relatively flat across this range– deeper for out-of-order or complex ISA

(e.g. Intel Architectures)

• use longer pipeline (higher frequency) if– FP/multimedia vector performance important– clock overhead low

• else use shorter pipeline– especially if area/power/effort are critical

soc 2.13

Advanced circuit techniques

• asynchronous or self-timed clocking– avoids clock distribution problems

but has its own overhead

• multi-phase domino clocking– skew tolerant and low clock overhead;

lots of power required and extra area

• wave pipelining– ultimate limit on t

t = Pmax - Pmin + C

soc 2.14

Basic Design Issues: Silicon Area, Power, Reliability, Reconfiguration

• die floorplanning methodology

• area-cost model

• power analysis and model

• reliability

• reconfigurable design

• soft processors

soc 2.15

AMD Barcelona multicore

http://www.techwarelabs.com/reviews/processors/barcelona/

soc 2.16

Die floorplanning methodology

• pick target cost based on market requirements• determine total area available within cost budget

– defect and yield model

• compute net available area for processors, caches and memory– account for I/O, buses, test hooks, I/O pads etc.

• select core processors and assess area and performance

• re-allocate area to optimize performance– cache, signal processors, multimedia processors, etc.

soc 2.17

Wafers and chips

suppose the wafer has diameter d and each die is square with area A

d

soc 2.18

If N is the number of dice on the wafer,

N = d)2/ (4A) [Gross Yield]

Let NG be number of good dice and ND be the number of defects on a wafer.

Given N dice of which NG are good.....suppose we randomly add

1 new defect to the wafer. What’s the probability that it strikes a

good die....and changes NG ?

Wafers and chips: example

soc 2.19

Probability of the defect hitting a good die = NG / N

The change in NG is d NG /d ND = - NG / N

Rewriting this we get d NG / NG = - ( 1/N) d ND

Integrating and solving: ln(NG) = -ND/N + C

Since NG = N => ND = 0, C must be ln(N)

NG / N = Yield = e - ND/N

let defect density ( defects / cm2 ) = D

Nd = D x wafer area = D x A x N

Yield = Ng / N = e - DA

typically D = 0.3 – 1.0 defect / cm2

soc 2.20

Using yield to size a dieto find the cost per die:

1. find N , the number of die on a wafer

2. find Yield

3. find Ng = Yield x N

4. cost/die = wafer cost/ Ng

Wafer Diameter

(cm)

Defect Density

(per cm2)

Wafer Cost ($)

Die Size (cm)

Gross Yield Yield

Good dice

Cost per good die

($)21 1 5000 1 314 0.37 116 43$ 21 1 5000 1.5 133 0.11 14 357$

soc 2.21

Effect of defect density

soc 2.22

What can be put on the die?

• depends on the lithography and die area

• lithography determined by f, minimum feature size

• feature size is related to the mask registration variation– f = 2

soc 2.23

Smallest device: 5 x 5

2

4

4

5

5

soc 2.24

Area Units: rbe and A• rbe: small area unit for sizing functional units

of the processor• suppose we define another larger unit, A, as

1A =f2 x 106,then 1A = 106 / 675 = 1481 rbe• since 1481 is close to 1444 we can also refer

to the simple register file as occupying 1 A

Unit Relative Size mask registration

f minimum feature size

f = 2

rbe register bit equivalent

rbe = 2700 2 = 675 f2

A functional unit area

A = 106 f2 = 1481 rbe

soc 2.25

Area of other cells

• 1 register bit = 1 rbe

• 1 CAM bit = 2 rbe

• 1 cache bit (6 tx cell) = 0.6 rbe

• 1 SRAM bit = 0.6 rbe

• 1 DRAM bit = 0.1 rbe = 67.5 f2

These are the parameters for basic cells in most design tradeoffs

soc 2.26

Floorplan and area allocation

Core processorsSignal processorCacheBusMemoryClockTest

soc 2.27

The baseline: I

• suppose d is 0.2 defects /cm2 and we target 80% yield

• then A = 110 mm2 gross or (allowing 20%) guard 88 mm2 net

• if f = 0.13 we have 5200 A area units for our design

• we want to realize– a 32b core processor (w 8kB I & 16kB D cache)– 2 32b Vector proc. W 16 x 1k x 32 vector memory

+ I and D cache– 128kB ROM– anything else is SRAM

soc 2.28

The baseline: II

This leaves 5200 - 2462 = 2538A available for data SRAMThis implies about 512kB of SRAM

soc 2.29

Example SOC floorplan

soc 2.30

Die area summary• cost: an exponential function of area• successful business model

– targets initial production at relatively low yield (~0.3)– ride learning curve and leverage technology to

reduce cost and improve performance

• technical innovation and analysis– intersect with business decisions to make a product– use design feasibility studies and empirical targets– methodology for cost and performance evaluation– marketing targets: determine weighting of

performance metrics

soc 2.31

Power consumption

• power consumption: becoming key design issue

• increased power: largely due to higher frequency operation

soc 2.32

Bipolar and CMOS clock frequency

Bipolar power limit

soc 2.33

Bipolar cooling technology (ca ’91)

Hitachi M880: 500 MHz; one processor/module, 40 die sealed in helium then cooled by a water jacket. Power consumed: about 800 watts per module.

F. Kobayashi, et al . “Hardware technology for Hitachi M-880.” Proceedings Electronic Components and Tech Conf., 1991.

soc 2.34

Power: real price of performance

As feature size & C (capacitance) decrease, the electric fields force a reduction in V. To maintain performance we also reduce Vth

So as Vth decreases this increases Ileakage and static power.Static power is now a big problem in high performance designs.

Static power can be controlled by maintaining Vth and usinglower frequencies; also lowering V reduces dynamic power.

Dynamicpower Static

power

soc 2.35

Power and frequency

• I = C dV/dt ….smaller C enables higher dV/dt (frequency)

• but I = (V - Vth)1.25/V and I also directly determines max. frequency.

• for Vth = 0.6v , halving V also halves the frequency. (E.g. if V goes from 3 to 1.5v then freq is ½)

• so halving the voltage (VDD or the signal V) halves the frequency BUT reduces the power by 1/8 … (CV2f/2)

• so

soc 2.36

Power: a new frontier

• cooled high power: >70w/ die• high power: 10- 50w/ die … plug in supply• low power: 0.1- 2w / die.. rechargeable battery• very low power: 1- 100mw /die .. AA size

batteries• extremely low power: 1- 100 microwatt/die and

below (nano watts) .. button batteries• no power: extract from local EM field,

….O (1uw/die)

soc 2.37

Battery energy and usage

type energy capacity

time power

recharage able

10,000 mAh

50 hours (10-20% duty)

400mw- 4w

2xAA 4000 mAh

½ year (10-20% duty)

1-10 mw

button 40mAh 5 years (always on)

1uw

soc 2.38

Power is important!

• by scaling alone a 1000x slower implementation may need only 10-9 as much power

• gating power to functional units and other

techniques should enable 100MHz processors

to operate at O(10-3) watts

• goal: O(10-6) watts…. implies about 10 MHz

soc 2.39

• design for reliability using– redundancy – error detect and correct– process recoverability– fail-safe computation

• failure: a deviation from a design specification• error: a failure that results in an incorrect signal

value• fault: an error manifests as an incorrect logical

result• faults

– do not necessarily produce incorrect program execution– can be masked by detection/correction logic, e.g. ecc

codes

• types of faults:– physical fault– design fault

Reliability + computational integrity

soc 2.40

Redundancy: carefully applied

• P(t) = e-t/ – derived in the same way as the yield equation

• TMR (triple modular redundancy) system– additional reliability over a time much less than

the expected failure time for a single module

• additional hardware– makes the occurrence of multiple module failures

more probable

soc 2.41

Highly reliable designs

• typical usage– error detection: parity, residue, block codes;

sanity & bounds checks– action (instruction) retry– error correction: code or alternate path compute– reconfiguration

soc 2.42

Why reconfigurable design?

• manage design complexity based on high-performance IP-blocks– avoid the risk and delay of fabrication

• time – support highly-pipelined designs

• area – regularity of FPGA, readily to advance to better process technology

• reliability – FPGA enables redundant cells and interconnections, avoid run-time faults

soc 2.43

Area estimate of FPGAs• use rbe model as the basic measure

– one slice 7000 transistors = 700 rbe– one logic element (LE) 12000 = 1200 rbe– Xilinx Virtex XC2V6000 = 33,792 slices

• 23.65 million rbe or 16400A

• 8 x 8 multiplier: around 35 slices– equivalent to 24500 rbe or 17A– 1-bit multiplier in VLSI contains a full-adder and an AND

gate 3840 transistors = 384 rbe around 60 times smaller than reconfigurable version

• block multipliers in FPGAs: more efficient

soc 2.44

Soft processors: using FPGAs

• soft processors how soft they are?– an instruction processor design in bit-stream

format, used to program an FPGA device– cost reduction, design reuse, …

• major soft processors include:– Altera: Nios– Xilinx: MicroBlaze – open-source: OpenRISC, Leon– all 32-bit RISC architecture with 5-stage

pipelines, connect to different bus standards

soc 2.45

Features: soft processors

soc 2.46

Summary

• best optimise: time, area, power

• cycle time: optimized pipelining

• area: die floorplanning, rbe model

• power: cooling + battery implications

• reliability: computational integrity, redundancy

• reconfiguration: reduce risks and delays– area overhead alleviated by coarse-grained blocks– soft processors: instruction processors in FPGA

Documents

Soc 2.1 Chapter 2 Chip Basics: Time, Area, Power, Reliability, Configurability Computer System Design System-on-Chip by M. Flynn & W. Luk Pub. Wiley 2011