EECS 151/251A Spring 2020 Digital Design and Integrated ...eecs151/sp20/files/lec10-timing.pdf · In November 1971, Intel introduced the world’s ﬁrst single-chip microprocessor,

EE141

EECS 151/251ASpring2020DigitalDesignandIntegratedCircuitsInstructor:JohnWawrzynek

Lecture 10: Timing

1

What do ASIC/FPGA Designers need to know about physics?

‣ Physics effect: Area ⇒ cost Delay ⇒ performance Energy ⇒ performance & cost

• Ideally, zero delay, area, and energy. However, the physical devices occupy area, take time, and consume energy.

• CMOS process lets us build transistors, wires, connections, and we get capacitors, inductors, and resistors whether or not we want them.

2

Spring 2020 EECS151 Page

Performance, Cost, Power

• How do we measure performance?operations/sec? cycles/sec?

• Performance is directly proportional to clock frequency. Although it may not be the entire story:

Ex: CPU performance = # instructions X CPI X clock period

!3


Limitations on Clock Rate

1 Logic Gate Delay 2 Delays in flip-flops

• What must happen in one clock cycle for correct operation?– All signals connected to FF (or memory) inputs must be ready

and “setup” before rising edge of clock. – For now we assume perfect clock distribution (all flip-flops see

the clock at the same time).

What are typical delay values?Both times contribute to limiting the clock period.

!4


Example

Parallel to serial converter circuit

T ≥ time(clk→Q) + time(mux) + time(setup) T ≥ τclk→Q + τmux + τsetupa

b

clk

!5


In General ...

T ≥ τclk→Q + τCL + τsetup

For correct operation:

for all paths.

• How do we enumerate all paths?– Any circuit input or register output to any register input or circuit

output?• Note:

– “setup time” for outputs is a function of what it connects to.– “clk-to-q” for circuit inputs depends on from where it comes.

!6

“Gate Delay”

‣ Modern CMOS gate delays on the order of a few picoseconds. (However, highly dependent on gate design and context.)

‣ Often expressed as FO4 delays (fan-out of 4) - as a process dependent delay metric: ‣ the delay of an inverter, driven by an

inverter 4x smaller than itself, and driving an inverter 4x larger than itself.

‣ Less than 10ps for a 32nm process. For a 7nm process FO4 is around 2.5ps.

7

http://en.wikipedia.org/wiki/Inverter_(logic_gate)

Process Dependent FO4 Delay

8

“Path Delay”

‣ For correct operation: Total Delay ≤ clock_period - FFsetup_time - FFclk_to_q on all paths.

‣ High-speed processors critical paths (worst case paths) have around 20 FO4 delays.

9

FO4 Delays per clock period

Francois Labonte

[email protected] 4/23/2003 Stanford University

Cycle in FO4

0

10

20

30

40

50

60

70

80

90

100

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05

intel 386

intel 486

intel pentium

intel pentium 2

intel pentium 3

intel pentium 4

intel itanium

Alpha 21064

Alpha 21164

Alpha 21264

Sparc

SuperSparc

Sparc64

Mips

HP PA

Power PC

AMD K6

AMD K7

AMD x86-64

Thanks to Francois Labonte, Stanford

FO4Delays

Historical limit: about

12

CPU Clock Periods1985-2005

MIPS 2000 5 pipeline stages

Pentium 4 20 pipeline stages

Pentium Pro 10 pipeline stages

10

PROCESSORS

1

CPU DB: Recording Microprocessor History

With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.

Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.

In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.

AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.

Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.

PROCESSORS

1

CPU DB: Recording Microprocessor History

With this open database, you can mine microprocessor trends over the past 40 years.

Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, Mark Horowitz, Stanford University

In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004. It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions per second while dissipating 0.5 watts. The following four decades witnessed exponential growth in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.

Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.

In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.

AN OPEN REPOSITORY OF PROCESSOR SPECSWhile information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.

Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.

PROCESSORS

15

1985 1990 1995 201020052000 2015

140

120

100

80

60

40

20

0

F04

/ cyc

le

F04 Delays Per Cycle for Processor Designs

FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle.

2008 2009 2010 2011 2012

1024libquantumoverall SPEC

512

256

32

64

128

16

Libquantum Score Versus SPEC Score

This figure shows how compiler optimizations have led to performance boosts in Libquantum.

SPEC

200

6

11

“Gate Delay”‣ What determines the actual delay of a logic gate? ‣ Transistors are not perfect switches - cannot change terminal

voltages instantaneously. ‣ Consider the NAND gate:

‣ Current (I) value depends on: process parameters, transistor size

‣ CL models gate output, wire, inputs to next stage (Cap. of Load) ‣ C “integrates” I creating a voltage change at output

∆t ∝ CL / I

12

More on transistor Current‣ Transistors actually act like a cross between a resistor and

“current source”

‣ ISAT depends on process parameters (higher for nFETs than for pFETs) and transistor size (layout):ISAT ∝ W/L

13

FinFets use multiple “fins” to get wider

Physical Layout determines FET strength

‣ “Switch-level” abstraction gives a good way to understand the function of a circuit. ‣ nFET (g=1 ? short circuit : open) ‣ pFET (g=0 ? short circuit : open)

‣ Understanding delay means going below the switch-level abstraction to transistor physics and layout details.

14

Transistors as water valves.If electrons are water molecules,

transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...

A “on” p-FET fillsup the capacitor

with charge.

1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec3.29

Delay Model:

CMOS


Lec3.30

Review: General C/L Cell Delay Model

° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior

- truth-table, logic equation, VHDL

• load factor of each input

• critical propagation delay from each input to each output for each transition

- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load

° Linear model composes

Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

A “on” n-FET empties the bucket.


Lec3.29

Delay Model:

CMOS


Lec3.30

Review: General C/L Cell Delay Model



• load factor of each input

• critical propagation delay from each input to each output for each transition



Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.31

Basic Technology: CMOS

° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors

• PMOS (P-Type Metal Oxide Semiconductor) transistors

° NMOS Transistor• Apply a HIGH (Vdd) to its gate

turns the transistor into a “conductor”

• Apply a LOW (GND) to its gateshuts off the conduction path

° PMOS Transistor• Apply a HIGH (Vdd) to its gate

shuts off the conduction path

• Apply a LOW (GND) to its gateturns the transistor into a “conductor”

Vdd = 5V

GND = 0v

Vdd = 5V

GND = 0v


Lec3.32

Basic Components: CMOS Inverter

Vdd

Circuit

° Inverter Operation

OutIn

SymbolPMOS

NMOS

In Out

Vdd

Open

Charge

VoutVdd

Vdd

Out

Open

Discharge

Vin

Vdd

Vdd

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“1”

“0”Time

Water level

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)

!"#$%&'(#)*(+,%-$*".(/0

1 2+.$0#$03

1 4546%,"#$3

“0”

“1”

TimeWater level

(Cartoon physics)

15

EE141

The Switch Inverter: Transient Response

tpHL = f(RonCL)= 0.69 Rn CL

(a) Low-to-high (b) High-to-low

Vin

Cin

Vin

Cin

V(t) = V0 e –t/RC t1/2 = ln(2) × RC

16

Turning Rise/Fall Delay into Gate Delay• Cascaded gates:

“transfer curve” for inverter.

1 11 10 0 0 0

17

More on gate delay‣ Everything that connects to the output of a logic gate (or

transistor) contributes capacitance:

‣ Transistor drains ‣ Interconnection

(wires/contacts/vias)

‣ Transistor Gates

I

18

‣ Additionally, wires can contribute additional intrinsic delay


Wire Delay• Ideally, wires behave as

“transmission lines”:– signal wave-front moves close to

the speed of light• ~1ft/ns

– Time from source to destination is called the “transit time”.

– In ICs most wires are short, and the transit times are relatively short compared to the rise/fall times and can be ignored.

– Not so on PC boards.

!19

Wires‣ As parallel plate capacitors:

C ∝ Area = width ∗ length

‣ Wires have some finite resistance, so have distributed R and C:

with r = res/length, c = cap/length, ∆t ∝ rcL2 ≅ rc + 2rc +3rc + ...

Spring 2003 EECS150 – Lec10-Timing Page 16

Wire Delay

• Even in those cases where the

transmission line effect is

negligible:

– Wires posses distributed

resistance and capacitance

– Time constant associated with

distributed RC is proportional to

the square of the length

• For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important.

– Typically around half of C of gate load is in the wires.

• For long wires on ICs:

– busses, clock lines, global control signal, etc.

– Resistance is significant, therefore distributed RC effect dominates.

– signals are typically “rebuffered” to reduce delay:

v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:












v1

v4v3

v2

time

v1 v2 v3 v4

20


Wire Delay• Even in those cases where the

transmission line effect is negligible:

– Wires posses distributed resistance and capacitance

– Time constant associated with distributed RC is proportional to the square of the length

• For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important. – Typically around half of C of

gate load is in the wires. • For long wires on ICs:



– signals are typically “rebuffered” to reduce delay

• For long wires on ICs with high currents: – inductance is also important

v1 v2 v3 v4

v1

v4v3

v2

time


Wire Rebuffering• For long wires on ICs:

– busses, clock lines, global control signal, etc. – Resistance is significant, therefore rcL2 effect dominates. – signals are typically “rebuffered” to reduce delay:

!22

unbuffered wire ∆t ∝ L2

wire buffered into N sections ∆t ∝ N * (L/N)2 + (N-1) * tbuffer Assuming tbuffer is small,

∝ N∆t ∝ L2/N

Speedup:

Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001

Fig. 1. Process SEM cross section.

The process was raised from [1] to limit standby power.

Circuit design and architectural pipelining ensure low voltage

performance and functionality. To further limit standby current

in handheld ASSPs, a longer poly target takes advantage of the

versus dependence and source-to-body bias is used

to electrically limit transistor in standby mode. All core

nMOS and pMOS transistors utilize separate source and bulk

connections to support this. The process includes cobalt disili-

cide gates and diffusions. Low source and drain capacitance, as

well as 3-nm gate-oxide thickness, allow high performance and

low-voltage operation.

III. ARCHITECTURE

The microprocessor contains 32-kB instruction and data

caches as well as an eight-entry coalescing writeback buffer.

The instruction and data cache fill buffers have two and four

entries, respectively. The data cache supports hit-under-miss

operation and lines may be locked to allow SRAM-like oper-

ation. Thirty-two-entry fully associative translation lookaside

buffers (TLBs) that support multiple page sizes are provided

for both caches. TLB entries may also be locked. A 128-entry

branch target buffer improves branch performance a pipeline

deeper than earlier high-performance ARM designs [2], [3].

A. Pipeline Organization

To obtain high performance, the microprocessor core utilizes

a simple scalar pipeline and a high-frequency clock. In addition

to avoiding the potential power waste of a superscalar approach,

functional design and validation complexity is decreased at the

expense of circuit design effort. To avoid circuit design issues,

the pipeline partitioning balances the workload and ensures that

no one pipeline stage is tight. The main integer pipeline is seven

stages, memory operations follow an eight-stage pipeline, and

when operating in thumb mode an extra pipe stage is inserted

after the last fetch stage to convert thumb instructions into ARM

instructions. Since thumb mode instructions [11] are 16 b, two

instructions are fetched in parallel while executing thumb in-

structions. A simplified diagram of the processor pipeline is

Fig. 2. Microprocessor pipeline organization.

shown in Fig. 2, where the state boundaries are indicated by

gray. Features that allow the microarchitecture to achieve high

speed are as follows.

The shifter and ALU reside in separate stages. The ARM in-

struction set allows a shift followed by an ALU operation in a

single instruction. Previous implementations limited frequency

by having the shift and ALU in a single stage. Splitting this op-

eration reduces the critical ALU bypass path by approximately

1/3. The extra pipeline hazard introduced when an instruction is

immediately followed by one requiring that the result be shifted

is infrequent.

Decoupled Instruction Fetch.A two-instruction deep queue is

implemented between the second fetch and instruction decode

pipe stages. This allows stalls generated later in the pipe to be

deferred by one or more cycles in the earlier pipe stages, thereby

allowing instruction fetches to proceed when the pipe is stalled,

and also relieves stall speed paths in the instruction fetch and

branch prediction units.

Deferred register dependency stalls. While register depen-

dencies are checked in the RF stage, stalls due to these hazards

are deferred until the X1 stage. All the necessary operands are

then captured from result-forwarding busses as the results are

returned to the register file.

One of the major goals of the design was to minimize the en-

ergy consumed to complete a given task. Conventional wisdom

has been that shorter pipelines are more efficient due to re-


Example

• Parallel to serial converter:

a

b T ! time(clk"Q) + time(mux) + time(setup)

T ! #clk"Q + #mux + #setup

clk

ALU “time budget”


General Model of Synchronous Circuit

• In general, for correct operation:

for all paths.

• How do we enumerate all paths?

– Any circuit input or register output to any register input or circuit

output.

– “setup time” for circuit outputs depends on what it connects to

– “clk-Q time” for circuit inputs depends on from where it comes.

reg regCL CL

clock input

output

option feedback

input output

T ! time(clk"Q) + time(CL) + time(setup)

T ! #clk"Q + #CL + #setup


Lec3.9

General C/L Cell Delay Model



• Input load factor of each input

• Propagation delay from each input to each output for each transition



Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10

Storage Element’s Timing Model

Clk

D Q

° Setup Time: Input must be stable BEFORE trigger clock edge

° Hold Time: Input must REMAIN stable after trigger clock edge

° Clock-to-Q time:

• Output cannot change instantaneously at the trigger clock edge

• Similar to delay in logic gates, two components:

- Internal Clock-to-Q

- Load dependent Clock-to-Q

Don’t Care Don’t Care

HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11

Clocking Methodology

Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.

° All storage elements are clocked by the same clock edge

° The combination logic blocks:• Inputs are updated at each clock tick

• All outputs MUST be stable before the next clock tick


Lec3.12

Critical Path & Cycle Time

Clk

.

.

.

.

.

.

.

.

.

.

.

.

° Critical path: the slowest path between any two storage devices

° Cycle time is a function of the critical path

° must be greater than:

Clock-to-Q + Longest Path through Combination Logic + Setup

Combinational Logic

23

Recall: Positive edge-triggered flip-flop

D Q


Delay in Flip-flops

• Setup time results from delay

through first latch.

• Clock to Q delay results from

delay through second latch.

D

clk

Q

setup time clock to Q delay

clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value

24

Sensing: When clock is low

D QA flip-flop “samples” right before the

edge, and then “holds” value.


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk = 0 clk’ = 1

Will capture new value on posedge.

Outputs last value captured.

25

Capture: When clock goes high

D QA flip-flop “samples” right before the

edge, and then “holds” value.


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Sampling circuit


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

Holds value


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk = 1 clk’ = 0

Remembers value just captured.

Outputs value just captured.

26

Flip Flop delays:

D Q


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’


Delay in Flip-flops





D

clk

Q


clk

clk’

clk

clk

clk’

clk’

clk

clk’

clk-to-Q ?

CLK == 0Sense D, but Q

outputs old value.

CLK 0->1Capture D, pass

value to Q

CLK

setup ?

clk-to-Q

setup

27

Note: with too much fanout, second stage could fail to capture data properly. Often output is rebuffered.

EECS151, UC Berkeley Sp18

Hold-time Violations

‣ Some state elements have positive hold time requirements. ‣ How can this be?

‣ Fast paths from one state element to the next can create a violation. (Think about shift registers!)

‣ CAD tools do their best to fix violations by inserting delay (buffers). ‣ Of course, if the path is delayed too much, then cycle time suffers. ‣ Difficult because buffer insertion changes layout, which changes path

delay.

FF

clk

d q

28

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Timing Analysis and Logic Delay

Some path somewhere in the design has the longest delay and is therefore the “critical path”.

1600

IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001

Fig.1.ProcessSEMcrosssection.

Theprocess

wasraisedfrom[1]tolimitstandbypower.

Circuitdesignandarchitecturalpipeliningensurelowvoltage

performanceandfunctionality.Tofurtherlimitstandbycurrent

inhandheldASSPs,alongerpolytargettakesadvantageofthe

versus

dependenceandsource-to-bodybiasisused

toelectricallylimittransistor

instandbymode.Allcore

nMOSandpMOStransistorsutilizeseparatesourceandbulk

connectionstosupportthis.Theprocessincludescobaltdisili-

cidegatesanddiffusions.Lowsourceanddraincapacitance,as

wellas3-nmgate-oxidethickness,allowhighperformanceand

low-voltageoperation. III.ARCHITECTURE

Themicroprocessorcontains32-kBinstructionanddata

cachesaswellasaneight-entrycoalescingwritebackbuffer.

Theinstructionanddatacachefillbuffershavetwoandfour

entries,respectively.Thedatacachesupportshit-under-miss

operationandlinesmaybelockedtoallowSRAM-likeoper-

ation.Thirty-two-entryfullyassociativetranslationlookaside

buffers(TLBs)thatsupportmultiplepagesizesareprovided

forbothcaches.TLBentriesmayalsobelocked.A128-entry

branchtargetbufferimprovesbranchperformanceapipeline

deeperthanearlierhigh-performanceARMdesigns[2],[3].

A.PipelineOrganization

Toobtainhighperformance,themicroprocessorcoreutilizes

asimplescalarpipelineandahigh-frequencyclock.Inaddition

toavoidingthepotentialpowerwasteofasuperscalarapproach,

functionaldesignandvalidationcomplexityisdecreasedatthe

expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,

thepipelinepartitioningbalancestheworkloadandensuresthat

noonepipelinestageistight.Themainintegerpipelineisseven

stages,memoryoperationsfollowaneight-stagepipeline,and

whenoperatinginthumbmodeanextrapipestageisinserted

afterthelastfetchstagetoconvertthumbinstructionsintoARM

instructions.Sincethumbmodeinstructions[11]are16b,two

instructionsarefetchedinparallelwhileexecutingthumbin-

structions.Asimplifieddiagramoftheprocessorpipelineis

Fig.2.Microprocessorpipelineorganization.

showninFig.2,wherethestateboundariesareindicatedby

gray.Featuresthatallowthemicroarchitecturetoachievehigh

speedareasfollows.

TheshifterandALUresideinseparatestages.TheARMin-

structionsetallowsashiftfollowedbyanALUoperationina

singleinstruction.Previousimplementationslimitedfrequency

byhavingtheshiftandALUinasinglestage.Splittingthisop-

erationreducesthecriticalALUbypasspathbyapproximately

1/3.Theextrapipelinehazardintroducedwhenaninstructionis

immediatelyfollowedbyonerequiringthattheresultbeshifted

isinfrequent.

DecoupledInstructionFetch.Atwo-instructiondeepqueueis

implementedbetweenthesecondfetchandinstructiondecode

pipestages.Thisallowsstallsgeneratedlaterinthepipetobe

deferredbyoneormorecyclesintheearlierpipestages,thereby

allowinginstructionfetchestoproceedwhenthepipeisstalled,

andalsorelievesstallspeedpathsintheinstructionfetchand

branchpredictionunits.

Deferredregisterdependency

stalls.Whileregisterdepen-

denciesarecheckedintheRFstage,stallsduetothesehazards

aredeferreduntiltheX1stage.Allthenecessaryoperandsare

thencapturedfromresult-forwardingbussesastheresultsare

returnedtotheregisterfile.

Oneofthemajorgoalsofthedesignwastominimizetheen-

ergyconsumedtocompleteagiventask.Conventionalwisdom

hasbeenthatshorterpipelinesaremoreefficientduetore-


Lec3.9








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10


Clk

D Q



° Clock-to-Q time:






HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11


Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.





Lec3.12


Clk

.

.

.

.

.

.

.

.

.

.

.

.





Register


Lec3.9








Cout

Vout

Cout

Delay

Va -> Vout

XX

X

X

X

X

Ccritical

delay per unit load

A

B

X

.

.

.

Combinational

Logic Cell

Internal Delay


Lec3.10


Clk

D Q



° Clock-to-Q time:






HoldSetup

D

Unknown

Clock-to-Q

Q


Lec3.11


Clk

Combination Logic

.

.

.

.

.

.

.

.

.

.

.

.





Lec3.12


Clk

.

.

.

.

.

.

.

.

.

.

.

.





Combinational Logic

29

Components of Combinational Path Delay

1. # of levels of logic 2. Internal cell delay 3. wire delay 4. cell input capacitance 5. cell fanout 6. cell output drive strength

30

Who controls the delay in ASIC?foundary engineer (TSMC)

Library Developer (Aritsan)

CAD Tools (DC, IC Compiler) Designer (you!)

1. # of levels synthesis HDL design

2. Internal cell delay

physical parameters

cell topology, trans sizing cell selection

3. Wire delay physical parameters place & route layout

4. Cell input capacitance

physical parameters

cell topology, trans sizing cell selection

5. Cell fanout synthesis HDL design

6. Cell drive strength

physical parameters

transistor sizing cell selection

31

Timing Closure: Searching for and beating down the critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Must consider all connected register pairs, paths, plus from input to register, plus register to output.

?

• Design tools help in the search. • Synthesis tools work to meet clock constraint,

report delays on paths, – Special static timing analyzers accept a design

netlist and report path delays, – and, of course, simulators can be used to determine

timing performance.Tools that are expected to do something about the timing behavior (such as synthesizers), also include provisions for specifying input arrival times (relative to the clock), and

output requirements (set-up times of next stage).�32

Timing Analysis, real example

From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.

netlist. Of these, 1 2 1 7 1 3 were top-level chip global nets,and 2 1 7 1 1 were processor-core-level global nets. Againstthis model 3 .5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9 .8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 1 2 GB to 1 4 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6 0 0 0 *Model S8 0 configured with 6 4 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as

well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2 .5 –3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1 .9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 2 4 -hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.

The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.

SummaryThe 1 7 4 -million-transistor !1 .3 -GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4 , emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.

Figure 25

POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.

VIM

Timer files ReportsAsserts

Spice

Spice

GL/1

Reports

< 12 hr

< 12 hr

< 12 hr

< 48 hr

< 24 hr

Non-uplift timing

Noiseimpacton timing

Upliftanalysis

Capacitanceadjust

Chipbench /EinsTimer

Chipbench /EinsTimer

Extraction

Core or chipwiring

Analysis/update(wires, buffers)

Notes:• Executed 2– 3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late

Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs

Figure 26

Histogram of the POWER4 processor path delays.

!40 !20 0 20 40 6 0 80 100 120 140 16 0 180 200 220 240 26 0 280Timing slack (ps)

Lat

e-m

ode

timin

g ch

ecks

(th

ousa

nds)

0

50

100

150

200

IBM J. RES. & DEV. VOL. 4 6 NO. 1 JANUARY 2 0 0 2 J. D. WARNOCK ET AL.

47

Most paths have hundreds of picoseconds to spare.

The critical path

�33

Timing OptimizationAs an ASIC/FPGA designer you get to choose: ‣ The algorithm ‣ The Microarchitecture (block diagram) ‣ The HDL description of the CL blocks

(number of levels of logic) ‣ Where to place registers and memory (the

pipelining) ‣ Overall floorplan and relative placement

of blocks�34

Post-Placement C-slow Retiming for the Xilinx VirtexFPGA

Nicholas Weaver⇤

UC BerkeleyBerkeley, CA

Yury MarkovskiyUC BerkeleyBerkeley, CA

Yatish PatelUC BerkeleyBerkeley, CA

John WawrzynekUC BerkeleyBerkeley, CA

ABSTRACT

C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.

Categories and Subject Descriptors

B.6.3 [Logic Design]: Design Aids—Automatic syn-

thesys

General Terms

Performance

Keywords

FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming

⇤Please address any correspondance [email protected]

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.

1. Introduction

Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.

Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.

The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.

This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.

We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.

We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1

1This was originally defined to meet systolic slowdown re-quirements.

Circuit retiming

Post-Placement C-slow Retiming for the Xilinx VirtexFPGA

Nicholas Weaver⇤

UC BerkeleyBerkeley, CA

Yury MarkovskiyUC BerkeleyBerkeley, CA

Yatish PatelUC BerkeleyBerkeley, CA

John WawrzynekUC BerkeleyBerkeley, CA

ABSTRACT

C-slow retiming is a process of automatically increas-ing the throughput of a design by enabling fine grainedpipelining of problems with feedback loops. This transfor-mation is especially appropriate when applied to FPGAdesigns because of the large number of available registers.To demonstrate and evaluate the benefits of C-slow re-timing, we constructed an automatic tool which modifiesdesigns targeting the Xilinx Virtex family of FPGAs. Ap-plying our tool to three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1synthesized microprocessor core, we were able to substan-tially increase the total throughput. For some parameters,throughput is e↵ectively doubled.

Categories and Subject Descriptors

B.6.3 [Logic Design]: Design Aids—Automatic syn-

thesys

General Terms

Performance

Keywords

FPGA CAD, FPGA Optimization, Retiming, C-slowRetiming

⇤Please address any correspondance [email protected]

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.FPGA’03, February 23–25, 2003, Monterey, California, USA.Copyright 2003 ACM 1-58113-651-X/03/0002 ...$5.00.

1. Introduction

Leiserson’s retiming algorithm[7] o↵ers a polynomialtime algorithm to optimize the clock period on arbitrarysynchronous circuits without changing circuit semantics.Although a powerful and e�cient transformation that hasbeen employed in experimental tools[10][2] and commercialsynthesis tools[13][14], it o↵ers only a minor clock periodimprovement for a well constructed design, as many de-signs have their critical path on a single cycle feedbackloop and can’t benefit from retiming.

Also proposed by Leiserson et al to meet the constraintsof systolic computation, is C-slow retiming.1 In C-slow re-timing, each design register is first replaced with C regis-ters before retiming. This transformation modifies the de-sign semantics so that C separate streams of computationare distributed through the pipeline, greatly increasing theaggregate throughput at the cost of additional latency andflip flops. This can automatically accelerate computationscontaining feedback loops by adding more flip-flops thatretiming can then move moved around the critical path.

The e↵ect of C-slow retiming is to enable pipelining ofthe critical path, even in the presence of feedback loops. Totake advantage of this increased throughput however, thereneeds to be su�cient task level parallelism. This processwill slow any single task but the aggregate throughput willbe increased by interleaving the resulting computation.

This process works very well on many FPGA archite-cures as these architectures tend to have a balanced ra-tio of logic elements to registers, while most user designscontain a considerably higher percentage of logic. Addi-tionaly, many architectures allow the registers to be usedindependently of the logic in a logic block.

We have constructed a prototype C-slow retiming toolthat modifies designs targeting the Xilinx Virtex familyof FPGAs. The tool operates after placement: convertingevery design register to C separate registers before apply-ing Leiserson’s retiming algorithm to minimize the clockperiod. New registers are allocated by scavenging unusedarray resources. The resulting design is then returned toXilinx tools for routing, timing analysis, and bitfile gener-ation.

We have selected three benchmarks: AES encryption,Smith/Waterman sequence matching, and the LEON 1

1This was originally defined to meet systolic slowdown re-quirements.

IN OUT

1 1

1 1 22

Figure 1: A small graph before retiming. The

nodes represent logic delays, with the inputs and

outputs passing through mandatory, fixed regis-

ters. The critical path is 5.

microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.

The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.

This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.

In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.

2. Conventional Retiming

Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].

In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-

IN OUT

1 1

1 1 22

Figure 2: The example in Figure 2 after retiming.

The critical path is reduced from 5 to 4.

straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.

A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.

This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w

0 is calculated, with w0(e) = w(e)� r(u) + r(v).

An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.

Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.

Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.

Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be

Circles are combinational logic, labelled with delays.

Critical path is 5 (ignore FF delay for now).We want to improve it without changing circuit semantics.

IN OUT

1 1

1 1 22

Figure 1: A small graph before retiming. The

nodes represent logic delays, with the inputs and

outputs passing through mandatory, fixed regis-

ters. The critical path is 5.

microprocessor core, for which we can envision scenar-ios where ample task-level parallelism exists. The AESand Smith/Watherman benchmarks were also C-slowed byhand, enabling us to evaluate how well our automated tech-niques compare with careful, hand designed implementa-tions that accomplishes the same goals.

The LEON 1 processor is a significantly larger synthe-sized design. Although it seems unusual, there is su�cienttask level parallelism to C-slow a microprocessor, as eachstream of execution can be viewed as a separate task. Theresulting C-slowed design behaves like a multithreaded sys-tem, with each virtual processor running slower but o↵er-ing a higher total throughput.

This prototype demonstrates significant speedups onall 3 benchmarks, nearly doubling the throughput for theproper parameters. On the AES and Smith/Watermanbenchmarks, these automated results compare favorablywith careful hand-constructed implementations that werethe result of manual C-slowing and pipelining.

In the remainder of the paper, we first discuss the se-mantic restrictions and changes that retiming and C-slowretiming impose on a design, the details of the retimingalgorithm, and the use of the target architecture. Fol-lowing the discussion of C-slow retiming, we describe ourimplementation of an automatic retiming tool. Then wedescribe the structure of all three benchmarks and presentthe results of applying our tool.

2. Conventional Retiming

Leiserson’s retiming treats a synchronous circuit as adirected graph, with delays on the nodes representing com-bination delays and weights on the edges representing reg-isters in the design. An additional node represents theexternal world, with appropriate edges added to accountfor all the I/Os. Two matrixes are calculated, W and D,that represent the number of registers and critical pathbetween every pair of nodes in the graph. Each node alsohas a lag value r that is calculated by the algorithm andused to change the number of registers on any given edge.Conventional retiming does not change the design seman-tics: all input and output timings remain unchanged, whileimposing minor design constraints on the use of FPGA fea-tures. More details and formal proofs of correctness canbe found in Leiserson’s original paper[7].

In order to determine whether a critical path P can beachieved, the retiming algorithm creates a series of con-

IN OUT

1 1

1 1 22

Figure 2: The example in Figure 2 after retiming.

The critical path is reduced from 5 to 4.

straints to calculate the lag on each node. All these con-strains are of the form x � y k that can be solved inO(n2) time by using the Bellman/Ford shortest path al-gorithm. The primary constraints insure correctness: noedge will have a negative number of registers while everycycle will always contain the original number of registers.All IO passes through an intermediate node insuring thatinput and output timings do not change. These constraintscan be modified to insure that a particular line will containno registers or a mandatory minimum number of registersto meet architectural constraints.

A second set of constraints attempt to insure that everypath longer than the critical path will contain at least oneregister, by creating an additional constraint for every pathlonger than the critical path. The actual constraints aresummarized in Table 1.

This process is iterated to find the minimum criticalpath that meets all the constraints. The lag calculated bythese constraints can then be used to change the designto meet this critical path. For each edge, a new registerweight w

0 is calculated, with w0(e) = w(e)� r(u) + r(v).

An example of how retiming a↵ects a simple design canbe seen in Figures 2 and 2. The initial design has a criticalpath of 5, while after retiming the critical path is reducedto 4. During this process, the number of registers is in-creased, yet the number of registers on every cycle andthe path from input to output remain unchanged. Sincethe feedback loop has only a single register and a delay of4, it is impossible to further improve the performance byretiming.

Retiming in this form imposes only minimal design lim-itations: there can be no asynchronous resets or similarelements, as the retiming technique only applies to syn-chronous circuits. A synchronous global reset imposes toomany constraints to allow retiming unless initial conditionsare calculated and the global reset itself is now excludedfrom retiming purposes. Local synchronous resets and en-ables just produce small, self loops that have no e↵ect onthe correct operation of the algorithm.

Most other design features can be accommodated bysimply adding appropriate constraints. As an example, alltristated lines can’t have registers applied to them, whilemandatory elements such as those seen in synchronousmemories can be easily accommodated by mandating reg-isters on the appropriate nets.

Memories themselves can be retimed like any other el-ement in the design, with dual ported memories treatedas a single node for retiming purposes. Memories thatare synthesized with a negative clock edge (to create thedesign illusion of asynchronous memories) can either be

Add a register, move one circle. Performance improves by 20%.

Logic Synthesis tools can do this in simple cases.

35

Retiming Example

36

REG REG30

REG

30

REGREG 30

80

REG

REG

20

Clk->Q10

Clk->Q10

Setup20

Critical Path Delay = 190!Clk->Q = 10!Setup = 20

Want to retime to here, however, delay cannot be added to the loop without changing the semantics of the logic. Because of this, many retiming tools stop at loops.

Therefore move registers to after loop.

Retiming Example

37

REG

REG

30

REG

30

REG

REG 30

80

REG

REG20

REG

REG

Clk->Q10

Clk->Q10

Setup20

Setup20

Critical Path Delay = 130

!Clk->Q = 10

!Setup = 20

This is the retimed solution thatmany retiming aware tools will stop at.

This is also the optimal solution whenthe initial values of the registers are not given.

REG 30

REG

30

REG

REG 30

80

REG

REG

20

REG

REG

!Clk->Q = 10

!Setup = 20

If the registers have the same initial

condition, they can be combined and

moved into the loop

REG REG 30

REG

30

REG

REG 30

80

REG

REG20

REG

REG

Clk->Q10

Clk->Q10

Critical Path Delay = 110

!Clk->Q = 10

!Setup = 20

The register can be movedthrough the NAND gate,

producing, a register at eachinput. The total delay in the

loop is unchanged. Caremust be taken to properly set the initial conditions

of the registers.

Clk->Q10

Clk->Q10

Setup20

Setup20

Retiming Example

38

REG 30

REG

30

REG

REG 30

80

REG

REG

20

REG

REG

!Clk->Q = 10

!Setup = 20

If the registers have the same initial

condition, they can be combined and

moved into the loop

1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001













III. ARCHITECTURE




































is infrequent.
















Floorplaning: essential to meet timing.

(Intel XScale 80200)

39

EECS151, UC Berkeley Sp18

Timing Analysis Tools‣ Static Timing Analysis: Tools use delay models for

gates and interconnect. Traces through circuit paths. ‣ Cell delay model capture ‣ For each input/output pair, internal delay (output

load independent) ‣ output dependent delay

‣ Standalone tools (PrimeTime) and part of logic synthesis.

‣ Back-annotation takes information from results of place and route to improve accuracy of timing analysis.

‣ DC in “topographical mode” uses preliminary layout information to model interconnect parasitics. ‣ Prior versions used a simple fan-out model of gate

loading.

delay

output load

40

EE141

Standard cell characterization

41

EE141

Modeling Gate Delay

42

EE141

Inverter Transient Response

tpHL = f(RonCL)= 0.69 Rn CL

(a) Low-to-high (b) High-to-low

Vin Vin

Cin Cin

V(t) = V0 e –t/RC t1/2 = ln(2) × RC

43

EE141

The Switch – Dynamic Model (Simplified)

|VGS|

S D

G

|VGS| ≥ |VT|

S DRon

G

CG

CDCS

44

EE141

Switch SizingWhat happens if we make a MOSFET W times larger (wider)

|VGS|

S D

G

|VGS| ≥ |VT|

S DRon/W

G

CGW

CDWCSW

W

45

EE141

Switch Parasitic Model

The pull-down switch (NMOS)

Vin

CG

Vout

CD

Minimum-size switch

RN

Sizing the transistor (factor W)

Vin

WCG

Vout

WCDRN

W

We assume transistors of minimal length (or at least constant length). R’s and C’s in units of per unit width.

46

EE141

The pull-up switch (PMOS)

Minimum-size switch

Vin

CG Vout

CD

RP = 2RNVin

2CG Vout

2CD

RN

Sized for symmetry

Vin

2WCG Vout

2WCD

RN

General sizing

W

Switch Parasitic Model

47

For traditional CMOS processes, pFETs are ~twice as resistive as nFETS. (Mobility of holes is 1/2 that of electrons).

EE141

Inverter Parasitic Model

Vin

Cin = 3WCG

Vout

Cint = 3WCD

RNW

RNW

Drain and gate capacitance of transistor are directly related by process (γ≈1)

CD= γCG

= 3WγCG

tp = 0.69RNW

⎛

⎝⎜

⎞

⎠⎟(3WγCG ) = 0.69(3γ )RNCG

Intrinsic delay of inverter independent of size

48

EE141

Inverter with Load Capacitance

Vin

Cin = 3WCG

Vout

Cint = 3WγCG

RNW

RNW

CL

f = fanout = ratio of load capacitance (CL) to and input capacitance (Cin)

49

= 0.69(3γRNCG)(1 +CL

γCin)

= tp0(1 +CL

γCin) = tp0(1 + f /γ)

tp = 0.69(RN /W )(Cint + CL)

= 0.69(RN /W )(3WγCG + CL)

EE141

Inverter Delay Model

tp= tp0(1+f/γ)

f

Delay

tp0

Question: how does transistor sizing (W) impact delay?

50

Delay linearly proportional to fanout, f. For f=0, delay is tinv

EE141

2-input NAND Gate❑ Derive the formula for gate propagation

as a function of fanout, f (as with inverter) ❑ We derive the equations based on the input

connected to the transistor closest to the output (A),

❑ assuming the B input had been set to 1 (for a long time)

❑ So we can fairly compare to the inverter, size the transistors so that the capacitance of each input is equivalent to the input capacitance of the inverter

❑ Assume that the resistance of the pFET is twice that of the nFET (Rp = 2Rn) if the pFET and nFET have the same width

❑ Size the transistors so that the rise time and fall times are equivalent

❑ For the 2 transistors in series, ignore the capacitance at their shared node

51

EE141

2-input NAND Gate❑ Solution setup: ❑ To keep the pullup and pulldown delays

the same, Wp = Wn, because Rp = 2Rn. ❑ Remember, inverter had input C of 3CG,

with Wp = 2Wn. ❑ Therefore, here we increase widths by

3/2 relative to the inverter, so ❑ R changes by 2/3 (shown in figure).

52B

BA

A

3CG Cint CL

(4/3)RN (4/3)RN

(2/3)RN

(2/3)RN

EE141

2-input NAND Gate

53

B

BA

A

3CG Cint CL

(4/3)RN (4/3)RN

(2/3)RN

(2/3)RN

❑ Solve: Cint = (6/2)CD + (3/2)CD = (9/2)CD

EE141

2-input NOR Gate

54

A

B

3CG Cint CL

(5/6)RN

(5/3)RN

B

A

(5/6)RN

(5/3)RN

Cint = 2 × (3/5/CD + (12/5)CD = (18/5)CDWp = 4Wn

EE141

Gate Delay Summary

55

inverter

2-NAND2-NOR

tp

f

The y-intercepts for NAND and NOR are both twice that of the inverter. The NAND line has a gradient 4/3 that of the inverter (steeper); for NOR it is 5/3 (steepest).

What about gates with more than 2-inputs?

Look at 4-input NAND:

interceptslope

Adding Wires to gate delay‣ Wires have finite resistance, so have distributed R and C:

with r = res/length, c = cap/length, ∆t ∝ rcL2 ≅ rc + 2rc +3rc + ...


Wire Delay



negligible:












v1

v4v3

v2

time

v1 v2 v3 v4


Wire Delay



negligible:












v1

v4v3

v2

time

v1 v2 v3 v4

56

‣ Wire propagation delay is around half of what it would be if R and C were “lumped”: tp = 0.38(rL * cL) = 0.38rcL2

Gate Driving long wire and other gates

57

tp = 0.69RdrCint + 0.69RdrCw + 0.38RwCw + 0.69RdrCfan + 0.69RwCfan

= 0.69Rdr(Cint + Cfan) + 0.69(Rdrcw + rwCfan)L + 0.38rwcwL2

Rw = rwL, Cw = cwL

Driving Large Loads‣ Large fanout nets: clocks, resets, memory bit lines, off-chip ‣ Relatively small driver results in long rise time (and thus

large gate delay)

‣ Strategy:

‣ How to optimally scale drivers? ‣ Optimal trade-off between delay per stage and total number of stages?

Staged Buffers

58

EE141

Driving Large Loads

CL

In Out

❑ For some given CL: ▪ How many stages are needed to minimize delay? ▪ How to size the inverters?

59

❑ Get fastest delay if build one very big inverter ▪ So big that delay is set only by self-loading

❑ Likely not the solution you’re interested in ▪ Someone has to drive this inverter (a big inverter has

a large input capacitance!) …

Cload

EE141

Delay Optimization❑ First assume given: ▪ A fixed number of inverters ▪ The size* of the first inverter ▪ The size of the load that needs to be driven

❑ What is the minimal delay of the inverter chain

60

* note: When we talk about inverter (or gate) “size”, we refer to the wide of the transistors making up the circuit.

EE141 61

tp, j = tp0(1 +Cg, j+1

γCg, j) = tp0(1 + fj /γ)

tp =N

∑j=1

tp, j = tp0

N

∑j=1

(1 +Cg, j+1

γCg, j), Cg,N+1 = CL

CL

In Out

1 2 NCg,1

❑ Delay for the j-th inverter stage:

❑ Total delay of the chain:

EE141

❑ Each inverter should be sized up by the same factor f with respect to the preceding inverter

❑ Therefore each stage has the same delay ❑ Given Cg,1 and CL

❑ Where F represents the overall fan-out of the circuit ❑ The minimal delay through the chain is

Optimum Delay and Number of Stages

62

f = N CL /Cg,1 = N F

tp = N ⋅ tp0(1 + N F /γ)

EE141

Example

CL= 8 C1

In Out

C11 f f 2

CL/C1 has to be evenly distributed across N = 3 stages:

63

EE141

Delay Optimization❑ Now assume given: ▪ The size of the first inverter ▪ The size of the load that needs to be driven

❑ Minimize delay by finding optimal number and sizes of gates

❑ So, need to find N that minimizes:

64

tp = N ⋅ tp0(1 + N F /γ), F = CL /Cg,1

EE141

Finding optimal fanout per stage

65

tp = N ⋅ tp0(1 + N F /γ), F = CL /Cg,1

❑ Differentiate w.r.t. N and set = 0:

γ + N F −N F lnF

N= 0

⇒ f = e(1+γ/f )

❑ Closed form only if : γ = 0 ⇒ N = ln(F), f = e

EE141

Optimum Effective Fanout f

❑ Optimum f for given process defined by γ

0 0.5 1 1.5 2 2.5 32.5

3

3.5

4

4.5

5

γ

f opt

fopt = 3.6 for γ = 1

e

66

f = e(1+γ/f )

EE141

In Practice: Plot of Total Delay

❑ Why the shape? ❑ Curves very flat for f > 2

▪ Simplest/most common choice: f = 4

[Hodges, p.281]

67

EE141

Normalized Delay As a Function of F

[Rabaey: page 210]

(γ = 1)

68

tp = N ⋅ tp0(1 + N F /γ), F = CL /Cg,1

Documents

EECS 151/251A Spring 2020 Digital Design and Integrated ...eecs151/sp20/files/lec10-timing.pdf · In November 1971, Intel introduced the world’s ﬁrst single-chip microprocessor,