CSL718 : Pipelined Processors

Anshul Kumar, CSE IITD

CSL718 : Pipelined ProcessorsCSL718 : Pipelined ProcessorsCSL718 : Pipelined ProcessorsCSL718 : Pipelined Processors

PipelineTimings

12th Jan, 2006

Anshul Kumar, CSE IITD slide 2

Pipelined ProcessorsPipelined ProcessorsPipelined ProcessorsPipelined Processors

Function-parallel

Instr level (ILP) Thread level Process level

Pipelined processors

VLIWs Superscalar processors

Parallel architectures

Data-parallel

Intel’s terminology:

• intra ILP

• inter ILP


Processor PerformanceProcessor PerformanceProcessor PerformanceProcessor Performance

• MIPS and MFLOPS

may not truly represent performance

• Execution time of a program

true measure of performance

• SPEC rating

acceptable


Execution Time and Clock PeriodExecution Time and Clock PeriodExecution Time and Clock PeriodExecution Time and Clock Period

Program exec time = Tprog = N * Tinst

= N * CPI * tN : Number of instructionsCPI : Cycles per instruction(Av)t : Clock cycle time

IF D RF EX/AG M WB

Instruction execution time = Tinst = CPI* tt


What influences clock period?What influences clock period?What influences clock period?What influences clock period?

Tprog = N * CPI * t

Technology - t

Software - N

Architecture - N * CPI * t Instruction set architecture (ISA)

trade-off N vs CPI * t

Micro architecture (A)

trade-off CPI vs t


Determining Clock PeriodDetermining Clock PeriodDetermining Clock PeriodDetermining Clock Period

Clock Period = t = Pmax

Pmax = max propagation delay

Clock

Pmax

CombReg Reg


Ideal PipeliningIdeal PipeliningIdeal PipeliningIdeal Pipelining

t = Tinst / SCPI = 1Effective time per inst Teff = 1 * Tinst / S

TinstS stages


Pipelining with hazardsPipelining with hazardsPipelining with hazardsPipelining with hazards

t = Tinst / SCPI = 1 + (S - 1) * bTeff = (1 + (S - 1) * b) * Tinst / S

TinstS stages

Frequency of interruptions - b


Teff vs. S (Tinst=10, b=.2)

024681012

1 2 3 4 5 6 7 8 9 10S

Tef

f


A more realistic viewA more realistic viewA more realistic viewA more realistic view

t = Pmax + CPmax = max propagation delay

C = clocking overhead

Clock

Pmax C

CombReg Reg


Clocking OverheadClocking OverheadClocking OverheadClocking Overhead

• Fixed overhead c– Setup time – Output delay

• Variable overhead (stretching factor) k

– Clock skew

t = Tinst / S + k * Tinst / S + c

= (1 + k) * Tinst / S + c


Pipelining with Clocking OverheadPipelining with Clocking OverheadPipelining with Clocking OverheadPipelining with Clocking Overhead

Teff = [1 + (S - 1) * b] *

[(1 + k) * Tinst / S + c]

Sopt = [(1 - b) * (1 + k) * Tinst / (b * c)]


Teff vs. S (Tinst=10, b=.2, k=.1, c=1)

0

5

10

15

1 3 5 7 9 11 13 15S

Tef

f


Partitioning instruction into cycles Partitioning instruction into cycles with non-uniform stage timeswith non-uniform stage times

Partitioning instruction into cycles Partitioning instruction into cycles with non-uniform stage timeswith non-uniform stage times

IF D RF AG T DF EX PA

One action - one pipeline stage => large quantization overhead

Multiple actions per stage?Multiple stages per action?


ExampleExampleExampleExample Put Away 2 ns

Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns


Optimal PipeliningOptimal PipeliningOptimal PipeliningOptimal Pipelining

Tinst = 4+6+10+3+12+9+3+6+10+3+22+2

= 90 ns

b = 0.2 c = 4 ns k = 5%

Sopt = [(1 - b) * (1 + k) * Tinst / (b * c)]

= 9.7 9

Tseg = 10 ns



Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns

Tseg = 10 ns

S = 10t = 14.5 nsS * t = 145 ns



Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns

S = 9

Tseg = 13 nst = 17.65 nsS * t = 159 ns



Data - ALU 3 ns

Addr - MAR 3 ns

Data - IR 3 ns

PC - MAR 4 ns

Cache Dir 6 ns

Cache Dir 6 ns

Cache Data 10 ns

Decode 6+6 ns

Gen Addr 9ns

Cache Data 10 ns

Execute 7+7+8 ns

Tseg = 20 ns

S = 5t = 25 nsS * t = 125 ns


ComparisonComparisonComparisonComparison

S Tseg t S * t Teff

9 13 17.65 159 45.89

10 10 14.50 145 40.60

5 20 25.00 125 45.00


Cycle QuantizationCycle QuantizationCycle QuantizationCycle Quantization

Delays are not integral multiple of clock period

Total overhead = clocking overhead

+ quantization overhead

S * t Tinst + S * C (ignoring k)

quantization overhead = S * (t - C) -Tinst

reduces as clock period becomes small


Other Timing ApproachesOther Timing ApproachesOther Timing ApproachesOther Timing Approaches

• Self Timed Circuits– No centralized free running clock– An operation begins as soon as its inputs are

available, that is, all its predecessors have completed

– Higher speed, lower power consumption

• Wave Pipelining– Omit inter-stage registers– Reduced clocking overhead


Conventional vs Wave PipeliningConventional vs Wave PipeliningConventional vs Wave PipeliningConventional vs Wave Pipelining

Conventional Pipeline• Registers separate

adjoining stages

• Clock period > max prop delay

• Inter-stage data stored in registers

Wave Pipeline• No registers between

adjoining stages

• Clock period less than max prop delay

• Waves of data propagate through combinational network (effectively, data is stored in the combinational circuit delay!)


No pipeliningNo pipeliningNo pipeliningNo pipelining

X

Clock

Reg Reg

X

X’ Y

X’

Y

Conventional pipeliningConventional pipeliningConventional pipeliningConventional pipeliningX

Clock

Reg Reg

X

X’ Y Y’ Z Z’ W

X’

Y

Y’

Z

Z’

W


Wave pipeliningWave pipeliningWave pipeliningWave pipeliningX

Clock

Reg Reg

X

Z’ W

Z’

W


TimingTimingTimingTiming

Comb cktX Y

Clock

Reg Reg

X

Y

ppropagation delay

sset-up time

T p + sTclock period


Timing with clock skewTiming with clock skewTiming with clock skewTiming with clock skew

Comb cktX Y

Clock

Reg Reg

X

Y

p s

T

T p + s + 2

Clock skew =


Variation in propagation delayVariation in propagation delayVariation in propagation delayVariation in propagation delay

• Different delays in different paths

• Delay variation due to process / temperature/ power variations

• Data-dependent delay variations


Timing for wave pipeliningTiming for wave pipeliningTiming for wave pipeliningTiming for wave pipelining

Comb cktX Y

Clock

Reg Reg

X

Y

T p + s + 4

pmin

pmax

p

T


Timing for wave pipeliningTiming for wave pipelining(expanded view)(expanded view)

Timing for wave pipeliningTiming for wave pipelining(expanded view)(expanded view)

pmin (n-1) T + 2 nT pmax + s + 2 T p + s + 4

p

T

X

Y

(n-1) T nTpmin pmax


ComparisonComparisonComparisonComparison

Conventional PipelineT pmax/n + s + 2 (plus cycle quantization

overhead)

nT pmax + ns + 2n

Wave PipelineT p + s + 4

nT pmax + s + 2


Problems with wave pipeliningProblems with wave pipeliningProblems with wave pipeliningProblems with wave pipelining

• Need to balance delays

• Narrow range of clock frequencies

• Control difficult

• Not very suitable for non-linear pipelines


Additional ReadingAdditional ReadingAdditional ReadingAdditional Reading

Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu, “Wave-Pipelining: A Tutorial and Research Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3, September 1998, pp. 464 – 474.

Documents

CSL718 : Pipelined Processors