Pipelined Design

Pipelined Design

9תרגול

Latency and Throughput

Latency - Total time to perform a single operation from start to end.

Throughput – operations / latency ratio or GOPS, giga-operations per second.

The clock cycle is always bounded by the slowest operation.

Picoseconds - 10^-12 Nanoseconds - 10^-9

Latency and Throughput

A B C D E FR

E

G

80ps 30ps 60ps 50ps 70ps 10ps 20ps

How to maximize throughput using an additional register? Put it between C and D

How to maximize throughput using 2 additional registers? AB, CD, EF

How to maximize throughput using 3 additional registers? A, BC, D, EF

What is the cycle time, latency and throughput in that case? 110ps, 440ps, (1/110) * 1000 = 9.09

How to get max throughput with min stages? 5 stages, since we cannot go lower than 100ps for a clock cycle

Pipeline registers

Select A,Since only jxx and call need valP at further stages (E and M resp.)

Predicted PC

5

Control Hazards

• PC prediction (Fetch stage) jxx , ret → ? call, jmp → ValC Other → ValP

• Branch prediction strategies• Always-taken• Never-taken• Backward taken, forward not taken

Why are forward branches less common ? How can we deal with stack prediction (ret) ?

6

Data Hazards

Data dependencies in Processors Result from one instruction is being used as an

operand for another (nop example) Stalling / forwarding / load-use hazard

Data Dependencies: 2 Nop’s

0x000: irmovl $10,%edx

1 2 3 4 5 6 7 8 9

F D E M WF D E M W

0x006: irmovl $3,%eax F D E M WF D E M W

0x00c: nop F D E M WF D E M W

0x00d: nop F D E M WF D E M W

0x00e: addl %edx,%eax F D E M WF D E M W

0x010: halt F D E M WF D E M W

10# demo-h2.ys

W

R[ %eax ] 3

D

valA R[ %edx = ]10

valB R[ %eax = ]0

•••

W

R[ %eax ] 3

W

R[ %eax ]← 3

D

valA R[ %edx = ]10

valB R[ %eax = ]0

D

valA R[ %edx [ = 10

valB R[ %eax [ = 0

•••

Cycle 6

Error

←

←

Stalling solution

If instruction follows too closely after one that writes register, slow it down

Hold instruction in decode Dynamically inject nop into execute stage


1 2 3 4 5 6 7 8 9

F D E M W

0x006: irmovl $3,%eax F D E M W

0x00c: nop F D E M W

bubble

F

E M W

0x00e: addl %edx,%eax D D E M W

0x010: halt F D E M W

10

#demo-h2.ys

F

F D E M W0x00d: nop

Data Forwarding Solution

irmovl in write-back stage

Destination value in W pipeline register

Forward as valB for decode stage


1 2 3 4 5 6 7 8 9

F D E M WF D E M W

0x006: irmovl $3,%eax F D E M WF D E M W

0x00c: nop F D E M WF D E M W

0x00d: nop F D E M WF D E M W

0x00e: addl%edx,%eax F D E M WF D E M W

0x010: halt F D E M WF D E M W

#demo-h2.ys

Cycle 6

W

R[ %eax ] 3

D

valA R[ %edx [ = 10

valB W_valE= 3

•••

W_dstE= %eaxW_valE = 3

srcA= %edxsrcB= %eax

←

←←

Limitation of Forwarding

Load-use dependency Value needed by end of

decode stage in cycle 7 Value read from memory in

memory stage of cycle 8


1 2 3 4 5 6 7 8 9

F D E M

W0x006: irmovl $3,%ecx F D E M

W

0x00c: rmmovl %ecx, 0(%edx) F D E M W

0x012: irmovl $10,%ebx F D E M W

0x018: mrmovl 0(%edx),%eax # Load %eax F D E M W

#demo-luh.ys

0x01e: addl%ebx,%eax # Use %eax

0x020: halt

F D E M W

F D E M W

10

F D E M W

11

Error

MM_dstM= %eaxm_valM M]128[ = 3

Cycle 7 Cycle 8

D

valA M_valE= 10valB R[ %eax = ]0

D

valA M_valE= 10valB R[ %eax[ = 0

MM_dstE= %ebxM_valE= 10

•••

←

←←

Avoiding Load/Use Hazard

Stall using instruction for one cycle

Can then pick up loaded value by forwarding from

memory stage


1 2 3 4 5 6 7 8 9

F D E M

W

F D E M

W0x006: irmovl $3,%ecx F D E M

W

F D E M

W

0x00c: rmmovl %ecx, 0(%edx) F D E M WF D E M W

0x012: irmovl $10,%ebx F D E M WF D E M W

0x018: mrmovl 0(%edx),%eax# Load %eax F D E M WF D E M W

#demo-luh.ys

0x01e:addl%ebx,%eax# Use %eax

0x020: halt

F D E M W

E M W

10

D D E M W

11

bubble

F D E M W

F

F

MM_dstM = %eax

m_valM0M]128[ = 3

MM_dstM= %eaxm_valM M]128[ = 3

Cycle 8

D

valA W_valE= 10valB m_valM= 3

D

valA W_valE= 10valB m_valM= 3

WW_dstE = %ebx

W_valE= 10

WW_dstE= %ebxW_valE= 10

••

←

←←

12

Control Logic

Processing “ret”Must stall until instruction reaches write back

Load/use hazardMust stall between read memory and use

Mis-predicted branch removing instructions from the pipe

• Implementing the forwarding logic

• Note: 5 forwarding sources

Documents

Pipelined Design