18

-8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 2: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 3: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

-8

© Copyright by Michel Dubois March 1997

Forwardingunit

Hazarddetection

unit

04

0

0

Inst

ruct

ion

mem

ory

PC

+

r1

r2

R1

R2w

W

opco

ders

rtrd

shift

func

t

Reg

iste

rs

Con

trol(P

C)

(rs)

(rt)

ALU

rsrt

rdfu

nct

shift

ALUctrl

Signext.

EX

ME

WB

ALUSrcALUOp

RegDst

ALUSrc

RegDst ALUOp

MemRead

+(PC

)

Z

Dat

am

emor

y

WR

ME

WB

ALU

_res

ult

@

W

R

Mem

Rea

d

Mem

Writ

e

Sto

re_d

ata

Reg

Writ

e

(PC

)

Bra

nch

ID.Flush

IF.F

lush

EX.Flush

WR

WB

ME

M_d

ata

RE

G_d

ata

Reg

Writ

e

MemtoReg

Original drawing provided by Prof. DuboisPipelined CPU (Late Branch from 1st Ed.) for the EE457 class Lab #6

ShiftLeft 2

3/26/2000

IF/IDIF-Stage

ID/EXID-Stage

EX/MEMEX-Stage MEM-Stage

MEM/WBWB-Stage

Page 4: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 5: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 6: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 7: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 8: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 9: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data
Page 10: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

Page 27 of 44

Page 11: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

Page 29 of 44

Page 12: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

382 Chapter 4 The Processor

Delayed branching was a simple and effective solution for a five-stage pipeline issuing one instruction each clock cycle. As processors go to both longer pipelines and issuing multiple instructions per clock cycle (see Section 4.10), the branch delay becomes longer, and a single delay slot is insufficient. Hence, delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches. Simultaneously, the growth in available tran sistors per chip has made dynamic prediction relatively cheaper.

Elaboration: A branch predictor tells us whether or not a branch is taken, but still requires the calculation of the branch target. In the five-stage pipeline, this calculation takes one cycle, meaning that taken branches will have a 1-cycle penalty. Delayed branches are

FIGURE 4.64 Scheduling the branch delay slot. The top box in each pair shows the code before scheduling; the bottom box shows the scheduled code. In (a), the delay slot is scheduled with an indepen dent instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of $s1 in the branch condition prevents the add instruction (whose des tination is $s1) from being moved into the branch delay slot. In (b) the branch delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be scheduled from the not-taken fall-through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the sub instruction when the branch goes in the unexpected direction. By “OK” we mean that the work is wasted, but the program will still exe cute correctly. This is the case, for example, if $t4 were an unused tempo rary register when the branch goes in the unexpected direction.

add $s1, $s2, $s3

if $s2 = 0 then

Delay slot

if $s2 = 0 then

add $s1, $s2, $s3

Becomes

a. From before

sub $t4, $t5, $t6

. . .

add $s1, $s2, $s3

if $s1 = 0 then

Delay slot

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

Becomes

b. From target

add $s1, $s2, $s3

if $s1 = 0 then

Delay slot

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

Becomes

c. From fall-through

sub $t4, $t5, $t6

Page 13: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

ee457_Lab6_Part4_r3_for_lecture.fm

10/29/06 19 / 32 C Copyright 2006 Gandhi Puvvada

RegInstr.

Data

FU

PC

IF ID EX MEM WB

Zero

Zero

BRANCH

BR

15-stage pipeline of the late-branch design of the 1st edition

HDU

contr

ol

RegInstr.

HDU

Data

FU

IF ID EX MEM WB

BRANCH

BR

1

5-stage pipeline of the early-branch design of the 3rd ed. and our lab 6

FU_Br

PC

cont

rol

HDU_Br

Zero

RegInstr.TLB

Instr.cache

DataTLB

Datacache

FU

PC

IF1 IF2 ID EX MEM1 MEM2 WB

Zero

Zero

BRANCH

BR

1

7-stage pipelined version of the late-branch design of the 1st edition

HDU

contr

ol

RegInstr.TLB

Instr.cache

HDU

DataTLB

Datacache

FU

IF1 IF2 ID EX MEM1 MEM2 WB

BRANCH

BR

1

7-stage pipelined version of the early-branch design of the 3rd ed. and our lab 6

FU_Br

PC

cont

rol

HDU_Br

Zero

All

4 pi

pelin

es

Page 14: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

March 22, 2018 10:13 am EE457 MT - Spring 2018 8 / 10 C Copyright 2018 Gandhi Puvvada

3 ( 6*3 = 18 + 7 bonus = 25 points) 15 min. Flushing by a successful branch:

3.1 Our Lab 6 Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset

using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.

For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below assumed one delay slot where as some assumed zero delay slots. All 6 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct.

All control units are identical and are as per the textbook design (a 1-input means it is an instruction destined to be flushed). Fill-in the table above.

4 ( 14 *1 + 1 bonus = 15 points) 5 min. Virtual MemoryMMU stands for _______________________; TLB stands for ________________________PTBR stands for ____________________________PT (Page Table) (essentially a LUT (Look-Up Table)) _______ has both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT. A Fully Associative TLB has both sides of LUT. T / F Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table. We ___________ (use / don’t use) indexing to ________ to locate the page table entry in ____________ (one/multiple) accesses to a full-length single-level page table.

25pts

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

#1 #2 #3

FF1 FF2 FF1 FF2 FF1 FF2

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

#4 #5 #6

FF1 FF2 FF1 FF2 FF1 FF2

15 pts

Page 15: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

March 22, 2018 10:13 am EE457 MT - Spring 2018 8 / 10 C Copyright 2018 Gandhi Puvvada

3 ( 6*3 = 18 + 7 bonus = 25 points) 15 min. Flushing by a successful branch:

3.1 Our Lab 6 Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset

using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.

For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below assumed one delay slot where as some assumed zero delay slots. All 6 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct.

All control units are identical and are as per the textbook design (a 1-input means it is an instruction destined to be flushed). Fill-in the table above.

4 ( 14 *1 + 1 bonus = 15 points) 5 min. Virtual MemoryMMU stands for _______________________; TLB stands for ________________________PTBR stands for ____________________________PT (Page Table) (essentially a LUT (Look-Up Table)) _______ has both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT. A Fully Associative TLB has both sides of LUT. T / F Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table. We ___________ (use / don’t use) indexing to ________ to locate the page table entry in ____________ (one/multiple) accesses to a full-length single-level page table.

25pts

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

#1 #2 #3

FF1 FF2 FF1 FF2 FF1 FF2

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

#4 #5 #6

FF1 FF2 FF1 FF2 FF1 FF2

15 pts

Page 16: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

October 25, 2018 10:14 am EE457 MT - Spring 2018 4 / 12 C Copyright 2018 Gandhi Puvvada

2 ( 12+10+12+9+12 = 55 points) 25 min. Branch delay Slots and power-on reset:

2.1 Regarding the DUAL purpose of the WBFF (Wristband Flip-Flop)(i) On power-on under reset, the WBFF on the side is ____________ (set/reset) ___________________ (synchronously / asynchronously) to ensures that the ________ (IF/ID) stage is filled with a _________ ________________________ (bubble/Random Instruction). Similar needs of other stages ____________________ (list stages from IF, ID, EX, MEM, WB) are achieved in a ____________ (similar / different) fashion. On power-on, the PC is ____________________________________________________________________ so that ____________________________________________________________(ii) When the Branch in __________ (IF/ID) stage is successful, the BR1 signal goes ________ (active/inactive) by becoming a _____ (0/1). This in turn causes deposition of a ______ (0/1) into the WBFF ___________________ (synchronously / asynchronously). At this time, the instruction in the _________ (IF/ID) stage ___________ (is / isn’t) in transit to the _________(IF/ID) stage. This causes the instruction in the _________ (IF/ID) stage "wristbanded" (= destined to be flushed by the control unit when he sees him).

2.2 Complete the two tables below. The three designs referred in the right table are the first three designs below.

Our Lab 6 (early branch) Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.

For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below may have assumed zero or more delay slot where as some may have assumed less or zero delay slots. All 3 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct. All control units are identical and are as per the textbook design (a 1-input to the control unit means it is an instruction destined to be flushed). Fill-in the table above.

Instr.Memory

IF ID

BR1

PC

cont

rol

RESETWBFF

12pts

10pts

9pts

12pts

e = executinge

e

ee

e

12pts

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

#1 #2 #3

FF1 FF2 FF1 FF2 FF1 FF2EA

RLY

BR

AN

CH

Page 17: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

October 25, 2018 10:14 am EE457 MT - Spring 2018 5 / 12 C Copyright 2018 Gandhi Puvvada

2.2.1 Among the three designs above, select designs, where the designer did not implement maximum possible number of delay slots, and alter those designs so as to make them take the maximum number of delay slots possible. Alter them on the previous page by crossing of selected portions and adding needed additional circuitry.

3 ( 40 points) 15 min. Virtual MemoryMMU stands for _______________________; TLB stands for _________________________PTBR stands for ____________________________; PPFN stands for ______________________Does a PT (Page Table) (essentially a LUT (Look-Up Table)) have both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT? ____________________________________________________Does a Fully Associative TLB (essentially a LUT (Look-Up Table)) have both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT? _____________________________________________ Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table. We ___________ (use / don’t use) indexing to ________ to locate the page table entry in ____________ (one/multiple) access(es) to a full-length single-level page table.

A TLB, which has PCID (Process Context ID) besides VPN in the TAG portion of its entries __________ (needs/does not need) to be flushed on context switch (i.e. when a new process needs to be made to run). Pl. explain. ___________________________________________________________________________________________________________________________________________________________________________________________________________Assume that there is only one-level of TLB and one-level of cache on the CPU chip (through multiple levels of on-chip TLB and multiple levels of on-chip cache are common in current processors). If there is a TLB miss, the ________ (MMU/CCU) will send out a ____________ (PA physical address / VA virtual address) to "walk" the page table outside the CPU chip in the MM main memory. If there is a cache miss, the ________ (MMU/CCU) will send out a ______________ (PA physical address / VA virtual address) to fetch the words of the missing block in the MM main memory.

In the 32-bit address systems, Intel thoughtfully divided the 20-bit VPN into two 10-bit fields so that _________________________________________________________________________________________________________________________________________________________________________________________________________________________________A 4KB physical page contains ___________ (state a number in decimal system) 4-word (16-byte) blocks.In a 2-level Page table (like in the Intel 32-bit address systems), the Page Directory (the first-level or the top-level table) is built ___________ (OD/IP) (OD = on demand, IP = at the initiation of the process) and the 2nd-level page tables are built ___________ (OD/IP) by the ________ (MMU/OS operating system).During the "cold" start, we expect to have a series of TLB misses, a series of page faults, and a series of cache misses. Consider the case where the program being run on the CPU called out the word with address 0000_1048 hex, which resulted in fetching the 4 words from the MM 0002_8048 hex, 0002_804C hex, 0002_8040 hex, and 0002_8044 hex. Narrate a sequence of events (what misses occurred in what order) and how access to 0000_1048 hex resulted in access to 0002_8048 hex. ____________________________________________________________________________________________________________________________________________________________________________________________________

Page 18: -8...IF1 IF2 ID EX MEM1 MEM2 WB Zero Zero BRANCH BR1 7-stage pipelined version of the late-branch design of the 1st edition HDU control Reg Instr. TLB Instr. cache HDU Data TLB Data

October 25, 2018 10:14 am EE457 MT - Spring 2018 4 / 12 C Copyright 2018 Gandhi Puvvada

2 ( 12+10+12+9+12 = 55 points) 25 min. Branch delay Slots and power-on reset:

2.1 Regarding the DUAL purpose of the WBFF (Wristband Flip-Flop)(i) On power-on under reset, the WBFF on the side is ____________ (set/reset) ___________________ (synchronously / asynchronously) to ensures that the ________ (IF/ID) stage is filled with a _________ ________________________ (bubble/Random Instruction). Similar needs of other stages ____________________ (list stages from IF, ID, EX, MEM, WB) are achieved in a ____________ (similar / different) fashion. On power-on, the PC is ____________________________________________________________________ so that ____________________________________________________________(ii) When the Branch in __________ (IF/ID) stage is successful, the BR1 signal goes ________ (active/inactive) by becoming a _____ (0/1). This in turn causes deposition of a ______ (0/1) into the WBFF ___________________ (synchronously / asynchronously). At this time, the instruction in the _________ (IF/ID) stage ___________ (is / isn’t) in transit to the _________(IF/ID) stage. This causes the instruction in the _________ (IF/ID) stage "wristbanded" (= destined to be flushed by the control unit when he sees him).

2.2 Complete the two tables below. The three designs referred in the right table are the first three designs below.

Our Lab 6 (early branch) Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.

For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below may have assumed zero or more delay slot where as some may have assumed less or zero delay slots. All 3 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct. All control units are identical and are as per the textbook design (a 1-input to the control unit means it is an instruction destined to be flushed). Fill-in the table above.

Instr.Memory

IF ID

BR1

PC

cont

rol

RESETWBFF

12pts

10pts

9pts

12pts

e = executinge

e

ee

e

12pts

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

Instr.TLB

Instr.cache

IF1 IF2 ID

BR1

PC

cont

rol

RESET RESET

#1 #2 #3

FF1 FF2 FF1 FF2 FF1 FF2EA

RLY

BR

AN

CH