Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
-8
© Copyright by Michel Dubois March 1997
Forwardingunit
Hazarddetection
unit
04
0
0
Inst
ruct
ion
mem
ory
PC
+
r1
r2
R1
R2w
W
opco
ders
rtrd
shift
func
t
Reg
iste
rs
Con
trol(P
C)
(rs)
(rt)
ALU
rsrt
rdfu
nct
shift
ALUctrl
Signext.
EX
ME
WB
ALUSrcALUOp
RegDst
ALUSrc
RegDst ALUOp
MemRead
+(PC
)
Z
Dat
am
emor
y
WR
ME
WB
ALU
_res
ult
@
W
R
Mem
Rea
d
Mem
Writ
e
Sto
re_d
ata
Reg
Writ
e
(PC
)
Bra
nch
ID.Flush
IF.F
lush
EX.Flush
WR
WB
ME
M_d
ata
RE
G_d
ata
Reg
Writ
e
MemtoReg
Original drawing provided by Prof. DuboisPipelined CPU (Late Branch from 1st Ed.) for the EE457 class Lab #6
ShiftLeft 2
3/26/2000
IF/IDIF-Stage
ID/EXID-Stage
EX/MEMEX-Stage MEM-Stage
MEM/WBWB-Stage
Page 27 of 44
Page 29 of 44
382 Chapter 4 The Processor
Delayed branching was a simple and effective solution for a five-stage pipeline issuing one instruction each clock cycle. As processors go to both longer pipelines and issuing multiple instructions per clock cycle (see Section 4.10), the branch delay becomes longer, and a single delay slot is insufficient. Hence, delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches. Simultaneously, the growth in available tran sistors per chip has made dynamic prediction relatively cheaper.
Elaboration: A branch predictor tells us whether or not a branch is taken, but still requires the calculation of the branch target. In the five-stage pipeline, this calculation takes one cycle, meaning that taken branches will have a 1-cycle penalty. Delayed branches are
FIGURE 4.64 Scheduling the branch delay slot. The top box in each pair shows the code before scheduling; the bottom box shows the scheduled code. In (a), the delay slot is scheduled with an indepen dent instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of $s1 in the branch condition prevents the add instruction (whose des tination is $s1) from being moved into the branch delay slot. In (b) the branch delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be scheduled from the not-taken fall-through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the sub instruction when the branch goes in the unexpected direction. By “OK” we mean that the work is wasted, but the program will still exe cute correctly. This is the case, for example, if $t4 were an unused tempo rary register when the branch goes in the unexpected direction.
add $s1, $s2, $s3
if $s2 = 0 then
Delay slot
if $s2 = 0 then
add $s1, $s2, $s3
Becomes
a. From before
sub $t4, $t5, $t6
. . .
add $s1, $s2, $s3
if $s1 = 0 then
Delay slot
add $s1, $s2, $s3
if $s1 = 0 then
sub $t4, $t5, $t6
Becomes
b. From target
add $s1, $s2, $s3
if $s1 = 0 then
Delay slot
add $s1, $s2, $s3
if $s1 = 0 then
sub $t4, $t5, $t6
Becomes
c. From fall-through
sub $t4, $t5, $t6
ee457_Lab6_Part4_r3_for_lecture.fm
10/29/06 19 / 32 C Copyright 2006 Gandhi Puvvada
RegInstr.
Data
FU
PC
IF ID EX MEM WB
Zero
Zero
BRANCH
BR
15-stage pipeline of the late-branch design of the 1st edition
HDU
contr
ol
RegInstr.
HDU
Data
FU
IF ID EX MEM WB
BRANCH
BR
1
5-stage pipeline of the early-branch design of the 3rd ed. and our lab 6
FU_Br
PC
cont
rol
HDU_Br
Zero
RegInstr.TLB
Instr.cache
DataTLB
Datacache
FU
PC
IF1 IF2 ID EX MEM1 MEM2 WB
Zero
Zero
BRANCH
BR
1
7-stage pipelined version of the late-branch design of the 1st edition
HDU
contr
ol
RegInstr.TLB
Instr.cache
HDU
DataTLB
Datacache
FU
IF1 IF2 ID EX MEM1 MEM2 WB
BRANCH
BR
1
7-stage pipelined version of the early-branch design of the 3rd ed. and our lab 6
FU_Br
PC
cont
rol
HDU_Br
Zero
All
4 pi
pelin
es
March 22, 2018 10:13 am EE457 MT - Spring 2018 8 / 10 C Copyright 2018 Gandhi Puvvada
3 ( 6*3 = 18 + 7 bonus = 25 points) 15 min. Flushing by a successful branch:
3.1 Our Lab 6 Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset
using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.
For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below assumed one delay slot where as some assumed zero delay slots. All 6 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct.
All control units are identical and are as per the textbook design (a 1-input means it is an instruction destined to be flushed). Fill-in the table above.
4 ( 14 *1 + 1 bonus = 15 points) 5 min. Virtual MemoryMMU stands for _______________________; TLB stands for ________________________PTBR stands for ____________________________PT (Page Table) (essentially a LUT (Look-Up Table)) _______ has both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT. A Fully Associative TLB has both sides of LUT. T / F Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table. We ___________ (use / don’t use) indexing to ________ to locate the page table entry in ____________ (one/multiple) accesses to a full-length single-level page table.
25pts
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
#1 #2 #3
FF1 FF2 FF1 FF2 FF1 FF2
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
#4 #5 #6
FF1 FF2 FF1 FF2 FF1 FF2
15 pts
March 22, 2018 10:13 am EE457 MT - Spring 2018 8 / 10 C Copyright 2018 Gandhi Puvvada
3 ( 6*3 = 18 + 7 bonus = 25 points) 15 min. Flushing by a successful branch:
3.1 Our Lab 6 Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset
using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.
For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below assumed one delay slot where as some assumed zero delay slots. All 6 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct.
All control units are identical and are as per the textbook design (a 1-input means it is an instruction destined to be flushed). Fill-in the table above.
4 ( 14 *1 + 1 bonus = 15 points) 5 min. Virtual MemoryMMU stands for _______________________; TLB stands for ________________________PTBR stands for ____________________________PT (Page Table) (essentially a LUT (Look-Up Table)) _______ has both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT. A Fully Associative TLB has both sides of LUT. T / F Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table. We ___________ (use / don’t use) indexing to ________ to locate the page table entry in ____________ (one/multiple) accesses to a full-length single-level page table.
25pts
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
#1 #2 #3
FF1 FF2 FF1 FF2 FF1 FF2
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
#4 #5 #6
FF1 FF2 FF1 FF2 FF1 FF2
15 pts
October 25, 2018 10:14 am EE457 MT - Spring 2018 4 / 12 C Copyright 2018 Gandhi Puvvada
2 ( 12+10+12+9+12 = 55 points) 25 min. Branch delay Slots and power-on reset:
2.1 Regarding the DUAL purpose of the WBFF (Wristband Flip-Flop)(i) On power-on under reset, the WBFF on the side is ____________ (set/reset) ___________________ (synchronously / asynchronously) to ensures that the ________ (IF/ID) stage is filled with a _________ ________________________ (bubble/Random Instruction). Similar needs of other stages ____________________ (list stages from IF, ID, EX, MEM, WB) are achieved in a ____________ (similar / different) fashion. On power-on, the PC is ____________________________________________________________________ so that ____________________________________________________________(ii) When the Branch in __________ (IF/ID) stage is successful, the BR1 signal goes ________ (active/inactive) by becoming a _____ (0/1). This in turn causes deposition of a ______ (0/1) into the WBFF ___________________ (synchronously / asynchronously). At this time, the instruction in the _________ (IF/ID) stage ___________ (is / isn’t) in transit to the _________(IF/ID) stage. This causes the instruction in the _________ (IF/ID) stage "wristbanded" (= destined to be flushed by the control unit when he sees him).
2.2 Complete the two tables below. The three designs referred in the right table are the first three designs below.
Our Lab 6 (early branch) Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.
For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below may have assumed zero or more delay slot where as some may have assumed less or zero delay slots. All 3 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct. All control units are identical and are as per the textbook design (a 1-input to the control unit means it is an instruction destined to be flushed). Fill-in the table above.
Instr.Memory
IF ID
BR1
PC
cont
rol
RESETWBFF
12pts
10pts
9pts
12pts
e = executinge
e
ee
e
12pts
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
#1 #2 #3
FF1 FF2 FF1 FF2 FF1 FF2EA
RLY
BR
AN
CH
October 25, 2018 10:14 am EE457 MT - Spring 2018 5 / 12 C Copyright 2018 Gandhi Puvvada
2.2.1 Among the three designs above, select designs, where the designer did not implement maximum possible number of delay slots, and alter those designs so as to make them take the maximum number of delay slots possible. Alter them on the previous page by crossing of selected portions and adding needed additional circuitry.
3 ( 40 points) 15 min. Virtual MemoryMMU stands for _______________________; TLB stands for _________________________PTBR stands for ____________________________; PPFN stands for ______________________Does a PT (Page Table) (essentially a LUT (Look-Up Table)) have both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT? ____________________________________________________Does a Fully Associative TLB (essentially a LUT (Look-Up Table)) have both the LHS (Left-Hand Side) and RHS (Right-Hand Side) of the LUT? _____________________________________________ Given _______ (VPN/PPFN) the TLB provides __________ (VPN/PPFN).Given _______ (VPN/PPFN) the page table provides __________ (VPN/PPFN).We ___________ (use / don’t use) parallel search to search the page table. We ___________ (use / don’t use) binary search (also called dictionary search) to search the page table. We ___________ (use / don’t use) indexing to ________ to locate the page table entry in ____________ (one/multiple) access(es) to a full-length single-level page table.
A TLB, which has PCID (Process Context ID) besides VPN in the TAG portion of its entries __________ (needs/does not need) to be flushed on context switch (i.e. when a new process needs to be made to run). Pl. explain. ___________________________________________________________________________________________________________________________________________________________________________________________________________Assume that there is only one-level of TLB and one-level of cache on the CPU chip (through multiple levels of on-chip TLB and multiple levels of on-chip cache are common in current processors). If there is a TLB miss, the ________ (MMU/CCU) will send out a ____________ (PA physical address / VA virtual address) to "walk" the page table outside the CPU chip in the MM main memory. If there is a cache miss, the ________ (MMU/CCU) will send out a ______________ (PA physical address / VA virtual address) to fetch the words of the missing block in the MM main memory.
In the 32-bit address systems, Intel thoughtfully divided the 20-bit VPN into two 10-bit fields so that _________________________________________________________________________________________________________________________________________________________________________________________________________________________________A 4KB physical page contains ___________ (state a number in decimal system) 4-word (16-byte) blocks.In a 2-level Page table (like in the Intel 32-bit address systems), the Page Directory (the first-level or the top-level table) is built ___________ (OD/IP) (OD = on demand, IP = at the initiation of the process) and the 2nd-level page tables are built ___________ (OD/IP) by the ________ (MMU/OS operating system).During the "cold" start, we expect to have a series of TLB misses, a series of page faults, and a series of cache misses. Consider the case where the program being run on the CPU called out the word with address 0000_1048 hex, which resulted in fetching the 4 words from the MM 0002_8048 hex, 0002_804C hex, 0002_8040 hex, and 0002_8044 hex. Narrate a sequence of events (what misses occurred in what order) and how access to 0000_1048 hex resulted in access to 0002_8048 hex. ____________________________________________________________________________________________________________________________________________________________________________________________________
October 25, 2018 10:14 am EE457 MT - Spring 2018 4 / 12 C Copyright 2018 Gandhi Puvvada
2 ( 12+10+12+9+12 = 55 points) 25 min. Branch delay Slots and power-on reset:
2.1 Regarding the DUAL purpose of the WBFF (Wristband Flip-Flop)(i) On power-on under reset, the WBFF on the side is ____________ (set/reset) ___________________ (synchronously / asynchronously) to ensures that the ________ (IF/ID) stage is filled with a _________ ________________________ (bubble/Random Instruction). Similar needs of other stages ____________________ (list stages from IF, ID, EX, MEM, WB) are achieved in a ____________ (similar / different) fashion. On power-on, the PC is ____________________________________________________________________ so that ____________________________________________________________(ii) When the Branch in __________ (IF/ID) stage is successful, the BR1 signal goes ________ (active/inactive) by becoming a _____ (0/1). This in turn causes deposition of a ______ (0/1) into the WBFF ___________________ (synchronously / asynchronously). At this time, the instruction in the _________ (IF/ID) stage ___________ (is / isn’t) in transit to the _________(IF/ID) stage. This causes the instruction in the _________ (IF/ID) stage "wristbanded" (= destined to be flushed by the control unit when he sees him).
2.2 Complete the two tables below. The three designs referred in the right table are the first three designs below.
Our Lab 6 (early branch) Verilog code may have chosen to set or clear the "wrist-band" Flip-Flop (flush bit) in the stage register IF/ID on system reset using the RESET signal. Accordingly we discussed in class a solution for the Lab 6 Part 4 question on flushing two stages of the 7-stage pipeline.
For this question, let us assume that each designer is allowed to choose to set or clear each "wrist-band" Flip-Flop. He can choose to set one FF and clear another. Also some designers below may have assumed zero or more delay slot where as some may have assumed less or zero delay slots. All 3 designs are correct based on their assumptions. Fill-in the table telling us what assumptions make the designs correct. All control units are identical and are as per the textbook design (a 1-input to the control unit means it is an instruction destined to be flushed). Fill-in the table above.
Instr.Memory
IF ID
BR1
PC
cont
rol
RESETWBFF
12pts
10pts
9pts
12pts
e = executinge
e
ee
e
12pts
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
Instr.TLB
Instr.cache
IF1 IF2 ID
BR1
PC
cont
rol
RESET RESET
#1 #2 #3
FF1 FF2 FF1 FF2 FF1 FF2EA
RLY
BR
AN
CH