Practice Assignment 1 - GUCeee.guc.edu.eg/Courses/Electronics/ELCT707... · 1 - Practice Assignment 1 1 PROBLEM 1: ... Determine the overall CPI of the program on the embedded processor

CSEN601 Spring2011 - Practice Assignment 1

1

PROBLEM 1:

An application running on a 1GHz pipelined processor has the following instruction mix:

Instruction Frequency CPI

Load-store 55% 5

Arithmetic 30% 4

Branch 15%

4

a) Determine the overall CPI of the program.

b) An embedded version of the processor that operates at 600 MHz is used to run the same application. In this version, the CPI of branch

instruction becomes 6 while the other types CPI remain unchanged. A new compiler is used which eliminates 25% of the load-store

instructions as well as 5% of the arithmetic instructions for this application.

i. Determine the overall CPI of the program on the embedded processor with the new compiler.

ii. Determine the factor by which the application on the embedded processor runs faster/slower.

Solution:

a) cycle/instruction

b) First we calculate the new percentages for each type of instruction:

i. Percentage of eliminated load-store from total instructions

=

Percentage of eliminated arithmetic from total instructions

=

Percentage of remaining instructions from total instructions

= ( )

New percentage of load-store instructions


2

=

New percentage of arithmetic instructions

=

New percentage of branch instructions

=

cycle/instr.

ii.

(

)

(i.e the program now is slower)

PROBLEM 2:

1- Suppose a MIPS processor uses the simple 5-stage pipeline described in the text. Further suppose that:

There is a single memory for both instructions and data which can do one read or write each cycle.

No forwarding is used.

An instruction cannot be fed into the pipeline until the hardware knows the instruction is to be executed certainly (no earlier than

the end of the execution stage in case the current instruction is a branch).

In the absence of hazards a new instruction can be fed into the pipeline each cycle.

For the following MIPS code:

lw R1, 0(R2)

lw R3, 12(R4)

add R5, R1, R3

beq R5, R5, L1

sw R5, 0(R3)

L1: sw R5, 12(R4)


3

a) Show using a diagram, how many cycles does this code take to complete?

b) Show using a diagram, how different hazard solving techniques can be used to decrease the total number of cycles for this program.

Solution:

a) As shown below, the code will take 15 cycles.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lw R1,0(R2) IF ID EX M WB lw R3,12(R4) IF ID EX M WB Add R5,R1,R3 IF ID EX M WB beq R5,R5,L1 IF ID EX M WB L1:sw R5,12(R4) IF ID EX M WB

b) Using the following hazard solving techniques:

Forwarding (to resolve some data hazards)

Separate instruction and data memories (to resolve some structural hazards)

Branch prediction

Assuming branch prediction turns out to be correct, the code will take 11 cycles.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

lw R1,0(R2) IF ID EX M WB

lw R3,12(R4) IF ID EX M WB

Add R5,R1,R3 IF ID EX M WB

beq R5,R5,L1 IF ID EX M WB

L1:sw R5,12(R4) IF ID EX M WB

PROBLEM 3


4

2- A five-stage pipelined processor supports the following instruction types:

Instruction Frequency

Load 25% Store 15% Integer 30% Floating point 20% Branch 10%

Assume the base CPI of the processor is equal to 1. Data hazards for floating point operations cause an average penalty of 0.9 stall cycles, branch

instructions have a misprediction penalty of 1 stall cycle, while all other instructions run at maximum possible throughput. For branch instructions,

the processor uses the predicted untaken scheme. If branch prediction turns out to be true 80% of the time, calculate the average CPI for this

program.

Solution:

The average CPI = the base CPI + The average number of stalls per instruction

= ( ) cycle/instr.


5

PROBLEM 4:

a) Identify all WAR, WAW and RAW dependencies in the following instruction sequence:

LD F2, 16(R6)

ADDD F2, F2, F4

DIVD F6, F2, F0

SUBD F0, F2, F10

SD F6, 32(R3)

b) Fill in the blank templates for executing this code with and without Tomasulo’s Algorithm for this instruction sequence.

Assume the following execution times:

LW: 2 cycles ADD/SUB: 2 cycles BNEZ: 3 cycles MULT/DIV: 4 cycles

For the original FP unit, assume one integer unit, one floating point multiply units, one F.P. add unit, one F.P. divide unit.

For Tomasulo’s, assume:

Three FP ADD units, 2 FP MULT units, 6 load buffers and three store buffers. (Same units as in book example)

Assume there is a cache miss causing a stall of 8 cycles on the execution of the 1st LD.

Assume FP adds/subs take 2 cycles, Mults take 10 cycles and Divides take 20 cycles.

Assume the store is a cache hit and executes in one cycle.

Assume many instructions can read from the register file simultaneously.

For the Tomasulo example, recall that only one instruction can drive the CDB at a time.

Solution:

Without Tomasulo’s Algorithm, and the processor is using Forwarding:

LD F2,16(R6) IF ID EX MEM1 MEM2 WB

ADDD F2,F2,F4 IF ID stall stall EX1 EX2 MEM WB


6

DIVD F6,F2,F0 IF stall stall ID stall EX1 EX2 EX3 EX4 MEM WB

SUBD F0,F2,F10 stall stall IF stall ID stall stall stall EX1 EX2 MEM WB

SD F6,32(R3) stall stall stall IF stall stall stall ID stall EX MEM1 MEM2 WB

Notes:

We considered we 1 execution unit and 1 memory unit and we had to respect this in order execution and in order completion to

solve the stalls exactly as shown in slide 5 of the ILP chapter.

With Tomasulo’s Algorithm:

We will use the same architecture shown in the lecture

Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F2 16 R2 Load1 No

ADDD F2 F2 F4 Load2 No

DIVD F6 F2 F0 Load3 No

SUBD F0 F2 F10

SD F6 32 R3

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 NO

Add3 No

Mult1 NO

Mult2 NO

Register result status:

Clock F0 F2 F4 F6 F8 F100 FU



LD F2 16 R2 1 2 Load1 Yes 16(R2)

ADDD F2 F2 F4 Load2 No

DIVD F^ F2 F0 Load3 No

SUBD F0 F2 F10

SD F6 32 R3



Add1 No

Add2 NO

Add3 No

Mult1 NO

Mult2 NO


Clock F0 F2 F4 F6 F8 F101 FU Load1


7



LD F2 16 R2 1 1 Load1 Yes 16(R2)

ADDD F2 F2 F4 2 Load2 No

DIVD F^ F2 F0 Load3 No

SUBD F0 F2 F10

SD F6 32 R3



Add1 YES ADD F4 Load1

Add2 NO

Add3 No

Mult1 NO

Mult2 NO


Clock F0 F2 F4 F6 F8 F102 FU ADD1



LD F2 16 R2 1 3 0 Load1 Yes 16(R2)


DIVD F6 F2 F0 3 Load3 No

SUBD F0 F2 F10

SD F6 32 R3



Add1 YES ADD F4 Load1

Add2 NO

Add3 No

Mult1 YES DIVD F0 ADD1

Mult2 NO


Clock F0 F2 F4 F6 F8 F103 FU ADD1 MULT1



LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4

SD F6 32 R3



2 Add1 YES ADD MEM(1) F4

Add2 YES SUBD F10 ADD1

Add3 No


Mult2 NO


Clock F0 F2 F4 F6 F8 F104 FU ADD2 ADD1 MULT1

LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4

SD F6 32 R3 5

Store Yes 32(R3) MULT1




Add2 NO SUBD F10 ADD1

Add3 No


Mult2 NO




8

LD F2 16 R2 1 3 4 0 Load1 NO

ADDD F2 F2 F4 2 6 Load2 No


SUBD F0 F2 F10 4

SD F6 32 R3 5





Add2 NO SUBD F10 ADD1

Add3 No


Mult2 NO



LD F2 16 R2 1 3 4 0 Load1 NO

ADDD F2 F2 F4 2 6 7 Load2 No


SUBD F0 F2 F10 4

SD F6 32 R3 5




0 Add1 No

2 Add2 YES SUBD res1 F10

Add3 No

4 Mult1 YES DIVD res1 F0

Mult2 NO


Clock F0 F2 F4 F6 F8 F107 FU ADD2 res1 MULT1

LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4

SD F6 32 R3 5




0 Add1 No


Add3 No


Mult2 NO


Clock F0 F2 F4 F6 F8 F108 FU ADD2 (RES) MULT1

LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4

SD F6 32 R3 5




0 Add1 No

1 Add2 NO SUBD res1 F10

Add3 No


Mult2 NO


Clock F0 F2 F4 F6 F8 F109 FU ADD2 (RES) MULT1


9

LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4 10

SD F6 32 R3 5




0 Add1 NO


Add3 No


Mult2 NO


Clock F0 F2 F4 F6 F8 F1010 FU ADD2 res1 MULT1



SUBD F0 F2 F10 4 10 11

SD F6 32 R3 5




0 Add1 No

0 Add2 No

Add3 No


Mult2 NO


Clock F0 F2 F4 F6 F8 F1011 FU res2 res1 MULT1

LD F2 16 R2 1 3 4 0 Load1 NO


DIVD F6 F2 F0 3 12 Load3 No

SUBD F0 F2 F10 4 10 11

SD F6 32 R3 5




0 Add1 No

0 Add2 No

Add3 No


Mult2 NO


Clock F0 F2 F4 F6 F8 F1012 FU res2 res1 MULT1

LD F2 16 R2 1 3 4 0 Load1 NO


DIVD F6 F2 F0 3 12 13 Load3 No

SUBD F0 F2 F10 4 10 11

SD F6 32 R3 5 Time: 2

Store Yes 32(R3) Res3



0 Add1 No

0 Add2 No

Add3 No

0 Mult1 No

Mult2 NO


Clock F0 F2 F4 F6 F8 F1013 FU res2 res1 Res3


10

LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4 10 11

SD F6 32 R3 5 Time: 1




0 Add1 No

0 Add2 No

Add3 No

0 Mult1 No

Mult2 NO



LD F2 16 R2 1 3 4 0 Load1 NO



SUBD F0 F2 F10 4 10 11

SD F6 32 R3 5 15 Time: 0




0 Add1 No

0 Add2 No

Add3 No

0 Mult1 No

Mult2 NO




11

PROBLEM 5:

Consider the following code. (The .... marks indicate instructions that are ignored in this example) LOOP1: ADDI R4, R0, #4

.......

LOOP 2: SUBI R4, R4, #1

.......

BNEZ R4, LOOP2

.......

BEQZ R8, LOOP1

.......

a) Focusing on the inner loop (LOOP2) only, analyze the branch behavior. Assume no other instruction changes the value of register R4. What

percentage of the time is the BNEZ branch instruction taken and not taken?

Consider LOOP2 is taken N times, so it is easy to deduce that the branch will be taken N times in each N+1 iterations, i.e. the loop will be

taken N/N+1 and not taken 1/N+1

Consider LOOP2 is taken N times, so it is easy to deduce that the branch will be taken N times in each N+1 iterations, i.e. the loop will be taken N/N+1 and not taken 1/N+1

b) Choose the best static branch prediction scheme for the BNEZ instruction. What percentage of the time will this static branch prediction be

correct for LOOP2?

Using Branch taken, we will reach N correct iterations out of every N+1 decisions.

c) Now consider dynamic branch prediction. Draw the state machine for a one-bit branch predictor. Be sure to clearly identify or define the

meaning of each state. For the inner loop (LOOP2), what will be the misprediction rate of the one-bit branch predictor?


12

For 1 bit branch predictor the FSM should look as above, studying LOOP2 only,

Iteration 1 2 3 . … .. .. N N+1

Prediction Decision Not Taken Taken Taken Taken Taken Taken Taken Taken

Final Decision Taken Taken Taken Taken Taken Taken Taken Taken Not

So, we would take wrong decision 2 times out of every N+1 times

d) Now draw the state diagram for a 2-bit dynamic branch predictor. Again, clearly label all states. What will be the misprediction rate of the 2-

bit branch predictor for LOOP2?

Iteration 1 2 3 . … .. .. N N+1

Prediction Decision Not Not Taken Taken Taken Taken Taken Taken Taken

Final Decision Taken Taken Taken Taken Taken Taken Taken Taken Not

Not Taken

Taken

Taken

Not Taken


13

e) Taking both loops in consideration, the state diagram for a 2,2 bit collator type dynamic branch predictor.

We will not use 2,2 as it is not described in the lecture, so we will just take the relation between both loop1 and loop2. So, if we consider LOOP2 is executed N times every LOOP 1 Iteration. It is clear that for the 1st loop iteration prediction will have 3 misses then it will be only 1 miss until the end of loop1

Documents

Practice Assignment 1 - GUCeee.guc.edu.eg/Courses/Electronics/ELCT707... · 1 - Practice Assignment 1 1 PROBLEM 1: ... Determine the overall CPI of the program on the embedded processor