ETH, Design of Digital Circuits, SS17 Practice Exercises · As pipeline depth increases, the latency to ... Answer the following questions. ... we will give you the state of the Register

ETH, Design of Digital Circuits, SS17

Practice Exercises III

Instructors: Prof. Onur Mutlu, Prof. Srdjan CapkunTAs: Jeremie Kim, Minesh Patel, Hasan Hassan, Arash Tavakkol, Der-Yeuan Yu, Francois Serre,

Victoria Caparros Cabezas, David Sommer, Mridula Singh, Sinisa Matetic, Aritra Dhar, Marco Guarnieri

Note: These exercises are not graded and are optional.

1 Potpourri

1.1 Pipelining

a) Circle one of A, B, C, D. As pipeline depth increases, the latency to process a single instruction:

A. decreasesB. increases

C. stays the sameD. could increase, decrease, or stay the same, depending on ...

b) Explain your reasoning (in no more than 20 words):

Each additional pipeline latch adds processing overhead (additional latency).

1.2 Data Dependencies

In a superscalar processor, the hardware detects the data dependencies between concurrently fetched in-structions. In contrast, in a VLIW processor, the compiler does.

1.3 TLBs

What does a TLB (Translation Lookaside Buffer) cache?

Page table entries.

1.4 SIMD Processing

Suppose we want to design a SIMD engine that can support a vector length of 16. We have two options: atraditional vector processor and a traditional array processor.

Which one is more costly in terms of chip area (circle one)?

The traditional vector processor The traditional array processor Neither

Explain:

An array processor requires 16 functional units for an operation whereas a vector processor requiresonly 1.

Assuming the latency of an addition operation is five cycles in both processors, how long will a VADD (vectoradd) instruction take in each of the processors (assume that the adder can be fully pipelined and is the samefor both processors)?

1/21

For a vector length of 1:

The traditional vector processor: 5 cycles

The traditional array processor: 5 cycles


The traditional vector processor: 8 cycles (5 for the first element to complete, 3 for the remaining 3)



The traditional vector processor:20 cycles (5 for the first element to complete, 15 for the remaining15)


2/21

2 GPUs and SIMD

We define the SIMD utilization of a program run on a GPU as the fraction of SIMD lanes that are keptbusy with active threads during the run of a program.

The following code segment is run on a GPU. Each thread executes a single iteration of the shown loop.Assume that the data values of the arrays A, B, and C are already in vector registers so there are no loadsand stores in this program. (Hint: Notice that there are 4 instructions in each thread.) A warp in the GPUconsists of 64 threads, and there are 64 SIMD lanes in the GPU.

for (i = 0; i < 1024768; i++) {

if (B[i] < 4444) {

A[i] = A[i] * C[i];

B[i] = A[i] + B[i];

C[i] = B[i] + 1;

}

}

(a) How many warps does it take to execute this program?

Warps = (Number of threads) / (Number of threads per warp)Number of threads = 220 (i.e., one thread per loop iteration).Number of threads per warp = 64 = 26 (given).Warps = 220/26 = 214

(b) When we measure the SIMD utilization for this program with one input set, we find that it is 67/256.What can you say about arrays A, B, and C? Be precise (Hint: Look at the ”if” branch, what can yousay about A, B and C?).

A: Nothing

B: 1 in every 64 of B’s elements is less than 4444

C: Nothing

(c) Is it possible for this program to yield a SIMD utilization of 100% (circle one)?

YES NO

If YES, what should be true about arrays A, B, C for the SIMD utilization to be 100%? Be precise. IfNO, explain why not.

Yes. Either:(1) All of B’s elements are greater than or equal to 4444, or(2) All of B’s element are less than 4444.

(d) Is it possible for this program to yield a SIMD utilization of 25% (circle one)?

YES NO

3/21

If YES, what should be true about arrays A, B, and C for the SIMD utilization to be 25%? Be precise.If NO, explain why not.

No. The smallest SIMD utilization possible is the same as part (b), 67/256, but this is greaterthan 25%.

4/21

3 Tomasulo’s Algorithm

Remember that Tomasulo’s algorithm requires tag broadcast and comparison to enable wake-up of dependentinstructions. In this question, we will calculate the number of tag comparators and size of tag storage requiredto implement Tomasulo’s algorithm in a machine that has the following properties:

• 8 functional units where each functional unit has a dedicated separate tag and data broadcast bus

• 32 64-bit architectural registers

• 16 reservation station entries per functional unit

• Each reservation station entry can have two source registers

Answer the following questions. Show your work for credit.

(a) What is the number of tag comparators per reservation station entry?

8 ∗ 2

(b) What is the total number of tag comparators in the entire machine?

16 ∗ 8 ∗ 2 ∗ 8 + 8 ∗ 32

(c) What is the (minimum possible) size of the tag?

log(16 ∗ 8) = 7

(d) What is the (minimum possible) size of the register alias table (or, frontend register file) in bits?

72 ∗ 32 (64 bits for data, 7 bits for the tag, 1 valid bit)

(e) What is the total (minimum possible) size of the tag storage in the entire machine in bits?

7 ∗ 32 + 7 ∗ 16 ∗ 8 ∗ 2

5/21

4 Out-of-Order Execution

In this problem, we will give you the state of the Register Alias Table (RAT) and Reservation Stations (RS)for an out-of-order execution engine that employs Tomasulos algorithm. Your job is to determine the originalsequence of five instructions in program order.

The out-of-order machine in this problem behaves as follows:

• The frontend of the machine has a one-cycle fetch stage and a one-cycle decode stage. The machine canfetch one instruction per cycle, and can decode one instruction per cycle.

• The machine dispatches one instruction per cycle into the reservation stations, in program order. Dispatchoccurs during the decode stage.

• An instruction always allocates the first reservation station that is available (in top-to-bottom order) atthe required functional unit.

• When a value is captured (at a reservation station) or written back (to a register) in this machine, theold tag that was previously at that location is not cleared ; only the valid bit is set.

• When an instruction in a reservation station finishes executing, the reservation station is cleared.

• Both the adder and multiplier are fully pipelined. An add instruction takes 2 cycles. A multiply instruc-tion takes 4 cycles.

• When an instruction completes execution, it broadcasts its result. A dependent instructions can beginexecution in the next cycle if it has all its operands available.

• When multiple instructions are ready to execute at a functional unit, the oldest ready instruction ischosen.

Initially, the machine is empty. Five instructions then are fetched, decoded, and dispatched into reservationstations. When the final instruction has been fetched and decoded, one instruction has already been writtenback. Pictured below is the state of the machine at this point, after the fifth instruction has been fetchedand decoded:

RAT

MUL

Reg V Tag Value

R0 1 13

R1 0 A 8

R2 1 3

R3 1 5

R4 0 X 255

R5 0 Y 12

R6 0 Z 74

R7 1 7

Src 1 Src2 Src 1 Src2

ADD

Tag V Value Tag V Value

A - 1 5 Z 0 -

B

C


A A 1 8 - 1 7

B X 0 - - 1 13

C - 1 3 - 1 8

6/21

(a) Give the five instructions that have been dispatched into the machine, in program order. The sourceregisters for the first instruction can be specified in either order. Give instructions in the followingformat: “opcode destination ⇐ source1, source2.”

ADD R1 ⇐ R2, R3MUL R4 ⇐ R1, R7MUL R5 ⇐ R4, R0MUL R6 ⇐ R2, R1ADD R1 ⇐ R3, R6

(b) Now assume that the machine flushes all instructions out of the pipeline and restarts fetch from the firstinstruction in the sequence above. Show the full pipeline timing diagram below for the sequence of fiveinstructions that you determined above, from the fetch of the first instruction to the writeback of thelast instruction. Assume that the machine stops fetching instructions after the fifth instruction.

As we saw in class, use “F” for fetch, “D” for decode, “En” to signify the nth cycle of execution for aninstruction, and “W” to signify writeback. You may or may not need all columns shown.

Cycle: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Instruction: F D E1 E2 WInstruction: F D E1 E2 E3 E4 WInstruction: F D E1 E2 E3 E4 WInstruction: F D E1 E2 E3 E4 WInstruction: F D E1 E2 W

Finally, show the state of the RAT and reservation stations after 10 cycles in the blank figures below.

RAT

MUL

Reg V Tag Value

R0 1 13

R1 0 A 8

R2 1 3

R3 1 5

R4 1 X 56

R5 0 Y 12

R6 1 Z 24

R7 1 7

Src 1 Src2 Src 1 Src2

ADD


A - 1 5 Z 1 24

B

C


A

B X 1 56 - 1 13

C

7/21

5 Dataflow

Consider the dataflow graph below and answer the following questions. Please restrict your answer to 10words or less.

What does the whole dataflow graph do (10 words or less)? (Hint: Identify what Module 1 and Module 2perform.)

The dataflow graph deadlocks unless the greatest common divisor (GCD) of A and B is the sameas the least common multiple (LCM) of A and B. If (GCD(A, B) == LCM(A, B)) then ANSWER= 1If you assume that A and B are fed as inputs continuously, the dataflow graph finds the leastcommon multiple (LCM) of A and B.

8/21

C * C

A B

FT FTTF TF

<

TF TF

=0?

FT FT

Module 1

=0?

TF TF FT FT

1

ANSWER+

1

Module 2

LegendC Copy

A

B

C

Initially C=AThen C=B

A B

C

C=A-B

9/21

6 Systolic Arrays

The following diagram is a systolic array that performs the multiplication of two 4-bit binary numbers.Assume that each adder takes one cycle.

(a) How many cycles does it take to perform one multiplication?

9 cycles

(b) How many cycles does it take to perform three multiplications?

13 cycles.Note that some input ports have to hold their values for an extra cycle because their nodesneed to wait for inputs from the previous nodes. You cannot completely overlap latencies ofdifferent nodes. However, because no one was aware of this, we accepted 11 cycles as anotherpossible answer.

10/21

(c) Fill in the following table, which has a list of inputs (a 1-bit binary number) such that the systolic arrayproduces the following outputs, in order: 5 ∗ 5, 12 ∗ 9, 11 ∗ 15. Please refer to the following diagram touse as reference for all the input ports.

5x5 = 101x101 12x9 = 1100x1001 11x15 = 1011x1111

Cycles Row Inputs Column Inputsa0 a1 a2 a3 a4 a5 a6 b0 b1 b2 b3 b4 b5 b6 a7 a8 a9 b7 b8 b9

1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 02 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 03 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 04 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 05 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 06 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 0 07 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 18 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 19 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 010 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 111 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 0 0 112 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 013 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 014 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

11/21

7 Mystery Instruction

A pesky engineer implemented a mystery instruction on the LC-3b. It is your job to determine what theinstruction does. The mystery instruction is encoded as:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 01010 DR SR1 0 0 0 0 0 0

The modifications we make to the LC-3b datapath and the microsequencer are highlighted in the attachedfigures (see the next two pages). We also provide the original LC-3b state diagram, in case you need it. (Asa reminder, the selection logic for SR2MUX is determined internally based on the instruction.)

The additional control signals are

GateTEMP1/1: NO, YES

GateTEMP2/1: NO, YES

LD.TEMP1/1: NO, LOAD

LD.TEMP2/1: NO, LOAD

ALUK/3: OR1 (A|0x1), LSHF1 (A<<1), PASSA, PASS0 (Pass value 0), PASS16 (Pass value 16)

COND/4:COND0000 ;UnconditionalCOND0001 ;Memory ReadyCOND0010 ;BranchCOND0011 ;Addressing modeCOND0100 ;Mystery 1COND1000 ;Mystery 2

The microcode for the instruction is given in the table below.

State Cond J Asserted Signals001010 (10) COND0000 001011 ALUK = PASS0, GateALU, LD.REG,

DRMUX = DR (IR[11:9])001011 (11) COND0000 101000 ALUK = PASSA, GateALU, LD.TEMP1,

SR1MUX = SR1 (IR[8:6])101000 (40) COND0000 110010 ALUK = PASS16, GateALU, LD.TEMP2110010 (50) COND1000 101101 ALUK = LSHF1, GateALU, LD.REG,

SR1MUX = DR, DRMUX = DR (IR[11:9])111101 (61) COND0000 101101 ALUK = OR1, GateALU, LD.REG,

SR1MUX = DR, DRMUX = DR (IR[11:9])101101 (45) COND0000 111111 GateTEMP1, LD.TEMP1111111 (63) COND0100 010010 GateTEMP2, LD.TEMP2

Describe what this instruction does.

Bit-reverses the value in SR1 and puts it in DR.

12/21

Code:

State 10: DR ← 0

State 11: TEMP1 ← value(SR1)

State 40: TEMP2 ← 16

State 50: DR = DR << 1

if (TEMP1[0] == 0)

goto State 45

else

goto State 61

State 61: DR = DR | 0x1

State 45: TEMP1 = TEMP1 >> 1

State 63: DEC TEMP2

if (TEMP2 == 0)

goto State 18

else

goto State 50

13/21

14/21

15/21

C.4. THE CONTROL STRUCTURE 11DRIR[11:9]111DRMUX(a) SR1SR1MUXIR[11:9]IR[8:6] (b)Logic BENPZNIR[11:9] (c)Figure C.6: Additional logic required to provide control signalsLC-3b to operate correctly with a memory that takes multiple clock cycles to read orstore a value.Suppose it takes memory five cycles to read a value. That is, once MAR containsthe address to be read and the microinstruction asserts READ, it will take five cyclesbefore the contents of the specified location in memory are available to be loaded intoMDR. (Note that the microinstruction asserts READ by means of three control signals:MIO.EN/YES, R.W/RD, and DATA.SIZE/WORD; see Figure C.3.)Recall our discussion in Section C.2 of the function of state 33, which accessesan instruction from memory during the fetch phase of each instruction cycle. For theLC-3b to operate correctly, state 33 must execute five times before moving on to state35. That is, until MDR contains valid data from the memory location specified by thecontents of MAR, we want state 33 to continue to re-execute. After five clock cycles,th h l t d th “ d ” lti i lid d t i MDR th16/21

C.2. THE STATE MACHINE 5R PC<! BaseR To 1812 To 18To 18RRTo 18To 18 To 18 MDR<! SR[7:0]MDR <! MIR <! MDRRDR<! SR1+OP2*set CCDR<! SR1&OP2*set CC [BEN]PC<! MDR 3215 0 01To 18To 18 To 18R R [IR[15:12]]2830R7<! PCMDR<! M[MAR]set CC BEN<! IR[11] & N + IR[10] & Z + IR[9] & P9DR<! SR1 XOR OP2* 4 22To 111011JSR JMP BR1010 To 10 2120 0 1LDBMAR<! B+off6set CCTo 18 MAR<! B+off6DR<! MDRset CCTo 18MDR<! M[MAR]2527 3762 STW STBLEASHFTRAPXORANDADDRTITo 8set CC set CCDR<! PC+LSHF(off9, 1)14 LDWMAR<! B+LSHF(off6,1) MAR<! B+LSHF(off6,1)PC<! PC+LSHF(off9,1)3335DR<! SHF(SR,A,D,amt4)NOTESB+off6 : Base + SEXT[offset6] RMDR<! M[MAR[15:1]’0]DR<! SEXT[BYTE.DATA]R 2931 18, 19 MDR<! SRTo 18R RM[MAR]<! MDR 1623 R R17To 19 24M[MAR]<! MDR**MAR<! LSHF(ZEXT[IR[7:0]],1)15To 18PC+off9 : PC + SEXT[offset9] MAR <! PCPC <! PC + 2*OP2 may be SR2 or SEXT[imm5]** [15:8] or [7:0] depending onMAR[0] [IR[11]]PC<! BaseRPC<! PC+LSHF(off11,1)R7<! PC R7<! PC13 Figure C.2: A state machine for the LC-3b17/21

8 Finite State Machines

We want to design a Finite State Machine (FSM) that has a one bit input (in) and will detect the sequence0-1-0. If this sequence is detected, the one bit output (detected) will be set to 1, otherwise this output willremain at 0.

Two of your colleagues have designed different state transition diagrams given below.

INITdetected=0

ZEROdetected=0

ZEROONEdetected=0

OUTdetected=1

in=0 in=1

in=1

in=0

in=1in=1

reset

in=0

INIT ZERO ZEROONE

in=0 /

detected=0

reset

in=1 /

detected=0

in=1 /

detected=0

in=1 /

detected=0

in=0 /

detected=1

in=0 /

detected=0

FSM A

FSM B

(a) Which one of the state diagrams depicts a Moore and which one a Mealy type of FSM?

FSM A is a Moore type FSM, the output depends only on the state. FSM B is a Mealy typeFSM, the output depends on both the state and inputs

(b) For both state transition diagrams state whether or not they are correct.

FSM A has a small mistake, for state ZERO it is not clear what will happen when in=0. FSMB is correct.

18/21

(c) Complete the following Verilog module that would implement the state machine as described in thequestion. You can implement one state transition diagram of your colleagues if that one is correct.

1 module fsm (input in, input clk , input reset , output reg detected );

2

3 reg [1:0] next_state , present_state;

4

5 parameter INIT = 2’b11;

6 parameter ZERO = 2’b00;

7 parameter ZEROONE = 2’b01;

8

9 always @ (*)

10 begin

11 next_state <= present_state; // default

12 detected <= 1’b0;

13 case (present_state)

14 INIT: next_state <= in ? INIT : ZERO;

15 ZERO: next_state <= in ? ZEROONE: ZERO;

16 ZEROONE: if (in)

17 next_state <= INIT;

18 else

19 begin

20 next_state <= ZERO;

21 detected <= 1’b1;

22 end

23 default: next_state <= present_state;

24 endcase

25 end

26

27 always @ (posedge clk , posedge reset)

28 if (reset) present_state <= INIT;

29 else present_state <= next_state;

30

31 endmodule

19/21

9 Verilog

There are four Verilog code snippets in this section. For each code, first state whether or not there is amistake. If there is a mistake explain how to correct it.Note: Assume that the behavior as described, is correct

(a)

1 module mux2 ( input [1:0] i, output s, input z);

2 assign s= (z) ? i[1]:i[0];

3 endmodule

4

5 module one (input [3:0] data , input sel1 , input sel2 , output z );

6 wire [1:0] temp ;

7

8 mux2 i0 (data [1:0], sel1 , temp [0]);

9 mux2 i1 (data [3:2], sel1 , temp [1]);

10 mux2 final (temp , sel2 , z);

11 endmodule

This code is not correct. It uses an ordered instantiation template where the ordering of the pinsshould correspond to the order they are declared. This itself is not wrong, but the input signalson the second position (sel1,sel2) connect to the output of the instance mux. Either the orderingof mux2 needs to be changed, or the ordering of the elements the instantiation. The better optionwould be to use a connect by name.

(b)

1 module two (input [3:0] a, input [0:3] b, output reg [3:0] z);

2 always @ ( *)

3 if (a == 0)

4 z={b[0],b[1],b[2],b[3]};

5 else

6 z=a+b;

7 endmodule

The code is syntactically correct. There is a lot of unnecessary code: using one input as [0:3] andthe other as [3:0] is not the smartest choice, the entire code could be written as assign z=a+b,but syntactically the code is correct.

20/21

(c)

1 module three (clk , rst , a, b, c, z);

2 input a,b,c,clk ,rst;

3 output reg z;

4 reg q;

5

6 always @ (*)

7 begin

8 q <= a ^ b;

9 if (c) q <= ~(a^b);

10 end

11 always @ (negedge clk)

12 if (rst) z= 1’b0;

13 else z= q;

14 endmodule

The code is syntactically correct. Once again, it is a bit unconventional. One always block useblocking statements and the other non-blocking statements. Both would work, although it wouldprobably be more efficient to write them otherwise.

(d)

1 module four (input [2:0] sel , output reg [5:0] z);

2 case (sel)

3 0: z = 6’b00_0000;

4 1: z = 6’b00_0001;

5 2: z = 6’b00_0011;

6 3: z = 6’b00_0111;

7 4: z = 6’b00_1111;

8 5: z = 6’b01_1111;

9 6: z = 6’b11_1111;

10 default: z= 6’b00_0000;

11 endcase

12 endmodule

This code is not correct. The case statement is a sequential statement that needs to be within analways statement.

21/21

Documents

ETH, Design of Digital Circuits, SS17 Practice Exercises · As pipeline depth increases, the latency to ... Answer the following questions. ... we will give you the state of the Register