Latancy Solution-pipeline reservation table

TEST-3 SOLUTIONS

Subject: Advanced Computer Architecture

1) Consider the following pipeline reservation table.

1 2 3 4 5 6 7 8 S1

S2

S3

(a) What are the forbidden latencies?(b) Draw the state transition diagram.(c) List all the simple cycles and greedy cycles.(d) Determine the optimal constant latency cycle and the minimal average latency.(e) Let the pipeline clock period be τ = 20 ns. Determine the throughput of the pipeline. (10 Marks)Sol:

Forbidden latencies: 2, 4, 5 and 7Permissible latencies: 1, 3, 6 and 8Collision vector: C7C6C5C4C3C2C1 = 1011010

CASE 1: latency 3

Present state 1011010Collision vector 1011010PS with 3 shifts + 0001101Next state 1011011

Present state 1011011 Present state 1011011

Collision vector 1011010 Collision vector 1011010PS with 3 shifts + 0001011 PS with 8 shifts + 0000000Next state 1011011 Next state 1011010

CASE 2: latency 6Present state 1011010Collision vector 1011010PS with 6 shifts + 0000001Next state 1011011

X X X

X X

X X X

Present state 1011011 Present state 1011011Collision vector 1011010 Collision vector 1011010PS with 6 shifts + 0000001 PS with 8 shifts + 0000000Next state 1011011 Next state 1011010

CASE 3: latency 1Present state 1011010Collision vector 1011010PS with 1 shifts + 0101101Next state 1111111

CASE 4: latency 8Present state 1111111 Present state 1011010Collision vector 1011010 Collision vector 1011010PS with 8 shifts + 0000000 PS with 8 shifts + 0000000Next state 1011010 Next state 1011010

8+

3 6 8+

1* 8+

3* 6

Latency cycles: (1, 8) (1, 8, 8) (1, 8, 3, 8) (1, 8, 8, 3, 8) (1, 8, 6, 8) (1, 8, 8, 6, 8) (8) (3) (6) (1, 8, 8, 6, 6, 8)Simple cycles: (3) (6) (8) (1, 8) (3, 8) (6, 8)Greedy cycles: (3) (1, 8)Optimal latency cycle: (3)MAL: Lower bound = 3 Upper bound = 4+1 = 5 Average greedy cycle latency = (1+8) / 2 = 4.5 MAL ≤ 4.5 MAL = (3)

Given: τ = 20 ns

1011010

1011011 1111111

Throughput of the pipeline = N/n x τ = 3/8 x 20 x 10-9 = 18.75 MIPS.

2) Describe the mechanisms for instruction pipelining interms of prefetch buffers, multiple functional units. (10 Marks)

Sol: Prefetch buffers:

Instruction pipeline

Instructions from branched locations

There are 3-types of pre-fetch buffers, namely

1. Sequential buffers

2. Target buffers

3. Loop buffers

to match instruction fetch rate to pipeline consumption rate.

Sequential buffers:

Sequential instructions are loaded into a pair of sequential buffers for in-sequence

pipelining.

Target buffers:

Instructions from a branch target are loaded into a pair of target buffers for out of

sequential pipelining. Both buffers operate in FIFO fashion. These buffers

become part of the pipeline as additional stages.

A conditional branch instruction cause both sequential buffers and target buffers

to fill with instructions.

After the branch condition is checked, appropriate instructions are taken from one

of the two buffers. The instructions in the other buffers are discarded.

Two buffers alternate to prevent a collision between instruction following into

and out of pipeline.

Seq buffer 1

Seq buffer 2

Seq buffer 2

Target buffer 1

Target buffer 2

Seq buffer 2

Fetch cache

Memory

Multiple functional units: Loop buffers:

These buffers hold sequential instruction contained in small loop. The loop

buffers are maintained by fetch stage of pipeline. Pre-fetched instructions in the

loop body will be executed repeatedly until all iterations complete execution.

The loop buffer operates in two steps.

a. It contains instructions sequentially ahead of current instruction. This saves the

instruction fetch time from memory.

b. It recognizes when the target of a branch falls within the target boundary.

The above architecture is pipelined scalar architecture. In this architecture, in

order to resolve data dependences and resource dependences among successive

instructions entering the pipeline.

The reservation stations [RS] are used with each functional unit. Operands can

wait in the reservation stations until its data dependences have been resolved.

Each reservation station is uniquely identified by a tag, which is monitored by a

tag unit.

The tag unit keeps checking the tags from all currently used registers or

reservation stations.

This register tagging technique allows the hardware to resolve conflicts between

source and destination registers assigned for multiple instructions.

Besides resolving conflicts, the reservation stations also serve as buffers to

interface the pipelined function units with decode and issue units.

The multiple functional units are supported to operate in parallel, once the

dependences are resolved.

Instructions from memory

Register file

B T

A S

Memory

Load registers

FU

FU

FU

FU

RSRSRSRS

Tag unit Decode and issue unit

Instruction fetch unit

Reservation

Stations

Functional units

PART-2

Answer any Two full questions.

3) Consider the five-stage pipelined processor specified by the following reservation table

1 2 3 4 5 6 S1

S2

S3

S4

S5

(a) What are the forbidden latencies?(b) Draw the state transition diagram.(c) List all the simple cycles and greedy cycles.(d) Determine the optimal constant latency cycle and the minimal average latency

(MAL). (10 Marks)

Sol: Forbidden latencies: 3, 4 and 5Permissible latencies: 1, 2 and 6Collision vector: C5C4C3C2C1 = 11100

CASE 1: latency 1


X X

X X

X

X

X X



CASE 2: latency 2



CASE 3: latency 6


6+

1* 6+ 2* 6+

11100

1

Latency cycles: (2),(6),(2,6),(1,6),(1,1,6)Simple cycles: (2),(6),(2,6),(1,6),(1,1,6)Greedy cycles: (2) (1, 6)Optimal latency cycle: (2)MAL: Lower bound = 2 Upper bound = 3+1 = 4 Average greedy cycle latency = (1+6) / 2 = 3.5 MAL = 2

4) Consider the following pipelined processor with four stages. This pipeline has a total evaluation time of six clock cycles. All successor stages must be used after each clock cycle. Output

Input

(a) Specify the reservation table for this pipeline with six columns and four rows.(b) List the set of forbidden latencies between task initiations.(c) Draw the state diagram which shows all possible latency cycles(d) List all greedy cycles from the state diagram(e) What is the value of minimal average latency (MAL)?

(10 Marks) Sol:

Reservation table:

1 2 3 4 5 6 S1

S2

X X

X X X

X X

X

S1 S2 S3 S4

11110

1111

S3

S4

Forbidden latencies: 2 and 4 Permissible latencies: 1, 3 and 5Collision vector: C4C3C2C1 = 1010

CASE 1: latency 1



CASE 2: latency 3



CASE 3: latency 5


5+

1010

1* 5+ 3 5+

3*

Simple cycles: (3),(5),(3,5),(1,5)Greedy cycles: (3) (1,5)

Average greedy cycle latency = (1+5) / 2 = 3MAL: Lower bound = 3 Upper bound = 2+1 = 3 MAL = 3

5) Design an arithmetic pipeline unit for fixed-point multiplication of 8-bit integer using CSA and CPA. (10 Marks)

Sol: An arithmetic pipeline unit for fixed-point multiplication of 8-bit integer using CSA and

CPA:

1111 1011

PART3

Answer any Two full questions.

6) How is the dot product operation n S = ∑ ai x bi

i=1implemented without data forwarding? What are the advantages that accure, with internal data forwarding? (5+5 = 10 Marks) Sol: The product operation n S = ∑ ai x bi

i=1

For example: A = (1, 2, 3, 4)

B = (4, 5, 6, 7)

A ● B = (1x4+2x5+3x6+4x7) = 60

Implementing the dot-product operation with internal data forwarding between a multiply

unit and an add unit.

Advantages:

The three instructions must be executed sequentially in a looping structure in

without internal data forwarding.

With data forwarding, the output of the multiplier is fed directly into the input

register R4 of the adder and the output of the multiplier is also routed to register

R3 as shown in Fig.

Therefore internal data forwarding between the two functional units reduces the

total execution time through the pipelined processor.

7) Design a binary multiply pipeline unit for two 4-bit operands. Use minimum number of CSA’s and CPA’s. Show all interconnections and bus width in the schematic diagram. Calculate the output of each CSA and CPA. (5+5 = 10 Marks)

Sol: A binary multiply unit for two 4-bit operands:

For example : Two 4-bit operands 1111 x 1111 1111 11110 111100 1111000 1110001

CSA1: 001111 011110 111100 S = 101101 C = 111100

CSA2: 0101101 0111100 1111000 S = 1101001 C = 1111000

CPA: 1101001 + 1111000 S= 11100001

8) Describe dynamic instruction scheduling achieved in Tomasulos register-tagging

scheme built in IBM 360/91 processor. (10 Marks)

Sol: Dynamic instruction scheduling achieved in Tomasulos register-tagging scheme

built in IBM 360/91 processor:

This hardware dependence resolution scheme was implemented with multiple

floating point units of IBM 91 processors for the model 91 processor, 3 RSs are

used in a floating point adder and two pairs in a floating point multiplier.

The scheme resolves resource conflicts as well as data dependences using register

tagging to allocate or deallocate the source and destination registers.

An issue instruction whose operands are not available is forwarded to an RS

associated with the functional unit it will use.

It waits until its data dependences have been resolved and its operands become

available.

The dependence is resolved by monitoring the result bus.

When all operands for an instruction is available, it is dispatched to the functional

unit for execution.

All working registers are tagged.

If a source register is busy when an instruction reaches the issue stage, the tag for

the source register is forwarded to an RS.

When the register becomes available, the tag can signal the availability.

Total execution time is 13 cycles, from cycle 4 to cycle 16

Documents

Latancy Solution-pipeline reservation table