Important Answers

8/6/2019 Important Answers

1/14

Powered By www.technoscriptz.com

Advanced computer Architecture

1. Define Instruction level parallelism.

All processors use pipelining to overlap the execution of instructions and improveperformance. This potential overlap among instructions is called instruction-level

parallelism (ILP), since the instructions can be evaluated in parallel.

2. Give the equation for pipeline CPI? How is it related to ideal pipeline CPI? Explain the terms

involved in it.

The value of the CPI (cycles per instruction) for a pipelined processor is the sum of the

base CPI and all contributions from stalls:Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls

Terms involved in the above expressionThe ideal pipeline CPIis a measure of the maximum performance attainable by the

implementation.

Structural stalls:Structural stalls are delays that occur because of the issue restrictions arising fromconflicts for functional units.A structural hazard occurs when a part of the processor's

hardware is needed by two or more instructions at the same time. A canonical exampleis a single memory unit that is accessed both in the fetch stage where an instruction is

retrieved from memory, and the memory stage where data is written and/or read frommemoryData Hazard Stalls:Data hazard stalls are the delays in the issuing of the instructions in the pipeline due tothe data dependences like true data dependence, anti dependence and output dependence.

Control stalls:

Branching hazards (also known as control hazards) occur with branches. On manyinstruction pipeline microarchitectures, the processor will not know the outcome of thebranch when it needs to insert a new instruction into the pipeline (normally

thefetch stage).

3. Define Branch target buffer.

A branch-prediction cache that stores the predicted address for the next instruction aftera branch is called a branch-target bufferor branch-target cache.

To reduce the branch penalty for a simple five-stage pipeline, as well as for deeperpipelines, user has to know whether the as-yet-undecoded instruction is a branch and, if

so, what the next PC should be. If the instruction is a branch and is known what the nextPC should be, then a branch penalty of zero is obtained. For this purpose of storing

targeted PC for every branch instruction PC, branch target buffer is used.


2/14


4. Define data dependences. Consider the following MIPS assembly code.

LD R1, 45 (R2)

DADD R7, R1,R5

DSUB R8,R1, R6

OR R9,R5,R1

BNEZ R7, targetDADD R10, R8,R5

XOR R2,R3,R4

(a) identify each dependence by type; list the two instructions involved. Identify which

instruction is dependent; and if there is one, name the storage location involved?

Let us name the given instructions so that it is helpful to write the dependences.

1. LD R1, 45 (R2)

2. DADD R7, R1,R5

3. DSUB R8,R1, R6

4. OR R9,R5,R15. BNEZ R7, target

6. DADD R10, R8,R5

7. XOR R2,R3,R4

Note : *1,2 (R1) represents dependence among instructions 1 and 2 by register R1

True dependence(RAW) : 1,2 (R1); 1,3 (R1); 1,4(R1); 2,5 (R7); 3,6(R8)

Anti dependence(WAR) : 1,7 (R2)

Output dependence(WAW) :--

5. Give the basic structure (hardware diagram) of a MIPS floating point unit using Tomasulo's

algorithm.


3/14


Advanced computer Architecture Paper B

1. Define register renaming?

Register renaming is the method of renaming i.e changing the name of the variable (register

number or memory location) so that the instructions do not conflict. This method is mainly used

to eliminate name dependences, which are not true data dependences, and hence can be

eliminated easily since there is no actual value being transmitted between the instructions.

Instructions involved in name dependence can be executed simultaneously or be reordered, after

implementing 'register renaming'.

2. Give the state diagram for a 2-bit branch predictor.

3. What is a Tournament predictor . Give example with diagram.

Tournament predictors uses multiple predictors, usually one based on global informationand one based on local information, and combining them with a selector. Tournament

predictors can achieve both better accuracy at medium sizes (8K-32K bits) and also makeuse of very large numbers of prediction bits effectively.

Tournament Predictors use several levels of branch-prediction tables together with an

algorithm for choosing among the multiple predictors.


4/14


Existing tournament predictors use a 2-bit saturating counter per branch to choose amongtwo different predictors based on which predictor (local, global, or even some mix) was

most effective in recent predictions. As in a simple 2-bit predictor, the saturating counterrequires two mispredictions before changing the identity of the preferred predictor.

The advantage of a tournament predictor is its ability to select the right predictor for a

particular branch, which is particularly crucial for the integer benchmarks.

1. Define ILP? Write about the challenges faced while implementing ILP. show an example

for each of type of dependences and hazards (write in points/ example).

ILP stands for Instruction level Parallelism. All processors since about 1985 usepipelining to overlap the execution of instructions and improve performance. This

potential overlap among instructions is called instruction-level parallelism (ILP), sincethe instructions can be evaluated in parallel.

The various techniques that are used to increase amount of parallelism are (1) to reducethe impact of data and control hazards and (2) to increase processor ability to exploit

parallelism. There are two largely separable approaches to exploiting ILP: an approachthat relies on hardware to help discover and exploit the parallelism dynamically, and an

approach that relies on software technology to find parallelism, statically at compile time.The value of the CPI (cycles per instruction) for a pipelined processor is the sum of the

base CPI and all contributions from stalls:Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls.

The ideal pipeline CPI is a measure of the maximum performance attainable by the

implementation. By reducing each of the terms of the right-hand side, we minimize theoverall pipeline CPI or, alternatively, increase the IPC (instructions per clock). The

equation above allows us to characterize various techniques by what component of theoverall CPI a technique reduces.

Data Dependence and Hazards:

To exploit instruction-level parallelism, determine which instructions can be executed in

parallel. If two instructions are parallel, they can execute simultaneously in a pipelinewithout causing any stalls. If two instructions are dependent they are not parallel and

must be executed in order.

There are three different types of dependences: data dependences (also called true data

dependences), name dependences, and control dependences.

Data Dependences:

An instruction j is data dependent on instruction i if either of the following holds:

Instruction i produces a result that may be used by instruction j, orInstruction j is data dependent on instruction k, and instruction k is data dependent on

instruction i.


5/14


The second condition simply states that one instruction is dependent on another if there

exists a chain of dependences of the first type between the two instructions. Thisdependence chain can be as long as the entire program.

For example, consider the following code sequence that increments a vector of values inmemory (starting at 0(R1) and with the last element at 8(R2)) by a scalar in register F2:

Loop: L.D F0,0(R1) ; F0=array elementADD.D F4,F0,F2 ; add scalar in F2

S.D F4,0(R1) ;store resultDADDUI R1,R1,#-8 ;decrement pointer 8 bytes

BNE R1,R2,LOOP ; branch R1!=zeroThe data dependences in this code sequence involve both floating point data:

and integer data:

Both of the above dependent sequences, are as shown by the arrows, with each

instruction depending on the previous one. The arrows here and in following examplesshow the order that must be preserved for correct execution. The arrow points from an

instruction that must precede the instruction that the arrowhead points to.If two instructions are data dependent they cannot execute simultaneously or be

overlapped completely. The dependence implies that there would be a chain of one ormore data hazards between the two instructions. Executing the instructions

simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall,thereby reducing or eliminating the overlap. Dependences are a property of programs.

Whether a given dependence results in an actual hazard being detected and whether thathazard actually causes a stall are properties of the pipeline organization. This difference

is critical to understanding how instruction-level parallelism can be exploited.

The presence of the dependence indicates the potential for a hazard, but the actualhazard and the length of any stall is a property of the pipeline. The importance of the data

dependences is that dependence (1) indicates the possibility of a hazard, (2) determinesthe order in which results must be calculated, and (3) sets an upper bound on how much


6/14


parallelism can possibly be exploited. Such limits are explored in section 3.8.

Name DependencesThe name dependence occurs when two instructions use the same register or memory

location, called a name, but there is no flow of data between the instructions associated

with that name.

There are two types of name dependences between an instruction i that precedesinstruction j in program order:

An antidependence between instruction i and instruction j occurs when instruction jwrites a register or memory location that instruction i reads. The original ordering

must be preserved to ensure that i reads the correct value. An output dependence occurs when instruction i and instruction j write the same

register or memory location. The ordering between the instructions must be preservedto ensure that the value finally written corresponds to instruction j.

Example for antidependence:DSUBU R4,R5,R6DADDU R5,R4,R9

(WAR- with R5)

Example for output dependenceDSUBU R4,R5,R6DADDU R4,R6,R9(WAW- R4)

Both anti-dependences and output dependences are name dependences, as opposed to

true data dependences, since there is no value being transmitted between the instructions.Since a name dependence is not a true dependence, instructions involved in a name

dependence can execute simultaneously or be reordered, if the name (register number ormemory location) used in the instructions is changed so the instructions do not conflict.

This renaming can be more easily done for register operands, where it is calledregister renaming. Register renaming can be done either statically by a compiler or

dynamically by the hardware. Before describing dependences arising from branches, letsexamine the relationship between dependences and pipeline data hazards.

Control Dependences:

A control dependence determines the ordering of an instruction, i, with respect to

a branch instruction so that the instruction i is executed in correct program order. Everyinstruction, except for those in the first basic block of the program, is control dependent

on some set of branches, and, in general, these control dependences must be preserved topreserve program order. One of the simplest examples of a control dependence is the

dependence of the statements in the then part of an if statement on the branch. Forexample, in the code segment:

if p1 { S1;


7/14


};if p2 { S2;

}S1 is control dependent on p1, and S2is control dependent on p2 but not on p1. In

general, there are two constraints imposed by control dependences:

1.An instruction that is control dependent on a branch cannot be moved before the branch sothat its execution is no longer controlled by the branch. For example, we cannot take aninstruction from the then-portion of an if-statement and move it before the if-statement.

2.An instruction that is not control dependent on a branch cannot be moved after the branchso that its execution is controlled by the branch. For example, we cannot take a statement

before the if-statement and move it into the then-portion.

Control dependence is preserved by two properties in a simple pipeline, First,

instructions execute in program order. This ordering ensures that an instruction thatoccurs before a branch is executed before the branch. Second, the detection of control or

branch hazards ensures that an instruction that is control dependent on a branch is notexecuted until the branch direction is known.

Example:

Branching to loop L1 from instruction 2 is dependent on the computation of R2 value in

DADDU.

2. For the given example of code,DSUBUI R3, R1, #2

BNEZ R3,L1 ; BRANCH B1

L1:DSUBUI R3,R2,#2

BNEZ R3,L2 ; branch b2

DADD R2,R0,R0

L2: DSUBU R3,R1,R2

BEQZ R3,L3

Take a 2 bit predictor (1 bit predictor with 1 bit correlator) for an input sequence of 2,0,2,0 withinitial input predictions NT/NT for both branch b1, b2


8/14


Two prediction bits are used. Table 1 shows the ralation with the prediction if the last barnch was

not taken and if the last branch was taken.

Table1.

These are the sequences of prediction and action if the sequence is 2,0,2,0 for a 2 bit predictor

with NT/NT initial settings.

Advanced computer Architecture Paper C

1. Define the terms register renaming.

Register renaming is the method of renaming i.e changing the name of the variable (register

number or memory location) so that the instructions do not conflict. This method is mainly used

to eliminate name dependences, which are not true data dependences, and hence can be

eliminated easily since there is no actual value being transmitted between the instructions.

Instructions involved in name dependence can be executed simultaneously or be reordered, after

implementing 'register renaming'.

2. What are the properties critical for program correctness?

The two properties critical to program correctnessand normally preserved bymaintaining both data and control dependenceare the exception behaviorand the data

flow.

Preserving the exception behavior means that any changes in the ordering of instruction

execution must not change how exceptions are raised in the program.Often this is relaxedto mean that the reordering of instruction execution must not cause any new exceptions in


9/14


the program. A simple example shows how maintaining the control and data dependencescan prevent such situations. Consider this code sequence:

DADDU R2,R3,R4BEQZ R2, L1

LW R1,0(R2)

LI:In this case, it is easy to see that if we do not maintain the data dependence involving R2,we can change the result of the program. Less obvious is the fact that if we ignore the

control dependence and move the load instruction before the branch, the load instructionmay cause a memory protection exception.

The second property preserved by maintenance of data dependences and controldependences is the data flow. The dataflow is the actual flow of data values among

instructions that produce results and those that consume them. Branches make the dataflow dynamic, since they allow the source of data for a given instruction to come from

many points. Put another way, it is insufficient to just maintain data dependences becausean instruction may be data dependent on more than one predecessor. Program order is

what determines which predecessor will actually deliver a data value to an instruction.Program order is ensured by maintaining the control dependences.

( For a 16 mark qn if asked, this has to be elaborated with example)

3. Explain the fields / components in the reservation station.

Each reservation station has seven fields:

OpThe operation to perform on source operands SI and S2. Qj, QkThe reservation stations that will produce the corresponding source operand;

a value of zero indicates that the source operand is already available in Vj or Vk, or isunnecessary. (The IBM 360/91 calls these SINKunit and SOURCEunit.)

Vj, VkThe value of the source operands. Note that only one of the V field or the Q

field is valid for each operand. For loads, the Vk field is used to hold the offset field.(These fields are called SINK and SOURCE on the IBM 360/91.)

AUsed to hold information for the memory address calculation for a load or store.

Initially, the immediate field of the instruction is stored here; after the addresscalculation, the effective address is stored here.

BusyIndicates that this reservation station and its accompanying functional unit areoccupied.

The register file has a field, Qi:

QiThe number of the reservation station that contains the operation whose

result should be stored into this register. If the value of Qi is blank (or 0), nocurrently active instruction is computing a result destined for this register,

meaning that the value is simply the register contents.


10/14


4. Show by an example, how loop unrolling will provide lesser clock cycles per instruction.

our loop unrolled so that there are four copies of the loop body, assuming Rl - R2(that is, the size of the array) is initially a multiple of 32, which means that the number of

loop iterations is a multiple of 4.


11/14


will rearrange expressions so as to allow constants to be collapsed, allowing an

expression such as "((' + 1) + 1)" to be rewritten as "(i +(1 + 1))" and then simplifiedto "(i + 2)."

Without scheduling, every operation in the unrolled loop is followed by a dependentoperation and thus will cause a stall. This loop will run in 27 clock cycleseach LD

has 1 stall, each ADDD 2, the DADDUI 1, plus 14 instruction issue cyclesor 6.75clock cycles for each of the four elements, but it can be scheduled to improve

performance significantly. Loop unrolling is normally done early in the compilationprocess, so that redundant computations can be exposed and eliminated by the

optimizer.


12/14


Comparison and conclusion:

For the given example,

Basic example / basic block has taken 9 clock cycles per iteration

By scheduling, it has reduced to 7 clock cycles per loop/ iteration

after loop unrolling (4 times), it has 6.75 clock cycles per iteration

scheduled unrolled loop finally takes 3.5 clock cycles per iteration

5. Draw the hardware architecture for Branch target buffer. Write the steps invollved in handling an

instruction with BTB using a flow chart.


13/14


To reduce the branch penalty for our simple five-stage pipeline, as well as for deeper pipelines,we must know whether the as-yet-undecoded instruction is a branch and, if so, what the next PC

should be. If the instruction is a branch and we know what the next PC should be, we can have a

branch penalty of zero. A branch-prediction cache that stores the predicted address for the nextinstruction after a branch is called a branch-target bufferor branch-target cache. Figure aboveshows a branch-target buffer.

Because a branch-target buffer predicts the next instruction address and will send it out beforedecoding the instruction, we mustknow whether the fetched instruction is predicted as a taken

branch. If the PC of the fetched instruction matches a PC in the prediction buffer, then thecorresponding predicted PC is used as the next PC. The hardware for this branch-target buffer is

essentially identical to the hardware for a cache. If a matching entry is found in the branch-targetbuffer, fetching begins immediately at the predicted PC. Note that unlike a branch-prediction

buffer, the predictive entry must be matched to this instruction because the predicted PC will besent out before it is known whether this instruction is even a branch. If the processor did not

check whether the entry matched this PC, then the wrong PC would be sent out for instructionsthat were not branches, resulting in a slower processor. We only need to store the predicted-taken

branches in the branch-target buffer, since an untaken branch should simply fetch the nextsequential instruction, as if it were not a branch.


14/14


Figure shows the detailed steps when using a branch-target buffer for a simple five-stagepipeline. From this we can see that there will be no branch delay if a branch-prediction entry is

found in the buffer and the prediction is correct. Otherwise, there will be a penalty of at least 2clock cycles. Dealing with the mispredictions and misses is a significant challenge, since we

typically will have to halt instruction fetch while we rewrite the buffer entry. Thus, we wouldlike to make this process fast to minimize the penalty.

Documents

Important Answers