Upload
vjay2003
View
221
Download
0
Embed Size (px)
Citation preview
8/6/2019 Important Answers
1/14
Powered By www.technoscriptz.com
Advanced computer Architecture
1. Define Instruction level parallelism.
All processors use pipelining to overlap the execution of instructions and improveperformance. This potential overlap among instructions is called instruction-level
parallelism (ILP), since the instructions can be evaluated in parallel.
2. Give the equation for pipeline CPI? How is it related to ideal pipeline CPI? Explain the terms
involved in it.
The value of the CPI (cycles per instruction) for a pipelined processor is the sum of the
base CPI and all contributions from stalls:Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
Terms involved in the above expressionThe ideal pipeline CPIis a measure of the maximum performance attainable by the
implementation.
Structural stalls:Structural stalls are delays that occur because of the issue restrictions arising fromconflicts for functional units.A structural hazard occurs when a part of the processor's
hardware is needed by two or more instructions at the same time. A canonical exampleis a single memory unit that is accessed both in the fetch stage where an instruction is
retrieved from memory, and the memory stage where data is written and/or read frommemoryData Hazard Stalls:Data hazard stalls are the delays in the issuing of the instructions in the pipeline due tothe data dependences like true data dependence, anti dependence and output dependence.
Control stalls:
Branching hazards (also known as control hazards) occur with branches. On manyinstruction pipeline microarchitectures, the processor will not know the outcome of thebranch when it needs to insert a new instruction into the pipeline (normally
thefetch stage).
3. Define Branch target buffer.
A branch-prediction cache that stores the predicted address for the next instruction aftera branch is called a branch-target bufferor branch-target cache.
To reduce the branch penalty for a simple five-stage pipeline, as well as for deeperpipelines, user has to know whether the as-yet-undecoded instruction is a branch and, if
so, what the next PC should be. If the instruction is a branch and is known what the nextPC should be, then a branch penalty of zero is obtained. For this purpose of storing
targeted PC for every branch instruction PC, branch target buffer is used.
8/6/2019 Important Answers
2/14
Powered By www.technoscriptz.com
4. Define data dependences. Consider the following MIPS assembly code.
LD R1, 45 (R2)
DADD R7, R1,R5
DSUB R8,R1, R6
OR R9,R5,R1
BNEZ R7, targetDADD R10, R8,R5
XOR R2,R3,R4
(a) identify each dependence by type; list the two instructions involved. Identify which
instruction is dependent; and if there is one, name the storage location involved?
Let us name the given instructions so that it is helpful to write the dependences.
1. LD R1, 45 (R2)
2. DADD R7, R1,R5
3. DSUB R8,R1, R6
4. OR R9,R5,R15. BNEZ R7, target
6. DADD R10, R8,R5
7. XOR R2,R3,R4
Note : *1,2 (R1) represents dependence among instructions 1 and 2 by register R1
True dependence(RAW) : 1,2 (R1); 1,3 (R1); 1,4(R1); 2,5 (R7); 3,6(R8)
Anti dependence(WAR) : 1,7 (R2)
Output dependence(WAW) :--
5. Give the basic structure (hardware diagram) of a MIPS floating point unit using Tomasulo's
algorithm.
8/6/2019 Important Answers
3/14
Powered By www.technoscriptz.com
Advanced computer Architecture Paper B
1. Define register renaming?
Register renaming is the method of renaming i.e changing the name of the variable (register
number or memory location) so that the instructions do not conflict. This method is mainly used
to eliminate name dependences, which are not true data dependences, and hence can be
eliminated easily since there is no actual value being transmitted between the instructions.
Instructions involved in name dependence can be executed simultaneously or be reordered, after
implementing 'register renaming'.
2. Give the state diagram for a 2-bit branch predictor.
3. What is a Tournament predictor . Give example with diagram.
Tournament predictors uses multiple predictors, usually one based on global informationand one based on local information, and combining them with a selector. Tournament
predictors can achieve both better accuracy at medium sizes (8K-32K bits) and also makeuse of very large numbers of prediction bits effectively.
Tournament Predictors use several levels of branch-prediction tables together with an
algorithm for choosing among the multiple predictors.
8/6/2019 Important Answers
4/14
Powered By www.technoscriptz.com
Existing tournament predictors use a 2-bit saturating counter per branch to choose amongtwo different predictors based on which predictor (local, global, or even some mix) was
most effective in recent predictions. As in a simple 2-bit predictor, the saturating counterrequires two mispredictions before changing the identity of the preferred predictor.
The advantage of a tournament predictor is its ability to select the right predictor for a
particular branch, which is particularly crucial for the integer benchmarks.
1. Define ILP? Write about the challenges faced while implementing ILP. show an example
for each of type of dependences and hazards (write in points/ example).
ILP stands for Instruction level Parallelism. All processors since about 1985 usepipelining to overlap the execution of instructions and improve performance. This
potential overlap among instructions is called instruction-level parallelism (ILP), sincethe instructions can be evaluated in parallel.
The various techniques that are used to increase amount of parallelism are (1) to reducethe impact of data and control hazards and (2) to increase processor ability to exploit
parallelism. There are two largely separable approaches to exploiting ILP: an approachthat relies on hardware to help discover and exploit the parallelism dynamically, and an
approach that relies on software technology to find parallelism, statically at compile time.The value of the CPI (cycles per instruction) for a pipelined processor is the sum of the
base CPI and all contributions from stalls:Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls.
The ideal pipeline CPI is a measure of the maximum performance attainable by the
implementation. By reducing each of the terms of the right-hand side, we minimize theoverall pipeline CPI or, alternatively, increase the IPC (instructions per clock). The
equation above allows us to characterize various techniques by what component of theoverall CPI a technique reduces.
Data Dependence and Hazards:
To exploit instruction-level parallelism, determine which instructions can be executed in
parallel. If two instructions are parallel, they can execute simultaneously in a pipelinewithout causing any stalls. If two instructions are dependent they are not parallel and
must be executed in order.
There are three different types of dependences: data dependences (also called true data
dependences), name dependences, and control dependences.
Data Dependences:
An instruction j is data dependent on instruction i if either of the following holds:
Instruction i produces a result that may be used by instruction j, orInstruction j is data dependent on instruction k, and instruction k is data dependent on
instruction i.
8/6/2019 Important Answers
5/14
Powered By www.technoscriptz.com
The second condition simply states that one instruction is dependent on another if there
exists a chain of dependences of the first type between the two instructions. Thisdependence chain can be as long as the entire program.
For example, consider the following code sequence that increments a vector of values inmemory (starting at 0(R1) and with the last element at 8(R2)) by a scalar in register F2:
Loop: L.D F0,0(R1) ; F0=array elementADD.D F4,F0,F2 ; add scalar in F2
S.D F4,0(R1) ;store resultDADDUI R1,R1,#-8 ;decrement pointer 8 bytes
BNE R1,R2,LOOP ; branch R1!=zeroThe data dependences in this code sequence involve both floating point data:
and integer data:
Both of the above dependent sequences, are as shown by the arrows, with each
instruction depending on the previous one. The arrows here and in following examplesshow the order that must be preserved for correct execution. The arrow points from an
instruction that must precede the instruction that the arrowhead points to.If two instructions are data dependent they cannot execute simultaneously or be
overlapped completely. The dependence implies that there would be a chain of one ormore data hazards between the two instructions. Executing the instructions
simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall,thereby reducing or eliminating the overlap. Dependences are a property of programs.
Whether a given dependence results in an actual hazard being detected and whether thathazard actually causes a stall are properties of the pipeline organization. This difference
is critical to understanding how instruction-level parallelism can be exploited.
The presence of the dependence indicates the potential for a hazard, but the actualhazard and the length of any stall is a property of the pipeline. The importance of the data
dependences is that dependence (1) indicates the possibility of a hazard, (2) determinesthe order in which results must be calculated, and (3) sets an upper bound on how much
8/6/2019 Important Answers
6/14
Powered By www.technoscriptz.com
parallelism can possibly be exploited. Such limits are explored in section 3.8.
Name DependencesThe name dependence occurs when two instructions use the same register or memory
location, called a name, but there is no flow of data between the instructions associated
with that name.
There are two types of name dependences between an instruction i that precedesinstruction j in program order:
An antidependence between instruction i and instruction j occurs when instruction jwrites a register or memory location that instruction i reads. The original ordering
must be preserved to ensure that i reads the correct value. An output dependence occurs when instruction i and instruction j write the same
register or memory location. The ordering between the instructions must be preservedto ensure that the value finally written corresponds to instruction j.
Example for antidependence:DSUBU R4,R5,R6DADDU R5,R4,R9
(WAR- with R5)
Example for output dependenceDSUBU R4,R5,R6DADDU R4,R6,R9(WAW- R4)
Both anti-dependences and output dependences are name dependences, as opposed to
true data dependences, since there is no value being transmitted between the instructions.Since a name dependence is not a true dependence, instructions involved in a name
dependence can execute simultaneously or be reordered, if the name (register number ormemory location) used in the instructions is changed so the instructions do not conflict.
This renaming can be more easily done for register operands, where it is calledregister renaming. Register renaming can be done either statically by a compiler or
dynamically by the hardware. Before describing dependences arising from branches, letsexamine the relationship between dependences and pipeline data hazards.
Control Dependences:
A control dependence determines the ordering of an instruction, i, with respect to
a branch instruction so that the instruction i is executed in correct program order. Everyinstruction, except for those in the first basic block of the program, is control dependent
on some set of branches, and, in general, these control dependences must be preserved topreserve program order. One of the simplest examples of a control dependence is the
dependence of the statements in the then part of an if statement on the branch. Forexample, in the code segment:
if p1 { S1;
8/6/2019 Important Answers
7/14
Powered By www.technoscriptz.com
};if p2 { S2;
}S1 is control dependent on p1, and S2is control dependent on p2 but not on p1. In
general, there are two constraints imposed by control dependences:
1.An instruction that is control dependent on a branch cannot be moved before the branch sothat its execution is no longer controlled by the branch. For example, we cannot take aninstruction from the then-portion of an if-statement and move it before the if-statement.
2.An instruction that is not control dependent on a branch cannot be moved after the branchso that its execution is controlled by the branch. For example, we cannot take a statement
before the if-statement and move it into the then-portion.
Control dependence is preserved by two properties in a simple pipeline, First,
instructions execute in program order. This ordering ensures that an instruction thatoccurs before a branch is executed before the branch. Second, the detection of control or
branch hazards ensures that an instruction that is control dependent on a branch is notexecuted until the branch direction is known.
Example:
Branching to loop L1 from instruction 2 is dependent on the computation of R2 value in
DADDU.
2. For the given example of code,DSUBUI R3, R1, #2
BNEZ R3,L1 ; BRANCH B1
L1:DSUBUI R3,R2,#2
BNEZ R3,L2 ; branch b2
DADD R2,R0,R0
L2: DSUBU R3,R1,R2
BEQZ R3,L3
Take a 2 bit predictor (1 bit predictor with 1 bit correlator) for an input sequence of 2,0,2,0 withinitial input predictions NT/NT for both branch b1, b2
8/6/2019 Important Answers
8/14
Powered By www.technoscriptz.com
Two prediction bits are used. Table 1 shows the ralation with the prediction if the last barnch was
not taken and if the last branch was taken.
Table1.
These are the sequences of prediction and action if the sequence is 2,0,2,0 for a 2 bit predictor
with NT/NT initial settings.
Advanced computer Architecture Paper C
1. Define the terms register renaming.
Register renaming is the method of renaming i.e changing the name of the variable (register
number or memory location) so that the instructions do not conflict. This method is mainly used
to eliminate name dependences, which are not true data dependences, and hence can be
eliminated easily since there is no actual value being transmitted between the instructions.
Instructions involved in name dependence can be executed simultaneously or be reordered, after
implementing 'register renaming'.
2. What are the properties critical for program correctness?
The two properties critical to program correctnessand normally preserved bymaintaining both data and control dependenceare the exception behaviorand the data
flow.
Preserving the exception behavior means that any changes in the ordering of instruction
execution must not change how exceptions are raised in the program.Often this is relaxedto mean that the reordering of instruction execution must not cause any new exceptions in
8/6/2019 Important Answers
9/14
Powered By www.technoscriptz.com
the program. A simple example shows how maintaining the control and data dependencescan prevent such situations. Consider this code sequence:
DADDU R2,R3,R4BEQZ R2, L1
LW R1,0(R2)
LI:In this case, it is easy to see that if we do not maintain the data dependence involving R2,we can change the result of the program. Less obvious is the fact that if we ignore the
control dependence and move the load instruction before the branch, the load instructionmay cause a memory protection exception.
The second property preserved by maintenance of data dependences and controldependences is the data flow. The dataflow is the actual flow of data values among
instructions that produce results and those that consume them. Branches make the dataflow dynamic, since they allow the source of data for a given instruction to come from
many points. Put another way, it is insufficient to just maintain data dependences becausean instruction may be data dependent on more than one predecessor. Program order is
what determines which predecessor will actually deliver a data value to an instruction.Program order is ensured by maintaining the control dependences.
( For a 16 mark qn if asked, this has to be elaborated with example)
3. Explain the fields / components in the reservation station.
Each reservation station has seven fields:
OpThe operation to perform on source operands SI and S2. Qj, QkThe reservation stations that will produce the corresponding source operand;
a value of zero indicates that the source operand is already available in Vj or Vk, or isunnecessary. (The IBM 360/91 calls these SINKunit and SOURCEunit.)
Vj, VkThe value of the source operands. Note that only one of the V field or the Q
field is valid for each operand. For loads, the Vk field is used to hold the offset field.(These fields are called SINK and SOURCE on the IBM 360/91.)
AUsed to hold information for the memory address calculation for a load or store.
Initially, the immediate field of the instruction is stored here; after the addresscalculation, the effective address is stored here.
BusyIndicates that this reservation station and its accompanying functional unit areoccupied.
The register file has a field, Qi:
QiThe number of the reservation station that contains the operation whose
result should be stored into this register. If the value of Qi is blank (or 0), nocurrently active instruction is computing a result destined for this register,
meaning that the value is simply the register contents.
8/6/2019 Important Answers
10/14
Powered By www.technoscriptz.com
4. Show by an example, how loop unrolling will provide lesser clock cycles per instruction.
our loop unrolled so that there are four copies of the loop body, assuming Rl - R2(that is, the size of the array) is initially a multiple of 32, which means that the number of
loop iterations is a multiple of 4.
8/6/2019 Important Answers
11/14
Powered By www.technoscriptz.com
will rearrange expressions so as to allow constants to be collapsed, allowing an
expression such as "((' + 1) + 1)" to be rewritten as "(i +(1 + 1))" and then simplifiedto "(i + 2)."
Without scheduling, every operation in the unrolled loop is followed by a dependentoperation and thus will cause a stall. This loop will run in 27 clock cycleseach LD
has 1 stall, each ADDD 2, the DADDUI 1, plus 14 instruction issue cyclesor 6.75clock cycles for each of the four elements, but it can be scheduled to improve
performance significantly. Loop unrolling is normally done early in the compilationprocess, so that redundant computations can be exposed and eliminated by the
optimizer.
8/6/2019 Important Answers
12/14
Powered By www.technoscriptz.com
Comparison and conclusion:
For the given example,
Basic example / basic block has taken 9 clock cycles per iteration
By scheduling, it has reduced to 7 clock cycles per loop/ iteration
after loop unrolling (4 times), it has 6.75 clock cycles per iteration
scheduled unrolled loop finally takes 3.5 clock cycles per iteration
5. Draw the hardware architecture for Branch target buffer. Write the steps invollved in handling an
instruction with BTB using a flow chart.
8/6/2019 Important Answers
13/14
Powered By www.technoscriptz.com
To reduce the branch penalty for our simple five-stage pipeline, as well as for deeper pipelines,we must know whether the as-yet-undecoded instruction is a branch and, if so, what the next PC
should be. If the instruction is a branch and we know what the next PC should be, we can have a
branch penalty of zero. A branch-prediction cache that stores the predicted address for the nextinstruction after a branch is called a branch-target bufferor branch-target cache. Figure aboveshows a branch-target buffer.
Because a branch-target buffer predicts the next instruction address and will send it out beforedecoding the instruction, we mustknow whether the fetched instruction is predicted as a taken
branch. If the PC of the fetched instruction matches a PC in the prediction buffer, then thecorresponding predicted PC is used as the next PC. The hardware for this branch-target buffer is
essentially identical to the hardware for a cache. If a matching entry is found in the branch-targetbuffer, fetching begins immediately at the predicted PC. Note that unlike a branch-prediction
buffer, the predictive entry must be matched to this instruction because the predicted PC will besent out before it is known whether this instruction is even a branch. If the processor did not
check whether the entry matched this PC, then the wrong PC would be sent out for instructionsthat were not branches, resulting in a slower processor. We only need to store the predicted-taken
branches in the branch-target buffer, since an untaken branch should simply fetch the nextsequential instruction, as if it were not a branch.
8/6/2019 Important Answers
14/14
Powered By www.technoscriptz.com
Figure shows the detailed steps when using a branch-target buffer for a simple five-stagepipeline. From this we can see that there will be no branch delay if a branch-prediction entry is
found in the buffer and the prediction is correct. Otherwise, there will be a penalty of at least 2clock cycles. Dealing with the mispredictions and misses is a significant challenge, since we
typically will have to halt instruction fetch while we rewrite the buffer entry. Thus, we wouldlike to make this process fast to minimize the penalty.