120
UNIT 5 Processor Organization

UNIT 5 Processor Organization

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UNIT 5 Processor Organization

UNIT 5Processor Organization

Page 2: UNIT 5 Processor Organization

Agenda

1. Processor Organization2. Register Organization : Case Study:: 80863. Instruction Cycle4. Instruction Pipelining5. Pipeline Hazards6. Dealing with Branches7. Superscalar Systems8. Instruction Level Pipelining9. Superpipelined Systems10.Design Issues

2R. V. Bidwe, PICT, Pune.

Page 3: UNIT 5 Processor Organization

Processor Organization

• Basic Functions of Processor

1. Fetch Instruction: The processor reads an instructionfrom memory (register, cache, main memory).

2. Interpret Instruction: The instruction is decoded todetermine what action is required.

3. Fetch Data: The execution of an instruction may requirereading data from memory or an I/O module.

4. Process Data: The execution of an instruction mayrequire performing some arithmetic or logical operationon data.

5. Write Data: The results of an execution may requirewriting data to memory or an I/O module.

3R. V. Bidwe, PICT, Pune.

Page 4: UNIT 5 Processor Organization

• To do these things, the processor needs to storesome data temporarily.

• It must remember the location of the last or nextinstruction so that it can know where to get thenext instruction.

• It needs to store instructions and datatemporarily while an instruction is beingexecuted.

• So, the processor needs a small internal memoryto do all these operations.

R. V. Bidwe, PICT, Pune. 4

Page 5: UNIT 5 Processor Organization

The Basic Processor Organization

R. V. Bidwe, PICT, Pune. 5

Page 6: UNIT 5 Processor Organization

• The major components of the processor are anArithmetic Logic Unit (ALU), a Control Unit (CU)and Registers.

• The ALU does the actual computation orprocessing of data.

• The CU controls the movement of data andinstructions into and out of the processor andcontrols the operation of the ALU.

• Registers consisting of a set of small storagelocations.

R. V. Bidwe, PICT, Pune. 6

Page 7: UNIT 5 Processor Organization

Internal Structure of the Processor

R. V. Bidwe, PICT, Pune. 7

Page 8: UNIT 5 Processor Organization

Memory Hierarchy

R. V. Bidwe, PICT, Pune. 8

Page 9: UNIT 5 Processor Organization

Register Organization

• A computer system employs a MemoryHierarchy.

• At higher levels of the hierarchy, memory isfaster, smaller, and more expensive (per bit).

• Within the processor, there is a set of registersthat function as a level of memory above mainmemory and cache in the hierarchy.

R. V. Bidwe, PICT, Pune. 9

Page 10: UNIT 5 Processor Organization

• The registers in the processor perform tworoles:

1. User-visible Registers: Enable the machine orassembly language programmer to minimize mainmemory references by optimizing use of registers.

2. Control and Status Registers: Used by the controlunit to control the operation of the processor andgives status of results after execution ofoperations or programs.

R. V. Bidwe, PICT, Pune. 10

Page 11: UNIT 5 Processor Organization

1. User-Visible Registers

• A user-visible register is one that is directlyused in machine language program, which isdirectly accessed by processor.

• We can characterize these in the followingcategories:– General Purpose Registers

• Data Registers

• Address Registers

– Condition codes

R. V. Bidwe, PICT, Pune. 11

Page 12: UNIT 5 Processor Organization

• General-purpose registers can be assigned to avariety of functions by the programmer.

• Any general-purpose register can contain theoperand for any opcode.

• Also they are used as per different cases inprogram.– There may be dedicated registers for Floating-point

(Used by FPU) and Stack operations (SP, BP).

– In some cases, general-purpose registers can be usedfor Addressing Functions (e.g. Register IndirectAddressing, Index or Displacement).

R. V. Bidwe, PICT, Pune. 12

Page 13: UNIT 5 Processor Organization

• In some processors, there is separate set of DataRegisters and Address Registers.

• Data registers may be used only to hold data andcannot be employed in the calculation of anoperand address. (eg. CX is default Count Registerin 8086)

• Address Registers may themselves be somewhatgeneral purpose, or they may be devoted to aparticular addressing mode. (eg. BX can hold BaseAddress in 8086 addressing modes)

R. V. Bidwe, PICT, Pune. 13

Page 14: UNIT 5 Processor Organization

• Other examples of Address Registers include thefollowing:– Segment Pointers: In a machine with segmented

addressing, a Segment Register holds the address ofthe base of the segment. (CS, DS, SS, ES are segmentregisters for 8086)

– Index Registers: These are used for IndexedAddressing and may be auto indexed. (SI, DI in8086)

– Stack Pointer: If there is user-visible Stack Addressing,then typically there is a dedicated register that pointsto the top of the stack. (SP, BP are used to point tostack in 8086)

R. V. Bidwe, PICT, Pune. 14

Page 15: UNIT 5 Processor Organization

• A final category of registers, which is at leastpartially visible to the user, holds ConditionCodes (also referred to as Flags).

• Condition codes are bits set by the processorhardware as the result of operations.

• Condition code bits are collected into one ormore registers. Usually, they form part of aControl Register.

R. V. Bidwe, PICT, Pune. 15

Page 16: UNIT 5 Processor Organization

Condition Codes

R. V. Bidwe, PICT, Pune. 16

Page 17: UNIT 5 Processor Organization

2. Control and Status Registers

• Program Counter (PC): Contains the address of an nextinstruction to be fetched.

• Instruction Register (IR): Contains the instruction mostrecently fetched.

• Memory Address Register (MAR): Contains theaddress of a location in memory.

• Memory Buffer Register (MBR): Contains a word ofdata to be written to memory or the word mostrecently read.

R. V. Bidwe, PICT, Pune. 17

Page 18: UNIT 5 Processor Organization

• Many processor designs include a register or setof registers, often known as the Program StatusWord (PSW), that contain status information.

• The PSW typically contains condition codes plusother status information. Common fields or flagsinclude the following:

– Sign: Contains the sign bit of the result of the lastarithmetic operation.

– Zero: Set when the result is 0.

– Carry: Set if an operation resulted in a carry (addition)into or borrow (subtraction) out of a high-order bit.Used for multiword arithmetic operations.

R. V. Bidwe, PICT, Pune. 18

Page 19: UNIT 5 Processor Organization

– Equal: Set if a logical compare result is equality.

– Overflow: Used to indicate arithmetic overflow.

– Interrupt Enable/Disable: Used to enable ordisable interrupts.

– Supervisor: Indicates whether the processor isexecuting in supervisor or user mode. Certainprivileged instructions can be executed only insupervisor mode, and certain areas of memorycan be accessed only in supervisor mode.

R. V. Bidwe, PICT, Pune. 19

Page 20: UNIT 5 Processor Organization

Ex. Register Organization

R. V. Bidwe, PICT, Pune. 20

Page 21: UNIT 5 Processor Organization

Instruction Cycle

• An instruction cycle includes the followingstages:– Fetch: Read the next instruction from memory

into the processor.

– Execute: Interpret the opcode and perform theindicated operation.

– Interrupt: If interrupts are enabled and aninterrupt has occurred, save the current processstate and service the interrupt.

R. V. Bidwe, PICT, Pune. 21

Page 22: UNIT 5 Processor Organization

Instruction Cycle

R. V. Bidwe, PICT, Pune. 22

Page 23: UNIT 5 Processor Organization

• The Indirect Cycle

– The execution of an instruction may involve one ormore operands in memory, each of which requiresa memory access.

– Usually each instruction is fetched first thenexecuted.

– After an instruction is fetched, it is examined todetermine if any indirect addressing is involved. Ifso, the required operands are fetched usingindirect addressing.

– Further, if indirect addressing is used, thenadditional memory accesses are required.

R. V. Bidwe, PICT, Pune. 23

Page 24: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 24

Page 25: UNIT 5 Processor Organization

• Data Flow

1. Fetch Cycle

2. Indirect Cycle

3. Interrupt Cycle

R. V. Bidwe, PICT, Pune. 25

Page 26: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 26

Page 27: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 27

Page 28: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 28

Page 29: UNIT 5 Processor Organization

Instruction Pipelining: Pipelining Strategy

R. V. Bidwe, PICT, Pune. 29

Page 30: UNIT 5 Processor Organization

• The pipeline has two independent stages.

• The first stage fetches an instruction and buffersit (eg. Instruction stream byte queue in 8086).Where on same time the second stage is free, thefirst stage passes it the buffered instruction.

• The second stage is executing the instruction.

• During execution time the first stage takesadvantage of any unused memory cycles to fetchand buffer the next instruction. This is calledInstruction Prefetch or Fetch Overlap.

R. V. Bidwe, PICT, Pune. 30

Page 31: UNIT 5 Processor Organization

• This process will speed up instruction execution,If the fetch and execute stages were of equalduration, the instruction cycle time would behalved. But this do not happen. Because:

1. The execution time will generally be longer than thefetch time. Execution will involve reading and storingoperands and the performance of some operation.Thus, the fetch stage may have to wait for sometime before execute cycle can empty its buffer.

2. In case of conditional branch instruction, processordon’t knows the address of the next instruction tobe fetched. Thus, the fetch stage must wait until itreceives the next instruction address from theexecute stage. The execute stage may then have towait while the next instruction is fetched.

R. V. Bidwe, PICT, Pune. 31

Page 32: UNIT 5 Processor Organization

• So new decomposition of the instruction processing is

– Fetch instruction (FI): Read the next expected instructioninto a buffer.

– Decode instruction (DI): Determine the opcode and theoperand specifiers.

– Calculate operands (CO): Calculate the effective address ofeach source operand. This may involve displacement,register indirect, indirect, or other forms of addresscalculation.

– Fetch operands (FO): Fetch each operand from memory.Operands in registers need not be fetched.

– Execute instruction (EI): Perform the indicated operationand store the result, if any, in the specified destinationoperand location.

– Write operand (WO): Store the result in memory.

R. V. Bidwe, PICT, Pune. 32

Page 33: UNIT 5 Processor Organization

Timing Diagram for Instruction Pipeline Operation: Ideal(Assume each step requires equal time)

R. V. Bidwe, PICT, Pune. 33

Page 34: UNIT 5 Processor Organization

• Several other factors still limit theperformance enhancement.

– If the six stages are not of equal duration, therewill be some waiting involved at various pipelinestages.

– Conditional Branch Instruction can invalidateseveral instruction fetches.

– If Interrupt occurs in a execution process.

R. V. Bidwe, PICT, Pune. 34

Page 35: UNIT 5 Processor Organization

The Effect of a Conditional Branch on Instruction Pipeline Operation

R. V. Bidwe, PICT, Pune. 35

Page 36: UNIT 5 Processor Organization

• Assume that instruction 3 is a conditional branch toinstruction 15. Until the instruction is executed, thereis no way of knowing which instruction will come next.The pipeline simply loads the next instruction insequence (instruction 4) and proceeds.

• The branch is taken at the end of time unit 7. At thispoint, the pipeline must be cleared of instructions thatare not useful.

• During time unit 8, instruction 15 enters the pipeline.No instructions complete during time units 9 through12; this is the performance penalty incurred becausewe could not anticipate the branch.

• So the logic is needed for pipelining to account forbranches and interrupts, which is given next.

R. V. Bidwe, PICT, Pune. 36

Page 37: UNIT 5 Processor Organization

Six-Stage CPU Instruction Pipeline Ideal

R. V. Bidwe, PICT, Pune. 37

Page 38: UNIT 5 Processor Organization

An Alternative Pipeline Depiction

R. V. Bidwe, PICT, Pune. 38

Page 39: UNIT 5 Processor Organization

• From the preceding discussion, it might appearthat the greater the number of stages in thepipeline, the faster the execution rate.

• But, designer must still consider:– At each stage of the pipeline, there is some overhead

involved (1) Moving data from buffer to buffer (2) Inperforming various preparation (3) Delivery functions.This overhead can lengthen the total execution timeof a single instruction.

– The amount of control logic required to handlememory and register dependencies and to optimizethe use of the pipeline increases enormously with thenumber of stages. This can lead to a situation wherethe logic controlling the gating between stages ismore complex than the stages being controlled.

R. V. Bidwe, PICT, Pune. 39

Page 40: UNIT 5 Processor Organization

Pipeline Hazard

• A Pipeline Hazard occurs when the pipeline, orsome portion of the pipeline, must stall becauseconditions do not permit continued execution.

• Such a pipeline stall is also referred to as aPipeline Bubble. There are three types ofhazards:

–Resource Hazard–Data Hazard–Control Hazard

R. V. Bidwe, PICT, Pune. 40

Page 41: UNIT 5 Processor Organization

Resource Hazards

• A resource hazard occurs when two (or more)instructions that are already in the pipelineneed the same resource.

• The result is that the instructions must beexecuted in serial rather than parallel for aportion of the pipeline.

• A resource hazard is sometime referred to as aStructural Hazard.

R. V. Bidwe, PICT, Pune. 41

Page 42: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 42

Page 43: UNIT 5 Processor Organization

• Another example of a resource conflict is asituation in which multiple instructions areready to enter the execute instruction phaseand there is a single ALU.

• One solutions to such resource hazards is toincrease available resources, such as havingmultiple ports into main memory andmultiple ALU units.

R. V. Bidwe, PICT, Pune. 43

Page 44: UNIT 5 Processor Organization

Data Hazards

• A data hazard occurs when there is a conflict inthe access of an operand location.

• In general terms, we can state the hazard in thisform: Two instructions in a program are to beexecuted in sequence and both access aparticular memory or register operand.

• If the two instructions are executed in strictsequence, no problem occurs.

R. V. Bidwe, PICT, Pune. 44

Page 45: UNIT 5 Processor Organization

• However, if the instructions are executed in apipeline, then it is possible for the operandvalue to be updated in such a way as toproduce a different result than would occurwith strict sequential execution.

• In other words, the program produces anincorrect result because of the use ofpipelining.

R. V. Bidwe, PICT, Pune. 45

Page 46: UNIT 5 Processor Organization

• As an example, consider the following x86 machine instruction sequence:

ADD EAX, EBX / EAX = EAX + EBX

SUB ECX, EAX / ECX = ECX - EAX

R. V. Bidwe, PICT, Pune. 46

Page 47: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 47

Page 48: UNIT 5 Processor Organization

• There are three types of Data Hazards;

– Read after Write (RAW), or True Dependency: Aninstruction modifies a register or memory location and asucceeding instruction reads the data in that memory orregister location. A hazard occurs if the read takes placebefore the write operation is complete.

– Write after Read (WAR), or Antidependency: Aninstruction reads a register or memory location and asucceeding instruction writes to the location. A hazardoccurs if the write operation completes before the readoperation takes place.

– Write after Write (WAW), or Output Dependency: Twoinstructions both write to the same location. A hazardoccurs if the write operations take place in the reverseorder of the intended sequence.

R. V. Bidwe, PICT, Pune. 48

Page 49: UNIT 5 Processor Organization

Control Hazards

• A control hazard, also known as a BranchHazard.

• Occurs when the pipeline makes the wrongdecision on a branch prediction and thereforebrings instructions into the pipeline that mustsubsequently be discarded.

R. V. Bidwe, PICT, Pune. 49

Page 50: UNIT 5 Processor Organization

Dealing with Branches

• One of the major problems in designing aninstruction pipeline is assuring a steady andcorrect flow of instructions to the initialstages of the pipeline.

• The primary hurdle is the conditional branchinstruction. Since, Until the instruction isactually executed, it is impossible todetermine whether the branch will be takenor not.

R. V. Bidwe, PICT, Pune. 50

Page 51: UNIT 5 Processor Organization

• Following approaches have been taken fordealing with conditional branches:

1. Multiple Streams

2. Prefetch Branch Target

3. Loop Buffer

4. Branch Prediction

5. Delayed Branch

R. V. Bidwe, PICT, Pune. 51

Page 52: UNIT 5 Processor Organization

1. Multiple Streams

• A simple pipeline suffers a penalty for abranch instruction because it must chooseone of two instructions to fetch next and maymake the wrong choice.

• A brute-force approach is to replicate theinitial portions of the pipeline and allow thepipeline to fetch both instructions, makinguse of two streams.

R. V. Bidwe, PICT, Pune. 52

Page 53: UNIT 5 Processor Organization

• There are two problems with this approach:– With multiple pipelines there are contention

delays for access to the registers and to memory.

– Additional branch instructions may enter thepipeline (either stream) before the original branchdecision is resolved. Each such instruction needsan additional stream.

• Despite these drawbacks, this strategy canimprove performance. Examples of machineswith two or more pipeline streams are theIBM 370/168 and the IBM 3033.

R. V. Bidwe, PICT, Pune. 53

Page 54: UNIT 5 Processor Organization

2. Prefetch Branch Target

• If a conditional branch is recognized, then thetarget of the branch is found and prefetched, inaddition to the instruction following the branch.

• This target is then saved until the branchinstruction is executed.

• When branch is taken, the target has alreadybeen prefetched.

• The IBM 360/91 uses this approach.R. V. Bidwe, PICT, Pune. 54

Page 55: UNIT 5 Processor Organization

3. Loop Buffer

• A loop buffer is a small, very-high-speed memorymaintained by the instruction fetch stage of thepipeline

• It contains the “n” most recently fetchedinstructions, in sequence.

• The external hardware first checks whether thebranch target is within the buffer.

• If so, the next instruction is fetched from thebuffer.

R. V. Bidwe, PICT, Pune. 55

Page 56: UNIT 5 Processor Organization

• The loop buffer has three benefits:

1. With the use of prefetching, the loop buffer willcontain some instruction sequentially ahead of thecurrent instruction fetch address. Thus, instructionsfetched in sequence will be available without theusual memory access time.

2. If target of a branch is near to current executinginstruction, then the target will already be in thebuffer. This is useful for the rather commonoccurrence of IF–THEN and IF–THEN–ELSE sequences.

R. V. Bidwe, PICT, Pune. 56

Page 57: UNIT 5 Processor Organization

3. This strategy is particularly well suited to dealingwith loops, or iterations; hence the name LoopBuffer.

4. If the loop buffer is large enough to contain all theinstructions in a loop, then those instructions need tobe fetched from memory only once, for the firstiteration. For subsequent iterations, all the neededinstructions are already in the buffer.

• The loop buffer is similar in principle to a cachededicated to instructions.

• The differences are that the loop buffer onlyretains instructions in sequence and is muchsmaller in size and hence lower in cost.

R. V. Bidwe, PICT, Pune. 57

Page 58: UNIT 5 Processor Organization

• Eg. If the buffer contains 256 bytes, and byteaddressing is used, then the least significant 8bits are used to index the buffer.

– The remaining most significant bits are checked todetermine if the branch target lies within theenvironment captured by the buffer.

R. V. Bidwe, PICT, Pune. 58

Page 59: UNIT 5 Processor Organization

4. Branch Prediction

• Various techniques can be used to predictwhether a branch will be taken. Among themore common are the following:

• Predict never taken

• Predict always taken

• Predict by Opcode

• Taken/not taken Switch

• Branch History Table

R. V. Bidwe, PICT, Pune. 59

Page 60: UNIT 5 Processor Organization

• The first three approaches are static: They donot depend on the execution history up to thetime of the conditional branch instruction.

• The latter two approaches are dynamic: Theydepend on the execution history.

• Predict never taken approach: Processor assumethat the branch will not be taken and continue tofetch instructions in sequence.– It is the most popular of all the branch prediction

methods

R. V. Bidwe, PICT, Pune. 60

Page 61: UNIT 5 Processor Organization

• Predict always taken approach: Processorassume that the branch will be taken andalways fetch for the branch target.

• Predict by Opcode approach: The processorassumes that the branch will be taken forcertain branch opcodes and not for others.

R. V. Bidwe, PICT, Pune. 61

Page 62: UNIT 5 Processor Organization

• Dynamic branch strategies attempt to improvethe accuracy of prediction by recording thehistory of conditional branch instructions in aprogram.

• One or more bits can be associated with eachconditional branch instruction.

• Those bits are stored in the recent history of theinstructions.

• These bits are referred by a taken/ not takenswitch that directs the processor to make aparticular decision when the next timeinstruction is encountered.

R. V. Bidwe, PICT, Pune. 62

Page 63: UNIT 5 Processor Organization

• These history bits are not stored in mainmemory. Rather, they are kept in temporaryhighspeed storage.

• Highspeed storages may be

– Cache Memory: But when the instruction isreplaced in the cache, its history is lost.

– History table: One or more bits of conditionbranch instruction are stored in table and Theprocessor could access the table associatively, likea cache.

R. V. Bidwe, PICT, Pune. 63

Page 64: UNIT 5 Processor Organization

• When Single bit is used:

– All that can be recorded is whether the lastexecution of this instruction resulted in a branchor not.

• When two bits are used:

– They can be used to record the result of the lasttwo instances of the execution of the associatedinstruction.

R. V. Bidwe, PICT, Pune. 64

Page 65: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 65

Page 66: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 66

Page 67: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 67

Page 68: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 68

Page 69: UNIT 5 Processor Organization

5. Delayed Branch

• It is possible to improve pipeline performanceby automatically rearranging instructionswithin a program.

• So that branch instructions occur later thanactually desired.

R. V. Bidwe, PICT, Pune. 69

Page 70: UNIT 5 Processor Organization

Intel 80486 Pipelining

• 80486 implements a five-stage pipeline:

– Fetch: Instruction is fetched and data is filled inBuffer.

– Decode stage 1: All opcode and addressing-modeinformation is decoded in the D1 stage.

– Decode stage 2: The D2 stage expands eachopcode into control signals for the ALU andcalculates oprands.

– Execute:

– Write back:R. V. Bidwe, PICT, Pune. 70

Page 71: UNIT 5 Processor Organization

80486 Instruction Pipeline Examples

R. V. Bidwe, PICT, Pune. 71

Page 72: UNIT 5 Processor Organization

RISC [Reduced Instruction Set Computing]

• Small set of instructions.

• Simplified and reduced instruction set.

• The reduced instruction set means that theprocessor can execute the instructions morequickly, potentially allowing for greater speeds.

• However, only allowing such simple instructionsmeans a greater burden is placed upon thesoftware itself (Compiler).

R. V. Bidwe, PICT, Pune. 72

Page 73: UNIT 5 Processor Organization

• As less instructions are available in the instructionset, it is difficult for programmer to write asoftware with the instructions that are available.

• Because of less instructions, Addressing modesare simplified and less, and the length of thecodes is fixed.

• Instruction pipelining can be implementedeasily.

• Only LOAD/STORE instructions can accessmemory.

R. V. Bidwe, PICT, Pune. 73

Page 74: UNIT 5 Processor Organization

• Each instruction requires single cycle time toexecute the program.

• So, RISC systems shorten execution time as theyhave reduced the clock cycles per instruction.

• Mainly used for real time applications.

• Examples of RISC Processors: Atmel AVR, PIC,ARM.

R. V. Bidwe, PICT, Pune. 74

Page 75: UNIT 5 Processor Organization

CISC [Complex Instruction Set Computing]

• Very large instruction sets.

• As the range of more advanced instructionsavailable, it becomes easy for compilers toexecute programs and helps to improveperformance.

• More specialized addressing modes andregisters also being implemented.

R. V. Bidwe, PICT, Pune. 75

Page 76: UNIT 5 Processor Organization

• Instruction pipelining can not be implementedeasily.

• Mainly used in normal PC’s, Workstations andservers.

• CISC systems shorten execution time by reducingthe number of instructions per program, asdirect instructions are available for manyoperations.

• Examples of CISC Processors: Intel 8086 andothers.

R. V. Bidwe, PICT, Pune. 76

Page 77: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 77

Page 78: UNIT 5 Processor Organization

Scalar Vs. Superscalar

R. V. Bidwe, PICT, Pune. 78

Page 79: UNIT 5 Processor Organization

Superscalar Systems

• The term superscalar first coined in 1987.

• Superscalar refers to a machine that is designedto improve the performance of the execution ofscalar instructions.

• The main characteristic of the superscalarapproach is the ability to execute instructionsindependently and concurrently in differentpipelines.

R. V. Bidwe, PICT, Pune. 79

Page 80: UNIT 5 Processor Organization

• This allows instructions to be executed in anorder different from the program order.

• In the superscalar organization, main systemis divided into multiple functional units, eachof which is implemented as a pipeline.

• Each individual functional unit provides adegree of parallelism by considering of itspipelined structure.

R. V. Bidwe, PICT, Pune. 80

Page 81: UNIT 5 Processor Organization

• The use of multiple functional units enablesthe processor to execute streams ofinstructions in parallel, one stream for eachpipeline.

• It is the responsibility of the hardware, inconjunction with the compiler, to assure thatthe parallel execution does not violate theexecution of the program.

R. V. Bidwe, PICT, Pune. 81

Page 82: UNIT 5 Processor Organization

Instruction-level Parallelism

• Instruction-level parallelism (ILP) is a measureof how many of the instructions ina computer program can be executedsimultaneously.

• Can be achieved using

– Hardware

– Software (Compiler)

R. V. Bidwe, PICT, Pune. 82

Page 83: UNIT 5 Processor Organization

Superpipelined Systems

• Term Superpipelining first coined in 1988.

• It is capable of performing two pipeline stagesper clock cycle.

• An alternative way of looking at this is that thefunctions performed in each stage can be splitinto two nonoverlapping parts and each canexecute in half a clock cycle.

• A superpipeline implementation that behaves inthis fashion is said to be of degree 2.

R. V. Bidwe, PICT, Pune. 83

Page 84: UNIT 5 Processor Organization

Superscalar Vs. Superpipelined

R. V. Bidwe, PICT, Pune. 84

Page 85: UNIT 5 Processor Organization

Constraints

• If user want to implement instruction-levelparallelism in superscalar machine, or want toincrease its degree, then following are thelimitations:

• True data dependency (Read after Write)

• Procedural dependency

• Resource conflicts

• Output dependency (Write after Write)

• Antidependency (Write after Read)

R. V. Bidwe, PICT, Pune. 85

Page 86: UNIT 5 Processor Organization

True Data Dependency

ADD EAX, ECX

MOV EBX, EAX

• The second instruction can be fetched anddecoded but cannot execute until the firstinstruction executes.

• The reason is that the second instructionneeds data produced by the first instruction.

R. V. Bidwe, PICT, Pune. 86

Page 87: UNIT 5 Processor Organization

Procedural Dependencies

• The presence of branches in an instructionsequence complicates the pipeline operation.

• The instructions following a branch (taken ornot taken) have a Procedural Dependency onthe branch and cannot execute nextinstruction until the branch is executed.

• Since, you can not execute next instructionuntil you execute branch instruction.

R. V. Bidwe, PICT, Pune. 87

Page 88: UNIT 5 Processor Organization

Resource Conflict

• A resource conflict is a competition of two ormore instructions for the same resource atthe same time.

• Examples of resources include memories,caches, buses, register-file ports, andfunctional units (e.g., ALU adder).

R. V. Bidwe, PICT, Pune. 88

Page 89: UNIT 5 Processor Organization

Effect of Dependencies

R. V. Bidwe, PICT, Pune. 89

Page 90: UNIT 5 Processor Organization

Design Issues

1. Instruction Level and Machine Parallelism

2. Instruction Issue Policy

3. Register Renaming

4. Machine Parallelism

5. Branch Prediction

6. Superscalar Execution and Implementation.

R. V. Bidwe, PICT, Pune. 90

Page 91: UNIT 5 Processor Organization

1. Instruction Level and Machine Parallelism

• Instruction-level Parallelism exists wheninstructions in a sequence are independentand thus can be executed in parallel byoverlapping.

• As an example of the concept of instruction-level parallelism, consider the following twocode fragments

R. V. Bidwe, PICT, Pune. 91

Page 92: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 92

Page 93: UNIT 5 Processor Organization

• The three instructions on the left areindependent, and in theory all three could beexecuted in parallel.

• In contrast, the three instructions on the rightcannot be executed in parallel because thesecond instruction uses the result of the first,and the third instruction uses the result of thesecond.

R. V. Bidwe, PICT, Pune. 93

Page 94: UNIT 5 Processor Organization

• So we can conclude,

– The degree of Instruction-level Parallelism isdetermined by

1. True Data Dependencies2. Procedural Dependencies in the code.3. Operational Latency: The time until the result of an

instruction is available for use as an operand in asubsequent instruction.

– These factors are actually dependent on nextinstructions in program and the instruction setarchitecture in the application.

R. V. Bidwe, PICT, Pune. 94

Page 95: UNIT 5 Processor Organization

• What is Machine Parallelism:

– Machine parallelism is a measure of the ability of theprocessor to take advantage of instruction-levelparallelism.

• Machine parallelism is determined by1. The number of instructions that can be fetched and

executed at the same time (the number of parallelpipelines).

2. The speed by which processor finds independentinstructions.

R. V. Bidwe, PICT, Pune. 95

Page 96: UNIT 5 Processor Organization

• So while designing the Superscalararchitectures:

– Both Instruction-level and Machine Parallelism areimportant factors in enhancing performance.

– A program may not have enough instruction-levelparallelism to take full advantage of machineparallelism.

– The degree of instruction level parallelism is highin RISC systems compared to CISC systems.

R. V. Bidwe, PICT, Pune. 96

Page 97: UNIT 5 Processor Organization

2. Instruction Issue Policy

• In order to achieve Instruction level parallelism,the processor must be able to carefully organizethe stages (fetching, decoding, and execution) ofinstructions in parallel, in order to get resultcorrectly.

• The term Instruction Issue to refer to the processof initiating instruction execution in theprocessor’s functional units.

• The term Instruction Issue Policy to refer to theprotocol used to issue instructions.

R. V. Bidwe, PICT, Pune. 97

Page 98: UNIT 5 Processor Organization

• We can group superscalar instruction issuepolicies into the following categories:

• In-order issue with In-order completion.

• In-order issue with Out-of-order completion.

• Out-of-order issue with Out-of-order completion.

R. V. Bidwe, PICT, Pune. 98

Page 99: UNIT 5 Processor Organization

In-order issue with in-order completion

• The simplest instruction issue policy is to issueinstructions in the exact order that would beachieved by sequential execution (In-orderissue) and to write results in that same order(in-order completion).

R. V. Bidwe, PICT, Pune. 99

Page 100: UNIT 5 Processor Organization

• Ex. Assume a superscalar pipeline system, which is

– 4 stage pipelining

– Fetch and Decode: 2 functional units

– Execute: 3 functional units

– Write Back: 2 functional units

R. V. Bidwe, PICT, Pune. 100

Page 101: UNIT 5 Processor Organization

• Also

– I1 requires two cycles to execute.

– I3 and I4 conflict for the same functional unit.

– I5 depends on the value produced by I4.

– I5 and I6 conflict for a same functional unit.

R. V. Bidwe, PICT, Pune. 101

Page 102: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 102

Page 103: UNIT 5 Processor Organization

In-order issue with out-of-order completion

R. V. Bidwe, PICT, Pune. 103

• Out-of-order completion is used in scalar RISCprocessors to improve the performance ofinstructions that require multiple cycles.

• With out-of-order completion, any number ofinstructions may be in the execution stage at anyone time, up to the maximum degree of machineparallelism across all functional units.

• Instruction issuing is troubled by a ResourceConflict, a Data Dependency, or a ProceduralDependency.

Page 104: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 104

Page 105: UNIT 5 Processor Organization

• One more dependency is Output Dependency(write after write [WAW] dependency),arises.

• Out-of-order completion requires morecomplex instruction issue logic than in-ordercompletion. In addition, it is more difficult todeal with instruction interrupts andexceptions.

R. V. Bidwe, PICT, Pune. 105

Page 106: UNIT 5 Processor Organization

Out-of-order issue with out-of-order completion

R. V. Bidwe, PICT, Pune. 106

• With in-order issue, the processor will onlydecode instructions up to the point of adependency or conflict.

• No additional instructions are decoded until theconflict is resolved.

• As a result, the processor cannot look ahead ofthe point of conflict to subsequent instructionsthat may be independent of those already in thepipeline and that may be usefully introduced intothe pipeline.

Page 107: UNIT 5 Processor Organization

• To allow out-of-order issue, it is necessary to decouplethe decode and execute stages of the pipeline. This isdone with a buffer referred to as an InstructionWindow.

• With this organization, after a processor has finisheddecoding an instruction, it is placed in the instructionwindow. As long as this buffer is not full, the processorcan continue to fetch and decode new instructions.

• When a functional unit becomes available in theexecute stage, an instruction from the instructionwindow may be issued to the execute stage.

• Any instruction may be issued, provided that (1) itneeds the particular functional unit that is available,and (2) no conflicts or dependencies block thisinstruction. R. V. Bidwe, PICT, Pune. 107

Page 108: UNIT 5 Processor Organization

Organization for Out-of-Order Issue with Out-of-Order Completion

R. V. Bidwe, PICT, Pune. 108

Page 109: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 109

Page 110: UNIT 5 Processor Organization

• The problem in this approach isAntidependency (Write After Read [WAR]dependency), arises.

R. V. Bidwe, PICT, Pune. 110

Page 111: UNIT 5 Processor Organization

• Disadvantage of Instruction Issue Policy:

– May occur WAW, WAR or RAW dependencies.

– Which may lead to Storage Conflicts.

– [Storage Conflict: Multiple instructions are competing for the use of the same register locations. Which reduces performance of pipeline.]

R. V. Bidwe, PICT, Pune. 111

Page 112: UNIT 5 Processor Organization

3. Register Renaming

• One of the best technique to resolve Storage Conflicts.

• Registers are allocated dynamically by the processorhardware, and they are associated with the valuesneeded by instructions at various points in time.

• When an instruction executes, for a destinationoperand, a new register is allocated for that value.

• Subsequent instructions that access that value as asource operand in that register must go through arenaming process.

R. V. Bidwe, PICT, Pune. 112

Page 113: UNIT 5 Processor Organization

R. V. Bidwe, PICT, Pune. 113

After Register Renaming

Example

Page 114: UNIT 5 Processor Organization

4. Machine Parallelism

• To improve the machine parallelism, it isprobably not worthwhile to add functionalunits without Register Renaming.

R. V. Bidwe, PICT, Pune. 114

Page 115: UNIT 5 Processor Organization

5. Branch Prediction

• Any high-performance pipelined machinemust address the issue of “Dealing withBranches”.

R. V. Bidwe, PICT, Pune. 115

Page 116: UNIT 5 Processor Organization

6. Superscalar Implementation

• We can make some general comments aboutthe processor hardware required for thesuperscalar approach.

1. Instruction fetch and Decode function requirethe use of multiple pipeline fetch and decodestages, and branch prediction logic.

2. Logic is needed to determine TrueDependencies.

R. V. Bidwe, PICT, Pune. 116

Page 117: UNIT 5 Processor Organization

3. Mechanisms for issuing multiple instructions inparallel.

4. Resources for parallel execution of multipleinstructions are needed. Which includesmultiple pipelined functional units and memoryhierarchies capable of simultaneously servicingmultiple memory references.

5. Mechanisms for committing the process state incorrect order.

R. V. Bidwe, PICT, Pune. 117

Page 118: UNIT 5 Processor Organization

Case Study: Pentium 4

R. V. Bidwe, PICT, Pune. 118

Page 119: UNIT 5 Processor Organization

• Although the concept of superscalar design isgenerally associated with the RISC architecture,the same superscalar principles can be appliedto a CISC machine. Eg. Pentium.

• The 80486 introduced the first pipelined x86processor but with no superscalar elements.

• The original Pentium had a modest superscalarcomponent, consisting of the use of twoseparate integer execution units.

• The Pentium Pro introduced a full-blownsuperscalar design with out-of-order execution.Subsequent x86 models have refined andenhanced the superscalar design.R. V. Bidwe, PICT, Pune. 119

Page 120: UNIT 5 Processor Organization

• The operation of the Pentium 4 can besummarized as follows:

1. The processor fetches instructions from memory inthe order of the static program.

2. Each instruction is translated into one or more fixed-length RISC instructions, known as micro-operations,or micro-ops.

3. The processor executes the micro-ops on asuperscalar pipeline organization, so that the micro-ops may be executed out of order.

4. The processor commits the results of each micro-opexecution to the processor’s register set in the orderof the original program flow.

R. V. Bidwe, PICT, Pune. 120