53
1 CHAPTER 1 Introduction: What is Pipelining? Definition:  In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one  An instruction pipeline is a technique used in the design of computer and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.) Most modern CPUs are driven by a clock. The CPU consists internally of logic and r egister (flipflops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is broken into four stages with a set o f flip flops between each stage. 1. Instruction fetch 2. Instruction decode and register fetch 3. Execute 4. Memory access & Register write back When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist. A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely

Sank Art Hes Is

Embed Size (px)

Citation preview

Page 1: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 1/53

1

CHAPTER 1

Introduction:

What is Pipelining?

Definition:

•  In computing, a pipeline is a set of data processing elements connected in series, so that

the output of one element is the input of the next one

•  An instruction pipeline is a technique used in the design of computer and other digital

electronic devices to increase their instruction throughput (the number of instructions that

can be executed in a unit of time

The fundamental idea is to split the processing of a computer instruction into a series of 

independent steps, with storage at the end of each step. This allows the computer's control

circuitry to issue instructions at the processing rate of the slowest step, which is much faster than

the time needed to perform all steps at once. The term pipeline refers to the fact that each step is

carrying data at once (like water), and each step is connected to the next (like the links of a pipe.)

Most modern CPUs are driven by a clock. The CPU consists internally of logic and register

(flipflops). When the clock signal arrives, the flip flops take their new value and the logic then

requires a period of time to decode the new values. Then the next clock pulse arrives and the flip

flops again take their new values, and so on. By breaking the logic into smaller pieces andinserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is

reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is

broken into four stages with a set of flip flops between each stage.

1.  Instruction fetch

2.  Instruction decode and register fetch

3.  Execute

4.  Memory access & Register write back 

When a programmer (or compiler) writes assembly code, they make the assumption that each

instruction is executed before execution of the subsequent instruction is begun. This assumption

is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is

known as a hazard. Various techniques for resolving hazards such as forwarding and stalling

exist.

A non-pipeline architecture is inefficient because some CPU components (modules) are idle

while another module is active during the instruction cycle. Pipelining does not completely

Page 2: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 2/53

2

cancel out idle time in a CPU but making those modules work in parallel improves program

execution significantly.

Processors with pipelining are organized inside into stages which can semi-independently work 

on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is fed to

another stage until the job is done. This organization of the processor allows overall processingtime to be significantly reduced.

A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic

gates in each stage. This generally means that the processor's frequency can be increased as the

cycle time is lowered. This happens because there are fewer components in each stage of the

pipeline, so the propagation delay is decreased for the overall stage .

Unfortunately, not all instructions are independent. In a simple pipeline, completing an

instruction may require 4 stages. To operate at full performance, this pipeline will need to run 3

subsequent independent instructions while the first is completing. If 3 instructions that do not

depend on the output of the first instruction are not available, the pipeline control logic must

insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately,

techniques such as forwarding can significantly reduce the cases where stalling is required.

While pipelining can in theory increase performance over an unpipelined core by a factor of the

number of stages (assuming the clock frequency also scales with the number of stages), in

reality, most code does not allow for ideal execution.

.Pipelining incorporates the concept of Time Overlapped Handling of Several Identical Tasks by

Several Non Identical Stages.Each Stage is made to handle a Distinct Section / Sub-Task for

each of the Tasks.At any point of time ideally each of the stages gets Busy with processing it‘s

own Sub Part belonging to different tasks . Each stage , at any given point of time, is processing

a Sub Task belonging to a Different Task hence if there are N stages then ideally N tasks are

being processed concurrently i.e. the Nth Task has started without any of the earlier N-1 tasks

being complete. Each of the Stages can go on independent of all the other stages provided it has

got some job to do / some Input to handle.Each of the stages, except the very first stage , gets

it‘s input from the previous stage and feeds the next stage [ except the very last one ].

Project Objective:

The main objective of this project is to design a a new pipelined RISC processor and developing

code for it in Verilog to verify it through simulation.

we designed a pipelined RISC architecture Processor from ground-up to implement some simple

functions like AND, OR, MOVE, STORE, ADD, SUBTRACT and simulated it using the code

we developed and found it satisfactory.

Page 3: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 3/53

3

Usage of Pipelining

1.Used in many everyday applications without our notice.

a) Concrete Casting involving a Number of People passing on the Concrete Mix among different

Levels.

b)FireFighting .

2. Has proved to be a very popular and successful way to exploit Instruction Level 

Parallelism[will be explaine in the next section] 

Instruction pipes are being used in almost all modern processors.

Consider a sufficiently large number of Identical Tasks [ Dumping Concrete Mix, Throwing

Bucket of Water, Executing Instructions in a Computer ].

Break up each Task into several smaller Sub Tasks.Design & Employ one Sub Unit for carrying

out each of these Sub Tasks. Each Sub Unit takes Input from it‘s previous stage / Unit and

delivers Output to it‘s next stage / Unit. Keep each of these Sub Units busy ALL the Time i.e.

Operate them in a Time Overlapped Fashion. If there are N Sub units and the slowest among

them takes K units of time, then our Assembly Line will complete at least N tasks every K unitsof Time.

Classic Examples:

Consider the way in which any Typical Undergraduate Engineering College Works :

1. It offers a 4 Year Curriculum.

2. It has got facilities [ Sub Units ] to train/ teach students of a Particular Year.

3. Starting from a Particular Year onwards , it Admits M number of students every year. After

the First 4 Years , number of students graduating per year = M in each of the Subsequent

Years , Assuming NO failures / An Ideal Scenario :Pipelining Student Admissions

Salient Features of This Pipeline:Fixed Number of Stages : 4

1.Identical Stages : Each Training stage is of One Year duration and each of which Handles the

Same Set of Students as had been admitted in the First Year.

2.No stage is starved of Inputs : getting adequate Number of Students.

3.Synchronized Stages : Through the Common Exam Schedule.

Page 4: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 4/53

4

Time Overlapped Processing:

[ Temporal Parallelism / Another Real Life Example]

A.Task : WASH , DRY & IRON Ten ( 10) Dirty clothes.

B. Units available [ Capacity ] Time Taken: WASHER [ Can Wash 5 Clothes in one go] takes

40 minutes.[ Hence Total Time Required to WASH 10 Dirty Clothes = 80 minutes] DRIER

[ Can Dry 8 Clothes at a time ] takes 20 minutes.[ Therefore Total Time Required to DRY 10Clothes= 40 minutes ]

Various Units involved in Washing : IRON [ Manual Ironing of 1 Cloth at a time ] takes 4

minutes [ Total Time Required to IRON 10 Clothes = 40 minutes ]

TOTAL Time Needed if operated in Strict , Time Non Overlapped Sequence = 160 Minutes.

Time Overlapped Processing

[ Temporal Parallelism / A Real Life Example - 2]

C. Time Overlapped Operation Sequence :

1. Put 5 Clothes in WASHER [ DRIER , IRON Idle ] .

2a. After 40 minutes [ WASHER Finishes washing 1st Lot ] put washed clothes [ 5 ] to DRY in

DRIER .2b. Load WASHER with the left over 5 Clothes so for the subsequent period both WASHER &

DRIER gets to work in a Time Overlapped Fashion. IRON is still Idle.

Time Overlapped Processing

[ Temporal Parallelism / A Real Life Example - 3]

3a. After 20 Minutes [ Total 60 Minutes ] DRIER will finish , one can take the clothes for

IRONING ( Provided there is space to keep those clothes) . Meanwhile WASHER is still

washing. DRIER is IDLE.

3b. After 20 more Minutes [Total 80 minutes] IRONING of first 5 clothes is finished while

WASHER has also finished washing ALL 10 clothes. DRIER remains IDLE.

Time Overlapped Processing

[ Temporal Parallelism / A Real Life Example - 4]

4. Engage DRIER to DRY remaining 5 clothes takes 20 more minutes [ Total Time Taken = 100

minutes ]. IRONING activity is idle due to lack of availability of clothes . WASHER can be kept

BUSY if more clothes were there.

5. IRONING these clothes will take 20 more minutes [ Total Time Taken = 120 minutes ].

Time Overlapped Processing

[ Real Life Example – Key Observations ]

.Net Time saved due to Time Overlapped Processing = 180 -120 = 60 minutes.

2.Slowest Stage in the Pipeline = IRONING .3. After all the 3 stages ( WASHER, DRIER, IRON ) have been made busy (after Step 3a.) one

will get one cloth ready after every 4 minutes.

Time Overlapped Usage of Different Processing Stations in an Assembly Line

( General Observations) – 1

Motivation: To Decrease the Processing Time of a Number of Identical Jobs.

The trick is to sub-divide the entire processing of a single job into a number of sub- tasks .

Page 5: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 5/53

5

Each sub task is to be handled by a separate processing station / stage.

1.Time Overlapped Usage of Different Processing Stations in an Assembly Line

( General Observations) – 2

4. Each of the Processing Stages should have some Input to Work on in order to keep that unit

Busy as often as possible.

5. Each of the Processing Stages except the very last one is generating some Output to beconsumed by the next Processing Stage only.

6. Each of these Processing Stages may not take same time and also need not be synchronized.

Hence there will have to be some intermediate store / buffer to hold temporarily the Inputs to any

particular processing station.

Time Overlapped Usage of Different Processing Stations in an Assembly Line

( General Observations) – 3

7.Since each processing stage is dependent on it‘s predecessor processing stage only as well as

feeding to it‘s next processing stage only hence one cannot reduce the processing time for any

 particular task/job lower than the slowest processing stage‘s processing time.

8.Each task normally passes through each of the processing stages regardless of the requirement9.Hence time taken to process a single task may increase as compared to the case where the

given task is processed based on it‘s specific requirements since a task may have to go through

some unnecessary stages.

Time Overlapped Usage of Different Processing Stations in an Assembly Line

( General Observations) – 4

10. System Throughput i.e. the number of tasks completed over a specific period of time will

increase because of Time Overlapped operation of the various processing stages.

11 However , if during the course of Processing , if any of the Processing Stage Fails / Stalls

then the entire Assembly Line will either crash OR get stalled.

Using Pipeline Inside a Computer

( Salient Queries - 1) 

1.How this Assembly Line Concept is applicable in the Instruction Processing in a typical

Computer ?

Ans . a). The CPU of any Computer essentially fetches , decodes and then executes Instructions

belonging to a Program.

b) Each Instruction Processing is composed of an almost identical set of stages / Machine Cycle.

Hence one can view the CPU to represent an Assembly Line for Instruction Processing.

Using Pipeline Inside a Computer

( Salient Queries - 2)2. Is the improved Throughput i.e. number of tasks completed over a period of time dependent

on / proportional to the number of processing / PIPELINE stages ?

To be answered later in the context of Instruction Processing in a Computer.

Page 6: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 6/53

6

CHAPTER2

The Typical Instruction Handling Sequence in a CPU

Typical Instruction Processing Stages Inside a CPU- 1

1.Fetch Instruction Op-Code [CISC ] / The Entire Instruction [ RISC ] from Instruction-Cache/ 

Memory into the Instruction Register using Instruction Pointer / PC appended by Code Segment

Register, as well as Update the Instruction Pointer / PC to point to next Instruction.[ IF] .

2.Decode Instruction Op-Code Inside the CPU and select some Register Operands [ RISC] , (In

this case Instruction Pointer / PC can be used to fetch the next Instruction ) or decide on future

Operand Address Reads as well as the next Instruction Location as in CISC. Update PC

Accordingly [ ID ]

Typical Instruction Processing Stages Inside a CPU- 2

3.Read Operand Addresses into the Instruction Register from I-Cache using the Instruction

Memory Address Register [ CISC only] [ROA] May have to be carried out a number of times

once for each of the Operand Addresses. (Optional) Not required for RISC.4.Execute Instruction Processing Op Code / Calculate Linear Operand Address Offset using

ALU [EX] . In the former case (processing) the operation may vary in time depending on the

type of Operation being carried out.

Typical Instruction Processing Stages Inside a CPU – 3

5.Read operand Values from Data -Cache / Memory using the computed Linear Offset as

obtained in the previous step appended by the appropriate Segment Registers

( DATA / STACK / EXTRA) . [ MEM]

N.B: For CISC the above two steps 4 & 5 may need to be executed a Number of times once each

for reading each of the Operand Addresses and at least once for performing computation. This

Computation Time need not be fixed.

Typical Instruction Processing Stages Inside a CPU - 4

6.Write Back Result [ Into the Designated Destination ]

[ WB ] . In case of a Memory being the destination the processor needs to compute the Linear

Address Offset using the step 4 .

Page 7: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 7/53

7

7.Interrupt Handling : Here main issues being two fold namely preserving the Current Context in

the System Stack followed by computing / locating the Target and loading it to the Instruction

Pointer.

One can Time Overlap these operations provided

A. There is no Resource Conflict among the various stages [ No Structural Hazards ]

B. Each Instruction once in the Pipeline does in no way affect the Execution pattern of any of it‘sSuccessor Instructions in the Pipeline [ There exists no Inter Instruction Dependency in the form

of either DATA Hazards or Control Hazards ] .

A representative RISC Processor [ MIPS / DLX ] Salient Features:

1. 32 bit Processor i.e. can handle 32 bit Operands at one go.

2. Fixed Instruction length ( 32 bits). Hence can be fetched in one machine cycle.

3. Load Store Architecture i.e. all the source operands need to be brought in some CPU

Register before processing all Results are to be computed in some CPU register before being

stored in some Memory location.

4. Restricted Addressing modes [ Register Direct , Indexed , Relative, Implied ].5. Large GPR file set.

MIPS a RISC Processor Uses the following 5-stage Pipeline

1.IF: Instruction fetch from Instruction Memory.

2.ID: Decode operands and Select CPU Register operands.

3.EX: ALU operation or Memory Data operand Linear Address generation.

4.MEM: Data Memory reference to Read Operand Values.

5.WB: Write back into CPU Register file.

MIPS Pipeline Stages

5 stages of MIPS Pipeline:

1. IF Stage:Needs access to the Program Memory to Fetch the whole instruction.

Needs a dedicated adder to update the PC.

2. ID Stage:

Needs access to the Register File.

3. EX Stage:

Needs an ALU and Associated Registers.

4. MEM Stage:

Needs access to the Data Memory.

5. WB Stage:Needs access to the Register File for writing Result. Pipeline Registers : 

Pipeline registers are essential part of pipelines serving as Inter Stage Buffers / Latches.

There are N-1 groups of pipeline registers in an N stage pipeline one group lying between two

successive Pipeline stages.

Each stage after completion of it‘s processing part saves ALL the relevant outputs generated by

it to the Intermediate Register lying at it‘s output. In MIPS Pipeline these Registers happens to

be .

Page 8: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 8/53

8

1.  IF /ID (Instruction Register writes the Fetched Instruction into this . Condition Code Flags

are also written into it ).

2. ID/EX ( ALU Operand Registers are written from it hence this stores the content of ALL the

Input Register Operands).

3. EX/MEM ( ALU Result Register + Flags writes into it ).

4. MEM/WB ( Memory Data / Buffer Register writes into it.)This way, each time ―something is computed‖... 

Effective address, Immediate value, Register content, etc. are saved & can be made available in

the context of the instruction that needs it.

Pipeline Register Depiction

 

Historically, there are two different types of pipelines:

1.Instruction pipelines

2.Data / Arithmetic pipelines [ SIMD case ]

Arithmetic pipelines (e.g. Floating Point Processing) are mostly found within Special Purpose

Processors / Co-Processors since these are to be employed only occasionally and also such Data

Pipelines Need a continuous stream of arithmetic operations. e.g. Vector processors operating

on an array.

On the other hand Instruction Pipelines are used in almost every modern processor to Increase

Instruction Execution Throughput. Assumed a s default.

Page 9: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 9/53

9

Instruction Level Parallelism [ILP] :

It is a measure of how many of the Instructions in a Computer Program can be executed

simultaneously [ In a Time Overlapped Fashion ] without violating the various Inter  –  

Instruction Dependencies that may exist.

Consider the following program:

I#1. e = a + bI#2. f = c + d

I#3. g = e * f 

Instruction I#3 depends on the results of Instruction I#1 as well as on Instruction I#2 [ True

(Data) [RAW] Dependency ]

However, instructions I#1 and I#2 do not depend on any other Instruction , so they can be

Executed simultaneously.

If we assume that each Instruction can be completed in one unit of time then these three

instructions can be completed in a total of two units of time, giving an ILP of 3/2.

Goal & Motivation to achieve Speed Up:

Ordinary programs are typically written under a sequential execution model where instructionsexecute one after the other and in the order specified by the programmer.

ILP allows the compiler and the processor to overlap the execution of multiple instructions or

even to change the order in which instructions are executed.

A goal of compiler and processor designers is to identify and take advantage of as much ILP as

possible in a Specified Sequential Code.

How much ILP exists in programs is very application specific. In certain fields, such as graphics

[ Manipulation of Individual Pixels in a Group ]and scientific computing [ Matrix Multiplication]

the amount can be very large. However, workloads such as cryptography exhibit much less

parallelism because of the inherent RAW Data Dependency among the constituent Operations

Ordinary programs are typically written under a sequential execution model where instructions

execute one after the other and in the order specified by the programmer.

ILP allows the compiler and the processor to overlap the execution of multiple instructions or

even to change the order in which instructions are executed.

A goal of compiler and processor designers is to identify and take advantage of as much ILP as

possible in a Specified Sequential Code.

How much ILP exists in programs is very application specific. In certain fields, such as graphics

[ Manipulation of Individual Pixels in a Group ]and scientific computing

[ Matrix Multiplication] the amount can be very large. However, workloads such as

cryptography exhibit much less parallelism because of the inherent RAW Data Dependencyamong the constituent Operations.

Micro – Architectural Techniques used to Exploit ILP - 1

Instruction pipelining where the execution of multiple instructions can be partially overlapped.

Superscalar execution in which multiple execution units are used to execute multiple instructions

in parallel. In typical superscalar processors, the instructions executing simultaneously are

adjacent in the original program order.

Page 10: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 10/53

10

Out-of-order execution where instructions execute in any order that does not violate data

dependencies. Note that this technique is independent of both pipelining and superscalar.

Register renaming which refers to a technique used to avoid unnecessary serialization of 

program operations imposed by the reuse of registers by those operations, used to enable out-of-

order execution.

Speculative execution which allow the execution of complete instructions or parts of instructionsbefore being certain whether this execution should take place.

A commonly used form of speculative execution is control flow speculation where instructions

past a control flow instruction (e.g., a branch) are executed before the target of the control flow

instruction is determined [ Branch Prediction (used to avoid stalling for control dependencies to

be resolved) ].

Several other forms of speculative execution have been proposed and are in use including

speculative execution driven by value prediction, memory dependence prediction and cache

latency prediction.

Factors Affecting ILP Implementation

Inter Instruction Dependencies:Data Dependency & Control Dependency.Various types of Data Dependencies

A data dependency in computer science is a situation in which a program statement (instruction)

refers to the Data / Operand of a preceding statement / Instruction in some way or the other.

In compiler theory, the technique used to discover data dependencies among statements (or

instructions) is called Dependence analysis.

Data Dependency

Defn:Let‘s consider that in any Computer Program there are two Statements S1 & S2 where the

statement S1 happens to be preceding the statement S2 in the Program.

The Statement S2 is said to be Data dependent on the Statement S1 if any one of the following 3

cases exist.

Data Dependency Conditions 

Bernstein Conditions :Assuming statement S1 and S2, S2 depends on S1

if: [I(S1) ∩ O(S2)] ∪ [O(S1) ∩ I(S2)] ∪ [O(S1) ∩ O(S2)] ≠ Φ

where: I (Si) is the set of memory locations read by Si and O (Sj) is the set of memory locations

written by Sj and there is a feasible run-time execution path from S1 to S2.This Condition is

called Bernstein Condition, named by A. J. Bernstein.

Cases of Data Dependency

True (data) Dependence: O(S1) ∩ I (S2)

Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) thatwill be READ by the Successor Statement S2 . [ Read After Write (RAW) ]

Anti-( Name) Dependence: I(S1) ∩ O(S2) , mirror relationship of true dependence here the

predecessor Instruction S1 Reads from some Memory Location or Register which is later

modified / written onto by the Successor Instruction S2 [ Write After Read (WAR) ].

Output Dependence: O(S1) ∩ O(S2), S1->S2 and both the Instructions S1 & S2 writes to the

same memory location or Register [ Write After Write (WAW) ]

Page 11: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 11/53

11

True Data [RAW] Dependency – 1

Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) that

will be READ by the Successor Statement S2 . [ Read After Write (RAW) ]

Example :

A true dependency, also known as a data dependency, occurs when an instruction depends on the

result of a previous instruction:I#1. A = 3

I#2. B = A

I#3. C = B

True Data [RAW] Dependency - 2

Here Instruction I#3 is truly dependent on instruction I#2, as the final value of C depends on the

instruction updating B. Instruction I#2 is truly dependent on instruction I#1, as the final value of 

B depends on the instruction updating A.

Since instruction I#3 is truly dependent upon instruction I#2 and instruction I#2 is truly

dependent on instruction I#1, instruction I#3 is also truly dependent on instruction I#1.

Instruction level parallelism is therefore not an option in this example.Anti (Name) [WAR] Dependency 

An anti-dependency occurs when an instruction requires a value that is later updated. In the

following example, instruction 3 anti-depends on instruction 2 — the ordering of these

instructions cannot be changed, nor can they be executed in parallel (possibly changing the

instruction ordering), as this would affect the final value of A.

I#1. B = 3

I#2. A = B + 1

I#3. B = 7

An anti-dependency is an example of a name dependency. That is, renaming of variables could

remove the dependency, as depicted in the next Slide:

Removing Anti Dependency through Renaming of Variables

I#1 . B = 3

I#N. B2 = B

I#2. A = B2 + 1

I#3. B = 7

Here a new variable, B2, has been declared as a copy of B in a new instruction, instruction N.

The anti-dependency between the instruction I#2 and the Instruction I#3 has been removed,

meaning that these instructions may now be executed in parallel. However, the modification has

introduced new sets of RAW dependencies like instruction I#2 is now truly dependent oninstruction I#N, which is in turn truly dependent upon instruction I#1.

As true dependencies, these new dependencies are impossible to safely remove.

Output [WAW] Dependency

An output dependency occurs when the ordering of instructions will affect the final output value

of a variable. In the example below, there is an output dependency between instructions I#3 and

Page 12: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 12/53

Page 13: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 13/53

13

However, dependencies among statements or instructions may hinder parallelism  —  parallel

execution of multiple instructions, either by a parallelizing compiler or by a processor exploiting

instruction level parallelism [ILP].

Recklessly executing multiple instructions without considering related dependences may cause

danger of getting wrong results, namely hazards.

A Non Pipelined floating point Processing:

Pipelined Floating Point Processing:

Page 14: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 14/53

14

Pipeline cycle :

The time required to move an instruction one step further in the pipeline.

Not to be confused with clock cycle.Determined by the time required by the slowest stage.

Basic Pipelining Terminologies

Pipeline designers try to balance the length (i.e. the processing time) of each pipeline stage.

For a perfectly balanced N stage pipeline, the execution time per instruction is t/N,where t is the execution time per instruction on non-pipelined machine and N is the number of 

pipeline stages.

However, it is very difficult to make the different pipeline stages perfectly balanced. So different

Pipeline stages may possess different Processing time.

Besides, pipelining itself involves some overhead arising due to the Registers/ Latches used

between two successive pipeline stages

Some Important Pipeline Issues

Timing Factors in a Typical Pipeline

Pipeline cycle :  

If Inter stage Latch / Register Delay = d

= max {m } + d

Pipeline frequency : f 

f = 1 /  

Ideal Pipeline Speedup

k-stage pipeline processes n tasks in k + (n-1) clock cycles:

k cycles for the first task and n-1 cycles for the remaining n-1 tasks.

Total time to process n tasks

Tk = [ k + (n-1)]  

For the non-pipelined processor

T1 = n k  [ n tasks passes through k stages each having delay ]Pipeline Speedup Expression

Speedup(SK )=T1 /TK = n K  / [K+(n-1) = n K / [K+(n-1)

Observe that the memory bandwidth must increase by a factor of Sk :Otherwise, the processor

would stall waiting for data to arrive from memory

Page 15: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 15/53

15

Exercise – 1 

Consider an unpipelined processor:

Takes 4 cycles for ALU and other operations

5 cycles for memory operations.

Assume the relative frequencies:

ALU and other=60%,memory operations=40

Cycle time =1ns Compute speedup due to pipelining:

Ignore effects of branching. Assume pipeline overhead = 0.2ns

Solution

Average instruction execution time for large number of instructions

unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns

Pipelined=1.2ns

Speedup=4.4/1.2=3.7 times

Pipeline Types: 

Synchronous pipeline:Either Pipeline cycle is constant (OR)

Pipeline Cycle through any Pipeline stage is an Integer Multiple of Clock Frequency known a-

 priori to each of the Pipeline stages so each stage knows when it‘s input will be available.  

N.B: Assumed Default.

Asynchronous pipeline:

Time for moving from stage to stage varies.

Individual stages need not be aware about the Timing of any other Stage.

Handshaking communication between stages.

A stage may have to WAIT for Input availability thereby requiring Interlocking of Stages.

Synchronous Pipeline

Transfers between stages are simultaneous.

One task or operation enters the pipeline per cycle.

Page 16: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 16/53

16

No of Pipeline Stages vs Performance – 1

Various Pipelined Processing Stages 1 –  8086:

 

Bus Interface Unit and Execution unit will work independently .(To enable two stage pipelined

processing) in 8086 Fetch and Execution overlap is there.

It is Only 2 stage pipelining

F E

F E

F E

F=Fetch the instruction and decode the inst, E=Execute the inst and write in to memory

Page 17: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 17/53

17

Page 18: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 18/53

18

Various Pipelined Processing Stages – 2:

Pipelined CPU – Memory Interface:

Page 19: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 19/53

19

Pipelined CPU – GPR Interface:

Speedup Factors with Instruction Pipelining

Page 20: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 20/53

20

Performance Evaluation Method

Amdahl’s Law 

Quantifies overall performance gain due to improve in a part of a computation.  

Performance improvement gained from using some faster mode of execution is limited by the

amount of time the enhancement is actually used. 

Amdahl‘s Law Speedup=Execution time for task with out enhancement/Execution time for the task using

enhancement 

Amdahl‘s Law and Speedup 

Speedup tells us:How much faster a machine will run due to an enhancement. 

For using Amdahl‘s law two things should be considered: 

Fraction of the computation time in the original machine that can use the enhancement

If a program executes in 30 seconds and 15 seconds of execution uses enhancement,

Fraction = ½. This value termed as Fraction (Enhanced) is always less than or equal to

1.Improvement gained by enhanced Execution mode ; that is , how much faster the task would

run if the Enhanced mode were used for the entire program .If enhanced task takes 3.5 secondsand original task took 7 seconds, we say the speedup is

2. CISC processors are not suitable for pipelining because of:

Variable instruction format.

Variable execution time.

Complex addressing modes.

RISC processors are suitable for pipelining because of:

Fixed instruction format.

Fixed execution time.

Limited addressing modes.

Advantages and disadvantages: Pipelining does not help in all cases. There are several possible disadvantages. An instruction

pipeline is said to be  fully pipelined if it can accept a new instruction every clock cycle. A

pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline

Advantages of Pipelining:

1.An n-stage pipeline:Can improve performance upto n times.

2.Not much investment in hardware:No replication of hardware resources necessary.The

principle deployed is to keep the units as busy as possible.

3.Transparent to the programmers:Easy to use

4The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases.

5Some combinational circuits such as adders or multipliers can be made faster by adding more

circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinational

circuit.

Page 21: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 21/53

21

6.Pipelines: Few Key Observations -1

Pipeline increases instruction throughput:But, does not decrease the execution time of the

individual instructions.In fact, slightly increases execution time of each instruction due to

pipeline overheads since each Instruction passes through Identical Pipeline stages.

Disadvantages of Pipelining:

1.A non-pipelined processor executes only a single instruction at a time. This prevents branchdelays (in effect, every branch is delayed) and problems with serial instructions being executed

concurrently. Consequently the design is simpler and cheaper to manufacture.

2.The instruction latency in a non-pipelined processor is slightly lower than in a pipelined

equivalent. This is because extra flipflops must be added to the data path of a pipelined

processor.

3.A non-pipelined processor will have a stable instruction bandwidth. The performance of a

pipelined processor is much harder to predict and may vary more widely between different

programs.

Pipeline Overheads

Pipeline register delay:Caused due to set up time.Clock skew:the maximum delay between clock arrival at any two registers.

Once clock cycle is as small as the pipeline overhead:No further pipelining would be useful.Very

deep pipelines may not be useful .

EXAMPLES:

Four Stages of an Instruction:

Instruction Fetch(F): Fetch the instruction from the Instruction Memory

Operand Fetch and Instruction Decode(D): Fetch the operand Data from the Memory or Reg

& Decode the inst

Execute(E):Calculate the memory address and/or execute the functionMemory & Write back(M) : Read the data from the Data Memory & Write Back to Register

INSTRUCTIONS WAITING:

D

C D

B C D

A B C D

0 1 2 3 4 5 6 7 8

FETCH A B C D X X X X

DECODE X A B C D X X X

XECUTE X X A B C D X X

MEMORY X X X A B C D X

Page 22: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 22/53

22

INSTRUCTIONS COMPLETED:

4-stage pipeline; the boxes represent instructions independent of each other

The top box is the list of instructions waiting to be executed; the bottom gray box is the list of 

instructions that have been completed; and the middle white box is the pipeline.

Execution is as follows:

Time Execution

0 Four instructions are awaiting to be executed

1   The A instruction is fetched from memory

2  the A instruction is decoded

  the B instruction is fetched from memory

3

  the A instruction is executed (actual operation is performed)

  the B instruction is decoded

  the C instruction is fetched

4

  the A instruction's results are written back to the register file or memory

  the B instruction is executed

  the C instruction is decoded

  the D instruction is fetched

5

  the A instruction is completed

  the B instruction is written back 

  the C instruction is executed

  the D instruction is decoded

6

  The B instruction is completed

  the C instruction is written back 

  the D instruction is executed

7   the C instruction is completed

A B C D

A B C

A B

A

Page 23: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 23/53

23

  the D instruction is written back 

8   the D instruction is completed

9 All instructions are executed

Bubble:

D

C D

B C D

A B C D

X A B B C D X X X X

X X A OO B C D X X X

X X X A OO B C D X X

X X X X A OO B C D X

0 1 2 3 4 5 6 7 8 9

COMPLETED INSTRUCTIONS:

Bubble in cycle 3 delays execution

 Bubble (computing): 

When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing

useful happens. In cycle 2, the fetching of the ‗B‘ instruction is delayed and the decoding stage

in cycle 3 now contains a bubble. Everything "behind" the ‗B‘ instruction is delayed as well buteverything "ahead" of the ‗B‘ instruction continues with execution.

Clearly, when compared to the execution above, the bubble yields a total execution time of 8

clock ticks instead of 7.

Bubbles are like stalls, in which nothing useful will happen for the fetch, decode, execute and

writeback. It can be completed with a NOP(no operation) code.

A A B C D

A B C

A B

A

Page 24: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 24/53

24

Example2:

Pipelined Execution Of Six Instructions : 

2 3 4 3 4 4 4 4 4

F D E MF D E M

F D E M

F D E M

F D E M

F D E M

2 4 6 8 10 12 17 20 25 29 32

Shaded region is the time while inst is waiting for processing unit(F or D or E or M)

Total Time Taken For Six Inst = 32…….(b) =2+3+4+3+(6-1)*4 (can be derived easily from the above fig)

If there are N number of instructions then total time required

=Total time required single inst + (N-1) * slowest process

Similarly in 8086 with F+E(2+3=5 units) & D+M(4+3 =7 units) as two stages

Total time = 12+(6-1) * 7 =47…………………….(c) 

Throughput:

From (1)

Sequential Processing = 72/6 = 12 units.......(from (a)&(1))

8086(2 stage)= 47/6 ≈8 units.......(form (c)&(1))

4 Stage Pipelined Processing =32/6 ≈ 6 units……..(from (b) & (1)) 

From the above we can conclude that with the use of 4 stage pipelined architecture we can

reduce the throughput such that no of inst processed by a processor in a given time will increase

Example3

A typical instruction to add two numbers might be ADD A, B, C, which adds the values

found in memory locations A and B, and then puts the result in memory location C. In a

pipelined processor the pipeline controller would break this into a series of tasks similar to:

LOAD R1, A

LOAD R2, BADD R3, R1, R2STORE C, R3LOAD next instruction

Page 25: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 25/53

25

The locations 'R1', 'R2' and 'R3' are registers in the CPU. The values stored in memory

locations labeled 'A' and 'B' are loaded (copied) into the R1 and R2 registers, then added,

and the result (which is in register R3) is stored in a memory location labeled 'C'.

In this example the pipeline is three stages long- load, execute, and store. Each of the steps

are called pipeline stages.

On a non-pipelined processor, only one stage can be working at a time so the entire

instruction has to complete before the next instruction can begin. On a pipelined processor,

all of the stages can be working at once on different instructions. So when this instruction is

at the execute stage, a second instruction will be at the decode stage and a 3rd instruction

will be at the fetch stage.

Pipelining doesn't reduce the time it takes to complete an instruction; it increases the

number of instructions that can be processed at once and reduces the delay between

completed instructions. The more pipeline stages a processor has, the more instructions it

can be working on at once and the less of a delay there is between completed instructions.

Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel

AVR and the PIC microcontroller each have a 2 stage pipeline.) Intel Pentium 4 processors

have 20 stage pipelines.

Example 4

To better visualize the concept, we can look at a theoretical 3-stage pipeline:

Stage Description

Load Read instruction from memory

Execute Execute instruction

Store Store result in memory and/or registers

and a pseudo-code assembly listing to be executed:

LOAD A, #40 ; load 40 in A 

MOVE B, A ; copy A in B 

ADD B, #20 ; add 20 to B 

STORE 0x300, B ; store B into memory cell 0x300 

Page 26: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 26/53

26

This is how it would be executed:

Clock 1

Load Execute Store

LOAD

The LOAD instruction is fetched from memory.

Clock 2

Load Execute Store

MOVE LOAD

The LOAD instruction is executed, while the MOVE instruction is fetched from memory.

Clock 3

Load Execute Store

ADD MOVE LOAD

The LOAD instruction is in the Store stage, where its result (the number 40) will be stored

in the register A. In the meantime, the MOVE instruction is being executed. Since it must

move the contents of A into B, it must wait for the ending of the LOAD instruction.

Clock 4

Load Execute Store

Page 27: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 27/53

27

STORE ADD MOVE

The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD

is calculating And so on. Note that, sometimes, an instruction will depend on the result of 

another one (like our MOVE example). When more than one instruction references a

particular location for an operand, either reading it (as an input) or writing it (as an output),

executing those instructions in an order different from the original program order can lead

to hazards (mentioned above). There are several established techniques for either preventing

hazards from occurring, or working around them if they do.

. Complications 

Many designs include pipelines as long as 7, 10 and even 20 stages (like in

the Intel Pentium 4). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and

their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer

computing. The Xelerator X10q has a pipeline more than a thousand stages long. The

downside of a long pipeline is that when a program branches, the processor cannot know

where to fetch the next instruction from and must wait until the branch instruction finishes,

leaving the pipeline behind it empty. In the extreme case, the performance of a pipelined

processor could theoretically approach that of an un-pipelined processor, or even slightly

worse if all but one pipeline stages are idle and a small overhead is present between

stages. Branch prediction attempts to alleviate this problem by guessing whether the branch

will be taken or not and speculatively executing the code path that it predicts will be taken.

When its predictions are correct, branch prediction avoids the penalty associated with

branching. However, branch prediction itself can end up exacerbating the problem if 

branches are predicted poorly, as the incorrect code path which has begun execution must

be flushed from the pipeline before resuming execution at the correct location.

In certain applications, such as supercomputing, programs are specially written to branch

rarely and so very long pipelines can speed up computation by reducing cycle time. If 

branching happens constantly, re-ordering branches such that the more likely to be needed

instructions are placed into the pipeline can significantly reduce the speed losses associatedwith having to flush failed branches.

Self-Modifying Programs: Because of the instruction pipeline, code that the processor

loads will not immediately execute. Due to this, updates in the code very near the current

location of execution may not take effect because they are already loaded into the Prefetch

Page 28: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 28/53

28

Input Queue. Instruction caches make this phenomenon even worse. This is only relevant

to self-modifying programs. 

Mathematical pipelines: Mathematical or arithmetic pipelines are different from

instructional pipelines, in that when mathematically processing large arrays or vectors, a

particular mathematical process, such as a multiply is repeated many thousands of times. Inthis environment, an instruction need only kick off an event whereby the arithmetic logic

unit (which is pipelined) takes over, and begins its series of calculations. Most of these

circuits can be found today in math processors and math processing sections of CPUs like

the Intel Pentium line.

History

Math processing (super-computing) began in earnest in the late 1970s as Vector Processors

and Array Processors. Usually very large bulky super-computing machines that needed

special environments and super-cooling of the cores. One of the early super computers was

the Cyber series built by Control Data Corporation. Its main architect was Seymour Cray,

who later resigned from CDC to head up Cray Research. Cray developed the XMP line of 

super computers, using pipelining for both multiply and add/subtract functions. Later, Star

Technologies took pipelining to another level by adding parallelism (several pipelined

functions working in parallel), developed by their engineer, Roger Chen. In 1984, Star

Technologies made another breakthrough with the pipelined divide circuit, developed by

James Bradley. By the mid 1980s, super-computing had taken off with offerings from many

different companies around the world.

Today, most of these circuits can be found embedded inside most micro-processors.

Page 29: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 29/53

29

CHAPTER 3ARCHITECTURE: 

E MSAF S

RDWRLRG

LDR

ALU

(IR) 4-7=R2 0 1

(IR) 0-3 =R1 MUX

CO (IR) 8-11=R3NTROL

FETCH(1) DECODE(2) EXECUTE(3) MEMORY(4)

PROGRAM

MEMORY

ACCUMULATOR

(4)OA

RSM(4)

RSE(3)

PC(1)

REGISTER

ARRAY

+1

MPME

MPMMMPAM(4)

MPAE(3)

DECODER

INST REG(2)

DATA

MEMORY

STR(4)

Page 30: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 30/53

30

INSTRUCTION FORMAT

OPCODE R3 R2 R1

AND SOURCE SOURCE DESTINATION

OR SOURCE SOURCE DESTINATION

ADD SOURCE SOURCE DESTINATION

SUB SOURCE SOURCE DESTINATION

MOVE XXXXXXX SOURCE DESTINATION

LOAD XXXXXXX SOURCE DESTINATION

STORE DESTINATION SOURCE XXXXXXX

NOT XXXXXXX SOURCE &DESTINATION XXXXXXX

EXAMPLE PROGRAM

NUMBER INSTRUCTION OPERATION BINARY CODE

I1 ADD R5 R4 R1 [R1]<-[R5]+[R4] 16‘H 0541 

I2 SUB R6 R4 R7 [R7]<-[R4]-[R6] 16‘H 1647

I3 MOVE R4 R3 [R3]<-[R4] 16‘H 4043 

I4 OR R3 R7 R0 [R0]<-[R3]||[R7] 16‘H 3370 

I5 LOAD R0 R3 [R3]<-[[R0]] 16‘H 5503 

I6 AND R7 R0 R2 [R0]<-[R3]&&[R7] 16‘H 2702 

I7 STORE R1 R6 [[R6]]<-[R1] 16‘H 6160 

MICRO PROGRAM MICRO

PROGRAM

MEMORY

CONTENT

MNEMONICS SAF(4-

bit)

S(1-bit) RGW(1-bit) MW(1-

BIT)

MR(1-

BIT)

CODE

ADD 4‘H 0 1 1 0 0 8‘H 0C SUB 4‘H 1 1 1 0 0 8‘H 1C 

AND 4‘H 2 1 1 0 0 8‘H 2C 

OR 4‘H 3 1 1 0 0 8‘H 3C 

MOVE 4‘H 4 1 1 0 0 8‘H 4C 

LOAD 4‘H 5 0 1 0 1 8‘H 55

STORE 4‘H 6 1 0 1 0 8‘H 6A

NOT 4‘H 7 1 1 0 0 8‘H 7C

Page 31: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 31/53

31

SIGNALS FULLFORM 

SAF SELECT ALU FUNCTION

S MUX SELECT LINE

RGW REGISTER WRITE

MW MEMORY WRITE

MR MEMORY READ

FETCH DECODE EXECUTE MEMORY

[IR]<-[[PC]][PC]<-[PC]+1

[IR]

DECODER=[IR]15-12

[PC]

[MPAE]<-DECODER[OB]<-[[IR]]11-8[OA]<-[[IR]]7-4[RSE]<-[IR]3-0

[MPAE]

[OB]

[OA]

[RSE]

[STR]<-[OA][AR]<-ALU_OUT[RSM]<-[RSE][MPAM]<-

[MPAE]

[STR]

[AR]

[RSM]

[MPAM] 

clk1 clk2 clk3 clk4 clk5 clk6I1-FETCH I1-DECODE I1-EXECUTE I1-MEMORY

[IR]<-[[0000]][PC]<-0000+1

[IR]=16’H 0541 [PC]=4H’00001 

[MPAE]<-DECODER[OB]<-[R5][OA]<-[R4][RSE]<-1

NOTE: Initally[R5]=4;[R4]=7

[OA]=4:[OA]=7

[RSE]=1;

[MPAE]=8’H0C 

[STR]<-7[AR]<-11[RSM]<-1[MPAM}<-8‘H0C 

[STR]=7

[AR]=11

[RSM]=1

[MPAM]=8’H0C 

[R1]=11 

I2-FETCH I2-DECODE I2-EXECUTE  I2-MEMORY

[IR]<-[[0001]][PC]<-0001+1

[IR]=16’H 1647

[PC]=4H’00002

[MPAE]<-DECODER[OB]<-[R6][OA]<-[R4][RSE]<-7

NOTE: Initally[R6]=4;

[OA]=4:[OA]=7

[RSE]=7;[MPAE]=8’H1C

[STR]<-7[AR]<-3[RSM]<-7[MPAM}<-8‘H1C 

[STR]=7

[AR]=3

[RSM]=7

[MPAM]=8’H1C

[R7]=3 

I3-FETCH I3-DECODE I3-EXECUTE  I3-MEMORY

[IR]<-[[0002]][PC]<-0002+1

[IR]=16’H 4043

[PC]=4H’00003

[MPAE]<-DECODER[OB]<-[R0][OA]<-[R4][RSE]<-3

[OA]=X:[OA]=7

[RSE]=3;

[MPAE]=8’H4C

[STR]<-7[AR]<-3[RSM]<-7[MPAM}<-8‘H4C 

[STR]=7

[AR]=7

[RSM]=3[MPAM]=8’H4C

[R3]=7 

Page 32: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 32/53

32

(1)FETCH:

R2=[IR](7-4); [OA]<--[R2];R3=[IR](11-8); [OB]<--[R3];[RSE]=[IR](3-0);

CONTROL SIGNAL :::::: SAF; 

I4-FETCH I4-DECODE I4-EXECUTE I4-MEMORY

Clk4

[IR]<-[[0003]][PC]<-0003+1

Clk5[IR]=16’H 3370 

[PC]=4H’00004 

[MPAE]<-DECODER

[OB]<-[R3][OA]<-[R7][RSE]<-0

Clk6[OA]=7:[OA]=3

[RSE]=0;

[MPAE]=8’H3C

[STR]<-3

[AR]<-7[RSM]<-0[MPAM}<-8‘H3C 

Clk7

[STR]=3

[AR]=7

[RSM]=0[MPAM]=8’H3C

[R0]=7 

Clk8 Clk9

I5-FETCH I5-DECODE I5-EXECUTE  I5-MEMORY

[IR]<-[[0004]][PC]<-0004+1

[IR]=16’H 5503

[PC]=4H’00005

[MPAE]<-DECODER[OB]<-[R5][OA]<-[R0][RSE]<-3

[OB]=4;[OA]=7

[RSE]=3;

[MPAE]=8’H55

[STR]<-7[AR]<-7[RSM]<-3[MPAM}<-

8‘H55

[STR]=7

[AR]=7

[RSM]=3

[MPAM]=8’H55

[R0]=MEM[7](Contents atmemory location7)

I6-FETCH I6-DECODE I6-EXECUTE  I6-MEMORY

[IR]<-[[0005]][PC]<-0005+1

[IR]=16’H 2702[PC]=4H’00006

[MPAE]<-DECODER[OB]<-[R7][OA]<-[R0][RSE]<-2

[OA]=3:[OA]=7

[RSE]=2;

[MPAE]=8’H2C

[STR]<-[7][AR]<-MEM[7]&& R3[RSM]<-2

[MPAM}<-8‘H2

C

[STR]=7

[AR]=MEM[7]

and 3

[RSM]=2

[MPAM]=8’H55

[R2]=MEM[7]

&& R3 

I7-FETCH I7-DECODE I7-EXECUTE I7-MEMORY

Clk7

[IR]<-[[0006]][PC]<-0006+1

Clk8

[IR]=16’H 6160 

[PC]=4H’00007 

[MPAE]<-DECODER[OB]<-[R1][OA]<-[R6][RSE]<-2

Clk9

[OA]=3:[OA]=4

[RSE]=0;

[MPAE]=8’H6A 

[STR]<-4[AR]<-3[RSM]<-0[MPAM}<-8‘H6A 

Clk10

[STR]=4

[AR]=3

[RSM]=0

[MPAM]=8’H6A 

MEM[4]=4

Page 33: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 33/53

33

(3).EXECUTE:

[STR]<--[OA];[AR]<--ALU_OUT;[RSM]<--[RSE];

CONTROL SIGNAL::::::: S4, RD, WR, LRG;

(4).MEMORY:

R1=[RSM];DATA.MEMORY ADRS<--[AR];[DATA.MEMORY ADRS]<--[STR]; // FOR STORE INSTRUCTION ONLY[R1]<--[DATA.MEMORY ADRS]; // FOR LOAD INSTRUCTION ONLY[R1]<--[AR]; // FOR ARTHMETIC AND LOGIC INSTRUCTIONS ONLY

SAF---SELECT ALU FUNCTION

S4----MUX SELECT [ 1-AR; 0-LDR ]RGW-REGISTER WRITEMW--MEMORY WRITEMR--MEMORY READ

Page 34: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 34/53

34

CHAPTER 4

VERILOG CODE: 

module data_memory();

 //parameter dataaddress=16;

 //parameter datasize=256;

parameter data_address=65536;

parameter word_size=16;

 //integer i;

 //parameter data_address=16;

reg [word_size-1:0] datamemory[0:data_address-1]; //memory with 16 bit word size and 65536

memory locations

initial

begin

$readmemb("init.data",datamemory);

 /*for(i=0;i<12;i=i+1)

$display("datamemory [%d]=%b",i,datamemory[i]);*/ 

end

endmodule

module program_memory(memory_out,address,data_in_memory,write_memory,clk,rst);

parameter wordsize=16;

parameter memorysize=256;

parameter addrsize=8;

output[wordsize -1 :0] memory_out;

input [addrsize -1 :0] address;

input [wordsize -1 :0] data_in_memory;

 //input read_memory;

input write_memory; //not necessary

input clk;

input rst;

module ir(ir_out,data_in_ir,clk,rst);

parameter wordsize=16;

 //parameter memorysize=256;

 //parameter addrsize=8;

 //input load_ir;

input [wordsize-1:0] data_in_ir; // INSTRUCTION FROM PROGRAM MEMORY

Page 35: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 35/53

35

input clk;

input rst;

output [wordsize-1:0] ir_out;

reg ir_out;

always @(posedge clk) begin

if(rst) begin ir_out<=0; end

else begin ir_out<=data_in_ir; end

end

endmodule

module micro_memory(memory_out,address,data_in_memory,write_memory,clk,rst);

parameter uwordsize=8;

parameter umemorysize=16;

parameter uaddrsize=4;

output[uwordsize -1 :0] memory_out;

input [uaddrsize -1 :0] address;

input [uwordsize -1 :0] data_in_memory;

 //input read_memory;

input write_memory;

input clk;

input rst;

module register8(register_out,register_in,clk,rst);

parameter r8wordsize=8;

output [r8wordsize -1 :0] register_out;

input [r8wordsize -1 : 0] register_in;

input clk;

input rst;

reg [r8wordsize -1:0] register_out;

initial begin register_out=0; end

always @(posedge clk)

begin

if(rst) begin register_out<=0;end

else if(clk) begin register_out<=register_in; end

end

endmodule

Page 36: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 36/53

36

module

register_array(register_out1,register_out2,address1,address2,address3,data_in_register,write_regi

ster,clk,rst);

parameter regwordsize=16;

parameter regmemorysize=16;

parameter regaddrsize=4;

output[regwordsize -1 :0] register_out1;

output[regwordsize -1 :0] register_out2;

input [regaddrsize -1 :0] address1;

input [regaddrsize -1 :0] address2;

input [regaddrsize -1 :0] address3;

input [regwordsize -1 :0] data_in_register;

 //input read_memory;

input write_register;

input clk;

input rst;

reg [regwordsize-1:0] memory[regmemorysize-1:0];

initial begin

memory[4'h0]<=16'h0001;

memory[4'h1]<=16'h0002;

memory[4'h2]<=16'h0003;

memory[4'h3]<=16'h0013;

memory[4'h4]<=16'h0023;

memory[4'h5]<=16'h0001;

memory[4'h6]<=16'h0002;

memory[4'h7]<=16'h0003;

memory[4'h8]<=16'h0013;

memory[4'h9]<=16'h0023;

memory[4'ha]<=16'h0001;

memory[4'hb]<=16'h0002;

memory[4'hc]<=16'h0003;

memory[4'hd]<=16'h0013;

memory[4'he]<=16'h0023;

memory[4'hf]<=16'h0001;

end

 //asynchronous read operation for data_output

Page 37: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 37/53

37

assign register_out1 = memory[address1];

assign register_out2 = memory[address2];

 //data write operation

always @(posedge clk) begin

if(write_register) begin memory[address3]<=data_in_register; end

end

endmodule

oa,ob

module register16(register_out,register_in,clk,rst);

parameter r16wordsize=16;

output [r16wordsize -1 :0] register_out;

input [r16wordsize -1 : 0] register_in;

input clk;

input rst;

reg [r16wordsize -1:0] register_out;

initial begin register_out=0; end

always @(posedge clk)

begin

if(rst) begin register_out<=0;end

else if(clk) begin register_out<=register_in; end

end

endmodule

 //rse

module register4(register_out,register_in,clk,rst);

parameter r4wordsize=4;

output [r4wordsize -1 :0] register_out;

input [r4wordsize -1 : 0] register_in;

input clk;

input rst;

reg [r4wordsize -1:0] register_out;

initial begin register_out=0; end

always @(posedge clk)

begin

if(rst) begin register_out<=0;end

Page 38: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 38/53

38

else if(clk) begin register_out<=register_in; end

end

endmodule

module alu(alu_out,OB,OA,SAF);

parameter wordsize=16;

parameter N=4;

output [wordsize-1:0] alu_out;

input [wordsize-1:0] OB;

input [wordsize-1:0] OA;

input [N-1:0] SAF;

reg alu_out;

always@(SAF or OA or OB) begin

case(SAF)

4'd0 : alu_out = OA + OB; // ADDITION

4'd1 : alu_out = OA - OB; // SUBTRACTION

4'd2 : alu_out = OA & OB; // AND OF OA AND OB

4'd3 : alu_out = ~OA; // NOT of OA

4'd4 : alu_out = OA; // MOVE INSTRUCTION

4'd5 : alu_out = OA; // LOAD INSTRUCTION

4'd6 : alu_out = OB; // STORE INSTRUCTION

 /*4'd7 : $display("INVALID ALU_CONTROL SIGNAL");

4'd8 : $display("INVALID ALU_CONTROL SIGNAL");

4'd9 : $display("INVALID ALU_CONTROL SIGNAL");

4'd10 : $display("INVALID ALU_CONTROL SIGNAL");

4'd11 : $display("INVALID ALU_CONTROL SIGNAL");

4'd12 : $display("INVALID ALU_CONTROL SIGNAL");

4'd13 : $display("INVALID ALU_CONTROL SIGNAL");

4'd14 : $display("INVALID ALU_CONTROL SIGNAL");

4'd15 : $display("INVALID ALU_CONTROL SIGNAL");

4'd16 : $display("INVALID ALU_CONTROL SIGNAL");

default : $display("INVALID ALU_CONTROL SIGNAL");*/ 

endcase

end

endmodule

module program_memory1(memory_out,address,data_in_memory,write_memory,clk,rst);

parameter dwordsize=16;

Page 39: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 39/53

Page 40: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 40/53

Page 41: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 41/53

41

parameter uwordsize=8;

parameter r8wordsize=8;

parameter r16wordsize=16;

parameter regwordsize=16;

parameter r4wordsize=4;

input clk;

input rst;

output [wordsize -1 :0] mem_out;

output [addrsize -1 :0] pc_out;

output [wordsize -1 :0] ir_out;

 //decode inputs

output [uwordsize -1 :0] umemory_out;

output [r8wordsize -1 :0] mpae_out;

output[regwordsize -1 :0] register_out1;

output[regwordsize -1 :0] register_out2;

output [r16wordsize -1 :0] oa_out;

output [r16wordsize -1 :0] ob_out;

output [r4wordsize -1 :0] rse;

 //execute inputs

output [r16wordsize -1 :0] alu_out;

output [r16wordsize -1 :0] ar_out;

output [r16wordsize -1 :0] str_out;

output [r4wordsize -1 :0] rsm_out;

output [r8wordsize -1 :0] mpam_out;

 //module

decode1(rse,oa_out,ob_out,register_out1,register_out2,mpae_out,umemory_out,ir_out,mem_out,

pc_out,clk,rst);

decode1

decode(rse,oa_out,ob_out,register_out1,register_out2,mpae_out,umemory_out,ir_out,mem_out,p

c_out,clk,rst);

 //module alu(alu_out,OB,OA,SAF);

alu alu1(.alu_out(alu_out),.OB(ob_out),.OA(oa_out),.SAF(mpae_out[7:4]));

 //module register16(register_out,register_in,clk,rst);

register16 ar(.register_out(ar_out),.register_in(alu_out),.clk(clk),.rst(rst));

register16 str(.register_out(str_out),.register_in(oa_out),.clk(clk),.rst(rst));

register4 rsm(.register_out(rsm_out),.register_in(rse),.clk(clk),.rst(rst));

Page 42: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 42/53

Page 43: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 43/53

43

output [addrsize -1 :0] pc_out;

output [wordsize -1 :0] ir_out;

 //decode inputs

output [uwordsize -1 :0] umemory_out;

output [r8wordsize -1 :0] mpae_out;

output[regwordsize -1 :0] register_out1;

output[regwordsize -1 :0] register_out2;

output [r16wordsize -1 :0] oa_out;

output [r16wordsize -1 :0] ob_out;

output [r4wordsize -1 :0] rse;

 //execute inputs

output [r16wordsize -1 :0] alu_out;

output [r16wordsize -1 :0] ar_out;

output [r16wordsize -1 :0] str_out;

output [r4wordsize -1 :0] rsm_out;

output [r8wordsize -1 :0] mpam_out;

output [dwordsize-1:0] ldr;

output [dwordsize-1:0] register_wire;

 //FETCH PHASE

 // [pc]<----[pc]+1

pc pc1(.pc_out(pc_out),.clk(clk),.rst(rst)); //pc is incremented at the positive edge of clock 

 // [mem_out]<----[[pc]]

program_memory pm(.memory_out(mem_out),.address(pc_out),.clk(clk),.rst(rst));

 //asynchronous memory read

 // [ir_out]<----[[pc]];

ir ir1(.ir_out(ir_out),.data_in_ir(mem_out),.clk(clk),.rst(rst));

 // DECODE PHASE

 // [umemory_out]<----[[ir_out[15:12]]]

micro_memory

umemory(.memory_out(umemory_out),.address(ir_out[15:12]),.clk(clk),.rst(rst));//decode of 

opcode

 // [mpae_out]<----[umemory_out]

register8 mpae(.register_out(mpae_out),.register_in(umemory_out),.clk(clk),.rst(rst));

 // register_out1<----[ir_out[7:4]]

 // register_out2<----[ir_out[11:8]]

 // [[rsm_out]]<----register_wire

Page 44: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 44/53

44

register_array

register(.register_out1(register_out1),.register_out2(register_out2),.address3(rsm_out),.data_in_r

egister(register_wire),.write_register(mpam_out[2:2]),.address1(ir_out[7:4]),.address2(ir_out[11:

8]),.clk(clk),.rst(rst));

 // [oa]<----register_out1

register16 oa(.register_out(oa_out),.register_in(register_out1),.clk(clk),.rst(rst));

 // [ob]<----register_out2

register16 ob(.register_out(ob_out),.register_in(register_out2),.clk(clk),.rst(rst));

 // [rse]<----[ir_out[3:0]]

register4 rse1(.register_out(rse),.register_in(ir_out[3:0]),.clk(clk),.rst(rst));

 //EXECUTE PHASE

 // [alu_out]<----[oa] SAF [ob]

alu alu1(.alu_out(alu_out),.OB(ob_out),.OA(oa_out),.SAF(mpae_out[7:4]));

 // [ar_out]<----[alu_out]

register16 ar(.register_out(ar_out),.register_in(alu_out),.clk(clk),.rst(rst));

 // [str_out]<----[oa]

register16 str(.register_out(str_out),.register_in(oa_out),.clk(clk),.rst(rst));

 // [rsm_out]<----[rse]

register4 rsm(.register_out(rsm_out),.register_in(rse),.clk(clk),.rst(rst));

 // [mpam_out]<----[mpae_out]

register8 mpam(.register_out(mpam_out),.register_in(mpae_out),.clk(clk),.rst(rst));

 // MEMORY PHASE

 // DATA MEMORY

 // [[ar_out]]<----[str_out] .....IF STORE INSTRUCTION

 // ldr<----[[ar_out]] ...........IF LOAD INSTRUCTION

program_memory1

data_memory(.memory_out(ldr),.address(ar_out),.data_in_memory(str_out),.write_memory(mpa

m_out[1:1]),.clk(clk),.rst(rst));

 // MUX

 // register_wire<----[ar_out]......IF 1 IS SELECTED

 // register_wire<----ldr ..........IF 0 IS SELECTED

assign register_wire = mpam_out[3:3] ? ar_out : ldr;

endmodule

Page 45: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 45/53

45

SIMULATION RESULTS FOR SOME ENTITIES:

ALU 

Page 46: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 46/53

Page 47: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 47/53

47

FETCH

EXECUTE

Page 48: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 48/53

48

INST REGISTER

Page 49: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 49/53

Page 50: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 50/53

50

Page 51: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 51/53

51

CHAPTER 5

Page 52: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 52/53

52

Bibliography:

1.Pipelined architecture processors from behavioural-level(2001 IEEE) by Robert Heath

and Sreenivas Durbha ,Dept. of Electrical Engineering, 453 Anderson Hall,University of 

Kentucky Lexington, KY 40506

2.Low-cost fault tolerance on the ALU in simple pipelined processors (2010 IEEE)Nguyen

Minh Huu∗ , Bruno Robisson and Michel Agoyan† CEA-Leti - Centre Micro´lectronique de

Provence e 880 route de Mimet,France

3. A study of floating-point architectures forpipelined RISC processors byReyes, J.A.P.; Alarcon, L.P.; Alarilla, L.;Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium

on Publication Year: 2006 , Page(s): 4 pp. – 2716

4.High-level implementation of the 5-stagepipelined ARM9TDM coreArandilla, C.C.; Constantino, J.B.A.; Glova, A.O.M.; Ballesil-Alvarez, A.P.; Reyes, J.A.P.;TENCON 2010 - 2010 IEEE Region 10 ConferencePublication Year: 2010

5. Design of High-Speed-Pipelined Execution Unit of 32-bit RISC ProcessorShofiqul Islam; Debanjan Chattopadhyay; Manoja Kumar Das; V Neelima; Rahul Sarkar;India Conference, 2006 Annual IEEEPublication Year: 2006

6. Design through verilog – T.R. Padmanabhan and B. Bala Tripura Sundari(WSE-2009)

7.Computer system architecture – Morris Mano 3rd Edition-Pearson Education.

8. Advanced Microprocessors And Peripherals by A.k.Ray Tata Mgraw Hill (2006)

9.www.isi.edu/~youngcho/csem

Page 53: Sank Art Hes Is

8/3/2019 Sank Art Hes Is

http://slidepdf.com/reader/full/sank-art-hes-is 53/53