Upload
bijay-mishra
View
232
Download
0
Embed Size (px)
Citation preview
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 1/14
Chapter 8 Pipeline and Vector Processing
1 | P a g e
Parallel ProcessingParallel Processing is a term used to denote a large class of techniques that are used to provide
simultaneous data-processing tasks for the purpose of increasing the computational speed of a
computer system. Instead of processing each instruction sequentially as in a conventional computer, a
parallel processing system is able to perform concurrent data processing to achieve faster execution
time.
For example, while an instruction is being executed in the ALU, the next instruction can be read from
memory. The system may have two or more ALUs and be able to execute two or more instructions at the
same time. Furthermore, the system may have two or more processors operating concurrently. The
purpose of parallel processing is to speed up the computer processing capability and increase its
throughput, that is, the amount of processing that can be accomplished during a given interval of time.
The amount of increase with parallel processing, and with it, the cost of the system increases. However,
technological developments have reduced hardware costs to the point where parallel processing
techniques are economically feasible.
Parallel processing can be viewed from various levels of complexity. At the lowest level, we distinguish
between parallel and serial operations by the type of registers used. Shift registers operate in serialfashion one bit at a time, while registers with parallel load operate with all the bits of the word
simultaneously, parallel processing at a higher level of complexity cab be achieved by have a multiplicity
of functional units that perform identical or different operations simultaneously. Parallel processing is
established by distributing the data among the multiple functional units. For example, the arithmetic
logic and shift operations can be separated into three units and the operands diverted to each unit under
the supervision of a control unit.
A parallel processing system is able to perform concurrent data processing to achieve faster execution
time. The system may have two or more ALUs and be able to execute two or more instructions at the
same time. Also, the system may have two or more processors operating concurrently. Goal is to
increase the throughput – the amount of processing that can be accomplished during a given interval of
time. Parallel processing increases the amount of hardware required.
Example: The ALU can be separated into three units and the operands diverted to each unit under the
supervision of a control unit. All units are independent of each other. A multifunctional organization is
usually associated with a complex control unit to coordinate all the activities among the various
components.
There are a variety of ways that parallel processing can be classified. It can be considered from the
internal organization of the processors, from the interconnection structure between processors, or from
the flow of information through the system. One classification introduced by M.J. Flynn considers theorganization of a computer system by the number of instructions and data items that are manipulated
simultaneously. The normal operation of a computer is to fetch instructions from memory and execute
them in the processor. The sequence of instructions read from memory constitutes an instruction
stream. The operations performed on the data in the processor constitute a data stream. Parallel
processing may occur in the instruction stream, in the data stream, or in both.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 2/14
Chapter 8 Pipeline and Vector Processing
2 | P a g e
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 3/14
Chapter 8 Pipeline and Vector Processing
3 | P a g e
Flynn’s classification divides computers into four major groups as follows:
1. Single instruction stream, single data stream (SISD)
2. Single instruction stream, multiple data stream (SIMD)
3. Multiple instruction streams, single data stream (MISD)
4. Multiple instruction streams, multiple data stream (MIMD)
Single Instruction Stream, Single Data Stream (SISD)
In computing, SISD (single instruction, single data) is a term referring to a computer architecture in which
a single processor, a uniprocessor, executes a single instruction stream, to operate on data stored in a
single memory.
SISD represents the organizations of a single computer containing a control unit, a processor unit, and amemory unit. Instructions are executed sequentially and the system may or may not have internal
parallel processing capabilities. Parallel processing in this case may be achieved by means of multiple
functional units or by pipeline processing. Instruction fetching and pipelined execution of instructions are
common examples found in most modern SISD computers
Single Instruction Stream, Multiple Data Stream (SIMD)
Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes
computers with multiple processing elements that perform the same operation on multiple data points
simultaneously. Thus, such machines exploit data level parallelism. SIMD is particularly applicable to
common tasks like adjusting the contrast in a digital image or adjusting the volume of digital audio. Most
modern CPU designs include SIMD instructions in order to improve the performance of multimedia use.
SIMD represents an organization that includes many processing units under the supervision of a common
control unit. All processors receive the same instruction from the control unit but operate on different
items of data. The shared memory unit must contain multiple modules so that it can communicate with
all the processors simultaneously.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 4/14
Chapter 8 Pipeline and Vector Processing
4 | P a g e
Multiple Instruction Streams, Single Data Stream (MISD)
In computing, MISD (multiple instruction, single data) is a type of parallel computing architecture where
many functional units perform different operations on the same data. Pipeline architectures belong to
this type.
MISD structure is only of theoretical interest since no practical system has been constructed using this
organization. One prominent example of MISD in computing are the Space Shuttle flight control
computers. A systolic array is an example of a MISD structure.
Multiple Instruction Streams, Multiple Data Stream (MIMD)
In computing, MIMD (multiple instruction, multiple data) is a technique employed to achieve parallelism.
Machines using MIMD have a number of processors that function asynchronously and independently. At
any time, different processors may be executing different instructions on different pieces of data.
MIMD organization refers to a computer system capable of processing several programs at the same
time. Most multiprocessor and multi-computer systems can be classified in this category. MIMDarchitectures may be used in a number of application areas such as computer-aided design/computer-
aided manufacturing, simulation, modeling, and as communication switches. MIMD machines can be of
either shared memory or distributed memory categories. These classifications are based on how MIMD
processors access memory. Shared memory machines may be of the bus-based, extended, or
hierarchical type. Distributed memory machines may have hypercube or mesh interconnection schemes.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 5/14
Chapter 8 Pipeline and Vector Processing
5 | P a g e
Flynn’s classification depends on the distinction between the performance of the control unit and the
data processing unit. It emphasizes the behavioral characteristics of the computer system rather than its
operational and structural interconnections. One type of parallel processing that does not fit Flynn’s
classification is pipelining.
Here we are considering parallel processing under the following main topics:1. Pipeline processing
2. Vector processing
3. Array processors
Pipeline Processing
Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process
being executed in special dedicated segments that operates concurrently with all other segments. A
pipeline can be visualized as a collection of processing segments through which binary information flows.
Each segment performs partial processing dictated by the way the task partitioned. The result obtained
from the computation in each segment is transferred to the next segment in the pipeline. The final resultis obtained after the data have passed through all segments. The name “pipeline” implies the flow of
information analogous to an industrial assembly line. It is character of pipelines that several
computations can be in progress in distinct segments at the same time. The overlapping of computations
is made possible by associating a register with each segment in the pipeline. The registers provide
isolation between each segment so that each can operate on distinct data simultaneously.
The pipeline organization will be demonstrated by means of a simple example. Suppose that we want to
perform the combined multiply and add operation with a stream of numbers.
Ai * Bi + C i for I = 1, 2, 3, ………..,7
Each suboperation is to be implemented in a segment within a pipeline. Each segment has one or two
registers and a combinational circuit as shown in figure below. R1 through R5 are registers that receive
new data with every clock pulse. The multiplier and adder are combinational circuits.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 6/14
Chapter 8 Pipeline and Vector Processing
6 | P a g e
The sub operations performed in each segment of the pipeline are as follows:
R1 ← Ai, R2 ← Bi Input Ai and Bi
R3 ← R1 * R2, R4 ← Ci Multiply and input Ci
R5 ← R3 + R4 Add Ci to product
The five registers are loaded with new data every clock pulse. The effect of each clock is shown in Table
below.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 7/14
Chapter 8 Pipeline and Vector Processing
7 | P a g e
The first clock pulse transfers A1 and B1 into R1 and R2. The second clock pulse transfers the product of R1 and R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R1 and R2.The third clock
pulse operates on all three segments simultaneously. It places A3 and B3 into R1 and R2, transfers the
product of R1 and R2 into R3, transfers C2 into R4, and places the sum of R3 and R4 into R5. It takes three
clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces
a new output and moves the data one step down the pipeline. This happens as long as new input data
flow into the system. When no more input data are available, the clock must continue until the last
output emerges out of the pipeline.
Arithmetic PipelineArithmetic units are usually found in very high-speed computers. They are used to implement floating-
point operations, multiplication of fixed-point numbers, and similar computations encountered in
scientific problems. A pipeline multiplier is essentially an array multiplier as described in figure below,
with special adders designed to minimize the carry propagation time through the partial products. We
will now show an example of a pipeline unit for floating point addition and subtraction.
The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers.
X = A X 2a
Y = B X 2b
A and B are two fractions that represents the mantissas and a and b are the exponents. The floating-
point addition and subtraction can be performed in four segments. The registers labeled R are placedbetween the segments to store intermediate results.
The sub operations that are performed in the four segments are:
1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract the mantissas
4. Normalize the result.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 8/14
Chapter 8 Pipeline and Vector Processing
8 | P a g e
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 9/14
Chapter 8 Pipeline and Vector Processing
9 | P a g e
Instruction PipelinePipeline processing can occur not only in the data stream but in the instruction stream as well. An
instruction pipeline reads consecutive instructions from memory while previous instructions are being
executed in other segments. This cause the instructions fetch and execute phases to overlap and
perform simultaneous operations. One possible digression associated with such a scheme is that an
instruction may cause a branch out of sequence. In that case the pipeline must be emptied and all the
instructions that have been read from memory after the branch instruction must be discarded.
Consider a computer with an instruction fetch unit and an instruction execution unit designed to provide
a two-segment pipeline. The instruction fetch segment can be implemented by means of a first-in, first-
out (FIFO) buffer. This is a type of unit that forms a queue rather than a stack. Whenever the execution
unit is not using memory, the control increments the program counter and uses its address value to read
consecutive instructions from memory. The instructions are inserted into the FIFO buffer so that they
can be executed on a first-in, first-out basis. Thus an instruction stream can be placed in a queue, waiting
for decoding and processing by the execution segment. The instruction stream queuing mechanism
provides an efficient way for reducing the average access time to memory for reading instructions.
Whenever there is space in the FIFO buffer, the control unit initiates the next instruction fetch phase.
The buffer acts as a queue form which control then extracts the instructions for the execution unit.
Figure: Instruction Pipeline
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 10/14
Chapter 8 Pipeline and Vector Processing
10 | P a g e
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 11/14
Chapter 8 Pipeline and Vector Processing
11 | P a g e
Computers with complex instructions require other phases in addition to the fetch and execute to
process an instruction with the following sequence of steps.
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
RISC PipelineInstruction pipelining is often used to enhance performance. Let us consider this in the context of RISC
architecture. In RISC machines most of the operations are register-to-register.
Therefore, the instructions can be executed in two phases:
F: Instruction Fetch to get the instruction.
E: Instruction Execute on register operands and store the results in register.
In general, the memory access in RISC is performed through LOAD and STORE operations. For suchinstructions the following steps may be needed:
F: Instruction Fetch to get the instruction
E: Effective address calculation for the desired memory operand
D: Memory to register or register to memory data transfer through bus.
Let us explain pipelining in RISC with an example program execution sample. Take the following program
(R indicates register).
LOAD RA (Load from memory location A)
LOAD RB (Load from memory location B)
ADD RC, RA, RB (RC = RA + RB)
SUB RD, RA, RB (RD = RA - RB)
MUL RE, RC, RD (RE = RC × RD)
STOR RE (Store in memory location C)
Return to main.
Figure: Sequential Execution of Instructions
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 12/14
Chapter 8 Pipeline and Vector Processing
12 | P a g e
Vector ProcessingVector processing is the arithmetic or logical computation applied on vectors whereas in scalar
processing only one or pair of data is processed. Therefore, vector processing is faster compared to
scalar processing. When the scalar code is converted to vector form then it is called vectorization. A
vector processor is a special coprocessor, which is designed to handle the vector computations. A vector
is an ordered set of the same type of scalar data items. The scalar item can be a floating pint number, an
integer or a logical value.
Vector instructions can be classified as below:
(1) Vector-Vector Instructions
In this type, vector operands are fetched from the vector register and stored in another vector register.
These instructions are denoted with the following function mappings:
F1 : V → V
F2 : V × V →V
(2) Vector-Scalar Instructions
In this type, when the combination of scalar and vector are fetched and stored in vector register. These
instructions are denoted with the following function mappings:F3 : S × V →V where S is the scalar item
(3) Vector reduction Instructions
When operations on vector are being reduced to scalar items as the result, then these are vector
reduction instructions. These instructions are denoted with the following function mappings:
F4 : V →S
F5 : V × V →S
(4) Vector-Memory Instructions
When vector operations with memory M are performed then these are vector-memory instructions.
These instructions are denoted with the following function mappings:
F6 : M→V
F7 : V →V
Array ProcessingThere is another method for vector operations. If we have an array of n processing elements (PEs) i.e.,
multiple ALUs for storing multiple operands of the vector, then an n instruction, for example, vector
addition, is broadcast to all PEs such that they add all operands of the vector at the same time. That
means all PEs will perform computation in parallel. All PEs are synchronized under one control unit. This
organization of synchronous array of PEs for vector operations is called array processor . The organization
is same as in SIMD. An array processor can handle one instruction and multiple data streams as we haveseen in case of SIMD organization. Therefore, array processors are also called SIMD array computers.
The organization of an array processor is shown in Figure below. The following components are
organized in an array processor:
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 13/14
Chapter 8 Pipeline and Vector Processing
13 | P a g e
Figure: Organization of SIMD Array Processor
Control Unit (CU): All PEs are under the control of one control unit. CU controls the inter communication
between the PEs. There is a local memory of CU also called CY memory. The user programs are loaded
into the CU memory. The vector instructions in the program are decoded by CU and broadcast to the
array of PEs. Instruction fetch and decoding is done by the CU only.
Processing elements (PEs): Each processing element consists of ALU, its registers and a local memory for
storage of distributed data. These PEs have been interconnected via an interconnection network. All PEs
receive the instructions from the control unit and the different component operands are fetched from
their local memory. Thus, all PEs perform the same function synchronously in a lock-step fashion underthe control of the CU.
It may be possible that all PEs need not participate in the execution of a vector instruction. Therefore, it
is required to adopt a masking scheme to control the status of each PE. A masking vector is used to
control the status of all PEs such that only enabled PEs are allowed to participate in the execution and
others are disabled.
7/28/2019 Chapter 08 - Pipeline and Vector Processing
http://slidepdf.com/reader/full/chapter-08-pipeline-and-vector-processing 14/14
Chapter 8 Pipeline and Vector Processing
Interconnection Network (IN): IN performs data exchange among the PEs, data routing and manipulation
functions. This IN is under the control of CU.
Host Computer: An array processor may be attached to a host computer through the control unit. The
purpose of the host computer is to broadcast a sequence of vector instructions through CU to the PEs.
Thus, the host computer is a general-purpose machine that acts as a manager of the entire system.
Array processors are special purpose computers which have been adopted for the following:• various scientific applications,
• matrix algebra,
• matrix eigen value calculations,
• real-time scene analysis
SIMD array processor on the large scale has been developed by NASA for earth resources satellite image
processing. This computer has been named Massively parallel processor (MPP) because it contains
16,384 processors that work in parallel. MPP provides real-time time varying scene analysis.
However, array processors are not commercially popular and are not commonly used. The reasons arethat array processors are difficult to program compared to pipelining and there is problem in
vectorization.
Array processing is used in radar, sonar, and seismic exploration, anti-jamming and wireless
communications.
References: J. P. Hayes, “Computer Architecture and Organization”, McGraw Hill, 3rd Ed, 1998. M. Morris Mano, “Computer System Architecture”, Pearson, 3rd Ed, 2004.
www.google.com
en.wikipedia.com
Assignments:(1) Differentiate between instruction pipeline and arithmetic pipeline?
(2) Identify the factors due to which speed of the pipelining is limited?
(3) What is the difference between scalar and vector processing?
A Gentle Advice:Please go through your text books and reference books for detail study!!! Thank you all.
Notes Compiled By:Bijay Mishra
biizay.blogspot.com
9813911076 or 9841695609