Upload
shobhit-mahawar
View
215
Download
0
Embed Size (px)
Citation preview
8/3/2019 b56a2Presentation2
1/49
8/3/2019 b56a2Presentation2
2/49
2
Topics Covered
An Overview of Parallel Processing
Parallelism in Uniprocessor Systems
Organization of Multiprocessor Flynns Classification
System Topologies
MIMD System Architectures
8/3/2019 b56a2Presentation2
3/49
3
An Overview of Parallel
Processing
What is parallel processing?
Parallel processing is a method to improve computersystem performance by executing two or more
instructions simultaneously. The goals of parallel processing.
One goal is to reduce the wall-clock time or theamount of real time that you need to wait for a problemto be solved.
Another goal is to solve bigger problems that might notfit in the limited memory of a single CPU.
8/3/2019 b56a2Presentation2
4/49
4
An Analogy of Parallelism
The task of ordering a shuffled deck of cards bysuit and then by rank can be done faster if the task iscarried out by two or more people. By splitting up thedecks and performing the instructionssimultaneously, then at the end combining the partialsolutions you have performed parallel processing.
8/3/2019 b56a2Presentation2
5/49
5
Another Analogy of Parallelism
Another analogy is having severalstudents grade quizzes simultaneously.Quizzes are distributed to a few students and
different problems are graded by eachstudent at the same time. After they arecompleted, the graded quizzes are thengathered and the scores are recorded.
8/3/2019 b56a2Presentation2
6/49
6
Parallelism in Uniprocessor Systems
It is possible to achieve parallelism with auniprocessor system.
Some examples are the instruction pipeline, arithmeticpipeline, I/O processor.
Note that a system that performs different operationson the same instruction is not considered parallel.
Only if the system processes two differentinstructions simultaneously can it be consideredparallel.
8/3/2019 b56a2Presentation2
7/49
7
Parallelism in a Uniprocessor
System
A reconfigurable arithmetic pipeline is anexample of parallelism in a uniprocessor
system.Each stage of a reconfigurable arithmeticpipeline has a multiplexer at its input. Themultiplexer may pass input data, or the dataoutput from other stages, to the stage inputs.The control unit of the CPU sets the selectsignals of the multiplexer to control the flow ofdata, thus configuring the pipeline.
8/3/2019 b56a2Presentation2
8/49
8
A Reconfigurable Pipeline With Data Flow
for the Computation
A[i]
B[i] * C[i] + D[i]
0
1
MUX
2
3
S1 S0
0
1
MUX
2
3
S1 S0
0
1
MUX
2
3
S1 S0
0
1
MUX
2
3
S1 S0
*
LATCH
+
LATCH
|
LATCH
To
memory
and
registers
Data
Inputs
0 0 x x 0 1 1 1
8/3/2019 b56a2Presentation2
9/49
9
Although arithmetic pipelines can performmany iterations of the same operation inparallel, they cannot perform different
operations simultaneously. To performdifferent arithmetic operations in parallel, aCPU may include a vectored arithmetic unit.
8/3/2019 b56a2Presentation2
10/49
10
Vector Arithmetic Unit
A vector arithmetic unit contains multiplefunctional units that perform addition,subtraction, and other functions. The control
unit routes input values to the differentfunctional units to allow the CPU to executemultiple instructions simultaneously.
For the operations AB+C and DE-F,
the CPU would route B and C to an adderand then route E and F to a subtractor forsimultaneous execution.
8/3/2019 b56a2Presentation2
11/49
11
A Vectored Arithmetic Unit
Data
Input
Connections
Data
Input
Connections
*
+
-
%
Data
Inputs
AB+C
D
E-F
8/3/2019 b56a2Presentation2
12/49
12
Organization of Multiprocessor
Systems
Flynns Classification
Was proposed by researcher Michael J. Flynnin 1966.
It is the most commonly accepted taxonomy ofcomputer organization.
In this classification, computers are classifiedby whether it processes a single instruction at
a time or multiple instructions simultaneously,and whether it operates on one or multipledata sets.
8/3/2019 b56a2Presentation2
13/49
13
Taxonomy of Computer
Architectures
4 categories of Flynns classification of multiprocessor systems by
their instruction and data streams
Architecture Categories
SISD SIMD MISD MIMD
8/3/2019 b56a2Presentation2
14/49
14
Single Instruction, Single Data
(SISD)
SISD machines executes a single instructionon individual data values using a singleprocessor.
Based on traditional Von Neumannuniprocessor architecture, instructions areexecuted sequentially or serially, one stepafter the next.
Until most recently, most computers are ofSISD type.
8/3/2019 b56a2Presentation2
15/49
15
SISD
Simple Diagrammatic Representation
C P MIS IS DS
8/3/2019 b56a2Presentation2
16/49
16
Single Instruction, Multiple Data
(SIMD) An SIMD machine executes a single instruction on
multiple data values simultaneously using manyprocessors.
Since there is only one instruction, each processordoes not have to fetch and decode each instruction.Instead, a single control unit does the fetch anddecoding for all processors.
SIMD architectures include array processors.
8/3/2019 b56a2Presentation2
17/49
8/3/2019 b56a2Presentation2
18/49
18
Multiple Instruction, Multiple Data
(MIMD) MIMD machines are usually referred to as
multiprocessors or multicomputers.
It may execute multiple instructions simultaneously,contrary to SIMD machines.
Each processor must include its own control unit thatwill assign to the processors parts of a task or aseparate task.
It has two subclasses: Shared memory anddistributed memory
8/3/2019 b56a2Presentation2
19/49
19
MIMD
C
C
P
P
M
IS
IS
IS
IS
DS
DS
8/3/2019 b56a2Presentation2
20/49
20
Multiple Instruction, Single Data
(MISD)
This category does not actually exist. Thiscategory was included in the taxonomy forthe sake of completeness.
8/3/2019 b56a2Presentation2
21/49
21
C
C
P
P
M
IS
IS
IS
IS
DS
DS
MISD
8/3/2019 b56a2Presentation2
22/49
22
Analogy of Flynns Classifications
An analogy of Flynns classification is the
check-in desk at an airport
SISD: a single desk
SIMD: many desks and a supervisor with amegaphone giving instructions that every deskobeys
MIMD: many desks working at their own pace,
synchronized through a central database
8/3/2019 b56a2Presentation2
23/49
23
System Topologies
Topologies
A system may also be classified by itstopology.
A topology is the pattern of connectionsbetween processors.
The cost-performance tradeoff determineswhich topologies to use for a multiprocessor
system.
8/3/2019 b56a2Presentation2
24/49
24
Topology Classification
A topology is characterized by its diameter,total bandwidth, and bisection bandwidth
Diameter the maximum distance between
two processors in the computer system. Total bandwidth the capacity of a
communications link multiplied by the numberof such links in the system.
Bisection bandwidth represents themaximum data transfer that could occur at thebottleneck in the topology.
8/3/2019 b56a2Presentation2
25/49
25
Shared BusTopology
Processors communicatewith each other via asingle bus that can onlyhandle one datatransmissions at a time.
In most shared buses,processors directlycommunicate with theirown local memory.
M
P
M
P
M
P
Global
memory
Shared Bus
System Topologies
8/3/2019 b56a2Presentation2
26/49
26
System Topologies
Ring Topology Uses direct connections
between processors
instead of a shared bus.
Allows communicationlinks to be activesimultaneously but data
may have to travelthrough severalprocessors to reach itsdestination.
P
P
P
P
P
P
8/3/2019 b56a2Presentation2
27/49
27
System Topologies
Tree Topology Uses direct
connections betweenprocessors; eachhaving threeconnections.
There is only one
unique path betweenany pair of processors.
P
PPPP
PP
8/3/2019 b56a2Presentation2
28/49
28
Systems Topologies
Mesh Topology In the mesh topology,
every processorconnects to theprocessors above andbelow it, and to its rightand left.
P
P
P PP
PP
P P
8/3/2019 b56a2Presentation2
29/49
29
System Topologies
Hypercube Topology
Is a multiple meshtopology.
Each processorconnects to all otherprocessors whosebinary values differ byone bit. For example,
processor 0(0000)connects to 1(0001) or2(0010).
P
PP
PP
PP
P P
PP
PP
PP
P
8/3/2019 b56a2Presentation2
30/49
30
System Topologies
Completely Connected Topology
Every processor has
n-1 connections, one to each
of the other processors. There is an increase in
complexity as the systemgrows but this offers maximum
communication capabilities. PP
PP
PP
PP
8/3/2019 b56a2Presentation2
31/49
31
MIMD System Architectures
Finally, the architecture of a MIMD system,contrast to its topology, refers to itsconnections to its system memory.
A systems may also be classified by theirarchitectures. Two of these are:
Uniform memory access (UMA)
Nonuniform memory access (NUMA)
8/3/2019 b56a2Presentation2
32/49
8/3/2019 b56a2Presentation2
33/49
33
Vector computers has multiple vectorpipelines
Two families of pipelined vector processor
memory to memoryregister to register
8/3/2019 b56a2Presentation2
34/49
34
Development layers
Applications
Programming environment
Languages supported
Communication model
Addressing space
Hardware architecture
8/3/2019 b56a2Presentation2
35/49
35
System attributes to performance
Clock rate and CPI
Clock has cycle time --
Clock rate f = 1/
Size of program is instruction count (Ic)
Cycles per instruction(CPI) time need forexecuting each instruction
8/3/2019 b56a2Presentation2
36/49
36
Performance factors
CPU time needed to execute the program
T = Ic x CPI x
Or,T = Ic x (p+m+k) x
Where
p=no. of processor cycle
m = no. of memory referencesk = ratio between memory cycle and processor cycle
8/3/2019 b56a2Presentation2
37/49
37
System attributes
The performance factors (Ic,p,m,k,) areinfluenced by
Instruction set architecture
Compiler technology
CPU implementation and control
Cache and memory hierarchy
8/3/2019 b56a2Presentation2
38/49
38
MIPS rate
= Ic/(Tx106) = f/(CPIx106)
= (f x Ic)/ (C x 106)
Where C is the total number of clock cyclesneeded to execute a given program
8/3/2019 b56a2Presentation2
39/49
39
Throughput rate i.e how many programs asystem can execute per unit time
Ws = f/(Ic x CPI)
8/3/2019 b56a2Presentation2
40/49
40
Uniform memory access (UMA)
The UMA is a type of symmetricmultiprocessor, or SMP, that has two or moreprocessors that perform symmetric functions.
UMA gives all CPUs equal (uniform) accessto all memory locations in shared memory.They interact with shared memory by somecommunications mechanism like a simple bus
or a complex multistage interconnectionnetwork.
8/3/2019 b56a2Presentation2
41/49
41
Uniform memory access (UMA)
Architecture
Shared
Memory
Processor 2
Processor 1
Processor n
Communications
mechanism
8/3/2019 b56a2Presentation2
42/49
42
Nonuniform memory access
(NUMA)
NUMA architectures, unlike UMAarchitectures do not allow uniform access toall shared memory locations. This
architecture still allows all processors toaccess all shared memory locations but in anonuniform way, each processor can accessits local shared memory more quickly than
the other memory modules not next to it.
8/3/2019 b56a2Presentation2
43/49
43
Nonuniform memory access
(NUMA) Architecture
Memory 1
Processor 1
Communications mechanism
Memory 2
Processor 2
Memory n
Processor n
8/3/2019 b56a2Presentation2
44/49
44
The COMA model
Cache only memory architecture (COMA)is a computer memory organization for use inmultiprocessors in which the local memories
(typically DRAM) at each node are used ascache. This is in contrast to using the localmemories as actual main memory, as inNUMA organizations.
http://en.wikipedia.org/wiki/Computer_memoryhttp://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/DRAMhttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/Non-Uniform_Memory_Accesshttp://en.wikipedia.org/wiki/DRAMhttp://en.wikipedia.org/wiki/Multiprocessorhttp://en.wikipedia.org/wiki/Computer_memory8/3/2019 b56a2Presentation2
45/49
45
The COMA model is a special case of NUMA machine inwhich distributed memories are converted into caches
There is no memory hierarchy at each processor node All caches form a global address space
Remote cache access is assisted by the distributedcache directories
Depending on interconnection network used sometimeshierarchical directories may be used to help locatecopies of cache blocks
Initial data placement is not critical because data will
eventually migrate to where it will be used.
Th COMA d l
8/3/2019 b56a2Presentation2
46/49
46
The COMA model
C
D
P
D
C
P
D
C
P
Interconnection network
8/3/2019 b56a2Presentation2
47/49
47
Vector Supercomputers
Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions
Load/Store Architecture
Vector Registers Vector Instructions
Hardwired Control
Highly Pipelined Functional Units
Interleaved Memory System
No Data Caches
No Virtual Memory
8/3/2019 b56a2Presentation2
48/49
48
8/3/2019 b56a2Presentation2
49/49
49
THE END