Upload
solomon-rogers
View
32
Download
0
Embed Size (px)
DESCRIPTION
2. Machine Architectures. von Neumann Architecture. Based on the fetch-decode-execute cycle The computer executes a single sequence of instructions that act on data . Both program and data are stored in memory. Flow of instructions. Flow of instructions. A. B. C. Data. - PowerPoint PPT Presentation
Citation preview
Arquitectura de Sistemas Paralelos e Distribuídos
Paulo MarquesDep. Eng. Informática – Universidade de [email protected]
Ago
/200
7
2. Machine Architectures
2
von Neumann Architecture
Based on the fetch-decode-execute cycle The computer executes a single sequence of
instructions that act on data. Both program and data are stored in memory.
Flow of instructionsFlow of instructions
Data
ABC
3
Flynn's Taxonomy
Classifies computers according to… The number of execution flows The number of data flows
Number of data flows
Number of execution
flows
SISDSingle-Instruction
Single-Data
SIMDSingle-Instruction
Multiple-Data
MISDMultiple-Instruction
Single-Data
MIMDMultiple-Instruction
Multiple-Data
4
Single Instruction, Single Data (SISD)
A serial (non-parallel) computer Single instruction: only one instruction stream is
being acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input during any one clock cycle
Most PCs, single CPU workstations, …
5
Single Instruction, Multiple Data (SIMD)
A type of parallel computer Single instruction: All processing units execute the same
instruction at any given clock cycle Multiple data: Each processing unit can operate on a different
data element Best suited for specialized problems characterized by a high
degree of regularity, such as image processing. Examples: Connection Machine CM-2, Cray J90, Pentium MMX
instructions
6
The Connection Machine 2 (SIMD) The massively parallel Connection Machine 2 was a supercomputer produced
by Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that work in parallel.
7
Multiple Instruction, Single Data (MISD)
Few actual examples of this class of parallel computer have ever existed
Some conceivable examples might be: multiple frequency filters operating on a single signal
stream multiple cryptography algorithms attempting to crack a
single coded message the Data Flow Architecture
8
Multiple Instruction, Multiple Data (MIMD)
Currently, the most common type of parallel computer Multiple Instruction: every processor may be executing a
different instruction stream Multiple Data: every processor may be working with a
different data stream Execution can be synchronous or asynchronous,
deterministic or non-deterministic Examples: most current supercomputers, computer clusters,
multi-processor SMP machines (inc. some types of PCs)
9
Earth Simulator Center – Yokohama, NecSX (MIMD)
The Earth Simulator is a project to develop a 40 TFLOPS system for climate modeling. It performs at 35.86 TFLOPS.
The ES is based on: - 5,120 (640 8-way nodes) 500 MHz NEC CPUs
- 8 GFLOPS per CPU (41 TFLOPS total) - 2 GB RAM per CPU (10 TB total) - Shared memory inside the node - 640 × 640 crossbar switch between the nodes
- 16 GB/s inter-node bandwidth
10
What about Memory?
The interface between CPUs and Memory in Parallel Machines is of crucial importance
The bottleneck on the bus, many times between memory and CPU, is known as the von Neumann bottleneck
It limits how fast a machine can operate: relationship between computation/communication
11
Communication in Parallel Machines
Programs act on data. Quite important: how do processors access each
others data?
Shared Memory ModelMessage Passing Model
Memory CPU Memory CPU
Memory CPU Memory CPU
network
CPU
CPUCPU
CPU
Memory
12
Shared Memory
Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as a global address space
Multiple processors can operate independently but share the same memory resources
Changes in a memory location made by one processor are visible to all other processors
Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA
13
Shared Memory (2)
Fast MemoryInterconnect
UMA: Uniform Memory Access
Single 4-processorMachine
CPU
CPUCPU
CPU
Memory CPU
Memory
CPU
Memory
CPU
Memory
NUMA: Non-Uniform Memory Access
A 3-processorNUMA Machine
14
Uniform Memory Access (UMA)
Most commonly represented today by Symmetric Multiprocessor (SMP) machines
Identical processors Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a
location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.
Very hard to scale
15
Non-Uniform Memory Access (NUMA)
Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP.
Not all processors have equal access time to all memories
Sometimes called DSM – Distributed Shared Memory
Advantages User-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to
the proximity of memory and CPUs More scalable than SMPs
Disadvantages Lack of scalability between memory and CPUs Programmer responsibility for synchronization constructs
that ensure "correct" access of global memory Expensive: it becomes increasingly difficult and expensive to
design and produce shared memory machines with ever increasing numbers of processors
16
UMA and NUMA
The new MAC PRO features2 Intel Core2 Duo processorsthat share a common central memory (up to 16Gbyte)
SGI Origin 3900:- 16 R14000A processors per brick, each brick with 32GBytes of RAM. - 12.8GB/s aggregated memory bw(Scales up to 512 processors and1TByte of memory)
17
Distributed Memory (DM)
Processors have their own local memory. Memory addresses in one processor do not map to
another processor (no global address space) Because each processor has its own local memory,
cache coherency does not apply Requires a communication network to connect inter-
processor memory When a processor needs access to data in another
processor, it is usually the task of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is the programmer's responsibility
Very scalable Cost effective: use of off-the-shelf processors and
networking Slower than UMA and NUMA machines
18
Distributed Memory
CPU
Memory
Computer
CPU
Memory
Computer
CPU
Memory
Computer
network interconnect
TITAN@DEI, a PC clusterinterconnected by FastEthernet
19
Hybrid Architectures
Today, most systems are an hybrid featuring shared distributed memory. Each node has several processors that share a central memory A fast switch interconnects the several nodes In some cases the interconnect allows for the mapping of
memory among nodes; in most cases it gives a message passing interface
fast network interconnect
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
20
ASCI White at theLawrence Livermore National Laboratory
Each node is an IBM POWER3 375 MHz NH-2 16-way SMP (i.e. 16 processors/node)
Each node has 16GB of memory A total of 512 nodes, interconnected by a 2GB/sec
network node-to-node The 512 nodes feature a total of 8192 processors,
having a total of 8192 GB of memory It currently operates at 13.8 TFLOPS
21
Summary
Architecture CC-UMA CC-NUMA Distributed/Hybrid
Examples - SMPs - Sun Vexx - SGI Challenge - IBM Power3
- SGI Origin - HP Exemplar - IBM Power4
- Cray T3E - IBM SP2
Programming - MPI - Threads - OpenMP - Shmem
- MPI - Threads - OpenMP - Shmem
- MPI
Scalability <10 processors <1000 processors
~1000 processors
Draw Backs - Limited mem bw- Hard to scale
- New architecture - Point-to-point communication
- Costly system administration - Programming is hard to develop and maintain
Software Availability
- Great - Great - Limited