2. Machine Architectures

Arquitectura de Sistemas Paralelos e Distribuídos

Paulo MarquesDep. Eng. Informática – Universidade de [email protected]

Ago

/200

7

2. Machine Architectures

2

von Neumann Architecture

Based on the fetch-decode-execute cycle The computer executes a single sequence of

instructions that act on data. Both program and data are stored in memory.

Flow of instructionsFlow of instructions

Data

ABC

3

Flynn's Taxonomy

Classifies computers according to… The number of execution flows The number of data flows

Number of data flows

Number of execution

flows

SISDSingle-Instruction

Single-Data

SIMDSingle-Instruction

Multiple-Data

MISDMultiple-Instruction

Single-Data

MIMDMultiple-Instruction

Multiple-Data

4

Single Instruction, Single Data (SISD)

A serial (non-parallel) computer Single instruction: only one instruction stream is

being acted on by the CPU during any one clock cycle

Single data: only one data stream is being used as input during any one clock cycle

Most PCs, single CPU workstations, …

5

Single Instruction, Multiple Data (SIMD)

A type of parallel computer Single instruction: All processing units execute the same

instruction at any given clock cycle Multiple data: Each processing unit can operate on a different

data element Best suited for specialized problems characterized by a high

degree of regularity, such as image processing. Examples: Connection Machine CM-2, Cray J90, Pentium MMX

instructions

6

The Connection Machine 2 (SIMD) The massively parallel Connection Machine 2 was a supercomputer produced

by Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that work in parallel.

7

Multiple Instruction, Single Data (MISD)

Few actual examples of this class of parallel computer have ever existed

Some conceivable examples might be: multiple frequency filters operating on a single signal

stream multiple cryptography algorithms attempting to crack a

single coded message the Data Flow Architecture

8

Multiple Instruction, Multiple Data (MIMD)

Currently, the most common type of parallel computer Multiple Instruction: every processor may be executing a

different instruction stream Multiple Data: every processor may be working with a

different data stream Execution can be synchronous or asynchronous,

deterministic or non-deterministic Examples: most current supercomputers, computer clusters,

multi-processor SMP machines (inc. some types of PCs)

9

Earth Simulator Center – Yokohama, NecSX (MIMD)

The Earth Simulator is a project to develop a 40 TFLOPS system for climate modeling. It performs at 35.86 TFLOPS.

The ES is based on: - 5,120 (640 8-way nodes) 500 MHz NEC CPUs

- 8 GFLOPS per CPU (41 TFLOPS total) - 2 GB RAM per CPU (10 TB total) - Shared memory inside the node - 640 × 640 crossbar switch between the nodes

- 16 GB/s inter-node bandwidth

10

What about Memory?

The interface between CPUs and Memory in Parallel Machines is of crucial importance

The bottleneck on the bus, many times between memory and CPU, is known as the von Neumann bottleneck

It limits how fast a machine can operate: relationship between computation/communication

11

Communication in Parallel Machines

Programs act on data. Quite important: how do processors access each

others data?

Shared Memory ModelMessage Passing Model

Memory CPU Memory CPU

Memory CPU Memory CPU

network

CPU

CPUCPU

CPU

Memory

12

Shared Memory

Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as a global address space

Multiple processors can operate independently but share the same memory resources

Changes in a memory location made by one processor are visible to all other processors

Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA

13

Shared Memory (2)

Fast MemoryInterconnect

UMA: Uniform Memory Access

Single 4-processorMachine

CPU

CPUCPU

CPU

Memory CPU

Memory

CPU

Memory

CPU

Memory

NUMA: Non-Uniform Memory Access

A 3-processorNUMA Machine

14

Uniform Memory Access (UMA)

Most commonly represented today by Symmetric Multiprocessor (SMP) machines

Identical processors Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a

location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.

Very hard to scale

15

Non-Uniform Memory Access (NUMA)

Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP.

Not all processors have equal access time to all memories

Sometimes called DSM – Distributed Shared Memory

Advantages User-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to

the proximity of memory and CPUs More scalable than SMPs

Disadvantages Lack of scalability between memory and CPUs Programmer responsibility for synchronization constructs

that ensure "correct" access of global memory Expensive: it becomes increasingly difficult and expensive to

design and produce shared memory machines with ever increasing numbers of processors

16

UMA and NUMA

The new MAC PRO features2 Intel Core2 Duo processorsthat share a common central memory (up to 16Gbyte)

SGI Origin 3900:- 16 R14000A processors per brick, each brick with 32GBytes of RAM. - 12.8GB/s aggregated memory bw(Scales up to 512 processors and1TByte of memory)

17

Distributed Memory (DM)

Processors have their own local memory. Memory addresses in one processor do not map to

another processor (no global address space) Because each processor has its own local memory,

cache coherency does not apply Requires a communication network to connect inter-

processor memory When a processor needs access to data in another

processor, it is usually the task of the programmer to explicitly define how and when data is communicated.

Synchronization between tasks is the programmer's responsibility

Very scalable Cost effective: use of off-the-shelf processors and

networking Slower than UMA and NUMA machines

18

Distributed Memory

CPU

Memory

Computer

CPU

Memory

Computer

CPU

Memory

Computer

network interconnect

TITAN@DEI, a PC clusterinterconnected by FastEthernet

19

Hybrid Architectures

Today, most systems are an hybrid featuring shared distributed memory. Each node has several processors that share a central memory A fast switch interconnects the several nodes In some cases the interconnect allows for the mapping of

memory among nodes; in most cases it gives a message passing interface

fast network interconnect

MemoryCPUCPU

CPUCPU

MemoryCPUCPU

CPUCPU

MemoryCPUCPU

CPUCPU

MemoryCPUCPU

CPUCPU

20

ASCI White at theLawrence Livermore National Laboratory

Each node is an IBM POWER3 375 MHz NH-2 16-way SMP (i.e. 16 processors/node)

Each node has 16GB of memory A total of 512 nodes, interconnected by a 2GB/sec

network node-to-node The 512 nodes feature a total of 8192 processors,

having a total of 8192 GB of memory It currently operates at 13.8 TFLOPS

21

Summary

Architecture CC-UMA CC-NUMA Distributed/Hybrid

Examples - SMPs - Sun Vexx - SGI Challenge - IBM Power3

- SGI Origin - HP Exemplar - IBM Power4

- Cray T3E - IBM SP2

Programming - MPI - Threads - OpenMP - Shmem

- MPI - Threads - OpenMP - Shmem

- MPI

Scalability <10 processors <1000 processors

~1000 processors

Draw Backs - Limited mem bw- Hard to scale

- New architecture - Point-to-point communication

- Costly system administration - Programming is hard to develop and maintain

Software Availability

- Great - Great - Limited

Documents

2. Machine Architectures