Lecture 2: Parallel ArchitecturesLecture 2: Parallel ...irodero/classes/09-10/ece451-566/... · connected to multiple memory modul es, ... Introduction to Parallel & Distributed Programming

ECE 451/566 - Intro. to Parallel & Distributed Prog.

1

ECE-451/ECE-566 - Introduction to Parallel and Distributed Programming

Lecture 2: Parallel ArchitecturesLecture 2: Parallel Architectures and Programming Models

Department of Electrical & Computer Engineering

Rutgers University

Machine Architectures andMachine Architectures and Interconnection Networks


2

Architecture SpectrumShared-Everything

S t i M lti– Symmetric MultiprocessorsShared Memory– NUMA, CC-NUMA

Distributed Memory– DSM, Message Passing

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 3

Shared-Nothing– Clusters, NOW’s

Client/Server

Pros and ConsShared Memory

P– Pros flexible, easier to program

– Consnot scalable, synchronization/coherency issues

Distributed Memory– Pros


osscalable

– Consdifficult to program, require explicit message passing


3

Conventional ComputerConsists of a processor executing a program stored in a (main) memory:

Main memory

Processor

Instructions (to processor)Data (to or from processor)


Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address.

Shared Memory Multiprocessor System

Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module

Interconnectiont k

Memory modulesOneaddressspace


Processors

network


4

Simplistic view of a small shared memory multiprocessor

Processors Shared memory

Examples:Dual Pentiums

Bus


Quad Pentiums

Quad Pentium Shared Memory Multiprocessor

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

ProcessorL1 cache

L2 Cache

Bus interface

L2 Cache

Bus interface

L2 Cache

Bus interface

L2 Cache

Bus interface

Memory ControllerI/O interface

Processor/memorybus


Memory

I/O bus

Shared memory


5

Programming Shared Memory MultiprocessorsUse:

Threads - programmer decomposes program into individual parallel sequences, (threads) each being able to access variables declared outside threads(threads), each being able to access variables declared outside threads.

Example PthreadsSequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism.

Example OpenMP - industry standard - needs OpenMP compilerSequential programming language with added syntax to declare shared variables and specify parallelism.

Example UPC (Unified Parallel C) - needs a UPC compiler.


Example UPC (Unified Parallel C) needs a UPC compiler.Parallel programming language with syntax to express parallelism - compiler creates executable code for each processor (not now common)Sequential programming language and ask parallelizing compiler to convert it into parallel executable code. - also not now common

Distributed Shared Memory Making main memory of group of interconnected computers

look as though a single memory with single address space.g g y g pShared memory programming techniques can then be used.

Processor

Interconnectionnetwork

Messages


Shared

Computers

memory


6

Message-Passing MulticomputerComplete computers connected through an

interconnection network

Processor

Interconnectionnetwork

Messages


Local

Computers

memory

Interconnection NetworksLimited and exhaustive interconnections2 d 3 di i l h2- and 3-dimensional meshesHypercube (not now common)Using Switches– Crossbar– Trees– Multistage interconnection networks



7

Two-dimensional array (mesh)Links Computer/

processor


Also three-dimensional - used in some large high performance systems.

Three-dimensional hypercube

110 111

010 011

100 101


000 001


8

Four-dimensional hypercube

1110

0000 0001

0010 0011

0100

0110

0101

0111

1000 1001

1010 1011

1100

1110

1101

1111


Hypercubes popular in 1980’s - not now

0001 00

Crossbar switch

Memories

SwitchesProcessors



9

Tree

R t

Switchelement

Root

Links


Processors

Multistage Interconnection NetworkExample: Omega network

2 × 2 switch elements(straight-through or

crossover connections)

000001

010011

000001

010011

Inputs Outputs


100101

110111

100101

110111


10

Taxonomy ofTaxonomy of HPC Architectures

Taxonomy of Architectures

Flynn (1966) created a simple classification f t b d b ffor computers based upon number of instruction streams and data streams– SISD - conventional– SIMD - data parallel, vector computing– MISD - systolic arrays


– MIMD - very general, multiple approaches.Current focus on MIMD model, using general purpose processors or multicomputers


11

HPC Architecture Examples SISD - mainframes, workstations, PCs. SIMD Shared Memory Vector machines CraySIMD Shared Memory - Vector machines, Cray... MIMD Shared Memory - Sequent, KSR, Tera, SGI, SUN.SIMD Distributed Memory - DAP, TMC CM-2... MIMD Distributed Memory - Cray T3D, Intel, Transputers, TMC CM-5, plus recent workstation clusters (IBM SP2, DEC, Sun, HP).Note: Modern sequential machines are not purely SISD – advanced


Note: Modern sequential machines are not purely SISD advanced RISC processors use many concepts from vector and parallel architectures (pipelining, parallel execution of instructions, prefetching of data, etc) in order to achieve one or more arithmetic operations per clock cycle.

SISD : A Conventional Computer

Instruc

Si l t i l t f i t ti

ProcessorData Input Data Output

ctions


Single processor computer - single stream of instructions generated from program. Instructions operate upon a single stream of data items.Speed is limited by the rate at which computer can transfer information internally.

e.g. PC, Macintosh, Workstations


12

The MISD ArchitectureInstructionStream A

InstructionStream B

Data InputStream

Data OutputStream

ProcessorA

ProcessorB

Stream BInstruction Stream C


More of an intellectual exercise than a practical configuration. Few built, but commercially not available

ProcessorC

Single Instruction Stream-Multiple Data Stream (SIMD) Computer

A specially designed computer - a single instructionA specially designed computer a single instruction stream from a single program, but multiple data streams exist.Single source program written and each processor executes its personal copy of this program, although independently and not in synchronism.Developed because a number of important applications th t tl t f d t


that mostly operate upon arrays of data.Source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.


13

SIMD ArchitectureInstruction

Stream

ProcessorA

ProcessorB

Data Inputstream A

Data Inputstream B

Data Outputstream A

Data Outputstream B


e.g. CRAY machine vector processing, Thinking machine cm*

Ci<= Ai * Bi

ProcessorCData Input

stream C

Data Outputstream C

Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer

General-purpose multiprocessor system. Each processor has a separate program and one instructionprocessor has a separate program and one instruction stream is generated from each program for each processor. Each instruction operates upon different data

Program Program

InstructionsInstructions


Processor

Data

Processor

Data


14

MIMD ArchitectureInstructionStream A

InstructionStream B

InstructionStream C

ProcessorA

ProcessorB

Processor

Data Inputstream A

Data Inputstream B

Data Outputstream A

Data Outputstream B

Data Output


Unlike SISD, MISD, MIMD computer works asynchronously.» Shared memory (tightly coupled) MIMD» Distributed memory (loosely coupled) MIMD

ProcessorCData Input

stream C

Data Outputstream C

Shared Memory MIMD machineComm: Source PE writes data to GM &

destination retrieves it Easy to build conventional OSes of

ProcessorA

ProcessorB

ProcessorCEasy to build, conventional OSes of

SISD can be easily be portedLimitation : reliability & expandability. A memory component or any processor failure affects the whole system.Increase of processors leads to memory contention

MEMORY

BUS

MEMORY

BUS

A B C

MEMORY

BUS


memory contention.E.g. : SGI machines.... Global Memory System


15

Distributed Memory MIMDIPC

channelIPC

channel

Communication : IPC on HighSpeed Network.Network can be configured to... Tree, Mesh, Cube, etc.Unlike Shared MIMD

easily/ readily expandableHighly reliable (any CPU

MEMORY

BUS

ProcessorA

ProcessorB

ProcessorC

MEMORY

BUS

MEMORY

BUS


Highly reliable (any CPU failure does not affect the whole system)

MemorySystem A

MemorySystem B

MemorySystem C

Towards Cluster andTowards Cluster and Distributed Computing


16

Parallel Processing Paradox

Time required to develop a parallel application for solving GCA is equal to:

– Half Life of Parallel Supercomputers.


An Alternative Supercomputing Resource

Vast numbers of under utilised workstations il bl tavailable to use.

Huge numbers of unused processor cycles and resources that could be put to good use in a wide variety of applications areas.Reluctance to buy Supercomputer due to th i t d h t lif


their cost and short life span.Distributed compute resources “fit” better into today's funding model.


17

Networked Computers as a Computing Platform

A network of computers became a very attractive alternative to p yexpensive supercomputers and parallel computer systems for high-performance computing in early 1990’s.

Several early projects. Notable:

– Berkeley NOW (network of workstations) project.


– NASA Beowulf project. (Will look at this one later)

Key advantagesVery high performance workstations and PCs readily available at low costavailable at low cost.

The latest processors can easily be incorporated into the system as they become available.

Existing software can be used or modified.



18

Software Tools for ClustersBased upon Message Passing Parallel Programming

Parallel Virtual Machine (PVM) - developed in late 1980’s. Became very popular.

Message-Passing Interface (MPI) - standard defined in 1990s.

Both provide a set of user-level libraries for message passing.


Use with regular programming languages (C, C++, ...).

Beowulf Clusters*A group of interconnected “commodity” computers achieving high performance with low costachieving high performance with low cost.

Typically using commodity interconnects - high speed Ethernet, and Linux OS.

* Beowulf comes from name given by NASA Goddard


Space Flight Center cluster project.


19

Cluster Interconnects

Originally fast Ethernet on low cost clustersg yGigabit Ethernet - easy upgrade path

More Specialized/Higher PerformanceMyrinet - 2.4 Gbits/sec - disadvantage: single vendorcLanSCI (Scalable Coherent Interface)


( )QNetInfiniband - may be important as infininband interfaces may be integrated on next generation PCs

Dedicated cluster with a master node

Dedicated Cluster User

Switch

Master node

Compute nodes

Up link

2nd Ethernetinterface

External network



20

Scalable Parallel Computers


Design Space of Competing Computer Architecture



21

Machine and Programming ModelsMachine and Programming Models applied to Parallel Systems

Generic Parallel Architecture

P P P PP P P P

Interconnection Network

M M MM

Memory


° Where is the memory physically located?

y


22

Parallel Programming Models

Control– How is parallelism created?p– What orderings exist between operations?– How do different threads of control synchronize?

Data– What data is private vs. shared?– How is logically shared data accessed or communicated?

Operations


Operations– What are the atomic operations?

Cost– How do we account for the cost of each of the above?

Trivial Example

Parallel Decomposition:

f A ii

n

( [ ] )=

−

∑0

1

a a e eco pos t o– Each evaluation and each partial sum is a task.

Assign n/p numbers to each of p processors– Each computes independent “private” results and partial sum.– One (or all) collects the p partial sums and computes the global sum.

Two Classes of Data: – Logically Shared


– Logically Shared» The original n numbers, the global sum.

– Logically Private» The individual function evaluations.» What about the individual partial sums?


23

Programming Model 1: Shared Address Space

Program consists of a collection of threads of controlEach has a set of private variables, e.g. local variables on the stack.C ll ti l ith t f h d i bl t ti i bl h dCollectively with a set of shared variables, e.g., static variables, shared common blocks, global heapThreads communicate implicitly by writing and reading shared variablesThreads coordinate explicitly by synchronization operations on shared variables -- writing and reading flags, locks or semaphoresLike concurrent programming on a uniprocessor

Address:


iressPPP

iress. . .

x = ...y = ..x ...

Shared

Private

Machine Model 1: Shared Memory Multiprocessor

Processors all connected to a large shared memory“Local” memory is not (usually) part of the hardware

P1 P2 Pn

$ $ $

oca e o y s ot (usua y) pa t o t e a d a e– Sun, DEC, Intel SMPs in Millennium, SGI Origin

Cost: much cheaper to access data in cache than in main memory


network

memory


24

Shared Memory Code for Computing a Sum

Thread 1 Thread 2

[s = 0 initially]local_s1= 0for i = 0, n/2-1

local_s1 = local_s1 + f(A[i])s = s + local_s1

[s = 0 initially]local_s2 = 0for i = n/2, n-1

local_s2= local_s2 + f(A[i])s = s +local_s2

What could go wrong?


Pitfall and Solution via Synchronization° Pitfall in computing a global sum s = local_s1 + local_s2:

Thread 1 (initially s=0)load s [from mem to reg]

Thread 2 (initially s=0)

s = s+local_s1 [=local_s1, in reg]store s [from reg to mem]

Time

load s [from mem to reg; initially 0]s = s+local_s2 [=local_s2, in reg]store s [from reg to mem]

° Instructions from different threads can be interleaved arbitrarily.° What can final result s stored in memory be?° Problem: race condition.° Possible solution: mutual exclusion with locks

Thread 1 Thread 2


Thread 1lockload ss = s+local_s1store sunlock

Thread 2lockload ss = s+local_s2store sunlock

° Locks must be atomic (execute completely without interruption).


25

Programming Model 2: Message PassingProgram consists of a collection of named processes.Thread of control plus local address space -- NO shared data.Local variables, static variables, common blocks, heap.pProcesses communicate by explicit data transfers -- matching send and receive pair by source and destination processors.Coordination is implicit in every communication event.Logically shared data is partitioned over local processes.Like distributed programming – program with MPI, PVM.

send P0,X

Lecture 2 ECE 451/566 - Introduction to Parallel & Distributed Programming 49PPP

iress. . .

iress

recv Pn,Y

XYA: A:

n0 1

Machine Model 2: Distributed MemoryCray XT, IBM SP2, BlueGene, Roadrunner, NOW, etc.Each processor is connected to its own memory and cache ac p ocesso s co ected to ts o e o y a d cac e

but cannot directly access another processor’s memoryEach “node” has a network interface (NI) for all

communication and synchronization

P1

memory

NI P2

memory

NI Pn NI


interconnect

memory memory memory. . .


26

Computing s = x(1)+x(2) on each processor

Processor 1 Processor 2i t 1

° First possible solution:

send xlocal, proc2[xlocal = x(1)]

receive xremote, proc2s = xlocal + xremote

receive xremote, proc1send xlocal, proc1

[xlocal = x(2)]s = xlocal + xremote

° Second possible solution -- what could go wrong?

Processor 1 Processor 2






° What if send/receive acts like the telephone system? The post office?

Programming Model 3: Data ParallelSingle sequential thread of control consisting of parallel operationsParallel operations applied to all (or a defined subset) of a data structureC i ti i i li it i ll l t d “ hift d” d tCommunication is implicit in parallel operators and “shifted” data structuresElegant and easy to understand and reason aboutLike marching in a regimentUsed by Matlab, Accelerators, GPUs, etc.Drawback: not all problems fit this model

A:


A:

fA:f

sum

A = array of all datafA = f(A)s = sum(fA)

s:


27

Machine Model 3: SIMD SystemA large number of (usually) small processors.A single “control processor” issues each instruction.Each processor executes the same instructionEach processor executes the same instruction.Some processors may be turned off on some instructions.Machines are not popular (CM2), but programming model isApplicable to emerging accelerators (GPGPUs, CellBE, etc.)

control processor


interconnect

P1

memory

NI P2

memory

NI Pn

memory

NI

. . .

Machine Model 4: Clusters of SMPsSince small shared memory machines (SMPs) are the

fastest commodity machine, why not build a larger machine by connecting many of them with a network?CLUMP = Cluster of SMPs.Shared memory within one SMP, but message passing

outside of an SMP.Two programming models:

– Treat machine as “flat”, always use message passing, even within SMP


Treat machine as flat , always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy).

– Expose two layers: shared memory and message passing (usually higher performance, but ugly to program).


28

Programming Model 4: Bulk SynchronousUsed within the message passing or shared memory

models as a programming convention.models as a programming convention.Phases are separated by global barriers:– Compute phases: all operate on local data (in distributed memory) or read

access to global data (in shared memory).– Communication phases: all participate in rearrangement or reduction of

global data.

Generally all doing the “same thing” in a phase:


y g g– all do f, but may all do different things within f.

Features the simplicity of data parallelism, but without the restrictions of a strict data parallel model.

Summary So FarHistorically, each parallel machine was unique, along with

its programming model and programming languageIt was necessary to throw away software and start over with

each new kind of machine – ugh !!!Now we distinguish the programming model from the

underlying machine, so we can write portably correct codes that run on many machines

– MPI now the most portable option, but can be tedious


Writing portably fast code requires tuning for architecture– Algorithm design challenge is to make this process easy– Example: picking a block size, not rewriting whole algorithm


29

Steps in Writing p gParallel Programs

Creating a Parallel ProgramIdentify work that can be done in parallel.

Partition work and perhaps data among logical processesPartition work and perhaps data among logical processes (threads).

Manage the data access, communication, synchronization.

Goal: maximize speedup due to parallelismSpeedupprob(P procs) = Time to solve prob with “best” sequential solution

Time to solve prob in parallel on P processors<= P (?)


<= P (?)Efficiency(P) = Speedup(P) / P

<= 1

° Key question is when you can solve each piece:• statically, if information is known in advance.• dynamically, otherwise.


30

Steps in the Process

ition

nt ion

ng

Task: arbitrarily defined piece of work that forms the basic unit of

OverallComputation

Dec

ompo

s

Grainsof Work

Ass

ignm

enProcesses/Threads

Orc

hest

rati

Processes/Threads

Map

pi

Processors


y pconcurrency.Process/Thread: abstract entity that performs tasks– tasks are assigned to threads via an assignment mechanism.– threads must coordinate to accomplish their collective tasks.

Processor: physical entity that executes a thread.

DecompositionBreak the overall computation into individual grains of work (tasks).work (tasks).– Identify concurrency and decide at what level to exploit it.– Concurrency may be statically identifiable or may vary dynamically.– It may depend only on problem size, or it may depend on the

particular input data.

Goal: identify enough tasks to keep the target range of processors busy, but not too many.


– Establishes upper limit on number of useful processors (i.e., scaling).

Tradeoff: sufficient concurrency vs. task control overhead.


31

AssignmentDetermine mechanism to divide work among threads– Functional partitioning:

A i l i ll di ti t t f k t diff t th d i li i» Assign logically distinct aspects of work to different thread, e.g. pipelining.– Structural mechanisms:

» Assign iterations of “parallel loop” according to a simple rule, e.g. proc j gets iterates j*n/p through (j+1)*n/p-1.

» Throw tasks in a bowl (task queue) and let threads feed.– Data/domain decomposition:

» Data describing the problem has a natural decomposition.k h d d i k i d i h i f


» Break up the data and assign work associated with regions, e.g. parts of physical system being simulated.

Goals:– Balance the workload to keep everyone busy (all the time).– Allow efficient orchestration.

OrchestrationProvide a means of – Naming and accessing shared data.– Communication and coordination among threads of control.

Goals:– Correctness of parallel solution -- respect the inherent dependencies

within the algorithm.– Avoid serialization.

R d t f i ti h i ti d t


– Reduce cost of communication, synchronization, and management. – Preserve locality of data reference.


32

MappingBinding processes to physical processors.Time to reach processor across network does not depend on which processor (roughly)which processor (roughly).– lots of old literature on “network topology”, no longer so important.

Basic issue is how many remote accesses.

Proc

Cache

Proc

Cachefast

l


Memory Memory

Network

slowreallyslow

Examples = f(A[1]) + … + f(A[n])f(A[1]) + … + f(A[n])DecompositionDecomposition

i h f(A[j])i h f(A[j])–– computing each f(A[j])computing each f(A[j])–– nn--fold parallelism, where n may be >> pfold parallelism, where n may be >> p–– computing sum scomputing sum s

AssignmentAssignment–– thread k sums sthread k sums skk = f(A[k*n/p]) + … + f(A[(k+1)*n/p= f(A[k*n/p]) + … + f(A[(k+1)*n/p--1]) 1]) –– thread 1 sums s = sthread 1 sums s = s11+ … + s+ … + spp (for simplicity of this example)–– thread 1 communicates s to other threadsthread 1 communicates s to other threads


Orchestration Orchestration – starting up threads– communicating, synchronizing with thread 1

MappingMapping– processor j runs thread j

Documents

Lecture 2: Parallel ArchitecturesLecture 2: Parallel ...irodero/classes/09-10/ece451-566/... · connected to multiple memory modul es, ... Introduction to Parallel & Distributed Programming