36
Parallel Computers Alexandre David 1.2.05 [email protected]

Alexandre David 1.2.05 [email protected]/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

Parallel Computers

Alexandre David1.2.05

[email protected]

Page 2: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 2

How much do we need to know?

Important to know the architecture of parallel hardware.Not all details are important to programmers

keep portabilitykeep up with technological changes

The point: Get a meaningful model.

Page 3: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 3

Intel Core Duo

cache coherence protocol ModifiedExclusiveSharedInvalid

more shared L2low latency

Page 4: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 4

ExampleThe point: Relatively expensive.

Page 5: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 5

AMD Dual Core Opteron

cache coherence protocol

ModifiedOwnedExclusiveSharedInvalid

easier for SMP

Page 6: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 6

Core i7

Page 7: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 7

SMP

caches “snoop” on the bus bottleneck

Page 8: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 8

Larger SMP – Sun Fire18 boards connectedby a crossbarswitch.Snooping buses.Directory based cachecoherence protocol.Scalable/higherlatency.

Note: Expensivehardware.

Page 9: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 9

CrossbarN x N connections.Expensive, limited.

Page 10: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 10

Heterogeneous chipsGPUs

800 ALU on ATI’s latest 4800 series.--logic, ++computational units

FPGAsPCI boards availablereconfigurable

CellDual-threaded PPC – PPU, 64 bits8x SPU

Page 11: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 11

Cell architecture

18.2GB/s128 bits

No cache coherenceprotocol.

Different philosophy:the PPU is a coordinator,the SPUs do the job.

Difficult to program.

Page 12: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 12

Clusters“Cheap” PCs connected together.

GB ethernetInfiniband…Memory private to each machine,use message based communication.Scalable but high latency.Sold by racks.

Page 13: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 13

BlueGene

65536 x@ 700MHz

interesting part

Page 14: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 14

Interconnect

3-D torus for standarddata transfers.

Collective network forfast reductions.Very powerful.

Page 15: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 15

Broadcast/Reduction

Broadcast Reduce

Page 16: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 16

Cut-through routingSimplified packet routing:

Packets take the same path(1x routing information).In sequence packet delivery (no sequencing).Error detection at message level, cheap detection (for good networks).Fixed size unit for packets = flow control digits (flits).

Page 17: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 17

Page 18: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 18

LessonsVery different architectures.

SMPDistributed

But we want one meaningful model.Hints:

local accesses - cheapnon-local accesses - expensive

Page 19: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 19

RAM modelSequential execution unit with unbounded memory.

every operation takes 1 unit of time

Limitedok for algorithms – reason on complexityunrealistic

Page 20: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 20

Application of the RAM model

Expected: O(n), O(log n)(array must be sorted)

update of location missing

Page 21: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 21

PRAM modelSeveral execution units accessing one shared unbounded memory

global accesssynchronous access – one global clockcontention resolved by pre-defined rules

EREW, CREW, CRCW, ERCWleast powerful, least convenient: EREWmost powerful, most convenient: CRCWlesson: reason on CRCW but apply on EREW because it is possible to simulate one with the other (in polynomial time)

like RAM: good for algorithms, complexity…

Page 22: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 22

CTA (Candidate Type Architecture)

Account for communication costs.Applies to clusters & SMPs.Local/non-local accesses.Goal: Achieve in practice the predicted running time. PRAM is misleading in that respect.The catch: Not easy to estimate communication costs.

Model:interconnected processors with RAMtopology not specified but this impacts communication costs.

Page 23: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 23

CTA

SMPClusterCell…Memory latencyspecified infunction of thereal architecture.Non-local: λ.

Page 24: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 24

Typical λ

Page 25: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 25

LessonUse locality

temporal & spatialsometimes redundant computation is better than sending data around

Exact number of processors supplied at runtime.

scale/not tied to one setupNote: λ increases with P.

Page 26: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 26

Memory reference mechanismsShared memory

avoid race conditions, needs synchronization

One-sidednot commonprivate (local) & shared non-coherent memory

Message passing – 2-sidedMPIComplex communication protocols.

Page 27: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 27

Memory consistency modelsSequential consistency – expensive.

serialize the operations of all processorsoperations obey specified order

Relaxed consistency – weaker.variations

Keep in mind: There are hardware tricks to get sequential consistency (CAS/TAS).

Page 28: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

Interconnects

Page 29: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 29

Bus Based Networks

No local cache

Local cache

Serialize accesses – cheap.

Page 30: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 30

Crossbar Networks

Parallel access – expensive.

Page 31: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 31

Omega networksMulti-stage network – compromise cost/performance.N nodes – log n stages.

Page 32: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 32

Linear Arrays and Meshes

Page 33: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 33

Hypercubes

2^d nodes,d=dimension,good routing,relatively expensive,low congestion

Page 34: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 34

Fat trees

More bandwidth where it is needed.

Page 35: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 35

Evaluating The NetworksAll the previous topologies have advantages and disadvantages.Important factors: cost and performance.Define criteria to characterize cost and performance.

Page 36: Alexandre David 1.2.05 adavid@cs.aaupeople.cs.aau.dk/~adavid/teaching/MVP-10/02-chap02-lect02.pdf · 08-02-2010 MVP'10 - Aalborg University 21 PRAM model Several execution units accessing

08-02-2010 MVP'10 - Aalborg University 36

CriteriaDiameter: maximum distance pa ↔ pb.Connectivity: measure of multiplicity of paths.Bisection width: minimum number of links to cut in order to partition the network in 2 equal halves.Bisection bandwidth: minimum volume of communication allowed between 2 halves.Cost: number of communication links, i.e., wires.