Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
08-02-2010 MVP'10 - Aalborg University 2
How much do we need to know?
Important to know the architecture of parallel hardware.Not all details are important to programmers
keep portabilitykeep up with technological changes
The point: Get a meaningful model.
08-02-2010 MVP'10 - Aalborg University 3
Intel Core Duo
cache coherence protocol ModifiedExclusiveSharedInvalid
more shared L2low latency
08-02-2010 MVP'10 - Aalborg University 4
ExampleThe point: Relatively expensive.
08-02-2010 MVP'10 - Aalborg University 5
AMD Dual Core Opteron
cache coherence protocol
ModifiedOwnedExclusiveSharedInvalid
easier for SMP
08-02-2010 MVP'10 - Aalborg University 6
Core i7
08-02-2010 MVP'10 - Aalborg University 7
SMP
caches “snoop” on the bus bottleneck
08-02-2010 MVP'10 - Aalborg University 8
Larger SMP – Sun Fire18 boards connectedby a crossbarswitch.Snooping buses.Directory based cachecoherence protocol.Scalable/higherlatency.
Note: Expensivehardware.
08-02-2010 MVP'10 - Aalborg University 9
CrossbarN x N connections.Expensive, limited.
08-02-2010 MVP'10 - Aalborg University 10
Heterogeneous chipsGPUs
800 ALU on ATI’s latest 4800 series.--logic, ++computational units
FPGAsPCI boards availablereconfigurable
CellDual-threaded PPC – PPU, 64 bits8x SPU
08-02-2010 MVP'10 - Aalborg University 11
Cell architecture
18.2GB/s128 bits
No cache coherenceprotocol.
Different philosophy:the PPU is a coordinator,the SPUs do the job.
Difficult to program.
08-02-2010 MVP'10 - Aalborg University 12
Clusters“Cheap” PCs connected together.
GB ethernetInfiniband…Memory private to each machine,use message based communication.Scalable but high latency.Sold by racks.
08-02-2010 MVP'10 - Aalborg University 13
BlueGene
65536 x@ 700MHz
interesting part
08-02-2010 MVP'10 - Aalborg University 14
Interconnect
3-D torus for standarddata transfers.
Collective network forfast reductions.Very powerful.
08-02-2010 MVP'10 - Aalborg University 15
Broadcast/Reduction
Broadcast Reduce
08-02-2010 MVP'10 - Aalborg University 16
Cut-through routingSimplified packet routing:
Packets take the same path(1x routing information).In sequence packet delivery (no sequencing).Error detection at message level, cheap detection (for good networks).Fixed size unit for packets = flow control digits (flits).
08-02-2010 MVP'10 - Aalborg University 17
08-02-2010 MVP'10 - Aalborg University 18
LessonsVery different architectures.
SMPDistributed
But we want one meaningful model.Hints:
local accesses - cheapnon-local accesses - expensive
08-02-2010 MVP'10 - Aalborg University 19
RAM modelSequential execution unit with unbounded memory.
every operation takes 1 unit of time
Limitedok for algorithms – reason on complexityunrealistic
08-02-2010 MVP'10 - Aalborg University 20
Application of the RAM model
Expected: O(n), O(log n)(array must be sorted)
update of location missing
08-02-2010 MVP'10 - Aalborg University 21
PRAM modelSeveral execution units accessing one shared unbounded memory
global accesssynchronous access – one global clockcontention resolved by pre-defined rules
EREW, CREW, CRCW, ERCWleast powerful, least convenient: EREWmost powerful, most convenient: CRCWlesson: reason on CRCW but apply on EREW because it is possible to simulate one with the other (in polynomial time)
like RAM: good for algorithms, complexity…
08-02-2010 MVP'10 - Aalborg University 22
CTA (Candidate Type Architecture)
Account for communication costs.Applies to clusters & SMPs.Local/non-local accesses.Goal: Achieve in practice the predicted running time. PRAM is misleading in that respect.The catch: Not easy to estimate communication costs.
Model:interconnected processors with RAMtopology not specified but this impacts communication costs.
08-02-2010 MVP'10 - Aalborg University 23
CTA
SMPClusterCell…Memory latencyspecified infunction of thereal architecture.Non-local: λ.
08-02-2010 MVP'10 - Aalborg University 24
Typical λ
08-02-2010 MVP'10 - Aalborg University 25
LessonUse locality
temporal & spatialsometimes redundant computation is better than sending data around
Exact number of processors supplied at runtime.
scale/not tied to one setupNote: λ increases with P.
08-02-2010 MVP'10 - Aalborg University 26
Memory reference mechanismsShared memory
avoid race conditions, needs synchronization
One-sidednot commonprivate (local) & shared non-coherent memory
Message passing – 2-sidedMPIComplex communication protocols.
08-02-2010 MVP'10 - Aalborg University 27
Memory consistency modelsSequential consistency – expensive.
serialize the operations of all processorsoperations obey specified order
Relaxed consistency – weaker.variations
Keep in mind: There are hardware tricks to get sequential consistency (CAS/TAS).
Interconnects
08-02-2010 MVP'10 - Aalborg University 29
Bus Based Networks
No local cache
Local cache
Serialize accesses – cheap.
08-02-2010 MVP'10 - Aalborg University 30
Crossbar Networks
Parallel access – expensive.
08-02-2010 MVP'10 - Aalborg University 31
Omega networksMulti-stage network – compromise cost/performance.N nodes – log n stages.
08-02-2010 MVP'10 - Aalborg University 32
Linear Arrays and Meshes
08-02-2010 MVP'10 - Aalborg University 33
Hypercubes
2^d nodes,d=dimension,good routing,relatively expensive,low congestion
08-02-2010 MVP'10 - Aalborg University 34
Fat trees
More bandwidth where it is needed.
08-02-2010 MVP'10 - Aalborg University 35
Evaluating The NetworksAll the previous topologies have advantages and disadvantages.Important factors: cost and performance.Define criteria to characterize cost and performance.
08-02-2010 MVP'10 - Aalborg University 36
CriteriaDiameter: maximum distance pa ↔ pb.Connectivity: measure of multiplicity of paths.Bisection width: minimum number of links to cut in order to partition the network in 2 equal halves.Bisection bandwidth: minimum volume of communication allowed between 2 halves.Cost: number of communication links, i.e., wires.