Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

1

Tema 7: Sistemes Multiprocessadors (Memòria Compartida)

Eduard Ayguadé i Josep Llosa

These slides have been prepared using some material which is part of the teaching material of other professors at the ComputerArchitecture Departament (Jesús Labarta, Miguel Valero, …). Other material available through the internet has also been used to prepare this chapter’s slides.

Throughput vs. parallel programmingThroughput vs. parallel programming

ThroughputMultiple, unrelated, instruction streams (programs) that executeconcurrently on multiple processorsMultiprogramming n tasks on p processors: each task receives p/nprocessors

Parallel ProgrammingMultiple, related, interacting instruction streams (single program) that execute concurrently to increase the speed of a single program1 task on m processors, each processor receives 1/m of the task:reduce response time

2

Example: heat distribution problemExample: heat distribution problem

Time=0

Time=1

…

Time=2

Time=3

Time=99

Time=100

Temp=20 Temp=100Temp=0

xi-1 xi+1

xi

xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2

Heat distribution: sequential programHeat distribution: sequential program

int t; /* time step */int i; /* array index */float x[n+2], y[n+2]; /* temperatures */

x[0] = y[0] = 20; x[n+1] = y[n+1] = 100;for(i=1; i<n+1; i++)

x[i] = 0;

for(t=1; t<=100; t++){

for(i=1; i<n+1; i++)y[i] = 0.5 * (x[i-1] + x[i+1]);

swap(x,y);}

x

…

…

y

3

Heat distribution: shared memoryHeat distribution: shared memory

int t; /* time step */int i; /* my processor id */float temp;shared float x[n+2]; /* temperatures */i=GetID()+1; /* GetID returns 0 .. P-1 */x[i]=0;if(i==1) {x[0]=20; x[n+1]=100;}

for (t=1; t<=100; t++){temp = 0.5 * (x[i-1] + x[i+1]);x[i] = temp;}

Points

Processors


int t; /* time step */int i; /* my processor id */float temp;shared float x[n+2]; /* temperatures */

i=GetID()+1; /* GetID returns 0 .. P-1 */x[i]=0;if(i==1) {x[0]=20; x[n+1]=100;}

for (t=1; t<=100; t++){barrier(n);temp = 0.5 * (x[i-1] + x[i+1]);barrier(n);x[i] = temp;}

barrier(n);

…

Time

Barrierwaiting

computing

P1 P2 P3 Pn

4


int t /* time step */, i;int id; /* my processor id */shared float x[1002]; /* temperatures */int leftmost,rightmost; /* processor boundary points */float x_leftmost,x_rightmost; /* their temperatures */

id=GetID();leftmost=id*100+1; rightmost=(id+1)*100;if(id==0){x[0]=20; x[1001]=100;}for(i=leftmost;i<=rightmost;i++) x[i]=0;

for(t=1;t<=100;t++){barrier(n);x_leftmost=0.5*(x[leftmost-1]+x[leftmost+1]);x_rightmost=0.5*(x[rightmost-1]+x[rightmost+1]);barrier(n);for(i=leftmost+1;i<rightmost;i++)

x[i]=0.5*(x[i-1]+x[i+1]);x[leftmost]=x_leftmost;x[rightmost]=x_rightmost;

}barrier(n);

1002 points

10 processors

… …

Don’t forget the Amdahls’ lawDon’t forget the Amdahls’ law

Example:p processorsprogram: 5% sequential, 95% suitable for speedup

T(1) = Ts + Tp = 0.05 T + 0.95 T

T(p) = Ts + (Tp / p) + overhead(p)

S(100) = T(1) / T(100) < 16.8

)()()1()(

poverheadPTpTs

TpTspT

TpS++

+==

5

Shared-memory parallel programmingShared-memory parallel programming

Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores

St o r e

P1

P2

Pn

P0

L o a d

P0 p r i v a t e

P1 p r i v a t e

P2 p r i v a t e

Pn p r i v a t e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Models of shared-memory multiprocessorsModels of shared-memory multiprocessors

The Uniform Memory Access (UMA) Model: The physical memory is shared by all processorsAll processors have equal access to all memory addressesAlso called SMP (Symmetric Multi-Processor)

Interconnection network

P1 P2 Pn

M

No cache


P1 P2 Pn

M

cache

Shared cache

P2

cache

Pn

cache

P1

cache


M

Private caches

6

On a board: Quad Xeon MP serverOn a board: Quad Xeon MP server

All coherence and multiprocessing glue in processor moduleUp to 4 processors

On a chip: dual core PowerPC 970MPOn a chip: dual core PowerPC 970MP

7

On a chip: POWER5On a chip: POWER5

Dual-core SMT processor8-way superscalar SMT cores 276M transistors, 389 mm2 dieOperating in lab at 1.8GHz & 1.3V1.9MB L2 cache – point of

coherencyOn-chip L3 directory, memory

controller

Technology130nm lithography, SOICu wiring, 8 layers of metal

High-speed elastic bus interface

I/Os: 2313 signal, 3057 power

On a chip: Intel dual core chipsOn a chip: Intel dual core chips

Intel Core Duo processor with twocores, a unified 2 MB L2 cache, and152 million transistors

Intel Pentium D (dual core) and eXtreme Edition (dualCore, multithreaded), 2 x 1MB L2 cache, up to 3.2 GHz

8

Chip multiprocessor vs. multithreadingChip multiprocessor vs. multithreading

SRAM secondary cache

DRAMDRAMDRAMDRAM

W-issue logic

PC

IF

RF

EU

IC DC

CPU1

W-issue logic

PC

IF

RF

EU

IC DC

CPUn. . .

k-way issue logic

I cache D cache

SRAM secondary cache

DRAMDRAMDRAMDRAM

. . .PC1

PCm . . .

Reg

File

1

Reg

File

m

Instruction fetch unit Execution units and queues

SMT Issue SlotSMT Issue Slot

SMT:ILP (instruction-level parallelism) like a superscalar and TLP (thread-level parallelism) like a multiprocessorIt hides long latencies like multithreaded

Thread1Thread2Thread3Thread4

Superscalar

4 issue slots

Multithreaded

4 issue slots

SMT

4 issue slots

Tim

e (p

roce

ssor

cyc

les)

MP2

2x(2 issue slots)

9

SMT ImplementationSMT Implementation

Straightforward extension to a conventional superscalar design:

The fetch unit is shared among the different threads. Fetch policies designed to improve fetch effectivenessSingle pool of physical registers for renaming

Hyperthreading:Intel’s form of SMT, available on Xeon server30% improvement on 2 threads

Dec

ode

Ren

ame

Inst

ruct

ion

Win

dow

Wak

eup+

sele

ct

Reg

iste

rfil

e

Byp

ass

Dat

a C

achePCPCPCPCPCPCPCPC

Fetch

Models of shared-memory multiprocessorsModels of shared-memory multiprocessors

Distributed shared-memory or Non-Uniform Memory Access (NUMA) Model:

Shared memory is physically distributed among processorsReferences to memory on other nodes must be sent across the interconnection network (transparently to programmer)

Access to local memory much faster than access to remote memory


P1

cache M1

P2

cache M2

Pn

cache Mn

10

POWER5 Multi-chip ModulePOWER5 Multi-chip Module

95mm x 95mm

Four POWER5 chips2 processors per chip2-way simultaneous multithreaded

Four L3 cache chips

4,491 signal I/Os

89 layers of metalMemoryI/OJTAG

On-BookOff-Book

POW

ER

5

L3 POWER5

L3

POWER5

L3

POW

ER

5

L3

Memory

POWER5L3

POWER5 POWER5 POWER5

Memory Memory

MCM

I/O I/O I/O I/O

L3 L3 L3

Memory

16-way Building Block16-way Building Block

MCM

I/O

POWER5

Memory

L3

BookI/OMemory

POWER5L3

I/O

POWER5

Memory

L3

I/O

POWER5

Memory

L3

11

64-way SMP Interconnection64-way SMP Interconnection

SynchronizationSynchronization

Why do we neeed synchronization? Need to know when itis safe for different processors to access a shared-memorylocation or to signal a certain event

Issues for synchronization:Uninterruptable instruction to fetch and update memory (atomicoperation)User level synchronization operation using these primitives (e.g. locks, barriers, …)For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

12


ComponentsAcquire method

Acquire the right to the synchronizationWaiting algorithm:

Busy wait (resource consumption during wait)Blocking (need mechanism to awake)Two phase: wait for a while, then block

Release methodAllow others to proceed


What’s wrong with the following synchronization code? (assume that initially flag=0)

lock: ld r1, flagst flag, #1bnz r1, lock

Unlock: st flag, #0

Pi

lock: ld r1, flagst flag, #1bnz r1, lock

Unlock: st flag, #0

Pj

13

Uninterruptable Fetch and UpdateUninterruptable Fetch and Update

Test-and-set instructionRead value and set location to 1The read and write sequence is indivisible

Atomic exchangeInterchange a value in a register with a value in memoryThe swap operation is indivisible

lock: t&s r1, flag


Fetch and op (addr, register)Read addr value to registerReplace addr value with op(addr value)Common variants

fetch and incrementfetch and decrementfetch and add - requires an additional addend register argument

Compare and swap (addr, reg1, reg2)compare addr value with contents of reg1if equal then swap addr value with reg2

14

User-level synchronizationUser-level synchronization

Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock

Assume:0 → synchronization variable is free 1 → synchronization variable is locked and unavailable

Set register to 1 & swapNew value in register determines success in getting lock

0 if you succeeded in setting the lock1 if other processor had already claimed access

mov R2, #1lockit: exch R2, 0(R1) ; atomic exchange

bnez R2, lockit ; already locked?


Barriers: processors block until all have reached itCan be implemented with a counter initially set to zeroEvery time a processor reaches the barrier, it atomicallyincrements the counterAnd compares with the total number of processors that need toreach the barrier

If not equal → go and wait on a flag variable (set to 0 by first processor reaching the barrier)If equal → set flag variable to 1

After the barrier, all processors continue

15


Centralized barrierBarrier(barr) {

lock (barr.lock);if (barr.counter == 0)

barr.flag = 0; // reset flag if first mycount = barr.counter++;unlock (barr.lock);if (mycount == P) { // last to arrive?

barr.counter =0; // reset for next barrierbarr.flag = 1; // release waiting processors

} elsewhile (barr.flag == 0); // busy wait for release

}


Sometimes it is hard to have read and write in 1 instructionuse 2 instructions instead

Load linked (or load locked) + store conditionalLoad linked returns the initial valueStore conditional returns 1 if it succeeds (same processor that didthe preceeding load) and 0 otherwise

The memory is in charge of keeping track the lastprocessor that loaded a memory location

16


Example: doing atomic swap R4, 0(R1)try: mov R3, R4 ; mov exchange value

ll R2, 0(R1) ; load linkedsc R3, 0(R1) ; store conditionalbeqz R3, try ; branch store fails (R3 = 0)mov R4, R2 ; put load value in R4

Example: doing fetch and increment f&i R2, 0(R1)try: ll R2, 0(R1) ; load linked

addi R3, R2, #1 ; increment (OK if reg–reg)sc R3, 0(R1) ; store conditional beqz R3,try ; branch store fails (R2 = 0)

Caches are critical for performanceCaches are critical for performance

Shared cacheHit and miss latency increased due tointervening switch and cache sizeInterference:

Positive: prefetching across processors, sharing of working setsNegative: conflicts when replacing data in cache

High bandwidth needsNo coherence problem

UsedIn first SMPs in Mid 80s to connect couple of processors on a boardToday: for multiprocessor on a chip (for small-scale systems ornodes)

IN

P1 P2 Pn

M

cache

17

Caches are critical for performanceCaches are critical for performance

Private cachesReduce average data access time (latency)Reduce bandwidth demands placed on shared interconnect

Many processors can share data efficientlyAutomatic replication closer to processor

But private processor caches create a problem:Copies of a variable can be present in multiple caches …… coherence problem when one processor writes

P2

$

Pn

$

P1

$

IN

M

Cache coherence problemCache coherence problem

CPU

X=0

CPU

cache

CPU

cache


X=0 E / S

...

...

CPU1 reads x

CPU

X=0

CPU

X=0

CPU

cache


X=0 E / S

...

...

CPU2 reads x

CPU

X=0

CPU

X=1

CPU

cache


X=0 E / S

...

...

CPU2 writes x

CPU

X=0

CPU

X=1

CPU

X=0


X=0 E / S

...

...

CPU1 or CPU3 read an incorrect x

18

Cache coherence using a bus Cache coherence using a bus

Snooping-based coherenceBus is a broadcast medium:

Transactions on bus are visible to all processors:Processors or their representatives can snoop (monitor) bus andtake action on relevant events (e.g. change state)


Sharedmemory

I/O

...

...CPU

cacheSCC

CPU

cacheSCC

CPU

cacheSCC

Cache coherence using a bus Cache coherence using a bus

Cache controller now receives inputs from both sides:Requests from processor (processor-side)Bus requests/responses from snooper (bus-side)Dual tags (not data) or dual-ported tag RAM

In either case, takes zero or more actions:Updates state, responds with data, generates new bus transactions

Protocol is distributed algorithm: cooperating statemachines

Set of states, state transition diagram, actions

Granularity of coherence is typically cache blockLike that of allocation in cache and transfer to/from cache

19

Caches with write buffersCaches with write buffers

Need to snoop the write buffer

Two protocols to handle write operationsTwo protocols to handle write operations

Write-update: writing processor broadcasts the new value and forces all others to update their copies

similar to write-through cache policynew data appears sooner in caches and main memorybut, higher bus traffic

High bandwidth requirements: every write from everyprocessor goes to shared bus and memory

Processor: 2 GHz, 2 instructions/cycle, and 10% of instructionsare 8-byte stores Each processor writes 3.2 GB data persecondMotherboard: FSB @ 400/800 MHz, 128-bit data bus peakbandwidth = 6.4/12.8 GB/sUp to 2 or 4 processors per board

20

Bus bandwidthBus bandwidth

Example: number of read and write references, shared and non-shared (all units in millions)

Instructions FLOPS References Total Reads Total Writes Shared Reads Shared WritesBarnes-Hut 2002.74 239.24 720.13 406.84 313.29 225.04 93.23LU 489.52 92.2 151.07 103.09 47.99 92.79 44.74Raytrace 833.35 - 290.35 210.03 80.31 161.1 22.35Ocean 376.51 101.54 99.7 81.16 18.54 76.95 16.97Radix 14.02 - 5.27 2.9 2.37 1.34 0.81

Two protocols to handle write operationsTwo protocols to handle write operations

Write-invalidate: writing processor forces all others to invalidate their copies

Similar to write-back cache policyDirty in cache state now indicates exclusive ownership:

Exclusive: only cache with a valid copy. Subsequent writes to same block do not need to broadcast invalidate → less bus trafficOwner: responsible for supplying block upon a request for it. The request is generated when accessing to an invalidated cache line(which causes miss)

most popular policy

Bus arbitration resolves races:Two processors attempt to write same block

21

Example: write-invalidate protocolExample: write-invalidate protocol

States:Invalid (I) or not exists (miss)Shared (S): one or moreDirty or Modified (M): one only

CPU events:PrRd (read)PrWr (write)

Bus transactions:BusRd: asks for copy with no intent to modifyBusRdX: asks for copy with intent to modify (invalidation)BusWB (or Flush): updates memory

Actions taken:Update state, perform bus transaction, flush value onto bus

Bus

CPU

SCC

cache

block i state

Example: write-invalidate protocol (MSI)Example: write-invalidate protocol (MSI)

CPU event / Bus transaction

BusUpgr instead of BusRdX:Upgrade from S to M in order to reduce traffic

22

Example: write-invalidate protocol (MESI)Example: write-invalidate protocol (MESI)

CPU event / Bus transaction

States:– invalid– exclusive (only this cache has copy,

but not modified)– shared (two or more caches may

have copies)– modified (dirty)

Transactions:– BusRd(S) means shared line

asserted on BusRd transaction– Flush’: if cache-to-cache sharing,

only one cache flushes data

Example: write-invalidate protocol (MESI)Example: write-invalidate protocol (MESI)

In MESI protocol, need to know if block is shared; i.e. transition to E or S state on read miss?

MESI also requires priority scheme for cache-to-cachetransfers

Which cache should supply data when in shared state?Commercial implementations allow memory to provide data

23

Bandwidth requirements on shared busBandwidth requirements on shared bus

5.27-14.02Radix

290.35-833.35Raytrace

99.7101.54376.51Ocean

References (M)FLOPS (M)Instructions (M)

Cache: 1MByteSet associative LRU: 4Line: 64BytesTransitions /1000 refs

839.5070.312500.03050.0219M

0.2768162.56900.72640.0092S

0.0060.00010.02410.00030E

0.03240.5766000.0262I

0.03540.25810.0006800NP

Raytrace

906.9451.498204.28860.0173M

0.305184.667100.41560.0109S

00.00010.02840.00080.0006E

1.7050.4119000.0485I

5.40511.31530.00300NP

Radix

843.5652.299600.00152.6259M

2.2392134.71602.49940.41715S

0.99550.02414.00400.204E

0.00151.8676000.6362I

1.67870.95651.248400NP

Ocean

MSEINP


NP I E S MNP - - BusRd BusRd BusRdXI - - BusRd BusRd BusRdXE - - - - - S - - Not possible - BusUpgrM BusWB BusWB Not possible BusWB -

07007070M60000S00000E

70707000I70707000NPMSEINP

Transfers:• Command and address on the address bus lines: 6 bytes• Data on the data bus lines: 64 bytes

24


Total number of bytes per 1000 references

Total: 90.4264

Total: 1026.939

Total: 761.0142

021.87502.1351.533M

1.66080000S

00000E

2.26840.362000I

2.47818.0670.047600NP

Raytrace

0104.8740300.2021.211M

1.83060000S

00000E

119.3528.833000I

378.35792.0710.2100NP

Radix

0160.97200.105183.813M

13.43520000S

00000E

0.105130.732000I

117.50966.95587.38800NP

Ocean

MSEINP


Total used bandwidth, assuming processor running at 2 GHz, 2 instructions per cycle

So for example, a FSB @ 400/800 MHz, 128-bit data bus, 6.4/12.8 GB/s peak bandwidth provides enough capacity for8/16 processors running Ocean (only 4/8 running Radix)

126.02290.4264290.35833.35Raytrace

1544.0701026.93865.2714.02Radix

806.067761.014299.7376.51Ocean

BW (MB/s)Bytes per 1000 refsReferences (M)Instructions (M)

25


How much bandwidth saves the MESI vs MSI with no BusUpgr?

… and after all

NP I E S MNP - - BusRd BusRd BusRdXI - - BusRd BusRd BusRdXE - - - - BusRdXS - - Not possible - BusRdXM BusWB BusWB Not possible BusWB -

151.29108.5616290.35833.35Raytrace

1573.421046.4655.2714.02Radix

1031.67974.00899.7376.51Ocean

BW (MB/s)Bytes per refReferences (M)Instructions (M)

User-level synchronization (revisited)User-level synchronization (revisited)

What about synchronization with cache coherency?Want to spin on cache copy to avoid full memory latencyLikely to get cache hits for such variables

Problem: Exchange includes a write, which invalidates all othercopies; this generates considerable bus traffic

mov R2, #1lockit:exch R2, 0(R1) ; atomic exchange

bnez R2, lockit ; already locked?

26

User-level synchronization (revisited)User-level synchronization (revisited)

Use a standard load (not a test and set)reads it’s own local copy until the value changeshence the spin doesn’t generate traffic until there is a reason to -e. g. the lock got releasedno guarantee that when you try it won’t be too latehence there is a starvation potential unless algorithm has somefairness model built inbut the win is clear

try: mov R2, #1lockit: ld R3, 0(R1) ; load var

bnez R3, lockit ; not free → spinexch R2, 0(R1) ; atomic exchangebnez R2, try ; already locked?

Directory-based cache coherencyDirectory-based cache coherency

Scaling to shared-memory systems with a large number ofprocessors:


CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

...

...

27


Directory:It tracks the state of the data stored in its memoryIf a block is shared, it must track which processors have itIf a block is valid only in one node, it should know which one

Nodememory

directory

Nodes currently having the line:- Bit string- 1 bit per node, position indicates node

state:– I: invalid– E: exclusive (only this node has copy,

but not modified)– S: shared (two or more nodes may

have copies)– M: modified (dirty)


Typically 3 processors involved in each transaction:Local node where a request originatesHome node where the memory location of an @ residesRemote node has a copy of a cache block, whether exclusive orshared

Example: get block for read access

Local

Home

1: RdReq

2: RdResp

28


Access to clean copy, with twosharers, for write

Access to dirty block for read

Local

Home

1: RdXReq2: ReadersList

Reader3: Invalidate

3: Invalidate

4: done

4: done Reader

Local

Home

1: RdReq 2: Owner

Owner3: Intervention

4: WB_D

4: Data

Distributed-memory multiprocessorsDistributed-memory multiprocessors

Distributed memory:Each processor can only reference its own local memoryReferences to memory on other nodes must be done sending messages across the interconnection network (not transparent to programmer)

Access to local memory much faster than messaging


P1

cache M1

P2

cache M2

Pn

cache Mn

29

Message passing programming modelMessage passing programming model

Send specifies buffer to be transmitted and receiving processReceive specifies sending process and application storage to receive intoOptional tag on send and matching rule on receiveImplicit synchronization (e.g. non-blocking send and blocking receive)Many overheads: copying, buffer management, protection

Process P Process Q

AddressY

AddressX

Send Q, X, tag

Receive P, Y, tag

Match

Local pr ocessaddress spaceLocal pr ocess

address space

Example: heat distribution problemExample: heat distribution problem

Time=0

Time=1

…

Time=2

Time=3

Time=99

Time=100

Temp=20 Temp=100Temp=0

xi-1 xi+1

xi

xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2

30

Heat distribution: message passingHeat distribution: message passing

int t /* time step */, i;int id; /* my processor id */int num_points; /* number of points per processor */float x[-1:(1000/P)]; /* temperatures, including shadow points */num_points = 1000/P;id=GetID();if (id == 0) x[-1]=20;if (id == (P-1)) x[num_points]=100;for(i=0; i<num_points; i++) x[i]=0;

for(t=1;t<=100;t++){if(id < P) receive(id+1,&x[num_points]);if(id > 0) receive(id-1,&x[-1]);for(i=0; i<num_points; i++)

x[i]=0.5*(x[i-1]+x[i+1]);if(id > 0) send(id-1,x[0]);if(id < P) send(id+1,x[num_points-1]);

}

shadow points

… …

…

id-1 id+1

id

time=t-1

time=t-1

1002 points

10 processors

… …

Networks for distributed memoryNetworks for distributed memory

Uniform access time networks

31


Non-uniform access time networksThe time to reach another node in the network depends on its position

line ring

hipercube 3D torus2D torus


Routing algorithm:Deterministic routing: there exists a single way to reach a node. The path is determined by the identifiers of the source and target nodes.

E.g. Manhattan routing in a 2D

(x1, y1) → (x2, y2)first ∆x = x2 - x1,then ∆y = y2 - y1

00 00

01 00

10 00

11 00

00 01

01 01

10 01

11 01

00 10

01 10

10 10

11 10

00 11

01 11

10 11

11 11

001

000

101

100

010 110

111011

R = X xor YTraverse dimensions of differing address in order

32

www.top500.org (november 2006)www.top500.org (november 2006)GFlops

20 Terabytes memory280+90 Terabytes disk storageArea: 120 m2

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

33

JS21 Processor Blade2 chips PPC970MP, each one with two cores, 2.3 GHzsymmetric multiprocessor 4 GB memory, shared memory40 GBytes local IDE diskMyrinet network adapter and 2 Gigabit ports


PowerPC 970MP


34

Blade Center• 14 blades per chassis (7U)

• 56 processors• 56GB memory

• Gigabit ethernet switch

6 chassis in a rack (42U)• 336 processors(3.1 TFlops)

• 336GB memory


Myrinet Clos256+256 switch


35

Myrinet Spine 1280 switchMyrinet Spine 1280 switch

Spine switches are used to connect 3, 4 and 5 Clos Clos256+256 units

Cabling spines to buildlarger systems

Clos 256x256Clos 256x256





Spine 1280 Spine 1280

256 links (1 to each node)250MB/s each direction

128 Links


36

IBM Blue Gene LightIBM Blue Gene Light

BG/L compute card

IBM Blue Gene LightIBM Blue Gene Light

Nodes interconnected as 64x32x32 3D torusEasy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware

A global reduction tree supportsfast global operations such as global max/sum in a few microseconds over 65,536 nodes

Auxiliary networks for I/O andglobal operations

37

NASA Columbia SupercomputerNASA Columbia Supercomputer

Global shared memory across 4 cpus, 8 Gigabyte

4 Itanium2 per C-Brick


Global shared memory across 64 cpus, 128 Gigabyte

38


Global shared memory across 512 cpus, 1 Terabyte


20 SGI® Altix™ 3700 superclusters, each with 512 Itanium2 processors (1.5 GHz, 6 MB cache)

Infiniband network to connect superclusters

Documents

Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé