38
Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé i Josep Llosa These slides have been prepared using some material which is part of the teaching material of other professors at the Computer Architecture Departament (Jesús Labarta, Miguel Valero, …). Other material available through the internet has also been used to prepare this chapter’s slides. Throughput vs. parallel programming Throughput vs. parallel programming Throughput Multiple, unrelated, instruction streams (programs) that execute concurrently on multiple processors Multiprogramming n tasks on p processors: each task receives p/n processors Parallel Programming Multiple, related, interacting instruction streams (single program) that execute concurrently to increase the speed of a single program 1 task on m processors, each processor receives 1/m of the task: reduce response time

Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

1

Tema 7: Sistemes Multiprocessadors (Memòria Compartida)

Eduard Ayguadé i Josep Llosa

These slides have been prepared using some material which is part of the teaching material of other professors at the ComputerArchitecture Departament (Jesús Labarta, Miguel Valero, …). Other material available through the internet has also been used to prepare this chapter’s slides.

Throughput vs. parallel programmingThroughput vs. parallel programming

ThroughputMultiple, unrelated, instruction streams (programs) that executeconcurrently on multiple processorsMultiprogramming n tasks on p processors: each task receives p/nprocessors

Parallel ProgrammingMultiple, related, interacting instruction streams (single program) that execute concurrently to increase the speed of a single program1 task on m processors, each processor receives 1/m of the task:reduce response time

Page 2: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

2

Example: heat distribution problemExample: heat distribution problem

Time=0

Time=1

Time=2

Time=3

Time=99

Time=100

Temp=20 Temp=100Temp=0

xi-1 xi+1

xi

xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2

Heat distribution: sequential programHeat distribution: sequential program

int t; /* time step */int i; /* array index */float x[n+2], y[n+2]; /* temperatures */

x[0] = y[0] = 20; x[n+1] = y[n+1] = 100;for(i=1; i<n+1; i++)

x[i] = 0;

for(t=1; t<=100; t++){

for(i=1; i<n+1; i++)y[i] = 0.5 * (x[i-1] + x[i+1]);

swap(x,y);}

x

y

Page 3: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

3

Heat distribution: shared memoryHeat distribution: shared memory

int t; /* time step */int i; /* my processor id */float temp;shared float x[n+2]; /* temperatures */i=GetID()+1; /* GetID returns 0 .. P-1 */x[i]=0;if(i==1) {x[0]=20; x[n+1]=100;}

for (t=1; t<=100; t++){temp = 0.5 * (x[i-1] + x[i+1]);x[i] = temp;}

Points

Processors

Heat distribution: shared memoryHeat distribution: shared memory

int t; /* time step */int i; /* my processor id */float temp;shared float x[n+2]; /* temperatures */

i=GetID()+1; /* GetID returns 0 .. P-1 */x[i]=0;if(i==1) {x[0]=20; x[n+1]=100;}

for (t=1; t<=100; t++){barrier(n);temp = 0.5 * (x[i-1] + x[i+1]);barrier(n);x[i] = temp;}

barrier(n);

Time

Barrierwaiting

computing

P1 P2 P3 Pn

Page 4: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

4

Heat distribution: shared memoryHeat distribution: shared memory

int t /* time step */, i;int id; /* my processor id */shared float x[1002]; /* temperatures */int leftmost,rightmost; /* processor boundary points */float x_leftmost,x_rightmost; /* their temperatures */

id=GetID();leftmost=id*100+1; rightmost=(id+1)*100;if(id==0){x[0]=20; x[1001]=100;}for(i=leftmost;i<=rightmost;i++) x[i]=0;

for(t=1;t<=100;t++){barrier(n);x_leftmost=0.5*(x[leftmost-1]+x[leftmost+1]);x_rightmost=0.5*(x[rightmost-1]+x[rightmost+1]);barrier(n);for(i=leftmost+1;i<rightmost;i++)

x[i]=0.5*(x[i-1]+x[i+1]);x[leftmost]=x_leftmost;x[rightmost]=x_rightmost;

}barrier(n);

1002 points

10 processors

… …

Don’t forget the Amdahls’ lawDon’t forget the Amdahls’ law

Example:p processorsprogram: 5% sequential, 95% suitable for speedup

T(1) = Ts + Tp = 0.05 T + 0.95 T

T(p) = Ts + (Tp / p) + overhead(p)

S(100) = T(1) / T(100) < 16.8

)()()1()(

poverheadPTpTs

TpTspT

TpS++

+==

Page 5: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

5

Shared-memory parallel programmingShared-memory parallel programming

Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores

St o r e

P1

P2

Pn

P0

L o a d

P0 p r i v a t e

P1 p r i v a t e

P2 p r i v a t e

Pn p r i v a t e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Models of shared-memory multiprocessorsModels of shared-memory multiprocessors

The Uniform Memory Access (UMA) Model: The physical memory is shared by all processorsAll processors have equal access to all memory addressesAlso called SMP (Symmetric Multi-Processor)

Interconnection network

P1 P2 Pn

M

No cache

Interconnection network

P1 P2 Pn

M

cache

Shared cache

P2

cache

Pn

cache

P1

cache

Interconnection network

M

Private caches

Page 6: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

6

On a board: Quad Xeon MP serverOn a board: Quad Xeon MP server

All coherence and multiprocessing glue in processor moduleUp to 4 processors

On a chip: dual core PowerPC 970MPOn a chip: dual core PowerPC 970MP

Page 7: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

7

On a chip: POWER5On a chip: POWER5

Dual-core SMT processor8-way superscalar SMT cores 276M transistors, 389 mm2 dieOperating in lab at 1.8GHz & 1.3V1.9MB L2 cache – point of

coherencyOn-chip L3 directory, memory

controller

Technology130nm lithography, SOICu wiring, 8 layers of metal

High-speed elastic bus interface

I/Os: 2313 signal, 3057 power

On a chip: Intel dual core chipsOn a chip: Intel dual core chips

Intel Core Duo processor with twocores, a unified 2 MB L2 cache, and152 million transistors

Intel Pentium D (dual core) and eXtreme Edition (dualCore, multithreaded), 2 x 1MB L2 cache, up to 3.2 GHz

Page 8: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

8

Chip multiprocessor vs. multithreadingChip multiprocessor vs. multithreading

SRAM secondary cache

DRAMDRAMDRAMDRAM

W-issue logic

PC

IF

RF

EU

IC DC

CPU1

W-issue logic

PC

IF

RF

EU

IC DC

CPUn. . .

k-way issue logic

I cache D cache

SRAM secondary cache

DRAMDRAMDRAMDRAM

. . .PC1

PCm . . .

Reg

File

1

Reg

File

m

Instruction fetch unit Execution units and queues

SMT Issue SlotSMT Issue Slot

SMT:ILP (instruction-level parallelism) like a superscalar and TLP (thread-level parallelism) like a multiprocessorIt hides long latencies like multithreaded

Thread1Thread2Thread3Thread4

Superscalar

4 issue slots

Multithreaded

4 issue slots

SMT

4 issue slots

Tim

e (p

roce

ssor

cyc

les)

MP2

2x(2 issue slots)

Page 9: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

9

SMT ImplementationSMT Implementation

Straightforward extension to a conventional superscalar design:

The fetch unit is shared among the different threads. Fetch policies designed to improve fetch effectivenessSingle pool of physical registers for renaming

Hyperthreading:Intel’s form of SMT, available on Xeon server30% improvement on 2 threads

Dec

ode

Ren

ame

Inst

ruct

ion

Win

dow

Wak

eup+

sele

ct

Reg

iste

rfil

e

Byp

ass

Dat

a C

achePCPCPCPCPCPCPCPC

Fetch

Models of shared-memory multiprocessorsModels of shared-memory multiprocessors

Distributed shared-memory or Non-Uniform Memory Access (NUMA) Model:

Shared memory is physically distributed among processorsReferences to memory on other nodes must be sent across the interconnection network (transparently to programmer)

Access to local memory much faster than access to remote memory

Interconnection network

P1

cache M1

P2

cache M2

Pn

cache Mn

Page 10: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

10

POWER5 Multi-chip ModulePOWER5 Multi-chip Module

95mm x 95mm

Four POWER5 chips2 processors per chip2-way simultaneous multithreaded

Four L3 cache chips

4,491 signal I/Os

89 layers of metalMemoryI/OJTAG

On-BookOff-Book

POW

ER

5

L3 POWER5

L3

POWER5

L3

POW

ER

5

L3

Memory

POWER5L3

POWER5 POWER5 POWER5

Memory Memory

MCM

I/O I/O I/O I/O

L3 L3 L3

Memory

16-way Building Block16-way Building Block

MCM

I/O

POWER5

Memory

L3

BookI/OMemory

POWER5L3

I/O

POWER5

Memory

L3

I/O

POWER5

Memory

L3

Page 11: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

11

64-way SMP Interconnection64-way SMP Interconnection

SynchronizationSynchronization

Why do we neeed synchronization? Need to know when itis safe for different processors to access a shared-memorylocation or to signal a certain event

Issues for synchronization:Uninterruptable instruction to fetch and update memory (atomicoperation)User level synchronization operation using these primitives (e.g. locks, barriers, …)For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

Page 12: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

12

SynchronizationSynchronization

ComponentsAcquire method

Acquire the right to the synchronizationWaiting algorithm:

Busy wait (resource consumption during wait)Blocking (need mechanism to awake)Two phase: wait for a while, then block

Release methodAllow others to proceed

SynchronizationSynchronization

What’s wrong with the following synchronization code? (assume that initially flag=0)

lock: ld r1, flagst flag, #1bnz r1, lock

Unlock: st flag, #0

Pi

lock: ld r1, flagst flag, #1bnz r1, lock

Unlock: st flag, #0

Pj

Page 13: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

13

Uninterruptable Fetch and UpdateUninterruptable Fetch and Update

Test-and-set instructionRead value and set location to 1The read and write sequence is indivisible

Atomic exchangeInterchange a value in a register with a value in memoryThe swap operation is indivisible

lock: t&s r1, flag

Uninterruptable Fetch and UpdateUninterruptable Fetch and Update

Fetch and op (addr, register)Read addr value to registerReplace addr value with op(addr value)Common variants

fetch and incrementfetch and decrementfetch and add - requires an additional addend register argument

Compare and swap (addr, reg1, reg2)compare addr value with contents of reg1if equal then swap addr value with reg2

Page 14: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

14

User-level synchronizationUser-level synchronization

Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock

Assume:0 → synchronization variable is free 1 → synchronization variable is locked and unavailable

Set register to 1 & swapNew value in register determines success in getting lock

0 if you succeeded in setting the lock1 if other processor had already claimed access

mov R2, #1lockit: exch R2, 0(R1) ; atomic exchange

bnez R2, lockit ; already locked?

User-level synchronizationUser-level synchronization

Barriers: processors block until all have reached itCan be implemented with a counter initially set to zeroEvery time a processor reaches the barrier, it atomicallyincrements the counterAnd compares with the total number of processors that need toreach the barrier

If not equal → go and wait on a flag variable (set to 0 by first processor reaching the barrier)If equal → set flag variable to 1

After the barrier, all processors continue

Page 15: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

15

User-level synchronizationUser-level synchronization

Centralized barrierBarrier(barr) {

lock (barr.lock);if (barr.counter == 0)

barr.flag = 0; // reset flag if first mycount = barr.counter++;unlock (barr.lock);if (mycount == P) { // last to arrive?

barr.counter =0; // reset for next barrierbarr.flag = 1; // release waiting processors

} elsewhile (barr.flag == 0); // busy wait for release

}

Uninterruptable Fetch and UpdateUninterruptable Fetch and Update

Sometimes it is hard to have read and write in 1 instructionuse 2 instructions instead

Load linked (or load locked) + store conditionalLoad linked returns the initial valueStore conditional returns 1 if it succeeds (same processor that didthe preceeding load) and 0 otherwise

The memory is in charge of keeping track the lastprocessor that loaded a memory location

Page 16: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

16

Uninterruptable Fetch and UpdateUninterruptable Fetch and Update

Example: doing atomic swap R4, 0(R1)try: mov R3, R4 ; mov exchange value

ll R2, 0(R1) ; load linkedsc R3, 0(R1) ; store conditionalbeqz R3, try ; branch store fails (R3 = 0)mov R4, R2 ; put load value in R4

Example: doing fetch and increment f&i R2, 0(R1)try: ll R2, 0(R1) ; load linked

addi R3, R2, #1 ; increment (OK if reg–reg)sc R3, 0(R1) ; store conditional beqz R3,try ; branch store fails (R2 = 0)

Caches are critical for performanceCaches are critical for performance

Shared cacheHit and miss latency increased due tointervening switch and cache sizeInterference:

Positive: prefetching across processors, sharing of working setsNegative: conflicts when replacing data in cache

High bandwidth needsNo coherence problem

UsedIn first SMPs in Mid 80s to connect couple of processors on a boardToday: for multiprocessor on a chip (for small-scale systems ornodes)

IN

P1 P2 Pn

M

cache

Page 17: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

17

Caches are critical for performanceCaches are critical for performance

Private cachesReduce average data access time (latency)Reduce bandwidth demands placed on shared interconnect

Many processors can share data efficientlyAutomatic replication closer to processor

But private processor caches create a problem:Copies of a variable can be present in multiple caches …… coherence problem when one processor writes

P2

$

Pn

$

P1

$

IN

M

Cache coherence problemCache coherence problem

CPU

X=0

CPU

cache

CPU

cache

Interconnection network

X=0 E / S

...

...

CPU1 reads x

CPU

X=0

CPU

X=0

CPU

cache

Interconnection network

X=0 E / S

...

...

CPU2 reads x

CPU

X=0

CPU

X=1

CPU

cache

Interconnection network

X=0 E / S

...

...

CPU2 writes x

CPU

X=0

CPU

X=1

CPU

X=0

Interconnection network

X=0 E / S

...

...

CPU1 or CPU3 read an incorrect x

Page 18: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

18

Cache coherence using a bus Cache coherence using a bus

Snooping-based coherenceBus is a broadcast medium:

Transactions on bus are visible to all processors:Processors or their representatives can snoop (monitor) bus andtake action on relevant events (e.g. change state)

Interconnection network

Sharedmemory

I/O

...

...CPU

cacheSCC

CPU

cacheSCC

CPU

cacheSCC

Cache coherence using a bus Cache coherence using a bus

Cache controller now receives inputs from both sides:Requests from processor (processor-side)Bus requests/responses from snooper (bus-side)Dual tags (not data) or dual-ported tag RAM

In either case, takes zero or more actions:Updates state, responds with data, generates new bus transactions

Protocol is distributed algorithm: cooperating statemachines

Set of states, state transition diagram, actions

Granularity of coherence is typically cache blockLike that of allocation in cache and transfer to/from cache

Page 19: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

19

Caches with write buffersCaches with write buffers

Need to snoop the write buffer

Two protocols to handle write operationsTwo protocols to handle write operations

Write-update: writing processor broadcasts the new value and forces all others to update their copies

similar to write-through cache policynew data appears sooner in caches and main memorybut, higher bus traffic

High bandwidth requirements: every write from everyprocessor goes to shared bus and memory

Processor: 2 GHz, 2 instructions/cycle, and 10% of instructionsare 8-byte stores Each processor writes 3.2 GB data persecondMotherboard: FSB @ 400/800 MHz, 128-bit data bus peakbandwidth = 6.4/12.8 GB/sUp to 2 or 4 processors per board

Page 20: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

20

Bus bandwidthBus bandwidth

Example: number of read and write references, shared and non-shared (all units in millions)

Instructions FLOPS References Total Reads Total Writes Shared Reads Shared WritesBarnes-Hut 2002.74 239.24 720.13 406.84 313.29 225.04 93.23LU 489.52 92.2 151.07 103.09 47.99 92.79 44.74Raytrace 833.35 - 290.35 210.03 80.31 161.1 22.35Ocean 376.51 101.54 99.7 81.16 18.54 76.95 16.97Radix 14.02 - 5.27 2.9 2.37 1.34 0.81

Two protocols to handle write operationsTwo protocols to handle write operations

Write-invalidate: writing processor forces all others to invalidate their copies

Similar to write-back cache policyDirty in cache state now indicates exclusive ownership:

Exclusive: only cache with a valid copy. Subsequent writes to same block do not need to broadcast invalidate → less bus trafficOwner: responsible for supplying block upon a request for it. The request is generated when accessing to an invalidated cache line(which causes miss)

most popular policy

Bus arbitration resolves races:Two processors attempt to write same block

Page 21: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

21

Example: write-invalidate protocolExample: write-invalidate protocol

States:Invalid (I) or not exists (miss)Shared (S): one or moreDirty or Modified (M): one only

CPU events:PrRd (read)PrWr (write)

Bus transactions:BusRd: asks for copy with no intent to modifyBusRdX: asks for copy with intent to modify (invalidation)BusWB (or Flush): updates memory

Actions taken:Update state, perform bus transaction, flush value onto bus

Bus

CPU

SCC

cache

block i state

Example: write-invalidate protocol (MSI)Example: write-invalidate protocol (MSI)

CPU event / Bus transaction

BusUpgr instead of BusRdX:Upgrade from S to M in order to reduce traffic

Page 22: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

22

Example: write-invalidate protocol (MESI)Example: write-invalidate protocol (MESI)

CPU event / Bus transaction

States:– invalid– exclusive (only this cache has copy,

but not modified)– shared (two or more caches may

have copies)– modified (dirty)

Transactions:– BusRd(S) means shared line

asserted on BusRd transaction– Flush’: if cache-to-cache sharing,

only one cache flushes data

Example: write-invalidate protocol (MESI)Example: write-invalidate protocol (MESI)

In MESI protocol, need to know if block is shared; i.e. transition to E or S state on read miss?

MESI also requires priority scheme for cache-to-cachetransfers

Which cache should supply data when in shared state?Commercial implementations allow memory to provide data

Page 23: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

23

Bandwidth requirements on shared busBandwidth requirements on shared bus

5.27-14.02Radix

290.35-833.35Raytrace

99.7101.54376.51Ocean

References (M)FLOPS (M)Instructions (M)

Cache: 1MByteSet associative LRU: 4Line: 64BytesTransitions /1000 refs

839.5070.312500.03050.0219M

0.2768162.56900.72640.0092S

0.0060.00010.02410.00030E

0.03240.5766000.0262I

0.03540.25810.0006800NP

Raytrace

906.9451.498204.28860.0173M

0.305184.667100.41560.0109S

00.00010.02840.00080.0006E

1.7050.4119000.0485I

5.40511.31530.00300NP

Radix

843.5652.299600.00152.6259M

2.2392134.71602.49940.41715S

0.99550.02414.00400.204E

0.00151.8676000.6362I

1.67870.95651.248400NP

Ocean

MSEINP

Bandwidth requirements on shared busBandwidth requirements on shared bus

NP I E S MNP - - BusRd BusRd BusRdXI - - BusRd BusRd BusRdXE - - - - - S - - Not possible - BusUpgrM BusWB BusWB Not possible BusWB -

07007070M60000S00000E

70707000I70707000NPMSEINP

Transfers:• Command and address on the address bus lines: 6 bytes• Data on the data bus lines: 64 bytes

Page 24: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

24

Bandwidth requirements on shared busBandwidth requirements on shared bus

Total number of bytes per 1000 references

Total: 90.4264

Total: 1026.939

Total: 761.0142

021.87502.1351.533M

1.66080000S

00000E

2.26840.362000I

2.47818.0670.047600NP

Raytrace

0104.8740300.2021.211M

1.83060000S

00000E

119.3528.833000I

378.35792.0710.2100NP

Radix

0160.97200.105183.813M

13.43520000S

00000E

0.105130.732000I

117.50966.95587.38800NP

Ocean

MSEINP

Bandwidth requirements on shared busBandwidth requirements on shared bus

Total used bandwidth, assuming processor running at 2 GHz, 2 instructions per cycle

So for example, a FSB @ 400/800 MHz, 128-bit data bus, 6.4/12.8 GB/s peak bandwidth provides enough capacity for8/16 processors running Ocean (only 4/8 running Radix)

126.02290.4264290.35833.35Raytrace

1544.0701026.93865.2714.02Radix

806.067761.014299.7376.51Ocean

BW (MB/s)Bytes per 1000 refsReferences (M)Instructions (M)

Page 25: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

25

Bandwidth requirements on shared busBandwidth requirements on shared bus

How much bandwidth saves the MESI vs MSI with no BusUpgr?

… and after all

NP I E S MNP - - BusRd BusRd BusRdXI - - BusRd BusRd BusRdXE - - - - BusRdXS - - Not possible - BusRdXM BusWB BusWB Not possible BusWB -

151.29108.5616290.35833.35Raytrace

1573.421046.4655.2714.02Radix

1031.67974.00899.7376.51Ocean

BW (MB/s)Bytes per refReferences (M)Instructions (M)

User-level synchronization (revisited)User-level synchronization (revisited)

What about synchronization with cache coherency?Want to spin on cache copy to avoid full memory latencyLikely to get cache hits for such variables

Problem: Exchange includes a write, which invalidates all othercopies; this generates considerable bus traffic

mov R2, #1lockit:exch R2, 0(R1) ; atomic exchange

bnez R2, lockit ; already locked?

Page 26: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

26

User-level synchronization (revisited)User-level synchronization (revisited)

Use a standard load (not a test and set)reads it’s own local copy until the value changeshence the spin doesn’t generate traffic until there is a reason to -e. g. the lock got releasedno guarantee that when you try it won’t be too latehence there is a starvation potential unless algorithm has somefairness model built inbut the win is clear

try: mov R2, #1lockit: ld R3, 0(R1) ; load var

bnez R3, lockit ; not free → spinexch R2, 0(R1) ; atomic exchangebnez R2, try ; already locked?

Directory-based cache coherencyDirectory-based cache coherency

Scaling to shared-memory systems with a large number ofprocessors:

Interconnection network

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

CPU cache

memorydirect.

...

...

Page 27: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

27

Directory-based cache coherencyDirectory-based cache coherency

Directory:It tracks the state of the data stored in its memoryIf a block is shared, it must track which processors have itIf a block is valid only in one node, it should know which one

Nodememory

directory

Nodes currently having the line:- Bit string- 1 bit per node, position indicates node

state:– I: invalid– E: exclusive (only this node has copy,

but not modified)– S: shared (two or more nodes may

have copies)– M: modified (dirty)

Directory-based cache coherencyDirectory-based cache coherency

Typically 3 processors involved in each transaction:Local node where a request originatesHome node where the memory location of an @ residesRemote node has a copy of a cache block, whether exclusive orshared

Example: get block for read access

Local

Home

1: RdReq

2: RdResp

Page 28: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

28

Directory-based cache coherencyDirectory-based cache coherency

Access to clean copy, with twosharers, for write

Access to dirty block for read

Local

Home

1: RdXReq2: ReadersList

Reader3: Invalidate

3: Invalidate

4: done

4: done Reader

Local

Home

1: RdReq 2: Owner

Owner3: Intervention

4: WB_D

4: Data

Distributed-memory multiprocessorsDistributed-memory multiprocessors

Distributed memory:Each processor can only reference its own local memoryReferences to memory on other nodes must be done sending messages across the interconnection network (not transparent to programmer)

Access to local memory much faster than messaging

Interconnection network

P1

cache M1

P2

cache M2

Pn

cache Mn

Page 29: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

29

Message passing programming modelMessage passing programming model

Send specifies buffer to be transmitted and receiving processReceive specifies sending process and application storage to receive intoOptional tag on send and matching rule on receiveImplicit synchronization (e.g. non-blocking send and blocking receive)Many overheads: copying, buffer management, protection

Process P Process Q

AddressY

AddressX

Send Q, X, tag

Receive P, Y, tag

Match

Local pr ocessaddress spaceLocal pr ocess

address space

Example: heat distribution problemExample: heat distribution problem

Time=0

Time=1

Time=2

Time=3

Time=99

Time=100

Temp=20 Temp=100Temp=0

xi-1 xi+1

xi

xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2

Page 30: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

30

Heat distribution: message passingHeat distribution: message passing

int t /* time step */, i;int id; /* my processor id */int num_points; /* number of points per processor */float x[-1:(1000/P)]; /* temperatures, including shadow points */num_points = 1000/P;id=GetID();if (id == 0) x[-1]=20;if (id == (P-1)) x[num_points]=100;for(i=0; i<num_points; i++) x[i]=0;

for(t=1;t<=100;t++){if(id < P) receive(id+1,&x[num_points]);if(id > 0) receive(id-1,&x[-1]);for(i=0; i<num_points; i++)

x[i]=0.5*(x[i-1]+x[i+1]);if(id > 0) send(id-1,x[0]);if(id < P) send(id+1,x[num_points-1]);

}

shadow points

… …

id-1 id+1

id

time=t-1

time=t-1

1002 points

10 processors

… …

Networks for distributed memoryNetworks for distributed memory

Uniform access time networks

Page 31: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

31

Networks for distributed memoryNetworks for distributed memory

Non-uniform access time networksThe time to reach another node in the network depends on its position

line ring

hipercube 3D torus2D torus

Networks for distributed memoryNetworks for distributed memory

Routing algorithm:Deterministic routing: there exists a single way to reach a node. The path is determined by the identifiers of the source and target nodes.

E.g. Manhattan routing in a 2D

(x1, y1) → (x2, y2)first ∆x = x2 - x1,then ∆y = y2 - y1

00 00

01 00

10 00

11 00

00 01

01 01

10 01

11 01

00 10

01 10

10 10

11 10

00 11

01 11

10 11

11 11

001

000

101

100

010 110

111011

R = X xor YTraverse dimensions of differing address in order

Page 32: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

32

www.top500.org (november 2006)www.top500.org (november 2006)GFlops

20 Terabytes memory280+90 Terabytes disk storageArea: 120 m2

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

Page 33: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

33

JS21 Processor Blade2 chips PPC970MP, each one with two cores, 2.3 GHzsymmetric multiprocessor 4 GB memory, shared memory40 GBytes local IDE diskMyrinet network adapter and 2 Gigabit ports

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

PowerPC 970MP

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

Page 34: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

34

Blade Center• 14 blades per chassis (7U)

• 56 processors• 56GB memory

• Gigabit ethernet switch

6 chassis in a rack (42U)• 336 processors(3.1 TFlops)

• 336GB memory

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

Myrinet Clos256+256 switch

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

Page 35: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

35

Myrinet Spine 1280 switchMyrinet Spine 1280 switch

Spine switches are used to connect 3, 4 and 5 Clos Clos256+256 units

Cabling spines to buildlarger systems

Clos 256x256Clos 256x256

Clos 256x256Clos 256x256

Clos 256x256Clos 256x256

Clos 256x256Clos 256x256

Clos 256x256Clos 256x256

Spine 1280 Spine 1280

256 links (1 to each node)250MB/s each direction

128 Links

MareNostrum: 10240 cpu, 94.2 TFlopsMareNostrum: 10240 cpu, 94.2 TFlops

Page 36: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

36

IBM Blue Gene LightIBM Blue Gene Light

BG/L compute card

IBM Blue Gene LightIBM Blue Gene Light

Nodes interconnected as 64x32x32 3D torusEasy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware

A global reduction tree supportsfast global operations such as global max/sum in a few microseconds over 65,536 nodes

Auxiliary networks for I/O andglobal operations

Page 37: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

37

NASA Columbia SupercomputerNASA Columbia Supercomputer

Global shared memory across 4 cpus, 8 Gigabyte

4 Itanium2 per C-Brick

NASA Columbia SupercomputerNASA Columbia Supercomputer

Global shared memory across 64 cpus, 128 Gigabyte

Page 38: Tema 7: Sistemes Multiprocessadors (Memòria Compartida)studies.ac.upc.edu/ETSETB/SEGPAR/slides/tema7.pdf · Tema 7: Sistemes Multiprocessadors (Memòria Compartida) Eduard Ayguadé

38

NASA Columbia SupercomputerNASA Columbia Supercomputer

Global shared memory across 512 cpus, 1 Terabyte

NASA Columbia SupercomputerNASA Columbia Supercomputer

20 SGI® Altix™ 3700 superclusters, each with 512 Itanium2 processors (1.5 GHz, 6 MB cache)

Infiniband network to connect superclusters