Multiprocessors - inf.ed.ac.uk · Multiprocessors ! Why multiprocessors?! – Power Wall! – ILP Wall! Limitation of ILP in programs! Complexity of superscalar design! – Cost-effectiveness:

Inf3 Computer Architecture - 2012-2013 1

Multiprocessors !

Why multiprocessors?!– Power Wall!

– ILP Wall! Limitation of ILP in programs! Complexity of superscalar design!

– Cost-effectiveness: easier to connect several ready processors than designing a new, more powerful, processor!

– Better integration and performance than networked individual computers!

But…!

Software should expose parallelism!– Programmers should write parallel programs!– Legacy code should be parallelized!


…as hard as any (problem) that computer science has faced.


Amdahl’s Law and Efficiency!

Let: F → fraction of problem that can be parallelized! Spar → speedup obtained on parallelized fraction! P → number of processors!

e.g.: 16 processors (Spar = 16), F = 0.9 (90%), !

Soverall = 1

(1 – F) + F

Spar

Soverall = 1

(1 – 0.9) + 0.9

16

= 6.4

E = Soverall

P

E = 6.4

16 = 0.4 (40%)


Inter-processor Communication Models!

Shared memory!

Message passing!

flag = 0; … a = 10; flag = 1;

flag = 0; … while (!flag) {} x = a * y;

Producer (p1) Consumer (p2)

… a = 10; send(p2, a, label);

… receive(p1, b, label); x = b * y;


Shared Memory vs Message Passing!

Shared memory pros!– Easier to program: correctness first, performance later!– For OS only (relatively) minor extensions required!

Shared memory cons!– Synchronization complex!– Communication implicit harder to optimize!– Coherence must be ensured. !


HW Support for Shared Memory!

Cache Coherence!– Caches + multiprocessers stale values!– System must behave correctly in the presence of

caches! Write propagation! Write serialization!

Memory Consistency!– When should writes propagate?!– How are memory operations ordered?!– What value should a read return?!

Primitive synchronization!– Memory fences: memory ordering on demand!– Read-Modify-writes: support for locks (critical sections) !

Inf3 Computer Architecture - 2012-2013!

Cache Coherence!


flag = 0; … data = 10;

flag = 1;

flag = 0; …

while (!flag) {}

x = data * y;


The update to flag (and data) should be (eventually) visible to p2

7

Memory Consistency!


flag = 0; … data = 10;

flag = 1;

flag = 0; …

while (!flag) {}

x = data * y;


If p2 sees the update to flag, will p2 see the update to data?

8

Primitive Synchronization !


flag = 0; … data = 10;

flag = 1;

flag = 0; …

while (!flag) {}

x = data * y;


If p2 sees the update to flag, will it see the update to data?

fence

fence

9


The Cache Coherence Problem!

CPU

Main memory

CPU CPU

Cache Cache Cache

T0: A=1

T0: A not cached T0: A not cached T0: A not cached T1: load A (A=1)

T1: A=1

T1: A not cached T1: A not cached T2: load A (A=1) T2: A not cached T2: A=1

T2: A=1

T3: store A (A=2) T3: A not cached T3: A=1

T3: A=1

stale

stale

T4: load A (A=1) T4: A=1 T4: A=2

T4: A=1

use old value

T5: load A (A=1)

use stale value!


Maintaining Cache Coherence!

Option 1: do not allow shared data to be cached if it can be modified (e.g., Cray T3D and T3E)!– Difficult to program!– Poor performance!

Option 2: user program makes sure that data is always coherent!– Very difficult to program!– Poor performance!

Option 3: hardware enforces cache coherence!


Cache Coherence Protocols! Idea:!

– Keep track of what processors have copies of what data!

– Enforce that at any given time a single value of every data exists:! By getting rid of copies of the data with old values → invalidate protocols! By updating everyone’s copy of the data → update protocols!

In practice:!– Guarantee that old values are eventually invalidated/updated (write

propagation)! (recall that without synchronization there is no guarantee that a load

will return the new value anyway)!

– Guarantee that only a single processor is allowed to modify a certain datum at any given time (write serialization)!

– Must appear as if no caches were present!

12


Write-invalidate Example!

CPU

Main memory

CPU CPU

Cache Cache Cache

T1: load A (A=1)

T1: A=1


T2: A=1

T3: store A (A=2) T3: A not cached T3: A not cached

T3: A=1

invalidate

stale

T4: load A (A=2) T4: A not cached T4: A=2

T4: A=1

new value T5: load A (A=2)

new value


Write-update Example!

CPU

Main memory

CPU CPU

Cache Cache Cache

T1: load A (A=1)

T1: A=1


T2: A=1

T3: store A (A=2) T3: A not cached T3: A = 2

T3: A=2

update

update

T4: load A (A=2) T4: A = 2 T4: A=2

T4: A=2

new value

T5: load A (A=2)

14


Cache Coherence Protocols!

Add state bits to cache lines to track state of the line!– Most common: Modified, Owned, Exclusive, Shared, Invalid !– Protocols usually named after the states supported!

Cache lines transition between states upon load/store operations from the local processor and by remote processors!

These state transitions must guarantee the invariant: no two cache copies can be simultaneously modified !– SWMR: Single writer multiple readers!

15


Example: MSI Protocol!

States:!– Modified (M): block is cached only in this cache and has been

modified!– Shared (S): block is cached in this cache and possibly in other

caches (no cache can modify the block)!– Invalid (I): block is not cached!

16



Transactions originated at this CPU:!

Invalid Shared

Modified

CPU read miss

CPU read hit

CPU write miss

CPU write

CPU write hit CPU read hit

17



Transactions originated at other CPU:!

Invalid Shared

Modified

CPU read miss

CPU read hit

CPU write miss

CPU write

CPU write hit CPU read hit

Remote write miss

Remote write miss Remote read miss

Remote read miss

18


Possible Implementations!

Two ways of implementing coherence protocols in hardware!

– Snooping: all cache controllers monitor all other caches’ activities and maintain the state of their lines! Commonly used with buses and in many CMP’s today!

– Directory: a central control device directly handles all cache activities and tells the caches what transitions to make! Can be of two types: centralized and distributed! Commonly used with scalable interconnects and in many CMP’s

today!

19

Documents

Multiprocessors - inf.ed.ac.uk · Multiprocessors ! Why multiprocessors?! – Power Wall! – ILP Wall! Limitation of ILP in programs! Complexity of superscalar design! – Cost-effectiveness: