28
e 2005 Dissertation Seminar, 18/11 – 2005 Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Auditorium Minus, Museum Gustavianum Software Techniques for Software Techniques for Distributed Shared Memory Distributed Shared Memory Zoran Radovic Zoran Radovic [email protected] [email protected]

Nov 18, 2005 [email protected] Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

Embed Size (px)

Citation preview

Page 1: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum

Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory

Zoran RadovicZoran [email protected]@it.uu.se

Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum

Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory

Zoran RadovicZoran [email protected]@it.uu.se

Page 2: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Outline

NUCA Locks

DSZOOM – Software-based Shared Memory

TMA – Trap-based Memory Architecture

Page 3: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Vasaloppet“Contention Problem in Sweden”

Traditional cross-country ski race90 km …

85.6533 km to go… CSCS

Page 4: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Spin Locks under Contention

Amount of Contention

Spin locks

Spin lockswith backoff

Cri

tic

al S

ecti

on

(C

S)

Co

st

IF (more contention) THEN less efficient CS …

“The more important the slower it runs…”

IF (more contention) THEN less efficient CS …

“The more important the slower it runs…”

Page 5: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Queue-based Locks

Amount of Contention

Spin locks

Spin lockswith backoff

CS

Co

st

Queue-based locks IF (more contention) THEN constant CS cost …

IF (more contention) THEN constant CS cost …

Page 6: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

This Dissertation

Amount of Contention

Queue-based locks

Spin locks

Spin lockswith backoff

NUCA locks

CS

Co

st

IF (more contention) THEN more efficient CS …

“The more important the faster it runs…”

IF (more contention) THEN more efficient CS …

“The more important the faster it runs…”

Page 7: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

NUCA Locks (Basic Idea)

Switch

MemoryMemoryMemory

TestTestTestTestLock/Unlock

Lock/Unlock

P

$

P

$

P

$…

P

$

P

$

P

$…

P

$

P

$

P

$…

TestTestTestTestTestTestTest

1) Reduce traffic- one CPU per node is testing…

2) Improve lock handover3) More efficient CS

- local traffic is cheaper

1) Reduce traffic- one CPU per node is testing…

2) Improve lock handover3) More efficient CS

- local traffic is cheaper

Page 8: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

The HBO Lock (the simplest HBO)

What do we need? node_id Compare&swap (CAS) atomic operation

CAS(Lock_address, FREE, node_id)

lock-acquire: If the lock-value is in the state FREE:

• The node_id is CAS-ed into the lock location

Else: 2 cases• The lock is “local” Spin with small backoff• The lock is “remote” Spin with large backoff

Simple but fairly effective…

CreatesCommunication

Affinity

Page 9: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Performance ResultsRealistic microbenchmark, 2-node WildFire, 28 CPUs

3

4

5

6

7

8

9

10

11

12

0 500 1000 1500 2000critical_work

Iter

atio

n T

ime

[sec

onds

]

Spin

MCS

HBO

WF

14 14

0

10

20

30

40

50

60

0 500 1000 1500 2000

critical_work

Nod

e H

ando

ffs

[%]

Fairness?Fairness?

Page 10: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Fairness StudyRealistic microbenchmark, 2-node WildFire, 28 CPUs

02468

10121416182022242628

0 5 10 15Time [seconds]

Num

ber

of F

inis

hed

Pro

cess

ors Spin

MCS

HBO

t

Page 11: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Application Performance28-processor runs

0

0.5

1

1.5

2

2.5

Barne

s

Choles

kyFM

M

Radios

ity

Raytra

ce

Volren

d

Wat

er-N

sq

Avera

ge

No

rma

lize

d S

pe

ed

up

Spin Spin EXP MCS HBO

≈ 4x

Page 12: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Total Traffic: Raytrace

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Spin Spin EXP MCS HBO

Local Transactions Global Transactions

Page 13: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

HBO Locks inside Linux Kernel

Patch provided by Silicon Graphics, Inc. Linux-IA64 kernel implementation, May 2005

Page-fault handler runs 3x faster 60 processors

Page 14: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Outline

NUCA Locks

DSZOOM – Software-based Shared Memory

TMA – Trap-based Memory Architecture

Page 15: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal

Page 16: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

The DSZOOM Proposal

Run entire protocol in requesting-processor No protocol agent communication!

Assumes user-level remote memory access put, get, and atomics [ InfiniBand ]

Fine-grain memory protocols (e.g., 64 bytes)

Hardware-like memory models [Shasta, Blizzard, Sirocco]

Page 17: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

“Squeezing” Protocols into Binaries…

...cmp %g0, %l5

bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

...cmp %g0, %l5

bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

ld [%o1 + 64], %o0

ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop

OriginalProgram

DSZOOMProgram

Fast-path Protocol

Code

Slow-pathProtocol

Code(C-code)

Binary/Assembler level instrumentation

Page 18: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Write Permission Caching

Problem: store instrumentation relies on locking More complex instrumentation

Solution: write permission cache (WPC) Small and fast software-managed cache Keeps write permissions

The WPC idea: Exploit store locality Dynamically reduce the number of memory references

in store checking code

Page 19: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Other “Features”

Two kinds of protocols Invalidate Update

Many optimizations Instrumentation scheduling (update and invalidate) Instrumentation batching (invalidate) WPC-based write batching (update) WPC-based dirty-data filtering (update) Private-data filtering (update) # of WPC entries (update and

invalidate) Coherence unit size (update and invalidate)

Page 20: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Coherence Flags and Profiling

Coherence flags Similar to optimization flags of compilers Possible scenario:

gcc -dszoom-cl 128 -dszoom-inv –O3 my_app.c

Execution profiling Similar to profile feedback of compilers Helps finding appropriate coherence flag settings Low overhead implementation in DSZOOM

• Less than 30 percent overhead

Works for both small and large input sets

Page 21: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

DSZOOM Results2-node WildFire, 16 CPUs

0.00.20.40.60.81.01.21.41.61.82.02.22.42.62.83.0

fft lu-c

lu-nc

radix

barn

esfm

m

ocean

-c

ocean

-nc

radio

sity

raytr

ace

water-n

sq

water-s

p

avera

ge

Nor

mal

ized

Exe

cutio

n T

ime

HW-DSM inv-64 inv-dwpc-64 PROFILED BEST

1.45x 1.11x

Page 22: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Outline

NUCA Locks

DSZOOM – Software-based Shared Memory

TMA – Trap-based Memory Architecture

Page 23: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Instrumentation Drawbacks

...cmp %g0, %l5

bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

...cmp %g0, %l5

bne 0x24431nop

ldd [%o0 + 16], %f4clr %l5...

ld [%o1 + 64], %o0

ld [%o1 + 64], %o0mov 255, %g6and %g6, %o0, %g6cmp %g6, 170bne 0x24450nop

OriginalProgram

DSZOOMProgram

Fast-path Protocol

Code

Slow-pathProtocol

Code(C-code)

• Binary transparency? • Run-time execution overhead

• Binary transparency? • Run-time execution overhead

Page 24: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Trap-Based Memory Architectures

Basic idea Detect fine-grained coherence violations in hardware Trigger a coherence trap when one occur Maintain coherence by software protocols

No memory system modifications Minimal processor modifications

Binary Transparency No need to instrument binaries/applications

Page 25: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

TMA LiteProof-of-concept Implementation

Load permission check Hardware implementation of software check

• Predefined “magic-value” convention

Store permission check Hardware WPC

Can be seen as a very small cache Operates on virtual addresses Accessed in parallel with the data TLB

Page 26: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

TMA Lite Performance[TMA: simulation study, 4 nodes | DSZOOM: 2-node WildFire]

0

0.5

1

1.5

2

2.5

Nor

mal

ized

Exe

cutio

n T

ime

HW-DSM DSZOOM DWPC PROFILED BEST TMA

1.75x 1.01x

Page 27: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Topics not Presented

RH lock algorithm Controlled (un)fairness

HBO_GT and HBO_GT_SD algorithms Global throttling and starvation detection

DSZOOM implementation details Instrumentation challenges; scheduling, batching, etc. Bandwidth filtering techniques; dirty- & private-data

Innovative TMA simulation tricks Low-level “good days” hacks Reusing Simics checkpoints

Page 28: Nov 18, 2005 Zoran.Radovic@it.uu.seDissertation Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for

[email protected] Dissertation Seminar Nov 18, 2005

Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum

Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory

Zoran RadovicZoran [email protected]@it.uu.se

Dissertation Seminar, 18/11 – 2005Dissertation Seminar, 18/11 – 2005Auditorium Minus, Museum GustavianumAuditorium Minus, Museum Gustavianum

Software Techniques forSoftware Techniques forDistributed Shared MemoryDistributed Shared Memory

Zoran RadovicZoran [email protected]@it.uu.se