29
HBO Locks HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Hierarchical Back-Off (HBO) Locks for Non-Uniform Communication Architectures Non-Uniform Communication Architectures Zoran Radovic and Erik Hagersten Zoran Radovic and Erik Hagersten {zoran.radovic, erik.hagersten}@it.uu.se {zoran.radovic, erik.hagersten}@it.uu.se HPCA-9 Ninth International Symposium on High Performance Computer Architecture Anaheim, California, February 8-12, 2003

HBO Locks Uppsala University Department of Information Technology Uppsala Architecture Research Team [UART] Hierarchical Back-Off (HBO) Locks for Non-Uniform

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

HBO LocksHBO Locks

Uppsala UniversityDepartment of Information Technology

Uppsala Architecture Research Team [UART]

Hierarchical Back-Off (HBO) Locks forHierarchical Back-Off (HBO) Locks forNon-Uniform Communication ArchitecturesNon-Uniform Communication Architectures

Zoran Radovic and Erik HagerstenZoran Radovic and Erik Hagersten{zoran.radovic, erik.hagersten}@it.uu.se{zoran.radovic, erik.hagersten}@it.uu.se

HPCA-9Ninth International Symposium onHigh Performance Computer ArchitectureAnaheim, California, February 8-12, 2003

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Synchronization BasicsSynchronization Basics

Locks are used to protect the shared critical section data

Common software-based solutions: Simple spin-locks

• TATAS (‘84)• TATAS_EXP (‘90)

Queue-based locks• MCS (‘91)• CLH (‘93)

A:=0 BARRIER

LOCK(L)A:=A+1

UNLOCK(L)LOCK(L)B:=A+5

UNLOCK(L)

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Raytrace SpeedupRaytrace Speedup

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS MCS

Sun WildFire (WF)

14 14

WF

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

VasaloppetVasaloppet“Contention Problem in Sweden”“Contention Problem in Sweden”

Traditional cross-country ski race55 miles …

51.6533 miles to

go… CS

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Spin Locks Under ContentionSpin Locks Under Contention

Amount of Contention

Spin locks

Spin locksw/ backoff

Cri

tic

al S

ecti

on

(C

S)

Co

st

IF (more contention) THEN less efficient CS …

“The more important the slower it runs…”

IF (more contention) THEN less efficient CS …

“The more important the slower it runs…”

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Queue-based LocksQueue-based Locks

Amount of Contention

Spin locks

Spin locksw/ backoff

CS

Co

st

Queue-based locks IF (more contention) THEN constant CS cost …

IF (more contention) THEN constant CS cost …

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

This TalkThis Talk

Amount of Contention

Queue-based locks

Spin locks

Spin locksw/ backoff

HBO locks

CS

Co

st

IF (more contention) THEN more efficient CS …

“The more important the faster it runs…”

IF (more contention) THEN more efficient CS …

“The more important the faster it runs…”

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Raytrace SpeedupRaytrace Speedup

0

1

2

3

4

5

6

7

8

9

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS MCS

HBO Locks

Sun WildFire (WF)

14 14

WF

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

OutlineOutline

Background & Motivation NUMA vs. NUCA Architectures Hierarchical Back-Off (HBO) Locks

HBO HBO_GT HBO_GT with starvation detection/avoidance

Performance Results Conclusions

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Switch

Non-Uniform MemoryNon-Uniform MemoryArchitecture (NUMA)Architecture (NUMA)

Many NUMA optimizations are proposed Page migration speed up accesses to “private” data Page replication speed up reads to “shared” data

Does not help communication… E.g., synchronization

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

12 – 10

Accesstime ratio ...

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

A “new” propertyof NUMAs…

NUCA

Non-Uniform CommunicationNon-Uniform CommunicationArchitecture (NUCA)Architecture (NUCA)

NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)

NUCAratio

Switch

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

1 2 – 10

NUCA optimizationsare getting important for

future architectures!

NUCA optimizationsare getting important for

future architectures!

...

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Our GoalsOur Goals

Design scalable spin locks that exploit NUCAs

Create communication affinity Keep the lock in the neighborhood [Mr. Rogers, 1968]

Speeds up lock handover

Lowers the access cost to critical section (CS) data

Reduce remote “probing” traffic Portable and scalable to many NUCA nodes

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

The HBO Lock (the simplest HBO)The HBO Lock (the simplest HBO)

What do we need? node_id Compare&swap (CAS) atomic operation

CAS(Lock_address, FREE, node_id)

lock-acquire: If the lock-value is in the state FREE:

• The node_id is CAS-ed into the lock location

Else: 2 cases (for 2 levels of non-uniformity):• The lock is “local” TATAS_EXP with small backoff• The lock is “remote” TATAS_EXP with large backoff

Simple but fairly effective…

CreatesCommunication

Affinity

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

The HBO_GT LockThe HBO_GT LockGT = Global ThrottlingGT = Global Throttling

FREE

P

$

P

$

P

$

P

$

Node 2: Memory

P

$

P

$

P

$

P

$

Node 5: Memory

FREE

Lock1:

Lock2:

P

FREE2

P

Local spinning

Remote spinning(w/ exp. backoff)

… …

FREECS2 2 2(remote_node_id)

FREELock3:

0x00000000my_is_

spinning:0x00000000

my_is_spinning:

Probing...(with CAS)

addr(Lock1)

Read a node-local flag...

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

The HBO_GT LockThe HBO_GT LockGT = Global ThrottlingGT = Global Throttling

A couple of nanoseconds later …

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

The HBO_GT LockThe HBO_GT LockGT = Global ThrottlingGT = Global Throttling

FREE

P

$

P

$

P

$

P

$

Node 2: Memory

P

$

P

$

P

$

P

$

Node 5: Memory

FREE

Lock1:

Lock2:

5

P

Local spinning

Remote spinning(w/ exp. backoff)

… …

FREECS55(remote_node_id)

FREELock3:

0x00000000my_is_

spinning:0x00000000

my_is_spinning:

Probing...(with CAS)

addr(Lock1)

Read a node-local flag...

5

P

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Our NUCA: Sun WildFireOur NUCA: Sun WildFire

NUCAratio

Switch

P1

$

P2

$

P3

$

P14

$

P1

$

P2

$

P3

$

P14

$

Memory Memory

1 6

14 14

WF

...

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Traditional MicrobenchmarkTraditional Microbenchmark

for (i = 0; i < iterations; i++) { LOCK(L); /* null/small Critical Section */ UNLOCK(L);}

For each thread:

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

NUCA-performanceNUCA-performanceTraditional microbenchmarkTraditional microbenchmark

0

5

10

15

20

25

30

35

40

45

50

55

60

0 4 8 12 16 20 24 28

Number of Processors

Tim

e [m

icro

seco

nds]

TATAS

MCS

HBO_GT

WF

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28Number of Processors

Nod

e ha

ndof

fs [

%]

TATAS

MCS

HBO_GT

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

New MicrobenchmarkNew Microbenchmark

for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_workcritical_work); // CS UNLOCK(L); static_delay(); random_delay();}

More realistic node handoffs for queue-locks Constant number of processors Control the “amount of contention”

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Performance ResultsPerformance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

3

4

5

6

7

8

9

10

11

12

0 500 1000 1500 2000critical_work

Tim

e [s

econ

ds]

TATAS

MCS

HBO_GT

WF

14 14

0

10

20

30

40

50

60

0 500 1000 1500 2000

critical_work

Nod

e ha

ndof

fs [

%]

Fairness?Fairness?

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Fairness StudyFairness StudyNew microbenchmark, 2-node Sun WildFire, 28 CPUsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

02468

10121416182022242628

0 5 10 15Time [seconds]

Num

ber

of F

inis

hed

Pro

cess

ors TATAS

MCS

HBO_GT

t

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Application PerformanceApplication PerformanceRaytrace SpeedupRaytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS

MCS

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Application PerformanceApplication PerformanceRaytrace SpeedupRaytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Spe

edup

TATAS

MCS

HBO

HBO_GT

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

HBO Locks Under ContentionHBO Locks Under Contention

Amount of Contention

Queue-based locks

Spin locks

Spin locksw/ backoff

CS

Co

st

HBO locks

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Total Traffic: RaytraceTotal Traffic: Raytrace

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

TATAS TATAS_EXP MCS HBO_GT

Local Transactions Global Transactions

1.11x1.11x

1.45x1.45x

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

Application PerformanceApplication Performance28-processor runs28-processor runs

0.0

0.5

1.0

1.5

2.0

2.5

Barne

s

Choles

kyFM

M

Radios

ity

Raytra

ce

Volren

d

Wat

er-N

sq

Avera

ge

No

rma

lize

d S

pe

ed

up

TATAS TATAS_EXP MCS HBO_GT

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

First-come, first-served not desirable for NUCAs The HBO lock exploits NUCAs by

creating locality through CS affinity (stable lock) reducing traffic compared with the test&set locks

HBO performs better under contention Traffic is significantly reduced Applications with contented locks scale better with

HBO locks on NUCAs

Starvation detection/avoidance in the paper…

ConclusionsConclusions

[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks

http://www.http://www.it.uu.se/research/group/uartit.uu.se/research/group/uart

UART’s Home PageUART’s Home Page

Supported by Sun Microsystems, Inc., and theParallel and Scientific Computing Institute (PSCI)