View
218
Download
2
Tags:
Embed Size (px)
Citation preview
HBO LocksHBO Locks
Uppsala UniversityDepartment of Information Technology
Uppsala Architecture Research Team [UART]
Hierarchical Back-Off (HBO) Locks forHierarchical Back-Off (HBO) Locks forNon-Uniform Communication ArchitecturesNon-Uniform Communication Architectures
Zoran Radovic and Erik HagerstenZoran Radovic and Erik Hagersten{zoran.radovic, erik.hagersten}@it.uu.se{zoran.radovic, erik.hagersten}@it.uu.se
HPCA-9Ninth International Symposium onHigh Performance Computer ArchitectureAnaheim, California, February 8-12, 2003
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Synchronization BasicsSynchronization Basics
Locks are used to protect the shared critical section data
Common software-based solutions: Simple spin-locks
• TATAS (‘84)• TATAS_EXP (‘90)
Queue-based locks• MCS (‘91)• CLH (‘93)
A:=0 BARRIER
LOCK(L)A:=A+1
UNLOCK(L)LOCK(L)B:=A+5
UNLOCK(L)
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Raytrace SpeedupRaytrace Speedup
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS MCS
Sun WildFire (WF)
14 14
WF
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
VasaloppetVasaloppet“Contention Problem in Sweden”“Contention Problem in Sweden”
Traditional cross-country ski race55 miles …
51.6533 miles to
go… CS
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Spin Locks Under ContentionSpin Locks Under Contention
Amount of Contention
Spin locks
Spin locksw/ backoff
Cri
tic
al S
ecti
on
(C
S)
Co
st
IF (more contention) THEN less efficient CS …
“The more important the slower it runs…”
IF (more contention) THEN less efficient CS …
“The more important the slower it runs…”
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Queue-based LocksQueue-based Locks
Amount of Contention
Spin locks
Spin locksw/ backoff
CS
Co
st
Queue-based locks IF (more contention) THEN constant CS cost …
IF (more contention) THEN constant CS cost …
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
This TalkThis Talk
Amount of Contention
Queue-based locks
Spin locks
Spin locksw/ backoff
HBO locks
CS
Co
st
IF (more contention) THEN more efficient CS …
“The more important the faster it runs…”
IF (more contention) THEN more efficient CS …
“The more important the faster it runs…”
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Raytrace SpeedupRaytrace Speedup
0
1
2
3
4
5
6
7
8
9
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS MCS
HBO Locks
Sun WildFire (WF)
14 14
WF
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
OutlineOutline
Background & Motivation NUMA vs. NUCA Architectures Hierarchical Back-Off (HBO) Locks
HBO HBO_GT HBO_GT with starvation detection/avoidance
Performance Results Conclusions
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Switch
Non-Uniform MemoryNon-Uniform MemoryArchitecture (NUMA)Architecture (NUMA)
Many NUMA optimizations are proposed Page migration speed up accesses to “private” data Page replication speed up reads to “shared” data
Does not help communication… E.g., synchronization
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
12 – 10
Accesstime ratio ...
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
A “new” propertyof NUMAs…
NUCA
Non-Uniform CommunicationNon-Uniform CommunicationArchitecture (NUCA)Architecture (NUCA)
NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)
NUCAratio
Switch
P1
$
P2
$
P3
$
Pn
$
P1
$
P2
$
P3
$
Pn
$
Memory Memory
1 2 – 10
NUCA optimizationsare getting important for
future architectures!
NUCA optimizationsare getting important for
future architectures!
...
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Our GoalsOur Goals
Design scalable spin locks that exploit NUCAs
Create communication affinity Keep the lock in the neighborhood [Mr. Rogers, 1968]
Speeds up lock handover
Lowers the access cost to critical section (CS) data
Reduce remote “probing” traffic Portable and scalable to many NUCA nodes
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
The HBO Lock (the simplest HBO)The HBO Lock (the simplest HBO)
What do we need? node_id Compare&swap (CAS) atomic operation
CAS(Lock_address, FREE, node_id)
lock-acquire: If the lock-value is in the state FREE:
• The node_id is CAS-ed into the lock location
Else: 2 cases (for 2 levels of non-uniformity):• The lock is “local” TATAS_EXP with small backoff• The lock is “remote” TATAS_EXP with large backoff
Simple but fairly effective…
CreatesCommunication
Affinity
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
…
The HBO_GT LockThe HBO_GT LockGT = Global ThrottlingGT = Global Throttling
FREE
P
$
P
$
P
$
P
$
Node 2: Memory
P
$
P
$
P
$
P
$
Node 5: Memory
FREE
Lock1:
Lock2:
P
FREE2
P
Local spinning
Remote spinning(w/ exp. backoff)
… …
FREECS2 2 2(remote_node_id)
FREELock3:
0x00000000my_is_
spinning:0x00000000
my_is_spinning:
Probing...(with CAS)
addr(Lock1)
Read a node-local flag...
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
The HBO_GT LockThe HBO_GT LockGT = Global ThrottlingGT = Global Throttling
A couple of nanoseconds later …
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
…
The HBO_GT LockThe HBO_GT LockGT = Global ThrottlingGT = Global Throttling
FREE
P
$
P
$
P
$
P
$
Node 2: Memory
P
$
P
$
P
$
P
$
Node 5: Memory
FREE
Lock1:
Lock2:
5
P
Local spinning
Remote spinning(w/ exp. backoff)
… …
FREECS55(remote_node_id)
FREELock3:
0x00000000my_is_
spinning:0x00000000
my_is_spinning:
Probing...(with CAS)
addr(Lock1)
Read a node-local flag...
5
P
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Our NUCA: Sun WildFireOur NUCA: Sun WildFire
NUCAratio
Switch
P1
$
P2
$
P3
$
P14
$
P1
$
P2
$
P3
$
P14
$
Memory Memory
1 6
14 14
WF
...
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Traditional MicrobenchmarkTraditional Microbenchmark
for (i = 0; i < iterations; i++) { LOCK(L); /* null/small Critical Section */ UNLOCK(L);}
For each thread:
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
NUCA-performanceNUCA-performanceTraditional microbenchmarkTraditional microbenchmark
0
5
10
15
20
25
30
35
40
45
50
55
60
0 4 8 12 16 20 24 28
Number of Processors
Tim
e [m
icro
seco
nds]
TATAS
MCS
HBO_GT
WF
0
10
20
30
40
50
60
70
80
90
100
0 4 8 12 16 20 24 28Number of Processors
Nod
e ha
ndof
fs [
%]
TATAS
MCS
HBO_GT
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
New MicrobenchmarkNew Microbenchmark
for (i = 0; i < iterations; i++) { LOCK(L); delay(critical_workcritical_work); // CS UNLOCK(L); static_delay(); random_delay();}
More realistic node handoffs for queue-locks Constant number of processors Control the “amount of contention”
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Performance ResultsPerformance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUsNew microbenchmark, 2-node Sun WildFire, 28 CPUs
3
4
5
6
7
8
9
10
11
12
0 500 1000 1500 2000critical_work
Tim
e [s
econ
ds]
TATAS
MCS
HBO_GT
WF
14 14
0
10
20
30
40
50
60
0 500 1000 1500 2000
critical_work
Nod
e ha
ndof
fs [
%]
Fairness?Fairness?
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Fairness StudyFairness StudyNew microbenchmark, 2-node Sun WildFire, 28 CPUsNew microbenchmark, 2-node Sun WildFire, 28 CPUs
02468
10121416182022242628
0 5 10 15Time [seconds]
Num
ber
of F
inis
hed
Pro
cess
ors TATAS
MCS
HBO_GT
t
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Application PerformanceApplication PerformanceRaytrace SpeedupRaytrace Speedup
WF
0
1
2
3
4
5
6
7
8
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS
MCS
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Application PerformanceApplication PerformanceRaytrace SpeedupRaytrace Speedup
WF
0
1
2
3
4
5
6
7
8
0 4 8 12 16 20 24 28
Number of Processors
Spe
edup
TATAS
MCS
HBO
HBO_GT
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
HBO Locks Under ContentionHBO Locks Under Contention
Amount of Contention
Queue-based locks
Spin locks
Spin locksw/ backoff
CS
Co
st
HBO locks
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Total Traffic: RaytraceTotal Traffic: Raytrace
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
TATAS TATAS_EXP MCS HBO_GT
Local Transactions Global Transactions
1.11x1.11x
1.45x1.45x
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
Application PerformanceApplication Performance28-processor runs28-processor runs
0.0
0.5
1.0
1.5
2.0
2.5
Barne
s
Choles
kyFM
M
Radios
ity
Raytra
ce
Volren
d
Wat
er-N
sq
Avera
ge
No
rma
lize
d S
pe
ed
up
TATAS TATAS_EXP MCS HBO_GT
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
First-come, first-served not desirable for NUCAs The HBO lock exploits NUCAs by
creating locality through CS affinity (stable lock) reducing traffic compared with the test&set locks
HBO performs better under contention Traffic is significantly reduced Applications with contented locks scale better with
HBO locks on NUCAs
Starvation detection/avoidance in the paper…
ConclusionsConclusions
[email protected]@it.uu.se Uppsala Architecture Research Team (UART)Uppsala Architecture Research Team (UART) HBO LocksHBO Locks
http://www.http://www.it.uu.se/research/group/uartit.uu.se/research/group/uart
UART’s Home PageUART’s Home Page
Supported by Sun Microsystems, Inc., and theParallel and Scientific Computing Institute (PSCI)