Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Scalable NUMA-awareBlocking Synchronization Primitives
Sanidhya Kashyap, Changwoo Min, Taesoo Kim
'i
The rise of big NUMA machines
'i
The rise of big NUMA machines
'i
The rise of big NUMA machines
Importance of NUMA awarenessNUMA node 1 NUMA node 2
W1
W2
W3
W4
W5W6
L
File
W1 W2 W3 W4 W6 W5
NUMA oblivious
Importance of NUMA awarenessNUMA node 1 NUMA node 2
W1
W2
W3
W4
W5W6
L
File
W1 W2 W3 W4 W6 W5
NUMA oblivious
W1 W6 W2 W3 W4 W5
NUMA aware/hierarchical
Importance of NUMA awarenessNUMA node 1 NUMA node 2
W1
W2
W3
W4
W5W6
L
File
Idea:Make synchronization primitives NUMA aware!
W1 W2 W3 W4 W6 W5
NUMA oblivious
W1 W6 W2 W3 W4 W5
NUMA aware/hierarchical
Lock's research efforts and their useLinux kernel lock
adoption / modificationDekker's algorithm (1962)Semaphore (1965)
Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)
Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)
Remote Core locking (2012)Cohort lock (2012)
RW cohort lock (2013)Malthusian lock (2014)
HMCS lock (2015)AHMCS lock(2016)
HBO lock (2003)
Lock's research efforts
Lock's research efforts and their useLinux kernel lock
adoption / modificationDekker's algorithm (1962)Semaphore (1965)
Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)
Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)
Remote Core locking (2012)Cohort lock (2012)
RW cohort lock (2013)Malthusian lock (2014)
HMCS lock (2015)AHMCS lock(2016)
HBO lock (2003)
NUMA-awarelocks
Lock's research efforts
Lock's research efforts and their useLinux kernel lock
adoption / modificationDekker's algorithm (1962)Semaphore (1965)
Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)
Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)
Remote Core locking (2012)Cohort lock (2012)
RW cohort lock (2013)Malthusian lock (2014)
HMCS lock (2015)AHMCS lock(2016)
HBO lock (2003)
NUMA-awarelocks
Spinlock TTAS→
Semaphore TTAS + block→
Rwsem TTAS + block→
Spinlock ticket→
Mutex TTAS + spin + block→
Rwsem TTAS + spin + block→
Spinlock ticket→
Mutex TTAS + block→
Rwsem TTAS + block→
Spinlock qspinlock→
Mutex TTAS + spin + block→
Rwsem TTAS + spin + block→
Lock's research efforts
1990s
2011
2014
2016
Lock's research efforts and their useLinux kernel lock
adoption / modificationDekker's algorithm (1962)Semaphore (1965)
Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)
Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)
Remote Core locking (2012)Cohort lock (2012)
RW cohort lock (2013)Malthusian lock (2014)
HMCS lock (2015)AHMCS lock(2016)
HBO lock (2003)
NUMA-awarelocks
Spinlock TTAS→
Semaphore TTAS + block→
Rwsem TTAS + block→
Spinlock ticket→
Mutex TTAS + spin + block→
Rwsem TTAS + spin + block→
Spinlock ticket→
Mutex TTAS + block→
Rwsem TTAS + block→
Spinlock qspinlock→
Mutex TTAS + spin + block→
Rwsem TTAS + spin + block→
Lock's research efforts
1990s
2011
2014
2016
Adopting NUMA aware locks is not easy
Issues with NUMA-aware primitives
● Memory footprint overhead– Cohort lock single instance: 1600 bytes– Example: 1–4 GB of lock space vs 38 MB of Linux’s
lock for 10 M inodes
● Does not support blocking/parking behavior
Blocking/parking approach● Under subscription
– #threads <= #cores
● Over subscription– #threads > #cores
● Spin-then-park strategy1) Spin for a certain duration
2) Add to a parking list
3) Schedule out (park/block)
under-subscription
Lock
thro
ughp
ut
→
#thread →
Blocking/parking approach● Under subscription
– #threads <= #cores
● Over subscription– #threads > #cores
● Spin-then-park strategy1) Spin for a certain duration
2) Add to a parking list
3) Schedule out (park/block)
under-subscription
Lock
thro
ughp
ut
→
#thread →
over-subscription
Blocking/parking approach● Under subscription
– #threads <= #cores
● Over subscription– #threads > #cores
● Spin-then-park strategy1) Spin for a certain duration
2) Add to a parking list
3) Schedule out (park/block)
under-subscription
Lock
thro
ughp
ut
→
#thread →
over-subscription
Spinning
Blocking/parking approach● Under subscription
– #threads <= #cores
● Over subscription– #threads > #cores
● Spin-then-park strategy1) Spin for a certain duration
2) Add to a parking list
3) Schedule out (park/block)
under-subscription
Lock
thro
ughp
ut
→
#thread →
over-subscription
Spinning
Parking
Blocking/parking approach● Under subscription
– #threads <= #cores
● Over subscription– #threads > #cores
● Spin-then-park strategy1) Spin for a certain duration
2) Add to a parking list
3) Schedule out (park/block)
under-subscription
Lock
thro
ughp
ut
→
#thread →
over-subscription
Spinning
Parking
Spin + park
Issues with blocking synchronization primitives
● High memory footprint for NUMA-aware locks● Inefficient blocking strategy
– Scheduling overhead in the critical path– Cache-line contention while scheduling out
CST lock
● NUMA-aware lock● Low memory footprint
– Allocate socket specific data structure when used– 1.5–10X memory less memory consumption
● Efficient parking/wake-up strategy– Limit the spinning up to a waiter’s time quantum– Pass the lock to an active waiter– Improves scalability by 1.2–4.7X
CST lock design
● NUMA-aware lock➢ Cohort lock principle
+ Mitigates cache-line contention and bouncing
● Memory efficient data structure➢ Allocate socket structure (snode) when used➢ Snodes are active until the life-cycle of the lock
+ Does not stress the memory allocator
CST lock design
● NUMA-aware parking list➢ Maintain separate per-socket parking lists for
readers and writers
+ Mitigates cache-line contention in over-subscribed scenario
+ Allows distributed wake-up of parked readers
CST lock design
● Remove scheduler intervention➢ Pass the lock to a spinning waiter➢ Waiters park themselves if more than one tasks are
running on a CPU (system load)
+ Scheduler not involved in the critical path
+ Guarantees forward progress of the system
Lock instantiation
socket_listsocket_list
global_tailglobal_tail
Threads: ● Initially no snodes are allocated
● Thread in a particular socket initiates an allocation
Lock instantiation
socket_listsocket_list
global_tailglobal_tail
T1/S1T1/S1Threads:
Socket 1
socket_list [S1]socket_list [S1]
● Initially no snodes are allocated
● Thread in a particular socket initiates an allocation
Lock instantiation
socket_listsocket_list
global_tailglobal_tail
T1/S1T1/S1Threads:
Socket 1
T2/S1T2/S1 T3/S1T3/S1
socket_list [S1]socket_list [S1]
● Initially no snodes are allocated
● Thread in a particular socket initiates an allocation
Lock instantiation
socket_listsocket_list
global_tailglobal_tail
T1/S1T1/S1Threads:
Socket 1
Socket 2
T2/S1T2/S1 T3/S1T3/S1 T4/S2T4/S2
socket_list [S1]socket_list [S1]socket_list [S1, S2]socket_list [S1, S2]
● Initially no snodes are allocated
● Thread in a particular socket initiates an allocation
CST lock phaseCST lock instance
socket_listsocket_list
Threads:
global_tailglobal_tail
● Allocate thread specific structure on the stack● Three states for each node
– L locked→– UW unparked/spinning waiter→– PW parked / blocked / scheduled out waiter→
CST lock phaseCST lock instance T1/S1T1/S1
socket_listsocket_list
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
LL
Threads:
Socket 1
socket_list [S1]socket_list [S1]
global_tailglobal_tail
● Allocate thread specific structure on the stack● Three states for each node
– L locked→– UW unparked/spinning waiter→– PW parked / blocked / scheduled out waiter→
CST lock phaseCST lock instance T1/S1T1/S1
socket_listsocket_list
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
Socket 1
socket_list [S1]socket_list [S1]
global_tailglobal_tail
T1
● Allocate thread specific structure on the stack● Three states for each node
– L locked→
– UW unparked/spinning waiter→
– PW parked / blocked / scheduled out waiter→
CST lock phaseCST lock instance T1/S1T1/S1
socket_listsocket_list
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LL
Acquire global lock
LL
Threads:
Socket 1
socket_list [S1]socket_list [S1]
global_tailglobal_tail
T1
● Allocate thread specific structure on the stack● Three states for each node
– L locked→
– UW unparked/spinning waiter→
– PW parked / blocked / scheduled out waiter→
CST lock phaseCST lock instance T1/S1T1/S1
socket_listsocket_list
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1 T3/S1T3/S1
Socket 1
socket_list [S1]socket_list [S1]
global_tailglobal_tail nextnext
p_nextp_next
UWUWT1 T2 T3
● Allocate thread specific structure on the stack● Three states for each node
– L locked→
– UW unparked/spinning waiter→
– PW parked / blocked / scheduled out waiter→
CST lock phase
nextnext
p_nextp_next
LL
CST lock instance T1/S1T1/S1
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
socket_listsocket_list
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1 T3/S1T3/S1
Socket 1
Socket 2
socket_list [S1]socket_list [S1]socket_list [S1, S2]socket_list [S1, S2]
global_tailglobal_tail
T4/S2T4/S2
nextnext
p_nextp_next
UWUWT1 T2 T3
T4
● Allocate thread specific structure on the stack● Three states for each node
– L locked→
– UW unparked/spinning waiter→
– PW parked / blocked / scheduled out waiter→
CST lock phase: blocking/parking
nextnext
p_nextp_next
LL
CST lock instance T1/S1T1/S1
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
socket_listCV
socket_listCV
waiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1 T3/S1T3/S1
Socket 1
Socket 2
PWPWsocket_list [S1]CV
socket_list [S1]CV
socket_list [S1, S2]CV
socket_list [S1, S2]CV
global_tailglobal_tail
T4/S2T4/S2
nextnext
p_nextp_next
UWUW
T2/S1T2/S1
● Before scheduling out, waiters atomically– Update the status from UW to PW– Add themselves to the parking list
T1 T2 T3
T4
CST unlock phaseCST lock instance
T1/S1T1/S1
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1
Socket 1
T1 T2
T3/S1T3/S1
nextnext
p_nextp_next
UWUWT3
Pass the lock to a spinning waiter
CST unlock phaseCST lock instance
T1/S1T1/S1
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1
Socket 1
T1 T2
T3/S1T3/S1
nextnext
p_nextp_next
UWUWT3
Pass the lock to a spinning waiter
T2/S1T2/S1
CST unlock phaseCST lock instance
T1/S1T1/S1
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1
Socket 1
T1 T2
T3/S1T3/S1
nextnext
p_nextp_next
UWUWT3
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLLnextnext
p_nextp_next
UWUW
Socket 1
PWPWT1 T2nextnext
p_nextp_next
UWUWT3
Pass the lock to a spinning waiter
T2/S1T2/S1
CST unlock phaseCST lock instance
T1/S1T1/S1
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLL
Threads:
nextnext
p_nextp_next
UWUW
T2/S1T2/S1
Socket 1
T1 T2
T3/S1T3/S1
nextnext
p_nextp_next
UWUWT3
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
nextnext
p_nextp_next
LLLLnextnext
p_nextp_next
UWUW
Socket 1
PWPWT1 T2nextnext
p_nextp_next
UWUWT3
socket_list [S1]socket_list [S1]
global_tailglobal_tailwaiting_tailwaiting_tail
parking_tailparking_tail
UWUW
snode_nextsnode_next
LLnextnext
p_nextp_next
UWUW
Socket 1
PWPWT2nextnext
p_nextp_next
LLT3
Pass the lock to a spinning waiter
T2/S1T2/S1
Implementation
● Implemented in the Linux kernel● Structures modified
– File system: inode– Memory management: mmap_sem
● Please see our paper– Read-write lock– Pseudo code
https://github.com/sslab-gatech/cst-locks
Evaluation
● Performance of locks in terms of scalability and memory footprint?
● Blocking/parking strategy effectiveness?
● Setup: 8-socket, 120-core NUMA machine
Case study: PsearchyJo
bs/h
our
● Overcomes memory footprint and scheduling overhead● Uses 1.5–9.1X less memory than the Cohort lock● Improves throughput by 1.4–1.6X
Mem
ory
foot
prin
t (M
B)
0
40
80
120
160
0 20 40 60 80 100 120#thread
Memory utilization
0
40
80
120
160
200
240
0 20 40 60 80 100 120#thread
Vanilla
Cohort
CST
Throughput
Effective parking strategy
● Better performance for both under- and over-subscribed scenario
● Improves scalability by 1.3–3.7X
0
40
80
120
160
200
240
1 2 4 8 16 32 64 128 256#thread
Enumerate a directory (rwsem)
0
0.1
0.2
0.3
0.4
0.5
1 2 4 8 16 32 64 128 256#thread
File creation (mutex)
Vanilla
Cohort
CST
M o
ps /
sec
Conclusion
● Two blocking synchronization primitives– NUMA-aware mutex and read-write semaphore
● Dynamically allocated data structure– Resolve NUMA-aware lock’s footprint issue
● Efficient spin-then-park strategy– Scheduling-aware parking/wake-up strategy– Mitigate scheduler interaction
Thank you!