Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Scalable NUMA-awareBlocking Synchronization Primitives

Sanidhya Kashyap, Changwoo Min, Taesoo Kim

'i

The rise of big NUMA machines

'i


'i


Importance of NUMA awarenessNUMA node 1 NUMA node 2

W1

W2

W3

W4

W5W6

L

File

W1 W2 W3 W4 W6 W5

NUMA oblivious


W1

W2

W3

W4

W5W6

L

File

W1 W2 W3 W4 W6 W5

NUMA oblivious

W1 W6 W2 W3 W4 W5

NUMA aware/hierarchical


W1

W2

W3

W4

W5W6

L

File

Idea:Make synchronization primitives NUMA aware!

W1 W2 W3 W4 W6 W5

NUMA oblivious

W1 W6 W2 W3 W4 W5

NUMA aware/hierarchical

Lock's research efforts and their useLinux kernel lock

adoption / modificationDekker's algorithm (1962)Semaphore (1965)

Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)

Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)

Remote Core locking (2012)Cohort lock (2012)

RW cohort lock (2013)Malthusian lock (2014)

HMCS lock (2015)AHMCS lock(2016)

HBO lock (2003)

Lock's research efforts








HBO lock (2003)

NUMA-awarelocks









HBO lock (2003)

NUMA-awarelocks

Spinlock TTAS→

Semaphore TTAS + block→

Rwsem TTAS + block→

Spinlock ticket→

Mutex TTAS + spin + block→

Rwsem TTAS + spin + block→

Spinlock ticket→

Mutex TTAS + block→


Spinlock qspinlock→




1990s

2011

2014

2016








HBO lock (2003)

NUMA-awarelocks

Spinlock TTAS→

Semaphore TTAS + block→


Spinlock ticket→



Spinlock ticket→

Mutex TTAS + block→


Spinlock qspinlock→




1990s

2011

2014

2016

Adopting NUMA aware locks is not easy

Issues with NUMA-aware primitives

● Memory footprint overhead– Cohort lock single instance: 1600 bytes– Example: 1–4 GB of lock space vs 38 MB of Linux’s

lock for 10 M inodes

● Does not support blocking/parking behavior

Blocking/parking approach● Under subscription

– #threads <= #cores

● Over subscription– #threads > #cores

● Spin-then-park strategy1) Spin for a certain duration

2) Add to a parking list

3) Schedule out (park/block)

under-subscription

Lock

thro

ughp

ut

→

#thread →







under-subscription

Lock

thro

ughp

ut

→

#thread →

over-subscription







under-subscription

Lock

thro

ughp

ut

→

#thread →

over-subscription

Spinning







under-subscription

Lock

thro

ughp

ut

→

#thread →

over-subscription

Spinning

Parking







under-subscription

Lock

thro

ughp

ut

→

#thread →

over-subscription

Spinning

Parking

Spin + park

Issues with blocking synchronization primitives

● High memory footprint for NUMA-aware locks● Inefficient blocking strategy

– Scheduling overhead in the critical path– Cache-line contention while scheduling out

CST lock

● NUMA-aware lock● Low memory footprint

– Allocate socket specific data structure when used– 1.5–10X memory less memory consumption

● Efficient parking/wake-up strategy– Limit the spinning up to a waiter’s time quantum– Pass the lock to an active waiter– Improves scalability by 1.2–4.7X

CST lock design

● NUMA-aware lock➢ Cohort lock principle

+ Mitigates cache-line contention and bouncing

● Memory efficient data structure➢ Allocate socket structure (snode) when used➢ Snodes are active until the life-cycle of the lock

+ Does not stress the memory allocator

CST lock design

● NUMA-aware parking list➢ Maintain separate per-socket parking lists for

readers and writers

+ Mitigates cache-line contention in over-subscribed scenario

+ Allows distributed wake-up of parked readers

CST lock design

● Remove scheduler intervention➢ Pass the lock to a spinning waiter➢ Waiters park themselves if more than one tasks are

running on a CPU (system load)

+ Scheduler not involved in the critical path

+ Guarantees forward progress of the system

Lock instantiation

socket_listsocket_list

global_tailglobal_tail

Threads: ● Initially no snodes are allocated

● Thread in a particular socket initiates an allocation

Lock instantiation



T1/S1T1/S1Threads:

Socket 1

socket_list [S1]socket_list [S1]

● Initially no snodes are allocated


Lock instantiation



T1/S1T1/S1Threads:

Socket 1

T2/S1T2/S1 T3/S1T3/S1




Lock instantiation



T1/S1T1/S1Threads:

Socket 1

Socket 2

T2/S1T2/S1 T3/S1T3/S1 T4/S2T4/S2

socket_list [S1]socket_list [S1]socket_list [S1, S2]socket_list [S1, S2]



CST lock phaseCST lock instance


Threads:


● Allocate thread specific structure on the stack● Three states for each node

– L locked→– UW unparked/spinning waiter→– PW parked / blocked / scheduled out waiter→

CST lock phaseCST lock instance T1/S1T1/S1


waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

LL

Threads:

Socket 1




– L locked→– UW unparked/spinning waiter→– PW parked / blocked / scheduled out waiter→





UWUW


nextnext

p_nextp_next

LLLL

Threads:

Socket 1



T1


– L locked→

– UW unparked/spinning waiter→

– PW parked / blocked / scheduled out waiter→





UWUW


nextnext

p_nextp_next

LL

Acquire global lock

LL

Threads:

Socket 1



T1


– L locked→







UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1 T3/S1T3/S1

Socket 1


global_tailglobal_tail nextnext

p_nextp_next

UWUWT1 T2 T3


– L locked→



CST lock phase

nextnext

p_nextp_next

LL

CST lock instance T1/S1T1/S1



UWUW





UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1 T3/S1T3/S1

Socket 1

Socket 2

socket_list [S1]socket_list [S1]socket_list [S1, S2]socket_list [S1, S2]


T4/S2T4/S2

nextnext

p_nextp_next

UWUWT1 T2 T3

T4


– L locked→



CST lock phase: blocking/parking

nextnext

p_nextp_next

LL

CST lock instance T1/S1T1/S1



UWUW


socket_listCV

socket_listCV



UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1 T3/S1T3/S1

Socket 1

Socket 2

PWPWsocket_list [S1]CV

socket_list [S1]CV

socket_list [S1, S2]CV

socket_list [S1, S2]CV


T4/S2T4/S2

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

● Before scheduling out, waiters atomically– Update the status from UW to PW– Add themselves to the parking list

T1 T2 T3

T4

CST unlock phaseCST lock instance

T1/S1T1/S1


global_tailglobal_tailwaiting_tailwaiting_tail


UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3

Pass the lock to a spinning waiter


T1/S1T1/S1




UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3


T2/S1T2/S1


T1/S1T1/S1




UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3




UWUW


nextnext

p_nextp_next

LLLLnextnext

p_nextp_next

UWUW

Socket 1

PWPWT1 T2nextnext

p_nextp_next

UWUWT3


T2/S1T2/S1


T1/S1T1/S1




UWUW


nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3




UWUW


nextnext

p_nextp_next

LLLLnextnext

p_nextp_next

UWUW

Socket 1

PWPWT1 T2nextnext

p_nextp_next

UWUWT3




UWUW


LLnextnext

p_nextp_next

UWUW

Socket 1

PWPWT2nextnext

p_nextp_next

LLT3


T2/S1T2/S1

Implementation

● Implemented in the Linux kernel● Structures modified

– File system: inode– Memory management: mmap_sem

● Please see our paper– Read-write lock– Pseudo code

https://github.com/sslab-gatech/cst-locks

Evaluation

● Performance of locks in terms of scalability and memory footprint?

● Blocking/parking strategy effectiveness?

● Setup: 8-socket, 120-core NUMA machine

Case study: PsearchyJo

bs/h

our

● Overcomes memory footprint and scheduling overhead● Uses 1.5–9.1X less memory than the Cohort lock● Improves throughput by 1.4–1.6X

Mem

ory

foot

prin

t (M

B)

0

40

80

120

160

0 20 40 60 80 100 120#thread

Memory utilization

0

40

80

120

160

200

240

0 20 40 60 80 100 120#thread

Vanilla

Cohort

CST

Throughput

Effective parking strategy

● Better performance for both under- and over-subscribed scenario

● Improves scalability by 1.3–3.7X

0

40

80

120

160

200

240

1 2 4 8 16 32 64 128 256#thread

Enumerate a directory (rwsem)

0

0.1

0.2

0.3

0.4

0.5

1 2 4 8 16 32 64 128 256#thread

File creation (mutex)

Vanilla

Cohort

CST

M o

ps /

sec

Conclusion

● Two blocking synchronization primitives– NUMA-aware mutex and read-write semaphore

● Dynamically allocated data structure– Resolve NUMA-aware lock’s footprint issue

● Efficient spin-then-park strategy– Scheduling-aware parking/wake-up strategy– Mitigate scheduler interaction

Thank you!

Documents

Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X