42
Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap, Changwoo Min, Taesoo Kim

Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Scalable NUMA-awareBlocking Synchronization Primitives

Sanidhya Kashyap, Changwoo Min, Taesoo Kim

Page 2: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

'i

The rise of big NUMA machines

Page 3: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

'i

The rise of big NUMA machines

Page 4: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

'i

The rise of big NUMA machines

Page 5: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Importance of NUMA awarenessNUMA node 1 NUMA node 2

W1

W2

W3

W4

W5W6

L

File

W1 W2 W3 W4 W6 W5

NUMA oblivious

Page 6: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Importance of NUMA awarenessNUMA node 1 NUMA node 2

W1

W2

W3

W4

W5W6

L

File

W1 W2 W3 W4 W6 W5

NUMA oblivious

W1 W6 W2 W3 W4 W5

NUMA aware/hierarchical

Page 7: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Importance of NUMA awarenessNUMA node 1 NUMA node 2

W1

W2

W3

W4

W5W6

L

File

Idea:Make synchronization primitives NUMA aware!

W1 W2 W3 W4 W6 W5

NUMA oblivious

W1 W6 W2 W3 W4 W5

NUMA aware/hierarchical

Page 8: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock's research efforts and their useLinux kernel lock

adoption / modificationDekker's algorithm (1962)Semaphore (1965)

Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)

Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)

Remote Core locking (2012)Cohort lock (2012)

RW cohort lock (2013)Malthusian lock (2014)

HMCS lock (2015)AHMCS lock(2016)

HBO lock (2003)

Lock's research efforts

Page 9: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock's research efforts and their useLinux kernel lock

adoption / modificationDekker's algorithm (1962)Semaphore (1965)

Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)

Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)

Remote Core locking (2012)Cohort lock (2012)

RW cohort lock (2013)Malthusian lock (2014)

HMCS lock (2015)AHMCS lock(2016)

HBO lock (2003)

NUMA-awarelocks

Lock's research efforts

Page 10: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock's research efforts and their useLinux kernel lock

adoption / modificationDekker's algorithm (1962)Semaphore (1965)

Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)

Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)

Remote Core locking (2012)Cohort lock (2012)

RW cohort lock (2013)Malthusian lock (2014)

HMCS lock (2015)AHMCS lock(2016)

HBO lock (2003)

NUMA-awarelocks

Spinlock TTAS→

Semaphore TTAS + block→

Rwsem TTAS + block→

Spinlock ticket→

Mutex TTAS + spin + block→

Rwsem TTAS + spin + block→

Spinlock ticket→

Mutex TTAS + block→

Rwsem TTAS + block→

Spinlock qspinlock→

Mutex TTAS + spin + block→

Rwsem TTAS + spin + block→

Lock's research efforts

1990s

2011

2014

2016

Page 11: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock's research efforts and their useLinux kernel lock

adoption / modificationDekker's algorithm (1962)Semaphore (1965)

Lamport's bakery algorithm (1974)Backoff lock (1989)Ticket lock (1991)MCS lock (1991)

Hierarchical lock – HCLH (2006)Flat combining NUMA lock (2011)

Remote Core locking (2012)Cohort lock (2012)

RW cohort lock (2013)Malthusian lock (2014)

HMCS lock (2015)AHMCS lock(2016)

HBO lock (2003)

NUMA-awarelocks

Spinlock TTAS→

Semaphore TTAS + block→

Rwsem TTAS + block→

Spinlock ticket→

Mutex TTAS + spin + block→

Rwsem TTAS + spin + block→

Spinlock ticket→

Mutex TTAS + block→

Rwsem TTAS + block→

Spinlock qspinlock→

Mutex TTAS + spin + block→

Rwsem TTAS + spin + block→

Lock's research efforts

1990s

2011

2014

2016

Adopting NUMA aware locks is not easy

Page 12: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Issues with NUMA-aware primitives

● Memory footprint overhead– Cohort lock single instance: 1600 bytes– Example: 1–4 GB of lock space vs 38 MB of Linux’s

lock for 10 M inodes

● Does not support blocking/parking behavior

Page 13: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Blocking/parking approach● Under subscription

– #threads <= #cores

● Over subscription– #threads > #cores

● Spin-then-park strategy1) Spin for a certain duration

2) Add to a parking list

3) Schedule out (park/block)

under-subscription

Lock

thro

ughp

ut

#thread →

Page 14: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Blocking/parking approach● Under subscription

– #threads <= #cores

● Over subscription– #threads > #cores

● Spin-then-park strategy1) Spin for a certain duration

2) Add to a parking list

3) Schedule out (park/block)

under-subscription

Lock

thro

ughp

ut

#thread →

over-subscription

Page 15: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Blocking/parking approach● Under subscription

– #threads <= #cores

● Over subscription– #threads > #cores

● Spin-then-park strategy1) Spin for a certain duration

2) Add to a parking list

3) Schedule out (park/block)

under-subscription

Lock

thro

ughp

ut

#thread →

over-subscription

Spinning

Page 16: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Blocking/parking approach● Under subscription

– #threads <= #cores

● Over subscription– #threads > #cores

● Spin-then-park strategy1) Spin for a certain duration

2) Add to a parking list

3) Schedule out (park/block)

under-subscription

Lock

thro

ughp

ut

#thread →

over-subscription

Spinning

Parking

Page 17: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Blocking/parking approach● Under subscription

– #threads <= #cores

● Over subscription– #threads > #cores

● Spin-then-park strategy1) Spin for a certain duration

2) Add to a parking list

3) Schedule out (park/block)

under-subscription

Lock

thro

ughp

ut

#thread →

over-subscription

Spinning

Parking

Spin + park

Page 18: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Issues with blocking synchronization primitives

● High memory footprint for NUMA-aware locks● Inefficient blocking strategy

– Scheduling overhead in the critical path– Cache-line contention while scheduling out

Page 19: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock

● NUMA-aware lock● Low memory footprint

– Allocate socket specific data structure when used– 1.5–10X memory less memory consumption

● Efficient parking/wake-up strategy– Limit the spinning up to a waiter’s time quantum– Pass the lock to an active waiter– Improves scalability by 1.2–4.7X

Page 20: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock design

● NUMA-aware lock➢ Cohort lock principle

+ Mitigates cache-line contention and bouncing

● Memory efficient data structure➢ Allocate socket structure (snode) when used➢ Snodes are active until the life-cycle of the lock

+ Does not stress the memory allocator

Page 21: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock design

● NUMA-aware parking list➢ Maintain separate per-socket parking lists for

readers and writers

+ Mitigates cache-line contention in over-subscribed scenario

+ Allows distributed wake-up of parked readers

Page 22: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock design

● Remove scheduler intervention➢ Pass the lock to a spinning waiter➢ Waiters park themselves if more than one tasks are

running on a CPU (system load)

+ Scheduler not involved in the critical path

+ Guarantees forward progress of the system

Page 23: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock instantiation

socket_listsocket_list

global_tailglobal_tail

Threads: ● Initially no snodes are allocated

● Thread in a particular socket initiates an allocation

Page 24: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock instantiation

socket_listsocket_list

global_tailglobal_tail

T1/S1T1/S1Threads:

Socket 1

socket_list [S1]socket_list [S1]

● Initially no snodes are allocated

● Thread in a particular socket initiates an allocation

Page 25: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock instantiation

socket_listsocket_list

global_tailglobal_tail

T1/S1T1/S1Threads:

Socket 1

T2/S1T2/S1 T3/S1T3/S1

socket_list [S1]socket_list [S1]

● Initially no snodes are allocated

● Thread in a particular socket initiates an allocation

Page 26: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Lock instantiation

socket_listsocket_list

global_tailglobal_tail

T1/S1T1/S1Threads:

Socket 1

Socket 2

T2/S1T2/S1 T3/S1T3/S1 T4/S2T4/S2

socket_list [S1]socket_list [S1]socket_list [S1, S2]socket_list [S1, S2]

● Initially no snodes are allocated

● Thread in a particular socket initiates an allocation

Page 27: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phaseCST lock instance

socket_listsocket_list

Threads:

global_tailglobal_tail

● Allocate thread specific structure on the stack● Three states for each node

– L locked→– UW unparked/spinning waiter→– PW parked / blocked / scheduled out waiter→

Page 28: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phaseCST lock instance T1/S1T1/S1

socket_listsocket_list

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

LL

Threads:

Socket 1

socket_list [S1]socket_list [S1]

global_tailglobal_tail

● Allocate thread specific structure on the stack● Three states for each node

– L locked→– UW unparked/spinning waiter→– PW parked / blocked / scheduled out waiter→

Page 29: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phaseCST lock instance T1/S1T1/S1

socket_listsocket_list

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

Socket 1

socket_list [S1]socket_list [S1]

global_tailglobal_tail

T1

● Allocate thread specific structure on the stack● Three states for each node

– L locked→

– UW unparked/spinning waiter→

– PW parked / blocked / scheduled out waiter→

Page 30: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phaseCST lock instance T1/S1T1/S1

socket_listsocket_list

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LL

Acquire global lock

LL

Threads:

Socket 1

socket_list [S1]socket_list [S1]

global_tailglobal_tail

T1

● Allocate thread specific structure on the stack● Three states for each node

– L locked→

– UW unparked/spinning waiter→

– PW parked / blocked / scheduled out waiter→

Page 31: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phaseCST lock instance T1/S1T1/S1

socket_listsocket_list

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1 T3/S1T3/S1

Socket 1

socket_list [S1]socket_list [S1]

global_tailglobal_tail nextnext

p_nextp_next

UWUWT1 T2 T3

● Allocate thread specific structure on the stack● Three states for each node

– L locked→

– UW unparked/spinning waiter→

– PW parked / blocked / scheduled out waiter→

Page 32: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phase

nextnext

p_nextp_next

LL

CST lock instance T1/S1T1/S1

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

socket_listsocket_list

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1 T3/S1T3/S1

Socket 1

Socket 2

socket_list [S1]socket_list [S1]socket_list [S1, S2]socket_list [S1, S2]

global_tailglobal_tail

T4/S2T4/S2

nextnext

p_nextp_next

UWUWT1 T2 T3

T4

● Allocate thread specific structure on the stack● Three states for each node

– L locked→

– UW unparked/spinning waiter→

– PW parked / blocked / scheduled out waiter→

Page 33: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST lock phase: blocking/parking

nextnext

p_nextp_next

LL

CST lock instance T1/S1T1/S1

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

socket_listCV

socket_listCV

waiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1 T3/S1T3/S1

Socket 1

Socket 2

PWPWsocket_list [S1]CV

socket_list [S1]CV

socket_list [S1, S2]CV

socket_list [S1, S2]CV

global_tailglobal_tail

T4/S2T4/S2

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

● Before scheduling out, waiters atomically– Update the status from UW to PW– Add themselves to the parking list

T1 T2 T3

T4

Page 34: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST unlock phaseCST lock instance

T1/S1T1/S1

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3

Pass the lock to a spinning waiter

Page 35: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST unlock phaseCST lock instance

T1/S1T1/S1

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3

Pass the lock to a spinning waiter

T2/S1T2/S1

Page 36: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST unlock phaseCST lock instance

T1/S1T1/S1

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLLnextnext

p_nextp_next

UWUW

Socket 1

PWPWT1 T2nextnext

p_nextp_next

UWUWT3

Pass the lock to a spinning waiter

T2/S1T2/S1

Page 37: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

CST unlock phaseCST lock instance

T1/S1T1/S1

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLL

Threads:

nextnext

p_nextp_next

UWUW

T2/S1T2/S1

Socket 1

T1 T2

T3/S1T3/S1

nextnext

p_nextp_next

UWUWT3

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

nextnext

p_nextp_next

LLLLnextnext

p_nextp_next

UWUW

Socket 1

PWPWT1 T2nextnext

p_nextp_next

UWUWT3

socket_list [S1]socket_list [S1]

global_tailglobal_tailwaiting_tailwaiting_tail

parking_tailparking_tail

UWUW

snode_nextsnode_next

LLnextnext

p_nextp_next

UWUW

Socket 1

PWPWT2nextnext

p_nextp_next

LLT3

Pass the lock to a spinning waiter

T2/S1T2/S1

Page 38: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Implementation

● Implemented in the Linux kernel● Structures modified

– File system: inode– Memory management: mmap_sem

● Please see our paper– Read-write lock– Pseudo code

https://github.com/sslab-gatech/cst-locks

Page 39: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Evaluation

● Performance of locks in terms of scalability and memory footprint?

● Blocking/parking strategy effectiveness?

● Setup: 8-socket, 120-core NUMA machine

Page 40: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Case study: PsearchyJo

bs/h

our

● Overcomes memory footprint and scheduling overhead● Uses 1.5–9.1X less memory than the Cohort lock● Improves throughput by 1.4–1.6X

Mem

ory

foot

prin

t (M

B)

0

40

80

120

160

0 20 40 60 80 100 120#thread

Memory utilization

0

40

80

120

160

200

240

0 20 40 60 80 100 120#thread

Vanilla

Cohort

CST

Throughput

Page 41: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Effective parking strategy

● Better performance for both under- and over-subscribed scenario

● Improves scalability by 1.3–3.7X

0

40

80

120

160

200

240

1 2 4 8 16 32 64 128 256#thread

Enumerate a directory (rwsem)

0

0.1

0.2

0.3

0.4

0.5

1 2 4 8 16 32 64 128 256#thread

File creation (mutex)

Vanilla

Cohort

CST

M o

ps /

sec

Page 42: Scalable NUMA-aware Blocking Synchronization …...– Limit the spinning up to a waiter’s time quantum – Pass the lock to an active waiter – Improves scalability by 1.2–4.7X

Conclusion

● Two blocking synchronization primitives– NUMA-aware mutex and read-write semaphore

● Dynamically allocated data structure– Resolve NUMA-aware lock’s footprint issue

● Efficient spin-then-park strategy– Scheduling-aware parking/wake-up strategy– Mitigate scheduler interaction

Thank you!