70
Practical Concerns for Scalable Synchronization Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto)

Practical Concerns for Scalable Synchronization

Embed Size (px)

DESCRIPTION

Practical Concerns for Scalable Synchronization. Jonathan Walpole (PSU) Paul McKenney (IBM) Tom Hart (University of Toronto). The problem. “i++” is dangerous if “i” is global. CPU 0. CPU 1. load r1,i. inc r1. store r1,i. i. The problem. “i++” is dangerous if “i” is global. CPU 0 - PowerPoint PPT Presentation

Citation preview

Page 1: Practical Concerns for Scalable Synchronization

Practical Concerns for Scalable Synchronization

Jonathan Walpole (PSU)Paul McKenney (IBM)

Tom Hart (University of Toronto)

Page 2: Practical Concerns for Scalable Synchronization

2www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

The problem

“i++” is dangerous if “i” is global

CPU 0load r1,i

inc r1store r1,i

i

CPU 1

Page 3: Practical Concerns for Scalable Synchronization

3www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

The problem

“i++” is dangerous if “i” is global

CPU 0

load r1,iload r1,i

inc r1store r1,i

i

i

CPU 1

load r1,i

i

Page 4: Practical Concerns for Scalable Synchronization

4www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

The problem

“i++” is dangerous if “i” is global

CPU 0

inc r1load r1,i

inc r1store r1,i

i

i+1

CPU 1

inc r1

i+1

Page 5: Practical Concerns for Scalable Synchronization

5www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

The problem

“i++” is dangerous if “i” is global

CPU 0

store r1,iload r1,i

inc r1store r1,i

i+1

i+1

CPU 1

store r1,i

i+1

Page 6: Practical Concerns for Scalable Synchronization

6www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

The solution – critical sections

Classic multiprocessor solution: spinlocks– CPU 1 waits for CPU 0 to release the lock

Counts are accurate, but locks have overhead!

spin_lock(&mylock);

i++;

spin_unlock(&mylock);

Page 7: Practical Concerns for Scalable Synchronization

7www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Critical-section efficiency

Lock Acquisition (Ta )

Critical Section (Tc )

Lock Release (Tr )

Critical-section efficiency =Tc

Tc+Ta+Tr

Ignoring lock contention and cache conflicts in the critical section

Page 8: Practical Concerns for Scalable Synchronization

8www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Critical section efficiency

Crit

ical

Sec

tion

Siz

e

Page 9: Practical Concerns for Scalable Synchronization

9www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Performance of normal instructions

Page 10: Practical Concerns for Scalable Synchronization

10www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Have synchronization instructions got faster?– Relative to normal instructions?– In absolute terms?

What are the implications of this for the performance of operating systems?

Can we fix this problem by adding more CPUs?

Questions

Page 11: Practical Concerns for Scalable Synchronization

11www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

What’s going on?

Taller memory hierarchies– Memory speeds have not kept up with CPU

speeds– 1984: no caches needed, since instructions

were slower than memory accesses– 2005: 3-4 level cache hierarchies, since

instructions are orders of magnitude faster than memory accesses

Page 12: Practical Concerns for Scalable Synchronization

12www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Why does this matter?

Page 13: Practical Concerns for Scalable Synchronization

13www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Why does this matter?

Synchronization implies sharing data across CPUs– normal instructions tend to hit in top-level cache– synchronization operations tend to miss

Synchronization requires a consistent view of data– between cache and memory– across multiple CPUs– requires CPU-CPU communication

Synchronization instructions see memory latency!

Page 14: Practical Concerns for Scalable Synchronization

14www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

… but that’s not all!

Longer pipelines– 1984: Many clock cycles per instruction– 2005: Many instructions per clock cycle

● 20-stage pipelines

Out of order execution– Keeps the pipelines full– Must not reorder the critical section before its

lock!

Synchronization instructions stall the pipeline!

Page 15: Practical Concerns for Scalable Synchronization

15www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Reordering means weak memory consistency

Memory barriers- Additional synchronization instructions are needed to manage reordering

Page 16: Practical Concerns for Scalable Synchronization

16www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

What is the cost of all this?

Instruction Cost 1.45 GHz

3.06GHzIBM POWER4 Intel Xeon

Normal Instruction 1.0 1.0

Page 17: Practical Concerns for Scalable Synchronization

17www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Atomic increment

Instruction Cost 1.45 GHz

3.06GHzIBM POWER4 Intel Xeon

Normal Instruction

Atomic Increment

1.0

183.1

1.0

402.3

Page 18: Practical Concerns for Scalable Synchronization

18www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Memory barriers

Instruction Cost 1.45 GHz

3.06GHzIBM POWER4 Intel Xeon

Normal Instruction

Atomic Increment

SMP Write Memory Barrier

Read Memory Barrier

Write Memory Barrier

1.0

183.1328.6328.9400.9

1.0

402.30.0

402.30.0

Page 19: Practical Concerns for Scalable Synchronization

19www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Lock acquisition/release with LL/SC

Instruction Cost 1.45 GHz

3.06GHzIBM POWER4 Intel Xeon

Normal Instruction

Atomic Increment

SMP Write Memory Barrier

Read Memory Barrier

Write Memory Barrier

Local Lock Round Trip

1.0

183.1328.6328.9400.91057.5

1.0

402.30.0

402.30

1138.8

Page 20: Practical Concerns for Scalable Synchronization

20www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Compare & swap unknown values (NBS)

Instruction Cost 1.45 GHz

3.06GHzIBM POWER4 Intel Xeon

Normal Instruction

Atomic Increment

SMP Write Memory Barrier

Read Memory Barrier

Write Memory Barrier

Local Lock Round Trip

CAS Cache Transfer & Invalidate

1.0

183.1328.6328.9400.91057.5247.1

1.0

402.30.0

402.30

1138.8847.1

Page 21: Practical Concerns for Scalable Synchronization

21www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Compare & swap known values (spinlocks)

Instruction Cost 1.45 GHz

3.06GHzIBM POWER4 Intel Xeon

Normal Instruction

Atomic Increment

SMP Write Memory Barrier

Read Memory Barrier

Write Memory Barrier

Local Lock Round Trip

CAS Cache Transfer & Invalidate

CAS Blind Cache Transfer

1.0

183.1328.6328.9400.91057.5247.1257.1

1.0

402.30.0

402.30

1138.8847.1993.9

Page 22: Practical Concerns for Scalable Synchronization

22www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

The net result?

1984: Lock contention was the main issue

2005: Critical section efficiency is a key issue

Even if the lock is always free when you try to acquire it, performance can still suck!

Page 23: Practical Concerns for Scalable Synchronization

23www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

How has this affected OS design?

Multiprocessor OS designers search for “scalable” synchronization strategies– reader-writer locking instead of global locking– data locking and partitioning– Per-CPU reader-writer locking– Non-blocking synchronization

The “common case” is read-mostly access to linked lists and hash-tables– asymmetric strategies favouring readers are good

Page 24: Practical Concerns for Scalable Synchronization

24www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Review - Global locking

A symmetric approach (also called “code locking”)– A critical section of code is guarded by a lock– Only one thread at a time can hold the lock

Examples include– Monitors– Java “synchronized” on global object– Linux spin_lock() on global spinlock_t

What is the problem with global locking?

Page 25: Practical Concerns for Scalable Synchronization

25www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Review - Global locking

A symmetric approach (also called “code locking”)– A critical section of code is guarded by a lock– Only one thread at a time can hold the lock

Examples include– Monitors– Java “synchronized” on global object– Linux spin_lock() on global spinlock_t

Global locking doesn’t scale due to lock contention!

Page 26: Practical Concerns for Scalable Synchronization

26www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Review - Reader-writer locking

Many readers can concurrently hold the lock

Writers exclude readers and other writers

The result?– No lock contention in read-mostly scenarios– So it should scale well, right?

Page 27: Practical Concerns for Scalable Synchronization

27www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Review - Reader-writer locking

Many readers can concurrently hold the lock

Writers exclude readers and other writers

The result?– No lock contention in read-mostly scenarios– So it should scale well, right?– … wrong!

Page 28: Practical Concerns for Scalable Synchronization

28www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Scalability of reader/writer locking

CPU 0

CPU 1

read

-acq

uir

em

em

ory

barr

ier

read

-acq

uir

em

em

ory

barr

ier

read

-acq

uir

em

em

ory

barr

ier

crit

ical

sect

ion

crit

ical

sect

ion

lock

Reader/writer locking does not scale due to criticalsection efficiency!

Page 29: Practical Concerns for Scalable Synchronization

29www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Review - Data locking

A lock per data item instead of one per collection– Per-hash-bucket locking for hash tables– CPUs acquire locks for different hash chains in

parallel– CPUs incur memory-latency and pipeline-flush

overheads in parallel

Data locking improves scalability by executing critical section “overhead” in parallel

Page 30: Practical Concerns for Scalable Synchronization

30www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Review - Per-CPU reader-writer locking

One lock per CPU (called brlock in Linux)– Readers acquire their own CPU’s lock– Writers acquire all CPU’s locks

In read-only workloads CPUs never exchange locks– no memory latency is incurred

Per-CPU R/W locking improves scalability by removing memory latency from read-lock acquisition for read-mostly scenarios

Page 31: Practical Concerns for Scalable Synchronization

31www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Scalability comparison

Expected scalability on read-mostly workloads– Global locking – poor due to lock contention– R/W locking – poor due to critical section

efficiency– Data locking – better?– R/W data locking – better still?– Per-CPU R/W locking – the best we can do?

Page 32: Practical Concerns for Scalable Synchronization

32www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Actual scalability

Scalability of locking strategies using read-only workloads in a hash-table benchmark

Measurements taken on a 4-CPU 700 MHz P-III system

Similar results are obtained on more recent CPUs

Page 33: Practical Concerns for Scalable Synchronization

33www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Scalability on 1.45 GHz POWER4 CPUs

Page 34: Practical Concerns for Scalable Synchronization

34www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Performance at different update fractions on 8 1.45 GHz POWER4 CPUs

Page 35: Practical Concerns for Scalable Synchronization

35www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

What are the lessons so far?

Page 36: Practical Concerns for Scalable Synchronization

36www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Avoid lock contention !

Avoid synchronization instructions !– … especially in the read-path !

What are the lessons so far?

Page 37: Practical Concerns for Scalable Synchronization

37www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

How about non-blocking synchronization?

Basic idea – copy & flip pointer (no locks!)– Read a pointer to a data item – Create a private copy of the item to update in place– Swap the old item for the new one using an atomic

compare & swap (CAS) instruction on its pointer– CAS fails if current pointer not equal to initial value– Retry on failure

NBS should enable fast reads … in theory!

Page 38: Practical Concerns for Scalable Synchronization

38www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Problems with NBS in practice

Reusing memory causes problems– Readers holding references can be hijacked during

data structure traversals when memory is reclaimed

– Readers see inconsistent data structures when memory is reused

How and when should memory be reclaimed?

Page 39: Practical Concerns for Scalable Synchronization

39www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Immediate reclamation?

Page 40: Practical Concerns for Scalable Synchronization

40www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Immediate reclamation?

In practice, readers must either– Use LL/SC to test if pointers have changed, or– Verify that version numbers associated with data

structures have not changed (2 memory barriers)

Synchronization instructions slow NBS readers!

Page 41: Practical Concerns for Scalable Synchronization

41www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Reader-friendly solutions

Never reclaim memory ?

Type-stable memory ?– Needs free pool per data structure type– Readers can still be hijacked to the free pool– Exposes OS to denial of service attacks

Ideally, defer reclaiming memory until its safe!– Defer reclamation of a data item until references

to it are no longer held by any thread

Page 42: Practical Concerns for Scalable Synchronization

42www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Wait for a while then delete?– … but how long should you wait?

Maintain reference counts or per-CPU “hazard pointers” on data that is in use?

How should we defer reclamation?

Page 43: Practical Concerns for Scalable Synchronization

43www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

How should we defer reclamation?

Wait for a while then delete?– … but how long should you wait?

Maintain reference counts or per-CPU “hazard pointers” on data that is in use?– Requires synchronization in read path!

Challenge – deferring destruction without using synchronization instructions in the read path

Page 44: Practical Concerns for Scalable Synchronization

44www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Coding convention:– Don’t allow a quiescent state to occur in a read-

side critical section

Reclamation strategy:– Only reclaim data after all CPUs in the system

have passed through a quiescent state

Example quiescent states:– Context switch in non-preemptive kernel– Yield in preemptive kernel– Return from system call …

Quiescent-state-based reclamation

Page 45: Practical Concerns for Scalable Synchronization

45www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Coding conventions for readers

Delineate read-side critical section– rcu_read_lock() and rcu_read_unlock() primitives– may compile to nothing on most architectures

Don’t hold references outside critical sections– Re-traverse data structure to pick up reference

Don’t yield the CPU during critical sections– Don’t voluntarily yield– Don’t block, don’t leave the kernel …

Page 46: Practical Concerns for Scalable Synchronization

46www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Overview of the basic idea

Writers create (and publish) new versions– Using locking or NBS to synchronize with each other

– Register call-backs to destroy old versions when safe● call_rcu() primitive registers a call back with a reclaimer

– Call-backs are deferred and memory reclaimed in batches

Readers do not use synchronization– While they hold a reference to a version it will not be

destroyed

– Completion of read-side critical sections is “inferred” by the reclaimer from observation of quiescent states

Page 47: Practical Concerns for Scalable Synchronization

47www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Overview of RCU API

Writer Reader

Reclaimer

rcu_dereference ()

rcu_assign_pointer ()

rcu_read_lock ()

synchronize_rcu ()

call_rcu ()

Memory Consistency of Mutable PointersCollection of versions of Immutable Objects

Page 48: Practical Concerns for Scalable Synchronization

48www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Context switch as a quiescent state

CPU 0

CPU 1 RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

Rem

ove

Ele

men

t

Con

text

Sw

itch

Con

text

Sw

itch

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

May hold referenceCan't hold reference to old version, but RCU can't tell

Can't hold reference to old version

Can't hold reference to old version

Con

text

Sw

itch

Page 49: Practical Concerns for Scalable Synchronization

49www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Grace periods

CPU 0

CPU 1 RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

Dele

teEle

men

t

Con

text

Sw

itch

Con

text

Sw

itch

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

RC

U R

ead

-Sid

eC

riti

cal S

ecti

on

Con

text

Sw

itch

Grace Period

Grace Period

Con

text

Sw

itch

Grace Period

Page 50: Practical Concerns for Scalable Synchronization

50www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Example quiescent states– Context switch (non-preemptive kernels)– Voluntary context switch (preemptive kernels)– Kernel entry/exit– Blocking call

Grace periods– A period during which every CPU has gone

through a quiescent state

Quiescent states and grace periods

Page 51: Practical Concerns for Scalable Synchronization

51www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Efficient implementation

Choosing good quiescent states– They should occur anyway– They should be easy to count– Not too frequent or infrequent

Recording and dispatching call-backs– Minimize inter-CPU communication– Maintain per-CPU queues of call-backs– Two queues – waiting for grace period start and

end

Page 52: Practical Concerns for Scalable Synchronization

52www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

'Next' RCUCallbacks

RCU's data structures

'Current' RCU

Callback

Grace-Period

Number

Global Grace-Period

Number

Global CPUBitmask

call_rcu()

Counter

CounterSnapshot

End of PreviousGrace Period (If

Any)

End of CurrentGrace Period

Page 53: Practical Concerns for Scalable Synchronization

53www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

RCU implementations

DYNIX/ptx RCU (data center)

Linux– Multiple implementations (in 2.5 and 2.6 kernels)– Preemptible and nonpreemptible

Tornado/K42 “generations”– Preemptive kernel– Helped generalize usage

Page 54: Practical Concerns for Scalable Synchronization

54www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Experimental results

How do different combinations of RCU, SMR, NBS and Locking compare?

Hash table mini-benchmark running on 1.45 GHz POWER4 system with 8 CPUs

Various workloads– Read/update fraction– Hash table size– Memory constraints– Number of CPUs

Page 55: Practical Concerns for Scalable Synchronization

55www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Scalability with working set in cache

Page 56: Practical Concerns for Scalable Synchronization

56www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Scalability with large working set

Page 57: Practical Concerns for Scalable Synchronization

57www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Performance at different update fractions (8 CPUs)

Page 58: Practical Concerns for Scalable Synchronization

58www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Performance at different update fractions (2 CPUs)

Page 59: Practical Concerns for Scalable Synchronization

59www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Performance in read-mostly scenarios

Page 60: Practical Concerns for Scalable Synchronization

60www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Impact of memory constraints

Page 61: Practical Concerns for Scalable Synchronization

61www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Performance and complexity

When should RCU be used?– Instead of simple spinlock?– Instead of per-CPU reader-writer lock?

Under what environmental conditions?– Memory-latency ratio– Number of CPUs

Under what workloads?– Fraction of accesses that are updates– Number of updates per grace period

Page 62: Practical Concerns for Scalable Synchronization

62www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Analytic results

Compute breakeven update-fraction contours for RCU vs. locking performance, against:– Number of CPUs (n)– Updates per grace period ()– Memory-latency ratio (r)

Look at computed memory-latency ratio at extreme values of for n=4 CPUs

Page 63: Practical Concerns for Scalable Synchronization

63www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Breakevens for RCU worst case(f vs. r for Small )

Page 64: Practical Concerns for Scalable Synchronization

64www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Breakeven for RCU best case(f vs. r, Large )

Page 65: Practical Concerns for Scalable Synchronization

65www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Real-world performance and complexity

SysV IPC– >10x on microbenchmark (8 CPUs)– 5% for database benchmark (2 CPUs)– 151 net lines added to the kernel

Directory-Entry Cache– +20% in multiuser benchmark (16 CPUs)– +12% on SPECweb99 (8 CPUs)– -10% time required to build kernel (16 CPUs)– 126 net lines added to the kernel

Page 66: Practical Concerns for Scalable Synchronization

66www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Real-world performance and complexity

Task List– +10% in multiuser benchmark (16 CPUs)– 6 net lines added to the kernel

● 13 added● 7 deleted

Page 67: Practical Concerns for Scalable Synchronization

67www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Summary and Conclusions (1)

RCU can provide order-of-magnitude speedups for read-mostly data structures– RCU optimal when less than 10% of accesses are

updates over wide range of CPUs– RCU projected to remain useful in future CPU

architectures

In Linux 2.6 kernel, RCU provided excellent performance with little added complexity– Currently over 1000 uses of RCU API in Linux

kernel

Page 68: Practical Concerns for Scalable Synchronization

68www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

Summary and Conclusions (2)

RCU introduces a new model and API for synchronization– There is additional complexity– Visual inspection of kernel code has uncovered

some subtle bugs in use of RCU API primitives– Tools to ensure correct use of API primitives are

needed

Page 69: Practical Concerns for Scalable Synchronization

69www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

A thought

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it !”

– Brian Kernighan

Page 70: Practical Concerns for Scalable Synchronization

70www.cs.pdx.edu/~walpoleJonathan WalpoleCS533 Winter 2006

UseUse

the right the right tooltool

for the for the job!!!job!!!