Concurrency Testing Challenges, Algorithms, and Tools Madan Musuvathi Microsoft Research

Concurrency TestingChallenges, Algorithms, and Tools

Madan MusuvathiMicrosoft Research

A concurrent program should

Function correctly

Maximize throughputFinish as many tasks as possible

Minimize latencyRespond to requests as soon as possible

While handling nondeterminism in the environment

Concurrency is HARD

Concurrency is PervasiveConcurrency is an age-old problem of computer

science

Most programs are concurrentAt least the one that you expect to get paid for, anyway

Solving the Concurrency Problem

We needBetter programming abstractions

Better analysis and verification techniques

Better testing methodologies

Weakest Link

Testing is more important than you thinkMy first-ever computer program:

Wrote it in BasicNot the world’s best programming language

With no idea about program correctnessI didn’t know first-order logic, loop-invariants, … I hadn’t heard about Hoare, Dijkstra, …

But still managed to write correct programs, using the write, test, [debug, write, test]+ cycle

How many of you have …written a program > 10,000 lines?

written a program, compiled it, called it done without testing the program on a single input?

written a program, compiled it, called it done without testing the program on few interesting inputs?

Imagine a world where you can’t pick the inputs during testing …

You write the program

Check its correctness by staring at it

Give the program to the computer

The computer tests on inputs of its choicefactorial(5) = 120factorial(5) = 120 the next 100 timesfactorial(7) = 5040

The computer runs this program again and again on these inputs for a week

The program didn’t crash and therefore it is correct

int factorial ( int x ) { int ret = 1; while(x > 1){ ret *= x; x --; } return ret;}

Parent_thread() { if (p != null) { p = new P(); Set (initEvent); }}

Child_thread(){ if (p != null) { Wait (initEvent); }}

This is the world of concurrency testingYou write the program

Check its correctness by staring at it

Give the program to the computer

The computer generates some interleavings

The computer runs this program again and again on these interleavings

The program didn’t crash and therefore its is correct

How do we test concurrent software today

Demo

CHESS PropositionCapture and expose nondeterminism to a scheduler

Threads can run at different speedsAsynchronous tasks can start at arbitrary time in the

futureHardware/compiler can reorder instructions

Explore the nondeterminism using several algorithmsTackle the astronomically large number of interleavingsRemember: Any algorithm is better than no control at all

CHESS in a nutshell

CHESS is a user-mode scheduler

Controls all scheduling nondeterminismReplace the OS scheduler

Guarantees:Every program run takes a different thread interleavingReproduce the interleaving for every run

Download CHESS source fromhttp://chesstool.codeplex.com/

http://chesstool.codeplex.com/

CHESS architecture

CHESSScheduler

UnmanagedProgram

Windows

ManagedProgram

CLR

• Every run takes a different interleaving• Reproduce the interleaving for every run

CHESSExploration

Engine

Win32 Wrappers

.NET Wrappers

Running Example

Lock (l);bal += x;Unlock(l);

Lock (l);t = bal;Unlock(l);

Lock (l);bal = t - y;Unlock(l);

Thread 1 Thread 2

Introduce Schedule() points

Schedule();Lock (l);bal += x;Schedule(); Unlock(l);

Schedule(); Lock (l);t = bal;Schedule(); Unlock(l);

Schedule(); Lock (l);bal = t - y;Schedule(); Unlock(l);

Thread 1 Thread 2

Instrument calls to the CHESS scheduler

Each call is a potential preemption point

First-cut solution: Random sleeps

Introduce random sleep at schedule points

Does not introduce new behaviorsSleep models a possible

preemption at each locationSleeping for a finite amount

guarantees starvation-freedom

Sleep(rand());Lock (l);bal += x;Sleep(rand());Unlock(l);

Sleep(rand());Lock (l);t = bal;Sleep(rand());Unlock(l);

Sleep(rand());Lock (l);bal = t - y;Sleep(rand());Unlock(l);

Thread 1 Thread 2

Improvement 1:Capture the “happens-before” graph




Thread 1 Thread 2

Delays that result in the same “happens-before” graph are equivalent

Avoid exploring equivalent interleavings



Sleep(5)

Sleep(5)

Improvement 2:Understand synchronization semantics




Thread 1 Thread 2 Avoid exploring delays that are impossible

Identify when threads can make progress

CHESS maintains a run queue and a wait queueMimics OS scheduler state

Schedule(); Unlock(l);


Schedule(); Lock (l);t = bal;

CHESS modes: speed vs coverageFast-mode

Introduce schedule points before synchronizations, volatile accesses, and interlocked operations

Finds many bugs in practice

Data-race modeIntroduce schedule points before memory accessesFinds race-conditions due to data racesCaptures all sequentially consistent (SC) executions

CHESS Design ChoicesSoundness

Any bug found by CHESS should be possible in the fieldShould not introduce false errors (both safety and

liveness)

CompletenessAny bug found in the field should be found by CHESS In theory, we need to capture all sources of

nondeterminismIn practice, we need to effectively explore the

astronomically large state space

Capture all sources of nondeterminism?No.Scheduling nondeterminism? Yes

Timing nondeterminism? YesControls when and in what order the timers fire

Nondeterministic system calls? MostlyCHESS uses precise abstractions for many system calls

Input nondeterminism? NoRely on users to provide inputs

Program inputs, return values of system calls, files read, packets received,…

Good tradeoff in the short termBut can’t find race-conditions on error handling code

Capture all sources of nondeterminism?No.Hardware relaxations? Yes

Hardware can reorder instructionsNon-SC executions possible in programs with data racesSober [CAV ‘08] can detect and explore such non-SC

executionsCompiler relaxations? No

Very few people understand what compilers can do to programs with data races

Far fewer than those who understand the general theory of relativity

Schedule Exploration Algorithms

Two kinds

Reduction algorithmsExplore one out of a large number equivalent

interleavings

Prioritization algorithmsPick “interesting” interleavings before you run out of

resourcesRemember: anything is better than nothing

Reduction Algorithms


Enumerating Thread Interleavings Using Depth-First Search

x = 1;y = 1;

x = 2;y = 2;

2,1

1,0

0,0

1,1

2,2

2,22,1

2,0

2,12,2

Thread 1 Thread 2

1,2

2,0

2,2

1,1

1,1 1,2

1,0

1,2 1,1

y = 1;

x = 1;

y = 2;

x = 2;

Explore (State s) { T = set of threads in s; foreach t in T { s’ = schedule t in s Explore(s’); }}

Behaviorally equivalent interleavings

Reach the same final state (x = 1, y = 3)

x = 1;

y = 2;

if(x == 1) {

y = 3; }

x = 1;

y = 2;

if(x == 1) {

y = 3; }

equiv

Behaviorally inequivalent interleavings

Reach different final states (1, 3) vs (1,2)

x = 1;

y = 2;

if(x == 1) {

y = 3; }

x = 1;

y = 2;

if(x == 1) {

y = 3; }

equiv

Behaviorally inequivalent interleavings

Don’t necessarily have to reach different states

x = 1;

y = 2;

if(x == 1) {

x = 1;

y = 2;

if(x == 1) {

y = 3; }

equiv

Execution EquivalenceTwo executions are equivalent if they can be obtained

by commuting independent operations

x = 1 r1 = y r2 = y r3 = x

x = 1 r1 = yr2 = y r3 = x

x = 1 r1 = yr2 = y r3 = x

x = 1 r1 = yr2 = y r3 = x

FormalismExecution is a sequence of transitions

Each transition is of the form <tid, var, op>tid: thread performing the transitionvar: the memory location accessed in the transitionop: READ | WRITE | READWRITE

Two steps are independent if They are executed by different threads and Either they access different variable or READ the same

variable

Equivalence makes the schedule space a Directed Acyclic Graph

x = 1;y = 1;

x = 2;y = 2;

2,1

1,0

0,0

1,1

2,2

2,22,1

2,0

2,12,2

Thread 1 Thread 2

1,2

2,0

2,2

1,1

1,1 1,2

1,0

1,2 1,1

HashTable visited;Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; if (s’ in visited) continue; visited.Add(s’); Explore(s’); }}

DFS in a DAG (CS 101)

Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; Explore(s’); }}

HashTable visited;Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; s” = canon(s”); if (s’’ in visited) continue; visited.Add(s’’); Explore(s’); }}

Sleep sets algorithmexplores a DAG without maintaining the table

Sleep Set Algorithm

x = 1;y = 1;

x = 2;y = 2;

2,1

1,0

0,0

1,1

2,2

2,2

2,0

2,1

Thread 1 Thread 2

1,2

2,0

2,2

1,1

1,1

1,0

1,2

Identify transitions that take you to visited states

Sleep Set Algorithm

Explore (Sequence s, sleep C) { T = set of transitions enabled in s; T’ = T – C; foreach t in T’ { C = C + t s’ = s . t ; C’ = C – {transitions dependent on t} Explore(s’, C’); }}

Summary

Sleep sets ensure that a stateless execution does not explode a DAG into a tree

Persistent Set Reduction

x = 1;x = 2;

y = 1;y = 2;

Thread 1 Thread 2

With Sleep Sets

x = 1;x = 2;

y = 1;y = 2;

Thread 1 Thread 2

With Persistent Sets

Assumption: we are only interested in the reachability of final states (for instance, no global assertions)

x = 1;x = 2;

y = 1;y = 2;

Thread 1 Thread 2

Persistent SetsA set of transitions P is

persistent in a state s, ifIn the state space X reachable

from s by only exploring transitions not in P

Every transition in X is independent with PP “persists” in X

It is sound to only explore P from s

s

x

With Persistent Sets

x = 1;x = 2;

y = 1;y = 2;

Thread 1 Thread 2

Dynamic Partial-Order Reduction Algorithm [Flanagan & Godefroid]Identifies persistent sets dynamicallyAfter execution a transition, insert a schedule point

before the most recent conflict

y = 1;x = 1;

x = 2;z = 3;

Thread 1 Thread 2

y=1

x=1

x=2

z=3

x=2

x=1

z=3

Prioritization Algorithms


Schedule PrioritizationPreemption bounding

Few preemptions are sufficient for finding lots of bugsPreemption sealing

Insert preemptions where you think bugs areRandom

If you don’t have additional information about the state space, random is the best

Still do partial-order reduction

Concurrency Correctness Criterion

CHESS checks for various correctness criteriaAssertion failuresDeadlocksLivelocksData racesAtomicity violations(Deterministic) Linearizability violations

Linearizability Checking in CHESS

Concurrency Correctness Criterion

MotivationWriting good test oracles is hard

Thread 1 Thread 2Bank.Add($20) Bank.Withdraw($20);

Assert(Bank.Balance() == ?)

MotivationWriting good test oracles is hard

Is this a correct assertion to check for?Now what if there are 5 threads each performing 5

queue operations

Thread 1 Thread 2q.AddFirst(10)q.RemoveLast()

q.AddLast(20)q.RemoveFirst()

Assert(q.IsEmpty())

We want to magicallyCheck if a Bank behaves like a Bank should doCheck if a queue behaves like a queue

Answer: Check for linearizability

LinearizabilityThe correctness notion closest to “thread safety”

A concurrent component behaves as if it is protected by a single global lock

Each operation appears to take effect instantaneously at some point between the call and return

The Problem with Linearizability CheckingNeed a sequential specification

Imagine writing a sequential specification for your operating system

Instead, check if a component is linearizable with respect to some deterministic specification

This can be done automaticallyGenerate the sequential specification by “inserting a

global lock”

LineUp: Two-Phase methodFor a given test:First, generate the sequential specification

Enumerate serial executions of the testRecord all observed historiesAssume the generated histories are the intended

behaviors of the component

Second, check linearizability with respect to the generated specificationEnumerate fully concurrent executions Test history against compatibility with serial executions

Line-Up on the Bank Example

Serial executions imply that the final balance can be 20 or 0

Concurrent executions should satisfy the assertion

Thread 1 Thread 2Bank.Add($20) Bank.Withdraw($20);

Assert( Bank.Balance() == 20 || Bank.Balance() == 0 )

Line-Up guaranteesFull Completeness:

If Line-Up reports a violation, the implementation is not linearizable with respect to any deterministic specification.

Restricted Soundness:If the implementation is not linearizable with respect to any deterministic specification, there exists a test on which Line-Up will report a violation.

Linearizability ViolationsNon-linearizable histories can reveal

implementation errors (e.g. incorrect synchronization)

The nonlinearizable behavior below was caused by a bug in .NET 4.0 (accidental lock timeout).

Thread 1Add 200 return TryTake Return 200

Thread 2Add 200 return TryTake return empty

Generalizing LinearizabilitySome operations may block.

e.g. semaphore.acquire()Blocking can be “good” (expected behavior) or

“bad” (bug).Original definition of linearizability does not make

this distinction.Blocking is always o.k.

We generalized definition to be able to catch “bad blocking”

A buggy counter implementationclass Counter{ int count = 0; bool b = false; Lock lock = new Lock();

void inc() { b = true; lock.acquire(); count = count + 1; lock.release(); b = false; } void get() { lock.acquire(); t = count; if(!b) lock.release(); return t; }}

Stuck History:

inc call

inc ret

get call

get 1

inc call

Results

Each letter is a separate root cause

Questions

(A) Incorrectuse of CAS causes state corruption. (B) RemoveLast() uses an incorrectlock-free optimization. (C) Call to SemaphoreSlim includesa timeout parameter by mistake. (D) ToArray() can livelock whencrossing segment boundaries. Note that the harness for this classperforms a particular pre-test sequence (add 31 elements, remove31 elements). (E) Insufficient locking: thread can get preemptedwhile trying to set an exception. (F) Barrier is not a linearizabledata type. Barriers block each thread until all threads have enteredthe barrier, a behavior that is not equivalent to any serial execution.(G) Cancel is not a linearizable method: The effect of the cancellationcan be delayed past the operation return, and in fact evenpast subsequent operations on the same thread. (H) Count() mayrelease a lock it does not own if interleaved with Add(). (I) Bag isnondeterministic by design to improve performance: the returnedvalue can depend on the specific interleaving. (J) Count may return0 even if the collection is not empty. The specification of theCount method was weakened after Line-Up detected this behavior.(K) TryTake may fail even if the collection is not empty. Thespecification of the TryTake method was weakened after Line-Updetected this behavior. (L) SetResult() throws the wrong exceptionif the task is already reserved for completion by somebody else, butnot completed yet.

Results: Phase 1 / Phase 2

OutlinePreemption bounding

Makes CHESS effective on deep state spacesFair stateless model checkingSoberFeatherLiteConcurrency Explorer


Makes CHESS effective on deep state spacesFair stateless model checking

Makes CHESS effective on cyclic state spacesEnables CHESS to find liveness violations (livelocks)

SoberFeatherLiteConcurrency Explorer

Concurrent programs have cyclic state spaces

SpinlocksNon-blocking algorithmsImplementations of synchronization primitivesPeriodic timers…

L1: while( ! done) { L2: Sleep(); }

M1: done = 1;

Thread 1 Thread 2 ! done L2

! doneL1

done L2

doneL1

A demonic scheduler unrolls any cycle ad-infinitum

! done

done! done

done! done

done

while( ! done){ Sleep();}

done = 1;

Thread 1 Thread 2

! done

Depth bounding

! done

done! done

done! done

done! done

Prune executions beyond a bounded number of steps

Depth bound

Problem 1: Ineffective state coverage

! done

! done

! done

! done

Bound has to be large enough to reach the deepest bug Typically, greater than 100

synchronization operations

Every unrolling of a cycle redundantly explores reachable state space

Depth bound

Problem 2: Cannot find livelocksLivelocks : lack of progress in a program

temp = done;while( ! temp){ Sleep();}

done = 1;

Thread 1 Thread 2

Key idea

This test terminates only when the scheduler is fairFairness is assumed by programmers

All cycles in correct programs are unfair A fair cycle is a livelock


done = 1;

Thread 1 Thread 2

! done! done

donedone

We need a fair demonic scheduler

Avoid unrolling unfair cyclesEffective state coverage

Detect fair cyclesFind livelocks

ConcurrentProgram

Test Harness

Win32 API

DemonicScheduler

FairDemonicScheduler

What notion of “fairness” do we use?

Weak fairnessForall t :: GF ( enabled(t) scheduled(t) )A thread that remains enabled should eventually be

scheduled

A weakly-fair scheduler will eventually schedule Thread 2Example: round-robin


done = 1;

Thread 1 Thread 2

Weak fairness does not suffice

Lock( l );While( ! done){ Unlock( l ); Sleep(); Lock( l );}Unlock( l );

Lock( l );done = 1;Unlock( l );

Thread 1 Thread 2

en = {T1, T2}

T1: Sleep()T2: Lock( l )

en = {T1, T2}

T1: Lock( l )T2: Lock( l )

en = { T1 }

T1: Unlock( l )T2: Lock( l )

en = {T1, T2}

T1: Sleep()T2: Lock( l )

Strong FairnessForall t :: GF enabled(t) GF scheduled(t)A thread that is enabled infinitely often is scheduled infinitely

often

Thread 2 is enabled and competes for the lock infinitely often

Lock( l );While( ! done){ Unlock( l ); Sleep(); Lock( l );}Unlock( l );

Lock( l );done = 1;Unlock( l );

Thread 1 Thread 2

Good Samaritan violationThread yield the processor when not making progress

Forall threads t : GF scheduled(t) GF yield(t)

Found many such violations, including one in the Singularity boot processResults in “sluggish I/O” behavior during bootup

while( ! done){ ;}

done = 1;

Thread 1 Thread 2

Results: Achieves more coverage faster

With fairness

Without fairness, with depth bound

20 30 40 50 60

States Explored 1726 871 1505 1726 1307 683

PercentageCoverage 100% 50% 87% 100% 76% 40%

Time(secs) 143 97 763 2531 >5000 >5000

Work stealing queue with one stealer

Finding livelocks and finding (not missing) safety violations

Program Lines of code Safety Bugs Livelocks

Work Stealing Q 4K 4CDS 6K 1CCR 9K 1 2ConcRT 16K 2 2Dryad 18K 7APE 19K 4STM 20K 2TPL 24K 4 5PLINQ 24K 1Singularity 175K 2

26 (total) 11 (total)

Acknowledgement: testers from PCP team


Makes CHESS effective on deep state spaces

Fair stateless model checkingMakes CHESS effective on cyclic state spacesEnables CHESS to find liveness violations (livelocks)

SoberDetect relaxed-memory model errorsDo not miss behaviors only possible in a relaxed memory

modelFeatherLiteConcurrency Explorer

Single slide on Sober Relaxed memory verification problem

Is P correct on a relaxed memory model

Sober: split the problem into two partsIs P correct on a sequentially consistent (SC) machineIs P sequentially consistent on a relaxed memory model

Check this while only exploring SC executions

CAV ‘08 solves the problem for a memory model with store buffers (TSO)

EC2 ‘08 extends this approach to a general class of memory models




SoberDetect relaxed-memory model errorsDo not miss behaviors only possible in a relaxed memory model

FeatherLiteA light-weight data-race detection engine (<20% overhead)

Concurrency Explorer

Single slide on FeatherLite Current data-race detection tools are slow

Process every memory access done by the programOne in 5 instructions access memory 1 billion accesses/sec

Key idea: Do smart adaptive sampling of memory accessesNaïve sampling does not work, need to sample both racing instructions

Cold-path hypothesis: At least one of the racing instructions occurs in a cold pathRaces between fast-paths are most probably benign

FeatherLite adaptively samples cold-paths at 100% rate and hot-paths at 0.1% rate

Finds 70% of the data-races with <20% runtime overheadExisting data-race detection tools >10X overhead




SoberDetect relaxed-memory model errorsDo not miss behaviors only possible in a relaxed memory model

FeatherLiteA light-weight data-race detection engine (<20% overhead)

Concurrency ExplorerFirst-class concurrency debugging

Concurrency explorerSingle-step over a thread interleavingInspect program states at each step

Program state = Stack of all threads + globalsLimited bi-directional debuggingInterleaving slices for better understanding

Working on:Closer integration with the Visual Studio debuggerExplore neighborhood interleavings

Conclusion

Don’t stress, use CHESSCHESS binary and papers available at

http://research.microsoft.com/CHESS

http://research.microsoft.com/CHESS

Points to get acrossCapturing non-determinism

Sync-orders, data-races, hardware interleavingsAdding elastic delaySoundness & completenessScoping Preemptions

QuestionsDid you find new bugsHow is this different from your previous papersHow is this different from previous mc effortsHow is this different from

Are these behaviors “expected” ?Thread 1

Thread 2

Thread 3

Add 10 return Add 20 return

TryTake return10

TryTake return “empty”

Thread 1

Thread 2

Thread 3

Add 10 return

Add 20 return

TryTake return10


LinearizabilityComponent is linearizable if all operations

Appear to take effect at a single temporal pointAnd that point is between the call and the return

“As if the component was protected by a single global lock”

Thread 1

Thread 2

Thread 3

Add 10 return Add 20 return

TryTake return10


Thread 1

Thread 2

Thread 3

Add 10 return

Add 20 return

TryTake return10


This behavior is not linearizableThread 2 getting a 10 means that Thread 1’s Add got

the queue before Thread 3’s AddSo, when Thread 3 does a TryTake, 20 should be still

in the queue

Thread 1

Thread 2

Thread 3

Add 10 return

Add 20 return

TryTake return20


Linearizable?

How is Linearizability different than Seriazliability?Serializability

All operations happen atomically in some serial orderLinearizability

All operations happen at a single instantThat instant is between the call and return

Serializable behavior that is not Linearizable

Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started

This is what makes linearizability composable

Thread 1

Thread 2Add 10 return


Serializable behavior that is not Linearizable

Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started

This is what makes linearizability composable

Thread 1



Serializability does not compose

The behavior of the blue queue and green queue are individually serializable

But, together, the behavior is not serializable

Thread 1


TryTake return “empty”Add 10 return


To make this all the more confusingDatabase concurrency control ensures that

transactions are linearizable Even though the literature only talks about serializability

Quote from Jim Gray:“When a transaction finishes, the state of the database

immediately reflects the updates of the transaction”

The commit point of a transaction is guaranteed between the transaction begin and endWhen using a two-phase locking protocol, for instance

“Standard” definition of LinearizabilityIs a little more general than my interpretation

“as if protected by a single global lock”Sometimes, a concurrent implementation can have

more behaviors than a sequential implementationExample: a set implemented as a queue

A sequential version will be FIFO even order does not matter for a set

For performance, a concurrent version can break the FIFO ordering but still maintain the set abstraction

Define a “sequential specification”

A Sequential Specification(A fancy word for something you already know but don’t usually

think about)

Each object has a statethe sequence of elements in the queue

Each operation has a precondition and a postconditionPrecondition: if the queue is not emptyPostcondition: Remove will remove the first element in queue

Another examplePrecondition: TruePostcondition: TryTake will

Return false if the queue is empty and leave the state unchangedOtherwise, return true and remove the first element in the queue

Documents

Concurrency Testing Challenges, Algorithms, and Tools Madan Musuvathi Microsoft Research