View
212
Download
0
Embed Size (px)
Citation preview
Concurrency TestingChallenges, Algorithms, and Tools
Madan MusuvathiMicrosoft Research
A concurrent program should
Function correctly
Maximize throughputFinish as many tasks as possible
Minimize latencyRespond to requests as soon as possible
While handling nondeterminism in the environment
Concurrency is HARD
Concurrency is PervasiveConcurrency is an age-old problem of computer
science
Most programs are concurrentAt least the one that you expect to get paid for, anyway
Solving the Concurrency Problem
We needBetter programming abstractions
Better analysis and verification techniques
Better testing methodologies
Weakest Link
Testing is more important than you thinkMy first-ever computer program:
Wrote it in BasicNot the world’s best programming language
With no idea about program correctnessI didn’t know first-order logic, loop-invariants, … I hadn’t heard about Hoare, Dijkstra, …
But still managed to write correct programs, using the write, test, [debug, write, test]+ cycle
How many of you have …written a program > 10,000 lines?
written a program, compiled it, called it done without testing the program on a single input?
written a program, compiled it, called it done without testing the program on few interesting inputs?
Imagine a world where you can’t pick the inputs during testing …
You write the program
Check its correctness by staring at it
Give the program to the computer
The computer tests on inputs of its choicefactorial(5) = 120factorial(5) = 120 the next 100 timesfactorial(7) = 5040
The computer runs this program again and again on these inputs for a week
The program didn’t crash and therefore it is correct
int factorial ( int x ) { int ret = 1; while(x > 1){ ret *= x; x --; } return ret;}
Parent_thread() { if (p != null) { p = new P(); Set (initEvent); }}
Child_thread(){ if (p != null) { Wait (initEvent); }}
This is the world of concurrency testingYou write the program
Check its correctness by staring at it
Give the program to the computer
The computer generates some interleavings
The computer runs this program again and again on these interleavings
The program didn’t crash and therefore its is correct
How do we test concurrent software today
Demo
CHESS PropositionCapture and expose nondeterminism to a scheduler
Threads can run at different speedsAsynchronous tasks can start at arbitrary time in the
futureHardware/compiler can reorder instructions
Explore the nondeterminism using several algorithmsTackle the astronomically large number of interleavingsRemember: Any algorithm is better than no control at all
CHESS in a nutshell
CHESS is a user-mode scheduler
Controls all scheduling nondeterminismReplace the OS scheduler
Guarantees:Every program run takes a different thread interleavingReproduce the interleaving for every run
Download CHESS source fromhttp://chesstool.codeplex.com/
CHESS architecture
CHESSScheduler
UnmanagedProgram
Windows
ManagedProgram
CLR
• Every run takes a different interleaving• Reproduce the interleaving for every run
CHESSExploration
Engine
Win32 Wrappers
.NET Wrappers
Running Example
Lock (l);bal += x;Unlock(l);
Lock (l);t = bal;Unlock(l);
Lock (l);bal = t - y;Unlock(l);
Thread 1 Thread 2
Introduce Schedule() points
Schedule();Lock (l);bal += x;Schedule(); Unlock(l);
Schedule(); Lock (l);t = bal;Schedule(); Unlock(l);
Schedule(); Lock (l);bal = t - y;Schedule(); Unlock(l);
Thread 1 Thread 2
Instrument calls to the CHESS scheduler
Each call is a potential preemption point
First-cut solution: Random sleeps
Introduce random sleep at schedule points
Does not introduce new behaviorsSleep models a possible
preemption at each locationSleeping for a finite amount
guarantees starvation-freedom
Sleep(rand());Lock (l);bal += x;Sleep(rand());Unlock(l);
Sleep(rand());Lock (l);t = bal;Sleep(rand());Unlock(l);
Sleep(rand());Lock (l);bal = t - y;Sleep(rand());Unlock(l);
Thread 1 Thread 2
Improvement 1:Capture the “happens-before” graph
Schedule();Lock (l);bal += x;Schedule(); Unlock(l);
Schedule(); Lock (l);t = bal;Schedule(); Unlock(l);
Schedule(); Lock (l);bal = t - y;Schedule(); Unlock(l);
Thread 1 Thread 2
Delays that result in the same “happens-before” graph are equivalent
Avoid exploring equivalent interleavings
Schedule(); Lock (l);bal = t - y;Schedule(); Unlock(l);
Schedule(); Lock (l);t = bal;Schedule(); Unlock(l);
Sleep(5)
Sleep(5)
Improvement 2:Understand synchronization semantics
Schedule();Lock (l);bal += x;Schedule(); Unlock(l);
Schedule(); Lock (l);t = bal;Schedule(); Unlock(l);
Schedule(); Lock (l);bal = t - y;Schedule(); Unlock(l);
Thread 1 Thread 2 Avoid exploring delays that are impossible
Identify when threads can make progress
CHESS maintains a run queue and a wait queueMimics OS scheduler state
Schedule(); Unlock(l);
Schedule(); Lock (l);bal = t - y;Schedule(); Unlock(l);
Schedule(); Lock (l);t = bal;
CHESS modes: speed vs coverageFast-mode
Introduce schedule points before synchronizations, volatile accesses, and interlocked operations
Finds many bugs in practice
Data-race modeIntroduce schedule points before memory accessesFinds race-conditions due to data racesCaptures all sequentially consistent (SC) executions
CHESS Design ChoicesSoundness
Any bug found by CHESS should be possible in the fieldShould not introduce false errors (both safety and
liveness)
CompletenessAny bug found in the field should be found by CHESS In theory, we need to capture all sources of
nondeterminismIn practice, we need to effectively explore the
astronomically large state space
Capture all sources of nondeterminism?No.Scheduling nondeterminism? Yes
Timing nondeterminism? YesControls when and in what order the timers fire
Nondeterministic system calls? MostlyCHESS uses precise abstractions for many system calls
Input nondeterminism? NoRely on users to provide inputs
Program inputs, return values of system calls, files read, packets received,…
Good tradeoff in the short termBut can’t find race-conditions on error handling code
Capture all sources of nondeterminism?No.Hardware relaxations? Yes
Hardware can reorder instructionsNon-SC executions possible in programs with data racesSober [CAV ‘08] can detect and explore such non-SC
executionsCompiler relaxations? No
Very few people understand what compilers can do to programs with data races
Far fewer than those who understand the general theory of relativity
Schedule Exploration Algorithms
Two kinds
Reduction algorithmsExplore one out of a large number equivalent
interleavings
Prioritization algorithmsPick “interesting” interleavings before you run out of
resourcesRemember: anything is better than nothing
Reduction Algorithms
Schedule Exploration Algorithms
Enumerating Thread Interleavings Using Depth-First Search
x = 1;y = 1;
x = 2;y = 2;
2,1
1,0
0,0
1,1
2,2
2,22,1
2,0
2,12,2
Thread 1 Thread 2
1,2
2,0
2,2
1,1
1,1 1,2
1,0
1,2 1,1
y = 1;
x = 1;
y = 2;
x = 2;
Explore (State s) { T = set of threads in s; foreach t in T { s’ = schedule t in s Explore(s’); }}
Behaviorally equivalent interleavings
Reach the same final state (x = 1, y = 3)
x = 1;
y = 2;
if(x == 1) {
y = 3; }
x = 1;
y = 2;
if(x == 1) {
y = 3; }
equiv
Behaviorally inequivalent interleavings
Reach different final states (1, 3) vs (1,2)
x = 1;
y = 2;
if(x == 1) {
y = 3; }
x = 1;
y = 2;
if(x == 1) {
y = 3; }
equiv
Behaviorally inequivalent interleavings
Don’t necessarily have to reach different states
x = 1;
y = 2;
if(x == 1) {
x = 1;
y = 2;
if(x == 1) {
y = 3; }
equiv
Execution EquivalenceTwo executions are equivalent if they can be obtained
by commuting independent operations
x = 1 r1 = y r2 = y r3 = x
x = 1 r1 = yr2 = y r3 = x
x = 1 r1 = yr2 = y r3 = x
x = 1 r1 = yr2 = y r3 = x
FormalismExecution is a sequence of transitions
Each transition is of the form <tid, var, op>tid: thread performing the transitionvar: the memory location accessed in the transitionop: READ | WRITE | READWRITE
Two steps are independent if They are executed by different threads and Either they access different variable or READ the same
variable
Equivalence makes the schedule space a Directed Acyclic Graph
x = 1;y = 1;
x = 2;y = 2;
2,1
1,0
0,0
1,1
2,2
2,22,1
2,0
2,12,2
Thread 1 Thread 2
1,2
2,0
2,2
1,1
1,1 1,2
1,0
1,2 1,1
HashTable visited;Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; if (s’ in visited) continue; visited.Add(s’); Explore(s’); }}
DFS in a DAG (CS 101)
Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; Explore(s’); }}
HashTable visited;Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; s” = canon(s”); if (s’’ in visited) continue; visited.Add(s’’); Explore(s’); }}
Sleep sets algorithmexplores a DAG without maintaining the table
Sleep Set Algorithm
x = 1;y = 1;
x = 2;y = 2;
2,1
1,0
0,0
1,1
2,2
2,2
2,0
2,1
Thread 1 Thread 2
1,2
2,0
2,2
1,1
1,1
1,0
1,2
Identify transitions that take you to visited states
Sleep Set Algorithm
Explore (Sequence s, sleep C) { T = set of transitions enabled in s; T’ = T – C; foreach t in T’ { C = C + t s’ = s . t ; C’ = C – {transitions dependent on t} Explore(s’, C’); }}
Summary
Sleep sets ensure that a stateless execution does not explode a DAG into a tree
Persistent Set Reduction
x = 1;x = 2;
y = 1;y = 2;
Thread 1 Thread 2
With Sleep Sets
x = 1;x = 2;
y = 1;y = 2;
Thread 1 Thread 2
With Persistent Sets
Assumption: we are only interested in the reachability of final states (for instance, no global assertions)
x = 1;x = 2;
y = 1;y = 2;
Thread 1 Thread 2
Persistent SetsA set of transitions P is
persistent in a state s, ifIn the state space X reachable
from s by only exploring transitions not in P
Every transition in X is independent with PP “persists” in X
It is sound to only explore P from s
s
x
With Persistent Sets
x = 1;x = 2;
y = 1;y = 2;
Thread 1 Thread 2
Dynamic Partial-Order Reduction Algorithm [Flanagan & Godefroid]Identifies persistent sets dynamicallyAfter execution a transition, insert a schedule point
before the most recent conflict
y = 1;x = 1;
x = 2;z = 3;
Thread 1 Thread 2
y=1
x=1
x=2
z=3
x=2
x=1
z=3
Prioritization Algorithms
Schedule Exploration Algorithms
Schedule PrioritizationPreemption bounding
Few preemptions are sufficient for finding lots of bugsPreemption sealing
Insert preemptions where you think bugs areRandom
If you don’t have additional information about the state space, random is the best
Still do partial-order reduction
Concurrency Correctness Criterion
CHESS checks for various correctness criteriaAssertion failuresDeadlocksLivelocksData racesAtomicity violations(Deterministic) Linearizability violations
Linearizability Checking in CHESS
Concurrency Correctness Criterion
MotivationWriting good test oracles is hard
Thread 1 Thread 2Bank.Add($20) Bank.Withdraw($20);
Assert(Bank.Balance() == ?)
MotivationWriting good test oracles is hard
Is this a correct assertion to check for?Now what if there are 5 threads each performing 5
queue operations
Thread 1 Thread 2q.AddFirst(10)q.RemoveLast()
q.AddLast(20)q.RemoveFirst()
Assert(q.IsEmpty())
We want to magicallyCheck if a Bank behaves like a Bank should doCheck if a queue behaves like a queue
Answer: Check for linearizability
LinearizabilityThe correctness notion closest to “thread safety”
A concurrent component behaves as if it is protected by a single global lock
Each operation appears to take effect instantaneously at some point between the call and return
The Problem with Linearizability CheckingNeed a sequential specification
Imagine writing a sequential specification for your operating system
Instead, check if a component is linearizable with respect to some deterministic specification
This can be done automaticallyGenerate the sequential specification by “inserting a
global lock”
LineUp: Two-Phase methodFor a given test:First, generate the sequential specification
Enumerate serial executions of the testRecord all observed historiesAssume the generated histories are the intended
behaviors of the component
Second, check linearizability with respect to the generated specificationEnumerate fully concurrent executions Test history against compatibility with serial executions
Line-Up on the Bank Example
Serial executions imply that the final balance can be 20 or 0
Concurrent executions should satisfy the assertion
Thread 1 Thread 2Bank.Add($20) Bank.Withdraw($20);
Assert( Bank.Balance() == 20 || Bank.Balance() == 0 )
Line-Up guaranteesFull Completeness:
If Line-Up reports a violation, the implementation is not linearizable with respect to any deterministic specification.
Restricted Soundness:If the implementation is not linearizable with respect to any deterministic specification, there exists a test on which Line-Up will report a violation.
Linearizability ViolationsNon-linearizable histories can reveal
implementation errors (e.g. incorrect synchronization)
The nonlinearizable behavior below was caused by a bug in .NET 4.0 (accidental lock timeout).
Thread 1Add 200 return TryTake Return 200
Thread 2Add 200 return TryTake return empty
Generalizing LinearizabilitySome operations may block.
e.g. semaphore.acquire()Blocking can be “good” (expected behavior) or
“bad” (bug).Original definition of linearizability does not make
this distinction.Blocking is always o.k.
We generalized definition to be able to catch “bad blocking”
A buggy counter implementationclass Counter{ int count = 0; bool b = false; Lock lock = new Lock();
void inc() { b = true; lock.acquire(); count = count + 1; lock.release(); b = false; } void get() { lock.acquire(); t = count; if(!b) lock.release(); return t; }}
Stuck History:
inc call
inc ret
get call
get 1
inc call
Results
Each letter is a separate root cause
Questions
(A) Incorrectuse of CAS causes state corruption. (B) RemoveLast() uses an incorrectlock-free optimization. (C) Call to SemaphoreSlim includesa timeout parameter by mistake. (D) ToArray() can livelock whencrossing segment boundaries. Note that the harness for this classperforms a particular pre-test sequence (add 31 elements, remove31 elements). (E) Insufficient locking: thread can get preemptedwhile trying to set an exception. (F) Barrier is not a linearizabledata type. Barriers block each thread until all threads have enteredthe barrier, a behavior that is not equivalent to any serial execution.(G) Cancel is not a linearizable method: The effect of the cancellationcan be delayed past the operation return, and in fact evenpast subsequent operations on the same thread. (H) Count() mayrelease a lock it does not own if interleaved with Add(). (I) Bag isnondeterministic by design to improve performance: the returnedvalue can depend on the specific interleaving. (J) Count may return0 even if the collection is not empty. The specification of theCount method was weakened after Line-Up detected this behavior.(K) TryTake may fail even if the collection is not empty. Thespecification of the TryTake method was weakened after Line-Updetected this behavior. (L) SetResult() throws the wrong exceptionif the task is already reserved for completion by somebody else, butnot completed yet.
Results: Phase 1 / Phase 2
OutlinePreemption bounding
Makes CHESS effective on deep state spacesFair stateless model checkingSoberFeatherLiteConcurrency Explorer
OutlinePreemption bounding
Makes CHESS effective on deep state spacesFair stateless model checking
Makes CHESS effective on cyclic state spacesEnables CHESS to find liveness violations (livelocks)
SoberFeatherLiteConcurrency Explorer
Concurrent programs have cyclic state spaces
SpinlocksNon-blocking algorithmsImplementations of synchronization primitivesPeriodic timers…
L1: while( ! done) { L2: Sleep(); }
M1: done = 1;
Thread 1 Thread 2 ! done L2
! doneL1
done L2
doneL1
A demonic scheduler unrolls any cycle ad-infinitum
! done
done! done
done! done
done
while( ! done){ Sleep();}
done = 1;
Thread 1 Thread 2
! done
Depth bounding
! done
done! done
done! done
done! done
Prune executions beyond a bounded number of steps
Depth bound
Problem 1: Ineffective state coverage
! done
! done
! done
! done
Bound has to be large enough to reach the deepest bug Typically, greater than 100
synchronization operations
Every unrolling of a cycle redundantly explores reachable state space
Depth bound
Problem 2: Cannot find livelocksLivelocks : lack of progress in a program
temp = done;while( ! temp){ Sleep();}
done = 1;
Thread 1 Thread 2
Key idea
This test terminates only when the scheduler is fairFairness is assumed by programmers
All cycles in correct programs are unfair A fair cycle is a livelock
while( ! done){ Sleep();}
done = 1;
Thread 1 Thread 2
! done! done
donedone
We need a fair demonic scheduler
Avoid unrolling unfair cyclesEffective state coverage
Detect fair cyclesFind livelocks
ConcurrentProgram
Test Harness
Win32 API
DemonicScheduler
FairDemonicScheduler
What notion of “fairness” do we use?
Weak fairnessForall t :: GF ( enabled(t) scheduled(t) )A thread that remains enabled should eventually be
scheduled
A weakly-fair scheduler will eventually schedule Thread 2Example: round-robin
while( ! done){ Sleep();}
done = 1;
Thread 1 Thread 2
Weak fairness does not suffice
Lock( l );While( ! done){ Unlock( l ); Sleep(); Lock( l );}Unlock( l );
Lock( l );done = 1;Unlock( l );
Thread 1 Thread 2
en = {T1, T2}
T1: Sleep()T2: Lock( l )
en = {T1, T2}
T1: Lock( l )T2: Lock( l )
en = { T1 }
T1: Unlock( l )T2: Lock( l )
en = {T1, T2}
T1: Sleep()T2: Lock( l )
Strong FairnessForall t :: GF enabled(t) GF scheduled(t)A thread that is enabled infinitely often is scheduled infinitely
often
Thread 2 is enabled and competes for the lock infinitely often
Lock( l );While( ! done){ Unlock( l ); Sleep(); Lock( l );}Unlock( l );
Lock( l );done = 1;Unlock( l );
Thread 1 Thread 2
Good Samaritan violationThread yield the processor when not making progress
Forall threads t : GF scheduled(t) GF yield(t)
Found many such violations, including one in the Singularity boot processResults in “sluggish I/O” behavior during bootup
while( ! done){ ;}
done = 1;
Thread 1 Thread 2
Results: Achieves more coverage faster
With fairness
Without fairness, with depth bound
20 30 40 50 60
States Explored 1726 871 1505 1726 1307 683
PercentageCoverage 100% 50% 87% 100% 76% 40%
Time(secs) 143 97 763 2531 >5000 >5000
Work stealing queue with one stealer
Finding livelocks and finding (not missing) safety violations
Program Lines of code Safety Bugs Livelocks
Work Stealing Q 4K 4CDS 6K 1CCR 9K 1 2ConcRT 16K 2 2Dryad 18K 7APE 19K 4STM 20K 2TPL 24K 4 5PLINQ 24K 1Singularity 175K 2
26 (total) 11 (total)
Acknowledgement: testers from PCP team
OutlinePreemption bounding
Makes CHESS effective on deep state spaces
Fair stateless model checkingMakes CHESS effective on cyclic state spacesEnables CHESS to find liveness violations (livelocks)
SoberDetect relaxed-memory model errorsDo not miss behaviors only possible in a relaxed memory
modelFeatherLiteConcurrency Explorer
Single slide on Sober Relaxed memory verification problem
Is P correct on a relaxed memory model
Sober: split the problem into two partsIs P correct on a sequentially consistent (SC) machineIs P sequentially consistent on a relaxed memory model
Check this while only exploring SC executions
CAV ‘08 solves the problem for a memory model with store buffers (TSO)
EC2 ‘08 extends this approach to a general class of memory models
OutlinePreemption bounding
Makes CHESS effective on deep state spaces
Fair stateless model checkingMakes CHESS effective on cyclic state spacesEnables CHESS to find liveness violations (livelocks)
SoberDetect relaxed-memory model errorsDo not miss behaviors only possible in a relaxed memory model
FeatherLiteA light-weight data-race detection engine (<20% overhead)
Concurrency Explorer
Single slide on FeatherLite Current data-race detection tools are slow
Process every memory access done by the programOne in 5 instructions access memory 1 billion accesses/sec
Key idea: Do smart adaptive sampling of memory accessesNaïve sampling does not work, need to sample both racing instructions
Cold-path hypothesis: At least one of the racing instructions occurs in a cold pathRaces between fast-paths are most probably benign
FeatherLite adaptively samples cold-paths at 100% rate and hot-paths at 0.1% rate
Finds 70% of the data-races with <20% runtime overheadExisting data-race detection tools >10X overhead
OutlinePreemption bounding
Makes CHESS effective on deep state spaces
Fair stateless model checkingMakes CHESS effective on cyclic state spacesEnables CHESS to find liveness violations (livelocks)
SoberDetect relaxed-memory model errorsDo not miss behaviors only possible in a relaxed memory model
FeatherLiteA light-weight data-race detection engine (<20% overhead)
Concurrency ExplorerFirst-class concurrency debugging
Concurrency explorerSingle-step over a thread interleavingInspect program states at each step
Program state = Stack of all threads + globalsLimited bi-directional debuggingInterleaving slices for better understanding
Working on:Closer integration with the Visual Studio debuggerExplore neighborhood interleavings
Conclusion
Don’t stress, use CHESSCHESS binary and papers available at
http://research.microsoft.com/CHESS
Points to get acrossCapturing non-determinism
Sync-orders, data-races, hardware interleavingsAdding elastic delaySoundness & completenessScoping Preemptions
QuestionsDid you find new bugsHow is this different from your previous papersHow is this different from previous mc effortsHow is this different from
Are these behaviors “expected” ?Thread 1
Thread 2
Thread 3
Add 10 return Add 20 return
TryTake return10
TryTake return “empty”
Thread 1
Thread 2
Thread 3
Add 10 return
Add 20 return
TryTake return10
TryTake return “empty”
LinearizabilityComponent is linearizable if all operations
Appear to take effect at a single temporal pointAnd that point is between the call and the return
“As if the component was protected by a single global lock”
Thread 1
Thread 2
Thread 3
Add 10 return Add 20 return
TryTake return10
TryTake return “empty”
Thread 1
Thread 2
Thread 3
Add 10 return
Add 20 return
TryTake return10
TryTake return “empty”
This behavior is not linearizableThread 2 getting a 10 means that Thread 1’s Add got
the queue before Thread 3’s AddSo, when Thread 3 does a TryTake, 20 should be still
in the queue
Thread 1
Thread 2
Thread 3
Add 10 return
Add 20 return
TryTake return20
TryTake return “empty”
Linearizable?
How is Linearizability different than Seriazliability?Serializability
All operations happen atomically in some serial orderLinearizability
All operations happen at a single instantThat instant is between the call and return
Serializable behavior that is not Linearizable
Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started
This is what makes linearizability composable
Thread 1
Thread 2Add 10 return
TryTake return “empty”
Serializable behavior that is not Linearizable
Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started
This is what makes linearizability composable
Thread 1
Thread 2Add 10 return
TryTake return “empty”
Serializability does not compose
The behavior of the blue queue and green queue are individually serializable
But, together, the behavior is not serializable
Thread 1
Thread 2Add 10 return
TryTake return “empty”Add 10 return
TryTake return “empty”
To make this all the more confusingDatabase concurrency control ensures that
transactions are linearizable Even though the literature only talks about serializability
Quote from Jim Gray:“When a transaction finishes, the state of the database
immediately reflects the updates of the transaction”
The commit point of a transaction is guaranteed between the transaction begin and endWhen using a two-phase locking protocol, for instance
“Standard” definition of LinearizabilityIs a little more general than my interpretation
“as if protected by a single global lock”Sometimes, a concurrent implementation can have
more behaviors than a sequential implementationExample: a set implemented as a queue
A sequential version will be FIFO even order does not matter for a set
For performance, a concurrent version can break the FIFO ordering but still maintain the set abstraction
Define a “sequential specification”
A Sequential Specification(A fancy word for something you already know but don’t usually
think about)
Each object has a statethe sequence of elements in the queue
Each operation has a precondition and a postconditionPrecondition: if the queue is not emptyPostcondition: Remove will remove the first element in queue
Another examplePrecondition: TruePostcondition: TryTake will
Return false if the queue is empty and leave the state unchangedOtherwise, return true and remove the first element in the queue