1 Revisiting Difficult Constraints if (hash(x) == hash(y)) {... } How do we cover this code? Suppose...

Preview:

Citation preview

1

Revisiting Difficult Constraints

if (hash(x) == hash(y)) {

...

}

How do we cover this code?

Suppose we’re running (DART, SAGE,SMART, CUTE, SPLAT, etc.) – we gethere, but hash(x) != hash(y). Can wesolve for hash(x) == hash(y)?

Concrete values won’t help us much – westill have to solve for hash(x) == C1 or forhash(y) == C2. . .

Any ideas?

2

Today

A brief “digression” on causality and philosophy (of science)

Fault localization & error explanation• Renieris & Reiss: Nearest Neighbors• Jones & Harrold: Tarantula

• How to evaluate a fault localization• PDGs (+ BFS or ranking)

• Solving for a nearest run (not really testing)

3

Causality

When a test case fails we start debugging

We assume that the fault (what we’re really after) causes the failure• Remember RIP (Reachability, Infection,

Propagation)?

What do we mean when we say that• “A causes B”?

4

Causality

We don’t know

Though it is central to everyday life – and to the aims of science• A real understanding of causality eludes us

to this day• Still no non-controversial way to answer the

question “does A cause B”?

5

Causality

Philosophy of causality is a fairly active area, back to Aristotle, and (more modern approaches) Hume• General agreement that a cause is

something that “makes a difference” – if the cause had not been, then the effect wouldn’t have been

• One theory that is rather popular with computer scientists is David Lewis’ counterfactual approach

• Probably because it (and probabilistic or statistical approaches) are amenable to mathematical treatment and automation

6

Causality (According to Lewis)

For Lewis (roughly – I’m conflating his counterfactual dependency and causal dependency)• A causes B (in world w) iff• In all possible worlds that are

maximally similar to w, and in which A does not take place, B also does not take place

7

Causality (According to Lewis)

Causality does not depend on• B being impossible without A• Seems reasonable: we don’t, when

asking “Was Larry slipping on the banana peel causally dependent on Curly dropping it?” consider worlds in which new circumstances (Moe dropping a banana peel) are introduced

8

Causality (According to Lewis)

Many objections to Lewis in the literature• e.g. cause precedes the event in time seems

to not be required by his approach

One is not a problem for our purposes• Distance metrics (how similar is world w to

world w’) are problematic for “worlds”• Counterfactuals are tricky

• Not a problem for program executions• May be details to handle, but no one has in-

principle objections to asking how similar two program executions are

• Or philosophical problems with multiple executions (no run is “privileged by actuality”)

9

Causality (According to Lewis)A

B

Did A cause B in this program execution?

d d’

Yes! d < d’

A

B

d d’

B

No. d > d’

10

Formally

A predicate e is causally dependent on a predicate c in an execution a iff:

1. c(a) e(a)

2. b . (c(b) e(b) (b’ . (c(b’) e(b’)) (d(a, b) < d(a, b’))))

11

What does this have to do with automated debugging??

A fault is an incorrect part of a program

In a failing test case, some fault is reached and executes• Causing the state of the program to be

corrupted (error)• This incorrect state is propagated

through the program (propagation is a series of “A causes B”s)

• Finally, bad state is observable as a failure – caused by the fault

12

Fault Localization

Fault localization, then, is:• An effort to automatically find (one of

the) causes of an observable failure• It is inherently difficult because there

are many causes of the failure that are not the fault

• We don’t mind seeing the chain of cause and effect reaching back to the fault

• But the fact that we reached the fault at all is also a cause!

13

Enough!

Ok, let’s get back to testing and some methods for localizing faults from test cases• But – keep in mind that when we

localize a fault, we’re really trying to automate finding causal relationships

• The fault is a cause of the failure

14

Lewis and Fault Localization

Causality:• Generally agreed that explanation is about

causality. [Ball,Naik,Rajamani],[Zeller],[Groce,Visser],[Sosa,Tooley],[Lewis],etc.

Similarity:• Also often assumed that successful

executions that are similar to a failing run can help explain an error. [Zeller],[Renieris,Reiss][Groce,Visser],etc.

• This work was not based on Lewis’ approach – it seems that this point about similarity is just an intuitive understanding most people (or at least computer scientists) share

15

Distance and Similarity

We already saw this idea at play in one version of Zeller’s delta-debugging• Trying to find the one change needed to take

a successful run and make it fail• Most similar thread schedule that doesn’t cause a

failure, etc.

Renieris and Reiss based a general fault localization technique on this idea – measuring distances between executions• To localize a fault, compare the failing trace

with its nearest neighbor according to some distance metric

16

Renieris and Reiss’ Localization

Basic idea (over-simplified)• We have lots of test cases

• Some fail• A much larger number pass

• Pick a failure• Find most similar successful test case• Report differences as our fault localization

“nearest neighbor”

17

Renieris and Reiss’ Localization

Collect spectra of executions, rather than the full executions• For example, just count the number of times

each source statement executed• Previous work on using spectra for

localization basically amounted to set difference/union – for example, find features unique to (or lacking in) the failing run(s)

• Problem: many failing runs have no such features – many successful test cases have R (and maybe I) but not P!

• Otherwise, localization wouldbe very easy

18

Renieris and Reiss’ Localization

Some obvious and not so obvious points to think about• Technique makes intuitive sense• But what if there are no successful runs that

are very similar?• Random testing might produce runs that all differ

in various accidental ways• Is this approach over-dependent on test suite

quality?

19

Renieris and Reiss’ Localization

Some obvious and not so obvious points to think about• What if we minimize the failing run using

delta-debugging?• Now lots of differences with original successful

runs just due to length!• We could produce a very similar run by using

delta-debugging to get a 1-change run that succeeds (there will actually be many of these)

• Can still use Renieris and Reiss’ approach – because delta-debugging works over the inputs, not the program behavior, spectra for these runs will be more or less similar to the failing test case

20

Renieris and Reiss’ Localization

Many details (see the paper):• Choice of spectra• Choice of distance metric• How to handle equal spectra for failing/passing

tests?

Basic idea is nonetheless straightforward

21

The Tarantula Approach

Jones, Harrold (and Stasko): Tarantula

Not based on distance metrics or a Lewis-like assumption

A “statistical” approach to fault localization

Originally conceived of as a visualization approach: produces a picture of all source in program, colored according to how “suspicious” it is• Green: not likely to be faulty• Yellow: hrm, a little suspicious• Red: very suspicious, likely fault

22

The Tarantula Approach

23

The Tarantula Approach

How do we score a statement in this approach? (where do all those colors come from?)

Again, assume we have a large set of tests, some passing, some failing

“Coverage entity” e (e.g., statement)• failed(e) = # tests covering e that fail• passed(e) = # tests covering e that pass• totalfailed, totalpassed = what you’d expect

24

The Tarantula Approach

How do we score a statement in this approach? (where do all those colors come from?)

dtotalfaileefailed

dtotalpasseepassed

dtotalfaileefailed

enesssuspicious)()(

)(

)(

25

The Tarantula Approach

Not very suspicious: appears in almost every passing test and almost every failing test

Highly suspicious: appears much more frequently in failing than passing tests

dtotalfaileefailed

dtotalpasseepassed

dtotalfaileefailed

enesssuspicious)()(

)(

)(

26

The Tarantula Approach

dtotalfaileefailed

dtotalpasseepassed

dtotalfaileefailed

enesssuspicious)()(

)(

)(

Simple program to computethe middle of three inputs,with a fault.

mid() int x, y, z, m;1 read (x, y, z);2 m = z;3 if (y < z)4 if (x < y)5 m = y;6 else if (x < z)7 m = y;8 else9 if (x > y)10 m = y;11 else if (x > z)12 m = x;13 print (m);

27

The Tarantula Approach

dtotalfaileefailed

dtotalpasseepassed

dtotalfaileefailed

enesssuspicious)()(

)(

)(

mid() int x, y, z, m;1 read (x, y, z); 2 m = z;3 if (y < z)4 if (x < y)5 m = y;6 else if (x < z)7 m = y;8 else9 if (x > y)10 m = y;11 else if (x > z)12 m = x;13 print (m);

Run some tests. . .

(3,3,5) (1,2,3) (3,2,1) (5,5,5) (5,3,4) (2,1,3)

Look at whether they pass or failLook at coverage of entities

Compute suspiciousness using the formula

0.50.50.50.630.00.710.830.00.00.00.00.00.5

Fault is indeed most suspicious!

28

The Tarantula Approach

Obvious benefits:• No problem if the fault is reached in some

successful test cases• Doesn’t depend on having any successful tests that

are similar to the failing test(s)• Provides a ranking of every statement, instead of

just a set of nodes – directions on where to look next• Numerical, even – how much more suspicious is X than Y?

• The pretty visualization may be quite helpful in seeing relationships between suspicious statements

• Is it less sensitive to accidental features of random tests, and to test suite quality in general?

• What about minimized failing tests here?

29

Tarantula vs. Nearest Neighbor

Which approach is better?• Once upon a time:

• Fault localization papers gave a few anecdotes of their technique working well, showed it working better than another approach on some example, and called it a day

• We’d like something more quantitative (how much better is this technique than that one?) and much less subjective!

30

Evaluating Fault Localization Approaches

Fault localization tools produce reports

We can reduce a report to a set (or ranking) of program locations

Let’s say we have three localization tools which produce• A big report that includes the fault• A much smaller report, but the actual

fault is not part of it• Another small report, also not

containing the fault

Which of theseis the “best”fault localization?

31

Evaluating a Fault Localization Report

Idea (credit to Renieris and Reiss):• Imagine an “ideal” debugger, the

perfect programmer• Starts reading the report

• Expands outwards from nodes (program locations) in the report to associated nodes, adding those at each step

• If a variable use is in the report, looks at the places it might be assigned

• If code is in the report, looks at the condition of any ifs guarding that code

• In general, follows program (causal) dependencies

• As soon as a fault is reached, recognizes it!

32

Evaluating a Fault Localization Report

Score the reports according to• How much code the ideal debugger

would read, starting from the report• Empty report: score = 0• Every line in the program: score = 0• Big report, containing the bug?

mediocre score• Small report, far from the bug? bad

score• Small report, “near” the bug? good

score• Report is the fault: great score (0.9)

0.4

0.8

0.2

0.9

33

Evaluating a Fault Localization Report

Breadth-first search of Program Dependency Graph (PDG) starting from fault localization:• Terminate the search when a real

fault is found• Score is proportion of the PDG that is

not explored during the breadth-first search

• Score near 1.00 = report includes only faults

34

Details of Evaluation MethodPDG

12 total nodes in PDG

35

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report

36

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report + 1 Layer BFS

37

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report + 1 Layer BFSSTOP: Real fault discovered

38

Details of Evaluation MethodPDG

12 total nodes in PDG

8 of 12 nodes not covered by

BFS: score = 8/12 ~= 0.67.

Fault

Report + 1 Layer BFSSTOP: Real fault discovered

39

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report

41

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report + 2 layers BFS

42

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report + 3 layers BFS

43

Details of Evaluation MethodPDG

12 total nodes in PDG

Fault

Report + 4 layers BFSSTOP: Real fault discovered

44

Details of Evaluation MethodPDG

12 total nodes in PDG

0 of 12 nodes not covered by

BFS: score = 0/12 ~= 0.00.

Fault

Report + 4 layers BFS

45

Details of Evaluation MethodPDG

Fault = Report

12 total nodes in PDG

11 of 12 nodes not covered by

BFS: score = 11/12 ~= 0.92.

46

Evaluating a Fault Localization Report

Caveats:• Isn’t a misleading report (a small number of

nodes, far from the bug) actually much worse than an empty report?

• “I don’t know” vs.• “Oh, yeah man, you left your keys in the living

room somewhere” (when in fact your keys are in a field in Nebraska)

• Nobody really searches a PDG like that!• Not backed up by user studies to show high

scores correlate to users finding the fault quickly from the report

47

Evaluating a Fault Localization Report

Still, the Renieris/Reiss scoring has been widely adopted by the testing community and some model checking folks• Best thing we’ve got, for now

48

Evaluating Fault Localization Approaches

So, how do the techniques stack up?

Tarantula seems to be the best of the test suite based techniques• Next best is the Cause Transitions approach

of Cleve and Zeller (see their paper), but it sometimes uses programmer knowledge

• Two different Nearest-Neighbor approaches are next best

• Set-intersection and set-union are worst

For details, see the Tarantula paper

49

Evaluating Fault Localization Approaches

Tarantula got scores at the 0.99 or > level 3 times more often than the next best

Trend continued at every ranking – Tarantula was always the best approach

Also appeared to be efficient:• Much faster than Cause-Transitions approach

of Cleve and Zeller• Probably about the same as the Nearest

Neighbor and set-union/intersection methods

50

Evaluating Fault Localization Approaches

Caveats:• Evaluation is over the Siemens suite (again!)

• But Tarantula has done well on larger programs• Tarantula and Nearest Neighbor might both

benefit from larger test suites produced by random testing

• Siemens is not that many tests, done by hand

51

Another Way to Do It

Question:• How good would the Nearest Neighbors method be if

our test suite contained all possible executions (the universe of tests)?

• We suspect it would do much better, right?

• But of course, that’s ridiculous – we can’t check for distance to every possible successful test case!

• Unless our program can be model checked• Leads us into next week’s topic, in a roundabout way:

testing via model checking

52

Explanation with Distance Metrics

Algorithm (very high level):1. Find a counterexample trace

(model checking term for “failing test case”)

2. Encode search for maximally similar successful execution under a distance metric d as an optimization problem

3. Report the differences (s) as an explanation (and a localization) of the error

53

Implementation #1

CBMC Bounded Model Checker for ANSI-C programs:• Input: C program + loop bounds• Checks for various properties:

• assert statements• Array bounds and pointer safety• Arithmetic overflow

• Verifies within given loop bounds• Provides counterexample if property

does not hold• Now provides error explanation and

fault localization.

54

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

Given a counterexample,

55

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

produce a successful executionthat is as similar as possible

(under a distance metric)

56

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

14: assert (a < 4);

5: b = -3

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 3

13: a = 3

4: a = 5

produce a successful executionthat is as similar as possible

(under a distance metric)

57

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

14: assert (a < 4);

5: b = -3

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 3

13: a = 3

4: a = 5

and examine the necessary differences:

58

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

14: assert (a < 4);

5: b = -3

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 3

13: a = 3

4: a = 5

and examine the necessary differences:

s

59

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

14: assert (a < 4);

5: b = -3

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 3

13: a = 3

4: a = 5

and examine the necessary differences:these are the causes

60

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

14: assert (a < 4);

5: b = -3

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 3

13: a = 3

4: a = 5

and the localization –lines 5, 12, and 13 are

likely bug locations.

61

Explanation with Distance Metrics

How it’s done:

Model checker

P+spec

First, the program (P) and

specification (spec) are sent

to the model checker.

62

Explanation with Distance Metrics

How it’s done:

Model checker

P+spec C

The model checker finds

a counterexample, C.

63

Explanation with Distance Metrics

How it’s done:

Model checker

BMC/constraint generator

P+spec C

The explanation tool uses P,

spec, and C to generate (via

Bounded Model Checking) a

formula with solutions that

are executions of P that are

not counterexamples

64

Explanation with Distance Metrics

How it’s done:

Model checker

BMC/constraint generator

P+spec C

S

Constraints are added to this

formula for an optimization

problem: find a solution that

is as similar to C as possible,

by the distance metric d. The

formula + optimization

problem is S

65

Explanation with Distance Metrics

How it’s done:

Model checker

BMC/constraint generator

P+spec C

Optimization tool

S -C

An optimization tool (PBS,

the Pseudo-Boolean Solver)

finds a solution to S:

an execution of P that is not

a counterexample, and is

as similar as possible to C:

call this execution -C

66

Explanation with Distance Metrics

How it’s done:

Model checker

BMC/constraint generator

P+spec C

Optimization tool

S -C

C

-Cs

Report the differences (s)

between C and –C to the

user: explanation and fault

localization

67

Explanation with Distance Metrics

The metric d is based on Static Single Assignment (SSA) (plus loop unrolling)• A variation on SSA, to be precise

CBMC model checker (bounded model checker for C programs) translates an ANSI C program into a set of equations

An execution of the program is just a solution to this set of equations

68

“SSA” Transformation

int main () {

int x, y;

int z = y;

if (x > 0)

y--;

else

y++;

z++;

assert (y == z);

}

int main () {

int x0, y0;

int z0 = y0;

y1 = y0 - 1;

y2 = y0 + 1;

guard1 = x0 > 0;

y3 = guard1?y1:y2;

z1 = z0 + 1;

assert (y3 == z1);

}

69

Transformation to Equationsint main () {

int x0, y0;

int z0 = y0;

y1 = y0 - 1;

y2 = y0 + 1;

guard1 = x0 > 0;

y3 = guard1?y1:y2;

z1 = z0 + 1;

assert (y3 == z1);

}

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 == z1)

70

Transformation to Equationsint main () {

int x0, y0;

int z0 = y0;

y1 = y0 - 1;

y2 = y0 + 1;

guard1 = x0 > 0;

y3 = guard1?y1:y2;

z1 = z0 + 1;

assert (y3 == z1);

}

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 == z1)

Uninitialized variables in CBMC are unconstrained inputs.

71

Transformation to Equationsint main () {

int x0, y0;

int z0 = y0;

y1 = y0 - 1;

y2 = y0 + 1;

guard1 = x0 > 0;

y3 = guard1?y1:y2;

z1 = z0 + 1;

assert (y3 == z1);

}

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 == z1)

CBMC (1) negates the assertion

72

Transformation to Equationsint main () {

int x0, y0;

int z0 = y0;

y1 = y0 - 1;

y2 = y0 + 1;

guard1 = x0 > 0;

y3 = guard1?y1:y2;

z1 = z0 + 1;

assert (y3 == z1);

}

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 != z1)

(assertion is now negated)

73

Transformation to Equationsint main () {

int x0, y0;

int z0 = y0;

y1 = y0 - 1;

y2 = y0 + 1;

guard1 = x0 > 0;

y3 = guard1?y1:y2;

z1 = z0 + 1;

assert (y3 == z1);

}

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 != z1)

then (2) translates to SAT and usesa fast solver to find a counterexample

74

Execution Representation

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 != z1)

Remove the assertion to get an equation forany execution of the program

(take care of loops by unrolling)

75

Execution Representation

(z0 == y0

y1 == y0 – 1

y2 == y0 + 1

guard1 == x0 > 0

y3 == guard1?y1:y2

z1 == z0 + 1

y3 != z1)

Execution represented by assignments toall variables in the equations

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

76

Execution Representation

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

Execution represented by assignments toall variables in the equations

x0 == 0

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == false

y3 == 6

z1 == 6

Successful execution

77

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

d = number of changes (s) between two executions

x0 == 0

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == false

y3 == 6

z1 == 6

Successful execution

78

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

d = number of changes (s) between two executions

x0 == 0

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == false

y3 == 6

z1 == 6

Successful execution

79

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

d = number of changes (s) between two executions

x0 == 0

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == false

y3 == 6

z1 == 6

Successful execution

80

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

d = number of changes (s) between two executions

x0 == 0

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == false

y3 == 6

z1 == 6

Successful execution

d = 3

81

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

3 is the minimum possible distance between thecounterexample and a successful execution

x0 == 0

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == false

y3 == 6

z1 == 6

Successful execution

d = 3

82

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

To compute the metric, add a new SATvariable for each potential

x0 == (x0 != 1)

y0 == (y0 != 5)

z0 == (z0 != 5)

y1 == (y1 != 4)

y2 == (y2 != 6)

guard1 == !guard1

y3 == (y3 != 4)

z1 == (z1 != 6)

New SAT variables

83

The Distance Metric d

x0 == 1

y0 == 5

z0 == 5

y1 == 4

y2 == 6

guard1 == true

y3 == 4

z1 == 6

Counterexample

And minimize the sum of the variables(treated as 0/1 values): a pseudo-Boolean problem

x0 == (x0 != 1)

y0 == (y0 != 5)

z0 == (z0 != 5)

y1 == (y1 != 4)

y2 == (y2 != 6)

guard1 == !guard1

y3 == (y3 != 4)

z1 == (z1 != 6)

New SAT variables

84

The Distance Metric d

An SSA-form oddity:• Distance metric can compare values

from code that doesn’t run in either execution being compared

• This can be the determining factor in which of two traces is most similar to a counterexample

• Counterintuitive but not necessarily incorrect: simply extends comparison to all hypothetical control flow paths

85

Explanation with Distance Metrics

Algorithm (lower level):1. Find a counterexample using Bounded Model

Checking (SAT)

2. Create a new problem: SAT for a successful execution + constraints for minimizing distance to counterexample (least changes)

3. Solve this optimization problem using a pseudo-Boolean solver (PBS) (= 0-1 ILP)

4. Report the differences (s) to the user as an explanation (and a localization) of the error

86

Explanation with Distance Metrics

Model checker

BMC/constraint generator

P+spec C

Optimization tool

S -C

C

-Cs

CBMC

explain

PBS

87

Explanation with Distance Metrics

Details hidden behind a Graphical User Interface (GUI) that hides SAT and distance metrics from users

GUI automatically highlights likely bug locations, presents changed values

Next slides: GUI in action + a teaser for experimental results

88

89

90

Explaining Abstract Counterexamples

91

Explaining Abstract CounterexamplesFirst implementation presents

differences as changes in concrete values, e.g.:• “In the counterexample, x is 14.

In the successful execution, x is 18.”

Which can miss the point:• What really matters is whether x is less

than y• But y isn’t mentioned at all!

92

Explaining Abstract Counterexamples If the counterexample and successful

execution were abstract traces, we’d get variable relationships and generalization for “free”

Abstraction should also make the model checking more scalable• This is why abstraction is traditionally

used in model checking, in fact

93

Model Checking + Abstraction

In abstract model checking, the model checker explores an abstract state space

In predicate abstraction, states consist of predicates that are true in a state, rather than concrete values:• Concrete:x = 12, y = 15, z = 0

• Abstract:x < y, z != 1

94

Model Checking + Abstraction

In abstract model checking, the model checker explores an abstract state space.

In predicate abstraction, states consist of predicates that are true in a state, rather than concrete values:• Concrete:x = 12, y = 15, z = 0

• Abstract:x < y, z != 1

Potentially represents many concrete states

95

Model Checking + Abstraction

Conservative predicate abstraction preserves all erroneous behaviors in the original system

Abstract “executions” now potentially represent a set of concrete executions

Must check execution to see if it matches some real behavior of program: abstraction adds behavior

96

Implementation #2

MAGIC Predicate Abstraction Based Model Checker for C programs:• Input: C program• Checks for various properties:

• assert statements• Simulation of a specification machine

• Provides counterexample if property does not hold

• Counterexamples are abstract executions – that describe real behavior of the actual program

• Now provides error explanation and fault localization

97

Model Checking + AbstractionPredicates & counterexample produced by the usual

Counterexample Guided Abstraction Refinement Framework.

Explanation will work as in the first case presented, except:• The explanation will be in terms of control flow differences and• Changes in predicate values.

98

MAGIC Overview

YesAbstractionAbstractionModel

CounterexampleReal?

CounterexampleReal?

No

Abstract Counterexample

AbstractionRefinementAbstractionRefinement

New Predicates

No

SpuriousCounterexample

Yes

VerificationVerification

Spec

Spec Holds

P

Real

99

MAGIC Overview

YesAbstractionAbstractionModel

CounterexampleReal?

CounterexampleReal?

No

AbstractionRefinementAbstractionRefinement

No

SpuriousCounterexample

Yes

VerificationVerification

Spec

Spec Holds

P

Real

New Predicates Abstract Counterexample

100

Model Checking + Abstraction

Explain an abstract counterexample that represents (at least one) real execution of the program

Explain with another abstract execution that:• Is not a counterexample• Is as similar as possible to the

abstract counterexample• Also represents real behavior

101

14: assert (a < 4);

5: b = 4

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 10

13: a = 10

4: a = 5

14: assert (a < 4);

5: b = -3

6: c = -4

7: a = 2

8: a = 1

9: a = 6

10: a = 4

11: c = 9

12: c = 3

13: a = 3

4: a = 5

Abstract rather than concrete traces:represent more than one execution

Automatic generalization

102

14: assert (a < 4);

5: b > 2

6: c < 7

7: a >= 4

8: a <= 4

9: a >= 4

10: a <= 4

11: c >= 7

12: c >= 7

13: a >= 4

4: a >= 4

14: assert (a < 4);

5: b <= 2

6: c < 7

7: a > 4

8: a <= 4

9: a > 4

10: a <= 4

11: c >= 9

12: c < 7

13: a < 3

4: a >= 4

Abstract rather than concrete traces:represent more than one execution

Automatic generalization

103

14: assert (a < 4);

5: b > 2

6: c < 7

7: a >= 4

8: a <= 4

9: a >= 4

10: a <= 4

11: c >= 7

12: c >= 7

13: a >= 4

4: a >= 4

14: assert (a < 4);

5: b <= 2

6: c < 7

7: a > 4

8: a <= 4

9: a > 4

10: a <= 4

11: c >= 9

12: c < 7

13: a < 3

4: a >= 4

Automatic generalization

c >= 7:

c = 7, c = 8,

c = 9, c = 10…

c < 7:

c = 6, c = 5,

c = 4, c = 3…

104

14: assert (a < 4);

5: b > 2

6: c < 7

7: a >= 4

8: a <= 4

9: a >= 4

10: a <= 4

11: c >= 7

12: c >= a

13: a >= 4

4: a >= 4

14: assert (a < 4);

5: b <= 2

6: c < 7

7: a > 4

8: a <= 4

9: a > 4

10: a <= 4

11: c >= 9

12: c < a

13: a < 3

4: a >= 4

Relationships between variables

c >= a:

c = 7 a = 7,

c = 9 a = 6…

c < a:

c = 7 a = 10,

c = 3 a = 4…

105

An Example

1 int main () {2 int input1, input2, input3;3 int least = input1;4 int most = input1;5 if (most < input2)6 most = input2;7 if (most < input3)8 most = input3;9 if (least > input2)10 most = input2;11 if (least > input3)12 least = input3;13 assert (least <= most);14 }

106

An Example

1 int main () {2 int input1, input2, input3;3 int least = input1;4 int most = input1;5 if (most < input2)6 most = input2;7 if (most < input3)8 most = input3;9 if (least > input2)10 most = input2;11 if (least > input3)12 least = input3;13 assert (least <= most);14 }

107

An Example

1 int main () {2 int input1, input2, input3;3 int least = input1;4 int most = input1;5 if (most < input2)6 most = input2;7 if (most < input3)8 most = input3;9 if (least > input2)10 most = input2;11 if (least > input3)12 least = input3;13 assert (least <= most);14 }

108

An Example

Value changed (line 2): input3#0 from 2147483615 to 0Value changed (line 12): least#2 from 2147483615 to 0Value changed (line 13): least#3 from 2147483615 to 0

109

An Example

Not very obvious what

this means…

Value changed (line 2): input3#0 from 2147483615 to 0Value changed (line 12): least#2 from 2147483615 to 0Value changed (line 13): least#3 from 2147483615 to 0

110

An Example

Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure

111

An Example

Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure

Here, on the other hand:

112

An Example

Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure

Here, on the other hand:

Line with error indicated

Avoid error by notexecuting line 10

113

An Example

Control location deleted (step #5): 10: most = input2Predicate changed (step #5): was: most < least now: least <= mostPredicate changed (step #5): was: most < input3 now: input3 <= most------------------------Predicate changed (step #6): was: most < least now: least <= mostAction changed (step #6): was: assertion_failure

Predicates show howchange in control flowaffects relationship of

the variables

114

Explaining Abstract Counterexamples Implemented in the MAGIC predicate

abstraction-based model checker

MAGIC represents executions as paths of states, not in SSA form

New distance metrics resembles traditional metrics from string or sequence comparison:• Insert, delete, replace operations• State = PC + predicate values

115

Explaining Abstract CounterexamplesSame underlying method as for

concrete explanation

Revise the distance metric to account for the new representation of program executions

Model checker

BMC/constraint generator

P+spec C

Optimization tool

S -C

C-C

s

MAGIC

MAGIC/explain

still PBS

116

CBMC vs. MAGIC Representations

input1#0 == 0

input2#0 == -1

input3#0 == 0

least#0 == 0

most#0 == 0

guard0 == true

guard1 == false

least#1 == 0

CBMC: SSA Assignments

s0

s1

s2

s3

MAGIC: States & actions

0

1

2

117

CBMC vs. MAGIC Representations

input1#0 == 0

input2#0 == -1

input3#0 == 0

least#0 == 0

most#0 == 0

guard0 == true

guard1 == false

least#1 == 0

CBMC: SSA Assignments

s0

s1

s2

s3

MAGIC: States & actions

0

1

2

Control location:

Line 5

Predicates:

input1 > input2

least == input1

...

118

A New Distance Metric

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

Must determine whichstates to compare: maybe different number of

states in two executions

Make use of literatureon string/sequence

comparison & metrics

119

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

1. Only compare stateswith matching

control locations

1

5

7

9

1

3

7

8

11

120

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

1

5

7

9

1

3

7

8

11

121

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

1

5

7

9

1

3

7

8

11

122

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

2. Must be unique

123

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

124

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

3. Don’t cross overother alignments

125

Alignment

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

126

A New Distance Metric

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

In sum: much like thetraditional metrics used

to compare strings,except the alphabet

is over control locations,predicates, and actions

127

A New Distance Metric

s0

s1

s2

s3

0

1

2

s’0

s’1

s’2

s’3

0

1

2

s’4

3

Encoded using BMCand psuedo-Boolean

optimization as inthe first case, with

variables for alignmentand control, predicateand action differences

128

Explaining Abstract Counterexamples

One execution (Potentially) many executions

Changes in values Changes in predicates

Always real execution May be spurious

- May need to iterate/refine

Execution as SSA values Execution as path & states

- Counterintuitive metric - Intuitive metric

- No alignment problem - Must consider alignments:

Which states to compare?

BMC to produce PBS problem BMC to produce PBS problem

(CBMC) (MAGIC)

129

Results

130

Results: Overview

Produces good explanations for numerous interesting case studies: C/OS-II RTOS Microkernel (3K lines)• OpenSSL code (3K lines)• Fragments of Linux kernel• TCAS Resolution Advisory component• Some smaller, “toy” linear temporal logic

property examples

C/OS-II, SSL, some TCAS bugs precisely isolated: report = fault

131

Results: Quantitative Evaluation

Very good scores by Renieris & Reiss method for evaluating fault localization:• Measures how much source code user

can avoid reading thanks to the localization. 1 is a perfect score

For SSL and C/OS-II case studies, scores of 0.999

Other examples (almost) all in range 0.720-0.993

132

Results: Comparison

Scores were generally much better than Nearest Neighbor – when it could be applied at all• Much more consistent• Testing-based methods of Renieris

and Reiss occasionally worked better• Also gave useless (score 0) explanations

much of the time

Scores a great improvement over the counterexample traces alone

133

Results: Comparison

Scores and times for variouslocalization methods

Best score for each program highlighted

* alternative scoring method for large programs

Program Explain JPF n-c n-s CBMCscore time score time score score score

TCAS 1 0.91 4 0.87 1521 0.00 0.58 0.41TCAS 11 0.93 7 0.93 5673 0.13 0.13 0.51TCAS 31 0.93 7 - - 0.00 0.00 0.46TCAS 40 0.88 6 0.87 30482 0.83 0.77 0.35TCAS 41 0.88 5 0.30 34 0.58 0.92 0.38

uCOS-ii 0.99 62 - - - - 0.97uCOS-ii* 0.81 62 - - - - 0.00

134

Results: MAGIC

No program required iteration to finda non-spurious explanation: good

abstraction already discovered

Program score time CE lengthmutex-n-01.c (lock) 0.79 0.04 6mutex-n-01.c (unlock) 0.99 0.04 6pci-n-01.c 0.78 0.07 9pci-rec-n-01.c 0.72 0.09 8SSL-1 0.99 8.07 29SSL-2 0.99 3.45 52uCOS-ii 0.00 0.76 19

135

Results: Time

Time to explain comparable to model checking time• No more than 10 seconds for

abstract explanation (except when it didn’t find one at all…)

• No more than 3 minutes for concrete explanations

136

Results: Room for Improvement

Concrete explanation worked better than abstract in some cases• When SSA based metric produced

smaller optimization constraints

For TCAS examples, user assistance was needed in some cases• Assertion of form (A implies B)• First explanation “explains” by showing

how A can fail to hold• Easy to get a good explanation—force

model checker to assume A

137

Conclusions: Good News

Counterexample explanation and fault localization can provide good assistance in locating errors

The model checking approach, when it can be applied (usually not to large programs or with complex data structures) may be most effective

But Tarantula is the real winner, unless model checking starts scaling better

138

Future Work?

The ultimate goal: testing tool or model checker fixes our programs for us – automatic program repair!

That’s not going to happen, I think

But we can try (and people are doing just that, right now)

139

Model Checking and Scaling

Next week we’ll look at a kind of “model checking” that doesn’t involve building SAT equations or producing an abstraction• We’ll run the program and backtrack

execution• Really just an oddball form of testing• Can’t do “stupid SAT-solver tricks”

like using PBS to produce great fault localizations, but has some other benefits

Recommended