TMPA-2017: 5W+1H Static Analysis Report Quality Measure

Preview:

Citation preview

5W+1H static analysis report qualitymeasure

Maxim Menshchikov, Timur Lepikhin

March 3, 2017

Saint Petersburg State University, OKTET Labs

Authors

Maxim Menshchikov

Student, Saint Petersburg State University.

Software Engineer at OKTET Labs.

Timur Lepikhin

Candidate of Sciences, Associate Professor,

Saint Petersburg State University.

1

Static analysis quality evaluation

How the quality is usually evaluated?

1. Precision.

PPV =TP

TP + FP

2. Recall.

TPR =TP

TP + FN

3. F1 (f-measure).

F1 =2TP

2TP + FP + FN

2

Static analysis quality evaluation

How the quality is usually evaluated?

4. False-Positive Rate.

FPR =FP

FP + TN

5. Accuracy.

ACC =TP + TN

P +N

6. ...

What’s missing in these measures?

3

Missing pieces

• Informational quality of messages

How good and informative the message is?

• Generalization of reports

Reports can be either positive or negative when talking about

errors.

“Error in line x”.

“No error in line x”.

• Error class identification1

Reports can relate to the same problem or point of interest in the

code. Reports should be combined according to that.• Utility support

Not all tested utilities may support some kind of report.1Not always missing :)

4

The input

Consider the following code sample:

#include <stdio.h>

int main()

{

int input;

if (scanf("%d", &input) == 1)

{

if (input == 2)

{

int *a;

int *n = a;

a = n;

*n = 5;

}

else

{

printf("OK\n");

}

}

return 0;

}

5

The output

Clang 3.9

main.cpp:10:13: warning: Assigned value is garbage or undefined:

int *n = a;

main.cpp:5:5: note: Taking true branch: if (scanf("%d", &input) == 1)

main.cpp:7:13: note: Assuming ’input’ is equal to 2: if (input == 2)

main.cpp:7:9: note: Taking true branch: if (input == 2)

main.cpp:9:13: note: ’a’ declared without an initial value: int *a;

main.cpp:10:13: note: Assigned value is garbage or undefined:

int *n = a;

main.cpp:11:13: warning: Value stored to ’a’ is never read: a = n;

main.cpp:11:13: note: Value stored to ’a’ is never read: a = n;

6

The output

cppcheck 1.76

[main.cpp:12]: (style) Variable ’a’ is assigned a value that is never

used.

[main.cpp:10]: (error) Uninitialized variable: a

7

The difference

1. Clang shows which conditions should be met to encounter the

bug.

2. Clang shows source code line text, while cppcheck only shows

file and line number.

Both reports would be “correct” in sense of all previous

measures. They would be considered equal with respect to

their contribution to result.

8

5W+1H

“5Ws” are actively used in journalism and natural language

processing.

Sometimes they are referred as “5W+1H”, where “H” denotes

“How?”.

• What?

• When?

• Where?

• Who?

• Why?

• How?

9

5W+1H

We suggest to rephrase the 6th question as “How to fix?”

• What? Consequences.

The error. What will happen if the error occurs.

• When?

Conditions when it happens.

• Where?

Source code line number, module name.

• Who?

Who wrote this line?

• Why?

More or less formal reason why the error was treated as such.

• How to fix?

The ways to fix the problem.

10

How it applies to previous code sample

Question Clang Cppcheck

What? Assigned value is garbage Uninitialized variable: a

Who? — —

Where? lines 5-10 line 10

When?scanf(...) == 1,

input == 2—

Why?’a’ declared without

initial value—

How? — —

11

5W+1H

• It is hard to prove its completeness. (Do you have any

counter-example?)

12

5W+1H

• It is hard to prove its completeness. (Do you have any

counter-example?)

• Some way to evaluate reports is still needed.

• You can always choose the most suitable question to associate

report information with.

13

Generalization of reports

Factual error Report

Presence Correctness Result kind Usefulness

No Indeterminate2 Indeterminate Yes

No Correct Positive No3

No Correct Negative Yes

No Incorrect Positive No

No Incorrect Negative No

Yes Indeterminate Indeterminate No

Yes Correct Positive Yes

Yes Correct Negative Yes

Yes Incorrect Positive No

Yes Incorrect Negative No

2Or rather missing3Something strange 14

Report classes

Report class is an infinite set of reports equal from end user’s

point of view. Let’s group reports by answers to following

questions:

• Why?

• What?

• Where?

15

Maths: propagate report classes

Consider the surjective function combining reports from set R to

the set of unique classes R′.

f(r) : R→ R′ r ∈ R

We’ll use R as an alias to R′ later on.

16

Maths: introduce weights

Consider the set of questions:

{What, When, Where,Who,Why,HowToFix}

Let W be a set of answer weights for questions 1-6, respectively.

W = {w1, w2, ..., w6}

Then following mapping can be applied4.

W = {0.2, 0.15, 0.1, 0.05, 0.2, 0.3}

4Make your own mapping satisfying the needs of your test

17

Maths: introduce weights, pt.2

Let I be informational quality of the message and

A = {a1, a2, ..., a6} be a set of answers quality, where

ai ∈ [0, 1], i = 1..6.

I =

6∑i=1

wi · ai (1)

Let Imax be a measure of maximal informational quality between

m utilities.

Imax =

6∑i=1

wi ·maxj

aij j ∈ 1..m (2)

18

Maths: introduce weights, pt.3

Having that, by taking Imax into account, we can easily find a sum

of all reports.

SR′ =

n∑i=1

Imaxi (3)

19

Maths: introduce weights, pt.4

Let m ∈ N be the number of tested static analyzers. Utility

support for i -report can be abstractly represented as:

uij ∈ Ui j = 1..m i = 1..n

uij ∈ {0, 1} (4)

where uij is a boolean value indicating the j− utility support of i−report’s underlying error type.

With that, we can find a sum of all reports for j− utility taking

utility support into account.

Sj =

n∑i=1

Iij ·m∏j=1

uij (5)

20

Maths: “IQ” measure

We can calculate informational quality measure for j− utility.

Snormj =Sj

SR′(6)

We would call this measure IQ (Informational Quality).

TPI only includes true positives. FPI includes false positives

with the informational value taken into account.

21

What? Should I measure it manually?

No.

• You can make you own parsers, as we did.

• Many reports looks similarly. You can evaluate them once and

apply the score to all.

• (Could have been easier if there was some kind of

standardized output...)

22

Real world testing

We tested the measure on Toyota ITC benchmarks5.

• Clang 3.9, cppcheck 1.76, Frama-C Silicon, PVS-Studio

(Linux) and ReSharper were tested.

• Original benchmark was forked, errors patched, limited Win32

support added.

• We created a lot of 5-minute-work parsers capable of reading

output we got. They cannot be applied to all outputs.

• pthread tests excluded from comparison as not all utilities

support it.

• We checked generic report informativeness.

• All measures were calculated and analyzed.

• The hypothesis: the measure is different from Precision,

Recall and F1 scores.5https://github.com/mmenshchikov/itc-benchmarks

23

Test methodology

• Prepared Toyota ITC benchmarks6.

• Coded parsers for all tested utilities7.

• Prepared scripts to do the comparison8 and verify results

except parts that cannot be automated.

• Scripts only check lines having special comments from Toyota.

• Reports were semi-automatically checked for correctness.

• Report quality was evaluted manually, yet applying the same

score to similar reports (takes really little time).

• The hypothesis was evaluated using t-test.

6https://github.com/mmenshchikov/itc-benchmarks7https://github.com/mmenshchikov/sa_parsers8https://github.com/mmenshchikov/sa_comparison_003

24

Results: Informativeness

Question Clang cppcheck Frama-C PVS RS9

What? 100% 100% 100% 100% 100%

When? 97.41% 0% 100% 0% 0%

Where? 100% 100% 100% 100% 100%

Who? 0% 0% 0% 0% 0%

Why? 35.78% 0% 99.77% 48.46% 0%

How to fix? 0% 0% 0% 17.15% 38.27%

9ReSharper C++

25

Results : IQ

Utility IQ TPI TP FPI FP PPV10 TPR11 F1

Clang 0.52 57.75 111 1.55 3 0.974 0.183 0.308

Cppcheck 0.3 30 100 0.6 2 0.98 0.165 0.282

Frama-C 0.649 196.1 302 57.2 88 0.774 0.498 0.606

PVS 0.459 53.67 117 4.32 12 0.907 0.193 0.318

RS12 – – – – – – – –

10Precision11Recall12ReSharper was excluded as it found “other” defects, although we considered

it generic-purpose from the beginning

26

Results : dependency

In this test we found a dependency between Precision (PPV )

and IQ.

• Utilities provide similar reports (measures for reports are

similar): test more utilities.

• Emitted messages are only error-related, no messages on error

absence: include tools that inform about bug absence as

well13.

It is not a generally representative.

We evaluated informational values ourselves, and that decreases

the reliability of results.

13Many developers ignored our requests for academic versions

27

What’s then

You can use this information to improve your utilities:

• Add answers to some of questions (“Who?”, “When?”).

• Explain decisions more formally (“Why?”).

• Suggest fixes, if possible (“How to fix?”).

How to improve the measure:

• Prepare better explained weights.

How to improve test:

• Better rules, less automation.

• Richer selection of tools.

28

Questions?

29

Verbosity

• Good verbosity

More information on analyzer’s decision.

Still you can filter out unneeded information.

• Bad verbosity

Many messages about the same error.

A lot of “rubbish” messages spreading user’s attention.

30

Who?

It questions who wrote a bad line or did the most significant

change in it.

• svn blame?

Too basic information. i.e. if constant in function invocation is

wrong, you will not know for sure who is to blame.

• Ethical aspects of blaming are out of question

You can use static analysis results to automatically create tasks in

a bugtracker and assign to right person.

31

5Ws

Term is coming from journalism, natural language processing,

problem-solving, etc.

Something like that mentioned by various philosophers and

rhetoricians.

Taught in high-school journalism classes by 1917.

32

Recommended