View
124
Download
1
Category
Preview:
Citation preview
5W+1H static analysis report qualitymeasure
Maxim Menshchikov, Timur Lepikhin
March 3, 2017
Saint Petersburg State University, OKTET Labs
Authors
Maxim Menshchikov
Student, Saint Petersburg State University.
Software Engineer at OKTET Labs.
Timur Lepikhin
Candidate of Sciences, Associate Professor,
Saint Petersburg State University.
1
Static analysis quality evaluation
How the quality is usually evaluated?
1. Precision.
PPV =TP
TP + FP
2. Recall.
TPR =TP
TP + FN
3. F1 (f-measure).
F1 =2TP
2TP + FP + FN
2
Static analysis quality evaluation
How the quality is usually evaluated?
4. False-Positive Rate.
FPR =FP
FP + TN
5. Accuracy.
ACC =TP + TN
P +N
6. ...
What’s missing in these measures?
3
Missing pieces
• Informational quality of messages
How good and informative the message is?
• Generalization of reports
Reports can be either positive or negative when talking about
errors.
“Error in line x”.
“No error in line x”.
• Error class identification1
Reports can relate to the same problem or point of interest in the
code. Reports should be combined according to that.• Utility support
Not all tested utilities may support some kind of report.1Not always missing :)
4
The input
Consider the following code sample:
#include <stdio.h>
int main()
{
int input;
if (scanf("%d", &input) == 1)
{
if (input == 2)
{
int *a;
int *n = a;
a = n;
*n = 5;
}
else
{
printf("OK\n");
}
}
return 0;
}
5
The output
Clang 3.9
main.cpp:10:13: warning: Assigned value is garbage or undefined:
int *n = a;
main.cpp:5:5: note: Taking true branch: if (scanf("%d", &input) == 1)
main.cpp:7:13: note: Assuming ’input’ is equal to 2: if (input == 2)
main.cpp:7:9: note: Taking true branch: if (input == 2)
main.cpp:9:13: note: ’a’ declared without an initial value: int *a;
main.cpp:10:13: note: Assigned value is garbage or undefined:
int *n = a;
main.cpp:11:13: warning: Value stored to ’a’ is never read: a = n;
main.cpp:11:13: note: Value stored to ’a’ is never read: a = n;
6
The output
cppcheck 1.76
[main.cpp:12]: (style) Variable ’a’ is assigned a value that is never
used.
[main.cpp:10]: (error) Uninitialized variable: a
7
The difference
1. Clang shows which conditions should be met to encounter the
bug.
2. Clang shows source code line text, while cppcheck only shows
file and line number.
Both reports would be “correct” in sense of all previous
measures. They would be considered equal with respect to
their contribution to result.
8
5W+1H
“5Ws” are actively used in journalism and natural language
processing.
Sometimes they are referred as “5W+1H”, where “H” denotes
“How?”.
• What?
• When?
• Where?
• Who?
• Why?
• How?
9
5W+1H
We suggest to rephrase the 6th question as “How to fix?”
• What? Consequences.
The error. What will happen if the error occurs.
• When?
Conditions when it happens.
• Where?
Source code line number, module name.
• Who?
Who wrote this line?
• Why?
More or less formal reason why the error was treated as such.
• How to fix?
The ways to fix the problem.
10
How it applies to previous code sample
Question Clang Cppcheck
What? Assigned value is garbage Uninitialized variable: a
Who? — —
Where? lines 5-10 line 10
When?scanf(...) == 1,
input == 2—
Why?’a’ declared without
initial value—
How? — —
11
5W+1H
• It is hard to prove its completeness. (Do you have any
counter-example?)
12
5W+1H
• It is hard to prove its completeness. (Do you have any
counter-example?)
• Some way to evaluate reports is still needed.
• You can always choose the most suitable question to associate
report information with.
13
Generalization of reports
Factual error Report
Presence Correctness Result kind Usefulness
No Indeterminate2 Indeterminate Yes
No Correct Positive No3
No Correct Negative Yes
No Incorrect Positive No
No Incorrect Negative No
Yes Indeterminate Indeterminate No
Yes Correct Positive Yes
Yes Correct Negative Yes
Yes Incorrect Positive No
Yes Incorrect Negative No
2Or rather missing3Something strange 14
Report classes
Report class is an infinite set of reports equal from end user’s
point of view. Let’s group reports by answers to following
questions:
• Why?
• What?
• Where?
15
Maths: propagate report classes
Consider the surjective function combining reports from set R to
the set of unique classes R′.
f(r) : R→ R′ r ∈ R
We’ll use R as an alias to R′ later on.
16
Maths: introduce weights
Consider the set of questions:
{What, When, Where,Who,Why,HowToFix}
Let W be a set of answer weights for questions 1-6, respectively.
W = {w1, w2, ..., w6}
Then following mapping can be applied4.
W = {0.2, 0.15, 0.1, 0.05, 0.2, 0.3}
4Make your own mapping satisfying the needs of your test
17
Maths: introduce weights, pt.2
Let I be informational quality of the message and
A = {a1, a2, ..., a6} be a set of answers quality, where
ai ∈ [0, 1], i = 1..6.
I =
6∑i=1
wi · ai (1)
Let Imax be a measure of maximal informational quality between
m utilities.
Imax =
6∑i=1
wi ·maxj
aij j ∈ 1..m (2)
18
Maths: introduce weights, pt.3
Having that, by taking Imax into account, we can easily find a sum
of all reports.
SR′ =
n∑i=1
Imaxi (3)
19
Maths: introduce weights, pt.4
Let m ∈ N be the number of tested static analyzers. Utility
support for i -report can be abstractly represented as:
uij ∈ Ui j = 1..m i = 1..n
uij ∈ {0, 1} (4)
where uij is a boolean value indicating the j− utility support of i−report’s underlying error type.
With that, we can find a sum of all reports for j− utility taking
utility support into account.
Sj =
n∑i=1
Iij ·m∏j=1
uij (5)
20
Maths: “IQ” measure
We can calculate informational quality measure for j− utility.
Snormj =Sj
SR′(6)
We would call this measure IQ (Informational Quality).
TPI only includes true positives. FPI includes false positives
with the informational value taken into account.
21
What? Should I measure it manually?
No.
• You can make you own parsers, as we did.
• Many reports looks similarly. You can evaluate them once and
apply the score to all.
• (Could have been easier if there was some kind of
standardized output...)
22
Real world testing
We tested the measure on Toyota ITC benchmarks5.
• Clang 3.9, cppcheck 1.76, Frama-C Silicon, PVS-Studio
(Linux) and ReSharper were tested.
• Original benchmark was forked, errors patched, limited Win32
support added.
• We created a lot of 5-minute-work parsers capable of reading
output we got. They cannot be applied to all outputs.
• pthread tests excluded from comparison as not all utilities
support it.
• We checked generic report informativeness.
• All measures were calculated and analyzed.
• The hypothesis: the measure is different from Precision,
Recall and F1 scores.5https://github.com/mmenshchikov/itc-benchmarks
23
Test methodology
• Prepared Toyota ITC benchmarks6.
• Coded parsers for all tested utilities7.
• Prepared scripts to do the comparison8 and verify results
except parts that cannot be automated.
• Scripts only check lines having special comments from Toyota.
• Reports were semi-automatically checked for correctness.
• Report quality was evaluted manually, yet applying the same
score to similar reports (takes really little time).
• The hypothesis was evaluated using t-test.
6https://github.com/mmenshchikov/itc-benchmarks7https://github.com/mmenshchikov/sa_parsers8https://github.com/mmenshchikov/sa_comparison_003
24
Results: Informativeness
Question Clang cppcheck Frama-C PVS RS9
What? 100% 100% 100% 100% 100%
When? 97.41% 0% 100% 0% 0%
Where? 100% 100% 100% 100% 100%
Who? 0% 0% 0% 0% 0%
Why? 35.78% 0% 99.77% 48.46% 0%
How to fix? 0% 0% 0% 17.15% 38.27%
9ReSharper C++
25
Results : IQ
Utility IQ TPI TP FPI FP PPV10 TPR11 F1
Clang 0.52 57.75 111 1.55 3 0.974 0.183 0.308
Cppcheck 0.3 30 100 0.6 2 0.98 0.165 0.282
Frama-C 0.649 196.1 302 57.2 88 0.774 0.498 0.606
PVS 0.459 53.67 117 4.32 12 0.907 0.193 0.318
RS12 – – – – – – – –
10Precision11Recall12ReSharper was excluded as it found “other” defects, although we considered
it generic-purpose from the beginning
26
Results : dependency
In this test we found a dependency between Precision (PPV )
and IQ.
• Utilities provide similar reports (measures for reports are
similar): test more utilities.
• Emitted messages are only error-related, no messages on error
absence: include tools that inform about bug absence as
well13.
It is not a generally representative.
We evaluated informational values ourselves, and that decreases
the reliability of results.
13Many developers ignored our requests for academic versions
27
What’s then
You can use this information to improve your utilities:
• Add answers to some of questions (“Who?”, “When?”).
• Explain decisions more formally (“Why?”).
• Suggest fixes, if possible (“How to fix?”).
How to improve the measure:
• Prepare better explained weights.
How to improve test:
• Better rules, less automation.
• Richer selection of tools.
28
Questions?
29
Verbosity
• Good verbosity
More information on analyzer’s decision.
Still you can filter out unneeded information.
• Bad verbosity
Many messages about the same error.
A lot of “rubbish” messages spreading user’s attention.
30
Who?
It questions who wrote a bad line or did the most significant
change in it.
• svn blame?
Too basic information. i.e. if constant in function invocation is
wrong, you will not know for sure who is to blame.
• Ethical aspects of blaming are out of question
You can use static analysis results to automatically create tasks in
a bugtracker and assign to right person.
31
5Ws
Term is coming from journalism, natural language processing,
problem-solving, etc.
Something like that mentioned by various philosophers and
rhetoricians.
Taught in high-school journalism classes by 1917.
32
Recommended