New Timo Berthold - Zuse Institute Berlin · 2020. 9. 7. · © 2019 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced

© 2019 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation’s express consent. 1

© 2019 Fair Isaac Corporation. Confidential.

This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation’s express consent.

Timo Berthold



• Performance variability

• Statistical tests

• Best practices


0,25

0,5

1

2

4

8

16

32

64ra

il50

7n

s12

08

40

0g

mu

-35

-40

sa

telli

tes

1-2

5e

ilB1

01

ran

16

x16

cs

ch

ed

01

0s

p9

8ic

roll3

00

0m

ik-2

50

-1-1

00

-1n

eo

s1

3a

flo

w4

0b

no

sw

ot

ne

wd

an

om

ine

-90

-10

ac

c-t

igh

t5n

eo

s-1

10

98

24

ne

os

-84

97

02

da

no

int

ne

os

-16

01

93

6u

nit

ca

l_7

tim

tab

1b

na

tt3

50

pg

5_3

4iis

-bu

pa

-co

vn

eo

s1

8n

s18

30

65

3b

ley_

xl1

ms

pp

16

n3

seq

24

ma

cro

ph

ag

erm

ine

6m

sc

98

-ip

iis-1

00

-0-c

ov

op

m2

-z7

-s2

bie

ns

t2b

inka

r10

_1c

ov1

07

5d

fn-g

win

-UU

Mle

cts

ch

ed

-4-o

bj

ns1

68

83

47

ns1

76

60

74

pig

eo

n-1

0q

iurm

atr

10

0-p

5n

3d

iv3

6n

eo

s-4

76

28

3m

ap

18

pw

-myc

iel4

roc

II-4

-11

ma

p2

0n

et1

2n

eo

s-9

34

27

8rm

atr

10

0-p

10

ns1

75

89

13

min

e-1

66

-5iis

-pim

a-c

ov

ap

p1

-2n

etd

ive

rsio

nm

cs

ch

ed

ne

os

-13

96

12

5a

sh6

08

gp

ia-3

co

lm

zzv1

1zi

b5

4-U

UE

be

as

leyC

3re

blo

ck6

7b

iella

1e

il33

-2n

eo

s-9

16

79

2ro

co

co

C1

0-0

01

00

0n

4-3

sp

98

irtr

ipti

m1

tan

gle

gra

m2

gla

ss4

co

re2

53

6-6

91

ex9

ne

os

-68

61

90

air

04

ba

b5

ne

os

-13

37

30

7vp

ph

ard

en

ligh

t13

en

ligh

t14

m1

00

n5

00

k4r1

tan

gle

gra

m1

30

n2

0b

8

Single solver version-to-version improvement: appx. 30%. HOWEVER:

• ~40% of the instances got slower (~50% got faster)

• Performance drop by up to a factor of 4 (versus improvements of up to factor 45)

Picture courtesy of Thorsten Koch.

Test set: MIPLIB2010 Benchmark (87 instances)


© 2018 Fair Isaac Corporation. Confidential. 5

• Performance variability is a change in performance by seemingly performance-neutral changes, like changing the

• Input format

• Essentially, this leads to a permutation of the problem

• Operating system (Does NOT occur with Xpress)

• Different number of threads (can be overcome by control setting in Xpress)

• This occurs even though the underlying software is deterministic

• Reasons are imperfect tie-breaking, numerical round-off differences on different platforms…

• Related: Adding redundant information (E.g., a constraint σ𝑖=1𝑛 𝑥𝑖 ≤ 𝑛, 𝑥𝑖 ∈ {0,1})


• Performance variability

• Can indicate improvements where there are none

• Can shadow improvements

• Can lead to wrong conclusions (“on some instances our method was bad, but on some, it was good”)

• Running Xpress with ten different random permutations:

• 118/240 instances have a variability of more than a factor of 2

• For 30, it is more than a factor of 10

• Most extreme case: 0.02 seconds versus timeout

• For large test sets, the effect averages out

• Still 4% difference between „best“ and „worst“ permutation on MIPLIB2017


• Performance variability can be utilized to

• Distinguish between structural improvements and random noise

• The more permutations/seeds you run, the clearer the picture for each single instance

• Mimic performance on unknown instances from same application

• Very, VERY often, customers give you one single example

• Performance variability gives you a poor man’s parallelization:

• Fire off X permutations of the problem, stop when the first one solves

• Doesn’t scale very well ;-)


• Performance variability for benchmarking is typically induced by

• Permuting the matrix: High variability 😊 but structural performance “loss” 😒

• Also: more cache misses

• Random seeds: Lower variability 😒 but preserves average performance 😊

• New method: Cyclic shifts

• Light-weight permutation that shifts the rows and columns

• Only two „breakpoints“

• High variability 😊 and preserves average performance😊

• Can be combined with random seeds

• As a solver developer, after each release:

• We change the benchmarking permutation seeds

• The offset to the random seed



• To test a hypothesis, we compare it to the null-hypothesis: “There is no such relationship”

• When the null-hypothesis can be rejected with high probability, the result is deemed significant

• Essentially, this expresses the confidence in the result. How safe is it to draw conclusions from the test results?

• “How likely is it that a random draw would have delivered the same (or an even more extreme) result?”

• Typically, p-values less than 5% are considered significant.

• For a normal distribution, 95% are within mean +/- two standard deviations

• There is a huge variety of statistical tests, require different assumptions, e.g., concerning the distribution. In our case typically: distribution-free (non-parametric)

• Distinguish between nominal data (yes/no) and rational data


• Applied to 2x2 contigency table of paired nominal data

• E.g., does the solver find a solution with/without a certain setting (yes/no, yes/no)

• Considers the cases where both differ (the counter diagonal of the table)

• Compute 𝜒2 =𝑏−𝑐 2

𝑏+𝑐

• For random drawed 𝑏, 𝑐, this would follow a chi-squared-distribution.

• Lookup p-value

• Well-suited for categorical data (found a solution yes/no, solved to optimality yes/no)


• Non-parametric difference test for paired rational data

• Sorts observations by absolute value of the logarithm of their ratio

• Assigns ranks from 1 to n

• Split into positive and negative group, compute rank sums

• The more different those sums, the more likely that the setting with the larger sum outperforms the one with the lower sum

• Wilcoxon statistic: 𝑧 =min 𝑊−,𝑊+ −

𝑁(𝑁+1)

4

𝑁(𝑁+1)(2𝑁+1)

24

• Would follow a normal distribution for randomly drawn data

• Well-suited for numerical data (runtime, number of nodes, PDI)



• Choose a suitable test set

• Are there standard test sets in your field?

• Use large test sets

• Use diverse test sets (within your target domain)

• Use „real“ instances if possible, not randomly generated

• Be careful, what to exclude and how to split test sets

• NEVER EVER do „easy“ vs. „hard“ only w.r.t. one „baseline“ setting

• Some measures might only make sense on „all optimal“ or „all timeout“

• Report number of solved instances

• Explain why certain instances had to be excluded and how many were affected


• Choose a suitable test set

• Are there standard test sets in your field?

• Use large test sets

• Use diverse test sets (within your target domain)

• Use „real“ instances if possible, not randomly generated

• Be careful, what to exclude and how to split test sets

• NEVER EVER do „easy“ vs. „hard“ only w.r.t. one „baseline“ setting

• Some measures might only make sense on „all optimal“ or „all timeout“

• Report number of solved instances

• Explain why certain instances had to be excluded and how many were affected


• Report machine specification, software versions and working limits used

• Make fair comparisons

• What is the state-of-the-art?

• Not only software, but also models

• Does the benchmark solver have the same purpose (e.g., heuristics vs exact solvers)


• State the question(s) that you want to answer and the means of doing so

• @Reviewers: This might not be the same question and the same methodology that you would have chosen....

• Run on otherwise idle machines

• Run at most one job per CPU (and bind to CPU)

• Never trust your own results

• Investigate outliers, negative and positive ones

• Use methods like permutations, statistical tests etc to double-check your results

• Use diverse measures

• Also report negative results



• Performance variability describes

a) Large runtime differences caused by seemingly neutral changes

b) A method to tell whether a result is significant

c) A performance measure

• What is NOT a best practice for computational experiments?

a) Use diverse machine setups

b) Use diverse performance measures

c) Run on otherwise idle machines

• When comparing runtimes, the significance should be checked by

a) A McNemar test

b) A Cyclic shift test

c) A Wilcoxon signed rank test


• Performance variability describes

a) Large runtime differences caused by seemingly neutral changes

b) A method to tell whether a result is significant

c) A performance measure

• What is NOT a best practice for computational experiments?

a) Use diverse machine setups

b) Use diverse performance measures

c) Run on otherwise idle machines

• When comparing runtimes, the significance should be checked by

a) A McNemar test

b) A Cyclic shift test

c) A Wilcoxon signed rank test


© 2019 Fair Isaac Corporation. Confidential.

This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac Corporation’s express consent.

Timo Berthold

Documents

New Timo Berthold - Zuse Institute Berlin · 2020. 9. 7. · © 2019 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced