1
Analysis of TimSort Algorithm Nicolas Auger, Vincent Jugé, Cyril Nicaud, Carine Pivoteau MOA – LIGM (UMR 8049) A Brief History of TimSort old TimSort 2001 ’02 ’03 ’04 ’05 ’06 ’07 ’08 ’09 ’10 ’11 ’12 ’13 ’14 ’15 ’16 ’17 ’18 ’19 ’20 ’21 invented by Tim Peters sorting algorithm in Python used in Android Java Java’s fix certified fix bug uncovered! [1] Python certified fix Java custom fix Octave O (n log n) running time proved [2] new bug in Java! [3] refined analysis: O (n + nH) Our contributions are written in blue. TimSort Principle Natural decomposition of a sequence into maximal monotonic runs: a bxh f e d c b a bb c d a b 3 7 4 2 Algorithm get a run decomposition of the initial sequence start with an empty stack while the run decomposition is not empty do take the next run and push it onto the stack #1 while this is possible do apply one of the merge rules #2 #3 #4 #5 while the stack has more than 1 element do merge the two topmost runs 1, 2, 3, 4,... are the top- most runs on the stack. The length of the i th run is denoted by r i . 1 2 3 ... r 1 > r 3 #2 if then 1 new 2 ... merge 2 & 3 1 2 ... r 1 > r 2 #3 else if or 1 2 3 ... r 1 + r 2 > r 3 #4 or then 1 2 3 4 ... r 2 + r 3 > r 4 #5 new 1 new 2 new 3 ... merge 1 & 2 Example of stack evolution for runs of length 24, 18, 50, 28, 20, 6, 4, 8, 1: (stack values represent run lengths) 24 #1 24 18 #1 18 24 50 #1 50 42 #2 92 #3 92 28 #1 28 92 20 #1 20 28 92 6 #1 6 20 28 92 4 #1 4 6 29 28 92 8 #1 8 10 20 28 92 #2 18 20 28 92 #5 38 28 92 #4 66 92 #3 66 92 1 #1 Results 13 years after its announcement by Tim Peters, we obtained the first proof of its worst-case running time: Theorem [2, 3]: TimSort runs in O (n log n) time. By design TimSort is well suited for partially sorted data. It falls into the adaptive sort family. Taking the number of runs ρ as a (natural) parameter for a refined analysis we obtained: Theorem [3]: TimSort runs in O (n + n log ρ) time. Going further, we introduce the notion of entropy H to take the vari- ations in run lengths into account, defined by: H = - X r i n log 2 r i n Theorem [3]: TimSort runs in O (n + nH) time. Our in-depth analysis of the algorithm enabled us to produce a bug in Java sorting! It was fixed within a few weeks after we reported it. References [1] S. de Gouw, J. Rot, F. de Boer, R. Bubel, and R. Hähnle. OpenJDK’s Java.utils.Collection.sort() is broken: The good, the bad and the worst case. In CAV, 2015. [2] N. Auger, C. Nicaud, and C. Pivoteau. Merge strategies: From Merge Sort to TimSort. Research Report hal-01212839, hal, 2015. [3] N. Auger, V. Jugé, C. Nicaud, and C. Pivoteau. On the worst-case complexity of TimSort. In ESA, 2018.

Analysis of TimSort Algorithmjuge/slides/poster/ligm-2019.pdf · Analysis of TimSort Algorithm NicolasAuger,VincentJugé,CyrilNicaud,CarinePivoteau MOA–LIGM(UMR8049) ABriefHistoryofTimSort

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analysis of TimSort Algorithmjuge/slides/poster/ligm-2019.pdf · Analysis of TimSort Algorithm NicolasAuger,VincentJugé,CyrilNicaud,CarinePivoteau MOA–LIGM(UMR8049) ABriefHistoryofTimSort

Analysis of TimSort AlgorithmNicolas Auger, Vincent Jugé, Cyril Nicaud, Carine Pivoteau

MOA – LIGM (UMR 8049)

A Brief History of TimSort

old TimSort

2001 ’02 ’03 ’04 ’05 ’06 ’07 ’08 ’09 ’10 ’11 ’12 ’13 ’14 ’15 ’16 ’17 ’18 ’19 ’20 ’21

invented by Tim Peterssorting algorithm in Python

used in AndroidJava

Java’s fix

certified fix

bug uncovered! [1]Python → certified fixJava → custom fix

Octave

O(n log n) running time proved [2]

new bug in Java! [3]refined analysis: O(n + nH)

Our contributions are written in blue.

TimSort Principle

Natural decomposition of a sequence into maximal monotonic runs:

a b x h f e d c b a b b c d a b3 7 4 2

Algorithm

get a run decomposition of the initial sequencestart with an empty stackwhile the run decomposition is not empty do

take the next run and push it onto the stack #1

while this is possible doapply one of the merge rules #2 #3 #4 #5

while the stack has more than 1 element domerge the two topmost runs

1, 2, 3, 4, . . . are the top-most runs on the stack.The length of the ith runis denoted by ri.

123. . .

r1 > r3

#2

if then

1

new 2

. . .

merge 2 & 3

12

. . .r1 > r2

#3

elseif or

123

. . .r1 + r2 > r3

#4

or then

1234. . .

r2 + r3 > r4

#5

new 1

new 2new 3. . .

merge 1 & 2

Example of stack evolution for runs of length 24, 18, 50, 28, 20, 6, 4, 8, 1:(stack values represent run lengths)

24#1

2418#1

1824

50#1

5042

#2

92#3

9228#1

2892

20#1

202892

6#1

6202892

4#1

46292892

8#1

810202892

#2

18202892

#5

382892

#4

6692

#36692

1#1

Results

13 years after its announcement by Tim Peters, we obtained the firstproof of its worst-case running time:

Theorem [2, 3]: TimSort runs in O(n log n) time.

By design TimSort is well suited for partially sorted data. It fallsinto the adaptive sort family. Taking the number of runs ρ as a(natural) parameter for a refined analysis we obtained:

Theorem [3]: TimSort runs in O(n + n log ρ) time.

Going further, we introduce the notion of entropy H to take the vari-ations in run lengths into account, defined by:

H = −∑ ri

nlog2

(rin

)Theorem [3]: TimSort runs in O(n + nH) time.

Our in-depth analysis of the algorithm enabled us to produce a bugin Java sorting! It was fixed within a few weeks after we reported it.

References

[1] S. de Gouw, J. Rot, F. de Boer, R. Bubel, and R. Hähnle. OpenJDK’s Java.utils.Collection.sort() is broken: The good, the bad and the worst case. In CAV, 2015.[2] N. Auger, C. Nicaud, and C. Pivoteau. Merge strategies: From Merge Sort to TimSort. Research Report hal-01212839, hal, 2015.[3] N. Auger, V. Jugé, C. Nicaud, and C. Pivoteau. On the worst-case complexity of TimSort. In ESA, 2018.