76

"Quantum" Performance Effects

Embed Size (px)

Citation preview

“Quantum” Performance Effectsv 3.0; February 2015

Sergey Kuksenko

[email protected], @kuksenk0

The following is intended to outline our general product direction. Itis intended for information purposes only, and may not beincorporated into any contract. It is not a commitment to deliver anymaterial, code, or functionality, and should not be relied upon inmaking purchasing decisions. The development, release, and timingof any features or functionality described for Oracle’s productsremains at the sole discretion of Oracle.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 2/52

Intro

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 3/52

Intro: performance engineering

1. Computer Science → Software Engineering– Build software to meet functional requirements– Mostly don’t care about HW and data specifics– Abstract and composable, “formal science”

2. Performance Engineering– “Real world strikes back!”– Exploring complex interactions between hardware, software, and data– Based on empirical evidence, i.e. “natural science”

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 4/52

Intro: what’s the difference?

architecture vs microarchitecture

x86AMD64(x86-64/Intel64)

ARMv7....

NehalemSandy BridgeBulldozerBobcat

Cortex-A9

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 5/52

Intro: what’s the difference?

architecture vs microarchitecture

x86AMD64(x86-64/Intel64)

ARMv7....

NehalemSandy BridgeBulldozerBobcat

Cortex-A9

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 5/52

Intro: SUTs1

∙ Intel® Core� i5-4300M [2.6 GHz] 1x2x2– 𝜇arch: Haswell– launched: Q4’2013– OS: Xubuntu 14.04 (64-bits)

∙ Samsung Exynos 4412, ARMv7 [1.6 GHz] 1x4x1– 𝜇arch: Cortex-A9– launched: 2011– OS: Linaro 12.11

1System Under TestCopyright © 2014, Oracle and/or its affiliates. All rights reserved. 6/52

Intro: SUTs1

∙ Intel® Core� i5-4300M [2.6 GHz] 1x2x2– 𝜇arch: Haswell– launched: Q4’2013– OS: Xubuntu 14.04 (64-bits)

∙ Samsung Exynos 4412, ARMv7 [1.6 GHz] 1x4x1– 𝜇arch: Cortex-A9– launched: 2011– OS: Linaro 12.11

1System Under TestCopyright © 2014, Oracle and/or its affiliates. All rights reserved. 6/52

Intro: SUTs (cont.)

∙ AMD Opteron� 4274HE [2.5 GHz] 2x8x1– 𝜇arch: Bulldozer/Valencia– launched: Q4’2011– OS: Oracle Linux Server release 6.0 (64-bits)

∙ Intel® Xeon® CPU E5-2680 [2.70 GHz] 2x8x2– 𝜇arch: Sandy Bridge– launched: Q1’2012– OS: Oracle Linux Server release 6.3 (64-bits)

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 7/52

Intro: JVM

∙ Java HotSpot� “1.8.0_25” 32-bits

∙ Java HotSpot� “1.8.0_25” 64-bits

∙ Java HotSpot� Embedded “1.8.0-ea-b79”

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 8/52

Intro: Demo code

https://github.com/kuksenko/quantum

∙ Required: JMH (Java Microbenchmark Harness)– http://openjdk.java.net/projects/code-tools/jmh/

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 9/52

Intro: Demo code

https://github.com/kuksenko/quantum

∙ Required: JMH (Java Microbenchmark Harness)– http://openjdk.java.net/projects/code-tools/jmh/

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 9/52

Core

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 10/52

Demo1: double sum

private double [] A = new double [2048];

@Benchmark

public double test1 () {

double sum = 0.0;

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

}

@Benchmark

public double manualUnroll () {

double sum = 0.0;

for (int i = 0; i < A.length; i += 4) {

sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3];

}

return sum;

}

426 𝑜𝑝𝑠𝑚𝑠

1120 𝑜𝑝𝑠𝑚𝑠

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 11/52

Demo1: double sum

private double [] A = new double [2048];

@Benchmark

public double test1 () {

double sum = 0.0;

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

}

@Benchmark

public double manualUnroll () {

double sum = 0.0;

for (int i = 0; i < A.length; i += 4) {

sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3];

}

return sum;

}

426 𝑜𝑝𝑠𝑚𝑠

1120 𝑜𝑝𝑠𝑚𝑠

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 11/52

Demo1: looking into asm, test1

loop: vaddsd 0x10(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x18(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x20(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x28(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x30(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x38(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x40(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x48(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x50(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x58(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x60(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x68(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x70(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x78(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x80(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x88(%edi ,%eax ,8),%xmm0 ,%xmm0

add $0x10 ,%eax

cmp %ebx ,%eax

jl loop:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 12/52

Demo1: looking into asm, manualUnroll

loop: vmovsd 0x48(%eax ,%edx ,8),% xmm0

vmovsd %xmm0 ,(% esp)

vmovsd 0x40(%eax ,%edx ,8),% xmm0

vmovsd %xmm0 ,0x8(%esp)

vmovsd 0x78(%eax ,%edx ,8),% xmm0

vaddsd 0x70(%eax ,%edx ,8),%xmm0 ,%xmm1

vmovsd 0x80(%eax ,%edx ,8),% xmm2

vmovsd 0x88(%eax ,%edx ,8),% xmm0

vmovsd %xmm0 ,0x10(%esp)

vmovsd 0x38(%eax ,%edx ,8),% xmm0

vaddsd 0x30(%eax ,%edx ,8),%xmm0 ,%xmm0

vmovsd %xmm0 ,0x18(%esp)

vmovsd 0x58(%eax ,%edx ,8),% xmm0

vaddsd 0x50(%eax ,%edx ,8),%xmm0 ,%xmm3

vmovsd 0x28(%eax ,%edx ,8),% xmm4

vmovsd 0x60(%eax ,%edx ,8),% xmm5

vmovsd 0x68(%eax ,%edx ,8),% xmm6

vmovsd 0x20(%eax ,%edx ,8),% xmm7

vmovsd 0x18(%eax ,%edx ,8),% xmm0

vaddsd 0x10(%eax ,%edx ,8),%xmm0 ,%xmm0

vaddsd %xmm2 ,%xmm1 ,%xmm1

vaddsd %xmm7 ,%xmm0 ,%xmm0

vaddsd 0x10(%esp),%xmm1 ,%xmm1

vaddsd %xmm4 ,%xmm0 ,%xmm0

vaddsd %xmm5 ,%xmm3 ,%xmm2

vaddsd 0x20(%esp),%xmm0 ,%xmm3

vaddsd %xmm6 ,%xmm2 ,%xmm2

vmovsd 0x18(%esp),%xmm0

vaddsd 0x8(%esp),%xmm0 ,%xmm0

vaddsd (%esp),%xmm0 ,%xmm0

vaddsd %xmm0 ,%xmm3 ,%xmm0

vaddsd %xmm0 ,%xmm2 ,%xmm0

vaddsd %xmm0 ,%xmm1 ,%xmm0

vmovsd %xmm0 ,0x20(%esp)

add $0x10 ,%edx

cmp %ebx ,%edx

jl loop:

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 13/52

Demo1: measure time@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation (2048)

public double test1 () {

double sum = 0.0;

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

}

@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation (2048)

public double manualUnroll () {

double sum = 0.0;

for (int i = 0; i < A.length; i += 4) {

sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3];

}

return sum;

}

𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 2.5

𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 0.5

Cycles Per Instruction

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14/52

Demo1: measure time@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation (2048)

public double test1 () {

double sum = 0.0;

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

}

@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation (2048)

public double manualUnroll () {

double sum = 0.0;

for (int i = 0; i < A.length; i += 4) {

sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3];

}

return sum;

}

𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 2.5

𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 0.5

Cycles Per Instruction

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14/52

Demo1: measure time@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation (2048)

public double test1 () {

double sum = 0.0;

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

}

@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation (2048)

public double manualUnroll () {

double sum = 0.0;

for (int i = 0; i < A.length; i += 4) {

sum += A[i] + A[i + 1] + A[i + 2] + A[i + 3];

}

return sum;

}

𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 2.5

𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 0.5

Cycles Per Instruction

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 14/52

𝜇arch: x86

CISC vs RISC

modern x86 CPU is not what it seems

All instructions (CISC) are dynamically translated into RISC-likemicrooperations (𝜇ops).

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 15/52

𝜇arch: x86

CISC and RISC

modern x86 CPU is not what it seems

All instructions (CISC) are dynamically translated into RISC-likemicrooperations (𝜇ops).

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 15/52

𝜇arch: Intel’s internals

http://commons.wikimedia.org/wiki/File:Intel_Nehalem_arch.svg

(c) Appaloosa, CC BY-SA 3.0

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 16/52

𝜇arch: simplified scheme

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 17/52

𝜇arch: looking into instruction tables2

Operation Latency 1𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡

addition (floating-point) 3 1

multiplication (floating-point) 5 0.5

addition (integer) 1 0.25

multiplication (integer) 3 1

2Haswell 𝜇archCopyright © 2014, Oracle and/or its affiliates. All rights reserved. 18/52

Demo1: test1, looking into asm again

loop: vaddsd 0x10(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x18(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x20(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x28(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x30(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x38(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x40(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x48(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x50(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x58(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x60(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x68(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x70(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x78(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x80(%edi ,%eax ,8),%xmm0 ,%xmm0

vaddsd 0x88(%edi ,%eax ,8),%xmm0 ,%xmm0

add $0x10 ,%eax

cmp %ebx ,%eax

jl loop:

𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 2.5

∼ 3 𝑐𝑦𝑐𝑙𝑒𝑠𝑜𝑝

𝑢𝑛𝑟𝑜𝑙𝑙𝑒𝑑 𝑏𝑦 16

19 𝑖𝑛𝑠𝑡𝑟𝑢𝑠𝑡𝑖𝑜𝑛𝑠

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 19/52

Demo1: test1, structural view

𝑡𝑖𝑚𝑒 = 1.15 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 2.5

∼ 3 𝑐𝑦𝑐𝑙𝑒𝑠𝑜𝑝

𝑢𝑛𝑟𝑜𝑙𝑙𝑒𝑑 𝑏𝑦 16

19 𝑖𝑛𝑠𝑡𝑟𝑢𝑠𝑡𝑖𝑜𝑛𝑠

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 20/52

Demo1: manualUnroll

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 21/52

Demo1: manualUnroll, structural view

𝑡𝑖𝑚𝑒 = 0.44 𝑛𝑠𝑜𝑝

𝐶𝑃𝐼 =∼ 0.5

∼ 1.14 𝑐𝑦𝑐𝑙𝑒𝑠𝑜𝑝

𝑢𝑛𝑟𝑜𝑙𝑙𝑒𝑑 𝑏𝑦 4 * 4

37 𝑖𝑛𝑠𝑡𝑟𝑢𝑠𝑡𝑖𝑜𝑛𝑠

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 22/52

𝜇arch: Dependences

Performance ILP3 of many programs is limited by natural datadependencies.

What to do?

Break Dependency Chains!

3Instruction Level ParallelismCopyright © 2014, Oracle and/or its affiliates. All rights reserved. 23/52

𝜇arch: Dependences

Performance ILP3 of many programs is limited by natural datadependencies.

What to do?

Break Dependency Chains!

3Instruction Level ParallelismCopyright © 2014, Oracle and/or its affiliates. All rights reserved. 23/52

Demo1(cont.): breaking chains in a “right” way

...

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

...

for (int i = 0; i < A.length; i += 2) {

sum0 += A[i];

sum1 += A[i + 1];

}

return sum0 + sum1;

...

for (int i = 0; i < array.length; i += 4) {

sum0 += A[i];

sum1 += A[i + 1];

sum2 += A[i + 2];

sum3 += A[i + 3];

}

return (sum0 + sum1) + (sum2 + sum3);

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52

Demo1(cont.): breaking chains in a “right” way...

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

...

for (int i = 0; i < A.length; i += 2) {

sum0 += A[i];

sum1 += A[i + 1];

}

return sum0 + sum1;

...

for (int i = 0; i < array.length; i += 4) {

sum0 += A[i];

sum1 += A[i + 1];

sum2 += A[i + 2];

sum3 += A[i + 3];

}

return (sum0 + sum1) + (sum2 + sum3);

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52

Demo1(cont.): breaking chains in a “right” way...

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

...

for (int i = 0; i < A.length; i += 2) {

sum0 += A[i];

sum1 += A[i + 1];

}

return sum0 + sum1;

...

for (int i = 0; i < array.length; i += 4) {

sum0 += A[i];

sum1 += A[i + 1];

sum2 += A[i + 2];

sum3 += A[i + 3];

}

return (sum0 + sum1) + (sum2 + sum3);

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52

Demo1(cont.): breaking chains in a “right” way...

for (int i = 0; i < A.length; i++) {

sum += A[i];

}

return sum;

...

for (int i = 0; i < A.length; i += 2) {

sum0 += A[i];

sum1 += A[i + 1];

}

return sum0 + sum1;

...

for (int i = 0; i < array.length; i += 4) {

sum0 += A[i];

sum1 += A[i + 1];

sum2 += A[i + 2];

sum3 += A[i + 3];

}

return (sum0 + sum1) + (sum2 + sum3);

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 24/52

Demo1(cont.): double sum final results

Haswell AMD ARMmanualUnroll 0.44 0.45 3.30test1 1.15 1.50 6.60test2 0.58 0.80 4.25test4 0.39 0.43 4.25test8 0.39 0.25 2.55

time, ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 25/52

Demo2: results

Haswell AMD ARMDoubleMul.test1 2.84 2.52 8.17DoubleMul.test2 2.50 2.37 4.25DoubleMul.test4 0.48 0.49 3.15DoubleMul.test8 0.25 0.30 2.53

IntMul.test1 1.14 1.16 10.04IntMul.test2 0.58 0.75 7.38IntMul.test4 0.38 0.67 4.64

IntSum.test1 0.39 0.32 8.92IntSum.test2 0.24 0.48 6.12

time, ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 26/52

Branches: to jump or not to jump

public int absSumBranch(int a[]) {

int sum = 0;

for (int x : a) {

if (x < 0) {

sum -= x;

} else {

sum += x;

}

}

return sum;

}

loop: mov 0xc(%ecx ,%ebp ,4),%ebx

test %ebx ,%ebx

jl L1

add %ebx ,%eax

jmp L2

L1: sub %ebx ,%eax

L2: inc %ebp

cmp %edx ,%ebp

jl loop

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 27/52

Branches: to jump or not to jump

public int absSumPredicated(int a[]) {

int sum = 0;

for (int x : a) {

sum += Math.abs(x);

}

return sum;

}

loop: mov 0xc(%ecx ,%ebp ,4),%ebx

mov %ebx ,%esi

neg %esi

test %ebx ,%ebx

cmovl %esi ,%ebx

add %ebx ,%eax

inc %ebp

cmp %edx ,%ebp

jl Loop

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 28/52

Demo3: results

Regular Pattern = (+, –)*

Nehalem Haswell AMD ARMbranch_sorted 0.9 0.5 1.0 5.0branch_regular 0.9 0.5 0.8 5.0branch_shuffled 6.4 1.0 2.8 9.4predicated_sorted 1.3 0.8 0.9 5.6predicated_regular 1.3 0.8 0.9 5.3predicated_shuffled 1.3 0.8 0.9 9.3

time, ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 29/52

Demo3: results

Regular Pattern = (+, +, –, +, –, –, +, –, –, +)*

Nehalem Haswell AMD ARMbranch_sorted 0.9 0.5 1.0 5.0branch_regular 1.6 0.9 1.0 5.0branch_shuffled 6.4 1.0 2.3 9.5predicated_sorted 1.3 0.8 0.9 5.6predicated_regular 1.3 0.8 0.9 5.3predicated_shuffled 1.3 0.8 0.9 9.3

time, ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 30/52

Demo4: && vs &

public int countConditional(boolean [] f0 , boolean [] f1) {

int cnt = 0;

for (int j = 0; j < SIZE; j++) {

for (int i = 0; i < SIZE; i++) {

if (f0[i] && f1[j]) {

cnt ++;

}

}

}

return cnt;

}

&&

shuffled 1.8 ns/opsorted 0.6 ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 31/52

Demo4: && vs &

public int countLogical(boolean [] f0 , boolean [] f1) {

int cnt = 0;

for (int j = 0; j < SIZE; j++) {

for (int i = 0; i < SIZE; i++) {

if (f0[i] & f1[j]) {

cnt ++;

}

}

}

return cnt;

}

&&

shuffled 1.8 ns/opsorted 0.6 ns/op

&

shuffled 1.2 ns/opsorted 1.2 ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 32/52

Demo5: interface invocation cost

public interface I { public int amount (); }

...

public class C0 implements I { public int amount (){ return 0; } }

public class C1 implements I { public int amount (){ return 1; } }

public class C2 implements I { public int amount (){ return 2; } }

public class C3 implements I { public int amount (){ return 3; } }

...

@Benchmark

@BenchmarkMode(Mode.AverageTime)

@OperationsPerInvocation(SIZE)

public int sum(I[] a) {

int s = 0;

for (I i : a) {

s += i.amount ();

}

return s;

}

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 33/52

Demo5: results

1 target 2 targets 3 targets 4 targetssorted 0.8 0.8 4.9 5.0regular 0.8 4.9 5.0shuffled 1.0 17.5 19.1

time, ns/op

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 34/52

Not a Real Core

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 35/52

Not a Real Core: HW Multithreading

∙ Simultaneous multithreading, SMTe.g. Intel® Hyper-Threading Technology

∙ Fine-grained temporal multithreadinge.g. CMT, Sun/Oracle ULTRASparc T1, T2, T3, T4, T5 ...

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 36/52

Back to Demo1: Execution Units Saturation

1 thread 2 threads 2 threads 4 threads-cpu 1,3 -cpu 2,3

DoubleSum.test1 426 850 840 1660DoubleSum.test2 845 1690 1260 2500DoubleSum.test4 1260 2513 1260 2520DoubleSum.manualUnroll 1120 2240 1260 2504

overall throughput, ops/ms

Max

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 37/52

Back to Demo1: Execution Units Saturation

1 thread 2 threads 2 threads 4 threads-cpu 1,3 -cpu 2,3

DoubleSum.test1 426 850 840 1660DoubleSum.test2 845 1690 1260 2500DoubleSum.test4 1260 2513 1260 2520DoubleSum.manualUnroll 1120 2240 1260 2504

overall throughput, ops/ms

Max

Max single core throughput

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 37/52

Back to Demo1: Execution Units Saturation

1 thread 2 threads 2 threads 4 threads-cpu 1,3 -cpu 2,3

DoubleSum.test1 426 850 840 1660DoubleSum.test2 845 1690 1260 2500DoubleSum.test4 1260 2513 1260 2520DoubleSum.manualUnroll 1120 2240 1260 2504

overall throughput, ops/ms

Max

Max system throughput

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 37/52

Demo6: Map.get()

private Map <Integer , Integer > jdk_map;

private int[] keys;

@Benchmark

public int testJdkPrimitive () {

int s = 0;

for (int key : keys) {

s += jdk_map.get(key);

}

return s;

}

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 38/52

Demo6: Map.get()

private Map <Integer , Integer > jdk_map;

private Integer [] boxedKeys;

@Benchmark

public int testJdkBoxed () {

int s = 0;

for (Integer key : boxedKeys) {

s += jdk_map.get(key);

}

return s;

}

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 39/52

Demo6: Map.get()

private TIntIntMap third_party_map;

private int[] keys;

@Benchmark

public int test3dParty () {

int s = 0;

for (int key : keys) {

s += third_party_map.get(key);

}

return s;

}

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 40/52

Demo6: Map.get() results

-cpu 1

-cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive

47 (25, 25) (30, ?)

JdkBoxed

71 (40, 40) (50, ?)

3dParty

74 (43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1

-cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47

(25, 25) (30, ?)

JdkBoxed

71 (40, 40) (50, ?)

3dParty

74 (43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1

-cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47

(25, 25) (30, ?)

JdkBoxed 71

(40, 40) (50, ?)

3dParty

74 (43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1

-cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47

(25, 25) (30, ?)

JdkBoxed 71

(40, 40) (50, ?)

3dParty 74

(43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3

-cpu 2(?) on -cpu 3

JdkPrimitive 47

(25, 25) (30, ?)

JdkBoxed 71

(40, 40) (50, ?)

3dParty 74

(43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3

-cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25)

(30, ?)

JdkBoxed 71

(40, 40) (50, ?)

3dParty 74

(43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3

-cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25)

(30, ?)

JdkBoxed 71 (40, 40)

(50, ?)

3dParty 74

(43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3

-cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25)

(30, ?)

JdkBoxed 71 (40, 40)

(50, ?)

3dParty 74 (43, 43)

(16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25)

(30, ?)

JdkBoxed 71 (40, 40)

(50, ?)

3dParty 74 (43, 43)

(16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25) (30, ?)JdkBoxed 71 (40, 40)

(50, ?)

3dParty 74 (43, 43)

(16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25) (30, ?)JdkBoxed 71 (40, 40) (50, ?)3dParty 74 (43, 43)

(16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Map.get() results

-cpu 1 -cpu 2,3 -cpu 2(?) on -cpu 3

JdkPrimitive 47 (25, 25) (30, ?)JdkBoxed 71 (40, 40) (50, ?)3dParty 74 (43, 43) (16, ?)

throughput, ops/ms

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 41/52

Demo6: Hyper.troll()

public static double d0;

public static double d1;

public static double d2;

@Benchmark

@OperationsPerInvocation (5)

public double troll () {

return (d0 / d2) / ((d1 / d2) / (d0 / d1));

}

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 42/52

Demo6: division results on Haswell

1 threadint 250double 180throughput, ops/𝜇s

-cpu 1,3 -cpu 2,3 -cpu 3(int, int) (250, 250) (125, 125) (125, 125)(double, double) (180, 180) (90, 90) (90, 90)(double, int) (180, 250) (150, 57) (90, 125)

throughput, ops/𝜇s

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 43/52

Demo6: division results on AMD

1 threadint 128double 300throughput, ops/𝜇s

-cpu 0,1 -cpu 0,2 -cpu 0,8 -cpu 0(int, int) (92, 92) (128, 128) (128, 128) (64, 64)(double, double) (150, 150) (300, 300) (300, 300) (150, 150)(double, int) (280, 120) (290, 128) (300, 128) (120, 64)

throughput, ops/𝜇s

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 44/52

Conclusion

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 45/52

Enlarge your knowledge with these simpletricks!

Reading list:∙ “Computer Architecture: A Quantitative Approach”John L. Hennessy, David A. Patterson

∙ CPU vendors documentation

∙ http://www.agner.org/optimize/

∙ etc.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 46/52

Thanks!

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 47/52

Q & A ?

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 48/52

Appendix

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 49/52

Appendix: Frequency Variance

∙ Dynamic CPU Frequency– TurboBoost and similar

"The processor must be working in the power, temperature, andspecification limits of the thermal design power (TDP)."©Intel

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 50/52

Appendix: TurboBoost in action

max normal freq.

measured freq.

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 51/52

Appendix: Set Fixed Frequency!

e.g.

cpufreq-set -u 2600000 -g performance

Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 52/52