Applied mechanical sympathy - Simone Bordet - Codemotion Milan 2014

Simone [email protected]

AppliedMechanicalSympathy

on

mailto:[email protected]


Who Am I

Simone Bordet [email protected] @simonebordet

Open Source Contributor Jetty, CometD, MX4J, Foxtrot, LiveTribe, JBoss, Larex

Lead Architect at Intalio/Webtide Jetty's SPDY and HTTP client maintainer

CometD project leader Web messaging framework

JVM tuning expert




Agenda

Mechanical Sympathy

Introduction to modern processors

Examples

Q&A



MechanicalSympathy



Mechanical Sympathy

Term coined by Jackie Stewart (Formula 1 driver) Jackie Stewart believed that to be a great driver one

must have a mechanical sympathy for how a car works to get the best out of it.

Term “revived” by Martin Thompson (Disruptor fame) But now applied to software (driver) and hardware (car)

Servers/libraries should be mechanically sympathetic To exploit the full power of modern CPUs



Introduction toModern Processors



CPU Caches



CPU Cache Latencies

L1 ~1 ns

L2 ~3 ns

L3 ~10 ns

RAM ~65 ns



CPU Execution

Modern CPUs are super-scalar Each core has multiple execution units Independent operations can be executed in parallel in

the same clock cycle by different execution units a = b + c d = e + f

Modern CPUs perform out-of-order-execution Semantic ordering retained via instruction retirement

a = b * c => multiplication takes few clocks d = a + 1 => depends on a, must wait e = b + 1 => can be executed before d



CPU Execution Quiz

int[] a = new int[2];

// Loop 1

for (int i=0; i<bazillion; ++i) { a[0]++; a[0]++; }

// Loop 2

for (int i=0; i<bazillion; ++i) { a[0]++; a[1]++; }

Loop 2 could be twice as fast Provided no optimizations to first loop: a[0] += 2;



CPU Instruction Rate

CPUs are fast 3 GHz => 3 clock cycles / ns

CPUs are parallel 4x (or more) instructions per clock cycle

Potential Instruction Rate 3 x 4 = 12 instructions / ns

Cache miss to main memory 12 x 65 = 780 instructions !

Compilers and JITs help with reordering Developers must pay attention to memory access patterns



CPU Access Patterns

CPUs are good at guessing what is accessed next Prefetch of instructions and data into caches

CPUs access memory in “cache lines” 64 bytes of contiguous data

Different cores can access the same cache line MESI protocol to keep the data consistent

MESI – Modified / Exclusive / Shared / Invalid



CPU Access Pattern Quiz

int[] arr = new int[bazillion];

// Loop 1

for (int i = 0; i < arr.length; ++i) arr[i] *= 3;

// Loop 2

for (int i = 0; i < arr.length; i += 16) arr[i] *= 3;

Loop 2 does only 1/16 – 6.25% – of the work Same performance Loop is dominated by memory access, not computation



CPU Branch Prediction

CPU branch predictor improves instruction flow Branch predictor keeps track of branch history

An if/else branch is guessed The branch is speculatively executed

Guess is right => instructions are retired

Guess is wrong => 3-6 ns penalty + re-execution



Tools

Java Microbenchmark Harness – JMH http://openjdk.java.net/projects/code-tools/jmh/ Vastly superior to any other framework

Probably the only non-flawed solution for HotSpot

Linux Perf Tools perf list

Shows hardware events

perf stat Attaches to a running process to gather events


http://openjdk.java.net/projects/code-tools/jmh/


Real code examplesfrom Jetty 9.x



HEX Conversion

Branching code

String hex = "A6";

int value = (x2i(hex.charAt(0)) << 4) | x2i(hex.charAt(1));

public int x2i(char c)

{

if (c >= 'A' && c <= 'F')

return 10 + c – 'A';

else

return c – '0';

}



HEX Conversion

Branchless code

String hex = "A6";

int value = (x2i(hex.charAt(0)) << 4) | x2i(hex.charAt(1));

public int x2i(char c)

{

return (c & 0x1F) + ((c >> 6) * 0x19) – 0x10;

}



HEX Conversion Benchmark

JMH benchmark runs (bigger is better) Branching: 270.686 ops/ms Branchless: 701.206 ops/ms

perf stat -e branches -e branch-misses Branching

1.27 insns/cycle 13,353,788,164 branches 11.92% branch-misses

Branchless 2.31 insns/cycle 2,553,447,200 branches 0.02% branch-misses



False Sharing

Jetty uses a customized queue for its thread pool

public class BlockingArrayQueue

{

private int _head;

private int _tail;

}

Memory Layout

header _head _tail

8 bytes 4 bytes 4 bytes



False Sharing

Core 0 Core 1

h t h t

h t h t

h t

L1

L2

L3



False Sharing

Avoid false sharing by padding

public class BlockingArrayQueue

{

private int _head;

private int h01, h02, h03, h04, h05, h06, h07, h08;

private int h11, h12, h13, h14, h15, h16, h17, h18;

private int _tail;

private int t01, t02, t03, t04, t05, t06, t07, t08;

private int t11, t12, t13, t14, t15, t16, t17, t18;

}



False Sharing Benchmark

JMH Benchmark runs (bigger is better) Dense: 624,644.870 ops/ms Padded: 790,501.896 ops/ms Improvement: 26%

j.u.c.LinkedBlockingQueue

JDK 8: sun.misc.Contended



Quiz

SLF4J Logger.debug(String, Object...)

int read = socket.read(buffer);

logger.debug("Bytes read {}", read);

Vararg array creation

Boxing of the read variable Values are likely not cached



HTTP Processing

IO Select

Read BytesParse Bytes

Invoke Servlet



HTTP Processing

Jetty 8: select + dispatch + process + invoke

IO Select


Invoke Servlet

Selector Thread Worker Thread



HTTP Processing

Jetty 9.0.0.M3: select + process + dispatch + invoke

IO Select


Invoke Servlet

Selector Thread Worker Thread



HTTP Processing Benchmark

Pipeline Benchmark, 1 M requests Jetty 8: 37.696 s Jetty 9.M3: 47.212 s

Even if Jetty 9.0.0.M3 was faster in each step the overall performance was horrible Parallel slowdown

From Jetty 9.0.0.M4 onward we switched back to the select + dispatch + process + invoke model



HTTP Processing Benchmark

Pipeline Benchmark, 1 M requests Jetty 8: 37.696 s Jetty 9.M4+: 29.172 s

Jetty 9 is way faster than Jetty 8 Produces less garbage too



JDK 7 NIO2 AsyncSocketChannel

JDK7's NIO AsynchronousSocketChannel I/O operations and CompletionHandler notifications only

happen in the “channel group” thread

Two problems: Unnecessary context switch for I/O operations Stack overflow for I/O operations performed by

completion handlers Additional context switch after N recursions



Atomic State Machines

Jetty 9: no synchronized blocks in the core All replaced by atomic state machines

No StackOverflowError typical of asynchronous callbacks Without context switch to a new thread

public abstract class IteratingCallback implements Callback

{

protected enum State

{ IDLE, PROCESSING, PENDING, ... };

private final AtomicReference<State> _state =

new AtomicReference<>(State.IDLE);

}



Atomic State Machines

public void succeeded()

{

loop: while (true)

{

State current = _state.get();

switch (current)

{

case PENDING:

if (_state.compareAndSet(current, State.PROCESSING))

break loop;

continue;

...

}

}

}



Conclusions



Conclusions

Know your hardware You will become a better programmer

Do not try this at home These optimizations only matter in the right context Ask library writers to optimize though !

Use Jetty 9.x We do care about these optimizations



Questions&

Answers


Technology

Applied mechanical sympathy - Simone Bordet - Codemotion Milan 2014