Performance and predictability

PERFORMANCE AND PREDICTABILITYRichard Warburton@richardwarburtoinsightfullogic.com

Why care about low level rubbish?

Branch Prediction

Memory Access

Storage

Conclusions

Performance Discussion

Product Solutions“Just use our library/tool/framework, and everything is web-scale!”

Architecture Advocacy“Always design your software like this.”

Methodology & Fundamentals“Here are some principles and knowledge, use your brain“

Do you care?

DON'T look at access patterns first

Many problems not Pattern RelatedNetworkingDatabase or External ServiceMinimising I/OGarbage CollectionInsufficient Parallelism

Harvest the low hanging fruit first

Low Hanging is a cost-benefit analysis

So when does it matter?

Informed Design & Architecture *

* this is not a call for premature optimisation

That 10% that actually matters

Performance of Predictable Accesses

Branch Prediction

Memory Access

Storage

Conclusions

What 4 things do CPUs actually do?

Fetch, Decode, Execute, Writeback

Pipelining speeds this up

Pipelined

What about branches?

public static int simple(int x, int y, int z) {

int ret;

if (x > 5) {

ret = y + z;

} else {

ret = y;

return ret;

Branches cause stalls, stalls kill performance

Super-pipelined & Superscalar

Can we eliminate branches?

Naive Compilation

if (x > 5) {

ret = z;

} else {

ret = y;

cmp x, 5 jmp L1 mov z, ret br L2L1: mov y, retL2: ...

Conditional Branches

if (x > 5) {

ret = z;

} else {

ret = y;

cmp x, 5 mov z, ret

cmovle y, ret

Strategy: predict branches and speculatively execute

Static Prediction

A forward branch defaults to not taken

A backward branch defaults to taken

Static Hints (Pentium 4 or later)

__emit 0x3E defaults to taken

__emit 0x2E defaults to not taken

don’t use them, flip the branch

Dynamic prediction: record history and predict future

Branch Target Buffer (BTB)

a log of the history of each branch

also stores the address

its finite!

record per conditional branch histories

Global

record shared history of conditional jumps

specialised predictor when there’s a loop (jumping in a cycle n times)

Function

specialised buffer for predicted nearby function returns

Two level Adaptive Predictor

accounts for up patterns of up to 3 if statements

Measurement and Monitoring

Use Performance Event Counters (Model Specific Registers)

Can be configured to store branch prediction information

Profilers & Tooling: perf (linux), VTune, AMD Code Analyst, Visual Studio, Oracle Performance Studio

Approach: Loop Unrolling

for (int x = 0; x < 100; x+=5)

foo(x);

foo(x+1);

foo(x+2);

foo(x+3);

foo(x+4);

Approach: minimise branches in loops

Approach: Separation of Concerns

Little branching on a latency critical path or thread

Heavily branchy administrative code on a different path

Helps adjust for the inflight branch prediction not tracking many targets

Only applicable with low context switching rates

Also easier to understand the code

One more thing ...

Division/Modulus

index = (index + 1) % SIZE;

index = index + 1;if (index == SIZE) { index = 0;}

/ or % are 92 Cycles on Core i*

Summary

CPUs are Super-pipelined and Superscalar

Branches cause stalls

Branches can be removed, minimised or simplified

Frequently get easier to read code as well

Branch Prediction

Memory Access

Storage

Conclusions

The Problem Very Fast

Relatively Slow

The Solution: CPU Cache

Core Demands Data, looks at its cache

If present (a "hit") then data returned to registerIf absent (a "miss") then data looked up from memory and stored in the cache

Fast memory is expensive, a small amount is affordable

Multilevel Cache: Intel Sandybridge

Shared Level 3 Cache

Level 2 Cache

Level 1 Data

Level 1 Instruction

Physical Core 0

HT: 2 Logical Cores

Level 2 Cache

Level 1 Data

Level 1 Instruction

Physical Core N

HT: 2 Logical Cores

How bad is a miss?

Location Latency in Clockcycles

Register 0

L1 Cache 3

L2 Cache 9

L3 Cache 21

Main Memory 150-400

Cache Lines

Data transferred in cache lines

Fixed size block of memory

Usually 64 bytes in current x86 CPUsBetween 32 and 256 bytes

Purely hardware consideration

Prefetch = Eagerly load data

Adjacent Cache Line Prefetch

Data Cache Unit (Streaming) Prefetch

Problem: CPU Prediction isn't perfect

Solution: Arrange Data so accesses are

predictable

Prefetching

Sequential Locality

Referring to data that is arranged linearly in memory

Spatial Locality

Referring to data that is close together in memory

Temporal Locality

Repeatedly referring to same data in a short time span

General Principles

Use smaller data types (-XX:+UseCompressedOops)

Avoid 'big holes' in your data

Make accesses as linear as possible

Primitive Arrays

// Sequential Access = Predictable

for (int i=0; i<someArray.length; i++)

someArray[i]++;

Primitive Arrays - Skipping Elements

// Holes Hurt

for (int i=0; i<someArray.length; i += SKIP)

someArray[i]++;

Primitive Arrays - Skipping Elements

Multidimensional Arrays

Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C)

Some people realign their accesses:

for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) {array[ROWS * col + row]++;

Bad Access Alignment

Strides the wrong way, bad locality.

array[COLS * row + col]++;

Strides the right way, good locality.

array[ROWS * col + row]++;

Full Random AccessL1D - 5 clocksL2 - 37 clocksMemory - 280 clocks

Sequential AccessL1D - 5 clocksL2 - 14 clocksMemory - 28 clocks

Primitive Collections (GNU Trove, GS-Coll)Arrays > Linked ListsHashtable > Search TreeAvoid Code bloating (Loop Unrolling)Custom Data Structures

Judy Arraysan associative array/map

kD-Treesgeneralised Binary Space Partitioning

Z-Order Curvemultidimensional data in one dimension

Data Layout Principles

Data Locality vs Java Heap Layout

class Foo {Integer count;Bar bar;Baz baz;

// No alignment guaranteesfor (Foo foo : foos) {

foo.count = 5;foo.bar.visit();

Data Locality vs Java Heap Layout

Serious Java Weakness

Location of objects in memory hard to guarantee.

GC also interferesCopyingCompaction

Summary

Cache misses cause stalls, which kill performance

Measurable via Performance Event Counters

Common Techniques for optimizing code

Branch Prediction

Memory Access

Storage

Conclusions

Hard Disks

Commonly used persistent storage

Spinning Rust, with a head to read/write

Constant Angular Velocity - rotations per minute stays constant

A simple model

Zone Constant Angular Velocity (ZCAV) /Zoned Bit Recording (ZBR)

Operation Time = Time to process the commandTime to seekRotational speed latencySequential Transfer TIme

ZBR implies faster transfer at limits than centre

Seeking vs Sequential reads

Seek and Rotation times dominate on small values of data

Random writes of 4kb can be 300 times slower than theoretical max data transfer

Consider the impact of context switching between applications or threads

Fragmentation causes unnecessary seeks

Sector alignment is disk-level irrelevant

Summary

Simple, sequential access patterns win

Fragmentation is your enemy

Alignment can be relevant in RAID scenarios

SSDs are a different game

Branch Prediction

Memory Access

Storage

Conclusions

Software runs on Hardware, play nice

Predictability is a running theme

Huge theoretical speedups

YMMV: measured incremental improvements >>> going nuts

More information

Articleshttp://www.akkadia.org/drepper/cpumemory.pdfhttps://gmplib.org/~tege/x86-timing.pdfhttp://psy-lob-saw.blogspot.co.uk/http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.htmlhttp://mechanical-sympathy.blogspot.co.ukhttp://www.agner.org/optimize/microarchitecture.pdf

Mailing Lists:https://groups.google.com/forum/#!forum/mechanical-sympathyhttps://groups.google.com/a/jclarity.com/forum/#!forum/friendshttp://gee.cs.oswego.edu/dl/concurrency-interest/

Q & A@richardwarburtoinsightfullogic.comtinyurl.com/java8lambdas

Performance and predictability

Technology

Dynamic characterization and predictability analysis of ...jad.shahroodut.ac.ir/article_522_24a0950c4e4efa95f6a9140c7a252c99.pdfDynamic characterization and predictability analysis

Irregularity and Predictability of ENSO

Recommended Key Performance Indicators for Measuring ANSP ...€¦ · Recommended Key Performance Indicators for Measuring ANSP Operational Performance flight efficiency and predictability

Performance and Injury Predictability during Firefighter

Proposed Performance Metrics Block Time & Predictability

8. Seasonal-to-Interannual Predictability and Prediction 8.1 Predictability 8.2 Prediction

INFORMATION THEORY AND DYNAMICAL SYSTEM PREDICTABILITY · INFORMATION THEORY AND DYNAMICAL SYSTEM PREDICTABILITY RICHARDKLEEMAN ... INFORMATION THEORY AND DYNAMICAL SYSTEM PREDICTABILITY

Predictability studies report Archived Content · Predictability studies report Office of the Qualifications and Examinations Regulator 2008 2 current examinations where predictability

Performance and predictability

Performance Predictability Carole’s Group Talk on 5-13-2009

Powerpoint Presentation: Decadal Predictability and Signal ... · Decadal predictability and signal-to-noise Decadal predictability and signal-to-noise ... • Forced climate change

Predictability, forecastability, and observability

Restoring Data Storage Predictability · Restoring Data Storage Predictability Thoughts and Approaches on Managing Storage Performance and Capacity in 2017 Brent Phillips – Managing

Increasing the Performance and Predictability of the Code

Predictability of Performance of A Company

Patterns and Tools for Achieving Predictability and Performance with Real-time Java

Reanalyses as predictability tools - ECMWF€¦ · TAKAHASHI, K.: REANALYSIS AS PREDICTABILITY TOOLS 158 ECMWF Seminar on Predictability in the European and Atlantic regions, 6 –

BEHAVIOR COMPLEXITY AND PREDICTABILITY

Market Predictability and Non-Informational Tradingfaculty.haas.berkeley.edu/hender/Market_Predictability_OIB.pdf · Market Predictability and Non-Informational Trading Terrence

Maintenance predictability and planning software