If you can't read please download the document
Upload
peter-lawrey
View
30.442
Download
1
Embed Size (px)
Citation preview
Writing and Testing
Higher Frequency Trading Engine
Peter LawreyHigher Frequency Trading Ltd
Who am I?
Australian living in UK. Father of three 15, 9 and 6 Vanilla Java blog gets 120K page views per month. 3rd for Java on StackOverflow.Six years designing, developing and supporting HFT systems in Java for hedge funds, trading houses and investment banks.
Principal Consultant for Higher Frequency Trading Ltd.
Event driven determinism
Critical operations are modelled as a series of asynchronous events
Producer is not slowed by the consumer
Can be recorded for deterministic testing and monitoring
Can known the state for the cirtical system without having to ask it.
Transparency and Understanding
Horizontal scalability is valueable for high throughput.
For low latency, you need simplicity. The less the system has to do the less time it takes.
Productivity
For many systems, a key driver is; how easy is it to add new features.
For low latency, a key driver is; how easy is it to take out redundant operations from the critical path.
Layering
Traditional design encourages layering to deal with one concept at a time. A driver is to hide from the developer what the lower layers are really doing.
In low latency, you need to understand what critical code is doing, and often combine layers to minimise the work done. This is more challenging for developers to deal with.
Taming your system
Ultra low GC, ideally not while trading.
Busy waiting isolated critical threads. Giving up the CPU slows your program by 2-5x.
Lock free coding. While locks are typically cheap, they make very bad outliers.
Direct access to memory for critical structures. You can control the layout and minimise garbage.
Latency profile
In a complex system, the latency increases sharply as you approach the worst latencies.
Latency
In a typical system, the worst 0.1% latency can be ten times the typical latency, but is often much more. This means your application needs to be able to track these outliers and profile them.
This is something most existing tools won't do for you. You need to build these into your system so you can monitor production.
What does a low GC system look like?
Typical tick to trade latency of 60 micros external to the box
Logged Eden space usage every 5 minutes.
Full GC every morning at 5 AM.
Low level Java
Java the language is suitable for low latencyYou can use natural Java for non critical code. This should be the majrity of your codeFor critical sections you need a subset of Java and the libraires which are suitable for low latency.Low level Java and natural Java integrate very easily, unlike other low level languages.
Latency reporting
Look at the percentiles, typical, 90%, 99%, 99.9% and worse in sample.
You should try to minimise the 99% or 99.9%. You should look at the worst latencies for acceptability.
Latency and throughput
There are periodic disturbances in your system. This means low throughput sees all of these.
In high throughput systems, the delays not only impact one event, but many events, possibly thousands.
Test realistic throughputs for your systems, as well as stress tests.
Why ultra low garbage
When a program accesses L1 cache is about 3x faster than using L2. L2 is 4 to 7 times faster than accessing L3. L3 is shared between cores. One thread running in L1 cache can be faster than using all your CPUs at once using L3 cache.
You L1 cache is 32 KB, so if you are creating 32 MB/s of garbage you are filling your L1 cache with garbage every milli-second.
Recycling is good
Recycling mutable objects works best if;They replace short or medium lived immutable objects.The lifecycle is easy to reason about.Data structure is simple and doesn't change significantly.These can help eliminate, not just reduce GCs.
Avoid the kernel
The kernel can be the biggest source of delays in your system. It can be avoided by
Kernel bypass network adapters
Isolating busy waiting CPUs
Memory mapped files for storage.
Avoid the kernel
Binding critical, busy waiting threads to isolated CPUs can make a big difference to jitter.Count of interrupts per hour by length.
Lock free coding
Minimising the use of lock allows thread to perform more consistently.
More complex to test.
Useful in ultra low latency context
Will scale better.
Faster math
Use double with rounding or long instead of BigDecimal ~100x faster and no garbage
Use long instead of Date or Calendar
Use sentinal values such as 0, NaN, MIN_VALUE or MAX_VALUE instead of nullable references.
Use Trove for collections with primitives.
Low latency libraries
Light weight as possible
The essence of what you need and no more
Designed to make full use of your hardware
Performance characteristics is a key requirement.
OpenHFT project
Thread Affinity binding
OpenHFT/Java-Thread-Affinity
Low latency persistence and IPC
OpenHFT/Java-Chronicle
Data structures in off heap memory
OpenHFT/Java-Lang
Runtime Compiler and loader
OpenHFT/Java-Runtime-Compiler
Apache 2.0 open source.
Java Chronicle
Designed to allow you to log everything. Esp tracing timestamps for profiling.
Typical IPC latency is less than one micro-second for small messages. And less than 10 micro-seconds for large messages.
Support reading/writing text and binary.
Java Chronicle performance
Sustained throughput limited by bandwidth of disk subsystem.
Burst throughput can be 1 to 3 GB per second depending on your hardware
Latencies for loads up to 100K events per second stable for good hardware (ok on a laptop)
Latencies for loads over one million per second, magnify any jitter in your system or application.
Java Chronicle Example
Writing textint count = 10 * 1000 * 1000;for (ExcerptAppender e = chronicle.createAppender(); e.index() < count; ) { e.startExcerpt(100); e.appendDateTimeMillis(System.currentTimeMillis()); e.append(", id=").append(e.index()); e.append(", name=lyj").append(e.index()); e.finish();}
Writes 10 million messages in 1.7 seconds on this laptop
Java Chronicle Example
Writing binaryExcerptAppender excerpt = ic.createAppender();long next = System.nanoTime();for (int i = 1; i 5755K(120320K), 0.0521970 secs]Startedprocessed 0processed 1000000Processed 2000000 deleted processed 9000000processed 10000000Received 10000000Processed 10,000,000 events in and out in 20.2 secondsThe latency distribution was 0.6, 0.7/2.7, 5/26 (611) us for the 50, 90/99, 99.9/99.99 %tile, (worst)On an i7 desktop
Processed 10,000,000 events in and out in 20.0 secondsThe latency distribution was 0.3, 0.3/1.6, 2/12 (77) us for the 50, 90/99, 99.9/99.99 %tile, (worst)
Q & A
Blog: Vanilla Java
Libraries: OpenHFT