Coding for High Frequency Trading and other Financial ... · Community available OTP.NET, Py-interface (Python), Perl, erlectricity (Ruby), PHP, Haskell/Erlang-FFI, Erlang/Gambit

Coding for High Frequency Trading and other Financial Services applications

Richard Croucher

March 2017

About meRichard is currently Vice President of High Frequency Engineering for Barclays As well as Barclays, Richard has consulted on IT to HSBC, RBS, DeutscheBank, CreditSuisse, Flowtraders, JP. Morgan, Merril Lynch and Bank of America.

Richard was also Chief Architect at Sun Microsystems where he worked on HPC Grid and helped create their Cloud capability. He was also a Principle Architect in Microsoft Live which evolved in Azure.

Functionally he has help positions as a Physicist, Electronic Design Engineer, Programmer, IT Product Manager, IT Marketing Manager, IT Consultant and IT Architect

Describes himself as a Platform Architect, specialising in HFT, DevOps, Linux and large scale Cloud solutions

Fellow of STAC Research, Fellow British Computer Society, Chartered IT Practicioner. Awarded degrees from Brunel University, University of East London and University of Berkshire

Intro

Investment Banking IT

High Frequency Trading

Common Technologies

Coding Styles

Investment Banking IT

Front Office• Traders sit on Trading rooms,

each with several screens and a fast dialler.

• Complimented now by cololocated algo trading computers

Middle Office• Accountants, Economists,

Mathematicians create strategies, watch exposure and risk on client portfolios.

• Trades are matched, cleared and settled

Back Office• The day’s cash movements

are aggregated and payment instructions sent out

• The Banks overnight positions are re-calculated and updated

Trade Flow

Trading - It’s all buying and selling

The venue is the electronic meeting place where buyers and sellers get connectedThe venue matches buyers to sellersThe spread is the current difference between buy and sell offers

Liquidity increases as buyers and sellers use your venue

Options are contracts to buy/sell in the future at an agreed price

Trading book

SellersBuyers

Spread

Dep

th

3.99

4.00

4.01

3.95

3.96

3.98

3.97

3.94

3.93

3.92

3.91

3.90

3.89

3.88

4.03

4.02

4.04

4.05

4.07

4.06

4.10

4.09

4.08

• Orders can be filled when they match buyers to sellers at the same price• Orders typically stay until cancelled or the market closes

Price

Full Book

Algorithmic Trading Strategies

Arbitration Exploiting difference in price for the same stock in different venuesMomentum Assumes that a current trend will continueAlpha (pairs) Matches stocks and assumes the value of one should be the same

as another Based on fundamental macroeconomic statisticsComposites Arb a ETF by beating it’s update, e.g. A 10% change in BP value

on the FTSE100 may translate into a 0.2% change in the overall index

Market Making Accepting an Exchange fee to provide liquidity Typically large spread just outside current price Object is to make pennies on volume and minimize holding

Buying strategies

Smart Order Routing Discover which venue is offering the best price A regulatory requirement in EU and USAIceberg Slice a large order up into lots of small orders to disguise intent and reduce

market impact Led to a big increase in order volumes and decrease in typical order sizeSweep Order Buy only a percentage of desired quantity and pause for more liquidity to

be placed, hopefully at the same priceCrossfire Liquidity capturing algorithm that targets liquidity within dark pools

Common Technologies DeployedTime Series Database – tick data - mostly kdbIn memory DatabaseReal time analytics - Hadoop, SharkExcel analytics and Grid compute pluginsComplex maths librariesPacket Decoders for each venue and trading protocolFIX EnginesSmart Order RoutersRDMA (Remote Direct Memory Access)FPGA (Fuse Programmable Gate Array)DevOps - Chef, Puppet

Sell Side System - Venue

Market Data

Market Data

Order RoutingOrder

Routing

MatchingEngine

MatchingEngine

SurveillanceSurveillance

MatchingEngine

MatchingEngineFix EngineFix Engine

SettlementSettlement

Buy Side System

Market DataDistributionMarket DataDistribution

Smart Order Router

Smart Order Router

Last ValueCache

Last ValueCache

Order Management

Order Management

Trading EngineTrading Engine

Risk and Compliance

Risk and Compliance

Trade FloorSupport

Trade FloorSupport



Pub/Sub Market Data bus

Store + Forward Reliable Message Bus

Traders AnalyticsAnalytics

PricingEnginePricingEngine

Trade FloorSupport

Trade FloorSupport

Web Trading Support

Web Trading Support Direct Feed

Trading Engine

Direct Feed Trading Engine

Rates Distribution

Rates Distribution

Trade BookingTrade

Booking

Fix EngineFix EngineFix EngineFix Engine



QUANT programming

Now generally used to refer to the development of algorithmic trading systemsHeavy maths bias – Ph.D expected Expect familiarity and understanding of Black-Scholes model used for derivatives

e.g. The value of a call option for a non-dividend paying underlying stock is

HFT – programming skillsC++ and Java dominate, with a small number of FPGA specialists

Packet processing – TCP, UDP, Multicast

Destructor threads – spin waiting for packets to arrive to avoid interrupt wakeup delay

Actor model – minimise lock contention

Nanosecond timestamping

Direct buffer management

Warm up - allocate and preload everything before trading starts

Pinning memory, cache line alignment

CPU isolation and affinity

Achieving durability via replication across network rather than disk write (often to ‘luke warm’ backup server)

Memory mapping files when writing to disk cannot be avoided

C++ expertise area’s Boost – particularly Math C and even assembler inline QuantLib - Greeks library Lockfree++, actor patterns Performance optimization with pragma’s Intel TBB (Thread Building Blocks) Code and data locality to reduce cache misses Compiler optimization Network processing – TCP, UDP, Multicast TCP bypass - RDMA - libibverb Memory optimization and tuning – cache alignment, huge

pages Tuning – Intel Vtune

http://www.boost.org/

http://quantcorner.wordpress.com/2011/02/06/quantlib-the-greeks-and-other-useful-option-related-values/

http://tradexoft.wordpress.com/tag/lockfree/

http://tradexoft.wordpress.com/tag/lockfree/

Java expertise area

Low latency tuning and jitter avoidance

Extensive use of NIO, particularly with buffers

Lock detection, avoidance and tuning

Network processing - TCP, UDP, Multicast

Packet processing - raw, pcaps

RDMA – IBM JSOR, NIO wrappings for libibverbGC tuning and avoidance - explicit buffer management and re-use, tuning -

Numa aware on multi-socket servers -XX:+UseNumaReduce the cache misses -XX:+UseCompressedOopsReduce the amount of TLB misses -XX:+UseLargePagesIncrease object persistence - -XX:MaxTenuringThreshold=4

Linux

Virtually all trading carried out on Linux

Goal is to achieve Low latency and low jitter Programmers are expected to know about the techniques used since there

are implications in the code Constant battle with power saving features added to each new

processor and Linux release - C and P states Isolating cores from scheduler and then explicit core selection for

critical threads - sched_setaffinity(2) Turbo boost, overclocking Kernel bypass preloads - Solarflare Onload, libvma, speedus TCP bypass - RDMA - RoCE, InfiniBand, OmniPath Linux bypass - Data Plane Development Kit (DPDK) Performance monitoring and profiling tools – sar, perf, iostats,

mpstat, memprof, strace, ltrace, blktrace, valgrind, latencytop, systemtap ….

FPGA Programming

Mix of hardware and software skills1. Hardware selection – device, board2. Design and Code - VHDL Verilog3. Simulation4. Synthesis5. Test bed instantiation6. Pin assignment7. FPGA bitstream generation and program load8. Test

See Sven Anderssons “How to design an FPGA from scratch” published in EE Times, good place to start although dated

http://www.eetimes.com/design/programmable-logic/4015129/How-to-design-an-FPGA-from-scratch

Multicore

Servers supporting 1000 hardware threads are already available. Trend will increase with Moore's law doubling this every 18 months

Scalability, particularly concurrency is too hard with imperative programming languages

Functional Languages are a better match to create scalable code to run on multiple cores

Functional languages implicitly better for event based, lambda programming styles

Functional Languages - ErlangBenefits of Erlang Erlang OTP provides comprehensive runtime support including - Debugger, Event

Managers, Watchdogs, FSM, in-memory DB, Distributed DB, HA, Unit test, Docs, Live Update

Big Int – no need to deal with overflows Built in support for HTML and SNMP Powerful bit level processing Code is more powerful – achieves more in fewer lines Automated restart using Supervisors – fail early strategy significantly reduces explicit

try/catch coding Vast ecosystem of Erlang code, particularly for messaging and comms

Challenges with Erlang Immutable variables take some getting used to Forces you to use recursion since no ‘while’, ‘for’ operations Native String handling inefficient - text intensive systems use bit strings Overhead of typed data reduces efficiency e.g. Int's Virtual Machine, GC, Interpretation overheads although HiPE provides optimized

support for Unix on x86.

Practical experience is that inherent concurrency more than compensates for VM overhead for most real work. Exceptions are intensive numeric calculation and ultra low latency trading

Erlang distributed computing – ping/pongping(0, Pong_PID) ->

Pong_PID ! finished,io:format(“ping finished~n”, []);

ping(N, Pong_PID) ->Pong_PID ! {ping, self()}receive

pong ->io:format(“Ping received pong~n”, [])

end,ping(N -1, Pong_PID).

pong() ->receive

finished ->io:format(“Pong finished~n”, []);

{ping, Ping_PID} ->io:format (“Pong received ping~n”, []),Ping_PID ! pong,pong()

end.

start() -> Pong_PID = spawn(ping_pong, pong, []),spawn(Server2, ping_pong, ping, [3, Pong_PID]).

Erlang Foreign Language integration

OS cmds:Function os:cmd executes the command and returns the result. Exposes dependencies on runtime operating system.

Ports:Emulates Erlang node, separate failure and scheduling domain, safest but highest overhead. Built in support for ‘C’ (erl_interface lib) Java (jinterface). Community available OTP.NET, Py-interface (Python), Perl, erlectricity (Ruby), PHP, Haskell/Erlang-FFI, Erlang/Gambit (Scheme), Distel (Emacs Lisp), Rustler (Rust)

Linked-in drivers:Runs inside a Erlang node. Dynamically linked in at runtime. Logically behaves like a port but with less overhead. Shares failure domain. A crash of one will crash both. Need to use driver_alloc(), driver_free() for malloc.

NIFs:Replaces Erlang function, shares failure domain and thread scheduling. Can impact node scheduling if execution > 1-2mS, needs to explicitly pre-empt by yielding. Setup enif_schedule_nif so that node restart the still to complete NIF

Erlang HFT example

Erlang Node

‘C’NIF

OrchestrationControl Plane(mult-threaded)

Erlang Node

Erlang Node

Erlang Msg and Pub/Sub

OS:CMDLinux setup/config

High performance tasksingle-threaded)

Advanced NetworkingTrading Venues are at extremely high volumes and speeds e.g OPRA 30 million msg/sec - 40G Ethernet

Extensive use of Multicast to ensure all participants receive data at same time

Linux treats UDP as disposable so constant battle with ‘drops’Needs explicit support in network - disabled in AWS.

TCP/IP Single thread performance limited to around 15Gbps

User space TCP/IP increase this to about 20Gbps but still too slow

RDMA is required to achieve single threaded line speed for interface > 10G

Read and write to remote memory at line speed and without consuming CPU cycles on remote node

Most 25/40/100G NICs all support RDMAVariants of latest Intel and AMD server CPU’s have RDMA on die

InfiniBand, Ethernet (RoCE) and OmniPath transports all unified by libibverbs

Programming with RDMA VERBSFour phases to a RDMA program Connection management and establishment

– rdma_cm allows TCP/IP address space to be used to establish ‘queue pairs’ which are used as end points.

Memory registration– locks the memory region which will send or receive data into memory to

prevent page faults. HCA loads TLB’s to translate between virtual to physical address

– exchange memory keys to permit remote access to this memory by the key holder

Data transfer– queue a work request , e.g. Read from or write to a remote server

Single block or scatter gather– actual data transfer carried out by hardware at wire speed

Up to 2GB in a single transfer, HW error detection and correction prevents errors

measured at 6Gbytes per second, < 2µS latency across network for 2KB transfers

Completion– Receive or check for notification of completion

Questions?

[email protected]

Documents

Coding for High Frequency Trading and other Financial ... · Community available OTP.NET, Py-interface (Python), Perl, erlectricity (Ruby), PHP, Haskell/Erlang-FFI, Erlang/Gambit