Writing High-Performance Software
Performance ⟺Longer Battery Life(Not only for when things need to run fast)
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
Core 1 Core 2
L1 (32 kiB)L1 (32 kiB)
Core 3 Core 4
L1 (32 kiB)L1 (32 kiB)
L2 (256 kiB)L2 (256 kiB)
L3 (6 MiB)
Main memory (16 GiB)
L2 (256 kiB)L2 (256 kiB)
Typical memory cache hierarchy (Core i5 Sandy Bridge)
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
register
0ns 0.125ns 0.25ns 0.375ns 0.5ns
http://www.7-cpu.com/cpu/IvyBridge.html
Memory latencies Core i5 Sandy Bridge
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
register
L1 cache
0ns 0.325ns 0.65ns 0.975ns 1.3ns
http://www.7-cpu.com/cpu/IvyBridge.html
Memory latencies Core i5 Sandy Bridge
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
register
L1 cache
L2 cache
0ns 1ns 2ns 3ns 4ns
http://www.7-cpu.com/cpu/IvyBridge.html
Memory LatencyMemory latencies Core i5 Sandy Bridge
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
register
L1 cache
L2 cache
L3 cache
0ns 3.75ns 7.5ns 11.25ns 15ns
http://www.7-cpu.com/cpu/IvyBridge.html
Memory LatencyMemory latencies Core i5 Sandy Bridge
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
http://www.7-cpu.com/cpu/IvyBridge.html
register
L1 cache
L2 cache
L3 cache
DRAM
0ns 22.5ns 45ns 67.5ns 90ns
61.8 x
Memory LatencyMemory latencies Core i5 Sandy Bridge
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU usage, even if your bottleneck is waiting for memory)
• Memory access patterns is a significant factor in performance
• Constant cache misses makes your program run up to 2 orders of magnitude slower than constant cache hits
Memory Latency
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
The memory you requested
The memory pulled into the cache
cache line
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
• CPUs prefetch memory automatically if they can recognize your access pattern (sequential is easy)
• CPUs predict branches in order to prefetch instruction memory
• Memory access pattern is not only determined by data access but also control flow (indirect jumps stall execution on a memory lookup)
Memory Latency
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes
For linear memory reads, the CPU will pre-fetch memory
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes
For linear memory reads, the CPU will pre-fetch memory
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss
Data Structures
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
• Array of pointers to objects and linked listsmore cache pressure / cache misses
• Array of objectsless cache pressure / cache hits
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
• One optimization is to refactor your single list of heterogenous objects into one list per type.
• Objects would lay out sequentially in memory
• Virtual function dispatch could become static
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;std::vector<circle> circles;
for (auto& s : rectangles) s.draw();for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;std::vector<circle> circles;
for (auto& s : rectangles) s.draw();for (auto& s : circles) s.draw();
Pointers needdereferencing +
vtable lookup
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
Pointers needdereferencing +
vtable lookup
Objects packed back-to-back, sequential memory access, no vtable lookup
std::vector<rectangle> rectangles;std::vector<circle> circles;
for (auto& s : rectangles) s.draw();for (auto& s : circles) s.draw();
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
• For heap allocated objects, put the most commonly used (“hot”) fields in the first cache line
• Avoid unnecessary padding
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
Padding
struct A [24 Bytes] 0: [int : 4] a--- 4 Bytes padding --- 8: [void* : 8] b 16: [int : 4] c --- 4 Bytes padding ---
struct B [16 Bytes] 0: [void* : 8] b 8: [int : 4] a 12: [int : 4] c
struct A {! int a;! void* b;! int c;};
struct B {! void* b;! int a;! int c;};
Context Switching
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
• One significant source of cache misses is switching context, and switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
• One significant source of cache misses is switching context, and switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
• Lower the cost of context switching by amortizing it over as much work as possible
• Reduce the number of system calls by passing as much work as possible per call
• Reduce thread wake-ups/sleeps by batching work
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
When a thread wakes up, do as much work as possible before going to sleep
Drain the socket of received bytes
Drain the job queue
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)One car at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)One car at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)One car at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)One car at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)One car at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)One car at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
• Every time the socket becomes readable, read and handle one request
buf = socket.read_one_request()req = parse_request(buf)handle_req(socket, req)
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
• Drain the socket each time it becomes readable
• Parse and handle each request that was receivebuf.append(socket.read_all())
req, buf = parse_request(buf)while req != None: handle_req(socket, req) req, buf = parse_request(buf)
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
• Write all responses in a single call at the endDon’t flush buffer to
socket until all messages are handledbuf.append(socket.read_all())
socket.cork()req, buf = parse_request(buf)while req != None: handle_req(socket, req) req, buf = parse_request(buf)socket.uncork()
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. then wait for completion event (Win32)
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
kevent ev[100];
int events = kevent(queue, nullptr , 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */}
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
Wait for the socket to become readable
Copy data from kernel space to user space
kevent ev[100];
int events = kevent(queue, nullptr , 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */}
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
WSABUF b = { buffer_size, buffer };DWORD transferred = 0, flags = 0;WSAOVERLAPPED ov; // [ initialization ]int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr);
WSAOVERLAPPED* ol;ULONG_PTR* key;BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
Initiate async. read into buffer
Wait for operations to complete
Query status
WSABUF b = { buffer_size, buffer };DWORD transferred = 0, flags = 0;WSAOVERLAPPED ov; // [ initialization ]int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr);
WSAOVERLAPPED* ol;ULONG_PTR* key;BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
• Passing in a buffer up-front is preferable because:
• NIC driver can in theory receive data directly into your buffer and save a copy
• If there is a memory copy, it can be done asynchronously, not blocking your thread
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
• Problem: What buffer size should be used?
• Too large will waste memory
• Too small will waste system calls(since we need multiple calls to drain the socket)
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming
• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read fills the whole buffer, increase size
• If an async read returns significantly less than the buffer size, decrease size
Size adjustments should be proportional to the buffer size
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer context switches and higher efficiency.
Use of magic numbers is a red flag
Message Queues
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read from disk.
void conn::on_disk_read(buffer const& buf) { m_socket.write(&buf[0], buf.size()); }
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
• Problem: We want to flush our sockets right before we go to sleep, i.e. when we have drained the message queue, without starvation
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this));}
void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); }
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
If there is no outstanding flush message, post one
Instead of writing to the socket, accumulate the buffers
Flush all buffers when all messages have been handled
void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this));}
void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); }
BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
FIFOmessage
queue
Message handler
Flush message