Transcript
Page 2: Writing High-Performance Software by Arvid Norberg

Performance ⟺Longer Battery Life(Not only for when things need to run fast)

Page 3: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Cache

Core 1 Core 2

L1 (32 kiB)L1 (32 kiB)

Core 3 Core 4

L1 (32 kiB)L1 (32 kiB)

L2 (256 kiB)L2 (256 kiB)

L3 (6 MiB)

Main memory (16 GiB)

L2 (256 kiB)L2 (256 kiB)

Typical memory cache hierarchy (Core i5 Sandy Bridge)

Page 4: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Latency

register

0ns 0.125ns 0.25ns 0.375ns 0.5ns

http://www.7-cpu.com/cpu/IvyBridge.html

Memory latencies Core i5 Sandy Bridge

Page 5: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Latency

register

L1 cache

0ns 0.325ns 0.65ns 0.975ns 1.3ns

http://www.7-cpu.com/cpu/IvyBridge.html

Memory latencies Core i5 Sandy Bridge

Page 6: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

register

L1 cache

L2 cache

0ns 1ns 2ns 3ns 4ns

http://www.7-cpu.com/cpu/IvyBridge.html

Memory LatencyMemory latencies Core i5 Sandy Bridge

Page 7: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

register

L1 cache

L2 cache

L3 cache

0ns 3.75ns 7.5ns 11.25ns 15ns

http://www.7-cpu.com/cpu/IvyBridge.html

Memory LatencyMemory latencies Core i5 Sandy Bridge

Page 8: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

http://www.7-cpu.com/cpu/IvyBridge.html

register

L1 cache

L2 cache

L3 cache

DRAM

0ns 22.5ns 45ns 67.5ns 90ns

61.8 x

Memory LatencyMemory latencies Core i5 Sandy Bridge

Page 9: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU usage, even if your bottleneck is waiting for memory)

• Memory access patterns is a significant factor in performance

• Constant cache misses makes your program run up to 2 orders of magnitude slower than constant cache hits

Memory Latency

Page 10: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Cache

The memory you requested

The memory pulled into the cache

cache line

Page 11: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

• CPUs prefetch memory automatically if they can recognize your access pattern (sequential is easy)

• CPUs predict branches in order to prefetch instruction memory

• Memory access pattern is not only determined by data access but also control flow (indirect jumps stall execution on a memory lookup)

Memory Latency

Page 12: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Cache

64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes

For linear memory reads, the CPU will pre-fetch memory

Page 13: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Cache

64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes

For linear memory reads, the CPU will pre-fetch memory

Page 14: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Cache

64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes

For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

Page 15: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Memory Cache

64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes

For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

Page 16: Writing High-Performance Software by Arvid Norberg

Data Structures

Page 17: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

• Array of pointers to objects and linked listsmore cache pressure / cache misses

• Array of objectsless cache pressure / cache hits

Page 18: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

• One optimization is to refactor your single list of heterogenous objects into one list per type.

• Objects would lay out sequentially in memory

• Virtual function dispatch could become static

Page 19: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

std::vector<std::unique_ptr<shape>> shapes;

for (auto& s : shapes) s->draw();

Page 20: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

std::vector<std::unique_ptr<shape>> shapes;

for (auto& s : shapes) s->draw();

std::vector<rectangle> rectangles;std::vector<circle> circles;

for (auto& s : rectangles) s.draw();for (auto& s : circles) s.draw();

Page 21: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

std::vector<std::unique_ptr<shape>> shapes;

for (auto& s : shapes) s->draw();

std::vector<rectangle> rectangles;std::vector<circle> circles;

for (auto& s : rectangles) s.draw();for (auto& s : circles) s.draw();

Pointers needdereferencing +

vtable lookup

Page 22: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

Pointers needdereferencing +

vtable lookup

Objects packed back-to-back, sequential memory access, no vtable lookup

std::vector<rectangle> rectangles;std::vector<circle> circles;

for (auto& s : rectangles) s.draw();for (auto& s : circles) s.draw();

std::vector<std::unique_ptr<shape>> shapes;

for (auto& s : shapes) s->draw();

Page 23: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

• For heap allocated objects, put the most commonly used (“hot”) fields in the first cache line

• Avoid unnecessary padding

Page 24: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Data Structures

Padding

struct A [24 Bytes] 0: [int : 4] a--- 4 Bytes padding --- 8: [void* : 8] b 16: [int : 4] c --- 4 Bytes padding ---

struct B [16 Bytes] 0: [void* : 8] b 8: [int : 4] a 12: [int : 4] c

struct A {! int a;! void* b;! int c;};

struct B {! void* b;! int a;! int c;};

Page 25: Writing High-Performance Software by Arvid Norberg

Context Switching

Page 26: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

• One significant source of cache misses is switching context, and switching the data set being worked on

• Context switch

• Thread / process switching

• User space -> kernel space

Page 27: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

• One significant source of cache misses is switching context, and switching the data set being worked on

• Context switch

• Thread / process switching

• User space -> kernel space

Page 28: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

• Lower the cost of context switching by amortizing it over as much work as possible

• Reduce the number of system calls by passing as much work as possible per call

• Reduce thread wake-ups/sleeps by batching work

Page 29: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

When a thread wakes up, do as much work as possible before going to sleep

Drain the socket of received bytes

Drain the job queue

Page 30: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)One car at a time

Page 31: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)One car at a time

Page 32: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)One car at a time

Page 33: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)One car at a time

Page 34: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)One car at a time

Page 35: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)One car at a time

Page 36: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)The whole queue at a time

Page 37: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)The whole queue at a time

Page 38: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching (traffic analogy)The whole queue at a time

Page 39: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

• Every time the socket becomes readable, read and handle one request

buf = socket.read_one_request()req = parse_request(buf)handle_req(socket, req)

Page 40: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

• Drain the socket each time it becomes readable

• Parse and handle each request that was receivebuf.append(socket.read_all())

req, buf = parse_request(buf)while req != None: handle_req(socket, req) req, buf = parse_request(buf)

Page 41: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

• Write all responses in a single call at the endDon’t flush buffer to

socket until all messages are handledbuf.append(socket.read_all())

socket.cork()req, buf = parse_request(buf)while req != None: handle_req(socket, req) req, buf = parse_request(buf)socket.uncork()

Page 42: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

• There are two ways to read from sockets

• Wait for readable event then read (POSIX)

• Read async. then wait for completion event (Win32)

Page 43: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

kevent ev[100];

int events = kevent(queue, nullptr , 0, ev, 100, nullptr);

for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */}

Page 44: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

Wait for the socket to become readable

Copy data from kernel space to user space

kevent ev[100];

int events = kevent(queue, nullptr , 0, ev, 100, nullptr);

for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */}

Page 45: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

WSABUF b = { buffer_size, buffer };DWORD transferred = 0, flags = 0;WSAOVERLAPPED ov; // [ initialization ]int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr);

WSAOVERLAPPED* ol;ULONG_PTR* key;BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE);

ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags);

Page 46: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

Initiate async. read into buffer

Wait for operations to complete

Query status

WSABUF b = { buffer_size, buffer };DWORD transferred = 0, flags = 0;WSAOVERLAPPED ov; // [ initialization ]int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr);

WSAOVERLAPPED* ol;ULONG_PTR* key;BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE);

ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags);

Page 47: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

• Passing in a buffer up-front is preferable because:

• NIC driver can in theory receive data directly into your buffer and save a copy

• If there is a memory copy, it can be done asynchronously, not blocking your thread

Page 48: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

• Problem: What buffer size should be used?

• Too large will waste memory

• Too small will waste system calls(since we need multiple calls to drain the socket)

Page 49: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Socket Programming

• Problem: What buffer size should be used?

• Start with some reasonable buffer size

• If an async read fills the whole buffer, increase size

• If an async read returns significantly less than the buffer size, decrease size

Size adjustments should be proportional to the buffer size

Page 50: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Context Switching

Adapt batch size to the computer’s natural granularity

Higher load should lead to larger batches, fewer context switches and higher efficiency.

Use of magic numbers is a red flag

Page 51: Writing High-Performance Software by Arvid Norberg

Message Queues

Page 52: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Message Queues

• Events on message queues may come in batches

• Example: we receive one message per 16 kiB block read from disk.

void conn::on_disk_read(buffer const& buf) { m_socket.write(&buf[0], buf.size()); }

Page 53: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Message Queues

• Problem: We want to flush our sockets right before we go to sleep, i.e. when we have drained the message queue, without starvation

Page 54: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Message Queues

void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this));}

void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); }

Page 55: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Message Queues

If there is no outstanding flush message, post one

Instead of writing to the socket, accumulate the buffers

Flush all buffers when all messages have been handled

void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this));}

void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); }

Page 56: Writing High-Performance Software by Arvid Norberg

BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .

Message Queues

FIFOmessage

queue

Message handler

Flush message