Click here to load reader

Writing High-Performance Software by Arvid Norberg

  • View
    2.085

  • Download
    1

Embed Size (px)

DESCRIPTION

BitTorrent Chief Architect Arvid Norberg on Writing high-performance software.

Text of Writing High-Performance Software by Arvid Norberg

2. Performance Longer Battery Life (Not only for when things need to run fast) 3. Memory Cache Typical memory cache hierarchy (Core i5 Sandy Bridge)Main memory (16 GiB) L3 (6 MiB) L2 (256 kiB)L2 (256 kiB)L2 (256 kiB)L2 (256 kiB)L1 (32 kiB)L1 (32 kiB)L1 (32 kiB)L1 (32 kiB)Core 1Core 2Core 3Core 4BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 4. Memory Latency Memory latencies Core i5 Sandy Bridgeregister0ns0.125ns0.25ns0.375ns0.5nshttp://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 5. Memory Latency Memory latencies Core i5 Sandy BridgeregisterL1 cache0ns0.325ns0.65ns0.975ns1.3nshttp://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 6. Memory Latency Memory latencies Core i5 Sandy BridgeregisterL1 cacheL2 cache0ns1ns2ns3ns4nshttp://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 7. Memory Latency Memory latencies Core i5 Sandy BridgeregisterL1 cacheL2 cacheL3 cache 0ns3.75ns7.5ns11.25ns15nshttp://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 8. Memory Latency Memory latencies Core i5 Sandy Bridgeregister L1 cache61.8 xL2 cache L3 cache DRAM 0ns22.5ns45ns67.5ns90nshttp://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 9. Memory Latency When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU usage, even if your bottleneck is waiting for memory) Memory access patterns is a signicant factor in performance Constant cache misses makes your program run up to 2 orders of magnitude slower than constant cache hitsBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 10. Memory CacheThe memory you requestedcache lineThe memory pulled into the cacheBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 11. Memory Latency CPUs prefetch memory automatically if they can recognize your access pattern (sequential is easy) CPUs predict branches in order to prefetch instruction memory Memory access pattern is not only determined by data access but also control ow (indirect jumps stall execution on a memory lookup)BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 12. Memory Cache For linear memory reads, the CPU will pre-fetch memory64 bytes64 bytesBitTorrent, Inc. | Writing High-Performance Software64 bytes64 bytes64 bytes64 bytes64 bytes64 bytesFor Internal Presentation Purposes Only, Not For External Distribution . 13. Memory Cache For linear memory reads, the CPU will pre-fetch memory64 bytes64 bytesBitTorrent, Inc. | Writing High-Performance Software64 bytes64 bytes64 bytes64 bytes64 bytes64 bytesFor Internal Presentation Purposes Only, Not For External Distribution . 14. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss64 bytes64 bytesBitTorrent, Inc. | Writing High-Performance Software64 bytes64 bytes64 bytes64 bytes64 bytes64 bytesFor Internal Presentation Purposes Only, Not For External Distribution . 15. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss64 bytes64 bytesBitTorrent, Inc. | Writing High-Performance Software64 bytes64 bytes64 bytes64 bytes64 bytes64 bytesFor Internal Presentation Purposes Only, Not For External Distribution . 16. Data Structures 17. Data Structures Array of pointers to objects and linked lists more cache pressure / cache misses Array of objects less cache pressure / cache hitsBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 18. Data Structures One optimization is to refactor your single list of heterogenous objects into one list per type. Objects would lay out sequentially in memory Virtual function dispatch could become staticBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 19. Data Structuresstd::vector> shapes; for (auto& s : shapes) s->draw();BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 20. Data Structuresstd::vector> shapes; for (auto& s : shapes) s->draw();std::vector rectangles; std::vector circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw();BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 21. Data Structuresstd::vector> shapes; for (auto& s : shapes) s->draw();Pointers need dereferencing + vtable lookupstd::vector rectangles; std::vector circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw();BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 22. Data Structuresstd::vector> shapes; for (auto& s : shapes) s->draw(); std::vector rectangles; std::vector circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw();BitTorrent, Inc. | Writing High-Performance SoftwarePointers need dereferencing + vtable lookup Objects packed back-toback, sequential memory access, no vtable lookup For Internal Presentation Purposes Only, Not For External Distribution . 23. Data Structures For heap allocated objects, put the most commonly used (hot) elds in the rst cache line Avoid unnecessary paddingBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 24. Data StructuresPadding struct A [24 Bytes] 0: [int : 4] a --- 4 Bytes padding --8: [void* : 8] b 16: [int : 4] c --- 4 Bytes padding ---struct A { ! int a; ! void* b; ! int c; };struct 0: 8: 12:struct B { ! void* b; ! int a; ! int c; };B [16 Bytes] [void* : 8] b [int : 4] a [int : 4] cBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 25. Context Switching 26. Context Switching One signicant source of cache misses is switching context, and switching the data set being worked on Context switch Thread / process switching User space -> kernel spaceBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 27. Context Switching One signicant source of cache misses is switching context, and switching the data set being worked on Context switch Thread / process switching User space -> kernel spaceBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 28. Context Switching Lower the cost of context switching by amortizing it over as much work as possible Reduce the number of system calls by passing as much work as possible per call Reduce thread wake-ups/sleeps by batching workBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 29. Context SwitchingWhen a thread wakes up, do as much work as possible before going to sleep Drain the socket of received bytes Drain the job queueBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 30. Context Switching (traffic analogy) One car at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 31. Context Switching (traffic analogy) One car at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 32. Context Switching (traffic analogy) One car at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 33. Context Switching (traffic analogy) One car at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 34. Context Switching (traffic analogy) One car at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 35. Context Switching (traffic analogy) One car at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 36. Context Switching (traffic analogy) The whole queue at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 37. Context Switching (traffic analogy) The whole queue at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 38. Context Switching (traffic analogy) The whole queue at a timeBitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 39. Context Switching Every time the socket becomes readable, read and handle one requestbuf = socket.read_one_request() req = parse_request(buf) handle_req(socket, req)BitTorrent, Inc. | Writing High-Performance SoftwareFor Internal Presentation Purposes Only, Not For External Distribution . 40. Context Switching Drain the s

Search related