Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Make the NICs Move!Adventures in Network Performance Tuning
Jeremy EderPrincipal Performance Engineer, Red Hat
Martin PorterVice President, Software Development, SolarflareJune 14, 2013
Introducing Solarflare ● High-performance, low-latency 10GbE server adapters
● Creator of OpenOnload, kernel bypass solution for ultra-low latency
● 700+ customers worldwide
● Clear leader in the Financial Services
● 7 of 10 Worldwide exchanges
● Top commercial banks, trading firms, hedge funds
● Accelerate mission critical applications
● Trading (e.g. HFT, matching, order routing, etc.)
● High Performance Computing
● Storage
● Media & Entertainment
● Cloud / Virtualization
● Big Data
● OEM with IBM and HP
Agenda
● Performance Tuning Theory/Basics
● Know Your Hardware/NUMA
● Tuned/Power Management
● Low Latency Network Tuning
● PTP/SR-IOV
● STAC™ Benchmarks
Performance Tuning Food Groups
CPU Memory
I/O
CPU Memory
Storage Network
Cover The Basics
Disable unnecessary services, runlevel 3
● Follow vendor guidelines for BIOS Tuning● Logical cores ? Power Management ? Turbo ?
● In the OS, consider● Disabling filesystem journal● Ensure mount using relatime● SSD/Memory Storage● Running swapless● Reducing writeback thresholds if your app does disk I/O
NUMA Topology
Socket #0(8 cores)
RAM
0
NUMA Node A NUMA Node B
NUMA Node C NUMA Node D
Socket #2(8 cores)
RAM
2
Socket #1(8 cores)
RAM
1
Socket #3(8 cores)
RAM
3
cgroup #A.1 (8 cores)
Instance #1
cgroup #B.1 (8 cores)
Instance #2
cgroup #C.1 (8 cores)
Instance #3
cgroup #D.1 (8 cores)
Instance #4
Know Your Hardware (hwloc)
Solarflare SFN6322
Know Your Hardware (lscpu)# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 16On-line CPU(s) list: 0-15Thread(s) per core: 1Core(s) per socket: 8Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 45Stepping: 7CPU MHz: 2900.000BogoMIPS: 5791.34Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 20480KNUMA node0 CPU(s): 0-7NUMA node1 CPU(s): 8-15
Hyperthreading Off
Core --> NUMA Node map
NUMA Affinity CLI Reference
# numactl -N1 -m1 ./command
● Sets CPU affinity for 'command' to CPU node 1
● Allocates memory out of Memory node 1
● Chose node 1 because of PCI-bus wiring
● Upstream kernel community working on automatic NUMA balancing.
● Experiment with “numad” in RHEL 6.4
Enhanced numastat: Global stats# numastat -mzcs
Per-node system memory usage (in MBs):
Node 0 Node 1 Total
------ ------ ------
MemTotal 147421 147456 294877
MemFree 143226 142781 286007
MemUsed 4196 4675 8870
Active 612 837 1449
Active(anon) 496 725 1221
AnonPages 301 525 826
Ground-up re-write of numastat in RHEL6.4
Enhanced numastat: Per-process stats# numastat -czs -p java
Per-node process memory usage (in MBs) for PID 12160 (java)
Node 0 Node 1 Total
------ ------ -----
Private 674 606 1280
Stack 10 10 20
Heap 0 0 0
------- ------ ------ -----
Total 685 616 1301
System asymmetry: Intel Data-Direct I/O
● New with E5 family (Sandy Bridge or later) Intel CPUs
● Allows PCI adapters to talk directly with the CPU cache
● Reduced latency when adapter and application are configured to use the DDIO-enabled NUMA node.
(Re)-introducing the “tuned” package
# tuned-adm listAvailable profiles:- latency-performance- default- enterprise-storage- virtual-guest- throughput-performance- virtual-host
Current active profile: latency-performance
“tuned” Profile SummaryTunable default enterprise-
storagevirtual-host
virtual-guest
latency-performance
throughput-performance
kernel.sched_min_granularity_ns
4ms 10ms 10ms 10ms 10ms
kernel.sched_wakeup_granularity_ns
4ms 15ms 15ms 15ms 15ms
vm.dirty_ratio 20% RAM 40% 10% 40% 40%
vm.dirty_background_ratio
10% RAM 5%
vm.swappiness 60 10 30
I/O Scheduler (Elevator)
CFQ deadline deadline deadline deadline deadline
Filesystem Barriers On Off Off Off
CPU Governor ondemand performance performance performance
Disk Read-ahead 4x
Disable THP Yes
CPU C-States Locked @ 1
Socket1
Locality of Packets
Stream fromCustomer 1
Stream fromCustomer 2
Stream fromCustomer 3
Socket1/Core1
Socket1/Core2
Socket1/Core3
Stream fromCustomer 4 Socket1/Core4
SR-IOV: RHEL 6.4
RHEL6.4 (tuned) RHEL6.4 (untuned) RHEL6.4 (SR-IOV tuned) RHEL6.4 (Bridge tuned)0
20
40
60
80
100
120
140
160
180
13 17 13
35
1621
47
94
20
62
85
165
1 4
1622
Min Mean 99.9% StdDev
Bare metal KVM + SR-IOV KVM + Bridge
Round-trip Latencies Into Guest(Lower is Better)
Lat
ency
(M
icro
seco
nd
s)
CPU Tuning: P-states (frequency)● Variable frequencies for each core
Lat
ency
(M
icro
seco
nd
s)
0.4
Std Dev Average Max0
5
10
15
20
25
30
35
40
3
17
37
3
13
33
3
13
36
0.4
12
18
Powersave Ondemand Performance Performance + C0
P-state Impact on Latency(Lower is better)
Default
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7
0 0 0 0.24 2.93 2.88 5.72 1.32 0.00 92.72
0 1 1 2.54 3.03 2.88 3.13 0.15 0.00 94.18
0 2 2 2.29 3.08 2.88 1.47 0.00 0.00 96.25
0 3 3 1.75 1.75 2.88 1.21 0.47 0.12 96.44
latency-performance
pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7
0 0 0 0.00 3.30 2.90 100.00 0.00 0.00 0.00
0 1 1 0.00 3.30 2.90 100.00 0.00 0.00 0.00
0 2 2 0.00 3.30 2.90 100.00 0.00 0.00 0.00
0 3 3 0.00 3.30 2.90 100.00 0.00 0.00 0.00
Turbostat shows P/C-states on Intel CPUsturbostat begins shipping in RHEL6.4, cpupowerutils package
Network Tuning: Low Latency TCP
● set TCP_NODELAY (Nagle)● Experiment with ethtool offloads● tcp_low_latency no substantive benefit found● Ensure kernel buffers are “right-sized”
● Use ss (Recv-Q Send-Q)● Don't setsockopt unless you've really tested
● Review old code to see if you're using setsockopt● Might be hurting performance
Network Tuning: Low Latency UDP
● Mainly about managing bursts, avoiding drops● rmem_max/wmem_max
● TX● netdev_max_backlog● txqueuelen
● RX● netdev_max_backlog● ethtool -g● ethtool -c● netdev_budget
● Dropwatch tool in RHEL
Network Tuning: Buffer Bloat
● Kernel buffers:
● NIC ring buffers: # ethtool -g p1p1
● 10G line-rate● ~4MB queue depth● Matching servers
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
Kernel Buffer Queue Depth
10Gbit TCP_STREAM
Send-Q Depth
Tim e (1-sec intervals)
Qu
eu
e D
ep
th (
MB
)
# ss |grep -v sshState Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 172.17.1.36:38462 172.17.1.34:12865 ESTAB 0 3723128 172.17.1.36:58856 172.17.1.34:53491
Tuna (new in RHEL6.4)
1 2
3
● CPU affinity for IRQs
● CPU affinity for PIDs● Scheduler Policy● Scheduler Priority
Tuna IRQ/CPU affinity context menus
Tuna – for processes
# tuna -t netserver -P
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
13488 OTHER 0 0xfff 1 0 netserver
# tuna -c2 -t netserver -m
# tuna -t netserver -P
thread ctxt_switches
pid SCHED_ rtpri affinity voluntary nonvoluntary cmd
13488 OTHER 0 2 1 0 netserver
● Try raising vm.stat_interval sysctl
Core
Tuna – for IRQs
● Move 'p1p1*' IRQs to Socket 1:
# tuna -q p1p1* -S0 -m -x
# tuna -Q | grep p1p1
78 p1p1-0 0 sfc
79 p1p1-1 1 sfc
80 p1p1-2 2 sfc
81 p1p1-3 3 sfc
82 p1p1-4 4 sfc
...
Network Tuning: IRQ affinity
● Use irqbalance for the common case
● New irqbalance automates NUMA affinity for IRQs
● Flow-Steering Technologies
● Move 'p1p1*' IRQs to Socket 1:
# tuna -q p1p1* -S1 -m -x
# tuna -Q | grep p1p1
● Manual IRQ pinning for the last X percent/determinism● Guide on Red Hat Customer Portal
Tuna – for core/socket isolation
# tuna -S1 -i
# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status
Cpus_allowed_list: 0-15
# tuna -S1 -i (tuna sets affinity of 'init' task as well)
# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status
Cpus_allowed_list: 0,1,2,3,4,5,6,7
Precision Time Protocol (IEEE-1588v2)
● Tech Preview in RHEL 6.4● Limited driver enablement in 6.4● Goal to enable Solarflare + PTP on RHEL 6
● Improved synchronization accuracy over NTP● PTP Hardware timestamping most accurate
● Query your NICs PTP capabilities: ethtool -T p1p1
● Improve time sync by disabling tickless kernel● nohz=off● Increased power consumption
Precision Time Protocol w/Solarflare NICs
● Single adapter / single network● Lowest latency in the industry
● Hardware timestamps of RX and TX packets
● High-precision Stratum 3 clock● Clock frequency/offset disciplined by
driver
● 1PPS input for calibration
● Maintains server synchronization within +/- 200 ns of master clock
SFN6322F
Precision Time Protocol (IEEE-1588v2)
nohz=off
nohz=on
Network benchmarks for the financial trading industry that are neutral with respect to vendor, network API, and network transport.
● http://www.stacresearch.com/nio
● Assisting w/STAC® Network I/O Benchmark Efforts
● Submitted first set in Jan 2013
● Second set due approx Q2 CY2013
Data Pathfor STAC-N
Network Tuning: Example Pinning MapProducer Consumer
Socket 0 Socket 0
Core # Task Core # Task
1 producer_harness_wrapper 1 consumer_harness_wrapper
2 producer sender 2 consumer sender
3 producer receiver 3 consumer receiver
4 controller 4 reflector
5 free 5 free
6 p1p1 interrupts 6 p1p1 interrupts
Core # Socket 1 Core # Socket 1
8 em1 interrupts 8 em1 interrupts
9 everything else userspace 9 everything else userspace
10 ... 10 ...
11 ... 11 ...
12 ... 12 ...
13 ... 13 ...
Transparent Hugepages
Transparent Hugepages Disabled
Solarflare Hybrid SR-IOVT r a d i t i o n a l V i r t u a l i z a t i o n
( n o S R - I O V )T y p i c a l S R - I O V
I m p l e m e n t a t i o nH y b r i d
S R - I O V M o d e l
● Traditional VM networking architecture
● Improved performance● Loss of ability to perform
migration
● Improved performance● Migration still available
● Not yet available in RHEL6 – working with community on upstream solution
● 2048 virtual NICs (vNIC) for unmatched performance and application scalability
● Improved network/latency over bridging in KVM
Red Hat and Solarflare are #1 in SPECvirt
2-Sockets 4-Sockets 8-Sockets0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
VMware RHEL (KVM) or RHEV
Comparison based on best performing Red Hat and VMware solutions by cpu socket count published at www.spec.org as of May 17, 2013. SPEC® and the benchmark name SPECvir_sct® are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPECvirt_sc2010, see www.spec.org/virt_sc2010/.
1,878 @120 VMs
2,442 @150 VMs
3,824 @234 VMs
4,682 @288 VMs
8,956 @552 VMs
Solarflare OpenOnload Application Acceleration
● Accelerated performance● Kernel bypass - streamlines and reduces
interrupts, context switches and data copies● TCP/IP, UDP and multicast acceleration● Reduces latency by 50%, increases message
rates 3x or more
● Seamless integration● Binary compatible with industry standard APIs● No software modifications are needed● Standards-based solution uses TCP/IP and
UDP● Compatible with existing Ethernet infrastructure
● Open source GPLv2
● Red Hat Support of 3rd Party Drivers:● https://access.redhat.com/site/articles/1067
Interesting new Network/Perf things in RHEL6.4
● tuna included
● latency-performance “tuned” profile beefed up● Lock C-states● Disable Transparent Hugepages
● turbostat included in cpupowerutils package
● hwloc now reports PCI bus topology
● PTP Tech Preview
● omping (multicast test)
Helpful Links
● Red Hat Low Latency Performance Tuning Guide
● Optimizing RHEL Performance by Tuning IRQ Affinity
● Red Hat Performance Tuning Guide
● Red Hat Virtualization Tuning Guide
● STAC Network I/O SIG
● Finteligent Low Latency Tuning w/KVM
● Blog: http://www.breakage.org/ or @jeremyeder
● Research Report: Virtualisation for trading - Red Hat Enterprise Linux 6 Kernel-based Virtual Machine (KVM) Hypervisor
● Uses PCI-passthrough of Solarflare NICs
Red Hat Summit 2013 App Session Survey
Sessions and Labs →
Application and platform
infrastructure II →
Fri, Jun 14 →
Make the NICs Move →
Polls and Surveys