Make the NICs Move! - Red Hat...Make the NICs Move! Adventures in Network Performance Tuning Jeremy Eder Principal Performance Engineer, Red Hat Martin Porter Vice President, Software

Make the NICs Move!Adventures in Network Performance Tuning

Jeremy EderPrincipal Performance Engineer, Red Hat

Martin PorterVice President, Software Development, SolarflareJune 14, 2013

Introducing Solarflare ● High-performance, low-latency 10GbE server adapters

● Creator of OpenOnload, kernel bypass solution for ultra-low latency

● 700+ customers worldwide

● Clear leader in the Financial Services

● 7 of 10 Worldwide exchanges

● Top commercial banks, trading firms, hedge funds

● Accelerate mission critical applications

● Trading (e.g. HFT, matching, order routing, etc.)

● High Performance Computing

● Storage

● Media & Entertainment

● Cloud / Virtualization

● Big Data

● OEM with IBM and HP

Agenda

● Performance Tuning Theory/Basics

● Know Your Hardware/NUMA

● Tuned/Power Management

● Low Latency Network Tuning

● PTP/SR-IOV

● STAC™ Benchmarks

Performance Tuning Food Groups

CPU Memory

I/O

CPU Memory

Storage Network

Cover The Basics

Disable unnecessary services, runlevel 3

● Follow vendor guidelines for BIOS Tuning● Logical cores ? Power Management ? Turbo ?

● In the OS, consider● Disabling filesystem journal● Ensure mount using relatime● SSD/Memory Storage● Running swapless● Reducing writeback thresholds if your app does disk I/O

NUMA Topology

Socket #0(8 cores)

RAM

0

NUMA Node A NUMA Node B

NUMA Node C NUMA Node D

Socket #2(8 cores)

RAM

2

Socket #1(8 cores)

RAM

1

Socket #3(8 cores)

RAM

3

cgroup #A.1 (8 cores)

Instance #1

cgroup #B.1 (8 cores)

Instance #2

cgroup #C.1 (8 cores)

Instance #3

cgroup #D.1 (8 cores)

Instance #4

Know Your Hardware (hwloc)

Solarflare SFN6322

Know Your Hardware (lscpu)# lscpuArchitecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 16On-line CPU(s) list: 0-15Thread(s) per core: 1Core(s) per socket: 8Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 45Stepping: 7CPU MHz: 2900.000BogoMIPS: 5791.34Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 20480KNUMA node0 CPU(s): 0-7NUMA node1 CPU(s): 8-15

Hyperthreading Off

Core --> NUMA Node map

NUMA Affinity CLI Reference

# numactl -N1 -m1 ./command

● Sets CPU affinity for 'command' to CPU node 1

● Allocates memory out of Memory node 1

● Chose node 1 because of PCI-bus wiring

● Upstream kernel community working on automatic NUMA balancing.

● Experiment with “numad” in RHEL 6.4

Enhanced numastat: Global stats# numastat -mzcs

Per-node system memory usage (in MBs):

Node 0 Node 1 Total

------ ------ ------

MemTotal 147421 147456 294877

MemFree 143226 142781 286007

MemUsed 4196 4675 8870

Active 612 837 1449

Active(anon) 496 725 1221

AnonPages 301 525 826

Ground-up re-write of numastat in RHEL6.4

Enhanced numastat: Per-process stats# numastat -czs -p java

Per-node process memory usage (in MBs) for PID 12160 (java)

Node 0 Node 1 Total

------ ------ -----

Private 674 606 1280

Stack 10 10 20

Heap 0 0 0

------- ------ ------ -----

Total 685 616 1301

System asymmetry: Intel Data-Direct I/O

● New with E5 family (Sandy Bridge or later) Intel CPUs

● Allows PCI adapters to talk directly with the CPU cache

● Reduced latency when adapter and application are configured to use the DDIO-enabled NUMA node.

(Re)-introducing the “tuned” package

# tuned-adm listAvailable profiles:- latency-performance- default- enterprise-storagevirtual-guest- throughput-performance- virtual-host

Current active profile: latency-performance

“tuned” Profile SummaryTunable default enterprise-

storagevirtual-host

virtual-guest

latency-performance

throughput-performance

kernel.sched_min_granularity_ns

4ms 10ms 10ms 10ms 10ms

kernel.sched_wakeup_granularity_ns

4ms 15ms 15ms 15ms 15ms

vm.dirty_ratio 20% RAM 40% 10% 40% 40%

vm.dirty_background_ratio

10% RAM 5%

vm.swappiness 60 10 30

I/O Scheduler (Elevator)

CFQ deadline deadline deadline deadline deadline

Filesystem Barriers On Off Off Off

CPU Governor ondemand performance performance performance

Disk Read-ahead 4x

Disable THP Yes

CPU C-States Locked @ 1

Socket1

Locality of Packets

Stream fromCustomer 1



Socket1/Core1

Socket1/Core2

Socket1/Core3

Stream fromCustomer 4 Socket1/Core4

SR-IOV: RHEL 6.4

RHEL6.4 (tuned) RHEL6.4 (untuned) RHEL6.4 (SR-IOV tuned) RHEL6.4 (Bridge tuned)0

20

40

60

80

100

120

140

160

180

13 17 13

35

1621

47

94

20

62

85

165

1 4

1622

Min Mean 99.9% StdDev

Bare metal KVM + SR-IOV KVM + Bridge

Round-trip Latencies Into Guest(Lower is Better)

Lat

ency

(M

icro

seco

nd

s)

CPU Tuning: P-states (frequency)● Variable frequencies for each core

Lat

ency

(M

icro

seco

nd

s)

0.4

Std Dev Average Max0

5

10

15

20

25

30

35

40

3

17

37

3

13

33

3

13

36

0.4

12

18

Powersave Ondemand Performance Performance + C0

P-state Impact on Latency(Lower is better)

Default

pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7

0 0 0 0.24 2.93 2.88 5.72 1.32 0.00 92.72

0 1 1 2.54 3.03 2.88 3.13 0.15 0.00 94.18

0 2 2 2.29 3.08 2.88 1.47 0.00 0.00 96.25

0 3 3 1.75 1.75 2.88 1.21 0.47 0.12 96.44

latency-performance

pk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c7

0 0 0 0.00 3.30 2.90 100.00 0.00 0.00 0.00

0 1 1 0.00 3.30 2.90 100.00 0.00 0.00 0.00

0 2 2 0.00 3.30 2.90 100.00 0.00 0.00 0.00

0 3 3 0.00 3.30 2.90 100.00 0.00 0.00 0.00

Turbostat shows P/C-states on Intel CPUsturbostat begins shipping in RHEL6.4, cpupowerutils package

Network Tuning: Low Latency TCP

● set TCP_NODELAY (Nagle)● Experiment with ethtool offloads● tcp_low_latency no substantive benefit found● Ensure kernel buffers are “right-sized”

● Use ss (Recv-Q Send-Q)● Don't setsockopt unless you've really tested

● Review old code to see if you're using setsockopt● Might be hurting performance

Network Tuning: Low Latency UDP

● Mainly about managing bursts, avoiding drops● rmem_max/wmem_max

● TX● netdev_max_backlog● txqueuelen

● RX● netdev_max_backlog● ethtool -g● ethtool -c● netdev_budget

● Dropwatch tool in RHEL

Network Tuning: Buffer Bloat

● Kernel buffers:

● NIC ring buffers: # ethtool -g p1p1

● 10G line-rate● ~4MB queue depth● Matching servers

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

4.50

Kernel Buffer Queue Depth

10Gbit TCP_STREAM

Send-Q Depth

Tim e (1-sec intervals)

Qu

eu

e D

ep

th (

MB

)

# ss |grep -v sshState Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 172.17.1.36:38462 172.17.1.34:12865 ESTAB 0 3723128 172.17.1.36:58856 172.17.1.34:53491

Tuna (new in RHEL6.4)

1 2

3

● CPU affinity for IRQs

● CPU affinity for PIDs● Scheduler Policy● Scheduler Priority

Tuna IRQ/CPU affinity context menus

Tuna – for processes

# tuna -t netserver -P

thread ctxt_switches

pid SCHED_ rtpri affinity voluntary nonvoluntary cmd

13488 OTHER 0 0xfff 1 0 netserver

# tuna -c2 -t netserver -m

# tuna -t netserver -P

thread ctxt_switches

pid SCHED_ rtpri affinity voluntary nonvoluntary cmd

13488 OTHER 0 2 1 0 netserver

● Try raising vm.stat_interval sysctl

Core

Tuna – for IRQs

● Move 'p1p1*' IRQs to Socket 1:

# tuna -q p1p1* -S0 -m -x

# tuna -Q | grep p1p1

78 p1p1-0 0 sfc

79 p1p1-1 1 sfc

80 p1p1-2 2 sfc

81 p1p1-3 3 sfc

82 p1p1-4 4 sfc

...

Network Tuning: IRQ affinity

● Use irqbalance for the common case

● New irqbalance automates NUMA affinity for IRQs

● Flow-Steering Technologies

● Move 'p1p1*' IRQs to Socket 1:

# tuna -q p1p1* -S1 -m -x

# tuna -Q | grep p1p1

● Manual IRQ pinning for the last X percent/determinism● Guide on Red Hat Customer Portal

https://access.redhat.com/knowledge/techbriefs/optimizing-red-hat-enterprise-linux-performance-tuning-irq-affinity

Tuna – for core/socket isolation

# tuna -S1 -i

# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status

Cpus_allowed_list: 0-15

# tuna -S1 -i (tuna sets affinity of 'init' task as well)

# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/status

Cpus_allowed_list: 0,1,2,3,4,5,6,7

Precision Time Protocol (IEEE-1588v2)

● Tech Preview in RHEL 6.4● Limited driver enablement in 6.4● Goal to enable Solarflare + PTP on RHEL 6

● Improved synchronization accuracy over NTP● PTP Hardware timestamping most accurate

● Query your NICs PTP capabilities: ethtool -T p1p1

● Improve time sync by disabling tickless kernel● nohz=off● Increased power consumption

Precision Time Protocol w/Solarflare NICs

● Single adapter / single network● Lowest latency in the industry

● Hardware timestamps of RX and TX packets

● High-precision Stratum 3 clock● Clock frequency/offset disciplined by

driver

● 1PPS input for calibration

● Maintains server synchronization within +/- 200 ns of master clock

SFN6322F

Precision Time Protocol (IEEE-1588v2)

nohz=off

nohz=on

Network benchmarks for the financial trading industry that are neutral with respect to vendor, network API, and network transport.

● http://www.stacresearch.com/nio

● Assisting w/STAC® Network I/O Benchmark Efforts

● Submitted first set in Jan 2013

● Second set due approx Q2 CY2013

http://www.stacresearch.com/nio

Data Pathfor STAC-N

Network Tuning: Example Pinning MapProducer Consumer

Socket 0 Socket 0

Core # Task Core # Task

1 producer_harness_wrapper 1 consumer_harness_wrapper

2 producer sender 2 consumer sender

3 producer receiver 3 consumer receiver

4 controller 4 reflector

5 free 5 free

6 p1p1 interrupts 6 p1p1 interrupts

Core # Socket 1 Core # Socket 1

8 em1 interrupts 8 em1 interrupts

9 everything else userspace 9 everything else userspace

10 ... 10 ...

11 ... 11 ...

12 ... 12 ...

13 ... 13 ...

Transparent Hugepages

Transparent Hugepages Disabled

Solarflare Hybrid SR-IOVT r a d i t i o n a l V i r t u a l i z a t i o n

( n o S R - I O V )T y p i c a l S R - I O V

I m p l e m e n t a t i o nH y b r i d

S R - I O V M o d e l

● Traditional VM networking architecture

● Improved performance● Loss of ability to perform

migration

● Improved performance● Migration still available

● Not yet available in RHEL6 – working with community on upstream solution

● 2048 virtual NICs (vNIC) for unmatched performance and application scalability

● Improved network/latency over bridging in KVM

Red Hat and Solarflare are #1 in SPECvirt

2-Sockets 4-Sockets 8-Sockets0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

VMware RHEL (KVM) or RHEV

Comparison based on best performing Red Hat and VMware solutions by cpu socket count published at www.spec.org as of May 17, 2013. SPEC® and the benchmark name SPECvir_sct® are registered trademarks of the Standard Performance Evaluation Corporation. For more information about SPECvirt_sc2010, see www.spec.org/virt_sc2010/.

1,878 @120 VMs

2,442 @150 VMs

3,824 @234 VMs

4,682 @288 VMs

8,956 @552 VMs

Solarflare OpenOnload Application Acceleration

● Accelerated performance● Kernel bypass - streamlines and reduces

interrupts, context switches and data copies● TCP/IP, UDP and multicast acceleration● Reduces latency by 50%, increases message

rates 3x or more

● Seamless integration● Binary compatible with industry standard APIs● No software modifications are needed● Standards-based solution uses TCP/IP and

UDP● Compatible with existing Ethernet infrastructure

● Open source GPLv2

● Red Hat Support of 3rd Party Drivers:● https://access.redhat.com/site/articles/1067

Interesting new Network/Perf things in RHEL6.4

● tuna included

● latency-performance “tuned” profile beefed up● Lock C-states● Disable Transparent Hugepages

● turbostat included in cpupowerutils package

● hwloc now reports PCI bus topology

● PTP Tech Preview

● omping (multicast test)

Helpful Links

● Red Hat Low Latency Performance Tuning Guide

● Optimizing RHEL Performance by Tuning IRQ Affinity

● Red Hat Performance Tuning Guide

● Red Hat Virtualization Tuning Guide

● STAC Network I/O SIG

● Finteligent Low Latency Tuning w/KVM

● Blog: http://www.breakage.org/ or @jeremyeder

https://access.redhat.com/knowledge/articles/221153

https://access.redhat.com/knowledge/articles/216733

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Performance_Tuning_Guide/index.html

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/Virtualization_Tuning_and_Optimization_Guide/index.html

http://www.STACresearch.com/nio

http://finteligent.net/pg/file/onxenterprise/read/51036

http://www.breakage.org/

● Research Report: Virtualisation for trading - Red Hat Enterprise Linux 6 Kernel-based Virtual Machine (KVM) Hypervisor

● Uses PCI-passthrough of Solarflare NICs




Red Hat Summit 2013 App Session Survey

Sessions and Labs →

Application and platform

infrastructure II →

Fri, Jun 14 →

Make the NICs Move →

Polls and Surveys

Documents

Make the NICs Move! - Red Hat...Make the NICs Move! Adventures in Network Performance Tuning Jeremy Eder Principal Performance Engineer, Red Hat Martin Porter Vice President, Software