Kvm performance optimization for ubuntu

KVM Performance Optimization

Paul SimCloud Consultantpaul.sim@canonical.com

● CPU & Memory○ vCPU pinning○ NUMA affinity○ THP (Transparent Huge Page)○ KSM (Kernel SamePage Merging) & virtio_balloon

● Networking○ vhost_net○ Interrupt handling○ Large Segment offload

● Block Device○ I/O Scheduler○ VM Cache mode○ Asynchronous I/O

● CPU & Memory

L1i32K

4 cycles, 0.5 ns

L2 or MLC (unified)256K

11 cycles, 7 ns

L3 or LLC (Shared)14 ~ 38 cycles, 45 ns

Local memory75 ~ 100 ns

Remote memory120 ~ 160 ns

Modern CPU cache architecture and latency (Intel Sandy Bridge)

L1d32K

4 cycles, 0.5 ns

L1i32K

4 cycles, 0.5 ns

L2 or MLC (unified)256K

11 cycles, 7 ns

L1d32K

4 cycles, 0.5 ns

● CPU & Memory - vCPU pinning

Node memory

Core 1

Core 2

Core 3

Node memory

KVM Guest1

KVM Guest4

KVM Guest2

KVM Guest3

* Pin specific vCPU on specific physical CPU* vCPU pinning increases CPU cache hit ratio

1. Discover cpu topology.

- virsh capabilities

2. pin vcpu on specific core

- vcpupin <domain> vcpu-num cpu-num

3. print vcpu information

- vcpuinfo <domain>Core 0

KVM Guest0

* Multi-node virtual machine?

● CPU & Memory - vCPU pinning

* Two times faster memory access than without pinning- Shorter is better on scheduling and mutex test.- Longer is better on memory test.

● CPU & Memory - NUMA affinity

CPU architecture (Intel Sandy Bridge)

ntroller

janghoon@machine-04:~$ sudo numactl --hardwareavailable: 4 nodes (0-3)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49node 0 size: 24567 MBnode 0 free: 6010 MBnode 1 cpus: cpus: 10 11 12 13 14 15 16 17 18 19 50 51 52 53 54 55 56 57 58 59node 1 size: 24576 MBnode 1 free: 8557 MBnode 2 cpus: 20 21 22 23 24 25 26 27 28 29 60 61 62 63 64 65 66 67 68 69node 2 size: 24567 MBnode 2 free: 6320 MBnode 3 cpus: 30 31 32 33 34 35 36 37 38 39 70 71 72 73 74 75 76 77 78 79node 3 size: 24567 MBnode 3 free: 8204 MBnode distances:node 0 1 2 3 0: 10 16 16 22 1: 16 10 22 16 2: 16 22 10 16 3: 22 16 16 10

* Remote memory is expensive!

1. Determine where the pages of a VM are allocated.

- cat /proc/<PID>/numa_maps

- cat /sys/fs/cgroup/memory/sysdefault/libvirt/qemu/<KVM name>/memory.numa_stat

total=244973 N0=118375 N1=126598file=81 N0=24 N1=57anon=244892 N0=118351 N1=126541unevictable=0 N0=0 N1=0

2. Change memory policy mode

- cgset -r cpuset.mems=<Node> sysdefault/libvirt/qemu/<KVM name>/emulator/

3. Migrate pages into a specific node.

- migratepages <PID> from-node to-node

- cat /proc/<PID>/status

* Memory policy modes1) interleave : Memory will be allocated using round robin on nodes. When memory cannot be allocated on the current interleave target fall back to other nodes. 2) bind : Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes.3) preferred : Preferably allocate memory on node, but if memory cannot be allocated there fall back to other nodes.

* “preferred” memory policy mode is not currently supported on cgroup

- NUMA reclaim

Unmapped page cache

Mapped page cache

Free memory

Anonymous page

Unmapped page cache

Mapped page cache

Free memory

Anonymous page

Unmapped page cache

Mapped page cache

Free memory

Anonymous page

Unmapped page cache

Mapped page cache

Free memory

Anonymous page

Local reclaim

1. Check if zone reclaim is enabled.

- cat /proc/sys/vm/zone_reclaim_mode

0(default): Linux kernel allocates the

memory to a remote NUMA node where

free memory is available.

1 : Linux kernel reclaims unmapped page

caches for the local NUMA node rather than

immediately allocating the memory to a

remote NUMA node.

It is known that a virtual machine causes zone

reclaim to occur in the situations when KSM(Kernel

Same-page Merging) is enabled or Hugepage are

enabled on a virtual machine side.

● CPU & Memory - THP (Transparent Huge Page)

- Paging level in some 64 bit architectures

Platform Page size Address bit used Paging levels splitting

Alpha 8KB 43 3 10 + 10 + 10 + 13

IA64 4KB 39 3 9 + 9 + 9 + 12

PPC64 4KB 41 3 10 + 10 + 9 + 12

x86_64 4KB(default) 48 4 9 + 9 + 9 + 9 + 12

x86_64 2MB(THP) 48 3 9 + 9 + 9 + 21

- Memory address translation in 64 bit Linux

Global Dir(9 bit)

Upper Dir(9 bit)

Middle Dir(9 bit)

Table(9 bit in 4KB0 bit in 2MB)

Offset(12 bit in 4KB21 bit in 2MB)

Linear(Virtual) address (48 bit)

Page Global Directory

Page Upper Directory

Page Middle Directory

Page table

Physical Page(4KB page size)

Physical Page (2MB page size) reduce 1

Paging hardware with TLB (Translation lookaside buffer)

* Translation lookaside buffer (TLB) : a cache that memory management hardware uses to improve virtual address translation speed.* TLB is also a kind of cache memory in CPU. * Then, how can we increase TLB hit ratio? - TLB can hold only 8 ~ 1024 entries.- Decrease the number of pages. i.e. On 32GB memory system,8,388,608 pages with 4KB page block 16,384 pages with 2MB page block

THP performance benchmark - MySQL 5.5 OLTP testing

* A guest machine also has to use HugePage for the best effect

1. Check current THP configuration- cat /sys/kernel/mm/transparent_hugepage/enabled2. Configure THP mode- echo mode > /sys/kernel/mm/transparent_hugepage/enabled3. monitor Huge page usage- cat /proc/meminfo | grep Huge

janghoon@machine-04:~$ grep Huge /proc/meminfoAnonHugePages: 462848 kBHugePages_Total: 0HugePages_Free: 0HugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 2048 kB

- grep Huge /proc/<PID>/smaps4. Adjust parameters under /sys/kernel/mm/transparent_hugepage/khugepaged- grep thp /proc/vmstat

* THP modes1) always : use HugePages always2) madvise : use HugePages only in specific regions, madvise(MADV_HUGEPAGE). Default on Ubuntu precise3) never : not use HugePages

* Currently it only works for anonymous memory mappings but in the future it can expand over the pagecache layer starting with tmpfs. - Linux Kernel Documentation/vm/transhuge.txt

● CPU & Memory - KSM & virtio_balloon

KSMMADV_MERGEABLE

Guest 2

Guest 0

Guest 1

Merging

1. A kernel feature in KVM that shares memory pages between various processes, over-committingthe memory.2. Only merges anonymous (private) pages.3. Enable KSM- echo “1” > /sys/kernel/mm/ksm/run4. Monitor KSM- Files under /sys/kernel/mm/ksm/5. For NUMA (in Linux 3.9)- /sys/kernel/mm/ksm/merge_across_nodes0 : merge only pages in the memory area of the same NUMA node1 : merge pages across nodes

- KSM (Kernel SamePage Merging)

● CPU & Memory - KSM & virtio_balloon

- Virtio_balloon

balloon

memory memory

balloon

memorymemory

1. The hypervisor sends a request to the guest operating system to return some amount of memory back to the hypervisor.2. The virtio_balloon driver in the guest operating system receives the request from the hypervisor.3. The virtio_balloon driver inflates a balloon of memory inside the guest operating system4. The guest operating system returns the balloon of memory back to the hypervisor.5. The hypervisor allocates the memory from the balloon elsewhere as needed.6. If the memory from the balloon later becomes available, the hypervisor can return the memory to the guest operating system

● Networking - vhost-net

Kernel

Bridge

Virtio front-end

Virtio back-end

vhost_net

Virtio front-end

Kernel

TAPBridge

Virtio Vhost-net

Context switching

virtqueue

Kernel

PF Bridge

PCI Path-through SR-IOV

Kernel

I/O MMU

● Networking - Interrupt handling

1. Multi-queue NIC

- RPS(Receive Packet Steering) / XPS(Transmit Packet Steering)

: A mechanism for intelligently selecting which receive/transmit queue to use when

receiving/transmitting a packet on a multi-queue device. it can configure CPU affinity per-device-

queue, disabled by default.

: configure cpu mask

- echo 1 > /sys/class/net/<device>/queues/rx-<queue num>/rps_cpus

- echo 16 > /sys/class/net/<device>/queues/tx-<queue num>/xps_cpus

root@machine-04:~# cat /proc/interrupts | grep eth0 71: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0 72: 0 0 213 679058 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0 73: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1 74: 562872 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2 75: 561303 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3 76: 583422 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-4 77: 21 5 563963 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-5 78: 23 0 548548 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-6 79: 567242 0 0 0 0 0 0 0 IR-PCI-MSI-edge eth0-TxRx-7

around 20 ~ 100 % improvement

2. MSI(Message Signaled Interrupt), MSI-X

- Make sure your MSI-X enabled

3. NAPI(New API)- a feature that in the linux kernel aiming to improve the performance of high-speed

networking, and avoid interrupt storms.- Ask your NIC vendor whether it’s enabled by default.

Most modern NICs support Multi-queue, MSI-X and NAPI. However, You may need to make sure these features are configured correctly and working properly.

janghoon@machine-04:~$ sudo lspci -vs 01:00.1 | grep MSICapabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+Capabilities: [70] MSI-X: Enable+ Count=10 Masked-Capabilities: [a0] Express Endpoint, MSI 00

http://en.wikipedia.org/wiki/Message_Signaled_Interruptshttp://en.wikipedia.org/wiki/New_API

4. IRQ Affinity (Pin Interrupts to the local node)

ntroller

I/O Controller

PCI-E PCI-E

Each node has PCI-E devices connected directly (On Intel Sandy Bridge).

Where is my 10G NIC connected to?

4. IRQ Affinity (Pin Interrupts to the local node)

1) stop irqbalance service

2) determine node that a NIC is connected to

- lspci -tv

- lspci -vs <PCI device bus address>

- dmidecode -t slot

- cat /sys/devices/pci*/<bus address>/numa_node (-1 : not detected)

- cat /proc/irq/<Interrupt number>/node

3) Pin interrupts from the NIC to specific node

- echo f > /proc/irq/<Interrupt number>/smp_affinity

● Networking - Large Segment Offload

w/o LSO

Header

Application Data (i.e. 4K)

Kernel Data Header Data Header Data

Header Data Header Data Header Data

DataHeader metadata+

Data (i.e. 4K)

with LSO

Header Data Header Data Header Data

janghoon@ubuntu-precise:~$ sudo ethtool -k eth0Offload parameters for eth0:tcp-segmentation-offload: onudp-fragmentation-offload: ongeneric-segmentation-offload: ongeneric-receive-offload: onlarge-receive-offload: off

● Networking - Large Segment Offload

1. 100% throughput performance improvement2. GSO/GRO, TSO and UFO are enabled by default on Ubuntu precise 12.04 LTS3. Jumbo-frame is not much effective

● Block device - I/O Scheduler

1. Noop : basic FIFO queue.

2. Deadline : I/O requests place in a priority queue and are guaranteed to be run within a

certain time. low latency.

3. CFQ (Completely Fair Queueing) : I/O requests are distributed to a number of per-

process queues. default I/O scheduler on Ubuntu.

janghoon@ubuntu-precise:~$ sudo cat /sys/block/sdb/queue/schedulernoop deadline [cfq]

● Block device - VM Cache mode

Disable host page cache

Disk Cache

Physical Disk

Host Page Cache

writeback writethrough none directsync

R w R w R w R w

● Block device - Asynchronous I/O

Kvm performance optimization for ubuntu

Technology

Ubuntu Packaging Guide - Packaging | Ubuntu Developerpackaging.ubuntu.com/fr/ubuntu-packaging-guide.pdf · Table des matières Ubuntu Packaging Guide Welcome to the Ubuntu Packaging

KVM Virtualization With Enomalism 2 On An Ubuntu 8.10 Serverwiki.deimos.fr/images/0/04/KVM_Virtualization_With... · 2020. 1. 30. · KVM Virtualization With Enomalism 2 On An Ubuntu

Installation and Getting Started GuideLinux KVM Linux KVM (Linux Kernel > 2.6.25), the recommended Linux distributions: • CentOS 6.3 • Fedora 17 • Ubuntu 12.10 • RedHat Enterprise

Ubuntu Thin Clients - Ubuntu Hardware Debugging

KVM-301 / KVM-D301 - Comunidad SYSCOMforo.syscom.mx/uploads/FileUpload/52/d8448749b7c2fb46c...KVR-300[OPTION] KVM-301/KVM-D301 09.04 KVM-301 / KVM-D301 SINCE 1976 KOREACOMMUNICATIONS

Installation Runbook for Appcito ADS... · Installation Runbook for Appcito ADS ... Ubuntu Non HA KVM yes ... Login to the Newly Launched Instances and we need to download the Respective

KVM Weather Report...KVM Weather Report Red Hat Author Gleb Natapov May 29, 2013 Part I What is KVM Section 1 KVM Features KVM Features 4 KVM features VT-x/AMD-V (hardware virtualization)

Prometheus · 2020-06-24 · 2 Linux (Ubuntu) SDN (Cisco) Chef Terraform Runtastic Infrastructure Base Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage Virtualization

Improving Graphic Performance of MeeGo on ARM...12 Install mic tools Install dependent packages to ubuntu first • sudo apt-get install yum rpm kpartx parted syslinux isomd5sum kvm

National CLC Conference Leeds 2010 National CLC Leeds Laura Czajkowski Ubuntu Women Ubuntu Advocate Ubuntu Ireland LoCo Ubuntu EMEA Membership Board Ubuntu

Ubuntu 사용 설명서 · 2012-10-31 · 5 Ubuntu 아이콘 — Ubuntu 아이콘을 클릭하면 Ubuntu 메뉴가 열립니다. Ubuntu 메뉴에서 응용 프 로그램, 환경 12설정,

OpenStack for KVM with Ubuntu on IBM z · PDF fileOpenStack for KVM with Ubuntu on IBM z Systems ... RedHat, Canonical, Mirantis, ... (HA guide, Config guide,

KVM as a Microsoft-compatible hypervisor. · 1 KVM Forum 2012 | Vadim Rozenfeld KVM as a Microsoft-compatible hypervisor. Vadim Rozenfeld KVM Forum, 2012

OpenStack performance optimization - KVM · OpenStack performance optimization NUMA, Large pages & CPU pinning Daniel P. Berrangé KVM Forum 2014: Düsseldorf

Support oVirt on Ubuntu - Intel® Software · PDF fileSupport oVirt on Ubuntu May 2013 Zhou Zheng Sheng / 周征晟 - IBM, Linux Technology Center, KVM zhshzhou ... xulrunner is missing

IP-KVM 1001 KVM over IP Switch - Opengearftp.opengear.com/download/manual/x-manuals/IP KVM User Manual.pdf · The IP-KVM provides convenient remote KVM access and control via LAN

1 Multimedia Industry Perspectives 2010 Laura Czajkowski Ubuntu Women Ubuntu Advocate Ubuntu Ireland LoCo Ubuntu EMEA Membership Board Ubuntu NGO Ubuntu

Virtio-blk Multi-queue Conversion and QEMU Optimization · Virtio-blk Multi-queue Conversion and QEMU Optimization Ming Lei, Canonical Ltd. KVM Forum 2014

Deploying with Ubuntu Cloud Ubuntu Cloud.pdf · Deploying with Ubuntu Cloud ... Ubuntu Server Ubuntu Cloud Guest Any OS OpenStack ... web or API interface Orchestra Infrastructure

Ubuntu in the Cloud _ Cloud _ Ubuntu