Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

CUTE2011@Seoul, Dec.15 2011

Ryousei Takano, Tsutomu Ikegami, Takahiro Hirofuchi, Yoshio Tanaka

Information Technology Research Institute,

National Institute of Advanced Industrial Science and Technology (AIST), Japan

Background

•  Cloud computing is getting increased attention from High Performance Computing community. –  e.g., Amazon EC2 Cluster Compute Instances

•  Virtualization is a key technology. –  Provider rely on virtualization to consolidate

computing resources. •  Virtualization provides not only opportunities,

but also challenges for HPC systems and applications. –  Concern: Performance degradation due to the

overhead of virtualization2

Contribution

•  Goal: –  To realize a practical HPC Cloud whose performance

is close to that of bare metal (i.e., non-virtualized) machines

•  Contributions: –  Feasible study by evaluating the HPC Challenge

benchmark on a 16 node InfiniBand cluster –  The effect of three performance tuning techniques:

•  PCI passthrough •  NUMA affinity •  VMM noise reduction

3

Outline

•  Background •  Performance tuning techniques for HPC Cloud

–  PCI passthrough –  NUMA affinity –  VMM noise reduction

•  Performance evaluation –  HPC Challenge benchmark suite –  Results

•  Summary

4

Outline




•  Summary

5

Current HPC Cloud

Its performance is not good and

unstable.

“True” HPC Cloud The performance is close to that of bare

metal machines.

Toward a practical HPC Cloud

Use PCI passthrough

Set NUMA affinity

Reduce VMM noise

6

VM1

NIC

VMM

Physical driver

Guest OS

To reduce the overhead of interrupt virtualization To disable unnecessary services on the host OS (i.e., ksmd).

VM (QEMU process)

Linux kernel

KVM

Physical CPU

VCPU threads

Guest OSThreads

CPU socket

IO architectures of VMs

7

IO emulationVM1

NIC

VMM

Guest driver

Physical driver

Guest OSVM2

vSwitch

…

PCI passthrough VM1

NIC

VMM

Physical driver

Guest OSVM2

…

VMM-bypass access

IO emulation degrades the performance due to the overhead of VMM processing.

PCI passthrough achieves the performance comparable to bare metal machines.

VMM: Virtual Machine Monitor

NUMA affinity

8

Bare Metal Linux

P0Physical CPU P1 P2 P3

Process scheduler

Threads

numactl

memory memory

CPU socket

On NUMA systems, memory affinity is an important performance factor. Local memory accesses are faster than remote memory accesses. In order to avoid inter-socket memory transfer, binding a thread to CPU socket can be effective.

NUMA: Non Uniform Memory Access

NUMA affinity: KVM

9

P3

VM (QEMU process)

P0

Linux kernel

KVM

Physical CPU P1 P2

VCPU threads

Process scheduler

Guest OS

Threads

CPU socket

V3V0 V1 V2

bind threads to vSocket

pin vCPU to CPU (Vn = Pn)

numactl

taskset

Bare Metal KVM Linux


Process scheduler

Threads

numactl

memory memory

CPU socket

NUMA affinity: Xen

10

pin vCPU to CPU (Vn = Pn)

Bare Metal Xen Linux


Process scheduler

Threads

numactl

memory memory

CPU socket

VM (Xen DomU)

P0

Xen Hypervisor

Physical CPU P1 P2 P3

VM (Dom0)

Domain scheduler

Guest OS

V3V0 V1 V2

Threads

VCPU

numactl cannot run on a guest OS, because Xen does not disclose the physical NUMA topology.

VMM noise

•  OS noise is well-known problem to large-scale system scalability. –  OS activities and some daemon programs take up

CPU time, consume cache and TLB, and delay the synchronization of parallel processes

•  VMM level noise, called VMM noise, can cause the same problem for a guest OS. –  The overhead of interrupt virtualization that results in

VM exits (i.e., VM-to-VMM switching) –  Unnecessary services on the host OS (i.e., ksmd)

•  Now, we do not take care of VMM noise.11

Outline




•  Summary

12

Experimental setting

13

Blade server （Dell PowerEdge M610）

CPU Intel quad-core Xeon E5540/2.53GHz x2

Chipset Intel 5520

Memory 48 GB DDR3

InfiniBand Mellanox ConnectX (MT26428)

13

Blade switch

InfiniBand Mellanox M3601Q (QDR 16 ports)

Evaluation of HPC Challenge benchmark on a 16 node Infiniband cluster

Host machine environmentOS Debian 6.0.1

Linux kernel 2.6.32-5-amd64

KVM 0.12.50

Xen 4.0.1

Compiler gcc/gfortran 4.4.5

MPI Open MPI 1.4.2

VM environmentVCPU 8Memory 45 GBOnly 1 VM runs on 1 host. !

HPC Challenge Benchmark Suite

! !

!"#$%&#$"'(")(#*+(,-..(/+0$1'

23&#$&4(5"6&4$#7

8+93":&4(5"6&4$#7

/;<!!

,-5

-8=>?2

28=<>!

@@8

=&'A"9>66+00

!$00$"'(-&:#'+:

>334$6&#$"'0

"#$

%&'(

%&'(

From: Piotr Luszczek, et al., “The HPC Challenge (HPCC) Benchmark Suite,” SC2006 Tutorial.

Compute intensive

Memory intensive

Communication intensive

We measure spatial and temporal locality boundaries by evaluating HPC Challenge benchmark suite.

14

HPC Challenge: Result

15

G: Global, EP: Embarrassingly parallel Higher is better, except for Random Ring Latency.

Compute intensive

Memory intensive

Communication intensive

0

0.2

0.4

0.6

0.8

1

1.2

1.4 HPL(G)

PTRANS(G)

STREAM(EP)

RandomAccess(G) FFT(G)

Random Ring BW

Random Ring Latency

BMM

BMM+pin

KVM

KVM+pin+bind

Xen

Xen+pin

Comparing Xen and KVM, the performances are almost same.

HPC Challenge: Result

16

Xen KVM

G: Global, EP: Embarrassingly parallel Higher is better, except for Random Ring Latency.

0

0.2

0.4

0.6

0.8

1

1.2 HPL(G)

PTRANS(G)

STREAM(EP)


Random Ring BW

Random Ring Latency

BMM KVM KVM+pin+bind

0

0.2

0.4

0.6

0.8

1

1.2 HPL(G)

PTRANS(G)

STREAM(EP)


Random Ring BW

Random Ring

Latency

BMM Xen Xen+pin

NUMA affinity is important even on a VM. But, the effect of VCPU pin is uncertain.

HPL: High Performance LINPACK

Configuration 1 node 16 nodesBMM 50.24 (1.00) 706.21 (1.00)BMM + bind 51.07 (1.02) 747.88 (1.06)Xen 49.44 (0.98) 700.23 (0.99)Xen + pin 49.37 (0.98) 698.93 (0.99)KVM 48.03 (0.96) 671.97 (0.95)KVM + pin + bind 49.33 (0.98) 684.96 (0.97)

17

•  BMM: The LINPACK efficiency is 57.7% in 16 nodes (63.1% in a single node).

•  BMM, KVM: setting NUMA affinity is effective.•  Virtualization overhead is 6 to 8%.

Discussion•  The performance of global benchmarks, except

for FFT(G), is almost comparable with that of bare metal machines. –  FFT decreased the performance by 11% to 20% due

to the virtualization overhead related to the inter-node communication and/or VMM noise.

–  PCI passthrough improves MPI communication throughput close to that of bare metal machines. But, interrupt injection that results in VM exits can disturb the application execution.

18

Discussion (cont.)

•  The performance of Xen is marginally better than that of KVM, except for RandomRing Bandwidth. –  The bandwidth decreases by 4% in KVM, 20% in Xen.

•  KVM: The performance of STREAM(EP) decreases by 27%. –  A lot of memory contention among processes (TLB

miss) may occur. It is the worst situation for EPT (Extended Page Table), because the page walk of EPT takes more time than that of shadow page table. This means a virtual machine is more sensitive to memory contention than a bare metal machine.

19

Outline




•  Summary

20

Summary

HPC Cloud is promising! •  The performance of coarse-grained parallel

applications is comparable to bare metal machines. •  We plan to adopt these performance tuning

techniques into our private cloud service called “AIST Cloud.”

•  Open issues: –  VMM noise reduction –  Live migration with VMM-bypass devices

21

22

HPC Cloud

HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications.

23

Physical Cluster

Virtualized Clusters

Users require resources according to needs.

Provider allocates users a dedicated virtual cluster on demand.

Amazon EC2 CCI in TOP500

0

20

40

60

80

100

0 100 200 300 400 500

InfiniBand Gigabit Ethernet 10 Gigabit Ethernet

※Efficiency＝（Maximum LINPACK performance：Rmax）／（Theoretical peak performance：Rpeak）

InfiniBand: 76%

Gigabit Ethernet: 52%

10 Gigabit Ethernet: 72%

TOP500 rank

Effi

cien

cy (

%)

TOP500 Nov. 2011

#42 Amazon EC2 cluster compute instances

GPGPU machines

0

20

40

60

80

100

0 100 200 300 400 500

InfiniBand Gigabit Ethernet 10 Gigabit Ethernet

LINPACK Efficiency

※Efficiency＝（Maximum LINPACK performance：Rmax）／（Theoretical peak performance：Rpeak）

InfiniBand: 79%

Gigabit Ethernet: 54%

10 Gigabit Ethernet: 74%

TOP500 rank

Effi

cien

cy (

%)

TOP500 June 2011

#451 Amazon EC2 cluster compute instances

Virtualization causes the performance degradation!

GPGPU machines