Upload
ryousei-takano
View
1.198
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presentation slides at CUTE 2011 (Korea-Japan e-Science and Cloud Symposium)
Citation preview
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
CUTE2011@Seoul, Dec.15 2011
Ryousei Takano, Tsutomu Ikegami, Takahiro Hirofuchi, Yoshio Tanaka
Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST), Japan
Background
• Cloud computing is getting increased attention from High Performance Computing community. – e.g., Amazon EC2 Cluster Compute Instances
• Virtualization is a key technology. – Provider rely on virtualization to consolidate
computing resources. • Virtualization provides not only opportunities,
but also challenges for HPC systems and applications. – Concern: Performance degradation due to the
overhead of virtualization2
Contribution
• Goal: – To realize a practical HPC Cloud whose performance
is close to that of bare metal (i.e., non-virtualized) machines
• Contributions: – Feasible study by evaluating the HPC Challenge
benchmark on a 16 node InfiniBand cluster – The effect of three performance tuning techniques:
• PCI passthrough • NUMA affinity • VMM noise reduction
3
Outline
• Background • Performance tuning techniques for HPC Cloud
– PCI passthrough – NUMA affinity – VMM noise reduction
• Performance evaluation – HPC Challenge benchmark suite – Results
• Summary
4
Outline
• Background • Performance tuning techniques for HPC Cloud
– PCI passthrough – NUMA affinity – VMM noise reduction
• Performance evaluation – HPC Challenge benchmark suite – Results
• Summary
5
Current HPC Cloud
Its performance is not good and
unstable.
“True” HPC Cloud The performance is close to that of bare
metal machines.
Toward a practical HPC Cloud
Use PCI passthrough
Set NUMA affinity
Reduce VMM noise
6
VM1
NIC
VMM
Physical driver
Guest OS
To reduce the overhead of interrupt virtualization To disable unnecessary services on the host OS (i.e., ksmd).
VM (QEMU process)
Linux kernel
KVM
Physical CPU
VCPU threads
Guest OSThreads
CPU socket
IO architectures of VMs
7
IO emulationVM1
NIC
VMM
Guest driver
Physical driver
Guest OSVM2
vSwitch
…
PCI passthrough VM1
NIC
VMM
Physical driver
Guest OSVM2
…
VMM-bypass access
IO emulation degrades the performance due to the overhead of VMM processing.
PCI passthrough achieves the performance comparable to bare metal machines.
VMM: Virtual Machine Monitor
NUMA affinity
8
Bare Metal Linux
P0Physical CPU P1 P2 P3
Process scheduler
Threads
numactl
memory memory
CPU socket
On NUMA systems, memory affinity is an important performance factor. Local memory accesses are faster than remote memory accesses. In order to avoid inter-socket memory transfer, binding a thread to CPU socket can be effective.
NUMA: Non Uniform Memory Access
NUMA affinity: KVM
9
P3
VM (QEMU process)
P0
Linux kernel
KVM
Physical CPU P1 P2
VCPU threads
Process scheduler
Guest OS
Threads
CPU socket
V3V0 V1 V2
bind threads to vSocket
pin vCPU to CPU (Vn = Pn)
numactl
taskset
Bare Metal KVM Linux
P0Physical CPU P1 P2 P3
Process scheduler
Threads
numactl
memory memory
CPU socket
NUMA affinity: Xen
10
pin vCPU to CPU (Vn = Pn)
Bare Metal Xen Linux
P0Physical CPU P1 P2 P3
Process scheduler
Threads
numactl
memory memory
CPU socket
VM (Xen DomU)
P0
Xen Hypervisor
Physical CPU P1 P2 P3
VM (Dom0)
Domain scheduler
Guest OS
V3V0 V1 V2
Threads
VCPU
numactl cannot run on a guest OS, because Xen does not disclose the physical NUMA topology.
VMM noise
• OS noise is well-known problem to large-scale system scalability. – OS activities and some daemon programs take up
CPU time, consume cache and TLB, and delay the synchronization of parallel processes
• VMM level noise, called VMM noise, can cause the same problem for a guest OS. – The overhead of interrupt virtualization that results in
VM exits (i.e., VM-to-VMM switching) – Unnecessary services on the host OS (i.e., ksmd)
• Now, we do not take care of VMM noise.11
Outline
• Background • Performance tuning techniques for HPC Cloud
– PCI passthrough – NUMA affinity – VMM noise reduction
• Performance evaluation – HPC Challenge benchmark suite – Results
• Summary
12
Experimental setting
13
Blade server (Dell PowerEdge M610)
CPU Intel quad-core Xeon E5540/2.53GHz x2
Chipset Intel 5520
Memory 48 GB DDR3
InfiniBand Mellanox ConnectX (MT26428)
13
Blade switch
InfiniBand Mellanox M3601Q (QDR 16 ports)
Evaluation of HPC Challenge benchmark on a 16 node Infiniband cluster
Host machine environmentOS Debian 6.0.1
Linux kernel 2.6.32-5-amd64
KVM 0.12.50
Xen 4.0.1
Compiler gcc/gfortran 4.4.5
MPI Open MPI 1.4.2
VM environmentVCPU 8Memory 45 GBOnly 1 VM runs on 1 host. !
HPC Challenge Benchmark Suite
! !
!"#$%&#$"'(")(#*+(,-..(/+0$1'
23&#$&4(5"6&4$#7
8+93":&4(5"6&4$#7
/;<!!
,-5
-8=>?2
28=<>!
@@8
=&'A"9>66+00
!$00$"'(-&:#'+:
>334$6&#$"'0
"#$
%&'(
%&'(
From: Piotr Luszczek, et al., “The HPC Challenge (HPCC) Benchmark Suite,” SC2006 Tutorial.
Compute intensive
Memory intensive
Communication intensive
We measure spatial and temporal locality boundaries by evaluating HPC Challenge benchmark suite.
14
HPC Challenge: Result
15
G: Global, EP: Embarrassingly parallel Higher is better, except for Random Ring Latency.
Compute intensive
Memory intensive
Communication intensive
0
0.2
0.4
0.6
0.8
1
1.2
1.4 HPL(G)
PTRANS(G)
STREAM(EP)
RandomAccess(G) FFT(G)
Random Ring BW
Random Ring Latency
BMM
BMM+pin
KVM
KVM+pin+bind
Xen
Xen+pin
Comparing Xen and KVM, the performances are almost same.
HPC Challenge: Result
16
Xen KVM
G: Global, EP: Embarrassingly parallel Higher is better, except for Random Ring Latency.
0
0.2
0.4
0.6
0.8
1
1.2 HPL(G)
PTRANS(G)
STREAM(EP)
RandomAccess(G) FFT(G)
Random Ring BW
Random Ring Latency
BMM KVM KVM+pin+bind
0
0.2
0.4
0.6
0.8
1
1.2 HPL(G)
PTRANS(G)
STREAM(EP)
RandomAccess(G) FFT(G)
Random Ring BW
Random Ring
Latency
BMM Xen Xen+pin
NUMA affinity is important even on a VM. But, the effect of VCPU pin is uncertain.
HPL: High Performance LINPACK
Configuration 1 node 16 nodesBMM 50.24 (1.00) 706.21 (1.00)BMM + bind 51.07 (1.02) 747.88 (1.06)Xen 49.44 (0.98) 700.23 (0.99)Xen + pin 49.37 (0.98) 698.93 (0.99)KVM 48.03 (0.96) 671.97 (0.95)KVM + pin + bind 49.33 (0.98) 684.96 (0.97)
17
• BMM: The LINPACK efficiency is 57.7% in 16 nodes (63.1% in a single node).
• BMM, KVM: setting NUMA affinity is effective.• Virtualization overhead is 6 to 8%.
Discussion• The performance of global benchmarks, except
for FFT(G), is almost comparable with that of bare metal machines. – FFT decreased the performance by 11% to 20% due
to the virtualization overhead related to the inter-node communication and/or VMM noise.
– PCI passthrough improves MPI communication throughput close to that of bare metal machines. But, interrupt injection that results in VM exits can disturb the application execution.
18
Discussion (cont.)
• The performance of Xen is marginally better than that of KVM, except for RandomRing Bandwidth. – The bandwidth decreases by 4% in KVM, 20% in Xen.
• KVM: The performance of STREAM(EP) decreases by 27%. – A lot of memory contention among processes (TLB
miss) may occur. It is the worst situation for EPT (Extended Page Table), because the page walk of EPT takes more time than that of shadow page table. This means a virtual machine is more sensitive to memory contention than a bare metal machine.
19
Outline
• Background • Performance tuning techniques for HPC Cloud
– PCI passthrough – NUMA affinity – VMM noise reduction
• Performance evaluation – HPC Challenge benchmark suite – Results
• Summary
20
Summary
HPC Cloud is promising! • The performance of coarse-grained parallel
applications is comparable to bare metal machines. • We plan to adopt these performance tuning
techniques into our private cloud service called “AIST Cloud.”
• Open issues: – VMM noise reduction – Live migration with VMM-bypass devices
21
22
HPC Cloud
HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications.
23
Physical Cluster
Virtualized Clusters
Users require resources according to needs.
Provider allocates users a dedicated virtual cluster on demand.
Amazon EC2 CCI in TOP500
0
20
40
60
80
100
0 100 200 300 400 500
InfiniBand Gigabit Ethernet 10 Gigabit Ethernet
※Efficiency=(Maximum LINPACK performance:Rmax)/(Theoretical peak performance:Rpeak)
InfiniBand: 76%
Gigabit Ethernet: 52%
10 Gigabit Ethernet: 72%
TOP500 rank
Effi
cien
cy (
%)
TOP500 Nov. 2011
#42 Amazon EC2 cluster compute instances
GPGPU machines
0
20
40
60
80
100
0 100 200 300 400 500
InfiniBand Gigabit Ethernet 10 Gigabit Ethernet
LINPACK Efficiency
※Efficiency=(Maximum LINPACK performance:Rmax)/(Theoretical peak performance:Rpeak)
InfiniBand: 79%
Gigabit Ethernet: 54%
10 Gigabit Ethernet: 74%
TOP500 rank
Effi
cien
cy (
%)
TOP500 June 2011
#451 Amazon EC2 cluster compute instances
Virtualization causes the performance degradation!
GPGPU machines