NUMA and Virtualization, the case of Xen

August 27-28, 2012,San Diego, CA, USA

Dario Faggioli,[email protected]

NUMA and Virtualization,the case of Xen

Dario Faggioli, [email protected]

mailto:[email protected]?subject=XS12%20Presentation%20--



NUMA and Virt. And Xen

Talk outline:

● What is NUMA– NUMA and Virtualization– NUMA and Xen

● What I have done in Xen about NUMA● What remains to be done in Xen about NUMA





Talk outline:






What is NUMA

● Non-Uniform Memory AccessNon-Uniform Memory Access: it will take longerto access some regions of memory than others

● Designed to improve scalability on large SMPs● Bottleneck: contention on the shared memory bus




What is NUMA

● Groups of processors (NUMA node) have their own local memory

● Any processor can access any memory, including the one not "owned" by its group (remote memory)

● Non-uniform: accessing local memory is faster than accessing remote memory




What is NUMA

Since:● (most of) memory is

allocated at task startup,

● tasks are (usually) free to run on any processor

Both local and remote accesses can happen during task's life

NODE

CPUs

MEM

NODE

CPUs

MEM

task A

mem A

task A

mem A




NUMA and Virtualization

Short lived tasks are just fine... But whatabout VMs?

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM1 VM1VM2

memVM2

FOREVER!!




NUMA and Xen

Xen allocates memory from all the nodeswhere the VM is allowed to run (when created)

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

memVM1

VM1 VM1VM2

memVM2

memVM2




NUMA and Xen

Xen allocates memory from all the nodesall the nodeswhere the VM is allowed to runwhere the VM is allowed to run (when created)

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

memVM1

VM1VM2

memVM2

memVM2

NODE

CPUs

MEM

NODE

CPUs

MEM




NUMA and Xen

Xen allocates memory from all the nodesall the nodeswhere the VM is allowed to runwhere the VM is allowed to run (when created)

How to specify that?

vCPU pinning during VM creation (e.g., in the VM config file)





Talk outline:






Automatic Placement

1.What happens on the left is better thanwhat happens on the right

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

memVM1

VM1

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM1




Automatic Placement

2.What happens on the left can be avoided:look at what happens on the right

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM1VM2

memVM2

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM1 VM2

memVM2

VM1




Automatic Placement

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM2

memVM2

VM1

1.VM1 creationtime: pin VM1 to the first node

2.VM2 creation time: pin VM2 to the second node, as first one already has another VM pinned to it

This now happensAutomatically within

libxl

VM1




NUMA Aware Scheduling

Nice, but using pinning has a majordrawback:

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM1

memVM2

VM2

memVM3

VM3

VM1

When VM1 is back,It has to wait, even ifThere are idle CPUs!




NUMA Aware Scheduling

Don't use pinning, tell the scheduler wherea VM prefers to run (node affinity)

NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

VM1

memVM2

VM2

memVM3

VM3

Now VM1 canrun immediately:remote accesses

Are better than notrunning at all!

VM1

Patches almostready (targeting

Xen 4.3)




Experimental Results

Trying to verify the performances of:● Automatic placement (pinning)● NUMA aware scheduling

(automatic placement + node affinity)




Testing Placement(thanks to Andre Przywara from AMD)

● Host: AMD Opteron(TM) 6276, 64 cores, 128 GB RAM, 8 NUMA nodes

● VMs: 1 to 49, 2 vCPUs, 2GB RAM each.nr. vCPUs Node14 012 112 212 312 412 512 612 7




Some Traces aboutNUMA Aware Scheduling




Benchmarking Pinning andNUMA Aware Scheduling

● Host: Intel Xeon(R) E5620, 16 cores, 12 GB RAM,2 NUMA nodes

● VMs: 2, 4, 6, 8 and 10 of them, 960MB RAM. Different experiments with 1, 2 and 4 vCPUs per VM

➔ SPECjbb2005 executed concurrently in all VMs

➔ 3 configurations: all-cpus, auto-pinning, auto-affinity

➔ Exp. repeated 3 times per each configuration





1 vCPU per VM: 20% to 16% improvement!





2 vCPUs per VM: 17% to 13% improvement!





4 vCPUs per VM:

16% to 4% improvementfrom all-cpus to auto-affinity

0.2% to 8% gap fromauto-affinity to auto-pinning













Talk outline:






Lots of Room forFurther Improvement!

Tracking ideas, people and progress on http://wiki.xen.org/wiki/Xen_NUMA_Roadmap

● Dynamic memory migration

● IO NUMA

● Guest (or Virtual) NUMA

● Ballooning and memory sharing

● Inter-VM dependencies

● Benchmarking and performances evaluation


http://wiki.xen.org/wiki/Xen_NUMA_Roadmap



Dynamic MemoryMigration

If VM2 goes away, it would be nice if wecould change VM1's (or VM3's) node affinityon-line NODE

CPUs

MEM

NODE

CPUs

MEM

memVM1

memVM2

VM2

memVM3

VM3 VM1

memVM1

And move it'sMemory accordingly!

Also targetingXen 4.3




IO NUMA

Different devices can be attached todifferent nodes: needs to be considered during placement / scheduling




Guest NUMA

If a VM is bigger than 1 node, should it know?

NODE

CPUs

MEM

NODE

CPUs

MEM

Pros: performances (especially HPC)

Cons: what if things changes?

● live migrationmemVM1

memVM2

VM2VM1

memVM1

VM1

NODE

CPUs

MEM

memVM2

VM2VM1

memVM1




Ballooning and Sharing

Ballooning should be NUMA aware

Sharing, should we allow that cross-node?

NODE NODE

CPUs

MEM

memVM1

NODE

CPUs

MEM

NODE

MEM

memVM1

VM1 VM2

memVM2

VM1

NODE

CPUs

MEM

memVM1

VM1

NODE

CPUs

MEM

CPUs

memVM1

VM1 VM2

memVM2

memVM3

VM3

shmemRemote access!

memVM1

VM2

memVM2!!




Inter-VM Dependences

NODE NODE

CPUs

MEM

memVM1

VM2

memVM2

NODE

CPUs

MEM

NODE

MEM

memVM1

VM1 VM2

memVM2

VM1

Are we sure the situation on the right is always better?Might it be workload dependant (VM cooperation VS. competiotion)

NODE

CPUs

MEM

memVM1

VM1

VM2

memVM2

NODE

CPUs

MEM

CPUs

memVM1

VM1 VM2

memVM2

VM1

memVM3

VM3

memVM3

VM3




Benchmarking andPerformances Evaluation

How to verify we are actually improving:

● What kind of workload(s)?

● What VMs configuration?





Talk outline:






Thanks!

Any Questions?


Technology

NUMA and Virtualization, the case of Xen