14
Xen Guest NUMA: General Enabling Part 29 April 2010 Jun Nakajima, Dexuan Cui, and Nitin Kamble

Nakajima numa-final

Embed Size (px)

Citation preview

Page 1: Nakajima numa-final

Xen Guest NUMA:General Enabling Part

29 April 2010Jun Nakajima,

Dexuan Cui, and Nitin Kamble

Page 2: Nakajima numa-final

Xen Summit NA 2010

2

Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,

EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2010 Intel Corporation.

Page 3: Nakajima numa-final

3Xen Summit NA 2010

Xen Guest NUMA Project

• Working with Xen Community:− Andre Przywara [email protected]

− Dulloor Rao [email protected]

− You are welcome to join us

• Generic guest NUMA support both for PV and HVM− Major difference is basically ACPI tables

− NUMA-specific enlightenments are applicable to both

Page 4: Nakajima numa-final

4Xen Summit NA 2010

Agenda

• NUMA machines

• Importance of NUMA Awareness

• Motivation of NUMA Guests

• What is required to support effective NUMA guest?

• Getting host info and resource allocation

• Guest configuration

• Current Status and Next Steps

Page 5: Nakajima numa-final

5Xen Summit NA 2010

NUMA Machines

Memory Buffer

Memory

Xeon® 7500 Xeon® 7500

Xeon®7500 Xeon®7500

I/O Hub

I/O Hub

Node*Cores

*: A socket/package can contain multiple nodes

Page 6: Nakajima numa-final

6* Other names and brands may be claimed as the property of others. Copyright © 2010, Intel Corporation. Intel Confidential

2-socket

2+2 (4S)

2+2+2+2 (8S) 4S (64DIMMs)

4S (32DIMMs)

4+4 (8S)

NUMA Machines (cont.)

CPU Socket

Memory

I/O Hub

Interconnect

6* Other names and brands may be claimed as the property of others. Copyright © 2010, Intel Corporation.

Page 7: Nakajima numa-final

7Xen Summit NA 2010

Importance of NUMA Awarenesslmbench's rd benchmark (normalized to native Linux (=100)):

guests numa=off numa=on avg increase

min avg max min avg max

1 78.0 102.3

7 37.4 45.6 62.0 90.6 102.3 110.9 124.4%

15 21.0 25.8 31.7 41.7 48.7 54.1 88.2%

23 13.4 17.5 23.2 25.0 28.0 30.1 60.2%

kernel compile in tmpfs, 1 VCPU, 2GB RAM, average of elapsed time:

guests numa=off numa=on increase1 480.610 464.320 3.4%

7 482.109 461.721 4.2%

15 515.297 477.669 7.3%

23 548.427 495.180 9.7%

again with 2 VCPUs and make -j2:

1 264.580 261.690 1.1%

7 279.763 258.907 7.7%

15 330.385 272.762 17.4%

23 463.510 390.547 15.7% (46 VCPUs on 32pCPUs)

*: 4 socket AMD Magny-Cours machine with 8 nodes,

48 cores and 96 GB RAM.

http://lists.xensource.com/archives/html/xen-devel/2009-12/msg00000.html

Andre Przywara <[email protected]>

Page 8: Nakajima numa-final

8Xen Summit NA 2010

Motivation

• More NUMA machines in the market

• Run very large guests efficiently on NUMA machines for performance reasons− More memory, VCPUs, I/O spanning across multiple nodes

− More performance, throughput

• Allow existing OS and apps to run in virtualization with NUMA enabled (or disabled)− Populate guest ACPI SRAT (Static Resource Affinity Table) and SLIT

(System Locality Information Table)

− NUMA libraries

• NUMA-specific optimizations/enlightenments

Page 9: Nakajima numa-final

9Xen Summit NA 2010

Achieving NUMA Performance

• Which processors (i.e. cores) are connected directly to which blocks of memory?− SRAT (Static Resource Affinity Table) or PV

• How far apart the processors are from their associated memory banks?− SLIT (System Locality Information Table) or PV

• Virtualization Specific Requirements− Bind VCPUs to node

− Construct guest SRAT and SLIT

• Need to reflect hardware attributes

• Predictable and repeatable− Use fixed guest configuration

Xeon® 7500 Xeon® 7500

Xeon®7500 Xeon®7500

Page 10: Nakajima numa-final

10Xen Summit NA 2010

Constructing SRAT and SLIT for Guests

• Get platform info from host using host NUMA API (in upstream)− XEN_SYSCTL_topologyinfo

• # of cores per node/socket

− XEN_SYSCTL_numainfo

• Equivalent to SRAT and SLIT

• Allocate memory from nodes based on memory allocation strategy in config file− CONFINE, SPLIT, STRIP (next page)

− # of nodesXeon® 7500 Xeon® 7500

Xeon®7500 Xeon®7500

Page 11: Nakajima numa-final

11Xen Summit NA 2010

Guest NUMA Config Options

• Number of nodes means “# of nodes from which memory is allocated”− Not necessarily visible to guest

• max_guest_nodes=<N>

− Specify desirable number of nodes. Number of system nodes by default.

• min_guest_nodes=<N>

− Specify minimum number of nodes. Memory is allocated from nodes ( >= min_guest_nodes). Creation of guest fails if allocation does not meet it. 1 by default.

• Number of nodes matter for SPLIT and STRIP (next page)

• Create guest in deterministic way by setting min_guest_nodes = max_guest_nodes

Page 12: Nakajima numa-final

12Xen Summit NA 2010

Guest NUMA Config Options (cont.)

Memory Allocation Strategy:

• CONFINE : Allocate entire domain memory from single node. Fail if does not work. − No need to tell guest NUMA at all.

• SPLIT : Allocate domain memory from nodes by splitting equally across the nodes. Fail if does not work. − Populate NUMA topology, and propagate to guest (includes PV querying

via hypercall). If guest is paravirtualized and does not know about NUMA (missing ELF hint), fail.

• STRIPE : Interleave domain memory across nodes. − No need to tell guest about NUMA at all.

• AUTOMATIC: Try three strategies after each other (order: CONFINE, SPLIT, STRIP)

Page 13: Nakajima numa-final

13Xen Summit NA 2010

Considerations on Live Migration

• Number of nodes needs to be same

• Memory allocation strategy needs to be inherited for live migration− CONFINE and STRIPE are not really NUMA guest

− SPLIT: SPLIT will be used at live-migration time.

• If target machine has similar NUMA characteristics, it’s possible to do live migration retaining NUMA performance.

Page 14: Nakajima numa-final

14Xen Summit NA 2010

Current Status and Next Steps

• Current Status− Host NUMA API is in upstream

− Rebasing the patches to submit

− Re-measuring performance

− Merge patches from Dulloor and Andre

• Next Steps− Performance analysis and different workloads

• Scheduling

− I/O NUMA

• DMA across nodes with direct device assignment

− Live Migration

• Anyone?