© 2014 IBM Corporation PowerVM Performance Updates Steve Nasypany [email protected] HMC v8 Performance Capacity Monitor Dynamic Platform Optimizer PowerVP

Embed Size (px)

Citation preview

  • Slide 1

2014 IBM Corporation PowerVM Performance Updates Steve Nasypany [email protected] HMC v8 Performance Capacity Monitor Dynamic Platform Optimizer PowerVP 1.1.2 VIOS Performance Advisor 2.2.3 Slide 2 2014 IBM Corporation 2 First, a HIPER APAR XMGC NOT TRAVERSING ALL KERNEL HEAPS Systems running 6100-09 Technology Level with bos.mp64 below the 6.1.9.2 level Systems running 7100-03 Technology Level with bos.mp64 below the 7.1.3.2 level PROBLEM DESCRIPTION: xmalloc garbage collector is not traversing all kernel heaps, causing pinned and virtual memory growth. This can lead to low memory or low paging space issues, resulting in performance degradations and, in some cases, a system hang or crash You cant diagnose this with vmstat or svmon easily. Systems just run out of memory pinned or computational memory keeps climbing, and cannot be accounted to a process AIX 6.1 TL9 SP1 AIX 7.1 TL3 SP1 XMGC IV53582IV53587 Slide 3 2014 IBM Corporation 3 http://www.redbooks.ibm.com/redpieces/abstracts/sg248171.html Draft available now! POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2 Compilers & optimization Performance tools & tuning Optimization Redbook Slide 4 2014 IBM Corporation 4 HMC Version 8 Performance Capacity Monitor Slide 5 2014 IBM Corporation 5 Power Systems Performance Monitoring HMC 780 or earlier HMC 810 Evolution from disjoint set of OS tools to integrated monitoring solution System resource monitoring via a single touch-point (HMC) Data collection and aggregation of performance metrics via Hypervisor REST API (WEB APIs) for integration with IBM and third-party products Trending of the utilization data Assists in first level of performance analysis & capacity planning Slide 6 2014 IBM Corporation 6 Performance Metrics (complete set, firmware dependent) Physical System Level Processor & Memory Resource Usage Statistics System Processor Usage Statistics (w/ LPAR, VIOS & Power Hypervisor usage breakdown) System Dedicated Memory Allocation and Shared Memory Usage Statistics (w/ LPAR, VIOS & Power Hypervisor usage breakdown) Advanced Virtualization Statistics Per LPAR Dispatch Wait Time Statistics Per LPAR Placement Indicator (for understanding whether the LPAR placement is good / bad based on score) Virtual IO Statistics Virtual IO Servers CPU / Memory Usage (Aggregated, Breakdown) SEA Traffic & Bandwidth usage Statistics (Aggregated & Per Client, Intra/Inter LPAR breakdown) NPIV Traffic & Bandwidth usage Statistics (HBA & Per Client breakdown) vSCSI Statistics (Aggregated & Per Client Usage) VLAN Traffic & Bandwidth usage Statistics (Adapter & LPAR breakdown) SRIOV Traffic & Bandwidth usage Statistics (Physical & Virtual Function Statistics w/ LPAR breakdown) 6 Slide 7 2014 IBM Corporation 7 Raw Metrics Cumulative counters (since IPL) or Quantities (size, config, etc.,) Fixed sampling intervals General purpose monitoring: 30 seconds, 30 minute cache Short term problem diagnosis: 5 seconds, 15 minute cache Processed Metrics Utilization (cpu, I/O, etc.,) Fixed interval of 30 seconds, preserved for 4 hrs Aggregated Metrics Rolled-up Processed Metrics Rolled-up data at 15 minute, 2-hour & daily (Min, Average & Max) Preserved for a max of 365 days (configurable per HMC & limited by storage space) Performance Metrics (cont.) Slide 8 2014 IBM Corporation 8 New control for storage and enablement Slide 9 2014 IBM Corporation 9 Aggregate Server: Current Usage (CPU, Memory, IO) Slide 10 2014 IBM Corporation 10 Partition: Entitlement vs Usage Spread, Detail Slide 11 2014 IBM Corporation 11 Partition: Processor Utilization Slide 12 2014 IBM Corporation 12 Partition: Network, including SR-IOV support Slide 13 2014 IBM Corporation 13 Storage by VIOS, vSCSI or NPIV Slide 14 2014 IBM Corporation 14 Minimum features with all POWER6 & above models: Managed System CPU Utilization (Point In Time & Historical) Managed System Memory Assignment (Point In Time & Historical) Server Overview Section of Historical Data with LPAR & VIOS view Processor Trend Views with LPAR, VIOS & Processor Pool (no System Firmware Utilization, Dispatch Metrics, will be shown as zero) Memory Trend Views with LPAR & VIOS view These metrics were available via legacy HMC performance data collection mechanisms and are picked up by the monitor. HMC v8 Monitor Support (June 2014 GA) Slide 15 2014 IBM Corporation 15 FW 780 & VIOS 2.2.3, all function except for 770/780-MxB models No support for LPAR Dispatch Wait Time No support for Power Hypervisor Utilization FW 780 or above with VIOS level below 2.2.3, then the following functions are not available (basically, no IO utilization): Network Bridge / Virtual Storage Trend Data VIOS Network / Storage Utilization FW 770 or less with VIOS 2.2.3 or later then these are not provided: Network Bridge Trend Data LPAR Dispatch Wait Time Power Hypervisor Utilization FW 770 or less with VIOS level below 2.2.3, then the tool will not provide: Network Bridge / Virtual Storage Trend Data VIOS Network / Storage Utilization LPAR Dispatch Wait Time Power Hypervisor Utilization HMC v8 Monitor Support (new firmware-based function) Slide 16 2014 IBM Corporation 16 Dynamic Platform Optimizer Update Slide 17 2014 IBM Corporation 17 DPO is a PowerVM virtualization feature that enables users to improve partition memory and processor placement (affinity) on Power Servers after they are up and running. DPO performs a sequence of memory and processor relocations to transform the existing server layout to the optimal layout based on the server topology. Client Benefits Ability to run without a platform IPL (entire system) Improved performance in a cloud or highly virtualized environments Dynamically adjust topology after mobility What is Dynamic Platform Optimizer - DPO Slide 18 2014 IBM Corporation 18 What is Affinity? Affinity is a locality measurement of an entity with respect to physical resources An entity could be a thread within AIX/i/Linux or the OS instance itself Physical resources could be a core, chip, node, socket, cache (L1/L2/L3), memory controller, memory DIMMs, or I/O buses Affinity is optimal when the number of cycles required to access resources is minimized POWER7+ 760 Planar Note x & z buses between chips, and A & B buses between Dual Chip Modules (DCM) In this model, each DCM is a node Slide 19 2014 IBM Corporation 19 Partition placement can become sub-optimal because of: Poor choices in Virtual Processor, Entitlement or Memory sizing The Hypervisor uses Entitlement & Memory settings to place a partition. Wide use of 10:1 Virtual Processor to Entitlement settings does not lend any information for optimal placement. Before you ask, there is no single golden rule, magic formula, or IBM-wide Best Practice for Virtual Processor & Entitlement sizing. If you want education in sizing, ask for it. Dynamic creation/deletion, processor and memory ops (DLPAR) Hibernation (Suspend or Resume) Live Partition Mobility (LPM) CEC Hot add, Repair, & Maintenance (CHARM) Older firmware levels are less sophisticated in placement and dynamic operations Partition Affinity: Why is it not always optimal? Slide 20 2014 IBM Corporation 20 Partition Affinity: Hypothetical 4 Node Frame Partition X Partition Y Partition Z Free LMBs DPO operation Partition Z Partition Y Partition X Partition Z Partition YPartition X Slide 21 2014 IBM Corporation 21 Current and Predicted Affinity enhancement with V7R780 firmware Scores at the partition level along with the system wide scores lsmemopt m managed_system o currscore r [sys | lpar] lsmemopt m managed_system o calcscore r [sys | lpar] [--id request_partition_list] [--xid protect_partition_list] sys = system-wide score (default if the r option not specified) lpar = partition scores Slide 22 2014 IBM Corporation 22 Example: V7R780 firmware current affinity score lsmemopt -m calvin -o currscore -r sys >curr_sys_score=97 lsmemopt m calvin o currscore r lpar >lpar_name=calvinp1,lpar_id=1,curr_lpar_score=100 lpar_name=calvinp2,lpar_id=2,curr_lpar_score=100 lpar_name=calvinp50,lpar_id=50,curr_lpar_score=100 lpar_name=calvinp51,lpar_id=51,curr_lpar_score=none lpar_name=calvinp52,lpar_id=52,curr_lpar_score=100 lpar_name=calvinp53,lpar_id=53,curr_lpar_score=74 lpar_name=calvinp54,lpar_id=54,curr_lpar_score=none Get predicted score lsmemopt -m calvin -o calcscore -r sys >curr_sys_score=97,predicted_sys_score=100,requested_lpar_i ds=none,protected_lpar_ids=none Slide 23 2014 IBM Corporation 23 HMC CLI: Starting/Stopping a DPO Operation optmem m managed_system t affinity o start [--id requested_partition_list] [--xid protect_partition_list] Partition lists are comma-separated and can include ranges Include -id 1,3, 5-8 Requested partitions: partitions that should be prioritized (default = all LPARs) Protected partitions: partitions that should not be touched (default = no LPARs) Exclude by name x CAB, ZIN or by LPAR id number --xid 5,10,16-20 optmem m managed_system t affinity o stop Use these switches to exclude partitions by name or number example: Partitions that are not DPO aware Slide 24 2014 IBM Corporation 24 HMC CLI: DPO Status lsmemopt m managed_system >in_progress=0,status=Finished,type=affinity,opt_id=1, progress=100,requested_lpar_ids=none,protected_lpar_ids= none,impacted_lpar_ids=106,110 Unique optimization identifier Estimated progress % LPARs that were impacted by the optimization (i.e. had CPUs, memory, or their hardware page table moved) Slide 25 2014 IBM Corporation 25 Whats New (V7R7.8.0): DPO Schedule, Thresholds, Notifications System affinity score Not LPAR affinity score Slide 26 2014 IBM Corporation 26 Introduced in fall 2012 (with feature code EB33) 770-MMD and 780-MHD with firmware level 760.00 795-FHB with firmware level 760.10 (760 with fix pack 1) Recommend 760_069 has enhancements below Additional systems added spring 2013 with firmware level 770 710,720,730,740 D-models with firmware level 770.00 750,760 D-models with firmware level 770.10 (770 with fix pack 1) 770-MMC and 780-MHC with firmware level 770.20 (770 with fix pack 2) Performance enhancements DPO memory movement time reduced Scoring algorithm improvements Recommend firmware at 770_021 Affinity scoring at the LPAR level with firmware level 780 delivered Dec 2013 770-MMB, 780-MHB added with 780.00 795-FHB updated with 780.00 770-MMD, 780-MHD (AM780_056_040 level released 4/30/2014) DPO Supported Hardware and Firmware levels * Some Power models and firmware releases listed above are currently planned for the future and have not yet been announced. * All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. http://www- 304.ibm.com/support/customercare/ sas/f/power5cm/power7.html Slide 27 2014 IBM Corporation 27 Running DPO DPO aware Operating Systems AIX: 6.1 TL8 or later, AIX 7.1 TL2 or later IBM i: 7.1 TR6 or later Linux: Some reaffinitization in RHEL7/SLES12. (Fully implemented in follow-on releases) VIOS 2.2.2.0 or later HMC V7R7.6.1 Partitions that are DPO aware are notified after DPO completes Re-affinitization Required Performance team measurements show reaffinitization is critical For older OS levels, users can exclude those partitions from optimization, or reboot them after running the DPO Optimizer Affinity (at a high level) is as good as CEC IPL (assuming unconstrained DPO) Slide 28 2014 IBM Corporation 28 More Information IBM PowerVM Virtualization Managing and Monitoring (June 2013) SG24-7590-04: http://www.redbooks.ibm.com/abstracts/sg247590.html?Openhttp://www.redbooks.ibm.com/abstracts/sg247590.html?Open IBM PowerVM Virtualization Introduction and Configuration (June 2013) SG24-7940-05: http://www.redbooks.ibm.com/abstracts/sg247940.html?Openhttp://www.redbooks.ibm.com/abstracts/sg247940.html?Open POWER7 Information Center under logical partitioning topiccs http://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=%2Fp7hat%2Fiphblm anagedlparp6.htmhttp://pic.dhe.ibm.com/infocenter/powersys/v3r1m5/index.jsp?topic=%2Fp7hat%2Fiphblm anagedlparp6.htm IBM DeveloperWorks https://www.ibm.com/developerworks/community/blogs/PowerFW/entry/dynamic_platform _optimizer5?lang=enhttps://www.ibm.com/developerworks/community/blogs/PowerFW/entry/dynamic_platform _optimizer5?lang=en POWER7 Logical Partitions Under the Hood http://www- 03.ibm.com/systems/resources/power_software_i_perfmgmt_processor_lpar.pdfhttp://www- 03.ibm.com/systems/resources/power_software_i_perfmgmt_processor_lpar.pdf Slide 29 2014 IBM Corporation 29 PowerVP Slide 30 2014 IBM Corporation 30 http://www.redbooks.ibm.com/redpieces/pdfs/redp5112.pdf Draft available now! PowerVP Redbook Slide 31 2014 IBM Corporation 31 Power 750/760 D Technical Overview Review - POWER7+ 750/760 Four Socket Planer Layout Note x & z buses between chips, and A & B buses between Dual Chip Modules (nodes) Slide 32 2014 IBM Corporation 32 Review - POWER7+ 770/780 Four Socket Planer Layout Loc Code Conn Ref Power 770/780 D Technical Overview Not as pretty as 750+ diagram, note we have x, w & z buses between chips with this model and buses to other nodes (not pictured) and IO are a little more cryptic Slide 33 2014 IBM Corporation 33 PowerVP - Virtual/Physical Topology Utilization Slide 34 2014 IBM Corporation 34 During an IPL of the entire Power System, the Hypervisor determines an optimal resource placement strategy for the server based on the partition configuration and the hardware topology of the system. There was a desire to have a visual understanding of how the hardware resources were assigned and being consumed by the various partitions that were running on the platform. It was also desired to have a visual indication showing a resources consumption and when it was going past a warning threshold (yellow) and when it was entering an overcommitted threshold (red.) Why PowerVP - Power Virtualization Performance Slide 35 2014 IBM Corporation 35 PowerVP Overview Graphically displays data from existing and new performance tools Converges performance data from across the system Shows CEC, node & partition level performance data Illustrates topology utilization with colored heat threshold settings Enables drill down for both physical and logical approaches Allows real-time monitoring and recording function Simplifies physical/virtual environment, monitoring, and analysis Not intended to replace any current monitoring or management product Slide 36 2014 IBM Corporation 36 System-wide Collector One required per system P7 topology information P7 chip/core utilizations P7 Power bus utilizations Memory and I/O utilization LPAR entitlements, utilization Partition Collectors Required for logical view LPAR CPU utilization Disk Activity Network Activity CPI analysis Cache analysis PowerVP Environment Core HPMCs Chip PMUlets System Collector Hypervisor interfaces Power Hardware FW/Hypervisor Operating system IBM i, AIX, VIOS, Linux Thread PMUs Partition Collector You only need to install a single system-wide collector to see global metrics Slide 37 2014 IBM Corporation 37 System Topology Node Drill Down Partition Drill Down PowerVP System, Node and Partition Views Slide 38 2014 IBM Corporation 38 PowerVP System Topology The initial view shows the hardware topology of the system you are logged into In this view, we see a Power 795 with all eight books and/or nodes installed, each with four sockets Values within boxes show CPU usage Lines between nodes show SMP fabric activity Slide 39 2014 IBM Corporation 39 Active buses are shown with solid colored lines. These can be between nodes, chips, memory controllers and IO buses. PowerVP Node drill down This view appears when you click on a node and allows you to see the resource assignments or consumption In this view, we see a POWER7 780 node with four chips each with four cores Slide 40 2014 IBM Corporation 40 PowerVP 1.1.2: Node View (POWER7 780) Slide 41 2014 IBM Corporation 41 CHIP Memory Controller DIMM IO SMP Bus LPARs PowerVP 1.1.2: Chip (POWER7 780 with 4 cores) Slide 42 2014 IBM Corporation 42 LPAR 7 has 8 VPs. As we select cores, 2 VPs are homed to each core. The fourth core has 4 VPs from four LPARs homed to it. This does not prevent VPs from being dispatched elsewhere in the pool as utilization requirements demand PowerVP 1.1.2: CPU Affinity Slide 43 2014 IBM Corporation 43 LPAR 7 Online Memory is 32768 MB, 50% of 64 GB in DIMMs Note: LPARs will be listed in color order in shipping version PowerVP 1.1.2: Memory Affinity Slide 44 2014 IBM Corporation 44 We can drill down on several of these resources. Example: we can drill down on the disk transfer or network activity by selecting the resource PowerVP - Partition drill down View allows us to drill down on resources being used by selected partition In this view, we see CPU, Memory, Disk IOPS, and Ethernet being consumed. We can also get an idea of cache and memory affinity. Slide 45 2014 IBM Corporation 45 PowerVP - Partition drill down (CPU, CPI) Slide 46 2014 IBM Corporation 46 PowerVP - Partition drill down (Disk) Slide 47 2014 IBM Corporation 47 PowerVP How do I use this? PowerVP is not intended to replace traditional performance management products It does not let you manage CPU, memory or IO resources It does provide an overview of hardware resource activity that allows you to get a high-level view Node/socket activity Cores assigned to dedicated and shared pool VMs Virtual Processors assigned to cores VMs memory assigned to DIMMs Memory bus activity IO bus activity Provides partition activity related to Storage & Network CPU Software Cycles-Per-Instruction Slide 48 2014 IBM Corporation 48 PowerVP How do I use this? High-Level High-level view can allow visual identification of node and bus stress Thresholding is largely arbitrary, but if one memory controller is obviously saturated and others are inactive, you have an indication more detailed review is required There are no rules-of-thumb or best practices for thresholds You can review system Redbooks and determine where you are with respect to bus performance (not always available, but newer Redbooks are more informative) This tool provides high-level diagnosis with some detailed view (if partition-level collectors are installed) Slide 49 2014 IBM Corporation 49 PowerVP How do I use this? Low-Level Cycles-Per-Instruction (CPI) is a complicated subject, it will be beyond the capacity of most customers to assess in detail In general, a lower CPI is better the fewer number of CPU cycles per instruction, the more instructions can get done PowerVP gives you various CPI values these, in conjunction with OS tools can tell you whether you have good affinity Affinity is a measurement of a threads locality to physical resources. Resources can be many things: L1/L2/L3 cache, core(s), chip, memory controller, socket, node, drawer, etc. Slide 50 2014 IBM Corporation 50 AIX Enhanced Affinity AIX on POWER7 and above uses Enhanced Affinity instrumentation to localize threads by Scheduler Resource Allocation Domain (SRAD) AIX Enhanced Affinity measures Local Usually a Chip Near Local Node/DCM Far Other Node/Drawer/CEC These are logical mappings, which may or may not be exactly 1:1 with physical resources POWER8 S824 DCM POWER7 770/780/795 Affinity Local chip Near intra- node Far inter- node Slide 51 2014 IBM Corporation 51 Topas Monitor for host: claret4 Interval: 2 =================================================================== REF1 SRAD TOTALMEM INUSE FREE FILECACHE HOMETHRDS CPUS ------------------------------------------------------------------- 0 2 4.48G 515M 3.98G 52.9M 134.0 12-15 0 12.1G 1.20G 10.9G 141M 236.0 0-7 1 1 4.98G 537M 4.46G 59.0M 129.0 8-11 3 3.40G 402M 3.01G 39.7M 116.0 16-19 =================================================================== CPU SRAD TOTALDISP LOCALDISP% NEARDISP% FARDISP% ---------------------------------------------------------- 0 0 303.0 43.6 15.5 40.9 2 0 1.00 100.0 0.0 0.0 3 0 1.00 100.0 0.0 0.0 4 0 1.00 100.0 0.0 0.0 5 0 1.00 100.0 0.0 0.0 6 0 1.00 100.0 0.0 0.0 AIX topas Logical Affinity (M option) Whats a bad FARDISP% rate? No rule-of-thumb, but 1000s of far dispatches per second will likely indicate lower performance How do we fix? Entitlement & Memory sizing Best Practices + Current Firmware + Dynamic Platform Optimizer NodeChip Dispatches Local is optimal Slide 52 2014 IBM Corporation 52 PowerVP can show us physical affinity (local, remote & distant) AIX topas can show us logical affinity (local, near & far) More local, more ideal Cache AffinityDIMM Affinity Computed CPI is an inverse calculation, lower is typically better Local is optimal PowerVP Physical Affinity: VM View Slide 53 2014 IBM Corporation 53 Power System models and ITEs with 770 firmware support 710-E1D, 720-E4D, 730-E2D, 740-E6D (also includes Linux D models) 750-E8D, 760-RMD 770-MMC, 780-MHC, ESE 9109-RMD p260-22X, p260-23X, p460-42X, p460-43X, p270-24X, p470-44X, p24L-7FL 71R-L1S, 71R-L1C, 71R-L1D, 71R-L1T, 7R2-L2C, 7R2-L2S, 7R2-L2D, 7R2-L2T Power System models added with 780 firmware support 780 Power Firmware 770-MMB and 780-MHB (eConfig support 1/28/2014) 795-FHB Dec 2013 Power System models with 780 firmware support 770-MMD, 780-MHD (4/30/2014) Pre-770 firmware models do not have instrumentation to support PowerVP PowerVP supported Power models and ITEs * Some Power models and firmware releases listed above are currently planned for the future and have not yet been announced. * All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. http://www- 304.ibm.com/support/customercare/ sas/f/power5cm/power7.html Slide 54 2014 IBM Corporation 54 Announced and GA in 4Q 2013 PowerVP 1.1.2 ships 6/2014 Available as standalone product or with PowerVM Enterprise Edition Agents will run on IBM i, AIX, Linux on Power and VIOS System i V7R1, AIX 6.1 & 7.1, any VIOS level that supports POWER7 RHEL 6.4, SUSE 11 SP 3 Client supported on Windows, Linux, and AIX Client requires Java 1.6 or greater Installer provided for Windows, Linux, and AIX Also includes a Java installer, which has worked under VMWARE and OSX (limited testing) Has worked on VMWARE and MAC where the others dont PowerVP OS Support Slide 55 2014 IBM Corporation 55 VIOS Performance Advisor 2.2.3 Slide 56 2014 IBM Corporation 56 Not another performance monitoring tool, an integrated report that leverages other tools and the labs knowledge base Summarizes the overall performance of a VIOS Identifies potential bottlenecks and performance inhibitors Proposes actions to be taken to address the bottlenecks The beta VIOS Performance Advisor productized and shipped with the Virtual I/O Server Shipped with VIOS 2.2.2 VIOS Performance Advisor: What is it? Slide 57 2014 IBM Corporation 57 Performance Advisor: How does it work? Polls key performance metrics for over a period of time Analyzes the data Produces an XML formatted report for viewing in a browser part command is available in VIOS restricted shell pronounced as p-Art (Performance Analysis & Reporting Tool). part command can be executed in two different modes Monitoring Mode (actually uses nmon recording now) Post Processing nmon Recording Mode The final report along with the supporting files are bundled together in a tar formatted file Users can download & extract it to their PC or machines with browser installed to view the reports in a browser. Slide 58 2014 IBM Corporation 58 Transfer & view report Monitoring Mode: 5 to 60 minutes IBM Virtual I/O Server login: padmin $ part -i 30 IBM Virtual I/O Server login: padmin $ part -f vio1_130915_1205.nmon Post-processing nmon recording Transfer the generated tar file to a machine with browser support Extract the tar file Load *.xml file in brower VIOS Performance Advisor: Process Collect Data - or - Slide 59 2014 IBM Corporation 59 VIOS Performance Advisor: Browser View Slide 60 2014 IBM Corporation 60 VIOS Performance Advisor: Legend, Risk & Impact Risk: Level of risk, as a range of 1 to 5, of making suggested value change Impact: Potential performance impact, as a range of 1 to 5, of making suggested value change Advisor Legend Informative Investigate Optimal Warning Critical Help/Info Slide 61 2014 IBM Corporation 61 VIOS Performance Advisor: Config Slide 62 2014 IBM Corporation 62 VIOS Performance Advisor: Tunable Information When you select the help icon a pop-up with guidance appears Slide 63 2014 IBM Corporation 63 VIOS Performance Advisor: CPU Guidance Slide 64 2014 IBM Corporation 64 VIOS Performance Advisor: Shared Pool Guidance If shared pool monitoring is enabled, the Advisor will report, status, settings and if there is a constraint Enablement is via partition properties panel as: Allow performance information collection Slide 65 2014 IBM Corporation 65 VIOS Performance Advisor: Memory Guidance Slide 66 2014 IBM Corporation 66 VIOS Performance Advisor: IO Total & Disks Slide 67 2014 IBM Corporation 67 VIOS Performance Advisor: Disk Adapters Slide 68 2014 IBM Corporation 68 VIOS Performance Advisor: FC Details FC Utilization based on peak IOPS rates Slide 69 2014 IBM Corporation 69 VIOS Performance Advisor: NPIV Breakdowns Slide 70 2014 IBM Corporation 70 VIOS Performance Advisor: Storage Pool Slide 71 2014 IBM Corporation 71 VIOS Performance Advisor: Shared Ethernet Accounting feature must be enabled on VIOS chdev dev ent* attr accounting=enabled Slide 72 2014 IBM Corporation 72 VIOS Performance Advisor: Shared Tunings Slide 73 2014 IBM Corporation 73 CPU overhead of running this tool in VIOS will be same as that of running nmon recording very low Memory footprint of the command is also kept to the minimum However in post-processing mode, if the recording contains a high number of samples, the part command will consume noticeable cpu when executed Example: an nmon recording with 4000 samples and of size 100MB collected on a VIOS with 255 disks configured will be take about 2 minutes to complete the analysis on a VIOS with an entitlement of 0.2 A typical (default) nmon recording will contain 1440 samples, so the above example is on the high end of the scale Performance Advisor: Overhead Slide 74 2014 IBM Corporation 74 Affinity Backup Slide 75 2014 IBM Corporation 75 What is Affinity? Affinity is a locality measurement of an entity with respect to physical resources An entity could be a thread within AIX/i/Linux or the OS instance itself Physical resources could be a core, chip, node, socket, cache (L1/L2/L3), memory controller, memory DIMMs, or I/O buses Affinity is optimal when the number of cycles required to access resources is minimized POWER7+ 760 Planar Note x & z buses between chips, and A & B buses between Dual Chip Modules (DCM) In this model, each DCM is a node Slide 76 2014 IBM Corporation 76 Thread Affinity Performance is closer to optimal when threads stay close to physical resources. Thread Affinity is a measurement of proximity to a resource Examples of resources: L2/L3 cache, memory, core, chip and node Cache Affinity: threads in different domains need to communicate with each other, or cache needs to move with thread(s) migrating across domains Memory Affinity: threads need to access data held in a different memory bank not associated with the same chip or node Modern highly multi-threaded workloads are architected to have light- weight threads and distributed application memory Can span domains with limited impact Unix scheduler/dispatch/memory manager mechanisms spread workloads Slide 77 2014 IBM Corporation 77 Partition placement can become sub-optimal because of: Poor choices in Virtual Processor, Entitlement or Memory sizing The Hypervisor uses Entitlement & Memory settings to place a partition. Wide use of 10:1 Virtual Processor to Entitlement settings does not lend much information for best placement. Before you ask, there is no single golden rule, magic formula, or IBM-wide Best Practice for Virtual Processor & Entitlement sizing. If you want education in sizing, ask for it. Dynamic creation/deletion, processor and memory ops (DLPAR) Hibernation (Suspend or Resume) Live Partition Mobility (LPM) CEC Hot add, Repair, & Maintenance (CHARM) Older firmware levels are less sophisticated in placement and dynamic operations Partition Affinity: Why is it not always optimal? Slide 78 2014 IBM Corporation 78 How does partition placement work? PowerVM knows the chip types and memory configuration, and attempts to pack partitions onto the smallest number of chips / nodes / drawers Optimizing placement will result in higher exploitation of local CPU and memory resources Dispatches across node boundaries will incur longer latencies, and both AIX and PowerVM the are actively trying to minimize that via active Enhanced Affinity mechanisms It considers the partition profiles and calculates optimal placements Placement is a function of Desired Entitlement, Desired & Maximum Memory settings Virtual Processor counts are not considered Maximum memory defines the size of the Hardware Page Table maintained for each partition. For POWER7, it is 1/64 th of Maximum and 1/128 th on POWER7+ and POWER8 Ideally, Desired + (Maximum/HPT ratio) < node memory size if possible Slide 79 2014 IBM Corporation 79 What tools exist for optimizing affinity? Within the AIX, two technologies are used to maximize thread affinity AIX dispatcher uses Enhanced Affinity services to keep a thread within the same POWER7 multiple-core chip to optimize chip and memory controller use Dynamic System Optimizer (DSO) proactively monitors, measures and moves threads, their associated memory pages and memory pre-fetch algorithms to maximize core, cache and DIMM efficiency. We do not cover this feature in this presentation. Within a PowerVM frame, three technologies assist in maximizing partition(s) affinity The PowerVM Hypervisor determines an optimal resource placement strategy for the server based on the partition configuration and the hardware topology of the system. Dynamic Platform Optimizer relocates OS instances within a frame for optimal physical placement PowerVP allows us to monitor placement, node, memory bus, IO bus and Symmetric Multi-Processor (SMP) bus activity Slide 80 2014 IBM Corporation 80 AIX Enhanced Affinity AIX on POWER7 and above uses Enhanced Affinity instrumentation to localize threads by Scheduler Resource Allocation Domain (SRAD) AIX Enhanced Affinity measures Local Usually a Chip Near Local Node/DCM Far Other Node/Drawer/CEC These are logical mappings, which may or may not be exactly 1:1 with physical resources POWER8 S824 DCM POWER7 770/780/795 Affinity Local chip Near intra- node Far inter- node Slide 81 2014 IBM Corporation 81 View of 24-way, two socket POWER7+ 760 with Dual Chip Modules (DCM) 6 cores chip, 12 in each DCM 5 Virtual Processors x 4-way SMT = 20 logical cpus Terms: REF Node (drawer or DCM/MCM socket) SRAD Scheduler Resource Allocation Domain AIX Affinity: lssrad tool shows logical placement # lssrad -av REF1 SRAD MEM CPU 0 0 12363.94 0-7 2 4589.00 12-15 1 1 5104.50 8-11 3 3486.00 16-19 If a threads home node was SRAD 0 SRAD 2 would be near SRAD 1 & 3 would be far Node 0 SRAD 0 2 Node 1 SRAD 1 3 Slide 82 2014 IBM Corporation 82 Affinity: Cycles-Per-Instruction Another way to look at affinity is by watching how many cycles a thread uses This can be done via Cycles-Per-Instruction (CPI) measurements POWER Architectures are instrumented with a variety of CPI values provided for chip resources These measurements are usually a complicated subject that are the domain of hardware and software developers In general, a lower CPI is better the fewer number of CPU cycles per instruction, the more efficient it is We will return to this concept in the PowerVP section Slide 83 2014 IBM Corporation 83 Affinity: Diagnosis When may I have a problem? - SRAD has CPUs but no memory or vice-versa - When CPU or Memory are very unbalanced But how do I really know? - Tools tell you: lssrad/topas/mpstat/svmon (AIX), numactl (Linux) & PowerVP - High percentage of threads with far dispatches - Disparity in performance between equivalent systems PowerVM & POWER8 provide a variety of improvements - PowerVM has come a long way in the last three years firmware, AIX, Dynamic Platform Optimizer and PowerVP give you a lot of options - Cache (sizes, pre-fetch, L4, Non-Uniform Cache Access logic), Controller, massive DIMM bandwidth improvement - Inter-socket latencies and efficiency have progressively improved from POWER7 to POWER7+ and now POWER8 Slide 84 2014 IBM Corporation 84 How do I Optimize Affinity? This is a separate topic, but an overview of options Follow POWER7 Best Practices for sizing (in general, size partitions entitlement, desired & maximum memory settings to be tailored to real usage and chip/node sizes) Update to newer firmware levels they are much smarter about physical placement of virtualized OS instances Use Dynamic Platform Optimizer (DPO) to optimally place partitions within a frame Monitor Enhanced Affinity metrics in AIX (topas M) Use Dynamic System Optimizer (DSO) to optimally place threads within AIX. DSO does this by monitoring core, cache and DIMM memory use by individual threads. Use software products that are affinity-aware (newer levels of some Websphere prodcuts are capable of this) Manually create Resource Sets (rsets) of CPU & memory resources and assign workloads to them (expert level)