Systems p Os Aix Whitepapers PDF Aix Support

IBM ~ p5 AIX 5L Support for Micro-Partitioning

and Simultaneous Multi-threading White Paper

Luke Browning

July 13, 2004

psssAIX5LSPLPARwp071304.doc Page 2

Abstract IBM ~® p5 servers with POWER5™ processors contain new technologies called simultaneous multi-threading and optionally, Micro-Partitioning™. AIX 5L™ Version 5.3 contains support for these new technologies. This white paper describes these new technologies and the AIX 5L support for them.


Overview

Micro-Partitioning

Micro-Partitioning is a mainframe-inspired technology that is based on two major advances in the area of server virtualization. Physical processors and I/O devices have been virtualized enabling these resources to be shared by multiple partitions. There are several advantages associated with this technology including finer grained resource allocations, more partitions and higher resource utilization.

The virtualization of processors requires a new partitioning model, since it is fundamentally different from the partitioning model used on POWER4™ processor-based servers, where whole processors are assigned to partitions. These processors are owned by the partition and are not easily shared with other partitions. These partitions are called dedicated partitions. In the new Micro-Partitioning model, physical processors are abstracted into virtual processors which are assigned to partitions. These virtual processor objects can’t be shared, but the underlying physical processors are shared, since they are used to actualize virtual processors at the platform level. This sharing is the primary feature of this new partitioning model, and it happens automatically. These partitions are called shared processor partitions.

It is to be noted that the virtual processor abstraction is implemented in the hardware and the POWER Hypervisor™, a component of firmware. From an operating system perspective, a virtual processor is indistinguishable from a physical processor, unless the operating system had been enhanced to be made aware of the difference. The key benefit of implementing partitioning in the hardware/firmware is to allow any operating system to run on POWER5 technology with little or no changes. Optionally, for optimal performance, the operating system can be enhanced to exploit Micro-Partitioning more in-depth, for instance by voluntarily relinquishing CPU cycles to the POWER Hypervisor, when they are not needed. AIX 5L V5.3 is the first version of AIX 5L that includes such enhancements.

The system administrator defines the number of virtual processors that may be utilized by a partition as well as the actual physical processor capacity that should be applied to actualize those virtual processors. The system administrator may specify that a fraction of a physical processor be applied to a partition enabling fractional processor capacity partitions to be created.

With fractional processor allocations, more partitions can be created on a given platform enabling clients to maximize the number of workloads that can be supported simultaneously. While it is possible to mix workloads on a single operating system image, most clients would prefer to minimize workload interactions for performance and stability reasons. Their preferred method is to place them in separate partitions. Shared processor technology makes this easier by supporting more partitions than traditional logical or physical partitioning systems.

Another important aspect of this technology is that it results in increased physical processor utilization levels. There are two reasons for this. First, capacity can be allocated more precisely at the partition level, since fractional allocations are supported. Second, operating systems are provided with the ability to give back capacity, when there is no real work to be performed, which enables the processor to be applied elsewhere by the hypervisor. This has the effect of minimizing idle time at the platform level, which by definition raises physical processor utilization, since utilization is a measurement of the amount of productive work that is performed.


From a cost perspective, this is very significant, because it means that less processor capacity has to be purchased, when consolidating servers. It is not uncommon to see 10% utilization levels in client shops, so the cost of extra processor capacity should be an important factor in the purchasing decision, particularly since processors are relatively expensive. The Micro-Partitioning capability of the POWER5 processor-based servers can provide a cost effective alternative to server farms.

Simultaneous Multi-threading Simultaneous multi-threading is the ability of a single physical processor to simultaneously dispatch instructions from more than one hardware thread context. It is a feature of the POWER5 processor and will be available at the same time as shared processors. There are two hardware threads per processor.

Simultaneous multi-threading is designed to take advantage of the superscalar nature of the POWER5 processor, so that more instructions can be executed at the same time. The basic concept is that no single application can fully saturate the processor, so it is better to have multiple applications providing input at the same time.

Simultaneous multi-threading is expected to be used primarily in commercial environments, where the speed of an individual transaction is not as important as the total number of transactions that can be performed. It is expected to increase the throughput of workloads with large or frequently changing working sets such as database servers and Web servers.

IBM has documented the performance benefit at 30%. For more information, see the following URL:

http://www.ibm.com/servers/eserver/pseries/hardware/system_perf.html

I/O Virtualization

I/O Virtualization comprises of four distinct functions: 1) Virtual Ethernet, 2) Shared Ethernet adapters and 3) Shared Fiber Channel adapters, and 4) Virtualized disks. By sharing adapters and disks, clients are not required to dedicate adapters to their LPARs, thus making the I/O model more economical.

Virtual Ethernet enables users to set up network interfaces for inter LPAR communication. The POWER Hypervisor™ implements an IEEE-compatible Ethernet switch, and operating systems implement Virtual Ethernet adapters. The POWER Hypervisor Ethernet switch enables IEEE VLAN mechanisms as well. Using Virtual Ethernet, clients are able to set up inter LPAR communication without requiring physical Ethernet adapters.

Shared Ethernet and Fiber Channel adapters and Virtualized disks are implemented using Hosting LPARs. Hosting LPARs, although AIX 5L operating system-based, are encapsulated for the purpose of simplifying system administration. They own the physical resources and allow sharing of these resources among multiple client LPARs. Communication between hosting LPAR and client LPARs is performed using a set of hypervisor interfaces. Sharing of physical Ethernet adapters is accomplished using a Layer 2 Packet Forwarder, which forwards packets between customer LPARs and physical network adapters. Sharing of physical Fiber Channel adapters is accomplished by allocating and mapping SAN LUNs to client LPARs. Multi-path I/O software in the hosting LPARs can protect against Fiber Channel path failures between hosting LPARs and SAN Storage controllers. Virtualized disk are implemented using AIX 5L LVM in the

http://www.ibm.com/servers/eserver/pseries/hardware/system_perf.html


hosting partition, these volumes become virtualized disks in the client LPARs. Hosting LPAR in effect implements SCSI target mode.

Clients can run multiple I/O hosting LPARs for availability reasons. Client LPARs can run multi-path I/O software to protect against hosting partition failures.

Target Markets

Here are a few scenarios where this maximization of resource utilization is beneficial.

Server consolidation. This is where a number of smaller existing server systems are consolidated on a single LPAR-capable system. Micro-Partitioning is particularly attractive in this environment when a fraction of a single POWER5 processor provides equivalent processing capacity as the existing server. Micro-Partitioning technology enables hundreds of these smaller existing servers to be replaced by a single POWER5 server.

Virtual blade servers. Micro-Partitioning enables hundreds of "low cost" individual virtual blade partitions to be defined that mimic the Intel® blade server environment. In a blade environment, individual blades must have enough capacity to handle bursts of activity (Web hits), but most blades in general are grossly underutilized. Micro-Partitioning and VLAN are natural fits in this environment, because the idle time of a virtual blade can be utilized by another virtual blade partition and VLANs provide low-cost high-speed communication vehicles for virtual blade servers on the same POWER5 server.

Production and batch/test systems. Micro-Partitioning provides the optimal environment for the co-existence of production and test systems. Production partitions can be defined with fixed performance requirements such that they receive the processor capacity they require on demand. Batch/test partitions can be defined with minimal resource commitment but with the ability to soak up spare cycles.

Overlapping production systems. This is an environment where system performance is critical, but the workloads of different servers are such that the peaks in demand from one server overlap the valley of demands from another. To some degree this environment can be serviced with dynamic LPAR, however, Micro-Partitioning provides a finer grain capability which is much more instantaneous.


Architectural Overview of Micro-Partitioning

There are three major components to Micro-Partitioning.

1. POWER Hypervisor support

2. User interface for shared processor partitions and Micro-Partitioning

3. Operating system support

Micro-Partitioning is delivered in shared processor partitions. Micro-Partitioning is an optional feature.

POWER Hypervisor Support

Architecturally, the POWER Hypervisor, a component of global firmware, owns the partitioning model and the resource abstractions that are required to support that model. Each partition is presented with the resource abstraction for its partition and other required information through the Open Firmware Device Tree, which is created by firmware and copied into the partition before the operating system is started. In this way, operating systems receive resource abstractions. They also participate in the partitioning model by making hypervisor calls at key points in their execution as defined by the model.

The introduction of shared processors didn’t fundamentally change this model. New virtual processor objects and hypervisor calls have been added to support shared processor partitions. Actually, the existing physical processor objects have just been refined, so as not to include physical characteristics of the processor, since there is not a fixed relationship between a virtual processor and the physical processor that actualizes it. These new hypervisor calls are intended to support the scheduling heuristic of minimizing idle time.

For example, operating systems can indicate, when a virtual processor is idle. In this case, the operating system cedes the virtual processor to the hypervisor, which enables it to schedule the remainder of the dispatch cycle for another purpose. Another optimization that an operating system can make is to confer the remainder of a virtual processor’s dispatch cycle to another virtual processor(s) in the partition. This primitive is designed to be used, when one virtual processor can’t make forward progress, because it is waiting for an event to occur on another virtual processor such as during a lock miss.

When a virtual processor cedes, it is put into the sleep state by the hypervisor. Once asleep, it can only be awakened by a prod from another virtual processor or by an external interrupt (and timer interrupt). The prod primitive is intended to be used to awaken ceded virtual processors, so that they may be activated to handle new work.

While not required, the use of these primitives is highly desirable for performance reasons, because they improve locking and minimize idle time. Response time and throughput should be improved, if these primitives are used. Their use is not required, because the hypervisor time slices virtual processors, which enables it to sequence through each virtual processor in a continuous fashion. Forward progress is thus assured without the use of the primitives.


The amount of time that a virtual processor runs before it is time sliced is based on the partition entitlement, which is specified by the system administrator. The partition entitlement is evenly distributed amongst the online virtual processors, so the number of online virtual processors impacts the length of each virtual processor’s dispatch cycle. The hypervisor uses the architectural metaphor of a “dispatch wheel” with a fixed rotation period of X milliseconds to guarantee that each virtual processor receives its share of the entitlement in a timely fashion. Virtual processors are time sliced through the use of the hardware decrementor much like the operating system time slices threads.

In general, the hypervisor uses a very simple scheduling model. The basic idea is that processor entitlement is distributed with each turn of the hypervisor’s dispatch wheel, so each partition is guaranteed a relatively constant stream of service. There is no concept of credit for ceded or conferred cycles. Entitlement has to be consumed by each partition in a single rotation of the wheel, or it is lost. Capacity may be consumed unevenly by the virtual processors in a partition, if some of them cede or confer.

The hypervisor schedules shared processor partitions from a set of physical processors that is called the shared processor pool. By definition, these processors are not associated with dedicated partitions. When a dedicated partition is shutdown, its processors are returned to the shared processor pool, so that these resources may be used by uncapped shared processor partitions. This feature can be disabled by the system administrator by setting a partition attribute in the partition profile of the dedicated partition at the HMC.

In shared partitions there is not a fixed relationship between virtual processors and the physical processors that actualize them. The hypervisor may use any physical processor in the shared processor pool that the virtual processor is assigned to, when it schedules the virtual processor. By default, it attempts to use the same physical processor, but this cannot always be guaranteed. The hypervisor employs the notion of a home node for virtual processors enabling it to select the best available physical processor from a memory affinity perspective for the virtual processor that is to be scheduled.

Affinity scheduling is designed to preserve the content of memory caches, so that the working data set of a job can be read or written in the shortest time period possible. Affinity is actively managed by the hypervisor, since each partition has a completely different context. Currently, there is one shared processor pool, so all virtual processors are implicitly associated with the same pool.

User Interface For Configuring Shared Processor Partitions

The Hardware Management Console (HMC) provides the user interface for logical partitions. For shared processors, it has been enhanced to enable the system administrator to specify the following partition attributes that are used to define the dimensions and performance characteristics of shared partitions:

• Minimum, desired, and maximum processor capacity • Minimum, desired, and maximum number of virtual processors • Capped or uncapped • Variable capacity weight

Processor capacity attributes are specified in terms of processing units. 1.0 processing unit represents one physical processor. 1.5 processing units is equivalent to one and a half physical processors. For example, a shared processor partition with 2.2 processing units has the equivalent power of 2.2 physical processors.


Shared processor partitions may be defined with a processor capacity as small as 1/10th of a physical processor. A maximum of 10 partitions may be started for each physical processor in the platform. A 16-way server can thus support a maximum of 160 partitions at the same time. The architectural maximum is 254 partitions for POWER5 processor-based systems.

When a partition is started, the system chooses the partition’s entitled processor capacity from the specified capacity range. The value that is chosen represents a commitment of capacity that is reserved for the partition. This capacity cannot be used to start another shared partition, otherwise capacity could be overcommitted. Preference is given to the desired value, but these values cannot always be used, because there may not be enough unassigned capacity in the system. In that event, a different value is chosen, which must be greater than or equal to the minimum capacity attribute. Otherwise, the partition cannot be started.

The same basic process applies for selecting the number of online virtual processors with the extra restriction that each virtual processor must be granted at least 1/10th of a processing unit of entitlement. In this way, the entitled processor capacity may affect the number of virtual processors that are automatically brought online by the system during boot. The maximum number of virtual processors per partition is 64.

There is also the concept of capped and uncapped partitions. A capped partition is not allowed to exceed its entitlement, while an uncapped partition is allowed. In fact, it may exceed its maximum processor capacity. An uncapped partition is only limited in its ability to consume cycles by the lack of online virtual processors and its variable capacity weight attribute.

The variable capacity weight attribute is a number between 0–255 that represents the relative share of extra capacity that the partition is eligible to receive. This parameter applies only to uncapped partitions. A partition’s share is computed by dividing their variable capacity weight by the sum of the variable capacity weights for all uncapped partitions. Therefore, a value of 0 may be used to prevent a partition from receiving extra capacity. This is sometimes referred to as a “soft cap”.

There is overhead associated with the maintenance of online virtual processors, so clients should carefully consider their capacity requirements before choosing values for these attributes. In general, the value of the minimum, desired, and maximum virtual processor attributes should parallel those of the minimum, desired, and maximum capacity attributes in some fashion. A special allowance should be made for uncapped partitions, since they are allowed to consume more than their entitlement. If the partition is uncapped, then the administrator may want to define the desired and maximum virtual processor attributes x% above the corresponding entitlement attributes. The exact percentage is installation specific, but 25-50% seems like a reasonable number.

The following table shows several reasonable settings:

Min VPs Desired VPs Max VPs Min Ent Desired Ent Max Ent Capped

1 2 4 0.1 2.0 4.0 Y

1 3 or 4 6 or 8 0.1 2.0 4.0 N

2 2 6 2.0 2.0 6.0 Y

2 3 or 4 8 or 10 2.0 2.0 6.0 N

VP = Virtual processors; Ent = Entitlement


Operating System Support

In general, operating systems and applications running in shared partitions need not be aware that they are sharing processors. However, overall system performance can be significantly improved by minor operating system changes. AIX 5L Version 5.3 provides support for optimizing overall system performance of shared processor partitions.

Shared processors, also, has an impact on the reporting of CPU utilization, performance monitors, capacity planning tools, and license managers. These issues and the AIX 5L V5.3 strategy for dealing with them are discussed further in the remainder of this document.

Dispatching and Interrupt Latencies

Virtual processors have dispatch latency, since they are scheduled. When a virtual processor is made runnable, it is placed on a run queue by the hypervisor, where it sits until it is dispatched. The time between these two events is referred to as dispatch latency.

The dispatch latency of a virtual processor is a function of the partition entitlement and the number of virtual processors that are online in the partition. Entitlement is equally divided amongst these online virtual processors, so the number of online virtual processors impacts the length of each virtual processor’s dispatch. The smaller the dispatch cycle the greater the dispatch latency.

Timers have latency issues also. The hardware decrementor is virtualized by the hypervisor at the virtual processor level, so that timers will interrupt the initiating virtual processor at the designated time. If a virtual processor is not running, then the timer interrupt has to be queued with the virtual processor, since it is delivered in the context of the running virtual processor.

External interrupts have latency issues also. External interrupts are routed directly to a partition. When the operating system makes the accept-pending-interrupt hypervisor call, the hypervisor, if necessary, dispatches a virtual processor of the target partition to process the interrupt. The hypervisor provides a mechanism for queuing up external interrupts that is also associated with virtual processors. Whenever this queuing mechanism is used, latencies are introduced.

These latency issues are not expected to cause functional problems, but they may present performance problems for real time applications. To quantify matters, the worst case virtual processor dispatch latency is 18 milliseconds, since the minimum dispatch cycle that is supported at the virtual processor level is one millisecond. This figure is based on the minimum partition entitlement of 1/10th of a physical processor and the 10 millisecond rotation period of the hypervisor’s dispatch wheel. It can be easily visualized by imagining that a virtual processor is scheduled in the first and last portions of two 10 millisecond intervals. In general, if these latencies are too great, then clients may increase entitlement, minimize the number of online virtual processors without reducing entitlement, or use dedicated processor partitions.


AIX 5L V5.3 Overview of Simultaneous Multi-threading

The simultaneous multi-threading policy is controlled by the operating system and is thus partition specific. AIX 5L V5.3 provides the smtctl command to control the simultaneous multi-threading mode of the partition. With this command, you can turn it on or off system-wide, either immediately or at the next boot. The simultaneous multi-threading mode persists across system boots and by default, is enabled by AIX 5L V5.3. The syntax for the smtctl command is:

smtctl [ -m { off | on } [ { -boot | -now } ] ] The smtctl command does not re-build the boot image. If the user wants to change the default SMT mode of AIX 5L V5.3, the bosboot command must be used to rebuild the boot image. The boot image has been extended to include an SMT indicator that controls the default SMT mode.

Configuration of Hardware Threads Each hardware thread is supported as a separate logical CPU by AIX 5L V5.3. So, a dedicated partition that is created with one physical processor is configured by AIX 5L V5.3 as a logical 2-way by default. This is independent of the partition type, so a shared partition with two virtual processors is configured by AIX 5L V5.3as a logical 4-way by default. AIX 5L V5.3 pre-allocates a sufficient quantity of logical processors, so that it can enable simultaneous multi-threads without rebooting the system. The number of logical processors that needs to be allocated is derived by doubling the number of processors specified by the user, when creating the partition. When simultaneous multi-threading is disabled, at least half of the logical processors will be offline.

Simultaneous Multi-threading Integration with Shared Processor Partitions The hypervisor saves and restores all necessary processor states, when preempting or dispatching virtual processors, which for simultaneous multi-threading-enabled processors means two active thread contexts. The result for shared processors is that two of the logical CPUs will always be scheduled in a physical sense together. These sibling threads are always scheduled in the same partition. Shared processor capacity is always delivered in terms of whole physical processors. So, a 4-way virtual processor partition with 2.0 processing units of entitlement, without simultaneous multi-threading, is configured by AIX 5L V5.3as a 4-way logical CPU partition, where each logical CPU has the power of 50% of a physical processor. With simultaneous multi-threading, it becomes an 8-way, where each logical CPU has the power of 25% of a physical processor. However, latency concerns normally associated with a virtual CPU’s fractional capacity don't apply linearly to the simultaneous multi-threading threads. Since both threads are dispatched together, they are active for the duration of a 50% dispatch window, sharing the physical CPU underneath to achieve the logical 25%. So this means that each of the logical CPUs is able to field interrupts for twice as long as their individual capacities would have allowed.


AIX 5L V5.3 Exploitation of Simultaneous Multi-threading

The POWER5 processor allows priorities to be assigned to hardware threads. The difference in priority between sibling threads determines the ratio of physical processor decode slots allotted to each thread. More slots provide better thread performance. Normally, AIX 5L V5.3maintains sibling threads at the same priority but will boost or lower thread priorities in a few key places to optimize performance. For example, AIX 5L V5.3 lowers thread priorities, when the thread is doing non-productive work spinning in the idle loop or on a kernel lock. Thread priorities will be boosted, when a thread is holding a critical kernel lock. These priority adjustments do not persist into user mode. AIX 5L V5.3does not consider a software thread’s dispatching priority, when choosing its hardware thread priority.

Several scheduling enhancements have also been made to exploit simultaneous multi-threading. For example, work will be distributed across all primary threads before work is dispatched to secondary threads. The reason for this enhancement is that the performance of a thread is best when its sibling thread is idle. Thread affinity is also considered in idle stealing and periodic run queue load balancing.

ISV Exploitation of Simultaneous Multi-threading

Thread priorities cannot be used by user programs.

The POWER5 processor contains performance monitor registers that may be used to analyze the performance of the processor. These registers are implemented at the thread level. In a shared processor environment, the contents of these registers is saved and restored as the virtual processor is dispatched, so that they reflect only the operations of the thread inside the partition. The use of the thread in other partitions is not reflected in the counters. The Performance Monitor APIs control this feature of the processor.

Affinity scheduling is supported for simultaneous multi-threading through the Resource Set APIs. A new level is introduced in the system topology database that reflects affinity domains for hardware threads. You can use these APIs to bind your application to a pair of threads on the same physical processor, if you wish to achieve a higher degree of cache sharing.

Like shared processor technology, simultaneous multi-threading has an impact on the reporting of CPU utilization, performance monitoring and capacity planning tools, and license managers. These issues are discussed later in this document.

Which Workloads are Likely to Benefit From Simultaneous Multi-threading?

This is a very difficult question to answer, because the performance benefit of simultaneous multi-threading is workload dependent. Most measurements of commercial workloads have received a 25-40% boost and a few have been even greater. These measurements were taken in a dedicated partition. Simultaneous multi-threading is also expected to help shared processor partitions. The extra threads give the partition a boost after it is dispatched, because they enable the partition to recover its working set quicker. Subsequently, they perform like they would in a dedicated partition. It may be somewhat non-intuitive, but simultaneous multi-threading is at its best, when the performance of the cache is at its worst.


The question may also be answered with the following generalities. Any workload where the majority of individual software threads highly utilize any resource in the processor or memory will benefit little from simultaneous multi-threading. For example, workloads that are heavily floating-point intensive are likely to gain little from simultaneous multi-threading and are the ones most likely to lose performance. They tend to heavily utilize either the floating-point units or the memory bandwidth, while workloads that have a very high Cycles Per Instruction (CPI) count tend to utilize processor and memory resources poorly and usually see the greatest simultaneous multi-threading benefit. These large CPIs are usually caused by high cache miss rates from a very large working set. Large commercial workloads typically have this characteristic, although it is somewhat dependent upon whether the two hardware threads share instructions or data or are completely distinct. Workloads that share instructions or data, which would include those that run a lot in the operating system or within a single application, tend to have better SMT benefit. Workloads with low CPI and low cache miss rates tend to see a benefit, but a smaller one.


New AIX 5L V5.3 Commands and APIs for Shared Processor Partitions and Simultaneous Multi-threading

This section discusses the AIX 5L V5.3 support that is provided to help solve administrative issues related to capacity pricing and licensing, capacity planning, and performance monitoring that arises from this new technology.

Capacity Pricing/Licensing

Software that licenses the execution of a software product to some number of CPUs is faced with a perplexing issue with shared processors. The issue is how to treat the license with respect to fractional CPU allocation. Should a one CPU license allow execution on four Virtual CPUs running at 25% capacity?

AIX 5L V5.3 provides the basic primitives that are needed to implement the following licensing schemes. APIs are provided that identify partition and system level configuration parameters. A mechanism including both scripts and APIs is provided for the notification of changes to partition parameters.

AIX 5L V5.3 supports the following capacity based licensing models:

License to system. This is licensing to all physical processors available within a given system. The machine could run as one or N LPARs, either of which would be acceptable to the license. To support this model, AIX 5L V5.3 reports the number of physical processors in the system.

License to shared pool. This is essentially licensing to all physical processor resources in the shared processor pool that this LPAR is defined to run in. The license covers the ability for a single shared LPAR to consume all of the resources in the shared pool. To support this model, AIX 5L V5.3 will report the number of physical processors in the shared pool. It should be noted that the size of the shared processor pool can change over time as dedicated partitions are started and stopped, so the size of the pool has to be is monitored by vendors. This model provides a simple abstraction that is suitable for uncapped partitions.

License to the partition’s virtual processors. This is licensing to the potential capacity of a given partition based on the number of virtual processors defined. The partition could run as capped or uncapped, could have capacity increases up to whole CPUs time number of virtual processors, but could not have dynamic LPAR additions of virtual processors beyond the licensed amount. A partition cannot consume more physical processor capacity than it has virtual processors, so this mechanism provides a more cost effective solution for uncapped partitions, where the number of virtual processors is smaller than the shared processor pool. To support this model, AIX 5L V5.3 will report the number of online virtual CPUs and the maximum number of virtual CPUs that could be brought online.

License to the entitled capacity. This would be licensing to a certain assigned CPU capacity. Typically, in these environments, CPU capacity is normalized, so that there is one metric which allows for equitable charge relative to the power of a particular processor. For example, a license good for 2.5 processors is worth a lot more on 2.5 - 2 GHz CPUs than it is on 2.5 - 1 GHz CPUs. By licensing to normalized processing units, the same license might then entitle five processing units of 1 GHz CPUs or 2.5 processors of 2 GHz CPUs. In this model, dynamic LPAR additions in capacity would need to be monitored to stay within the licensed capacity. The notion of uncapped partitions, which can consume more than their entitled capacity, could be dealt with by using a rolling average and soft-capping. Specifically, something like every four hours, the actual utilization is compared to the entitled capacity. If the utilization is more than the entitled, then the partition is capped at the entitled capacity for the next 4


hour window. If the next average shows the utilization is below the entitled, the soft cap could be removed. So essentially in this model, the entitled capacity is not limited by the license, allowing for peaks in utilization above the entitled, but rather the average LPAR performance is limited by the license. To enable this model, AIX 5L V5.3 will report all the various shared processor configuration attributes: entitled capacity, variable capacity weight, utilized processor resource, and support the dynamic change of the variable capacity weight to implement soft-capping.

License to physical processors. For capped partitions, the maximum entitled processor capacity of a partition could be rounded up to the nearest whole number. For uncapped partitions, one of the schemes outlined above could be used. It is also possible to combine them. The number of physical processors could be calculated by the lesser of 1) the number of virtual processors and 2) the size of the shared pool.

Processor Utilization

Processor utilization is a critical component of metering, performance monitoring and capacity planning. With respect to POWER5 technologies, two new advances that will be commonly used will combine to make the concept of utilization much more complex. Partitioning, specifically shared processor partitioning and simultaneous multi-threading. Individually, they add complexity to this concept, but together they multiply the complexity.

First a little background information. AIX 5L, like most UNIX® systems, measures processor utilization using a sample-based approach to approximate the percentage of processor time spent executing user programs, system code, waiting for disk I/O and idle. AIX 5L produces 100 interrupts per second to take samples. At each interrupt, a utilization category (user, system, iowait, and idle) is charged with 1/100th of a second of processor time. The utilization category is chosen based on the state of the interrupted thread.

This statistic is typically viewed from second to second through the use of performance monitoring tools to determine the current system load. This is accomplished by displaying the change in each utilization category over the interval as a percent of the total processor capacity. This statistic is most commonly used to determine idle capacity.

The problem with this algorithm is that it assumes that each logical CPU runs at the same speed, which is not necessarily true with shared processor partitions. In fact, there is a 10X variable speed rate that must be considered, when calculating physical processor utilization in shared processor partitions. This is not a problem in dedicated partitions.

The variable speed rate is a result of the scheduling algorithm used by the hypervisor. When the operating system cedes idle capacity, it does so with the hope of receiving credit for the remaining portion of virtual processor’s dispatch cycle. These cycles must be consumed in the current rotation of the dispatch wheel, or they are lost. In this way, another virtual processor in the same partition may receive extra cycles.

This opportunistic redirection of cycles is one of the strengths of this architecture, but it is inconsistent with a fixed increment sampling algorithm. Consider a shared 4-way that has one job running. Using the algorithm above, this would be reported as 25% busy - 75% idle, but there might not be any available capacity!

With simultaneous multi-threading, the issue is even more subtle. The POWER5 processor has a hardware feature that enables threads to run at different priorities under software control. The priorities control the numbers of cycles that each thread gets to start decoding instructions. There are seven levels of priority.


Level four is normal and is the level that most code will be running at. Level one is very low and is designed to save power. The idle processor runs at this priority. Spinning locks run at low priority or level two. The difference in three levels of priority between normal and very low means that the idle process will get one out of eight cycles to decode instructions. Because fewer instructions are started in the decode stage of the pipeline, the idle process will execute relatively fewer instructions. In short, thread priorities compound the variable speed rate mentioned above, but its effect is not limited to shared processor partitions. It also impacts dedicated partitions.

Another implication of simultaneous multi-threading is that there are twice as many logical CPUs and therefore twice as many increments made every 10 milliseconds. This has the effect of overestimating idle time, since simultaneous multi-threading does not double performance. For example, using the algorithm above a 2-way with one CPU intensive job would be reported as 50% idle, which would ordinarily lead one to believe that a second instance of the job could be started without impacting the performance of the first one. This is clearly false. It is worth noting that the significance of the error diminishes as the system becomes busier.

The POWER5 processor architecture attempts to deal with these complex issues by introducing a new processor register that is intended for measuring utilization. This new register, Processor Utilization Resource Register (PURR), is used to approximate the time that a virtual processor is actually running on a physical processor. The register advances automatically so that the operating system can always get the current up to date value. The hypervisor saves and restores the register across virtual processor context switches to simulate a monotonically increasing atomic clock at the virtual processor level.

Each hardware thread has a PURR. The hardware increments the PURRs based on how each thread is using the resources of the processor including the dispatch cycles that are allocated to each thread. For a cycle in which no instructions are dispatched, the PURR of the thread that last dispatched an instruction is incremented. Because there are many resources in the hardware, anyone of which can be a bottleneck that limits simultaneous multi-threading gain, the use of the PURR is an approximation of the time spent running. The execution time for a virtual processor can be calculated by adding sibling thread PURRs.

The PURR provides a solution to the problems mentioned above. At each interrupt, AIX 5L V5.3 calculates the elapsed PURR for the current sample period. This value is added to the appropriate utilization category, instead of the fixed size increment (1/100th) that was previously added. The PURR provides an accurate indication of the time spent running, but it doesn’t accurately describe the idle capacity of the partition, because it does not count ceded cycles. When an idle processor cedes, its PURR does not advance. This lost capacity has to be added back into the statistic. This is accomplished by post processing the statistic in the command, since the total number of ceded cycles is impacted by the length of the interval, which is only known by the command. To determine the amount of lost capacity, the command needs to know the partition entitlement, which is available through the lpar_get_info() and perfstat_partition_total() system calls.

Performance Monitoring

A new command lparstat is provided for shared processor partitions.

The following information is provided and is expected to be viewed over intervals.

• Physical utilization in the traditional user, system, iowait, and idle categories


• Physical processors consumed • Percent of partition entitlement consumed • Available shared processor capacity in the shared pool • Time spent in voluntary operating system calls to the hypervisor • Number of virtual processor context switches

The important thing to note about physical utilization is that it is calculated relative to the partition entitlement. For example, a shared partition with an entitlement of 2.2 processing units that is 20% idle has 44% of the power of one physical processor available for new programs. The number of logical or virtual processors is not relevant.

Physical processors consumed and percent of partition entitlement consumed are handy, because they show consumption above and below entitlement in a way that is easily related to physical processors. The physical utilization statistic doesn’t provide this information. In an uncapped partition, these statistics can be used to determine whether a partition is getting more than its entitlement. In a capped partition, they can be used to determine if there is spare capacity.

The available shared processor capacity statistic may be used to determine how much capacity is currently available in the shared pool. The hypervisor reports the cumulative time that it could not dispatch a virtual processor which is influenced by several factors including the lack of demand for processor cycles, the inability of partitions to take advantage of available cycles due to an insufficient number of online virtual processors, and the use of capped partitions. Stated another way, this statistic provides a measure of spare cycles in the platform that could be obtained and is intended for uncapped partitions. The ability of an uncapped partition to exploit these spare cycles is a function of the number of online logical CPUs.

The number of virtual processor context switches is also important in that it is one measure of hypervisor overhead. Clients may find that it is best to minimize the number of virtual processors in each partition, if there are lots of partitions. On the other hand, if a lot of virtual processors are needed to satisfy peak load conditions and the capacity requirements vary greatly over time, then it may be best to vary virtual processors offline when they are not needed. The Partition Load Manager (PLM) may be used to automate this process as a function of load.

The following utilization related commands have been updated.

• iostat • mpstat • sadc • sar • topas • vmstat

These commands display physical utilization, not logical utilization.

When invoked in a shared processor partition, the commands listed above produce new statistics such as the number of physical processors consumed in the user specified interval and the percent of the partition entitlement that was consumed. These statistics are produced automatically, so capacity planning and performance analysis tools that process the output of these commands may be impacted. This compatibility risk is tolerated, because it is a necessary consequence of the new shared processor architecture.


The following trace-based tools have been updated.

• curt • filemon • netpmon • pprof • splat

On shared processor enabled partitions, a new trace hook is generated by the trace facility for each virtual processor preemption window, so that the tools can accurately compute physical CPU time. In most cases, the changes involve the subtraction of lost time due to hypervisor preemption, so that the resulting utilization corresponds to the use of physical processors. The intent is to minimize the shared processor variance, so that the same thought process can be followed for performance analysis regardless of the partition type. For example, the curt command is used to profile the operation of the system. It shows the duration of operations like system calls and first and second level interrupt handlers, so the time spent not running must be subtracted from the current operation. Similarly, the filemon command shows file activity in an interval which is measured using wall time, but it also shows the CPU utilization of the interval, which must be massaged to provide an accurate indicator of the processor roles in driving file activity. The splat command provides lock analysis, so the lock hold times and spin times need to be modified to show the actual CPU times, so that developers can determine the effects of their locking strategies under real world conditions without hypervisor noise.

A new command schedo is being provided for tuning simultaneous multi-threading. There are numerous tuning knobs being added, but at this point there is only one tuning knob that should be considered for change by system administrators. The snooze delay parameter may be used to specify the amount of time to spin in the idle process before ceding. This is significant, because the hypervisor silently transitions the processor into single threaded (ST) mode to eliminate the simultaneous multi-threading overhead in dedicated partitions, so that the other thread may run faster. The snooze delay parameter is used to control this dynamic simultaneous multi-threading transition. It specifies the amount of time to spin in the idle process before ceding to the hypervisor. The cede call triggers the transition. To use this parameter meaningfully, the user needs to balance the increase in processing power with the added latency in re-starting the thread. If a value of -1 is specified, then the wait process does not cede. If a value of 0 is specified, the wait process cedes immediately. Otherwise the wait process spins at very low priority for the specified time base units before ceding. To maximize throughput, a value of -1 should be specified. To maximize speed, a value of 0 should be specified. At this time, there are no client guidelines for specifying a specific time value.

Performance Metrics

The performance library (libperfstat) provides several APIs that may be used to determine physical utilization of logical processors, shared pool idle capacity, partition processor entitlements, etc.

See the routines perfstat_cpu(), perfstat_cpu_total(), and perfstat_partition_total().


Accounting, Workload Management and Resource Limits

Accounting, Workload Management (WLM), resource limits and thread scheduling have been updated to use physical processor utilization as opposed to the logical processor utilization. For example, the system call getrusage() reports the use of the physical CPU of the process, and the signal SIGXCPU is sent when the physical processor usage exceeds the physical processor limit set by setrlimit().

Shared Processor Partitions and Simultaneous Multi-threading Configuration Information The following information is provided for applications and middleware that wish to become shared processor partition-aware or simultaneous multi-threading-aware.

• Minimum, desired, and maximum processor capacity • Entitled capacity • Variable capacity weight • Maximum dispatch latency based on entitlement and number of online virtual CPUs • Minimum, desired, and maximum number of virtual processors • Number of online virtual processors • Minimum and maximum number of logical CPUs • Number of online logical CPUs • Maximum number of potential physical processors in the box including unlicensed and hot pluggable in

machine • Number of licensed physical processors in machine • Number of physical processors in shared pool • Unallocated capacity in pool • Total LPAR dispatch time • Capped or uncapped state • Variable capacity weight • Shared processor partition capable and enabled • Simultaneous multi-threading capable and enabled • SMP mode • Number of threads per physical processor • LPAR name and number • Capacity increment

The following APIs may be used to gather this information.

#include <sys/dr.h>

int klpar_get_info(int command, void * lparinfo, size_t bufsize);

int lpar_get_info(int command, void * lparinfo, size_t bufsize);

The interfaces listed are supported on all AIX 5L V5.3 platforms. When they are invoked on SMP- or LPAR-based systems, only a subset of the data is returned. The lpar_get_info() system call is provided for applications. The klpar_get_info() kernel service is provided for kernel extensions. In addition, a few of


the items such as SMT and shared processor capabilities and enablement have been added to the _system_configuration structure, which is defined in the file /usr/include/sys/systemcfg.h. The intent is to enable performance critical code to evaluate the type of partition quickly without having to pay the cost of an expensive function / system call.

The following changes have been made to the ODM:

• New shared processor attributes. The system object (sys0) has been updated to include the following

attributes: the partition mode (dedicated or shared), capped vs. uncapped state, minimum and maximum capacity, partition entitlement, and variable capacity weight.

• New simultaneous multi-threading attributes. Each processor object has been updated to indicate the simultaneous multi-threading and ST mode of the processor.

• Expanded processor definition. The representation of processors has been expanded to include virtual processors. In shared partitions, virtual processors are represented. In dedicated partitions, physical processors are represented. Virtual processors do not have location codes.

Dynamic Logical Partitioning

The following partition attributes may be changed by dynamic LPAR procedures at the HMC.

• Entitled processor capacity • The number of online virtual processors • Variable capacity weight

One of the advantages of the shared processor architecture is that processor capacity can be changed without impacting applications or middleware. This is accomplished by modifying the entitled capacity or the variable capacity weight of the partition, however, the ability of the partition to utilize this extra capacity is restricted by the number of online logical processors, so the user may have to increase this number in some cases to take advantage of the extra capacity.

AIX 5L V5.3 automatically translates virtual processor requests into the appropriate number of logical CPU requests, which was supported in a previous release, so there is no new impact to applications and middleware. When simultaneous multi-threading is enabled, each virtual processor request is translated into two logical CPUs requests.

The variable capacity weight parameter applies to uncapped partitions. It controls the ability of the partition to receive cycles beyond its entitlement, which is dependent on here being unutilized capacity at the platform level. The client may want to modify this parameter, if a partition is getting too much processing capacity or not enough.

Dynamic memory addition and removal is also supported. The only change in this area is that the size of the logical memory block (LMB) has changed. It has been reduced from 256MB to 16MB to allow for thinner partitions. There is no impact associated with these changes. The new LMB size applies to dedicated partition also. The size of the LMB can be set at the service console.

Notification of changes to these parameters will be provided so that applications such as license managers, performance analysis tools and high level schedulers can monitor and control the allocation and use of


system resources in shared processor partitions. This may be accomplished through scripts, APIs or kernel services.

Dynamic LPAR scripts

A dynamic LPAR script is composed of the following commands:

• Scriptinfo • Register • Usage <resource=value> • Checkrelease <resource=value> • Prerelease <resource=value> • Postrelease <resource=value> • Undoprerelease <resource=value> • Checkacquire <resource=value> • Preacquire <resource=value> • Undopreacquire <resource=value> • Postacquire <resource=value>

When a dynamic LPAR script is being installed, the drmgr issues the register command of the dynamic LPAR script to retrieve the list of resources that the script is designed to support. This helps the drmgr to execute only the relevant scripts in a dynamic LPAR request.

The following resource types have been added to support shared partitions:

resource=var_weight changes to variable capacity weight

resource=capacity changes to entitled processor capacity

Note that virtual processor changes are supported under the pre-existing “cpu” type.

Architecturally, a dynamic LPAR script may register for more than one resource type, so the resource parameter is also supplied as a command argument enabling the script to identify the type of resource that is being reconfigured and to apply the proper set of environment variables to identify the particulars associated with that resource type.

The following environment variables are provided for entitled capacity changes.

DR_CPU_CAPACITY=<decimal value>

DR_CPU_CAPACITY_DELTA=<decimal value>

Capacity is not expressed as a fraction in the above parameters. Capacity is expressed as a percentage, where 100 represents one physical processor, and 180 represents the power of 1.8 processors.

The following environment variables are provided for variable capacity weight changes.


DR_VAR_WEIGHT=<decimal value>

DR_VAR_WEIGHT_DELTA=<decimal value>

The environment variables DR_CPU_CAPACITY and DR_VAR_WEIGHT represent the value of the partition attribute before the request was made, so the script will have to internally add or subtract the delta to determine the result of the request.

The interface presented to dynamic LPAR-aware processor scripts (and applications) is logical CPU based, so a dynamic LPAR request to add or remove a virtual processor must be translated into two “cpu” requests when SMT is enabled. This is driven by the drmgr and is transparent to applications and middleware.

Dynamic LPAR-aware Applications

The following interface is used by dynamic LPAR-aware applications to determine the nature of dynamic LPAR requests and in selected cases to fail those requests. Applications are notified through the use of the signal SIGRECONFIG, which is generated three times in the course of a dynamic LPAR event to trigger check, pre and post phase processing.

#include <sys/dr.h>

int dr_reconfig(int flags, dr_info_t *info);

The following fields have been added to the dr_info_t structure:

unsigned int ent_cap : 1; // entitled capacity change request

unsigned int var_wgt : 1; // variable weight change request

unsigned int splpar_capable : 1; // partition is Shared Processor Partition capable

unsigned int splpar_shared : 1; // shared partition (1), dedicated (0)

unsigned int splpar_capped : 1; // shared partition is capped

unsigned int cap_constrained : 1; // capacity is constrained by PHYP

uint64_t capacity; // the current entitled capacity or

// variable capacity weight value

// depending on the bit fields

// ent_cap and var_wgt.

Int delta_cap; // delta entitled capacity or variable

// weight capacity that is to be added

// or removed.


In the structure above, capacity is expressed as a percentage, so 100 represents one physical processor and 180 represents the 1.8 processors.

Dynamic LPAR-aware Kernel Extensions

The following kernel services are used to register and unregister reconfiguration handlers, which are invoked by the kernel before and after dynamic LPAR operations depending on the set of events specified by the kernel extension when registering.

#include <sys/dr.h>

int reconfig_register(handler, actions, arg, token, name);

int reconfig_unregister(token);

The following events have been added to support shared processors.

• Capacity addition and removal • Virtual processor add and remove is supported via pre-existing CPU add and remove

Note that kernel extensions are not notified of variable capacity weight changes, since they can’t really take advantage of this information. Variable capacity weight is provided solely for administrative purposes, so notification is only provided in the application environment.


Enhanced Dynamic Processor Deallocation and Dynamic Processor Sparing

Dynamic Processor De-allocation enables defective processors to be taken offline automatically, before they fail. This is visible to applications, since the number of online logical processors is decremented. An application that is attached to the defective processor can prevent the operation from being performed, so Dynamic Processor De-allocation may fail to remove the defective processor in some cases.

Dynamic Processor Sparing transparently replaces defective processors with spare processors. It is transparent to applications, because spare processors are not in use by the system. The spare processor assumes the identity of the defective processor. Dynamic Processor Sparing is dependent on the presence of spare processors. A system has spare processors, if it is shipped with extra processors that the customer did not pay for. These processors may be activated using Capacity on Demand procedures.

Both of these Reliability, Availability and Serviceability (RAS) features are enhanced by shared processor technology. Enhanced processor virtualization enables the hypervisor to implement Dynamic Processor Sparing in a manner that is completely transparent to the operating system. In effect, processor sparing becomes purely a hardware/firmware technology, which can be applied to any partition including Linux partitions for the first time. On the other hand, Dynamic Processor Deallocation is still implemented jointly between the operating system and firmware, although shared processor technology represents a significant advance in that it enables capacity and not logical CPUs to be removed. This means it will be more transparent to applications and middleware and can be applied to partitions with one logical CPU. Previously, it could only be applied if there were two or more logical processors.


High Performance Computing and Dedicated Partitions

In a shared partition there is not a fixed relationship between the virtual processor and the physical processor that actualizes it. The hypervisor will try to use a physical processor with the same memory affinity as the virtual processor, but it is not guaranteed. Virtual processors have the concept of a home physical processor. If it can’t find a physical processor with the same memory affinity, then it gradually broadens its search to include processors with weaker memory affinity, until it finds one that it can use. As a consequence, memory affinity is expected to be weaker in shared processor partitions.

Workload variability is also expected to be increased in shared partitions, because there are latencies associated with the scheduling of virtual processors and interrupts. Simultaneous multi-threading may also increase variability, since it adds another level of resource sharing, which could lead to a situation where one thread interferes with the forward progress of its sibling.

Therefore, if an application is cache sensitive or can’t tolerate variability, then it should be deployed in a dedicated partition with simultaneous multi-threading disabled. In dedicated partitions, the entire processor is assigned to a partition. Processors are not shared with other partitions, and they are not scheduled by the hypervisor. Dedicated partitions must be explicitly created by the system administrator using the Hardware Management Console.

Processor and memory affinity data is only provided in dedicated partitions. In a shared processor partition, all processors are considered to have the same affinity. Affinity information is provided through RSET APIs, which contain discovery and bind services.

© IBM Corporation 2004 IBM Corporation Integrated Marketing Communications Systems and Technology Group Route 100 Somers, New York 10589 Produced in the United States of America July 2004 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area. All statements regarding future directions and intent of IBM are subject to change or withdrawal without notice and represent goals and objectives only. IBM, the IBM logo, the e-business logo, ~, AIX 5L, Micro-Partitioning, POWER4, POWER5, POWER Hypervisor, pSeries are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at http://www.ibm.com/legal/copytrade.shtml. UNIX is a registered trademark of The Open Group in the United States, other countries or both. Linux is a registered trademark of Linus Torvalds in the United States, other countries or both. Intel is a registered trademark of Intel Corporation in the United States and/or other countries. Other company, product, and service names may be trademarks or service marks of others.

IBM hardware products are manufactured from new parts, or new and used parts. Regardless, our warranty terms apply.

This equipment is subject to FCC rules. It will comply with the appropriate FCC rules before final delivery to the buyer.

Information concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of the non-IBM products should be addressed with the suppliers.

The IBM home page on the Internet can be found at http://www.ibm.com.

The pSeries home page on the Internet can be found at http://www.ibm.com/servers/eserver/pseries.

http://www.ibm.com/legal/copytrade.shtml

http://www.ibm.com

http://www.ibm.com/servers/eserver/pseries

Documents

Systems p Os Aix Whitepapers PDF Aix Support