200601031 Solar is Physical Memory Management

Embed Size (px)

Citation preview

  • 8/2/2019 200601031 Solar is Physical Memory Management

    1/14

    Solaris 10 Physical Memory Management

    Physical memory is managed globally in Solaris via a central free pool and a system daemon to

    manage the use of physical memory.

    Physical Memory Allocation

    Solaris uses the systems RAM as a central pool of physical memory for different consumers

    within the system. Physical memory is distributed through the central pool at allocation time and

    returned to the pool when it is no longer needed. A system daemon (the page scanner ) pro

    actively manages memory allocation when there is a systemwide shortage of memory.

  • 8/2/2019 200601031 Solar is Physical Memory Management

    2/14

    The Allocation cycle of Physical memory

    The most significant central pool physical memory is the freelist. Physical memory is placed

    on the freelist in page-size chunks when the system is first booted and freelist as shown the

    above figure.

    Anonymous/process allocations

    Anonymous memory, the most common form of allocation from the freelist, is used for most of

    a processs memory allocation, including heap and stack. Anonymous memory also fulfills

    shared memory mappings allocations. A small amount of anonymous memory is also used in

    the kernel for items such a thread stacks. Anonymous memory is pageable and is returned to

    the freelist when it is unmapped or if it is stolen by the page scanner daemon.

    File System page cache

    The page cache is used for caching of file data for file systems other than the ZFS file system.The file system page cache grows on demand to consume available physical memory as a file

    cache and caches file data in page-size chunks. The pages then reside in one of the three

    places: the segmap cache, a processs address space to which they are mapped, or on the

    cachelist.

    The cachelist is the heart of the page cache. All unmapped file pages reside on the cachelist.

    Segmap is a cache that holds file data read and written through the read and write system

    calls. Memory is allocated from the freelist to satisfy a read of a new file page, which then

    resides in the segmap file cache. File pages are eventually moved from the segmap cache to

    the cachelist to make room for more pages in the segmap cache. It can be interpreted as thefast first level file system read/write cache.

    The cachelist operates as a part of freelist. When the freelist is depleted, allocations are made

    from the oldest pages in the cachelist. This allows file system cache to grow to consume all

    available memory to dynamically shrink as memory is required for other purposes.

    Kernel allocations

    The kernel uses memory to manage information about internal system state; for example,

    memory used to hold the list of processes in the system. The kernel allocates memory from the

    freelist for these purposes with its own allocators:vnem and slab and the memory allocated is

    mostly nonpageable. However, unlike process and file allocations, the kernel seldom returnsmemory to the freelist; memory is allocated and freed between kernel subsystems and the

    kernel allocators. Memory is consumed from the freelist only when the total kernel allocation

    grows and memory is returned to the system freelist pro actively by the kernels allocators when

    a global memory shortage occurs.

  • 8/2/2019 200601031 Solar is Physical Memory Management

    3/14

    Pages: The Basic unit of Solaris Memory

    Pages are the fundamental unit of physical memory in the Solaris memory management

    subsystem. Physical memory is divided into pages. Every active (not free) page in the Solaris

    kernel is a mapping between a file (vnode) and memory; the page can be identified with a

    vnode pointer and the page size offset within that vnode.A pages identity is its vnode/offsetpair. The age structure and associated lists are shown below:

    The hardware address translation (HAT) and address space layers manage the mapping

    between a physical page and its virtual address space. The key property of vnode/offset pair is

    reusability; that is, we can reuse each physical page for another task by simply synchronizing its

    contents in RAM with its backing store(the vnode and the offset) before the page is used.

    The Page Hash List

    The VM system hashes pages with identity (a valid vnode/offset pair) onto a global hash list

    so that they can be located by vnode and offset. Three page functions search the global hashlist; page_find(), page_lookup(), and page_lookup_nowait(). The global hash list is an array of

    pointers to linked lists of pages. The functions use a hash to index into the page_hash array to

    locate the list of pages that contains the page with the matching vnode/offset pair. The following

    figure shows how the page_find() function indexes into the page_hash array o locate a page

    matching a given vnode/offset.

  • 8/2/2019 200601031 Solar is Physical Memory Management

    4/14

    It calculates the slot in the page_hash array containing a lost of potential pages by using the

    PAGE_HASH_FUNC macro, shown below :

    It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the

    slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds

    such a page.

    Free List and cache List

    The free list and cache list hold pages that are not mapped into any address space and that

    have been freed by page_free(). The sum of these pages is reported in thefree column in

    vmstat. Even though vmstatreports these pages as free, they can still contain a valid page from

    vnode/offset and hence are still part of the global page cache. Memory on the cache list is not

    really free; it is a valid cache of a page from a file. However, pages will be moved from the

    cache list to the free list and their contents discarded if the free list becomes exhausted.

    The free list contains pages that no longer have a vnode and offset associated with them-

    which can only occur if the page has been destroyed and removed frm a vnodes hash list. Thecache list is a hashed list of pages that still have mappings to valid vnode and offset. The pages

    can be obtained from the cache list by the page_lookup() routine. This function accepts a vnde

    and offset as argument and returns a page structure. If the page is found on th cache list, then

    the page is removed from the cache list and returned to the caller. When we find and remove

    pages from the cache list, we are reclaiminga page. Page reclaims are reported by vmstat in

    the re column.

  • 8/2/2019 200601031 Solar is Physical Memory Management

    5/14

    Physical Page memseg Lists

    The Solaris kernel uses a segmented global physical page list, consisting of segments of

    contiguous physical memory. (Many hardware platforms now present memory in noncontiguous

    groups.) Contiguous physical memory segments are added during system boot. They are also

    added and deleted dynamically when physical memory is added and removed while the systemis running. The following figure shows the arrangement of the physical page lists into contiguous

    segments.

    The Page-Level Interfaces

    The Solaris 10 virtual memory system implementation has grouped page management and

    manipulation into a central group of functions. These functions are used by the segment drivers

    and file systems to create, delete and modify pages. The following are some of the page-level

    interfaces:

  • 8/2/2019 200601031 Solar is Physical Memory Management

    6/14

    The Page Throttle

    Solaris implements a page creation throttle so a small core of memory is available for

    consumption by critical parts of the kernel. The page throttle, implemented in the page_create()

    and page_create_va() functions, causes page creates to block when the PG_WAIT flag isspecified, that is, when available is less than the system global, throttlefree. By default, the

    system global parameter, throttlefree, is set to the same value as the system global parameter

    minfree. By default, memory allocated through the kernel memory allocator specifies PG_WAIT

    and is subject to the page-created throttle.

  • 8/2/2019 200601031 Solar is Physical Memory Management

    7/14

    Page Coloring

    Some interesting effects result from the organization of pages within the processor caches,

    and as a result, the page placement policy within these caches can dramatically affect

    processor performance. When pages overlay other pages in the cache, they can displace cache

    data that we might not want overlaid, resulting in less cache utilization and hot spots.

    The optimal placement of pages in the cache often depends on the memory access patterns

    of the application; that is, is the application accessing memory in a random order, or is it doing

    some sort of strided ordered access? Several different algorithms can be selected in the Solaris

    kernel to implement page placement; the default attempts to provide the best overall

    performance.

    To understand how page placement can affect performance, lets look at the cache

    configuration and see when page overlaying and displacement can occur. The UltraSPARC-I

    and -II implementations use virtually addressed L1 caches and physically addressed L2 caches.

    The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical

    memory in 64-byte units. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary

    between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size

    of the caches reported to the operating system. The L1 cache sizes are recorded in the

    vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.

    Well start by using the L2 cache as an example of how page placement can affect

    performance. The physical addressing of the L2 cache means that the cache is organized in

    page-sized multiples of the physical address space, which means that the cache effectively has

    only a limited number of page-aligned slots. The number of effective page slots in the cache is

    the cache size divided by the page size. To simplify our examples, lets assume we have a 32-

    Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8

    Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read

    and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte

    cache has 1024 addressable slots. The following figure shows how our cache would look if we

    laid it out linearly:

  • 8/2/2019 200601031 Solar is Physical Memory Management

    8/14

    The L2 cache is direct-mapped from physical memory. If we were to access physical

    addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory

    locations would map to the same cache line. If we were now to access these two addresses, wecause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line

    for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This

    ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it

    effectively reduces our performance to that of real-memory speed, rather than cache speed. By

    accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64

    bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 10

    20 times slower than cache and so can have a dramatic effect on performance.

    Our simple example was based on the assumption that we were accessing physical memory

    in a regular pattern, but we dont program to physical memory; rather, we program to virtual

    memory. Therefore, the operating system must provide a sensible mapping between virtualmemory and physical memory; otherwise, effects such as our example can occur.

    By default, physical pages are assigned to an address space from the order in which they

    appear in the free list. In general, the first time a machine boots, the free list may have physical

    memory in a linear order, and we may end up with the behavior described in our ping pong

    example. Once a machine has been running, the physical page free list will become randomly

    ordered, and subsequent reruns of an identical application could get very different physical page

    placement and, as a result, very different performance. On early Solaris implementations, this is

    exactly what customers sawdiffering performance for identical runs, as much as 30 percent

    difference.

    To provide better and consistent performance, the Solaris kernel uses a page coloring

    algorithm when pages are allocated to a virtual address space. Rather than being randomly

    allocated, the pages are allocated with a specific predetermined relationship between the virtual

    address to which they are being mapped and their underlying physical address. The virtual-to-

    physical relationship is predetermined as follows: the free list of physical pages is organized into

    specifically colored bins, one color bin for each slot in the physical cache; the number of color

  • 8/2/2019 200601031 Solar is Physical Memory Management

    9/14

    bins is determined by the ecache size divided by the page size. (In our example, there would be

    exactly four colored bins.)

    When a page is put on the free list, the page_free() algorithms assign it to a color bin. When a

    page is consumed from the free list, the virtual-to-physical algorithm takes the page from a

    physical color bin, chosen as a function of the virtual address which to which the page will be

    mapped. The algorithm requires that when allocating pages from the free list, the page create

    function must know the virtual address to which a page will be mapped.

    New pages are allocated by calling the page_create_va() function. The page_create_va()

    function accepts the virtual address of the location to which the page is going to be mapped as

    an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take

    physical pages from.

    No one algorithm suits all applications because different applications have different memory

    access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been

    refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel

    supports a default algorithm and two optional algorithms. The default algorithm was chosenaccording to the following criteria:

    Fairly consistent, repeatable results

    Good overall performance for the majority of applications

    Acceptable performance across a wide range of applications

    The default algorithm uses a hashing algorithm to distribute pages as evenly as possible

    throughout the cache. The default and three other available page coloring algorithms are shown

    here:

  • 8/2/2019 200601031 Solar is Physical Memory Management

    10/14

    You can change the default algorithm by setting the system parameter consistent_coloring,either on-the-fly with adb or permanently in /etc/system.

    So, which algorithm is best? Well, your mileage will vary, depending on your application. Page

    coloring usually only makes a difference on memory-intensive scientific applications, and the

    defaults are usually fine for commercial or database systems. If you have a time-critical

    scientific application, then we recommend that you experiment with the different algorithms andsee which is best. Remember that some algorithms will produce different results for each run,

    so aggregate as many runs as possible.

    The Page Scanner

    The page scanner is the memory management daemon that manages system wide physical

    memory. The page scanner and the virtual memory page fault mechanism are the core of the

    demand-paged memory allocation system used to manage Solaris memory. When there is a

    memory shortage, the page scanner runs, to steal memory from address spaces by takingpages that havent been used recently, syncing them up with their backing store (swap space if

    they are anonymous pages), and freeing them. If paged-out virtual memory is required again by

    an address space, then a memory page fault occurs when the virtual address is referenced and

    the pages are recreated and copied back from their backing store.

    The balancing of page stealing and page faults determines which parts of virtual memory will

    be backed by real physical memory and which will be moved out to swap. The page scanner

    does not understand the memory usage patterns or working sets of processes; it only knows

    reference information on a physical page-by-page basis. This policy is often referred to as

    global pagereplacement; the alternative process-based page management, is known as local

    page replacement. The subtleties of which pages are stolen govern the memory allocationpolicies and can affect different workloads in different ways. During the life of the Solaris kernel,

    only two significant changes in memory replacement policies have occurred:

    Enhancements to minimize page stealing from extensively shared libraries and

    executables

    Priority paging to prevent application, shared library, and executable paging on systems

    with ample memory

  • 8/2/2019 200601031 Solar is Physical Memory Management

    11/14

    Page Scanner Implementation

    The page scanner is implemented as two kernel threads, both of which use pageout. Onethread scans pages, and the other thread pushes the dirty pages queued for I/O to the swapdevice. In addition, the kernel callout mechanism wakes the page scanner thread when memory

    is insufficient.

    The scanner schedpaging() function is called four times per second by a callout placed in thecallout table. The schedpaging() function checks whether free memory is below the threshold(lotsfree or cachefree) and, if required, prepares to trigger the scanner thread.The page scanner is not only awakened by the callout thread, it is also triggered by the clock()thread if memory falls below minfree or by the page allocator if memory falls belowthrottlefree.

    This illustrates how the page scanner works:

    Page Scanner Architecture

    When called, the schedpaging routine calculates two setup parameters for the page scannerthread: the number of pages to scan and the number of CPU ticks that the scanner thread canconsume while doing so. The number of pages and cpu ticks are calculated according to theequations shown of Scan Rate Parameters (Assuming No Priority Paging). Once the scanningparameters have beencalculated, schedpaging triggers the page scanner through a conditionvariable wakeup.

    The page scanner thread cycles through the physical page list, progressing by the number ofpages requested each time it is woken up. The front hand and the back hand each have a page

  • 8/2/2019 200601031 Solar is Physical Memory Management

    12/14

    pointer. The front hand is incremented first so that it can clear the referenced and modified bitsfor the page currently pointed to by the front hand. The back hand is then incremented, and thestatus of the page pointed to by the back hand is checked by the check_page() function. At thispoint, if the page has been modified, it is placed in the dirty page queue for processing by thepage-out thread. If the page was not referenced (its clean!), then it is simply freed.

    Dirty pages are placed onto a queue so that a separate thread, the page-out thread, can writethem out to their backing store. We use another thread so that a deadlock cant occur while thesystem is waiting to swap a page out. The page-out thread uses a preinitialized list of asyncbuffer headers as the queue for I/O requests. The list is initialized with 256 entries, which meansthe queue can contain at most 256 entries. The number of entries preconfigured on the list iscontrolled by the async_request_size system parameter. Requests to queue more I/Os onto thequeue will be blocked if the entire queue is full (256 entries) or if the rate of pages queued hasexceeded the system maximum set by the maxpgio parameter.

    The page-out thread simply removes I/O entries from the queue and initiates I/O on it bycalling the vnode putpage() function for the page in question. In the Solaris kernel, this functioncalls the swapfs_putpage() function to initiate the swap page-out via the swapfs layer.

    The swapfs layer delays and gathers together pages (16 pages on sun4u), then writes these outtogether. The klustsize parameter controls the number of pages that swapfs will cluster;the defaults are shown in in the below table.

    The Memory Scheduler

    In addition to the page-out process, the CPU scheduler/dispatcher can swap out entireprocesses to conserve memory. This operation is separate from page-out. Swapping out a

    process involves removing all of a processs thread structures and private pages from memory,and setting flags in the process table to indicate that this process has been swapped out. This isan inexpensive way to conserve memory, but it dramatically affects a processs performanceand hence is used only when paging fails to consistently free enough memory.

    The memory scheduler is launched at boot time and does nothing unless memory isconsistently less than desfree memory (30 second average). At this point, the memoryscheduler starts looking for processes that it can completely swap out. The memory scheduler

  • 8/2/2019 200601031 Solar is Physical Memory Management

    13/14

    will soft-swap out processes if the shortage is minimal or hard-swap out processes in the caseof a larger memory shortage.

    Soft Swapping

    Soft swapping takes place when the 30-second average for free memory is below desfree.

    Then, the memory scheduler looks for processes that have been inactive for at least maxslpseconds. When the memory scheduler find a process that has been sleeping for maxslpseconds, it swaps out the thread structures for each thread, then pages out all of the privatepages of memory for that process.

    Hard Swapping

    Hard swapping takes place when all of the following are true: At least two processes are on the run queue, waiting for CPU. The average free memory over 30 seconds is consistently less than desfree. Excessive paging (determined to be true if page-out + page-in > maxpgio) is going on.

    When hard swapping is invoked, a much more aggressive approach is used to find memory.First, the kernel is requested to unload all modules and cache memory that are not currentlyactive, then processes are sequentially swapped out until the desired amount of free memory isreturned.

    References :

    Richard McDougall & Jim Mauro 'Solaris Internals Solaris 10 and Opensolaris

    Kernel Architecture '2nd Edition, Pearson Education, ISBN : 81-317-1620-1

    http://www.opensolaris.org

    Robert A. Gingell, Joseph P. Moran, and William A. Shannon,Virtual Memory

    Architecture inSunOS,Proceedings of the Summer 1987 Usenix Technical

    Conference, Usenix Association, Phoenix Arizona, USA, June 1987.

    Richard McDougallSupporting Multiple Page Sizes in the Solaris Operating

    System, Sun BluePrints OnLineMarch 2004, Sun Microsystems Inc.

    Steven R. Kleiman, Vnodes: An Architecture for Multiple File Systems Types in

    Sun UNIX,Proceedings of the Summer 1986 Usenix Technical Conference, Usenix

    Association, PhoenixArizona, USA, June 1986.

    Marshall Kirk McKusick, Michael J. Karels, and Keith Bostic, A Pageable MemoryBased Filesystem,Proceedings of the Summer 1990 Usenix Technical Conference,

    Usenix Association, Anaheim California, USA, June 1990.

    The Solaris Memory System - Sizing tools and architectureCopyright 1997 Sun

    Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A.

    http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html

    http://www.opensolaris.org/http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.htmlhttp://www.opensolaris.org/http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html
  • 8/2/2019 200601031 Solar is Physical Memory Management

    14/14

    Peter Snydertmpfs: A Virtual Memory File System, Sun Microsystems Inc.

    http://developers.sun.com/solaris/articles/free_phys_ram.html

    http://www.dbapool.com/faqs/Q_116.html

    http://developers.sun.com/solaris/articles/free_phys_ram.htmlhttp://www.dbapool.com/faqs/Q_116.htmlhttp://developers.sun.com/solaris/articles/free_phys_ram.htmlhttp://www.dbapool.com/faqs/Q_116.html