200601031 Solar is Physical Memory Management

8/2/2019 200601031 Solar is Physical Memory Management

1/14

Solaris 10 Physical Memory Management

Physical memory is managed globally in Solaris via a central free pool and a system daemon to

manage the use of physical memory.

Physical Memory Allocation

Solaris uses the systems RAM as a central pool of physical memory for different consumers

within the system. Physical memory is distributed through the central pool at allocation time and

returned to the pool when it is no longer needed. A system daemon (the page scanner ) pro

actively manages memory allocation when there is a systemwide shortage of memory.


2/14

The Allocation cycle of Physical memory

The most significant central pool physical memory is the freelist. Physical memory is placed

on the freelist in page-size chunks when the system is first booted and freelist as shown the

above figure.

Anonymous/process allocations

Anonymous memory, the most common form of allocation from the freelist, is used for most of

a processs memory allocation, including heap and stack. Anonymous memory also fulfills

shared memory mappings allocations. A small amount of anonymous memory is also used in

the kernel for items such a thread stacks. Anonymous memory is pageable and is returned to

the freelist when it is unmapped or if it is stolen by the page scanner daemon.

File System page cache

The page cache is used for caching of file data for file systems other than the ZFS file system.The file system page cache grows on demand to consume available physical memory as a file

cache and caches file data in page-size chunks. The pages then reside in one of the three

places: the segmap cache, a processs address space to which they are mapped, or on the

cachelist.

The cachelist is the heart of the page cache. All unmapped file pages reside on the cachelist.

Segmap is a cache that holds file data read and written through the read and write system

calls. Memory is allocated from the freelist to satisfy a read of a new file page, which then

resides in the segmap file cache. File pages are eventually moved from the segmap cache to

the cachelist to make room for more pages in the segmap cache. It can be interpreted as thefast first level file system read/write cache.

The cachelist operates as a part of freelist. When the freelist is depleted, allocations are made

from the oldest pages in the cachelist. This allows file system cache to grow to consume all

available memory to dynamically shrink as memory is required for other purposes.

Kernel allocations

The kernel uses memory to manage information about internal system state; for example,

memory used to hold the list of processes in the system. The kernel allocates memory from the

freelist for these purposes with its own allocators:vnem and slab and the memory allocated is

mostly nonpageable. However, unlike process and file allocations, the kernel seldom returnsmemory to the freelist; memory is allocated and freed between kernel subsystems and the

kernel allocators. Memory is consumed from the freelist only when the total kernel allocation

grows and memory is returned to the system freelist pro actively by the kernels allocators when

a global memory shortage occurs.


3/14

Pages: The Basic unit of Solaris Memory

Pages are the fundamental unit of physical memory in the Solaris memory management

subsystem. Physical memory is divided into pages. Every active (not free) page in the Solaris

kernel is a mapping between a file (vnode) and memory; the page can be identified with a

vnode pointer and the page size offset within that vnode.A pages identity is its vnode/offsetpair. The age structure and associated lists are shown below:

The hardware address translation (HAT) and address space layers manage the mapping

between a physical page and its virtual address space. The key property of vnode/offset pair is

reusability; that is, we can reuse each physical page for another task by simply synchronizing its

contents in RAM with its backing store(the vnode and the offset) before the page is used.

The Page Hash List

The VM system hashes pages with identity (a valid vnode/offset pair) onto a global hash list

so that they can be located by vnode and offset. Three page functions search the global hashlist; page_find(), page_lookup(), and page_lookup_nowait(). The global hash list is an array of

pointers to linked lists of pages. The functions use a hash to index into the page_hash array to

locate the list of pages that contains the page with the matching vnode/offset pair. The following

figure shows how the page_find() function indexes into the page_hash array o locate a page

matching a given vnode/offset.


4/14

It calculates the slot in the page_hash array containing a lost of potential pages by using the

PAGE_HASH_FUNC macro, shown below :

It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the

slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds

such a page.

Free List and cache List

The free list and cache list hold pages that are not mapped into any address space and that

have been freed by page_free(). The sum of these pages is reported in thefree column in

vmstat. Even though vmstatreports these pages as free, they can still contain a valid page from

vnode/offset and hence are still part of the global page cache. Memory on the cache list is not

really free; it is a valid cache of a page from a file. However, pages will be moved from the

cache list to the free list and their contents discarded if the free list becomes exhausted.

The free list contains pages that no longer have a vnode and offset associated with them-

which can only occur if the page has been destroyed and removed frm a vnodes hash list. Thecache list is a hashed list of pages that still have mappings to valid vnode and offset. The pages

can be obtained from the cache list by the page_lookup() routine. This function accepts a vnde

and offset as argument and returns a page structure. If the page is found on th cache list, then

the page is removed from the cache list and returned to the caller. When we find and remove

pages from the cache list, we are reclaiminga page. Page reclaims are reported by vmstat in

the re column.


5/14

Physical Page memseg Lists

The Solaris kernel uses a segmented global physical page list, consisting of segments of

contiguous physical memory. (Many hardware platforms now present memory in noncontiguous

groups.) Contiguous physical memory segments are added during system boot. They are also

added and deleted dynamically when physical memory is added and removed while the systemis running. The following figure shows the arrangement of the physical page lists into contiguous

segments.

The Page-Level Interfaces

The Solaris 10 virtual memory system implementation has grouped page management and

manipulation into a central group of functions. These functions are used by the segment drivers

and file systems to create, delete and modify pages. The following are some of the page-level

interfaces:


6/14

The Page Throttle

Solaris implements a page creation throttle so a small core of memory is available for

consumption by critical parts of the kernel. The page throttle, implemented in the page_create()

and page_create_va() functions, causes page creates to block when the PG_WAIT flag isspecified, that is, when available is less than the system global, throttlefree. By default, the

system global parameter, throttlefree, is set to the same value as the system global parameter

minfree. By default, memory allocated through the kernel memory allocator specifies PG_WAIT

and is subject to the page-created throttle.


7/14

Page Coloring

Some interesting effects result from the organization of pages within the processor caches,

and as a result, the page placement policy within these caches can dramatically affect

processor performance. When pages overlay other pages in the cache, they can displace cache

data that we might not want overlaid, resulting in less cache utilization and hot spots.

The optimal placement of pages in the cache often depends on the memory access patterns

of the application; that is, is the application accessing memory in a random order, or is it doing

some sort of strided ordered access? Several different algorithms can be selected in the Solaris

kernel to implement page placement; the default attempts to provide the best overall

performance.

To understand how page placement can affect performance, lets look at the cache

configuration and see when page overlaying and displacement can occur. The UltraSPARC-I

and -II implementations use virtually addressed L1 caches and physically addressed L2 caches.

The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical

memory in 64-byte units. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary

between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size

of the caches reported to the operating system. The L1 cache sizes are recorded in the

vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.

Well start by using the L2 cache as an example of how page placement can affect

performance. The physical addressing of the L2 cache means that the cache is organized in

page-sized multiples of the physical address space, which means that the cache effectively has

only a limited number of page-aligned slots. The number of effective page slots in the cache is

the cache size divided by the page size. To simplify our examples, lets assume we have a 32-

Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8

Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read

and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte

cache has 1024 addressable slots. The following figure shows how our cache would look if we

laid it out linearly:


8/14

The L2 cache is direct-mapped from physical memory. If we were to access physical

addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory

locations would map to the same cache line. If we were now to access these two addresses, wecause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line

for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This

ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it

effectively reduces our performance to that of real-memory speed, rather than cache speed. By

accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64

bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 10

20 times slower than cache and so can have a dramatic effect on performance.

Our simple example was based on the assumption that we were accessing physical memory

in a regular pattern, but we dont program to physical memory; rather, we program to virtual

memory. Therefore, the operating system must provide a sensible mapping between virtualmemory and physical memory; otherwise, effects such as our example can occur.

By default, physical pages are assigned to an address space from the order in which they

appear in the free list. In general, the first time a machine boots, the free list may have physical

memory in a linear order, and we may end up with the behavior described in our ping pong

example. Once a machine has been running, the physical page free list will become randomly

ordered, and subsequent reruns of an identical application could get very different physical page

placement and, as a result, very different performance. On early Solaris implementations, this is

exactly what customers sawdiffering performance for identical runs, as much as 30 percent

difference.

To provide better and consistent performance, the Solaris kernel uses a page coloring

algorithm when pages are allocated to a virtual address space. Rather than being randomly

allocated, the pages are allocated with a specific predetermined relationship between the virtual

address to which they are being mapped and their underlying physical address. The virtual-to-

physical relationship is predetermined as follows: the free list of physical pages is organized into

specifically colored bins, one color bin for each slot in the physical cache; the number of color


9/14

bins is determined by the ecache size divided by the page size. (In our example, there would be

exactly four colored bins.)

When a page is put on the free list, the page_free() algorithms assign it to a color bin. When a

page is consumed from the free list, the virtual-to-physical algorithm takes the page from a

physical color bin, chosen as a function of the virtual address which to which the page will be

mapped. The algorithm requires that when allocating pages from the free list, the page create

function must know the virtual address to which a page will be mapped.

New pages are allocated by calling the page_create_va() function. The page_create_va()

function accepts the virtual address of the location to which the page is going to be mapped as

an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take

physical pages from.

No one algorithm suits all applications because different applications have different memory

access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been

refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel

supports a default algorithm and two optional algorithms. The default algorithm was chosenaccording to the following criteria:

Fairly consistent, repeatable results

Good overall performance for the majority of applications

Acceptable performance across a wide range of applications

The default algorithm uses a hashing algorithm to distribute pages as evenly as possible

throughout the cache. The default and three other available page coloring algorithms are shown

here:


10/14

You can change the default algorithm by setting the system parameter consistent_coloring,either on-the-fly with adb or permanently in /etc/system.

So, which algorithm is best? Well, your mileage will vary, depending on your application. Page

coloring usually only makes a difference on memory-intensive scientific applications, and the

defaults are usually fine for commercial or database systems. If you have a time-critical

scientific application, then we recommend that you experiment with the different algorithms andsee which is best. Remember that some algorithms will produce different results for each run,

so aggregate as many runs as possible.

The Page Scanner

The page scanner is the memory management daemon that manages system wide physical

memory. The page scanner and the virtual memory page fault mechanism are the core of the

demand-paged memory allocation system used to manage Solaris memory. When there is a

memory shortage, the page scanner runs, to steal memory from address spaces by takingpages that havent been used recently, syncing them up with their backing store (swap space if

they are anonymous pages), and freeing them. If paged-out virtual memory is required again by

an address space, then a memory page fault occurs when the virtual address is referenced and

the pages are recreated and copied back from their backing store.

The balancing of page stealing and page faults determines which parts of virtual memory will

be backed by real physical memory and which will be moved out to swap. The page scanner

does not understand the memory usage patterns or working sets of processes; it only knows

reference information on a physical page-by-page basis. This policy is often referred to as

global pagereplacement; the alternative process-based page management, is known as local

page replacement. The subtleties of which pages are stolen govern the memory allocationpolicies and can affect different workloads in different ways. During the life of the Solaris kernel,

only two significant changes in memory replacement policies have occurred:

Enhancements to minimize page stealing from extensively shared libraries and

executables

Priority paging to prevent application, shared library, and executable paging on systems

with ample memory


11/14

Page Scanner Implementation

The page scanner is implemented as two kernel threads, both of which use pageout. Onethread scans pages, and the other thread pushes the dirty pages queued for I/O to the swapdevice. In addition, the kernel callout mechanism wakes the page scanner thread when memory

is insufficient.

The scanner schedpaging() function is called four times per second by a callout placed in thecallout table. The schedpaging() function checks whether free memory is below the threshold(lotsfree or cachefree) and, if required, prepares to trigger the scanner thread.The page scanner is not only awakened by the callout thread, it is also triggered by the clock()thread if memory falls below minfree or by the page allocator if memory falls belowthrottlefree.

This illustrates how the page scanner works:

Page Scanner Architecture

When called, the schedpaging routine calculates two setup parameters for the page scannerthread: the number of pages to scan and the number of CPU ticks that the scanner thread canconsume while doing so. The number of pages and cpu ticks are calculated according to theequations shown of Scan Rate Parameters (Assuming No Priority Paging). Once the scanningparameters have beencalculated, schedpaging triggers the page scanner through a conditionvariable wakeup.

The page scanner thread cycles through the physical page list, progressing by the number ofpages requested each time it is woken up. The front hand and the back hand each have a page


12/14

pointer. The front hand is incremented first so that it can clear the referenced and modified bitsfor the page currently pointed to by the front hand. The back hand is then incremented, and thestatus of the page pointed to by the back hand is checked by the check_page() function. At thispoint, if the page has been modified, it is placed in the dirty page queue for processing by thepage-out thread. If the page was not referenced (its clean!), then it is simply freed.

Dirty pages are placed onto a queue so that a separate thread, the page-out thread, can writethem out to their backing store. We use another thread so that a deadlock cant occur while thesystem is waiting to swap a page out. The page-out thread uses a preinitialized list of asyncbuffer headers as the queue for I/O requests. The list is initialized with 256 entries, which meansthe queue can contain at most 256 entries. The number of entries preconfigured on the list iscontrolled by the async_request_size system parameter. Requests to queue more I/Os onto thequeue will be blocked if the entire queue is full (256 entries) or if the rate of pages queued hasexceeded the system maximum set by the maxpgio parameter.

The page-out thread simply removes I/O entries from the queue and initiates I/O on it bycalling the vnode putpage() function for the page in question. In the Solaris kernel, this functioncalls the swapfs_putpage() function to initiate the swap page-out via the swapfs layer.

The swapfs layer delays and gathers together pages (16 pages on sun4u), then writes these outtogether. The klustsize parameter controls the number of pages that swapfs will cluster;the defaults are shown in in the below table.

The Memory Scheduler

In addition to the page-out process, the CPU scheduler/dispatcher can swap out entireprocesses to conserve memory. This operation is separate from page-out. Swapping out a

process involves removing all of a processs thread structures and private pages from memory,and setting flags in the process table to indicate that this process has been swapped out. This isan inexpensive way to conserve memory, but it dramatically affects a processs performanceand hence is used only when paging fails to consistently free enough memory.

The memory scheduler is launched at boot time and does nothing unless memory isconsistently less than desfree memory (30 second average). At this point, the memoryscheduler starts looking for processes that it can completely swap out. The memory scheduler


13/14

will soft-swap out processes if the shortage is minimal or hard-swap out processes in the caseof a larger memory shortage.

Soft Swapping

Soft swapping takes place when the 30-second average for free memory is below desfree.

Then, the memory scheduler looks for processes that have been inactive for at least maxslpseconds. When the memory scheduler find a process that has been sleeping for maxslpseconds, it swaps out the thread structures for each thread, then pages out all of the privatepages of memory for that process.

Hard Swapping

Hard swapping takes place when all of the following are true: At least two processes are on the run queue, waiting for CPU. The average free memory over 30 seconds is consistently less than desfree. Excessive paging (determined to be true if page-out + page-in > maxpgio) is going on.

When hard swapping is invoked, a much more aggressive approach is used to find memory.First, the kernel is requested to unload all modules and cache memory that are not currentlyactive, then processes are sequentially swapped out until the desired amount of free memory isreturned.

References :

Richard McDougall & Jim Mauro 'Solaris Internals Solaris 10 and Opensolaris

Kernel Architecture '2nd Edition, Pearson Education, ISBN : 81-317-1620-1

http://www.opensolaris.org

Robert A. Gingell, Joseph P. Moran, and William A. Shannon,Virtual Memory

Architecture inSunOS,Proceedings of the Summer 1987 Usenix Technical

Conference, Usenix Association, Phoenix Arizona, USA, June 1987.

Richard McDougallSupporting Multiple Page Sizes in the Solaris Operating

System, Sun BluePrints OnLineMarch 2004, Sun Microsystems Inc.

Steven R. Kleiman, Vnodes: An Architecture for Multiple File Systems Types in

Sun UNIX,Proceedings of the Summer 1986 Usenix Technical Conference, Usenix

Association, PhoenixArizona, USA, June 1986.

Marshall Kirk McKusick, Michael J. Karels, and Keith Bostic, A Pageable MemoryBased Filesystem,Proceedings of the Summer 1990 Usenix Technical Conference,

Usenix Association, Anaheim California, USA, June 1990.

The Solaris Memory System - Sizing tools and architectureCopyright 1997 Sun

Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A.

http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html
http://www.opensolaris.org/http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.htmlhttp://www.opensolaris.org/http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html


14/14

Peter Snydertmpfs: A Virtual Memory File System, Sun Microsystems Inc.

http://developers.sun.com/solaris/articles/free_phys_ram.html

http://www.dbapool.com/faqs/Q_116.html
http://developers.sun.com/solaris/articles/free_phys_ram.htmlhttp://www.dbapool.com/faqs/Q_116.htmlhttp://developers.sun.com/solaris/articles/free_phys_ram.htmlhttp://www.dbapool.com/faqs/Q_116.html

Documents

200601031 Solar is Physical Memory Management