Upload
calnortonjnr
View
218
Download
0
Embed Size (px)
Citation preview
8/2/2019 200601031 Solar is Physical Memory Management
1/14
Solaris 10 Physical Memory Management
Physical memory is managed globally in Solaris via a central free pool and a system daemon to
manage the use of physical memory.
Physical Memory Allocation
Solaris uses the systems RAM as a central pool of physical memory for different consumers
within the system. Physical memory is distributed through the central pool at allocation time and
returned to the pool when it is no longer needed. A system daemon (the page scanner ) pro
actively manages memory allocation when there is a systemwide shortage of memory.
8/2/2019 200601031 Solar is Physical Memory Management
2/14
The Allocation cycle of Physical memory
The most significant central pool physical memory is the freelist. Physical memory is placed
on the freelist in page-size chunks when the system is first booted and freelist as shown the
above figure.
Anonymous/process allocations
Anonymous memory, the most common form of allocation from the freelist, is used for most of
a processs memory allocation, including heap and stack. Anonymous memory also fulfills
shared memory mappings allocations. A small amount of anonymous memory is also used in
the kernel for items such a thread stacks. Anonymous memory is pageable and is returned to
the freelist when it is unmapped or if it is stolen by the page scanner daemon.
File System page cache
The page cache is used for caching of file data for file systems other than the ZFS file system.The file system page cache grows on demand to consume available physical memory as a file
cache and caches file data in page-size chunks. The pages then reside in one of the three
places: the segmap cache, a processs address space to which they are mapped, or on the
cachelist.
The cachelist is the heart of the page cache. All unmapped file pages reside on the cachelist.
Segmap is a cache that holds file data read and written through the read and write system
calls. Memory is allocated from the freelist to satisfy a read of a new file page, which then
resides in the segmap file cache. File pages are eventually moved from the segmap cache to
the cachelist to make room for more pages in the segmap cache. It can be interpreted as thefast first level file system read/write cache.
The cachelist operates as a part of freelist. When the freelist is depleted, allocations are made
from the oldest pages in the cachelist. This allows file system cache to grow to consume all
available memory to dynamically shrink as memory is required for other purposes.
Kernel allocations
The kernel uses memory to manage information about internal system state; for example,
memory used to hold the list of processes in the system. The kernel allocates memory from the
freelist for these purposes with its own allocators:vnem and slab and the memory allocated is
mostly nonpageable. However, unlike process and file allocations, the kernel seldom returnsmemory to the freelist; memory is allocated and freed between kernel subsystems and the
kernel allocators. Memory is consumed from the freelist only when the total kernel allocation
grows and memory is returned to the system freelist pro actively by the kernels allocators when
a global memory shortage occurs.
8/2/2019 200601031 Solar is Physical Memory Management
3/14
Pages: The Basic unit of Solaris Memory
Pages are the fundamental unit of physical memory in the Solaris memory management
subsystem. Physical memory is divided into pages. Every active (not free) page in the Solaris
kernel is a mapping between a file (vnode) and memory; the page can be identified with a
vnode pointer and the page size offset within that vnode.A pages identity is its vnode/offsetpair. The age structure and associated lists are shown below:
The hardware address translation (HAT) and address space layers manage the mapping
between a physical page and its virtual address space. The key property of vnode/offset pair is
reusability; that is, we can reuse each physical page for another task by simply synchronizing its
contents in RAM with its backing store(the vnode and the offset) before the page is used.
The Page Hash List
The VM system hashes pages with identity (a valid vnode/offset pair) onto a global hash list
so that they can be located by vnode and offset. Three page functions search the global hashlist; page_find(), page_lookup(), and page_lookup_nowait(). The global hash list is an array of
pointers to linked lists of pages. The functions use a hash to index into the page_hash array to
locate the list of pages that contains the page with the matching vnode/offset pair. The following
figure shows how the page_find() function indexes into the page_hash array o locate a page
matching a given vnode/offset.
8/2/2019 200601031 Solar is Physical Memory Management
4/14
It calculates the slot in the page_hash array containing a lost of potential pages by using the
PAGE_HASH_FUNC macro, shown below :
It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the
slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds
such a page.
Free List and cache List
The free list and cache list hold pages that are not mapped into any address space and that
have been freed by page_free(). The sum of these pages is reported in thefree column in
vmstat. Even though vmstatreports these pages as free, they can still contain a valid page from
vnode/offset and hence are still part of the global page cache. Memory on the cache list is not
really free; it is a valid cache of a page from a file. However, pages will be moved from the
cache list to the free list and their contents discarded if the free list becomes exhausted.
The free list contains pages that no longer have a vnode and offset associated with them-
which can only occur if the page has been destroyed and removed frm a vnodes hash list. Thecache list is a hashed list of pages that still have mappings to valid vnode and offset. The pages
can be obtained from the cache list by the page_lookup() routine. This function accepts a vnde
and offset as argument and returns a page structure. If the page is found on th cache list, then
the page is removed from the cache list and returned to the caller. When we find and remove
pages from the cache list, we are reclaiminga page. Page reclaims are reported by vmstat in
the re column.
8/2/2019 200601031 Solar is Physical Memory Management
5/14
Physical Page memseg Lists
The Solaris kernel uses a segmented global physical page list, consisting of segments of
contiguous physical memory. (Many hardware platforms now present memory in noncontiguous
groups.) Contiguous physical memory segments are added during system boot. They are also
added and deleted dynamically when physical memory is added and removed while the systemis running. The following figure shows the arrangement of the physical page lists into contiguous
segments.
The Page-Level Interfaces
The Solaris 10 virtual memory system implementation has grouped page management and
manipulation into a central group of functions. These functions are used by the segment drivers
and file systems to create, delete and modify pages. The following are some of the page-level
interfaces:
8/2/2019 200601031 Solar is Physical Memory Management
6/14
The Page Throttle
Solaris implements a page creation throttle so a small core of memory is available for
consumption by critical parts of the kernel. The page throttle, implemented in the page_create()
and page_create_va() functions, causes page creates to block when the PG_WAIT flag isspecified, that is, when available is less than the system global, throttlefree. By default, the
system global parameter, throttlefree, is set to the same value as the system global parameter
minfree. By default, memory allocated through the kernel memory allocator specifies PG_WAIT
and is subject to the page-created throttle.
8/2/2019 200601031 Solar is Physical Memory Management
7/14
Page Coloring
Some interesting effects result from the organization of pages within the processor caches,
and as a result, the page placement policy within these caches can dramatically affect
processor performance. When pages overlay other pages in the cache, they can displace cache
data that we might not want overlaid, resulting in less cache utilization and hot spots.
The optimal placement of pages in the cache often depends on the memory access patterns
of the application; that is, is the application accessing memory in a random order, or is it doing
some sort of strided ordered access? Several different algorithms can be selected in the Solaris
kernel to implement page placement; the default attempts to provide the best overall
performance.
To understand how page placement can affect performance, lets look at the cache
configuration and see when page overlaying and displacement can occur. The UltraSPARC-I
and -II implementations use virtually addressed L1 caches and physically addressed L2 caches.
The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical
memory in 64-byte units. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary
between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size
of the caches reported to the operating system. The L1 cache sizes are recorded in the
vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.
Well start by using the L2 cache as an example of how page placement can affect
performance. The physical addressing of the L2 cache means that the cache is organized in
page-sized multiples of the physical address space, which means that the cache effectively has
only a limited number of page-aligned slots. The number of effective page slots in the cache is
the cache size divided by the page size. To simplify our examples, lets assume we have a 32-
Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8
Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read
and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte
cache has 1024 addressable slots. The following figure shows how our cache would look if we
laid it out linearly:
8/2/2019 200601031 Solar is Physical Memory Management
8/14
The L2 cache is direct-mapped from physical memory. If we were to access physical
addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory
locations would map to the same cache line. If we were now to access these two addresses, wecause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line
for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This
ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it
effectively reduces our performance to that of real-memory speed, rather than cache speed. By
accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64
bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 10
20 times slower than cache and so can have a dramatic effect on performance.
Our simple example was based on the assumption that we were accessing physical memory
in a regular pattern, but we dont program to physical memory; rather, we program to virtual
memory. Therefore, the operating system must provide a sensible mapping between virtualmemory and physical memory; otherwise, effects such as our example can occur.
By default, physical pages are assigned to an address space from the order in which they
appear in the free list. In general, the first time a machine boots, the free list may have physical
memory in a linear order, and we may end up with the behavior described in our ping pong
example. Once a machine has been running, the physical page free list will become randomly
ordered, and subsequent reruns of an identical application could get very different physical page
placement and, as a result, very different performance. On early Solaris implementations, this is
exactly what customers sawdiffering performance for identical runs, as much as 30 percent
difference.
To provide better and consistent performance, the Solaris kernel uses a page coloring
algorithm when pages are allocated to a virtual address space. Rather than being randomly
allocated, the pages are allocated with a specific predetermined relationship between the virtual
address to which they are being mapped and their underlying physical address. The virtual-to-
physical relationship is predetermined as follows: the free list of physical pages is organized into
specifically colored bins, one color bin for each slot in the physical cache; the number of color
8/2/2019 200601031 Solar is Physical Memory Management
9/14
bins is determined by the ecache size divided by the page size. (In our example, there would be
exactly four colored bins.)
When a page is put on the free list, the page_free() algorithms assign it to a color bin. When a
page is consumed from the free list, the virtual-to-physical algorithm takes the page from a
physical color bin, chosen as a function of the virtual address which to which the page will be
mapped. The algorithm requires that when allocating pages from the free list, the page create
function must know the virtual address to which a page will be mapped.
New pages are allocated by calling the page_create_va() function. The page_create_va()
function accepts the virtual address of the location to which the page is going to be mapped as
an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take
physical pages from.
No one algorithm suits all applications because different applications have different memory
access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been
refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel
supports a default algorithm and two optional algorithms. The default algorithm was chosenaccording to the following criteria:
Fairly consistent, repeatable results
Good overall performance for the majority of applications
Acceptable performance across a wide range of applications
The default algorithm uses a hashing algorithm to distribute pages as evenly as possible
throughout the cache. The default and three other available page coloring algorithms are shown
here:
8/2/2019 200601031 Solar is Physical Memory Management
10/14
You can change the default algorithm by setting the system parameter consistent_coloring,either on-the-fly with adb or permanently in /etc/system.
So, which algorithm is best? Well, your mileage will vary, depending on your application. Page
coloring usually only makes a difference on memory-intensive scientific applications, and the
defaults are usually fine for commercial or database systems. If you have a time-critical
scientific application, then we recommend that you experiment with the different algorithms andsee which is best. Remember that some algorithms will produce different results for each run,
so aggregate as many runs as possible.
The Page Scanner
The page scanner is the memory management daemon that manages system wide physical
memory. The page scanner and the virtual memory page fault mechanism are the core of the
demand-paged memory allocation system used to manage Solaris memory. When there is a
memory shortage, the page scanner runs, to steal memory from address spaces by takingpages that havent been used recently, syncing them up with their backing store (swap space if
they are anonymous pages), and freeing them. If paged-out virtual memory is required again by
an address space, then a memory page fault occurs when the virtual address is referenced and
the pages are recreated and copied back from their backing store.
The balancing of page stealing and page faults determines which parts of virtual memory will
be backed by real physical memory and which will be moved out to swap. The page scanner
does not understand the memory usage patterns or working sets of processes; it only knows
reference information on a physical page-by-page basis. This policy is often referred to as
global pagereplacement; the alternative process-based page management, is known as local
page replacement. The subtleties of which pages are stolen govern the memory allocationpolicies and can affect different workloads in different ways. During the life of the Solaris kernel,
only two significant changes in memory replacement policies have occurred:
Enhancements to minimize page stealing from extensively shared libraries and
executables
Priority paging to prevent application, shared library, and executable paging on systems
with ample memory
8/2/2019 200601031 Solar is Physical Memory Management
11/14
Page Scanner Implementation
The page scanner is implemented as two kernel threads, both of which use pageout. Onethread scans pages, and the other thread pushes the dirty pages queued for I/O to the swapdevice. In addition, the kernel callout mechanism wakes the page scanner thread when memory
is insufficient.
The scanner schedpaging() function is called four times per second by a callout placed in thecallout table. The schedpaging() function checks whether free memory is below the threshold(lotsfree or cachefree) and, if required, prepares to trigger the scanner thread.The page scanner is not only awakened by the callout thread, it is also triggered by the clock()thread if memory falls below minfree or by the page allocator if memory falls belowthrottlefree.
This illustrates how the page scanner works:
Page Scanner Architecture
When called, the schedpaging routine calculates two setup parameters for the page scannerthread: the number of pages to scan and the number of CPU ticks that the scanner thread canconsume while doing so. The number of pages and cpu ticks are calculated according to theequations shown of Scan Rate Parameters (Assuming No Priority Paging). Once the scanningparameters have beencalculated, schedpaging triggers the page scanner through a conditionvariable wakeup.
The page scanner thread cycles through the physical page list, progressing by the number ofpages requested each time it is woken up. The front hand and the back hand each have a page
8/2/2019 200601031 Solar is Physical Memory Management
12/14
pointer. The front hand is incremented first so that it can clear the referenced and modified bitsfor the page currently pointed to by the front hand. The back hand is then incremented, and thestatus of the page pointed to by the back hand is checked by the check_page() function. At thispoint, if the page has been modified, it is placed in the dirty page queue for processing by thepage-out thread. If the page was not referenced (its clean!), then it is simply freed.
Dirty pages are placed onto a queue so that a separate thread, the page-out thread, can writethem out to their backing store. We use another thread so that a deadlock cant occur while thesystem is waiting to swap a page out. The page-out thread uses a preinitialized list of asyncbuffer headers as the queue for I/O requests. The list is initialized with 256 entries, which meansthe queue can contain at most 256 entries. The number of entries preconfigured on the list iscontrolled by the async_request_size system parameter. Requests to queue more I/Os onto thequeue will be blocked if the entire queue is full (256 entries) or if the rate of pages queued hasexceeded the system maximum set by the maxpgio parameter.
The page-out thread simply removes I/O entries from the queue and initiates I/O on it bycalling the vnode putpage() function for the page in question. In the Solaris kernel, this functioncalls the swapfs_putpage() function to initiate the swap page-out via the swapfs layer.
The swapfs layer delays and gathers together pages (16 pages on sun4u), then writes these outtogether. The klustsize parameter controls the number of pages that swapfs will cluster;the defaults are shown in in the below table.
The Memory Scheduler
In addition to the page-out process, the CPU scheduler/dispatcher can swap out entireprocesses to conserve memory. This operation is separate from page-out. Swapping out a
process involves removing all of a processs thread structures and private pages from memory,and setting flags in the process table to indicate that this process has been swapped out. This isan inexpensive way to conserve memory, but it dramatically affects a processs performanceand hence is used only when paging fails to consistently free enough memory.
The memory scheduler is launched at boot time and does nothing unless memory isconsistently less than desfree memory (30 second average). At this point, the memoryscheduler starts looking for processes that it can completely swap out. The memory scheduler
8/2/2019 200601031 Solar is Physical Memory Management
13/14
will soft-swap out processes if the shortage is minimal or hard-swap out processes in the caseof a larger memory shortage.
Soft Swapping
Soft swapping takes place when the 30-second average for free memory is below desfree.
Then, the memory scheduler looks for processes that have been inactive for at least maxslpseconds. When the memory scheduler find a process that has been sleeping for maxslpseconds, it swaps out the thread structures for each thread, then pages out all of the privatepages of memory for that process.
Hard Swapping
Hard swapping takes place when all of the following are true: At least two processes are on the run queue, waiting for CPU. The average free memory over 30 seconds is consistently less than desfree. Excessive paging (determined to be true if page-out + page-in > maxpgio) is going on.
When hard swapping is invoked, a much more aggressive approach is used to find memory.First, the kernel is requested to unload all modules and cache memory that are not currentlyactive, then processes are sequentially swapped out until the desired amount of free memory isreturned.
References :
Richard McDougall & Jim Mauro 'Solaris Internals Solaris 10 and Opensolaris
Kernel Architecture '2nd Edition, Pearson Education, ISBN : 81-317-1620-1
http://www.opensolaris.org
Robert A. Gingell, Joseph P. Moran, and William A. Shannon,Virtual Memory
Architecture inSunOS,Proceedings of the Summer 1987 Usenix Technical
Conference, Usenix Association, Phoenix Arizona, USA, June 1987.
Richard McDougallSupporting Multiple Page Sizes in the Solaris Operating
System, Sun BluePrints OnLineMarch 2004, Sun Microsystems Inc.
Steven R. Kleiman, Vnodes: An Architecture for Multiple File Systems Types in
Sun UNIX,Proceedings of the Summer 1986 Usenix Technical Conference, Usenix
Association, PhoenixArizona, USA, June 1986.
Marshall Kirk McKusick, Michael J. Karels, and Keith Bostic, A Pageable MemoryBased Filesystem,Proceedings of the Summer 1990 Usenix Technical Conference,
Usenix Association, Anaheim California, USA, June 1990.
The Solaris Memory System - Sizing tools and architectureCopyright 1997 Sun
Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A.
http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html
http://www.opensolaris.org/http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.htmlhttp://www.opensolaris.org/http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html8/2/2019 200601031 Solar is Physical Memory Management
14/14
Peter Snydertmpfs: A Virtual Memory File System, Sun Microsystems Inc.
http://developers.sun.com/solaris/articles/free_phys_ram.html
http://www.dbapool.com/faqs/Q_116.html
http://developers.sun.com/solaris/articles/free_phys_ram.htmlhttp://www.dbapool.com/faqs/Q_116.htmlhttp://developers.sun.com/solaris/articles/free_phys_ram.htmlhttp://www.dbapool.com/faqs/Q_116.html