View
236
Download
1
Category
Tags:
Preview:
Citation preview
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Emery Berger, Kathryn McKinley*, Robert Blumofe, Paul Wilson
Hoard: A Scalable Memory Allocator for Multithreaded
Applications
Department of Computer Sciences*Department of Computer Science
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Motivation
Parallel multithreaded programs becoming prevalent
web servers, search engines, database managers, etc.
run on SMP’s for high performance
often embarrassingly parallel
Memory allocation is a bottleneckprevents scaling with number of processors
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Assessment Criteria for Multiprocessor AllocatorsSpeed
competitive with uniprocessor allocators on one processor
Scalabilityperformance linear with the number of
processors
Fragmentation (= max allocated / max in use)competitive with uniprocessor allocators
worst-case and average-case
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Uniprocessor Allocators on Multiprocessors
Fragmentation: ExcellentVery low for most programs [Wilson &
Johnstone]
Speed & Scalability: PoorHeap contention
a single lock protects the heap
Can exacerbate false sharingdifferent processors can share cache lines
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Allocator-InducedFalse Sharing
Allocators cause false sharing!
Cache lines can end up spread across a number of processors
Practically all allocators do this
processor 1 processor 2x2 = malloc(s);x1 = malloc(s);
A cache line
thrash… thrash…
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Existing Multiprocessor AllocatorsSpeed:
One concurrent heap (e.g., concurrent B-tree): too expensive
too many locks/atomic updates
O(log n) cost per memory operation
Fast allocators use multiple heaps
Scalability:Allocator-induced false sharing and other
bottlenecks
Fragmentation: P-fold increase or even unbounded
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Multiprocessor Allocator I:Pure Private Heaps
Pure private heaps:one heap per processor.
malloc gets memoryfrom the processor's heap or the system
free puts memory on the processor's heap
Avoids heap contentionExamples: STL, ad hoc
(e.g., Cilk 4.1)
x1= malloc(s)
free(x1) free(x2)
x3= malloc(s)
x2= malloc(s)
x4= malloc(s)
processor 1 processor 2
= allocated by heap 1
= free, on heap 2
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
How to Break Pure Private Heaps: Fragmentation
Pure private heaps:memory
consumption can grow without bound!
Producer-consumer:processor 1
allocatesprocessor 2 frees
free(x1)
x2= malloc(s)
free(x2)
x1= malloc(s)processor 1 processor 2
x3= malloc(s)
free(x3)
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Multiprocessor Allocator II:Private Heaps with OwnershipPrivate heaps with
ownership:free puts memory back on the originating processor's heap.
Avoids unbounded memory consumptionExamples:
ptmalloc [Gloger], LKmalloc [Larson & Krishnan]
x1= malloc(s)
free(x1)
free(x2)
x2= malloc(s)
processor 1 processor 2
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
How to Break Private Heaps with Ownership:FragmentationPrivate heaps with
ownership:memory consumption can blowup by a factor of P.
Round-robin producer-consumer:processor i allocatesprocessor i+1 frees
This really happens (NDS).
free(x2)
free(x1)
free(x3)
x1= malloc(s)
x2= malloc(s)
x3=malloc(s)
processor 1 processor 2 processor 3
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
So What Do We Do Now?
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
The Hoard Multiprocessor Memory AllocatorManages memory in page-sized superblocks of
same-sized objects- Avoids false sharing by not carving up cache lines- Avoids heap contention - local heaps allocate &
free small blocks from their set of superblocks
Adds a global heap that is a repository of superblocks
When the fraction of free memory exceeds the empty fraction, moves superblocks to the global heap- Avoids blowup in memory consumption
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Hoard ExampleHoard:
one heap per processor+ a global heap
malloc gets memory from a superblock on its heap.
free returns memory to its superblock. If the heap is “too empty”, it moves a superblock to the global heap.
x1= malloc(s)
processor 1 global heap
free(x7)
…some mallocs
…some frees
Empty fraction = 1/3
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Summary of Analytical Results
Worst-case memory consumption:O(n log M/m + P) [instead of O(P n log
M/m)]n = memory required
M = biggest object sizem = smallest object sizeP = number of processors
Best possible: O(n log M/m) [Robson]
Provably low synchronization in most cases
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
ExperimentsRun on a dedicated 14-processor Sun Enterprise
300 MHz UltraSparc, 1 GB of RAMSolaris 2.7
All programs compiled with g++ version 2.95.1
Allocators:Hoard version 2.0.2Solaris (system allocator)Ptmalloc (GNU libc – private heaps with ownership)mtmalloc (Sun’s “MT-hot” allocator)
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Performance: threadtest
speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Performance: Larson
Server-style benchmark with sharing
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Performance: false sharing
Each thread reads & writes heap data
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Fragmentation ResultsOn most standard uniprocessor benchmarks,
Hoard’s fragmentation was low:p2c (Pascal-to-C): 1.20 espresso: 1.47LRUsim: 1.05 Ghostscript: 1.15
Within 20% of Lea’s allocator
On the multiprocessor benchmarksand other codes: Fragmentation was between 1.02 and 1.24 for all
but one anomalous benchmark (shbench: 3.17).
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Hoard ConclusionsSpeed: Excellent
As fast as a uniprocessor allocator on one processor
amortized O(1) cost1 lock for malloc, 2 for free
Scalability: ExcellentScales linearly with the number of processorsAvoids false sharing
Fragmentation: Very goodWorst-case is provably close to idealActual observed fragmentation is low
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000
Hoard Heap Details“Segregated size class”
allocatorSize classes are
logarithmically-spacedSuperblocks hold objects of one
size class
empty superblocks are “recycled”
Approximately radix-sorted:Allocate from mostly-full
superblocksFast removal of mostly-empty
superblocks
8 16 2432 40 48
sizeclass bins
radix-sortedsuperblock lists(emptiest to fullest)
superblocks
Recommended