Upload
chelsa
View
41
Download
0
Embed Size (px)
DESCRIPTION
A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization. David Bacon Perry Cheng (presenting) V.T. Rajan IBM T.J. Watson Research. Roadmap. What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and Space Usage Heap Architecture - PowerPoint PPT Presentation
Citation preview
A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization
David BaconPerry Cheng (presenting)V.T. Rajan
IBM T.J. Watson Research
What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and
Space Usage Heap Architecture
Types of Fragmentation Incremental Compaction Read Barriers Barrier Performance
Scheduling: Time-Based vs. Work-Based Empirical Results
Pause Time Distribution Minimum Mutator Utilization (MMU) Pause Times
Summary and Conclusion
Roadmap
Real-time Embedded Systems Memory usage important
Uniprocessor
Problem Domain
3 Styles of Uniprocessor Garbage Collection:Stop-the-World vs. Incremental vs. Real-Time
STW
Inc
RT
time
Pause Times (Average and Maximum)
STW
Inc
RT
1.5s 1.7s
0.5s 0.7s 0.3s 0.5s 0.9s 0.3s
0.15 - 0.19 s
1.6s
0.5s
0.18s
Coarse-Grained Utilization vs. Time
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
Time (s)
Uti
liza
tio
n (
%)
STW
Inc
RT
2.0 s window
Fine-Grained Utilization vs. Time
STW
Inc
RT
0
0.2
0.4
0.6
0.8
1
0
0.25 0.5
0.75 1
1.25 1.5
1.75 2
2.25 2.5
2.75 3
3.25 3.5
3.75 4
4.25 4.5
4.75 5
5.25 5.5
5.75 6
6.25 6.5
6.75 7
7.25 7.5
7.75 8
Time (s)
Uti
liza
tio
n
0.4 s window
Minimum Mutator Utilization (MMU)
STW
Inc
RT
0
20
40
60
80
100
Window Size (s) - logarithmic scale
MM
U
Space Usage over Time
0
10
20
30
40
50
60
70
80
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Time (s)
Use
d S
pace
(M
b)
STW
Inc
RTmax live
trigger
2 X max live
Problems with Existing RT Collectors
0
20
40
60
80
100
0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 4. 5 5. 0 5. 5 6. 0 6. 5 7. 0 7. 5 8. 0
T i me (s )
Spa
ce (M
b)
max live2 X max live3 X max live4 X max live
Non-moving Collector
0
20
40
60
80
100
T i me (s )
MM
U
0
20
40
60
80
100
0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 4. 0 4. 5 5. 0 5. 5 6. 0 6. 5 7. 0 7. 5 8. 0
T i me (s )
Spa
ce (M
b)
max live2 X max live3 X max live4 X max live
Replicating Collector
Not fully incremental,Tight coupling,Work-based scheduling
Our Collector Goals Results
Real-Time ~10 ms Low Space Overhead ~2X Good Utilization during GC ~ 40%
Solution Incremental Mark-Sweep Collector Write barrier – snapshot-at-the-beginning [Yuasa] Segregated free list heap architecture Read Barrier – to support defragmentation [Brooks]
Incremental defragmentation Segmented arrays – to bound fragmentation
What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and Space Usage
Heap Architecture Types of Fragmentation Incremental Compaction Read Barriers Barrier Performance
Scheduling: Time-Based vs. Work-Based Empirical Results
Pause Time Distribution Minimum Mutator Utilization (MMU) Pause Times
Summary and Conclusion
Roadmap
Fragmentation and Compaction
Intuitively: available but unusable memory
avoidance and coalescing - no guarantees compaction
used
needed
free
Heap Architecture Segregated Free Lists
– heap divided into pages– each page has equally-sizes blocks (1 object
per block)– Large arrays are segmented
used free
sz 24
sz 32
external
internal page-internal
Controlling Internal and Page-Internal Fragmentation
Choose page size (page) and block sizes (sk)
If sk = sk-1 (1 + ), internal fragmentation
page-internal fragmentation page / smax
E.g. If page = 16K, = 1/8, smax= 2K, maximum non-external fragmentation to 12.5%.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
dbja
ck
java
cje
ssm
trt
mpeg
audi
o
com
press
Internal Page-Internal External Recently Dead Live
Fragmentation - small heap ( = 1/8 vs.
= 1/2)
=1/8 =1/2
Incremental Compaction
Compact only a part of the heapRequires knowing what to compact ahead of time
Key ProblemsPopular objectsDetermining references to moved objects
used
Incremental Compaction: Redirection
Access all objects via per-object redirection pointers
Redirection is initially self-referential
Move an object by updating ONE redirection pointer
original replica
Consistency via Read Barrier [Brooks]
Correctness requires always using the replica
E.g. field selection must be modified
x[offset]
x
x[redirect][offset]
x
normal access
read barrier access
x
Some Important Details Our read barrier is decoupled from collection Complication: In Java, any reference might be null
actual read barrier for GetField(x,offset) must be augmented
tmp = x[offset];return (tmp == null) ? null : tmp[redirect]
CSE, code motion (LICM and sinking), null-check combining
Barrier Variants - when to redirectlazy - easier for collectoreager - better for optimization
Barrier Overhead to Mutator Conventional wisdom says read barriers are too
expensiveStudies found overhead of 20-40% (Zorn, Nielsen)Our barrier has 4-6% overhead with optimizations
0
2
4
6
8
10
12
com
press
jess db
java
c
mpeg
audio
mtrt
jack
Geo. M
ean
Lazy
Eager
Heap (one size only)Stack
Program Start
HeapStack
free
allocated
Program is allocating
HeapStack
free
unmarked
GC starts
HeapStack
free
unmarked
marked orallocated
Program allocating and GC marking
HeapStack
free
unmarked
marked orallocated
Sweeping away blocks
HeapStack
free
allocated
evacuated
GC moving objects and installing redirection
HeapStack
free
unmarked
evacuated
marked orallocated
2nd GC starts tracing and redirection fixup
HeapStack
free
allocated
2nd GC complete
What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and Space Usage
Heap Architecture Types of Fragmentation Incremental Compaction Read Barriers Barrier Performance
Scheduling: Time-Based vs. Work-Based Empirical Results
Pause Time Distribution Minimum Mutator Utilization (MMU) Pause Times
Summary and Conclusion
Roadmap
Scheduling the Collector Scheduling Issues
bad CPU utilization and space usage loose program and collector coupling
Time-Based Trigger the collector to run for CT seconds whenever the program runs for QT seconds
Work-Based Trigger the collector to collect CW work whenever the program allocate QW bytes
Time-Based Scheduling
Trigger the collector to run for CT seconds whenever the program runs for QT seconds
Sp
ace
(M
b)
Time (s)
0
10
20
30
40
50
60
70
80
90
100
Smooth Alloc Uneven Alloc High Alloc
0
0.2
0.4
0.6
0.8
1
Any
MM
U (
CP
U
Uti
liza
tio
n)
Window Size (s)
Work-Based Scheduling
0
0.2
0.4
0.6
0.8
1
Smooth Alloc Uneven Alloc
High Alloc
MM
U (
CP
U
Uti
liza
tio
n)
Trigger the collector to collect CW bytes whenever the program allocates QW bytes
Window Size (s)
0
20
40
60
80
100
Any
Sp
ace
(M
b)
Time (s)
What is Real-time Garbage Collection? Pause Time, CPU utilization (MMU), and Space Usage
Heap Architecture Types of Fragmentation Incremental Compaction Read Barriers Barrier Performance
Scheduling: Time-Based vs. Work-Based Empirical Results
Pause Time Distribution Minimum Mutator Utilization (MMU) Pause Times
Summary and Conclusion
Roadmap
Pause Time Distribution for javac
(Time-Based vs. Work-Based)
12 ms 12 ms
Utilization vs. Time for javac
(Time-Based vs. Work-Based)
Uti
liza
tio
n
(%)
Time (s) Time (s)
0.4
0.2
0
0.6
0.8
1.0
0.4
0.2
0
0.6
0.8
1.0
Uti
liza
tio
n
(%)
0.45
Minimum Mutator Utilization for javac
(Time-Based vs. Work-Based)
Space Usage for javac (Time-Based vs. Work-
Based)
3 inter-related factors:Space Bound (tradeoff)Utilization (tradeoff)Allocation Rate (lower is better)
Other factorsCollection rate (higher is better)Pointer density (lower is better)
Intrinsic Tradeoff
Summary: Mostly Non-moving RT GC
Read Barriers Permits incremental defragmentation Overhead is 4-6% with compiler optimizations
Low Space Overhead Space usage is only about 2 X max live data
Fragmentation still bounded Consistent Utilization
Always at least 45% at 12 ms resolution
Conclusions Real-time GC is real
There are tradeoffs just like in traditional GC
Scheduling should be primarily time-based
Fallback to work-based due to user’s incorrect parameter estimations
Incremental defragmentation is possible
Compiler support is important!
Future Work Lowering the real-time resolution
Sub-millisecond worst-case pause Main issue: breaking up stack scan
Segmented array optimizations Reduce segmented array cost below ~2%
Opportunistic contiguous layout Type-based specialization with invalidation
Strip-mining