Upload
siran
View
33
Download
0
Embed Size (px)
DESCRIPTION
Eliminating Read Barriers through Procrastination and Cleanliness. KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan. Big Picture. Lightweight user-level threads. Lots of concurrency!. Scheduler 1. t1. t2. tn. Core 1. Core 2. Core n. Heap. Big Picture. Big Picture. - PowerPoint PPT Presentation
Citation preview
Eliminating Read Barriers through Procrastination and Cleanliness
KC SivaramakrishnanLukasz Ziarek
Suresh Jagannathan
2
Big PictureLightweight user-level threads
Scheduler 1t1 t2 tn Lots of
concurrency!
Core 1 Core nCore 2
Heap
3
Big Picture
Expendable resource?
Big Picture
3
Scheduler 1t1 t2 tn Lots of
concurrency!
Heap
4
Big Picture
Expendable resource?
Big Picture
4
Scheduler 1
t1
t2 tn Lots of concurrency!
Heap
Exploit program concurrency to
eliminate read barriers from thread-local collectors
GC Operation
Alleviate MM cost?
5
MultiMLton• Goals– Safety, Scalability, ready for future manycore processors
• Parallel extension of MLton SML compiler and runtime
• Parallel extension of Concurrent ML– Lots of Concurrency!– Interact by sending messages over first-class channels
C
send (c, v)
v recv (c)
6
MultiMLton GC: Considerations• Standard ML – functional PL with side-effects– Most objects are small and ephemeral
• Independent generational GC– # Mutations << # Reads
• Keep cost of reads to be low
• Minimize NUMA effects• Run on non-cache coherent HW
7
MultiMLton GC: Design
Core
Local Heap
Core
Local Heap
Core
Local Heap
Core
Local Heap
Shared Heap
Thread-local GC
• NUMA Awareness• Circumvent cache-coherence issues
8
Invariant Preservation• Read and write barriers for preserving
invariants
Shared Heap
r
Local Heap
x
Target
Source
Exporting writes
r := x
Shared Heap
r
Local Heap
x
FWD
Transitive closure of x
Mutator needs read
barriers!
9
Challenge• Object reads are pervasive– RB overhead cost (RB) * frequency (RB)∝
• Read barrier optimization– Stacks and Registers never point to forwarded objects
20.1 %15.3 %21.3 %
Mean Overhead----------------------
Read
bar
rier o
verh
ead
(%)
10
Mutator and Forwarded Objects
# RB invocations
# Encountered forwarded objects
< 0.00001
Eliminate read barriers altogether
11
RB Elimination• Visibility Invariant– Mutator does not encounter forwarded objects
• Observation– No forwarded objects created visibility invariant ⇒
No read barriers⇒• Exploit concurrency Procrastination!
12
ProcrastinationShared Heap
r1
Local Heap
x1
T1 T2
r2
x2
r1 := x1 r2 := x2
T T is running
T T is suspended
T T is blocked
13
ProcrastinationShared Heap
r1
Local Heap
x1
r1 := x1
T1 T2
r2 := x2 r2
x2
Delayed write list
Control switches
to T2
T T is running
T T is suspended
T T is blocked
14
ProcrastinationShared Heap
r1
Local Heap
x1
T1 T2
r2
x2
Delayed write list
r1 := x1 r2 := x2
T T is running
T T is suspended
T T is blocked
15
ProcrastinationShared Heap
r1
Local Heap
T1 T2
r2 x2
Delayed write list
r1 := x1 r2 := x2x1
T T is running
T T is suspended
T T is blocked
FWD FWD
16
ProcrastinationShared Heap
r1
Local Heap
T1 T2
r2 x2
Delayed write list
x1
Force local GC
T T is running
T T is suspended
T T is blocked
r1 := x1 r2 := x2
17
Correctness• Does Procrastination introduce deadlocks?– Threads can be procrastinated while holding a lock!
T1 T2T2T T is running
T T is suspended
T T is blocked
18
Correctness
T1
• Is Procrastination safe?– Yes. Forcing a local GC unblocks the threads.– No deadlocks or livelocks!
T2T T is running
T T is suspended
T T is blocked
• Does Procrastination introduce deadlocks?– Threads can be procrastinated while holding a lock!
19
Correctness
T1 T2
• Does Procrastination introduce deadlocks?– Threads can be procrastinated while holding a lock!
T T is running
T T is suspended
T T is blocked• Is Procrastination safe?– Yes. Forcing a local GC unblocks the threads.– No deadlocks or livelocks!
20
• Efficacy (Procrastination) ∝ # Available runnable threads
Is Procrastination alone enough?
M
W1W1 W1
F
J
Serial (low thread availability)
Concurrent (high thread availability)
• With Procrastination, half of local major GCs were forced
Eager exporting writes while preserving visibility invariant
21
Cleanliness• A clean object closure can be lifted to the
shared heap without breaking the visibility invariant
r := xinSharedHeap (r)
inLocalHeap (x)&&
isClean (x)
Eager write (no Procrastination)
22
Cleanliness: Intuition
Shared Heap
Local Heap
x
lift (x) to shared heap
23
Shared Heap
Local Heap
Cleanliness: Intuition
x
FWD
find all references to FWD
24
Shared Heap
Local Heap
Cleanliness: Intuition
x Need to scan the entire local heap
25
Local Heap
h
Shared Heap
Cleanliness: Simpler question
x
FWD
Do all references originate from heap region h?
sizeof (h) << sizeof (local heap)
26
Local Heap
h
Shared Heap
Cleanliness: Simpler question
x Only scan theheap region h.
Heap session!
sizeof (h) << sizeof (local heap)
27
• Current session closed & new session opened– After an exporting write, a user-level context switch, a local
GC
Heap Sessions• Source of an exporting write is often– Young– rarely referenced from outside the closure
Previous Session Current Session FreeLocal Heap
SessionStart Frontier
Young Objects
Old Objects Start
28
• Current session closed & new session opened– After an exporting write, a user-level context switch, a local
GC– SessionStart is moved to Frontier
Heap Sessions• Source of an exporting write is often– Young– rarely referenced from outside the closure
• Average session size < 4KB
Previous Session FreeLocal Heap
Frontier & SessionStartStart
29
Cleanliness: Eager exporting writes• A clean object closure
– is fully contained within the current session– has no references from previous session
Previous SessionCurrent Session
FreeLocal Heap
X
Y Z
r := x
rShared Heap
30
Cleanliness: Eager exporting writes• A clean object closure
– is fully contained within the current session– has no references from previous session
Previous SessionCurrent Session
FreeLocal Heap
X
Y Z
r := x
rShared Heap
Walk and fix
FWD
31
Avoid tracing current session?• Many SML objects are tree-structured (List, Tree, etc,.)
– Specialize for no pointers from outside the closure• ∀x’ transitive closure (x), ∊
ref_count (x) = 0 && ref_count (x’) = 1
Local Heap
x(0)
y(1)
z(1)
• Eager exporting write– No current session tracing needed!
No refs from
outside
– ref_count does not consider pointers from stack or registers
32
Cleanliness: Reference Count
Current Session
X(0)
Current Session
X(1)
Current Session
X(LM)
Current Session
X(G)
PrevSess
Zero One LocalMany Global
• Purpose– Track pointers from previous session to current session– Identify tree-structured object
• Does not track pointers from stack and registers– Reference count only triggered during object initialization and
mutation
33
Bringing it all together• ∀x’ transitive closure (x), ∊
if max (ref_count (x’))– One & ref_count (x) = 0 Clean tree-structured ⇒ ⇒
Session tracing not needed– LocalMany Clean ⇒ ⇒ Trace current session– Global 1+ pointer from previous session ⇒ ⇒
Procrastinate
34
Cleanliness: Tree-structured Closure
Previous Session Current Session
x(0)
y(1)
z(1)T1Local Heap
Shared heap
r := x
r
current stack
35
Shared heap
Cleanliness: Tree-structured Closure
Previous Session Current Session
x
y
z
T1current stackLocal Heap
r := x
FWD
r
Walk current
stack
36
Shared heap
Cleanliness: Tree-structured Closure
Previous Session Current Session
x
y
z
T1current stackLocal Heap
r := x
r
No need to walk current
session!
37
Shared heap
Cleanliness: Tree-structured Closure
Previous Session Current Session
x
y
z
T1Local Heap
r := x
r
T2 Nextstack
FWDcurrent stack
38
Shared heap
Cleanliness: Tree-structured Closure
Previous Session Current Session
x
y
z
T1previousstackLocal Heap
r := x
r
T2 currentstack
Context Switch
Walk target stack
39
Cleanliness: Object graph
Previous Session Current Session
x(0)
y(LM)
z(1)current
stackLocal Heap
Shared heap
r := x
r
a
40
Shared heap
Cleanliness: Object graph
Previous Session Current Session
x
y
z
current stackLocal Heap
r := x
r
a
FWD
FWD
Walk current
stack
Walk current session
41
Shared heap
Cleanliness: Object graph
Previous Session Current Session
x
y
z
current stackLocal Heap
r := x
r
a
Walk current
stack
Walk current session
42
Cleanliness: Global Reference
Previous Session Current Session
x(0)
y(1)
z(G)T1 current
stackLocal Heap
Shared heap
r := x
r
a
43
Cleanliness: Global Reference
Previous Session Current Session
x(0)
y(1)
z(G)T1 current
stackLocal Heap
Shared heap
r := x
r
a
Procrastinate
44
Immutable Objects• Specialize exporting writes• If immutable object in previous session– Copy to shared heap• Immutable objects in SML do not have identity
– Original object unmodified• Avoid space leaks– Treat large immutable objects as mutable
45
Cleanliness: Summary• Cleanliness allows eager exporting writes
while preserving visibility invariant• With Procrastination + Cleanliness, <1% of
local GCs were forced• Additional niceties– Completely dynamic Portable– Does not impose any restriction on the GC
strategy
46
Evaluation• Variants– RB- : TLC with Procrastination and Cleanliness – RB+ : TLC with read barriers
• Sansom’s dual-mode GC– Cheney’s 2-space copying collection Jonker’s sliding
mark-compacting– Generational, 2 generations, No aging
• Target Architectures: – 16-core AMD Opteron server (NUMA)– 48-core Intel SCC (non-cache coherent)– 864-core Azul Vega3
47
Results• Speedup: At 3X min heap size, RB- faster than
RB+– AMD 32% (2X faster than STW collector)– SCC 20%– AZUL 30%
• Concurrency– During exporting write, 8 runnable user-level
threads/core!
Cleanliness Impact• RB- MU- : RB- GC ignoring mutability for Cleanliness• RB- CL- : RB- GC ignoring Cleanliness (Only Procrastination)
48
Avg. slowdown--------------------
11.4%28.2%31.7%
49
Conclusion• Eliminate the need for read barriers by
preserving the visibility invariant– Procrastination: Exploit concurrency for delaying
exporting writes– Cleanliness: Exploit generational property for
eagerly perform exporting writes
50
Questions?
http://multimlton.cs.purdue.edu