Upload
graiden-kim
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Umbra: Efficient and Scalable Memory Shadowing. Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT). CGO 2010, Toronto, Canada April 26 , 2010. Shadow Memory. Meta-data Track properties of application memory Synchronized Update Application data and meta-data. a.out. a.out. - PowerPoint PPT Presentation
Citation preview
Qin Zhao (MIT)Derek Bruening (VMware)Saman Amarasinghe (MIT)
Umbra: Efficient and Scalable Memory Shadowing
CGO 2010, Toronto, CanadaApril 26, 2010
Shadow Memory
• Meta-data– Track properties of application memory
• Synchronized Update– Application data and meta-data
CGO, Toronto, Canada, 4/26/2010 2
a.outa.out
stack stack
libc libc
Application Memory
Shadow Memory
heap heap
Examples
• Memory Error Detection– MemCheck [VEE’07]– Purify [USENIX’92]– Dr. Memory– MemTracker [HPCA’07]
• Dynamic Information Flow Tracking – LIFT [MICRO’39]– TaintTrace [ISCC’06]
• Multi-threaded Debugging– Eraser [TCS’97]– Helgrind
• Others– Redux [TCS’03]– Software Watchpoint [CC’08]
CGO, Toronto, Canada, 4/26/2010 3
Issues
• Performance– Runtime overhead
• Example: MemCheck 25x [VEE’07]
• Scalability– 64-bit architecture
• Dependence– OS– Hardware
• Development– Implemented with specific analysis– Lack of a general framework
CGO, Toronto, Canada, 4/26/2010 4
Memory Shadowing System
• Dynamic Instrumentation– Context switch (application ↔ shadow)– Address calculation– Updating meta-data
• Memory Management– Memory allocation / free
• Monitor application memory management• Manage shadow memory
– Mapping translation scheme (addrA addrS)• DMS: Direct Mapping Scheme• SMS: Segmented Mapping Scheme
CGO, Toronto, Canada, 4/26/2010 5
Direct Mapping Scheme (DMS)• Single memory region for entire address space.• Translation:• Issue: address conflict between memA and memS
CGO, Toronto, Canada, 4/26/2010 6
dispaddraddr AS
lea [addr] %r1add %r1 disp %r1
DMS-32 SMS-32 DMS-64 SMS-640
1
2
3
4
5
1.80
2.40
4.67
Slowdown relative to
native execution
Application
Shadow
DMS-32 SMS-32 DMS-64 SMS-640
1
2
3
4
5
1.80
2.40
4.67
Slowdown relative to
native execution
Segmented Mapping Scheme (SMS)• Shadow segment per application segment• Translation:
– Segment lookup (address indexing)– Address translation
CGO, Toronto, Canada, 4/26/2010 7
lea [addr] %r1mov %r1 %r2shr %r2, 16 %r2add %r1, disp[%r2] %r1
segAS dispaddraddr
addrA
addrS
App 1
Shd 1
Shd 2
App 2
Segment table
Umbra
• Mapping Scheme– Segmented mapping– Scale with actual memory usage
• Implementation– DynamoRIO
• Optimization– Translation optimization– Instrumentation optimization
• Client API• Experimental Results
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 8
Kernel space
Shadow Memory Mapping
• Scaling to 64-bit Architecture– DMS
• Infeasible due to memory layout
CGO, Toronto, Canada, 4/26/2010 9
a.out
Unusable space
stack
User space
vsyscall
247
264
CGO, Toronto, Canada, 4/26/2010
Shadow Memory Mapping
• Scaling to 64-bit Architecture– DMS
• Infeasible due to memory layout– Single-Level SMS
• Too big (~4 billion entries)
CGO, Toronto, Canada, 4/26/2010 10
addrA
Shadow Memory Mapping
• Scaling to 64-bit Architecture– DMS
• Infeasible due to memory layout– Single-Level SMS
• Too big (~4 billion entries)– Multi-Level SMS
• Even more expensive • Fast path on lower 32G (MemCheck)
CGO, Toronto, Canada, 4/26/2010 11DMS-32 SMS-32 DMS-64 SMS-64
0
1
2
3
4
5
1.80
2.40
4.67
Slowdown relative to
native execution
addrA
Shadow Memory Mapping
• Scaling to 64-bit Architecture– DMS is infeasible – Single-Level SMS is too sparse– Multi-Level SMS is too expensive
• Umbra Solution– Eliminate empty entries– Compact table– Walk the table to find the entry
CGO, Toronto, Canada, 4/26/2010 12
Umbra
• Mapping Scheme √– Segmented mapping– Scale with actual memory usage
• Implementation– DynamoRIO
• Optimization– Translation optimization– Instrumentation optimization
• Client API• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 13
Implementation
• Memory Manager– Monitor and control application memory allocation
• brk, mmap, munmap, mremap– Allocate shadow memory– Maintain translation table
• Instrumenter– Instrument every memory reference
• Context save• Address calculation• Address translation• Shadow memory update• Context restore
CGO, Toronto, Canada, 4/26/2010 14
App 1
Shd 1
Shd 2
App 2
Umbra
• Mapping Scheme √– Segmented mapping– Scale with actual memory usage
• Implementation √– DynamoRIO
• Optimization– Translation optimization– Instrumentation optimization
• Client API• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 15
~100
Unoptimized System
• Small overhead from DynamoRIO• Slower than SMS-64
– Need to walk the global translation table
• Why so slow?– 41.79% instructions are memory references– For each of these instructions
• Full context switch• Table lookup• Call-out instrumentation
16
Global translation
table
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
Table
Mem
oiz
ati
on C
...
Refe
rence
Cach
e
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
Optimization
• Translation Optimization– Thread-local translation cache– Hashtable lookup– Memoization mini-cache– Reference uni-cache
• Instrumentation Optimization– Context switch reduction– Reference grouping– 3-stage code layout
1717
Global translation
table
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
Table
Mem
oiz
ati
on C
...
Refe
rence
Cach
e
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
~100
~100
1. Thread-Local Translation Cache
• Local translation table per thread– Synchronize with global translation
table when necessary– Avoid lock contention– Walk table to find match entry
• Walk global table if not find in thread-local cache
• Inlined instrumentation
18
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
Table
Mem
oiz
ati
on C
...
Refe
rence
Cach
e
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
~100
2. Hashtable Lookup
• Hashtable per thread• Fixed number of slots• Hash(addra) entry
in thread-local cache– If match, found – If no match, walk the local cache
19
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Hashtable
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
Table
Mem
oiz
ati
on C
...
Refe
rence
Cach
e
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
~100
3. Memoization Mini-Cache
• Four-entry table per thread– Stack– Heap– Application (a.out)– Units found in last table lookup
• If not match, hashtable lookup– 68.93% hit ratio
20
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
table
Mem
oiz
ati
on M
in...
Refe
rence
Uni-
C...
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
~100
4. Reference Uni-Cache
• Software uni-cache per instr per thread– Last reference unit tag– Last translation displacement
• If not match, memoization mini-cache check– 99.93% hit ratio
21
Reference uni-cache
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
ADD $1, (%RAX)
MOV %RBX 48(%RAX)
PUSH %RAX
ADD 40(%RAX), %RBXSM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
table
Mem
oiz
ati
on M
in...
Refe
rence
Uni-
C...
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
5. Context Switch Reduction
• Register liveness analysis– Use dead register– Avoid flags save/restore
22
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
~100
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
table
Mem
oiz
ati
on M
in...
Refe
rence
Uni-
C...
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
Reference uni-cache
ADD $1, (%RAX)
MOV %RBX 48(%RAX)
PUSH %RAX
ADD 40(%RAX), %RBX
#/#Instr SPEC2006
Memory Reference 41.79%
Eflag Steal 2.55%
Register Steal 8.20%
6. Reference Grouping
• One reference cache for multiple references– Stack local variables– Different members of the same
object
23
Thread 1
Thread 2
Global translation
table
Thread-local translation
cache
Memoization mini-cache
Hashtable
~100
SM
S-6
4
Dynam
oR
IO
Unopti
miz
ed
Loca
l Tra
nsl
ati
on...
Hash
table
Mem
oiz
ati
on M
in...
Refe
rence
Uni-
C...
Conte
xt
Sw
itch
R...
Refe
rence
Gro
u...0
2468
101214161820
4.7
1.1
100.0
15.8 15.2
12.0
8.3
3.1 2.5
Reference uni-cache
ADD $1, (%RAX)
MOV %RBX 48(%RAX)
PUSH %RAX
ADD 40(%RAX), %RBX
#/#Instr SPEC2006
Memory Reference 41.79%
Ref Uni-Cache Checks 22.76%
3-stage Code Layout
• Inline stub (<10 instructions)– Quick inline check code with minimal context switch
• Lean procedure (~50 instructions)– Simple assembly procedure with partial context switch
• Callout (C function)– C function with complete context switch
CGO, Toronto, Canada, 4/26/2010 24
uni-cache checkmemoization check
hashtable lookup
local cache lookup
<full context switch>c_function() { // global table // lookup . . . . . .}<full context switch>
app instruction
Inline stub Lean procedure Callout
Umbra
• Mapping Scheme √– Segmented mapping– Scale with actual memory usage
• Implementation √– DynamoRIO
• Optimization √– Translation optimization– Instrumentation optimization
• Client API• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 25
Client API
Event Hooks Description
client_init Process initialization
client_exit Process exit
client_thread_init Thread initialization
client_thread_exit Thread exit
shadow_memory_create Shadow memory creation
shadow_memory_delete Shadow memory deletion
instrument_update Insert meta-data update code
CGO, Toronto, Canada, 4/26/2010 26
Umbra Client: Shared Memory Detection
static void instrument_update(void *drcontext, umbra_info_t *umbra_info, mem_ref_t *ref, instrlist_t *ilist, instr_t *where) { … /* lock or [%r1], tid_map [%r1] */ opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4); opnd2 = OPND_CREATE_INT32(client_tls_datatid_map); instr = INSTR_CREATE_or(drcontext, opnd1, opnd2); LOCK(instr); instrlist_meta_preinsert(ilist, label, instr);}
27CGO, Toronto, Canada, 4/26/2010
• Meta-data maintains a bit map to store which threads access the associated memory
Umbra
• Mapping Scheme √– Segmented mapping– Scale with actual memory usage
• Implementation √– DynamoRIO
• Optimization √– Translation optimization– Instrumentation optimization
• Client API √• Experimental Result
– Performance evaluation– Statistics collection
CGO, Toronto, Canada, 4/26/2010 28
Performance Evaluation
CGO, Toronto, Canada, 4/26/2010 29
Slowdown relative to
native execution
DMS-32 SMS-32 SMS-64 Umbra-640.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1.80
2.40
4.67
2.49
EMS64:Efficient Memory Shadowing for 64-bit
• Translation– – Reference uni-cache hit rate: 99.93%– Still need a costly check to catch the 0.07%
• Reg steal; save flags; compare & jump; restore
• EMS64 (ISMM’10)– Speculatively use a disp without check– Notified by memory access violation fault for incorrect
disp
disprcaddraddr AS .
CGO, Toronto, Canada, 4/26/2010 30
EMS64 Preliminary ResultSlowdown relative to
native execution
CGO, Toronto, Canada, 4/26/2010 31
DMS-32 SMS-32 SMS-64 Umbra-64 EMS-640.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1.80
2.40
4.67
2.49
1.81
Thanks
• Download– http://people.csail.mit.edu/qin_zhao/umbra/
• Q & A
CGO, Toronto, Canada, 4/26/2010 32