32
1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14 2007 San Jose, CA

Ubiquitous Memory Introspection (UMI)

  • Upload
    kennan

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Ubiquitous Memory Introspection (UMI). Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS. CGO 2007, March 14 2007 San Jose, CA. The complexity crunch. hardware complexity ↑ + software complexity ↑ ⇓ too many things happening concurrently, - PowerPoint PPT Presentation

Citation preview

Page 1: Ubiquitous Memory Introspection (UMI)

1

Ubiquitous Memory Introspection (UMI)

Qin Zhao, NUSRodric Rabbah, IBM

Saman Amarasinghe, MITLarry Rudolph, MIT

Weng-Fai Wong, NUS

CGO 2007, March 14 2007San Jose, CA

Page 2: Ubiquitous Memory Introspection (UMI)

2

The complexity crunch

hardware complexity ↑+

software complexity ↑

⇓too many things happening concurrently,

complex interactions and side-effects

⇓less understanding of program

execution behavior ↓

Page 3: Ubiquitous Memory Introspection (UMI)

3

• S/W developer– Computation vs.

communication ratio– Function/path frequency– Test coverage

• H/W designer– Cache behavior– Branch prediction

• System administrator– Interaction between

processes

The importance of program behavior characterization

Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

Page 4: Ubiquitous Memory Introspection (UMI)

4

Common approaches toprogram understanding

Overhead

Level of detail

Versatility

Space and time overhead

Coarse grained summaries vs. instruction level and contextual information

Portability and ability to adapt to different uses

Page 5: Ubiquitous Memory Introspection (UMI)

5

Common approaches toprogram understanding

ProfilingSimulatio

n

Overhead high very high

Level of detail

very high very high

Versatility

high very high

Page 6: Ubiquitous Memory Introspection (UMI)

6

Common approaches toprogram understanding

ProfilingSimulatio

nHW

Counters

Overhead high very high very low

Level of detail

very high very high very low

Versatility

high very high very low

Page 7: Ubiquitous Memory Introspection (UMI)

7

Slowdown due to HW counters (counting L1 misses for 181.mcf on P4)

0100

200300400

500600700

800

native 10 100 1K 10K 100K 1M

HW counter sample size

Tim

e (

secon

ds)

(>2000%)

(1%)(10%) (1%)(35%)

(325%)

(% slowdown)

Page 8: Ubiquitous Memory Introspection (UMI)

8

Common approaches toprogram understanding

ProfilingSimulatio

nHW

CountersDesirable Approach

Overhead high very high very low low

Level of detail

very high very high very low high

Versatility

high very high very low high

Page 9: Ubiquitous Memory Introspection (UMI)

9

Common approaches toprogram understanding

ProfilingSimulatio

nHW

CountersDesirable Approach

Overhead high very high very low low

Level of detail

very high very high very low high

Versatility

high very high very low high

UMI

Page 10: Ubiquitous Memory Introspection (UMI)

10

Key components

• Dynamic Binary Instrumentation – Complete coverage, transparent,

language independent, versatile, …

• Bursty Simulation– Sampling and fast forwarding

techniques– Detailed context information– Reasonable extrapolation and prediction

Page 11: Ubiquitous Memory Introspection (UMI)

11

Ubiquitous Memory Introspection

online mini-simulations analyze short memory access profiles recorded from frequently executed code regions

• Key concepts– Focus on hot code regions– Selectively instrument instructions– Fast online mini-simulations– Actionable profiling results for online memory

optimizations

Page 12: Ubiquitous Memory Introspection (UMI)

12

Working prototype

• Implemented in DynamoRIO– Runs on Linux– Used on Intel P4 and AMD K7

• Benchmarks– SPEC 2000, SPEC 2006, Olden– Server apps: MySQL, Apache– Desktop apps: Acrobat-reader,

MEncoder

Page 13: Ubiquitous Memory Introspection (UMI)

13

UMI is cheap and non-intrusive(SPEC2K reference workloads on

P4)

Avera

ge

0%

20%

40%

60%

80%

168.

wupwise

171.

swim

172.

mgr

id

173.

applu

177.

mes

a

178.

galge

l

179.

art

183.

equa

ke

187.

face

rec

188.

amm

p

189.

lucas

191.

fma3

d

200.

sixtra

ck

301.

apsi

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

56.b

zip2

300.

twolf

em3d

healt

hm

st

treea

dd tspft

DynamoRIO

UMI

•Average overhead is 14%•1% more than DynamoRIO

% s

low

dow

n c

om

pare

d t

o n

ati

ve

execu

tion

Page 14: Ubiquitous Memory Introspection (UMI)

14

What can UMI do for you?

• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

Page 15: Ubiquitous Memory Introspection (UMI)

15

Coarse grained memory analysis

• Experiment: measure cache misses in three ways – HW counters– Full cache simulator (Cachegrind)– UMI

• Report correlation between measurements– Linear relationship of two sets of data

1-1

strong positive

correlation

0strong negative

correlation

no correlation

Page 16: Ubiquitous Memory Introspection (UMI)

16

Cache miss correlation results

• HW counter vs. Cachegrind– 0.99– 20x to 100x slowdown

• HW counter vs. UMI– 0.88– Less than 2x slowdown

in worst case0

0.2

0.4

0.6

0.8

1

(HW

, CG

)(H

W, U

MI)

Page 17: Ubiquitous Memory Introspection (UMI)

17

What can UMI do for you?

• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

Page 18: Ubiquitous Memory Introspection (UMI)

18

Fine grained memory analysis

• Experiment: predict delinquent loads using UMI– Individual loads with cache miss rate greater

than threshold

• Delinquent load set determined according to full cache simulator (Cachegrind)– Loads that contribute 90% of total cache misses

• Measure and report two metricsDelinquent loadsidentified byCachegrind

Delinquent loads predicted by

UMIrecallfalse

positive

Page 19: Ubiquitous Memory Introspection (UMI)

19

UMI delinquent load prediction accuracy

Recall

(higher is better)

False positive

(lower is better)

benchmarks with ≥ 1% miss rate

88% 55%

benchmarks with < 1% miss rate

26% 59%

Page 20: Ubiquitous Memory Introspection (UMI)

20

What can UMI do for you?

• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetcher, learning, adaptation

Page 21: Ubiquitous Memory Introspection (UMI)

21

Experiment: online stride prefetching

• Use results of delinquent load prediction

• Discover stride patterns for delinquent loads

• Insert instructions to prefetch data to L2

• Compare runtime for UMI and P4 with HW stride prefetcher

Page 22: Ubiquitous Memory Introspection (UMI)

22

Data prefetching results summary

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

run

nin

g t

ime

no

rmal

ized

to

nat

ive

exec

uti

on

(lo

wer

is

bet

ter)

SW prefetching + DynamoRIO + UMI

HW prefetching + Native Binary

Combined prefetching + DynamoRIO + UMI

ft

Avera

ge

0.20.30.40.50.60.70.80.91.01.1

181.

mcf

171.

swim

172.

mgr

id

179.

art

183.

equa

ke

188.

amm

p

191.

fma3

d

301.

apsi

em3d m

st ft

Avera

ge

More than offsets slowdowns from binary instrumentation!

Page 23: Ubiquitous Memory Introspection (UMI)

23

The Gory Details

Page 24: Ubiquitous Memory Introspection (UMI)

24

UMI components

• Region selector

• Instrumentor

• Profile analyzer

Page 25: Ubiquitous Memory Introspection (UMI)

25

Region selector

• Identify representative code regions– Focus on traces, loops– Frequently executed code– Piggy back on binary instrumentor tricks

• Reinforce with sampling – Time based, or leverage HW counters– Naturally adapt to program phases

Page 26: Ubiquitous Memory Introspection (UMI)

26

Instrumentor

• Record address references– Insert instructions to record address

referenced by memory operation

• Manage profiling overhead– Clone code trace (akin to Arnold-Ryder

scheme)– Selective instrumentation of memory

operations• E.g., ignore stack and static data

Page 27: Ubiquitous Memory Introspection (UMI)

27

Recording profilesCode Trace Profile

T1

T1

T2

T1

counter

0x1040x0280x013

0x012

0x1000x0240x011

op3op2op1

Code Trace T1

Code Trace T2

0x0310x032

op2op1counter

page protection to detect profile overflow

Address Profiles

early trace exist

early trace exist

Page 28: Ubiquitous Memory Introspection (UMI)

28

Mini-simulator

• Triggered when code or address profile is full• Simple cache simulator

– Currently simulate L2 cache of host– LRU replacement– Improve approximations with techniques similar

to offline fast forwarding simulators• Warm up and periodic flushing

• Other possible analyzer– Reference affinity model– Data reuse and locality analysis

Page 29: Ubiquitous Memory Introspection (UMI)

29

Mini-simulations and parameter sensitivity

181.mcf

0.000.20

0.400.60

0.801.00

1.20

1 2 4 8 16 32 64 128 256 512 1024sampling frequency threshold

recall falsepositive

1.40

normalized performance recall ratio false positive ratio

• Regular data structures• If sampling threshold too high, starts to exceed

loop bounds: miss out on profiling important loops

• Adaptive threshold is best

Page 30: Ubiquitous Memory Introspection (UMI)

30

Mini-simulations and parameter sensitivity

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

64 128 256 512 1K 2K 4K 8K 32K

address profile length

197.parserrecall false

positive

16K

normalized performance recall ratio false positive ratio

• Irregular data structures• Need longer profiles to reach useful conclusions

Page 31: Ubiquitous Memory Introspection (UMI)

31

Summary• UMI is lightweight and has a low overhead

– 1% more than DynamoRIO– Can be done with Pin, Valgrind, etc.– No added hardware necessary– No synchronization or syscall headaches– Other cores can do real work!

• Practical for extracting detailed information– Online and workload specific– Instruction level memory reference profiles– Versatile and user-programmable analysis

• Facilitate migration of offline memory optimizations to online setting

Page 32: Ubiquitous Memory Introspection (UMI)

32

Future work• More types of online analysis

– Include global information– Incremental (leverage previous execution info)– Combine analysis across multiple threads

• More runtime optimizations – E.g., advanced prefetch optimization

• Hot data stream prefetch• Markov prefetcher

– Locality enhancing data reorganization• Pool allocation• Cooperative allocation between different threads

• Your ideas here…