Using Sampled and Incomplete Profiles

Using Sampled and Incomplete Profiles

David KaeliDepartment of Electrical and Computer

EngineeringNortheastern University

Boston, [email protected]

Overview

• Trace-based simulation is expensive (caches are getting larger, CPUs and networks are getting faster)

• Approximate results are many times sufficient

• How can we utilize a reduced or sampled profile and still obtain accurate modeling/simulation results?

How does sampling affect our Metrics for Evaluating Trace Collection Methodologies

• Speed – sampled profiles reduce speed requirements

• Memory – sampled profiles may take less space• Accuracy – sampled profiles are less accurate• Intrusiveness – sampling is less intrusive• Completeness – no change• Granularity – may affect our ability to capture fast

events (nyquist)• Flexibility – no change• Portability – clock speed may affect sampling

accuracy• Capacity – sampling should reduce this• Cost – potentially less cost and less time

Application Areas:

• Memory system performance – cache simulation, working set models, temporal and spatial locality

• CPU pipeline modeling – instruction frequencies, instruction sequences (2-at-a-time, 3-at-a time, pipeline snapshots)

• Network simulation – input traffic distribution, queue lengths, throughput, burstiness

Memory Systems

• Temporal locality – addresses will be referenced close in time

• Spatial locality – addresses close by will be referenced next

• Working set models– Belady (1966) – Virtual memory page replacement

algorithms (Optimal replacement defined)– Denning (1980) – The pattern of page access over

the execution of the program– Thiebaut and Stone (1986) – The number of

misses incurred due to task switches can be modeled as a binomial distribution

Memory Systems

• Cold start – How do we model behavior when a program starts execution for the first time?

• Warm start – How do we model behavior when a program resumes execution?

• How does memory organization (e.g., set associativity)and typical program behavior affect our ability to utilize sampled profiles effectively?

• Do we sample in time or can we also sample in space (e.g., a set)?

Memory Systems

• If we only capture a subset of the important addresses, can we reproduce the full trace?– Abstract Execution (Larus 1990) – basic block

traces– Trace Reduction (Smith 1977, Puzak 1985) –

generate a reduced trace that contains the exact same number of misses and writebacks as the original trace (a type of filtering is performed, similar to Agarwal’s Block Filter)

– Trace compaction (Samples 1989) – perform a diff on sequential addresses and only capture important diffs

Can we do something simple?Laha 1988 Fu 1994

ignore

sample n sample n+1

ignore ignore

sampling ratio = sample size/sample interval

sampling interval sample size

Two types of errors1. sampling errors – is the ratio optimal?2. accurately predicting effects of ignored portions

But this only suggests when to sample, not what to sample…

Sampling Rate:What is the Nyquist Frequency for an Program?

• We must sample a program at a rate of twice the frequency of the event of interest

• Half the sampling frequency is termed the Nyquist frequency

• Sampling at lower rates than twice the Nyquist frequency can cause aliasing and distortion

• The biggest problem is that not all events of interest in a program exhibit a nice periodic pattern

Example: gprof( )

• Produces 3 things– A listing of the total execution times and call counts

for each of the functions in the program, sorted by decreasing time

– The functions sorted according to the time they represent, including the time of their call graph descendents

– Total execution in a cycle and the members in that cycle (a cycle is a back edge)

• gprof samples a program’s execution– Obtains exact call statistics– Does not obtain exact time measurements– Accuracy is obtained through statistical sampling– Sampling reduces the associated overhead

When to sample: Cold start vs. Warm start

0

50

100

150

200

250

300

350

400

450

Time ->

Cum

ula

tive m

isse

s

cold start warm start

Sampling dimensions• We can sample in both time and space

– Time (periodic sampling)– Filtered sampling (e.g., using addresses ranges)

• Time– Periodic– Random– #misses, #instructions, #loads/stores

• Space– Address ranges – may not be representative of all

ranges– Cache sets – may limit the utility of the sample– Statically tagged events – focuses in on particular

instructions and data of interest

How do we account for unknown references?

• Assume that this behavior does not affect past/future behavior

• Assume that some percentage of past behavior is overwritten by unknown reference behavior– Decay model– Footprints in the cache– MRU model

• Assume that all of the past behavior is overwritten by the unknown reference behavior

How do we model the effects of multiprogramming??

• Flush all tables (caches, branch predictors, TLBs, load buffers, etc.)

• Estimate interference using a model– Invalidate some % of all entries based on

the relative time since last execution– Utilize working set models and utilize these

to estimate the effect of the interference

• Allow aliasing to occur where appropriate (branch predictors, but not caches)

Sampled Instruction Execution Profiles• Instruction frequencies

– For SPEC92int programs on a IA32 CPU• 43% ALU, 22% loads, 12% stores, 23% control

flow (H&P AQA)

• Instruction sequences– Top pairs– Top triples– Continue up until average BB size

• Branches in the pipeline– Sampled versus modeled

Analytical Models of Workload Behavior

Squillante and Kaeli, 1997• We an capture distributions of the distance between events

• We can then compute the probability of different events occurring in a pipeline of length n (think of n as a window of execution)

• We can also compute the conditional probabilities of multiple events occurring in the pipeline of length n

• We can then assign weights to each of these multiple events to compute the throughput (IPC) of a pipeline

• Our model uses random marked point process (time between events) and produces very accurate estimates of pipeline throughput


Squillante and Kaeli, 1997• Some constraints are assumed in the sake of simplicity:– Inter-branch times are independent– Inter-branch times are identically

distributed– Delay due to taken branches is

constant


Squillante and Kaeli, 1997


Squillante and Kaeli, 1997• A analytical formula is used to compute

approximate CPI and speedup measures for an n-stage pipelined processor

• Traces are captured from benchmark execution

• The distribution of the number of instructions between successive taken branches is computed using a “window of execution” filter


Squillante and Kaeli, 1997


Squillante and Kaeli, 1997• For a pipeline length of 8, the obtained results are quite precise for three out of the five benchmarks

• For bubblesort and prime, the IID assumption is violated, thus introducing some inaccuracies in our model

• Future work looks at handling inaccuracies that provide multi-level conditional probabilities into our model

Capturing n-length Instruction Sequences

• Sampling over n sequential instructions• Capturing the most frequently executed

sequences• Utilizing these sequences to drive

pipeline design• Capturing longer profiles may allow us

to predict design hardware trace caches• Gonzalez, Tubella and Molina describe a

mechanism for both profiled instructions and operand values

Acrobat Document

UU PP CC

Trace-Level ReuseTrace-Level Reuse

A. González, J. Tubella and C. Molina

Dpt. d´Arquitectura de Computadors

Universitat Politècnica de Catalunya

1999 International Conference on Parallel Processing ICPP´99

Documents

Using Sampled and Incomplete Profiles