Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University

July 9, 2004 ICPADS ‘04 Session 7C

A Framework for Profiling Multiprocessor Memory Performance

Diana Villa, Jaime Acosta, Patricia J. TellerThe University of Texas at El PasoDepartment of Computer Science

Bret OlszewskiIBM Corporation – Austin, TX

ICPADS ’04

Outline

Motivation Data Collection Environment

Workload & Platform Monitored Events

Sampled Event Traces Performance Evaluation Framework Data Analysis & Results Conclusions and Future Work

ICPADS ’04

Motivation

Modern SystemsPerformance governed by memory subsystem

SMPs Deeper and larger memory hierarchies Performance analysis considerations

Time to results and size of data set

GoalDevelop a new performance analysis

methodology

ICPADS ’04

Data Collection Environment

WorkloadTPC-C benchmark

Commercial OLTP

Platform IBM eServer pSeries 690 architecture (p690)

8- and 32-processor configurations

ICPADS ’04

Platform

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

MCM 0 MCM 1

X

8-processor p690 configuration

ICPADS ’04

Platform

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 0

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 2

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 1

P

P P

PP

PP

P

L2

L2

L2

L2

L3

MCM 3

P

32-processor p690 configuration

ICPADS ’04

Monitored Events

L2-cache data-load misses L2.5 L2.75 L3 L3.5 MEM

L1-cache data-load miss L2

ICPADS ’04

L2

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 12 cycles

MCM 0 MCM 1

X

ICPADS ’04

L2.5

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 73 cycles

MCM 0 MCM 1

X

ICPADS ’04

L2.75

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 96 cycles

MCM 0 MCM 1

X

ICPADS ’04

L3

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 112 cycles

MCM 0 MCM 1

X

ICPADS ’04

L3.5

P X

XP

XP

P

X

X

X

X

P

PP

P

L2

L2

L2

L2 L2

L2 L2

L2

L3 L3

Penalty: 143 cycles

MCM 0 MCM 1

X

ICPADS ’04

Data Collection

10-minute observation interval

Performance Monitoring Unit (PMU) Special-purpose registers Programming interface

Kernel extension

eprof PMU configuration Event-based sampling

ICPADS ’04

Sampled Event Traces

Sampling Record periodic occurrences of an event 100 events/sec/CPU

Event record

372872 184469 0.328104637 000000000000A8C4 0000000000218880

PID TID Timestamp Effective Instruction Address

EffectiveData Address

Average number of samples collected/event 238,448 for 8-processor data 212,396 for 32-processor data

ICPADS ’04

Performance Framework

Database

Load DB Java Tool

Report Generation Java Tool

p690TPC-C

Data Collection Environment

Reports

5 BufferPool 56893 293846 Data,BSS,Heap 8799 48551 Kernel 23485 9840

Sampled Event Traces

PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.

Distribution of L3 Data Load Hits

0 0.1 0.2 0.3 0.4 0.5

Kernel

Text

Data,BSS,Heap

BufferPool

SharedData

Stack

U-BlockandKernelStack

KERN_HEAP

Ad

dre

ss r

egio

n

Fraction of data loads

Unique cache line

Hit %

Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment

050

100150200

250300

350400

100 1600 3100 4600 6100 7600

Page [0-65536]

Hit

/Cac

he

lin

e co

un

t

Total loads

Unique cache line

Graphs

ICPADS ’04

Data Analysis - 1

Overall goal Study effectiveness of p690 memory hierarchy

Characterize differences between private and shared data loads Track missing L2-cache lines across levels of the p690 memory

hierarchy

Studied address regions Referenced by 90% of L2-cache data-load misses Private: Data,BSS,Heap Shared: Buffer Pool

ICPADS ’04

Data Analysis - 2

Private data loads Accessible only to owner process Examples: process’ return stack, local variables Ideal: Remain close to executing processor

Shared data loads Accessible by every TPC-C process Examples: application code, global variables Ideal: Remain in higher levels of memory hierarchy

ICPADS ’04

Results

0

20000

40000

60000

80000

100000

120000

L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name

Distribution of Data Load Hits: BUFFER_POOL

DataLoadHits

UniqueCacheLines

0

20000

40000

60000

80000

100000

120000

L2 L2.5 MOD L2.75Mod L3 L3.5 MEMEvent Name

Distribution of Data Load Hits: DATABSSHEAP

DataLoadHits

UniqueCacheLines

32-Processor Data

SharedPrivate

ICPADS ’04

Results

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

32-Processor Data

Good Application/Architecture Match

Private Shared

ICPADS ’04

Results

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

32-Processor Data

Possible Performance Impediment

SharedPrivate

ICPADS ’04

Results

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

32-Processor Data

Shared Data References More Localized than Private Data References

Private Shared

ICPADS ’04

Results

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

0

20000

40000

60000

80000

100000

120000



DataLoadHits

UniqueCacheLines

32-Processor Data

MEM Data Load Hits Primarily Due To Compulsory Misses

SharedPrivate

ICPADS ’04

Conclusions - 1

Developed new performance evaluation framework Applicable to large SMP systems Sampled performance monitor event traces

Manageable, Collected in real-time Core

Database management system (MySQL), Java tools

Applied methodology to study memory-subsystem behavior TPC-C executing on p690 Evaluated differences between private and shared data loads

ICPADS ’04

Conclusions - 2

References for private data Satisfied within the MCM Good application/architecture match

References for shared data Referenced outside the MCM Increased locality of reference Target for performance improvement

Main memory accesses primarily associated with compulsory misses

ICPADS ’04

Future Work

Quantify representativeness of sampled event traces

Enhance performance evaluation framework

Expand study of application data load behaviore.g., process characterization

Suggest ways to improve performance of TPC-C executing on p690 Improved memory management of Buffer Pool resulting in performance

improvements Track performance impediments to actual code and/or data structures

ICPADS ’04

Thank You.

Questions?

Documents

Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University