Upload
willis-hubbard
View
215
Download
3
Embed Size (px)
Citation preview
July 9, 2004 ICPADS ‘04 Session 7C
A Framework for Profiling Multiprocessor Memory Performance
Diana Villa, Jaime Acosta, Patricia J. TellerThe University of Texas at El PasoDepartment of Computer Science
Bret OlszewskiIBM Corporation – Austin, TX
ICPADS ’04
Outline
Motivation Data Collection Environment
Workload & Platform Monitored Events
Sampled Event Traces Performance Evaluation Framework Data Analysis & Results Conclusions and Future Work
ICPADS ’04
Motivation
Modern SystemsPerformance governed by memory subsystem
SMPs Deeper and larger memory hierarchies Performance analysis considerations
Time to results and size of data set
GoalDevelop a new performance analysis
methodology
ICPADS ’04
Data Collection Environment
WorkloadTPC-C benchmark
Commercial OLTP
Platform IBM eServer pSeries 690 architecture (p690)
8- and 32-processor configurations
ICPADS ’04
Platform
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
MCM 0 MCM 1
X
8-processor p690 configuration
ICPADS ’04
Platform
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 0
P
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 2
P
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 1
P
P P
PP
PP
P
L2
L2
L2
L2
L3
MCM 3
P
32-processor p690 configuration
ICPADS ’04
Monitored Events
L2-cache data-load misses L2.5 L2.75 L3 L3.5 MEM
L1-cache data-load miss L2
ICPADS ’04
L2
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 12 cycles
MCM 0 MCM 1
X
ICPADS ’04
L2.5
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 73 cycles
MCM 0 MCM 1
X
ICPADS ’04
L2.75
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 96 cycles
MCM 0 MCM 1
X
ICPADS ’04
L3
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 112 cycles
MCM 0 MCM 1
X
ICPADS ’04
L3.5
P X
XP
XP
P
X
X
X
X
P
PP
P
L2
L2
L2
L2 L2
L2 L2
L2
L3 L3
Penalty: 143 cycles
MCM 0 MCM 1
X
ICPADS ’04
Data Collection
10-minute observation interval
Performance Monitoring Unit (PMU) Special-purpose registers Programming interface
Kernel extension
eprof PMU configuration Event-based sampling
ICPADS ’04
Sampled Event Traces
Sampling Record periodic occurrences of an event 100 events/sec/CPU
Event record
372872 184469 0.328104637 000000000000A8C4 0000000000218880
PID TID Timestamp Effective Instruction Address
EffectiveData Address
Average number of samples collected/event 238,448 for 8-processor data 212,396 for 32-processor data
ICPADS ’04
Performance Framework
Database
Load DB Java Tool
Report Generation Java Tool
p690TPC-C
Data Collection Environment
Reports
5 BufferPool 56893 293846 Data,BSS,Heap 8799 48551 Kernel 23485 9840
Sampled Event Traces
PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.PID TID Timestamp Instr.Addr. DataAddr.
Distribution of L3 Data Load Hits
0 0.1 0.2 0.3 0.4 0.5
Kernel
Text
Data,BSS,Heap
BufferPool
SharedData
Stack
U-BlockandKernelStack
KERN_HEAP
Ad
dre
ss r
egio
n
Fraction of data loads
Unique cache line
Hit %
Distribution of L3 Data Load Hits Across Pages of a Buffer Pool Segment
050
100150200
250300
350400
100 1600 3100 4600 6100 7600
Page [0-65536]
Hit
/Cac
he
lin
e co
un
t
Total loads
Unique cache line
Graphs
ICPADS ’04
Data Analysis - 1
Overall goal Study effectiveness of p690 memory hierarchy
Characterize differences between private and shared data loads Track missing L2-cache lines across levels of the p690 memory
hierarchy
Studied address regions Referenced by 90% of L2-cache data-load misses Private: Data,BSS,Heap Shared: Buffer Pool
ICPADS ’04
Data Analysis - 2
Private data loads Accessible only to owner process Examples: process’ return stack, local variables Ideal: Remain close to executing processor
Shared data loads Accessible by every TPC-C process Examples: application code, global variables Ideal: Remain in higher levels of memory hierarchy
ICPADS ’04
Results
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: BUFFER_POOL
DataLoadHits
UniqueCacheLines
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75Mod L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: DATABSSHEAP
DataLoadHits
UniqueCacheLines
32-Processor Data
SharedPrivate
ICPADS ’04
Results
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: BUFFER_POOL
DataLoadHits
UniqueCacheLines
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75Mod L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: DATABSSHEAP
DataLoadHits
UniqueCacheLines
32-Processor Data
Good Application/Architecture Match
Private Shared
ICPADS ’04
Results
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: BUFFER_POOL
DataLoadHits
UniqueCacheLines
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75Mod L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: DATABSSHEAP
DataLoadHits
UniqueCacheLines
32-Processor Data
Possible Performance Impediment
SharedPrivate
ICPADS ’04
Results
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: BUFFER_POOL
DataLoadHits
UniqueCacheLines
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75Mod L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: DATABSSHEAP
DataLoadHits
UniqueCacheLines
32-Processor Data
Shared Data References More Localized than Private Data References
Private Shared
ICPADS ’04
Results
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75 MOD L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: BUFFER_POOL
DataLoadHits
UniqueCacheLines
0
20000
40000
60000
80000
100000
120000
L2 L2.5 MOD L2.75Mod L3 L3.5 MEMEvent Name
Distribution of Data Load Hits: DATABSSHEAP
DataLoadHits
UniqueCacheLines
32-Processor Data
MEM Data Load Hits Primarily Due To Compulsory Misses
SharedPrivate
ICPADS ’04
Conclusions - 1
Developed new performance evaluation framework Applicable to large SMP systems Sampled performance monitor event traces
Manageable, Collected in real-time Core
Database management system (MySQL), Java tools
Applied methodology to study memory-subsystem behavior TPC-C executing on p690 Evaluated differences between private and shared data loads
ICPADS ’04
Conclusions - 2
References for private data Satisfied within the MCM Good application/architecture match
References for shared data Referenced outside the MCM Increased locality of reference Target for performance improvement
Main memory accesses primarily associated with compulsory misses
ICPADS ’04
Future Work
Quantify representativeness of sampled event traces
Enhance performance evaluation framework
Expand study of application data load behaviore.g., process characterization
Suggest ways to improve performance of TPC-C executing on p690 Improved memory management of Buffer Pool resulting in performance
improvements Track performance impediments to actual code and/or data structures
ICPADS ’04
Thank You.
Questions?