39
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas State, 3 Texas Advanced Computing Center, 4 NVIDIA

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

Embed Size (px)

Citation preview

Page 1: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

1

Evaluation and Optimization of Multicore Performance Bottlenecks in

Supercomputing Applications

Jeff Diamond1, Martin Burtscher2,

John D. McCalpin3, Byoung-Do Kim3,

Stephen W. Keckler1,4, James C. Browne1

1University of Texas, 2Texas State, 3Texas Advanced Computing Center, 4NVIDIA

Page 2: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

2

Trends In Supercomputers

Page 3: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

3

Is multicorean issue?

Page 4: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

4

The Problem: Multicore Scalability

Page 5: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

5

The Problem: Multicore Scalability

Page 6: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

6

Optimizations Differ in Multicore

Base code vs Multicore Optimized code

Page 7: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

7

Paper Contributions

Studies multicore related bottlenecks Identifies performance measurement challenges

unique to multicore systems Presents systematic approach to multicore

performance analysisDemonstrates principles of optimization

Page 8: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

8

Talk Outline

IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion

Page 9: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

9

Approach: An HPC Case Study

Examine a real HPC application Major functions add variety

What is a typical HPC application?Many exhibit low arithmetic intensity

Typical of explicit / iterative solvers, stencilsFinite volume / elements / differencesMolecular dynamics, particle simulations, graph

search, Sparse MM, etc.

Page 10: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

10

Application: HOMME High Order Method Modeling Environment 3-D Atmospheric Simulation from NCAR Required for NSF acceptance testing Excellent scaling, highly optimized Arithmetic Intensity typical of stencil codes

Supercomputers:Ranger – 62,976 cores, 579 Teraflops

• 2.3 GHz quad core AMD Barcelona chips

Longhorn – 2,048 cores + 512 GPUs• 2.5 GHz quad core Intel Nehalem-EP chips

Approach: An HPC Case Study

Page 11: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

11

Talk Outline

IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion

Page 12: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

12

Multicore Performance BottlenecksSINGLE CHIP

SINGLE DIMM

PRIVATEL1/L2 Cache

SHAREDL3 CACHE

SHAREDOFF-CHIP BW

SHARED DRAMPAGE CACHES

NODE

LOCAL DRAM

L3

L2 L2

L2 L2

L1 L1

L1 L1

Page 13: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

13

Disturbances Persist Longer

Page 14: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

14

Measurement Implications

Page 15: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

15

Measurements Must Be Lightweight

Duration of major HOMME functions

Action Cycles

Read Counter 9

Read Four Counters 30

Call Function 40

PAPI READ 400

System Call 5,000

TLB Page Initialization 25,000

Function Duration Calls Per Second % Exec Time2,000 cycles or less 100,000 20%

2,000 to 10,000 cycles 20,000 10%10K to 200K cycles 1,600 15%200K to 1M cycles 200 15%1M to 10M cycles - 0%10M or more cycle 4 35%

Page 16: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

16

Multicore Measurement Issues

Performance issues in shared memory systemContext SensitiveNondeterministicHighly non local

Measurement disturbance is significantAccessing memory or delaying core Hard to “bracket” measurement effectsDisturbances can last billions of cyclesBottlenecks can be “bursty”

Conclusion – need multiple tools

Page 17: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

17

Talk Outline

IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion

Page 18: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

18

Multicore Performance BottlenecksSINGLE CHIP

SINGLE DIMM

SHAREDL3 CACHE

SHAREDOFF-CHIP BW

SHARED DRAMPAGE CACHES

NODE

LOCAL DRAM

L3

L2 L2

L2 L2

L1 L1

L1 L1

Page 19: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

19

Measurement Approach

Find important functionsCompare performance counters at min/max core density Identify key multicore bottleneck:

L3 capacity – L3 miss rates increase with density Off-chip BW – BW usage at min density greater than share DRAM contention – DRAM page miss rates increase with

density

For small and medium functions, follow up with light weight / temporal measurements

Page 20: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

20

Typical Homme Loop

Page 21: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

21

Apply “Microfission” (First Line)

Page 22: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

22

“Loop Microfission”

Local, context free optimizationEach array processed independently

Add high-level blocking to fit cache

Reduces total DRAM banks Statistically reduces DRAM page miss rate

Reduces instantaneous working set sizeHelps with L3 capacity and off-chip BW

Page 23: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

23

Microfission Results

Page 24: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

24

Talk Outline

IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion

Page 25: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

25

Summary and Conclusions

HPC scalability must include multicoreNot well understoodRequires new analysis and measurement

techniquesOptimizations differ from single-core

Microfission is just one exampleMulticore locality optimization for shared

cachesImproves performance by 35%

Page 26: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

26

Future Work

Expect multicore observations apply to other HPC applications with low arithmetic intensity Irregular parallel applications: Adaptive meshes,

heterogeneous workloads Irregular blocking applications: graph traversal

Wider range of multicore (memory-focused) optimizationsRecomputationRelocating DataTemporary storage reductionStructural changes

Page 27: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

27

Thank You

Any Questions?

Page 28: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

28

BACKUP SLIDES…

Page 29: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

29

Less DRAM Contention

Page 30: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

30

Multicore Optimized, Low Density

Page 31: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

31

Most important functions

Page 32: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

32

L1 & L2 Miss Rates Less Relevant

Page 33: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

33

TEST

Page 34: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

34

HPC Applications Have Low Intensity

Page 35: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

35

Loads Per Cycle vs Intrachip Scaling

Page 36: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

36

TEST

Page 37: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

37

TEST

Page 38: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

38

Oscillations Effect L2 Miss Rate

Page 39: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin

39

Oscillations Effect L2 Miss Rate