35
EXPLORING SOFTWARE SCALABILITY AND TRADE-OFFS IN THE MULTI-CORE ERA WITH F AST AND ACCURATE MICRO- ARCHITECTURAL SIMULATION TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS, WIM V ANROOSE, LIEVEN EECKHOUT TREVOR.CARLSON@ELIS.UGENT .BE HTTP://WWW .ELIS.UGENT .BE/~TCARLSON WEDNESDAY , FEBRUARY 15 TH , 2012 PP12, SAVANNAH, GA

T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

EXPLORING SOFTWARE SCALABILITY AND TRADE-OFFS IN THE MULTI-CORE ERA WITH

FAST AND ACCURATE MICRO-ARCHITECTURAL SIMULATION

TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS,

WIM VANROOSE, LIEVEN EECKHOUT

[email protected] HTTP://WWW.ELIS.UGENT.BE/~TCARLSON

WEDNESDAY, FEBRUARY 15TH, 2012 PP12, SAVANNAH, GA

Page 2: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

HPC PERFORMANCE CHALLENGES

Source: Yalick, EXADAPT 2011

Single-core performance is not keeping pace

Page 3: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

HPC POWER CHALLENGES

• #1 on TOP500 (K,Japan) consumes 12.7MW @ 10.5 Petaflops

• Exascale goal: 2018, 20MW @ 1,000 Petaflops – ~2x power, 100x performance (beyond Moore’s law)

Source: Yalick, EXADAPT 2011

Page 4: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

HPC SOFTWARE, HARDWARE CHALLENGES

• New programming models – Pthreads / OpenMP / Clik++ / MPI / PGAS

• Hardware is becoming increasingly heterogeneous and diverse – NUMA – CPU Turbo Mode / DVFS – Out-of-Order (Xeon) vs. In-Order (Atom/MIC) – NUCA (future)

• Energy consumption – For current machines, for future systems

• Reliability of large thread and core counts are big concerns – Large clusters of CPUs or GPUs will have regular system-wide

failure rates (Order of months for current systems to hours/days for very large systems)

4

Page 5: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

OVERVIEW

• Why use a Simulator?

• About the Sniper Multi-core Simulator

– Interval core model

– Parallel, fast and accurate

• Application Feedback

– CPI Stacks and software scaling

– Software Optimization Case Study

5

Page 6: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

OVERVIEW

• Why use a Simulator?

• About the Sniper Multi-core Simulator

– Interval core model

– Parallel, fast and accurate

• Application Feedback

– CPI Stacks and software scaling

– Software Optimization Case Study

6

Page 7: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

TYPICAL SYSTEM CACHE HIERARCHY

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

L2

L1I L1D

L1I L1D

L2

L1I L1D

L1I L1D

L3

DRAM

Page 8: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

WHY IS MY CODE SLOWER THAN EXPECTED?

• Traditional on-line analysis routines do not provide the whole story – Cache misses are not the whole story

• Performance counters/cache miss rates do not give an accurate picture of performance

• Tools like Valgrind can also report cache hits and misses, but they do not provide the impact in runtime

– There is no easy way to understand where the lost cycles from the software are going

– VTune can provide some help for specific problems, but does not provide a breakdown for each component

8

Page 9: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

PERFORMANCE QUESTIONS NEED ANSWERS

• Scalability – More cores vs. more nodes

– Strong vs. weak scaling analysis

• Performance – How will it perform and scale on next generation

hardware?

• Hardware options – Is it better to have fewer fast cores, or more slower

cores?

– Will an in-order core be sufficient and power efficient?

9

Page 10: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

OVERVIEW

• Why use a Simulator?

• About the Sniper Multi-core Simulator

– Interval core model

– Parallel, fast and accurate

• Application Feedback

– CPI Stacks and software scaling

– Software Optimization Case Study

10

Page 11: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

• Out-of-order core performance model with in-order simulation speed

INTERVAL SIMULATION

11

effe

ctiv

e d

isp

atch

rat

e

time

I-cache miss branch misprediction

long-latency load miss

interval 1 interval 2 interval 3

D. Genbrugge et al., HPCA’10 S. Eyerman et al., ACM TOCS, May 2009

T. Karkhanis and J. E. Smith, ISCA’04, ISCA’07

Page 12: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

KEY BENEFITS OF THE INTERVAL MODEL

• Models superscalar OOO execution

• Models impact of ILP

• Models second-order effects: MLP

• Allows for constructing CPI stacks

12

Page 13: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

LONG LATENCY MISS EVENTS ISOLATED LONG-LATENCY LOAD

S. Eyerman et al., ACM TOCS, May 2009

13

Page 14: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

LONG LATENCY MISS EVENTS OVERLAPPING LONG-LATENCY LOADS

S. Eyerman et al., ACM TOCS, May 2009

14

Page 15: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

SNIPER SIMULATION ENVIRONMENT

• User-level, x86-64, parallel (multi-threaded)

• Based on the MIT Graphite Simulator

• Many features

– Interval core model, CPI stacks

– Shared cache models, DVFS

– OpenMP and TBB support, etc.

• Hardware-validated against a 16-core Intel Xeon X7460 Dunnington machine

15

Page 16: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

INTERVAL PROVIDES NEEDED ACCURACY

16

The interval core model provides consistent accuracy of 25% avg. abs. error, with a minimal slowdown

T. E. Carlson et al., SC11

Page 17: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

INTERVAL: GOOD OVERALL ACCURACY

17

Good accuracy for the entire benchmark suite

T. E. Carlson et al., SC11

Page 18: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

SIMULATION PERFORMANCE

18

Sniper currently scales to 2 MIPS

Typical simulators run at 10s-100s KIPS, without scaling

T. E. Carlson et al., SC11

Page 19: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

OVERVIEW

• Why use a Simulator?

• About the Sniper Multi-core Simulator

– Interval core model

– Parallel, fast and accurate

• Application Feedback

– CPI Stacks and software scaling

– Software Optimization Case Study

19

Page 20: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

IPC TRACE – TYPICAL SIMULATOR OUTPUT

timet

per-thread IPC (ferret-large)

time

7

6

5

4

3

2

1

0

IPC Traces do not provide insight into the application’s behavior

20

Page 21: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

CYCLE STACKS

• Where did my cycles go?

• CPI stack: cycles per instruction,

broken up in components

• Normalize by either

– Number of instructions (CPI stack)

– Execution time (time stack)

• Different from miss rates as

cycle stacks directly quantify

the effect on performance

CPI

L2 cache

I-cache

Branch

Base

21

Heirman, et. al, IISWC, Nov 2011

Page 22: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

CYCLE STACKS FOR PARALLEL APPLICATIONS

• Homogeneous application with heterogeneous performance

22

Heirman, et. al, IISWC, Nov 2011

Page 23: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR

23

Page 24: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: application becomes DRAM bound

24

Page 25: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

USING CYCLE STACKS TO EXPLAIN SCALING BEHAVIOR • Scale input: application becomes DRAM bound

• Scale core count: synch losses increase to 20%

25

Page 26: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

Carlson, et. al, SC11, Nov 2011

SUGGEST APPLICATION IMPROVEMENTS

Pthread mutex to LOCK INC instruction

26

Page 27: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

0

7

core

time

ANALYZE SYNCHRONIZATION BEHAVIOR

Thread-state timeline for barnes:

synchronization, critical sections, load imbalance

Critical section Blocked Working

27

Page 28: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

CASE STUDY: TILED HEAT TRANSFER

P. Ghysels, 2011

• 5-point stencil, applied to consecutive time steps

• Optimization: tiled to optimize locality, multiple time steps per tile – but

requires redundant computation at tile edges

28

Page 29: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

TILE SIZE AND STEPS VS. CACHE BEHAVIOR

tile_heat

29

Page 30: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

NOT JUST TIME, ENERGY AS WELL

Integration with McPAT, provide application-specific estimates for power, energy

(and EDP, ED2P, …)

Li et al., MICRO’09

30

Page 31: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

WHERE IS THE ENERGY GOING?

EDP: Energy Delay Product

31

Page 32: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

ARCHITECTURAL EXPLORATION

Experiment: double the size of L2 and L3 caches

32

Page 33: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

SNIPER SIMULATION ENVIRONMENT

• Source code is publically available

• Discussion board for Q&A

• Open source (MIT license, interval model with academic license) available at

http://snipersim.org

33

Page 34: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

CONCLUSIONS

• Detailed application understanding is needed for complex trade-off analysis – Raw Application Performance – Software Algorithm Optimization – Energy/Power Analysis

• More accurate than instrumentation (no intrusion), higher visibility

• Simulation (with Sniper) is a fast and accurate simulation for multi-core processors

• Faster than most simulators, so Sniper can be used to model the effects of large caches, large input sets, multiple runs

• Allows for architectural exploration (vs. performance counters)

34

Page 35: T -O IN THE M -C E WITH · 2012-03-22 · – ~2x power, 100x performance (beyond Moore’s law) Source: Yalick, EXADAPT 2011 . HPC SOFTWARE, HARDWARE CHALLENGES •New programming

SOFTWARE ANALYSIS AND EXPLORATION USING FAST AND ACCURATE MICRO-

ARCHITECTURAL SIMULATION

TREVOR E. CARLSON, WIM HEIRMAN, SOURADIP SARKAR, ZHE MA, PIETER GHYSELS,

WIM VANROOSE, LIEVEN EECKHOUT

[email protected] HTTP://WWW.ELIS.UGENT.BE/~TCARLSON

WEDNESDAY, FEBRUARY 15TH, 2012 PP12, SAVANNAH, GA