26
spcl.inf.ethz.ch @spcl_eth SABELA RAMOS, T ORSTEN HOEFLER Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

SABELA RAMOS, TORSTEN HOEFLER

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

Page 2: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Microarchitectures are becoming more and more complex

CPU

L1

DDR

CPU

L1

DDR

CPU

L1

CPU

L1

DDR

CPU

L1

L2

CPU

L1

DDR

CPU

L1

L2

… CPU

L1

DDR

CPU

L1

L2

CPUL1L2

CPUL1L2

CPUL1L2

CPUL1L2

GDDR5 GDDR5

2

Page 3: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

▪ Performance engineering: “encompasses the set of roles, skills, activities, practices, tools, and deliverables applied at every phase of the systems development life cycle which ensures that a solution will be designed, implemented, and operationally supported to meet the non-functional requirements for performance (such as throughput, latency, or memory usage).”

▪ Manually profile codes and tune them to the given architecture

▪ Requires highly-skilled performance engineers

▪ Need familiarity with

NUMA (topology, bandwidths etc.)

Caches (associativity, sizes etc.)

Microarchitecture (number of outstanding loads etc.)

How to optimize codes for these complex architectures?

3

Trust me, I’m an engineer!

Page 4: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

4

An engineering example – Tacoma Narrows Bridge

Page 5: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Scientific Performance Engineering

1) Observe2) Model

3) Understand4) Build

5

Page 6: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

EDC EDC EDC EDC

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

Tile Tile Tile TileTile Tile

IMC Tile Tile IMCTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Misc

IIO

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

PCIe

DDR DDR

Modeling by example: KNL Architecture (mesh)

Core

CHA

1 MB

L2Core

2 VPU2 VPU

6

Page 7: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (memory: Flat & Cache)

EDC EDC EDC EDC

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

Tile Tile Tile TileTile Tile

IMC Tile Tile IMCTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Misc

IIO

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

PCIe

DDR DDR

Core

CHA

1 MB

L2Core

2 VPU2 VPU

7

Page 8: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (all to all mode)

EDC EDC EDC EDC

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

Tile Tile Tile TileTile Tile

IMC Tile Tile IMCTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Misc

IIO

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

PCIe

DDR DDR

Core

CHA

1 MB

L2Core

2 VPU2 VPU

8

Page 9: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (Quadrant or Hemisphere)

EDC EDC EDC EDC

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

Tile Tile Tile TileTile Tile

IMC Tile Tile IMCTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Misc

IIO

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

PCIe

DDR DDR

Core

CHA

1 MB

L2Core

2 VPU2 VPU

9

Page 10: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

KNL Architecture (SNC-4 or SNC-2)

EDC EDC EDC EDC

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

Tile Tile Tile TileTile Tile

IMC Tile Tile IMCTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Misc

IIO

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

PCIe

DDR DDR

Core

CHA

1 MB

L2Core

2 VPU2 VPU

10

Page 11: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

EDC EDC EDC EDC

EDC EDC EDC EDC

Tile Tile Tile TileTile Tile

Tile Tile Tile Tile

Tile Tile Tile TileTile Tile

IMC Tile Tile IMCTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Tile Tile Tile TileTile Tile

Misc

IIO

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

PCIe

DDR DDR

KNL Architecture

How much does this all matter?

What is the real cost of accessing cache?

What is the cost of accessing memory?

11

Page 12: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Step 1: Understand core-to-core transfers – MESIF cache coherence

12

Write back overhead

Location: only 5-15% difference

That is curious!

Contention effects?

All values are medians within 10% of the 95% nonparametric CI, cf. TH, RB: “Scientific Benchmarking of Parallel Computing Systems”, SC16

Page 13: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Step 2: Understand core-to-memory transfers – DRAM and MCDRAM

13All values are medians within 10% of the 95% nonparametric CI, cf. TH, RB: “Scientific Benchmarking of Parallel Computing Systems”, SC16

MCDRAM 20% slower!

MCDRAM 4-6x faster!

Need to read and write for full bandwidth

Cache mode >20% slower

Bandwidth suffers a bit

Page 14: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Performance engineers optimize your code!

14

Are you kidding me?

Page 15: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

15

A principled approach to designing cache-to-cache broadcast algorithms

Tree cost

Tree depth

Reached threads

S. Ramos, TH: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, ACM HPDC’13S. Ramos, TH: “Cache line aware optimizations for ccNUMA systems (IEEE TPDS’17).

0

2

4 5 6 7

Multi-ary tree example

3 8

1

depth d = 2

k1 = 2

k2 = 3

Level size

Page 16: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Model-driven performance engineering for broadcast

16

Binary or binomial trees?

Oh! But sure I can do that.

Page 17: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Model-driven performance engineering for broadcast

17

Binary or binomial trees?

Oh! But sure I can do that.

13x faster, lower

variance

Page 18: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

18

Easy to generalize to similar algorithms

Barrier (7x faster than OpenMP) Reduce (5x faster then OpenMP)

Page 19: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

19

Sorting in complex memories

Check! Let’s do something “Big Data”!

TriadCache Mode

SNC4

TriadFlat Mode

SNC4

We’ll need a parallel sort on all cores, right?

Page 20: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

Memory model: Bitonic Mergesort

CPU0

CPU0

CPU1

• Slices of 16 elements go through a bitonic network.• Communication: CPU0 accesses data from local and

remote caches.• Synchronization: CPU0 waits for CPU1.• Memory accesses: latency vs. bandwidth.

20

Page 21: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

21

Modeling Bitonic Sorting memory access cost

L1 access cost

elements in L1

L2 access cost

elements in L2

Page 22: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

22

Bitonic Sort of 1 kiB measurement

model including synchronization cost

model (just) memory costs

Synchronization outweights memory costs for small data!

So don’t parallelize too much!

Page 23: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

23

Bitonic Sort of 4 MiB

model including synchronization cost

model (just) memory costs“parallelization

boundary”

Page 24: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

24

Bitonic Sort of 1 GiB

synchronization negligible

Always use all cores!

Page 25: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

25

The most surprising result last …

Hey, but which memory? DRAM

or MCDRAM?

The model (and practical measurements) indicate that it does not matter.

Thesis: the higher bandwidth of MCDRAM did not help due to the higher latency (log2 n depth).

Disclaimer: this is NOT the best sorting algorithm for Xeon Phi KNL. It is the best we found with limited effort. We suspect that

a combination of algorithms will perform best.

Page 26: ABELA AMOS, TORSTEN OEFLER Capability Models for Manycore …htor.inf.ethz.ch/publications/img/hoefler-memory-model... · 2018. 2. 24. · SABELA RAMOS, TORSTEN HOEFLER Capability

spcl.inf.ethz.ch

@spcl_eth

26

Scientific performance engineering for complex memory systems

Questions/Discussions?