71
Igor Podladtchikov, Spectraseis Inc March 19, 2013 Memory Bound Wave Propagation at Hardware Limit

Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

Igor Podladtchikov, Spectraseis Inc

March 19, 2013

Memory Bound Wave Propagation at Hardware Limit

Page 2: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Geophysical Method to locate subsurface events:

• Propagate and image time-reversed data acquired at the surface

Use full wave-equation

• Acoustic or Elastic

• Heterogeneous Materials

Need very fast solvers

• Thousands of Events

• Big Models

Microseismic Monitoring

Time Reversed Imaging (TRI)

2

Page 3: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Expectations

Results

Acoustic Solver Implementation

Elastic Solver Implementation

Summary

Roadmap

3

Page 4: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Limiters

Processors

Memory

• Computation

• 1000 Gflops/s

• Transfer

• 100 GB/s

The two principle performance limiters

4

Page 5: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Acoustic Equations

1 variable read & write

2 variables read only

5

Page 6: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Elastic Equations

9 variables read & write

3 variables read only

6

Page 7: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

flops / bytes ratio:

• Compute or Memory Bound?

BYTES, not numbers:

• Single precision: 4 bytes

• e.g. 1st derivative: 2 reads, 1 write, 12 bytes transferred

Arithmetic Intensity

7

Page 8: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

M2070 machine balance:

• Peak flops / bytes : 1030 / 117 ~ 9

K10 machine balance:

• Peak flops / bytes : 4577 / 228 ~ 20

Arithmetic Intensity:

• << machine balance : memory bound

• >> machine balance : compute bound

Machine Balance

8

Page 9: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Arithmetic Intensity

Machine Balance:

− Fermi: ~ 9

− Kepler: ~ 20

9

We’re memory bound

<< 9

2

2acoustic elastic

flops 2 5 24 99

bytes 12 16 68 444

ratio

x x

0.17 0.31 0.35 0.22

Page 10: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Memory Bound – What To Do?

Option A:

− Give up (don’t even try)

− Blame memory bound for slow code

Option B:

− Celebrate: FLOPS are for FREE

− Optimize memory access efficiency

− Count bytes, not flops

− Try to approach memcpy throughput

10

Our claim: real world applications can run close to memcpy!

Page 11: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

− Aim for minimum read / writes

• Touch everything once (un-improvable)

• Don’t read neighbors twice

How to optimize memory access?

Try to avoid redundant read / writes

11

…don’t read me again!

Read me once…

Page 12: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

− Track optimization progress

− Don’t count neighbors in your performance metric!

“Remember traffic is the volume of data to a particular memory. It is not the

number of loads and stores”

— Performance Tuning of Scientific Applications

Don’t count neighbor reads!

Don’t cheat (yourself)

12

Page 13: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Ideal Memory Throughput

13

No Neighbors!

N IO Grid Size Word SizeMTP

Time Elapsed

N IO 2 DOF Constants

DOF : Degree of freedom (read and write)

Constants : read only

Grid Size : nx * ny * nz * nt

Word Size: 4 bytes (single precision)

Page 14: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

30

30

4 Grid Size 4 bytes GBMTP

Time Elapsed 2 s

21 Grid Size 4 bytes GBMTP

Time Elapsed 2 s

Acoustic

Elastic

Ideal Memory Throughput

14

N_IO Acoustic : 2 * 1 + 2 = 4

N_IO Elastic : 2 * 9 + 3 = 21

Page 15: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Expectations

Results

Acoustic Solver Implementation

Elastic Solver Implementation

Summary

Roadmap

15

Page 16: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

− All solvers include

• free surface

• absorbing layer

• domain decomposition along all 3 axes

• IPC if GPUs map-able, MPI otherwise

− All solvers verified against single CPU code

− All data from NVIDIA PSG Cluster – Thank You!

Results

without further ado..

16

Page 17: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

0

20

40

60

80

100

120

64 128 192 256 320 384 448 512 576 640 704

MT

P G

B/s

cube size

M2070

Memcpy

Pressure

Density

Elastic

Memory Throughput on M2070

Real physics at 85% and 52% of hardware limit

17

Page 18: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

po[CENTER] = 2*pcc - po[CENTER]*abs + vp2[CENTER] *

(

pc[LEFT ] + pc[RIGHT ]

+ pc[LEFT2 ] + pc[RIGHT2]

+ pcm + pcp

- 6*pcc

);

Neighbor

Reads

Acoustic pressure update:

If we count neighbor reads as IO operations: 6 additional

− 10 IO operations

− 100 GB/s / 4 * 10 = 250 GB/s MTP peak on M2070

− Theoretical hardware limit is 150 GB/s

Neighbors Don’t Count!

DON’T COUNT NEIGHBOR READS

18

Page 19: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

0

50

100

150

200

250

64 128 192 256 320 384 448 512 576 640 704

MT

P G

B/s

cube size

K10

Memcpy

Pressure

Density

Elastic

Memory Throughput on M2070

Strong scaling on both K10 GPU’s

same size and power consumption as M2070!

19

Page 20: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Other GPUs

50

150

250

Memcpy K10

K20X

K20

M2090

M2070

GK104

20

The green cards win

50

150

250

Pressure K10

K20X

K20

M2090

M2070

GK104

50

150

250

Density K10

K20X

K20

M2090

M2070

GK104 50

150

250

Elastic K10

K20X

K20

M2090

M2070

GK104

Page 21: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Multi-GPU Weak Scaling on GK104

21

PCIe 2: 6 GB/s

PCIe 3: 12 GB/s

0

10

20

30

40

50

60

70

80

90

100

64 192 320 448 576

per

GP

U M

TP,

% o

f sin

gle

cube size

Density

0

10

20

30

40

50

60

70

80

90

100

64 128 192 256 320 384p

er

GP

U M

TP,

% o

f sin

gle

cube size

Elastic

2 nodes (IPC)

4 nodes (MPI PCIe3)

8 nodes (IB FDR)

Page 22: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

− Defined Ideal, Un-improvable Memory Throughput

• MTP = N_IO * Grid Size * Word Size / time elapsed

• N_IO = 2*DOF + Const

• No neighbors or temporary variables

− Came close to memcpy with real world applications

• acoustic: 85 %

• elastic: 52 %

• performance proportional to memcpy on various architectures

− Solvers scale on multiple GPUs

Results Summary

22

Page 23: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Expectations

Results

Acoustic Solver Implementation

Elastic Solver Implementation

Summary

Roadmap

23

Page 24: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Respect the number 32

− 32 x 8 Thread-blocks

− Fast axis sizes multiples of 32 (can be padded)

− Hit global memory segments and L1 cache lines (32 x 4B = 128B)

Rely on cache

− Shared memory requires extra operations

− Shared memory needs __synchthreads()

− Registers are faster than shared memory

− If working set fits in cache, cache is faster

General Considerations

24

Page 25: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

First Try – Acoustic Pressure

25

Yes, that’s it

#define EXIT_BND(xx,yy,nx,ny) \

int xx = blockIdx.x*blockDim.x + threadIdx.x; if(xx < 1 || xx >= nx - 1) return; \

int yy = blockIdx.y*blockDim.y + threadIdx.y; if(yy < 1 || yy >= ny - 1) return;

#define CENTER i1 + i2*n1 + i3*n1*n2

#define RIGHT i1+1 + i2*n1 + i3*n1*n2

#define LEFT i1-1 + i2*n1 + i3*n1*n2

#define RIGHT2 i1 + (i2+1)*n1 + i3*n1*n2

#define LEFT2 i1 + (i2-1)*n1 + i3*n1*n2

#define TOP i1 + i2*n1 + (i3+1)*n1*n2

#define BOT i1 + i2*n1 + (i3-1)*n1*n2

__global__ void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2,

const int n1, const int n2, const int n3){

EXIT_BND(i1,i2,n1,n2)

int i3;

for(i3 = 1; i3 < n3-1; i3++){

po[CENTER] = 2*pc[CENTER] - po[CENTER] + vp2[CENTER] * (

pc[LEFT] + pc[RIGHT] + pc[LEFT2] + pc[RIGHT2] + pc[BOT] + pc[TOP] –

6*pc[CENTER] );

}

}

Page 26: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

35

45

55

65

75

85

95

64 128 192 256 320 384 448 512 576 640 704

MT

P G

B/s

cube size

First Try – Acoustic Pressure

pretty good, but not good enough

26

Yay!

Boo…

Boo

Hoo

Hoo…

Page 27: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Suspect TLB misses:

• Translation Lookaside Buffers

• Accelerating translation from virtual to physical memory

• Act like caches on the page table

“If the kernel’s working set … exceeds TLB capacity (or associativity) then one

generates TLB capacity (or conflict) misses.”

— Performance Tuning of Scientific Applications

First Try

27

Page 28: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

If the kernel’s working set is too big, we’ll reduce it:

Batched Execution

Launch kernel batches for slowest axis

28

__global__ void pressure_gpu_het_vp2(const float* pc, float* po, const float* vp2,

const int n1, const int n2, const int n3,

const int offset){

EXIT_BND(i1,i2,n1,n2)

int i3;

for(i3 = offset+1; i3 < offset+n3-1; i3++){

po[CENTER] = 2*pc[CENTER] - po[CENTER] + vp2[CENTER] * (

pc[LEFT] + pc[RIGHT] + pc[LEFT2] + pc[RIGHT2] + pc[BOT] + pc[TOP] –

6*pc[CENTER] );

}

}

Page 29: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Batched Execution

29

Done

35

45

55

65

75

85

95

64 192 320 448 576 704M

TP

GB

/s

cube size

First Try vs. Batched

batch 32

no batch

35

45

55

65

75

85

95

64 192 320 448 576 704

MT

P G

B/s

cube size

Batched Execution

32

100

300

no batch

Page 30: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

The Density Problem

Density equation has Vp inside difference,

which means twice the amount of

neighbors to fetch:

35

45

55

65

75

85

95

64 192 320 448 576 704M

TP

GB

/s

cube size

Pressure vs. Density

Pressure Density Naïve

30

Page 31: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

What Problem? Add Variable!

Just replace vp*current inside

derivative by variable! At every time-

step:

− launch ucvp = uc*vp kernel

− launch solver, take ucvp derivative

Why is it slower??

− we introduced additional

read+write

− the additional read+write don’t

count! same problem, same result,

same performance metric formula!

35

45

55

65

75

85

95

105

64 192 320 448 576 704

MT

P G

B/s

cube size

Pressure vs. Density

Pressure Density Naïve

Add Var

31

THAT problem

Page 32: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

The code looks a little “repetitive”, we’re multiplying by vp a whole lot of times:

unew = 2*ucc - uo[CENTER]*abs

+ uc[RIGHT] *v2[RIGHT] + uc[LEFT] *v2[LEFT]

+ uc[RIGHT2]*v2[RIGHT2]+ uc[LEFT2]*v2[LEFT2]

+ ucp *v2p + ucm *v2m - 6*ucc*v2c;

What if we do this:

unew = (2*ucc - uo[CENTER]*abs) / vp[CENTER] // divide by vp for time-step!

+ uc[RIGHT] + uc[LEFT]

+ uc[RIGHT2]+ uc[LEFT2]

+ ucp + ucm - 6*ucc;

uo[CENTER] = unew*abs*vp[CENTER]; // store wave-fields pre-multiplied with vp!

The Density Trick

Same memory usage, same N_IO, but less neighbor reads!

32

Page 33: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

The Density Trick

Memory access pattern the same as

pressure – same performance as

pressure!

35

55

75

95

64 192 320 448 576 704

MT

P G

B/s

cube size

Pressure vs. Density

Pressure Density Naïve

Add Var Pre-Mul

33

And BOOM goes the dynamite

Page 34: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Expectations

Results

Acoustic Solver Implementation

Elastic Solver Implementation

Summary

Roadmap

34

Page 35: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Very similar situation to acoustic solver

• Less neighbors because 1st derivative, but more variables

• Use batching, 32x8 thread-blocks, fast axis sizes multiples of 32

Use staggered grid

• Average materials on the fly

All variables the same size

• All coalesced, same stride for everyone

General Considerations

35

Page 36: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Staggered Grid – Elementary Cell

Every grid point contains 12 elements:

‒ 3 particle velocity components Vx, Vy, Vz

‒ 3 normal stress components Sxx, Syy, Szz

‒ 3 shear stress components Sxy, Sxz, Syz

‒ 3 material properties ρ, λ, μ

36

y

x

Sxy

Sxz, out of screen

Syz, out of screen

Vx

Vy

Vz, out of screen

Sxx, Syy, Szz, ρ, λ, μ

Page 37: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Staggered Grid

37

┼ Everyone surrounded by correct

spatial difference neighbors

‒ Materials over Velocity and

Shear need to be averaged

Page 38: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Staggered Grid

38

Vy Vx

Sxz over Vx Sxy over Vy

Sxy

Sxx, Syy, Szz, ρ, λ, μ

Area updated

Ghost Stress,

ignored

Ghost Stress,

updated

Boundary Velocity,

from neighbor or

boundary condition

Page 39: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Separate Stress and Velocity Update

39

Velocity Kernel

Stress Kernel

Page 40: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Separate Stress and Velocity Update

40

‒ needs to be at time-step t

‒ needs to be at time-step

t+1/2

‒ handled by thread-block,

possibly on different SM

‒ thread-block scheduling

unknown

‒ read redundancy

Page 41: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Separate Stress and Shear Update

41

Velocity Kernel

Stress Kernel

Shear Kernel

Page 42: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Separate Stress and Shear Update

42

‒ for i 0…n-1

‒ for i 1…n

‒ Divergence experimentally

established to be slightly

worse than read redundancy

Page 43: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Individual Kernel Performance

− Normal stress has no material

averaging

− Velocity needs to average density

from 2 values, for 3 different

positions

− Shear stress needs to average

Lame coefficient from 4 values, for

3 different positions

35

55

75

95

64 192 320 448

MT

P G

B/s

cube size

Elastic Kernels

Stress Velocity

Shear Elastic

43

In sequence they suffer read redundancy

Page 44: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Individual kernels are close to limit, but introduce read redundancy:

− shear stress: read 3 V, read 3 SX, read 1 M, write 3 SX (10)

− normal stress: read 3 V (AGAIN), read 3 S, read L, read M (AGAIN),

write 3 S (11, 4 redundant)

− velocity: read 3 V (AGAIN), read 3 SX (AGAIN), read 3 S (AGAIN), read R,

write 3 V (13, 9 redundant)

• Total 34, 13 redundant

We could totally cheat and say we’re doing 34 IO, and therefore our peak

performance is 60 / 21 * 34 = 97 GB/s -> 83 % of memcpy speed!

Read Redundancy

It’s important to know whether an algorithm has room for improvement or not…

This one definitely has!

44

Page 45: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Respected the number 32

− Memory segments, warps and L1 cache lines

Relied on cache

− Only works if working unit small enough

− So, reduce your working units

Give Hardware maximum possibility to parallelize

− No __syncthreads()

− Minimum divergence

Implementation Summary

45

Page 46: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Expectations

Results

Acoustic Solver Implementation

Elastic Solver Implementation

Summary

Roadmap

46

Page 47: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Ideal Memory Throughput

− MTP = N_IO * Grid Size * Word Size / time elapsed

− N_IO = 2*DOF + Const

− GFlops misleading in memory bound situation

− Counting neighbors is a crime

Real world applications can approach memcpy throughput

− acoustic: 85 % (100 GB/s on M2070, 180 GB/s on K10)

− elastic: 52 % (60 GB/s on M2070, 100 GB/s on K10)

Summary

Physics at Memcpy Throughput:

Physics for free!

47

Page 48: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

For fixed problem size and hardware capabilities…

Every Algorithm’s Dream…

…which is faster?

48

Read Compute Write

Read C Write

Read Write

3D FFT: 40 GB/s, 180 Gflops/s

Acoustic: 100 GB/s, 70 Gflops/s

Memcpy: 117 GB/s, 0 Gflops/s

Page 49: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Performance Tuning of Scientific Applications – David H. Bailey and Robert F. Lucas

GPU Performance Analysis and Optimization, Paulius Micikevicius, GTC 2012

3D Finite Difference Computation on GPUs using CUDA, Paulius Micikevicius, 2010

Numerical Modeling in Fortran, Day 9, Paul Tackley, 2012

[email protected]

References

Questions?

49

Page 50: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different
Page 51: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

App Throughput (TP) GB/s Application Speed

Hardware (HW) TP GB/s Hardware’s Transfer Throughput

HW TP Limit GB/s Practical Throughput Limit

(memcpy)

Profile how

many bytes

transferred.

Practical

instead of

theoretical

throughput

limit.

Less than

100% is only

critical if it’s

substantially

less.

Performance Peak Analysis

51

App / Limit % 100 % is ideal

App / HW % Less than 100 % means not all

bytes transferred are used

HW / Limit % Less than 100 % means memory

bus underutilized

Page 52: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

App Throughput (TP) GB/s 100.0 88.5

Hardware (HW) TP GB/s 103.6 92.1

HW TP Limit GB/s 117.6 114.3

App / Limit % 85.0 77.4

App / HW % 96.5 96.1

HW / Limit % 88.1 80.6

GPU M2070 GK104

Data from

448 cubed,

10 time steps

run.

Access

pattern OK.

Could have

more

concurrent

memory

access,

especially on

GK104, to

increase HW

utilization.

Performance Peak Analysis: Density

52

Page 53: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Cache miss

causes

memory

replays and

stalls

Register Queue GK104 Profiled

53

Metric Queue No Q Comments

APP Time [sec] 0.148 0.180

APP MTP [GB/s] 89.379 73.375

Instructions [10^9] 52.404 60.156 replays

Writes [GB] 3.320 3.320

Reads [GB] 10.780 11.335 Cache miss

Reads/cube 3.004 3.158 Cache miss

HW MTP [ GB/s] 95.266 81.416 Stalls

APP / HW MTP [%] 93.800 90.123 Cache miss

Page 54: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Cache miss

causes

memory

replays and

stalls

Density Trick Profiled on M2070

54

Metric Trick Naive Comments

APP Time [sec] 0.133 0.190

APP MTP [GB/s] 99.672 69.557

Instructions [10^9] 44.370 53.706 replays

Writes [GB] 3.320 3.320

Reads [GB] 10.455 12.836 Cache miss

Reads/cube 2.913 3.577 Cache miss

HW MTP [ GB/s] 103.566 85.030 Stalls

APP / HW MTP [%] 96.196 81.802 Cache miss

Page 55: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

• nvprof from cuda toolkit 5.0 : nvprof --event <event name>

• inst_issued (Fermi), inst_issued1 + 2*inst_issued2 (K10)

per warp, 32 instructions per count

• fb_subp0_write_sectors + fb_subp1_write_sectors

32 bytes per count

• fb_subp0_read_sectors + fb_subp1_read_sectors

32 bytes per count

Profiling Notes

55

Page 56: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

• Reported compute bound for large stencils, so not memory bound anymore

Would you prefer to pay for the bus or ride it for free?

• Reported higher accuracy

Assuming function well behaved and infinitely differentiable, which

is not the case for heterogeneous media

Ironically, free mall ride in Denver is cleaner and newer then normal

busses you actually pay for

Higher Order Approximations?

56

Page 57: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Smooth vs. Real World

57

Let’s approximate some derivatives

– Waves are smooth, for sure

– Sine and cosine are infinitely differentiable

– Taylor approximation seems like a good idea

• All differences are multiplied by material properties

• If property has step, difference x property will have step

• We chose factor of 0.9 here -> not a very rough step

• Function looks smooth…

sin(x) | sin(x) * 0.9

smooth real

Page 58: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Smooth vs. Heterogeneous:

1st Derivative

58

My oh my, what do we have here?

2

4

6

8

2 ( )

4 ( )

6 ( )

8 ( )

nd iii

th v

th vii

th ix

Order Error

h f a

h f a

h f a

h f a

smooth 1st deriv real 1st deriv

Big Error

Small Error Big h Small h

2nd 4th 6th 8th

Big Error

Small Error Big h Small h

2nd 4th 6th 8th

Page 59: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Smooth vs. Heterogeneous:

2nd Derivative

59

All orders fail, but the higher ones seem worse

2

4

6

8

2 ( )

4 ( )

6 ( )

8 ( )

nd iv

th vi

th viii

th x

Order Error

h f a

h f a

h f a

h f a

Big Error

Small Error Big h Small h

2nd 4th 6th 8th

Big Error

Small Error Big h Small h

2nd 4th 6th 8th

smooth 2nd deriv real 2nd deriv

Page 60: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

1D solvers: Pressure

0

5

10

15

20

25

30

1E

-6 s

um

ab

s e

rr

Points per Wavelength

Pressure Hom.

2nd 4th 6th 8th

60

Higher order not substantially better below 6 ppw

0

5

10

15

20

25

30

1E

-6 s

um

ab

s e

rr

Points per Wavelength

Pressure Het.

2nd 4th 6th 8th

Page 61: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

1D solvers: Stress – Velocity

0.1

2.1

4.1

6.1

8.1

1E

-9 s

um

ab

s e

rr

Points per Wavelength

SV Hom.

2nd 4th 6th 8th

61

Higher order WORSE below 6 ppw

0.1

2.1

4.1

6.1

8.1

1E

-9 s

um

ab

s e

rr

Points per Wavelength

SV Het.

2nd 4th 6th 8th

Page 62: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

• Reported larger time-step possible

Smaller time-step required for the same resolution

Lower resolution problematic in heterogeneous media

• In Conclusion:

• More expensive to develop

• No accuracy benefits in heterogeneous media

• Building Ferrari with shopping cart wheels is silly:

» also need higher order boundary conditions

» also need higher order time-stepping

» etc.

Higher Order Approximations?

Higher order complications.

62

Page 63: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Kepler GK104

Same memcpy bandwidth – expect same performance

63

35

45

55

65

75

85

95

105

64 192 320 448 576 704

MT

P G

B/s

cube size

GK104 vs. M2070

M2070 GK104

Page 64: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

What’s “wrong” with GK104?

GK104:

• max 2048 threads, 256 threads / TB

• occupancy 1 -> 8 TB / SM

• 8 SM x 8 TB / SM -> 64 TB concurrently

• 512 KB L2 -> 8 KB L2 per TB

M2070:

• max 1536 threads, 256 threads / TB

• occupancy 2/3 -> 4 TB / SM

• 14 SM x 4 TB -> 56 TB concurrently

• 768 KB L2 -> about 14 KB L2 per TB

• AND: 48 KB L1 -> 12 KB L1 per TB

64

3D Finite Difference Computation on GPUs using CUDA

Paulius Micikevicius, NVIDIA, 2009

No need to fetch center and top

use ancient register queue technique

Page 65: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Further improvement

more likely through

concurrent access

increase (more bytes

in flight)

Looking at compiler

numbers, occupancy

reduction to increase

cache per TB seems

like a bad idea (HW

utilization limited)

Fermi doesn’t care, as

expected

Register Queue

That’s better.

65

35

55

75

95

64 192 320 448 576 704

MT

P G

B/s

cube size

GK104 vs. M2070

M2070 - no reg Q GK104 - no reg Q

GK104 - req Q M070 - reg Q

Page 66: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

• Why is volume and pressure performance curve so jagged and why is there a

massive kink down at 384 (12*32)?

• Suspect: accidental locality

The Kink

Read CENTER might prefetch someone’s LEFT or RIGHT, or hit in cache

Read LEFT or RIGHT might prefetch someone’s CENTER, or hit in cache

66

TB 0,0 TB 0,1 TB 0,2

read or

hit cache

read or

hit cache

read or

hit cache

read read

SM 0

TB 0,3 TB 0,4 TB 0,5

read or

hit cache

read or

hit cache

read or

hit cache

read read

SM 1

Page 67: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

• If no accidental locality, there should be more IO operations than necessary,

and a lower performance ceiling:

unnecessary right OR left : 5 instead of 4 IO

4/5 = 80% throughput

unnecessary right AND left : 6 instead of 4 IO

4/6 = 66% throughput

• How to test?

Create 80% situation with chess pattern

The Kink

67

Page 68: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

The Kink – Chess Experiment

• prevent possibility of accidental locality

by removing all neighbors (chess board

pattern)

• specifically: fast axis index =

(2*blockIdx.x+blockIdx.y%2) *

blockDim.x + threadIdx.x

• no direct neighbors that can help each

other, and either left or right overfetch is

an unnecessary additional read

• expect 80% of peak performance:

80 GB/s, as benchmark shows!

68

35

45

55

65

75

85

95

105

64 192 320 448 576 704M

TP

GB

/s

cube size

Pressure Chess Pattern

Pressure Normal Pressure Chess

Page 69: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

The Kink – Locality Effect?

• second experiment: comment out left

and right neighbor access

• results are relatively flat, not jagged

• 14 SM on M2070, peak at 448 = 14*32?

• 382 = 12*32 some especially bad locality

situation?

69

35

55

75

95

64 192 320 448 576 704

MT

P G

B/s

cube size

Pressure Chess Pattern

Pressure Normal

Pressure Chess

Pressure No Left+Right

Page 70: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

Averaged Materials

• Our weakest link is obviously shear

stress kernel

• Most probably because of material

average

• What if we pre-average and store Mue,

Mue_x, Mue_y and Mue_z?

• Less pressure on cache and faster

solver?

70

Interesting.

Page 71: Memory Bound Wave Propagation at Hardware Limit | GTC 2013€¦ · − Normal stress has no material averaging − Velocity needs to average density from 2 values, for 3 different

© Spectraseis Inc. 2013

• Current shear stress kernel peak at 85 GB/s -> if it goes up to 100 GB/s,

overall performance won’t improve much

• Shear stress kernel currently has 7 reads and 3 writes, total 10. Adding 3

extra Mue to read would increase to total 13.

What memory throughput would MATCH existing version?

Averaged Materials

Alas.

71

1 2 21 2 2 1

1 2 1

2 1.3 85 GB/s 110.5 GB/s

s s st t mtp mtp

mtp mtp s

mtp

Maybe possible, but even if,

still used much more memory.