Characterization and modeling of finite difference stencil

3/5/2012

1

Software & Services Group, Energy Engineering Team 1

> Optimization notice

Characterization and modeling of finite

difference stencil performance

Philippe Thierry, Leo Borges and Jamel Tayeb

Intel Corp



Perf projection from future kernels

Hypothesis :

Numerical schemes won’t change so much

only bytes/flops and memory needs will differ

Present Mid Term Long Term Future

Application

Computing System

Present Mid Term Long Term Future

Tun

ing

wo

rk

3/5/2012

2



How far is this goal …

Simulator level

Core

level

Socket level

Nodelevel

Large upscaling Small upscaling Small upscaling Large upscaling

A global model should include information from simulator to cluster levels

Cluster

level

Application traces Hardware counters Communication & topology

Impact of micro-u

(IPC for ex.)Impact of BW, misses, SIMD, Affinity

Impact of

Interconnect BW and latency

T_total = f (all this information)



Multi level performance analysis

(in the case of time domain seismic imaging)

3D FD3D FD

Halo Halo exchange

i/o i/o scheme

The 3DFD kernel + Halo + BCs + the io = The whole RTM application

The 3DFD kernel + Halo exchange

3DFD kernel Only

The 3DFD kernel + Halo + BCs

BoundaryBoundaryconditions

Core level optimization

Affinity , Numa effect

Load balancing, Domain decomp.

Cluster and application architecture and topology

Fwi ~ Several RTM iterations

3/5/2012

3



What results to look for

Gflops /sec: Are we far from the peak of the algorithm & of the machine ?

(Mpts/s) -> (Flops/pts) -> (Flops/s)

Gbytes/sec : Are we bandwith limited or latency limited ?

Flops/ joule: Can we see the impact of any given implementation ?

Total nb Ops per point per iteration 292

Total nb of point 2229969960.00000

Total (sec) = 9.598

Total/ite (sec) = 0.479

Speed 1: 232.313 M points / sec

Speed 2: 6.968 E-002 Points / cycle

Speed 3: 67.835 Gflops / sec

=> 42.390 % of peak, 1 threads, 6 MPIs

By hands to get max theoretical Perf

Get approximate values using HDW counters

Computational kernel

• Flops / Int

• Read/write Dram



Reminder about the peak Flops

In terms of « Flops » : Two 128 bits SIMD per cycle

4 MUL (32b) and 4 ADD (32b): 8 Single Precision Flops / cycle

2 MUL (64b) and 2 ADD (64b): 4 Double Precision Flops / cycle

Theoretical peak for a 2-sockets X5680 (6 cores @ 3.33GHz)

8 [Flops/cycle ]*12*3.33 [Gcycles/sec] = 319.68 [Gflops/sec] SP

4 [Flops/cycle ]*12*3.33 [Gcycles/sec] = 159.84 [Gflops/sec] DP

6 operations / cycle:

• 3 memory ops

• 3 computational operations

Nehalem Micro-u (OOO execution engine)

3/5/2012

4



Max theoretical Flops/s by hands (isotropic kernel)

Total number of flops per point: nF = 7K+5 ,

made of ADD: nA = 6K +2 and MUL: nM = K + 3

-> max (nA,nM) is the limiting factor

Achievable peak (with infinite bandwidth) is the ratio of

nF done at 8 Flops/cy (2 SSE ports)

by the max(nA, nM) done at 4 Flops/cy (1 SSE port)

%Peak(K)= 100.0 * [ nF / 2* max (nA,nM) ]

K : half stencil length



Flops/s Wall Kernel specifications: FP operations

Nb of tables implementation ADD MUL Total flops

Iso 3 6*K +2 K+3 7*K+5

Iso, rho 6 36*K – 95 39/2 *K - 46 (111/2)*K -141

VTI 7 Duveneck et al, 2011 6*K + 7 3*K + 8 9*K + 15

VTI, rho 10 Duveneck et al, 2011 36*K -90 39/2 *K - 42 (111/2)*K - 132

TTI 24 Fletcher et al., 2009 30*K + 38 18*K + 62 48*K+100

Visco n/a Komatisch et al., 2009 n/a n/a n/a

50,00

60,00

70,00

80,00

90,00

100,00

2 7 12 17 22 27 32

% o

f p

ea

k F

lop

s/s

half stencil length (K)

iso

iso+rho

vti

vti+rho

tti

Max % of SP peak as function of kernel length

3/5/2012

5



Memory demand and BW wall

The « ideal - bandwidth limited » nb of grid points per second will be S_BW= B / (4* 4) [Gpts]/[sec]

Ideal case

1 store « v » ,

3 loads « u,v,w »

Worst case

6*K +3 Load

1 Store

LOAD STORE Total 4-Bytes

Iso 3 1 4

Iso, rho 6 4 10

VTI 7 2 9

VTI, rho 7 5 12

TTI 14 12 26

Visco.E n/a n/a n/a



Real numbers when we can’t count anymore

Flops/s : Collect FP related counters

(FP_COMP_OPS_EXE.x87

+ FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION + FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION

+ FP_COMP_OPS_EXE.SSE_FP_SCALAR) *1E-09 / Elapsed_time

GB/s : Collect ‘UNCore read and write events’

(UNC_IMC_WRITES for each socket + UNC_IMC_READS for each socket) *1E-09 * Cache line size / E_time

(Elapsed time = CPU_CLK_UNHALTED/ Processor Frequency / Nb of Cores )

+ WATT and Energy as a function of time

3/5/2012

6



Application

Energy Checker API

Energy Checker

TSRV

ESRV

Consumed Energy Counter(s)

Temperature, RH, etc. Counter(s)

Instrument code

Counters exposed through simple EC API

API delivered as source code

Collect data and compute Energy Efficiency (EE)

EE = “Work” done / Energy Consumed

Run time ability to report

Analyze

System productivity

Application’s energy profile

Intel® Energy Checker SDK

Measure energy consumed for a workload



3d-tti-fd- Observed versus calculated BW

tti_no_cb 24th

tti_cb 24th

Stream triad

0

5

10

15

20

25

30

35

40

45

0 2 4 6 8 10 12 14

GB

/s

nb of mpi processes

LLC misses(K) ~= (Observed memory demand – Calculated memory demand ) / 4

tti 24th Calc

(weak scalability => same amount of work per process)

3/5/2012

7



Mean % of total LLC misses vs Total DRAM reads

Total DRAM reads / 100 Inst. Ret.

Mem boundCPU bound

Iso CB

Iso no-CB

TTI CB

TTI no-CB

Visco

12 mpi processes

16th24th

8th

100.0

50.0

0.00.0 1.0 2.0 3.0 4.0 5.0 6.0

% o

f LL

C m

isse

s



DRAM BW vs Total DRAM Reads + Writes

Total DRAM reads + write / 100 Inst. Ret.

Me

mB

W (

GB

/S)

Iso CB

Iso no-CB

TTI CB

TTI no-CB

Visco

16th24th

8th

12 mpi processes

Mem boundCPU bound

35.0

30.0

25.0

20.0

15.0

10.0

5.0

0.0

0.0 1.0 2.0 3.0 4.0 5.0 6.0

3/5/2012

8



B/F graph

GFLops

100 200 300 400 500 600

GB

yte

s

100.0

50.0

0.0

1 Gpts/sec

0.5 Gpts/sec

Mem. bound

CP

U

bo

un

dC

PU

b

ou

nd

8th TTI CB8th TTI CB

8th TTI no-CB8th TTI no-CB

Stream / LinpackStream / Linpack

8th TTI CB

8th TTI no-CB

Stream / Linpack



Impact on power consumption = f(elapsed time)

Wall clock time (sec)

Po

we

r co

nsu

mp

tio

n(W

AT

TS

)

350

300

250

200

150

100

50

0.0 50 100 150 200 250 300 350 400 500

8th TTI CB

4th Visco

8th TTI no-CB 16th TTI CB 16th TTI no-CB

3/5/2012

9



Perf per Watt

Higher is better

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

Mpts/sec

Gflops / Joule



Visco Elastic Kernel: DRAM demand and power consumption

0

WA

TT

S

300

250

200

150

100

50

Gig

a.

Inst

ruct

ion

s /

sec

Ba

nd

wid

th(G

B/s)

Wall clock time (sec)

0.0 50 100 150 200 250

35.0

30.0

25.0

20.0

15.0

10.0

5.0

0.00

6

5

4

3

2

1

Read DRAM BW

Write DRAM BW

Power

G.Inst /sec

3/5/2012

10



Conclusions

• Use of the right metrics for performance measurements

• Know the max theoretical performance of your implementation

• Don’t forget system configuration and its impact on measurements

• Simple projections are usefull

• Final goal of perf modelling must be clearly defined

- Short term optimization with current kernel and hdw

- Short / mid term extrapolation for future hardware

- Long term extrapolation with future kernels and future hardwares



Acknowledgements

Thanks to Intel for permission to publish the results.

Thanks to Hugh Caffey for fruitfull discussions

Software & Services Group, Energy Engineering Team

“Prediction is very difficult, especially about the future”

by Niels Bohr, Physicist, 1885-1962

3/5/2012

11



Optimization Notice

Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize forinstruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), butdo not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers,including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detaileddescription of Intel compiler options, including the instruction sets and specific microprocessors they implicate, pleaserefer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are partof Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. Whilethe compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatiblemicroprocessors, depending on the options you select, your code and other factors, you likely will get extra performanceon Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree fornon-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations includeIntel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), andSupplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does notguarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured byIntel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance onIntel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determinewhich best meet your requirements. We hope to win your business by striving to offer the best performance of anycompiler or library; please let us know if you find we do not.Notice revision #20101101



Legal Disclaimer

Legal Disclamer

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OROTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NOLIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THISINFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect theapproximate performance of Intel products as measured by those tests. Any difference in system hardware or softwaredesign or configuration may affect actual performance. Buyers should consult other sources of information to evaluatethe performance of systems or components they are considering purchasing. For more information on performance testsand on the performance of Intel products, reference www.intel.com/software/products.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2010. Intel Corporation.

Documents

Characterization and modeling of finite difference stencil