Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Optimizing Commercial Software for Intel® Xeon Phi™ Coprocessors: Lessons Learned

©2013 Acceleware Ltd. All rights reserved.

Dan Cyca, Chief Technical Officer, Acceleware

Supercomputing ConferenceDenver, Colorado, USA November 17-22, 2013

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

In My Parallel Universe… Small to medium-sized seismic companies aren’t limited by

computational resources when processing seismic data

1

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Seismic Computing Requirements

1995 2000 2005 2010 2015

100 GF1 TF

10 TF

10 PF

100 TF

100 PF

1 PF

1 EF

1990

Paraxial WE approximation

Kirchhoff Migration

Post SDM, PreSTM

Full WE Approximation

WEM

RTM

FWI

ElasticImaging

Source: Total 2012 3

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

RTM OverviewSource Receiver data

Propagate backwardsin time

Correlate

source and

receiver

wavefields

Propagate forwardsin time

4

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

5

RTM Introduction Finite-difference code Compute intensive:

10s of hours per seismic shot

Large memory footprint: 100GB per shot

Large local storage requirement: 500GB per shot

10,000s of shots

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

RTM: Computational Requirements RTM image is made by migrating and then stacking a large

number of shots (typically between 10,000 and 100,000) Migrating each shot requires two or three 3D wave

propagations Each shot migration requires large RAM (~100GB) and

temporary disk space (~500GB) Runtime per shots varies between a few minutes (low

frequency isotropic) to several hours (high frequency anisotropic)

Typical compute cluster used for RTM will be 100s of nodes

6

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

In My Parallel Universe… Small to medium-sized seismic

companies aren’t limited by computational resources when processing seismic data– We want to make RTM (1 PFlop)

available to these companies We’re delivering parallel

software to run RTM on Xeon Phi systems

7

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

RTM: Wave Propagation Finite-difference time domain technique

– 3D stencils

3D grid with millions of points – Update the entire grid every time step– 1000s of time steps

Memory footprint of 10-100 GB Wavefield data from forward pass stored to

disk to facilitate imaging

8

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Parallelizing Single Shots Finite-difference grid

contains over 200 million cells per volume (2 GB)

Numerous volumes per shot (Earth model, wavefields and image)

One shot easily fits in a CPU compute node, but may be too large for a single Xeon Phi

9

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Parallelizing Each Shot: Multiple Cards

The volume is partitioned into pieces that fit on a single Xeon Phi

Phi 2Phi 1Phi 010

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Phi 0

Phi 1

Boundaries must be transferred between partitions

Transfers can become a bottleneck unless they are done asynchronously with stencil calculations

…

…

Transfer

Transfer

Parallelizing Each Shot: Multiple Cards

11

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Parallelizing Each Shot: Within Card

Core 1, Thread 1

Core 1, Thread 0

Core 0, Thread 3

Core 0, Thread 2

Core 0, Thread 1

Core 0, Thread 0

z

x/y

Data in x and y are split over cores Operations in z dimension are vectorized 12

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Levels of Parallelism Each shot is split over multiple Xeon Phi Coprocessors (or

Xeon nodes) using MPI The partition on each Phi is split over cores using OpenMP Operations on each thread are vectorized using the

compiler’s autovectorizer

13

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Kernel: 8th Order Spatial Derivative#pragma omp parallel forfor(size_t x = xMin; x < xMax; x++){ for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY;#pragma vector … for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z;

pVy[i] = yCoeffs[0]*(pV[i-4*strideY]-pV[i+4*strideY]) + yCoeffs[1]*(pV[i-3*strideY]-pV[i+3*strideY]) + yCoeffs[2]*(pV[i-2*strideY]-pV[i+2*strideY]) + yCoeffs[3]*(pV[i-1*strideY]-pV[i+1*strideY]) + yCoeffs[4]*pV[i]; } }}

123456789101112131415161718

Triple loopover dimensions

One-dimensional derivative: simple calculation with large memory bandwidth

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Tuning OpenMP#pragma omp parallel for collapse(2) schedule(static)for(size_t x = xMin; x < xMax; x++){ for(size_t y = yMin; y < yMax; y++) {#pragma vector … size_t const idx = x*strideX + y*strideY; for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z;

// Derivative Calculations } }}

12345678910111213141516

Many options available for OpenMP– Tuning especially

important on Phi (mostly because of high thread count)

Here we use static loop scheduling, because it has the lowest overhead– It is also the most prone

to load-balance issues

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Tuning OpenMP Collapse(2) combines two adjacent for loops Here, X and Y dimensions are combined. Eg: X = 250, Y = 150 Work is broken more evenly onto cores when there are more

iterations– 250 iterations on 240 threads (60*4) means 10 threads do

double work while other threads wait (1/2 time wasted)– 250 x 150 divides much better onto 240 threads (1/157 time

wasted) Improved Phi performance by 1.5x!

Y

X X * Y

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Tuning Thread Affinity We programmatically set affinity with run dependent logic Isolating various tasks prevents over-subscription of cores

Core 0 Core 1 Core 2

…

Core 60 Core 61

Transfer Threads Disk IO Threads Propagation Threads OS Threads

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Tuning Thread Affinity Thread affinity settings improved scaling on multiple Phis

and multiple CPU sockets

Dual Xeon Phi vs. Single

Phi

Dual Xeon sockets vs.

single socketWithout Affinity Changes 1.3x 1.9x

With Affinity Changes 1.9x 1.7x

Different settings for Xeon Phi and Xeon

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Tuning Memory Access#pragma omp parallel for collapse(2) schedule(static)for(size_t x = xMin; x < xMax; x++){ for(size_t y = yMin; y < yMax; y++) { size_t const idx = x*strideX + y*strideY; __assume(strideX%16==0); __assume(strideY%16==0); __assume(idx%16==0); __assume_aligned(pV ,64); __assume_aligned(pVy ,64); #pragma vector always assert vecremainder#pragma ivdep#pragma vector nontemporal (pVy) for(size_t z = zMin; z < zMax; z++) { size_t const i = idx + z; pVy[i] = ( yCoeffs*(pV[i-4*strideY]-pV[i+4*strideY])... } }}

12345678910111213141516171819202122

pVy[i] is written once and should not be cached

Give compiler hints about indexing so it knows when to use aligned reads/writes

Improved performance by 1.1x on both Xeon and Xeon Phi!

19

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Current Performance Results For anisotropic wave propagation, Xeon Phi coprocessor is

~2.3x a single Xeon E5-2670 CPU Same code-base and optimizations applied to Xeon and

Xeon Phi

20

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

About Acceleware Professional training

– Xeon Phi Coprocessor Optimization– OpenCL– OpenMP – MPI

High performance consulting – Feasibility Studies – Porting and Optimization – Algorithm parallelization

Accelerated software – Oil and Gas– Electromagnetics

21

Op

tim

izin

g C

om

merc

ial

Soft

ware

for

Inte

l X

eon P

hi

Copro

cess

ors

: Le

ssons

Learn

ed

Questions? Come visit us in booth #1825!

Head Office Tel: +1 403.249.9099 Email: [email protected]

Viktoria Kaczur Senior Account Manager Tel: +1 403.249.9099 ext. 356 Cell: +1 403.671.4455Email: [email protected]

22

Technology

Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned