Guy Gueritz Oil & Gas Business Development - GTC On-Demand ...on-demand.gputechconf.com/gtc/2012/presentations/S... · Mathieu Dubois Senior HPC Consultant . ... supercomputer suite

Guy Gueritz Oil & Gas Business Development

Mathieu Dubois Senior HPC Consultant

2 ©Bull, 2012 GPU Tech Conference 2012 – San Jose

1. Hybrid Architectures for Seismic Imaging

BULL Profile in HPC Hybrid Architectures Example : Reverse Time Migration

2. Parallel Programming for Hybrid Architectures

GPU Activities at BULL : building an expertise Tools and Programming Environments Numerical methods Scalability


1. Hybrid Architectures For Seismic

Imaging

Guy GUERITZ

Oil & Gas Business Development

Grenoble Advanced Competency & Services Center


BCS

BIS

BSS

BIP

€1.35 – 1.45 billion

€1.2 billion

Direct

margin

Direct

margin

EBIT

Indirect

costs

EBIT

€50-60

million

Indirect

costs

Shareholders

Crescendo Ind. 20%

France Télécom 8%

FSI 5%

NEC 2%

Floating 65%

Total 100%

2011 figures

Revenue €1,301 M + 4.6%

Gross margin +4.2%

EBIT +23%

Employees 9,000

Revenue 2010 >

2013

Maint. & PRS

Services

Hardw. & systems

Fulfillment

Critical systems

Profitability 2010 > 2013


0 50 100 150 200

2007

2008

2009

2010

2011

37

70

98

152

181 180M€+ income in 2011 (w/o maintenance)

Three petaflop-scale systems

- 2010: Tera 100, the first petaflopic system ever designed and developed in Europe, one of the most efficient in its category (84% @ linpack)

- 2010-2011: Genci / Curie (France) - 2 Pflops

- 2011-2012: IFERC – 1.5 Pflops

Other recent key projects - KNMI (Netherlands): meteo

- Barcelona Supercomputing Center (Spain): 186 Tflops (hybrid)

- Société Générale (France) : 350 Tflops

- Dassault Aviation (France) : 100 Tflops

- AWE (UK) : 250 Tflops

Launch of Extreme Factory (HPC pay per use) and Mobull (HPC mobile data center)

- Extreme Factory: Renault, Exa, LL Products, Classified

- Mobull: U_Perpignan, Cenaero


Services

Design

Architecture

Project

Management

Optimisation

supercomputer suite

StoreWay

Hardware platforms

Software environments

Interconnect

Storage systems

Built from standard components, optimized by Bull’s innovation


Structural Mechanics Implicit

Structural Mechanics Explicit

Computational Fluid Dynamics

Electro-Magnetics

Computational Chemistry

Quantum Mechanics

Reservoir Simulation Rendering / Ray Tracing Climate / Weather

Ocean Simulation Data Analytics

Computational Chemistry

Molecular Dynamics Computational Biology

Seismic Processing


TERA 100

GPU-based extension

198 bullx B505 accelerator

blades

396 NVIDIA® Tesla™

M2090 GPU processors

202,752 GPU cores

CURIE

GPU-based extension


blades

288 NVIDIA® Tesla™ M2090

GPU processors

147,456 GPU cores

Barcelona

Supercomputing Centre

GPU-based system


blades

252 NVIDIA® Tesla™ M2090

GPU processors

129,024 GPU cores


Need A super computing system:

to be installed at Petrobras’ new Data Center, at the University Campus of Rio de Janeiro

equipped with GPU accelerator technology

dedicated to the development of new subsurface imaging techniques to support oil exploration and production

Solution

A hybrid architecture coupling 66 general-purpose servers to 66

GPU systems:

66 bullx R422 E2 servers, i.e. 132 compute nodes or 1056 Intel® Xeon® 5500 cores providing a peak performance of 12.4 Tflops

66 NVIDIA® Tesla S1070 GPU systems, i.e. 63360 cores, providing an additional theoretical performance of 246 Tflops

1 bullx R423 E2 service node

Ultra fast InfiniBand QDR interconnect

bullx cluster suite and Red Hat Enterprise Linux

Leader in the Brazilian petrochemical sector,

and one of the largest integrated energy

companies in the world



source:

exascale.org


(Animation courtesy of the Institute of Geophysics in Hamburg)


Forward Pass

• First Recursion – forward in time • Model downgoing wavefield, store snapshots of wavefield at set time

intervals

Backward Pass

• Second Recursion – reverse time • Compute backward extrapolation of wavefield snapshots starting with

receiver data

Correlate Forward + Backward Snapshots

• Apply imaging condition • Correlate forward + backward samples together


Turning waves

Prismatic waves

Diving waves

Strong reflections

Multiples


3D Gridded Model

- Wave equation discretized into derivatives at set timesteps

- 3D grid size & resolution corresponds to wavelength (max. frequency) & aperture

size

Time Approximation by Finite Differences

- Differential equations transformed into finite difference equations at set timesteps

- Explicit scheme: one element is calculated recursively from several previously

calculated points & timesteps

Fourier Methods

- Transforms between time & frequency domains

- Eliminates some cumulative errors found in FD approximations


Grid size

- Frequency content

- Choice of FD scheme

Aperture

- Too big = too costly

computationally

- Too small = depending on

geology, may miss reflections

Storing downgoing wavefield

- Snapshots

- ‚Virtual receivers‘ at model

boundaries

- Random boundaries

Code parallelization

- 3D loops in OpenMP, CUDA

Domain decomposition

- MPI implemented to fit local

memory


Storing all wavefield snapshots

- Simple method but generates enormous data

- Requires large capacity, fast-access on-node storage

- Node I/O impacts performance

Checkpointing

- Storing pairs of consecutive snapshots at specified time intervals

Storing boundary history only

- Record wavefield at edges & bottom of model

- ‚Virtual‘ receivers

- Recursive calculation, so can regenerate downgoing wavefield

Random boundaries

- Make boundaries random reflectors

- Extrapolate twice, once forward (no storage), once backward (generates downgoing wavefield


Multi-core CPU

sockets


CPUs connected to RAM via

independent memory channels


I/O Hub

2 – 4 GPUs per node


I/O Hub

Local mass storage:

spinning or solid-state

drives


I/O Hub

Node – node interconnect


Water

cooling bullx S supernodes bullx blades

(B500 series and

DLC B700 series)

bullx R series

Storage

Architectur

e ACCELERATORS

supercomputer suite


• 2 x Intel Xeon 5600

• 2 x NVIDIA M2090

• 2 x IB QDR

7U

2.1

TF

LO

PS

Embedded Accelerator for high performance with high energy efficiency


Front

view

2 x CPUs 2 x GPUs Double-width blade

2 NVIDIA Tesla M2090 GPUs

2 Intel® Xeon® 5600 quad/hexa-core CPUs

1 dedicated PCI-e 16x connection for each GPU

Double InfiniBand QDR connections between

blades

Exploded

view


I/O

Controller

Multi GPU System

IB

GBE

GPU GPU GPU GPU

CPU CPU QPI

QPI

PCIe 8x

4GB/s

QPI

westmere EP westmere EP westmere EP

31.2GB/s

12.8GB/s

Each direction

31.2GB/s

IB

PCIe 16x

8GB/s PCIe 8x

4GB/s

IB

PCIe 16x

8GB/s

bullx B505 Accelerator Blade

QPI

I/O

Controller

(Tylersburg)

I/O

Controller

(Tylersburg)

GBE

GPU GPU


RTM Example: Salt Diapir

Object of study

- Demonstrate imaging

quality of RTM

- Show GPU speedup

Paradigm ECHOS 1.1

- Uses AXE RTM libraries

Multi-client data imaged

with PSDM

Data courtesy of J. Schlegtenhorst


2 cables, 8 streamers

each

(2*8) * 408 traces

16*3.3 MB = 52.8 MB

Streamer interval 100m

Far offset 5300m

Shot pattern 5000m X 700m

Sub-volume 10 Km x 6 Km x 12 Km

12.5m x 12.5m grid

fmax=25 Hz - fmax=40 Hz



CDP Grid Inline 2701



Optimum Grid Inline 2701



CDP Grid Inline 2891



Optimum Grid Inline 2891



Single Shot Runtime (30Hz)

New B510 Sandybridge blades 16 cores, 4 channels to memory RTM image 2h 41m

B505 Westmere GPU blades 2 x M2090 GPU RTM image 15m


Run Times:16 cores (1 node 2 sockets SandyBridge)

25 Hz

30 Hz

35 Hz

40 Hz

OPTIMUM GRID 43 min 1 hour 17 min 2 hours 07 min 3 hours 23 min

CDP GRID 2 hours 19 min 2 hour 41 min 2 hours 58 min 3 hours 28 min


Choice of hybrid architecture depends on several factors

- Algorithm & numerical method employed

- Correlation strategy used (local storage requirements)

- Grid & aperture sizes

- Frequencies involved

- Size of survey

As RTM becomes more generally used, system scalability will be

of critical importance

- Processor & co-processor technologies evolving rapidly

- Software environment maturing

- Economics of hybrid approach gaining hold


2. Parallel Programming For Hybrid

Architectures

Mathieu DUBOIS

Senior Application Engineer - Hardware Accelerators Expert

Applications & Performance Team

Grenoble Advanced Competency & Services Center


3 sites : Grenoble (A) , Angers (B),

Les Clayes-sous-Bois (C)

15 fulltime dedicated engineers 14 performance engineers

1 system administrators coming from different scientific domains Software & Hardware Expertise

2 benchmarking systems

Benchmarking System in Anger : top500 ranked (110) – 107 Tflops/s

HPC Lab in Grenoble

B

A

C


Presales common operations Technical answers to “calls for tender” Consulting (architecture)

Services “Extreme Computing Competence Center” Specific mission: Porting, integration and optimization

of user applications in their bullx environment

Training

Support High Level support (L3)

Technology watch

Development of internal tools


Training Benchmarking

Proof Of Concept /

Code Migration

Code

Optimisation

Activities

Physics, Chemistry,

Biology Oil & Gas

Life Science Security & Finance

Areas

Technology Watch & Performance Evaluation


Barcelona Supercomputing Center

•252 M2090

•103 Tflop Linpack score

•Ranked 114 at TOP500

•Ranked 7 at GREEN500 (#1 in Europe)

GENCI

•288 M2090



•Ranked 8 at GREEN500

CEA - Tera 100

•390 M2090





BULL’s expertise in GPU environment is well

recognized


2010 - Premier Prix : Dimitri Komatitsch

SPECFEM3D (geodynamics)

GPU version in development

2009 - Premier Prix : Luigi Genovese

BigDFT (nanosciences)

CUDA & OpenCL version available

Award and Active Development

of Major Scientific Applications


PGI Accelerator

HMPP

OpenCL

Fortran CUDA

CUDA C

Performance

PGI Accelerator HMPP

Fortran CUDA

CUDA C

OpenCL

Simplicity


2

2

2

2

2

2

2

2

2

1

z

P

y

P

x

P

t

P

v

Isotropic Wave Equation :

order-k in space stencil (here k is 4)

memory bandwidth bound code


Per thread

local memory

Per bloc

shared

memory

Per GPU

global

memory

thread

block of threads

kernel 1

kernel 2

sequential kern

els


http://developer.download.nvidia.com/CUDA/CUDA_Zone/papers/gpu_3dfd_rev.pdf

First Approach : 3k +1 elements needed for 1 output value

Better Approach : Some data are being reused for several output values Perform calculation from shared memory latency of shared memory is 2 orders of magnitude

lower than global memory

One order of magnitude increased performance comparing GPU to one CPU core

One can also overlap computation with data transfer

for output wavefield saving.


Code based on an extension of the pseudo spectral method called the pseudo-analytic model

Modifies the Fourier Transform of the Laplacian operator

correcting the propagation errors from the finite differences scheme

Obtain nearly non-dispersive wave propagation

Original source code in Fortran 90 using OpenMP and MKL FFTs

One shot per node

Obvious Hot Spots :

FFTs

Laplacian


0

50

100

150

200

250

300

350

400

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs

Tim

e (

se

c)

85 % of the time spent in one subroutine

In this subroutine 6 kernels are identified :

kernel1 : 31 % kernel2 : 13 % kernel3 : 2 % kernel4 : 1 % kernel5 : 18 %

FFT : 35%


bullx b505 server with : 2 Intel Westmere 4 cores processors @ 2,67 GHz 24 GB DDR3@1333 MHz 2 NVIDIA M2090 GPUs

Software and tools NVIDIA CUDA 4.1

Intel Compilers version 12 and Intel MPI 4 PGI compilers 11

Use CUFFT Library and write call wrappers

Write a CUDA kernel for each of the 5 subroutine kernels (avoid transfers)

Compare CUDA C , Fortran CUDA, HMPP


Simplified Fortran porting

No need for Fortran C CUDA interfaces

No problem with conversion of unit memory stride access in multidimensional arrays

API simplified and identical to Fortran 90 !Define variables on CPUs

real, pinned, allocatable, dimension(:,:,:) :: A_host

!Define variables on GPUs

real, device, allocatable, dimension(:,:,:) :: A_device

!allocate them in a single call

allocate( A_host(nx,ny,nz), A_device(nx,ny,nz) )

!transfer data between CPU/GPU

A_device = A_host

Same Performance between

CUDA C & Fortran CUDA


#pragma hmpp <cublas> group, target=cuda

#pragma hmpp <cublas> acquire

#pragma hmppalt cublas declare, name=“cublasSgemm”, extend(error,…),fallbakc=true

void MycublaSgemm(int* proxyError, char transa, char transb, int m, int n, int k,

float alpha, const float *A, int lda, const float *B, int ldb,

float beta, float *C, int ldc)

(

devicedataA = hmpprt_data_get_device_adress(A);

(...)

cublasSgemm(transa,transb,m,n,k,alpha,deviceData1A,lda,deviceDataB,ldb,beta,deviceDataC,ldc);

)

Before HMPP 3 : no possibility to call for external CUDA libraries

Now

#pragma hmpp <cublas> group, target=cuda

#pragma hmpp <cublas> acquire

#pragma hmppalt cublas call, name=“cublasSgemm”

sgemm(trans,trans,n,n,n,alpha,A,n,B,n,beta,C,n)

Replace at compilation with


4x Speed Up between

1 M2090 and 8 Xeon cores

Data transfers are reduced to 1 second

0

50

100

150

200

250

300

350

400

kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs

Tim

e (

sec)

8 cores Xeon

M2090 GPU


RTM are embarrassingly parallel applications (over shots)

On standard CPU servers : compute 1 shot per node take advantage of all the CPU cores for full MKL FFT performance Overall performance will increase with new generation processors

On GPU servers : 4x speed up for one shot using one GPU Compute 1 shot per GPU available on the server Diminish the number of servers by 2 for same speed up Keep the same number of servers but with double speed up

Small data set for benchmarking Problem size may be too big to fit in today GPU memory


BULL has built his expertise on real customer requests : Trainings

POC in Oil & Gas, Finance, Life Science, Material Science Advice for cluster architecture definition Pro-activity

BULL expertise is recognized: Successful POC with significant speed up and cost reduction

Acknowledgment in scientific publication Help with code migration and optimization


Documents

Guy Gueritz Oil & Gas Business Development - GTC On-Demand ...on-demand.gputechconf.com/gtc/2012/presentations/S... · Mathieu Dubois Senior HPC Consultant . ... supercomputer suite