Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Guy Gueritz Oil & Gas Business Development
Mathieu Dubois Senior HPC Consultant
2 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
1. Hybrid Architectures for Seismic Imaging
BULL Profile in HPC Hybrid Architectures Example : Reverse Time Migration
2. Parallel Programming for Hybrid Architectures
GPU Activities at BULL : building an expertise Tools and Programming Environments Numerical methods Scalability
3 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
1. Hybrid Architectures For Seismic
Imaging
Guy GUERITZ
Oil & Gas Business Development
Grenoble Advanced Competency & Services Center
4 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
BCS
BIS
BSS
BIP
€1.35 – 1.45 billion
€1.2 billion
Direct
margin
Direct
margin
EBIT
Indirect
costs
EBIT
€50-60
million
Indirect
costs
Shareholders
Crescendo Ind. 20%
France Télécom 8%
FSI 5%
NEC 2%
Floating 65%
Total 100%
2011 figures
Revenue €1,301 M + 4.6%
Gross margin +4.2%
EBIT +23%
Employees 9,000
Revenue 2010 >
2013
Maint. & PRS
Services
Hardw. & systems
Fulfillment
Critical systems
Profitability 2010 > 2013
5 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
0 50 100 150 200
2007
2008
2009
2010
2011
37
70
98
152
181 180M€+ income in 2011 (w/o maintenance)
Three petaflop-scale systems
- 2010: Tera 100, the first petaflopic system ever designed and developed in Europe, one of the most efficient in its category (84% @ linpack)
- 2010-2011: Genci / Curie (France) - 2 Pflops
- 2011-2012: IFERC – 1.5 Pflops
Other recent key projects - KNMI (Netherlands): meteo
- Barcelona Supercomputing Center (Spain): 186 Tflops (hybrid)
- Société Générale (France) : 350 Tflops
- Dassault Aviation (France) : 100 Tflops
- AWE (UK) : 250 Tflops
Launch of Extreme Factory (HPC pay per use) and Mobull (HPC mobile data center)
- Extreme Factory: Renault, Exa, LL Products, Classified
- Mobull: U_Perpignan, Cenaero
6 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Services
Design
Architecture
Project
Management
Optimisation
supercomputer suite
StoreWay
Hardware platforms
Software environments
Interconnect
Storage systems
Built from standard components, optimized by Bull’s innovation
7 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Structural Mechanics Implicit
Structural Mechanics Explicit
Computational Fluid Dynamics
Electro-Magnetics
Computational Chemistry
Quantum Mechanics
Reservoir Simulation Rendering / Ray Tracing Climate / Weather
Ocean Simulation Data Analytics
Computational Chemistry
Molecular Dynamics Computational Biology
Seismic Processing
8 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
TERA 100
GPU-based extension
198 bullx B505 accelerator
blades
396 NVIDIA® Tesla™
M2090 GPU processors
202,752 GPU cores
CURIE
GPU-based extension
144 bullx B505 accelerator
blades
288 NVIDIA® Tesla™ M2090
GPU processors
147,456 GPU cores
Barcelona
Supercomputing Centre
GPU-based system
126 bullx B505 accelerator
blades
252 NVIDIA® Tesla™ M2090
GPU processors
129,024 GPU cores
9 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Need A super computing system:
to be installed at Petrobras’ new Data Center, at the University Campus of Rio de Janeiro
equipped with GPU accelerator technology
dedicated to the development of new subsurface imaging techniques to support oil exploration and production
Solution
A hybrid architecture coupling 66 general-purpose servers to 66
GPU systems:
66 bullx R422 E2 servers, i.e. 132 compute nodes or 1056 Intel® Xeon® 5500 cores providing a peak performance of 12.4 Tflops
66 NVIDIA® Tesla S1070 GPU systems, i.e. 63360 cores, providing an additional theoretical performance of 246 Tflops
1 bullx R423 E2 service node
Ultra fast InfiniBand QDR interconnect
bullx cluster suite and Red Hat Enterprise Linux
Leader in the Brazilian petrochemical sector,
and one of the largest integrated energy
companies in the world
10 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
11 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
source:
exascale.org
12 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
(Animation courtesy of the Institute of Geophysics in Hamburg)
13 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Forward Pass
• First Recursion – forward in time • Model downgoing wavefield, store snapshots of wavefield at set time
intervals
Backward Pass
• Second Recursion – reverse time • Compute backward extrapolation of wavefield snapshots starting with
receiver data
Correlate Forward + Backward Snapshots
• Apply imaging condition • Correlate forward + backward samples together
14 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Turning waves
Prismatic waves
Diving waves
Strong reflections
Multiples
15 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
3D Gridded Model
- Wave equation discretized into derivatives at set timesteps
- 3D grid size & resolution corresponds to wavelength (max. frequency) & aperture
size
Time Approximation by Finite Differences
- Differential equations transformed into finite difference equations at set timesteps
- Explicit scheme: one element is calculated recursively from several previously
calculated points & timesteps
Fourier Methods
- Transforms between time & frequency domains
- Eliminates some cumulative errors found in FD approximations
16 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Grid size
- Frequency content
- Choice of FD scheme
Aperture
- Too big = too costly
computationally
- Too small = depending on
geology, may miss reflections
Storing downgoing wavefield
- Snapshots
- ‚Virtual receivers‘ at model
boundaries
- Random boundaries
Code parallelization
- 3D loops in OpenMP, CUDA
Domain decomposition
- MPI implemented to fit local
memory
17 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Storing all wavefield snapshots
- Simple method but generates enormous data
- Requires large capacity, fast-access on-node storage
- Node I/O impacts performance
Checkpointing
- Storing pairs of consecutive snapshots at specified time intervals
Storing boundary history only
- Record wavefield at edges & bottom of model
- ‚Virtual‘ receivers
- Recursive calculation, so can regenerate downgoing wavefield
Random boundaries
- Make boundaries random reflectors
- Extrapolate twice, once forward (no storage), once backward (generates downgoing wavefield
18 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Multi-core CPU
sockets
19 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
CPUs connected to RAM via
independent memory channels
20 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
I/O Hub
2 – 4 GPUs per node
21 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
I/O Hub
Local mass storage:
spinning or solid-state
drives
22 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
I/O Hub
Node – node interconnect
23 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Water
cooling bullx S supernodes bullx blades
(B500 series and
DLC B700 series)
bullx R series
Storage
Architectur
e ACCELERATORS
supercomputer suite
24 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
• 2 x Intel Xeon 5600
• 2 x NVIDIA M2090
• 2 x IB QDR
7U
2.1
TF
LO
PS
Embedded Accelerator for high performance with high energy efficiency
25 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Front
view
2 x CPUs 2 x GPUs Double-width blade
2 NVIDIA Tesla M2090 GPUs
2 Intel® Xeon® 5600 quad/hexa-core CPUs
1 dedicated PCI-e 16x connection for each GPU
Double InfiniBand QDR connections between
blades
Exploded
view
26 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
I/O
Controller
Multi GPU System
IB
GBE
GPU GPU GPU GPU
CPU CPU QPI
QPI
PCIe 8x
4GB/s
QPI
westmere EP westmere EP westmere EP
31.2GB/s
12.8GB/s
Each direction
31.2GB/s
IB
PCIe 16x
8GB/s PCIe 8x
4GB/s
IB
PCIe 16x
8GB/s
bullx B505 Accelerator Blade
QPI
I/O
Controller
(Tylersburg)
I/O
Controller
(Tylersburg)
GBE
GPU GPU
27 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
RTM Example: Salt Diapir
Object of study
- Demonstrate imaging
quality of RTM
- Show GPU speedup
Paradigm ECHOS 1.1
- Uses AXE RTM libraries
Multi-client data imaged
with PSDM
Data courtesy of J. Schlegtenhorst
28 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
2 cables, 8 streamers
each
(2*8) * 408 traces
16*3.3 MB = 52.8 MB
Streamer interval 100m
Far offset 5300m
Shot pattern 5000m X 700m
Sub-volume 10 Km x 6 Km x 12 Km
12.5m x 12.5m grid
fmax=25 Hz - fmax=40 Hz
Data courtesy of J. Schlegtenhorst
29 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
CDP Grid Inline 2701
Data courtesy of J. Schlegtenhorst
30 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Optimum Grid Inline 2701
Data courtesy of J. Schlegtenhorst
31 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
CDP Grid Inline 2891
Data courtesy of J. Schlegtenhorst
32 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Optimum Grid Inline 2891
Data courtesy of J. Schlegtenhorst
33 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Single Shot Runtime (30Hz)
New B510 Sandybridge blades 16 cores, 4 channels to memory RTM image 2h 41m
B505 Westmere GPU blades 2 x M2090 GPU RTM image 15m
34 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Run Times:16 cores (1 node 2 sockets SandyBridge)
25 Hz
30 Hz
35 Hz
40 Hz
OPTIMUM GRID 43 min 1 hour 17 min 2 hours 07 min 3 hours 23 min
CDP GRID 2 hours 19 min 2 hour 41 min 2 hours 58 min 3 hours 28 min
35 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Choice of hybrid architecture depends on several factors
- Algorithm & numerical method employed
- Correlation strategy used (local storage requirements)
- Grid & aperture sizes
- Frequencies involved
- Size of survey
As RTM becomes more generally used, system scalability will be
of critical importance
- Processor & co-processor technologies evolving rapidly
- Software environment maturing
- Economics of hybrid approach gaining hold
36 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
2. Parallel Programming For Hybrid
Architectures
Mathieu DUBOIS
Senior Application Engineer - Hardware Accelerators Expert
Applications & Performance Team
Grenoble Advanced Competency & Services Center
37 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
3 sites : Grenoble (A) , Angers (B),
Les Clayes-sous-Bois (C)
15 fulltime dedicated engineers 14 performance engineers
1 system administrators coming from different scientific domains Software & Hardware Expertise
2 benchmarking systems
Benchmarking System in Anger : top500 ranked (110) – 107 Tflops/s
HPC Lab in Grenoble
B
A
C
38 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Presales common operations Technical answers to “calls for tender” Consulting (architecture)
Services “Extreme Computing Competence Center” Specific mission: Porting, integration and optimization
of user applications in their bullx environment
Training
Support High Level support (L3)
Technology watch
Development of internal tools
39 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Training Benchmarking
Proof Of Concept /
Code Migration
Code
Optimisation
Activities
Physics, Chemistry,
Biology Oil & Gas
Life Science Security & Finance
Areas
Technology Watch & Performance Evaluation
40 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Barcelona Supercomputing Center
•252 M2090
•103 Tflop Linpack score
•Ranked 114 at TOP500
•Ranked 7 at GREEN500 (#1 in Europe)
GENCI
•288 M2090
•110 Tflop Linpack score
•Ranked 102 at TOP500
•Ranked 8 at GREEN500
CEA - Tera 100
•390 M2090
•154 Tflop Linpack score
•Ranked 75 at TOP500
41 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
42 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
BULL’s expertise in GPU environment is well
recognized
43 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
2010 - Premier Prix : Dimitri Komatitsch
SPECFEM3D (geodynamics)
GPU version in development
2009 - Premier Prix : Luigi Genovese
BigDFT (nanosciences)
CUDA & OpenCL version available
Award and Active Development
of Major Scientific Applications
44 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
PGI Accelerator
HMPP
OpenCL
Fortran CUDA
CUDA C
Performance
PGI Accelerator HMPP
Fortran CUDA
CUDA C
OpenCL
Simplicity
45 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
2
2
2
2
2
2
2
2
2
1
z
P
y
P
x
P
t
P
v
Isotropic Wave Equation :
order-k in space stencil (here k is 4)
memory bandwidth bound code
46 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Per thread
local memory
Per bloc
shared
memory
Per GPU
global
memory
thread
block of threads
kernel 1
kernel 2
sequential kern
els
47 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
http://developer.download.nvidia.com/CUDA/CUDA_Zone/papers/gpu_3dfd_rev.pdf
First Approach : 3k +1 elements needed for 1 output value
Better Approach : Some data are being reused for several output values Perform calculation from shared memory latency of shared memory is 2 orders of magnitude
lower than global memory
One order of magnitude increased performance comparing GPU to one CPU core
One can also overlap computation with data transfer
for output wavefield saving.
48 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Code based on an extension of the pseudo spectral method called the pseudo-analytic model
Modifies the Fourier Transform of the Laplacian operator
correcting the propagation errors from the finite differences scheme
Obtain nearly non-dispersive wave propagation
Original source code in Fortran 90 using OpenMP and MKL FFTs
One shot per node
Obvious Hot Spots :
FFTs
Laplacian
49 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
0
50
100
150
200
250
300
350
400
kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs
Tim
e (
se
c)
85 % of the time spent in one subroutine
In this subroutine 6 kernels are identified :
kernel1 : 31 % kernel2 : 13 % kernel3 : 2 % kernel4 : 1 % kernel5 : 18 %
FFT : 35%
50 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
bullx b505 server with : 2 Intel Westmere 4 cores processors @ 2,67 GHz 24 GB DDR3@1333 MHz 2 NVIDIA M2090 GPUs
Software and tools NVIDIA CUDA 4.1
Intel Compilers version 12 and Intel MPI 4 PGI compilers 11
Use CUFFT Library and write call wrappers
Write a CUDA kernel for each of the 5 subroutine kernels (avoid transfers)
Compare CUDA C , Fortran CUDA, HMPP
51 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
Simplified Fortran porting
No need for Fortran C CUDA interfaces
No problem with conversion of unit memory stride access in multidimensional arrays
API simplified and identical to Fortran 90 !Define variables on CPUs
real, pinned, allocatable, dimension(:,:,:) :: A_host
!Define variables on GPUs
real, device, allocatable, dimension(:,:,:) :: A_device
!allocate them in a single call
allocate( A_host(nx,ny,nz), A_device(nx,ny,nz) )
!transfer data between CPU/GPU
A_device = A_host
Same Performance between
CUDA C & Fortran CUDA
52 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
#pragma hmpp <cublas> group, target=cuda
#pragma hmpp <cublas> acquire
#pragma hmppalt cublas declare, name=“cublasSgemm”, extend(error,…),fallbakc=true
void MycublaSgemm(int* proxyError, char transa, char transb, int m, int n, int k,
float alpha, const float *A, int lda, const float *B, int ldb,
float beta, float *C, int ldc)
(
devicedataA = hmpprt_data_get_device_adress(A);
(...)
cublasSgemm(transa,transb,m,n,k,alpha,deviceData1A,lda,deviceDataB,ldb,beta,deviceDataC,ldc);
)
Before HMPP 3 : no possibility to call for external CUDA libraries
Now
#pragma hmpp <cublas> group, target=cuda
#pragma hmpp <cublas> acquire
#pragma hmppalt cublas call, name=“cublasSgemm”
sgemm(trans,trans,n,n,n,alpha,A,n,B,n,beta,C,n)
Replace at compilation with
53 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
4x Speed Up between
1 M2090 and 8 Xeon cores
Data transfers are reduced to 1 second
0
50
100
150
200
250
300
350
400
kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 FFTs
Tim
e (
sec)
8 cores Xeon
M2090 GPU
54 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
RTM are embarrassingly parallel applications (over shots)
On standard CPU servers : compute 1 shot per node take advantage of all the CPU cores for full MKL FFT performance Overall performance will increase with new generation processors
On GPU servers : 4x speed up for one shot using one GPU Compute 1 shot per GPU available on the server Diminish the number of servers by 2 for same speed up Keep the same number of servers but with double speed up
Small data set for benchmarking Problem size may be too big to fit in today GPU memory
55 ©Bull, 2012 GPU Tech Conference 2012 – San Jose
BULL has built his expertise on real customer requests : Trainings
POC in Oil & Gas, Finance, Life Science, Material Science Advice for cluster architecture definition Pro-activity
BULL expertise is recognized: Successful POC with significant speed up and cost reduction
Acknowledgment in scientific publication Help with code migration and optimization
56 ©Bull, 2012 GPU Tech Conference 2012 – San Jose