Seismic Imaging on NVIDIA GPUs€¦ · Seismic Imaging Summary • Seismic imaging CUDA codes – All 3 main codes written & verified •Two in production •One in production testing/optimization

Seismic ImagingSeismic Imagingon NVIDIA on NVIDIA GPUsGPUs

Scott MortonScott MortonHess CorporationHess Corporation

2

Seismic ImagingSeismic ImagingOutlineOutline

• Seismic data & imaging

• NVIDIA GPUs + CUDA– Why?– How?

• Three imaging methods– Algorithm– Challenges– Performance

3

Seismic ImagingSeismic ImagingDataData

oilgas

H2O

4


Receiver

Time (m

s)

5


6

Seismic ImagingSeismic ImagingAn iterative processAn iterative process

Construct initialearth model

Perform imaging

Image & Modelconsistent?

Done

No

Yes

Update earth model

Computation scales as size of data and

image/model

7

Seismic ImagingSeismic ImagingWhy GPUs?Why GPUs?

• Price-to-performance ratio improvement– Want 10X to change platforms

• Payback must more than cover effort & risk• Got 10X ten years ago in switching from

supercomputers to PC clusters

8

Seismic ImagingSeismic ImagingWhy GPUs?Why GPUs?

• Price-to-performance ratio improvement– Want 10X to change platforms

• Payback must more than cover effort & risk• Got 10X ten years ago in switching from

supercomputers to PC clusters

– Several years ago there were indicators we can get 10X or more on GPUs• Peak performance• Benchmarks• Simple prototype kernels

9

Seismic ImagingSeismic ImagingWhy CUDA & NVIDIA GPUs?Why CUDA & NVIDIA GPUs?

• Ease of programming– Must be able to port, maintain & modify production

codes (relatively) easily• These costs must be included

– Have tried Cg, Brook and Peakstream• All lacking in some aspect

– CUDA programming model straightforward• SIMD-like thread-based parallelism• In 1.5 days

– Took “intro to CUDA” class– Wrote a working 2-D seismic modeling code

• Programming memory hierarchy for optimization is the biggest challenge

10

Seismic ImagingSeismic ImagingHow to port a code?How to port a code?

• Design GPU algorithm– Optimize for memory hierarchy– Keep main data structures in GPU memory

• Create prototype GPU kernel– Include main computational characteristics– Test performance against CPU kernel– Iteratively refine prototype

• Port full kernel & compare with CPU kernel– Verify numerical results– Compare performance results

• Incorporate into production code & system

11

Seismic ImagingSeismic ImagingImaging methodsImaging methods

• Kirchhoff imaging– High-frequency propagation– Ray or eikonal travel-times

• “Wave-equation” imaging– One-way propagation: z ~ t– Frequency-domain method– ADI (alternating direction

implicit) finite difference

• “Reverse-time” imaging– Two-way propagation– Time-domain– Explicit finite-difference

Increasing computational cost

12

Kirchhoff ImagingKirchhoff ImagingPhysical algorithmPhysical algorithm

• Based on the Kirchhoff integral– Pre-compute coarse travel-times for propagation

from surface locations to image points:

– 4-D surface integral through a 5-D data set

– Computational complexity:• NI ~ 109 is the number of output image points• ND~ 108 is the number of input data traces• f ~ 10 is the number of cycles/point/trace• f NI ND ~ 1018 cycles ~ 10 CPU-years

( )),(T),(T,,D)(I 22 rxxsrsrsx rrrrrrr+== ∫∫ tdd

),(T xs rr

13

Kirchhoff ImagingKirchhoff ImagingComputational kernelComputational kernel

z

x

y

t = TS + TR Add toImage

Image traceData trace

Image point x

Source s

Receiver r

TS

TR

( )∑ +==rs

rxxsrsxrr

rrrrrrr

,),(T),(T,,D)(I t

1 migration contribution

14

Kirchhoff ImagingKirchhoff ImagingCUDA kernelCUDA kernel

GPU

Data traces

In texture

Get cached

15

0 – Initial Kernel1 – Used Texture Memory2 – Used Shared Memory3 – Global Memory Coalescing4 – Decreased Data Trace Shared

Memory Use5 – Optimized Use of Shared

Memory6 – Consolidated “if” Statements,

Eliminated or Substituted Some Math Operations

7 – Removed an “if” and “for”8 – Used Texture Memory for Data-

Trace Fetch



Memory6 – Consolidated “if” Statements,

Eliminated or Substituted Some Math Operations



Memory

0 – Initial Kernel1 – Used Texture Memory2 – Used Shared Memory3 – Global Memory Coalescing

Performance

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

0 1 2 3 4 5 6 7 8 9

Code Version

Bill

ions

of M

igra

tion

Con

trib

utio

ns p

er S

econ

d

GPUCPU

Kirchhoff ImagingKirchhoff ImagingKernel optimizationKernel optimization

16

GPU-to-CPU Performance Ratio

0

20

40

60

80

100

2 4 6 8 10 12 14 16 18

image points per travel-time cell in x or y

GPU

Spe

ed-u

p

CUDA 2D Tex (nix=4)CUDA Linear Tex (nix=4)CUDA 2D Tex (niy=4)CUDA Linear Tex (niy=4)CUDA 2D Tex (nix=4) G2CUDA 2D Tex (nix=4) G2 PINCUDA Linear Tex (nix=4) G2CUDA Linear Tex (nix=4) G2 PINCUDA 2D Tex (niy=4) G2CUDA 2D Tex (niy=4) G2 PINCUDA Linear Tex (niy=4) G2CUDA Linear Tex (niy=4) G2 PIN

Kirchhoff ImagingKirchhoff ImagingKernel performanceKernel performance

G80

GT200

Typical parameter range

17

Kirchhoff ImagingKirchhoff ImagingProduction statusProduction status

• GPU kernel incorporated into production code– Large kernel speed-ups results in “CPU overhead” for task

setup dominating GPU production runs

• Further optimizations– create GPU kernels for most “overhead” components– optimized left-over CPU code (which helps CPU version also)

Time (hr) Set-up Kernel Total Speed-up

Original CPU code 5 20 25

Main GPU kernel 5 0.5 5.5 5

Further optimizations

0.5 0.5 1 25

18

PtP

V2

2

2

2)(1

∇=∂∂

xr

• Based on scalar wave equation

• Frequency-domain• Preferred direction of

propagation: z ~ t

– Evolution in depth

xy

z

““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation

Pyx

VV

izP

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

+±

=∂∂

2

2

2

2

2

2 )(1)( ωω xx

r

r

19

xy

z


PtP

V2

2

2

2)(1

∇=∂∂

xr

Pyx

VV

izP

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

+±

=∂∂

2

2

2

2

2

2 )(1)( ωω xx

r

r



propagation: z ~ t


20

xy

z


PtP

V2

2

2

2)(1

∇=∂∂

xr

Pyx

VV

izP

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

+±

=∂∂

2

2

2

2

2

2 )(1)( ωω xx

r

r



propagation: z ~ t


21

xy

z


PtP

V2

2

2

2)(1

∇=∂∂

xr

Pyx

VV

izP

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

+±

=∂∂

2

2

2

2

2

2 )(1)( ωω xx

r

r



propagation: z ~ t


22

xy

z


PtP

V2

2

2

2)(1

∇=∂∂

xr

Pyx

VV

izP

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

+±

=∂∂

2

2

2

2

2

2 )(1)( ωω xx

r

r



propagation: z ~ t


23

xy

z


PtP

V2

2

2

2)(1

∇=∂∂

xr

Pyx

VV

izP

⎟⎟⎠

⎞⎜⎜⎝

⎛∂∂

+∂∂

+±

=∂∂

2

2

2

2

2

2 )(1)( ωω xx

r

r



propagation: z ~ t


24


• Evolution eqn uses– Continued fractions– Operator splitting– ADI finite difference

• Each depth step requires applying four operators– Along x– Along y– Along x + y– Along x – y

x

y

25




x

y

26




x

y

27




x

y

28




x

y

29

P z+Δz P z=

x

z

z

z + Δz

““WaveWave--equationequation”” ImagingImagingImplicit complex triImplicit complex tri--diagonal linear systemsdiagonal linear systems

( )( ) z

yxxzyx

zyxx

zzyxx

zzyx

zzyxx

PAPAPA

APPAAP

,*

,*

,*

,,,

21

21

Δ+Δ−

Δ+Δ+

Δ+Δ+Δ−

+−+

=+−+

30

P z+Δz P z=

x

z

z

z + Δz

““WaveWave--equationequation”” ImagingImagingImplicit complex triImplicit complex tri--diagonal linear systemsdiagonal linear systems

( )( ) z

yxxzyx

zyxx

zzyxx

zzyx

zzyxx

PAPAPA

APPAAP

,*

,*

,*

,,,

21

21

Δ+Δ−

Δ+Δ+

Δ+Δ+Δ−

+−+

=+−+

The evaluation and solution of these complex tri-diagonal systemsdominates the computational cost of our wave-equation imaging code.

31

““WaveWave--equationequation”” ImagingImagingLow level parallelismLow level parallelism

• Common work between shot-records– Calculating the coefficients of the matrices

• Dependent on frequency & local velocity

– Part of the solving of the tri-diagonal system

• Parallelize over shot-records in the kernel

zz+Δz

= P1 PnP2P1 PnP2

32

““WaveWave--equationequation”” ImagingImagingCUDA kernelsCUDA kernels

• Separate kernels for each operator– x, y, x+y and x-y

33zz+Δz

= P1 PnP2P1 PnP2


• Separate kernels for each operator– x, y, x+y and x-y

34


x

y

35


x

y

36

““WaveWave--equationequation”” ImagingImagingPerformancePerformance

• Production CPU kernel– Performance: 15 – 50 Mpoints/sec

• Prototype CUDA kernel– Single tri-diagonal system– Constant coefficients– Performance: 700 Mpoints/sec

• Production CUDA kernels– Single kernel handles x, y, x+y & x-y operators– Several kernels calculate coefficients– Performance: 300-500 Mpoints/sec

37

• Based on the scalar wave equation

• Explicit finite-difference scheme– 2nd order in time– Variable order in space: 6th - 16th

• Most of the computation

– Bandwidth• Read P(x,t), P(x,t-dt) & V(x) (plus halo!)• Write P(x,t+dt)• Max performance is 4+ Gpt/s

““ReverseReverse--timetime”” ImagingImagingTwoTwo--way propagationway propagation

PtP

V2

2

2

2)(1

∇=∂∂

x

8th

38

““ReverseReverse--timetime”” ImagingImagingCUDA algorithmCUDA algorithm

• Paulius’s algorithm

– Each thread specifies an (x,y) point, marching in z.

– Each thread block handles a 2-D rectangle.

– Each 2-D slice + halo is read into shared memory.

– Threads in a block re-use these values. Required Halo

Computed Laplacian

threadIdx.x

thre

adId

x.y

39

““ReverseReverse--timetime”” ImagingImagingCUDA algorithmCUDA algorithm

• Paulius’s algorithm

– Threads in a block march in z, storing values “in-front” & “behind” in registers.

– Number of registers limits the block size and the core-to-halo ratio.

– GPU performance is predictable: 2.5 – 3 Gpt/s. Required Halo

Computed Laplacian

Thread marching direction

40

““ReverseReverse--timetime”” ImagingImagingKernel performanceKernel performance

8th order, RTM

0

5

10

15

20

25

30

128x

128x20

025

6x96x

20096

x256x

200

256x

256x20

0

480x

480x20

0

512x

512x51

2

640x

640x20

064

0x640

x8

960x

704x20

0

704x

960x20

0

Grid size

Spee

dup

const, 16x16shared, 16x16const, 16x32shared, 16x32

41

““ReverseReverse--timetime”” ImagingImagingInterInter--GPU communicationGPU communication

• High frequency requires– Dense sampling– Large memory– Multiple GPUs– Halo exchange– Inter-GPU communication

• Device Host– Use pinned memory– PCIe bus predictably yields ~ 5 GB/s– ~ 10 % of kernel time– Easily hidden

42

““ReverseReverse--timetime”” ImagingImagingInterInter--GPU communicationGPU communication

• CPU process process– Currently using MPI

• From legacy code

– Performance variable– Comparable to kernel time– Solutions

• OpenMP?• single controlling process?

• Node node– Currently Gigabit Ethernet– Solution? Infiniband? 10-GigE?

43

Seismic ImagingSeismic ImagingSummarySummary

• Seismic imaging CUDA codes– All 3 main codes written & verified

• Two in production• One in production testing/optimization

– All done with about two-man years of effort– Kernel speed-ups vary from 10 – 80 X on GT200– 456-GPU cluster out-performs 3000-CPU cluster

• GPU cluster– Jan 2008: bought 32-nodes (128 G80 GPUs)– Dec 2008: upgraded & expanded to 456 GT200s– Nov 2009: expanding to 1200 GT200s

44

Seismic ImagingSeismic ImagingSummarySummary

45

Seismic ImagingSeismic ImagingAcknowledgementsAcknowledgements

• Code co-authors/collaborators– Thomas Cullison (Hess & Colorado School of Mines)– Paulius Micikevicius (NVIDIA)– Igor Terentyev (Hess & Rice University)

• Hess GPU systems– Jeff Davis, Mac McCalla

• NVIDIA support & management– Ty Mckercher, Paul Holzhauer, Jeff Saunders, Philip

Nenon

• Hess management– Jacques Leveille, Vic Forsyth, Jim Sherman

Documents

Seismic Imaging on NVIDIA GPUs€¦ · Seismic Imaging Summary • Seismic imaging CUDA codes – All 3 main codes written & verified •Two in production •One in production testing/optimization