Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Seismic ImagingSeismic Imagingon NVIDIA on NVIDIA GPUsGPUs
Scott MortonScott MortonHess CorporationHess Corporation
2
Seismic ImagingSeismic ImagingOutlineOutline
• Seismic data & imaging
• NVIDIA GPUs + CUDA– Why?– How?
• Three imaging methods– Algorithm– Challenges– Performance
3
Seismic ImagingSeismic ImagingDataData
oilgas
H2O
4
Seismic ImagingSeismic ImagingDataData
Receiver
Time (m
s)
5
Seismic ImagingSeismic ImagingDataData
6
Seismic ImagingSeismic ImagingAn iterative processAn iterative process
Construct initialearth model
Perform imaging
Image & Modelconsistent?
Done
No
Yes
Update earth model
Computation scales as size of data and
image/model
7
Seismic ImagingSeismic ImagingWhy GPUs?Why GPUs?
• Price-to-performance ratio improvement– Want 10X to change platforms
• Payback must more than cover effort & risk• Got 10X ten years ago in switching from
supercomputers to PC clusters
8
Seismic ImagingSeismic ImagingWhy GPUs?Why GPUs?
• Price-to-performance ratio improvement– Want 10X to change platforms
• Payback must more than cover effort & risk• Got 10X ten years ago in switching from
supercomputers to PC clusters
– Several years ago there were indicators we can get 10X or more on GPUs• Peak performance• Benchmarks• Simple prototype kernels
9
Seismic ImagingSeismic ImagingWhy CUDA & NVIDIA GPUs?Why CUDA & NVIDIA GPUs?
• Ease of programming– Must be able to port, maintain & modify production
codes (relatively) easily• These costs must be included
– Have tried Cg, Brook and Peakstream• All lacking in some aspect
– CUDA programming model straightforward• SIMD-like thread-based parallelism• In 1.5 days
– Took “intro to CUDA” class– Wrote a working 2-D seismic modeling code
• Programming memory hierarchy for optimization is the biggest challenge
10
Seismic ImagingSeismic ImagingHow to port a code?How to port a code?
• Design GPU algorithm– Optimize for memory hierarchy– Keep main data structures in GPU memory
• Create prototype GPU kernel– Include main computational characteristics– Test performance against CPU kernel– Iteratively refine prototype
• Port full kernel & compare with CPU kernel– Verify numerical results– Compare performance results
• Incorporate into production code & system
11
Seismic ImagingSeismic ImagingImaging methodsImaging methods
• Kirchhoff imaging– High-frequency propagation– Ray or eikonal travel-times
• “Wave-equation” imaging– One-way propagation: z ~ t– Frequency-domain method– ADI (alternating direction
implicit) finite difference
• “Reverse-time” imaging– Two-way propagation– Time-domain– Explicit finite-difference
Increasing computational cost
12
Kirchhoff ImagingKirchhoff ImagingPhysical algorithmPhysical algorithm
• Based on the Kirchhoff integral– Pre-compute coarse travel-times for propagation
from surface locations to image points:
– 4-D surface integral through a 5-D data set
– Computational complexity:• NI ~ 109 is the number of output image points• ND~ 108 is the number of input data traces• f ~ 10 is the number of cycles/point/trace• f NI ND ~ 1018 cycles ~ 10 CPU-years
( )),(T),(T,,D)(I 22 rxxsrsrsx rrrrrrr+== ∫∫ tdd
),(T xs rr
13
Kirchhoff ImagingKirchhoff ImagingComputational kernelComputational kernel
z
x
y
t = TS + TR Add toImage
Image traceData trace
Image point x
Source s
Receiver r
TS
TR
( )∑ +==rs
rxxsrsxrr
rrrrrrr
,),(T),(T,,D)(I t
1 migration contribution
14
Kirchhoff ImagingKirchhoff ImagingCUDA kernelCUDA kernel
GPU
Data traces
In texture
Get cached
15
0 – Initial Kernel1 – Used Texture Memory2 – Used Shared Memory3 – Global Memory Coalescing4 – Decreased Data Trace Shared
Memory Use5 – Optimized Use of Shared
Memory6 – Consolidated “if” Statements,
Eliminated or Substituted Some Math Operations
7 – Removed an “if” and “for”8 – Used Texture Memory for Data-
Trace Fetch
0 – Initial Kernel1 – Used Texture Memory2 – Used Shared Memory3 – Global Memory Coalescing4 – Decreased Data Trace Shared
Memory Use5 – Optimized Use of Shared
Memory6 – Consolidated “if” Statements,
Eliminated or Substituted Some Math Operations
0 – Initial Kernel1 – Used Texture Memory2 – Used Shared Memory3 – Global Memory Coalescing4 – Decreased Data Trace Shared
Memory Use5 – Optimized Use of Shared
Memory
0 – Initial Kernel1 – Used Texture Memory2 – Used Shared Memory3 – Global Memory Coalescing
Performance
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
0 1 2 3 4 5 6 7 8 9
Code Version
Bill
ions
of M
igra
tion
Con
trib
utio
ns p
er S
econ
d
GPUCPU
Kirchhoff ImagingKirchhoff ImagingKernel optimizationKernel optimization
16
GPU-to-CPU Performance Ratio
0
20
40
60
80
100
2 4 6 8 10 12 14 16 18
image points per travel-time cell in x or y
GPU
Spe
ed-u
p
CUDA 2D Tex (nix=4)CUDA Linear Tex (nix=4)CUDA 2D Tex (niy=4)CUDA Linear Tex (niy=4)CUDA 2D Tex (nix=4) G2CUDA 2D Tex (nix=4) G2 PINCUDA Linear Tex (nix=4) G2CUDA Linear Tex (nix=4) G2 PINCUDA 2D Tex (niy=4) G2CUDA 2D Tex (niy=4) G2 PINCUDA Linear Tex (niy=4) G2CUDA Linear Tex (niy=4) G2 PIN
Kirchhoff ImagingKirchhoff ImagingKernel performanceKernel performance
G80
GT200
Typical parameter range
17
Kirchhoff ImagingKirchhoff ImagingProduction statusProduction status
• GPU kernel incorporated into production code– Large kernel speed-ups results in “CPU overhead” for task
setup dominating GPU production runs
• Further optimizations– create GPU kernels for most “overhead” components– optimized left-over CPU code (which helps CPU version also)
Time (hr) Set-up Kernel Total Speed-up
Original CPU code 5 20 25
Main GPU kernel 5 0.5 5.5 5
Further optimizations
0.5 0.5 1 25
18
PtP
V2
2
2
2)(1
∇=∂∂
xr
• Based on scalar wave equation
• Frequency-domain• Preferred direction of
propagation: z ~ t
– Evolution in depth
xy
z
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
Pyx
VV
izP
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
+±
=∂∂
2
2
2
2
2
2 )(1)( ωω xx
r
r
19
xy
z
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
PtP
V2
2
2
2)(1
∇=∂∂
xr
Pyx
VV
izP
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
+±
=∂∂
2
2
2
2
2
2 )(1)( ωω xx
r
r
• Based on scalar wave equation
• Frequency-domain• Preferred direction of
propagation: z ~ t
– Evolution in depth
20
xy
z
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
PtP
V2
2
2
2)(1
∇=∂∂
xr
Pyx
VV
izP
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
+±
=∂∂
2
2
2
2
2
2 )(1)( ωω xx
r
r
• Based on scalar wave equation
• Frequency-domain• Preferred direction of
propagation: z ~ t
– Evolution in depth
21
xy
z
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
PtP
V2
2
2
2)(1
∇=∂∂
xr
Pyx
VV
izP
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
+±
=∂∂
2
2
2
2
2
2 )(1)( ωω xx
r
r
• Based on scalar wave equation
• Frequency-domain• Preferred direction of
propagation: z ~ t
– Evolution in depth
22
xy
z
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
PtP
V2
2
2
2)(1
∇=∂∂
xr
Pyx
VV
izP
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
+±
=∂∂
2
2
2
2
2
2 )(1)( ωω xx
r
r
• Based on scalar wave equation
• Frequency-domain• Preferred direction of
propagation: z ~ t
– Evolution in depth
23
xy
z
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
PtP
V2
2
2
2)(1
∇=∂∂
xr
Pyx
VV
izP
⎟⎟⎠
⎞⎜⎜⎝
⎛∂∂
+∂∂
+±
=∂∂
2
2
2
2
2
2 )(1)( ωω xx
r
r
• Based on scalar wave equation
• Frequency-domain• Preferred direction of
propagation: z ~ t
– Evolution in depth
24
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
• Evolution eqn uses– Continued fractions– Operator splitting– ADI finite difference
• Each depth step requires applying four operators– Along x– Along y– Along x + y– Along x – y
x
y
25
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
• Evolution eqn uses– Continued fractions– Operator splitting– ADI finite difference
• Each depth step requires applying four operators– Along x– Along y– Along x + y– Along x – y
x
y
26
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
• Evolution eqn uses– Continued fractions– Operator splitting– ADI finite difference
• Each depth step requires applying four operators– Along x– Along y– Along x + y– Along x – y
x
y
27
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
• Evolution eqn uses– Continued fractions– Operator splitting– ADI finite difference
• Each depth step requires applying four operators– Along x– Along y– Along x + y– Along x – y
x
y
28
““WaveWave--equationequation”” ImagingImagingOneOne--way propagationway propagation
• Evolution eqn uses– Continued fractions– Operator splitting– ADI finite difference
• Each depth step requires applying four operators– Along x– Along y– Along x + y– Along x – y
x
y
29
P z+Δz P z=
x
z
z
z + Δz
““WaveWave--equationequation”” ImagingImagingImplicit complex triImplicit complex tri--diagonal linear systemsdiagonal linear systems
( )( ) z
yxxzyx
zyxx
zzyxx
zzyx
zzyxx
PAPAPA
APPAAP
,*
,*
,*
,,,
21
21
Δ+Δ−
Δ+Δ+
Δ+Δ+Δ−
+−+
=+−+
30
P z+Δz P z=
x
z
z
z + Δz
““WaveWave--equationequation”” ImagingImagingImplicit complex triImplicit complex tri--diagonal linear systemsdiagonal linear systems
( )( ) z
yxxzyx
zyxx
zzyxx
zzyx
zzyxx
PAPAPA
APPAAP
,*
,*
,*
,,,
21
21
Δ+Δ−
Δ+Δ+
Δ+Δ+Δ−
+−+
=+−+
The evaluation and solution of these complex tri-diagonal systemsdominates the computational cost of our wave-equation imaging code.
31
““WaveWave--equationequation”” ImagingImagingLow level parallelismLow level parallelism
• Common work between shot-records– Calculating the coefficients of the matrices
• Dependent on frequency & local velocity
– Part of the solving of the tri-diagonal system
• Parallelize over shot-records in the kernel
zz+Δz
= P1 PnP2P1 PnP2
32
““WaveWave--equationequation”” ImagingImagingCUDA kernelsCUDA kernels
• Separate kernels for each operator– x, y, x+y and x-y
33zz+Δz
= P1 PnP2P1 PnP2
““WaveWave--equationequation”” ImagingImagingCUDA kernelsCUDA kernels
• Separate kernels for each operator– x, y, x+y and x-y
34
““WaveWave--equationequation”” ImagingImagingCUDA kernelsCUDA kernels
x
y
35
““WaveWave--equationequation”” ImagingImagingCUDA kernelsCUDA kernels
x
y
36
““WaveWave--equationequation”” ImagingImagingPerformancePerformance
• Production CPU kernel– Performance: 15 – 50 Mpoints/sec
• Prototype CUDA kernel– Single tri-diagonal system– Constant coefficients– Performance: 700 Mpoints/sec
• Production CUDA kernels– Single kernel handles x, y, x+y & x-y operators– Several kernels calculate coefficients– Performance: 300-500 Mpoints/sec
37
• Based on the scalar wave equation
• Explicit finite-difference scheme– 2nd order in time– Variable order in space: 6th - 16th
• Most of the computation
– Bandwidth• Read P(x,t), P(x,t-dt) & V(x) (plus halo!)• Write P(x,t+dt)• Max performance is 4+ Gpt/s
““ReverseReverse--timetime”” ImagingImagingTwoTwo--way propagationway propagation
PtP
V2
2
2
2)(1
∇=∂∂
x
8th
38
““ReverseReverse--timetime”” ImagingImagingCUDA algorithmCUDA algorithm
• Paulius’s algorithm
– Each thread specifies an (x,y) point, marching in z.
– Each thread block handles a 2-D rectangle.
– Each 2-D slice + halo is read into shared memory.
– Threads in a block re-use these values. Required Halo
Computed Laplacian
threadIdx.x
thre
adId
x.y
39
““ReverseReverse--timetime”” ImagingImagingCUDA algorithmCUDA algorithm
• Paulius’s algorithm
– Threads in a block march in z, storing values “in-front” & “behind” in registers.
– Number of registers limits the block size and the core-to-halo ratio.
– GPU performance is predictable: 2.5 – 3 Gpt/s. Required Halo
Computed Laplacian
Thread marching direction
40
““ReverseReverse--timetime”” ImagingImagingKernel performanceKernel performance
8th order, RTM
0
5
10
15
20
25
30
128x
128x20
025
6x96x
20096
x256x
200
256x
256x20
0
480x
480x20
0
512x
512x51
2
640x
640x20
064
0x640
x8
960x
704x20
0
704x
960x20
0
Grid size
Spee
dup
const, 16x16shared, 16x16const, 16x32shared, 16x32
41
““ReverseReverse--timetime”” ImagingImagingInterInter--GPU communicationGPU communication
• High frequency requires– Dense sampling– Large memory– Multiple GPUs– Halo exchange– Inter-GPU communication
• Device Host– Use pinned memory– PCIe bus predictably yields ~ 5 GB/s– ~ 10 % of kernel time– Easily hidden
42
““ReverseReverse--timetime”” ImagingImagingInterInter--GPU communicationGPU communication
• CPU process process– Currently using MPI
• From legacy code
– Performance variable– Comparable to kernel time– Solutions
• OpenMP?• single controlling process?
• Node node– Currently Gigabit Ethernet– Solution? Infiniband? 10-GigE?
43
Seismic ImagingSeismic ImagingSummarySummary
• Seismic imaging CUDA codes– All 3 main codes written & verified
• Two in production• One in production testing/optimization
– All done with about two-man years of effort– Kernel speed-ups vary from 10 – 80 X on GT200– 456-GPU cluster out-performs 3000-CPU cluster
• GPU cluster– Jan 2008: bought 32-nodes (128 G80 GPUs)– Dec 2008: upgraded & expanded to 456 GT200s– Nov 2009: expanding to 1200 GT200s
44
Seismic ImagingSeismic ImagingSummarySummary
45
Seismic ImagingSeismic ImagingAcknowledgementsAcknowledgements
• Code co-authors/collaborators– Thomas Cullison (Hess & Colorado School of Mines)– Paulius Micikevicius (NVIDIA)– Igor Terentyev (Hess & Rice University)
• Hess GPU systems– Jeff Davis, Mac McCalla
• NVIDIA support & management– Ty Mckercher, Paul Holzhauer, Jeff Saunders, Philip
Nenon
• Hess management– Jacques Leveille, Vic Forsyth, Jim Sherman