Evacuate Now?
Faster-than-real-time
Shallow Water Simulations on GPUs
NVIDIA GPU Technology Conference
San Jose, California, 2010
André R. Brodtkorb
ICT
Talk Outline
Introduction
Why Shallow Water Simulations?
The Shallow Water Equations
Numerical scheme
Our contribution
Simulator Implementation
Results including screen capture video
Live Demo on a standard Laptop
Summary
2
Learn how to simulate a half an hour dam break in 27 seconds
ICT
The Shallow Water Equations
First described by de Saint-Venant (1797-1886)
Gravity-induced fluid motion
2D free surface
Negligible vertical acceleration
Wave length much larger than depth
Conservation of mass and momentum
Not only for water:
Atmospheric flow
Avalanches
...
3Water image from http://freephoto.com / Ian Britton
ICT
Target application areas
4
Floods
2010: Pakistan (2000+)
Tsunamis
2004 Indian Ocean (230000)
Storm Surges
2005 Hurricane Katrina (1836)
Dam breaks
1959 Malpasset (423)
Images from wikipedia.org
ICT
Mathematical Formulation
5
Vector of
Conserved
variables
Flux FunctionsBed slope
source term
Bed friction
source term
ICT
The Shallow Water Equations
6
Water depth,
discharge (u), and
discharge (v)
ICT
Explicit Numerical Schemes
Hyperbolic partial differential equation
Enables explicit schemes
Accurate modeling of discontinuities / shocks
High accuracy in smooth parts
without oscillations near discontinuities
Capable of representing dry states
Negative water depths ruin simulations
7Images from wikipedia.org, James Kilfiger
ICT
Explicit Numerical Schemes
8
Additional wanted properties:
Second order accurate fluxes
Total variation diminishing
Well balancedness
ICT
Explicit Numerical Schemes
9
Additional wanted properties:
Second order accurate fluxes
Total variation diminishing
Well balancedness
Scheme of choice: A. Kurganov and G. Petrova,
A Second-Order Well-Balanced Positivity Preserving
Central-Upwind Scheme for the Saint-Venant System
Communications in Mathematical Sciences, 5 (2007), 133-160
ICT
Kurganov-Petrova – Spatial discretization
10
Write on vector form
Impose finite-volume grid
Rewrite in terms of w=h+B
ICT
Kurganov-Petrova – Finite Volume Grid
Q defined as cell averages
B defined as piecewise bilinear
F and G calculated across cell interfaces
Source terms, H, calculated as cell averages
11
ICT
Kurganov-Petrova – Flux calculations
12
Continuous variables Discrete variables
Dry states fix
Slope reconstruction
Integration pointsFlux calculation
ICT
Kurganov-Petrova – Temporal discretization
13
Gather all explicit terms
One ordinary differential equation in time per cell
ICT
Kurganov-Petrova – Temporal discretization
14
Discretize using second order Runge-Kutta
Total variation diminishing
Semi-implicit friction source term
Discretize in time
ICT
Kurganov-Petrova – CFL condition Explicit scheme, time step restriction:
Time step size restricted by a
Courant-Friedrichs-Lewy condition
The numerical domain of dependence must include
the domain of dependence of the equation
Each wave is allowed to travel at most one
quarter grid cell per time step
15
Mathematical
propagation speed
Space
Stable
Unstable
Tim
e
ICT
Kurganov-Petrova – Simulation cycle
16
3. Halfstep
1. Calculate fluxes
4. Calculate fluxes5. Evolve in time
6. Boundary
conditions
2. Calculate Dt
ICT
Implementation – GPU code
17
Step
Four CUDA kernels:
87% Flux
<1% Timestep size (CFL condition)
12% Forward euler step
<1% Set boundary conditions
ICT
Flux kernel – Domain decomposition
A nine-point nonlinear stencil
Comprised of simpler stencils
Heavy use of shmem
Computationally demanding
Traditional Block Decomposition
Overlaping ghost cells (aka. apron)
Global ghost cells for boundary conditions
Domain padding
18
ICT
Flux kernel – Block size
Block size is 16x14 Warp size: multiple of 32
Shared memory use: 16 shmem
buffers use ~16 KB
Occupancy
Use 48 KB shared mem, 16 KB cache
Three resident blocks
Trades cache for occupancy
Fermi cache
Global memory access
19
ICT
Flux kernel - computations
Calculations
Flux across north and east interface
Bed slope source term for the cell
Collective stencil operations
n threads, and n+1 interfaces one warp performs extra calculations!
Alternative is one thread per stencil operation
Many idle threads, and extra register pressure
20
Input Slopes Integration points Flux
ICT
Flux kernel – flux limiter
Limits the fluxes to obtain
non-oscillatory solution
Generalized minmod limiter
Least steep slope, or
Zero if signs differ
Creates divergent code paths
Use branchless implementation (2007)
Requires special sign function
Much faster than naïve approach
21
(2007) T. Hagen, M. Henriksen, J. Hjelmervik, and K.-A. Lie.
How to solve systems of conservation laws numerically using the graphics processor as a high-performance computational engine.
Geometrical Modeling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF, (211–264). Springer Verlag, 2007.
float minmod(float a, float b, float c) {
return 0.25f
*sign(a)
*(sign(a) + sign(b))
*(sign(b) + sign(c))
*min( min(abs(a), abs(b)), abs(c) );
}
ICT
Timestep size kernel
Flux kernel calculates wave speed per cell
Find global maximum
Calculate timestep using the CFL condition
Parallel reduction:
Models CUDA SDK sample
Template code
Fully coalesced reads
Without bank conflicts
Optimization
Perform partial reduction in flux kernel
Reduces memory and bandwidth
by a factor 192
22Image from ”Optimizing Parallel Reduction in CUDA”, Mark Harris
16x14 1
ICT
Time integration kernel
Computes Q* or Qn+1
Solves the time-ODE per cell
”Trivial” to implement
Fully coalesced memory access
Memory bound
23
ICT
Boundary conditions kernel
Global boundary uses ghost cells
Fixed inlet / outlet discharge
Fixed depth
Reflecting
Outflow/Absorbing
Currently no mixed boundaries
Can also supply hydrograph
Tsunamies
Storm surges
Tidal waves
24
Global boundary
Local ghost cells
3.5m Tsunami, 1h 10m Storm Surge, 4d
ICT
Boundary conditions kernel
25
Similar to CUDA SDK reduction sample, using templates:
One block sets all four boundaries
Boundary length (>64, >128, >256, >512)
Boundary type (”none”, reflecting, fixed depth, fixed discharge, absorbing outlet)
In total: 4*5*5*5*5 = 2500 realizations
switch(block.x) {
case 512: BCKernelLauncher<512, N, S, E, W>(grid, block, stream); break;
case 256: BCKernelLauncher<256, N, S, E, W>(grid, block, stream); break;
case 128: BCKernelLauncher<128, N, S, E, W>(grid, block, stream); break;
case 64: BCKernelLauncher< 64, N, S, E, W>(grid, block, stream); break;
}
ICT
Optimization: Early exit
Observation: Many dry areas
do not require computation
Use a small buffer to store
wet blocks
Exit flux kernel if nearest
neighbours are dry
Up-to 6x speedup
Blocks still have to be scheduled
Blocks read the auxiliary buffer
One wet cell marks the whole block as wet
26
ICT
Results - Performance
Circular Dam break
1st order Euler
30% wet cells: 1200 megacells / s
50% wet cells: 900 megacells / s
100% wet cells: 300 megacells / s
2nd order Runge-Kutta
30% wet cells: 600 megacells / s
50% wet cells: 450 megacells / s
100% wet cells: 150 megacells / s
27
ICT
Results – Multiple GPUs
Single-node multi-GPU
Four Tesla GPUs
Threading
Near-perfect weak scaling
Near-perfect strong scaling
Up-to 380 million cells (16 GB)
19 000 x 19 000 cells
28
ICT
Verification
2D Parabolic basin
Planar water surface oscillates
100 x 100 cells
Horizontal scale: 8 km
Vertical scale: 3.3 m
Simulation and analytical match well
But, as most schemes, growing errors along wet-dry interface
29
ICT
Validation – Barrage de Malpasset South-east France near Fréjus
Bursts at 21:13 December 2nd 1959
40 meter high wall of water
70 km/h (43 mi/h)
Reaches mediterranean in 30 minutes
423 casualties, $68 million in damages
30Images from Google maps, TeraMetrics
Double curvature dam
66.5 m high
220 m crest length
55 million cubic metres of water
ICT
Validation
Experimental data from 1:400 model
482 000 cells
1100 x 440 bathymetry values
15 meter resolution
31
Accurately predicts maximum
elevation and front arrival time
Largest discrepancy at gauges
14 (arrival time) and 9 (elevation)
Compares well with published results
ICT
Implementation – CPU framework
Simulation loop executed by CPU
Output to netCDF
Direct visualization via OpenGL
32
ICT 33
Video:
http://www.youtube.com/watch?v=FbZBR-FjRwY
ICT
Live Demo
Dell XPS m1330, Flamingo Pink
Purchased 09-2008, price ~$1850
Intel Core 2 duo T9300 @ 2.5 GHz
4.0 GB RAM
NVIDIA GeForce 8400M GS
128 MB graphics RAM
Only 16 cuda cores (GTX 480 has 448)
34
Windows Vista Ultimate SP2 32-bit
CUDA toolkit/SDK 3.1 32-bit
CUDA Driver 257.21
Microsoft Visual Studio 2008
Images from dell.com
ICT
Summary
Faster than real-time performance
150-1200 megacells per second
Verified and validated results
Can accurately predict real-world events using single precision
Direct visualization
Interactive exploration of simulation results
35
Learn how to simulate a half an hour dam break in seconds
ICT
References
36
A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig,
Simulation and Visualization of the Saint-Venant System using GPUs,
Computing and Visualization in Science, 2010
special issue on Hot topics in Computational Engineering, [forthcoming].
A. R. Brodtkorb, M. L. Sætra, and M. Altinakar,
Efficient Shallow Water Simulations on GPUs: Implementation,
Visualization, Verification, and Validation,
in review, 2010.
A. R. Brodtkorb,
Scientific Computing on Heterogeneous Architectures
Ph.D. Thesis, University of Oslo,
Submitted, 2010.
ICT
Thank you for your attention.
Questions?
37
http://babrodtk.at.ifi.uio.no http://hetcomp.com