Download pdf - Evacuate Now? Faster-than-real-time Shallow Water ...babrodtk.at.ifi.uio.no/files/publications/brodtkorb_gtc2010.pdf · Validation –Barrage de Malpasset South-east France near Fréjus

Evacuate Now?

Faster-than-real-time

Shallow Water Simulations on GPUs

NVIDIA GPU Technology Conference

San Jose, California, 2010

André R. Brodtkorb

ICT

Talk Outline

Introduction

Why Shallow Water Simulations?

The Shallow Water Equations

Numerical scheme

Our contribution

Simulator Implementation

Results including screen capture video

Live Demo on a standard Laptop

Summary

2

Learn how to simulate a half an hour dam break in 27 seconds

ICT


First described by de Saint-Venant (1797-1886)

Gravity-induced fluid motion

2D free surface

Negligible vertical acceleration

Wave length much larger than depth

Conservation of mass and momentum

Not only for water:

Atmospheric flow

Avalanches

...

3Water image from http://freephoto.com / Ian Britton

http://freephoto.com/

ICT

Target application areas

4

Floods

2010: Pakistan (2000+)

Tsunamis

2004 Indian Ocean (230000)

Storm Surges

2005 Hurricane Katrina (1836)

Dam breaks

1959 Malpasset (423)

Images from wikipedia.org

ICT

Mathematical Formulation

5

Vector of

Conserved

variables

Flux FunctionsBed slope

source term

Bed friction

source term

ICT


6

Water depth,

discharge (u), and

discharge (v)

ICT

Explicit Numerical Schemes

Hyperbolic partial differential equation

Enables explicit schemes

Accurate modeling of discontinuities / shocks

High accuracy in smooth parts

without oscillations near discontinuities

Capable of representing dry states

Negative water depths ruin simulations

7Images from wikipedia.org, James Kilfiger

ICT


8

Additional wanted properties:

Second order accurate fluxes

Total variation diminishing

Well balancedness

ICT


9

Additional wanted properties:

Second order accurate fluxes


Well balancedness

Scheme of choice: A. Kurganov and G. Petrova,

A Second-Order Well-Balanced Positivity Preserving

Central-Upwind Scheme for the Saint-Venant System

Communications in Mathematical Sciences, 5 (2007), 133-160

http://129.81.170.14/~kurganov/Kurganov-Petrova_CMS.pdf













ICT

Kurganov-Petrova – Spatial discretization

10

Write on vector form

Impose finite-volume grid

Rewrite in terms of w=h+B

ICT

Kurganov-Petrova – Finite Volume Grid

Q defined as cell averages

B defined as piecewise bilinear

F and G calculated across cell interfaces

Source terms, H, calculated as cell averages

11

ICT

Kurganov-Petrova – Flux calculations

12

Continuous variables Discrete variables

Dry states fix

Slope reconstruction

Integration pointsFlux calculation

ICT

Kurganov-Petrova – Temporal discretization

13

Gather all explicit terms

One ordinary differential equation in time per cell

ICT

Kurganov-Petrova – Temporal discretization

14

Discretize using second order Runge-Kutta


Semi-implicit friction source term

Discretize in time

ICT

Kurganov-Petrova – CFL condition Explicit scheme, time step restriction:

Time step size restricted by a

Courant-Friedrichs-Lewy condition

The numerical domain of dependence must include

the domain of dependence of the equation

Each wave is allowed to travel at most one

quarter grid cell per time step

15

Mathematical

propagation speed

Space

Stable

Unstable

Tim

e

ICT

Kurganov-Petrova – Simulation cycle

16

3. Halfstep

1. Calculate fluxes

4. Calculate fluxes5. Evolve in time

6. Boundary

conditions

2. Calculate Dt

ICT

Implementation – GPU code

17

Step

Four CUDA kernels:

87% Flux

<1% Timestep size (CFL condition)

12% Forward euler step

<1% Set boundary conditions

ICT

Flux kernel – Domain decomposition

A nine-point nonlinear stencil

Comprised of simpler stencils

Heavy use of shmem

Computationally demanding

Traditional Block Decomposition

Overlaping ghost cells (aka. apron)

Global ghost cells for boundary conditions

Domain padding

18

ICT

Flux kernel – Block size

Block size is 16x14 Warp size: multiple of 32

Shared memory use: 16 shmem

buffers use ~16 KB

Occupancy

Use 48 KB shared mem, 16 KB cache

Three resident blocks

Trades cache for occupancy

Fermi cache

Global memory access

19

ICT

Flux kernel - computations

Calculations

Flux across north and east interface

Bed slope source term for the cell

Collective stencil operations

n threads, and n+1 interfaces one warp performs extra calculations!

Alternative is one thread per stencil operation

Many idle threads, and extra register pressure

20

Input Slopes Integration points Flux

ICT

Flux kernel – flux limiter

Limits the fluxes to obtain

non-oscillatory solution

Generalized minmod limiter

Least steep slope, or

Zero if signs differ

Creates divergent code paths

Use branchless implementation (2007)

Requires special sign function

Much faster than naïve approach

21

(2007) T. Hagen, M. Henriksen, J. Hjelmervik, and K.-A. Lie.

How to solve systems of conservation laws numerically using the graphics processor as a high-performance computational engine.

Geometrical Modeling, Numerical Simulation, and Optimization: Industrial Mathematics at SINTEF, (211–264). Springer Verlag, 2007.

float minmod(float a, float b, float c) {

return 0.25f

*sign(a)

*(sign(a) + sign(b))

*(sign(b) + sign(c))

*min( min(abs(a), abs(b)), abs(c) );

}

http://kalie.at.ifi.uio.no/papers/conslaws-GPU.pdf



ICT

Timestep size kernel

Flux kernel calculates wave speed per cell

Find global maximum

Calculate timestep using the CFL condition

Parallel reduction:

Models CUDA SDK sample

Template code

Fully coalesced reads

Without bank conflicts

Optimization

Perform partial reduction in flux kernel

Reduces memory and bandwidth

by a factor 192

22Image from ”Optimizing Parallel Reduction in CUDA”, Mark Harris

16x14 1

ICT

Time integration kernel

Computes Q* or Qn+1

Solves the time-ODE per cell

”Trivial” to implement

Fully coalesced memory access

Memory bound

23

ICT

Boundary conditions kernel

Global boundary uses ghost cells

Fixed inlet / outlet discharge

Fixed depth

Reflecting

Outflow/Absorbing

Currently no mixed boundaries

Can also supply hydrograph

Tsunamies

Storm surges

Tidal waves

24

Global boundary

Local ghost cells

3.5m Tsunami, 1h 10m Storm Surge, 4d

ICT

Boundary conditions kernel

25

Similar to CUDA SDK reduction sample, using templates:

One block sets all four boundaries

Boundary length (>64, >128, >256, >512)

Boundary type (”none”, reflecting, fixed depth, fixed discharge, absorbing outlet)

In total: 4*5*5*5*5 = 2500 realizations

switch(block.x) {

case 512: BCKernelLauncher<512, N, S, E, W>(grid, block, stream); break;



case 64: BCKernelLauncher< 64, N, S, E, W>(grid, block, stream); break;

}

ICT

Optimization: Early exit

Observation: Many dry areas

do not require computation

Use a small buffer to store

wet blocks

Exit flux kernel if nearest

neighbours are dry

Up-to 6x speedup

Blocks still have to be scheduled

Blocks read the auxiliary buffer

One wet cell marks the whole block as wet

26

ICT

Results - Performance

Circular Dam break

1st order Euler

30% wet cells: 1200 megacells / s



2nd order Runge-Kutta




27

ICT

Results – Multiple GPUs

Single-node multi-GPU

Four Tesla GPUs

Threading

Near-perfect weak scaling

Near-perfect strong scaling

Up-to 380 million cells (16 GB)

19 000 x 19 000 cells

28

ICT

Verification

2D Parabolic basin

Planar water surface oscillates

100 x 100 cells

Horizontal scale: 8 km

Vertical scale: 3.3 m

Simulation and analytical match well

But, as most schemes, growing errors along wet-dry interface

29

ICT

Validation – Barrage de Malpasset South-east France near Fréjus

Bursts at 21:13 December 2nd 1959

40 meter high wall of water

70 km/h (43 mi/h)

Reaches mediterranean in 30 minutes

423 casualties, $68 million in damages

30Images from Google maps, TeraMetrics

Double curvature dam

66.5 m high

220 m crest length

55 million cubic metres of water

ICT

Validation

Experimental data from 1:400 model

482 000 cells

1100 x 440 bathymetry values

15 meter resolution

31

Accurately predicts maximum

elevation and front arrival time

Largest discrepancy at gauges

14 (arrival time) and 9 (elevation)

Compares well with published results

ICT

Implementation – CPU framework

Simulation loop executed by CPU

Output to netCDF

Direct visualization via OpenGL

32

ICT 33

Video:

http://www.youtube.com/watch?v=FbZBR-FjRwY




ICT

Live Demo

Dell XPS m1330, Flamingo Pink

Purchased 09-2008, price ~$1850

Intel Core 2 duo T9300 @ 2.5 GHz

4.0 GB RAM

NVIDIA GeForce 8400M GS

128 MB graphics RAM

Only 16 cuda cores (GTX 480 has 448)

34

Windows Vista Ultimate SP2 32-bit

CUDA toolkit/SDK 3.1 32-bit

CUDA Driver 257.21

Microsoft Visual Studio 2008

Images from dell.com

ICT

Summary

Faster than real-time performance

150-1200 megacells per second

Verified and validated results

Can accurately predict real-world events using single precision

Direct visualization

Interactive exploration of simulation results

35

Learn how to simulate a half an hour dam break in seconds

ICT

References

36

A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig,

Simulation and Visualization of the Saint-Venant System using GPUs,

Computing and Visualization in Science, 2010

special issue on Hot topics in Computational Engineering, [forthcoming].

A. R. Brodtkorb, M. L. Sætra, and M. Altinakar,

Efficient Shallow Water Simulations on GPUs: Implementation,

Visualization, Verification, and Validation,

in review, 2010.

A. R. Brodtkorb,

Scientific Computing on Heterogeneous Architectures

Ph.D. Thesis, University of Oslo,

Submitted, 2010.

ICT

Thank you for your attention.

Questions?

[email protected]

37

http://babrodtk.at.ifi.uio.no http://hetcomp.com