Seismic Duck “ Reflection Seismology Was Never So Much Fun !”

Seismic Duck“Reflection Seismology Was Never So Much Fun!”

Presented by:Kobie ShmuelNadav Elyahu

Outline

Preface

Threading Building Blocks (Intel® TBB)

Background

Overall parallelization strategy for the game

Threading

Conclusion and further optimizations

Preface

Seismic Duck is a freeware game about reflection seismology, which is imaging

underground structures by sending soundwaves into the ground and interpreting the

echos.

Seismic Duck was parallelized with

Intel® TBB. It was rewritten by Arch

D. Robison (the architect for Intel®

TBB). The original game was written

for Macs in the mid 1990s.

The goals of Seismic Duck are to

demonstrate some physics and be

an interesting game.

Preface

Seismic Duck tries to be qualitatively accurate, but quantitatively interesting. Thus time and space scales are distorted:o The sound waves are too slow o The horizontal scale covers a very large lateral distance in proportion to

the vertical scale.

Despite the complexity and the realistic model the game tries to imply it’s very intuitive and is targeted for (bright) children as well.

Threading Building Blocks (Intel® TBB)

A C++ template library developed by Intel for writing software programs that take

advantage of multi-core processors.

The library consists of data structures and algorithms that allow a programmer to avoid

some complications arising from the use of native threading packages.

Instead the library abstracts access to the multiple processors by allowing the operations

to be treated as "tasks", which are allocated to individual cores dynamically by the

library's run-time engine, and by automating efficient use of the CPU cache.

Arch D. Robison was the architect of TBB, and was the lead

developer for KAI C++, developed the game ‘Seismic Duck’ as a hobby.

Background

Seismic Duck runs three independent core computations:o Seismic wave propagation and rendering.o Gas/oil/water flow through a reservoir and rendering.o Seismogram rendering.

All three run in parallel using parallel_invoke (template function that evaluates several functions in parallel).

At a high level, it follows the ’Three Layer Cake’ pattern (a method of parallel programming for shared-memory hardware to maximize performance, readability, and flexibility).

Still that level of parallelism was not enough to get a good animation speed. Wave propagation simulation was a bottleneck.

Background – Numerical simulation of Waves

Waves are simulated using FDTD method.

Five 2D arrays represent a scalar field over a 2D grid.

Three of the arrays represent variables that step through time.

The other two represent rock properties.

Background – Numerical simulation of Waves

“Leap Frog” algorithm is used

The advantage of staggering and leap frogging is that it delivers results accurate to 2nd order

for the cost of a 1st order approach.

The update operations are beautifully simple:

forall i, j {Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);}

forall i, j {U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));}

Overall parallelization strategy The TBB demo "seismic" and Seismic Duck use two very different

approaches to parallelizing these updates.

o The TBB demo is supposed to show how to use a TBB feature ("parallel_for") and a basic parallel pattern. keep it as simple as possible.

o Seismic Duck is written for high performance, at the expense of increased complexity. It uses several parallel patterns. It also has more ambitious numerical modeling.

The pattern behind the TBB demo is "Odd-Even Communication Group" in time and a geometric decomposition pattern in space.

The big drawback of the Odd-Even pattern is memory bandwidth.

Overall parallelization strategy Using the Odd-Even pattern, each grid point is loaded once from main

memory per time step.

First sweep (update Vx and Vy): 6 memory references (read A, U, Vx, Vy; write Vx, Vy). 6 floating-point adds 2 floating-point multiplications

Second sweep (update U): 5 memory references (read B, Vx, Vy, U; write U) 4 floating-point adds 1 floating-point multiplication

Overall parallelization strategy

The key consideration is the (6+4) floating-point additions per

(6+5) memory references.

C=(6+4)/(6+5)=10/11≈0.91 is a serious bottleneck.

C=12 for typical (core 2 duo) hardware but C≈0.91 for the Odd-

Even code.

Thus the odd-even version delivers a small fraction of machine’s

theoretical peak floating-point performance.

Overall parallelization strategy

Improving ‘C’

Single sweep (update Vx, Vy, and U):

8 memory references (read A, B, U, Vx, Vy; write Vx, Vy, U).

10 floating-point adds

3 floating-point multiplications

Now C=10/8=1.25. About a 37% improvement over the C=.91 value in the original code.

Overall parallelization strategy However, the improvement in C complicates parallelism.

for( i=1; i<m; ++i )

for( j=1; j<n; ++j ) {

Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);

Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);

U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));

}

The fused code has a sequential loop nest, because it must preserve the following

constraints:

o The update of Vx[i][j] must use the value of U[i][j+1] from the previous sweep.

o The update of Vy[i][j] must use the value of U[i+1][j] from the previous sweep.

o The update of U[i][j] must use the values of Vx[i][j-1] and Vy[i-1][j] from

the current sweep.

Threading Treating the grids as having map coordinates, a grid point must be

updated after the points north and west of it are updated, but before the points south and east of it are updated.

One way to parallelize the loop nest is the Wavefront Pattern . In that pattern, execution sweeps diagonally from the northwest to southeast corner.

But that pattern has relatively high synchronization costs. Furthermore, in this context, it has poor cache locality because it would tend to schedule adjacent grid points on different processors.

An alternative - geometric decomposition and the Ghost Cell Pattern.

Threading – Wavefront Pattern

Problem Data elements are laid out as multidimensional grids representing a logical plane or space.

The dependency between the elements results in computations that resemble a diagonal sweep

Creates nontrivial problem for distribution of work between the parallel processing units.

Driving Forces Workload must be balanced as the diagonal

wavefront computation sweeps the elements.

Processing units must minimize the idle time

while others are executing.

Performance of the overall system must be

efficient.


Solution

Data distribution is a critical factor; processor’s idle time must be minimized. computational load at given instance of time differs throughout the

sweeping process.

simple partitions across rows or

columns are not encouraged.

most widely used data distribution

scheme in practice is block cyclic

distribution.


Threading strategies

Fixed block based strategy

Cyclical fixed block based strategy

Variable cyclical block based strategy


Example Dynamic programming matrix (DPM) that is generated by a biological sequence alignment algorithm.

These speedups are not

necessarily the best that can

be achieved, as the blocking

factor greatly affects the

results.

They are meant to show that

for a little effort, it is possible

to quickly build a parallel

program with reasonable

speedups.


Delegation of work to threads The delegation of work to threads in the Wavefront pattern is handled by a work queue.

Diagonal synchronization

Prerequisite synchronization

Threading – Ghost Cell PatternProblem Computing problems divided geometrically into chunks computed on different

processors are not embarrassingly parallel.

Specifically, the points at the borders of a chunk require the values of points from

the neighboring chunks.

Values between processes need to be transferred in an efficient and structured

manner?

Points that influence the calculations of a point is often called a stencil.

Retrieving points introduce communication operations which lead to high latency.

Threading – Ghost Cell PatternSolution Allocate additional space for a series of ghost cells around the edges of each chunk.

For every iteration have each pair of neighbors exchange their borders.

Wide Halo.

Corner Cells.

Multi dimensional border exchange.

NOTE : unnecessary synchronization may slow implementation. For example MPI_Send (and

MPI_Sendrecv) may block until the receiver starts to receive the message.

Threading

Wide Halo method has been used. C has been increased from 0.91 to 1.25. Each chunk can be updated independently by a different thread, except around its

border.

To see the exception, consider the interaction of chunk 0 and chunk 1. Let i0 be the index of the last row of chunk 0 and let i1 be the index of the first row of chunk 1.o The update of Vy[i0][j] must use the value of U[i1][j] from the previous sweep.o The update of U[i1][j] must use the value Vy[i0][j] from the current sweep.

The ghost cell pattern enables the chunks to be updated in parallel. Each chunk becomes a separate grid with an extra row of grid points added above and below it.

Conclusion and further optimizations

Several parallelization techniques have been discussed.

Further optimizations have been made through vectorization and cache

optimizations.

Game is “fun” you should try it out (free download).

THE END

Documents

Seismic Duck “ Reflection Seismology Was Never So Much Fun !”