Upload
keona
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Seismic Duck “ Reflection Seismology Was Never So Much Fun !”. Presented by: Kobie Shmuel Nadav Elyahu. Outline. Preface Threading Building Blocks (Intel® TBB ) Background Overall parallelization strategy for the game Threading Conclusion and further optimizations. Preface. - PowerPoint PPT Presentation
Citation preview
Seismic Duck“Reflection Seismology Was Never So Much Fun!”
Presented by:Kobie ShmuelNadav Elyahu
Outline
Preface
Threading Building Blocks (Intel® TBB)
Background
Overall parallelization strategy for the game
Threading
Conclusion and further optimizations
Preface
Seismic Duck is a freeware game about reflection seismology, which is imaging
underground structures by sending soundwaves into the ground and interpreting the
echos.
Seismic Duck was parallelized with
Intel® TBB. It was rewritten by Arch
D. Robison (the architect for Intel®
TBB). The original game was written
for Macs in the mid 1990s.
The goals of Seismic Duck are to
demonstrate some physics and be
an interesting game.
Preface
Seismic Duck tries to be qualitatively accurate, but quantitatively interesting. Thus time and space scales are distorted:o The sound waves are too slow o The horizontal scale covers a very large lateral distance in proportion to
the vertical scale.
Despite the complexity and the realistic model the game tries to imply it’s very intuitive and is targeted for (bright) children as well.
Threading Building Blocks (Intel® TBB)
A C++ template library developed by Intel for writing software programs that take
advantage of multi-core processors.
The library consists of data structures and algorithms that allow a programmer to avoid
some complications arising from the use of native threading packages.
Instead the library abstracts access to the multiple processors by allowing the operations
to be treated as "tasks", which are allocated to individual cores dynamically by the
library's run-time engine, and by automating efficient use of the CPU cache.
Arch D. Robison was the architect of TBB, and was the lead
developer for KAI C++, developed the game ‘Seismic Duck’ as a hobby.
Background
Seismic Duck runs three independent core computations:o Seismic wave propagation and rendering.o Gas/oil/water flow through a reservoir and rendering.o Seismogram rendering.
All three run in parallel using parallel_invoke (template function that evaluates several functions in parallel).
At a high level, it follows the ’Three Layer Cake’ pattern (a method of parallel programming for shared-memory hardware to maximize performance, readability, and flexibility).
Still that level of parallelism was not enough to get a good animation speed. Wave propagation simulation was a bottleneck.
Background – Numerical simulation of Waves
Waves are simulated using FDTD method.
Five 2D arrays represent a scalar field over a 2D grid.
Three of the arrays represent variables that step through time.
The other two represent rock properties.
Background – Numerical simulation of Waves
“Leap Frog” algorithm is used
The advantage of staggering and leap frogging is that it delivers results accurate to 2nd order
for the cost of a 1st order approach.
The update operations are beautifully simple:
forall i, j {Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);}
forall i, j {U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));}
Overall parallelization strategy The TBB demo "seismic" and Seismic Duck use two very different
approaches to parallelizing these updates.
o The TBB demo is supposed to show how to use a TBB feature ("parallel_for") and a basic parallel pattern. keep it as simple as possible.
o Seismic Duck is written for high performance, at the expense of increased complexity. It uses several parallel patterns. It also has more ambitious numerical modeling.
The pattern behind the TBB demo is "Odd-Even Communication Group" in time and a geometric decomposition pattern in space.
The big drawback of the Odd-Even pattern is memory bandwidth.
Overall parallelization strategy Using the Odd-Even pattern, each grid point is loaded once from main
memory per time step.
First sweep (update Vx and Vy): 6 memory references (read A, U, Vx, Vy; write Vx, Vy). 6 floating-point adds 2 floating-point multiplications
Second sweep (update U): 5 memory references (read B, Vx, Vy, U; write U) 4 floating-point adds 1 floating-point multiplication
Overall parallelization strategy
The key consideration is the (6+4) floating-point additions per
(6+5) memory references.
C=(6+4)/(6+5)=10/11≈0.91 is a serious bottleneck.
C=12 for typical (core 2 duo) hardware but C≈0.91 for the Odd-
Even code.
Thus the odd-even version delivers a small fraction of machine’s
theoretical peak floating-point performance.
Overall parallelization strategy
Improving ‘C’
Single sweep (update Vx, Vy, and U):
8 memory references (read A, B, U, Vx, Vy; write Vx, Vy, U).
10 floating-point adds
3 floating-point multiplications
Now C=10/8=1.25. About a 37% improvement over the C=.91 value in the original code.
Overall parallelization strategy However, the improvement in C complicates parallelism.
for( i=1; i<m; ++i )
for( j=1; j<n; ++j ) {
Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);
Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);
U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));
}
The fused code has a sequential loop nest, because it must preserve the following
constraints:
o The update of Vx[i][j] must use the value of U[i][j+1] from the previous sweep.
o The update of Vy[i][j] must use the value of U[i+1][j] from the previous sweep.
o The update of U[i][j] must use the values of Vx[i][j-1] and Vy[i-1][j] from
the current sweep.
Threading Treating the grids as having map coordinates, a grid point must be
updated after the points north and west of it are updated, but before the points south and east of it are updated.
One way to parallelize the loop nest is the Wavefront Pattern . In that pattern, execution sweeps diagonally from the northwest to southeast corner.
But that pattern has relatively high synchronization costs. Furthermore, in this context, it has poor cache locality because it would tend to schedule adjacent grid points on different processors.
An alternative - geometric decomposition and the Ghost Cell Pattern.
Threading – Wavefront Pattern
Problem Data elements are laid out as multidimensional grids representing a logical plane or space.
The dependency between the elements results in computations that resemble a diagonal sweep
Creates nontrivial problem for distribution of work between the parallel processing units.
Driving Forces Workload must be balanced as the diagonal
wavefront computation sweeps the elements.
Processing units must minimize the idle time
while others are executing.
Performance of the overall system must be
efficient.
Threading – Wavefront Pattern
Solution
Data distribution is a critical factor; processor’s idle time must be minimized. computational load at given instance of time differs throughout the
sweeping process.
simple partitions across rows or
columns are not encouraged.
most widely used data distribution
scheme in practice is block cyclic
distribution.
Threading – Wavefront Pattern
Threading strategies
Fixed block based strategy
Cyclical fixed block based strategy
Variable cyclical block based strategy
Threading – Wavefront Pattern
Example Dynamic programming matrix (DPM) that is generated by a biological sequence alignment algorithm.
These speedups are not
necessarily the best that can
be achieved, as the blocking
factor greatly affects the
results.
They are meant to show that
for a little effort, it is possible
to quickly build a parallel
program with reasonable
speedups.
Threading – Wavefront Pattern
Delegation of work to threads The delegation of work to threads in the Wavefront pattern is handled by a work queue.
Diagonal synchronization
Prerequisite synchronization
Threading – Ghost Cell PatternProblem Computing problems divided geometrically into chunks computed on different
processors are not embarrassingly parallel.
Specifically, the points at the borders of a chunk require the values of points from
the neighboring chunks.
Values between processes need to be transferred in an efficient and structured
manner?
Points that influence the calculations of a point is often called a stencil.
Retrieving points introduce communication operations which lead to high latency.
Threading – Ghost Cell PatternSolution Allocate additional space for a series of ghost cells around the edges of each chunk.
For every iteration have each pair of neighbors exchange their borders.
Wide Halo.
Corner Cells.
Multi dimensional border exchange.
NOTE : unnecessary synchronization may slow implementation. For example MPI_Send (and
MPI_Sendrecv) may block until the receiver starts to receive the message.
Threading
Wide Halo method has been used. C has been increased from 0.91 to 1.25. Each chunk can be updated independently by a different thread, except around its
border.
To see the exception, consider the interaction of chunk 0 and chunk 1. Let i0 be the index of the last row of chunk 0 and let i1 be the index of the first row of chunk 1.o The update of Vy[i0][j] must use the value of U[i1][j] from the previous sweep.o The update of U[i1][j] must use the value Vy[i0][j] from the current sweep.
The ghost cell pattern enables the chunks to be updated in parallel. Each chunk becomes a separate grid with an extra row of grid points added above and below it.
Conclusion and further optimizations
Several parallelization techniques have been discussed.
Further optimizations have been made through vectorization and cache
optimizations.
Game is “fun” you should try it out (free download).
THE END