Physics in Parallel Gdc 2005

7/29/2019 Physics in Parallel Gdc 2005

1/82

Physics in Parallel: Simulation

on 7th Generation Hardware

David WuPseudo Interactive


2/82

Why are we here?

The 7th generation is approaching.

We are no longer next gen

We are all scrambling to adopt to thenew stuff, so that we can stay

on the bleeding edge

And push the envelope

and take things to the next level.


3/82

Whats Next Gen?

Multiple Processors

not entirely new, but more than before.

Parallelism not entirely new, but more than before.

Physics



4/82

Take-Away

So much to cover

General Principles

Useful Concepts Techniques

Tips

Bad Jokes

Goal is to save you time during thetransition to..

Next Gen


5/82

Format for presentation

Every year we discover new ways tocommunicate information.


6/82

Patterns

A description of a recurrent problemand of the core of possible solutions

Difficult to write Too pretentious

Inviting criticism


7/82

Gems

Valuable bits of information

Too 6th Gen


8/82

Blog

Free Form

Continuity not required

Subjective/opinionated is okay

Arbitrary Tangents are okay

Catchy Title need not match article

No quality bar

This sounds 7th Gen to me.


9/82

Disclaimer

My information sources range from:

press releases

Patents

other Blogs on the net

random probabilistic guesses.

Much of the information is probablywrong.


10/82

Multi-threaded programming

I participated in some in depthdiscussions on this topic, after weeks

of debate, the conclusion was: Multi-threaded programming is hard

1-Mar-05


11/82

What is 7th Gen Hardware?

Fast

Many parallel processors

Very High peak Flops

In order execution

2-Mar-05


12/82

What is 7th Gen Hardware?

High memory latency

Not enough Bandwidth

Moderate clock speed improvements

Not enough Memory

CPU-GPU convergence

2-Mar-05


13/82

Hardware usually sucks

Is Multi-Processor Revolutionary?

It is kind of here already

Hyper Threading

Dual Processor

Sega Saturn


3-Mar-05


14/82


Hardware advances require years ofpreparatory hype:

3D Accelerators Online

SIMD

Not with a bang but with a whimper

3-Mar-05


15/82


The big problem with hardwareadvances is sofware.

We dont like to do things that arehard.

If there is a big enough payoff we do

it. This time there is a big enough payoff.

3-Mar-05


16/82

Types of Parallelism

Task Parallelism Render+physics

Data Parallelism collision detection on two objects at a

time

Instruction Parallelism multiple elements in a vector

Use all three

4-Mar-05


17/82

Techniques

Pipeline

Work Crew

Forking

5-Mar-05


18/82

Pipeline Task Parallelism

Subdivide problem into discrete tasks

Solve tasks in parallel, spreading them

across multiple processors.

5-Mar-05


19/82

Pipeline Task Parallelism

Thread 0collision detection

Frame 3

Thread 1Logic/AIFrame 2

Thread 2IntegrationFrame 1

Thread 0collision detection

Frame 4

Thread 1Logic/AIFrame 3

Thread 2IntegrationFrame 2

5-Mar-05


20/82

Pipeline

Similar to CPU/GPU parallelism

CPUFrame 3

GPUFrame 2

CPUFrame 4

GPUFrame 3

5-Mar-05


21/82

Pipeline: notes

Dependencies explicit

Communication explicit

I.e. through FIFO Avoids deadlock issues

Avoids most race conditions

Load balancing is not great Does not reduce latency vs. singled

threaded case

5-Mar-05


22/82

Pipeline: notes

Feedback between tasks is difficult

Best for open loop tasks

Secondary dynamics, I.e. pony tail

Effects

Suitable for specialized hardware,

because task requirements are cleanlydivided.

5-Mar-05


23/82

Pipeline: notes

Suitable for restricted memoryarchitectures, as seen in a certain

proposed 7th gen console design. Adds bandwidth overhead and

memory use overhead to SMP systems

that would otherwise communicate viathe cache.

5-Mar-05


24/82

Work Crew

Component wise division of system

CollisionDetection Integration

Rendering AI/Logic

Audio

IO

Particle System

Fluid Simulation

5-Mar-05


25/82

Work Crew Task Parallelism

Similar to pipeline but without explicitordering.

Dependencies are handled on a case bycase basis.

i.e. particles that do not effect game play mightnot need to be deterministic, so they can run

without explicit synchronization. Components without interdependencies can

run asynchronously, e.g. kinematics and AI.

5-Mar-05


26/82

Work Crew

Suitable for some external processes suchas IO, Gamepad, Sound, Sockets.

Suitable for decoupled systems: particle simulations that do not effect game play

Fluid dynamics

Visual damage simulation

Cloth simulation

5-Mar-05


27/82

Work Crew

Scalability is limited by the number ofdiscrete tasks

Load balancing is limited by theasymmetric nature of the componentsand their requirements.

Higher risk of deadlocks Higher risk of race conditions

5-Mar-05


28/82

Work Crew

May require double buffering of somedata to avoid race conditions.

Poor data coherency Good code coherency

5-Mar-05


29/82

Forking Data Parallelism

Perform the same task on multiple objectsin parallel.

Thread forks into multiple threads acrossmultiple processors

All threads repeatedly grab pending objectsindiscriminately and execute the task onthem

When finished, threads combine back intothe original thread.

5-Mar-05


30/82

Forking

Object AThread 2

Object BThread 0

Object CThread 1

Fork

combine

5-Mar-05


31/82

Forking

Task assignment can often be doneusing simple interlocked primitives:

I.e. Int i =InterlockedIncrement(&nextTodo);

OpenMP adds compiler support for this

via pragmas

5-Mar-05


32/82

Forking

Externally Synchronous external callers dont have to worry about

being thread safe

thread safety requirements are limited tothe scope of the code within the forkedsection.

This is a big deal. good for isolated engine components and

middle ware

5-Mar-05


33/82

Forking Example

AI running in thread 0

AI calls RayQuery() for a line of sight

checkRayQuery forks into 6 threads, computes

the ray query, and then returns the

results through thread 0AI, running in thread 0 uses the result.

5-Mar-05


34/82

Forking

Minimizes Latency for a given task

Good data and code coherency

Potentially high synchronizationoverhead, depending on the coupling.

Highly scalable if you have many tasks

with few dependencies Ideal for Collision detection.

5-Mar-05


35/82

Forking - Batches

Reduces inter-threadcommunication

Reduces potentialfor loadbalancing.

ImprovesInstructionlevelparallelism

Objects 21..30Thread 2



Fork

combine

5-Mar-05


36/82

Our Approach

1) Collision DetectionForked

3) IntegrationForked

2) AI/LogicSingle threaded

4) RenderingForked/Pipeline

2b) Damage EffectsContractor QueueAll extra threads

AudioWhatever

2a) engine callsForked

6-Mar-05

M ltith d d


37/82

Multithreadedprogramming is Hard

Solutions that directly expose multiplethreads to leaf code are a bad idea.

Sequential, single threaded,synchronous code is the fastest towrite and debug

In order to meet schedules most leafcode will stay this way.

7-Mar-05


38/82

Notes on Collision detection

All collision prims are stored in aglobal search tree.

Bounding Kdop tree with 8 childrenper node.

The most common case is when 0 or 1

children need to be traversed 8 children results in fewer branches

8 Children allows better Prefetching

7-Mar-05


39/82

Collision detection

Each moving object is a task

Each object is independently queried vs. allother objects in the tree.

Results are output to a global list ofcontacts and collisions

To avoid duplicates, moving object vs.

moving object collisions are only processedif the active moving objects memoryaddress is


40/82

Collision detection

Threads pop objects off of the todo list oneby one using interlocked access until theyare all processed.

Each query takes O(lgN) time. Very little data contention

output operations are rare and quick

task allocation uses InterlockedIncrement

On 2 Cpus with many objects I see a 80%performance increase.

Hopefully scalable to many CPUs

7-Mar-05


41/82

Collision detection

We try to keep collision code and datain the cache as much as possible

We try to finish Collision detection assoon as possible because there aredependencies on it

All threads attack the problem at once

7-Mar-05


42/82

Notes on Integration

The process that steps objects forward intime, in a manner consistent with all

contacts and constraints.

8-Mar-05


43/82

Integration

Each batch ofcoupledobjects is atask.

Each Batch is solved independently Threads pop batches with no

dependencies off of the todo list oneby one using interlocked access until

they are all processed.

8-Mar-05


44/82

Integration

When a dynamic object does notinteract with other dynamic objects,

its batch contains only that object. When dynamic objects interact, they

are coupled, their solutions are

dependant on each other and theymost be solved together.

8-Mar-05


45/82

Integration

In some cases, objects can be artificiallydecoupled.

I.e. assume object A weighs 2000kg, and object

B weighs 1 kg. In some cases we can assumethat the dynamics of B do not effect thedynamics of A.

In this case, A can first be solved independently,and the resulting dynamics can be fed into thesolution for B.

This creates an ordering dependency.

A must be solved before B.

8-Mar-05


46/82

Integration

When objects are moved they must beupdated in the global collision tree.

Transactions need to be atomic, this is

accomplished with locks/critical sections Ditto for the VSD tree

Task allocation is slightly more complex dueto dependencies

Despite all this we see a 75% performanceincrease on 2 CPUs with many objects.

8-Mar-05

8 M 05


47/82

Integration

We use a discrete newton solver, whichworks okay with our task dicretization

I.e. One thread per batch If there where hundreds of processors

and not as many batches, we would

fork the solver itself and use Jacobiiterations

8-Mar-05

9 M 05


48/82

Transactions

With fine grained data parallelism, werequire many, light weight atomic

transactions. For this we use either:

Interlocked primitives

Critical Sections

Spin Locks

9-Mar-05

9 M 05


49/82

Transactions

Whenever possible, interlockedprimitives are used.

Interlocked primitives are simple atomic

transactions on single words

If the transaction is short a spin Lockis used.

Otherwise a critical section is used.

A Spin Lock is like a critical section,except that it spins rather than sleeps

when blocking

9-Mar-05

9 M 05


50/82

CPUs are difficult

There are some processor specificnuances to consider when writing your

own locks:Due to out of order reads, data accessfollowing the acquisition of a lock

should be proceeded by a load fence orisync. Otherwise the processor mightpreload old data that changes right

before the lock is released.

9-Mar-05

9 M 05


51/82

CPUs are difficult

Due to out of order writes, a storefence or lwsync needs to happen before

releasing the lock, otherwise the unlockmight be visible to threads before thedata update has taken place, andanother thread might claim the lock andthen fetch stale data from its cache allbefore the real data arrives.

9-Mar-05

9 M 05


52/82

Lock Example

Acquire() looks like:

while(_InterlockedCompareExchange(&isLocked,1,0)!= 0){

PauseWhileLocked();}__isync();

Release() looks like:

__lwsync();isLocked=0;

9-Mar-05

9 M 05


53/82

CPUs are difficult

On Hyperthreaded systems it is importantthat PauseWhileLocked() puts the thread tosleep so that the other thread(s) can use the

complete core. It is also important that you dont constantly

bang on memory while trying to take thelock.

If you are going to hold locks for a fair bit oftime, a critical section is usually a betterchoice, as it switches to another thread

rather than spinning.

9-Mar-05

10 M 05


54/82

Instruction Parallelism is Good

Most relevant processors are pipelined

Multiple execution units run in parallel

No Out of Order Execution High execution latency

Most have SIMD

Intrinsics

10-Mar-05

11 Ma 05


55/82

Code Scheduling is Good

Instruction level parallelism requiresappropriate code scheduling

Compiler Hand Holding is often necessary

to give the compiler more freedom toschedule

loop unrolling

Using temporaries rather than membervars or globals

inline functions

__restrict

11-Mar-05

12 Mar 05


56/82

Branches are Bad

Data dependant branches are frequentin collision detection.

Branches on floating point results areoften very slow, partly due to the longfloating point pipelines.

Whenever possible use instructions

like fsel, vsel, min,max, etc toeliminate branches

12-Mar-05

12 Mar 05


57/82

Branches

i.e. Rather Than:

if( (a>b)||(c>d) ) {}

Use If( max(a-b,c-d) > 0) {}

12-Mar-05

12 Mar 05


58/82

Branches and GPUS

On earlier GPU hardware HLSL willemulate all conditionals usingpredicated instructions.

Similar techniques are often beneficialon CPUs.

If(a>=b) c = d;

Could be written as C = fsel(a-b,d,c);

12-Mar-05

13 Mar 05


59/82

Hyper Threading?

What: 1 core, >1 simultaneousthreads on >1 simultaneous contexts.

Same execute units and cache Why: Execution units are often idle,

extra threads can make better

utilization of them.

13-Mar-05

13 Mar 05


60/82

Hyper Threading?

Why are execution units idle?

Pipeline Latency

I.e. if a multiply-add pipe has a throughput of 1 per cycle, and a latency of 7cycles, a 4 element dot product takes1+6+6+6 = 19 cycles.

13-Mar-05

14 Mar 05


61/82

Pipeline latency is Bad

Most of the time, only one stage of themadd pipeline is active and the others areidle.

If 4 threads are all doing dot products atonce the time taken is:

1+1+1+1+3+1+1+1+3+1+1+1+3+1+1+1 =

23 cycles, which is 3.3x the throughput. Somewhat redundant with Out of Order

execution and loop unrolling.

14-Mar-05

15 Mar 05


62/82

Memory is Slow

Cache Misses

When one thread blocks on a cache

miss, the other threads can continuerunning while the cache line is beingfilled.

15-Mar-05

15 Mar 05


63/82

Branches are Bad (still)

Branches

Data dependant branches do not mix

well with deep pipelines. If the result at the end of the pipelineis needed to determine what to fetchnext at the beginning of the pipeline,

you get a big bubble. This can be filled by the other threads.

15-Mar-05

16-Mar-05


64/82

Data Locality is Good

It is worth mentioning that cores withmultiple threads share l1 caches.

So it is usually best to have all threadsof a core working on the same codeand data set.

16-Mar-05

17-Mar-05


65/82

Hyper Threading And Cell

Cells design side steps the motivation forHyperThreading in a variety of ways:

No cache, stream data in and out ahead of

time. Lots of registers for loop unrolling.

Memory architecture encourages streamprocessing which is conducive to loopunrolling.

Complex programming model makesscheduling optimizations seem relativelyeasy.

17-Mar-05

18-Mar-05


66/82

Cache Is important on SMP

Shared L2 Cache, some L1 Cache Sharing

Big caches is one of the distinct advantagesof CPUs vs. GPUs

some consoles

Physics algorithms for collision detection,integration, and constraint resolution

require repeated accesses to individualstructures, which is ideal for caches.

18-Mar-05

18-Mar-05


67/82

Cache Is important on SMP

Shared caches also make inter-threadand inter-core communication less

expensive. These points motivate forked models

and data parallelism

18-Mar-05

19-Mar-05


68/82

Memory Latency is a big deal

Latency is a real problem. To design a highperformance system we need to considerthis as a first order concern.

In this generation we are looking at ~500cycle penalties on an l2 cache miss, ~50cycles on an l1 cache miss.

L2 cache is shared between cores. L1 cache is shared between threads.

19-Mar-05

19-Mar-05


69/82

Memory Latency is a big deal

This also motivates the use of dataparallelism.

An extreme form of data parallelism isseen in stream processing.

19 Mar 05

20-Mar-05


70/82

GPUs are Fast

GPU's are an effective demonstration ofparallelism.

The overall system is a pipeline

Vertex shader -> triangle Setup ->Rasterization ->Pixel Shader -> framebuffer

Within each stage, the model is forked, withmany simultaneous threads

20 Mar 05

20-Mar-05


71/82

Physics on GPUS

There has been a fair bit of research onthe topic of use GPUs for physical

simulations.A review article is here:

http://download.nvidia.com/developer/GP

U_Gems_2/GPU_Gems2_ch29.pdf

20 Mar 05

Eulerian vs. Lagrangian 21-Mar-05


72/82

Eulerian vs. Lagrangianin a Nutshell

Eulerian approaches discretize dof inspace

Lagrangian approaches dont.

21 Mar 05

21-Mar-05


73/82

Lagrangian Example

Each Particle has Dof

21 Mar 05

21-Mar-05


74/82

Eulerian Example

Each Cell Has DOF

21 Mar 05

21-Mar-05


75/82

Eulerian vs. Lagrangian

Eulerian algorithms are effective fordense, highly coupled interactive

systems. Solving navier stokes

Computation fluid dynamics

Water, Smoke, Fire

21 Mar 05

21-Mar-05


76/82


Interactions are not qualitatively datadependant

So they can run in parallel withoutfeedback.

Langrangian interactions often are

data dependant I.e. collisions

21 Mar 05

21-Mar-05


77/82


Physically-Based

Simulation on

Graphics Hardwarehttp://developer.nvidia.com/docs/IO/823

0/GDC2003_PhysSimOnGPUs.pdf

21 Mar 05

21-Mar-05


78/82

Particles

A good example of a Lagrangriantechnique implemented on streamprocessors is UberFlow

UberFlow implements a fully featuredparticle system on a GPU

Data dependant control is difficult,

Uberflow uses a data independentsorting method for collision detection

21 Mar 05

21-Mar-05


79/82

Particles

http://www.ati.com/developer/Eurographics/Kipfer04_UberFlow_eghw.pdf

21 Mar 05

22-Mar-05


80/82

Solvers and Parallelism

Conjugate Gradient

Jacobi iterations

Gauss Seidel Red Black Gauss Seidel

MultiGrid

22 Mar 05

22-Mar-05


81/82

Solvers and Parallelism

Sparse Matrix Solvers on the GPU:Conjugate Gradients and Multigrid

http://www.multires.caltech.edu/pubs/GPUSim.pdf

a 05

Programming 23-Mar-05
http://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdf


82/82

g gLanguages are not Good

C++, Java, C# are not ideal for finegrained parallelism

What's next:HLSL?

Functional Languages? Haskel?

OpenMp?

Documents

Physics in Parallel Gdc 2005