Upload
carlos-chura
View
218
Download
1
Embed Size (px)
Citation preview
7/29/2019 Physics in Parallel Gdc 2005
1/82
Physics in Parallel: Simulation
on 7th Generation Hardware
David WuPseudo Interactive
7/29/2019 Physics in Parallel Gdc 2005
2/82
Why are we here?
The 7th generation is approaching.
We are no longer next gen
We are all scrambling to adopt to thenew stuff, so that we can stay
on the bleeding edge
And push the envelope
and take things to the next level.
7/29/2019 Physics in Parallel Gdc 2005
3/82
Whats Next Gen?
Multiple Processors
not entirely new, but more than before.
Parallelism not entirely new, but more than before.
Physics
not entirely new, but more than before.
7/29/2019 Physics in Parallel Gdc 2005
4/82
Take-Away
So much to cover
General Principles
Useful Concepts Techniques
Tips
Bad Jokes
Goal is to save you time during thetransition to..
Next Gen
7/29/2019 Physics in Parallel Gdc 2005
5/82
Format for presentation
Every year we discover new ways tocommunicate information.
7/29/2019 Physics in Parallel Gdc 2005
6/82
Patterns
A description of a recurrent problemand of the core of possible solutions
Difficult to write Too pretentious
Inviting criticism
7/29/2019 Physics in Parallel Gdc 2005
7/82
Gems
Valuable bits of information
Too 6th Gen
7/29/2019 Physics in Parallel Gdc 2005
8/82
Blog
Free Form
Continuity not required
Subjective/opinionated is okay
Arbitrary Tangents are okay
Catchy Title need not match article
No quality bar
This sounds 7th Gen to me.
7/29/2019 Physics in Parallel Gdc 2005
9/82
Disclaimer
My information sources range from:
press releases
Patents
other Blogs on the net
random probabilistic guesses.
Much of the information is probablywrong.
7/29/2019 Physics in Parallel Gdc 2005
10/82
Multi-threaded programming
I participated in some in depthdiscussions on this topic, after weeks
of debate, the conclusion was: Multi-threaded programming is hard
1-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
11/82
What is 7th Gen Hardware?
Fast
Many parallel processors
Very High peak Flops
In order execution
2-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
12/82
What is 7th Gen Hardware?
High memory latency
Not enough Bandwidth
Moderate clock speed improvements
Not enough Memory
CPU-GPU convergence
2-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
13/82
Hardware usually sucks
Is Multi-Processor Revolutionary?
It is kind of here already
Hyper Threading
Dual Processor
Sega Saturn
not entirely new, but more than before.
3-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
14/82
Hardware usually sucks
Hardware advances require years ofpreparatory hype:
3D Accelerators Online
SIMD
Not with a bang but with a whimper
3-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
15/82
Hardware usually sucks
The big problem with hardwareadvances is sofware.
We dont like to do things that arehard.
If there is a big enough payoff we do
it. This time there is a big enough payoff.
3-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
16/82
Types of Parallelism
Task Parallelism Render+physics
Data Parallelism collision detection on two objects at a
time
Instruction Parallelism multiple elements in a vector
Use all three
4-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
17/82
Techniques
Pipeline
Work Crew
Forking
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
18/82
Pipeline Task Parallelism
Subdivide problem into discrete tasks
Solve tasks in parallel, spreading them
across multiple processors.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
19/82
Pipeline Task Parallelism
Thread 0collision detection
Frame 3
Thread 1Logic/AIFrame 2
Thread 2IntegrationFrame 1
Thread 0collision detection
Frame 4
Thread 1Logic/AIFrame 3
Thread 2IntegrationFrame 2
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
20/82
Pipeline
Similar to CPU/GPU parallelism
CPUFrame 3
GPUFrame 2
CPUFrame 4
GPUFrame 3
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
21/82
Pipeline: notes
Dependencies explicit
Communication explicit
I.e. through FIFO Avoids deadlock issues
Avoids most race conditions
Load balancing is not great Does not reduce latency vs. singled
threaded case
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
22/82
Pipeline: notes
Feedback between tasks is difficult
Best for open loop tasks
Secondary dynamics, I.e. pony tail
Effects
Suitable for specialized hardware,
because task requirements are cleanlydivided.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
23/82
Pipeline: notes
Suitable for restricted memoryarchitectures, as seen in a certain
proposed 7th gen console design. Adds bandwidth overhead and
memory use overhead to SMP systems
that would otherwise communicate viathe cache.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
24/82
Work Crew
Component wise division of system
CollisionDetection Integration
Rendering AI/Logic
Audio
IO
Particle System
Fluid Simulation
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
25/82
Work Crew Task Parallelism
Similar to pipeline but without explicitordering.
Dependencies are handled on a case bycase basis.
i.e. particles that do not effect game play mightnot need to be deterministic, so they can run
without explicit synchronization. Components without interdependencies can
run asynchronously, e.g. kinematics and AI.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
26/82
Work Crew
Suitable for some external processes suchas IO, Gamepad, Sound, Sockets.
Suitable for decoupled systems: particle simulations that do not effect game play
Fluid dynamics
Visual damage simulation
Cloth simulation
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
27/82
Work Crew
Scalability is limited by the number ofdiscrete tasks
Load balancing is limited by theasymmetric nature of the componentsand their requirements.
Higher risk of deadlocks Higher risk of race conditions
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
28/82
Work Crew
May require double buffering of somedata to avoid race conditions.
Poor data coherency Good code coherency
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
29/82
Forking Data Parallelism
Perform the same task on multiple objectsin parallel.
Thread forks into multiple threads acrossmultiple processors
All threads repeatedly grab pending objectsindiscriminately and execute the task onthem
When finished, threads combine back intothe original thread.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
30/82
Forking
Object AThread 2
Object BThread 0
Object CThread 1
Fork
combine
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
31/82
Forking
Task assignment can often be doneusing simple interlocked primitives:
I.e. Int i =InterlockedIncrement(&nextTodo);
OpenMP adds compiler support for this
via pragmas
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
32/82
Forking
Externally Synchronous external callers dont have to worry about
being thread safe
thread safety requirements are limited tothe scope of the code within the forkedsection.
This is a big deal. good for isolated engine components and
middle ware
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
33/82
Forking Example
AI running in thread 0
AI calls RayQuery() for a line of sight
checkRayQuery forks into 6 threads, computes
the ray query, and then returns the
results through thread 0AI, running in thread 0 uses the result.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
34/82
Forking
Minimizes Latency for a given task
Good data and code coherency
Potentially high synchronizationoverhead, depending on the coupling.
Highly scalable if you have many tasks
with few dependencies Ideal for Collision detection.
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
35/82
Forking - Batches
Reduces inter-threadcommunication
Reduces potentialfor loadbalancing.
ImprovesInstructionlevelparallelism
Objects 21..30Thread 2
Objects 0..10Thread 0
Objects 11..20Thread 1
Fork
combine
5-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
36/82
Our Approach
1) Collision DetectionForked
3) IntegrationForked
2) AI/LogicSingle threaded
4) RenderingForked/Pipeline
2b) Damage EffectsContractor QueueAll extra threads
AudioWhatever
2a) engine callsForked
6-Mar-05
M ltith d d
7/29/2019 Physics in Parallel Gdc 2005
37/82
Multithreadedprogramming is Hard
Solutions that directly expose multiplethreads to leaf code are a bad idea.
Sequential, single threaded,synchronous code is the fastest towrite and debug
In order to meet schedules most leafcode will stay this way.
7-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
38/82
Notes on Collision detection
All collision prims are stored in aglobal search tree.
Bounding Kdop tree with 8 childrenper node.
The most common case is when 0 or 1
children need to be traversed 8 children results in fewer branches
8 Children allows better Prefetching
7-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
39/82
Collision detection
Each moving object is a task
Each object is independently queried vs. allother objects in the tree.
Results are output to a global list ofcontacts and collisions
To avoid duplicates, moving object vs.
moving object collisions are only processedif the active moving objects memoryaddress is
7/29/2019 Physics in Parallel Gdc 2005
40/82
Collision detection
Threads pop objects off of the todo list oneby one using interlocked access until theyare all processed.
Each query takes O(lgN) time. Very little data contention
output operations are rare and quick
task allocation uses InterlockedIncrement
On 2 Cpus with many objects I see a 80%performance increase.
Hopefully scalable to many CPUs
7-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
41/82
Collision detection
We try to keep collision code and datain the cache as much as possible
We try to finish Collision detection assoon as possible because there aredependencies on it
All threads attack the problem at once
7-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
42/82
Notes on Integration
The process that steps objects forward intime, in a manner consistent with all
contacts and constraints.
8-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
43/82
Integration
Each batch ofcoupledobjects is atask.
Each Batch is solved independently Threads pop batches with no
dependencies off of the todo list oneby one using interlocked access until
they are all processed.
8-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
44/82
Integration
When a dynamic object does notinteract with other dynamic objects,
its batch contains only that object. When dynamic objects interact, they
are coupled, their solutions are
dependant on each other and theymost be solved together.
8-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
45/82
Integration
In some cases, objects can be artificiallydecoupled.
I.e. assume object A weighs 2000kg, and object
B weighs 1 kg. In some cases we can assumethat the dynamics of B do not effect thedynamics of A.
In this case, A can first be solved independently,and the resulting dynamics can be fed into thesolution for B.
This creates an ordering dependency.
A must be solved before B.
8-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
46/82
Integration
When objects are moved they must beupdated in the global collision tree.
Transactions need to be atomic, this is
accomplished with locks/critical sections Ditto for the VSD tree
Task allocation is slightly more complex dueto dependencies
Despite all this we see a 75% performanceincrease on 2 CPUs with many objects.
8-Mar-05
8 M 05
7/29/2019 Physics in Parallel Gdc 2005
47/82
Integration
We use a discrete newton solver, whichworks okay with our task dicretization
I.e. One thread per batch If there where hundreds of processors
and not as many batches, we would
fork the solver itself and use Jacobiiterations
8-Mar-05
9 M 05
7/29/2019 Physics in Parallel Gdc 2005
48/82
Transactions
With fine grained data parallelism, werequire many, light weight atomic
transactions. For this we use either:
Interlocked primitives
Critical Sections
Spin Locks
9-Mar-05
9 M 05
7/29/2019 Physics in Parallel Gdc 2005
49/82
Transactions
Whenever possible, interlockedprimitives are used.
Interlocked primitives are simple atomic
transactions on single words
If the transaction is short a spin Lockis used.
Otherwise a critical section is used.
A Spin Lock is like a critical section,except that it spins rather than sleeps
when blocking
9-Mar-05
9 M 05
7/29/2019 Physics in Parallel Gdc 2005
50/82
CPUs are difficult
There are some processor specificnuances to consider when writing your
own locks:Due to out of order reads, data accessfollowing the acquisition of a lock
should be proceeded by a load fence orisync. Otherwise the processor mightpreload old data that changes right
before the lock is released.
9-Mar-05
9 M 05
7/29/2019 Physics in Parallel Gdc 2005
51/82
CPUs are difficult
Due to out of order writes, a storefence or lwsync needs to happen before
releasing the lock, otherwise the unlockmight be visible to threads before thedata update has taken place, andanother thread might claim the lock andthen fetch stale data from its cache allbefore the real data arrives.
9-Mar-05
9 M 05
7/29/2019 Physics in Parallel Gdc 2005
52/82
Lock Example
Acquire() looks like:
while(_InterlockedCompareExchange(&isLocked,1,0)!= 0){
PauseWhileLocked();}__isync();
Release() looks like:
__lwsync();isLocked=0;
9-Mar-05
9 M 05
7/29/2019 Physics in Parallel Gdc 2005
53/82
CPUs are difficult
On Hyperthreaded systems it is importantthat PauseWhileLocked() puts the thread tosleep so that the other thread(s) can use the
complete core. It is also important that you dont constantly
bang on memory while trying to take thelock.
If you are going to hold locks for a fair bit oftime, a critical section is usually a betterchoice, as it switches to another thread
rather than spinning.
9-Mar-05
10 M 05
7/29/2019 Physics in Parallel Gdc 2005
54/82
Instruction Parallelism is Good
Most relevant processors are pipelined
Multiple execution units run in parallel
No Out of Order Execution High execution latency
Most have SIMD
Intrinsics
10-Mar-05
11 Ma 05
7/29/2019 Physics in Parallel Gdc 2005
55/82
Code Scheduling is Good
Instruction level parallelism requiresappropriate code scheduling
Compiler Hand Holding is often necessary
to give the compiler more freedom toschedule
loop unrolling
Using temporaries rather than membervars or globals
inline functions
__restrict
11-Mar-05
12 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
56/82
Branches are Bad
Data dependant branches are frequentin collision detection.
Branches on floating point results areoften very slow, partly due to the longfloating point pipelines.
Whenever possible use instructions
like fsel, vsel, min,max, etc toeliminate branches
12-Mar-05
12 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
57/82
Branches
i.e. Rather Than:
if( (a>b)||(c>d) ) {}
Use If( max(a-b,c-d) > 0) {}
12-Mar-05
12 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
58/82
Branches and GPUS
On earlier GPU hardware HLSL willemulate all conditionals usingpredicated instructions.
Similar techniques are often beneficialon CPUs.
If(a>=b) c = d;
Could be written as C = fsel(a-b,d,c);
12-Mar-05
13 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
59/82
Hyper Threading?
What: 1 core, >1 simultaneousthreads on >1 simultaneous contexts.
Same execute units and cache Why: Execution units are often idle,
extra threads can make better
utilization of them.
13-Mar-05
13 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
60/82
Hyper Threading?
Why are execution units idle?
Pipeline Latency
I.e. if a multiply-add pipe has a throughput of 1 per cycle, and a latency of 7cycles, a 4 element dot product takes1+6+6+6 = 19 cycles.
13-Mar-05
14 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
61/82
Pipeline latency is Bad
Most of the time, only one stage of themadd pipeline is active and the others areidle.
If 4 threads are all doing dot products atonce the time taken is:
1+1+1+1+3+1+1+1+3+1+1+1+3+1+1+1 =
23 cycles, which is 3.3x the throughput. Somewhat redundant with Out of Order
execution and loop unrolling.
14-Mar-05
15 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
62/82
Memory is Slow
Cache Misses
When one thread blocks on a cache
miss, the other threads can continuerunning while the cache line is beingfilled.
15-Mar-05
15 Mar 05
7/29/2019 Physics in Parallel Gdc 2005
63/82
Branches are Bad (still)
Branches
Data dependant branches do not mix
well with deep pipelines. If the result at the end of the pipelineis needed to determine what to fetchnext at the beginning of the pipeline,
you get a big bubble. This can be filled by the other threads.
15-Mar-05
16-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
64/82
Data Locality is Good
It is worth mentioning that cores withmultiple threads share l1 caches.
So it is usually best to have all threadsof a core working on the same codeand data set.
16-Mar-05
17-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
65/82
Hyper Threading And Cell
Cells design side steps the motivation forHyperThreading in a variety of ways:
No cache, stream data in and out ahead of
time. Lots of registers for loop unrolling.
Memory architecture encourages streamprocessing which is conducive to loopunrolling.
Complex programming model makesscheduling optimizations seem relativelyeasy.
17-Mar-05
18-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
66/82
Cache Is important on SMP
Shared L2 Cache, some L1 Cache Sharing
Big caches is one of the distinct advantagesof CPUs vs. GPUs
some consoles
Physics algorithms for collision detection,integration, and constraint resolution
require repeated accesses to individualstructures, which is ideal for caches.
18-Mar-05
18-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
67/82
Cache Is important on SMP
Shared caches also make inter-threadand inter-core communication less
expensive. These points motivate forked models
and data parallelism
18-Mar-05
19-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
68/82
Memory Latency is a big deal
Latency is a real problem. To design a highperformance system we need to considerthis as a first order concern.
In this generation we are looking at ~500cycle penalties on an l2 cache miss, ~50cycles on an l1 cache miss.
L2 cache is shared between cores. L1 cache is shared between threads.
19-Mar-05
19-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
69/82
Memory Latency is a big deal
This also motivates the use of dataparallelism.
An extreme form of data parallelism isseen in stream processing.
19 Mar 05
20-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
70/82
GPUs are Fast
GPU's are an effective demonstration ofparallelism.
The overall system is a pipeline
Vertex shader -> triangle Setup ->Rasterization ->Pixel Shader -> framebuffer
Within each stage, the model is forked, withmany simultaneous threads
20 Mar 05
20-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
71/82
Physics on GPUS
There has been a fair bit of research onthe topic of use GPUs for physical
simulations.A review article is here:
http://download.nvidia.com/developer/GP
U_Gems_2/GPU_Gems2_ch29.pdf
20 Mar 05
Eulerian vs. Lagrangian 21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
72/82
Eulerian vs. Lagrangianin a Nutshell
Eulerian approaches discretize dof inspace
Lagrangian approaches dont.
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
73/82
Lagrangian Example
Each Particle has Dof
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
74/82
Eulerian Example
Each Cell Has DOF
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
75/82
Eulerian vs. Lagrangian
Eulerian algorithms are effective fordense, highly coupled interactive
systems. Solving navier stokes
Computation fluid dynamics
Water, Smoke, Fire
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
76/82
Eulerian vs. Lagrangian
Interactions are not qualitatively datadependant
So they can run in parallel withoutfeedback.
Langrangian interactions often are
data dependant I.e. collisions
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
77/82
Eulerian vs. Lagrangian
Physically-Based
Simulation on
Graphics Hardwarehttp://developer.nvidia.com/docs/IO/823
0/GDC2003_PhysSimOnGPUs.pdf
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
78/82
Particles
A good example of a Lagrangriantechnique implemented on streamprocessors is UberFlow
UberFlow implements a fully featuredparticle system on a GPU
Data dependant control is difficult,
Uberflow uses a data independentsorting method for collision detection
21 Mar 05
21-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
79/82
Particles
http://www.ati.com/developer/Eurographics/Kipfer04_UberFlow_eghw.pdf
21 Mar 05
22-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
80/82
Solvers and Parallelism
Conjugate Gradient
Jacobi iterations
Gauss Seidel Red Black Gauss Seidel
MultiGrid
22 Mar 05
22-Mar-05
7/29/2019 Physics in Parallel Gdc 2005
81/82
Solvers and Parallelism
Sparse Matrix Solvers on the GPU:Conjugate Gradients and Multigrid
http://www.multires.caltech.edu/pubs/GPUSim.pdf
a 05
Programming 23-Mar-05
http://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdf7/29/2019 Physics in Parallel Gdc 2005
82/82
g gLanguages are not Good
C++, Java, C# are not ideal for finegrained parallelism
What's next:HLSL?
Functional Languages? Haskel?
OpenMp?