Physics in Parallel Gdc 2005

Embed Size (px)

Citation preview

  • 7/29/2019 Physics in Parallel Gdc 2005

    1/82

    Physics in Parallel: Simulation

    on 7th Generation Hardware

    David WuPseudo Interactive

  • 7/29/2019 Physics in Parallel Gdc 2005

    2/82

    Why are we here?

    The 7th generation is approaching.

    We are no longer next gen

    We are all scrambling to adopt to thenew stuff, so that we can stay

    on the bleeding edge

    And push the envelope

    and take things to the next level.

  • 7/29/2019 Physics in Parallel Gdc 2005

    3/82

    Whats Next Gen?

    Multiple Processors

    not entirely new, but more than before.

    Parallelism not entirely new, but more than before.

    Physics

    not entirely new, but more than before.

  • 7/29/2019 Physics in Parallel Gdc 2005

    4/82

    Take-Away

    So much to cover

    General Principles

    Useful Concepts Techniques

    Tips

    Bad Jokes

    Goal is to save you time during thetransition to..

    Next Gen

  • 7/29/2019 Physics in Parallel Gdc 2005

    5/82

    Format for presentation

    Every year we discover new ways tocommunicate information.

  • 7/29/2019 Physics in Parallel Gdc 2005

    6/82

    Patterns

    A description of a recurrent problemand of the core of possible solutions

    Difficult to write Too pretentious

    Inviting criticism

  • 7/29/2019 Physics in Parallel Gdc 2005

    7/82

    Gems

    Valuable bits of information

    Too 6th Gen

  • 7/29/2019 Physics in Parallel Gdc 2005

    8/82

    Blog

    Free Form

    Continuity not required

    Subjective/opinionated is okay

    Arbitrary Tangents are okay

    Catchy Title need not match article

    No quality bar

    This sounds 7th Gen to me.

  • 7/29/2019 Physics in Parallel Gdc 2005

    9/82

    Disclaimer

    My information sources range from:

    press releases

    Patents

    other Blogs on the net

    random probabilistic guesses.

    Much of the information is probablywrong.

  • 7/29/2019 Physics in Parallel Gdc 2005

    10/82

    Multi-threaded programming

    I participated in some in depthdiscussions on this topic, after weeks

    of debate, the conclusion was: Multi-threaded programming is hard

    1-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    11/82

    What is 7th Gen Hardware?

    Fast

    Many parallel processors

    Very High peak Flops

    In order execution

    2-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    12/82

    What is 7th Gen Hardware?

    High memory latency

    Not enough Bandwidth

    Moderate clock speed improvements

    Not enough Memory

    CPU-GPU convergence

    2-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    13/82

    Hardware usually sucks

    Is Multi-Processor Revolutionary?

    It is kind of here already

    Hyper Threading

    Dual Processor

    Sega Saturn

    not entirely new, but more than before.

    3-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    14/82

    Hardware usually sucks

    Hardware advances require years ofpreparatory hype:

    3D Accelerators Online

    SIMD

    Not with a bang but with a whimper

    3-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    15/82

    Hardware usually sucks

    The big problem with hardwareadvances is sofware.

    We dont like to do things that arehard.

    If there is a big enough payoff we do

    it. This time there is a big enough payoff.

    3-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    16/82

    Types of Parallelism

    Task Parallelism Render+physics

    Data Parallelism collision detection on two objects at a

    time

    Instruction Parallelism multiple elements in a vector

    Use all three

    4-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    17/82

    Techniques

    Pipeline

    Work Crew

    Forking

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    18/82

    Pipeline Task Parallelism

    Subdivide problem into discrete tasks

    Solve tasks in parallel, spreading them

    across multiple processors.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    19/82

    Pipeline Task Parallelism

    Thread 0collision detection

    Frame 3

    Thread 1Logic/AIFrame 2

    Thread 2IntegrationFrame 1

    Thread 0collision detection

    Frame 4

    Thread 1Logic/AIFrame 3

    Thread 2IntegrationFrame 2

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    20/82

    Pipeline

    Similar to CPU/GPU parallelism

    CPUFrame 3

    GPUFrame 2

    CPUFrame 4

    GPUFrame 3

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    21/82

    Pipeline: notes

    Dependencies explicit

    Communication explicit

    I.e. through FIFO Avoids deadlock issues

    Avoids most race conditions

    Load balancing is not great Does not reduce latency vs. singled

    threaded case

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    22/82

    Pipeline: notes

    Feedback between tasks is difficult

    Best for open loop tasks

    Secondary dynamics, I.e. pony tail

    Effects

    Suitable for specialized hardware,

    because task requirements are cleanlydivided.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    23/82

    Pipeline: notes

    Suitable for restricted memoryarchitectures, as seen in a certain

    proposed 7th gen console design. Adds bandwidth overhead and

    memory use overhead to SMP systems

    that would otherwise communicate viathe cache.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    24/82

    Work Crew

    Component wise division of system

    CollisionDetection Integration

    Rendering AI/Logic

    Audio

    IO

    Particle System

    Fluid Simulation

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    25/82

    Work Crew Task Parallelism

    Similar to pipeline but without explicitordering.

    Dependencies are handled on a case bycase basis.

    i.e. particles that do not effect game play mightnot need to be deterministic, so they can run

    without explicit synchronization. Components without interdependencies can

    run asynchronously, e.g. kinematics and AI.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    26/82

    Work Crew

    Suitable for some external processes suchas IO, Gamepad, Sound, Sockets.

    Suitable for decoupled systems: particle simulations that do not effect game play

    Fluid dynamics

    Visual damage simulation

    Cloth simulation

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    27/82

    Work Crew

    Scalability is limited by the number ofdiscrete tasks

    Load balancing is limited by theasymmetric nature of the componentsand their requirements.

    Higher risk of deadlocks Higher risk of race conditions

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    28/82

    Work Crew

    May require double buffering of somedata to avoid race conditions.

    Poor data coherency Good code coherency

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    29/82

    Forking Data Parallelism

    Perform the same task on multiple objectsin parallel.

    Thread forks into multiple threads acrossmultiple processors

    All threads repeatedly grab pending objectsindiscriminately and execute the task onthem

    When finished, threads combine back intothe original thread.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    30/82

    Forking

    Object AThread 2

    Object BThread 0

    Object CThread 1

    Fork

    combine

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    31/82

    Forking

    Task assignment can often be doneusing simple interlocked primitives:

    I.e. Int i =InterlockedIncrement(&nextTodo);

    OpenMP adds compiler support for this

    via pragmas

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    32/82

    Forking

    Externally Synchronous external callers dont have to worry about

    being thread safe

    thread safety requirements are limited tothe scope of the code within the forkedsection.

    This is a big deal. good for isolated engine components and

    middle ware

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    33/82

    Forking Example

    AI running in thread 0

    AI calls RayQuery() for a line of sight

    checkRayQuery forks into 6 threads, computes

    the ray query, and then returns the

    results through thread 0AI, running in thread 0 uses the result.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    34/82

    Forking

    Minimizes Latency for a given task

    Good data and code coherency

    Potentially high synchronizationoverhead, depending on the coupling.

    Highly scalable if you have many tasks

    with few dependencies Ideal for Collision detection.

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    35/82

    Forking - Batches

    Reduces inter-threadcommunication

    Reduces potentialfor loadbalancing.

    ImprovesInstructionlevelparallelism

    Objects 21..30Thread 2

    Objects 0..10Thread 0

    Objects 11..20Thread 1

    Fork

    combine

    5-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    36/82

    Our Approach

    1) Collision DetectionForked

    3) IntegrationForked

    2) AI/LogicSingle threaded

    4) RenderingForked/Pipeline

    2b) Damage EffectsContractor QueueAll extra threads

    AudioWhatever

    2a) engine callsForked

    6-Mar-05

    M ltith d d

  • 7/29/2019 Physics in Parallel Gdc 2005

    37/82

    Multithreadedprogramming is Hard

    Solutions that directly expose multiplethreads to leaf code are a bad idea.

    Sequential, single threaded,synchronous code is the fastest towrite and debug

    In order to meet schedules most leafcode will stay this way.

    7-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    38/82

    Notes on Collision detection

    All collision prims are stored in aglobal search tree.

    Bounding Kdop tree with 8 childrenper node.

    The most common case is when 0 or 1

    children need to be traversed 8 children results in fewer branches

    8 Children allows better Prefetching

    7-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    39/82

    Collision detection

    Each moving object is a task

    Each object is independently queried vs. allother objects in the tree.

    Results are output to a global list ofcontacts and collisions

    To avoid duplicates, moving object vs.

    moving object collisions are only processedif the active moving objects memoryaddress is

  • 7/29/2019 Physics in Parallel Gdc 2005

    40/82

    Collision detection

    Threads pop objects off of the todo list oneby one using interlocked access until theyare all processed.

    Each query takes O(lgN) time. Very little data contention

    output operations are rare and quick

    task allocation uses InterlockedIncrement

    On 2 Cpus with many objects I see a 80%performance increase.

    Hopefully scalable to many CPUs

    7-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    41/82

    Collision detection

    We try to keep collision code and datain the cache as much as possible

    We try to finish Collision detection assoon as possible because there aredependencies on it

    All threads attack the problem at once

    7-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    42/82

    Notes on Integration

    The process that steps objects forward intime, in a manner consistent with all

    contacts and constraints.

    8-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    43/82

    Integration

    Each batch ofcoupledobjects is atask.

    Each Batch is solved independently Threads pop batches with no

    dependencies off of the todo list oneby one using interlocked access until

    they are all processed.

    8-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    44/82

    Integration

    When a dynamic object does notinteract with other dynamic objects,

    its batch contains only that object. When dynamic objects interact, they

    are coupled, their solutions are

    dependant on each other and theymost be solved together.

    8-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    45/82

    Integration

    In some cases, objects can be artificiallydecoupled.

    I.e. assume object A weighs 2000kg, and object

    B weighs 1 kg. In some cases we can assumethat the dynamics of B do not effect thedynamics of A.

    In this case, A can first be solved independently,and the resulting dynamics can be fed into thesolution for B.

    This creates an ordering dependency.

    A must be solved before B.

    8-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    46/82

    Integration

    When objects are moved they must beupdated in the global collision tree.

    Transactions need to be atomic, this is

    accomplished with locks/critical sections Ditto for the VSD tree

    Task allocation is slightly more complex dueto dependencies

    Despite all this we see a 75% performanceincrease on 2 CPUs with many objects.

    8-Mar-05

    8 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    47/82

    Integration

    We use a discrete newton solver, whichworks okay with our task dicretization

    I.e. One thread per batch If there where hundreds of processors

    and not as many batches, we would

    fork the solver itself and use Jacobiiterations

    8-Mar-05

    9 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    48/82

    Transactions

    With fine grained data parallelism, werequire many, light weight atomic

    transactions. For this we use either:

    Interlocked primitives

    Critical Sections

    Spin Locks

    9-Mar-05

    9 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    49/82

    Transactions

    Whenever possible, interlockedprimitives are used.

    Interlocked primitives are simple atomic

    transactions on single words

    If the transaction is short a spin Lockis used.

    Otherwise a critical section is used.

    A Spin Lock is like a critical section,except that it spins rather than sleeps

    when blocking

    9-Mar-05

    9 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    50/82

    CPUs are difficult

    There are some processor specificnuances to consider when writing your

    own locks:Due to out of order reads, data accessfollowing the acquisition of a lock

    should be proceeded by a load fence orisync. Otherwise the processor mightpreload old data that changes right

    before the lock is released.

    9-Mar-05

    9 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    51/82

    CPUs are difficult

    Due to out of order writes, a storefence or lwsync needs to happen before

    releasing the lock, otherwise the unlockmight be visible to threads before thedata update has taken place, andanother thread might claim the lock andthen fetch stale data from its cache allbefore the real data arrives.

    9-Mar-05

    9 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    52/82

    Lock Example

    Acquire() looks like:

    while(_InterlockedCompareExchange(&isLocked,1,0)!= 0){

    PauseWhileLocked();}__isync();

    Release() looks like:

    __lwsync();isLocked=0;

    9-Mar-05

    9 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    53/82

    CPUs are difficult

    On Hyperthreaded systems it is importantthat PauseWhileLocked() puts the thread tosleep so that the other thread(s) can use the

    complete core. It is also important that you dont constantly

    bang on memory while trying to take thelock.

    If you are going to hold locks for a fair bit oftime, a critical section is usually a betterchoice, as it switches to another thread

    rather than spinning.

    9-Mar-05

    10 M 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    54/82

    Instruction Parallelism is Good

    Most relevant processors are pipelined

    Multiple execution units run in parallel

    No Out of Order Execution High execution latency

    Most have SIMD

    Intrinsics

    10-Mar-05

    11 Ma 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    55/82

    Code Scheduling is Good

    Instruction level parallelism requiresappropriate code scheduling

    Compiler Hand Holding is often necessary

    to give the compiler more freedom toschedule

    loop unrolling

    Using temporaries rather than membervars or globals

    inline functions

    __restrict

    11-Mar-05

    12 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    56/82

    Branches are Bad

    Data dependant branches are frequentin collision detection.

    Branches on floating point results areoften very slow, partly due to the longfloating point pipelines.

    Whenever possible use instructions

    like fsel, vsel, min,max, etc toeliminate branches

    12-Mar-05

    12 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    57/82

    Branches

    i.e. Rather Than:

    if( (a>b)||(c>d) ) {}

    Use If( max(a-b,c-d) > 0) {}

    12-Mar-05

    12 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    58/82

    Branches and GPUS

    On earlier GPU hardware HLSL willemulate all conditionals usingpredicated instructions.

    Similar techniques are often beneficialon CPUs.

    If(a>=b) c = d;

    Could be written as C = fsel(a-b,d,c);

    12-Mar-05

    13 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    59/82

    Hyper Threading?

    What: 1 core, >1 simultaneousthreads on >1 simultaneous contexts.

    Same execute units and cache Why: Execution units are often idle,

    extra threads can make better

    utilization of them.

    13-Mar-05

    13 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    60/82

    Hyper Threading?

    Why are execution units idle?

    Pipeline Latency

    I.e. if a multiply-add pipe has a throughput of 1 per cycle, and a latency of 7cycles, a 4 element dot product takes1+6+6+6 = 19 cycles.

    13-Mar-05

    14 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    61/82

    Pipeline latency is Bad

    Most of the time, only one stage of themadd pipeline is active and the others areidle.

    If 4 threads are all doing dot products atonce the time taken is:

    1+1+1+1+3+1+1+1+3+1+1+1+3+1+1+1 =

    23 cycles, which is 3.3x the throughput. Somewhat redundant with Out of Order

    execution and loop unrolling.

    14-Mar-05

    15 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    62/82

    Memory is Slow

    Cache Misses

    When one thread blocks on a cache

    miss, the other threads can continuerunning while the cache line is beingfilled.

    15-Mar-05

    15 Mar 05

  • 7/29/2019 Physics in Parallel Gdc 2005

    63/82

    Branches are Bad (still)

    Branches

    Data dependant branches do not mix

    well with deep pipelines. If the result at the end of the pipelineis needed to determine what to fetchnext at the beginning of the pipeline,

    you get a big bubble. This can be filled by the other threads.

    15-Mar-05

    16-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    64/82

    Data Locality is Good

    It is worth mentioning that cores withmultiple threads share l1 caches.

    So it is usually best to have all threadsof a core working on the same codeand data set.

    16-Mar-05

    17-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    65/82

    Hyper Threading And Cell

    Cells design side steps the motivation forHyperThreading in a variety of ways:

    No cache, stream data in and out ahead of

    time. Lots of registers for loop unrolling.

    Memory architecture encourages streamprocessing which is conducive to loopunrolling.

    Complex programming model makesscheduling optimizations seem relativelyeasy.

    17-Mar-05

    18-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    66/82

    Cache Is important on SMP

    Shared L2 Cache, some L1 Cache Sharing

    Big caches is one of the distinct advantagesof CPUs vs. GPUs

    some consoles

    Physics algorithms for collision detection,integration, and constraint resolution

    require repeated accesses to individualstructures, which is ideal for caches.

    18-Mar-05

    18-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    67/82

    Cache Is important on SMP

    Shared caches also make inter-threadand inter-core communication less

    expensive. These points motivate forked models

    and data parallelism

    18-Mar-05

    19-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    68/82

    Memory Latency is a big deal

    Latency is a real problem. To design a highperformance system we need to considerthis as a first order concern.

    In this generation we are looking at ~500cycle penalties on an l2 cache miss, ~50cycles on an l1 cache miss.

    L2 cache is shared between cores. L1 cache is shared between threads.

    19-Mar-05

    19-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    69/82

    Memory Latency is a big deal

    This also motivates the use of dataparallelism.

    An extreme form of data parallelism isseen in stream processing.

    19 Mar 05

    20-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    70/82

    GPUs are Fast

    GPU's are an effective demonstration ofparallelism.

    The overall system is a pipeline

    Vertex shader -> triangle Setup ->Rasterization ->Pixel Shader -> framebuffer

    Within each stage, the model is forked, withmany simultaneous threads

    20 Mar 05

    20-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    71/82

    Physics on GPUS

    There has been a fair bit of research onthe topic of use GPUs for physical

    simulations.A review article is here:

    http://download.nvidia.com/developer/GP

    U_Gems_2/GPU_Gems2_ch29.pdf

    20 Mar 05

    Eulerian vs. Lagrangian 21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    72/82

    Eulerian vs. Lagrangianin a Nutshell

    Eulerian approaches discretize dof inspace

    Lagrangian approaches dont.

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    73/82

    Lagrangian Example

    Each Particle has Dof

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    74/82

    Eulerian Example

    Each Cell Has DOF

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    75/82

    Eulerian vs. Lagrangian

    Eulerian algorithms are effective fordense, highly coupled interactive

    systems. Solving navier stokes

    Computation fluid dynamics

    Water, Smoke, Fire

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    76/82

    Eulerian vs. Lagrangian

    Interactions are not qualitatively datadependant

    So they can run in parallel withoutfeedback.

    Langrangian interactions often are

    data dependant I.e. collisions

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    77/82

    Eulerian vs. Lagrangian

    Physically-Based

    Simulation on

    Graphics Hardwarehttp://developer.nvidia.com/docs/IO/823

    0/GDC2003_PhysSimOnGPUs.pdf

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    78/82

    Particles

    A good example of a Lagrangriantechnique implemented on streamprocessors is UberFlow

    UberFlow implements a fully featuredparticle system on a GPU

    Data dependant control is difficult,

    Uberflow uses a data independentsorting method for collision detection

    21 Mar 05

    21-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    79/82

    Particles

    http://www.ati.com/developer/Eurographics/Kipfer04_UberFlow_eghw.pdf

    21 Mar 05

    22-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    80/82

    Solvers and Parallelism

    Conjugate Gradient

    Jacobi iterations

    Gauss Seidel Red Black Gauss Seidel

    MultiGrid

    22 Mar 05

    22-Mar-05

  • 7/29/2019 Physics in Parallel Gdc 2005

    81/82

    Solvers and Parallelism

    Sparse Matrix Solvers on the GPU:Conjugate Gradients and Multigrid

    http://www.multires.caltech.edu/pubs/GPUSim.pdf

    a 05

    Programming 23-Mar-05

    http://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdfhttp://www.multires.caltech.edu/pubs/GPUSim.pdf
  • 7/29/2019 Physics in Parallel Gdc 2005

    82/82

    g gLanguages are not Good

    C++, Java, C# are not ideal for finegrained parallelism

    What's next:HLSL?

    Functional Languages? Haskel?

    OpenMp?