5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ...

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre

Member of the Helmholtz Association

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

HPC ≠ HPC

ns 𝜇s ms s min h

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

HPC ≠ HPC

ns 𝜇s ms s min h

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

HPC ≠ HPC

ns 𝜇s ms s min h

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

HPC ≠ HPC

ns 𝜇s ms s min h

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Our MotivationSolving Coulomb problem for Molecular Dynamics

Task: Compute all pairwise interactions of N particles

N-body problem: O(N2) → O(N) with FMM

Why is that an issue?

MD targets < 1ms runtime per time step

MD runs millions or billions of time steps

not compute-bound, but synchronization bound

no libraries (like BLAS) to do the heavy lifting

We might have to look under the hood ... and get our hands dirty.

Parallelization Potential

Classical

high low

Algorithmic Complexity

Parallelization

Classical Approach

Lots of independent parallelism

Parallelization Potential

Classical

high low

Algorithmic Complexity

Parallelization

Fast Multipole Method (FMM)

Many dependent phases

Varying amount of parallelism

Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output

synchronization

points

Different amount of available loop-level parallelism within each phase

Some phases contain sub-dependencies

Synchronizations might be problematic

Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output

synchronization

points

Different amount of available loop-level parallelism within each phase

Some phases contain sub-dependencies

Synchronizations might be problematic

FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

p2m m2m m2l l2l l2p𝜔 𝜔

FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments

p2m m2m m2l l2l l2p

𝜔 𝜇

FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

p2m m2m m2l l2l l2p𝜇 𝜇

CPU Tasking Framework

ThreadingWrapperThread

Scheduler

Dispatcher

TaskFactory

LoadBalancer

Scheduler

Dispatcher

TaskFactory

LoadBalancer

Scheduler

Dispatcher

TaskFactory

LoadBalancer

Scheduler

Dispatcher

TaskFactory

LoadBalancer

CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

Queues

� Task execution

� new task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible

Dispatcher

Queues

� Task execution

� new task � task

Dispatcher

Queues

� Task execution

Dispatcher

Queues

� Task execution

Dispatcher

Queues

� Task execution

� new task� new task � task

Tasking Without Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

Runtime [s]

#ActiveThreads

Tasking With Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

L2L L2P

Runtime [s]

#ActiveThreads

GPU TaskingGoal

Provide same features as CPU tasking:

Static and dynamic load balancing

Priority queues

Ready-to-execute tasks

GPU TaskingUniform Programming Model for CPUs and GPUs

PitfallsPerformance Portability

Diverse GPU programming approaches:

OpenCL

Our requirements:

Strong subset of C++11

Portability between GPU vendors

Tasking features

Maturity

(Intermediate) Solution

Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being

portable out of the box.

PitfallsPerformance Portability

For performance portability we consider diverse GPU programming approaches:

OpenCL

Unsatisfying (Intermediate) Solution

Use CUDA for reasons of performance and specific features. Take the loss of not being portable out of the

PitfallsArchitectural Differences

Pitfalls for Load Balancing

No thread pinning

No cache coherency

Pitfalls for Mutual Exclusion

Weak memory consistency

Missing forward progress guarantees

PitfallsLoad Balancing

No possibility to pin threads to streaming multiprocessors

No direct access to shared memory of other streaming multiprocessors

Work stealing requires multi-producer multi-consumer queues → Mechanism for mutual exclusion?

PitfallsMutual Exclusion

Weak memory consistency

Warp-synchronous deadlocks due to lock step

How to prove thread safety?

PitfallsMutex Implementation

class Mutex

__inline__ __device__ void lock()

while (atomicCAS(&mutex, 0, 1) != 0)

__threadfence();

__inline__ __device__ void unlock()

__threadfence();

atomicExch(&mutex, 0);

int mutex = 0;

Very First EvaluationConditions

Tasking with global queue only

Measurements without work load to determine enqueue and dequeue overhead

Measurements on P100 with 56 thread blocks with 1024 threads each

Measurements on V100 with 80 thread blocks with 1024 threads each

First EvaluationTasking Overhead on P100 and V100

100 101 102 103 104 105 106

10−1

#Tasks

Runtimeinms

GPU TaskingConclusion

Fine-grained task parallelism pays off on CPUs

Developed mapping between CPU and GPU concepts

(Partly) overcome pitfalls:

Lock-based mutual exclusion

Reusability of CPU tasking code

Architectural differences between CPU and GPU

Successfully transferred parts of CPU tasking to GPUs

Next StepsAnalyze and solve performance issues in dependency resolution

Use memory pool for dynamic allocations

Implement hierarchical queues

Transfer priority queue to GPU

Exploit data-parallelism through warps

Consider the use of lock-free data structures

Implement FMM based on GPU tasking

Thank You to Our Sponsor!

NVIDIA Tesla V100 and NVIDIA Tesla P100 where provided by

5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ...

Documents

VebraAlto.com - Agency Cloud · *dudjhwruhqwrqdorqjwhupedvlv 6lwxdwhglqdvxshueorfdwlrqdqgehlqjdffhvvhgylddsulydwhurdgwrwkhuhduriwkhvkrsslqjsdudghrq3ruwodqg 5rdg )xuwkhuehqhilwvlqfoxghdqxsdqgryhugrru

Laura Taylor CV - April 2017[419] - WordPress.com · &9 ri /dxud &hohvwh 7d\oru 3(5621$/ '(7$,/6 6851$0( 7d\oru ),567 1$0(6 /dxud &hohvwh '$7( 2) %,57+ 1ryhpehu ,'(17,7< 180%(5

Yaesu VX-7R Technical Suplement

%ULQJ 1DWXUH 7R

CP 62-26 Newsletter (03.26.2020) - Dakota County, Minnesota · 7klv surmhfw zloo uhdoljq wk 6wuhhw &rxqw\ 5rdg dssurdfklqj wkh lqwhuvhfwlrq dw 1ruwkilhog %rxohydug &rxqw\ 5rdg lq

DXUD 3 +XEEDUG

QVWUXPHQWV /WG 3DUVRQV 5RDG ZZZ ZLND FD WK

AAMC 7R Solutions

Joseph McClafferty 7R, Matthew Williams 7R, Dexter Conner ......2017/06/20 · Samuel Kakosa 7R, Marcus Phillips 7R, Alex Dixon 7R, Will Meehan 7R, Amin Khan 7R. Well done boys. There

Wildlife Management Unit 7R Wildlife Management Unit 7R Author: DEC Region 7 Subject: wildlife management unit 7R map Keywords: wildlife management unit, recreation, hunting, deer,

&URVV 5RDG - hmarston.co.uk

Yaesu VX-7R Users Manual

VX-7R Spec

TASK FORCE - TURKEY · 7rxulvp lq 6sdlq vrph iljxuhv gdwd &dvh 6wxglhv ri lqwhuhvw iru 6lon 5rdg frxqwulhv 9dohqfld 7kh 6sdqlvk olqn wr wkh 6lon 5rdg 3ursrvdov iru 81:72 6lon 5rdg

VX-7R Operating Manual

It ,3«7r

< =5 7 T 7R

HD605 7R Dumper Technical Sheet

5RDG - docs.grey.ca

5RDG 5DFH - Road Racing Core