University of Wisconsin Petascale Tools Workshop Madison, WI August 4-7 th 2014 The Hybrid Model: Experiences at Extreme Scale Benjamin Welton

University of Wisconsin

Petascale Tools WorkshopMadison, WI

August 4-7th 2014

The Hybrid Model:Experiences at Extreme Scale

Benjamin Welton

The Hybrid Model

o TBON + Xo Leveraging TBONs, GPUs, and CPUs in large scale

computationo Combination creates a new computational

model with new challengeso Management of multiple devices, Local node load

balancing, and Node level data management.o Traditional distributed systems problems get

worseo Cluster wide load balancing, I/O management, and

debugging.

2The Hybrid Model: Experiences at Extreme Scale

MRNet and GPUs

o To get more experience with GPUs at scale we built a leadership class application called Mr. Scan

o Mr. Scan is a density based clustering algorithm utilizing GPUso The first to application able to cluster multi-

billion point datasetsoUses MRNet as its distribution framework

o However we ran into some challengeso Load balancing, debugging, and I/O

inhibited performance and increased development time 3The Hybrid Model: Experiences at Extreme Scale

Density-based clustering

o Discovers the number of clusterso Finds oddly-shaped clusters


Goal: Find regions that meet minimum density and spatial

distance characteristics

The two parameters that determine if a point is in a cluster

is Epsilon (Eps), and MinPts

If the number of points in Eps is > MinPts, the point is a core point.

For every discovered point, this same calculation is performed until

the cluster is fully expanded

Clustering Example (DBSCAN[1])


EpsMinPts

MinPts: 3

[1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996)


MRNet – Multicast / Reduction Networko General-purpose

TBON APIo Network: user-defined

topologyo Stream: logical data channel

o to a set of back-endso multicast, gather, and custom

reductiono Packet: collection of datao Filter: stream data operator

o synchronizationo transformation

o Widely adopted by HPC toolso CEPBA toolkit o Cray ATP & CCDBo Open|SpeedShop & CBTFo STATo TAU

FE

…… …BE

appappappapp

BE

appappappapp

BE

appappappapp

BE

appappappapp

CP CP

CP CP CP CP

F(x1,…,xn)

Computation in a Tree-Based Overlay Network


FE

BE BE BE

CP CP

BE

o Adjustable for load balance

o Output sizes MUST be constant or decreasing at each level for scalability

o MRNet provides this process structure Data Size:

10MB per BE

Total Size of Packets: ≤10 MB

Total Size of Packets:≤10 MB

MRNet Hybrid Computation

o A hybrid computation includes GPU processing elements alongside traditional CPU elements.

o In MRNet, GPUs were included as filters.

o A combination of CPU and GPU filters were used in MRNet.

8Mr. Scan: Performance challenges of an extreme scale GPU-Based Application

FE

……BE

appappappapp

BE

appappappapp

CP

CP CP

F(x1,…,xn)

F(x1,…,xn)

F(x1,…,xn)

Intro to Mr. Scan


BE BE BE

CP CP

BE

DBSCAN

Merge

FE

Mr. Scan Phases

Partition: Distributed

DBSCAN: GPU (on BE)

Merge: CPU (x #levels)

Sweep: CPU (x #levels)FE

BE BE BE BE

Merge

FS

Sweep

Sweep

Mr. Scan SC 2013 Performance


Time: 0 Time: 18.2 Min

Partitioner

DB

SC

AN

Merge &

Sw

eep

Clustering 6.5 Billion Points

FS Read 224 Secs

FS Write489 Secs

MRNetStartup

130 Secs

FS Read: 24 Secs

DBSCAN168 Secs

Merge Time: 6 Secs

Sweep Time: 4 Secs

Write Output: 19 Secs

Load Balancing Issue

o In initial testing imbalances in load between nodes was a significant limiting factor in performance o Increased run-time of Mr. Scan’s computation

phase by a factor of 10. o Input point counts did not correlate to run

times for a specific nodeo Adding an additional 25 minutes to the

computationo Resolving the load balance problem required

numerous Mr. Scan specific optimizationso Algorithm Tricks like Dense Box and heuristics in

data partitioning11The Hybrid Model: Experiences at Extreme Scale

Partition Phase

o Goal: Partitions computationally equivalent to DBSCAN

o Algorithm:o Form initial partitionso Add shadow regionso Rebalance


Distributed Partitioner


GPU DBSCAN Computation


DBSCAN computation is performed in two distinct steps on the leaf nodes of the tree

Step 1: Detect Core Points

Block 1

Block 2

Block 900

T1

T2

T512

T1

T2

T512

T1

T2

T512

Block 1T1

T2

T512

Block 2T1

T2

T512

Block 900T1

T2

T512

Step 2: Expand core points and color

The DBSCAN Density Problem

o Imbalances in point density can cause huge differences in runtimes between Thread Groups inside of a GPU (10-15x variance in time)o Issue is caused by the lookup operation for a

points neighbors in the DBSCAN point expansion phase.


𝜀𝜀

Higher density results in higher neighbor count which increases the number of comparison operations

Dense Box

o Dense Box eliminates the need to perform neighbor lookups on points in dense regions by labeling points as a member of cluster before DBSCAN is run.

The Hybrid Model: Experiences at Extreme Scale

1. Start with an region.

𝜖2√2

𝜖2√2

2. Divide the region of data into area’s of size for dense area detection*.

3. For each area which has point count >= MinPts. Mark points as members of a cluster. Do not expand these points.

* chosen because it guarantees all points inside are within distance of each other.

15

𝜺

Challenges of the Hybrid Model

o Debuggingo Difficult to detect incorrect output without writing

application specific verification toolso Load Balancing

o GPUs increased the difficulty of balancing load both cluster wide and on a local node (due to large variance in runtimes with identical sized input)

o Application-specific solution required for load balancing

o Existing distributed framework components stressedo Increased computational performance of GPUs

stress other non-accelerated components of the system (such as I/O) 17The Hybrid Model: Experiences at Extreme Scale

Debugging Mr. Scan

o Result verification complicated due to:o CUDA Warp Scheduling not being deterministico Packet reception order not deterministic in MRNet

o Both issues altered output slightly o DBSCAN non-core point cluster selection is order

dependento Output cluster IDs would vary based on packet

processing order in MRNeto Easy verification of output, such as a bitwise

comparison against a known correct output, not possible


Debugging Mr. Scan

o We had to write verification tools to run after each run to ensure output was still correctoVery costly in terms of both programmer

time (to write the tools) and wall clock runtime

o Worst of all…. Tools used for verification are DBSCAN specific. oGeneric solutions needed badly for

increased productivity


Load Balancing

o Load balancing between nodes proved to be a significant and serious issueo Identical input sizes would result in vastly

differing runtimes (by up to 10x)oWithout the load balancing work, Mr. Scan

would not have scaledo Application specific GPU load balancing

system implementedoNo existing frameworks could help with

balancing GPU applications


Other components

o GPU use revealed flaws that were hidden in the original non-GPU implementation of Mr. Scan.

o I/O, start-up, and other components of the system impacted performance greatlyoAccounting for a majority of the run time of

Mr. Scano Solutions to these issued that scaled for

a CPU based application, might not for a GPU based application


Work in progress

o We are currently looking at ways to perform load balance/sharing in GPU applications in a generic way

o We are looking at methods that do not change the distributed models used by applications and require no direct vendor supportoGetting users or hardware vendors to make

massive changes to their applications/hardware is hard


Questions?


Characteristics of a ideal load balancing frameworko Require as few changes to existing

applications as possibleo We cannot expect application developers to give

up MPI, MapReduce, TBON, or other computational frameworks to solve load imbalance

o Take advantage of the fine grained computation decomposition we see with GPUs/Acceleratorso Course grained solutions (such as moving entire

kernel invocations/processes) limits options for balancing load.

o Needs to play by the hardware vendors “rules”o We cannot rely on support from hardware vendors

for a distributed framework.


An Idea: Automating Load Balancing

o Have a layer above the GPU but below the user application framework to manage and load balance GPU computations across nodes

o GPU Manager would execute user application code on device while attempting to share load with idle GPUs

25Mr. Scan: Efficient Clustering with MRNet and GPUs

User Application (MPI/MRNet/

MapReduce/etc)

GPU Manager

GPU Device

An Idea: A Load Balancing Service


Application Supplies CUDA functions (PTX, CUBIN)

Program sent to device, pointer to function saved

Function Binary

Argument Data for functions passed to manager

Data forwarded to device

Data

Application ask to run function binary. Supplying a data stride and number of compute blocks

Compute Blocks created and added to queue

Function Ptr + Data Offset


SIMD

Persistent kernel in GPU would pull off this queue and execute the user’s function

At completion of all queued blocks results returned

Result

An Idea: A Load Balancing Serviceo On detection of an idle GPU, load is shared between

nodes.




Binary Data

User Binary transfer to new host Binary sent to GPU

Binary

Data for compute blocks Data copied to GPU

Data

Block moved, updating data offset

Block moved, updating data offset


SIMD SIMD

Block Executed, Result returned to originating node

Documents

University of Wisconsin Petascale Tools Workshop Madison, WI August 4-7 th 2014 The Hybrid Model: Experiences at Extreme Scale Benjamin Welton