A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

A MultidimensionalDistributed Array Abstractionfor PGAS

www.dash-project.org

Tobias [email protected]ät München, MNM-Team

| 2DASH

DASH - Overview

DASH is a C++ template library that offers– distributed data structures and parallel algorithms

– a complete PGAS (part. global address space) programming system without a custom (pre-)compiler

PGAS Terminology – SHMEM Analogy

Unit: The individual participants in a DASH program, usually full OS processes.

Private

Shared

Unit 0 Unit 1 Unit N-1

int b;int c;

dash::Array a(1000);

int a;…

dash::Shared s;

10..190..9 ..999

Shared data: managed by DASHin a virtual global address space

Private data: managed by regular C/C++ mechanisms

| 3DASH

DASH Project Structure

Phase I (2013-2015) Phase II (2016-2018)

LMU MunichProject lead,

C++ template library

Project lead, C++ template library,

data dock

TU DresdenLibraries and

interfaces, toolsSmart data

structures, resilience

HLRS Stuttgart DART runtime DART runtime

KIT Karlsruhe Application studies

IHR StuttgartSmart deployment, Application studies

| 4DASH

DASH - Partitioned Global Address Space

Data affinity– data has well-defined owner but can be accessed by any unit

– data locality important for performance

– support for the owner computes execution model

DASH:– unified access to

local and remotedata in globalmemory space

| 5DASH

DASH - Partitioned Global Address Space

Data affinity– data has well-defined owner but can be accessed by any unit

– data locality important for performance

– support for the owner computes execution model

DASH:– unified access to

local and remotedata in globalmemory space

– and explicit viewson local memoryspace

| 6DASH

DASH Distributed Data Structures Overview

Container Description Data distribution

Array<T> 1D Array static, configurable

NArray<T, N> N-dim. Array static, configurable

Shared<T> Shared scalar fixed (at 0)

Directory(*)<T> Variable-size,locally indexedarray

manual,load-balanced

List<T> Variable-sizelinked list

dynamic,load-balanced

Map<T> Variable-sizeassociative map

dynamic, balancedby hash function

(*) Under construction

| 7DASH

DASH Distributed Data Structures Overview

Container Description Data distribution

Array<T> 1D Array static, configurable

NArray<T, N> N-dim. Array static, configurable

Shared<T> Shared scalar fixed (at 0)

Directory(*)<T> Variable-size,locally indexedarray

manual,load-balanced

List<T> Variable-sizelinked list

dynamic,load-balanced

Map<T> Variable-sizeassociative map

dynamic, balancedby hash function

(*) Under construction

| 8DASH

Multidimensional Data Distribution (1)

dash::Pattern<N> specifies N-dim data distribution– Blocked, cyclic, and block-cyclic in multiple dimensions

Pattern<2>(20, 15)

(BLOCKED,NONE)

(NONE,BLOCKCYCLIC(2))

(BLOCKED,BLOCKCYCLIC(3))

Extent in first and second dimension

Distribution in first and second dimension

| 9DASH

Multidimensional Data Distribution (2)

Example: tiled and tile-shifted data distribution

(TILE(4), TILE(3))

ShiftTilePattern<2>(32, 24)TilePattern<2, COL_MAJOR>(20, 15)

(TILE(5), TILE(5))

| 10DASH

Multidimensional Views

Lightweight Multidimensional Views

// 8x8 2D arraydash::NArray<int, 2> mat(8,8);

// linear access using iteratorsdash::distance(mat.begin(), mat.end()) == 64

// create 2x5 region viewauto reg = matrix.cols(2,5).rows(3,2);

// region can be used just like 2D arraycout << reg[1][2] << endl; // ‘7’

dash::distance(reg.begin(), reg.end()) == 10

| 11DASH


Lightweight Multidimensional Views– Local and block views

dash::NArray<int, 2> mat(80000,17000);

// view to block element rangeauto block = matrix.block(3,2);// use view as Cartesian space:auto elem = block[30][20];

dash::NArray<int, 2> mat(80000,17000);

if (dash::myid() == 1) {// view to local element rangeauto local_elems = matrix.local;// use view as sequential range:for (auto elem : local_elems) { … }

}

| 12DASH


Multidimensional Iterator Ranges

– Global iterators provide access to theunderlying view of their index space

– DASH global iterators on multi-dimensional regions canstill be passed to standard library algorithms

auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols

auto r_view = r.begin().view();r_view.extents() == { 3,5 }r_view.offsets() == { 4,2 }

// DASH algorithms use n-dim. view: dash::summa(r.begin(), r.end(), …);// multidimensional iterators still// are sequential:std::for_each(r.begin(), r.end(), …);

| 13DASH


Multidimensional Iterator Ranges

– Global iterators provide access to thedata distribution pattern of their iteration space

auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols

auto r_pattern = r.begin().pattern();r_pattern.blocksize() == 4r_pattern.blocks() == 16r_pattern.blocks_at(dash::myid()) == 4

| 14DASH

DASH Algorithms

Growing number of DASH equivalents to STL algorithms:

Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:

dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);

- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)

- dash::copy range[i] <- range2[i]

| 15DASH

DASH Algorithms

Growing number of DASH equivalents to STL algorithms:

Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:

dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);

- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)

- dash::copy range[i] <- range2[i]

| 16DASH

Asynchronous Copying for Latency Hiding

Asynchronous Operations

– Async. algorithm interface:dash::copy_async()

– Launch policy:dash::launch::async (in upcoming DASH release 0.3.0)

std::vector<int> lcopy(block.size());// starts async. copy of global range to local memory// … via algorithm interface:auto fut = dash::copy_async(block.begin(), block.end(),

lcopy.begin());// … or via launch policy:auto fut = dash::copy(dash::launch::async,

block.begin(), block.end(),lcopy.begin());

overlapping computation();auto copy_end = fut.get(); // blocks until copy received

| 17DASH

Block matrix-matrix multiplication with prefetching

while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);

}

Case Study: S(R)UMMA Algorithm

| 18DASH



}


DISCLAIMER

This code is simplified for brevity.Get the real source code here:

https://github.com/dash-project/dash/blob/development/dash/include/dash/algorithm/SUMMA.h

https://github.com/dash-project/dash/blob/development/dash/include/dash/algorithm/SUMMA.h

| 19DASH



}


Schedules block transmissions tominimize network congestion

| 20DASH



}


Local submatrix multiplication using DGEMM from serial Intel MKL

Schedules block transmissions tominimize network congestion

| 21DASH

DASH vs. DGEMM: Intel MKL, PLASMA

| 22DASH

DASH vs. PDGEMM: ScaLAPACK

Good.

But acing singular benchmarksis not the actual point.

Most important: the NArray concept allows intuitive

design of efficient algorithms we achieved portable, robust

efficiency on different hardwareand system environments

| 23DASH

Summary

NArray Concept– Views simplify design of efficient algorithms

– First-class support for locality-based operations

– Complies to existing C++ standard library concepts

DASH algorithms on n-dim. ranges– SUMMA case study: straight-forward, compact

implementation

– Leveraged portable efficiency of Intel MKL

– Beats performance in (P)DGEMM compared toIntel MKL, PLASMA, ScaLAPACK

– Robust scalability in a variety of node-level and highly distributed benchmark scenarios

http://www.dash-project.org/http://github.com/dash-project/

http://www.dash-project.org/

http://github.com/dash-project/

| 24DASH

Acknowledgements

DASH on GitHub:https://github.com/dash-project/dash/

Funding

The DASH Team

T. Fuchs (LMU), R. Kowalewski (LMU), D. Hünich (TUD), A. Knüpfer (TUD), J. Gracia (HLRS), C. Glass (HLRS), H. Zhou (HLRS), K. Idrees (HLRS), J. Schuchart (HLRS), F. Mößbauer (LMU), K. Fürlinger (LMU)

https://github.com/dash-project/dash/

Software

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)