24
A Multidimensional Distributed Array Abstraction for PGAS www.dash-project.org Tobias Fuchs [email protected] Ludwig-Maximilians-Universität München, MNM-Team

A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

Embed Size (px)

Citation preview

Page 1: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

A MultidimensionalDistributed Array Abstractionfor PGAS

www.dash-project.org

Tobias [email protected]ät München, MNM-Team

Page 2: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 2DASH

DASH - Overview

DASH is a C++ template library that offers– distributed data structures and parallel algorithms

– a complete PGAS (part. global address space) programming system without a custom (pre-)compiler

PGAS Terminology – SHMEM Analogy

Unit: The individual participants in a DASH program, usually full OS processes.

Private

Shared

Unit 0 Unit 1 Unit N-1

int b;int c;

dash::Array a(1000);

int a;…

dash::Shared s;

10..190..9 ..999

Shared data: managed by DASHin a virtual global address space

Private data: managed by regular C/C++ mechanisms

Page 3: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 3DASH

DASH Project Structure

Phase I (2013-2015) Phase II (2016-2018)

LMU MunichProject lead,

C++ template library

Project lead, C++ template library,

data dock

TU DresdenLibraries and

interfaces, toolsSmart data

structures, resilience

HLRS Stuttgart DART runtime DART runtime

KIT Karlsruhe Application studies

IHR StuttgartSmart deployment, Application studies

Page 4: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 4DASH

DASH - Partitioned Global Address Space

Data affinity– data has well-defined owner but can be accessed by any unit

– data locality important for performance

– support for the owner computes execution model

DASH:– unified access to

local and remotedata in globalmemory space

Page 5: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 5DASH

DASH - Partitioned Global Address Space

Data affinity– data has well-defined owner but can be accessed by any unit

– data locality important for performance

– support for the owner computes execution model

DASH:– unified access to

local and remotedata in globalmemory space

– and explicit viewson local memoryspace

Page 6: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 6DASH

DASH Distributed Data Structures Overview

Container Description Data distribution

Array<T> 1D Array static, configurable

NArray<T, N> N-dim. Array static, configurable

Shared<T> Shared scalar fixed (at 0)

Directory(*)<T> Variable-size,locally indexedarray

manual,load-balanced

List<T> Variable-sizelinked list

dynamic,load-balanced

Map<T> Variable-sizeassociative map

dynamic, balancedby hash function

(*) Under construction

Page 7: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 7DASH

DASH Distributed Data Structures Overview

Container Description Data distribution

Array<T> 1D Array static, configurable

NArray<T, N> N-dim. Array static, configurable

Shared<T> Shared scalar fixed (at 0)

Directory(*)<T> Variable-size,locally indexedarray

manual,load-balanced

List<T> Variable-sizelinked list

dynamic,load-balanced

Map<T> Variable-sizeassociative map

dynamic, balancedby hash function

(*) Under construction

Page 8: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 8DASH

Multidimensional Data Distribution (1)

dash::Pattern<N> specifies N-dim data distribution– Blocked, cyclic, and block-cyclic in multiple dimensions

Pattern<2>(20, 15)

(BLOCKED,NONE)

(NONE,BLOCKCYCLIC(2))

(BLOCKED,BLOCKCYCLIC(3))

Extent in first and second dimension

Distribution in first and second dimension

Page 9: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 9DASH

Multidimensional Data Distribution (2)

Example: tiled and tile-shifted data distribution

(TILE(4), TILE(3))

ShiftTilePattern<2>(32, 24)TilePattern<2, COL_MAJOR>(20, 15)

(TILE(5), TILE(5))

Page 10: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 10DASH

Multidimensional Views

Lightweight Multidimensional Views

// 8x8 2D arraydash::NArray<int, 2> mat(8,8);

// linear access using iteratorsdash::distance(mat.begin(), mat.end()) == 64

// create 2x5 region viewauto reg = matrix.cols(2,5).rows(3,2);

// region can be used just like 2D arraycout << reg[1][2] << endl; // ‘7’

dash::distance(reg.begin(), reg.end()) == 10

Page 11: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 11DASH

Multidimensional Views

Lightweight Multidimensional Views– Local and block views

dash::NArray<int, 2> mat(80000,17000);

// view to block element rangeauto block = matrix.block(3,2);// use view as Cartesian space:auto elem = block[30][20];

dash::NArray<int, 2> mat(80000,17000);

if (dash::myid() == 1) {// view to local element rangeauto local_elems = matrix.local;// use view as sequential range:for (auto elem : local_elems) { … }

}

Page 12: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 12DASH

Multidimensional Views

Multidimensional Iterator Ranges

– Global iterators provide access to theunderlying view of their index space

– DASH global iterators on multi-dimensional regions canstill be passed to standard library algorithms

auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols

auto r_view = r.begin().view();r_view.extents() == { 3,5 }r_view.offsets() == { 4,2 }

// DASH algorithms use n-dim. view: dash::summa(r.begin(), r.end(), …);// multidimensional iterators still// are sequential:std::for_each(r.begin(), r.end(), …);

Page 13: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 13DASH

Multidimensional Views

Multidimensional Iterator Ranges

– Global iterators provide access to thedata distribution pattern of their iteration space

auto r = mat.sub(0, { 4,7 }) // rows.sub(1, { 2,7 }); // cols

auto r_pattern = r.begin().pattern();r_pattern.blocksize() == 4r_pattern.blocks() == 16r_pattern.blocks_at(dash::myid()) == 4

Page 14: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 14DASH

DASH Algorithms

Growing number of DASH equivalents to STL algorithms:

Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:

dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);

- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)

- dash::copy range[i] <- range2[i]

Page 15: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 15DASH

DASH Algorithms

Growing number of DASH equivalents to STL algorithms:

Examples of STL algorithms ported to DASH, also workfor multidimensional ranges:

dash::GlobIter<T> dash::fill(GlobIter<T> begin, GlobIter<T> end,T val);

- dash::fill range[i] <- val- dash::generate range[i] <- func()- dash::for_each func(range[i])- dash::transform range[i] = func(range2[i])- dash::accumulate sum(range[i]) (0<=i<=n-1)- dash::min_element min(range[i]) (0<=i<=n-1)

- dash::copy range[i] <- range2[i]

Page 16: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 16DASH

Asynchronous Copying for Latency Hiding

Asynchronous Operations

– Async. algorithm interface:dash::copy_async()

– Launch policy:dash::launch::async (in upcoming DASH release 0.3.0)

std::vector<int> lcopy(block.size());// starts async. copy of global range to local memory// … via algorithm interface:auto fut = dash::copy_async(block.begin(), block.end(),

lcopy.begin());// … or via launch policy:auto fut = dash::copy(dash::launch::async,

block.begin(), block.end(),lcopy.begin());

overlapping computation();auto copy_end = fut.get(); // blocks until copy received

Page 17: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 17DASH

Block matrix-matrix multiplication with prefetching

while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);

}

Case Study: S(R)UMMA Algorithm

Page 18: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 18DASH

Block matrix-matrix multiplication with prefetching

while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);

}

Case Study: S(R)UMMA Algorithm

DISCLAIMER

This code is simplified for brevity.Get the real source code here:

https://github.com/dash-project/dash/blob/development/dash/include/dash/algorithm/SUMMA.h

Page 19: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 19DASH

Block matrix-matrix multiplication with prefetching

while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);

}

Case Study: S(R)UMMA Algorithm

Schedules block transmissions tominimize network congestion

Page 20: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 20DASH

Block matrix-matrix multiplication with prefetching

while(!done) {blk_a = matrixA.local.block(k); …blk_b = matrixB.local.block(k); …// prefetchauto get_a = dash::copy_async(blk_a.begin(), blk_a.end(), lblk_a_get);auto get_b = dash::copy_async(blk_b.begin(), blk_b.end(), lblk_b_get);// local DGEMMdash::multiply(lblk_a_comp, lblk_b_comp, lblk_c_comp); // wait for transfer to finishget_a.wait(); get_b.wait();// swap buffersswap(lblk_a_get, lblk_a_comp); swap(lblk_b_get, lblk_b_comp);

}

Case Study: S(R)UMMA Algorithm

Local submatrix multiplication using DGEMM from serial Intel MKL

Schedules block transmissions tominimize network congestion

Page 21: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 21DASH

DASH vs. DGEMM: Intel MKL, PLASMA

Page 22: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 22DASH

DASH vs. PDGEMM: ScaLAPACK

Good.

But acing singular benchmarksis not the actual point.

Most important: the NArray concept allows intuitive

design of efficient algorithms we achieved portable, robust

efficiency on different hardwareand system environments

Page 23: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 23DASH

Summary

NArray Concept– Views simplify design of efficient algorithms

– First-class support for locality-based operations

– Complies to existing C++ standard library concepts

DASH algorithms on n-dim. ranges– SUMMA case study: straight-forward, compact

implementation

– Leveraged portable efficiency of Intel MKL

– Beats performance in (P)DGEMM compared toIntel MKL, PLASMA, ScaLAPACK

– Robust scalability in a variety of node-level and highly distributed benchmark scenarios

http://www.dash-project.org/http://github.com/dash-project/

Page 24: A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)

| 24DASH

Acknowledgements

DASH on GitHub:https://github.com/dash-project/dash/

Funding

The DASH Team

T. Fuchs (LMU), R. Kowalewski (LMU), D. Hünich (TUD), A. Knüpfer (TUD), J. Gracia (HLRS), C. Glass (HLRS), H. Zhou (HLRS), K. Idrees (HLRS), J. Schuchart (HLRS), F. Mößbauer (LMU), K. Fürlinger (LMU)