53
COMPUTE | STORE | ANALYZE Chapel’s New Adventures in Data Locality Brad Chamberlain Chapel Team, Cray Inc. August 2, 2017

Chapel’s New Adventures in Data Locality | STORE | ANALYZE Flat vs. Hierarchical Locales 24 Traditionally, Chapel has supported a “flat” locale model intra-locale decisions are

  • Upload
    vannhi

  • View
    222

  • Download
    2

Embed Size (px)

Citation preview

C O M P U T E | S T O R E | A N A L Y Z E

Chapel’s New Adventures in Data Locality

Brad ChamberlainChapel Team, Cray Inc.

August 2, 2017

C O M P U T E | S T O R E | A N A L Y Z E

This presentation may contain forward-looking statements that arebased on our current expectations. Forward looking statementsmay include statements about our financial guidance and expectedoperating results, our opportunities and future potential, our productdevelopment and new product introduction plans, our ability toexpand and penetrate our addressable markets and otherstatements that are not historical facts. These statements are onlypredictions and actual results may materially vary from thoseprojected. Please refer to Cray's documents filed with the SEC fromtime to time concerning factors that could affect the Company andthese forward-looking statements.

Safe Harbor Statement

Copyright 2017 Cray Inc.2

C O M P U T E | S T O R E | A N A L Y Z E

What is Chapel?

3

Chapel: A productive parallel programming language● portable● open-source● a collaborative effort

Goals:● Support general parallel programming at scale● Make parallel programming far more productive

Copyright 2017 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Chapel and Productivity

Copyright 2017 Cray Inc.4

● Chapel strives to be……as programmable as Python…as fast as Fortran…as scalable as MPI, SHMEM, or UPC…as portable as C…as flexible as C++…as fun as [your favorite programming language]

C O M P U T E | S T O R E | A N A L Y Z E

The Chapel Team at Cray (May 2017)

Copyright 2017 Cray Inc.5

14 full-time employees + 2 summer interns

C O M P U T E | S T O R E | A N A L Y Z E

The Broader Chapel Community (a subset)

Copyright 2017 Cray Inc.6

http://chapel.cray.com/collaborations.html

C O M P U T E | S T O R E | A N A L Y Z E

Scalable Parallel Programming Concerns

Copyright 2017 Cray Inc.

A

B

C

α

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

Typical Chapel programmers should focus on:● Parallelism: What should execute simultaneously?● Locality: Where should those tasks execute? their data reside?

7

C O M P U T E | S T O R E | A N A L Y Z E

Outline

Copyright 2017 Cray Inc.8

✓What’s Chapel?ØClassic Chapel Concepts for Locality (‘CCC’s)● Three Recent Locality Endeavors (“New Adventures”)● Wrap-up

C O M P U T E | S T O R E | A N A L Y Z E

CCC #1: Locales

Copyright 2017 Cray Inc.9

locale: Chapel type/values representing architectural locality● (think “compute node”)

locale

C O M P U T E | S T O R E | A N A L Y Z E

CCC #1: Locales

Copyright 2017 Cray Inc.10

locale: Chapel type/values representing architectural locality● (think “compute node”)● Chapel automatically provides a 1D array of locales:

const Locales: [0..#numLocales] locale;

locale #1 locale #2 locale #3locale #0

C O M P U T E | S T O R E | A N A L Y Z E

on-clause: Moves the current task to the specified locale

CCC #2: on-clauses

Copyright 2017 Cray Inc.11

C O M P U T E | S T O R E | A N A L Y Z E

on-clause: Moves the current task to the specified locale

// programs begin execution as a single task on locale #0config const n = computeLocalProblemSize(),

alpha = 0.5;

locale #1 locale #2 locale #3

CCC #2: on-clauses

Copyright 2017 Cray Inc.12

locale #0

n 𝛂

C O M P U T E | S T O R E | A N A L Y Z E

on-clause: Moves the current task to the specified locale

// programs begin execution as a single task on locale #0config const n = computeLocalProblemSize(),

alpha = 0.5;

coforall loc in Locales do // creates a task per localeon loc { // moves the task to its locale

}

locale #1 locale #2 locale #3

CCC #2: on-clauses

Copyright 2017 Cray Inc.13

locale #0

n 𝛂

C O M P U T E | S T O R E | A N A L Y Z E

on-clause: Moves the current task to the specified locale

// programs begin execution as a single task on locale #0config const n = computeLocalProblemSize(),

alpha = 0.5;

coforall loc in Locales do // creates a task per localeon loc { // moves the task to its locale

var A, B, C: [1..n] real;A = B + alpha * C;

}

locale #1

ABC

locale #2

ABC

locale #3

ABC

CCC #2: on-clauses

Copyright 2017 Cray Inc.14

locale #0

ABC

n 𝛂

(conceptual view)

C O M P U T E | S T O R E | A N A L Y Z E

locale #1

ABC

n 𝛂

locale #2

ABC

n 𝛂

locale #3

ABC

n 𝛂

CCC #2: on-clauses

Copyright 2017 Cray Inc.15

locale #0

ABC

n 𝛂

(optimized view)

n 𝛂 n 𝛂 n 𝛂

on-clause: Moves the current task to the specified locale

// programs begin execution as a single task on locale #0config const n = computeLocalProblemSize(),

alpha = 0.5;

coforall loc in Locales do // creates a task per localeon loc { // moves the task to its locale

var A, B, C: [1..n] real;A = B + alpha * C;

}

C O M P U T E | S T O R E | A N A L Y Z E

CCC #3: Distributions / Domain Maps

Copyright 2017 Cray Inc.16

distribution: Maps domains (“index sets”) to locales

C O M P U T E | S T O R E | A N A L Y Z E

CCC #3: Distributions / Domain Maps

Copyright 2017 Cray Inc.17

distribution: Maps domains (“index sets”) to locales

config const n = computeGlobalProblemSize(),

alpha = 0.5;

α

locale #1

locale #2

locale #3

locale #0

C O M P U T E | S T O R E | A N A L Y Z E

CCC #3: Distributions / Domain Maps

Copyright 2017 Cray Inc.18

distribution: Maps domains (“index sets”) to locales

config const n = computeGlobalProblemSize(),

alpha = 0.5;

use BlockDist;const ProblemSpace = {1..n} dmapped Block(…);

ProblemSpace

locale #1

locale #2

locale #3

locale #0

α

C O M P U T E | S T O R E | A N A L Y Z E

CCC #3: Distributions / Domain Maps

Copyright 2017 Cray Inc.19

distribution: Maps domains (“index sets”) to locales

config const n = computeGlobalProblemSize(),

alpha = 0.5;

use BlockDist;const ProblemSpace = {1..n} dmapped Block(…);var A, B, C: [ProblemSpace] real;

A

B

C

α

locale #1

locale #2

locale #3

locale #0

ProblemSpace

C O M P U T E | S T O R E | A N A L Y Z E

CCC #3: Distributions / Domain Maps

Copyright 2017 Cray Inc.20

distribution: Maps domains (“index sets”) to locales

config const n = computeGlobalProblemSize(),

alpha = 0.5;

use BlockDist;const ProblemSpace = {1..n} dmapped Block(…);var A, B, C: [ProblemSpace] real;A = B + alpha * C;

A

B

C

α

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

=

+

·

locale #1

locale #2

locale #3

locale #0

C O M P U T E | S T O R E | A N A L Y Z E

CCC #4: User Control over Locality Policies

Copyright 2017 Cray Inc.21

● In Chapel, users can……write their own distributions

“How should domains & arrays be mapped to locales and their memories?”

…write their own parallel iterators“How should forall-loops be implemented? How many tasks, running where?”

…write their own locale models“How should tasks, memory, communication be mapped to the system?”

● This gives users full control over key locality policies● Moreover, all “built-in” Chapel features are written in this framework

C O M P U T E | S T O R E | A N A L Y Z E

Locality Adventure #1: NUMA Locale Model

Copyright 2017 Cray Inc.22

C O M P U T E | S T O R E | A N A L Y Z E

The Perils of NUMA Obliviousness

Copyright 2017 Cray Inc.23

● Accessing non-NUMA-local memory ⇒ performance hit● e.g., Stream EP on Cray XC w/ 2 NUMA domains per node:

0

20

40

60

80

100

GB

/s

poorly-aligned memory

well-aligned memory

C O M P U T E | S T O R E | A N A L Y Z E

Flat vs. Hierarchical Locales

24

● Traditionally, Chapel has supported a “flat” locale model● intra-locale decisions are managed on the user’s behalf

● But, users can also write hierarchical locale models

locale #0 locale #1 locale #2 locale #3

locale #0 locale #1 locale #2 locale #3

sub-locale A

sub-locale B

sub-locale A

sub-locale B

sub-locale A

sub-locale B

sub-locale A

sub-locale BC C D E C C D E C C D E C C D E

C O M P U T E | S T O R E | A N A L Y Z E

Adventure #1: NUMA locale model

Copyright 2017 Cray Inc.25

● Created ‘numa’ locale model to describe NUMA nodes

● Also made the default domain map NUMA-aware● allocates local arrays using a chunk per sublocale

var A: [1..n] real;

NUMA compute nodeNUMA domain

memPU PU

PU PU

NUMA domain

memPU PU

PU PU

numa locale

NUMA 0 sub-localeNUMA 1

sub-locale⇒

numa locale

A0

A1

C O M P U T E | S T O R E | A N A L Y Z E

Adventure #1: Positive Impact

Copyright 2017 Cray Inc.26

0102030405060708090

100

Chapel 1.15

GB

/s

Stream EP

flat locale model*numa locale model

* = ostensibly… we’ll come back to this in a few slides

C O M P U T E | S T O R E | A N A L Y Z E

Adventure #1: Negative Impact

Copyright 2017 Cray Inc.27

● Array accesses like A[i] now require a dividenuma locale

A0

A1

C O M P U T E | S T O R E | A N A L Y Z E

Adventure #1: Summary

Copyright 2017 Cray Inc.28

● The increased array access cost is problematic● We’d like these idioms to all perform equivalently in Chapel:

● whole-array operations:A = B + alpha * C;

● zippered iteration:forall (a, b, c) in zip(A, B, C) do

a = b + alpha * c;

● random access:forall i in ProblemSpace do

A[i] = B[i] + alpha * C[i];

● While there are ways to mitigate the overheads, they aren’t ideal● still not overhead-free (in some approaches)● too expensive to implement (in others)

● So, let’s try something else…

C O M P U T E | S T O R E | A N A L Y Z E

Locality Adventure #2: PGAS, Networks, & Locality

Copyright 2017 Cray Inc.29

C O M P U T E | S T O R E | A N A L Y Z E

Flat Locale Model: Correcting a White Lie

Copyright 2017 Cray Inc.30

● I suggested that the flat locale model is NUMA-oblivious● It is, but the default domain map actually is not

● it distributes array indices using first-touch, heuristically

● Sometimes results in good performance, but not always:

0

20

40

60

80

100

Chapel 1.14

GB

/s comm = gasnet/mpi

comm = ugni

flat locale

A

C O M P U T E | S T O R E | A N A L Y Z E

● Chapel usually performs best when using ugni● leverages Cray network capabilities● matches Chapel’s PGAS features wellQ: why not here, when no communication is used?A: PGAS-based network registration of heap at program startup

● serves as first-touch, pinning all memory to NUMA domain 0● lack of communication magnifies memory-oriented bottlenecks

Flat Locale Model: Why does GASNet/MPI win?

Copyright 2017 Cray Inc.31

0

20

40

60

80

100

Chapel 1.14

GB

/s comm = gasnet/mpi

comm = ugni

C O M P U T E | S T O R E | A N A L Y Z E

● For ’numa’, each sublocale registers its own local heap● thus, this is one approach to addressing the problem

● but, it introduces the aforementioned overheads for random access

NUMA locale model and network registration

Copyright 2017 Cray Inc.32

80

100

GB

/s

comm = gasnet-mpi, locale = flatcomm = ugni, locale = numa

C O M P U T E | S T O R E | A N A L Y Z E

Adventure #2: Dynamic Memory Registration

Copyright 2017 Cray Inc.33

● Register array memory with network at allocation time● heuristically, divide array into approximately equal # of pages

● Impact: Restores performance for ugni:

flat locale

A

0

20

40

60

80

100

gasnet-mpi ugni, static registration

ugni, dynamic registration

GB

/s

C O M P U T E | S T O R E | A N A L Y Z E

Locality Adventure #3: Intel Xeon Phi (“KNL”) HBM

Copyright 2017 Cray Inc.34

C O M P U T E | S T O R E | A N A L Y Z E

Adventure #3: KNL Locale Model

Copyright 2017 Cray Inc.35

Image Source: https://newsroom.intel.com/press-kits/intel-xeon-phi-processor-family/

knl locale

NUMA 0 sub-localeNUMA 1

sub-locale

⇒HBM /

MCDRAMsub-locale

NUMA k sub-locale

C O M P U T E | S T O R E | A N A L Y Z E

KNL Locale Model: Usage and Status

Copyright 2017 Cray Inc.36

● Chapel can target KNL’s MCDRAM via normal on-clauses● accessor methods expose memory-based sub-locales● methods implemented across all standard locale models for portability

on here.highBandwidthMemory() {

x = new myClass(); // placed in MCDRAM...

on here.defaultMemory() {

y = new myClass(); // placed in DDR...

}

}

Status: Supported as of Chapel 1.15● no performance results to report at this time● next step: improve support for memory introspection (“if I have…”)

C O M P U T E | S T O R E | A N A L Y Z E

General Chapel Performance Snapshots

Copyright 2017 Cray Inc.37

C O M P U T E | S T O R E | A N A L Y Z E

00.5

11.5

22.5

33.5

PRESSURE_CALC ENERGY_CALC VOL3D_CALC DEL_DOT_VEC_2D COUPLE FIR INIT3 MULADDSUB IF_QUAD TRAP_INT PIC_2D

Nor

mal

ized

Tim

e

Parallel LCALS kernels: Chapel vs g++ w/ OMP

g++ OMP

Chapel parallel

LCALS Timings: Chapel 1.15 vs. C [+ OpenMP]

Copyright 2017 Cray Inc.38

Shared memory performance competitive with hand-coded

fast

er

0

1

2

3

Nor

mal

ized

Tim

e

Serial LCALS kernels: Chapel vs. g++

g++ serial

Chapel serial

C O M P U T E | S T O R E | A N A L Y Z E

���

���

���

�� �� �� ��� ���

�����

�������

����������� �� �� ���������

��� ��� ��������������� ��� ���������

���� ������� ��� ��������������

HPCC RA Performance: Chapel 1.15 vs. MPI

Copyright 2017 Cray Inc.

(x 36 cores per locale)

39

fast

er

C O M P U T E | S T O R E | A N A L Y Z E

ISx Timings: Chapel 1.15 vs. MPI, SHMEM

Copyright 2017 Cray Inc.40

0

2

4

6

8

10

12

14

1 2 4 8 16 32 64

Tim

e (s

econ

ds)

Cray XC nodes (x 36 cores per node)

ISx weakISO Total Time

SHMEM

MPI

Chapel 1.15

fast

er

C O M P U T E | S T O R E | A N A L Y Z E

The Computer Language Benchmarks Game (CLBG)

Copyright 2017 Cray Inc.41

Chapel entry acceptedFall 2016

C O M P U T E | S T O R E | A N A L Y Z E

CLBG Language Cross-Language Summary (May 2017 standings)

Copyright 2017 Cray Inc.42

Geometric mean code size (normalized to smallest entry)

Geo

met

ric m

ean

exec

utio

n tim

e (n

orm

aliz

ed to

fast

est e

ntry

)

smaller

fast

er

C O M P U T E | S T O R E | A N A L Y Z E

CLBG Language Cross-Language Summary (May 2017 standings, no Python)

Copyright 2017 Cray Inc.43

Geometric mean code size (normalized to smallest entry)

Geo

met

ric m

ean

exec

utio

n tim

e (n

orm

aliz

ed to

fast

est e

ntry

)

smaller

fast

er

C O M P U T E | S T O R E | A N A L Y Z E

A Closing Quote(source: Jonathan Dursi’s CHIUW 2017 keynote)

Copyright 2017 Cray Inc.44

C O M P U T E | S T O R E | A N A L Y Z E

CHIUW 2017 keynote (excerpt)

Copyright 2017 Cray Inc.45

“My opinion as an outsider…is that Chapel is important, Chapel is mature, and Chapel is just getting started.

“If the scientific community is going to have frameworks for solving scientific problems that are actually designed for our problems, they’re going to come from a project like

Chapel.“And the thing about Chapel is that the set of all things that

are ‘projects like Chapel’ is ‘Chapel.’”–Jonathan Dursi

Chapel’s Home in the New Landscape of Scientific Frameworks(and what it can learn from the neighbours)

CHIUW 2017 keynote

C O M P U T E | S T O R E | A N A L Y Z E

Questions?

Copyright 2017 Cray Inc.46

C O M P U T E | S T O R E | A N A L Y Z E

Chapel Resources

Copyright 2017 Cray Inc.47

C O M P U T E | S T O R E | A N A L Y Z E

Ways to Track Chapel Remotely

Copyright 2017 Cray Inc.

Facebook: http://facebook.com/ChapelLanguage

Twitter: http://twitter.com/ChapelLanguage

Youtube: https://www.youtube.com/channel/UCHmm27bYjhknK5mU7ZzPGsQ/

e-mail: [email protected]

48

C O M P U T E | S T O R E | A N A L Y Z E

Suggested Reading

49

Chapel chapter from Programming Models for Parallel Computing● a detailed overview of Chapel’s history, motivating themes, features● published by MIT Press, November 2015● edited by Pavan Balaji (Argonne)● chapter is now also available online

Other Chapel papers/publications available at http://chapel.cray.com/papers.html

Copyright 2017 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Suggested Short Reads (Blog Articles)

50

CHIUW 2017: Surveying the Chapel Landscape, Cray Blog, July 2017.● a run-down of recent events

Chapel: Productive Parallel Programming, Cray Blog, May 2013.● a short-and-sweet introduction to Chapel

Six Ways to Say “Hello” in Chapel (parts 1, 2, 3), Cray Blog, Sep-Oct 2015.● a series of articles illustrating the basics of parallelism and locality in Chapel

Why Chapel? (parts 1, 2, 3), Cray Blog, Jun-Oct 2014.● a series of articles answering common questions about why we are pursuing

Chapel in spite of the inherent challenges

[Ten] Myths About Scalable Programming Languages, IEEE TCSC Blog(index available on chapel.cray.com “blog articles” page), Apr-Nov 2012.

● a series of technical opinion pieces designed to argue against standard reasons given for not developing high-level parallel languages

Copyright 2017 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E

Where to..

Copyright 2017 Cray Inc.51

Submit bug reports:GitHub issues for chapel-lang/chapel: public bug [email protected]: for reporting non-public bugs

Ask User-Oriented Questions:StackOverflow: when appropriate / other users might care#chapel-users (irc.freenode.net): user-oriented IRC [email protected]: user discussions

Discuss Chapel [email protected]: developer discussions#chapel-developers (irc.freenode.net): developer-oriented IRC channel

Discuss Chapel’s use in [email protected]: educator discussions

Directly contact Chapel team at Cray: [email protected]

C O M P U T E | S T O R E | A N A L Y Z E

Legal Disclaimer

Copyright 2017 Cray Inc.

Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners.

52