25
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome, [email protected]

© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

Embed Size (px)

Citation preview

Page 1: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

© 2008 IBM Corporation

Deep Computing Messaging Framework

Lightweight Communication for Petascale SupercomputingSupercomputing 2008

Michael Blocksome, [email protected]

Page 2: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation2

DCMF Open Source Community

Open source community established January 2008

Wiki– http://dcmf.anl-external.org/wiki

Mailing List– [email protected]

Git Source Repository– helpful git resources on wiki

– git clone http://dcmf.anl-external.org/dcmf.git/

Page 3: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation3

Design Goals

Scalable to millions of tasks Efficient on low frequency embedded cores

– Inlined system programmer interface (SPI)

Supports many programming paradigms– Active Messages

– Support multiple contexts

– Multiple levels of application interfaces

Structured component design– Extendible to new architectures

– Software architecture for multiple networks

– Open source runtime with external contributions

Separate library for optimized collectives– Hardware acceleration

– Software collectives

Page 4: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation4

BerkeleyUPC

Application

DMA SPI

DCMF (C++)

MPICH2

DCMF Public API

GlobalArrays

GASNet

Systems Programming Interface

Deep ComputingMessaging Framework

CCMI

Application Layer

Charm++

DM

A S

PI

App

licat

ions

(Q

CD

)

DC

MF

App

licat

ions

ARMCI Library Portability Layer

BG/P Network Hardware

IBM supported software

Externally supported software

IBM® Blue Gene® /P Messaging Software Stack

dcmfd ADI

Page 5: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation5

Direct DCMF Application Programming

dcmf.h – core interface– point-to-point and utilities

– all functions implemented

collectives interface(s)– may or may not be implemented

– check return value on register!

Collective Component Messaging Interface (CCMI)

– high level collectives library

– uses multisend interface

– extensible to new collectives

BG/P Hardware

dcmf_collectives.h

CCMI

S XS X

S X

DCMFsysd

ep

mes

sage

r

sysd

ep

Device Device

ProtocolProtocol

Protocol

dcmf.h

ProtocolProtocol

Protocol

ProtocolProtocol

Protocol

Device Device

dcmf_globalcollectives.hdcmf_multisend.h

Adaptor

Application

high levelcollectives

multisendcollectives

all point-to-pointglobal

collectives

Page 6: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation6

DCMF Blue Gene/P Performance

Point-to-Point Collectives on 512 nodes (SMP)

MPI achieves 4300MB/sec (96% of peak) for torus near-neighbor communication on 6 links

Protocol Latency (µs)

DCMF Eager One-way 1.6

MPI Eager One-way 2.4

MPI Rendezvous One-way 5.6

DCMF Put 0.9

DCMF Get 1.6

ARMCI blocking put 2.0

ARMCI blocking get 3.3

Collective Operation Performance

MPI Barrier 1.3us

MPI Allreduce (int sum) 4.3us

MPI Broadcast 4.3us

MPI Allreduce throughput 817 MB/sec

MPI Bcast throughput 2.0 GB/sec

Barriers accelerated via the Global Interrupt network

Allreduce and broadcast operations accelerated via the collective network

Large broadcasts take advantage of the 6 edge-disjoint routes on a 3D torus

Page 7: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation7

Why use DCMF ?

Scales on BG/P to millions of tasks– high-efficiency, low overhead

Open Source– active community support

Easily port applications and libraries to DCMF interface

Unique features of DCMF– See next chart

Page 8: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation8

MX VERBS LAPI ELAN DCMF

Multiple Contexts N Y Y Y Y

Active Messages N N1 Y Y Y

One-sided calls N Y Y Y Y

Strided or Vector calls N1 N1 Y Y N2

Multi-send calls N1 N1 N1 N1 Y

Message Ordering and Consistency

N N N N Y

Device interface for many different networks

N Y (C-API) N N Y3 (C++)

Topology Awareness N N N N Y

Architecture Neutral N Y Y N Y

Non-blocking optimized collectives

N1 N1 N1 Blocking Y

1 This feature can be implemented in software on top of the provided set of features in this API, at possibly lower efficiency2 Non-contiguous transfer operation to be added3 Device level programming is available at the protocol level and not the API

Feature Comparison (to the best of our knowledge)

Page 9: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation9

DCMF C API Features

Multiple Context Registration– supports multiple, concurrent communication paradigms

Memory Consistency– One sided communication APIs like UPC and ARMCI need optimized support for

memory consistency levels

Active Messaging– Good match for Charm++ and other active message runtimes

– MPI can be easily supported

Multisend Protocols– Amortize startup across many messages sent together

Topology Awareness Optimized Protocols See dcmf.h

Page 10: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation10

Extending DCMF to other Architectures

Copy the “Linux® sockets” messager and build options– Contains sockets device and DCMF_Send () protocol

– Implements core API, returns DCMF_UNIMPL for collectives

New architecture only needs to implement DCMF_Send– Sockets device enables DCMF on Linux clusters

– Shmem device enables DCMF on multi-core systems

DCMF provides default *oversend point-to-point implementations– DCMF_Put ()

– DCMF_Get ()

– DCMF_Control ()

Selectively implement architecture devices and optimized protocols– Assign to DCMF_USER0_SEND_PROTOCOL (for example) to test

Page 11: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation11

Upcoming Features * (nothing promised)

Common Device Interface (CDI)– Posix Shared Memory

– Sockets

– Infiniband Multi-channel advance

– Thread may advance a “slice” of the messaging devices

– Dedicated threads result in uncontested locks for high-level communication libraries

Add a blocking advance API– Eliminate explicit processor polls on supported hardware

– May degrade to a regular DCMF_Messager_advance() on unsupported hardware Extend API to access Blue Gene® features in portable manner

– network and device structures

– replace hardware struct with key-value Noncontiguous point-to-point one-sided

– iterator can be used to implement all other interfaces (strided, vector, etc) One-sided “on the fly” collectives (ad hoc)

Page 12: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation12

DCMF Device Abstraction

At the core of DCMF is a “Device”, with a packet API abstraction and a DMA API abstraction

In principle, the functions are virtual, in practice the methods are inlined for performance

– Barton-Nackman C++ templates

Common Device Interface (CDI)– If you implement this interface, you get all of DCMF “for free”

– Good for rapid prototypes

Page 13: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation13

Current DCMF Devices

Blue Gene/P– DMA / 3-D Torus Network

– Collective Network

– Global Interrupt Network

– Lockbox / Memory Atomics

Generic– Sockets

– hybrid compatable

– Shared Memory

– hybrid compatable

– Infiniband

– hybrid compatable

Page 14: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation14

Other DCMF Projects

IBM– Roadrunner

Argonne National Laboratory– MPICH2

– ZeptoOS

Pacific Northwest National Laboratory– Global Arrays / ARMCI

Berkeley– UPC / GASNet

University of Illinois at Urbana-Champaign– Charm++

Page 15: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation15

Open Source Project Ideas, in no particular order

Store-and-Forward protocols Stream API Channel combining, message striping across devices Extend to other process managers (OpenMPI, etc) Extend to other platforms (OS X, BSD, Windows, ?) DCMF functional and performance test suite Scalability improvements for sockets and IB Combination shmem/sockets messager GPU device ? hybrid model Shared memory collectives

Page 16: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation16

How Can We be a more effective open source project

How to improve open source experience

specific needs, directions?

missing features?

Page 17: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

© 2008 IBM Corporation

Additional Charts

DCMF on Linux ClustersDCMF on Infiniband

Page 18: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

© 2008 IBM Corporation

DCMF on Linux Clusters

Page 19: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation19

DCMF on Linux Clusters

Build Instructions on Wiki– http://dcmf.anl-external.org/wiki/index.php/Building_DCMF_for_Linux

Test environment for application developers – Evaluate the DCMF API and runtime

– Port applications to DCMF before reserving time on Blue Gene/P

Uses MPICH2 PMI for job launch and management– Needs pluggable job launch and sysdep extension to remove MPICH2 dependency

Implemented Devices– sockets device

– shmem device

Page 20: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation20

DCMF Sockets Device

Standard sockets syscalls implemented on many architectures

Uses the “packet” CDI– New “stream” CDI may provide better performance

Current design is not scalable– primarily a development and porting platform

Can be used to initialize other devices that require sychronization

Page 21: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation21

DCMF Shmem Device

Uses the “packet” CDI

Only point-to-point send

Thread safe, allows multiple threads to post messages to device

No collectives

Page 22: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

© 2008 IBM Corporation

DCMF on Infiniband

Page 23: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation23

DCMF Infiniband Motivations

Optimize for low power processors and big fatties

Infiniband project lead: Charles Archer– communicate via dcmf mailing list

Page 24: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation24

DCMF Infiniband Device

Implements CDI “rdma” version– direct RDMA

– memregions

Implements CDI “packet” version– “eager” style sends

rdma CDI design– SRQ, scalable – worst latency

packet CDI design– Per destination rdma with send recv

– Per destination rdma with direct DMA – best latency

Page 25: © 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,

DCMF BoF, Supercomputing 2008

Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation25

DCMF Infiniband – Future Work

Remove artificial limits on scalability– currently 32 nodes

Implement memregion caching

Multiple adaptor support (?)

Switch management routines (?)

Multiple network implemention– SRQ and “per destination”

Async progress through IB events