EuroMPI 2013 presentation: McMPI

McMPI Managed-code MPI library in Pure C#

Dr D Holmes, EPCC

[email protected]

Outline

• Yet another MPI library?

• Managed-code, C#, Windows

• McMPI, design and implementation details

• Object-orientation, design patterns,

communication performance results

• Threads and the MPI Standard

• Pre-“End Points proposal” ideas

Why Implement MPI Again?

• Parallel program, distributed memory => MPI library

• Most (all?) MPI libraries written in C

• MPI Standard provides C and FORTRAN bindings

• C++ can use the C functions

• Other languages can follow the C++ model

• Use the C functions

• Alternatively, MPI can be implemented in that language

• Removes inter-language function call overheads but …

• May not be possible to achieve comparable performance

Why Did I Choose C#?

• Experience and knowledge I gained from my career in

software development

• My impression of the popularity of C# in commercial

software development

• My desire to bridge the gap between high-performance

programming and high-productivity programming

• One of the UK research councils offered me funding for a

PhD that proposed to use C# to implement MPI

C# Myths

• C# only runs on Windows

• Not such a bad thing – 3 of the Top500 machines use Windows

• Not actually true – Mono works on multiple operating systems

• C# is a Microsoft language

• Not such a bad thing – resources, commitment, support, training

• Not actually true – C# follows ECMA and ISO standards

• C# is slow like Java

• Not such a bad thing – expressivity, readability, re-usability

• Not actually true – no easy way to prove this conclusively

• C# and its ilk are not things we need to care about

• Not such a bad thing – they will survive/thrive, or not, without us

• Not actually true – popularity trumps utility

McMPI Design & Implementation

• Desirable features of code

• Isolation of concerns -> easier to understand

• Human readability -> easier to maintain

• Compiler readability -> easier to get good performance

• Object-orientation can help with isolation of concerns

• So can modularisation and judiciously reducing LOC per code file

• Design patterns can help with human readability

• So can documentation and useful in-code comments

• Choice of language & compiler can help with performance

• So can coding style and detailed examination of compiler output

• What is the best compromise?

Communication Layer

• Abstract class factory design pattern

• Similar to plug-ins

• Enables addition of new functionality without re-compilation of the

rest of the library

• All communication modules:

• Implement the same Abstract Device Interface (ADI)

• Isolate the details of their implementation from other layers

• Provide the same semantics and capabilities

• Reliable delivery

• Ordering of delivery

• Preservation of message boundaries

• Message = fixed size envelope information and variable size user data

Communication Layer – UML

Protocol Layer

• Bridge design pattern

• Enables addition of new functionality without re-compilation of the

rest of the library

• All protocol messages:

• Implement inherit from the same base class

• Isolate the details of their implementation from other layers

• Modify state of internal shared data structures independently

• Shared data structures (message ‘queues’)

• Unexpected queue – message envelope at receiver before receive

• Request queue – receive called before message envelope arrival

• Matched queue – at receiver waiting for message data to arrive

• Pending queue – message data waiting at sender

Protocol Layer – UML

Interface Layer

• Simple façade design pattern

• Translates MPI Standard-like syntax into protocol layer syntax

• Will become adapter design pattern

• For example, when custom data-types are implemented

• Current version of McMPI covers parts of MPI 1 only

• Initialisation and finalisation

• Administration functions, e.g. to get rank and size of communicator

• Point-to-point communication functions

• ready, synchronous, standard (not buffered)

• blocking, non-blocking, persistent

• Previous version had collectives

• Implemented on top of point-to-point

• Using hypercube or binary tree algorithms

McMPI Implementation Overview

Performance Results – Introduction 1

• Shared-memory results – hardware details

• Number of Nodes: 1 Armari Magnetar server

• CPUs per Node: 2 Intel Xeon E5420

• Threads per CPU: 4 Quad-core, no hyper-threading

• Core Clock Speed: 2.5GHz Front-side bus 1333MHz

• Level 1 Cache: 4x2x32KB Data & instruction per core

• Level 2 Cache: 2x6MB One per pair of cores

• Memory per Node: 16GB DDR2 667MHz

• Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet

• Operating System: WinXP Pro 64bit with SP3 version 5.2.3790

Performance Results – Introduction 2

• Distributed-memory results – hardware details

• Number of Nodes: 18 Dell PowerEdge 2900

• CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6

• Threads per CPU: 2 Dual-core, no hyper-threading

• Core Clock Speed: 2.0GHz Front-side bus 1333MHz

• Level 1 Cache: 2x2x32KB Data & instruction per core

• Level 2 Cache: 1x4MB One per CPU

• Memory per Node: 4GB DDR2 533MHz

• Network Hardware: 2xNIC BCM5708C NetXtreme II GigE

• Operating System: Win2008 Server x64, SP2 version 6.0.6002

Shared-memory – Latency

0

1

2

3

4

5

6

1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768

Late

ncy

(µ

s)

Message Size (bytes)

MPICH2 Shared Memory

MS-MPI Shared Memory

McMPI thread-to-thread

Shared-memory – Bandwidth

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576

Ban

dw

idth

(M

bit

/s)


McMPI thread-to-thread

MPICH2 shared-memory

MS-MPI shared-memory

Distributed-memory – Latency

0

50

100

150

200

250

300

350

400

450

500

550

600

1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768

Late

ncy

(µ

s)


McMPI Eager

MS-MPI

Distributed-memory – Bandwidth

0

250

500

750

1,000

4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576

Ban

dw

idth

(M

bit

/s)


McMPI Rendezvous

McMPI Eager

MS-MPI

Thread-as-rank – Threading Level

• McMPI allows MPI_THREAD_AS_RANK as input for the

MPI_INIT_THREAD function

• McMPI creates new threads during initialisation

• Not needed – MPI_INIT_THREAD must be called enough times

• McMPI uses thread-local storage to store ‘rank’

• Not needed – each communicator handle can encode ‘rank’

• Thread-to-thread message delivery is zero-copy

• Direct copy from user send buffer to user receive buffer

• Any thread can progress MPI messages

Thread-as-rank – MPI Process

Diagram created

by Gaurav Saxena

MSc, 2013

Thread-as-rank – MPI Standard

• Is thread-as-rank compliant with the MPI Standard?

• Does the MPI Standard allow/support thread-as-rank?

• Ambiguous/debatable at best

• The MPI Standard assumes MPI process = OS process

• Call MPI_INIT or MPI_INIT_THREAD twice in one OS process

• Erroneous by definition or results in two MPI processes?

• MPI Standard “thread compliant” prohibits thread-as-rank

• To maintain a POSIX-process-like interface for MPI process

• End-points proposal violates this principle in exactly the same way

• Other possible interfaces exist

Thread-as-rank – End-points

• Similarities

• Multiple threads can communicate reliably without using tags

• Thread ‘rank’ can be stored in thread-local storage or handles

• Most common use-case likely requires MPI_THREAD_MULTIPLE

• Differences

• Thread-as-rank part of initialisation and active until finalisation

• End-points created after initialisation and can be destroyed

• Thread-as-rank has all possible ranks in MPI_COMM_WORLD

• End-points only has some ranks in MPI_COMM_WORLD

• Thread-as-rank cannot create ranks but may need to merge ranks

• End-points can create ranks and does not need to merge ranks

Thread-as-rank – MPI Forum Proposal?

• Short answer: no

• Long answer: not yet, it’s complicated

• More likely to be suggested amendments to end-points proposal

• Thread-as-rank is a special case of end-points

• Standard MPI_COMM_WORLD replaced with an end-points

communicator during MPI_INIT_THREAD

• Thread-safety implications are similar (possibly identical?)

• Advantages/opportunities similar

• Thread-to-thread delivery rather than process-to-process delivery

• Work-stealing MPI progress engine or per-thread message queues

Questions?

Technology

EuroMPI 2013 presentation: McMPI