Upload
dan-holmes
View
289
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Slides from a conference paper presentation introducing a new MPI library, written in pure C#.
Citation preview
Outline
• Yet another MPI library?
• Managed-code, C#, Windows
• McMPI, design and implementation details
• Object-orientation, design patterns,
communication performance results
• Threads and the MPI Standard
• Pre-“End Points proposal” ideas
Why Implement MPI Again?
• Parallel program, distributed memory => MPI library
• Most (all?) MPI libraries written in C
• MPI Standard provides C and FORTRAN bindings
• C++ can use the C functions
• Other languages can follow the C++ model
• Use the C functions
• Alternatively, MPI can be implemented in that language
• Removes inter-language function call overheads but …
• May not be possible to achieve comparable performance
Why Did I Choose C#?
• Experience and knowledge I gained from my career in
software development
• My impression of the popularity of C# in commercial
software development
• My desire to bridge the gap between high-performance
programming and high-productivity programming
• One of the UK research councils offered me funding for a
PhD that proposed to use C# to implement MPI
C# Myths
• C# only runs on Windows
• Not such a bad thing – 3 of the Top500 machines use Windows
• Not actually true – Mono works on multiple operating systems
• C# is a Microsoft language
• Not such a bad thing – resources, commitment, support, training
• Not actually true – C# follows ECMA and ISO standards
• C# is slow like Java
• Not such a bad thing – expressivity, readability, re-usability
• Not actually true – no easy way to prove this conclusively
• C# and its ilk are not things we need to care about
• Not such a bad thing – they will survive/thrive, or not, without us
• Not actually true – popularity trumps utility
McMPI Design & Implementation
• Desirable features of code
• Isolation of concerns -> easier to understand
• Human readability -> easier to maintain
• Compiler readability -> easier to get good performance
• Object-orientation can help with isolation of concerns
• So can modularisation and judiciously reducing LOC per code file
• Design patterns can help with human readability
• So can documentation and useful in-code comments
• Choice of language & compiler can help with performance
• So can coding style and detailed examination of compiler output
• What is the best compromise?
Communication Layer
• Abstract class factory design pattern
• Similar to plug-ins
• Enables addition of new functionality without re-compilation of the
rest of the library
• All communication modules:
• Implement the same Abstract Device Interface (ADI)
• Isolate the details of their implementation from other layers
• Provide the same semantics and capabilities
• Reliable delivery
• Ordering of delivery
• Preservation of message boundaries
• Message = fixed size envelope information and variable size user data
Communication Layer – UML
Protocol Layer
• Bridge design pattern
• Enables addition of new functionality without re-compilation of the
rest of the library
• All protocol messages:
• Implement inherit from the same base class
• Isolate the details of their implementation from other layers
• Modify state of internal shared data structures independently
• Shared data structures (message ‘queues’)
• Unexpected queue – message envelope at receiver before receive
• Request queue – receive called before message envelope arrival
• Matched queue – at receiver waiting for message data to arrive
• Pending queue – message data waiting at sender
Protocol Layer – UML
Interface Layer
• Simple façade design pattern
• Translates MPI Standard-like syntax into protocol layer syntax
• Will become adapter design pattern
• For example, when custom data-types are implemented
• Current version of McMPI covers parts of MPI 1 only
• Initialisation and finalisation
• Administration functions, e.g. to get rank and size of communicator
• Point-to-point communication functions
• ready, synchronous, standard (not buffered)
• blocking, non-blocking, persistent
• Previous version had collectives
• Implemented on top of point-to-point
• Using hypercube or binary tree algorithms
McMPI Implementation Overview
Performance Results – Introduction 1
• Shared-memory results – hardware details
• Number of Nodes: 1 Armari Magnetar server
• CPUs per Node: 2 Intel Xeon E5420
• Threads per CPU: 4 Quad-core, no hyper-threading
• Core Clock Speed: 2.5GHz Front-side bus 1333MHz
• Level 1 Cache: 4x2x32KB Data & instruction per core
• Level 2 Cache: 2x6MB One per pair of cores
• Memory per Node: 16GB DDR2 667MHz
• Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet
• Operating System: WinXP Pro 64bit with SP3 version 5.2.3790
Performance Results – Introduction 2
• Distributed-memory results – hardware details
• Number of Nodes: 18 Dell PowerEdge 2900
• CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6
• Threads per CPU: 2 Dual-core, no hyper-threading
• Core Clock Speed: 2.0GHz Front-side bus 1333MHz
• Level 1 Cache: 2x2x32KB Data & instruction per core
• Level 2 Cache: 1x4MB One per CPU
• Memory per Node: 4GB DDR2 533MHz
• Network Hardware: 2xNIC BCM5708C NetXtreme II GigE
• Operating System: Win2008 Server x64, SP2 version 6.0.6002
Shared-memory – Latency
0
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Late
ncy
(µ
s)
Message Size (bytes)
MPICH2 Shared Memory
MS-MPI Shared Memory
McMPI thread-to-thread
Shared-memory – Bandwidth
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Ban
dw
idth
(M
bit
/s)
Message Size (bytes)
McMPI thread-to-thread
MPICH2 shared-memory
MS-MPI shared-memory
Distributed-memory – Latency
0
50
100
150
200
250
300
350
400
450
500
550
600
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Late
ncy
(µ
s)
Message Size (bytes)
McMPI Eager
MS-MPI
Distributed-memory – Bandwidth
0
250
500
750
1,000
4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Ban
dw
idth
(M
bit
/s)
Message Size (bytes)
McMPI Rendezvous
McMPI Eager
MS-MPI
Thread-as-rank – Threading Level
• McMPI allows MPI_THREAD_AS_RANK as input for the
MPI_INIT_THREAD function
• McMPI creates new threads during initialisation
• Not needed – MPI_INIT_THREAD must be called enough times
• McMPI uses thread-local storage to store ‘rank’
• Not needed – each communicator handle can encode ‘rank’
• Thread-to-thread message delivery is zero-copy
• Direct copy from user send buffer to user receive buffer
• Any thread can progress MPI messages
Thread-as-rank – MPI Process
Diagram created
by Gaurav Saxena
MSc, 2013
Thread-as-rank – MPI Standard
• Is thread-as-rank compliant with the MPI Standard?
• Does the MPI Standard allow/support thread-as-rank?
• Ambiguous/debatable at best
• The MPI Standard assumes MPI process = OS process
• Call MPI_INIT or MPI_INIT_THREAD twice in one OS process
• Erroneous by definition or results in two MPI processes?
• MPI Standard “thread compliant” prohibits thread-as-rank
• To maintain a POSIX-process-like interface for MPI process
• End-points proposal violates this principle in exactly the same way
• Other possible interfaces exist
Thread-as-rank – End-points
• Similarities
• Multiple threads can communicate reliably without using tags
• Thread ‘rank’ can be stored in thread-local storage or handles
• Most common use-case likely requires MPI_THREAD_MULTIPLE
• Differences
• Thread-as-rank part of initialisation and active until finalisation
• End-points created after initialisation and can be destroyed
• Thread-as-rank has all possible ranks in MPI_COMM_WORLD
• End-points only has some ranks in MPI_COMM_WORLD
• Thread-as-rank cannot create ranks but may need to merge ranks
• End-points can create ranks and does not need to merge ranks
Thread-as-rank – MPI Forum Proposal?
• Short answer: no
• Long answer: not yet, it’s complicated
• More likely to be suggested amendments to end-points proposal
• Thread-as-rank is a special case of end-points
• Standard MPI_COMM_WORLD replaced with an end-points
communicator during MPI_INIT_THREAD
• Thread-safety implications are similar (possibly identical?)
• Advantages/opportunities similar
• Thread-to-thread delivery rather than process-to-process delivery
• Work-stealing MPI progress engine or per-thread message queues
Questions?