Upload
yin
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Toward Message Passing for a Million Processes: Characterizing MPI on a Massive Scale Blue Gene/P. P. Balaji, A. Chan, R. Thakur , W. Gropp and E. Lusk Math. and Computer Sci., Argonne National Laboratory Computer Science, University of Illinois, Urbana Champaign. - PowerPoint PPT Presentation
Citation preview
Toward Message Passing for a Million
Processes: Characterizing MPI on a
Massive Scale Blue Gene/P
P. Balaji, A. Chan, R. Thakur, W. Gropp and E. Lusk
Math. and Computer Sci., Argonne National Laboratory
Computer Science, University of Illinois, Urbana Champaign
Massive Scale High End Computing
• We have passed the Petaflop Barrier– Two systems over the Petaflop mark in the Top500: LANL
Roadrunner and ORNL Jaguar
– Argonne has a 163840-core Blue Gene/P
– Lawrence Livermore has a 286720-core Blue Gene/L
• Exaflop systems will out by 2018-2020– Expected to have more than a million processing elements
– Might be processors, cores, SMTs
• Such large systems pose many challenges to middleware
trying to take advantage of these systems
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Hardware Sharing at Massive Scales
• At massive scales, number of hardware components
cannot increase exponentially with system size– Too expensive (cost plays a major factor!)
– E.g., Crossbar switches, Fat-tree networks
• At this scale, most systems do a lot of hardware sharing– Shared caches, shared communication engines, shared
networks
• More sharing means more contention– The challenge is how do we deal with this contention?
– More importantly: what’s the impact of such architectures?
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Presentation Layout
• Introduction and Motivation
• Blue Gene/P Architectural Overview
• Performance Results and Analysis
• Conclusions and Future Work
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Blue Gene/P Overview
• Second Generation
of the Blue Gene
supercomputers
• Extremely energy
efficient design
using low-power
chips– Four 850MHz
cores on each
PPC450
processor
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
BG/P Network Stack
• Uses five specialized
networks– Two if them (10G and 1G
Ethernet) are used for File I/O
and system management
– Remaining three (3D Torus,
Global collective network,
Global interrupt network) are
used for MPI communication• 3D torus: 6 bidirectional links
for each node (total of 5.1
GBps)
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
X-Axis
Z-Axis
Y-Axis
BG/P Communication Middleware• Three Software Stack Layers:
– System Programming Interface (SPI)• Directly above the hardware
• Most efficient, but very difficult to program and not portable!
– Deep Computing Messaging Framework (DCMF)• Portability layer built on top of SPI
• Generalized message passing framework
• Allows different stacks to be build on top
– MPI• Built on top of DCMF
• Most portable of the three layers
• Based off of MPICH2 (contributed back to Argonne as of 1.1a1)
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
BG/P OS Stack
• Uses a lightweight kernel known as the Compute Node
Kernel (CNK)– Better integration between the hardware and software stacks
• No swap space– Equal virtual and physical address space
– Static virtual to physical address translation• Easier for devices to access a “virtual address region”
• (Mostly) Symmetric address space– Potential for direct memory access between processes
– Similar to SMARTMAP on Cray
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Presentation Layout
• Introduction and Motivation
• Blue Gene/P Architectural Overview
• Performance Results and Analysis
• Conclusions and Future Work
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Performance Results
• DMA Engine Behavior
• Impact of System Scale
• Network Congestion Characteristics
• Parallel Collective Communication
• Analysis of an Ocean Modeling Communication Kernel
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Inter-node Performance
0 2 8 32128
512 2K0
2
4
6
8
10
12
14
16
18
20One-Way Latency
In-Cache
Out-of-Cache
Message Size (bytes)
La
ten
cy
(u
s)
1 4 16 64256 1K 4K
16K64K
256K 1M
0
500
1000
1500
2000
2500
3000
3500Unidirectional Bandwidth
In-Cache
Out-of-Cache
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps
)
Intra-node Performance
0 2 8 32128
512 2K0
1
2
3
4
5
6
7
8One-Way Latency
Core 1Core 2
Message Size (bytes)
La
ten
cy
(u
s)
1 4 16 64256 1K 4K
16K64K
256K 1M
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000Unidirectional Bandwidth
Core 1Core 2
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps
)
Multi-Stream Communication
1 2 4 8 16 32 64128
256512 1K 2K 4K 8K
16K32K
64K128K
256K512K 1M
0
500
1000
1500
2000
2500
3000
3500Multi-Stream Communication
1 Core
2 Cores
3 Cores
4 Cores
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps
)
Fan Bandwidth Tests
1 4 16 64256 1K 4K
16K64K
256K 1M
0
2000
4000
6000
8000
10000
12000
14000Fan-in
1 Peer2 Peers3 Peers
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps
)
1 4 16 64256 1K 4K
16K64K
256K 1M
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000Fan-out
1 Peer2 Peers3 Peers4 Peers
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps
)
Performance Results
• DMA Engine Behavior
• Impact of System Scale
• Network Congestion Characteristics
• Parallel Collective Communication
• Analysis of an Ocean Modeling Communication Kernel
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Impact of Number of Hops on Performance Degradation of One-way latency
4 32128
384864
15362560
40005760
8064
10976
14336
18432
23328
28800
35200
42592
50688
59904
70304
81536
94080
108000
122880-5
10
25
40
55
70
85
100
0 bytes 32 bytes
1K bytes 32K bytes
1M bytes
System Size
% P
erf
orm
an
ce
De
gra
da
tio
n
Performance Results
• DMA Engine Behavior
• Impact of System Scale
• Network Congestion Characteristics
• Parallel Collective Communication
• Analysis of an Ocean Modeling Communication Kernel
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Network Communication Behavior• Network communication between pairs would oftentimes
have overlapping links– This can cause network congestion
– Communication throttling is a common approach to avoid
such congestion
• On massive scale systems getting network congestion
feedback to the source might not be very scalable– Approach: If a link is busy, backpressure applies to all of the
remaining 5 inbound links
– Each DMA engine verifies busy link before sending data
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
P0 P1 P2 P3 P4 P5 P6 P7
Network Congestion Behavior
1 2 4 8 16 32 64128
256512 1K 2K 4K 8K
16K32K
64K128K
256K512K 1M
0
500
1000
1500
2000
2500
3000
3500
Congestion Behavior (Fully Overlapped Communication)
P2-P5P3-P4No overlap
Message Size (bytes)
Ba
nd
wid
th (
Mb
ps
)
Parallel Collective Performance
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
0
20
40
60
80
100
120
140
160
180
200MPI_Bcast: 16K bytes
System Size
Tim
e (
us
)
4 16 64256 1K 4K
16K64K
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000MPI_Allgather: 4 bytes
1 Communicator
2 Communicators
3 Communicators
4 Communicators
System Size
Tim
e (
us
)
Performance Results
• DMA Engine Behavior
• Impact of System Scale
• Network Congestion Characteristics
• Parallel Collective Communication
• Analysis of an Ocean Modeling Communication Kernel
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
HALO: Modeling Ocean Modeling
• NRL Layered Ocean Model (NLOM) simulates enclosed
seas, major oceans basins, and the global ocean
• HALO was initially developed as the communication kernel
for NLOM– Gained popularity because of its similarity to other models
as well (e.g., algebraic solvers)
– Rough indication of the communication behavior of other
models as well, including CFD and nuclear physics
• Distributes data on a 2D logical process grid and performs
a nearest neighbor exchange along the logical grid
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Process Mapping (XYZ)
X-Axis
Z-Axis
Y-Axis
Process Mapping (YXZ)
X-Axis
Z-Axis
Y-Axis
Nearest Neighbor Performance
2 4 8 16 32 64 128 256 512 1K0
100
200
300
400
500
600
700
800
900System Size : 16K Processors
XYZTTXYZZYXT
Grid Partition (bytes)
Ex
ec
uti
on
Tim
e (
us
)
2 4 8 16 32 64 128 256 512 1K0
500
1000
1500
2000
2500System Size : 128K Processors
XYZTTXYZZYXT
Grid Partition (bytes)
Ex
ec
uti
on
Tim
e (
us
)
Presentation Layout
• Introduction and Motivation
• Blue Gene/P Architectural Overview
• Performance Results and Analysis
• Conclusions and Future Work
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Concluding Remarks and Future Work• Increasing system scales are leading to large amounts of
hardware sharing– Shared caches, shared communication engines, shared
networks
– More sharing means more contention
– What’s the impact of such shared hardware on performance?
• We performed an analysis with Blue Gene/P– Identified and characterized several performance issues
– Documented different areas where behavior is different from
cluster-like systems
• Future work: Description language for process mapping
Pavan Balaji, Argonne National Laboratory
ISC (06/23/2009)
Thank You!
Contacts:
{balaji, chan, thakur, lusk} @ mcs.anl.gov
wgropp @ illinois.edu
Web Links:
MPICH2: http://www.mcs.anl.gov/research/projects/mpich2