Toward Message Passing for a Million Processes: Characterizing MPI on a Massive Scale Blue Gene/P

Toward Message Passing for a Million

Processes: Characterizing MPI on a

Massive Scale Blue Gene/P

P. Balaji, A. Chan, R. Thakur, W. Gropp and E. Lusk

Math. and Computer Sci., Argonne National Laboratory

Computer Science, University of Illinois, Urbana Champaign

Massive Scale High End Computing

• We have passed the Petaflop Barrier– Two systems over the Petaflop mark in the Top500: LANL

Roadrunner and ORNL Jaguar

– Argonne has a 163840-core Blue Gene/P

– Lawrence Livermore has a 286720-core Blue Gene/L

• Exaflop systems will out by 2018-2020– Expected to have more than a million processing elements

– Might be processors, cores, SMTs

• Such large systems pose many challenges to middleware

trying to take advantage of these systems

Pavan Balaji, Argonne National Laboratory

ISC (06/23/2009)

Hardware Sharing at Massive Scales

• At massive scales, number of hardware components

cannot increase exponentially with system size– Too expensive (cost plays a major factor!)

– E.g., Crossbar switches, Fat-tree networks

• At this scale, most systems do a lot of hardware sharing– Shared caches, shared communication engines, shared

networks

• More sharing means more contention– The challenge is how do we deal with this contention?

– More importantly: what’s the impact of such architectures?


ISC (06/23/2009)

Presentation Layout

• Introduction and Motivation

• Blue Gene/P Architectural Overview

• Performance Results and Analysis

• Conclusions and Future Work


ISC (06/23/2009)

Blue Gene/P Overview

• Second Generation

of the Blue Gene

supercomputers

• Extremely energy

efficient design

using low-power

chips– Four 850MHz

cores on each

PPC450

processor


ISC (06/23/2009)

BG/P Network Stack

• Uses five specialized

networks– Two if them (10G and 1G

Ethernet) are used for File I/O

and system management

– Remaining three (3D Torus,

Global collective network,

Global interrupt network) are

used for MPI communication• 3D torus: 6 bidirectional links

for each node (total of 5.1

GBps)


ISC (06/23/2009)

X-Axis

Z-Axis

Y-Axis

BG/P Communication Middleware• Three Software Stack Layers:

– System Programming Interface (SPI)• Directly above the hardware

• Most efficient, but very difficult to program and not portable!

– Deep Computing Messaging Framework (DCMF)• Portability layer built on top of SPI

• Generalized message passing framework

• Allows different stacks to be build on top

– MPI• Built on top of DCMF

• Most portable of the three layers

• Based off of MPICH2 (contributed back to Argonne as of 1.1a1)


ISC (06/23/2009)

BG/P OS Stack

• Uses a lightweight kernel known as the Compute Node

Kernel (CNK)– Better integration between the hardware and software stacks

• No swap space– Equal virtual and physical address space

– Static virtual to physical address translation• Easier for devices to access a “virtual address region”

• (Mostly) Symmetric address space– Potential for direct memory access between processes

– Similar to SMARTMAP on Cray


ISC (06/23/2009)

Presentation Layout






ISC (06/23/2009)

Performance Results

• DMA Engine Behavior

• Impact of System Scale

• Network Congestion Characteristics

• Parallel Collective Communication

• Analysis of an Ocean Modeling Communication Kernel


ISC (06/23/2009)

Inter-node Performance

0 2 8 32128

512 2K0

2

4

6

8

10

12

14

16

18

20One-Way Latency

In-Cache

Out-of-Cache

Message Size (bytes)

La

ten

cy

(u

s)

1 4 16 64256 1K 4K

16K64K

256K 1M

0

500

1000

1500

2000

2500

3000

3500Unidirectional Bandwidth

In-Cache

Out-of-Cache


Ba

nd

wid

th (

Mb

ps

)

Intra-node Performance

0 2 8 32128

512 2K0

1

2

3

4

5

6

7

8One-Way Latency

Core 1Core 2


La

ten

cy

(u

s)

1 4 16 64256 1K 4K

16K64K

256K 1M

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000Unidirectional Bandwidth

Core 1Core 2


Ba

nd

wid

th (

Mb

ps

)

Multi-Stream Communication

1 2 4 8 16 32 64128

256512 1K 2K 4K 8K

16K32K

64K128K

256K512K 1M

0

500

1000

1500

2000

2500

3000

3500Multi-Stream Communication

1 Core

2 Cores

3 Cores

4 Cores


Ba

nd

wid

th (

Mb

ps

)

Fan Bandwidth Tests

1 4 16 64256 1K 4K

16K64K

256K 1M

0

2000

4000

6000

8000

10000

12000

14000Fan-in

1 Peer2 Peers3 Peers


Ba

nd

wid

th (

Mb

ps

)

1 4 16 64256 1K 4K

16K64K

256K 1M

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000Fan-out

1 Peer2 Peers3 Peers4 Peers


Ba

nd

wid

th (

Mb

ps

)

Performance Results







ISC (06/23/2009)

Impact of Number of Hops on Performance Degradation of One-way latency

4 32128

384864

15362560

40005760

8064

10976

14336

18432

23328

28800

35200

42592

50688

59904

70304

81536

94080

108000

122880-5

10

25

40

55

70

85

100

0 bytes 32 bytes

1K bytes 32K bytes

1M bytes

System Size

% P

erf

orm

an

ce

De

gra

da

tio

n

Performance Results







ISC (06/23/2009)

Network Communication Behavior• Network communication between pairs would oftentimes

have overlapping links– This can cause network congestion

– Communication throttling is a common approach to avoid

such congestion

• On massive scale systems getting network congestion

feedback to the source might not be very scalable– Approach: If a link is busy, backpressure applies to all of the

remaining 5 inbound links

– Each DMA engine verifies busy link before sending data


ISC (06/23/2009)

P0 P1 P2 P3 P4 P5 P6 P7

Network Congestion Behavior

1 2 4 8 16 32 64128

256512 1K 2K 4K 8K

16K32K

64K128K

256K512K 1M

0

500

1000

1500

2000

2500

3000

3500

Congestion Behavior (Fully Overlapped Communication)

P2-P5P3-P4No overlap


Ba

nd

wid

th (

Mb

ps

)

Parallel Collective Performance


ISC (06/23/2009)

0

20

40

60

80

100

120

140

160

180

200MPI_Bcast: 16K bytes

System Size

Tim

e (

us

)

4 16 64256 1K 4K

16K64K

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000MPI_Allgather: 4 bytes

1 Communicator

2 Communicators

3 Communicators

4 Communicators

System Size

Tim

e (

us

)

Performance Results







ISC (06/23/2009)

HALO: Modeling Ocean Modeling

• NRL Layered Ocean Model (NLOM) simulates enclosed

seas, major oceans basins, and the global ocean

• HALO was initially developed as the communication kernel

for NLOM– Gained popularity because of its similarity to other models

as well (e.g., algebraic solvers)

– Rough indication of the communication behavior of other

models as well, including CFD and nuclear physics

• Distributes data on a 2D logical process grid and performs

a nearest neighbor exchange along the logical grid


ISC (06/23/2009)

Process Mapping (XYZ)

X-Axis

Z-Axis

Y-Axis

Process Mapping (YXZ)

X-Axis

Z-Axis

Y-Axis

Nearest Neighbor Performance

2 4 8 16 32 64 128 256 512 1K0

100

200

300

400

500

600

700

800

900System Size : 16K Processors

XYZTTXYZZYXT

Grid Partition (bytes)

Ex

ec

uti

on

Tim

e (

us

)

2 4 8 16 32 64 128 256 512 1K0

500

1000

1500

2000

2500System Size : 128K Processors

XYZTTXYZZYXT

Grid Partition (bytes)

Ex

ec

uti

on

Tim

e (

us

)

Presentation Layout






ISC (06/23/2009)

Concluding Remarks and Future Work• Increasing system scales are leading to large amounts of

hardware sharing– Shared caches, shared communication engines, shared

networks

– More sharing means more contention

– What’s the impact of such shared hardware on performance?

• We performed an analysis with Blue Gene/P– Identified and characterized several performance issues

– Documented different areas where behavior is different from

cluster-like systems

• Future work: Description language for process mapping


ISC (06/23/2009)

Thank You!

Contacts:

{balaji, chan, thakur, lusk} @ mcs.anl.gov

wgropp @ illinois.edu

Web Links:

MPICH2: http://www.mcs.anl.gov/research/projects/mpich2

http://www.mcs.anl.gov/research/projects/mpich2

Documents

Toward Message Passing for a Million Processes: Characterizing MPI on a Massive Scale Blue Gene/P