85
Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems Kei Davis and Fabrizio Petrini { kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

  • Upload
    presley

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems. Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Performance and Architectures Lab (PAL), CCS-3. Overview. - PowerPoint PPT Presentation

Citation preview

Page 1: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Computer and Computational Sciences DivisionLos Alamos National Laboratory

Ideas that change the world

Achieving Usability and Efficiency in

Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov

Performance and Architectures Lab (PAL), CCS-3

Page 2: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy2

CCS-3

PALOverview

In this part of the tutorial we will discuss the characteristics of some of the most powerful supercomputers

We classify these machines along three dimensions Node Integration - how processors and network

interface are integrated in a computing node Network Integration – what primitive mechanisms

the network provides to coordinate the processing nodes

System Software Integration – how the operating system instances are globally coordinated

Page 3: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy3

CCS-3

PALOverview

We argue that the level of integration in each of the three dimensions, more than other parameters (as distributed vs shared memory or vector vs scalar processor), is the discriminating factor beween large-scale supercomputers

In this part of the tutorial we will briefly characterize some existing and up-coming parallel computers

Page 4: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy4

CCS-3

PALASCI Q: Los Alamos National Laboratory

Page 5: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy5

CCS-3

PALASCI Q

Total — 20.48 TF/s, #3 in the top 500

Systems — 2048 AlphaServer ES45s

8,192 EV-68 1.25-GHz CPUs with 16-MB cache

Memory — 22 Terabytes

System Interconnect

Dual Rail Quadrics Interconnect

4096 QSW PCI adapters

Four 1024-way QSW federated switches

Operational in 2002

Page 6: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy6

CCS-3

PAL

MemoryUp to 32 GB

MMB 2

MMB 1

MMB 0

Serial, Parallelkeyboard/mousefloppy

Cache 16 MB per CPU

256b 125 MHz(4.0 GB/s)

256b 125 MHz(4.0 GB/s)

EV68 1.25 GHz

PC

I5

PC

I4

PC

I0

PC

I2

PC

I1

PC

I6

PC

I7

PCI-USB PCI-junk IO

PC

I3

PC

I8

PC

I 9

64b 33MHz (266MB/S)64b 33MHz (266MB/S)64b 66MHz (528 MB/S)P

CI5

PC

I4

PC

I0

PC

I2

PC

I1

64b 66MHz (528 MB/S)

PC

I6

PC

I7

PCI-USB PCI-junk IO

PC

I3

PC

I8

PC

I 9

64b 33 MHz (266 MB/S)

64b 66 MHz (528 MB/S)

QuadC-Chip Controller

PCI ChipBus 0

PCI ChipBus 1

DDDDD

DDDD DDDD DD

QuadC-Chip Controller

PCI ChipBus 0,1

PCI ChipBus 2,3

DDDDD

DDDD DDDD DD

MMB 3

PC

I7 HS

PC

I5

PC

I4

PC

I3 HS

PC

I2 HS

PC

I1 HS

PC

I0

Each @ 64b 500 MHz (4.0 GB/s)

PC

I9 HS

PC

I8 HS

PC

I6 HS

3.3V I/O 5.0V I/O

Node: HP (Compaq) AlphaServer ES45 21264 System Architecture

Page 7: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy7

CCS-3

PALQsNET: Quaternary Fat Tree

• Hardware support for Collective Communication

• MPI Latency 4s, Bandwidth 300 MB/s

• Barrier latency less than 10s

Page 8: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy8

CCS-3

PALInterconnection Network

1st 64U64DNodes 0-63

16th 64U64DNodes 960-1023

48 63 1023

1

2

3...

SwitchLevel

4

5

960

6

Mid Level

Super Top Level

1024 nodes(2x = 2048 nodes)

Page 9: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy9

CCS-3

PALSystem Software

Operating System is Tru64 Nodes organized in Clusters of 32 for resource

allocation and administration purposes (TruCluster) Resource management executed through Ethernet

(RMS)

Page 10: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy10

CCS-3

PALASCI Q: Overview

Node Integration Low (multiple boards per node, network interface on

I/O bus) Network Integration

High (HW support for atomic collective primitives) System Software Integration

Medium/Low (TruCluster)

Page 11: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy11

CCS-3

PALASCI Thunder, 1,024 Nodes, 23 TF/s peak

Page 12: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy12

CCS-3

PALASCI Thunder, Lawrence Livermore National Laboratory

• 1,024 Nodes, 4096 Processors, 23 TF/s,

•#2 in the top 500

Page 13: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy13

CCS-3

PALASCI Thunder: Configuration

1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total)

2.5 s, 912 MB/s MPI latency and bandwidth over Quadrics Elan4

Barrier synchronization 6 s, allreduce 15 s 75 TB in local disk in 73GB/node UltraSCSI320 Lustre file system with 6.4 GB/s delivered parallell

I/O performance Linux RH 3.0, SLURM, Chaos

Page 14: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy14

CCS-3

PAL

CHAOS: Clustered High Availability Operating System Derived from Red Hat, but differs in the following

areas Modified kernel (Lustre and hw specific) New packages for cluster monitoring, system

installation, power/console management SLURM, an open-source resource manager

Page 15: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy15

CCS-3

PALASCI Thunder: Overview

Node Integration Medium/Low (network interface on I/O bus)

Network Integration Very High (HW support for atomic collective

primitives) System Software Integration

Medium (Chaos)

Page 16: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy16

CCS-3

PALSystem X: Virginia Tech

Page 17: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy17

CCS-3

PALSystem X, 10.28 TF/s 1100 dual Apple G5 2GHz CPU based nodes.

8 billion operations/second/processor (8 GFlops) peak double precision floating performance.

Each node has 4GB of main memory and 160 GB of Serial ATA storage. 176TB total secondary storage.

Infiniband, 8s and 870 MB/s, latency and bandwidth, partial support for collective communication

System-level Fault-tolerance (Déjà vu)

Page 18: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy18

CCS-3

PALSystem X: Overview

Node Integration Medium/Low (network interface on I/O bus)

Network Integration Medium (limited support for atomic collective

primitives) System Software Integration

Medium (system-level fault-tolerance)

Page 19: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy19

CCS-3

PAL

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

Node Card(32 chips, 4x4x2)

16 Compute Cards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 GF/s4 MB

5.6/11.2 GF/s0.5 GB DDR

90/180 GF/s8 GB DDR

2.9/5.7 TF/s256 GB DDR

180/360 TF/s16 TB DDR

BlueGene/L System

Page 20: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy20

CCS-3

PALBlueGene/L Compute ASIC

PLB (4:1)

“Double FPU”

Ethernet Gbit

JTAGAccess

144 bit wide DDR256/512MB

JTAG

Gbit Ethernet

440 CPU

440 CPUI/O proc

L2

L2

MultiportedSharedSRAM Buffer

Torus

DDR Control with ECC

SharedL3 directoryfor EDRAM

Includes ECC

4MB EDRAM

L3 CacheorMemory

6 out and6 in, each at 1.4 Gbit/s link

256

256

1024+144 ECC256

128

128

32k/32k L1

32k/32k L1

“Double FPU”

256

snoop

Tree

3 out and3 in, each at 2.8 Gbit/s link

GlobalInterrupt

4 global barriers orinterrupts

128

• IBM CU-11, 0.13 µm• 11 x 11 mm die size• 25 x 32 mm CBGA• 474 pins, 328 signal• 1.5/2.5 Volt

Page 21: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy21

CCS-3

PAL

Page 22: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy22

CCS-3

PAL

DC-DC Converters:40V 1.5, 2.5V

2 I/O cards

16compute

cards

Page 23: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy23

CCS-3

PAL

Page 24: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy24

CCS-3

PALBlueGene/L Interconnection Networks

3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) 350/700 GBytes/s bisection bandwidth Communications backbone for computations

Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traversal in the order of 5 µs Interconnects all compute and I/O nodes (1024)

Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)

Low Latency Global Barrier 8 single wires crossing whole system, touching all nodes

Control Network (JTAG) For booting, checkpointing, error logging

Page 25: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy25

CCS-3

PALBlueGene/L System Software Organization

Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)

I/O nodes run Linux and provide O/S services

file accessprocess launch/terminationdebugging

Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software

Page 26: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy26

CCS-3

PALOperating Systems

Compute nodes: CNKSpecialized simple O/S

5000 lines of code, 40KBytes in core

No thread support, no virtual memoryProtection

Protect kernel from applicationSome net devices in userspace

File I/O offloaded (“function shipped”) to IO nodes

Through kernel system calls“Boot, start app and then stay out of the way”

I/O nodes: Linux2.4.19 kernel (2.6 underway) w/ ramdiskNFS/GPFS clientCIO daemon to

Start/stop jobsExecute file I/O

Global O/S (CMCS, service node) Invisible to user programs Global and collective decisions Interfaces with external policy

modules (e.g., job scheduler) Commercial database technology

(DB2) stores static and dynamic state

Partition selection Partition boot Running of jobs System error logs Checkpoint/restart

mechanism Scalability, robustness, security

Execution mechanisms in the core Policy decisions in the service node

Page 27: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy27

CCS-3

PALBlueGeneL: Overview

Node Integration High (processing node integrates processors and

network interfaces, network interfaces directly connected to the processors)

Network Integration High (separate tree network)

System Software Integration Medium/High (Compute kernels are not globally

coordinated) #2 and #4 in the top500

Page 28: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy28

CCS-3

PALCray XD1

Page 29: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy29

CCS-3

PALCray XD1 System Architecture

Compute 12 AMD Opteron 32/64

bit, x86 processors High Performance

LinuxRapidArray Interconnect 12 communications

processors 1 Tb/s switch fabricActive Management Dedicated processorApplication Acceleration 6 co-processors Processors directly

connected to the interconnect

Page 30: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy30

CCS-3

PALCray XD1 Processing Node

Six SATA Hard Drives

Four independent PCI-X Slots

500 Gb/s crossbar switch

12-port Inter-chassis

connector

Connector to 2nd 500 Gb/s crossbar switch and 12-port

inter-chassis connector

4 Fans

Chassis Rear

Chassis Front

Six 2-way SMP Blades

Page 31: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy31

CCS-3

PAL Cray XD1 Compute Blade

4 DIMM Sockets for DDR 400

Registered ECCMemory

4 DIMM Sockets for DDR 400

Registered ECCMemory

RapidArrayCommunications

ProcessorAMD Opteron

2XX Processor

Connector to Main Board

AMD Opteron 2XX

Processor

Page 32: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy32

CCS-3

PALFast Access to the Interconnect

Processor I/O Interconnect

GigaBytes GFLOPS GigaBytes per Second

CrayXD1

Memory

Xeon Server

6.4GB/sDDR 400

8 GB/s

5.3 GB/sDDR 333

0.25 GB/sGigE

1 GB/sPCI-X

Page 33: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy33

CCS-3

PAL Communications Optimizations

RapidArray Communications Processor HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization

RapidArray Communications

Processor

2 GB/s

3.2 GB/s

2 GB/s

AMDOpteron 2XX

Processor

RA

Page 34: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy34

CCS-3

PAL

Usability

Single System Command and Control

Resiliency

Dedicated management processors, real-time OS and communications fabric.

Proactive background diagnostics with self-healing.

Synchronized Linux kernels

Active Manager System

Active ManagementSoftware

Page 35: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy35

CCS-3

PALCray XD1: Overview

Node Integration High (direct access from HyperTransport to

RapidArray) Network Integration

Medium/High (HW support for collective communication)

System Software Integration High (Compute kernels are globally coordinated)

Early stage

Page 36: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy36

CCS-3

PALASCI Red STORM

Page 37: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy37

CCS-3

PALRed Storm Architecture

Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect. Each

compute node processor has a bi-directional connection to the primary communication network

108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz)

~10 TB of DDR memory @ 333MHz Red/Black switching: ~1/4, ~1/2, ~1/4 8 Service and I/O cabinets on each end (256

processors for each color240 TB of disk storage (120 TB per color)

Page 38: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy38

CCS-3

PALRed Storm Architecture

Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes

Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes

Separate RAS and system management network (Ethernet)

Router table-based routing in the interconnect

Page 39: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy39

CCS-3

PAL

Net I/O

Service

Users

File I/OCompute

/home

Red Storm architecture

Page 40: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy40

CCS-3

PALSystem Layout(27 x 16 x 24 mesh)

NormallyUnclassified

NormallyClassified

SwitchableNodes

Disconnect Cabinets

{ {

Page 41: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy41

CCS-3

PAL

Run-Time System

Logarithmic loader

Fast, efficient Node allocator

Batch system – PBS

Libraries – MPI, I/O, Math

File Systems being considered include

PVFS – interim file system

Lustre – Pathforward support,

Panassas…

Operating Systems

LINUX on service and I/O nodes

Sandia’s LWK (Catamount) on compute nodes

LINUX on RAS nodes

Red Storm System Software

Page 42: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy42

CCS-3

PALASCI Red Storm: Overview

Node Integration High (direct access from HyperTransport to network

through custom network interface chip) Network Integration

Medium (No support for collective communication) System Software Integration

Medium/High (scalable resource manager, no global coordination between nodes)

Expected to become the most powerful machine in the world (competition permitting)

Page 43: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy43

CCS-3

PALOverview

Node Integration

Network Integration

Software Integration

ASCI Q

ASCI Thunder

System X

BlueGene/L

Cray XD1

Red Storm

Page 44: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy44

CCS-3

PALA Case Study: ASCI Q

We try to provide some insight on the what we perceive are the important problems in a large-scale supercomputer

Our hands-on experience on ASCI Q shows that the system software and the global coordination are fundamental in a large-scale parallel machine

Page 45: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy45

CCS-3

PALASCI Q

2,048 ES45 Alphaservers, with 4 processors/node16 GB of memory per node8,192 processors in total2 independent network rails, Quadrics Elan3 > 8192 cables 20 Tflops peak, #2 in the top 500 listsA complex human artifact

Page 46: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy46

CCS-3

PALDealing with the complexity of a real system

In this section of the tutorial we provide insight into our methodology, that we used to substantially improve the performance of ASCI Q.

This methodology is based on an arsenal of analytical models custom microbenchmarks full applications discrete event simulators

Dealing with the complexity of the machine and the complexity of a real parallel application, SAGE, with > 150,000 lines of Fortran & MPI code

Page 47: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy47

CCS-3

PALOverview

Our performance expectations for ASCI Q and the reality Identification of performance factors

Application performance and breakdown into components Detailed examination of system effects

A methodology to identify operating systems effects Effect of scaling – up to 2000 nodes/ 8000 processors Quantification of the impact

Towards the elimination of overheads demonstrated over 2x performance improvement

Generalization of our results: application resonance Bottom line: the importance of the integration of the

various system across nodes

Page 48: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy48

CCS-3

PAL

SAGE Performance (QA & QB)

0

0.2

0.4

0.6

0.8

1

1.2

0 512 1024 1536 2048 2560 3072 3584 4096

# PEs

Cyc

le-t

ime

(s)

Model

Sep-21-02

Nov-25-02

Performance of SAGE on 1024 nodes

Performance consistent across QA and QB (the two segments of ASCI Q, with 1024 nodes/4096 processors each) Measured time 2x greater than model (4096 PEs)

There is a difference

why ?

Lower is better!

Page 49: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy49

CCS-3

PALUsing fewer PEs per Node

Test performance using 1,2,3 and 4 PEs per node

Sage on QB (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000

#PEs

Cycle

Tim

e (

s)

1PEsPerNode

2PEsPerNode

3PEsPerNode

4PEsPerNode

Lower is better!

Page 50: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy50

CCS-3

PALUsing fewer PEs per node (2)

Measurements match model almost exactly for 1,2 and 3 PEs per node!

Sage on QB (timing.input)

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10 100 1000 10000

#PEs

Err

or

(s)

- M

easu

red

- M

od

el

1PEsPerNode

2PEsPerNode

3PEsPerNode

4PEsPerNode

Performance issue only occurs when using 4 PEs per node

Page 51: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy51

CCS-3

PALMystery #1

SAGE performs significantly worse on ASCI Q than was predicted by our model

Page 52: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy52

CCS-3

PALSAGE performance components

Look at SAGE in terms of main components: Put/Get (point-to-point boundary exchange) Collectives (allreduce, broadcast, reduction)

SAGE on QB - Breakdown (timing.input)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 10 100 1000 10000

#PEs

Tim

e/C

ycle

(s)

token_allreduce

token_bcast

token_get

token_put

token_reduction

cyc_time

Performance issue seems to occur only on collective operations

Page 53: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy53

CCS-3

PALPerformance of the collectives

Measure collective performance separately

•0

•0.5

•1

•1.5

•2

•2.5

•3

•0 •100 •200 •300 •400 •500 •600 •700 •800 •900 •1000

• Lat

ency

ms

•Nodes

•Allreduce Latency

•1 process per node•2 processes per node•3 processes per node•4 processes per node

Collectives (e.g., allreduce and barrier) mirror the performance of the application

Page 54: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy54

CCS-3

PALIdentifying the problem within Sage

Sage

Allreduce

Simplify

Page 55: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy55

CCS-3

PALExposing the problems with simple benchmarks

Allreduce

Benchmarks

Add complexity

Challenge: identify the simplest benchmark that exposes the problem

Page 56: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy56

CCS-3

PALInterconnection network and communication libraries

The initial (obvious) suspects were the interconnection network and the MPI implementation

We tested in depth the network, the low level transmission protocols and several allreduce algorithms

We also implemented allreduce in the Network Interface Card

By changing the synchronization mechanism we were able to reduce the latency of an allreduce benchmark by a factor of 7

But we only got small improvements in Sage (5%)

Page 57: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy57

CCS-3

PALMystery #2

Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce

7 times faster leads to a small performance improvement

Page 58: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy58

CCS-3

PALComputational noise

After having ruled out the network and MPI we focused our attention on the compute nodes

Our hypothesis is that the computational noise is generated inside the processing nodes

This noise “freezes” a running process for a certain amount of time and generates a “computational” hole

Page 59: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy59

CCS-3

PALComputational noise: intuition

Running 4 processes on all 4 processors of an Alphaserver ES45

P2P0 P1 P3

The computation of one process is interrupted by an external event (e.g., system daemon or kernel)

Page 60: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy60

CCS-3

PAL

IDLE

Computational noise: 3 processes on 3 processors

Running 3 processes on 3 processors of an Alphaserver ES45

P2P0 P1

The “noise” can run on the 4th processor without interrupting the other 3 processes

Page 61: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy61

CCS-3

PALCoarse grained measurement

We execute a computational loop for 1,000 seconds on all 4,096 processors of QB

•P•1

•P•2

•P•3

•P•4

•TIME

•START •END

Page 62: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy62

CCS-3

PALCoarse grained computational overhead per process

The slowdown per process is small, between 1% and 2.5%

lower is better

Page 63: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy63

CCS-3

PALMystery #3

Although the “noise” hypothesis could explain SAGE’s suboptimal performance, the

microbenchmarks of per-processor noise indicate that at most 2.5% of performance is

lost to noise

Page 64: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy64

CCS-3

PALFine grained measurement

We run the same benchmark for 1000 seconds, but we measure the run time every millisecond

Fine granularity representative of many ASCI codes

Page 65: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy65

CCS-3

PALFine grained computational overhead per node

We now compute the slowdown per-node, rather than per-process

The noise has a clear, per cluster, structure

Optimum is 0(lower is better)

Page 66: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy66

CCS-3

PALFinding #1

Analyzing noise on a per-nodebasis reveals a regular structure

across nodes

Page 67: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy67

CCS-3

PAL The Q machine is organized in 32 node clusters (TruCluster) In each cluster there is a cluster manager (node 0), a quorum node

(node 1) and the RMS data collection (node 31)

Noise in a 32 Node Cluster

Page 68: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy68

CCS-3

PALPer node noise distribution

Plot distribution of one million, 1 ms computational chunks

In an ideal, noiseless, machine the distribution graph is a single bar at 1 ms of 1 million points per process (4

million per node)

Every outlier identifies a computation that was delayed by external interference

We show the distributions for the standard cluster node, and also nodes 0, 1 and 31

Page 69: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy69

CCS-3

PALCluster Node (2-30)

10% of the times the execution of the 1 ms chunk of computation is delayed

Page 70: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy70

CCS-3

PALNode 0, Cluster Manager

We can identify 4 main sources of noise

Page 71: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy71

CCS-3

PALNode 1, Quorum Node

One source of heavyweight noise (335 ms!)

Page 72: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy72

CCS-3

PALNode 31

Many fine grained interruptions, between 6 and 8 milliseconds

Page 73: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy73

CCS-3

PALThe effect of the noise

An application is usually a sequence of a computation followed by a synchronization (collective):

... ... ... ...

... ... ... ... ... But if an event happens on a single node then it can affect

all the other nodes

Page 74: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy74

CCS-3

PALEffect of System Size

The probability of a random event occurring increases with the node count.

... ... ... ...

Page 75: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy75

CCS-3

PALTolerating Noise: Buffered Coscheduling (BCS)

... ... ... ...... ... ... ...We can tolerate the noise by coscheduling the activities of the system software on each node

Page 76: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy76

CCS-3

PALDiscrete Event Simulator:used to model noise

DES used to examine and identify impact of noise: takes as input the harmonics that characterize the noise

Noise model closely approximates experimental data The primary bottleneck is the fine-grained noise generated by the compute

nodes (Tru64)

Lower is better

Page 77: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy77

CCS-3

PALFinding #2

On fine-grained applications, moreperformance is lost to short but

frequent noise on all nodes than to long but less frequent noise

on just a few nodes

Page 78: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy78

CCS-3

PALIncremental noise reduction

1. removed about 10 daemons from all nodes (including: envmod, insightd, snmpd, lpd, niff)

2. decreased RMS monitoring frequency by a factor of 2 on each node (from an interval of 30s to 60s)

3. moved several daemons from nodes 1 and 2 to node 0 on each cluster.

Page 79: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy79

CCS-3

PALImprovements in the Barrier Synchronization Latency

Page 80: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy80

CCS-3

PALResulting SAGE Performance

Nodes 0 and 31 also configured out in the optimization

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 512 1024 1536 2048 2560 3072 3584 4096# PEs

Cy

cle

-tim

e (

s)

ModelSep-21-02Nov-25-02Jan-27-03 (Min)Jan-27-03

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 1024 2048 3072 4096 5120 6144 7168 8192# PEs

Cy

cle

-tim

e (

s)

ModelSep-21-02Nov-25-02Jan-27-03May-01-03May-01-03 (min)

Page 81: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy81

CCS-3

PALFinding #3

We were able to double SAGE’s performance by selectively removing

noise caused by several types of system activities

Page 82: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy82

CCS-3

PALGeneralizing our results:application resonance

The computational granularity of a balanced bulk-synchronous application correlates to the type of noise.

Intuition: any noise source has a negative impact, a few noise sources

tend to have a major impact on a given application. Rule of thumb:

the computational granularity of the application “enters in resonance” with the noise of the same order of magnitude

The performance can be enhanced by selectively removing sources of noise

We can provide a reasonable estimate of the performance improvement knowing the computational granularity of a given application.

Page 83: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy83

CCS-3

PALCumulative Noise Distribution, Sequence of Barriers with No Computation

Most of the latency is generated by the fine-grained, high-frequency noisie of the cluster nodes

Page 84: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy84

CCS-3

PALConclusions

Combination of Measurement, Simulation and Modeling to

Identify and resolve performance issues on Q Used modeling to determine that a problem exists

Developed computation kernels to quantify O/S events:

Effect increases with the number of nodes

Impact is determined by the computation granularity in an application

Application performance has significantly improved

Method also being applied to other large-systems

Page 85: Achieving Usability and Efficiency in Large-Scale Parallel Computing Systems

Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa

Italy85

CCS-3

PALAbout the authors

Kei Davis is a team leader and technical staff member at Los Alamos National Laboratory (LANL) where he is currently working on system software solutions for reliability and usability of large-scale parallel computers. Previous work at LANL includes computer system performance evaluation and modeling, large-scale computer system simulation, and parallel functional language implementation. His research interests are centered on parallel computing; more specifically, various aspects of operating systems, parallel programming, and programming language design and implementation. Kei received his PhD in Computing Science from Glasgow University and his MS in Computation from Oxford University. Before his appointment at LANL he was a research scientist at the Computing Research Laboratory at New Mexico State University.

Fabrizio Petrini is a member of the technical staff of the CCS3 group of the Los Alamos National Laboratory (LANL). He received his PhD in Computer Science from the University of Pisa in 1997. Before his appointment at LANL he was a research fellow of the Computing Laboratory of the Oxford University (UK), a postdoctoral researcher of the University of California at Berkeley, and a member of the technical staff of the Hewlett Packard Laboratories. His research interests include various aspects of supercomputers, including high-performance interconnection networks and network interfaces, job scheduling algorithms, parallel architectures, operating systems and parallel programming languages. He has received numerous awards from the NNSA for contributions to supercomputing projects, and from other organizations for scientific publications.