53
Parallel Scalable Operating Systems Presented by Dr. Florin Isaila Universidad Carlos III de Madrid Visiting Scholar Argonne National Lab

Parallel Scalable Operating Systems

Embed Size (px)

DESCRIPTION

Parallel Scalable Operating Systems. Presented by Dr. Florin Isaila Universidad Carlos III de Madrid Visiting Scholar Argonne National Lab. Contents. Preliminaries Top500 Scalability Blue Gene History BG/L, BG/C, BG/P, BG/Q Scalable OS for BG/L Scalable file systems - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel Scalable Operating Systems

Parallel Scalable Operating Systems

Presented by Dr. Florin Isaila

Universidad Carlos III de Madrid Visiting Scholar Argonne National Lab

Page 2: Parallel Scalable Operating Systems

Contents

Preliminaries Top500 Scalability

Blue Gene History BG/L, BG/C, BG/P, BG/Q

Scalable OS for BG/L Scalable file systems BG/P at Argonne National Lab Conclusions

Page 3: Parallel Scalable Operating Systems
Page 4: Parallel Scalable Operating Systems

Top500Top500

Page 5: Parallel Scalable Operating Systems

Generalities

Since 1993 twice a year: June and November Ranking of the most powerful computing

systems in the world Ranking criteria: performance of the

LINPACK benchmark Jack Dongarra alma máter Site web: www.top500.org

Page 6: Parallel Scalable Operating Systems

HPL: High-Performance Linpack

solves a dense system of linear equations Variant of LU factorization of matrices of size N

measure of a computer’s floating-point rate of execution computation done in 64 bit floating point arithmetic Rpeak : theoretic system performance

upper bound for the real performance (in MFLOP) Ex: Intel Itanium 2 at 1.5 GHz 4 FP/s -> 6GFLOPS

Nmax : obtained by varying N and choosing the maximum performance

Rmax : maximum real performance achieved for Nmax

N1/2: size of problem needed to achieve ½ of Rmax

Page 7: Parallel Scalable Operating Systems

Jack Dongarra´s slide

Page 8: Parallel Scalable Operating Systems
Page 9: Parallel Scalable Operating Systems

Amdahl´s law

Suppose a fraction f of your application is not parallelizable

1-f : parallelizable on p processors

Speedup(P) = T1 /Tp

<= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p)

<= 1/f

Page 10: Parallel Scalable Operating Systems

Amdahl’s Law (for 1024 processors)

Speedup

0

128

256

384

512

640

768

896

1024

0 0.01 0.02 0.03 0.04

s

Page 11: Parallel Scalable Operating Systems

Load Balance

Work: data access, computation Not just equal work, but must be busy at same time Ex: Speedup ≤ 1000/400 = 2.5

Sequential Work

Max Work on any ProcessorSpeedup ≤

1000

200

100

400

300

0 200 400 600 800 1000 1200

Sequential

Processor 1

Processor 2

Processor 3

Processor 4

Page 12: Parallel Scalable Operating Systems

Communication and synchronization

Communication is expensive! Measure: communication to computation ratio Inherent communication

Determined by assignment of tasks to processesActual communication may be larger (artifactual)

One principle: Assign tasks that access same data to same process

Sequential Work

Max (Work + Synch Wait Time + Comm Cost)Speedup <

Synchronization pointWork Synchronization wait time

Communication

Process 1

Process 2

Process 3

Page 13: Parallel Scalable Operating Systems

Blue Gene

Page 14: Parallel Scalable Operating Systems

Blue Gene partners

IBM “Blue”: The corporate color of IBM “Gene”: The intended use of the Blue Gene

clusters – Computational biology, specifically, protein folding

Lawrence Livermore National Lab Department of Energy Academia

Page 15: Parallel Scalable Operating Systems

BG HistoryDecember 1999 Project created

Supercomputer for biomolecular phenomena

29 Septiembre 2004

Blue Gene/L prototype had overtaken NEC's Earth Simulator

(36,01 TFlops | 8 Cabinets )

November 2004 Blue Gene/L reaches 70,82 TFlops (16 Cabinets)

24 March 2005 Blue Gene/L broke its record, reaching 135.5 TFLOPS

(32 Cabinets)

June 2005 Several sites world-wide took 5 out of 10 in top 10

(top 500.org)

27 October 2006

November 2007

280.6 TFLOPS (65,536 "Compute Nodes" and adition1024 "IO nodes“

in 64 air-cooled cabinets)

478.2 TFLOPS

Page 16: Parallel Scalable Operating Systems

Family

BG/L BG/C BG/P BG/Q

Page 17: Parallel Scalable Operating Systems

Blue Gene/L

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s16 GB

32 Node Cards

2.8/5.6 TF/s512 GB

64 Racks, 64x32x32

180/360 TF/s32 TB

Rack

System

Node Card

Compute Card

Chip

Page 18: Parallel Scalable Operating Systems

Technical specifications

64 cabinets which contain 65.536 high-performance compute nodes (chips)

1.024 I/O nodes. 32-bit PowerPC processors 5 networks The main memory has a size of 33 terabytes. Maximum performance of 183.5 TFLOPS when

using one processor for computation and the other one for communication, and 367 TFLOPS if using both for computation.

Page 19: Parallel Scalable Operating Systems

Blue Gene / L

Networks: 3D Torus Collective Network Global Barrier/Interrupt Gigabit Ethernet (I/O & Connectivity) Control (system boot, debug,

monitoring)

Page 20: Parallel Scalable Operating Systems

Networks

Three dimensional torus- Compute nodes

Global tree- collective communication- I/O

Ethernet Control network

Page 21: Parallel Scalable Operating Systems

Three-dimensional (3D) torus network in which the nodes (red balls) are connected to their six nearest-neighbor nodes in a 3D mesh.

Page 22: Parallel Scalable Operating Systems

Blue Gene / L

Processor: PowerPC 440 700Mhz Low power allows dense packaging

External Memory: 512MB SDRAM per node / 1GB

Slow embedded core at a clock speed of 700 Mhz 32 KB L1 cache L2 is a small prefetch buffer 4MB Embedded DRAM L3 cache

Page 23: Parallel Scalable Operating Systems

PowerPC 440 core

Page 24: Parallel Scalable Operating Systems

BG/L compute ASIC

Non-cache coherent L1 Pre-fetch buffer L2 Shared 4MB DRAM (L3) Interface to external DRAM 5 network interfaces

Torus, collective, global barrier, Ethernet, control

Page 25: Parallel Scalable Operating Systems

Block diagram

Page 26: Parallel Scalable Operating Systems

Blue Gene / L

Compute Nodes: Dual processor, 1024 per Rack

I/O Nodes: Dual processor, 16-128 per Rack

Page 27: Parallel Scalable Operating Systems

Blue Gene / L

Compute Nodes: Proprietary kernel (tailored to processor design)

I/O Nodes: Embedded Linux

Front-end and service nodes: Suse SLES 9 Linux (familiarity with users)

Page 28: Parallel Scalable Operating Systems

Blue Gene / L

Performance: Peak performance per rack: 5,73 TFlops Linpack performance per rack: 4,71 TFlops

Page 29: Parallel Scalable Operating Systems

Blue Gene / C

a.k.a Cyclops64 massively parallel (first supercomputer on a

chip) Processors with a 96 port, 7 stage non-internally

blocking crossbar switch. Theoretical peak performance (chip): 80 GFlops

Page 30: Parallel Scalable Operating Systems

Blue Gene / C

Cellular architecture 64-bit Cyclops64 chip:

500 Mhz 80 processors ( each has 2 thread units and a FP

unit) Software

Cyclops64 exposes much of the underyling hardware to the programmer, allowing the programer to write very high performance, finely tuned software.

Page 31: Parallel Scalable Operating Systems

Blue Gene / C

Picture of BG/C Performances:

Board: 320 GFlops Rack: 15,76 Tflops System: 1,1 PFlops

Page 32: Parallel Scalable Operating Systems

Blue Gene / P Similar Architecture to BG/L, but

Cache coherent L1 cache 4 cores per nodes 10 Gbit Ethernet external IO infrastructure Scales upto 3-PFLOPS More energy efficient

167TF/s by 2007, 1PF by 2008

Page 33: Parallel Scalable Operating Systems

Blue Gene / Q

Continuation of Blue Gene/L and /P Targeting 10PF/s by 2010/2011 Higher freq at similar performance / watt Similar number of nodes Many more cores More generally useful Aggressive compiler New network: Scalable and cheap

Page 34: Parallel Scalable Operating Systems

Motivation for a scalable OSMotivation for a scalable OS

Blue Gene/L is currently the world’s fastest and most scalable supercomputer

Several system components contribute to that scalability. The Operating Systems for the different nodes of Blue

Gene/L are among the components responsible for that scalability.

The OS overhead on one node affects the scalability of the whole system

Goal: design a scalable solution for the OS.

Page 35: Parallel Scalable Operating Systems

High level view of BG/LHigh level view of BG/L

Principle: the structure of the software should reflect the structure of the hardware.

Page 36: Parallel Scalable Operating Systems

BG/L Partitioning

Space-sharing Divided along natural boundaries into

partitions Each partition can run only one job Each node can be in one of this modes

Coprocessor: one processor assists the other Virtual node: two separate processors, each of

them with its own memory space

Page 37: Parallel Scalable Operating Systems

OS

Compute nodes: dedicated OS I/O nodes: dedicated OS Service nodes: conventional off-the-shelf OS Front-end nodes: program compilation,

debug, submit File servers: store data , no specific for BG/L

Page 38: Parallel Scalable Operating Systems

BG/L OS solution BG/L OS solution

Components: I/O, service nodes, CNK The compute and I/O nodes organized into logical

entities called processing sets or psets: 1 I/O node + a collection of CNs 8, 16, 64, 128 CNs Logical concept Should reflect physical proximity => fast communication

Job: collection of N compute processes (on CNs) Own private address space Message passing MPI: ranks 0, N-1

Page 39: Parallel Scalable Operating Systems

High level view of BG/LHigh level view of BG/L

Page 40: Parallel Scalable Operating Systems

BG/L OS solution:CNKBG/L OS solution:CNK

Compute node: run only compute processes an all the compute nodes of a particular partition can execute in two different modes: Coprocessor mode Virtual node mode

Compute Node Kernel (CNK): simple OS Creates an address spaces Load code and initialize data Transfer processor control to the loaded executable

Page 41: Parallel Scalable Operating Systems

CNK

consumes 1MB Creates either

One address space of 511/1023MB 2 address spaces of 255/511MB

No virtual memory, no paging The entire mapping fits into the TLB of PowerPC

Load in push mode: 1 CN reads the executable from FS and sends to all the others

One image loaded and then stays out of the way!!!

Page 42: Parallel Scalable Operating Systems

CNK

No OS scheduling (one thread) No memory management (No TLB overhead) No local file services User level execution until:

Process requests a system call Hardware interrupts: timer (requested by application),

abnormal events Syscall

Simple: handled locally (getting the time, set an alarm) Complex: forward to I/O nodes Unsupported (fork/mmap): error

Page 43: Parallel Scalable Operating Systems

Benefits of the simple solution

Robustness: simple design, implementation, test, debugging

Scalability: no interference among compute nodes Low system noise Performance measurements

Page 44: Parallel Scalable Operating Systems

I/O nodeI/O node

Two roles in Blue Gene/L: Act as an effective master of its corresponding pset To offer services request from compute nodes in its pset

Mainly I/O operations on locally mounted FSs

Only one processor used: due to the lack of memory coherency

Executes an embedded version of the Linux operating system: Does not use any swap space it has an in-memory root file system it uses little memory lacks the majority of LINUX daemons.

Page 45: Parallel Scalable Operating Systems

I/O node

Complete TCP/IP stack Supported FS: NFS, GPFS, Lustre, PVFS Main process: Control and I/O daemon (CIOD) Launch a job

Job manager sends the request to the service node Service node contacts the CIOD CIOD sends the executable to all processes in pset

Page 46: Parallel Scalable Operating Systems

System calls System calls

Page 47: Parallel Scalable Operating Systems

Service nodesService nodes

run the Blue Gene/L control system. Tight integration with CNs and IONs CN and IONs: stateless, no persistent memory Responsible for operation and monitoring the CNs and I/ONs

Creates system partitions and isolates it Computes network routing for torus, collective and global interrupt

networks loads OS code for CNs and I/ONs

Page 48: Parallel Scalable Operating Systems

Problems

Not fully POSIX compliant Many applications need

Process/thread creation Full server sockets Shared memory segments Memory mapped files

Page 49: Parallel Scalable Operating Systems

File systems for BG systems

Need for scalable file systems: NFS is not a solution

Most supercomputers and clusters in top 500 use one of these parallel file systems GPFS Lustre PVFS2

Page 50: Parallel Scalable Operating Systems

GPFS/PVFS/Lustre mounted on the I/O nodes

File system servers

Page 51: Parallel Scalable Operating Systems

File systems for BG systems

All these parallel file systems are storing files round-robin over several FS servers

Allow concurrent access to files FS calls are forwarded from the compute nodes

to the I/O nodes The I/O nodes execute the calls on behalf of

the compute nodes and forward back the result Problem: data travels over different networks

for each FS call (caching becomes critical)

Page 52: Parallel Scalable Operating Systems

SummarySummary

The OS solution for Blue Gene adopts a software architecture reflecting the hardware architecture of the system.

The result of this approach is a lightweight kernel for the compute nodes (CNK) and a port of Linux that implements the file system and TCP/IP functionality for the I/O nodes.

This separation of responsibilities leads to an OS solution that is: Simple Robust High Performing Scalable Extensible

Problem: limited applicability Scalable parallel file systems

Page 53: Parallel Scalable Operating Systems

Further reading

Designing a highly-scalable operating system: the Blue Gene/L story -Proceedings of the 2006 ACM/IEEE conference on Supercomputing

www.research.ibm.com/bluegene/