"Towards Petascale Grids as a Foundation of E-Science"

"Towards Petascale Grids as a Foundation of E-Science"

Satoshi Matsuoka

Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics

Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary

Vision of Grid Infrastructure in the past…

Bunch of networked PCs virtualized to be a Supercomputer

Very divergent & distributed

supercomputers, storage, etc. tied together &

“virtualized”

OR

The “dream” is for the infrastructure to behave as a virtual supercomputing environment with an ideal

programming model for many applications

http://www1.jp.dell.com/content/products/category.aspx/desktops?c=jp&l=jp&s=soho









But this is not meant to be

Don Quixote or wrong tree dog bark picture here

TSUBAME: the first 100 Teraflops Supercomputer for Grids 2006-2010

ClearSpeed CSX600SIMD accelerator360 648 boards, 35 52.2TeraFlops

Storage1.0 Petabyte (Sun “Thumper”)0.1Petabyte (NEC iStore)Lustre FS, NFS, CIF, WebDAV (over IP)50GB/s aggregate I/O BW

500GB48disks 500GB

48disks500GB48disks

NEC SX-8i(for porting)

Unified IB network

Sun Galaxy 4 (Opteron Dual core 8-socket)

10480core/655Nodes32-128GB

21.4TeraBytes50.4TeraFlops

OS Linux (SuSE 9, 10) NAREGI Grid MW

Voltaire ISR9288 Infiniband 10Gbps x2 ~1310+50 Ports~13.5Terabits/s(3Tbits bisection)

10Gbps+External NW

“Fastest Supercomputer in

Asia” 29th [email protected]

Now 103 TeraFlops Peak as of Oct. 31st!

1.5PB

60GB/s

Sun BladeInteger Workload

Accelerator(90 nodes, 720 CPU

TSUBAME Job Statistics Dec. 2006-Aug.2007 (#Jobs)

• 797,886 Jobs (~3270 daily)

• 597,438 serial jobs (74.8%)

• 121,108 <=8p jobs (15.2%)

• 129,398 ISV Application Jobs (16.2%)

• However, >32p jobs account for 2/3 of cumulative CPU usage

TSUBAME J obs

0

100000

200000

300000

400000

500000

600000

700000

=1p <=8p <=16p <=32p <=64p <=128p >128p

# Processors / J ob

# Jo

bs

1系列

90%

Coexistence of ease-of-use in both - short duration parameter survey- large scale MPI

Fits the TSUBAME design well

In the Supercomputing Landscape,Petaflops class is already here… in early 2008

2008Q1 TACC/Sun “Ranger”~52,600 “Barcelona” Opteron CPU Cores, ~500TFlops~100 racks, ~300m2 floorspace2.4MW Power, 1.4km IB cx4 copper cabling2 Petabytes HDD

2008 LLNL/IBM “BlueGene/P”~300,000 PPC Cores, ~1PFlops~72 racks, ~400m2 floorspace~3MW Power, copper cabling> 10 Petaflops

> million cores> 10s Petabytesplanned for 2011-2012in the US, Japan, (EU), (other APAC)

Other Petaflops 2008/2009- LANL/IBM “Roadrunner”- JICS/Cray(?) (NSF Track 2)- ORNL/Cray- ANL/IBM BG/P- EU Machines (Julich…)…

In fact we can build one now In fact we can build one now (!)(!)

• @Tokyo---One of the Largest IDC in the World @Tokyo---One of the Largest IDC in the World (in Tokyo...)(in Tokyo...)

• Can fit a 10PF here easy (> 20 Rangers)Can fit a 10PF here easy (> 20 Rangers)

• On top of a 55KV/6GW SubstationOn top of a 55KV/6GW Substation

• 150m diameter (small baseball stadium) 150m diameter (small baseball stadium)

• 140,000 m2 IDC floorspace140,000 m2 IDC floorspace

• 70+70 MW power70+70 MW power

• Size of entire Google(?) (~million LP nodes)Size of entire Google(?) (~million LP nodes)

• Source of “Cloud” infrastructureSource of “Cloud” infrastructure

Gilder’s Law – Will make thin-client accessibility to servers essentially “free”

Scientific American, January 2001

Number of Years0 1 2 3 4 5

Per

form

ance

per

Do

llar

Sp

ent

Data Storage(bits per square inch)

(Doubling time 12 Months)

Optical Fiber(bits per second)


Silicon Computer Chips(Number of Transistors)


(Original slide courtesy Phil Papadopoulos @ SDSC)

DOE SC Applications Overview

(following slides courtesy John Shalf @ LBL NERSC)

Sparse MatrixLU FactorizationMulti-DisciplineSuperLU

Dense MatrixCMB AnalysisCosmologyMADCAP

ParticleMolecular DynamicsLife SciencesPMEMD

StructureProblem/MethodDisciplineNAME

Fourier/GridDFTMaterial SciencePARATEC

Particle in CellVlasov-PoissonMagnetic FusionGTC

2D/3D LatticeMHD Plasma PhysicsLBMHD

3D GridGeneral RelativityAstrophysicsCACTUS

3D GridAGCMClimate ModelingFVCAM

Latency Bound vs. Bandwidth Bound?

• How large does a message have to be in order to saturate a dedicated circuit on the interconnect? – N1/2 from the early days of vector computing– Bandwidth Delay Product in TCP

3.4KB2GB/s1.7usRapidArray/IB4xCray XD1

2.8KB500MB/s5.7usMyrinet 2000Myrinet Cluster

8.4KB1.5GB/s5.6usNEC CustomNEC ES

46KB6.3GB/s7.3usCray CustomCray X1

2KB1.9GB/s1.1usNumalink-4SGI Altix

Bandwidth Delay Product

Peak BandwidthMPI LatencyTechnologySystem

• Bandwidth Bound if msg size > Bandwidth*Delay

• Latency Bound if msg size < Bandwidth*Delay– Except if pipelined (unlikely with MPI due to

overhead)– Cannot pipeline MPI collectives (but can in Titanium)(Original slide courtesy John Shalf @ LBL)

Diagram of Message Size Distribution Function 　

（ MADBench-P2P)

(Original slide courtesy John Shalf @ LBL)

60% of messages > 1MB BW Dominant, Could be executed on WAN


Message Size Distributions(SuperLU-PTP)

> 95% of messages < 1KByte Low latency, tightly coupled LAN

Collective Buffer Sizes- demise of metacomputing -

95% Latency Bound!!!=> For metacomputing, Desktop and small cluster grids pretty much hopeless except parameter sweep apps


So what does this tell us? • A “grid” programming model for parallelizing a single app is

not worthwhile– Either simple parameter sweep / workflow, or will not work

– We will have enough problems programming a single system with millions of threads (e.g., Jack’s keynote)

• Grid programming at “diplomacy” level– Must look at multiple applications, and how they compete /

coordinate• The apps execution environment should be virtualized --- grid being

transparent to applications

• Zillions of apps in the overall infrastructure, competing for resources

• Hundreds to thousands of application components that coordinate (workflow, coupled multi-physics interactions, etc.)

– NAREGI focuses on these scenarios

RISMSolvent distribution

FMOElectronic structure

Mediator Mediator

Solvent charge distribution is transformed from

regular to irregular meshes

Mulliken charge is transferred for partial

charge of solute molecules

Electronic structure of Nano-scale molecules in solvent is calculated self-consistent by exchanging solvent chargedistribution and partial charge of solute molecules.

*Original RISM and FMO codes are developed by Institute of Molecular Science and National Institute of Advanced Industrial Science and Technology, respectively.

Suitable for SMP Suitable for Cluster

GridMPI

Use case in NAREGI: RISM-FMO Coupled Simulation

Application sharing inResearch communities

Information Service

⑦RegisterDeployment Info.

Server#1 Server#2 Server#3

Compiling OK!

Test Run NG! Test Run OK!Test Run OK!

ResourceInfo.③Compile

Application Summary Program Source Files Input Files Resource Requirementsetc.

ApplicationDeveloper

⑥Deploy

④Send back Compiled

Application Environment

①Register Application②Select Compiling Host

⑤Select Deployment Host

ACS(Application Contents Service)

PSE Server

Registration & Deployment of Applications

PSEDataGridInformation

Service

SuperScheduler

Web server(apache)

Workflow Servlet

tomcatWokflow Description

By NAREGI-WFML

Server

BPEL<invoke name=EPS-jobA> ↓ 　　 JSDL -A<invoke name=BES-jobA> ↓ 　　 JSDL -A…………………..

NAREGIJM I/F module

BPEL+JSDL

http(s)

Data iconProgram icon

Appli-A

Appli-B

JSDL

JSDL

applet

Global fileinformation

Application Information

/gfarm/..

GridFTP(StdoutStderr)

Description of Workflow and Job Submission Requirements

Reservation Based Co-Allocation

Computing ResourceComputing ResourceComputing ResourceComputing Resource

GridVMGridVM

Accounting

CIM

UR/RUS

GridVMGridVM

ResourceInfo.

Reservation, Submission,Query, Control…

ClientClient

ConcreteJSDL

ConcreteJSDL

Workflow

AbstractJSDL

SuperScheduler

DistributedInformation Service

DAI

ResourceQuery

Reservation basedCo-Allocation

• Co-allocation for heterogeneous architectures and applications

• Used for advanced science applications, huge MPI jobs, realtime visualization on grid, etc...

1.1. ModulesModulesGridMPI: MPI-1 and 2 compliant grid ready MPI libraryGridRPC: OGF/GridRPC compliant GridRPC libraryMediator: Communication tool for heterogeneous applications SBC: Storage based communication tool

2.2. FeaturesFeaturesGridMPI

MPI for a collection of geographically distributed resourcesHigh performance optimized for high bandwidth network

GridMPITask parallel simple seamless programming

MediatorCommunication library for heterogeneous applicationsData format conversion

SBCStorage based communication for heterogeneous applications

3.3. Supporting StandardsSupporting StandardsMPI-1 and 2 　OGF/GridRPC

Communication Libraries and Tools

Grid Ready Programming Libraries

• Standards compliant GridMPI and GridRPC

GridMPI Data ParallelMPI Compatibility

100000 CPU

100-500 CPU

RPC

RPC

GridRPC (Ninf-G2)Task Parallel, SimpleSeamless programming

• Mediator

• SBC (Storage Based Communication)

Communication Tools for Co-Allocation Jobs

Application-1 Application-2Mediator Mediator

Data FormatConversion

Data FormatConversion

GridMPI ( )

Application-3 Application-2SBC library SBC library

SBC protocol ( )

Cluster A (fast CPU, slow networks)

Cluster B (high bandwidth, large memory)

App A(High BW)

Compete Scenario: MPI / VM Migration on Grid (our ABARIS FT-MPI)

MPI Comm Log

Redistribution

ResourceManager

Host Host HostVM

MPIVM

MPIVM

MPI

Host Host

App A (High Bandwidth)

Host

Host Host Host

App B （ CPU)

VM

MPIVM

MPIVM

MPI

Resource manager, aware of individual

application characteristics

Host Host HostVM

MPIVM

MPIVM

MPI

VM Job MigrationPower Optimization

Documents

"Towards Petascale Grids as a Foundation of E-Science"