Upload
alexa
View
33
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Satoshi Matsuoka Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary. "Towards Petascale Grids as a Foundation of E-Science". Vision of Grid Infrastructure in the past…. OR. - PowerPoint PPT Presentation
Citation preview
"Towards Petascale Grids as a Foundation of E-Science"
Satoshi Matsuoka
Tokyo Institute of Technology / The NAREGI Project, National Institute of Informatics
Oct. 1, 2007 EGEE07 Presentation @ Budapest, Hungary
Vision of Grid Infrastructure in the past…
Bunch of networked PCs virtualized to be a Supercomputer
Very divergent & distributed
supercomputers, storage, etc. tied together &
“virtualized”
OR
The “dream” is for the infrastructure to behave as a virtual supercomputing environment with an ideal
programming model for many applications
But this is not meant to be
Don Quixote or wrong tree dog bark picture here
TSUBAME: the first 100 Teraflops Supercomputer for Grids 2006-2010
ClearSpeed CSX600SIMD accelerator360 648 boards, 35 52.2TeraFlops
Storage1.0 Petabyte (Sun “Thumper”)0.1Petabyte (NEC iStore)Lustre FS, NFS, CIF, WebDAV (over IP)50GB/s aggregate I/O BW
500GB48disks 500GB
48disks500GB48disks
NEC SX-8i(for porting)
Unified IB network
Sun Galaxy 4 (Opteron Dual core 8-socket)
10480core/655Nodes32-128GB
21.4TeraBytes50.4TeraFlops
OS Linux (SuSE 9, 10) NAREGI Grid MW
Voltaire ISR9288 Infiniband 10Gbps x2 ~1310+50 Ports~13.5Terabits/s(3Tbits bisection)
10Gbps+External NW
“Fastest Supercomputer in
Asia” 29th [email protected]
Now 103 TeraFlops Peak as of Oct. 31st!
1.5PB
60GB/s
Sun BladeInteger Workload
Accelerator(90 nodes, 720 CPU
TSUBAME Job Statistics Dec. 2006-Aug.2007 (#Jobs)
• 797,886 Jobs (~3270 daily)
• 597,438 serial jobs (74.8%)
• 121,108 <=8p jobs (15.2%)
• 129,398 ISV Application Jobs (16.2%)
• However, >32p jobs account for 2/3 of cumulative CPU usage
TSUBAME J obs
0
100000
200000
300000
400000
500000
600000
700000
=1p <=8p <=16p <=32p <=64p <=128p >128p
# Processors / J ob
# Jo
bs
1系列
90%
Coexistence of ease-of-use in both - short duration parameter survey- large scale MPI
Fits the TSUBAME design well
In the Supercomputing Landscape,Petaflops class is already here… in early 2008
2008Q1 TACC/Sun “Ranger”~52,600 “Barcelona” Opteron CPU Cores, ~500TFlops~100 racks, ~300m2 floorspace2.4MW Power, 1.4km IB cx4 copper cabling2 Petabytes HDD
2008 LLNL/IBM “BlueGene/P”~300,000 PPC Cores, ~1PFlops~72 racks, ~400m2 floorspace~3MW Power, copper cabling> 10 Petaflops
> million cores> 10s Petabytesplanned for 2011-2012in the US, Japan, (EU), (other APAC)
Other Petaflops 2008/2009- LANL/IBM “Roadrunner”- JICS/Cray(?) (NSF Track 2)- ORNL/Cray- ANL/IBM BG/P- EU Machines (Julich…)…
In fact we can build one now In fact we can build one now (!)(!)
• @Tokyo---One of the Largest IDC in the World @Tokyo---One of the Largest IDC in the World (in Tokyo...)(in Tokyo...)
• Can fit a 10PF here easy (> 20 Rangers)Can fit a 10PF here easy (> 20 Rangers)
• On top of a 55KV/6GW SubstationOn top of a 55KV/6GW Substation
• 150m diameter (small baseball stadium) 150m diameter (small baseball stadium)
• 140,000 m2 IDC floorspace140,000 m2 IDC floorspace
• 70+70 MW power70+70 MW power
• Size of entire Google(?) (~million LP nodes)Size of entire Google(?) (~million LP nodes)
• Source of “Cloud” infrastructureSource of “Cloud” infrastructure
Gilder’s Law – Will make thin-client accessibility to servers essentially “free”
Scientific American, January 2001
Number of Years0 1 2 3 4 5
Per
form
ance
per
Do
llar
Sp
ent
Data Storage(bits per square inch)
(Doubling time 12 Months)
Optical Fiber(bits per second)
(Doubling time 9 Months)
Silicon Computer Chips(Number of Transistors)
(Doubling time 18 Months)
(Original slide courtesy Phil Papadopoulos @ SDSC)
DOE SC Applications Overview
(following slides courtesy John Shalf @ LBL NERSC)
Sparse MatrixLU FactorizationMulti-DisciplineSuperLU
Dense MatrixCMB AnalysisCosmologyMADCAP
ParticleMolecular DynamicsLife SciencesPMEMD
StructureProblem/MethodDisciplineNAME
Fourier/GridDFTMaterial SciencePARATEC
Particle in CellVlasov-PoissonMagnetic FusionGTC
2D/3D LatticeMHD Plasma PhysicsLBMHD
3D GridGeneral RelativityAstrophysicsCACTUS
3D GridAGCMClimate ModelingFVCAM
Latency Bound vs. Bandwidth Bound?
• How large does a message have to be in order to saturate a dedicated circuit on the interconnect? – N1/2 from the early days of vector computing– Bandwidth Delay Product in TCP
3.4KB2GB/s1.7usRapidArray/IB4xCray XD1
2.8KB500MB/s5.7usMyrinet 2000Myrinet Cluster
8.4KB1.5GB/s5.6usNEC CustomNEC ES
46KB6.3GB/s7.3usCray CustomCray X1
2KB1.9GB/s1.1usNumalink-4SGI Altix
Bandwidth Delay Product
Peak BandwidthMPI LatencyTechnologySystem
• Bandwidth Bound if msg size > Bandwidth*Delay
• Latency Bound if msg size < Bandwidth*Delay– Except if pipelined (unlikely with MPI due to
overhead)– Cannot pipeline MPI collectives (but can in Titanium)(Original slide courtesy John Shalf @ LBL)
Diagram of Message Size Distribution Function
( MADBench-P2P)
(Original slide courtesy John Shalf @ LBL)
60% of messages > 1MB BW Dominant, Could be executed on WAN
(Original slide courtesy John Shalf @ LBL)
Message Size Distributions(SuperLU-PTP)
> 95% of messages < 1KByte Low latency, tightly coupled LAN
Collective Buffer Sizes- demise of metacomputing -
95% Latency Bound!!!=> For metacomputing, Desktop and small cluster grids pretty much hopeless except parameter sweep apps
(Original slide courtesy John Shalf @ LBL)
So what does this tell us? • A “grid” programming model for parallelizing a single app is
not worthwhile– Either simple parameter sweep / workflow, or will not work
– We will have enough problems programming a single system with millions of threads (e.g., Jack’s keynote)
• Grid programming at “diplomacy” level– Must look at multiple applications, and how they compete /
coordinate• The apps execution environment should be virtualized --- grid being
transparent to applications
• Zillions of apps in the overall infrastructure, competing for resources
• Hundreds to thousands of application components that coordinate (workflow, coupled multi-physics interactions, etc.)
– NAREGI focuses on these scenarios
RISMSolvent distribution
FMOElectronic structure
Mediator Mediator
Solvent charge distribution is transformed from
regular to irregular meshes
Mulliken charge is transferred for partial
charge of solute molecules
Electronic structure of Nano-scale molecules in solvent is calculated self-consistent by exchanging solvent chargedistribution and partial charge of solute molecules.
*Original RISM and FMO codes are developed by Institute of Molecular Science and National Institute of Advanced Industrial Science and Technology, respectively.
Suitable for SMP Suitable for Cluster
GridMPI
Use case in NAREGI: RISM-FMO Coupled Simulation
Application sharing inResearch communities
Information Service
⑦RegisterDeployment Info.
Server#1 Server#2 Server#3
Compiling OK!
Test Run NG! Test Run OK!Test Run OK!
ResourceInfo.③Compile
Application Summary Program Source Files Input Files Resource Requirementsetc.
ApplicationDeveloper
⑥Deploy
④Send back Compiled
Application Environment
①Register Application②Select Compiling Host
⑤Select Deployment Host
ACS(Application Contents Service)
PSE Server
Registration & Deployment of Applications
PSEDataGridInformation
Service
SuperScheduler
Web server(apache)
Workflow Servlet
tomcatWokflow Description
By NAREGI-WFML
Server
BPEL<invoke name=EPS-jobA> ↓ JSDL -A<invoke name=BES-jobA> ↓ JSDL -A…………………..
NAREGIJM I/F module
BPEL+JSDL
http(s)
Data iconProgram icon
Appli-A
Appli-B
JSDL
JSDL
applet
Global fileinformation
Application Information
/gfarm/..
GridFTP(StdoutStderr)
Description of Workflow and Job Submission Requirements
Reservation Based Co-Allocation
Computing ResourceComputing ResourceComputing ResourceComputing Resource
GridVMGridVM
Accounting
CIM
UR/RUS
GridVMGridVM
ResourceInfo.
Reservation, Submission,Query, Control…
ClientClient
ConcreteJSDL
ConcreteJSDL
Workflow
AbstractJSDL
SuperScheduler
DistributedInformation Service
DAI
ResourceQuery
Reservation basedCo-Allocation
• Co-allocation for heterogeneous architectures and applications
• Used for advanced science applications, huge MPI jobs, realtime visualization on grid, etc...
1.1. ModulesModulesGridMPI: MPI-1 and 2 compliant grid ready MPI libraryGridRPC: OGF/GridRPC compliant GridRPC libraryMediator: Communication tool for heterogeneous applications SBC: Storage based communication tool
2.2. FeaturesFeaturesGridMPI
MPI for a collection of geographically distributed resourcesHigh performance optimized for high bandwidth network
GridMPITask parallel simple seamless programming
MediatorCommunication library for heterogeneous applicationsData format conversion
SBCStorage based communication for heterogeneous applications
3.3. Supporting StandardsSupporting StandardsMPI-1 and 2 OGF/GridRPC
Communication Libraries and Tools
Grid Ready Programming Libraries
• Standards compliant GridMPI and GridRPC
GridMPI Data ParallelMPI Compatibility
100000 CPU
100-500 CPU
RPC
RPC
GridRPC (Ninf-G2)Task Parallel, SimpleSeamless programming
• Mediator
• SBC (Storage Based Communication)
Communication Tools for Co-Allocation Jobs
Application-1 Application-2Mediator Mediator
Data FormatConversion
Data FormatConversion
GridMPI ( )
Application-3 Application-2SBC library SBC library
SBC protocol ( )
Cluster A (fast CPU, slow networks)
Cluster B (high bandwidth, large memory)
App A(High BW)
Compete Scenario: MPI / VM Migration on Grid (our ABARIS FT-MPI)
MPI Comm Log
Redistribution
ResourceManager
Host Host HostVM
MPIVM
MPIVM
MPI
Host Host
App A (High Bandwidth)
Host
Host Host Host
App B ( CPU)
VM
MPIVM
MPIVM
MPI
Resource manager, aware of individual
application characteristics
Host Host HostVM
MPIVM
MPIVM
MPI
VM Job MigrationPower Optimization