26
© 2008 IBM Corporation Blue Heron Project IBM Rochester: Tom Budnik: [email protected] Amanda Peters: [email protected] Condor: Greg Thain With contributions from: IBM Rochester: Mark Megerian, Sam Miller, Brant Knudson and Mike Mundy Other IBMers: Patrick Carey, Abbas Farazdel, Maria Iordache and Alex Zekulin UW-Madison Condor: Dr. Miron Livny April 30, 2008

© 2008 IBM Corporation Blue Heron Project IBM Rochester: Tom Budnik: [email protected] Amanda Peters: [email protected] Condor: Greg Thain With contributions

Embed Size (px)

Citation preview

© 2008 IBM Corporation

Blue Heron ProjectIBM Rochester: Tom Budnik: [email protected] Amanda Peters: [email protected]

Condor: Greg Thain

With contributions from: IBM Rochester: Mark Megerian, Sam Miller, Brant Knudson and Mike Mundy Other IBMers: Patrick Carey, Abbas Farazdel, Maria Iordache and Alex Zekulin UW-Madison Condor: Dr. Miron Livny

April 30, 2008

© 2008 IBM Corporation2

Agenda

What is the Blue Heron Project?

Condor and IBM Blue Gene Collaboration

Introduction to Blue Gene/P

What applications fit the Blue Heron model?

How does Blue Heron work?

Information Sources

Condor on BG/P demo (Greg Thain)

© 2008 IBM Corporation3

What is the Blue Heron Project?

Blue GeneEnvironment

Serial and Pleasantly Parallel Apps

Highly Scalable Msg Passing Apps

Paths Toward aPaths Toward aGeneral Purpose MachineGeneral Purpose Machine

*** NEW *** Available 5/16/08 HTC HPC (MPI)

Blue Heron = Blue Gene/P HTC and Condor

Blue Heron provides a complete integrated solution that gives users a simple, flexible mechanism for submitting single-node jobs.

Blue Gene looks like a "cluster" from an app’s point of view

Blue Gene supports hybrid application environment

Classic HPC (MPI) apps and now HTC apps

© 2008 IBM Corporation4

and Blue Gene Collaboration

Both IBM and Condor teams engaged in adapting code to bring Condor and Blue Gene technologies together

Previous Activities (BG/L) • Prototype/research Condor running HTC workloads

Current Activities (BG/P)

• Blue Heron Project Partner in design of HTC services Condor supports HTC workloads using static partitions

Future Collaboration (BG/P and BG/Q)• Condor supports dynamic machine partitioning• Condor supports HPC (MPI) jobs• I/O Node exploitation with Condor • Persistent memory support (data affinity scheduling)• Petascale environment issues

© 2008 IBM Corporation5

Introduction to Blue GeneTechnology Roadmap

2004 2007

Blue Gene/PPPC 450 @ 850MHzScalable to 3+ PF

Blue Gene/Q

Blue Gene/LPPC 440 @ 700MHzScalable to 596+ TF

BG/P is the 2nd Generation of the Blue Gene Family

© 2008 IBM Corporation6

Introduction to Blue Gene/P

Chip

13.6 GF/s8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s2 or 4 GB DDR2

32 Node Cardsup to 64x10 GigE

I/O links

14 TF/s2 or 4 TB

up to 3.56 PF/s512 or 1024 TB

CabledRack

System

Compute Card

435 GF/s64 or 128 GB

32 Compute Cards up to 2 I/O cards

Node Card

Leadership performance in a space-saving, processor dense, power-efficient package.

High reliability: Designed for less then 1 failure per rack per year (7 days MTBF for 72 racks).

Easy administration using the powerful web based Blue Gene Navigator.

Ultrascale capacity machine (“cluster buster”): run 4,096 HTC jobs on a single rack.

The system scales from 1 to 256 racks: 3.56 PF/s peak

Quad-Core PowerPCSystem-on-Chip

up to 256 racks

© 2008 IBM Corporation7

What applications fit the Blue Heron model? Master/Worker Paradigm:

Many “pleasantly parallel” apps on BG/P use a compute node as the “master node”

Advantage of Blue Heron (HTC) Solution: Move the “master node” from a Blue Gene compute node to the Front-End Node (FEN). This is a better

solution for the following reasons: Application resiliency: In MPI model a single node failure kills entire app for the partition. In HTC mode only

the job running on the failed node is ended, other single node jobs continue to run on partition. FEN has more memory, better performance, more functionality than a single compute node Code that runs on the compute nodes is much cleaner, since it only contains the work to be performed, and

leaves the coordination to a script or scheduler (NO MPI NEEDED) The coordinator functionality can be a Perl script, Python, compiled program, or anything that runs on Linux The coordinator can interact directly with DB2 or MySQL, to either get the inputs for the application, or to

store the results. This can eliminate the need to create a flat-file input for the app, or to generate the results in an output file.

Example: American Monte Carlo (options pricing) Reference: en.wikipedia.org/wiki/Monte_Carlo_methods_in_finance

MPI_Init (&argc, &argv);

MPI_Comm_rank (MPI_COMM_WORLD, &rank);

if (rank == 0) {

// send work to other nodes and collect results

} else {

// do real work

}

© 2008 IBM Corporation8

How does Blue Heron work? “Software Architecture Viewpoint”

Lightweight

Extreme scalability

Flexible scalability

High throughput (fast)

Design Goals:

© 2008 IBM Corporation9

How does Blue Heron work? “End user perspective”

“submit” client: Acts as a shadow or proxy for the real job running on the compute node – very lightweight

Submit jobs to location or pool Pool id concept: scheduler alias for a collection of partitions available to run a job on location: the resource where the job will execute in the form of a processor or wildcard location

Example #1 (submit to location): submit -location “R00-M0-N00-J05-C00” -exe hello_world

Example #2 (submit to pool): submit -pool BIOLOGY –exe hello_world

Job scheduler example:

Submit jobs using Condor (“condor_submit”)

Submitting jobs (typically from FEN):

© 2008 IBM Corporation10

Navigator Viewing active HTC jobs running on Blue Gene partitions (blocks)

© 2008 IBM Corporation11

Navigator Viewing HTC job history on Blue Gene

© 2008 IBM Corporation12

Information Sources

Official Website www.ibm.com/servers/deepcomputing/bluegene.html

Blue Gene Redbooks and Redpapers For the latest list go to For the latest list go to www.redbooks.ibm.comwww.redbooks.ibm.com and search for “Blue Gene” and search for “Blue Gene”

IBM Journal of Research and Development researchweb.watson.ibm.com/journal/rd/521/team.html

www.research.ibm.com/journal/rd49-23.html

Research Site www.research.ibm.com/bluegene/index.html

TOP500 List www.top500.org

Green500 List www.green500.org

© 2008 IBM Corporation13

Condor using HTC on BG/P Demo: Rosetta++ with MySQL

Rosetta++ is a protein prediction algorithm

It is very well-suited to HTC, since it runs many simulations of the same protein, using different random number seeds The one that results in the lowest energy model among those attempted is the “solution”

Rosetta++ had already been shown to work on Blue Gene, by David Baker’s lab

Our goal was to show that it runs well in HTC mode

Very little actual code changes were required: Compiled for Blue Gene, but using the single node version (NO MPI)

Changed a few places that did file output to use stdout, since that made it easier for the submitting script to associate each task to its results

Created a simple database front-end using both DB2 and MySQL, to contain the proteins and the seeds

Perl script reads inputs from database, submits each task to Condor, and processes results back into the database

Demonstrates HTC mode using Condor, with perfect linear scaling and no MPI

© 2008 IBM Corporation14

Questions?

© 2008 IBM Corporation15

Backup Slides

© 2008 IBM Corporation16

What are the Blue Gene System Components?

Blue Gene Rack(s)Hardware/Software

Host SystemService Node and Front End (login) Nodes

SuSE SLES/10, HPC SW Stack,File Servers, Storage Subsystem,

XLF/C Compilers, DB23rd Party Ethernet Switch

© 2008 IBM Corporation17

Blue Gene Integrated Networks

Torus Compute nodes only Direct access by app DMA

Collective Compute and I/O node attached 16 routes allow multiple network

configurations to be formed Contains an ALU for collective

operation offload Direct access by app

Barrier Compute and I/O nodes Low latency barrier across

system (< 1usec for 72 racks) Used to synchronize time bases Direct access by app

10Gb Functional Ethernet I/O nodes only

1Gb Private Control Ethernet Provides JTAG, i2c, etc, access

to hardware. Accessible only from Service Node

Clock network Single clock source for all racks

© 2008 IBM Corporation18

Blue Gene is the most Power, Space, and Cooling Efficient Supercomputer(Published specs per peak performance)

0%

100%

200%

300%

400%

Racks/TF kW/TF Sq Ft/TF Tons/TF

Sun/Constellation Cray/XT4 SGI/ICE

IBM BG/P

© 2008 IBM Corporation19

Blue Gene is Orders of Magnitude more Reliable than other Platforms

Results of survey conducted by Argonne National Lab on 10 clusters ranging from 1.2 to 365 TFlops (peak); excluding storage subsystem, management nodes, SAN network equipment, software outages.

* Estimated based on reliability improvements implemented in BG/P compared to BG/L

0

200

400

600

800Fa

ilure

s pe

r m

onth

for

a 10

0 TF

/s s

yste

m

Itanium2 x86 Power5 BG/L BG/P

394

127

1

800

<1*

© 2008 IBM Corporation20

Blue Gene Software Hierarchical Organization

Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)

I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination

Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software

© 2008 IBM Corporation21

Quad Mode Also called Virtual Node Mode All 4 cores run 1 process each No threading Each process gets ¼ node

memory MPI/HTC programming model

Dual Mode 2 cores run 1 process each Each process may spawn 1

thread on core not used by other process

Each process gets ½ node memory

MPI/OpenMP/HTC programming model

SMP Mode 1 core runs 1 process Process may spawn threads

on each of the other cores Process gets full node

memory MPI/OpenMP/HTC

programming model

M

P

M

P

M

P

Memory address space

M

Co

re 0

P

Application

Co

re 1

Co

re 2

Co

re 3

Application

M

P

T

M

P

TCo

re 0

Co

re 1

Co

re 2

Co

re 3

Memory address space

CPU2 CPU3

Application

M

P

T T TCo

re 0

Co

re 1

Co

re 2

Co

re 3

Memory address space

BG/P Job Modes allow Flexible use of Compute Node Resources

© 2008 IBM Corporation22

Why and for What is Blue Gene Used? Improve understanding – significantly larger scale, more complex and higher resolution

models; new science applications Multiscale and multiphysics – From atoms to mega-structures; coupled applications Shorter time to solution – Answers from months to minutes

Physics – Materials ScienceMolecular Dynamics

Environment and Climate Modeling Life Sciences: Sequencing

BiologicalModeling – Brain Science

Computational Fluid Dynamics

Life Sciences: In-Silico Trials, Drug Discovery

Financial ModelingStreaming Data Analysis

Geophysical Data ProcessingUpstream Petroleum

© 2008 IBM Corporation23

Many Computational Science Modeling and Simulation Algorithms and Numerical Methods are Massively Parallel

Good Better Best

BasicAlgorithms &

NumericalMethods

PipelineFlows

Biosphere/Geosphere

Neural Networks

Condensed MatterElectronic Structure

CloudPhysics

-

ChemicalReactors

CVD

PetroleumReservoirs

MolecularModeling

BiomolecularDynamics / Protein Folding

RationalDrug DesignNanotechnology

FractureMechanics

ChemicalDynamics Atomic

ScatteringsElectronicStructure

Flows in Porous Media

FluidDynamics

Reaction-Diffusion

MultiphaseFlow

Weather and ClimateStructural Mechanics

Seismic Processing

AerodynamicsGeophysical Fluids

QuantumChemistry

ActinideChemistry

CosmologyAstrophysics

VLSIDesign

ManufacturingSystems

MilitaryLogistics

NeutronTransport

NuclearStructure

QuantumChromo-Dynamics Virtual

Reality

VirtualPrototypes

ComputationalSteering

Scientific Visualization

MultimediaCollaborationTools

CAD

GenomeProcessing

Databases

Large-scaleData Mining

IntelligentAgents

IntelligentSearch

Cryptography

Number Theory

Ecosystems

EconomicsModels

Astrophysics

SignalProcessing

Data Assimilation

Diffraction & InversionProblems

MRI Imaging

DistributionNetworks

Electrical Grids

Phylogenetic TreesCrystallography

TomographicReconstruction

ChemicalReactors

PlasmaProcessing

Radiation

MultibodyDynamics

Air TrafficControl

PopulationGenetics

TransportationSystems

Economics

ComputerVision

AutomatedDeduction

ComputerAlgebra

OrbitalMechanics

Electromagnetics

Magnet DesignSource: Rick Stevens, Argonne National Lab and The University of Chicago

SymbolicProcessing

Pattern Matching

RasterGraphics

MonteCarlo

DiscreteEvents

N-Body

FourierMethods

GraphTheoretic

Transport

Partial Diff. EQs.

Ordinary Diff. EQs.

Fields

© 2008 IBM Corporation24

What applications fit the Blue Heron model?Wide range of applications can run in HTC mode

Many applications that run on Blue Gene today are “embarrassingly (pleasantly) parallel” or “independently parallel”

They don’t exploit the torus for MPI communication and just want a large number of small tasks, with a coordinator of results

HTC Application Identification

Solution Statement: A high-throughput computing (HTC) application is one in which the same basic calculation must be performed

over many independent input data elements and the results collected. Because each calculation is independent, it is extremely easy to spread calculations out over multiple cluster nodes. For this reason, high-throughput applications are sometimes called “embarrassingly parallel.” HTC applications occur much more frequently than one might think, showing up in areas such as parameters studies, search applications, data analytics, and what-if calculations.

Identifying a HTC application: There are a number of identifiers you can use to determine if your specific computing problem fits into the

category of a high-throughput application:

Do you need to run many instances of the same application with different arguments or parameters?

Do you need to run the same application many times with different input files?

Do you have an application that can select subsets of the input data and whose results can be combined by a simple merge process such as concatenating, placing them into a single data base, or adding them together?  

If the answer to any of these questions is “yes,” then it is quite likely that you have a HTC application.

Source: Grid.org

© 2008 IBM Corporation25

How does Blue Heron work?

Key Features:

Provides a job submit command that is simple, lightweight, and extremely fast

Job state is integrated into Control System database, so administrators know which nodes have jobs, and which are idle

Provides stdin/stdout/stderr on a per-job basis

Enables individual jobs to be signaled or killed

Maintains a user ID on per-job basis (allows multiple users per partition)

Blue Gene Navigator shows HTC jobs (active or in history) with job exit status & runtime stats

Designed for easy integration with job schedulers (e.g. Condor, LoadLeveler, SIMPLE, etc.)

© 2008 IBM Corporation26

submit command./submit [options] or ./submit [options] binary [arg1 arg2 ... argn]

Job options:

[-]-exe <exe> executable to run

[-]-args "arg1 arg2 ... argn" arguments, must be enclosed in double quotes

[-]-env <env=value> add an environmental for the job

[-]-exp_env <env> export an environmental to the job's environment

[-]-env_all add all current environmentals to the job's environment

[-]-cwd <cwd> the job's current working directory

[-]-timeout <seconds> number of seconds before the job is killed

[-]-strace run job under system call tracing

Resource options:

[-]-mode <SMP|DUAL|VNM> the job mode

[-]-location <Rxx-Mx-Nxx-Jxx-Cxx> compute core location to run the job

[-]-pool <id> compute node pool ID to run the job

Options:

[-]-port <port> listen port of the submit mux to connect to (default 10246)

[-]-trace <0-7> tracing level, default(6)

[-]-enable_tty_reporting disable the default line buffering of stdin, stdout, and stderr when input (stdin) or output (stdout/stderr) is not a tty

[-]-raise if a job dies with a signal, submit will raise this signal