Upload
tonya
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Progress Towards Petascale Virtual Machines. Al Geist Oak Ridge National Laboratory www.csm.ornl.gov/~geist. EuroPVM-MPI 2003 Venice, Italy September 30, 2003. Petascale Virtual Machine Another kind of “PVM”. This talk will describe: DOE Genomes to Life Project - PowerPoint PPT Presentation
Citation preview
Progress Towards Petascale Virtual Machines
Progress Towards Petascale Virtual Machines
Al GeistOak Ridge National Laboratory
www.csm.ornl.gov/~geist
EuroPVM-MPI 2003Venice, Italy
September 30, 2003
Petascale Virtual MachineAnother kind of “PVM”
This talk will describe:
DOE Genomes to Life Project PVM use today in the Genomics Integrated Supercomputer Toolkit for fault tolerance, and high availability in a dynamic environment
Harness Project (next generation of PVM) and its features to help scale to Petascale systems
Distributed peer-to-peer controlH2O – the self adapting core of HarnessFTMPI – fault tolerant MPI
Latest superscalable algorithms with natural fault tolerance for petascale environments.
Understanding the Essential Processes of Living Systems
Follow-on to Human Genome Program– Determined the entire DNA sequence for humans – 24 chromosomes in 6 ft of DNA– 3 billion nucleotides code for 35,000 genes– Only 0.0001% difference between people.
Instructions to build a human fits on a DVD (3GB)
Genomes to Life Program goal is to read the Instructions starting with simple single cell organisms - microbes
– Molecular Machines– Regulatory Pathways– Multi-cell Communities
Develop new computational methods to understand complex biological systems
DOE Genomes to Life Program DOE Genomes to Life Program
PVM
$100M effort
www.genomes-to-life.org
Many interlinked proteins form interacting machines
From The Machinery of Life, David S. Goodsell,
Springer-Verlag, New York, 1993.
Molecular Machines Fill Cells Molecular Machines Fill Cells
www.genomes-to-life.org
Gene regulation controls what genes are expressed
Proteome changes over time and due to environmental conditions
Regulatory Networks Control the Machines Regulatory Networks Control the Machines
- And -
Biological Complexity
ComparativeGenomics
Constraint-BasedFlexible Docking
1000 TF
100 TF
10 TF
1 TF*
Constrained rigid
docking
Genome-scale protein threading
Community metabolic regulatory, signaling simulations
Molecular machine classical simulation
Protein machineInteractions
Cell, pathway, and network
simulation
Molecule-basedcell simulation
*Teraflops
Current U.S. Computing
Cell-basedcommunity simulation
GTL will Require Petascale Systems GTL will Require Petascale Systems
GTL is going to rely on high-performance computing and data analysis to process high-throughput experimental data
Biology for the21st Century Biology for the21st Century
The new computational biology environments will be conceptually integrated “knowledge enabling” environments that couple diverse sets of distributed data, advanced informatics methods, experiments, modeling, and simulation.
models
pathwaysgenomes
proteinstructure
Rawdata
regulatoryelements
Early Endosomes
Late Endosomes
Lysosomes
Golgi
EGFR erbB-2
Shc Cbl eps8
erbB-2 PLCGrb-2
??
1
Annexin II?
2
4
3
65
?
?
Src Eps15 AP-2
ERK
simulation
experiment
Data analysis
modeling
GIST is a framework for large-scale biological application deployment– provides a transparent and high-performance interface to biological applications
– provides transparent access to distributed data sets– utilizes PVM to launch and manage jobs across a wide diversity of supercomputers– highly fault tolerant and adapts to dynamic changes in the environment using PVM– next step deploy across ORNL, ANL, PNNL, SNL as a multi-site “Bio-Grid” – thousand of users for execution of genome analysis and simulation.
Genome Integrated Supercomputer Toolkit Genome Integrated Supercomputer Toolkit
IBM p690864 proc
Cray X1256 proc
P4 Cluster64 proc
SGI Altix256 proc
PVM across Heterogeneous Supercomputers
Protein analysis engine
Web portal
XML
XML
pathways
genomes
Rawdata
The GIST Developers really want HarnessThe GIST Developers really want Harness
They ask us regularly about the next generation of PVM called Harness because they want the increased adaptability and fault tolerance that Harness promises.
Harness is being developed by the same team that developed PVM:
Vaidy Sunderam – Emory UniversityAl Geist – Oak Ridge National LabJack Dongarra – University of Tennessee and ORNL
Harness II Design Goals Harness II Design Goals
Harness is a distributed virtual machine environment that goes beyond the features of PVM:
Allow users to dynamically customize, adapt, and extend a virtual machine's features • to more closely match the needs of their application• to optimize the virtual machine for the underlying computer
resources.
Is being designed to scale to petascale virtual machines • distributed control • minimized global state• no single point of failure
Allows multiple virtual machines to join and split in temporary micro-grids
Host D
Host C
Host B
Host A
VirtualMachine
Operation within VM usesDistributed Control
user features
HARNESS daemon
Customizationand extensionby dynamicallyadding pluglets
Componentbased daemon
Merge/split with other VMs
AnotherVM
HARNESS II ArchitectureHARNESS II Architecture
Daemon built on top of H2O kernel with DVM pluglet loaded
DVMFT-MPI
Processes control
• No single point (or set of points) of failure for Harness. It survives as long as one member still lives.
• All members know the state of the virtual machine, and their knowledge is kept consistent w.r.t. the order of changes of state. (Important parallel programming requirement!)
• No member is more important than any other (at any instant) i.e. here isn’t a pass-around “control token”
• For Petascale Systems the control members can be a distributed subset of all the processors in the system
Symmetric Peer-to-Peer Distributed ControlSymmetric Peer-to-Peer Distributed Control
CharacteristicsCharacteristics
Supportsmultiple
simultaneous updates
Harness Distributed ControlHarness Distributed Control
addhost
Fast host deleteor recoveryfrom fault
Parallel recoveryfrom multiplehost failures
Supportsfast hostadding
Control is Asynchronous and ParallelControl is Asynchronous and Parallel
Virtual machine Size of the Control Loop 1 <= S <= (size of VM)
For small VM and ultimatefault tolerance S = (size of VM)
For large VM a random selectionof a few hosts (f.e. S = 10) givesa balance of multi-point failureand performance.
HARNESS: Petascale Virtual MachineHARNESS: Petascale Virtual Machine
For S = 1, distributed controlbecomes simple client/servermodel.
Variable Distributed Control Loop Size
H2O kernel - OverviewH2O kernel - Overview
H2O is multithreaded lightweight kernel that is dynamically configured by loading “pluglets”
Resources provided as services through pluglets.Services may be deployed by any authorized party: provider, client, or third-party reseller
H2O is stateless and resources independent
In Harness the DVM service, which includes distributed control of services, must be installed on host
Pluglets can provide Multiple programming models
Java and C implementations being developed
FT-MPI PVM
Java RMIActiveobjects
Programming models
OGSA P2P
Pluglet
Pluglet
Functionalinterfaces
Kernel
Clients
[Suspendible]
H2O is built on top of a flexible P2P communication layer called RMIX
– Provides interoperability between kernels and other web services
– Adopts common RMI semantics– Designed for easy porting between protocols – Dynamic protocol negotiation– Scalable P2P design
RPC clientsWeb Services
SOAP clients...
Java H2O kernel
A
C
B
H2O kernel
E
F
D
RMIX
Networking
RMIX
NetworkingRPC, IIOP,JRMP, SOAP, …
H2O kernel – RMIX Communication H2O kernel – RMIX Communication
Deploy
B
A
LegacyApp
DeployProvider
AClient
Repository
A BReseller
C
Deploy
Anativecode
ProviderClient
Repository
ABDeveloper
C
ProviderClient
B
A
...
Registration and Discovery e-mail,phone, ...JNDIUDDI LDAP DNS GIS ...
B
Publish Find
Provider
Cluster computingLike–PVM–Harness–LAM/MPI
Grid Web portalLike–Genome Channel–Biology workbench–Web service
Internet ComputingLike –SETI at HOME–Entropia, –United Devices
H2O can support a wide range of distributed computing modelsH2O can support a wide range of distributed computing models
Flexibility beyond the PVM/MPI model
Harness Fault Tolerant MPI Plug-inHarness Fault Tolerant MPI Plug-in
FT-MPI built in layers with tuned collectives, tuned derived data type handling and good point2point bandwidth.
Works with MPE profiling and tools such as JUMPSHOT from ANL.
libftmpi
MPI application
Startup plugin
Name Service
Ftmpi_notifier
libftmpi
MPI application
Startup plugin
H2O H2O
Application performance on par with MPICH-2.
FTMPI available SC2003
Harness Fault Tolerant MPI Plug-inHarness Fault Tolerant MPI Plug-in
FT-MPI is a system level Fault Tolerant full MPI 1.2 implementation.
Process failures are detected & passed back to the users application using MPI objects. The users application decides how best to reconfigure the system and continue.
Recovery Options for affected communicators:– ABORT: just do as other implementations i.e.checkpoint restart– BLANK: leave hole– SHRINK: re-order processes to make a contiguous communicator– REBUILD: re-spawn lost processes and add them to MPI_COMM_WORLD
11 22 33 44 55 66 77 88 99
11 22 55 66 88 99
11 22 33 44 55 66
11 22 33 44 55 66 77 88 99
XX XX XXCommunicatorOptions
Large-scale Fault Large-scale Fault ToleranceLarge-scale Fault Large-scale Fault Tolerance
Developing fault tolerant algorithms is not trivial. Anything beyond simple checkpoint/restart is beyond most scientists. Many recovery issues must be addressed
Doing a restart of 90,000 tasks because of the failure of 1 task, may be very inefficient use of resources.
When and what are the recovery options for large-scale simulations?
Taking fault tolerance beyond checkpoint/restart.
Fault Tolerance – a petascale perspectiveFault Tolerance – a petascale perspective
Future systems are being designed with 100,000 processors.
The time before some failure will be measured in minutes.
Checkpointing and restarting this large a system could take longer than the time to the next failure!
Development of algorithms that can be naturally fault tolerant I.e. failure anywhere can be ignored? And still get the right answer.
– No monitoring– No notification– No recovery
Is this possible? YES!
What to do?Autonomic? Self-healing?
Demonstrated that the scale invariance and natural fault tolerance can exist for local and global algorithms
Progress on Super-scalar algorithmsProgress on Super-scalar algorithms
local
global
Finite Difference (Christian Engelman) – Demonstrated natural fault tolerance w/ chaotic
relaxation, meshless, finite difference solution of Laplace and Poisson problems
Global information (Kasidit Chancio) – Demonstrated natural fault tolerance in global
max problem w/random, directed graphs
Gridless Multigrid (Ryan Adams)– Combines the fast convergence of multigrid with
the natural fault tolerance property. Hierarchical implementation of finite difference above.
– Three different asynchronous updates explored
Further InformationFurther Information
www.csm.ornl.gov/~geist
Genomes to Life
Harness
Naturally Fault tolerant Algoritnms
Questions?