VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for
Materials Science Calculations"
VLab: A Cyberinfrastructure for Parameter Sampling Computations Suited for
Materials Science Calculations"
Cesar R. S. da Silva1
Pedro R. C. da Silveira1
1Minnesota Supercomputing Institute, University of Minnesota
Work Sponsored by NSF grant ITR-0426757 and MSI
The VLab
-“A cyberinfrastructure designed to facilitate/enable execution of extensive calculations that can be broken into several decoupled tasks."
Typically parameter sweeping applications, like:
- Weather and Climate - Oil Search - Stress tests of investment strategies - Seismology - Geodynamics - The everybody's favorite: Calculation thermal properties of materials at high pressures and temperatures.”
VLab has three main roles
1 - Science Enabler
• Empowering users to manage extensive workflows
-Automatic workflow management -Ease of use -Collaborative support -Diversity of tools for Data Analysis, Visualization, etc …
• Aggregating throughput of scattered resources to cope with huge workloads
-Distributed computations -Fault tolerance -Optimal scheduling
… three main roles
2 - Community facility
• Available to the entire community of planetary materials
• Provide a set of tools of common interest
3 - Virtual Organization
• Globally accessible through the WWW
• Strong collaborative support
- Shared access to projects - Collaborative data analysis with synchronous view of data. - Works combined with teleconference software.
Allow geographically distributed groups to work on the Same project.
However, VLab is not :
1 - A program or Software Distribution
• You can download the sources and create your own VLab
• But You don't have any advantage doing so.
2 - A tool to calculate thermal properties of Materials
• This is just one VLab application
• New applications can be developed as users show
interest and willingness to participate.
The VLab
- Composed by a set of tools, made available to each other as Web Services distributed throughout the internet. Currently available tools include:
- Quantum ESPRESSO Package tools - Input preparation for pwscf, phonon, workflows, etc … - Data Analysis and Visualization Tools (VTK/OpenGL) - Workflow Management and monitoring tools - and many more to come …
- Automatic generation of task input and recollection of output
- User Interface consolidated through a easy to use Portal
VLab Workflows Typical VLab workflows, like
the High-T Cij calculation involve iterations through the following steps:
1) Prepare inputs for tasks, and generate execution packages containing required files.
2) Dispatch the execution packages to compute nodes for execution.
3) Gather results for analysis and eventually iterate steps 1-3.
Leverages computing capabilities of distributed resources (TeraGrid, OSG, scattered resources, other grids)
- Automatic Task Distribution and Data Recollection
Exploit workflow level parallelism to increase performance
Optimal scheduling is an Open field
Vlab - A Distributed System Approach
-Distributed components are replicated for:
- Redundancy- Performance- Flexibility
-No central component to fail and bring everything down!
-Flexible Scheduling for:- Cost- Turnaround Time - Job Throughput- Workload Balance- System Throughput
Vlab - What already works
-Automatic task distribution and data recollection
-Shared access to project monitoring tools and data
-Non colaborative data analysis and 2D graphs.
-High PT properties workflow and its sub-workflows
•High PT application completes successfully, generating a number of thermodynamic variables from a single input, with no user intervention during execution.
Vlab - What has to be done
-Fault tolerance -Registry Based.
-Redundant Registry and Metadata DB for data persistence.
-Full Journaling of critical transactions for data (metadata) integrity.
-Dynamical Composition of Web Services -Will facilitate development of new applications.
-Volumetric (3D) data visualization
*Has to be rewritten from the scratch.
-Collaborative data analisys and visualization. -Have inconsistent iUI.
-Erratic behavior with 2 or more simultaneous users.
-Support for synchronous view of data not yet implemented
… What has to be done-Methodological improvements
-Real space symmetry operations in ESPRESSO -> reciprocal space
-Numerical instability with Wentzcovitch VCS-MD -> (PR?)
-Constant g-space cut-off in VCS-MD in ESPRESSO -> (?)
-Fitting procedure in High PT data analysis tool. Tool currently in use has a serious flaw.
VLab in Action
Live demo at 2nd VLab Workshop 07: http://www.vlab.msi.umn.edu/events/videos/secondworkshop/08082007/Demo.mov
Calculation of High P,T Thermodynamic Properties Cubic MgO 2 atom cell Static + Lattice Dynamics calculation {Pn}x{qi} sampling
Show distributed computing capabilities Ability to integrate visualization and data analysis tools
Visit the VLab web site: http://www.vlab.msi.umn.edu/
VLab Service Oriented Architecture On the Web:
http://dasilveira.msi.umn.edu:8080/vlab/Usage oriented view of VLab SOA
=> Tree-like structure in 4 layers: 1) User Interface (Portal)2) Workflow control and monitoring (Project Executor / Interaction)3) Task Dispatching / Interaction, task data retrieving, Auxiliary Services4) Heavy computations and Visualization resources layer.
Scheduling=> Fundamental importance for Performance
The usual approach: -Use agents that interact with the broker
Problem: Agents are not stateless! -More complicated to develop -Persistence must be guaranteed
The VLab approach: -Use an independent WS to monitor workload. -Persistence of data is provided by a local DB. -Compute WS and Workload Monitor are stateless!
Vlab - Not Just a Client/Server
The Client/Server Approach:
-The portal and the supporting modules have access to a large central multi-processor system.
-Can work as a facilitator but lacks other important features found in VLab.
-No Flexibility of Scheduling -No redundancy => Poor availability-No choice for cost (usually High)
Fault Tolerance
- Reactive: We have not identified any need for proactive FT.
- Registry Based: Persistent sessions are registered and must periodically inform the registry about its "alive" state.
- Redundant Registry and Metadata DB for data persistence
- Fully Journaling (data and metadata) of Critical Transactions for data and metadata integrity. This guarantee the state of any persistent session can be restored in case of failure.
• Only Project Executor sessions and few user and project interaction sessions are required to be persistent. Therefore, a simple approach to Fault Tolerance (FT) is possible:
VLAB requirements•Workflow management => Facilitator/Enabler
•Support for distributed computations
•Ease of use
•Support for collaboration
•Flexibility (update/add tools, new features)
•Fault tolerance
•Diversity of tools–analysis, visualization, data reduction, storage, etc .
Compute Performance x Throughput
Leveraging Concurrent Computing for features and performance
High Performance Parallel Computing
High Throughput Distributed Processing
The red line is the predicted optimal performance for upto 16 independent 4-way parallel tasks runningconcurrently (HTC job).
Basic ProblemDemand for Extensive Parameter Sampling
Typical High (P,T) study(ex. Thermal Properties) {Pn}x{qi} => ~102 jobs
Large High (P,T) study( Cij(P,T) ) {Pn}x{i}x{qj} => ~103-4 jobs
Future studies:Extension to alloys
(sampling over configurations)
{{xm}l}x{Pn}x{i}x{qj} => ~105 jobs
• 102-105 Jobs to prepare, submit, monitor, and analyze results• Manual work is prone to human errors => Unmanageable!!!
• First Principles => Sheer number (1015-1020) of operations (Today) => Well over 1022 in 3-5 years
Basic Problem (cont. …)
Fundamental Requirements• Enable user to manage these extensive workflows
-Automatic workflow management -Ease of use, collaborative support, diversity of tools, flexibility• Aggregate throughput to cope with huge workloads -Distributed computations, fault tolerance, optimal scheduling
The Big Challenge of PerformanceMPP systems are not very cost effective for this class of problems
•FFT and matrix transposition: Limited scalability or•Low performance per processor