Upload
ava-gillespie
View
43
Download
0
Embed Size (px)
DESCRIPTION
DRM/Computational Grids. Bill DeSalvo August 18, 2004. Computational Grids. Definitions…. Cluster : An arbitrary collection of distributed IT resources organized as a management domain… a single system environment. - PowerPoint PPT Presentation
Citation preview
DRM/Computational Grids
Bill DeSalvo
August 18, 2004
Computational Grids
© Platform Computing Inc. 20033
Definitions…
Cluster: An arbitrary collection of distributed IT resources organized as a management domain… a single system environment
Grid: Transparent, secure, coordinated resource sharing across one or more sites… a cluster of clusters
© Platform Computing Inc. 20034
Grid Drivers
Virtual Organizations- New infrastructure enables new org structures
- Collaborative computing
New Class of Capabilities- Potential to solve very large problems
New Business Models- Outsourcing of computing tasks- Utility computing- Peak load support
Source: IDC
Optimize CapabilitiesOptimize Capabilities
Resource Optimization- Maximize return on capital equipment
Resource Access- Provide mechanisms to share resources across organizational boundaries
Cost Sharing- allow multiple groups to contribute resources to a project while maintaining control of those resources
Improved Management Model- incorporate multiple systems into an organization under a single unified systems model
Source: IDC
Optimize InfrastructureOptimize Infrastructure
© Platform Computing Inc. 20035
Ian Foster’s Three-Point Grid Checklist
Coordinates resources
Not subject to centralized control
One or more (virtual) organizations
Geographic distribution of users/resources is common
Standard, open, general-purpose protocols and interfaces
Delivers nontrivial qualities of service
SLAs vs. policies vs. QoS
Translates business objectives into IT objectives
Enables effective utilization, resource aggregation, and remote access to specialized resources
Clusters are NOT grids!A cluster is a local-area, logical arrangement of independent entities that collectively provide a service.
© Platform Computing Inc. 20036
Virtual Organizations
© Platform Computing Inc. 20037
Evolution of the Grid
© Platform Computing Inc. 20038
© Platform Computing Inc. 20039
Everyone’s Aware of “The Grid”
© Platform Computing Inc. 200310
Platform Grid Competencies
Resource Leasing
Job Forwarding
Account Mapping
Grid Fairshare Scheduling
Advance Reservations
User Authentication
Reliable Data Transfer
Outgrowth of Platform’s experience in Grid and Distributed Computing
Platform MultiCluster
© Platform Computing Inc. 200312
Three-Point Grid Checklist & Platform MultiCluster
Coordinates resources
Not subject to centralized control
‘Single’ organization (“Enterprise Grid”)
Geographic distribution of users/resources is common
Proprietary protocols and interfaces
Delivers nontrivial qualities of service
SLAs vs. policies
Common queues
Advance reservation
Resource leasing
Fairshare
SLAs
Translates business objectives into IT objectives
Enables effective utilization, resource aggregation, and remote access to specialized resources
© Platform Computing Inc. 200313
Why MultiCluster
Global Sharing, Local Ownership (“politics of the grid”)Providing … while maintaining …
Increased Capacity
Increased Capability
Increased Scalability
Growing Computational Needs
Local Autonomy
Dept ADept B
Dept C Dept D
© Platform Computing Inc. 200314
Job Forwarding Model
“HPC Center” ConfigurationEnhanced transparency
FCFS guarantee, pending reason support, chunk jobs, host type/queue status aware scheduling, checkpoint/migration
Cluster A
HPC Center
Cluster B
Cluster C
© Platform Computing Inc. 200315
Job Forwarding Model
Compute
ServersCompute
Servers
Site A Site B
Send queue
Receivequeue
You submitWe do ---• Job transfer• data staging• Account mapping• Accounting
© Platform Computing Inc. 200316
Resource Leasing Model
Accelerating Enterprise Grid AdoptionSingle system image, ease of administration, scalability
Enable fairshare, preemption, pending reason support, chunk jobs, advance reservation, interactive jobs, parallel jobs, … across clusters
© Platform Computing Inc. 200319
Advance Reservation
Nodes dedicated to User A for time duration
Reserve nodes for exclusive access for user or user group
Ensures critical work is done without interference Useful for benchmarking or system maintenance One-time and recurring reservation Administrator defines reservation for users
Use Cases
© Platform Computing Inc. 200322
DoD HPCMP Grid
DoD HPCMPChallenge
Initiative to share resources on HPCMP’s resources easily & transparently: SMDC, TACOM, NRL, NAVO and WSMR, …
Build a meta-queuing system to integrate the centers
Primary BenefitThe capability to submit a job to a single, common queue, which will be sent to thebest available computer in the Grid
© Platform Computing Inc. 200323
DOD HPCMOSolutionPlatform LSF MultiCluster
Resource reservation protocolTransparent job controlAccounting
Client-server, interactions KerberizedTicket forwarding/renewalMulti-realm supportAccount mapping
Platform FTAKerberizedFault tolerant
DoD HPCMP Grid
Requirement Fire and Forget
Full Kerberos 5Support
Reliable, SecureFile Transfer
© Platform Computing Inc. 200324
NAVOSUN E10K
64 PEs
AEDCOrigin 2000
64 PEs
DRENNRL
Origin 2000
128 PEs
TACOM/TARDEC
Onyx2 32 PEs
RTTCOrigin 2000
32 PEs
SMDCOrigin 2000
64 PEs
SSCSDHP
Superdome
44 PEs
AFFTCOrigin 3000
64 PEs
WSMROrigin 2000
64 PEs
DREN
GRID Challenges Logistics / Coordination
PeopleUser AccountsGeographic locationsSite configurationsTime zones /schedules
Network Security /FirewallsIntro of batch queuing systems to environments Training & skills transfer
DoD HPCMP Grid
© Platform Computing Inc. 200325
SHARCNET
ExternalGrids/Portal
© Platform Computing Inc. 200326
SHARCNET
The network is no longer ‘passive plumbing’
True resource that can be managed in real time – with guaranteed QoS
Potential projects
-based resource leasing, advance reservation
IP-based topology awareness
Enables new classes of Grid applications
Operational results
Real-time, remote visualization
Virtual storage
Persistent/pervasive
On demand
The Globus Toolkit V2
© Platform Computing Inc. 200328
Sharing pains…physical login
Compute
ServersCompute
ServersSite A Site B
You have to• Get and maintain multiple accounts• Use different batch systems• No consolidated accounting• Manual file movement
© Platform Computing Inc. 200329
The Globus Toolkit™ Version 2 (GT2)
A software toolkit that addresses key technical problems in the development of Grid-enabled tools, services, and applications
Offers a modular “bag of technologies”
Enables incremental development of grid-enabled tools and applications
Implements standard Grid protocols and APIs
Made available under liberal Open Source license
Provided by The Globus Alliance
http://www.globus.org
© Platform Computing Inc. 200330
Globus Toolkit: Evaluation (+)
Good technical solutions for key problems, e.g.
Authentication and authorization
Resource discovery and monitoring
Reliable remote service invocation
High-performance remote data access
This & good engineering is enabling progress
Good quality reference implementation, multi-language support, interfaces to many systems, large user base, industrial support
Growing community code base built on tools
© Platform Computing Inc. 200331
Globus Toolkit: Evaluation (-)
Protocol deficiencies, e.g.
Heterogeneous basis: HTTP, LDAP, FTP
No standard means of invocation, notification, error propagation, authorization, termination, …
Significant missing functionality, e.g.
Databases, sensors, instruments, workflow, …
Virtualization of end systems (hosting envs.)
Little work on total system properties, e.g.
Dependability, end-to-end QoS, …
Reasoning about system properties
Scalability
© Platform Computing Inc. 200332
LSF MC & Globus
MC: Transparent, dynamic, intelligent, scalable inter-cluster sharing
User does not need to know about clusters: total transparency
MC dynamically chooses the “best cluster” to run the job
User chooses which cluster to submit job to via Globus interface
Static, non-intelligent sharing
Lacks transparency
Cluster A Cluster B Cluster C
Globus
Inter-cluster protocols
Globus Toolkit 3 (OGSA)
© Platform Computing Inc. 200336
Open Grid Services Architecture (OGSA)
Next-generation architecture
Consequence of technology refresh (i.e., refactoring the Globus Toolkit) and research into Autonomic Computing
Convergence of Grid Computing and Web Services
Globus Toolkit
Access services – e.g., CLIs, GUIs, portals and CoGs
Resource and allocation management
Monitoring and discovery services – e.g., sensing and indexing
Data management services – e.g., file transfer, replica management, etc.
Security – e.g., the Grid Security Infrastructure
Initially SOAP, WSDL and WS-Inspection
The Global Grid Forum (GGF) serves as the standards authority
Two layers
Core Grid platform – OGSA platform interfaces and models
Core Grid infrastructure – Open Grid Services Infrastructure (OGSI)
http://www.gridforum.org http://www.globus.org/ogsa
© Platform Computing Inc. 200337
Importance of OGSA to Customers
Grid-enabled Web Services transforming IT
Analyst feedback (e.g., Gartner)
Customer experience
Customers demand standards-compliant products, solutions and services – why?
Vendors guilty of over-promising and under-delivering
Avoid single-vendor lock-in
Proprietary implementations based on open standards
Seek multi-vendor deliverables
Framework for partner collaboration
Demanding professionalism in software engineering
Seek to be engaged in the process
© Platform Computing Inc. 200338
Platform Embraces Open Standards
Platform developing software for over 11 years
Standards efforts are recent activities
Existing implementations are proprietary
Platform is an NPi founder
NPi merged with GGF (4/02)
NPi being leveraged in OGSA
Platform committed to open standards
Proprietary implementations based on open standards
Platform experienced in Open Source arena
Offering Linux solutions for over 6 years
Offering Globus Toolkit solutions for about 2 years
Source-code available for components of Platform LSF
Platform and Globus
© Platform Computing Inc. 200340
Platform Globus Toolkit
CSF Plus Advanced CSF-based metascheduler
Job persistence; enhanced scalability (6x GT 3); Cluster load balancing and host type matching (LSF only)
Globus Toolkit 3
Community Scheduler Framework (CSF)Round robin job scheduling; Advance reservation booking, query, & control; Reservation based scheduling; Job throttling for increased
reliability
Connectors for 3rd party workload management systems (ie: SGE, PBS, etc)
Native command line interface support
Platform Globus Tookit
One step installation
Open Source
Platform Enhancements
CSF
© Platform Computing Inc. 200342
What is CSF?
CSF (Community Scheduler Framework)
. Not a Platform product
. Contributed industries 1st open source meta-scheduler enhancement to Globus Toolkit V3.X
. Developed with the latest version of OGSI – grid guideline being developed with Global Grid Forum
. Open source "meta-scheduler“ – framework
- Provides basic protocols and interfaces to help resources work together in heterogeneous environments
- enables global access and maintains local control of resources
© Platform Computing Inc. 200343
Key Benefits of OGSA Compliance
•Future-proof & protect grid investment using standards-based
solutions
•Standardized approach to access Platform LSF
•Interoperate with 3rd party systems
© Platform Computing Inc. 200344
Metaschedulers
Scheduler that co-ordinates communication between heterogeneous schedulers that operate at a local level
Enables global access and coordination while maintaining local control and ownership of resources
Future – possible to schedule workload execution also storage, network bandwidth, etc.
© Platform Computing Inc. 200345
CSF Grid Services
Job Service creates, monitors and controls compute jobs
Reservation Service guarantees resources are available for running a job
Queueing Service provides a service where administrators can customize and define scheduling policies at the VO level and/or at the different resourcemanager level
RM Adaptor Service provides a Grid service interface that bridgesthe Grid service protocol and resource managers (LSF, PBS, SGE, Condor and other RMs)
© Platform Computing Inc. 200346
CSF Architecture
Platform LSF User
Globus Toolkit User
LSFLSF
Meta-scheduler
Plugin
Meta-scheduler
Plugin
Grid Service Hosting Environment
Job Service
Reservation Service
Meta-SchedulerGlobal
Information Service
RIPS
GRAM SGE RIPS
GRAM PBS RIPS
RM Adapter
RIPS = Resource Information Provider Services
GRAM = Grid Resource & Allocation Mangement
Queuing Service
Third Party Workload Management System
Third Party Workload Management System
Platform LSF
Pro
file
High
Low
Awareness/Knowledge Liking/Preference/Conviction Commitment
Grid Canada
OMII
© Platform Computing Inc. 200348
What are the Multi-Domain Tools and What Do They Do?
Platform MultiCluster
Enables global access and coordination while maintaining local control and ownership of resources
Join geographically dispersed clusters
Production quality solution to build enterprise grids
Platform proprietary solution that is standards-based & OGSA compliant
Globus Toolkit
Tools to join geographically dispersed clusters
A bunch of “bricks” to build grids (that’s why it’s called a toolkit)
Users have to specify which cluster they would like their job to be sent to – not transparent
Open source solution
Platform adds commercial support: documentation, training, tech support, professional services
Summary
© Platform Computing Inc. 200354
Summary
OGSA applies to e-Science and e-Business
Rich architectural framework
Existing, emerging and planned specifications
Ultimately resulting in Open Standards
Existing, emerging and planned implementations
The Community Scheduler Framework
Standards-based
Choice of implementations
Ushers existing grids towards OGSA compliance
Spectrum of potential use cases
Thank you.